Hi,
Continuing on an earlier discussion[1] regarding having a separate module for Hadoop related utility components, I have gone through our project briefly and found the following components which I feel could be moved to a separate module for reusability, and better module structure. Module Name Class Name Used at / Remarks flink-hadoop-fs flink.runtime.util.HadoopUtils flink-runtime => HadoopModule & HadoopModuleFactory flink-swift-fs-hadoop => SwiftFileSystemFactory flink-yarn => Utils, YarnClusterDescriptor flink-hadoop-compatability api.java.hadoop.mapred.utils.HadoopUtils Both belong to the same module but with different packages (api.java.hadoop.mapred and api.java.hadoop.mapreduce) api.java.hadoop.mapreduce.utils.HadoopUtils flink-sequeunce-file formats.sequeuncefile.SerializableHadoopConfiguration Currently, it is used at formats.sequencefile.SequenceFileWriterFactory but can also be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and pretty much everywhere to avoid NotSerializableException. *Proposal* To summarise, I believe we can create a new module (flink-hadoop-utils ?) and move these reusable components to this new module which will have an optional/provided dependency on flink-shaded-hadoop-2. *Structure* In the present form, I think we will have two classes with the packaging structure being *org.apache.flink.hadoop.[utils/serialization]* 1. HadoopUtils with all static methods ( after combining and eliminating the duplicate code fragments from the three HadoopUtils classes mentioned above) 2. Move the existing SerializableHadoopConfiguration from the flink-sequence-file to this new module . *Justification* * With this change, we would be stripping away the dependency on flink-hadoop-fs from flink-runtime as I don't see any other classes from flink-hadoop-fs is being used anywhere in flink-runtime module. * We will have a common place where all the utilities related to Hadoop can go which can be reused easily without leading to jar hell. In addition to this, if you are aware of any other classes that fit in this approach, please share the details here. *Note* I don't have a complete understanding here but I did see two implementations of the following classes under two different packages *.mapred and *.mapreduce. * HadoopInputFormat * HadoopInputFormatBase * HadoopOutputFormat * HadoopOutputFormatBase Can we somehow figure and have them in this new module? Thanks, Sivaprasanna [1] https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E |
Hi Sivaprasanna,
thanks for starting this discussion. In general I like the idea to remove duplications and move common code to a shared module. As a recommendation, I would exclude the whole part about Flink's Hadoop compatibility modules because they are legacy code and hardly used anymore. This would also have the benefit of making the scope of the proposal a bit smaller. What we now need is a committer who wants to help with this effort. It might be that this takes a bit of time as many of the committers are quite busy. Cheers, Till On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <[hidden email]> wrote: > Hi, > > Continuing on an earlier discussion[1] regarding having a separate module > for Hadoop related utility components, I have gone through our project > briefly and found the following components which I feel could be moved to a > separate module for reusability, and better module structure. > > Module Name Class Name Used at / Remarks > > flink-hadoop-fs > flink.runtime.util.HadoopUtils > flink-runtime => HadoopModule & HadoopModuleFactory > flink-swift-fs-hadoop => SwiftFileSystemFactory > flink-yarn => Utils, YarnClusterDescriptor > > flink-hadoop-compatability > api.java.hadoop.mapred.utils.HadoopUtils > Both belong to the same module but with different packages > (api.java.hadoop.mapred and api.java.hadoop.mapreduce) > api.java.hadoop.mapreduce.utils.HadoopUtils > flink-sequeunce-file > formats.sequeuncefile.SerializableHadoopConfiguration Currently, > it is used at formats.sequencefile.SequenceFileWriterFactory but can also > be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and > pretty much everywhere to avoid NotSerializableException. > > *Proposal* > To summarise, I believe we can create a new module (flink-hadoop-utils ?) > and move these reusable components to this new module which will have an > optional/provided dependency on flink-shaded-hadoop-2. > > *Structure* > In the present form, I think we will have two classes with the packaging > structure being *org.apache.flink.hadoop.[utils/serialization]* > 1. HadoopUtils with all static methods ( after combining and eliminating > the duplicate code fragments from the three HadoopUtils classes mentioned > above) > 2. Move the existing SerializableHadoopConfiguration from the > flink-sequence-file to this new module . > > *Justification* > * With this change, we would be stripping away the dependency on > flink-hadoop-fs from flink-runtime as I don't see any other classes from > flink-hadoop-fs is being used anywhere in flink-runtime module. > * We will have a common place where all the utilities related to Hadoop can > go which can be reused easily without leading to jar hell. > > In addition to this, if you are aware of any other classes that fit in this > approach, please share the details here. > > *Note* > I don't have a complete understanding here but I did see two > implementations of the following classes under two different packages > *.mapred and *.mapreduce. > * HadoopInputFormat > * HadoopInputFormatBase > * HadoopOutputFormat > * HadoopOutputFormatBase > > Can we somehow figure and have them in this new module? > > Thanks, > Sivaprasanna > > [1] > > https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E > |
Hello Till,
I agree with having the scope limited and more concentrated. I can file a Jira and get started with the code changes, as and when someone has some bandwidth, the review can also be done. What do you think? Cheers, Sivaprasanna On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <[hidden email]> wrote: > Hi Sivaprasanna, > > thanks for starting this discussion. In general I like the idea to remove > duplications and move common code to a shared module. As a recommendation, > I would exclude the whole part about Flink's Hadoop compatibility modules > because they are legacy code and hardly used anymore. This would also have > the benefit of making the scope of the proposal a bit smaller. > > What we now need is a committer who wants to help with this effort. It > might be that this takes a bit of time as many of the committers are quite > busy. > > Cheers, > Till > > On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <[hidden email]> > wrote: > > > Hi, > > > > Continuing on an earlier discussion[1] regarding having a separate module > > for Hadoop related utility components, I have gone through our project > > briefly and found the following components which I feel could be moved > to a > > separate module for reusability, and better module structure. > > > > Module Name Class Name Used at / Remarks > > > > flink-hadoop-fs > > flink.runtime.util.HadoopUtils > > flink-runtime => HadoopModule & HadoopModuleFactory > > flink-swift-fs-hadoop => SwiftFileSystemFactory > > flink-yarn => Utils, YarnClusterDescriptor > > > > flink-hadoop-compatability > > api.java.hadoop.mapred.utils.HadoopUtils > > Both belong to the same module but with different packages > > (api.java.hadoop.mapred and api.java.hadoop.mapreduce) > > api.java.hadoop.mapreduce.utils.HadoopUtils > > flink-sequeunce-file > > formats.sequeuncefile.SerializableHadoopConfiguration Currently, > > it is used at formats.sequencefile.SequenceFileWriterFactory but can also > > be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and > > pretty much everywhere to avoid NotSerializableException. > > > > *Proposal* > > To summarise, I believe we can create a new module (flink-hadoop-utils ?) > > and move these reusable components to this new module which will have an > > optional/provided dependency on flink-shaded-hadoop-2. > > > > *Structure* > > In the present form, I think we will have two classes with the packaging > > structure being *org.apache.flink.hadoop.[utils/serialization]* > > 1. HadoopUtils with all static methods ( after combining and eliminating > > the duplicate code fragments from the three HadoopUtils classes mentioned > > above) > > 2. Move the existing SerializableHadoopConfiguration from the > > flink-sequence-file to this new module . > > > > *Justification* > > * With this change, we would be stripping away the dependency on > > flink-hadoop-fs from flink-runtime as I don't see any other classes from > > flink-hadoop-fs is being used anywhere in flink-runtime module. > > * We will have a common place where all the utilities related to Hadoop > can > > go which can be reused easily without leading to jar hell. > > > > In addition to this, if you are aware of any other classes that fit in > this > > approach, please share the details here. > > > > *Note* > > I don't have a complete understanding here but I did see two > > implementations of the following classes under two different packages > > *.mapred and *.mapreduce. > > * HadoopInputFormat > > * HadoopInputFormatBase > > * HadoopOutputFormat > > * HadoopOutputFormatBase > > > > Can we somehow figure and have them in this new module? > > > > Thanks, > > Sivaprasanna > > > > [1] > > > > > https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E > > > |
I would recommend to wait until a committer has signed up for reviewing
your changes before preparing any PR. Otherwise the chances are high that you invest a lot of time but the changes never get in. On 30/03/2020 11:42, Sivaprasanna wrote: > Hello Till, > > I agree with having the scope limited and more concentrated. I can file a > Jira and get started with the code changes, as and when someone has some > bandwidth, the review can also be done. What do you think? > > Cheers, > Sivaprasanna > > On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <[hidden email]> wrote: > >> Hi Sivaprasanna, >> >> thanks for starting this discussion. In general I like the idea to remove >> duplications and move common code to a shared module. As a recommendation, >> I would exclude the whole part about Flink's Hadoop compatibility modules >> because they are legacy code and hardly used anymore. This would also have >> the benefit of making the scope of the proposal a bit smaller. >> >> What we now need is a committer who wants to help with this effort. It >> might be that this takes a bit of time as many of the committers are quite >> busy. >> >> Cheers, >> Till >> >> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <[hidden email]> >> wrote: >> >>> Hi, >>> >>> Continuing on an earlier discussion[1] regarding having a separate module >>> for Hadoop related utility components, I have gone through our project >>> briefly and found the following components which I feel could be moved >> to a >>> separate module for reusability, and better module structure. >>> >>> Module Name Class Name Used at / Remarks >>> >>> flink-hadoop-fs >>> flink.runtime.util.HadoopUtils >>> flink-runtime => HadoopModule & HadoopModuleFactory >>> flink-swift-fs-hadoop => SwiftFileSystemFactory >>> flink-yarn => Utils, YarnClusterDescriptor >>> >>> flink-hadoop-compatability >>> api.java.hadoop.mapred.utils.HadoopUtils >>> Both belong to the same module but with different packages >>> (api.java.hadoop.mapred and api.java.hadoop.mapreduce) >>> api.java.hadoop.mapreduce.utils.HadoopUtils >>> flink-sequeunce-file >>> formats.sequeuncefile.SerializableHadoopConfiguration Currently, >>> it is used at formats.sequencefile.SequenceFileWriterFactory but can also >>> be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and >>> pretty much everywhere to avoid NotSerializableException. >>> >>> *Proposal* >>> To summarise, I believe we can create a new module (flink-hadoop-utils ?) >>> and move these reusable components to this new module which will have an >>> optional/provided dependency on flink-shaded-hadoop-2. >>> >>> *Structure* >>> In the present form, I think we will have two classes with the packaging >>> structure being *org.apache.flink.hadoop.[utils/serialization]* >>> 1. HadoopUtils with all static methods ( after combining and eliminating >>> the duplicate code fragments from the three HadoopUtils classes mentioned >>> above) >>> 2. Move the existing SerializableHadoopConfiguration from the >>> flink-sequence-file to this new module . >>> >>> *Justification* >>> * With this change, we would be stripping away the dependency on >>> flink-hadoop-fs from flink-runtime as I don't see any other classes from >>> flink-hadoop-fs is being used anywhere in flink-runtime module. >>> * We will have a common place where all the utilities related to Hadoop >> can >>> go which can be reused easily without leading to jar hell. >>> >>> In addition to this, if you are aware of any other classes that fit in >> this >>> approach, please share the details here. >>> >>> *Note* >>> I don't have a complete understanding here but I did see two >>> implementations of the following classes under two different packages >>> *.mapred and *.mapreduce. >>> * HadoopInputFormat >>> * HadoopInputFormatBase >>> * HadoopOutputFormat >>> * HadoopOutputFormatBase >>> >>> Can we somehow figure and have them in this new module? >>> >>> Thanks, >>> Sivaprasanna >>> >>> [1] >>> >>> >> https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E |
Bump.
Please let me know, if someone is interested in reviewing this one. I am willing to start working on this. BTW, a small and new addition to the list: With FLINK-10114 merged, OrcBulkWriterFactory can also reuse `SerializableHadoopConfiguration` along with SequenceFileWriterFactory and CompressWriterFactory. CC - Kostas Kloudas since he has a better understanding on the `SerializableHadoopConfiguration.` Cheers, Sivaprasanna On Mon, Mar 30, 2020 at 3:17 PM Chesnay Schepler <[hidden email]> wrote: > I would recommend to wait until a committer has signed up for reviewing > your changes before preparing any PR. > Otherwise the chances are high that you invest a lot of time but the > changes never get in. > > On 30/03/2020 11:42, Sivaprasanna wrote: > > Hello Till, > > > > I agree with having the scope limited and more concentrated. I can file a > > Jira and get started with the code changes, as and when someone has some > > bandwidth, the review can also be done. What do you think? > > > > Cheers, > > Sivaprasanna > > > > On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <[hidden email]> > wrote: > > > >> Hi Sivaprasanna, > >> > >> thanks for starting this discussion. In general I like the idea to > remove > >> duplications and move common code to a shared module. As a > recommendation, > >> I would exclude the whole part about Flink's Hadoop compatibility > modules > >> because they are legacy code and hardly used anymore. This would also > have > >> the benefit of making the scope of the proposal a bit smaller. > >> > >> What we now need is a committer who wants to help with this effort. It > >> might be that this takes a bit of time as many of the committers are > quite > >> busy. > >> > >> Cheers, > >> Till > >> > >> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna <[hidden email] > > > >> wrote: > >> > >>> Hi, > >>> > >>> Continuing on an earlier discussion[1] regarding having a separate > module > >>> for Hadoop related utility components, I have gone through our project > >>> briefly and found the following components which I feel could be moved > >> to a > >>> separate module for reusability, and better module structure. > >>> > >>> Module Name Class Name Used at / Remarks > >>> > >>> flink-hadoop-fs > >>> flink.runtime.util.HadoopUtils > >>> flink-runtime => HadoopModule & HadoopModuleFactory > >>> flink-swift-fs-hadoop => SwiftFileSystemFactory > >>> flink-yarn => Utils, YarnClusterDescriptor > >>> > >>> flink-hadoop-compatability > >>> api.java.hadoop.mapred.utils.HadoopUtils > >>> Both belong to the same module but with different packages > >>> (api.java.hadoop.mapred and api.java.hadoop.mapreduce) > >>> api.java.hadoop.mapreduce.utils.HadoopUtils > >>> flink-sequeunce-file > >>> formats.sequeuncefile.SerializableHadoopConfiguration Currently, > >>> it is used at formats.sequencefile.SequenceFileWriterFactory but can > also > >>> be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and > >>> pretty much everywhere to avoid NotSerializableException. > >>> > >>> *Proposal* > >>> To summarise, I believe we can create a new module (flink-hadoop-utils > ?) > >>> and move these reusable components to this new module which will have > an > >>> optional/provided dependency on flink-shaded-hadoop-2. > >>> > >>> *Structure* > >>> In the present form, I think we will have two classes with the > packaging > >>> structure being *org.apache.flink.hadoop.[utils/serialization]* > >>> 1. HadoopUtils with all static methods ( after combining and > eliminating > >>> the duplicate code fragments from the three HadoopUtils classes > mentioned > >>> above) > >>> 2. Move the existing SerializableHadoopConfiguration from the > >>> flink-sequence-file to this new module . > >>> > >>> *Justification* > >>> * With this change, we would be stripping away the dependency on > >>> flink-hadoop-fs from flink-runtime as I don't see any other classes > from > >>> flink-hadoop-fs is being used anywhere in flink-runtime module. > >>> * We will have a common place where all the utilities related to Hadoop > >> can > >>> go which can be reused easily without leading to jar hell. > >>> > >>> In addition to this, if you are aware of any other classes that fit in > >> this > >>> approach, please share the details here. > >>> > >>> *Note* > >>> I don't have a complete understanding here but I did see two > >>> implementations of the following classes under two different packages > >>> *.mapred and *.mapreduce. > >>> * HadoopInputFormat > >>> * HadoopInputFormatBase > >>> * HadoopOutputFormat > >>> * HadoopOutputFormatBase > >>> > >>> Can we somehow figure and have them in this new module? > >>> > >>> Thanks, > >>> Sivaprasanna > >>> > >>> [1] > >>> > >>> > >> > https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E > > > |
Hi Sivaprasanna,
thanks a lot for your proposal. Now that I ran into a HadoopUtils-related issue myself [1] I see the benefit in this proposal. I'm happy to be the Flink committer that mentors this change. If we do this, I would like to have a small scope for the initial change: - create a "flink-hadoop-utils" module - move generic, common utils into that module (for example SerializableHadoopConfiguration) I agree with Till that we should initially leave out the Hadoop compatibility modules. You can go ahead with filing a JIRA ticket! Let's discuss the exact scope there. [1] https://github.com/apache/flink/pull/12146 On Thu, Apr 30, 2020 at 6:54 PM Sivaprasanna <[hidden email]> wrote: > Bump. > > Please let me know, if someone is interested in reviewing this one. I am > willing to start working on this. BTW, a small and new addition to the > list: With FLINK-10114 merged, OrcBulkWriterFactory can also reuse > `SerializableHadoopConfiguration` along with SequenceFileWriterFactory and > CompressWriterFactory. > > CC - Kostas Kloudas since he has a better understanding on the > `SerializableHadoopConfiguration.` > > Cheers, > Sivaprasanna > > On Mon, Mar 30, 2020 at 3:17 PM Chesnay Schepler <[hidden email]> > wrote: > > > I would recommend to wait until a committer has signed up for reviewing > > your changes before preparing any PR. > > Otherwise the chances are high that you invest a lot of time but the > > changes never get in. > > > > On 30/03/2020 11:42, Sivaprasanna wrote: > > > Hello Till, > > > > > > I agree with having the scope limited and more concentrated. I can > file a > > > Jira and get started with the code changes, as and when someone has > some > > > bandwidth, the review can also be done. What do you think? > > > > > > Cheers, > > > Sivaprasanna > > > > > > On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <[hidden email]> > > wrote: > > > > > >> Hi Sivaprasanna, > > >> > > >> thanks for starting this discussion. In general I like the idea to > > remove > > >> duplications and move common code to a shared module. As a > > recommendation, > > >> I would exclude the whole part about Flink's Hadoop compatibility > > modules > > >> because they are legacy code and hardly used anymore. This would also > > have > > >> the benefit of making the scope of the proposal a bit smaller. > > >> > > >> What we now need is a committer who wants to help with this effort. It > > >> might be that this takes a bit of time as many of the committers are > > quite > > >> busy. > > >> > > >> Cheers, > > >> Till > > >> > > >> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna < > [hidden email] > > > > > >> wrote: > > >> > > >>> Hi, > > >>> > > >>> Continuing on an earlier discussion[1] regarding having a separate > > module > > >>> for Hadoop related utility components, I have gone through our > project > > >>> briefly and found the following components which I feel could be > moved > > >> to a > > >>> separate module for reusability, and better module structure. > > >>> > > >>> Module Name Class Name Used at / Remarks > > >>> > > >>> flink-hadoop-fs > > >>> flink.runtime.util.HadoopUtils > > >>> flink-runtime => HadoopModule & HadoopModuleFactory > > >>> flink-swift-fs-hadoop => SwiftFileSystemFactory > > >>> flink-yarn => Utils, YarnClusterDescriptor > > >>> > > >>> flink-hadoop-compatability > > >>> api.java.hadoop.mapred.utils.HadoopUtils > > >>> Both belong to the same module but with different packages > > >>> (api.java.hadoop.mapred and api.java.hadoop.mapreduce) > > >>> api.java.hadoop.mapreduce.utils.HadoopUtils > > >>> flink-sequeunce-file > > >>> formats.sequeuncefile.SerializableHadoopConfiguration Currently, > > >>> it is used at formats.sequencefile.SequenceFileWriterFactory but can > > also > > >>> be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter and > > >>> pretty much everywhere to avoid NotSerializableException. > > >>> > > >>> *Proposal* > > >>> To summarise, I believe we can create a new module > (flink-hadoop-utils > > ?) > > >>> and move these reusable components to this new module which will have > > an > > >>> optional/provided dependency on flink-shaded-hadoop-2. > > >>> > > >>> *Structure* > > >>> In the present form, I think we will have two classes with the > > packaging > > >>> structure being *org.apache.flink.hadoop.[utils/serialization]* > > >>> 1. HadoopUtils with all static methods ( after combining and > > eliminating > > >>> the duplicate code fragments from the three HadoopUtils classes > > mentioned > > >>> above) > > >>> 2. Move the existing SerializableHadoopConfiguration from the > > >>> flink-sequence-file to this new module . > > >>> > > >>> *Justification* > > >>> * With this change, we would be stripping away the dependency on > > >>> flink-hadoop-fs from flink-runtime as I don't see any other classes > > from > > >>> flink-hadoop-fs is being used anywhere in flink-runtime module. > > >>> * We will have a common place where all the utilities related to > Hadoop > > >> can > > >>> go which can be reused easily without leading to jar hell. > > >>> > > >>> In addition to this, if you are aware of any other classes that fit > in > > >> this > > >>> approach, please share the details here. > > >>> > > >>> *Note* > > >>> I don't have a complete understanding here but I did see two > > >>> implementations of the following classes under two different packages > > >>> *.mapred and *.mapreduce. > > >>> * HadoopInputFormat > > >>> * HadoopInputFormatBase > > >>> * HadoopOutputFormat > > >>> * HadoopOutputFormatBase > > >>> > > >>> Can we somehow figure and have them in this new module? > > >>> > > >>> Thanks, > > >>> Sivaprasanna > > >>> > > >>> [1] > > >>> > > >>> > > >> > > > https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E > > > > > > > |
Awesome. : )
Thanks, Robert for signing up to be the reviewer. I will create Jira and share the link here. Stay safe. - Sivaprasanna On Thu, May 28, 2020 at 12:13 PM Robert Metzger <[hidden email]> wrote: > Hi Sivaprasanna, > > thanks a lot for your proposal. Now that I ran into a HadoopUtils-related > issue myself [1] I see the benefit in this proposal. > > I'm happy to be the Flink committer that mentors this change. If we do > this, I would like to have a small scope for the initial change: > - create a "flink-hadoop-utils" module > - move generic, common utils into that module (for example > SerializableHadoopConfiguration) > > I agree with Till that we should initially leave out the Hadoop > compatibility modules. > > You can go ahead with filing a JIRA ticket! Let's discuss the exact scope > there. > > > [1] https://github.com/apache/flink/pull/12146 > > > On Thu, Apr 30, 2020 at 6:54 PM Sivaprasanna <[hidden email]> > wrote: > > > Bump. > > > > Please let me know, if someone is interested in reviewing this one. I am > > willing to start working on this. BTW, a small and new addition to the > > list: With FLINK-10114 merged, OrcBulkWriterFactory can also reuse > > `SerializableHadoopConfiguration` along with SequenceFileWriterFactory > and > > CompressWriterFactory. > > > > CC - Kostas Kloudas since he has a better understanding on the > > `SerializableHadoopConfiguration.` > > > > Cheers, > > Sivaprasanna > > > > On Mon, Mar 30, 2020 at 3:17 PM Chesnay Schepler <[hidden email]> > > wrote: > > > > > I would recommend to wait until a committer has signed up for reviewing > > > your changes before preparing any PR. > > > Otherwise the chances are high that you invest a lot of time but the > > > changes never get in. > > > > > > On 30/03/2020 11:42, Sivaprasanna wrote: > > > > Hello Till, > > > > > > > > I agree with having the scope limited and more concentrated. I can > > file a > > > > Jira and get started with the code changes, as and when someone has > > some > > > > bandwidth, the review can also be done. What do you think? > > > > > > > > Cheers, > > > > Sivaprasanna > > > > > > > > On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <[hidden email]> > > > wrote: > > > > > > > >> Hi Sivaprasanna, > > > >> > > > >> thanks for starting this discussion. In general I like the idea to > > > remove > > > >> duplications and move common code to a shared module. As a > > > recommendation, > > > >> I would exclude the whole part about Flink's Hadoop compatibility > > > modules > > > >> because they are legacy code and hardly used anymore. This would > also > > > have > > > >> the benefit of making the scope of the proposal a bit smaller. > > > >> > > > >> What we now need is a committer who wants to help with this effort. > It > > > >> might be that this takes a bit of time as many of the committers are > > > quite > > > >> busy. > > > >> > > > >> Cheers, > > > >> Till > > > >> > > > >> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna < > > [hidden email] > > > > > > > >> wrote: > > > >> > > > >>> Hi, > > > >>> > > > >>> Continuing on an earlier discussion[1] regarding having a separate > > > module > > > >>> for Hadoop related utility components, I have gone through our > > project > > > >>> briefly and found the following components which I feel could be > > moved > > > >> to a > > > >>> separate module for reusability, and better module structure. > > > >>> > > > >>> Module Name Class Name Used at / Remarks > > > >>> > > > >>> flink-hadoop-fs > > > >>> flink.runtime.util.HadoopUtils > > > >>> flink-runtime => HadoopModule & HadoopModuleFactory > > > >>> flink-swift-fs-hadoop => SwiftFileSystemFactory > > > >>> flink-yarn => Utils, YarnClusterDescriptor > > > >>> > > > >>> flink-hadoop-compatability > > > >>> api.java.hadoop.mapred.utils.HadoopUtils > > > >>> Both belong to the same module but with different packages > > > >>> (api.java.hadoop.mapred and api.java.hadoop.mapreduce) > > > >>> api.java.hadoop.mapreduce.utils.HadoopUtils > > > >>> flink-sequeunce-file > > > >>> formats.sequeuncefile.SerializableHadoopConfiguration Currently, > > > >>> it is used at formats.sequencefile.SequenceFileWriterFactory but > can > > > also > > > >>> be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter > and > > > >>> pretty much everywhere to avoid NotSerializableException. > > > >>> > > > >>> *Proposal* > > > >>> To summarise, I believe we can create a new module > > (flink-hadoop-utils > > > ?) > > > >>> and move these reusable components to this new module which will > have > > > an > > > >>> optional/provided dependency on flink-shaded-hadoop-2. > > > >>> > > > >>> *Structure* > > > >>> In the present form, I think we will have two classes with the > > > packaging > > > >>> structure being *org.apache.flink.hadoop.[utils/serialization]* > > > >>> 1. HadoopUtils with all static methods ( after combining and > > > eliminating > > > >>> the duplicate code fragments from the three HadoopUtils classes > > > mentioned > > > >>> above) > > > >>> 2. Move the existing SerializableHadoopConfiguration from the > > > >>> flink-sequence-file to this new module . > > > >>> > > > >>> *Justification* > > > >>> * With this change, we would be stripping away the dependency on > > > >>> flink-hadoop-fs from flink-runtime as I don't see any other classes > > > from > > > >>> flink-hadoop-fs is being used anywhere in flink-runtime module. > > > >>> * We will have a common place where all the utilities related to > > Hadoop > > > >> can > > > >>> go which can be reused easily without leading to jar hell. > > > >>> > > > >>> In addition to this, if you are aware of any other classes that fit > > in > > > >> this > > > >>> approach, please share the details here. > > > >>> > > > >>> *Note* > > > >>> I don't have a complete understanding here but I did see two > > > >>> implementations of the following classes under two different > packages > > > >>> *.mapred and *.mapreduce. > > > >>> * HadoopInputFormat > > > >>> * HadoopInputFormatBase > > > >>> * HadoopOutputFormat > > > >>> * HadoopOutputFormatBase > > > >>> > > > >>> Can we somehow figure and have them in this new module? > > > >>> > > > >>> Thanks, > > > >>> Sivaprasanna > > > >>> > > > >>> [1] > > > >>> > > > >>> > > > >> > > > > > > https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E > > > > > > > > > > > > |
FYI.
I created a Jira to track this improvement. https://issues.apache.org/jira/browse/FLINK-18013 - Sivaprasanna On Thu, May 28, 2020 at 12:22 PM Sivaprasanna <[hidden email]> wrote: > Awesome. : ) > Thanks, Robert for signing up to be the reviewer. I will create Jira and > share the link here. > > Stay safe. > > - > Sivaprasanna > > On Thu, May 28, 2020 at 12:13 PM Robert Metzger <[hidden email]> > wrote: > >> Hi Sivaprasanna, >> >> thanks a lot for your proposal. Now that I ran into a HadoopUtils-related >> issue myself [1] I see the benefit in this proposal. >> >> I'm happy to be the Flink committer that mentors this change. If we do >> this, I would like to have a small scope for the initial change: >> - create a "flink-hadoop-utils" module >> - move generic, common utils into that module (for example >> SerializableHadoopConfiguration) >> >> I agree with Till that we should initially leave out the Hadoop >> compatibility modules. >> >> You can go ahead with filing a JIRA ticket! Let's discuss the exact scope >> there. >> >> >> [1] https://github.com/apache/flink/pull/12146 >> >> >> On Thu, Apr 30, 2020 at 6:54 PM Sivaprasanna <[hidden email]> >> wrote: >> >> > Bump. >> > >> > Please let me know, if someone is interested in reviewing this one. I am >> > willing to start working on this. BTW, a small and new addition to the >> > list: With FLINK-10114 merged, OrcBulkWriterFactory can also reuse >> > `SerializableHadoopConfiguration` along with SequenceFileWriterFactory >> and >> > CompressWriterFactory. >> > >> > CC - Kostas Kloudas since he has a better understanding on the >> > `SerializableHadoopConfiguration.` >> > >> > Cheers, >> > Sivaprasanna >> > >> > On Mon, Mar 30, 2020 at 3:17 PM Chesnay Schepler <[hidden email]> >> > wrote: >> > >> > > I would recommend to wait until a committer has signed up for >> reviewing >> > > your changes before preparing any PR. >> > > Otherwise the chances are high that you invest a lot of time but the >> > > changes never get in. >> > > >> > > On 30/03/2020 11:42, Sivaprasanna wrote: >> > > > Hello Till, >> > > > >> > > > I agree with having the scope limited and more concentrated. I can >> > file a >> > > > Jira and get started with the code changes, as and when someone has >> > some >> > > > bandwidth, the review can also be done. What do you think? >> > > > >> > > > Cheers, >> > > > Sivaprasanna >> > > > >> > > > On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann <[hidden email] >> > >> > > wrote: >> > > > >> > > >> Hi Sivaprasanna, >> > > >> >> > > >> thanks for starting this discussion. In general I like the idea to >> > > remove >> > > >> duplications and move common code to a shared module. As a >> > > recommendation, >> > > >> I would exclude the whole part about Flink's Hadoop compatibility >> > > modules >> > > >> because they are legacy code and hardly used anymore. This would >> also >> > > have >> > > >> the benefit of making the scope of the proposal a bit smaller. >> > > >> >> > > >> What we now need is a committer who wants to help with this >> effort. It >> > > >> might be that this takes a bit of time as many of the committers >> are >> > > quite >> > > >> busy. >> > > >> >> > > >> Cheers, >> > > >> Till >> > > >> >> > > >> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna < >> > [hidden email] >> > > > >> > > >> wrote: >> > > >> >> > > >>> Hi, >> > > >>> >> > > >>> Continuing on an earlier discussion[1] regarding having a separate >> > > module >> > > >>> for Hadoop related utility components, I have gone through our >> > project >> > > >>> briefly and found the following components which I feel could be >> > moved >> > > >> to a >> > > >>> separate module for reusability, and better module structure. >> > > >>> >> > > >>> Module Name Class Name Used at / Remarks >> > > >>> >> > > >>> flink-hadoop-fs >> > > >>> flink.runtime.util.HadoopUtils >> > > >>> flink-runtime => HadoopModule & HadoopModuleFactory >> > > >>> flink-swift-fs-hadoop => SwiftFileSystemFactory >> > > >>> flink-yarn => Utils, YarnClusterDescriptor >> > > >>> >> > > >>> flink-hadoop-compatability >> > > >>> api.java.hadoop.mapred.utils.HadoopUtils >> > > >>> Both belong to the same module but with different packages >> > > >>> (api.java.hadoop.mapred and api.java.hadoop.mapreduce) >> > > >>> api.java.hadoop.mapreduce.utils.HadoopUtils >> > > >>> flink-sequeunce-file >> > > >>> formats.sequeuncefile.SerializableHadoopConfiguration Currently, >> > > >>> it is used at formats.sequencefile.SequenceFileWriterFactory but >> can >> > > also >> > > >>> be used at HadoopCompressionBulkWriter, a potential OrcBulkWriter >> and >> > > >>> pretty much everywhere to avoid NotSerializableException. >> > > >>> >> > > >>> *Proposal* >> > > >>> To summarise, I believe we can create a new module >> > (flink-hadoop-utils >> > > ?) >> > > >>> and move these reusable components to this new module which will >> have >> > > an >> > > >>> optional/provided dependency on flink-shaded-hadoop-2. >> > > >>> >> > > >>> *Structure* >> > > >>> In the present form, I think we will have two classes with the >> > > packaging >> > > >>> structure being *org.apache.flink.hadoop.[utils/serialization]* >> > > >>> 1. HadoopUtils with all static methods ( after combining and >> > > eliminating >> > > >>> the duplicate code fragments from the three HadoopUtils classes >> > > mentioned >> > > >>> above) >> > > >>> 2. Move the existing SerializableHadoopConfiguration from the >> > > >>> flink-sequence-file to this new module . >> > > >>> >> > > >>> *Justification* >> > > >>> * With this change, we would be stripping away the dependency on >> > > >>> flink-hadoop-fs from flink-runtime as I don't see any other >> classes >> > > from >> > > >>> flink-hadoop-fs is being used anywhere in flink-runtime module. >> > > >>> * We will have a common place where all the utilities related to >> > Hadoop >> > > >> can >> > > >>> go which can be reused easily without leading to jar hell. >> > > >>> >> > > >>> In addition to this, if you are aware of any other classes that >> fit >> > in >> > > >> this >> > > >>> approach, please share the details here. >> > > >>> >> > > >>> *Note* >> > > >>> I don't have a complete understanding here but I did see two >> > > >>> implementations of the following classes under two different >> packages >> > > >>> *.mapred and *.mapreduce. >> > > >>> * HadoopInputFormat >> > > >>> * HadoopInputFormatBase >> > > >>> * HadoopOutputFormat >> > > >>> * HadoopOutputFormatBase >> > > >>> >> > > >>> Can we somehow figure and have them in this new module? >> > > >>> >> > > >>> Thanks, >> > > >>> Sivaprasanna >> > > >>> >> > > >>> [1] >> > > >>> >> > > >>> >> > > >> >> > > >> > >> https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E >> > > >> > > >> > > >> > >> > |
Thanks a lot! Let's continue the discussion in the ticket! (I might not be
able to respond before Monday there) On Thu, May 28, 2020 at 5:08 PM Sivaprasanna <[hidden email]> wrote: > FYI. > > I created a Jira to track this improvement. > https://issues.apache.org/jira/browse/FLINK-18013 > > - > Sivaprasanna > > On Thu, May 28, 2020 at 12:22 PM Sivaprasanna <[hidden email]> > wrote: > > > Awesome. : ) > > Thanks, Robert for signing up to be the reviewer. I will create Jira and > > share the link here. > > > > Stay safe. > > > > - > > Sivaprasanna > > > > On Thu, May 28, 2020 at 12:13 PM Robert Metzger <[hidden email]> > > wrote: > > > >> Hi Sivaprasanna, > >> > >> thanks a lot for your proposal. Now that I ran into a > HadoopUtils-related > >> issue myself [1] I see the benefit in this proposal. > >> > >> I'm happy to be the Flink committer that mentors this change. If we do > >> this, I would like to have a small scope for the initial change: > >> - create a "flink-hadoop-utils" module > >> - move generic, common utils into that module (for example > >> SerializableHadoopConfiguration) > >> > >> I agree with Till that we should initially leave out the Hadoop > >> compatibility modules. > >> > >> You can go ahead with filing a JIRA ticket! Let's discuss the exact > scope > >> there. > >> > >> > >> [1] https://github.com/apache/flink/pull/12146 > >> > >> > >> On Thu, Apr 30, 2020 at 6:54 PM Sivaprasanna <[hidden email] > > > >> wrote: > >> > >> > Bump. > >> > > >> > Please let me know, if someone is interested in reviewing this one. I > am > >> > willing to start working on this. BTW, a small and new addition to the > >> > list: With FLINK-10114 merged, OrcBulkWriterFactory can also reuse > >> > `SerializableHadoopConfiguration` along with SequenceFileWriterFactory > >> and > >> > CompressWriterFactory. > >> > > >> > CC - Kostas Kloudas since he has a better understanding on the > >> > `SerializableHadoopConfiguration.` > >> > > >> > Cheers, > >> > Sivaprasanna > >> > > >> > On Mon, Mar 30, 2020 at 3:17 PM Chesnay Schepler <[hidden email]> > >> > wrote: > >> > > >> > > I would recommend to wait until a committer has signed up for > >> reviewing > >> > > your changes before preparing any PR. > >> > > Otherwise the chances are high that you invest a lot of time but the > >> > > changes never get in. > >> > > > >> > > On 30/03/2020 11:42, Sivaprasanna wrote: > >> > > > Hello Till, > >> > > > > >> > > > I agree with having the scope limited and more concentrated. I can > >> > file a > >> > > > Jira and get started with the code changes, as and when someone > has > >> > some > >> > > > bandwidth, the review can also be done. What do you think? > >> > > > > >> > > > Cheers, > >> > > > Sivaprasanna > >> > > > > >> > > > On Mon, Mar 30, 2020 at 3:00 PM Till Rohrmann < > [hidden email] > >> > > >> > > wrote: > >> > > > > >> > > >> Hi Sivaprasanna, > >> > > >> > >> > > >> thanks for starting this discussion. In general I like the idea > to > >> > > remove > >> > > >> duplications and move common code to a shared module. As a > >> > > recommendation, > >> > > >> I would exclude the whole part about Flink's Hadoop compatibility > >> > > modules > >> > > >> because they are legacy code and hardly used anymore. This would > >> also > >> > > have > >> > > >> the benefit of making the scope of the proposal a bit smaller. > >> > > >> > >> > > >> What we now need is a committer who wants to help with this > >> effort. It > >> > > >> might be that this takes a bit of time as many of the committers > >> are > >> > > quite > >> > > >> busy. > >> > > >> > >> > > >> Cheers, > >> > > >> Till > >> > > >> > >> > > >> On Thu, Mar 19, 2020 at 2:15 PM Sivaprasanna < > >> > [hidden email] > >> > > > > >> > > >> wrote: > >> > > >> > >> > > >>> Hi, > >> > > >>> > >> > > >>> Continuing on an earlier discussion[1] regarding having a > separate > >> > > module > >> > > >>> for Hadoop related utility components, I have gone through our > >> > project > >> > > >>> briefly and found the following components which I feel could be > >> > moved > >> > > >> to a > >> > > >>> separate module for reusability, and better module structure. > >> > > >>> > >> > > >>> Module Name Class Name Used at / Remarks > >> > > >>> > >> > > >>> flink-hadoop-fs > >> > > >>> flink.runtime.util.HadoopUtils > >> > > >>> flink-runtime => HadoopModule & HadoopModuleFactory > >> > > >>> flink-swift-fs-hadoop => SwiftFileSystemFactory > >> > > >>> flink-yarn => Utils, YarnClusterDescriptor > >> > > >>> > >> > > >>> flink-hadoop-compatability > >> > > >>> api.java.hadoop.mapred.utils.HadoopUtils > >> > > >>> Both belong to the same module but with different packages > >> > > >>> (api.java.hadoop.mapred and api.java.hadoop.mapreduce) > >> > > >>> api.java.hadoop.mapreduce.utils.HadoopUtils > >> > > >>> flink-sequeunce-file > >> > > >>> formats.sequeuncefile.SerializableHadoopConfiguration Currently, > >> > > >>> it is used at formats.sequencefile.SequenceFileWriterFactory but > >> can > >> > > also > >> > > >>> be used at HadoopCompressionBulkWriter, a potential > OrcBulkWriter > >> and > >> > > >>> pretty much everywhere to avoid NotSerializableException. > >> > > >>> > >> > > >>> *Proposal* > >> > > >>> To summarise, I believe we can create a new module > >> > (flink-hadoop-utils > >> > > ?) > >> > > >>> and move these reusable components to this new module which will > >> have > >> > > an > >> > > >>> optional/provided dependency on flink-shaded-hadoop-2. > >> > > >>> > >> > > >>> *Structure* > >> > > >>> In the present form, I think we will have two classes with the > >> > > packaging > >> > > >>> structure being *org.apache.flink.hadoop.[utils/serialization]* > >> > > >>> 1. HadoopUtils with all static methods ( after combining and > >> > > eliminating > >> > > >>> the duplicate code fragments from the three HadoopUtils classes > >> > > mentioned > >> > > >>> above) > >> > > >>> 2. Move the existing SerializableHadoopConfiguration from the > >> > > >>> flink-sequence-file to this new module . > >> > > >>> > >> > > >>> *Justification* > >> > > >>> * With this change, we would be stripping away the dependency on > >> > > >>> flink-hadoop-fs from flink-runtime as I don't see any other > >> classes > >> > > from > >> > > >>> flink-hadoop-fs is being used anywhere in flink-runtime module. > >> > > >>> * We will have a common place where all the utilities related to > >> > Hadoop > >> > > >> can > >> > > >>> go which can be reused easily without leading to jar hell. > >> > > >>> > >> > > >>> In addition to this, if you are aware of any other classes that > >> fit > >> > in > >> > > >> this > >> > > >>> approach, please share the details here. > >> > > >>> > >> > > >>> *Note* > >> > > >>> I don't have a complete understanding here but I did see two > >> > > >>> implementations of the following classes under two different > >> packages > >> > > >>> *.mapred and *.mapreduce. > >> > > >>> * HadoopInputFormat > >> > > >>> * HadoopInputFormatBase > >> > > >>> * HadoopOutputFormat > >> > > >>> * HadoopOutputFormatBase > >> > > >>> > >> > > >>> Can we somehow figure and have them in this new module? > >> > > >>> > >> > > >>> Thanks, > >> > > >>> Sivaprasanna > >> > > >>> > >> > > >>> [1] > >> > > >>> > >> > > >>> > >> > > >> > >> > > > >> > > >> > https://lists.apache.org/thread.html/r198f09496ba46885adbcc41fe778a7a34ad1cd685eeae8beb71e6fbb%40%3Cdev.flink.apache.org%3E > >> > > > >> > > > >> > > > >> > > >> > > > |
Free forum by Nabble | Edit this page |