Hi Stephan,
Good idea. Just like hadoop, we can have flink-shaded-hive-uber. Then the startup of hive integration will be very simple with one or two pre-bundled, user just add these dependencies: - flink-connector-hive.jar - flink-shaded-hive-uber-<version>.jar Some changes are needed, but I think it should work. Another thing is can we put flink-connector-hive.jar into flink/lib, it should clean and no dependencies. Best, Jingsong Lee On Thu, Feb 6, 2020 at 7:13 PM Stephan Ewen <[hidden email]> wrote: > Hi Jingsong! > > This sounds that with two pre-bundled versions (hive 1.2.1 and hive 2.3.6) > you can cover a lot of versions. > > Would it make sense to add these to flink-shaded (with proper dependency > exclusions of unnecessary dependencies) and offer them as a download, > similar as we offer pre-shaded Hadoop downloads? > > Best, > Stephan > > > On Thu, Feb 6, 2020 at 10:26 AM Jingsong Li <[hidden email]> > wrote: > >> Hi Stephan, >> >> The hive/lib/ has many jars, this lib is for execution, metastore, hive >> client and all things. >> What we really depend on is hive-exec.jar. (hive-metastore.jar is also >> required in the low version hive) >> And hive-exec.jar is a uber jar. We just want half classes of it. These >> half classes are not so clean, but it is OK to have them. >> >> Our solution now: >> - exclude hive jars from build >> - provide 8 versions dependencies way, user choose by his hive version.[1] >> >> Spark's solution: >> - build-in hive 1.2.1 dependencies to support hive 0.12.0 through 2.3.3. >> [2] >> - hive-exec.jar is hive-exec.spark.jar, Spark has modified the >> hive-exec build pom to exclude unnecessary classes including Orc and >> parquet. >> - build-in orc and parquet dependencies to optimizer performance. >> - support hive version 2.3.3 upper by "mvn install -Phive-2.3", to >> built-in hive-exec-2.3.6.jar. It seems that since this version, hive's API >> has been seriously incompatible. >> Most of the versions used by users are hive 0.12.0 through 2.3.3. So the >> default build of Spark is good to most of users. >> >> Presto's solution: >> - Built-in presto's hive.[3] Shade hive classes instead of thrift classes. >> - Rewrite some client related code to solve kinds of issues. >> This approach is the heaviest, but also the cleanest. It can support all >> kinds of hive versions with one build. >> >> So I think we can do: >> >> - The eight versions we now maintain are too many. I think we can move >> forward in the direction of Presto/Spark and try to reduce dependencies >> versions. >> >> - As your said, about provide fat/uber jars or helper script, I prefer >> uber jars, user can download one jar to their startup. Just like Kafka. >> >> [1] >> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies >> [2] >> https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore >> [3] https://github.com/prestodb/presto-hive-apache >> >> Best, >> Jingsong Lee >> >> On Wed, Feb 5, 2020 at 10:15 PM Stephan Ewen <[hidden email]> wrote: >> >>> Some thoughts about other options we have: >>> >>> - Put fat/shaded jars for the common versions into "flink-shaded" and >>> offer them for download on the website, similar to pre-bundles Hadoop >>> versions. >>> >>> - Look at the Presto code (Metastore protocol) and see if we can reuse >>> that >>> >>> - Have a setup helper script that takes the versions and pulls the >>> required dependencies. >>> >>> Can you share how can a "built-in" dependency could work, if there are >>> so many different conflicting versions? >>> >>> Thanks, >>> Stephan >>> >>> >>> On Tue, Feb 4, 2020 at 12:59 PM Rui Li <[hidden email]> wrote: >>> >>>> Hi Stephan, >>>> >>>> As Jingsong stated, in our documentation the recommended way to add Hive >>>> deps is to use exactly what users have installed. It's just we ask >>>> users to >>>> manually add those jars, instead of automatically find them based on env >>>> variables. I prefer to keep it this way for a while, and see if there're >>>> real concerns/complaints from user feedbacks. >>>> >>>> Please also note the Hive jars are not the only ones needed to integrate >>>> with Hive, users have to make sure flink-connector-hive and Hadoop jars >>>> are >>>> in classpath too. So I'm afraid a single "HIVE" env variable wouldn't >>>> save >>>> all the manual work for our users. >>>> >>>> On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <[hidden email]> >>>> wrote: >>>> >>>> > Hi all, >>>> > >>>> > For your information, we have document the dependencies detailed >>>> > information [1]. I think it's a lot clearer than before, but it's >>>> worse >>>> > than presto and spark (they avoid or have built-in hive dependency). >>>> > >>>> > I thought about Stephan's suggestion: >>>> > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus >>>> two >>>> > or three jars, if so many jars are introduced, maybe will there be a >>>> big >>>> > conflict. >>>> > - And hive/lib is not available on every machine. We need to upload so >>>> > many jars. >>>> > - A separate classloader maybe hard to work too, our >>>> flink-connector-hive >>>> > need hive jars, we may need to deal with flink-connector-hive jar >>>> spacial >>>> > too. >>>> > CC: Rui Li >>>> > >>>> > I think the best system to integrate with hive is presto, which only >>>> > connects hive metastore through thrift protocol. But I understand >>>> that it >>>> > costs a lot to rewrite the code. >>>> > >>>> > [1] >>>> > >>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies >>>> > >>>> > Best, >>>> > Jingsong Lee >>>> > >>>> > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <[hidden email]> wrote: >>>> > >>>> >> We have had much trouble in the past from "too deep too custom" >>>> >> integrations that everyone got out of the box, i.e., Hadoop. >>>> >> Flink has has such a broad spectrum of use cases, if we have custom >>>> build >>>> >> for every other framework in that spectrum, we'll be in trouble. >>>> >> >>>> >> So I would also be -1 for custom builds. >>>> >> >>>> >> Couldn't we do something similar as we started doing for Hadoop? >>>> Moving >>>> >> away from convenience downloads to allowing users to "export" their >>>> setup >>>> >> for Flink? >>>> >> >>>> >> - We can have a "hive module (loader)" in flink/lib by default >>>> >> - The module loader would look for an environment variable like >>>> >> "HIVE_CLASSPATH" and load these classes (ideally in a separate >>>> >> classloader). >>>> >> - The loader can search for certain classes and instantiate >>>> catalog / >>>> >> functions / etc. when finding them instantiates the hive module >>>> >> referencing >>>> >> them >>>> >> - That way, we use exactly what users have installed, without >>>> needing to >>>> >> build our own bundles. >>>> >> >>>> >> Could that work? >>>> >> >>>> >> Best, >>>> >> Stephan >>>> >> >>>> >> >>>> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <[hidden email]> >>>> >> wrote: >>>> >> >>>> >> > Couldn't it simply be documented which jars are in the convenience >>>> jars >>>> >> > which are pre built and can be downloaded from the website? Then >>>> people >>>> >> who >>>> >> > need a custom version know which jars they need to provide to >>>> Flink? >>>> >> > >>>> >> > Cheers, >>>> >> > Till >>>> >> > >>>> >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[hidden email]> >>>> wrote: >>>> >> > >>>> >> > > I'm not sure providing an uber jar would be possible. >>>> >> > > >>>> >> > > Different from kafka and elasticsearch connector who have >>>> dependencies >>>> >> > for >>>> >> > > a specific kafka/elastic version, or the kafka universal >>>> connector >>>> >> that >>>> >> > > provides good compatibilities, hive connector needs to deal with >>>> hive >>>> >> > jars >>>> >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH >>>> >> distributions) >>>> >> > > with incompatibility even between minor versions, different >>>> versioned >>>> >> > > hadoop and other extra dependency jars for each hive version. >>>> >> > > >>>> >> > > Besides, users usually need to be able to easily see which >>>> individual >>>> >> > jars >>>> >> > > are required, which is invisible from an uber jar. Hive users >>>> already >>>> >> > have >>>> >> > > their hive deployments. They usually have to use their own hive >>>> jars >>>> >> > > because, unlike hive jars on mvn, their own jars contain changes >>>> >> in-house >>>> >> > > or from vendors. They need to easily tell which jars Flink >>>> requires >>>> >> for >>>> >> > > corresponding open sourced hive version to their own hive >>>> deployment, >>>> >> and >>>> >> > > copy in-hosue jars over from hive deployments as replacements. >>>> >> > > >>>> >> > > Providing a script to download all the individual jars for a >>>> specified >>>> >> > hive >>>> >> > > version can be an alternative. >>>> >> > > >>>> >> > > The goal is we need to provide a *product*, not a technology, to >>>> make >>>> >> it >>>> >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive >>>> >> community >>>> >> > > and ecosystem, not the other way around. I'd argue Hive >>>> connector can >>>> >> be >>>> >> > > treat differently because its community/ecosystem/userbase is >>>> much >>>> >> larger >>>> >> > > than the other connectors, and it's way more important than other >>>> >> > > connectors to Flink on the mission of becoming a batch/streaming >>>> >> unified >>>> >> > > engine and get Flink more widely adopted. >>>> >> > > >>>> >> > > >>>> >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan < >>>> [hidden email]> >>>> >> > wrote: >>>> >> > > >>>> >> > > > Also -1 on separate builds. >>>> >> > > > >>>> >> > > > After referencing some other BigData engines for >>>> distribution[1], i >>>> >> > > didn't >>>> >> > > > find strong needs to publish a separate build >>>> >> > > > for just a separate Hive version, indeed there are builds for >>>> >> different >>>> >> > > > Hadoop version. >>>> >> > > > >>>> >> > > > Just like Seth and Aljoscha said, we could push a >>>> >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or >>>> other use >>>> >> > > cases. >>>> >> > > > >>>> >> > > > [1] https://spark.apache.org/downloads.html >>>> >> > > > [2] >>>> >> > > >>>> >> >>>> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html >>>> >> > > > >>>> >> > > > Best, >>>> >> > > > Danny Chan >>>> >> > > > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道: >>>> >> > > > > >>>> >> > > > > >>>> >> > > > >>>> >> > > >>>> >> > >>>> >> >>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies >>>> >> > > > >>>> >> > > >>>> >> > >>>> >> >>>> > >>>> > >>>> > -- >>>> > Best, Jingsong Lee >>>> > >>>> >>> >> >> -- >> Best, Jingsong Lee >> > -- Best, Jingsong Lee |
IIRC, Guowei wants to work on supporting Table API connectors in Plugins.
With that, we could have the Hive dependency as a plugin, avoiding dependency conflicts. On Thu, Feb 6, 2020 at 1:11 PM Jingsong Li <[hidden email]> wrote: > Hi Stephan, > > Good idea. Just like hadoop, we can have flink-shaded-hive-uber. > Then the startup of hive integration will be very simple with one or two > pre-bundled, user just add these dependencies: > - flink-connector-hive.jar > - flink-shaded-hive-uber-<version>.jar > > Some changes are needed, but I think it should work. > > Another thing is can we put flink-connector-hive.jar into flink/lib, it > should clean and no dependencies. > > Best, > Jingsong Lee > > On Thu, Feb 6, 2020 at 7:13 PM Stephan Ewen <[hidden email]> wrote: > >> Hi Jingsong! >> >> This sounds that with two pre-bundled versions (hive 1.2.1 and hive >> 2.3.6) you can cover a lot of versions. >> >> Would it make sense to add these to flink-shaded (with proper dependency >> exclusions of unnecessary dependencies) and offer them as a download, >> similar as we offer pre-shaded Hadoop downloads? >> >> Best, >> Stephan >> >> >> On Thu, Feb 6, 2020 at 10:26 AM Jingsong Li <[hidden email]> >> wrote: >> >>> Hi Stephan, >>> >>> The hive/lib/ has many jars, this lib is for execution, metastore, hive >>> client and all things. >>> What we really depend on is hive-exec.jar. (hive-metastore.jar is also >>> required in the low version hive) >>> And hive-exec.jar is a uber jar. We just want half classes of it. These >>> half classes are not so clean, but it is OK to have them. >>> >>> Our solution now: >>> - exclude hive jars from build >>> - provide 8 versions dependencies way, user choose by his hive >>> version.[1] >>> >>> Spark's solution: >>> - build-in hive 1.2.1 dependencies to support hive 0.12.0 through 2.3.3. >>> [2] >>> - hive-exec.jar is hive-exec.spark.jar, Spark has modified the >>> hive-exec build pom to exclude unnecessary classes including Orc and >>> parquet. >>> - build-in orc and parquet dependencies to optimizer performance. >>> - support hive version 2.3.3 upper by "mvn install -Phive-2.3", to >>> built-in hive-exec-2.3.6.jar. It seems that since this version, hive's API >>> has been seriously incompatible. >>> Most of the versions used by users are hive 0.12.0 through 2.3.3. So the >>> default build of Spark is good to most of users. >>> >>> Presto's solution: >>> - Built-in presto's hive.[3] Shade hive classes instead of thrift >>> classes. >>> - Rewrite some client related code to solve kinds of issues. >>> This approach is the heaviest, but also the cleanest. It can support all >>> kinds of hive versions with one build. >>> >>> So I think we can do: >>> >>> - The eight versions we now maintain are too many. I think we can move >>> forward in the direction of Presto/Spark and try to reduce dependencies >>> versions. >>> >>> - As your said, about provide fat/uber jars or helper script, I prefer >>> uber jars, user can download one jar to their startup. Just like Kafka. >>> >>> [1] >>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies >>> [2] >>> https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore >>> [3] https://github.com/prestodb/presto-hive-apache >>> >>> Best, >>> Jingsong Lee >>> >>> On Wed, Feb 5, 2020 at 10:15 PM Stephan Ewen <[hidden email]> wrote: >>> >>>> Some thoughts about other options we have: >>>> >>>> - Put fat/shaded jars for the common versions into "flink-shaded" and >>>> offer them for download on the website, similar to pre-bundles Hadoop >>>> versions. >>>> >>>> - Look at the Presto code (Metastore protocol) and see if we can >>>> reuse that >>>> >>>> - Have a setup helper script that takes the versions and pulls the >>>> required dependencies. >>>> >>>> Can you share how can a "built-in" dependency could work, if there are >>>> so many different conflicting versions? >>>> >>>> Thanks, >>>> Stephan >>>> >>>> >>>> On Tue, Feb 4, 2020 at 12:59 PM Rui Li <[hidden email]> wrote: >>>> >>>>> Hi Stephan, >>>>> >>>>> As Jingsong stated, in our documentation the recommended way to add >>>>> Hive >>>>> deps is to use exactly what users have installed. It's just we ask >>>>> users to >>>>> manually add those jars, instead of automatically find them based on >>>>> env >>>>> variables. I prefer to keep it this way for a while, and see if >>>>> there're >>>>> real concerns/complaints from user feedbacks. >>>>> >>>>> Please also note the Hive jars are not the only ones needed to >>>>> integrate >>>>> with Hive, users have to make sure flink-connector-hive and Hadoop >>>>> jars are >>>>> in classpath too. So I'm afraid a single "HIVE" env variable wouldn't >>>>> save >>>>> all the manual work for our users. >>>>> >>>>> On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <[hidden email]> >>>>> wrote: >>>>> >>>>> > Hi all, >>>>> > >>>>> > For your information, we have document the dependencies detailed >>>>> > information [1]. I think it's a lot clearer than before, but it's >>>>> worse >>>>> > than presto and spark (they avoid or have built-in hive dependency). >>>>> > >>>>> > I thought about Stephan's suggestion: >>>>> > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus >>>>> two >>>>> > or three jars, if so many jars are introduced, maybe will there be a >>>>> big >>>>> > conflict. >>>>> > - And hive/lib is not available on every machine. We need to upload >>>>> so >>>>> > many jars. >>>>> > - A separate classloader maybe hard to work too, our >>>>> flink-connector-hive >>>>> > need hive jars, we may need to deal with flink-connector-hive jar >>>>> spacial >>>>> > too. >>>>> > CC: Rui Li >>>>> > >>>>> > I think the best system to integrate with hive is presto, which only >>>>> > connects hive metastore through thrift protocol. But I understand >>>>> that it >>>>> > costs a lot to rewrite the code. >>>>> > >>>>> > [1] >>>>> > >>>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies >>>>> > >>>>> > Best, >>>>> > Jingsong Lee >>>>> > >>>>> > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <[hidden email]> >>>>> wrote: >>>>> > >>>>> >> We have had much trouble in the past from "too deep too custom" >>>>> >> integrations that everyone got out of the box, i.e., Hadoop. >>>>> >> Flink has has such a broad spectrum of use cases, if we have custom >>>>> build >>>>> >> for every other framework in that spectrum, we'll be in trouble. >>>>> >> >>>>> >> So I would also be -1 for custom builds. >>>>> >> >>>>> >> Couldn't we do something similar as we started doing for Hadoop? >>>>> Moving >>>>> >> away from convenience downloads to allowing users to "export" their >>>>> setup >>>>> >> for Flink? >>>>> >> >>>>> >> - We can have a "hive module (loader)" in flink/lib by default >>>>> >> - The module loader would look for an environment variable like >>>>> >> "HIVE_CLASSPATH" and load these classes (ideally in a separate >>>>> >> classloader). >>>>> >> - The loader can search for certain classes and instantiate >>>>> catalog / >>>>> >> functions / etc. when finding them instantiates the hive module >>>>> >> referencing >>>>> >> them >>>>> >> - That way, we use exactly what users have installed, without >>>>> needing to >>>>> >> build our own bundles. >>>>> >> >>>>> >> Could that work? >>>>> >> >>>>> >> Best, >>>>> >> Stephan >>>>> >> >>>>> >> >>>>> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <[hidden email] >>>>> > >>>>> >> wrote: >>>>> >> >>>>> >> > Couldn't it simply be documented which jars are in the >>>>> convenience jars >>>>> >> > which are pre built and can be downloaded from the website? Then >>>>> people >>>>> >> who >>>>> >> > need a custom version know which jars they need to provide to >>>>> Flink? >>>>> >> > >>>>> >> > Cheers, >>>>> >> > Till >>>>> >> > >>>>> >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[hidden email]> >>>>> wrote: >>>>> >> > >>>>> >> > > I'm not sure providing an uber jar would be possible. >>>>> >> > > >>>>> >> > > Different from kafka and elasticsearch connector who have >>>>> dependencies >>>>> >> > for >>>>> >> > > a specific kafka/elastic version, or the kafka universal >>>>> connector >>>>> >> that >>>>> >> > > provides good compatibilities, hive connector needs to deal >>>>> with hive >>>>> >> > jars >>>>> >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH >>>>> >> distributions) >>>>> >> > > with incompatibility even between minor versions, different >>>>> versioned >>>>> >> > > hadoop and other extra dependency jars for each hive version. >>>>> >> > > >>>>> >> > > Besides, users usually need to be able to easily see which >>>>> individual >>>>> >> > jars >>>>> >> > > are required, which is invisible from an uber jar. Hive users >>>>> already >>>>> >> > have >>>>> >> > > their hive deployments. They usually have to use their own hive >>>>> jars >>>>> >> > > because, unlike hive jars on mvn, their own jars contain changes >>>>> >> in-house >>>>> >> > > or from vendors. They need to easily tell which jars Flink >>>>> requires >>>>> >> for >>>>> >> > > corresponding open sourced hive version to their own hive >>>>> deployment, >>>>> >> and >>>>> >> > > copy in-hosue jars over from hive deployments as replacements. >>>>> >> > > >>>>> >> > > Providing a script to download all the individual jars for a >>>>> specified >>>>> >> > hive >>>>> >> > > version can be an alternative. >>>>> >> > > >>>>> >> > > The goal is we need to provide a *product*, not a technology, >>>>> to make >>>>> >> it >>>>> >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive >>>>> >> community >>>>> >> > > and ecosystem, not the other way around. I'd argue Hive >>>>> connector can >>>>> >> be >>>>> >> > > treat differently because its community/ecosystem/userbase is >>>>> much >>>>> >> larger >>>>> >> > > than the other connectors, and it's way more important than >>>>> other >>>>> >> > > connectors to Flink on the mission of becoming a batch/streaming >>>>> >> unified >>>>> >> > > engine and get Flink more widely adopted. >>>>> >> > > >>>>> >> > > >>>>> >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan < >>>>> [hidden email]> >>>>> >> > wrote: >>>>> >> > > >>>>> >> > > > Also -1 on separate builds. >>>>> >> > > > >>>>> >> > > > After referencing some other BigData engines for >>>>> distribution[1], i >>>>> >> > > didn't >>>>> >> > > > find strong needs to publish a separate build >>>>> >> > > > for just a separate Hive version, indeed there are builds for >>>>> >> different >>>>> >> > > > Hadoop version. >>>>> >> > > > >>>>> >> > > > Just like Seth and Aljoscha said, we could push a >>>>> >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or >>>>> other use >>>>> >> > > cases. >>>>> >> > > > >>>>> >> > > > [1] https://spark.apache.org/downloads.html >>>>> >> > > > [2] >>>>> >> > > >>>>> >> >>>>> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html >>>>> >> > > > >>>>> >> > > > Best, >>>>> >> > > > Danny Chan >>>>> >> > > > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道: >>>>> >> > > > > >>>>> >> > > > > >>>>> >> > > > >>>>> >> > > >>>>> >> > >>>>> >> >>>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies >>>>> >> > > > >>>>> >> > > >>>>> >> > >>>>> >> >>>>> > >>>>> > >>>>> > -- >>>>> > Best, Jingsong Lee >>>>> > >>>>> >>>> >>> >>> -- >>> Best, Jingsong Lee >>> >> > > -- > Best, Jingsong Lee > |
Free forum by Nabble | Edit this page |