Hi all,
I want to propose to have a couple separate Flink distributions with Hive dependencies on specific Hive versions (2.3.4 and 1.2.1). The distributions will be provided to users on Flink download page [1]. A few reasons to do this: 1) Flink-Hive integration is important to many many Flink and Hive users in two dimensions: a) for Flink metadata: HiveCatalog is the only persistent catalog to manage Flink tables. With Flink 1.10 supporting more DDL, the persistent catalog would be playing even more critical role in users' workflow b) for Flink data: Hive data connector (source/sink) helps both Flink and Hive users to unlock new use cases in streaming, near-realtime/realtime data warehouse, backfill, etc. 2) currently users have to go thru a *really* tedious process to get started, because it requires lots of extra jars (see [2]) that are absent in Flink's lean distribution. We've had so many users from public mailing list, private email, DingTalk groups who got frustrated on spending lots of time figuring out the jars themselves. They would rather have a more "right out of box" quickstart experience, and play with the catalog and source/sink without hassle. 3) it's easier for users to replace those Hive dependencies for their own Hive versions - just replace those jars with the right versions and no need to find the doc. * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base out there, and that's why we are using them as examples for dependencies in [1] even though we've supported almost all Hive versions [3] now. I want to hear what the community think about this, and how to achieve it if we believe that's the way to go. Cheers, Bowen [1] https://flink.apache.org/downloads.html [2] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies [3] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions |
cc user ML in case anyone want to chime in
On Fri, Dec 13, 2019 at 00:44 Bowen Li <[hidden email]> wrote: > Hi all, > > I want to propose to have a couple separate Flink distributions with Hive > dependencies on specific Hive versions (2.3.4 and 1.2.1). The distributions > will be provided to users on Flink download page [1]. > > A few reasons to do this: > > 1) Flink-Hive integration is important to many many Flink and Hive users > in two dimensions: > a) for Flink metadata: HiveCatalog is the only persistent catalog to > manage Flink tables. With Flink 1.10 supporting more DDL, the persistent > catalog would be playing even more critical role in users' workflow > b) for Flink data: Hive data connector (source/sink) helps both Flink > and Hive users to unlock new use cases in streaming, near-realtime/realtime > data warehouse, backfill, etc. > > 2) currently users have to go thru a *really* tedious process to get > started, because it requires lots of extra jars (see [2]) that are absent > in Flink's lean distribution. We've had so many users from public mailing > list, private email, DingTalk groups who got frustrated on spending lots of > time figuring out the jars themselves. They would rather have a more "right > out of box" quickstart experience, and play with the catalog and > source/sink without hassle. > > 3) it's easier for users to replace those Hive dependencies for their own > Hive versions - just replace those jars with the right versions and no need > to find the doc. > > * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base > out there, and that's why we are using them as examples for dependencies in > [1] even though we've supported almost all Hive versions [3] now. > > I want to hear what the community think about this, and how to achieve it > if we believe that's the way to go. > > Cheers, > Bowen > > [1] https://flink.apache.org/downloads.html > [2] > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies > [3] > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions > |
+1, this is definitely necessary for better user experience. Setting up
environment is always painful for many big data tools. Bowen Li <[hidden email]> 于2019年12月13日周五 下午5:02写道: > cc user ML in case anyone want to chime in > > On Fri, Dec 13, 2019 at 00:44 Bowen Li <[hidden email]> wrote: > >> Hi all, >> >> I want to propose to have a couple separate Flink distributions with Hive >> dependencies on specific Hive versions (2.3.4 and 1.2.1). The distributions >> will be provided to users on Flink download page [1]. >> >> A few reasons to do this: >> >> 1) Flink-Hive integration is important to many many Flink and Hive users >> in two dimensions: >> a) for Flink metadata: HiveCatalog is the only persistent catalog to >> manage Flink tables. With Flink 1.10 supporting more DDL, the persistent >> catalog would be playing even more critical role in users' workflow >> b) for Flink data: Hive data connector (source/sink) helps both >> Flink and Hive users to unlock new use cases in streaming, >> near-realtime/realtime data warehouse, backfill, etc. >> >> 2) currently users have to go thru a *really* tedious process to get >> started, because it requires lots of extra jars (see [2]) that are absent >> in Flink's lean distribution. We've had so many users from public mailing >> list, private email, DingTalk groups who got frustrated on spending lots of >> time figuring out the jars themselves. They would rather have a more "right >> out of box" quickstart experience, and play with the catalog and >> source/sink without hassle. >> >> 3) it's easier for users to replace those Hive dependencies for their own >> Hive versions - just replace those jars with the right versions and no need >> to find the doc. >> >> * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base >> out there, and that's why we are using them as examples for dependencies in >> [1] even though we've supported almost all Hive versions [3] now. >> >> I want to hear what the community think about this, and how to achieve it >> if we believe that's the way to go. >> >> Cheers, >> Bowen >> >> [1] https://flink.apache.org/downloads.html >> [2] >> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies >> [3] >> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions >> > -- Best Regards Jeff Zhang |
In reply to this post by bowen.li
Hi Bowen~
Thanks for driving on this. I have tried using sql client with hive connector about two weeks ago, it’s painful to set up the environment from my experience. + 1 for this proposal. Best, Terry Wang > 2019年12月13日 16:44,Bowen Li <[hidden email]> 写道: > > Hi all, > > I want to propose to have a couple separate Flink distributions with Hive > dependencies on specific Hive versions (2.3.4 and 1.2.1). The distributions > will be provided to users on Flink download page [1]. > > A few reasons to do this: > > 1) Flink-Hive integration is important to many many Flink and Hive users in > two dimensions: > a) for Flink metadata: HiveCatalog is the only persistent catalog to > manage Flink tables. With Flink 1.10 supporting more DDL, the persistent > catalog would be playing even more critical role in users' workflow > b) for Flink data: Hive data connector (source/sink) helps both Flink > and Hive users to unlock new use cases in streaming, near-realtime/realtime > data warehouse, backfill, etc. > > 2) currently users have to go thru a *really* tedious process to get > started, because it requires lots of extra jars (see [2]) that are absent > in Flink's lean distribution. We've had so many users from public mailing > list, private email, DingTalk groups who got frustrated on spending lots of > time figuring out the jars themselves. They would rather have a more "right > out of box" quickstart experience, and play with the catalog and > source/sink without hassle. > > 3) it's easier for users to replace those Hive dependencies for their own > Hive versions - just replace those jars with the right versions and no need > to find the doc. > > * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base > out there, and that's why we are using them as examples for dependencies in > [1] even though we've supported almost all Hive versions [3] now. > > I want to hear what the community think about this, and how to achieve it > if we believe that's the way to go. > > Cheers, > Bowen > > [1] https://flink.apache.org/downloads.html > [2] > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies > [3] > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions |
Hi Bowen,
Thanks for driving this. +1 for this proposal. Due to our multi version support, users are required to rely on different dependencies, it does break the "out of box" experience. Now that the client has changed to go to child first class loader resolve by default, it puts forward higher requirements for user dependence, which also leads to: a related bug to run hive job using the dependencies from document.[1] It is really hard to use. I have some more thinking: - I think we can make the user's jar package as thin as possible by providing the appropriate excludes. Sometimes, the transmission of jar packets consumes a lot of resources and time. - Why we not add for hive version 3? [1] https://issues.apache.org/jira/browse/FLINK-14849 Best, Jingsong Lee On Fri, Dec 13, 2019 at 5:12 PM Terry Wang <[hidden email]> wrote: > Hi Bowen~ > > Thanks for driving on this. I have tried using sql client with hive > connector about two weeks ago, it’s painful to set up the environment from > my experience. > + 1 for this proposal. > > Best, > Terry Wang > > > > > 2019年12月13日 16:44,Bowen Li <[hidden email]> 写道: > > > > Hi all, > > > > I want to propose to have a couple separate Flink distributions with Hive > > dependencies on specific Hive versions (2.3.4 and 1.2.1). The > distributions > > will be provided to users on Flink download page [1]. > > > > A few reasons to do this: > > > > 1) Flink-Hive integration is important to many many Flink and Hive users > in > > two dimensions: > > a) for Flink metadata: HiveCatalog is the only persistent catalog to > > manage Flink tables. With Flink 1.10 supporting more DDL, the persistent > > catalog would be playing even more critical role in users' workflow > > b) for Flink data: Hive data connector (source/sink) helps both Flink > > and Hive users to unlock new use cases in streaming, > near-realtime/realtime > > data warehouse, backfill, etc. > > > > 2) currently users have to go thru a *really* tedious process to get > > started, because it requires lots of extra jars (see [2]) that are absent > > in Flink's lean distribution. We've had so many users from public mailing > > list, private email, DingTalk groups who got frustrated on spending lots > of > > time figuring out the jars themselves. They would rather have a more > "right > > out of box" quickstart experience, and play with the catalog and > > source/sink without hassle. > > > > 3) it's easier for users to replace those Hive dependencies for their own > > Hive versions - just replace those jars with the right versions and no > need > > to find the doc. > > > > * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base > > out there, and that's why we are using them as examples for dependencies > in > > [1] even though we've supported almost all Hive versions [3] now. > > > > I want to hear what the community think about this, and how to achieve it > > if we believe that's the way to go. > > > > Cheers, > > Bowen > > > > [1] https://flink.apache.org/downloads.html > > [2] > > > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies > > [3] > > > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions > > -- Best, Jingsong Lee |
In reply to this post by bowen.li
-1
We shouldn't need to deploy additional binaries to have a feature be remotely usable. This usually points to something else being done incorrectly. If it is indeed such a hassle to setup hive on Flink, then my conclusion would be that either a) the documentation needs to be improved b) the architecture needs to be improved or, if all else fails c) provide a utility script for setting it up easier. We spent a lot of time on reducing the number of binaries in the hadoop days, and also go extra steps to prevent a separate Java 11 binary, and I see no reason why Hive should get special treatment on this matter. Regards, Chesnay On 13/12/2019 09:44, Bowen Li wrote: > Hi all, > > I want to propose to have a couple separate Flink distributions with Hive > dependencies on specific Hive versions (2.3.4 and 1.2.1). The distributions > will be provided to users on Flink download page [1]. > > A few reasons to do this: > > 1) Flink-Hive integration is important to many many Flink and Hive users in > two dimensions: > a) for Flink metadata: HiveCatalog is the only persistent catalog to > manage Flink tables. With Flink 1.10 supporting more DDL, the persistent > catalog would be playing even more critical role in users' workflow > b) for Flink data: Hive data connector (source/sink) helps both Flink > and Hive users to unlock new use cases in streaming, near-realtime/realtime > data warehouse, backfill, etc. > > 2) currently users have to go thru a *really* tedious process to get > started, because it requires lots of extra jars (see [2]) that are absent > in Flink's lean distribution. We've had so many users from public mailing > list, private email, DingTalk groups who got frustrated on spending lots of > time figuring out the jars themselves. They would rather have a more "right > out of box" quickstart experience, and play with the catalog and > source/sink without hassle. > > 3) it's easier for users to replace those Hive dependencies for their own > Hive versions - just replace those jars with the right versions and no need > to find the doc. > > * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base > out there, and that's why we are using them as examples for dependencies in > [1] even though we've supported almost all Hive versions [3] now. > > I want to hear what the community think about this, and how to achieve it > if we believe that's the way to go. > > Cheers, > Bowen > > [1] https://flink.apache.org/downloads.html > [2] > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies > [3] > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions > |
I'm generally not opposed to convenience binaries, if a huge number of
people would benefit from them, and the overhead for the Flink project is low. I did not see a huge demand for such binaries yet (neither for the Flink + Hive integration). Looking at Apache Spark, they are also only offering convenience binaries for Hadoop only. Maybe we could provide a "Docker Playground" for Flink + Hive in the documentation (and the flink-playgrounds.git repo)? (similar to https://ci.apache.org/projects/flink/flink-docs-master/getting-started/docker-playgrounds/flink-operations-playground.html ) On Fri, Dec 13, 2019 at 3:04 PM Chesnay Schepler <[hidden email]> wrote: > -1 > > We shouldn't need to deploy additional binaries to have a feature be > remotely usable. > This usually points to something else being done incorrectly. > > If it is indeed such a hassle to setup hive on Flink, then my conclusion > would be that either > a) the documentation needs to be improved > b) the architecture needs to be improved > or, if all else fails c) provide a utility script for setting it up easier. > > We spent a lot of time on reducing the number of binaries in the hadoop > days, and also go extra steps to prevent a separate Java 11 binary, and > I see no reason why Hive should get special treatment on this matter. > > Regards, > Chesnay > > On 13/12/2019 09:44, Bowen Li wrote: > > Hi all, > > > > I want to propose to have a couple separate Flink distributions with Hive > > dependencies on specific Hive versions (2.3.4 and 1.2.1). The > distributions > > will be provided to users on Flink download page [1]. > > > > A few reasons to do this: > > > > 1) Flink-Hive integration is important to many many Flink and Hive users > in > > two dimensions: > > a) for Flink metadata: HiveCatalog is the only persistent catalog > to > > manage Flink tables. With Flink 1.10 supporting more DDL, the persistent > > catalog would be playing even more critical role in users' workflow > > b) for Flink data: Hive data connector (source/sink) helps both > Flink > > and Hive users to unlock new use cases in streaming, > near-realtime/realtime > > data warehouse, backfill, etc. > > > > 2) currently users have to go thru a *really* tedious process to get > > started, because it requires lots of extra jars (see [2]) that are absent > > in Flink's lean distribution. We've had so many users from public mailing > > list, private email, DingTalk groups who got frustrated on spending lots > of > > time figuring out the jars themselves. They would rather have a more > "right > > out of box" quickstart experience, and play with the catalog and > > source/sink without hassle. > > > > 3) it's easier for users to replace those Hive dependencies for their own > > Hive versions - just replace those jars with the right versions and no > need > > to find the doc. > > > > * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base > > out there, and that's why we are using them as examples for dependencies > in > > [1] even though we've supported almost all Hive versions [3] now. > > > > I want to hear what the community think about this, and how to achieve it > > if we believe that's the way to go. > > > > Cheers, > > Bowen > > > > [1] https://flink.apache.org/downloads.html > > [2] > > > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies > > [3] > > > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions > > > > |
I'm also -1 on separate builds.
What about publishing convenience jars that contain the dependencies for each version? For example, there could be a flink-hive-1.2.1-uber.jar that users could just add to their lib folder that contains all the necessary dependencies to connect to that hive version. On Fri, Dec 13, 2019 at 8:50 AM Robert Metzger <[hidden email]> wrote: > I'm generally not opposed to convenience binaries, if a huge number of > people would benefit from them, and the overhead for the Flink project is > low. I did not see a huge demand for such binaries yet (neither for the > Flink + Hive integration). Looking at Apache Spark, they are also only > offering convenience binaries for Hadoop only. > > Maybe we could provide a "Docker Playground" for Flink + Hive in the > documentation (and the flink-playgrounds.git repo)? > (similar to > > https://ci.apache.org/projects/flink/flink-docs-master/getting-started/docker-playgrounds/flink-operations-playground.html > ) > > > > On Fri, Dec 13, 2019 at 3:04 PM Chesnay Schepler <[hidden email]> > wrote: > > > -1 > > > > We shouldn't need to deploy additional binaries to have a feature be > > remotely usable. > > This usually points to something else being done incorrectly. > > > > If it is indeed such a hassle to setup hive on Flink, then my conclusion > > would be that either > > a) the documentation needs to be improved > > b) the architecture needs to be improved > > or, if all else fails c) provide a utility script for setting it up > easier. > > > > We spent a lot of time on reducing the number of binaries in the hadoop > > days, and also go extra steps to prevent a separate Java 11 binary, and > > I see no reason why Hive should get special treatment on this matter. > > > > Regards, > > Chesnay > > > > On 13/12/2019 09:44, Bowen Li wrote: > > > Hi all, > > > > > > I want to propose to have a couple separate Flink distributions with > Hive > > > dependencies on specific Hive versions (2.3.4 and 1.2.1). The > > distributions > > > will be provided to users on Flink download page [1]. > > > > > > A few reasons to do this: > > > > > > 1) Flink-Hive integration is important to many many Flink and Hive > users > > in > > > two dimensions: > > > a) for Flink metadata: HiveCatalog is the only persistent catalog > > to > > > manage Flink tables. With Flink 1.10 supporting more DDL, the > persistent > > > catalog would be playing even more critical role in users' workflow > > > b) for Flink data: Hive data connector (source/sink) helps both > > Flink > > > and Hive users to unlock new use cases in streaming, > > near-realtime/realtime > > > data warehouse, backfill, etc. > > > > > > 2) currently users have to go thru a *really* tedious process to get > > > started, because it requires lots of extra jars (see [2]) that are > absent > > > in Flink's lean distribution. We've had so many users from public > mailing > > > list, private email, DingTalk groups who got frustrated on spending > lots > > of > > > time figuring out the jars themselves. They would rather have a more > > "right > > > out of box" quickstart experience, and play with the catalog and > > > source/sink without hassle. > > > > > > 3) it's easier for users to replace those Hive dependencies for their > own > > > Hive versions - just replace those jars with the right versions and no > > need > > > to find the doc. > > > > > > * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user > base > > > out there, and that's why we are using them as examples for > dependencies > > in > > > [1] even though we've supported almost all Hive versions [3] now. > > > > > > I want to hear what the community think about this, and how to achieve > it > > > if we believe that's the way to go. > > > > > > Cheers, > > > Bowen > > > > > > [1] https://flink.apache.org/downloads.html > > > [2] > > > > > > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies > > > [3] > > > > > > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions > > > > > > > > |
I was going to suggest the same thing as Seth. So yes, I’m against having Flink distributions that contain Hive but for convenience downloads as we have for Hadoop.
Best, Aljoscha > On 13. Dec 2019, at 18:04, Seth Wiesman <[hidden email]> wrote: > > I'm also -1 on separate builds. > > What about publishing convenience jars that contain the dependencies for > each version? For example, there could be a flink-hive-1.2.1-uber.jar that > users could just add to their lib folder that contains all the necessary > dependencies to connect to that hive version. > > > On Fri, Dec 13, 2019 at 8:50 AM Robert Metzger <[hidden email]> wrote: > >> I'm generally not opposed to convenience binaries, if a huge number of >> people would benefit from them, and the overhead for the Flink project is >> low. I did not see a huge demand for such binaries yet (neither for the >> Flink + Hive integration). Looking at Apache Spark, they are also only >> offering convenience binaries for Hadoop only. >> >> Maybe we could provide a "Docker Playground" for Flink + Hive in the >> documentation (and the flink-playgrounds.git repo)? >> (similar to >> >> https://ci.apache.org/projects/flink/flink-docs-master/getting-started/docker-playgrounds/flink-operations-playground.html >> ) >> >> >> >> On Fri, Dec 13, 2019 at 3:04 PM Chesnay Schepler <[hidden email]> >> wrote: >> >>> -1 >>> >>> We shouldn't need to deploy additional binaries to have a feature be >>> remotely usable. >>> This usually points to something else being done incorrectly. >>> >>> If it is indeed such a hassle to setup hive on Flink, then my conclusion >>> would be that either >>> a) the documentation needs to be improved >>> b) the architecture needs to be improved >>> or, if all else fails c) provide a utility script for setting it up >> easier. >>> >>> We spent a lot of time on reducing the number of binaries in the hadoop >>> days, and also go extra steps to prevent a separate Java 11 binary, and >>> I see no reason why Hive should get special treatment on this matter. >>> >>> Regards, >>> Chesnay >>> >>> On 13/12/2019 09:44, Bowen Li wrote: >>>> Hi all, >>>> >>>> I want to propose to have a couple separate Flink distributions with >> Hive >>>> dependencies on specific Hive versions (2.3.4 and 1.2.1). The >>> distributions >>>> will be provided to users on Flink download page [1]. >>>> >>>> A few reasons to do this: >>>> >>>> 1) Flink-Hive integration is important to many many Flink and Hive >> users >>> in >>>> two dimensions: >>>> a) for Flink metadata: HiveCatalog is the only persistent catalog >>> to >>>> manage Flink tables. With Flink 1.10 supporting more DDL, the >> persistent >>>> catalog would be playing even more critical role in users' workflow >>>> b) for Flink data: Hive data connector (source/sink) helps both >>> Flink >>>> and Hive users to unlock new use cases in streaming, >>> near-realtime/realtime >>>> data warehouse, backfill, etc. >>>> >>>> 2) currently users have to go thru a *really* tedious process to get >>>> started, because it requires lots of extra jars (see [2]) that are >> absent >>>> in Flink's lean distribution. We've had so many users from public >> mailing >>>> list, private email, DingTalk groups who got frustrated on spending >> lots >>> of >>>> time figuring out the jars themselves. They would rather have a more >>> "right >>>> out of box" quickstart experience, and play with the catalog and >>>> source/sink without hassle. >>>> >>>> 3) it's easier for users to replace those Hive dependencies for their >> own >>>> Hive versions - just replace those jars with the right versions and no >>> need >>>> to find the doc. >>>> >>>> * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user >> base >>>> out there, and that's why we are using them as examples for >> dependencies >>> in >>>> [1] even though we've supported almost all Hive versions [3] now. >>>> >>>> I want to hear what the community think about this, and how to achieve >> it >>>> if we believe that's the way to go. >>>> >>>> Cheers, >>>> Bowen >>>> >>>> [1] https://flink.apache.org/downloads.html >>>> [2] >>>> >>> >> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies >>>> [3] >>>> >>> >> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions >>>> >>> >>> >> |
I agree with Seth and Aljoscha and think that is a right way to go.
We already provided uber jars for kafka and elasticsearch for out-of-box, you can see the download links in this page[1]. Users can easily to download the connectors and versions they like and drag to SQL CLI lib directories. The uber jars contains all the dependencies required and may be shaded. In this way, users can skip to build a uber jar themselves. Hive is indeed a "connector" too, and should also follow this way. Best, Jark [1]: https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies On Sat, 14 Dec 2019 at 03:03, Aljoscha Krettek <[hidden email]> wrote: > I was going to suggest the same thing as Seth. So yes, I’m against having > Flink distributions that contain Hive but for convenience downloads as we > have for Hadoop. > > Best, > Aljoscha > > > On 13. Dec 2019, at 18:04, Seth Wiesman <[hidden email]> wrote: > > > > I'm also -1 on separate builds. > > > > What about publishing convenience jars that contain the dependencies for > > each version? For example, there could be a flink-hive-1.2.1-uber.jar > that > > users could just add to their lib folder that contains all the necessary > > dependencies to connect to that hive version. > > > > > > On Fri, Dec 13, 2019 at 8:50 AM Robert Metzger <[hidden email]> > wrote: > > > >> I'm generally not opposed to convenience binaries, if a huge number of > >> people would benefit from them, and the overhead for the Flink project > is > >> low. I did not see a huge demand for such binaries yet (neither for the > >> Flink + Hive integration). Looking at Apache Spark, they are also only > >> offering convenience binaries for Hadoop only. > >> > >> Maybe we could provide a "Docker Playground" for Flink + Hive in the > >> documentation (and the flink-playgrounds.git repo)? > >> (similar to > >> > >> > https://ci.apache.org/projects/flink/flink-docs-master/getting-started/docker-playgrounds/flink-operations-playground.html > >> ) > >> > >> > >> > >> On Fri, Dec 13, 2019 at 3:04 PM Chesnay Schepler <[hidden email]> > >> wrote: > >> > >>> -1 > >>> > >>> We shouldn't need to deploy additional binaries to have a feature be > >>> remotely usable. > >>> This usually points to something else being done incorrectly. > >>> > >>> If it is indeed such a hassle to setup hive on Flink, then my > conclusion > >>> would be that either > >>> a) the documentation needs to be improved > >>> b) the architecture needs to be improved > >>> or, if all else fails c) provide a utility script for setting it up > >> easier. > >>> > >>> We spent a lot of time on reducing the number of binaries in the hadoop > >>> days, and also go extra steps to prevent a separate Java 11 binary, and > >>> I see no reason why Hive should get special treatment on this matter. > >>> > >>> Regards, > >>> Chesnay > >>> > >>> On 13/12/2019 09:44, Bowen Li wrote: > >>>> Hi all, > >>>> > >>>> I want to propose to have a couple separate Flink distributions with > >> Hive > >>>> dependencies on specific Hive versions (2.3.4 and 1.2.1). The > >>> distributions > >>>> will be provided to users on Flink download page [1]. > >>>> > >>>> A few reasons to do this: > >>>> > >>>> 1) Flink-Hive integration is important to many many Flink and Hive > >> users > >>> in > >>>> two dimensions: > >>>> a) for Flink metadata: HiveCatalog is the only persistent catalog > >>> to > >>>> manage Flink tables. With Flink 1.10 supporting more DDL, the > >> persistent > >>>> catalog would be playing even more critical role in users' workflow > >>>> b) for Flink data: Hive data connector (source/sink) helps both > >>> Flink > >>>> and Hive users to unlock new use cases in streaming, > >>> near-realtime/realtime > >>>> data warehouse, backfill, etc. > >>>> > >>>> 2) currently users have to go thru a *really* tedious process to get > >>>> started, because it requires lots of extra jars (see [2]) that are > >> absent > >>>> in Flink's lean distribution. We've had so many users from public > >> mailing > >>>> list, private email, DingTalk groups who got frustrated on spending > >> lots > >>> of > >>>> time figuring out the jars themselves. They would rather have a more > >>> "right > >>>> out of box" quickstart experience, and play with the catalog and > >>>> source/sink without hassle. > >>>> > >>>> 3) it's easier for users to replace those Hive dependencies for their > >> own > >>>> Hive versions - just replace those jars with the right versions and no > >>> need > >>>> to find the doc. > >>>> > >>>> * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user > >> base > >>>> out there, and that's why we are using them as examples for > >> dependencies > >>> in > >>>> [1] even though we've supported almost all Hive versions [3] now. > >>>> > >>>> I want to hear what the community think about this, and how to achieve > >> it > >>>> if we believe that's the way to go. > >>>> > >>>> Cheers, > >>>> Bowen > >>>> > >>>> [1] https://flink.apache.org/downloads.html > >>>> [2] > >>>> > >>> > >> > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies > >>>> [3] > >>>> > >>> > >> > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions > >>>> > >>> > >>> > >> > > |
Thanks all for explaining.
I misunderstood the original proposal. -1 to put them in our distributions +1 to have provide hive uber jars as Seth and Aljoscha advice Hive is just a connector no matter how important it is. So I totally agree that we shouldn't put them in our distributions. We can start offering three uber jars: - flink-sql-connector-hive-1 (uber jar with hive dependent version 1.2.1) - flink-sql-connector-hive-2 (uber jar with hive dependent version 2.3.4) - flink-sql-connector-hive-3 (uber jar with hive dependent version 3.1.1) My understanding is quite enough to users. Best, Jingsong Lee On Sun, Dec 15, 2019 at 12:42 PM Jark Wu <[hidden email]> wrote: > I agree with Seth and Aljoscha and think that is a right way to go. > We already provided uber jars for kafka and elasticsearch for out-of-box, > you can see the download links in this page[1]. > Users can easily to download the connectors and versions they like and drag > to SQL CLI lib directories. The uber jars > contains all the dependencies required and may be shaded. In this way, > users can skip to build a uber jar themselves. > Hive is indeed a "connector" too, and should also follow this way. > > Best, > Jark > > [1]: > > https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies > > On Sat, 14 Dec 2019 at 03:03, Aljoscha Krettek <[hidden email]> > wrote: > > > I was going to suggest the same thing as Seth. So yes, I’m against having > > Flink distributions that contain Hive but for convenience downloads as we > > have for Hadoop. > > > > Best, > > Aljoscha > > > > > On 13. Dec 2019, at 18:04, Seth Wiesman <[hidden email]> wrote: > > > > > > I'm also -1 on separate builds. > > > > > > What about publishing convenience jars that contain the dependencies > for > > > each version? For example, there could be a flink-hive-1.2.1-uber.jar > > that > > > users could just add to their lib folder that contains all the > necessary > > > dependencies to connect to that hive version. > > > > > > > > > On Fri, Dec 13, 2019 at 8:50 AM Robert Metzger <[hidden email]> > > wrote: > > > > > >> I'm generally not opposed to convenience binaries, if a huge number of > > >> people would benefit from them, and the overhead for the Flink project > > is > > >> low. I did not see a huge demand for such binaries yet (neither for > the > > >> Flink + Hive integration). Looking at Apache Spark, they are also only > > >> offering convenience binaries for Hadoop only. > > >> > > >> Maybe we could provide a "Docker Playground" for Flink + Hive in the > > >> documentation (and the flink-playgrounds.git repo)? > > >> (similar to > > >> > > >> > > > https://ci.apache.org/projects/flink/flink-docs-master/getting-started/docker-playgrounds/flink-operations-playground.html > > >> ) > > >> > > >> > > >> > > >> On Fri, Dec 13, 2019 at 3:04 PM Chesnay Schepler <[hidden email]> > > >> wrote: > > >> > > >>> -1 > > >>> > > >>> We shouldn't need to deploy additional binaries to have a feature be > > >>> remotely usable. > > >>> This usually points to something else being done incorrectly. > > >>> > > >>> If it is indeed such a hassle to setup hive on Flink, then my > > conclusion > > >>> would be that either > > >>> a) the documentation needs to be improved > > >>> b) the architecture needs to be improved > > >>> or, if all else fails c) provide a utility script for setting it up > > >> easier. > > >>> > > >>> We spent a lot of time on reducing the number of binaries in the > hadoop > > >>> days, and also go extra steps to prevent a separate Java 11 binary, > and > > >>> I see no reason why Hive should get special treatment on this matter. > > >>> > > >>> Regards, > > >>> Chesnay > > >>> > > >>> On 13/12/2019 09:44, Bowen Li wrote: > > >>>> Hi all, > > >>>> > > >>>> I want to propose to have a couple separate Flink distributions with > > >> Hive > > >>>> dependencies on specific Hive versions (2.3.4 and 1.2.1). The > > >>> distributions > > >>>> will be provided to users on Flink download page [1]. > > >>>> > > >>>> A few reasons to do this: > > >>>> > > >>>> 1) Flink-Hive integration is important to many many Flink and Hive > > >> users > > >>> in > > >>>> two dimensions: > > >>>> a) for Flink metadata: HiveCatalog is the only persistent > catalog > > >>> to > > >>>> manage Flink tables. With Flink 1.10 supporting more DDL, the > > >> persistent > > >>>> catalog would be playing even more critical role in users' workflow > > >>>> b) for Flink data: Hive data connector (source/sink) helps both > > >>> Flink > > >>>> and Hive users to unlock new use cases in streaming, > > >>> near-realtime/realtime > > >>>> data warehouse, backfill, etc. > > >>>> > > >>>> 2) currently users have to go thru a *really* tedious process to get > > >>>> started, because it requires lots of extra jars (see [2]) that are > > >> absent > > >>>> in Flink's lean distribution. We've had so many users from public > > >> mailing > > >>>> list, private email, DingTalk groups who got frustrated on spending > > >> lots > > >>> of > > >>>> time figuring out the jars themselves. They would rather have a more > > >>> "right > > >>>> out of box" quickstart experience, and play with the catalog and > > >>>> source/sink without hassle. > > >>>> > > >>>> 3) it's easier for users to replace those Hive dependencies for > their > > >> own > > >>>> Hive versions - just replace those jars with the right versions and > no > > >>> need > > >>>> to find the doc. > > >>>> > > >>>> * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user > > >> base > > >>>> out there, and that's why we are using them as examples for > > >> dependencies > > >>> in > > >>>> [1] even though we've supported almost all Hive versions [3] now. > > >>>> > > >>>> I want to hear what the community think about this, and how to > achieve > > >> it > > >>>> if we believe that's the way to go. > > >>>> > > >>>> Cheers, > > >>>> Bowen > > >>>> > > >>>> [1] https://flink.apache.org/downloads.html > > >>>> [2] > > >>>> > > >>> > > >> > > > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies > > >>>> [3] > > >>>> > > >>> > > >> > > > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions > > >>>> > > >>> > > >>> > > >> > > > > > -- Best, Jingsong Lee |
In reply to this post by Aljoscha Krettek-2
Also -1 on separate builds.
After referencing some other BigData engines for distribution[1], i didn't find strong needs to publish a separate build for just a separate Hive version, indeed there are builds for different Hadoop version. Just like Seth and Aljoscha said, we could push a flink-hive-version-uber.jar to use as a lib of SQL-CLI or other use cases. [1] https://spark.apache.org/downloads.html [2] https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html Best, Danny Chan 在 2019年12月14日 +0800 AM3:03,[hidden email],写道: > > https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies |
I'm not sure providing an uber jar would be possible.
Different from kafka and elasticsearch connector who have dependencies for a specific kafka/elastic version, or the kafka universal connector that provides good compatibilities, hive connector needs to deal with hive jars in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH distributions) with incompatibility even between minor versions, different versioned hadoop and other extra dependency jars for each hive version. Besides, users usually need to be able to easily see which individual jars are required, which is invisible from an uber jar. Hive users already have their hive deployments. They usually have to use their own hive jars because, unlike hive jars on mvn, their own jars contain changes in-house or from vendors. They need to easily tell which jars Flink requires for corresponding open sourced hive version to their own hive deployment, and copy in-hosue jars over from hive deployments as replacements. Providing a script to download all the individual jars for a specified hive version can be an alternative. The goal is we need to provide a *product*, not a technology, to make it less hassle for Hive users. Afterall, it's Flink embracing Hive community and ecosystem, not the other way around. I'd argue Hive connector can be treat differently because its community/ecosystem/userbase is much larger than the other connectors, and it's way more important than other connectors to Flink on the mission of becoming a batch/streaming unified engine and get Flink more widely adopted. On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[hidden email]> wrote: > Also -1 on separate builds. > > After referencing some other BigData engines for distribution[1], i didn't > find strong needs to publish a separate build > for just a separate Hive version, indeed there are builds for different > Hadoop version. > > Just like Seth and Aljoscha said, we could push a > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other use cases. > > [1] https://spark.apache.org/downloads.html > [2] https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html > > Best, > Danny Chan > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道: > > > > > https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies > |
Couldn't it simply be documented which jars are in the convenience jars
which are pre built and can be downloaded from the website? Then people who need a custom version know which jars they need to provide to Flink? Cheers, Till On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[hidden email]> wrote: > I'm not sure providing an uber jar would be possible. > > Different from kafka and elasticsearch connector who have dependencies for > a specific kafka/elastic version, or the kafka universal connector that > provides good compatibilities, hive connector needs to deal with hive jars > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH distributions) > with incompatibility even between minor versions, different versioned > hadoop and other extra dependency jars for each hive version. > > Besides, users usually need to be able to easily see which individual jars > are required, which is invisible from an uber jar. Hive users already have > their hive deployments. They usually have to use their own hive jars > because, unlike hive jars on mvn, their own jars contain changes in-house > or from vendors. They need to easily tell which jars Flink requires for > corresponding open sourced hive version to their own hive deployment, and > copy in-hosue jars over from hive deployments as replacements. > > Providing a script to download all the individual jars for a specified hive > version can be an alternative. > > The goal is we need to provide a *product*, not a technology, to make it > less hassle for Hive users. Afterall, it's Flink embracing Hive community > and ecosystem, not the other way around. I'd argue Hive connector can be > treat differently because its community/ecosystem/userbase is much larger > than the other connectors, and it's way more important than other > connectors to Flink on the mission of becoming a batch/streaming unified > engine and get Flink more widely adopted. > > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[hidden email]> wrote: > > > Also -1 on separate builds. > > > > After referencing some other BigData engines for distribution[1], i > didn't > > find strong needs to publish a separate build > > for just a separate Hive version, indeed there are builds for different > > Hadoop version. > > > > Just like Seth and Aljoscha said, we could push a > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other use > cases. > > > > [1] https://spark.apache.org/downloads.html > > [2] > https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html > > > > Best, > > Danny Chan > > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道: > > > > > > > > > https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies > > > |
We have had much trouble in the past from "too deep too custom"
integrations that everyone got out of the box, i.e., Hadoop. Flink has has such a broad spectrum of use cases, if we have custom build for every other framework in that spectrum, we'll be in trouble. So I would also be -1 for custom builds. Couldn't we do something similar as we started doing for Hadoop? Moving away from convenience downloads to allowing users to "export" their setup for Flink? - We can have a "hive module (loader)" in flink/lib by default - The module loader would look for an environment variable like "HIVE_CLASSPATH" and load these classes (ideally in a separate classloader). - The loader can search for certain classes and instantiate catalog / functions / etc. when finding them instantiates the hive module referencing them - That way, we use exactly what users have installed, without needing to build our own bundles. Could that work? Best, Stephan On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <[hidden email]> wrote: > Couldn't it simply be documented which jars are in the convenience jars > which are pre built and can be downloaded from the website? Then people who > need a custom version know which jars they need to provide to Flink? > > Cheers, > Till > > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[hidden email]> wrote: > > > I'm not sure providing an uber jar would be possible. > > > > Different from kafka and elasticsearch connector who have dependencies > for > > a specific kafka/elastic version, or the kafka universal connector that > > provides good compatibilities, hive connector needs to deal with hive > jars > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH distributions) > > with incompatibility even between minor versions, different versioned > > hadoop and other extra dependency jars for each hive version. > > > > Besides, users usually need to be able to easily see which individual > jars > > are required, which is invisible from an uber jar. Hive users already > have > > their hive deployments. They usually have to use their own hive jars > > because, unlike hive jars on mvn, their own jars contain changes in-house > > or from vendors. They need to easily tell which jars Flink requires for > > corresponding open sourced hive version to their own hive deployment, and > > copy in-hosue jars over from hive deployments as replacements. > > > > Providing a script to download all the individual jars for a specified > hive > > version can be an alternative. > > > > The goal is we need to provide a *product*, not a technology, to make it > > less hassle for Hive users. Afterall, it's Flink embracing Hive community > > and ecosystem, not the other way around. I'd argue Hive connector can be > > treat differently because its community/ecosystem/userbase is much larger > > than the other connectors, and it's way more important than other > > connectors to Flink on the mission of becoming a batch/streaming unified > > engine and get Flink more widely adopted. > > > > > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[hidden email]> > wrote: > > > > > Also -1 on separate builds. > > > > > > After referencing some other BigData engines for distribution[1], i > > didn't > > > find strong needs to publish a separate build > > > for just a separate Hive version, indeed there are builds for different > > > Hadoop version. > > > > > > Just like Seth and Aljoscha said, we could push a > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other use > > cases. > > > > > > [1] https://spark.apache.org/downloads.html > > > [2] > > https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html > > > > > > Best, > > > Danny Chan > > > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道: > > > > > > > > > > > > > > https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies > > > > > > |
Hi all,
For your information, we have document the dependencies detailed information [1]. I think it's a lot clearer than before, but it's worse than presto and spark (they avoid or have built-in hive dependency). I thought about Stephan's suggestion: - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus two or three jars, if so many jars are introduced, maybe will there be a big conflict. - And hive/lib is not available on every machine. We need to upload so many jars. - A separate classloader maybe hard to work too, our flink-connector-hive need hive jars, we may need to deal with flink-connector-hive jar spacial too. CC: Rui Li I think the best system to integrate with hive is presto, which only connects hive metastore through thrift protocol. But I understand that it costs a lot to rewrite the code. [1] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies Best, Jingsong Lee On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <[hidden email]> wrote: > We have had much trouble in the past from "too deep too custom" > integrations that everyone got out of the box, i.e., Hadoop. > Flink has has such a broad spectrum of use cases, if we have custom build > for every other framework in that spectrum, we'll be in trouble. > > So I would also be -1 for custom builds. > > Couldn't we do something similar as we started doing for Hadoop? Moving > away from convenience downloads to allowing users to "export" their setup > for Flink? > > - We can have a "hive module (loader)" in flink/lib by default > - The module loader would look for an environment variable like > "HIVE_CLASSPATH" and load these classes (ideally in a separate > classloader). > - The loader can search for certain classes and instantiate catalog / > functions / etc. when finding them instantiates the hive module referencing > them > - That way, we use exactly what users have installed, without needing to > build our own bundles. > > Could that work? > > Best, > Stephan > > > On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <[hidden email]> > wrote: > > > Couldn't it simply be documented which jars are in the convenience jars > > which are pre built and can be downloaded from the website? Then people > who > > need a custom version know which jars they need to provide to Flink? > > > > Cheers, > > Till > > > > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[hidden email]> wrote: > > > > > I'm not sure providing an uber jar would be possible. > > > > > > Different from kafka and elasticsearch connector who have dependencies > > for > > > a specific kafka/elastic version, or the kafka universal connector that > > > provides good compatibilities, hive connector needs to deal with hive > > jars > > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH distributions) > > > with incompatibility even between minor versions, different versioned > > > hadoop and other extra dependency jars for each hive version. > > > > > > Besides, users usually need to be able to easily see which individual > > jars > > > are required, which is invisible from an uber jar. Hive users already > > have > > > their hive deployments. They usually have to use their own hive jars > > > because, unlike hive jars on mvn, their own jars contain changes > in-house > > > or from vendors. They need to easily tell which jars Flink requires for > > > corresponding open sourced hive version to their own hive deployment, > and > > > copy in-hosue jars over from hive deployments as replacements. > > > > > > Providing a script to download all the individual jars for a specified > > hive > > > version can be an alternative. > > > > > > The goal is we need to provide a *product*, not a technology, to make > it > > > less hassle for Hive users. Afterall, it's Flink embracing Hive > community > > > and ecosystem, not the other way around. I'd argue Hive connector can > be > > > treat differently because its community/ecosystem/userbase is much > larger > > > than the other connectors, and it's way more important than other > > > connectors to Flink on the mission of becoming a batch/streaming > unified > > > engine and get Flink more widely adopted. > > > > > > > > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[hidden email]> > > wrote: > > > > > > > Also -1 on separate builds. > > > > > > > > After referencing some other BigData engines for distribution[1], i > > > didn't > > > > find strong needs to publish a separate build > > > > for just a separate Hive version, indeed there are builds for > different > > > > Hadoop version. > > > > > > > > Just like Seth and Aljoscha said, we could push a > > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other use > > > cases. > > > > > > > > [1] https://spark.apache.org/downloads.html > > > > [2] > > > https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html > > > > > > > > Best, > > > > Danny Chan > > > > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道: > > > > > > > > > > > > > > > > > > > > https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies > > > > > > > > > > -- Best, Jingsong Lee |
Hi Stephan,
As Jingsong stated, in our documentation the recommended way to add Hive deps is to use exactly what users have installed. It's just we ask users to manually add those jars, instead of automatically find them based on env variables. I prefer to keep it this way for a while, and see if there're real concerns/complaints from user feedbacks. Please also note the Hive jars are not the only ones needed to integrate with Hive, users have to make sure flink-connector-hive and Hadoop jars are in classpath too. So I'm afraid a single "HIVE" env variable wouldn't save all the manual work for our users. On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <[hidden email]> wrote: > Hi all, > > For your information, we have document the dependencies detailed > information [1]. I think it's a lot clearer than before, but it's worse > than presto and spark (they avoid or have built-in hive dependency). > > I thought about Stephan's suggestion: > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus two > or three jars, if so many jars are introduced, maybe will there be a big > conflict. > - And hive/lib is not available on every machine. We need to upload so > many jars. > - A separate classloader maybe hard to work too, our flink-connector-hive > need hive jars, we may need to deal with flink-connector-hive jar spacial > too. > CC: Rui Li > > I think the best system to integrate with hive is presto, which only > connects hive metastore through thrift protocol. But I understand that it > costs a lot to rewrite the code. > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies > > Best, > Jingsong Lee > > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <[hidden email]> wrote: > >> We have had much trouble in the past from "too deep too custom" >> integrations that everyone got out of the box, i.e., Hadoop. >> Flink has has such a broad spectrum of use cases, if we have custom build >> for every other framework in that spectrum, we'll be in trouble. >> >> So I would also be -1 for custom builds. >> >> Couldn't we do something similar as we started doing for Hadoop? Moving >> away from convenience downloads to allowing users to "export" their setup >> for Flink? >> >> - We can have a "hive module (loader)" in flink/lib by default >> - The module loader would look for an environment variable like >> "HIVE_CLASSPATH" and load these classes (ideally in a separate >> classloader). >> - The loader can search for certain classes and instantiate catalog / >> functions / etc. when finding them instantiates the hive module >> referencing >> them >> - That way, we use exactly what users have installed, without needing to >> build our own bundles. >> >> Could that work? >> >> Best, >> Stephan >> >> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <[hidden email]> >> wrote: >> >> > Couldn't it simply be documented which jars are in the convenience jars >> > which are pre built and can be downloaded from the website? Then people >> who >> > need a custom version know which jars they need to provide to Flink? >> > >> > Cheers, >> > Till >> > >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[hidden email]> wrote: >> > >> > > I'm not sure providing an uber jar would be possible. >> > > >> > > Different from kafka and elasticsearch connector who have dependencies >> > for >> > > a specific kafka/elastic version, or the kafka universal connector >> that >> > > provides good compatibilities, hive connector needs to deal with hive >> > jars >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH >> distributions) >> > > with incompatibility even between minor versions, different versioned >> > > hadoop and other extra dependency jars for each hive version. >> > > >> > > Besides, users usually need to be able to easily see which individual >> > jars >> > > are required, which is invisible from an uber jar. Hive users already >> > have >> > > their hive deployments. They usually have to use their own hive jars >> > > because, unlike hive jars on mvn, their own jars contain changes >> in-house >> > > or from vendors. They need to easily tell which jars Flink requires >> for >> > > corresponding open sourced hive version to their own hive deployment, >> and >> > > copy in-hosue jars over from hive deployments as replacements. >> > > >> > > Providing a script to download all the individual jars for a specified >> > hive >> > > version can be an alternative. >> > > >> > > The goal is we need to provide a *product*, not a technology, to make >> it >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive >> community >> > > and ecosystem, not the other way around. I'd argue Hive connector can >> be >> > > treat differently because its community/ecosystem/userbase is much >> larger >> > > than the other connectors, and it's way more important than other >> > > connectors to Flink on the mission of becoming a batch/streaming >> unified >> > > engine and get Flink more widely adopted. >> > > >> > > >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[hidden email]> >> > wrote: >> > > >> > > > Also -1 on separate builds. >> > > > >> > > > After referencing some other BigData engines for distribution[1], i >> > > didn't >> > > > find strong needs to publish a separate build >> > > > for just a separate Hive version, indeed there are builds for >> different >> > > > Hadoop version. >> > > > >> > > > Just like Seth and Aljoscha said, we could push a >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other use >> > > cases. >> > > > >> > > > [1] https://spark.apache.org/downloads.html >> > > > [2] >> > > >> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html >> > > > >> > > > Best, >> > > > Danny Chan >> > > > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道: >> > > > > >> > > > > >> > > > >> > > >> > >> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies >> > > > >> > > >> > >> > > > -- > Best, Jingsong Lee > |
Some thoughts about other options we have:
- Put fat/shaded jars for the common versions into "flink-shaded" and offer them for download on the website, similar to pre-bundles Hadoop versions. - Look at the Presto code (Metastore protocol) and see if we can reuse that - Have a setup helper script that takes the versions and pulls the required dependencies. Can you share how can a "built-in" dependency could work, if there are so many different conflicting versions? Thanks, Stephan On Tue, Feb 4, 2020 at 12:59 PM Rui Li <[hidden email]> wrote: > Hi Stephan, > > As Jingsong stated, in our documentation the recommended way to add Hive > deps is to use exactly what users have installed. It's just we ask users to > manually add those jars, instead of automatically find them based on env > variables. I prefer to keep it this way for a while, and see if there're > real concerns/complaints from user feedbacks. > > Please also note the Hive jars are not the only ones needed to integrate > with Hive, users have to make sure flink-connector-hive and Hadoop jars are > in classpath too. So I'm afraid a single "HIVE" env variable wouldn't save > all the manual work for our users. > > On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <[hidden email]> wrote: > > > Hi all, > > > > For your information, we have document the dependencies detailed > > information [1]. I think it's a lot clearer than before, but it's worse > > than presto and spark (they avoid or have built-in hive dependency). > > > > I thought about Stephan's suggestion: > > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus two > > or three jars, if so many jars are introduced, maybe will there be a big > > conflict. > > - And hive/lib is not available on every machine. We need to upload so > > many jars. > > - A separate classloader maybe hard to work too, our flink-connector-hive > > need hive jars, we may need to deal with flink-connector-hive jar spacial > > too. > > CC: Rui Li > > > > I think the best system to integrate with hive is presto, which only > > connects hive metastore through thrift protocol. But I understand that it > > costs a lot to rewrite the code. > > > > [1] > > > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies > > > > Best, > > Jingsong Lee > > > > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <[hidden email]> wrote: > > > >> We have had much trouble in the past from "too deep too custom" > >> integrations that everyone got out of the box, i.e., Hadoop. > >> Flink has has such a broad spectrum of use cases, if we have custom > build > >> for every other framework in that spectrum, we'll be in trouble. > >> > >> So I would also be -1 for custom builds. > >> > >> Couldn't we do something similar as we started doing for Hadoop? Moving > >> away from convenience downloads to allowing users to "export" their > setup > >> for Flink? > >> > >> - We can have a "hive module (loader)" in flink/lib by default > >> - The module loader would look for an environment variable like > >> "HIVE_CLASSPATH" and load these classes (ideally in a separate > >> classloader). > >> - The loader can search for certain classes and instantiate catalog / > >> functions / etc. when finding them instantiates the hive module > >> referencing > >> them > >> - That way, we use exactly what users have installed, without needing > to > >> build our own bundles. > >> > >> Could that work? > >> > >> Best, > >> Stephan > >> > >> > >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <[hidden email]> > >> wrote: > >> > >> > Couldn't it simply be documented which jars are in the convenience > jars > >> > which are pre built and can be downloaded from the website? Then > people > >> who > >> > need a custom version know which jars they need to provide to Flink? > >> > > >> > Cheers, > >> > Till > >> > > >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[hidden email]> wrote: > >> > > >> > > I'm not sure providing an uber jar would be possible. > >> > > > >> > > Different from kafka and elasticsearch connector who have > dependencies > >> > for > >> > > a specific kafka/elastic version, or the kafka universal connector > >> that > >> > > provides good compatibilities, hive connector needs to deal with > hive > >> > jars > >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH > >> distributions) > >> > > with incompatibility even between minor versions, different > versioned > >> > > hadoop and other extra dependency jars for each hive version. > >> > > > >> > > Besides, users usually need to be able to easily see which > individual > >> > jars > >> > > are required, which is invisible from an uber jar. Hive users > already > >> > have > >> > > their hive deployments. They usually have to use their own hive jars > >> > > because, unlike hive jars on mvn, their own jars contain changes > >> in-house > >> > > or from vendors. They need to easily tell which jars Flink requires > >> for > >> > > corresponding open sourced hive version to their own hive > deployment, > >> and > >> > > copy in-hosue jars over from hive deployments as replacements. > >> > > > >> > > Providing a script to download all the individual jars for a > specified > >> > hive > >> > > version can be an alternative. > >> > > > >> > > The goal is we need to provide a *product*, not a technology, to > make > >> it > >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive > >> community > >> > > and ecosystem, not the other way around. I'd argue Hive connector > can > >> be > >> > > treat differently because its community/ecosystem/userbase is much > >> larger > >> > > than the other connectors, and it's way more important than other > >> > > connectors to Flink on the mission of becoming a batch/streaming > >> unified > >> > > engine and get Flink more widely adopted. > >> > > > >> > > > >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[hidden email]> > >> > wrote: > >> > > > >> > > > Also -1 on separate builds. > >> > > > > >> > > > After referencing some other BigData engines for distribution[1], > i > >> > > didn't > >> > > > find strong needs to publish a separate build > >> > > > for just a separate Hive version, indeed there are builds for > >> different > >> > > > Hadoop version. > >> > > > > >> > > > Just like Seth and Aljoscha said, we could push a > >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other > use > >> > > cases. > >> > > > > >> > > > [1] https://spark.apache.org/downloads.html > >> > > > [2] > >> > > > >> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html > >> > > > > >> > > > Best, > >> > > > Danny Chan > >> > > > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道: > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> > https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies > >> > > > > >> > > > >> > > >> > > > > > > -- > > Best, Jingsong Lee > > > |
Hi Stephan,
The hive/lib/ has many jars, this lib is for execution, metastore, hive client and all things. What we really depend on is hive-exec.jar. (hive-metastore.jar is also required in the low version hive) And hive-exec.jar is a uber jar. We just want half classes of it. These half classes are not so clean, but it is OK to have them. Our solution now: - exclude hive jars from build - provide 8 versions dependencies way, user choose by his hive version.[1] Spark's solution: - build-in hive 1.2.1 dependencies to support hive 0.12.0 through 2.3.3. [2] - hive-exec.jar is hive-exec.spark.jar, Spark has modified the hive-exec build pom to exclude unnecessary classes including Orc and parquet. - build-in orc and parquet dependencies to optimizer performance. - support hive version 2.3.3 upper by "mvn install -Phive-2.3", to built-in hive-exec-2.3.6.jar. It seems that since this version, hive's API has been seriously incompatible. Most of the versions used by users are hive 0.12.0 through 2.3.3. So the default build of Spark is good to most of users. Presto's solution: - Built-in presto's hive.[3] Shade hive classes instead of thrift classes. - Rewrite some client related code to solve kinds of issues. This approach is the heaviest, but also the cleanest. It can support all kinds of hive versions with one build. So I think we can do: - The eight versions we now maintain are too many. I think we can move forward in the direction of Presto/Spark and try to reduce dependencies versions. - As your said, about provide fat/uber jars or helper script, I prefer uber jars, user can download one jar to their startup. Just like Kafka. [1] https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies [2] https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore [3] https://github.com/prestodb/presto-hive-apache Best, Jingsong Lee On Wed, Feb 5, 2020 at 10:15 PM Stephan Ewen <[hidden email]> wrote: > Some thoughts about other options we have: > > - Put fat/shaded jars for the common versions into "flink-shaded" and > offer them for download on the website, similar to pre-bundles Hadoop > versions. > > - Look at the Presto code (Metastore protocol) and see if we can reuse > that > > - Have a setup helper script that takes the versions and pulls the > required dependencies. > > Can you share how can a "built-in" dependency could work, if there are so > many different conflicting versions? > > Thanks, > Stephan > > > On Tue, Feb 4, 2020 at 12:59 PM Rui Li <[hidden email]> wrote: > >> Hi Stephan, >> >> As Jingsong stated, in our documentation the recommended way to add Hive >> deps is to use exactly what users have installed. It's just we ask users >> to >> manually add those jars, instead of automatically find them based on env >> variables. I prefer to keep it this way for a while, and see if there're >> real concerns/complaints from user feedbacks. >> >> Please also note the Hive jars are not the only ones needed to integrate >> with Hive, users have to make sure flink-connector-hive and Hadoop jars >> are >> in classpath too. So I'm afraid a single "HIVE" env variable wouldn't save >> all the manual work for our users. >> >> On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <[hidden email]> >> wrote: >> >> > Hi all, >> > >> > For your information, we have document the dependencies detailed >> > information [1]. I think it's a lot clearer than before, but it's worse >> > than presto and spark (they avoid or have built-in hive dependency). >> > >> > I thought about Stephan's suggestion: >> > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus two >> > or three jars, if so many jars are introduced, maybe will there be a big >> > conflict. >> > - And hive/lib is not available on every machine. We need to upload so >> > many jars. >> > - A separate classloader maybe hard to work too, our >> flink-connector-hive >> > need hive jars, we may need to deal with flink-connector-hive jar >> spacial >> > too. >> > CC: Rui Li >> > >> > I think the best system to integrate with hive is presto, which only >> > connects hive metastore through thrift protocol. But I understand that >> it >> > costs a lot to rewrite the code. >> > >> > [1] >> > >> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies >> > >> > Best, >> > Jingsong Lee >> > >> > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <[hidden email]> wrote: >> > >> >> We have had much trouble in the past from "too deep too custom" >> >> integrations that everyone got out of the box, i.e., Hadoop. >> >> Flink has has such a broad spectrum of use cases, if we have custom >> build >> >> for every other framework in that spectrum, we'll be in trouble. >> >> >> >> So I would also be -1 for custom builds. >> >> >> >> Couldn't we do something similar as we started doing for Hadoop? Moving >> >> away from convenience downloads to allowing users to "export" their >> setup >> >> for Flink? >> >> >> >> - We can have a "hive module (loader)" in flink/lib by default >> >> - The module loader would look for an environment variable like >> >> "HIVE_CLASSPATH" and load these classes (ideally in a separate >> >> classloader). >> >> - The loader can search for certain classes and instantiate catalog / >> >> functions / etc. when finding them instantiates the hive module >> >> referencing >> >> them >> >> - That way, we use exactly what users have installed, without >> needing to >> >> build our own bundles. >> >> >> >> Could that work? >> >> >> >> Best, >> >> Stephan >> >> >> >> >> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <[hidden email]> >> >> wrote: >> >> >> >> > Couldn't it simply be documented which jars are in the convenience >> jars >> >> > which are pre built and can be downloaded from the website? Then >> people >> >> who >> >> > need a custom version know which jars they need to provide to Flink? >> >> > >> >> > Cheers, >> >> > Till >> >> > >> >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[hidden email]> >> wrote: >> >> > >> >> > > I'm not sure providing an uber jar would be possible. >> >> > > >> >> > > Different from kafka and elasticsearch connector who have >> dependencies >> >> > for >> >> > > a specific kafka/elastic version, or the kafka universal connector >> >> that >> >> > > provides good compatibilities, hive connector needs to deal with >> hive >> >> > jars >> >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH >> >> distributions) >> >> > > with incompatibility even between minor versions, different >> versioned >> >> > > hadoop and other extra dependency jars for each hive version. >> >> > > >> >> > > Besides, users usually need to be able to easily see which >> individual >> >> > jars >> >> > > are required, which is invisible from an uber jar. Hive users >> already >> >> > have >> >> > > their hive deployments. They usually have to use their own hive >> jars >> >> > > because, unlike hive jars on mvn, their own jars contain changes >> >> in-house >> >> > > or from vendors. They need to easily tell which jars Flink requires >> >> for >> >> > > corresponding open sourced hive version to their own hive >> deployment, >> >> and >> >> > > copy in-hosue jars over from hive deployments as replacements. >> >> > > >> >> > > Providing a script to download all the individual jars for a >> specified >> >> > hive >> >> > > version can be an alternative. >> >> > > >> >> > > The goal is we need to provide a *product*, not a technology, to >> make >> >> it >> >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive >> >> community >> >> > > and ecosystem, not the other way around. I'd argue Hive connector >> can >> >> be >> >> > > treat differently because its community/ecosystem/userbase is much >> >> larger >> >> > > than the other connectors, and it's way more important than other >> >> > > connectors to Flink on the mission of becoming a batch/streaming >> >> unified >> >> > > engine and get Flink more widely adopted. >> >> > > >> >> > > >> >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[hidden email]> >> >> > wrote: >> >> > > >> >> > > > Also -1 on separate builds. >> >> > > > >> >> > > > After referencing some other BigData engines for >> distribution[1], i >> >> > > didn't >> >> > > > find strong needs to publish a separate build >> >> > > > for just a separate Hive version, indeed there are builds for >> >> different >> >> > > > Hadoop version. >> >> > > > >> >> > > > Just like Seth and Aljoscha said, we could push a >> >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other >> use >> >> > > cases. >> >> > > > >> >> > > > [1] https://spark.apache.org/downloads.html >> >> > > > [2] >> >> > > >> >> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html >> >> > > > >> >> > > > Best, >> >> > > > Danny Chan >> >> > > > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道: >> >> > > > > >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies >> >> > > > >> >> > > >> >> > >> >> >> > >> > >> > -- >> > Best, Jingsong Lee >> > >> > -- Best, Jingsong Lee |
Hi Jingsong!
This sounds that with two pre-bundled versions (hive 1.2.1 and hive 2.3.6) you can cover a lot of versions. Would it make sense to add these to flink-shaded (with proper dependency exclusions of unnecessary dependencies) and offer them as a download, similar as we offer pre-shaded Hadoop downloads? Best, Stephan On Thu, Feb 6, 2020 at 10:26 AM Jingsong Li <[hidden email]> wrote: > Hi Stephan, > > The hive/lib/ has many jars, this lib is for execution, metastore, hive > client and all things. > What we really depend on is hive-exec.jar. (hive-metastore.jar is also > required in the low version hive) > And hive-exec.jar is a uber jar. We just want half classes of it. These > half classes are not so clean, but it is OK to have them. > > Our solution now: > - exclude hive jars from build > - provide 8 versions dependencies way, user choose by his hive version.[1] > > Spark's solution: > - build-in hive 1.2.1 dependencies to support hive 0.12.0 through 2.3.3. > [2] > - hive-exec.jar is hive-exec.spark.jar, Spark has modified the > hive-exec build pom to exclude unnecessary classes including Orc and > parquet. > - build-in orc and parquet dependencies to optimizer performance. > - support hive version 2.3.3 upper by "mvn install -Phive-2.3", to > built-in hive-exec-2.3.6.jar. It seems that since this version, hive's API > has been seriously incompatible. > Most of the versions used by users are hive 0.12.0 through 2.3.3. So the > default build of Spark is good to most of users. > > Presto's solution: > - Built-in presto's hive.[3] Shade hive classes instead of thrift classes. > - Rewrite some client related code to solve kinds of issues. > This approach is the heaviest, but also the cleanest. It can support all > kinds of hive versions with one build. > > So I think we can do: > > - The eight versions we now maintain are too many. I think we can move > forward in the direction of Presto/Spark and try to reduce dependencies > versions. > > - As your said, about provide fat/uber jars or helper script, I prefer > uber jars, user can download one jar to their startup. Just like Kafka. > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies > [2] > https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore > [3] https://github.com/prestodb/presto-hive-apache > > Best, > Jingsong Lee > > On Wed, Feb 5, 2020 at 10:15 PM Stephan Ewen <[hidden email]> wrote: > >> Some thoughts about other options we have: >> >> - Put fat/shaded jars for the common versions into "flink-shaded" and >> offer them for download on the website, similar to pre-bundles Hadoop >> versions. >> >> - Look at the Presto code (Metastore protocol) and see if we can reuse >> that >> >> - Have a setup helper script that takes the versions and pulls the >> required dependencies. >> >> Can you share how can a "built-in" dependency could work, if there are so >> many different conflicting versions? >> >> Thanks, >> Stephan >> >> >> On Tue, Feb 4, 2020 at 12:59 PM Rui Li <[hidden email]> wrote: >> >>> Hi Stephan, >>> >>> As Jingsong stated, in our documentation the recommended way to add Hive >>> deps is to use exactly what users have installed. It's just we ask users >>> to >>> manually add those jars, instead of automatically find them based on env >>> variables. I prefer to keep it this way for a while, and see if there're >>> real concerns/complaints from user feedbacks. >>> >>> Please also note the Hive jars are not the only ones needed to integrate >>> with Hive, users have to make sure flink-connector-hive and Hadoop jars >>> are >>> in classpath too. So I'm afraid a single "HIVE" env variable wouldn't >>> save >>> all the manual work for our users. >>> >>> On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <[hidden email]> >>> wrote: >>> >>> > Hi all, >>> > >>> > For your information, we have document the dependencies detailed >>> > information [1]. I think it's a lot clearer than before, but it's worse >>> > than presto and spark (they avoid or have built-in hive dependency). >>> > >>> > I thought about Stephan's suggestion: >>> > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus >>> two >>> > or three jars, if so many jars are introduced, maybe will there be a >>> big >>> > conflict. >>> > - And hive/lib is not available on every machine. We need to upload so >>> > many jars. >>> > - A separate classloader maybe hard to work too, our >>> flink-connector-hive >>> > need hive jars, we may need to deal with flink-connector-hive jar >>> spacial >>> > too. >>> > CC: Rui Li >>> > >>> > I think the best system to integrate with hive is presto, which only >>> > connects hive metastore through thrift protocol. But I understand that >>> it >>> > costs a lot to rewrite the code. >>> > >>> > [1] >>> > >>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies >>> > >>> > Best, >>> > Jingsong Lee >>> > >>> > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <[hidden email]> wrote: >>> > >>> >> We have had much trouble in the past from "too deep too custom" >>> >> integrations that everyone got out of the box, i.e., Hadoop. >>> >> Flink has has such a broad spectrum of use cases, if we have custom >>> build >>> >> for every other framework in that spectrum, we'll be in trouble. >>> >> >>> >> So I would also be -1 for custom builds. >>> >> >>> >> Couldn't we do something similar as we started doing for Hadoop? >>> Moving >>> >> away from convenience downloads to allowing users to "export" their >>> setup >>> >> for Flink? >>> >> >>> >> - We can have a "hive module (loader)" in flink/lib by default >>> >> - The module loader would look for an environment variable like >>> >> "HIVE_CLASSPATH" and load these classes (ideally in a separate >>> >> classloader). >>> >> - The loader can search for certain classes and instantiate catalog >>> / >>> >> functions / etc. when finding them instantiates the hive module >>> >> referencing >>> >> them >>> >> - That way, we use exactly what users have installed, without >>> needing to >>> >> build our own bundles. >>> >> >>> >> Could that work? >>> >> >>> >> Best, >>> >> Stephan >>> >> >>> >> >>> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <[hidden email]> >>> >> wrote: >>> >> >>> >> > Couldn't it simply be documented which jars are in the convenience >>> jars >>> >> > which are pre built and can be downloaded from the website? Then >>> people >>> >> who >>> >> > need a custom version know which jars they need to provide to Flink? >>> >> > >>> >> > Cheers, >>> >> > Till >>> >> > >>> >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[hidden email]> >>> wrote: >>> >> > >>> >> > > I'm not sure providing an uber jar would be possible. >>> >> > > >>> >> > > Different from kafka and elasticsearch connector who have >>> dependencies >>> >> > for >>> >> > > a specific kafka/elastic version, or the kafka universal connector >>> >> that >>> >> > > provides good compatibilities, hive connector needs to deal with >>> hive >>> >> > jars >>> >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH >>> >> distributions) >>> >> > > with incompatibility even between minor versions, different >>> versioned >>> >> > > hadoop and other extra dependency jars for each hive version. >>> >> > > >>> >> > > Besides, users usually need to be able to easily see which >>> individual >>> >> > jars >>> >> > > are required, which is invisible from an uber jar. Hive users >>> already >>> >> > have >>> >> > > their hive deployments. They usually have to use their own hive >>> jars >>> >> > > because, unlike hive jars on mvn, their own jars contain changes >>> >> in-house >>> >> > > or from vendors. They need to easily tell which jars Flink >>> requires >>> >> for >>> >> > > corresponding open sourced hive version to their own hive >>> deployment, >>> >> and >>> >> > > copy in-hosue jars over from hive deployments as replacements. >>> >> > > >>> >> > > Providing a script to download all the individual jars for a >>> specified >>> >> > hive >>> >> > > version can be an alternative. >>> >> > > >>> >> > > The goal is we need to provide a *product*, not a technology, to >>> make >>> >> it >>> >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive >>> >> community >>> >> > > and ecosystem, not the other way around. I'd argue Hive connector >>> can >>> >> be >>> >> > > treat differently because its community/ecosystem/userbase is much >>> >> larger >>> >> > > than the other connectors, and it's way more important than other >>> >> > > connectors to Flink on the mission of becoming a batch/streaming >>> >> unified >>> >> > > engine and get Flink more widely adopted. >>> >> > > >>> >> > > >>> >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[hidden email] >>> > >>> >> > wrote: >>> >> > > >>> >> > > > Also -1 on separate builds. >>> >> > > > >>> >> > > > After referencing some other BigData engines for >>> distribution[1], i >>> >> > > didn't >>> >> > > > find strong needs to publish a separate build >>> >> > > > for just a separate Hive version, indeed there are builds for >>> >> different >>> >> > > > Hadoop version. >>> >> > > > >>> >> > > > Just like Seth and Aljoscha said, we could push a >>> >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other >>> use >>> >> > > cases. >>> >> > > > >>> >> > > > [1] https://spark.apache.org/downloads.html >>> >> > > > [2] >>> >> > > >>> >> >>> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html >>> >> > > > >>> >> > > > Best, >>> >> > > > Danny Chan >>> >> > > > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道: >>> >> > > > > >>> >> > > > > >>> >> > > > >>> >> > > >>> >> > >>> >> >>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies >>> >> > > > >>> >> > > >>> >> > >>> >> >>> > >>> > >>> > -- >>> > Best, Jingsong Lee >>> > >>> >> > > -- > Best, Jingsong Lee > |
Free forum by Nabble | Edit this page |