[DISCUSS] have separate Flink distributions with built-in Hive dependencies

classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] have separate Flink distributions with built-in Hive dependencies

bowen.li
Hi all,

I want to propose to have a couple separate Flink distributions with Hive
dependencies on specific Hive versions (2.3.4 and 1.2.1). The distributions
will be provided to users on Flink download page [1].

A few reasons to do this:

1) Flink-Hive integration is important to many many Flink and Hive users in
two dimensions:
     a) for Flink metadata: HiveCatalog is the only persistent catalog to
manage Flink tables. With Flink 1.10 supporting more DDL, the persistent
catalog would be playing even more critical role in users' workflow
     b) for Flink data: Hive data connector (source/sink) helps both Flink
and Hive users to unlock new use cases in streaming, near-realtime/realtime
data warehouse, backfill, etc.

2) currently users have to go thru a *really* tedious process to get
started, because it requires lots of extra jars (see [2]) that are absent
in Flink's lean distribution. We've had so many users from public mailing
list, private email, DingTalk groups who got frustrated on spending lots of
time figuring out the jars themselves. They would rather have a more "right
out of box" quickstart experience, and play with the catalog and
source/sink without hassle.

3) it's easier for users to replace those Hive dependencies for their own
Hive versions - just replace those jars with the right versions and no need
to find the doc.

* Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base
out there, and that's why we are using them as examples for dependencies in
[1] even though we've supported almost all Hive versions [3] now.

I want to hear what the community think about this, and how to achieve it
if we believe that's the way to go.

Cheers,
Bowen

[1] https://flink.apache.org/downloads.html
[2]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
[3]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

bowen.li
cc user ML in case anyone want to chime in

On Fri, Dec 13, 2019 at 00:44 Bowen Li <[hidden email]> wrote:

> Hi all,
>
> I want to propose to have a couple separate Flink distributions with Hive
> dependencies on specific Hive versions (2.3.4 and 1.2.1). The distributions
> will be provided to users on Flink download page [1].
>
> A few reasons to do this:
>
> 1) Flink-Hive integration is important to many many Flink and Hive users
> in two dimensions:
>      a) for Flink metadata: HiveCatalog is the only persistent catalog to
> manage Flink tables. With Flink 1.10 supporting more DDL, the persistent
> catalog would be playing even more critical role in users' workflow
>      b) for Flink data: Hive data connector (source/sink) helps both Flink
> and Hive users to unlock new use cases in streaming, near-realtime/realtime
> data warehouse, backfill, etc.
>
> 2) currently users have to go thru a *really* tedious process to get
> started, because it requires lots of extra jars (see [2]) that are absent
> in Flink's lean distribution. We've had so many users from public mailing
> list, private email, DingTalk groups who got frustrated on spending lots of
> time figuring out the jars themselves. They would rather have a more "right
> out of box" quickstart experience, and play with the catalog and
> source/sink without hassle.
>
> 3) it's easier for users to replace those Hive dependencies for their own
> Hive versions - just replace those jars with the right versions and no need
> to find the doc.
>
> * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base
> out there, and that's why we are using them as examples for dependencies in
> [1] even though we've supported almost all Hive versions [3] now.
>
> I want to hear what the community think about this, and how to achieve it
> if we believe that's the way to go.
>
> Cheers,
> Bowen
>
> [1] https://flink.apache.org/downloads.html
> [2]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
> [3]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Jeff Zhang
+1, this is definitely necessary for better user experience. Setting up
environment is always painful for many big data tools.



Bowen Li <[hidden email]> 于2019年12月13日周五 下午5:02写道:

> cc user ML in case anyone want to chime in
>
> On Fri, Dec 13, 2019 at 00:44 Bowen Li <[hidden email]> wrote:
>
>> Hi all,
>>
>> I want to propose to have a couple separate Flink distributions with Hive
>> dependencies on specific Hive versions (2.3.4 and 1.2.1). The distributions
>> will be provided to users on Flink download page [1].
>>
>> A few reasons to do this:
>>
>> 1) Flink-Hive integration is important to many many Flink and Hive users
>> in two dimensions:
>>      a) for Flink metadata: HiveCatalog is the only persistent catalog to
>> manage Flink tables. With Flink 1.10 supporting more DDL, the persistent
>> catalog would be playing even more critical role in users' workflow
>>      b) for Flink data: Hive data connector (source/sink) helps both
>> Flink and Hive users to unlock new use cases in streaming,
>> near-realtime/realtime data warehouse, backfill, etc.
>>
>> 2) currently users have to go thru a *really* tedious process to get
>> started, because it requires lots of extra jars (see [2]) that are absent
>> in Flink's lean distribution. We've had so many users from public mailing
>> list, private email, DingTalk groups who got frustrated on spending lots of
>> time figuring out the jars themselves. They would rather have a more "right
>> out of box" quickstart experience, and play with the catalog and
>> source/sink without hassle.
>>
>> 3) it's easier for users to replace those Hive dependencies for their own
>> Hive versions - just replace those jars with the right versions and no need
>> to find the doc.
>>
>> * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base
>> out there, and that's why we are using them as examples for dependencies in
>> [1] even though we've supported almost all Hive versions [3] now.
>>
>> I want to hear what the community think about this, and how to achieve it
>> if we believe that's the way to go.
>>
>> Cheers,
>> Bowen
>>
>> [1] https://flink.apache.org/downloads.html
>> [2]
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
>> [3]
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions
>>
>

--
Best Regards

Jeff Zhang
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Terry Wang
In reply to this post by bowen.li
Hi Bowen~

Thanks for driving on this. I have tried using sql client with hive connector about two weeks ago, it’s painful to set up the environment from my experience.
+ 1 for this proposal.

Best,
Terry Wang



> 2019年12月13日 16:44,Bowen Li <[hidden email]> 写道:
>
> Hi all,
>
> I want to propose to have a couple separate Flink distributions with Hive
> dependencies on specific Hive versions (2.3.4 and 1.2.1). The distributions
> will be provided to users on Flink download page [1].
>
> A few reasons to do this:
>
> 1) Flink-Hive integration is important to many many Flink and Hive users in
> two dimensions:
>     a) for Flink metadata: HiveCatalog is the only persistent catalog to
> manage Flink tables. With Flink 1.10 supporting more DDL, the persistent
> catalog would be playing even more critical role in users' workflow
>     b) for Flink data: Hive data connector (source/sink) helps both Flink
> and Hive users to unlock new use cases in streaming, near-realtime/realtime
> data warehouse, backfill, etc.
>
> 2) currently users have to go thru a *really* tedious process to get
> started, because it requires lots of extra jars (see [2]) that are absent
> in Flink's lean distribution. We've had so many users from public mailing
> list, private email, DingTalk groups who got frustrated on spending lots of
> time figuring out the jars themselves. They would rather have a more "right
> out of box" quickstart experience, and play with the catalog and
> source/sink without hassle.
>
> 3) it's easier for users to replace those Hive dependencies for their own
> Hive versions - just replace those jars with the right versions and no need
> to find the doc.
>
> * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base
> out there, and that's why we are using them as examples for dependencies in
> [1] even though we've supported almost all Hive versions [3] now.
>
> I want to hear what the community think about this, and how to achieve it
> if we believe that's the way to go.
>
> Cheers,
> Bowen
>
> [1] https://flink.apache.org/downloads.html
> [2]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
> [3]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Jingsong Li
Hi Bowen,

Thanks for driving this.
+1 for this proposal.

Due to our multi version support, users are required to rely on
different dependencies, it does break the "out of box" experience.
Now that the client has changed to go to child first class loader resolve
by default, it puts forward higher requirements for user dependence, which
also leads to: a related bug to run hive job using the dependencies from
document.[1]
It is really hard to use.

I have some more thinking:
- I think we can make the user's jar package as thin as possible by
providing the appropriate excludes. Sometimes, the transmission of jar
packets consumes a lot of resources and time.
- Why we not add for hive version 3?

[1] https://issues.apache.org/jira/browse/FLINK-14849

Best,
Jingsong Lee

On Fri, Dec 13, 2019 at 5:12 PM Terry Wang <[hidden email]> wrote:

> Hi Bowen~
>
> Thanks for driving on this. I have tried using sql client with hive
> connector about two weeks ago, it’s painful to set up the environment from
> my experience.
> + 1 for this proposal.
>
> Best,
> Terry Wang
>
>
>
> > 2019年12月13日 16:44,Bowen Li <[hidden email]> 写道:
> >
> > Hi all,
> >
> > I want to propose to have a couple separate Flink distributions with Hive
> > dependencies on specific Hive versions (2.3.4 and 1.2.1). The
> distributions
> > will be provided to users on Flink download page [1].
> >
> > A few reasons to do this:
> >
> > 1) Flink-Hive integration is important to many many Flink and Hive users
> in
> > two dimensions:
> >     a) for Flink metadata: HiveCatalog is the only persistent catalog to
> > manage Flink tables. With Flink 1.10 supporting more DDL, the persistent
> > catalog would be playing even more critical role in users' workflow
> >     b) for Flink data: Hive data connector (source/sink) helps both Flink
> > and Hive users to unlock new use cases in streaming,
> near-realtime/realtime
> > data warehouse, backfill, etc.
> >
> > 2) currently users have to go thru a *really* tedious process to get
> > started, because it requires lots of extra jars (see [2]) that are absent
> > in Flink's lean distribution. We've had so many users from public mailing
> > list, private email, DingTalk groups who got frustrated on spending lots
> of
> > time figuring out the jars themselves. They would rather have a more
> "right
> > out of box" quickstart experience, and play with the catalog and
> > source/sink without hassle.
> >
> > 3) it's easier for users to replace those Hive dependencies for their own
> > Hive versions - just replace those jars with the right versions and no
> need
> > to find the doc.
> >
> > * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base
> > out there, and that's why we are using them as examples for dependencies
> in
> > [1] even though we've supported almost all Hive versions [3] now.
> >
> > I want to hear what the community think about this, and how to achieve it
> > if we believe that's the way to go.
> >
> > Cheers,
> > Bowen
> >
> > [1] https://flink.apache.org/downloads.html
> > [2]
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
> > [3]
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions
>
>

--
Best, Jingsong Lee
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Chesnay Schepler-3
In reply to this post by bowen.li
-1

We shouldn't need to deploy additional binaries to have a feature be
remotely usable.
This usually points to something else being done incorrectly.

If it is indeed such a hassle to setup hive on Flink, then my conclusion
would be that either
a) the documentation needs to be improved
b) the architecture needs to be improved
or, if all else fails c) provide a utility script for setting it up easier.

We spent a lot of time on reducing the number of binaries in the hadoop
days, and also go extra steps to prevent a separate Java 11 binary, and
I see no reason why Hive should get special treatment on this matter.

Regards,
Chesnay

On 13/12/2019 09:44, Bowen Li wrote:

> Hi all,
>
> I want to propose to have a couple separate Flink distributions with Hive
> dependencies on specific Hive versions (2.3.4 and 1.2.1). The distributions
> will be provided to users on Flink download page [1].
>
> A few reasons to do this:
>
> 1) Flink-Hive integration is important to many many Flink and Hive users in
> two dimensions:
>       a) for Flink metadata: HiveCatalog is the only persistent catalog to
> manage Flink tables. With Flink 1.10 supporting more DDL, the persistent
> catalog would be playing even more critical role in users' workflow
>       b) for Flink data: Hive data connector (source/sink) helps both Flink
> and Hive users to unlock new use cases in streaming, near-realtime/realtime
> data warehouse, backfill, etc.
>
> 2) currently users have to go thru a *really* tedious process to get
> started, because it requires lots of extra jars (see [2]) that are absent
> in Flink's lean distribution. We've had so many users from public mailing
> list, private email, DingTalk groups who got frustrated on spending lots of
> time figuring out the jars themselves. They would rather have a more "right
> out of box" quickstart experience, and play with the catalog and
> source/sink without hassle.
>
> 3) it's easier for users to replace those Hive dependencies for their own
> Hive versions - just replace those jars with the right versions and no need
> to find the doc.
>
> * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base
> out there, and that's why we are using them as examples for dependencies in
> [1] even though we've supported almost all Hive versions [3] now.
>
> I want to hear what the community think about this, and how to achieve it
> if we believe that's the way to go.
>
> Cheers,
> Bowen
>
> [1] https://flink.apache.org/downloads.html
> [2]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
> [3]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions
>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Robert Metzger
I'm generally not opposed to convenience binaries, if a huge number of
people would benefit from them, and the overhead for the Flink project is
low. I did not see a huge demand for such binaries yet (neither for the
Flink + Hive integration). Looking at Apache Spark, they are also only
offering convenience binaries for Hadoop only.

Maybe we could provide a "Docker Playground" for Flink + Hive in the
documentation (and the flink-playgrounds.git repo)?
(similar to
https://ci.apache.org/projects/flink/flink-docs-master/getting-started/docker-playgrounds/flink-operations-playground.html
)



On Fri, Dec 13, 2019 at 3:04 PM Chesnay Schepler <[hidden email]> wrote:

> -1
>
> We shouldn't need to deploy additional binaries to have a feature be
> remotely usable.
> This usually points to something else being done incorrectly.
>
> If it is indeed such a hassle to setup hive on Flink, then my conclusion
> would be that either
> a) the documentation needs to be improved
> b) the architecture needs to be improved
> or, if all else fails c) provide a utility script for setting it up easier.
>
> We spent a lot of time on reducing the number of binaries in the hadoop
> days, and also go extra steps to prevent a separate Java 11 binary, and
> I see no reason why Hive should get special treatment on this matter.
>
> Regards,
> Chesnay
>
> On 13/12/2019 09:44, Bowen Li wrote:
> > Hi all,
> >
> > I want to propose to have a couple separate Flink distributions with Hive
> > dependencies on specific Hive versions (2.3.4 and 1.2.1). The
> distributions
> > will be provided to users on Flink download page [1].
> >
> > A few reasons to do this:
> >
> > 1) Flink-Hive integration is important to many many Flink and Hive users
> in
> > two dimensions:
> >       a) for Flink metadata: HiveCatalog is the only persistent catalog
> to
> > manage Flink tables. With Flink 1.10 supporting more DDL, the persistent
> > catalog would be playing even more critical role in users' workflow
> >       b) for Flink data: Hive data connector (source/sink) helps both
> Flink
> > and Hive users to unlock new use cases in streaming,
> near-realtime/realtime
> > data warehouse, backfill, etc.
> >
> > 2) currently users have to go thru a *really* tedious process to get
> > started, because it requires lots of extra jars (see [2]) that are absent
> > in Flink's lean distribution. We've had so many users from public mailing
> > list, private email, DingTalk groups who got frustrated on spending lots
> of
> > time figuring out the jars themselves. They would rather have a more
> "right
> > out of box" quickstart experience, and play with the catalog and
> > source/sink without hassle.
> >
> > 3) it's easier for users to replace those Hive dependencies for their own
> > Hive versions - just replace those jars with the right versions and no
> need
> > to find the doc.
> >
> > * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user base
> > out there, and that's why we are using them as examples for dependencies
> in
> > [1] even though we've supported almost all Hive versions [3] now.
> >
> > I want to hear what the community think about this, and how to achieve it
> > if we believe that's the way to go.
> >
> > Cheers,
> > Bowen
> >
> > [1] https://flink.apache.org/downloads.html
> > [2]
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
> > [3]
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Seth Wiesman-4
I'm also -1 on separate builds.

What about publishing convenience jars that contain the dependencies for
each version? For example, there could be a flink-hive-1.2.1-uber.jar that
users could just add to their lib folder that contains all the necessary
dependencies to connect to that hive version.


On Fri, Dec 13, 2019 at 8:50 AM Robert Metzger <[hidden email]> wrote:

> I'm generally not opposed to convenience binaries, if a huge number of
> people would benefit from them, and the overhead for the Flink project is
> low. I did not see a huge demand for such binaries yet (neither for the
> Flink + Hive integration). Looking at Apache Spark, they are also only
> offering convenience binaries for Hadoop only.
>
> Maybe we could provide a "Docker Playground" for Flink + Hive in the
> documentation (and the flink-playgrounds.git repo)?
> (similar to
>
> https://ci.apache.org/projects/flink/flink-docs-master/getting-started/docker-playgrounds/flink-operations-playground.html
> )
>
>
>
> On Fri, Dec 13, 2019 at 3:04 PM Chesnay Schepler <[hidden email]>
> wrote:
>
> > -1
> >
> > We shouldn't need to deploy additional binaries to have a feature be
> > remotely usable.
> > This usually points to something else being done incorrectly.
> >
> > If it is indeed such a hassle to setup hive on Flink, then my conclusion
> > would be that either
> > a) the documentation needs to be improved
> > b) the architecture needs to be improved
> > or, if all else fails c) provide a utility script for setting it up
> easier.
> >
> > We spent a lot of time on reducing the number of binaries in the hadoop
> > days, and also go extra steps to prevent a separate Java 11 binary, and
> > I see no reason why Hive should get special treatment on this matter.
> >
> > Regards,
> > Chesnay
> >
> > On 13/12/2019 09:44, Bowen Li wrote:
> > > Hi all,
> > >
> > > I want to propose to have a couple separate Flink distributions with
> Hive
> > > dependencies on specific Hive versions (2.3.4 and 1.2.1). The
> > distributions
> > > will be provided to users on Flink download page [1].
> > >
> > > A few reasons to do this:
> > >
> > > 1) Flink-Hive integration is important to many many Flink and Hive
> users
> > in
> > > two dimensions:
> > >       a) for Flink metadata: HiveCatalog is the only persistent catalog
> > to
> > > manage Flink tables. With Flink 1.10 supporting more DDL, the
> persistent
> > > catalog would be playing even more critical role in users' workflow
> > >       b) for Flink data: Hive data connector (source/sink) helps both
> > Flink
> > > and Hive users to unlock new use cases in streaming,
> > near-realtime/realtime
> > > data warehouse, backfill, etc.
> > >
> > > 2) currently users have to go thru a *really* tedious process to get
> > > started, because it requires lots of extra jars (see [2]) that are
> absent
> > > in Flink's lean distribution. We've had so many users from public
> mailing
> > > list, private email, DingTalk groups who got frustrated on spending
> lots
> > of
> > > time figuring out the jars themselves. They would rather have a more
> > "right
> > > out of box" quickstart experience, and play with the catalog and
> > > source/sink without hassle.
> > >
> > > 3) it's easier for users to replace those Hive dependencies for their
> own
> > > Hive versions - just replace those jars with the right versions and no
> > need
> > > to find the doc.
> > >
> > > * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user
> base
> > > out there, and that's why we are using them as examples for
> dependencies
> > in
> > > [1] even though we've supported almost all Hive versions [3] now.
> > >
> > > I want to hear what the community think about this, and how to achieve
> it
> > > if we believe that's the way to go.
> > >
> > > Cheers,
> > > Bowen
> > >
> > > [1] https://flink.apache.org/downloads.html
> > > [2]
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
> > > [3]
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions
> > >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Aljoscha Krettek-2
I was going to suggest the same thing as Seth. So yes, I’m against having Flink distributions that contain Hive but for convenience downloads as we have for Hadoop.

Best,
Aljoscha

> On 13. Dec 2019, at 18:04, Seth Wiesman <[hidden email]> wrote:
>
> I'm also -1 on separate builds.
>
> What about publishing convenience jars that contain the dependencies for
> each version? For example, there could be a flink-hive-1.2.1-uber.jar that
> users could just add to their lib folder that contains all the necessary
> dependencies to connect to that hive version.
>
>
> On Fri, Dec 13, 2019 at 8:50 AM Robert Metzger <[hidden email]> wrote:
>
>> I'm generally not opposed to convenience binaries, if a huge number of
>> people would benefit from them, and the overhead for the Flink project is
>> low. I did not see a huge demand for such binaries yet (neither for the
>> Flink + Hive integration). Looking at Apache Spark, they are also only
>> offering convenience binaries for Hadoop only.
>>
>> Maybe we could provide a "Docker Playground" for Flink + Hive in the
>> documentation (and the flink-playgrounds.git repo)?
>> (similar to
>>
>> https://ci.apache.org/projects/flink/flink-docs-master/getting-started/docker-playgrounds/flink-operations-playground.html
>> )
>>
>>
>>
>> On Fri, Dec 13, 2019 at 3:04 PM Chesnay Schepler <[hidden email]>
>> wrote:
>>
>>> -1
>>>
>>> We shouldn't need to deploy additional binaries to have a feature be
>>> remotely usable.
>>> This usually points to something else being done incorrectly.
>>>
>>> If it is indeed such a hassle to setup hive on Flink, then my conclusion
>>> would be that either
>>> a) the documentation needs to be improved
>>> b) the architecture needs to be improved
>>> or, if all else fails c) provide a utility script for setting it up
>> easier.
>>>
>>> We spent a lot of time on reducing the number of binaries in the hadoop
>>> days, and also go extra steps to prevent a separate Java 11 binary, and
>>> I see no reason why Hive should get special treatment on this matter.
>>>
>>> Regards,
>>> Chesnay
>>>
>>> On 13/12/2019 09:44, Bowen Li wrote:
>>>> Hi all,
>>>>
>>>> I want to propose to have a couple separate Flink distributions with
>> Hive
>>>> dependencies on specific Hive versions (2.3.4 and 1.2.1). The
>>> distributions
>>>> will be provided to users on Flink download page [1].
>>>>
>>>> A few reasons to do this:
>>>>
>>>> 1) Flink-Hive integration is important to many many Flink and Hive
>> users
>>> in
>>>> two dimensions:
>>>>      a) for Flink metadata: HiveCatalog is the only persistent catalog
>>> to
>>>> manage Flink tables. With Flink 1.10 supporting more DDL, the
>> persistent
>>>> catalog would be playing even more critical role in users' workflow
>>>>      b) for Flink data: Hive data connector (source/sink) helps both
>>> Flink
>>>> and Hive users to unlock new use cases in streaming,
>>> near-realtime/realtime
>>>> data warehouse, backfill, etc.
>>>>
>>>> 2) currently users have to go thru a *really* tedious process to get
>>>> started, because it requires lots of extra jars (see [2]) that are
>> absent
>>>> in Flink's lean distribution. We've had so many users from public
>> mailing
>>>> list, private email, DingTalk groups who got frustrated on spending
>> lots
>>> of
>>>> time figuring out the jars themselves. They would rather have a more
>>> "right
>>>> out of box" quickstart experience, and play with the catalog and
>>>> source/sink without hassle.
>>>>
>>>> 3) it's easier for users to replace those Hive dependencies for their
>> own
>>>> Hive versions - just replace those jars with the right versions and no
>>> need
>>>> to find the doc.
>>>>
>>>> * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user
>> base
>>>> out there, and that's why we are using them as examples for
>> dependencies
>>> in
>>>> [1] even though we've supported almost all Hive versions [3] now.
>>>>
>>>> I want to hear what the community think about this, and how to achieve
>> it
>>>> if we believe that's the way to go.
>>>>
>>>> Cheers,
>>>> Bowen
>>>>
>>>> [1] https://flink.apache.org/downloads.html
>>>> [2]
>>>>
>>>
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
>>>> [3]
>>>>
>>>
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions
>>>>
>>>
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Jark Wu-2
I agree with Seth and Aljoscha and think that is a right way to go.
We already provided uber jars for kafka and elasticsearch for out-of-box,
you can see the download links in this page[1].
Users can easily to download the connectors and versions they like and drag
to SQL CLI lib directories. The uber jars
contains all the dependencies required and may be shaded. In this way,
users can skip to build a uber jar themselves.
Hive is indeed a "connector" too, and should also follow this way.

Best,
Jark

[1]:
https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies

On Sat, 14 Dec 2019 at 03:03, Aljoscha Krettek <[hidden email]> wrote:

> I was going to suggest the same thing as Seth. So yes, I’m against having
> Flink distributions that contain Hive but for convenience downloads as we
> have for Hadoop.
>
> Best,
> Aljoscha
>
> > On 13. Dec 2019, at 18:04, Seth Wiesman <[hidden email]> wrote:
> >
> > I'm also -1 on separate builds.
> >
> > What about publishing convenience jars that contain the dependencies for
> > each version? For example, there could be a flink-hive-1.2.1-uber.jar
> that
> > users could just add to their lib folder that contains all the necessary
> > dependencies to connect to that hive version.
> >
> >
> > On Fri, Dec 13, 2019 at 8:50 AM Robert Metzger <[hidden email]>
> wrote:
> >
> >> I'm generally not opposed to convenience binaries, if a huge number of
> >> people would benefit from them, and the overhead for the Flink project
> is
> >> low. I did not see a huge demand for such binaries yet (neither for the
> >> Flink + Hive integration). Looking at Apache Spark, they are also only
> >> offering convenience binaries for Hadoop only.
> >>
> >> Maybe we could provide a "Docker Playground" for Flink + Hive in the
> >> documentation (and the flink-playgrounds.git repo)?
> >> (similar to
> >>
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/getting-started/docker-playgrounds/flink-operations-playground.html
> >> )
> >>
> >>
> >>
> >> On Fri, Dec 13, 2019 at 3:04 PM Chesnay Schepler <[hidden email]>
> >> wrote:
> >>
> >>> -1
> >>>
> >>> We shouldn't need to deploy additional binaries to have a feature be
> >>> remotely usable.
> >>> This usually points to something else being done incorrectly.
> >>>
> >>> If it is indeed such a hassle to setup hive on Flink, then my
> conclusion
> >>> would be that either
> >>> a) the documentation needs to be improved
> >>> b) the architecture needs to be improved
> >>> or, if all else fails c) provide a utility script for setting it up
> >> easier.
> >>>
> >>> We spent a lot of time on reducing the number of binaries in the hadoop
> >>> days, and also go extra steps to prevent a separate Java 11 binary, and
> >>> I see no reason why Hive should get special treatment on this matter.
> >>>
> >>> Regards,
> >>> Chesnay
> >>>
> >>> On 13/12/2019 09:44, Bowen Li wrote:
> >>>> Hi all,
> >>>>
> >>>> I want to propose to have a couple separate Flink distributions with
> >> Hive
> >>>> dependencies on specific Hive versions (2.3.4 and 1.2.1). The
> >>> distributions
> >>>> will be provided to users on Flink download page [1].
> >>>>
> >>>> A few reasons to do this:
> >>>>
> >>>> 1) Flink-Hive integration is important to many many Flink and Hive
> >> users
> >>> in
> >>>> two dimensions:
> >>>>      a) for Flink metadata: HiveCatalog is the only persistent catalog
> >>> to
> >>>> manage Flink tables. With Flink 1.10 supporting more DDL, the
> >> persistent
> >>>> catalog would be playing even more critical role in users' workflow
> >>>>      b) for Flink data: Hive data connector (source/sink) helps both
> >>> Flink
> >>>> and Hive users to unlock new use cases in streaming,
> >>> near-realtime/realtime
> >>>> data warehouse, backfill, etc.
> >>>>
> >>>> 2) currently users have to go thru a *really* tedious process to get
> >>>> started, because it requires lots of extra jars (see [2]) that are
> >> absent
> >>>> in Flink's lean distribution. We've had so many users from public
> >> mailing
> >>>> list, private email, DingTalk groups who got frustrated on spending
> >> lots
> >>> of
> >>>> time figuring out the jars themselves. They would rather have a more
> >>> "right
> >>>> out of box" quickstart experience, and play with the catalog and
> >>>> source/sink without hassle.
> >>>>
> >>>> 3) it's easier for users to replace those Hive dependencies for their
> >> own
> >>>> Hive versions - just replace those jars with the right versions and no
> >>> need
> >>>> to find the doc.
> >>>>
> >>>> * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user
> >> base
> >>>> out there, and that's why we are using them as examples for
> >> dependencies
> >>> in
> >>>> [1] even though we've supported almost all Hive versions [3] now.
> >>>>
> >>>> I want to hear what the community think about this, and how to achieve
> >> it
> >>>> if we believe that's the way to go.
> >>>>
> >>>> Cheers,
> >>>> Bowen
> >>>>
> >>>> [1] https://flink.apache.org/downloads.html
> >>>> [2]
> >>>>
> >>>
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
> >>>> [3]
> >>>>
> >>>
> >>
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions
> >>>>
> >>>
> >>>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Jingsong Li
Thanks all for explaining.

I misunderstood the original proposal.
-1 to put them in our distributions
+1 to have provide hive uber jars as Seth and Aljoscha advice

Hive is just a connector no matter how important it is.
So I totally agree that we shouldn't put them in our distributions.
We can start offering three uber jars:
- flink-sql-connector-hive-1 (uber jar with hive dependent version 1.2.1)
- flink-sql-connector-hive-2 (uber jar with hive dependent version 2.3.4)
- flink-sql-connector-hive-3 (uber jar with hive dependent version 3.1.1)
My understanding is quite enough to users.

Best,
Jingsong Lee

On Sun, Dec 15, 2019 at 12:42 PM Jark Wu <[hidden email]> wrote:

> I agree with Seth and Aljoscha and think that is a right way to go.
> We already provided uber jars for kafka and elasticsearch for out-of-box,
> you can see the download links in this page[1].
> Users can easily to download the connectors and versions they like and drag
> to SQL CLI lib directories. The uber jars
> contains all the dependencies required and may be shaded. In this way,
> users can skip to build a uber jar themselves.
> Hive is indeed a "connector" too, and should also follow this way.
>
> Best,
> Jark
>
> [1]:
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
>
> On Sat, 14 Dec 2019 at 03:03, Aljoscha Krettek <[hidden email]>
> wrote:
>
> > I was going to suggest the same thing as Seth. So yes, I’m against having
> > Flink distributions that contain Hive but for convenience downloads as we
> > have for Hadoop.
> >
> > Best,
> > Aljoscha
> >
> > > On 13. Dec 2019, at 18:04, Seth Wiesman <[hidden email]> wrote:
> > >
> > > I'm also -1 on separate builds.
> > >
> > > What about publishing convenience jars that contain the dependencies
> for
> > > each version? For example, there could be a flink-hive-1.2.1-uber.jar
> > that
> > > users could just add to their lib folder that contains all the
> necessary
> > > dependencies to connect to that hive version.
> > >
> > >
> > > On Fri, Dec 13, 2019 at 8:50 AM Robert Metzger <[hidden email]>
> > wrote:
> > >
> > >> I'm generally not opposed to convenience binaries, if a huge number of
> > >> people would benefit from them, and the overhead for the Flink project
> > is
> > >> low. I did not see a huge demand for such binaries yet (neither for
> the
> > >> Flink + Hive integration). Looking at Apache Spark, they are also only
> > >> offering convenience binaries for Hadoop only.
> > >>
> > >> Maybe we could provide a "Docker Playground" for Flink + Hive in the
> > >> documentation (and the flink-playgrounds.git repo)?
> > >> (similar to
> > >>
> > >>
> >
> https://ci.apache.org/projects/flink/flink-docs-master/getting-started/docker-playgrounds/flink-operations-playground.html
> > >> )
> > >>
> > >>
> > >>
> > >> On Fri, Dec 13, 2019 at 3:04 PM Chesnay Schepler <[hidden email]>
> > >> wrote:
> > >>
> > >>> -1
> > >>>
> > >>> We shouldn't need to deploy additional binaries to have a feature be
> > >>> remotely usable.
> > >>> This usually points to something else being done incorrectly.
> > >>>
> > >>> If it is indeed such a hassle to setup hive on Flink, then my
> > conclusion
> > >>> would be that either
> > >>> a) the documentation needs to be improved
> > >>> b) the architecture needs to be improved
> > >>> or, if all else fails c) provide a utility script for setting it up
> > >> easier.
> > >>>
> > >>> We spent a lot of time on reducing the number of binaries in the
> hadoop
> > >>> days, and also go extra steps to prevent a separate Java 11 binary,
> and
> > >>> I see no reason why Hive should get special treatment on this matter.
> > >>>
> > >>> Regards,
> > >>> Chesnay
> > >>>
> > >>> On 13/12/2019 09:44, Bowen Li wrote:
> > >>>> Hi all,
> > >>>>
> > >>>> I want to propose to have a couple separate Flink distributions with
> > >> Hive
> > >>>> dependencies on specific Hive versions (2.3.4 and 1.2.1). The
> > >>> distributions
> > >>>> will be provided to users on Flink download page [1].
> > >>>>
> > >>>> A few reasons to do this:
> > >>>>
> > >>>> 1) Flink-Hive integration is important to many many Flink and Hive
> > >> users
> > >>> in
> > >>>> two dimensions:
> > >>>>      a) for Flink metadata: HiveCatalog is the only persistent
> catalog
> > >>> to
> > >>>> manage Flink tables. With Flink 1.10 supporting more DDL, the
> > >> persistent
> > >>>> catalog would be playing even more critical role in users' workflow
> > >>>>      b) for Flink data: Hive data connector (source/sink) helps both
> > >>> Flink
> > >>>> and Hive users to unlock new use cases in streaming,
> > >>> near-realtime/realtime
> > >>>> data warehouse, backfill, etc.
> > >>>>
> > >>>> 2) currently users have to go thru a *really* tedious process to get
> > >>>> started, because it requires lots of extra jars (see [2]) that are
> > >> absent
> > >>>> in Flink's lean distribution. We've had so many users from public
> > >> mailing
> > >>>> list, private email, DingTalk groups who got frustrated on spending
> > >> lots
> > >>> of
> > >>>> time figuring out the jars themselves. They would rather have a more
> > >>> "right
> > >>>> out of box" quickstart experience, and play with the catalog and
> > >>>> source/sink without hassle.
> > >>>>
> > >>>> 3) it's easier for users to replace those Hive dependencies for
> their
> > >> own
> > >>>> Hive versions - just replace those jars with the right versions and
> no
> > >>> need
> > >>>> to find the doc.
> > >>>>
> > >>>> * Hive 2.3.4 and 1.2.1 are two versions that represent lots of user
> > >> base
> > >>>> out there, and that's why we are using them as examples for
> > >> dependencies
> > >>> in
> > >>>> [1] even though we've supported almost all Hive versions [3] now.
> > >>>>
> > >>>> I want to hear what the community think about this, and how to
> achieve
> > >> it
> > >>>> if we believe that's the way to go.
> > >>>>
> > >>>> Cheers,
> > >>>> Bowen
> > >>>>
> > >>>> [1] https://flink.apache.org/downloads.html
> > >>>> [2]
> > >>>>
> > >>>
> > >>
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
> > >>>> [3]
> > >>>>
> > >>>
> > >>
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#supported-hive-versions
> > >>>>
> > >>>
> > >>>
> > >>
> >
> >
>


--
Best, Jingsong Lee
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Danny Chan
In reply to this post by Aljoscha Krettek-2
Also -1 on separate builds.

After referencing some other BigData engines for distribution[1], i didn't find strong needs to publish a separate build
for just a separate Hive version, indeed there are builds for different Hadoop version.

Just like Seth and Aljoscha said, we could push a flink-hive-version-uber.jar to use as a lib of SQL-CLI or other use cases.

[1] https://spark.apache.org/downloads.html
[2] https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html

Best,
Danny Chan
在 2019年12月14日 +0800 AM3:03,[hidden email],写道:
>
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

bowen.li
I'm not sure providing an uber jar would be possible.

Different from kafka and elasticsearch connector who have dependencies for
a specific kafka/elastic version, or the kafka universal connector that
provides good compatibilities, hive connector needs to deal with hive jars
in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH distributions)
with incompatibility even between minor versions, different versioned
hadoop and other extra dependency jars for each hive version.

Besides, users usually need to be able to easily see which individual jars
are required, which is invisible from an uber jar. Hive users already have
their hive deployments. They usually have to use their own hive jars
because, unlike hive jars on mvn, their own jars contain changes in-house
or from vendors. They need to easily tell which jars Flink requires for
corresponding open sourced hive version to their own hive deployment, and
copy in-hosue jars over from hive deployments as replacements.

Providing a script to download all the individual jars for a specified hive
version can be an alternative.

The goal is we need to provide a *product*, not a technology, to make it
less hassle for Hive users. Afterall, it's Flink embracing Hive community
and ecosystem, not the other way around. I'd argue Hive connector can be
treat differently because its community/ecosystem/userbase is much larger
than the other connectors, and it's way more important than other
connectors to Flink on the mission of becoming a batch/streaming unified
engine and get Flink more widely adopted.


On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[hidden email]> wrote:

> Also -1 on separate builds.
>
> After referencing some other BigData engines for distribution[1], i didn't
> find strong needs to publish a separate build
> for just a separate Hive version, indeed there are builds for different
> Hadoop version.
>
> Just like Seth and Aljoscha said, we could push a
> flink-hive-version-uber.jar to use as a lib of SQL-CLI or other use cases.
>
> [1] https://spark.apache.org/downloads.html
> [2] https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
>
> Best,
> Danny Chan
> 在 2019年12月14日 +0800 AM3:03,[hidden email],写道:
> >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Till Rohrmann
Couldn't it simply be documented which jars are in the convenience jars
which are pre built and can be downloaded from the website? Then people who
need a custom version know which jars they need to provide to Flink?

Cheers,
Till

On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[hidden email]> wrote:

> I'm not sure providing an uber jar would be possible.
>
> Different from kafka and elasticsearch connector who have dependencies for
> a specific kafka/elastic version, or the kafka universal connector that
> provides good compatibilities, hive connector needs to deal with hive jars
> in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH distributions)
> with incompatibility even between minor versions, different versioned
> hadoop and other extra dependency jars for each hive version.
>
> Besides, users usually need to be able to easily see which individual jars
> are required, which is invisible from an uber jar. Hive users already have
> their hive deployments. They usually have to use their own hive jars
> because, unlike hive jars on mvn, their own jars contain changes in-house
> or from vendors. They need to easily tell which jars Flink requires for
> corresponding open sourced hive version to their own hive deployment, and
> copy in-hosue jars over from hive deployments as replacements.
>
> Providing a script to download all the individual jars for a specified hive
> version can be an alternative.
>
> The goal is we need to provide a *product*, not a technology, to make it
> less hassle for Hive users. Afterall, it's Flink embracing Hive community
> and ecosystem, not the other way around. I'd argue Hive connector can be
> treat differently because its community/ecosystem/userbase is much larger
> than the other connectors, and it's way more important than other
> connectors to Flink on the mission of becoming a batch/streaming unified
> engine and get Flink more widely adopted.
>
>
> On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[hidden email]> wrote:
>
> > Also -1 on separate builds.
> >
> > After referencing some other BigData engines for distribution[1], i
> didn't
> > find strong needs to publish a separate build
> > for just a separate Hive version, indeed there are builds for different
> > Hadoop version.
> >
> > Just like Seth and Aljoscha said, we could push a
> > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other use
> cases.
> >
> > [1] https://spark.apache.org/downloads.html
> > [2]
> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
> >
> > Best,
> > Danny Chan
> > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道:
> > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Stephan Ewen
We have had much trouble in the past from "too deep too custom"
integrations that everyone got out of the box, i.e., Hadoop.
Flink has has such a broad spectrum of use cases, if we have custom build
for every other framework in that spectrum, we'll be in trouble.

So I would also be -1 for custom builds.

Couldn't we do something similar as we started doing for Hadoop? Moving
away from convenience downloads to allowing users to "export" their setup
for Flink?

  - We can have a "hive module (loader)" in flink/lib by default
  - The module loader would look for an environment variable like
"HIVE_CLASSPATH" and load these classes (ideally in a separate classloader).
  - The loader can search for certain classes and instantiate catalog /
functions / etc. when finding them instantiates the hive module referencing
them
  - That way, we use exactly what users have installed, without needing to
build our own bundles.

Could that work?

Best,
Stephan


On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <[hidden email]> wrote:

> Couldn't it simply be documented which jars are in the convenience jars
> which are pre built and can be downloaded from the website? Then people who
> need a custom version know which jars they need to provide to Flink?
>
> Cheers,
> Till
>
> On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[hidden email]> wrote:
>
> > I'm not sure providing an uber jar would be possible.
> >
> > Different from kafka and elasticsearch connector who have dependencies
> for
> > a specific kafka/elastic version, or the kafka universal connector that
> > provides good compatibilities, hive connector needs to deal with hive
> jars
> > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH distributions)
> > with incompatibility even between minor versions, different versioned
> > hadoop and other extra dependency jars for each hive version.
> >
> > Besides, users usually need to be able to easily see which individual
> jars
> > are required, which is invisible from an uber jar. Hive users already
> have
> > their hive deployments. They usually have to use their own hive jars
> > because, unlike hive jars on mvn, their own jars contain changes in-house
> > or from vendors. They need to easily tell which jars Flink requires for
> > corresponding open sourced hive version to their own hive deployment, and
> > copy in-hosue jars over from hive deployments as replacements.
> >
> > Providing a script to download all the individual jars for a specified
> hive
> > version can be an alternative.
> >
> > The goal is we need to provide a *product*, not a technology, to make it
> > less hassle for Hive users. Afterall, it's Flink embracing Hive community
> > and ecosystem, not the other way around. I'd argue Hive connector can be
> > treat differently because its community/ecosystem/userbase is much larger
> > than the other connectors, and it's way more important than other
> > connectors to Flink on the mission of becoming a batch/streaming unified
> > engine and get Flink more widely adopted.
> >
> >
> > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[hidden email]>
> wrote:
> >
> > > Also -1 on separate builds.
> > >
> > > After referencing some other BigData engines for distribution[1], i
> > didn't
> > > find strong needs to publish a separate build
> > > for just a separate Hive version, indeed there are builds for different
> > > Hadoop version.
> > >
> > > Just like Seth and Aljoscha said, we could push a
> > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other use
> > cases.
> > >
> > > [1] https://spark.apache.org/downloads.html
> > > [2]
> > https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
> > >
> > > Best,
> > > Danny Chan
> > > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道:
> > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Jingsong Li
Hi all,

For your information, we have document the dependencies detailed
information [1]. I think it's a lot clearer than before, but it's worse
than presto and spark (they avoid or have built-in hive dependency).

I thought about Stephan's suggestion:
- The hive/lib has 200+ jars, but we only need hive-exec.jar or plus two or
three jars, if so many jars are introduced, maybe will there be a big
conflict.
- And hive/lib is not available on every machine. We need to upload so many
jars.
- A separate classloader maybe hard to work too, our flink-connector-hive
need hive jars, we may need to deal with flink-connector-hive jar spacial
too.
CC: Rui Li

I think the best system to integrate with hive is presto, which only
connects hive metastore through thrift protocol. But I understand that it
costs a lot to rewrite the code.

[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies

Best,
Jingsong Lee

On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <[hidden email]> wrote:

> We have had much trouble in the past from "too deep too custom"
> integrations that everyone got out of the box, i.e., Hadoop.
> Flink has has such a broad spectrum of use cases, if we have custom build
> for every other framework in that spectrum, we'll be in trouble.
>
> So I would also be -1 for custom builds.
>
> Couldn't we do something similar as we started doing for Hadoop? Moving
> away from convenience downloads to allowing users to "export" their setup
> for Flink?
>
>   - We can have a "hive module (loader)" in flink/lib by default
>   - The module loader would look for an environment variable like
> "HIVE_CLASSPATH" and load these classes (ideally in a separate
> classloader).
>   - The loader can search for certain classes and instantiate catalog /
> functions / etc. when finding them instantiates the hive module referencing
> them
>   - That way, we use exactly what users have installed, without needing to
> build our own bundles.
>
> Could that work?
>
> Best,
> Stephan
>
>
> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <[hidden email]>
> wrote:
>
> > Couldn't it simply be documented which jars are in the convenience jars
> > which are pre built and can be downloaded from the website? Then people
> who
> > need a custom version know which jars they need to provide to Flink?
> >
> > Cheers,
> > Till
> >
> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[hidden email]> wrote:
> >
> > > I'm not sure providing an uber jar would be possible.
> > >
> > > Different from kafka and elasticsearch connector who have dependencies
> > for
> > > a specific kafka/elastic version, or the kafka universal connector that
> > > provides good compatibilities, hive connector needs to deal with hive
> > jars
> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH distributions)
> > > with incompatibility even between minor versions, different versioned
> > > hadoop and other extra dependency jars for each hive version.
> > >
> > > Besides, users usually need to be able to easily see which individual
> > jars
> > > are required, which is invisible from an uber jar. Hive users already
> > have
> > > their hive deployments. They usually have to use their own hive jars
> > > because, unlike hive jars on mvn, their own jars contain changes
> in-house
> > > or from vendors. They need to easily tell which jars Flink requires for
> > > corresponding open sourced hive version to their own hive deployment,
> and
> > > copy in-hosue jars over from hive deployments as replacements.
> > >
> > > Providing a script to download all the individual jars for a specified
> > hive
> > > version can be an alternative.
> > >
> > > The goal is we need to provide a *product*, not a technology, to make
> it
> > > less hassle for Hive users. Afterall, it's Flink embracing Hive
> community
> > > and ecosystem, not the other way around. I'd argue Hive connector can
> be
> > > treat differently because its community/ecosystem/userbase is much
> larger
> > > than the other connectors, and it's way more important than other
> > > connectors to Flink on the mission of becoming a batch/streaming
> unified
> > > engine and get Flink more widely adopted.
> > >
> > >
> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[hidden email]>
> > wrote:
> > >
> > > > Also -1 on separate builds.
> > > >
> > > > After referencing some other BigData engines for distribution[1], i
> > > didn't
> > > > find strong needs to publish a separate build
> > > > for just a separate Hive version, indeed there are builds for
> different
> > > > Hadoop version.
> > > >
> > > > Just like Seth and Aljoscha said, we could push a
> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other use
> > > cases.
> > > >
> > > > [1] https://spark.apache.org/downloads.html
> > > > [2]
> > > https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
> > > >
> > > > Best,
> > > > Danny Chan
> > > > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道:
> > > > >
> > > > >
> > > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
> > > >
> > >
> >
>


--
Best, Jingsong Lee
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Rui Li-2
Hi Stephan,

As Jingsong stated, in our documentation the recommended way to add Hive
deps is to use exactly what users have installed. It's just we ask users to
manually add those jars, instead of automatically find them based on env
variables. I prefer to keep it this way for a while, and see if there're
real concerns/complaints from user feedbacks.

Please also note the Hive jars are not the only ones needed to integrate
with Hive, users have to make sure flink-connector-hive and Hadoop jars are
in classpath too. So I'm afraid a single "HIVE" env variable wouldn't save
all the manual work for our users.

On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <[hidden email]> wrote:

> Hi all,
>
> For your information, we have document the dependencies detailed
> information [1]. I think it's a lot clearer than before, but it's worse
> than presto and spark (they avoid or have built-in hive dependency).
>
> I thought about Stephan's suggestion:
> - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus two
> or three jars, if so many jars are introduced, maybe will there be a big
> conflict.
> - And hive/lib is not available on every machine. We need to upload so
> many jars.
> - A separate classloader maybe hard to work too, our flink-connector-hive
> need hive jars, we may need to deal with flink-connector-hive jar spacial
> too.
> CC: Rui Li
>
> I think the best system to integrate with hive is presto, which only
> connects hive metastore through thrift protocol. But I understand that it
> costs a lot to rewrite the code.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
>
> Best,
> Jingsong Lee
>
> On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <[hidden email]> wrote:
>
>> We have had much trouble in the past from "too deep too custom"
>> integrations that everyone got out of the box, i.e., Hadoop.
>> Flink has has such a broad spectrum of use cases, if we have custom build
>> for every other framework in that spectrum, we'll be in trouble.
>>
>> So I would also be -1 for custom builds.
>>
>> Couldn't we do something similar as we started doing for Hadoop? Moving
>> away from convenience downloads to allowing users to "export" their setup
>> for Flink?
>>
>>   - We can have a "hive module (loader)" in flink/lib by default
>>   - The module loader would look for an environment variable like
>> "HIVE_CLASSPATH" and load these classes (ideally in a separate
>> classloader).
>>   - The loader can search for certain classes and instantiate catalog /
>> functions / etc. when finding them instantiates the hive module
>> referencing
>> them
>>   - That way, we use exactly what users have installed, without needing to
>> build our own bundles.
>>
>> Could that work?
>>
>> Best,
>> Stephan
>>
>>
>> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <[hidden email]>
>> wrote:
>>
>> > Couldn't it simply be documented which jars are in the convenience jars
>> > which are pre built and can be downloaded from the website? Then people
>> who
>> > need a custom version know which jars they need to provide to Flink?
>> >
>> > Cheers,
>> > Till
>> >
>> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[hidden email]> wrote:
>> >
>> > > I'm not sure providing an uber jar would be possible.
>> > >
>> > > Different from kafka and elasticsearch connector who have dependencies
>> > for
>> > > a specific kafka/elastic version, or the kafka universal connector
>> that
>> > > provides good compatibilities, hive connector needs to deal with hive
>> > jars
>> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH
>> distributions)
>> > > with incompatibility even between minor versions, different versioned
>> > > hadoop and other extra dependency jars for each hive version.
>> > >
>> > > Besides, users usually need to be able to easily see which individual
>> > jars
>> > > are required, which is invisible from an uber jar. Hive users already
>> > have
>> > > their hive deployments. They usually have to use their own hive jars
>> > > because, unlike hive jars on mvn, their own jars contain changes
>> in-house
>> > > or from vendors. They need to easily tell which jars Flink requires
>> for
>> > > corresponding open sourced hive version to their own hive deployment,
>> and
>> > > copy in-hosue jars over from hive deployments as replacements.
>> > >
>> > > Providing a script to download all the individual jars for a specified
>> > hive
>> > > version can be an alternative.
>> > >
>> > > The goal is we need to provide a *product*, not a technology, to make
>> it
>> > > less hassle for Hive users. Afterall, it's Flink embracing Hive
>> community
>> > > and ecosystem, not the other way around. I'd argue Hive connector can
>> be
>> > > treat differently because its community/ecosystem/userbase is much
>> larger
>> > > than the other connectors, and it's way more important than other
>> > > connectors to Flink on the mission of becoming a batch/streaming
>> unified
>> > > engine and get Flink more widely adopted.
>> > >
>> > >
>> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[hidden email]>
>> > wrote:
>> > >
>> > > > Also -1 on separate builds.
>> > > >
>> > > > After referencing some other BigData engines for distribution[1], i
>> > > didn't
>> > > > find strong needs to publish a separate build
>> > > > for just a separate Hive version, indeed there are builds for
>> different
>> > > > Hadoop version.
>> > > >
>> > > > Just like Seth and Aljoscha said, we could push a
>> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other use
>> > > cases.
>> > > >
>> > > > [1] https://spark.apache.org/downloads.html
>> > > > [2]
>> > >
>> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
>> > > >
>> > > > Best,
>> > > > Danny Chan
>> > > > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道:
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
>> > > >
>> > >
>> >
>>
>
>
> --
> Best, Jingsong Lee
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Stephan Ewen
Some thoughts about other options we have:

  - Put fat/shaded jars for the common versions into "flink-shaded" and
offer them for download on the website, similar to pre-bundles Hadoop
versions.

  - Look at the Presto code (Metastore protocol) and see if we can reuse
that

  - Have a setup helper script that takes the versions and pulls the
required dependencies.

Can you share how can a "built-in" dependency could work, if there are so
many different conflicting versions?

Thanks,
Stephan


On Tue, Feb 4, 2020 at 12:59 PM Rui Li <[hidden email]> wrote:

> Hi Stephan,
>
> As Jingsong stated, in our documentation the recommended way to add Hive
> deps is to use exactly what users have installed. It's just we ask users to
> manually add those jars, instead of automatically find them based on env
> variables. I prefer to keep it this way for a while, and see if there're
> real concerns/complaints from user feedbacks.
>
> Please also note the Hive jars are not the only ones needed to integrate
> with Hive, users have to make sure flink-connector-hive and Hadoop jars are
> in classpath too. So I'm afraid a single "HIVE" env variable wouldn't save
> all the manual work for our users.
>
> On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <[hidden email]> wrote:
>
> > Hi all,
> >
> > For your information, we have document the dependencies detailed
> > information [1]. I think it's a lot clearer than before, but it's worse
> > than presto and spark (they avoid or have built-in hive dependency).
> >
> > I thought about Stephan's suggestion:
> > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus two
> > or three jars, if so many jars are introduced, maybe will there be a big
> > conflict.
> > - And hive/lib is not available on every machine. We need to upload so
> > many jars.
> > - A separate classloader maybe hard to work too, our flink-connector-hive
> > need hive jars, we may need to deal with flink-connector-hive jar spacial
> > too.
> > CC: Rui Li
> >
> > I think the best system to integrate with hive is presto, which only
> > connects hive metastore through thrift protocol. But I understand that it
> > costs a lot to rewrite the code.
> >
> > [1]
> >
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
> >
> > Best,
> > Jingsong Lee
> >
> > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <[hidden email]> wrote:
> >
> >> We have had much trouble in the past from "too deep too custom"
> >> integrations that everyone got out of the box, i.e., Hadoop.
> >> Flink has has such a broad spectrum of use cases, if we have custom
> build
> >> for every other framework in that spectrum, we'll be in trouble.
> >>
> >> So I would also be -1 for custom builds.
> >>
> >> Couldn't we do something similar as we started doing for Hadoop? Moving
> >> away from convenience downloads to allowing users to "export" their
> setup
> >> for Flink?
> >>
> >>   - We can have a "hive module (loader)" in flink/lib by default
> >>   - The module loader would look for an environment variable like
> >> "HIVE_CLASSPATH" and load these classes (ideally in a separate
> >> classloader).
> >>   - The loader can search for certain classes and instantiate catalog /
> >> functions / etc. when finding them instantiates the hive module
> >> referencing
> >> them
> >>   - That way, we use exactly what users have installed, without needing
> to
> >> build our own bundles.
> >>
> >> Could that work?
> >>
> >> Best,
> >> Stephan
> >>
> >>
> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <[hidden email]>
> >> wrote:
> >>
> >> > Couldn't it simply be documented which jars are in the convenience
> jars
> >> > which are pre built and can be downloaded from the website? Then
> people
> >> who
> >> > need a custom version know which jars they need to provide to Flink?
> >> >
> >> > Cheers,
> >> > Till
> >> >
> >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[hidden email]> wrote:
> >> >
> >> > > I'm not sure providing an uber jar would be possible.
> >> > >
> >> > > Different from kafka and elasticsearch connector who have
> dependencies
> >> > for
> >> > > a specific kafka/elastic version, or the kafka universal connector
> >> that
> >> > > provides good compatibilities, hive connector needs to deal with
> hive
> >> > jars
> >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH
> >> distributions)
> >> > > with incompatibility even between minor versions, different
> versioned
> >> > > hadoop and other extra dependency jars for each hive version.
> >> > >
> >> > > Besides, users usually need to be able to easily see which
> individual
> >> > jars
> >> > > are required, which is invisible from an uber jar. Hive users
> already
> >> > have
> >> > > their hive deployments. They usually have to use their own hive jars
> >> > > because, unlike hive jars on mvn, their own jars contain changes
> >> in-house
> >> > > or from vendors. They need to easily tell which jars Flink requires
> >> for
> >> > > corresponding open sourced hive version to their own hive
> deployment,
> >> and
> >> > > copy in-hosue jars over from hive deployments as replacements.
> >> > >
> >> > > Providing a script to download all the individual jars for a
> specified
> >> > hive
> >> > > version can be an alternative.
> >> > >
> >> > > The goal is we need to provide a *product*, not a technology, to
> make
> >> it
> >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive
> >> community
> >> > > and ecosystem, not the other way around. I'd argue Hive connector
> can
> >> be
> >> > > treat differently because its community/ecosystem/userbase is much
> >> larger
> >> > > than the other connectors, and it's way more important than other
> >> > > connectors to Flink on the mission of becoming a batch/streaming
> >> unified
> >> > > engine and get Flink more widely adopted.
> >> > >
> >> > >
> >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[hidden email]>
> >> > wrote:
> >> > >
> >> > > > Also -1 on separate builds.
> >> > > >
> >> > > > After referencing some other BigData engines for distribution[1],
> i
> >> > > didn't
> >> > > > find strong needs to publish a separate build
> >> > > > for just a separate Hive version, indeed there are builds for
> >> different
> >> > > > Hadoop version.
> >> > > >
> >> > > > Just like Seth and Aljoscha said, we could push a
> >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other
> use
> >> > > cases.
> >> > > >
> >> > > > [1] https://spark.apache.org/downloads.html
> >> > > > [2]
> >> > >
> >> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
> >> > > >
> >> > > > Best,
> >> > > > Danny Chan
> >> > > > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道:
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
> >> > > >
> >> > >
> >> >
> >>
> >
> >
> > --
> > Best, Jingsong Lee
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Jingsong Li
Hi Stephan,

The hive/lib/ has many jars, this lib is for execution, metastore, hive
client and all things.
What we really depend on is hive-exec.jar. (hive-metastore.jar is also
required in the low version hive)
And hive-exec.jar is a uber jar. We just want half classes of it. These
half classes are not so clean, but it is OK to have them.

Our solution now:
- exclude hive jars from build
- provide 8 versions dependencies way, user choose by his hive version.[1]

Spark's solution:
- build-in hive 1.2.1 dependencies to support hive 0.12.0 through 2.3.3. [2]
    - hive-exec.jar is hive-exec.spark.jar, Spark has modified the
hive-exec build pom to exclude unnecessary classes including Orc and
parquet.
    - build-in orc and parquet dependencies to optimizer performance.
- support hive version 2.3.3 upper by "mvn install -Phive-2.3", to built-in
hive-exec-2.3.6.jar. It seems that since this version, hive's API has been
seriously incompatible.
Most of the versions used by users are hive 0.12.0 through 2.3.3. So the
default build of Spark is good to most of users.

Presto's solution:
- Built-in presto's hive.[3] Shade hive classes instead of thrift classes.
- Rewrite some client related code to solve kinds of issues.
This approach is the heaviest, but also the cleanest. It can support all
kinds of hive versions with one build.

So I think we can do:

- The eight versions we now maintain are too many. I think we can move
forward in the direction of Presto/Spark and try to reduce dependencies
versions.

- As your said, about provide fat/uber jars or helper script, I prefer uber
jars, user can download one jar to their startup. Just like Kafka.

[1]
https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
[2]
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore
[3] https://github.com/prestodb/presto-hive-apache

Best,
Jingsong Lee

On Wed, Feb 5, 2020 at 10:15 PM Stephan Ewen <[hidden email]> wrote:

> Some thoughts about other options we have:
>
>   - Put fat/shaded jars for the common versions into "flink-shaded" and
> offer them for download on the website, similar to pre-bundles Hadoop
> versions.
>
>   - Look at the Presto code (Metastore protocol) and see if we can reuse
> that
>
>   - Have a setup helper script that takes the versions and pulls the
> required dependencies.
>
> Can you share how can a "built-in" dependency could work, if there are so
> many different conflicting versions?
>
> Thanks,
> Stephan
>
>
> On Tue, Feb 4, 2020 at 12:59 PM Rui Li <[hidden email]> wrote:
>
>> Hi Stephan,
>>
>> As Jingsong stated, in our documentation the recommended way to add Hive
>> deps is to use exactly what users have installed. It's just we ask users
>> to
>> manually add those jars, instead of automatically find them based on env
>> variables. I prefer to keep it this way for a while, and see if there're
>> real concerns/complaints from user feedbacks.
>>
>> Please also note the Hive jars are not the only ones needed to integrate
>> with Hive, users have to make sure flink-connector-hive and Hadoop jars
>> are
>> in classpath too. So I'm afraid a single "HIVE" env variable wouldn't save
>> all the manual work for our users.
>>
>> On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <[hidden email]>
>> wrote:
>>
>> > Hi all,
>> >
>> > For your information, we have document the dependencies detailed
>> > information [1]. I think it's a lot clearer than before, but it's worse
>> > than presto and spark (they avoid or have built-in hive dependency).
>> >
>> > I thought about Stephan's suggestion:
>> > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus two
>> > or three jars, if so many jars are introduced, maybe will there be a big
>> > conflict.
>> > - And hive/lib is not available on every machine. We need to upload so
>> > many jars.
>> > - A separate classloader maybe hard to work too, our
>> flink-connector-hive
>> > need hive jars, we may need to deal with flink-connector-hive jar
>> spacial
>> > too.
>> > CC: Rui Li
>> >
>> > I think the best system to integrate with hive is presto, which only
>> > connects hive metastore through thrift protocol. But I understand that
>> it
>> > costs a lot to rewrite the code.
>> >
>> > [1]
>> >
>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
>> >
>> > Best,
>> > Jingsong Lee
>> >
>> > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <[hidden email]> wrote:
>> >
>> >> We have had much trouble in the past from "too deep too custom"
>> >> integrations that everyone got out of the box, i.e., Hadoop.
>> >> Flink has has such a broad spectrum of use cases, if we have custom
>> build
>> >> for every other framework in that spectrum, we'll be in trouble.
>> >>
>> >> So I would also be -1 for custom builds.
>> >>
>> >> Couldn't we do something similar as we started doing for Hadoop? Moving
>> >> away from convenience downloads to allowing users to "export" their
>> setup
>> >> for Flink?
>> >>
>> >>   - We can have a "hive module (loader)" in flink/lib by default
>> >>   - The module loader would look for an environment variable like
>> >> "HIVE_CLASSPATH" and load these classes (ideally in a separate
>> >> classloader).
>> >>   - The loader can search for certain classes and instantiate catalog /
>> >> functions / etc. when finding them instantiates the hive module
>> >> referencing
>> >> them
>> >>   - That way, we use exactly what users have installed, without
>> needing to
>> >> build our own bundles.
>> >>
>> >> Could that work?
>> >>
>> >> Best,
>> >> Stephan
>> >>
>> >>
>> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <[hidden email]>
>> >> wrote:
>> >>
>> >> > Couldn't it simply be documented which jars are in the convenience
>> jars
>> >> > which are pre built and can be downloaded from the website? Then
>> people
>> >> who
>> >> > need a custom version know which jars they need to provide to Flink?
>> >> >
>> >> > Cheers,
>> >> > Till
>> >> >
>> >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[hidden email]>
>> wrote:
>> >> >
>> >> > > I'm not sure providing an uber jar would be possible.
>> >> > >
>> >> > > Different from kafka and elasticsearch connector who have
>> dependencies
>> >> > for
>> >> > > a specific kafka/elastic version, or the kafka universal connector
>> >> that
>> >> > > provides good compatibilities, hive connector needs to deal with
>> hive
>> >> > jars
>> >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH
>> >> distributions)
>> >> > > with incompatibility even between minor versions, different
>> versioned
>> >> > > hadoop and other extra dependency jars for each hive version.
>> >> > >
>> >> > > Besides, users usually need to be able to easily see which
>> individual
>> >> > jars
>> >> > > are required, which is invisible from an uber jar. Hive users
>> already
>> >> > have
>> >> > > their hive deployments. They usually have to use their own hive
>> jars
>> >> > > because, unlike hive jars on mvn, their own jars contain changes
>> >> in-house
>> >> > > or from vendors. They need to easily tell which jars Flink requires
>> >> for
>> >> > > corresponding open sourced hive version to their own hive
>> deployment,
>> >> and
>> >> > > copy in-hosue jars over from hive deployments as replacements.
>> >> > >
>> >> > > Providing a script to download all the individual jars for a
>> specified
>> >> > hive
>> >> > > version can be an alternative.
>> >> > >
>> >> > > The goal is we need to provide a *product*, not a technology, to
>> make
>> >> it
>> >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive
>> >> community
>> >> > > and ecosystem, not the other way around. I'd argue Hive connector
>> can
>> >> be
>> >> > > treat differently because its community/ecosystem/userbase is much
>> >> larger
>> >> > > than the other connectors, and it's way more important than other
>> >> > > connectors to Flink on the mission of becoming a batch/streaming
>> >> unified
>> >> > > engine and get Flink more widely adopted.
>> >> > >
>> >> > >
>> >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[hidden email]>
>> >> > wrote:
>> >> > >
>> >> > > > Also -1 on separate builds.
>> >> > > >
>> >> > > > After referencing some other BigData engines for
>> distribution[1], i
>> >> > > didn't
>> >> > > > find strong needs to publish a separate build
>> >> > > > for just a separate Hive version, indeed there are builds for
>> >> different
>> >> > > > Hadoop version.
>> >> > > >
>> >> > > > Just like Seth and Aljoscha said, we could push a
>> >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other
>> use
>> >> > > cases.
>> >> > > >
>> >> > > > [1] https://spark.apache.org/downloads.html
>> >> > > > [2]
>> >> > >
>> >> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
>> >> > > >
>> >> > > > Best,
>> >> > > > Danny Chan
>> >> > > > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道:
>> >> > > > >
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>> >
>> > --
>> > Best, Jingsong Lee
>> >
>>
>

--
Best, Jingsong Lee
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] have separate Flink distributions with built-in Hive dependencies

Stephan Ewen
Hi Jingsong!

This sounds that with two pre-bundled versions (hive 1.2.1 and hive 2.3.6)
you can cover a lot of versions.

Would it make sense to add these to flink-shaded (with proper dependency
exclusions of unnecessary dependencies) and offer them as a download,
similar as we offer pre-shaded Hadoop downloads?

Best,
Stephan


On Thu, Feb 6, 2020 at 10:26 AM Jingsong Li <[hidden email]> wrote:

> Hi Stephan,
>
> The hive/lib/ has many jars, this lib is for execution, metastore, hive
> client and all things.
> What we really depend on is hive-exec.jar. (hive-metastore.jar is also
> required in the low version hive)
> And hive-exec.jar is a uber jar. We just want half classes of it. These
> half classes are not so clean, but it is OK to have them.
>
> Our solution now:
> - exclude hive jars from build
> - provide 8 versions dependencies way, user choose by his hive version.[1]
>
> Spark's solution:
> - build-in hive 1.2.1 dependencies to support hive 0.12.0 through 2.3.3.
> [2]
>     - hive-exec.jar is hive-exec.spark.jar, Spark has modified the
> hive-exec build pom to exclude unnecessary classes including Orc and
> parquet.
>     - build-in orc and parquet dependencies to optimizer performance.
> - support hive version 2.3.3 upper by "mvn install -Phive-2.3", to
> built-in hive-exec-2.3.6.jar. It seems that since this version, hive's API
> has been seriously incompatible.
> Most of the versions used by users are hive 0.12.0 through 2.3.3. So the
> default build of Spark is good to most of users.
>
> Presto's solution:
> - Built-in presto's hive.[3] Shade hive classes instead of thrift classes.
> - Rewrite some client related code to solve kinds of issues.
> This approach is the heaviest, but also the cleanest. It can support all
> kinds of hive versions with one build.
>
> So I think we can do:
>
> - The eight versions we now maintain are too many. I think we can move
> forward in the direction of Presto/Spark and try to reduce dependencies
> versions.
>
> - As your said, about provide fat/uber jars or helper script, I prefer
> uber jars, user can download one jar to their startup. Just like Kafka.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
> [2]
> https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html#interacting-with-different-versions-of-hive-metastore
> [3] https://github.com/prestodb/presto-hive-apache
>
> Best,
> Jingsong Lee
>
> On Wed, Feb 5, 2020 at 10:15 PM Stephan Ewen <[hidden email]> wrote:
>
>> Some thoughts about other options we have:
>>
>>   - Put fat/shaded jars for the common versions into "flink-shaded" and
>> offer them for download on the website, similar to pre-bundles Hadoop
>> versions.
>>
>>   - Look at the Presto code (Metastore protocol) and see if we can reuse
>> that
>>
>>   - Have a setup helper script that takes the versions and pulls the
>> required dependencies.
>>
>> Can you share how can a "built-in" dependency could work, if there are so
>> many different conflicting versions?
>>
>> Thanks,
>> Stephan
>>
>>
>> On Tue, Feb 4, 2020 at 12:59 PM Rui Li <[hidden email]> wrote:
>>
>>> Hi Stephan,
>>>
>>> As Jingsong stated, in our documentation the recommended way to add Hive
>>> deps is to use exactly what users have installed. It's just we ask users
>>> to
>>> manually add those jars, instead of automatically find them based on env
>>> variables. I prefer to keep it this way for a while, and see if there're
>>> real concerns/complaints from user feedbacks.
>>>
>>> Please also note the Hive jars are not the only ones needed to integrate
>>> with Hive, users have to make sure flink-connector-hive and Hadoop jars
>>> are
>>> in classpath too. So I'm afraid a single "HIVE" env variable wouldn't
>>> save
>>> all the manual work for our users.
>>>
>>> On Tue, Feb 4, 2020 at 5:54 PM Jingsong Li <[hidden email]>
>>> wrote:
>>>
>>> > Hi all,
>>> >
>>> > For your information, we have document the dependencies detailed
>>> > information [1]. I think it's a lot clearer than before, but it's worse
>>> > than presto and spark (they avoid or have built-in hive dependency).
>>> >
>>> > I thought about Stephan's suggestion:
>>> > - The hive/lib has 200+ jars, but we only need hive-exec.jar or plus
>>> two
>>> > or three jars, if so many jars are introduced, maybe will there be a
>>> big
>>> > conflict.
>>> > - And hive/lib is not available on every machine. We need to upload so
>>> > many jars.
>>> > - A separate classloader maybe hard to work too, our
>>> flink-connector-hive
>>> > need hive jars, we may need to deal with flink-connector-hive jar
>>> spacial
>>> > too.
>>> > CC: Rui Li
>>> >
>>> > I think the best system to integrate with hive is presto, which only
>>> > connects hive metastore through thrift protocol. But I understand that
>>> it
>>> > costs a lot to rewrite the code.
>>> >
>>> > [1]
>>> >
>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/table/hive/#dependencies
>>> >
>>> > Best,
>>> > Jingsong Lee
>>> >
>>> > On Tue, Feb 4, 2020 at 1:44 AM Stephan Ewen <[hidden email]> wrote:
>>> >
>>> >> We have had much trouble in the past from "too deep too custom"
>>> >> integrations that everyone got out of the box, i.e., Hadoop.
>>> >> Flink has has such a broad spectrum of use cases, if we have custom
>>> build
>>> >> for every other framework in that spectrum, we'll be in trouble.
>>> >>
>>> >> So I would also be -1 for custom builds.
>>> >>
>>> >> Couldn't we do something similar as we started doing for Hadoop?
>>> Moving
>>> >> away from convenience downloads to allowing users to "export" their
>>> setup
>>> >> for Flink?
>>> >>
>>> >>   - We can have a "hive module (loader)" in flink/lib by default
>>> >>   - The module loader would look for an environment variable like
>>> >> "HIVE_CLASSPATH" and load these classes (ideally in a separate
>>> >> classloader).
>>> >>   - The loader can search for certain classes and instantiate catalog
>>> /
>>> >> functions / etc. when finding them instantiates the hive module
>>> >> referencing
>>> >> them
>>> >>   - That way, we use exactly what users have installed, without
>>> needing to
>>> >> build our own bundles.
>>> >>
>>> >> Could that work?
>>> >>
>>> >> Best,
>>> >> Stephan
>>> >>
>>> >>
>>> >> On Wed, Dec 18, 2019 at 9:43 AM Till Rohrmann <[hidden email]>
>>> >> wrote:
>>> >>
>>> >> > Couldn't it simply be documented which jars are in the convenience
>>> jars
>>> >> > which are pre built and can be downloaded from the website? Then
>>> people
>>> >> who
>>> >> > need a custom version know which jars they need to provide to Flink?
>>> >> >
>>> >> > Cheers,
>>> >> > Till
>>> >> >
>>> >> > On Tue, Dec 17, 2019 at 6:49 PM Bowen Li <[hidden email]>
>>> wrote:
>>> >> >
>>> >> > > I'm not sure providing an uber jar would be possible.
>>> >> > >
>>> >> > > Different from kafka and elasticsearch connector who have
>>> dependencies
>>> >> > for
>>> >> > > a specific kafka/elastic version, or the kafka universal connector
>>> >> that
>>> >> > > provides good compatibilities, hive connector needs to deal with
>>> hive
>>> >> > jars
>>> >> > > in all 1.x, 2.x, 3.x versions (let alone all the HDP/CDH
>>> >> distributions)
>>> >> > > with incompatibility even between minor versions, different
>>> versioned
>>> >> > > hadoop and other extra dependency jars for each hive version.
>>> >> > >
>>> >> > > Besides, users usually need to be able to easily see which
>>> individual
>>> >> > jars
>>> >> > > are required, which is invisible from an uber jar. Hive users
>>> already
>>> >> > have
>>> >> > > their hive deployments. They usually have to use their own hive
>>> jars
>>> >> > > because, unlike hive jars on mvn, their own jars contain changes
>>> >> in-house
>>> >> > > or from vendors. They need to easily tell which jars Flink
>>> requires
>>> >> for
>>> >> > > corresponding open sourced hive version to their own hive
>>> deployment,
>>> >> and
>>> >> > > copy in-hosue jars over from hive deployments as replacements.
>>> >> > >
>>> >> > > Providing a script to download all the individual jars for a
>>> specified
>>> >> > hive
>>> >> > > version can be an alternative.
>>> >> > >
>>> >> > > The goal is we need to provide a *product*, not a technology, to
>>> make
>>> >> it
>>> >> > > less hassle for Hive users. Afterall, it's Flink embracing Hive
>>> >> community
>>> >> > > and ecosystem, not the other way around. I'd argue Hive connector
>>> can
>>> >> be
>>> >> > > treat differently because its community/ecosystem/userbase is much
>>> >> larger
>>> >> > > than the other connectors, and it's way more important than other
>>> >> > > connectors to Flink on the mission of becoming a batch/streaming
>>> >> unified
>>> >> > > engine and get Flink more widely adopted.
>>> >> > >
>>> >> > >
>>> >> > > On Sun, Dec 15, 2019 at 10:03 PM Danny Chan <[hidden email]
>>> >
>>> >> > wrote:
>>> >> > >
>>> >> > > > Also -1 on separate builds.
>>> >> > > >
>>> >> > > > After referencing some other BigData engines for
>>> distribution[1], i
>>> >> > > didn't
>>> >> > > > find strong needs to publish a separate build
>>> >> > > > for just a separate Hive version, indeed there are builds for
>>> >> different
>>> >> > > > Hadoop version.
>>> >> > > >
>>> >> > > > Just like Seth and Aljoscha said, we could push a
>>> >> > > > flink-hive-version-uber.jar to use as a lib of SQL-CLI or other
>>> use
>>> >> > > cases.
>>> >> > > >
>>> >> > > > [1] https://spark.apache.org/downloads.html
>>> >> > > > [2]
>>> >> > >
>>> >>
>>> https://www.elastic.co/guide/en/elasticsearch/hadoop/current/hive.html
>>> >> > > >
>>> >> > > > Best,
>>> >> > > > Danny Chan
>>> >> > > > 在 2019年12月14日 +0800 AM3:03,[hidden email],写道:
>>> >> > > > >
>>> >> > > > >
>>> >> > > >
>>> >> > >
>>> >> >
>>> >>
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/table/connect.html#dependencies
>>> >> > > >
>>> >> > >
>>> >> >
>>> >>
>>> >
>>> >
>>> > --
>>> > Best, Jingsong Lee
>>> >
>>>
>>
>
> --
> Best, Jingsong Lee
>
12