[DISCUSS] Releasing "fat" and "slim" Flink distributions

classic Classic list List threaded Threaded
44 messages Options
123
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Releasing "fat" and "slim" Flink distributions

Aljoscha Krettek-2
Hi Everyone,

I'd like to discuss about releasing a more full-featured Flink
distribution. The motivation is that there is friction for SQL/Table API
users that want to use Table connectors which are not there in the
current Flink Distribution. For these users the workflow is currently
roughly:

  - download Flink dist
  - configure csv/Kafka/json connectors per configuration
  - run SQL client or program
  - decrypt error message and research the solution
  - download additional connector jars
  - program works correctly

I realize that this can be made to work but if every SQL user has this
as their first experience that doesn't seem good to me.

My proposal is to provide two versions of the Flink Distribution in the
future: "fat" and "slim" (names to be discussed):

  - slim would be even trimmer than todays distribution
  - fat would contain a lot of convenience connectors (yet to be
determined which one)

And yes, I realize that there are already more dimensions of Flink
releases (Scala version and Java version).

For background, our current Flink dist has these in the opt directory:

  - flink-azure-fs-hadoop-1.10.0.jar
  - flink-cep-scala_2.12-1.10.0.jar
  - flink-cep_2.12-1.10.0.jar
  - flink-gelly-scala_2.12-1.10.0.jar
  - flink-gelly_2.12-1.10.0.jar
  - flink-metrics-datadog-1.10.0.jar
  - flink-metrics-graphite-1.10.0.jar
  - flink-metrics-influxdb-1.10.0.jar
  - flink-metrics-prometheus-1.10.0.jar
  - flink-metrics-slf4j-1.10.0.jar
  - flink-metrics-statsd-1.10.0.jar
  - flink-oss-fs-hadoop-1.10.0.jar
  - flink-python_2.12-1.10.0.jar
  - flink-queryable-state-runtime_2.12-1.10.0.jar
  - flink-s3-fs-hadoop-1.10.0.jar
  - flink-s3-fs-presto-1.10.0.jar
  - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
  - flink-sql-client_2.12-1.10.0.jar
  - flink-state-processor-api_2.12-1.10.0.jar
  - flink-swift-fs-hadoop-1.10.0.jar

Current Flink dist is 267M. If we removed everything from opt we would
go down to 126M. I would reccomend this, because the large majority of
the files in opt are probably unused.

What do you think?

Best,
Aljoscha

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Jeff Zhang
Hi Aljoscha,

Big +1 for the fat flink distribution, where do you plan to put these
connectors ? opt or lib ?

Aljoscha Krettek <[hidden email]> 于2020年4月15日周三 下午3:30写道:

> Hi Everyone,
>
> I'd like to discuss about releasing a more full-featured Flink
> distribution. The motivation is that there is friction for SQL/Table API
> users that want to use Table connectors which are not there in the
> current Flink Distribution. For these users the workflow is currently
> roughly:
>
>   - download Flink dist
>   - configure csv/Kafka/json connectors per configuration
>   - run SQL client or program
>   - decrypt error message and research the solution
>   - download additional connector jars
>   - program works correctly
>
> I realize that this can be made to work but if every SQL user has this
> as their first experience that doesn't seem good to me.
>
> My proposal is to provide two versions of the Flink Distribution in the
> future: "fat" and "slim" (names to be discussed):
>
>   - slim would be even trimmer than todays distribution
>   - fat would contain a lot of convenience connectors (yet to be
> determined which one)
>
> And yes, I realize that there are already more dimensions of Flink
> releases (Scala version and Java version).
>
> For background, our current Flink dist has these in the opt directory:
>
>   - flink-azure-fs-hadoop-1.10.0.jar
>   - flink-cep-scala_2.12-1.10.0.jar
>   - flink-cep_2.12-1.10.0.jar
>   - flink-gelly-scala_2.12-1.10.0.jar
>   - flink-gelly_2.12-1.10.0.jar
>   - flink-metrics-datadog-1.10.0.jar
>   - flink-metrics-graphite-1.10.0.jar
>   - flink-metrics-influxdb-1.10.0.jar
>   - flink-metrics-prometheus-1.10.0.jar
>   - flink-metrics-slf4j-1.10.0.jar
>   - flink-metrics-statsd-1.10.0.jar
>   - flink-oss-fs-hadoop-1.10.0.jar
>   - flink-python_2.12-1.10.0.jar
>   - flink-queryable-state-runtime_2.12-1.10.0.jar
>   - flink-s3-fs-hadoop-1.10.0.jar
>   - flink-s3-fs-presto-1.10.0.jar
>   - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>   - flink-sql-client_2.12-1.10.0.jar
>   - flink-state-processor-api_2.12-1.10.0.jar
>   - flink-swift-fs-hadoop-1.10.0.jar
>
> Current Flink dist is 267M. If we removed everything from opt we would
> go down to 126M. I would reccomend this, because the large majority of
> the files in opt are probably unused.
>
> What do you think?
>
> Best,
> Aljoscha
>
>

--
Best Regards

Jeff Zhang
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Kurt Young
In reply to this post by Aljoscha Krettek-2
Big +1 from my side.

From my experience, missing connector & format jar is the TOP 1 problem
which
SQL users will probably run into. Similar questions raised in Flink's
Dingtalk group
almost every 1 or 2 days. And I have personally answered dozens of such
question.
Sometimes it's still not enough for users to download the jars and put it
in lib directory,
they also have to restart their Flink cluster, which is not obvious and the
situation will
look like very tricky.

Best,
Kurt


On Wed, Apr 15, 2020 at 3:30 PM Aljoscha Krettek <[hidden email]>
wrote:

> Hi Everyone,
>
> I'd like to discuss about releasing a more full-featured Flink
> distribution. The motivation is that there is friction for SQL/Table API
> users that want to use Table connectors which are not there in the
> current Flink Distribution. For these users the workflow is currently
> roughly:
>
>   - download Flink dist
>   - configure csv/Kafka/json connectors per configuration
>   - run SQL client or program
>   - decrypt error message and research the solution
>   - download additional connector jars
>   - program works correctly
>
> I realize that this can be made to work but if every SQL user has this
> as their first experience that doesn't seem good to me.
>
> My proposal is to provide two versions of the Flink Distribution in the
> future: "fat" and "slim" (names to be discussed):
>
>   - slim would be even trimmer than todays distribution
>   - fat would contain a lot of convenience connectors (yet to be
> determined which one)
>
> And yes, I realize that there are already more dimensions of Flink
> releases (Scala version and Java version).
>
> For background, our current Flink dist has these in the opt directory:
>
>   - flink-azure-fs-hadoop-1.10.0.jar
>   - flink-cep-scala_2.12-1.10.0.jar
>   - flink-cep_2.12-1.10.0.jar
>   - flink-gelly-scala_2.12-1.10.0.jar
>   - flink-gelly_2.12-1.10.0.jar
>   - flink-metrics-datadog-1.10.0.jar
>   - flink-metrics-graphite-1.10.0.jar
>   - flink-metrics-influxdb-1.10.0.jar
>   - flink-metrics-prometheus-1.10.0.jar
>   - flink-metrics-slf4j-1.10.0.jar
>   - flink-metrics-statsd-1.10.0.jar
>   - flink-oss-fs-hadoop-1.10.0.jar
>   - flink-python_2.12-1.10.0.jar
>   - flink-queryable-state-runtime_2.12-1.10.0.jar
>   - flink-s3-fs-hadoop-1.10.0.jar
>   - flink-s3-fs-presto-1.10.0.jar
>   - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>   - flink-sql-client_2.12-1.10.0.jar
>   - flink-state-processor-api_2.12-1.10.0.jar
>   - flink-swift-fs-hadoop-1.10.0.jar
>
> Current Flink dist is 267M. If we removed everything from opt we would
> go down to 126M. I would reccomend this, because the large majority of
> the files in opt are probably unused.
>
> What do you think?
>
> Best,
> Aljoscha
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Jark Wu-2
In reply to this post by Jeff Zhang
+1 to the proposal. I also found the "download additional jar" step is
really verbose when I prepare webinars.

At least, I think the flink-csv and flink-json should in the distribution,
they are quite small and don't have other dependencies.

Best,
Jark

On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]> wrote:

> Hi Aljoscha,
>
> Big +1 for the fat flink distribution, where do you plan to put these
> connectors ? opt or lib ?
>
> Aljoscha Krettek <[hidden email]> 于2020年4月15日周三 下午3:30写道:
>
> > Hi Everyone,
> >
> > I'd like to discuss about releasing a more full-featured Flink
> > distribution. The motivation is that there is friction for SQL/Table API
> > users that want to use Table connectors which are not there in the
> > current Flink Distribution. For these users the workflow is currently
> > roughly:
> >
> >   - download Flink dist
> >   - configure csv/Kafka/json connectors per configuration
> >   - run SQL client or program
> >   - decrypt error message and research the solution
> >   - download additional connector jars
> >   - program works correctly
> >
> > I realize that this can be made to work but if every SQL user has this
> > as their first experience that doesn't seem good to me.
> >
> > My proposal is to provide two versions of the Flink Distribution in the
> > future: "fat" and "slim" (names to be discussed):
> >
> >   - slim would be even trimmer than todays distribution
> >   - fat would contain a lot of convenience connectors (yet to be
> > determined which one)
> >
> > And yes, I realize that there are already more dimensions of Flink
> > releases (Scala version and Java version).
> >
> > For background, our current Flink dist has these in the opt directory:
> >
> >   - flink-azure-fs-hadoop-1.10.0.jar
> >   - flink-cep-scala_2.12-1.10.0.jar
> >   - flink-cep_2.12-1.10.0.jar
> >   - flink-gelly-scala_2.12-1.10.0.jar
> >   - flink-gelly_2.12-1.10.0.jar
> >   - flink-metrics-datadog-1.10.0.jar
> >   - flink-metrics-graphite-1.10.0.jar
> >   - flink-metrics-influxdb-1.10.0.jar
> >   - flink-metrics-prometheus-1.10.0.jar
> >   - flink-metrics-slf4j-1.10.0.jar
> >   - flink-metrics-statsd-1.10.0.jar
> >   - flink-oss-fs-hadoop-1.10.0.jar
> >   - flink-python_2.12-1.10.0.jar
> >   - flink-queryable-state-runtime_2.12-1.10.0.jar
> >   - flink-s3-fs-hadoop-1.10.0.jar
> >   - flink-s3-fs-presto-1.10.0.jar
> >   - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >   - flink-sql-client_2.12-1.10.0.jar
> >   - flink-state-processor-api_2.12-1.10.0.jar
> >   - flink-swift-fs-hadoop-1.10.0.jar
> >
> > Current Flink dist is 267M. If we removed everything from opt we would
> > go down to 126M. I would reccomend this, because the large majority of
> > the files in opt are probably unused.
> >
> > What do you think?
> >
> > Best,
> > Aljoscha
> >
> >
>
> --
> Best Regards
>
> Jeff Zhang
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Dian Fu-2
+1 to this proposal.

Missing connector jars is also a big problem for PyFlink users. Currently, after a Python user has installed PyFlink using `pip`, he has to manually copy the connector fat jars to the PyFlink installation directory for the connectors to be used if he wants to run jobs locally. This process is very confuse for users and affects the experience a lot.

Regards,
Dian

> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> 写道:
>
> +1 to the proposal. I also found the "download additional jar" step is
> really verbose when I prepare webinars.
>
> At least, I think the flink-csv and flink-json should in the distribution,
> they are quite small and don't have other dependencies.
>
> Best,
> Jark
>
> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]> wrote:
>
>> Hi Aljoscha,
>>
>> Big +1 for the fat flink distribution, where do you plan to put these
>> connectors ? opt or lib ?
>>
>> Aljoscha Krettek <[hidden email]> 于2020年4月15日周三 下午3:30写道:
>>
>>> Hi Everyone,
>>>
>>> I'd like to discuss about releasing a more full-featured Flink
>>> distribution. The motivation is that there is friction for SQL/Table API
>>> users that want to use Table connectors which are not there in the
>>> current Flink Distribution. For these users the workflow is currently
>>> roughly:
>>>
>>>  - download Flink dist
>>>  - configure csv/Kafka/json connectors per configuration
>>>  - run SQL client or program
>>>  - decrypt error message and research the solution
>>>  - download additional connector jars
>>>  - program works correctly
>>>
>>> I realize that this can be made to work but if every SQL user has this
>>> as their first experience that doesn't seem good to me.
>>>
>>> My proposal is to provide two versions of the Flink Distribution in the
>>> future: "fat" and "slim" (names to be discussed):
>>>
>>>  - slim would be even trimmer than todays distribution
>>>  - fat would contain a lot of convenience connectors (yet to be
>>> determined which one)
>>>
>>> And yes, I realize that there are already more dimensions of Flink
>>> releases (Scala version and Java version).
>>>
>>> For background, our current Flink dist has these in the opt directory:
>>>
>>>  - flink-azure-fs-hadoop-1.10.0.jar
>>>  - flink-cep-scala_2.12-1.10.0.jar
>>>  - flink-cep_2.12-1.10.0.jar
>>>  - flink-gelly-scala_2.12-1.10.0.jar
>>>  - flink-gelly_2.12-1.10.0.jar
>>>  - flink-metrics-datadog-1.10.0.jar
>>>  - flink-metrics-graphite-1.10.0.jar
>>>  - flink-metrics-influxdb-1.10.0.jar
>>>  - flink-metrics-prometheus-1.10.0.jar
>>>  - flink-metrics-slf4j-1.10.0.jar
>>>  - flink-metrics-statsd-1.10.0.jar
>>>  - flink-oss-fs-hadoop-1.10.0.jar
>>>  - flink-python_2.12-1.10.0.jar
>>>  - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>  - flink-s3-fs-hadoop-1.10.0.jar
>>>  - flink-s3-fs-presto-1.10.0.jar
>>>  - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>  - flink-sql-client_2.12-1.10.0.jar
>>>  - flink-state-processor-api_2.12-1.10.0.jar
>>>  - flink-swift-fs-hadoop-1.10.0.jar
>>>
>>> Current Flink dist is 267M. If we removed everything from opt we would
>>> go down to 126M. I would reccomend this, because the large majority of
>>> the files in opt are probably unused.
>>>
>>> What do you think?
>>>
>>> Best,
>>> Aljoscha
>>>
>>>
>>
>> --
>> Best Regards
>>
>> Jeff Zhang
>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

godfreyhe
Big +1.
This will improve user experience (special for Flink new users).
We answered so many questions about "class not found".

Best,
Godfrey

Dian Fu <[hidden email]> 于2020年4月15日周三 下午4:30写道:

> +1 to this proposal.
>
> Missing connector jars is also a big problem for PyFlink users. Currently,
> after a Python user has installed PyFlink using `pip`, he has to manually
> copy the connector fat jars to the PyFlink installation directory for the
> connectors to be used if he wants to run jobs locally. This process is very
> confuse for users and affects the experience a lot.
>
> Regards,
> Dian
>
> > 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> 写道:
> >
> > +1 to the proposal. I also found the "download additional jar" step is
> > really verbose when I prepare webinars.
> >
> > At least, I think the flink-csv and flink-json should in the
> distribution,
> > they are quite small and don't have other dependencies.
> >
> > Best,
> > Jark
> >
> > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]> wrote:
> >
> >> Hi Aljoscha,
> >>
> >> Big +1 for the fat flink distribution, where do you plan to put these
> >> connectors ? opt or lib ?
> >>
> >> Aljoscha Krettek <[hidden email]> 于2020年4月15日周三 下午3:30写道:
> >>
> >>> Hi Everyone,
> >>>
> >>> I'd like to discuss about releasing a more full-featured Flink
> >>> distribution. The motivation is that there is friction for SQL/Table
> API
> >>> users that want to use Table connectors which are not there in the
> >>> current Flink Distribution. For these users the workflow is currently
> >>> roughly:
> >>>
> >>>  - download Flink dist
> >>>  - configure csv/Kafka/json connectors per configuration
> >>>  - run SQL client or program
> >>>  - decrypt error message and research the solution
> >>>  - download additional connector jars
> >>>  - program works correctly
> >>>
> >>> I realize that this can be made to work but if every SQL user has this
> >>> as their first experience that doesn't seem good to me.
> >>>
> >>> My proposal is to provide two versions of the Flink Distribution in the
> >>> future: "fat" and "slim" (names to be discussed):
> >>>
> >>>  - slim would be even trimmer than todays distribution
> >>>  - fat would contain a lot of convenience connectors (yet to be
> >>> determined which one)
> >>>
> >>> And yes, I realize that there are already more dimensions of Flink
> >>> releases (Scala version and Java version).
> >>>
> >>> For background, our current Flink dist has these in the opt directory:
> >>>
> >>>  - flink-azure-fs-hadoop-1.10.0.jar
> >>>  - flink-cep-scala_2.12-1.10.0.jar
> >>>  - flink-cep_2.12-1.10.0.jar
> >>>  - flink-gelly-scala_2.12-1.10.0.jar
> >>>  - flink-gelly_2.12-1.10.0.jar
> >>>  - flink-metrics-datadog-1.10.0.jar
> >>>  - flink-metrics-graphite-1.10.0.jar
> >>>  - flink-metrics-influxdb-1.10.0.jar
> >>>  - flink-metrics-prometheus-1.10.0.jar
> >>>  - flink-metrics-slf4j-1.10.0.jar
> >>>  - flink-metrics-statsd-1.10.0.jar
> >>>  - flink-oss-fs-hadoop-1.10.0.jar
> >>>  - flink-python_2.12-1.10.0.jar
> >>>  - flink-queryable-state-runtime_2.12-1.10.0.jar
> >>>  - flink-s3-fs-hadoop-1.10.0.jar
> >>>  - flink-s3-fs-presto-1.10.0.jar
> >>>  - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>  - flink-sql-client_2.12-1.10.0.jar
> >>>  - flink-state-processor-api_2.12-1.10.0.jar
> >>>  - flink-swift-fs-hadoop-1.10.0.jar
> >>>
> >>> Current Flink dist is 267M. If we removed everything from opt we would
> >>> go down to 126M. I would reccomend this, because the large majority of
> >>> the files in opt are probably unused.
> >>>
> >>> What do you think?
> >>>
> >>> Best,
> >>> Aljoscha
> >>>
> >>>
> >>
> >> --
> >> Best Regards
> >>
> >> Jeff Zhang
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Jingsong Li
Big +1.

I like "fat" and "slim".

For csv and json, like Jark said, they are quite small and don't have other
dependencies. They are important to kafka connector, and important
to upcoming file system connector too.
So can we move them to both "fat" and "slim"? They're so important, and
they're so lightweight.

Best,
Jingsong Lee

On Wed, Apr 15, 2020 at 4:53 PM godfrey he <[hidden email]> wrote:

> Big +1.
> This will improve user experience (special for Flink new users).
> We answered so many questions about "class not found".
>
> Best,
> Godfrey
>
> Dian Fu <[hidden email]> 于2020年4月15日周三 下午4:30写道:
>
> > +1 to this proposal.
> >
> > Missing connector jars is also a big problem for PyFlink users.
> Currently,
> > after a Python user has installed PyFlink using `pip`, he has to manually
> > copy the connector fat jars to the PyFlink installation directory for the
> > connectors to be used if he wants to run jobs locally. This process is
> very
> > confuse for users and affects the experience a lot.
> >
> > Regards,
> > Dian
> >
> > > 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> 写道:
> > >
> > > +1 to the proposal. I also found the "download additional jar" step is
> > > really verbose when I prepare webinars.
> > >
> > > At least, I think the flink-csv and flink-json should in the
> > distribution,
> > > they are quite small and don't have other dependencies.
> > >
> > > Best,
> > > Jark
> > >
> > > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]> wrote:
> > >
> > >> Hi Aljoscha,
> > >>
> > >> Big +1 for the fat flink distribution, where do you plan to put these
> > >> connectors ? opt or lib ?
> > >>
> > >> Aljoscha Krettek <[hidden email]> 于2020年4月15日周三 下午3:30写道:
> > >>
> > >>> Hi Everyone,
> > >>>
> > >>> I'd like to discuss about releasing a more full-featured Flink
> > >>> distribution. The motivation is that there is friction for SQL/Table
> > API
> > >>> users that want to use Table connectors which are not there in the
> > >>> current Flink Distribution. For these users the workflow is currently
> > >>> roughly:
> > >>>
> > >>>  - download Flink dist
> > >>>  - configure csv/Kafka/json connectors per configuration
> > >>>  - run SQL client or program
> > >>>  - decrypt error message and research the solution
> > >>>  - download additional connector jars
> > >>>  - program works correctly
> > >>>
> > >>> I realize that this can be made to work but if every SQL user has
> this
> > >>> as their first experience that doesn't seem good to me.
> > >>>
> > >>> My proposal is to provide two versions of the Flink Distribution in
> the
> > >>> future: "fat" and "slim" (names to be discussed):
> > >>>
> > >>>  - slim would be even trimmer than todays distribution
> > >>>  - fat would contain a lot of convenience connectors (yet to be
> > >>> determined which one)
> > >>>
> > >>> And yes, I realize that there are already more dimensions of Flink
> > >>> releases (Scala version and Java version).
> > >>>
> > >>> For background, our current Flink dist has these in the opt
> directory:
> > >>>
> > >>>  - flink-azure-fs-hadoop-1.10.0.jar
> > >>>  - flink-cep-scala_2.12-1.10.0.jar
> > >>>  - flink-cep_2.12-1.10.0.jar
> > >>>  - flink-gelly-scala_2.12-1.10.0.jar
> > >>>  - flink-gelly_2.12-1.10.0.jar
> > >>>  - flink-metrics-datadog-1.10.0.jar
> > >>>  - flink-metrics-graphite-1.10.0.jar
> > >>>  - flink-metrics-influxdb-1.10.0.jar
> > >>>  - flink-metrics-prometheus-1.10.0.jar
> > >>>  - flink-metrics-slf4j-1.10.0.jar
> > >>>  - flink-metrics-statsd-1.10.0.jar
> > >>>  - flink-oss-fs-hadoop-1.10.0.jar
> > >>>  - flink-python_2.12-1.10.0.jar
> > >>>  - flink-queryable-state-runtime_2.12-1.10.0.jar
> > >>>  - flink-s3-fs-hadoop-1.10.0.jar
> > >>>  - flink-s3-fs-presto-1.10.0.jar
> > >>>  - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > >>>  - flink-sql-client_2.12-1.10.0.jar
> > >>>  - flink-state-processor-api_2.12-1.10.0.jar
> > >>>  - flink-swift-fs-hadoop-1.10.0.jar
> > >>>
> > >>> Current Flink dist is 267M. If we removed everything from opt we
> would
> > >>> go down to 126M. I would reccomend this, because the large majority
> of
> > >>> the files in opt are probably unused.
> > >>>
> > >>> What do you think?
> > >>>
> > >>> Best,
> > >>> Aljoscha
> > >>>
> > >>>
> > >>
> > >> --
> > >> Best Regards
> > >>
> > >> Jeff Zhang
> > >>
> >
> >
>


--
Best, Jingsong Lee
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Kurt Young
Regarding to the specific solution, I'm not sure about the "fat" and "slim"
solution though. I get the idea
that we can make the slim one even more lightweight than current
distribution, but what about the "fat"
one? Do you mean that we would package all connectors and formats into
this? I'm not sure if this is
feasible. For example, we can't put all versions of kafka and hive
connector jars into lib directory, and
we also might need hadoop jars when using filesystem connector to access
data from HDFS.

So my guess would be we might hand-pick some of the most frequently used
connectors and formats
into our "lib" directory, like kafka, csv, json metioned above, and still
leave some other connectors out of it.
If this is the case, then why not we just provide this distribution to
user? I'm not sure i get the benefit of
providing another super "slim" jar (we have to pay some costs to provide
another suit of distribution).

What do you think?

Best,
Kurt


On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <[hidden email]> wrote:

> Big +1.
>
> I like "fat" and "slim".
>
> For csv and json, like Jark said, they are quite small and don't have other
> dependencies. They are important to kafka connector, and important
> to upcoming file system connector too.
> So can we move them to both "fat" and "slim"? They're so important, and
> they're so lightweight.
>
> Best,
> Jingsong Lee
>
> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <[hidden email]> wrote:
>
> > Big +1.
> > This will improve user experience (special for Flink new users).
> > We answered so many questions about "class not found".
> >
> > Best,
> > Godfrey
> >
> > Dian Fu <[hidden email]> 于2020年4月15日周三 下午4:30写道:
> >
> > > +1 to this proposal.
> > >
> > > Missing connector jars is also a big problem for PyFlink users.
> > Currently,
> > > after a Python user has installed PyFlink using `pip`, he has to
> manually
> > > copy the connector fat jars to the PyFlink installation directory for
> the
> > > connectors to be used if he wants to run jobs locally. This process is
> > very
> > > confuse for users and affects the experience a lot.
> > >
> > > Regards,
> > > Dian
> > >
> > > > 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> 写道:
> > > >
> > > > +1 to the proposal. I also found the "download additional jar" step
> is
> > > > really verbose when I prepare webinars.
> > > >
> > > > At least, I think the flink-csv and flink-json should in the
> > > distribution,
> > > > they are quite small and don't have other dependencies.
> > > >
> > > > Best,
> > > > Jark
> > > >
> > > > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]> wrote:
> > > >
> > > >> Hi Aljoscha,
> > > >>
> > > >> Big +1 for the fat flink distribution, where do you plan to put
> these
> > > >> connectors ? opt or lib ?
> > > >>
> > > >> Aljoscha Krettek <[hidden email]> 于2020年4月15日周三 下午3:30写道:
> > > >>
> > > >>> Hi Everyone,
> > > >>>
> > > >>> I'd like to discuss about releasing a more full-featured Flink
> > > >>> distribution. The motivation is that there is friction for
> SQL/Table
> > > API
> > > >>> users that want to use Table connectors which are not there in the
> > > >>> current Flink Distribution. For these users the workflow is
> currently
> > > >>> roughly:
> > > >>>
> > > >>>  - download Flink dist
> > > >>>  - configure csv/Kafka/json connectors per configuration
> > > >>>  - run SQL client or program
> > > >>>  - decrypt error message and research the solution
> > > >>>  - download additional connector jars
> > > >>>  - program works correctly
> > > >>>
> > > >>> I realize that this can be made to work but if every SQL user has
> > this
> > > >>> as their first experience that doesn't seem good to me.
> > > >>>
> > > >>> My proposal is to provide two versions of the Flink Distribution in
> > the
> > > >>> future: "fat" and "slim" (names to be discussed):
> > > >>>
> > > >>>  - slim would be even trimmer than todays distribution
> > > >>>  - fat would contain a lot of convenience connectors (yet to be
> > > >>> determined which one)
> > > >>>
> > > >>> And yes, I realize that there are already more dimensions of Flink
> > > >>> releases (Scala version and Java version).
> > > >>>
> > > >>> For background, our current Flink dist has these in the opt
> > directory:
> > > >>>
> > > >>>  - flink-azure-fs-hadoop-1.10.0.jar
> > > >>>  - flink-cep-scala_2.12-1.10.0.jar
> > > >>>  - flink-cep_2.12-1.10.0.jar
> > > >>>  - flink-gelly-scala_2.12-1.10.0.jar
> > > >>>  - flink-gelly_2.12-1.10.0.jar
> > > >>>  - flink-metrics-datadog-1.10.0.jar
> > > >>>  - flink-metrics-graphite-1.10.0.jar
> > > >>>  - flink-metrics-influxdb-1.10.0.jar
> > > >>>  - flink-metrics-prometheus-1.10.0.jar
> > > >>>  - flink-metrics-slf4j-1.10.0.jar
> > > >>>  - flink-metrics-statsd-1.10.0.jar
> > > >>>  - flink-oss-fs-hadoop-1.10.0.jar
> > > >>>  - flink-python_2.12-1.10.0.jar
> > > >>>  - flink-queryable-state-runtime_2.12-1.10.0.jar
> > > >>>  - flink-s3-fs-hadoop-1.10.0.jar
> > > >>>  - flink-s3-fs-presto-1.10.0.jar
> > > >>>  - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > > >>>  - flink-sql-client_2.12-1.10.0.jar
> > > >>>  - flink-state-processor-api_2.12-1.10.0.jar
> > > >>>  - flink-swift-fs-hadoop-1.10.0.jar
> > > >>>
> > > >>> Current Flink dist is 267M. If we removed everything from opt we
> > would
> > > >>> go down to 126M. I would reccomend this, because the large majority
> > of
> > > >>> the files in opt are probably unused.
> > > >>>
> > > >>> What do you think?
> > > >>>
> > > >>> Best,
> > > >>> Aljoscha
> > > >>>
> > > >>>
> > > >>
> > > >> --
> > > >> Best Regards
> > > >>
> > > >> Jeff Zhang
> > > >>
> > >
> > >
> >
>
>
> --
> Best, Jingsong Lee
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Chesnay Schepler-3
I don't see a lot of value in having multiple distributions.

The simple reality is that no fat distribution we could provide would
satisfy all use-cases, so why even try.
If users commonly run into issues for certain jars, then maybe those
should be added to the current distribution.

Personally though I still believe we should only distribute a slim
version. I'd rather have users always add required jars to the
distribution than only when they go outside our "expected" use-cases.
Then we might finally address this issue properly, i.e., tooling to
assemble custom distributions and/or better error messages if
Flink-provided extensions cannot be found.

On 15/04/2020 15:23, Kurt Young wrote:

> Regarding to the specific solution, I'm not sure about the "fat" and "slim"
> solution though. I get the idea
> that we can make the slim one even more lightweight than current
> distribution, but what about the "fat"
> one? Do you mean that we would package all connectors and formats into
> this? I'm not sure if this is
> feasible. For example, we can't put all versions of kafka and hive
> connector jars into lib directory, and
> we also might need hadoop jars when using filesystem connector to access
> data from HDFS.
>
> So my guess would be we might hand-pick some of the most frequently used
> connectors and formats
> into our "lib" directory, like kafka, csv, json metioned above, and still
> leave some other connectors out of it.
> If this is the case, then why not we just provide this distribution to
> user? I'm not sure i get the benefit of
> providing another super "slim" jar (we have to pay some costs to provide
> another suit of distribution).
>
> What do you think?
>
> Best,
> Kurt
>
>
> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <[hidden email]> wrote:
>
>> Big +1.
>>
>> I like "fat" and "slim".
>>
>> For csv and json, like Jark said, they are quite small and don't have other
>> dependencies. They are important to kafka connector, and important
>> to upcoming file system connector too.
>> So can we move them to both "fat" and "slim"? They're so important, and
>> they're so lightweight.
>>
>> Best,
>> Jingsong Lee
>>
>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <[hidden email]> wrote:
>>
>>> Big +1.
>>> This will improve user experience (special for Flink new users).
>>> We answered so many questions about "class not found".
>>>
>>> Best,
>>> Godfrey
>>>
>>> Dian Fu <[hidden email]> 于2020年4月15日周三 下午4:30写道:
>>>
>>>> +1 to this proposal.
>>>>
>>>> Missing connector jars is also a big problem for PyFlink users.
>>> Currently,
>>>> after a Python user has installed PyFlink using `pip`, he has to
>> manually
>>>> copy the connector fat jars to the PyFlink installation directory for
>> the
>>>> connectors to be used if he wants to run jobs locally. This process is
>>> very
>>>> confuse for users and affects the experience a lot.
>>>>
>>>> Regards,
>>>> Dian
>>>>
>>>>> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> 写道:
>>>>>
>>>>> +1 to the proposal. I also found the "download additional jar" step
>> is
>>>>> really verbose when I prepare webinars.
>>>>>
>>>>> At least, I think the flink-csv and flink-json should in the
>>>> distribution,
>>>>> they are quite small and don't have other dependencies.
>>>>>
>>>>> Best,
>>>>> Jark
>>>>>
>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]> wrote:
>>>>>
>>>>>> Hi Aljoscha,
>>>>>>
>>>>>> Big +1 for the fat flink distribution, where do you plan to put
>> these
>>>>>> connectors ? opt or lib ?
>>>>>>
>>>>>> Aljoscha Krettek <[hidden email]> 于2020年4月15日周三 下午3:30写道:
>>>>>>
>>>>>>> Hi Everyone,
>>>>>>>
>>>>>>> I'd like to discuss about releasing a more full-featured Flink
>>>>>>> distribution. The motivation is that there is friction for
>> SQL/Table
>>>> API
>>>>>>> users that want to use Table connectors which are not there in the
>>>>>>> current Flink Distribution. For these users the workflow is
>> currently
>>>>>>> roughly:
>>>>>>>
>>>>>>>   - download Flink dist
>>>>>>>   - configure csv/Kafka/json connectors per configuration
>>>>>>>   - run SQL client or program
>>>>>>>   - decrypt error message and research the solution
>>>>>>>   - download additional connector jars
>>>>>>>   - program works correctly
>>>>>>>
>>>>>>> I realize that this can be made to work but if every SQL user has
>>> this
>>>>>>> as their first experience that doesn't seem good to me.
>>>>>>>
>>>>>>> My proposal is to provide two versions of the Flink Distribution in
>>> the
>>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>>
>>>>>>>   - slim would be even trimmer than todays distribution
>>>>>>>   - fat would contain a lot of convenience connectors (yet to be
>>>>>>> determined which one)
>>>>>>>
>>>>>>> And yes, I realize that there are already more dimensions of Flink
>>>>>>> releases (Scala version and Java version).
>>>>>>>
>>>>>>> For background, our current Flink dist has these in the opt
>>> directory:
>>>>>>>   - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>   - flink-cep-scala_2.12-1.10.0.jar
>>>>>>>   - flink-cep_2.12-1.10.0.jar
>>>>>>>   - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>>   - flink-gelly_2.12-1.10.0.jar
>>>>>>>   - flink-metrics-datadog-1.10.0.jar
>>>>>>>   - flink-metrics-graphite-1.10.0.jar
>>>>>>>   - flink-metrics-influxdb-1.10.0.jar
>>>>>>>   - flink-metrics-prometheus-1.10.0.jar
>>>>>>>   - flink-metrics-slf4j-1.10.0.jar
>>>>>>>   - flink-metrics-statsd-1.10.0.jar
>>>>>>>   - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>   - flink-python_2.12-1.10.0.jar
>>>>>>>   - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>>   - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>   - flink-s3-fs-presto-1.10.0.jar
>>>>>>>   - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>   - flink-sql-client_2.12-1.10.0.jar
>>>>>>>   - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>>   - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>
>>>>>>> Current Flink dist is 267M. If we removed everything from opt we
>>> would
>>>>>>> go down to 126M. I would reccomend this, because the large majority
>>> of
>>>>>>> the files in opt are probably unused.
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Best,
>>>>>>> Aljoscha
>>>>>>>
>>>>>>>
>>>>>> --
>>>>>> Best Regards
>>>>>>
>>>>>> Jeff Zhang
>>>>>>
>>>>
>>
>> --
>> Best, Jingsong Lee
>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Jark Wu-2
Hi,

I think we should first reach an consensus on "what problem do we want to
solve?"
(1) improve first experience? or (2) improve production experience?

As far as I can see, with the above discussion, I think what we want to
solve is the "first experience".
And I think the slim jar is still the best distribution for production,
because it's easier to assembling jars
than excluding jars and can avoid potential class conflicts.

If we want to improve "first experience", I think it make sense to have a
fat distribution to give users a more smooth first experience.
But I would like to call it "playground distribution" or something like
that to explicitly differ from the "slim production-purpose distribution".
The "playground distribution" can contains some widely used jars, like
universal-kafka-sql-connector, elasticsearch7-sql-connector, avro, json,
csv, etc..
Even we can provide a playground docker which may contain the fat
distribution, python3, and hive.

Best,
Jark


On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <[hidden email]> wrote:

> I don't see a lot of value in having multiple distributions.
>
> The simple reality is that no fat distribution we could provide would
> satisfy all use-cases, so why even try.
> If users commonly run into issues for certain jars, then maybe those
> should be added to the current distribution.
>
> Personally though I still believe we should only distribute a slim
> version. I'd rather have users always add required jars to the
> distribution than only when they go outside our "expected" use-cases.
> Then we might finally address this issue properly, i.e., tooling to
> assemble custom distributions and/or better error messages if
> Flink-provided extensions cannot be found.
>
> On 15/04/2020 15:23, Kurt Young wrote:
> > Regarding to the specific solution, I'm not sure about the "fat" and
> "slim"
> > solution though. I get the idea
> > that we can make the slim one even more lightweight than current
> > distribution, but what about the "fat"
> > one? Do you mean that we would package all connectors and formats into
> > this? I'm not sure if this is
> > feasible. For example, we can't put all versions of kafka and hive
> > connector jars into lib directory, and
> > we also might need hadoop jars when using filesystem connector to access
> > data from HDFS.
> >
> > So my guess would be we might hand-pick some of the most frequently used
> > connectors and formats
> > into our "lib" directory, like kafka, csv, json metioned above, and still
> > leave some other connectors out of it.
> > If this is the case, then why not we just provide this distribution to
> > user? I'm not sure i get the benefit of
> > providing another super "slim" jar (we have to pay some costs to provide
> > another suit of distribution).
> >
> > What do you think?
> >
> > Best,
> > Kurt
> >
> >
> > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <[hidden email]>
> wrote:
> >
> >> Big +1.
> >>
> >> I like "fat" and "slim".
> >>
> >> For csv and json, like Jark said, they are quite small and don't have
> other
> >> dependencies. They are important to kafka connector, and important
> >> to upcoming file system connector too.
> >> So can we move them to both "fat" and "slim"? They're so important, and
> >> they're so lightweight.
> >>
> >> Best,
> >> Jingsong Lee
> >>
> >> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <[hidden email]> wrote:
> >>
> >>> Big +1.
> >>> This will improve user experience (special for Flink new users).
> >>> We answered so many questions about "class not found".
> >>>
> >>> Best,
> >>> Godfrey
> >>>
> >>> Dian Fu <[hidden email]> 于2020年4月15日周三 下午4:30写道:
> >>>
> >>>> +1 to this proposal.
> >>>>
> >>>> Missing connector jars is also a big problem for PyFlink users.
> >>> Currently,
> >>>> after a Python user has installed PyFlink using `pip`, he has to
> >> manually
> >>>> copy the connector fat jars to the PyFlink installation directory for
> >> the
> >>>> connectors to be used if he wants to run jobs locally. This process is
> >>> very
> >>>> confuse for users and affects the experience a lot.
> >>>>
> >>>> Regards,
> >>>> Dian
> >>>>
> >>>>> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> 写道:
> >>>>>
> >>>>> +1 to the proposal. I also found the "download additional jar" step
> >> is
> >>>>> really verbose when I prepare webinars.
> >>>>>
> >>>>> At least, I think the flink-csv and flink-json should in the
> >>>> distribution,
> >>>>> they are quite small and don't have other dependencies.
> >>>>>
> >>>>> Best,
> >>>>> Jark
> >>>>>
> >>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]> wrote:
> >>>>>
> >>>>>> Hi Aljoscha,
> >>>>>>
> >>>>>> Big +1 for the fat flink distribution, where do you plan to put
> >> these
> >>>>>> connectors ? opt or lib ?
> >>>>>>
> >>>>>> Aljoscha Krettek <[hidden email]> 于2020年4月15日周三 下午3:30写道:
> >>>>>>
> >>>>>>> Hi Everyone,
> >>>>>>>
> >>>>>>> I'd like to discuss about releasing a more full-featured Flink
> >>>>>>> distribution. The motivation is that there is friction for
> >> SQL/Table
> >>>> API
> >>>>>>> users that want to use Table connectors which are not there in the
> >>>>>>> current Flink Distribution. For these users the workflow is
> >> currently
> >>>>>>> roughly:
> >>>>>>>
> >>>>>>>   - download Flink dist
> >>>>>>>   - configure csv/Kafka/json connectors per configuration
> >>>>>>>   - run SQL client or program
> >>>>>>>   - decrypt error message and research the solution
> >>>>>>>   - download additional connector jars
> >>>>>>>   - program works correctly
> >>>>>>>
> >>>>>>> I realize that this can be made to work but if every SQL user has
> >>> this
> >>>>>>> as their first experience that doesn't seem good to me.
> >>>>>>>
> >>>>>>> My proposal is to provide two versions of the Flink Distribution in
> >>> the
> >>>>>>> future: "fat" and "slim" (names to be discussed):
> >>>>>>>
> >>>>>>>   - slim would be even trimmer than todays distribution
> >>>>>>>   - fat would contain a lot of convenience connectors (yet to be
> >>>>>>> determined which one)
> >>>>>>>
> >>>>>>> And yes, I realize that there are already more dimensions of Flink
> >>>>>>> releases (Scala version and Java version).
> >>>>>>>
> >>>>>>> For background, our current Flink dist has these in the opt
> >>> directory:
> >>>>>>>   - flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>>   - flink-cep-scala_2.12-1.10.0.jar
> >>>>>>>   - flink-cep_2.12-1.10.0.jar
> >>>>>>>   - flink-gelly-scala_2.12-1.10.0.jar
> >>>>>>>   - flink-gelly_2.12-1.10.0.jar
> >>>>>>>   - flink-metrics-datadog-1.10.0.jar
> >>>>>>>   - flink-metrics-graphite-1.10.0.jar
> >>>>>>>   - flink-metrics-influxdb-1.10.0.jar
> >>>>>>>   - flink-metrics-prometheus-1.10.0.jar
> >>>>>>>   - flink-metrics-slf4j-1.10.0.jar
> >>>>>>>   - flink-metrics-statsd-1.10.0.jar
> >>>>>>>   - flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>>   - flink-python_2.12-1.10.0.jar
> >>>>>>>   - flink-queryable-state-runtime_2.12-1.10.0.jar
> >>>>>>>   - flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>>   - flink-s3-fs-presto-1.10.0.jar
> >>>>>>>   - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>>>   - flink-sql-client_2.12-1.10.0.jar
> >>>>>>>   - flink-state-processor-api_2.12-1.10.0.jar
> >>>>>>>   - flink-swift-fs-hadoop-1.10.0.jar
> >>>>>>>
> >>>>>>> Current Flink dist is 267M. If we removed everything from opt we
> >>> would
> >>>>>>> go down to 126M. I would reccomend this, because the large majority
> >>> of
> >>>>>>> the files in opt are probably unused.
> >>>>>>>
> >>>>>>> What do you think?
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Aljoscha
> >>>>>>>
> >>>>>>>
> >>>>>> --
> >>>>>> Best Regards
> >>>>>>
> >>>>>> Jeff Zhang
> >>>>>>
> >>>>
> >>
> >> --
> >> Best, Jingsong Lee
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Jingsong Li
Hi,

I am thinking both "improve first experience" and "improve production
experience".

I'm thinking about what's the common mode of Flink?
Streaming job use Kafka? Batch job use Hive?

Hive 1.2.1 dependencies can be compatible with most of Hive server
versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
Flink is currently mainly used for streaming, so let's not talk about hive.

For streaming jobs, first of all, the jobs in my mind is (related to
connectors):
- ETL jobs: Kafka -> Kafka
- Join jobs: Kafka -> DimJDBC -> Kafka
- Aggregation jobs: Kafka -> JDBCSink
So Kafka and JDBC are probably the most commonly used. Of course, also
includes CSV, JSON's formats.
So when we provide such a fat distribution:
- With CSV, JSON.
- With flink-kafka-universal and kafka dependencies.
- With flink-jdbc.
Using this fat distribution, most users can run their jobs well. (jdbc
driver jar required, but this is very natural to do)
Can these dependencies lead to kinds of conflicts? Only Kafka may have
conflicts, but if our goal is to use kafka-universal to support all Kafka
versions, it is hopeful to target the vast majority of users.

We don't want to plug all jars into the fat distribution. Only need less
conflict and common. of course, it is a matter of consideration to put
which jar into fat distribution.
We have the opportunity to facilitate the majority of users, but also left
opportunities for customization.

Best,
Jingsong Lee

On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <[hidden email]> wrote:

> Hi,
>
> I think we should first reach an consensus on "what problem do we want to
> solve?"
> (1) improve first experience? or (2) improve production experience?
>
> As far as I can see, with the above discussion, I think what we want to
> solve is the "first experience".
> And I think the slim jar is still the best distribution for production,
> because it's easier to assembling jars
> than excluding jars and can avoid potential class conflicts.
>
> If we want to improve "first experience", I think it make sense to have a
> fat distribution to give users a more smooth first experience.
> But I would like to call it "playground distribution" or something like
> that to explicitly differ from the "slim production-purpose distribution".
> The "playground distribution" can contains some widely used jars, like
> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro, json,
> csv, etc..
> Even we can provide a playground docker which may contain the fat
> distribution, python3, and hive.
>
> Best,
> Jark
>
>
> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <[hidden email]> wrote:
>
> > I don't see a lot of value in having multiple distributions.
> >
> > The simple reality is that no fat distribution we could provide would
> > satisfy all use-cases, so why even try.
> > If users commonly run into issues for certain jars, then maybe those
> > should be added to the current distribution.
> >
> > Personally though I still believe we should only distribute a slim
> > version. I'd rather have users always add required jars to the
> > distribution than only when they go outside our "expected" use-cases.
> > Then we might finally address this issue properly, i.e., tooling to
> > assemble custom distributions and/or better error messages if
> > Flink-provided extensions cannot be found.
> >
> > On 15/04/2020 15:23, Kurt Young wrote:
> > > Regarding to the specific solution, I'm not sure about the "fat" and
> > "slim"
> > > solution though. I get the idea
> > > that we can make the slim one even more lightweight than current
> > > distribution, but what about the "fat"
> > > one? Do you mean that we would package all connectors and formats into
> > > this? I'm not sure if this is
> > > feasible. For example, we can't put all versions of kafka and hive
> > > connector jars into lib directory, and
> > > we also might need hadoop jars when using filesystem connector to
> access
> > > data from HDFS.
> > >
> > > So my guess would be we might hand-pick some of the most frequently
> used
> > > connectors and formats
> > > into our "lib" directory, like kafka, csv, json metioned above, and
> still
> > > leave some other connectors out of it.
> > > If this is the case, then why not we just provide this distribution to
> > > user? I'm not sure i get the benefit of
> > > providing another super "slim" jar (we have to pay some costs to
> provide
> > > another suit of distribution).
> > >
> > > What do you think?
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <[hidden email]>
> > wrote:
> > >
> > >> Big +1.
> > >>
> > >> I like "fat" and "slim".
> > >>
> > >> For csv and json, like Jark said, they are quite small and don't have
> > other
> > >> dependencies. They are important to kafka connector, and important
> > >> to upcoming file system connector too.
> > >> So can we move them to both "fat" and "slim"? They're so important,
> and
> > >> they're so lightweight.
> > >>
> > >> Best,
> > >> Jingsong Lee
> > >>
> > >> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <[hidden email]>
> wrote:
> > >>
> > >>> Big +1.
> > >>> This will improve user experience (special for Flink new users).
> > >>> We answered so many questions about "class not found".
> > >>>
> > >>> Best,
> > >>> Godfrey
> > >>>
> > >>> Dian Fu <[hidden email]> 于2020年4月15日周三 下午4:30写道:
> > >>>
> > >>>> +1 to this proposal.
> > >>>>
> > >>>> Missing connector jars is also a big problem for PyFlink users.
> > >>> Currently,
> > >>>> after a Python user has installed PyFlink using `pip`, he has to
> > >> manually
> > >>>> copy the connector fat jars to the PyFlink installation directory
> for
> > >> the
> > >>>> connectors to be used if he wants to run jobs locally. This process
> is
> > >>> very
> > >>>> confuse for users and affects the experience a lot.
> > >>>>
> > >>>> Regards,
> > >>>> Dian
> > >>>>
> > >>>>> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> 写道:
> > >>>>>
> > >>>>> +1 to the proposal. I also found the "download additional jar" step
> > >> is
> > >>>>> really verbose when I prepare webinars.
> > >>>>>
> > >>>>> At least, I think the flink-csv and flink-json should in the
> > >>>> distribution,
> > >>>>> they are quite small and don't have other dependencies.
> > >>>>>
> > >>>>> Best,
> > >>>>> Jark
> > >>>>>
> > >>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]> wrote:
> > >>>>>
> > >>>>>> Hi Aljoscha,
> > >>>>>>
> > >>>>>> Big +1 for the fat flink distribution, where do you plan to put
> > >> these
> > >>>>>> connectors ? opt or lib ?
> > >>>>>>
> > >>>>>> Aljoscha Krettek <[hidden email]> 于2020年4月15日周三 下午3:30写道:
> > >>>>>>
> > >>>>>>> Hi Everyone,
> > >>>>>>>
> > >>>>>>> I'd like to discuss about releasing a more full-featured Flink
> > >>>>>>> distribution. The motivation is that there is friction for
> > >> SQL/Table
> > >>>> API
> > >>>>>>> users that want to use Table connectors which are not there in
> the
> > >>>>>>> current Flink Distribution. For these users the workflow is
> > >> currently
> > >>>>>>> roughly:
> > >>>>>>>
> > >>>>>>>   - download Flink dist
> > >>>>>>>   - configure csv/Kafka/json connectors per configuration
> > >>>>>>>   - run SQL client or program
> > >>>>>>>   - decrypt error message and research the solution
> > >>>>>>>   - download additional connector jars
> > >>>>>>>   - program works correctly
> > >>>>>>>
> > >>>>>>> I realize that this can be made to work but if every SQL user has
> > >>> this
> > >>>>>>> as their first experience that doesn't seem good to me.
> > >>>>>>>
> > >>>>>>> My proposal is to provide two versions of the Flink Distribution
> in
> > >>> the
> > >>>>>>> future: "fat" and "slim" (names to be discussed):
> > >>>>>>>
> > >>>>>>>   - slim would be even trimmer than todays distribution
> > >>>>>>>   - fat would contain a lot of convenience connectors (yet to be
> > >>>>>>> determined which one)
> > >>>>>>>
> > >>>>>>> And yes, I realize that there are already more dimensions of
> Flink
> > >>>>>>> releases (Scala version and Java version).
> > >>>>>>>
> > >>>>>>> For background, our current Flink dist has these in the opt
> > >>> directory:
> > >>>>>>>   - flink-azure-fs-hadoop-1.10.0.jar
> > >>>>>>>   - flink-cep-scala_2.12-1.10.0.jar
> > >>>>>>>   - flink-cep_2.12-1.10.0.jar
> > >>>>>>>   - flink-gelly-scala_2.12-1.10.0.jar
> > >>>>>>>   - flink-gelly_2.12-1.10.0.jar
> > >>>>>>>   - flink-metrics-datadog-1.10.0.jar
> > >>>>>>>   - flink-metrics-graphite-1.10.0.jar
> > >>>>>>>   - flink-metrics-influxdb-1.10.0.jar
> > >>>>>>>   - flink-metrics-prometheus-1.10.0.jar
> > >>>>>>>   - flink-metrics-slf4j-1.10.0.jar
> > >>>>>>>   - flink-metrics-statsd-1.10.0.jar
> > >>>>>>>   - flink-oss-fs-hadoop-1.10.0.jar
> > >>>>>>>   - flink-python_2.12-1.10.0.jar
> > >>>>>>>   - flink-queryable-state-runtime_2.12-1.10.0.jar
> > >>>>>>>   - flink-s3-fs-hadoop-1.10.0.jar
> > >>>>>>>   - flink-s3-fs-presto-1.10.0.jar
> > >>>>>>>   - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > >>>>>>>   - flink-sql-client_2.12-1.10.0.jar
> > >>>>>>>   - flink-state-processor-api_2.12-1.10.0.jar
> > >>>>>>>   - flink-swift-fs-hadoop-1.10.0.jar
> > >>>>>>>
> > >>>>>>> Current Flink dist is 267M. If we removed everything from opt we
> > >>> would
> > >>>>>>> go down to 126M. I would reccomend this, because the large
> majority
> > >>> of
> > >>>>>>> the files in opt are probably unused.
> > >>>>>>>
> > >>>>>>> What do you think?
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Aljoscha
> > >>>>>>>
> > >>>>>>>
> > >>>>>> --
> > >>>>>> Best Regards
> > >>>>>>
> > >>>>>> Jeff Zhang
> > >>>>>>
> > >>>>
> > >>
> > >> --
> > >> Best, Jingsong Lee
> > >>
> >
> >
>


--
Best, Jingsong Lee
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

wenlong.lwl
Hi all,

Regarding slim and fat distributions, I think different kinds of jobs may
prefer different type of distribution:

For DataStream job, I think we may not like fat distribution containing
connectors because user would always need to depend on the connector in
user code, it is easy to include the connector jar in the user lib. Less
jar in lib means less class conflicts and problems.

For SQL job, I think we are trying to encourage user to user pure sql(DDL +
DML) to construct their job, In order to improve user experience, It may be
important for flink, not only providing as many connector jar in
distribution as possible especially the connector and format we have well
documented,  but also providing an mechanism to load connectors according
to the DDLs,

So I think it could be good to place connector/format jars in some dir like
opt/connector which would not affect jobs by default, and introduce a
mechanism of dynamic discovery for SQL.

Best,
Wenlong

On Wed, 15 Apr 2020 at 22:46, Jingsong Li <[hidden email]> wrote:

> Hi,
>
> I am thinking both "improve first experience" and "improve production
> experience".
>
> I'm thinking about what's the common mode of Flink?
> Streaming job use Kafka? Batch job use Hive?
>
> Hive 1.2.1 dependencies can be compatible with most of Hive server
> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> Flink is currently mainly used for streaming, so let's not talk about hive.
>
> For streaming jobs, first of all, the jobs in my mind is (related to
> connectors):
> - ETL jobs: Kafka -> Kafka
> - Join jobs: Kafka -> DimJDBC -> Kafka
> - Aggregation jobs: Kafka -> JDBCSink
> So Kafka and JDBC are probably the most commonly used. Of course, also
> includes CSV, JSON's formats.
> So when we provide such a fat distribution:
> - With CSV, JSON.
> - With flink-kafka-universal and kafka dependencies.
> - With flink-jdbc.
> Using this fat distribution, most users can run their jobs well. (jdbc
> driver jar required, but this is very natural to do)
> Can these dependencies lead to kinds of conflicts? Only Kafka may have
> conflicts, but if our goal is to use kafka-universal to support all Kafka
> versions, it is hopeful to target the vast majority of users.
>
> We don't want to plug all jars into the fat distribution. Only need less
> conflict and common. of course, it is a matter of consideration to put
> which jar into fat distribution.
> We have the opportunity to facilitate the majority of users, but also left
> opportunities for customization.
>
> Best,
> Jingsong Lee
>
> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <[hidden email]> wrote:
>
> > Hi,
> >
> > I think we should first reach an consensus on "what problem do we want to
> > solve?"
> > (1) improve first experience? or (2) improve production experience?
> >
> > As far as I can see, with the above discussion, I think what we want to
> > solve is the "first experience".
> > And I think the slim jar is still the best distribution for production,
> > because it's easier to assembling jars
> > than excluding jars and can avoid potential class conflicts.
> >
> > If we want to improve "first experience", I think it make sense to have a
> > fat distribution to give users a more smooth first experience.
> > But I would like to call it "playground distribution" or something like
> > that to explicitly differ from the "slim production-purpose
> distribution".
> > The "playground distribution" can contains some widely used jars, like
> > universal-kafka-sql-connector, elasticsearch7-sql-connector, avro, json,
> > csv, etc..
> > Even we can provide a playground docker which may contain the fat
> > distribution, python3, and hive.
> >
> > Best,
> > Jark
> >
> >
> > On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <[hidden email]>
> wrote:
> >
> > > I don't see a lot of value in having multiple distributions.
> > >
> > > The simple reality is that no fat distribution we could provide would
> > > satisfy all use-cases, so why even try.
> > > If users commonly run into issues for certain jars, then maybe those
> > > should be added to the current distribution.
> > >
> > > Personally though I still believe we should only distribute a slim
> > > version. I'd rather have users always add required jars to the
> > > distribution than only when they go outside our "expected" use-cases.
> > > Then we might finally address this issue properly, i.e., tooling to
> > > assemble custom distributions and/or better error messages if
> > > Flink-provided extensions cannot be found.
> > >
> > > On 15/04/2020 15:23, Kurt Young wrote:
> > > > Regarding to the specific solution, I'm not sure about the "fat" and
> > > "slim"
> > > > solution though. I get the idea
> > > > that we can make the slim one even more lightweight than current
> > > > distribution, but what about the "fat"
> > > > one? Do you mean that we would package all connectors and formats
> into
> > > > this? I'm not sure if this is
> > > > feasible. For example, we can't put all versions of kafka and hive
> > > > connector jars into lib directory, and
> > > > we also might need hadoop jars when using filesystem connector to
> > access
> > > > data from HDFS.
> > > >
> > > > So my guess would be we might hand-pick some of the most frequently
> > used
> > > > connectors and formats
> > > > into our "lib" directory, like kafka, csv, json metioned above, and
> > still
> > > > leave some other connectors out of it.
> > > > If this is the case, then why not we just provide this distribution
> to
> > > > user? I'm not sure i get the benefit of
> > > > providing another super "slim" jar (we have to pay some costs to
> > provide
> > > > another suit of distribution).
> > > >
> > > > What do you think?
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <[hidden email]>
> > > wrote:
> > > >
> > > >> Big +1.
> > > >>
> > > >> I like "fat" and "slim".
> > > >>
> > > >> For csv and json, like Jark said, they are quite small and don't
> have
> > > other
> > > >> dependencies. They are important to kafka connector, and important
> > > >> to upcoming file system connector too.
> > > >> So can we move them to both "fat" and "slim"? They're so important,
> > and
> > > >> they're so lightweight.
> > > >>
> > > >> Best,
> > > >> Jingsong Lee
> > > >>
> > > >> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <[hidden email]>
> > wrote:
> > > >>
> > > >>> Big +1.
> > > >>> This will improve user experience (special for Flink new users).
> > > >>> We answered so many questions about "class not found".
> > > >>>
> > > >>> Best,
> > > >>> Godfrey
> > > >>>
> > > >>> Dian Fu <[hidden email]> 于2020年4月15日周三 下午4:30写道:
> > > >>>
> > > >>>> +1 to this proposal.
> > > >>>>
> > > >>>> Missing connector jars is also a big problem for PyFlink users.
> > > >>> Currently,
> > > >>>> after a Python user has installed PyFlink using `pip`, he has to
> > > >> manually
> > > >>>> copy the connector fat jars to the PyFlink installation directory
> > for
> > > >> the
> > > >>>> connectors to be used if he wants to run jobs locally. This
> process
> > is
> > > >>> very
> > > >>>> confuse for users and affects the experience a lot.
> > > >>>>
> > > >>>> Regards,
> > > >>>> Dian
> > > >>>>
> > > >>>>> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> 写道:
> > > >>>>>
> > > >>>>> +1 to the proposal. I also found the "download additional jar"
> step
> > > >> is
> > > >>>>> really verbose when I prepare webinars.
> > > >>>>>
> > > >>>>> At least, I think the flink-csv and flink-json should in the
> > > >>>> distribution,
> > > >>>>> they are quite small and don't have other dependencies.
> > > >>>>>
> > > >>>>> Best,
> > > >>>>> Jark
> > > >>>>>
> > > >>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]>
> wrote:
> > > >>>>>
> > > >>>>>> Hi Aljoscha,
> > > >>>>>>
> > > >>>>>> Big +1 for the fat flink distribution, where do you plan to put
> > > >> these
> > > >>>>>> connectors ? opt or lib ?
> > > >>>>>>
> > > >>>>>> Aljoscha Krettek <[hidden email]> 于2020年4月15日周三 下午3:30写道:
> > > >>>>>>
> > > >>>>>>> Hi Everyone,
> > > >>>>>>>
> > > >>>>>>> I'd like to discuss about releasing a more full-featured Flink
> > > >>>>>>> distribution. The motivation is that there is friction for
> > > >> SQL/Table
> > > >>>> API
> > > >>>>>>> users that want to use Table connectors which are not there in
> > the
> > > >>>>>>> current Flink Distribution. For these users the workflow is
> > > >> currently
> > > >>>>>>> roughly:
> > > >>>>>>>
> > > >>>>>>>   - download Flink dist
> > > >>>>>>>   - configure csv/Kafka/json connectors per configuration
> > > >>>>>>>   - run SQL client or program
> > > >>>>>>>   - decrypt error message and research the solution
> > > >>>>>>>   - download additional connector jars
> > > >>>>>>>   - program works correctly
> > > >>>>>>>
> > > >>>>>>> I realize that this can be made to work but if every SQL user
> has
> > > >>> this
> > > >>>>>>> as their first experience that doesn't seem good to me.
> > > >>>>>>>
> > > >>>>>>> My proposal is to provide two versions of the Flink
> Distribution
> > in
> > > >>> the
> > > >>>>>>> future: "fat" and "slim" (names to be discussed):
> > > >>>>>>>
> > > >>>>>>>   - slim would be even trimmer than todays distribution
> > > >>>>>>>   - fat would contain a lot of convenience connectors (yet to
> be
> > > >>>>>>> determined which one)
> > > >>>>>>>
> > > >>>>>>> And yes, I realize that there are already more dimensions of
> > Flink
> > > >>>>>>> releases (Scala version and Java version).
> > > >>>>>>>
> > > >>>>>>> For background, our current Flink dist has these in the opt
> > > >>> directory:
> > > >>>>>>>   - flink-azure-fs-hadoop-1.10.0.jar
> > > >>>>>>>   - flink-cep-scala_2.12-1.10.0.jar
> > > >>>>>>>   - flink-cep_2.12-1.10.0.jar
> > > >>>>>>>   - flink-gelly-scala_2.12-1.10.0.jar
> > > >>>>>>>   - flink-gelly_2.12-1.10.0.jar
> > > >>>>>>>   - flink-metrics-datadog-1.10.0.jar
> > > >>>>>>>   - flink-metrics-graphite-1.10.0.jar
> > > >>>>>>>   - flink-metrics-influxdb-1.10.0.jar
> > > >>>>>>>   - flink-metrics-prometheus-1.10.0.jar
> > > >>>>>>>   - flink-metrics-slf4j-1.10.0.jar
> > > >>>>>>>   - flink-metrics-statsd-1.10.0.jar
> > > >>>>>>>   - flink-oss-fs-hadoop-1.10.0.jar
> > > >>>>>>>   - flink-python_2.12-1.10.0.jar
> > > >>>>>>>   - flink-queryable-state-runtime_2.12-1.10.0.jar
> > > >>>>>>>   - flink-s3-fs-hadoop-1.10.0.jar
> > > >>>>>>>   - flink-s3-fs-presto-1.10.0.jar
> > > >>>>>>>   - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > > >>>>>>>   - flink-sql-client_2.12-1.10.0.jar
> > > >>>>>>>   - flink-state-processor-api_2.12-1.10.0.jar
> > > >>>>>>>   - flink-swift-fs-hadoop-1.10.0.jar
> > > >>>>>>>
> > > >>>>>>> Current Flink dist is 267M. If we removed everything from opt
> we
> > > >>> would
> > > >>>>>>> go down to 126M. I would reccomend this, because the large
> > majority
> > > >>> of
> > > >>>>>>> the files in opt are probably unused.
> > > >>>>>>>
> > > >>>>>>> What do you think?
> > > >>>>>>>
> > > >>>>>>> Best,
> > > >>>>>>> Aljoscha
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>> --
> > > >>>>>> Best Regards
> > > >>>>>>
> > > >>>>>> Jeff Zhang
> > > >>>>>>
> > > >>>>
> > > >>
> > > >> --
> > > >> Best, Jingsong Lee
> > > >>
> > >
> > >
> >
>
>
> --
> Best, Jingsong Lee
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Aljoscha Krettek-2
I want to reinforce my opinion from earlier: This is about improving the
situation both for first-time users and for experienced users that want
to use a Flink dist in production. The current Flink dist is too "thin"
for first-time SQL users and it is too "fat" for production users, that
is where serving no-one properly with the current middle-ground. That's
why I think introducing those specialized "spins" of Flink dist would be
good.

By the way, at some point in the future production users might not even
need to get a Flink dist anymore. They should be able to have Flink as a
dependency of their project (including the runtime) and then build an
image from this for Kubernetes or a fat jar for YARN.

Aljoscha

On 15.04.20 18:14, wenlong.lwl wrote:

> Hi all,
>
> Regarding slim and fat distributions, I think different kinds of jobs may
> prefer different type of distribution:
>
> For DataStream job, I think we may not like fat distribution containing
> connectors because user would always need to depend on the connector in
> user code, it is easy to include the connector jar in the user lib. Less
> jar in lib means less class conflicts and problems.
>
> For SQL job, I think we are trying to encourage user to user pure sql(DDL +
> DML) to construct their job, In order to improve user experience, It may be
> important for flink, not only providing as many connector jar in
> distribution as possible especially the connector and format we have well
> documented,  but also providing an mechanism to load connectors according
> to the DDLs,
>
> So I think it could be good to place connector/format jars in some dir like
> opt/connector which would not affect jobs by default, and introduce a
> mechanism of dynamic discovery for SQL.
>
> Best,
> Wenlong
>
> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <[hidden email]> wrote:
>
>> Hi,
>>
>> I am thinking both "improve first experience" and "improve production
>> experience".
>>
>> I'm thinking about what's the common mode of Flink?
>> Streaming job use Kafka? Batch job use Hive?
>>
>> Hive 1.2.1 dependencies can be compatible with most of Hive server
>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
>> Flink is currently mainly used for streaming, so let's not talk about hive.
>>
>> For streaming jobs, first of all, the jobs in my mind is (related to
>> connectors):
>> - ETL jobs: Kafka -> Kafka
>> - Join jobs: Kafka -> DimJDBC -> Kafka
>> - Aggregation jobs: Kafka -> JDBCSink
>> So Kafka and JDBC are probably the most commonly used. Of course, also
>> includes CSV, JSON's formats.
>> So when we provide such a fat distribution:
>> - With CSV, JSON.
>> - With flink-kafka-universal and kafka dependencies.
>> - With flink-jdbc.
>> Using this fat distribution, most users can run their jobs well. (jdbc
>> driver jar required, but this is very natural to do)
>> Can these dependencies lead to kinds of conflicts? Only Kafka may have
>> conflicts, but if our goal is to use kafka-universal to support all Kafka
>> versions, it is hopeful to target the vast majority of users.
>>
>> We don't want to plug all jars into the fat distribution. Only need less
>> conflict and common. of course, it is a matter of consideration to put
>> which jar into fat distribution.
>> We have the opportunity to facilitate the majority of users, but also left
>> opportunities for customization.
>>
>> Best,
>> Jingsong Lee
>>
>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <[hidden email]> wrote:
>>
>>> Hi,
>>>
>>> I think we should first reach an consensus on "what problem do we want to
>>> solve?"
>>> (1) improve first experience? or (2) improve production experience?
>>>
>>> As far as I can see, with the above discussion, I think what we want to
>>> solve is the "first experience".
>>> And I think the slim jar is still the best distribution for production,
>>> because it's easier to assembling jars
>>> than excluding jars and can avoid potential class conflicts.
>>>
>>> If we want to improve "first experience", I think it make sense to have a
>>> fat distribution to give users a more smooth first experience.
>>> But I would like to call it "playground distribution" or something like
>>> that to explicitly differ from the "slim production-purpose
>> distribution".
>>> The "playground distribution" can contains some widely used jars, like
>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro, json,
>>> csv, etc..
>>> Even we can provide a playground docker which may contain the fat
>>> distribution, python3, and hive.
>>>
>>> Best,
>>> Jark
>>>
>>>
>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <[hidden email]>
>> wrote:
>>>
>>>> I don't see a lot of value in having multiple distributions.
>>>>
>>>> The simple reality is that no fat distribution we could provide would
>>>> satisfy all use-cases, so why even try.
>>>> If users commonly run into issues for certain jars, then maybe those
>>>> should be added to the current distribution.
>>>>
>>>> Personally though I still believe we should only distribute a slim
>>>> version. I'd rather have users always add required jars to the
>>>> distribution than only when they go outside our "expected" use-cases.
>>>> Then we might finally address this issue properly, i.e., tooling to
>>>> assemble custom distributions and/or better error messages if
>>>> Flink-provided extensions cannot be found.
>>>>
>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>> Regarding to the specific solution, I'm not sure about the "fat" and
>>>> "slim"
>>>>> solution though. I get the idea
>>>>> that we can make the slim one even more lightweight than current
>>>>> distribution, but what about the "fat"
>>>>> one? Do you mean that we would package all connectors and formats
>> into
>>>>> this? I'm not sure if this is
>>>>> feasible. For example, we can't put all versions of kafka and hive
>>>>> connector jars into lib directory, and
>>>>> we also might need hadoop jars when using filesystem connector to
>>> access
>>>>> data from HDFS.
>>>>>
>>>>> So my guess would be we might hand-pick some of the most frequently
>>> used
>>>>> connectors and formats
>>>>> into our "lib" directory, like kafka, csv, json metioned above, and
>>> still
>>>>> leave some other connectors out of it.
>>>>> If this is the case, then why not we just provide this distribution
>> to
>>>>> user? I'm not sure i get the benefit of
>>>>> providing another super "slim" jar (we have to pay some costs to
>>> provide
>>>>> another suit of distribution).
>>>>>
>>>>> What do you think?
>>>>>
>>>>> Best,
>>>>> Kurt
>>>>>
>>>>>
>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <[hidden email]>
>>>> wrote:
>>>>>
>>>>>> Big +1.
>>>>>>
>>>>>> I like "fat" and "slim".
>>>>>>
>>>>>> For csv and json, like Jark said, they are quite small and don't
>> have
>>>> other
>>>>>> dependencies. They are important to kafka connector, and important
>>>>>> to upcoming file system connector too.
>>>>>> So can we move them to both "fat" and "slim"? They're so important,
>>> and
>>>>>> they're so lightweight.
>>>>>>
>>>>>> Best,
>>>>>> Jingsong Lee
>>>>>>
>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <[hidden email]>
>>> wrote:
>>>>>>
>>>>>>> Big +1.
>>>>>>> This will improve user experience (special for Flink new users).
>>>>>>> We answered so many questions about "class not found".
>>>>>>>
>>>>>>> Best,
>>>>>>> Godfrey
>>>>>>>
>>>>>>> Dian Fu <[hidden email]> 于2020年4月15日周三 下午4:30写道:
>>>>>>>
>>>>>>>> +1 to this proposal.
>>>>>>>>
>>>>>>>> Missing connector jars is also a big problem for PyFlink users.
>>>>>>> Currently,
>>>>>>>> after a Python user has installed PyFlink using `pip`, he has to
>>>>>> manually
>>>>>>>> copy the connector fat jars to the PyFlink installation directory
>>> for
>>>>>> the
>>>>>>>> connectors to be used if he wants to run jobs locally. This
>> process
>>> is
>>>>>>> very
>>>>>>>> confuse for users and affects the experience a lot.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Dian
>>>>>>>>
>>>>>>>>> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> 写道:
>>>>>>>>>
>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
>> step
>>>>>> is
>>>>>>>>> really verbose when I prepare webinars.
>>>>>>>>>
>>>>>>>>> At least, I think the flink-csv and flink-json should in the
>>>>>>>> distribution,
>>>>>>>>> they are quite small and don't have other dependencies.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Jark
>>>>>>>>>
>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]>
>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Aljoscha,
>>>>>>>>>>
>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to put
>>>>>> these
>>>>>>>>>> connectors ? opt or lib ?
>>>>>>>>>>
>>>>>>>>>> Aljoscha Krettek <[hidden email]> 于2020年4月15日周三 下午3:30写道:
>>>>>>>>>>
>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>
>>>>>>>>>>> I'd like to discuss about releasing a more full-featured Flink
>>>>>>>>>>> distribution. The motivation is that there is friction for
>>>>>> SQL/Table
>>>>>>>> API
>>>>>>>>>>> users that want to use Table connectors which are not there in
>>> the
>>>>>>>>>>> current Flink Distribution. For these users the workflow is
>>>>>> currently
>>>>>>>>>>> roughly:
>>>>>>>>>>>
>>>>>>>>>>>    - download Flink dist
>>>>>>>>>>>    - configure csv/Kafka/json connectors per configuration
>>>>>>>>>>>    - run SQL client or program
>>>>>>>>>>>    - decrypt error message and research the solution
>>>>>>>>>>>    - download additional connector jars
>>>>>>>>>>>    - program works correctly
>>>>>>>>>>>
>>>>>>>>>>> I realize that this can be made to work but if every SQL user
>> has
>>>>>>> this
>>>>>>>>>>> as their first experience that doesn't seem good to me.
>>>>>>>>>>>
>>>>>>>>>>> My proposal is to provide two versions of the Flink
>> Distribution
>>> in
>>>>>>> the
>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>>>>>>
>>>>>>>>>>>    - slim would be even trimmer than todays distribution
>>>>>>>>>>>    - fat would contain a lot of convenience connectors (yet to
>> be
>>>>>>>>>>> determined which one)
>>>>>>>>>>>
>>>>>>>>>>> And yes, I realize that there are already more dimensions of
>>> Flink
>>>>>>>>>>> releases (Scala version and Java version).
>>>>>>>>>>>
>>>>>>>>>>> For background, our current Flink dist has these in the opt
>>>>>>> directory:
>>>>>>>>>>>    - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>>>>>    - flink-cep-scala_2.12-1.10.0.jar
>>>>>>>>>>>    - flink-cep_2.12-1.10.0.jar
>>>>>>>>>>>    - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>>>>>>    - flink-gelly_2.12-1.10.0.jar
>>>>>>>>>>>    - flink-metrics-datadog-1.10.0.jar
>>>>>>>>>>>    - flink-metrics-graphite-1.10.0.jar
>>>>>>>>>>>    - flink-metrics-influxdb-1.10.0.jar
>>>>>>>>>>>    - flink-metrics-prometheus-1.10.0.jar
>>>>>>>>>>>    - flink-metrics-slf4j-1.10.0.jar
>>>>>>>>>>>    - flink-metrics-statsd-1.10.0.jar
>>>>>>>>>>>    - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>>>>>    - flink-python_2.12-1.10.0.jar
>>>>>>>>>>>    - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>>>>>>    - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>>>>>    - flink-s3-fs-presto-1.10.0.jar
>>>>>>>>>>>    - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>>>>>    - flink-sql-client_2.12-1.10.0.jar
>>>>>>>>>>>    - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>>>>>>    - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>>>>>
>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from opt
>> we
>>>>>>> would
>>>>>>>>>>> go down to 126M. I would reccomend this, because the large
>>> majority
>>>>>>> of
>>>>>>>>>>> the files in opt are probably unused.
>>>>>>>>>>>
>>>>>>>>>>> What do you think?
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Aljoscha
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best Regards
>>>>>>>>>>
>>>>>>>>>> Jeff Zhang
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best, Jingsong Lee
>>>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> Best, Jingsong Lee
>>
>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

dwysakowicz
Hi all,

Few points from my side:

1. I like the idea of simplifying the experience for first time users.
As for production use cases I share Jark's opinion that in this case I
would expect users to combine their distribution manually. I think in
such scenarios it is important to understand interconnections.
Personally I'd expect the slimmest possible distribution that I can
extend further with what I need in my production scenario.

2. I think there is also the problem that the matrix of possible
combinations that can be useful is already big. Do we want to have a
distribution for:

    SQL users: which connectors should we include? should we include
hive? which other catalog?

    DataStream users: which connectors should we include?

   For both of the above should we include yarn/kubernetes?

I would opt for providing only the "slim" distribution as a release
artifact.

3. However, as I said I think its worth investigating how we can improve
users experience. What do you think of providing a tool, could be e.g. a
shell script that constructs a distribution based on users choice. I
think that was also what Chesnay mentioned as "tooling to
assemble custom distributions" In the end how I see the difference
between a slim and fat distribution is which jars do we put into the
lib, right? It could have a few "screens".

1. Which API are you interested in:
a. SQL API
b. DataStream API


2. [SQL] Which connectors do you want to use? [multichoice]:
a. Kafka
b. Elasticsearch
...

3. [SQL] Which catalog you want to use?

...

Such a tool would download all the dependencies from maven and put them
into the correct folder. In the future we can extend it with additional
rules e.g. kafka-0.9 cannot be chosen at the same time with
kafka-universal etc.

The benefit of it would be that the distribution that we release could
remain "slim" or we could even make it slimmer. I might be missing
something here though.

Best,

Dawdi

On 16/04/2020 11:02, Aljoscha Krettek wrote:

> I want to reinforce my opinion from earlier: This is about improving
> the situation both for first-time users and for experienced users that
> want to use a Flink dist in production. The current Flink dist is too
> "thin" for first-time SQL users and it is too "fat" for production
> users, that is where serving no-one properly with the current
> middle-ground. That's why I think introducing those specialized
> "spins" of Flink dist would be good.
>
> By the way, at some point in the future production users might not
> even need to get a Flink dist anymore. They should be able to have
> Flink as a dependency of their project (including the runtime) and
> then build an image from this for Kubernetes or a fat jar for YARN.
>
> Aljoscha
>
> On 15.04.20 18:14, wenlong.lwl wrote:
>> Hi all,
>>
>> Regarding slim and fat distributions, I think different kinds of jobs
>> may
>> prefer different type of distribution:
>>
>> For DataStream job, I think we may not like fat distribution containing
>> connectors because user would always need to depend on the connector in
>> user code, it is easy to include the connector jar in the user lib. Less
>> jar in lib means less class conflicts and problems.
>>
>> For SQL job, I think we are trying to encourage user to user pure
>> sql(DDL +
>> DML) to construct their job, In order to improve user experience, It
>> may be
>> important for flink, not only providing as many connector jar in
>> distribution as possible especially the connector and format we have
>> well
>> documented,  but also providing an mechanism to load connectors
>> according
>> to the DDLs,
>>
>> So I think it could be good to place connector/format jars in some
>> dir like
>> opt/connector which would not affect jobs by default, and introduce a
>> mechanism of dynamic discovery for SQL.
>>
>> Best,
>> Wenlong
>>
>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <[hidden email]>
>> wrote:
>>
>>> Hi,
>>>
>>> I am thinking both "improve first experience" and "improve production
>>> experience".
>>>
>>> I'm thinking about what's the common mode of Flink?
>>> Streaming job use Kafka? Batch job use Hive?
>>>
>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
>>> Flink is currently mainly used for streaming, so let's not talk
>>> about hive.
>>>
>>> For streaming jobs, first of all, the jobs in my mind is (related to
>>> connectors):
>>> - ETL jobs: Kafka -> Kafka
>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>>> - Aggregation jobs: Kafka -> JDBCSink
>>> So Kafka and JDBC are probably the most commonly used. Of course, also
>>> includes CSV, JSON's formats.
>>> So when we provide such a fat distribution:
>>> - With CSV, JSON.
>>> - With flink-kafka-universal and kafka dependencies.
>>> - With flink-jdbc.
>>> Using this fat distribution, most users can run their jobs well. (jdbc
>>> driver jar required, but this is very natural to do)
>>> Can these dependencies lead to kinds of conflicts? Only Kafka may have
>>> conflicts, but if our goal is to use kafka-universal to support all
>>> Kafka
>>> versions, it is hopeful to target the vast majority of users.
>>>
>>> We don't want to plug all jars into the fat distribution. Only need
>>> less
>>> conflict and common. of course, it is a matter of consideration to put
>>> which jar into fat distribution.
>>> We have the opportunity to facilitate the majority of users, but
>>> also left
>>> opportunities for customization.
>>>
>>> Best,
>>> Jingsong Lee
>>>
>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <[hidden email]> wrote:
>>>
>>>> Hi,
>>>>
>>>> I think we should first reach an consensus on "what problem do we
>>>> want to
>>>> solve?"
>>>> (1) improve first experience? or (2) improve production experience?
>>>>
>>>> As far as I can see, with the above discussion, I think what we
>>>> want to
>>>> solve is the "first experience".
>>>> And I think the slim jar is still the best distribution for
>>>> production,
>>>> because it's easier to assembling jars
>>>> than excluding jars and can avoid potential class conflicts.
>>>>
>>>> If we want to improve "first experience", I think it make sense to
>>>> have a
>>>> fat distribution to give users a more smooth first experience.
>>>> But I would like to call it "playground distribution" or something
>>>> like
>>>> that to explicitly differ from the "slim production-purpose
>>> distribution".
>>>> The "playground distribution" can contains some widely used jars, like
>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
>>>> json,
>>>> csv, etc..
>>>> Even we can provide a playground docker which may contain the fat
>>>> distribution, python3, and hive.
>>>>
>>>> Best,
>>>> Jark
>>>>
>>>>
>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <[hidden email]>
>>> wrote:
>>>>
>>>>> I don't see a lot of value in having multiple distributions.
>>>>>
>>>>> The simple reality is that no fat distribution we could provide would
>>>>> satisfy all use-cases, so why even try.
>>>>> If users commonly run into issues for certain jars, then maybe those
>>>>> should be added to the current distribution.
>>>>>
>>>>> Personally though I still believe we should only distribute a slim
>>>>> version. I'd rather have users always add required jars to the
>>>>> distribution than only when they go outside our "expected" use-cases.
>>>>> Then we might finally address this issue properly, i.e., tooling to
>>>>> assemble custom distributions and/or better error messages if
>>>>> Flink-provided extensions cannot be found.
>>>>>
>>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>>> Regarding to the specific solution, I'm not sure about the "fat" and
>>>>> "slim"
>>>>>> solution though. I get the idea
>>>>>> that we can make the slim one even more lightweight than current
>>>>>> distribution, but what about the "fat"
>>>>>> one? Do you mean that we would package all connectors and formats
>>> into
>>>>>> this? I'm not sure if this is
>>>>>> feasible. For example, we can't put all versions of kafka and hive
>>>>>> connector jars into lib directory, and
>>>>>> we also might need hadoop jars when using filesystem connector to
>>>> access
>>>>>> data from HDFS.
>>>>>>
>>>>>> So my guess would be we might hand-pick some of the most frequently
>>>> used
>>>>>> connectors and formats
>>>>>> into our "lib" directory, like kafka, csv, json metioned above, and
>>>> still
>>>>>> leave some other connectors out of it.
>>>>>> If this is the case, then why not we just provide this distribution
>>> to
>>>>>> user? I'm not sure i get the benefit of
>>>>>> providing another super "slim" jar (we have to pay some costs to
>>>> provide
>>>>>> another suit of distribution).
>>>>>>
>>>>>> What do you think?
>>>>>>
>>>>>> Best,
>>>>>> Kurt
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <[hidden email]>
>>>>> wrote:
>>>>>>
>>>>>>> Big +1.
>>>>>>>
>>>>>>> I like "fat" and "slim".
>>>>>>>
>>>>>>> For csv and json, like Jark said, they are quite small and don't
>>> have
>>>>> other
>>>>>>> dependencies. They are important to kafka connector, and important
>>>>>>> to upcoming file system connector too.
>>>>>>> So can we move them to both "fat" and "slim"? They're so important,
>>>> and
>>>>>>> they're so lightweight.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jingsong Lee
>>>>>>>
>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <[hidden email]>
>>>> wrote:
>>>>>>>
>>>>>>>> Big +1.
>>>>>>>> This will improve user experience (special for Flink new users).
>>>>>>>> We answered so many questions about "class not found".
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Godfrey
>>>>>>>>
>>>>>>>> Dian Fu <[hidden email]> 于2020年4月15日周三 下午4:30写道:
>>>>>>>>
>>>>>>>>> +1 to this proposal.
>>>>>>>>>
>>>>>>>>> Missing connector jars is also a big problem for PyFlink users.
>>>>>>>> Currently,
>>>>>>>>> after a Python user has installed PyFlink using `pip`, he has to
>>>>>>> manually
>>>>>>>>> copy the connector fat jars to the PyFlink installation directory
>>>> for
>>>>>>> the
>>>>>>>>> connectors to be used if he wants to run jobs locally. This
>>> process
>>>> is
>>>>>>>> very
>>>>>>>>> confuse for users and affects the experience a lot.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Dian
>>>>>>>>>
>>>>>>>>>> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> 写道:
>>>>>>>>>>
>>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
>>> step
>>>>>>> is
>>>>>>>>>> really verbose when I prepare webinars.
>>>>>>>>>>
>>>>>>>>>> At least, I think the flink-csv and flink-json should in the
>>>>>>>>> distribution,
>>>>>>>>>> they are quite small and don't have other dependencies.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jark
>>>>>>>>>>
>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]>
>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Aljoscha,
>>>>>>>>>>>
>>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to put
>>>>>>> these
>>>>>>>>>>> connectors ? opt or lib ?
>>>>>>>>>>>
>>>>>>>>>>> Aljoscha Krettek <[hidden email]> 于2020年4月15日周三
>>>>>>>>>>> 下午3:30写道:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>
>>>>>>>>>>>> I'd like to discuss about releasing a more full-featured Flink
>>>>>>>>>>>> distribution. The motivation is that there is friction for
>>>>>>> SQL/Table
>>>>>>>>> API
>>>>>>>>>>>> users that want to use Table connectors which are not there in
>>>> the
>>>>>>>>>>>> current Flink Distribution. For these users the workflow is
>>>>>>> currently
>>>>>>>>>>>> roughly:
>>>>>>>>>>>>
>>>>>>>>>>>>    - download Flink dist
>>>>>>>>>>>>    - configure csv/Kafka/json connectors per configuration
>>>>>>>>>>>>    - run SQL client or program
>>>>>>>>>>>>    - decrypt error message and research the solution
>>>>>>>>>>>>    - download additional connector jars
>>>>>>>>>>>>    - program works correctly
>>>>>>>>>>>>
>>>>>>>>>>>> I realize that this can be made to work but if every SQL user
>>> has
>>>>>>>> this
>>>>>>>>>>>> as their first experience that doesn't seem good to me.
>>>>>>>>>>>>
>>>>>>>>>>>> My proposal is to provide two versions of the Flink
>>> Distribution
>>>> in
>>>>>>>> the
>>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>>>>>>>
>>>>>>>>>>>>    - slim would be even trimmer than todays distribution
>>>>>>>>>>>>    - fat would contain a lot of convenience connectors (yet to
>>> be
>>>>>>>>>>>> determined which one)
>>>>>>>>>>>>
>>>>>>>>>>>> And yes, I realize that there are already more dimensions of
>>>> Flink
>>>>>>>>>>>> releases (Scala version and Java version).
>>>>>>>>>>>>
>>>>>>>>>>>> For background, our current Flink dist has these in the opt
>>>>>>>> directory:
>>>>>>>>>>>>    - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>    - flink-cep-scala_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-cep_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-gelly_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-datadog-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-graphite-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-influxdb-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-prometheus-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-slf4j-1.10.0.jar
>>>>>>>>>>>>    - flink-metrics-statsd-1.10.0.jar
>>>>>>>>>>>>    - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>    - flink-python_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>    - flink-s3-fs-presto-1.10.0.jar
>>>>>>>>>>>>    - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>>>>>>    - flink-sql-client_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>>>>>>>    - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>
>>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from opt
>>> we
>>>>>>>> would
>>>>>>>>>>>> go down to 126M. I would reccomend this, because the large
>>>> majority
>>>>>>>> of
>>>>>>>>>>>> the files in opt are probably unused.
>>>>>>>>>>>>
>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Aljoscha
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best Regards
>>>>>>>>>>>
>>>>>>>>>>> Jeff Zhang
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Best, Jingsong Lee
>>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> Best, Jingsong Lee
>>>
>>
>


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Till Rohrmann
I think what Chesnay and Dawid proposed would be the ideal solution.
Ideally, we would also have a nice web tool for the website which generates
the corresponding distribution for download.

To get things started we could start with only supporting to
download/creating the "fat" version with the script. The fat version would
then consist of the slim distribution and whatever we deem important for
new users to get started.

Cheers,
Till

On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <[hidden email]>
wrote:

> Hi all,
>
> Few points from my side:
>
> 1. I like the idea of simplifying the experience for first time users.
> As for production use cases I share Jark's opinion that in this case I
> would expect users to combine their distribution manually. I think in
> such scenarios it is important to understand interconnections.
> Personally I'd expect the slimmest possible distribution that I can
> extend further with what I need in my production scenario.
>
> 2. I think there is also the problem that the matrix of possible
> combinations that can be useful is already big. Do we want to have a
> distribution for:
>
>     SQL users: which connectors should we include? should we include
> hive? which other catalog?
>
>     DataStream users: which connectors should we include?
>
>    For both of the above should we include yarn/kubernetes?
>
> I would opt for providing only the "slim" distribution as a release
> artifact.
>
> 3. However, as I said I think its worth investigating how we can improve
> users experience. What do you think of providing a tool, could be e.g. a
> shell script that constructs a distribution based on users choice. I
> think that was also what Chesnay mentioned as "tooling to
> assemble custom distributions" In the end how I see the difference
> between a slim and fat distribution is which jars do we put into the
> lib, right? It could have a few "screens".
>
> 1. Which API are you interested in:
> a. SQL API
> b. DataStream API
>
>
> 2. [SQL] Which connectors do you want to use? [multichoice]:
> a. Kafka
> b. Elasticsearch
> ...
>
> 3. [SQL] Which catalog you want to use?
>
> ...
>
> Such a tool would download all the dependencies from maven and put them
> into the correct folder. In the future we can extend it with additional
> rules e.g. kafka-0.9 cannot be chosen at the same time with
> kafka-universal etc.
>
> The benefit of it would be that the distribution that we release could
> remain "slim" or we could even make it slimmer. I might be missing
> something here though.
>
> Best,
>
> Dawdi
>
> On 16/04/2020 11:02, Aljoscha Krettek wrote:
> > I want to reinforce my opinion from earlier: This is about improving
> > the situation both for first-time users and for experienced users that
> > want to use a Flink dist in production. The current Flink dist is too
> > "thin" for first-time SQL users and it is too "fat" for production
> > users, that is where serving no-one properly with the current
> > middle-ground. That's why I think introducing those specialized
> > "spins" of Flink dist would be good.
> >
> > By the way, at some point in the future production users might not
> > even need to get a Flink dist anymore. They should be able to have
> > Flink as a dependency of their project (including the runtime) and
> > then build an image from this for Kubernetes or a fat jar for YARN.
> >
> > Aljoscha
> >
> > On 15.04.20 18:14, wenlong.lwl wrote:
> >> Hi all,
> >>
> >> Regarding slim and fat distributions, I think different kinds of jobs
> >> may
> >> prefer different type of distribution:
> >>
> >> For DataStream job, I think we may not like fat distribution containing
> >> connectors because user would always need to depend on the connector in
> >> user code, it is easy to include the connector jar in the user lib. Less
> >> jar in lib means less class conflicts and problems.
> >>
> >> For SQL job, I think we are trying to encourage user to user pure
> >> sql(DDL +
> >> DML) to construct their job, In order to improve user experience, It
> >> may be
> >> important for flink, not only providing as many connector jar in
> >> distribution as possible especially the connector and format we have
> >> well
> >> documented,  but also providing an mechanism to load connectors
> >> according
> >> to the DDLs,
> >>
> >> So I think it could be good to place connector/format jars in some
> >> dir like
> >> opt/connector which would not affect jobs by default, and introduce a
> >> mechanism of dynamic discovery for SQL.
> >>
> >> Best,
> >> Wenlong
> >>
> >> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <[hidden email]>
> >> wrote:
> >>
> >>> Hi,
> >>>
> >>> I am thinking both "improve first experience" and "improve production
> >>> experience".
> >>>
> >>> I'm thinking about what's the common mode of Flink?
> >>> Streaming job use Kafka? Batch job use Hive?
> >>>
> >>> Hive 1.2.1 dependencies can be compatible with most of Hive server
> >>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> >>> Flink is currently mainly used for streaming, so let's not talk
> >>> about hive.
> >>>
> >>> For streaming jobs, first of all, the jobs in my mind is (related to
> >>> connectors):
> >>> - ETL jobs: Kafka -> Kafka
> >>> - Join jobs: Kafka -> DimJDBC -> Kafka
> >>> - Aggregation jobs: Kafka -> JDBCSink
> >>> So Kafka and JDBC are probably the most commonly used. Of course, also
> >>> includes CSV, JSON's formats.
> >>> So when we provide such a fat distribution:
> >>> - With CSV, JSON.
> >>> - With flink-kafka-universal and kafka dependencies.
> >>> - With flink-jdbc.
> >>> Using this fat distribution, most users can run their jobs well. (jdbc
> >>> driver jar required, but this is very natural to do)
> >>> Can these dependencies lead to kinds of conflicts? Only Kafka may have
> >>> conflicts, but if our goal is to use kafka-universal to support all
> >>> Kafka
> >>> versions, it is hopeful to target the vast majority of users.
> >>>
> >>> We don't want to plug all jars into the fat distribution. Only need
> >>> less
> >>> conflict and common. of course, it is a matter of consideration to put
> >>> which jar into fat distribution.
> >>> We have the opportunity to facilitate the majority of users, but
> >>> also left
> >>> opportunities for customization.
> >>>
> >>> Best,
> >>> Jingsong Lee
> >>>
> >>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <[hidden email]> wrote:
> >>>
> >>>> Hi,
> >>>>
> >>>> I think we should first reach an consensus on "what problem do we
> >>>> want to
> >>>> solve?"
> >>>> (1) improve first experience? or (2) improve production experience?
> >>>>
> >>>> As far as I can see, with the above discussion, I think what we
> >>>> want to
> >>>> solve is the "first experience".
> >>>> And I think the slim jar is still the best distribution for
> >>>> production,
> >>>> because it's easier to assembling jars
> >>>> than excluding jars and can avoid potential class conflicts.
> >>>>
> >>>> If we want to improve "first experience", I think it make sense to
> >>>> have a
> >>>> fat distribution to give users a more smooth first experience.
> >>>> But I would like to call it "playground distribution" or something
> >>>> like
> >>>> that to explicitly differ from the "slim production-purpose
> >>> distribution".
> >>>> The "playground distribution" can contains some widely used jars, like
> >>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> >>>> json,
> >>>> csv, etc..
> >>>> Even we can provide a playground docker which may contain the fat
> >>>> distribution, python3, and hive.
> >>>>
> >>>> Best,
> >>>> Jark
> >>>>
> >>>>
> >>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <[hidden email]>
> >>> wrote:
> >>>>
> >>>>> I don't see a lot of value in having multiple distributions.
> >>>>>
> >>>>> The simple reality is that no fat distribution we could provide would
> >>>>> satisfy all use-cases, so why even try.
> >>>>> If users commonly run into issues for certain jars, then maybe those
> >>>>> should be added to the current distribution.
> >>>>>
> >>>>> Personally though I still believe we should only distribute a slim
> >>>>> version. I'd rather have users always add required jars to the
> >>>>> distribution than only when they go outside our "expected" use-cases.
> >>>>> Then we might finally address this issue properly, i.e., tooling to
> >>>>> assemble custom distributions and/or better error messages if
> >>>>> Flink-provided extensions cannot be found.
> >>>>>
> >>>>> On 15/04/2020 15:23, Kurt Young wrote:
> >>>>>> Regarding to the specific solution, I'm not sure about the "fat" and
> >>>>> "slim"
> >>>>>> solution though. I get the idea
> >>>>>> that we can make the slim one even more lightweight than current
> >>>>>> distribution, but what about the "fat"
> >>>>>> one? Do you mean that we would package all connectors and formats
> >>> into
> >>>>>> this? I'm not sure if this is
> >>>>>> feasible. For example, we can't put all versions of kafka and hive
> >>>>>> connector jars into lib directory, and
> >>>>>> we also might need hadoop jars when using filesystem connector to
> >>>> access
> >>>>>> data from HDFS.
> >>>>>>
> >>>>>> So my guess would be we might hand-pick some of the most frequently
> >>>> used
> >>>>>> connectors and formats
> >>>>>> into our "lib" directory, like kafka, csv, json metioned above, and
> >>>> still
> >>>>>> leave some other connectors out of it.
> >>>>>> If this is the case, then why not we just provide this distribution
> >>> to
> >>>>>> user? I'm not sure i get the benefit of
> >>>>>> providing another super "slim" jar (we have to pay some costs to
> >>>> provide
> >>>>>> another suit of distribution).
> >>>>>>
> >>>>>> What do you think?
> >>>>>>
> >>>>>> Best,
> >>>>>> Kurt
> >>>>>>
> >>>>>>
> >>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <[hidden email]
> >
> >>>>> wrote:
> >>>>>>
> >>>>>>> Big +1.
> >>>>>>>
> >>>>>>> I like "fat" and "slim".
> >>>>>>>
> >>>>>>> For csv and json, like Jark said, they are quite small and don't
> >>> have
> >>>>> other
> >>>>>>> dependencies. They are important to kafka connector, and important
> >>>>>>> to upcoming file system connector too.
> >>>>>>> So can we move them to both "fat" and "slim"? They're so important,
> >>>> and
> >>>>>>> they're so lightweight.
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Jingsong Lee
> >>>>>>>
> >>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <[hidden email]>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Big +1.
> >>>>>>>> This will improve user experience (special for Flink new users).
> >>>>>>>> We answered so many questions about "class not found".
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Godfrey
> >>>>>>>>
> >>>>>>>> Dian Fu <[hidden email]> 于2020年4月15日周三 下午4:30写道:
> >>>>>>>>
> >>>>>>>>> +1 to this proposal.
> >>>>>>>>>
> >>>>>>>>> Missing connector jars is also a big problem for PyFlink users.
> >>>>>>>> Currently,
> >>>>>>>>> after a Python user has installed PyFlink using `pip`, he has to
> >>>>>>> manually
> >>>>>>>>> copy the connector fat jars to the PyFlink installation directory
> >>>> for
> >>>>>>> the
> >>>>>>>>> connectors to be used if he wants to run jobs locally. This
> >>> process
> >>>> is
> >>>>>>>> very
> >>>>>>>>> confuse for users and affects the experience a lot.
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Dian
> >>>>>>>>>
> >>>>>>>>>> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> 写道:
> >>>>>>>>>>
> >>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
> >>> step
> >>>>>>> is
> >>>>>>>>>> really verbose when I prepare webinars.
> >>>>>>>>>>
> >>>>>>>>>> At least, I think the flink-csv and flink-json should in the
> >>>>>>>>> distribution,
> >>>>>>>>>> they are quite small and don't have other dependencies.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Jark
> >>>>>>>>>>
> >>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]>
> >>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Hi Aljoscha,
> >>>>>>>>>>>
> >>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to put
> >>>>>>> these
> >>>>>>>>>>> connectors ? opt or lib ?
> >>>>>>>>>>>
> >>>>>>>>>>> Aljoscha Krettek <[hidden email]> 于2020年4月15日周三
> >>>>>>>>>>> 下午3:30写道:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hi Everyone,
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'd like to discuss about releasing a more full-featured Flink
> >>>>>>>>>>>> distribution. The motivation is that there is friction for
> >>>>>>> SQL/Table
> >>>>>>>>> API
> >>>>>>>>>>>> users that want to use Table connectors which are not there in
> >>>> the
> >>>>>>>>>>>> current Flink Distribution. For these users the workflow is
> >>>>>>> currently
> >>>>>>>>>>>> roughly:
> >>>>>>>>>>>>
> >>>>>>>>>>>>    - download Flink dist
> >>>>>>>>>>>>    - configure csv/Kafka/json connectors per configuration
> >>>>>>>>>>>>    - run SQL client or program
> >>>>>>>>>>>>    - decrypt error message and research the solution
> >>>>>>>>>>>>    - download additional connector jars
> >>>>>>>>>>>>    - program works correctly
> >>>>>>>>>>>>
> >>>>>>>>>>>> I realize that this can be made to work but if every SQL user
> >>> has
> >>>>>>>> this
> >>>>>>>>>>>> as their first experience that doesn't seem good to me.
> >>>>>>>>>>>>
> >>>>>>>>>>>> My proposal is to provide two versions of the Flink
> >>> Distribution
> >>>> in
> >>>>>>>> the
> >>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
> >>>>>>>>>>>>
> >>>>>>>>>>>>    - slim would be even trimmer than todays distribution
> >>>>>>>>>>>>    - fat would contain a lot of convenience connectors (yet to
> >>> be
> >>>>>>>>>>>> determined which one)
> >>>>>>>>>>>>
> >>>>>>>>>>>> And yes, I realize that there are already more dimensions of
> >>>> Flink
> >>>>>>>>>>>> releases (Scala version and Java version).
> >>>>>>>>>>>>
> >>>>>>>>>>>> For background, our current Flink dist has these in the opt
> >>>>>>>> directory:
> >>>>>>>>>>>>    - flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>>    - flink-cep-scala_2.12-1.10.0.jar
> >>>>>>>>>>>>    - flink-cep_2.12-1.10.0.jar
> >>>>>>>>>>>>    - flink-gelly-scala_2.12-1.10.0.jar
> >>>>>>>>>>>>    - flink-gelly_2.12-1.10.0.jar
> >>>>>>>>>>>>    - flink-metrics-datadog-1.10.0.jar
> >>>>>>>>>>>>    - flink-metrics-graphite-1.10.0.jar
> >>>>>>>>>>>>    - flink-metrics-influxdb-1.10.0.jar
> >>>>>>>>>>>>    - flink-metrics-prometheus-1.10.0.jar
> >>>>>>>>>>>>    - flink-metrics-slf4j-1.10.0.jar
> >>>>>>>>>>>>    - flink-metrics-statsd-1.10.0.jar
> >>>>>>>>>>>>    - flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>>    - flink-python_2.12-1.10.0.jar
> >>>>>>>>>>>>    - flink-queryable-state-runtime_2.12-1.10.0.jar
> >>>>>>>>>>>>    - flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>>    - flink-s3-fs-presto-1.10.0.jar
> >>>>>>>>>>>>    - flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>>>>>>>>    - flink-sql-client_2.12-1.10.0.jar
> >>>>>>>>>>>>    - flink-state-processor-api_2.12-1.10.0.jar
> >>>>>>>>>>>>    - flink-swift-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>>
> >>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from opt
> >>> we
> >>>>>>>> would
> >>>>>>>>>>>> go down to 126M. I would reccomend this, because the large
> >>>> majority
> >>>>>>>> of
> >>>>>>>>>>>> the files in opt are probably unused.
> >>>>>>>>>>>>
> >>>>>>>>>>>> What do you think?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Aljoscha
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Best Regards
> >>>>>>>>>>>
> >>>>>>>>>>> Jeff Zhang
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>
> >>>>>>> --
> >>>>>>> Best, Jingsong Lee
> >>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Best, Jingsong Lee
> >>>
> >>
> >
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Kurt Young
I'm not so sure about the web tool solution though. The concern I have for
this approach is the final generated
distribution is kind of non-deterministic. We might generate too many
different combinations when user trying to
package different types of connector, format, and even maybe hadoop
releases.  As far as I can tell, most open
source projects and apache projects will only release some
pre-defined distributions, which most users are already
familiar with, thus hard to change IMO. And I also have went through in
some cases, users will try to re-distribute
the release package, because of the unstable network of apache website from
China. In web tool solution, I don't
think this kind of re-distribution would be possible anymore.

In the meantime, I also have a concern that we will fall back into our trap
again if we try to offer this smart & flexible
solution. Because it needs users to cooperate with such mechanism. It's
exactly the situation what we currently fell
into:
1. We offered a smart solution.
2. We hope users will follow the correct instructions.
3. Everything will work as expected if users followed the right
instructions.

In reality, I suspect not all users will do the second step correctly. And
for new users who only trying to have a quick
experience with Flink, I would bet most users will do it wrong.

So, my proposal would be one of the following 2 options:
1. Provide a slim distribution for advanced product users and provide a
distribution which will have some popular builtin jars.
2. Only provide a distribution which will have some popular builtin jars.

If we are trying to reduce the distributions we released, I would prefer 2
> 1.

Best,
Kurt


On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <[hidden email]> wrote:

> I think what Chesnay and Dawid proposed would be the ideal solution.
> Ideally, we would also have a nice web tool for the website which generates
> the corresponding distribution for download.
>
> To get things started we could start with only supporting to
> download/creating the "fat" version with the script. The fat version would
> then consist of the slim distribution and whatever we deem important for
> new users to get started.
>
> Cheers,
> Till
>
> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <[hidden email]>
> wrote:
>
> > Hi all,
> >
> > Few points from my side:
> >
> > 1. I like the idea of simplifying the experience for first time users.
> > As for production use cases I share Jark's opinion that in this case I
> > would expect users to combine their distribution manually. I think in
> > such scenarios it is important to understand interconnections.
> > Personally I'd expect the slimmest possible distribution that I can
> > extend further with what I need in my production scenario.
> >
> > 2. I think there is also the problem that the matrix of possible
> > combinations that can be useful is already big. Do we want to have a
> > distribution for:
> >
> >     SQL users: which connectors should we include? should we include
> > hive? which other catalog?
> >
> >     DataStream users: which connectors should we include?
> >
> >    For both of the above should we include yarn/kubernetes?
> >
> > I would opt for providing only the "slim" distribution as a release
> > artifact.
> >
> > 3. However, as I said I think its worth investigating how we can improve
> > users experience. What do you think of providing a tool, could be e.g. a
> > shell script that constructs a distribution based on users choice. I
> > think that was also what Chesnay mentioned as "tooling to
> > assemble custom distributions" In the end how I see the difference
> > between a slim and fat distribution is which jars do we put into the
> > lib, right? It could have a few "screens".
> >
> > 1. Which API are you interested in:
> > a. SQL API
> > b. DataStream API
> >
> >
> > 2. [SQL] Which connectors do you want to use? [multichoice]:
> > a. Kafka
> > b. Elasticsearch
> > ...
> >
> > 3. [SQL] Which catalog you want to use?
> >
> > ...
> >
> > Such a tool would download all the dependencies from maven and put them
> > into the correct folder. In the future we can extend it with additional
> > rules e.g. kafka-0.9 cannot be chosen at the same time with
> > kafka-universal etc.
> >
> > The benefit of it would be that the distribution that we release could
> > remain "slim" or we could even make it slimmer. I might be missing
> > something here though.
> >
> > Best,
> >
> > Dawdi
> >
> > On 16/04/2020 11:02, Aljoscha Krettek wrote:
> > > I want to reinforce my opinion from earlier: This is about improving
> > > the situation both for first-time users and for experienced users that
> > > want to use a Flink dist in production. The current Flink dist is too
> > > "thin" for first-time SQL users and it is too "fat" for production
> > > users, that is where serving no-one properly with the current
> > > middle-ground. That's why I think introducing those specialized
> > > "spins" of Flink dist would be good.
> > >
> > > By the way, at some point in the future production users might not
> > > even need to get a Flink dist anymore. They should be able to have
> > > Flink as a dependency of their project (including the runtime) and
> > > then build an image from this for Kubernetes or a fat jar for YARN.
> > >
> > > Aljoscha
> > >
> > > On 15.04.20 18:14, wenlong.lwl wrote:
> > >> Hi all,
> > >>
> > >> Regarding slim and fat distributions, I think different kinds of jobs
> > >> may
> > >> prefer different type of distribution:
> > >>
> > >> For DataStream job, I think we may not like fat distribution
> containing
> > >> connectors because user would always need to depend on the connector
> in
> > >> user code, it is easy to include the connector jar in the user lib.
> Less
> > >> jar in lib means less class conflicts and problems.
> > >>
> > >> For SQL job, I think we are trying to encourage user to user pure
> > >> sql(DDL +
> > >> DML) to construct their job, In order to improve user experience, It
> > >> may be
> > >> important for flink, not only providing as many connector jar in
> > >> distribution as possible especially the connector and format we have
> > >> well
> > >> documented,  but also providing an mechanism to load connectors
> > >> according
> > >> to the DDLs,
> > >>
> > >> So I think it could be good to place connector/format jars in some
> > >> dir like
> > >> opt/connector which would not affect jobs by default, and introduce a
> > >> mechanism of dynamic discovery for SQL.
> > >>
> > >> Best,
> > >> Wenlong
> > >>
> > >> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <[hidden email]>
> > >> wrote:
> > >>
> > >>> Hi,
> > >>>
> > >>> I am thinking both "improve first experience" and "improve production
> > >>> experience".
> > >>>
> > >>> I'm thinking about what's the common mode of Flink?
> > >>> Streaming job use Kafka? Batch job use Hive?
> > >>>
> > >>> Hive 1.2.1 dependencies can be compatible with most of Hive server
> > >>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> > >>> Flink is currently mainly used for streaming, so let's not talk
> > >>> about hive.
> > >>>
> > >>> For streaming jobs, first of all, the jobs in my mind is (related to
> > >>> connectors):
> > >>> - ETL jobs: Kafka -> Kafka
> > >>> - Join jobs: Kafka -> DimJDBC -> Kafka
> > >>> - Aggregation jobs: Kafka -> JDBCSink
> > >>> So Kafka and JDBC are probably the most commonly used. Of course,
> also
> > >>> includes CSV, JSON's formats.
> > >>> So when we provide such a fat distribution:
> > >>> - With CSV, JSON.
> > >>> - With flink-kafka-universal and kafka dependencies.
> > >>> - With flink-jdbc.
> > >>> Using this fat distribution, most users can run their jobs well.
> (jdbc
> > >>> driver jar required, but this is very natural to do)
> > >>> Can these dependencies lead to kinds of conflicts? Only Kafka may
> have
> > >>> conflicts, but if our goal is to use kafka-universal to support all
> > >>> Kafka
> > >>> versions, it is hopeful to target the vast majority of users.
> > >>>
> > >>> We don't want to plug all jars into the fat distribution. Only need
> > >>> less
> > >>> conflict and common. of course, it is a matter of consideration to
> put
> > >>> which jar into fat distribution.
> > >>> We have the opportunity to facilitate the majority of users, but
> > >>> also left
> > >>> opportunities for customization.
> > >>>
> > >>> Best,
> > >>> Jingsong Lee
> > >>>
> > >>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <[hidden email]> wrote:
> > >>>
> > >>>> Hi,
> > >>>>
> > >>>> I think we should first reach an consensus on "what problem do we
> > >>>> want to
> > >>>> solve?"
> > >>>> (1) improve first experience? or (2) improve production experience?
> > >>>>
> > >>>> As far as I can see, with the above discussion, I think what we
> > >>>> want to
> > >>>> solve is the "first experience".
> > >>>> And I think the slim jar is still the best distribution for
> > >>>> production,
> > >>>> because it's easier to assembling jars
> > >>>> than excluding jars and can avoid potential class conflicts.
> > >>>>
> > >>>> If we want to improve "first experience", I think it make sense to
> > >>>> have a
> > >>>> fat distribution to give users a more smooth first experience.
> > >>>> But I would like to call it "playground distribution" or something
> > >>>> like
> > >>>> that to explicitly differ from the "slim production-purpose
> > >>> distribution".
> > >>>> The "playground distribution" can contains some widely used jars,
> like
> > >>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> > >>>> json,
> > >>>> csv, etc..
> > >>>> Even we can provide a playground docker which may contain the fat
> > >>>> distribution, python3, and hive.
> > >>>>
> > >>>> Best,
> > >>>> Jark
> > >>>>
> > >>>>
> > >>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <[hidden email]>
> > >>> wrote:
> > >>>>
> > >>>>> I don't see a lot of value in having multiple distributions.
> > >>>>>
> > >>>>> The simple reality is that no fat distribution we could provide
> would
> > >>>>> satisfy all use-cases, so why even try.
> > >>>>> If users commonly run into issues for certain jars, then maybe
> those
> > >>>>> should be added to the current distribution.
> > >>>>>
> > >>>>> Personally though I still believe we should only distribute a slim
> > >>>>> version. I'd rather have users always add required jars to the
> > >>>>> distribution than only when they go outside our "expected"
> use-cases.
> > >>>>> Then we might finally address this issue properly, i.e., tooling to
> > >>>>> assemble custom distributions and/or better error messages if
> > >>>>> Flink-provided extensions cannot be found.
> > >>>>>
> > >>>>> On 15/04/2020 15:23, Kurt Young wrote:
> > >>>>>> Regarding to the specific solution, I'm not sure about the "fat"
> and
> > >>>>> "slim"
> > >>>>>> solution though. I get the idea
> > >>>>>> that we can make the slim one even more lightweight than current
> > >>>>>> distribution, but what about the "fat"
> > >>>>>> one? Do you mean that we would package all connectors and formats
> > >>> into
> > >>>>>> this? I'm not sure if this is
> > >>>>>> feasible. For example, we can't put all versions of kafka and hive
> > >>>>>> connector jars into lib directory, and
> > >>>>>> we also might need hadoop jars when using filesystem connector to
> > >>>> access
> > >>>>>> data from HDFS.
> > >>>>>>
> > >>>>>> So my guess would be we might hand-pick some of the most
> frequently
> > >>>> used
> > >>>>>> connectors and formats
> > >>>>>> into our "lib" directory, like kafka, csv, json metioned above,
> and
> > >>>> still
> > >>>>>> leave some other connectors out of it.
> > >>>>>> If this is the case, then why not we just provide this
> distribution
> > >>> to
> > >>>>>> user? I'm not sure i get the benefit of
> > >>>>>> providing another super "slim" jar (we have to pay some costs to
> > >>>> provide
> > >>>>>> another suit of distribution).
> > >>>>>>
> > >>>>>> What do you think?
> > >>>>>>
> > >>>>>> Best,
> > >>>>>> Kurt
> > >>>>>>
> > >>>>>>
> > >>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> [hidden email]
> > >
> > >>>>> wrote:
> > >>>>>>
> > >>>>>>> Big +1.
> > >>>>>>>
> > >>>>>>> I like "fat" and "slim".
> > >>>>>>>
> > >>>>>>> For csv and json, like Jark said, they are quite small and don't
> > >>> have
> > >>>>> other
> > >>>>>>> dependencies. They are important to kafka connector, and
> important
> > >>>>>>> to upcoming file system connector too.
> > >>>>>>> So can we move them to both "fat" and "slim"? They're so
> important,
> > >>>> and
> > >>>>>>> they're so lightweight.
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Jingsong Lee
> > >>>>>>>
> > >>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <[hidden email]>
> > >>>> wrote:
> > >>>>>>>
> > >>>>>>>> Big +1.
> > >>>>>>>> This will improve user experience (special for Flink new users).
> > >>>>>>>> We answered so many questions about "class not found".
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Godfrey
> > >>>>>>>>
> > >>>>>>>> Dian Fu <[hidden email]> 于2020年4月15日周三 下午4:30写道:
> > >>>>>>>>
> > >>>>>>>>> +1 to this proposal.
> > >>>>>>>>>
> > >>>>>>>>> Missing connector jars is also a big problem for PyFlink users.
> > >>>>>>>> Currently,
> > >>>>>>>>> after a Python user has installed PyFlink using `pip`, he has
> to
> > >>>>>>> manually
> > >>>>>>>>> copy the connector fat jars to the PyFlink installation
> directory
> > >>>> for
> > >>>>>>> the
> > >>>>>>>>> connectors to be used if he wants to run jobs locally. This
> > >>> process
> > >>>> is
> > >>>>>>>> very
> > >>>>>>>>> confuse for users and affects the experience a lot.
> > >>>>>>>>>
> > >>>>>>>>> Regards,
> > >>>>>>>>> Dian
> > >>>>>>>>>
> > >>>>>>>>>> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> 写道:
> > >>>>>>>>>>
> > >>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
> > >>> step
> > >>>>>>> is
> > >>>>>>>>>> really verbose when I prepare webinars.
> > >>>>>>>>>>
> > >>>>>>>>>> At least, I think the flink-csv and flink-json should in the
> > >>>>>>>>> distribution,
> > >>>>>>>>>> they are quite small and don't have other dependencies.
> > >>>>>>>>>>
> > >>>>>>>>>> Best,
> > >>>>>>>>>> Jark
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]>
> > >>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> Hi Aljoscha,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to
> put
> > >>>>>>> these
> > >>>>>>>>>>> connectors ? opt or lib ?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Aljoscha Krettek <[hidden email]> 于2020年4月15日周三
> > >>>>>>>>>>> 下午3:30写道:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hi Everyone,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I'd like to discuss about releasing a more full-featured
> Flink
> > >>>>>>>>>>>> distribution. The motivation is that there is friction for
> > >>>>>>> SQL/Table
> > >>>>>>>>> API
> > >>>>>>>>>>>> users that want to use Table connectors which are not there
> in
> > >>>> the
> > >>>>>>>>>>>> current Flink Distribution. For these users the workflow is
> > >>>>>>> currently
> > >>>>>>>>>>>> roughly:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>    - download Flink dist
> > >>>>>>>>>>>>    - configure csv/Kafka/json connectors per configuration
> > >>>>>>>>>>>>    - run SQL client or program
> > >>>>>>>>>>>>    - decrypt error message and research the solution
> > >>>>>>>>>>>>    - download additional connector jars
> > >>>>>>>>>>>>    - program works correctly
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I realize that this can be made to work but if every SQL
> user
> > >>> has
> > >>>>>>>> this
> > >>>>>>>>>>>> as their first experience that doesn't seem good to me.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> My proposal is to provide two versions of the Flink
> > >>> Distribution
> > >>>> in
> > >>>>>>>> the
> > >>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>    - slim would be even trimmer than todays distribution
> > >>>>>>>>>>>>    - fat would contain a lot of convenience connectors (yet
> to
> > >>> be
> > >>>>>>>>>>>> determined which one)
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> And yes, I realize that there are already more dimensions of
> > >>>> Flink
> > >>>>>>>>>>>> releases (Scala version and Java version).
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> For background, our current Flink dist has these in the opt
> > >>>>>>>> directory:
> > >>>>>>>>>>>>    - flink-azure-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>>    - flink-cep-scala_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-cep_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-gelly-scala_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-gelly_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-datadog-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-graphite-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-influxdb-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-prometheus-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-slf4j-1.10.0.jar
> > >>>>>>>>>>>>    - flink-metrics-statsd-1.10.0.jar
> > >>>>>>>>>>>>    - flink-oss-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>>    - flink-python_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-queryable-state-runtime_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-s3-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>>    - flink-s3-fs-presto-1.10.0.jar
> > >>>>>>>>>>>>    -
> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > >>>>>>>>>>>>    - flink-sql-client_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-state-processor-api_2.12-1.10.0.jar
> > >>>>>>>>>>>>    - flink-swift-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
> opt
> > >>> we
> > >>>>>>>> would
> > >>>>>>>>>>>> go down to 126M. I would reccomend this, because the large
> > >>>> majority
> > >>>>>>>> of
> > >>>>>>>>>>>> the files in opt are probably unused.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> What do you think?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Best,
> > >>>>>>>>>>>> Aljoscha
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>>> Best Regards
> > >>>>>>>>>>>
> > >>>>>>>>>>> Jeff Zhang
> > >>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>
> > >>>>>>> --
> > >>>>>>> Best, Jingsong Lee
> > >>>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Best, Jingsong Lee
> > >>>
> > >>
> > >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Chesnay Schepler-3
The problem with having a distribution with "popular" stuff is that it
doesn't really /solve/ a problem, it just hides it for users who fall
into these particular use-cases.
Move out of it and you once again run into exact same problems out-lined.

This is exactly why I like the tooling approach; you have to deal with
it from the start and transitioning to a custom use-case is easier.

Would users following instructions really be such a big problem?
I would expect that users generally know /what /they need, just not
necessarily how it is assembled correctly (where do get which jar, which
directory to put it in).
It seems like these are exactly the problem this would solve?
I just don't see how moving a jar corresponding to some feature from opt
to some directory (lib/plugins) is less error-prone than just selecting
the feature and having the tool handle the rest.

As for re-distributions, it depends on the form that the tool would take.
It could be an application that runs locally and works against maven
central (note: not necessarily /using/ maven); this should would work in
China, no?

A web tool would of course be fancy, but I don't know how feasible this
is with the ASF infrastructure.
You wouldn't be able to mirror the distribution, so the load can't be
distributed. I doubt INFRA would like this.

Note that third-parties could also start distributing use-case oriented
distributions, which would be perfectly fine as far as I'm concerned.

On 16/04/2020 16:57, Kurt Young wrote:

> I'm not so sure about the web tool solution though. The concern I have for
> this approach is the final generated
> distribution is kind of non-deterministic. We might generate too many
> different combinations when user trying to
> package different types of connector, format, and even maybe hadoop
> releases.  As far as I can tell, most open
> source projects and apache projects will only release some
> pre-defined distributions, which most users are already
> familiar with, thus hard to change IMO. And I also have went through in
> some cases, users will try to re-distribute
> the release package, because of the unstable network of apache website from
> China. In web tool solution, I don't
> think this kind of re-distribution would be possible anymore.
>
> In the meantime, I also have a concern that we will fall back into our trap
> again if we try to offer this smart & flexible
> solution. Because it needs users to cooperate with such mechanism. It's
> exactly the situation what we currently fell
> into:
> 1. We offered a smart solution.
> 2. We hope users will follow the correct instructions.
> 3. Everything will work as expected if users followed the right
> instructions.
>
> In reality, I suspect not all users will do the second step correctly. And
> for new users who only trying to have a quick
> experience with Flink, I would bet most users will do it wrong.
>
> So, my proposal would be one of the following 2 options:
> 1. Provide a slim distribution for advanced product users and provide a
> distribution which will have some popular builtin jars.
> 2. Only provide a distribution which will have some popular builtin jars.
>
> If we are trying to reduce the distributions we released, I would prefer 2
>> 1.
> Best,
> Kurt
>
>
> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <[hidden email]> wrote:
>
>> I think what Chesnay and Dawid proposed would be the ideal solution.
>> Ideally, we would also have a nice web tool for the website which generates
>> the corresponding distribution for download.
>>
>> To get things started we could start with only supporting to
>> download/creating the "fat" version with the script. The fat version would
>> then consist of the slim distribution and whatever we deem important for
>> new users to get started.
>>
>> Cheers,
>> Till
>>
>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <[hidden email]>
>> wrote:
>>
>>> Hi all,
>>>
>>> Few points from my side:
>>>
>>> 1. I like the idea of simplifying the experience for first time users.
>>> As for production use cases I share Jark's opinion that in this case I
>>> would expect users to combine their distribution manually. I think in
>>> such scenarios it is important to understand interconnections.
>>> Personally I'd expect the slimmest possible distribution that I can
>>> extend further with what I need in my production scenario.
>>>
>>> 2. I think there is also the problem that the matrix of possible
>>> combinations that can be useful is already big. Do we want to have a
>>> distribution for:
>>>
>>>      SQL users: which connectors should we include? should we include
>>> hive? which other catalog?
>>>
>>>      DataStream users: which connectors should we include?
>>>
>>>     For both of the above should we include yarn/kubernetes?
>>>
>>> I would opt for providing only the "slim" distribution as a release
>>> artifact.
>>>
>>> 3. However, as I said I think its worth investigating how we can improve
>>> users experience. What do you think of providing a tool, could be e.g. a
>>> shell script that constructs a distribution based on users choice. I
>>> think that was also what Chesnay mentioned as "tooling to
>>> assemble custom distributions" In the end how I see the difference
>>> between a slim and fat distribution is which jars do we put into the
>>> lib, right? It could have a few "screens".
>>>
>>> 1. Which API are you interested in:
>>> a. SQL API
>>> b. DataStream API
>>>
>>>
>>> 2. [SQL] Which connectors do you want to use? [multichoice]:
>>> a. Kafka
>>> b. Elasticsearch
>>> ...
>>>
>>> 3. [SQL] Which catalog you want to use?
>>>
>>> ...
>>>
>>> Such a tool would download all the dependencies from maven and put them
>>> into the correct folder. In the future we can extend it with additional
>>> rules e.g. kafka-0.9 cannot be chosen at the same time with
>>> kafka-universal etc.
>>>
>>> The benefit of it would be that the distribution that we release could
>>> remain "slim" or we could even make it slimmer. I might be missing
>>> something here though.
>>>
>>> Best,
>>>
>>> Dawdi
>>>
>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
>>>> I want to reinforce my opinion from earlier: This is about improving
>>>> the situation both for first-time users and for experienced users that
>>>> want to use a Flink dist in production. The current Flink dist is too
>>>> "thin" for first-time SQL users and it is too "fat" for production
>>>> users, that is where serving no-one properly with the current
>>>> middle-ground. That's why I think introducing those specialized
>>>> "spins" of Flink dist would be good.
>>>>
>>>> By the way, at some point in the future production users might not
>>>> even need to get a Flink dist anymore. They should be able to have
>>>> Flink as a dependency of their project (including the runtime) and
>>>> then build an image from this for Kubernetes or a fat jar for YARN.
>>>>
>>>> Aljoscha
>>>>
>>>> On 15.04.20 18:14, wenlong.lwl wrote:
>>>>> Hi all,
>>>>>
>>>>> Regarding slim and fat distributions, I think different kinds of jobs
>>>>> may
>>>>> prefer different type of distribution:
>>>>>
>>>>> For DataStream job, I think we may not like fat distribution
>> containing
>>>>> connectors because user would always need to depend on the connector
>> in
>>>>> user code, it is easy to include the connector jar in the user lib.
>> Less
>>>>> jar in lib means less class conflicts and problems.
>>>>>
>>>>> For SQL job, I think we are trying to encourage user to user pure
>>>>> sql(DDL +
>>>>> DML) to construct their job, In order to improve user experience, It
>>>>> may be
>>>>> important for flink, not only providing as many connector jar in
>>>>> distribution as possible especially the connector and format we have
>>>>> well
>>>>> documented,  but also providing an mechanism to load connectors
>>>>> according
>>>>> to the DDLs,
>>>>>
>>>>> So I think it could be good to place connector/format jars in some
>>>>> dir like
>>>>> opt/connector which would not affect jobs by default, and introduce a
>>>>> mechanism of dynamic discovery for SQL.
>>>>>
>>>>> Best,
>>>>> Wenlong
>>>>>
>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <[hidden email]>
>>>>> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I am thinking both "improve first experience" and "improve production
>>>>>> experience".
>>>>>>
>>>>>> I'm thinking about what's the common mode of Flink?
>>>>>> Streaming job use Kafka? Batch job use Hive?
>>>>>>
>>>>>> Hive 1.2.1 dependencies can be compatible with most of Hive server
>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
>>>>>> Flink is currently mainly used for streaming, so let's not talk
>>>>>> about hive.
>>>>>>
>>>>>> For streaming jobs, first of all, the jobs in my mind is (related to
>>>>>> connectors):
>>>>>> - ETL jobs: Kafka -> Kafka
>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>>>>>> - Aggregation jobs: Kafka -> JDBCSink
>>>>>> So Kafka and JDBC are probably the most commonly used. Of course,
>> also
>>>>>> includes CSV, JSON's formats.
>>>>>> So when we provide such a fat distribution:
>>>>>> - With CSV, JSON.
>>>>>> - With flink-kafka-universal and kafka dependencies.
>>>>>> - With flink-jdbc.
>>>>>> Using this fat distribution, most users can run their jobs well.
>> (jdbc
>>>>>> driver jar required, but this is very natural to do)
>>>>>> Can these dependencies lead to kinds of conflicts? Only Kafka may
>> have
>>>>>> conflicts, but if our goal is to use kafka-universal to support all
>>>>>> Kafka
>>>>>> versions, it is hopeful to target the vast majority of users.
>>>>>>
>>>>>> We don't want to plug all jars into the fat distribution. Only need
>>>>>> less
>>>>>> conflict and common. of course, it is a matter of consideration to
>> put
>>>>>> which jar into fat distribution.
>>>>>> We have the opportunity to facilitate the majority of users, but
>>>>>> also left
>>>>>> opportunities for customization.
>>>>>>
>>>>>> Best,
>>>>>> Jingsong Lee
>>>>>>
>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <[hidden email]> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I think we should first reach an consensus on "what problem do we
>>>>>>> want to
>>>>>>> solve?"
>>>>>>> (1) improve first experience? or (2) improve production experience?
>>>>>>>
>>>>>>> As far as I can see, with the above discussion, I think what we
>>>>>>> want to
>>>>>>> solve is the "first experience".
>>>>>>> And I think the slim jar is still the best distribution for
>>>>>>> production,
>>>>>>> because it's easier to assembling jars
>>>>>>> than excluding jars and can avoid potential class conflicts.
>>>>>>>
>>>>>>> If we want to improve "first experience", I think it make sense to
>>>>>>> have a
>>>>>>> fat distribution to give users a more smooth first experience.
>>>>>>> But I would like to call it "playground distribution" or something
>>>>>>> like
>>>>>>> that to explicitly differ from the "slim production-purpose
>>>>>> distribution".
>>>>>>> The "playground distribution" can contains some widely used jars,
>> like
>>>>>>> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
>>>>>>> json,
>>>>>>> csv, etc..
>>>>>>> Even we can provide a playground docker which may contain the fat
>>>>>>> distribution, python3, and hive.
>>>>>>>
>>>>>>> Best,
>>>>>>> Jark
>>>>>>>
>>>>>>>
>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <[hidden email]>
>>>>>> wrote:
>>>>>>>> I don't see a lot of value in having multiple distributions.
>>>>>>>>
>>>>>>>> The simple reality is that no fat distribution we could provide
>> would
>>>>>>>> satisfy all use-cases, so why even try.
>>>>>>>> If users commonly run into issues for certain jars, then maybe
>> those
>>>>>>>> should be added to the current distribution.
>>>>>>>>
>>>>>>>> Personally though I still believe we should only distribute a slim
>>>>>>>> version. I'd rather have users always add required jars to the
>>>>>>>> distribution than only when they go outside our "expected"
>> use-cases.
>>>>>>>> Then we might finally address this issue properly, i.e., tooling to
>>>>>>>> assemble custom distributions and/or better error messages if
>>>>>>>> Flink-provided extensions cannot be found.
>>>>>>>>
>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>>>>>> Regarding to the specific solution, I'm not sure about the "fat"
>> and
>>>>>>>> "slim"
>>>>>>>>> solution though. I get the idea
>>>>>>>>> that we can make the slim one even more lightweight than current
>>>>>>>>> distribution, but what about the "fat"
>>>>>>>>> one? Do you mean that we would package all connectors and formats
>>>>>> into
>>>>>>>>> this? I'm not sure if this is
>>>>>>>>> feasible. For example, we can't put all versions of kafka and hive
>>>>>>>>> connector jars into lib directory, and
>>>>>>>>> we also might need hadoop jars when using filesystem connector to
>>>>>>> access
>>>>>>>>> data from HDFS.
>>>>>>>>>
>>>>>>>>> So my guess would be we might hand-pick some of the most
>> frequently
>>>>>>> used
>>>>>>>>> connectors and formats
>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned above,
>> and
>>>>>>> still
>>>>>>>>> leave some other connectors out of it.
>>>>>>>>> If this is the case, then why not we just provide this
>> distribution
>>>>>> to
>>>>>>>>> user? I'm not sure i get the benefit of
>>>>>>>>> providing another super "slim" jar (we have to pay some costs to
>>>>>>> provide
>>>>>>>>> another suit of distribution).
>>>>>>>>>
>>>>>>>>> What do you think?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Kurt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
>> [hidden email]
>>>>>>>> wrote:
>>>>>>>>>> Big +1.
>>>>>>>>>>
>>>>>>>>>> I like "fat" and "slim".
>>>>>>>>>>
>>>>>>>>>> For csv and json, like Jark said, they are quite small and don't
>>>>>> have
>>>>>>>> other
>>>>>>>>>> dependencies. They are important to kafka connector, and
>> important
>>>>>>>>>> to upcoming file system connector too.
>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so
>> important,
>>>>>>> and
>>>>>>>>>> they're so lightweight.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Jingsong Lee
>>>>>>>>>>
>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <[hidden email]>
>>>>>>> wrote:
>>>>>>>>>>> Big +1.
>>>>>>>>>>> This will improve user experience (special for Flink new users).
>>>>>>>>>>> We answered so many questions about "class not found".
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Godfrey
>>>>>>>>>>>
>>>>>>>>>>> Dian Fu <[hidden email]> 于2020年4月15日周三 下午4:30写道:
>>>>>>>>>>>
>>>>>>>>>>>> +1 to this proposal.
>>>>>>>>>>>>
>>>>>>>>>>>> Missing connector jars is also a big problem for PyFlink users.
>>>>>>>>>>> Currently,
>>>>>>>>>>>> after a Python user has installed PyFlink using `pip`, he has
>> to
>>>>>>>>>> manually
>>>>>>>>>>>> copy the connector fat jars to the PyFlink installation
>> directory
>>>>>>> for
>>>>>>>>>> the
>>>>>>>>>>>> connectors to be used if he wants to run jobs locally. This
>>>>>> process
>>>>>>> is
>>>>>>>>>>> very
>>>>>>>>>>>> confuse for users and affects the experience a lot.
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Dian
>>>>>>>>>>>>
>>>>>>>>>>>>> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> 写道:
>>>>>>>>>>>>>
>>>>>>>>>>>>> +1 to the proposal. I also found the "download additional jar"
>>>>>> step
>>>>>>>>>> is
>>>>>>>>>>>>> really verbose when I prepare webinars.
>>>>>>>>>>>>>
>>>>>>>>>>>>> At least, I think the flink-csv and flink-json should in the
>>>>>>>>>>>> distribution,
>>>>>>>>>>>>> they are quite small and don't have other dependencies.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Jark
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]>
>>>>>> wrote:
>>>>>>>>>>>>>> Hi Aljoscha,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan to
>> put
>>>>>>>>>> these
>>>>>>>>>>>>>> connectors ? opt or lib ?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Aljoscha Krettek <[hidden email]> 于2020年4月15日周三
>>>>>>>>>>>>>> 下午3:30写道:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'd like to discuss about releasing a more full-featured
>> Flink
>>>>>>>>>>>>>>> distribution. The motivation is that there is friction for
>>>>>>>>>> SQL/Table
>>>>>>>>>>>> API
>>>>>>>>>>>>>>> users that want to use Table connectors which are not there
>> in
>>>>>>> the
>>>>>>>>>>>>>>> current Flink Distribution. For these users the workflow is
>>>>>>>>>> currently
>>>>>>>>>>>>>>> roughly:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     - download Flink dist
>>>>>>>>>>>>>>>     - configure csv/Kafka/json connectors per configuration
>>>>>>>>>>>>>>>     - run SQL client or program
>>>>>>>>>>>>>>>     - decrypt error message and research the solution
>>>>>>>>>>>>>>>     - download additional connector jars
>>>>>>>>>>>>>>>     - program works correctly
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I realize that this can be made to work but if every SQL
>> user
>>>>>> has
>>>>>>>>>>> this
>>>>>>>>>>>>>>> as their first experience that doesn't seem good to me.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> My proposal is to provide two versions of the Flink
>>>>>> Distribution
>>>>>>> in
>>>>>>>>>>> the
>>>>>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>     - slim would be even trimmer than todays distribution
>>>>>>>>>>>>>>>     - fat would contain a lot of convenience connectors (yet
>> to
>>>>>> be
>>>>>>>>>>>>>>> determined which one)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> And yes, I realize that there are already more dimensions of
>>>>>>> Flink
>>>>>>>>>>>>>>> releases (Scala version and Java version).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> For background, our current Flink dist has these in the opt
>>>>>>>>>>> directory:
>>>>>>>>>>>>>>>     - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-cep-scala_2.12-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-cep_2.12-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-gelly_2.12-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-metrics-datadog-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-metrics-graphite-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-metrics-influxdb-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-metrics-prometheus-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-metrics-slf4j-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-metrics-statsd-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-python_2.12-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-s3-fs-presto-1.10.0.jar
>>>>>>>>>>>>>>>     -
>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>>>>>>>>>     - flink-sql-client_2.12-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>>>>>>>>>>     - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
>> opt
>>>>>> we
>>>>>>>>>>> would
>>>>>>>>>>>>>>> go down to 126M. I would reccomend this, because the large
>>>>>>> majority
>>>>>>>>>>> of
>>>>>>>>>>>>>>> the files in opt are probably unused.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> What do you think?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Aljoscha
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Best Regards
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jeff Zhang
>>>>>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Best, Jingsong Lee
>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best, Jingsong Lee
>>>>>>
>>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Kurt Young
I'm not saying pre-bundle some jars will make this problem go away, and
you're right that only hides the problem for
some users. But what if this solution can hide the problem for 90% users?
Would't that be good enough for us to try?

Regarding to would users following instructions really be such a big
problem?
I'm afraid yes. Otherwise I won't answer such questions for at least a
dozen times and I won't see such questions coming
up from time to time. During some periods, I even saw such questions every
day.

Best,
Kurt


On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <[hidden email]>
wrote:

> The problem with having a distribution with "popular" stuff is that it
> doesn't really *solve* a problem, it just hides it for users who fall
> into these particular use-cases.
> Move out of it and you once again run into exact same problems out-lined.
>
> This is exactly why I like the tooling approach; you have to deal with it
> from the start and transitioning to a custom use-case is easier.
>
> Would users following instructions really be such a big problem?
> I would expect that users generally know *what *they need, just not
> necessarily how it is assembled correctly (where do get which jar, which
> directory to put it in).
> It seems like these are exactly the problem this would solve?
> I just don't see how moving a jar corresponding to some feature from opt
> to some directory (lib/plugins) is less error-prone than just selecting the
> feature and having the tool handle the rest.
>
> As for re-distributions, it depends on the form that the tool would take.
> It could be an application that runs locally and works against maven
> central (note: not necessarily *using* maven); this should would work in
> China, no?
>
> A web tool would of course be fancy, but I don't know how feasible this is
> with the ASF infrastructure.
> You wouldn't be able to mirror the distribution, so the load can't be
> distributed. I doubt INFRA would like this.
>
> Note that third-parties could also start distributing use-case oriented
> distributions, which would be perfectly fine as far as I'm concerned.
>
> On 16/04/2020 16:57, Kurt Young wrote:
>
> I'm not so sure about the web tool solution though. The concern I have for
> this approach is the final generated
> distribution is kind of non-deterministic. We might generate too many
> different combinations when user trying to
> package different types of connector, format, and even maybe hadoop
> releases.  As far as I can tell, most open
> source projects and apache projects will only release some
> pre-defined distributions, which most users are already
> familiar with, thus hard to change IMO. And I also have went through in
> some cases, users will try to re-distribute
> the release package, because of the unstable network of apache website from
> China. In web tool solution, I don't
> think this kind of re-distribution would be possible anymore.
>
> In the meantime, I also have a concern that we will fall back into our trap
> again if we try to offer this smart & flexible
> solution. Because it needs users to cooperate with such mechanism. It's
> exactly the situation what we currently fell
> into:
> 1. We offered a smart solution.
> 2. We hope users will follow the correct instructions.
> 3. Everything will work as expected if users followed the right
> instructions.
>
> In reality, I suspect not all users will do the second step correctly. And
> for new users who only trying to have a quick
> experience with Flink, I would bet most users will do it wrong.
>
> So, my proposal would be one of the following 2 options:
> 1. Provide a slim distribution for advanced product users and provide a
> distribution which will have some popular builtin jars.
> 2. Only provide a distribution which will have some popular builtin jars.
>
> If we are trying to reduce the distributions we released, I would prefer 2
>
> 1.
>
> Best,
> Kurt
>
>
> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <[hidden email]> <[hidden email]> wrote:
>
>
> I think what Chesnay and Dawid proposed would be the ideal solution.
> Ideally, we would also have a nice web tool for the website which generates
> the corresponding distribution for download.
>
> To get things started we could start with only supporting to
> download/creating the "fat" version with the script. The fat version would
> then consist of the slim distribution and whatever we deem important for
> new users to get started.
>
> Cheers,
> Till
>
> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <[hidden email]> <[hidden email]>
> wrote:
>
>
> Hi all,
>
> Few points from my side:
>
> 1. I like the idea of simplifying the experience for first time users.
> As for production use cases I share Jark's opinion that in this case I
> would expect users to combine their distribution manually. I think in
> such scenarios it is important to understand interconnections.
> Personally I'd expect the slimmest possible distribution that I can
> extend further with what I need in my production scenario.
>
> 2. I think there is also the problem that the matrix of possible
> combinations that can be useful is already big. Do we want to have a
> distribution for:
>
>     SQL users: which connectors should we include? should we include
> hive? which other catalog?
>
>     DataStream users: which connectors should we include?
>
>    For both of the above should we include yarn/kubernetes?
>
> I would opt for providing only the "slim" distribution as a release
> artifact.
>
> 3. However, as I said I think its worth investigating how we can improve
> users experience. What do you think of providing a tool, could be e.g. a
> shell script that constructs a distribution based on users choice. I
> think that was also what Chesnay mentioned as "tooling to
> assemble custom distributions" In the end how I see the difference
> between a slim and fat distribution is which jars do we put into the
> lib, right? It could have a few "screens".
>
> 1. Which API are you interested in:
> a. SQL API
> b. DataStream API
>
>
> 2. [SQL] Which connectors do you want to use? [multichoice]:
> a. Kafka
> b. Elasticsearch
> ...
>
> 3. [SQL] Which catalog you want to use?
>
> ...
>
> Such a tool would download all the dependencies from maven and put them
> into the correct folder. In the future we can extend it with additional
> rules e.g. kafka-0.9 cannot be chosen at the same time with
> kafka-universal etc.
>
> The benefit of it would be that the distribution that we release could
> remain "slim" or we could even make it slimmer. I might be missing
> something here though.
>
> Best,
>
> Dawdi
>
> On 16/04/2020 11:02, Aljoscha Krettek wrote:
>
> I want to reinforce my opinion from earlier: This is about improving
> the situation both for first-time users and for experienced users that
> want to use a Flink dist in production. The current Flink dist is too
> "thin" for first-time SQL users and it is too "fat" for production
> users, that is where serving no-one properly with the current
> middle-ground. That's why I think introducing those specialized
> "spins" of Flink dist would be good.
>
> By the way, at some point in the future production users might not
> even need to get a Flink dist anymore. They should be able to have
> Flink as a dependency of their project (including the runtime) and
> then build an image from this for Kubernetes or a fat jar for YARN.
>
> Aljoscha
>
> On 15.04.20 18:14, wenlong.lwl wrote:
>
> Hi all,
>
> Regarding slim and fat distributions, I think different kinds of jobs
> may
> prefer different type of distribution:
>
> For DataStream job, I think we may not like fat distribution
>
> containing
>
> connectors because user would always need to depend on the connector
>
> in
>
> user code, it is easy to include the connector jar in the user lib.
>
> Less
>
> jar in lib means less class conflicts and problems.
>
> For SQL job, I think we are trying to encourage user to user pure
> sql(DDL +
> DML) to construct their job, In order to improve user experience, It
> may be
> important for flink, not only providing as many connector jar in
> distribution as possible especially the connector and format we have
> well
> documented,  but also providing an mechanism to load connectors
> according
> to the DDLs,
>
> So I think it could be good to place connector/format jars in some
> dir like
> opt/connector which would not affect jobs by default, and introduce a
> mechanism of dynamic discovery for SQL.
>
> Best,
> Wenlong
>
> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <[hidden email]> <[hidden email]>
> wrote:
>
>
> Hi,
>
> I am thinking both "improve first experience" and "improve production
> experience".
>
> I'm thinking about what's the common mode of Flink?
> Streaming job use Kafka? Batch job use Hive?
>
> Hive 1.2.1 dependencies can be compatible with most of Hive server
> versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> Flink is currently mainly used for streaming, so let's not talk
> about hive.
>
> For streaming jobs, first of all, the jobs in my mind is (related to
> connectors):
> - ETL jobs: Kafka -> Kafka
> - Join jobs: Kafka -> DimJDBC -> Kafka
> - Aggregation jobs: Kafka -> JDBCSink
> So Kafka and JDBC are probably the most commonly used. Of course,
>
> also
>
> includes CSV, JSON's formats.
> So when we provide such a fat distribution:
> - With CSV, JSON.
> - With flink-kafka-universal and kafka dependencies.
> - With flink-jdbc.
> Using this fat distribution, most users can run their jobs well.
>
> (jdbc
>
> driver jar required, but this is very natural to do)
> Can these dependencies lead to kinds of conflicts? Only Kafka may
>
> have
>
> conflicts, but if our goal is to use kafka-universal to support all
> Kafka
> versions, it is hopeful to target the vast majority of users.
>
> We don't want to plug all jars into the fat distribution. Only need
> less
> conflict and common. of course, it is a matter of consideration to
>
> put
>
> which jar into fat distribution.
> We have the opportunity to facilitate the majority of users, but
> also left
> opportunities for customization.
>
> Best,
> Jingsong Lee
>
> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <[hidden email]> <[hidden email]> wrote:
>
>
> Hi,
>
> I think we should first reach an consensus on "what problem do we
> want to
> solve?"
> (1) improve first experience? or (2) improve production experience?
>
> As far as I can see, with the above discussion, I think what we
> want to
> solve is the "first experience".
> And I think the slim jar is still the best distribution for
> production,
> because it's easier to assembling jars
> than excluding jars and can avoid potential class conflicts.
>
> If we want to improve "first experience", I think it make sense to
> have a
> fat distribution to give users a more smooth first experience.
> But I would like to call it "playground distribution" or something
> like
> that to explicitly differ from the "slim production-purpose
>
> distribution".
>
> The "playground distribution" can contains some widely used jars,
>
> like
>
> universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> json,
> csv, etc..
> Even we can provide a playground docker which may contain the fat
> distribution, python3, and hive.
>
> Best,
> Jark
>
>
> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <[hidden email]> <[hidden email]>
>
> wrote:
>
> I don't see a lot of value in having multiple distributions.
>
> The simple reality is that no fat distribution we could provide
>
> would
>
> satisfy all use-cases, so why even try.
> If users commonly run into issues for certain jars, then maybe
>
> those
>
> should be added to the current distribution.
>
> Personally though I still believe we should only distribute a slim
> version. I'd rather have users always add required jars to the
> distribution than only when they go outside our "expected"
>
> use-cases.
>
> Then we might finally address this issue properly, i.e., tooling to
> assemble custom distributions and/or better error messages if
> Flink-provided extensions cannot be found.
>
> On 15/04/2020 15:23, Kurt Young wrote:
>
> Regarding to the specific solution, I'm not sure about the "fat"
>
> and
>
> "slim"
>
> solution though. I get the idea
> that we can make the slim one even more lightweight than current
> distribution, but what about the "fat"
> one? Do you mean that we would package all connectors and formats
>
> into
>
> this? I'm not sure if this is
> feasible. For example, we can't put all versions of kafka and hive
> connector jars into lib directory, and
> we also might need hadoop jars when using filesystem connector to
>
> access
>
> data from HDFS.
>
> So my guess would be we might hand-pick some of the most
>
> frequently
>
> used
>
> connectors and formats
> into our "lib" directory, like kafka, csv, json metioned above,
>
> and
>
> still
>
> leave some other connectors out of it.
> If this is the case, then why not we just provide this
>
> distribution
>
> to
>
> user? I'm not sure i get the benefit of
> providing another super "slim" jar (we have to pay some costs to
>
> provide
>
> another suit of distribution).
>
> What do you think?
>
> Best,
> Kurt
>
>
> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
>
> [hidden email]
>
> wrote:
>
> Big +1.
>
> I like "fat" and "slim".
>
> For csv and json, like Jark said, they are quite small and don't
>
> have
>
> other
>
> dependencies. They are important to kafka connector, and
>
> important
>
> to upcoming file system connector too.
> So can we move them to both "fat" and "slim"? They're so
>
> important,
>
> and
>
> they're so lightweight.
>
> Best,
> Jingsong Lee
>
> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <[hidden email]> <[hidden email]>
>
> wrote:
>
> Big +1.
> This will improve user experience (special for Flink new users).
> We answered so many questions about "class not found".
>
> Best,
> Godfrey
>
> Dian Fu <[hidden email]> <[hidden email]> 于2020年4月15日周三 下午4:30写道:
>
>
> +1 to this proposal.
>
> Missing connector jars is also a big problem for PyFlink users.
>
> Currently,
>
> after a Python user has installed PyFlink using `pip`, he has
>
> to
>
> manually
>
> copy the connector fat jars to the PyFlink installation
>
> directory
>
> for
>
> the
>
> connectors to be used if he wants to run jobs locally. This
>
> process
>
> is
>
> very
>
> confuse for users and affects the experience a lot.
>
> Regards,
> Dian
>
>
> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> <[hidden email]> 写道:
>
> +1 to the proposal. I also found the "download additional jar"
>
> step
>
> is
>
> really verbose when I prepare webinars.
>
> At least, I think the flink-csv and flink-json should in the
>
> distribution,
>
> they are quite small and don't have other dependencies.
>
> Best,
> Jark
>
> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]> <[hidden email]>
>
> wrote:
>
> Hi Aljoscha,
>
> Big +1 for the fat flink distribution, where do you plan to
>
> put
>
> these
>
> connectors ? opt or lib ?
>
> Aljoscha Krettek <[hidden email]> <[hidden email]> 于2020年4月15日周三
> 下午3:30写道:
>
>
> Hi Everyone,
>
> I'd like to discuss about releasing a more full-featured
>
> Flink
>
> distribution. The motivation is that there is friction for
>
> SQL/Table
>
> API
>
> users that want to use Table connectors which are not there
>
> in
>
> the
>
> current Flink Distribution. For these users the workflow is
>
> currently
>
> roughly:
>
>    - download Flink dist
>    - configure csv/Kafka/json connectors per configuration
>    - run SQL client or program
>    - decrypt error message and research the solution
>    - download additional connector jars
>    - program works correctly
>
> I realize that this can be made to work but if every SQL
>
> user
>
> has
>
> this
>
> as their first experience that doesn't seem good to me.
>
> My proposal is to provide two versions of the Flink
>
> Distribution
>
> in
>
> the
>
> future: "fat" and "slim" (names to be discussed):
>
>    - slim would be even trimmer than todays distribution
>    - fat would contain a lot of convenience connectors (yet
>
> to
>
> be
>
> determined which one)
>
> And yes, I realize that there are already more dimensions of
>
> Flink
>
> releases (Scala version and Java version).
>
> For background, our current Flink dist has these in the opt
>
> directory:
>
>    - flink-azure-fs-hadoop-1.10.0.jar
>    - flink-cep-scala_2.12-1.10.0.jar
>    - flink-cep_2.12-1.10.0.jar
>    - flink-gelly-scala_2.12-1.10.0.jar
>    - flink-gelly_2.12-1.10.0.jar
>    - flink-metrics-datadog-1.10.0.jar
>    - flink-metrics-graphite-1.10.0.jar
>    - flink-metrics-influxdb-1.10.0.jar
>    - flink-metrics-prometheus-1.10.0.jar
>    - flink-metrics-slf4j-1.10.0.jar
>    - flink-metrics-statsd-1.10.0.jar
>    - flink-oss-fs-hadoop-1.10.0.jar
>    - flink-python_2.12-1.10.0.jar
>    - flink-queryable-state-runtime_2.12-1.10.0.jar
>    - flink-s3-fs-hadoop-1.10.0.jar
>    - flink-s3-fs-presto-1.10.0.jar
>    -
>
> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>
>    - flink-sql-client_2.12-1.10.0.jar
>    - flink-state-processor-api_2.12-1.10.0.jar
>    - flink-swift-fs-hadoop-1.10.0.jar
>
> Current Flink dist is 267M. If we removed everything from
>
> opt
>
> we
>
> would
>
> go down to 126M. I would reccomend this, because the large
>
> majority
>
> of
>
> the files in opt are probably unused.
>
> What do you think?
>
> Best,
> Aljoscha
>
>
>
> --
> Best Regards
>
> Jeff Zhang
>
>
> --
> Best, Jingsong Lee
>
>
> --
> Best, Jingsong Lee
>
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Rafi Aroch-2
As a reference for a nice first-experience I had, take a look at
https://code.quarkus.io/
You reach this page after you click "Start Coding" at the project homepage.

Rafi


On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <[hidden email]> wrote:

> I'm not saying pre-bundle some jars will make this problem go away, and
> you're right that only hides the problem for
> some users. But what if this solution can hide the problem for 90% users?
> Would't that be good enough for us to try?
>
> Regarding to would users following instructions really be such a big
> problem?
> I'm afraid yes. Otherwise I won't answer such questions for at least a
> dozen times and I won't see such questions coming
> up from time to time. During some periods, I even saw such questions every
> day.
>
> Best,
> Kurt
>
>
> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <[hidden email]>
> wrote:
>
> > The problem with having a distribution with "popular" stuff is that it
> > doesn't really *solve* a problem, it just hides it for users who fall
> > into these particular use-cases.
> > Move out of it and you once again run into exact same problems out-lined.
> >
> > This is exactly why I like the tooling approach; you have to deal with it
> > from the start and transitioning to a custom use-case is easier.
> >
> > Would users following instructions really be such a big problem?
> > I would expect that users generally know *what *they need, just not
> > necessarily how it is assembled correctly (where do get which jar, which
> > directory to put it in).
> > It seems like these are exactly the problem this would solve?
> > I just don't see how moving a jar corresponding to some feature from opt
> > to some directory (lib/plugins) is less error-prone than just selecting
> the
> > feature and having the tool handle the rest.
> >
> > As for re-distributions, it depends on the form that the tool would take.
> > It could be an application that runs locally and works against maven
> > central (note: not necessarily *using* maven); this should would work in
> > China, no?
> >
> > A web tool would of course be fancy, but I don't know how feasible this
> is
> > with the ASF infrastructure.
> > You wouldn't be able to mirror the distribution, so the load can't be
> > distributed. I doubt INFRA would like this.
> >
> > Note that third-parties could also start distributing use-case oriented
> > distributions, which would be perfectly fine as far as I'm concerned.
> >
> > On 16/04/2020 16:57, Kurt Young wrote:
> >
> > I'm not so sure about the web tool solution though. The concern I have
> for
> > this approach is the final generated
> > distribution is kind of non-deterministic. We might generate too many
> > different combinations when user trying to
> > package different types of connector, format, and even maybe hadoop
> > releases.  As far as I can tell, most open
> > source projects and apache projects will only release some
> > pre-defined distributions, which most users are already
> > familiar with, thus hard to change IMO. And I also have went through in
> > some cases, users will try to re-distribute
> > the release package, because of the unstable network of apache website
> from
> > China. In web tool solution, I don't
> > think this kind of re-distribution would be possible anymore.
> >
> > In the meantime, I also have a concern that we will fall back into our
> trap
> > again if we try to offer this smart & flexible
> > solution. Because it needs users to cooperate with such mechanism. It's
> > exactly the situation what we currently fell
> > into:
> > 1. We offered a smart solution.
> > 2. We hope users will follow the correct instructions.
> > 3. Everything will work as expected if users followed the right
> > instructions.
> >
> > In reality, I suspect not all users will do the second step correctly.
> And
> > for new users who only trying to have a quick
> > experience with Flink, I would bet most users will do it wrong.
> >
> > So, my proposal would be one of the following 2 options:
> > 1. Provide a slim distribution for advanced product users and provide a
> > distribution which will have some popular builtin jars.
> > 2. Only provide a distribution which will have some popular builtin jars.
> >
> > If we are trying to reduce the distributions we released, I would prefer
> 2
> >
> > 1.
> >
> > Best,
> > Kurt
> >
> >
> > On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <[hidden email]> <
> [hidden email]> wrote:
> >
> >
> > I think what Chesnay and Dawid proposed would be the ideal solution.
> > Ideally, we would also have a nice web tool for the website which
> generates
> > the corresponding distribution for download.
> >
> > To get things started we could start with only supporting to
> > download/creating the "fat" version with the script. The fat version
> would
> > then consist of the slim distribution and whatever we deem important for
> > new users to get started.
> >
> > Cheers,
> > Till
> >
> > On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> [hidden email]> <[hidden email]>
> > wrote:
> >
> >
> > Hi all,
> >
> > Few points from my side:
> >
> > 1. I like the idea of simplifying the experience for first time users.
> > As for production use cases I share Jark's opinion that in this case I
> > would expect users to combine their distribution manually. I think in
> > such scenarios it is important to understand interconnections.
> > Personally I'd expect the slimmest possible distribution that I can
> > extend further with what I need in my production scenario.
> >
> > 2. I think there is also the problem that the matrix of possible
> > combinations that can be useful is already big. Do we want to have a
> > distribution for:
> >
> >     SQL users: which connectors should we include? should we include
> > hive? which other catalog?
> >
> >     DataStream users: which connectors should we include?
> >
> >    For both of the above should we include yarn/kubernetes?
> >
> > I would opt for providing only the "slim" distribution as a release
> > artifact.
> >
> > 3. However, as I said I think its worth investigating how we can improve
> > users experience. What do you think of providing a tool, could be e.g. a
> > shell script that constructs a distribution based on users choice. I
> > think that was also what Chesnay mentioned as "tooling to
> > assemble custom distributions" In the end how I see the difference
> > between a slim and fat distribution is which jars do we put into the
> > lib, right? It could have a few "screens".
> >
> > 1. Which API are you interested in:
> > a. SQL API
> > b. DataStream API
> >
> >
> > 2. [SQL] Which connectors do you want to use? [multichoice]:
> > a. Kafka
> > b. Elasticsearch
> > ...
> >
> > 3. [SQL] Which catalog you want to use?
> >
> > ...
> >
> > Such a tool would download all the dependencies from maven and put them
> > into the correct folder. In the future we can extend it with additional
> > rules e.g. kafka-0.9 cannot be chosen at the same time with
> > kafka-universal etc.
> >
> > The benefit of it would be that the distribution that we release could
> > remain "slim" or we could even make it slimmer. I might be missing
> > something here though.
> >
> > Best,
> >
> > Dawdi
> >
> > On 16/04/2020 11:02, Aljoscha Krettek wrote:
> >
> > I want to reinforce my opinion from earlier: This is about improving
> > the situation both for first-time users and for experienced users that
> > want to use a Flink dist in production. The current Flink dist is too
> > "thin" for first-time SQL users and it is too "fat" for production
> > users, that is where serving no-one properly with the current
> > middle-ground. That's why I think introducing those specialized
> > "spins" of Flink dist would be good.
> >
> > By the way, at some point in the future production users might not
> > even need to get a Flink dist anymore. They should be able to have
> > Flink as a dependency of their project (including the runtime) and
> > then build an image from this for Kubernetes or a fat jar for YARN.
> >
> > Aljoscha
> >
> > On 15.04.20 18:14, wenlong.lwl wrote:
> >
> > Hi all,
> >
> > Regarding slim and fat distributions, I think different kinds of jobs
> > may
> > prefer different type of distribution:
> >
> > For DataStream job, I think we may not like fat distribution
> >
> > containing
> >
> > connectors because user would always need to depend on the connector
> >
> > in
> >
> > user code, it is easy to include the connector jar in the user lib.
> >
> > Less
> >
> > jar in lib means less class conflicts and problems.
> >
> > For SQL job, I think we are trying to encourage user to user pure
> > sql(DDL +
> > DML) to construct their job, In order to improve user experience, It
> > may be
> > important for flink, not only providing as many connector jar in
> > distribution as possible especially the connector and format we have
> > well
> > documented,  but also providing an mechanism to load connectors
> > according
> > to the DDLs,
> >
> > So I think it could be good to place connector/format jars in some
> > dir like
> > opt/connector which would not affect jobs by default, and introduce a
> > mechanism of dynamic discovery for SQL.
> >
> > Best,
> > Wenlong
> >
> > On Wed, 15 Apr 2020 at 22:46, Jingsong Li <[hidden email]> <
> [hidden email]>
> > wrote:
> >
> >
> > Hi,
> >
> > I am thinking both "improve first experience" and "improve production
> > experience".
> >
> > I'm thinking about what's the common mode of Flink?
> > Streaming job use Kafka? Batch job use Hive?
> >
> > Hive 1.2.1 dependencies can be compatible with most of Hive server
> > versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> > Flink is currently mainly used for streaming, so let's not talk
> > about hive.
> >
> > For streaming jobs, first of all, the jobs in my mind is (related to
> > connectors):
> > - ETL jobs: Kafka -> Kafka
> > - Join jobs: Kafka -> DimJDBC -> Kafka
> > - Aggregation jobs: Kafka -> JDBCSink
> > So Kafka and JDBC are probably the most commonly used. Of course,
> >
> > also
> >
> > includes CSV, JSON's formats.
> > So when we provide such a fat distribution:
> > - With CSV, JSON.
> > - With flink-kafka-universal and kafka dependencies.
> > - With flink-jdbc.
> > Using this fat distribution, most users can run their jobs well.
> >
> > (jdbc
> >
> > driver jar required, but this is very natural to do)
> > Can these dependencies lead to kinds of conflicts? Only Kafka may
> >
> > have
> >
> > conflicts, but if our goal is to use kafka-universal to support all
> > Kafka
> > versions, it is hopeful to target the vast majority of users.
> >
> > We don't want to plug all jars into the fat distribution. Only need
> > less
> > conflict and common. of course, it is a matter of consideration to
> >
> > put
> >
> > which jar into fat distribution.
> > We have the opportunity to facilitate the majority of users, but
> > also left
> > opportunities for customization.
> >
> > Best,
> > Jingsong Lee
> >
> > On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <[hidden email]> <
> [hidden email]> wrote:
> >
> >
> > Hi,
> >
> > I think we should first reach an consensus on "what problem do we
> > want to
> > solve?"
> > (1) improve first experience? or (2) improve production experience?
> >
> > As far as I can see, with the above discussion, I think what we
> > want to
> > solve is the "first experience".
> > And I think the slim jar is still the best distribution for
> > production,
> > because it's easier to assembling jars
> > than excluding jars and can avoid potential class conflicts.
> >
> > If we want to improve "first experience", I think it make sense to
> > have a
> > fat distribution to give users a more smooth first experience.
> > But I would like to call it "playground distribution" or something
> > like
> > that to explicitly differ from the "slim production-purpose
> >
> > distribution".
> >
> > The "playground distribution" can contains some widely used jars,
> >
> > like
> >
> > universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> > json,
> > csv, etc..
> > Even we can provide a playground docker which may contain the fat
> > distribution, python3, and hive.
> >
> > Best,
> > Jark
> >
> >
> > On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <[hidden email]> <
> [hidden email]>
> >
> > wrote:
> >
> > I don't see a lot of value in having multiple distributions.
> >
> > The simple reality is that no fat distribution we could provide
> >
> > would
> >
> > satisfy all use-cases, so why even try.
> > If users commonly run into issues for certain jars, then maybe
> >
> > those
> >
> > should be added to the current distribution.
> >
> > Personally though I still believe we should only distribute a slim
> > version. I'd rather have users always add required jars to the
> > distribution than only when they go outside our "expected"
> >
> > use-cases.
> >
> > Then we might finally address this issue properly, i.e., tooling to
> > assemble custom distributions and/or better error messages if
> > Flink-provided extensions cannot be found.
> >
> > On 15/04/2020 15:23, Kurt Young wrote:
> >
> > Regarding to the specific solution, I'm not sure about the "fat"
> >
> > and
> >
> > "slim"
> >
> > solution though. I get the idea
> > that we can make the slim one even more lightweight than current
> > distribution, but what about the "fat"
> > one? Do you mean that we would package all connectors and formats
> >
> > into
> >
> > this? I'm not sure if this is
> > feasible. For example, we can't put all versions of kafka and hive
> > connector jars into lib directory, and
> > we also might need hadoop jars when using filesystem connector to
> >
> > access
> >
> > data from HDFS.
> >
> > So my guess would be we might hand-pick some of the most
> >
> > frequently
> >
> > used
> >
> > connectors and formats
> > into our "lib" directory, like kafka, csv, json metioned above,
> >
> > and
> >
> > still
> >
> > leave some other connectors out of it.
> > If this is the case, then why not we just provide this
> >
> > distribution
> >
> > to
> >
> > user? I'm not sure i get the benefit of
> > providing another super "slim" jar (we have to pay some costs to
> >
> > provide
> >
> > another suit of distribution).
> >
> > What do you think?
> >
> > Best,
> > Kurt
> >
> >
> > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> >
> > [hidden email]
> >
> > wrote:
> >
> > Big +1.
> >
> > I like "fat" and "slim".
> >
> > For csv and json, like Jark said, they are quite small and don't
> >
> > have
> >
> > other
> >
> > dependencies. They are important to kafka connector, and
> >
> > important
> >
> > to upcoming file system connector too.
> > So can we move them to both "fat" and "slim"? They're so
> >
> > important,
> >
> > and
> >
> > they're so lightweight.
> >
> > Best,
> > Jingsong Lee
> >
> > On Wed, Apr 15, 2020 at 4:53 PM godfrey he <[hidden email]> <
> [hidden email]>
> >
> > wrote:
> >
> > Big +1.
> > This will improve user experience (special for Flink new users).
> > We answered so many questions about "class not found".
> >
> > Best,
> > Godfrey
> >
> > Dian Fu <[hidden email]> <[hidden email]> 于2020年4月15日周三
> 下午4:30写道:
> >
> >
> > +1 to this proposal.
> >
> > Missing connector jars is also a big problem for PyFlink users.
> >
> > Currently,
> >
> > after a Python user has installed PyFlink using `pip`, he has
> >
> > to
> >
> > manually
> >
> > copy the connector fat jars to the PyFlink installation
> >
> > directory
> >
> > for
> >
> > the
> >
> > connectors to be used if he wants to run jobs locally. This
> >
> > process
> >
> > is
> >
> > very
> >
> > confuse for users and affects the experience a lot.
> >
> > Regards,
> > Dian
> >
> >
> > 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> <[hidden email]> 写道:
> >
> > +1 to the proposal. I also found the "download additional jar"
> >
> > step
> >
> > is
> >
> > really verbose when I prepare webinars.
> >
> > At least, I think the flink-csv and flink-json should in the
> >
> > distribution,
> >
> > they are quite small and don't have other dependencies.
> >
> > Best,
> > Jark
> >
> > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]> <
> [hidden email]>
> >
> > wrote:
> >
> > Hi Aljoscha,
> >
> > Big +1 for the fat flink distribution, where do you plan to
> >
> > put
> >
> > these
> >
> > connectors ? opt or lib ?
> >
> > Aljoscha Krettek <[hidden email]> <[hidden email]>
> 于2020年4月15日周三
> > 下午3:30写道:
> >
> >
> > Hi Everyone,
> >
> > I'd like to discuss about releasing a more full-featured
> >
> > Flink
> >
> > distribution. The motivation is that there is friction for
> >
> > SQL/Table
> >
> > API
> >
> > users that want to use Table connectors which are not there
> >
> > in
> >
> > the
> >
> > current Flink Distribution. For these users the workflow is
> >
> > currently
> >
> > roughly:
> >
> >    - download Flink dist
> >    - configure csv/Kafka/json connectors per configuration
> >    - run SQL client or program
> >    - decrypt error message and research the solution
> >    - download additional connector jars
> >    - program works correctly
> >
> > I realize that this can be made to work but if every SQL
> >
> > user
> >
> > has
> >
> > this
> >
> > as their first experience that doesn't seem good to me.
> >
> > My proposal is to provide two versions of the Flink
> >
> > Distribution
> >
> > in
> >
> > the
> >
> > future: "fat" and "slim" (names to be discussed):
> >
> >    - slim would be even trimmer than todays distribution
> >    - fat would contain a lot of convenience connectors (yet
> >
> > to
> >
> > be
> >
> > determined which one)
> >
> > And yes, I realize that there are already more dimensions of
> >
> > Flink
> >
> > releases (Scala version and Java version).
> >
> > For background, our current Flink dist has these in the opt
> >
> > directory:
> >
> >    - flink-azure-fs-hadoop-1.10.0.jar
> >    - flink-cep-scala_2.12-1.10.0.jar
> >    - flink-cep_2.12-1.10.0.jar
> >    - flink-gelly-scala_2.12-1.10.0.jar
> >    - flink-gelly_2.12-1.10.0.jar
> >    - flink-metrics-datadog-1.10.0.jar
> >    - flink-metrics-graphite-1.10.0.jar
> >    - flink-metrics-influxdb-1.10.0.jar
> >    - flink-metrics-prometheus-1.10.0.jar
> >    - flink-metrics-slf4j-1.10.0.jar
> >    - flink-metrics-statsd-1.10.0.jar
> >    - flink-oss-fs-hadoop-1.10.0.jar
> >    - flink-python_2.12-1.10.0.jar
> >    - flink-queryable-state-runtime_2.12-1.10.0.jar
> >    - flink-s3-fs-hadoop-1.10.0.jar
> >    - flink-s3-fs-presto-1.10.0.jar
> >    -
> >
> > flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >
> >    - flink-sql-client_2.12-1.10.0.jar
> >    - flink-state-processor-api_2.12-1.10.0.jar
> >    - flink-swift-fs-hadoop-1.10.0.jar
> >
> > Current Flink dist is 267M. If we removed everything from
> >
> > opt
> >
> > we
> >
> > would
> >
> > go down to 126M. I would reccomend this, because the large
> >
> > majority
> >
> > of
> >
> > the files in opt are probably unused.
> >
> > What do you think?
> >
> > Best,
> > Aljoscha
> >
> >
> >
> > --
> > Best Regards
> >
> > Jeff Zhang
> >
> >
> > --
> > Best, Jingsong Lee
> >
> >
> > --
> > Best, Jingsong Lee
> >
> >
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Jark Wu-2
Hi,

I like the idea of web tool to assemble fat distribution. And the
https://code.quarkus.io/ looks very nice.
All the users need to do is just select what he/she need (I think this step
can't be omitted anyway).
We can also provide a default fat distribution on the web which default
selects some popular connectors.

Best,
Jark

On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <[hidden email]> wrote:

> As a reference for a nice first-experience I had, take a look at
> https://code.quarkus.io/
> You reach this page after you click "Start Coding" at the project homepage.
>
> Rafi
>
>
> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <[hidden email]> wrote:
>
> > I'm not saying pre-bundle some jars will make this problem go away, and
> > you're right that only hides the problem for
> > some users. But what if this solution can hide the problem for 90% users?
> > Would't that be good enough for us to try?
> >
> > Regarding to would users following instructions really be such a big
> > problem?
> > I'm afraid yes. Otherwise I won't answer such questions for at least a
> > dozen times and I won't see such questions coming
> > up from time to time. During some periods, I even saw such questions
> every
> > day.
> >
> > Best,
> > Kurt
> >
> >
> > On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <[hidden email]>
> > wrote:
> >
> > > The problem with having a distribution with "popular" stuff is that it
> > > doesn't really *solve* a problem, it just hides it for users who fall
> > > into these particular use-cases.
> > > Move out of it and you once again run into exact same problems
> out-lined.
> > >
> > > This is exactly why I like the tooling approach; you have to deal with
> it
> > > from the start and transitioning to a custom use-case is easier.
> > >
> > > Would users following instructions really be such a big problem?
> > > I would expect that users generally know *what *they need, just not
> > > necessarily how it is assembled correctly (where do get which jar,
> which
> > > directory to put it in).
> > > It seems like these are exactly the problem this would solve?
> > > I just don't see how moving a jar corresponding to some feature from
> opt
> > > to some directory (lib/plugins) is less error-prone than just selecting
> > the
> > > feature and having the tool handle the rest.
> > >
> > > As for re-distributions, it depends on the form that the tool would
> take.
> > > It could be an application that runs locally and works against maven
> > > central (note: not necessarily *using* maven); this should would work
> in
> > > China, no?
> > >
> > > A web tool would of course be fancy, but I don't know how feasible this
> > is
> > > with the ASF infrastructure.
> > > You wouldn't be able to mirror the distribution, so the load can't be
> > > distributed. I doubt INFRA would like this.
> > >
> > > Note that third-parties could also start distributing use-case oriented
> > > distributions, which would be perfectly fine as far as I'm concerned.
> > >
> > > On 16/04/2020 16:57, Kurt Young wrote:
> > >
> > > I'm not so sure about the web tool solution though. The concern I have
> > for
> > > this approach is the final generated
> > > distribution is kind of non-deterministic. We might generate too many
> > > different combinations when user trying to
> > > package different types of connector, format, and even maybe hadoop
> > > releases.  As far as I can tell, most open
> > > source projects and apache projects will only release some
> > > pre-defined distributions, which most users are already
> > > familiar with, thus hard to change IMO. And I also have went through in
> > > some cases, users will try to re-distribute
> > > the release package, because of the unstable network of apache website
> > from
> > > China. In web tool solution, I don't
> > > think this kind of re-distribution would be possible anymore.
> > >
> > > In the meantime, I also have a concern that we will fall back into our
> > trap
> > > again if we try to offer this smart & flexible
> > > solution. Because it needs users to cooperate with such mechanism. It's
> > > exactly the situation what we currently fell
> > > into:
> > > 1. We offered a smart solution.
> > > 2. We hope users will follow the correct instructions.
> > > 3. Everything will work as expected if users followed the right
> > > instructions.
> > >
> > > In reality, I suspect not all users will do the second step correctly.
> > And
> > > for new users who only trying to have a quick
> > > experience with Flink, I would bet most users will do it wrong.
> > >
> > > So, my proposal would be one of the following 2 options:
> > > 1. Provide a slim distribution for advanced product users and provide a
> > > distribution which will have some popular builtin jars.
> > > 2. Only provide a distribution which will have some popular builtin
> jars.
> > >
> > > If we are trying to reduce the distributions we released, I would
> prefer
> > 2
> > >
> > > 1.
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <[hidden email]> <
> > [hidden email]> wrote:
> > >
> > >
> > > I think what Chesnay and Dawid proposed would be the ideal solution.
> > > Ideally, we would also have a nice web tool for the website which
> > generates
> > > the corresponding distribution for download.
> > >
> > > To get things started we could start with only supporting to
> > > download/creating the "fat" version with the script. The fat version
> > would
> > > then consist of the slim distribution and whatever we deem important
> for
> > > new users to get started.
> > >
> > > Cheers,
> > > Till
> > >
> > > On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> > [hidden email]> <[hidden email]>
> > > wrote:
> > >
> > >
> > > Hi all,
> > >
> > > Few points from my side:
> > >
> > > 1. I like the idea of simplifying the experience for first time users.
> > > As for production use cases I share Jark's opinion that in this case I
> > > would expect users to combine their distribution manually. I think in
> > > such scenarios it is important to understand interconnections.
> > > Personally I'd expect the slimmest possible distribution that I can
> > > extend further with what I need in my production scenario.
> > >
> > > 2. I think there is also the problem that the matrix of possible
> > > combinations that can be useful is already big. Do we want to have a
> > > distribution for:
> > >
> > >     SQL users: which connectors should we include? should we include
> > > hive? which other catalog?
> > >
> > >     DataStream users: which connectors should we include?
> > >
> > >    For both of the above should we include yarn/kubernetes?
> > >
> > > I would opt for providing only the "slim" distribution as a release
> > > artifact.
> > >
> > > 3. However, as I said I think its worth investigating how we can
> improve
> > > users experience. What do you think of providing a tool, could be e.g.
> a
> > > shell script that constructs a distribution based on users choice. I
> > > think that was also what Chesnay mentioned as "tooling to
> > > assemble custom distributions" In the end how I see the difference
> > > between a slim and fat distribution is which jars do we put into the
> > > lib, right? It could have a few "screens".
> > >
> > > 1. Which API are you interested in:
> > > a. SQL API
> > > b. DataStream API
> > >
> > >
> > > 2. [SQL] Which connectors do you want to use? [multichoice]:
> > > a. Kafka
> > > b. Elasticsearch
> > > ...
> > >
> > > 3. [SQL] Which catalog you want to use?
> > >
> > > ...
> > >
> > > Such a tool would download all the dependencies from maven and put them
> > > into the correct folder. In the future we can extend it with additional
> > > rules e.g. kafka-0.9 cannot be chosen at the same time with
> > > kafka-universal etc.
> > >
> > > The benefit of it would be that the distribution that we release could
> > > remain "slim" or we could even make it slimmer. I might be missing
> > > something here though.
> > >
> > > Best,
> > >
> > > Dawdi
> > >
> > > On 16/04/2020 11:02, Aljoscha Krettek wrote:
> > >
> > > I want to reinforce my opinion from earlier: This is about improving
> > > the situation both for first-time users and for experienced users that
> > > want to use a Flink dist in production. The current Flink dist is too
> > > "thin" for first-time SQL users and it is too "fat" for production
> > > users, that is where serving no-one properly with the current
> > > middle-ground. That's why I think introducing those specialized
> > > "spins" of Flink dist would be good.
> > >
> > > By the way, at some point in the future production users might not
> > > even need to get a Flink dist anymore. They should be able to have
> > > Flink as a dependency of their project (including the runtime) and
> > > then build an image from this for Kubernetes or a fat jar for YARN.
> > >
> > > Aljoscha
> > >
> > > On 15.04.20 18:14, wenlong.lwl wrote:
> > >
> > > Hi all,
> > >
> > > Regarding slim and fat distributions, I think different kinds of jobs
> > > may
> > > prefer different type of distribution:
> > >
> > > For DataStream job, I think we may not like fat distribution
> > >
> > > containing
> > >
> > > connectors because user would always need to depend on the connector
> > >
> > > in
> > >
> > > user code, it is easy to include the connector jar in the user lib.
> > >
> > > Less
> > >
> > > jar in lib means less class conflicts and problems.
> > >
> > > For SQL job, I think we are trying to encourage user to user pure
> > > sql(DDL +
> > > DML) to construct their job, In order to improve user experience, It
> > > may be
> > > important for flink, not only providing as many connector jar in
> > > distribution as possible especially the connector and format we have
> > > well
> > > documented,  but also providing an mechanism to load connectors
> > > according
> > > to the DDLs,
> > >
> > > So I think it could be good to place connector/format jars in some
> > > dir like
> > > opt/connector which would not affect jobs by default, and introduce a
> > > mechanism of dynamic discovery for SQL.
> > >
> > > Best,
> > > Wenlong
> > >
> > > On Wed, 15 Apr 2020 at 22:46, Jingsong Li <[hidden email]> <
> > [hidden email]>
> > > wrote:
> > >
> > >
> > > Hi,
> > >
> > > I am thinking both "improve first experience" and "improve production
> > > experience".
> > >
> > > I'm thinking about what's the common mode of Flink?
> > > Streaming job use Kafka? Batch job use Hive?
> > >
> > > Hive 1.2.1 dependencies can be compatible with most of Hive server
> > > versions. So Spark and Presto have built-in Hive 1.2.1 dependency.
> > > Flink is currently mainly used for streaming, so let's not talk
> > > about hive.
> > >
> > > For streaming jobs, first of all, the jobs in my mind is (related to
> > > connectors):
> > > - ETL jobs: Kafka -> Kafka
> > > - Join jobs: Kafka -> DimJDBC -> Kafka
> > > - Aggregation jobs: Kafka -> JDBCSink
> > > So Kafka and JDBC are probably the most commonly used. Of course,
> > >
> > > also
> > >
> > > includes CSV, JSON's formats.
> > > So when we provide such a fat distribution:
> > > - With CSV, JSON.
> > > - With flink-kafka-universal and kafka dependencies.
> > > - With flink-jdbc.
> > > Using this fat distribution, most users can run their jobs well.
> > >
> > > (jdbc
> > >
> > > driver jar required, but this is very natural to do)
> > > Can these dependencies lead to kinds of conflicts? Only Kafka may
> > >
> > > have
> > >
> > > conflicts, but if our goal is to use kafka-universal to support all
> > > Kafka
> > > versions, it is hopeful to target the vast majority of users.
> > >
> > > We don't want to plug all jars into the fat distribution. Only need
> > > less
> > > conflict and common. of course, it is a matter of consideration to
> > >
> > > put
> > >
> > > which jar into fat distribution.
> > > We have the opportunity to facilitate the majority of users, but
> > > also left
> > > opportunities for customization.
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <[hidden email]> <
> > [hidden email]> wrote:
> > >
> > >
> > > Hi,
> > >
> > > I think we should first reach an consensus on "what problem do we
> > > want to
> > > solve?"
> > > (1) improve first experience? or (2) improve production experience?
> > >
> > > As far as I can see, with the above discussion, I think what we
> > > want to
> > > solve is the "first experience".
> > > And I think the slim jar is still the best distribution for
> > > production,
> > > because it's easier to assembling jars
> > > than excluding jars and can avoid potential class conflicts.
> > >
> > > If we want to improve "first experience", I think it make sense to
> > > have a
> > > fat distribution to give users a more smooth first experience.
> > > But I would like to call it "playground distribution" or something
> > > like
> > > that to explicitly differ from the "slim production-purpose
> > >
> > > distribution".
> > >
> > > The "playground distribution" can contains some widely used jars,
> > >
> > > like
> > >
> > > universal-kafka-sql-connector, elasticsearch7-sql-connector, avro,
> > > json,
> > > csv, etc..
> > > Even we can provide a playground docker which may contain the fat
> > > distribution, python3, and hive.
> > >
> > > Best,
> > > Jark
> > >
> > >
> > > On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <[hidden email]> <
> > [hidden email]>
> > >
> > > wrote:
> > >
> > > I don't see a lot of value in having multiple distributions.
> > >
> > > The simple reality is that no fat distribution we could provide
> > >
> > > would
> > >
> > > satisfy all use-cases, so why even try.
> > > If users commonly run into issues for certain jars, then maybe
> > >
> > > those
> > >
> > > should be added to the current distribution.
> > >
> > > Personally though I still believe we should only distribute a slim
> > > version. I'd rather have users always add required jars to the
> > > distribution than only when they go outside our "expected"
> > >
> > > use-cases.
> > >
> > > Then we might finally address this issue properly, i.e., tooling to
> > > assemble custom distributions and/or better error messages if
> > > Flink-provided extensions cannot be found.
> > >
> > > On 15/04/2020 15:23, Kurt Young wrote:
> > >
> > > Regarding to the specific solution, I'm not sure about the "fat"
> > >
> > > and
> > >
> > > "slim"
> > >
> > > solution though. I get the idea
> > > that we can make the slim one even more lightweight than current
> > > distribution, but what about the "fat"
> > > one? Do you mean that we would package all connectors and formats
> > >
> > > into
> > >
> > > this? I'm not sure if this is
> > > feasible. For example, we can't put all versions of kafka and hive
> > > connector jars into lib directory, and
> > > we also might need hadoop jars when using filesystem connector to
> > >
> > > access
> > >
> > > data from HDFS.
> > >
> > > So my guess would be we might hand-pick some of the most
> > >
> > > frequently
> > >
> > > used
> > >
> > > connectors and formats
> > > into our "lib" directory, like kafka, csv, json metioned above,
> > >
> > > and
> > >
> > > still
> > >
> > > leave some other connectors out of it.
> > > If this is the case, then why not we just provide this
> > >
> > > distribution
> > >
> > > to
> > >
> > > user? I'm not sure i get the benefit of
> > > providing another super "slim" jar (we have to pay some costs to
> > >
> > > provide
> > >
> > > another suit of distribution).
> > >
> > > What do you think?
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> > >
> > > [hidden email]
> > >
> > > wrote:
> > >
> > > Big +1.
> > >
> > > I like "fat" and "slim".
> > >
> > > For csv and json, like Jark said, they are quite small and don't
> > >
> > > have
> > >
> > > other
> > >
> > > dependencies. They are important to kafka connector, and
> > >
> > > important
> > >
> > > to upcoming file system connector too.
> > > So can we move them to both "fat" and "slim"? They're so
> > >
> > > important,
> > >
> > > and
> > >
> > > they're so lightweight.
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Wed, Apr 15, 2020 at 4:53 PM godfrey he <[hidden email]> <
> > [hidden email]>
> > >
> > > wrote:
> > >
> > > Big +1.
> > > This will improve user experience (special for Flink new users).
> > > We answered so many questions about "class not found".
> > >
> > > Best,
> > > Godfrey
> > >
> > > Dian Fu <[hidden email]> <[hidden email]> 于2020年4月15日周三
> > 下午4:30写道:
> > >
> > >
> > > +1 to this proposal.
> > >
> > > Missing connector jars is also a big problem for PyFlink users.
> > >
> > > Currently,
> > >
> > > after a Python user has installed PyFlink using `pip`, he has
> > >
> > > to
> > >
> > > manually
> > >
> > > copy the connector fat jars to the PyFlink installation
> > >
> > > directory
> > >
> > > for
> > >
> > > the
> > >
> > > connectors to be used if he wants to run jobs locally. This
> > >
> > > process
> > >
> > > is
> > >
> > > very
> > >
> > > confuse for users and affects the experience a lot.
> > >
> > > Regards,
> > > Dian
> > >
> > >
> > > 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> <[hidden email]> 写道:
> > >
> > > +1 to the proposal. I also found the "download additional jar"
> > >
> > > step
> > >
> > > is
> > >
> > > really verbose when I prepare webinars.
> > >
> > > At least, I think the flink-csv and flink-json should in the
> > >
> > > distribution,
> > >
> > > they are quite small and don't have other dependencies.
> > >
> > > Best,
> > > Jark
> > >
> > > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <[hidden email]> <
> > [hidden email]>
> > >
> > > wrote:
> > >
> > > Hi Aljoscha,
> > >
> > > Big +1 for the fat flink distribution, where do you plan to
> > >
> > > put
> > >
> > > these
> > >
> > > connectors ? opt or lib ?
> > >
> > > Aljoscha Krettek <[hidden email]> <[hidden email]>
> > 于2020年4月15日周三
> > > 下午3:30写道:
> > >
> > >
> > > Hi Everyone,
> > >
> > > I'd like to discuss about releasing a more full-featured
> > >
> > > Flink
> > >
> > > distribution. The motivation is that there is friction for
> > >
> > > SQL/Table
> > >
> > > API
> > >
> > > users that want to use Table connectors which are not there
> > >
> > > in
> > >
> > > the
> > >
> > > current Flink Distribution. For these users the workflow is
> > >
> > > currently
> > >
> > > roughly:
> > >
> > >    - download Flink dist
> > >    - configure csv/Kafka/json connectors per configuration
> > >    - run SQL client or program
> > >    - decrypt error message and research the solution
> > >    - download additional connector jars
> > >    - program works correctly
> > >
> > > I realize that this can be made to work but if every SQL
> > >
> > > user
> > >
> > > has
> > >
> > > this
> > >
> > > as their first experience that doesn't seem good to me.
> > >
> > > My proposal is to provide two versions of the Flink
> > >
> > > Distribution
> > >
> > > in
> > >
> > > the
> > >
> > > future: "fat" and "slim" (names to be discussed):
> > >
> > >    - slim would be even trimmer than todays distribution
> > >    - fat would contain a lot of convenience connectors (yet
> > >
> > > to
> > >
> > > be
> > >
> > > determined which one)
> > >
> > > And yes, I realize that there are already more dimensions of
> > >
> > > Flink
> > >
> > > releases (Scala version and Java version).
> > >
> > > For background, our current Flink dist has these in the opt
> > >
> > > directory:
> > >
> > >    - flink-azure-fs-hadoop-1.10.0.jar
> > >    - flink-cep-scala_2.12-1.10.0.jar
> > >    - flink-cep_2.12-1.10.0.jar
> > >    - flink-gelly-scala_2.12-1.10.0.jar
> > >    - flink-gelly_2.12-1.10.0.jar
> > >    - flink-metrics-datadog-1.10.0.jar
> > >    - flink-metrics-graphite-1.10.0.jar
> > >    - flink-metrics-influxdb-1.10.0.jar
> > >    - flink-metrics-prometheus-1.10.0.jar
> > >    - flink-metrics-slf4j-1.10.0.jar
> > >    - flink-metrics-statsd-1.10.0.jar
> > >    - flink-oss-fs-hadoop-1.10.0.jar
> > >    - flink-python_2.12-1.10.0.jar
> > >    - flink-queryable-state-runtime_2.12-1.10.0.jar
> > >    - flink-s3-fs-hadoop-1.10.0.jar
> > >    - flink-s3-fs-presto-1.10.0.jar
> > >    -
> > >
> > > flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > >
> > >    - flink-sql-client_2.12-1.10.0.jar
> > >    - flink-state-processor-api_2.12-1.10.0.jar
> > >    - flink-swift-fs-hadoop-1.10.0.jar
> > >
> > > Current Flink dist is 267M. If we removed everything from
> > >
> > > opt
> > >
> > > we
> > >
> > > would
> > >
> > > go down to 126M. I would reccomend this, because the large
> > >
> > > majority
> > >
> > > of
> > >
> > > the files in opt are probably unused.
> > >
> > > What do you think?
> > >
> > > Best,
> > > Aljoscha
> > >
> > >
> > >
> > > --
> > > Best Regards
> > >
> > > Jeff Zhang
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> > >
> > >
> > >
> >
>
123