[DISCUSS] Releasing "fat" and "slim" Flink distributions

classic Classic list List threaded Threaded
44 messages Options
123
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Benchao Li
+1 to include them for sql-client by default;
+0 to put into lib and exposed to all kinds of jobs, including DataStream.

Danny Chan <[hidden email]> 于2020年6月5日周五 下午2:31写道:

> +1, at least, we should keep an out of the box SQL-CLI, it’s very poor
> experience to add such required format jars for SQL users.
>
> Best,
> Danny Chan
> 在 2020年6月5日 +0800 AM11:14,Jingsong Li <[hidden email]>,写道:
> > Hi all,
> >
> > Considering that 1.11 will be released soon, what about my previous
> > proposal? Put flink-csv, flink-json and flink-avro under lib.
> > These three formats are very small and no third party dependence, and
> they
> > are widely used by table users.
> >
> > Best,
> > Jingsong Lee
> >
> > On Tue, May 12, 2020 at 4:19 PM Jingsong Li <[hidden email]>
> wrote:
> >
> > > Thanks for your discussion.
> > >
> > > Sorry to start discussing another thing:
> > >
> > > The biggest problem I see is the variety of problems caused by users'
> lack
> > > of format dependency.
> > > As Aljoscha said, these three formats are very small and no third party
> > > dependence, and they are widely used by table users.
> > > Actually, we don't have any other built-in table formats now... In
> total
> > > 151K...
> > >
> > > 73K flink-avro-1.10.0.jar
> > > 36K flink-csv-1.10.0.jar
> > > 42K flink-json-1.10.0.jar
> > >
> > > So, Can we just put them into "lib/" or flink-table-uber?
> > > It not solve all problems and maybe it is independent of "fat" and
> "slim".
> > > But also improve usability.
> > > What do you think? Any objections?
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <[hidden email]>
> > > wrote:
> > >
> > > > One downside would be that we're shipping more stuff when running on
> > > > YARN for example, since the entire plugins directory is shiped by
> default.
> > > >
> > > > On 17/04/2020 16:38, Stephan Ewen wrote:
> > > > > @Aljoscha I think that is an interesting line of thinking. the
> swift-fs
> > > > may
> > > > > be rarely enough used to move it to an optional download.
> > > > >
> > > > > I would still drop two more thoughts:
> > > > >
> > > > > (1) Now that we have plugins support, is there a reason to have a
> > > > metrics
> > > > > reporter or file system in /opt instead of /plugins? They don't
> spoil
> > > > the
> > > > > class path any more.
> > > > >
> > > > > (2) I can imagine there still being a desire to have a "minimal"
> docker
> > > > > file, for users that want to keep the container images as small as
> > > > > possible, to speed up deployment. It is fine if that would not be
> the
> > > > > default, though.
> > > > >
> > > > >
> > > > > On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
> [hidden email]>
> > > > > wrote:
> > > > >
> > > > > > I think having such tools and/or tailor-made distributions can
> be nice
> > > > > > but I also think the discussion is missing the main point: The
> initial
> > > > > > observation/motivation is that apparently a lot of users (Kurt
> and I
> > > > > > talked about this) on the chinese DingTalk support groups, and
> other
> > > > > > support channels have problems when first using the SQL client
> because
> > > > > > of these missing connectors/formats. For these, having
> additional tools
> > > > > > would not solve anything because they would also not take that
> extra
> > > > > > step. I think that even tiny friction should be avoided because
> the
> > > > > > annoyance from it accumulates of the (hopefully) many users that
> we
> > > > want
> > > > > > to have.
> > > > > >
> > > > > > Maybe we should take a step back from discussing the
> "fat"/"slim" idea
> > > > > > and instead think about the composition of the current dist. As
> > > > > > mentioned we have these jars in opt/:
> > > > > >
> > > > > > 17M flink-azure-fs-hadoop-1.10.0.jar
> > > > > > 52K flink-cep-scala_2.11-1.10.0.jar
> > > > > > 180K flink-cep_2.11-1.10.0.jar
> > > > > > 746K flink-gelly-scala_2.11-1.10.0.jar
> > > > > > 626K flink-gelly_2.11-1.10.0.jar
> > > > > > 512K flink-metrics-datadog-1.10.0.jar
> > > > > > 159K flink-metrics-graphite-1.10.0.jar
> > > > > > 1.0M flink-metrics-influxdb-1.10.0.jar
> > > > > > 102K flink-metrics-prometheus-1.10.0.jar
> > > > > > 10K flink-metrics-slf4j-1.10.0.jar
> > > > > > 12K flink-metrics-statsd-1.10.0.jar
> > > > > > 36M flink-oss-fs-hadoop-1.10.0.jar
> > > > > > 28M flink-python_2.11-1.10.0.jar
> > > > > > 22K flink-queryable-state-runtime_2.11-1.10.0.jar
> > > > > > 18M flink-s3-fs-hadoop-1.10.0.jar
> > > > > > 31M flink-s3-fs-presto-1.10.0.jar
> > > > > > 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > > > > > 518K flink-sql-client_2.11-1.10.0.jar
> > > > > > 99K flink-state-processor-api_2.11-1.10.0.jar
> > > > > > 25M flink-swift-fs-hadoop-1.10.0.jar
> > > > > > 160M opt
> > > > > >
> > > > > > The "filesystem" connectors ar ethe heavy hitters, there.
> > > > > >
> > > > > > I downloaded most of the SQL connectors/formats and this is what
> I got:
> > > > > >
> > > > > > 73K flink-avro-1.10.0.jar
> > > > > > 36K flink-csv-1.10.0.jar
> > > > > > 55K flink-hbase_2.11-1.10.0.jar
> > > > > > 88K flink-jdbc_2.11-1.10.0.jar
> > > > > > 42K flink-json-1.10.0.jar
> > > > > > 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> > > > > > 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> > > > > > 24M sql-connectors-formats
> > > > > >
> > > > > > We could just add these to the Flink distribution without
> blowing it up
> > > > > > by much. We could drop any of the existing "filesystem"
> connectors from
> > > > > > opt and add the SQL connectors/formats and not change the size
> of Flink
> > > > > > dist. So maybe we should do that instead?
> > > > > >
> > > > > > We would need some tooling for the sql-client shell script to
> pick-up
> > > > > > the connectors/formats up from opt/ because we don't want to add
> them
> > > > to
> > > > > > lib/. We're already doing that for finding the flink-sql-client
> jar,
> > > > > > which is also not in lib/.
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > Best,
> > > > > > Aljoscha
> > > > > >
> > > > > > On 17.04.20 05:22, Jark Wu wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I like the idea of web tool to assemble fat distribution. And
> the
> > > > > > > https://code.quarkus.io/ looks very nice.
> > > > > > > All the users need to do is just select what he/she need (I
> think this
> > > > > > step
> > > > > > > can't be omitted anyway).
> > > > > > > We can also provide a default fat distribution on the web which
> > > > default
> > > > > > > selects some popular connectors.
> > > > > > >
> > > > > > > Best,
> > > > > > > Jark
> > > > > > >
> > > > > > > On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <[hidden email]
> >
> > > > wrote:
> > > > > > >
> > > > > > > > As a reference for a nice first-experience I had, take a
> look at
> > > > > > > > https://code.quarkus.io/
> > > > > > > > You reach this page after you click "Start Coding" at the
> project
> > > > > > homepage.
> > > > > > > > Rafi
> > > > > > > >
> > > > > > > >
> > > > > > > > On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <[hidden email]>
> wrote:
> > > > > > > >
> > > > > > > > > I'm not saying pre-bundle some jars will make this problem
> go away,
> > > > and
> > > > > > > > > you're right that only hides the problem for
> > > > > > > > > some users. But what if this solution can hide the problem
> for 90%
> > > > > > users?
> > > > > > > > > Would't that be good enough for us to try?
> > > > > > > > >
> > > > > > > > > Regarding to would users following instructions really be
> such a big
> > > > > > > > > problem?
> > > > > > > > > I'm afraid yes. Otherwise I won't answer such questions
> for at
> > > > least a
> > > > > > > > > dozen times and I won't see such questions coming
> > > > > > > > > up from time to time. During some periods, I even saw such
> questions
> > > > > > > > every
> > > > > > > > > day.
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Kurt
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> > > > [hidden email]>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > The problem with having a distribution with "popular"
> stuff is
> > > > that it
> > > > > > > > > > doesn't really *solve* a problem, it just hides it for
> users who
> > > > fall
> > > > > > > > > > into these particular use-cases.
> > > > > > > > > > Move out of it and you once again run into exact same
> problems
> > > > > > > > out-lined.
> > > > > > > > > > This is exactly why I like the tooling approach; you
> have to deal
> > > > with
> > > > > > > > it
> > > > > > > > > > from the start and transitioning to a custom use-case is
> easier.
> > > > > > > > > >
> > > > > > > > > > Would users following instructions really be such a big
> problem?
> > > > > > > > > > I would expect that users generally know *what *they
> need, just not
> > > > > > > > > > necessarily how it is assembled correctly (where do get
> which jar,
> > > > > > > > which
> > > > > > > > > > directory to put it in).
> > > > > > > > > > It seems like these are exactly the problem this would
> solve?
> > > > > > > > > > I just don't see how moving a jar corresponding to some
> feature
> > > > from
> > > > > > > > opt
> > > > > > > > > > to some directory (lib/plugins) is less error-prone than
> just
> > > > > > selecting
> > > > > > > > > the
> > > > > > > > > > feature and having the tool handle the rest.
> > > > > > > > > >
> > > > > > > > > > As for re-distributions, it depends on the form that the
> tool would
> > > > > > > > take.
> > > > > > > > > > It could be an application that runs locally and works
> against
> > > > maven
> > > > > > > > > > central (note: not necessarily *using* maven); this
> should would
> > > > work
> > > > > > > > in
> > > > > > > > > > China, no?
> > > > > > > > > >
> > > > > > > > > > A web tool would of course be fancy, but I don't know
> how feasible
> > > > > > this
> > > > > > > > > is
> > > > > > > > > > with the ASF infrastructure.
> > > > > > > > > > You wouldn't be able to mirror the distribution, so the
> load can't
> > > > be
> > > > > > > > > > distributed. I doubt INFRA would like this.
> > > > > > > > > >
> > > > > > > > > > Note that third-parties could also start distributing
> use-case
> > > > > > oriented
> > > > > > > > > > distributions, which would be perfectly fine as far as
> I'm
> > > > concerned.
> > > > > > > > > >
> > > > > > > > > > On 16/04/2020 16:57, Kurt Young wrote:
> > > > > > > > > >
> > > > > > > > > > I'm not so sure about the web tool solution though. The
> concern I
> > > > have
> > > > > > > > > for
> > > > > > > > > > this approach is the final generated
> > > > > > > > > > distribution is kind of non-deterministic. We might
> generate too
> > > > many
> > > > > > > > > > different combinations when user trying to
> > > > > > > > > > package different types of connector, format, and even
> maybe hadoop
> > > > > > > > > > releases. As far as I can tell, most open
> > > > > > > > > > source projects and apache projects will only release
> some
> > > > > > > > > > pre-defined distributions, which most users are already
> > > > > > > > > > familiar with, thus hard to change IMO. And I also have
> went
> > > > through
> > > > > > in
> > > > > > > > > > some cases, users will try to re-distribute
> > > > > > > > > > the release package, because of the unstable network of
> apache
> > > > website
> > > > > > > > > from
> > > > > > > > > > China. In web tool solution, I don't
> > > > > > > > > > think this kind of re-distribution would be possible
> anymore.
> > > > > > > > > >
> > > > > > > > > > In the meantime, I also have a concern that we will fall
> back into
> > > > our
> > > > > > > > > trap
> > > > > > > > > > again if we try to offer this smart & flexible
> > > > > > > > > > solution. Because it needs users to cooperate with such
> mechanism.
> > > > > > It's
> > > > > > > > > > exactly the situation what we currently fell
> > > > > > > > > > into:
> > > > > > > > > > 1. We offered a smart solution.
> > > > > > > > > > 2. We hope users will follow the correct instructions.
> > > > > > > > > > 3. Everything will work as expected if users followed
> the right
> > > > > > > > > > instructions.
> > > > > > > > > >
> > > > > > > > > > In reality, I suspect not all users will do the second
> step
> > > > correctly.
> > > > > > > > > And
> > > > > > > > > > for new users who only trying to have a quick
> > > > > > > > > > experience with Flink, I would bet most users will do it
> wrong.
> > > > > > > > > >
> > > > > > > > > > So, my proposal would be one of the following 2 options:
> > > > > > > > > > 1. Provide a slim distribution for advanced product
> users and
> > > > provide
> > > > > > a
> > > > > > > > > > distribution which will have some popular builtin jars.
> > > > > > > > > > 2. Only provide a distribution which will have some
> popular builtin
> > > > > > > > jars.
> > > > > > > > > > If we are trying to reduce the distributions we
> released, I would
> > > > > > > > prefer
> > > > > > > > > 2
> > > > > > > > > > 1.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Kurt
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
> > > > [hidden email]>
> > > > > > <
> > > > > > > > > [hidden email]> wrote:
> > > > > > > > > >
> > > > > > > > > > I think what Chesnay and Dawid proposed would be the
> ideal
> > > > solution.
> > > > > > > > > > Ideally, we would also have a nice web tool for the
> website which
> > > > > > > > > generates
> > > > > > > > > > the corresponding distribution for download.
> > > > > > > > > >
> > > > > > > > > > To get things started we could start with only
> supporting to
> > > > > > > > > > download/creating the "fat" version with the script. The
> fat
> > > > version
> > > > > > > > > would
> > > > > > > > > > then consist of the slim distribution and whatever we
> deem
> > > > important
> > > > > > > > for
> > > > > > > > > > new users to get started.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Till
> > > > > > > > > >
> > > > > > > > > > On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> > > > > > > > > [hidden email]> <[hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Few points from my side:
> > > > > > > > > >
> > > > > > > > > > 1. I like the idea of simplifying the experience for
> first time
> > > > users.
> > > > > > > > > > As for production use cases I share Jark's opinion that
> in this
> > > > case I
> > > > > > > > > > would expect users to combine their distribution
> manually. I think
> > > > in
> > > > > > > > > > such scenarios it is important to understand
> interconnections.
> > > > > > > > > > Personally I'd expect the slimmest possible distribution
> that I can
> > > > > > > > > > extend further with what I need in my production
> scenario.
> > > > > > > > > >
> > > > > > > > > > 2. I think there is also the problem that the matrix of
> possible
> > > > > > > > > > combinations that can be useful is already big. Do we
> want to have
> > > > a
> > > > > > > > > > distribution for:
> > > > > > > > > >
> > > > > > > > > > SQL users: which connectors should we include? should we
> > > > include
> > > > > > > > > > hive? which other catalog?
> > > > > > > > > >
> > > > > > > > > > DataStream users: which connectors should we include?
> > > > > > > > > >
> > > > > > > > > > For both of the above should we include yarn/kubernetes?
> > > > > > > > > >
> > > > > > > > > > I would opt for providing only the "slim" distribution
> as a release
> > > > > > > > > > artifact.
> > > > > > > > > >
> > > > > > > > > > 3. However, as I said I think its worth investigating
> how we can
> > > > > > > > improve
> > > > > > > > > > users experience. What do you think of providing a tool,
> could be
> > > > e.g.
> > > > > > > > a
> > > > > > > > > > shell script that constructs a distribution based on
> users choice.
> > > > I
> > > > > > > > > > think that was also what Chesnay mentioned as "tooling to
> > > > > > > > > > assemble custom distributions" In the end how I see the
> difference
> > > > > > > > > > between a slim and fat distribution is which jars do we
> put into
> > > > the
> > > > > > > > > > lib, right? It could have a few "screens".
> > > > > > > > > >
> > > > > > > > > > 1. Which API are you interested in:
> > > > > > > > > > a. SQL API
> > > > > > > > > > b. DataStream API
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 2. [SQL] Which connectors do you want to use?
> [multichoice]:
> > > > > > > > > > a. Kafka
> > > > > > > > > > b. Elasticsearch
> > > > > > > > > > ...
> > > > > > > > > >
> > > > > > > > > > 3. [SQL] Which catalog you want to use?
> > > > > > > > > >
> > > > > > > > > > ...
> > > > > > > > > >
> > > > > > > > > > Such a tool would download all the dependencies from
> maven and put
> > > > > > them
> > > > > > > > > > into the correct folder. In the future we can extend it
> with
> > > > > > additional
> > > > > > > > > > rules e.g. kafka-0.9 cannot be chosen at the same time
> with
> > > > > > > > > > kafka-universal etc.
> > > > > > > > > >
> > > > > > > > > > The benefit of it would be that the distribution that we
> release
> > > > could
> > > > > > > > > > remain "slim" or we could even make it slimmer. I might
> be missing
> > > > > > > > > > something here though.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > >
> > > > > > > > > > Dawdi
> > > > > > > > > >
> > > > > > > > > > On 16/04/2020 11:02, Aljoscha Krettek wrote:
> > > > > > > > > >
> > > > > > > > > > I want to reinforce my opinion from earlier: This is
> about
> > > > improving
> > > > > > > > > > the situation both for first-time users and for
> experienced users
> > > > that
> > > > > > > > > > want to use a Flink dist in production. The current
> Flink dist is
> > > > too
> > > > > > > > > > "thin" for first-time SQL users and it is too "fat" for
> production
> > > > > > > > > > users, that is where serving no-one properly with the
> current
> > > > > > > > > > middle-ground. That's why I think introducing those
> specialized
> > > > > > > > > > "spins" of Flink dist would be good.
> > > > > > > > > >
> > > > > > > > > > By the way, at some point in the future production users
> might not
> > > > > > > > > > even need to get a Flink dist anymore. They should be
> able to have
> > > > > > > > > > Flink as a dependency of their project (including the
> runtime) and
> > > > > > > > > > then build an image from this for Kubernetes or a fat
> jar for YARN.
> > > > > > > > > >
> > > > > > > > > > Aljoscha
> > > > > > > > > >
> > > > > > > > > > On 15.04.20 18:14, wenlong.lwl wrote:
> > > > > > > > > >
> > > > > > > > > > Hi all,
> > > > > > > > > >
> > > > > > > > > > Regarding slim and fat distributions, I think different
> kinds of
> > > > jobs
> > > > > > > > > > may
> > > > > > > > > > prefer different type of distribution:
> > > > > > > > > >
> > > > > > > > > > For DataStream job, I think we may not like fat
> distribution
> > > > > > > > > >
> > > > > > > > > > containing
> > > > > > > > > >
> > > > > > > > > > connectors because user would always need to depend on
> the
> > > > connector
> > > > > > > > > >
> > > > > > > > > > in
> > > > > > > > > >
> > > > > > > > > > user code, it is easy to include the connector jar in
> the user lib.
> > > > > > > > > >
> > > > > > > > > > Less
> > > > > > > > > >
> > > > > > > > > > jar in lib means less class conflicts and problems.
> > > > > > > > > >
> > > > > > > > > > For SQL job, I think we are trying to encourage user to
> user pure
> > > > > > > > > > sql(DDL +
> > > > > > > > > > DML) to construct their job, In order to improve user
> experience,
> > > > It
> > > > > > > > > > may be
> > > > > > > > > > important for flink, not only providing as many
> connector jar in
> > > > > > > > > > distribution as possible especially the connector and
> format we
> > > > have
> > > > > > > > > > well
> > > > > > > > > > documented, but also providing an mechanism to load
> connectors
> > > > > > > > > > according
> > > > > > > > > > to the DDLs,
> > > > > > > > > >
> > > > > > > > > > So I think it could be good to place connector/format
> jars in some
> > > > > > > > > > dir like
> > > > > > > > > > opt/connector which would not affect jobs by default, and
> > > > introduce a
> > > > > > > > > > mechanism of dynamic discovery for SQL.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Wenlong
> > > > > > > > > >
> > > > > > > > > > On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
> [hidden email]>
> > > > <
> > > > > > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I am thinking both "improve first experience" and
> "improve
> > > > production
> > > > > > > > > > experience".
> > > > > > > > > >
> > > > > > > > > > I'm thinking about what's the common mode of Flink?
> > > > > > > > > > Streaming job use Kafka? Batch job use Hive?
> > > > > > > > > >
> > > > > > > > > > Hive 1.2.1 dependencies can be compatible with most of
> Hive server
> > > > > > > > > > versions. So Spark and Presto have built-in Hive 1.2.1
> dependency.
> > > > > > > > > > Flink is currently mainly used for streaming, so let's
> not talk
> > > > > > > > > > about hive.
> > > > > > > > > >
> > > > > > > > > > For streaming jobs, first of all, the jobs in my mind is
> (related
> > > > to
> > > > > > > > > > connectors):
> > > > > > > > > > - ETL jobs: Kafka -> Kafka
> > > > > > > > > > - Join jobs: Kafka -> DimJDBC -> Kafka
> > > > > > > > > > - Aggregation jobs: Kafka -> JDBCSink
> > > > > > > > > > So Kafka and JDBC are probably the most commonly used.
> Of course,
> > > > > > > > > >
> > > > > > > > > > also
> > > > > > > > > >
> > > > > > > > > > includes CSV, JSON's formats.
> > > > > > > > > > So when we provide such a fat distribution:
> > > > > > > > > > - With CSV, JSON.
> > > > > > > > > > - With flink-kafka-universal and kafka dependencies.
> > > > > > > > > > - With flink-jdbc.
> > > > > > > > > > Using this fat distribution, most users can run their
> jobs well.
> > > > > > > > > >
> > > > > > > > > > (jdbc
> > > > > > > > > >
> > > > > > > > > > driver jar required, but this is very natural to do)
> > > > > > > > > > Can these dependencies lead to kinds of conflicts? Only
> Kafka may
> > > > > > > > > >
> > > > > > > > > > have
> > > > > > > > > >
> > > > > > > > > > conflicts, but if our goal is to use kafka-universal to
> support all
> > > > > > > > > > Kafka
> > > > > > > > > > versions, it is hopeful to target the vast majority of
> users.
> > > > > > > > > >
> > > > > > > > > > We don't want to plug all jars into the fat
> distribution. Only need
> > > > > > > > > > less
> > > > > > > > > > conflict and common. of course, it is a matter of
> consideration to
> > > > > > > > > >
> > > > > > > > > > put
> > > > > > > > > >
> > > > > > > > > > which jar into fat distribution.
> > > > > > > > > > We have the opportunity to facilitate the majority of
> users, but
> > > > > > > > > > also left
> > > > > > > > > > opportunities for customization.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jingsong Lee
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <
> [hidden email]> <
> > > > > > > > > [hidden email]> wrote:
> > > > > > > > > >
> > > > > > > > > > Hi,
> > > > > > > > > >
> > > > > > > > > > I think we should first reach an consensus on "what
> problem do we
> > > > > > > > > > want to
> > > > > > > > > > solve?"
> > > > > > > > > > (1) improve first experience? or (2) improve production
> experience?
> > > > > > > > > >
> > > > > > > > > > As far as I can see, with the above discussion, I think
> what we
> > > > > > > > > > want to
> > > > > > > > > > solve is the "first experience".
> > > > > > > > > > And I think the slim jar is still the best distribution
> for
> > > > > > > > > > production,
> > > > > > > > > > because it's easier to assembling jars
> > > > > > > > > > than excluding jars and can avoid potential class
> conflicts.
> > > > > > > > > >
> > > > > > > > > > If we want to improve "first experience", I think it
> make sense to
> > > > > > > > > > have a
> > > > > > > > > > fat distribution to give users a more smooth first
> experience.
> > > > > > > > > > But I would like to call it "playground distribution" or
> something
> > > > > > > > > > like
> > > > > > > > > > that to explicitly differ from the "slim
> production-purpose
> > > > > > > > > >
> > > > > > > > > > distribution".
> > > > > > > > > >
> > > > > > > > > > The "playground distribution" can contains some widely
> used jars,
> > > > > > > > > >
> > > > > > > > > > like
> > > > > > > > > >
> > > > > > > > > > universal-kafka-sql-connector,
> elasticsearch7-sql-connector, avro,
> > > > > > > > > > json,
> > > > > > > > > > csv, etc..
> > > > > > > > > > Even we can provide a playground docker which may
> contain the fat
> > > > > > > > > > distribution, python3, and hive.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jark
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
> [hidden email]>
> > > > <
> > > > > > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > I don't see a lot of value in having multiple
> distributions.
> > > > > > > > > >
> > > > > > > > > > The simple reality is that no fat distribution we could
> provide
> > > > > > > > > >
> > > > > > > > > > would
> > > > > > > > > >
> > > > > > > > > > satisfy all use-cases, so why even try.
> > > > > > > > > > If users commonly run into issues for certain jars, then
> maybe
> > > > > > > > > >
> > > > > > > > > > those
> > > > > > > > > >
> > > > > > > > > > should be added to the current distribution.
> > > > > > > > > >
> > > > > > > > > > Personally though I still believe we should only
> distribute a slim
> > > > > > > > > > version. I'd rather have users always add required jars
> to the
> > > > > > > > > > distribution than only when they go outside our
> "expected"
> > > > > > > > > >
> > > > > > > > > > use-cases.
> > > > > > > > > >
> > > > > > > > > > Then we might finally address this issue properly, i.e.,
> tooling to
> > > > > > > > > > assemble custom distributions and/or better error
> messages if
> > > > > > > > > > Flink-provided extensions cannot be found.
> > > > > > > > > >
> > > > > > > > > > On 15/04/2020 15:23, Kurt Young wrote:
> > > > > > > > > >
> > > > > > > > > > Regarding to the specific solution, I'm not sure about
> the "fat"
> > > > > > > > > >
> > > > > > > > > > and
> > > > > > > > > >
> > > > > > > > > > "slim"
> > > > > > > > > >
> > > > > > > > > > solution though. I get the idea
> > > > > > > > > > that we can make the slim one even more lightweight than
> current
> > > > > > > > > > distribution, but what about the "fat"
> > > > > > > > > > one? Do you mean that we would package all connectors
> and formats
> > > > > > > > > >
> > > > > > > > > > into
> > > > > > > > > >
> > > > > > > > > > this? I'm not sure if this is
> > > > > > > > > > feasible. For example, we can't put all versions of
> kafka and hive
> > > > > > > > > > connector jars into lib directory, and
> > > > > > > > > > we also might need hadoop jars when using filesystem
> connector to
> > > > > > > > > >
> > > > > > > > > > access
> > > > > > > > > >
> > > > > > > > > > data from HDFS.
> > > > > > > > > >
> > > > > > > > > > So my guess would be we might hand-pick some of the most
> > > > > > > > > >
> > > > > > > > > > frequently
> > > > > > > > > >
> > > > > > > > > > used
> > > > > > > > > >
> > > > > > > > > > connectors and formats
> > > > > > > > > > into our "lib" directory, like kafka, csv, json metioned
> above,
> > > > > > > > > >
> > > > > > > > > > and
> > > > > > > > > >
> > > > > > > > > > still
> > > > > > > > > >
> > > > > > > > > > leave some other connectors out of it.
> > > > > > > > > > If this is the case, then why not we just provide this
> > > > > > > > > >
> > > > > > > > > > distribution
> > > > > > > > > >
> > > > > > > > > > to
> > > > > > > > > >
> > > > > > > > > > user? I'm not sure i get the benefit of
> > > > > > > > > > providing another super "slim" jar (we have to pay some
> costs to
> > > > > > > > > >
> > > > > > > > > > provide
> > > > > > > > > >
> > > > > > > > > > another suit of distribution).
> > > > > > > > > >
> > > > > > > > > > What do you think?
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Kurt
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> > > > > > > > > >
> > > > > > > > > > [hidden email]
> > > > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Big +1.
> > > > > > > > > >
> > > > > > > > > > I like "fat" and "slim".
> > > > > > > > > >
> > > > > > > > > > For csv and json, like Jark said, they are quite small
> and don't
> > > > > > > > > >
> > > > > > > > > > have
> > > > > > > > > >
> > > > > > > > > > other
> > > > > > > > > >
> > > > > > > > > > dependencies. They are important to kafka connector, and
> > > > > > > > > >
> > > > > > > > > > important
> > > > > > > > > >
> > > > > > > > > > to upcoming file system connector too.
> > > > > > > > > > So can we move them to both "fat" and "slim"? They're so
> > > > > > > > > >
> > > > > > > > > > important,
> > > > > > > > > >
> > > > > > > > > > and
> > > > > > > > > >
> > > > > > > > > > they're so lightweight.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jingsong Lee
> > > > > > > > > >
> > > > > > > > > > On Wed, Apr 15, 2020 at 4:53 PM godfrey he <
> [hidden email]> <
> > > > > > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Big +1.
> > > > > > > > > > This will improve user experience (special for Flink new
> users).
> > > > > > > > > > We answered so many questions about "class not found".
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Godfrey
> > > > > > > > > >
> > > > > > > > > > Dian Fu <[hidden email]> <[hidden email]>
> > > > 于2020年4月15日周三
> > > > > > > > > 下午4:30写道:
> > > > > > > > > >
> > > > > > > > > > +1 to this proposal.
> > > > > > > > > >
> > > > > > > > > > Missing connector jars is also a big problem for PyFlink
> users.
> > > > > > > > > >
> > > > > > > > > > Currently,
> > > > > > > > > >
> > > > > > > > > > after a Python user has installed PyFlink using `pip`,
> he has
> > > > > > > > > >
> > > > > > > > > > to
> > > > > > > > > >
> > > > > > > > > > manually
> > > > > > > > > >
> > > > > > > > > > copy the connector fat jars to the PyFlink installation
> > > > > > > > > >
> > > > > > > > > > directory
> > > > > > > > > >
> > > > > > > > > > for
> > > > > > > > > >
> > > > > > > > > > the
> > > > > > > > > >
> > > > > > > > > > connectors to be used if he wants to run jobs locally.
> This
> > > > > > > > > >
> > > > > > > > > > process
> > > > > > > > > >
> > > > > > > > > > is
> > > > > > > > > >
> > > > > > > > > > very
> > > > > > > > > >
> > > > > > > > > > confuse for users and affects the experience a lot.
> > > > > > > > > >
> > > > > > > > > > Regards,
> > > > > > > > > > Dian
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> <
> [hidden email]>
> > > > 写道:
> > > > > > > > > >
> > > > > > > > > > +1 to the proposal. I also found the "download
> additional jar"
> > > > > > > > > >
> > > > > > > > > > step
> > > > > > > > > >
> > > > > > > > > > is
> > > > > > > > > >
> > > > > > > > > > really verbose when I prepare webinars.
> > > > > > > > > >
> > > > > > > > > > At least, I think the flink-csv and flink-json should in
> the
> > > > > > > > > >
> > > > > > > > > > distribution,
> > > > > > > > > >
> > > > > > > > > > they are quite small and don't have other dependencies.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jark
> > > > > > > > > >
> > > > > > > > > > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <
> [hidden email]> <
> > > > > > > > > [hidden email]>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Aljoscha,
> > > > > > > > > >
> > > > > > > > > > Big +1 for the fat flink distribution, where do you plan
> to
> > > > > > > > > >
> > > > > > > > > > put
> > > > > > > > > >
> > > > > > > > > > these
> > > > > > > > > >
> > > > > > > > > > connectors ? opt or lib ?
> > > > > > > > > >
> > > > > > > > > > Aljoscha Krettek <[hidden email]> <
> [hidden email]>
> > > > > > > > > 于2020年4月15日周三
> > > > > > > > > > 下午3:30写道:
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Hi Everyone,
> > > > > > > > > >
> > > > > > > > > > I'd like to discuss about releasing a more full-featured
> > > > > > > > > >
> > > > > > > > > > Flink
> > > > > > > > > >
> > > > > > > > > > distribution. The motivation is that there is friction
> for
> > > > > > > > > >
> > > > > > > > > > SQL/Table
> > > > > > > > > >
> > > > > > > > > > API
> > > > > > > > > >
> > > > > > > > > > users that want to use Table connectors which are not
> there
> > > > > > > > > >
> > > > > > > > > > in
> > > > > > > > > >
> > > > > > > > > > the
> > > > > > > > > >
> > > > > > > > > > current Flink Distribution. For these users the workflow
> is
> > > > > > > > > >
> > > > > > > > > > currently
> > > > > > > > > >
> > > > > > > > > > roughly:
> > > > > > > > > >
> > > > > > > > > > - download Flink dist
> > > > > > > > > > - configure csv/Kafka/json connectors per configuration
> > > > > > > > > > - run SQL client or program
> > > > > > > > > > - decrypt error message and research the solution
> > > > > > > > > > - download additional connector jars
> > > > > > > > > > - program works correctly
> > > > > > > > > >
> > > > > > > > > > I realize that this can be made to work but if every SQL
> > > > > > > > > >
> > > > > > > > > > user
> > > > > > > > > >
> > > > > > > > > > has
> > > > > > > > > >
> > > > > > > > > > this
> > > > > > > > > >
> > > > > > > > > > as their first experience that doesn't seem good to me.
> > > > > > > > > >
> > > > > > > > > > My proposal is to provide two versions of the Flink
> > > > > > > > > >
> > > > > > > > > > Distribution
> > > > > > > > > >
> > > > > > > > > > in
> > > > > > > > > >
> > > > > > > > > > the
> > > > > > > > > >
> > > > > > > > > > future: "fat" and "slim" (names to be discussed):
> > > > > > > > > >
> > > > > > > > > > - slim would be even trimmer than todays distribution
> > > > > > > > > > - fat would contain a lot of convenience connectors (yet
> > > > > > > > > >
> > > > > > > > > > to
> > > > > > > > > >
> > > > > > > > > > be
> > > > > > > > > >
> > > > > > > > > > determined which one)
> > > > > > > > > >
> > > > > > > > > > And yes, I realize that there are already more
> dimensions of
> > > > > > > > > >
> > > > > > > > > > Flink
> > > > > > > > > >
> > > > > > > > > > releases (Scala version and Java version).
> > > > > > > > > >
> > > > > > > > > > For background, our current Flink dist has these in the
> opt
> > > > > > > > > >
> > > > > > > > > > directory:
> > > > > > > > > >
> > > > > > > > > > - flink-azure-fs-hadoop-1.10.0.jar
> > > > > > > > > > - flink-cep-scala_2.12-1.10.0.jar
> > > > > > > > > > - flink-cep_2.12-1.10.0.jar
> > > > > > > > > > - flink-gelly-scala_2.12-1.10.0.jar
> > > > > > > > > > - flink-gelly_2.12-1.10.0.jar
> > > > > > > > > > - flink-metrics-datadog-1.10.0.jar
> > > > > > > > > > - flink-metrics-graphite-1.10.0.jar
> > > > > > > > > > - flink-metrics-influxdb-1.10.0.jar
> > > > > > > > > > - flink-metrics-prometheus-1.10.0.jar
> > > > > > > > > > - flink-metrics-slf4j-1.10.0.jar
> > > > > > > > > > - flink-metrics-statsd-1.10.0.jar
> > > > > > > > > > - flink-oss-fs-hadoop-1.10.0.jar
> > > > > > > > > > - flink-python_2.12-1.10.0.jar
> > > > > > > > > > - flink-queryable-state-runtime_2.12-1.10.0.jar
> > > > > > > > > > - flink-s3-fs-hadoop-1.10.0.jar
> > > > > > > > > > - flink-s3-fs-presto-1.10.0.jar
> > > > > > > > > > -
> > > > > > > > > >
> > > > > > > > > > flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > > > > > > > > >
> > > > > > > > > > - flink-sql-client_2.12-1.10.0.jar
> > > > > > > > > > - flink-state-processor-api_2.12-1.10.0.jar
> > > > > > > > > > - flink-swift-fs-hadoop-1.10.0.jar
> > > > > > > > > >
> > > > > > > > > > Current Flink dist is 267M. If we removed everything from
> > > > > > > > > >
> > > > > > > > > > opt
> > > > > > > > > >
> > > > > > > > > > we
> > > > > > > > > >
> > > > > > > > > > would
> > > > > > > > > >
> > > > > > > > > > go down to 126M. I would reccomend this, because the
> large
> > > > > > > > > >
> > > > > > > > > > majority
> > > > > > > > > >
> > > > > > > > > > of
> > > > > > > > > >
> > > > > > > > > > the files in opt are probably unused.
> > > > > > > > > >
> > > > > > > > > > What do you think?
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Aljoscha
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best Regards
> > > > > > > > > >
> > > > > > > > > > Jeff Zhang
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best, Jingsong Lee
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Best, Jingsong Lee
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > >
> > > >
> > > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> >
> >
> > --
> > Best, Jingsong Lee
>


--

Best,
Benchao Li
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Leonard Xu
+1 for Jingsong’s proposal to put flink-csv, flink-json and flink-avro under lib/ directory.
I have heard many SQL users(most of newbies) complaint the out-of-box experience in mail list.

Best,
Leonard Xu


> 在 2020年6月5日,14:39,Benchao Li <[hidden email]> 写道:
>
> +1 to include them for sql-client by default;
> +0 to put into lib and exposed to all kinds of jobs, including DataStream.
>
> Danny Chan <[hidden email]> 于2020年6月5日周五 下午2:31写道:
>
>> +1, at least, we should keep an out of the box SQL-CLI, it’s very poor
>> experience to add such required format jars for SQL users.
>>
>> Best,
>> Danny Chan
>> 在 2020年6月5日 +0800 AM11:14,Jingsong Li <[hidden email]>,写道:
>>> Hi all,
>>>
>>> Considering that 1.11 will be released soon, what about my previous
>>> proposal? Put flink-csv, flink-json and flink-avro under lib.
>>> These three formats are very small and no third party dependence, and
>> they
>>> are widely used by table users.
>>>
>>> Best,
>>> Jingsong Lee
>>>
>>> On Tue, May 12, 2020 at 4:19 PM Jingsong Li <[hidden email]>
>> wrote:
>>>
>>>> Thanks for your discussion.
>>>>
>>>> Sorry to start discussing another thing:
>>>>
>>>> The biggest problem I see is the variety of problems caused by users'
>> lack
>>>> of format dependency.
>>>> As Aljoscha said, these three formats are very small and no third party
>>>> dependence, and they are widely used by table users.
>>>> Actually, we don't have any other built-in table formats now... In
>> total
>>>> 151K...
>>>>
>>>> 73K flink-avro-1.10.0.jar
>>>> 36K flink-csv-1.10.0.jar
>>>> 42K flink-json-1.10.0.jar
>>>>
>>>> So, Can we just put them into "lib/" or flink-table-uber?
>>>> It not solve all problems and maybe it is independent of "fat" and
>> "slim".
>>>> But also improve usability.
>>>> What do you think? Any objections?
>>>>
>>>> Best,
>>>> Jingsong Lee
>>>>
>>>> On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <[hidden email]>
>>>> wrote:
>>>>
>>>>> One downside would be that we're shipping more stuff when running on
>>>>> YARN for example, since the entire plugins directory is shiped by
>> default.
>>>>>
>>>>> On 17/04/2020 16:38, Stephan Ewen wrote:
>>>>>> @Aljoscha I think that is an interesting line of thinking. the
>> swift-fs
>>>>> may
>>>>>> be rarely enough used to move it to an optional download.
>>>>>>
>>>>>> I would still drop two more thoughts:
>>>>>>
>>>>>> (1) Now that we have plugins support, is there a reason to have a
>>>>> metrics
>>>>>> reporter or file system in /opt instead of /plugins? They don't
>> spoil
>>>>> the
>>>>>> class path any more.
>>>>>>
>>>>>> (2) I can imagine there still being a desire to have a "minimal"
>> docker
>>>>>> file, for users that want to keep the container images as small as
>>>>>> possible, to speed up deployment. It is fine if that would not be
>> the
>>>>>> default, though.
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
>> [hidden email]>
>>>>>> wrote:
>>>>>>
>>>>>>> I think having such tools and/or tailor-made distributions can
>> be nice
>>>>>>> but I also think the discussion is missing the main point: The
>> initial
>>>>>>> observation/motivation is that apparently a lot of users (Kurt
>> and I
>>>>>>> talked about this) on the chinese DingTalk support groups, and
>> other
>>>>>>> support channels have problems when first using the SQL client
>> because
>>>>>>> of these missing connectors/formats. For these, having
>> additional tools
>>>>>>> would not solve anything because they would also not take that
>> extra
>>>>>>> step. I think that even tiny friction should be avoided because
>> the
>>>>>>> annoyance from it accumulates of the (hopefully) many users that
>> we
>>>>> want
>>>>>>> to have.
>>>>>>>
>>>>>>> Maybe we should take a step back from discussing the
>> "fat"/"slim" idea
>>>>>>> and instead think about the composition of the current dist. As
>>>>>>> mentioned we have these jars in opt/:
>>>>>>>
>>>>>>> 17M flink-azure-fs-hadoop-1.10.0.jar
>>>>>>> 52K flink-cep-scala_2.11-1.10.0.jar
>>>>>>> 180K flink-cep_2.11-1.10.0.jar
>>>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar
>>>>>>> 626K flink-gelly_2.11-1.10.0.jar
>>>>>>> 512K flink-metrics-datadog-1.10.0.jar
>>>>>>> 159K flink-metrics-graphite-1.10.0.jar
>>>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar
>>>>>>> 102K flink-metrics-prometheus-1.10.0.jar
>>>>>>> 10K flink-metrics-slf4j-1.10.0.jar
>>>>>>> 12K flink-metrics-statsd-1.10.0.jar
>>>>>>> 36M flink-oss-fs-hadoop-1.10.0.jar
>>>>>>> 28M flink-python_2.11-1.10.0.jar
>>>>>>> 22K flink-queryable-state-runtime_2.11-1.10.0.jar
>>>>>>> 18M flink-s3-fs-hadoop-1.10.0.jar
>>>>>>> 31M flink-s3-fs-presto-1.10.0.jar
>>>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>> 518K flink-sql-client_2.11-1.10.0.jar
>>>>>>> 99K flink-state-processor-api_2.11-1.10.0.jar
>>>>>>> 25M flink-swift-fs-hadoop-1.10.0.jar
>>>>>>> 160M opt
>>>>>>>
>>>>>>> The "filesystem" connectors ar ethe heavy hitters, there.
>>>>>>>
>>>>>>> I downloaded most of the SQL connectors/formats and this is what
>> I got:
>>>>>>>
>>>>>>> 73K flink-avro-1.10.0.jar
>>>>>>> 36K flink-csv-1.10.0.jar
>>>>>>> 55K flink-hbase_2.11-1.10.0.jar
>>>>>>> 88K flink-jdbc_2.11-1.10.0.jar
>>>>>>> 42K flink-json-1.10.0.jar
>>>>>>> 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
>>>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
>>>>>>> 24M sql-connectors-formats
>>>>>>>
>>>>>>> We could just add these to the Flink distribution without
>> blowing it up
>>>>>>> by much. We could drop any of the existing "filesystem"
>> connectors from
>>>>>>> opt and add the SQL connectors/formats and not change the size
>> of Flink
>>>>>>> dist. So maybe we should do that instead?
>>>>>>>
>>>>>>> We would need some tooling for the sql-client shell script to
>> pick-up
>>>>>>> the connectors/formats up from opt/ because we don't want to add
>> them
>>>>> to
>>>>>>> lib/. We're already doing that for finding the flink-sql-client
>> jar,
>>>>>>> which is also not in lib/.
>>>>>>>
>>>>>>> What do you think?
>>>>>>>
>>>>>>> Best,
>>>>>>> Aljoscha
>>>>>>>
>>>>>>> On 17.04.20 05:22, Jark Wu wrote:
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> I like the idea of web tool to assemble fat distribution. And
>> the
>>>>>>>> https://code.quarkus.io/ looks very nice.
>>>>>>>> All the users need to do is just select what he/she need (I
>> think this
>>>>>>> step
>>>>>>>> can't be omitted anyway).
>>>>>>>> We can also provide a default fat distribution on the web which
>>>>> default
>>>>>>>> selects some popular connectors.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Jark
>>>>>>>>
>>>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <[hidden email]
>>>
>>>>> wrote:
>>>>>>>>
>>>>>>>>> As a reference for a nice first-experience I had, take a
>> look at
>>>>>>>>> https://code.quarkus.io/
>>>>>>>>> You reach this page after you click "Start Coding" at the
>> project
>>>>>>> homepage.
>>>>>>>>> Rafi
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <[hidden email]>
>> wrote:
>>>>>>>>>
>>>>>>>>>> I'm not saying pre-bundle some jars will make this problem
>> go away,
>>>>> and
>>>>>>>>>> you're right that only hides the problem for
>>>>>>>>>> some users. But what if this solution can hide the problem
>> for 90%
>>>>>>> users?
>>>>>>>>>> Would't that be good enough for us to try?
>>>>>>>>>>
>>>>>>>>>> Regarding to would users following instructions really be
>> such a big
>>>>>>>>>> problem?
>>>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions
>> for at
>>>>> least a
>>>>>>>>>> dozen times and I won't see such questions coming
>>>>>>>>>> up from time to time. During some periods, I even saw such
>> questions
>>>>>>>>> every
>>>>>>>>>> day.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Kurt
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
>>>>> [hidden email]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> The problem with having a distribution with "popular"
>> stuff is
>>>>> that it
>>>>>>>>>>> doesn't really *solve* a problem, it just hides it for
>> users who
>>>>> fall
>>>>>>>>>>> into these particular use-cases.
>>>>>>>>>>> Move out of it and you once again run into exact same
>> problems
>>>>>>>>> out-lined.
>>>>>>>>>>> This is exactly why I like the tooling approach; you
>> have to deal
>>>>> with
>>>>>>>>> it
>>>>>>>>>>> from the start and transitioning to a custom use-case is
>> easier.
>>>>>>>>>>>
>>>>>>>>>>> Would users following instructions really be such a big
>> problem?
>>>>>>>>>>> I would expect that users generally know *what *they
>> need, just not
>>>>>>>>>>> necessarily how it is assembled correctly (where do get
>> which jar,
>>>>>>>>> which
>>>>>>>>>>> directory to put it in).
>>>>>>>>>>> It seems like these are exactly the problem this would
>> solve?
>>>>>>>>>>> I just don't see how moving a jar corresponding to some
>> feature
>>>>> from
>>>>>>>>> opt
>>>>>>>>>>> to some directory (lib/plugins) is less error-prone than
>> just
>>>>>>> selecting
>>>>>>>>>> the
>>>>>>>>>>> feature and having the tool handle the rest.
>>>>>>>>>>>
>>>>>>>>>>> As for re-distributions, it depends on the form that the
>> tool would
>>>>>>>>> take.
>>>>>>>>>>> It could be an application that runs locally and works
>> against
>>>>> maven
>>>>>>>>>>> central (note: not necessarily *using* maven); this
>> should would
>>>>> work
>>>>>>>>> in
>>>>>>>>>>> China, no?
>>>>>>>>>>>
>>>>>>>>>>> A web tool would of course be fancy, but I don't know
>> how feasible
>>>>>>> this
>>>>>>>>>> is
>>>>>>>>>>> with the ASF infrastructure.
>>>>>>>>>>> You wouldn't be able to mirror the distribution, so the
>> load can't
>>>>> be
>>>>>>>>>>> distributed. I doubt INFRA would like this.
>>>>>>>>>>>
>>>>>>>>>>> Note that third-parties could also start distributing
>> use-case
>>>>>>> oriented
>>>>>>>>>>> distributions, which would be perfectly fine as far as
>> I'm
>>>>> concerned.
>>>>>>>>>>>
>>>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
>>>>>>>>>>>
>>>>>>>>>>> I'm not so sure about the web tool solution though. The
>> concern I
>>>>> have
>>>>>>>>>> for
>>>>>>>>>>> this approach is the final generated
>>>>>>>>>>> distribution is kind of non-deterministic. We might
>> generate too
>>>>> many
>>>>>>>>>>> different combinations when user trying to
>>>>>>>>>>> package different types of connector, format, and even
>> maybe hadoop
>>>>>>>>>>> releases. As far as I can tell, most open
>>>>>>>>>>> source projects and apache projects will only release
>> some
>>>>>>>>>>> pre-defined distributions, which most users are already
>>>>>>>>>>> familiar with, thus hard to change IMO. And I also have
>> went
>>>>> through
>>>>>>> in
>>>>>>>>>>> some cases, users will try to re-distribute
>>>>>>>>>>> the release package, because of the unstable network of
>> apache
>>>>> website
>>>>>>>>>> from
>>>>>>>>>>> China. In web tool solution, I don't
>>>>>>>>>>> think this kind of re-distribution would be possible
>> anymore.
>>>>>>>>>>>
>>>>>>>>>>> In the meantime, I also have a concern that we will fall
>> back into
>>>>> our
>>>>>>>>>> trap
>>>>>>>>>>> again if we try to offer this smart & flexible
>>>>>>>>>>> solution. Because it needs users to cooperate with such
>> mechanism.
>>>>>>> It's
>>>>>>>>>>> exactly the situation what we currently fell
>>>>>>>>>>> into:
>>>>>>>>>>> 1. We offered a smart solution.
>>>>>>>>>>> 2. We hope users will follow the correct instructions.
>>>>>>>>>>> 3. Everything will work as expected if users followed
>> the right
>>>>>>>>>>> instructions.
>>>>>>>>>>>
>>>>>>>>>>> In reality, I suspect not all users will do the second
>> step
>>>>> correctly.
>>>>>>>>>> And
>>>>>>>>>>> for new users who only trying to have a quick
>>>>>>>>>>> experience with Flink, I would bet most users will do it
>> wrong.
>>>>>>>>>>>
>>>>>>>>>>> So, my proposal would be one of the following 2 options:
>>>>>>>>>>> 1. Provide a slim distribution for advanced product
>> users and
>>>>> provide
>>>>>>> a
>>>>>>>>>>> distribution which will have some popular builtin jars.
>>>>>>>>>>> 2. Only provide a distribution which will have some
>> popular builtin
>>>>>>>>> jars.
>>>>>>>>>>> If we are trying to reduce the distributions we
>> released, I would
>>>>>>>>> prefer
>>>>>>>>>> 2
>>>>>>>>>>> 1.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Kurt
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
>>>>> [hidden email]>
>>>>>>> <
>>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I think what Chesnay and Dawid proposed would be the
>> ideal
>>>>> solution.
>>>>>>>>>>> Ideally, we would also have a nice web tool for the
>> website which
>>>>>>>>>> generates
>>>>>>>>>>> the corresponding distribution for download.
>>>>>>>>>>>
>>>>>>>>>>> To get things started we could start with only
>> supporting to
>>>>>>>>>>> download/creating the "fat" version with the script. The
>> fat
>>>>> version
>>>>>>>>>> would
>>>>>>>>>>> then consist of the slim distribution and whatever we
>> deem
>>>>> important
>>>>>>>>> for
>>>>>>>>>>> new users to get started.
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Till
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
>>>>>>>>>> [hidden email]> <[hidden email]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> Few points from my side:
>>>>>>>>>>>
>>>>>>>>>>> 1. I like the idea of simplifying the experience for
>> first time
>>>>> users.
>>>>>>>>>>> As for production use cases I share Jark's opinion that
>> in this
>>>>> case I
>>>>>>>>>>> would expect users to combine their distribution
>> manually. I think
>>>>> in
>>>>>>>>>>> such scenarios it is important to understand
>> interconnections.
>>>>>>>>>>> Personally I'd expect the slimmest possible distribution
>> that I can
>>>>>>>>>>> extend further with what I need in my production
>> scenario.
>>>>>>>>>>>
>>>>>>>>>>> 2. I think there is also the problem that the matrix of
>> possible
>>>>>>>>>>> combinations that can be useful is already big. Do we
>> want to have
>>>>> a
>>>>>>>>>>> distribution for:
>>>>>>>>>>>
>>>>>>>>>>> SQL users: which connectors should we include? should we
>>>>> include
>>>>>>>>>>> hive? which other catalog?
>>>>>>>>>>>
>>>>>>>>>>> DataStream users: which connectors should we include?
>>>>>>>>>>>
>>>>>>>>>>> For both of the above should we include yarn/kubernetes?
>>>>>>>>>>>
>>>>>>>>>>> I would opt for providing only the "slim" distribution
>> as a release
>>>>>>>>>>> artifact.
>>>>>>>>>>>
>>>>>>>>>>> 3. However, as I said I think its worth investigating
>> how we can
>>>>>>>>> improve
>>>>>>>>>>> users experience. What do you think of providing a tool,
>> could be
>>>>> e.g.
>>>>>>>>> a
>>>>>>>>>>> shell script that constructs a distribution based on
>> users choice.
>>>>> I
>>>>>>>>>>> think that was also what Chesnay mentioned as "tooling to
>>>>>>>>>>> assemble custom distributions" In the end how I see the
>> difference
>>>>>>>>>>> between a slim and fat distribution is which jars do we
>> put into
>>>>> the
>>>>>>>>>>> lib, right? It could have a few "screens".
>>>>>>>>>>>
>>>>>>>>>>> 1. Which API are you interested in:
>>>>>>>>>>> a. SQL API
>>>>>>>>>>> b. DataStream API
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2. [SQL] Which connectors do you want to use?
>> [multichoice]:
>>>>>>>>>>> a. Kafka
>>>>>>>>>>> b. Elasticsearch
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>> 3. [SQL] Which catalog you want to use?
>>>>>>>>>>>
>>>>>>>>>>> ...
>>>>>>>>>>>
>>>>>>>>>>> Such a tool would download all the dependencies from
>> maven and put
>>>>>>> them
>>>>>>>>>>> into the correct folder. In the future we can extend it
>> with
>>>>>>> additional
>>>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time
>> with
>>>>>>>>>>> kafka-universal etc.
>>>>>>>>>>>
>>>>>>>>>>> The benefit of it would be that the distribution that we
>> release
>>>>> could
>>>>>>>>>>> remain "slim" or we could even make it slimmer. I might
>> be missing
>>>>>>>>>>> something here though.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>>
>>>>>>>>>>> Dawdi
>>>>>>>>>>>
>>>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
>>>>>>>>>>>
>>>>>>>>>>> I want to reinforce my opinion from earlier: This is
>> about
>>>>> improving
>>>>>>>>>>> the situation both for first-time users and for
>> experienced users
>>>>> that
>>>>>>>>>>> want to use a Flink dist in production. The current
>> Flink dist is
>>>>> too
>>>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for
>> production
>>>>>>>>>>> users, that is where serving no-one properly with the
>> current
>>>>>>>>>>> middle-ground. That's why I think introducing those
>> specialized
>>>>>>>>>>> "spins" of Flink dist would be good.
>>>>>>>>>>>
>>>>>>>>>>> By the way, at some point in the future production users
>> might not
>>>>>>>>>>> even need to get a Flink dist anymore. They should be
>> able to have
>>>>>>>>>>> Flink as a dependency of their project (including the
>> runtime) and
>>>>>>>>>>> then build an image from this for Kubernetes or a fat
>> jar for YARN.
>>>>>>>>>>>
>>>>>>>>>>> Aljoscha
>>>>>>>>>>>
>>>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> Regarding slim and fat distributions, I think different
>> kinds of
>>>>> jobs
>>>>>>>>>>> may
>>>>>>>>>>> prefer different type of distribution:
>>>>>>>>>>>
>>>>>>>>>>> For DataStream job, I think we may not like fat
>> distribution
>>>>>>>>>>>
>>>>>>>>>>> containing
>>>>>>>>>>>
>>>>>>>>>>> connectors because user would always need to depend on
>> the
>>>>> connector
>>>>>>>>>>>
>>>>>>>>>>> in
>>>>>>>>>>>
>>>>>>>>>>> user code, it is easy to include the connector jar in
>> the user lib.
>>>>>>>>>>>
>>>>>>>>>>> Less
>>>>>>>>>>>
>>>>>>>>>>> jar in lib means less class conflicts and problems.
>>>>>>>>>>>
>>>>>>>>>>> For SQL job, I think we are trying to encourage user to
>> user pure
>>>>>>>>>>> sql(DDL +
>>>>>>>>>>> DML) to construct their job, In order to improve user
>> experience,
>>>>> It
>>>>>>>>>>> may be
>>>>>>>>>>> important for flink, not only providing as many
>> connector jar in
>>>>>>>>>>> distribution as possible especially the connector and
>> format we
>>>>> have
>>>>>>>>>>> well
>>>>>>>>>>> documented, but also providing an mechanism to load
>> connectors
>>>>>>>>>>> according
>>>>>>>>>>> to the DDLs,
>>>>>>>>>>>
>>>>>>>>>>> So I think it could be good to place connector/format
>> jars in some
>>>>>>>>>>> dir like
>>>>>>>>>>> opt/connector which would not affect jobs by default, and
>>>>> introduce a
>>>>>>>>>>> mechanism of dynamic discovery for SQL.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Wenlong
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
>> [hidden email]>
>>>>> <
>>>>>>>>>> [hidden email]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I am thinking both "improve first experience" and
>> "improve
>>>>> production
>>>>>>>>>>> experience".
>>>>>>>>>>>
>>>>>>>>>>> I'm thinking about what's the common mode of Flink?
>>>>>>>>>>> Streaming job use Kafka? Batch job use Hive?
>>>>>>>>>>>
>>>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of
>> Hive server
>>>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1
>> dependency.
>>>>>>>>>>> Flink is currently mainly used for streaming, so let's
>> not talk
>>>>>>>>>>> about hive.
>>>>>>>>>>>
>>>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is
>> (related
>>>>> to
>>>>>>>>>>> connectors):
>>>>>>>>>>> - ETL jobs: Kafka -> Kafka
>>>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
>>>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink
>>>>>>>>>>> So Kafka and JDBC are probably the most commonly used.
>> Of course,
>>>>>>>>>>>
>>>>>>>>>>> also
>>>>>>>>>>>
>>>>>>>>>>> includes CSV, JSON's formats.
>>>>>>>>>>> So when we provide such a fat distribution:
>>>>>>>>>>> - With CSV, JSON.
>>>>>>>>>>> - With flink-kafka-universal and kafka dependencies.
>>>>>>>>>>> - With flink-jdbc.
>>>>>>>>>>> Using this fat distribution, most users can run their
>> jobs well.
>>>>>>>>>>>
>>>>>>>>>>> (jdbc
>>>>>>>>>>>
>>>>>>>>>>> driver jar required, but this is very natural to do)
>>>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only
>> Kafka may
>>>>>>>>>>>
>>>>>>>>>>> have
>>>>>>>>>>>
>>>>>>>>>>> conflicts, but if our goal is to use kafka-universal to
>> support all
>>>>>>>>>>> Kafka
>>>>>>>>>>> versions, it is hopeful to target the vast majority of
>> users.
>>>>>>>>>>>
>>>>>>>>>>> We don't want to plug all jars into the fat
>> distribution. Only need
>>>>>>>>>>> less
>>>>>>>>>>> conflict and common. of course, it is a matter of
>> consideration to
>>>>>>>>>>>
>>>>>>>>>>> put
>>>>>>>>>>>
>>>>>>>>>>> which jar into fat distribution.
>>>>>>>>>>> We have the opportunity to facilitate the majority of
>> users, but
>>>>>>>>>>> also left
>>>>>>>>>>> opportunities for customization.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Jingsong Lee
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <
>> [hidden email]> <
>>>>>>>>>> [hidden email]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> I think we should first reach an consensus on "what
>> problem do we
>>>>>>>>>>> want to
>>>>>>>>>>> solve?"
>>>>>>>>>>> (1) improve first experience? or (2) improve production
>> experience?
>>>>>>>>>>>
>>>>>>>>>>> As far as I can see, with the above discussion, I think
>> what we
>>>>>>>>>>> want to
>>>>>>>>>>> solve is the "first experience".
>>>>>>>>>>> And I think the slim jar is still the best distribution
>> for
>>>>>>>>>>> production,
>>>>>>>>>>> because it's easier to assembling jars
>>>>>>>>>>> than excluding jars and can avoid potential class
>> conflicts.
>>>>>>>>>>>
>>>>>>>>>>> If we want to improve "first experience", I think it
>> make sense to
>>>>>>>>>>> have a
>>>>>>>>>>> fat distribution to give users a more smooth first
>> experience.
>>>>>>>>>>> But I would like to call it "playground distribution" or
>> something
>>>>>>>>>>> like
>>>>>>>>>>> that to explicitly differ from the "slim
>> production-purpose
>>>>>>>>>>>
>>>>>>>>>>> distribution".
>>>>>>>>>>>
>>>>>>>>>>> The "playground distribution" can contains some widely
>> used jars,
>>>>>>>>>>>
>>>>>>>>>>> like
>>>>>>>>>>>
>>>>>>>>>>> universal-kafka-sql-connector,
>> elasticsearch7-sql-connector, avro,
>>>>>>>>>>> json,
>>>>>>>>>>> csv, etc..
>>>>>>>>>>> Even we can provide a playground docker which may
>> contain the fat
>>>>>>>>>>> distribution, python3, and hive.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Jark
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
>> [hidden email]>
>>>>> <
>>>>>>>>>> [hidden email]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I don't see a lot of value in having multiple
>> distributions.
>>>>>>>>>>>
>>>>>>>>>>> The simple reality is that no fat distribution we could
>> provide
>>>>>>>>>>>
>>>>>>>>>>> would
>>>>>>>>>>>
>>>>>>>>>>> satisfy all use-cases, so why even try.
>>>>>>>>>>> If users commonly run into issues for certain jars, then
>> maybe
>>>>>>>>>>>
>>>>>>>>>>> those
>>>>>>>>>>>
>>>>>>>>>>> should be added to the current distribution.
>>>>>>>>>>>
>>>>>>>>>>> Personally though I still believe we should only
>> distribute a slim
>>>>>>>>>>> version. I'd rather have users always add required jars
>> to the
>>>>>>>>>>> distribution than only when they go outside our
>> "expected"
>>>>>>>>>>>
>>>>>>>>>>> use-cases.
>>>>>>>>>>>
>>>>>>>>>>> Then we might finally address this issue properly, i.e.,
>> tooling to
>>>>>>>>>>> assemble custom distributions and/or better error
>> messages if
>>>>>>>>>>> Flink-provided extensions cannot be found.
>>>>>>>>>>>
>>>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
>>>>>>>>>>>
>>>>>>>>>>> Regarding to the specific solution, I'm not sure about
>> the "fat"
>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>> "slim"
>>>>>>>>>>>
>>>>>>>>>>> solution though. I get the idea
>>>>>>>>>>> that we can make the slim one even more lightweight than
>> current
>>>>>>>>>>> distribution, but what about the "fat"
>>>>>>>>>>> one? Do you mean that we would package all connectors
>> and formats
>>>>>>>>>>>
>>>>>>>>>>> into
>>>>>>>>>>>
>>>>>>>>>>> this? I'm not sure if this is
>>>>>>>>>>> feasible. For example, we can't put all versions of
>> kafka and hive
>>>>>>>>>>> connector jars into lib directory, and
>>>>>>>>>>> we also might need hadoop jars when using filesystem
>> connector to
>>>>>>>>>>>
>>>>>>>>>>> access
>>>>>>>>>>>
>>>>>>>>>>> data from HDFS.
>>>>>>>>>>>
>>>>>>>>>>> So my guess would be we might hand-pick some of the most
>>>>>>>>>>>
>>>>>>>>>>> frequently
>>>>>>>>>>>
>>>>>>>>>>> used
>>>>>>>>>>>
>>>>>>>>>>> connectors and formats
>>>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned
>> above,
>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>> still
>>>>>>>>>>>
>>>>>>>>>>> leave some other connectors out of it.
>>>>>>>>>>> If this is the case, then why not we just provide this
>>>>>>>>>>>
>>>>>>>>>>> distribution
>>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>> user? I'm not sure i get the benefit of
>>>>>>>>>>> providing another super "slim" jar (we have to pay some
>> costs to
>>>>>>>>>>>
>>>>>>>>>>> provide
>>>>>>>>>>>
>>>>>>>>>>> another suit of distribution).
>>>>>>>>>>>
>>>>>>>>>>> What do you think?
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Kurt
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
>>>>>>>>>>>
>>>>>>>>>>> [hidden email]
>>>>>>>>>>>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Big +1.
>>>>>>>>>>>
>>>>>>>>>>> I like "fat" and "slim".
>>>>>>>>>>>
>>>>>>>>>>> For csv and json, like Jark said, they are quite small
>> and don't
>>>>>>>>>>>
>>>>>>>>>>> have
>>>>>>>>>>>
>>>>>>>>>>> other
>>>>>>>>>>>
>>>>>>>>>>> dependencies. They are important to kafka connector, and
>>>>>>>>>>>
>>>>>>>>>>> important
>>>>>>>>>>>
>>>>>>>>>>> to upcoming file system connector too.
>>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so
>>>>>>>>>>>
>>>>>>>>>>> important,
>>>>>>>>>>>
>>>>>>>>>>> and
>>>>>>>>>>>
>>>>>>>>>>> they're so lightweight.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Jingsong Lee
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <
>> [hidden email]> <
>>>>>>>>>> [hidden email]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Big +1.
>>>>>>>>>>> This will improve user experience (special for Flink new
>> users).
>>>>>>>>>>> We answered so many questions about "class not found".
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Godfrey
>>>>>>>>>>>
>>>>>>>>>>> Dian Fu <[hidden email]> <[hidden email]>
>>>>> 于2020年4月15日周三
>>>>>>>>>> 下午4:30写道:
>>>>>>>>>>>
>>>>>>>>>>> +1 to this proposal.
>>>>>>>>>>>
>>>>>>>>>>> Missing connector jars is also a big problem for PyFlink
>> users.
>>>>>>>>>>>
>>>>>>>>>>> Currently,
>>>>>>>>>>>
>>>>>>>>>>> after a Python user has installed PyFlink using `pip`,
>> he has
>>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>> manually
>>>>>>>>>>>
>>>>>>>>>>> copy the connector fat jars to the PyFlink installation
>>>>>>>>>>>
>>>>>>>>>>> directory
>>>>>>>>>>>
>>>>>>>>>>> for
>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>> connectors to be used if he wants to run jobs locally.
>> This
>>>>>>>>>>>
>>>>>>>>>>> process
>>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>> very
>>>>>>>>>>>
>>>>>>>>>>> confuse for users and affects the experience a lot.
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Dian
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> <
>> [hidden email]>
>>>>> 写道:
>>>>>>>>>>>
>>>>>>>>>>> +1 to the proposal. I also found the "download
>> additional jar"
>>>>>>>>>>>
>>>>>>>>>>> step
>>>>>>>>>>>
>>>>>>>>>>> is
>>>>>>>>>>>
>>>>>>>>>>> really verbose when I prepare webinars.
>>>>>>>>>>>
>>>>>>>>>>> At least, I think the flink-csv and flink-json should in
>> the
>>>>>>>>>>>
>>>>>>>>>>> distribution,
>>>>>>>>>>>
>>>>>>>>>>> they are quite small and don't have other dependencies.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Jark
>>>>>>>>>>>
>>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <
>> [hidden email]> <
>>>>>>>>>> [hidden email]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Aljoscha,
>>>>>>>>>>>
>>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan
>> to
>>>>>>>>>>>
>>>>>>>>>>> put
>>>>>>>>>>>
>>>>>>>>>>> these
>>>>>>>>>>>
>>>>>>>>>>> connectors ? opt or lib ?
>>>>>>>>>>>
>>>>>>>>>>> Aljoscha Krettek <[hidden email]> <
>> [hidden email]>
>>>>>>>>>> 于2020年4月15日周三
>>>>>>>>>>> 下午3:30写道:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Hi Everyone,
>>>>>>>>>>>
>>>>>>>>>>> I'd like to discuss about releasing a more full-featured
>>>>>>>>>>>
>>>>>>>>>>> Flink
>>>>>>>>>>>
>>>>>>>>>>> distribution. The motivation is that there is friction
>> for
>>>>>>>>>>>
>>>>>>>>>>> SQL/Table
>>>>>>>>>>>
>>>>>>>>>>> API
>>>>>>>>>>>
>>>>>>>>>>> users that want to use Table connectors which are not
>> there
>>>>>>>>>>>
>>>>>>>>>>> in
>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>> current Flink Distribution. For these users the workflow
>> is
>>>>>>>>>>>
>>>>>>>>>>> currently
>>>>>>>>>>>
>>>>>>>>>>> roughly:
>>>>>>>>>>>
>>>>>>>>>>> - download Flink dist
>>>>>>>>>>> - configure csv/Kafka/json connectors per configuration
>>>>>>>>>>> - run SQL client or program
>>>>>>>>>>> - decrypt error message and research the solution
>>>>>>>>>>> - download additional connector jars
>>>>>>>>>>> - program works correctly
>>>>>>>>>>>
>>>>>>>>>>> I realize that this can be made to work but if every SQL
>>>>>>>>>>>
>>>>>>>>>>> user
>>>>>>>>>>>
>>>>>>>>>>> has
>>>>>>>>>>>
>>>>>>>>>>> this
>>>>>>>>>>>
>>>>>>>>>>> as their first experience that doesn't seem good to me.
>>>>>>>>>>>
>>>>>>>>>>> My proposal is to provide two versions of the Flink
>>>>>>>>>>>
>>>>>>>>>>> Distribution
>>>>>>>>>>>
>>>>>>>>>>> in
>>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>>
>>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
>>>>>>>>>>>
>>>>>>>>>>> - slim would be even trimmer than todays distribution
>>>>>>>>>>> - fat would contain a lot of convenience connectors (yet
>>>>>>>>>>>
>>>>>>>>>>> to
>>>>>>>>>>>
>>>>>>>>>>> be
>>>>>>>>>>>
>>>>>>>>>>> determined which one)
>>>>>>>>>>>
>>>>>>>>>>> And yes, I realize that there are already more
>> dimensions of
>>>>>>>>>>>
>>>>>>>>>>> Flink
>>>>>>>>>>>
>>>>>>>>>>> releases (Scala version and Java version).
>>>>>>>>>>>
>>>>>>>>>>> For background, our current Flink dist has these in the
>> opt
>>>>>>>>>>>
>>>>>>>>>>> directory:
>>>>>>>>>>>
>>>>>>>>>>> - flink-azure-fs-hadoop-1.10.0.jar
>>>>>>>>>>> - flink-cep-scala_2.12-1.10.0.jar
>>>>>>>>>>> - flink-cep_2.12-1.10.0.jar
>>>>>>>>>>> - flink-gelly-scala_2.12-1.10.0.jar
>>>>>>>>>>> - flink-gelly_2.12-1.10.0.jar
>>>>>>>>>>> - flink-metrics-datadog-1.10.0.jar
>>>>>>>>>>> - flink-metrics-graphite-1.10.0.jar
>>>>>>>>>>> - flink-metrics-influxdb-1.10.0.jar
>>>>>>>>>>> - flink-metrics-prometheus-1.10.0.jar
>>>>>>>>>>> - flink-metrics-slf4j-1.10.0.jar
>>>>>>>>>>> - flink-metrics-statsd-1.10.0.jar
>>>>>>>>>>> - flink-oss-fs-hadoop-1.10.0.jar
>>>>>>>>>>> - flink-python_2.12-1.10.0.jar
>>>>>>>>>>> - flink-queryable-state-runtime_2.12-1.10.0.jar
>>>>>>>>>>> - flink-s3-fs-hadoop-1.10.0.jar
>>>>>>>>>>> - flink-s3-fs-presto-1.10.0.jar
>>>>>>>>>>> -
>>>>>>>>>>>
>>>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
>>>>>>>>>>>
>>>>>>>>>>> - flink-sql-client_2.12-1.10.0.jar
>>>>>>>>>>> - flink-state-processor-api_2.12-1.10.0.jar
>>>>>>>>>>> - flink-swift-fs-hadoop-1.10.0.jar
>>>>>>>>>>>
>>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
>>>>>>>>>>>
>>>>>>>>>>> opt
>>>>>>>>>>>
>>>>>>>>>>> we
>>>>>>>>>>>
>>>>>>>>>>> would
>>>>>>>>>>>
>>>>>>>>>>> go down to 126M. I would reccomend this, because the
>> large
>>>>>>>>>>>
>>>>>>>>>>> majority
>>>>>>>>>>>
>>>>>>>>>>> of
>>>>>>>>>>>
>>>>>>>>>>> the files in opt are probably unused.
>>>>>>>>>>>
>>>>>>>>>>> What do you think?
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Aljoscha
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best Regards
>>>>>>>>>>>
>>>>>>>>>>> Jeff Zhang
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best, Jingsong Lee
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Best, Jingsong Lee
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Best, Jingsong Lee
>>>>
>>>
>>>
>>> --
>>> Best, Jingsong Lee
>>
>
>
> --
>
> Best,
> Benchao Li

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Rui Li
+1 to add light-weighted formats into the lib

On Fri, Jun 5, 2020 at 3:28 PM Leonard Xu <[hidden email]> wrote:

> +1 for Jingsong’s proposal to put flink-csv, flink-json and flink-avro
> under lib/ directory.
> I have heard many SQL users(most of newbies) complaint the out-of-box
> experience in mail list.
>
> Best,
> Leonard Xu
>
>
> > 在 2020年6月5日,14:39,Benchao Li <[hidden email]> 写道:
> >
> > +1 to include them for sql-client by default;
> > +0 to put into lib and exposed to all kinds of jobs, including
> DataStream.
> >
> > Danny Chan <[hidden email]> 于2020年6月5日周五 下午2:31写道:
> >
> >> +1, at least, we should keep an out of the box SQL-CLI, it’s very poor
> >> experience to add such required format jars for SQL users.
> >>
> >> Best,
> >> Danny Chan
> >> 在 2020年6月5日 +0800 AM11:14,Jingsong Li <[hidden email]>,写道:
> >>> Hi all,
> >>>
> >>> Considering that 1.11 will be released soon, what about my previous
> >>> proposal? Put flink-csv, flink-json and flink-avro under lib.
> >>> These three formats are very small and no third party dependence, and
> >> they
> >>> are widely used by table users.
> >>>
> >>> Best,
> >>> Jingsong Lee
> >>>
> >>> On Tue, May 12, 2020 at 4:19 PM Jingsong Li <[hidden email]>
> >> wrote:
> >>>
> >>>> Thanks for your discussion.
> >>>>
> >>>> Sorry to start discussing another thing:
> >>>>
> >>>> The biggest problem I see is the variety of problems caused by users'
> >> lack
> >>>> of format dependency.
> >>>> As Aljoscha said, these three formats are very small and no third
> party
> >>>> dependence, and they are widely used by table users.
> >>>> Actually, we don't have any other built-in table formats now... In
> >> total
> >>>> 151K...
> >>>>
> >>>> 73K flink-avro-1.10.0.jar
> >>>> 36K flink-csv-1.10.0.jar
> >>>> 42K flink-json-1.10.0.jar
> >>>>
> >>>> So, Can we just put them into "lib/" or flink-table-uber?
> >>>> It not solve all problems and maybe it is independent of "fat" and
> >> "slim".
> >>>> But also improve usability.
> >>>> What do you think? Any objections?
> >>>>
> >>>> Best,
> >>>> Jingsong Lee
> >>>>
> >>>> On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <[hidden email]>
> >>>> wrote:
> >>>>
> >>>>> One downside would be that we're shipping more stuff when running on
> >>>>> YARN for example, since the entire plugins directory is shiped by
> >> default.
> >>>>>
> >>>>> On 17/04/2020 16:38, Stephan Ewen wrote:
> >>>>>> @Aljoscha I think that is an interesting line of thinking. the
> >> swift-fs
> >>>>> may
> >>>>>> be rarely enough used to move it to an optional download.
> >>>>>>
> >>>>>> I would still drop two more thoughts:
> >>>>>>
> >>>>>> (1) Now that we have plugins support, is there a reason to have a
> >>>>> metrics
> >>>>>> reporter or file system in /opt instead of /plugins? They don't
> >> spoil
> >>>>> the
> >>>>>> class path any more.
> >>>>>>
> >>>>>> (2) I can imagine there still being a desire to have a "minimal"
> >> docker
> >>>>>> file, for users that want to keep the container images as small as
> >>>>>> possible, to speed up deployment. It is fine if that would not be
> >> the
> >>>>>> default, though.
> >>>>>>
> >>>>>>
> >>>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
> >> [hidden email]>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> I think having such tools and/or tailor-made distributions can
> >> be nice
> >>>>>>> but I also think the discussion is missing the main point: The
> >> initial
> >>>>>>> observation/motivation is that apparently a lot of users (Kurt
> >> and I
> >>>>>>> talked about this) on the chinese DingTalk support groups, and
> >> other
> >>>>>>> support channels have problems when first using the SQL client
> >> because
> >>>>>>> of these missing connectors/formats. For these, having
> >> additional tools
> >>>>>>> would not solve anything because they would also not take that
> >> extra
> >>>>>>> step. I think that even tiny friction should be avoided because
> >> the
> >>>>>>> annoyance from it accumulates of the (hopefully) many users that
> >> we
> >>>>> want
> >>>>>>> to have.
> >>>>>>>
> >>>>>>> Maybe we should take a step back from discussing the
> >> "fat"/"slim" idea
> >>>>>>> and instead think about the composition of the current dist. As
> >>>>>>> mentioned we have these jars in opt/:
> >>>>>>>
> >>>>>>> 17M flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>> 52K flink-cep-scala_2.11-1.10.0.jar
> >>>>>>> 180K flink-cep_2.11-1.10.0.jar
> >>>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar
> >>>>>>> 626K flink-gelly_2.11-1.10.0.jar
> >>>>>>> 512K flink-metrics-datadog-1.10.0.jar
> >>>>>>> 159K flink-metrics-graphite-1.10.0.jar
> >>>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar
> >>>>>>> 102K flink-metrics-prometheus-1.10.0.jar
> >>>>>>> 10K flink-metrics-slf4j-1.10.0.jar
> >>>>>>> 12K flink-metrics-statsd-1.10.0.jar
> >>>>>>> 36M flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>> 28M flink-python_2.11-1.10.0.jar
> >>>>>>> 22K flink-queryable-state-runtime_2.11-1.10.0.jar
> >>>>>>> 18M flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>> 31M flink-s3-fs-presto-1.10.0.jar
> >>>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>>> 518K flink-sql-client_2.11-1.10.0.jar
> >>>>>>> 99K flink-state-processor-api_2.11-1.10.0.jar
> >>>>>>> 25M flink-swift-fs-hadoop-1.10.0.jar
> >>>>>>> 160M opt
> >>>>>>>
> >>>>>>> The "filesystem" connectors ar ethe heavy hitters, there.
> >>>>>>>
> >>>>>>> I downloaded most of the SQL connectors/formats and this is what
> >> I got:
> >>>>>>>
> >>>>>>> 73K flink-avro-1.10.0.jar
> >>>>>>> 36K flink-csv-1.10.0.jar
> >>>>>>> 55K flink-hbase_2.11-1.10.0.jar
> >>>>>>> 88K flink-jdbc_2.11-1.10.0.jar
> >>>>>>> 42K flink-json-1.10.0.jar
> >>>>>>> 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> >>>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> >>>>>>> 24M sql-connectors-formats
> >>>>>>>
> >>>>>>> We could just add these to the Flink distribution without
> >> blowing it up
> >>>>>>> by much. We could drop any of the existing "filesystem"
> >> connectors from
> >>>>>>> opt and add the SQL connectors/formats and not change the size
> >> of Flink
> >>>>>>> dist. So maybe we should do that instead?
> >>>>>>>
> >>>>>>> We would need some tooling for the sql-client shell script to
> >> pick-up
> >>>>>>> the connectors/formats up from opt/ because we don't want to add
> >> them
> >>>>> to
> >>>>>>> lib/. We're already doing that for finding the flink-sql-client
> >> jar,
> >>>>>>> which is also not in lib/.
> >>>>>>>
> >>>>>>> What do you think?
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Aljoscha
> >>>>>>>
> >>>>>>> On 17.04.20 05:22, Jark Wu wrote:
> >>>>>>>> Hi,
> >>>>>>>>
> >>>>>>>> I like the idea of web tool to assemble fat distribution. And
> >> the
> >>>>>>>> https://code.quarkus.io/ looks very nice.
> >>>>>>>> All the users need to do is just select what he/she need (I
> >> think this
> >>>>>>> step
> >>>>>>>> can't be omitted anyway).
> >>>>>>>> We can also provide a default fat distribution on the web which
> >>>>> default
> >>>>>>>> selects some popular connectors.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Jark
> >>>>>>>>
> >>>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <[hidden email]
> >>>
> >>>>> wrote:
> >>>>>>>>
> >>>>>>>>> As a reference for a nice first-experience I had, take a
> >> look at
> >>>>>>>>> https://code.quarkus.io/
> >>>>>>>>> You reach this page after you click "Start Coding" at the
> >> project
> >>>>>>> homepage.
> >>>>>>>>> Rafi
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <[hidden email]>
> >> wrote:
> >>>>>>>>>
> >>>>>>>>>> I'm not saying pre-bundle some jars will make this problem
> >> go away,
> >>>>> and
> >>>>>>>>>> you're right that only hides the problem for
> >>>>>>>>>> some users. But what if this solution can hide the problem
> >> for 90%
> >>>>>>> users?
> >>>>>>>>>> Would't that be good enough for us to try?
> >>>>>>>>>>
> >>>>>>>>>> Regarding to would users following instructions really be
> >> such a big
> >>>>>>>>>> problem?
> >>>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions
> >> for at
> >>>>> least a
> >>>>>>>>>> dozen times and I won't see such questions coming
> >>>>>>>>>> up from time to time. During some periods, I even saw such
> >> questions
> >>>>>>>>> every
> >>>>>>>>>> day.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Kurt
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> >>>>> [hidden email]>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> The problem with having a distribution with "popular"
> >> stuff is
> >>>>> that it
> >>>>>>>>>>> doesn't really *solve* a problem, it just hides it for
> >> users who
> >>>>> fall
> >>>>>>>>>>> into these particular use-cases.
> >>>>>>>>>>> Move out of it and you once again run into exact same
> >> problems
> >>>>>>>>> out-lined.
> >>>>>>>>>>> This is exactly why I like the tooling approach; you
> >> have to deal
> >>>>> with
> >>>>>>>>> it
> >>>>>>>>>>> from the start and transitioning to a custom use-case is
> >> easier.
> >>>>>>>>>>>
> >>>>>>>>>>> Would users following instructions really be such a big
> >> problem?
> >>>>>>>>>>> I would expect that users generally know *what *they
> >> need, just not
> >>>>>>>>>>> necessarily how it is assembled correctly (where do get
> >> which jar,
> >>>>>>>>> which
> >>>>>>>>>>> directory to put it in).
> >>>>>>>>>>> It seems like these are exactly the problem this would
> >> solve?
> >>>>>>>>>>> I just don't see how moving a jar corresponding to some
> >> feature
> >>>>> from
> >>>>>>>>> opt
> >>>>>>>>>>> to some directory (lib/plugins) is less error-prone than
> >> just
> >>>>>>> selecting
> >>>>>>>>>> the
> >>>>>>>>>>> feature and having the tool handle the rest.
> >>>>>>>>>>>
> >>>>>>>>>>> As for re-distributions, it depends on the form that the
> >> tool would
> >>>>>>>>> take.
> >>>>>>>>>>> It could be an application that runs locally and works
> >> against
> >>>>> maven
> >>>>>>>>>>> central (note: not necessarily *using* maven); this
> >> should would
> >>>>> work
> >>>>>>>>> in
> >>>>>>>>>>> China, no?
> >>>>>>>>>>>
> >>>>>>>>>>> A web tool would of course be fancy, but I don't know
> >> how feasible
> >>>>>>> this
> >>>>>>>>>> is
> >>>>>>>>>>> with the ASF infrastructure.
> >>>>>>>>>>> You wouldn't be able to mirror the distribution, so the
> >> load can't
> >>>>> be
> >>>>>>>>>>> distributed. I doubt INFRA would like this.
> >>>>>>>>>>>
> >>>>>>>>>>> Note that third-parties could also start distributing
> >> use-case
> >>>>>>> oriented
> >>>>>>>>>>> distributions, which would be perfectly fine as far as
> >> I'm
> >>>>> concerned.
> >>>>>>>>>>>
> >>>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> I'm not so sure about the web tool solution though. The
> >> concern I
> >>>>> have
> >>>>>>>>>> for
> >>>>>>>>>>> this approach is the final generated
> >>>>>>>>>>> distribution is kind of non-deterministic. We might
> >> generate too
> >>>>> many
> >>>>>>>>>>> different combinations when user trying to
> >>>>>>>>>>> package different types of connector, format, and even
> >> maybe hadoop
> >>>>>>>>>>> releases. As far as I can tell, most open
> >>>>>>>>>>> source projects and apache projects will only release
> >> some
> >>>>>>>>>>> pre-defined distributions, which most users are already
> >>>>>>>>>>> familiar with, thus hard to change IMO. And I also have
> >> went
> >>>>> through
> >>>>>>> in
> >>>>>>>>>>> some cases, users will try to re-distribute
> >>>>>>>>>>> the release package, because of the unstable network of
> >> apache
> >>>>> website
> >>>>>>>>>> from
> >>>>>>>>>>> China. In web tool solution, I don't
> >>>>>>>>>>> think this kind of re-distribution would be possible
> >> anymore.
> >>>>>>>>>>>
> >>>>>>>>>>> In the meantime, I also have a concern that we will fall
> >> back into
> >>>>> our
> >>>>>>>>>> trap
> >>>>>>>>>>> again if we try to offer this smart & flexible
> >>>>>>>>>>> solution. Because it needs users to cooperate with such
> >> mechanism.
> >>>>>>> It's
> >>>>>>>>>>> exactly the situation what we currently fell
> >>>>>>>>>>> into:
> >>>>>>>>>>> 1. We offered a smart solution.
> >>>>>>>>>>> 2. We hope users will follow the correct instructions.
> >>>>>>>>>>> 3. Everything will work as expected if users followed
> >> the right
> >>>>>>>>>>> instructions.
> >>>>>>>>>>>
> >>>>>>>>>>> In reality, I suspect not all users will do the second
> >> step
> >>>>> correctly.
> >>>>>>>>>> And
> >>>>>>>>>>> for new users who only trying to have a quick
> >>>>>>>>>>> experience with Flink, I would bet most users will do it
> >> wrong.
> >>>>>>>>>>>
> >>>>>>>>>>> So, my proposal would be one of the following 2 options:
> >>>>>>>>>>> 1. Provide a slim distribution for advanced product
> >> users and
> >>>>> provide
> >>>>>>> a
> >>>>>>>>>>> distribution which will have some popular builtin jars.
> >>>>>>>>>>> 2. Only provide a distribution which will have some
> >> popular builtin
> >>>>>>>>> jars.
> >>>>>>>>>>> If we are trying to reduce the distributions we
> >> released, I would
> >>>>>>>>> prefer
> >>>>>>>>>> 2
> >>>>>>>>>>> 1.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Kurt
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
> >>>>> [hidden email]>
> >>>>>>> <
> >>>>>>>>>> [hidden email]> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> I think what Chesnay and Dawid proposed would be the
> >> ideal
> >>>>> solution.
> >>>>>>>>>>> Ideally, we would also have a nice web tool for the
> >> website which
> >>>>>>>>>> generates
> >>>>>>>>>>> the corresponding distribution for download.
> >>>>>>>>>>>
> >>>>>>>>>>> To get things started we could start with only
> >> supporting to
> >>>>>>>>>>> download/creating the "fat" version with the script. The
> >> fat
> >>>>> version
> >>>>>>>>>> would
> >>>>>>>>>>> then consist of the slim distribution and whatever we
> >> deem
> >>>>> important
> >>>>>>>>> for
> >>>>>>>>>>> new users to get started.
> >>>>>>>>>>>
> >>>>>>>>>>> Cheers,
> >>>>>>>>>>> Till
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> >>>>>>>>>> [hidden email]> <[hidden email]>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> Few points from my side:
> >>>>>>>>>>>
> >>>>>>>>>>> 1. I like the idea of simplifying the experience for
> >> first time
> >>>>> users.
> >>>>>>>>>>> As for production use cases I share Jark's opinion that
> >> in this
> >>>>> case I
> >>>>>>>>>>> would expect users to combine their distribution
> >> manually. I think
> >>>>> in
> >>>>>>>>>>> such scenarios it is important to understand
> >> interconnections.
> >>>>>>>>>>> Personally I'd expect the slimmest possible distribution
> >> that I can
> >>>>>>>>>>> extend further with what I need in my production
> >> scenario.
> >>>>>>>>>>>
> >>>>>>>>>>> 2. I think there is also the problem that the matrix of
> >> possible
> >>>>>>>>>>> combinations that can be useful is already big. Do we
> >> want to have
> >>>>> a
> >>>>>>>>>>> distribution for:
> >>>>>>>>>>>
> >>>>>>>>>>> SQL users: which connectors should we include? should we
> >>>>> include
> >>>>>>>>>>> hive? which other catalog?
> >>>>>>>>>>>
> >>>>>>>>>>> DataStream users: which connectors should we include?
> >>>>>>>>>>>
> >>>>>>>>>>> For both of the above should we include yarn/kubernetes?
> >>>>>>>>>>>
> >>>>>>>>>>> I would opt for providing only the "slim" distribution
> >> as a release
> >>>>>>>>>>> artifact.
> >>>>>>>>>>>
> >>>>>>>>>>> 3. However, as I said I think its worth investigating
> >> how we can
> >>>>>>>>> improve
> >>>>>>>>>>> users experience. What do you think of providing a tool,
> >> could be
> >>>>> e.g.
> >>>>>>>>> a
> >>>>>>>>>>> shell script that constructs a distribution based on
> >> users choice.
> >>>>> I
> >>>>>>>>>>> think that was also what Chesnay mentioned as "tooling to
> >>>>>>>>>>> assemble custom distributions" In the end how I see the
> >> difference
> >>>>>>>>>>> between a slim and fat distribution is which jars do we
> >> put into
> >>>>> the
> >>>>>>>>>>> lib, right? It could have a few "screens".
> >>>>>>>>>>>
> >>>>>>>>>>> 1. Which API are you interested in:
> >>>>>>>>>>> a. SQL API
> >>>>>>>>>>> b. DataStream API
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> 2. [SQL] Which connectors do you want to use?
> >> [multichoice]:
> >>>>>>>>>>> a. Kafka
> >>>>>>>>>>> b. Elasticsearch
> >>>>>>>>>>> ...
> >>>>>>>>>>>
> >>>>>>>>>>> 3. [SQL] Which catalog you want to use?
> >>>>>>>>>>>
> >>>>>>>>>>> ...
> >>>>>>>>>>>
> >>>>>>>>>>> Such a tool would download all the dependencies from
> >> maven and put
> >>>>>>> them
> >>>>>>>>>>> into the correct folder. In the future we can extend it
> >> with
> >>>>>>> additional
> >>>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time
> >> with
> >>>>>>>>>>> kafka-universal etc.
> >>>>>>>>>>>
> >>>>>>>>>>> The benefit of it would be that the distribution that we
> >> release
> >>>>> could
> >>>>>>>>>>> remain "slim" or we could even make it slimmer. I might
> >> be missing
> >>>>>>>>>>> something here though.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>>
> >>>>>>>>>>> Dawdi
> >>>>>>>>>>>
> >>>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> I want to reinforce my opinion from earlier: This is
> >> about
> >>>>> improving
> >>>>>>>>>>> the situation both for first-time users and for
> >> experienced users
> >>>>> that
> >>>>>>>>>>> want to use a Flink dist in production. The current
> >> Flink dist is
> >>>>> too
> >>>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for
> >> production
> >>>>>>>>>>> users, that is where serving no-one properly with the
> >> current
> >>>>>>>>>>> middle-ground. That's why I think introducing those
> >> specialized
> >>>>>>>>>>> "spins" of Flink dist would be good.
> >>>>>>>>>>>
> >>>>>>>>>>> By the way, at some point in the future production users
> >> might not
> >>>>>>>>>>> even need to get a Flink dist anymore. They should be
> >> able to have
> >>>>>>>>>>> Flink as a dependency of their project (including the
> >> runtime) and
> >>>>>>>>>>> then build an image from this for Kubernetes or a fat
> >> jar for YARN.
> >>>>>>>>>>>
> >>>>>>>>>>> Aljoscha
> >>>>>>>>>>>
> >>>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi all,
> >>>>>>>>>>>
> >>>>>>>>>>> Regarding slim and fat distributions, I think different
> >> kinds of
> >>>>> jobs
> >>>>>>>>>>> may
> >>>>>>>>>>> prefer different type of distribution:
> >>>>>>>>>>>
> >>>>>>>>>>> For DataStream job, I think we may not like fat
> >> distribution
> >>>>>>>>>>>
> >>>>>>>>>>> containing
> >>>>>>>>>>>
> >>>>>>>>>>> connectors because user would always need to depend on
> >> the
> >>>>> connector
> >>>>>>>>>>>
> >>>>>>>>>>> in
> >>>>>>>>>>>
> >>>>>>>>>>> user code, it is easy to include the connector jar in
> >> the user lib.
> >>>>>>>>>>>
> >>>>>>>>>>> Less
> >>>>>>>>>>>
> >>>>>>>>>>> jar in lib means less class conflicts and problems.
> >>>>>>>>>>>
> >>>>>>>>>>> For SQL job, I think we are trying to encourage user to
> >> user pure
> >>>>>>>>>>> sql(DDL +
> >>>>>>>>>>> DML) to construct their job, In order to improve user
> >> experience,
> >>>>> It
> >>>>>>>>>>> may be
> >>>>>>>>>>> important for flink, not only providing as many
> >> connector jar in
> >>>>>>>>>>> distribution as possible especially the connector and
> >> format we
> >>>>> have
> >>>>>>>>>>> well
> >>>>>>>>>>> documented, but also providing an mechanism to load
> >> connectors
> >>>>>>>>>>> according
> >>>>>>>>>>> to the DDLs,
> >>>>>>>>>>>
> >>>>>>>>>>> So I think it could be good to place connector/format
> >> jars in some
> >>>>>>>>>>> dir like
> >>>>>>>>>>> opt/connector which would not affect jobs by default, and
> >>>>> introduce a
> >>>>>>>>>>> mechanism of dynamic discovery for SQL.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Wenlong
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
> >> [hidden email]>
> >>>>> <
> >>>>>>>>>> [hidden email]>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> I am thinking both "improve first experience" and
> >> "improve
> >>>>> production
> >>>>>>>>>>> experience".
> >>>>>>>>>>>
> >>>>>>>>>>> I'm thinking about what's the common mode of Flink?
> >>>>>>>>>>> Streaming job use Kafka? Batch job use Hive?
> >>>>>>>>>>>
> >>>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of
> >> Hive server
> >>>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1
> >> dependency.
> >>>>>>>>>>> Flink is currently mainly used for streaming, so let's
> >> not talk
> >>>>>>>>>>> about hive.
> >>>>>>>>>>>
> >>>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is
> >> (related
> >>>>> to
> >>>>>>>>>>> connectors):
> >>>>>>>>>>> - ETL jobs: Kafka -> Kafka
> >>>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
> >>>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink
> >>>>>>>>>>> So Kafka and JDBC are probably the most commonly used.
> >> Of course,
> >>>>>>>>>>>
> >>>>>>>>>>> also
> >>>>>>>>>>>
> >>>>>>>>>>> includes CSV, JSON's formats.
> >>>>>>>>>>> So when we provide such a fat distribution:
> >>>>>>>>>>> - With CSV, JSON.
> >>>>>>>>>>> - With flink-kafka-universal and kafka dependencies.
> >>>>>>>>>>> - With flink-jdbc.
> >>>>>>>>>>> Using this fat distribution, most users can run their
> >> jobs well.
> >>>>>>>>>>>
> >>>>>>>>>>> (jdbc
> >>>>>>>>>>>
> >>>>>>>>>>> driver jar required, but this is very natural to do)
> >>>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only
> >> Kafka may
> >>>>>>>>>>>
> >>>>>>>>>>> have
> >>>>>>>>>>>
> >>>>>>>>>>> conflicts, but if our goal is to use kafka-universal to
> >> support all
> >>>>>>>>>>> Kafka
> >>>>>>>>>>> versions, it is hopeful to target the vast majority of
> >> users.
> >>>>>>>>>>>
> >>>>>>>>>>> We don't want to plug all jars into the fat
> >> distribution. Only need
> >>>>>>>>>>> less
> >>>>>>>>>>> conflict and common. of course, it is a matter of
> >> consideration to
> >>>>>>>>>>>
> >>>>>>>>>>> put
> >>>>>>>>>>>
> >>>>>>>>>>> which jar into fat distribution.
> >>>>>>>>>>> We have the opportunity to facilitate the majority of
> >> users, but
> >>>>>>>>>>> also left
> >>>>>>>>>>> opportunities for customization.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Jingsong Lee
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <
> >> [hidden email]> <
> >>>>>>>>>> [hidden email]> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi,
> >>>>>>>>>>>
> >>>>>>>>>>> I think we should first reach an consensus on "what
> >> problem do we
> >>>>>>>>>>> want to
> >>>>>>>>>>> solve?"
> >>>>>>>>>>> (1) improve first experience? or (2) improve production
> >> experience?
> >>>>>>>>>>>
> >>>>>>>>>>> As far as I can see, with the above discussion, I think
> >> what we
> >>>>>>>>>>> want to
> >>>>>>>>>>> solve is the "first experience".
> >>>>>>>>>>> And I think the slim jar is still the best distribution
> >> for
> >>>>>>>>>>> production,
> >>>>>>>>>>> because it's easier to assembling jars
> >>>>>>>>>>> than excluding jars and can avoid potential class
> >> conflicts.
> >>>>>>>>>>>
> >>>>>>>>>>> If we want to improve "first experience", I think it
> >> make sense to
> >>>>>>>>>>> have a
> >>>>>>>>>>> fat distribution to give users a more smooth first
> >> experience.
> >>>>>>>>>>> But I would like to call it "playground distribution" or
> >> something
> >>>>>>>>>>> like
> >>>>>>>>>>> that to explicitly differ from the "slim
> >> production-purpose
> >>>>>>>>>>>
> >>>>>>>>>>> distribution".
> >>>>>>>>>>>
> >>>>>>>>>>> The "playground distribution" can contains some widely
> >> used jars,
> >>>>>>>>>>>
> >>>>>>>>>>> like
> >>>>>>>>>>>
> >>>>>>>>>>> universal-kafka-sql-connector,
> >> elasticsearch7-sql-connector, avro,
> >>>>>>>>>>> json,
> >>>>>>>>>>> csv, etc..
> >>>>>>>>>>> Even we can provide a playground docker which may
> >> contain the fat
> >>>>>>>>>>> distribution, python3, and hive.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Jark
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
> >> [hidden email]>
> >>>>> <
> >>>>>>>>>> [hidden email]>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> I don't see a lot of value in having multiple
> >> distributions.
> >>>>>>>>>>>
> >>>>>>>>>>> The simple reality is that no fat distribution we could
> >> provide
> >>>>>>>>>>>
> >>>>>>>>>>> would
> >>>>>>>>>>>
> >>>>>>>>>>> satisfy all use-cases, so why even try.
> >>>>>>>>>>> If users commonly run into issues for certain jars, then
> >> maybe
> >>>>>>>>>>>
> >>>>>>>>>>> those
> >>>>>>>>>>>
> >>>>>>>>>>> should be added to the current distribution.
> >>>>>>>>>>>
> >>>>>>>>>>> Personally though I still believe we should only
> >> distribute a slim
> >>>>>>>>>>> version. I'd rather have users always add required jars
> >> to the
> >>>>>>>>>>> distribution than only when they go outside our
> >> "expected"
> >>>>>>>>>>>
> >>>>>>>>>>> use-cases.
> >>>>>>>>>>>
> >>>>>>>>>>> Then we might finally address this issue properly, i.e.,
> >> tooling to
> >>>>>>>>>>> assemble custom distributions and/or better error
> >> messages if
> >>>>>>>>>>> Flink-provided extensions cannot be found.
> >>>>>>>>>>>
> >>>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Regarding to the specific solution, I'm not sure about
> >> the "fat"
> >>>>>>>>>>>
> >>>>>>>>>>> and
> >>>>>>>>>>>
> >>>>>>>>>>> "slim"
> >>>>>>>>>>>
> >>>>>>>>>>> solution though. I get the idea
> >>>>>>>>>>> that we can make the slim one even more lightweight than
> >> current
> >>>>>>>>>>> distribution, but what about the "fat"
> >>>>>>>>>>> one? Do you mean that we would package all connectors
> >> and formats
> >>>>>>>>>>>
> >>>>>>>>>>> into
> >>>>>>>>>>>
> >>>>>>>>>>> this? I'm not sure if this is
> >>>>>>>>>>> feasible. For example, we can't put all versions of
> >> kafka and hive
> >>>>>>>>>>> connector jars into lib directory, and
> >>>>>>>>>>> we also might need hadoop jars when using filesystem
> >> connector to
> >>>>>>>>>>>
> >>>>>>>>>>> access
> >>>>>>>>>>>
> >>>>>>>>>>> data from HDFS.
> >>>>>>>>>>>
> >>>>>>>>>>> So my guess would be we might hand-pick some of the most
> >>>>>>>>>>>
> >>>>>>>>>>> frequently
> >>>>>>>>>>>
> >>>>>>>>>>> used
> >>>>>>>>>>>
> >>>>>>>>>>> connectors and formats
> >>>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned
> >> above,
> >>>>>>>>>>>
> >>>>>>>>>>> and
> >>>>>>>>>>>
> >>>>>>>>>>> still
> >>>>>>>>>>>
> >>>>>>>>>>> leave some other connectors out of it.
> >>>>>>>>>>> If this is the case, then why not we just provide this
> >>>>>>>>>>>
> >>>>>>>>>>> distribution
> >>>>>>>>>>>
> >>>>>>>>>>> to
> >>>>>>>>>>>
> >>>>>>>>>>> user? I'm not sure i get the benefit of
> >>>>>>>>>>> providing another super "slim" jar (we have to pay some
> >> costs to
> >>>>>>>>>>>
> >>>>>>>>>>> provide
> >>>>>>>>>>>
> >>>>>>>>>>> another suit of distribution).
> >>>>>>>>>>>
> >>>>>>>>>>> What do you think?
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Kurt
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> >>>>>>>>>>>
> >>>>>>>>>>> [hidden email]
> >>>>>>>>>>>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Big +1.
> >>>>>>>>>>>
> >>>>>>>>>>> I like "fat" and "slim".
> >>>>>>>>>>>
> >>>>>>>>>>> For csv and json, like Jark said, they are quite small
> >> and don't
> >>>>>>>>>>>
> >>>>>>>>>>> have
> >>>>>>>>>>>
> >>>>>>>>>>> other
> >>>>>>>>>>>
> >>>>>>>>>>> dependencies. They are important to kafka connector, and
> >>>>>>>>>>>
> >>>>>>>>>>> important
> >>>>>>>>>>>
> >>>>>>>>>>> to upcoming file system connector too.
> >>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so
> >>>>>>>>>>>
> >>>>>>>>>>> important,
> >>>>>>>>>>>
> >>>>>>>>>>> and
> >>>>>>>>>>>
> >>>>>>>>>>> they're so lightweight.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Jingsong Lee
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <
> >> [hidden email]> <
> >>>>>>>>>> [hidden email]>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Big +1.
> >>>>>>>>>>> This will improve user experience (special for Flink new
> >> users).
> >>>>>>>>>>> We answered so many questions about "class not found".
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Godfrey
> >>>>>>>>>>>
> >>>>>>>>>>> Dian Fu <[hidden email]> <[hidden email]>
> >>>>> 于2020年4月15日周三
> >>>>>>>>>> 下午4:30写道:
> >>>>>>>>>>>
> >>>>>>>>>>> +1 to this proposal.
> >>>>>>>>>>>
> >>>>>>>>>>> Missing connector jars is also a big problem for PyFlink
> >> users.
> >>>>>>>>>>>
> >>>>>>>>>>> Currently,
> >>>>>>>>>>>
> >>>>>>>>>>> after a Python user has installed PyFlink using `pip`,
> >> he has
> >>>>>>>>>>>
> >>>>>>>>>>> to
> >>>>>>>>>>>
> >>>>>>>>>>> manually
> >>>>>>>>>>>
> >>>>>>>>>>> copy the connector fat jars to the PyFlink installation
> >>>>>>>>>>>
> >>>>>>>>>>> directory
> >>>>>>>>>>>
> >>>>>>>>>>> for
> >>>>>>>>>>>
> >>>>>>>>>>> the
> >>>>>>>>>>>
> >>>>>>>>>>> connectors to be used if he wants to run jobs locally.
> >> This
> >>>>>>>>>>>
> >>>>>>>>>>> process
> >>>>>>>>>>>
> >>>>>>>>>>> is
> >>>>>>>>>>>
> >>>>>>>>>>> very
> >>>>>>>>>>>
> >>>>>>>>>>> confuse for users and affects the experience a lot.
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Dian
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> <
> >> [hidden email]>
> >>>>> 写道:
> >>>>>>>>>>>
> >>>>>>>>>>> +1 to the proposal. I also found the "download
> >> additional jar"
> >>>>>>>>>>>
> >>>>>>>>>>> step
> >>>>>>>>>>>
> >>>>>>>>>>> is
> >>>>>>>>>>>
> >>>>>>>>>>> really verbose when I prepare webinars.
> >>>>>>>>>>>
> >>>>>>>>>>> At least, I think the flink-csv and flink-json should in
> >> the
> >>>>>>>>>>>
> >>>>>>>>>>> distribution,
> >>>>>>>>>>>
> >>>>>>>>>>> they are quite small and don't have other dependencies.
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Jark
> >>>>>>>>>>>
> >>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <
> >> [hidden email]> <
> >>>>>>>>>> [hidden email]>
> >>>>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Aljoscha,
> >>>>>>>>>>>
> >>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan
> >> to
> >>>>>>>>>>>
> >>>>>>>>>>> put
> >>>>>>>>>>>
> >>>>>>>>>>> these
> >>>>>>>>>>>
> >>>>>>>>>>> connectors ? opt or lib ?
> >>>>>>>>>>>
> >>>>>>>>>>> Aljoscha Krettek <[hidden email]> <
> >> [hidden email]>
> >>>>>>>>>> 于2020年4月15日周三
> >>>>>>>>>>> 下午3:30写道:
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Hi Everyone,
> >>>>>>>>>>>
> >>>>>>>>>>> I'd like to discuss about releasing a more full-featured
> >>>>>>>>>>>
> >>>>>>>>>>> Flink
> >>>>>>>>>>>
> >>>>>>>>>>> distribution. The motivation is that there is friction
> >> for
> >>>>>>>>>>>
> >>>>>>>>>>> SQL/Table
> >>>>>>>>>>>
> >>>>>>>>>>> API
> >>>>>>>>>>>
> >>>>>>>>>>> users that want to use Table connectors which are not
> >> there
> >>>>>>>>>>>
> >>>>>>>>>>> in
> >>>>>>>>>>>
> >>>>>>>>>>> the
> >>>>>>>>>>>
> >>>>>>>>>>> current Flink Distribution. For these users the workflow
> >> is
> >>>>>>>>>>>
> >>>>>>>>>>> currently
> >>>>>>>>>>>
> >>>>>>>>>>> roughly:
> >>>>>>>>>>>
> >>>>>>>>>>> - download Flink dist
> >>>>>>>>>>> - configure csv/Kafka/json connectors per configuration
> >>>>>>>>>>> - run SQL client or program
> >>>>>>>>>>> - decrypt error message and research the solution
> >>>>>>>>>>> - download additional connector jars
> >>>>>>>>>>> - program works correctly
> >>>>>>>>>>>
> >>>>>>>>>>> I realize that this can be made to work but if every SQL
> >>>>>>>>>>>
> >>>>>>>>>>> user
> >>>>>>>>>>>
> >>>>>>>>>>> has
> >>>>>>>>>>>
> >>>>>>>>>>> this
> >>>>>>>>>>>
> >>>>>>>>>>> as their first experience that doesn't seem good to me.
> >>>>>>>>>>>
> >>>>>>>>>>> My proposal is to provide two versions of the Flink
> >>>>>>>>>>>
> >>>>>>>>>>> Distribution
> >>>>>>>>>>>
> >>>>>>>>>>> in
> >>>>>>>>>>>
> >>>>>>>>>>> the
> >>>>>>>>>>>
> >>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
> >>>>>>>>>>>
> >>>>>>>>>>> - slim would be even trimmer than todays distribution
> >>>>>>>>>>> - fat would contain a lot of convenience connectors (yet
> >>>>>>>>>>>
> >>>>>>>>>>> to
> >>>>>>>>>>>
> >>>>>>>>>>> be
> >>>>>>>>>>>
> >>>>>>>>>>> determined which one)
> >>>>>>>>>>>
> >>>>>>>>>>> And yes, I realize that there are already more
> >> dimensions of
> >>>>>>>>>>>
> >>>>>>>>>>> Flink
> >>>>>>>>>>>
> >>>>>>>>>>> releases (Scala version and Java version).
> >>>>>>>>>>>
> >>>>>>>>>>> For background, our current Flink dist has these in the
> >> opt
> >>>>>>>>>>>
> >>>>>>>>>>> directory:
> >>>>>>>>>>>
> >>>>>>>>>>> - flink-azure-fs-hadoop-1.10.0.jar
> >>>>>>>>>>> - flink-cep-scala_2.12-1.10.0.jar
> >>>>>>>>>>> - flink-cep_2.12-1.10.0.jar
> >>>>>>>>>>> - flink-gelly-scala_2.12-1.10.0.jar
> >>>>>>>>>>> - flink-gelly_2.12-1.10.0.jar
> >>>>>>>>>>> - flink-metrics-datadog-1.10.0.jar
> >>>>>>>>>>> - flink-metrics-graphite-1.10.0.jar
> >>>>>>>>>>> - flink-metrics-influxdb-1.10.0.jar
> >>>>>>>>>>> - flink-metrics-prometheus-1.10.0.jar
> >>>>>>>>>>> - flink-metrics-slf4j-1.10.0.jar
> >>>>>>>>>>> - flink-metrics-statsd-1.10.0.jar
> >>>>>>>>>>> - flink-oss-fs-hadoop-1.10.0.jar
> >>>>>>>>>>> - flink-python_2.12-1.10.0.jar
> >>>>>>>>>>> - flink-queryable-state-runtime_2.12-1.10.0.jar
> >>>>>>>>>>> - flink-s3-fs-hadoop-1.10.0.jar
> >>>>>>>>>>> - flink-s3-fs-presto-1.10.0.jar
> >>>>>>>>>>> -
> >>>>>>>>>>>
> >>>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> >>>>>>>>>>>
> >>>>>>>>>>> - flink-sql-client_2.12-1.10.0.jar
> >>>>>>>>>>> - flink-state-processor-api_2.12-1.10.0.jar
> >>>>>>>>>>> - flink-swift-fs-hadoop-1.10.0.jar
> >>>>>>>>>>>
> >>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
> >>>>>>>>>>>
> >>>>>>>>>>> opt
> >>>>>>>>>>>
> >>>>>>>>>>> we
> >>>>>>>>>>>
> >>>>>>>>>>> would
> >>>>>>>>>>>
> >>>>>>>>>>> go down to 126M. I would reccomend this, because the
> >> large
> >>>>>>>>>>>
> >>>>>>>>>>> majority
> >>>>>>>>>>>
> >>>>>>>>>>> of
> >>>>>>>>>>>
> >>>>>>>>>>> the files in opt are probably unused.
> >>>>>>>>>>>
> >>>>>>>>>>> What do you think?
> >>>>>>>>>>>
> >>>>>>>>>>> Best,
> >>>>>>>>>>> Aljoscha
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Best Regards
> >>>>>>>>>>>
> >>>>>>>>>>> Jeff Zhang
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Best, Jingsong Lee
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> --
> >>>>>>>>>>> Best, Jingsong Lee
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>
> >>>>>
> >>>>>
> >>>>
> >>>> --
> >>>> Best, Jingsong Lee
> >>>>
> >>>
> >>>
> >>> --
> >>> Best, Jingsong Lee
> >>
> >
> >
> > --
> >
> > Best,
> > Benchao Li
>
>

--
Best regards!
Rui Li
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] Releasing "fat" and "slim" Flink distributions

Jingsong Li
Hi,

Thanks all for your feedback.

I created JIRA for bundling format jars in lib. [1] FYI.

[1]https://issues.apache.org/jira/browse/FLINK-18173

Best,
Jingsong Lee

On Fri, Jun 5, 2020 at 3:59 PM Rui Li <[hidden email]> wrote:

> +1 to add light-weighted formats into the lib
>
> On Fri, Jun 5, 2020 at 3:28 PM Leonard Xu <[hidden email]> wrote:
>
> > +1 for Jingsong’s proposal to put flink-csv, flink-json and flink-avro
> > under lib/ directory.
> > I have heard many SQL users(most of newbies) complaint the out-of-box
> > experience in mail list.
> >
> > Best,
> > Leonard Xu
> >
> >
> > > 在 2020年6月5日,14:39,Benchao Li <[hidden email]> 写道:
> > >
> > > +1 to include them for sql-client by default;
> > > +0 to put into lib and exposed to all kinds of jobs, including
> > DataStream.
> > >
> > > Danny Chan <[hidden email]> 于2020年6月5日周五 下午2:31写道:
> > >
> > >> +1, at least, we should keep an out of the box SQL-CLI, it’s very poor
> > >> experience to add such required format jars for SQL users.
> > >>
> > >> Best,
> > >> Danny Chan
> > >> 在 2020年6月5日 +0800 AM11:14,Jingsong Li <[hidden email]>,写道:
> > >>> Hi all,
> > >>>
> > >>> Considering that 1.11 will be released soon, what about my previous
> > >>> proposal? Put flink-csv, flink-json and flink-avro under lib.
> > >>> These three formats are very small and no third party dependence, and
> > >> they
> > >>> are widely used by table users.
> > >>>
> > >>> Best,
> > >>> Jingsong Lee
> > >>>
> > >>> On Tue, May 12, 2020 at 4:19 PM Jingsong Li <[hidden email]>
> > >> wrote:
> > >>>
> > >>>> Thanks for your discussion.
> > >>>>
> > >>>> Sorry to start discussing another thing:
> > >>>>
> > >>>> The biggest problem I see is the variety of problems caused by
> users'
> > >> lack
> > >>>> of format dependency.
> > >>>> As Aljoscha said, these three formats are very small and no third
> > party
> > >>>> dependence, and they are widely used by table users.
> > >>>> Actually, we don't have any other built-in table formats now... In
> > >> total
> > >>>> 151K...
> > >>>>
> > >>>> 73K flink-avro-1.10.0.jar
> > >>>> 36K flink-csv-1.10.0.jar
> > >>>> 42K flink-json-1.10.0.jar
> > >>>>
> > >>>> So, Can we just put them into "lib/" or flink-table-uber?
> > >>>> It not solve all problems and maybe it is independent of "fat" and
> > >> "slim".
> > >>>> But also improve usability.
> > >>>> What do you think? Any objections?
> > >>>>
> > >>>> Best,
> > >>>> Jingsong Lee
> > >>>>
> > >>>> On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <
> [hidden email]>
> > >>>> wrote:
> > >>>>
> > >>>>> One downside would be that we're shipping more stuff when running
> on
> > >>>>> YARN for example, since the entire plugins directory is shiped by
> > >> default.
> > >>>>>
> > >>>>> On 17/04/2020 16:38, Stephan Ewen wrote:
> > >>>>>> @Aljoscha I think that is an interesting line of thinking. the
> > >> swift-fs
> > >>>>> may
> > >>>>>> be rarely enough used to move it to an optional download.
> > >>>>>>
> > >>>>>> I would still drop two more thoughts:
> > >>>>>>
> > >>>>>> (1) Now that we have plugins support, is there a reason to have a
> > >>>>> metrics
> > >>>>>> reporter or file system in /opt instead of /plugins? They don't
> > >> spoil
> > >>>>> the
> > >>>>>> class path any more.
> > >>>>>>
> > >>>>>> (2) I can imagine there still being a desire to have a "minimal"
> > >> docker
> > >>>>>> file, for users that want to keep the container images as small as
> > >>>>>> possible, to speed up deployment. It is fine if that would not be
> > >> the
> > >>>>>> default, though.
> > >>>>>>
> > >>>>>>
> > >>>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek <
> > >> [hidden email]>
> > >>>>>> wrote:
> > >>>>>>
> > >>>>>>> I think having such tools and/or tailor-made distributions can
> > >> be nice
> > >>>>>>> but I also think the discussion is missing the main point: The
> > >> initial
> > >>>>>>> observation/motivation is that apparently a lot of users (Kurt
> > >> and I
> > >>>>>>> talked about this) on the chinese DingTalk support groups, and
> > >> other
> > >>>>>>> support channels have problems when first using the SQL client
> > >> because
> > >>>>>>> of these missing connectors/formats. For these, having
> > >> additional tools
> > >>>>>>> would not solve anything because they would also not take that
> > >> extra
> > >>>>>>> step. I think that even tiny friction should be avoided because
> > >> the
> > >>>>>>> annoyance from it accumulates of the (hopefully) many users that
> > >> we
> > >>>>> want
> > >>>>>>> to have.
> > >>>>>>>
> > >>>>>>> Maybe we should take a step back from discussing the
> > >> "fat"/"slim" idea
> > >>>>>>> and instead think about the composition of the current dist. As
> > >>>>>>> mentioned we have these jars in opt/:
> > >>>>>>>
> > >>>>>>> 17M flink-azure-fs-hadoop-1.10.0.jar
> > >>>>>>> 52K flink-cep-scala_2.11-1.10.0.jar
> > >>>>>>> 180K flink-cep_2.11-1.10.0.jar
> > >>>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar
> > >>>>>>> 626K flink-gelly_2.11-1.10.0.jar
> > >>>>>>> 512K flink-metrics-datadog-1.10.0.jar
> > >>>>>>> 159K flink-metrics-graphite-1.10.0.jar
> > >>>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar
> > >>>>>>> 102K flink-metrics-prometheus-1.10.0.jar
> > >>>>>>> 10K flink-metrics-slf4j-1.10.0.jar
> > >>>>>>> 12K flink-metrics-statsd-1.10.0.jar
> > >>>>>>> 36M flink-oss-fs-hadoop-1.10.0.jar
> > >>>>>>> 28M flink-python_2.11-1.10.0.jar
> > >>>>>>> 22K flink-queryable-state-runtime_2.11-1.10.0.jar
> > >>>>>>> 18M flink-s3-fs-hadoop-1.10.0.jar
> > >>>>>>> 31M flink-s3-fs-presto-1.10.0.jar
> > >>>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > >>>>>>> 518K flink-sql-client_2.11-1.10.0.jar
> > >>>>>>> 99K flink-state-processor-api_2.11-1.10.0.jar
> > >>>>>>> 25M flink-swift-fs-hadoop-1.10.0.jar
> > >>>>>>> 160M opt
> > >>>>>>>
> > >>>>>>> The "filesystem" connectors ar ethe heavy hitters, there.
> > >>>>>>>
> > >>>>>>> I downloaded most of the SQL connectors/formats and this is what
> > >> I got:
> > >>>>>>>
> > >>>>>>> 73K flink-avro-1.10.0.jar
> > >>>>>>> 36K flink-csv-1.10.0.jar
> > >>>>>>> 55K flink-hbase_2.11-1.10.0.jar
> > >>>>>>> 88K flink-jdbc_2.11-1.10.0.jar
> > >>>>>>> 42K flink-json-1.10.0.jar
> > >>>>>>> 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar
> > >>>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar
> > >>>>>>> 24M sql-connectors-formats
> > >>>>>>>
> > >>>>>>> We could just add these to the Flink distribution without
> > >> blowing it up
> > >>>>>>> by much. We could drop any of the existing "filesystem"
> > >> connectors from
> > >>>>>>> opt and add the SQL connectors/formats and not change the size
> > >> of Flink
> > >>>>>>> dist. So maybe we should do that instead?
> > >>>>>>>
> > >>>>>>> We would need some tooling for the sql-client shell script to
> > >> pick-up
> > >>>>>>> the connectors/formats up from opt/ because we don't want to add
> > >> them
> > >>>>> to
> > >>>>>>> lib/. We're already doing that for finding the flink-sql-client
> > >> jar,
> > >>>>>>> which is also not in lib/.
> > >>>>>>>
> > >>>>>>> What do you think?
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Aljoscha
> > >>>>>>>
> > >>>>>>> On 17.04.20 05:22, Jark Wu wrote:
> > >>>>>>>> Hi,
> > >>>>>>>>
> > >>>>>>>> I like the idea of web tool to assemble fat distribution. And
> > >> the
> > >>>>>>>> https://code.quarkus.io/ looks very nice.
> > >>>>>>>> All the users need to do is just select what he/she need (I
> > >> think this
> > >>>>>>> step
> > >>>>>>>> can't be omitted anyway).
> > >>>>>>>> We can also provide a default fat distribution on the web which
> > >>>>> default
> > >>>>>>>> selects some popular connectors.
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Jark
> > >>>>>>>>
> > >>>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <[hidden email]
> > >>>
> > >>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> As a reference for a nice first-experience I had, take a
> > >> look at
> > >>>>>>>>> https://code.quarkus.io/
> > >>>>>>>>> You reach this page after you click "Start Coding" at the
> > >> project
> > >>>>>>> homepage.
> > >>>>>>>>> Rafi
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <[hidden email]>
> > >> wrote:
> > >>>>>>>>>
> > >>>>>>>>>> I'm not saying pre-bundle some jars will make this problem
> > >> go away,
> > >>>>> and
> > >>>>>>>>>> you're right that only hides the problem for
> > >>>>>>>>>> some users. But what if this solution can hide the problem
> > >> for 90%
> > >>>>>>> users?
> > >>>>>>>>>> Would't that be good enough for us to try?
> > >>>>>>>>>>
> > >>>>>>>>>> Regarding to would users following instructions really be
> > >> such a big
> > >>>>>>>>>> problem?
> > >>>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions
> > >> for at
> > >>>>> least a
> > >>>>>>>>>> dozen times and I won't see such questions coming
> > >>>>>>>>>> up from time to time. During some periods, I even saw such
> > >> questions
> > >>>>>>>>> every
> > >>>>>>>>>> day.
> > >>>>>>>>>>
> > >>>>>>>>>> Best,
> > >>>>>>>>>> Kurt
> > >>>>>>>>>>
> > >>>>>>>>>>
> > >>>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler <
> > >>>>> [hidden email]>
> > >>>>>>>>>> wrote:
> > >>>>>>>>>>
> > >>>>>>>>>>> The problem with having a distribution with "popular"
> > >> stuff is
> > >>>>> that it
> > >>>>>>>>>>> doesn't really *solve* a problem, it just hides it for
> > >> users who
> > >>>>> fall
> > >>>>>>>>>>> into these particular use-cases.
> > >>>>>>>>>>> Move out of it and you once again run into exact same
> > >> problems
> > >>>>>>>>> out-lined.
> > >>>>>>>>>>> This is exactly why I like the tooling approach; you
> > >> have to deal
> > >>>>> with
> > >>>>>>>>> it
> > >>>>>>>>>>> from the start and transitioning to a custom use-case is
> > >> easier.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Would users following instructions really be such a big
> > >> problem?
> > >>>>>>>>>>> I would expect that users generally know *what *they
> > >> need, just not
> > >>>>>>>>>>> necessarily how it is assembled correctly (where do get
> > >> which jar,
> > >>>>>>>>> which
> > >>>>>>>>>>> directory to put it in).
> > >>>>>>>>>>> It seems like these are exactly the problem this would
> > >> solve?
> > >>>>>>>>>>> I just don't see how moving a jar corresponding to some
> > >> feature
> > >>>>> from
> > >>>>>>>>> opt
> > >>>>>>>>>>> to some directory (lib/plugins) is less error-prone than
> > >> just
> > >>>>>>> selecting
> > >>>>>>>>>> the
> > >>>>>>>>>>> feature and having the tool handle the rest.
> > >>>>>>>>>>>
> > >>>>>>>>>>> As for re-distributions, it depends on the form that the
> > >> tool would
> > >>>>>>>>> take.
> > >>>>>>>>>>> It could be an application that runs locally and works
> > >> against
> > >>>>> maven
> > >>>>>>>>>>> central (note: not necessarily *using* maven); this
> > >> should would
> > >>>>> work
> > >>>>>>>>> in
> > >>>>>>>>>>> China, no?
> > >>>>>>>>>>>
> > >>>>>>>>>>> A web tool would of course be fancy, but I don't know
> > >> how feasible
> > >>>>>>> this
> > >>>>>>>>>> is
> > >>>>>>>>>>> with the ASF infrastructure.
> > >>>>>>>>>>> You wouldn't be able to mirror the distribution, so the
> > >> load can't
> > >>>>> be
> > >>>>>>>>>>> distributed. I doubt INFRA would like this.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Note that third-parties could also start distributing
> > >> use-case
> > >>>>>>> oriented
> > >>>>>>>>>>> distributions, which would be perfectly fine as far as
> > >> I'm
> > >>>>> concerned.
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> I'm not so sure about the web tool solution though. The
> > >> concern I
> > >>>>> have
> > >>>>>>>>>> for
> > >>>>>>>>>>> this approach is the final generated
> > >>>>>>>>>>> distribution is kind of non-deterministic. We might
> > >> generate too
> > >>>>> many
> > >>>>>>>>>>> different combinations when user trying to
> > >>>>>>>>>>> package different types of connector, format, and even
> > >> maybe hadoop
> > >>>>>>>>>>> releases. As far as I can tell, most open
> > >>>>>>>>>>> source projects and apache projects will only release
> > >> some
> > >>>>>>>>>>> pre-defined distributions, which most users are already
> > >>>>>>>>>>> familiar with, thus hard to change IMO. And I also have
> > >> went
> > >>>>> through
> > >>>>>>> in
> > >>>>>>>>>>> some cases, users will try to re-distribute
> > >>>>>>>>>>> the release package, because of the unstable network of
> > >> apache
> > >>>>> website
> > >>>>>>>>>> from
> > >>>>>>>>>>> China. In web tool solution, I don't
> > >>>>>>>>>>> think this kind of re-distribution would be possible
> > >> anymore.
> > >>>>>>>>>>>
> > >>>>>>>>>>> In the meantime, I also have a concern that we will fall
> > >> back into
> > >>>>> our
> > >>>>>>>>>> trap
> > >>>>>>>>>>> again if we try to offer this smart & flexible
> > >>>>>>>>>>> solution. Because it needs users to cooperate with such
> > >> mechanism.
> > >>>>>>> It's
> > >>>>>>>>>>> exactly the situation what we currently fell
> > >>>>>>>>>>> into:
> > >>>>>>>>>>> 1. We offered a smart solution.
> > >>>>>>>>>>> 2. We hope users will follow the correct instructions.
> > >>>>>>>>>>> 3. Everything will work as expected if users followed
> > >> the right
> > >>>>>>>>>>> instructions.
> > >>>>>>>>>>>
> > >>>>>>>>>>> In reality, I suspect not all users will do the second
> > >> step
> > >>>>> correctly.
> > >>>>>>>>>> And
> > >>>>>>>>>>> for new users who only trying to have a quick
> > >>>>>>>>>>> experience with Flink, I would bet most users will do it
> > >> wrong.
> > >>>>>>>>>>>
> > >>>>>>>>>>> So, my proposal would be one of the following 2 options:
> > >>>>>>>>>>> 1. Provide a slim distribution for advanced product
> > >> users and
> > >>>>> provide
> > >>>>>>> a
> > >>>>>>>>>>> distribution which will have some popular builtin jars.
> > >>>>>>>>>>> 2. Only provide a distribution which will have some
> > >> popular builtin
> > >>>>>>>>> jars.
> > >>>>>>>>>>> If we are trying to reduce the distributions we
> > >> released, I would
> > >>>>>>>>> prefer
> > >>>>>>>>>> 2
> > >>>>>>>>>>> 1.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Kurt
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann <
> > >>>>> [hidden email]>
> > >>>>>>> <
> > >>>>>>>>>> [hidden email]> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> I think what Chesnay and Dawid proposed would be the
> > >> ideal
> > >>>>> solution.
> > >>>>>>>>>>> Ideally, we would also have a nice web tool for the
> > >> website which
> > >>>>>>>>>> generates
> > >>>>>>>>>>> the corresponding distribution for download.
> > >>>>>>>>>>>
> > >>>>>>>>>>> To get things started we could start with only
> > >> supporting to
> > >>>>>>>>>>> download/creating the "fat" version with the script. The
> > >> fat
> > >>>>> version
> > >>>>>>>>>> would
> > >>>>>>>>>>> then consist of the slim distribution and whatever we
> > >> deem
> > >>>>> important
> > >>>>>>>>> for
> > >>>>>>>>>>> new users to get started.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Cheers,
> > >>>>>>>>>>> Till
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz <
> > >>>>>>>>>> [hidden email]> <[hidden email]>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi all,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Few points from my side:
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. I like the idea of simplifying the experience for
> > >> first time
> > >>>>> users.
> > >>>>>>>>>>> As for production use cases I share Jark's opinion that
> > >> in this
> > >>>>> case I
> > >>>>>>>>>>> would expect users to combine their distribution
> > >> manually. I think
> > >>>>> in
> > >>>>>>>>>>> such scenarios it is important to understand
> > >> interconnections.
> > >>>>>>>>>>> Personally I'd expect the slimmest possible distribution
> > >> that I can
> > >>>>>>>>>>> extend further with what I need in my production
> > >> scenario.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2. I think there is also the problem that the matrix of
> > >> possible
> > >>>>>>>>>>> combinations that can be useful is already big. Do we
> > >> want to have
> > >>>>> a
> > >>>>>>>>>>> distribution for:
> > >>>>>>>>>>>
> > >>>>>>>>>>> SQL users: which connectors should we include? should we
> > >>>>> include
> > >>>>>>>>>>> hive? which other catalog?
> > >>>>>>>>>>>
> > >>>>>>>>>>> DataStream users: which connectors should we include?
> > >>>>>>>>>>>
> > >>>>>>>>>>> For both of the above should we include yarn/kubernetes?
> > >>>>>>>>>>>
> > >>>>>>>>>>> I would opt for providing only the "slim" distribution
> > >> as a release
> > >>>>>>>>>>> artifact.
> > >>>>>>>>>>>
> > >>>>>>>>>>> 3. However, as I said I think its worth investigating
> > >> how we can
> > >>>>>>>>> improve
> > >>>>>>>>>>> users experience. What do you think of providing a tool,
> > >> could be
> > >>>>> e.g.
> > >>>>>>>>> a
> > >>>>>>>>>>> shell script that constructs a distribution based on
> > >> users choice.
> > >>>>> I
> > >>>>>>>>>>> think that was also what Chesnay mentioned as "tooling to
> > >>>>>>>>>>> assemble custom distributions" In the end how I see the
> > >> difference
> > >>>>>>>>>>> between a slim and fat distribution is which jars do we
> > >> put into
> > >>>>> the
> > >>>>>>>>>>> lib, right? It could have a few "screens".
> > >>>>>>>>>>>
> > >>>>>>>>>>> 1. Which API are you interested in:
> > >>>>>>>>>>> a. SQL API
> > >>>>>>>>>>> b. DataStream API
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> 2. [SQL] Which connectors do you want to use?
> > >> [multichoice]:
> > >>>>>>>>>>> a. Kafka
> > >>>>>>>>>>> b. Elasticsearch
> > >>>>>>>>>>> ...
> > >>>>>>>>>>>
> > >>>>>>>>>>> 3. [SQL] Which catalog you want to use?
> > >>>>>>>>>>>
> > >>>>>>>>>>> ...
> > >>>>>>>>>>>
> > >>>>>>>>>>> Such a tool would download all the dependencies from
> > >> maven and put
> > >>>>>>> them
> > >>>>>>>>>>> into the correct folder. In the future we can extend it
> > >> with
> > >>>>>>> additional
> > >>>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time
> > >> with
> > >>>>>>>>>>> kafka-universal etc.
> > >>>>>>>>>>>
> > >>>>>>>>>>> The benefit of it would be that the distribution that we
> > >> release
> > >>>>> could
> > >>>>>>>>>>> remain "slim" or we could even make it slimmer. I might
> > >> be missing
> > >>>>>>>>>>> something here though.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Dawdi
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> I want to reinforce my opinion from earlier: This is
> > >> about
> > >>>>> improving
> > >>>>>>>>>>> the situation both for first-time users and for
> > >> experienced users
> > >>>>> that
> > >>>>>>>>>>> want to use a Flink dist in production. The current
> > >> Flink dist is
> > >>>>> too
> > >>>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for
> > >> production
> > >>>>>>>>>>> users, that is where serving no-one properly with the
> > >> current
> > >>>>>>>>>>> middle-ground. That's why I think introducing those
> > >> specialized
> > >>>>>>>>>>> "spins" of Flink dist would be good.
> > >>>>>>>>>>>
> > >>>>>>>>>>> By the way, at some point in the future production users
> > >> might not
> > >>>>>>>>>>> even need to get a Flink dist anymore. They should be
> > >> able to have
> > >>>>>>>>>>> Flink as a dependency of their project (including the
> > >> runtime) and
> > >>>>>>>>>>> then build an image from this for Kubernetes or a fat
> > >> jar for YARN.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Aljoscha
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi all,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Regarding slim and fat distributions, I think different
> > >> kinds of
> > >>>>> jobs
> > >>>>>>>>>>> may
> > >>>>>>>>>>> prefer different type of distribution:
> > >>>>>>>>>>>
> > >>>>>>>>>>> For DataStream job, I think we may not like fat
> > >> distribution
> > >>>>>>>>>>>
> > >>>>>>>>>>> containing
> > >>>>>>>>>>>
> > >>>>>>>>>>> connectors because user would always need to depend on
> > >> the
> > >>>>> connector
> > >>>>>>>>>>>
> > >>>>>>>>>>> in
> > >>>>>>>>>>>
> > >>>>>>>>>>> user code, it is easy to include the connector jar in
> > >> the user lib.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Less
> > >>>>>>>>>>>
> > >>>>>>>>>>> jar in lib means less class conflicts and problems.
> > >>>>>>>>>>>
> > >>>>>>>>>>> For SQL job, I think we are trying to encourage user to
> > >> user pure
> > >>>>>>>>>>> sql(DDL +
> > >>>>>>>>>>> DML) to construct their job, In order to improve user
> > >> experience,
> > >>>>> It
> > >>>>>>>>>>> may be
> > >>>>>>>>>>> important for flink, not only providing as many
> > >> connector jar in
> > >>>>>>>>>>> distribution as possible especially the connector and
> > >> format we
> > >>>>> have
> > >>>>>>>>>>> well
> > >>>>>>>>>>> documented, but also providing an mechanism to load
> > >> connectors
> > >>>>>>>>>>> according
> > >>>>>>>>>>> to the DDLs,
> > >>>>>>>>>>>
> > >>>>>>>>>>> So I think it could be good to place connector/format
> > >> jars in some
> > >>>>>>>>>>> dir like
> > >>>>>>>>>>> opt/connector which would not affect jobs by default, and
> > >>>>> introduce a
> > >>>>>>>>>>> mechanism of dynamic discovery for SQL.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Wenlong
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li <
> > >> [hidden email]>
> > >>>>> <
> > >>>>>>>>>> [hidden email]>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I am thinking both "improve first experience" and
> > >> "improve
> > >>>>> production
> > >>>>>>>>>>> experience".
> > >>>>>>>>>>>
> > >>>>>>>>>>> I'm thinking about what's the common mode of Flink?
> > >>>>>>>>>>> Streaming job use Kafka? Batch job use Hive?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of
> > >> Hive server
> > >>>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1
> > >> dependency.
> > >>>>>>>>>>> Flink is currently mainly used for streaming, so let's
> > >> not talk
> > >>>>>>>>>>> about hive.
> > >>>>>>>>>>>
> > >>>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is
> > >> (related
> > >>>>> to
> > >>>>>>>>>>> connectors):
> > >>>>>>>>>>> - ETL jobs: Kafka -> Kafka
> > >>>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka
> > >>>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink
> > >>>>>>>>>>> So Kafka and JDBC are probably the most commonly used.
> > >> Of course,
> > >>>>>>>>>>>
> > >>>>>>>>>>> also
> > >>>>>>>>>>>
> > >>>>>>>>>>> includes CSV, JSON's formats.
> > >>>>>>>>>>> So when we provide such a fat distribution:
> > >>>>>>>>>>> - With CSV, JSON.
> > >>>>>>>>>>> - With flink-kafka-universal and kafka dependencies.
> > >>>>>>>>>>> - With flink-jdbc.
> > >>>>>>>>>>> Using this fat distribution, most users can run their
> > >> jobs well.
> > >>>>>>>>>>>
> > >>>>>>>>>>> (jdbc
> > >>>>>>>>>>>
> > >>>>>>>>>>> driver jar required, but this is very natural to do)
> > >>>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only
> > >> Kafka may
> > >>>>>>>>>>>
> > >>>>>>>>>>> have
> > >>>>>>>>>>>
> > >>>>>>>>>>> conflicts, but if our goal is to use kafka-universal to
> > >> support all
> > >>>>>>>>>>> Kafka
> > >>>>>>>>>>> versions, it is hopeful to target the vast majority of
> > >> users.
> > >>>>>>>>>>>
> > >>>>>>>>>>> We don't want to plug all jars into the fat
> > >> distribution. Only need
> > >>>>>>>>>>> less
> > >>>>>>>>>>> conflict and common. of course, it is a matter of
> > >> consideration to
> > >>>>>>>>>>>
> > >>>>>>>>>>> put
> > >>>>>>>>>>>
> > >>>>>>>>>>> which jar into fat distribution.
> > >>>>>>>>>>> We have the opportunity to facilitate the majority of
> > >> users, but
> > >>>>>>>>>>> also left
> > >>>>>>>>>>> opportunities for customization.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Jingsong Lee
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu <
> > >> [hidden email]> <
> > >>>>>>>>>> [hidden email]> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I think we should first reach an consensus on "what
> > >> problem do we
> > >>>>>>>>>>> want to
> > >>>>>>>>>>> solve?"
> > >>>>>>>>>>> (1) improve first experience? or (2) improve production
> > >> experience?
> > >>>>>>>>>>>
> > >>>>>>>>>>> As far as I can see, with the above discussion, I think
> > >> what we
> > >>>>>>>>>>> want to
> > >>>>>>>>>>> solve is the "first experience".
> > >>>>>>>>>>> And I think the slim jar is still the best distribution
> > >> for
> > >>>>>>>>>>> production,
> > >>>>>>>>>>> because it's easier to assembling jars
> > >>>>>>>>>>> than excluding jars and can avoid potential class
> > >> conflicts.
> > >>>>>>>>>>>
> > >>>>>>>>>>> If we want to improve "first experience", I think it
> > >> make sense to
> > >>>>>>>>>>> have a
> > >>>>>>>>>>> fat distribution to give users a more smooth first
> > >> experience.
> > >>>>>>>>>>> But I would like to call it "playground distribution" or
> > >> something
> > >>>>>>>>>>> like
> > >>>>>>>>>>> that to explicitly differ from the "slim
> > >> production-purpose
> > >>>>>>>>>>>
> > >>>>>>>>>>> distribution".
> > >>>>>>>>>>>
> > >>>>>>>>>>> The "playground distribution" can contains some widely
> > >> used jars,
> > >>>>>>>>>>>
> > >>>>>>>>>>> like
> > >>>>>>>>>>>
> > >>>>>>>>>>> universal-kafka-sql-connector,
> > >> elasticsearch7-sql-connector, avro,
> > >>>>>>>>>>> json,
> > >>>>>>>>>>> csv, etc..
> > >>>>>>>>>>> Even we can provide a playground docker which may
> > >> contain the fat
> > >>>>>>>>>>> distribution, python3, and hive.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Jark
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler <
> > >> [hidden email]>
> > >>>>> <
> > >>>>>>>>>> [hidden email]>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> I don't see a lot of value in having multiple
> > >> distributions.
> > >>>>>>>>>>>
> > >>>>>>>>>>> The simple reality is that no fat distribution we could
> > >> provide
> > >>>>>>>>>>>
> > >>>>>>>>>>> would
> > >>>>>>>>>>>
> > >>>>>>>>>>> satisfy all use-cases, so why even try.
> > >>>>>>>>>>> If users commonly run into issues for certain jars, then
> > >> maybe
> > >>>>>>>>>>>
> > >>>>>>>>>>> those
> > >>>>>>>>>>>
> > >>>>>>>>>>> should be added to the current distribution.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Personally though I still believe we should only
> > >> distribute a slim
> > >>>>>>>>>>> version. I'd rather have users always add required jars
> > >> to the
> > >>>>>>>>>>> distribution than only when they go outside our
> > >> "expected"
> > >>>>>>>>>>>
> > >>>>>>>>>>> use-cases.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Then we might finally address this issue properly, i.e.,
> > >> tooling to
> > >>>>>>>>>>> assemble custom distributions and/or better error
> > >> messages if
> > >>>>>>>>>>> Flink-provided extensions cannot be found.
> > >>>>>>>>>>>
> > >>>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Regarding to the specific solution, I'm not sure about
> > >> the "fat"
> > >>>>>>>>>>>
> > >>>>>>>>>>> and
> > >>>>>>>>>>>
> > >>>>>>>>>>> "slim"
> > >>>>>>>>>>>
> > >>>>>>>>>>> solution though. I get the idea
> > >>>>>>>>>>> that we can make the slim one even more lightweight than
> > >> current
> > >>>>>>>>>>> distribution, but what about the "fat"
> > >>>>>>>>>>> one? Do you mean that we would package all connectors
> > >> and formats
> > >>>>>>>>>>>
> > >>>>>>>>>>> into
> > >>>>>>>>>>>
> > >>>>>>>>>>> this? I'm not sure if this is
> > >>>>>>>>>>> feasible. For example, we can't put all versions of
> > >> kafka and hive
> > >>>>>>>>>>> connector jars into lib directory, and
> > >>>>>>>>>>> we also might need hadoop jars when using filesystem
> > >> connector to
> > >>>>>>>>>>>
> > >>>>>>>>>>> access
> > >>>>>>>>>>>
> > >>>>>>>>>>> data from HDFS.
> > >>>>>>>>>>>
> > >>>>>>>>>>> So my guess would be we might hand-pick some of the most
> > >>>>>>>>>>>
> > >>>>>>>>>>> frequently
> > >>>>>>>>>>>
> > >>>>>>>>>>> used
> > >>>>>>>>>>>
> > >>>>>>>>>>> connectors and formats
> > >>>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned
> > >> above,
> > >>>>>>>>>>>
> > >>>>>>>>>>> and
> > >>>>>>>>>>>
> > >>>>>>>>>>> still
> > >>>>>>>>>>>
> > >>>>>>>>>>> leave some other connectors out of it.
> > >>>>>>>>>>> If this is the case, then why not we just provide this
> > >>>>>>>>>>>
> > >>>>>>>>>>> distribution
> > >>>>>>>>>>>
> > >>>>>>>>>>> to
> > >>>>>>>>>>>
> > >>>>>>>>>>> user? I'm not sure i get the benefit of
> > >>>>>>>>>>> providing another super "slim" jar (we have to pay some
> > >> costs to
> > >>>>>>>>>>>
> > >>>>>>>>>>> provide
> > >>>>>>>>>>>
> > >>>>>>>>>>> another suit of distribution).
> > >>>>>>>>>>>
> > >>>>>>>>>>> What do you think?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Kurt
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li <
> > >>>>>>>>>>>
> > >>>>>>>>>>> [hidden email]
> > >>>>>>>>>>>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Big +1.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I like "fat" and "slim".
> > >>>>>>>>>>>
> > >>>>>>>>>>> For csv and json, like Jark said, they are quite small
> > >> and don't
> > >>>>>>>>>>>
> > >>>>>>>>>>> have
> > >>>>>>>>>>>
> > >>>>>>>>>>> other
> > >>>>>>>>>>>
> > >>>>>>>>>>> dependencies. They are important to kafka connector, and
> > >>>>>>>>>>>
> > >>>>>>>>>>> important
> > >>>>>>>>>>>
> > >>>>>>>>>>> to upcoming file system connector too.
> > >>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so
> > >>>>>>>>>>>
> > >>>>>>>>>>> important,
> > >>>>>>>>>>>
> > >>>>>>>>>>> and
> > >>>>>>>>>>>
> > >>>>>>>>>>> they're so lightweight.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Jingsong Lee
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he <
> > >> [hidden email]> <
> > >>>>>>>>>> [hidden email]>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Big +1.
> > >>>>>>>>>>> This will improve user experience (special for Flink new
> > >> users).
> > >>>>>>>>>>> We answered so many questions about "class not found".
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Godfrey
> > >>>>>>>>>>>
> > >>>>>>>>>>> Dian Fu <[hidden email]> <[hidden email]>
> > >>>>> 于2020年4月15日周三
> > >>>>>>>>>> 下午4:30写道:
> > >>>>>>>>>>>
> > >>>>>>>>>>> +1 to this proposal.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Missing connector jars is also a big problem for PyFlink
> > >> users.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Currently,
> > >>>>>>>>>>>
> > >>>>>>>>>>> after a Python user has installed PyFlink using `pip`,
> > >> he has
> > >>>>>>>>>>>
> > >>>>>>>>>>> to
> > >>>>>>>>>>>
> > >>>>>>>>>>> manually
> > >>>>>>>>>>>
> > >>>>>>>>>>> copy the connector fat jars to the PyFlink installation
> > >>>>>>>>>>>
> > >>>>>>>>>>> directory
> > >>>>>>>>>>>
> > >>>>>>>>>>> for
> > >>>>>>>>>>>
> > >>>>>>>>>>> the
> > >>>>>>>>>>>
> > >>>>>>>>>>> connectors to be used if he wants to run jobs locally.
> > >> This
> > >>>>>>>>>>>
> > >>>>>>>>>>> process
> > >>>>>>>>>>>
> > >>>>>>>>>>> is
> > >>>>>>>>>>>
> > >>>>>>>>>>> very
> > >>>>>>>>>>>
> > >>>>>>>>>>> confuse for users and affects the experience a lot.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Regards,
> > >>>>>>>>>>> Dian
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> <
> > >> [hidden email]>
> > >>>>> 写道:
> > >>>>>>>>>>>
> > >>>>>>>>>>> +1 to the proposal. I also found the "download
> > >> additional jar"
> > >>>>>>>>>>>
> > >>>>>>>>>>> step
> > >>>>>>>>>>>
> > >>>>>>>>>>> is
> > >>>>>>>>>>>
> > >>>>>>>>>>> really verbose when I prepare webinars.
> > >>>>>>>>>>>
> > >>>>>>>>>>> At least, I think the flink-csv and flink-json should in
> > >> the
> > >>>>>>>>>>>
> > >>>>>>>>>>> distribution,
> > >>>>>>>>>>>
> > >>>>>>>>>>> they are quite small and don't have other dependencies.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Jark
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang <
> > >> [hidden email]> <
> > >>>>>>>>>> [hidden email]>
> > >>>>>>>>>>> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi Aljoscha,
> > >>>>>>>>>>>
> > >>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan
> > >> to
> > >>>>>>>>>>>
> > >>>>>>>>>>> put
> > >>>>>>>>>>>
> > >>>>>>>>>>> these
> > >>>>>>>>>>>
> > >>>>>>>>>>> connectors ? opt or lib ?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Aljoscha Krettek <[hidden email]> <
> > >> [hidden email]>
> > >>>>>>>>>> 于2020年4月15日周三
> > >>>>>>>>>>> 下午3:30写道:
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Hi Everyone,
> > >>>>>>>>>>>
> > >>>>>>>>>>> I'd like to discuss about releasing a more full-featured
> > >>>>>>>>>>>
> > >>>>>>>>>>> Flink
> > >>>>>>>>>>>
> > >>>>>>>>>>> distribution. The motivation is that there is friction
> > >> for
> > >>>>>>>>>>>
> > >>>>>>>>>>> SQL/Table
> > >>>>>>>>>>>
> > >>>>>>>>>>> API
> > >>>>>>>>>>>
> > >>>>>>>>>>> users that want to use Table connectors which are not
> > >> there
> > >>>>>>>>>>>
> > >>>>>>>>>>> in
> > >>>>>>>>>>>
> > >>>>>>>>>>> the
> > >>>>>>>>>>>
> > >>>>>>>>>>> current Flink Distribution. For these users the workflow
> > >> is
> > >>>>>>>>>>>
> > >>>>>>>>>>> currently
> > >>>>>>>>>>>
> > >>>>>>>>>>> roughly:
> > >>>>>>>>>>>
> > >>>>>>>>>>> - download Flink dist
> > >>>>>>>>>>> - configure csv/Kafka/json connectors per configuration
> > >>>>>>>>>>> - run SQL client or program
> > >>>>>>>>>>> - decrypt error message and research the solution
> > >>>>>>>>>>> - download additional connector jars
> > >>>>>>>>>>> - program works correctly
> > >>>>>>>>>>>
> > >>>>>>>>>>> I realize that this can be made to work but if every SQL
> > >>>>>>>>>>>
> > >>>>>>>>>>> user
> > >>>>>>>>>>>
> > >>>>>>>>>>> has
> > >>>>>>>>>>>
> > >>>>>>>>>>> this
> > >>>>>>>>>>>
> > >>>>>>>>>>> as their first experience that doesn't seem good to me.
> > >>>>>>>>>>>
> > >>>>>>>>>>> My proposal is to provide two versions of the Flink
> > >>>>>>>>>>>
> > >>>>>>>>>>> Distribution
> > >>>>>>>>>>>
> > >>>>>>>>>>> in
> > >>>>>>>>>>>
> > >>>>>>>>>>> the
> > >>>>>>>>>>>
> > >>>>>>>>>>> future: "fat" and "slim" (names to be discussed):
> > >>>>>>>>>>>
> > >>>>>>>>>>> - slim would be even trimmer than todays distribution
> > >>>>>>>>>>> - fat would contain a lot of convenience connectors (yet
> > >>>>>>>>>>>
> > >>>>>>>>>>> to
> > >>>>>>>>>>>
> > >>>>>>>>>>> be
> > >>>>>>>>>>>
> > >>>>>>>>>>> determined which one)
> > >>>>>>>>>>>
> > >>>>>>>>>>> And yes, I realize that there are already more
> > >> dimensions of
> > >>>>>>>>>>>
> > >>>>>>>>>>> Flink
> > >>>>>>>>>>>
> > >>>>>>>>>>> releases (Scala version and Java version).
> > >>>>>>>>>>>
> > >>>>>>>>>>> For background, our current Flink dist has these in the
> > >> opt
> > >>>>>>>>>>>
> > >>>>>>>>>>> directory:
> > >>>>>>>>>>>
> > >>>>>>>>>>> - flink-azure-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>> - flink-cep-scala_2.12-1.10.0.jar
> > >>>>>>>>>>> - flink-cep_2.12-1.10.0.jar
> > >>>>>>>>>>> - flink-gelly-scala_2.12-1.10.0.jar
> > >>>>>>>>>>> - flink-gelly_2.12-1.10.0.jar
> > >>>>>>>>>>> - flink-metrics-datadog-1.10.0.jar
> > >>>>>>>>>>> - flink-metrics-graphite-1.10.0.jar
> > >>>>>>>>>>> - flink-metrics-influxdb-1.10.0.jar
> > >>>>>>>>>>> - flink-metrics-prometheus-1.10.0.jar
> > >>>>>>>>>>> - flink-metrics-slf4j-1.10.0.jar
> > >>>>>>>>>>> - flink-metrics-statsd-1.10.0.jar
> > >>>>>>>>>>> - flink-oss-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>> - flink-python_2.12-1.10.0.jar
> > >>>>>>>>>>> - flink-queryable-state-runtime_2.12-1.10.0.jar
> > >>>>>>>>>>> - flink-s3-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>> - flink-s3-fs-presto-1.10.0.jar
> > >>>>>>>>>>> -
> > >>>>>>>>>>>
> > >>>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar
> > >>>>>>>>>>>
> > >>>>>>>>>>> - flink-sql-client_2.12-1.10.0.jar
> > >>>>>>>>>>> - flink-state-processor-api_2.12-1.10.0.jar
> > >>>>>>>>>>> - flink-swift-fs-hadoop-1.10.0.jar
> > >>>>>>>>>>>
> > >>>>>>>>>>> Current Flink dist is 267M. If we removed everything from
> > >>>>>>>>>>>
> > >>>>>>>>>>> opt
> > >>>>>>>>>>>
> > >>>>>>>>>>> we
> > >>>>>>>>>>>
> > >>>>>>>>>>> would
> > >>>>>>>>>>>
> > >>>>>>>>>>> go down to 126M. I would reccomend this, because the
> > >> large
> > >>>>>>>>>>>
> > >>>>>>>>>>> majority
> > >>>>>>>>>>>
> > >>>>>>>>>>> of
> > >>>>>>>>>>>
> > >>>>>>>>>>> the files in opt are probably unused.
> > >>>>>>>>>>>
> > >>>>>>>>>>> What do you think?
> > >>>>>>>>>>>
> > >>>>>>>>>>> Best,
> > >>>>>>>>>>> Aljoscha
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>>> Best Regards
> > >>>>>>>>>>>
> > >>>>>>>>>>> Jeff Zhang
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>>> Best, Jingsong Lee
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> --
> > >>>>>>>>>>> Best, Jingsong Lee
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>> --
> > >>>> Best, Jingsong Lee
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Best, Jingsong Lee
> > >>
> > >
> > >
> > > --
> > >
> > > Best,
> > > Benchao Li
> >
> >
>
> --
> Best regards!
> Rui Li
>


--
Best, Jingsong Lee
123