+1 to include them for sql-client by default;
+0 to put into lib and exposed to all kinds of jobs, including DataStream. Danny Chan <[hidden email]> 于2020年6月5日周五 下午2:31写道: > +1, at least, we should keep an out of the box SQL-CLI, it’s very poor > experience to add such required format jars for SQL users. > > Best, > Danny Chan > 在 2020年6月5日 +0800 AM11:14,Jingsong Li <[hidden email]>,写道: > > Hi all, > > > > Considering that 1.11 will be released soon, what about my previous > > proposal? Put flink-csv, flink-json and flink-avro under lib. > > These three formats are very small and no third party dependence, and > they > > are widely used by table users. > > > > Best, > > Jingsong Lee > > > > On Tue, May 12, 2020 at 4:19 PM Jingsong Li <[hidden email]> > wrote: > > > > > Thanks for your discussion. > > > > > > Sorry to start discussing another thing: > > > > > > The biggest problem I see is the variety of problems caused by users' > lack > > > of format dependency. > > > As Aljoscha said, these three formats are very small and no third party > > > dependence, and they are widely used by table users. > > > Actually, we don't have any other built-in table formats now... In > total > > > 151K... > > > > > > 73K flink-avro-1.10.0.jar > > > 36K flink-csv-1.10.0.jar > > > 42K flink-json-1.10.0.jar > > > > > > So, Can we just put them into "lib/" or flink-table-uber? > > > It not solve all problems and maybe it is independent of "fat" and > "slim". > > > But also improve usability. > > > What do you think? Any objections? > > > > > > Best, > > > Jingsong Lee > > > > > > On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <[hidden email]> > > > wrote: > > > > > > > One downside would be that we're shipping more stuff when running on > > > > YARN for example, since the entire plugins directory is shiped by > default. > > > > > > > > On 17/04/2020 16:38, Stephan Ewen wrote: > > > > > @Aljoscha I think that is an interesting line of thinking. the > swift-fs > > > > may > > > > > be rarely enough used to move it to an optional download. > > > > > > > > > > I would still drop two more thoughts: > > > > > > > > > > (1) Now that we have plugins support, is there a reason to have a > > > > metrics > > > > > reporter or file system in /opt instead of /plugins? They don't > spoil > > > > the > > > > > class path any more. > > > > > > > > > > (2) I can imagine there still being a desire to have a "minimal" > docker > > > > > file, for users that want to keep the container images as small as > > > > > possible, to speed up deployment. It is fine if that would not be > the > > > > > default, though. > > > > > > > > > > > > > > > On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek < > [hidden email]> > > > > > wrote: > > > > > > > > > > > I think having such tools and/or tailor-made distributions can > be nice > > > > > > but I also think the discussion is missing the main point: The > initial > > > > > > observation/motivation is that apparently a lot of users (Kurt > and I > > > > > > talked about this) on the chinese DingTalk support groups, and > other > > > > > > support channels have problems when first using the SQL client > because > > > > > > of these missing connectors/formats. For these, having > additional tools > > > > > > would not solve anything because they would also not take that > extra > > > > > > step. I think that even tiny friction should be avoided because > the > > > > > > annoyance from it accumulates of the (hopefully) many users that > we > > > > want > > > > > > to have. > > > > > > > > > > > > Maybe we should take a step back from discussing the > "fat"/"slim" idea > > > > > > and instead think about the composition of the current dist. As > > > > > > mentioned we have these jars in opt/: > > > > > > > > > > > > 17M flink-azure-fs-hadoop-1.10.0.jar > > > > > > 52K flink-cep-scala_2.11-1.10.0.jar > > > > > > 180K flink-cep_2.11-1.10.0.jar > > > > > > 746K flink-gelly-scala_2.11-1.10.0.jar > > > > > > 626K flink-gelly_2.11-1.10.0.jar > > > > > > 512K flink-metrics-datadog-1.10.0.jar > > > > > > 159K flink-metrics-graphite-1.10.0.jar > > > > > > 1.0M flink-metrics-influxdb-1.10.0.jar > > > > > > 102K flink-metrics-prometheus-1.10.0.jar > > > > > > 10K flink-metrics-slf4j-1.10.0.jar > > > > > > 12K flink-metrics-statsd-1.10.0.jar > > > > > > 36M flink-oss-fs-hadoop-1.10.0.jar > > > > > > 28M flink-python_2.11-1.10.0.jar > > > > > > 22K flink-queryable-state-runtime_2.11-1.10.0.jar > > > > > > 18M flink-s3-fs-hadoop-1.10.0.jar > > > > > > 31M flink-s3-fs-presto-1.10.0.jar > > > > > > 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar > > > > > > 518K flink-sql-client_2.11-1.10.0.jar > > > > > > 99K flink-state-processor-api_2.11-1.10.0.jar > > > > > > 25M flink-swift-fs-hadoop-1.10.0.jar > > > > > > 160M opt > > > > > > > > > > > > The "filesystem" connectors ar ethe heavy hitters, there. > > > > > > > > > > > > I downloaded most of the SQL connectors/formats and this is what > I got: > > > > > > > > > > > > 73K flink-avro-1.10.0.jar > > > > > > 36K flink-csv-1.10.0.jar > > > > > > 55K flink-hbase_2.11-1.10.0.jar > > > > > > 88K flink-jdbc_2.11-1.10.0.jar > > > > > > 42K flink-json-1.10.0.jar > > > > > > 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar > > > > > > 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar > > > > > > 24M sql-connectors-formats > > > > > > > > > > > > We could just add these to the Flink distribution without > blowing it up > > > > > > by much. We could drop any of the existing "filesystem" > connectors from > > > > > > opt and add the SQL connectors/formats and not change the size > of Flink > > > > > > dist. So maybe we should do that instead? > > > > > > > > > > > > We would need some tooling for the sql-client shell script to > pick-up > > > > > > the connectors/formats up from opt/ because we don't want to add > them > > > > to > > > > > > lib/. We're already doing that for finding the flink-sql-client > jar, > > > > > > which is also not in lib/. > > > > > > > > > > > > What do you think? > > > > > > > > > > > > Best, > > > > > > Aljoscha > > > > > > > > > > > > On 17.04.20 05:22, Jark Wu wrote: > > > > > > > Hi, > > > > > > > > > > > > > > I like the idea of web tool to assemble fat distribution. And > the > > > > > > > https://code.quarkus.io/ looks very nice. > > > > > > > All the users need to do is just select what he/she need (I > think this > > > > > > step > > > > > > > can't be omitted anyway). > > > > > > > We can also provide a default fat distribution on the web which > > > > default > > > > > > > selects some popular connectors. > > > > > > > > > > > > > > Best, > > > > > > > Jark > > > > > > > > > > > > > > On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <[hidden email] > > > > > > wrote: > > > > > > > > > > > > > > > As a reference for a nice first-experience I had, take a > look at > > > > > > > > https://code.quarkus.io/ > > > > > > > > You reach this page after you click "Start Coding" at the > project > > > > > > homepage. > > > > > > > > Rafi > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <[hidden email]> > wrote: > > > > > > > > > > > > > > > > > I'm not saying pre-bundle some jars will make this problem > go away, > > > > and > > > > > > > > > you're right that only hides the problem for > > > > > > > > > some users. But what if this solution can hide the problem > for 90% > > > > > > users? > > > > > > > > > Would't that be good enough for us to try? > > > > > > > > > > > > > > > > > > Regarding to would users following instructions really be > such a big > > > > > > > > > problem? > > > > > > > > > I'm afraid yes. Otherwise I won't answer such questions > for at > > > > least a > > > > > > > > > dozen times and I won't see such questions coming > > > > > > > > > up from time to time. During some periods, I even saw such > questions > > > > > > > > every > > > > > > > > > day. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Kurt > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler < > > > > [hidden email]> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > The problem with having a distribution with "popular" > stuff is > > > > that it > > > > > > > > > > doesn't really *solve* a problem, it just hides it for > users who > > > > fall > > > > > > > > > > into these particular use-cases. > > > > > > > > > > Move out of it and you once again run into exact same > problems > > > > > > > > out-lined. > > > > > > > > > > This is exactly why I like the tooling approach; you > have to deal > > > > with > > > > > > > > it > > > > > > > > > > from the start and transitioning to a custom use-case is > easier. > > > > > > > > > > > > > > > > > > > > Would users following instructions really be such a big > problem? > > > > > > > > > > I would expect that users generally know *what *they > need, just not > > > > > > > > > > necessarily how it is assembled correctly (where do get > which jar, > > > > > > > > which > > > > > > > > > > directory to put it in). > > > > > > > > > > It seems like these are exactly the problem this would > solve? > > > > > > > > > > I just don't see how moving a jar corresponding to some > feature > > > > from > > > > > > > > opt > > > > > > > > > > to some directory (lib/plugins) is less error-prone than > just > > > > > > selecting > > > > > > > > > the > > > > > > > > > > feature and having the tool handle the rest. > > > > > > > > > > > > > > > > > > > > As for re-distributions, it depends on the form that the > tool would > > > > > > > > take. > > > > > > > > > > It could be an application that runs locally and works > against > > > > maven > > > > > > > > > > central (note: not necessarily *using* maven); this > should would > > > > work > > > > > > > > in > > > > > > > > > > China, no? > > > > > > > > > > > > > > > > > > > > A web tool would of course be fancy, but I don't know > how feasible > > > > > > this > > > > > > > > > is > > > > > > > > > > with the ASF infrastructure. > > > > > > > > > > You wouldn't be able to mirror the distribution, so the > load can't > > > > be > > > > > > > > > > distributed. I doubt INFRA would like this. > > > > > > > > > > > > > > > > > > > > Note that third-parties could also start distributing > use-case > > > > > > oriented > > > > > > > > > > distributions, which would be perfectly fine as far as > I'm > > > > concerned. > > > > > > > > > > > > > > > > > > > > On 16/04/2020 16:57, Kurt Young wrote: > > > > > > > > > > > > > > > > > > > > I'm not so sure about the web tool solution though. The > concern I > > > > have > > > > > > > > > for > > > > > > > > > > this approach is the final generated > > > > > > > > > > distribution is kind of non-deterministic. We might > generate too > > > > many > > > > > > > > > > different combinations when user trying to > > > > > > > > > > package different types of connector, format, and even > maybe hadoop > > > > > > > > > > releases. As far as I can tell, most open > > > > > > > > > > source projects and apache projects will only release > some > > > > > > > > > > pre-defined distributions, which most users are already > > > > > > > > > > familiar with, thus hard to change IMO. And I also have > went > > > > through > > > > > > in > > > > > > > > > > some cases, users will try to re-distribute > > > > > > > > > > the release package, because of the unstable network of > apache > > > > website > > > > > > > > > from > > > > > > > > > > China. In web tool solution, I don't > > > > > > > > > > think this kind of re-distribution would be possible > anymore. > > > > > > > > > > > > > > > > > > > > In the meantime, I also have a concern that we will fall > back into > > > > our > > > > > > > > > trap > > > > > > > > > > again if we try to offer this smart & flexible > > > > > > > > > > solution. Because it needs users to cooperate with such > mechanism. > > > > > > It's > > > > > > > > > > exactly the situation what we currently fell > > > > > > > > > > into: > > > > > > > > > > 1. We offered a smart solution. > > > > > > > > > > 2. We hope users will follow the correct instructions. > > > > > > > > > > 3. Everything will work as expected if users followed > the right > > > > > > > > > > instructions. > > > > > > > > > > > > > > > > > > > > In reality, I suspect not all users will do the second > step > > > > correctly. > > > > > > > > > And > > > > > > > > > > for new users who only trying to have a quick > > > > > > > > > > experience with Flink, I would bet most users will do it > wrong. > > > > > > > > > > > > > > > > > > > > So, my proposal would be one of the following 2 options: > > > > > > > > > > 1. Provide a slim distribution for advanced product > users and > > > > provide > > > > > > a > > > > > > > > > > distribution which will have some popular builtin jars. > > > > > > > > > > 2. Only provide a distribution which will have some > popular builtin > > > > > > > > jars. > > > > > > > > > > If we are trying to reduce the distributions we > released, I would > > > > > > > > prefer > > > > > > > > > 2 > > > > > > > > > > 1. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Kurt > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann < > > > > [hidden email]> > > > > > > < > > > > > > > > > [hidden email]> wrote: > > > > > > > > > > > > > > > > > > > > I think what Chesnay and Dawid proposed would be the > ideal > > > > solution. > > > > > > > > > > Ideally, we would also have a nice web tool for the > website which > > > > > > > > > generates > > > > > > > > > > the corresponding distribution for download. > > > > > > > > > > > > > > > > > > > > To get things started we could start with only > supporting to > > > > > > > > > > download/creating the "fat" version with the script. The > fat > > > > version > > > > > > > > > would > > > > > > > > > > then consist of the slim distribution and whatever we > deem > > > > important > > > > > > > > for > > > > > > > > > > new users to get started. > > > > > > > > > > > > > > > > > > > > Cheers, > > > > > > > > > > Till > > > > > > > > > > > > > > > > > > > > On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz < > > > > > > > > > [hidden email]> <[hidden email]> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > Few points from my side: > > > > > > > > > > > > > > > > > > > > 1. I like the idea of simplifying the experience for > first time > > > > users. > > > > > > > > > > As for production use cases I share Jark's opinion that > in this > > > > case I > > > > > > > > > > would expect users to combine their distribution > manually. I think > > > > in > > > > > > > > > > such scenarios it is important to understand > interconnections. > > > > > > > > > > Personally I'd expect the slimmest possible distribution > that I can > > > > > > > > > > extend further with what I need in my production > scenario. > > > > > > > > > > > > > > > > > > > > 2. I think there is also the problem that the matrix of > possible > > > > > > > > > > combinations that can be useful is already big. Do we > want to have > > > > a > > > > > > > > > > distribution for: > > > > > > > > > > > > > > > > > > > > SQL users: which connectors should we include? should we > > > > include > > > > > > > > > > hive? which other catalog? > > > > > > > > > > > > > > > > > > > > DataStream users: which connectors should we include? > > > > > > > > > > > > > > > > > > > > For both of the above should we include yarn/kubernetes? > > > > > > > > > > > > > > > > > > > > I would opt for providing only the "slim" distribution > as a release > > > > > > > > > > artifact. > > > > > > > > > > > > > > > > > > > > 3. However, as I said I think its worth investigating > how we can > > > > > > > > improve > > > > > > > > > > users experience. What do you think of providing a tool, > could be > > > > e.g. > > > > > > > > a > > > > > > > > > > shell script that constructs a distribution based on > users choice. > > > > I > > > > > > > > > > think that was also what Chesnay mentioned as "tooling to > > > > > > > > > > assemble custom distributions" In the end how I see the > difference > > > > > > > > > > between a slim and fat distribution is which jars do we > put into > > > > the > > > > > > > > > > lib, right? It could have a few "screens". > > > > > > > > > > > > > > > > > > > > 1. Which API are you interested in: > > > > > > > > > > a. SQL API > > > > > > > > > > b. DataStream API > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. [SQL] Which connectors do you want to use? > [multichoice]: > > > > > > > > > > a. Kafka > > > > > > > > > > b. Elasticsearch > > > > > > > > > > ... > > > > > > > > > > > > > > > > > > > > 3. [SQL] Which catalog you want to use? > > > > > > > > > > > > > > > > > > > > ... > > > > > > > > > > > > > > > > > > > > Such a tool would download all the dependencies from > maven and put > > > > > > them > > > > > > > > > > into the correct folder. In the future we can extend it > with > > > > > > additional > > > > > > > > > > rules e.g. kafka-0.9 cannot be chosen at the same time > with > > > > > > > > > > kafka-universal etc. > > > > > > > > > > > > > > > > > > > > The benefit of it would be that the distribution that we > release > > > > could > > > > > > > > > > remain "slim" or we could even make it slimmer. I might > be missing > > > > > > > > > > something here though. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > > > > > > > Dawdi > > > > > > > > > > > > > > > > > > > > On 16/04/2020 11:02, Aljoscha Krettek wrote: > > > > > > > > > > > > > > > > > > > > I want to reinforce my opinion from earlier: This is > about > > > > improving > > > > > > > > > > the situation both for first-time users and for > experienced users > > > > that > > > > > > > > > > want to use a Flink dist in production. The current > Flink dist is > > > > too > > > > > > > > > > "thin" for first-time SQL users and it is too "fat" for > production > > > > > > > > > > users, that is where serving no-one properly with the > current > > > > > > > > > > middle-ground. That's why I think introducing those > specialized > > > > > > > > > > "spins" of Flink dist would be good. > > > > > > > > > > > > > > > > > > > > By the way, at some point in the future production users > might not > > > > > > > > > > even need to get a Flink dist anymore. They should be > able to have > > > > > > > > > > Flink as a dependency of their project (including the > runtime) and > > > > > > > > > > then build an image from this for Kubernetes or a fat > jar for YARN. > > > > > > > > > > > > > > > > > > > > Aljoscha > > > > > > > > > > > > > > > > > > > > On 15.04.20 18:14, wenlong.lwl wrote: > > > > > > > > > > > > > > > > > > > > Hi all, > > > > > > > > > > > > > > > > > > > > Regarding slim and fat distributions, I think different > kinds of > > > > jobs > > > > > > > > > > may > > > > > > > > > > prefer different type of distribution: > > > > > > > > > > > > > > > > > > > > For DataStream job, I think we may not like fat > distribution > > > > > > > > > > > > > > > > > > > > containing > > > > > > > > > > > > > > > > > > > > connectors because user would always need to depend on > the > > > > connector > > > > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > > user code, it is easy to include the connector jar in > the user lib. > > > > > > > > > > > > > > > > > > > > Less > > > > > > > > > > > > > > > > > > > > jar in lib means less class conflicts and problems. > > > > > > > > > > > > > > > > > > > > For SQL job, I think we are trying to encourage user to > user pure > > > > > > > > > > sql(DDL + > > > > > > > > > > DML) to construct their job, In order to improve user > experience, > > > > It > > > > > > > > > > may be > > > > > > > > > > important for flink, not only providing as many > connector jar in > > > > > > > > > > distribution as possible especially the connector and > format we > > > > have > > > > > > > > > > well > > > > > > > > > > documented, but also providing an mechanism to load > connectors > > > > > > > > > > according > > > > > > > > > > to the DDLs, > > > > > > > > > > > > > > > > > > > > So I think it could be good to place connector/format > jars in some > > > > > > > > > > dir like > > > > > > > > > > opt/connector which would not affect jobs by default, and > > > > introduce a > > > > > > > > > > mechanism of dynamic discovery for SQL. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Wenlong > > > > > > > > > > > > > > > > > > > > On Wed, 15 Apr 2020 at 22:46, Jingsong Li < > [hidden email]> > > > > < > > > > > > > > > [hidden email]> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > I am thinking both "improve first experience" and > "improve > > > > production > > > > > > > > > > experience". > > > > > > > > > > > > > > > > > > > > I'm thinking about what's the common mode of Flink? > > > > > > > > > > Streaming job use Kafka? Batch job use Hive? > > > > > > > > > > > > > > > > > > > > Hive 1.2.1 dependencies can be compatible with most of > Hive server > > > > > > > > > > versions. So Spark and Presto have built-in Hive 1.2.1 > dependency. > > > > > > > > > > Flink is currently mainly used for streaming, so let's > not talk > > > > > > > > > > about hive. > > > > > > > > > > > > > > > > > > > > For streaming jobs, first of all, the jobs in my mind is > (related > > > > to > > > > > > > > > > connectors): > > > > > > > > > > - ETL jobs: Kafka -> Kafka > > > > > > > > > > - Join jobs: Kafka -> DimJDBC -> Kafka > > > > > > > > > > - Aggregation jobs: Kafka -> JDBCSink > > > > > > > > > > So Kafka and JDBC are probably the most commonly used. > Of course, > > > > > > > > > > > > > > > > > > > > also > > > > > > > > > > > > > > > > > > > > includes CSV, JSON's formats. > > > > > > > > > > So when we provide such a fat distribution: > > > > > > > > > > - With CSV, JSON. > > > > > > > > > > - With flink-kafka-universal and kafka dependencies. > > > > > > > > > > - With flink-jdbc. > > > > > > > > > > Using this fat distribution, most users can run their > jobs well. > > > > > > > > > > > > > > > > > > > > (jdbc > > > > > > > > > > > > > > > > > > > > driver jar required, but this is very natural to do) > > > > > > > > > > Can these dependencies lead to kinds of conflicts? Only > Kafka may > > > > > > > > > > > > > > > > > > > > have > > > > > > > > > > > > > > > > > > > > conflicts, but if our goal is to use kafka-universal to > support all > > > > > > > > > > Kafka > > > > > > > > > > versions, it is hopeful to target the vast majority of > users. > > > > > > > > > > > > > > > > > > > > We don't want to plug all jars into the fat > distribution. Only need > > > > > > > > > > less > > > > > > > > > > conflict and common. of course, it is a matter of > consideration to > > > > > > > > > > > > > > > > > > > > put > > > > > > > > > > > > > > > > > > > > which jar into fat distribution. > > > > > > > > > > We have the opportunity to facilitate the majority of > users, but > > > > > > > > > > also left > > > > > > > > > > opportunities for customization. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Jingsong Lee > > > > > > > > > > > > > > > > > > > > On Wed, Apr 15, 2020 at 10:09 PM Jark Wu < > [hidden email]> < > > > > > > > > > [hidden email]> wrote: > > > > > > > > > > > > > > > > > > > > Hi, > > > > > > > > > > > > > > > > > > > > I think we should first reach an consensus on "what > problem do we > > > > > > > > > > want to > > > > > > > > > > solve?" > > > > > > > > > > (1) improve first experience? or (2) improve production > experience? > > > > > > > > > > > > > > > > > > > > As far as I can see, with the above discussion, I think > what we > > > > > > > > > > want to > > > > > > > > > > solve is the "first experience". > > > > > > > > > > And I think the slim jar is still the best distribution > for > > > > > > > > > > production, > > > > > > > > > > because it's easier to assembling jars > > > > > > > > > > than excluding jars and can avoid potential class > conflicts. > > > > > > > > > > > > > > > > > > > > If we want to improve "first experience", I think it > make sense to > > > > > > > > > > have a > > > > > > > > > > fat distribution to give users a more smooth first > experience. > > > > > > > > > > But I would like to call it "playground distribution" or > something > > > > > > > > > > like > > > > > > > > > > that to explicitly differ from the "slim > production-purpose > > > > > > > > > > > > > > > > > > > > distribution". > > > > > > > > > > > > > > > > > > > > The "playground distribution" can contains some widely > used jars, > > > > > > > > > > > > > > > > > > > > like > > > > > > > > > > > > > > > > > > > > universal-kafka-sql-connector, > elasticsearch7-sql-connector, avro, > > > > > > > > > > json, > > > > > > > > > > csv, etc.. > > > > > > > > > > Even we can provide a playground docker which may > contain the fat > > > > > > > > > > distribution, python3, and hive. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Jark > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler < > [hidden email]> > > > > < > > > > > > > > > [hidden email]> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > I don't see a lot of value in having multiple > distributions. > > > > > > > > > > > > > > > > > > > > The simple reality is that no fat distribution we could > provide > > > > > > > > > > > > > > > > > > > > would > > > > > > > > > > > > > > > > > > > > satisfy all use-cases, so why even try. > > > > > > > > > > If users commonly run into issues for certain jars, then > maybe > > > > > > > > > > > > > > > > > > > > those > > > > > > > > > > > > > > > > > > > > should be added to the current distribution. > > > > > > > > > > > > > > > > > > > > Personally though I still believe we should only > distribute a slim > > > > > > > > > > version. I'd rather have users always add required jars > to the > > > > > > > > > > distribution than only when they go outside our > "expected" > > > > > > > > > > > > > > > > > > > > use-cases. > > > > > > > > > > > > > > > > > > > > Then we might finally address this issue properly, i.e., > tooling to > > > > > > > > > > assemble custom distributions and/or better error > messages if > > > > > > > > > > Flink-provided extensions cannot be found. > > > > > > > > > > > > > > > > > > > > On 15/04/2020 15:23, Kurt Young wrote: > > > > > > > > > > > > > > > > > > > > Regarding to the specific solution, I'm not sure about > the "fat" > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > "slim" > > > > > > > > > > > > > > > > > > > > solution though. I get the idea > > > > > > > > > > that we can make the slim one even more lightweight than > current > > > > > > > > > > distribution, but what about the "fat" > > > > > > > > > > one? Do you mean that we would package all connectors > and formats > > > > > > > > > > > > > > > > > > > > into > > > > > > > > > > > > > > > > > > > > this? I'm not sure if this is > > > > > > > > > > feasible. For example, we can't put all versions of > kafka and hive > > > > > > > > > > connector jars into lib directory, and > > > > > > > > > > we also might need hadoop jars when using filesystem > connector to > > > > > > > > > > > > > > > > > > > > access > > > > > > > > > > > > > > > > > > > > data from HDFS. > > > > > > > > > > > > > > > > > > > > So my guess would be we might hand-pick some of the most > > > > > > > > > > > > > > > > > > > > frequently > > > > > > > > > > > > > > > > > > > > used > > > > > > > > > > > > > > > > > > > > connectors and formats > > > > > > > > > > into our "lib" directory, like kafka, csv, json metioned > above, > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > still > > > > > > > > > > > > > > > > > > > > leave some other connectors out of it. > > > > > > > > > > If this is the case, then why not we just provide this > > > > > > > > > > > > > > > > > > > > distribution > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > user? I'm not sure i get the benefit of > > > > > > > > > > providing another super "slim" jar (we have to pay some > costs to > > > > > > > > > > > > > > > > > > > > provide > > > > > > > > > > > > > > > > > > > > another suit of distribution). > > > > > > > > > > > > > > > > > > > > What do you think? > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Kurt > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li < > > > > > > > > > > > > > > > > > > > > [hidden email] > > > > > > > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Big +1. > > > > > > > > > > > > > > > > > > > > I like "fat" and "slim". > > > > > > > > > > > > > > > > > > > > For csv and json, like Jark said, they are quite small > and don't > > > > > > > > > > > > > > > > > > > > have > > > > > > > > > > > > > > > > > > > > other > > > > > > > > > > > > > > > > > > > > dependencies. They are important to kafka connector, and > > > > > > > > > > > > > > > > > > > > important > > > > > > > > > > > > > > > > > > > > to upcoming file system connector too. > > > > > > > > > > So can we move them to both "fat" and "slim"? They're so > > > > > > > > > > > > > > > > > > > > important, > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > they're so lightweight. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Jingsong Lee > > > > > > > > > > > > > > > > > > > > On Wed, Apr 15, 2020 at 4:53 PM godfrey he < > [hidden email]> < > > > > > > > > > [hidden email]> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Big +1. > > > > > > > > > > This will improve user experience (special for Flink new > users). > > > > > > > > > > We answered so many questions about "class not found". > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Godfrey > > > > > > > > > > > > > > > > > > > > Dian Fu <[hidden email]> <[hidden email]> > > > > 于2020年4月15日周三 > > > > > > > > > 下午4:30写道: > > > > > > > > > > > > > > > > > > > > +1 to this proposal. > > > > > > > > > > > > > > > > > > > > Missing connector jars is also a big problem for PyFlink > users. > > > > > > > > > > > > > > > > > > > > Currently, > > > > > > > > > > > > > > > > > > > > after a Python user has installed PyFlink using `pip`, > he has > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > manually > > > > > > > > > > > > > > > > > > > > copy the connector fat jars to the PyFlink installation > > > > > > > > > > > > > > > > > > > > directory > > > > > > > > > > > > > > > > > > > > for > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > connectors to be used if he wants to run jobs locally. > This > > > > > > > > > > > > > > > > > > > > process > > > > > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > > > > very > > > > > > > > > > > > > > > > > > > > confuse for users and affects the experience a lot. > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > Dian > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> < > [hidden email]> > > > > 写道: > > > > > > > > > > > > > > > > > > > > +1 to the proposal. I also found the "download > additional jar" > > > > > > > > > > > > > > > > > > > > step > > > > > > > > > > > > > > > > > > > > is > > > > > > > > > > > > > > > > > > > > really verbose when I prepare webinars. > > > > > > > > > > > > > > > > > > > > At least, I think the flink-csv and flink-json should in > the > > > > > > > > > > > > > > > > > > > > distribution, > > > > > > > > > > > > > > > > > > > > they are quite small and don't have other dependencies. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Jark > > > > > > > > > > > > > > > > > > > > On Wed, 15 Apr 2020 at 15:44, Jeff Zhang < > [hidden email]> < > > > > > > > > > [hidden email]> > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Hi Aljoscha, > > > > > > > > > > > > > > > > > > > > Big +1 for the fat flink distribution, where do you plan > to > > > > > > > > > > > > > > > > > > > > put > > > > > > > > > > > > > > > > > > > > these > > > > > > > > > > > > > > > > > > > > connectors ? opt or lib ? > > > > > > > > > > > > > > > > > > > > Aljoscha Krettek <[hidden email]> < > [hidden email]> > > > > > > > > > 于2020年4月15日周三 > > > > > > > > > > 下午3:30写道: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Everyone, > > > > > > > > > > > > > > > > > > > > I'd like to discuss about releasing a more full-featured > > > > > > > > > > > > > > > > > > > > Flink > > > > > > > > > > > > > > > > > > > > distribution. The motivation is that there is friction > for > > > > > > > > > > > > > > > > > > > > SQL/Table > > > > > > > > > > > > > > > > > > > > API > > > > > > > > > > > > > > > > > > > > users that want to use Table connectors which are not > there > > > > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > current Flink Distribution. For these users the workflow > is > > > > > > > > > > > > > > > > > > > > currently > > > > > > > > > > > > > > > > > > > > roughly: > > > > > > > > > > > > > > > > > > > > - download Flink dist > > > > > > > > > > - configure csv/Kafka/json connectors per configuration > > > > > > > > > > - run SQL client or program > > > > > > > > > > - decrypt error message and research the solution > > > > > > > > > > - download additional connector jars > > > > > > > > > > - program works correctly > > > > > > > > > > > > > > > > > > > > I realize that this can be made to work but if every SQL > > > > > > > > > > > > > > > > > > > > user > > > > > > > > > > > > > > > > > > > > has > > > > > > > > > > > > > > > > > > > > this > > > > > > > > > > > > > > > > > > > > as their first experience that doesn't seem good to me. > > > > > > > > > > > > > > > > > > > > My proposal is to provide two versions of the Flink > > > > > > > > > > > > > > > > > > > > Distribution > > > > > > > > > > > > > > > > > > > > in > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > > future: "fat" and "slim" (names to be discussed): > > > > > > > > > > > > > > > > > > > > - slim would be even trimmer than todays distribution > > > > > > > > > > - fat would contain a lot of convenience connectors (yet > > > > > > > > > > > > > > > > > > > > to > > > > > > > > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > > > > determined which one) > > > > > > > > > > > > > > > > > > > > And yes, I realize that there are already more > dimensions of > > > > > > > > > > > > > > > > > > > > Flink > > > > > > > > > > > > > > > > > > > > releases (Scala version and Java version). > > > > > > > > > > > > > > > > > > > > For background, our current Flink dist has these in the > opt > > > > > > > > > > > > > > > > > > > > directory: > > > > > > > > > > > > > > > > > > > > - flink-azure-fs-hadoop-1.10.0.jar > > > > > > > > > > - flink-cep-scala_2.12-1.10.0.jar > > > > > > > > > > - flink-cep_2.12-1.10.0.jar > > > > > > > > > > - flink-gelly-scala_2.12-1.10.0.jar > > > > > > > > > > - flink-gelly_2.12-1.10.0.jar > > > > > > > > > > - flink-metrics-datadog-1.10.0.jar > > > > > > > > > > - flink-metrics-graphite-1.10.0.jar > > > > > > > > > > - flink-metrics-influxdb-1.10.0.jar > > > > > > > > > > - flink-metrics-prometheus-1.10.0.jar > > > > > > > > > > - flink-metrics-slf4j-1.10.0.jar > > > > > > > > > > - flink-metrics-statsd-1.10.0.jar > > > > > > > > > > - flink-oss-fs-hadoop-1.10.0.jar > > > > > > > > > > - flink-python_2.12-1.10.0.jar > > > > > > > > > > - flink-queryable-state-runtime_2.12-1.10.0.jar > > > > > > > > > > - flink-s3-fs-hadoop-1.10.0.jar > > > > > > > > > > - flink-s3-fs-presto-1.10.0.jar > > > > > > > > > > - > > > > > > > > > > > > > > > > > > > > flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar > > > > > > > > > > > > > > > > > > > > - flink-sql-client_2.12-1.10.0.jar > > > > > > > > > > - flink-state-processor-api_2.12-1.10.0.jar > > > > > > > > > > - flink-swift-fs-hadoop-1.10.0.jar > > > > > > > > > > > > > > > > > > > > Current Flink dist is 267M. If we removed everything from > > > > > > > > > > > > > > > > > > > > opt > > > > > > > > > > > > > > > > > > > > we > > > > > > > > > > > > > > > > > > > > would > > > > > > > > > > > > > > > > > > > > go down to 126M. I would reccomend this, because the > large > > > > > > > > > > > > > > > > > > > > majority > > > > > > > > > > > > > > > > > > > > of > > > > > > > > > > > > > > > > > > > > the files in opt are probably unused. > > > > > > > > > > > > > > > > > > > > What do you think? > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Aljoscha > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Best Regards > > > > > > > > > > > > > > > > > > > > Jeff Zhang > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Best, Jingsong Lee > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Best, Jingsong Lee > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Best, Jingsong Lee > > > > > > > > > -- > > Best, Jingsong Lee > -- Best, Benchao Li |
+1 for Jingsong’s proposal to put flink-csv, flink-json and flink-avro under lib/ directory.
I have heard many SQL users(most of newbies) complaint the out-of-box experience in mail list. Best, Leonard Xu > 在 2020年6月5日,14:39,Benchao Li <[hidden email]> 写道: > > +1 to include them for sql-client by default; > +0 to put into lib and exposed to all kinds of jobs, including DataStream. > > Danny Chan <[hidden email]> 于2020年6月5日周五 下午2:31写道: > >> +1, at least, we should keep an out of the box SQL-CLI, it’s very poor >> experience to add such required format jars for SQL users. >> >> Best, >> Danny Chan >> 在 2020年6月5日 +0800 AM11:14,Jingsong Li <[hidden email]>,写道: >>> Hi all, >>> >>> Considering that 1.11 will be released soon, what about my previous >>> proposal? Put flink-csv, flink-json and flink-avro under lib. >>> These three formats are very small and no third party dependence, and >> they >>> are widely used by table users. >>> >>> Best, >>> Jingsong Lee >>> >>> On Tue, May 12, 2020 at 4:19 PM Jingsong Li <[hidden email]> >> wrote: >>> >>>> Thanks for your discussion. >>>> >>>> Sorry to start discussing another thing: >>>> >>>> The biggest problem I see is the variety of problems caused by users' >> lack >>>> of format dependency. >>>> As Aljoscha said, these three formats are very small and no third party >>>> dependence, and they are widely used by table users. >>>> Actually, we don't have any other built-in table formats now... In >> total >>>> 151K... >>>> >>>> 73K flink-avro-1.10.0.jar >>>> 36K flink-csv-1.10.0.jar >>>> 42K flink-json-1.10.0.jar >>>> >>>> So, Can we just put them into "lib/" or flink-table-uber? >>>> It not solve all problems and maybe it is independent of "fat" and >> "slim". >>>> But also improve usability. >>>> What do you think? Any objections? >>>> >>>> Best, >>>> Jingsong Lee >>>> >>>> On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <[hidden email]> >>>> wrote: >>>> >>>>> One downside would be that we're shipping more stuff when running on >>>>> YARN for example, since the entire plugins directory is shiped by >> default. >>>>> >>>>> On 17/04/2020 16:38, Stephan Ewen wrote: >>>>>> @Aljoscha I think that is an interesting line of thinking. the >> swift-fs >>>>> may >>>>>> be rarely enough used to move it to an optional download. >>>>>> >>>>>> I would still drop two more thoughts: >>>>>> >>>>>> (1) Now that we have plugins support, is there a reason to have a >>>>> metrics >>>>>> reporter or file system in /opt instead of /plugins? They don't >> spoil >>>>> the >>>>>> class path any more. >>>>>> >>>>>> (2) I can imagine there still being a desire to have a "minimal" >> docker >>>>>> file, for users that want to keep the container images as small as >>>>>> possible, to speed up deployment. It is fine if that would not be >> the >>>>>> default, though. >>>>>> >>>>>> >>>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek < >> [hidden email]> >>>>>> wrote: >>>>>> >>>>>>> I think having such tools and/or tailor-made distributions can >> be nice >>>>>>> but I also think the discussion is missing the main point: The >> initial >>>>>>> observation/motivation is that apparently a lot of users (Kurt >> and I >>>>>>> talked about this) on the chinese DingTalk support groups, and >> other >>>>>>> support channels have problems when first using the SQL client >> because >>>>>>> of these missing connectors/formats. For these, having >> additional tools >>>>>>> would not solve anything because they would also not take that >> extra >>>>>>> step. I think that even tiny friction should be avoided because >> the >>>>>>> annoyance from it accumulates of the (hopefully) many users that >> we >>>>> want >>>>>>> to have. >>>>>>> >>>>>>> Maybe we should take a step back from discussing the >> "fat"/"slim" idea >>>>>>> and instead think about the composition of the current dist. As >>>>>>> mentioned we have these jars in opt/: >>>>>>> >>>>>>> 17M flink-azure-fs-hadoop-1.10.0.jar >>>>>>> 52K flink-cep-scala_2.11-1.10.0.jar >>>>>>> 180K flink-cep_2.11-1.10.0.jar >>>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar >>>>>>> 626K flink-gelly_2.11-1.10.0.jar >>>>>>> 512K flink-metrics-datadog-1.10.0.jar >>>>>>> 159K flink-metrics-graphite-1.10.0.jar >>>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar >>>>>>> 102K flink-metrics-prometheus-1.10.0.jar >>>>>>> 10K flink-metrics-slf4j-1.10.0.jar >>>>>>> 12K flink-metrics-statsd-1.10.0.jar >>>>>>> 36M flink-oss-fs-hadoop-1.10.0.jar >>>>>>> 28M flink-python_2.11-1.10.0.jar >>>>>>> 22K flink-queryable-state-runtime_2.11-1.10.0.jar >>>>>>> 18M flink-s3-fs-hadoop-1.10.0.jar >>>>>>> 31M flink-s3-fs-presto-1.10.0.jar >>>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar >>>>>>> 518K flink-sql-client_2.11-1.10.0.jar >>>>>>> 99K flink-state-processor-api_2.11-1.10.0.jar >>>>>>> 25M flink-swift-fs-hadoop-1.10.0.jar >>>>>>> 160M opt >>>>>>> >>>>>>> The "filesystem" connectors ar ethe heavy hitters, there. >>>>>>> >>>>>>> I downloaded most of the SQL connectors/formats and this is what >> I got: >>>>>>> >>>>>>> 73K flink-avro-1.10.0.jar >>>>>>> 36K flink-csv-1.10.0.jar >>>>>>> 55K flink-hbase_2.11-1.10.0.jar >>>>>>> 88K flink-jdbc_2.11-1.10.0.jar >>>>>>> 42K flink-json-1.10.0.jar >>>>>>> 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar >>>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar >>>>>>> 24M sql-connectors-formats >>>>>>> >>>>>>> We could just add these to the Flink distribution without >> blowing it up >>>>>>> by much. We could drop any of the existing "filesystem" >> connectors from >>>>>>> opt and add the SQL connectors/formats and not change the size >> of Flink >>>>>>> dist. So maybe we should do that instead? >>>>>>> >>>>>>> We would need some tooling for the sql-client shell script to >> pick-up >>>>>>> the connectors/formats up from opt/ because we don't want to add >> them >>>>> to >>>>>>> lib/. We're already doing that for finding the flink-sql-client >> jar, >>>>>>> which is also not in lib/. >>>>>>> >>>>>>> What do you think? >>>>>>> >>>>>>> Best, >>>>>>> Aljoscha >>>>>>> >>>>>>> On 17.04.20 05:22, Jark Wu wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> I like the idea of web tool to assemble fat distribution. And >> the >>>>>>>> https://code.quarkus.io/ looks very nice. >>>>>>>> All the users need to do is just select what he/she need (I >> think this >>>>>>> step >>>>>>>> can't be omitted anyway). >>>>>>>> We can also provide a default fat distribution on the web which >>>>> default >>>>>>>> selects some popular connectors. >>>>>>>> >>>>>>>> Best, >>>>>>>> Jark >>>>>>>> >>>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <[hidden email] >>> >>>>> wrote: >>>>>>>> >>>>>>>>> As a reference for a nice first-experience I had, take a >> look at >>>>>>>>> https://code.quarkus.io/ >>>>>>>>> You reach this page after you click "Start Coding" at the >> project >>>>>>> homepage. >>>>>>>>> Rafi >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <[hidden email]> >> wrote: >>>>>>>>> >>>>>>>>>> I'm not saying pre-bundle some jars will make this problem >> go away, >>>>> and >>>>>>>>>> you're right that only hides the problem for >>>>>>>>>> some users. But what if this solution can hide the problem >> for 90% >>>>>>> users? >>>>>>>>>> Would't that be good enough for us to try? >>>>>>>>>> >>>>>>>>>> Regarding to would users following instructions really be >> such a big >>>>>>>>>> problem? >>>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions >> for at >>>>> least a >>>>>>>>>> dozen times and I won't see such questions coming >>>>>>>>>> up from time to time. During some periods, I even saw such >> questions >>>>>>>>> every >>>>>>>>>> day. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Kurt >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler < >>>>> [hidden email]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> The problem with having a distribution with "popular" >> stuff is >>>>> that it >>>>>>>>>>> doesn't really *solve* a problem, it just hides it for >> users who >>>>> fall >>>>>>>>>>> into these particular use-cases. >>>>>>>>>>> Move out of it and you once again run into exact same >> problems >>>>>>>>> out-lined. >>>>>>>>>>> This is exactly why I like the tooling approach; you >> have to deal >>>>> with >>>>>>>>> it >>>>>>>>>>> from the start and transitioning to a custom use-case is >> easier. >>>>>>>>>>> >>>>>>>>>>> Would users following instructions really be such a big >> problem? >>>>>>>>>>> I would expect that users generally know *what *they >> need, just not >>>>>>>>>>> necessarily how it is assembled correctly (where do get >> which jar, >>>>>>>>> which >>>>>>>>>>> directory to put it in). >>>>>>>>>>> It seems like these are exactly the problem this would >> solve? >>>>>>>>>>> I just don't see how moving a jar corresponding to some >> feature >>>>> from >>>>>>>>> opt >>>>>>>>>>> to some directory (lib/plugins) is less error-prone than >> just >>>>>>> selecting >>>>>>>>>> the >>>>>>>>>>> feature and having the tool handle the rest. >>>>>>>>>>> >>>>>>>>>>> As for re-distributions, it depends on the form that the >> tool would >>>>>>>>> take. >>>>>>>>>>> It could be an application that runs locally and works >> against >>>>> maven >>>>>>>>>>> central (note: not necessarily *using* maven); this >> should would >>>>> work >>>>>>>>> in >>>>>>>>>>> China, no? >>>>>>>>>>> >>>>>>>>>>> A web tool would of course be fancy, but I don't know >> how feasible >>>>>>> this >>>>>>>>>> is >>>>>>>>>>> with the ASF infrastructure. >>>>>>>>>>> You wouldn't be able to mirror the distribution, so the >> load can't >>>>> be >>>>>>>>>>> distributed. I doubt INFRA would like this. >>>>>>>>>>> >>>>>>>>>>> Note that third-parties could also start distributing >> use-case >>>>>>> oriented >>>>>>>>>>> distributions, which would be perfectly fine as far as >> I'm >>>>> concerned. >>>>>>>>>>> >>>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote: >>>>>>>>>>> >>>>>>>>>>> I'm not so sure about the web tool solution though. The >> concern I >>>>> have >>>>>>>>>> for >>>>>>>>>>> this approach is the final generated >>>>>>>>>>> distribution is kind of non-deterministic. We might >> generate too >>>>> many >>>>>>>>>>> different combinations when user trying to >>>>>>>>>>> package different types of connector, format, and even >> maybe hadoop >>>>>>>>>>> releases. As far as I can tell, most open >>>>>>>>>>> source projects and apache projects will only release >> some >>>>>>>>>>> pre-defined distributions, which most users are already >>>>>>>>>>> familiar with, thus hard to change IMO. And I also have >> went >>>>> through >>>>>>> in >>>>>>>>>>> some cases, users will try to re-distribute >>>>>>>>>>> the release package, because of the unstable network of >> apache >>>>> website >>>>>>>>>> from >>>>>>>>>>> China. In web tool solution, I don't >>>>>>>>>>> think this kind of re-distribution would be possible >> anymore. >>>>>>>>>>> >>>>>>>>>>> In the meantime, I also have a concern that we will fall >> back into >>>>> our >>>>>>>>>> trap >>>>>>>>>>> again if we try to offer this smart & flexible >>>>>>>>>>> solution. Because it needs users to cooperate with such >> mechanism. >>>>>>> It's >>>>>>>>>>> exactly the situation what we currently fell >>>>>>>>>>> into: >>>>>>>>>>> 1. We offered a smart solution. >>>>>>>>>>> 2. We hope users will follow the correct instructions. >>>>>>>>>>> 3. Everything will work as expected if users followed >> the right >>>>>>>>>>> instructions. >>>>>>>>>>> >>>>>>>>>>> In reality, I suspect not all users will do the second >> step >>>>> correctly. >>>>>>>>>> And >>>>>>>>>>> for new users who only trying to have a quick >>>>>>>>>>> experience with Flink, I would bet most users will do it >> wrong. >>>>>>>>>>> >>>>>>>>>>> So, my proposal would be one of the following 2 options: >>>>>>>>>>> 1. Provide a slim distribution for advanced product >> users and >>>>> provide >>>>>>> a >>>>>>>>>>> distribution which will have some popular builtin jars. >>>>>>>>>>> 2. Only provide a distribution which will have some >> popular builtin >>>>>>>>> jars. >>>>>>>>>>> If we are trying to reduce the distributions we >> released, I would >>>>>>>>> prefer >>>>>>>>>> 2 >>>>>>>>>>> 1. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Kurt >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann < >>>>> [hidden email]> >>>>>>> < >>>>>>>>>> [hidden email]> wrote: >>>>>>>>>>> >>>>>>>>>>> I think what Chesnay and Dawid proposed would be the >> ideal >>>>> solution. >>>>>>>>>>> Ideally, we would also have a nice web tool for the >> website which >>>>>>>>>> generates >>>>>>>>>>> the corresponding distribution for download. >>>>>>>>>>> >>>>>>>>>>> To get things started we could start with only >> supporting to >>>>>>>>>>> download/creating the "fat" version with the script. The >> fat >>>>> version >>>>>>>>>> would >>>>>>>>>>> then consist of the slim distribution and whatever we >> deem >>>>> important >>>>>>>>> for >>>>>>>>>>> new users to get started. >>>>>>>>>>> >>>>>>>>>>> Cheers, >>>>>>>>>>> Till >>>>>>>>>>> >>>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz < >>>>>>>>>> [hidden email]> <[hidden email]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> Few points from my side: >>>>>>>>>>> >>>>>>>>>>> 1. I like the idea of simplifying the experience for >> first time >>>>> users. >>>>>>>>>>> As for production use cases I share Jark's opinion that >> in this >>>>> case I >>>>>>>>>>> would expect users to combine their distribution >> manually. I think >>>>> in >>>>>>>>>>> such scenarios it is important to understand >> interconnections. >>>>>>>>>>> Personally I'd expect the slimmest possible distribution >> that I can >>>>>>>>>>> extend further with what I need in my production >> scenario. >>>>>>>>>>> >>>>>>>>>>> 2. I think there is also the problem that the matrix of >> possible >>>>>>>>>>> combinations that can be useful is already big. Do we >> want to have >>>>> a >>>>>>>>>>> distribution for: >>>>>>>>>>> >>>>>>>>>>> SQL users: which connectors should we include? should we >>>>> include >>>>>>>>>>> hive? which other catalog? >>>>>>>>>>> >>>>>>>>>>> DataStream users: which connectors should we include? >>>>>>>>>>> >>>>>>>>>>> For both of the above should we include yarn/kubernetes? >>>>>>>>>>> >>>>>>>>>>> I would opt for providing only the "slim" distribution >> as a release >>>>>>>>>>> artifact. >>>>>>>>>>> >>>>>>>>>>> 3. However, as I said I think its worth investigating >> how we can >>>>>>>>> improve >>>>>>>>>>> users experience. What do you think of providing a tool, >> could be >>>>> e.g. >>>>>>>>> a >>>>>>>>>>> shell script that constructs a distribution based on >> users choice. >>>>> I >>>>>>>>>>> think that was also what Chesnay mentioned as "tooling to >>>>>>>>>>> assemble custom distributions" In the end how I see the >> difference >>>>>>>>>>> between a slim and fat distribution is which jars do we >> put into >>>>> the >>>>>>>>>>> lib, right? It could have a few "screens". >>>>>>>>>>> >>>>>>>>>>> 1. Which API are you interested in: >>>>>>>>>>> a. SQL API >>>>>>>>>>> b. DataStream API >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 2. [SQL] Which connectors do you want to use? >> [multichoice]: >>>>>>>>>>> a. Kafka >>>>>>>>>>> b. Elasticsearch >>>>>>>>>>> ... >>>>>>>>>>> >>>>>>>>>>> 3. [SQL] Which catalog you want to use? >>>>>>>>>>> >>>>>>>>>>> ... >>>>>>>>>>> >>>>>>>>>>> Such a tool would download all the dependencies from >> maven and put >>>>>>> them >>>>>>>>>>> into the correct folder. In the future we can extend it >> with >>>>>>> additional >>>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time >> with >>>>>>>>>>> kafka-universal etc. >>>>>>>>>>> >>>>>>>>>>> The benefit of it would be that the distribution that we >> release >>>>> could >>>>>>>>>>> remain "slim" or we could even make it slimmer. I might >> be missing >>>>>>>>>>> something here though. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> >>>>>>>>>>> Dawdi >>>>>>>>>>> >>>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote: >>>>>>>>>>> >>>>>>>>>>> I want to reinforce my opinion from earlier: This is >> about >>>>> improving >>>>>>>>>>> the situation both for first-time users and for >> experienced users >>>>> that >>>>>>>>>>> want to use a Flink dist in production. The current >> Flink dist is >>>>> too >>>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for >> production >>>>>>>>>>> users, that is where serving no-one properly with the >> current >>>>>>>>>>> middle-ground. That's why I think introducing those >> specialized >>>>>>>>>>> "spins" of Flink dist would be good. >>>>>>>>>>> >>>>>>>>>>> By the way, at some point in the future production users >> might not >>>>>>>>>>> even need to get a Flink dist anymore. They should be >> able to have >>>>>>>>>>> Flink as a dependency of their project (including the >> runtime) and >>>>>>>>>>> then build an image from this for Kubernetes or a fat >> jar for YARN. >>>>>>>>>>> >>>>>>>>>>> Aljoscha >>>>>>>>>>> >>>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote: >>>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> Regarding slim and fat distributions, I think different >> kinds of >>>>> jobs >>>>>>>>>>> may >>>>>>>>>>> prefer different type of distribution: >>>>>>>>>>> >>>>>>>>>>> For DataStream job, I think we may not like fat >> distribution >>>>>>>>>>> >>>>>>>>>>> containing >>>>>>>>>>> >>>>>>>>>>> connectors because user would always need to depend on >> the >>>>> connector >>>>>>>>>>> >>>>>>>>>>> in >>>>>>>>>>> >>>>>>>>>>> user code, it is easy to include the connector jar in >> the user lib. >>>>>>>>>>> >>>>>>>>>>> Less >>>>>>>>>>> >>>>>>>>>>> jar in lib means less class conflicts and problems. >>>>>>>>>>> >>>>>>>>>>> For SQL job, I think we are trying to encourage user to >> user pure >>>>>>>>>>> sql(DDL + >>>>>>>>>>> DML) to construct their job, In order to improve user >> experience, >>>>> It >>>>>>>>>>> may be >>>>>>>>>>> important for flink, not only providing as many >> connector jar in >>>>>>>>>>> distribution as possible especially the connector and >> format we >>>>> have >>>>>>>>>>> well >>>>>>>>>>> documented, but also providing an mechanism to load >> connectors >>>>>>>>>>> according >>>>>>>>>>> to the DDLs, >>>>>>>>>>> >>>>>>>>>>> So I think it could be good to place connector/format >> jars in some >>>>>>>>>>> dir like >>>>>>>>>>> opt/connector which would not affect jobs by default, and >>>>> introduce a >>>>>>>>>>> mechanism of dynamic discovery for SQL. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Wenlong >>>>>>>>>>> >>>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li < >> [hidden email]> >>>>> < >>>>>>>>>> [hidden email]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I am thinking both "improve first experience" and >> "improve >>>>> production >>>>>>>>>>> experience". >>>>>>>>>>> >>>>>>>>>>> I'm thinking about what's the common mode of Flink? >>>>>>>>>>> Streaming job use Kafka? Batch job use Hive? >>>>>>>>>>> >>>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of >> Hive server >>>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 >> dependency. >>>>>>>>>>> Flink is currently mainly used for streaming, so let's >> not talk >>>>>>>>>>> about hive. >>>>>>>>>>> >>>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is >> (related >>>>> to >>>>>>>>>>> connectors): >>>>>>>>>>> - ETL jobs: Kafka -> Kafka >>>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka >>>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink >>>>>>>>>>> So Kafka and JDBC are probably the most commonly used. >> Of course, >>>>>>>>>>> >>>>>>>>>>> also >>>>>>>>>>> >>>>>>>>>>> includes CSV, JSON's formats. >>>>>>>>>>> So when we provide such a fat distribution: >>>>>>>>>>> - With CSV, JSON. >>>>>>>>>>> - With flink-kafka-universal and kafka dependencies. >>>>>>>>>>> - With flink-jdbc. >>>>>>>>>>> Using this fat distribution, most users can run their >> jobs well. >>>>>>>>>>> >>>>>>>>>>> (jdbc >>>>>>>>>>> >>>>>>>>>>> driver jar required, but this is very natural to do) >>>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only >> Kafka may >>>>>>>>>>> >>>>>>>>>>> have >>>>>>>>>>> >>>>>>>>>>> conflicts, but if our goal is to use kafka-universal to >> support all >>>>>>>>>>> Kafka >>>>>>>>>>> versions, it is hopeful to target the vast majority of >> users. >>>>>>>>>>> >>>>>>>>>>> We don't want to plug all jars into the fat >> distribution. Only need >>>>>>>>>>> less >>>>>>>>>>> conflict and common. of course, it is a matter of >> consideration to >>>>>>>>>>> >>>>>>>>>>> put >>>>>>>>>>> >>>>>>>>>>> which jar into fat distribution. >>>>>>>>>>> We have the opportunity to facilitate the majority of >> users, but >>>>>>>>>>> also left >>>>>>>>>>> opportunities for customization. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Jingsong Lee >>>>>>>>>>> >>>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu < >> [hidden email]> < >>>>>>>>>> [hidden email]> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> I think we should first reach an consensus on "what >> problem do we >>>>>>>>>>> want to >>>>>>>>>>> solve?" >>>>>>>>>>> (1) improve first experience? or (2) improve production >> experience? >>>>>>>>>>> >>>>>>>>>>> As far as I can see, with the above discussion, I think >> what we >>>>>>>>>>> want to >>>>>>>>>>> solve is the "first experience". >>>>>>>>>>> And I think the slim jar is still the best distribution >> for >>>>>>>>>>> production, >>>>>>>>>>> because it's easier to assembling jars >>>>>>>>>>> than excluding jars and can avoid potential class >> conflicts. >>>>>>>>>>> >>>>>>>>>>> If we want to improve "first experience", I think it >> make sense to >>>>>>>>>>> have a >>>>>>>>>>> fat distribution to give users a more smooth first >> experience. >>>>>>>>>>> But I would like to call it "playground distribution" or >> something >>>>>>>>>>> like >>>>>>>>>>> that to explicitly differ from the "slim >> production-purpose >>>>>>>>>>> >>>>>>>>>>> distribution". >>>>>>>>>>> >>>>>>>>>>> The "playground distribution" can contains some widely >> used jars, >>>>>>>>>>> >>>>>>>>>>> like >>>>>>>>>>> >>>>>>>>>>> universal-kafka-sql-connector, >> elasticsearch7-sql-connector, avro, >>>>>>>>>>> json, >>>>>>>>>>> csv, etc.. >>>>>>>>>>> Even we can provide a playground docker which may >> contain the fat >>>>>>>>>>> distribution, python3, and hive. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Jark >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler < >> [hidden email]> >>>>> < >>>>>>>>>> [hidden email]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I don't see a lot of value in having multiple >> distributions. >>>>>>>>>>> >>>>>>>>>>> The simple reality is that no fat distribution we could >> provide >>>>>>>>>>> >>>>>>>>>>> would >>>>>>>>>>> >>>>>>>>>>> satisfy all use-cases, so why even try. >>>>>>>>>>> If users commonly run into issues for certain jars, then >> maybe >>>>>>>>>>> >>>>>>>>>>> those >>>>>>>>>>> >>>>>>>>>>> should be added to the current distribution. >>>>>>>>>>> >>>>>>>>>>> Personally though I still believe we should only >> distribute a slim >>>>>>>>>>> version. I'd rather have users always add required jars >> to the >>>>>>>>>>> distribution than only when they go outside our >> "expected" >>>>>>>>>>> >>>>>>>>>>> use-cases. >>>>>>>>>>> >>>>>>>>>>> Then we might finally address this issue properly, i.e., >> tooling to >>>>>>>>>>> assemble custom distributions and/or better error >> messages if >>>>>>>>>>> Flink-provided extensions cannot be found. >>>>>>>>>>> >>>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote: >>>>>>>>>>> >>>>>>>>>>> Regarding to the specific solution, I'm not sure about >> the "fat" >>>>>>>>>>> >>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>>> "slim" >>>>>>>>>>> >>>>>>>>>>> solution though. I get the idea >>>>>>>>>>> that we can make the slim one even more lightweight than >> current >>>>>>>>>>> distribution, but what about the "fat" >>>>>>>>>>> one? Do you mean that we would package all connectors >> and formats >>>>>>>>>>> >>>>>>>>>>> into >>>>>>>>>>> >>>>>>>>>>> this? I'm not sure if this is >>>>>>>>>>> feasible. For example, we can't put all versions of >> kafka and hive >>>>>>>>>>> connector jars into lib directory, and >>>>>>>>>>> we also might need hadoop jars when using filesystem >> connector to >>>>>>>>>>> >>>>>>>>>>> access >>>>>>>>>>> >>>>>>>>>>> data from HDFS. >>>>>>>>>>> >>>>>>>>>>> So my guess would be we might hand-pick some of the most >>>>>>>>>>> >>>>>>>>>>> frequently >>>>>>>>>>> >>>>>>>>>>> used >>>>>>>>>>> >>>>>>>>>>> connectors and formats >>>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned >> above, >>>>>>>>>>> >>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>>> still >>>>>>>>>>> >>>>>>>>>>> leave some other connectors out of it. >>>>>>>>>>> If this is the case, then why not we just provide this >>>>>>>>>>> >>>>>>>>>>> distribution >>>>>>>>>>> >>>>>>>>>>> to >>>>>>>>>>> >>>>>>>>>>> user? I'm not sure i get the benefit of >>>>>>>>>>> providing another super "slim" jar (we have to pay some >> costs to >>>>>>>>>>> >>>>>>>>>>> provide >>>>>>>>>>> >>>>>>>>>>> another suit of distribution). >>>>>>>>>>> >>>>>>>>>>> What do you think? >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Kurt >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li < >>>>>>>>>>> >>>>>>>>>>> [hidden email] >>>>>>>>>>> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Big +1. >>>>>>>>>>> >>>>>>>>>>> I like "fat" and "slim". >>>>>>>>>>> >>>>>>>>>>> For csv and json, like Jark said, they are quite small >> and don't >>>>>>>>>>> >>>>>>>>>>> have >>>>>>>>>>> >>>>>>>>>>> other >>>>>>>>>>> >>>>>>>>>>> dependencies. They are important to kafka connector, and >>>>>>>>>>> >>>>>>>>>>> important >>>>>>>>>>> >>>>>>>>>>> to upcoming file system connector too. >>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so >>>>>>>>>>> >>>>>>>>>>> important, >>>>>>>>>>> >>>>>>>>>>> and >>>>>>>>>>> >>>>>>>>>>> they're so lightweight. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Jingsong Lee >>>>>>>>>>> >>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he < >> [hidden email]> < >>>>>>>>>> [hidden email]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Big +1. >>>>>>>>>>> This will improve user experience (special for Flink new >> users). >>>>>>>>>>> We answered so many questions about "class not found". >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Godfrey >>>>>>>>>>> >>>>>>>>>>> Dian Fu <[hidden email]> <[hidden email]> >>>>> 于2020年4月15日周三 >>>>>>>>>> 下午4:30写道: >>>>>>>>>>> >>>>>>>>>>> +1 to this proposal. >>>>>>>>>>> >>>>>>>>>>> Missing connector jars is also a big problem for PyFlink >> users. >>>>>>>>>>> >>>>>>>>>>> Currently, >>>>>>>>>>> >>>>>>>>>>> after a Python user has installed PyFlink using `pip`, >> he has >>>>>>>>>>> >>>>>>>>>>> to >>>>>>>>>>> >>>>>>>>>>> manually >>>>>>>>>>> >>>>>>>>>>> copy the connector fat jars to the PyFlink installation >>>>>>>>>>> >>>>>>>>>>> directory >>>>>>>>>>> >>>>>>>>>>> for >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> connectors to be used if he wants to run jobs locally. >> This >>>>>>>>>>> >>>>>>>>>>> process >>>>>>>>>>> >>>>>>>>>>> is >>>>>>>>>>> >>>>>>>>>>> very >>>>>>>>>>> >>>>>>>>>>> confuse for users and affects the experience a lot. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Dian >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> < >> [hidden email]> >>>>> 写道: >>>>>>>>>>> >>>>>>>>>>> +1 to the proposal. I also found the "download >> additional jar" >>>>>>>>>>> >>>>>>>>>>> step >>>>>>>>>>> >>>>>>>>>>> is >>>>>>>>>>> >>>>>>>>>>> really verbose when I prepare webinars. >>>>>>>>>>> >>>>>>>>>>> At least, I think the flink-csv and flink-json should in >> the >>>>>>>>>>> >>>>>>>>>>> distribution, >>>>>>>>>>> >>>>>>>>>>> they are quite small and don't have other dependencies. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Jark >>>>>>>>>>> >>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang < >> [hidden email]> < >>>>>>>>>> [hidden email]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Aljoscha, >>>>>>>>>>> >>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan >> to >>>>>>>>>>> >>>>>>>>>>> put >>>>>>>>>>> >>>>>>>>>>> these >>>>>>>>>>> >>>>>>>>>>> connectors ? opt or lib ? >>>>>>>>>>> >>>>>>>>>>> Aljoscha Krettek <[hidden email]> < >> [hidden email]> >>>>>>>>>> 于2020年4月15日周三 >>>>>>>>>>> 下午3:30写道: >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi Everyone, >>>>>>>>>>> >>>>>>>>>>> I'd like to discuss about releasing a more full-featured >>>>>>>>>>> >>>>>>>>>>> Flink >>>>>>>>>>> >>>>>>>>>>> distribution. The motivation is that there is friction >> for >>>>>>>>>>> >>>>>>>>>>> SQL/Table >>>>>>>>>>> >>>>>>>>>>> API >>>>>>>>>>> >>>>>>>>>>> users that want to use Table connectors which are not >> there >>>>>>>>>>> >>>>>>>>>>> in >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> current Flink Distribution. For these users the workflow >> is >>>>>>>>>>> >>>>>>>>>>> currently >>>>>>>>>>> >>>>>>>>>>> roughly: >>>>>>>>>>> >>>>>>>>>>> - download Flink dist >>>>>>>>>>> - configure csv/Kafka/json connectors per configuration >>>>>>>>>>> - run SQL client or program >>>>>>>>>>> - decrypt error message and research the solution >>>>>>>>>>> - download additional connector jars >>>>>>>>>>> - program works correctly >>>>>>>>>>> >>>>>>>>>>> I realize that this can be made to work but if every SQL >>>>>>>>>>> >>>>>>>>>>> user >>>>>>>>>>> >>>>>>>>>>> has >>>>>>>>>>> >>>>>>>>>>> this >>>>>>>>>>> >>>>>>>>>>> as their first experience that doesn't seem good to me. >>>>>>>>>>> >>>>>>>>>>> My proposal is to provide two versions of the Flink >>>>>>>>>>> >>>>>>>>>>> Distribution >>>>>>>>>>> >>>>>>>>>>> in >>>>>>>>>>> >>>>>>>>>>> the >>>>>>>>>>> >>>>>>>>>>> future: "fat" and "slim" (names to be discussed): >>>>>>>>>>> >>>>>>>>>>> - slim would be even trimmer than todays distribution >>>>>>>>>>> - fat would contain a lot of convenience connectors (yet >>>>>>>>>>> >>>>>>>>>>> to >>>>>>>>>>> >>>>>>>>>>> be >>>>>>>>>>> >>>>>>>>>>> determined which one) >>>>>>>>>>> >>>>>>>>>>> And yes, I realize that there are already more >> dimensions of >>>>>>>>>>> >>>>>>>>>>> Flink >>>>>>>>>>> >>>>>>>>>>> releases (Scala version and Java version). >>>>>>>>>>> >>>>>>>>>>> For background, our current Flink dist has these in the >> opt >>>>>>>>>>> >>>>>>>>>>> directory: >>>>>>>>>>> >>>>>>>>>>> - flink-azure-fs-hadoop-1.10.0.jar >>>>>>>>>>> - flink-cep-scala_2.12-1.10.0.jar >>>>>>>>>>> - flink-cep_2.12-1.10.0.jar >>>>>>>>>>> - flink-gelly-scala_2.12-1.10.0.jar >>>>>>>>>>> - flink-gelly_2.12-1.10.0.jar >>>>>>>>>>> - flink-metrics-datadog-1.10.0.jar >>>>>>>>>>> - flink-metrics-graphite-1.10.0.jar >>>>>>>>>>> - flink-metrics-influxdb-1.10.0.jar >>>>>>>>>>> - flink-metrics-prometheus-1.10.0.jar >>>>>>>>>>> - flink-metrics-slf4j-1.10.0.jar >>>>>>>>>>> - flink-metrics-statsd-1.10.0.jar >>>>>>>>>>> - flink-oss-fs-hadoop-1.10.0.jar >>>>>>>>>>> - flink-python_2.12-1.10.0.jar >>>>>>>>>>> - flink-queryable-state-runtime_2.12-1.10.0.jar >>>>>>>>>>> - flink-s3-fs-hadoop-1.10.0.jar >>>>>>>>>>> - flink-s3-fs-presto-1.10.0.jar >>>>>>>>>>> - >>>>>>>>>>> >>>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar >>>>>>>>>>> >>>>>>>>>>> - flink-sql-client_2.12-1.10.0.jar >>>>>>>>>>> - flink-state-processor-api_2.12-1.10.0.jar >>>>>>>>>>> - flink-swift-fs-hadoop-1.10.0.jar >>>>>>>>>>> >>>>>>>>>>> Current Flink dist is 267M. If we removed everything from >>>>>>>>>>> >>>>>>>>>>> opt >>>>>>>>>>> >>>>>>>>>>> we >>>>>>>>>>> >>>>>>>>>>> would >>>>>>>>>>> >>>>>>>>>>> go down to 126M. I would reccomend this, because the >> large >>>>>>>>>>> >>>>>>>>>>> majority >>>>>>>>>>> >>>>>>>>>>> of >>>>>>>>>>> >>>>>>>>>>> the files in opt are probably unused. >>>>>>>>>>> >>>>>>>>>>> What do you think? >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Aljoscha >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Best Regards >>>>>>>>>>> >>>>>>>>>>> Jeff Zhang >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Best, Jingsong Lee >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Best, Jingsong Lee >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>> >>>>> >>>>> >>>> >>>> -- >>>> Best, Jingsong Lee >>>> >>> >>> >>> -- >>> Best, Jingsong Lee >> > > > -- > > Best, > Benchao Li |
+1 to add light-weighted formats into the lib
On Fri, Jun 5, 2020 at 3:28 PM Leonard Xu <[hidden email]> wrote: > +1 for Jingsong’s proposal to put flink-csv, flink-json and flink-avro > under lib/ directory. > I have heard many SQL users(most of newbies) complaint the out-of-box > experience in mail list. > > Best, > Leonard Xu > > > > 在 2020年6月5日,14:39,Benchao Li <[hidden email]> 写道: > > > > +1 to include them for sql-client by default; > > +0 to put into lib and exposed to all kinds of jobs, including > DataStream. > > > > Danny Chan <[hidden email]> 于2020年6月5日周五 下午2:31写道: > > > >> +1, at least, we should keep an out of the box SQL-CLI, it’s very poor > >> experience to add such required format jars for SQL users. > >> > >> Best, > >> Danny Chan > >> 在 2020年6月5日 +0800 AM11:14,Jingsong Li <[hidden email]>,写道: > >>> Hi all, > >>> > >>> Considering that 1.11 will be released soon, what about my previous > >>> proposal? Put flink-csv, flink-json and flink-avro under lib. > >>> These three formats are very small and no third party dependence, and > >> they > >>> are widely used by table users. > >>> > >>> Best, > >>> Jingsong Lee > >>> > >>> On Tue, May 12, 2020 at 4:19 PM Jingsong Li <[hidden email]> > >> wrote: > >>> > >>>> Thanks for your discussion. > >>>> > >>>> Sorry to start discussing another thing: > >>>> > >>>> The biggest problem I see is the variety of problems caused by users' > >> lack > >>>> of format dependency. > >>>> As Aljoscha said, these three formats are very small and no third > party > >>>> dependence, and they are widely used by table users. > >>>> Actually, we don't have any other built-in table formats now... In > >> total > >>>> 151K... > >>>> > >>>> 73K flink-avro-1.10.0.jar > >>>> 36K flink-csv-1.10.0.jar > >>>> 42K flink-json-1.10.0.jar > >>>> > >>>> So, Can we just put them into "lib/" or flink-table-uber? > >>>> It not solve all problems and maybe it is independent of "fat" and > >> "slim". > >>>> But also improve usability. > >>>> What do you think? Any objections? > >>>> > >>>> Best, > >>>> Jingsong Lee > >>>> > >>>> On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler <[hidden email]> > >>>> wrote: > >>>> > >>>>> One downside would be that we're shipping more stuff when running on > >>>>> YARN for example, since the entire plugins directory is shiped by > >> default. > >>>>> > >>>>> On 17/04/2020 16:38, Stephan Ewen wrote: > >>>>>> @Aljoscha I think that is an interesting line of thinking. the > >> swift-fs > >>>>> may > >>>>>> be rarely enough used to move it to an optional download. > >>>>>> > >>>>>> I would still drop two more thoughts: > >>>>>> > >>>>>> (1) Now that we have plugins support, is there a reason to have a > >>>>> metrics > >>>>>> reporter or file system in /opt instead of /plugins? They don't > >> spoil > >>>>> the > >>>>>> class path any more. > >>>>>> > >>>>>> (2) I can imagine there still being a desire to have a "minimal" > >> docker > >>>>>> file, for users that want to keep the container images as small as > >>>>>> possible, to speed up deployment. It is fine if that would not be > >> the > >>>>>> default, though. > >>>>>> > >>>>>> > >>>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek < > >> [hidden email]> > >>>>>> wrote: > >>>>>> > >>>>>>> I think having such tools and/or tailor-made distributions can > >> be nice > >>>>>>> but I also think the discussion is missing the main point: The > >> initial > >>>>>>> observation/motivation is that apparently a lot of users (Kurt > >> and I > >>>>>>> talked about this) on the chinese DingTalk support groups, and > >> other > >>>>>>> support channels have problems when first using the SQL client > >> because > >>>>>>> of these missing connectors/formats. For these, having > >> additional tools > >>>>>>> would not solve anything because they would also not take that > >> extra > >>>>>>> step. I think that even tiny friction should be avoided because > >> the > >>>>>>> annoyance from it accumulates of the (hopefully) many users that > >> we > >>>>> want > >>>>>>> to have. > >>>>>>> > >>>>>>> Maybe we should take a step back from discussing the > >> "fat"/"slim" idea > >>>>>>> and instead think about the composition of the current dist. As > >>>>>>> mentioned we have these jars in opt/: > >>>>>>> > >>>>>>> 17M flink-azure-fs-hadoop-1.10.0.jar > >>>>>>> 52K flink-cep-scala_2.11-1.10.0.jar > >>>>>>> 180K flink-cep_2.11-1.10.0.jar > >>>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar > >>>>>>> 626K flink-gelly_2.11-1.10.0.jar > >>>>>>> 512K flink-metrics-datadog-1.10.0.jar > >>>>>>> 159K flink-metrics-graphite-1.10.0.jar > >>>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar > >>>>>>> 102K flink-metrics-prometheus-1.10.0.jar > >>>>>>> 10K flink-metrics-slf4j-1.10.0.jar > >>>>>>> 12K flink-metrics-statsd-1.10.0.jar > >>>>>>> 36M flink-oss-fs-hadoop-1.10.0.jar > >>>>>>> 28M flink-python_2.11-1.10.0.jar > >>>>>>> 22K flink-queryable-state-runtime_2.11-1.10.0.jar > >>>>>>> 18M flink-s3-fs-hadoop-1.10.0.jar > >>>>>>> 31M flink-s3-fs-presto-1.10.0.jar > >>>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar > >>>>>>> 518K flink-sql-client_2.11-1.10.0.jar > >>>>>>> 99K flink-state-processor-api_2.11-1.10.0.jar > >>>>>>> 25M flink-swift-fs-hadoop-1.10.0.jar > >>>>>>> 160M opt > >>>>>>> > >>>>>>> The "filesystem" connectors ar ethe heavy hitters, there. > >>>>>>> > >>>>>>> I downloaded most of the SQL connectors/formats and this is what > >> I got: > >>>>>>> > >>>>>>> 73K flink-avro-1.10.0.jar > >>>>>>> 36K flink-csv-1.10.0.jar > >>>>>>> 55K flink-hbase_2.11-1.10.0.jar > >>>>>>> 88K flink-jdbc_2.11-1.10.0.jar > >>>>>>> 42K flink-json-1.10.0.jar > >>>>>>> 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar > >>>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar > >>>>>>> 24M sql-connectors-formats > >>>>>>> > >>>>>>> We could just add these to the Flink distribution without > >> blowing it up > >>>>>>> by much. We could drop any of the existing "filesystem" > >> connectors from > >>>>>>> opt and add the SQL connectors/formats and not change the size > >> of Flink > >>>>>>> dist. So maybe we should do that instead? > >>>>>>> > >>>>>>> We would need some tooling for the sql-client shell script to > >> pick-up > >>>>>>> the connectors/formats up from opt/ because we don't want to add > >> them > >>>>> to > >>>>>>> lib/. We're already doing that for finding the flink-sql-client > >> jar, > >>>>>>> which is also not in lib/. > >>>>>>> > >>>>>>> What do you think? > >>>>>>> > >>>>>>> Best, > >>>>>>> Aljoscha > >>>>>>> > >>>>>>> On 17.04.20 05:22, Jark Wu wrote: > >>>>>>>> Hi, > >>>>>>>> > >>>>>>>> I like the idea of web tool to assemble fat distribution. And > >> the > >>>>>>>> https://code.quarkus.io/ looks very nice. > >>>>>>>> All the users need to do is just select what he/she need (I > >> think this > >>>>>>> step > >>>>>>>> can't be omitted anyway). > >>>>>>>> We can also provide a default fat distribution on the web which > >>>>> default > >>>>>>>> selects some popular connectors. > >>>>>>>> > >>>>>>>> Best, > >>>>>>>> Jark > >>>>>>>> > >>>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <[hidden email] > >>> > >>>>> wrote: > >>>>>>>> > >>>>>>>>> As a reference for a nice first-experience I had, take a > >> look at > >>>>>>>>> https://code.quarkus.io/ > >>>>>>>>> You reach this page after you click "Start Coding" at the > >> project > >>>>>>> homepage. > >>>>>>>>> Rafi > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <[hidden email]> > >> wrote: > >>>>>>>>> > >>>>>>>>>> I'm not saying pre-bundle some jars will make this problem > >> go away, > >>>>> and > >>>>>>>>>> you're right that only hides the problem for > >>>>>>>>>> some users. But what if this solution can hide the problem > >> for 90% > >>>>>>> users? > >>>>>>>>>> Would't that be good enough for us to try? > >>>>>>>>>> > >>>>>>>>>> Regarding to would users following instructions really be > >> such a big > >>>>>>>>>> problem? > >>>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions > >> for at > >>>>> least a > >>>>>>>>>> dozen times and I won't see such questions coming > >>>>>>>>>> up from time to time. During some periods, I even saw such > >> questions > >>>>>>>>> every > >>>>>>>>>> day. > >>>>>>>>>> > >>>>>>>>>> Best, > >>>>>>>>>> Kurt > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler < > >>>>> [hidden email]> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> The problem with having a distribution with "popular" > >> stuff is > >>>>> that it > >>>>>>>>>>> doesn't really *solve* a problem, it just hides it for > >> users who > >>>>> fall > >>>>>>>>>>> into these particular use-cases. > >>>>>>>>>>> Move out of it and you once again run into exact same > >> problems > >>>>>>>>> out-lined. > >>>>>>>>>>> This is exactly why I like the tooling approach; you > >> have to deal > >>>>> with > >>>>>>>>> it > >>>>>>>>>>> from the start and transitioning to a custom use-case is > >> easier. > >>>>>>>>>>> > >>>>>>>>>>> Would users following instructions really be such a big > >> problem? > >>>>>>>>>>> I would expect that users generally know *what *they > >> need, just not > >>>>>>>>>>> necessarily how it is assembled correctly (where do get > >> which jar, > >>>>>>>>> which > >>>>>>>>>>> directory to put it in). > >>>>>>>>>>> It seems like these are exactly the problem this would > >> solve? > >>>>>>>>>>> I just don't see how moving a jar corresponding to some > >> feature > >>>>> from > >>>>>>>>> opt > >>>>>>>>>>> to some directory (lib/plugins) is less error-prone than > >> just > >>>>>>> selecting > >>>>>>>>>> the > >>>>>>>>>>> feature and having the tool handle the rest. > >>>>>>>>>>> > >>>>>>>>>>> As for re-distributions, it depends on the form that the > >> tool would > >>>>>>>>> take. > >>>>>>>>>>> It could be an application that runs locally and works > >> against > >>>>> maven > >>>>>>>>>>> central (note: not necessarily *using* maven); this > >> should would > >>>>> work > >>>>>>>>> in > >>>>>>>>>>> China, no? > >>>>>>>>>>> > >>>>>>>>>>> A web tool would of course be fancy, but I don't know > >> how feasible > >>>>>>> this > >>>>>>>>>> is > >>>>>>>>>>> with the ASF infrastructure. > >>>>>>>>>>> You wouldn't be able to mirror the distribution, so the > >> load can't > >>>>> be > >>>>>>>>>>> distributed. I doubt INFRA would like this. > >>>>>>>>>>> > >>>>>>>>>>> Note that third-parties could also start distributing > >> use-case > >>>>>>> oriented > >>>>>>>>>>> distributions, which would be perfectly fine as far as > >> I'm > >>>>> concerned. > >>>>>>>>>>> > >>>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote: > >>>>>>>>>>> > >>>>>>>>>>> I'm not so sure about the web tool solution though. The > >> concern I > >>>>> have > >>>>>>>>>> for > >>>>>>>>>>> this approach is the final generated > >>>>>>>>>>> distribution is kind of non-deterministic. We might > >> generate too > >>>>> many > >>>>>>>>>>> different combinations when user trying to > >>>>>>>>>>> package different types of connector, format, and even > >> maybe hadoop > >>>>>>>>>>> releases. As far as I can tell, most open > >>>>>>>>>>> source projects and apache projects will only release > >> some > >>>>>>>>>>> pre-defined distributions, which most users are already > >>>>>>>>>>> familiar with, thus hard to change IMO. And I also have > >> went > >>>>> through > >>>>>>> in > >>>>>>>>>>> some cases, users will try to re-distribute > >>>>>>>>>>> the release package, because of the unstable network of > >> apache > >>>>> website > >>>>>>>>>> from > >>>>>>>>>>> China. In web tool solution, I don't > >>>>>>>>>>> think this kind of re-distribution would be possible > >> anymore. > >>>>>>>>>>> > >>>>>>>>>>> In the meantime, I also have a concern that we will fall > >> back into > >>>>> our > >>>>>>>>>> trap > >>>>>>>>>>> again if we try to offer this smart & flexible > >>>>>>>>>>> solution. Because it needs users to cooperate with such > >> mechanism. > >>>>>>> It's > >>>>>>>>>>> exactly the situation what we currently fell > >>>>>>>>>>> into: > >>>>>>>>>>> 1. We offered a smart solution. > >>>>>>>>>>> 2. We hope users will follow the correct instructions. > >>>>>>>>>>> 3. Everything will work as expected if users followed > >> the right > >>>>>>>>>>> instructions. > >>>>>>>>>>> > >>>>>>>>>>> In reality, I suspect not all users will do the second > >> step > >>>>> correctly. > >>>>>>>>>> And > >>>>>>>>>>> for new users who only trying to have a quick > >>>>>>>>>>> experience with Flink, I would bet most users will do it > >> wrong. > >>>>>>>>>>> > >>>>>>>>>>> So, my proposal would be one of the following 2 options: > >>>>>>>>>>> 1. Provide a slim distribution for advanced product > >> users and > >>>>> provide > >>>>>>> a > >>>>>>>>>>> distribution which will have some popular builtin jars. > >>>>>>>>>>> 2. Only provide a distribution which will have some > >> popular builtin > >>>>>>>>> jars. > >>>>>>>>>>> If we are trying to reduce the distributions we > >> released, I would > >>>>>>>>> prefer > >>>>>>>>>> 2 > >>>>>>>>>>> 1. > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> Kurt > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann < > >>>>> [hidden email]> > >>>>>>> < > >>>>>>>>>> [hidden email]> wrote: > >>>>>>>>>>> > >>>>>>>>>>> I think what Chesnay and Dawid proposed would be the > >> ideal > >>>>> solution. > >>>>>>>>>>> Ideally, we would also have a nice web tool for the > >> website which > >>>>>>>>>> generates > >>>>>>>>>>> the corresponding distribution for download. > >>>>>>>>>>> > >>>>>>>>>>> To get things started we could start with only > >> supporting to > >>>>>>>>>>> download/creating the "fat" version with the script. The > >> fat > >>>>> version > >>>>>>>>>> would > >>>>>>>>>>> then consist of the slim distribution and whatever we > >> deem > >>>>> important > >>>>>>>>> for > >>>>>>>>>>> new users to get started. > >>>>>>>>>>> > >>>>>>>>>>> Cheers, > >>>>>>>>>>> Till > >>>>>>>>>>> > >>>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz < > >>>>>>>>>> [hidden email]> <[hidden email]> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Hi all, > >>>>>>>>>>> > >>>>>>>>>>> Few points from my side: > >>>>>>>>>>> > >>>>>>>>>>> 1. I like the idea of simplifying the experience for > >> first time > >>>>> users. > >>>>>>>>>>> As for production use cases I share Jark's opinion that > >> in this > >>>>> case I > >>>>>>>>>>> would expect users to combine their distribution > >> manually. I think > >>>>> in > >>>>>>>>>>> such scenarios it is important to understand > >> interconnections. > >>>>>>>>>>> Personally I'd expect the slimmest possible distribution > >> that I can > >>>>>>>>>>> extend further with what I need in my production > >> scenario. > >>>>>>>>>>> > >>>>>>>>>>> 2. I think there is also the problem that the matrix of > >> possible > >>>>>>>>>>> combinations that can be useful is already big. Do we > >> want to have > >>>>> a > >>>>>>>>>>> distribution for: > >>>>>>>>>>> > >>>>>>>>>>> SQL users: which connectors should we include? should we > >>>>> include > >>>>>>>>>>> hive? which other catalog? > >>>>>>>>>>> > >>>>>>>>>>> DataStream users: which connectors should we include? > >>>>>>>>>>> > >>>>>>>>>>> For both of the above should we include yarn/kubernetes? > >>>>>>>>>>> > >>>>>>>>>>> I would opt for providing only the "slim" distribution > >> as a release > >>>>>>>>>>> artifact. > >>>>>>>>>>> > >>>>>>>>>>> 3. However, as I said I think its worth investigating > >> how we can > >>>>>>>>> improve > >>>>>>>>>>> users experience. What do you think of providing a tool, > >> could be > >>>>> e.g. > >>>>>>>>> a > >>>>>>>>>>> shell script that constructs a distribution based on > >> users choice. > >>>>> I > >>>>>>>>>>> think that was also what Chesnay mentioned as "tooling to > >>>>>>>>>>> assemble custom distributions" In the end how I see the > >> difference > >>>>>>>>>>> between a slim and fat distribution is which jars do we > >> put into > >>>>> the > >>>>>>>>>>> lib, right? It could have a few "screens". > >>>>>>>>>>> > >>>>>>>>>>> 1. Which API are you interested in: > >>>>>>>>>>> a. SQL API > >>>>>>>>>>> b. DataStream API > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> 2. [SQL] Which connectors do you want to use? > >> [multichoice]: > >>>>>>>>>>> a. Kafka > >>>>>>>>>>> b. Elasticsearch > >>>>>>>>>>> ... > >>>>>>>>>>> > >>>>>>>>>>> 3. [SQL] Which catalog you want to use? > >>>>>>>>>>> > >>>>>>>>>>> ... > >>>>>>>>>>> > >>>>>>>>>>> Such a tool would download all the dependencies from > >> maven and put > >>>>>>> them > >>>>>>>>>>> into the correct folder. In the future we can extend it > >> with > >>>>>>> additional > >>>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time > >> with > >>>>>>>>>>> kafka-universal etc. > >>>>>>>>>>> > >>>>>>>>>>> The benefit of it would be that the distribution that we > >> release > >>>>> could > >>>>>>>>>>> remain "slim" or we could even make it slimmer. I might > >> be missing > >>>>>>>>>>> something here though. > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> > >>>>>>>>>>> Dawdi > >>>>>>>>>>> > >>>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote: > >>>>>>>>>>> > >>>>>>>>>>> I want to reinforce my opinion from earlier: This is > >> about > >>>>> improving > >>>>>>>>>>> the situation both for first-time users and for > >> experienced users > >>>>> that > >>>>>>>>>>> want to use a Flink dist in production. The current > >> Flink dist is > >>>>> too > >>>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for > >> production > >>>>>>>>>>> users, that is where serving no-one properly with the > >> current > >>>>>>>>>>> middle-ground. That's why I think introducing those > >> specialized > >>>>>>>>>>> "spins" of Flink dist would be good. > >>>>>>>>>>> > >>>>>>>>>>> By the way, at some point in the future production users > >> might not > >>>>>>>>>>> even need to get a Flink dist anymore. They should be > >> able to have > >>>>>>>>>>> Flink as a dependency of their project (including the > >> runtime) and > >>>>>>>>>>> then build an image from this for Kubernetes or a fat > >> jar for YARN. > >>>>>>>>>>> > >>>>>>>>>>> Aljoscha > >>>>>>>>>>> > >>>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hi all, > >>>>>>>>>>> > >>>>>>>>>>> Regarding slim and fat distributions, I think different > >> kinds of > >>>>> jobs > >>>>>>>>>>> may > >>>>>>>>>>> prefer different type of distribution: > >>>>>>>>>>> > >>>>>>>>>>> For DataStream job, I think we may not like fat > >> distribution > >>>>>>>>>>> > >>>>>>>>>>> containing > >>>>>>>>>>> > >>>>>>>>>>> connectors because user would always need to depend on > >> the > >>>>> connector > >>>>>>>>>>> > >>>>>>>>>>> in > >>>>>>>>>>> > >>>>>>>>>>> user code, it is easy to include the connector jar in > >> the user lib. > >>>>>>>>>>> > >>>>>>>>>>> Less > >>>>>>>>>>> > >>>>>>>>>>> jar in lib means less class conflicts and problems. > >>>>>>>>>>> > >>>>>>>>>>> For SQL job, I think we are trying to encourage user to > >> user pure > >>>>>>>>>>> sql(DDL + > >>>>>>>>>>> DML) to construct their job, In order to improve user > >> experience, > >>>>> It > >>>>>>>>>>> may be > >>>>>>>>>>> important for flink, not only providing as many > >> connector jar in > >>>>>>>>>>> distribution as possible especially the connector and > >> format we > >>>>> have > >>>>>>>>>>> well > >>>>>>>>>>> documented, but also providing an mechanism to load > >> connectors > >>>>>>>>>>> according > >>>>>>>>>>> to the DDLs, > >>>>>>>>>>> > >>>>>>>>>>> So I think it could be good to place connector/format > >> jars in some > >>>>>>>>>>> dir like > >>>>>>>>>>> opt/connector which would not affect jobs by default, and > >>>>> introduce a > >>>>>>>>>>> mechanism of dynamic discovery for SQL. > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> Wenlong > >>>>>>>>>>> > >>>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li < > >> [hidden email]> > >>>>> < > >>>>>>>>>> [hidden email]> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Hi, > >>>>>>>>>>> > >>>>>>>>>>> I am thinking both "improve first experience" and > >> "improve > >>>>> production > >>>>>>>>>>> experience". > >>>>>>>>>>> > >>>>>>>>>>> I'm thinking about what's the common mode of Flink? > >>>>>>>>>>> Streaming job use Kafka? Batch job use Hive? > >>>>>>>>>>> > >>>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of > >> Hive server > >>>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 > >> dependency. > >>>>>>>>>>> Flink is currently mainly used for streaming, so let's > >> not talk > >>>>>>>>>>> about hive. > >>>>>>>>>>> > >>>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is > >> (related > >>>>> to > >>>>>>>>>>> connectors): > >>>>>>>>>>> - ETL jobs: Kafka -> Kafka > >>>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka > >>>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink > >>>>>>>>>>> So Kafka and JDBC are probably the most commonly used. > >> Of course, > >>>>>>>>>>> > >>>>>>>>>>> also > >>>>>>>>>>> > >>>>>>>>>>> includes CSV, JSON's formats. > >>>>>>>>>>> So when we provide such a fat distribution: > >>>>>>>>>>> - With CSV, JSON. > >>>>>>>>>>> - With flink-kafka-universal and kafka dependencies. > >>>>>>>>>>> - With flink-jdbc. > >>>>>>>>>>> Using this fat distribution, most users can run their > >> jobs well. > >>>>>>>>>>> > >>>>>>>>>>> (jdbc > >>>>>>>>>>> > >>>>>>>>>>> driver jar required, but this is very natural to do) > >>>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only > >> Kafka may > >>>>>>>>>>> > >>>>>>>>>>> have > >>>>>>>>>>> > >>>>>>>>>>> conflicts, but if our goal is to use kafka-universal to > >> support all > >>>>>>>>>>> Kafka > >>>>>>>>>>> versions, it is hopeful to target the vast majority of > >> users. > >>>>>>>>>>> > >>>>>>>>>>> We don't want to plug all jars into the fat > >> distribution. Only need > >>>>>>>>>>> less > >>>>>>>>>>> conflict and common. of course, it is a matter of > >> consideration to > >>>>>>>>>>> > >>>>>>>>>>> put > >>>>>>>>>>> > >>>>>>>>>>> which jar into fat distribution. > >>>>>>>>>>> We have the opportunity to facilitate the majority of > >> users, but > >>>>>>>>>>> also left > >>>>>>>>>>> opportunities for customization. > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> Jingsong Lee > >>>>>>>>>>> > >>>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu < > >> [hidden email]> < > >>>>>>>>>> [hidden email]> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hi, > >>>>>>>>>>> > >>>>>>>>>>> I think we should first reach an consensus on "what > >> problem do we > >>>>>>>>>>> want to > >>>>>>>>>>> solve?" > >>>>>>>>>>> (1) improve first experience? or (2) improve production > >> experience? > >>>>>>>>>>> > >>>>>>>>>>> As far as I can see, with the above discussion, I think > >> what we > >>>>>>>>>>> want to > >>>>>>>>>>> solve is the "first experience". > >>>>>>>>>>> And I think the slim jar is still the best distribution > >> for > >>>>>>>>>>> production, > >>>>>>>>>>> because it's easier to assembling jars > >>>>>>>>>>> than excluding jars and can avoid potential class > >> conflicts. > >>>>>>>>>>> > >>>>>>>>>>> If we want to improve "first experience", I think it > >> make sense to > >>>>>>>>>>> have a > >>>>>>>>>>> fat distribution to give users a more smooth first > >> experience. > >>>>>>>>>>> But I would like to call it "playground distribution" or > >> something > >>>>>>>>>>> like > >>>>>>>>>>> that to explicitly differ from the "slim > >> production-purpose > >>>>>>>>>>> > >>>>>>>>>>> distribution". > >>>>>>>>>>> > >>>>>>>>>>> The "playground distribution" can contains some widely > >> used jars, > >>>>>>>>>>> > >>>>>>>>>>> like > >>>>>>>>>>> > >>>>>>>>>>> universal-kafka-sql-connector, > >> elasticsearch7-sql-connector, avro, > >>>>>>>>>>> json, > >>>>>>>>>>> csv, etc.. > >>>>>>>>>>> Even we can provide a playground docker which may > >> contain the fat > >>>>>>>>>>> distribution, python3, and hive. > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> Jark > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler < > >> [hidden email]> > >>>>> < > >>>>>>>>>> [hidden email]> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> I don't see a lot of value in having multiple > >> distributions. > >>>>>>>>>>> > >>>>>>>>>>> The simple reality is that no fat distribution we could > >> provide > >>>>>>>>>>> > >>>>>>>>>>> would > >>>>>>>>>>> > >>>>>>>>>>> satisfy all use-cases, so why even try. > >>>>>>>>>>> If users commonly run into issues for certain jars, then > >> maybe > >>>>>>>>>>> > >>>>>>>>>>> those > >>>>>>>>>>> > >>>>>>>>>>> should be added to the current distribution. > >>>>>>>>>>> > >>>>>>>>>>> Personally though I still believe we should only > >> distribute a slim > >>>>>>>>>>> version. I'd rather have users always add required jars > >> to the > >>>>>>>>>>> distribution than only when they go outside our > >> "expected" > >>>>>>>>>>> > >>>>>>>>>>> use-cases. > >>>>>>>>>>> > >>>>>>>>>>> Then we might finally address this issue properly, i.e., > >> tooling to > >>>>>>>>>>> assemble custom distributions and/or better error > >> messages if > >>>>>>>>>>> Flink-provided extensions cannot be found. > >>>>>>>>>>> > >>>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote: > >>>>>>>>>>> > >>>>>>>>>>> Regarding to the specific solution, I'm not sure about > >> the "fat" > >>>>>>>>>>> > >>>>>>>>>>> and > >>>>>>>>>>> > >>>>>>>>>>> "slim" > >>>>>>>>>>> > >>>>>>>>>>> solution though. I get the idea > >>>>>>>>>>> that we can make the slim one even more lightweight than > >> current > >>>>>>>>>>> distribution, but what about the "fat" > >>>>>>>>>>> one? Do you mean that we would package all connectors > >> and formats > >>>>>>>>>>> > >>>>>>>>>>> into > >>>>>>>>>>> > >>>>>>>>>>> this? I'm not sure if this is > >>>>>>>>>>> feasible. For example, we can't put all versions of > >> kafka and hive > >>>>>>>>>>> connector jars into lib directory, and > >>>>>>>>>>> we also might need hadoop jars when using filesystem > >> connector to > >>>>>>>>>>> > >>>>>>>>>>> access > >>>>>>>>>>> > >>>>>>>>>>> data from HDFS. > >>>>>>>>>>> > >>>>>>>>>>> So my guess would be we might hand-pick some of the most > >>>>>>>>>>> > >>>>>>>>>>> frequently > >>>>>>>>>>> > >>>>>>>>>>> used > >>>>>>>>>>> > >>>>>>>>>>> connectors and formats > >>>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned > >> above, > >>>>>>>>>>> > >>>>>>>>>>> and > >>>>>>>>>>> > >>>>>>>>>>> still > >>>>>>>>>>> > >>>>>>>>>>> leave some other connectors out of it. > >>>>>>>>>>> If this is the case, then why not we just provide this > >>>>>>>>>>> > >>>>>>>>>>> distribution > >>>>>>>>>>> > >>>>>>>>>>> to > >>>>>>>>>>> > >>>>>>>>>>> user? I'm not sure i get the benefit of > >>>>>>>>>>> providing another super "slim" jar (we have to pay some > >> costs to > >>>>>>>>>>> > >>>>>>>>>>> provide > >>>>>>>>>>> > >>>>>>>>>>> another suit of distribution). > >>>>>>>>>>> > >>>>>>>>>>> What do you think? > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> Kurt > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li < > >>>>>>>>>>> > >>>>>>>>>>> [hidden email] > >>>>>>>>>>> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Big +1. > >>>>>>>>>>> > >>>>>>>>>>> I like "fat" and "slim". > >>>>>>>>>>> > >>>>>>>>>>> For csv and json, like Jark said, they are quite small > >> and don't > >>>>>>>>>>> > >>>>>>>>>>> have > >>>>>>>>>>> > >>>>>>>>>>> other > >>>>>>>>>>> > >>>>>>>>>>> dependencies. They are important to kafka connector, and > >>>>>>>>>>> > >>>>>>>>>>> important > >>>>>>>>>>> > >>>>>>>>>>> to upcoming file system connector too. > >>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so > >>>>>>>>>>> > >>>>>>>>>>> important, > >>>>>>>>>>> > >>>>>>>>>>> and > >>>>>>>>>>> > >>>>>>>>>>> they're so lightweight. > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> Jingsong Lee > >>>>>>>>>>> > >>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he < > >> [hidden email]> < > >>>>>>>>>> [hidden email]> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Big +1. > >>>>>>>>>>> This will improve user experience (special for Flink new > >> users). > >>>>>>>>>>> We answered so many questions about "class not found". > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> Godfrey > >>>>>>>>>>> > >>>>>>>>>>> Dian Fu <[hidden email]> <[hidden email]> > >>>>> 于2020年4月15日周三 > >>>>>>>>>> 下午4:30写道: > >>>>>>>>>>> > >>>>>>>>>>> +1 to this proposal. > >>>>>>>>>>> > >>>>>>>>>>> Missing connector jars is also a big problem for PyFlink > >> users. > >>>>>>>>>>> > >>>>>>>>>>> Currently, > >>>>>>>>>>> > >>>>>>>>>>> after a Python user has installed PyFlink using `pip`, > >> he has > >>>>>>>>>>> > >>>>>>>>>>> to > >>>>>>>>>>> > >>>>>>>>>>> manually > >>>>>>>>>>> > >>>>>>>>>>> copy the connector fat jars to the PyFlink installation > >>>>>>>>>>> > >>>>>>>>>>> directory > >>>>>>>>>>> > >>>>>>>>>>> for > >>>>>>>>>>> > >>>>>>>>>>> the > >>>>>>>>>>> > >>>>>>>>>>> connectors to be used if he wants to run jobs locally. > >> This > >>>>>>>>>>> > >>>>>>>>>>> process > >>>>>>>>>>> > >>>>>>>>>>> is > >>>>>>>>>>> > >>>>>>>>>>> very > >>>>>>>>>>> > >>>>>>>>>>> confuse for users and affects the experience a lot. > >>>>>>>>>>> > >>>>>>>>>>> Regards, > >>>>>>>>>>> Dian > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> < > >> [hidden email]> > >>>>> 写道: > >>>>>>>>>>> > >>>>>>>>>>> +1 to the proposal. I also found the "download > >> additional jar" > >>>>>>>>>>> > >>>>>>>>>>> step > >>>>>>>>>>> > >>>>>>>>>>> is > >>>>>>>>>>> > >>>>>>>>>>> really verbose when I prepare webinars. > >>>>>>>>>>> > >>>>>>>>>>> At least, I think the flink-csv and flink-json should in > >> the > >>>>>>>>>>> > >>>>>>>>>>> distribution, > >>>>>>>>>>> > >>>>>>>>>>> they are quite small and don't have other dependencies. > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> Jark > >>>>>>>>>>> > >>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang < > >> [hidden email]> < > >>>>>>>>>> [hidden email]> > >>>>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hi Aljoscha, > >>>>>>>>>>> > >>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan > >> to > >>>>>>>>>>> > >>>>>>>>>>> put > >>>>>>>>>>> > >>>>>>>>>>> these > >>>>>>>>>>> > >>>>>>>>>>> connectors ? opt or lib ? > >>>>>>>>>>> > >>>>>>>>>>> Aljoscha Krettek <[hidden email]> < > >> [hidden email]> > >>>>>>>>>> 于2020年4月15日周三 > >>>>>>>>>>> 下午3:30写道: > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> Hi Everyone, > >>>>>>>>>>> > >>>>>>>>>>> I'd like to discuss about releasing a more full-featured > >>>>>>>>>>> > >>>>>>>>>>> Flink > >>>>>>>>>>> > >>>>>>>>>>> distribution. The motivation is that there is friction > >> for > >>>>>>>>>>> > >>>>>>>>>>> SQL/Table > >>>>>>>>>>> > >>>>>>>>>>> API > >>>>>>>>>>> > >>>>>>>>>>> users that want to use Table connectors which are not > >> there > >>>>>>>>>>> > >>>>>>>>>>> in > >>>>>>>>>>> > >>>>>>>>>>> the > >>>>>>>>>>> > >>>>>>>>>>> current Flink Distribution. For these users the workflow > >> is > >>>>>>>>>>> > >>>>>>>>>>> currently > >>>>>>>>>>> > >>>>>>>>>>> roughly: > >>>>>>>>>>> > >>>>>>>>>>> - download Flink dist > >>>>>>>>>>> - configure csv/Kafka/json connectors per configuration > >>>>>>>>>>> - run SQL client or program > >>>>>>>>>>> - decrypt error message and research the solution > >>>>>>>>>>> - download additional connector jars > >>>>>>>>>>> - program works correctly > >>>>>>>>>>> > >>>>>>>>>>> I realize that this can be made to work but if every SQL > >>>>>>>>>>> > >>>>>>>>>>> user > >>>>>>>>>>> > >>>>>>>>>>> has > >>>>>>>>>>> > >>>>>>>>>>> this > >>>>>>>>>>> > >>>>>>>>>>> as their first experience that doesn't seem good to me. > >>>>>>>>>>> > >>>>>>>>>>> My proposal is to provide two versions of the Flink > >>>>>>>>>>> > >>>>>>>>>>> Distribution > >>>>>>>>>>> > >>>>>>>>>>> in > >>>>>>>>>>> > >>>>>>>>>>> the > >>>>>>>>>>> > >>>>>>>>>>> future: "fat" and "slim" (names to be discussed): > >>>>>>>>>>> > >>>>>>>>>>> - slim would be even trimmer than todays distribution > >>>>>>>>>>> - fat would contain a lot of convenience connectors (yet > >>>>>>>>>>> > >>>>>>>>>>> to > >>>>>>>>>>> > >>>>>>>>>>> be > >>>>>>>>>>> > >>>>>>>>>>> determined which one) > >>>>>>>>>>> > >>>>>>>>>>> And yes, I realize that there are already more > >> dimensions of > >>>>>>>>>>> > >>>>>>>>>>> Flink > >>>>>>>>>>> > >>>>>>>>>>> releases (Scala version and Java version). > >>>>>>>>>>> > >>>>>>>>>>> For background, our current Flink dist has these in the > >> opt > >>>>>>>>>>> > >>>>>>>>>>> directory: > >>>>>>>>>>> > >>>>>>>>>>> - flink-azure-fs-hadoop-1.10.0.jar > >>>>>>>>>>> - flink-cep-scala_2.12-1.10.0.jar > >>>>>>>>>>> - flink-cep_2.12-1.10.0.jar > >>>>>>>>>>> - flink-gelly-scala_2.12-1.10.0.jar > >>>>>>>>>>> - flink-gelly_2.12-1.10.0.jar > >>>>>>>>>>> - flink-metrics-datadog-1.10.0.jar > >>>>>>>>>>> - flink-metrics-graphite-1.10.0.jar > >>>>>>>>>>> - flink-metrics-influxdb-1.10.0.jar > >>>>>>>>>>> - flink-metrics-prometheus-1.10.0.jar > >>>>>>>>>>> - flink-metrics-slf4j-1.10.0.jar > >>>>>>>>>>> - flink-metrics-statsd-1.10.0.jar > >>>>>>>>>>> - flink-oss-fs-hadoop-1.10.0.jar > >>>>>>>>>>> - flink-python_2.12-1.10.0.jar > >>>>>>>>>>> - flink-queryable-state-runtime_2.12-1.10.0.jar > >>>>>>>>>>> - flink-s3-fs-hadoop-1.10.0.jar > >>>>>>>>>>> - flink-s3-fs-presto-1.10.0.jar > >>>>>>>>>>> - > >>>>>>>>>>> > >>>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar > >>>>>>>>>>> > >>>>>>>>>>> - flink-sql-client_2.12-1.10.0.jar > >>>>>>>>>>> - flink-state-processor-api_2.12-1.10.0.jar > >>>>>>>>>>> - flink-swift-fs-hadoop-1.10.0.jar > >>>>>>>>>>> > >>>>>>>>>>> Current Flink dist is 267M. If we removed everything from > >>>>>>>>>>> > >>>>>>>>>>> opt > >>>>>>>>>>> > >>>>>>>>>>> we > >>>>>>>>>>> > >>>>>>>>>>> would > >>>>>>>>>>> > >>>>>>>>>>> go down to 126M. I would reccomend this, because the > >> large > >>>>>>>>>>> > >>>>>>>>>>> majority > >>>>>>>>>>> > >>>>>>>>>>> of > >>>>>>>>>>> > >>>>>>>>>>> the files in opt are probably unused. > >>>>>>>>>>> > >>>>>>>>>>> What do you think? > >>>>>>>>>>> > >>>>>>>>>>> Best, > >>>>>>>>>>> Aljoscha > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Best Regards > >>>>>>>>>>> > >>>>>>>>>>> Jeff Zhang > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Best, Jingsong Lee > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> -- > >>>>>>>>>>> Best, Jingsong Lee > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>> > >>>>> > >>>>> > >>>> > >>>> -- > >>>> Best, Jingsong Lee > >>>> > >>> > >>> > >>> -- > >>> Best, Jingsong Lee > >> > > > > > > -- > > > > Best, > > Benchao Li > > -- Best regards! Rui Li |
Hi,
Thanks all for your feedback. I created JIRA for bundling format jars in lib. [1] FYI. [1]https://issues.apache.org/jira/browse/FLINK-18173 Best, Jingsong Lee On Fri, Jun 5, 2020 at 3:59 PM Rui Li <[hidden email]> wrote: > +1 to add light-weighted formats into the lib > > On Fri, Jun 5, 2020 at 3:28 PM Leonard Xu <[hidden email]> wrote: > > > +1 for Jingsong’s proposal to put flink-csv, flink-json and flink-avro > > under lib/ directory. > > I have heard many SQL users(most of newbies) complaint the out-of-box > > experience in mail list. > > > > Best, > > Leonard Xu > > > > > > > 在 2020年6月5日,14:39,Benchao Li <[hidden email]> 写道: > > > > > > +1 to include them for sql-client by default; > > > +0 to put into lib and exposed to all kinds of jobs, including > > DataStream. > > > > > > Danny Chan <[hidden email]> 于2020年6月5日周五 下午2:31写道: > > > > > >> +1, at least, we should keep an out of the box SQL-CLI, it’s very poor > > >> experience to add such required format jars for SQL users. > > >> > > >> Best, > > >> Danny Chan > > >> 在 2020年6月5日 +0800 AM11:14,Jingsong Li <[hidden email]>,写道: > > >>> Hi all, > > >>> > > >>> Considering that 1.11 will be released soon, what about my previous > > >>> proposal? Put flink-csv, flink-json and flink-avro under lib. > > >>> These three formats are very small and no third party dependence, and > > >> they > > >>> are widely used by table users. > > >>> > > >>> Best, > > >>> Jingsong Lee > > >>> > > >>> On Tue, May 12, 2020 at 4:19 PM Jingsong Li <[hidden email]> > > >> wrote: > > >>> > > >>>> Thanks for your discussion. > > >>>> > > >>>> Sorry to start discussing another thing: > > >>>> > > >>>> The biggest problem I see is the variety of problems caused by > users' > > >> lack > > >>>> of format dependency. > > >>>> As Aljoscha said, these three formats are very small and no third > > party > > >>>> dependence, and they are widely used by table users. > > >>>> Actually, we don't have any other built-in table formats now... In > > >> total > > >>>> 151K... > > >>>> > > >>>> 73K flink-avro-1.10.0.jar > > >>>> 36K flink-csv-1.10.0.jar > > >>>> 42K flink-json-1.10.0.jar > > >>>> > > >>>> So, Can we just put them into "lib/" or flink-table-uber? > > >>>> It not solve all problems and maybe it is independent of "fat" and > > >> "slim". > > >>>> But also improve usability. > > >>>> What do you think? Any objections? > > >>>> > > >>>> Best, > > >>>> Jingsong Lee > > >>>> > > >>>> On Mon, May 11, 2020 at 5:48 PM Chesnay Schepler < > [hidden email]> > > >>>> wrote: > > >>>> > > >>>>> One downside would be that we're shipping more stuff when running > on > > >>>>> YARN for example, since the entire plugins directory is shiped by > > >> default. > > >>>>> > > >>>>> On 17/04/2020 16:38, Stephan Ewen wrote: > > >>>>>> @Aljoscha I think that is an interesting line of thinking. the > > >> swift-fs > > >>>>> may > > >>>>>> be rarely enough used to move it to an optional download. > > >>>>>> > > >>>>>> I would still drop two more thoughts: > > >>>>>> > > >>>>>> (1) Now that we have plugins support, is there a reason to have a > > >>>>> metrics > > >>>>>> reporter or file system in /opt instead of /plugins? They don't > > >> spoil > > >>>>> the > > >>>>>> class path any more. > > >>>>>> > > >>>>>> (2) I can imagine there still being a desire to have a "minimal" > > >> docker > > >>>>>> file, for users that want to keep the container images as small as > > >>>>>> possible, to speed up deployment. It is fine if that would not be > > >> the > > >>>>>> default, though. > > >>>>>> > > >>>>>> > > >>>>>> On Fri, Apr 17, 2020 at 12:16 PM Aljoscha Krettek < > > >> [hidden email]> > > >>>>>> wrote: > > >>>>>> > > >>>>>>> I think having such tools and/or tailor-made distributions can > > >> be nice > > >>>>>>> but I also think the discussion is missing the main point: The > > >> initial > > >>>>>>> observation/motivation is that apparently a lot of users (Kurt > > >> and I > > >>>>>>> talked about this) on the chinese DingTalk support groups, and > > >> other > > >>>>>>> support channels have problems when first using the SQL client > > >> because > > >>>>>>> of these missing connectors/formats. For these, having > > >> additional tools > > >>>>>>> would not solve anything because they would also not take that > > >> extra > > >>>>>>> step. I think that even tiny friction should be avoided because > > >> the > > >>>>>>> annoyance from it accumulates of the (hopefully) many users that > > >> we > > >>>>> want > > >>>>>>> to have. > > >>>>>>> > > >>>>>>> Maybe we should take a step back from discussing the > > >> "fat"/"slim" idea > > >>>>>>> and instead think about the composition of the current dist. As > > >>>>>>> mentioned we have these jars in opt/: > > >>>>>>> > > >>>>>>> 17M flink-azure-fs-hadoop-1.10.0.jar > > >>>>>>> 52K flink-cep-scala_2.11-1.10.0.jar > > >>>>>>> 180K flink-cep_2.11-1.10.0.jar > > >>>>>>> 746K flink-gelly-scala_2.11-1.10.0.jar > > >>>>>>> 626K flink-gelly_2.11-1.10.0.jar > > >>>>>>> 512K flink-metrics-datadog-1.10.0.jar > > >>>>>>> 159K flink-metrics-graphite-1.10.0.jar > > >>>>>>> 1.0M flink-metrics-influxdb-1.10.0.jar > > >>>>>>> 102K flink-metrics-prometheus-1.10.0.jar > > >>>>>>> 10K flink-metrics-slf4j-1.10.0.jar > > >>>>>>> 12K flink-metrics-statsd-1.10.0.jar > > >>>>>>> 36M flink-oss-fs-hadoop-1.10.0.jar > > >>>>>>> 28M flink-python_2.11-1.10.0.jar > > >>>>>>> 22K flink-queryable-state-runtime_2.11-1.10.0.jar > > >>>>>>> 18M flink-s3-fs-hadoop-1.10.0.jar > > >>>>>>> 31M flink-s3-fs-presto-1.10.0.jar > > >>>>>>> 196K flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar > > >>>>>>> 518K flink-sql-client_2.11-1.10.0.jar > > >>>>>>> 99K flink-state-processor-api_2.11-1.10.0.jar > > >>>>>>> 25M flink-swift-fs-hadoop-1.10.0.jar > > >>>>>>> 160M opt > > >>>>>>> > > >>>>>>> The "filesystem" connectors ar ethe heavy hitters, there. > > >>>>>>> > > >>>>>>> I downloaded most of the SQL connectors/formats and this is what > > >> I got: > > >>>>>>> > > >>>>>>> 73K flink-avro-1.10.0.jar > > >>>>>>> 36K flink-csv-1.10.0.jar > > >>>>>>> 55K flink-hbase_2.11-1.10.0.jar > > >>>>>>> 88K flink-jdbc_2.11-1.10.0.jar > > >>>>>>> 42K flink-json-1.10.0.jar > > >>>>>>> 20M flink-sql-connector-elasticsearch6_2.11-1.10.0.jar > > >>>>>>> 2.8M flink-sql-connector-kafka_2.11-1.10.0.jar > > >>>>>>> 24M sql-connectors-formats > > >>>>>>> > > >>>>>>> We could just add these to the Flink distribution without > > >> blowing it up > > >>>>>>> by much. We could drop any of the existing "filesystem" > > >> connectors from > > >>>>>>> opt and add the SQL connectors/formats and not change the size > > >> of Flink > > >>>>>>> dist. So maybe we should do that instead? > > >>>>>>> > > >>>>>>> We would need some tooling for the sql-client shell script to > > >> pick-up > > >>>>>>> the connectors/formats up from opt/ because we don't want to add > > >> them > > >>>>> to > > >>>>>>> lib/. We're already doing that for finding the flink-sql-client > > >> jar, > > >>>>>>> which is also not in lib/. > > >>>>>>> > > >>>>>>> What do you think? > > >>>>>>> > > >>>>>>> Best, > > >>>>>>> Aljoscha > > >>>>>>> > > >>>>>>> On 17.04.20 05:22, Jark Wu wrote: > > >>>>>>>> Hi, > > >>>>>>>> > > >>>>>>>> I like the idea of web tool to assemble fat distribution. And > > >> the > > >>>>>>>> https://code.quarkus.io/ looks very nice. > > >>>>>>>> All the users need to do is just select what he/she need (I > > >> think this > > >>>>>>> step > > >>>>>>>> can't be omitted anyway). > > >>>>>>>> We can also provide a default fat distribution on the web which > > >>>>> default > > >>>>>>>> selects some popular connectors. > > >>>>>>>> > > >>>>>>>> Best, > > >>>>>>>> Jark > > >>>>>>>> > > >>>>>>>> On Fri, 17 Apr 2020 at 02:29, Rafi Aroch <[hidden email] > > >>> > > >>>>> wrote: > > >>>>>>>> > > >>>>>>>>> As a reference for a nice first-experience I had, take a > > >> look at > > >>>>>>>>> https://code.quarkus.io/ > > >>>>>>>>> You reach this page after you click "Start Coding" at the > > >> project > > >>>>>>> homepage. > > >>>>>>>>> Rafi > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Thu, Apr 16, 2020 at 6:53 PM Kurt Young <[hidden email]> > > >> wrote: > > >>>>>>>>> > > >>>>>>>>>> I'm not saying pre-bundle some jars will make this problem > > >> go away, > > >>>>> and > > >>>>>>>>>> you're right that only hides the problem for > > >>>>>>>>>> some users. But what if this solution can hide the problem > > >> for 90% > > >>>>>>> users? > > >>>>>>>>>> Would't that be good enough for us to try? > > >>>>>>>>>> > > >>>>>>>>>> Regarding to would users following instructions really be > > >> such a big > > >>>>>>>>>> problem? > > >>>>>>>>>> I'm afraid yes. Otherwise I won't answer such questions > > >> for at > > >>>>> least a > > >>>>>>>>>> dozen times and I won't see such questions coming > > >>>>>>>>>> up from time to time. During some periods, I even saw such > > >> questions > > >>>>>>>>> every > > >>>>>>>>>> day. > > >>>>>>>>>> > > >>>>>>>>>> Best, > > >>>>>>>>>> Kurt > > >>>>>>>>>> > > >>>>>>>>>> > > >>>>>>>>>> On Thu, Apr 16, 2020 at 11:21 PM Chesnay Schepler < > > >>>>> [hidden email]> > > >>>>>>>>>> wrote: > > >>>>>>>>>> > > >>>>>>>>>>> The problem with having a distribution with "popular" > > >> stuff is > > >>>>> that it > > >>>>>>>>>>> doesn't really *solve* a problem, it just hides it for > > >> users who > > >>>>> fall > > >>>>>>>>>>> into these particular use-cases. > > >>>>>>>>>>> Move out of it and you once again run into exact same > > >> problems > > >>>>>>>>> out-lined. > > >>>>>>>>>>> This is exactly why I like the tooling approach; you > > >> have to deal > > >>>>> with > > >>>>>>>>> it > > >>>>>>>>>>> from the start and transitioning to a custom use-case is > > >> easier. > > >>>>>>>>>>> > > >>>>>>>>>>> Would users following instructions really be such a big > > >> problem? > > >>>>>>>>>>> I would expect that users generally know *what *they > > >> need, just not > > >>>>>>>>>>> necessarily how it is assembled correctly (where do get > > >> which jar, > > >>>>>>>>> which > > >>>>>>>>>>> directory to put it in). > > >>>>>>>>>>> It seems like these are exactly the problem this would > > >> solve? > > >>>>>>>>>>> I just don't see how moving a jar corresponding to some > > >> feature > > >>>>> from > > >>>>>>>>> opt > > >>>>>>>>>>> to some directory (lib/plugins) is less error-prone than > > >> just > > >>>>>>> selecting > > >>>>>>>>>> the > > >>>>>>>>>>> feature and having the tool handle the rest. > > >>>>>>>>>>> > > >>>>>>>>>>> As for re-distributions, it depends on the form that the > > >> tool would > > >>>>>>>>> take. > > >>>>>>>>>>> It could be an application that runs locally and works > > >> against > > >>>>> maven > > >>>>>>>>>>> central (note: not necessarily *using* maven); this > > >> should would > > >>>>> work > > >>>>>>>>> in > > >>>>>>>>>>> China, no? > > >>>>>>>>>>> > > >>>>>>>>>>> A web tool would of course be fancy, but I don't know > > >> how feasible > > >>>>>>> this > > >>>>>>>>>> is > > >>>>>>>>>>> with the ASF infrastructure. > > >>>>>>>>>>> You wouldn't be able to mirror the distribution, so the > > >> load can't > > >>>>> be > > >>>>>>>>>>> distributed. I doubt INFRA would like this. > > >>>>>>>>>>> > > >>>>>>>>>>> Note that third-parties could also start distributing > > >> use-case > > >>>>>>> oriented > > >>>>>>>>>>> distributions, which would be perfectly fine as far as > > >> I'm > > >>>>> concerned. > > >>>>>>>>>>> > > >>>>>>>>>>> On 16/04/2020 16:57, Kurt Young wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> I'm not so sure about the web tool solution though. The > > >> concern I > > >>>>> have > > >>>>>>>>>> for > > >>>>>>>>>>> this approach is the final generated > > >>>>>>>>>>> distribution is kind of non-deterministic. We might > > >> generate too > > >>>>> many > > >>>>>>>>>>> different combinations when user trying to > > >>>>>>>>>>> package different types of connector, format, and even > > >> maybe hadoop > > >>>>>>>>>>> releases. As far as I can tell, most open > > >>>>>>>>>>> source projects and apache projects will only release > > >> some > > >>>>>>>>>>> pre-defined distributions, which most users are already > > >>>>>>>>>>> familiar with, thus hard to change IMO. And I also have > > >> went > > >>>>> through > > >>>>>>> in > > >>>>>>>>>>> some cases, users will try to re-distribute > > >>>>>>>>>>> the release package, because of the unstable network of > > >> apache > > >>>>> website > > >>>>>>>>>> from > > >>>>>>>>>>> China. In web tool solution, I don't > > >>>>>>>>>>> think this kind of re-distribution would be possible > > >> anymore. > > >>>>>>>>>>> > > >>>>>>>>>>> In the meantime, I also have a concern that we will fall > > >> back into > > >>>>> our > > >>>>>>>>>> trap > > >>>>>>>>>>> again if we try to offer this smart & flexible > > >>>>>>>>>>> solution. Because it needs users to cooperate with such > > >> mechanism. > > >>>>>>> It's > > >>>>>>>>>>> exactly the situation what we currently fell > > >>>>>>>>>>> into: > > >>>>>>>>>>> 1. We offered a smart solution. > > >>>>>>>>>>> 2. We hope users will follow the correct instructions. > > >>>>>>>>>>> 3. Everything will work as expected if users followed > > >> the right > > >>>>>>>>>>> instructions. > > >>>>>>>>>>> > > >>>>>>>>>>> In reality, I suspect not all users will do the second > > >> step > > >>>>> correctly. > > >>>>>>>>>> And > > >>>>>>>>>>> for new users who only trying to have a quick > > >>>>>>>>>>> experience with Flink, I would bet most users will do it > > >> wrong. > > >>>>>>>>>>> > > >>>>>>>>>>> So, my proposal would be one of the following 2 options: > > >>>>>>>>>>> 1. Provide a slim distribution for advanced product > > >> users and > > >>>>> provide > > >>>>>>> a > > >>>>>>>>>>> distribution which will have some popular builtin jars. > > >>>>>>>>>>> 2. Only provide a distribution which will have some > > >> popular builtin > > >>>>>>>>> jars. > > >>>>>>>>>>> If we are trying to reduce the distributions we > > >> released, I would > > >>>>>>>>> prefer > > >>>>>>>>>> 2 > > >>>>>>>>>>> 1. > > >>>>>>>>>>> > > >>>>>>>>>>> Best, > > >>>>>>>>>>> Kurt > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On Thu, Apr 16, 2020 at 9:33 PM Till Rohrmann < > > >>>>> [hidden email]> > > >>>>>>> < > > >>>>>>>>>> [hidden email]> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> I think what Chesnay and Dawid proposed would be the > > >> ideal > > >>>>> solution. > > >>>>>>>>>>> Ideally, we would also have a nice web tool for the > > >> website which > > >>>>>>>>>> generates > > >>>>>>>>>>> the corresponding distribution for download. > > >>>>>>>>>>> > > >>>>>>>>>>> To get things started we could start with only > > >> supporting to > > >>>>>>>>>>> download/creating the "fat" version with the script. The > > >> fat > > >>>>> version > > >>>>>>>>>> would > > >>>>>>>>>>> then consist of the slim distribution and whatever we > > >> deem > > >>>>> important > > >>>>>>>>> for > > >>>>>>>>>>> new users to get started. > > >>>>>>>>>>> > > >>>>>>>>>>> Cheers, > > >>>>>>>>>>> Till > > >>>>>>>>>>> > > >>>>>>>>>>> On Thu, Apr 16, 2020 at 11:33 AM Dawid Wysakowicz < > > >>>>>>>>>> [hidden email]> <[hidden email]> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Hi all, > > >>>>>>>>>>> > > >>>>>>>>>>> Few points from my side: > > >>>>>>>>>>> > > >>>>>>>>>>> 1. I like the idea of simplifying the experience for > > >> first time > > >>>>> users. > > >>>>>>>>>>> As for production use cases I share Jark's opinion that > > >> in this > > >>>>> case I > > >>>>>>>>>>> would expect users to combine their distribution > > >> manually. I think > > >>>>> in > > >>>>>>>>>>> such scenarios it is important to understand > > >> interconnections. > > >>>>>>>>>>> Personally I'd expect the slimmest possible distribution > > >> that I can > > >>>>>>>>>>> extend further with what I need in my production > > >> scenario. > > >>>>>>>>>>> > > >>>>>>>>>>> 2. I think there is also the problem that the matrix of > > >> possible > > >>>>>>>>>>> combinations that can be useful is already big. Do we > > >> want to have > > >>>>> a > > >>>>>>>>>>> distribution for: > > >>>>>>>>>>> > > >>>>>>>>>>> SQL users: which connectors should we include? should we > > >>>>> include > > >>>>>>>>>>> hive? which other catalog? > > >>>>>>>>>>> > > >>>>>>>>>>> DataStream users: which connectors should we include? > > >>>>>>>>>>> > > >>>>>>>>>>> For both of the above should we include yarn/kubernetes? > > >>>>>>>>>>> > > >>>>>>>>>>> I would opt for providing only the "slim" distribution > > >> as a release > > >>>>>>>>>>> artifact. > > >>>>>>>>>>> > > >>>>>>>>>>> 3. However, as I said I think its worth investigating > > >> how we can > > >>>>>>>>> improve > > >>>>>>>>>>> users experience. What do you think of providing a tool, > > >> could be > > >>>>> e.g. > > >>>>>>>>> a > > >>>>>>>>>>> shell script that constructs a distribution based on > > >> users choice. > > >>>>> I > > >>>>>>>>>>> think that was also what Chesnay mentioned as "tooling to > > >>>>>>>>>>> assemble custom distributions" In the end how I see the > > >> difference > > >>>>>>>>>>> between a slim and fat distribution is which jars do we > > >> put into > > >>>>> the > > >>>>>>>>>>> lib, right? It could have a few "screens". > > >>>>>>>>>>> > > >>>>>>>>>>> 1. Which API are you interested in: > > >>>>>>>>>>> a. SQL API > > >>>>>>>>>>> b. DataStream API > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> 2. [SQL] Which connectors do you want to use? > > >> [multichoice]: > > >>>>>>>>>>> a. Kafka > > >>>>>>>>>>> b. Elasticsearch > > >>>>>>>>>>> ... > > >>>>>>>>>>> > > >>>>>>>>>>> 3. [SQL] Which catalog you want to use? > > >>>>>>>>>>> > > >>>>>>>>>>> ... > > >>>>>>>>>>> > > >>>>>>>>>>> Such a tool would download all the dependencies from > > >> maven and put > > >>>>>>> them > > >>>>>>>>>>> into the correct folder. In the future we can extend it > > >> with > > >>>>>>> additional > > >>>>>>>>>>> rules e.g. kafka-0.9 cannot be chosen at the same time > > >> with > > >>>>>>>>>>> kafka-universal etc. > > >>>>>>>>>>> > > >>>>>>>>>>> The benefit of it would be that the distribution that we > > >> release > > >>>>> could > > >>>>>>>>>>> remain "slim" or we could even make it slimmer. I might > > >> be missing > > >>>>>>>>>>> something here though. > > >>>>>>>>>>> > > >>>>>>>>>>> Best, > > >>>>>>>>>>> > > >>>>>>>>>>> Dawdi > > >>>>>>>>>>> > > >>>>>>>>>>> On 16/04/2020 11:02, Aljoscha Krettek wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> I want to reinforce my opinion from earlier: This is > > >> about > > >>>>> improving > > >>>>>>>>>>> the situation both for first-time users and for > > >> experienced users > > >>>>> that > > >>>>>>>>>>> want to use a Flink dist in production. The current > > >> Flink dist is > > >>>>> too > > >>>>>>>>>>> "thin" for first-time SQL users and it is too "fat" for > > >> production > > >>>>>>>>>>> users, that is where serving no-one properly with the > > >> current > > >>>>>>>>>>> middle-ground. That's why I think introducing those > > >> specialized > > >>>>>>>>>>> "spins" of Flink dist would be good. > > >>>>>>>>>>> > > >>>>>>>>>>> By the way, at some point in the future production users > > >> might not > > >>>>>>>>>>> even need to get a Flink dist anymore. They should be > > >> able to have > > >>>>>>>>>>> Flink as a dependency of their project (including the > > >> runtime) and > > >>>>>>>>>>> then build an image from this for Kubernetes or a fat > > >> jar for YARN. > > >>>>>>>>>>> > > >>>>>>>>>>> Aljoscha > > >>>>>>>>>>> > > >>>>>>>>>>> On 15.04.20 18:14, wenlong.lwl wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> Hi all, > > >>>>>>>>>>> > > >>>>>>>>>>> Regarding slim and fat distributions, I think different > > >> kinds of > > >>>>> jobs > > >>>>>>>>>>> may > > >>>>>>>>>>> prefer different type of distribution: > > >>>>>>>>>>> > > >>>>>>>>>>> For DataStream job, I think we may not like fat > > >> distribution > > >>>>>>>>>>> > > >>>>>>>>>>> containing > > >>>>>>>>>>> > > >>>>>>>>>>> connectors because user would always need to depend on > > >> the > > >>>>> connector > > >>>>>>>>>>> > > >>>>>>>>>>> in > > >>>>>>>>>>> > > >>>>>>>>>>> user code, it is easy to include the connector jar in > > >> the user lib. > > >>>>>>>>>>> > > >>>>>>>>>>> Less > > >>>>>>>>>>> > > >>>>>>>>>>> jar in lib means less class conflicts and problems. > > >>>>>>>>>>> > > >>>>>>>>>>> For SQL job, I think we are trying to encourage user to > > >> user pure > > >>>>>>>>>>> sql(DDL + > > >>>>>>>>>>> DML) to construct their job, In order to improve user > > >> experience, > > >>>>> It > > >>>>>>>>>>> may be > > >>>>>>>>>>> important for flink, not only providing as many > > >> connector jar in > > >>>>>>>>>>> distribution as possible especially the connector and > > >> format we > > >>>>> have > > >>>>>>>>>>> well > > >>>>>>>>>>> documented, but also providing an mechanism to load > > >> connectors > > >>>>>>>>>>> according > > >>>>>>>>>>> to the DDLs, > > >>>>>>>>>>> > > >>>>>>>>>>> So I think it could be good to place connector/format > > >> jars in some > > >>>>>>>>>>> dir like > > >>>>>>>>>>> opt/connector which would not affect jobs by default, and > > >>>>> introduce a > > >>>>>>>>>>> mechanism of dynamic discovery for SQL. > > >>>>>>>>>>> > > >>>>>>>>>>> Best, > > >>>>>>>>>>> Wenlong > > >>>>>>>>>>> > > >>>>>>>>>>> On Wed, 15 Apr 2020 at 22:46, Jingsong Li < > > >> [hidden email]> > > >>>>> < > > >>>>>>>>>> [hidden email]> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Hi, > > >>>>>>>>>>> > > >>>>>>>>>>> I am thinking both "improve first experience" and > > >> "improve > > >>>>> production > > >>>>>>>>>>> experience". > > >>>>>>>>>>> > > >>>>>>>>>>> I'm thinking about what's the common mode of Flink? > > >>>>>>>>>>> Streaming job use Kafka? Batch job use Hive? > > >>>>>>>>>>> > > >>>>>>>>>>> Hive 1.2.1 dependencies can be compatible with most of > > >> Hive server > > >>>>>>>>>>> versions. So Spark and Presto have built-in Hive 1.2.1 > > >> dependency. > > >>>>>>>>>>> Flink is currently mainly used for streaming, so let's > > >> not talk > > >>>>>>>>>>> about hive. > > >>>>>>>>>>> > > >>>>>>>>>>> For streaming jobs, first of all, the jobs in my mind is > > >> (related > > >>>>> to > > >>>>>>>>>>> connectors): > > >>>>>>>>>>> - ETL jobs: Kafka -> Kafka > > >>>>>>>>>>> - Join jobs: Kafka -> DimJDBC -> Kafka > > >>>>>>>>>>> - Aggregation jobs: Kafka -> JDBCSink > > >>>>>>>>>>> So Kafka and JDBC are probably the most commonly used. > > >> Of course, > > >>>>>>>>>>> > > >>>>>>>>>>> also > > >>>>>>>>>>> > > >>>>>>>>>>> includes CSV, JSON's formats. > > >>>>>>>>>>> So when we provide such a fat distribution: > > >>>>>>>>>>> - With CSV, JSON. > > >>>>>>>>>>> - With flink-kafka-universal and kafka dependencies. > > >>>>>>>>>>> - With flink-jdbc. > > >>>>>>>>>>> Using this fat distribution, most users can run their > > >> jobs well. > > >>>>>>>>>>> > > >>>>>>>>>>> (jdbc > > >>>>>>>>>>> > > >>>>>>>>>>> driver jar required, but this is very natural to do) > > >>>>>>>>>>> Can these dependencies lead to kinds of conflicts? Only > > >> Kafka may > > >>>>>>>>>>> > > >>>>>>>>>>> have > > >>>>>>>>>>> > > >>>>>>>>>>> conflicts, but if our goal is to use kafka-universal to > > >> support all > > >>>>>>>>>>> Kafka > > >>>>>>>>>>> versions, it is hopeful to target the vast majority of > > >> users. > > >>>>>>>>>>> > > >>>>>>>>>>> We don't want to plug all jars into the fat > > >> distribution. Only need > > >>>>>>>>>>> less > > >>>>>>>>>>> conflict and common. of course, it is a matter of > > >> consideration to > > >>>>>>>>>>> > > >>>>>>>>>>> put > > >>>>>>>>>>> > > >>>>>>>>>>> which jar into fat distribution. > > >>>>>>>>>>> We have the opportunity to facilitate the majority of > > >> users, but > > >>>>>>>>>>> also left > > >>>>>>>>>>> opportunities for customization. > > >>>>>>>>>>> > > >>>>>>>>>>> Best, > > >>>>>>>>>>> Jingsong Lee > > >>>>>>>>>>> > > >>>>>>>>>>> On Wed, Apr 15, 2020 at 10:09 PM Jark Wu < > > >> [hidden email]> < > > >>>>>>>>>> [hidden email]> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> Hi, > > >>>>>>>>>>> > > >>>>>>>>>>> I think we should first reach an consensus on "what > > >> problem do we > > >>>>>>>>>>> want to > > >>>>>>>>>>> solve?" > > >>>>>>>>>>> (1) improve first experience? or (2) improve production > > >> experience? > > >>>>>>>>>>> > > >>>>>>>>>>> As far as I can see, with the above discussion, I think > > >> what we > > >>>>>>>>>>> want to > > >>>>>>>>>>> solve is the "first experience". > > >>>>>>>>>>> And I think the slim jar is still the best distribution > > >> for > > >>>>>>>>>>> production, > > >>>>>>>>>>> because it's easier to assembling jars > > >>>>>>>>>>> than excluding jars and can avoid potential class > > >> conflicts. > > >>>>>>>>>>> > > >>>>>>>>>>> If we want to improve "first experience", I think it > > >> make sense to > > >>>>>>>>>>> have a > > >>>>>>>>>>> fat distribution to give users a more smooth first > > >> experience. > > >>>>>>>>>>> But I would like to call it "playground distribution" or > > >> something > > >>>>>>>>>>> like > > >>>>>>>>>>> that to explicitly differ from the "slim > > >> production-purpose > > >>>>>>>>>>> > > >>>>>>>>>>> distribution". > > >>>>>>>>>>> > > >>>>>>>>>>> The "playground distribution" can contains some widely > > >> used jars, > > >>>>>>>>>>> > > >>>>>>>>>>> like > > >>>>>>>>>>> > > >>>>>>>>>>> universal-kafka-sql-connector, > > >> elasticsearch7-sql-connector, avro, > > >>>>>>>>>>> json, > > >>>>>>>>>>> csv, etc.. > > >>>>>>>>>>> Even we can provide a playground docker which may > > >> contain the fat > > >>>>>>>>>>> distribution, python3, and hive. > > >>>>>>>>>>> > > >>>>>>>>>>> Best, > > >>>>>>>>>>> Jark > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On Wed, 15 Apr 2020 at 21:47, Chesnay Schepler < > > >> [hidden email]> > > >>>>> < > > >>>>>>>>>> [hidden email]> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> I don't see a lot of value in having multiple > > >> distributions. > > >>>>>>>>>>> > > >>>>>>>>>>> The simple reality is that no fat distribution we could > > >> provide > > >>>>>>>>>>> > > >>>>>>>>>>> would > > >>>>>>>>>>> > > >>>>>>>>>>> satisfy all use-cases, so why even try. > > >>>>>>>>>>> If users commonly run into issues for certain jars, then > > >> maybe > > >>>>>>>>>>> > > >>>>>>>>>>> those > > >>>>>>>>>>> > > >>>>>>>>>>> should be added to the current distribution. > > >>>>>>>>>>> > > >>>>>>>>>>> Personally though I still believe we should only > > >> distribute a slim > > >>>>>>>>>>> version. I'd rather have users always add required jars > > >> to the > > >>>>>>>>>>> distribution than only when they go outside our > > >> "expected" > > >>>>>>>>>>> > > >>>>>>>>>>> use-cases. > > >>>>>>>>>>> > > >>>>>>>>>>> Then we might finally address this issue properly, i.e., > > >> tooling to > > >>>>>>>>>>> assemble custom distributions and/or better error > > >> messages if > > >>>>>>>>>>> Flink-provided extensions cannot be found. > > >>>>>>>>>>> > > >>>>>>>>>>> On 15/04/2020 15:23, Kurt Young wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> Regarding to the specific solution, I'm not sure about > > >> the "fat" > > >>>>>>>>>>> > > >>>>>>>>>>> and > > >>>>>>>>>>> > > >>>>>>>>>>> "slim" > > >>>>>>>>>>> > > >>>>>>>>>>> solution though. I get the idea > > >>>>>>>>>>> that we can make the slim one even more lightweight than > > >> current > > >>>>>>>>>>> distribution, but what about the "fat" > > >>>>>>>>>>> one? Do you mean that we would package all connectors > > >> and formats > > >>>>>>>>>>> > > >>>>>>>>>>> into > > >>>>>>>>>>> > > >>>>>>>>>>> this? I'm not sure if this is > > >>>>>>>>>>> feasible. For example, we can't put all versions of > > >> kafka and hive > > >>>>>>>>>>> connector jars into lib directory, and > > >>>>>>>>>>> we also might need hadoop jars when using filesystem > > >> connector to > > >>>>>>>>>>> > > >>>>>>>>>>> access > > >>>>>>>>>>> > > >>>>>>>>>>> data from HDFS. > > >>>>>>>>>>> > > >>>>>>>>>>> So my guess would be we might hand-pick some of the most > > >>>>>>>>>>> > > >>>>>>>>>>> frequently > > >>>>>>>>>>> > > >>>>>>>>>>> used > > >>>>>>>>>>> > > >>>>>>>>>>> connectors and formats > > >>>>>>>>>>> into our "lib" directory, like kafka, csv, json metioned > > >> above, > > >>>>>>>>>>> > > >>>>>>>>>>> and > > >>>>>>>>>>> > > >>>>>>>>>>> still > > >>>>>>>>>>> > > >>>>>>>>>>> leave some other connectors out of it. > > >>>>>>>>>>> If this is the case, then why not we just provide this > > >>>>>>>>>>> > > >>>>>>>>>>> distribution > > >>>>>>>>>>> > > >>>>>>>>>>> to > > >>>>>>>>>>> > > >>>>>>>>>>> user? I'm not sure i get the benefit of > > >>>>>>>>>>> providing another super "slim" jar (we have to pay some > > >> costs to > > >>>>>>>>>>> > > >>>>>>>>>>> provide > > >>>>>>>>>>> > > >>>>>>>>>>> another suit of distribution). > > >>>>>>>>>>> > > >>>>>>>>>>> What do you think? > > >>>>>>>>>>> > > >>>>>>>>>>> Best, > > >>>>>>>>>>> Kurt > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> On Wed, Apr 15, 2020 at 7:08 PM Jingsong Li < > > >>>>>>>>>>> > > >>>>>>>>>>> [hidden email] > > >>>>>>>>>>> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> Big +1. > > >>>>>>>>>>> > > >>>>>>>>>>> I like "fat" and "slim". > > >>>>>>>>>>> > > >>>>>>>>>>> For csv and json, like Jark said, they are quite small > > >> and don't > > >>>>>>>>>>> > > >>>>>>>>>>> have > > >>>>>>>>>>> > > >>>>>>>>>>> other > > >>>>>>>>>>> > > >>>>>>>>>>> dependencies. They are important to kafka connector, and > > >>>>>>>>>>> > > >>>>>>>>>>> important > > >>>>>>>>>>> > > >>>>>>>>>>> to upcoming file system connector too. > > >>>>>>>>>>> So can we move them to both "fat" and "slim"? They're so > > >>>>>>>>>>> > > >>>>>>>>>>> important, > > >>>>>>>>>>> > > >>>>>>>>>>> and > > >>>>>>>>>>> > > >>>>>>>>>>> they're so lightweight. > > >>>>>>>>>>> > > >>>>>>>>>>> Best, > > >>>>>>>>>>> Jingsong Lee > > >>>>>>>>>>> > > >>>>>>>>>>> On Wed, Apr 15, 2020 at 4:53 PM godfrey he < > > >> [hidden email]> < > > >>>>>>>>>> [hidden email]> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> Big +1. > > >>>>>>>>>>> This will improve user experience (special for Flink new > > >> users). > > >>>>>>>>>>> We answered so many questions about "class not found". > > >>>>>>>>>>> > > >>>>>>>>>>> Best, > > >>>>>>>>>>> Godfrey > > >>>>>>>>>>> > > >>>>>>>>>>> Dian Fu <[hidden email]> <[hidden email]> > > >>>>> 于2020年4月15日周三 > > >>>>>>>>>> 下午4:30写道: > > >>>>>>>>>>> > > >>>>>>>>>>> +1 to this proposal. > > >>>>>>>>>>> > > >>>>>>>>>>> Missing connector jars is also a big problem for PyFlink > > >> users. > > >>>>>>>>>>> > > >>>>>>>>>>> Currently, > > >>>>>>>>>>> > > >>>>>>>>>>> after a Python user has installed PyFlink using `pip`, > > >> he has > > >>>>>>>>>>> > > >>>>>>>>>>> to > > >>>>>>>>>>> > > >>>>>>>>>>> manually > > >>>>>>>>>>> > > >>>>>>>>>>> copy the connector fat jars to the PyFlink installation > > >>>>>>>>>>> > > >>>>>>>>>>> directory > > >>>>>>>>>>> > > >>>>>>>>>>> for > > >>>>>>>>>>> > > >>>>>>>>>>> the > > >>>>>>>>>>> > > >>>>>>>>>>> connectors to be used if he wants to run jobs locally. > > >> This > > >>>>>>>>>>> > > >>>>>>>>>>> process > > >>>>>>>>>>> > > >>>>>>>>>>> is > > >>>>>>>>>>> > > >>>>>>>>>>> very > > >>>>>>>>>>> > > >>>>>>>>>>> confuse for users and affects the experience a lot. > > >>>>>>>>>>> > > >>>>>>>>>>> Regards, > > >>>>>>>>>>> Dian > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> 在 2020年4月15日,下午3:51,Jark Wu <[hidden email]> < > > >> [hidden email]> > > >>>>> 写道: > > >>>>>>>>>>> > > >>>>>>>>>>> +1 to the proposal. I also found the "download > > >> additional jar" > > >>>>>>>>>>> > > >>>>>>>>>>> step > > >>>>>>>>>>> > > >>>>>>>>>>> is > > >>>>>>>>>>> > > >>>>>>>>>>> really verbose when I prepare webinars. > > >>>>>>>>>>> > > >>>>>>>>>>> At least, I think the flink-csv and flink-json should in > > >> the > > >>>>>>>>>>> > > >>>>>>>>>>> distribution, > > >>>>>>>>>>> > > >>>>>>>>>>> they are quite small and don't have other dependencies. > > >>>>>>>>>>> > > >>>>>>>>>>> Best, > > >>>>>>>>>>> Jark > > >>>>>>>>>>> > > >>>>>>>>>>> On Wed, 15 Apr 2020 at 15:44, Jeff Zhang < > > >> [hidden email]> < > > >>>>>>>>>> [hidden email]> > > >>>>>>>>>>> wrote: > > >>>>>>>>>>> > > >>>>>>>>>>> Hi Aljoscha, > > >>>>>>>>>>> > > >>>>>>>>>>> Big +1 for the fat flink distribution, where do you plan > > >> to > > >>>>>>>>>>> > > >>>>>>>>>>> put > > >>>>>>>>>>> > > >>>>>>>>>>> these > > >>>>>>>>>>> > > >>>>>>>>>>> connectors ? opt or lib ? > > >>>>>>>>>>> > > >>>>>>>>>>> Aljoscha Krettek <[hidden email]> < > > >> [hidden email]> > > >>>>>>>>>> 于2020年4月15日周三 > > >>>>>>>>>>> 下午3:30写道: > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> Hi Everyone, > > >>>>>>>>>>> > > >>>>>>>>>>> I'd like to discuss about releasing a more full-featured > > >>>>>>>>>>> > > >>>>>>>>>>> Flink > > >>>>>>>>>>> > > >>>>>>>>>>> distribution. The motivation is that there is friction > > >> for > > >>>>>>>>>>> > > >>>>>>>>>>> SQL/Table > > >>>>>>>>>>> > > >>>>>>>>>>> API > > >>>>>>>>>>> > > >>>>>>>>>>> users that want to use Table connectors which are not > > >> there > > >>>>>>>>>>> > > >>>>>>>>>>> in > > >>>>>>>>>>> > > >>>>>>>>>>> the > > >>>>>>>>>>> > > >>>>>>>>>>> current Flink Distribution. For these users the workflow > > >> is > > >>>>>>>>>>> > > >>>>>>>>>>> currently > > >>>>>>>>>>> > > >>>>>>>>>>> roughly: > > >>>>>>>>>>> > > >>>>>>>>>>> - download Flink dist > > >>>>>>>>>>> - configure csv/Kafka/json connectors per configuration > > >>>>>>>>>>> - run SQL client or program > > >>>>>>>>>>> - decrypt error message and research the solution > > >>>>>>>>>>> - download additional connector jars > > >>>>>>>>>>> - program works correctly > > >>>>>>>>>>> > > >>>>>>>>>>> I realize that this can be made to work but if every SQL > > >>>>>>>>>>> > > >>>>>>>>>>> user > > >>>>>>>>>>> > > >>>>>>>>>>> has > > >>>>>>>>>>> > > >>>>>>>>>>> this > > >>>>>>>>>>> > > >>>>>>>>>>> as their first experience that doesn't seem good to me. > > >>>>>>>>>>> > > >>>>>>>>>>> My proposal is to provide two versions of the Flink > > >>>>>>>>>>> > > >>>>>>>>>>> Distribution > > >>>>>>>>>>> > > >>>>>>>>>>> in > > >>>>>>>>>>> > > >>>>>>>>>>> the > > >>>>>>>>>>> > > >>>>>>>>>>> future: "fat" and "slim" (names to be discussed): > > >>>>>>>>>>> > > >>>>>>>>>>> - slim would be even trimmer than todays distribution > > >>>>>>>>>>> - fat would contain a lot of convenience connectors (yet > > >>>>>>>>>>> > > >>>>>>>>>>> to > > >>>>>>>>>>> > > >>>>>>>>>>> be > > >>>>>>>>>>> > > >>>>>>>>>>> determined which one) > > >>>>>>>>>>> > > >>>>>>>>>>> And yes, I realize that there are already more > > >> dimensions of > > >>>>>>>>>>> > > >>>>>>>>>>> Flink > > >>>>>>>>>>> > > >>>>>>>>>>> releases (Scala version and Java version). > > >>>>>>>>>>> > > >>>>>>>>>>> For background, our current Flink dist has these in the > > >> opt > > >>>>>>>>>>> > > >>>>>>>>>>> directory: > > >>>>>>>>>>> > > >>>>>>>>>>> - flink-azure-fs-hadoop-1.10.0.jar > > >>>>>>>>>>> - flink-cep-scala_2.12-1.10.0.jar > > >>>>>>>>>>> - flink-cep_2.12-1.10.0.jar > > >>>>>>>>>>> - flink-gelly-scala_2.12-1.10.0.jar > > >>>>>>>>>>> - flink-gelly_2.12-1.10.0.jar > > >>>>>>>>>>> - flink-metrics-datadog-1.10.0.jar > > >>>>>>>>>>> - flink-metrics-graphite-1.10.0.jar > > >>>>>>>>>>> - flink-metrics-influxdb-1.10.0.jar > > >>>>>>>>>>> - flink-metrics-prometheus-1.10.0.jar > > >>>>>>>>>>> - flink-metrics-slf4j-1.10.0.jar > > >>>>>>>>>>> - flink-metrics-statsd-1.10.0.jar > > >>>>>>>>>>> - flink-oss-fs-hadoop-1.10.0.jar > > >>>>>>>>>>> - flink-python_2.12-1.10.0.jar > > >>>>>>>>>>> - flink-queryable-state-runtime_2.12-1.10.0.jar > > >>>>>>>>>>> - flink-s3-fs-hadoop-1.10.0.jar > > >>>>>>>>>>> - flink-s3-fs-presto-1.10.0.jar > > >>>>>>>>>>> - > > >>>>>>>>>>> > > >>>>>>>>>>> flink-shaded-netty-tcnative-dynamic-2.0.25.Final-9.0.jar > > >>>>>>>>>>> > > >>>>>>>>>>> - flink-sql-client_2.12-1.10.0.jar > > >>>>>>>>>>> - flink-state-processor-api_2.12-1.10.0.jar > > >>>>>>>>>>> - flink-swift-fs-hadoop-1.10.0.jar > > >>>>>>>>>>> > > >>>>>>>>>>> Current Flink dist is 267M. If we removed everything from > > >>>>>>>>>>> > > >>>>>>>>>>> opt > > >>>>>>>>>>> > > >>>>>>>>>>> we > > >>>>>>>>>>> > > >>>>>>>>>>> would > > >>>>>>>>>>> > > >>>>>>>>>>> go down to 126M. I would reccomend this, because the > > >> large > > >>>>>>>>>>> > > >>>>>>>>>>> majority > > >>>>>>>>>>> > > >>>>>>>>>>> of > > >>>>>>>>>>> > > >>>>>>>>>>> the files in opt are probably unused. > > >>>>>>>>>>> > > >>>>>>>>>>> What do you think? > > >>>>>>>>>>> > > >>>>>>>>>>> Best, > > >>>>>>>>>>> Aljoscha > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> -- > > >>>>>>>>>>> Best Regards > > >>>>>>>>>>> > > >>>>>>>>>>> Jeff Zhang > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> -- > > >>>>>>>>>>> Best, Jingsong Lee > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> -- > > >>>>>>>>>>> Best, Jingsong Lee > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>>>>>> > > >>>>>>> > > >>>>> > > >>>>> > > >>>> > > >>>> -- > > >>>> Best, Jingsong Lee > > >>>> > > >>> > > >>> > > >>> -- > > >>> Best, Jingsong Lee > > >> > > > > > > > > > -- > > > > > > Best, > > > Benchao Li > > > > > > -- > Best regards! > Rui Li > -- Best, Jingsong Lee |
Free forum by Nabble | Edit this page |