Hi all,
I heard some users complain that table is difficult to test. Now with SQL client, users are more and more inclined to use it to test rather than program. The most common example is Kafka source. If users need to test their SQL output and checkpoint, they need to: - 1.Launch a Kafka standalone, create a Kafka topic . - 2.Write a program, mock input records, and produce records to Kafka topic. - 3.Then test in Flink. The step 1 and 2 are annoying, although this test is E2E. Then I found StatefulSequenceSource, it is very good because it has deal with checkpoint things, so it is very good to checkpoint mechanism.Usually, users are turned on checkpoint in production. With computed columns, user are easy to create a sequence source DDL same to Kafka DDL. Then they can test inside Flink, don't need launch other things. Have you consider this? What do you think? CC: @Aljoscha Krettek <[hidden email]> the author of StatefulSequenceSource. Best, Jingsong Lee |
+1.
I would suggest to take a step even further and see what users really need to test/try/play with table API and Flink SQL. Besides this one, here're some more sources and sinks that I have developed or used previously to facilitate building Flink table/SQL pipelines. 1. random input data source - should generate random data at a specified rate according to schema - purposes - test Flink pipeline and data can end up in external storage correctly - stress test Flink sink as well as tuning up external storage 2. print data sink - should print data in row format in console - purposes - make it easier to test Flink SQL job e2e in IDE - test Flink pipeline and ensure output data format/value is correct 3. no output data sink - just swallow output data without doing anything - purpose - evaluate and tune performance of Flink source and the whole pipeline. Users' don't need to worry about sink back pressure These may be taken into consideration all together as an effort to lower the threshold of running Flink SQL/table API, and facilitate users' daily work. Cheers, Bowen On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <[hidden email]> wrote: > Hi all, > > I heard some users complain that table is difficult to test. Now with SQL > client, users are more and more inclined to use it to test rather than > program. > The most common example is Kafka source. If users need to test their SQL > output and checkpoint, they need to: > > - 1.Launch a Kafka standalone, create a Kafka topic . > - 2.Write a program, mock input records, and produce records to Kafka > topic. > - 3.Then test in Flink. > > The step 1 and 2 are annoying, although this test is E2E. > > Then I found StatefulSequenceSource, it is very good because it has deal > with checkpoint things, so it is very good to checkpoint mechanism.Usually, > users are turned on checkpoint in production. > > With computed columns, user are easy to create a sequence source DDL same > to Kafka DDL. Then they can test inside Flink, don't need launch other > things. > > Have you consider this? What do you think? > > CC: @Aljoscha Krettek <[hidden email]> the author > of StatefulSequenceSource. > > Best, > Jingsong Lee > |
+1 to Bowen's proposal. I also saw many requirements on such built-in
connectors. I will leave some my thoughts here: > 1. datagen source (random source) I think we can merge the functinality of sequence-source into random source to allow users to custom their data values. Flink can generate random data according to the field types, users can customize their values to be more domain specific, e.g. 'field.user'='User_[1-9]{0,1}' This will be similar to kafka-datagen-connect[1]. > 2. console sink (print sink) This will be very useful in production debugging, to easily output an intermediate view or result view to a `.out` file. So that we can look into the data representation, or check dirty data. This should be out-of-box without manually DDL registration. > 3. blackhole sink (no output sink) This is very useful for high performance testing of Flink, to meansure the throughput of the whole pipeline without sink. Presto also provides this as a built-in connector [2]. Best, Jark [1]: https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification [2]: https://prestodb.io/docs/current/connector/blackhole.html On Sat, 21 Mar 2020 at 12:31, Bowen Li <[hidden email]> wrote: > +1. > > I would suggest to take a step even further and see what users really need > to test/try/play with table API and Flink SQL. Besides this one, here're > some more sources and sinks that I have developed or used previously to > facilitate building Flink table/SQL pipelines. > > > 1. random input data source > - should generate random data at a specified rate according to schema > - purposes > - test Flink pipeline and data can end up in external storage > correctly > - stress test Flink sink as well as tuning up external storage > 2. print data sink > - should print data in row format in console > - purposes > - make it easier to test Flink SQL job e2e in IDE > - test Flink pipeline and ensure output data format/value is > correct > 3. no output data sink > - just swallow output data without doing anything > - purpose > - evaluate and tune performance of Flink source and the whole > pipeline. Users' don't need to worry about sink back pressure > > These may be taken into consideration all together as an effort to lower > the threshold of running Flink SQL/table API, and facilitate users' daily > work. > > Cheers, > Bowen > > > On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <[hidden email]> > wrote: > > > Hi all, > > > > I heard some users complain that table is difficult to test. Now with SQL > > client, users are more and more inclined to use it to test rather than > > program. > > The most common example is Kafka source. If users need to test their SQL > > output and checkpoint, they need to: > > > > - 1.Launch a Kafka standalone, create a Kafka topic . > > - 2.Write a program, mock input records, and produce records to Kafka > > topic. > > - 3.Then test in Flink. > > > > The step 1 and 2 are annoying, although this test is E2E. > > > > Then I found StatefulSequenceSource, it is very good because it has deal > > with checkpoint things, so it is very good to checkpoint > mechanism.Usually, > > users are turned on checkpoint in production. > > > > With computed columns, user are easy to create a sequence source DDL same > > to Kafka DDL. Then they can test inside Flink, don't need launch other > > things. > > > > Have you consider this? What do you think? > > > > CC: @Aljoscha Krettek <[hidden email]> the author > > of StatefulSequenceSource. > > > > Best, > > Jingsong Lee > > > |
Thanks Jingsong for bringing up this discussion. +1 to this proposal. I think Bowen's proposal makes much sense to me.
This is also a painful problem for PyFlink users. Currently there is no built-in easy-to-use table source/sink and it requires users to write a lot of code to trying out PyFlink. This is especially painful for new users who are not familiar with PyFlink/Flink. I have also encountered the tedious process Bowen encountered, e.g. writing random source connector, print sink and also blackhole print sink as there are no built-in ones to use. Regards, Dian > 在 2020年3月22日,上午11:24,Jark Wu <[hidden email]> 写道: > > +1 to Bowen's proposal. I also saw many requirements on such built-in > connectors. > > I will leave some my thoughts here: > >> 1. datagen source (random source) > I think we can merge the functinality of sequence-source into random source > to allow users to custom their data values. > Flink can generate random data according to the field types, users > can customize their values to be more domain specific, e.g. > 'field.user'='User_[1-9]{0,1}' > This will be similar to kafka-datagen-connect[1]. > >> 2. console sink (print sink) > This will be very useful in production debugging, to easily output an > intermediate view or result view to a `.out` file. > So that we can look into the data representation, or check dirty data. > This should be out-of-box without manually DDL registration. > >> 3. blackhole sink (no output sink) > This is very useful for high performance testing of Flink, to meansure the > throughput of the whole pipeline without sink. > Presto also provides this as a built-in connector [2]. > > Best, > Jark > > [1]: > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > [2]: https://prestodb.io/docs/current/connector/blackhole.html > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <[hidden email]> wrote: > >> +1. >> >> I would suggest to take a step even further and see what users really need >> to test/try/play with table API and Flink SQL. Besides this one, here're >> some more sources and sinks that I have developed or used previously to >> facilitate building Flink table/SQL pipelines. >> >> >> 1. random input data source >> - should generate random data at a specified rate according to schema >> - purposes >> - test Flink pipeline and data can end up in external storage >> correctly >> - stress test Flink sink as well as tuning up external storage >> 2. print data sink >> - should print data in row format in console >> - purposes >> - make it easier to test Flink SQL job e2e in IDE >> - test Flink pipeline and ensure output data format/value is >> correct >> 3. no output data sink >> - just swallow output data without doing anything >> - purpose >> - evaluate and tune performance of Flink source and the whole >> pipeline. Users' don't need to worry about sink back pressure >> >> These may be taken into consideration all together as an effort to lower >> the threshold of running Flink SQL/table API, and facilitate users' daily >> work. >> >> Cheers, >> Bowen >> >> >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <[hidden email]> >> wrote: >> >>> Hi all, >>> >>> I heard some users complain that table is difficult to test. Now with SQL >>> client, users are more and more inclined to use it to test rather than >>> program. >>> The most common example is Kafka source. If users need to test their SQL >>> output and checkpoint, they need to: >>> >>> - 1.Launch a Kafka standalone, create a Kafka topic . >>> - 2.Write a program, mock input records, and produce records to Kafka >>> topic. >>> - 3.Then test in Flink. >>> >>> The step 1 and 2 are annoying, although this test is E2E. >>> >>> Then I found StatefulSequenceSource, it is very good because it has deal >>> with checkpoint things, so it is very good to checkpoint >> mechanism.Usually, >>> users are turned on checkpoint in production. >>> >>> With computed columns, user are easy to create a sequence source DDL same >>> to Kafka DDL. Then they can test inside Flink, don't need launch other >>> things. >>> >>> Have you consider this? What do you think? >>> >>> CC: @Aljoscha Krettek <[hidden email]> the author >>> of StatefulSequenceSource. >>> >>> Best, >>> Jingsong Lee >>> >> |
Thanks Bowen, Jark and Dian for your feedback and suggestions.
I reorganize with your suggestions, and try to expose DDLs: 1.datagen source: - easy startup/test for streaming job - performance testing DDL: CREATE TABLE user ( id BIGINT, age INT, description STRING ) WITH ( 'connector.type' = 'datagen', 'connector.rows-per-second'='100', 'connector.total-records'='1000000', 'schema.id.generator' = 'sequence', 'schema.id.generator.start' = '1', 'schema.age.generator' = 'random', 'schema.age.generator.min' = '0', 'schema.age.generator.max' = '100', 'schema.description.generator' = 'random', 'schema.description.generator.length' = '100' ) Default is random generator. Hi Jark, I don't want to bring complicated regularities, because it can be done through computed columns. And it is hard to define standard regularities, I think we can leave it to the future. 2.print sink: - easy test for streaming job - be very useful in production debugging DDL: CREATE TABLE print_table ( ... ) WITH ( 'connector.type' = 'print' ) 3.blackhole sink - very useful for high performance testing of Flink - I've also run into users trying UDF to output, not sink, so they need this sink as well. DDL: CREATE TABLE blackhole_table ( ... ) WITH ( 'connector.type' = 'blackhole' ) What do you think? Best, Jingsong Lee On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <[hidden email]> wrote: > Thanks Jingsong for bringing up this discussion. +1 to this proposal. I > think Bowen's proposal makes much sense to me. > > This is also a painful problem for PyFlink users. Currently there is no > built-in easy-to-use table source/sink and it requires users to write a lot > of code to trying out PyFlink. This is especially painful for new users who > are not familiar with PyFlink/Flink. I have also encountered the tedious > process Bowen encountered, e.g. writing random source connector, print sink > and also blackhole print sink as there are no built-in ones to use. > > Regards, > Dian > > > 在 2020年3月22日,上午11:24,Jark Wu <[hidden email]> 写道: > > > > +1 to Bowen's proposal. I also saw many requirements on such built-in > > connectors. > > > > I will leave some my thoughts here: > > > >> 1. datagen source (random source) > > I think we can merge the functinality of sequence-source into random > source > > to allow users to custom their data values. > > Flink can generate random data according to the field types, users > > can customize their values to be more domain specific, e.g. > > 'field.user'='User_[1-9]{0,1}' > > This will be similar to kafka-datagen-connect[1]. > > > >> 2. console sink (print sink) > > This will be very useful in production debugging, to easily output an > > intermediate view or result view to a `.out` file. > > So that we can look into the data representation, or check dirty data. > > This should be out-of-box without manually DDL registration. > > > >> 3. blackhole sink (no output sink) > > This is very useful for high performance testing of Flink, to meansure > the > > throughput of the whole pipeline without sink. > > Presto also provides this as a built-in connector [2]. > > > > Best, > > Jark > > > > [1]: > > > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > > [2]: https://prestodb.io/docs/current/connector/blackhole.html > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <[hidden email]> wrote: > > > >> +1. > >> > >> I would suggest to take a step even further and see what users really > need > >> to test/try/play with table API and Flink SQL. Besides this one, here're > >> some more sources and sinks that I have developed or used previously to > >> facilitate building Flink table/SQL pipelines. > >> > >> > >> 1. random input data source > >> - should generate random data at a specified rate according to > schema > >> - purposes > >> - test Flink pipeline and data can end up in external storage > >> correctly > >> - stress test Flink sink as well as tuning up external storage > >> 2. print data sink > >> - should print data in row format in console > >> - purposes > >> - make it easier to test Flink SQL job e2e in IDE > >> - test Flink pipeline and ensure output data format/value is > >> correct > >> 3. no output data sink > >> - just swallow output data without doing anything > >> - purpose > >> - evaluate and tune performance of Flink source and the whole > >> pipeline. Users' don't need to worry about sink back pressure > >> > >> These may be taken into consideration all together as an effort to lower > >> the threshold of running Flink SQL/table API, and facilitate users' > daily > >> work. > >> > >> Cheers, > >> Bowen > >> > >> > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <[hidden email]> > >> wrote: > >> > >>> Hi all, > >>> > >>> I heard some users complain that table is difficult to test. Now with > SQL > >>> client, users are more and more inclined to use it to test rather than > >>> program. > >>> The most common example is Kafka source. If users need to test their > SQL > >>> output and checkpoint, they need to: > >>> > >>> - 1.Launch a Kafka standalone, create a Kafka topic . > >>> - 2.Write a program, mock input records, and produce records to Kafka > >>> topic. > >>> - 3.Then test in Flink. > >>> > >>> The step 1 and 2 are annoying, although this test is E2E. > >>> > >>> Then I found StatefulSequenceSource, it is very good because it has > deal > >>> with checkpoint things, so it is very good to checkpoint > >> mechanism.Usually, > >>> users are turned on checkpoint in production. > >>> > >>> With computed columns, user are easy to create a sequence source DDL > same > >>> to Kafka DDL. Then they can test inside Flink, don't need launch other > >>> things. > >>> > >>> Have you consider this? What do you think? > >>> > >>> CC: @Aljoscha Krettek <[hidden email]> the author > >>> of StatefulSequenceSource. > >>> > >>> Best, > >>> Jingsong Lee > >>> > >> > > -- Best, Jingsong Lee |
Hi Jingsong,
Thanks for bring this up. Generally, it's a very good proposal. About data gen source, do you think we need to add more columns with various types? About print sink, do we need to specify the schema? Jingsong Li <[hidden email]> 于2020年3月23日周一 下午1:51写道: > Thanks Bowen, Jark and Dian for your feedback and suggestions. > > I reorganize with your suggestions, and try to expose DDLs: > > 1.datagen source: > - easy startup/test for streaming job > - performance testing > > DDL: > CREATE TABLE user ( > id BIGINT, > age INT, > description STRING > ) WITH ( > 'connector.type' = 'datagen', > 'connector.rows-per-second'='100', > 'connector.total-records'='1000000', > > 'schema.id.generator' = 'sequence', > 'schema.id.generator.start' = '1', > > 'schema.age.generator' = 'random', > 'schema.age.generator.min' = '0', > 'schema.age.generator.max' = '100', > > 'schema.description.generator' = 'random', > 'schema.description.generator.length' = '100' > ) > > Default is random generator. > Hi Jark, I don't want to bring complicated regularities, because it can be > done through computed columns. And it is hard to define > standard regularities, I think we can leave it to the future. > > 2.print sink: > - easy test for streaming job > - be very useful in production debugging > > DDL: > CREATE TABLE print_table ( > ... > ) WITH ( > 'connector.type' = 'print' > ) > > 3.blackhole sink > - very useful for high performance testing of Flink > - I've also run into users trying UDF to output, not sink, so they need > this sink as well. > > DDL: > CREATE TABLE blackhole_table ( > ... > ) WITH ( > 'connector.type' = 'blackhole' > ) > > What do you think? > > Best, > Jingsong Lee > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <[hidden email]> wrote: > > > Thanks Jingsong for bringing up this discussion. +1 to this proposal. I > > think Bowen's proposal makes much sense to me. > > > > This is also a painful problem for PyFlink users. Currently there is no > > built-in easy-to-use table source/sink and it requires users to write a > lot > > of code to trying out PyFlink. This is especially painful for new users > who > > are not familiar with PyFlink/Flink. I have also encountered the tedious > > process Bowen encountered, e.g. writing random source connector, print > sink > > and also blackhole print sink as there are no built-in ones to use. > > > > Regards, > > Dian > > > > > 在 2020年3月22日,上午11:24,Jark Wu <[hidden email]> 写道: > > > > > > +1 to Bowen's proposal. I also saw many requirements on such built-in > > > connectors. > > > > > > I will leave some my thoughts here: > > > > > >> 1. datagen source (random source) > > > I think we can merge the functinality of sequence-source into random > > source > > > to allow users to custom their data values. > > > Flink can generate random data according to the field types, users > > > can customize their values to be more domain specific, e.g. > > > 'field.user'='User_[1-9]{0,1}' > > > This will be similar to kafka-datagen-connect[1]. > > > > > >> 2. console sink (print sink) > > > This will be very useful in production debugging, to easily output an > > > intermediate view or result view to a `.out` file. > > > So that we can look into the data representation, or check dirty data. > > > This should be out-of-box without manually DDL registration. > > > > > >> 3. blackhole sink (no output sink) > > > This is very useful for high performance testing of Flink, to meansure > > the > > > throughput of the whole pipeline without sink. > > > Presto also provides this as a built-in connector [2]. > > > > > > Best, > > > Jark > > > > > > [1]: > > > > > > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > > > [2]: https://prestodb.io/docs/current/connector/blackhole.html > > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <[hidden email]> wrote: > > > > > >> +1. > > >> > > >> I would suggest to take a step even further and see what users really > > need > > >> to test/try/play with table API and Flink SQL. Besides this one, > here're > > >> some more sources and sinks that I have developed or used previously > to > > >> facilitate building Flink table/SQL pipelines. > > >> > > >> > > >> 1. random input data source > > >> - should generate random data at a specified rate according to > > schema > > >> - purposes > > >> - test Flink pipeline and data can end up in external storage > > >> correctly > > >> - stress test Flink sink as well as tuning up external storage > > >> 2. print data sink > > >> - should print data in row format in console > > >> - purposes > > >> - make it easier to test Flink SQL job e2e in IDE > > >> - test Flink pipeline and ensure output data format/value is > > >> correct > > >> 3. no output data sink > > >> - just swallow output data without doing anything > > >> - purpose > > >> - evaluate and tune performance of Flink source and the whole > > >> pipeline. Users' don't need to worry about sink back pressure > > >> > > >> These may be taken into consideration all together as an effort to > lower > > >> the threshold of running Flink SQL/table API, and facilitate users' > > daily > > >> work. > > >> > > >> Cheers, > > >> Bowen > > >> > > >> > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <[hidden email]> > > >> wrote: > > >> > > >>> Hi all, > > >>> > > >>> I heard some users complain that table is difficult to test. Now with > > SQL > > >>> client, users are more and more inclined to use it to test rather > than > > >>> program. > > >>> The most common example is Kafka source. If users need to test their > > SQL > > >>> output and checkpoint, they need to: > > >>> > > >>> - 1.Launch a Kafka standalone, create a Kafka topic . > > >>> - 2.Write a program, mock input records, and produce records to Kafka > > >>> topic. > > >>> - 3.Then test in Flink. > > >>> > > >>> The step 1 and 2 are annoying, although this test is E2E. > > >>> > > >>> Then I found StatefulSequenceSource, it is very good because it has > > deal > > >>> with checkpoint things, so it is very good to checkpoint > > >> mechanism.Usually, > > >>> users are turned on checkpoint in production. > > >>> > > >>> With computed columns, user are easy to create a sequence source DDL > > same > > >>> to Kafka DDL. Then they can test inside Flink, don't need launch > other > > >>> things. > > >>> > > >>> Have you consider this? What do you think? > > >>> > > >>> CC: @Aljoscha Krettek <[hidden email]> the author > > >>> of StatefulSequenceSource. > > >>> > > >>> Best, > > >>> Jingsong Lee > > >>> > > >> > > > > > > -- > Best, Jingsong Lee > -- Benchao Li School of Electronics Engineering and Computer Science, Peking University Tel:+86-15650713730 Email: [hidden email]; [hidden email] |
Hi Jingsong,
Regarding (2) and (3), I was thinking to ignore manually DDL work, so users can use them directly: # this will log results to `.out` files INSERT INTO console SELECT ... # this will drop all received records INSERT INTO blackhole SELECT ... Here `console` and `blackhole` are system sinks which is similar to system functions. Best, Jark On Mon, 23 Mar 2020 at 16:33, Benchao Li <[hidden email]> wrote: > Hi Jingsong, > > Thanks for bring this up. Generally, it's a very good proposal. > > About data gen source, do you think we need to add more columns with > various types? > > About print sink, do we need to specify the schema? > > Jingsong Li <[hidden email]> 于2020年3月23日周一 下午1:51写道: > > > Thanks Bowen, Jark and Dian for your feedback and suggestions. > > > > I reorganize with your suggestions, and try to expose DDLs: > > > > 1.datagen source: > > - easy startup/test for streaming job > > - performance testing > > > > DDL: > > CREATE TABLE user ( > > id BIGINT, > > age INT, > > description STRING > > ) WITH ( > > 'connector.type' = 'datagen', > > 'connector.rows-per-second'='100', > > 'connector.total-records'='1000000', > > > > 'schema.id.generator' = 'sequence', > > 'schema.id.generator.start' = '1', > > > > 'schema.age.generator' = 'random', > > 'schema.age.generator.min' = '0', > > 'schema.age.generator.max' = '100', > > > > 'schema.description.generator' = 'random', > > 'schema.description.generator.length' = '100' > > ) > > > > Default is random generator. > > Hi Jark, I don't want to bring complicated regularities, because it can > be > > done through computed columns. And it is hard to define > > standard regularities, I think we can leave it to the future. > > > > 2.print sink: > > - easy test for streaming job > > - be very useful in production debugging > > > > DDL: > > CREATE TABLE print_table ( > > ... > > ) WITH ( > > 'connector.type' = 'print' > > ) > > > > 3.blackhole sink > > - very useful for high performance testing of Flink > > - I've also run into users trying UDF to output, not sink, so they need > > this sink as well. > > > > DDL: > > CREATE TABLE blackhole_table ( > > ... > > ) WITH ( > > 'connector.type' = 'blackhole' > > ) > > > > What do you think? > > > > Best, > > Jingsong Lee > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <[hidden email]> wrote: > > > > > Thanks Jingsong for bringing up this discussion. +1 to this proposal. I > > > think Bowen's proposal makes much sense to me. > > > > > > This is also a painful problem for PyFlink users. Currently there is no > > > built-in easy-to-use table source/sink and it requires users to write a > > lot > > > of code to trying out PyFlink. This is especially painful for new users > > who > > > are not familiar with PyFlink/Flink. I have also encountered the > tedious > > > process Bowen encountered, e.g. writing random source connector, print > > sink > > > and also blackhole print sink as there are no built-in ones to use. > > > > > > Regards, > > > Dian > > > > > > > 在 2020年3月22日,上午11:24,Jark Wu <[hidden email]> 写道: > > > > > > > > +1 to Bowen's proposal. I also saw many requirements on such built-in > > > > connectors. > > > > > > > > I will leave some my thoughts here: > > > > > > > >> 1. datagen source (random source) > > > > I think we can merge the functinality of sequence-source into random > > > source > > > > to allow users to custom their data values. > > > > Flink can generate random data according to the field types, users > > > > can customize their values to be more domain specific, e.g. > > > > 'field.user'='User_[1-9]{0,1}' > > > > This will be similar to kafka-datagen-connect[1]. > > > > > > > >> 2. console sink (print sink) > > > > This will be very useful in production debugging, to easily output an > > > > intermediate view or result view to a `.out` file. > > > > So that we can look into the data representation, or check dirty > data. > > > > This should be out-of-box without manually DDL registration. > > > > > > > >> 3. blackhole sink (no output sink) > > > > This is very useful for high performance testing of Flink, to > meansure > > > the > > > > throughput of the whole pipeline without sink. > > > > Presto also provides this as a built-in connector [2]. > > > > > > > > Best, > > > > Jark > > > > > > > > [1]: > > > > > > > > > > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > > > > [2]: https://prestodb.io/docs/current/connector/blackhole.html > > > > > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <[hidden email]> wrote: > > > > > > > >> +1. > > > >> > > > >> I would suggest to take a step even further and see what users > really > > > need > > > >> to test/try/play with table API and Flink SQL. Besides this one, > > here're > > > >> some more sources and sinks that I have developed or used previously > > to > > > >> facilitate building Flink table/SQL pipelines. > > > >> > > > >> > > > >> 1. random input data source > > > >> - should generate random data at a specified rate according to > > > schema > > > >> - purposes > > > >> - test Flink pipeline and data can end up in external > storage > > > >> correctly > > > >> - stress test Flink sink as well as tuning up external > storage > > > >> 2. print data sink > > > >> - should print data in row format in console > > > >> - purposes > > > >> - make it easier to test Flink SQL job e2e in IDE > > > >> - test Flink pipeline and ensure output data format/value is > > > >> correct > > > >> 3. no output data sink > > > >> - just swallow output data without doing anything > > > >> - purpose > > > >> - evaluate and tune performance of Flink source and the > whole > > > >> pipeline. Users' don't need to worry about sink back > pressure > > > >> > > > >> These may be taken into consideration all together as an effort to > > lower > > > >> the threshold of running Flink SQL/table API, and facilitate users' > > > daily > > > >> work. > > > >> > > > >> Cheers, > > > >> Bowen > > > >> > > > >> > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li < > [hidden email]> > > > >> wrote: > > > >> > > > >>> Hi all, > > > >>> > > > >>> I heard some users complain that table is difficult to test. Now > with > > > SQL > > > >>> client, users are more and more inclined to use it to test rather > > than > > > >>> program. > > > >>> The most common example is Kafka source. If users need to test > their > > > SQL > > > >>> output and checkpoint, they need to: > > > >>> > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic . > > > >>> - 2.Write a program, mock input records, and produce records to > Kafka > > > >>> topic. > > > >>> - 3.Then test in Flink. > > > >>> > > > >>> The step 1 and 2 are annoying, although this test is E2E. > > > >>> > > > >>> Then I found StatefulSequenceSource, it is very good because it has > > > deal > > > >>> with checkpoint things, so it is very good to checkpoint > > > >> mechanism.Usually, > > > >>> users are turned on checkpoint in production. > > > >>> > > > >>> With computed columns, user are easy to create a sequence source > DDL > > > same > > > >>> to Kafka DDL. Then they can test inside Flink, don't need launch > > other > > > >>> things. > > > >>> > > > >>> Have you consider this? What do you think? > > > >>> > > > >>> CC: @Aljoscha Krettek <[hidden email]> the author > > > >>> of StatefulSequenceSource. > > > >>> > > > >>> Best, > > > >>> Jingsong Lee > > > >>> > > > >> > > > > > > > > > > -- > > Best, Jingsong Lee > > > > > -- > > Benchao Li > School of Electronics Engineering and Computer Science, Peking University > Tel:+86-15650713730 > Email: [hidden email]; [hidden email] > |
Hi Benchao,
> do you think we need to add more columns with various types? I didn't list all types, but we should support primitive types, varchar, Decimal, Timestamp and etc... This can be done continuously. Hi Benchao, Jark, About console and blackhole, yes, they can have no schema, the schema can be inferred by upstream node. - But now we don't have this mechanism to do these configurable sink things. - If we want to support, we need a single way to support these two sinks. - And uses can use "create table like" and others way to simplify DDL. And for providing system/registered tables (`console` and `blackhole`): - I have no strong opinion on these system tables. In SQL, will be "insert into blackhole select a /*int*/, b /*string*/ from tableA", "insert into blackhole select a /*double*/, b /*Map*/, c /*string*/ from tableB". It seems that Blackhole is a universal thing, which makes me feel bad intuitively. - Can user override these tables? If can, we need ensure it can be overwrite by catalog tables. So I think we can leave these system tables to future too. What do you think? Best, Jingsong Lee On Mon, Mar 23, 2020 at 4:44 PM Jark Wu <[hidden email]> wrote: > Hi Jingsong, > > Regarding (2) and (3), I was thinking to ignore manually DDL work, so users > can use them directly: > > # this will log results to `.out` files > INSERT INTO console > SELECT ... > > # this will drop all received records > INSERT INTO blackhole > SELECT ... > > Here `console` and `blackhole` are system sinks which is similar to system > functions. > > Best, > Jark > > On Mon, 23 Mar 2020 at 16:33, Benchao Li <[hidden email]> wrote: > > > Hi Jingsong, > > > > Thanks for bring this up. Generally, it's a very good proposal. > > > > About data gen source, do you think we need to add more columns with > > various types? > > > > About print sink, do we need to specify the schema? > > > > Jingsong Li <[hidden email]> 于2020年3月23日周一 下午1:51写道: > > > > > Thanks Bowen, Jark and Dian for your feedback and suggestions. > > > > > > I reorganize with your suggestions, and try to expose DDLs: > > > > > > 1.datagen source: > > > - easy startup/test for streaming job > > > - performance testing > > > > > > DDL: > > > CREATE TABLE user ( > > > id BIGINT, > > > age INT, > > > description STRING > > > ) WITH ( > > > 'connector.type' = 'datagen', > > > 'connector.rows-per-second'='100', > > > 'connector.total-records'='1000000', > > > > > > 'schema.id.generator' = 'sequence', > > > 'schema.id.generator.start' = '1', > > > > > > 'schema.age.generator' = 'random', > > > 'schema.age.generator.min' = '0', > > > 'schema.age.generator.max' = '100', > > > > > > 'schema.description.generator' = 'random', > > > 'schema.description.generator.length' = '100' > > > ) > > > > > > Default is random generator. > > > Hi Jark, I don't want to bring complicated regularities, because it can > > be > > > done through computed columns. And it is hard to define > > > standard regularities, I think we can leave it to the future. > > > > > > 2.print sink: > > > - easy test for streaming job > > > - be very useful in production debugging > > > > > > DDL: > > > CREATE TABLE print_table ( > > > ... > > > ) WITH ( > > > 'connector.type' = 'print' > > > ) > > > > > > 3.blackhole sink > > > - very useful for high performance testing of Flink > > > - I've also run into users trying UDF to output, not sink, so they need > > > this sink as well. > > > > > > DDL: > > > CREATE TABLE blackhole_table ( > > > ... > > > ) WITH ( > > > 'connector.type' = 'blackhole' > > > ) > > > > > > What do you think? > > > > > > Best, > > > Jingsong Lee > > > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <[hidden email]> > wrote: > > > > > > > Thanks Jingsong for bringing up this discussion. +1 to this > proposal. I > > > > think Bowen's proposal makes much sense to me. > > > > > > > > This is also a painful problem for PyFlink users. Currently there is > no > > > > built-in easy-to-use table source/sink and it requires users to > write a > > > lot > > > > of code to trying out PyFlink. This is especially painful for new > users > > > who > > > > are not familiar with PyFlink/Flink. I have also encountered the > > tedious > > > > process Bowen encountered, e.g. writing random source connector, > > > sink > > > > and also blackhole print sink as there are no built-in ones to use. > > > > > > > > Regards, > > > > Dian > > > > > > > > > 在 2020年3月22日,上午11:24,Jark Wu <[hidden email]> 写道: > > > > > > > > > > +1 to Bowen's proposal. I also saw many requirements on such > built-in > > > > > connectors. > > > > > > > > > > I will leave some my thoughts here: > > > > > > > > > >> 1. datagen source (random source) > > > > > I think we can merge the functinality of sequence-source into > random > > > > source > > > > > to allow users to custom their data values. > > > > > Flink can generate random data according to the field types, users > > > > > can customize their values to be more domain specific, e.g. > > > > > 'field.user'='User_[1-9]{0,1}' > > > > > This will be similar to kafka-datagen-connect[1]. > > > > > > > > > >> 2. console sink (print sink) > > > > > This will be very useful in production debugging, to easily output > an > > > > > intermediate view or result view to a `.out` file. > > > > > So that we can look into the data representation, or check dirty > > data. > > > > > This should be out-of-box without manually DDL registration. > > > > > > > > > >> 3. blackhole sink (no output sink) > > > > > This is very useful for high performance testing of Flink, to > > meansure > > > > the > > > > > throughput of the whole pipeline without sink. > > > > > Presto also provides this as a built-in connector [2]. > > > > > > > > > > Best, > > > > > Jark > > > > > > > > > > [1]: > > > > > > > > > > > > > > > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > > > > > [2]: https://prestodb.io/docs/current/connector/blackhole.html > > > > > > > > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <[hidden email]> > wrote: > > > > > > > > > >> +1. > > > > >> > > > > >> I would suggest to take a step even further and see what users > > really > > > > need > > > > >> to test/try/play with table API and Flink SQL. Besides this one, > > > here're > > > > >> some more sources and sinks that I have developed or used > previously > > > to > > > > >> facilitate building Flink table/SQL pipelines. > > > > >> > > > > >> > > > > >> 1. random input data source > > > > >> - should generate random data at a specified rate according > to > > > > schema > > > > >> - purposes > > > > >> - test Flink pipeline and data can end up in external > > storage > > > > >> correctly > > > > >> - stress test Flink sink as well as tuning up external > > storage > > > > >> 2. print data sink > > > > >> - should print data in row format in console > > > > >> - purposes > > > > >> - make it easier to test Flink SQL job e2e in IDE > > > > >> - test Flink pipeline and ensure output data format/value > is > > > > >> correct > > > > >> 3. no output data sink > > > > >> - just swallow output data without doing anything > > > > >> - purpose > > > > >> - evaluate and tune performance of Flink source and the > > whole > > > > >> pipeline. Users' don't need to worry about sink back > > pressure > > > > >> > > > > >> These may be taken into consideration all together as an effort to > > > lower > > > > >> the threshold of running Flink SQL/table API, and facilitate > users' > > > > daily > > > > >> work. > > > > >> > > > > >> Cheers, > > > > >> Bowen > > > > >> > > > > >> > > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li < > > [hidden email]> > > > > >> wrote: > > > > >> > > > > >>> Hi all, > > > > >>> > > > > >>> I heard some users complain that table is difficult to test. Now > > with > > > > SQL > > > > >>> client, users are more and more inclined to use it to test rather > > > than > > > > >>> program. > > > > >>> The most common example is Kafka source. If users need to test > > their > > > > SQL > > > > >>> output and checkpoint, they need to: > > > > >>> > > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic . > > > > >>> - 2.Write a program, mock input records, and produce records to > > Kafka > > > > >>> topic. > > > > >>> - 3.Then test in Flink. > > > > >>> > > > > >>> The step 1 and 2 are annoying, although this test is E2E. > > > > >>> > > > > >>> Then I found StatefulSequenceSource, it is very good because it > has > > > > deal > > > > >>> with checkpoint things, so it is very good to checkpoint > > > > >> mechanism.Usually, > > > > >>> users are turned on checkpoint in production. > > > > >>> > > > > >>> With computed columns, user are easy to create a sequence source > > DDL > > > > same > > > > >>> to Kafka DDL. Then they can test inside Flink, don't need launch > > > other > > > > >>> things. > > > > >>> > > > > >>> Have you consider this? What do you think? > > > > >>> > > > > >>> CC: @Aljoscha Krettek <[hidden email]> the author > > > > >>> of StatefulSequenceSource. > > > > >>> > > > > >>> Best, > > > > >>> Jingsong Lee > > > > >>> > > > > >> > > > > > > > > > > > > > > -- > > > Best, Jingsong Lee > > > > > > > > > -- > > > > Benchao Li > > School of Electronics Engineering and Computer Science, Peking University > > Tel:+86-15650713730 > > Email: [hidden email]; [hidden email] > > > -- Best, Jingsong Lee |
I agree with Jingsong that sink schema inference and system tables can be
considered later. I wouldn’t recommend to tackle them for the sake of simplifying user experience to the extreme. Providing the above handy source and sink implementations already offer users a ton of immediate value. On Mon, Mar 23, 2020 at 20:20 Jingsong Li <[hidden email]> wrote: > Hi Benchao, > > > do you think we need to add more columns with various types? > > I didn't list all types, but we should support primitive types, varchar, > Decimal, Timestamp and etc... > This can be done continuously. > > Hi Benchao, Jark, > About console and blackhole, yes, they can have no schema, the schema can > be inferred by upstream node. > - But now we don't have this mechanism to do these configurable sink > things. > - If we want to support, we need a single way to support these two sinks. > - And uses can use "create table like" and others way to simplify DDL. > > And for providing system/registered tables (`console` and `blackhole`): > - I have no strong opinion on these system tables. In SQL, will be "insert > into blackhole select a /*int*/, b /*string*/ from tableA", "insert into > blackhole select a /*double*/, b /*Map*/, c /*string*/ from tableB". It > seems that Blackhole is a universal thing, which makes me feel bad > intuitively. > - Can user override these tables? If can, we need ensure it can be > overwrite by catalog tables. > > So I think we can leave these system tables to future too. > What do you think? > > Best, > Jingsong Lee > > On Mon, Mar 23, 2020 at 4:44 PM Jark Wu <[hidden email]> wrote: > > > Hi Jingsong, > > > > Regarding (2) and (3), I was thinking to ignore manually DDL work, so > users > > can use them directly: > > > > # this will log results to `.out` files > > INSERT INTO console > > SELECT ... > > > > # this will drop all received records > > INSERT INTO blackhole > > SELECT ... > > > > Here `console` and `blackhole` are system sinks which is similar to > system > > functions. > > > > Best, > > Jark > > > > On Mon, 23 Mar 2020 at 16:33, Benchao Li <[hidden email]> wrote: > > > > > Hi Jingsong, > > > > > > Thanks for bring this up. Generally, it's a very good proposal. > > > > > > About data gen source, do you think we need to add more columns with > > > various types? > > > > > > About print sink, do we need to specify the schema? > > > > > > Jingsong Li <[hidden email]> 于2020年3月23日周一 下午1:51写道: > > > > > > > Thanks Bowen, Jark and Dian for your feedback and suggestions. > > > > > > > > I reorganize with your suggestions, and try to expose DDLs: > > > > > > > > 1.datagen source: > > > > - easy startup/test for streaming job > > > > - performance testing > > > > > > > > DDL: > > > > CREATE TABLE user ( > > > > id BIGINT, > > > > age INT, > > > > description STRING > > > > ) WITH ( > > > > 'connector.type' = 'datagen', > > > > 'connector.rows-per-second'='100', > > > > 'connector.total-records'='1000000', > > > > > > > > 'schema.id.generator' = 'sequence', > > > > 'schema.id.generator.start' = '1', > > > > > > > > 'schema.age.generator' = 'random', > > > > 'schema.age.generator.min' = '0', > > > > 'schema.age.generator.max' = '100', > > > > > > > > 'schema.description.generator' = 'random', > > > > 'schema.description.generator.length' = '100' > > > > ) > > > > > > > > Default is random generator. > > > > Hi Jark, I don't want to bring complicated regularities, because it > can > > > be > > > > done through computed columns. And it is hard to define > > > > standard regularities, I think we can leave it to the future. > > > > > > > > 2.print sink: > > > > - easy test for streaming job > > > > - be very useful in production debugging > > > > > > > > DDL: > > > > CREATE TABLE print_table ( > > > > ... > > > > ) WITH ( > > > > 'connector.type' = 'print' > > > > ) > > > > > > > > 3.blackhole sink > > > > - very useful for high performance testing of Flink > > > > - I've also run into users trying UDF to output, not sink, so they > need > > > > this sink as well. > > > > > > > > DDL: > > > > CREATE TABLE blackhole_table ( > > > > ... > > > > ) WITH ( > > > > 'connector.type' = 'blackhole' > > > > ) > > > > > > > > What do you think? > > > > > > > > Best, > > > > Jingsong Lee > > > > > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <[hidden email]> > > wrote: > > > > > > > > > Thanks Jingsong for bringing up this discussion. +1 to this > > proposal. I > > > > > think Bowen's proposal makes much sense to me. > > > > > > > > > > This is also a painful problem for PyFlink users. Currently there > is > > no > > > > > built-in easy-to-use table source/sink and it requires users to > > write a > > > > lot > > > > > of code to trying out PyFlink. This is especially painful for new > > users > > > > who > > > > > are not familiar with PyFlink/Flink. I have also encountered the > > > tedious > > > > > process Bowen encountered, e.g. writing random source connector, > > > > sink > > > > > and also blackhole print sink as there are no built-in ones to use. > > > > > > > > > > Regards, > > > > > Dian > > > > > > > > > > > 在 2020年3月22日,上午11:24,Jark Wu <[hidden email]> 写道: > > > > > > > > > > > > +1 to Bowen's proposal. I also saw many requirements on such > > built-in > > > > > > connectors. > > > > > > > > > > > > I will leave some my thoughts here: > > > > > > > > > > > >> 1. datagen source (random source) > > > > > > I think we can merge the functinality of sequence-source into > > random > > > > > source > > > > > > to allow users to custom their data values. > > > > > > Flink can generate random data according to the field types, > users > > > > > > can customize their values to be more domain specific, e.g. > > > > > > 'field.user'='User_[1-9]{0,1}' > > > > > > This will be similar to kafka-datagen-connect[1]. > > > > > > > > > > > >> 2. console sink (print sink) > > > > > > This will be very useful in production debugging, to easily > output > > an > > > > > > intermediate view or result view to a `.out` file. > > > > > > So that we can look into the data representation, or check dirty > > > data. > > > > > > This should be out-of-box without manually DDL registration. > > > > > > > > > > > >> 3. blackhole sink (no output sink) > > > > > > This is very useful for high performance testing of Flink, to > > > meansure > > > > > the > > > > > > throughput of the whole pipeline without sink. > > > > > > Presto also provides this as a built-in connector [2]. > > > > > > > > > > > > Best, > > > > > > Jark > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > > > > > > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > > > > > > [2]: https://prestodb.io/docs/current/connector/blackhole.html > > > > > > > > > > > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <[hidden email]> > > wrote: > > > > > > > > > > > >> +1. > > > > > >> > > > > > >> I would suggest to take a step even further and see what users > > > really > > > > > need > > > > > >> to test/try/play with table API and Flink SQL. Besides this one, > > > > here're > > > > > >> some more sources and sinks that I have developed or used > > previously > > > > to > > > > > >> facilitate building Flink table/SQL pipelines. > > > > > >> > > > > > >> > > > > > >> 1. random input data source > > > > > >> - should generate random data at a specified rate according > > to > > > > > schema > > > > > >> - purposes > > > > > >> - test Flink pipeline and data can end up in external > > > storage > > > > > >> correctly > > > > > >> - stress test Flink sink as well as tuning up external > > > storage > > > > > >> 2. print data sink > > > > > >> - should print data in row format in console > > > > > >> - purposes > > > > > >> - make it easier to test Flink SQL job e2e in IDE > > > > > >> - test Flink pipeline and ensure output data > format/value > > is > > > > > >> correct > > > > > >> 3. no output data sink > > > > > >> - just swallow output data without doing anything > > > > > >> - purpose > > > > > >> - evaluate and tune performance of Flink source and the > > > whole > > > > > >> pipeline. Users' don't need to worry about sink back > > > pressure > > > > > >> > > > > > >> These may be taken into consideration all together as an effort > to > > > > lower > > > > > >> the threshold of running Flink SQL/table API, and facilitate > > users' > > > > > daily > > > > > >> work. > > > > > >> > > > > > >> Cheers, > > > > > >> Bowen > > > > > >> > > > > > >> > > > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li < > > > [hidden email]> > > > > > >> wrote: > > > > > >> > > > > > >>> Hi all, > > > > > >>> > > > > > >>> I heard some users complain that table is difficult to test. > Now > > > with > > > > > SQL > > > > > >>> client, users are more and more inclined to use it to test > rather > > > > than > > > > > >>> program. > > > > > >>> The most common example is Kafka source. If users need to test > > > their > > > > > SQL > > > > > >>> output and checkpoint, they need to: > > > > > >>> > > > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic . > > > > > >>> - 2.Write a program, mock input records, and produce records to > > > Kafka > > > > > >>> topic. > > > > > >>> - 3.Then test in Flink. > > > > > >>> > > > > > >>> The step 1 and 2 are annoying, although this test is E2E. > > > > > >>> > > > > > >>> Then I found StatefulSequenceSource, it is very good because it > > has > > > > > deal > > > > > >>> with checkpoint things, so it is very good to checkpoint > > > > > >> mechanism.Usually, > > > > > >>> users are turned on checkpoint in production. > > > > > >>> > > > > > >>> With computed columns, user are easy to create a sequence > source > > > DDL > > > > > same > > > > > >>> to Kafka DDL. Then they can test inside Flink, don't need > launch > > > > other > > > > > >>> things. > > > > > >>> > > > > > >>> Have you consider this? What do you think? > > > > > >>> > > > > > >>> CC: @Aljoscha Krettek <[hidden email]> the author > > > > > >>> of StatefulSequenceSource. > > > > > >>> > > > > > >>> Best, > > > > > >>> Jingsong Lee > > > > > >>> > > > > > >> > > > > > > > > > > > > > > > > > > -- > > > > Best, Jingsong Lee > > > > > > > > > > > > > -- > > > > > > Benchao Li > > > School of Electronics Engineering and Computer Science, Peking > University > > > Tel:+86-15650713730 > > > Email: [hidden email]; [hidden email] > > > > > > > > -- > Best, Jingsong Lee > |
Hi all,
I created https://issues.apache.org/jira/browse/FLINK-16743 for follow-up discussion. FYI. Best, Jingsong Lee On Tue, Mar 24, 2020 at 2:20 PM Bowen Li <[hidden email]> wrote: > I agree with Jingsong that sink schema inference and system tables can be > considered later. I wouldn’t recommend to tackle them for the sake of > simplifying user experience to the extreme. Providing the above handy > source and sink implementations already offer users a ton of immediate > value. > > > On Mon, Mar 23, 2020 at 20:20 Jingsong Li <[hidden email]> wrote: > > > Hi Benchao, > > > > > do you think we need to add more columns with various types? > > > > I didn't list all types, but we should support primitive types, varchar, > > Decimal, Timestamp and etc... > > This can be done continuously. > > > > Hi Benchao, Jark, > > About console and blackhole, yes, they can have no schema, the schema can > > be inferred by upstream node. > > - But now we don't have this mechanism to do these configurable sink > > things. > > - If we want to support, we need a single way to support these two sinks. > > - And uses can use "create table like" and others way to simplify DDL. > > > > And for providing system/registered tables (`console` and `blackhole`): > > - I have no strong opinion on these system tables. In SQL, will be > "insert > > into blackhole select a /*int*/, b /*string*/ from tableA", "insert into > > blackhole select a /*double*/, b /*Map*/, c /*string*/ from tableB". It > > seems that Blackhole is a universal thing, which makes me feel bad > > intuitively. > > - Can user override these tables? If can, we need ensure it can be > > overwrite by catalog tables. > > > > So I think we can leave these system tables to future too. > > What do you think? > > > > Best, > > Jingsong Lee > > > > On Mon, Mar 23, 2020 at 4:44 PM Jark Wu <[hidden email]> wrote: > > > > > Hi Jingsong, > > > > > > Regarding (2) and (3), I was thinking to ignore manually DDL work, so > > users > > > can use them directly: > > > > > > # this will log results to `.out` files > > > INSERT INTO console > > > SELECT ... > > > > > > # this will drop all received records > > > INSERT INTO blackhole > > > SELECT ... > > > > > > Here `console` and `blackhole` are system sinks which is similar to > > system > > > functions. > > > > > > Best, > > > Jark > > > > > > On Mon, 23 Mar 2020 at 16:33, Benchao Li <[hidden email]> wrote: > > > > > > > Hi Jingsong, > > > > > > > > Thanks for bring this up. Generally, it's a very good proposal. > > > > > > > > About data gen source, do you think we need to add more columns with > > > > various types? > > > > > > > > About print sink, do we need to specify the schema? > > > > > > > > Jingsong Li <[hidden email]> 于2020年3月23日周一 下午1:51写道: > > > > > > > > > Thanks Bowen, Jark and Dian for your feedback and suggestions. > > > > > > > > > > I reorganize with your suggestions, and try to expose DDLs: > > > > > > > > > > 1.datagen source: > > > > > - easy startup/test for streaming job > > > > > - performance testing > > > > > > > > > > DDL: > > > > > CREATE TABLE user ( > > > > > id BIGINT, > > > > > age INT, > > > > > description STRING > > > > > ) WITH ( > > > > > 'connector.type' = 'datagen', > > > > > 'connector.rows-per-second'='100', > > > > > 'connector.total-records'='1000000', > > > > > > > > > > 'schema.id.generator' = 'sequence', > > > > > 'schema.id.generator.start' = '1', > > > > > > > > > > 'schema.age.generator' = 'random', > > > > > 'schema.age.generator.min' = '0', > > > > > 'schema.age.generator.max' = '100', > > > > > > > > > > 'schema.description.generator' = 'random', > > > > > 'schema.description.generator.length' = '100' > > > > > ) > > > > > > > > > > Default is random generator. > > > > > Hi Jark, I don't want to bring complicated regularities, because it > > can > > > > be > > > > > done through computed columns. And it is hard to define > > > > > standard regularities, I think we can leave it to the future. > > > > > > > > > > 2.print sink: > > > > > - easy test for streaming job > > > > > - be very useful in production debugging > > > > > > > > > > DDL: > > > > > CREATE TABLE print_table ( > > > > > ... > > > > > ) WITH ( > > > > > 'connector.type' = 'print' > > > > > ) > > > > > > > > > > 3.blackhole sink > > > > > - very useful for high performance testing of Flink > > > > > - I've also run into users trying UDF to output, not sink, so they > > need > > > > > this sink as well. > > > > > > > > > > DDL: > > > > > CREATE TABLE blackhole_table ( > > > > > ... > > > > > ) WITH ( > > > > > 'connector.type' = 'blackhole' > > > > > ) > > > > > > > > > > What do you think? > > > > > > > > > > Best, > > > > > Jingsong Lee > > > > > > > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <[hidden email]> > > > wrote: > > > > > > > > > > > Thanks Jingsong for bringing up this discussion. +1 to this > > > proposal. I > > > > > > think Bowen's proposal makes much sense to me. > > > > > > > > > > > > This is also a painful problem for PyFlink users. Currently there > > is > > > no > > > > > > built-in easy-to-use table source/sink and it requires users to > > > write a > > > > > lot > > > > > > of code to trying out PyFlink. This is especially painful for new > > > users > > > > > who > > > > > > are not familiar with PyFlink/Flink. I have also encountered the > > > > tedious > > > > > > process Bowen encountered, e.g. writing random source connector, > > > > > sink > > > > > > and also blackhole print sink as there are no built-in ones to > use. > > > > > > > > > > > > Regards, > > > > > > Dian > > > > > > > > > > > > > 在 2020年3月22日,上午11:24,Jark Wu <[hidden email]> 写道: > > > > > > > > > > > > > > +1 to Bowen's proposal. I also saw many requirements on such > > > built-in > > > > > > > connectors. > > > > > > > > > > > > > > I will leave some my thoughts here: > > > > > > > > > > > > > >> 1. datagen source (random source) > > > > > > > I think we can merge the functinality of sequence-source into > > > random > > > > > > source > > > > > > > to allow users to custom their data values. > > > > > > > Flink can generate random data according to the field types, > > users > > > > > > > can customize their values to be more domain specific, e.g. > > > > > > > 'field.user'='User_[1-9]{0,1}' > > > > > > > This will be similar to kafka-datagen-connect[1]. > > > > > > > > > > > > > >> 2. console sink (print sink) > > > > > > > This will be very useful in production debugging, to easily > > output > > > an > > > > > > > intermediate view or result view to a `.out` file. > > > > > > > So that we can look into the data representation, or check > dirty > > > > data. > > > > > > > This should be out-of-box without manually DDL registration. > > > > > > > > > > > > > >> 3. blackhole sink (no output sink) > > > > > > > This is very useful for high performance testing of Flink, to > > > > meansure > > > > > > the > > > > > > > throughput of the whole pipeline without sink. > > > > > > > Presto also provides this as a built-in connector [2]. > > > > > > > > > > > > > > Best, > > > > > > > Jark > > > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > > > > > > > [2]: https://prestodb.io/docs/current/connector/blackhole.html > > > > > > > > > > > > > > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <[hidden email]> > > > wrote: > > > > > > > > > > > > > >> +1. > > > > > > >> > > > > > > >> I would suggest to take a step even further and see what users > > > > really > > > > > > need > > > > > > >> to test/try/play with table API and Flink SQL. Besides this > one, > > > > > here're > > > > > > >> some more sources and sinks that I have developed or used > > > previously > > > > > to > > > > > > >> facilitate building Flink table/SQL pipelines. > > > > > > >> > > > > > > >> > > > > > > >> 1. random input data source > > > > > > >> - should generate random data at a specified rate > according > > > to > > > > > > schema > > > > > > >> - purposes > > > > > > >> - test Flink pipeline and data can end up in external > > > > storage > > > > > > >> correctly > > > > > > >> - stress test Flink sink as well as tuning up external > > > > storage > > > > > > >> 2. print data sink > > > > > > >> - should print data in row format in console > > > > > > >> - purposes > > > > > > >> - make it easier to test Flink SQL job e2e in IDE > > > > > > >> - test Flink pipeline and ensure output data > > format/value > > > is > > > > > > >> correct > > > > > > >> 3. no output data sink > > > > > > >> - just swallow output data without doing anything > > > > > > >> - purpose > > > > > > >> - evaluate and tune performance of Flink source and > the > > > > whole > > > > > > >> pipeline. Users' don't need to worry about sink back > > > > pressure > > > > > > >> > > > > > > >> These may be taken into consideration all together as an > effort > > to > > > > > lower > > > > > > >> the threshold of running Flink SQL/table API, and facilitate > > > users' > > > > > > daily > > > > > > >> work. > > > > > > >> > > > > > > >> Cheers, > > > > > > >> Bowen > > > > > > >> > > > > > > >> > > > > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li < > > > > [hidden email]> > > > > > > >> wrote: > > > > > > >> > > > > > > >>> Hi all, > > > > > > >>> > > > > > > >>> I heard some users complain that table is difficult to test. > > Now > > > > with > > > > > > SQL > > > > > > >>> client, users are more and more inclined to use it to test > > rather > > > > > than > > > > > > >>> program. > > > > > > >>> The most common example is Kafka source. If users need to > test > > > > their > > > > > > SQL > > > > > > >>> output and checkpoint, they need to: > > > > > > >>> > > > > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic . > > > > > > >>> - 2.Write a program, mock input records, and produce records > to > > > > Kafka > > > > > > >>> topic. > > > > > > >>> - 3.Then test in Flink. > > > > > > >>> > > > > > > >>> The step 1 and 2 are annoying, although this test is E2E. > > > > > > >>> > > > > > > >>> Then I found StatefulSequenceSource, it is very good because > it > > > has > > > > > > deal > > > > > > >>> with checkpoint things, so it is very good to checkpoint > > > > > > >> mechanism.Usually, > > > > > > >>> users are turned on checkpoint in production. > > > > > > >>> > > > > > > >>> With computed columns, user are easy to create a sequence > > source > > > > DDL > > > > > > same > > > > > > >>> to Kafka DDL. Then they can test inside Flink, don't need > > launch > > > > > other > > > > > > >>> things. > > > > > > >>> > > > > > > >>> Have you consider this? What do you think? > > > > > > >>> > > > > > > >>> CC: @Aljoscha Krettek <[hidden email]> the author > > > > > > >>> of StatefulSequenceSource. > > > > > > >>> > > > > > > >>> Best, > > > > > > >>> Jingsong Lee > > > > > > >>> > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best, Jingsong Lee > > > > > > > > > > > > > > > > > -- > > > > > > > > Benchao Li > > > > School of Electronics Engineering and Computer Science, Peking > > University > > > > Tel:+86-15650713730 > > > > Email: [hidden email]; [hidden email] > > > > > > > > > > > > > -- > > Best, Jingsong Lee > > > -- Best, Jingsong Lee |
Hi everyone,
sorry for reviving this thread at this point in time. Generally, I think, this is a very valuable effort. Have we considered only providing a very basic data generator (+ discarding and printing sink tables) in Apache Flink and moving a more comprehensive data generating table source to an ecosystem project promoted on flink-packages.org. I think this has a lot of potential (e.g. in combination with Java Faker [1]), but it would probably be better served in a small separately maintained repository. Cheers, Konstantin [1] https://github.com/DiUS/java-faker On Tue, Mar 24, 2020 at 9:10 AM Jingsong Li <[hidden email]> wrote: > Hi all, > > I created https://issues.apache.org/jira/browse/FLINK-16743 for follow-up > discussion. FYI. > > Best, > Jingsong Lee > > On Tue, Mar 24, 2020 at 2:20 PM Bowen Li <[hidden email]> wrote: > > > I agree with Jingsong that sink schema inference and system tables can be > > considered later. I wouldn’t recommend to tackle them for the sake of > > simplifying user experience to the extreme. Providing the above handy > > source and sink implementations already offer users a ton of immediate > > value. > > > > > > On Mon, Mar 23, 2020 at 20:20 Jingsong Li <[hidden email]> > wrote: > > > > > Hi Benchao, > > > > > > > do you think we need to add more columns with various types? > > > > > > I didn't list all types, but we should support primitive types, > varchar, > > > Decimal, Timestamp and etc... > > > This can be done continuously. > > > > > > Hi Benchao, Jark, > > > About console and blackhole, yes, they can have no schema, the schema > can > > > be inferred by upstream node. > > > - But now we don't have this mechanism to do these configurable sink > > > things. > > > - If we want to support, we need a single way to support these two > sinks. > > > - And uses can use "create table like" and others way to simplify DDL. > > > > > > And for providing system/registered tables (`console` and `blackhole`): > > > - I have no strong opinion on these system tables. In SQL, will be > > "insert > > > into blackhole select a /*int*/, b /*string*/ from tableA", "insert > into > > > blackhole select a /*double*/, b /*Map*/, c /*string*/ from tableB". It > > > seems that Blackhole is a universal thing, which makes me feel bad > > > intuitively. > > > - Can user override these tables? If can, we need ensure it can be > > > overwrite by catalog tables. > > > > > > So I think we can leave these system tables to future too. > > > What do you think? > > > > > > Best, > > > Jingsong Lee > > > > > > On Mon, Mar 23, 2020 at 4:44 PM Jark Wu <[hidden email]> wrote: > > > > > > > Hi Jingsong, > > > > > > > > Regarding (2) and (3), I was thinking to ignore manually DDL work, so > > > users > > > > can use them directly: > > > > > > > > # this will log results to `.out` files > > > > INSERT INTO console > > > > SELECT ... > > > > > > > > # this will drop all received records > > > > INSERT INTO blackhole > > > > SELECT ... > > > > > > > > Here `console` and `blackhole` are system sinks which is similar to > > > system > > > > functions. > > > > > > > > Best, > > > > Jark > > > > > > > > On Mon, 23 Mar 2020 at 16:33, Benchao Li <[hidden email]> > wrote: > > > > > > > > > Hi Jingsong, > > > > > > > > > > Thanks for bring this up. Generally, it's a very good proposal. > > > > > > > > > > About data gen source, do you think we need to add more columns > with > > > > > various types? > > > > > > > > > > About print sink, do we need to specify the schema? > > > > > > > > > > Jingsong Li <[hidden email]> 于2020年3月23日周一 下午1:51写道: > > > > > > > > > > > Thanks Bowen, Jark and Dian for your feedback and suggestions. > > > > > > > > > > > > I reorganize with your suggestions, and try to expose DDLs: > > > > > > > > > > > > 1.datagen source: > > > > > > - easy startup/test for streaming job > > > > > > - performance testing > > > > > > > > > > > > DDL: > > > > > > CREATE TABLE user ( > > > > > > id BIGINT, > > > > > > age INT, > > > > > > description STRING > > > > > > ) WITH ( > > > > > > 'connector.type' = 'datagen', > > > > > > 'connector.rows-per-second'='100', > > > > > > 'connector.total-records'='1000000', > > > > > > > > > > > > 'schema.id.generator' = 'sequence', > > > > > > 'schema.id.generator.start' = '1', > > > > > > > > > > > > 'schema.age.generator' = 'random', > > > > > > 'schema.age.generator.min' = '0', > > > > > > 'schema.age.generator.max' = '100', > > > > > > > > > > > > 'schema.description.generator' = 'random', > > > > > > 'schema.description.generator.length' = '100' > > > > > > ) > > > > > > > > > > > > Default is random generator. > > > > > > Hi Jark, I don't want to bring complicated regularities, because > it > > > can > > > > > be > > > > > > done through computed columns. And it is hard to define > > > > > > standard regularities, I think we can leave it to the future. > > > > > > > > > > > > 2.print sink: > > > > > > - easy test for streaming job > > > > > > - be very useful in production debugging > > > > > > > > > > > > DDL: > > > > > > CREATE TABLE print_table ( > > > > > > ... > > > > > > ) WITH ( > > > > > > 'connector.type' = 'print' > > > > > > ) > > > > > > > > > > > > 3.blackhole sink > > > > > > - very useful for high performance testing of Flink > > > > > > - I've also run into users trying UDF to output, not sink, so > they > > > need > > > > > > this sink as well. > > > > > > > > > > > > DDL: > > > > > > CREATE TABLE blackhole_table ( > > > > > > ... > > > > > > ) WITH ( > > > > > > 'connector.type' = 'blackhole' > > > > > > ) > > > > > > > > > > > > What do you think? > > > > > > > > > > > > Best, > > > > > > Jingsong Lee > > > > > > > > > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <[hidden email]> > > > > wrote: > > > > > > > > > > > > > Thanks Jingsong for bringing up this discussion. +1 to this > > > > proposal. I > > > > > > > think Bowen's proposal makes much sense to me. > > > > > > > > > > > > > > This is also a painful problem for PyFlink users. Currently > there > > > is > > > > no > > > > > > > built-in easy-to-use table source/sink and it requires users to > > > > write a > > > > > > lot > > > > > > > of code to trying out PyFlink. This is especially painful for > new > > > > users > > > > > > who > > > > > > > are not familiar with PyFlink/Flink. I have also encountered > the > > > > > tedious > > > > > > > process Bowen encountered, e.g. writing random source > connector, > > > > > > sink > > > > > > > and also blackhole print sink as there are no built-in ones to > > use. > > > > > > > > > > > > > > Regards, > > > > > > > Dian > > > > > > > > > > > > > > > 在 2020年3月22日,上午11:24,Jark Wu <[hidden email]> 写道: > > > > > > > > > > > > > > > > +1 to Bowen's proposal. I also saw many requirements on such > > > > built-in > > > > > > > > connectors. > > > > > > > > > > > > > > > > I will leave some my thoughts here: > > > > > > > > > > > > > > > >> 1. datagen source (random source) > > > > > > > > I think we can merge the functinality of sequence-source into > > > > random > > > > > > > source > > > > > > > > to allow users to custom their data values. > > > > > > > > Flink can generate random data according to the field types, > > > users > > > > > > > > can customize their values to be more domain specific, e.g. > > > > > > > > 'field.user'='User_[1-9]{0,1}' > > > > > > > > This will be similar to kafka-datagen-connect[1]. > > > > > > > > > > > > > > > >> 2. console sink (print sink) > > > > > > > > This will be very useful in production debugging, to easily > > > output > > > > an > > > > > > > > intermediate view or result view to a `.out` file. > > > > > > > > So that we can look into the data representation, or check > > dirty > > > > > data. > > > > > > > > This should be out-of-box without manually DDL registration. > > > > > > > > > > > > > > > >> 3. blackhole sink (no output sink) > > > > > > > > This is very useful for high performance testing of Flink, to > > > > > meansure > > > > > > > the > > > > > > > > throughput of the whole pipeline without sink. > > > > > > > > Presto also provides this as a built-in connector [2]. > > > > > > > > > > > > > > > > Best, > > > > > > > > Jark > > > > > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > > > > > > > > [2]: > https://prestodb.io/docs/current/connector/blackhole.html > > > > > > > > > > > > > > > > > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <[hidden email]> > > > > wrote: > > > > > > > > > > > > > > > >> +1. > > > > > > > >> > > > > > > > >> I would suggest to take a step even further and see what > users > > > > > really > > > > > > > need > > > > > > > >> to test/try/play with table API and Flink SQL. Besides this > > one, > > > > > > here're > > > > > > > >> some more sources and sinks that I have developed or used > > > > previously > > > > > > to > > > > > > > >> facilitate building Flink table/SQL pipelines. > > > > > > > >> > > > > > > > >> > > > > > > > >> 1. random input data source > > > > > > > >> - should generate random data at a specified rate > > according > > > > to > > > > > > > schema > > > > > > > >> - purposes > > > > > > > >> - test Flink pipeline and data can end up in > external > > > > > storage > > > > > > > >> correctly > > > > > > > >> - stress test Flink sink as well as tuning up > external > > > > > storage > > > > > > > >> 2. print data sink > > > > > > > >> - should print data in row format in console > > > > > > > >> - purposes > > > > > > > >> - make it easier to test Flink SQL job e2e in IDE > > > > > > > >> - test Flink pipeline and ensure output data > > > format/value > > > > is > > > > > > > >> correct > > > > > > > >> 3. no output data sink > > > > > > > >> - just swallow output data without doing anything > > > > > > > >> - purpose > > > > > > > >> - evaluate and tune performance of Flink source and > > the > > > > > whole > > > > > > > >> pipeline. Users' don't need to worry about sink back > > > > > pressure > > > > > > > >> > > > > > > > >> These may be taken into consideration all together as an > > effort > > > to > > > > > > lower > > > > > > > >> the threshold of running Flink SQL/table API, and facilitate > > > > users' > > > > > > > daily > > > > > > > >> work. > > > > > > > >> > > > > > > > >> Cheers, > > > > > > > >> Bowen > > > > > > > >> > > > > > > > >> > > > > > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li < > > > > > [hidden email]> > > > > > > > >> wrote: > > > > > > > >> > > > > > > > >>> Hi all, > > > > > > > >>> > > > > > > > >>> I heard some users complain that table is difficult to > test. > > > Now > > > > > with > > > > > > > SQL > > > > > > > >>> client, users are more and more inclined to use it to test > > > rather > > > > > > than > > > > > > > >>> program. > > > > > > > >>> The most common example is Kafka source. If users need to > > test > > > > > their > > > > > > > SQL > > > > > > > >>> output and checkpoint, they need to: > > > > > > > >>> > > > > > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic . > > > > > > > >>> - 2.Write a program, mock input records, and produce > records > > to > > > > > Kafka > > > > > > > >>> topic. > > > > > > > >>> - 3.Then test in Flink. > > > > > > > >>> > > > > > > > >>> The step 1 and 2 are annoying, although this test is E2E. > > > > > > > >>> > > > > > > > >>> Then I found StatefulSequenceSource, it is very good > because > > it > > > > has > > > > > > > deal > > > > > > > >>> with checkpoint things, so it is very good to checkpoint > > > > > > > >> mechanism.Usually, > > > > > > > >>> users are turned on checkpoint in production. > > > > > > > >>> > > > > > > > >>> With computed columns, user are easy to create a sequence > > > source > > > > > DDL > > > > > > > same > > > > > > > >>> to Kafka DDL. Then they can test inside Flink, don't need > > > launch > > > > > > other > > > > > > > >>> things. > > > > > > > >>> > > > > > > > >>> Have you consider this? What do you think? > > > > > > > >>> > > > > > > > >>> CC: @Aljoscha Krettek <[hidden email]> the author > > > > > > > >>> of StatefulSequenceSource. > > > > > > > >>> > > > > > > > >>> Best, > > > > > > > >>> Jingsong Lee > > > > > > > >>> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best, Jingsong Lee > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Benchao Li > > > > > School of Electronics Engineering and Computer Science, Peking > > > University > > > > > Tel:+86-15650713730 > > > > > Email: [hidden email]; [hidden email] > > > > > > > > > > > > > > > > > > -- > > > Best, Jingsong Lee > > > > > > > > -- > Best, Jingsong Lee > -- Konstantin Knauf https://twitter.com/snntrable https://github.com/knaufk |
Hi Konstantin,
Thanks for the link of Java Faker. It's an intereting project and could benefit to a comprehensive datagen source. What the discarding and printing sink look like in your thought? 1) manually create a table with a `blackhole` or `print` connector, e.g. CREATE TABLE my_sink ( a INT, b STRNG, c DOUBLE ) WITH ( 'connector' = 'print' ); INSERT INTO my_sink SELECT a, b, c FROM my_source; 2) a system built-in table named `blackhole` and `print` without manually schema work, e.g. INSERT INTO print SELECT a, b, c, d FROM my_source; Best, Jark On Thu, 30 Apr 2020 at 21:19, Konstantin Knauf <[hidden email]> wrote: > Hi everyone, > > sorry for reviving this thread at this point in time. Generally, I think, > this is a very valuable effort. Have we considered only providing a very > basic data generator (+ discarding and printing sink tables) in Apache > Flink and moving a more comprehensive data generating table source to an > ecosystem project promoted on flink-packages.org. I think this has a lot > of > potential (e.g. in combination with Java Faker [1]), but it would probably > be better served in a small separately maintained repository. > > Cheers, > > Konstantin > > [1] https://github.com/DiUS/java-faker > > > On Tue, Mar 24, 2020 at 9:10 AM Jingsong Li <[hidden email]> > wrote: > > > Hi all, > > > > I created https://issues.apache.org/jira/browse/FLINK-16743 for > follow-up > > discussion. FYI. > > > > Best, > > Jingsong Lee > > > > On Tue, Mar 24, 2020 at 2:20 PM Bowen Li <[hidden email]> wrote: > > > > > I agree with Jingsong that sink schema inference and system tables can > be > > > considered later. I wouldn’t recommend to tackle them for the sake of > > > simplifying user experience to the extreme. Providing the above handy > > > source and sink implementations already offer users a ton of immediate > > > value. > > > > > > > > > On Mon, Mar 23, 2020 at 20:20 Jingsong Li <[hidden email]> > > wrote: > > > > > > > Hi Benchao, > > > > > > > > > do you think we need to add more columns with various types? > > > > > > > > I didn't list all types, but we should support primitive types, > > varchar, > > > > Decimal, Timestamp and etc... > > > > This can be done continuously. > > > > > > > > Hi Benchao, Jark, > > > > About console and blackhole, yes, they can have no schema, the schema > > can > > > > be inferred by upstream node. > > > > - But now we don't have this mechanism to do these configurable sink > > > > things. > > > > - If we want to support, we need a single way to support these two > > sinks. > > > > - And uses can use "create table like" and others way to simplify > DDL. > > > > > > > > And for providing system/registered tables (`console` and > `blackhole`): > > > > - I have no strong opinion on these system tables. In SQL, will be > > > "insert > > > > into blackhole select a /*int*/, b /*string*/ from tableA", "insert > > into > > > > blackhole select a /*double*/, b /*Map*/, c /*string*/ from tableB". > It > > > > seems that Blackhole is a universal thing, which makes me feel bad > > > > intuitively. > > > > - Can user override these tables? If can, we need ensure it can be > > > > overwrite by catalog tables. > > > > > > > > So I think we can leave these system tables to future too. > > > > What do you think? > > > > > > > > Best, > > > > Jingsong Lee > > > > > > > > On Mon, Mar 23, 2020 at 4:44 PM Jark Wu <[hidden email]> wrote: > > > > > > > > > Hi Jingsong, > > > > > > > > > > Regarding (2) and (3), I was thinking to ignore manually DDL work, > so > > > > users > > > > > can use them directly: > > > > > > > > > > # this will log results to `.out` files > > > > > INSERT INTO console > > > > > SELECT ... > > > > > > > > > > # this will drop all received records > > > > > INSERT INTO blackhole > > > > > SELECT ... > > > > > > > > > > Here `console` and `blackhole` are system sinks which is similar to > > > > system > > > > > functions. > > > > > > > > > > Best, > > > > > Jark > > > > > > > > > > On Mon, 23 Mar 2020 at 16:33, Benchao Li <[hidden email]> > > wrote: > > > > > > > > > > > Hi Jingsong, > > > > > > > > > > > > Thanks for bring this up. Generally, it's a very good proposal. > > > > > > > > > > > > About data gen source, do you think we need to add more columns > > with > > > > > > various types? > > > > > > > > > > > > About print sink, do we need to specify the schema? > > > > > > > > > > > > Jingsong Li <[hidden email]> 于2020年3月23日周一 下午1:51写道: > > > > > > > > > > > > > Thanks Bowen, Jark and Dian for your feedback and suggestions. > > > > > > > > > > > > > > I reorganize with your suggestions, and try to expose DDLs: > > > > > > > > > > > > > > 1.datagen source: > > > > > > > - easy startup/test for streaming job > > > > > > > - performance testing > > > > > > > > > > > > > > DDL: > > > > > > > CREATE TABLE user ( > > > > > > > id BIGINT, > > > > > > > age INT, > > > > > > > description STRING > > > > > > > ) WITH ( > > > > > > > 'connector.type' = 'datagen', > > > > > > > 'connector.rows-per-second'='100', > > > > > > > 'connector.total-records'='1000000', > > > > > > > > > > > > > > 'schema.id.generator' = 'sequence', > > > > > > > 'schema.id.generator.start' = '1', > > > > > > > > > > > > > > 'schema.age.generator' = 'random', > > > > > > > 'schema.age.generator.min' = '0', > > > > > > > 'schema.age.generator.max' = '100', > > > > > > > > > > > > > > 'schema.description.generator' = 'random', > > > > > > > 'schema.description.generator.length' = '100' > > > > > > > ) > > > > > > > > > > > > > > Default is random generator. > > > > > > > Hi Jark, I don't want to bring complicated regularities, > because > > it > > > > can > > > > > > be > > > > > > > done through computed columns. And it is hard to define > > > > > > > standard regularities, I think we can leave it to the future. > > > > > > > > > > > > > > 2.print sink: > > > > > > > - easy test for streaming job > > > > > > > - be very useful in production debugging > > > > > > > > > > > > > > DDL: > > > > > > > CREATE TABLE print_table ( > > > > > > > ... > > > > > > > ) WITH ( > > > > > > > 'connector.type' = 'print' > > > > > > > ) > > > > > > > > > > > > > > 3.blackhole sink > > > > > > > - very useful for high performance testing of Flink > > > > > > > - I've also run into users trying UDF to output, not sink, so > > they > > > > need > > > > > > > this sink as well. > > > > > > > > > > > > > > DDL: > > > > > > > CREATE TABLE blackhole_table ( > > > > > > > ... > > > > > > > ) WITH ( > > > > > > > 'connector.type' = 'blackhole' > > > > > > > ) > > > > > > > > > > > > > > What do you think? > > > > > > > > > > > > > > Best, > > > > > > > Jingsong Lee > > > > > > > > > > > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu < > [hidden email]> > > > > > wrote: > > > > > > > > > > > > > > > Thanks Jingsong for bringing up this discussion. +1 to this > > > > > proposal. I > > > > > > > > think Bowen's proposal makes much sense to me. > > > > > > > > > > > > > > > > This is also a painful problem for PyFlink users. Currently > > there > > > > is > > > > > no > > > > > > > > built-in easy-to-use table source/sink and it requires users > to > > > > > write a > > > > > > > lot > > > > > > > > of code to trying out PyFlink. This is especially painful for > > new > > > > > users > > > > > > > who > > > > > > > > are not familiar with PyFlink/Flink. I have also encountered > > the > > > > > > tedious > > > > > > > > process Bowen encountered, e.g. writing random source > > connector, > > > > > > > sink > > > > > > > > and also blackhole print sink as there are no built-in ones > to > > > use. > > > > > > > > > > > > > > > > Regards, > > > > > > > > Dian > > > > > > > > > > > > > > > > > 在 2020年3月22日,上午11:24,Jark Wu <[hidden email]> 写道: > > > > > > > > > > > > > > > > > > +1 to Bowen's proposal. I also saw many requirements on > such > > > > > built-in > > > > > > > > > connectors. > > > > > > > > > > > > > > > > > > I will leave some my thoughts here: > > > > > > > > > > > > > > > > > >> 1. datagen source (random source) > > > > > > > > > I think we can merge the functinality of sequence-source > into > > > > > random > > > > > > > > source > > > > > > > > > to allow users to custom their data values. > > > > > > > > > Flink can generate random data according to the field > types, > > > > users > > > > > > > > > can customize their values to be more domain specific, e.g. > > > > > > > > > 'field.user'='User_[1-9]{0,1}' > > > > > > > > > This will be similar to kafka-datagen-connect[1]. > > > > > > > > > > > > > > > > > >> 2. console sink (print sink) > > > > > > > > > This will be very useful in production debugging, to easily > > > > output > > > > > an > > > > > > > > > intermediate view or result view to a `.out` file. > > > > > > > > > So that we can look into the data representation, or check > > > dirty > > > > > > data. > > > > > > > > > This should be out-of-box without manually DDL > registration. > > > > > > > > > > > > > > > > > >> 3. blackhole sink (no output sink) > > > > > > > > > This is very useful for high performance testing of Flink, > to > > > > > > meansure > > > > > > > > the > > > > > > > > > throughput of the whole pipeline without sink. > > > > > > > > > Presto also provides this as a built-in connector [2]. > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Jark > > > > > > > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > > > > > > > > > [2]: > > https://prestodb.io/docs/current/connector/blackhole.html > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li < > [hidden email]> > > > > > wrote: > > > > > > > > > > > > > > > > > >> +1. > > > > > > > > >> > > > > > > > > >> I would suggest to take a step even further and see what > > users > > > > > > really > > > > > > > > need > > > > > > > > >> to test/try/play with table API and Flink SQL. Besides > this > > > one, > > > > > > > here're > > > > > > > > >> some more sources and sinks that I have developed or used > > > > > previously > > > > > > > to > > > > > > > > >> facilitate building Flink table/SQL pipelines. > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> 1. random input data source > > > > > > > > >> - should generate random data at a specified rate > > > according > > > > > to > > > > > > > > schema > > > > > > > > >> - purposes > > > > > > > > >> - test Flink pipeline and data can end up in > > external > > > > > > storage > > > > > > > > >> correctly > > > > > > > > >> - stress test Flink sink as well as tuning up > > external > > > > > > storage > > > > > > > > >> 2. print data sink > > > > > > > > >> - should print data in row format in console > > > > > > > > >> - purposes > > > > > > > > >> - make it easier to test Flink SQL job e2e in IDE > > > > > > > > >> - test Flink pipeline and ensure output data > > > > format/value > > > > > is > > > > > > > > >> correct > > > > > > > > >> 3. no output data sink > > > > > > > > >> - just swallow output data without doing anything > > > > > > > > >> - purpose > > > > > > > > >> - evaluate and tune performance of Flink source > and > > > the > > > > > > whole > > > > > > > > >> pipeline. Users' don't need to worry about sink > back > > > > > > pressure > > > > > > > > >> > > > > > > > > >> These may be taken into consideration all together as an > > > effort > > > > to > > > > > > > lower > > > > > > > > >> the threshold of running Flink SQL/table API, and > facilitate > > > > > users' > > > > > > > > daily > > > > > > > > >> work. > > > > > > > > >> > > > > > > > > >> Cheers, > > > > > > > > >> Bowen > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li < > > > > > > [hidden email]> > > > > > > > > >> wrote: > > > > > > > > >> > > > > > > > > >>> Hi all, > > > > > > > > >>> > > > > > > > > >>> I heard some users complain that table is difficult to > > test. > > > > Now > > > > > > with > > > > > > > > SQL > > > > > > > > >>> client, users are more and more inclined to use it to > test > > > > rather > > > > > > > than > > > > > > > > >>> program. > > > > > > > > >>> The most common example is Kafka source. If users need to > > > test > > > > > > their > > > > > > > > SQL > > > > > > > > >>> output and checkpoint, they need to: > > > > > > > > >>> > > > > > > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic . > > > > > > > > >>> - 2.Write a program, mock input records, and produce > > records > > > to > > > > > > Kafka > > > > > > > > >>> topic. > > > > > > > > >>> - 3.Then test in Flink. > > > > > > > > >>> > > > > > > > > >>> The step 1 and 2 are annoying, although this test is E2E. > > > > > > > > >>> > > > > > > > > >>> Then I found StatefulSequenceSource, it is very good > > because > > > it > > > > > has > > > > > > > > deal > > > > > > > > >>> with checkpoint things, so it is very good to checkpoint > > > > > > > > >> mechanism.Usually, > > > > > > > > >>> users are turned on checkpoint in production. > > > > > > > > >>> > > > > > > > > >>> With computed columns, user are easy to create a sequence > > > > source > > > > > > DDL > > > > > > > > same > > > > > > > > >>> to Kafka DDL. Then they can test inside Flink, don't need > > > > launch > > > > > > > other > > > > > > > > >>> things. > > > > > > > > >>> > > > > > > > > >>> Have you consider this? What do you think? > > > > > > > > >>> > > > > > > > > >>> CC: @Aljoscha Krettek <[hidden email]> the author > > > > > > > > >>> of StatefulSequenceSource. > > > > > > > > >>> > > > > > > > > >>> Best, > > > > > > > > >>> Jingsong Lee > > > > > > > > >>> > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Best, Jingsong Lee > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Benchao Li > > > > > > School of Electronics Engineering and Computer Science, Peking > > > > University > > > > > > Tel:+86-15650713730 > > > > > > Email: [hidden email]; [hidden email] > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best, Jingsong Lee > > > > > > > > > > > > > -- > > Best, Jingsong Lee > > > > > -- > > Konstantin Knauf > > https://twitter.com/snntrable > > https://github.com/knaufk > |
Hi Jark,
my gut feeling is 1), because of its consistency with other connectors (does not add two secret keywords) although it is more verbose. Best, Konstantin On Thu, Apr 30, 2020 at 5:01 PM Jark Wu <[hidden email]> wrote: > Hi Konstantin, > > Thanks for the link of Java Faker. It's an intereting project and > could benefit to a comprehensive datagen source. > > What the discarding and printing sink look like in your thought? > 1) manually create a table with a `blackhole` or `print` connector, e.g. > > CREATE TABLE my_sink ( > a INT, > b STRNG, > c DOUBLE > ) WITH ( > 'connector' = 'print' > ); > INSERT INTO my_sink SELECT a, b, c FROM my_source; > > 2) a system built-in table named `blackhole` and `print` without manually > schema work, e.g. > INSERT INTO print SELECT a, b, c, d FROM my_source; > > Best, > Jark > > > > On Thu, 30 Apr 2020 at 21:19, Konstantin Knauf <[hidden email]> wrote: > > > Hi everyone, > > > > sorry for reviving this thread at this point in time. Generally, I think, > > this is a very valuable effort. Have we considered only providing a very > > basic data generator (+ discarding and printing sink tables) in Apache > > Flink and moving a more comprehensive data generating table source to an > > ecosystem project promoted on flink-packages.org. I think this has a lot > > of > > potential (e.g. in combination with Java Faker [1]), but it would > probably > > be better served in a small separately maintained repository. > > > > Cheers, > > > > Konstantin > > > > [1] https://github.com/DiUS/java-faker > > > > > > On Tue, Mar 24, 2020 at 9:10 AM Jingsong Li <[hidden email]> > > wrote: > > > > > Hi all, > > > > > > I created https://issues.apache.org/jira/browse/FLINK-16743 for > > follow-up > > > discussion. FYI. > > > > > > Best, > > > Jingsong Lee > > > > > > On Tue, Mar 24, 2020 at 2:20 PM Bowen Li <[hidden email]> wrote: > > > > > > > I agree with Jingsong that sink schema inference and system tables > can > > be > > > > considered later. I wouldn’t recommend to tackle them for the sake of > > > > simplifying user experience to the extreme. Providing the above handy > > > > source and sink implementations already offer users a ton of > immediate > > > > value. > > > > > > > > > > > > On Mon, Mar 23, 2020 at 20:20 Jingsong Li <[hidden email]> > > > wrote: > > > > > > > > > Hi Benchao, > > > > > > > > > > > do you think we need to add more columns with various types? > > > > > > > > > > I didn't list all types, but we should support primitive types, > > > varchar, > > > > > Decimal, Timestamp and etc... > > > > > This can be done continuously. > > > > > > > > > > Hi Benchao, Jark, > > > > > About console and blackhole, yes, they can have no schema, the > schema > > > can > > > > > be inferred by upstream node. > > > > > - But now we don't have this mechanism to do these configurable > sink > > > > > things. > > > > > - If we want to support, we need a single way to support these two > > > sinks. > > > > > - And uses can use "create table like" and others way to simplify > > DDL. > > > > > > > > > > And for providing system/registered tables (`console` and > > `blackhole`): > > > > > - I have no strong opinion on these system tables. In SQL, will be > > > > "insert > > > > > into blackhole select a /*int*/, b /*string*/ from tableA", "insert > > > into > > > > > blackhole select a /*double*/, b /*Map*/, c /*string*/ from > tableB". > > It > > > > > seems that Blackhole is a universal thing, which makes me feel bad > > > > > intuitively. > > > > > - Can user override these tables? If can, we need ensure it can be > > > > > overwrite by catalog tables. > > > > > > > > > > So I think we can leave these system tables to future too. > > > > > What do you think? > > > > > > > > > > Best, > > > > > Jingsong Lee > > > > > > > > > > On Mon, Mar 23, 2020 at 4:44 PM Jark Wu <[hidden email]> wrote: > > > > > > > > > > > Hi Jingsong, > > > > > > > > > > > > Regarding (2) and (3), I was thinking to ignore manually DDL > work, > > so > > > > > users > > > > > > can use them directly: > > > > > > > > > > > > # this will log results to `.out` files > > > > > > INSERT INTO console > > > > > > SELECT ... > > > > > > > > > > > > # this will drop all received records > > > > > > INSERT INTO blackhole > > > > > > SELECT ... > > > > > > > > > > > > Here `console` and `blackhole` are system sinks which is similar > to > > > > > system > > > > > > functions. > > > > > > > > > > > > Best, > > > > > > Jark > > > > > > > > > > > > On Mon, 23 Mar 2020 at 16:33, Benchao Li <[hidden email]> > > > wrote: > > > > > > > > > > > > > Hi Jingsong, > > > > > > > > > > > > > > Thanks for bring this up. Generally, it's a very good proposal. > > > > > > > > > > > > > > About data gen source, do you think we need to add more columns > > > with > > > > > > > various types? > > > > > > > > > > > > > > About print sink, do we need to specify the schema? > > > > > > > > > > > > > > Jingsong Li <[hidden email]> 于2020年3月23日周一 下午1:51写道: > > > > > > > > > > > > > > > Thanks Bowen, Jark and Dian for your feedback and > suggestions. > > > > > > > > > > > > > > > > I reorganize with your suggestions, and try to expose DDLs: > > > > > > > > > > > > > > > > 1.datagen source: > > > > > > > > - easy startup/test for streaming job > > > > > > > > - performance testing > > > > > > > > > > > > > > > > DDL: > > > > > > > > CREATE TABLE user ( > > > > > > > > id BIGINT, > > > > > > > > age INT, > > > > > > > > description STRING > > > > > > > > ) WITH ( > > > > > > > > 'connector.type' = 'datagen', > > > > > > > > 'connector.rows-per-second'='100', > > > > > > > > 'connector.total-records'='1000000', > > > > > > > > > > > > > > > > 'schema.id.generator' = 'sequence', > > > > > > > > 'schema.id.generator.start' = '1', > > > > > > > > > > > > > > > > 'schema.age.generator' = 'random', > > > > > > > > 'schema.age.generator.min' = '0', > > > > > > > > 'schema.age.generator.max' = '100', > > > > > > > > > > > > > > > > 'schema.description.generator' = 'random', > > > > > > > > 'schema.description.generator.length' = '100' > > > > > > > > ) > > > > > > > > > > > > > > > > Default is random generator. > > > > > > > > Hi Jark, I don't want to bring complicated regularities, > > because > > > it > > > > > can > > > > > > > be > > > > > > > > done through computed columns. And it is hard to define > > > > > > > > standard regularities, I think we can leave it to the future. > > > > > > > > > > > > > > > > 2.print sink: > > > > > > > > - easy test for streaming job > > > > > > > > - be very useful in production debugging > > > > > > > > > > > > > > > > DDL: > > > > > > > > CREATE TABLE print_table ( > > > > > > > > ... > > > > > > > > ) WITH ( > > > > > > > > 'connector.type' = 'print' > > > > > > > > ) > > > > > > > > > > > > > > > > 3.blackhole sink > > > > > > > > - very useful for high performance testing of Flink > > > > > > > > - I've also run into users trying UDF to output, not sink, so > > > they > > > > > need > > > > > > > > this sink as well. > > > > > > > > > > > > > > > > DDL: > > > > > > > > CREATE TABLE blackhole_table ( > > > > > > > > ... > > > > > > > > ) WITH ( > > > > > > > > 'connector.type' = 'blackhole' > > > > > > > > ) > > > > > > > > > > > > > > > > What do you think? > > > > > > > > > > > > > > > > Best, > > > > > > > > Jingsong Lee > > > > > > > > > > > > > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu < > > [hidden email]> > > > > > > wrote: > > > > > > > > > > > > > > > > > Thanks Jingsong for bringing up this discussion. +1 to this > > > > > > proposal. I > > > > > > > > > think Bowen's proposal makes much sense to me. > > > > > > > > > > > > > > > > > > This is also a painful problem for PyFlink users. Currently > > > there > > > > > is > > > > > > no > > > > > > > > > built-in easy-to-use table source/sink and it requires > users > > to > > > > > > write a > > > > > > > > lot > > > > > > > > > of code to trying out PyFlink. This is especially painful > for > > > new > > > > > > users > > > > > > > > who > > > > > > > > > are not familiar with PyFlink/Flink. I have also > encountered > > > the > > > > > > > tedious > > > > > > > > > process Bowen encountered, e.g. writing random source > > > connector, > > > > > > > > sink > > > > > > > > > and also blackhole print sink as there are no built-in ones > > to > > > > use. > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > Dian > > > > > > > > > > > > > > > > > > > 在 2020年3月22日,上午11:24,Jark Wu <[hidden email]> 写道: > > > > > > > > > > > > > > > > > > > > +1 to Bowen's proposal. I also saw many requirements on > > such > > > > > > built-in > > > > > > > > > > connectors. > > > > > > > > > > > > > > > > > > > > I will leave some my thoughts here: > > > > > > > > > > > > > > > > > > > >> 1. datagen source (random source) > > > > > > > > > > I think we can merge the functinality of sequence-source > > into > > > > > > random > > > > > > > > > source > > > > > > > > > > to allow users to custom their data values. > > > > > > > > > > Flink can generate random data according to the field > > types, > > > > > users > > > > > > > > > > can customize their values to be more domain specific, > e.g. > > > > > > > > > > 'field.user'='User_[1-9]{0,1}' > > > > > > > > > > This will be similar to kafka-datagen-connect[1]. > > > > > > > > > > > > > > > > > > > >> 2. console sink (print sink) > > > > > > > > > > This will be very useful in production debugging, to > easily > > > > > output > > > > > > an > > > > > > > > > > intermediate view or result view to a `.out` file. > > > > > > > > > > So that we can look into the data representation, or > check > > > > dirty > > > > > > > data. > > > > > > > > > > This should be out-of-box without manually DDL > > registration. > > > > > > > > > > > > > > > > > > > >> 3. blackhole sink (no output sink) > > > > > > > > > > This is very useful for high performance testing of > Flink, > > to > > > > > > > meansure > > > > > > > > > the > > > > > > > > > > throughput of the whole pipeline without sink. > > > > > > > > > > Presto also provides this as a built-in connector [2]. > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Jark > > > > > > > > > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > > > > > > > > > > [2]: > > > https://prestodb.io/docs/current/connector/blackhole.html > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li < > > [hidden email]> > > > > > > wrote: > > > > > > > > > > > > > > > > > > > >> +1. > > > > > > > > > >> > > > > > > > > > >> I would suggest to take a step even further and see what > > > users > > > > > > > really > > > > > > > > > need > > > > > > > > > >> to test/try/play with table API and Flink SQL. Besides > > this > > > > one, > > > > > > > > here're > > > > > > > > > >> some more sources and sinks that I have developed or > used > > > > > > previously > > > > > > > > to > > > > > > > > > >> facilitate building Flink table/SQL pipelines. > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> 1. random input data source > > > > > > > > > >> - should generate random data at a specified rate > > > > according > > > > > > to > > > > > > > > > schema > > > > > > > > > >> - purposes > > > > > > > > > >> - test Flink pipeline and data can end up in > > > external > > > > > > > storage > > > > > > > > > >> correctly > > > > > > > > > >> - stress test Flink sink as well as tuning up > > > external > > > > > > > storage > > > > > > > > > >> 2. print data sink > > > > > > > > > >> - should print data in row format in console > > > > > > > > > >> - purposes > > > > > > > > > >> - make it easier to test Flink SQL job e2e in > IDE > > > > > > > > > >> - test Flink pipeline and ensure output data > > > > > format/value > > > > > > is > > > > > > > > > >> correct > > > > > > > > > >> 3. no output data sink > > > > > > > > > >> - just swallow output data without doing anything > > > > > > > > > >> - purpose > > > > > > > > > >> - evaluate and tune performance of Flink source > > and > > > > the > > > > > > > whole > > > > > > > > > >> pipeline. Users' don't need to worry about sink > > back > > > > > > > pressure > > > > > > > > > >> > > > > > > > > > >> These may be taken into consideration all together as an > > > > effort > > > > > to > > > > > > > > lower > > > > > > > > > >> the threshold of running Flink SQL/table API, and > > facilitate > > > > > > users' > > > > > > > > > daily > > > > > > > > > >> work. > > > > > > > > > >> > > > > > > > > > >> Cheers, > > > > > > > > > >> Bowen > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li < > > > > > > > [hidden email]> > > > > > > > > > >> wrote: > > > > > > > > > >> > > > > > > > > > >>> Hi all, > > > > > > > > > >>> > > > > > > > > > >>> I heard some users complain that table is difficult to > > > test. > > > > > Now > > > > > > > with > > > > > > > > > SQL > > > > > > > > > >>> client, users are more and more inclined to use it to > > test > > > > > rather > > > > > > > > than > > > > > > > > > >>> program. > > > > > > > > > >>> The most common example is Kafka source. If users need > to > > > > test > > > > > > > their > > > > > > > > > SQL > > > > > > > > > >>> output and checkpoint, they need to: > > > > > > > > > >>> > > > > > > > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic . > > > > > > > > > >>> - 2.Write a program, mock input records, and produce > > > records > > > > to > > > > > > > Kafka > > > > > > > > > >>> topic. > > > > > > > > > >>> - 3.Then test in Flink. > > > > > > > > > >>> > > > > > > > > > >>> The step 1 and 2 are annoying, although this test is > E2E. > > > > > > > > > >>> > > > > > > > > > >>> Then I found StatefulSequenceSource, it is very good > > > because > > > > it > > > > > > has > > > > > > > > > deal > > > > > > > > > >>> with checkpoint things, so it is very good to > checkpoint > > > > > > > > > >> mechanism.Usually, > > > > > > > > > >>> users are turned on checkpoint in production. > > > > > > > > > >>> > > > > > > > > > >>> With computed columns, user are easy to create a > sequence > > > > > source > > > > > > > DDL > > > > > > > > > same > > > > > > > > > >>> to Kafka DDL. Then they can test inside Flink, don't > need > > > > > launch > > > > > > > > other > > > > > > > > > >>> things. > > > > > > > > > >>> > > > > > > > > > >>> Have you consider this? What do you think? > > > > > > > > > >>> > > > > > > > > > >>> CC: @Aljoscha Krettek <[hidden email]> the author > > > > > > > > > >>> of StatefulSequenceSource. > > > > > > > > > >>> > > > > > > > > > >>> Best, > > > > > > > > > >>> Jingsong Lee > > > > > > > > > >>> > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Best, Jingsong Lee > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Benchao Li > > > > > > > School of Electronics Engineering and Computer Science, Peking > > > > > University > > > > > > > Tel:+86-15650713730 > > > > > > > Email: [hidden email]; [hidden email] > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Best, Jingsong Lee > > > > > > > > > > > > > > > > > > -- > > > Best, Jingsong Lee > > > > > > > > > -- > > > > Konstantin Knauf > > > > https://twitter.com/snntrable > > > > https://github.com/knaufk > > > -- Konstantin Knauf https://twitter.com/snntrable https://github.com/knaufk |
Thanks Konstantin for your Faker link.
It looks very interesting and very real. We can add this generator to datagen source. Best, Jingsong Lee On Fri, May 1, 2020 at 1:00 AM Konstantin Knauf <[hidden email]> wrote: > Hi Jark, > > my gut feeling is 1), because of its consistency with other connectors > (does not add two secret keywords) although it is more verbose. > > Best, > > Konstantin > > > > On Thu, Apr 30, 2020 at 5:01 PM Jark Wu <[hidden email]> wrote: > > > Hi Konstantin, > > > > Thanks for the link of Java Faker. It's an intereting project and > > could benefit to a comprehensive datagen source. > > > > What the discarding and printing sink look like in your thought? > > 1) manually create a table with a `blackhole` or `print` connector, e.g. > > > > CREATE TABLE my_sink ( > > a INT, > > b STRNG, > > c DOUBLE > > ) WITH ( > > 'connector' = 'print' > > ); > > INSERT INTO my_sink SELECT a, b, c FROM my_source; > > > > 2) a system built-in table named `blackhole` and `print` without manually > > schema work, e.g. > > INSERT INTO print SELECT a, b, c, d FROM my_source; > > > > Best, > > Jark > > > > > > > > On Thu, 30 Apr 2020 at 21:19, Konstantin Knauf <[hidden email]> > wrote: > > > > > Hi everyone, > > > > > > sorry for reviving this thread at this point in time. Generally, I > think, > > > this is a very valuable effort. Have we considered only providing a > very > > > basic data generator (+ discarding and printing sink tables) in Apache > > > Flink and moving a more comprehensive data generating table source to > an > > > ecosystem project promoted on flink-packages.org. I think this has a > lot > > > of > > > potential (e.g. in combination with Java Faker [1]), but it would > > probably > > > be better served in a small separately maintained repository. > > > > > > Cheers, > > > > > > Konstantin > > > > > > [1] https://github.com/DiUS/java-faker > > > > > > > > > On Tue, Mar 24, 2020 at 9:10 AM Jingsong Li <[hidden email]> > > > wrote: > > > > > > > Hi all, > > > > > > > > I created https://issues.apache.org/jira/browse/FLINK-16743 for > > > follow-up > > > > discussion. FYI. > > > > > > > > Best, > > > > Jingsong Lee > > > > > > > > On Tue, Mar 24, 2020 at 2:20 PM Bowen Li <[hidden email]> > wrote: > > > > > > > > > I agree with Jingsong that sink schema inference and system tables > > can > > > be > > > > > considered later. I wouldn’t recommend to tackle them for the sake > of > > > > > simplifying user experience to the extreme. Providing the above > handy > > > > > source and sink implementations already offer users a ton of > > immediate > > > > > value. > > > > > > > > > > > > > > > On Mon, Mar 23, 2020 at 20:20 Jingsong Li <[hidden email]> > > > > wrote: > > > > > > > > > > > Hi Benchao, > > > > > > > > > > > > > do you think we need to add more columns with various types? > > > > > > > > > > > > I didn't list all types, but we should support primitive types, > > > > varchar, > > > > > > Decimal, Timestamp and etc... > > > > > > This can be done continuously. > > > > > > > > > > > > Hi Benchao, Jark, > > > > > > About console and blackhole, yes, they can have no schema, the > > schema > > > > can > > > > > > be inferred by upstream node. > > > > > > - But now we don't have this mechanism to do these configurable > > sink > > > > > > things. > > > > > > - If we want to support, we need a single way to support these > two > > > > sinks. > > > > > > - And uses can use "create table like" and others way to simplify > > > DDL. > > > > > > > > > > > > And for providing system/registered tables (`console` and > > > `blackhole`): > > > > > > - I have no strong opinion on these system tables. In SQL, will > be > > > > > "insert > > > > > > into blackhole select a /*int*/, b /*string*/ from tableA", > "insert > > > > into > > > > > > blackhole select a /*double*/, b /*Map*/, c /*string*/ from > > tableB". > > > It > > > > > > seems that Blackhole is a universal thing, which makes me feel > bad > > > > > > intuitively. > > > > > > - Can user override these tables? If can, we need ensure it can > be > > > > > > overwrite by catalog tables. > > > > > > > > > > > > So I think we can leave these system tables to future too. > > > > > > What do you think? > > > > > > > > > > > > Best, > > > > > > Jingsong Lee > > > > > > > > > > > > On Mon, Mar 23, 2020 at 4:44 PM Jark Wu <[hidden email]> > wrote: > > > > > > > > > > > > > Hi Jingsong, > > > > > > > > > > > > > > Regarding (2) and (3), I was thinking to ignore manually DDL > > work, > > > so > > > > > > users > > > > > > > can use them directly: > > > > > > > > > > > > > > # this will log results to `.out` files > > > > > > > INSERT INTO console > > > > > > > SELECT ... > > > > > > > > > > > > > > # this will drop all received records > > > > > > > INSERT INTO blackhole > > > > > > > SELECT ... > > > > > > > > > > > > > > Here `console` and `blackhole` are system sinks which is > similar > > to > > > > > > system > > > > > > > functions. > > > > > > > > > > > > > > Best, > > > > > > > Jark > > > > > > > > > > > > > > On Mon, 23 Mar 2020 at 16:33, Benchao Li <[hidden email]> > > > > wrote: > > > > > > > > > > > > > > > Hi Jingsong, > > > > > > > > > > > > > > > > Thanks for bring this up. Generally, it's a very good > proposal. > > > > > > > > > > > > > > > > About data gen source, do you think we need to add more > columns > > > > with > > > > > > > > various types? > > > > > > > > > > > > > > > > About print sink, do we need to specify the schema? > > > > > > > > > > > > > > > > Jingsong Li <[hidden email]> 于2020年3月23日周一 下午1:51写道: > > > > > > > > > > > > > > > > > Thanks Bowen, Jark and Dian for your feedback and > > suggestions. > > > > > > > > > > > > > > > > > > I reorganize with your suggestions, and try to expose DDLs: > > > > > > > > > > > > > > > > > > 1.datagen source: > > > > > > > > > - easy startup/test for streaming job > > > > > > > > > - performance testing > > > > > > > > > > > > > > > > > > DDL: > > > > > > > > > CREATE TABLE user ( > > > > > > > > > id BIGINT, > > > > > > > > > age INT, > > > > > > > > > description STRING > > > > > > > > > ) WITH ( > > > > > > > > > 'connector.type' = 'datagen', > > > > > > > > > 'connector.rows-per-second'='100', > > > > > > > > > 'connector.total-records'='1000000', > > > > > > > > > > > > > > > > > > 'schema.id.generator' = 'sequence', > > > > > > > > > 'schema.id.generator.start' = '1', > > > > > > > > > > > > > > > > > > 'schema.age.generator' = 'random', > > > > > > > > > 'schema.age.generator.min' = '0', > > > > > > > > > 'schema.age.generator.max' = '100', > > > > > > > > > > > > > > > > > > 'schema.description.generator' = 'random', > > > > > > > > > 'schema.description.generator.length' = '100' > > > > > > > > > ) > > > > > > > > > > > > > > > > > > Default is random generator. > > > > > > > > > Hi Jark, I don't want to bring complicated regularities, > > > because > > > > it > > > > > > can > > > > > > > > be > > > > > > > > > done through computed columns. And it is hard to define > > > > > > > > > standard regularities, I think we can leave it to the > future. > > > > > > > > > > > > > > > > > > 2.print sink: > > > > > > > > > - easy test for streaming job > > > > > > > > > - be very useful in production debugging > > > > > > > > > > > > > > > > > > DDL: > > > > > > > > > CREATE TABLE print_table ( > > > > > > > > > ... > > > > > > > > > ) WITH ( > > > > > > > > > 'connector.type' = 'print' > > > > > > > > > ) > > > > > > > > > > > > > > > > > > 3.blackhole sink > > > > > > > > > - very useful for high performance testing of Flink > > > > > > > > > - I've also run into users trying UDF to output, not sink, > so > > > > they > > > > > > need > > > > > > > > > this sink as well. > > > > > > > > > > > > > > > > > > DDL: > > > > > > > > > CREATE TABLE blackhole_table ( > > > > > > > > > ... > > > > > > > > > ) WITH ( > > > > > > > > > 'connector.type' = 'blackhole' > > > > > > > > > ) > > > > > > > > > > > > > > > > > > What do you think? > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > Jingsong Lee > > > > > > > > > > > > > > > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu < > > > [hidden email]> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > Thanks Jingsong for bringing up this discussion. +1 to > this > > > > > > > proposal. I > > > > > > > > > > think Bowen's proposal makes much sense to me. > > > > > > > > > > > > > > > > > > > > This is also a painful problem for PyFlink users. > Currently > > > > there > > > > > > is > > > > > > > no > > > > > > > > > > built-in easy-to-use table source/sink and it requires > > users > > > to > > > > > > > write a > > > > > > > > > lot > > > > > > > > > > of code to trying out PyFlink. This is especially painful > > for > > > > new > > > > > > > users > > > > > > > > > who > > > > > > > > > > are not familiar with PyFlink/Flink. I have also > > encountered > > > > the > > > > > > > > tedious > > > > > > > > > > process Bowen encountered, e.g. writing random source > > > > connector, > > > > > > > > > sink > > > > > > > > > > and also blackhole print sink as there are no built-in > ones > > > to > > > > > use. > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > Dian > > > > > > > > > > > > > > > > > > > > > 在 2020年3月22日,上午11:24,Jark Wu <[hidden email]> 写道: > > > > > > > > > > > > > > > > > > > > > > +1 to Bowen's proposal. I also saw many requirements on > > > such > > > > > > > built-in > > > > > > > > > > > connectors. > > > > > > > > > > > > > > > > > > > > > > I will leave some my thoughts here: > > > > > > > > > > > > > > > > > > > > > >> 1. datagen source (random source) > > > > > > > > > > > I think we can merge the functinality of > sequence-source > > > into > > > > > > > random > > > > > > > > > > source > > > > > > > > > > > to allow users to custom their data values. > > > > > > > > > > > Flink can generate random data according to the field > > > types, > > > > > > users > > > > > > > > > > > can customize their values to be more domain specific, > > e.g. > > > > > > > > > > > 'field.user'='User_[1-9]{0,1}' > > > > > > > > > > > This will be similar to kafka-datagen-connect[1]. > > > > > > > > > > > > > > > > > > > > > >> 2. console sink (print sink) > > > > > > > > > > > This will be very useful in production debugging, to > > easily > > > > > > output > > > > > > > an > > > > > > > > > > > intermediate view or result view to a `.out` file. > > > > > > > > > > > So that we can look into the data representation, or > > check > > > > > dirty > > > > > > > > data. > > > > > > > > > > > This should be out-of-box without manually DDL > > > registration. > > > > > > > > > > > > > > > > > > > > > >> 3. blackhole sink (no output sink) > > > > > > > > > > > This is very useful for high performance testing of > > Flink, > > > to > > > > > > > > meansure > > > > > > > > > > the > > > > > > > > > > > throughput of the whole pipeline without sink. > > > > > > > > > > > Presto also provides this as a built-in connector [2]. > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > Jark > > > > > > > > > > > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification > > > > > > > > > > > [2]: > > > > https://prestodb.io/docs/current/connector/blackhole.html > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li < > > > [hidden email]> > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > >> +1. > > > > > > > > > > >> > > > > > > > > > > >> I would suggest to take a step even further and see > what > > > > users > > > > > > > > really > > > > > > > > > > need > > > > > > > > > > >> to test/try/play with table API and Flink SQL. Besides > > > this > > > > > one, > > > > > > > > > here're > > > > > > > > > > >> some more sources and sinks that I have developed or > > used > > > > > > > previously > > > > > > > > > to > > > > > > > > > > >> facilitate building Flink table/SQL pipelines. > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> 1. random input data source > > > > > > > > > > >> - should generate random data at a specified rate > > > > > according > > > > > > > to > > > > > > > > > > schema > > > > > > > > > > >> - purposes > > > > > > > > > > >> - test Flink pipeline and data can end up in > > > > external > > > > > > > > storage > > > > > > > > > > >> correctly > > > > > > > > > > >> - stress test Flink sink as well as tuning up > > > > external > > > > > > > > storage > > > > > > > > > > >> 2. print data sink > > > > > > > > > > >> - should print data in row format in console > > > > > > > > > > >> - purposes > > > > > > > > > > >> - make it easier to test Flink SQL job e2e in > > IDE > > > > > > > > > > >> - test Flink pipeline and ensure output data > > > > > > format/value > > > > > > > is > > > > > > > > > > >> correct > > > > > > > > > > >> 3. no output data sink > > > > > > > > > > >> - just swallow output data without doing anything > > > > > > > > > > >> - purpose > > > > > > > > > > >> - evaluate and tune performance of Flink > source > > > and > > > > > the > > > > > > > > whole > > > > > > > > > > >> pipeline. Users' don't need to worry about > sink > > > back > > > > > > > > pressure > > > > > > > > > > >> > > > > > > > > > > >> These may be taken into consideration all together as > an > > > > > effort > > > > > > to > > > > > > > > > lower > > > > > > > > > > >> the threshold of running Flink SQL/table API, and > > > facilitate > > > > > > > users' > > > > > > > > > > daily > > > > > > > > > > >> work. > > > > > > > > > > >> > > > > > > > > > > >> Cheers, > > > > > > > > > > >> Bowen > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li < > > > > > > > > [hidden email]> > > > > > > > > > > >> wrote: > > > > > > > > > > >> > > > > > > > > > > >>> Hi all, > > > > > > > > > > >>> > > > > > > > > > > >>> I heard some users complain that table is difficult > to > > > > test. > > > > > > Now > > > > > > > > with > > > > > > > > > > SQL > > > > > > > > > > >>> client, users are more and more inclined to use it to > > > test > > > > > > rather > > > > > > > > > than > > > > > > > > > > >>> program. > > > > > > > > > > >>> The most common example is Kafka source. If users > need > > to > > > > > test > > > > > > > > their > > > > > > > > > > SQL > > > > > > > > > > >>> output and checkpoint, they need to: > > > > > > > > > > >>> > > > > > > > > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic . > > > > > > > > > > >>> - 2.Write a program, mock input records, and produce > > > > records > > > > > to > > > > > > > > Kafka > > > > > > > > > > >>> topic. > > > > > > > > > > >>> - 3.Then test in Flink. > > > > > > > > > > >>> > > > > > > > > > > >>> The step 1 and 2 are annoying, although this test is > > E2E. > > > > > > > > > > >>> > > > > > > > > > > >>> Then I found StatefulSequenceSource, it is very good > > > > because > > > > > it > > > > > > > has > > > > > > > > > > deal > > > > > > > > > > >>> with checkpoint things, so it is very good to > > checkpoint > > > > > > > > > > >> mechanism.Usually, > > > > > > > > > > >>> users are turned on checkpoint in production. > > > > > > > > > > >>> > > > > > > > > > > >>> With computed columns, user are easy to create a > > sequence > > > > > > source > > > > > > > > DDL > > > > > > > > > > same > > > > > > > > > > >>> to Kafka DDL. Then they can test inside Flink, don't > > need > > > > > > launch > > > > > > > > > other > > > > > > > > > > >>> things. > > > > > > > > > > >>> > > > > > > > > > > >>> Have you consider this? What do you think? > > > > > > > > > > >>> > > > > > > > > > > >>> CC: @Aljoscha Krettek <[hidden email]> the > author > > > > > > > > > > >>> of StatefulSequenceSource. > > > > > > > > > > >>> > > > > > > > > > > >>> Best, > > > > > > > > > > >>> Jingsong Lee > > > > > > > > > > >>> > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > Best, Jingsong Lee > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > Benchao Li > > > > > > > > School of Electronics Engineering and Computer Science, > Peking > > > > > > University > > > > > > > > Tel:+86-15650713730 > > > > > > > > Email: [hidden email]; [hidden email] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best, Jingsong Lee > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Best, Jingsong Lee > > > > > > > > > > > > > -- > > > > > > Konstantin Knauf > > > > > > https://twitter.com/snntrable > > > > > > https://github.com/knaufk > > > > > > > > -- > > Konstantin Knauf > > https://twitter.com/snntrable > > https://github.com/knaufk > -- Best, Jingsong Lee |
Free forum by Nabble | Edit this page |