Hi everyone,
Currently, the data gen table source only supports a subset of Flink SQL types. One missing type in particular is TIMESTAMP(3). The reason, I suspect, it was not added originally is that it doesn't really make sense to have random timestamps. What you really want is for them to be ascending. In the use cases of data generation, users typically don't care about late data. The workaround proposed in the docs is to create your event time attribute using a computed column. CREATE TABLE t ( ts AS LOCALTIMESTAMP ) WITH ( 'connector' = 'datagen' ) The problem is that this does not play well with the LIKE clause. Many users do not create datagen backed tables from scratch but using the LIKE clause to shadow a physical table in their catalog - such as Kafka. The problem is the LIKE clause does not allow redefining columns so there is no way to do this for a table with an event time attribute. The below will fail. CREATE TABLE Orders ( order_id BIGINT, order_time TIMESTAMP(3) quantity INT, cost AS price * quantity, WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND, PRIMARY KEY (order_id) NOT ENFORCED ) WITH ( 'connector' = 'kafka', 'topic' = 'orders', 'properties.bootstrap.servers' = 'localhost:9092', 'properties.group.id' = 'orderGroup', 'format' = 'csv' ) CREATE TEMPORARY TABLE Orders WITH ( 'connector' = 'datagen' ) LIKE Orders (EXCLUDING ALL) I see two solutions to this and would like to hear what people think. 1) Support TIMESTAMP in datagen tables but always supply strictly ascending timestamps. The above would now "just work". This semantic makes sense given the way event time attributes are used in streaming applications and we can clearly document the behavior. 2) Relax the constraints of the LIKE clause to allow overriding physical columns with computed columns. This would make it clearer to the user what is happening but would require substantially higher development effort and I don't know if this feature would add value beyond this one use case. In practice, this would allow the following. Please let me know what you think. CREATE TEMPORARY TABLE Orders ( order_time AS LOCALTIMESTAMP ) WITH ( 'connector' = 'datagen' ) LIKE Orders (EXCLUDING ALL) Seth |
Hi Seth,
Thanks for bringing up this topic. I think the second approach is a more generic solution. Other connectors can also benefit from this. We also keep the flexibility for generating random timestamps for some scenarios. Best, Godfrey Seth Wiesman <[hidden email]> 于2020年7月24日周五 下午11:30写道: > Hi everyone, > > Currently, the data gen table source only supports a subset of Flink SQL > types. One missing type in particular is TIMESTAMP(3). The reason, I > suspect, it was not added originally is that it doesn't really make sense > to have random timestamps. What you really want is for them to be > ascending. In the use cases of data generation, users typically don't care > about late data. The workaround proposed in the docs is to create your > event time attribute using a computed column. > > CREATE TABLE t ( > ts AS LOCALTIMESTAMP > ) WITH ( > 'connector' = 'datagen' > ) > > The problem is that this does not play well with the LIKE clause. Many > users do not create datagen backed tables from scratch but using the LIKE > clause to shadow a physical table in their catalog - such as Kafka. > > The problem is the LIKE clause does not allow redefining columns so there > is no way to do this for a table with an event time attribute. The below > will fail. > > CREATE TABLE Orders ( > order_id BIGINT, > order_time TIMESTAMP(3) > quantity INT, > cost AS price * quantity, > WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND, > PRIMARY KEY (order_id) NOT ENFORCED > ) WITH ( > 'connector' = 'kafka', > 'topic' = 'orders', > 'properties.bootstrap.servers' = 'localhost:9092', > 'properties.group.id' = 'orderGroup', > 'format' = 'csv' > ) > > CREATE TEMPORARY TABLE Orders WITH ( > 'connector' = 'datagen' > ) LIKE Orders (EXCLUDING ALL) > > > I see two solutions to this and would like to hear what people think. > > 1) Support TIMESTAMP in datagen tables but always supply strictly ascending > timestamps. The above would now "just work". This semantic makes sense > given the way event time attributes are used in streaming applications and > we can clearly document the behavior. > > 2) Relax the constraints of the LIKE clause to allow overriding physical > columns with computed columns. This would make it clearer to the user what > is happening but would require substantially higher development effort and > I don't know if this feature would add value beyond this one use case. In > practice, this would allow the following. > > Please let me know what you think. > CREATE TEMPORARY TABLE Orders ( > order_time AS LOCALTIMESTAMP > ) WITH ( > 'connector' = 'datagen' > ) LIKE Orders (EXCLUDING ALL) > > Seth > |
+1 to option#1.
I think it makes sense to enhance the datagen connector. In this case, I think we can support the default TIMESTAMP generation strategy as "sequence" with an optional start point. This strategy can be changed to "constant", "random", or others. This would be really helpful and cool if we can support this, and that's why I prefer #1 than #2: CREATE TEMPORARY TABLE Orders WITH ( 'connector' = 'datagen' ) LIKE Orders (EXCLUDING ALL) Regarding #2, if we want to extend the LIKE clause, maybe we can add an "OVERWRITING COLUMNS" like option. But unless we have other strong use cases for this, otherwise, I think this makes things complicated (the current like options are already puzzling). Maybe @Dawid Wysakowicz <[hidden email]> has more thoughts on this. Best, Jark On Mon, 27 Jul 2020 at 10:50, godfrey he <[hidden email]> wrote: > Hi Seth, > Thanks for bringing up this topic. > > I think the second approach is a more generic solution. > Other connectors can also benefit from this. > We also keep the flexibility for generating random timestamps for some > scenarios. > > Best, > Godfrey > > Seth Wiesman <[hidden email]> 于2020年7月24日周五 下午11:30写道: > > > Hi everyone, > > > > Currently, the data gen table source only supports a subset of Flink SQL > > types. One missing type in particular is TIMESTAMP(3). The reason, I > > suspect, it was not added originally is that it doesn't really make sense > > to have random timestamps. What you really want is for them to be > > ascending. In the use cases of data generation, users typically don't > care > > about late data. The workaround proposed in the docs is to create your > > event time attribute using a computed column. > > > > CREATE TABLE t ( > > ts AS LOCALTIMESTAMP > > ) WITH ( > > 'connector' = 'datagen' > > ) > > > > The problem is that this does not play well with the LIKE clause. Many > > users do not create datagen backed tables from scratch but using the LIKE > > clause to shadow a physical table in their catalog - such as Kafka. > > > > The problem is the LIKE clause does not allow redefining columns so there > > is no way to do this for a table with an event time attribute. The below > > will fail. > > > > CREATE TABLE Orders ( > > order_id BIGINT, > > order_time TIMESTAMP(3) > > quantity INT, > > cost AS price * quantity, > > WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND, > > PRIMARY KEY (order_id) NOT ENFORCED > > ) WITH ( > > 'connector' = 'kafka', > > 'topic' = 'orders', > > 'properties.bootstrap.servers' = 'localhost:9092', > > 'properties.group.id' = 'orderGroup', > > 'format' = 'csv' > > ) > > > > CREATE TEMPORARY TABLE Orders WITH ( > > 'connector' = 'datagen' > > ) LIKE Orders (EXCLUDING ALL) > > > > > > I see two solutions to this and would like to hear what people think. > > > > 1) Support TIMESTAMP in datagen tables but always supply strictly > ascending > > timestamps. The above would now "just work". This semantic makes sense > > given the way event time attributes are used in streaming applications > and > > we can clearly document the behavior. > > > > 2) Relax the constraints of the LIKE clause to allow overriding physical > > columns with computed columns. This would make it clearer to the user > what > > is happening but would require substantially higher development effort > and > > I don't know if this feature would add value beyond this one use case. In > > practice, this would allow the following. > > > > Please let me know what you think. > > CREATE TEMPORARY TABLE Orders ( > > order_time AS LOCALTIMESTAMP > > ) WITH ( > > 'connector' = 'datagen' > > ) LIKE Orders (EXCLUDING ALL) > > > > Seth > > > |
Hi everyone,
I have the feeling that #1 suffices for the vast majority of use cases. Cheers, Konstantin On Mon, Jul 27, 2020 at 6:36 AM Jark Wu <[hidden email]> wrote: > +1 to option#1. > > I think it makes sense to enhance the datagen connector. > In this case, I think we can support the default TIMESTAMP generation > strategy as "sequence" with an optional start point. > This strategy can be changed to "constant", "random", or others. > This would be really helpful and cool if we can support this, and that's > why I prefer #1 than #2: > > CREATE TEMPORARY TABLE Orders WITH ( > 'connector' = 'datagen' > ) LIKE Orders (EXCLUDING ALL) > > Regarding #2, if we want to extend the LIKE clause, maybe we can add an > "OVERWRITING COLUMNS" like option. > But unless we have other strong use cases for this, otherwise, I think this > makes things complicated (the current like options are already puzzling). > Maybe @Dawid Wysakowicz <[hidden email]> has more thoughts on > this. > > > Best, > Jark > > On Mon, 27 Jul 2020 at 10:50, godfrey he <[hidden email]> wrote: > > > Hi Seth, > > Thanks for bringing up this topic. > > > > I think the second approach is a more generic solution. > > Other connectors can also benefit from this. > > We also keep the flexibility for generating random timestamps for some > > scenarios. > > > > Best, > > Godfrey > > > > Seth Wiesman <[hidden email]> 于2020年7月24日周五 下午11:30写道: > > > > > Hi everyone, > > > > > > Currently, the data gen table source only supports a subset of Flink > SQL > > > types. One missing type in particular is TIMESTAMP(3). The reason, I > > > suspect, it was not added originally is that it doesn't really make > sense > > > to have random timestamps. What you really want is for them to be > > > ascending. In the use cases of data generation, users typically don't > > care > > > about late data. The workaround proposed in the docs is to create your > > > event time attribute using a computed column. > > > > > > CREATE TABLE t ( > > > ts AS LOCALTIMESTAMP > > > ) WITH ( > > > 'connector' = 'datagen' > > > ) > > > > > > The problem is that this does not play well with the LIKE clause. Many > > > users do not create datagen backed tables from scratch but using the > LIKE > > > clause to shadow a physical table in their catalog - such as Kafka. > > > > > > The problem is the LIKE clause does not allow redefining columns so > there > > > is no way to do this for a table with an event time attribute. The > below > > > will fail. > > > > > > CREATE TABLE Orders ( > > > order_id BIGINT, > > > order_time TIMESTAMP(3) > > > quantity INT, > > > cost AS price * quantity, > > > WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND, > > > PRIMARY KEY (order_id) NOT ENFORCED > > > ) WITH ( > > > 'connector' = 'kafka', > > > 'topic' = 'orders', > > > 'properties.bootstrap.servers' = 'localhost:9092', > > > 'properties.group.id' = 'orderGroup', > > > 'format' = 'csv' > > > ) > > > > > > CREATE TEMPORARY TABLE Orders WITH ( > > > 'connector' = 'datagen' > > > ) LIKE Orders (EXCLUDING ALL) > > > > > > > > > I see two solutions to this and would like to hear what people think. > > > > > > 1) Support TIMESTAMP in datagen tables but always supply strictly > > ascending > > > timestamps. The above would now "just work". This semantic makes sense > > > given the way event time attributes are used in streaming applications > > and > > > we can clearly document the behavior. > > > > > > 2) Relax the constraints of the LIKE clause to allow overriding > physical > > > columns with computed columns. This would make it clearer to the user > > what > > > is happening but would require substantially higher development effort > > and > > > I don't know if this feature would add value beyond this one use case. > In > > > practice, this would allow the following. > > > > > > Please let me know what you think. > > > CREATE TEMPORARY TABLE Orders ( > > > order_time AS LOCALTIMESTAMP > > > ) WITH ( > > > 'connector' = 'datagen' > > > ) LIKE Orders (EXCLUDING ALL) > > > > > > Seth > > > > > > -- Konstantin Knauf https://twitter.com/snntrable https://github.com/knaufk |
Ok, let's go with option 1. This doesn't prevent up from adding option 2
latter if necessary. I've opened a ticket: https://issues.apache.org/jira/browse/FLINK-18735 On Mon, Jul 27, 2020 at 12:05 AM Konstantin Knauf <[hidden email]> wrote: > Hi everyone, > > I have the feeling that #1 suffices for the vast majority of use cases. > > Cheers, > > Konstantin > > On Mon, Jul 27, 2020 at 6:36 AM Jark Wu <[hidden email]> wrote: > > > +1 to option#1. > > > > I think it makes sense to enhance the datagen connector. > > In this case, I think we can support the default TIMESTAMP generation > > strategy as "sequence" with an optional start point. > > This strategy can be changed to "constant", "random", or others. > > This would be really helpful and cool if we can support this, and that's > > why I prefer #1 than #2: > > > > CREATE TEMPORARY TABLE Orders WITH ( > > 'connector' = 'datagen' > > ) LIKE Orders (EXCLUDING ALL) > > > > Regarding #2, if we want to extend the LIKE clause, maybe we can add an > > "OVERWRITING COLUMNS" like option. > > But unless we have other strong use cases for this, otherwise, I think > this > > makes things complicated (the current like options are already puzzling). > > Maybe @Dawid Wysakowicz <[hidden email]> has more thoughts on > > this. > > > > > > Best, > > Jark > > > > On Mon, 27 Jul 2020 at 10:50, godfrey he <[hidden email]> wrote: > > > > > Hi Seth, > > > Thanks for bringing up this topic. > > > > > > I think the second approach is a more generic solution. > > > Other connectors can also benefit from this. > > > We also keep the flexibility for generating random timestamps for some > > > scenarios. > > > > > > Best, > > > Godfrey > > > > > > Seth Wiesman <[hidden email]> 于2020年7月24日周五 下午11:30写道: > > > > > > > Hi everyone, > > > > > > > > Currently, the data gen table source only supports a subset of Flink > > SQL > > > > types. One missing type in particular is TIMESTAMP(3). The reason, I > > > > suspect, it was not added originally is that it doesn't really make > > sense > > > > to have random timestamps. What you really want is for them to be > > > > ascending. In the use cases of data generation, users typically don't > > > care > > > > about late data. The workaround proposed in the docs is to create > your > > > > event time attribute using a computed column. > > > > > > > > CREATE TABLE t ( > > > > ts AS LOCALTIMESTAMP > > > > ) WITH ( > > > > 'connector' = 'datagen' > > > > ) > > > > > > > > The problem is that this does not play well with the LIKE clause. > Many > > > > users do not create datagen backed tables from scratch but using the > > LIKE > > > > clause to shadow a physical table in their catalog - such as Kafka. > > > > > > > > The problem is the LIKE clause does not allow redefining columns so > > there > > > > is no way to do this for a table with an event time attribute. The > > below > > > > will fail. > > > > > > > > CREATE TABLE Orders ( > > > > order_id BIGINT, > > > > order_time TIMESTAMP(3) > > > > quantity INT, > > > > cost AS price * quantity, > > > > WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND, > > > > PRIMARY KEY (order_id) NOT ENFORCED > > > > ) WITH ( > > > > 'connector' = 'kafka', > > > > 'topic' = 'orders', > > > > 'properties.bootstrap.servers' = 'localhost:9092', > > > > 'properties.group.id' = 'orderGroup', > > > > 'format' = 'csv' > > > > ) > > > > > > > > CREATE TEMPORARY TABLE Orders WITH ( > > > > 'connector' = 'datagen' > > > > ) LIKE Orders (EXCLUDING ALL) > > > > > > > > > > > > I see two solutions to this and would like to hear what people think. > > > > > > > > 1) Support TIMESTAMP in datagen tables but always supply strictly > > > ascending > > > > timestamps. The above would now "just work". This semantic makes > sense > > > > given the way event time attributes are used in streaming > applications > > > and > > > > we can clearly document the behavior. > > > > > > > > 2) Relax the constraints of the LIKE clause to allow overriding > > physical > > > > columns with computed columns. This would make it clearer to the user > > > what > > > > is happening but would require substantially higher development > effort > > > and > > > > I don't know if this feature would add value beyond this one use > case. > > In > > > > practice, this would allow the following. > > > > > > > > Please let me know what you think. > > > > CREATE TEMPORARY TABLE Orders ( > > > > order_time AS LOCALTIMESTAMP > > > > ) WITH ( > > > > 'connector' = 'datagen' > > > > ) LIKE Orders (EXCLUDING ALL) > > > > > > > > Seth > > > > > > > > > > > > -- > > Konstantin Knauf > > https://twitter.com/snntrable > > https://github.com/knaufk > |
Free forum by Nabble | Edit this page |