[Discuss] Handling of Timestamp in DataGen table created via LIKE

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[Discuss] Handling of Timestamp in DataGen table created via LIKE

Seth Wiesman-4
Hi everyone,

Currently, the data gen table source only supports a subset of Flink SQL
types. One missing type in particular is TIMESTAMP(3). The reason, I
suspect, it was not added originally is that it doesn't really make sense
to have random timestamps. What you really want is for them to be
ascending. In the use cases of data generation, users typically don't care
about late data. The workaround proposed in the docs is to create your
event time attribute using a computed column.

CREATE TABLE t (
    ts AS LOCALTIMESTAMP
) WITH (
    'connector' = 'datagen'
)

The problem is that this does not play well with the LIKE clause. Many
users do not create datagen backed tables from scratch but using the LIKE
clause to shadow a physical table in their catalog - such as Kafka.

The problem is the LIKE clause does not allow redefining columns so there
is no way to do this for a table with an event time attribute. The below
will fail.

CREATE TABLE Orders (
    order_id   BIGINT,
    order_time TIMESTAMP(3)
    quantity   INT,
    cost       AS price * quantity,
    WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND,
    PRIMARY KEY (order_id) NOT ENFORCED
) WITH (
    'connector' = 'kafka',
    'topic' = 'orders',
    'properties.bootstrap.servers' = 'localhost:9092',
    'properties.group.id' = 'orderGroup',
    'format' = 'csv'
)

CREATE TEMPORARY TABLE Orders WITH (
    'connector' = 'datagen'
) LIKE Orders (EXCLUDING ALL)


I see two solutions to this and would like to hear what people think.

1) Support TIMESTAMP in datagen tables but always supply strictly ascending
timestamps. The above would now "just work". This semantic makes sense
given the way event time attributes are used in streaming applications and
we can clearly document the behavior.

2) Relax the constraints of the LIKE clause to allow overriding physical
columns with computed columns. This would make it clearer to the user what
is happening but would require substantially higher development effort and
I don't know if this feature would add value beyond this one use case. In
practice, this would allow the following.

Please let me know what you think.
CREATE TEMPORARY TABLE Orders (
    order_time AS LOCALTIMESTAMP
) WITH (
     'connector' = 'datagen'
) LIKE Orders (EXCLUDING ALL)

Seth
Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Handling of Timestamp in DataGen table created via LIKE

godfreyhe
Hi Seth,
Thanks for bringing up this topic.

I think the second approach is a more generic solution.
Other connectors can also benefit from this.
We also keep the flexibility for generating random timestamps for some
scenarios.

Best,
Godfrey

Seth Wiesman <[hidden email]> 于2020年7月24日周五 下午11:30写道:

> Hi everyone,
>
> Currently, the data gen table source only supports a subset of Flink SQL
> types. One missing type in particular is TIMESTAMP(3). The reason, I
> suspect, it was not added originally is that it doesn't really make sense
> to have random timestamps. What you really want is for them to be
> ascending. In the use cases of data generation, users typically don't care
> about late data. The workaround proposed in the docs is to create your
> event time attribute using a computed column.
>
> CREATE TABLE t (
>     ts AS LOCALTIMESTAMP
> ) WITH (
>     'connector' = 'datagen'
> )
>
> The problem is that this does not play well with the LIKE clause. Many
> users do not create datagen backed tables from scratch but using the LIKE
> clause to shadow a physical table in their catalog - such as Kafka.
>
> The problem is the LIKE clause does not allow redefining columns so there
> is no way to do this for a table with an event time attribute. The below
> will fail.
>
> CREATE TABLE Orders (
>     order_id   BIGINT,
>     order_time TIMESTAMP(3)
>     quantity   INT,
>     cost       AS price * quantity,
>     WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND,
>     PRIMARY KEY (order_id) NOT ENFORCED
> ) WITH (
>     'connector' = 'kafka',
>     'topic' = 'orders',
>     'properties.bootstrap.servers' = 'localhost:9092',
>     'properties.group.id' = 'orderGroup',
>     'format' = 'csv'
> )
>
> CREATE TEMPORARY TABLE Orders WITH (
>     'connector' = 'datagen'
> ) LIKE Orders (EXCLUDING ALL)
>
>
> I see two solutions to this and would like to hear what people think.
>
> 1) Support TIMESTAMP in datagen tables but always supply strictly ascending
> timestamps. The above would now "just work". This semantic makes sense
> given the way event time attributes are used in streaming applications and
> we can clearly document the behavior.
>
> 2) Relax the constraints of the LIKE clause to allow overriding physical
> columns with computed columns. This would make it clearer to the user what
> is happening but would require substantially higher development effort and
> I don't know if this feature would add value beyond this one use case. In
> practice, this would allow the following.
>
> Please let me know what you think.
> CREATE TEMPORARY TABLE Orders (
>     order_time AS LOCALTIMESTAMP
> ) WITH (
>      'connector' = 'datagen'
> ) LIKE Orders (EXCLUDING ALL)
>
> Seth
>
Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Handling of Timestamp in DataGen table created via LIKE

Jark Wu-2
+1 to option#1.

I think it makes sense to enhance the datagen connector.
In this case, I think we can support the default TIMESTAMP generation
strategy as "sequence" with an optional start point.
This strategy can be changed to "constant", "random", or others.
This would be really helpful and cool if we can support this, and that's
why I prefer #1 than #2:

CREATE TEMPORARY TABLE Orders WITH (
    'connector' = 'datagen'
) LIKE Orders (EXCLUDING ALL)

Regarding #2, if we want to extend the LIKE clause, maybe we can add an
"OVERWRITING COLUMNS" like option.
But unless we have other strong use cases for this, otherwise, I think this
makes things complicated (the current like options are already puzzling).
Maybe @Dawid Wysakowicz <[hidden email]>  has more thoughts on
this.


Best,
Jark

On Mon, 27 Jul 2020 at 10:50, godfrey he <[hidden email]> wrote:

> Hi Seth,
> Thanks for bringing up this topic.
>
> I think the second approach is a more generic solution.
> Other connectors can also benefit from this.
> We also keep the flexibility for generating random timestamps for some
> scenarios.
>
> Best,
> Godfrey
>
> Seth Wiesman <[hidden email]> 于2020年7月24日周五 下午11:30写道:
>
> > Hi everyone,
> >
> > Currently, the data gen table source only supports a subset of Flink SQL
> > types. One missing type in particular is TIMESTAMP(3). The reason, I
> > suspect, it was not added originally is that it doesn't really make sense
> > to have random timestamps. What you really want is for them to be
> > ascending. In the use cases of data generation, users typically don't
> care
> > about late data. The workaround proposed in the docs is to create your
> > event time attribute using a computed column.
> >
> > CREATE TABLE t (
> >     ts AS LOCALTIMESTAMP
> > ) WITH (
> >     'connector' = 'datagen'
> > )
> >
> > The problem is that this does not play well with the LIKE clause. Many
> > users do not create datagen backed tables from scratch but using the LIKE
> > clause to shadow a physical table in their catalog - such as Kafka.
> >
> > The problem is the LIKE clause does not allow redefining columns so there
> > is no way to do this for a table with an event time attribute. The below
> > will fail.
> >
> > CREATE TABLE Orders (
> >     order_id   BIGINT,
> >     order_time TIMESTAMP(3)
> >     quantity   INT,
> >     cost       AS price * quantity,
> >     WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND,
> >     PRIMARY KEY (order_id) NOT ENFORCED
> > ) WITH (
> >     'connector' = 'kafka',
> >     'topic' = 'orders',
> >     'properties.bootstrap.servers' = 'localhost:9092',
> >     'properties.group.id' = 'orderGroup',
> >     'format' = 'csv'
> > )
> >
> > CREATE TEMPORARY TABLE Orders WITH (
> >     'connector' = 'datagen'
> > ) LIKE Orders (EXCLUDING ALL)
> >
> >
> > I see two solutions to this and would like to hear what people think.
> >
> > 1) Support TIMESTAMP in datagen tables but always supply strictly
> ascending
> > timestamps. The above would now "just work". This semantic makes sense
> > given the way event time attributes are used in streaming applications
> and
> > we can clearly document the behavior.
> >
> > 2) Relax the constraints of the LIKE clause to allow overriding physical
> > columns with computed columns. This would make it clearer to the user
> what
> > is happening but would require substantially higher development effort
> and
> > I don't know if this feature would add value beyond this one use case. In
> > practice, this would allow the following.
> >
> > Please let me know what you think.
> > CREATE TEMPORARY TABLE Orders (
> >     order_time AS LOCALTIMESTAMP
> > ) WITH (
> >      'connector' = 'datagen'
> > ) LIKE Orders (EXCLUDING ALL)
> >
> > Seth
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Handling of Timestamp in DataGen table created via LIKE

Konstantin Knauf-4
Hi everyone,

I have the feeling that #1 suffices for the vast majority of use cases.

Cheers,

Konstantin

On Mon, Jul 27, 2020 at 6:36 AM Jark Wu <[hidden email]> wrote:

> +1 to option#1.
>
> I think it makes sense to enhance the datagen connector.
> In this case, I think we can support the default TIMESTAMP generation
> strategy as "sequence" with an optional start point.
> This strategy can be changed to "constant", "random", or others.
> This would be really helpful and cool if we can support this, and that's
> why I prefer #1 than #2:
>
> CREATE TEMPORARY TABLE Orders WITH (
>     'connector' = 'datagen'
> ) LIKE Orders (EXCLUDING ALL)
>
> Regarding #2, if we want to extend the LIKE clause, maybe we can add an
> "OVERWRITING COLUMNS" like option.
> But unless we have other strong use cases for this, otherwise, I think this
> makes things complicated (the current like options are already puzzling).
> Maybe @Dawid Wysakowicz <[hidden email]>  has more thoughts on
> this.
>
>
> Best,
> Jark
>
> On Mon, 27 Jul 2020 at 10:50, godfrey he <[hidden email]> wrote:
>
> > Hi Seth,
> > Thanks for bringing up this topic.
> >
> > I think the second approach is a more generic solution.
> > Other connectors can also benefit from this.
> > We also keep the flexibility for generating random timestamps for some
> > scenarios.
> >
> > Best,
> > Godfrey
> >
> > Seth Wiesman <[hidden email]> 于2020年7月24日周五 下午11:30写道:
> >
> > > Hi everyone,
> > >
> > > Currently, the data gen table source only supports a subset of Flink
> SQL
> > > types. One missing type in particular is TIMESTAMP(3). The reason, I
> > > suspect, it was not added originally is that it doesn't really make
> sense
> > > to have random timestamps. What you really want is for them to be
> > > ascending. In the use cases of data generation, users typically don't
> > care
> > > about late data. The workaround proposed in the docs is to create your
> > > event time attribute using a computed column.
> > >
> > > CREATE TABLE t (
> > >     ts AS LOCALTIMESTAMP
> > > ) WITH (
> > >     'connector' = 'datagen'
> > > )
> > >
> > > The problem is that this does not play well with the LIKE clause. Many
> > > users do not create datagen backed tables from scratch but using the
> LIKE
> > > clause to shadow a physical table in their catalog - such as Kafka.
> > >
> > > The problem is the LIKE clause does not allow redefining columns so
> there
> > > is no way to do this for a table with an event time attribute. The
> below
> > > will fail.
> > >
> > > CREATE TABLE Orders (
> > >     order_id   BIGINT,
> > >     order_time TIMESTAMP(3)
> > >     quantity   INT,
> > >     cost       AS price * quantity,
> > >     WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND,
> > >     PRIMARY KEY (order_id) NOT ENFORCED
> > > ) WITH (
> > >     'connector' = 'kafka',
> > >     'topic' = 'orders',
> > >     'properties.bootstrap.servers' = 'localhost:9092',
> > >     'properties.group.id' = 'orderGroup',
> > >     'format' = 'csv'
> > > )
> > >
> > > CREATE TEMPORARY TABLE Orders WITH (
> > >     'connector' = 'datagen'
> > > ) LIKE Orders (EXCLUDING ALL)
> > >
> > >
> > > I see two solutions to this and would like to hear what people think.
> > >
> > > 1) Support TIMESTAMP in datagen tables but always supply strictly
> > ascending
> > > timestamps. The above would now "just work". This semantic makes sense
> > > given the way event time attributes are used in streaming applications
> > and
> > > we can clearly document the behavior.
> > >
> > > 2) Relax the constraints of the LIKE clause to allow overriding
> physical
> > > columns with computed columns. This would make it clearer to the user
> > what
> > > is happening but would require substantially higher development effort
> > and
> > > I don't know if this feature would add value beyond this one use case.
> In
> > > practice, this would allow the following.
> > >
> > > Please let me know what you think.
> > > CREATE TEMPORARY TABLE Orders (
> > >     order_time AS LOCALTIMESTAMP
> > > ) WITH (
> > >      'connector' = 'datagen'
> > > ) LIKE Orders (EXCLUDING ALL)
> > >
> > > Seth
> > >
> >
>


--

Konstantin Knauf

https://twitter.com/snntrable

https://github.com/knaufk
Reply | Threaded
Open this post in threaded view
|

Re: [Discuss] Handling of Timestamp in DataGen table created via LIKE

Seth Wiesman-4
Ok, let's go with option 1. This doesn't prevent up from adding option 2
latter if necessary.

I've opened a ticket: https://issues.apache.org/jira/browse/FLINK-18735

On Mon, Jul 27, 2020 at 12:05 AM Konstantin Knauf <[hidden email]> wrote:

> Hi everyone,
>
> I have the feeling that #1 suffices for the vast majority of use cases.
>
> Cheers,
>
> Konstantin
>
> On Mon, Jul 27, 2020 at 6:36 AM Jark Wu <[hidden email]> wrote:
>
> > +1 to option#1.
> >
> > I think it makes sense to enhance the datagen connector.
> > In this case, I think we can support the default TIMESTAMP generation
> > strategy as "sequence" with an optional start point.
> > This strategy can be changed to "constant", "random", or others.
> > This would be really helpful and cool if we can support this, and that's
> > why I prefer #1 than #2:
> >
> > CREATE TEMPORARY TABLE Orders WITH (
> >     'connector' = 'datagen'
> > ) LIKE Orders (EXCLUDING ALL)
> >
> > Regarding #2, if we want to extend the LIKE clause, maybe we can add an
> > "OVERWRITING COLUMNS" like option.
> > But unless we have other strong use cases for this, otherwise, I think
> this
> > makes things complicated (the current like options are already puzzling).
> > Maybe @Dawid Wysakowicz <[hidden email]>  has more thoughts on
> > this.
> >
> >
> > Best,
> > Jark
> >
> > On Mon, 27 Jul 2020 at 10:50, godfrey he <[hidden email]> wrote:
> >
> > > Hi Seth,
> > > Thanks for bringing up this topic.
> > >
> > > I think the second approach is a more generic solution.
> > > Other connectors can also benefit from this.
> > > We also keep the flexibility for generating random timestamps for some
> > > scenarios.
> > >
> > > Best,
> > > Godfrey
> > >
> > > Seth Wiesman <[hidden email]> 于2020年7月24日周五 下午11:30写道:
> > >
> > > > Hi everyone,
> > > >
> > > > Currently, the data gen table source only supports a subset of Flink
> > SQL
> > > > types. One missing type in particular is TIMESTAMP(3). The reason, I
> > > > suspect, it was not added originally is that it doesn't really make
> > sense
> > > > to have random timestamps. What you really want is for them to be
> > > > ascending. In the use cases of data generation, users typically don't
> > > care
> > > > about late data. The workaround proposed in the docs is to create
> your
> > > > event time attribute using a computed column.
> > > >
> > > > CREATE TABLE t (
> > > >     ts AS LOCALTIMESTAMP
> > > > ) WITH (
> > > >     'connector' = 'datagen'
> > > > )
> > > >
> > > > The problem is that this does not play well with the LIKE clause.
> Many
> > > > users do not create datagen backed tables from scratch but using the
> > LIKE
> > > > clause to shadow a physical table in their catalog - such as Kafka.
> > > >
> > > > The problem is the LIKE clause does not allow redefining columns so
> > there
> > > > is no way to do this for a table with an event time attribute. The
> > below
> > > > will fail.
> > > >
> > > > CREATE TABLE Orders (
> > > >     order_id   BIGINT,
> > > >     order_time TIMESTAMP(3)
> > > >     quantity   INT,
> > > >     cost       AS price * quantity,
> > > >     WATERMARK FOR order_time AS order_time - INTERVAL '5' SECOND,
> > > >     PRIMARY KEY (order_id) NOT ENFORCED
> > > > ) WITH (
> > > >     'connector' = 'kafka',
> > > >     'topic' = 'orders',
> > > >     'properties.bootstrap.servers' = 'localhost:9092',
> > > >     'properties.group.id' = 'orderGroup',
> > > >     'format' = 'csv'
> > > > )
> > > >
> > > > CREATE TEMPORARY TABLE Orders WITH (
> > > >     'connector' = 'datagen'
> > > > ) LIKE Orders (EXCLUDING ALL)
> > > >
> > > >
> > > > I see two solutions to this and would like to hear what people think.
> > > >
> > > > 1) Support TIMESTAMP in datagen tables but always supply strictly
> > > ascending
> > > > timestamps. The above would now "just work". This semantic makes
> sense
> > > > given the way event time attributes are used in streaming
> applications
> > > and
> > > > we can clearly document the behavior.
> > > >
> > > > 2) Relax the constraints of the LIKE clause to allow overriding
> > physical
> > > > columns with computed columns. This would make it clearer to the user
> > > what
> > > > is happening but would require substantially higher development
> effort
> > > and
> > > > I don't know if this feature would add value beyond this one use
> case.
> > In
> > > > practice, this would allow the following.
> > > >
> > > > Please let me know what you think.
> > > > CREATE TEMPORARY TABLE Orders (
> > > >     order_time AS LOCALTIMESTAMP
> > > > ) WITH (
> > > >      'connector' = 'datagen'
> > > > ) LIKE Orders (EXCLUDING ALL)
> > > >
> > > > Seth
> > > >
> > >
> >
>
>
> --
>
> Konstantin Knauf
>
> https://twitter.com/snntrable
>
> https://github.com/knaufk
>