Question about FLIP-66

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about FLIP-66

Jungtaek Lim-2
Hi devs,

I'm interesting about the new change on FLIP-66 [1], because if I
understand correctly, Flink hasn't been having event-time timestamp field
(column) as a part of "normal" schema, and FLIP-66 tries to change it.

That sounds as the column may be open for modification, like rename (alias)
or some other operations, or even be dropped via projection. Will such
operations affect event-time timestamp for the record? If you have an idea
about how Spark Structured Streaming works with watermark then you might
catch the point.

Maybe the question could be reworded as, does the definition of event time
timestamp column on DDL only project to the source definition, or it will
carry over the entire query and let operator determine such column as
event-time timestamp. (SSS works as latter.) I think this is a huge
difference, as for me it's like stability vs flexibility, and there're
drawbacks on latter (there're also drawbacks on former as well, but
computed column may cover up).

Thanks in advance!
Jungtaek Lim (HeartSaVioR)

1.
https://cwiki.apache.org/confluence/display/FLINK/FLIP-66%3A+Support+Time+Attribute+in+SQL+DDL
Reply | Threaded
Open this post in threaded view
|

Re: Question about FLIP-66

Kurt Young
The current behavior is later. Flink gets time attribute column from source
table, and tries to analyze and keep
the time attribute column as much as possible, e.g. simple projection or
filter which doesn't effect the column
will keep the time attribute, window aggregate will generate its own time
attribute if you select window_start or
window_end. But you're right, sometimes framework will loose the
information about time attribute column, and
after that, some operations will throw exception.

Best,
Kurt


On Tue, Apr 28, 2020 at 7:45 AM Jungtaek Lim <[hidden email]>
wrote:

> Hi devs,
>
> I'm interesting about the new change on FLIP-66 [1], because if I
> understand correctly, Flink hasn't been having event-time timestamp field
> (column) as a part of "normal" schema, and FLIP-66 tries to change it.
>
> That sounds as the column may be open for modification, like rename (alias)
> or some other operations, or even be dropped via projection. Will such
> operations affect event-time timestamp for the record? If you have an idea
> about how Spark Structured Streaming works with watermark then you might
> catch the point.
>
> Maybe the question could be reworded as, does the definition of event time
> timestamp column on DDL only project to the source definition, or it will
> carry over the entire query and let operator determine such column as
> event-time timestamp. (SSS works as latter.) I think this is a huge
> difference, as for me it's like stability vs flexibility, and there're
> drawbacks on latter (there're also drawbacks on former as well, but
> computed column may cover up).
>
> Thanks in advance!
> Jungtaek Lim (HeartSaVioR)
>
> 1.
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-66%3A+Support+Time+Attribute+in+SQL+DDL
>
Reply | Threaded
Open this post in threaded view
|

Re: Question about FLIP-66

Jark Wu-2
Hi Jungtaek,

Kurt has said what I want to say. I will add some background.
Flink Table API & SQL only supports to define processing-time attribute and
event-time attribute (watermark) on source, not support to define a new one
in query.
The time attributes will pass through the query and time-based operations
can only apply on the time attributes.

The reason why Flink Table & SQL only supports to define watermark on
source is that this can allow us to do per-partition watermark, source idle
and simplify things.
There are also some discussion about "disable arbitrary watermark assigners
in the middle of a pipeline in DataStream" in this JIRA issue comments.

Best,
Jark

[1]: https://issues.apache.org/jira/browse/FLINK-11286


On Tue, 28 Apr 2020 at 09:28, Kurt Young <[hidden email]> wrote:

> The current behavior is later. Flink gets time attribute column from source
> table, and tries to analyze and keep
> the time attribute column as much as possible, e.g. simple projection or
> filter which doesn't effect the column
> will keep the time attribute, window aggregate will generate its own time
> attribute if you select window_start or
> window_end. But you're right, sometimes framework will loose the
> information about time attribute column, and
> after that, some operations will throw exception.
>
> Best,
> Kurt
>
>
> On Tue, Apr 28, 2020 at 7:45 AM Jungtaek Lim <[hidden email]
> >
> wrote:
>
> > Hi devs,
> >
> > I'm interesting about the new change on FLIP-66 [1], because if I
> > understand correctly, Flink hasn't been having event-time timestamp field
> > (column) as a part of "normal" schema, and FLIP-66 tries to change it.
> >
> > That sounds as the column may be open for modification, like rename
> (alias)
> > or some other operations, or even be dropped via projection. Will such
> > operations affect event-time timestamp for the record? If you have an
> idea
> > about how Spark Structured Streaming works with watermark then you might
> > catch the point.
> >
> > Maybe the question could be reworded as, does the definition of event
> time
> > timestamp column on DDL only project to the source definition, or it will
> > carry over the entire query and let operator determine such column as
> > event-time timestamp. (SSS works as latter.) I think this is a huge
> > difference, as for me it's like stability vs flexibility, and there're
> > drawbacks on latter (there're also drawbacks on former as well, but
> > computed column may cover up).
> >
> > Thanks in advance!
> > Jungtaek Lim (HeartSaVioR)
> >
> > 1.
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-66%3A+Support+Time+Attribute+in+SQL+DDL
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Question about FLIP-66

Jungtaek Lim-2
Thanks Kurt and Jark for the detailed explanation! Pretty much helped to
understand about FLIP-66.

That sounds as Flink won't leverage timestamp in StreamRecord (which is
hidden and cannot modified easily) and handles the time semantic by the
input schema for the operation, to unify the semantic between batch and
stream. Did I understand it correctly?

I'm not familiar with internal of Flink so not easy to consume the
information in FLINK-11286, but in general I'd be supportive with defining
watermark as close as possible from source, as it'll be easier to reason
about. (I basically refer to timestamp assigner instead of watermark
assigner though.)

- Jungtaek Lim

On Tue, Apr 28, 2020 at 11:37 AM Jark Wu <[hidden email]> wrote:

> Hi Jungtaek,
>
> Kurt has said what I want to say. I will add some background.
> Flink Table API & SQL only supports to define processing-time attribute and
> event-time attribute (watermark) on source, not support to define a new one
> in query.
> The time attributes will pass through the query and time-based operations
> can only apply on the time attributes.
>
> The reason why Flink Table & SQL only supports to define watermark on
> source is that this can allow us to do per-partition watermark, source idle
> and simplify things.
> There are also some discussion about "disable arbitrary watermark assigners
> in the middle of a pipeline in DataStream" in this JIRA issue comments.
>
> Best,
> Jark
>
> [1]: https://issues.apache.org/jira/browse/FLINK-11286
>
>
> On Tue, 28 Apr 2020 at 09:28, Kurt Young <[hidden email]> wrote:
>
> > The current behavior is later. Flink gets time attribute column from
> source
> > table, and tries to analyze and keep
> > the time attribute column as much as possible, e.g. simple projection or
> > filter which doesn't effect the column
> > will keep the time attribute, window aggregate will generate its own time
> > attribute if you select window_start or
> > window_end. But you're right, sometimes framework will loose the
> > information about time attribute column, and
> > after that, some operations will throw exception.
> >
> > Best,
> > Kurt
> >
> >
> > On Tue, Apr 28, 2020 at 7:45 AM Jungtaek Lim <
> [hidden email]
> > >
> > wrote:
> >
> > > Hi devs,
> > >
> > > I'm interesting about the new change on FLIP-66 [1], because if I
> > > understand correctly, Flink hasn't been having event-time timestamp
> field
> > > (column) as a part of "normal" schema, and FLIP-66 tries to change it.
> > >
> > > That sounds as the column may be open for modification, like rename
> > (alias)
> > > or some other operations, or even be dropped via projection. Will such
> > > operations affect event-time timestamp for the record? If you have an
> > idea
> > > about how Spark Structured Streaming works with watermark then you
> might
> > > catch the point.
> > >
> > > Maybe the question could be reworded as, does the definition of event
> > time
> > > timestamp column on DDL only project to the source definition, or it
> will
> > > carry over the entire query and let operator determine such column as
> > > event-time timestamp. (SSS works as latter.) I think this is a huge
> > > difference, as for me it's like stability vs flexibility, and there're
> > > drawbacks on latter (there're also drawbacks on former as well, but
> > > computed column may cover up).
> > >
> > > Thanks in advance!
> > > Jungtaek Lim (HeartSaVioR)
> > >
> > > 1.
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-66%3A+Support+Time+Attribute+in+SQL+DDL
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Question about FLIP-66

Jark Wu-2
Hi Jungtaek,

Yes. Your understanding is correct :)

Best,
Jark

On Tue, 28 Apr 2020 at 11:58, Jungtaek Lim <[hidden email]>
wrote:

> Thanks Kurt and Jark for the detailed explanation! Pretty much helped to
> understand about FLIP-66.
>
> That sounds as Flink won't leverage timestamp in StreamRecord (which is
> hidden and cannot modified easily) and handles the time semantic by the
> input schema for the operation, to unify the semantic between batch and
> stream. Did I understand it correctly?
>
> I'm not familiar with internal of Flink so not easy to consume the
> information in FLINK-11286, but in general I'd be supportive with defining
> watermark as close as possible from source, as it'll be easier to reason
> about. (I basically refer to timestamp assigner instead of watermark
> assigner though.)
>
> - Jungtaek Lim
>
> On Tue, Apr 28, 2020 at 11:37 AM Jark Wu <[hidden email]> wrote:
>
> > Hi Jungtaek,
> >
> > Kurt has said what I want to say. I will add some background.
> > Flink Table API & SQL only supports to define processing-time attribute
> and
> > event-time attribute (watermark) on source, not support to define a new
> one
> > in query.
> > The time attributes will pass through the query and time-based operations
> > can only apply on the time attributes.
> >
> > The reason why Flink Table & SQL only supports to define watermark on
> > source is that this can allow us to do per-partition watermark, source
> idle
> > and simplify things.
> > There are also some discussion about "disable arbitrary watermark
> assigners
> > in the middle of a pipeline in DataStream" in this JIRA issue comments.
> >
> > Best,
> > Jark
> >
> > [1]: https://issues.apache.org/jira/browse/FLINK-11286
> >
> >
> > On Tue, 28 Apr 2020 at 09:28, Kurt Young <[hidden email]> wrote:
> >
> > > The current behavior is later. Flink gets time attribute column from
> > source
> > > table, and tries to analyze and keep
> > > the time attribute column as much as possible, e.g. simple projection
> or
> > > filter which doesn't effect the column
> > > will keep the time attribute, window aggregate will generate its own
> time
> > > attribute if you select window_start or
> > > window_end. But you're right, sometimes framework will loose the
> > > information about time attribute column, and
> > > after that, some operations will throw exception.
> > >
> > > Best,
> > > Kurt
> > >
> > >
> > > On Tue, Apr 28, 2020 at 7:45 AM Jungtaek Lim <
> > [hidden email]
> > > >
> > > wrote:
> > >
> > > > Hi devs,
> > > >
> > > > I'm interesting about the new change on FLIP-66 [1], because if I
> > > > understand correctly, Flink hasn't been having event-time timestamp
> > field
> > > > (column) as a part of "normal" schema, and FLIP-66 tries to change
> it.
> > > >
> > > > That sounds as the column may be open for modification, like rename
> > > (alias)
> > > > or some other operations, or even be dropped via projection. Will
> such
> > > > operations affect event-time timestamp for the record? If you have an
> > > idea
> > > > about how Spark Structured Streaming works with watermark then you
> > might
> > > > catch the point.
> > > >
> > > > Maybe the question could be reworded as, does the definition of event
> > > time
> > > > timestamp column on DDL only project to the source definition, or it
> > will
> > > > carry over the entire query and let operator determine such column as
> > > > event-time timestamp. (SSS works as latter.) I think this is a huge
> > > > difference, as for me it's like stability vs flexibility, and
> there're
> > > > drawbacks on latter (there're also drawbacks on former as well, but
> > > > computed column may cover up).
> > > >
> > > > Thanks in advance!
> > > > Jungtaek Lim (HeartSaVioR)
> > > >
> > > > 1.
> > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-66%3A+Support+Time+Attribute+in+SQL+DDL
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Question about FLIP-66

Jungtaek Lim-2
Thanks for confirming! Honestly I support to treat timestamp field as
special and restrict modification (the way DataStream API does), but I
agree the new approach could be more natural to unify the semantic of SQL
for both batch and stream.

Thanks again,
Jungtaek Lim (HeartSaVioR)

On Tue, Apr 28, 2020 at 2:24 PM Jark Wu <[hidden email]> wrote:

> Hi Jungtaek,
>
> Yes. Your understanding is correct :)
>
> Best,
> Jark
>
> On Tue, 28 Apr 2020 at 11:58, Jungtaek Lim <[hidden email]>
> wrote:
>
> > Thanks Kurt and Jark for the detailed explanation! Pretty much helped to
> > understand about FLIP-66.
> >
> > That sounds as Flink won't leverage timestamp in StreamRecord (which is
> > hidden and cannot modified easily) and handles the time semantic by the
> > input schema for the operation, to unify the semantic between batch and
> > stream. Did I understand it correctly?
> >
> > I'm not familiar with internal of Flink so not easy to consume the
> > information in FLINK-11286, but in general I'd be supportive with
> defining
> > watermark as close as possible from source, as it'll be easier to reason
> > about. (I basically refer to timestamp assigner instead of watermark
> > assigner though.)
> >
> > - Jungtaek Lim
> >
> > On Tue, Apr 28, 2020 at 11:37 AM Jark Wu <[hidden email]> wrote:
> >
> > > Hi Jungtaek,
> > >
> > > Kurt has said what I want to say. I will add some background.
> > > Flink Table API & SQL only supports to define processing-time attribute
> > and
> > > event-time attribute (watermark) on source, not support to define a new
> > one
> > > in query.
> > > The time attributes will pass through the query and time-based
> operations
> > > can only apply on the time attributes.
> > >
> > > The reason why Flink Table & SQL only supports to define watermark on
> > > source is that this can allow us to do per-partition watermark, source
> > idle
> > > and simplify things.
> > > There are also some discussion about "disable arbitrary watermark
> > assigners
> > > in the middle of a pipeline in DataStream" in this JIRA issue comments.
> > >
> > > Best,
> > > Jark
> > >
> > > [1]: https://issues.apache.org/jira/browse/FLINK-11286
> > >
> > >
> > > On Tue, 28 Apr 2020 at 09:28, Kurt Young <[hidden email]> wrote:
> > >
> > > > The current behavior is later. Flink gets time attribute column from
> > > source
> > > > table, and tries to analyze and keep
> > > > the time attribute column as much as possible, e.g. simple projection
> > or
> > > > filter which doesn't effect the column
> > > > will keep the time attribute, window aggregate will generate its own
> > time
> > > > attribute if you select window_start or
> > > > window_end. But you're right, sometimes framework will loose the
> > > > information about time attribute column, and
> > > > after that, some operations will throw exception.
> > > >
> > > > Best,
> > > > Kurt
> > > >
> > > >
> > > > On Tue, Apr 28, 2020 at 7:45 AM Jungtaek Lim <
> > > [hidden email]
> > > > >
> > > > wrote:
> > > >
> > > > > Hi devs,
> > > > >
> > > > > I'm interesting about the new change on FLIP-66 [1], because if I
> > > > > understand correctly, Flink hasn't been having event-time timestamp
> > > field
> > > > > (column) as a part of "normal" schema, and FLIP-66 tries to change
> > it.
> > > > >
> > > > > That sounds as the column may be open for modification, like rename
> > > > (alias)
> > > > > or some other operations, or even be dropped via projection. Will
> > such
> > > > > operations affect event-time timestamp for the record? If you have
> an
> > > > idea
> > > > > about how Spark Structured Streaming works with watermark then you
> > > might
> > > > > catch the point.
> > > > >
> > > > > Maybe the question could be reworded as, does the definition of
> event
> > > > time
> > > > > timestamp column on DDL only project to the source definition, or
> it
> > > will
> > > > > carry over the entire query and let operator determine such column
> as
> > > > > event-time timestamp. (SSS works as latter.) I think this is a huge
> > > > > difference, as for me it's like stability vs flexibility, and
> > there're
> > > > > drawbacks on latter (there're also drawbacks on former as well, but
> > > > > computed column may cover up).
> > > > >
> > > > > Thanks in advance!
> > > > > Jungtaek Lim (HeartSaVioR)
> > > > >
> > > > > 1.
> > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-66%3A+Support+Time+Attribute+in+SQL+DDL
> > > > >
> > > >
> > >
> >
>