Hi devs,
I'm interesting about the new change on FLIP-66 [1], because if I understand correctly, Flink hasn't been having event-time timestamp field (column) as a part of "normal" schema, and FLIP-66 tries to change it. That sounds as the column may be open for modification, like rename (alias) or some other operations, or even be dropped via projection. Will such operations affect event-time timestamp for the record? If you have an idea about how Spark Structured Streaming works with watermark then you might catch the point. Maybe the question could be reworded as, does the definition of event time timestamp column on DDL only project to the source definition, or it will carry over the entire query and let operator determine such column as event-time timestamp. (SSS works as latter.) I think this is a huge difference, as for me it's like stability vs flexibility, and there're drawbacks on latter (there're also drawbacks on former as well, but computed column may cover up). Thanks in advance! Jungtaek Lim (HeartSaVioR) 1. https://cwiki.apache.org/confluence/display/FLINK/FLIP-66%3A+Support+Time+Attribute+in+SQL+DDL |
The current behavior is later. Flink gets time attribute column from source
table, and tries to analyze and keep the time attribute column as much as possible, e.g. simple projection or filter which doesn't effect the column will keep the time attribute, window aggregate will generate its own time attribute if you select window_start or window_end. But you're right, sometimes framework will loose the information about time attribute column, and after that, some operations will throw exception. Best, Kurt On Tue, Apr 28, 2020 at 7:45 AM Jungtaek Lim <[hidden email]> wrote: > Hi devs, > > I'm interesting about the new change on FLIP-66 [1], because if I > understand correctly, Flink hasn't been having event-time timestamp field > (column) as a part of "normal" schema, and FLIP-66 tries to change it. > > That sounds as the column may be open for modification, like rename (alias) > or some other operations, or even be dropped via projection. Will such > operations affect event-time timestamp for the record? If you have an idea > about how Spark Structured Streaming works with watermark then you might > catch the point. > > Maybe the question could be reworded as, does the definition of event time > timestamp column on DDL only project to the source definition, or it will > carry over the entire query and let operator determine such column as > event-time timestamp. (SSS works as latter.) I think this is a huge > difference, as for me it's like stability vs flexibility, and there're > drawbacks on latter (there're also drawbacks on former as well, but > computed column may cover up). > > Thanks in advance! > Jungtaek Lim (HeartSaVioR) > > 1. > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-66%3A+Support+Time+Attribute+in+SQL+DDL > |
Hi Jungtaek,
Kurt has said what I want to say. I will add some background. Flink Table API & SQL only supports to define processing-time attribute and event-time attribute (watermark) on source, not support to define a new one in query. The time attributes will pass through the query and time-based operations can only apply on the time attributes. The reason why Flink Table & SQL only supports to define watermark on source is that this can allow us to do per-partition watermark, source idle and simplify things. There are also some discussion about "disable arbitrary watermark assigners in the middle of a pipeline in DataStream" in this JIRA issue comments. Best, Jark [1]: https://issues.apache.org/jira/browse/FLINK-11286 On Tue, 28 Apr 2020 at 09:28, Kurt Young <[hidden email]> wrote: > The current behavior is later. Flink gets time attribute column from source > table, and tries to analyze and keep > the time attribute column as much as possible, e.g. simple projection or > filter which doesn't effect the column > will keep the time attribute, window aggregate will generate its own time > attribute if you select window_start or > window_end. But you're right, sometimes framework will loose the > information about time attribute column, and > after that, some operations will throw exception. > > Best, > Kurt > > > On Tue, Apr 28, 2020 at 7:45 AM Jungtaek Lim <[hidden email] > > > wrote: > > > Hi devs, > > > > I'm interesting about the new change on FLIP-66 [1], because if I > > understand correctly, Flink hasn't been having event-time timestamp field > > (column) as a part of "normal" schema, and FLIP-66 tries to change it. > > > > That sounds as the column may be open for modification, like rename > (alias) > > or some other operations, or even be dropped via projection. Will such > > operations affect event-time timestamp for the record? If you have an > idea > > about how Spark Structured Streaming works with watermark then you might > > catch the point. > > > > Maybe the question could be reworded as, does the definition of event > time > > timestamp column on DDL only project to the source definition, or it will > > carry over the entire query and let operator determine such column as > > event-time timestamp. (SSS works as latter.) I think this is a huge > > difference, as for me it's like stability vs flexibility, and there're > > drawbacks on latter (there're also drawbacks on former as well, but > > computed column may cover up). > > > > Thanks in advance! > > Jungtaek Lim (HeartSaVioR) > > > > 1. > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-66%3A+Support+Time+Attribute+in+SQL+DDL > > > |
Thanks Kurt and Jark for the detailed explanation! Pretty much helped to
understand about FLIP-66. That sounds as Flink won't leverage timestamp in StreamRecord (which is hidden and cannot modified easily) and handles the time semantic by the input schema for the operation, to unify the semantic between batch and stream. Did I understand it correctly? I'm not familiar with internal of Flink so not easy to consume the information in FLINK-11286, but in general I'd be supportive with defining watermark as close as possible from source, as it'll be easier to reason about. (I basically refer to timestamp assigner instead of watermark assigner though.) - Jungtaek Lim On Tue, Apr 28, 2020 at 11:37 AM Jark Wu <[hidden email]> wrote: > Hi Jungtaek, > > Kurt has said what I want to say. I will add some background. > Flink Table API & SQL only supports to define processing-time attribute and > event-time attribute (watermark) on source, not support to define a new one > in query. > The time attributes will pass through the query and time-based operations > can only apply on the time attributes. > > The reason why Flink Table & SQL only supports to define watermark on > source is that this can allow us to do per-partition watermark, source idle > and simplify things. > There are also some discussion about "disable arbitrary watermark assigners > in the middle of a pipeline in DataStream" in this JIRA issue comments. > > Best, > Jark > > [1]: https://issues.apache.org/jira/browse/FLINK-11286 > > > On Tue, 28 Apr 2020 at 09:28, Kurt Young <[hidden email]> wrote: > > > The current behavior is later. Flink gets time attribute column from > source > > table, and tries to analyze and keep > > the time attribute column as much as possible, e.g. simple projection or > > filter which doesn't effect the column > > will keep the time attribute, window aggregate will generate its own time > > attribute if you select window_start or > > window_end. But you're right, sometimes framework will loose the > > information about time attribute column, and > > after that, some operations will throw exception. > > > > Best, > > Kurt > > > > > > On Tue, Apr 28, 2020 at 7:45 AM Jungtaek Lim < > [hidden email] > > > > > wrote: > > > > > Hi devs, > > > > > > I'm interesting about the new change on FLIP-66 [1], because if I > > > understand correctly, Flink hasn't been having event-time timestamp > field > > > (column) as a part of "normal" schema, and FLIP-66 tries to change it. > > > > > > That sounds as the column may be open for modification, like rename > > (alias) > > > or some other operations, or even be dropped via projection. Will such > > > operations affect event-time timestamp for the record? If you have an > > idea > > > about how Spark Structured Streaming works with watermark then you > might > > > catch the point. > > > > > > Maybe the question could be reworded as, does the definition of event > > time > > > timestamp column on DDL only project to the source definition, or it > will > > > carry over the entire query and let operator determine such column as > > > event-time timestamp. (SSS works as latter.) I think this is a huge > > > difference, as for me it's like stability vs flexibility, and there're > > > drawbacks on latter (there're also drawbacks on former as well, but > > > computed column may cover up). > > > > > > Thanks in advance! > > > Jungtaek Lim (HeartSaVioR) > > > > > > 1. > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-66%3A+Support+Time+Attribute+in+SQL+DDL > > > > > > |
Hi Jungtaek,
Yes. Your understanding is correct :) Best, Jark On Tue, 28 Apr 2020 at 11:58, Jungtaek Lim <[hidden email]> wrote: > Thanks Kurt and Jark for the detailed explanation! Pretty much helped to > understand about FLIP-66. > > That sounds as Flink won't leverage timestamp in StreamRecord (which is > hidden and cannot modified easily) and handles the time semantic by the > input schema for the operation, to unify the semantic between batch and > stream. Did I understand it correctly? > > I'm not familiar with internal of Flink so not easy to consume the > information in FLINK-11286, but in general I'd be supportive with defining > watermark as close as possible from source, as it'll be easier to reason > about. (I basically refer to timestamp assigner instead of watermark > assigner though.) > > - Jungtaek Lim > > On Tue, Apr 28, 2020 at 11:37 AM Jark Wu <[hidden email]> wrote: > > > Hi Jungtaek, > > > > Kurt has said what I want to say. I will add some background. > > Flink Table API & SQL only supports to define processing-time attribute > and > > event-time attribute (watermark) on source, not support to define a new > one > > in query. > > The time attributes will pass through the query and time-based operations > > can only apply on the time attributes. > > > > The reason why Flink Table & SQL only supports to define watermark on > > source is that this can allow us to do per-partition watermark, source > idle > > and simplify things. > > There are also some discussion about "disable arbitrary watermark > assigners > > in the middle of a pipeline in DataStream" in this JIRA issue comments. > > > > Best, > > Jark > > > > [1]: https://issues.apache.org/jira/browse/FLINK-11286 > > > > > > On Tue, 28 Apr 2020 at 09:28, Kurt Young <[hidden email]> wrote: > > > > > The current behavior is later. Flink gets time attribute column from > > source > > > table, and tries to analyze and keep > > > the time attribute column as much as possible, e.g. simple projection > or > > > filter which doesn't effect the column > > > will keep the time attribute, window aggregate will generate its own > time > > > attribute if you select window_start or > > > window_end. But you're right, sometimes framework will loose the > > > information about time attribute column, and > > > after that, some operations will throw exception. > > > > > > Best, > > > Kurt > > > > > > > > > On Tue, Apr 28, 2020 at 7:45 AM Jungtaek Lim < > > [hidden email] > > > > > > > wrote: > > > > > > > Hi devs, > > > > > > > > I'm interesting about the new change on FLIP-66 [1], because if I > > > > understand correctly, Flink hasn't been having event-time timestamp > > field > > > > (column) as a part of "normal" schema, and FLIP-66 tries to change > it. > > > > > > > > That sounds as the column may be open for modification, like rename > > > (alias) > > > > or some other operations, or even be dropped via projection. Will > such > > > > operations affect event-time timestamp for the record? If you have an > > > idea > > > > about how Spark Structured Streaming works with watermark then you > > might > > > > catch the point. > > > > > > > > Maybe the question could be reworded as, does the definition of event > > > time > > > > timestamp column on DDL only project to the source definition, or it > > will > > > > carry over the entire query and let operator determine such column as > > > > event-time timestamp. (SSS works as latter.) I think this is a huge > > > > difference, as for me it's like stability vs flexibility, and > there're > > > > drawbacks on latter (there're also drawbacks on former as well, but > > > > computed column may cover up). > > > > > > > > Thanks in advance! > > > > Jungtaek Lim (HeartSaVioR) > > > > > > > > 1. > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-66%3A+Support+Time+Attribute+in+SQL+DDL > > > > > > > > > > |
Thanks for confirming! Honestly I support to treat timestamp field as
special and restrict modification (the way DataStream API does), but I agree the new approach could be more natural to unify the semantic of SQL for both batch and stream. Thanks again, Jungtaek Lim (HeartSaVioR) On Tue, Apr 28, 2020 at 2:24 PM Jark Wu <[hidden email]> wrote: > Hi Jungtaek, > > Yes. Your understanding is correct :) > > Best, > Jark > > On Tue, 28 Apr 2020 at 11:58, Jungtaek Lim <[hidden email]> > wrote: > > > Thanks Kurt and Jark for the detailed explanation! Pretty much helped to > > understand about FLIP-66. > > > > That sounds as Flink won't leverage timestamp in StreamRecord (which is > > hidden and cannot modified easily) and handles the time semantic by the > > input schema for the operation, to unify the semantic between batch and > > stream. Did I understand it correctly? > > > > I'm not familiar with internal of Flink so not easy to consume the > > information in FLINK-11286, but in general I'd be supportive with > defining > > watermark as close as possible from source, as it'll be easier to reason > > about. (I basically refer to timestamp assigner instead of watermark > > assigner though.) > > > > - Jungtaek Lim > > > > On Tue, Apr 28, 2020 at 11:37 AM Jark Wu <[hidden email]> wrote: > > > > > Hi Jungtaek, > > > > > > Kurt has said what I want to say. I will add some background. > > > Flink Table API & SQL only supports to define processing-time attribute > > and > > > event-time attribute (watermark) on source, not support to define a new > > one > > > in query. > > > The time attributes will pass through the query and time-based > operations > > > can only apply on the time attributes. > > > > > > The reason why Flink Table & SQL only supports to define watermark on > > > source is that this can allow us to do per-partition watermark, source > > idle > > > and simplify things. > > > There are also some discussion about "disable arbitrary watermark > > assigners > > > in the middle of a pipeline in DataStream" in this JIRA issue comments. > > > > > > Best, > > > Jark > > > > > > [1]: https://issues.apache.org/jira/browse/FLINK-11286 > > > > > > > > > On Tue, 28 Apr 2020 at 09:28, Kurt Young <[hidden email]> wrote: > > > > > > > The current behavior is later. Flink gets time attribute column from > > > source > > > > table, and tries to analyze and keep > > > > the time attribute column as much as possible, e.g. simple projection > > or > > > > filter which doesn't effect the column > > > > will keep the time attribute, window aggregate will generate its own > > time > > > > attribute if you select window_start or > > > > window_end. But you're right, sometimes framework will loose the > > > > information about time attribute column, and > > > > after that, some operations will throw exception. > > > > > > > > Best, > > > > Kurt > > > > > > > > > > > > On Tue, Apr 28, 2020 at 7:45 AM Jungtaek Lim < > > > [hidden email] > > > > > > > > > wrote: > > > > > > > > > Hi devs, > > > > > > > > > > I'm interesting about the new change on FLIP-66 [1], because if I > > > > > understand correctly, Flink hasn't been having event-time timestamp > > > field > > > > > (column) as a part of "normal" schema, and FLIP-66 tries to change > > it. > > > > > > > > > > That sounds as the column may be open for modification, like rename > > > > (alias) > > > > > or some other operations, or even be dropped via projection. Will > > such > > > > > operations affect event-time timestamp for the record? If you have > an > > > > idea > > > > > about how Spark Structured Streaming works with watermark then you > > > might > > > > > catch the point. > > > > > > > > > > Maybe the question could be reworded as, does the definition of > event > > > > time > > > > > timestamp column on DDL only project to the source definition, or > it > > > will > > > > > carry over the entire query and let operator determine such column > as > > > > > event-time timestamp. (SSS works as latter.) I think this is a huge > > > > > difference, as for me it's like stability vs flexibility, and > > there're > > > > > drawbacks on latter (there're also drawbacks on former as well, but > > > > > computed column may cover up). > > > > > > > > > > Thanks in advance! > > > > > Jungtaek Lim (HeartSaVioR) > > > > > > > > > > 1. > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-66%3A+Support+Time+Attribute+in+SQL+DDL > > > > > > > > > > > > > > > |
Free forum by Nabble | Edit this page |