(DEPRECATED) Apache Flink Mailing List archive.

Stream SQL and Dynamic tables

Classic

List

Threaded

8 messages Options

Radu Tudoran

Stream SQL and Dynamic tables

Hi all,

I have a question with respect to the scope behind the initiative behind relational queries on data streams:

https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_f4konQPW4tnl8THw6rzGUdaqU/edit#

Is the approach of using dynamic tables intended to replace the implementation and mechanisms build now in stream sql ? Or will the two co-exist, be built one on top of the other?

Also – is the document in the final form or can we still provide feedback / ask questions?

Thanks for the clarification (and sorry if I missed at some point the discussion that might have clarified this)

Dr. Radu Tudoran

Senior Research Engineer - Big Data Expert

IT R&D Division

HUAWEI TECHNOLOGIES Duesseldorf GmbH

European Research Center

Riesstrasse 25, 80992 München

E-mail: [hidden email]

Mobile: +49 15209084330

Telephone: +49 891588344173

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN

This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

Fabian Hueske-2

Re: Stream SQL and Dynamic tables

Hi Radu,

the idea is to have dynamic tables as the common ground for Table API and
SQL.
I don't think it is a good idea to implement and maintain 3 different
relational APIs with possibly varying semantics.

Actually, you can see the current status of the Table API / SQL on stream
as a subset of the proposed semantics.
Right now, all streams are implicitly converted into Tables with APPEND
semantics. The currently supported operations (selection, filter, union,
group windows) return streams.
The only thing that would change for these operations would be the output
mode to be retraction mode by default in order to be able to emit updated
records (e.g., updated aggregates due to late records).

The document is not final and we can of course discuss the proposal.

Best, Fabian

2017-01-26 11:33 GMT+01:00 Radu Tudoran <[hidden email]>:

> Hi all,
>
>
>
> I have a question with respect to the scope behind the initiative behind
> relational queries on data streams:
>
> https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_
> f4konQPW4tnl8THw6rzGUdaqU/edit#
>
>
>
> Is the approach of using dynamic tables intended to replace the
> implementation and mechanisms build now in stream sql ? Or will the two
> co-exist, be built one on top of the other?
>
>
>
> Also – is the document in the final form or can we still provide feedback
> / ask questions?
>
>
>
> Thanks for the clarification (and sorry if I missed at some point the
> discussion that might have clarified this)
>
>
>
> Dr. Radu Tudoran
>
> Senior Research Engineer - Big Data Expert
>
> IT R&D Division
>
>
>
> [image: cid:image007.jpg@01CD52EB.AD060EE0]
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>
> European Research Center
>
> Riesstrasse 25, 80992 München
>
>
>
> E-mail: *[hidden email] <[hidden email]>*
>
> Mobile: +49 15209084330 <+49%201520%209084330>
>
> Telephone: +49 891588344173 <+49%2089%201588344173>
>
>
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
>
> This e-mail and its attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure, reproduction,
> or dissemination) by persons other than the intended recipient(s) is
> prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
>
>
>

Radu Tudoran

RE: Stream SQL and Dynamic tables

Hi,

Thanks for the clarification Fabian - it is really useful.
I agree that we should consolidate the module and avoid the need to further maintain 3 different "projects". It does make sense to see the current (I would call it)"Stream SQL" as a table with append semantics. However, one thing that should be clarified is what is the best way from the implementation point of view to keep the state of the table (if we can actually keep it - though the need is clear for supporting retraction). As the input is a stream and the table is append of course we run in the classical infinite issue that streams have. What should be the approach?
Should we consider keeping the data in something like the statebackend now for windows, and then pushing them to the disk (e.g., like FSStateBackends). Perhaps with the disk we can at least enlarge the horizon of what we keep.
I will give some comments and some thoughts in the document about this.

Dr. Radu Tudoran
Senior Research Engineer - Big Data Expert
IT R&D Division

HUAWEI TECHNOLOGIES Duesseldorf GmbH
European Research Center
Riesstrasse 25, 80992 München

E-mail: [hidden email]
Mobile: +49 15209084330
Telephone: +49 891588344173

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

-----Original Message-----
From: Fabian Hueske [mailto:[hidden email]]
Sent: Thursday, January 26, 2017 3:37 PM
To: [hidden email]
Subject: Re: Stream SQL and Dynamic tables

Hi Radu,

the idea is to have dynamic tables as the common ground for Table API and SQL.
I don't think it is a good idea to implement and maintain 3 different relational APIs with possibly varying semantics.

Actually, you can see the current status of the Table API / SQL on stream as a subset of the proposed semantics.
Right now, all streams are implicitly converted into Tables with APPEND semantics. The currently supported operations (selection, filter, union, group windows) return streams.
The only thing that would change for these operations would be the output mode to be retraction mode by default in order to be able to emit updated records (e.g., updated aggregates due to late records).

The document is not final and we can of course discuss the proposal.

Best, Fabian

2017-01-26 11:33 GMT+01:00 Radu Tudoran <[hidden email]>:

> Hi all,
>
>
>
> I have a question with respect to the scope behind the initiative
> behind relational queries on data streams:
>
> https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_
> f4konQPW4tnl8THw6rzGUdaqU/edit#
>
>
>
> Is the approach of using dynamic tables intended to replace the
> implementation and mechanisms build now in stream sql ? Or will the
> two co-exist, be built one on top of the other?
>
>
>
> Also – is the document in the final form or can we still provide
> feedback / ask questions?
>
>
>
> Thanks for the clarification (and sorry if I missed at some point the
> discussion that might have clarified this)
>
>
>
> Dr. Radu Tudoran
>
> Senior Research Engineer - Big Data Expert
>
> IT R&D Division
>
>
>
> [image: cid:image007.jpg@01CD52EB.AD060EE0]
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
>
> European Research Center
>
> Riesstrasse 25, 80992 München
>
>
>
> E-mail: *[hidden email] <[hidden email]>*
>
> Mobile: +49 15209084330 <+49%201520%209084330>
>
> Telephone: +49 891588344173 <+49%2089%201588344173>
>
>
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com Registered
> Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, Managing
> Director: Bo PENG, Wanzhou MENG, Lifang CHEN Sitz der Gesellschaft:
> Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
>
> This e-mail and its attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address
> is listed above. Any use of the information contained herein in any
> way (including, but not limited to, total or partial disclosure,
> reproduction, or dissemination) by persons other than the intended
> recipient(s) is prohibited. If you receive this e-mail in error,
> please notify the sender by phone or email immediately and delete it!
>
>
>

Fabian Hueske-2

Re: Stream SQL and Dynamic tables

Hi Radu,

the idea is to only support operations that are bounded in space and
compute time:

- space: the size of state may not infinitely grow over time or with
growing key domains. For these cases, the optimizer will enforce a cleanup
timeout and all data which is passed that timeout will be discarded.
Operations which cannot be bounded in space will be rejected.

- compute time: certain queries can not be efficiently execute because
newly arriving data (late data or just newly appended rows) might trigger
recomputation of large parts of the current state. Operations that will
result in such a computation pattern will be rejected. One example would be
event-time OVER ROWS windows as we discussed in the other thread.

So the plan is that the optimizer takes care of limiting the space
requirements and computation effort.
However, you are of course right. Retraction and long running windows can
result significant amounts of operator state.
I don't think this is a special requirement for the Table API (there are
users of the DataStream API with jobs that manage TBs of state). Persisting
state to disk with RocksDB and scaling out to more nodes should address the
scaling problem initially. In the long run, the Flink community will work
to improve the handling of large state with features such as incremental
checkpoints and new state backends.

Looking forward to your comments.

Best,
Fabian

2017-01-27 11:01 GMT+01:00 Radu Tudoran <[hidden email]>:

> Hi,
>
> Thanks for the clarification Fabian - it is really useful.
> I agree that we should consolidate the module and avoid the need to
> further maintain 3 different "projects". It does make sense to see the
> current (I would call it)"Stream SQL" as a table with append semantics.
> However, one thing that should be clarified is what is the best way from
> the implementation point of view to keep the state of the table (if we can
> actually keep it - though the need is clear for supporting retraction). As
> the input is a stream and the table is append of course we run in the
> classical infinite issue that streams have. What should be the approach?
> Should we consider keeping the data in something like the statebackend now
> for windows, and then pushing them to the disk (e.g., like
> FSStateBackends). Perhaps with the disk we can at least enlarge the horizon
> of what we keep.
> I will give some comments and some thoughts in the document about this.
>
>
> Dr. Radu Tudoran
> Senior Research Engineer - Big Data Expert
> IT R&D Division
>
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> European Research Center
> Riesstrasse 25, 80992 München
>
> E-mail: [hidden email]
> Mobile: +49 15209084330
> Telephone: +49 891588344173
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
> This e-mail and its attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure, reproduction,
> or dissemination) by persons other than the intended recipient(s) is
> prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
>
>
> -----Original Message-----
> From: Fabian Hueske [mailto:[hidden email]]
> Sent: Thursday, January 26, 2017 3:37 PM
> To: [hidden email]
> Subject: Re: Stream SQL and Dynamic tables
>
> Hi Radu,
>
> the idea is to have dynamic tables as the common ground for Table API and
> SQL.
> I don't think it is a good idea to implement and maintain 3 different
> relational APIs with possibly varying semantics.
>
> Actually, you can see the current status of the Table API / SQL on stream
> as a subset of the proposed semantics.
> Right now, all streams are implicitly converted into Tables with APPEND
> semantics. The currently supported operations (selection, filter, union,
> group windows) return streams.
> The only thing that would change for these operations would be the output
> mode to be retraction mode by default in order to be able to emit updated
> records (e.g., updated aggregates due to late records).
>
> The document is not final and we can of course discuss the proposal.
>
> Best, Fabian
>
> 2017-01-26 11:33 GMT+01:00 Radu Tudoran <[hidden email]>:
>
> > Hi all,
> >
> >
> >
> > I have a question with respect to the scope behind the initiative
> > behind relational queries on data streams:
> >
> > https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_
> > f4konQPW4tnl8THw6rzGUdaqU/edit#
> >
> >
> >
> > Is the approach of using dynamic tables intended to replace the
> > implementation and mechanisms build now in stream sql ? Or will the
> > two co-exist, be built one on top of the other?
> >
> >
> >
> > Also – is the document in the final form or can we still provide
> > feedback / ask questions?
> >
> >
> >
> > Thanks for the clarification (and sorry if I missed at some point the
> > discussion that might have clarified this)
> >
> >
> >
> > Dr. Radu Tudoran
> >
> > Senior Research Engineer - Big Data Expert
> >
> > IT R&D Division
> >
> >
> >
> > [image: cid:image007.jpg@01CD52EB.AD060EE0]
> >
> > HUAWEI TECHNOLOGIES Duesseldorf GmbH
> >
> > European Research Center
> >
> > Riesstrasse 25, 80992 München
> >
> >
> >
> > E-mail: *[hidden email] <[hidden email]>*
> >
> > Mobile: +49 15209084330 <+49%201520%209084330>
> >
> > Telephone: +49 891588344173 <+49%2089%201588344173>
> >
> >
> >
> > HUAWEI TECHNOLOGIES Duesseldorf GmbH
> > Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com Registered
> > Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, Managing
> > Director: Bo PENG, Wanzhou MENG, Lifang CHEN Sitz der Gesellschaft:
> > Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> > Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
> >
> > This e-mail and its attachments contain confidential information from
> > HUAWEI, which is intended only for the person or entity whose address
> > is listed above. Any use of the information contained herein in any
> > way (including, but not limited to, total or partial disclosure,
> > reproduction, or dissemination) by persons other than the intended
> > recipient(s) is prohibited. If you receive this e-mail in error,
> > please notify the sender by phone or email immediately and delete it!
> >
> >
> >
>

Radu Tudoran

RE: Stream SQL and Dynamic tables

Hi Fabian,

Thanks for the clarifications. I have a follow up question: you say that operations are expected to be bounded in space and time (e.g., the optimizer will do a cleanup after a certain timeout period). - can I assume that this will imply that we will have at the level of the system a couple of parameters that hold these thresholds and potentially can be setup?

For example having in the environment variable

Env.setCleanupTimeout(100,TimeUnit.Minutes);

...or alternatively perhaps directly at the level of the table (either table environment or the table itself)

TableEnvironment tbEnv =...
tbEnv.setCleanupTimeOut(100,TimeUnit.Minutes)
Table tb=
tb.setCleanupTimeOut(100,TimeUnit.Minutes)

-----Original Message-----
From: Fabian Hueske [mailto:[hidden email]]
Sent: Friday, January 27, 2017 9:41 PM
To: [hidden email]
Subject: Re: Stream SQL and Dynamic tables

Hi Radu,

the idea is to only support operations that are bounded in space and compute time:

- space: the size of state may not infinitely grow over time or with growing key domains. For these cases, the optimizer will enforce a cleanup timeout and all data which is passed that timeout will be discarded.
Operations which cannot be bounded in space will be rejected.

- compute time: certain queries can not be efficiently execute because newly arriving data (late data or just newly appended rows) might trigger recomputation of large parts of the current state. Operations that will result in such a computation pattern will be rejected. One example would be event-time OVER ROWS windows as we discussed in the other thread.

So the plan is that the optimizer takes care of limiting the space requirements and computation effort.
However, you are of course right. Retraction and long running windows can result significant amounts of operator state.
I don't think this is a special requirement for the Table API (there are users of the DataStream API with jobs that manage TBs of state). Persisting state to disk with RocksDB and scaling out to more nodes should address the scaling problem initially. In the long run, the Flink community will work to improve the handling of large state with features such as incremental checkpoints and new state backends.

Looking forward to your comments.

Best,
Fabian

2017-01-27 11:01 GMT+01:00 Radu Tudoran <[hidden email]>:

> Hi,
>
> Thanks for the clarification Fabian - it is really useful.
> I agree that we should consolidate the module and avoid the need to
> further maintain 3 different "projects". It does make sense to see the
> current (I would call it)"Stream SQL" as a table with append semantics.
> However, one thing that should be clarified is what is the best way
> from the implementation point of view to keep the state of the table
> (if we can actually keep it - though the need is clear for supporting
> retraction). As the input is a stream and the table is append of
> course we run in the classical infinite issue that streams have. What should be the approach?
> Should we consider keeping the data in something like the statebackend
> now for windows, and then pushing them to the disk (e.g., like
> FSStateBackends). Perhaps with the disk we can at least enlarge the
> horizon of what we keep.
> I will give some comments and some thoughts in the document about this.
>
>
> Dr. Radu Tudoran
> Senior Research Engineer - Big Data Expert IT R&D Division
>
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> European Research Center
> Riesstrasse 25, 80992 München
>
> E-mail: [hidden email]
> Mobile: +49 15209084330
> Telephone: +49 891588344173
>
> HUAWEI TECHNOLOGIES Duesseldorf GmbH
> Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
> Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
> Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
> Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
> This e-mail and its attachments contain confidential information from
> HUAWEI, which is intended only for the person or entity whose address is
> listed above. Any use of the information contained herein in any way
> (including, but not limited to, total or partial disclosure, reproduction,
> or dissemination) by persons other than the intended recipient(s) is
> prohibited. If you receive this e-mail in error, please notify the sender
> by phone or email immediately and delete it!
>
>
> -----Original Message-----
> From: Fabian Hueske [mailto:[hidden email]]
> Sent: Thursday, January 26, 2017 3:37 PM
> To: [hidden email]
> Subject: Re: Stream SQL and Dynamic tables
>
> Hi Radu,
>
> the idea is to have dynamic tables as the common ground for Table API and
> SQL.
> I don't think it is a good idea to implement and maintain 3 different
> relational APIs with possibly varying semantics.
>
> Actually, you can see the current status of the Table API / SQL on stream
> as a subset of the proposed semantics.
> Right now, all streams are implicitly converted into Tables with APPEND
> semantics. The currently supported operations (selection, filter, union,
> group windows) return streams.
> The only thing that would change for these operations would be the output
> mode to be retraction mode by default in order to be able to emit updated
> records (e.g., updated aggregates due to late records).
>
> The document is not final and we can of course discuss the proposal.
>
> Best, Fabian
>
> 2017-01-26 11:33 GMT+01:00 Radu Tudoran <[hidden email]>:
>
> > Hi all,
> >
> >
> >
> > I have a question with respect to the scope behind the initiative
> > behind relational queries on data streams:
> >
> > https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_
> > f4konQPW4tnl8THw6rzGUdaqU/edit#
> >
> >
> >
> > Is the approach of using dynamic tables intended to replace the
> > implementation and mechanisms build now in stream sql ? Or will the
> > two co-exist, be built one on top of the other?
> >
> >
> >
> > Also – is the document in the final form or can we still provide
> > feedback / ask questions?
> >
> >
> >
> > Thanks for the clarification (and sorry if I missed at some point the
> > discussion that might have clarified this)
> >
> >
> >
> > Dr. Radu Tudoran
> >
> > Senior Research Engineer - Big Data Expert
> >
> > IT R&D Division
> >
> >
> >
> > [image: cid:image007.jpg@01CD52EB.AD060EE0]
> >
> > HUAWEI TECHNOLOGIES Duesseldorf GmbH
> >
> > European Research Center
> >
> > Riesstrasse 25, 80992 München
> >
> >
> >
> > E-mail: *[hidden email] <[hidden email]>*
> >
> > Mobile: +49 15209084330 <+49%201520%209084330>
> >
> > Telephone: +49 891588344173 <+49%2089%201588344173>
> >
> >
> >
> > HUAWEI TECHNOLOGIES Duesseldorf GmbH
> > Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com Registered
> > Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, Managing
> > Director: Bo PENG, Wanzhou MENG, Lifang CHEN Sitz der Gesellschaft:
> > Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> > Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
> >
> > This e-mail and its attachments contain confidential information from
> > HUAWEI, which is intended only for the person or entity whose address
> > is listed above. Any use of the information contained herein in any
> > way (including, but not limited to, total or partial disclosure,
> > reproduction, or dissemination) by persons other than the intended
> > recipient(s) is prohibited. If you receive this e-mail in error,
> > please notify the sender by phone or email immediately and delete it!
> >
> >
> >
>

Fabian Hueske-2

Re: Stream SQL and Dynamic tables

Hi Radu,

yes, the clean-up timeout would need to be defined somewhere.
I would actually prefer to do that within the query, because the clean-up
timeout affects the result and hence the semantics of the computed result.
This could look for instance as

SELECT a, sum(b)
FROM myTable
WHERE rowtime BETWEEN now() - INTERVAL '1' DAY AND now()
GROUP BY a;

In this query now() would always refer to the current time, i.e., the
current wall-clock time for processing time or the current watermark time
for event time.
The result of the query would be the grouped aggregate of the data received
in the last hour.
We can add syntactic sugar with built-in functions as for example:
last(rowtime, INTERVAL '1' DAY).

In addition we can also add a configuration parameter to the
TableEnvironment to control the clean-up timeout.

Cheers,
Fabian

2017-01-30 18:14 GMT+01:00 Radu Tudoran <[hidden email]>:

> Hi Fabian,
>
> Thanks for the clarifications. I have a follow up question: you say that
> operations are expected to be bounded in space and time (e.g., the
> optimizer will do a cleanup after a certain timeout period). - can I assume
> that this will imply that we will have at the level of the system a couple
> of parameters that hold these thresholds and potentially can be setup?
>
> For example having in the environment variable
>
> Env.setCleanupTimeout(100,TimeUnit.Minutes);
>
> ...or alternatively perhaps directly at the level of the table (either
> table environment or the table itself)
>
> TableEnvironment tbEnv =...
> tbEnv.setCleanupTimeOut(100,TimeUnit.Minutes)
> Table tb=
> tb.setCleanupTimeOut(100,TimeUnit.Minutes)
>
>
>
> -----Original Message-----
> From: Fabian Hueske [mailto:[hidden email]]
> Sent: Friday, January 27, 2017 9:41 PM
> To: [hidden email]
> Subject: Re: Stream SQL and Dynamic tables
>
> Hi Radu,
>
> the idea is to only support operations that are bounded in space and
> compute time:
>
> - space: the size of state may not infinitely grow over time or with
> growing key domains. For these cases, the optimizer will enforce a cleanup
> timeout and all data which is passed that timeout will be discarded.
> Operations which cannot be bounded in space will be rejected.
>
> - compute time: certain queries can not be efficiently execute because
> newly arriving data (late data or just newly appended rows) might trigger
> recomputation of large parts of the current state. Operations that will
> result in such a computation pattern will be rejected. One example would be
> event-time OVER ROWS windows as we discussed in the other thread.
>
> So the plan is that the optimizer takes care of limiting the space
> requirements and computation effort.
> However, you are of course right. Retraction and long running windows can
> result significant amounts of operator state.
> I don't think this is a special requirement for the Table API (there are
> users of the DataStream API with jobs that manage TBs of state). Persisting
> state to disk with RocksDB and scaling out to more nodes should address the
> scaling problem initially. In the long run, the Flink community will work
> to improve the handling of large state with features such as incremental
> checkpoints and new state backends.
>
> Looking forward to your comments.
>
> Best,
> Fabian
>
> 2017-01-27 11:01 GMT+01:00 Radu Tudoran <[hidden email]>:
>
> > Hi,
> >
> > Thanks for the clarification Fabian - it is really useful.
> > I agree that we should consolidate the module and avoid the need to
> > further maintain 3 different "projects". It does make sense to see the
> > current (I would call it)"Stream SQL" as a table with append semantics.
> > However, one thing that should be clarified is what is the best way
> > from the implementation point of view to keep the state of the table
> > (if we can actually keep it - though the need is clear for supporting
> > retraction). As the input is a stream and the table is append of
> > course we run in the classical infinite issue that streams have. What
> should be the approach?
> > Should we consider keeping the data in something like the statebackend
> > now for windows, and then pushing them to the disk (e.g., like
> > FSStateBackends). Perhaps with the disk we can at least enlarge the
> > horizon of what we keep.
> > I will give some comments and some thoughts in the document about this.
> >
> >
> > Dr. Radu Tudoran
> > Senior Research Engineer - Big Data Expert IT R&D Division
> >
> >
> > HUAWEI TECHNOLOGIES Duesseldorf GmbH
> > European Research Center
> > Riesstrasse 25, 80992 München
> >
> > E-mail: [hidden email]
> > Mobile: +49 15209084330
> > Telephone: +49 891588344173
> >
> > HUAWEI TECHNOLOGIES Duesseldorf GmbH
> > Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
> > Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
> > Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
> > Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> > Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
> > This e-mail and its attachments contain confidential information from
> > HUAWEI, which is intended only for the person or entity whose address is
> > listed above. Any use of the information contained herein in any way
> > (including, but not limited to, total or partial disclosure,
> reproduction,
> > or dissemination) by persons other than the intended recipient(s) is
> > prohibited. If you receive this e-mail in error, please notify the sender
> > by phone or email immediately and delete it!
> >
> >
> > -----Original Message-----
> > From: Fabian Hueske [mailto:[hidden email]]
> > Sent: Thursday, January 26, 2017 3:37 PM
> > To: [hidden email]
> > Subject: Re: Stream SQL and Dynamic tables
> >
> > Hi Radu,
> >
> > the idea is to have dynamic tables as the common ground for Table API and
> > SQL.
> > I don't think it is a good idea to implement and maintain 3 different
> > relational APIs with possibly varying semantics.
> >
> > Actually, you can see the current status of the Table API / SQL on stream
> > as a subset of the proposed semantics.
> > Right now, all streams are implicitly converted into Tables with APPEND
> > semantics. The currently supported operations (selection, filter, union,
> > group windows) return streams.
> > The only thing that would change for these operations would be the output
> > mode to be retraction mode by default in order to be able to emit updated
> > records (e.g., updated aggregates due to late records).
> >
> > The document is not final and we can of course discuss the proposal.
> >
> > Best, Fabian
> >
> > 2017-01-26 11:33 GMT+01:00 Radu Tudoran <[hidden email]>:
> >
> > > Hi all,
> > >
> > >
> > >
> > > I have a question with respect to the scope behind the initiative
> > > behind relational queries on data streams:
> > >
> > > https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_
> > > f4konQPW4tnl8THw6rzGUdaqU/edit#
> > >
> > >
> > >
> > > Is the approach of using dynamic tables intended to replace the
> > > implementation and mechanisms build now in stream sql ? Or will the
> > > two co-exist, be built one on top of the other?
> > >
> > >
> > >
> > > Also – is the document in the final form or can we still provide
> > > feedback / ask questions?
> > >
> > >
> > >
> > > Thanks for the clarification (and sorry if I missed at some point the
> > > discussion that might have clarified this)
> > >
> > >
> > >
> > > Dr. Radu Tudoran
> > >
> > > Senior Research Engineer - Big Data Expert
> > >
> > > IT R&D Division
> > >
> > >
> > >
> > > [image: cid:image007.jpg@01CD52EB.AD060EE0]
> > >
> > > HUAWEI TECHNOLOGIES Duesseldorf GmbH
> > >
> > > European Research Center
> > >
> > > Riesstrasse 25, 80992 München
> > >
> > >
> > >
> > > E-mail: *[hidden email] <[hidden email]>*
> > >
> > > Mobile: +49 15209084330 <+49%201520%209084330>
> > >
> > > Telephone: +49 891588344173 <+49%2089%201588344173>
> > >
> > >
> > >
> > > HUAWEI TECHNOLOGIES Duesseldorf GmbH
> > > Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com Registered
> > > Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, Managing
> > > Director: Bo PENG, Wanzhou MENG, Lifang CHEN Sitz der Gesellschaft:
> > > Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> > > Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
> > >
> > > This e-mail and its attachments contain confidential information from
> > > HUAWEI, which is intended only for the person or entity whose address
> > > is listed above. Any use of the information contained herein in any
> > > way (including, but not limited to, total or partial disclosure,
> > > reproduction, or dissemination) by persons other than the intended
> > > recipient(s) is prohibited. If you receive this e-mail in error,
> > > please notify the sender by phone or email immediately and delete it!
> > >
> > >
> > >
> >
>

Radu Tudoran

RE: Stream SQL and Dynamic tables

Hi,

For someone who likes to program more stream-like, I must say I like the syntax that you proposed. So I would be fine to keep it this way.

My only question/concerned is if someone who does SQL as a day to day job would like this way to write queries in which we port at least time concepts from streaming. However, it is not a complex concept to be ported - so I would like to believe that it is not a big deal to write SQL queries using this syntax. Nevertheless, I just wanted to raise the point.

-----Original Message-----
From: Fabian Hueske [mailto:[hidden email]]
Sent: Monday, January 30, 2017 9:07 PM
To: [hidden email]
Subject: Re: Stream SQL and Dynamic tables

Hi Radu,

yes, the clean-up timeout would need to be defined somewhere.
I would actually prefer to do that within the query, because the clean-up timeout affects the result and hence the semantics of the computed result.
This could look for instance as

SELECT a, sum(b)
FROM myTable
WHERE rowtime BETWEEN now() - INTERVAL '1' DAY AND now() GROUP BY a;

In this query now() would always refer to the current time, i.e., the current wall-clock time for processing time or the current watermark time for event time.
The result of the query would be the grouped aggregate of the data received in the last hour.
We can add syntactic sugar with built-in functions as for example:
last(rowtime, INTERVAL '1' DAY).

In addition we can also add a configuration parameter to the TableEnvironment to control the clean-up timeout.

Cheers,
Fabian

2017-01-30 18:14 GMT+01:00 Radu Tudoran <[hidden email]>:

> Hi Fabian,
>
> Thanks for the clarifications. I have a follow up question: you say
> that operations are expected to be bounded in space and time (e.g.,
> the optimizer will do a cleanup after a certain timeout period). - can
> I assume that this will imply that we will have at the level of the
> system a couple of parameters that hold these thresholds and potentially can be setup?
>
> For example having in the environment variable
>
> Env.setCleanupTimeout(100,TimeUnit.Minutes);
>
> ...or alternatively perhaps directly at the level of the table (either
> table environment or the table itself)
>
> TableEnvironment tbEnv =...
> tbEnv.setCleanupTimeOut(100,TimeUnit.Minutes)
> Table tb=
> tb.setCleanupTimeOut(100,TimeUnit.Minutes)
>
>
>
> -----Original Message-----
> From: Fabian Hueske [mailto:[hidden email]]
> Sent: Friday, January 27, 2017 9:41 PM
> To: [hidden email]
> Subject: Re: Stream SQL and Dynamic tables
>
> Hi Radu,
>
> the idea is to only support operations that are bounded in space and
> compute time:
>
> - space: the size of state may not infinitely grow over time or with
> growing key domains. For these cases, the optimizer will enforce a
> cleanup timeout and all data which is passed that timeout will be discarded.
> Operations which cannot be bounded in space will be rejected.
>
> - compute time: certain queries can not be efficiently execute because
> newly arriving data (late data or just newly appended rows) might
> trigger recomputation of large parts of the current state. Operations
> that will result in such a computation pattern will be rejected. One
> example would be event-time OVER ROWS windows as we discussed in the other thread.
>
> So the plan is that the optimizer takes care of limiting the space
> requirements and computation effort.
> However, you are of course right. Retraction and long running windows
> can result significant amounts of operator state.
> I don't think this is a special requirement for the Table API (there
> are users of the DataStream API with jobs that manage TBs of state).
> Persisting state to disk with RocksDB and scaling out to more nodes
> should address the scaling problem initially. In the long run, the
> Flink community will work to improve the handling of large state with
> features such as incremental checkpoints and new state backends.
>
> Looking forward to your comments.
>
> Best,
> Fabian
>
> 2017-01-27 11:01 GMT+01:00 Radu Tudoran <[hidden email]>:
>
> > Hi,
> >
> > Thanks for the clarification Fabian - it is really useful.
> > I agree that we should consolidate the module and avoid the need to
> > further maintain 3 different "projects". It does make sense to see
> > the current (I would call it)"Stream SQL" as a table with append semantics.
> > However, one thing that should be clarified is what is the best way
> > from the implementation point of view to keep the state of the table
> > (if we can actually keep it - though the need is clear for
> > supporting retraction). As the input is a stream and the table is
> > append of course we run in the classical infinite issue that streams
> > have. What
> should be the approach?
> > Should we consider keeping the data in something like the
> > statebackend now for windows, and then pushing them to the disk
> > (e.g., like FSStateBackends). Perhaps with the disk we can at least
> > enlarge the horizon of what we keep.
> > I will give some comments and some thoughts in the document about this.
> >
> >
> > Dr. Radu Tudoran
> > Senior Research Engineer - Big Data Expert IT R&D Division
> >
> >
> > HUAWEI TECHNOLOGIES Duesseldorf GmbH European Research Center
> > Riesstrasse 25, 80992 München
> >
> > E-mail: [hidden email]
> > Mobile: +49 15209084330
> > Telephone: +49 891588344173
> >
> > HUAWEI TECHNOLOGIES Duesseldorf GmbH Hansaallee 205, 40549
> > Düsseldorf, Germany, www.huawei.com Registered Office: Düsseldorf,
> > Register Court Düsseldorf, HRB 56063, Managing Director: Bo PENG,
> > Wanzhou MENG, Lifang CHEN Sitz der Gesellschaft: Düsseldorf,
> > Amtsgericht Düsseldorf, HRB 56063,
> > Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN This e-mail and
> > its attachments contain confidential information from HUAWEI, which
> > is intended only for the person or entity whose address is listed
> > above. Any use of the information contained herein in any way
> > (including, but not limited to, total or partial disclosure,
> reproduction,
> > or dissemination) by persons other than the intended recipient(s) is
> > prohibited. If you receive this e-mail in error, please notify the
> > sender by phone or email immediately and delete it!
> >
> >
> > -----Original Message-----
> > From: Fabian Hueske [mailto:[hidden email]]
> > Sent: Thursday, January 26, 2017 3:37 PM
> > To: [hidden email]
> > Subject: Re: Stream SQL and Dynamic tables
> >
> > Hi Radu,
> >
> > the idea is to have dynamic tables as the common ground for Table
> > API and SQL.
> > I don't think it is a good idea to implement and maintain 3
> > different relational APIs with possibly varying semantics.
> >
> > Actually, you can see the current status of the Table API / SQL on
> > stream as a subset of the proposed semantics.
> > Right now, all streams are implicitly converted into Tables with
> > APPEND semantics. The currently supported operations (selection,
> > filter, union, group windows) return streams.
> > The only thing that would change for these operations would be the
> > output mode to be retraction mode by default in order to be able to
> > emit updated records (e.g., updated aggregates due to late records).
> >
> > The document is not final and we can of course discuss the proposal.
> >
> > Best, Fabian
> >
> > 2017-01-26 11:33 GMT+01:00 Radu Tudoran <[hidden email]>:
> >
> > > Hi all,
> > >
> > >
> > >
> > > I have a question with respect to the scope behind the initiative
> > > behind relational queries on data streams:
> > >
> > > https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_
> > > f4konQPW4tnl8THw6rzGUdaqU/edit#
> > >
> > >
> > >
> > > Is the approach of using dynamic tables intended to replace the
> > > implementation and mechanisms build now in stream sql ? Or will
> > > the two co-exist, be built one on top of the other?
> > >
> > >
> > >
> > > Also – is the document in the final form or can we still provide
> > > feedback / ask questions?
> > >
> > >
> > >
> > > Thanks for the clarification (and sorry if I missed at some point
> > > the discussion that might have clarified this)
> > >
> > >
> > >
> > > Dr. Radu Tudoran
> > >
> > > Senior Research Engineer - Big Data Expert
> > >
> > > IT R&D Division
> > >
> > >
> > >
> > > [image: cid:image007.jpg@01CD52EB.AD060EE0]
> > >
> > > HUAWEI TECHNOLOGIES Duesseldorf GmbH
> > >
> > > European Research Center
> > >
> > > Riesstrasse 25, 80992 München
> > >
> > >
> > >
> > > E-mail: *[hidden email] <[hidden email]>*
> > >
> > > Mobile: +49 15209084330 <+49%201520%209084330>
> > >
> > > Telephone: +49 891588344173 <+49%2089%201588344173>
> > >
> > >
> > >
> > > HUAWEI TECHNOLOGIES Duesseldorf GmbH Hansaallee 205, 40549
> > > Düsseldorf, Germany, www.huawei.com Registered
> > > Office: Düsseldorf, Register Court Düsseldorf, HRB 56063, Managing
> > > Director: Bo PENG, Wanzhou MENG, Lifang CHEN Sitz der Gesellschaft:
> > > Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
> > > Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
> > >
> > > This e-mail and its attachments contain confidential information
> > > from HUAWEI, which is intended only for the person or entity whose
> > > address is listed above. Any use of the information contained
> > > herein in any way (including, but not limited to, total or partial
> > > disclosure, reproduction, or dissemination) by persons other than
> > > the intended
> > > recipient(s) is prohibited. If you receive this e-mail in error,
> > > please notify the sender by phone or email immediately and delete it!
> > >
> > >
> > >
> >
>

Radu Tudoran

RE: Stream SQL and Dynamic tables

In reply to this post by Fabian Hueske-2

Hi,

I made some comments over the Dynamic table document. Not sure how to ask for feedback for them...therefore my email.

Please let me know what do you think

https://docs.google.com/document/d/1qVVt_16kdaZQ8RTfA_f4konQPW4tnl8THw6rzGUdaqU/edit#heading=h.3eo2vkvydld6

Dr. Radu Tudoran
Senior Research Engineer - Big Data Expert
IT R&D Division

HUAWEI TECHNOLOGIES Duesseldorf GmbH
European Research Center
Riesstrasse 25, 80992 München

E-mail: [hidden email]
Mobile: +49 15209084330
Telephone: +49 891588344173

HUAWEI TECHNOLOGIES Duesseldorf GmbH
Hansaallee 205, 40549 Düsseldorf, Germany, www.huawei.com
Registered Office: Düsseldorf, Register Court Düsseldorf, HRB 56063,
Managing Director: Bo PENG, Wanzhou MENG, Lifang CHEN
Sitz der Gesellschaft: Düsseldorf, Amtsgericht Düsseldorf, HRB 56063,
Geschäftsführer: Bo PENG, Wanzhou MENG, Lifang CHEN
This e-mail and its attachments contain confidential information from HUAWEI, which is intended only for the person or entity whose address is listed above. Any use of the information contained herein in any way (including, but not limited to, total or partial disclosure, reproduction, or dissemination) by persons other than the intended recipient(s) is prohibited. If you receive this e-mail in error, please notify the sender by phone or email immediately and delete it!

-----Original Message-----
From: Fabian Hueske [mailto:[hidden email]]
Sent: Monday, January 30, 2017 9:07 PM
To: [hidden email]
Subject: Re: Stream SQL and Dynamic tables

Hi Radu,

yes, the clean-up timeout would need to be defined somewhere.
I would actually prefer to do that within the query, because the clean-up timeout affects the result and hence the semantics of the computed result.
This could look for instance as

SELECT a, sum(b)
FROM myTable
WHERE rowtime BETWEEN now() - INTERVAL '1' DAY AND now() GROUP BY a;

In this query now() would always refer to the current time, i.e., the current wall-clock time for processing time or the current watermark time for event time.
The result of the query would be the grouped aggregate of the data received in the last hour.
We can add syntactic sugar with built-in functions as for example:
last(rowtime, INTERVAL '1' DAY).

In addition we can also add a configuration parameter to the TableEnvironment to control the clean-up timeout.

Cheers,
Fabian

2017-01-30 18:14 GMT+01:00 Radu Tudoran <[hidden email]>: