Hi guys,
I would like to push forward the work here: https://issues.apache.org/jira/browse/FLINK-2147 Can anyone more familiar with streaming api verify if this could be a mature task. The intention is to summarize data over a window like in the case of StreamGroupedFold. Specifically implement count min in a window. Best, Stavros |
Hi,
with how the window API currently works this can only be done for non-parallel windows. For keyed windows everything that happens is scoped to the key of the elements: window contents are kept in per-key state, triggers fire on a per-key basis. Therefore a count-min sketch cannot be used because it would require to keep state across keys. For non-parallel windows a user could do this: DataStream input = ... input .windowAll(<some window>) .fold(new MySketch(), new MySketchFoldFunction()) with sketch data types and a fold function that is tailored to the user types. Therefore, I would prefer to not add a special API for this and vote to close https://issues.apache.org/jira/browse/FLINK-2147. I already commented on https://issues.apache.org/jira/browse/FLINK-2144, saying a similar thing. What I would welcome very much is to add some well documented examples to Flink that showcase how some of these operations can be written. Cheers, Aljoscha On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos <[hidden email]> wrote: > Hi guys, > > I would like to push forward the work here: > https://issues.apache.org/jira/browse/FLINK-2147 > > Can anyone more familiar with streaming api verify if this could be a > mature task. > The intention is to summarize data over a window like in the case of > StreamGroupedFold. > Specifically implement count min in a window. > > Best, > Stavros > |
Hi thnx for the feedback.
So there is a limitation due to parallel windows implementation. No intentions to change that somehow to accommodate similar estimations? WindowAll in practice is used as step in the pipeline? I mean since its inherently not parallel cannot scale correct? Although there is an exception: "Only for special cases, such as aligned time windows is it possible to perform this operation in parallel" Probably missing something... I could try do the example stuff (and open a new feature on jira for that). I will also vote for closing the old issue too since there is no other way at least for the time being... Thanx, Stavros On Fri, May 20, 2016 at 7:02 PM, Aljoscha Krettek <[hidden email]> wrote: > Hi, > with how the window API currently works this can only be done for > non-parallel windows. For keyed windows everything that happens is scoped > to the key of the elements: window contents are kept in per-key state, > triggers fire on a per-key basis. Therefore a count-min sketch cannot be > used because it would require to keep state across keys. > > For non-parallel windows a user could do this: > > DataStream input = ... > input > .windowAll(<some window>) > .fold(new MySketch(), new MySketchFoldFunction()) > > with sketch data types and a fold function that is tailored to the user > types. Therefore, I would prefer to not add a special API for this and vote > to close https://issues.apache.org/jira/browse/FLINK-2147. I already > commented on https://issues.apache.org/jira/browse/FLINK-2144, saying a > similar thing. > > What I would welcome very much is to add some well documented examples to > Flink that showcase how some of these operations can be written. > > Cheers, > Aljoscha > > On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos <[hidden email] > > > wrote: > > > Hi guys, > > > > I would like to push forward the work here: > > https://issues.apache.org/jira/browse/FLINK-2147 > > > > Can anyone more familiar with streaming api verify if this could be a > > mature task. > > The intention is to summarize data over a window like in the case of > > StreamGroupedFold. > > Specifically implement count min in a window. > > > > Best, > > Stavros > > > |
Hi,
no such changes are planned right now. The separaten between the keys is very strict in order to make the windowing state re-partitionable so that we can implement dynamic rescaling of the parallelism of a program. The WindowAll is only used for specific cases where you need a Trigger that sees all elements of the stream. I personally don't think it is very useful because it is not scaleable. In theory, for time windows this can be parallelized but it is not currently done in Flink. Do you have a specific use case for the count-min sketch in mind. If not, maybe our energy is better spent on producing examples with real-world applicability. I'm not against having an example for a count-min sketch, I'm just worried that you might put your energy into something that is not useful to a lot of people. Cheers, Aljoscha On Fri, 20 May 2016 at 20:13 Stavros Kontopoulos <[hidden email]> wrote: > Hi thnx for the feedback. > > So there is a limitation due to parallel windows implementation. > No intentions to change that somehow to accommodate similar estimations? > > WindowAll in practice is used as step in the pipeline? I mean since its > inherently not parallel cannot scale correct? > Although there is an exception: "Only for special cases, such as aligned > time windows is it possible to perform this operation in parallel" > Probably missing something... > > I could try do the example stuff (and open a new feature on jira for that). > I will also vote for closing the old issue too since there is no other way > at least for the time being... > > Thanx, > Stavros > > On Fri, May 20, 2016 at 7:02 PM, Aljoscha Krettek <[hidden email]> > wrote: > > > Hi, > > with how the window API currently works this can only be done for > > non-parallel windows. For keyed windows everything that happens is scoped > > to the key of the elements: window contents are kept in per-key state, > > triggers fire on a per-key basis. Therefore a count-min sketch cannot be > > used because it would require to keep state across keys. > > > > For non-parallel windows a user could do this: > > > > DataStream input = ... > > input > > .windowAll(<some window>) > > .fold(new MySketch(), new MySketchFoldFunction()) > > > > with sketch data types and a fold function that is tailored to the user > > types. Therefore, I would prefer to not add a special API for this and > vote > > to close https://issues.apache.org/jira/browse/FLINK-2147. I already > > commented on https://issues.apache.org/jira/browse/FLINK-2144, saying a > > similar thing. > > > > What I would welcome very much is to add some well documented examples to > > Flink that showcase how some of these operations can be written. > > > > Cheers, > > Aljoscha > > > > On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos < > [hidden email] > > > > > wrote: > > > > > Hi guys, > > > > > > I would like to push forward the work here: > > > https://issues.apache.org/jira/browse/FLINK-2147 > > > > > > Can anyone more familiar with streaming api verify if this could be a > > > mature task. > > > The intention is to summarize data over a window like in the case of > > > StreamGroupedFold. > > > Specifically implement count min in a window. > > > > > > Best, > > > Stavros > > > > > > |
Hey Aljoscha,
Thnax for the useful comments. I have recently looked at spark sketches: http://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics So there must be value in this effort. In my experience counting in general is a common need for large data sets. For example people often in a non streaming setting use redis for its hyperlolog algo. What are other areas you find more important or of higher priority for the time being? Best, Stavros On Mon, May 23, 2016 at 6:18 PM, Aljoscha Krettek <[hidden email]> wrote: > Hi, > no such changes are planned right now. The separaten between the keys is > very strict in order to make the windowing state re-partitionable so that > we can implement dynamic rescaling of the parallelism of a program. > > The WindowAll is only used for specific cases where you need a Trigger that > sees all elements of the stream. I personally don't think it is very useful > because it is not scaleable. In theory, for time windows this can be > parallelized but it is not currently done in Flink. > > Do you have a specific use case for the count-min sketch in mind. If not, > maybe our energy is better spent on producing examples with real-world > applicability. I'm not against having an example for a count-min sketch, > I'm just worried that you might put your energy into something that is not > useful to a lot of people. > > Cheers, > Aljoscha > > On Fri, 20 May 2016 at 20:13 Stavros Kontopoulos <[hidden email] > > > wrote: > > > Hi thnx for the feedback. > > > > So there is a limitation due to parallel windows implementation. > > No intentions to change that somehow to accommodate similar estimations? > > > > WindowAll in practice is used as step in the pipeline? I mean since its > > inherently not parallel cannot scale correct? > > Although there is an exception: "Only for special cases, such as aligned > > time windows is it possible to perform this operation in parallel" > > Probably missing something... > > > > I could try do the example stuff (and open a new feature on jira for > that). > > I will also vote for closing the old issue too since there is no other > way > > at least for the time being... > > > > Thanx, > > Stavros > > > > On Fri, May 20, 2016 at 7:02 PM, Aljoscha Krettek <[hidden email]> > > wrote: > > > > > Hi, > > > with how the window API currently works this can only be done for > > > non-parallel windows. For keyed windows everything that happens is > scoped > > > to the key of the elements: window contents are kept in per-key state, > > > triggers fire on a per-key basis. Therefore a count-min sketch cannot > be > > > used because it would require to keep state across keys. > > > > > > For non-parallel windows a user could do this: > > > > > > DataStream input = ... > > > input > > > .windowAll(<some window>) > > > .fold(new MySketch(), new MySketchFoldFunction()) > > > > > > with sketch data types and a fold function that is tailored to the user > > > types. Therefore, I would prefer to not add a special API for this and > > vote > > > to close https://issues.apache.org/jira/browse/FLINK-2147. I already > > > commented on https://issues.apache.org/jira/browse/FLINK-2144, saying > a > > > similar thing. > > > > > > What I would welcome very much is to add some well documented examples > to > > > Flink that showcase how some of these operations can be written. > > > > > > Cheers, > > > Aljoscha > > > > > > On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos < > > [hidden email] > > > > > > > wrote: > > > > > > > Hi guys, > > > > > > > > I would like to push forward the work here: > > > > https://issues.apache.org/jira/browse/FLINK-2147 > > > > > > > > Can anyone more familiar with streaming api verify if this could be a > > > > mature task. > > > > The intention is to summarize data over a window like in the case of > > > > StreamGroupedFold. > > > > Specifically implement count min in a window. > > > > > > > > Best, > > > > Stavros > > > > > > > > > > |
Hi,
that link was interesting, thanks! As I said though, it's probably not a good fit for Flink right now. The things that I feel are important right now are: - dynamic scaling: the ability of a streaming pipeline to adapt to changes in the amount of incoming data. This is tricky with stateful operations and long-running pipelines. For Spark this is easier to do because every mini-batch is individually scheduled and they can therefore be scheduled on differing numbers of machines. - an API for joining static (or slowly evolving) data with streaming data: this has been coming up in different forms on the mailing lists and when talking with people. Apache Beam solves this with "side inputs". In Flink we want to add something as well, maybe along the lines of side inputs or maybe something more specific for the case of pure joins. - working on managed memory: In Flink we were always very conscious about how memory was used, we were using our own abstractions for dealing with memory and efficient serialization. We call this the "managed memory" abstraction. Spark recently also started going in this direction with Project Tungsten. For the streaming API there are still some places where we could make things more efficient by working on the managed memory more, for example, there is no state backend that uses managed memory. We are either completely on the Java Heap or use RocksDB there. - stream SQL: this is obvious and everybody wants it. - A generic cross-runner API: This is what Apache Beam (née Google Dataflow) does. It is very interesting to write a program once and then be able to run it on different runners. This brings more flexibility for users. It's not clear how this will play out in the long run but it's very interesting to keep an eye on. For most of these the Flink community is in various stages of implementing it, so that's good. :-) Cheers, Aljoscha On Mon, 23 May 2016 at 17:48 Stavros Kontopoulos <[hidden email]> wrote: > Hey Aljoscha, > > Thnax for the useful comments. I have recently looked at spark sketches: > > http://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics > So there must be value in this effort. > In my experience counting in general is a common need for large data sets. > For example people often in a non streaming setting use redis for > its hyperlolog algo. > > What are other areas you find more important or of higher priority for the > time being? > > Best, > Stavros > > On Mon, May 23, 2016 at 6:18 PM, Aljoscha Krettek <[hidden email]> > wrote: > > > Hi, > > no such changes are planned right now. The separaten between the keys is > > very strict in order to make the windowing state re-partitionable so that > > we can implement dynamic rescaling of the parallelism of a program. > > > > The WindowAll is only used for specific cases where you need a Trigger > that > > sees all elements of the stream. I personally don't think it is very > useful > > because it is not scaleable. In theory, for time windows this can be > > parallelized but it is not currently done in Flink. > > > > Do you have a specific use case for the count-min sketch in mind. If not, > > maybe our energy is better spent on producing examples with real-world > > applicability. I'm not against having an example for a count-min sketch, > > I'm just worried that you might put your energy into something that is > not > > useful to a lot of people. > > > > Cheers, > > Aljoscha > > > > On Fri, 20 May 2016 at 20:13 Stavros Kontopoulos < > [hidden email] > > > > > wrote: > > > > > Hi thnx for the feedback. > > > > > > So there is a limitation due to parallel windows implementation. > > > No intentions to change that somehow to accommodate similar > estimations? > > > > > > WindowAll in practice is used as step in the pipeline? I mean since its > > > inherently not parallel cannot scale correct? > > > Although there is an exception: "Only for special cases, such as > aligned > > > time windows is it possible to perform this operation in parallel" > > > Probably missing something... > > > > > > I could try do the example stuff (and open a new feature on jira for > > that). > > > I will also vote for closing the old issue too since there is no other > > way > > > at least for the time being... > > > > > > Thanx, > > > Stavros > > > > > > On Fri, May 20, 2016 at 7:02 PM, Aljoscha Krettek <[hidden email] > > > > > wrote: > > > > > > > Hi, > > > > with how the window API currently works this can only be done for > > > > non-parallel windows. For keyed windows everything that happens is > > scoped > > > > to the key of the elements: window contents are kept in per-key > state, > > > > triggers fire on a per-key basis. Therefore a count-min sketch cannot > > be > > > > used because it would require to keep state across keys. > > > > > > > > For non-parallel windows a user could do this: > > > > > > > > DataStream input = ... > > > > input > > > > .windowAll(<some window>) > > > > .fold(new MySketch(), new MySketchFoldFunction()) > > > > > > > > with sketch data types and a fold function that is tailored to the > user > > > > types. Therefore, I would prefer to not add a special API for this > and > > > vote > > > > to close https://issues.apache.org/jira/browse/FLINK-2147. I already > > > > commented on https://issues.apache.org/jira/browse/FLINK-2144, > saying > > a > > > > similar thing. > > > > > > > > What I would welcome very much is to add some well documented > examples > > to > > > > Flink that showcase how some of these operations can be written. > > > > > > > > Cheers, > > > > Aljoscha > > > > > > > > On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos < > > > [hidden email] > > > > > > > > > wrote: > > > > > > > > > Hi guys, > > > > > > > > > > I would like to push forward the work here: > > > > > https://issues.apache.org/jira/browse/FLINK-2147 > > > > > > > > > > Can anyone more familiar with streaming api verify if this could > be a > > > > > mature task. > > > > > The intention is to summarize data over a window like in the case > of > > > > > StreamGroupedFold. > > > > > Specifically implement count min in a window. > > > > > > > > > > Best, > > > > > Stavros > > > > > > > > > > > > > > > |
Hi Stavros!
I think what Aljoscha wants to say is that the community is a bit hard pressed reviewing new and complex things right now. There are a lot of threads going on already. If you want to work on this, why not make your own GitHub project "Approximate algorithms on Apache Flink" or so? Greetings, Stephan On Wed, May 25, 2016 at 3:02 PM, Aljoscha Krettek <[hidden email]> wrote: > Hi, > that link was interesting, thanks! As I said though, it's probably not a > good fit for Flink right now. > > The things that I feel are important right now are: > > - dynamic scaling: the ability of a streaming pipeline to adapt to changes > in the amount of incoming data. This is tricky with stateful operations and > long-running pipelines. For Spark this is easier to do because every > mini-batch is individually scheduled and they can therefore be scheduled on > differing numbers of machines. > > - an API for joining static (or slowly evolving) data with streaming data: > this has been coming up in different forms on the mailing lists and when > talking with people. Apache Beam solves this with "side inputs". In Flink > we want to add something as well, maybe along the lines of side inputs or > maybe something more specific for the case of pure joins. > > - working on managed memory: In Flink we were always very conscious about > how memory was used, we were using our own abstractions for dealing with > memory and efficient serialization. We call this the "managed memory" > abstraction. Spark recently also started going in this direction with > Project Tungsten. For the streaming API there are still some places where > we could make things more efficient by working on the managed memory more, > for example, there is no state backend that uses managed memory. We are > either completely on the Java Heap or use RocksDB there. > > - stream SQL: this is obvious and everybody wants it. > > - A generic cross-runner API: This is what Apache Beam (née Google > Dataflow) does. It is very interesting to write a program once and then be > able to run it on different runners. This brings more flexibility for > users. It's not clear how this will play out in the long run but it's very > interesting to keep an eye on. > > For most of these the Flink community is in various stages of implementing > it, so that's good. :-) > > Cheers, > Aljoscha > > On Mon, 23 May 2016 at 17:48 Stavros Kontopoulos <[hidden email] > > > wrote: > > > Hey Aljoscha, > > > > Thnax for the useful comments. I have recently looked at spark sketches: > > > > > http://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics > > So there must be value in this effort. > > In my experience counting in general is a common need for large data > sets. > > For example people often in a non streaming setting use redis for > > its hyperlolog algo. > > > > What are other areas you find more important or of higher priority for > the > > time being? > > > > Best, > > Stavros > > > > On Mon, May 23, 2016 at 6:18 PM, Aljoscha Krettek <[hidden email]> > > wrote: > > > > > Hi, > > > no such changes are planned right now. The separaten between the keys > is > > > very strict in order to make the windowing state re-partitionable so > that > > > we can implement dynamic rescaling of the parallelism of a program. > > > > > > The WindowAll is only used for specific cases where you need a Trigger > > that > > > sees all elements of the stream. I personally don't think it is very > > useful > > > because it is not scaleable. In theory, for time windows this can be > > > parallelized but it is not currently done in Flink. > > > > > > Do you have a specific use case for the count-min sketch in mind. If > not, > > > maybe our energy is better spent on producing examples with real-world > > > applicability. I'm not against having an example for a count-min > sketch, > > > I'm just worried that you might put your energy into something that is > > not > > > useful to a lot of people. > > > > > > Cheers, > > > Aljoscha > > > > > > On Fri, 20 May 2016 at 20:13 Stavros Kontopoulos < > > [hidden email] > > > > > > > wrote: > > > > > > > Hi thnx for the feedback. > > > > > > > > So there is a limitation due to parallel windows implementation. > > > > No intentions to change that somehow to accommodate similar > > estimations? > > > > > > > > WindowAll in practice is used as step in the pipeline? I mean since > its > > > > inherently not parallel cannot scale correct? > > > > Although there is an exception: "Only for special cases, such as > > aligned > > > > time windows is it possible to perform this operation in parallel" > > > > Probably missing something... > > > > > > > > I could try do the example stuff (and open a new feature on jira for > > > that). > > > > I will also vote for closing the old issue too since there is no > other > > > way > > > > at least for the time being... > > > > > > > > Thanx, > > > > Stavros > > > > > > > > On Fri, May 20, 2016 at 7:02 PM, Aljoscha Krettek < > [hidden email] > > > > > > > wrote: > > > > > > > > > Hi, > > > > > with how the window API currently works this can only be done for > > > > > non-parallel windows. For keyed windows everything that happens is > > > scoped > > > > > to the key of the elements: window contents are kept in per-key > > state, > > > > > triggers fire on a per-key basis. Therefore a count-min sketch > cannot > > > be > > > > > used because it would require to keep state across keys. > > > > > > > > > > For non-parallel windows a user could do this: > > > > > > > > > > DataStream input = ... > > > > > input > > > > > .windowAll(<some window>) > > > > > .fold(new MySketch(), new MySketchFoldFunction()) > > > > > > > > > > with sketch data types and a fold function that is tailored to the > > user > > > > > types. Therefore, I would prefer to not add a special API for this > > and > > > > vote > > > > > to close https://issues.apache.org/jira/browse/FLINK-2147. I > already > > > > > commented on https://issues.apache.org/jira/browse/FLINK-2144, > > saying > > > a > > > > > similar thing. > > > > > > > > > > What I would welcome very much is to add some well documented > > examples > > > to > > > > > Flink that showcase how some of these operations can be written. > > > > > > > > > > Cheers, > > > > > Aljoscha > > > > > > > > > > On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos < > > > > [hidden email] > > > > > > > > > > > wrote: > > > > > > > > > > > Hi guys, > > > > > > > > > > > > I would like to push forward the work here: > > > > > > https://issues.apache.org/jira/browse/FLINK-2147 > > > > > > > > > > > > Can anyone more familiar with streaming api verify if this could > > be a > > > > > > mature task. > > > > > > The intention is to summarize data over a window like in the case > > of > > > > > > StreamGroupedFold. > > > > > > Specifically implement count min in a window. > > > > > > > > > > > > Best, > > > > > > Stavros > > > > > > > > > > > > > > > > > > > > > |
Hi Stephan,
An external project would be possible and maybe merge it in the future if it makes sense. Just wanted to point out that in general there is a need, but i understand priorities and may also try to work on these. Best, Stavros On Thu, May 26, 2016 at 10:00 PM, Stephan Ewen <[hidden email]> wrote: > Hi Stavros! > > I think what Aljoscha wants to say is that the community is a bit hard > pressed reviewing new and complex things right now. > There are a lot of threads going on already. > > If you want to work on this, why not make your own GitHub project > "Approximate algorithms on Apache Flink" or so? > > Greetings, > Stephan > > > > On Wed, May 25, 2016 at 3:02 PM, Aljoscha Krettek <[hidden email]> > wrote: > > > Hi, > > that link was interesting, thanks! As I said though, it's probably not a > > good fit for Flink right now. > > > > The things that I feel are important right now are: > > > > - dynamic scaling: the ability of a streaming pipeline to adapt to > changes > > in the amount of incoming data. This is tricky with stateful operations > and > > long-running pipelines. For Spark this is easier to do because every > > mini-batch is individually scheduled and they can therefore be scheduled > on > > differing numbers of machines. > > > > - an API for joining static (or slowly evolving) data with streaming > data: > > this has been coming up in different forms on the mailing lists and when > > talking with people. Apache Beam solves this with "side inputs". In Flink > > we want to add something as well, maybe along the lines of side inputs or > > maybe something more specific for the case of pure joins. > > > > - working on managed memory: In Flink we were always very conscious > about > > how memory was used, we were using our own abstractions for dealing with > > memory and efficient serialization. We call this the "managed memory" > > abstraction. Spark recently also started going in this direction with > > Project Tungsten. For the streaming API there are still some places where > > we could make things more efficient by working on the managed memory > more, > > for example, there is no state backend that uses managed memory. We are > > either completely on the Java Heap or use RocksDB there. > > > > - stream SQL: this is obvious and everybody wants it. > > > > - A generic cross-runner API: This is what Apache Beam (née Google > > Dataflow) does. It is very interesting to write a program once and then > be > > able to run it on different runners. This brings more flexibility for > > users. It's not clear how this will play out in the long run but it's > very > > interesting to keep an eye on. > > > > For most of these the Flink community is in various stages of > implementing > > it, so that's good. :-) > > > > Cheers, > > Aljoscha > > > > On Mon, 23 May 2016 at 17:48 Stavros Kontopoulos < > [hidden email] > > > > > wrote: > > > > > Hey Aljoscha, > > > > > > Thnax for the useful comments. I have recently looked at spark > sketches: > > > > > > > > > http://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics > > > So there must be value in this effort. > > > In my experience counting in general is a common need for large data > > sets. > > > For example people often in a non streaming setting use redis for > > > its hyperlolog algo. > > > > > > What are other areas you find more important or of higher priority for > > the > > > time being? > > > > > > Best, > > > Stavros > > > > > > On Mon, May 23, 2016 at 6:18 PM, Aljoscha Krettek <[hidden email] > > > > > wrote: > > > > > > > Hi, > > > > no such changes are planned right now. The separaten between the keys > > is > > > > very strict in order to make the windowing state re-partitionable so > > that > > > > we can implement dynamic rescaling of the parallelism of a program. > > > > > > > > The WindowAll is only used for specific cases where you need a > Trigger > > > that > > > > sees all elements of the stream. I personally don't think it is very > > > useful > > > > because it is not scaleable. In theory, for time windows this can be > > > > parallelized but it is not currently done in Flink. > > > > > > > > Do you have a specific use case for the count-min sketch in mind. If > > not, > > > > maybe our energy is better spent on producing examples with > real-world > > > > applicability. I'm not against having an example for a count-min > > sketch, > > > > I'm just worried that you might put your energy into something that > is > > > not > > > > useful to a lot of people. > > > > > > > > Cheers, > > > > Aljoscha > > > > > > > > On Fri, 20 May 2016 at 20:13 Stavros Kontopoulos < > > > [hidden email] > > > > > > > > > wrote: > > > > > > > > > Hi thnx for the feedback. > > > > > > > > > > So there is a limitation due to parallel windows implementation. > > > > > No intentions to change that somehow to accommodate similar > > > estimations? > > > > > > > > > > WindowAll in practice is used as step in the pipeline? I mean since > > its > > > > > inherently not parallel cannot scale correct? > > > > > Although there is an exception: "Only for special cases, such as > > > aligned > > > > > time windows is it possible to perform this operation in parallel" > > > > > Probably missing something... > > > > > > > > > > I could try do the example stuff (and open a new feature on jira > for > > > > that). > > > > > I will also vote for closing the old issue too since there is no > > other > > > > way > > > > > at least for the time being... > > > > > > > > > > Thanx, > > > > > Stavros > > > > > > > > > > On Fri, May 20, 2016 at 7:02 PM, Aljoscha Krettek < > > [hidden email] > > > > > > > > > wrote: > > > > > > > > > > > Hi, > > > > > > with how the window API currently works this can only be done for > > > > > > non-parallel windows. For keyed windows everything that happens > is > > > > scoped > > > > > > to the key of the elements: window contents are kept in per-key > > > state, > > > > > > triggers fire on a per-key basis. Therefore a count-min sketch > > cannot > > > > be > > > > > > used because it would require to keep state across keys. > > > > > > > > > > > > For non-parallel windows a user could do this: > > > > > > > > > > > > DataStream input = ... > > > > > > input > > > > > > .windowAll(<some window>) > > > > > > .fold(new MySketch(), new MySketchFoldFunction()) > > > > > > > > > > > > with sketch data types and a fold function that is tailored to > the > > > user > > > > > > types. Therefore, I would prefer to not add a special API for > this > > > and > > > > > vote > > > > > > to close https://issues.apache.org/jira/browse/FLINK-2147. I > > already > > > > > > commented on https://issues.apache.org/jira/browse/FLINK-2144, > > > saying > > > > a > > > > > > similar thing. > > > > > > > > > > > > What I would welcome very much is to add some well documented > > > examples > > > > to > > > > > > Flink that showcase how some of these operations can be written. > > > > > > > > > > > > Cheers, > > > > > > Aljoscha > > > > > > > > > > > > On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos < > > > > > [hidden email] > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hi guys, > > > > > > > > > > > > > > I would like to push forward the work here: > > > > > > > https://issues.apache.org/jira/browse/FLINK-2147 > > > > > > > > > > > > > > Can anyone more familiar with streaming api verify if this > could > > > be a > > > > > > > mature task. > > > > > > > The intention is to summarize data over a window like in the > case > > > of > > > > > > > StreamGroupedFold. > > > > > > > Specifically implement count min in a window. > > > > > > > > > > > > > > Best, > > > > > > > Stavros > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Free forum by Nabble | Edit this page |