buffering in operators, implementing statistics

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

buffering in operators, implementing statistics

Stavros Kontopoulos
Hi guys,

I would like to push forward the work here:
https://issues.apache.org/jira/browse/FLINK-2147

Can anyone more familiar with streaming api verify if this could be a
mature task.
The intention is to summarize data over a window like in the case of
StreamGroupedFold.
Specifically implement count min in a window.

Best,
Stavros
Reply | Threaded
Open this post in threaded view
|

Re: buffering in operators, implementing statistics

Aljoscha Krettek-2
Hi,
with how the window API currently works this can only be done for
non-parallel windows. For keyed windows everything that happens is scoped
to the key of the elements: window contents are kept in per-key state,
triggers fire on a per-key basis. Therefore a count-min sketch cannot be
used because it would require to keep state across keys.

For non-parallel windows a user could do this:

DataStream input = ...
input
  .windowAll(<some window>)
  .fold(new MySketch(), new MySketchFoldFunction())

with sketch data types and a fold function that is tailored to the user
types. Therefore, I would prefer to not add a special API for this and vote
to close https://issues.apache.org/jira/browse/FLINK-2147. I already
commented on https://issues.apache.org/jira/browse/FLINK-2144, saying a
similar thing.

What I would welcome very much is to add some well documented examples to
Flink that showcase how some of these operations can be written.

Cheers,
Aljoscha

On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos <[hidden email]>
wrote:

> Hi guys,
>
> I would like to push forward the work here:
> https://issues.apache.org/jira/browse/FLINK-2147
>
> Can anyone more familiar with streaming api verify if this could be a
> mature task.
> The intention is to summarize data over a window like in the case of
> StreamGroupedFold.
> Specifically implement count min in a window.
>
> Best,
> Stavros
>
Reply | Threaded
Open this post in threaded view
|

Re: buffering in operators, implementing statistics

Stavros Kontopoulos
Hi thnx for the feedback.

So there is a limitation due to parallel windows implementation.
No intentions to change that somehow to accommodate similar estimations?

WindowAll in practice is used as step in the pipeline? I mean since its
inherently not parallel cannot scale correct?
Although there is an exception: "Only for special cases, such as aligned
time windows is it possible to perform this operation in parallel"
Probably missing something...

I could try do the example stuff (and open a new feature on jira for that).
I will also vote for closing the old issue too since there is no other way
at least for the time being...

Thanx,
Stavros

On Fri, May 20, 2016 at 7:02 PM, Aljoscha Krettek <[hidden email]>
wrote:

> Hi,
> with how the window API currently works this can only be done for
> non-parallel windows. For keyed windows everything that happens is scoped
> to the key of the elements: window contents are kept in per-key state,
> triggers fire on a per-key basis. Therefore a count-min sketch cannot be
> used because it would require to keep state across keys.
>
> For non-parallel windows a user could do this:
>
> DataStream input = ...
> input
>   .windowAll(<some window>)
>   .fold(new MySketch(), new MySketchFoldFunction())
>
> with sketch data types and a fold function that is tailored to the user
> types. Therefore, I would prefer to not add a special API for this and vote
> to close https://issues.apache.org/jira/browse/FLINK-2147. I already
> commented on https://issues.apache.org/jira/browse/FLINK-2144, saying a
> similar thing.
>
> What I would welcome very much is to add some well documented examples to
> Flink that showcase how some of these operations can be written.
>
> Cheers,
> Aljoscha
>
> On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos <[hidden email]
> >
> wrote:
>
> > Hi guys,
> >
> > I would like to push forward the work here:
> > https://issues.apache.org/jira/browse/FLINK-2147
> >
> > Can anyone more familiar with streaming api verify if this could be a
> > mature task.
> > The intention is to summarize data over a window like in the case of
> > StreamGroupedFold.
> > Specifically implement count min in a window.
> >
> > Best,
> > Stavros
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: buffering in operators, implementing statistics

Aljoscha Krettek-2
Hi,
no such changes are planned right now. The separaten between the keys is
very strict in order to make the windowing state re-partitionable so that
we can implement dynamic rescaling of the parallelism of a program.

The WindowAll is only used for specific cases where you need a Trigger that
sees all elements of the stream. I personally don't think it is very useful
because it is not scaleable. In theory, for time windows this can be
parallelized but it is not currently done in Flink.

Do you have a specific use case for the count-min sketch in mind. If not,
maybe our energy is better spent on producing examples with real-world
applicability. I'm not against having an example for a count-min sketch,
I'm just worried that you might put your energy into something that is not
useful to a lot of people.

Cheers,
Aljoscha

On Fri, 20 May 2016 at 20:13 Stavros Kontopoulos <[hidden email]>
wrote:

> Hi thnx for the feedback.
>
> So there is a limitation due to parallel windows implementation.
> No intentions to change that somehow to accommodate similar estimations?
>
> WindowAll in practice is used as step in the pipeline? I mean since its
> inherently not parallel cannot scale correct?
> Although there is an exception: "Only for special cases, such as aligned
> time windows is it possible to perform this operation in parallel"
> Probably missing something...
>
> I could try do the example stuff (and open a new feature on jira for that).
> I will also vote for closing the old issue too since there is no other way
> at least for the time being...
>
> Thanx,
> Stavros
>
> On Fri, May 20, 2016 at 7:02 PM, Aljoscha Krettek <[hidden email]>
> wrote:
>
> > Hi,
> > with how the window API currently works this can only be done for
> > non-parallel windows. For keyed windows everything that happens is scoped
> > to the key of the elements: window contents are kept in per-key state,
> > triggers fire on a per-key basis. Therefore a count-min sketch cannot be
> > used because it would require to keep state across keys.
> >
> > For non-parallel windows a user could do this:
> >
> > DataStream input = ...
> > input
> >   .windowAll(<some window>)
> >   .fold(new MySketch(), new MySketchFoldFunction())
> >
> > with sketch data types and a fold function that is tailored to the user
> > types. Therefore, I would prefer to not add a special API for this and
> vote
> > to close https://issues.apache.org/jira/browse/FLINK-2147. I already
> > commented on https://issues.apache.org/jira/browse/FLINK-2144, saying a
> > similar thing.
> >
> > What I would welcome very much is to add some well documented examples to
> > Flink that showcase how some of these operations can be written.
> >
> > Cheers,
> > Aljoscha
> >
> > On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos <
> [hidden email]
> > >
> > wrote:
> >
> > > Hi guys,
> > >
> > > I would like to push forward the work here:
> > > https://issues.apache.org/jira/browse/FLINK-2147
> > >
> > > Can anyone more familiar with streaming api verify if this could be a
> > > mature task.
> > > The intention is to summarize data over a window like in the case of
> > > StreamGroupedFold.
> > > Specifically implement count min in a window.
> > >
> > > Best,
> > > Stavros
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: buffering in operators, implementing statistics

Stavros Kontopoulos
Hey Aljoscha,

Thnax for the useful comments. I have recently looked at spark sketches:
http://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics
So there must be value in this effort.
In my experience counting in general is a common need for large data sets.
For example people often in a non streaming setting use redis for
its hyperlolog algo.

What are other areas you find more important or of higher priority for the
time being?

Best,
Stavros

On Mon, May 23, 2016 at 6:18 PM, Aljoscha Krettek <[hidden email]>
wrote:

> Hi,
> no such changes are planned right now. The separaten between the keys is
> very strict in order to make the windowing state re-partitionable so that
> we can implement dynamic rescaling of the parallelism of a program.
>
> The WindowAll is only used for specific cases where you need a Trigger that
> sees all elements of the stream. I personally don't think it is very useful
> because it is not scaleable. In theory, for time windows this can be
> parallelized but it is not currently done in Flink.
>
> Do you have a specific use case for the count-min sketch in mind. If not,
> maybe our energy is better spent on producing examples with real-world
> applicability. I'm not against having an example for a count-min sketch,
> I'm just worried that you might put your energy into something that is not
> useful to a lot of people.
>
> Cheers,
> Aljoscha
>
> On Fri, 20 May 2016 at 20:13 Stavros Kontopoulos <[hidden email]
> >
> wrote:
>
> > Hi thnx for the feedback.
> >
> > So there is a limitation due to parallel windows implementation.
> > No intentions to change that somehow to accommodate similar estimations?
> >
> > WindowAll in practice is used as step in the pipeline? I mean since its
> > inherently not parallel cannot scale correct?
> > Although there is an exception: "Only for special cases, such as aligned
> > time windows is it possible to perform this operation in parallel"
> > Probably missing something...
> >
> > I could try do the example stuff (and open a new feature on jira for
> that).
> > I will also vote for closing the old issue too since there is no other
> way
> > at least for the time being...
> >
> > Thanx,
> > Stavros
> >
> > On Fri, May 20, 2016 at 7:02 PM, Aljoscha Krettek <[hidden email]>
> > wrote:
> >
> > > Hi,
> > > with how the window API currently works this can only be done for
> > > non-parallel windows. For keyed windows everything that happens is
> scoped
> > > to the key of the elements: window contents are kept in per-key state,
> > > triggers fire on a per-key basis. Therefore a count-min sketch cannot
> be
> > > used because it would require to keep state across keys.
> > >
> > > For non-parallel windows a user could do this:
> > >
> > > DataStream input = ...
> > > input
> > >   .windowAll(<some window>)
> > >   .fold(new MySketch(), new MySketchFoldFunction())
> > >
> > > with sketch data types and a fold function that is tailored to the user
> > > types. Therefore, I would prefer to not add a special API for this and
> > vote
> > > to close https://issues.apache.org/jira/browse/FLINK-2147. I already
> > > commented on https://issues.apache.org/jira/browse/FLINK-2144, saying
> a
> > > similar thing.
> > >
> > > What I would welcome very much is to add some well documented examples
> to
> > > Flink that showcase how some of these operations can be written.
> > >
> > > Cheers,
> > > Aljoscha
> > >
> > > On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos <
> > [hidden email]
> > > >
> > > wrote:
> > >
> > > > Hi guys,
> > > >
> > > > I would like to push forward the work here:
> > > > https://issues.apache.org/jira/browse/FLINK-2147
> > > >
> > > > Can anyone more familiar with streaming api verify if this could be a
> > > > mature task.
> > > > The intention is to summarize data over a window like in the case of
> > > > StreamGroupedFold.
> > > > Specifically implement count min in a window.
> > > >
> > > > Best,
> > > > Stavros
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: buffering in operators, implementing statistics

Aljoscha Krettek-2
Hi,
that link was interesting, thanks! As I said though, it's probably not a
good fit for Flink right now.

The things that I feel are important right now are:

 - dynamic scaling: the ability of a streaming pipeline to adapt to changes
in the amount of incoming data. This is tricky with stateful operations and
long-running pipelines. For Spark this is easier to do because every
mini-batch is individually scheduled and they can therefore be scheduled on
differing numbers of machines.

 - an API for joining static (or slowly evolving) data with streaming data:
this has been coming up in different forms on the mailing lists and when
talking with people. Apache Beam solves this with "side inputs". In Flink
we want to add something as well, maybe along the lines of side inputs or
maybe something more specific for the case of pure joins.

 - working on managed memory: In Flink we were always very conscious about
how memory was used, we were using our own abstractions for dealing with
memory and efficient serialization. We call this the "managed memory"
abstraction. Spark recently also started going in this direction with
Project Tungsten. For the streaming API there are still some places where
we could make things more efficient by working on the managed memory more,
for example, there is no state backend that uses managed memory. We are
either completely on the Java Heap or use RocksDB there.

 - stream SQL: this is obvious and everybody wants it.

 - A generic cross-runner API: This is what Apache Beam (née Google
Dataflow) does. It is very interesting to write a program once and then be
able to run it on different runners. This brings more flexibility for
users. It's not clear how this will play out in the long run but it's very
interesting to keep an eye on.

For most of these the Flink community is in various stages of implementing
it, so that's good. :-)

Cheers,
Aljoscha

On Mon, 23 May 2016 at 17:48 Stavros Kontopoulos <[hidden email]>
wrote:

> Hey Aljoscha,
>
> Thnax for the useful comments. I have recently looked at spark sketches:
>
> http://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics
> So there must be value in this effort.
> In my experience counting in general is a common need for large data sets.
> For example people often in a non streaming setting use redis for
> its hyperlolog algo.
>
> What are other areas you find more important or of higher priority for the
> time being?
>
> Best,
> Stavros
>
> On Mon, May 23, 2016 at 6:18 PM, Aljoscha Krettek <[hidden email]>
> wrote:
>
> > Hi,
> > no such changes are planned right now. The separaten between the keys is
> > very strict in order to make the windowing state re-partitionable so that
> > we can implement dynamic rescaling of the parallelism of a program.
> >
> > The WindowAll is only used for specific cases where you need a Trigger
> that
> > sees all elements of the stream. I personally don't think it is very
> useful
> > because it is not scaleable. In theory, for time windows this can be
> > parallelized but it is not currently done in Flink.
> >
> > Do you have a specific use case for the count-min sketch in mind. If not,
> > maybe our energy is better spent on producing examples with real-world
> > applicability. I'm not against having an example for a count-min sketch,
> > I'm just worried that you might put your energy into something that is
> not
> > useful to a lot of people.
> >
> > Cheers,
> > Aljoscha
> >
> > On Fri, 20 May 2016 at 20:13 Stavros Kontopoulos <
> [hidden email]
> > >
> > wrote:
> >
> > > Hi thnx for the feedback.
> > >
> > > So there is a limitation due to parallel windows implementation.
> > > No intentions to change that somehow to accommodate similar
> estimations?
> > >
> > > WindowAll in practice is used as step in the pipeline? I mean since its
> > > inherently not parallel cannot scale correct?
> > > Although there is an exception: "Only for special cases, such as
> aligned
> > > time windows is it possible to perform this operation in parallel"
> > > Probably missing something...
> > >
> > > I could try do the example stuff (and open a new feature on jira for
> > that).
> > > I will also vote for closing the old issue too since there is no other
> > way
> > > at least for the time being...
> > >
> > > Thanx,
> > > Stavros
> > >
> > > On Fri, May 20, 2016 at 7:02 PM, Aljoscha Krettek <[hidden email]
> >
> > > wrote:
> > >
> > > > Hi,
> > > > with how the window API currently works this can only be done for
> > > > non-parallel windows. For keyed windows everything that happens is
> > scoped
> > > > to the key of the elements: window contents are kept in per-key
> state,
> > > > triggers fire on a per-key basis. Therefore a count-min sketch cannot
> > be
> > > > used because it would require to keep state across keys.
> > > >
> > > > For non-parallel windows a user could do this:
> > > >
> > > > DataStream input = ...
> > > > input
> > > >   .windowAll(<some window>)
> > > >   .fold(new MySketch(), new MySketchFoldFunction())
> > > >
> > > > with sketch data types and a fold function that is tailored to the
> user
> > > > types. Therefore, I would prefer to not add a special API for this
> and
> > > vote
> > > > to close https://issues.apache.org/jira/browse/FLINK-2147. I already
> > > > commented on https://issues.apache.org/jira/browse/FLINK-2144,
> saying
> > a
> > > > similar thing.
> > > >
> > > > What I would welcome very much is to add some well documented
> examples
> > to
> > > > Flink that showcase how some of these operations can be written.
> > > >
> > > > Cheers,
> > > > Aljoscha
> > > >
> > > > On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos <
> > > [hidden email]
> > > > >
> > > > wrote:
> > > >
> > > > > Hi guys,
> > > > >
> > > > > I would like to push forward the work here:
> > > > > https://issues.apache.org/jira/browse/FLINK-2147
> > > > >
> > > > > Can anyone more familiar with streaming api verify if this could
> be a
> > > > > mature task.
> > > > > The intention is to summarize data over a window like in the case
> of
> > > > > StreamGroupedFold.
> > > > > Specifically implement count min in a window.
> > > > >
> > > > > Best,
> > > > > Stavros
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: buffering in operators, implementing statistics

Stephan Ewen
Hi Stavros!

I think what Aljoscha wants to say is that the community is a bit hard
pressed reviewing new and complex things right now.
There are a lot of threads going on already.

If you want to work on this, why not make your own GitHub project
"Approximate algorithms on Apache Flink" or so?

Greetings,
Stephan



On Wed, May 25, 2016 at 3:02 PM, Aljoscha Krettek <[hidden email]>
wrote:

> Hi,
> that link was interesting, thanks! As I said though, it's probably not a
> good fit for Flink right now.
>
> The things that I feel are important right now are:
>
>  - dynamic scaling: the ability of a streaming pipeline to adapt to changes
> in the amount of incoming data. This is tricky with stateful operations and
> long-running pipelines. For Spark this is easier to do because every
> mini-batch is individually scheduled and they can therefore be scheduled on
> differing numbers of machines.
>
>  - an API for joining static (or slowly evolving) data with streaming data:
> this has been coming up in different forms on the mailing lists and when
> talking with people. Apache Beam solves this with "side inputs". In Flink
> we want to add something as well, maybe along the lines of side inputs or
> maybe something more specific for the case of pure joins.
>
>  - working on managed memory: In Flink we were always very conscious about
> how memory was used, we were using our own abstractions for dealing with
> memory and efficient serialization. We call this the "managed memory"
> abstraction. Spark recently also started going in this direction with
> Project Tungsten. For the streaming API there are still some places where
> we could make things more efficient by working on the managed memory more,
> for example, there is no state backend that uses managed memory. We are
> either completely on the Java Heap or use RocksDB there.
>
>  - stream SQL: this is obvious and everybody wants it.
>
>  - A generic cross-runner API: This is what Apache Beam (née Google
> Dataflow) does. It is very interesting to write a program once and then be
> able to run it on different runners. This brings more flexibility for
> users. It's not clear how this will play out in the long run but it's very
> interesting to keep an eye on.
>
> For most of these the Flink community is in various stages of implementing
> it, so that's good. :-)
>
> Cheers,
> Aljoscha
>
> On Mon, 23 May 2016 at 17:48 Stavros Kontopoulos <[hidden email]
> >
> wrote:
>
> > Hey Aljoscha,
> >
> > Thnax for the useful comments. I have recently looked at spark sketches:
> >
> >
> http://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics
> > So there must be value in this effort.
> > In my experience counting in general is a common need for large data
> sets.
> > For example people often in a non streaming setting use redis for
> > its hyperlolog algo.
> >
> > What are other areas you find more important or of higher priority for
> the
> > time being?
> >
> > Best,
> > Stavros
> >
> > On Mon, May 23, 2016 at 6:18 PM, Aljoscha Krettek <[hidden email]>
> > wrote:
> >
> > > Hi,
> > > no such changes are planned right now. The separaten between the keys
> is
> > > very strict in order to make the windowing state re-partitionable so
> that
> > > we can implement dynamic rescaling of the parallelism of a program.
> > >
> > > The WindowAll is only used for specific cases where you need a Trigger
> > that
> > > sees all elements of the stream. I personally don't think it is very
> > useful
> > > because it is not scaleable. In theory, for time windows this can be
> > > parallelized but it is not currently done in Flink.
> > >
> > > Do you have a specific use case for the count-min sketch in mind. If
> not,
> > > maybe our energy is better spent on producing examples with real-world
> > > applicability. I'm not against having an example for a count-min
> sketch,
> > > I'm just worried that you might put your energy into something that is
> > not
> > > useful to a lot of people.
> > >
> > > Cheers,
> > > Aljoscha
> > >
> > > On Fri, 20 May 2016 at 20:13 Stavros Kontopoulos <
> > [hidden email]
> > > >
> > > wrote:
> > >
> > > > Hi thnx for the feedback.
> > > >
> > > > So there is a limitation due to parallel windows implementation.
> > > > No intentions to change that somehow to accommodate similar
> > estimations?
> > > >
> > > > WindowAll in practice is used as step in the pipeline? I mean since
> its
> > > > inherently not parallel cannot scale correct?
> > > > Although there is an exception: "Only for special cases, such as
> > aligned
> > > > time windows is it possible to perform this operation in parallel"
> > > > Probably missing something...
> > > >
> > > > I could try do the example stuff (and open a new feature on jira for
> > > that).
> > > > I will also vote for closing the old issue too since there is no
> other
> > > way
> > > > at least for the time being...
> > > >
> > > > Thanx,
> > > > Stavros
> > > >
> > > > On Fri, May 20, 2016 at 7:02 PM, Aljoscha Krettek <
> [hidden email]
> > >
> > > > wrote:
> > > >
> > > > > Hi,
> > > > > with how the window API currently works this can only be done for
> > > > > non-parallel windows. For keyed windows everything that happens is
> > > scoped
> > > > > to the key of the elements: window contents are kept in per-key
> > state,
> > > > > triggers fire on a per-key basis. Therefore a count-min sketch
> cannot
> > > be
> > > > > used because it would require to keep state across keys.
> > > > >
> > > > > For non-parallel windows a user could do this:
> > > > >
> > > > > DataStream input = ...
> > > > > input
> > > > >   .windowAll(<some window>)
> > > > >   .fold(new MySketch(), new MySketchFoldFunction())
> > > > >
> > > > > with sketch data types and a fold function that is tailored to the
> > user
> > > > > types. Therefore, I would prefer to not add a special API for this
> > and
> > > > vote
> > > > > to close https://issues.apache.org/jira/browse/FLINK-2147. I
> already
> > > > > commented on https://issues.apache.org/jira/browse/FLINK-2144,
> > saying
> > > a
> > > > > similar thing.
> > > > >
> > > > > What I would welcome very much is to add some well documented
> > examples
> > > to
> > > > > Flink that showcase how some of these operations can be written.
> > > > >
> > > > > Cheers,
> > > > > Aljoscha
> > > > >
> > > > > On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos <
> > > > [hidden email]
> > > > > >
> > > > > wrote:
> > > > >
> > > > > > Hi guys,
> > > > > >
> > > > > > I would like to push forward the work here:
> > > > > > https://issues.apache.org/jira/browse/FLINK-2147
> > > > > >
> > > > > > Can anyone more familiar with streaming api verify if this could
> > be a
> > > > > > mature task.
> > > > > > The intention is to summarize data over a window like in the case
> > of
> > > > > > StreamGroupedFold.
> > > > > > Specifically implement count min in a window.
> > > > > >
> > > > > > Best,
> > > > > > Stavros
> > > > > >
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: buffering in operators, implementing statistics

Stavros Kontopoulos
Hi Stephan,

An external project would be possible and maybe merge it in the future if
it makes sense. Just wanted to point out that in general there is a need,
but i understand priorities and may also try to work on these.

Best,
Stavros

On Thu, May 26, 2016 at 10:00 PM, Stephan Ewen <[hidden email]> wrote:

> Hi Stavros!
>
> I think what Aljoscha wants to say is that the community is a bit hard
> pressed reviewing new and complex things right now.
> There are a lot of threads going on already.
>
> If you want to work on this, why not make your own GitHub project
> "Approximate algorithms on Apache Flink" or so?
>
> Greetings,
> Stephan
>
>
>
> On Wed, May 25, 2016 at 3:02 PM, Aljoscha Krettek <[hidden email]>
> wrote:
>
> > Hi,
> > that link was interesting, thanks! As I said though, it's probably not a
> > good fit for Flink right now.
> >
> > The things that I feel are important right now are:
> >
> >  - dynamic scaling: the ability of a streaming pipeline to adapt to
> changes
> > in the amount of incoming data. This is tricky with stateful operations
> and
> > long-running pipelines. For Spark this is easier to do because every
> > mini-batch is individually scheduled and they can therefore be scheduled
> on
> > differing numbers of machines.
> >
> >  - an API for joining static (or slowly evolving) data with streaming
> data:
> > this has been coming up in different forms on the mailing lists and when
> > talking with people. Apache Beam solves this with "side inputs". In Flink
> > we want to add something as well, maybe along the lines of side inputs or
> > maybe something more specific for the case of pure joins.
> >
> >  - working on managed memory: In Flink we were always very conscious
> about
> > how memory was used, we were using our own abstractions for dealing with
> > memory and efficient serialization. We call this the "managed memory"
> > abstraction. Spark recently also started going in this direction with
> > Project Tungsten. For the streaming API there are still some places where
> > we could make things more efficient by working on the managed memory
> more,
> > for example, there is no state backend that uses managed memory. We are
> > either completely on the Java Heap or use RocksDB there.
> >
> >  - stream SQL: this is obvious and everybody wants it.
> >
> >  - A generic cross-runner API: This is what Apache Beam (née Google
> > Dataflow) does. It is very interesting to write a program once and then
> be
> > able to run it on different runners. This brings more flexibility for
> > users. It's not clear how this will play out in the long run but it's
> very
> > interesting to keep an eye on.
> >
> > For most of these the Flink community is in various stages of
> implementing
> > it, so that's good. :-)
> >
> > Cheers,
> > Aljoscha
> >
> > On Mon, 23 May 2016 at 17:48 Stavros Kontopoulos <
> [hidden email]
> > >
> > wrote:
> >
> > > Hey Aljoscha,
> > >
> > > Thnax for the useful comments. I have recently looked at spark
> sketches:
> > >
> > >
> >
> http://www.slideshare.net/databricks/sketching-big-data-with-spark-randomized-algorithms-for-largescale-data-analytics
> > > So there must be value in this effort.
> > > In my experience counting in general is a common need for large data
> > sets.
> > > For example people often in a non streaming setting use redis for
> > > its hyperlolog algo.
> > >
> > > What are other areas you find more important or of higher priority for
> > the
> > > time being?
> > >
> > > Best,
> > > Stavros
> > >
> > > On Mon, May 23, 2016 at 6:18 PM, Aljoscha Krettek <[hidden email]
> >
> > > wrote:
> > >
> > > > Hi,
> > > > no such changes are planned right now. The separaten between the keys
> > is
> > > > very strict in order to make the windowing state re-partitionable so
> > that
> > > > we can implement dynamic rescaling of the parallelism of a program.
> > > >
> > > > The WindowAll is only used for specific cases where you need a
> Trigger
> > > that
> > > > sees all elements of the stream. I personally don't think it is very
> > > useful
> > > > because it is not scaleable. In theory, for time windows this can be
> > > > parallelized but it is not currently done in Flink.
> > > >
> > > > Do you have a specific use case for the count-min sketch in mind. If
> > not,
> > > > maybe our energy is better spent on producing examples with
> real-world
> > > > applicability. I'm not against having an example for a count-min
> > sketch,
> > > > I'm just worried that you might put your energy into something that
> is
> > > not
> > > > useful to a lot of people.
> > > >
> > > > Cheers,
> > > > Aljoscha
> > > >
> > > > On Fri, 20 May 2016 at 20:13 Stavros Kontopoulos <
> > > [hidden email]
> > > > >
> > > > wrote:
> > > >
> > > > > Hi thnx for the feedback.
> > > > >
> > > > > So there is a limitation due to parallel windows implementation.
> > > > > No intentions to change that somehow to accommodate similar
> > > estimations?
> > > > >
> > > > > WindowAll in practice is used as step in the pipeline? I mean since
> > its
> > > > > inherently not parallel cannot scale correct?
> > > > > Although there is an exception: "Only for special cases, such as
> > > aligned
> > > > > time windows is it possible to perform this operation in parallel"
> > > > > Probably missing something...
> > > > >
> > > > > I could try do the example stuff (and open a new feature on jira
> for
> > > > that).
> > > > > I will also vote for closing the old issue too since there is no
> > other
> > > > way
> > > > > at least for the time being...
> > > > >
> > > > > Thanx,
> > > > > Stavros
> > > > >
> > > > > On Fri, May 20, 2016 at 7:02 PM, Aljoscha Krettek <
> > [hidden email]
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hi,
> > > > > > with how the window API currently works this can only be done for
> > > > > > non-parallel windows. For keyed windows everything that happens
> is
> > > > scoped
> > > > > > to the key of the elements: window contents are kept in per-key
> > > state,
> > > > > > triggers fire on a per-key basis. Therefore a count-min sketch
> > cannot
> > > > be
> > > > > > used because it would require to keep state across keys.
> > > > > >
> > > > > > For non-parallel windows a user could do this:
> > > > > >
> > > > > > DataStream input = ...
> > > > > > input
> > > > > >   .windowAll(<some window>)
> > > > > >   .fold(new MySketch(), new MySketchFoldFunction())
> > > > > >
> > > > > > with sketch data types and a fold function that is tailored to
> the
> > > user
> > > > > > types. Therefore, I would prefer to not add a special API for
> this
> > > and
> > > > > vote
> > > > > > to close https://issues.apache.org/jira/browse/FLINK-2147. I
> > already
> > > > > > commented on https://issues.apache.org/jira/browse/FLINK-2144,
> > > saying
> > > > a
> > > > > > similar thing.
> > > > > >
> > > > > > What I would welcome very much is to add some well documented
> > > examples
> > > > to
> > > > > > Flink that showcase how some of these operations can be written.
> > > > > >
> > > > > > Cheers,
> > > > > > Aljoscha
> > > > > >
> > > > > > On Thu, 19 May 2016 at 16:38 Stavros Kontopoulos <
> > > > > [hidden email]
> > > > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hi guys,
> > > > > > >
> > > > > > > I would like to push forward the work here:
> > > > > > > https://issues.apache.org/jira/browse/FLINK-2147
> > > > > > >
> > > > > > > Can anyone more familiar with streaming api verify if this
> could
> > > be a
> > > > > > > mature task.
> > > > > > > The intention is to summarize data over a window like in the
> case
> > > of
> > > > > > > StreamGroupedFold.
> > > > > > > Specifically implement count min in a window.
> > > > > > >
> > > > > > > Best,
> > > > > > > Stavros
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>