I think the though was to explicitly not have the same terminology as the
batch API to not confuse people. But this is a minor naming issue IMO. On Tue, Jul 14, 2015 at 12:40 PM, Gyula Fóra <[hidden email]> wrote: > I see your point, reduceByKey is much clearer. > > The question is whether we want to introduce this inconsistency across the > two api-s or stick with what we have. > On Tue, Jul 14, 2015 at 10:57 AM Aljoscha Krettek <[hidden email]> > wrote: > > > I agree, the groupBy, in the batch API is misleading, since a > > ds.groupBy().reduce() does not really build any groups, it is really a > > ds.keyBy().reduceByKey(). In the streaming API we can still fix this, > IMHO. > > > > On Tue, 14 Jul 2015 at 10:56 Stephan Ewen <[hidden email]> wrote: > > > > > It is not a bit different than the batch API, because streaming > semantics > > > are a bit different ;-) > > > > > > One good thing is that we can make things better that were sub-optimal > in > > > the Batch API. > > > > > > On Tue, Jul 14, 2015 at 10:55 AM, Stephan Ewen <[hidden email]> > wrote: > > > > > > > keyBy() does not do any grouping. Grouping in streams in not defined > > > > without windows. > > > > > > > > On Tue, Jul 14, 2015 at 10:48 AM, Gyula Fóra <[hidden email]> > > > wrote: > > > > > > > >> If we only want to have either keyBy or groupBy, why not keep > groupBy? > > > >> That > > > >> would be more consistent with the batch api. > > > >> On Tue, Jul 14, 2015 at 10:35 AM Stephan Ewen <[hidden email]> > > wrote: > > > >> > > > >> > Concerning your comments: > > > >> > > > > >> > 1) In the new design, there is no grouping without windowing. The > > > >> > KeyedDataStream subsumes the grouping and key-ing for partitioned > > > state. > > > >> > > > > >> > The keyBy() + window() makes a parallel grouped window > > > >> > keyBy() alone allows access to partitioned state. > > > >> > > > > >> > My thought was that this is simpler, because it needs not > > > groupBy() > > > >> and > > > >> > keyBy(), but one construct to handle both cases. > > > >> > > > > >> > 2) The discretization is a rough thought and is nothing for the > > short > > > >> term. > > > >> > It totally needs more thoughts. I put it there to have it as a > > sketch > > > >> for > > > >> > how to evolve this. > > > >> > > > > >> > The idea is of course to not have a single data set, but a > > series > > > of > > > >> > data set. In each discrete time slice, the data set can be treated > > > like > > > >> a > > > >> > regular data set. > > > >> > > > > >> > Let's kick off a separate design for the discretization. Joins > > are > > > >> good > > > >> > to talk about (data sets can be joined with data set), and I am > sure > > > >> there > > > >> > are more questions coming up. > > > >> > > > > >> > > > > >> > Does that make sense? > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > On Tue, Jul 14, 2015 at 10:05 AM, Gyula Fóra < > [hidden email]> > > > >> wrote: > > > >> > > > > >> > > I think Marton has some good points here. > > > >> > > > > > >> > > 1) Is KeyedDataStream a better name if this is only a renaming? > > > >> > > > > > >> > > 2) the discretize semantics is unclear indeed. Are we operating > > on a > > > >> > single > > > >> > > or sequence of datasets? If the latter why not call it something > > > else > > > >> > > (dstream). How are joins and other binary operators defined for > > > >> different > > > >> > > discretizations etc. > > > >> > > On Mon, Jul 13, 2015 at 7:37 PM Márton Balassi < > > [hidden email] > > > > > > > >> > > wrote: > > > >> > > > > > >> > > > Generally I agree with the new design. Two concerns: > > > >> > > > > > > >> > > > 1) Does KeyedDataStream replace GroupedDataStream or is it the > > > >> latter a > > > >> > > > special case of the former? > > > >> > > > > > > >> > > > The KeyedDataStream as described in the design document is a > bit > > > >> > unclear > > > >> > > > for me. It lists the following usages: > > > >> > > > a) It is the first step in building a window stream, on top > of > > > >> which > > > >> > > the > > > >> > > > grouped/windowed aggregation and reduce-style function can be > > > >> applied > > > >> > > > b) It allows to use the "by-key" state of functions. Here, > > every > > > >> > record > > > >> > > > has access to a state that is scoped by its key. Key-scoped > > state > > > >> can > > > >> > be > > > >> > > > automatically redistributed and repartitioned. > > > >> > > > > > > >> > > > The code snippet describes a use case where the computation > and > > > the > > > >> > > access > > > >> > > > of the state is used the way currently the GroupedDataStream > > > should > > > >> > > work. I > > > >> > > > suppose this is the example for case b). Would case a) also > > window > > > >> > > elements > > > >> > > > by key? If yes, then this is practically a renaming and > > > enhancement > > > >> of > > > >> > > the > > > >> > > > GroupedDataStream functionality with keyed state. Then the > > > >> > > > StreamExecutionEnvironment.createKeyedStream(Partitioner, > > > >> > > > KeySelector)construction does not make much sense as the user > > only > > > >> > > operates > > > >> > > > within the scope of the keyselector and not the partitioner > > > anyway. > > > >> > > > > > > >> > > > I personally think KeyedDataStream as a name does not > > necessarily > > > >> > suggest > > > >> > > > that the records are grouped by key, it only suggests > > partitioning > > > >> by > > > >> > > key - > > > >> > > > at least for me. :) > > > >> > > > > > > >> > > > 2) The API for discretization is not convenient IMHO > > > >> > > > > > > >> > > > The discretization part declares that the output of > > > >> > > DataStream.discretize() > > > >> > > > is a sequence of DataSets. I love this approach, but then in > the > > > >> code > > > >> > > > snippet the return value of this function is simply a DataSet > > and > > > >> uses > > > >> > it > > > >> > > > as such. The take home message of that code is the following: > > this > > > >> is > > > >> > > > actually the way you would like to program on these sequence > of > > > >> > DataSets, > > > >> > > > most probably you would like to do the same with each of them. > > If > > > >> that > > > >> > is > > > >> > > > the case we should provide a nice utility for that. I think > > Spark > > > >> > > > Streaming's DStream.foreachRDD() is fairly useful for this > > > purpose. > > > >> > > > > > > >> > > > On Mon, Jul 13, 2015 at 6:30 PM, Gyula Fóra < > > [hidden email] > > > > > > > >> > > wrote: > > > >> > > > > > > >> > > > > +1 > > > >> > > > > On Mon, Jul 13, 2015 at 6:23 PM Stephan Ewen < > > [hidden email]> > > > >> > wrote: > > > >> > > > > > > > >> > > > > > If naming is the only concern, then we should go ahead, > > > because > > > >> we > > > >> > > can > > > >> > > > > > change names easily (before the release). > > > >> > > > > > > > > >> > > > > > In fact, I don't think it leaves a bad impression. Global > > > >> windows > > > >> > are > > > >> > > > > > non-parallel windows. There are also parallel windows. > Pick > > > what > > > >> > you > > > >> > > > need > > > >> > > > > > and what works. > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > On Mon, Jul 13, 2015 at 6:13 PM, Gyula Fóra < > > > >> [hidden email]> > > > >> > > > > wrote: > > > >> > > > > > > > > >> > > > > > > I think we agree on everything its more of a naming > issue > > :) > > > >> > > > > > > > > > >> > > > > > > I thought it might be misleading that global time > windows > > > are > > > >> > > > > > > "non-parallel" windows. We dont want to give a bad > > > impression. > > > >> > > (Also > > > >> > > > we > > > >> > > > > > > dont want them to think that every global window is > > parallel > > > >> but > > > >> > > > thats > > > >> > > > > > not > > > >> > > > > > > a problem here) > > > >> > > > > > > > > > >> > > > > > > Gyula > > > >> > > > > > > On Mon, Jul 13, 2015 at 5:22 PM Stephan Ewen < > > > >> [hidden email]> > > > >> > > > wrote: > > > >> > > > > > > > > > >> > > > > > > > Okay, what is missing about the windowing in your > > opinion? > > > >> > > > > > > > > > > >> > > > > > > > The core points of the document are: > > > >> > > > > > > > > > > >> > > > > > > > - The parallel windows are per group only. > > > >> > > > > > > > > > > >> > > > > > > > - The implementation of the parallel windows holds > > > window > > > >> > data > > > >> > > in > > > >> > > > > the > > > >> > > > > > > > group buffers. > > > >> > > > > > > > > > > >> > > > > > > > - The global windows are non-parallel. May have > > parallel > > > >> > > > > > > pre-aggregation, > > > >> > > > > > > > if they are time windows. > > > >> > > > > > > > > > > >> > > > > > > > - Time may be operator time (timer thread), or > > watermark > > > >> > time. > > > >> > > > > > > Watermark > > > >> > > > > > > > time can refer to ingress or event time. > > > >> > > > > > > > > > > >> > > > > > > > - Windows that do not pre-aggregate may require > > elements > > > >> in > > > >> > > > order. > > > >> > > > > > Not > > > >> > > > > > > > part of the first prototype. > > > >> > > > > > > > > > > >> > > > > > > > Do we agree on those points? > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > On Mon, Jul 13, 2015 at 4:50 PM, Gyula Fóra < > > > >> > > [hidden email]> > > > >> > > > > > > wrote: > > > >> > > > > > > > > > > >> > > > > > > > > In general I like it, although the main difference > > > between > > > >> > the > > > >> > > > > > current > > > >> > > > > > > > and > > > >> > > > > > > > > the new one is the windowing and that is still not > > very > > > >> > clear. > > > >> > > > > > > > > > > > >> > > > > > > > > Where do we have the full stream time windows for > > > >> > > instance?(which > > > >> > > > > is > > > >> > > > > > > > > parallel but not keyed) > > > >> > > > > > > > > On Mon, Jul 13, 2015 at 4:28 PM Aljoscha Krettek < > > > >> > > > > > [hidden email]> > > > >> > > > > > > > > wrote: > > > >> > > > > > > > > > > > >> > > > > > > > > > +1 I like it as well. > > > >> > > > > > > > > > > > > >> > > > > > > > > > On Mon, 13 Jul 2015 at 16:17 Kostas Tzoumas < > > > >> > > > [hidden email] > > > >> > > > > > > > > >> > > > > > > > wrote: > > > >> > > > > > > > > > > > > >> > > > > > > > > > > +1 from my side > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > On Mon, Jul 13, 2015 at 4:15 PM, Stephan Ewen < > > > >> > > > > [hidden email]> > > > >> > > > > > > > > wrote: > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > Do we have consensus on these designs? > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > If we have, we should get to implementing this > > > soon, > > > >> > > > because > > > >> > > > > > > > > basically > > > >> > > > > > > > > > > all > > > >> > > > > > > > > > > > streaming patches will have to be revisited in > > > >> light of > > > >> > > > > this... > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > On Tue, Jul 7, 2015 at 3:41 PM, Gyula Fóra < > > > >> > > > > > [hidden email] > > > >> > > > > > > > > > > >> > > > > > > > > > wrote: > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > You are right thats an important issue. > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > And I think we should also do some renaming > > with > > > >> the > > > >> > > > > > > "iterations" > > > >> > > > > > > > > > > because > > > >> > > > > > > > > > > > > they are not really iterations like in the > > batch > > > >> case > > > >> > > and > > > >> > > > > it > > > >> > > > > > > > might > > > >> > > > > > > > > > > > confuse > > > >> > > > > > > > > > > > > some users. > > > >> > > > > > > > > > > > > Maybe we can call them loops or cycles and > > > rename > > > >> the > > > >> > > api > > > >> > > > > > calls > > > >> > > > > > > > to > > > >> > > > > > > > > > make > > > >> > > > > > > > > > > > it > > > >> > > > > > > > > > > > > more intuitive what happens. It is really > > just a > > > >> > cyclic > > > >> > > > > > > dataflow. > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > Aljoscha Krettek <[hidden email]> ezt > > írta > > > >> > > > (időpont: > > > >> > > > > > > 2015. > > > >> > > > > > > > > júl. > > > >> > > > > > > > > > > 7., > > > >> > > > > > > > > > > > > K, > > > >> > > > > > > > > > > > > 15:35): > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Hi, > > > >> > > > > > > > > > > > > > I just noticed that we don't have anything > > > about > > > >> > how > > > >> > > > > > > iterations > > > >> > > > > > > > > and > > > >> > > > > > > > > > > > > > timestamps/watermarks should interact. > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Cheers, > > > >> > > > > > > > > > > > > > Aljoscha > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > On Mon, 6 Jul 2015 at 23:56 Stephan Ewen < > > > >> > > > > [hidden email] > > > >> > > > > > > > > > >> > > > > > > > > wrote: > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Hi all! > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > As many of you know, there are a ongoing > > > >> efforts > > > >> > to > > > >> > > > > > > > consolidate > > > >> > > > > > > > > > the > > > >> > > > > > > > > > > > > > > streaming API for the next release, and > > then > > > >> > > graduate > > > >> > > > > it > > > >> > > > > > > > (from > > > >> > > > > > > > > > beta > > > >> > > > > > > > > > > > > > > status). > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > In the process of this consolidation, we > > > want > > > >> to > > > >> > > > > achieve > > > >> > > > > > > the > > > >> > > > > > > > > > > > following > > > >> > > > > > > > > > > > > > > goals. > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > - Make the code more robust and > simplify > > it > > > >> in > > > >> > > parts > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > - Clearly define the semantics of the > > > >> > constructs. > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > - Prepare it for support of more > advanced > > > >> > > concepts, > > > >> > > > > like > > > >> > > > > > > > > > > > partitionable > > > >> > > > > > > > > > > > > > > state, and event time. > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > - Cut support for certain corner cases > > that > > > >> were > > > >> > > > > > > prototyped, > > > >> > > > > > > > > but > > > >> > > > > > > > > > > > > turned > > > >> > > > > > > > > > > > > > > out to be not efficiently doable > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Based on prior discussions on the > mailing > > > >> list, > > > >> > > > > Aljoscha > > > >> > > > > > > and > > > >> > > > > > > > me > > > >> > > > > > > > > > > > drafted > > > >> > > > > > > > > > > > > > the > > > >> > > > > > > > > > > > > > > design documents below, which outline > how > > > the > > > >> > > > > > consolidated > > > >> > > > > > > > API > > > >> > > > > > > > > > > would > > > >> > > > > > > > > > > > > > like. > > > >> > > > > > > > > > > > > > > We focused in constructs, time, and > window > > > >> > > semantics. > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Design document on how to restructure > the > > > >> > Streaming > > > >> > > > > API: > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Streams+and+Operations+on+Streams > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Design document on definitions of time, > > > order, > > > >> > and > > > >> > > > the > > > >> > > > > > > > > resulting > > > >> > > > > > > > > > > > > > semantics: > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Note: The design of the interfaces and > > > >> concepts > > > >> > for > > > >> > > > > > > advanced > > > >> > > > > > > > > > state > > > >> > > > > > > > > > > in > > > >> > > > > > > > > > > > > > > functions is not in here. That is part > of > > a > > > >> > > separate > > > >> > > > > > design > > > >> > > > > > > > > > > > discussion > > > >> > > > > > > > > > > > > > and > > > >> > > > > > > > > > > > > > > orthogonal to the designs drafted here. > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Please have a look and voice questions > and > > > >> > > concerns. > > > >> > > > > > Since > > > >> > > > > > > we > > > >> > > > > > > > > > > should > > > >> > > > > > > > > > > > > not > > > >> > > > > > > > > > > > > > > break the streaming API more than once, > we > > > >> should > > > >> > > > make > > > >> > > > > > sure > > > >> > > > > > > > > this > > > >> > > > > > > > > > > > > > > consolidation brings it into the shape > we > > > >> want it > > > >> > > to > > > >> > > > be > > > >> > > > > > in. > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Greetings, > > > >> > > > > > > > > > > > > > > Stephan > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > > > > > > |
In reply to this post by Gyula Fóra
There is no inconsistency between the Batch and Streaming API. They have
different semantics - the batch API is implicitly always windowed. There is a naming difference between the two APIs. There is a strong inconsistency within the Streaming API right now. Grouping and aggregating without windows is plain dangerous in streaming. It either blows up or is undefined in its behavior. On Tue, Jul 14, 2015 at 12:40 PM, Gyula Fóra <[hidden email]> wrote: > I see your point, reduceByKey is much clearer. > > The question is whether we want to introduce this inconsistency across the > two api-s or stick with what we have. > On Tue, Jul 14, 2015 at 10:57 AM Aljoscha Krettek <[hidden email]> > wrote: > > > I agree, the groupBy, in the batch API is misleading, since a > > ds.groupBy().reduce() does not really build any groups, it is really a > > ds.keyBy().reduceByKey(). In the streaming API we can still fix this, > IMHO. > > > > On Tue, 14 Jul 2015 at 10:56 Stephan Ewen <[hidden email]> wrote: > > > > > It is not a bit different than the batch API, because streaming > semantics > > > are a bit different ;-) > > > > > > One good thing is that we can make things better that were sub-optimal > in > > > the Batch API. > > > > > > On Tue, Jul 14, 2015 at 10:55 AM, Stephan Ewen <[hidden email]> > wrote: > > > > > > > keyBy() does not do any grouping. Grouping in streams in not defined > > > > without windows. > > > > > > > > On Tue, Jul 14, 2015 at 10:48 AM, Gyula Fóra <[hidden email]> > > > wrote: > > > > > > > >> If we only want to have either keyBy or groupBy, why not keep > groupBy? > > > >> That > > > >> would be more consistent with the batch api. > > > >> On Tue, Jul 14, 2015 at 10:35 AM Stephan Ewen <[hidden email]> > > wrote: > > > >> > > > >> > Concerning your comments: > > > >> > > > > >> > 1) In the new design, there is no grouping without windowing. The > > > >> > KeyedDataStream subsumes the grouping and key-ing for partitioned > > > state. > > > >> > > > > >> > The keyBy() + window() makes a parallel grouped window > > > >> > keyBy() alone allows access to partitioned state. > > > >> > > > > >> > My thought was that this is simpler, because it needs not > > > groupBy() > > > >> and > > > >> > keyBy(), but one construct to handle both cases. > > > >> > > > > >> > 2) The discretization is a rough thought and is nothing for the > > short > > > >> term. > > > >> > It totally needs more thoughts. I put it there to have it as a > > sketch > > > >> for > > > >> > how to evolve this. > > > >> > > > > >> > The idea is of course to not have a single data set, but a > > series > > > of > > > >> > data set. In each discrete time slice, the data set can be treated > > > like > > > >> a > > > >> > regular data set. > > > >> > > > > >> > Let's kick off a separate design for the discretization. Joins > > are > > > >> good > > > >> > to talk about (data sets can be joined with data set), and I am > sure > > > >> there > > > >> > are more questions coming up. > > > >> > > > > >> > > > > >> > Does that make sense? > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > On Tue, Jul 14, 2015 at 10:05 AM, Gyula Fóra < > [hidden email]> > > > >> wrote: > > > >> > > > > >> > > I think Marton has some good points here. > > > >> > > > > > >> > > 1) Is KeyedDataStream a better name if this is only a renaming? > > > >> > > > > > >> > > 2) the discretize semantics is unclear indeed. Are we operating > > on a > > > >> > single > > > >> > > or sequence of datasets? If the latter why not call it something > > > else > > > >> > > (dstream). How are joins and other binary operators defined for > > > >> different > > > >> > > discretizations etc. > > > >> > > On Mon, Jul 13, 2015 at 7:37 PM Márton Balassi < > > [hidden email] > > > > > > > >> > > wrote: > > > >> > > > > > >> > > > Generally I agree with the new design. Two concerns: > > > >> > > > > > > >> > > > 1) Does KeyedDataStream replace GroupedDataStream or is it the > > > >> latter a > > > >> > > > special case of the former? > > > >> > > > > > > >> > > > The KeyedDataStream as described in the design document is a > bit > > > >> > unclear > > > >> > > > for me. It lists the following usages: > > > >> > > > a) It is the first step in building a window stream, on top > of > > > >> which > > > >> > > the > > > >> > > > grouped/windowed aggregation and reduce-style function can be > > > >> applied > > > >> > > > b) It allows to use the "by-key" state of functions. Here, > > every > > > >> > record > > > >> > > > has access to a state that is scoped by its key. Key-scoped > > state > > > >> can > > > >> > be > > > >> > > > automatically redistributed and repartitioned. > > > >> > > > > > > >> > > > The code snippet describes a use case where the computation > and > > > the > > > >> > > access > > > >> > > > of the state is used the way currently the GroupedDataStream > > > should > > > >> > > work. I > > > >> > > > suppose this is the example for case b). Would case a) also > > window > > > >> > > elements > > > >> > > > by key? If yes, then this is practically a renaming and > > > enhancement > > > >> of > > > >> > > the > > > >> > > > GroupedDataStream functionality with keyed state. Then the > > > >> > > > StreamExecutionEnvironment.createKeyedStream(Partitioner, > > > >> > > > KeySelector)construction does not make much sense as the user > > only > > > >> > > operates > > > >> > > > within the scope of the keyselector and not the partitioner > > > anyway. > > > >> > > > > > > >> > > > I personally think KeyedDataStream as a name does not > > necessarily > > > >> > suggest > > > >> > > > that the records are grouped by key, it only suggests > > partitioning > > > >> by > > > >> > > key - > > > >> > > > at least for me. :) > > > >> > > > > > > >> > > > 2) The API for discretization is not convenient IMHO > > > >> > > > > > > >> > > > The discretization part declares that the output of > > > >> > > DataStream.discretize() > > > >> > > > is a sequence of DataSets. I love this approach, but then in > the > > > >> code > > > >> > > > snippet the return value of this function is simply a DataSet > > and > > > >> uses > > > >> > it > > > >> > > > as such. The take home message of that code is the following: > > this > > > >> is > > > >> > > > actually the way you would like to program on these sequence > of > > > >> > DataSets, > > > >> > > > most probably you would like to do the same with each of them. > > If > > > >> that > > > >> > is > > > >> > > > the case we should provide a nice utility for that. I think > > Spark > > > >> > > > Streaming's DStream.foreachRDD() is fairly useful for this > > > purpose. > > > >> > > > > > > >> > > > On Mon, Jul 13, 2015 at 6:30 PM, Gyula Fóra < > > [hidden email] > > > > > > > >> > > wrote: > > > >> > > > > > > >> > > > > +1 > > > >> > > > > On Mon, Jul 13, 2015 at 6:23 PM Stephan Ewen < > > [hidden email]> > > > >> > wrote: > > > >> > > > > > > > >> > > > > > If naming is the only concern, then we should go ahead, > > > because > > > >> we > > > >> > > can > > > >> > > > > > change names easily (before the release). > > > >> > > > > > > > > >> > > > > > In fact, I don't think it leaves a bad impression. Global > > > >> windows > > > >> > are > > > >> > > > > > non-parallel windows. There are also parallel windows. > Pick > > > what > > > >> > you > > > >> > > > need > > > >> > > > > > and what works. > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > On Mon, Jul 13, 2015 at 6:13 PM, Gyula Fóra < > > > >> [hidden email]> > > > >> > > > > wrote: > > > >> > > > > > > > > >> > > > > > > I think we agree on everything its more of a naming > issue > > :) > > > >> > > > > > > > > > >> > > > > > > I thought it might be misleading that global time > windows > > > are > > > >> > > > > > > "non-parallel" windows. We dont want to give a bad > > > impression. > > > >> > > (Also > > > >> > > > we > > > >> > > > > > > dont want them to think that every global window is > > parallel > > > >> but > > > >> > > > thats > > > >> > > > > > not > > > >> > > > > > > a problem here) > > > >> > > > > > > > > > >> > > > > > > Gyula > > > >> > > > > > > On Mon, Jul 13, 2015 at 5:22 PM Stephan Ewen < > > > >> [hidden email]> > > > >> > > > wrote: > > > >> > > > > > > > > > >> > > > > > > > Okay, what is missing about the windowing in your > > opinion? > > > >> > > > > > > > > > > >> > > > > > > > The core points of the document are: > > > >> > > > > > > > > > > >> > > > > > > > - The parallel windows are per group only. > > > >> > > > > > > > > > > >> > > > > > > > - The implementation of the parallel windows holds > > > window > > > >> > data > > > >> > > in > > > >> > > > > the > > > >> > > > > > > > group buffers. > > > >> > > > > > > > > > > >> > > > > > > > - The global windows are non-parallel. May have > > parallel > > > >> > > > > > > pre-aggregation, > > > >> > > > > > > > if they are time windows. > > > >> > > > > > > > > > > >> > > > > > > > - Time may be operator time (timer thread), or > > watermark > > > >> > time. > > > >> > > > > > > Watermark > > > >> > > > > > > > time can refer to ingress or event time. > > > >> > > > > > > > > > > >> > > > > > > > - Windows that do not pre-aggregate may require > > elements > > > >> in > > > >> > > > order. > > > >> > > > > > Not > > > >> > > > > > > > part of the first prototype. > > > >> > > > > > > > > > > >> > > > > > > > Do we agree on those points? > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > On Mon, Jul 13, 2015 at 4:50 PM, Gyula Fóra < > > > >> > > [hidden email]> > > > >> > > > > > > wrote: > > > >> > > > > > > > > > > >> > > > > > > > > In general I like it, although the main difference > > > between > > > >> > the > > > >> > > > > > current > > > >> > > > > > > > and > > > >> > > > > > > > > the new one is the windowing and that is still not > > very > > > >> > clear. > > > >> > > > > > > > > > > > >> > > > > > > > > Where do we have the full stream time windows for > > > >> > > instance?(which > > > >> > > > > is > > > >> > > > > > > > > parallel but not keyed) > > > >> > > > > > > > > On Mon, Jul 13, 2015 at 4:28 PM Aljoscha Krettek < > > > >> > > > > > [hidden email]> > > > >> > > > > > > > > wrote: > > > >> > > > > > > > > > > > >> > > > > > > > > > +1 I like it as well. > > > >> > > > > > > > > > > > > >> > > > > > > > > > On Mon, 13 Jul 2015 at 16:17 Kostas Tzoumas < > > > >> > > > [hidden email] > > > >> > > > > > > > > >> > > > > > > > wrote: > > > >> > > > > > > > > > > > > >> > > > > > > > > > > +1 from my side > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > On Mon, Jul 13, 2015 at 4:15 PM, Stephan Ewen < > > > >> > > > > [hidden email]> > > > >> > > > > > > > > wrote: > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > Do we have consensus on these designs? > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > If we have, we should get to implementing this > > > soon, > > > >> > > > because > > > >> > > > > > > > > basically > > > >> > > > > > > > > > > all > > > >> > > > > > > > > > > > streaming patches will have to be revisited in > > > >> light of > > > >> > > > > this... > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > On Tue, Jul 7, 2015 at 3:41 PM, Gyula Fóra < > > > >> > > > > > [hidden email] > > > >> > > > > > > > > > > >> > > > > > > > > > wrote: > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > You are right thats an important issue. > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > And I think we should also do some renaming > > with > > > >> the > > > >> > > > > > > "iterations" > > > >> > > > > > > > > > > because > > > >> > > > > > > > > > > > > they are not really iterations like in the > > batch > > > >> case > > > >> > > and > > > >> > > > > it > > > >> > > > > > > > might > > > >> > > > > > > > > > > > confuse > > > >> > > > > > > > > > > > > some users. > > > >> > > > > > > > > > > > > Maybe we can call them loops or cycles and > > > rename > > > >> the > > > >> > > api > > > >> > > > > > calls > > > >> > > > > > > > to > > > >> > > > > > > > > > make > > > >> > > > > > > > > > > > it > > > >> > > > > > > > > > > > > more intuitive what happens. It is really > > just a > > > >> > cyclic > > > >> > > > > > > dataflow. > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > Aljoscha Krettek <[hidden email]> ezt > > írta > > > >> > > > (időpont: > > > >> > > > > > > 2015. > > > >> > > > > > > > > júl. > > > >> > > > > > > > > > > 7., > > > >> > > > > > > > > > > > > K, > > > >> > > > > > > > > > > > > 15:35): > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Hi, > > > >> > > > > > > > > > > > > > I just noticed that we don't have anything > > > about > > > >> > how > > > >> > > > > > > iterations > > > >> > > > > > > > > and > > > >> > > > > > > > > > > > > > timestamps/watermarks should interact. > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Cheers, > > > >> > > > > > > > > > > > > > Aljoscha > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > On Mon, 6 Jul 2015 at 23:56 Stephan Ewen < > > > >> > > > > [hidden email] > > > >> > > > > > > > > > >> > > > > > > > > wrote: > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Hi all! > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > As many of you know, there are a ongoing > > > >> efforts > > > >> > to > > > >> > > > > > > > consolidate > > > >> > > > > > > > > > the > > > >> > > > > > > > > > > > > > > streaming API for the next release, and > > then > > > >> > > graduate > > > >> > > > > it > > > >> > > > > > > > (from > > > >> > > > > > > > > > beta > > > >> > > > > > > > > > > > > > > status). > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > In the process of this consolidation, we > > > want > > > >> to > > > >> > > > > achieve > > > >> > > > > > > the > > > >> > > > > > > > > > > > following > > > >> > > > > > > > > > > > > > > goals. > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > - Make the code more robust and > simplify > > it > > > >> in > > > >> > > parts > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > - Clearly define the semantics of the > > > >> > constructs. > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > - Prepare it for support of more > advanced > > > >> > > concepts, > > > >> > > > > like > > > >> > > > > > > > > > > > partitionable > > > >> > > > > > > > > > > > > > > state, and event time. > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > - Cut support for certain corner cases > > that > > > >> were > > > >> > > > > > > prototyped, > > > >> > > > > > > > > but > > > >> > > > > > > > > > > > > turned > > > >> > > > > > > > > > > > > > > out to be not efficiently doable > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Based on prior discussions on the > mailing > > > >> list, > > > >> > > > > Aljoscha > > > >> > > > > > > and > > > >> > > > > > > > me > > > >> > > > > > > > > > > > drafted > > > >> > > > > > > > > > > > > > the > > > >> > > > > > > > > > > > > > > design documents below, which outline > how > > > the > > > >> > > > > > consolidated > > > >> > > > > > > > API > > > >> > > > > > > > > > > would > > > >> > > > > > > > > > > > > > like. > > > >> > > > > > > > > > > > > > > We focused in constructs, time, and > window > > > >> > > semantics. > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Design document on how to restructure > the > > > >> > Streaming > > > >> > > > > API: > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Streams+and+Operations+on+Streams > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Design document on definitions of time, > > > order, > > > >> > and > > > >> > > > the > > > >> > > > > > > > > resulting > > > >> > > > > > > > > > > > > > semantics: > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Note: The design of the interfaces and > > > >> concepts > > > >> > for > > > >> > > > > > > advanced > > > >> > > > > > > > > > state > > > >> > > > > > > > > > > in > > > >> > > > > > > > > > > > > > > functions is not in here. That is part > of > > a > > > >> > > separate > > > >> > > > > > design > > > >> > > > > > > > > > > > discussion > > > >> > > > > > > > > > > > > > and > > > >> > > > > > > > > > > > > > > orthogonal to the designs drafted here. > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Please have a look and voice questions > and > > > >> > > concerns. > > > >> > > > > > Since > > > >> > > > > > > we > > > >> > > > > > > > > > > should > > > >> > > > > > > > > > > > > not > > > >> > > > > > > > > > > > > > > break the streaming API more than once, > we > > > >> should > > > >> > > > make > > > >> > > > > > sure > > > >> > > > > > > > > this > > > >> > > > > > > > > > > > > > > consolidation brings it into the shape > we > > > >> want it > > > >> > > to > > > >> > > > be > > > >> > > > > > in. > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Greetings, > > > >> > > > > > > > > > > > > > > Stephan > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > >> > > > > > > > > > > > > > > |
Ok, thanks for the clarification. Let us try to document it in a way that
those thoughts are reflected then. Discretization will not happen upfront we can wait with that. On Tue, Jul 14, 2015 at 12:49 PM, Stephan Ewen <[hidden email]> wrote: > There is no inconsistency between the Batch and Streaming API. They have > different semantics - the batch API is implicitly always windowed. > > There is a naming difference between the two APIs. > > There is a strong inconsistency within the Streaming API right now. > Grouping and aggregating without windows is plain dangerous in streaming. > It either blows up or is undefined in its behavior. > > > > On Tue, Jul 14, 2015 at 12:40 PM, Gyula Fóra <[hidden email]> wrote: > > > I see your point, reduceByKey is much clearer. > > > > The question is whether we want to introduce this inconsistency across > the > > two api-s or stick with what we have. > > On Tue, Jul 14, 2015 at 10:57 AM Aljoscha Krettek <[hidden email]> > > wrote: > > > > > I agree, the groupBy, in the batch API is misleading, since a > > > ds.groupBy().reduce() does not really build any groups, it is really a > > > ds.keyBy().reduceByKey(). In the streaming API we can still fix this, > > IMHO. > > > > > > On Tue, 14 Jul 2015 at 10:56 Stephan Ewen <[hidden email]> wrote: > > > > > > > It is not a bit different than the batch API, because streaming > > semantics > > > > are a bit different ;-) > > > > > > > > One good thing is that we can make things better that were > sub-optimal > > in > > > > the Batch API. > > > > > > > > On Tue, Jul 14, 2015 at 10:55 AM, Stephan Ewen <[hidden email]> > > wrote: > > > > > > > > > keyBy() does not do any grouping. Grouping in streams in not > defined > > > > > without windows. > > > > > > > > > > On Tue, Jul 14, 2015 at 10:48 AM, Gyula Fóra <[hidden email] > > > > > > wrote: > > > > > > > > > >> If we only want to have either keyBy or groupBy, why not keep > > groupBy? > > > > >> That > > > > >> would be more consistent with the batch api. > > > > >> On Tue, Jul 14, 2015 at 10:35 AM Stephan Ewen <[hidden email]> > > > wrote: > > > > >> > > > > >> > Concerning your comments: > > > > >> > > > > > >> > 1) In the new design, there is no grouping without windowing. > The > > > > >> > KeyedDataStream subsumes the grouping and key-ing for > partitioned > > > > state. > > > > >> > > > > > >> > The keyBy() + window() makes a parallel grouped window > > > > >> > keyBy() alone allows access to partitioned state. > > > > >> > > > > > >> > My thought was that this is simpler, because it needs not > > > > groupBy() > > > > >> and > > > > >> > keyBy(), but one construct to handle both cases. > > > > >> > > > > > >> > 2) The discretization is a rough thought and is nothing for the > > > short > > > > >> term. > > > > >> > It totally needs more thoughts. I put it there to have it as a > > > sketch > > > > >> for > > > > >> > how to evolve this. > > > > >> > > > > > >> > The idea is of course to not have a single data set, but a > > > series > > > > of > > > > >> > data set. In each discrete time slice, the data set can be > treated > > > > like > > > > >> a > > > > >> > regular data set. > > > > >> > > > > > >> > Let's kick off a separate design for the discretization. > Joins > > > are > > > > >> good > > > > >> > to talk about (data sets can be joined with data set), and I am > > sure > > > > >> there > > > > >> > are more questions coming up. > > > > >> > > > > > >> > > > > > >> > Does that make sense? > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > > > > > >> > On Tue, Jul 14, 2015 at 10:05 AM, Gyula Fóra < > > [hidden email]> > > > > >> wrote: > > > > >> > > > > > >> > > I think Marton has some good points here. > > > > >> > > > > > > >> > > 1) Is KeyedDataStream a better name if this is only a > renaming? > > > > >> > > > > > > >> > > 2) the discretize semantics is unclear indeed. Are we > operating > > > on a > > > > >> > single > > > > >> > > or sequence of datasets? If the latter why not call it > something > > > > else > > > > >> > > (dstream). How are joins and other binary operators defined > for > > > > >> different > > > > >> > > discretizations etc. > > > > >> > > On Mon, Jul 13, 2015 at 7:37 PM Márton Balassi < > > > [hidden email] > > > > > > > > > >> > > wrote: > > > > >> > > > > > > >> > > > Generally I agree with the new design. Two concerns: > > > > >> > > > > > > > >> > > > 1) Does KeyedDataStream replace GroupedDataStream or is it > the > > > > >> latter a > > > > >> > > > special case of the former? > > > > >> > > > > > > > >> > > > The KeyedDataStream as described in the design document is a > > bit > > > > >> > unclear > > > > >> > > > for me. It lists the following usages: > > > > >> > > > a) It is the first step in building a window stream, on > top > > of > > > > >> which > > > > >> > > the > > > > >> > > > grouped/windowed aggregation and reduce-style function can > be > > > > >> applied > > > > >> > > > b) It allows to use the "by-key" state of functions. Here, > > > every > > > > >> > record > > > > >> > > > has access to a state that is scoped by its key. Key-scoped > > > state > > > > >> can > > > > >> > be > > > > >> > > > automatically redistributed and repartitioned. > > > > >> > > > > > > > >> > > > The code snippet describes a use case where the computation > > and > > > > the > > > > >> > > access > > > > >> > > > of the state is used the way currently the GroupedDataStream > > > > should > > > > >> > > work. I > > > > >> > > > suppose this is the example for case b). Would case a) also > > > window > > > > >> > > elements > > > > >> > > > by key? If yes, then this is practically a renaming and > > > > enhancement > > > > >> of > > > > >> > > the > > > > >> > > > GroupedDataStream functionality with keyed state. Then the > > > > >> > > > StreamExecutionEnvironment.createKeyedStream(Partitioner, > > > > >> > > > KeySelector)construction does not make much sense as the > user > > > only > > > > >> > > operates > > > > >> > > > within the scope of the keyselector and not the partitioner > > > > anyway. > > > > >> > > > > > > > >> > > > I personally think KeyedDataStream as a name does not > > > necessarily > > > > >> > suggest > > > > >> > > > that the records are grouped by key, it only suggests > > > partitioning > > > > >> by > > > > >> > > key - > > > > >> > > > at least for me. :) > > > > >> > > > > > > > >> > > > 2) The API for discretization is not convenient IMHO > > > > >> > > > > > > > >> > > > The discretization part declares that the output of > > > > >> > > DataStream.discretize() > > > > >> > > > is a sequence of DataSets. I love this approach, but then in > > the > > > > >> code > > > > >> > > > snippet the return value of this function is simply a > DataSet > > > and > > > > >> uses > > > > >> > it > > > > >> > > > as such. The take home message of that code is the > following: > > > this > > > > >> is > > > > >> > > > actually the way you would like to program on these sequence > > of > > > > >> > DataSets, > > > > >> > > > most probably you would like to do the same with each of > them. > > > If > > > > >> that > > > > >> > is > > > > >> > > > the case we should provide a nice utility for that. I think > > > Spark > > > > >> > > > Streaming's DStream.foreachRDD() is fairly useful for this > > > > purpose. > > > > >> > > > > > > > >> > > > On Mon, Jul 13, 2015 at 6:30 PM, Gyula Fóra < > > > [hidden email] > > > > > > > > > >> > > wrote: > > > > >> > > > > > > > >> > > > > +1 > > > > >> > > > > On Mon, Jul 13, 2015 at 6:23 PM Stephan Ewen < > > > [hidden email]> > > > > >> > wrote: > > > > >> > > > > > > > > >> > > > > > If naming is the only concern, then we should go ahead, > > > > because > > > > >> we > > > > >> > > can > > > > >> > > > > > change names easily (before the release). > > > > >> > > > > > > > > > >> > > > > > In fact, I don't think it leaves a bad impression. > Global > > > > >> windows > > > > >> > are > > > > >> > > > > > non-parallel windows. There are also parallel windows. > > Pick > > > > what > > > > >> > you > > > > >> > > > need > > > > >> > > > > > and what works. > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > On Mon, Jul 13, 2015 at 6:13 PM, Gyula Fóra < > > > > >> [hidden email]> > > > > >> > > > > wrote: > > > > >> > > > > > > > > > >> > > > > > > I think we agree on everything its more of a naming > > issue > > > :) > > > > >> > > > > > > > > > > >> > > > > > > I thought it might be misleading that global time > > windows > > > > are > > > > >> > > > > > > "non-parallel" windows. We dont want to give a bad > > > > impression. > > > > >> > > (Also > > > > >> > > > we > > > > >> > > > > > > dont want them to think that every global window is > > > parallel > > > > >> but > > > > >> > > > thats > > > > >> > > > > > not > > > > >> > > > > > > a problem here) > > > > >> > > > > > > > > > > >> > > > > > > Gyula > > > > >> > > > > > > On Mon, Jul 13, 2015 at 5:22 PM Stephan Ewen < > > > > >> [hidden email]> > > > > >> > > > wrote: > > > > >> > > > > > > > > > > >> > > > > > > > Okay, what is missing about the windowing in your > > > opinion? > > > > >> > > > > > > > > > > > >> > > > > > > > The core points of the document are: > > > > >> > > > > > > > > > > > >> > > > > > > > - The parallel windows are per group only. > > > > >> > > > > > > > > > > > >> > > > > > > > - The implementation of the parallel windows holds > > > > window > > > > >> > data > > > > >> > > in > > > > >> > > > > the > > > > >> > > > > > > > group buffers. > > > > >> > > > > > > > > > > > >> > > > > > > > - The global windows are non-parallel. May have > > > parallel > > > > >> > > > > > > pre-aggregation, > > > > >> > > > > > > > if they are time windows. > > > > >> > > > > > > > > > > > >> > > > > > > > - Time may be operator time (timer thread), or > > > watermark > > > > >> > time. > > > > >> > > > > > > Watermark > > > > >> > > > > > > > time can refer to ingress or event time. > > > > >> > > > > > > > > > > > >> > > > > > > > - Windows that do not pre-aggregate may require > > > elements > > > > >> in > > > > >> > > > order. > > > > >> > > > > > Not > > > > >> > > > > > > > part of the first prototype. > > > > >> > > > > > > > > > > > >> > > > > > > > Do we agree on those points? > > > > >> > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > On Mon, Jul 13, 2015 at 4:50 PM, Gyula Fóra < > > > > >> > > [hidden email]> > > > > >> > > > > > > wrote: > > > > >> > > > > > > > > > > > >> > > > > > > > > In general I like it, although the main difference > > > > between > > > > >> > the > > > > >> > > > > > current > > > > >> > > > > > > > and > > > > >> > > > > > > > > the new one is the windowing and that is still not > > > very > > > > >> > clear. > > > > >> > > > > > > > > > > > > >> > > > > > > > > Where do we have the full stream time windows for > > > > >> > > instance?(which > > > > >> > > > > is > > > > >> > > > > > > > > parallel but not keyed) > > > > >> > > > > > > > > On Mon, Jul 13, 2015 at 4:28 PM Aljoscha Krettek < > > > > >> > > > > > [hidden email]> > > > > >> > > > > > > > > wrote: > > > > >> > > > > > > > > > > > > >> > > > > > > > > > +1 I like it as well. > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > On Mon, 13 Jul 2015 at 16:17 Kostas Tzoumas < > > > > >> > > > [hidden email] > > > > >> > > > > > > > > > >> > > > > > > > wrote: > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > +1 from my side > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > On Mon, Jul 13, 2015 at 4:15 PM, Stephan Ewen > < > > > > >> > > > > [hidden email]> > > > > >> > > > > > > > > wrote: > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > Do we have consensus on these designs? > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > If we have, we should get to implementing > this > > > > soon, > > > > >> > > > because > > > > >> > > > > > > > > basically > > > > >> > > > > > > > > > > all > > > > >> > > > > > > > > > > > streaming patches will have to be revisited > in > > > > >> light of > > > > >> > > > > this... > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > On Tue, Jul 7, 2015 at 3:41 PM, Gyula Fóra < > > > > >> > > > > > [hidden email] > > > > >> > > > > > > > > > > > >> > > > > > > > > > wrote: > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > You are right thats an important issue. > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > And I think we should also do some > renaming > > > with > > > > >> the > > > > >> > > > > > > "iterations" > > > > >> > > > > > > > > > > because > > > > >> > > > > > > > > > > > > they are not really iterations like in the > > > batch > > > > >> case > > > > >> > > and > > > > >> > > > > it > > > > >> > > > > > > > might > > > > >> > > > > > > > > > > > confuse > > > > >> > > > > > > > > > > > > some users. > > > > >> > > > > > > > > > > > > Maybe we can call them loops or cycles and > > > > rename > > > > >> the > > > > >> > > api > > > > >> > > > > > calls > > > > >> > > > > > > > to > > > > >> > > > > > > > > > make > > > > >> > > > > > > > > > > > it > > > > >> > > > > > > > > > > > > more intuitive what happens. It is really > > > just a > > > > >> > cyclic > > > > >> > > > > > > dataflow. > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > Aljoscha Krettek <[hidden email]> > ezt > > > írta > > > > >> > > > (időpont: > > > > >> > > > > > > 2015. > > > > >> > > > > > > > > júl. > > > > >> > > > > > > > > > > 7., > > > > >> > > > > > > > > > > > > K, > > > > >> > > > > > > > > > > > > 15:35): > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Hi, > > > > >> > > > > > > > > > > > > > I just noticed that we don't have > anything > > > > about > > > > >> > how > > > > >> > > > > > > iterations > > > > >> > > > > > > > > and > > > > >> > > > > > > > > > > > > > timestamps/watermarks should interact. > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > Cheers, > > > > >> > > > > > > > > > > > > > Aljoscha > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > On Mon, 6 Jul 2015 at 23:56 Stephan > Ewen < > > > > >> > > > > [hidden email] > > > > >> > > > > > > > > > > >> > > > > > > > > wrote: > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Hi all! > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > As many of you know, there are a > ongoing > > > > >> efforts > > > > >> > to > > > > >> > > > > > > > consolidate > > > > >> > > > > > > > > > the > > > > >> > > > > > > > > > > > > > > streaming API for the next release, > and > > > then > > > > >> > > graduate > > > > >> > > > > it > > > > >> > > > > > > > (from > > > > >> > > > > > > > > > beta > > > > >> > > > > > > > > > > > > > > status). > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > In the process of this consolidation, > we > > > > want > > > > >> to > > > > >> > > > > achieve > > > > >> > > > > > > the > > > > >> > > > > > > > > > > > following > > > > >> > > > > > > > > > > > > > > goals. > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > - Make the code more robust and > > simplify > > > it > > > > >> in > > > > >> > > parts > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > - Clearly define the semantics of the > > > > >> > constructs. > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > - Prepare it for support of more > > advanced > > > > >> > > concepts, > > > > >> > > > > like > > > > >> > > > > > > > > > > > partitionable > > > > >> > > > > > > > > > > > > > > state, and event time. > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > - Cut support for certain corner > cases > > > that > > > > >> were > > > > >> > > > > > > prototyped, > > > > >> > > > > > > > > but > > > > >> > > > > > > > > > > > > turned > > > > >> > > > > > > > > > > > > > > out to be not efficiently doable > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Based on prior discussions on the > > mailing > > > > >> list, > > > > >> > > > > Aljoscha > > > > >> > > > > > > and > > > > >> > > > > > > > me > > > > >> > > > > > > > > > > > drafted > > > > >> > > > > > > > > > > > > > the > > > > >> > > > > > > > > > > > > > > design documents below, which outline > > how > > > > the > > > > >> > > > > > consolidated > > > > >> > > > > > > > API > > > > >> > > > > > > > > > > would > > > > >> > > > > > > > > > > > > > like. > > > > >> > > > > > > > > > > > > > > We focused in constructs, time, and > > window > > > > >> > > semantics. > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Design document on how to restructure > > the > > > > >> > Streaming > > > > >> > > > > API: > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Streams+and+Operations+on+Streams > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Design document on definitions of > time, > > > > order, > > > > >> > and > > > > >> > > > the > > > > >> > > > > > > > > resulting > > > > >> > > > > > > > > > > > > > semantics: > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/Time+and+Order+in+Streams > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Note: The design of the interfaces and > > > > >> concepts > > > > >> > for > > > > >> > > > > > > advanced > > > > >> > > > > > > > > > state > > > > >> > > > > > > > > > > in > > > > >> > > > > > > > > > > > > > > functions is not in here. That is part > > of > > > a > > > > >> > > separate > > > > >> > > > > > design > > > > >> > > > > > > > > > > > discussion > > > > >> > > > > > > > > > > > > > and > > > > >> > > > > > > > > > > > > > > orthogonal to the designs drafted > here. > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Please have a look and voice questions > > and > > > > >> > > concerns. > > > > >> > > > > > Since > > > > >> > > > > > > we > > > > >> > > > > > > > > > > should > > > > >> > > > > > > > > > > > > not > > > > >> > > > > > > > > > > > > > > break the streaming API more than > once, > > we > > > > >> should > > > > >> > > > make > > > > >> > > > > > sure > > > > >> > > > > > > > > this > > > > >> > > > > > > > > > > > > > > consolidation brings it into the shape > > we > > > > >> want it > > > > >> > > to > > > > >> > > > be > > > > >> > > > > > in. > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > Greetings, > > > > >> > > > > > > > > > > > > > > Stephan > > > > >> > > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > > >> > > > > > > > > > > > > > > >> > > > > > > > > > > > > > >> > > > > > > > > > > > > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > > > > > > > > > > > > > > > > |
Free forum by Nabble | Edit this page |