Hey all,
I've noticed a few times now when trying to help users implement particular things in the Flink API that it can be complicated to map what they know they are trying to do onto higher-level Flink concepts such as windowing or Connect/CoFlatMap/ValueState, etc. At some point it just becomes easier to think about writing a Flink operator yourself that is integrated into the pipeline with a transform() call. It can just be easier to think at a more basic level. For example I can write an operator that can consume one or two input streams (should probably be N), update state which is managed for me fault tolerantly, and output elements or setup timers/triggers that give me callbacks from which I can also update state or emit elements. When you think at this level you realize you can program just about anything you want. You can create whatever fault-tolerant data structures you want, and easily execute robust stateful computation over data streams at scale. This is the real technology and power of Flink IMO. Also, at this level I don't have to think about the complexities of windowing semantics, learn as much API, etc. I can easily have some inputs that are broadcast, others that are keyed, manage my own state in whatever data structure makes sense, etc. If I know exactly what I actually want to do I can just do it with the full power of my chosen language, data structures, etc. I'm not "restricted" to trying to map everything onto higher-level Flink constructs which is sometimes actually more complicated. Programming at this level is actually fairly easy to do but people seem a bit afraid of this level of the API. They think of it as low-level or custom hacking.. Anyway, I guess my thought is this.. Should we explain Flink to people at this level *first*? Show that you have nearly unlimited power and flexibility to build what you want *and only then* from there explain the higher level APIs they can use *if* those match their use cases well. Would this better demonstrate to people the power of Flink and maybe *liberate* them a bit from feeling they have to map their problem onto a more complex set of higher level primitives? I see people trying to shoe-horn what they are really trying to do, which is simple to explain in english, onto windows, triggers, CoFlatMaps, etc, and this get's complicated sometimes. It's like an impedance mismatch. You could just solve the problem very easily programmed in straight Java/Scala. Anyway, it's very easy to drop down a level in the API and program whatever you want but users don't seem to *perceive* it that way. Just some thoughts... Any feedback? Have any of you had similar experiences when working with newer Flink users or as a newer Flink user yourself? Can/should we do anything to make the *lower* level API more accessible/visible to users? -Jamie |
Hi Jamie,
I agree that it is often much easier to work on the lower level APIs if you know what you are doing. I think it would be nice to have very clean abstractions on that level so we could teach this to the users first but currently I thinm its not easy enough to be good starting point. The user needs to understand a lot about the system if the dont want to hurt other parts of the pipeline. For insance working with the streamrecords, propagating watermarks, working with state internals This all might be overwhelming at the first glance. But maybe we can slim some abstractions down to the point where this becomes kind of the extension of the RichFunctions. Cheers, Gyula On Sat, Aug 13, 2016, 17:48 Jamie Grier <[hidden email]> wrote: > Hey all, > > I've noticed a few times now when trying to help users implement particular > things in the Flink API that it can be complicated to map what they know > they are trying to do onto higher-level Flink concepts such as windowing or > Connect/CoFlatMap/ValueState, etc. > > At some point it just becomes easier to think about writing a Flink > operator yourself that is integrated into the pipeline with a transform() > call. > > It can just be easier to think at a more basic level. For example I can > write an operator that can consume one or two input streams (should > probably be N), update state which is managed for me fault tolerantly, and > output elements or setup timers/triggers that give me callbacks from which > I can also update state or emit elements. > > When you think at this level you realize you can program just about > anything you want. You can create whatever fault-tolerant data structures > you want, and easily execute robust stateful computation over data streams > at scale. This is the real technology and power of Flink IMO. > > Also, at this level I don't have to think about the complexities of > windowing semantics, learn as much API, etc. I can easily have some inputs > that are broadcast, others that are keyed, manage my own state in whatever > data structure makes sense, etc. If I know exactly what I actually want to > do I can just do it with the full power of my chosen language, data > structures, etc. I'm not "restricted" to trying to map everything onto > higher-level Flink constructs which is sometimes actually more complicated. > > Programming at this level is actually fairly easy to do but people seem a > bit afraid of this level of the API. They think of it as low-level or > custom hacking.. > > Anyway, I guess my thought is this.. Should we explain Flink to people at > this level *first*? Show that you have nearly unlimited power and > flexibility to build what you want *and only then* from there explain the > higher level APIs they can use *if* those match their use cases well. > > Would this better demonstrate to people the power of Flink and maybe > *liberate* them a bit from feeling they have to map their problem onto a > more complex set of higher level primitives? I see people trying to > shoe-horn what they are really trying to do, which is simple to explain in > english, onto windows, triggers, CoFlatMaps, etc, and this get's > complicated sometimes. It's like an impedance mismatch. You could just > solve the problem very easily programmed in straight Java/Scala. > > Anyway, it's very easy to drop down a level in the API and program whatever > you want but users don't seem to *perceive* it that way. > > Just some thoughts... Any feedback? Have any of you had similar > experiences when working with newer Flink users or as a newer Flink user > yourself? Can/should we do anything to make the *lower* level API more > accessible/visible to users? > > -Jamie > |
It really depends on the skill level of the developer. Using low-level
API requires to think about many details (eg. state handling etc.) that could be done wrong. As Flink gets a broader community, more people will use it who might not have the required skill level to deal with low-level API. For more trained uses, it is of course a powerful tool! I guess it boils down to the question, what type of developer Flink targets, if low-level API should be offensive advertised or not. Also keep in mind, that many people criticized Storm's low-level API as hard to program etc. -Matthias On 08/15/2016 07:46 AM, Gyula Fóra wrote: > Hi Jamie, > > I agree that it is often much easier to work on the lower level APIs if you > know what you are doing. > > I think it would be nice to have very clean abstractions on that level so > we could teach this to the users first but currently I thinm its not easy > enough to be good starting point. > > The user needs to understand a lot about the system if the dont want to > hurt other parts of the pipeline. For insance working with the > streamrecords, propagating watermarks, working with state internals > > This all might be overwhelming at the first glance. But maybe we can slim > some abstractions down to the point where this becomes kind of the > extension of the RichFunctions. > > Cheers, > Gyula > > On Sat, Aug 13, 2016, 17:48 Jamie Grier <[hidden email]> wrote: > >> Hey all, >> >> I've noticed a few times now when trying to help users implement particular >> things in the Flink API that it can be complicated to map what they know >> they are trying to do onto higher-level Flink concepts such as windowing or >> Connect/CoFlatMap/ValueState, etc. >> >> At some point it just becomes easier to think about writing a Flink >> operator yourself that is integrated into the pipeline with a transform() >> call. >> >> It can just be easier to think at a more basic level. For example I can >> write an operator that can consume one or two input streams (should >> probably be N), update state which is managed for me fault tolerantly, and >> output elements or setup timers/triggers that give me callbacks from which >> I can also update state or emit elements. >> >> When you think at this level you realize you can program just about >> anything you want. You can create whatever fault-tolerant data structures >> you want, and easily execute robust stateful computation over data streams >> at scale. This is the real technology and power of Flink IMO. >> >> Also, at this level I don't have to think about the complexities of >> windowing semantics, learn as much API, etc. I can easily have some inputs >> that are broadcast, others that are keyed, manage my own state in whatever >> data structure makes sense, etc. If I know exactly what I actually want to >> do I can just do it with the full power of my chosen language, data >> structures, etc. I'm not "restricted" to trying to map everything onto >> higher-level Flink constructs which is sometimes actually more complicated. >> >> Programming at this level is actually fairly easy to do but people seem a >> bit afraid of this level of the API. They think of it as low-level or >> custom hacking.. >> >> Anyway, I guess my thought is this.. Should we explain Flink to people at >> this level *first*? Show that you have nearly unlimited power and >> flexibility to build what you want *and only then* from there explain the >> higher level APIs they can use *if* those match their use cases well. >> >> Would this better demonstrate to people the power of Flink and maybe >> *liberate* them a bit from feeling they have to map their problem onto a >> more complex set of higher level primitives? I see people trying to >> shoe-horn what they are really trying to do, which is simple to explain in >> english, onto windows, triggers, CoFlatMaps, etc, and this get's >> complicated sometimes. It's like an impedance mismatch. You could just >> solve the problem very easily programmed in straight Java/Scala. >> >> Anyway, it's very easy to drop down a level in the API and program whatever >> you want but users don't seem to *perceive* it that way. >> >> Just some thoughts... Any feedback? Have any of you had similar >> experiences when working with newer Flink users or as a newer Flink user >> yourself? Can/should we do anything to make the *lower* level API more >> accessible/visible to users? >> >> -Jamie >> > |
Hi All,
I also thought about this recently. A good think would be to add a good user facing operator that behaves more or less like an enhanced FlatMap with multiple inputs, multiple outputs, state access and keyed timers. I'm a bit hesitant, though, since users rarely think about the implications that come with state updating and out-of-order events. If you don't implement a stateful operator correctly you have pretty much arbitrary results. The problem with out-of-order event arrival and state update is that the state basically has to monotonically transition "upwards" through a lattice for the computation to make sense. I know this sounds rather theoretical so I'll try to explain with an example. Say you have an operator that waits for timestamped elements A, B, C to arrive in timestamp order and then does some processing. The naive approach would be to have a small state machine that tracks what element you have seen so far. The state machine has three states: NOTHING, SEEN_A, SEEN_B and SEEN_C. The state machine is supposed to traverse these states linearly as the elements arrive. This doesn't work, however, when elements arrive in an order that does not match their timestamp order. What the user should do is to have a "Set" state that keeps track of the elements that it has seen. Once it has seen {A, B, C} the operator must check the timestamps and then do the processing, if required. The set of possible combinations of A, B, and C forms a lattice when combined with the "subset" operation. And traversal through these sets is monotonically "upwards" so it works regardless of the order that the elements arrive in. (I recently pointed this out on the Beam mailing list and Kenneth Knowles rightly pointed out that what I was describing was in fact a lattice.) I know this is a bit off-topic but I think it's very easy for users to write wrong operations when they are dealing with state. We should still have a good API for it, though. Just wanted to make people aware of this. Cheers, Aljoscha On Mon, 15 Aug 2016 at 08:18 Matthias J. Sax <[hidden email]> wrote: > It really depends on the skill level of the developer. Using low-level > API requires to think about many details (eg. state handling etc.) that > could be done wrong. > > As Flink gets a broader community, more people will use it who might not > have the required skill level to deal with low-level API. For more > trained uses, it is of course a powerful tool! > > I guess it boils down to the question, what type of developer Flink > targets, if low-level API should be offensive advertised or not. Also > keep in mind, that many people criticized Storm's low-level API as hard > to program etc. > > > -Matthias > > On 08/15/2016 07:46 AM, Gyula Fóra wrote: > > Hi Jamie, > > > > I agree that it is often much easier to work on the lower level APIs if > you > > know what you are doing. > > > > I think it would be nice to have very clean abstractions on that level so > > we could teach this to the users first but currently I thinm its not easy > > enough to be good starting point. > > > > The user needs to understand a lot about the system if the dont want to > > hurt other parts of the pipeline. For insance working with the > > streamrecords, propagating watermarks, working with state internals > > > > This all might be overwhelming at the first glance. But maybe we can slim > > some abstractions down to the point where this becomes kind of the > > extension of the RichFunctions. > > > > Cheers, > > Gyula > > > > On Sat, Aug 13, 2016, 17:48 Jamie Grier <[hidden email]> wrote: > > > >> Hey all, > >> > >> I've noticed a few times now when trying to help users implement > particular > >> things in the Flink API that it can be complicated to map what they know > >> they are trying to do onto higher-level Flink concepts such as > windowing or > >> Connect/CoFlatMap/ValueState, etc. > >> > >> At some point it just becomes easier to think about writing a Flink > >> operator yourself that is integrated into the pipeline with a > transform() > >> call. > >> > >> It can just be easier to think at a more basic level. For example I can > >> write an operator that can consume one or two input streams (should > >> probably be N), update state which is managed for me fault tolerantly, > and > >> output elements or setup timers/triggers that give me callbacks from > which > >> I can also update state or emit elements. > >> > >> When you think at this level you realize you can program just about > >> anything you want. You can create whatever fault-tolerant data > structures > >> you want, and easily execute robust stateful computation over data > streams > >> at scale. This is the real technology and power of Flink IMO. > >> > >> Also, at this level I don't have to think about the complexities of > >> windowing semantics, learn as much API, etc. I can easily have some > inputs > >> that are broadcast, others that are keyed, manage my own state in > whatever > >> data structure makes sense, etc. If I know exactly what I actually > want to > >> do I can just do it with the full power of my chosen language, data > >> structures, etc. I'm not "restricted" to trying to map everything onto > >> higher-level Flink constructs which is sometimes actually more > complicated. > >> > >> Programming at this level is actually fairly easy to do but people seem > a > >> bit afraid of this level of the API. They think of it as low-level or > >> custom hacking.. > >> > >> Anyway, I guess my thought is this.. Should we explain Flink to people > at > >> this level *first*? Show that you have nearly unlimited power and > >> flexibility to build what you want *and only then* from there explain > the > >> higher level APIs they can use *if* those match their use cases well. > >> > >> Would this better demonstrate to people the power of Flink and maybe > >> *liberate* them a bit from feeling they have to map their problem onto a > >> more complex set of higher level primitives? I see people trying to > >> shoe-horn what they are really trying to do, which is simple to explain > in > >> english, onto windows, triggers, CoFlatMaps, etc, and this get's > >> complicated sometimes. It's like an impedance mismatch. You could just > >> solve the problem very easily programmed in straight Java/Scala. > >> > >> Anyway, it's very easy to drop down a level in the API and program > whatever > >> you want but users don't seem to *perceive* it that way. > >> > >> Just some thoughts... Any feedback? Have any of you had similar > >> experiences when working with newer Flink users or as a newer Flink user > >> yourself? Can/should we do anything to make the *lower* level API more > >> accessible/visible to users? > >> > >> -Jamie > >> > > > > |
Hi,
I'm also not sure whether we should start teaching Flink by demonstrating the low-level APIs. According to my experience, people new to Flink should first learn a very basic set of primitive operations. Usually this is map, flatmap, join, windows, etc. The semantics of these operations is well defined and one doesn't have too many possibilities to shoot oneself in the foot. The more restrictive the API is, the less likely it is that something goes wrong. Of course, this sometimes entails that the program might not be expressed as elegantly as it could have been. As an advanced (maybe very advanced) topic, we should, however, also cover the lower level APIs in our documentation. And it makes probably sense to clean it a little bit up and offer also some tooling around it. But given that this level of abstraction involves a lot of details which are hard to catch for a Flink newbie, I think it's not the perfect starting point to learn Flink. Cheers, Till On Mon, Aug 15, 2016 at 11:11 AM, Aljoscha Krettek <[hidden email]> wrote: > Hi All, > I also thought about this recently. A good think would be to add a good > user facing operator that behaves more or less like an enhanced FlatMap > with multiple inputs, multiple outputs, state access and keyed timers. I'm > a bit hesitant, though, since users rarely think about the implications > that come with state updating and out-of-order events. If you don't > implement a stateful operator correctly you have pretty much arbitrary > results. > > The problem with out-of-order event arrival and state update is that the > state basically has to monotonically transition "upwards" through a lattice > for the computation to make sense. I know this sounds rather theoretical so > I'll try to explain with an example. Say you have an operator that waits > for timestamped elements A, B, C to arrive in timestamp order and then does > some processing. The naive approach would be to have a small state machine > that tracks what element you have seen so far. The state machine has three > states: NOTHING, SEEN_A, SEEN_B and SEEN_C. The state machine is supposed > to traverse these states linearly as the elements arrive. This doesn't > work, however, when elements arrive in an order that does not match their > timestamp order. What the user should do is to have a "Set" state that > keeps track of the elements that it has seen. Once it has seen {A, B, C} > the operator must check the timestamps and then do the processing, if > required. The set of possible combinations of A, B, and C forms a lattice > when combined with the "subset" operation. And traversal through these sets > is monotonically "upwards" so it works regardless of the order that the > elements arrive in. (I recently pointed this out on the Beam mailing list > and Kenneth Knowles rightly pointed out that what I was describing was in > fact a lattice.) > > I know this is a bit off-topic but I think it's very easy for users to > write wrong operations when they are dealing with state. We should still > have a good API for it, though. Just wanted to make people aware of this. > > Cheers, > Aljoscha > > On Mon, 15 Aug 2016 at 08:18 Matthias J. Sax <[hidden email]> wrote: > > > It really depends on the skill level of the developer. Using low-level > > API requires to think about many details (eg. state handling etc.) that > > could be done wrong. > > > > As Flink gets a broader community, more people will use it who might not > > have the required skill level to deal with low-level API. For more > > trained uses, it is of course a powerful tool! > > > > I guess it boils down to the question, what type of developer Flink > > targets, if low-level API should be offensive advertised or not. Also > > keep in mind, that many people criticized Storm's low-level API as hard > > to program etc. > > > > > > -Matthias > > > > On 08/15/2016 07:46 AM, Gyula Fóra wrote: > > > Hi Jamie, > > > > > > I agree that it is often much easier to work on the lower level APIs if > > you > > > know what you are doing. > > > > > > I think it would be nice to have very clean abstractions on that level > so > > > we could teach this to the users first but currently I thinm its not > easy > > > enough to be good starting point. > > > > > > The user needs to understand a lot about the system if the dont want to > > > hurt other parts of the pipeline. For insance working with the > > > streamrecords, propagating watermarks, working with state internals > > > > > > This all might be overwhelming at the first glance. But maybe we can > slim > > > some abstractions down to the point where this becomes kind of the > > > extension of the RichFunctions. > > > > > > Cheers, > > > Gyula > > > > > > On Sat, Aug 13, 2016, 17:48 Jamie Grier <[hidden email]> > wrote: > > > > > >> Hey all, > > >> > > >> I've noticed a few times now when trying to help users implement > > particular > > >> things in the Flink API that it can be complicated to map what they > know > > >> they are trying to do onto higher-level Flink concepts such as > > windowing or > > >> Connect/CoFlatMap/ValueState, etc. > > >> > > >> At some point it just becomes easier to think about writing a Flink > > >> operator yourself that is integrated into the pipeline with a > > transform() > > >> call. > > >> > > >> It can just be easier to think at a more basic level. For example I > can > > >> write an operator that can consume one or two input streams (should > > >> probably be N), update state which is managed for me fault tolerantly, > > and > > >> output elements or setup timers/triggers that give me callbacks from > > which > > >> I can also update state or emit elements. > > >> > > >> When you think at this level you realize you can program just about > > >> anything you want. You can create whatever fault-tolerant data > > structures > > >> you want, and easily execute robust stateful computation over data > > streams > > >> at scale. This is the real technology and power of Flink IMO. > > >> > > >> Also, at this level I don't have to think about the complexities of > > >> windowing semantics, learn as much API, etc. I can easily have some > > inputs > > >> that are broadcast, others that are keyed, manage my own state in > > whatever > > >> data structure makes sense, etc. If I know exactly what I actually > > want to > > >> do I can just do it with the full power of my chosen language, data > > >> structures, etc. I'm not "restricted" to trying to map everything > onto > > >> higher-level Flink constructs which is sometimes actually more > > complicated. > > >> > > >> Programming at this level is actually fairly easy to do but people > seem > > a > > >> bit afraid of this level of the API. They think of it as low-level or > > >> custom hacking.. > > >> > > >> Anyway, I guess my thought is this.. Should we explain Flink to > people > > at > > >> this level *first*? Show that you have nearly unlimited power and > > >> flexibility to build what you want *and only then* from there explain > > the > > >> higher level APIs they can use *if* those match their use cases well. > > >> > > >> Would this better demonstrate to people the power of Flink and maybe > > >> *liberate* them a bit from feeling they have to map their problem > onto a > > >> more complex set of higher level primitives? I see people trying to > > >> shoe-horn what they are really trying to do, which is simple to > explain > > in > > >> english, onto windows, triggers, CoFlatMaps, etc, and this get's > > >> complicated sometimes. It's like an impedance mismatch. You could > just > > >> solve the problem very easily programmed in straight Java/Scala. > > >> > > >> Anyway, it's very easy to drop down a level in the API and program > > whatever > > >> you want but users don't seem to *perceive* it that way. > > >> > > >> Just some thoughts... Any feedback? Have any of you had similar > > >> experiences when working with newer Flink users or as a newer Flink > user > > >> yourself? Can/should we do anything to make the *lower* level API > more > > >> accessible/visible to users? > > >> > > >> -Jamie > > >> > > > > > > > > |
In reply to this post by Aljoscha Krettek-2
You lost me at lattice, Aljoscha ;)
I do think something like the more powerful N-way FlatMap w/ Timers Aljoscha is describing here would probably solve most of the problem. Often Flink's higher level primitives work well for people and that's great. It's just that I also spend a fair amount of time discussing with people how to map what they know they want to do onto operations that aren't a perfect fit and it sometimes liberates them when they realize they can just implement it the way they want by dropping down a level. They usually don't go there themselves, though. I mention teaching this "first" and then the higher layers I guess because that's just a matter of teaching philosophy. I think it's good to to see the basic operations that are available first and then understand that the other abstractions are built on top of that. That way you're not afraid to drop-down to basics when you know what you want to get done. -Jamie On Mon, Aug 15, 2016 at 2:11 AM, Aljoscha Krettek <[hidden email]> wrote: > Hi All, > I also thought about this recently. A good think would be to add a good > user facing operator that behaves more or less like an enhanced FlatMap > with multiple inputs, multiple outputs, state access and keyed timers. I'm > a bit hesitant, though, since users rarely think about the implications > that come with state updating and out-of-order events. If you don't > implement a stateful operator correctly you have pretty much arbitrary > results. > > The problem with out-of-order event arrival and state update is that the > state basically has to monotonically transition "upwards" through a lattice > for the computation to make sense. I know this sounds rather theoretical so > I'll try to explain with an example. Say you have an operator that waits > for timestamped elements A, B, C to arrive in timestamp order and then does > some processing. The naive approach would be to have a small state machine > that tracks what element you have seen so far. The state machine has three > states: NOTHING, SEEN_A, SEEN_B and SEEN_C. The state machine is supposed > to traverse these states linearly as the elements arrive. This doesn't > work, however, when elements arrive in an order that does not match their > timestamp order. What the user should do is to have a "Set" state that > keeps track of the elements that it has seen. Once it has seen {A, B, C} > the operator must check the timestamps and then do the processing, if > required. The set of possible combinations of A, B, and C forms a lattice > when combined with the "subset" operation. And traversal through these sets > is monotonically "upwards" so it works regardless of the order that the > elements arrive in. (I recently pointed this out on the Beam mailing list > and Kenneth Knowles rightly pointed out that what I was describing was in > fact a lattice.) > > I know this is a bit off-topic but I think it's very easy for users to > write wrong operations when they are dealing with state. We should still > have a good API for it, though. Just wanted to make people aware of this. > > Cheers, > Aljoscha > > On Mon, 15 Aug 2016 at 08:18 Matthias J. Sax <[hidden email]> wrote: > > > It really depends on the skill level of the developer. Using low-level > > API requires to think about many details (eg. state handling etc.) that > > could be done wrong. > > > > As Flink gets a broader community, more people will use it who might not > > have the required skill level to deal with low-level API. For more > > trained uses, it is of course a powerful tool! > > > > I guess it boils down to the question, what type of developer Flink > > targets, if low-level API should be offensive advertised or not. Also > > keep in mind, that many people criticized Storm's low-level API as hard > > to program etc. > > > > > > -Matthias > > > > On 08/15/2016 07:46 AM, Gyula Fóra wrote: > > > Hi Jamie, > > > > > > I agree that it is often much easier to work on the lower level APIs if > > you > > > know what you are doing. > > > > > > I think it would be nice to have very clean abstractions on that level > so > > > we could teach this to the users first but currently I thinm its not > easy > > > enough to be good starting point. > > > > > > The user needs to understand a lot about the system if the dont want to > > > hurt other parts of the pipeline. For insance working with the > > > streamrecords, propagating watermarks, working with state internals > > > > > > This all might be overwhelming at the first glance. But maybe we can > slim > > > some abstractions down to the point where this becomes kind of the > > > extension of the RichFunctions. > > > > > > Cheers, > > > Gyula > > > > > > On Sat, Aug 13, 2016, 17:48 Jamie Grier <[hidden email]> > wrote: > > > > > >> Hey all, > > >> > > >> I've noticed a few times now when trying to help users implement > > particular > > >> things in the Flink API that it can be complicated to map what they > know > > >> they are trying to do onto higher-level Flink concepts such as > > windowing or > > >> Connect/CoFlatMap/ValueState, etc. > > >> > > >> At some point it just becomes easier to think about writing a Flink > > >> operator yourself that is integrated into the pipeline with a > > transform() > > >> call. > > >> > > >> It can just be easier to think at a more basic level. For example I > can > > >> write an operator that can consume one or two input streams (should > > >> probably be N), update state which is managed for me fault tolerantly, > > and > > >> output elements or setup timers/triggers that give me callbacks from > > which > > >> I can also update state or emit elements. > > >> > > >> When you think at this level you realize you can program just about > > >> anything you want. You can create whatever fault-tolerant data > > structures > > >> you want, and easily execute robust stateful computation over data > > streams > > >> at scale. This is the real technology and power of Flink IMO. > > >> > > >> Also, at this level I don't have to think about the complexities of > > >> windowing semantics, learn as much API, etc. I can easily have some > > inputs > > >> that are broadcast, others that are keyed, manage my own state in > > whatever > > >> data structure makes sense, etc. If I know exactly what I actually > > want to > > >> do I can just do it with the full power of my chosen language, data > > >> structures, etc. I'm not "restricted" to trying to map everything > onto > > >> higher-level Flink constructs which is sometimes actually more > > complicated. > > >> > > >> Programming at this level is actually fairly easy to do but people > seem > > a > > >> bit afraid of this level of the API. They think of it as low-level or > > >> custom hacking.. > > >> > > >> Anyway, I guess my thought is this.. Should we explain Flink to > people > > at > > >> this level *first*? Show that you have nearly unlimited power and > > >> flexibility to build what you want *and only then* from there explain > > the > > >> higher level APIs they can use *if* those match their use cases well. > > >> > > >> Would this better demonstrate to people the power of Flink and maybe > > >> *liberate* them a bit from feeling they have to map their problem > onto a > > >> more complex set of higher level primitives? I see people trying to > > >> shoe-horn what they are really trying to do, which is simple to > explain > > in > > >> english, onto windows, triggers, CoFlatMaps, etc, and this get's > > >> complicated sometimes. It's like an impedance mismatch. You could > just > > >> solve the problem very easily programmed in straight Java/Scala. > > >> > > >> Anyway, it's very easy to drop down a level in the API and program > > whatever > > >> you want but users don't seem to *perceive* it that way. > > >> > > >> Just some thoughts... Any feedback? Have any of you had similar > > >> experiences when working with newer Flink users or as a newer Flink > user > > >> yourself? Can/should we do anything to make the *lower* level API > more > > >> accessible/visible to users? > > >> > > >> -Jamie > > >> > > > > > > > > -- Jamie Grier data Artisans, Director of Applications Engineering @jamiegrier <https://twitter.com/jamiegrier> [hidden email] |
Hi Jamie,
thanks for sharing your thoughts on this! You're raising some interesting points. Whether users find the lower-level primitives more intuitive depends on their background I believe. From what I've seen, if users are coming from the S4/Storm world and are used to the "compositional" way of streaming, then indeed it's easier for them to think and operate on that level. These are usually people who have seen/built streaming things before trying out Flink. But if we're talking about analysts and people coming from the "batch" way of thinking or people used to working with SQL/python, then the higher-level declarative API is probably easier to understand. I do think that we should make the lower-level API more visible and document it properly, but I'm not sure if we should teach Flink on this level first. I think that presenting it as a set of "advanced" features makes more sense actually. Cheers, -Vasia. On 16 August 2016 at 04:24, Jamie Grier <[hidden email]> wrote: > You lost me at lattice, Aljoscha ;) > > I do think something like the more powerful N-way FlatMap w/ Timers > Aljoscha is describing here would probably solve most of the problem. > Often Flink's higher level primitives work well for people and that's > great. It's just that I also spend a fair amount of time discussing with > people how to map what they know they want to do onto operations that > aren't a perfect fit and it sometimes liberates them when they realize they > can just implement it the way they want by dropping down a level. They > usually don't go there themselves, though. > > I mention teaching this "first" and then the higher layers I guess because > that's just a matter of teaching philosophy. I think it's good to to see > the basic operations that are available first and then understand that the > other abstractions are built on top of that. That way you're not afraid to > drop-down to basics when you know what you want to get done. > > -Jamie > > > On Mon, Aug 15, 2016 at 2:11 AM, Aljoscha Krettek <[hidden email]> > wrote: > > > Hi All, > > I also thought about this recently. A good think would be to add a good > > user facing operator that behaves more or less like an enhanced FlatMap > > with multiple inputs, multiple outputs, state access and keyed timers. > I'm > > a bit hesitant, though, since users rarely think about the implications > > that come with state updating and out-of-order events. If you don't > > implement a stateful operator correctly you have pretty much arbitrary > > results. > > > > The problem with out-of-order event arrival and state update is that the > > state basically has to monotonically transition "upwards" through a > lattice > > for the computation to make sense. I know this sounds rather theoretical > so > > I'll try to explain with an example. Say you have an operator that waits > > for timestamped elements A, B, C to arrive in timestamp order and then > does > > some processing. The naive approach would be to have a small state > machine > > that tracks what element you have seen so far. The state machine has > three > > states: NOTHING, SEEN_A, SEEN_B and SEEN_C. The state machine is supposed > > to traverse these states linearly as the elements arrive. This doesn't > > work, however, when elements arrive in an order that does not match their > > timestamp order. What the user should do is to have a "Set" state that > > keeps track of the elements that it has seen. Once it has seen {A, B, C} > > the operator must check the timestamps and then do the processing, if > > required. The set of possible combinations of A, B, and C forms a lattice > > when combined with the "subset" operation. And traversal through these > sets > > is monotonically "upwards" so it works regardless of the order that the > > elements arrive in. (I recently pointed this out on the Beam mailing list > > and Kenneth Knowles rightly pointed out that what I was describing was in > > fact a lattice.) > > > > I know this is a bit off-topic but I think it's very easy for users to > > write wrong operations when they are dealing with state. We should still > > have a good API for it, though. Just wanted to make people aware of this. > > > > Cheers, > > Aljoscha > > > > On Mon, 15 Aug 2016 at 08:18 Matthias J. Sax <[hidden email]> wrote: > > > > > It really depends on the skill level of the developer. Using low-level > > > API requires to think about many details (eg. state handling etc.) that > > > could be done wrong. > > > > > > As Flink gets a broader community, more people will use it who might > not > > > have the required skill level to deal with low-level API. For more > > > trained uses, it is of course a powerful tool! > > > > > > I guess it boils down to the question, what type of developer Flink > > > targets, if low-level API should be offensive advertised or not. Also > > > keep in mind, that many people criticized Storm's low-level API as hard > > > to program etc. > > > > > > > > > -Matthias > > > > > > On 08/15/2016 07:46 AM, Gyula Fóra wrote: > > > > Hi Jamie, > > > > > > > > I agree that it is often much easier to work on the lower level APIs > if > > > you > > > > know what you are doing. > > > > > > > > I think it would be nice to have very clean abstractions on that > level > > so > > > > we could teach this to the users first but currently I thinm its not > > easy > > > > enough to be good starting point. > > > > > > > > The user needs to understand a lot about the system if the dont want > to > > > > hurt other parts of the pipeline. For insance working with the > > > > streamrecords, propagating watermarks, working with state internals > > > > > > > > This all might be overwhelming at the first glance. But maybe we can > > slim > > > > some abstractions down to the point where this becomes kind of the > > > > extension of the RichFunctions. > > > > > > > > Cheers, > > > > Gyula > > > > > > > > On Sat, Aug 13, 2016, 17:48 Jamie Grier <[hidden email]> > > wrote: > > > > > > > >> Hey all, > > > >> > > > >> I've noticed a few times now when trying to help users implement > > > particular > > > >> things in the Flink API that it can be complicated to map what they > > know > > > >> they are trying to do onto higher-level Flink concepts such as > > > windowing or > > > >> Connect/CoFlatMap/ValueState, etc. > > > >> > > > >> At some point it just becomes easier to think about writing a Flink > > > >> operator yourself that is integrated into the pipeline with a > > > transform() > > > >> call. > > > >> > > > >> It can just be easier to think at a more basic level. For example I > > can > > > >> write an operator that can consume one or two input streams (should > > > >> probably be N), update state which is managed for me fault > tolerantly, > > > and > > > >> output elements or setup timers/triggers that give me callbacks from > > > which > > > >> I can also update state or emit elements. > > > >> > > > >> When you think at this level you realize you can program just about > > > >> anything you want. You can create whatever fault-tolerant data > > > structures > > > >> you want, and easily execute robust stateful computation over data > > > streams > > > >> at scale. This is the real technology and power of Flink IMO. > > > >> > > > >> Also, at this level I don't have to think about the complexities of > > > >> windowing semantics, learn as much API, etc. I can easily have some > > > inputs > > > >> that are broadcast, others that are keyed, manage my own state in > > > whatever > > > >> data structure makes sense, etc. If I know exactly what I actually > > > want to > > > >> do I can just do it with the full power of my chosen language, data > > > >> structures, etc. I'm not "restricted" to trying to map everything > > onto > > > >> higher-level Flink constructs which is sometimes actually more > > > complicated. > > > >> > > > >> Programming at this level is actually fairly easy to do but people > > seem > > > a > > > >> bit afraid of this level of the API. They think of it as low-level > or > > > >> custom hacking.. > > > >> > > > >> Anyway, I guess my thought is this.. Should we explain Flink to > > people > > > at > > > >> this level *first*? Show that you have nearly unlimited power and > > > >> flexibility to build what you want *and only then* from there > explain > > > the > > > >> higher level APIs they can use *if* those match their use cases > well. > > > >> > > > >> Would this better demonstrate to people the power of Flink and maybe > > > >> *liberate* them a bit from feeling they have to map their problem > > onto a > > > >> more complex set of higher level primitives? I see people trying to > > > >> shoe-horn what they are really trying to do, which is simple to > > explain > > > in > > > >> english, onto windows, triggers, CoFlatMaps, etc, and this get's > > > >> complicated sometimes. It's like an impedance mismatch. You could > > just > > > >> solve the problem very easily programmed in straight Java/Scala. > > > >> > > > >> Anyway, it's very easy to drop down a level in the API and program > > > whatever > > > >> you want but users don't seem to *perceive* it that way. > > > >> > > > >> Just some thoughts... Any feedback? Have any of you had similar > > > >> experiences when working with newer Flink users or as a newer Flink > > user > > > >> yourself? Can/should we do anything to make the *lower* level API > > more > > > >> accessible/visible to users? > > > >> > > > >> -Jamie > > > >> > > > > > > > > > > > > > > > > -- > > Jamie Grier > data Artisans, Director of Applications Engineering > @jamiegrier <https://twitter.com/jamiegrier> > [hidden email] > |
Jamie,
I think you raise a valid concern but I would hesitate to accept the suggestion that the low-level API be promoted to app developers. Higher-level abstractions tend to be more constrained and more optimized, whereas lower-level abstractions tend to be more powerful, be more laborious to use and provide the system with less knowledge. It is a classic tradeoff. I think it important to consider, what are the important/distinguishing characteristics of the Flink framework. Exactly-once guarantees, event-time support, support for job upgrade without data loss, fault tolerance, etc. I’m speculating that the high-level abstraction provided to app developers is probably needed to retain those charactistics. I think Vasia makes a good point that SQL might be a good alternative way to ease into Flink. Finally, I believe the low-level API is primarily intended for extension purposes (connectors, operations, etc) not app development. It could use better documentation to ensure that third-party extensions support those key characteristics. -Eron > On Aug 16, 2016, at 3:12 AM, Vasiliki Kalavri <[hidden email]> wrote: > > Hi Jamie, > > thanks for sharing your thoughts on this! You're raising some interesting > points. > > Whether users find the lower-level primitives more intuitive depends on > their background I believe. From what I've seen, if users are coming from > the S4/Storm world and are used to the "compositional" way of streaming, > then indeed it's easier for them to think and operate on that level. These > are usually people who have seen/built streaming things before trying out > Flink. > But if we're talking about analysts and people coming from the "batch" way > of thinking or people used to working with SQL/python, then the > higher-level declarative API is probably easier to understand. > > I do think that we should make the lower-level API more visible and > document it properly, but I'm not sure if we should teach Flink on this > level first. I think that presenting it as a set of "advanced" features > makes more sense actually. > > Cheers, > -Vasia. > > On 16 August 2016 at 04:24, Jamie Grier <[hidden email]> wrote: > >> You lost me at lattice, Aljoscha ;) >> >> I do think something like the more powerful N-way FlatMap w/ Timers >> Aljoscha is describing here would probably solve most of the problem. >> Often Flink's higher level primitives work well for people and that's >> great. It's just that I also spend a fair amount of time discussing with >> people how to map what they know they want to do onto operations that >> aren't a perfect fit and it sometimes liberates them when they realize they >> can just implement it the way they want by dropping down a level. They >> usually don't go there themselves, though. >> >> I mention teaching this "first" and then the higher layers I guess because >> that's just a matter of teaching philosophy. I think it's good to to see >> the basic operations that are available first and then understand that the >> other abstractions are built on top of that. That way you're not afraid to >> drop-down to basics when you know what you want to get done. >> >> -Jamie >> >> >> On Mon, Aug 15, 2016 at 2:11 AM, Aljoscha Krettek <[hidden email]> >> wrote: >> >>> Hi All, >>> I also thought about this recently. A good think would be to add a good >>> user facing operator that behaves more or less like an enhanced FlatMap >>> with multiple inputs, multiple outputs, state access and keyed timers. >> I'm >>> a bit hesitant, though, since users rarely think about the implications >>> that come with state updating and out-of-order events. If you don't >>> implement a stateful operator correctly you have pretty much arbitrary >>> results. >>> >>> The problem with out-of-order event arrival and state update is that the >>> state basically has to monotonically transition "upwards" through a >> lattice >>> for the computation to make sense. I know this sounds rather theoretical >> so >>> I'll try to explain with an example. Say you have an operator that waits >>> for timestamped elements A, B, C to arrive in timestamp order and then >> does >>> some processing. The naive approach would be to have a small state >> machine >>> that tracks what element you have seen so far. The state machine has >> three >>> states: NOTHING, SEEN_A, SEEN_B and SEEN_C. The state machine is supposed >>> to traverse these states linearly as the elements arrive. This doesn't >>> work, however, when elements arrive in an order that does not match their >>> timestamp order. What the user should do is to have a "Set" state that >>> keeps track of the elements that it has seen. Once it has seen {A, B, C} >>> the operator must check the timestamps and then do the processing, if >>> required. The set of possible combinations of A, B, and C forms a lattice >>> when combined with the "subset" operation. And traversal through these >> sets >>> is monotonically "upwards" so it works regardless of the order that the >>> elements arrive in. (I recently pointed this out on the Beam mailing list >>> and Kenneth Knowles rightly pointed out that what I was describing was in >>> fact a lattice.) >>> >>> I know this is a bit off-topic but I think it's very easy for users to >>> write wrong operations when they are dealing with state. We should still >>> have a good API for it, though. Just wanted to make people aware of this. >>> >>> Cheers, >>> Aljoscha >>> >>> On Mon, 15 Aug 2016 at 08:18 Matthias J. Sax <[hidden email]> wrote: >>> >>>> It really depends on the skill level of the developer. Using low-level >>>> API requires to think about many details (eg. state handling etc.) that >>>> could be done wrong. >>>> >>>> As Flink gets a broader community, more people will use it who might >> not >>>> have the required skill level to deal with low-level API. For more >>>> trained uses, it is of course a powerful tool! >>>> >>>> I guess it boils down to the question, what type of developer Flink >>>> targets, if low-level API should be offensive advertised or not. Also >>>> keep in mind, that many people criticized Storm's low-level API as hard >>>> to program etc. >>>> >>>> >>>> -Matthias >>>> >>>> On 08/15/2016 07:46 AM, Gyula Fóra wrote: >>>>> Hi Jamie, >>>>> >>>>> I agree that it is often much easier to work on the lower level APIs >> if >>>> you >>>>> know what you are doing. >>>>> >>>>> I think it would be nice to have very clean abstractions on that >> level >>> so >>>>> we could teach this to the users first but currently I thinm its not >>> easy >>>>> enough to be good starting point. >>>>> >>>>> The user needs to understand a lot about the system if the dont want >> to >>>>> hurt other parts of the pipeline. For insance working with the >>>>> streamrecords, propagating watermarks, working with state internals >>>>> >>>>> This all might be overwhelming at the first glance. But maybe we can >>> slim >>>>> some abstractions down to the point where this becomes kind of the >>>>> extension of the RichFunctions. >>>>> >>>>> Cheers, >>>>> Gyula >>>>> >>>>> On Sat, Aug 13, 2016, 17:48 Jamie Grier <[hidden email]> >>> wrote: >>>>> >>>>>> Hey all, >>>>>> >>>>>> I've noticed a few times now when trying to help users implement >>>> particular >>>>>> things in the Flink API that it can be complicated to map what they >>> know >>>>>> they are trying to do onto higher-level Flink concepts such as >>>> windowing or >>>>>> Connect/CoFlatMap/ValueState, etc. >>>>>> >>>>>> At some point it just becomes easier to think about writing a Flink >>>>>> operator yourself that is integrated into the pipeline with a >>>> transform() >>>>>> call. >>>>>> >>>>>> It can just be easier to think at a more basic level. For example I >>> can >>>>>> write an operator that can consume one or two input streams (should >>>>>> probably be N), update state which is managed for me fault >> tolerantly, >>>> and >>>>>> output elements or setup timers/triggers that give me callbacks from >>>> which >>>>>> I can also update state or emit elements. >>>>>> >>>>>> When you think at this level you realize you can program just about >>>>>> anything you want. You can create whatever fault-tolerant data >>>> structures >>>>>> you want, and easily execute robust stateful computation over data >>>> streams >>>>>> at scale. This is the real technology and power of Flink IMO. >>>>>> >>>>>> Also, at this level I don't have to think about the complexities of >>>>>> windowing semantics, learn as much API, etc. I can easily have some >>>> inputs >>>>>> that are broadcast, others that are keyed, manage my own state in >>>> whatever >>>>>> data structure makes sense, etc. If I know exactly what I actually >>>> want to >>>>>> do I can just do it with the full power of my chosen language, data >>>>>> structures, etc. I'm not "restricted" to trying to map everything >>> onto >>>>>> higher-level Flink constructs which is sometimes actually more >>>> complicated. >>>>>> >>>>>> Programming at this level is actually fairly easy to do but people >>> seem >>>> a >>>>>> bit afraid of this level of the API. They think of it as low-level >> or >>>>>> custom hacking.. >>>>>> >>>>>> Anyway, I guess my thought is this.. Should we explain Flink to >>> people >>>> at >>>>>> this level *first*? Show that you have nearly unlimited power and >>>>>> flexibility to build what you want *and only then* from there >> explain >>>> the >>>>>> higher level APIs they can use *if* those match their use cases >> well. >>>>>> >>>>>> Would this better demonstrate to people the power of Flink and maybe >>>>>> *liberate* them a bit from feeling they have to map their problem >>> onto a >>>>>> more complex set of higher level primitives? I see people trying to >>>>>> shoe-horn what they are really trying to do, which is simple to >>> explain >>>> in >>>>>> english, onto windows, triggers, CoFlatMaps, etc, and this get's >>>>>> complicated sometimes. It's like an impedance mismatch. You could >>> just >>>>>> solve the problem very easily programmed in straight Java/Scala. >>>>>> >>>>>> Anyway, it's very easy to drop down a level in the API and program >>>> whatever >>>>>> you want but users don't seem to *perceive* it that way. >>>>>> >>>>>> Just some thoughts... Any feedback? Have any of you had similar >>>>>> experiences when working with newer Flink users or as a newer Flink >>> user >>>>>> yourself? Can/should we do anything to make the *lower* level API >>> more >>>>>> accessible/visible to users? >>>>>> >>>>>> -Jamie >>>>>> >>>>> >>>> >>>> >>> >> >> >> >> -- >> >> Jamie Grier >> data Artisans, Director of Applications Engineering >> @jamiegrier <https://twitter.com/jamiegrier> >> [hidden email] >> |
I agree with Vasia that for data scientist it's likely easier to learn the
high-level api. I like the material from http://dataartisans.github.io/flink-training/ but all of them focus on the high level api. Maybe we could have a guide (blog post, lecture, whatever) on how to get into Flink as a Storm guy. That would allow people with that background to directly dive into the lower level api working with models similar to what they were used to. I would volunteer but I'm not familiar with Storm. I for my part, always had to use some lower level api in all of my application, most of the time pestering Aljioscha about the details. So either I'm the exception or there is a need for more complex examples showcasing the lower level api methods. One of the things I have been using in several pipelines so far is extracting the start and end timestamp from a window adding it to the window aggregate. Maybe something like this could be a useful example to include into the training. Side question: I assume there are recurring design patterns in stream applications user develop. Is there any chance we will be able to identify or create some design patterns (similar to java design pattern). That would make it easier to use the lower level api and might help people to avoid pitfalls like the one Alijosha mentioned. cheers Martin PS: I hope its fine for me to butt into the discussion like this. On Tue, Aug 16, 2016 at 4:34 PM, Wright, Eron <[hidden email]> wrote: > Jamie, > I think you raise a valid concern but I would hesitate to accept the > suggestion that the low-level API be promoted to app developers. > > Higher-level abstractions tend to be more constrained and more optimized, > whereas lower-level abstractions tend to be more powerful, be more > laborious to use and provide the system with less knowledge. It is a > classic tradeoff. > > I think it important to consider, what are the important/distinguishing > characteristics of the Flink framework. Exactly-once guarantees, > event-time support, support for job upgrade without data loss, fault > tolerance, etc. I’m speculating that the high-level abstraction provided > to app developers is probably needed to retain those charactistics. > > I think Vasia makes a good point that SQL might be a good alternative way > to ease into Flink. > > Finally, I believe the low-level API is primarily intended for extension > purposes (connectors, operations, etc) not app development. It could use > better documentation to ensure that third-party extensions support those > key characteristics. > > -Eron > > > On Aug 16, 2016, at 3:12 AM, Vasiliki Kalavri <[hidden email]> > wrote: > > > > Hi Jamie, > > > > thanks for sharing your thoughts on this! You're raising some interesting > > points. > > > > Whether users find the lower-level primitives more intuitive depends on > > their background I believe. From what I've seen, if users are coming from > > the S4/Storm world and are used to the "compositional" way of streaming, > > then indeed it's easier for them to think and operate on that level. > These > > are usually people who have seen/built streaming things before trying out > > Flink. > > But if we're talking about analysts and people coming from the "batch" > way > > of thinking or people used to working with SQL/python, then the > > higher-level declarative API is probably easier to understand. > > > > I do think that we should make the lower-level API more visible and > > document it properly, but I'm not sure if we should teach Flink on this > > level first. I think that presenting it as a set of "advanced" features > > makes more sense actually. > > > > Cheers, > > -Vasia. > > > > On 16 August 2016 at 04:24, Jamie Grier <[hidden email]> wrote: > > > >> You lost me at lattice, Aljoscha ;) > >> > >> I do think something like the more powerful N-way FlatMap w/ Timers > >> Aljoscha is describing here would probably solve most of the problem. > >> Often Flink's higher level primitives work well for people and that's > >> great. It's just that I also spend a fair amount of time discussing > with > >> people how to map what they know they want to do onto operations that > >> aren't a perfect fit and it sometimes liberates them when they realize > they > >> can just implement it the way they want by dropping down a level. They > >> usually don't go there themselves, though. > >> > >> I mention teaching this "first" and then the higher layers I guess > because > >> that's just a matter of teaching philosophy. I think it's good to to > see > >> the basic operations that are available first and then understand that > the > >> other abstractions are built on top of that. That way you're not > afraid to > >> drop-down to basics when you know what you want to get done. > >> > >> -Jamie > >> > >> > >> On Mon, Aug 15, 2016 at 2:11 AM, Aljoscha Krettek <[hidden email]> > >> wrote: > >> > >>> Hi All, > >>> I also thought about this recently. A good think would be to add a good > >>> user facing operator that behaves more or less like an enhanced FlatMap > >>> with multiple inputs, multiple outputs, state access and keyed timers. > >> I'm > >>> a bit hesitant, though, since users rarely think about the implications > >>> that come with state updating and out-of-order events. If you don't > >>> implement a stateful operator correctly you have pretty much arbitrary > >>> results. > >>> > >>> The problem with out-of-order event arrival and state update is that > the > >>> state basically has to monotonically transition "upwards" through a > >> lattice > >>> for the computation to make sense. I know this sounds rather > theoretical > >> so > >>> I'll try to explain with an example. Say you have an operator that > waits > >>> for timestamped elements A, B, C to arrive in timestamp order and then > >> does > >>> some processing. The naive approach would be to have a small state > >> machine > >>> that tracks what element you have seen so far. The state machine has > >> three > >>> states: NOTHING, SEEN_A, SEEN_B and SEEN_C. The state machine is > supposed > >>> to traverse these states linearly as the elements arrive. This doesn't > >>> work, however, when elements arrive in an order that does not match > their > >>> timestamp order. What the user should do is to have a "Set" state that > >>> keeps track of the elements that it has seen. Once it has seen {A, B, > C} > >>> the operator must check the timestamps and then do the processing, if > >>> required. The set of possible combinations of A, B, and C forms a > lattice > >>> when combined with the "subset" operation. And traversal through these > >> sets > >>> is monotonically "upwards" so it works regardless of the order that the > >>> elements arrive in. (I recently pointed this out on the Beam mailing > list > >>> and Kenneth Knowles rightly pointed out that what I was describing was > in > >>> fact a lattice.) > >>> > >>> I know this is a bit off-topic but I think it's very easy for users to > >>> write wrong operations when they are dealing with state. We should > still > >>> have a good API for it, though. Just wanted to make people aware of > this. > >>> > >>> Cheers, > >>> Aljoscha > >>> > >>> On Mon, 15 Aug 2016 at 08:18 Matthias J. Sax <[hidden email]> wrote: > >>> > >>>> It really depends on the skill level of the developer. Using low-level > >>>> API requires to think about many details (eg. state handling etc.) > that > >>>> could be done wrong. > >>>> > >>>> As Flink gets a broader community, more people will use it who might > >> not > >>>> have the required skill level to deal with low-level API. For more > >>>> trained uses, it is of course a powerful tool! > >>>> > >>>> I guess it boils down to the question, what type of developer Flink > >>>> targets, if low-level API should be offensive advertised or not. Also > >>>> keep in mind, that many people criticized Storm's low-level API as > hard > >>>> to program etc. > >>>> > >>>> > >>>> -Matthias > >>>> > >>>> On 08/15/2016 07:46 AM, Gyula Fóra wrote: > >>>>> Hi Jamie, > >>>>> > >>>>> I agree that it is often much easier to work on the lower level APIs > >> if > >>>> you > >>>>> know what you are doing. > >>>>> > >>>>> I think it would be nice to have very clean abstractions on that > >> level > >>> so > >>>>> we could teach this to the users first but currently I thinm its not > >>> easy > >>>>> enough to be good starting point. > >>>>> > >>>>> The user needs to understand a lot about the system if the dont want > >> to > >>>>> hurt other parts of the pipeline. For insance working with the > >>>>> streamrecords, propagating watermarks, working with state internals > >>>>> > >>>>> This all might be overwhelming at the first glance. But maybe we can > >>> slim > >>>>> some abstractions down to the point where this becomes kind of the > >>>>> extension of the RichFunctions. > >>>>> > >>>>> Cheers, > >>>>> Gyula > >>>>> > >>>>> On Sat, Aug 13, 2016, 17:48 Jamie Grier <[hidden email]> > >>> wrote: > >>>>> > >>>>>> Hey all, > >>>>>> > >>>>>> I've noticed a few times now when trying to help users implement > >>>> particular > >>>>>> things in the Flink API that it can be complicated to map what they > >>> know > >>>>>> they are trying to do onto higher-level Flink concepts such as > >>>> windowing or > >>>>>> Connect/CoFlatMap/ValueState, etc. > >>>>>> > >>>>>> At some point it just becomes easier to think about writing a Flink > >>>>>> operator yourself that is integrated into the pipeline with a > >>>> transform() > >>>>>> call. > >>>>>> > >>>>>> It can just be easier to think at a more basic level. For example I > >>> can > >>>>>> write an operator that can consume one or two input streams (should > >>>>>> probably be N), update state which is managed for me fault > >> tolerantly, > >>>> and > >>>>>> output elements or setup timers/triggers that give me callbacks from > >>>> which > >>>>>> I can also update state or emit elements. > >>>>>> > >>>>>> When you think at this level you realize you can program just about > >>>>>> anything you want. You can create whatever fault-tolerant data > >>>> structures > >>>>>> you want, and easily execute robust stateful computation over data > >>>> streams > >>>>>> at scale. This is the real technology and power of Flink IMO. > >>>>>> > >>>>>> Also, at this level I don't have to think about the complexities of > >>>>>> windowing semantics, learn as much API, etc. I can easily have some > >>>> inputs > >>>>>> that are broadcast, others that are keyed, manage my own state in > >>>> whatever > >>>>>> data structure makes sense, etc. If I know exactly what I actually > >>>> want to > >>>>>> do I can just do it with the full power of my chosen language, data > >>>>>> structures, etc. I'm not "restricted" to trying to map everything > >>> onto > >>>>>> higher-level Flink constructs which is sometimes actually more > >>>> complicated. > >>>>>> > >>>>>> Programming at this level is actually fairly easy to do but people > >>> seem > >>>> a > >>>>>> bit afraid of this level of the API. They think of it as low-level > >> or > >>>>>> custom hacking.. > >>>>>> > >>>>>> Anyway, I guess my thought is this.. Should we explain Flink to > >>> people > >>>> at > >>>>>> this level *first*? Show that you have nearly unlimited power and > >>>>>> flexibility to build what you want *and only then* from there > >> explain > >>>> the > >>>>>> higher level APIs they can use *if* those match their use cases > >> well. > >>>>>> > >>>>>> Would this better demonstrate to people the power of Flink and maybe > >>>>>> *liberate* them a bit from feeling they have to map their problem > >>> onto a > >>>>>> more complex set of higher level primitives? I see people trying to > >>>>>> shoe-horn what they are really trying to do, which is simple to > >>> explain > >>>> in > >>>>>> english, onto windows, triggers, CoFlatMaps, etc, and this get's > >>>>>> complicated sometimes. It's like an impedance mismatch. You could > >>> just > >>>>>> solve the problem very easily programmed in straight Java/Scala. > >>>>>> > >>>>>> Anyway, it's very easy to drop down a level in the API and program > >>>> whatever > >>>>>> you want but users don't seem to *perceive* it that way. > >>>>>> > >>>>>> Just some thoughts... Any feedback? Have any of you had similar > >>>>>> experiences when working with newer Flink users or as a newer Flink > >>> user > >>>>>> yourself? Can/should we do anything to make the *lower* level API > >>> more > >>>>>> accessible/visible to users? > >>>>>> > >>>>>> -Jamie > >>>>>> > >>>>> > >>>> > >>>> > >>> > >> > >> > >> > >> -- > >> > >> Jamie Grier > >> data Artisans, Director of Applications Engineering > >> @jamiegrier <https://twitter.com/jamiegrier> > >> [hidden email] > >> > > |
I do actually think that all levels of abstraction have their value.
If you wish, we have (top to bottom): (1) SQL (2) Table API with simplified Stream/Table duality (3) DataStream API / windows (4) DataStream API with custom windows and triggers (5) Custom Operators The Data Scientist may not like (5), but there sure is a bunch of people that just want the basic fabric (streams and fault tolerant state) to stitch together whatever they want. I think it would be great to present this layering and simply say "pick your entry point!" The abstraction of (5) is actually similar to what the operator-centric (non-fluent) API of Kafka Streams is. It is only slightly more involved in Flink, because it exposes too many internals. With some simple wrapper/template, this can be a full-fledged API function, or a separate API in itself. Stephan On Wed, Aug 17, 2016 at 6:16 PM, Martin Neumann <[hidden email]> wrote: > I agree with Vasia that for data scientist it's likely easier to learn the > high-level api. I like the material from > http://dataartisans.github.io/flink-training/ but all of them focus on the > high level api. > > Maybe we could have a guide (blog post, lecture, whatever) on how to get > into Flink as a Storm guy. That would allow people with that background to > directly dive into the lower level api working with models similar to what > they were used to. I would volunteer but I'm not familiar with Storm. > > I for my part, always had to use some lower level api in all of my > application, most of the time pestering Aljioscha about the details. So > either I'm the exception or there is a need for more complex examples > showcasing the lower level api methods. > One of the things I have been using in several pipelines so far is > extracting the start and end timestamp from a window adding it to the > window aggregate. Maybe something like this could be a useful example to > include into the training. > > Side question: > I assume there are recurring design patterns in stream applications user > develop. Is there any chance we will be able to identify or create some > design patterns (similar to java design pattern). That would make it easier > to use the lower level api and might help people to avoid pitfalls like the > one Alijosha mentioned. > > cheers Martin > PS: I hope its fine for me to butt into the discussion like this. > > On Tue, Aug 16, 2016 at 4:34 PM, Wright, Eron <[hidden email]> wrote: > > > Jamie, > > I think you raise a valid concern but I would hesitate to accept the > > suggestion that the low-level API be promoted to app developers. > > > > Higher-level abstractions tend to be more constrained and more optimized, > > whereas lower-level abstractions tend to be more powerful, be more > > laborious to use and provide the system with less knowledge. It is a > > classic tradeoff. > > > > I think it important to consider, what are the important/distinguishing > > characteristics of the Flink framework. Exactly-once guarantees, > > event-time support, support for job upgrade without data loss, fault > > tolerance, etc. I’m speculating that the high-level abstraction > provided > > to app developers is probably needed to retain those charactistics. > > > > I think Vasia makes a good point that SQL might be a good alternative way > > to ease into Flink. > > > > Finally, I believe the low-level API is primarily intended for extension > > purposes (connectors, operations, etc) not app development. It could > use > > better documentation to ensure that third-party extensions support those > > key characteristics. > > > > -Eron > > > > > On Aug 16, 2016, at 3:12 AM, Vasiliki Kalavri < > [hidden email]> > > wrote: > > > > > > Hi Jamie, > > > > > > thanks for sharing your thoughts on this! You're raising some > interesting > > > points. > > > > > > Whether users find the lower-level primitives more intuitive depends on > > > their background I believe. From what I've seen, if users are coming > from > > > the S4/Storm world and are used to the "compositional" way of > streaming, > > > then indeed it's easier for them to think and operate on that level. > > These > > > are usually people who have seen/built streaming things before trying > out > > > Flink. > > > But if we're talking about analysts and people coming from the "batch" > > way > > > of thinking or people used to working with SQL/python, then the > > > higher-level declarative API is probably easier to understand. > > > > > > I do think that we should make the lower-level API more visible and > > > document it properly, but I'm not sure if we should teach Flink on this > > > level first. I think that presenting it as a set of "advanced" features > > > makes more sense actually. > > > > > > Cheers, > > > -Vasia. > > > > > > On 16 August 2016 at 04:24, Jamie Grier <[hidden email]> > wrote: > > > > > >> You lost me at lattice, Aljoscha ;) > > >> > > >> I do think something like the more powerful N-way FlatMap w/ Timers > > >> Aljoscha is describing here would probably solve most of the problem. > > >> Often Flink's higher level primitives work well for people and that's > > >> great. It's just that I also spend a fair amount of time discussing > > with > > >> people how to map what they know they want to do onto operations that > > >> aren't a perfect fit and it sometimes liberates them when they realize > > they > > >> can just implement it the way they want by dropping down a level. > They > > >> usually don't go there themselves, though. > > >> > > >> I mention teaching this "first" and then the higher layers I guess > > because > > >> that's just a matter of teaching philosophy. I think it's good to to > > see > > >> the basic operations that are available first and then understand that > > the > > >> other abstractions are built on top of that. That way you're not > > afraid to > > >> drop-down to basics when you know what you want to get done. > > >> > > >> -Jamie > > >> > > >> > > >> On Mon, Aug 15, 2016 at 2:11 AM, Aljoscha Krettek < > [hidden email]> > > >> wrote: > > >> > > >>> Hi All, > > >>> I also thought about this recently. A good think would be to add a > good > > >>> user facing operator that behaves more or less like an enhanced > FlatMap > > >>> with multiple inputs, multiple outputs, state access and keyed > timers. > > >> I'm > > >>> a bit hesitant, though, since users rarely think about the > implications > > >>> that come with state updating and out-of-order events. If you don't > > >>> implement a stateful operator correctly you have pretty much > arbitrary > > >>> results. > > >>> > > >>> The problem with out-of-order event arrival and state update is that > > the > > >>> state basically has to monotonically transition "upwards" through a > > >> lattice > > >>> for the computation to make sense. I know this sounds rather > > theoretical > > >> so > > >>> I'll try to explain with an example. Say you have an operator that > > waits > > >>> for timestamped elements A, B, C to arrive in timestamp order and > then > > >> does > > >>> some processing. The naive approach would be to have a small state > > >> machine > > >>> that tracks what element you have seen so far. The state machine has > > >> three > > >>> states: NOTHING, SEEN_A, SEEN_B and SEEN_C. The state machine is > > supposed > > >>> to traverse these states linearly as the elements arrive. This > doesn't > > >>> work, however, when elements arrive in an order that does not match > > their > > >>> timestamp order. What the user should do is to have a "Set" state > that > > >>> keeps track of the elements that it has seen. Once it has seen {A, B, > > C} > > >>> the operator must check the timestamps and then do the processing, if > > >>> required. The set of possible combinations of A, B, and C forms a > > lattice > > >>> when combined with the "subset" operation. And traversal through > these > > >> sets > > >>> is monotonically "upwards" so it works regardless of the order that > the > > >>> elements arrive in. (I recently pointed this out on the Beam mailing > > list > > >>> and Kenneth Knowles rightly pointed out that what I was describing > was > > in > > >>> fact a lattice.) > > >>> > > >>> I know this is a bit off-topic but I think it's very easy for users > to > > >>> write wrong operations when they are dealing with state. We should > > still > > >>> have a good API for it, though. Just wanted to make people aware of > > this. > > >>> > > >>> Cheers, > > >>> Aljoscha > > >>> > > >>> On Mon, 15 Aug 2016 at 08:18 Matthias J. Sax <[hidden email]> > wrote: > > >>> > > >>>> It really depends on the skill level of the developer. Using > low-level > > >>>> API requires to think about many details (eg. state handling etc.) > > that > > >>>> could be done wrong. > > >>>> > > >>>> As Flink gets a broader community, more people will use it who might > > >> not > > >>>> have the required skill level to deal with low-level API. For more > > >>>> trained uses, it is of course a powerful tool! > > >>>> > > >>>> I guess it boils down to the question, what type of developer Flink > > >>>> targets, if low-level API should be offensive advertised or not. > Also > > >>>> keep in mind, that many people criticized Storm's low-level API as > > hard > > >>>> to program etc. > > >>>> > > >>>> > > >>>> -Matthias > > >>>> > > >>>> On 08/15/2016 07:46 AM, Gyula Fóra wrote: > > >>>>> Hi Jamie, > > >>>>> > > >>>>> I agree that it is often much easier to work on the lower level > APIs > > >> if > > >>>> you > > >>>>> know what you are doing. > > >>>>> > > >>>>> I think it would be nice to have very clean abstractions on that > > >> level > > >>> so > > >>>>> we could teach this to the users first but currently I thinm its > not > > >>> easy > > >>>>> enough to be good starting point. > > >>>>> > > >>>>> The user needs to understand a lot about the system if the dont > want > > >> to > > >>>>> hurt other parts of the pipeline. For insance working with the > > >>>>> streamrecords, propagating watermarks, working with state internals > > >>>>> > > >>>>> This all might be overwhelming at the first glance. But maybe we > can > > >>> slim > > >>>>> some abstractions down to the point where this becomes kind of the > > >>>>> extension of the RichFunctions. > > >>>>> > > >>>>> Cheers, > > >>>>> Gyula > > >>>>> > > >>>>> On Sat, Aug 13, 2016, 17:48 Jamie Grier <[hidden email]> > > >>> wrote: > > >>>>> > > >>>>>> Hey all, > > >>>>>> > > >>>>>> I've noticed a few times now when trying to help users implement > > >>>> particular > > >>>>>> things in the Flink API that it can be complicated to map what > they > > >>> know > > >>>>>> they are trying to do onto higher-level Flink concepts such as > > >>>> windowing or > > >>>>>> Connect/CoFlatMap/ValueState, etc. > > >>>>>> > > >>>>>> At some point it just becomes easier to think about writing a > Flink > > >>>>>> operator yourself that is integrated into the pipeline with a > > >>>> transform() > > >>>>>> call. > > >>>>>> > > >>>>>> It can just be easier to think at a more basic level. For > example I > > >>> can > > >>>>>> write an operator that can consume one or two input streams > (should > > >>>>>> probably be N), update state which is managed for me fault > > >> tolerantly, > > >>>> and > > >>>>>> output elements or setup timers/triggers that give me callbacks > from > > >>>> which > > >>>>>> I can also update state or emit elements. > > >>>>>> > > >>>>>> When you think at this level you realize you can program just > about > > >>>>>> anything you want. You can create whatever fault-tolerant data > > >>>> structures > > >>>>>> you want, and easily execute robust stateful computation over data > > >>>> streams > > >>>>>> at scale. This is the real technology and power of Flink IMO. > > >>>>>> > > >>>>>> Also, at this level I don't have to think about the complexities > of > > >>>>>> windowing semantics, learn as much API, etc. I can easily have > some > > >>>> inputs > > >>>>>> that are broadcast, others that are keyed, manage my own state in > > >>>> whatever > > >>>>>> data structure makes sense, etc. If I know exactly what I > actually > > >>>> want to > > >>>>>> do I can just do it with the full power of my chosen language, > data > > >>>>>> structures, etc. I'm not "restricted" to trying to map everything > > >>> onto > > >>>>>> higher-level Flink constructs which is sometimes actually more > > >>>> complicated. > > >>>>>> > > >>>>>> Programming at this level is actually fairly easy to do but people > > >>> seem > > >>>> a > > >>>>>> bit afraid of this level of the API. They think of it as > low-level > > >> or > > >>>>>> custom hacking.. > > >>>>>> > > >>>>>> Anyway, I guess my thought is this.. Should we explain Flink to > > >>> people > > >>>> at > > >>>>>> this level *first*? Show that you have nearly unlimited power and > > >>>>>> flexibility to build what you want *and only then* from there > > >> explain > > >>>> the > > >>>>>> higher level APIs they can use *if* those match their use cases > > >> well. > > >>>>>> > > >>>>>> Would this better demonstrate to people the power of Flink and > maybe > > >>>>>> *liberate* them a bit from feeling they have to map their problem > > >>> onto a > > >>>>>> more complex set of higher level primitives? I see people trying > to > > >>>>>> shoe-horn what they are really trying to do, which is simple to > > >>> explain > > >>>> in > > >>>>>> english, onto windows, triggers, CoFlatMaps, etc, and this get's > > >>>>>> complicated sometimes. It's like an impedance mismatch. You > could > > >>> just > > >>>>>> solve the problem very easily programmed in straight Java/Scala. > > >>>>>> > > >>>>>> Anyway, it's very easy to drop down a level in the API and program > > >>>> whatever > > >>>>>> you want but users don't seem to *perceive* it that way. > > >>>>>> > > >>>>>> Just some thoughts... Any feedback? Have any of you had similar > > >>>>>> experiences when working with newer Flink users or as a newer > Flink > > >>> user > > >>>>>> yourself? Can/should we do anything to make the *lower* level API > > >>> more > > >>>>>> accessible/visible to users? > > >>>>>> > > >>>>>> -Jamie > > >>>>>> > > >>>>> > > >>>> > > >>>> > > >>> > > >> > > >> > > >> > > >> -- > > >> > > >> Jamie Grier > > >> data Artisans, Director of Applications Engineering > > >> @jamiegrier <https://twitter.com/jamiegrier> > > >> [hidden email] > > >> > > > > > |
I think it’s better to provide lower-level APIs. Though high-level APIs are preferred by many users, but lower-level APIs are still needed to enhance the expressiveness and ease the optimization.
Flink APIs now provide good abstraction for many operations, but higher abstraction always means poorer expressiveness. Blink(derived from Flink) now is heavily used in Alibaba Inc. for real-time computing, and we come across many applications that cannot be easily be expressed by Flink APIs. Take a concrete example. In an advertising application, we need to perform a set of aggregations on windows. Each record will fire a window which contains all the records arrived at prior and the window size is determined by the record's value. Since it’s impossible to know the windows a record belongs to, we are unable to implement the operation with the Flink window abstractions (Assigner-Trigger-Evictor). Finally, we implement the application with Flatmap and customized states. Besides the advertising application, many applications are implemented in such methods. If lower-level Flink APIs are provided, we can facilitate the development of these applications and well improve their performance. I think Microsoft’s Naiad does a good job in the abstraction of lower-level APIs. Different from those low-level APIs in Storm or S4, Naiad also provides the APIs to track the progress (NotifyAt and OnNotify). With the knowledge of the job’s topology, we can easily track the progress of the execution. Checkpoints then can be viewed as a special kind of progress tracking and can be implemented with the two methods. In such cases, we can implement customized fault-tolerance mechanisms which are always demanded by our users. The ability to track the progress can also be used to optimize Machine Learning jobs. It has been proven that many ML jobs can be optimized with bounded asynchronous iterations. By tracking the progress of the producers, the consumers can proceed to the next iteration with sufficient producers complete their jobs. Xiaogang > 在 2016年8月18日,上午12:35,Stephan Ewen <[hidden email]> 写道: > > I do actually think that all levels of abstraction have their value. > If you wish, we have (top to bottom): > > (1) SQL > > (2) Table API with simplified Stream/Table duality > > (3) DataStream API / windows > > (4) DataStream API with custom windows and triggers > > (5) Custom Operators > > > The Data Scientist may not like (5), but there sure is a bunch of people > that just want the basic fabric (streams and fault tolerant state) to > stitch together whatever they want. > > I think it would be great to present this layering and simply say "pick > your entry point!" > > The abstraction of (5) is actually similar to what the operator-centric > (non-fluent) API of Kafka Streams is. It is only slightly more involved in > Flink, because it exposes too many internals. With some simple > wrapper/template, this can be a full-fledged API function, or a separate > API in itself. > > Stephan > > > > On Wed, Aug 17, 2016 at 6:16 PM, Martin Neumann <[hidden email]> wrote: > >> I agree with Vasia that for data scientist it's likely easier to learn the >> high-level api. I like the material from >> http://dataartisans.github.io/flink-training/ but all of them focus on the >> high level api. >> >> Maybe we could have a guide (blog post, lecture, whatever) on how to get >> into Flink as a Storm guy. That would allow people with that background to >> directly dive into the lower level api working with models similar to what >> they were used to. I would volunteer but I'm not familiar with Storm. >> >> I for my part, always had to use some lower level api in all of my >> application, most of the time pestering Aljioscha about the details. So >> either I'm the exception or there is a need for more complex examples >> showcasing the lower level api methods. >> One of the things I have been using in several pipelines so far is >> extracting the start and end timestamp from a window adding it to the >> window aggregate. Maybe something like this could be a useful example to >> include into the training. >> >> Side question: >> I assume there are recurring design patterns in stream applications user >> develop. Is there any chance we will be able to identify or create some >> design patterns (similar to java design pattern). That would make it easier >> to use the lower level api and might help people to avoid pitfalls like the >> one Alijosha mentioned. >> >> cheers Martin >> PS: I hope its fine for me to butt into the discussion like this. >> >> On Tue, Aug 16, 2016 at 4:34 PM, Wright, Eron <[hidden email]> wrote: >> >>> Jamie, >>> I think you raise a valid concern but I would hesitate to accept the >>> suggestion that the low-level API be promoted to app developers. >>> >>> Higher-level abstractions tend to be more constrained and more optimized, >>> whereas lower-level abstractions tend to be more powerful, be more >>> laborious to use and provide the system with less knowledge. It is a >>> classic tradeoff. >>> >>> I think it important to consider, what are the important/distinguishing >>> characteristics of the Flink framework. Exactly-once guarantees, >>> event-time support, support for job upgrade without data loss, fault >>> tolerance, etc. I’m speculating that the high-level abstraction >> provided >>> to app developers is probably needed to retain those charactistics. >>> >>> I think Vasia makes a good point that SQL might be a good alternative way >>> to ease into Flink. >>> >>> Finally, I believe the low-level API is primarily intended for extension >>> purposes (connectors, operations, etc) not app development. It could >> use >>> better documentation to ensure that third-party extensions support those >>> key characteristics. >>> >>> -Eron >>> >>>> On Aug 16, 2016, at 3:12 AM, Vasiliki Kalavri < >> [hidden email]> >>> wrote: >>>> >>>> Hi Jamie, >>>> >>>> thanks for sharing your thoughts on this! You're raising some >> interesting >>>> points. >>>> >>>> Whether users find the lower-level primitives more intuitive depends on >>>> their background I believe. From what I've seen, if users are coming >> from >>>> the S4/Storm world and are used to the "compositional" way of >> streaming, >>>> then indeed it's easier for them to think and operate on that level. >>> These >>>> are usually people who have seen/built streaming things before trying >> out >>>> Flink. >>>> But if we're talking about analysts and people coming from the "batch" >>> way >>>> of thinking or people used to working with SQL/python, then the >>>> higher-level declarative API is probably easier to understand. >>>> >>>> I do think that we should make the lower-level API more visible and >>>> document it properly, but I'm not sure if we should teach Flink on this >>>> level first. I think that presenting it as a set of "advanced" features >>>> makes more sense actually. >>>> >>>> Cheers, >>>> -Vasia. >>>> >>>> On 16 August 2016 at 04:24, Jamie Grier <[hidden email]> >> wrote: >>>> >>>>> You lost me at lattice, Aljoscha ;) >>>>> >>>>> I do think something like the more powerful N-way FlatMap w/ Timers >>>>> Aljoscha is describing here would probably solve most of the problem. >>>>> Often Flink's higher level primitives work well for people and that's >>>>> great. It's just that I also spend a fair amount of time discussing >>> with >>>>> people how to map what they know they want to do onto operations that >>>>> aren't a perfect fit and it sometimes liberates them when they realize >>> they >>>>> can just implement it the way they want by dropping down a level. >> They >>>>> usually don't go there themselves, though. >>>>> >>>>> I mention teaching this "first" and then the higher layers I guess >>> because >>>>> that's just a matter of teaching philosophy. I think it's good to to >>> see >>>>> the basic operations that are available first and then understand that >>> the >>>>> other abstractions are built on top of that. That way you're not >>> afraid to >>>>> drop-down to basics when you know what you want to get done. >>>>> >>>>> -Jamie >>>>> >>>>> >>>>> On Mon, Aug 15, 2016 at 2:11 AM, Aljoscha Krettek < >> [hidden email]> >>>>> wrote: >>>>> >>>>>> Hi All, >>>>>> I also thought about this recently. A good think would be to add a >> good >>>>>> user facing operator that behaves more or less like an enhanced >> FlatMap >>>>>> with multiple inputs, multiple outputs, state access and keyed >> timers. >>>>> I'm >>>>>> a bit hesitant, though, since users rarely think about the >> implications >>>>>> that come with state updating and out-of-order events. If you don't >>>>>> implement a stateful operator correctly you have pretty much >> arbitrary >>>>>> results. >>>>>> >>>>>> The problem with out-of-order event arrival and state update is that >>> the >>>>>> state basically has to monotonically transition "upwards" through a >>>>> lattice >>>>>> for the computation to make sense. I know this sounds rather >>> theoretical >>>>> so >>>>>> I'll try to explain with an example. Say you have an operator that >>> waits >>>>>> for timestamped elements A, B, C to arrive in timestamp order and >> then >>>>> does >>>>>> some processing. The naive approach would be to have a small state >>>>> machine >>>>>> that tracks what element you have seen so far. The state machine has >>>>> three >>>>>> states: NOTHING, SEEN_A, SEEN_B and SEEN_C. The state machine is >>> supposed >>>>>> to traverse these states linearly as the elements arrive. This >> doesn't >>>>>> work, however, when elements arrive in an order that does not match >>> their >>>>>> timestamp order. What the user should do is to have a "Set" state >> that >>>>>> keeps track of the elements that it has seen. Once it has seen {A, B, >>> C} >>>>>> the operator must check the timestamps and then do the processing, if >>>>>> required. The set of possible combinations of A, B, and C forms a >>> lattice >>>>>> when combined with the "subset" operation. And traversal through >> these >>>>> sets >>>>>> is monotonically "upwards" so it works regardless of the order that >> the >>>>>> elements arrive in. (I recently pointed this out on the Beam mailing >>> list >>>>>> and Kenneth Knowles rightly pointed out that what I was describing >> was >>> in >>>>>> fact a lattice.) >>>>>> >>>>>> I know this is a bit off-topic but I think it's very easy for users >> to >>>>>> write wrong operations when they are dealing with state. We should >>> still >>>>>> have a good API for it, though. Just wanted to make people aware of >>> this. >>>>>> >>>>>> Cheers, >>>>>> Aljoscha >>>>>> >>>>>> On Mon, 15 Aug 2016 at 08:18 Matthias J. Sax <[hidden email]> >> wrote: >>>>>> >>>>>>> It really depends on the skill level of the developer. Using >> low-level >>>>>>> API requires to think about many details (eg. state handling etc.) >>> that >>>>>>> could be done wrong. >>>>>>> >>>>>>> As Flink gets a broader community, more people will use it who might >>>>> not >>>>>>> have the required skill level to deal with low-level API. For more >>>>>>> trained uses, it is of course a powerful tool! >>>>>>> >>>>>>> I guess it boils down to the question, what type of developer Flink >>>>>>> targets, if low-level API should be offensive advertised or not. >> Also >>>>>>> keep in mind, that many people criticized Storm's low-level API as >>> hard >>>>>>> to program etc. >>>>>>> >>>>>>> >>>>>>> -Matthias >>>>>>> >>>>>>> On 08/15/2016 07:46 AM, Gyula Fóra wrote: >>>>>>>> Hi Jamie, >>>>>>>> >>>>>>>> I agree that it is often much easier to work on the lower level >> APIs >>>>> if >>>>>>> you >>>>>>>> know what you are doing. >>>>>>>> >>>>>>>> I think it would be nice to have very clean abstractions on that >>>>> level >>>>>> so >>>>>>>> we could teach this to the users first but currently I thinm its >> not >>>>>> easy >>>>>>>> enough to be good starting point. >>>>>>>> >>>>>>>> The user needs to understand a lot about the system if the dont >> want >>>>> to >>>>>>>> hurt other parts of the pipeline. For insance working with the >>>>>>>> streamrecords, propagating watermarks, working with state internals >>>>>>>> >>>>>>>> This all might be overwhelming at the first glance. But maybe we >> can >>>>>> slim >>>>>>>> some abstractions down to the point where this becomes kind of the >>>>>>>> extension of the RichFunctions. >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Gyula >>>>>>>> >>>>>>>> On Sat, Aug 13, 2016, 17:48 Jamie Grier <[hidden email]> >>>>>> wrote: >>>>>>>> >>>>>>>>> Hey all, >>>>>>>>> >>>>>>>>> I've noticed a few times now when trying to help users implement >>>>>>> particular >>>>>>>>> things in the Flink API that it can be complicated to map what >> they >>>>>> know >>>>>>>>> they are trying to do onto higher-level Flink concepts such as >>>>>>> windowing or >>>>>>>>> Connect/CoFlatMap/ValueState, etc. >>>>>>>>> >>>>>>>>> At some point it just becomes easier to think about writing a >> Flink >>>>>>>>> operator yourself that is integrated into the pipeline with a >>>>>>> transform() >>>>>>>>> call. >>>>>>>>> >>>>>>>>> It can just be easier to think at a more basic level. For >> example I >>>>>> can >>>>>>>>> write an operator that can consume one or two input streams >> (should >>>>>>>>> probably be N), update state which is managed for me fault >>>>> tolerantly, >>>>>>> and >>>>>>>>> output elements or setup timers/triggers that give me callbacks >> from >>>>>>> which >>>>>>>>> I can also update state or emit elements. >>>>>>>>> >>>>>>>>> When you think at this level you realize you can program just >> about >>>>>>>>> anything you want. You can create whatever fault-tolerant data >>>>>>> structures >>>>>>>>> you want, and easily execute robust stateful computation over data >>>>>>> streams >>>>>>>>> at scale. This is the real technology and power of Flink IMO. >>>>>>>>> >>>>>>>>> Also, at this level I don't have to think about the complexities >> of >>>>>>>>> windowing semantics, learn as much API, etc. I can easily have >> some >>>>>>> inputs >>>>>>>>> that are broadcast, others that are keyed, manage my own state in >>>>>>> whatever >>>>>>>>> data structure makes sense, etc. If I know exactly what I >> actually >>>>>>> want to >>>>>>>>> do I can just do it with the full power of my chosen language, >> data >>>>>>>>> structures, etc. I'm not "restricted" to trying to map everything >>>>>> onto >>>>>>>>> higher-level Flink constructs which is sometimes actually more >>>>>>> complicated. >>>>>>>>> >>>>>>>>> Programming at this level is actually fairly easy to do but people >>>>>> seem >>>>>>> a >>>>>>>>> bit afraid of this level of the API. They think of it as >> low-level >>>>> or >>>>>>>>> custom hacking.. >>>>>>>>> >>>>>>>>> Anyway, I guess my thought is this.. Should we explain Flink to >>>>>> people >>>>>>> at >>>>>>>>> this level *first*? Show that you have nearly unlimited power and >>>>>>>>> flexibility to build what you want *and only then* from there >>>>> explain >>>>>>> the >>>>>>>>> higher level APIs they can use *if* those match their use cases >>>>> well. >>>>>>>>> >>>>>>>>> Would this better demonstrate to people the power of Flink and >> maybe >>>>>>>>> *liberate* them a bit from feeling they have to map their problem >>>>>> onto a >>>>>>>>> more complex set of higher level primitives? I see people trying >> to >>>>>>>>> shoe-horn what they are really trying to do, which is simple to >>>>>> explain >>>>>>> in >>>>>>>>> english, onto windows, triggers, CoFlatMaps, etc, and this get's >>>>>>>>> complicated sometimes. It's like an impedance mismatch. You >> could >>>>>> just >>>>>>>>> solve the problem very easily programmed in straight Java/Scala. >>>>>>>>> >>>>>>>>> Anyway, it's very easy to drop down a level in the API and program >>>>>>> whatever >>>>>>>>> you want but users don't seem to *perceive* it that way. >>>>>>>>> >>>>>>>>> Just some thoughts... Any feedback? Have any of you had similar >>>>>>>>> experiences when working with newer Flink users or as a newer >> Flink >>>>>> user >>>>>>>>> yourself? Can/should we do anything to make the *lower* level API >>>>>> more >>>>>>>>> accessible/visible to users? >>>>>>>>> >>>>>>>>> -Jamie >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> >>>>> Jamie Grier >>>>> data Artisans, Director of Applications Engineering >>>>> @jamiegrier <https://twitter.com/jamiegrier> >>>>> [hidden email] >>>>> >>> >>> >> |
Free forum by Nabble | Edit this page |