Hi,
I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore, India. I'm currently interested in distributed systems and stream processing and am looking to delve deeper into the subject, and hope to get some insight by contributing to Apache Flink. I've gathered some idea of the flink-streaming codebase by recently working on a PR for FLINK-1450[2]. Both FLINK-1617[3] and FLINK-1534[4] are interesting projects that I would love to work on over the summer. I was wondering which amongst these would be more appreciated by the community, so I can start working towards a proposal for either one. Regarding FLINK-1534, I was wondering why would simply merging and filtering the existing streams for events we want to detect not work? Also on going through the document mentioned by @mbalassi in the JIRA comment[5], the authors specify some Runtime Event Detection concepts in Section 5.2. I'm assuming the project entails on building a similar analogy using Flink and the deliverables would include working pattern matching operators over Flink DataStreams as described in the report. If so, then shouldn't it be trivial to implement the described the Binary operator using a WindowedStream and a Filter? I hope my questions don't seem misplaced here and I would appreciate links to literature where I can learn more on the topic. Regards, Akshay Dixit [1] : http://akshaydixi.me [2] : https://github.com/apache/flink/pull/481 [3] : https://issues.apache.org/jira/browse/FLINK-1617 [4] : https://issues.apache.org/jira/browse/FLINK-1534 [5] : http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf |
Dear Akshay,
Thanks again for your interest and for the recent contribution to streaming. Both of the projects mentioned wold be largely appreciated by the community, and you can also propose other project suggestions here for discussion. Regarding FLINK-1534, the thesis I mentioned serves as a starting point and indeed the basic solution can be implemented with filtering and windowing/mapping with some state storing whether the cause of an event has been already seen. Solely relying on the now existing windowing API this however might cause performance issues if the events also have an expiration timeout - some optimization there would be included. The further challenge is to try to further exploit the parallel job execution of Flink to possibly scale a pattern matching query. Best, Marton On Sun, Mar 15, 2015 at 3:22 PM, Akshay Dixit <[hidden email]> wrote: > Hi, > I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore, India. I'm > currently interested in distributed systems and stream processing and am > looking to delve deeper into the subject, and hope to get some insight by > contributing to Apache Flink. I've gathered some idea of the > flink-streaming codebase by recently working on a PR for FLINK-1450[2]. > > Both FLINK-1617[3] and FLINK-1534[4] are interesting projects that I would > love to work on over the summer. I was wondering which amongst these would > be more appreciated by the community, so I can start working towards a > proposal for either one. > > Regarding FLINK-1534, I was wondering why would simply merging and > filtering the existing streams for events we want to detect not work? Also > on going through the document mentioned by @mbalassi in the JIRA > comment[5], the authors specify some Runtime Event Detection concepts in > Section 5.2. I'm assuming the project entails on building a similar analogy > using Flink and the deliverables would include working pattern matching > operators over Flink DataStreams as described in the report. If so, then > shouldn't it be trivial to implement the described the Binary operator > using a WindowedStream and a Filter? > I hope my questions don't seem misplaced here and I would appreciate links > to literature where I can learn more on the topic. > > Regards, > Akshay Dixit > > [1] : http://akshaydixi.me > [2] : https://github.com/apache/flink/pull/481 > [3] : https://issues.apache.org/jira/browse/FLINK-1617 > [4] : https://issues.apache.org/jira/browse/FLINK-1534 > [5] : > http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf > |
Thanks for the explanation Marton. I've decided to try out for FLINK-1534.
After reading through the thesis[4] and a few other papers[1][2][3], I believe I've gathered a little context to ask more questions. But I'm still not sure how Flink's internals work so please bear with me. Although the ongoing effort to document the architecture and internal is really helpful for newbies like me and would greatly decrease the ramping up time. Detecting a pattern of events would comprise of a pipeline that accepts the pattern query and sources of DataStreams, and outputs detected matches of that pattern to a sink or forwards it along to another stream for further computation. As you said, a simple filter-join-aggregate query system could be developed implementing using the existing Streaming windowing API. But matching over complex events and decoding their pattern queries would require implementing a DSL that transforms queries into an evaluation model. For e.g, in [1], the authors have implemented an NFA automaton with a shared versioned buffer that models the queries. In [4], the authors propose a new language that is much more expressive and compiles into a topology graph for Storm. So in Flink's case, I believe the proposed DSL would generate operator graphs for the Flink compiler to schedule Jobgraphs over TaskManagers. If we don't depend on the Windowing API, would we need to create new operators such as the Projection, Conjunction and Union operators defined in [4] ? Also I would like to hear your thoughts on how to approach scaling the pattern matching query. Note all these techniques talk about scaling a single query. I've read various ways such as 1. Merging equivalent runs[1] -: This seems a good way to squash multiple instances of pattern matching forks into a single one if they have the same state. But I'm not sure how we would implement this in Flink since this is a runtime optimization. 2. Implementing a matched version buffer[1] -: This would involve sharing state of a buffer datastructure across multiple candidate match instances for the pattern. 3. Splitting complex composite patterns into simpler sub-patterns[4] and executing separate queries to detect those sub-patterns. This might translate into different tasks and duplicating the source datastreams to all the new generated tasks. Also since I don't know how the Flink compiler behaves, would some of the optimizations involve making changes to it too? Regards, Akshay Dixit [1] : Efficient Pattern Matching over Event Streams <http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf> [2] : On Supporting Kleene Closure over Event Streams <http://people.cs.umass.edu/~yanlei/publications/sase-icde08.pdf> [3] : Processing Flows of Information: From Data Stream to Complex Event Processing <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.396.1785&rep=rep1&type=pdf> [4] : Distributing Complex Event Detection <http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf> On Mon, Mar 16, 2015 at 3:22 PM, Márton Balassi <[hidden email]> wrote: > Dear Akshay, > > Thanks again for your interest and for the recent contribution to > streaming. > > Both of the projects mentioned wold be largely appreciated by the > community, and you can also propose other project suggestions here for > discussion. > > Regarding FLINK-1534, the thesis I mentioned serves as a starting point and > indeed the basic solution can be implemented with filtering and > windowing/mapping with some state storing whether the cause of an event has > been already seen. Solely relying on the now existing windowing API this > however might cause performance issues if the events also have an > expiration timeout - some optimization there would be included. The further > challenge is to try to further exploit the parallel job execution of Flink > to possibly scale a pattern matching query. > > Best, > > Marton > > On Sun, Mar 15, 2015 at 3:22 PM, Akshay Dixit <[hidden email]> > wrote: > > > Hi, > > I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore, India. I'm > > currently interested in distributed systems and stream processing and am > > looking to delve deeper into the subject, and hope to get some insight by > > contributing to Apache Flink. I've gathered some idea of the > > flink-streaming codebase by recently working on a PR for FLINK-1450[2]. > > > > Both FLINK-1617[3] and FLINK-1534[4] are interesting projects that I > would > > love to work on over the summer. I was wondering which amongst these > would > > be more appreciated by the community, so I can start working towards a > > proposal for either one. > > > > Regarding FLINK-1534, I was wondering why would simply merging and > > filtering the existing streams for events we want to detect not work? > Also > > on going through the document mentioned by @mbalassi in the JIRA > > comment[5], the authors specify some Runtime Event Detection concepts in > > Section 5.2. I'm assuming the project entails on building a similar > analogy > > using Flink and the deliverables would include working pattern matching > > operators over Flink DataStreams as described in the report. If so, then > > shouldn't it be trivial to implement the described the Binary operator > > using a WindowedStream and a Filter? > > I hope my questions don't seem misplaced here and I would appreciate > links > > to literature where I can learn more on the topic. > > > > Regards, > > Akshay Dixit > > > > [1] : http://akshaydixi.me > > [2] : https://github.com/apache/flink/pull/481 > > [3] : https://issues.apache.org/jira/browse/FLINK-1617 > > [4] : https://issues.apache.org/jira/browse/FLINK-1534 > > [5] : > > http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf > > > |
Hi,
It'd really help if I got a reply soon. It'll be helpful in writing the proposal since the deadline is on 27th. Thanks Regards, Akshay Dixit On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit <[hidden email]> wrote: > Thanks for the explanation Marton. I've decided to try out for FLINK-1534. > > After reading through the thesis[4] and a few other papers[1][2][3], I > believe I've gathered a little context to ask more questions. But I'm still > not sure how Flink's internals work > so please bear with me. Although the ongoing effort to document the > architecture and internal is really helpful for newbies like me and would > greatly decrease the ramping up time. > > Detecting a pattern of events would comprise of a pipeline that accepts > the pattern query and > sources of DataStreams, and outputs detected matches of that pattern to a > sink or forwards it > along to another stream for further computation. > > As you said, a simple filter-join-aggregate query system could be > developed implementing using the existing Streaming windowing API. > But matching over complex events and decoding their pattern queries would > require implementing a DSL that transforms queries into an evaluation > model. For e.g, > in [1], the authors have implemented an NFA automaton with a shared > versioned buffer that models the queries. In [4], the authors > propose a new language that is much more expressive and compiles into a > topology graph for Storm. > > So in Flink's case, I believe the proposed DSL would generate operator > graphs for the Flink compiler to schedule Jobgraphs over TaskManagers. > If we don't depend on the Windowing API, would we need to create new > operators such as the Projection, Conjunction and Union operators defined > in [4] ? > Also I would like to hear your thoughts on how to approach scaling the > pattern matching query. Note all these techniques talk about scaling a > single query. > I've read various ways such as > > 1. Merging equivalent runs[1] -: This seems a good way to squash multiple > instances of pattern matching forks into a single one if they have the same > state. > But I'm not sure how we would implement this in Flink since this is a > runtime optimization. > > 2. Implementing a matched version buffer[1] -: This would involve sharing > state of a buffer datastructure across multiple candidate match instances > for the pattern. > > 3. Splitting complex composite patterns into simpler sub-patterns[4] and > executing separate queries to detect those sub-patterns. This might > translate into different > tasks and duplicating the source datastreams to all the new generated > tasks. > > Also since I don't know how the Flink compiler behaves, would some of the > optimizations involve making changes to it too? > > Regards, > Akshay Dixit > > [1] : Efficient Pattern Matching over Event Streams > <http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf> > [2] : On Supporting Kleene Closure over Event Streams > <http://people.cs.umass.edu/~yanlei/publications/sase-icde08.pdf> > [3] : Processing Flows of Information: From Data Stream to Complex Event > Processing > <http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.396.1785&rep=rep1&type=pdf> > [4] : Distributing Complex Event Detection > <http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf> > > On Mon, Mar 16, 2015 at 3:22 PM, Márton Balassi <[hidden email]> > wrote: > >> Dear Akshay, >> >> Thanks again for your interest and for the recent contribution to >> streaming. >> >> Both of the projects mentioned wold be largely appreciated by the >> community, and you can also propose other project suggestions here for >> discussion. >> >> Regarding FLINK-1534, the thesis I mentioned serves as a starting point >> and >> indeed the basic solution can be implemented with filtering and >> windowing/mapping with some state storing whether the cause of an event >> has >> been already seen. Solely relying on the now existing windowing API this >> however might cause performance issues if the events also have an >> expiration timeout - some optimization there would be included. The >> further >> challenge is to try to further exploit the parallel job execution of Flink >> to possibly scale a pattern matching query. >> >> Best, >> >> Marton >> >> On Sun, Mar 15, 2015 at 3:22 PM, Akshay Dixit <[hidden email]> >> wrote: >> >> > Hi, >> > I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore, India. I'm >> > currently interested in distributed systems and stream processing and am >> > looking to delve deeper into the subject, and hope to get some insight >> by >> > contributing to Apache Flink. I've gathered some idea of the >> > flink-streaming codebase by recently working on a PR for FLINK-1450[2]. >> > >> > Both FLINK-1617[3] and FLINK-1534[4] are interesting projects that I >> would >> > love to work on over the summer. I was wondering which amongst these >> would >> > be more appreciated by the community, so I can start working towards a >> > proposal for either one. >> > >> > Regarding FLINK-1534, I was wondering why would simply merging and >> > filtering the existing streams for events we want to detect not work? >> Also >> > on going through the document mentioned by @mbalassi in the JIRA >> > comment[5], the authors specify some Runtime Event Detection concepts in >> > Section 5.2. I'm assuming the project entails on building a similar >> analogy >> > using Flink and the deliverables would include working pattern matching >> > operators over Flink DataStreams as described in the report. If so, then >> > shouldn't it be trivial to implement the described the Binary operator >> > using a WindowedStream and a Filter? >> > I hope my questions don't seem misplaced here and I would appreciate >> links >> > to literature where I can learn more on the topic. >> > >> > Regards, >> > Akshay Dixit >> > >> > [1] : http://akshaydixi.me >> > [2] : https://github.com/apache/flink/pull/481 >> > [3] : https://issues.apache.org/jira/browse/FLINK-1617 >> > [4] : https://issues.apache.org/jira/browse/FLINK-1534 >> > [5] : >> > http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf >> > >> > > |
Hey,
Give me an hour or so as I am in a meeting currently, but I will get back to you afterwards. Regards, Gyula On Tue, Mar 24, 2015 at 11:03 AM, Akshay Dixit <[hidden email]> wrote: > Hi, > It'd really help if I got a reply soon. It'll be helpful in writing the > proposal since the deadline is on 27th. Thanks > Regards, > Akshay Dixit > > On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit <[hidden email]> > wrote: > > > Thanks for the explanation Marton. I've decided to try out for > FLINK-1534. > > > > After reading through the thesis[4] and a few other papers[1][2][3], I > > believe I've gathered a little context to ask more questions. But I'm > still > > not sure how Flink's internals work > > so please bear with me. Although the ongoing effort to document the > > architecture and internal is really helpful for newbies like me and would > > greatly decrease the ramping up time. > > > > Detecting a pattern of events would comprise of a pipeline that accepts > > the pattern query and > > sources of DataStreams, and outputs detected matches of that pattern to a > > sink or forwards it > > along to another stream for further computation. > > > > As you said, a simple filter-join-aggregate query system could be > > developed implementing using the existing Streaming windowing API. > > But matching over complex events and decoding their pattern queries would > > require implementing a DSL that transforms queries into an evaluation > > model. For e.g, > > in [1], the authors have implemented an NFA automaton with a shared > > versioned buffer that models the queries. In [4], the authors > > propose a new language that is much more expressive and compiles into a > > topology graph for Storm. > > > > So in Flink's case, I believe the proposed DSL would generate operator > > graphs for the Flink compiler to schedule Jobgraphs over TaskManagers. > > If we don't depend on the Windowing API, would we need to create new > > operators such as the Projection, Conjunction and Union operators defined > > in [4] ? > > Also I would like to hear your thoughts on how to approach scaling the > > pattern matching query. Note all these techniques talk about scaling a > > single query. > > I've read various ways such as > > > > 1. Merging equivalent runs[1] -: This seems a good way to squash > multiple > > instances of pattern matching forks into a single one if they have the > same > > state. > > But I'm not sure how we would implement this in Flink since this is a > > runtime optimization. > > > > 2. Implementing a matched version buffer[1] -: This would involve > sharing > > state of a buffer datastructure across multiple candidate match instances > > for the pattern. > > > > 3. Splitting complex composite patterns into simpler sub-patterns[4] and > > executing separate queries to detect those sub-patterns. This might > > translate into different > > tasks and duplicating the source datastreams to all the new generated > > tasks. > > > > Also since I don't know how the Flink compiler behaves, would some of the > > optimizations involve making changes to it too? > > > > Regards, > > Akshay Dixit > > > > [1] : Efficient Pattern Matching over Event Streams > > <http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf> > > [2] : On Supporting Kleene Closure over Event Streams > > <http://people.cs.umass.edu/~yanlei/publications/sase-icde08.pdf> > > [3] : Processing Flows of Information: From Data Stream to Complex Event > > Processing > > < > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.396.1785&rep=rep1&type=pdf > > > > [4] : Distributing Complex Event Detection > > <http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf > > > > > > On Mon, Mar 16, 2015 at 3:22 PM, Márton Balassi < > [hidden email]> > > wrote: > > > >> Dear Akshay, > >> > >> Thanks again for your interest and for the recent contribution to > >> streaming. > >> > >> Both of the projects mentioned wold be largely appreciated by the > >> community, and you can also propose other project suggestions here for > >> discussion. > >> > >> Regarding FLINK-1534, the thesis I mentioned serves as a starting point > >> and > >> indeed the basic solution can be implemented with filtering and > >> windowing/mapping with some state storing whether the cause of an event > >> has > >> been already seen. Solely relying on the now existing windowing API this > >> however might cause performance issues if the events also have an > >> expiration timeout - some optimization there would be included. The > >> further > >> challenge is to try to further exploit the parallel job execution of > Flink > >> to possibly scale a pattern matching query. > >> > >> Best, > >> > >> Marton > >> > >> On Sun, Mar 15, 2015 at 3:22 PM, Akshay Dixit <[hidden email]> > >> wrote: > >> > >> > Hi, > >> > I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore, India. I'm > >> > currently interested in distributed systems and stream processing and > am > >> > looking to delve deeper into the subject, and hope to get some insight > >> by > >> > contributing to Apache Flink. I've gathered some idea of the > >> > flink-streaming codebase by recently working on a PR for > FLINK-1450[2]. > >> > > >> > Both FLINK-1617[3] and FLINK-1534[4] are interesting projects that I > >> would > >> > love to work on over the summer. I was wondering which amongst these > >> would > >> > be more appreciated by the community, so I can start working towards a > >> > proposal for either one. > >> > > >> > Regarding FLINK-1534, I was wondering why would simply merging and > >> > filtering the existing streams for events we want to detect not work? > >> Also > >> > on going through the document mentioned by @mbalassi in the JIRA > >> > comment[5], the authors specify some Runtime Event Detection concepts > in > >> > Section 5.2. I'm assuming the project entails on building a similar > >> analogy > >> > using Flink and the deliverables would include working pattern > matching > >> > operators over Flink DataStreams as described in the report. If so, > then > >> > shouldn't it be trivial to implement the described the Binary operator > >> > using a WindowedStream and a Filter? > >> > I hope my questions don't seem misplaced here and I would appreciate > >> links > >> > to literature where I can learn more on the topic. > >> > > >> > Regards, > >> > Akshay Dixit > >> > > >> > [1] : http://akshaydixi.me > >> > [2] : https://github.com/apache/flink/pull/481 > >> > [3] : https://issues.apache.org/jira/browse/FLINK-1617 > >> > [4] : https://issues.apache.org/jira/browse/FLINK-1534 > >> > [5] : > >> > > http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf > >> > > >> > > > > > |
Hey Dixit,
Sorry for the delay, I had to discuss this in more detail with some of our other core developers. The consensus seems to be that we would like push this project in a direction where the changes can be quickly included in the next releases. For this it is essential that we implement features that are complete (and clean) from the users perspective. This does not necessarily mean that we would like to have everything at once but rather that it is preferable to start with something clean and simple (for instance the naive chained filter approach) and progressively build more complex logic. This also mean that we would like to avoid "researchy" code in the codebase as much as possible. Of course once we have a stable api for this functionality we can work towards making the optimizations that you have mentioned like operator sharing and so on. The ideal proposal would give a clear sketch of the pattern matching API that you would like to implement, which might be some added operators at first to the current API and possible a DSL later with more advanced functionality (this would probably go in a separate library until it is very stable). So please in the proposal include a preview of what the pattern matching syntax would look like integrated with the current operators, how it would interact with other parts of the system etc. These are the thing we need to figure out before we consider the optimizations I think, because it usually turns out, that the API semantics you would like to provide can hugely affect (probably limit) the possibilities that you have afterwards in terms of optimizations. Let me know if you have further questions regarding this :) Gyula On Tue, Mar 24, 2015 at 12:01 PM, Gyula Fóra <[hidden email]> wrote: > Hey, > > Give me an hour or so as I am in a meeting currently, but I will get back > to you afterwards. > > Regards, > Gyula > > On Tue, Mar 24, 2015 at 11:03 AM, Akshay Dixit <[hidden email]> > wrote: > >> Hi, >> It'd really help if I got a reply soon. It'll be helpful in writing the >> proposal since the deadline is on 27th. Thanks >> Regards, >> Akshay Dixit >> >> On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit <[hidden email]> >> wrote: >> >> > Thanks for the explanation Marton. I've decided to try out for >> FLINK-1534. >> > >> > After reading through the thesis[4] and a few other papers[1][2][3], I >> > believe I've gathered a little context to ask more questions. But I'm >> still >> > not sure how Flink's internals work >> > so please bear with me. Although the ongoing effort to document the >> > architecture and internal is really helpful for newbies like me and >> would >> > greatly decrease the ramping up time. >> > >> > Detecting a pattern of events would comprise of a pipeline that accepts >> > the pattern query and >> > sources of DataStreams, and outputs detected matches of that pattern to >> a >> > sink or forwards it >> > along to another stream for further computation. >> > >> > As you said, a simple filter-join-aggregate query system could be >> > developed implementing using the existing Streaming windowing API. >> > But matching over complex events and decoding their pattern queries >> would >> > require implementing a DSL that transforms queries into an evaluation >> > model. For e.g, >> > in [1], the authors have implemented an NFA automaton with a shared >> > versioned buffer that models the queries. In [4], the authors >> > propose a new language that is much more expressive and compiles into a >> > topology graph for Storm. >> > >> > So in Flink's case, I believe the proposed DSL would generate operator >> > graphs for the Flink compiler to schedule Jobgraphs over TaskManagers. >> > If we don't depend on the Windowing API, would we need to create new >> > operators such as the Projection, Conjunction and Union operators >> defined >> > in [4] ? >> > Also I would like to hear your thoughts on how to approach scaling the >> > pattern matching query. Note all these techniques talk about scaling a >> > single query. >> > I've read various ways such as >> > >> > 1. Merging equivalent runs[1] -: This seems a good way to squash >> multiple >> > instances of pattern matching forks into a single one if they have the >> same >> > state. >> > But I'm not sure how we would implement this in Flink since this is a >> > runtime optimization. >> > >> > 2. Implementing a matched version buffer[1] -: This would involve >> sharing >> > state of a buffer datastructure across multiple candidate match >> instances >> > for the pattern. >> > >> > 3. Splitting complex composite patterns into simpler sub-patterns[4] >> and >> > executing separate queries to detect those sub-patterns. This might >> > translate into different >> > tasks and duplicating the source datastreams to all the new generated >> > tasks. >> > >> > Also since I don't know how the Flink compiler behaves, would some of >> the >> > optimizations involve making changes to it too? >> > >> > Regards, >> > Akshay Dixit >> > >> > [1] : Efficient Pattern Matching over Event Streams >> > <http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf >> > >> > [2] : On Supporting Kleene Closure over Event Streams >> > <http://people.cs.umass.edu/~yanlei/publications/sase-icde08.pdf> >> > [3] : Processing Flows of Information: From Data Stream to Complex Event >> > Processing >> > < >> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.396.1785&rep=rep1&type=pdf >> > >> > [4] : Distributing Complex Event Detection >> > < >> http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf> >> > >> > On Mon, Mar 16, 2015 at 3:22 PM, Márton Balassi < >> [hidden email]> >> > wrote: >> > >> >> Dear Akshay, >> >> >> >> Thanks again for your interest and for the recent contribution to >> >> streaming. >> >> >> >> Both of the projects mentioned wold be largely appreciated by the >> >> community, and you can also propose other project suggestions here for >> >> discussion. >> >> >> >> Regarding FLINK-1534, the thesis I mentioned serves as a starting point >> >> and >> >> indeed the basic solution can be implemented with filtering and >> >> windowing/mapping with some state storing whether the cause of an event >> >> has >> >> been already seen. Solely relying on the now existing windowing API >> this >> >> however might cause performance issues if the events also have an >> >> expiration timeout - some optimization there would be included. The >> >> further >> >> challenge is to try to further exploit the parallel job execution of >> Flink >> >> to possibly scale a pattern matching query. >> >> >> >> Best, >> >> >> >> Marton >> >> >> >> On Sun, Mar 15, 2015 at 3:22 PM, Akshay Dixit <[hidden email]> >> >> wrote: >> >> >> >> > Hi, >> >> > I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore, India. I'm >> >> > currently interested in distributed systems and stream processing >> and am >> >> > looking to delve deeper into the subject, and hope to get some >> insight >> >> by >> >> > contributing to Apache Flink. I've gathered some idea of the >> >> > flink-streaming codebase by recently working on a PR for >> FLINK-1450[2]. >> >> > >> >> > Both FLINK-1617[3] and FLINK-1534[4] are interesting projects that I >> >> would >> >> > love to work on over the summer. I was wondering which amongst these >> >> would >> >> > be more appreciated by the community, so I can start working towards >> a >> >> > proposal for either one. >> >> > >> >> > Regarding FLINK-1534, I was wondering why would simply merging and >> >> > filtering the existing streams for events we want to detect not work? >> >> Also >> >> > on going through the document mentioned by @mbalassi in the JIRA >> >> > comment[5], the authors specify some Runtime Event Detection >> concepts in >> >> > Section 5.2. I'm assuming the project entails on building a similar >> >> analogy >> >> > using Flink and the deliverables would include working pattern >> matching >> >> > operators over Flink DataStreams as described in the report. If so, >> then >> >> > shouldn't it be trivial to implement the described the Binary >> operator >> >> > using a WindowedStream and a Filter? >> >> > I hope my questions don't seem misplaced here and I would appreciate >> >> links >> >> > to literature where I can learn more on the topic. >> >> > >> >> > Regards, >> >> > Akshay Dixit >> >> > >> >> > [1] : http://akshaydixi.me >> >> > [2] : https://github.com/apache/flink/pull/481 >> >> > [3] : https://issues.apache.org/jira/browse/FLINK-1617 >> >> > [4] : https://issues.apache.org/jira/browse/FLINK-1534 >> >> > [5] : >> >> > >> http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf >> >> > >> >> >> > >> > >> > > |
Thanks Gyula.
I agree too that simple and working implementations are preferrable over hacky complex solutions. I'll start sketching out an initial straighforward API with only basic pattern matching features and base it on the existing windowing API. I'll post a draft of the proposal, keeping the points you've said in mind, tomorrow, so you can look it over to see if its all right. Regards, Akshay Dixit On Tue, Mar 24, 2015 at 6:30 PM, Gyula Fóra <[hidden email]> wrote: > Hey Dixit, > > Sorry for the delay, I had to discuss this in more detail with some of our > other core developers. > > The consensus seems to be that we would like push this project in a > direction where the changes can be quickly included in the next releases. > For this it is essential that we implement features that are complete (and > clean) from the users perspective. This does not necessarily mean that we > would like to have everything at once but rather that it is preferable to > start with something clean and simple (for instance the naive chained > filter approach) and progressively build more complex logic. > > This also mean that we would like to avoid "researchy" code in the codebase > as much as possible. Of course once we have a stable api for this > functionality we can work towards making the optimizations that you have > mentioned like operator sharing and so on. > > The ideal proposal would give a clear sketch of the pattern matching API > that you would like to implement, which might be some added operators at > first to the current API and possible a DSL later with more advanced > functionality (this would probably go in a separate library until it is > very stable). > > So please in the proposal include a preview of what the pattern matching > syntax would look like integrated with the current operators, how it would > interact with other parts of the system etc. > > These are the thing we need to figure out before we consider the > optimizations I think, because it usually turns out, that the API semantics > you would like to provide can hugely affect (probably limit) the > possibilities that you have afterwards in terms of optimizations. > > Let me know if you have further questions regarding this :) > > Gyula > > On Tue, Mar 24, 2015 at 12:01 PM, Gyula Fóra <[hidden email]> wrote: > > > Hey, > > > > Give me an hour or so as I am in a meeting currently, but I will get back > > to you afterwards. > > > > Regards, > > Gyula > > > > On Tue, Mar 24, 2015 at 11:03 AM, Akshay Dixit <[hidden email]> > > wrote: > > > >> Hi, > >> It'd really help if I got a reply soon. It'll be helpful in writing the > >> proposal since the deadline is on 27th. Thanks > >> Regards, > >> Akshay Dixit > >> > >> On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit <[hidden email]> > >> wrote: > >> > >> > Thanks for the explanation Marton. I've decided to try out for > >> FLINK-1534. > >> > > >> > After reading through the thesis[4] and a few other papers[1][2][3], I > >> > believe I've gathered a little context to ask more questions. But I'm > >> still > >> > not sure how Flink's internals work > >> > so please bear with me. Although the ongoing effort to document the > >> > architecture and internal is really helpful for newbies like me and > >> would > >> > greatly decrease the ramping up time. > >> > > >> > Detecting a pattern of events would comprise of a pipeline that > accepts > >> > the pattern query and > >> > sources of DataStreams, and outputs detected matches of that pattern > to > >> a > >> > sink or forwards it > >> > along to another stream for further computation. > >> > > >> > As you said, a simple filter-join-aggregate query system could be > >> > developed implementing using the existing Streaming windowing API. > >> > But matching over complex events and decoding their pattern queries > >> would > >> > require implementing a DSL that transforms queries into an evaluation > >> > model. For e.g, > >> > in [1], the authors have implemented an NFA automaton with a shared > >> > versioned buffer that models the queries. In [4], the authors > >> > propose a new language that is much more expressive and compiles into > a > >> > topology graph for Storm. > >> > > >> > So in Flink's case, I believe the proposed DSL would generate operator > >> > graphs for the Flink compiler to schedule Jobgraphs over TaskManagers. > >> > If we don't depend on the Windowing API, would we need to create new > >> > operators such as the Projection, Conjunction and Union operators > >> defined > >> > in [4] ? > >> > Also I would like to hear your thoughts on how to approach scaling the > >> > pattern matching query. Note all these techniques talk about scaling a > >> > single query. > >> > I've read various ways such as > >> > > >> > 1. Merging equivalent runs[1] -: This seems a good way to squash > >> multiple > >> > instances of pattern matching forks into a single one if they have the > >> same > >> > state. > >> > But I'm not sure how we would implement this in Flink since this is a > >> > runtime optimization. > >> > > >> > 2. Implementing a matched version buffer[1] -: This would involve > >> sharing > >> > state of a buffer datastructure across multiple candidate match > >> instances > >> > for the pattern. > >> > > >> > 3. Splitting complex composite patterns into simpler sub-patterns[4] > >> and > >> > executing separate queries to detect those sub-patterns. This might > >> > translate into different > >> > tasks and duplicating the source datastreams to all the new generated > >> > tasks. > >> > > >> > Also since I don't know how the Flink compiler behaves, would some of > >> the > >> > optimizations involve making changes to it too? > >> > > >> > Regards, > >> > Akshay Dixit > >> > > >> > [1] : Efficient Pattern Matching over Event Streams > >> > < > http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf > >> > > >> > [2] : On Supporting Kleene Closure over Event Streams > >> > <http://people.cs.umass.edu/~yanlei/publications/sase-icde08.pdf> > >> > [3] : Processing Flows of Information: From Data Stream to Complex > Event > >> > Processing > >> > < > >> > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.396.1785&rep=rep1&type=pdf > >> > > >> > [4] : Distributing Complex Event Detection > >> > < > >> http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf > > > >> > > >> > On Mon, Mar 16, 2015 at 3:22 PM, Márton Balassi < > >> [hidden email]> > >> > wrote: > >> > > >> >> Dear Akshay, > >> >> > >> >> Thanks again for your interest and for the recent contribution to > >> >> streaming. > >> >> > >> >> Both of the projects mentioned wold be largely appreciated by the > >> >> community, and you can also propose other project suggestions here > for > >> >> discussion. > >> >> > >> >> Regarding FLINK-1534, the thesis I mentioned serves as a starting > point > >> >> and > >> >> indeed the basic solution can be implemented with filtering and > >> >> windowing/mapping with some state storing whether the cause of an > event > >> >> has > >> >> been already seen. Solely relying on the now existing windowing API > >> this > >> >> however might cause performance issues if the events also have an > >> >> expiration timeout - some optimization there would be included. The > >> >> further > >> >> challenge is to try to further exploit the parallel job execution of > >> Flink > >> >> to possibly scale a pattern matching query. > >> >> > >> >> Best, > >> >> > >> >> Marton > >> >> > >> >> On Sun, Mar 15, 2015 at 3:22 PM, Akshay Dixit <[hidden email]> > >> >> wrote: > >> >> > >> >> > Hi, > >> >> > I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore, India. > I'm > >> >> > currently interested in distributed systems and stream processing > >> and am > >> >> > looking to delve deeper into the subject, and hope to get some > >> insight > >> >> by > >> >> > contributing to Apache Flink. I've gathered some idea of the > >> >> > flink-streaming codebase by recently working on a PR for > >> FLINK-1450[2]. > >> >> > > >> >> > Both FLINK-1617[3] and FLINK-1534[4] are interesting projects that > I > >> >> would > >> >> > love to work on over the summer. I was wondering which amongst > these > >> >> would > >> >> > be more appreciated by the community, so I can start working > towards > >> a > >> >> > proposal for either one. > >> >> > > >> >> > Regarding FLINK-1534, I was wondering why would simply merging and > >> >> > filtering the existing streams for events we want to detect not > work? > >> >> Also > >> >> > on going through the document mentioned by @mbalassi in the JIRA > >> >> > comment[5], the authors specify some Runtime Event Detection > >> concepts in > >> >> > Section 5.2. I'm assuming the project entails on building a similar > >> >> analogy > >> >> > using Flink and the deliverables would include working pattern > >> matching > >> >> > operators over Flink DataStreams as described in the report. If so, > >> then > >> >> > shouldn't it be trivial to implement the described the Binary > >> operator > >> >> > using a WindowedStream and a Filter? > >> >> > I hope my questions don't seem misplaced here and I would > appreciate > >> >> links > >> >> > to literature where I can learn more on the topic. > >> >> > > >> >> > Regards, > >> >> > Akshay Dixit > >> >> > > >> >> > [1] : http://akshaydixi.me > >> >> > [2] : https://github.com/apache/flink/pull/481 > >> >> > [3] : https://issues.apache.org/jira/browse/FLINK-1617 > >> >> > [4] : https://issues.apache.org/jira/browse/FLINK-1534 > >> >> > [5] : > >> >> > > >> http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf > >> >> > > >> >> > >> > > >> > > >> > > > > > |
Hi,
The link to the draft proposal that I've prepared is https://gist.github.com/akshaydixi/88f3fbcebab0119a6a31 It would be great if I could get some feedback on it. Regards, Akshay Dixit On Wed, Mar 25, 2015 at 2:03 AM, Akshay Dixit <[hidden email]> wrote: > Thanks Gyula. > > I agree too that simple and working implementations are preferrable over > hacky complex solutions. I'll start sketching out an initial straighforward > API with only basic pattern matching features > and base it on the existing windowing API. I'll post a draft of the > proposal, keeping the points you've said in mind, tomorrow, so you can > look it over to see if its all right. > Regards, > Akshay Dixit > > On Tue, Mar 24, 2015 at 6:30 PM, Gyula Fóra <[hidden email]> wrote: > >> Hey Dixit, >> >> Sorry for the delay, I had to discuss this in more detail with some of our >> other core developers. >> >> The consensus seems to be that we would like push this project in a >> direction where the changes can be quickly included in the next releases. >> For this it is essential that we implement features that are complete (and >> clean) from the users perspective. This does not necessarily mean that we >> would like to have everything at once but rather that it is preferable to >> start with something clean and simple (for instance the naive chained >> filter approach) and progressively build more complex logic. >> >> This also mean that we would like to avoid "researchy" code in the >> codebase >> as much as possible. Of course once we have a stable api for this >> functionality we can work towards making the optimizations that you have >> mentioned like operator sharing and so on. >> >> The ideal proposal would give a clear sketch of the pattern matching API >> that you would like to implement, which might be some added operators at >> first to the current API and possible a DSL later with more advanced >> functionality (this would probably go in a separate library until it is >> very stable). >> >> So please in the proposal include a preview of what the pattern matching >> syntax would look like integrated with the current operators, how it would >> interact with other parts of the system etc. >> >> These are the thing we need to figure out before we consider the >> optimizations I think, because it usually turns out, that the API >> semantics >> you would like to provide can hugely affect (probably limit) the >> possibilities that you have afterwards in terms of optimizations. >> >> Let me know if you have further questions regarding this :) >> >> Gyula >> >> On Tue, Mar 24, 2015 at 12:01 PM, Gyula Fóra <[hidden email]> wrote: >> >> > Hey, >> > >> > Give me an hour or so as I am in a meeting currently, but I will get >> back >> > to you afterwards. >> > >> > Regards, >> > Gyula >> > >> > On Tue, Mar 24, 2015 at 11:03 AM, Akshay Dixit <[hidden email]> >> > wrote: >> > >> >> Hi, >> >> It'd really help if I got a reply soon. It'll be helpful in writing the >> >> proposal since the deadline is on 27th. Thanks >> >> Regards, >> >> Akshay Dixit >> >> >> >> On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit <[hidden email]> >> >> wrote: >> >> >> >> > Thanks for the explanation Marton. I've decided to try out for >> >> FLINK-1534. >> >> > >> >> > After reading through the thesis[4] and a few other papers[1][2][3], >> I >> >> > believe I've gathered a little context to ask more questions. But I'm >> >> still >> >> > not sure how Flink's internals work >> >> > so please bear with me. Although the ongoing effort to document the >> >> > architecture and internal is really helpful for newbies like me and >> >> would >> >> > greatly decrease the ramping up time. >> >> > >> >> > Detecting a pattern of events would comprise of a pipeline that >> accepts >> >> > the pattern query and >> >> > sources of DataStreams, and outputs detected matches of that pattern >> to >> >> a >> >> > sink or forwards it >> >> > along to another stream for further computation. >> >> > >> >> > As you said, a simple filter-join-aggregate query system could be >> >> > developed implementing using the existing Streaming windowing API. >> >> > But matching over complex events and decoding their pattern queries >> >> would >> >> > require implementing a DSL that transforms queries into an evaluation >> >> > model. For e.g, >> >> > in [1], the authors have implemented an NFA automaton with a shared >> >> > versioned buffer that models the queries. In [4], the authors >> >> > propose a new language that is much more expressive and compiles >> into a >> >> > topology graph for Storm. >> >> > >> >> > So in Flink's case, I believe the proposed DSL would generate >> operator >> >> > graphs for the Flink compiler to schedule Jobgraphs over >> TaskManagers. >> >> > If we don't depend on the Windowing API, would we need to create new >> >> > operators such as the Projection, Conjunction and Union operators >> >> defined >> >> > in [4] ? >> >> > Also I would like to hear your thoughts on how to approach scaling >> the >> >> > pattern matching query. Note all these techniques talk about scaling >> a >> >> > single query. >> >> > I've read various ways such as >> >> > >> >> > 1. Merging equivalent runs[1] -: This seems a good way to squash >> >> multiple >> >> > instances of pattern matching forks into a single one if they have >> the >> >> same >> >> > state. >> >> > But I'm not sure how we would implement this in Flink since this is a >> >> > runtime optimization. >> >> > >> >> > 2. Implementing a matched version buffer[1] -: This would involve >> >> sharing >> >> > state of a buffer datastructure across multiple candidate match >> >> instances >> >> > for the pattern. >> >> > >> >> > 3. Splitting complex composite patterns into simpler sub-patterns[4] >> >> and >> >> > executing separate queries to detect those sub-patterns. This might >> >> > translate into different >> >> > tasks and duplicating the source datastreams to all the new generated >> >> > tasks. >> >> > >> >> > Also since I don't know how the Flink compiler behaves, would some of >> >> the >> >> > optimizations involve making changes to it too? >> >> > >> >> > Regards, >> >> > Akshay Dixit >> >> > >> >> > [1] : Efficient Pattern Matching over Event Streams >> >> > < >> http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf >> >> > >> >> > [2] : On Supporting Kleene Closure over Event Streams >> >> > <http://people.cs.umass.edu/~yanlei/publications/sase-icde08.pdf> >> >> > [3] : Processing Flows of Information: From Data Stream to Complex >> Event >> >> > Processing >> >> > < >> >> >> http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.396.1785&rep=rep1&type=pdf >> >> > >> >> > [4] : Distributing Complex Event Detection >> >> > < >> >> >> http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf> >> >> > >> >> > On Mon, Mar 16, 2015 at 3:22 PM, Márton Balassi < >> >> [hidden email]> >> >> > wrote: >> >> > >> >> >> Dear Akshay, >> >> >> >> >> >> Thanks again for your interest and for the recent contribution to >> >> >> streaming. >> >> >> >> >> >> Both of the projects mentioned wold be largely appreciated by the >> >> >> community, and you can also propose other project suggestions here >> for >> >> >> discussion. >> >> >> >> >> >> Regarding FLINK-1534, the thesis I mentioned serves as a starting >> point >> >> >> and >> >> >> indeed the basic solution can be implemented with filtering and >> >> >> windowing/mapping with some state storing whether the cause of an >> event >> >> >> has >> >> >> been already seen. Solely relying on the now existing windowing API >> >> this >> >> >> however might cause performance issues if the events also have an >> >> >> expiration timeout - some optimization there would be included. The >> >> >> further >> >> >> challenge is to try to further exploit the parallel job execution of >> >> Flink >> >> >> to possibly scale a pattern matching query. >> >> >> >> >> >> Best, >> >> >> >> >> >> Marton >> >> >> >> >> >> On Sun, Mar 15, 2015 at 3:22 PM, Akshay Dixit <[hidden email] >> > >> >> >> wrote: >> >> >> >> >> >> > Hi, >> >> >> > I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore, India. >> I'm >> >> >> > currently interested in distributed systems and stream processing >> >> and am >> >> >> > looking to delve deeper into the subject, and hope to get some >> >> insight >> >> >> by >> >> >> > contributing to Apache Flink. I've gathered some idea of the >> >> >> > flink-streaming codebase by recently working on a PR for >> >> FLINK-1450[2]. >> >> >> > >> >> >> > Both FLINK-1617[3] and FLINK-1534[4] are interesting projects >> that I >> >> >> would >> >> >> > love to work on over the summer. I was wondering which amongst >> these >> >> >> would >> >> >> > be more appreciated by the community, so I can start working >> towards >> >> a >> >> >> > proposal for either one. >> >> >> > >> >> >> > Regarding FLINK-1534, I was wondering why would simply merging and >> >> >> > filtering the existing streams for events we want to detect not >> work? >> >> >> Also >> >> >> > on going through the document mentioned by @mbalassi in the JIRA >> >> >> > comment[5], the authors specify some Runtime Event Detection >> >> concepts in >> >> >> > Section 5.2. I'm assuming the project entails on building a >> similar >> >> >> analogy >> >> >> > using Flink and the deliverables would include working pattern >> >> matching >> >> >> > operators over Flink DataStreams as described in the report. If >> so, >> >> then >> >> >> > shouldn't it be trivial to implement the described the Binary >> >> operator >> >> >> > using a WindowedStream and a Filter? >> >> >> > I hope my questions don't seem misplaced here and I would >> appreciate >> >> >> links >> >> >> > to literature where I can learn more on the topic. >> >> >> > >> >> >> > Regards, >> >> >> > Akshay Dixit >> >> >> > >> >> >> > [1] : http://akshaydixi.me >> >> >> > [2] : https://github.com/apache/flink/pull/481 >> >> >> > [3] : https://issues.apache.org/jira/browse/FLINK-1617 >> >> >> > [4] : https://issues.apache.org/jira/browse/FLINK-1534 >> >> >> > [5] : >> >> >> > >> >> >> http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf >> >> >> > >> >> >> >> >> > >> >> > >> >> >> > >> > >> > > |
I think it looks good for a start, we will have to work on the API a little
bit together to make it fit smoothly with what we currently have. There is a few gaps in the timeline but that you have probably noticed :) Otherwise +1 from me. On Wed, Mar 25, 2015 at 11:35 PM, Akshay Dixit <[hidden email]> wrote: > Hi, > The link to the draft proposal that I've prepared is > https://gist.github.com/akshaydixi/88f3fbcebab0119a6a31 > It would be great if I could get some feedback on it. > Regards, > Akshay Dixit > > On Wed, Mar 25, 2015 at 2:03 AM, Akshay Dixit <[hidden email]> > wrote: > > > Thanks Gyula. > > > > I agree too that simple and working implementations are preferrable over > > hacky complex solutions. I'll start sketching out an initial > straighforward > > API with only basic pattern matching features > > and base it on the existing windowing API. I'll post a draft of the > > proposal, keeping the points you've said in mind, tomorrow, so you can > > look it over to see if its all right. > > Regards, > > Akshay Dixit > > > > On Tue, Mar 24, 2015 at 6:30 PM, Gyula Fóra <[hidden email]> wrote: > > > >> Hey Dixit, > >> > >> Sorry for the delay, I had to discuss this in more detail with some of > our > >> other core developers. > >> > >> The consensus seems to be that we would like push this project in a > >> direction where the changes can be quickly included in the next > releases. > >> For this it is essential that we implement features that are complete > (and > >> clean) from the users perspective. This does not necessarily mean that > we > >> would like to have everything at once but rather that it is preferable > to > >> start with something clean and simple (for instance the naive chained > >> filter approach) and progressively build more complex logic. > >> > >> This also mean that we would like to avoid "researchy" code in the > >> codebase > >> as much as possible. Of course once we have a stable api for this > >> functionality we can work towards making the optimizations that you have > >> mentioned like operator sharing and so on. > >> > >> The ideal proposal would give a clear sketch of the pattern matching API > >> that you would like to implement, which might be some added operators at > >> first to the current API and possible a DSL later with more advanced > >> functionality (this would probably go in a separate library until it is > >> very stable). > >> > >> So please in the proposal include a preview of what the pattern matching > >> syntax would look like integrated with the current operators, how it > would > >> interact with other parts of the system etc. > >> > >> These are the thing we need to figure out before we consider the > >> optimizations I think, because it usually turns out, that the API > >> semantics > >> you would like to provide can hugely affect (probably limit) the > >> possibilities that you have afterwards in terms of optimizations. > >> > >> Let me know if you have further questions regarding this :) > >> > >> Gyula > >> > >> On Tue, Mar 24, 2015 at 12:01 PM, Gyula Fóra <[hidden email]> wrote: > >> > >> > Hey, > >> > > >> > Give me an hour or so as I am in a meeting currently, but I will get > >> back > >> > to you afterwards. > >> > > >> > Regards, > >> > Gyula > >> > > >> > On Tue, Mar 24, 2015 at 11:03 AM, Akshay Dixit <[hidden email]> > >> > wrote: > >> > > >> >> Hi, > >> >> It'd really help if I got a reply soon. It'll be helpful in writing > the > >> >> proposal since the deadline is on 27th. Thanks > >> >> Regards, > >> >> Akshay Dixit > >> >> > >> >> On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit <[hidden email]> > >> >> wrote: > >> >> > >> >> > Thanks for the explanation Marton. I've decided to try out for > >> >> FLINK-1534. > >> >> > > >> >> > After reading through the thesis[4] and a few other > papers[1][2][3], > >> I > >> >> > believe I've gathered a little context to ask more questions. But > I'm > >> >> still > >> >> > not sure how Flink's internals work > >> >> > so please bear with me. Although the ongoing effort to document the > >> >> > architecture and internal is really helpful for newbies like me and > >> >> would > >> >> > greatly decrease the ramping up time. > >> >> > > >> >> > Detecting a pattern of events would comprise of a pipeline that > >> accepts > >> >> > the pattern query and > >> >> > sources of DataStreams, and outputs detected matches of that > pattern > >> to > >> >> a > >> >> > sink or forwards it > >> >> > along to another stream for further computation. > >> >> > > >> >> > As you said, a simple filter-join-aggregate query system could be > >> >> > developed implementing using the existing Streaming windowing API. > >> >> > But matching over complex events and decoding their pattern queries > >> >> would > >> >> > require implementing a DSL that transforms queries into an > evaluation > >> >> > model. For e.g, > >> >> > in [1], the authors have implemented an NFA automaton with a shared > >> >> > versioned buffer that models the queries. In [4], the authors > >> >> > propose a new language that is much more expressive and compiles > >> into a > >> >> > topology graph for Storm. > >> >> > > >> >> > So in Flink's case, I believe the proposed DSL would generate > >> operator > >> >> > graphs for the Flink compiler to schedule Jobgraphs over > >> TaskManagers. > >> >> > If we don't depend on the Windowing API, would we need to create > new > >> >> > operators such as the Projection, Conjunction and Union operators > >> >> defined > >> >> > in [4] ? > >> >> > Also I would like to hear your thoughts on how to approach scaling > >> the > >> >> > pattern matching query. Note all these techniques talk about > scaling > >> a > >> >> > single query. > >> >> > I've read various ways such as > >> >> > > >> >> > 1. Merging equivalent runs[1] -: This seems a good way to squash > >> >> multiple > >> >> > instances of pattern matching forks into a single one if they have > >> the > >> >> same > >> >> > state. > >> >> > But I'm not sure how we would implement this in Flink since this > is a > >> >> > runtime optimization. > >> >> > > >> >> > 2. Implementing a matched version buffer[1] -: This would involve > >> >> sharing > >> >> > state of a buffer datastructure across multiple candidate match > >> >> instances > >> >> > for the pattern. > >> >> > > >> >> > 3. Splitting complex composite patterns into simpler > sub-patterns[4] > >> >> and > >> >> > executing separate queries to detect those sub-patterns. This might > >> >> > translate into different > >> >> > tasks and duplicating the source datastreams to all the new > generated > >> >> > tasks. > >> >> > > >> >> > Also since I don't know how the Flink compiler behaves, would some > of > >> >> the > >> >> > optimizations involve making changes to it too? > >> >> > > >> >> > Regards, > >> >> > Akshay Dixit > >> >> > > >> >> > [1] : Efficient Pattern Matching over Event Streams > >> >> > < > >> http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf > >> >> > > >> >> > [2] : On Supporting Kleene Closure over Event Streams > >> >> > <http://people.cs.umass.edu/~yanlei/publications/sase-icde08.pdf> > >> >> > [3] : Processing Flows of Information: From Data Stream to Complex > >> Event > >> >> > Processing > >> >> > < > >> >> > >> > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.396.1785&rep=rep1&type=pdf > >> >> > > >> >> > [4] : Distributing Complex Event Detection > >> >> > < > >> >> > >> http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf > > > >> >> > > >> >> > On Mon, Mar 16, 2015 at 3:22 PM, Márton Balassi < > >> >> [hidden email]> > >> >> > wrote: > >> >> > > >> >> >> Dear Akshay, > >> >> >> > >> >> >> Thanks again for your interest and for the recent contribution to > >> >> >> streaming. > >> >> >> > >> >> >> Both of the projects mentioned wold be largely appreciated by the > >> >> >> community, and you can also propose other project suggestions here > >> for > >> >> >> discussion. > >> >> >> > >> >> >> Regarding FLINK-1534, the thesis I mentioned serves as a starting > >> point > >> >> >> and > >> >> >> indeed the basic solution can be implemented with filtering and > >> >> >> windowing/mapping with some state storing whether the cause of an > >> event > >> >> >> has > >> >> >> been already seen. Solely relying on the now existing windowing > API > >> >> this > >> >> >> however might cause performance issues if the events also have an > >> >> >> expiration timeout - some optimization there would be included. > The > >> >> >> further > >> >> >> challenge is to try to further exploit the parallel job execution > of > >> >> Flink > >> >> >> to possibly scale a pattern matching query. > >> >> >> > >> >> >> Best, > >> >> >> > >> >> >> Marton > >> >> >> > >> >> >> On Sun, Mar 15, 2015 at 3:22 PM, Akshay Dixit < > [hidden email] > >> > > >> >> >> wrote: > >> >> >> > >> >> >> > Hi, > >> >> >> > I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore, India. > >> I'm > >> >> >> > currently interested in distributed systems and stream > processing > >> >> and am > >> >> >> > looking to delve deeper into the subject, and hope to get some > >> >> insight > >> >> >> by > >> >> >> > contributing to Apache Flink. I've gathered some idea of the > >> >> >> > flink-streaming codebase by recently working on a PR for > >> >> FLINK-1450[2]. > >> >> >> > > >> >> >> > Both FLINK-1617[3] and FLINK-1534[4] are interesting projects > >> that I > >> >> >> would > >> >> >> > love to work on over the summer. I was wondering which amongst > >> these > >> >> >> would > >> >> >> > be more appreciated by the community, so I can start working > >> towards > >> >> a > >> >> >> > proposal for either one. > >> >> >> > > >> >> >> > Regarding FLINK-1534, I was wondering why would simply merging > and > >> >> >> > filtering the existing streams for events we want to detect not > >> work? > >> >> >> Also > >> >> >> > on going through the document mentioned by @mbalassi in the JIRA > >> >> >> > comment[5], the authors specify some Runtime Event Detection > >> >> concepts in > >> >> >> > Section 5.2. I'm assuming the project entails on building a > >> similar > >> >> >> analogy > >> >> >> > using Flink and the deliverables would include working pattern > >> >> matching > >> >> >> > operators over Flink DataStreams as described in the report. If > >> so, > >> >> then > >> >> >> > shouldn't it be trivial to implement the described the Binary > >> >> operator > >> >> >> > using a WindowedStream and a Filter? > >> >> >> > I hope my questions don't seem misplaced here and I would > >> appreciate > >> >> >> links > >> >> >> > to literature where I can learn more on the topic. > >> >> >> > > >> >> >> > Regards, > >> >> >> > Akshay Dixit > >> >> >> > > >> >> >> > [1] : http://akshaydixi.me > >> >> >> > [2] : https://github.com/apache/flink/pull/481 > >> >> >> > [3] : https://issues.apache.org/jira/browse/FLINK-1617 > >> >> >> > [4] : https://issues.apache.org/jira/browse/FLINK-1534 > >> >> >> > [5] : > >> >> >> > > >> >> > >> http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf > >> >> >> > > >> >> >> > >> >> > > >> >> > > >> >> > >> > > >> > > >> > > > > > |
Thanks for going through it Gyula.
I've made the necessary amends to the timeline and submitted the proposal. Regards, Akshay Dixit On Thu, Mar 26, 2015 at 8:53 PM, Gyula Fóra <[hidden email]> wrote: > I think it looks good for a start, we will have to work on the API a little > bit together to make it fit smoothly with what we currently have. > > There is a few gaps in the timeline but that you have probably noticed :) > > Otherwise +1 from me. > > On Wed, Mar 25, 2015 at 11:35 PM, Akshay Dixit <[hidden email]> > wrote: > > > Hi, > > The link to the draft proposal that I've prepared is > > https://gist.github.com/akshaydixi/88f3fbcebab0119a6a31 > > It would be great if I could get some feedback on it. > > Regards, > > Akshay Dixit > > > > On Wed, Mar 25, 2015 at 2:03 AM, Akshay Dixit <[hidden email]> > > wrote: > > > > > Thanks Gyula. > > > > > > I agree too that simple and working implementations are preferrable > over > > > hacky complex solutions. I'll start sketching out an initial > > straighforward > > > API with only basic pattern matching features > > > and base it on the existing windowing API. I'll post a draft of the > > > proposal, keeping the points you've said in mind, tomorrow, so you can > > > look it over to see if its all right. > > > Regards, > > > Akshay Dixit > > > > > > On Tue, Mar 24, 2015 at 6:30 PM, Gyula Fóra <[hidden email]> wrote: > > > > > >> Hey Dixit, > > >> > > >> Sorry for the delay, I had to discuss this in more detail with some of > > our > > >> other core developers. > > >> > > >> The consensus seems to be that we would like push this project in a > > >> direction where the changes can be quickly included in the next > > releases. > > >> For this it is essential that we implement features that are complete > > (and > > >> clean) from the users perspective. This does not necessarily mean that > > we > > >> would like to have everything at once but rather that it is preferable > > to > > >> start with something clean and simple (for instance the naive chained > > >> filter approach) and progressively build more complex logic. > > >> > > >> This also mean that we would like to avoid "researchy" code in the > > >> codebase > > >> as much as possible. Of course once we have a stable api for this > > >> functionality we can work towards making the optimizations that you > have > > >> mentioned like operator sharing and so on. > > >> > > >> The ideal proposal would give a clear sketch of the pattern matching > API > > >> that you would like to implement, which might be some added operators > at > > >> first to the current API and possible a DSL later with more advanced > > >> functionality (this would probably go in a separate library until it > is > > >> very stable). > > >> > > >> So please in the proposal include a preview of what the pattern > matching > > >> syntax would look like integrated with the current operators, how it > > would > > >> interact with other parts of the system etc. > > >> > > >> These are the thing we need to figure out before we consider the > > >> optimizations I think, because it usually turns out, that the API > > >> semantics > > >> you would like to provide can hugely affect (probably limit) the > > >> possibilities that you have afterwards in terms of optimizations. > > >> > > >> Let me know if you have further questions regarding this :) > > >> > > >> Gyula > > >> > > >> On Tue, Mar 24, 2015 at 12:01 PM, Gyula Fóra <[hidden email]> > wrote: > > >> > > >> > Hey, > > >> > > > >> > Give me an hour or so as I am in a meeting currently, but I will get > > >> back > > >> > to you afterwards. > > >> > > > >> > Regards, > > >> > Gyula > > >> > > > >> > On Tue, Mar 24, 2015 at 11:03 AM, Akshay Dixit < > [hidden email]> > > >> > wrote: > > >> > > > >> >> Hi, > > >> >> It'd really help if I got a reply soon. It'll be helpful in writing > > the > > >> >> proposal since the deadline is on 27th. Thanks > > >> >> Regards, > > >> >> Akshay Dixit > > >> >> > > >> >> On Sun, Mar 22, 2015 at 1:17 AM, Akshay Dixit < > [hidden email]> > > >> >> wrote: > > >> >> > > >> >> > Thanks for the explanation Marton. I've decided to try out for > > >> >> FLINK-1534. > > >> >> > > > >> >> > After reading through the thesis[4] and a few other > > papers[1][2][3], > > >> I > > >> >> > believe I've gathered a little context to ask more questions. But > > I'm > > >> >> still > > >> >> > not sure how Flink's internals work > > >> >> > so please bear with me. Although the ongoing effort to document > the > > >> >> > architecture and internal is really helpful for newbies like me > and > > >> >> would > > >> >> > greatly decrease the ramping up time. > > >> >> > > > >> >> > Detecting a pattern of events would comprise of a pipeline that > > >> accepts > > >> >> > the pattern query and > > >> >> > sources of DataStreams, and outputs detected matches of that > > pattern > > >> to > > >> >> a > > >> >> > sink or forwards it > > >> >> > along to another stream for further computation. > > >> >> > > > >> >> > As you said, a simple filter-join-aggregate query system could be > > >> >> > developed implementing using the existing Streaming windowing > API. > > >> >> > But matching over complex events and decoding their pattern > queries > > >> >> would > > >> >> > require implementing a DSL that transforms queries into an > > evaluation > > >> >> > model. For e.g, > > >> >> > in [1], the authors have implemented an NFA automaton with a > shared > > >> >> > versioned buffer that models the queries. In [4], the authors > > >> >> > propose a new language that is much more expressive and compiles > > >> into a > > >> >> > topology graph for Storm. > > >> >> > > > >> >> > So in Flink's case, I believe the proposed DSL would generate > > >> operator > > >> >> > graphs for the Flink compiler to schedule Jobgraphs over > > >> TaskManagers. > > >> >> > If we don't depend on the Windowing API, would we need to create > > new > > >> >> > operators such as the Projection, Conjunction and Union operators > > >> >> defined > > >> >> > in [4] ? > > >> >> > Also I would like to hear your thoughts on how to approach > scaling > > >> the > > >> >> > pattern matching query. Note all these techniques talk about > > scaling > > >> a > > >> >> > single query. > > >> >> > I've read various ways such as > > >> >> > > > >> >> > 1. Merging equivalent runs[1] -: This seems a good way to squash > > >> >> multiple > > >> >> > instances of pattern matching forks into a single one if they > have > > >> the > > >> >> same > > >> >> > state. > > >> >> > But I'm not sure how we would implement this in Flink since this > > is a > > >> >> > runtime optimization. > > >> >> > > > >> >> > 2. Implementing a matched version buffer[1] -: This would > involve > > >> >> sharing > > >> >> > state of a buffer datastructure across multiple candidate match > > >> >> instances > > >> >> > for the pattern. > > >> >> > > > >> >> > 3. Splitting complex composite patterns into simpler > > sub-patterns[4] > > >> >> and > > >> >> > executing separate queries to detect those sub-patterns. This > might > > >> >> > translate into different > > >> >> > tasks and duplicating the source datastreams to all the new > > generated > > >> >> > tasks. > > >> >> > > > >> >> > Also since I don't know how the Flink compiler behaves, would > some > > of > > >> >> the > > >> >> > optimizations involve making changes to it too? > > >> >> > > > >> >> > Regards, > > >> >> > Akshay Dixit > > >> >> > > > >> >> > [1] : Efficient Pattern Matching over Event Streams > > >> >> > < > > >> > http://people.cs.umass.edu/~yanlei/publications/sase-sigmod08-long.pdf > > >> >> > > > >> >> > [2] : On Supporting Kleene Closure over Event Streams > > >> >> > <http://people.cs.umass.edu/~yanlei/publications/sase-icde08.pdf > > > > >> >> > [3] : Processing Flows of Information: From Data Stream to > Complex > > >> Event > > >> >> > Processing > > >> >> > < > > >> >> > > >> > > > http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.396.1785&rep=rep1&type=pdf > > >> >> > > > >> >> > [4] : Distributing Complex Event Detection > > >> >> > < > > >> >> > > >> > http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf > > > > > >> >> > > > >> >> > On Mon, Mar 16, 2015 at 3:22 PM, Márton Balassi < > > >> >> [hidden email]> > > >> >> > wrote: > > >> >> > > > >> >> >> Dear Akshay, > > >> >> >> > > >> >> >> Thanks again for your interest and for the recent contribution > to > > >> >> >> streaming. > > >> >> >> > > >> >> >> Both of the projects mentioned wold be largely appreciated by > the > > >> >> >> community, and you can also propose other project suggestions > here > > >> for > > >> >> >> discussion. > > >> >> >> > > >> >> >> Regarding FLINK-1534, the thesis I mentioned serves as a > starting > > >> point > > >> >> >> and > > >> >> >> indeed the basic solution can be implemented with filtering and > > >> >> >> windowing/mapping with some state storing whether the cause of > an > > >> event > > >> >> >> has > > >> >> >> been already seen. Solely relying on the now existing windowing > > API > > >> >> this > > >> >> >> however might cause performance issues if the events also have > an > > >> >> >> expiration timeout - some optimization there would be included. > > The > > >> >> >> further > > >> >> >> challenge is to try to further exploit the parallel job > execution > > of > > >> >> Flink > > >> >> >> to possibly scale a pattern matching query. > > >> >> >> > > >> >> >> Best, > > >> >> >> > > >> >> >> Marton > > >> >> >> > > >> >> >> On Sun, Mar 15, 2015 at 3:22 PM, Akshay Dixit < > > [hidden email] > > >> > > > >> >> >> wrote: > > >> >> >> > > >> >> >> > Hi, > > >> >> >> > I'm Akshay Dixit[1], a 4th year undergrad at VIT Vellore, > India. > > >> I'm > > >> >> >> > currently interested in distributed systems and stream > > processing > > >> >> and am > > >> >> >> > looking to delve deeper into the subject, and hope to get some > > >> >> insight > > >> >> >> by > > >> >> >> > contributing to Apache Flink. I've gathered some idea of the > > >> >> >> > flink-streaming codebase by recently working on a PR for > > >> >> FLINK-1450[2]. > > >> >> >> > > > >> >> >> > Both FLINK-1617[3] and FLINK-1534[4] are interesting projects > > >> that I > > >> >> >> would > > >> >> >> > love to work on over the summer. I was wondering which amongst > > >> these > > >> >> >> would > > >> >> >> > be more appreciated by the community, so I can start working > > >> towards > > >> >> a > > >> >> >> > proposal for either one. > > >> >> >> > > > >> >> >> > Regarding FLINK-1534, I was wondering why would simply merging > > and > > >> >> >> > filtering the existing streams for events we want to detect > not > > >> work? > > >> >> >> Also > > >> >> >> > on going through the document mentioned by @mbalassi in the > JIRA > > >> >> >> > comment[5], the authors specify some Runtime Event Detection > > >> >> concepts in > > >> >> >> > Section 5.2. I'm assuming the project entails on building a > > >> similar > > >> >> >> analogy > > >> >> >> > using Flink and the deliverables would include working pattern > > >> >> matching > > >> >> >> > operators over Flink DataStreams as described in the report. > If > > >> so, > > >> >> then > > >> >> >> > shouldn't it be trivial to implement the described the Binary > > >> >> operator > > >> >> >> > using a WindowedStream and a Filter? > > >> >> >> > I hope my questions don't seem misplaced here and I would > > >> appreciate > > >> >> >> links > > >> >> >> > to literature where I can learn more on the topic. > > >> >> >> > > > >> >> >> > Regards, > > >> >> >> > Akshay Dixit > > >> >> >> > > > >> >> >> > [1] : http://akshaydixi.me > > >> >> >> > [2] : https://github.com/apache/flink/pull/481 > > >> >> >> > [3] : https://issues.apache.org/jira/browse/FLINK-1617 > > >> >> >> > [4] : https://issues.apache.org/jira/browse/FLINK-1534 > > >> >> >> > [5] : > > >> >> >> > > > >> >> > > >> > http://www.doc.ic.ac.uk/teaching/distinguished-projects/2012/k.nagy.pdf > > >> >> >> > > > >> >> >> > > >> >> > > > >> >> > > > >> >> > > >> > > > >> > > > >> > > > > > > > > > |
Free forum by Nabble | Edit this page |