Hello,
I would like to propose an enrichment of existing Flink SQL MATCH_RECOGNIZE syntax to cover for the case of the absence of an event. Such an enrichment would help our company solve a business case containing timed-out patterns handling. An example of usage of such a clause from Flink training exercises could be a task of identification of taxi rides with a START event that is not followed by an END event within two hours. Currently, a solution to such a task could be achieved with the use of CEP and a timeout handler. However, as far as I know, it is impossible to take advantage of Flink SQL syntax for this task. I can think of two ways for such a feature to be incorporated into existing MATCH_RECOGNIZE syntax: - In analogy to CEP, a keyword could be added which would determine, if timed out matches should be dropped altogether or available either through side output or main output. SQL usage could be similar to the current WITHIN clause, f.e. "PATTERN (A B C) TIMEOUT INTERVAL '30' SECOND" would output partially matched patterns 30 seconds after A event appearance. - Add possibility to define absence of event inside pattern definition - for example "PATTERN (A B !C) WITHIN INTERVAL '30' SECOND" would output partially matched patterns with the occurrence of A and B event 30 seconds after A event appearance. In our company we did some basic testing of this concept - we modified existing MatchCodeGenerator to add processTimedOutMatch function based on a boolean trigger and tested it against the aforementioned business case containing timed-out patterns handling. I'm interested to hear your thoughts about how we could help Flink SQL be able to express these kinds of cases. With regards, Kosma Grochowski |
Hi Kosma,
Thanks for the proposal. I like it and we also have supported similar syntax in our company. The problem is that Flink SQL leverages Calcite as the query parser, so if we want to support this syntax, we may have to push this syntax back to the Calcite community. Besides, the SQL standard doesn't define the timeout syntax for MATCH RECOGNIZE. So we have to extend the standard and this is usually not trivial. So I think it would be better to have a joint discussion with the Calcite and Flink community together. What do you think? Best, Jark On Fri, 18 Sep 2020 at 22:48, Kosma Grochowski < [hidden email]> wrote: > Hello, > > I would like to propose an enrichment of existing Flink SQL > MATCH_RECOGNIZE syntax to cover for the case of the absence of an event. > Such an enrichment would help our company solve a business case containing > timed-out patterns handling. An example of usage of such a clause from > Flink training exercises could be a task of identification of taxi rides > with a START event that is not followed by an END event within two hours. > Currently, a solution to such a task could be achieved with the use of CEP > and a timeout handler. However, as far as I know, it is impossible to take > advantage of Flink SQL syntax for this task. > > I can think of two ways for such a feature to be incorporated into > existing MATCH_RECOGNIZE syntax: > - In analogy to CEP, a keyword could be added which would determine, if > timed out matches should be dropped altogether or available either through > side output or main output. SQL usage could be similar to the current > WITHIN clause, f.e. "PATTERN (A B C) TIMEOUT INTERVAL '30' SECOND" would > output partially matched patterns 30 seconds after A event appearance. > > - Add possibility to define absence of event inside pattern definition - > for example "PATTERN (A B !C) WITHIN INTERVAL '30' SECOND" would output > partially matched patterns with the occurrence of A and B event 30 seconds > after A event appearance. > > In our company we did some basic testing of this concept - we modified > existing MatchCodeGenerator to add processTimedOutMatch function based on a > boolean trigger and tested it against the aforementioned business case > containing timed-out patterns handling. > > > I'm interested to hear your thoughts about how we could help Flink SQL be > able to express these kinds of cases. > > With regards, > Kosma Grochowski > > > > |
Hi Jark,
Thank you for your e-mail. I agree, let's engage all interested parties in this discussion - I'm writing this e-mail to both Flink and Calcite dev mailing lists. I'll repeat myself to present the proposal to the Calcite community. I would like to propose an enrichment of existing Flink SQL MATCH_RECOGNIZE syntax to cover for the case of the absence of an event. Such an enrichment would help our company solve a business case containing timed-out patterns handling. An example of usage of such a clause from Flink training exercises could be a task of identification of taxi rides with a START event that is not followed by an END event within two hours. Currently, a solution to such a task could be achieved with the use of CEP and a timeout handler. However, as far as I know, it is impossible to take advantage of Flink SQL syntax for this task. I can think of two ways for such a feature to be incorporated into existing MATCH_RECOGNIZE syntax: - In analogy to CEP, a keyword could be added which would determine, if timed out matches should be dropped altogether or available either through side output or main output. SQL usage could be similar to the current WITHIN clause, f.e. "PATTERN (A B C) TIMEOUT INTERVAL '30' SECOND" would output partially matched patterns 30 seconds after A event appearance. - Add possibility to define absence of event inside pattern definition - for example "PATTERN (A B !C) WITHIN INTERVAL '30' SECOND" would output partially matched patterns with the occurrence of A and B event 30 seconds after A event appearance. In our company we did some basic testing of this concept - we modified existing MatchCodeGenerator to add processTimedOutMatch function based on a boolean trigger and tested it against the aforementioned business case containing timed-out patterns handling. I'm interested to hear your thoughts about how we could help Flink SQL be able to express these kinds of cases. With regards, Kosma Grochowski > On 21 Sep 2020, at 05:12, Jark Wu <[hidden email]> wrote: > > Hi Kosma, > > Thanks for the proposal. I like it and we also have supported similar > syntax in our company. > The problem is that Flink SQL leverages Calcite as the query parser, so if > we want to support this syntax, we may have to push this syntax back to the > Calcite community. > Besides, the SQL standard doesn't define the timeout syntax for MATCH > RECOGNIZE. So we have to extend the standard and this is usually not > trivial. > > So I think it would be better to have a joint discussion with the Calcite > and Flink community together. What do you think? > > Best, > Jark > > > > > > On Fri, 18 Sep 2020 at 22:48, Kosma Grochowski < > [hidden email]> wrote: > >> Hello, >> >> I would like to propose an enrichment of existing Flink SQL >> MATCH_RECOGNIZE syntax to cover for the case of the absence of an event. >> Such an enrichment would help our company solve a business case containing >> timed-out patterns handling. An example of usage of such a clause from >> Flink training exercises could be a task of identification of taxi rides >> with a START event that is not followed by an END event within two hours. >> Currently, a solution to such a task could be achieved with the use of CEP >> and a timeout handler. However, as far as I know, it is impossible to take >> advantage of Flink SQL syntax for this task. >> >> I can think of two ways for such a feature to be incorporated into >> existing MATCH_RECOGNIZE syntax: >> - In analogy to CEP, a keyword could be added which would determine, if >> timed out matches should be dropped altogether or available either through >> side output or main output. SQL usage could be similar to the current >> WITHIN clause, f.e. "PATTERN (A B C) TIMEOUT INTERVAL '30' SECOND" would >> output partially matched patterns 30 seconds after A event appearance. >> >> - Add possibility to define absence of event inside pattern definition - >> for example "PATTERN (A B !C) WITHIN INTERVAL '30' SECOND" would output >> partially matched patterns with the occurrence of A and B event 30 seconds >> after A event appearance. >> >> In our company we did some basic testing of this concept - we modified >> existing MatchCodeGenerator to add processTimedOutMatch function based on a >> boolean trigger and tested it against the aforementioned business case >> containing timed-out patterns handling. >> >> >> I'm interested to hear your thoughts about how we could help Flink SQL be >> able to express these kinds of cases. >> >> With regards, >> Kosma Grochowski >> >> >> >> |
Is there a better way?
I'm am idealist with regard to streaming SQL semantics, and I'm going to make the 'slippery slope' argument that if we add a TIMEOUT parameter to MATCH_RECOGNIZE, won't we also need to add it to GROUP BY and JOIN? (Because those are also "blocking" operators.) Maybe JOIN and GROUP BY are simpler because (absent retractions) they are monotonic. If more data arrives, it will not cause rows to disappear from your result. So, maybe anti-join is the best comparison. How does Flink deal with, say "show me all orders from customers who have not made a product return in the last 3 months"? You'd need a timeout on the PRODUCT_RETURNS stream, right? My hunch is that Flink can express these semantics without extending the syntax of JOIN, and if so, we could use the same approach to make MATCH_RECOGNIZE work with late data. Julian On Mon, Sep 21, 2020 at 12:05 AM Kosma Grochowski <[hidden email]> wrote: > > Hi Jark, > > Thank you for your e-mail. I agree, let's engage all interested parties in this discussion - I'm writing this e-mail to both Flink and Calcite dev mailing lists. > > I'll repeat myself to present the proposal to the Calcite community. > > I would like to propose an enrichment of existing Flink SQL MATCH_RECOGNIZE syntax to cover for the case of the absence of an event. Such an enrichment would help our company solve a business case containing timed-out patterns handling. An example of usage of such a clause from Flink training exercises could be a task of identification of taxi rides with a START event that is not followed by an END event within two hours. Currently, a solution to such a task could be achieved with the use of CEP and a timeout handler. However, as far as I know, it is impossible to take advantage of Flink SQL syntax for this task. > > I can think of two ways for such a feature to be incorporated into existing MATCH_RECOGNIZE syntax: > - In analogy to CEP, a keyword could be added which would determine, if timed out matches should be dropped altogether or available either through side output or main output. SQL usage could be similar to the current WITHIN clause, f.e. "PATTERN (A B C) TIMEOUT INTERVAL '30' SECOND" would output partially matched patterns 30 seconds after A event appearance. > > - Add possibility to define absence of event inside pattern definition - for example "PATTERN (A B !C) WITHIN INTERVAL '30' SECOND" would output partially matched patterns with the occurrence of A and B event 30 seconds after A event appearance. > > In our company we did some basic testing of this concept - we modified existing MatchCodeGenerator to add processTimedOutMatch function based on a boolean trigger and tested it against the aforementioned business case containing timed-out patterns handling. > > I'm interested to hear your thoughts about how we could help Flink SQL be able to express these kinds of cases. > > With regards, > Kosma Grochowski > > > > > On 21 Sep 2020, at 05:12, Jark Wu <[hidden email]> wrote: > > > > Hi Kosma, > > > > Thanks for the proposal. I like it and we also have supported similar > > syntax in our company. > > The problem is that Flink SQL leverages Calcite as the query parser, so if > > we want to support this syntax, we may have to push this syntax back to the > > Calcite community. > > Besides, the SQL standard doesn't define the timeout syntax for MATCH > > RECOGNIZE. So we have to extend the standard and this is usually not > > trivial. > > > > So I think it would be better to have a joint discussion with the Calcite > > and Flink community together. What do you think? > > > > Best, > > Jark > > > > > > > > > > > > On Fri, 18 Sep 2020 at 22:48, Kosma Grochowski < > > [hidden email]> wrote: > > > >> Hello, > >> > >> I would like to propose an enrichment of existing Flink SQL > >> MATCH_RECOGNIZE syntax to cover for the case of the absence of an event. > >> Such an enrichment would help our company solve a business case containing > >> timed-out patterns handling. An example of usage of such a clause from > >> Flink training exercises could be a task of identification of taxi rides > >> with a START event that is not followed by an END event within two hours. > >> Currently, a solution to such a task could be achieved with the use of CEP > >> and a timeout handler. However, as far as I know, it is impossible to take > >> advantage of Flink SQL syntax for this task. > >> > >> I can think of two ways for such a feature to be incorporated into > >> existing MATCH_RECOGNIZE syntax: > >> - In analogy to CEP, a keyword could be added which would determine, if > >> timed out matches should be dropped altogether or available either through > >> side output or main output. SQL usage could be similar to the current > >> WITHIN clause, f.e. "PATTERN (A B C) TIMEOUT INTERVAL '30' SECOND" would > >> output partially matched patterns 30 seconds after A event appearance. > >> > >> - Add possibility to define absence of event inside pattern definition - > >> for example "PATTERN (A B !C) WITHIN INTERVAL '30' SECOND" would output > >> partially matched patterns with the occurrence of A and B event 30 seconds > >> after A event appearance. > >> > >> In our company we did some basic testing of this concept - we modified > >> existing MatchCodeGenerator to add processTimedOutMatch function based on a > >> boolean trigger and tested it against the aforementioned business case > >> containing timed-out patterns handling. > >> > >> > >> I'm interested to hear your thoughts about how we could help Flink SQL be > >> able to express these kinds of cases. > >> > >> With regards, > >> Kosma Grochowski > >> > >> > >> > >> > |
Thank you Julian for mentioning the anti-join. With its help, I managed to solve our particular case similarly as follows:
``` SELECT e.* FROM events e LEFT JOIN patterns p ON e.record_id = p.begin_record_id WHERE e.pattern_val = 'BEGIN' AND p.begin_record_id is null ``` However, I'm thinking that such an approach will fail for more complicated patterns than `BEGIN !END`, for example determining on which event did the pattern `A B{1,N} A{1,N} B` time out does not seem suitable for such an approach. Moreover, this way of proceeding seems like a workaround of MATCH_RECOGNIZE limitations in dealing with absent events. I can’t think of a way to make these cases solved generically, and such pattern extensions would be the way to do that. With regards, Kosma > On 22 Sep 2020, at 20:29, Julian Hyde <[hidden email]> wrote: > > Is there a better way? > > I'm am idealist with regard to streaming SQL semantics, and I'm going > to make the 'slippery slope' argument that if we add a TIMEOUT > parameter to MATCH_RECOGNIZE, won't we also need to add it to GROUP BY > and JOIN? (Because those are also "blocking" operators.) > > Maybe JOIN and GROUP BY are simpler because (absent retractions) they > are monotonic. If more data arrives, it will not cause rows to > disappear from your result. So, maybe anti-join is the best > comparison. How does Flink deal with, say "show me all orders from > customers who have not made a product return in the last 3 months"? > You'd need a timeout on the PRODUCT_RETURNS stream, right? > > My hunch is that Flink can express these semantics without extending > the syntax of JOIN, and if so, we could use the same approach to make > MATCH_RECOGNIZE work with late data. > > Julian > > On Mon, Sep 21, 2020 at 12:05 AM Kosma Grochowski > <[hidden email]> wrote: >> >> Hi Jark, >> >> Thank you for your e-mail. I agree, let's engage all interested parties in this discussion - I'm writing this e-mail to both Flink and Calcite dev mailing lists. >> >> I'll repeat myself to present the proposal to the Calcite community. >> >> I would like to propose an enrichment of existing Flink SQL MATCH_RECOGNIZE syntax to cover for the case of the absence of an event. Such an enrichment would help our company solve a business case containing timed-out patterns handling. An example of usage of such a clause from Flink training exercises could be a task of identification of taxi rides with a START event that is not followed by an END event within two hours. Currently, a solution to such a task could be achieved with the use of CEP and a timeout handler. However, as far as I know, it is impossible to take advantage of Flink SQL syntax for this task. >> >> I can think of two ways for such a feature to be incorporated into existing MATCH_RECOGNIZE syntax: >> - In analogy to CEP, a keyword could be added which would determine, if timed out matches should be dropped altogether or available either through side output or main output. SQL usage could be similar to the current WITHIN clause, f.e. "PATTERN (A B C) TIMEOUT INTERVAL '30' SECOND" would output partially matched patterns 30 seconds after A event appearance. >> >> - Add possibility to define absence of event inside pattern definition - for example "PATTERN (A B !C) WITHIN INTERVAL '30' SECOND" would output partially matched patterns with the occurrence of A and B event 30 seconds after A event appearance. >> >> In our company we did some basic testing of this concept - we modified existing MatchCodeGenerator to add processTimedOutMatch function based on a boolean trigger and tested it against the aforementioned business case containing timed-out patterns handling. >> >> I'm interested to hear your thoughts about how we could help Flink SQL be able to express these kinds of cases. >> >> With regards, >> Kosma Grochowski >> >> >> >>> On 21 Sep 2020, at 05:12, Jark Wu <[hidden email]> wrote: >>> >>> Hi Kosma, >>> >>> Thanks for the proposal. I like it and we also have supported similar >>> syntax in our company. >>> The problem is that Flink SQL leverages Calcite as the query parser, so if >>> we want to support this syntax, we may have to push this syntax back to the >>> Calcite community. >>> Besides, the SQL standard doesn't define the timeout syntax for MATCH >>> RECOGNIZE. So we have to extend the standard and this is usually not >>> trivial. >>> >>> So I think it would be better to have a joint discussion with the Calcite >>> and Flink community together. What do you think? >>> >>> Best, >>> Jark >>> >>> >>> >>> >>> >>> On Fri, 18 Sep 2020 at 22:48, Kosma Grochowski < >>> [hidden email]> wrote: >>> >>>> Hello, >>>> >>>> I would like to propose an enrichment of existing Flink SQL >>>> MATCH_RECOGNIZE syntax to cover for the case of the absence of an event. >>>> Such an enrichment would help our company solve a business case containing >>>> timed-out patterns handling. An example of usage of such a clause from >>>> Flink training exercises could be a task of identification of taxi rides >>>> with a START event that is not followed by an END event within two hours. >>>> Currently, a solution to such a task could be achieved with the use of CEP >>>> and a timeout handler. However, as far as I know, it is impossible to take >>>> advantage of Flink SQL syntax for this task. >>>> >>>> I can think of two ways for such a feature to be incorporated into >>>> existing MATCH_RECOGNIZE syntax: >>>> - In analogy to CEP, a keyword could be added which would determine, if >>>> timed out matches should be dropped altogether or available either through >>>> side output or main output. SQL usage could be similar to the current >>>> WITHIN clause, f.e. "PATTERN (A B C) TIMEOUT INTERVAL '30' SECOND" would >>>> output partially matched patterns 30 seconds after A event appearance. >>>> >>>> - Add possibility to define absence of event inside pattern definition - >>>> for example "PATTERN (A B !C) WITHIN INTERVAL '30' SECOND" would output >>>> partially matched patterns with the occurrence of A and B event 30 seconds >>>> after A event appearance. >>>> >>>> In our company we did some basic testing of this concept - we modified >>>> existing MatchCodeGenerator to add processTimedOutMatch function based on a >>>> boolean trigger and tested it against the aforementioned business case >>>> containing timed-out patterns handling. >>>> >>>> >>>> I'm interested to hear your thoughts about how we could help Flink SQL be >>>> able to express these kinds of cases. >>>> >>>> With regards, >>>> Kosma Grochowski >>>> >>>> >>>> >>>> >> |
Free forum by Nabble | Edit this page |