Hi all,
I would like to start a discussion on this feature request (JIRA link). <https://issues.apache.org/jira/browse/FLINK-16039> Consider the events : [1, event], [2, event] where first element is event timestamp in seconds and second element is event code/name. Also consider that an Event time session window with inactivityGap = 2 seconds is acting on above stream. When the first event arrives, a session window should be created that is [1,1]. When the second event arrives, a new session window should be created that is [2,2]. Since this falls within firstWindowTimestamp+inactivityGap, it should be merged into session window [1,2] and [2,2] should be deleted. *This is my understanding of how session windows are created. Please correct me if wrong.* However, Flink does not follow such a definition of windows semantically. If I call the getEnd() method of the TimeWindow() class, I get back timestamp + inactivityGap. For the above example, after processing the first element, I would get 1 + 2 = 3 seconds as the window "end". The actual window end should be the timestamp 1, which is the last event in the session window. A solution would be to change the "end" definition of all windows, but I suppose this would be breaking and would need some debate. Therefore, I propose an intermediate solution : add a new API method that keeps track of the last element added in the session window. If there is agreement on this, I would like to start drafting a change document and implement this. |
Hi Manas,
First of all I think your understanding of how the session windows work is correct. I tend to slightly disagree that the end for a session window is wrong. It is my personal opinion though. I see it this way that a TimeWindow in case of a session window is the session itself. The session always ends after a period of inactivity. Take a user session on a webpage. Such a session does not end/isn't brought down at the time of a last event. It is closed after a period of inactivity. In such scenario I think the behavior of the session window is correct. Moreover you can achieve what you are describing with an aggregate[1] function. You can easily maintain the biggest number seen for a window. Lastly, I think the overall feeling in the community is that we are very skeptical towards extending the Windows API. From what I've heard and experienced the ProcessFunction[2] is a much better principle to build custom solutions upon, as in fact its easier to control and even understand. That said I am rather against introducing that change. Best, Dawid [1] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/windows.html#aggregatefunction [2] https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/process_function.html#process-function-low-level-operations On 13/03/2020 09:46, Manas Kale wrote: > Hi all, > I would like to start a discussion on this feature request (JIRA link). > <https://issues.apache.org/jira/browse/FLINK-16039> > > Consider the events : > > [1, event], [2, event] > > where first element is event timestamp in seconds and second element is > event code/name. > > Also consider that an Event time session window with inactivityGap = 2 > seconds is acting on above stream. > > When the first event arrives, a session window should be created that is > [1,1]. > > When the second event arrives, a new session window should be created that > is [2,2]. Since this falls within firstWindowTimestamp+inactivityGap, it > should be merged into session window [1,2] and [2,2] should be deleted. > > > *This is my understanding of how session windows are created. Please > correct me if wrong.* > However, Flink does not follow such a definition of windows semantically. > If I call the getEnd() method of the TimeWindow() class, I get back > timestamp + inactivityGap. > > For the above example, after processing the first element, I would get 1 + > 2 = 3 seconds as the window "end". > > The actual window end should be the timestamp 1, which is the last event in > the session window. > > A solution would be to change the "end" definition of all windows, but I > suppose this would be breaking and would need some debate. > > Therefore, I propose an intermediate solution : add a new API method that > keeps track of the last element added in the session window. > > If there is agreement on this, I would like to start drafting a change > document and implement this. > signature.asc (849 bytes) Download Attachment |
Hi Dawid,
Thank you for the response, I see your point. I was perhaps thinking only from the perspective of my use case where I think such a definition makes sense and did not account for the general case. Regards, Manas On Thu, Mar 26, 2020 at 8:40 PM Dawid Wysakowicz <[hidden email]> wrote: > Hi Manas, > > First of all I think your understanding of how the session windows work > is correct. > > I tend to slightly disagree that the end for a session window is wrong. > It is my personal opinion though. I see it this way that a TimeWindow in > case of a session window is the session itself. The session always ends > after a period of inactivity. Take a user session on a webpage. Such a > session does not end/isn't brought down at the time of a last event. It > is closed after a period of inactivity. In such scenario I think the > behavior of the session window is correct. > > Moreover you can achieve what you are describing with an aggregate[1] > function. You can easily maintain the biggest number seen for a window. > > Lastly, I think the overall feeling in the community is that we are very > skeptical towards extending the Windows API. From what I've heard and > experienced the ProcessFunction[2] is a much better principle to build > custom solutions upon, as in fact its easier to control and even > understand. That said I am rather against introducing that change. > > Best, > > Dawid > > [1] > > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/windows.html#aggregatefunction > > [2] > > https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/stream/operators/process_function.html#process-function-low-level-operations > > On 13/03/2020 09:46, Manas Kale wrote: > > Hi all, > > I would like to start a discussion on this feature request (JIRA link). > > <https://issues.apache.org/jira/browse/FLINK-16039> > > > > Consider the events : > > > > [1, event], [2, event] > > > > where first element is event timestamp in seconds and second element is > > event code/name. > > > > Also consider that an Event time session window with inactivityGap = 2 > > seconds is acting on above stream. > > > > When the first event arrives, a session window should be created that is > > [1,1]. > > > > When the second event arrives, a new session window should be created > that > > is [2,2]. Since this falls within firstWindowTimestamp+inactivityGap, it > > should be merged into session window [1,2] and [2,2] should be deleted. > > > > > > *This is my understanding of how session windows are created. Please > > correct me if wrong.* > > However, Flink does not follow such a definition of windows semantically. > > If I call the getEnd() method of the TimeWindow() class, I get back > > timestamp + inactivityGap. > > > > For the above example, after processing the first element, I would get 1 > + > > 2 = 3 seconds as the window "end". > > > > The actual window end should be the timestamp 1, which is the last event > in > > the session window. > > > > A solution would be to change the "end" definition of all windows, but I > > suppose this would be breaking and would need some debate. > > > > Therefore, I propose an intermediate solution : add a new API method that > > keeps track of the last element added in the session window. > > > > If there is agreement on this, I would like to start drafting a change > > document and implement this. > > > > |
Free forum by Nabble | Edit this page |