[jira] [Created] (FLINK-22792) Limit size of already processed files in File Source SplitEnumerator

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-22792) Limit size of already processed files in File Source SplitEnumerator

Shang Yuanchun (Jira)
Tianxin Zhao created FLINK-22792:
------------------------------------

             Summary: Limit size of already processed files in File Source SplitEnumerator
                 Key: FLINK-22792
                 URL: https://issues.apache.org/jira/browse/FLINK-22792
             Project: Flink
          Issue Type: Improvement
          Components: Connectors / FileSystem
            Reporter: Tianxin Zhao


File Source makes use of {{ContinuousFileSplitEnumerator}} to discover files in selected file system. Task inside the SplitEnumerator periodically lists given path and creates splits from the path. To avoid splits getting reprocessed, currently all processed paths is recorded in the set {{pathsAlreadyProcessed}}. However, this set could grow indefinitely with new files added to the input path and eventually result in out of memory issue. (Original PR: [https://github.com/apache/flink/pull/13401])

This ticket aim to limit the size of {{pathsAlreadyProcessed}} in use of a configurable SLA such that files older than some (watermark - SLA) would be ignored to be processed and also cleaned up from the {{pathsAlreadyProcessed}} set. Watermark is decided based on the minimum modification time of unprocessed files. {{pathsAlreadyProcessed}} set would be cleaned up during every snapshot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)