Tianxin Zhao created FLINK-22792:
------------------------------------
Summary: Limit size of already processed files in File Source SplitEnumerator
Key: FLINK-22792
URL:
https://issues.apache.org/jira/browse/FLINK-22792 Project: Flink
Issue Type: Improvement
Components: Connectors / FileSystem
Reporter: Tianxin Zhao
File Source makes use of {{ContinuousFileSplitEnumerator}} to discover files in selected file system. Task inside the SplitEnumerator periodically lists given path and creates splits from the path. To avoid splits getting reprocessed, currently all processed paths is recorded in the set {{pathsAlreadyProcessed}}. However, this set could grow indefinitely with new files added to the input path and eventually result in out of memory issue. (Original PR: [
https://github.com/apache/flink/pull/13401])
This ticket aim to limit the size of {{pathsAlreadyProcessed}} in use of a configurable SLA such that files older than some (watermark - SLA) would be ignored to be processed and also cleaned up from the {{pathsAlreadyProcessed}} set. Watermark is decided based on the minimum modification time of unprocessed files. {{pathsAlreadyProcessed}} set would be cleaned up during every snapshot.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)