Stephan Ewen created FLINK-5332:
-----------------------------------
Summary: Non-thread safe FileSystem::initOutPathLocalFS() can cause lost files/directories in local execution
Key: FLINK-5332
URL:
https://issues.apache.org/jira/browse/FLINK-5332 Project: Flink
Issue Type: Bug
Components: Core
Affects Versions: 1.2.0
Reporter: Stephan Ewen
Assignee: Stephan Ewen
Priority: Critical
Fix For: 1.2.0
This is mainly relevant to tests and Local Mini Cluster executions.
The {{FileOutputFormat}} and its subclasses rely on {{FileSystem::initOutPathLocalFS()}} to prepare the output directory. When multiple parallel output writers call that method, there is a slim chance that one parallel threads deletes the others directory. The checks that the method has are not bullet proof.
I believe that this is the cause for many Travis test instabilities that we observed over time.
Simply synchronizing that method per process should do the trick. Since it is a rare initialization method, and only relevant in tests & local mini cluster executions, it should be a price that is okay to pay. I see no other way, as we do not have simple access to an atomic "check and delete and recreate" file operation.
The synchronization also makes many "re-try" code paths obsolete (there should be no re-tries needed on proper file systems).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)