[jira] [Created] (FLINK-1287) Improve File Input Split assignment

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-1287) Improve File Input Split assignment

Shang Yuanchun (Jira)
Robert Metzger created FLINK-1287:
-------------------------------------

             Summary: Improve File Input Split assignment
                 Key: FLINK-1287
                 URL: https://issues.apache.org/jira/browse/FLINK-1287
             Project: Flink
          Issue Type: Improvement
          Components: Local Runtime
            Reporter: Robert Metzger


While running some DFS read-intensive benchmarks, I found that the assignment of input splits is not optimal. In particular in cases where the numWorker != numDataNodes and when the replication factor is low (in my case it was 1).

In the particular example, the input had 40960 splits, of which 4694 were read remotely.  Spark did only 2056 remote reads for the same dataset.
With the replication factor increased to 2, Flink did only 290 remote reads. So usually, users shouldn't be affected by this issue.





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)