[jira] [Created] (FLINK-13956) Add sequence file format with repeated sync blocks

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-13956) Add sequence file format with repeated sync blocks

Shang Yuanchun (Jira)
Arvid Heise created FLINK-13956:
-----------------------------------

             Summary: Add sequence file format with repeated sync blocks
                 Key: FLINK-13956
                 URL: https://issues.apache.org/jira/browse/FLINK-13956
             Project: Flink
          Issue Type: Improvement
            Reporter: Arvid Heise


The current {{SequenceFileFormat}} produces files that are tightly bound to the block size of the filesystem. While this was a somewhat plausible assumption in the old HDFS days, it can lead to [hard to debug issues in other file systems|https://lists.apache.org/thread.html/bdd87cbb5eb7b19ab4be6501940ec5659e8f6ce6c27ccefa2680732c@%3Cdev.flink.apache.org%3E].

We could implement a file format similar to the current version of Hadoop's SequenceFileFormat: add a sync block inbetween records whenever X bytes were written. Hadoop uses 2k, but I'd propose to use 1M.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)