Gyula Fora created FLINK-20221:
----------------------------------
Summary: DelimitedInputFormat does not restore compressed filesplits correctly leading to dataloss
Key: FLINK-20221
URL:
https://issues.apache.org/jira/browse/FLINK-20221 Project: Flink
Issue Type: Bug
Components: Connectors / FileSystem
Affects Versions: 1.11.2, 1.10.2, 1.12.0
Reporter: Gyula Fora
Assignee: Gyula Fora
It seems that the delimited input format cannot correctly restore input splits if they belong to compressed files. Basically when a compressed filesplit is restored in the middle, it won't read it anymore leading to dataloss.
The cause of the problem is that for compressed splits that use an inflater stream, the splitlength is set to the magic number -1 which is ignored in the reopen method and causes the split to go to `end` state immediately.
The problem and the fix is shown in this commit:
[
https://github.com/gyfora/flink/commit/4adc8ba8d1989fff2db43881c9cb3799848c6e0d]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)