[jira] [Created] (FLINK-21825) AvroInputFormat doesn't honor -1 length on FileInputSplits

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-21825) AvroInputFormat doesn't honor -1 length on FileInputSplits

Shang Yuanchun (Jira)
Nicolas Ferrario created FLINK-21825:
----------------------------------------

             Summary: AvroInputFormat doesn't honor -1 length on FileInputSplits
                 Key: FLINK-21825
                 URL: https://issues.apache.org/jira/browse/FLINK-21825
             Project: Flink
          Issue Type: Bug
          Components: API / Core
    Affects Versions: 1.12.2
            Reporter: Nicolas Ferrario


FileInputSplit documentation says a length of *-1* means that the whole file should be read, however AvroInputFormat expects the exact size.

[https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-core/src/main/java/org/apache/flink/core/fs/FileInputSplit.java#L50]
{code:java}
/**
  * Constructs a split with host information.
  *      
  * @param num the number of this input split
  * @param file the file name
  * @param start the position of the first byte in the file to process
  * @param length the number of bytes in the file to process (-1 is flag for "read whole file")
  * @param hosts the list of hosts containing the block, possibly <code>null</code>
  */
  public FileInputSplit(int num, Path file, long start, long length, String[] hosts){code}
[https://github.com/apache/flink/blob/97bfd049951f8d52a2e0aed14265074c4255ead0/flink-formats/flink-avro/src/main/java/org/apache/flink/formats/avro/AvroInputFormat.java#L141]

 
{code:java}
private DataFileReader<E> initReader(FileInputSplit split) throws IOException {
    DatumReader<E> datumReader;

    if (org.apache.avro.generic.GenericRecord.class == avroValueType) {
        datumReader = new GenericDatumReader<E>();
    } else {
        datumReader =
                org.apache.avro.specific.SpecificRecordBase.class.isAssignableFrom(
                                avroValueType)
                        ? new SpecificDatumReader<E>(avroValueType)
                        : new ReflectDatumReader<E>(avroValueType);
    }
    if (LOG.isInfoEnabled()) {
        LOG.info("Opening split {}", split);
    }

    SeekableInput in =
            new FSDataInputStreamWrapper(
                    stream,
                    split.getPath().getFileSystem().getFileStatus(split.getPath()).getLen());
    DataFileReader<E> dataFileReader =
            (DataFileReader) DataFileReader.openReader(in, datumReader);

    if (LOG.isDebugEnabled()) {
        LOG.debug("Loaded SCHEMA: {}", dataFileReader.getSchema());
    }

    end = split.getStart() + split.getLength(); <--- THIS LINE
    recordsReadSinceLastSync = 0;
    return dataFileReader;
}
{code}
This could be fixed either by updating the documentation, or by honoring -1 in AvroInputFormat.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)