[jira] [Created] (FLINK-21350) ParquetInputFormat incorrectly interprets timestamps encoded in microseconds as timestamps encoded in milliseconds

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-21350) ParquetInputFormat incorrectly interprets timestamps encoded in microseconds as timestamps encoded in milliseconds

Shang Yuanchun (Jira)
Jeffrey Charles created FLINK-21350:
---------------------------------------

             Summary: ParquetInputFormat incorrectly interprets timestamps encoded in microseconds as timestamps encoded in milliseconds
                 Key: FLINK-21350
                 URL: https://issues.apache.org/jira/browse/FLINK-21350
             Project: Flink
          Issue Type: Bug
          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
    Affects Versions: 1.12.1, 1.12.0
            Reporter: Jeffrey Charles


Given a parquet file with a schema that has a field with a physical type of INT64 and a logical type of TIMESTAMP_MICROS, all of the ParquetInputFormat sub-classes deserialize the timestamp as tens of thousands of years in the future.

Looking at the code in [https://github.com/apache/flink/blob/release-1.12.1/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/utils/RowConverter.java#L326,] it looks to me like the row converter is interpreting the field value as if it contained milliseconds and not microseconds. Specifically both millisecond and microsecond processing share the same code path to instantiate a java.sql.timestamp which takes a millisecond value in its constructor and the microsecond case statement is passing it a value in microseconds. I tested a change locally where I divide the value by 1000 in the microseconds case statement and that results in a timestamp with the expected value.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)