Felix Neutatz created FLINK-1271:
------------------------------------
Summary: Extend HadoopOutputFormat and HadoopInputFormat to handle Void.class
Key: FLINK-1271
URL:
https://issues.apache.org/jira/browse/FLINK-1271 Project: Flink
Issue Type: Wish
Components: Hadoop Compatibility
Reporter: Felix Neutatz
Priority: Minor
Parquet, one of the most famous and efficient column store formats in Hadoop uses Void.class as Key!
At the moment there are only keys allowed which extend Writable.
For example, we would need to be able to do something like:
HadoopInputFormat hadoopInputFormat = new HadoopInputFormat(new ParquetThriftInputFormat(), Void.class, AminoAcid.class, job);
ParquetThriftInputFormat.addInputPath(job, new Path("newpath"));
ParquetThriftInputFormat.setReadSupportClass(job, AminoAcid.class);
// Create a Flink job with it
DataSet<Tuple2<Void, AminoAcid>> data = env.createInput(hadoopInputFormat);
Where AminoAcid is a generated Thrift class in this case.
However, I figured out how to output Parquet files with Parquet by creating a class which extends HadoopOutputFormat.
Now we will have to discuss, what's the best approach to make the Parquet integration happen
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)