Caizhi Weng created FLINK-19641:
-----------------------------------
Summary: Optimize parallelism calculating of HiveTableSource by checking file number
Key: FLINK-19641
URL:
https://issues.apache.org/jira/browse/FLINK-19641 Project: Flink
Issue Type: Improvement
Components: Connectors / Hive
Reporter: Caizhi Weng
Fix For: 1.12.0
The current implementation of {{HiveTableSource#calculateParallelism}} directly uses {{inputFormat.createInputSplits(0).length}} as the number of splits. However {{createInputSplits}} may be costly as it will read some data from all source files, especially when the table is not partitioned and the number of files are large.
Many Hive tables maintain the number of files in that table, and it's obvious that the number of splits is at least the number of files. So we can try to fetch the number of files (almost without cost) first and if the number of files already exceeds maximum parallelism we can directly use the maximum parallelism without calling {{createInputSplits}}.
This is a significant optimization on the current flink TPCDS benchmark, which will create some table with 15000 files without partitioning. This optimization will improve the performance of the whole benchmark by 300s and more.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)