[jira] [Created] (FLINK-19641) Optimize parallelism calculating of HiveTableSource by checking file number

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-19641) Optimize parallelism calculating of HiveTableSource by checking file number

Shang Yuanchun (Jira)
Caizhi Weng created FLINK-19641:
-----------------------------------

             Summary: Optimize parallelism calculating of HiveTableSource by checking file number
                 Key: FLINK-19641
                 URL: https://issues.apache.org/jira/browse/FLINK-19641
             Project: Flink
          Issue Type: Improvement
          Components: Connectors / Hive
            Reporter: Caizhi Weng
             Fix For: 1.12.0


The current implementation of {{HiveTableSource#calculateParallelism}} directly uses {{inputFormat.createInputSplits(0).length}} as the number of splits. However {{createInputSplits}} may be costly as it will read some data from all source files, especially when the table is not partitioned and the number of files are large.

Many Hive tables maintain the number of files in that table, and it's obvious that the number of splits is at least the number of files. So we can try to fetch the number of files (almost without cost) first and if the number of files already exceeds maximum parallelism we can directly use the maximum parallelism without calling {{createInputSplits}}.

This is a significant optimization on the current flink TPCDS benchmark, which will create some table with 15000 files without partitioning. This optimization will improve the performance of the whole benchmark by 300s and more.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)