[Discussion][FLINK-20416] Need a cached catalog for batch SQL job

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[Discussion][FLINK-20416] Need a cached catalog for batch SQL job

Sebastian Liu
Hi all,

I‘d like to discuss a feature which supports the Flink OLAP scenario.

For OLAP scenarios, There are usually some analytical queries which running
time is relatively short. These queries are also sensitive to latency. In
the current Blink sql processing, parse/validate/optimize stages are all
need meta data from the catalog API. But each request to the catalog
requires re-run of the underlying meta query.

We may need a cached catalog which can cache the table schema and statistic
info to avoid unnecessary repeated meta requests. The most straightforward
scenario is to use Flink Batch SQL to query Hive data. If there is a Cached
Hive Catalog, we will save lots of interaction latency with HMS.

I have draft a design doc about this:
https://docs.google.com/document/d/1oL8HUpv2WaF6OkFvbH5iefXkOJB__Dal_bYsIZJA_Gk/edit?usp=sharing

Jira issue: https://issues.apache.org/jira/browse/FLINK-20416

IMO, this feature can further improve the stability and execution speed of
analyze query for Flink SQL.

Looking forward to your feedback, and any discussion or comments are
welcome.


--

*With kind regards
------------------------------------------------------------
Sebastian Liu 刘洋
Institute of Computing Technology, Chinese Academy of Science
Mobile\WeChat: +86—15201613655
E-mail: [hidden email] <[hidden email]>
QQ: 3239559*