[DISCUSS] Increase the parallel of flink connector.

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] Increase the parallel of flink connector.

jie mei
Hi, Community

I want to start a preliminary discussion about increment the parallel of
flink connectors
with optional transaction supporting. looking forward to your feedback and
advices.

*Blew are the three scenarios I summarized we should support:*

1. *unordered at least once*: this is for the store engine which mainly
support append data, such as clickhouse.

2. *ordered for partition with optional flush when checkpoint*: this is for
the store engine which support upsert but
   don't support transaction across rows. such hbase, elasticsearch.

3. *global ordered with optional transaction*: this is for the store engine
which support transaction across rows.


*here is the basic data write flow:*

*1.base data sink flow:*

    [sink data]    -------- |
                            |            ThreadSafe

  [timeout event]  -------- | --------> [Ring Buffer] -------> [Sink Buffer
Manager] -------> [trigger flush the data of sink buffer ]
                            |
[checkpoint event] -------- |

*2.when to flush the data*

  a. if enable flush the data when checkpoint or transaction. just flush
data when received the checkpoint event.
  b. exceed the max batch size;
  c. receive a timeout event;
  d. receive a checkpoint event;

*3.how the sink buffer should be organized.*

   a. unordered at least once
      the sink buffer should be organized as a linkedList with size limit
and flush the data concurrently.

   b. *ordered for partition with optional flush when checkpoint:*
      the sink buffer should be organized as a map, the key should be
determinzed by parition key or primary key(primary key % sink parallel),
      the value should be a sink buffer list with size limit. we try move
the data to sink buffer by parititioned key or primary key, and flush
      the data of different sink buffer list concurrently.

   c. *global ordered with optional transaction*:
      the sink buffer should be organized as a linkedList with size limit
and flush the data serially


*how the logical should be implement.*

I think we should use the latest flink-143: Unified Sink API [1], and
support split table.



[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-143%3A+Unified+Sink+API

--

*Best Regards*
*Jeremy Mei*