Jin Xing created FLINK-22672:
--------------------------------
Summary: Some enhancements for pluggable shuffle service framework
Key: FLINK-22672
URL:
https://issues.apache.org/jira/browse/FLINK-22672 Project: Flink
Issue Type: Improvement
Components: Runtime / Network
Reporter: Jin Xing
"Pluggable shuffle service" in Flink provides an architecture which are unified for both streaming and batch jobs, allowing user to customize the process of data transfer between shuffle stages according to scenarios.
There are already a number of implementations of "remote shuffle service" on Spark like [1][2][3]. Remote shuffle enables to shuffle data from/to a remote cluster and achieves benefits like :
# The lifecycle of computing resource can be decoupled with shuffle data, once computing task is finished, idle computing nodes can be released with its completed shuffle data accormadated on remote shuffle cluster.
# There is no need to reserve disk capacity for shuffle on computing nodes. Remote shuffle cluster serves shuffling request with better scaling ability and alleviates the local disk pressure on computing nodes when data skew.
Based "pluggable shuffle service", we build our own "remote shuffle service" on Flink -- Lattice, which targets to provide functionalities and improve performance for batch processing jobs. Basically it works as below:
# Lattice cluster works as an independent service for shuffling request;
# LatticeShuffleMaster extends ShuffleMaster, works inside JM and talks with remote Lattice cluster for shuffle resouce application and shuffle data lifecycle management;
# LatticeShuffleEnvironmente extends ShuffleEnvironment, works inside TM and provides an environment for shuffling data from/to remote Lattice cluster;
During the process of building Lattice we find some potential enhancements on "pluggable shuffle service". I will enumerate and create some sub JIRAs under this umbrella
[1] [
https://www.alibabacloud.com/blog/emr-remote-shuffle-service-a-powerful-elastic-tool-of-serverless-spark_597728]
[2] [
https://bestoreo.github.io/post/cosco/cosco/]
[3] [
https://github.com/uber/RemoteShuffleService]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)