To the best of my knowledge, for Flink deployment on Kubernetes we have two
options as of now : (1) active K8S integration with separate job manager per job and (2) reactive container mode with auto rescale based on some metrics: Could you please give me on the hint on the below: A - Are the two integrations already integrated to Flink recent releases? Any documentation on that? B - In all cases it is necessary to kill and restart the job which is a concern for some critical use cases? Can a rolling upgrade be used to have a zero down time while recalling/upgrading? C- In such recasle mechanism, does Kubernetes/Flink identify which stream operator is the source of load/utilization and rescale it individually, or the rescaling is done at the granularity of whole job. D- for stateful operators/jobs, how the state repartitioning and assignment to new instances is performed? Does this repartitioning/reassignment is time consuming especially for large states? Thank you. -- Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ |
Hi, Mazen
AFAIK, we now have two K8s integration, native[1] and standalone[2]. I guess the native K8s integration is what you mean by active K8S integration. Regarding the reactive mode, I think it is still working in progress, you could refer to [3]. [1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html [2] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html [3] https://issues.apache.org/jira/browse/FLINK-10407 Best, Yangze Guo On Mon, Aug 24, 2020 at 6:20 PM Mazen <[hidden email]> wrote: > > To the best of my knowledge, for Flink deployment on Kubernetes we have two > options as of now : (1) active K8S integration with separate job manager per > job and (2) reactive container mode with auto rescale based on some metrics: > Could you please give me on the hint on the below: > > A - Are the two integrations already integrated to Flink recent releases? > Any documentation on that? > > B - In all cases it is necessary to kill and restart the job which is a > concern for some critical use cases? Can a rolling upgrade be used to have a > zero down time while recalling/upgrading? > > C- In such recasle mechanism, does Kubernetes/Flink identify which stream > operator is the source of load/utilization and rescale it individually, or > the rescaling is done at the granularity of whole job. > > D- for stateful operators/jobs, how the state repartitioning and assignment > to new instances is performed? Does this repartitioning/reassignment is time > consuming especially for large states? > > Thank you. > > > > -- > Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ |
Add some more additional information.
A. Refer to Yangze's answer. B. If you are using a native K8s integration and enable the zookeeper HA, whenever you want to upgrade, stop the Flink cluster and start a new one with same clusterId. It will recover from the latest checkpoint. If you are using the standalone mode, using a operator will help a lot for the upgrading[1][2]. However, both of them will have some down time. C. AFAIK, even though we have the reactive mode in the future, i am afraid that Flink could not identify which operator is source and rescale individually. Since we enable the slot sharing by default, only the max parallelism matters. D. If you store the state on DFS(e.g. HDFS, S3, GFS, etc.), when the job restarts, they will be fetched remotely. So large states will consume more time. But i think you could have a try with the statefulSet and PV(i.e. persistent volume) to make local recovery could work. [1]. https://github.com/lyft/flinkk8soperator [2]. https://github.com/GoogleCloudPlatform/flink-on-k8s-operator Best, Yang Yangze Guo <[hidden email]> 于2020年8月24日周一 下午7:55写道: > Hi, Mazen > > AFAIK, we now have two K8s integration, native[1] and standalone[2]. I > guess the native K8s integration is what you mean by active K8S > integration. > > Regarding the reactive mode, I think it is still working in progress, > you could refer to [3]. > > [1] > https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html > [2] > https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html > [3] https://issues.apache.org/jira/browse/FLINK-10407 > > Best, > Yangze Guo > > On Mon, Aug 24, 2020 at 6:20 PM Mazen <[hidden email]> > wrote: > > > > To the best of my knowledge, for Flink deployment on Kubernetes we have > two > > options as of now : (1) active K8S integration with separate job manager > per > > job and (2) reactive container mode with auto rescale based on some > metrics: > > Could you please give me on the hint on the below: > > > > A - Are the two integrations already integrated to Flink recent releases? > > Any documentation on that? > > > > B - In all cases it is necessary to kill and restart the job which is a > > concern for some critical use cases? Can a rolling upgrade be used to > have a > > zero down time while recalling/upgrading? > > > > C- In such recasle mechanism, does Kubernetes/Flink identify which stream > > operator is the source of load/utilization and rescale it individually, > or > > the rescaling is done at the granularity of whole job. > > > > D- for stateful operators/jobs, how the state repartitioning and > assignment > > to new instances is performed? Does this repartitioning/reassignment is > time > > consuming especially for large states? > > > > Thank you. > > > > > > > > -- > > Sent from: > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/ > |
Free forum by Nabble | Edit this page |