Flink kubernetes autoscale

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink kubernetes autoscale

Mazen Ezzeddine
To the best of my knowledge, for Flink deployment on Kubernetes we have two
options as of now : (1) active K8S integration with separate job manager per
job and (2) reactive container mode with auto rescale based on some metrics:
Could you please give me on the hint on the below:

A - Are the two integrations already integrated to Flink recent releases?
Any documentation on that?

B - In all cases it is necessary to kill and restart the job which is a
concern for some critical use cases? Can a rolling upgrade be used to have a
zero down time while recalling/upgrading?

C- In such recasle mechanism, does Kubernetes/Flink identify which stream
operator is the source of load/utilization and rescale it individually, or
the rescaling is done at the granularity of whole job.

D- for stateful operators/jobs, how the state repartitioning and assignment
to new instances is performed? Does this repartitioning/reassignment is time
consuming especially for large states?

Thank you.



--
Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Flink kubernetes autoscale

Yangze Guo
Hi, Mazen

AFAIK, we now have two K8s integration, native[1] and standalone[2]. I
guess the native K8s integration is what you mean by active K8S
integration.

Regarding the reactive mode, I think it is still working in progress,
you could refer to [3].

[1] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html
[2] https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html
[3] https://issues.apache.org/jira/browse/FLINK-10407

Best,
Yangze Guo

On Mon, Aug 24, 2020 at 6:20 PM Mazen <[hidden email]> wrote:

>
> To the best of my knowledge, for Flink deployment on Kubernetes we have two
> options as of now : (1) active K8S integration with separate job manager per
> job and (2) reactive container mode with auto rescale based on some metrics:
> Could you please give me on the hint on the below:
>
> A - Are the two integrations already integrated to Flink recent releases?
> Any documentation on that?
>
> B - In all cases it is necessary to kill and restart the job which is a
> concern for some critical use cases? Can a rolling upgrade be used to have a
> zero down time while recalling/upgrading?
>
> C- In such recasle mechanism, does Kubernetes/Flink identify which stream
> operator is the source of load/utilization and rescale it individually, or
> the rescaling is done at the granularity of whole job.
>
> D- for stateful operators/jobs, how the state repartitioning and assignment
> to new instances is performed? Does this repartitioning/reassignment is time
> consuming especially for large states?
>
> Thank you.
>
>
>
> --
> Sent from: http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
Reply | Threaded
Open this post in threaded view
|

Re: Flink kubernetes autoscale

Yang Wang
Add some more additional information.

A. Refer to Yangze's answer.
B. If you are using a native K8s integration and enable the zookeeper HA,
whenever you want to upgrade, stop the Flink cluster
and start a new one with same clusterId. It will recover from the latest
checkpoint.
    If you are using the standalone mode, using a operator will help a lot
for the upgrading[1][2]. However, both of them will have
some down time.
C. AFAIK, even though we have the reactive mode in the future, i am afraid
that Flink could not identify which operator is source and
rescale individually. Since we enable the slot sharing by default, only the
max parallelism matters.
D. If you store the state on DFS(e.g. HDFS, S3, GFS, etc.), when the job
restarts, they will be fetched remotely. So large states will
consume more time. But i think you could have a try with the statefulSet
and PV(i.e. persistent volume) to make local recovery could work.



[1]. https://github.com/lyft/flinkk8soperator
[2]. https://github.com/GoogleCloudPlatform/flink-on-k8s-operator


Best,
Yang

Yangze Guo <[hidden email]> 于2020年8月24日周一 下午7:55写道:

> Hi, Mazen
>
> AFAIK, we now have two K8s integration, native[1] and standalone[2]. I
> guess the native K8s integration is what you mean by active K8S
> integration.
>
> Regarding the reactive mode, I think it is still working in progress,
> you could refer to [3].
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/native_kubernetes.html
> [2]
> https://ci.apache.org/projects/flink/flink-docs-release-1.11/ops/deployment/kubernetes.html
> [3] https://issues.apache.org/jira/browse/FLINK-10407
>
> Best,
> Yangze Guo
>
> On Mon, Aug 24, 2020 at 6:20 PM Mazen <[hidden email]>
> wrote:
> >
> > To the best of my knowledge, for Flink deployment on Kubernetes we have
> two
> > options as of now : (1) active K8S integration with separate job manager
> per
> > job and (2) reactive container mode with auto rescale based on some
> metrics:
> > Could you please give me on the hint on the below:
> >
> > A - Are the two integrations already integrated to Flink recent releases?
> > Any documentation on that?
> >
> > B - In all cases it is necessary to kill and restart the job which is a
> > concern for some critical use cases? Can a rolling upgrade be used to
> have a
> > zero down time while recalling/upgrading?
> >
> > C- In such recasle mechanism, does Kubernetes/Flink identify which stream
> > operator is the source of load/utilization and rescale it individually,
> or
> > the rescaling is done at the granularity of whole job.
> >
> > D- for stateful operators/jobs, how the state repartitioning and
> assignment
> > to new instances is performed? Does this repartitioning/reassignment is
> time
> > consuming especially for large states?
> >
> > Thank you.
> >
> >
> >
> > --
> > Sent from:
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/
>