The timeline of Flink support Reactive mode in Yarn

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

The timeline of Flink support Reactive mode in Yarn

Bing Jiang
Hi, community.

For streaming applications, auto scaling is critical to guarantee our
resource provision, not big or small. i.e. it is cost efficiency
requirements from public cloud (AWS) players.

I read the document of FLIP-159
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-159%3A+Reactive+Mode>,
and it is claimed that it aims at the requirements of elastic scale.
Unfortunately, it only supports standalone clusters. So I'd like to figure
out whether the community has already tried out this feature on Yarn, and
which kinds of issues will be the blocker to support this feature from Yarn
perspective?

I appreciate your insights!

Thanks
Bing
--
Reply | Threaded
Open this post in threaded view
|

Re: The timeline of Flink support Reactive mode in Yarn

Robert Metzger
Hey Bing,

The idea of reactive mode is that it reacts to changing resource
availability, where an outside service is adding or removing machines to a
Flink cluster.
Reactive Mode doesn't work on an active resource manager such as YARN,
because you can not tell YARN from the outside to remove resources from a
Flink Application. The Flink application has to do this by itself.

For reactive mode, we introduced Adaptive Scheduler (FLIP-160) and
declarative resource management (FLIP-138). Both of these changes are very
important building blocks for adding autoscaling to active resource
managers -- I would say we've completed most of the heavy lifting for
autoscaling already.

The main blocker for autoscaling in my opinion is figuring out a good API
for determining the scale of a running job: Do we expose a REST API where
users can adjust the parallelism of individual operators? Do we expose an
API where some code runs somewhere adjusting the parallelism of operators?
Do we provide a set of configuration parameters defining the "min",
"target", "max" parallelism, or some scaling based on some metrics (Kafka
consumer lag, latency, throughput, backpressure, cpu load, ...).

I've personally started playing a bit with a small prototype for
autoscaling, but it's not ready to share yet. From the topics the main
contributors are currently working on, I'm not aware of anybody working on
autoscaling in the Flink 1.14 release cycle, hence I believe it is unlikely
to be part of 1.14.

Let me know what you think, I'm in particular interested in how you would
like to use autoscaling!

Best,
Robert


On Thu, May 20, 2021 at 9:11 AM Bing Jiang <[hidden email]> wrote:

> Hi, community.
>
> For streaming applications, auto scaling is critical to guarantee our
> resource provision, not big or small. i.e. it is cost efficiency
> requirements from public cloud (AWS) players.
>
> I read the document of FLIP-159
> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-159%3A+Reactive+Mode
> >,
> and it is claimed that it aims at the requirements of elastic scale.
> Unfortunately, it only supports standalone clusters. So I'd like to figure
> out whether the community has already tried out this feature on Yarn, and
> which kinds of issues will be the blocker to support this feature from Yarn
> perspective?
>
> I appreciate your insights!
>
> Thanks
> Bing
> --
>