(DEPRECATED) Apache Flink Mailing List archive.

Handling skewness and Heterogeniety

Classic

List

Threaded

3 messages Options

Anis Nasir

Handling skewness and Heterogeniety

Dear All,

I have few use cases for Flink streaming where the cluster consist of
heterogenous machines.

Additionally, there is skew present in both the input distribution (e.g.,
each tuple is drawn from a zipf distribution) and the service time (e.g.,
service time required for each tuple comes from a zipf distribution).

I want to know who Flink will handle such use cases assuming that the
distribution of both workload and cluster is unknown in prior.

Any help will be highly appreciated!

Regards,
Anis

Fabian Hueske-2

Re: Handling skewness and Heterogeniety

Hi Anis,

Flink uses regular hash-partitioning to shuffle records and does not have a
mechanism to counter data skew (other than scaling out).
Heterogeneous hardware can (to some extend) be addressed by adapting the
number of processing slots (or task managers) per machine, i.e., configure
fewer slots on machines with lower performance.

Best, Fabian

2017-02-15 2:12 GMT+01:00 Anis Nasir <[hidden email]>:

> Dear All,
>
> I have few use cases for Flink streaming where the cluster consist of
> heterogenous machines.
>
> Additionally, there is skew present in both the input distribution (e.g.,
> each tuple is drawn from a zipf distribution) and the service time (e.g.,
> service time required for each tuple comes from a zipf distribution).
>
> I want to know who Flink will handle such use cases assuming that the
> distribution of both workload and cluster is unknown in prior.
>
> Any help will be highly appreciated!
>
>
> Regards,
> Anis
>

Anis Nasir

Re: Handling skewness and Heterogeniety

Dear Fabian,

Can you have a look into this issue. What actions will be required to
resolve this one?

https://issues.apache.org/jira/browse/FLINK-1725

Regards,
Anis

On Wed, Feb 15, 2017 at 6:36 PM, Fabian Hueske <[hidden email]> wrote:

> Hi Anis,
>
> Flink uses regular hash-partitioning to shuffle records and does not have a
> mechanism to counter data skew (other than scaling out).
> Heterogeneous hardware can (to some extend) be addressed by adapting the
> number of processing slots (or task managers) per machine, i.e., configure
> fewer slots on machines with lower performance.
>
> Best, Fabian
>
> 2017-02-15 2:12 GMT+01:00 Anis Nasir <[hidden email]>:
>
> > Dear All,
> >
> > I have few use cases for Flink streaming where the cluster consist of
> > heterogenous machines.
> >
> > Additionally, there is skew present in both the input distribution (e.g.,
> > each tuple is drawn from a zipf distribution) and the service time (e.g.,
> > service time required for each tuple comes from a zipf distribution).
> >
> > I want to know who Flink will handle such use cases assuming that the
> > distribution of both workload and cluster is unknown in prior.
> >
> > Any help will be highly appreciated!
> >
> >
> > Regards,
> > Anis
> >
>