DataStream.partitionCustom() - define parallelism

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

DataStream.partitionCustom() - define parallelism

Jaromir Vanek
Hi all,

I've got one question considering custom partitioning in DataStream API. Is it possible to define/change parallelism when doing 'partitionCustom' transformation?

As far as I discovered there is no way how to call 'setParallelism()' on the 'PartitionTransformation' because it's hidden from the user.

In another thread (http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Range-partitioning-td10947.html) I found a short notice about custom partitioner - 'a partitioning will only be valid to the point that you change the parallelism.' So the possibility of changing parallelism in the subsequent transformation won't solve my problem.

How to make sure that both partitioner and number of partitions are valid for subsequent transformations?

Thank you
Reply | Threaded
Open this post in threaded view
|

Re: DataStream.partitionCustom() - define parallelism

Aljoscha Krettek-2
Hi,
it should be possible to set the parallelism on the actual downstream
operation. The partitioning operation is just an intermediate.

Cheers,
Aljoscha

On Mon, 18 Jul 2016 at 20:20 vanekjar <[hidden email]> wrote:

> Hi all,
>
> I've got one question considering custom partitioning in DataStream API. Is
> it possible to define/change parallelism when doing 'partitionCustom'
> transformation?
>
> As far as I discovered there is no way how to call 'setParallelism()' on
> the
> 'PartitionTransformation' because it's hidden from the user.
>
> In another thread
> (
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Range-partitioning-td10947.html
> )
> I found a short notice about custom partitioner - 'a partitioning will only
> be valid to the point that you change the parallelism.' So the possibility
> of changing parallelism in the subsequent transformation won't solve my
> problem.
>
> How to make sure that both partitioner and number of partitions are valid
> for subsequent transformations?
>
> Thank you
>
>
>
> --
> View this message in context:
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DataStream-partitionCustom-define-parallelism-tp12597.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at
> Nabble.com.
>
Reply | Threaded
Open this post in threaded view
|

Re: DataStream.partitionCustom() - define parallelism

Jaromir Vanek
Aljoscha Krettek-2 wrote
Hi,
it should be possible to set the parallelism on the actual downstream
operation. The partitioning operation is just an intermediate.

Cheers,
Aljoscha
Are you sure about that? The mentioned discussion about range partitioner says exactly opposite:

A partitioning will only be valid to the point that you change the parallelism.

In the modified program the data will be correctly partitioned (lets say into 8 partitions if the default parallelism is 8).
After the partitioning, the 8 partitions have to be reduced to 3 partitions as defined by the map-partition
operator with parallelism 3. This is done by randomly shuffling which destroys the range-partitioning.
It says the the downstream operation will use "random shuffle" when changing parallelism. Is 'customPartition()' different case?
Reply | Threaded
Open this post in threaded view
|

Re: DataStream.partitionCustom() - define parallelism

Aljoscha Krettek-2
Hi,
I think that was just related to the DataSet API. If I'm not mistaken
changing the parallelism should work after a "partitionCustom()".

Cheers,
Aljoscha

On Tue, 19 Jul 2016 at 19:25 Jaromir Vanek <[hidden email]> wrote:

> Aljoscha Krettek-2 wrote
> > Hi,
> > it should be possible to set the parallelism on the actual downstream
> > operation. The partitioning operation is just an intermediate.
> >
> > Cheers,
> > Aljoscha
>
> Are you sure about that? The  mentioned discussion
> <
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/Range-partitioning-tp10947p10953.html
> >
> about range partitioner says exactly opposite:
>
>
> > A partitioning will only be valid to the point that you change the
> > parallelism.
> >
> > In the modified program the data will be correctly partitioned (lets say
> > into 8 partitions if the default parallelism is 8).
> > After the partitioning, the 8 partitions have to be reduced to 3
> > partitions as defined by the map-partition
> > operator with parallelism 3. This is done by randomly shuffling which
> > destroys the range-partitioning.
>
> It says the the downstream operation will use "random shuffle" when
> changing
> parallelism. Is 'customPartition()' different case?
>
>
>
> --
> View this message in context:
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DataStream-partitionCustom-define-parallelism-tp12597p12617.html
> Sent from the Apache Flink Mailing List archive. mailing list archive at
> Nabble.com.
>