Question about Re-Partition Operator

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Question about Re-Partition Operator

Aljoscha Krettek-2
Hi,
(mostly @fabian) why is the re-partition operator special-cased like
it is now? You can only do map/filter on partitioned data. Wouldn't it
be nice if we had a general re-partition operator. Operators that
normally do their own repartitioning would notice that the data is
already partitioned and use that. The way it is now, the
implementation relies heavily on special-case handling.

In the long run we could even introduce combine and sort as special
operators that users could insert themselves. The optimiser would then
also insert these before operations when required. This would
simplify/generalise things a bit.

What do you think?

Cheers,
Aljoscha
Reply | Threaded
Open this post in threaded view
|

Re: Question about Re-Partition Operator

Fabian Hueske
Hi,

I limited explicit repartitioning to map-like operatators because Map
operations should be by far the most common use cases (rebalancing for
expensive mappers, repartition for partitionMap, ...) and I doubt the
benefits of repartitioning before other operators.
Right now, partitioning is implemented as an own runtime operator with
noop-driver. Hence, there would be a serialization overhead if subsequent
operators cannot be chained (i.e., join).
Also having a partitioning in front of a combinable reduce would make
things complicated, because the partitioning should be moved between
combiner and reducer.
Overall, I think, there are not many benefits of offering repartitioning
for all operators and if you still want it, you can use an identity mapper.

Does that make sense?
If not, we can also change the implementation without breaking the API and
offer repartitioning for all operators. ;-)

Anyway, I'm with you that we should think about how to integrate physical
operators (partition, sort, combine) into the API.

Cheers, Fabian


2014-09-24 17:00 GMT+02:00 Aljoscha Krettek <[hidden email]>:

> Hi,
> (mostly @fabian) why is the re-partition operator special-cased like
> it is now? You can only do map/filter on partitioned data. Wouldn't it
> be nice if we had a general re-partition operator. Operators that
> normally do their own repartitioning would notice that the data is
> already partitioned and use that. The way it is now, the
> implementation relies heavily on special-case handling.
>
> In the long run we could even introduce combine and sort as special
> operators that users could insert themselves. The optimiser would then
> also insert these before operations when required. This would
> simplify/generalise things a bit.
>
> What do you think?
>
> Cheers,
> Aljoscha
>
Reply | Threaded
Open this post in threaded view
|

Re: Question about Re-Partition Operator

Aljoscha Krettek-2
Yes, I see your points on why you limited it to the map-style
operators. My gripe with it is that it moves knowledge about the
Partition Operator into some operators that now need special-case
handling and that these operations are duplicated on
PartitionedDataSet. If we have Partition as just another operator
there would be no special-case handling and the integration would be
scalable (in a sense that we don't have to touch other operators, or
have to add partition support if we add another map-style operator).

My view is that users should know what they are doing if the add a
repartition before a reduce. In the future the optimizer could catch
those cases and merge a custom repartition with a reduce if there are
no other consumers of the repartitioning.

Cheers,
Aljoscha

On Thu, Sep 25, 2014 at 10:13 AM, Fabian Hueske <[hidden email]> wrote:

> Hi,
>
> I limited explicit repartitioning to map-like operatators because Map
> operations should be by far the most common use cases (rebalancing for
> expensive mappers, repartition for partitionMap, ...) and I doubt the
> benefits of repartitioning before other operators.
> Right now, partitioning is implemented as an own runtime operator with
> noop-driver. Hence, there would be a serialization overhead if subsequent
> operators cannot be chained (i.e., join).
> Also having a partitioning in front of a combinable reduce would make
> things complicated, because the partitioning should be moved between
> combiner and reducer.
> Overall, I think, there are not many benefits of offering repartitioning
> for all operators and if you still want it, you can use an identity mapper.
>
> Does that make sense?
> If not, we can also change the implementation without breaking the API and
> offer repartitioning for all operators. ;-)
>
> Anyway, I'm with you that we should think about how to integrate physical
> operators (partition, sort, combine) into the API.
>
> Cheers, Fabian
>
>
> 2014-09-24 17:00 GMT+02:00 Aljoscha Krettek <[hidden email]>:
>
>> Hi,
>> (mostly @fabian) why is the re-partition operator special-cased like
>> it is now? You can only do map/filter on partitioned data. Wouldn't it
>> be nice if we had a general re-partition operator. Operators that
>> normally do their own repartitioning would notice that the data is
>> already partitioned and use that. The way it is now, the
>> implementation relies heavily on special-case handling.
>>
>> In the long run we could even introduce combine and sort as special
>> operators that users could insert themselves. The optimiser would then
>> also insert these before operations when required. This would
>> simplify/generalise things a bit.
>>
>> What do you think?
>>
>> Cheers,
>> Aljoscha
>>
Reply | Threaded
Open this post in threaded view
|

Re: Question about Re-Partition Operator

Fabian Hueske
I wanted the change to be as less invasive as possible and there were a few
tricky details.
For example the DOP of a partition operator must be the same as the
following operator (otherwise data is reshuffled again).
Also multiple consumers of a repartition are a nasty detail (esp. if they
have different DOPs).

These things were a bit tricky to integrate into the current translation
mechanism without major changes.
Right now, each Map op injects its own partitioning operator before itself
and sets the DOP to its own DOP.
That's why the logic for that was put into the Map ops. A drawback of this
approach is, that there is no reuse of repartitionings even if it would be
possible...

There might be smarter ways to implement this, but I think my (limited)
solution works correct in all cornercase that came to my mind.

2014-09-25 10:27 GMT+02:00 Aljoscha Krettek <[hidden email]>:

> Yes, I see your points on why you limited it to the map-style
> operators. My gripe with it is that it moves knowledge about the
> Partition Operator into some operators that now need special-case
> handling and that these operations are duplicated on
> PartitionedDataSet. If we have Partition as just another operator
> there would be no special-case handling and the integration would be
> scalable (in a sense that we don't have to touch other operators, or
> have to add partition support if we add another map-style operator).
>
> My view is that users should know what they are doing if the add a
> repartition before a reduce. In the future the optimizer could catch
> those cases and merge a custom repartition with a reduce if there are
> no other consumers of the repartitioning.
>
> Cheers,
> Aljoscha
>
> On Thu, Sep 25, 2014 at 10:13 AM, Fabian Hueske <[hidden email]>
> wrote:
> > Hi,
> >
> > I limited explicit repartitioning to map-like operatators because Map
> > operations should be by far the most common use cases (rebalancing for
> > expensive mappers, repartition for partitionMap, ...) and I doubt the
> > benefits of repartitioning before other operators.
> > Right now, partitioning is implemented as an own runtime operator with
> > noop-driver. Hence, there would be a serialization overhead if subsequent
> > operators cannot be chained (i.e., join).
> > Also having a partitioning in front of a combinable reduce would make
> > things complicated, because the partitioning should be moved between
> > combiner and reducer.
> > Overall, I think, there are not many benefits of offering repartitioning
> > for all operators and if you still want it, you can use an identity
> mapper.
> >
> > Does that make sense?
> > If not, we can also change the implementation without breaking the API
> and
> > offer repartitioning for all operators. ;-)
> >
> > Anyway, I'm with you that we should think about how to integrate physical
> > operators (partition, sort, combine) into the API.
> >
> > Cheers, Fabian
> >
> >
> > 2014-09-24 17:00 GMT+02:00 Aljoscha Krettek <[hidden email]>:
> >
> >> Hi,
> >> (mostly @fabian) why is the re-partition operator special-cased like
> >> it is now? You can only do map/filter on partitioned data. Wouldn't it
> >> be nice if we had a general re-partition operator. Operators that
> >> normally do their own repartitioning would notice that the data is
> >> already partitioned and use that. The way it is now, the
> >> implementation relies heavily on special-case handling.
> >>
> >> In the long run we could even introduce combine and sort as special
> >> operators that users could insert themselves. The optimiser would then
> >> also insert these before operations when required. This would
> >> simplify/generalise things a bit.
> >>
> >> What do you think?
> >>
> >> Cheers,
> >> Aljoscha
> >>
>
Reply | Threaded
Open this post in threaded view
|

Re: Question about Re-Partition Operator

Aljoscha Krettek-2
I did a pull request that contains my suggestions. This also fixes the
problem that you have multiple repartition operators if you do several
map operators on a repartition. Each one of them would get their own
private repartition.

What do you guys think?

On Thu, Sep 25, 2014 at 10:44 AM, Fabian Hueske <[hidden email]> wrote:

> I wanted the change to be as less invasive as possible and there were a few
> tricky details.
> For example the DOP of a partition operator must be the same as the
> following operator (otherwise data is reshuffled again).
> Also multiple consumers of a repartition are a nasty detail (esp. if they
> have different DOPs).
>
> These things were a bit tricky to integrate into the current translation
> mechanism without major changes.
> Right now, each Map op injects its own partitioning operator before itself
> and sets the DOP to its own DOP.
> That's why the logic for that was put into the Map ops. A drawback of this
> approach is, that there is no reuse of repartitionings even if it would be
> possible...
>
> There might be smarter ways to implement this, but I think my (limited)
> solution works correct in all cornercase that came to my mind.
>
> 2014-09-25 10:27 GMT+02:00 Aljoscha Krettek <[hidden email]>:
>
>> Yes, I see your points on why you limited it to the map-style
>> operators. My gripe with it is that it moves knowledge about the
>> Partition Operator into some operators that now need special-case
>> handling and that these operations are duplicated on
>> PartitionedDataSet. If we have Partition as just another operator
>> there would be no special-case handling and the integration would be
>> scalable (in a sense that we don't have to touch other operators, or
>> have to add partition support if we add another map-style operator).
>>
>> My view is that users should know what they are doing if the add a
>> repartition before a reduce. In the future the optimizer could catch
>> those cases and merge a custom repartition with a reduce if there are
>> no other consumers of the repartitioning.
>>
>> Cheers,
>> Aljoscha
>>
>> On Thu, Sep 25, 2014 at 10:13 AM, Fabian Hueske <[hidden email]>
>> wrote:
>> > Hi,
>> >
>> > I limited explicit repartitioning to map-like operatators because Map
>> > operations should be by far the most common use cases (rebalancing for
>> > expensive mappers, repartition for partitionMap, ...) and I doubt the
>> > benefits of repartitioning before other operators.
>> > Right now, partitioning is implemented as an own runtime operator with
>> > noop-driver. Hence, there would be a serialization overhead if subsequent
>> > operators cannot be chained (i.e., join).
>> > Also having a partitioning in front of a combinable reduce would make
>> > things complicated, because the partitioning should be moved between
>> > combiner and reducer.
>> > Overall, I think, there are not many benefits of offering repartitioning
>> > for all operators and if you still want it, you can use an identity
>> mapper.
>> >
>> > Does that make sense?
>> > If not, we can also change the implementation without breaking the API
>> and
>> > offer repartitioning for all operators. ;-)
>> >
>> > Anyway, I'm with you that we should think about how to integrate physical
>> > operators (partition, sort, combine) into the API.
>> >
>> > Cheers, Fabian
>> >
>> >
>> > 2014-09-24 17:00 GMT+02:00 Aljoscha Krettek <[hidden email]>:
>> >
>> >> Hi,
>> >> (mostly @fabian) why is the re-partition operator special-cased like
>> >> it is now? You can only do map/filter on partitioned data. Wouldn't it
>> >> be nice if we had a general re-partition operator. Operators that
>> >> normally do their own repartitioning would notice that the data is
>> >> already partitioned and use that. The way it is now, the
>> >> implementation relies heavily on special-case handling.
>> >>
>> >> In the long run we could even introduce combine and sort as special
>> >> operators that users could insert themselves. The optimiser would then
>> >> also insert these before operations when required. This would
>> >> simplify/generalise things a bit.
>> >>
>> >> What do you think?
>> >>
>> >> Cheers,
>> >> Aljoscha
>> >>
>>