(DEPRECATED) Apache Flink Mailing List archive.

Channel indexing with pointwise connection pattern

Classic

List

Threaded

4 messages Options

Gyula Fóra

Channel indexing with pointwise connection pattern

Hey,

I was hoping that someone can answer this right away without me having to dig through all the code :)

How does the channel indexing go when more then one consumer subtask is connected to an intermediate dataset in the pointwise pattern?
I am trying to figure out which one is the “in-memory” channel to set up proper partitioning for streams in this case. The idea would be to push the majority to the in memory channel while still push some messages to the network channel to leverage operator parallelism, but to implement this I need to figure out the index of the in-memory channel.

Thank you,
Gyula

Stephan Ewen

Re: Channel indexing with pointwise connection pattern

This is a bit tricky, since the new scheduling is more flexible...

Assume we have a PointWise connection with two receiving tasks per sending
task: outgoing channels 0 and 1.

When scheduling in the most basic mode, the receivers can go anywhere, but
the schedule will try to give them a slot on the same instance, if
possible. Both could be in-memory, but both could be remote.

When using a SlotSharingGroup (we do that by default right now), one of the
receivers can share the slot of the sender, but not both. Which one does
depends wich one stays first. Currently that is the one which gets data
first, but this is going to change soon with the new channels and
deployment.

When you use a CoLocationGroup, you are guaranteed that subtasks n of the
sender is Co-located with subtask n of the receiver. But in the above
PointWise model, you would rather want subtask n to be co-located with
subtask 2*n.

I don't think there is a reliable way to guarantee that, other than having
a slightly modified version of the co-location constraint.

Stephan
Am 27.11.2014 00:45 schrieb "Gyula Fora" <[hidden email]>:

> Hey,
>
> I was hoping that someone can answer this right away without me having to
> dig through all the code :)
>
> How does the channel indexing go when more then one consumer subtask is
> connected to an intermediate dataset in the pointwise pattern?
> I am trying to figure out which one is the “in-memory” channel to set up
> proper partitioning for streams in this case. The idea would be to push the
> majority to the in memory channel while still push some messages to the
> network channel to leverage operator parallelism, but to implement this I
> need to figure out the index of the in-memory channel.
>
> Thank you,
> Gyula

Gyula Fóra

Re: Channel indexing with pointwise connection pattern

Thanks Stephan,

So I took a quick look at the ChannelSelectors the batch api uses and I see that for Forward strategy uses round-robin. My question was aimed exactly to avoid having to do this. Isn’t this sub-optimal?
Maybe we could pass the channel info to the channel selector, so it can make “smarter” decision.

Gyula

> On 27 Nov 2014, at 10:45, Stephan Ewen <[hidden email]> wrote:
>
> This is a bit tricky, since the new scheduling is more flexible...
>
> Assume we have a PointWise connection with two receiving tasks per sending
> task: outgoing channels 0 and 1.
>
> When scheduling in the most basic mode, the receivers can go anywhere, but
> the schedule will try to give them a slot on the same instance, if
> possible. Both could be in-memory, but both could be remote.
>
> When using a SlotSharingGroup (we do that by default right now), one of the
> receivers can share the slot of the sender, but not both. Which one does
> depends wich one stays first. Currently that is the one which gets data
> first, but this is going to change soon with the new channels and
> deployment.
>
> When you use a CoLocationGroup, you are guaranteed that subtasks n of the
> sender is Co-located with subtask n of the receiver. But in the above
> PointWise model, you would rather want subtask n to be co-located with
> subtask 2*n.
>
> I don't think there is a reliable way to guarantee that, other than having
> a slightly modified version of the co-location constraint.
>
> Stephan
> Am 27.11.2014 00:45 schrieb "Gyula Fora" <[hidden email]>:
>
>> Hey,
>>
>> I was hoping that someone can answer this right away without me having to
>> dig through all the code :)
>>
>> How does the channel indexing go when more then one consumer subtask is
>> connected to an intermediate dataset in the pointwise pattern?
>> I am trying to figure out which one is the “in-memory” channel to set up
>> proper partitioning for streams in this case. The idea would be to push the
>> majority to the in memory channel while still push some messages to the
>> network channel to leverage operator parallelism, but to implement this I
>> need to figure out the index of the in-memory channel.
>>
>> Thank you,
>> Gyula

Stephan Ewen

Re: Channel indexing with pointwise connection pattern

Our implication so far was that forwarding means evenly scattering over
successors - a balanced load being the important goal.

If you find different requirements in streaming, you could define a new
type of selector.

On Thu, Nov 27, 2014 at 11:02 AM, Gyula Fora <[hidden email]> wrote:

> Thanks Stephan,
>
> So I took a quick look at the ChannelSelectors the batch api uses and I
> see that for Forward strategy uses round-robin. My question was aimed
> exactly to avoid having to do this. Isn’t this sub-optimal?
> Maybe we could pass the channel info to the channel selector, so it can
> make “smarter” decision.
>
> Gyula
>
> > On 27 Nov 2014, at 10:45, Stephan Ewen <[hidden email]> wrote:
> >
> > This is a bit tricky, since the new scheduling is more flexible...
> >
> > Assume we have a PointWise connection with two receiving tasks per
> sending
> > task: outgoing channels 0 and 1.
> >
> > When scheduling in the most basic mode, the receivers can go anywhere,
> but
> > the schedule will try to give them a slot on the same instance, if
> > possible. Both could be in-memory, but both could be remote.
> >
> > When using a SlotSharingGroup (we do that by default right now), one of
> the
> > receivers can share the slot of the sender, but not both. Which one does
> > depends wich one stays first. Currently that is the one which gets data
> > first, but this is going to change soon with the new channels and
> > deployment.
> >
> > When you use a CoLocationGroup, you are guaranteed that subtasks n of the
> > sender is Co-located with subtask n of the receiver. But in the above
> > PointWise model, you would rather want subtask n to be co-located with
> > subtask 2*n.
> >
> > I don't think there is a reliable way to guarantee that, other than
> having
> > a slightly modified version of the co-location constraint.
> >
> > Stephan
> > Am 27.11.2014 00:45 schrieb "Gyula Fora" <[hidden email]>:
> >
> >> Hey,
> >>
> >> I was hoping that someone can answer this right away without me having
> to
> >> dig through all the code :)
> >>
> >> How does the channel indexing go when more then one consumer subtask is
> >> connected to an intermediate dataset in the pointwise pattern?
> >> I am trying to figure out which one is the “in-memory” channel to set up
> >> proper partitioning for streams in this case. The idea would be to push
> the
> >> majority to the in memory channel while still push some messages to the
> >> network channel to leverage operator parallelism, but to implement this
> I
> >> need to figure out the index of the in-memory channel.
> >>
> >> Thank you,
> >> Gyula
>
>