Hey,
I was hoping that someone can answer this right away without me having to dig through all the code :) How does the channel indexing go when more then one consumer subtask is connected to an intermediate dataset in the pointwise pattern? I am trying to figure out which one is the “in-memory” channel to set up proper partitioning for streams in this case. The idea would be to push the majority to the in memory channel while still push some messages to the network channel to leverage operator parallelism, but to implement this I need to figure out the index of the in-memory channel. Thank you, Gyula |
This is a bit tricky, since the new scheduling is more flexible...
Assume we have a PointWise connection with two receiving tasks per sending task: outgoing channels 0 and 1. When scheduling in the most basic mode, the receivers can go anywhere, but the schedule will try to give them a slot on the same instance, if possible. Both could be in-memory, but both could be remote. When using a SlotSharingGroup (we do that by default right now), one of the receivers can share the slot of the sender, but not both. Which one does depends wich one stays first. Currently that is the one which gets data first, but this is going to change soon with the new channels and deployment. When you use a CoLocationGroup, you are guaranteed that subtasks n of the sender is Co-located with subtask n of the receiver. But in the above PointWise model, you would rather want subtask n to be co-located with subtask 2*n. I don't think there is a reliable way to guarantee that, other than having a slightly modified version of the co-location constraint. Stephan Am 27.11.2014 00:45 schrieb "Gyula Fora" <[hidden email]>: > Hey, > > I was hoping that someone can answer this right away without me having to > dig through all the code :) > > How does the channel indexing go when more then one consumer subtask is > connected to an intermediate dataset in the pointwise pattern? > I am trying to figure out which one is the “in-memory” channel to set up > proper partitioning for streams in this case. The idea would be to push the > majority to the in memory channel while still push some messages to the > network channel to leverage operator parallelism, but to implement this I > need to figure out the index of the in-memory channel. > > Thank you, > Gyula |
Thanks Stephan,
So I took a quick look at the ChannelSelectors the batch api uses and I see that for Forward strategy uses round-robin. My question was aimed exactly to avoid having to do this. Isn’t this sub-optimal? Maybe we could pass the channel info to the channel selector, so it can make “smarter” decision. Gyula > On 27 Nov 2014, at 10:45, Stephan Ewen <[hidden email]> wrote: > > This is a bit tricky, since the new scheduling is more flexible... > > Assume we have a PointWise connection with two receiving tasks per sending > task: outgoing channels 0 and 1. > > When scheduling in the most basic mode, the receivers can go anywhere, but > the schedule will try to give them a slot on the same instance, if > possible. Both could be in-memory, but both could be remote. > > When using a SlotSharingGroup (we do that by default right now), one of the > receivers can share the slot of the sender, but not both. Which one does > depends wich one stays first. Currently that is the one which gets data > first, but this is going to change soon with the new channels and > deployment. > > When you use a CoLocationGroup, you are guaranteed that subtasks n of the > sender is Co-located with subtask n of the receiver. But in the above > PointWise model, you would rather want subtask n to be co-located with > subtask 2*n. > > I don't think there is a reliable way to guarantee that, other than having > a slightly modified version of the co-location constraint. > > Stephan > Am 27.11.2014 00:45 schrieb "Gyula Fora" <[hidden email]>: > >> Hey, >> >> I was hoping that someone can answer this right away without me having to >> dig through all the code :) >> >> How does the channel indexing go when more then one consumer subtask is >> connected to an intermediate dataset in the pointwise pattern? >> I am trying to figure out which one is the “in-memory” channel to set up >> proper partitioning for streams in this case. The idea would be to push the >> majority to the in memory channel while still push some messages to the >> network channel to leverage operator parallelism, but to implement this I >> need to figure out the index of the in-memory channel. >> >> Thank you, >> Gyula |
Our implication so far was that forwarding means evenly scattering over
successors - a balanced load being the important goal. If you find different requirements in streaming, you could define a new type of selector. On Thu, Nov 27, 2014 at 11:02 AM, Gyula Fora <[hidden email]> wrote: > Thanks Stephan, > > So I took a quick look at the ChannelSelectors the batch api uses and I > see that for Forward strategy uses round-robin. My question was aimed > exactly to avoid having to do this. Isn’t this sub-optimal? > Maybe we could pass the channel info to the channel selector, so it can > make “smarter” decision. > > Gyula > > > On 27 Nov 2014, at 10:45, Stephan Ewen <[hidden email]> wrote: > > > > This is a bit tricky, since the new scheduling is more flexible... > > > > Assume we have a PointWise connection with two receiving tasks per > sending > > task: outgoing channels 0 and 1. > > > > When scheduling in the most basic mode, the receivers can go anywhere, > but > > the schedule will try to give them a slot on the same instance, if > > possible. Both could be in-memory, but both could be remote. > > > > When using a SlotSharingGroup (we do that by default right now), one of > the > > receivers can share the slot of the sender, but not both. Which one does > > depends wich one stays first. Currently that is the one which gets data > > first, but this is going to change soon with the new channels and > > deployment. > > > > When you use a CoLocationGroup, you are guaranteed that subtasks n of the > > sender is Co-located with subtask n of the receiver. But in the above > > PointWise model, you would rather want subtask n to be co-located with > > subtask 2*n. > > > > I don't think there is a reliable way to guarantee that, other than > having > > a slightly modified version of the co-location constraint. > > > > Stephan > > Am 27.11.2014 00:45 schrieb "Gyula Fora" <[hidden email]>: > > > >> Hey, > >> > >> I was hoping that someone can answer this right away without me having > to > >> dig through all the code :) > >> > >> How does the channel indexing go when more then one consumer subtask is > >> connected to an intermediate dataset in the pointwise pattern? > >> I am trying to figure out which one is the “in-memory” channel to set up > >> proper partitioning for streams in this case. The idea would be to push > the > >> majority to the in memory channel while still push some messages to the > >> network channel to leverage operator parallelism, but to implement this > I > >> need to figure out the index of the in-memory channel. > >> > >> Thank you, > >> Gyula > > |
Free forum by Nabble | Edit this page |