Hi
I have a small problem with doing a custom join, that I would need some help with. Maybe I'm also approaching the problem wrong. So basically I have two dataset. The simplified example: The first one has a start and end value. The second dataset is just a list of ordered numbers and some value (value is ignored in the example) Example One = {3,6},{5,7} Two = 1,2,3,4,5,6,7 What I need is a sort of custom join, that brings to the first dataset all elements from the second that are within the range. Something like .. join where one.start <= two.number <= one.end So {3,6} from one would only need to "see" 3,4,5 Joining does not work out of the box here as the key is sort of "dynamic" depending on the value of one. I can just use a map for the first dataset and broadcast the second into the mapper which can then select the required elements - but my assumption is that the second dataset might actually be very large as well, but the qualifying join "numbers" from two will actually be small. Is there something I could do in this particular case? Thanks a lot Johannes |
I thought about your problem over the weekend. Unfortunately the algorithm
that you describe does not fit "regular" equi-join semantics, but I think it could be "fitted" with a more complex dataflow. To achieve that, I would partition the (active) domain of the two datasets on fine-granular intervals (for the sake of the example, let's say 10. You can prepare a "coarse-grained" join key on the inputs using a "x % 10" (Flat)Map: One: (0, {3,6}), (0, {5,7}) Two: (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7) Upon that you can do a regular join on the "coarse-grained" key (in the first component of the tuples), and follow that with a filter that evaluates the actual "one.start <= two.number <= one.end" predicate. Regards, Alex 2015-04-24 20:55 GMT+02:00 Kirschnick, Johannes < [hidden email]>: > Hi > I have a small problem with doing a custom join, that I would need some > help with. Maybe I'm also approaching the problem wrong. > So basically I have two dataset. > The simplified example: The first one has a start and end value. The > second dataset is just a list of ordered numbers and some value (value is > ignored in the example) > Example > One = {3,6},{5,7} > Two = 1,2,3,4,5,6,7 > What I need is a sort of custom join, that brings to the first dataset all > elements from the second that are within the range. > Something like .. join where one.start <= two.number <= one.end > So {3,6} from one would only need to "see" 3,4,5 > Joining does not work out of the box here as the key is sort of "dynamic" > depending on the value of one. > I can just use a map for the first dataset and broadcast the second into > the mapper which can then select the required elements - but my assumption > is that the second dataset might actually be very large as well, but the > qualifying join "numbers" from two will actually be small. > Is there something I could do in this particular case? > Thanks a lot > Johannes > |
That's a good solution. In order to deal with ranges which overlap two
intervals you have to create multiple "coarse-grained" join keys. One key for each interval contained in the range. Cheers, Till On Apr 26, 2015 11:22 PM, "Alexander Alexandrov" < [hidden email]> wrote: > I thought about your problem over the weekend. Unfortunately the algorithm > that you describe does not fit "regular" equi-join semantics, but I think > it could be "fitted" with a more complex dataflow. > > To achieve that, I would partition the (active) domain of the two datasets > on fine-granular intervals (for the sake of the example, let's say 10. > > You can prepare a "coarse-grained" join key on the inputs using a "x % 10" > (Flat)Map: > > One: (0, {3,6}), (0, {5,7}) > Two: (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7) > > Upon that you can do a regular join on the "coarse-grained" key (in the > first component of the tuples), and follow that with a filter that > evaluates the actual "one.start <= two.number <= one.end" predicate. > > Regards, > Alex > > > 2015-04-24 20:55 GMT+02:00 Kirschnick, Johannes < > [hidden email]>: > > > Hi > > I have a small problem with doing a custom join, that I would need some > > help with. Maybe I'm also approaching the problem wrong. > > So basically I have two dataset. > > The simplified example: The first one has a start and end value. The > > second dataset is just a list of ordered numbers and some value (value is > > ignored in the example) > > Example > > One = {3,6},{5,7} > > Two = 1,2,3,4,5,6,7 > > What I need is a sort of custom join, that brings to the first dataset > all > > elements from the second that are within the range. > > Something like .. join where one.start <= two.number <= one.end > > So {3,6} from one would only need to "see" 3,4,5 > > Joining does not work out of the box here as the key is sort of "dynamic" > > depending on the value of one. > > I can just use a map for the first dataset and broadcast the second into > > the mapper which can then select the required elements - but my > assumption > > is that the second dataset might actually be very large as well, but the > > qualifying join "numbers" from two will actually be small. > > Is there something I could do in this particular case? > > Thanks a lot > > Johannes > > > |
In reply to this post by Kirschnick, Johannes
Hi,
thanks for dedicating some of your weekend time to this problem. Your solution looks quite neat, thanks .. For your information, what I'm trying to do is multiply matrix blocks (dataset one) with individual elements of a vector (dataset two, of row id and some value). In this case "one" needs to see all the qualifying entries from "two" to correctly multiply, ideally as a group for streamlined multiplication. Johannes -----Ursprüngliche Nachricht----- Von: Alexander Alexandrov [mailto:[hidden email]] Gesendet: Sonntag, 26. April 2015 23:22 An: [hidden email] Betreff: Re: Join with a custom predicate I thought about your problem over the weekend. Unfortunately the algorithm that you describe does not fit "regular" equi-join semantics, but I think it could be "fitted" with a more complex dataflow. To achieve that, I would partition the (active) domain of the two datasets on fine-granular intervals (for the sake of the example, let's say 10. You can prepare a "coarse-grained" join key on the inputs using a "x % 10" (Flat)Map: One: (0, {3,6}), (0, {5,7}) Two: (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7) Upon that you can do a regular join on the "coarse-grained" key (in the first component of the tuples), and follow that with a filter that evaluates the actual "one.start <= two.number <= one.end" predicate. Regards, Alex 2015-04-24 20:55 GMT+02:00 Kirschnick, Johannes < [hidden email]>: > Hi > I have a small problem with doing a custom join, that I would need > some help with. Maybe I'm also approaching the problem wrong. > So basically I have two dataset. > The simplified example: The first one has a start and end value. The > second dataset is just a list of ordered numbers and some value (value > is ignored in the example) Example One = {3,6},{5,7} Two = > 1,2,3,4,5,6,7 What I need is a sort of custom join, that brings to the > first dataset all elements from the second that are within the range. > Something like .. join where one.start <= two.number <= one.end So > {3,6} from one would only need to "see" 3,4,5 Joining does not work > out of the box here as the key is sort of "dynamic" > depending on the value of one. > I can just use a map for the first dataset and broadcast the second > into the mapper which can then select the required elements - but my > assumption is that the second dataset might actually be very large as > well, but the qualifying join "numbers" from two will actually be small. > Is there something I could do in this particular case? > Thanks a lot > Johannes > |
I'm curios to learn more, maybe we can discuss this over a coffee in the
next days ;) 2015-04-27 12:16 GMT+02:00 Kirschnick, Johannes < [hidden email]>: > Hi, > > thanks for dedicating some of your weekend time to this problem. > Your solution looks quite neat, thanks .. > > For your information, what I'm trying to do is multiply matrix blocks > (dataset one) with individual elements of a vector (dataset two, of row id > and some value). > In this case "one" needs to see all the qualifying entries from "two" to > correctly multiply, ideally as a group for streamlined multiplication. > > Johannes > > -----Ursprüngliche Nachricht----- > Von: Alexander Alexandrov [mailto:[hidden email]] > Gesendet: Sonntag, 26. April 2015 23:22 > An: [hidden email] > Betreff: Re: Join with a custom predicate > > I thought about your problem over the weekend. Unfortunately the algorithm > that you describe does not fit "regular" equi-join semantics, but I think > it could be "fitted" with a more complex dataflow. > > To achieve that, I would partition the (active) domain of the two datasets > on fine-granular intervals (for the sake of the example, let's say 10. > > You can prepare a "coarse-grained" join key on the inputs using a "x % 10" > (Flat)Map: > > One: (0, {3,6}), (0, {5,7}) > Two: (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7) > > Upon that you can do a regular join on the "coarse-grained" key (in the > first component of the tuples), and follow that with a filter that > evaluates the actual "one.start <= two.number <= one.end" predicate. > > Regards, > Alex > > > 2015-04-24 20:55 GMT+02:00 Kirschnick, Johannes < > [hidden email]>: > > > Hi > > I have a small problem with doing a custom join, that I would need > > some help with. Maybe I'm also approaching the problem wrong. > > So basically I have two dataset. > > The simplified example: The first one has a start and end value. The > > second dataset is just a list of ordered numbers and some value (value > > is ignored in the example) Example One = {3,6},{5,7} Two = > > 1,2,3,4,5,6,7 What I need is a sort of custom join, that brings to the > > first dataset all elements from the second that are within the range. > > Something like .. join where one.start <= two.number <= one.end So > > {3,6} from one would only need to "see" 3,4,5 Joining does not work > > out of the box here as the key is sort of "dynamic" > > depending on the value of one. > > I can just use a map for the first dataset and broadcast the second > > into the mapper which can then select the required elements - but my > > assumption is that the second dataset might actually be very large as > > well, but the qualifying join "numbers" from two will actually be small. > > Is there something I could do in this particular case? > > Thanks a lot > > Johannes > > > |
Free forum by Nabble | Edit this page |