Join with a custom predicate

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Join with a custom predicate

Kirschnick, Johannes
Hi
I have a small problem with doing a custom join, that I would need some help with. Maybe I'm also approaching the problem wrong.
So basically I have two dataset.
The simplified example: The first one has a start and end value. The second dataset is just a list of ordered numbers and some value (value is ignored in the example)
Example
One = {3,6},{5,7}
Two = 1,2,3,4,5,6,7
What I need is a sort of custom join, that brings to the first dataset all elements from the second that are within the range.
Something like .. join where one.start <= two.number <= one.end
So {3,6} from one would only need to "see" 3,4,5
Joining does not work out of the box here as the key is sort of "dynamic" depending on the value of one.
I can just use a map for the first dataset and broadcast the second into the mapper which can then select the required elements - but my assumption is that the second dataset might actually be very large as well, but the qualifying join "numbers" from two will actually be small.
Is there something I could do in this particular case?
Thanks a lot
Johannes
Reply | Threaded
Open this post in threaded view
|

Re: Join with a custom predicate

aalexandrov
I thought about your problem over the weekend. Unfortunately the algorithm
that you describe does not fit "regular" equi-join semantics, but I think
it could be "fitted" with a more complex dataflow.

To achieve that, I would partition the (active) domain of the two datasets
on fine-granular intervals (for the sake of the example, let's say 10.

You can prepare a "coarse-grained" join key on the inputs using a "x % 10"
(Flat)Map:

One: (0, {3,6}), (0, {5,7})
Two: (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7)

Upon that you can do a regular join on the "coarse-grained" key (in the
first component of the tuples), and follow that with a filter that
evaluates the actual "one.start <= two.number <= one.end" predicate.

Regards,
Alex


2015-04-24 20:55 GMT+02:00 Kirschnick, Johannes <
[hidden email]>:

> Hi
> I have a small problem with doing a custom join, that I would need some
> help with. Maybe I'm also approaching the problem wrong.
> So basically I have two dataset.
> The simplified example: The first one has a start and end value. The
> second dataset is just a list of ordered numbers and some value (value is
> ignored in the example)
> Example
> One = {3,6},{5,7}
> Two = 1,2,3,4,5,6,7
> What I need is a sort of custom join, that brings to the first dataset all
> elements from the second that are within the range.
> Something like .. join where one.start <= two.number <= one.end
> So {3,6} from one would only need to "see" 3,4,5
> Joining does not work out of the box here as the key is sort of "dynamic"
> depending on the value of one.
> I can just use a map for the first dataset and broadcast the second into
> the mapper which can then select the required elements - but my assumption
> is that the second dataset might actually be very large as well, but the
> qualifying join "numbers" from two will actually be small.
> Is there something I could do in this particular case?
> Thanks a lot
> Johannes
>
Reply | Threaded
Open this post in threaded view
|

Re: Join with a custom predicate

Till Rohrmann
That's a good solution. In order to deal with ranges which overlap two
intervals you have to create multiple "coarse-grained" join keys. One key
for each interval contained in the range.

Cheers,
Till
On Apr 26, 2015 11:22 PM, "Alexander Alexandrov" <
[hidden email]> wrote:

> I thought about your problem over the weekend. Unfortunately the algorithm
> that you describe does not fit "regular" equi-join semantics, but I think
> it could be "fitted" with a more complex dataflow.
>
> To achieve that, I would partition the (active) domain of the two datasets
> on fine-granular intervals (for the sake of the example, let's say 10.
>
> You can prepare a "coarse-grained" join key on the inputs using a "x % 10"
> (Flat)Map:
>
> One: (0, {3,6}), (0, {5,7})
> Two: (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7)
>
> Upon that you can do a regular join on the "coarse-grained" key (in the
> first component of the tuples), and follow that with a filter that
> evaluates the actual "one.start <= two.number <= one.end" predicate.
>
> Regards,
> Alex
>
>
> 2015-04-24 20:55 GMT+02:00 Kirschnick, Johannes <
> [hidden email]>:
>
> > Hi
> > I have a small problem with doing a custom join, that I would need some
> > help with. Maybe I'm also approaching the problem wrong.
> > So basically I have two dataset.
> > The simplified example: The first one has a start and end value. The
> > second dataset is just a list of ordered numbers and some value (value is
> > ignored in the example)
> > Example
> > One = {3,6},{5,7}
> > Two = 1,2,3,4,5,6,7
> > What I need is a sort of custom join, that brings to the first dataset
> all
> > elements from the second that are within the range.
> > Something like .. join where one.start <= two.number <= one.end
> > So {3,6} from one would only need to "see" 3,4,5
> > Joining does not work out of the box here as the key is sort of "dynamic"
> > depending on the value of one.
> > I can just use a map for the first dataset and broadcast the second into
> > the mapper which can then select the required elements - but my
> assumption
> > is that the second dataset might actually be very large as well, but the
> > qualifying join "numbers" from two will actually be small.
> > Is there something I could do in this particular case?
> > Thanks a lot
> > Johannes
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Join with a custom predicate

Kirschnick, Johannes
In reply to this post by Kirschnick, Johannes
Hi,

thanks for dedicating some of your weekend time to this problem.
Your solution looks quite neat, thanks ..  

For your information, what I'm trying to do is multiply matrix blocks (dataset one) with individual elements of a vector (dataset two, of row id and some value).
In this case "one" needs to see all the qualifying entries from "two" to correctly multiply, ideally as a group for streamlined multiplication.

Johannes

-----Ursprüngliche Nachricht-----
Von: Alexander Alexandrov [mailto:[hidden email]]
Gesendet: Sonntag, 26. April 2015 23:22
An: [hidden email]
Betreff: Re: Join with a custom predicate

I thought about your problem over the weekend. Unfortunately the algorithm that you describe does not fit "regular" equi-join semantics, but I think it could be "fitted" with a more complex dataflow.

To achieve that, I would partition the (active) domain of the two datasets on fine-granular intervals (for the sake of the example, let's say 10.

You can prepare a "coarse-grained" join key on the inputs using a "x % 10"
(Flat)Map:

One: (0, {3,6}), (0, {5,7})
Two: (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7)

Upon that you can do a regular join on the "coarse-grained" key (in the first component of the tuples), and follow that with a filter that evaluates the actual "one.start <= two.number <= one.end" predicate.

Regards,
Alex


2015-04-24 20:55 GMT+02:00 Kirschnick, Johannes <
[hidden email]>:

> Hi
> I have a small problem with doing a custom join, that I would need
> some help with. Maybe I'm also approaching the problem wrong.
> So basically I have two dataset.
> The simplified example: The first one has a start and end value. The
> second dataset is just a list of ordered numbers and some value (value
> is ignored in the example) Example One = {3,6},{5,7} Two =
> 1,2,3,4,5,6,7 What I need is a sort of custom join, that brings to the
> first dataset all elements from the second that are within the range.
> Something like .. join where one.start <= two.number <= one.end So
> {3,6} from one would only need to "see" 3,4,5 Joining does not work
> out of the box here as the key is sort of "dynamic"
> depending on the value of one.
> I can just use a map for the first dataset and broadcast the second
> into the mapper which can then select the required elements - but my
> assumption is that the second dataset might actually be very large as
> well, but the qualifying join "numbers" from two will actually be small.
> Is there something I could do in this particular case?
> Thanks a lot
> Johannes
>
Reply | Threaded
Open this post in threaded view
|

Re: Join with a custom predicate

aalexandrov
I'm curios to learn more, maybe we can discuss this over a coffee in the
next days ;)

2015-04-27 12:16 GMT+02:00 Kirschnick, Johannes <
[hidden email]>:

> Hi,
>
> thanks for dedicating some of your weekend time to this problem.
> Your solution looks quite neat, thanks ..
>
> For your information, what I'm trying to do is multiply matrix blocks
> (dataset one) with individual elements of a vector (dataset two, of row id
> and some value).
> In this case "one" needs to see all the qualifying entries from "two" to
> correctly multiply, ideally as a group for streamlined multiplication.
>
> Johannes
>
> -----Ursprüngliche Nachricht-----
> Von: Alexander Alexandrov [mailto:[hidden email]]
> Gesendet: Sonntag, 26. April 2015 23:22
> An: [hidden email]
> Betreff: Re: Join with a custom predicate
>
> I thought about your problem over the weekend. Unfortunately the algorithm
> that you describe does not fit "regular" equi-join semantics, but I think
> it could be "fitted" with a more complex dataflow.
>
> To achieve that, I would partition the (active) domain of the two datasets
> on fine-granular intervals (for the sake of the example, let's say 10.
>
> You can prepare a "coarse-grained" join key on the inputs using a "x % 10"
> (Flat)Map:
>
> One: (0, {3,6}), (0, {5,7})
> Two: (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7)
>
> Upon that you can do a regular join on the "coarse-grained" key (in the
> first component of the tuples), and follow that with a filter that
> evaluates the actual "one.start <= two.number <= one.end" predicate.
>
> Regards,
> Alex
>
>
> 2015-04-24 20:55 GMT+02:00 Kirschnick, Johannes <
> [hidden email]>:
>
> > Hi
> > I have a small problem with doing a custom join, that I would need
> > some help with. Maybe I'm also approaching the problem wrong.
> > So basically I have two dataset.
> > The simplified example: The first one has a start and end value. The
> > second dataset is just a list of ordered numbers and some value (value
> > is ignored in the example) Example One = {3,6},{5,7} Two =
> > 1,2,3,4,5,6,7 What I need is a sort of custom join, that brings to the
> > first dataset all elements from the second that are within the range.
> > Something like .. join where one.start <= two.number <= one.end So
> > {3,6} from one would only need to "see" 3,4,5 Joining does not work
> > out of the box here as the key is sort of "dynamic"
> > depending on the value of one.
> > I can just use a map for the first dataset and broadcast the second
> > into the mapper which can then select the required elements - but my
> > assumption is that the second dataset might actually be very large as
> > well, but the qualifying join "numbers" from two will actually be small.
> > Is there something I could do in this particular case?
> > Thanks a lot
> > Johannes
> >
>