(DEPRECATED) Apache Flink Mailing List archive.

Join with a custom predicate

Classic

List

Threaded

5 messages Options

Kirschnick, Johannes

Join with a custom predicate

Hi
I have a small problem with doing a custom join, that I would need some help with. Maybe I'm also approaching the problem wrong.
So basically I have two dataset.
The simplified example: The first one has a start and end value. The second dataset is just a list of ordered numbers and some value (value is ignored in the example)
Example
One = {3,6},{5,7}
Two = 1,2,3,4,5,6,7
What I need is a sort of custom join, that brings to the first dataset all elements from the second that are within the range.
Something like .. join where one.start <= two.number <= one.end
So {3,6} from one would only need to "see" 3,4,5
Joining does not work out of the box here as the key is sort of "dynamic" depending on the value of one.
I can just use a map for the first dataset and broadcast the second into the mapper which can then select the required elements - but my assumption is that the second dataset might actually be very large as well, but the qualifying join "numbers" from two will actually be small.
Is there something I could do in this particular case?
Thanks a lot
Johannes

aalexandrov

Re: Join with a custom predicate

I thought about your problem over the weekend. Unfortunately the algorithm
that you describe does not fit "regular" equi-join semantics, but I think
it could be "fitted" with a more complex dataflow.

To achieve that, I would partition the (active) domain of the two datasets
on fine-granular intervals (for the sake of the example, let's say 10.

You can prepare a "coarse-grained" join key on the inputs using a "x % 10"
(Flat)Map:

One: (0, {3,6}), (0, {5,7})
Two: (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7)

Upon that you can do a regular join on the "coarse-grained" key (in the
first component of the tuples), and follow that with a filter that
evaluates the actual "one.start <= two.number <= one.end" predicate.

Regards,
Alex

2015-04-24 20:55 GMT+02:00 Kirschnick, Johannes <
[hidden email]>:

> Hi
> I have a small problem with doing a custom join, that I would need some
> help with. Maybe I'm also approaching the problem wrong.
> So basically I have two dataset.
> The simplified example: The first one has a start and end value. The
> second dataset is just a list of ordered numbers and some value (value is
> ignored in the example)
> Example
> One = {3,6},{5,7}
> Two = 1,2,3,4,5,6,7
> What I need is a sort of custom join, that brings to the first dataset all
> elements from the second that are within the range.
> Something like .. join where one.start <= two.number <= one.end
> So {3,6} from one would only need to "see" 3,4,5
> Joining does not work out of the box here as the key is sort of "dynamic"
> depending on the value of one.
> I can just use a map for the first dataset and broadcast the second into
> the mapper which can then select the required elements - but my assumption
> is that the second dataset might actually be very large as well, but the
> qualifying join "numbers" from two will actually be small.
> Is there something I could do in this particular case?
> Thanks a lot
> Johannes
>

Till Rohrmann

Re: Join with a custom predicate

That's a good solution. In order to deal with ranges which overlap two
intervals you have to create multiple "coarse-grained" join keys. One key
for each interval contained in the range.

Cheers,
Till
On Apr 26, 2015 11:22 PM, "Alexander Alexandrov" <
[hidden email]> wrote:

> I thought about your problem over the weekend. Unfortunately the algorithm
> that you describe does not fit "regular" equi-join semantics, but I think
> it could be "fitted" with a more complex dataflow.
>
> To achieve that, I would partition the (active) domain of the two datasets
> on fine-granular intervals (for the sake of the example, let's say 10.
>
> You can prepare a "coarse-grained" join key on the inputs using a "x % 10"
> (Flat)Map:
>
> One: (0, {3,6}), (0, {5,7})
> Two: (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7)
>
> Upon that you can do a regular join on the "coarse-grained" key (in the
> first component of the tuples), and follow that with a filter that
> evaluates the actual "one.start <= two.number <= one.end" predicate.
>
> Regards,
> Alex
>
>
> 2015-04-24 20:55 GMT+02:00 Kirschnick, Johannes <
> [hidden email]>:
>
> > Hi
> > I have a small problem with doing a custom join, that I would need some
> > help with. Maybe I'm also approaching the problem wrong.
> > So basically I have two dataset.
> > The simplified example: The first one has a start and end value. The
> > second dataset is just a list of ordered numbers and some value (value is
> > ignored in the example)
> > Example
> > One = {3,6},{5,7}
> > Two = 1,2,3,4,5,6,7
> > What I need is a sort of custom join, that brings to the first dataset
> all
> > elements from the second that are within the range.
> > Something like .. join where one.start <= two.number <= one.end
> > So {3,6} from one would only need to "see" 3,4,5
> > Joining does not work out of the box here as the key is sort of "dynamic"
> > depending on the value of one.
> > I can just use a map for the first dataset and broadcast the second into
> > the mapper which can then select the required elements - but my
> assumption
> > is that the second dataset might actually be very large as well, but the
> > qualifying join "numbers" from two will actually be small.
> > Is there something I could do in this particular case?
> > Thanks a lot
> > Johannes
> >
>

Kirschnick, Johannes

Re: Join with a custom predicate

In reply to this post by Kirschnick, Johannes

Hi,

thanks for dedicating some of your weekend time to this problem.
Your solution looks quite neat, thanks ..

For your information, what I'm trying to do is multiply matrix blocks (dataset one) with individual elements of a vector (dataset two, of row id and some value).
In this case "one" needs to see all the qualifying entries from "two" to correctly multiply, ideally as a group for streamlined multiplication.

Johannes

-----Ursprüngliche Nachricht-----
Von: Alexander Alexandrov [mailto:[hidden email]]
Gesendet: Sonntag, 26. April 2015 23:22
An: [hidden email]
Betreff: Re: Join with a custom predicate

I thought about your problem over the weekend. Unfortunately the algorithm that you describe does not fit "regular" equi-join semantics, but I think it could be "fitted" with a more complex dataflow.

To achieve that, I would partition the (active) domain of the two datasets on fine-granular intervals (for the sake of the example, let's say 10.

You can prepare a "coarse-grained" join key on the inputs using a "x % 10"
(Flat)Map:

One: (0, {3,6}), (0, {5,7})
Two: (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7)

Upon that you can do a regular join on the "coarse-grained" key (in the first component of the tuples), and follow that with a filter that evaluates the actual "one.start <= two.number <= one.end" predicate.

Regards,
Alex

2015-04-24 20:55 GMT+02:00 Kirschnick, Johannes <
[hidden email]>:

> Hi
> I have a small problem with doing a custom join, that I would need
> some help with. Maybe I'm also approaching the problem wrong.
> So basically I have two dataset.
> The simplified example: The first one has a start and end value. The
> second dataset is just a list of ordered numbers and some value (value
> is ignored in the example) Example One = {3,6},{5,7} Two =
> 1,2,3,4,5,6,7 What I need is a sort of custom join, that brings to the
> first dataset all elements from the second that are within the range.
> Something like .. join where one.start <= two.number <= one.end So
> {3,6} from one would only need to "see" 3,4,5 Joining does not work
> out of the box here as the key is sort of "dynamic"
> depending on the value of one.
> I can just use a map for the first dataset and broadcast the second
> into the mapper which can then select the required elements - but my
> assumption is that the second dataset might actually be very large as
> well, but the qualifying join "numbers" from two will actually be small.
> Is there something I could do in this particular case?
> Thanks a lot
> Johannes
>

aalexandrov

Re: Join with a custom predicate

I'm curios to learn more, maybe we can discuss this over a coffee in the
next days ;)

2015-04-27 12:16 GMT+02:00 Kirschnick, Johannes <
[hidden email]>:

> Hi,
>
> thanks for dedicating some of your weekend time to this problem.
> Your solution looks quite neat, thanks ..
>
> For your information, what I'm trying to do is multiply matrix blocks
> (dataset one) with individual elements of a vector (dataset two, of row id
> and some value).
> In this case "one" needs to see all the qualifying entries from "two" to
> correctly multiply, ideally as a group for streamlined multiplication.
>
> Johannes
>
> -----Ursprüngliche Nachricht-----
> Von: Alexander Alexandrov [mailto:[hidden email]]
> Gesendet: Sonntag, 26. April 2015 23:22
> An: [hidden email]
> Betreff: Re: Join with a custom predicate
>
> I thought about your problem over the weekend. Unfortunately the algorithm
> that you describe does not fit "regular" equi-join semantics, but I think
> it could be "fitted" with a more complex dataflow.
>
> To achieve that, I would partition the (active) domain of the two datasets
> on fine-granular intervals (for the sake of the example, let's say 10.
>
> You can prepare a "coarse-grained" join key on the inputs using a "x % 10"
> (Flat)Map:
>
> One: (0, {3,6}), (0, {5,7})
> Two: (0, 1), (0, 2), (0, 3), (0, 4), (0, 5), (0, 6), (0, 7)
>
> Upon that you can do a regular join on the "coarse-grained" key (in the
> first component of the tuples), and follow that with a filter that
> evaluates the actual "one.start <= two.number <= one.end" predicate.
>
> Regards,
> Alex
>
>
> 2015-04-24 20:55 GMT+02:00 Kirschnick, Johannes <
> [hidden email]>:
>
> > Hi
> > I have a small problem with doing a custom join, that I would need
> > some help with. Maybe I'm also approaching the problem wrong.
> > So basically I have two dataset.
> > The simplified example: The first one has a start and end value. The
> > second dataset is just a list of ordered numbers and some value (value
> > is ignored in the example) Example One = {3,6},{5,7} Two =
> > 1,2,3,4,5,6,7 What I need is a sort of custom join, that brings to the
> > first dataset all elements from the second that are within the range.
> > Something like .. join where one.start <= two.number <= one.end So
> > {3,6} from one would only need to "see" 3,4,5 Joining does not work
> > out of the box here as the key is sort of "dynamic"
> > depending on the value of one.
> > I can just use a map for the first dataset and broadcast the second
> > into the mapper which can then select the required elements - but my
> > assumption is that the second dataset might actually be very large as
> > well, but the qualifying join "numbers" from two will actually be small.
> > Is there something I could do in this particular case?
> > Thanks a lot
> > Johannes
> >
>