broadcast set size

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

broadcast set size

Martin Neumann-2
Hej,

Up to what sizes are broadcast sets a good idea?

I have large dataset (~5 GB) and I'm only interested in lines with a
certain ID that I have in a file. The file has ~10 k entries.
I could either Join the dataset with the IDList or I could broadcast the
ID list and do the filtering in a Mapper.

What would be the better solution given the data sizes described above?
Is there a good rule of thumb when to switch from one solution to the other?

cheers Martin
Reply | Threaded
Open this post in threaded view
|

Re: broadcast set size

aalexandrov
Hi Martin,

The answer of your question really depends on the DOP in which you will be
running the job and the expected selectivity (the fraction of lines with
that certain ID) in case this does not depend on "the other side" and can
be pre-filtered prior to broadcasting.

However, since Flink's optimizer determines the actual data shipment
strategy for the join (Broadcast or Re-Partition) based on the estimated
input size, I think sticking with the join is the more conservative and
long-term flexible solution in your case.

To summarize: if you want to do a join, I would always stick with the Flink
Join primitive. Flink is different from other systems as it hides multiple
strategies like a "Map-side" join (which in other systems is done manually
with a broadcast variable) and a "Reduce-side" join (also known as
re-partition join) behind an abstract  primitive and decides which one to
use in a cost-based way.

PS. I think that these kind of questions are better suited for the user
list.

2015-04-09 17:36 GMT+02:00 Martin Neumann <[hidden email]>:

> Hej,
>
> Up to what sizes are broadcast sets a good idea?
>
> I have large dataset (~5 GB) and I'm only interested in lines with a
> certain ID that I have in a file. The file has ~10 k entries.
> I could either Join the dataset with the IDList or I could broadcast the
> ID list and do the filtering in a Mapper.
>
> What would be the better solution given the data sizes described above?
> Is there a good rule of thumb when to switch from one solution to the
> other?
>
> cheers Martin
>
Reply | Threaded
Open this post in threaded view
|

Re: broadcast set size

Stephan Ewen
In reply to this post by Martin Neumann-2
Hi Martin!

You can use a broadcast join for that as well. You use it exactly like the
usual join, but you write "joinWithTiny" or "joinWithLarge", depending on
whether the data set that is the argument to the function is the small or
the large one.

The broadcast join internally also broadcasts the small data set (like the
broadcast sets), but keeps it in the managed memory, which has the benefit
that it does spill to disk, if needed.

BTW. Many times (if you join two files, a small one and a large one), the
Flink optimizer will actually automatically pick the broadcast join. That
depends currently on whether the client can access the HDFS and gather
statistics, and what kind of functions you use after the data sources and
before the join (whether size estimates are still good or already fuzzy).

If you want to figure out what strategy is used, have a look at the
execution plan (either dump the JSON, put into the HTML file, or use the
web client).

Greetings,
Stephan



On Thu, Apr 9, 2015 at 5:36 PM, Martin Neumann <[hidden email]> wrote:

> Hej,
>
> Up to what sizes are broadcast sets a good idea?
>
> I have large dataset (~5 GB) and I'm only interested in lines with a
> certain ID that I have in a file. The file has ~10 k entries.
> I could either Join the dataset with the IDList or I could broadcast the
> ID list and do the filtering in a Mapper.
>
> What would be the better solution given the data sizes described above?
> Is there a good rule of thumb when to switch from one solution to the
> other?
>
> cheers Martin
>