Hi all,
Let’s imagine a simple repartition hash Join oft two tables. As soon as the first table is hashed completely (all EndOfPartition Events sent) the shipping and probing of the second table starts. What I can’t find: 1. What triggers to start the probing exactly? 2. Where can I find it in the code? My final goal is to change the 2-phase join mechanism to a mixed implementation where probing for finished subpartitions begins earlier. I appreciate any help. Thanks. Benjamin |
Hi Benjamin,
As you mentioned hash join I assume that you are referring to `HashJoinOperator` in blink planner. The input is selected by `nextSelection` method. As you can see, it will first read all records in the build side then read all records in the probe side. So the probing will only start after the build side has been fully received. Apart from this specific question, I'm interested in how you're going to implement a hash join where any of the two sides can be read. Could you share your ideas or give some hints about this? Thanks a lot. Benjamin Burkhardt <[hidden email]> 于2019年7月24日周三 上午1:24写道: > Hi all, > > Let’s imagine a simple repartition hash Join oft two tables. > > As soon as the first table is hashed completely (all EndOfPartition Events > sent) the shipping and probing of the second table starts. > > What I can’t find: > > 1. What triggers to start the probing exactly? > 2. Where can I find it in the code? > > > My final goal is to change the 2-phase join mechanism to a mixed > implementation where probing for finished subpartitions begins earlier. > > I appreciate any help. > > Thanks. > > Benjamin > |
Hi,
while doing a join, a MutableHashTable is created and filled while building. After building it is closed and the probing can begin. I would like to start probing while building the hash table still runs. (ignoring the fact that this would lead to join misses...) Anyone having an idea how one could do that? Am 24. Juli 2019, 04:14 +0200 schrieb Caizhi Weng <[hidden email]>: > Hi Benjamin, > > As you mentioned hash join I assume that you are referring to > `HashJoinOperator` in blink planner. > > The input is selected by `nextSelection` method. As you can see, it will > first read all records in the build side then read all records in the probe > side. So the probing will only start after the build side has been fully > received. > > Apart from this specific question, I'm interested in how you're going to > implement a hash join where any of the two sides can be read. Could you > share your ideas or give some hints about this? Thanks a lot. > > Benjamin Burkhardt <[hidden email]> 于2019年7月24日周三 上午1:24写道: > > > Hi all, > > > > Let’s imagine a simple repartition hash Join oft two tables. > > > > As soon as the first table is hashed completely (all EndOfPartition Events > > sent) the shipping and probing of the second table starts. > > > > What I can’t find: > > > > 1. What triggers to start the probing exactly? > > 2. Where can I find it in the code? > > > > > > My final goal is to change the 2-phase join mechanism to a mixed > > implementation where probing for finished subpartitions begins earlier. > > > > I appreciate any help. > > > > Thanks. > > > > Benjamin > > |
Hi!
The join implementations used for the DataSet API and for the Blink Planner are quite intricate. They make use of these custom memory segments, to operate as much as possible on bytes, to control JVM memory utilization and to save serialization costs. That makes the implementation super complicated. Do your experiments need that kind of behavior? If not, it might actually be simpler to just start with some very simple object-based hash table. If you want to actually work with these classes, I would suggest to first change the MutableHashTable a bit. It is probably easier to remove the logic to pull from iterators into build and probe side, but rather do this outside the class. (I am assuming you plan to work with a fork for research here). Best, Stephan On Thu, Jul 25, 2019 at 11:58 AM Benjamin Burkhardt < [hidden email]> wrote: > Hi, > > while doing a join, a MutableHashTable is created and filled while > building. After building it is closed and the probing can begin. > I would like to start probing while building the hash table still runs. > (ignoring the fact that this would lead to join misses...) > > Anyone having an idea how one could do that? > > Am 24. Juli 2019, 04:14 +0200 schrieb Caizhi Weng <[hidden email]>: > > Hi Benjamin, > > > > As you mentioned hash join I assume that you are referring to > > `HashJoinOperator` in blink planner. > > > > The input is selected by `nextSelection` method. As you can see, it will > > first read all records in the build side then read all records in the > probe > > side. So the probing will only start after the build side has been fully > > received. > > > > Apart from this specific question, I'm interested in how you're going to > > implement a hash join where any of the two sides can be read. Could you > > share your ideas or give some hints about this? Thanks a lot. > > > > Benjamin Burkhardt <[hidden email]> 于2019年7月24日周三 上午1:24写道: > > > > > Hi all, > > > > > > Let’s imagine a simple repartition hash Join oft two tables. > > > > > > As soon as the first table is hashed completely (all EndOfPartition > Events > > > sent) the shipping and probing of the second table starts. > > > > > > What I can’t find: > > > > > > 1. What triggers to start the probing exactly? > > > 2. Where can I find it in the code? > > > > > > > > > My final goal is to change the 2-phase join mechanism to a mixed > > > implementation where probing for finished subpartitions begins earlier. > > > > > > I appreciate any help. > > > > > > Thanks. > > > > > > Benjamin > > > > |
Free forum by Nabble | Edit this page |