|
Hello to my squirrels,
I've started looking into FLINK-1943 <https://issues.apache.org/jira/browse/FLINK-1943> and I need some help to understand what to test and how to do it properly. In the corresponding Spargel compiler test, the following functionality is checked: 1. sink: the ship strategy is FORWARD and the parallelism is correct 2. iteration: degree of parallelism 3. solution set join: parallelism and input1 ship strategy is PARTITION_HASH 4. workset join: parallelism, input1 (edges) ship strategy is PARTITION_HASH and cached, input2 (workset) ship strategy is FORWARD 5. check that the initial partitioning is pushed out of the loop 6. check that the initial workset sort is outside the loop I have been able to verify 1-4 of the above for the GSA iteration plan, but I'm not sure how to check (5) and (6) or whether they are expected to hold in the GSA case. In [1] you can see what the GSA iteration operators looks like and in [2] you can see what the visualizer tools generates the GSA connected components. Any pointers would be greatly appreciated! Cheers, Vasia. [1]: https://docs.google.com/drawings/d/1tiNQeOphWtkNXTGlnDJ3Ipanh0Tm2R8sHe8XNyTnf98/edit?usp=sharing [2]: http://imgur.com/GQZ48ZI |
|
Hey,
any input on this? or a hint? or where to look to figure this out by myself? Thanks! -Vasia. On 7 July 2015 at 15:20, Vasiliki Kalavri <[hidden email]> wrote: > Hello to my squirrels, > > I've started looking into FLINK-1943 > <https://issues.apache.org/jira/browse/FLINK-1943> and I need some help > to understand what to test and how to do it properly. > > In the corresponding Spargel compiler test, the following functionality is > checked: > > 1. sink: the ship strategy is FORWARD and the parallelism is correct > 2. iteration: degree of parallelism > 3. solution set join: parallelism and input1 ship strategy is > PARTITION_HASH > 4. workset join: parallelism, input1 (edges) ship strategy is > PARTITION_HASH and cached, input2 (workset) ship strategy is FORWARD > 5. check that the initial partitioning is pushed out of the loop > 6. check that the initial workset sort is outside the loop > > I have been able to verify 1-4 of the above for the GSA iteration plan, > but I'm not sure how to check (5) and (6) or whether they are expected to > hold in the GSA case. > > In [1] you can see what the GSA iteration operators looks like and in [2] > you can see what the visualizer tools generates the GSA connected > components. > > Any pointers would be greatly appreciated! > > Cheers, > Vasia. > > [1]: > https://docs.google.com/drawings/d/1tiNQeOphWtkNXTGlnDJ3Ipanh0Tm2R8sHe8XNyTnf98/edit?usp=sharing > [2]: http://imgur.com/GQZ48ZI > |
|
Hey Vasia!
Sorry for the late response... Thanks for pinging again! The optimizer is acting a little funky here - seems an artifact of the "properties" optimization. -> The initial join needs to be partitioned and sorted. Can you check whether one partitioning and sorting happens before the iteration? That part is cut off in the screenshot sou sent. It must be either on the input of the iteration, of the output. -> The iteration needs to make sure it leaves the data partitioned and sorted. There is a "re-sorting" operator at the end ("Rebuild Workset Properties"), but it does not partition. The test should make sure the data is known to be partitioned at the very end of the iteration (after the "Rebuild Workset Properties" operator). This is probably true, if the join has some forward field annotation. We can have a quick skype chat later, if you have more questions... Greetings, Stephan On Wed, Jul 15, 2015 at 12:08 PM, Vasiliki Kalavri < [hidden email]> wrote: > Hey, > > any input on this? or a hint? or where to look to figure this out by > myself? > > Thanks! > -Vasia. > > On 7 July 2015 at 15:20, Vasiliki Kalavri <[hidden email]> > wrote: > > > Hello to my squirrels, > > > > I've started looking into FLINK-1943 > > <https://issues.apache.org/jira/browse/FLINK-1943> and I need some help > > to understand what to test and how to do it properly. > > > > In the corresponding Spargel compiler test, the following functionality > is > > checked: > > > > 1. sink: the ship strategy is FORWARD and the parallelism is correct > > 2. iteration: degree of parallelism > > 3. solution set join: parallelism and input1 ship strategy is > > PARTITION_HASH > > 4. workset join: parallelism, input1 (edges) ship strategy is > > PARTITION_HASH and cached, input2 (workset) ship strategy is FORWARD > > 5. check that the initial partitioning is pushed out of the loop > > 6. check that the initial workset sort is outside the loop > > > > I have been able to verify 1-4 of the above for the GSA iteration plan, > > but I'm not sure how to check (5) and (6) or whether they are expected to > > hold in the GSA case. > > > > In [1] you can see what the GSA iteration operators looks like and in [2] > > you can see what the visualizer tools generates the GSA connected > > components. > > > > Any pointers would be greatly appreciated! > > > > Cheers, > > Vasia. > > > > [1]: > > > https://docs.google.com/drawings/d/1tiNQeOphWtkNXTGlnDJ3Ipanh0Tm2R8sHe8XNyTnf98/edit?usp=sharing > > [2]: http://imgur.com/GQZ48ZI > > > |
|
Hi,
thank you Stephan! Here's the missing part of the plan: http://i.imgur.com/N861tg1.png There is one hash partition / sort. Is this what you're talking about? Regarding your second point, how can I test if the data is known to be partitioned at the end? -Vasia. On 15 July 2015 at 13:13, Stephan Ewen <[hidden email]> wrote: > Hey Vasia! > > Sorry for the late response... Thanks for pinging again! > > The optimizer is acting a little funky here - seems an artifact of the > "properties" optimization. > > -> The initial join needs to be partitioned and sorted. Can you check > whether one partitioning and sorting happens before the iteration? That > part is cut off in the screenshot sou sent. It must be either on the input > of the iteration, of the output. > > -> The iteration needs to make sure it leaves the data partitioned and > sorted. There is a "re-sorting" operator at the end ("Rebuild Workset > Properties"), but it does not partition. The test should make sure the data > is known to be partitioned at the very end of the iteration (after the > "Rebuild Workset Properties" operator). This is probably true, if the join > has some forward field annotation. > > We can have a quick skype chat later, if you have more questions... > > Greetings, > Stephan > > > > On Wed, Jul 15, 2015 at 12:08 PM, Vasiliki Kalavri < > [hidden email]> wrote: > > > Hey, > > > > any input on this? or a hint? or where to look to figure this out by > > myself? > > > > Thanks! > > -Vasia. > > > > On 7 July 2015 at 15:20, Vasiliki Kalavri <[hidden email]> > > wrote: > > > > > Hello to my squirrels, > > > > > > I've started looking into FLINK-1943 > > > <https://issues.apache.org/jira/browse/FLINK-1943> and I need some > help > > > to understand what to test and how to do it properly. > > > > > > In the corresponding Spargel compiler test, the following functionality > > is > > > checked: > > > > > > 1. sink: the ship strategy is FORWARD and the parallelism is correct > > > 2. iteration: degree of parallelism > > > 3. solution set join: parallelism and input1 ship strategy is > > > PARTITION_HASH > > > 4. workset join: parallelism, input1 (edges) ship strategy is > > > PARTITION_HASH and cached, input2 (workset) ship strategy is FORWARD > > > 5. check that the initial partitioning is pushed out of the loop > > > 6. check that the initial workset sort is outside the loop > > > > > > I have been able to verify 1-4 of the above for the GSA iteration plan, > > > but I'm not sure how to check (5) and (6) or whether they are expected > to > > > hold in the GSA case. > > > > > > In [1] you can see what the GSA iteration operators looks like and in > [2] > > > you can see what the visualizer tools generates the GSA connected > > > components. > > > > > > Any pointers would be greatly appreciated! > > > > > > Cheers, > > > Vasia. > > > > > > [1]: > > > > > > https://docs.google.com/drawings/d/1tiNQeOphWtkNXTGlnDJ3Ipanh0Tm2R8sHe8XNyTnf98/edit?usp=sharing > > > [2]: http://imgur.com/GQZ48ZI > > > > > > |
|
Lady Kalamari,
The plan looks good. To test whether the data is partitioned there: If you have the optimizer plan, make sure the global properties have a partitioning property of "PATITIONED_HASH". Thanks, Stephan On Wed, Jul 15, 2015 at 2:07 PM, Vasiliki Kalavri <[hidden email] > wrote: > Hi, > > thank you Stephan! > > Here's the missing part of the plan: http://i.imgur.com/N861tg1.png > There is one hash partition / sort. Is this what you're talking about? > > Regarding your second point, how can I test if the data is known to be > partitioned at the end? > > > -Vasia. > > On 15 July 2015 at 13:13, Stephan Ewen <[hidden email]> wrote: > > > Hey Vasia! > > > > Sorry for the late response... Thanks for pinging again! > > > > The optimizer is acting a little funky here - seems an artifact of the > > "properties" optimization. > > > > -> The initial join needs to be partitioned and sorted. Can you check > > whether one partitioning and sorting happens before the iteration? That > > part is cut off in the screenshot sou sent. It must be either on the > input > > of the iteration, of the output. > > > > -> The iteration needs to make sure it leaves the data partitioned and > > sorted. There is a "re-sorting" operator at the end ("Rebuild Workset > > Properties"), but it does not partition. The test should make sure the > data > > is known to be partitioned at the very end of the iteration (after the > > "Rebuild Workset Properties" operator). This is probably true, if the > join > > has some forward field annotation. > > > > We can have a quick skype chat later, if you have more questions... > > > > Greetings, > > Stephan > > > > > > > > On Wed, Jul 15, 2015 at 12:08 PM, Vasiliki Kalavri < > > [hidden email]> wrote: > > > > > Hey, > > > > > > any input on this? or a hint? or where to look to figure this out by > > > myself? > > > > > > Thanks! > > > -Vasia. > > > > > > On 7 July 2015 at 15:20, Vasiliki Kalavri <[hidden email]> > > > wrote: > > > > > > > Hello to my squirrels, > > > > > > > > I've started looking into FLINK-1943 > > > > <https://issues.apache.org/jira/browse/FLINK-1943> and I need some > > help > > > > to understand what to test and how to do it properly. > > > > > > > > In the corresponding Spargel compiler test, the following > functionality > > > is > > > > checked: > > > > > > > > 1. sink: the ship strategy is FORWARD and the parallelism is correct > > > > 2. iteration: degree of parallelism > > > > 3. solution set join: parallelism and input1 ship strategy is > > > > PARTITION_HASH > > > > 4. workset join: parallelism, input1 (edges) ship strategy is > > > > PARTITION_HASH and cached, input2 (workset) ship strategy is FORWARD > > > > 5. check that the initial partitioning is pushed out of the loop > > > > 6. check that the initial workset sort is outside the loop > > > > > > > > I have been able to verify 1-4 of the above for the GSA iteration > plan, > > > > but I'm not sure how to check (5) and (6) or whether they are > expected > > to > > > > hold in the GSA case. > > > > > > > > In [1] you can see what the GSA iteration operators looks like and in > > [2] > > > > you can see what the visualizer tools generates the GSA connected > > > > components. > > > > > > > > Any pointers would be greatly appreciated! > > > > > > > > Cheers, > > > > Vasia. > > > > > > > > [1]: > > > > > > > > > > https://docs.google.com/drawings/d/1tiNQeOphWtkNXTGlnDJ3Ipanh0Tm2R8sHe8XNyTnf98/edit?usp=sharing > > > > [2]: http://imgur.com/GQZ48ZI > > > > > > > > > > |
| Free forum by Nabble | Edit this page |
