Hi there,
I'm working on getting some complex scalding flow working on top of Flink. I've get pretty far along with Cyrille Chépélov's cascading 3 branch. However, the Distributed Cache support in scalding assumes Hadoop which our jobs use heavily. I've been working on getting the support in there. I succeeded in at least setting the files by proxying between Scalding/Hadoop config and Flink Config and then back again. I'm stuck on getting out inside the the running flow. I just cannot figure out how to get an instance of org.apache.flink.api.common.cache.DistributedCache so I can call getFile(). Or for that mater, how do I get a RuntimeContext inside the cascading / scalding job. I've spent a bunch of time pouring over the cascading-flink code and I still can't figure it out. Can somebody point me in the right direction. I will post the code as scalding PR once I have it working. Thanks, -M -- Milosz Tanski CTO 16 East 34th Street, 15th floor New York, NY 10016 p: 646-253-9055 e: [hidden email] |
Hi Milosz,
The RuntimeContext of a Flink operator is a member of FlinkFlowProcess. You could try to add a getRuntimeContext() method to FlinkFlowProcess and cast the FlowProcess parameter in your user code to FlinkFlowProcess. Of course, this change would tie the program to Flink as the execution engine. As far as I know, Cascading does not provide an interface for the distributed cache which would allow for back-end transparent implementation. Please let me know, if you have further questions. Best, Fabian 2016-01-20 14:52 GMT+01:00 Milosz Tanski <[hidden email]>: > Hi there, > > I'm working on getting some complex scalding flow working on top of > Flink. I've get pretty far along with Cyrille Chépélov's cascading 3 > branch. However, the Distributed Cache support in scalding assumes > Hadoop which our jobs use heavily. > > I've been working on getting the support in there. I succeeded in at > least setting the files by proxying between Scalding/Hadoop config and > Flink Config and then back again. I'm stuck on getting out inside the > the running flow. > > I just cannot figure out how to get an instance of > org.apache.flink.api.common.cache.DistributedCache so I can call > getFile(). Or for that mater, how do I get a RuntimeContext inside the > cascading / scalding job. I've spent a bunch of time pouring over the > cascading-flink code and I still can't figure it out. > > Can somebody point me in the right direction. I will post the code as > scalding PR once I have it working. > > Thanks, > -M > > -- > Milosz Tanski > CTO > 16 East 34th Street, 15th floor > New York, NY 10016 > > p: 646-253-9055 > e: [hidden email] > |
Fabian,
It's true cascading doesn't have native distributed cache support, but scalding does. https://github.com/twitter/scalding/blob/develop/scalding-core/src/main/scala/com/twitter/scalding/filecache/DistributedCacheFile.scala It already abstracts local / Hadoop mode. I'm trying to add Flink distributed cache there, so it just works when the cascading3 branch of flink gets merged. I'm going to have to give some thought how to plumb what you said from within the running operation to get access to FlinkFlowProcess -> RuntimeContext -> DistributedCache - M On Thu, Jan 21, 2016 at 5:12 AM, Fabian Hueske <[hidden email]> wrote: > Hi Milosz, > > The RuntimeContext of a Flink operator is a member of FlinkFlowProcess. > You could try to add a getRuntimeContext() method to FlinkFlowProcess and > cast the FlowProcess parameter in your user code to FlinkFlowProcess. > Of course, this change would tie the program to Flink as the execution > engine. > > As far as I know, Cascading does not provide an interface for the > distributed cache which would allow for back-end transparent > implementation. > > Please let me know, if you have further questions. > > Best, Fabian > > > 2016-01-20 14:52 GMT+01:00 Milosz Tanski <[hidden email]>: > > > Hi there, > > > > I'm working on getting some complex scalding flow working on top of > > Flink. I've get pretty far along with Cyrille Chépélov's cascading 3 > > branch. However, the Distributed Cache support in scalding assumes > > Hadoop which our jobs use heavily. > > > > I've been working on getting the support in there. I succeeded in at > > least setting the files by proxying between Scalding/Hadoop config and > > Flink Config and then back again. I'm stuck on getting out inside the > > the running flow. > > > > I just cannot figure out how to get an instance of > > org.apache.flink.api.common.cache.DistributedCache so I can call > > getFile(). Or for that mater, how do I get a RuntimeContext inside the > > cascading / scalding job. I've spent a bunch of time pouring over the > > cascading-flink code and I still can't figure it out. > > > > Can somebody point me in the right direction. I will post the code as > > scalding PR once I have it working. > > > > Thanks, > > -M > > > > -- > > Milosz Tanski > > CTO > > 16 East 34th Street, 15th floor > > New York, NY 10016 > > > > p: 646-253-9055 > > e: [hidden email] > > > -- Milosz Tanski CTO 16 East 34th Street, 15th floor New York, NY 10016 p: 646-253-9055 e: [hidden email] |
Free forum by Nabble | Edit this page |