Hi Flink Developers
I am sending this email to let you know about XGBoost4J, a package that we are planning to announce next week . Here is the draft version of the post https://github.com/dmlc/xgboost/blob/master/doc/jvm/xgboost4j-intro.md In short, XGBoost is a machine learning package that is used by more than half of the machine challenge winning solutions and is already widely used in industry. The distributed version scale to billion examples(10x faster than spark.mllib in the experiment) with fewer resources (see . http://arxiv.org/abs/1603.02754) We are interested in putting distributed XGBoost into all Dataflow platforms include Flink. This does not mean we re-implement it on Flink. But instead we build a portable API that has a communication library, and being able to run on different DataFlow programs. We hope this can benefit the Flink users, to enable them to get access to one of the state-of-art machine learning algorithm. I am sending this email to the mail-list to let you know about it, and hoping to get some contributors to help improving the XGBoost Flink API to be more compatible with current FlinkML stack. We also hope to get some support from the system side, to enable some abstraction needed in XGBoost for using multiple threads within even one slot for maximum performance. Let us know about your thoughts. Cheers Tianqi |
This is a really interesting approach. The idea of a ML library over
DataFlow is probably a winning move and I hope it will stop the proliferation of worthless reimplementation that is taking place in the big data world. Do you think that DataFlow posed specific problems to your work? Does it missing something that you had to fill in with your work? Here at RadicalBit we are interested both in DataFlow/Apache Beam and in distributed ML and your approach to us look the best and I hope more and more teams follow your example, maybe integrating existing libraries like H2O with DataFlow. Keep us updated if you plan to develop other algorithms. 2016-03-11 21:32 GMT+01:00 Tianqi Chen <[hidden email]>: > Hi Flink Developers > I am sending this email to let you know about XGBoost4J, a package that > we are planning to announce next week . Here is the draft version of the > post > https://github.com/dmlc/xgboost/blob/master/doc/jvm/xgboost4j-intro.md > > In short, XGBoost is a machine learning package that is used by more > than half of the machine challenge winning solutions and is already widely > used in industry. The distributed version scale to billion examples(10x > faster than spark.mllib in the experiment) with fewer resources (see . > http://arxiv.org/abs/1603.02754) > > We are interested in putting distributed XGBoost into all Dataflow > platforms include Flink. This does not mean we re-implement it on Flink. > But instead we build a portable API that has a communication library, and > being able to run on different DataFlow programs. > > We hope this can benefit the Flink users, to enable them to get access > to one of the state-of-art machine learning algorithm. I am sending this > email to the mail-list to let you know about it, and hoping to get some > contributors to help improving the XGBoost Flink API to be more compatible > with current FlinkML stack. We also hope to get some support from the > system side, to enable some abstraction needed in XGBoost for using > multiple threads within even one slot for maximum performance. > > > Let us know about your thoughts. > > Cheers > > Tianqi > |
Hello Tianqui,
Yes that definitely sounds interesting for us and we are looking forward to help out with the implementation. Regards, Theodore -- Sent from a mobile device. May contain autocorrect errors. On Mar 12, 2016 11:29 AM, "Simone Robutti" <[hidden email]> wrote: > This is a really interesting approach. The idea of a ML library over > DataFlow is probably a winning move and I hope it will stop the > proliferation of worthless reimplementation that is taking place in the big > data world. Do you think that DataFlow posed specific problems to your > work? Does it missing something that you had to fill in with your work? > > Here at RadicalBit we are interested both in DataFlow/Apache Beam and in > distributed ML and your approach to us look the best and I hope more and > more teams follow your example, maybe integrating existing libraries like > H2O with DataFlow. > > Keep us updated if you plan to develop other algorithms. > > 2016-03-11 21:32 GMT+01:00 Tianqi Chen <[hidden email]>: > > > Hi Flink Developers > > I am sending this email to let you know about XGBoost4J, a package > that > > we are planning to announce next week . Here is the draft version of the > > post > > https://github.com/dmlc/xgboost/blob/master/doc/jvm/xgboost4j-intro.md > > > > In short, XGBoost is a machine learning package that is used by more > > than half of the machine challenge winning solutions and is already > widely > > used in industry. The distributed version scale to billion examples(10x > > faster than spark.mllib in the experiment) with fewer resources (see . > > http://arxiv.org/abs/1603.02754) > > > > We are interested in putting distributed XGBoost into all Dataflow > > platforms include Flink. This does not mean we re-implement it on Flink. > > But instead we build a portable API that has a communication library, and > > being able to run on different DataFlow programs. > > > > We hope this can benefit the Flink users, to enable them to get > access > > to one of the state-of-art machine learning algorithm. I am sending this > > email to the mail-list to let you know about it, and hoping to get some > > contributors to help improving the XGBoost Flink API to be more > compatible > > with current FlinkML stack. We also hope to get some support from the > > system side, to enable some abstraction needed in XGBoost for using > > multiple threads within even one slot for maximum performance. > > > > > > Let us know about your thoughts. > > > > Cheers > > > > Tianqi > > > |
Thanks for the reply. I am writing a long email to give the answers to
Simone and clarifies what we do I want to mention that *you can use the library already in Flink*. See Flink example here: https://github.com/dmlc/xgboost/tree/master/jvm-packages#xgboost-flink I have not run pressure test on top of Flink, but we did the pressure test thing on Spark and it is consistent with our standalone version on a 100M example dataset, gives around 10x over mllib's version. I assume same thing holds for Flink as well. So if you are interested, please try it out. I imagine this can be very useful to have a Flink demo that can directly give competitive result on say a kaggle competition, and can attract more users to Flink community as well. *The Internal Details* Here at dmlc.ml , we are building libraries that we dive deep and aim for the best performance and flexibility. We build our own abstractions when needed, for example, XGBoost relies on Allreduce, and MXNet, another well known deep learning project, relies on parameter server abstraction. We tried to make these abstractions portable, so they are not stand-alone C++ programs, but can be used as library in other languages, e.g. scala. So essentially, here is what is needed: - Start a tracker on driver side to connect the workers together to use our version of communication library (this can be swapped depending on level of integration, if communication API is provided natively by the platform). - An API to start concurrent jobs(containers), that can execute a function (either worker or server). - Gettng accessed to partition of data in each worker. Take XGBoost for example, what we do in Flink is to as a MapPartition stage, and treat each slot as an worker. Each worker then collaboratively solve the machine learning problem, and return the trained model. *What is needed from DataFlow * Dataflow is a nice abstraction for data processing. As you can see, the approach we take is somewhat more low level, I would call it developer API. Since the requirement is basically to start the worker other types of containers that runs a scala function from driver side. MapPartition works well for the XGBoost case, but here are what can be improved: - Being able to specify resources of the slot at the ML stage, for example, xgboost as well as deep learning program can benefit from using multiple cores in each worker. While currently mapPartition uses one core for each Parititon. - Being able to launch container that does not take data, for example parameter server instance. This is mainly needed for the deep learning program. *Why not implement them using DataFlow?* One thing I can expect people to argue is why not directly use (multiple) data-flow stages to implement these algorithm. This is a possible approach, here are the reasons why - Most work in advanced ML algorithm is actually the machine learning part, and add a bit communication into it. So directly using communication library inside ML code allows easier migration from optimized single machine version to distributed one. - Not all dataflow executors are alike, for machine learning usually benefit from persistent program state (which Flink have but not spark), and we want to be invariant of such difference. Dataflow was originally designed for data process, and I do feel sometimes other abstraction fits machine learning well. The idea of embedding the ML's abstraction into one stage of dataflow allow us to take benefit from the flexible data processing phase, and also use the best learning algorithm. *Fault Tolerance?* Most algorithm we have assumes a fail-restart scheme from the host platform, which means we will rely on system to restart the failed jobs somewhere, and provide the same input data. Then internally the communication library will kick in and try to recover, usually via some checkpoint. Of course if there is checkpoint feature from the host, this can also be used. *More Machine Learning Algorithms?* XGBoost is part of DMLC http://dmlc.ml project. Our goal is *not to *develop general library that covers all algorithms, like FlinkML. Instead, we pick all the most important ones which are used in production pipeline, and build *deeply optimized for each specific one as a package.* Of course there are also shared components like communication library and duplicated effort among the libraries are shared. I believe we covered most things people need, plus some simple ones that can be directly implemented in FlinkML(Kmeans, linear model). One thing that could be interesting to try next is MXNet https://github.com/dmlc/mxnet/tree/master/scala-package, which is a full fledge deep learning library that comes with all the features you need as well as a Scala Binding. However, we do need a bit more things that I mentioned in the requirement section. *What Help we can get from Flink Community * I will list the points that are clear and actionable here: *- *Improve xgboost-Flink API so that it is consistent with current FlinkML pipeline - Provide some "developer API" that allows perf improvement as I mentioned in "What is needed from DataFlow" - Support abstraction needed for MXNet, and enable *streaming, GPU-enabled distributed deep learning on Flink* - Main obstacle will be the "developer API" While some of these effort seems to be a lot to port specific machine learning library. Enable them basically enable port all machine learning libraries we build and we will be building using these abstractions. Tianqi On Sat, Mar 12, 2016 at 4:51 AM, Theodore Vasiloudis < [hidden email]> wrote: > Hello Tianqui, > > Yes that definitely sounds interesting for us and we are looking forward to > help out with the implementation. > > Regards, > Theodore > -- > Sent from a mobile device. May contain autocorrect errors. > On Mar 12, 2016 11:29 AM, "Simone Robutti" <[hidden email]> > wrote: > > > This is a really interesting approach. The idea of a ML library over > > DataFlow is probably a winning move and I hope it will stop the > > proliferation of worthless reimplementation that is taking place in the > big > > data world. Do you think that DataFlow posed specific problems to your > > work? Does it missing something that you had to fill in with your work? > > > > Here at RadicalBit we are interested both in DataFlow/Apache Beam and in > > distributed ML and your approach to us look the best and I hope more and > > more teams follow your example, maybe integrating existing libraries like > > H2O with DataFlow. > > > > Keep us updated if you plan to develop other algorithms. > > > > 2016-03-11 21:32 GMT+01:00 Tianqi Chen <[hidden email]>: > > > > > Hi Flink Developers > > > I am sending this email to let you know about XGBoost4J, a package > > that > > > we are planning to announce next week . Here is the draft version of > the > > > post > > > https://github.com/dmlc/xgboost/blob/master/doc/jvm/xgboost4j-intro.md > > > > > > In short, XGBoost is a machine learning package that is used by > more > > > than half of the machine challenge winning solutions and is already > > widely > > > used in industry. The distributed version scale to billion examples(10x > > > faster than spark.mllib in the experiment) with fewer resources (see . > > > http://arxiv.org/abs/1603.02754) > > > > > > We are interested in putting distributed XGBoost into all Dataflow > > > platforms include Flink. This does not mean we re-implement it on > Flink. > > > But instead we build a portable API that has a communication library, > and > > > being able to run on different DataFlow programs. > > > > > > We hope this can benefit the Flink users, to enable them to get > > access > > > to one of the state-of-art machine learning algorithm. I am sending > this > > > email to the mail-list to let you know about it, and hoping to get some > > > contributors to help improving the XGBoost Flink API to be more > > compatible > > > with current FlinkML stack. We also hope to get some support from the > > > system side, to enable some abstraction needed in XGBoost for using > > > multiple threads within even one slot for maximum performance. > > > > > > > > > Let us know about your thoughts. > > > > > > Cheers > > > > > > Tianqi > > > > > > |
Thanks for the insight, what you're doing is really interesting. I will
definitely spend some time looking at DMLC and MXNet. 2016-03-12 18:35 GMT+01:00 Tianqi Chen <[hidden email]>: > Thanks for the reply. I am writing a long email to give the answers to > Simone and clarifies what we do > > I want to mention that *you can use the library already in Flink*. See > Flink example here: > https://github.com/dmlc/xgboost/tree/master/jvm-packages#xgboost-flink > > I have not run pressure test on top of Flink, but we did the pressure test > thing on Spark and it is consistent with our standalone version on a 100M > example dataset, gives around 10x over mllib's version. I assume same thing > holds for Flink as well. > So if you are interested, please try it out. I imagine this can be very > useful to have a Flink demo that can directly give competitive result on > say a kaggle competition, and can attract more users to Flink community as > well. > > *The Internal Details* > Here at dmlc.ml , we are building libraries that we dive deep and aim > for the best performance and flexibility. We build our own abstractions > when needed, for example, XGBoost relies on Allreduce, and MXNet, another > well known deep learning project, relies on parameter server abstraction. > We tried to make these abstractions portable, so they are not stand-alone > C++ programs, but can be used as library in other languages, e.g. scala. > So essentially, here is what is needed: > - Start a tracker on driver side to connect the workers together to use our > version of communication library (this can be swapped depending on level of > integration, if communication API is provided natively by the platform). > - An API to start concurrent jobs(containers), that can execute a function > (either worker or server). > - Gettng accessed to partition of data in each worker. > > Take XGBoost for example, what we do in Flink is to as a MapPartition > stage, and treat each slot as an worker. Each worker then collaboratively > solve the machine learning problem, and return the trained model. > > *What is needed from DataFlow * > Dataflow is a nice abstraction for data processing. As you can see, the > approach we take is somewhat more low level, I would call it developer > API. Since the requirement is basically to start the worker other types of > containers that runs a scala function from driver side. MapPartition works > well for the XGBoost case, but here are what can be improved: > - Being able to specify resources of the slot at the ML stage, for example, > xgboost as well as deep learning program can benefit from using multiple > cores in each worker. While currently mapPartition uses one core for each > Parititon. > - Being able to launch container that does not take data, for example > parameter server instance. This is mainly needed for the deep learning > program. > > > *Why not implement them using DataFlow?* > One thing I can expect people to argue is why not directly use (multiple) > data-flow stages to implement these algorithm. This is a possible approach, > here are the reasons why > - Most work in advanced ML algorithm is actually the machine learning part, > and add a bit communication into it. So directly using communication > library inside ML code allows easier migration from optimized single > machine version to distributed one. > - Not all dataflow executors are alike, for machine learning usually > benefit from persistent program state (which Flink have but not spark), and > we want to be invariant of such difference. > Dataflow was originally designed for data process, and I do feel sometimes > other abstraction fits machine learning well. The idea of embedding the > ML's abstraction into one stage of dataflow allow us to take benefit from > the flexible data processing phase, and also use the best learning > algorithm. > > *Fault Tolerance?* > Most algorithm we have assumes a fail-restart scheme from the host > platform, which means we will rely on system to restart the failed jobs > somewhere, and provide the same input data. Then internally the > communication library will kick in and try to recover, usually via some > checkpoint. Of course if there is checkpoint feature from the host, this > can also be used. > > > > *More Machine Learning Algorithms?* > XGBoost is part of DMLC http://dmlc.ml project. Our goal is *not to > *develop > general library that covers all algorithms, like FlinkML. Instead, we pick > all the most important ones which are used in production pipeline, and > build *deeply optimized for each specific one as a package.* Of course > there are also shared components like communication library and duplicated > effort among the libraries are shared. I believe we covered most things > people need, plus some simple ones that can be directly implemented in > FlinkML(Kmeans, linear model). > > One thing that could be interesting to try next is MXNet > https://github.com/dmlc/mxnet/tree/master/scala-package, which is a full > fledge deep learning library that comes with all the features you need as > well as a Scala Binding. However, > we do need a bit more things that I mentioned in the requirement section. > > > *What Help we can get from Flink Community * > I will list the points that are clear and actionable here: > > *- *Improve xgboost-Flink API so that it is consistent with current FlinkML > pipeline > - Provide some "developer API" that allows perf improvement as I mentioned > in "What is needed from DataFlow" > - Support abstraction needed for MXNet, and enable *streaming, GPU-enabled > distributed deep learning on Flink* > - Main obstacle will be the "developer API" > > While some of these effort seems to be a lot to port specific machine > learning library. Enable them basically enable port all machine learning > libraries we build and we will be building using these abstractions. > > Tianqi > > > On Sat, Mar 12, 2016 at 4:51 AM, Theodore Vasiloudis < > [hidden email]> wrote: > > > Hello Tianqui, > > > > Yes that definitely sounds interesting for us and we are looking forward > to > > help out with the implementation. > > > > Regards, > > Theodore > > -- > > Sent from a mobile device. May contain autocorrect errors. > > On Mar 12, 2016 11:29 AM, "Simone Robutti" <[hidden email] > > > > wrote: > > > > > This is a really interesting approach. The idea of a ML library over > > > DataFlow is probably a winning move and I hope it will stop the > > > proliferation of worthless reimplementation that is taking place in the > > big > > > data world. Do you think that DataFlow posed specific problems to your > > > work? Does it missing something that you had to fill in with your work? > > > > > > Here at RadicalBit we are interested both in DataFlow/Apache Beam and > in > > > distributed ML and your approach to us look the best and I hope more > and > > > more teams follow your example, maybe integrating existing libraries > like > > > H2O with DataFlow. > > > > > > Keep us updated if you plan to develop other algorithms. > > > > > > 2016-03-11 21:32 GMT+01:00 Tianqi Chen <[hidden email]>: > > > > > > > Hi Flink Developers > > > > I am sending this email to let you know about XGBoost4J, a > package > > > that > > > > we are planning to announce next week . Here is the draft version of > > the > > > > post > > > > > https://github.com/dmlc/xgboost/blob/master/doc/jvm/xgboost4j-intro.md > > > > > > > > In short, XGBoost is a machine learning package that is used by > > more > > > > than half of the machine challenge winning solutions and is already > > > widely > > > > used in industry. The distributed version scale to billion > examples(10x > > > > faster than spark.mllib in the experiment) with fewer resources (see > . > > > > http://arxiv.org/abs/1603.02754) > > > > > > > > We are interested in putting distributed XGBoost into all > Dataflow > > > > platforms include Flink. This does not mean we re-implement it on > > Flink. > > > > But instead we build a portable API that has a communication library, > > and > > > > being able to run on different DataFlow programs. > > > > > > > > We hope this can benefit the Flink users, to enable them to get > > > access > > > > to one of the state-of-art machine learning algorithm. I am sending > > this > > > > email to the mail-list to let you know about it, and hoping to get > some > > > > contributors to help improving the XGBoost Flink API to be more > > > compatible > > > > with current FlinkML stack. We also hope to get some support from > the > > > > system side, to enable some abstraction needed in XGBoost for using > > > > multiple threads within even one slot for maximum performance. > > > > > > > > > > > > Let us know about your thoughts. > > > > > > > > Cheers > > > > > > > > Tianqi > > > > > > > > > > |
Hi Tianqi,
dmlc looks really cool and it would be great to integrate it with Flink. As far as I understood your requirements, I think that you can already implement most of it on Flink. For example, starting a special container which does not receive any input could be a specialized SourceOperator. In this SourceOperator you could then start a parameter server which receives the partial updates of the processing tasks. You're right that each task is executed by its own thread. However, you always can spawn new threads from within your user function. This should allow you to run the ML code multi-threaded. But then it might be advisable to lower the number of slots a bit so that you have more cores available than slots defined. The integration of the dmlc API with the existing ML pipelines should not be too hard. As long as one has access to the resulting data set it should be easy to plug it into another predictor/estimator instance. I guess we would mainly need some tooling around it. Looking forward running xgboost and mxnet with Flink :-) Cheers, Till On Sat, Mar 12, 2016 at 7:17 PM, Simone Robutti < [hidden email]> wrote: > Thanks for the insight, what you're doing is really interesting. I will > definitely spend some time looking at DMLC and MXNet. > > 2016-03-12 18:35 GMT+01:00 Tianqi Chen <[hidden email]>: > > > Thanks for the reply. I am writing a long email to give the answers to > > Simone and clarifies what we do > > > > I want to mention that *you can use the library already in Flink*. See > > Flink example here: > > https://github.com/dmlc/xgboost/tree/master/jvm-packages#xgboost-flink > > > > I have not run pressure test on top of Flink, but we did the pressure > test > > thing on Spark and it is consistent with our standalone version on a 100M > > example dataset, gives around 10x over mllib's version. I assume same > thing > > holds for Flink as well. > > So if you are interested, please try it out. I imagine this can be very > > useful to have a Flink demo that can directly give competitive result on > > say a kaggle competition, and can attract more users to Flink community > as > > well. > > > > *The Internal Details* > > Here at dmlc.ml , we are building libraries that we dive deep and aim > > for the best performance and flexibility. We build our own abstractions > > when needed, for example, XGBoost relies on Allreduce, and MXNet, another > > well known deep learning project, relies on parameter server abstraction. > > We tried to make these abstractions portable, so they are not stand-alone > > C++ programs, but can be used as library in other languages, e.g. scala. > > So essentially, here is what is needed: > > - Start a tracker on driver side to connect the workers together to use > our > > version of communication library (this can be swapped depending on level > of > > integration, if communication API is provided natively by the platform). > > - An API to start concurrent jobs(containers), that can execute a > function > > (either worker or server). > > - Gettng accessed to partition of data in each worker. > > > > Take XGBoost for example, what we do in Flink is to as a MapPartition > > stage, and treat each slot as an worker. Each worker then > collaboratively > > solve the machine learning problem, and return the trained model. > > > > *What is needed from DataFlow * > > Dataflow is a nice abstraction for data processing. As you can see, the > > approach we take is somewhat more low level, I would call it developer > > API. Since the requirement is basically to start the worker other types > of > > containers that runs a scala function from driver side. MapPartition > works > > well for the XGBoost case, but here are what can be improved: > > - Being able to specify resources of the slot at the ML stage, for > example, > > xgboost as well as deep learning program can benefit from using multiple > > cores in each worker. While currently mapPartition uses one core for each > > Parititon. > > - Being able to launch container that does not take data, for example > > parameter server instance. This is mainly needed for the deep learning > > program. > > > > > > *Why not implement them using DataFlow?* > > One thing I can expect people to argue is why not directly use (multiple) > > data-flow stages to implement these algorithm. This is a possible > approach, > > here are the reasons why > > - Most work in advanced ML algorithm is actually the machine learning > part, > > and add a bit communication into it. So directly using communication > > library inside ML code allows easier migration from optimized single > > machine version to distributed one. > > - Not all dataflow executors are alike, for machine learning usually > > benefit from persistent program state (which Flink have but not spark), > and > > we want to be invariant of such difference. > > Dataflow was originally designed for data process, and I do feel > sometimes > > other abstraction fits machine learning well. The idea of embedding the > > ML's abstraction into one stage of dataflow allow us to take benefit from > > the flexible data processing phase, and also use the best learning > > algorithm. > > > > *Fault Tolerance?* > > Most algorithm we have assumes a fail-restart scheme from the host > > platform, which means we will rely on system to restart the failed jobs > > somewhere, and provide the same input data. Then internally the > > communication library will kick in and try to recover, usually via some > > checkpoint. Of course if there is checkpoint feature from the host, this > > can also be used. > > > > > > > > *More Machine Learning Algorithms?* > > XGBoost is part of DMLC http://dmlc.ml project. Our goal is *not to > > *develop > > general library that covers all algorithms, like FlinkML. Instead, we > pick > > all the most important ones which are used in production pipeline, and > > build *deeply optimized for each specific one as a package.* Of course > > there are also shared components like communication library and > duplicated > > effort among the libraries are shared. I believe we covered most things > > people need, plus some simple ones that can be directly implemented in > > FlinkML(Kmeans, linear model). > > > > One thing that could be interesting to try next is MXNet > > https://github.com/dmlc/mxnet/tree/master/scala-package, which is a full > > fledge deep learning library that comes with all the features you need as > > well as a Scala Binding. However, > > we do need a bit more things that I mentioned in the requirement section. > > > > > > *What Help we can get from Flink Community * > > I will list the points that are clear and actionable here: > > > > *- *Improve xgboost-Flink API so that it is consistent with current > FlinkML > > pipeline > > - Provide some "developer API" that allows perf improvement as I > mentioned > > in "What is needed from DataFlow" > > - Support abstraction needed for MXNet, and enable *streaming, > GPU-enabled > > distributed deep learning on Flink* > > - Main obstacle will be the "developer API" > > > > While some of these effort seems to be a lot to port specific machine > > learning library. Enable them basically enable port all machine learning > > libraries we build and we will be building using these abstractions. > > > > Tianqi > > > > > > On Sat, Mar 12, 2016 at 4:51 AM, Theodore Vasiloudis < > > [hidden email]> wrote: > > > > > Hello Tianqui, > > > > > > Yes that definitely sounds interesting for us and we are looking > forward > > to > > > help out with the implementation. > > > > > > Regards, > > > Theodore > > > -- > > > Sent from a mobile device. May contain autocorrect errors. > > > On Mar 12, 2016 11:29 AM, "Simone Robutti" < > [hidden email] > > > > > > wrote: > > > > > > > This is a really interesting approach. The idea of a ML library over > > > > DataFlow is probably a winning move and I hope it will stop the > > > > proliferation of worthless reimplementation that is taking place in > the > > > big > > > > data world. Do you think that DataFlow posed specific problems to > your > > > > work? Does it missing something that you had to fill in with your > work? > > > > > > > > Here at RadicalBit we are interested both in DataFlow/Apache Beam and > > in > > > > distributed ML and your approach to us look the best and I hope more > > and > > > > more teams follow your example, maybe integrating existing libraries > > like > > > > H2O with DataFlow. > > > > > > > > Keep us updated if you plan to develop other algorithms. > > > > > > > > 2016-03-11 21:32 GMT+01:00 Tianqi Chen <[hidden email]>: > > > > > > > > > Hi Flink Developers > > > > > I am sending this email to let you know about XGBoost4J, a > > package > > > > that > > > > > we are planning to announce next week . Here is the draft version > of > > > the > > > > > post > > > > > > > https://github.com/dmlc/xgboost/blob/master/doc/jvm/xgboost4j-intro.md > > > > > > > > > > In short, XGBoost is a machine learning package that is used by > > > more > > > > > than half of the machine challenge winning solutions and is already > > > > widely > > > > > used in industry. The distributed version scale to billion > > examples(10x > > > > > faster than spark.mllib in the experiment) with fewer resources > (see > > . > > > > > http://arxiv.org/abs/1603.02754) > > > > > > > > > > We are interested in putting distributed XGBoost into all > > Dataflow > > > > > platforms include Flink. This does not mean we re-implement it on > > > Flink. > > > > > But instead we build a portable API that has a communication > library, > > > and > > > > > being able to run on different DataFlow programs. > > > > > > > > > > We hope this can benefit the Flink users, to enable them to get > > > > access > > > > > to one of the state-of-art machine learning algorithm. I am sending > > > this > > > > > email to the mail-list to let you know about it, and hoping to get > > some > > > > > contributors to help improving the XGBoost Flink API to be more > > > > compatible > > > > > with current FlinkML stack. We also hope to get some support from > > the > > > > > system side, to enable some abstraction needed in XGBoost for using > > > > > multiple threads within even one slot for maximum performance. > > > > > > > > > > > > > > > Let us know about your thoughts. > > > > > > > > > > Cheers > > > > > > > > > > Tianqi > > > > > > > > > > > > > > > |
Thanks! I am not aware of SrcOperator before. Then yes things can be done.
About multi-threading issue, I am looking for more principled API to specify the resources requirement, e.g. the slots in this stage needs 4 GPU cores and 1 GPU. So the resource allocator can be aware of that. We have published the release announcement http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html And you can try the Flink version out:) On Mon, Mar 14, 2016 at 10:00 AM, Till Rohrmann <[hidden email]> wrote: > Hi Tianqi, > > dmlc looks really cool and it would be great to integrate it with Flink. As > far as I understood your requirements, I think that you can already > implement most of it on Flink. > > For example, starting a special container which does not receive any input > could be a specialized SourceOperator. In this SourceOperator you could > then start a parameter server which receives the partial updates of the > processing tasks. > > You're right that each task is executed by its own thread. However, you > always can spawn new threads from within your user function. This should > allow you to run the ML code multi-threaded. But then it might be advisable > to lower the number of slots a bit so that you have more cores available > than slots defined. > > The integration of the dmlc API with the existing ML pipelines should not > be too hard. As long as one has access to the resulting data set it should > be easy to plug it into another predictor/estimator instance. I guess we > would mainly need some tooling around it. > > Looking forward running xgboost and mxnet with Flink :-) > > Cheers, > Till > > On Sat, Mar 12, 2016 at 7:17 PM, Simone Robutti < > [hidden email]> wrote: > > > Thanks for the insight, what you're doing is really interesting. I will > > definitely spend some time looking at DMLC and MXNet. > > > > 2016-03-12 18:35 GMT+01:00 Tianqi Chen <[hidden email]>: > > > > > Thanks for the reply. I am writing a long email to give the answers to > > > Simone and clarifies what we do > > > > > > I want to mention that *you can use the library already in Flink*. See > > > Flink example here: > > > https://github.com/dmlc/xgboost/tree/master/jvm-packages#xgboost-flink > > > > > > I have not run pressure test on top of Flink, but we did the pressure > > test > > > thing on Spark and it is consistent with our standalone version on a > 100M > > > example dataset, gives around 10x over mllib's version. I assume same > > thing > > > holds for Flink as well. > > > So if you are interested, please try it out. I imagine this can be very > > > useful to have a Flink demo that can directly give competitive result > on > > > say a kaggle competition, and can attract more users to Flink community > > as > > > well. > > > > > > *The Internal Details* > > > Here at dmlc.ml , we are building libraries that we dive deep and > aim > > > for the best performance and flexibility. We build our own abstractions > > > when needed, for example, XGBoost relies on Allreduce, and MXNet, > another > > > well known deep learning project, relies on parameter server > abstraction. > > > We tried to make these abstractions portable, so they are not > stand-alone > > > C++ programs, but can be used as library in other languages, e.g. > scala. > > > So essentially, here is what is needed: > > > - Start a tracker on driver side to connect the workers together to use > > our > > > version of communication library (this can be swapped depending on > level > > of > > > integration, if communication API is provided natively by the > platform). > > > - An API to start concurrent jobs(containers), that can execute a > > function > > > (either worker or server). > > > - Gettng accessed to partition of data in each worker. > > > > > > Take XGBoost for example, what we do in Flink is to as a MapPartition > > > stage, and treat each slot as an worker. Each worker then > > collaboratively > > > solve the machine learning problem, and return the trained model. > > > > > > *What is needed from DataFlow * > > > Dataflow is a nice abstraction for data processing. As you can see, the > > > approach we take is somewhat more low level, I would call it developer > > > API. Since the requirement is basically to start the worker other > types > > of > > > containers that runs a scala function from driver side. MapPartition > > works > > > well for the XGBoost case, but here are what can be improved: > > > - Being able to specify resources of the slot at the ML stage, for > > example, > > > xgboost as well as deep learning program can benefit from using > multiple > > > cores in each worker. While currently mapPartition uses one core for > each > > > Parititon. > > > - Being able to launch container that does not take data, for example > > > parameter server instance. This is mainly needed for the deep learning > > > program. > > > > > > > > > *Why not implement them using DataFlow?* > > > One thing I can expect people to argue is why not directly use > (multiple) > > > data-flow stages to implement these algorithm. This is a possible > > approach, > > > here are the reasons why > > > - Most work in advanced ML algorithm is actually the machine learning > > part, > > > and add a bit communication into it. So directly using communication > > > library inside ML code allows easier migration from optimized single > > > machine version to distributed one. > > > - Not all dataflow executors are alike, for machine learning usually > > > benefit from persistent program state (which Flink have but not spark), > > and > > > we want to be invariant of such difference. > > > Dataflow was originally designed for data process, and I do feel > > sometimes > > > other abstraction fits machine learning well. The idea of embedding the > > > ML's abstraction into one stage of dataflow allow us to take benefit > from > > > the flexible data processing phase, and also use the best learning > > > algorithm. > > > > > > *Fault Tolerance?* > > > Most algorithm we have assumes a fail-restart scheme from the host > > > platform, which means we will rely on system to restart the failed jobs > > > somewhere, and provide the same input data. Then internally the > > > communication library will kick in and try to recover, usually via some > > > checkpoint. Of course if there is checkpoint feature from the host, > this > > > can also be used. > > > > > > > > > > > > *More Machine Learning Algorithms?* > > > XGBoost is part of DMLC http://dmlc.ml project. Our goal is *not to > > > *develop > > > general library that covers all algorithms, like FlinkML. Instead, we > > pick > > > all the most important ones which are used in production pipeline, and > > > build *deeply optimized for each specific one as a package.* Of course > > > there are also shared components like communication library and > > duplicated > > > effort among the libraries are shared. I believe we covered most things > > > people need, plus some simple ones that can be directly implemented in > > > FlinkML(Kmeans, linear model). > > > > > > One thing that could be interesting to try next is MXNet > > > https://github.com/dmlc/mxnet/tree/master/scala-package, which is a > full > > > fledge deep learning library that comes with all the features you need > as > > > well as a Scala Binding. However, > > > we do need a bit more things that I mentioned in the requirement > section. > > > > > > > > > *What Help we can get from Flink Community * > > > I will list the points that are clear and actionable here: > > > > > > *- *Improve xgboost-Flink API so that it is consistent with current > > FlinkML > > > pipeline > > > - Provide some "developer API" that allows perf improvement as I > > mentioned > > > in "What is needed from DataFlow" > > > - Support abstraction needed for MXNet, and enable *streaming, > > GPU-enabled > > > distributed deep learning on Flink* > > > - Main obstacle will be the "developer API" > > > > > > While some of these effort seems to be a lot to port specific machine > > > learning library. Enable them basically enable port all machine > learning > > > libraries we build and we will be building using these abstractions. > > > > > > Tianqi > > > > > > > > > On Sat, Mar 12, 2016 at 4:51 AM, Theodore Vasiloudis < > > > [hidden email]> wrote: > > > > > > > Hello Tianqui, > > > > > > > > Yes that definitely sounds interesting for us and we are looking > > forward > > > to > > > > help out with the implementation. > > > > > > > > Regards, > > > > Theodore > > > > -- > > > > Sent from a mobile device. May contain autocorrect errors. > > > > On Mar 12, 2016 11:29 AM, "Simone Robutti" < > > [hidden email] > > > > > > > > wrote: > > > > > > > > > This is a really interesting approach. The idea of a ML library > over > > > > > DataFlow is probably a winning move and I hope it will stop the > > > > > proliferation of worthless reimplementation that is taking place in > > the > > > > big > > > > > data world. Do you think that DataFlow posed specific problems to > > your > > > > > work? Does it missing something that you had to fill in with your > > work? > > > > > > > > > > Here at RadicalBit we are interested both in DataFlow/Apache Beam > and > > > in > > > > > distributed ML and your approach to us look the best and I hope > more > > > and > > > > > more teams follow your example, maybe integrating existing > libraries > > > like > > > > > H2O with DataFlow. > > > > > > > > > > Keep us updated if you plan to develop other algorithms. > > > > > > > > > > 2016-03-11 21:32 GMT+01:00 Tianqi Chen <[hidden email]>: > > > > > > > > > > > Hi Flink Developers > > > > > > I am sending this email to let you know about XGBoost4J, a > > > package > > > > > that > > > > > > we are planning to announce next week . Here is the draft version > > of > > > > the > > > > > > post > > > > > > > > > https://github.com/dmlc/xgboost/blob/master/doc/jvm/xgboost4j-intro.md > > > > > > > > > > > > In short, XGBoost is a machine learning package that is used > by > > > > more > > > > > > than half of the machine challenge winning solutions and is > already > > > > > widely > > > > > > used in industry. The distributed version scale to billion > > > examples(10x > > > > > > faster than spark.mllib in the experiment) with fewer resources > > (see > > > . > > > > > > http://arxiv.org/abs/1603.02754) > > > > > > > > > > > > We are interested in putting distributed XGBoost into all > > > Dataflow > > > > > > platforms include Flink. This does not mean we re-implement it on > > > > Flink. > > > > > > But instead we build a portable API that has a communication > > library, > > > > and > > > > > > being able to run on different DataFlow programs. > > > > > > > > > > > > We hope this can benefit the Flink users, to enable them to > get > > > > > access > > > > > > to one of the state-of-art machine learning algorithm. I am > sending > > > > this > > > > > > email to the mail-list to let you know about it, and hoping to > get > > > some > > > > > > contributors to help improving the XGBoost Flink API to be more > > > > > compatible > > > > > > with current FlinkML stack. We also hope to get some support > from > > > the > > > > > > system side, to enable some abstraction needed in XGBoost for > using > > > > > > multiple threads within even one slot for maximum performance. > > > > > > > > > > > > > > > > > > Let us know about your thoughts. > > > > > > > > > > > > Cheers > > > > > > > > > > > > Tianqi > > > > > > > > > > > > > > > > > > > > > |
Hi Tianqi,
I am also impressed with XGBoost and looking forward to having integration with Apache Flink. As for resource requirements, Apache Flink is using abstraction of slots to express parallel execution in the TaskManager [1] I suppose you were looking for some DSL or specification configuration to set resources such as CPU, memory, and GPU to be set per execution of a task in the TaskManager? - Henry [1] https://ci.apache.org/projects/flink/flink-docs-master/concepts/concepts.html#distributed-execution On Mon, Mar 14, 2016 at 2:39 PM, Tianqi Chen <[hidden email]> wrote: > Thanks! I am not aware of SrcOperator before. Then yes things can be done. > > About multi-threading issue, I am looking for more principled API to > specify the resources requirement, e.g. the slots in this stage needs 4 GPU > cores and 1 GPU. So the resource allocator can be aware of that. > > We have published the release announcement > > http://dmlc.ml/2016/03/14/xgboost4j-portable-distributed-xgboost-in-spark-flink-and-dataflow.html > > And you can try the Flink version out:) > > > > > On Mon, Mar 14, 2016 at 10:00 AM, Till Rohrmann <[hidden email]> > wrote: > > > Hi Tianqi, > > > > dmlc looks really cool and it would be great to integrate it with Flink. > As > > far as I understood your requirements, I think that you can already > > implement most of it on Flink. > > > > For example, starting a special container which does not receive any > input > > could be a specialized SourceOperator. In this SourceOperator you could > > then start a parameter server which receives the partial updates of the > > processing tasks. > > > > You're right that each task is executed by its own thread. However, you > > always can spawn new threads from within your user function. This should > > allow you to run the ML code multi-threaded. But then it might be > advisable > > to lower the number of slots a bit so that you have more cores available > > than slots defined. > > > > The integration of the dmlc API with the existing ML pipelines should not > > be too hard. As long as one has access to the resulting data set it > should > > be easy to plug it into another predictor/estimator instance. I guess we > > would mainly need some tooling around it. > > > > Looking forward running xgboost and mxnet with Flink :-) > > > > Cheers, > > Till > > > > On Sat, Mar 12, 2016 at 7:17 PM, Simone Robutti < > > [hidden email]> wrote: > > > > > Thanks for the insight, what you're doing is really interesting. I will > > > definitely spend some time looking at DMLC and MXNet. > > > > > > 2016-03-12 18:35 GMT+01:00 Tianqi Chen <[hidden email]>: > > > > > > > Thanks for the reply. I am writing a long email to give the answers > to > > > > Simone and clarifies what we do > > > > > > > > I want to mention that *you can use the library already in Flink*. > See > > > > Flink example here: > > > > > https://github.com/dmlc/xgboost/tree/master/jvm-packages#xgboost-flink > > > > > > > > I have not run pressure test on top of Flink, but we did the pressure > > > test > > > > thing on Spark and it is consistent with our standalone version on a > > 100M > > > > example dataset, gives around 10x over mllib's version. I assume same > > > thing > > > > holds for Flink as well. > > > > So if you are interested, please try it out. I imagine this can be > very > > > > useful to have a Flink demo that can directly give competitive result > > on > > > > say a kaggle competition, and can attract more users to Flink > community > > > as > > > > well. > > > > > > > > *The Internal Details* > > > > Here at dmlc.ml , we are building libraries that we dive deep and > > aim > > > > for the best performance and flexibility. We build our own > abstractions > > > > when needed, for example, XGBoost relies on Allreduce, and MXNet, > > another > > > > well known deep learning project, relies on parameter server > > abstraction. > > > > We tried to make these abstractions portable, so they are not > > stand-alone > > > > C++ programs, but can be used as library in other languages, e.g. > > scala. > > > > So essentially, here is what is needed: > > > > - Start a tracker on driver side to connect the workers together to > use > > > our > > > > version of communication library (this can be swapped depending on > > level > > > of > > > > integration, if communication API is provided natively by the > > platform). > > > > - An API to start concurrent jobs(containers), that can execute a > > > function > > > > (either worker or server). > > > > - Gettng accessed to partition of data in each worker. > > > > > > > > Take XGBoost for example, what we do in Flink is to as a MapPartition > > > > stage, and treat each slot as an worker. Each worker then > > > collaboratively > > > > solve the machine learning problem, and return the trained model. > > > > > > > > *What is needed from DataFlow * > > > > Dataflow is a nice abstraction for data processing. As you can see, > the > > > > approach we take is somewhat more low level, I would call it > developer > > > > API. Since the requirement is basically to start the worker other > > types > > > of > > > > containers that runs a scala function from driver side. MapPartition > > > works > > > > well for the XGBoost case, but here are what can be improved: > > > > - Being able to specify resources of the slot at the ML stage, for > > > example, > > > > xgboost as well as deep learning program can benefit from using > > multiple > > > > cores in each worker. While currently mapPartition uses one core for > > each > > > > Parititon. > > > > - Being able to launch container that does not take data, for example > > > > parameter server instance. This is mainly needed for the deep > learning > > > > program. > > > > > > > > > > > > *Why not implement them using DataFlow?* > > > > One thing I can expect people to argue is why not directly use > > (multiple) > > > > data-flow stages to implement these algorithm. This is a possible > > > approach, > > > > here are the reasons why > > > > - Most work in advanced ML algorithm is actually the machine learning > > > part, > > > > and add a bit communication into it. So directly using communication > > > > library inside ML code allows easier migration from optimized single > > > > machine version to distributed one. > > > > - Not all dataflow executors are alike, for machine learning usually > > > > benefit from persistent program state (which Flink have but not > spark), > > > and > > > > we want to be invariant of such difference. > > > > Dataflow was originally designed for data process, and I do feel > > > sometimes > > > > other abstraction fits machine learning well. The idea of embedding > the > > > > ML's abstraction into one stage of dataflow allow us to take benefit > > from > > > > the flexible data processing phase, and also use the best learning > > > > algorithm. > > > > > > > > *Fault Tolerance?* > > > > Most algorithm we have assumes a fail-restart scheme from the host > > > > platform, which means we will rely on system to restart the failed > jobs > > > > somewhere, and provide the same input data. Then internally the > > > > communication library will kick in and try to recover, usually via > some > > > > checkpoint. Of course if there is checkpoint feature from the host, > > this > > > > can also be used. > > > > > > > > > > > > > > > > *More Machine Learning Algorithms?* > > > > XGBoost is part of DMLC http://dmlc.ml project. Our goal is *not to > > > > *develop > > > > general library that covers all algorithms, like FlinkML. Instead, we > > > pick > > > > all the most important ones which are used in production pipeline, > and > > > > build *deeply optimized for each specific one as a package.* Of > course > > > > there are also shared components like communication library and > > > duplicated > > > > effort among the libraries are shared. I believe we covered most > things > > > > people need, plus some simple ones that can be directly implemented > in > > > > FlinkML(Kmeans, linear model). > > > > > > > > One thing that could be interesting to try next is MXNet > > > > https://github.com/dmlc/mxnet/tree/master/scala-package, which is a > > full > > > > fledge deep learning library that comes with all the features you > need > > as > > > > well as a Scala Binding. However, > > > > we do need a bit more things that I mentioned in the requirement > > section. > > > > > > > > > > > > *What Help we can get from Flink Community * > > > > I will list the points that are clear and actionable here: > > > > > > > > *- *Improve xgboost-Flink API so that it is consistent with current > > > FlinkML > > > > pipeline > > > > - Provide some "developer API" that allows perf improvement as I > > > mentioned > > > > in "What is needed from DataFlow" > > > > - Support abstraction needed for MXNet, and enable *streaming, > > > GPU-enabled > > > > distributed deep learning on Flink* > > > > - Main obstacle will be the "developer API" > > > > > > > > While some of these effort seems to be a lot to port specific machine > > > > learning library. Enable them basically enable port all machine > > learning > > > > libraries we build and we will be building using these abstractions. > > > > > > > > Tianqi > > > > > > > > > > > > On Sat, Mar 12, 2016 at 4:51 AM, Theodore Vasiloudis < > > > > [hidden email]> wrote: > > > > > > > > > Hello Tianqui, > > > > > > > > > > Yes that definitely sounds interesting for us and we are looking > > > forward > > > > to > > > > > help out with the implementation. > > > > > > > > > > Regards, > > > > > Theodore > > > > > -- > > > > > Sent from a mobile device. May contain autocorrect errors. > > > > > On Mar 12, 2016 11:29 AM, "Simone Robutti" < > > > [hidden email] > > > > > > > > > > wrote: > > > > > > > > > > > This is a really interesting approach. The idea of a ML library > > over > > > > > > DataFlow is probably a winning move and I hope it will stop the > > > > > > proliferation of worthless reimplementation that is taking place > in > > > the > > > > > big > > > > > > data world. Do you think that DataFlow posed specific problems to > > > your > > > > > > work? Does it missing something that you had to fill in with your > > > work? > > > > > > > > > > > > Here at RadicalBit we are interested both in DataFlow/Apache Beam > > and > > > > in > > > > > > distributed ML and your approach to us look the best and I hope > > more > > > > and > > > > > > more teams follow your example, maybe integrating existing > > libraries > > > > like > > > > > > H2O with DataFlow. > > > > > > > > > > > > Keep us updated if you plan to develop other algorithms. > > > > > > > > > > > > 2016-03-11 21:32 GMT+01:00 Tianqi Chen <[hidden email] > >: > > > > > > > > > > > > > Hi Flink Developers > > > > > > > I am sending this email to let you know about XGBoost4J, a > > > > package > > > > > > that > > > > > > > we are planning to announce next week . Here is the draft > version > > > of > > > > > the > > > > > > > post > > > > > > > > > > > > https://github.com/dmlc/xgboost/blob/master/doc/jvm/xgboost4j-intro.md > > > > > > > > > > > > > > In short, XGBoost is a machine learning package that is > used > > by > > > > > more > > > > > > > than half of the machine challenge winning solutions and is > > already > > > > > > widely > > > > > > > used in industry. The distributed version scale to billion > > > > examples(10x > > > > > > > faster than spark.mllib in the experiment) with fewer resources > > > (see > > > > . > > > > > > > http://arxiv.org/abs/1603.02754) > > > > > > > > > > > > > > We are interested in putting distributed XGBoost into all > > > > Dataflow > > > > > > > platforms include Flink. This does not mean we re-implement it > on > > > > > Flink. > > > > > > > But instead we build a portable API that has a communication > > > library, > > > > > and > > > > > > > being able to run on different DataFlow programs. > > > > > > > > > > > > > > We hope this can benefit the Flink users, to enable them to > > get > > > > > > access > > > > > > > to one of the state-of-art machine learning algorithm. I am > > sending > > > > > this > > > > > > > email to the mail-list to let you know about it, and hoping to > > get > > > > some > > > > > > > contributors to help improving the XGBoost Flink API to be > more > > > > > > compatible > > > > > > > with current FlinkML stack. We also hope to get some support > > from > > > > the > > > > > > > system side, to enable some abstraction needed in XGBoost for > > using > > > > > > > multiple threads within even one slot for maximum performance. > > > > > > > > > > > > > > > > > > > > > Let us know about your thoughts. > > > > > > > > > > > > > > Cheers > > > > > > > > > > > > > > Tianqi > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Free forum by Nabble | Edit this page |