Some ideas for long-term Flink-related research and implementation projects

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Some ideas for long-term Flink-related research and implementation projects

Kostas Tzoumas
Hi Folks,

After talking with Stephan, Fabian, Robert, and Ufuk, we gathered a few
project ideas that people have been throwing around. These do not
immediately classify as issues as they are major extensions of Flink (some
might classify as completely different projects). These would make nice
standalone implementation projects, for example for University theses. Some
of them also require research and architecture work.

The relevance to this mailing list is that perhaps someone is interested in
picking up such a project.

Here is the idea dump:

---------------

Domain-specific language for graph processing: Create a GraphDataSet that
abstracts away the internal representation of a graph and operations on the
GraphDataSet. The project involves gathering requirements for graph
processing functionality, architecting the DSL, implementation, and
possible work on optimizing the operations when a graph operation can be
mapped to different DataSet to DataSet transformations.

Distributed mutable state: Currently delta iterations use internally a hash
index to store the state of the iteration, and they invoke index merging
functionality. One idea would be to surface an operator (with care) to the
APIs that essentially allows mutable state manipulations. Another idea
would be to implement something along the lines of a parameter server and
make such functionality accessible to the APIs.

Domain-specific language for spatial data: Create spatial data types
(point, region, etc) and operations thereof

Integration into Apache BigTop

Integration with Apache Ambari

Pig frontend for Flink: An initial effort was here:
http://kth.diva-portal.org/smash/get/diva2:539046/FULLTEXT01.pdf

Cascading on Flink

Optimizing the integration with columnar file formats (Parquet, ORCFile)
and perhaps eventually pushing filters down to data scans.

Statistical operators to extract statistical information from a DataSet
(e.g., histograms of value distributions)

Integration with Apache Mahout (ongoing effort)

Integration with Apache Tez (ongoing effort)

Flink Streaming (ongoing effort)

Eclipse plugin that includes functionality for execution plan debugging

Local execution of programs using Java Collections

---------------

Feel free to extend the descriptions that are empty and to extend this list.

Do you think that these would qualify as JIRA tickets classified as
"wishes"?

Kostas
Reply | Threaded
Open this post in threaded view
|

Re: Some ideas for long-term Flink-related research and implementation projects

Robert Metzger
Thank you for writing down the ideas.

I think we should not open JIRAs for these ideas. I would rather prefer to
put the list on the website or a wiki (once we have that).


On Fri, Jun 20, 2014 at 6:25 PM, Kostas Tzoumas <[hidden email]
> wrote:

> Hi Folks,
>
> After talking with Stephan, Fabian, Robert, and Ufuk, we gathered a few
> project ideas that people have been throwing around. These do not
> immediately classify as issues as they are major extensions of Flink (some
> might classify as completely different projects). These would make nice
> standalone implementation projects, for example for University theses. Some
> of them also require research and architecture work.
>
> The relevance to this mailing list is that perhaps someone is interested in
> picking up such a project.
>
> Here is the idea dump:
>
> ---------------
>
> Domain-specific language for graph processing: Create a GraphDataSet that
> abstracts away the internal representation of a graph and operations on the
> GraphDataSet. The project involves gathering requirements for graph
> processing functionality, architecting the DSL, implementation, and
> possible work on optimizing the operations when a graph operation can be
> mapped to different DataSet to DataSet transformations.
>
> Distributed mutable state: Currently delta iterations use internally a hash
> index to store the state of the iteration, and they invoke index merging
> functionality. One idea would be to surface an operator (with care) to the
> APIs that essentially allows mutable state manipulations. Another idea
> would be to implement something along the lines of a parameter server and
> make such functionality accessible to the APIs.
>
> Domain-specific language for spatial data: Create spatial data types
> (point, region, etc) and operations thereof
>
> Integration into Apache BigTop
>
> Integration with Apache Ambari
>
> Pig frontend for Flink: An initial effort was here:
> http://kth.diva-portal.org/smash/get/diva2:539046/FULLTEXT01.pdf
>
> Cascading on Flink
>
> Optimizing the integration with columnar file formats (Parquet, ORCFile)
> and perhaps eventually pushing filters down to data scans.
>
> Statistical operators to extract statistical information from a DataSet
> (e.g., histograms of value distributions)
>
> Integration with Apache Mahout (ongoing effort)
>
> Integration with Apache Tez (ongoing effort)
>
> Flink Streaming (ongoing effort)
>
> Eclipse plugin that includes functionality for execution plan debugging
>
> Local execution of programs using Java Collections
>
> ---------------
>
> Feel free to extend the descriptions that are empty and to extend this
> list.
>
> Do you think that these would qualify as JIRA tickets classified as
> "wishes"?
>
> Kostas
>
Reply | Threaded
Open this post in threaded view
|

Re: Some ideas for long-term Flink-related research and implementation projects

Henry Saputra
Last email thread was not closed whether we want wiki or not. Seems like it
is good idea to have wiki, at least for now, to share ideas like this.

- Henry

On Friday, June 20, 2014, Robert Metzger <[hidden email]> wrote:

> Thank you for writing down the ideas.
>
> I think we should not open JIRAs for these ideas. I would rather prefer to
> put the list on the website or a wiki (once we have that).
>
>
> On Fri, Jun 20, 2014 at 6:25 PM, Kostas Tzoumas <
> [hidden email] <javascript:;>
> > wrote:
>
> > Hi Folks,
> >
> > After talking with Stephan, Fabian, Robert, and Ufuk, we gathered a few
> > project ideas that people have been throwing around. These do not
> > immediately classify as issues as they are major extensions of Flink
> (some
> > might classify as completely different projects). These would make nice
> > standalone implementation projects, for example for University theses.
> Some
> > of them also require research and architecture work.
> >
> > The relevance to this mailing list is that perhaps someone is interested
> in
> > picking up such a project.
> >
> > Here is the idea dump:
> >
> > ---------------
> >
> > Domain-specific language for graph processing: Create a GraphDataSet that
> > abstracts away the internal representation of a graph and operations on
> the
> > GraphDataSet. The project involves gathering requirements for graph
> > processing functionality, architecting the DSL, implementation, and
> > possible work on optimizing the operations when a graph operation can be
> > mapped to different DataSet to DataSet transformations.
> >
> > Distributed mutable state: Currently delta iterations use internally a
> hash
> > index to store the state of the iteration, and they invoke index merging
> > functionality. One idea would be to surface an operator (with care) to
> the
> > APIs that essentially allows mutable state manipulations. Another idea
> > would be to implement something along the lines of a parameter server and
> > make such functionality accessible to the APIs.
> >
> > Domain-specific language for spatial data: Create spatial data types
> > (point, region, etc) and operations thereof
> >
> > Integration into Apache BigTop
> >
> > Integration with Apache Ambari
> >
> > Pig frontend for Flink: An initial effort was here:
> > http://kth.diva-portal.org/smash/get/diva2:539046/FULLTEXT01.pdf
> >
> > Cascading on Flink
> >
> > Optimizing the integration with columnar file formats (Parquet, ORCFile)
> > and perhaps eventually pushing filters down to data scans.
> >
> > Statistical operators to extract statistical information from a DataSet
> > (e.g., histograms of value distributions)
> >
> > Integration with Apache Mahout (ongoing effort)
> >
> > Integration with Apache Tez (ongoing effort)
> >
> > Flink Streaming (ongoing effort)
> >
> > Eclipse plugin that includes functionality for execution plan debugging
> >
> > Local execution of programs using Java Collections
> >
> > ---------------
> >
> > Feel free to extend the descriptions that are empty and to extend this
> > list.
> >
> > Do you think that these would qualify as JIRA tickets classified as
> > "wishes"?
> >
> > Kostas
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Some ideas for long-term Flink-related research and implementation projects

Fabian Hueske
I'm still +1 for a wiki.


2014-06-20 21:49 GMT+02:00 Henry Saputra <[hidden email]>:

> Last email thread was not closed whether we want wiki or not. Seems like it
> is good idea to have wiki, at least for now, to share ideas like this.
>
> - Henry
>
> On Friday, June 20, 2014, Robert Metzger <[hidden email]> wrote:
>
> > Thank you for writing down the ideas.
> >
> > I think we should not open JIRAs for these ideas. I would rather prefer
> to
> > put the list on the website or a wiki (once we have that).
> >
> >
> > On Fri, Jun 20, 2014 at 6:25 PM, Kostas Tzoumas <
> > [hidden email] <javascript:;>
> > > wrote:
> >
> > > Hi Folks,
> > >
> > > After talking with Stephan, Fabian, Robert, and Ufuk, we gathered a few
> > > project ideas that people have been throwing around. These do not
> > > immediately classify as issues as they are major extensions of Flink
> > (some
> > > might classify as completely different projects). These would make nice
> > > standalone implementation projects, for example for University theses.
> > Some
> > > of them also require research and architecture work.
> > >
> > > The relevance to this mailing list is that perhaps someone is
> interested
> > in
> > > picking up such a project.
> > >
> > > Here is the idea dump:
> > >
> > > ---------------
> > >
> > > Domain-specific language for graph processing: Create a GraphDataSet
> that
> > > abstracts away the internal representation of a graph and operations on
> > the
> > > GraphDataSet. The project involves gathering requirements for graph
> > > processing functionality, architecting the DSL, implementation, and
> > > possible work on optimizing the operations when a graph operation can
> be
> > > mapped to different DataSet to DataSet transformations.
> > >
> > > Distributed mutable state: Currently delta iterations use internally a
> > hash
> > > index to store the state of the iteration, and they invoke index
> merging
> > > functionality. One idea would be to surface an operator (with care) to
> > the
> > > APIs that essentially allows mutable state manipulations. Another idea
> > > would be to implement something along the lines of a parameter server
> and
> > > make such functionality accessible to the APIs.
> > >
> > > Domain-specific language for spatial data: Create spatial data types
> > > (point, region, etc) and operations thereof
> > >
> > > Integration into Apache BigTop
> > >
> > > Integration with Apache Ambari
> > >
> > > Pig frontend for Flink: An initial effort was here:
> > > http://kth.diva-portal.org/smash/get/diva2:539046/FULLTEXT01.pdf
> > >
> > > Cascading on Flink
> > >
> > > Optimizing the integration with columnar file formats (Parquet,
> ORCFile)
> > > and perhaps eventually pushing filters down to data scans.
> > >
> > > Statistical operators to extract statistical information from a DataSet
> > > (e.g., histograms of value distributions)
> > >
> > > Integration with Apache Mahout (ongoing effort)
> > >
> > > Integration with Apache Tez (ongoing effort)
> > >
> > > Flink Streaming (ongoing effort)
> > >
> > > Eclipse plugin that includes functionality for execution plan debugging
> > >
> > > Local execution of programs using Java Collections
> > >
> > > ---------------
> > >
> > > Feel free to extend the descriptions that are empty and to extend this
> > > list.
> > >
> > > Do you think that these would qualify as JIRA tickets classified as
> > > "wishes"?
> > >
> > > Kostas
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Some ideas for long-term Flink-related research and implementation projects

till.rohrmann
I agree, a wiki would be a good place to list these ideas.


On Fri, Jun 20, 2014 at 10:02 PM, Fabian Hueske <[hidden email]> wrote:

> I'm still +1 for a wiki.
>
>
> 2014-06-20 21:49 GMT+02:00 Henry Saputra <[hidden email]>:
>
> > Last email thread was not closed whether we want wiki or not. Seems like
> it
> > is good idea to have wiki, at least for now, to share ideas like this.
> >
> > - Henry
> >
> > On Friday, June 20, 2014, Robert Metzger <[hidden email]> wrote:
> >
> > > Thank you for writing down the ideas.
> > >
> > > I think we should not open JIRAs for these ideas. I would rather prefer
> > to
> > > put the list on the website or a wiki (once we have that).
> > >
> > >
> > > On Fri, Jun 20, 2014 at 6:25 PM, Kostas Tzoumas <
> > > [hidden email] <javascript:;>
> > > > wrote:
> > >
> > > > Hi Folks,
> > > >
> > > > After talking with Stephan, Fabian, Robert, and Ufuk, we gathered a
> few
> > > > project ideas that people have been throwing around. These do not
> > > > immediately classify as issues as they are major extensions of Flink
> > > (some
> > > > might classify as completely different projects). These would make
> nice
> > > > standalone implementation projects, for example for University
> theses.
> > > Some
> > > > of them also require research and architecture work.
> > > >
> > > > The relevance to this mailing list is that perhaps someone is
> > interested
> > > in
> > > > picking up such a project.
> > > >
> > > > Here is the idea dump:
> > > >
> > > > ---------------
> > > >
> > > > Domain-specific language for graph processing: Create a GraphDataSet
> > that
> > > > abstracts away the internal representation of a graph and operations
> on
> > > the
> > > > GraphDataSet. The project involves gathering requirements for graph
> > > > processing functionality, architecting the DSL, implementation, and
> > > > possible work on optimizing the operations when a graph operation can
> > be
> > > > mapped to different DataSet to DataSet transformations.
> > > >
> > > > Distributed mutable state: Currently delta iterations use internally
> a
> > > hash
> > > > index to store the state of the iteration, and they invoke index
> > merging
> > > > functionality. One idea would be to surface an operator (with care)
> to
> > > the
> > > > APIs that essentially allows mutable state manipulations. Another
> idea
> > > > would be to implement something along the lines of a parameter server
> > and
> > > > make such functionality accessible to the APIs.
> > > >
> > > > Domain-specific language for spatial data: Create spatial data types
> > > > (point, region, etc) and operations thereof
> > > >
> > > > Integration into Apache BigTop
> > > >
> > > > Integration with Apache Ambari
> > > >
> > > > Pig frontend for Flink: An initial effort was here:
> > > > http://kth.diva-portal.org/smash/get/diva2:539046/FULLTEXT01.pdf
> > > >
> > > > Cascading on Flink
> > > >
> > > > Optimizing the integration with columnar file formats (Parquet,
> > ORCFile)
> > > > and perhaps eventually pushing filters down to data scans.
> > > >
> > > > Statistical operators to extract statistical information from a
> DataSet
> > > > (e.g., histograms of value distributions)
> > > >
> > > > Integration with Apache Mahout (ongoing effort)
> > > >
> > > > Integration with Apache Tez (ongoing effort)
> > > >
> > > > Flink Streaming (ongoing effort)
> > > >
> > > > Eclipse plugin that includes functionality for execution plan
> debugging
> > > >
> > > > Local execution of programs using Java Collections
> > > >
> > > > ---------------
> > > >
> > > > Feel free to extend the descriptions that are empty and to extend
> this
> > > > list.
> > > >
> > > > Do you think that these would qualify as JIRA tickets classified as
> > > > "wishes"?
> > > >
> > > > Kostas
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Some ideas for long-term Flink-related research and implementation projects

Ufuk Celebi
I don't want the move the Wiki discussion here, but I couldn't find the Wiki thread.

If I'm not mistaken it's possible to activate GitHub Wikis on a per repo basis. Since everyone has a GH account and might know our GH based Wiki from before, couldn't we just try it out?

On 20 Jun 2014, at 22:57, Till Rohrmann <[hidden email]> wrote:

> I agree, a wiki would be a good place to list these ideas.
Reply | Threaded
Open this post in threaded view
|

Re: Some ideas for long-term Flink-related research and implementation projects

Rajika Kumarasiri
In reply to this post by Robert Metzger
Why don't we have a web page on the website with "Open Projects" or
something and link from there ?

Rajika


On Fri, Jun 20, 2014 at 3:13 PM, Robert Metzger <[hidden email]> wrote:

> Thank you for writing down the ideas.
>
> I think we should not open JIRAs for these ideas. I would rather prefer to
> put the list on the website or a wiki (once we have that).
>
>
> On Fri, Jun 20, 2014 at 6:25 PM, Kostas Tzoumas <
> [hidden email]
> > wrote:
>
> > Hi Folks,
> >
> > After talking with Stephan, Fabian, Robert, and Ufuk, we gathered a few
> > project ideas that people have been throwing around. These do not
> > immediately classify as issues as they are major extensions of Flink
> (some
> > might classify as completely different projects). These would make nice
> > standalone implementation projects, for example for University theses.
> Some
> > of them also require research and architecture work.
> >
> > The relevance to this mailing list is that perhaps someone is interested
> in
> > picking up such a project.
> >
> > Here is the idea dump:
> >
> > ---------------
> >
> > Domain-specific language for graph processing: Create a GraphDataSet that
> > abstracts away the internal representation of a graph and operations on
> the
> > GraphDataSet. The project involves gathering requirements for graph
> > processing functionality, architecting the DSL, implementation, and
> > possible work on optimizing the operations when a graph operation can be
> > mapped to different DataSet to DataSet transformations.
> >
> > Distributed mutable state: Currently delta iterations use internally a
> hash
> > index to store the state of the iteration, and they invoke index merging
> > functionality. One idea would be to surface an operator (with care) to
> the
> > APIs that essentially allows mutable state manipulations. Another idea
> > would be to implement something along the lines of a parameter server and
> > make such functionality accessible to the APIs.
> >
> > Domain-specific language for spatial data: Create spatial data types
> > (point, region, etc) and operations thereof
> >
> > Integration into Apache BigTop
> >
> > Integration with Apache Ambari
> >
> > Pig frontend for Flink: An initial effort was here:
> > http://kth.diva-portal.org/smash/get/diva2:539046/FULLTEXT01.pdf
> >
> > Cascading on Flink
> >
> > Optimizing the integration with columnar file formats (Parquet, ORCFile)
> > and perhaps eventually pushing filters down to data scans.
> >
> > Statistical operators to extract statistical information from a DataSet
> > (e.g., histograms of value distributions)
> >
> > Integration with Apache Mahout (ongoing effort)
> >
> > Integration with Apache Tez (ongoing effort)
> >
> > Flink Streaming (ongoing effort)
> >
> > Eclipse plugin that includes functionality for execution plan debugging
> >
> > Local execution of programs using Java Collections
> >
> > ---------------
> >
> > Feel free to extend the descriptions that are empty and to extend this
> > list.
> >
> > Do you think that these would qualify as JIRA tickets classified as
> > "wishes"?
> >
> > Kostas
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Some ideas for long-term Flink-related research and implementation projects

Stephan Ewen
I understood that this is exactly the plan.

Ufuk is preparing a stub for the website and a wiki has been requested. I
think this list with projects goes into the wiki and will be linked from
the website.


On Mon, Jun 23, 2014 at 4:03 PM, Rajika Kumarasiri <
[hidden email]> wrote:

> Why don't we have a web page on the website with "Open Projects" or
> something and link from there ?
>
> Rajika
>
>
> On Fri, Jun 20, 2014 at 3:13 PM, Robert Metzger <[hidden email]>
> wrote:
>
> > Thank you for writing down the ideas.
> >
> > I think we should not open JIRAs for these ideas. I would rather prefer
> to
> > put the list on the website or a wiki (once we have that).
> >
> >
> > On Fri, Jun 20, 2014 at 6:25 PM, Kostas Tzoumas <
> > [hidden email]
> > > wrote:
> >
> > > Hi Folks,
> > >
> > > After talking with Stephan, Fabian, Robert, and Ufuk, we gathered a few
> > > project ideas that people have been throwing around. These do not
> > > immediately classify as issues as they are major extensions of Flink
> > (some
> > > might classify as completely different projects). These would make nice
> > > standalone implementation projects, for example for University theses.
> > Some
> > > of them also require research and architecture work.
> > >
> > > The relevance to this mailing list is that perhaps someone is
> interested
> > in
> > > picking up such a project.
> > >
> > > Here is the idea dump:
> > >
> > > ---------------
> > >
> > > Domain-specific language for graph processing: Create a GraphDataSet
> that
> > > abstracts away the internal representation of a graph and operations on
> > the
> > > GraphDataSet. The project involves gathering requirements for graph
> > > processing functionality, architecting the DSL, implementation, and
> > > possible work on optimizing the operations when a graph operation can
> be
> > > mapped to different DataSet to DataSet transformations.
> > >
> > > Distributed mutable state: Currently delta iterations use internally a
> > hash
> > > index to store the state of the iteration, and they invoke index
> merging
> > > functionality. One idea would be to surface an operator (with care) to
> > the
> > > APIs that essentially allows mutable state manipulations. Another idea
> > > would be to implement something along the lines of a parameter server
> and
> > > make such functionality accessible to the APIs.
> > >
> > > Domain-specific language for spatial data: Create spatial data types
> > > (point, region, etc) and operations thereof
> > >
> > > Integration into Apache BigTop
> > >
> > > Integration with Apache Ambari
> > >
> > > Pig frontend for Flink: An initial effort was here:
> > > http://kth.diva-portal.org/smash/get/diva2:539046/FULLTEXT01.pdf
> > >
> > > Cascading on Flink
> > >
> > > Optimizing the integration with columnar file formats (Parquet,
> ORCFile)
> > > and perhaps eventually pushing filters down to data scans.
> > >
> > > Statistical operators to extract statistical information from a DataSet
> > > (e.g., histograms of value distributions)
> > >
> > > Integration with Apache Mahout (ongoing effort)
> > >
> > > Integration with Apache Tez (ongoing effort)
> > >
> > > Flink Streaming (ongoing effort)
> > >
> > > Eclipse plugin that includes functionality for execution plan debugging
> > >
> > > Local execution of programs using Java Collections
> > >
> > > ---------------
> > >
> > > Feel free to extend the descriptions that are empty and to extend this
> > > list.
> > >
> > > Do you think that these would qualify as JIRA tickets classified as
> > > "wishes"?
> > >
> > > Kostas
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Some ideas for long-term Flink-related research and implementation projects

Henry Saputra
In reply to this post by Kostas Tzoumas
I am interested to see how Flink integrate with Apache Tez. Anyone has
any reference or JIRA or any doc to see how far the ongoing effort
been going?


Thanks,

- Henry

On Fri, Jun 20, 2014 at 9:25 AM, Kostas Tzoumas
<[hidden email]> wrote:

> Hi Folks,
>
> After talking with Stephan, Fabian, Robert, and Ufuk, we gathered a few
> project ideas that people have been throwing around. These do not
> immediately classify as issues as they are major extensions of Flink (some
> might classify as completely different projects). These would make nice
> standalone implementation projects, for example for University theses. Some
> of them also require research and architecture work.
>
> The relevance to this mailing list is that perhaps someone is interested in
> picking up such a project.
>
> Here is the idea dump:
>
> ---------------
>
> Domain-specific language for graph processing: Create a GraphDataSet that
> abstracts away the internal representation of a graph and operations on the
> GraphDataSet. The project involves gathering requirements for graph
> processing functionality, architecting the DSL, implementation, and
> possible work on optimizing the operations when a graph operation can be
> mapped to different DataSet to DataSet transformations.
>
> Distributed mutable state: Currently delta iterations use internally a hash
> index to store the state of the iteration, and they invoke index merging
> functionality. One idea would be to surface an operator (with care) to the
> APIs that essentially allows mutable state manipulations. Another idea
> would be to implement something along the lines of a parameter server and
> make such functionality accessible to the APIs.
>
> Domain-specific language for spatial data: Create spatial data types
> (point, region, etc) and operations thereof
>
> Integration into Apache BigTop
>
> Integration with Apache Ambari
>
> Pig frontend for Flink: An initial effort was here:
> http://kth.diva-portal.org/smash/get/diva2:539046/FULLTEXT01.pdf
>
> Cascading on Flink
>
> Optimizing the integration with columnar file formats (Parquet, ORCFile)
> and perhaps eventually pushing filters down to data scans.
>
> Statistical operators to extract statistical information from a DataSet
> (e.g., histograms of value distributions)
>
> Integration with Apache Mahout (ongoing effort)
>
> Integration with Apache Tez (ongoing effort)
>
> Flink Streaming (ongoing effort)
>
> Eclipse plugin that includes functionality for execution plan debugging
>
> Local execution of programs using Java Collections
>
> ---------------
>
> Feel free to extend the descriptions that are empty and to extend this list.
>
> Do you think that these would qualify as JIRA tickets classified as
> "wishes"?
>
> Kostas
Reply | Threaded
Open this post in threaded view
|

Re: Some ideas for long-term Flink-related research and implementation projects

Kostas Tzoumas
Henry,

I am currently travelling and be able to write more about this next week.
The idea is to use Tez as the distributed engine, and port Flink's runtime
operators (for joins, aggregation) etc on top of that. The Flink APIs and
optimizer should not need many changes. This should be in theory possible
for the non-iterative parts of Flink. Filip has started an early effort of
getting a WordCount that uses Stratosphere types and operators to run on
top of Tez:
https://github.com/filiphaase/incubator-tez/tree/stratosphere-input-output-proto1/tez-mapreduce-examples/src/main/java/org/apache/tez/stratosphere

Kostas


On Tue, Jun 24, 2014 at 12:33 AM, Henry Saputra <[hidden email]>
wrote:

> I am interested to see how Flink integrate with Apache Tez. Anyone has
> any reference or JIRA or any doc to see how far the ongoing effort
> been going?
>
>
> Thanks,
>
> - Henry
>
> On Fri, Jun 20, 2014 at 9:25 AM, Kostas Tzoumas
> <[hidden email]> wrote:
> > Hi Folks,
> >
> > After talking with Stephan, Fabian, Robert, and Ufuk, we gathered a few
> > project ideas that people have been throwing around. These do not
> > immediately classify as issues as they are major extensions of Flink
> (some
> > might classify as completely different projects). These would make nice
> > standalone implementation projects, for example for University theses.
> Some
> > of them also require research and architecture work.
> >
> > The relevance to this mailing list is that perhaps someone is interested
> in
> > picking up such a project.
> >
> > Here is the idea dump:
> >
> > ---------------
> >
> > Domain-specific language for graph processing: Create a GraphDataSet that
> > abstracts away the internal representation of a graph and operations on
> the
> > GraphDataSet. The project involves gathering requirements for graph
> > processing functionality, architecting the DSL, implementation, and
> > possible work on optimizing the operations when a graph operation can be
> > mapped to different DataSet to DataSet transformations.
> >
> > Distributed mutable state: Currently delta iterations use internally a
> hash
> > index to store the state of the iteration, and they invoke index merging
> > functionality. One idea would be to surface an operator (with care) to
> the
> > APIs that essentially allows mutable state manipulations. Another idea
> > would be to implement something along the lines of a parameter server and
> > make such functionality accessible to the APIs.
> >
> > Domain-specific language for spatial data: Create spatial data types
> > (point, region, etc) and operations thereof
> >
> > Integration into Apache BigTop
> >
> > Integration with Apache Ambari
> >
> > Pig frontend for Flink: An initial effort was here:
> > http://kth.diva-portal.org/smash/get/diva2:539046/FULLTEXT01.pdf
> >
> > Cascading on Flink
> >
> > Optimizing the integration with columnar file formats (Parquet, ORCFile)
> > and perhaps eventually pushing filters down to data scans.
> >
> > Statistical operators to extract statistical information from a DataSet
> > (e.g., histograms of value distributions)
> >
> > Integration with Apache Mahout (ongoing effort)
> >
> > Integration with Apache Tez (ongoing effort)
> >
> > Flink Streaming (ongoing effort)
> >
> > Eclipse plugin that includes functionality for execution plan debugging
> >
> > Local execution of programs using Java Collections
> >
> > ---------------
> >
> > Feel free to extend the descriptions that are empty and to extend this
> list.
> >
> > Do you think that these would qualify as JIRA tickets classified as
> > "wishes"?
> >
> > Kostas
>
Reply | Threaded
Open this post in threaded view
|

Re: Some ideas for long-term Flink-related research and implementation projects

Stephan Ewen
@everyone interested in the Tez work:

I created a JIRA Issue with a brief summary of the current status and
plans: https://issues.apache.org/jira/browse/FLINK-972

I was thinking about a brief dedicate Tez Hangout next week. Please post
here, if you would like to have a Hangout on Flink&Tez next week.

Stephan
Reply | Threaded
Open this post in threaded view
|

Re: Some ideas for long-term Flink-related research and implementation projects

Kostas Tzoumas
Stephan, great work, thank you! I am interested

Kostas


On Tue, Jun 24, 2014 at 3:09 PM, Stephan Ewen <[hidden email]> wrote:

> @everyone interested in the Tez work:
>
> I created a JIRA Issue with a brief summary of the current status and
> plans: https://issues.apache.org/jira/browse/FLINK-972
>
> I was thinking about a brief dedicate Tez Hangout next week. Please post
> here, if you would like to have a Hangout on Flink&Tez next week.
>
> Stephan
>
Reply | Threaded
Open this post in threaded view
|

Re: Some ideas for long-term Flink-related research and implementation projects

Fabian Hueske
I'm in as well.


2014-06-24 16:35 GMT+02:00 Kostas Tzoumas <[hidden email]>:

> Stephan, great work, thank you! I am interested
>
> Kostas
>
>
> On Tue, Jun 24, 2014 at 3:09 PM, Stephan Ewen <[hidden email]> wrote:
>
> > @everyone interested in the Tez work:
> >
> > I created a JIRA Issue with a brief summary of the current status and
> > plans: https://issues.apache.org/jira/browse/FLINK-972
> >
> > I was thinking about a brief dedicate Tez Hangout next week. Please post
> > here, if you would like to have a Hangout on Flink&Tez next week.
> >
> > Stephan
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Some ideas for long-term Flink-related research and implementation projects

Henry Saputra
In reply to this post by Kostas Tzoumas
Thanks for the explanation Kostas.
I am hoping to keep the Flink APIs (i.e. the operator functions) clean
and hide all Tez nitty gritty in the plan execution =)


- Henry

On Tue, Jun 24, 2014 at 5:05 AM, Kostas Tzoumas
<[hidden email]> wrote:

> Henry,
>
> I am currently travelling and be able to write more about this next week.
> The idea is to use Tez as the distributed engine, and port Flink's runtime
> operators (for joins, aggregation) etc on top of that. The Flink APIs and
> optimizer should not need many changes. This should be in theory possible
> for the non-iterative parts of Flink. Filip has started an early effort of
> getting a WordCount that uses Stratosphere types and operators to run on
> top of Tez:
> https://github.com/filiphaase/incubator-tez/tree/stratosphere-input-output-proto1/tez-mapreduce-examples/src/main/java/org/apache/tez/stratosphere
>
> Kostas
>
>
> On Tue, Jun 24, 2014 at 12:33 AM, Henry Saputra <[hidden email]>
> wrote:
>
>> I am interested to see how Flink integrate with Apache Tez. Anyone has
>> any reference or JIRA or any doc to see how far the ongoing effort
>> been going?
>>
>>
>> Thanks,
>>
>> - Henry
>>
>> On Fri, Jun 20, 2014 at 9:25 AM, Kostas Tzoumas
>> <[hidden email]> wrote:
>> > Hi Folks,
>> >
>> > After talking with Stephan, Fabian, Robert, and Ufuk, we gathered a few
>> > project ideas that people have been throwing around. These do not
>> > immediately classify as issues as they are major extensions of Flink
>> (some
>> > might classify as completely different projects). These would make nice
>> > standalone implementation projects, for example for University theses.
>> Some
>> > of them also require research and architecture work.
>> >
>> > The relevance to this mailing list is that perhaps someone is interested
>> in
>> > picking up such a project.
>> >
>> > Here is the idea dump:
>> >
>> > ---------------
>> >
>> > Domain-specific language for graph processing: Create a GraphDataSet that
>> > abstracts away the internal representation of a graph and operations on
>> the
>> > GraphDataSet. The project involves gathering requirements for graph
>> > processing functionality, architecting the DSL, implementation, and
>> > possible work on optimizing the operations when a graph operation can be
>> > mapped to different DataSet to DataSet transformations.
>> >
>> > Distributed mutable state: Currently delta iterations use internally a
>> hash
>> > index to store the state of the iteration, and they invoke index merging
>> > functionality. One idea would be to surface an operator (with care) to
>> the
>> > APIs that essentially allows mutable state manipulations. Another idea
>> > would be to implement something along the lines of a parameter server and
>> > make such functionality accessible to the APIs.
>> >
>> > Domain-specific language for spatial data: Create spatial data types
>> > (point, region, etc) and operations thereof
>> >
>> > Integration into Apache BigTop
>> >
>> > Integration with Apache Ambari
>> >
>> > Pig frontend for Flink: An initial effort was here:
>> > http://kth.diva-portal.org/smash/get/diva2:539046/FULLTEXT01.pdf
>> >
>> > Cascading on Flink
>> >
>> > Optimizing the integration with columnar file formats (Parquet, ORCFile)
>> > and perhaps eventually pushing filters down to data scans.
>> >
>> > Statistical operators to extract statistical information from a DataSet
>> > (e.g., histograms of value distributions)
>> >
>> > Integration with Apache Mahout (ongoing effort)
>> >
>> > Integration with Apache Tez (ongoing effort)
>> >
>> > Flink Streaming (ongoing effort)
>> >
>> > Eclipse plugin that includes functionality for execution plan debugging
>> >
>> > Local execution of programs using Java Collections
>> >
>> > ---------------
>> >
>> > Feel free to extend the descriptions that are empty and to extend this
>> list.
>> >
>> > Do you think that these would qualify as JIRA tickets classified as
>> > "wishes"?
>> >
>> > Kostas
>>
Reply | Threaded
Open this post in threaded view
|

Re: Some ideas for long-term Flink-related research and implementation projects

Henry Saputra
In reply to this post by Stephan Ewen
+1

I am in

Might as well share the Hangout link in the dev@ list just in case
people would like to drop by

- Henry

On Tue, Jun 24, 2014 at 6:09 AM, Stephan Ewen <[hidden email]> wrote:
> @everyone interested in the Tez work:
>
> I created a JIRA Issue with a brief summary of the current status and
> plans: https://issues.apache.org/jira/browse/FLINK-972
>
> I was thinking about a brief dedicate Tez Hangout next week. Please post
> here, if you would like to have a Hangout on Flink&Tez next week.
>
> Stephan
Reply | Threaded
Open this post in threaded view
|

Re: Some ideas for long-term Flink-related research and implementation projects

Robert Metzger
In reply to this post by Fabian Hueske
I've copied the project ideas into our newly created wiki:
https://cwiki.apache.org/confluence/display/FLINK/Project+Ideas


On Fri, Jun 20, 2014 at 10:02 PM, Fabian Hueske <[hidden email]> wrote:

> I'm still +1 for a wiki.
>
>
> 2014-06-20 21:49 GMT+02:00 Henry Saputra <[hidden email]>:
>
> > Last email thread was not closed whether we want wiki or not. Seems like
> it
> > is good idea to have wiki, at least for now, to share ideas like this.
> >
> > - Henry
> >
> > On Friday, June 20, 2014, Robert Metzger <[hidden email]> wrote:
> >
> > > Thank you for writing down the ideas.
> > >
> > > I think we should not open JIRAs for these ideas. I would rather prefer
> > to
> > > put the list on the website or a wiki (once we have that).
> > >
> > >
> > > On Fri, Jun 20, 2014 at 6:25 PM, Kostas Tzoumas <
> > > [hidden email] <javascript:;>
> > > > wrote:
> > >
> > > > Hi Folks,
> > > >
> > > > After talking with Stephan, Fabian, Robert, and Ufuk, we gathered a
> few
> > > > project ideas that people have been throwing around. These do not
> > > > immediately classify as issues as they are major extensions of Flink
> > > (some
> > > > might classify as completely different projects). These would make
> nice
> > > > standalone implementation projects, for example for University
> theses.
> > > Some
> > > > of them also require research and architecture work.
> > > >
> > > > The relevance to this mailing list is that perhaps someone is
> > interested
> > > in
> > > > picking up such a project.
> > > >
> > > > Here is the idea dump:
> > > >
> > > > ---------------
> > > >
> > > > Domain-specific language for graph processing: Create a GraphDataSet
> > that
> > > > abstracts away the internal representation of a graph and operations
> on
> > > the
> > > > GraphDataSet. The project involves gathering requirements for graph
> > > > processing functionality, architecting the DSL, implementation, and
> > > > possible work on optimizing the operations when a graph operation can
> > be
> > > > mapped to different DataSet to DataSet transformations.
> > > >
> > > > Distributed mutable state: Currently delta iterations use internally
> a
> > > hash
> > > > index to store the state of the iteration, and they invoke index
> > merging
> > > > functionality. One idea would be to surface an operator (with care)
> to
> > > the
> > > > APIs that essentially allows mutable state manipulations. Another
> idea
> > > > would be to implement something along the lines of a parameter server
> > and
> > > > make such functionality accessible to the APIs.
> > > >
> > > > Domain-specific language for spatial data: Create spatial data types
> > > > (point, region, etc) and operations thereof
> > > >
> > > > Integration into Apache BigTop
> > > >
> > > > Integration with Apache Ambari
> > > >
> > > > Pig frontend for Flink: An initial effort was here:
> > > > http://kth.diva-portal.org/smash/get/diva2:539046/FULLTEXT01.pdf
> > > >
> > > > Cascading on Flink
> > > >
> > > > Optimizing the integration with columnar file formats (Parquet,
> > ORCFile)
> > > > and perhaps eventually pushing filters down to data scans.
> > > >
> > > > Statistical operators to extract statistical information from a
> DataSet
> > > > (e.g., histograms of value distributions)
> > > >
> > > > Integration with Apache Mahout (ongoing effort)
> > > >
> > > > Integration with Apache Tez (ongoing effort)
> > > >
> > > > Flink Streaming (ongoing effort)
> > > >
> > > > Eclipse plugin that includes functionality for execution plan
> debugging
> > > >
> > > > Local execution of programs using Java Collections
> > > >
> > > > ---------------
> > > >
> > > > Feel free to extend the descriptions that are empty and to extend
> this
> > > > list.
> > > >
> > > > Do you think that these would qualify as JIRA tickets classified as
> > > > "wishes"?
> > > >
> > > > Kostas
> > > >
> > >
> >
>