[jira] [Commented] (FLINK-947) Add support for "Named Datasets"

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Commented] (FLINK-947) Add support for "Named Datasets"

Shang Yuanchun (Jira)

    [ https://issues.apache.org/jira/browse/FLINK-947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14035631#comment-14035631 ]

Aljoscha Krettek commented on FLINK-947:
----------------------------------------

I developed a small prototype for showcasing the idea. The code  is online here: https://github.com/aljoscha/stratosphere/tree/named-datasets. Right now I have only Join and Map working and the implementation is still a little sketchy. But it works.

I'm using tuples underneath but maybe we can change this to operate on some more efficient structures. This will require deeper changes though. What do you think? If you like it we could start working on this with more people to properly integrate it with the existing stuff.

Please ask questions if you want to learn more about it.

Example code:
{code:java}
public class NamedDataSetPlayground {

    public static void main(String[] args) throws Exception {

        final ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment();

        DataSet<Tuple2<String, Integer>> in1 = env.fromElements(
                new Tuple2<String, Integer>("eins", 1),
                new Tuple2<String, Integer>("zwei", 2),
                new Tuple2<String, Integer>("drei", 3)
        );
        DataSet<Tuple2<String, Integer>> in2 = env.fromElements(
                new Tuple2<String, Integer>("eins", 11),
                new Tuple2<String, Integer>("zwei", 12),
                new Tuple2<String, Integer>("drei", 13)
        );

        NamedDataSet one = in1.named("word", "count");
        NamedDataSet two = in2.named("word2", "count2");

        NamedDataSet result =
                one.join(two, new String[] {"word"}, new String[] {"word2"})
                .project("count", "count2", "word")
                .as("a", "b", "c")
                .join(two, new String[]{"b"}, new String[]{"count2"})
                .project("a", "b", "word2")
                .map(new MapFunction<FooIn, FooOut>() {
                    @Override
                    public FooOut map(FooIn value) throws Exception {
                        FooOut out = new FooOut();
                        out.x = value.a;
                        out.y = value.b;
                        out.z = value.word2;
                        return out;
                    }
                })
                .join(two, new String[]{"z"}, new String[]{"word2"});

        result.toDataSet().print();

        env.execute();
    }

    public static class FooIn {
        int a;
        Integer b;
        String word2;
    }

    public static class FooOut {
        Integer x;
        int y;
        String z;
    }
}
{code}

> Add support for "Named Datasets"
> --------------------------------
>
>                 Key: FLINK-947
>                 URL: https://issues.apache.org/jira/browse/FLINK-947
>             Project: Flink
>          Issue Type: New Feature
>          Components: Java API
>            Reporter: Aljoscha Krettek
>            Assignee: Aljoscha Krettek
>            Priority: Minor
>
> This would create an API that is a mix between SQL like declarativity and the power of user defined functions. Example user code could look like this:
> {code:Java}
> NamedDataSet one = ...
> NamedDataSet two = ...
> NamedDataSet result = one.join(two).where("key").equalTo("otherKey")
>   .project("a", "b", "c")
>   .map( (UserTypeIn in) -> return new UserTypeOut(...) )
>   .print();
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)