Hi All,
I shall be joining DIMA as a Senior Researcher from Tuesday 4 Nov onwards. I have more of a background in C++ (for OO) and languages like SML and Haskell (for FP). So I am learning the Scala syntax at the moment. I came across the following improvement request: https://issues.apache.org/jira/browse/FLINK-553 Seeing that I have not yet had a chance to look at the Scala API and internals of Flink, would it be ok for me to work on this improvement or is this part of a bigger change? I see that most of the aggregation functions will anyway need a major overhaul, so I am not sure if this change also falls under the same ambit? I think the idea might be to return more data from the first function so that the next function in line in the composition has more to take as input. I am not sure if there are many users already using Flink because such a change would obviously break a lot of code. If this issue has other dependencies that I am not seeing at the moment, then please feel free to suggest some other Scala related change. Thanks, Manu -- The greater danger for most of us lies not in setting our aim too high and falling short; but in setting our aim too low, and achieving our mark. - Michelangelo |
Hi Manu!
I think the change to explicitly expose the group reduce key is separate from the overhaul of the aggregation functions. I have added a comment to the issue, outlining the ideas gathered to far. There is also an interesting follow-up possible: Reduce on individual fields. Frequently, the reduce affects only one field, as for example in Spark's reduceByKey, which is a special case for key-value pairs. We could add a nice generic version "reduceField(fieldExpression, ReduceFunction<f>, )". An example for how this may look in a program is: DataSet<MyType> pairs = ... pairs.groupBy("name") . reduceField("counts", (a, b) -> a + b); Stephan On Sat, Nov 1, 2014 at 12:01 PM, Manu Kaul <[hidden email]> wrote: > Hi All, > I shall be joining DIMA as a Senior Researcher from > Tuesday 4 Nov onwards. > > I have more of a background in C++ (for OO) and languages like > SML and Haskell (for FP). So I am learning the Scala syntax at > the moment. > > I came across the following improvement request: > https://issues.apache.org/jira/browse/FLINK-553 > > Seeing that I have not yet had a chance to look at the Scala API > and internals of Flink, would it be ok for me to work on this improvement > or is this part of a bigger change? I see that most of the aggregation > functions will anyway need a major overhaul, so I am not sure if this > change also falls under the same ambit? > > I think the idea might be to return more data from the first function so > that the next function in line in the composition has more to take as > input. I am not sure if there are many users already using Flink > because such a change would obviously break a lot of code. > > If this issue has other dependencies that I am not seeing at the > moment, then please feel free to suggest some other Scala related > change. > > Thanks, > Manu > > > -- > > The greater danger for most of us lies not in setting our aim too high and > falling short; but in setting our aim too low, and achieving our mark. > - Michelangelo > |
And we could also combine several reduceFields(), similar to how we
combine aggregates, for example andReduceField(). On Sat, Nov 1, 2014 at 8:50 PM, Stephan Ewen <[hidden email]> wrote: > Hi Manu! > > I think the change to explicitly expose the group reduce key is separate > from the overhaul of the aggregation functions. > > I have added a comment to the issue, outlining the ideas gathered to far. > > There is also an interesting follow-up possible: Reduce on individual > fields. Frequently, the reduce affects only one field, as for example in > Spark's reduceByKey, which is a special case for key-value pairs. We could > add a nice generic version "reduceField(fieldExpression, ReduceFunction<f>, > )". > > An example for how this may look in a program is: > > DataSet<MyType> pairs = ... > pairs.groupBy("name") > . reduceField("counts", (a, b) -> a + b); > > > Stephan > > > > > On Sat, Nov 1, 2014 at 12:01 PM, Manu Kaul <[hidden email]> wrote: > >> Hi All, >> I shall be joining DIMA as a Senior Researcher from >> Tuesday 4 Nov onwards. >> >> I have more of a background in C++ (for OO) and languages like >> SML and Haskell (for FP). So I am learning the Scala syntax at >> the moment. >> >> I came across the following improvement request: >> https://issues.apache.org/jira/browse/FLINK-553 >> >> Seeing that I have not yet had a chance to look at the Scala API >> and internals of Flink, would it be ok for me to work on this improvement >> or is this part of a bigger change? I see that most of the aggregation >> functions will anyway need a major overhaul, so I am not sure if this >> change also falls under the same ambit? >> >> I think the idea might be to return more data from the first function so >> that the next function in line in the composition has more to take as >> input. I am not sure if there are many users already using Flink >> because such a change would obviously break a lot of code. >> >> If this issue has other dependencies that I am not seeing at the >> moment, then please feel free to suggest some other Scala related >> change. >> >> Thanks, >> Manu >> >> >> -- >> >> The greater danger for most of us lies not in setting our aim too high and >> falling short; but in setting our aim too low, and achieving our mark. >> - Michelangelo >> |
In reply to this post by Manu Kaul
Hello,
as have some other fellow colleges as well I would like to introduce myself as well to the list. I am a PhD student from Berlin who wants to work with Flink . As suggested by the getting started guide I had a look at some starter issues and found the issue about comments in CSV lines https://issues.apache.org/jira/browse/FLINK-1208 While looking into it I noticed that the current CSV parser does not correctly read escaped fields There is of course a debate as to how to escape any value in CSV files, but the common use is to use " as the escape character So the following line will not parse 1997,Ford,E350,"Super, ""luxurious"" truck" I had a look into why that is and if I could propose a fix for it. Being a novice to the codebase I noticed that the CSV Parser uses the parsers from org.apache.flink.types.parser.* So the question I have: Are these parsers only used for CSV files and thus would introducing the escaping mechanism just work - or are they used in a lot of other places requiring a special handling in case of CSV instead. Thus fixing the escaping would actually mean to break/ fix a lot of other thing? Johannes |
Hi Johannes!
Welcome :-) Right now, the parsers are used only in the CSV formats, so you can adjust them to that format's needs. Greetings, Stephan On Thu, Nov 6, 2014 at 3:17 PM, Kirschnick, Johannes < [hidden email]> wrote: > Hello, > > as have some other fellow colleges as well I would like to introduce > myself as well to the list. > I am a PhD student from Berlin who wants to work with Flink . > > As suggested by the getting started guide I had a look at some starter > issues and found the issue about comments in CSV lines > https://issues.apache.org/jira/browse/FLINK-1208 > > While looking into it I noticed that the current CSV parser does not > correctly read escaped fields > There is of course a debate as to how to escape any value in CSV files, > but the common use is to use " as the escape character > > So the following line will not parse > > 1997,Ford,E350,"Super, ""luxurious"" truck" > > I had a look into why that is and if I could propose a fix for it. > Being a novice to the codebase I noticed that the CSV Parser uses the > parsers from > > org.apache.flink.types.parser.* > > So the question I have: > > Are these parsers only used for CSV files and thus would introducing the > escaping mechanism just work - or are they used in a lot of other places > requiring a special handling in case of CSV instead. > Thus fixing the escaping would actually mean to break/ fix a lot of other > thing? > > Johannes > |
Free forum by Nabble | Edit this page |