Introduction

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Introduction

Manu Kaul
Hi All,
I shall be joining DIMA as a Senior Researcher from
Tuesday 4 Nov onwards.

I have more of a background in C++ (for OO) and languages like
SML and Haskell (for FP). So I am learning the Scala syntax at
the moment.

I came across the following improvement request:
https://issues.apache.org/jira/browse/FLINK-553

Seeing that I have not yet had a chance to look at the Scala API
and internals of Flink, would it be ok for me to work on this improvement
or is this part of a bigger change? I see that most of the aggregation
functions will anyway need a major overhaul, so I am not sure if this
change also falls under the same ambit?

I think the idea might be to return more data from the first function so
that the next function in line in the composition has more to take as
input. I am not sure if there are many users already using Flink
because such a change would obviously break a lot of code.

If this issue has other dependencies that I am not seeing at the
moment, then please feel free to suggest some other Scala related
change.

Thanks,
Manu


--

The greater danger for most of us lies not in setting our aim too high and
falling short; but in setting our aim too low, and achieving our mark.
- Michelangelo
Reply | Threaded
Open this post in threaded view
|

Re: Introduction

Stephan Ewen
Hi Manu!

I think the change to explicitly expose the group reduce key is separate
from the overhaul of the aggregation functions.

I have added a comment to the issue, outlining the ideas gathered to far.

There is also an interesting follow-up possible: Reduce on individual
fields. Frequently, the reduce affects only one field, as for example in
Spark's reduceByKey, which is a special case for key-value pairs. We could
add a nice generic version "reduceField(fieldExpression, ReduceFunction<f>,
)".

An example for how this may look in a program is:

DataSet<MyType> pairs = ...
pairs.groupBy("name")
        . reduceField("counts", (a, b) -> a + b);


Stephan




On Sat, Nov 1, 2014 at 12:01 PM, Manu Kaul <[hidden email]> wrote:

> Hi All,
> I shall be joining DIMA as a Senior Researcher from
> Tuesday 4 Nov onwards.
>
> I have more of a background in C++ (for OO) and languages like
> SML and Haskell (for FP). So I am learning the Scala syntax at
> the moment.
>
> I came across the following improvement request:
> https://issues.apache.org/jira/browse/FLINK-553
>
> Seeing that I have not yet had a chance to look at the Scala API
> and internals of Flink, would it be ok for me to work on this improvement
> or is this part of a bigger change? I see that most of the aggregation
> functions will anyway need a major overhaul, so I am not sure if this
> change also falls under the same ambit?
>
> I think the idea might be to return more data from the first function so
> that the next function in line in the composition has more to take as
> input. I am not sure if there are many users already using Flink
> because such a change would obviously break a lot of code.
>
> If this issue has other dependencies that I am not seeing at the
> moment, then please feel free to suggest some other Scala related
> change.
>
> Thanks,
> Manu
>
>
> --
>
> The greater danger for most of us lies not in setting our aim too high and
> falling short; but in setting our aim too low, and achieving our mark.
> - Michelangelo
>
Reply | Threaded
Open this post in threaded view
|

Re: Introduction

Aljoscha Krettek-2
And we could also combine several reduceFields(), similar to how we
combine aggregates, for example andReduceField().

On Sat, Nov 1, 2014 at 8:50 PM, Stephan Ewen <[hidden email]> wrote:

> Hi Manu!
>
> I think the change to explicitly expose the group reduce key is separate
> from the overhaul of the aggregation functions.
>
> I have added a comment to the issue, outlining the ideas gathered to far.
>
> There is also an interesting follow-up possible: Reduce on individual
> fields. Frequently, the reduce affects only one field, as for example in
> Spark's reduceByKey, which is a special case for key-value pairs. We could
> add a nice generic version "reduceField(fieldExpression, ReduceFunction<f>,
> )".
>
> An example for how this may look in a program is:
>
> DataSet<MyType> pairs = ...
> pairs.groupBy("name")
>         . reduceField("counts", (a, b) -> a + b);
>
>
> Stephan
>
>
>
>
> On Sat, Nov 1, 2014 at 12:01 PM, Manu Kaul <[hidden email]> wrote:
>
>> Hi All,
>> I shall be joining DIMA as a Senior Researcher from
>> Tuesday 4 Nov onwards.
>>
>> I have more of a background in C++ (for OO) and languages like
>> SML and Haskell (for FP). So I am learning the Scala syntax at
>> the moment.
>>
>> I came across the following improvement request:
>> https://issues.apache.org/jira/browse/FLINK-553
>>
>> Seeing that I have not yet had a chance to look at the Scala API
>> and internals of Flink, would it be ok for me to work on this improvement
>> or is this part of a bigger change? I see that most of the aggregation
>> functions will anyway need a major overhaul, so I am not sure if this
>> change also falls under the same ambit?
>>
>> I think the idea might be to return more data from the first function so
>> that the next function in line in the composition has more to take as
>> input. I am not sure if there are many users already using Flink
>> because such a change would obviously break a lot of code.
>>
>> If this issue has other dependencies that I am not seeing at the
>> moment, then please feel free to suggest some other Scala related
>> change.
>>
>> Thanks,
>> Manu
>>
>>
>> --
>>
>> The greater danger for most of us lies not in setting our aim too high and
>> falling short; but in setting our aim too low, and achieving our mark.
>> - Michelangelo
>>
Reply | Threaded
Open this post in threaded view
|

Introduction

Kirschnick, Johannes
In reply to this post by Manu Kaul
Hello,

as have some other fellow colleges as well I would like to introduce myself as well to the list.
I am a PhD student from Berlin who wants to work with Flink .

As suggested by the getting started guide I had a look at some starter issues and found the issue about comments in CSV lines
https://issues.apache.org/jira/browse/FLINK-1208

While looking into it I noticed that the current CSV parser does not correctly read escaped fields
There is of course a debate as to how to escape any value in CSV files, but the common use is to use " as the escape character

So the following line will not parse

1997,Ford,E350,"Super, ""luxurious"" truck"

I had a look into why that is and if I could propose a fix for it.
Being a novice to the codebase I noticed that the CSV Parser uses the parsers from

org.apache.flink.types.parser.*

So the question I have:

Are these parsers only used for CSV files and thus would introducing the escaping mechanism just work - or are they used in a lot of other places requiring a special handling in case of CSV instead.
Thus fixing the escaping would actually mean to break/ fix a lot of other thing?

Johannes
Reply | Threaded
Open this post in threaded view
|

Re: Introduction

Stephan Ewen
Hi Johannes!

Welcome :-)

Right now, the parsers are used only in the CSV formats, so you can adjust
them to that format's needs.

Greetings,
Stephan


On Thu, Nov 6, 2014 at 3:17 PM, Kirschnick, Johannes <
[hidden email]> wrote:

> Hello,
>
> as have some other fellow colleges as well I would like to introduce
> myself as well to the list.
> I am a PhD student from Berlin who wants to work with Flink .
>
> As suggested by the getting started guide I had a look at some starter
> issues and found the issue about comments in CSV lines
> https://issues.apache.org/jira/browse/FLINK-1208
>
> While looking into it I noticed that the current CSV parser does not
> correctly read escaped fields
> There is of course a debate as to how to escape any value in CSV files,
> but the common use is to use " as the escape character
>
> So the following line will not parse
>
> 1997,Ford,E350,"Super, ""luxurious"" truck"
>
> I had a look into why that is and if I could propose a fix for it.
> Being a novice to the codebase I noticed that the CSV Parser uses the
> parsers from
>
> org.apache.flink.types.parser.*
>
> So the question I have:
>
> Are these parsers only used for CSV files and thus would introducing the
> escaping mechanism just work - or are they used in a lot of other places
> requiring a special handling in case of CSV instead.
> Thus fixing the escaping would actually mean to break/ fix a lot of other
> thing?
>
> Johannes
>