CsvInputFormat delimiter fields

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

CsvInputFormat delimiter fields

Martin Neumann
Hej,

A lot of my inputs are csv files so I use the CsvInputFormat a lot. What I
find kind of odd that the Line delimiter is a String but the Field
delimiter is a Character.

*see:* new CsvInputFormat<Tuple2<String,String>>(new
Path(pVecPath),"\n",'\t',String.class,String.class)

Is there a reason for this? I'm currently working with a file that has a
more complex field delimiter so I had to write a mapper to read from
StringInputFormat.

cheers Martin
Reply | Threaded
Open this post in threaded view
|

Re: CsvInputFormat delimiter fields

Stephan Ewen
Hi!

The reason is the current way the csv parsers work. They are pushed into
the byte stream parsing and are restricted to recognize one char
delimiters. It is possible to change that, but would be a bit of work.

Stephan

On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <[hidden email]>
wrote:

> Hej,
>
> A lot of my inputs are csv files so I use the CsvInputFormat a lot. What I
> find kind of odd that the Line delimiter is a String but the Field
> delimiter is a Character.
>
> *see:* new CsvInputFormat<Tuple2<String,String>>(new
> Path(pVecPath),"\n",'\t',String.class,String.class)
>
> Is there a reason for this? I'm currently working with a file that has a
> more complex field delimiter so I had to write a mapper to read from
> StringInputFormat.
>
> cheers Martin
>
Reply | Threaded
Open this post in threaded view
|

Re: CsvInputFormat delimiter fields

Martin Neumann
Would changing it cost performance?
If not I thing it would be a good change to make since it allows to (ab)use
the csv reader to load structured Text files (for example by putting
Keywords as delimiter).

Being able to put a regular expression there would be even nicer but maybe
it should end up in its own InputFormat then.

cheers Martin

On Wed, Oct 15, 2014 at 3:47 PM, Stephan Ewen <[hidden email]> wrote:

> Hi!
>
> The reason is the current way the csv parsers work. They are pushed into
> the byte stream parsing and are restricted to recognize one char
> delimiters. It is possible to change that, but would be a bit of work.
>
> Stephan
>
> On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <[hidden email]>
> wrote:
>
> > Hej,
> >
> > A lot of my inputs are csv files so I use the CsvInputFormat a lot. What
> I
> > find kind of odd that the Line delimiter is a String but the Field
> > delimiter is a Character.
> >
> > *see:* new CsvInputFormat<Tuple2<String,String>>(new
> > Path(pVecPath),"\n",'\t',String.class,String.class)
> >
> > Is there a reason for this? I'm currently working with a file that has a
> > more complex field delimiter so I had to write a mapper to read from
> > StringInputFormat.
> >
> > cheers Martin
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: CsvInputFormat delimiter fields

Fabian Hueske
I don't think, that multi-char field delimiters would cause a performance
problem. The data needs to be parsed anyway.
Only in cases where the delimiter has a prefix that occurs often in the
regular data, it could have a major impact.

Fabian

2014-10-15 16:07 GMT+02:00 Martin Neumann <[hidden email]>:

> Would changing it cost performance?
> If not I thing it would be a good change to make since it allows to (ab)use
> the csv reader to load structured Text files (for example by putting
> Keywords as delimiter).
>
> Being able to put a regular expression there would be even nicer but maybe
> it should end up in its own InputFormat then.
>
> cheers Martin
>
> On Wed, Oct 15, 2014 at 3:47 PM, Stephan Ewen <[hidden email]> wrote:
>
> > Hi!
> >
> > The reason is the current way the csv parsers work. They are pushed into
> > the byte stream parsing and are restricted to recognize one char
> > delimiters. It is possible to change that, but would be a bit of work.
> >
> > Stephan
> >
> > On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <[hidden email]>
> > wrote:
> >
> > > Hej,
> > >
> > > A lot of my inputs are csv files so I use the CsvInputFormat a lot.
> What
> > I
> > > find kind of odd that the Line delimiter is a String but the Field
> > > delimiter is a Character.
> > >
> > > *see:* new CsvInputFormat<Tuple2<String,String>>(new
> > > Path(pVecPath),"\n",'\t',String.class,String.class)
> > >
> > > Is there a reason for this? I'm currently working with a file that has
> a
> > > more complex field delimiter so I had to write a mapper to read from
> > > StringInputFormat.
> > >
> > > cheers Martin
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: CsvInputFormat delimiter fields

Fabian Hueske
I created FLINK-1168 for this feature request.

2014-10-16 11:28 GMT+02:00 Fabian Hueske <[hidden email]>:

> I don't think, that multi-char field delimiters would cause a performance
> problem. The data needs to be parsed anyway.
> Only in cases where the delimiter has a prefix that occurs often in the
> regular data, it could have a major impact.
>
> Fabian
>
> 2014-10-15 16:07 GMT+02:00 Martin Neumann <[hidden email]>:
>
>> Would changing it cost performance?
>> If not I thing it would be a good change to make since it allows to
>> (ab)use
>> the csv reader to load structured Text files (for example by putting
>> Keywords as delimiter).
>>
>> Being able to put a regular expression there would be even nicer but maybe
>> it should end up in its own InputFormat then.
>>
>> cheers Martin
>>
>> On Wed, Oct 15, 2014 at 3:47 PM, Stephan Ewen <[hidden email]> wrote:
>>
>> > Hi!
>> >
>> > The reason is the current way the csv parsers work. They are pushed into
>> > the byte stream parsing and are restricted to recognize one char
>> > delimiters. It is possible to change that, but would be a bit of work.
>> >
>> > Stephan
>> >
>> > On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <[hidden email]>
>> > wrote:
>> >
>> > > Hej,
>> > >
>> > > A lot of my inputs are csv files so I use the CsvInputFormat a lot.
>> What
>> > I
>> > > find kind of odd that the Line delimiter is a String but the Field
>> > > delimiter is a Character.
>> > >
>> > > *see:* new CsvInputFormat<Tuple2<String,String>>(new
>> > > Path(pVecPath),"\n",'\t',String.class,String.class)
>> > >
>> > > Is there a reason for this? I'm currently working with a file that
>> has a
>> > > more complex field delimiter so I had to write a mapper to read from
>> > > StringInputFormat.
>> > >
>> > > cheers Martin
>> > >
>> >
>>
>
>