(DEPRECATED) Apache Flink Mailing List archive.

CsvInputFormat delimiter fields

Classic

List

Threaded

5 messages Options

Martin Neumann

CsvInputFormat delimiter fields

Hej,

A lot of my inputs are csv files so I use the CsvInputFormat a lot. What I
find kind of odd that the Line delimiter is a String but the Field
delimiter is a Character.

*see:* new CsvInputFormat<Tuple2<String,String>>(new
Path(pVecPath),"\n",'\t',String.class,String.class)

Is there a reason for this? I'm currently working with a file that has a
more complex field delimiter so I had to write a mapper to read from
StringInputFormat.

cheers Martin

Stephan Ewen

Re: CsvInputFormat delimiter fields

Hi!

The reason is the current way the csv parsers work. They are pushed into
the byte stream parsing and are restricted to recognize one char
delimiters. It is possible to change that, but would be a bit of work.

Stephan

On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <[hidden email]>
wrote:

> Hej,
>
> A lot of my inputs are csv files so I use the CsvInputFormat a lot. What I
> find kind of odd that the Line delimiter is a String but the Field
> delimiter is a Character.
>
> *see:* new CsvInputFormat<Tuple2<String,String>>(new
> Path(pVecPath),"\n",'\t',String.class,String.class)
>
> Is there a reason for this? I'm currently working with a file that has a
> more complex field delimiter so I had to write a mapper to read from
> StringInputFormat.
>
> cheers Martin
>

Martin Neumann

Re: CsvInputFormat delimiter fields

Would changing it cost performance?
If not I thing it would be a good change to make since it allows to (ab)use
the csv reader to load structured Text files (for example by putting
Keywords as delimiter).

Being able to put a regular expression there would be even nicer but maybe
it should end up in its own InputFormat then.

cheers Martin

On Wed, Oct 15, 2014 at 3:47 PM, Stephan Ewen <[hidden email]> wrote:

> Hi!
>
> The reason is the current way the csv parsers work. They are pushed into
> the byte stream parsing and are restricted to recognize one char
> delimiters. It is possible to change that, but would be a bit of work.
>
> Stephan
>
> On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <[hidden email]>
> wrote:
>
> > Hej,
> >
> > A lot of my inputs are csv files so I use the CsvInputFormat a lot. What
> I
> > find kind of odd that the Line delimiter is a String but the Field
> > delimiter is a Character.
> >
> > *see:* new CsvInputFormat<Tuple2<String,String>>(new
> > Path(pVecPath),"\n",'\t',String.class,String.class)
> >
> > Is there a reason for this? I'm currently working with a file that has a
> > more complex field delimiter so I had to write a mapper to read from
> > StringInputFormat.
> >
> > cheers Martin
> >
>

Fabian Hueske

Re: CsvInputFormat delimiter fields

I don't think, that multi-char field delimiters would cause a performance
problem. The data needs to be parsed anyway.
Only in cases where the delimiter has a prefix that occurs often in the
regular data, it could have a major impact.

Fabian

2014-10-15 16:07 GMT+02:00 Martin Neumann <[hidden email]>:

> Would changing it cost performance?
> If not I thing it would be a good change to make since it allows to (ab)use
> the csv reader to load structured Text files (for example by putting
> Keywords as delimiter).
>
> Being able to put a regular expression there would be even nicer but maybe
> it should end up in its own InputFormat then.
>
> cheers Martin
>
> On Wed, Oct 15, 2014 at 3:47 PM, Stephan Ewen <[hidden email]> wrote:
>
> > Hi!
> >
> > The reason is the current way the csv parsers work. They are pushed into
> > the byte stream parsing and are restricted to recognize one char
> > delimiters. It is possible to change that, but would be a bit of work.
> >
> > Stephan
> >
> > On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <[hidden email]>
> > wrote:
> >
> > > Hej,
> > >
> > > A lot of my inputs are csv files so I use the CsvInputFormat a lot.
> What
> > I
> > > find kind of odd that the Line delimiter is a String but the Field
> > > delimiter is a Character.
> > >
> > > *see:* new CsvInputFormat<Tuple2<String,String>>(new
> > > Path(pVecPath),"\n",'\t',String.class,String.class)
> > >
> > > Is there a reason for this? I'm currently working with a file that has
> a
> > > more complex field delimiter so I had to write a mapper to read from
> > > StringInputFormat.
> > >
> > > cheers Martin
> > >
> >
>

Fabian Hueske

Re: CsvInputFormat delimiter fields

I created FLINK-1168 for this feature request.

2014-10-16 11:28 GMT+02:00 Fabian Hueske <[hidden email]>:

> I don't think, that multi-char field delimiters would cause a performance
> problem. The data needs to be parsed anyway.
> Only in cases where the delimiter has a prefix that occurs often in the
> regular data, it could have a major impact.
>
> Fabian
>
> 2014-10-15 16:07 GMT+02:00 Martin Neumann <[hidden email]>:
>
>> Would changing it cost performance?
>> If not I thing it would be a good change to make since it allows to
>> (ab)use
>> the csv reader to load structured Text files (for example by putting
>> Keywords as delimiter).
>>
>> Being able to put a regular expression there would be even nicer but maybe
>> it should end up in its own InputFormat then.
>>
>> cheers Martin
>>
>> On Wed, Oct 15, 2014 at 3:47 PM, Stephan Ewen <[hidden email]> wrote:
>>
>> > Hi!
>> >
>> > The reason is the current way the csv parsers work. They are pushed into
>> > the byte stream parsing and are restricted to recognize one char
>> > delimiters. It is possible to change that, but would be a bit of work.
>> >
>> > Stephan
>> >
>> > On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <[hidden email]>
>> > wrote:
>> >
>> > > Hej,
>> > >
>> > > A lot of my inputs are csv files so I use the CsvInputFormat a lot.
>> What
>> > I
>> > > find kind of odd that the Line delimiter is a String but the Field
>> > > delimiter is a Character.
>> > >
>> > > *see:* new CsvInputFormat<Tuple2<String,String>>(new
>> > > Path(pVecPath),"\n",'\t',String.class,String.class)
>> > >
>> > > Is there a reason for this? I'm currently working with a file that
>> has a
>> > > more complex field delimiter so I had to write a mapper to read from
>> > > StringInputFormat.
>> > >
>> > > cheers Martin
>> > >
>> >
>>
>
>