Hej,
A lot of my inputs are csv files so I use the CsvInputFormat a lot. What I find kind of odd that the Line delimiter is a String but the Field delimiter is a Character. *see:* new CsvInputFormat<Tuple2<String,String>>(new Path(pVecPath),"\n",'\t',String.class,String.class) Is there a reason for this? I'm currently working with a file that has a more complex field delimiter so I had to write a mapper to read from StringInputFormat. cheers Martin |
Hi!
The reason is the current way the csv parsers work. They are pushed into the byte stream parsing and are restricted to recognize one char delimiters. It is possible to change that, but would be a bit of work. Stephan On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <[hidden email]> wrote: > Hej, > > A lot of my inputs are csv files so I use the CsvInputFormat a lot. What I > find kind of odd that the Line delimiter is a String but the Field > delimiter is a Character. > > *see:* new CsvInputFormat<Tuple2<String,String>>(new > Path(pVecPath),"\n",'\t',String.class,String.class) > > Is there a reason for this? I'm currently working with a file that has a > more complex field delimiter so I had to write a mapper to read from > StringInputFormat. > > cheers Martin > |
Would changing it cost performance?
If not I thing it would be a good change to make since it allows to (ab)use the csv reader to load structured Text files (for example by putting Keywords as delimiter). Being able to put a regular expression there would be even nicer but maybe it should end up in its own InputFormat then. cheers Martin On Wed, Oct 15, 2014 at 3:47 PM, Stephan Ewen <[hidden email]> wrote: > Hi! > > The reason is the current way the csv parsers work. They are pushed into > the byte stream parsing and are restricted to recognize one char > delimiters. It is possible to change that, but would be a bit of work. > > Stephan > > On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <[hidden email]> > wrote: > > > Hej, > > > > A lot of my inputs are csv files so I use the CsvInputFormat a lot. What > I > > find kind of odd that the Line delimiter is a String but the Field > > delimiter is a Character. > > > > *see:* new CsvInputFormat<Tuple2<String,String>>(new > > Path(pVecPath),"\n",'\t',String.class,String.class) > > > > Is there a reason for this? I'm currently working with a file that has a > > more complex field delimiter so I had to write a mapper to read from > > StringInputFormat. > > > > cheers Martin > > > |
I don't think, that multi-char field delimiters would cause a performance
problem. The data needs to be parsed anyway. Only in cases where the delimiter has a prefix that occurs often in the regular data, it could have a major impact. Fabian 2014-10-15 16:07 GMT+02:00 Martin Neumann <[hidden email]>: > Would changing it cost performance? > If not I thing it would be a good change to make since it allows to (ab)use > the csv reader to load structured Text files (for example by putting > Keywords as delimiter). > > Being able to put a regular expression there would be even nicer but maybe > it should end up in its own InputFormat then. > > cheers Martin > > On Wed, Oct 15, 2014 at 3:47 PM, Stephan Ewen <[hidden email]> wrote: > > > Hi! > > > > The reason is the current way the csv parsers work. They are pushed into > > the byte stream parsing and are restricted to recognize one char > > delimiters. It is possible to change that, but would be a bit of work. > > > > Stephan > > > > On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <[hidden email]> > > wrote: > > > > > Hej, > > > > > > A lot of my inputs are csv files so I use the CsvInputFormat a lot. > What > > I > > > find kind of odd that the Line delimiter is a String but the Field > > > delimiter is a Character. > > > > > > *see:* new CsvInputFormat<Tuple2<String,String>>(new > > > Path(pVecPath),"\n",'\t',String.class,String.class) > > > > > > Is there a reason for this? I'm currently working with a file that has > a > > > more complex field delimiter so I had to write a mapper to read from > > > StringInputFormat. > > > > > > cheers Martin > > > > > > |
I created FLINK-1168 for this feature request.
2014-10-16 11:28 GMT+02:00 Fabian Hueske <[hidden email]>: > I don't think, that multi-char field delimiters would cause a performance > problem. The data needs to be parsed anyway. > Only in cases where the delimiter has a prefix that occurs often in the > regular data, it could have a major impact. > > Fabian > > 2014-10-15 16:07 GMT+02:00 Martin Neumann <[hidden email]>: > >> Would changing it cost performance? >> If not I thing it would be a good change to make since it allows to >> (ab)use >> the csv reader to load structured Text files (for example by putting >> Keywords as delimiter). >> >> Being able to put a regular expression there would be even nicer but maybe >> it should end up in its own InputFormat then. >> >> cheers Martin >> >> On Wed, Oct 15, 2014 at 3:47 PM, Stephan Ewen <[hidden email]> wrote: >> >> > Hi! >> > >> > The reason is the current way the csv parsers work. They are pushed into >> > the byte stream parsing and are restricted to recognize one char >> > delimiters. It is possible to change that, but would be a bit of work. >> > >> > Stephan >> > >> > On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <[hidden email]> >> > wrote: >> > >> > > Hej, >> > > >> > > A lot of my inputs are csv files so I use the CsvInputFormat a lot. >> What >> > I >> > > find kind of odd that the Line delimiter is a String but the Field >> > > delimiter is a Character. >> > > >> > > *see:* new CsvInputFormat<Tuple2<String,String>>(new >> > > Path(pVecPath),"\n",'\t',String.class,String.class) >> > > >> > > Is there a reason for this? I'm currently working with a file that >> has a >> > > more complex field delimiter so I had to write a mapper to read from >> > > StringInputFormat. >> > > >> > > cheers Martin >> > > >> > >> > > |
Free forum by Nabble | Edit this page |