Discuss about [FLINK-10134] UTF-16 support for TextInputFormat bug fixed

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Discuss about [FLINK-10134] UTF-16 support for TextInputFormat bug fixed

x1q1j1
dear everyone,
   
        I hope to discuss this jira with everyone in order to handle this matter better. Here are some of my thoughts:


        1.Where should the BOM be read? I think that when the file is started at the beginning of the file, you still need to increase the logic for processing the bom. Add a variable to the read bom encoding logic to record the file bom encoding. For example: put it in the function createinputsplit.
        2.We can use the previously generated variables to determine whether it is (bom with UTF8, UTF16 with bom, UTF32 with bom), and control the byte size according to the encoding type to handle the end of each line, because I found that the previous bug is actually A coding problem, and the improper handling of each line of records ends up. In response to this problem, I did the following work:



String utf8 = "UTF-8";


String utf16 = "UTF-16";


String utf32 = "UTF-32";


int stepSize = 0;


String charsetName = this.getCharsetName();


if (charsetName.contains(utf8)) {


stepSize = 1;


} else if (charsetName.contains(utf16)) {


stepSize = 2;


} else if (charsetName.contains(utf32)) {


stepSize = 4;


}


//Check if \n is used as delimiter and the end of this line is a \r, then remove \r from the line


if (this.getDelimiter() != null && this.getDelimiter().length == 1


&& this.getDelimiter()[0] == NEW_LINE && offset + numBytes >= stepSize


&& bytes[offset + numBytes - stepSize] == CARRIAGE_RETURN) {


numBytes -= stepSize;


}


numBytes = numBytes - stepSize + 1;


return new String(bytes, offset, numBytes, this.getCharsetName());




       


   If you still don't know what I want to describe, you can see the detailed code implementation in the PR I submitted.
Here is the link to PR:  https://github.com/apache/flink/pull/6710 
Here is the link to Jira: https://issues.apache.org/jira/browse/FLINK-10134



    Looking forward to your reply


       
Best wishes.
qianjinxu
Reply | Threaded
Open this post in threaded view
|

Re: Discuss about [FLINK-10134] UTF-16 support for TextInputFormat bug fixed

Chesnay Schepler-3
Please move this discussion to either the PR. There's little value in
spreading discussions over several channels; any insight raised here
should also be visible in the PR.

On 28.09.2018 07:07, x1q1j1 wrote:

> dear everyone,
>      
>          I hope to discuss this jira with everyone in order to handle this matter better. Here are some of my thoughts:
>
>
> 1.Where should the BOM be read? I think that when the file is started at the beginning of the file, you still need to increase the logic for processing the bom. Add a variable to the read bom encoding logic to record the file bom encoding. For example: put it in the function createinputsplit.
> 2.We can use the previously generated variables to determine whether it is (bom with UTF8, UTF16 with bom, UTF32 with bom), and control the byte size according to the encoding type to handle the end of each line, because I found that the previous bug is actually A coding problem, and the improper handling of each line of records ends up. In response to this problem, I did the following work:
>
>
>
> String utf8 = "UTF-8";
>
>
> String utf16 = "UTF-16";
>
>
> String utf32 = "UTF-32";
>
>
> int stepSize = 0;
>
>
> String charsetName = this.getCharsetName();
>
>
> if (charsetName.contains(utf8)) {
>
>
> stepSize = 1;
>
>
> } else if (charsetName.contains(utf16)) {
>
>
> stepSize = 2;
>
>
> } else if (charsetName.contains(utf32)) {
>
>
> stepSize = 4;
>
>
> }
>
>
> //Check if \n is used as delimiter and the end of this line is a \r, then remove \r from the line
>
>
> if (this.getDelimiter() != null && this.getDelimiter().length == 1
>
>
> && this.getDelimiter()[0] == NEW_LINE && offset + numBytes >= stepSize
>
>
> && bytes[offset + numBytes - stepSize] == CARRIAGE_RETURN) {
>
>
> numBytes -= stepSize;
>
>
> }
>
>
> numBytes = numBytes - stepSize + 1;
>
>
> return new String(bytes, offset, numBytes, this.getCharsetName());
>
>
>
>
>
>
>
>     If you still don't know what I want to describe, you can see the detailed code implementation in the PR I submitted.
> Here is the link to PR:  https://github.com/apache/flink/pull/6710
> Here is the link to Jira: https://issues.apache.org/jira/browse/FLINK-10134
>
>
>
>      Looking forward to your reply
>
>
>
> Best wishes.
> qianjinxu