(DEPRECATED) Apache Flink Mailing List archive.

Discuss about [FLINK-10134] UTF-16 support for TextInputFormat bug fixed

Classic

List

Threaded

2 messages Options

x1q1j1

Discuss about [FLINK-10134] UTF-16 support for TextInputFormat bug fixed

dear everyone,

I hope to discuss this jira with everyone in order to handle this matter better. Here are some of my thoughts:

1.Where should the BOM be read? I think that when the file is started at the beginning of the file, you still need to increase the logic for processing the bom. Add a variable to the read bom encoding logic to record the file bom encoding. For example: put it in the function createinputsplit.
2.We can use the previously generated variables to determine whether it is (bom with UTF8, UTF16 with bom, UTF32 with bom), and control the byte size according to the encoding type to handle the end of each line, because I found that the previous bug is actually A coding problem, and the improper handling of each line of records ends up. In response to this problem, I did the following work:

String utf8 = "UTF-8";

String utf16 = "UTF-16";

String utf32 = "UTF-32";

int stepSize = 0;

String charsetName = this.getCharsetName();

if (charsetName.contains(utf8)) {

stepSize = 1;

} else if (charsetName.contains(utf16)) {

stepSize = 2;

} else if (charsetName.contains(utf32)) {

stepSize = 4;

}

//Check if \n is used as delimiter and the end of this line is a \r, then remove \r from the line

if (this.getDelimiter() != null && this.getDelimiter().length == 1

&& this.getDelimiter()[0] == NEW_LINE && offset + numBytes >= stepSize

&& bytes[offset + numBytes - stepSize] == CARRIAGE_RETURN) {

numBytes -= stepSize;

}

numBytes = numBytes - stepSize + 1;

return new String(bytes, offset, numBytes, this.getCharsetName());

If you still don't know what I want to describe, you can see the detailed code implementation in the PR I submitted.
Here is the link to PR: https://github.com/apache/flink/pull/6710
Here is the link to Jira: https://issues.apache.org/jira/browse/FLINK-10134

Looking forward to your reply

Best wishes.
qianjinxu

Chesnay Schepler-3

Re: Discuss about [FLINK-10134] UTF-16 support for TextInputFormat bug fixed

Please move this discussion to either the PR. There's little value in
spreading discussions over several channels; any insight raised here
should also be visible in the PR.

On 28.09.2018 07:07, x1q1j1 wrote:

> dear everyone,
>
> I hope to discuss this jira with everyone in order to handle this matter better. Here are some of my thoughts:
>
>
> 1.Where should the BOM be read? I think that when the file is started at the beginning of the file, you still need to increase the logic for processing the bom. Add a variable to the read bom encoding logic to record the file bom encoding. For example: put it in the function createinputsplit.
> 2.We can use the previously generated variables to determine whether it is (bom with UTF8, UTF16 with bom, UTF32 with bom), and control the byte size according to the encoding type to handle the end of each line, because I found that the previous bug is actually A coding problem, and the improper handling of each line of records ends up. In response to this problem, I did the following work:
>
>
>
> String utf8 = "UTF-8";
>
>
> String utf16 = "UTF-16";
>
>
> String utf32 = "UTF-32";
>
>
> int stepSize = 0;
>
>
> String charsetName = this.getCharsetName();
>
>
> if (charsetName.contains(utf8)) {
>
>
> stepSize = 1;
>
>
> } else if (charsetName.contains(utf16)) {
>
>
> stepSize = 2;
>
>
> } else if (charsetName.contains(utf32)) {
>
>
> stepSize = 4;
>
>
> }
>
>
> //Check if \n is used as delimiter and the end of this line is a \r, then remove \r from the line
>
>
> if (this.getDelimiter() != null && this.getDelimiter().length == 1
>
>
> && this.getDelimiter()[0] == NEW_LINE && offset + numBytes >= stepSize
>
>
> && bytes[offset + numBytes - stepSize] == CARRIAGE_RETURN) {
>
>
> numBytes -= stepSize;
>
>
> }
>
>
> numBytes = numBytes - stepSize + 1;
>
>
> return new String(bytes, offset, numBytes, this.getCharsetName());
>
>
>
>
>
>
>
> If you still don't know what I want to describe, you can see the detailed code implementation in the PR I submitted.
> Here is the link to PR: https://github.com/apache/flink/pull/6710
> Here is the link to Jira: https://issues.apache.org/jira/browse/FLINK-10134
>
>
>
> Looking forward to your reply
>
>
>
> Best wishes.
> qianjinxu