[jira] [Created] (FLINK-17438) Flink StreamingFileSink chinese garbled

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

[jira] [Created] (FLINK-17438) Flink StreamingFileSink chinese garbled

Shang Yuanchun (Jira)
颖 created FLINK-17438:
-------------------------

             Summary: Flink StreamingFileSink chinese garbled
                 Key: FLINK-17438
                 URL: https://issues.apache.org/jira/browse/FLINK-17438
             Project: Flink
          Issue Type: Bug
          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
    Affects Versions: 1.10.0
         Environment: CDH6.0.1 hadoop3.0.0 Flink 1.10.0 
            Reporter: 颖


val writer:CompressWriterFactory[String] = new CompressWriterFactory[String](new DefaultExtractor[String]())
 .withHadoopCompression(s"SnappyCodec")//${compress}

 val fileConfig = OutputFileConfig.builder().withPartPrefix(s"${prefix}").withPartSuffix(s"${suffix}").build()

 val bulkFormatBuilder = StreamingFileSink.forBulkFormat(new Path(output), writer)
 // 自定义分桶策略
 bulkFormatBuilder.withBucketAssigner(new DemoAssigner())
 // 自定义输出文件配置
 bulkFormatBuilder.withOutputFileConfig(fileConfig)

 val sink = bulkFormatBuilder.build()

// val rollingPolicy = DefaultRollingPolicy.builder().withRolloverInterval(TimeUnit.MINUTES.toMillis(5)).withInactivityInterval(TimeUnit.MINUTES.toMillis(3)).withMaxPartSize(1 * 1024 * 1024)
// val bulkFormatBuilder = StreamingFileSink.forRowFormat(new Path(output), new SimpleStringEncoder[String]()).withRollingPolicy(rollingPolicy.build())
// val sink = bulkFormatBuilder.build()

 ds.map(_.log).addSink(sink).setParallelism(fileNum).name("snappy sink to hdfs")

 

In this way, flink API is called and written to HDFS. There are Chinese fields in the log, and the corresponding scrambled code is after hive is resolved,

CREATE EXTERNAL TABLE `demo_app`(
 `str` string COMMENT '原始记录json')
COMMENT 'app flink埋点日志'
PARTITIONED BY (
 `ymd` string COMMENT '日期分区yyyymmdd')
ROW FORMAT SERDE
 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
 'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
 'hdfs://nameservice1/user/xxx/inke_back.db'

kafka source data :

{"name":"inke.dcc.flume.collect","type":"flume","status":"完成","batchDuration":3000,"proccessDelay":0,"shedulerDelay":0,"topic":"newserverlog_opd_operate_log","endpoint":"ali-a-opd-script01.bj","batchId":"xxx","batchTime":1588065997320,"numRecords":-1,"numBytes":-1,"totalRecords":0,"totalBytes":0,"ipAddr":"10.111.27.230"}

 

hive data :

{"name":"inke.dcc.flume.collect","type":"flume","status":"������","batchDuration":3000,"proccessDelay":0,"shedulerDelay":0,"topic":"newserverlog_opd_operate_log","endpoint":"ali-a-opd-script01.bj","batchId":"xxx","batchTime":1588065997320,"numRecords":-1,"numBytes":-1,"totalRecords":0,"totalBytes":0,"ipAddr":"10.111.27.230"}

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)