Fwd: Verifying correctness of StreamingFileSink (Kafka -> S3)

classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Verifying correctness of StreamingFileSink (Kafka -> S3)

Kostas Kloudas-5
Hi Amran,

I am including also the message to the public ML because this may be
interesting to other users as well.

You can always write your input stream to two sinks, as shown below:

DataStream<...> myInputFromKafka = env.addSource(new
FlinkKafkaConsumerVERSION(...));

final StreamingFileSink<String> sinkA = StreamingFileSink
    .forRowFormat(new Path(outputPathA), new SimpleStringEncoder<>("UTF-8"))
    .withBucketAssigner(assignerA)
    .build();

final StreamingFileSink<String> sinkB = StreamingFileSink
    .forRowFormat(new Path(outputPathB), new SimpleStringEncoder<>("UTF-8"))
    .withBucketAssigner(assignerB)
    .build();

myInputFromKafka.addSink(sinkA)
myInputFromKafka.addSink(sinkB)

Cheers,
Kostas

---------- Forwarded message ---------
From: amran dean <[hidden email]>
Date: Wed, Oct 16, 2019 at 8:43 PM
Subject: Re: Verifying correctness of StreamingFileSink (Kafka -> S3)
To: Kostas Kloudas <[hidden email]>


Hi Kostas,
Suppose there is a requirement that current S3 object format cannot
change (avoid client migration).
Would it be possible to achieve a sort of dual-write, where Kafka
records are written to two separate S3 prefixes:

/bucket/topic/data/dt=2019-10-16/part-x-xx...
- One bucket holds raw record data as clients previously expect

/bucket/topic/metadata/dt=2019-10-16/partition_x/part-x-xx...
- A second bucket holds only offset data, for data integrity
verification purposes via periodic checks (i.e no "holes")

Does StreamingFileSink permit this?

On Wed, Oct 16, 2019 at 12:34 AM Kostas Kloudas <[hidden email]> wrote:

>
> Hi Amran,
>
> If you want to know from which partition your input data come from,
> you can always have a separate bucket for each partition.
> As described in [1], you can extract the offset/partition/topic
> information for an incoming record and based on this, decide the
> appropriate bucket to put the record.
>
> Cheers,
> Kostas
>
> [1] https://ci.apache.org/projects/flink/flink-docs-stable/dev/connectors/kafka.html
>
> On Wed, Oct 16, 2019 at 4:00 AM amran dean <[hidden email]> wrote:
> >
> > I am evaluating StreamingFileSink (Kafka 0.10.11) as a production-ready alternative to a current Kafka -> S3 solution.
> >
> > Is there any way to verify the integrity of data written in S3? I'm confused how the file names (e.g part-1-17) map to Kafka partitions, and further unsure how to ensure that no Kafka records are lost (I know Flink guarantees exactly-once, but this is more of a sanity check).
> >
> >
> >
> >