Using Flink Streaming to write to multiple output files in HDFS

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Using Flink Streaming to write to multiple output files in HDFS

Andra Lungu
Hey guys,

Long time, no see :). I recently started a new job and it involves
performing a set of real-time data analytics using Apache Kafka, Storm
and Flume.

What happens, on a very high level, is that set of signals is
collected, stored into a Kafka topic and then Storm is used to filter
certain fields out or to enrich the fields with other
meta-information. Finally, Flume writes the output into mutiple HDFS
files depending on the date, hour etc.

Now, I saw that Flink can play with a similar pipeline, but without
needing Flume for the writing to HDFS part (see
http://data-artisans.com/kafka-flink-a-practical-how-to/). Which
brings me to my question: jow does Flink handle writing to multiple
files in a streaming fashion? -until now, I was playing with batch and
writeAsCsv just took one file as a parameter-

Next question: What are the prerequisites to deploy a Flink Streaming
job on a cluster? Yarn, HDFS, anything else?

Final question, more of a request: I'd like to play around with Flink
Streaming to state whether it can substitute Storm in this use case
and whether it can outrun it :P. To this end, I'll need some starting
points: docs, blog posts, examples to read. Any input would be useful.

I wanted to dig for a newbie task in the streaming area, but I could
not find one... can we think of something easy to get me started?

Thanks! Hope you guys had fun at Flink Forward!
Andra
Reply | Threaded
Open this post in threaded view
|

Re: Using Flink Streaming to write to multiple output files in HDFS

Aljoscha Krettek-2
Hi,
the documentation has a guide about the Streaming API:
https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html

This also contains a section about the rolling (HDFS) FileSystem sink:
https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#hadoop-filesystem

For blog entries I would suggest these:
 - http://data-artisans.com/real-time-stream-processing-the-next-step-for-apache-flink/
 - http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
 - http://data-artisans.com/kafka-flink-a-practical-how-to/

I don’t think we have an easy starter issues right now on the Streaming API. But some might come up in the future. :D

Cheers,
Aljoscha

> On 21 Oct 2015, at 11:40, Andra Lungu <[hidden email]> wrote:
>
> Hey guys,
>
> Long time, no see :). I recently started a new job and it involves
> performing a set of real-time data analytics using Apache Kafka, Storm
> and Flume.
>
> What happens, on a very high level, is that set of signals is
> collected, stored into a Kafka topic and then Storm is used to filter
> certain fields out or to enrich the fields with other
> meta-information. Finally, Flume writes the output into mutiple HDFS
> files depending on the date, hour etc.
>
> Now, I saw that Flink can play with a similar pipeline, but without
> needing Flume for the writing to HDFS part (see
> http://data-artisans.com/kafka-flink-a-practical-how-to/). Which
> brings me to my question: jow does Flink handle writing to multiple
> files in a streaming fashion? -until now, I was playing with batch and
> writeAsCsv just took one file as a parameter-
>
> Next question: What are the prerequisites to deploy a Flink Streaming
> job on a cluster? Yarn, HDFS, anything else?
>
> Final question, more of a request: I'd like to play around with Flink
> Streaming to state whether it can substitute Storm in this use case
> and whether it can outrun it :P. To this end, I'll need some starting
> points: docs, blog posts, examples to read. Any input would be useful.
>
> I wanted to dig for a newbie task in the streaming area, but I could
> not find one... can we think of something easy to get me started?
>
> Thanks! Hope you guys had fun at Flink Forward!
> Andra

Reply | Threaded
Open this post in threaded view
|

Re: Using Flink Streaming to write to multiple output files in HDFS

Fabian Hueske-2
There are also training slides and programming exercises (incl. reference
solutions) for the DataStream API at

--> http://dataartisans.github.io/flink-training/

Cheers, Fabian

2015-10-21 14:03 GMT+02:00 Aljoscha Krettek <[hidden email]>:

> Hi,
> the documentation has a guide about the Streaming API:
>
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html
>
> This also contains a section about the rolling (HDFS) FileSystem sink:
>
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#hadoop-filesystem
>
> For blog entries I would suggest these:
>  -
> http://data-artisans.com/real-time-stream-processing-the-next-step-for-apache-flink/
>  -
> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
>  - http://data-artisans.com/kafka-flink-a-practical-how-to/
>
> I don’t think we have an easy starter issues right now on the Streaming
> API. But some might come up in the future. :D
>
> Cheers,
> Aljoscha
> > On 21 Oct 2015, at 11:40, Andra Lungu <[hidden email]> wrote:
> >
> > Hey guys,
> >
> > Long time, no see :). I recently started a new job and it involves
> > performing a set of real-time data analytics using Apache Kafka, Storm
> > and Flume.
> >
> > What happens, on a very high level, is that set of signals is
> > collected, stored into a Kafka topic and then Storm is used to filter
> > certain fields out or to enrich the fields with other
> > meta-information. Finally, Flume writes the output into mutiple HDFS
> > files depending on the date, hour etc.
> >
> > Now, I saw that Flink can play with a similar pipeline, but without
> > needing Flume for the writing to HDFS part (see
> > http://data-artisans.com/kafka-flink-a-practical-how-to/). Which
> > brings me to my question: jow does Flink handle writing to multiple
> > files in a streaming fashion? -until now, I was playing with batch and
> > writeAsCsv just took one file as a parameter-
> >
> > Next question: What are the prerequisites to deploy a Flink Streaming
> > job on a cluster? Yarn, HDFS, anything else?
> >
> > Final question, more of a request: I'd like to play around with Flink
> > Streaming to state whether it can substitute Storm in this use case
> > and whether it can outrun it :P. To this end, I'll need some starting
> > points: docs, blog posts, examples to read. Any input would be useful.
> >
> > I wanted to dig for a newbie task in the streaming area, but I could
> > not find one... can we think of something easy to get me started?
> >
> > Thanks! Hope you guys had fun at Flink Forward!
> > Andra
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Using Flink Streaming to write to multiple output files in HDFS

Robert Metzger
Hey Andra,

were you able to answer your questions from Aljoschas and Fabians links?

Flink's streaming file sink is quite unique (compared to Flume) because it
supports exactly-once semantics. Also, the performance compared to Storm is
probably much better, so you can save a lot of resources.


On Wed, Oct 21, 2015 at 2:35 PM, Fabian Hueske <[hidden email]> wrote:

> There are also training slides and programming exercises (incl. reference
> solutions) for the DataStream API at
>
> --> http://dataartisans.github.io/flink-training/
>
> Cheers, Fabian
>
> 2015-10-21 14:03 GMT+02:00 Aljoscha Krettek <[hidden email]>:
>
> > Hi,
> > the documentation has a guide about the Streaming API:
> >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html
> >
> > This also contains a section about the rolling (HDFS) FileSystem sink:
> >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#hadoop-filesystem
> >
> > For blog entries I would suggest these:
> >  -
> >
> http://data-artisans.com/real-time-stream-processing-the-next-step-for-apache-flink/
> >  -
> >
> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
> >  - http://data-artisans.com/kafka-flink-a-practical-how-to/
> >
> > I don’t think we have an easy starter issues right now on the Streaming
> > API. But some might come up in the future. :D
> >
> > Cheers,
> > Aljoscha
> > > On 21 Oct 2015, at 11:40, Andra Lungu <[hidden email]> wrote:
> > >
> > > Hey guys,
> > >
> > > Long time, no see :). I recently started a new job and it involves
> > > performing a set of real-time data analytics using Apache Kafka, Storm
> > > and Flume.
> > >
> > > What happens, on a very high level, is that set of signals is
> > > collected, stored into a Kafka topic and then Storm is used to filter
> > > certain fields out or to enrich the fields with other
> > > meta-information. Finally, Flume writes the output into mutiple HDFS
> > > files depending on the date, hour etc.
> > >
> > > Now, I saw that Flink can play with a similar pipeline, but without
> > > needing Flume for the writing to HDFS part (see
> > > http://data-artisans.com/kafka-flink-a-practical-how-to/). Which
> > > brings me to my question: jow does Flink handle writing to multiple
> > > files in a streaming fashion? -until now, I was playing with batch and
> > > writeAsCsv just took one file as a parameter-
> > >
> > > Next question: What are the prerequisites to deploy a Flink Streaming
> > > job on a cluster? Yarn, HDFS, anything else?
> > >
> > > Final question, more of a request: I'd like to play around with Flink
> > > Streaming to state whether it can substitute Storm in this use case
> > > and whether it can outrun it :P. To this end, I'll need some starting
> > > points: docs, blog posts, examples to read. Any input would be useful.
> > >
> > > I wanted to dig for a newbie task in the streaming area, but I could
> > > not find one... can we think of something easy to get me started?
> > >
> > > Thanks! Hope you guys had fun at Flink Forward!
> > > Andra
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Using Flink Streaming to write to multiple output files in HDFS

Nyamath Ulla Khan
Hi Andra,

You could find very intersting example for Flink streaming and with Kafka
(input/Output).

https://flink.apache.org/news/2015/02/09/streaming-example.html.
http://dataartisans.github.io/flink-training/exercises/ ( Contains most the
different Operator Example)
http://dataartisans.github.io/flink-training/exercises/rideCleansing.html

I hope this help you start  with Flink Streaming API.

Cheer's
Nyamath Ulla Khan

On Mon, Nov 9, 2015 at 11:41 PM, Robert Metzger <[hidden email]> wrote:

> Hey Andra,
>
> were you able to answer your questions from Aljoschas and Fabians links?
>
> Flink's streaming file sink is quite unique (compared to Flume) because it
> supports exactly-once semantics. Also, the performance compared to Storm is
> probably much better, so you can save a lot of resources.
>
>
> On Wed, Oct 21, 2015 at 2:35 PM, Fabian Hueske <[hidden email]> wrote:
>
> > There are also training slides and programming exercises (incl. reference
> > solutions) for the DataStream API at
> >
> > --> http://dataartisans.github.io/flink-training/
> >
> > Cheers, Fabian
> >
> > 2015-10-21 14:03 GMT+02:00 Aljoscha Krettek <[hidden email]>:
> >
> > > Hi,
> > > the documentation has a guide about the Streaming API:
> > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html
> > >
> > > This also contains a section about the rolling (HDFS) FileSystem sink:
> > >
> > >
> >
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#hadoop-filesystem
> > >
> > > For blog entries I would suggest these:
> > >  -
> > >
> >
> http://data-artisans.com/real-time-stream-processing-the-next-step-for-apache-flink/
> > >  -
> > >
> >
> http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/
> > >  - http://data-artisans.com/kafka-flink-a-practical-how-to/
> > >
> > > I don’t think we have an easy starter issues right now on the Streaming
> > > API. But some might come up in the future. :D
> > >
> > > Cheers,
> > > Aljoscha
> > > > On 21 Oct 2015, at 11:40, Andra Lungu <[hidden email]> wrote:
> > > >
> > > > Hey guys,
> > > >
> > > > Long time, no see :). I recently started a new job and it involves
> > > > performing a set of real-time data analytics using Apache Kafka,
> Storm
> > > > and Flume.
> > > >
> > > > What happens, on a very high level, is that set of signals is
> > > > collected, stored into a Kafka topic and then Storm is used to filter
> > > > certain fields out or to enrich the fields with other
> > > > meta-information. Finally, Flume writes the output into mutiple HDFS
> > > > files depending on the date, hour etc.
> > > >
> > > > Now, I saw that Flink can play with a similar pipeline, but without
> > > > needing Flume for the writing to HDFS part (see
> > > > http://data-artisans.com/kafka-flink-a-practical-how-to/). Which
> > > > brings me to my question: jow does Flink handle writing to multiple
> > > > files in a streaming fashion? -until now, I was playing with batch
> and
> > > > writeAsCsv just took one file as a parameter-
> > > >
> > > > Next question: What are the prerequisites to deploy a Flink Streaming
> > > > job on a cluster? Yarn, HDFS, anything else?
> > > >
> > > > Final question, more of a request: I'd like to play around with Flink
> > > > Streaming to state whether it can substitute Storm in this use case
> > > > and whether it can outrun it :P. To this end, I'll need some starting
> > > > points: docs, blog posts, examples to read. Any input would be
> useful.
> > > >
> > > > I wanted to dig for a newbie task in the streaming area, but I could
> > > > not find one... can we think of something easy to get me started?
> > > >
> > > > Thanks! Hope you guys had fun at Flink Forward!
> > > > Andra
> > >
> > >
> >
>



--
Thanks and Regards
Nyamath Ulla Khan