Hey guys,
Long time, no see :). I recently started a new job and it involves performing a set of real-time data analytics using Apache Kafka, Storm and Flume. What happens, on a very high level, is that set of signals is collected, stored into a Kafka topic and then Storm is used to filter certain fields out or to enrich the fields with other meta-information. Finally, Flume writes the output into mutiple HDFS files depending on the date, hour etc. Now, I saw that Flink can play with a similar pipeline, but without needing Flume for the writing to HDFS part (see http://data-artisans.com/kafka-flink-a-practical-how-to/). Which brings me to my question: jow does Flink handle writing to multiple files in a streaming fashion? -until now, I was playing with batch and writeAsCsv just took one file as a parameter- Next question: What are the prerequisites to deploy a Flink Streaming job on a cluster? Yarn, HDFS, anything else? Final question, more of a request: I'd like to play around with Flink Streaming to state whether it can substitute Storm in this use case and whether it can outrun it :P. To this end, I'll need some starting points: docs, blog posts, examples to read. Any input would be useful. I wanted to dig for a newbie task in the streaming area, but I could not find one... can we think of something easy to get me started? Thanks! Hope you guys had fun at Flink Forward! Andra |
Hi,
the documentation has a guide about the Streaming API: https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html This also contains a section about the rolling (HDFS) FileSystem sink: https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#hadoop-filesystem For blog entries I would suggest these: - http://data-artisans.com/real-time-stream-processing-the-next-step-for-apache-flink/ - http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/ - http://data-artisans.com/kafka-flink-a-practical-how-to/ I don’t think we have an easy starter issues right now on the Streaming API. But some might come up in the future. :D Cheers, Aljoscha > On 21 Oct 2015, at 11:40, Andra Lungu <[hidden email]> wrote: > > Hey guys, > > Long time, no see :). I recently started a new job and it involves > performing a set of real-time data analytics using Apache Kafka, Storm > and Flume. > > What happens, on a very high level, is that set of signals is > collected, stored into a Kafka topic and then Storm is used to filter > certain fields out or to enrich the fields with other > meta-information. Finally, Flume writes the output into mutiple HDFS > files depending on the date, hour etc. > > Now, I saw that Flink can play with a similar pipeline, but without > needing Flume for the writing to HDFS part (see > http://data-artisans.com/kafka-flink-a-practical-how-to/). Which > brings me to my question: jow does Flink handle writing to multiple > files in a streaming fashion? -until now, I was playing with batch and > writeAsCsv just took one file as a parameter- > > Next question: What are the prerequisites to deploy a Flink Streaming > job on a cluster? Yarn, HDFS, anything else? > > Final question, more of a request: I'd like to play around with Flink > Streaming to state whether it can substitute Storm in this use case > and whether it can outrun it :P. To this end, I'll need some starting > points: docs, blog posts, examples to read. Any input would be useful. > > I wanted to dig for a newbie task in the streaming area, but I could > not find one... can we think of something easy to get me started? > > Thanks! Hope you guys had fun at Flink Forward! > Andra |
There are also training slides and programming exercises (incl. reference
solutions) for the DataStream API at --> http://dataartisans.github.io/flink-training/ Cheers, Fabian 2015-10-21 14:03 GMT+02:00 Aljoscha Krettek <[hidden email]>: > Hi, > the documentation has a guide about the Streaming API: > > https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html > > This also contains a section about the rolling (HDFS) FileSystem sink: > > https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#hadoop-filesystem > > For blog entries I would suggest these: > - > http://data-artisans.com/real-time-stream-processing-the-next-step-for-apache-flink/ > - > http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/ > - http://data-artisans.com/kafka-flink-a-practical-how-to/ > > I don’t think we have an easy starter issues right now on the Streaming > API. But some might come up in the future. :D > > Cheers, > Aljoscha > > On 21 Oct 2015, at 11:40, Andra Lungu <[hidden email]> wrote: > > > > Hey guys, > > > > Long time, no see :). I recently started a new job and it involves > > performing a set of real-time data analytics using Apache Kafka, Storm > > and Flume. > > > > What happens, on a very high level, is that set of signals is > > collected, stored into a Kafka topic and then Storm is used to filter > > certain fields out or to enrich the fields with other > > meta-information. Finally, Flume writes the output into mutiple HDFS > > files depending on the date, hour etc. > > > > Now, I saw that Flink can play with a similar pipeline, but without > > needing Flume for the writing to HDFS part (see > > http://data-artisans.com/kafka-flink-a-practical-how-to/). Which > > brings me to my question: jow does Flink handle writing to multiple > > files in a streaming fashion? -until now, I was playing with batch and > > writeAsCsv just took one file as a parameter- > > > > Next question: What are the prerequisites to deploy a Flink Streaming > > job on a cluster? Yarn, HDFS, anything else? > > > > Final question, more of a request: I'd like to play around with Flink > > Streaming to state whether it can substitute Storm in this use case > > and whether it can outrun it :P. To this end, I'll need some starting > > points: docs, blog posts, examples to read. Any input would be useful. > > > > I wanted to dig for a newbie task in the streaming area, but I could > > not find one... can we think of something easy to get me started? > > > > Thanks! Hope you guys had fun at Flink Forward! > > Andra > > |
Hey Andra,
were you able to answer your questions from Aljoschas and Fabians links? Flink's streaming file sink is quite unique (compared to Flume) because it supports exactly-once semantics. Also, the performance compared to Storm is probably much better, so you can save a lot of resources. On Wed, Oct 21, 2015 at 2:35 PM, Fabian Hueske <[hidden email]> wrote: > There are also training slides and programming exercises (incl. reference > solutions) for the DataStream API at > > --> http://dataartisans.github.io/flink-training/ > > Cheers, Fabian > > 2015-10-21 14:03 GMT+02:00 Aljoscha Krettek <[hidden email]>: > > > Hi, > > the documentation has a guide about the Streaming API: > > > > > https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html > > > > This also contains a section about the rolling (HDFS) FileSystem sink: > > > > > https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#hadoop-filesystem > > > > For blog entries I would suggest these: > > - > > > http://data-artisans.com/real-time-stream-processing-the-next-step-for-apache-flink/ > > - > > > http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/ > > - http://data-artisans.com/kafka-flink-a-practical-how-to/ > > > > I don’t think we have an easy starter issues right now on the Streaming > > API. But some might come up in the future. :D > > > > Cheers, > > Aljoscha > > > On 21 Oct 2015, at 11:40, Andra Lungu <[hidden email]> wrote: > > > > > > Hey guys, > > > > > > Long time, no see :). I recently started a new job and it involves > > > performing a set of real-time data analytics using Apache Kafka, Storm > > > and Flume. > > > > > > What happens, on a very high level, is that set of signals is > > > collected, stored into a Kafka topic and then Storm is used to filter > > > certain fields out or to enrich the fields with other > > > meta-information. Finally, Flume writes the output into mutiple HDFS > > > files depending on the date, hour etc. > > > > > > Now, I saw that Flink can play with a similar pipeline, but without > > > needing Flume for the writing to HDFS part (see > > > http://data-artisans.com/kafka-flink-a-practical-how-to/). Which > > > brings me to my question: jow does Flink handle writing to multiple > > > files in a streaming fashion? -until now, I was playing with batch and > > > writeAsCsv just took one file as a parameter- > > > > > > Next question: What are the prerequisites to deploy a Flink Streaming > > > job on a cluster? Yarn, HDFS, anything else? > > > > > > Final question, more of a request: I'd like to play around with Flink > > > Streaming to state whether it can substitute Storm in this use case > > > and whether it can outrun it :P. To this end, I'll need some starting > > > points: docs, blog posts, examples to read. Any input would be useful. > > > > > > I wanted to dig for a newbie task in the streaming area, but I could > > > not find one... can we think of something easy to get me started? > > > > > > Thanks! Hope you guys had fun at Flink Forward! > > > Andra > > > > > |
Hi Andra,
You could find very intersting example for Flink streaming and with Kafka (input/Output). https://flink.apache.org/news/2015/02/09/streaming-example.html. http://dataartisans.github.io/flink-training/exercises/ ( Contains most the different Operator Example) http://dataartisans.github.io/flink-training/exercises/rideCleansing.html I hope this help you start with Flink Streaming API. Cheer's Nyamath Ulla Khan On Mon, Nov 9, 2015 at 11:41 PM, Robert Metzger <[hidden email]> wrote: > Hey Andra, > > were you able to answer your questions from Aljoschas and Fabians links? > > Flink's streaming file sink is quite unique (compared to Flume) because it > supports exactly-once semantics. Also, the performance compared to Storm is > probably much better, so you can save a lot of resources. > > > On Wed, Oct 21, 2015 at 2:35 PM, Fabian Hueske <[hidden email]> wrote: > > > There are also training slides and programming exercises (incl. reference > > solutions) for the DataStream API at > > > > --> http://dataartisans.github.io/flink-training/ > > > > Cheers, Fabian > > > > 2015-10-21 14:03 GMT+02:00 Aljoscha Krettek <[hidden email]>: > > > > > Hi, > > > the documentation has a guide about the Streaming API: > > > > > > > > > https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html > > > > > > This also contains a section about the rolling (HDFS) FileSystem sink: > > > > > > > > > https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming_guide.html#hadoop-filesystem > > > > > > For blog entries I would suggest these: > > > - > > > > > > http://data-artisans.com/real-time-stream-processing-the-next-step-for-apache-flink/ > > > - > > > > > > http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/ > > > - http://data-artisans.com/kafka-flink-a-practical-how-to/ > > > > > > I don’t think we have an easy starter issues right now on the Streaming > > > API. But some might come up in the future. :D > > > > > > Cheers, > > > Aljoscha > > > > On 21 Oct 2015, at 11:40, Andra Lungu <[hidden email]> wrote: > > > > > > > > Hey guys, > > > > > > > > Long time, no see :). I recently started a new job and it involves > > > > performing a set of real-time data analytics using Apache Kafka, > Storm > > > > and Flume. > > > > > > > > What happens, on a very high level, is that set of signals is > > > > collected, stored into a Kafka topic and then Storm is used to filter > > > > certain fields out or to enrich the fields with other > > > > meta-information. Finally, Flume writes the output into mutiple HDFS > > > > files depending on the date, hour etc. > > > > > > > > Now, I saw that Flink can play with a similar pipeline, but without > > > > needing Flume for the writing to HDFS part (see > > > > http://data-artisans.com/kafka-flink-a-practical-how-to/). Which > > > > brings me to my question: jow does Flink handle writing to multiple > > > > files in a streaming fashion? -until now, I was playing with batch > and > > > > writeAsCsv just took one file as a parameter- > > > > > > > > Next question: What are the prerequisites to deploy a Flink Streaming > > > > job on a cluster? Yarn, HDFS, anything else? > > > > > > > > Final question, more of a request: I'd like to play around with Flink > > > > Streaming to state whether it can substitute Storm in this use case > > > > and whether it can outrun it :P. To this end, I'll need some starting > > > > points: docs, blog posts, examples to read. Any input would be > useful. > > > > > > > > I wanted to dig for a newbie task in the streaming area, but I could > > > > not find one... can we think of something easy to get me started? > > > > > > > > Thanks! Hope you guys had fun at Flink Forward! > > > > Andra > > > > > > > > > -- Thanks and Regards Nyamath Ulla Khan |
Free forum by Nabble | Edit this page |