(DEPRECATED) Apache Flink Mailing List archive.

Question about Apache Flink Use Case

Classic

List

Threaded

6 messages Options

Suma Cherukuri

Question about Apache Flink Use Case

Hi,

Good Afternoon!

I work as an engineer at Symantec. My team works on Multi-tenant Event Processing System. Just a high level background, our customers write data to kafka brokers though agents like logstash and we process the events and save the log data in Elastic Search and S3.

Use Case: We have a use case where in we write batches of events to S3 when file size limitation of 1MB (specific to our case) or a certain time threshold is reached. We are planning on merging the number of files specific to a folder into one single file based on either time limit such as every 24 hrs.

We were considering various options available today and would like to know if Apache Flink can be used to serve the purpose.

Looking forward to hearing from you.

Thank you
Suma Cherukuri

Suma Cherukuri

Question about Apache Flink Use Case

Suneel Marthi-2

Re: Question about Apache Flink Use Case

In reply to this post by Suma Cherukuri

From the Use Case description, it seems like u r looking to aggregate files
based on either a threshold size or threshold time and ship them to S3.
Correct?

Flink might be an overkill here and u could look at frameworks like Apache
NiFi that have pre-built (and configurable) processors to do just what u r
describing here.

On Fri, Jul 22, 2016 at 3:00 PM, Suma Cherukuri <[hidden email]
> wrote:

> Hi,
>
> Good Afternoon!
>
> I work as an engineer at Symantec. My team works on Multi-tenant Event
> Processing System. Just a high level background, our customers write data
> to kafka brokers though agents like logstash and we process the events and
> save the log data in Elastic Search and S3.
>
> Use Case: We have a use case where in we write batches of events to S3
> when file size limitation of 1MB (specific to our case) or a certain time
> threshold is reached. We are planning on merging the number of files
> specific to a folder into one single file based on either time limit such
> as every 24 hrs.
>
> We were considering various options available today and would like to
> know if Apache Flink can be used to serve the purpose.
>
> Looking forward to hearing from you.
>
> Thank you
> Suma Cherukuri
>
>

mxm

Re: Question about Apache Flink Use Case

Hi Suma Cherukuri,

Apache Flink can certainly serve your use case very well. Here's why:

1) Apache Flink has a connectors for Kafka and ElasticSearch. It
supports reading and writing to the S3 file system.

2) Apache Flink includes a RollingSink which splits up data into files
with a configurable maximum file size. The RollingSink includes a
"Bucketer" which lets you control when and how to create new
directories or files.

3) Apache Flink's streaming API and runtime for event processing is
one of the most advanced out there (support for Event Time, Windowing,
exactly-once)

These are just first pointers. Please don't hesitate to ask more
questions. I think we would need a bit more details about your use
case to understand how exactly you would use Apache Flink.

Best,
Max

On Sun, Jul 24, 2016 at 1:58 AM, Suneel Marthi <[hidden email]> wrote:

> From the Use Case description, it seems like u r looking to aggregate files
> based on either a threshold size or threshold time and ship them to S3.
> Correct?
>
> Flink might be an overkill here and u could look at frameworks like Apache
> NiFi that have pre-built (and configurable) processors to do just what u r
> describing here.
>
>
>
> On Fri, Jul 22, 2016 at 3:00 PM, Suma Cherukuri <[hidden email]
>> wrote:
>
>> Hi,
>>
>> Good Afternoon!
>>
>> I work as an engineer at Symantec. My team works on Multi-tenant Event
>> Processing System. Just a high level background, our customers write data
>> to kafka brokers though agents like logstash and we process the events and
>> save the log data in Elastic Search and S3.
>>
>> Use Case: We have a use case where in we write batches of events to S3
>> when file size limitation of 1MB (specific to our case) or a certain time
>> threshold is reached. We are planning on merging the number of files
>> specific to a folder into one single file based on either time limit such
>> as every 24 hrs.
>>
>> We were considering various options available today and would like to
>> know if Apache Flink can be used to serve the purpose.
>>
>> Looking forward to hearing from you.
>>
>> Thank you
>> Suma Cherukuri
>>
>>

Kostas Kloudas

Re: Question about Apache Flink Use Case

Hi Suma Cherukuri,

From what I understand you have many small files and you want to
aggregate them into bigger ones containing the logs of the last 24h.

As Max said RollingSinks will allow you to have exactly-once semantics
when writing your aggregated results to your FS.

As far as reading your input is concerned, Flink recently
integrated functionality to periodically monitor a directory, e.g. your
log directory, and process only the new files as they appear.

This will be part of the 1.1 release which is coming possibly during this
week or the next, but you can always find it on the master branch.

The method that you need is:

readFile(FileInputFormat<OUT> inputFormat,
String filePath,
FileProcessingMode watchType,
long interval,
FilePathFilter filter,
TypeInformation<OUT> typeInformation)

which allows you to specify the FileProcessingMode (which you should set to
FileProcessingMode.PROCESS_CONTINUOUSLY) and the “interval” at which
Flink is going to monitor the directory (path) for new files.

In addition you can find some helper methods in the StreamExecutionEnvironment
class that allow you to avoid specifying some parameters.

I believe that with the above two features (RollingSink and ContinuousMonitoring source)
Link can be the tool for your job, as both of them also provide exactly-once guarantees.

I hope this helps.

Let us know what you think,
Kostas

> On Jul 26, 2016, at 11:51 AM, Maximilian Michels <[hidden email]> wrote:
>
> Hi Suma Cherukuri,
>
> Apache Flink can certainly serve your use case very well. Here's why:
>
> 1) Apache Flink has a connectors for Kafka and ElasticSearch. It
> supports reading and writing to the S3 file system.
>
> 2) Apache Flink includes a RollingSink which splits up data into files
> with a configurable maximum file size. The RollingSink includes a
> "Bucketer" which lets you control when and how to create new
> directories or files.
>
> 3) Apache Flink's streaming API and runtime for event processing is
> one of the most advanced out there (support for Event Time, Windowing,
> exactly-once)
>
> These are just first pointers. Please don't hesitate to ask more
> questions. I think we would need a bit more details about your use
> case to understand how exactly you would use Apache Flink.
>
> Best,
> Max
>
> On Sun, Jul 24, 2016 at 1:58 AM, Suneel Marthi <[hidden email]> wrote:
>> From the Use Case description, it seems like u r looking to aggregate files
>> based on either a threshold size or threshold time and ship them to S3.
>> Correct?
>>
>> Flink might be an overkill here and u could look at frameworks like Apache
>> NiFi that have pre-built (and configurable) processors to do just what u r
>> describing here.
>>
>>
>>
>> On Fri, Jul 22, 2016 at 3:00 PM, Suma Cherukuri <[hidden email]
>>> wrote:
>>
>>> Hi,
>>>
>>> Good Afternoon!
>>>
>>> I work as an engineer at Symantec. My team works on Multi-tenant Event
>>> Processing System. Just a high level background, our customers write data
>>> to kafka brokers though agents like logstash and we process the events and
>>> save the log data in Elastic Search and S3.
>>>
>>> Use Case: We have a use case where in we write batches of events to S3
>>> when file size limitation of 1MB (specific to our case) or a certain time
>>> threshold is reached. We are planning on merging the number of files
>>> specific to a folder into one single file based on either time limit such
>>> as every 24 hrs.
>>>
>>> We were considering various options available today and would like to
>>> know if Apache Flink can be used to serve the purpose.
>>>
>>> Looking forward to hearing from you.
>>>
>>> Thank you
>>> Suma Cherukuri
>>>
>>>

Kostas Kloudas

Re: Question about Apache Flink Use Case

In reply to this post by Suma Cherukuri

Hi Suma Cherukuri,

I also replied to your question in the dev list, but I repeat the answer here
just in case you missed in.

From what I understand you have many small files and you want to
aggregate them into bigger ones containing the logs of the last 24h.

As Max said RollingSinks will allow you to have exactly-once semantics
when writing your aggregated results to your FS.

As far as reading your input is concerned, Flink recently
integrated functionality to periodically monitor a directory, e.g. your
log directory, and process only the new files as they appear.

This will be part of the 1.1 release which is coming possibly during this
week or the next, but you can always find it on the master branch.

The method that you need is:

readFile(FileInputFormat<OUT> inputFormat,
String filePath,
FileProcessingMode watchType,
long interval,
FilePathFilter filter,
TypeInformation<OUT> typeInformation)

which allows you to specify the FileProcessingMode (which you should set to
FileProcessingMode.PROCESS_CONTINUOUSLY) and the “interval” at which
Flink is going to monitor the directory (path) for new files.

In addition you can find some helper methods in the StreamExecutionEnvironment
class that allow you to avoid specifying some parameters.

I believe that with the above two features (RollingSink and ContinuousMonitoring source)
Link can be the tool for your job, as both of them also provide exactly-once guarantees.

I hope this helps.

Let us know what you think,
Kostas

> On Jul 22, 2016, at 9:03 PM, Suma Cherukuri <[hidden email]> wrote:
>
> Hi,
>
> Good Afternoon!
>
> I work as an engineer at Symantec. My team works on Multi-tenant Event Processing System. Just a high level background, our customers write data to kafka brokers though agents like logstash and we process the events and save the log data in Elastic Search and S3.
>
> Use Case: We have a use case where in we write batches of events to S3 when file size limitation of 1MB (specific to our case) or a certain time threshold is reached. We are planning on merging the number of files specific to a folder into one single file based on either time limit such as every 24 hrs.
>
> We were considering various options available today and would like to know if Apache Flink can be used to serve the purpose.
>
> Looking forward to hearing from you.
>
> Thank you
> Suma Cherukuri
>