Flink solution to active - active Multi site cloud data ingestion

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink solution to active - active Multi site cloud data ingestion

Melekh, Gregory-2
Hello Flink Experts.



We have Flink job consuming data from Kafka and ingest it to multi-site (Azure-east – Azure-west) replicated Cassandra.

Now we have to aggregate data hourly. The problem is that device X can report once on site A and once on site B. This means that some messages for that device, will be processed by Flink in site A and some messages will be processed on site B.

I want an aggregation result that will reflect all messages transmitted by specific device X.

Are there any best practices to handle multi-site ingestion?

Any idea how to handle the scenario above?

Thanks in advance.

Reply | Threaded
Open this post in threaded view
|

Re: Flink solution to active - active Multi site cloud data ingestion

Seth Wiesman-3
Hi Gregory,

The easiest solution would be to include the site in your key so that at
query time the rows from each site can be aggregated together.

Instead of <Key, Value>, the table would be <Key, Site, Value> and your
query would become Select sum(value) FROM table GROUP BY key;

Otherwise, you will need to get all that data into a single site to perform
a final aggregation prior to writing to Cassandra.

On Wed, May 15, 2019 at 3:45 AM Melekh, Gregory <[hidden email]>
wrote:

> Hello Flink Experts.
>
>
>
> We have Flink job consuming data from Kafka and ingest it to multi-site
> (Azure-east – Azure-west) replicated Cassandra.
>
> Now we have to aggregate data hourly. The problem is that device X can
> report once on site A and once on site B. This means that some messages for
> that device, will be processed by Flink in site A and some messages will be
> processed on site B.
>
> I want an aggregation result that will reflect all messages transmitted by
> specific device X.
>
> Are there any best practices to handle multi-site ingestion?
>
> Any idea how to handle the scenario above?
>
> Thanks in advance.
>
>

--

Seth Wiesman | Solutions Architect

+1 314 387 1463

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
Reply | Threaded
Open this post in threaded view
|

Re: Flink solution to active - active Multi site cloud data ingestion

Seth Wiesman-3
In the future, these kinds of questions are more appropriate for the user
mailing list ([hidden email]). Dev is for internal Flink development.

On Wed, May 15, 2019 at 12:06 PM Seth Wiesman <[hidden email]> wrote:

> Hi Gregory,
>
> The easiest solution would be to include the site in your key so that at
> query time the rows from each site can be aggregated together.
>
> Instead of <Key, Value>, the table would be <Key, Site, Value> and your
> query would become Select sum(value) FROM table GROUP BY key;
>
> Otherwise, you will need to get all that data into a single site to
> perform a final aggregation prior to writing to Cassandra.
>
> On Wed, May 15, 2019 at 3:45 AM Melekh, Gregory <
> [hidden email]> wrote:
>
>> Hello Flink Experts.
>>
>>
>>
>> We have Flink job consuming data from Kafka and ingest it to multi-site
>> (Azure-east – Azure-west) replicated Cassandra.
>>
>> Now we have to aggregate data hourly. The problem is that device X can
>> report once on site A and once on site B. This means that some messages for
>> that device, will be processed by Flink in site A and some messages will be
>> processed on site B.
>>
>> I want an aggregation result that will reflect all messages transmitted
>> by specific device X.
>>
>> Are there any best practices to handle multi-site ingestion?
>>
>> Any idea how to handle the scenario above?
>>
>> Thanks in advance.
>>
>>
>
> --
>
> Seth Wiesman | Solutions Architect
>
> +1 314 387 1463
>
> <https://www.ververica.com/>
>
> Follow us @VervericaData
>
> --
>
> Join Flink Forward <https://flink-forward.org/> - The Apache Flink
> Conference
>
> Stream Processing | Event Driven | Real Time
>
> --
>
> Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
>
> --
> Data Artisans GmbH
> Registered at Amtsgericht Charlottenburg: HRB 158244 B
> Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
>


--

Seth Wiesman | Solutions Architect

+1 314 387 1463

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Data Artisans GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
Reply | Threaded
Open this post in threaded view
|

Re: Flink solution to active - active Multi site cloud data ingestion

Melekh, Gregory-2
Thanks for your answer.
We also thought about this solution, but finally we rejected it.
In additional use case when needed to get unique reporting devices during time range this solution can't help us.

Because could happen fallowing:
East-Site reported devices: 1,2,3
West-Site reported devices: 2,4,5

3 devices on East-Site and 3 devices on West-Site are not distinct 6 devices.







On 15/05/2019, 20:40, "Seth Wiesman" <[hidden email]> wrote:

    In the future, these kinds of questions are more appropriate for the user
    mailing list ([hidden email]). Dev is for internal Flink development.
   
    On Wed, May 15, 2019 at 12:06 PM Seth Wiesman <[hidden email]> wrote:
   
    > Hi Gregory,
    >
    > The easiest solution would be to include the site in your key so that at
    > query time the rows from each site can be aggregated together.
    >
    > Instead of <Key, Value>, the table would be <Key, Site, Value> and your
    > query would become Select sum(value) FROM table GROUP BY key;
    >
    > Otherwise, you will need to get all that data into a single site to
    > perform a final aggregation prior to writing to Cassandra.
    >
    > On Wed, May 15, 2019 at 3:45 AM Melekh, Gregory <
    > [hidden email]> wrote:
    >
    >> Hello Flink Experts.
    >>
    >>
    >>
    >> We have Flink job consuming data from Kafka and ingest it to multi-site
    >> (Azure-east – Azure-west) replicated Cassandra.
    >>
    >> Now we have to aggregate data hourly. The problem is that device X can
    >> report once on site A and once on site B. This means that some messages for
    >> that device, will be processed by Flink in site A and some messages will be
    >> processed on site B.
    >>
    >> I want an aggregation result that will reflect all messages transmitted
    >> by specific device X.
    >>
    >> Are there any best practices to handle multi-site ingestion?
    >>
    >> Any idea how to handle the scenario above?
    >>
    >> Thanks in advance.
    >>
    >>
    >
    > --
    >
    > Seth Wiesman | Solutions Architect
    >
    > +1 314 387 1463
    >
    > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ververica.com_&d=DwIFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=NXeRKule6ZRAMa70nBLA4dC9UIPKHnuHoD6Pxag2OLw&m=0AT5d1BW1kOW39xNxw5yWGj9l9VA6EljVioSwdPqxis&s=byZ1WkLo1Kygmp2LlY5jXlRqKKQMsQ0C4A2rIdEeFK4&e=>
    >
    > Follow us @VervericaData
    >
    > --
    >
    > Join Flink Forward <https://urldefense.proofpoint.com/v2/url?u=https-3A__flink-2Dforward.org_&d=DwIFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=NXeRKule6ZRAMa70nBLA4dC9UIPKHnuHoD6Pxag2OLw&m=0AT5d1BW1kOW39xNxw5yWGj9l9VA6EljVioSwdPqxis&s=8IhuyN-y3mvy1rW4YLHtKRdZyFbDCT9r0enMvr94Vbc&e=> - The Apache Flink
    > Conference
    >
    > Stream Processing | Event Driven | Real Time
    >
    > --
    >
    > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
    >
    > --
    > Data Artisans GmbH
    > Registered at Amtsgericht Charlottenburg: HRB 158244 B
    > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen
    >
   
   
    --
   
    Seth Wiesman | Solutions Architect
   
    +1 314 387 1463
   
    <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ververica.com_&d=DwIFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=NXeRKule6ZRAMa70nBLA4dC9UIPKHnuHoD6Pxag2OLw&m=0AT5d1BW1kOW39xNxw5yWGj9l9VA6EljVioSwdPqxis&s=byZ1WkLo1Kygmp2LlY5jXlRqKKQMsQ0C4A2rIdEeFK4&e=>
   
    Follow us @VervericaData
   
    --
   
    Join Flink Forward <https://urldefense.proofpoint.com/v2/url?u=https-3A__flink-2Dforward.org_&d=DwIFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=NXeRKule6ZRAMa70nBLA4dC9UIPKHnuHoD6Pxag2OLw&m=0AT5d1BW1kOW39xNxw5yWGj9l9VA6EljVioSwdPqxis&s=8IhuyN-y3mvy1rW4YLHtKRdZyFbDCT9r0enMvr94Vbc&e=> - The Apache Flink
    Conference
   
    Stream Processing | Event Driven | Real Time
   
    --
   
    Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany
   
    --
    Data Artisans GmbH
    Registered at Amtsgericht Charlottenburg: HRB 158244 B
    Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen