Hello Flink Experts.
We have Flink job consuming data from Kafka and ingest it to multi-site (Azure-east – Azure-west) replicated Cassandra. Now we have to aggregate data hourly. The problem is that device X can report once on site A and once on site B. This means that some messages for that device, will be processed by Flink in site A and some messages will be processed on site B. I want an aggregation result that will reflect all messages transmitted by specific device X. Are there any best practices to handle multi-site ingestion? Any idea how to handle the scenario above? Thanks in advance. |
Hi Gregory,
The easiest solution would be to include the site in your key so that at query time the rows from each site can be aggregated together. Instead of <Key, Value>, the table would be <Key, Site, Value> and your query would become Select sum(value) FROM table GROUP BY key; Otherwise, you will need to get all that data into a single site to perform a final aggregation prior to writing to Cassandra. On Wed, May 15, 2019 at 3:45 AM Melekh, Gregory <[hidden email]> wrote: > Hello Flink Experts. > > > > We have Flink job consuming data from Kafka and ingest it to multi-site > (Azure-east – Azure-west) replicated Cassandra. > > Now we have to aggregate data hourly. The problem is that device X can > report once on site A and once on site B. This means that some messages for > that device, will be processed by Flink in site A and some messages will be > processed on site B. > > I want an aggregation result that will reflect all messages transmitted by > specific device X. > > Are there any best practices to handle multi-site ingestion? > > Any idea how to handle the scenario above? > > Thanks in advance. > > -- Seth Wiesman | Solutions Architect +1 314 387 1463 <https://www.ververica.com/> Follow us @VervericaData -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Data Artisans GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen |
In the future, these kinds of questions are more appropriate for the user
mailing list ([hidden email]). Dev is for internal Flink development. On Wed, May 15, 2019 at 12:06 PM Seth Wiesman <[hidden email]> wrote: > Hi Gregory, > > The easiest solution would be to include the site in your key so that at > query time the rows from each site can be aggregated together. > > Instead of <Key, Value>, the table would be <Key, Site, Value> and your > query would become Select sum(value) FROM table GROUP BY key; > > Otherwise, you will need to get all that data into a single site to > perform a final aggregation prior to writing to Cassandra. > > On Wed, May 15, 2019 at 3:45 AM Melekh, Gregory < > [hidden email]> wrote: > >> Hello Flink Experts. >> >> >> >> We have Flink job consuming data from Kafka and ingest it to multi-site >> (Azure-east – Azure-west) replicated Cassandra. >> >> Now we have to aggregate data hourly. The problem is that device X can >> report once on site A and once on site B. This means that some messages for >> that device, will be processed by Flink in site A and some messages will be >> processed on site B. >> >> I want an aggregation result that will reflect all messages transmitted >> by specific device X. >> >> Are there any best practices to handle multi-site ingestion? >> >> Any idea how to handle the scenario above? >> >> Thanks in advance. >> >> > > -- > > Seth Wiesman | Solutions Architect > > +1 314 387 1463 > > <https://www.ververica.com/> > > Follow us @VervericaData > > -- > > Join Flink Forward <https://flink-forward.org/> - The Apache Flink > Conference > > Stream Processing | Event Driven | Real Time > > -- > > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > -- > Data Artisans GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen > -- Seth Wiesman | Solutions Architect +1 314 387 1463 <https://www.ververica.com/> Follow us @VervericaData -- Join Flink Forward <https://flink-forward.org/> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Data Artisans GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen |
Thanks for your answer.
We also thought about this solution, but finally we rejected it. In additional use case when needed to get unique reporting devices during time range this solution can't help us. Because could happen fallowing: East-Site reported devices: 1,2,3 West-Site reported devices: 2,4,5 3 devices on East-Site and 3 devices on West-Site are not distinct 6 devices. On 15/05/2019, 20:40, "Seth Wiesman" <[hidden email]> wrote: In the future, these kinds of questions are more appropriate for the user mailing list ([hidden email]). Dev is for internal Flink development. On Wed, May 15, 2019 at 12:06 PM Seth Wiesman <[hidden email]> wrote: > Hi Gregory, > > The easiest solution would be to include the site in your key so that at > query time the rows from each site can be aggregated together. > > Instead of <Key, Value>, the table would be <Key, Site, Value> and your > query would become Select sum(value) FROM table GROUP BY key; > > Otherwise, you will need to get all that data into a single site to > perform a final aggregation prior to writing to Cassandra. > > On Wed, May 15, 2019 at 3:45 AM Melekh, Gregory < > [hidden email]> wrote: > >> Hello Flink Experts. >> >> >> >> We have Flink job consuming data from Kafka and ingest it to multi-site >> (Azure-east – Azure-west) replicated Cassandra. >> >> Now we have to aggregate data hourly. The problem is that device X can >> report once on site A and once on site B. This means that some messages for >> that device, will be processed by Flink in site A and some messages will be >> processed on site B. >> >> I want an aggregation result that will reflect all messages transmitted >> by specific device X. >> >> Are there any best practices to handle multi-site ingestion? >> >> Any idea how to handle the scenario above? >> >> Thanks in advance. >> >> > > -- > > Seth Wiesman | Solutions Architect > > +1 314 387 1463 > > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ververica.com_&d=DwIFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=NXeRKule6ZRAMa70nBLA4dC9UIPKHnuHoD6Pxag2OLw&m=0AT5d1BW1kOW39xNxw5yWGj9l9VA6EljVioSwdPqxis&s=byZ1WkLo1Kygmp2LlY5jXlRqKKQMsQ0C4A2rIdEeFK4&e=> > > Follow us @VervericaData > > -- > > Join Flink Forward <https://urldefense.proofpoint.com/v2/url?u=https-3A__flink-2Dforward.org_&d=DwIFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=NXeRKule6ZRAMa70nBLA4dC9UIPKHnuHoD6Pxag2OLw&m=0AT5d1BW1kOW39xNxw5yWGj9l9VA6EljVioSwdPqxis&s=8IhuyN-y3mvy1rW4YLHtKRdZyFbDCT9r0enMvr94Vbc&e=> - The Apache Flink > Conference > > Stream Processing | Event Driven | Real Time > > -- > > Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany > > -- > Data Artisans GmbH > Registered at Amtsgericht Charlottenburg: HRB 158244 B > Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen > -- Seth Wiesman | Solutions Architect +1 314 387 1463 <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ververica.com_&d=DwIFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=NXeRKule6ZRAMa70nBLA4dC9UIPKHnuHoD6Pxag2OLw&m=0AT5d1BW1kOW39xNxw5yWGj9l9VA6EljVioSwdPqxis&s=byZ1WkLo1Kygmp2LlY5jXlRqKKQMsQ0C4A2rIdEeFK4&e=> Follow us @VervericaData -- Join Flink Forward <https://urldefense.proofpoint.com/v2/url?u=https-3A__flink-2Dforward.org_&d=DwIFaQ&c=LFYZ-o9_HUMeMTSQicvjIg&r=NXeRKule6ZRAMa70nBLA4dC9UIPKHnuHoD6Pxag2OLw&m=0AT5d1BW1kOW39xNxw5yWGj9l9VA6EljVioSwdPqxis&s=8IhuyN-y3mvy1rW4YLHtKRdZyFbDCT9r0enMvr94Vbc&e=> - The Apache Flink Conference Stream Processing | Event Driven | Real Time -- Data Artisans GmbH | Invalidenstrasse 115, 10115 Berlin, Germany -- Data Artisans GmbH Registered at Amtsgericht Charlottenburg: HRB 158244 B Managing Directors: Dr. Kostas Tzoumas, Dr. Stephan Ewen |
Free forum by Nabble | Edit this page |