I'm looking at using Flink for a streaming project that has to use some
internal systems as event sources. They are very similar to Kafka in their semantic. The data is partitioned and each partition can be replayed from a specified offset. The first system creates and deletes such partitions dynamically based on load. It provides an API to get list of partitions as well as their state (open, closed for append). The second system has a fixed set of a few thousand partitions, but they are allocated to a dynamic set of hosts and each host provides poll API that returns events from all partitions that currently reside on it. The metadata API that returns current mapping of partitions to hosts is provided. I found a thread <http://mail-archives.apache.org/mod_mbox/flink-user/201602.mbox/%3CCANjo42xgyUZAU=fmgGVFXVYMj7nVt67=3eJY=pWRc_nZdQ-EkA@...%3E> that mentioned that changing parallelism is one of the high priority items for this year. Has any work started on it? And would it support the type of dynamic sources we have? I could try adding such support myself if it would help to speed things up. Thanks, Maxim. |
Hi Maxim,
you can implement a source for the system you are describing without changing the parallelism of Flink. What you have to do is implement your own data sources for Flink. I would start by implementing the ParallelSourceFunction interface, where each parallel source instance is reading from a subset of servers. So basically one "flink partition" is reading from one or more partitions of your system. On Mon, Mar 7, 2016 at 9:18 PM, Maxim <[hidden email]> wrote: > I'm looking at using Flink for a streaming project that has to use some > internal systems as event sources. They are very similar to Kafka in their > semantic. The data is partitioned and each partition can be replayed from a > specified offset. > > The first system creates and deletes such partitions dynamically based on > load. It provides an API to get list of partitions as well as their state > (open, closed for append). > > The second system has a fixed set of a few thousand partitions, but they > are allocated to a dynamic set of hosts and each host provides poll API > that returns events from all partitions that currently reside on it. The > metadata API that returns current mapping of partitions to hosts is > provided. > > I found a thread > < > http://mail-archives.apache.org/mod_mbox/flink-user/201602.mbox/%3CCANjo42xgyUZAU=fmgGVFXVYMj7nVt67=3eJY=pWRc_nZdQ-EkA@...%3E > > > that > mentioned that changing parallelism is one of the high priority items for > this year. Has any work started on it? And would it support the type of > dynamic sources we have? > > I could try adding such support myself if it would help to speed things up. > > Thanks, > > Maxim. > |
Free forum by Nabble | Edit this page |