[DISCUSS] FLIP-27: Refactor Source Interface

classic Classic list List threaded Threaded
86 messages Options
12345
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Kostas Kloudas
Hi Biao,

Thanks for the answer!

So given the multi-threaded readers, now we have as open questions:

1) How do we let the checkpoints pass through our multi-threaded reader
operator?

2) Do we have separate reader and source operators or not? In the strategy
that has a separate source, the source operator has a parallelism of 1 and
is responsible for split recovery only.

For the first one, given also the constraints (blocking, finite queues,
etc), I do not have an answer yet.

For the 2nd, I think that we should go with separate operators for the
source and the readers, for the following reasons:

1) This is more aligned with a potential future improvement where the split
discovery becomes a responsibility of the JobManager and readers are
pooling more work from the JM.

2) The source is going to be the "single point of truth". It will know what
has been processed and what not. If the source and the readers are a single
operator with parallelism > 1, or in general, if the split discovery is
done by each task individually, then:
   i) we have to have a deterministic scheme for each reader to assign
splits to itself (e.g. mod subtaskId). This is not necessarily trivial for
all sources.
   ii) each reader would have to keep a copy of all its processed slpits
   iii) the state has to be a union state with a non-trivial merging logic
in order to support rescaling.

Two additional points that you raised above:

i) The point that you raised that we need to keep all splits (processed and
not-processed) I think is a bit of a strong requirement. This would imply
that for infinite sources the state will grow indefinitely. This is problem
is even more pronounced if we do not have a single source that assigns
splits to readers, as each reader will have its own copy of the state.

ii) it is true that for finite sources we need to somehow not close the
readers when the source/split discoverer finishes. The
ContinuousFileReaderOperator has a work-around for that. It is not elegant,
and checkpoints are not emitted after closing the source, but this, I
believe, is a bigger problem which requires more changes than just
refactoring the source interface.

Cheers,
Kostas
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Stephan Ewen
Thanks everyone for the lively discussion. Let me try to summarize where I
see convergence in the discussion and open issues.
I'll try to group this by design aspect of the source. Please let me know
if I got things wrong or missed something crucial here.

For issues 1-3, if the below reflects the state of the discussion, I would
try and update the FLIP in the next days.
For the remaining ones we need more discussion.

I would suggest to fork each of these aspects into a separate mail thread,
or will loose sight of the individual aspects.

*(1) Separation of Split Enumerator and Split Reader*

  - All seem to agree this is a good thing
  - Split Enumerator could in the end live on JobManager (and assign splits
via RPC) or in a task (and assign splits via data streams)
  - this discussion is orthogonal and should come later, when the interface
is agreed upon.

*(2) Split Readers for one or more splits*

  - Discussion seems to agree that we need to support one reader that
possibly handles multiple splits concurrently.
  - The requirement comes from sources where one poll()-style call fetches
data from different splits / partitions
    --> example sources that require that would be for example Kafka,
Pravega, Pulsar

  - Could have one split reader per source, or multiple split readers that
share the "poll()" function
  - To not make it too complicated, we can start with thinking about one
split reader for all splits initially and see if that covers all
requirements

*(3) Threading model of the Split Reader*

  - Most active part of the discussion ;-)

  - A non-blocking way for Flink's task code to interact with the source is
needed in order to a task runtime code based on a
single-threaded/actor-style task design
    --> I personally am a big proponent of that, it will help with
well-behaved checkpoints, efficiency, and simpler yet more robust runtime
code

  - Users care about simple abstraction, so as a subclass of SplitReader
(non-blocking / async) we need to have a BlockingSplitReader which will
form the basis of most source implementations. BlockingSplitReader lets
users do blocking simple poll() calls.
  - The BlockingSplitReader would spawn a thread (or more) and the
thread(s) can make blocking calls and hand over data buffers via a blocking
queue
  - This should allow us to cover both, a fully async runtime, and a simple
blocking interface for users.
  - This is actually very similar to how the Kafka connectors work. Kafka
9+ with one thread, Kafka 8 with multiple threads

  - On the base SplitReader (the async one), the non-blocking method that
gets the next chunk of data would signal data availability via a
CompletableFuture, because that gives the best flexibility (can await
completion or register notification handlers).
  - The source task would register a "thenHandle()" (or similar) on the
future to put a "take next data" task into the actor-style mailbox

*(4) Split Enumeration and Assignment*

  - Splits may be generated lazily, both in cases where there is a limited
number of splits (but very many), or splits are discovered over time
  - Assignment should also be lazy, to get better load balancing
  - Assignment needs support locality preferences

  - Possible design based on discussion so far:

    --> SplitReader has a method "addSplits(SplitT...)" to add one or more
splits. Some split readers might assume they have only one split ever,
concurrently, others assume multiple splits. (Note: idea behind being able
to add multiple splits at the same time is to ease startup where multiple
splits may be assigned instantly.)
    --> SplitReader has a context object on which it can call indicate when
splits are completed. The enumerator gets that notification and can use to
decide when to assign new splits. This should help both in cases of sources
that take splits lazily (file readers) and in case the source needs to
preserve a partial order between splits (Kinesis, Pravega, Pulsar may need
that).
    --> SplitEnumerator gets notification when SplitReaders start and when
they finish splits. They can decide at that moment to push more splits to
that reader
    --> The SplitEnumerator should probably be aware of the source
parallelism, to build its initial distribution.

  - Open question: Should the source expose something like "host
preferences", so that yarn/mesos/k8s can take this into account when
selecting a node to start a TM on?

*(5) Watermarks and event time alignment*

  - Watermark generation, as well as idleness, needs to be per split (like
currently in the Kafka Source, per partition)
  - It is desirable to support optional event-time-alignment, meaning that
splits that are ahead are back-pressured or temporarily unsubscribed

  - I think i would be desirable to encapsulate watermark generation logic
in watermark generators, for a separation of concerns. The watermark
generators should run per split.
  - Using watermark generators would also help with another problem of the
suggested interface, namely supporting non-periodic watermarks efficiently.

  - Need a way to "dispatch" next record to different watermark generators
  - Need a way to tell SplitReader to "suspend" a split until a certain
watermark is reached (event time backpressure)
  - This would in fact be not needed (and thus simpler) if we had a
SplitReader per split and may be a reason to re-open that discussion

*(6) Watermarks across splits and in the Split Enumerator*

  - The split enumerator may need some watermark awareness, which should be
purely based on split metadata (like create timestamp of file splits)
  - If there are still more splits with overlapping event time range for a
split reader, then that split reader should not advance the watermark
within the split beyond the overlap boundary. Otherwise future splits will
produce late data.

  - One way to approach this could be that the split enumerator may send
watermarks to the readers, and the readers cannot emit watermarks beyond
that received watermark.
  - Many split enumerators would simply immediately send Long.MAX out and
leave the progress purely to the split readers.

  - For event-time alignment / split back pressure, this begs the question
how we can avoid deadlocks that may arise when splits are suspended for
event time back pressure,

*(7) Batch and streaming Unification*

  - Functionality wise, the above design should support both
  - Batch often (mostly) does not care about reading "in order" and
generating watermarks
    --> Might use different enumerator logic that is more locality aware
and ignores event time order
    --> Does not generate watermarks
  - Would be great if bounded sources could be identified at compile time,
so that "env.addBoundedSource(...)" is type safe and can return a
"BoundedDataStream".
  - Possible to defer this discussion until later

*Miscellaneous Comments*

  - Should the source have a TypeInformation for the produced type, instead
of a serializer? We need a type information in the stream anyways, and can
derive the serializer from that. Plus, creating the serializer should
respect the ExecutionConfig.

  - The TypeSerializer interface is very powerful but also not easy to
implement. Its purpose is to handle data super efficiently, support
flexible ways of evolution, etc.
  For metadata I would suggest to look at the SimpleVersionedSerializer
instead, which is used for example for checkpoint master hooks, or for the
streaming file sink. I think that is is a good match for cases where we do
not need more than ser/deser (no copy, etc.) and don't need to push
versioning out of the serialization paths for best performance (as in the
TypeSerializer)


On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <[hidden email]>
wrote:

> Hi Biao,
>
> Thanks for the answer!
>
> So given the multi-threaded readers, now we have as open questions:
>
> 1) How do we let the checkpoints pass through our multi-threaded reader
> operator?
>
> 2) Do we have separate reader and source operators or not? In the strategy
> that has a separate source, the source operator has a parallelism of 1 and
> is responsible for split recovery only.
>
> For the first one, given also the constraints (blocking, finite queues,
> etc), I do not have an answer yet.
>
> For the 2nd, I think that we should go with separate operators for the
> source and the readers, for the following reasons:
>
> 1) This is more aligned with a potential future improvement where the split
> discovery becomes a responsibility of the JobManager and readers are
> pooling more work from the JM.
>
> 2) The source is going to be the "single point of truth". It will know what
> has been processed and what not. If the source and the readers are a single
> operator with parallelism > 1, or in general, if the split discovery is
> done by each task individually, then:
>    i) we have to have a deterministic scheme for each reader to assign
> splits to itself (e.g. mod subtaskId). This is not necessarily trivial for
> all sources.
>    ii) each reader would have to keep a copy of all its processed slpits
>    iii) the state has to be a union state with a non-trivial merging logic
> in order to support rescaling.
>
> Two additional points that you raised above:
>
> i) The point that you raised that we need to keep all splits (processed and
> not-processed) I think is a bit of a strong requirement. This would imply
> that for infinite sources the state will grow indefinitely. This is problem
> is even more pronounced if we do not have a single source that assigns
> splits to readers, as each reader will have its own copy of the state.
>
> ii) it is true that for finite sources we need to somehow not close the
> readers when the source/split discoverer finishes. The
> ContinuousFileReaderOperator has a work-around for that. It is not elegant,
> and checkpoints are not emitted after closing the source, but this, I
> believe, is a bigger problem which requires more changes than just
> refactoring the source interface.
>
> Cheers,
> Kostas
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Biao Liu
Hi community,
The summary of Stephan makes a lot sense to me. It is much clearer indeed
after splitting the complex topic into small ones.
I was wondering is there any detail plan for next step? If not, I would
like to push this thing forward by creating some JIRA issues.
Another question is that should version 1.8 include these features?

Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道:

> Thanks everyone for the lively discussion. Let me try to summarize where I
> see convergence in the discussion and open issues.
> I'll try to group this by design aspect of the source. Please let me know
> if I got things wrong or missed something crucial here.
>
> For issues 1-3, if the below reflects the state of the discussion, I would
> try and update the FLIP in the next days.
> For the remaining ones we need more discussion.
>
> I would suggest to fork each of these aspects into a separate mail thread,
> or will loose sight of the individual aspects.
>
> *(1) Separation of Split Enumerator and Split Reader*
>
>   - All seem to agree this is a good thing
>   - Split Enumerator could in the end live on JobManager (and assign splits
> via RPC) or in a task (and assign splits via data streams)
>   - this discussion is orthogonal and should come later, when the interface
> is agreed upon.
>
> *(2) Split Readers for one or more splits*
>
>   - Discussion seems to agree that we need to support one reader that
> possibly handles multiple splits concurrently.
>   - The requirement comes from sources where one poll()-style call fetches
> data from different splits / partitions
>     --> example sources that require that would be for example Kafka,
> Pravega, Pulsar
>
>   - Could have one split reader per source, or multiple split readers that
> share the "poll()" function
>   - To not make it too complicated, we can start with thinking about one
> split reader for all splits initially and see if that covers all
> requirements
>
> *(3) Threading model of the Split Reader*
>
>   - Most active part of the discussion ;-)
>
>   - A non-blocking way for Flink's task code to interact with the source is
> needed in order to a task runtime code based on a
> single-threaded/actor-style task design
>     --> I personally am a big proponent of that, it will help with
> well-behaved checkpoints, efficiency, and simpler yet more robust runtime
> code
>
>   - Users care about simple abstraction, so as a subclass of SplitReader
> (non-blocking / async) we need to have a BlockingSplitReader which will
> form the basis of most source implementations. BlockingSplitReader lets
> users do blocking simple poll() calls.
>   - The BlockingSplitReader would spawn a thread (or more) and the
> thread(s) can make blocking calls and hand over data buffers via a blocking
> queue
>   - This should allow us to cover both, a fully async runtime, and a simple
> blocking interface for users.
>   - This is actually very similar to how the Kafka connectors work. Kafka
> 9+ with one thread, Kafka 8 with multiple threads
>
>   - On the base SplitReader (the async one), the non-blocking method that
> gets the next chunk of data would signal data availability via a
> CompletableFuture, because that gives the best flexibility (can await
> completion or register notification handlers).
>   - The source task would register a "thenHandle()" (or similar) on the
> future to put a "take next data" task into the actor-style mailbox
>
> *(4) Split Enumeration and Assignment*
>
>   - Splits may be generated lazily, both in cases where there is a limited
> number of splits (but very many), or splits are discovered over time
>   - Assignment should also be lazy, to get better load balancing
>   - Assignment needs support locality preferences
>
>   - Possible design based on discussion so far:
>
>     --> SplitReader has a method "addSplits(SplitT...)" to add one or more
> splits. Some split readers might assume they have only one split ever,
> concurrently, others assume multiple splits. (Note: idea behind being able
> to add multiple splits at the same time is to ease startup where multiple
> splits may be assigned instantly.)
>     --> SplitReader has a context object on which it can call indicate when
> splits are completed. The enumerator gets that notification and can use to
> decide when to assign new splits. This should help both in cases of sources
> that take splits lazily (file readers) and in case the source needs to
> preserve a partial order between splits (Kinesis, Pravega, Pulsar may need
> that).
>     --> SplitEnumerator gets notification when SplitReaders start and when
> they finish splits. They can decide at that moment to push more splits to
> that reader
>     --> The SplitEnumerator should probably be aware of the source
> parallelism, to build its initial distribution.
>
>   - Open question: Should the source expose something like "host
> preferences", so that yarn/mesos/k8s can take this into account when
> selecting a node to start a TM on?
>
> *(5) Watermarks and event time alignment*
>
>   - Watermark generation, as well as idleness, needs to be per split (like
> currently in the Kafka Source, per partition)
>   - It is desirable to support optional event-time-alignment, meaning that
> splits that are ahead are back-pressured or temporarily unsubscribed
>
>   - I think i would be desirable to encapsulate watermark generation logic
> in watermark generators, for a separation of concerns. The watermark
> generators should run per split.
>   - Using watermark generators would also help with another problem of the
> suggested interface, namely supporting non-periodic watermarks efficiently.
>
>   - Need a way to "dispatch" next record to different watermark generators
>   - Need a way to tell SplitReader to "suspend" a split until a certain
> watermark is reached (event time backpressure)
>   - This would in fact be not needed (and thus simpler) if we had a
> SplitReader per split and may be a reason to re-open that discussion
>
> *(6) Watermarks across splits and in the Split Enumerator*
>
>   - The split enumerator may need some watermark awareness, which should be
> purely based on split metadata (like create timestamp of file splits)
>   - If there are still more splits with overlapping event time range for a
> split reader, then that split reader should not advance the watermark
> within the split beyond the overlap boundary. Otherwise future splits will
> produce late data.
>
>   - One way to approach this could be that the split enumerator may send
> watermarks to the readers, and the readers cannot emit watermarks beyond
> that received watermark.
>   - Many split enumerators would simply immediately send Long.MAX out and
> leave the progress purely to the split readers.
>
>   - For event-time alignment / split back pressure, this begs the question
> how we can avoid deadlocks that may arise when splits are suspended for
> event time back pressure,
>
> *(7) Batch and streaming Unification*
>
>   - Functionality wise, the above design should support both
>   - Batch often (mostly) does not care about reading "in order" and
> generating watermarks
>     --> Might use different enumerator logic that is more locality aware
> and ignores event time order
>     --> Does not generate watermarks
>   - Would be great if bounded sources could be identified at compile time,
> so that "env.addBoundedSource(...)" is type safe and can return a
> "BoundedDataStream".
>   - Possible to defer this discussion until later
>
> *Miscellaneous Comments*
>
>   - Should the source have a TypeInformation for the produced type, instead
> of a serializer? We need a type information in the stream anyways, and can
> derive the serializer from that. Plus, creating the serializer should
> respect the ExecutionConfig.
>
>   - The TypeSerializer interface is very powerful but also not easy to
> implement. Its purpose is to handle data super efficiently, support
> flexible ways of evolution, etc.
>   For metadata I would suggest to look at the SimpleVersionedSerializer
> instead, which is used for example for checkpoint master hooks, or for the
> streaming file sink. I think that is is a good match for cases where we do
> not need more than ser/deser (no copy, etc.) and don't need to push
> versioning out of the serialization paths for best performance (as in the
> TypeSerializer)
>
>
> On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
> [hidden email]>
> wrote:
>
> > Hi Biao,
> >
> > Thanks for the answer!
> >
> > So given the multi-threaded readers, now we have as open questions:
> >
> > 1) How do we let the checkpoints pass through our multi-threaded reader
> > operator?
> >
> > 2) Do we have separate reader and source operators or not? In the
> strategy
> > that has a separate source, the source operator has a parallelism of 1
> and
> > is responsible for split recovery only.
> >
> > For the first one, given also the constraints (blocking, finite queues,
> > etc), I do not have an answer yet.
> >
> > For the 2nd, I think that we should go with separate operators for the
> > source and the readers, for the following reasons:
> >
> > 1) This is more aligned with a potential future improvement where the
> split
> > discovery becomes a responsibility of the JobManager and readers are
> > pooling more work from the JM.
> >
> > 2) The source is going to be the "single point of truth". It will know
> what
> > has been processed and what not. If the source and the readers are a
> single
> > operator with parallelism > 1, or in general, if the split discovery is
> > done by each task individually, then:
> >    i) we have to have a deterministic scheme for each reader to assign
> > splits to itself (e.g. mod subtaskId). This is not necessarily trivial
> for
> > all sources.
> >    ii) each reader would have to keep a copy of all its processed slpits
> >    iii) the state has to be a union state with a non-trivial merging
> logic
> > in order to support rescaling.
> >
> > Two additional points that you raised above:
> >
> > i) The point that you raised that we need to keep all splits (processed
> and
> > not-processed) I think is a bit of a strong requirement. This would imply
> > that for infinite sources the state will grow indefinitely. This is
> problem
> > is even more pronounced if we do not have a single source that assigns
> > splits to readers, as each reader will have its own copy of the state.
> >
> > ii) it is true that for finite sources we need to somehow not close the
> > readers when the source/split discoverer finishes. The
> > ContinuousFileReaderOperator has a work-around for that. It is not
> elegant,
> > and checkpoints are not emitted after closing the source, but this, I
> > believe, is a bigger problem which requires more changes than just
> > refactoring the source interface.
> >
> > Cheers,
> > Kostas
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Stephan Ewen
Before creating any JIRA issues, we need to converge a bit further on the
design.
There are too many unsolved questions in the above summary.

I would try and come up with a next version of the interface proposal in
the coming days and
use that as the base to continue the discussion.

Whether this can be part of 1.8 or not depends on how fast we converge. If
the release interval is similar
to the past releases, we would see feature freeze in the mid of next month.

Best,
Stephan


On Sun, Jan 20, 2019 at 4:09 PM Biao Liu <[hidden email]> wrote:

> Hi community,
> The summary of Stephan makes a lot sense to me. It is much clearer indeed
> after splitting the complex topic into small ones.
> I was wondering is there any detail plan for next step? If not, I would
> like to push this thing forward by creating some JIRA issues.
> Another question is that should version 1.8 include these features?
>
> Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道:
>
> > Thanks everyone for the lively discussion. Let me try to summarize where
> I
> > see convergence in the discussion and open issues.
> > I'll try to group this by design aspect of the source. Please let me know
> > if I got things wrong or missed something crucial here.
> >
> > For issues 1-3, if the below reflects the state of the discussion, I
> would
> > try and update the FLIP in the next days.
> > For the remaining ones we need more discussion.
> >
> > I would suggest to fork each of these aspects into a separate mail
> thread,
> > or will loose sight of the individual aspects.
> >
> > *(1) Separation of Split Enumerator and Split Reader*
> >
> >   - All seem to agree this is a good thing
> >   - Split Enumerator could in the end live on JobManager (and assign
> splits
> > via RPC) or in a task (and assign splits via data streams)
> >   - this discussion is orthogonal and should come later, when the
> interface
> > is agreed upon.
> >
> > *(2) Split Readers for one or more splits*
> >
> >   - Discussion seems to agree that we need to support one reader that
> > possibly handles multiple splits concurrently.
> >   - The requirement comes from sources where one poll()-style call
> fetches
> > data from different splits / partitions
> >     --> example sources that require that would be for example Kafka,
> > Pravega, Pulsar
> >
> >   - Could have one split reader per source, or multiple split readers
> that
> > share the "poll()" function
> >   - To not make it too complicated, we can start with thinking about one
> > split reader for all splits initially and see if that covers all
> > requirements
> >
> > *(3) Threading model of the Split Reader*
> >
> >   - Most active part of the discussion ;-)
> >
> >   - A non-blocking way for Flink's task code to interact with the source
> is
> > needed in order to a task runtime code based on a
> > single-threaded/actor-style task design
> >     --> I personally am a big proponent of that, it will help with
> > well-behaved checkpoints, efficiency, and simpler yet more robust runtime
> > code
> >
> >   - Users care about simple abstraction, so as a subclass of SplitReader
> > (non-blocking / async) we need to have a BlockingSplitReader which will
> > form the basis of most source implementations. BlockingSplitReader lets
> > users do blocking simple poll() calls.
> >   - The BlockingSplitReader would spawn a thread (or more) and the
> > thread(s) can make blocking calls and hand over data buffers via a
> blocking
> > queue
> >   - This should allow us to cover both, a fully async runtime, and a
> simple
> > blocking interface for users.
> >   - This is actually very similar to how the Kafka connectors work. Kafka
> > 9+ with one thread, Kafka 8 with multiple threads
> >
> >   - On the base SplitReader (the async one), the non-blocking method that
> > gets the next chunk of data would signal data availability via a
> > CompletableFuture, because that gives the best flexibility (can await
> > completion or register notification handlers).
> >   - The source task would register a "thenHandle()" (or similar) on the
> > future to put a "take next data" task into the actor-style mailbox
> >
> > *(4) Split Enumeration and Assignment*
> >
> >   - Splits may be generated lazily, both in cases where there is a
> limited
> > number of splits (but very many), or splits are discovered over time
> >   - Assignment should also be lazy, to get better load balancing
> >   - Assignment needs support locality preferences
> >
> >   - Possible design based on discussion so far:
> >
> >     --> SplitReader has a method "addSplits(SplitT...)" to add one or
> more
> > splits. Some split readers might assume they have only one split ever,
> > concurrently, others assume multiple splits. (Note: idea behind being
> able
> > to add multiple splits at the same time is to ease startup where multiple
> > splits may be assigned instantly.)
> >     --> SplitReader has a context object on which it can call indicate
> when
> > splits are completed. The enumerator gets that notification and can use
> to
> > decide when to assign new splits. This should help both in cases of
> sources
> > that take splits lazily (file readers) and in case the source needs to
> > preserve a partial order between splits (Kinesis, Pravega, Pulsar may
> need
> > that).
> >     --> SplitEnumerator gets notification when SplitReaders start and
> when
> > they finish splits. They can decide at that moment to push more splits to
> > that reader
> >     --> The SplitEnumerator should probably be aware of the source
> > parallelism, to build its initial distribution.
> >
> >   - Open question: Should the source expose something like "host
> > preferences", so that yarn/mesos/k8s can take this into account when
> > selecting a node to start a TM on?
> >
> > *(5) Watermarks and event time alignment*
> >
> >   - Watermark generation, as well as idleness, needs to be per split
> (like
> > currently in the Kafka Source, per partition)
> >   - It is desirable to support optional event-time-alignment, meaning
> that
> > splits that are ahead are back-pressured or temporarily unsubscribed
> >
> >   - I think i would be desirable to encapsulate watermark generation
> logic
> > in watermark generators, for a separation of concerns. The watermark
> > generators should run per split.
> >   - Using watermark generators would also help with another problem of
> the
> > suggested interface, namely supporting non-periodic watermarks
> efficiently.
> >
> >   - Need a way to "dispatch" next record to different watermark
> generators
> >   - Need a way to tell SplitReader to "suspend" a split until a certain
> > watermark is reached (event time backpressure)
> >   - This would in fact be not needed (and thus simpler) if we had a
> > SplitReader per split and may be a reason to re-open that discussion
> >
> > *(6) Watermarks across splits and in the Split Enumerator*
> >
> >   - The split enumerator may need some watermark awareness, which should
> be
> > purely based on split metadata (like create timestamp of file splits)
> >   - If there are still more splits with overlapping event time range for
> a
> > split reader, then that split reader should not advance the watermark
> > within the split beyond the overlap boundary. Otherwise future splits
> will
> > produce late data.
> >
> >   - One way to approach this could be that the split enumerator may send
> > watermarks to the readers, and the readers cannot emit watermarks beyond
> > that received watermark.
> >   - Many split enumerators would simply immediately send Long.MAX out and
> > leave the progress purely to the split readers.
> >
> >   - For event-time alignment / split back pressure, this begs the
> question
> > how we can avoid deadlocks that may arise when splits are suspended for
> > event time back pressure,
> >
> > *(7) Batch and streaming Unification*
> >
> >   - Functionality wise, the above design should support both
> >   - Batch often (mostly) does not care about reading "in order" and
> > generating watermarks
> >     --> Might use different enumerator logic that is more locality aware
> > and ignores event time order
> >     --> Does not generate watermarks
> >   - Would be great if bounded sources could be identified at compile
> time,
> > so that "env.addBoundedSource(...)" is type safe and can return a
> > "BoundedDataStream".
> >   - Possible to defer this discussion until later
> >
> > *Miscellaneous Comments*
> >
> >   - Should the source have a TypeInformation for the produced type,
> instead
> > of a serializer? We need a type information in the stream anyways, and
> can
> > derive the serializer from that. Plus, creating the serializer should
> > respect the ExecutionConfig.
> >
> >   - The TypeSerializer interface is very powerful but also not easy to
> > implement. Its purpose is to handle data super efficiently, support
> > flexible ways of evolution, etc.
> >   For metadata I would suggest to look at the SimpleVersionedSerializer
> > instead, which is used for example for checkpoint master hooks, or for
> the
> > streaming file sink. I think that is is a good match for cases where we
> do
> > not need more than ser/deser (no copy, etc.) and don't need to push
> > versioning out of the serialization paths for best performance (as in the
> > TypeSerializer)
> >
> >
> > On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
> > [hidden email]>
> > wrote:
> >
> > > Hi Biao,
> > >
> > > Thanks for the answer!
> > >
> > > So given the multi-threaded readers, now we have as open questions:
> > >
> > > 1) How do we let the checkpoints pass through our multi-threaded reader
> > > operator?
> > >
> > > 2) Do we have separate reader and source operators or not? In the
> > strategy
> > > that has a separate source, the source operator has a parallelism of 1
> > and
> > > is responsible for split recovery only.
> > >
> > > For the first one, given also the constraints (blocking, finite queues,
> > > etc), I do not have an answer yet.
> > >
> > > For the 2nd, I think that we should go with separate operators for the
> > > source and the readers, for the following reasons:
> > >
> > > 1) This is more aligned with a potential future improvement where the
> > split
> > > discovery becomes a responsibility of the JobManager and readers are
> > > pooling more work from the JM.
> > >
> > > 2) The source is going to be the "single point of truth". It will know
> > what
> > > has been processed and what not. If the source and the readers are a
> > single
> > > operator with parallelism > 1, or in general, if the split discovery is
> > > done by each task individually, then:
> > >    i) we have to have a deterministic scheme for each reader to assign
> > > splits to itself (e.g. mod subtaskId). This is not necessarily trivial
> > for
> > > all sources.
> > >    ii) each reader would have to keep a copy of all its processed
> slpits
> > >    iii) the state has to be a union state with a non-trivial merging
> > logic
> > > in order to support rescaling.
> > >
> > > Two additional points that you raised above:
> > >
> > > i) The point that you raised that we need to keep all splits (processed
> > and
> > > not-processed) I think is a bit of a strong requirement. This would
> imply
> > > that for infinite sources the state will grow indefinitely. This is
> > problem
> > > is even more pronounced if we do not have a single source that assigns
> > > splits to readers, as each reader will have its own copy of the state.
> > >
> > > ii) it is true that for finite sources we need to somehow not close the
> > > readers when the source/split discoverer finishes. The
> > > ContinuousFileReaderOperator has a work-around for that. It is not
> > elegant,
> > > and checkpoints are not emitted after closing the source, but this, I
> > > believe, is a bigger problem which requires more changes than just
> > > refactoring the source interface.
> > >
> > > Cheers,
> > > Kostas
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Piotr Nowojski-2
In reply to this post by Biao Liu
Hi Biao!

This discussion was stalled because of preparations for the open sourcing & merging Blink. I think before creating the tickets we should split this discussion into topics/areas outlined by Stephan and create Flips for that.

I think there is no chance for this to be completed in couple of remaining weeks/1 month before 1.8 feature freeze, however it would be good to aim with those changes for 1.9.

Piotrek

> On 20 Jan 2019, at 16:08, Biao Liu <[hidden email]> wrote:
>
> Hi community,
> The summary of Stephan makes a lot sense to me. It is much clearer indeed
> after splitting the complex topic into small ones.
> I was wondering is there any detail plan for next step? If not, I would
> like to push this thing forward by creating some JIRA issues.
> Another question is that should version 1.8 include these features?
>
> Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道:
>
>> Thanks everyone for the lively discussion. Let me try to summarize where I
>> see convergence in the discussion and open issues.
>> I'll try to group this by design aspect of the source. Please let me know
>> if I got things wrong or missed something crucial here.
>>
>> For issues 1-3, if the below reflects the state of the discussion, I would
>> try and update the FLIP in the next days.
>> For the remaining ones we need more discussion.
>>
>> I would suggest to fork each of these aspects into a separate mail thread,
>> or will loose sight of the individual aspects.
>>
>> *(1) Separation of Split Enumerator and Split Reader*
>>
>>  - All seem to agree this is a good thing
>>  - Split Enumerator could in the end live on JobManager (and assign splits
>> via RPC) or in a task (and assign splits via data streams)
>>  - this discussion is orthogonal and should come later, when the interface
>> is agreed upon.
>>
>> *(2) Split Readers for one or more splits*
>>
>>  - Discussion seems to agree that we need to support one reader that
>> possibly handles multiple splits concurrently.
>>  - The requirement comes from sources where one poll()-style call fetches
>> data from different splits / partitions
>>    --> example sources that require that would be for example Kafka,
>> Pravega, Pulsar
>>
>>  - Could have one split reader per source, or multiple split readers that
>> share the "poll()" function
>>  - To not make it too complicated, we can start with thinking about one
>> split reader for all splits initially and see if that covers all
>> requirements
>>
>> *(3) Threading model of the Split Reader*
>>
>>  - Most active part of the discussion ;-)
>>
>>  - A non-blocking way for Flink's task code to interact with the source is
>> needed in order to a task runtime code based on a
>> single-threaded/actor-style task design
>>    --> I personally am a big proponent of that, it will help with
>> well-behaved checkpoints, efficiency, and simpler yet more robust runtime
>> code
>>
>>  - Users care about simple abstraction, so as a subclass of SplitReader
>> (non-blocking / async) we need to have a BlockingSplitReader which will
>> form the basis of most source implementations. BlockingSplitReader lets
>> users do blocking simple poll() calls.
>>  - The BlockingSplitReader would spawn a thread (or more) and the
>> thread(s) can make blocking calls and hand over data buffers via a blocking
>> queue
>>  - This should allow us to cover both, a fully async runtime, and a simple
>> blocking interface for users.
>>  - This is actually very similar to how the Kafka connectors work. Kafka
>> 9+ with one thread, Kafka 8 with multiple threads
>>
>>  - On the base SplitReader (the async one), the non-blocking method that
>> gets the next chunk of data would signal data availability via a
>> CompletableFuture, because that gives the best flexibility (can await
>> completion or register notification handlers).
>>  - The source task would register a "thenHandle()" (or similar) on the
>> future to put a "take next data" task into the actor-style mailbox
>>
>> *(4) Split Enumeration and Assignment*
>>
>>  - Splits may be generated lazily, both in cases where there is a limited
>> number of splits (but very many), or splits are discovered over time
>>  - Assignment should also be lazy, to get better load balancing
>>  - Assignment needs support locality preferences
>>
>>  - Possible design based on discussion so far:
>>
>>    --> SplitReader has a method "addSplits(SplitT...)" to add one or more
>> splits. Some split readers might assume they have only one split ever,
>> concurrently, others assume multiple splits. (Note: idea behind being able
>> to add multiple splits at the same time is to ease startup where multiple
>> splits may be assigned instantly.)
>>    --> SplitReader has a context object on which it can call indicate when
>> splits are completed. The enumerator gets that notification and can use to
>> decide when to assign new splits. This should help both in cases of sources
>> that take splits lazily (file readers) and in case the source needs to
>> preserve a partial order between splits (Kinesis, Pravega, Pulsar may need
>> that).
>>    --> SplitEnumerator gets notification when SplitReaders start and when
>> they finish splits. They can decide at that moment to push more splits to
>> that reader
>>    --> The SplitEnumerator should probably be aware of the source
>> parallelism, to build its initial distribution.
>>
>>  - Open question: Should the source expose something like "host
>> preferences", so that yarn/mesos/k8s can take this into account when
>> selecting a node to start a TM on?
>>
>> *(5) Watermarks and event time alignment*
>>
>>  - Watermark generation, as well as idleness, needs to be per split (like
>> currently in the Kafka Source, per partition)
>>  - It is desirable to support optional event-time-alignment, meaning that
>> splits that are ahead are back-pressured or temporarily unsubscribed
>>
>>  - I think i would be desirable to encapsulate watermark generation logic
>> in watermark generators, for a separation of concerns. The watermark
>> generators should run per split.
>>  - Using watermark generators would also help with another problem of the
>> suggested interface, namely supporting non-periodic watermarks efficiently.
>>
>>  - Need a way to "dispatch" next record to different watermark generators
>>  - Need a way to tell SplitReader to "suspend" a split until a certain
>> watermark is reached (event time backpressure)
>>  - This would in fact be not needed (and thus simpler) if we had a
>> SplitReader per split and may be a reason to re-open that discussion
>>
>> *(6) Watermarks across splits and in the Split Enumerator*
>>
>>  - The split enumerator may need some watermark awareness, which should be
>> purely based on split metadata (like create timestamp of file splits)
>>  - If there are still more splits with overlapping event time range for a
>> split reader, then that split reader should not advance the watermark
>> within the split beyond the overlap boundary. Otherwise future splits will
>> produce late data.
>>
>>  - One way to approach this could be that the split enumerator may send
>> watermarks to the readers, and the readers cannot emit watermarks beyond
>> that received watermark.
>>  - Many split enumerators would simply immediately send Long.MAX out and
>> leave the progress purely to the split readers.
>>
>>  - For event-time alignment / split back pressure, this begs the question
>> how we can avoid deadlocks that may arise when splits are suspended for
>> event time back pressure,
>>
>> *(7) Batch and streaming Unification*
>>
>>  - Functionality wise, the above design should support both
>>  - Batch often (mostly) does not care about reading "in order" and
>> generating watermarks
>>    --> Might use different enumerator logic that is more locality aware
>> and ignores event time order
>>    --> Does not generate watermarks
>>  - Would be great if bounded sources could be identified at compile time,
>> so that "env.addBoundedSource(...)" is type safe and can return a
>> "BoundedDataStream".
>>  - Possible to defer this discussion until later
>>
>> *Miscellaneous Comments*
>>
>>  - Should the source have a TypeInformation for the produced type, instead
>> of a serializer? We need a type information in the stream anyways, and can
>> derive the serializer from that. Plus, creating the serializer should
>> respect the ExecutionConfig.
>>
>>  - The TypeSerializer interface is very powerful but also not easy to
>> implement. Its purpose is to handle data super efficiently, support
>> flexible ways of evolution, etc.
>>  For metadata I would suggest to look at the SimpleVersionedSerializer
>> instead, which is used for example for checkpoint master hooks, or for the
>> streaming file sink. I think that is is a good match for cases where we do
>> not need more than ser/deser (no copy, etc.) and don't need to push
>> versioning out of the serialization paths for best performance (as in the
>> TypeSerializer)
>>
>>
>> On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
>> [hidden email]>
>> wrote:
>>
>>> Hi Biao,
>>>
>>> Thanks for the answer!
>>>
>>> So given the multi-threaded readers, now we have as open questions:
>>>
>>> 1) How do we let the checkpoints pass through our multi-threaded reader
>>> operator?
>>>
>>> 2) Do we have separate reader and source operators or not? In the
>> strategy
>>> that has a separate source, the source operator has a parallelism of 1
>> and
>>> is responsible for split recovery only.
>>>
>>> For the first one, given also the constraints (blocking, finite queues,
>>> etc), I do not have an answer yet.
>>>
>>> For the 2nd, I think that we should go with separate operators for the
>>> source and the readers, for the following reasons:
>>>
>>> 1) This is more aligned with a potential future improvement where the
>> split
>>> discovery becomes a responsibility of the JobManager and readers are
>>> pooling more work from the JM.
>>>
>>> 2) The source is going to be the "single point of truth". It will know
>> what
>>> has been processed and what not. If the source and the readers are a
>> single
>>> operator with parallelism > 1, or in general, if the split discovery is
>>> done by each task individually, then:
>>>   i) we have to have a deterministic scheme for each reader to assign
>>> splits to itself (e.g. mod subtaskId). This is not necessarily trivial
>> for
>>> all sources.
>>>   ii) each reader would have to keep a copy of all its processed slpits
>>>   iii) the state has to be a union state with a non-trivial merging
>> logic
>>> in order to support rescaling.
>>>
>>> Two additional points that you raised above:
>>>
>>> i) The point that you raised that we need to keep all splits (processed
>> and
>>> not-processed) I think is a bit of a strong requirement. This would imply
>>> that for infinite sources the state will grow indefinitely. This is
>> problem
>>> is even more pronounced if we do not have a single source that assigns
>>> splits to readers, as each reader will have its own copy of the state.
>>>
>>> ii) it is true that for finite sources we need to somehow not close the
>>> readers when the source/split discoverer finishes. The
>>> ContinuousFileReaderOperator has a work-around for that. It is not
>> elegant,
>>> and checkpoints are not emitted after closing the source, but this, I
>>> believe, is a bigger problem which requires more changes than just
>>> refactoring the source interface.
>>>
>>> Cheers,
>>> Kostas
>>>
>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Biao Liu
Hi Stephan & Piotrek,

Thank you for feedback.

It seems that there are a lot of things to do in community. I am just
afraid that this discussion may be forgotten since there so many proposals
recently.
Anyway, wish to see the split topics soon :)

Piotr Nowojski <[hidden email]> 于2019年1月24日周四 下午8:21写道:

> Hi Biao!
>
> This discussion was stalled because of preparations for the open sourcing
> & merging Blink. I think before creating the tickets we should split this
> discussion into topics/areas outlined by Stephan and create Flips for that.
>
> I think there is no chance for this to be completed in couple of remaining
> weeks/1 month before 1.8 feature freeze, however it would be good to aim
> with those changes for 1.9.
>
> Piotrek
>
> > On 20 Jan 2019, at 16:08, Biao Liu <[hidden email]> wrote:
> >
> > Hi community,
> > The summary of Stephan makes a lot sense to me. It is much clearer indeed
> > after splitting the complex topic into small ones.
> > I was wondering is there any detail plan for next step? If not, I would
> > like to push this thing forward by creating some JIRA issues.
> > Another question is that should version 1.8 include these features?
> >
> > Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道:
> >
> >> Thanks everyone for the lively discussion. Let me try to summarize
> where I
> >> see convergence in the discussion and open issues.
> >> I'll try to group this by design aspect of the source. Please let me
> know
> >> if I got things wrong or missed something crucial here.
> >>
> >> For issues 1-3, if the below reflects the state of the discussion, I
> would
> >> try and update the FLIP in the next days.
> >> For the remaining ones we need more discussion.
> >>
> >> I would suggest to fork each of these aspects into a separate mail
> thread,
> >> or will loose sight of the individual aspects.
> >>
> >> *(1) Separation of Split Enumerator and Split Reader*
> >>
> >>  - All seem to agree this is a good thing
> >>  - Split Enumerator could in the end live on JobManager (and assign
> splits
> >> via RPC) or in a task (and assign splits via data streams)
> >>  - this discussion is orthogonal and should come later, when the
> interface
> >> is agreed upon.
> >>
> >> *(2) Split Readers for one or more splits*
> >>
> >>  - Discussion seems to agree that we need to support one reader that
> >> possibly handles multiple splits concurrently.
> >>  - The requirement comes from sources where one poll()-style call
> fetches
> >> data from different splits / partitions
> >>    --> example sources that require that would be for example Kafka,
> >> Pravega, Pulsar
> >>
> >>  - Could have one split reader per source, or multiple split readers
> that
> >> share the "poll()" function
> >>  - To not make it too complicated, we can start with thinking about one
> >> split reader for all splits initially and see if that covers all
> >> requirements
> >>
> >> *(3) Threading model of the Split Reader*
> >>
> >>  - Most active part of the discussion ;-)
> >>
> >>  - A non-blocking way for Flink's task code to interact with the source
> is
> >> needed in order to a task runtime code based on a
> >> single-threaded/actor-style task design
> >>    --> I personally am a big proponent of that, it will help with
> >> well-behaved checkpoints, efficiency, and simpler yet more robust
> runtime
> >> code
> >>
> >>  - Users care about simple abstraction, so as a subclass of SplitReader
> >> (non-blocking / async) we need to have a BlockingSplitReader which will
> >> form the basis of most source implementations. BlockingSplitReader lets
> >> users do blocking simple poll() calls.
> >>  - The BlockingSplitReader would spawn a thread (or more) and the
> >> thread(s) can make blocking calls and hand over data buffers via a
> blocking
> >> queue
> >>  - This should allow us to cover both, a fully async runtime, and a
> simple
> >> blocking interface for users.
> >>  - This is actually very similar to how the Kafka connectors work. Kafka
> >> 9+ with one thread, Kafka 8 with multiple threads
> >>
> >>  - On the base SplitReader (the async one), the non-blocking method that
> >> gets the next chunk of data would signal data availability via a
> >> CompletableFuture, because that gives the best flexibility (can await
> >> completion or register notification handlers).
> >>  - The source task would register a "thenHandle()" (or similar) on the
> >> future to put a "take next data" task into the actor-style mailbox
> >>
> >> *(4) Split Enumeration and Assignment*
> >>
> >>  - Splits may be generated lazily, both in cases where there is a
> limited
> >> number of splits (but very many), or splits are discovered over time
> >>  - Assignment should also be lazy, to get better load balancing
> >>  - Assignment needs support locality preferences
> >>
> >>  - Possible design based on discussion so far:
> >>
> >>    --> SplitReader has a method "addSplits(SplitT...)" to add one or
> more
> >> splits. Some split readers might assume they have only one split ever,
> >> concurrently, others assume multiple splits. (Note: idea behind being
> able
> >> to add multiple splits at the same time is to ease startup where
> multiple
> >> splits may be assigned instantly.)
> >>    --> SplitReader has a context object on which it can call indicate
> when
> >> splits are completed. The enumerator gets that notification and can use
> to
> >> decide when to assign new splits. This should help both in cases of
> sources
> >> that take splits lazily (file readers) and in case the source needs to
> >> preserve a partial order between splits (Kinesis, Pravega, Pulsar may
> need
> >> that).
> >>    --> SplitEnumerator gets notification when SplitReaders start and
> when
> >> they finish splits. They can decide at that moment to push more splits
> to
> >> that reader
> >>    --> The SplitEnumerator should probably be aware of the source
> >> parallelism, to build its initial distribution.
> >>
> >>  - Open question: Should the source expose something like "host
> >> preferences", so that yarn/mesos/k8s can take this into account when
> >> selecting a node to start a TM on?
> >>
> >> *(5) Watermarks and event time alignment*
> >>
> >>  - Watermark generation, as well as idleness, needs to be per split
> (like
> >> currently in the Kafka Source, per partition)
> >>  - It is desirable to support optional event-time-alignment, meaning
> that
> >> splits that are ahead are back-pressured or temporarily unsubscribed
> >>
> >>  - I think i would be desirable to encapsulate watermark generation
> logic
> >> in watermark generators, for a separation of concerns. The watermark
> >> generators should run per split.
> >>  - Using watermark generators would also help with another problem of
> the
> >> suggested interface, namely supporting non-periodic watermarks
> efficiently.
> >>
> >>  - Need a way to "dispatch" next record to different watermark
> generators
> >>  - Need a way to tell SplitReader to "suspend" a split until a certain
> >> watermark is reached (event time backpressure)
> >>  - This would in fact be not needed (and thus simpler) if we had a
> >> SplitReader per split and may be a reason to re-open that discussion
> >>
> >> *(6) Watermarks across splits and in the Split Enumerator*
> >>
> >>  - The split enumerator may need some watermark awareness, which should
> be
> >> purely based on split metadata (like create timestamp of file splits)
> >>  - If there are still more splits with overlapping event time range for
> a
> >> split reader, then that split reader should not advance the watermark
> >> within the split beyond the overlap boundary. Otherwise future splits
> will
> >> produce late data.
> >>
> >>  - One way to approach this could be that the split enumerator may send
> >> watermarks to the readers, and the readers cannot emit watermarks beyond
> >> that received watermark.
> >>  - Many split enumerators would simply immediately send Long.MAX out and
> >> leave the progress purely to the split readers.
> >>
> >>  - For event-time alignment / split back pressure, this begs the
> question
> >> how we can avoid deadlocks that may arise when splits are suspended for
> >> event time back pressure,
> >>
> >> *(7) Batch and streaming Unification*
> >>
> >>  - Functionality wise, the above design should support both
> >>  - Batch often (mostly) does not care about reading "in order" and
> >> generating watermarks
> >>    --> Might use different enumerator logic that is more locality aware
> >> and ignores event time order
> >>    --> Does not generate watermarks
> >>  - Would be great if bounded sources could be identified at compile
> time,
> >> so that "env.addBoundedSource(...)" is type safe and can return a
> >> "BoundedDataStream".
> >>  - Possible to defer this discussion until later
> >>
> >> *Miscellaneous Comments*
> >>
> >>  - Should the source have a TypeInformation for the produced type,
> instead
> >> of a serializer? We need a type information in the stream anyways, and
> can
> >> derive the serializer from that. Plus, creating the serializer should
> >> respect the ExecutionConfig.
> >>
> >>  - The TypeSerializer interface is very powerful but also not easy to
> >> implement. Its purpose is to handle data super efficiently, support
> >> flexible ways of evolution, etc.
> >>  For metadata I would suggest to look at the SimpleVersionedSerializer
> >> instead, which is used for example for checkpoint master hooks, or for
> the
> >> streaming file sink. I think that is is a good match for cases where we
> do
> >> not need more than ser/deser (no copy, etc.) and don't need to push
> >> versioning out of the serialization paths for best performance (as in
> the
> >> TypeSerializer)
> >>
> >>
> >> On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
> >> [hidden email]>
> >> wrote:
> >>
> >>> Hi Biao,
> >>>
> >>> Thanks for the answer!
> >>>
> >>> So given the multi-threaded readers, now we have as open questions:
> >>>
> >>> 1) How do we let the checkpoints pass through our multi-threaded reader
> >>> operator?
> >>>
> >>> 2) Do we have separate reader and source operators or not? In the
> >> strategy
> >>> that has a separate source, the source operator has a parallelism of 1
> >> and
> >>> is responsible for split recovery only.
> >>>
> >>> For the first one, given also the constraints (blocking, finite queues,
> >>> etc), I do not have an answer yet.
> >>>
> >>> For the 2nd, I think that we should go with separate operators for the
> >>> source and the readers, for the following reasons:
> >>>
> >>> 1) This is more aligned with a potential future improvement where the
> >> split
> >>> discovery becomes a responsibility of the JobManager and readers are
> >>> pooling more work from the JM.
> >>>
> >>> 2) The source is going to be the "single point of truth". It will know
> >> what
> >>> has been processed and what not. If the source and the readers are a
> >> single
> >>> operator with parallelism > 1, or in general, if the split discovery is
> >>> done by each task individually, then:
> >>>   i) we have to have a deterministic scheme for each reader to assign
> >>> splits to itself (e.g. mod subtaskId). This is not necessarily trivial
> >> for
> >>> all sources.
> >>>   ii) each reader would have to keep a copy of all its processed slpits
> >>>   iii) the state has to be a union state with a non-trivial merging
> >> logic
> >>> in order to support rescaling.
> >>>
> >>> Two additional points that you raised above:
> >>>
> >>> i) The point that you raised that we need to keep all splits (processed
> >> and
> >>> not-processed) I think is a bit of a strong requirement. This would
> imply
> >>> that for infinite sources the state will grow indefinitely. This is
> >> problem
> >>> is even more pronounced if we do not have a single source that assigns
> >>> splits to readers, as each reader will have its own copy of the state.
> >>>
> >>> ii) it is true that for finite sources we need to somehow not close the
> >>> readers when the source/split discoverer finishes. The
> >>> ContinuousFileReaderOperator has a work-around for that. It is not
> >> elegant,
> >>> and checkpoints are not emitted after closing the source, but this, I
> >>> believe, is a bigger problem which requires more changes than just
> >>> refactoring the source interface.
> >>>
> >>> Cheers,
> >>> Kostas
> >>>
> >>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Steven Wu
This proposal mentioned that SplitEnumerator might run on the JobManager or
in a single task on a TaskManager.

if enumerator is a single task on a taskmanager, then the job DAG can never
been embarrassingly parallel anymore. That will nullify the leverage of
fine-grained recovery for embarrassingly parallel jobs.

It's not clear to me what's the implication of running enumerator on the
jobmanager. So I will leave that out for now.

On Mon, Jan 28, 2019 at 3:05 AM Biao Liu <[hidden email]> wrote:

> Hi Stephan & Piotrek,
>
> Thank you for feedback.
>
> It seems that there are a lot of things to do in community. I am just
> afraid that this discussion may be forgotten since there so many proposals
> recently.
> Anyway, wish to see the split topics soon :)
>
> Piotr Nowojski <[hidden email]> 于2019年1月24日周四 下午8:21写道:
>
> > Hi Biao!
> >
> > This discussion was stalled because of preparations for the open sourcing
> > & merging Blink. I think before creating the tickets we should split this
> > discussion into topics/areas outlined by Stephan and create Flips for
> that.
> >
> > I think there is no chance for this to be completed in couple of
> remaining
> > weeks/1 month before 1.8 feature freeze, however it would be good to aim
> > with those changes for 1.9.
> >
> > Piotrek
> >
> > > On 20 Jan 2019, at 16:08, Biao Liu <[hidden email]> wrote:
> > >
> > > Hi community,
> > > The summary of Stephan makes a lot sense to me. It is much clearer
> indeed
> > > after splitting the complex topic into small ones.
> > > I was wondering is there any detail plan for next step? If not, I would
> > > like to push this thing forward by creating some JIRA issues.
> > > Another question is that should version 1.8 include these features?
> > >
> > > Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道:
> > >
> > >> Thanks everyone for the lively discussion. Let me try to summarize
> > where I
> > >> see convergence in the discussion and open issues.
> > >> I'll try to group this by design aspect of the source. Please let me
> > know
> > >> if I got things wrong or missed something crucial here.
> > >>
> > >> For issues 1-3, if the below reflects the state of the discussion, I
> > would
> > >> try and update the FLIP in the next days.
> > >> For the remaining ones we need more discussion.
> > >>
> > >> I would suggest to fork each of these aspects into a separate mail
> > thread,
> > >> or will loose sight of the individual aspects.
> > >>
> > >> *(1) Separation of Split Enumerator and Split Reader*
> > >>
> > >>  - All seem to agree this is a good thing
> > >>  - Split Enumerator could in the end live on JobManager (and assign
> > splits
> > >> via RPC) or in a task (and assign splits via data streams)
> > >>  - this discussion is orthogonal and should come later, when the
> > interface
> > >> is agreed upon.
> > >>
> > >> *(2) Split Readers for one or more splits*
> > >>
> > >>  - Discussion seems to agree that we need to support one reader that
> > >> possibly handles multiple splits concurrently.
> > >>  - The requirement comes from sources where one poll()-style call
> > fetches
> > >> data from different splits / partitions
> > >>    --> example sources that require that would be for example Kafka,
> > >> Pravega, Pulsar
> > >>
> > >>  - Could have one split reader per source, or multiple split readers
> > that
> > >> share the "poll()" function
> > >>  - To not make it too complicated, we can start with thinking about
> one
> > >> split reader for all splits initially and see if that covers all
> > >> requirements
> > >>
> > >> *(3) Threading model of the Split Reader*
> > >>
> > >>  - Most active part of the discussion ;-)
> > >>
> > >>  - A non-blocking way for Flink's task code to interact with the
> source
> > is
> > >> needed in order to a task runtime code based on a
> > >> single-threaded/actor-style task design
> > >>    --> I personally am a big proponent of that, it will help with
> > >> well-behaved checkpoints, efficiency, and simpler yet more robust
> > runtime
> > >> code
> > >>
> > >>  - Users care about simple abstraction, so as a subclass of
> SplitReader
> > >> (non-blocking / async) we need to have a BlockingSplitReader which
> will
> > >> form the basis of most source implementations. BlockingSplitReader
> lets
> > >> users do blocking simple poll() calls.
> > >>  - The BlockingSplitReader would spawn a thread (or more) and the
> > >> thread(s) can make blocking calls and hand over data buffers via a
> > blocking
> > >> queue
> > >>  - This should allow us to cover both, a fully async runtime, and a
> > simple
> > >> blocking interface for users.
> > >>  - This is actually very similar to how the Kafka connectors work.
> Kafka
> > >> 9+ with one thread, Kafka 8 with multiple threads
> > >>
> > >>  - On the base SplitReader (the async one), the non-blocking method
> that
> > >> gets the next chunk of data would signal data availability via a
> > >> CompletableFuture, because that gives the best flexibility (can await
> > >> completion or register notification handlers).
> > >>  - The source task would register a "thenHandle()" (or similar) on the
> > >> future to put a "take next data" task into the actor-style mailbox
> > >>
> > >> *(4) Split Enumeration and Assignment*
> > >>
> > >>  - Splits may be generated lazily, both in cases where there is a
> > limited
> > >> number of splits (but very many), or splits are discovered over time
> > >>  - Assignment should also be lazy, to get better load balancing
> > >>  - Assignment needs support locality preferences
> > >>
> > >>  - Possible design based on discussion so far:
> > >>
> > >>    --> SplitReader has a method "addSplits(SplitT...)" to add one or
> > more
> > >> splits. Some split readers might assume they have only one split ever,
> > >> concurrently, others assume multiple splits. (Note: idea behind being
> > able
> > >> to add multiple splits at the same time is to ease startup where
> > multiple
> > >> splits may be assigned instantly.)
> > >>    --> SplitReader has a context object on which it can call indicate
> > when
> > >> splits are completed. The enumerator gets that notification and can
> use
> > to
> > >> decide when to assign new splits. This should help both in cases of
> > sources
> > >> that take splits lazily (file readers) and in case the source needs to
> > >> preserve a partial order between splits (Kinesis, Pravega, Pulsar may
> > need
> > >> that).
> > >>    --> SplitEnumerator gets notification when SplitReaders start and
> > when
> > >> they finish splits. They can decide at that moment to push more splits
> > to
> > >> that reader
> > >>    --> The SplitEnumerator should probably be aware of the source
> > >> parallelism, to build its initial distribution.
> > >>
> > >>  - Open question: Should the source expose something like "host
> > >> preferences", so that yarn/mesos/k8s can take this into account when
> > >> selecting a node to start a TM on?
> > >>
> > >> *(5) Watermarks and event time alignment*
> > >>
> > >>  - Watermark generation, as well as idleness, needs to be per split
> > (like
> > >> currently in the Kafka Source, per partition)
> > >>  - It is desirable to support optional event-time-alignment, meaning
> > that
> > >> splits that are ahead are back-pressured or temporarily unsubscribed
> > >>
> > >>  - I think i would be desirable to encapsulate watermark generation
> > logic
> > >> in watermark generators, for a separation of concerns. The watermark
> > >> generators should run per split.
> > >>  - Using watermark generators would also help with another problem of
> > the
> > >> suggested interface, namely supporting non-periodic watermarks
> > efficiently.
> > >>
> > >>  - Need a way to "dispatch" next record to different watermark
> > generators
> > >>  - Need a way to tell SplitReader to "suspend" a split until a certain
> > >> watermark is reached (event time backpressure)
> > >>  - This would in fact be not needed (and thus simpler) if we had a
> > >> SplitReader per split and may be a reason to re-open that discussion
> > >>
> > >> *(6) Watermarks across splits and in the Split Enumerator*
> > >>
> > >>  - The split enumerator may need some watermark awareness, which
> should
> > be
> > >> purely based on split metadata (like create timestamp of file splits)
> > >>  - If there are still more splits with overlapping event time range
> for
> > a
> > >> split reader, then that split reader should not advance the watermark
> > >> within the split beyond the overlap boundary. Otherwise future splits
> > will
> > >> produce late data.
> > >>
> > >>  - One way to approach this could be that the split enumerator may
> send
> > >> watermarks to the readers, and the readers cannot emit watermarks
> beyond
> > >> that received watermark.
> > >>  - Many split enumerators would simply immediately send Long.MAX out
> and
> > >> leave the progress purely to the split readers.
> > >>
> > >>  - For event-time alignment / split back pressure, this begs the
> > question
> > >> how we can avoid deadlocks that may arise when splits are suspended
> for
> > >> event time back pressure,
> > >>
> > >> *(7) Batch and streaming Unification*
> > >>
> > >>  - Functionality wise, the above design should support both
> > >>  - Batch often (mostly) does not care about reading "in order" and
> > >> generating watermarks
> > >>    --> Might use different enumerator logic that is more locality
> aware
> > >> and ignores event time order
> > >>    --> Does not generate watermarks
> > >>  - Would be great if bounded sources could be identified at compile
> > time,
> > >> so that "env.addBoundedSource(...)" is type safe and can return a
> > >> "BoundedDataStream".
> > >>  - Possible to defer this discussion until later
> > >>
> > >> *Miscellaneous Comments*
> > >>
> > >>  - Should the source have a TypeInformation for the produced type,
> > instead
> > >> of a serializer? We need a type information in the stream anyways, and
> > can
> > >> derive the serializer from that. Plus, creating the serializer should
> > >> respect the ExecutionConfig.
> > >>
> > >>  - The TypeSerializer interface is very powerful but also not easy to
> > >> implement. Its purpose is to handle data super efficiently, support
> > >> flexible ways of evolution, etc.
> > >>  For metadata I would suggest to look at the SimpleVersionedSerializer
> > >> instead, which is used for example for checkpoint master hooks, or for
> > the
> > >> streaming file sink. I think that is is a good match for cases where
> we
> > do
> > >> not need more than ser/deser (no copy, etc.) and don't need to push
> > >> versioning out of the serialization paths for best performance (as in
> > the
> > >> TypeSerializer)
> > >>
> > >>
> > >> On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
> > >> [hidden email]>
> > >> wrote:
> > >>
> > >>> Hi Biao,
> > >>>
> > >>> Thanks for the answer!
> > >>>
> > >>> So given the multi-threaded readers, now we have as open questions:
> > >>>
> > >>> 1) How do we let the checkpoints pass through our multi-threaded
> reader
> > >>> operator?
> > >>>
> > >>> 2) Do we have separate reader and source operators or not? In the
> > >> strategy
> > >>> that has a separate source, the source operator has a parallelism of
> 1
> > >> and
> > >>> is responsible for split recovery only.
> > >>>
> > >>> For the first one, given also the constraints (blocking, finite
> queues,
> > >>> etc), I do not have an answer yet.
> > >>>
> > >>> For the 2nd, I think that we should go with separate operators for
> the
> > >>> source and the readers, for the following reasons:
> > >>>
> > >>> 1) This is more aligned with a potential future improvement where the
> > >> split
> > >>> discovery becomes a responsibility of the JobManager and readers are
> > >>> pooling more work from the JM.
> > >>>
> > >>> 2) The source is going to be the "single point of truth". It will
> know
> > >> what
> > >>> has been processed and what not. If the source and the readers are a
> > >> single
> > >>> operator with parallelism > 1, or in general, if the split discovery
> is
> > >>> done by each task individually, then:
> > >>>   i) we have to have a deterministic scheme for each reader to assign
> > >>> splits to itself (e.g. mod subtaskId). This is not necessarily
> trivial
> > >> for
> > >>> all sources.
> > >>>   ii) each reader would have to keep a copy of all its processed
> slpits
> > >>>   iii) the state has to be a union state with a non-trivial merging
> > >> logic
> > >>> in order to support rescaling.
> > >>>
> > >>> Two additional points that you raised above:
> > >>>
> > >>> i) The point that you raised that we need to keep all splits
> (processed
> > >> and
> > >>> not-processed) I think is a bit of a strong requirement. This would
> > imply
> > >>> that for infinite sources the state will grow indefinitely. This is
> > >> problem
> > >>> is even more pronounced if we do not have a single source that
> assigns
> > >>> splits to readers, as each reader will have its own copy of the
> state.
> > >>>
> > >>> ii) it is true that for finite sources we need to somehow not close
> the
> > >>> readers when the source/split discoverer finishes. The
> > >>> ContinuousFileReaderOperator has a work-around for that. It is not
> > >> elegant,
> > >>> and checkpoints are not emitted after closing the source, but this, I
> > >>> believe, is a bigger problem which requires more changes than just
> > >>> refactoring the source interface.
> > >>>
> > >>> Cheers,
> > >>> Kostas
> > >>>
> > >>
> >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Biao Liu
Hi Steven,
Thank you for the feedback. Please take a look at the document FLIP-27
<https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface>
which
is updated recently. A lot of details of enumerator were added in this
document. I think it would help.

Steven Wu <[hidden email]> 于2019年3月28日周四 下午12:52写道:

> This proposal mentioned that SplitEnumerator might run on the JobManager or
> in a single task on a TaskManager.
>
> if enumerator is a single task on a taskmanager, then the job DAG can never
> been embarrassingly parallel anymore. That will nullify the leverage of
> fine-grained recovery for embarrassingly parallel jobs.
>
> It's not clear to me what's the implication of running enumerator on the
> jobmanager. So I will leave that out for now.
>
> On Mon, Jan 28, 2019 at 3:05 AM Biao Liu <[hidden email]> wrote:
>
> > Hi Stephan & Piotrek,
> >
> > Thank you for feedback.
> >
> > It seems that there are a lot of things to do in community. I am just
> > afraid that this discussion may be forgotten since there so many
> proposals
> > recently.
> > Anyway, wish to see the split topics soon :)
> >
> > Piotr Nowojski <[hidden email]> 于2019年1月24日周四 下午8:21写道:
> >
> > > Hi Biao!
> > >
> > > This discussion was stalled because of preparations for the open
> sourcing
> > > & merging Blink. I think before creating the tickets we should split
> this
> > > discussion into topics/areas outlined by Stephan and create Flips for
> > that.
> > >
> > > I think there is no chance for this to be completed in couple of
> > remaining
> > > weeks/1 month before 1.8 feature freeze, however it would be good to
> aim
> > > with those changes for 1.9.
> > >
> > > Piotrek
> > >
> > > > On 20 Jan 2019, at 16:08, Biao Liu <[hidden email]> wrote:
> > > >
> > > > Hi community,
> > > > The summary of Stephan makes a lot sense to me. It is much clearer
> > indeed
> > > > after splitting the complex topic into small ones.
> > > > I was wondering is there any detail plan for next step? If not, I
> would
> > > > like to push this thing forward by creating some JIRA issues.
> > > > Another question is that should version 1.8 include these features?
> > > >
> > > > Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道:
> > > >
> > > >> Thanks everyone for the lively discussion. Let me try to summarize
> > > where I
> > > >> see convergence in the discussion and open issues.
> > > >> I'll try to group this by design aspect of the source. Please let me
> > > know
> > > >> if I got things wrong or missed something crucial here.
> > > >>
> > > >> For issues 1-3, if the below reflects the state of the discussion, I
> > > would
> > > >> try and update the FLIP in the next days.
> > > >> For the remaining ones we need more discussion.
> > > >>
> > > >> I would suggest to fork each of these aspects into a separate mail
> > > thread,
> > > >> or will loose sight of the individual aspects.
> > > >>
> > > >> *(1) Separation of Split Enumerator and Split Reader*
> > > >>
> > > >>  - All seem to agree this is a good thing
> > > >>  - Split Enumerator could in the end live on JobManager (and assign
> > > splits
> > > >> via RPC) or in a task (and assign splits via data streams)
> > > >>  - this discussion is orthogonal and should come later, when the
> > > interface
> > > >> is agreed upon.
> > > >>
> > > >> *(2) Split Readers for one or more splits*
> > > >>
> > > >>  - Discussion seems to agree that we need to support one reader that
> > > >> possibly handles multiple splits concurrently.
> > > >>  - The requirement comes from sources where one poll()-style call
> > > fetches
> > > >> data from different splits / partitions
> > > >>    --> example sources that require that would be for example Kafka,
> > > >> Pravega, Pulsar
> > > >>
> > > >>  - Could have one split reader per source, or multiple split readers
> > > that
> > > >> share the "poll()" function
> > > >>  - To not make it too complicated, we can start with thinking about
> > one
> > > >> split reader for all splits initially and see if that covers all
> > > >> requirements
> > > >>
> > > >> *(3) Threading model of the Split Reader*
> > > >>
> > > >>  - Most active part of the discussion ;-)
> > > >>
> > > >>  - A non-blocking way for Flink's task code to interact with the
> > source
> > > is
> > > >> needed in order to a task runtime code based on a
> > > >> single-threaded/actor-style task design
> > > >>    --> I personally am a big proponent of that, it will help with
> > > >> well-behaved checkpoints, efficiency, and simpler yet more robust
> > > runtime
> > > >> code
> > > >>
> > > >>  - Users care about simple abstraction, so as a subclass of
> > SplitReader
> > > >> (non-blocking / async) we need to have a BlockingSplitReader which
> > will
> > > >> form the basis of most source implementations. BlockingSplitReader
> > lets
> > > >> users do blocking simple poll() calls.
> > > >>  - The BlockingSplitReader would spawn a thread (or more) and the
> > > >> thread(s) can make blocking calls and hand over data buffers via a
> > > blocking
> > > >> queue
> > > >>  - This should allow us to cover both, a fully async runtime, and a
> > > simple
> > > >> blocking interface for users.
> > > >>  - This is actually very similar to how the Kafka connectors work.
> > Kafka
> > > >> 9+ with one thread, Kafka 8 with multiple threads
> > > >>
> > > >>  - On the base SplitReader (the async one), the non-blocking method
> > that
> > > >> gets the next chunk of data would signal data availability via a
> > > >> CompletableFuture, because that gives the best flexibility (can
> await
> > > >> completion or register notification handlers).
> > > >>  - The source task would register a "thenHandle()" (or similar) on
> the
> > > >> future to put a "take next data" task into the actor-style mailbox
> > > >>
> > > >> *(4) Split Enumeration and Assignment*
> > > >>
> > > >>  - Splits may be generated lazily, both in cases where there is a
> > > limited
> > > >> number of splits (but very many), or splits are discovered over time
> > > >>  - Assignment should also be lazy, to get better load balancing
> > > >>  - Assignment needs support locality preferences
> > > >>
> > > >>  - Possible design based on discussion so far:
> > > >>
> > > >>    --> SplitReader has a method "addSplits(SplitT...)" to add one or
> > > more
> > > >> splits. Some split readers might assume they have only one split
> ever,
> > > >> concurrently, others assume multiple splits. (Note: idea behind
> being
> > > able
> > > >> to add multiple splits at the same time is to ease startup where
> > > multiple
> > > >> splits may be assigned instantly.)
> > > >>    --> SplitReader has a context object on which it can call
> indicate
> > > when
> > > >> splits are completed. The enumerator gets that notification and can
> > use
> > > to
> > > >> decide when to assign new splits. This should help both in cases of
> > > sources
> > > >> that take splits lazily (file readers) and in case the source needs
> to
> > > >> preserve a partial order between splits (Kinesis, Pravega, Pulsar
> may
> > > need
> > > >> that).
> > > >>    --> SplitEnumerator gets notification when SplitReaders start and
> > > when
> > > >> they finish splits. They can decide at that moment to push more
> splits
> > > to
> > > >> that reader
> > > >>    --> The SplitEnumerator should probably be aware of the source
> > > >> parallelism, to build its initial distribution.
> > > >>
> > > >>  - Open question: Should the source expose something like "host
> > > >> preferences", so that yarn/mesos/k8s can take this into account when
> > > >> selecting a node to start a TM on?
> > > >>
> > > >> *(5) Watermarks and event time alignment*
> > > >>
> > > >>  - Watermark generation, as well as idleness, needs to be per split
> > > (like
> > > >> currently in the Kafka Source, per partition)
> > > >>  - It is desirable to support optional event-time-alignment, meaning
> > > that
> > > >> splits that are ahead are back-pressured or temporarily unsubscribed
> > > >>
> > > >>  - I think i would be desirable to encapsulate watermark generation
> > > logic
> > > >> in watermark generators, for a separation of concerns. The watermark
> > > >> generators should run per split.
> > > >>  - Using watermark generators would also help with another problem
> of
> > > the
> > > >> suggested interface, namely supporting non-periodic watermarks
> > > efficiently.
> > > >>
> > > >>  - Need a way to "dispatch" next record to different watermark
> > > generators
> > > >>  - Need a way to tell SplitReader to "suspend" a split until a
> certain
> > > >> watermark is reached (event time backpressure)
> > > >>  - This would in fact be not needed (and thus simpler) if we had a
> > > >> SplitReader per split and may be a reason to re-open that discussion
> > > >>
> > > >> *(6) Watermarks across splits and in the Split Enumerator*
> > > >>
> > > >>  - The split enumerator may need some watermark awareness, which
> > should
> > > be
> > > >> purely based on split metadata (like create timestamp of file
> splits)
> > > >>  - If there are still more splits with overlapping event time range
> > for
> > > a
> > > >> split reader, then that split reader should not advance the
> watermark
> > > >> within the split beyond the overlap boundary. Otherwise future
> splits
> > > will
> > > >> produce late data.
> > > >>
> > > >>  - One way to approach this could be that the split enumerator may
> > send
> > > >> watermarks to the readers, and the readers cannot emit watermarks
> > beyond
> > > >> that received watermark.
> > > >>  - Many split enumerators would simply immediately send Long.MAX out
> > and
> > > >> leave the progress purely to the split readers.
> > > >>
> > > >>  - For event-time alignment / split back pressure, this begs the
> > > question
> > > >> how we can avoid deadlocks that may arise when splits are suspended
> > for
> > > >> event time back pressure,
> > > >>
> > > >> *(7) Batch and streaming Unification*
> > > >>
> > > >>  - Functionality wise, the above design should support both
> > > >>  - Batch often (mostly) does not care about reading "in order" and
> > > >> generating watermarks
> > > >>    --> Might use different enumerator logic that is more locality
> > aware
> > > >> and ignores event time order
> > > >>    --> Does not generate watermarks
> > > >>  - Would be great if bounded sources could be identified at compile
> > > time,
> > > >> so that "env.addBoundedSource(...)" is type safe and can return a
> > > >> "BoundedDataStream".
> > > >>  - Possible to defer this discussion until later
> > > >>
> > > >> *Miscellaneous Comments*
> > > >>
> > > >>  - Should the source have a TypeInformation for the produced type,
> > > instead
> > > >> of a serializer? We need a type information in the stream anyways,
> and
> > > can
> > > >> derive the serializer from that. Plus, creating the serializer
> should
> > > >> respect the ExecutionConfig.
> > > >>
> > > >>  - The TypeSerializer interface is very powerful but also not easy
> to
> > > >> implement. Its purpose is to handle data super efficiently, support
> > > >> flexible ways of evolution, etc.
> > > >>  For metadata I would suggest to look at the
> SimpleVersionedSerializer
> > > >> instead, which is used for example for checkpoint master hooks, or
> for
> > > the
> > > >> streaming file sink. I think that is is a good match for cases where
> > we
> > > do
> > > >> not need more than ser/deser (no copy, etc.) and don't need to push
> > > >> versioning out of the serialization paths for best performance (as
> in
> > > the
> > > >> TypeSerializer)
> > > >>
> > > >>
> > > >> On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
> > > >> [hidden email]>
> > > >> wrote:
> > > >>
> > > >>> Hi Biao,
> > > >>>
> > > >>> Thanks for the answer!
> > > >>>
> > > >>> So given the multi-threaded readers, now we have as open questions:
> > > >>>
> > > >>> 1) How do we let the checkpoints pass through our multi-threaded
> > reader
> > > >>> operator?
> > > >>>
> > > >>> 2) Do we have separate reader and source operators or not? In the
> > > >> strategy
> > > >>> that has a separate source, the source operator has a parallelism
> of
> > 1
> > > >> and
> > > >>> is responsible for split recovery only.
> > > >>>
> > > >>> For the first one, given also the constraints (blocking, finite
> > queues,
> > > >>> etc), I do not have an answer yet.
> > > >>>
> > > >>> For the 2nd, I think that we should go with separate operators for
> > the
> > > >>> source and the readers, for the following reasons:
> > > >>>
> > > >>> 1) This is more aligned with a potential future improvement where
> the
> > > >> split
> > > >>> discovery becomes a responsibility of the JobManager and readers
> are
> > > >>> pooling more work from the JM.
> > > >>>
> > > >>> 2) The source is going to be the "single point of truth". It will
> > know
> > > >> what
> > > >>> has been processed and what not. If the source and the readers are
> a
> > > >> single
> > > >>> operator with parallelism > 1, or in general, if the split
> discovery
> > is
> > > >>> done by each task individually, then:
> > > >>>   i) we have to have a deterministic scheme for each reader to
> assign
> > > >>> splits to itself (e.g. mod subtaskId). This is not necessarily
> > trivial
> > > >> for
> > > >>> all sources.
> > > >>>   ii) each reader would have to keep a copy of all its processed
> > slpits
> > > >>>   iii) the state has to be a union state with a non-trivial merging
> > > >> logic
> > > >>> in order to support rescaling.
> > > >>>
> > > >>> Two additional points that you raised above:
> > > >>>
> > > >>> i) The point that you raised that we need to keep all splits
> > (processed
> > > >> and
> > > >>> not-processed) I think is a bit of a strong requirement. This would
> > > imply
> > > >>> that for infinite sources the state will grow indefinitely. This is
> > > >> problem
> > > >>> is even more pronounced if we do not have a single source that
> > assigns
> > > >>> splits to readers, as each reader will have its own copy of the
> > state.
> > > >>>
> > > >>> ii) it is true that for finite sources we need to somehow not close
> > the
> > > >>> readers when the source/split discoverer finishes. The
> > > >>> ContinuousFileReaderOperator has a work-around for that. It is not
> > > >> elegant,
> > > >>> and checkpoints are not emitted after closing the source, but
> this, I
> > > >>> believe, is a bigger problem which requires more changes than just
> > > >>> refactoring the source interface.
> > > >>>
> > > >>> Cheers,
> > > >>> Kostas
> > > >>>
> > > >>
> > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Biao Liu
Hi devs,

Since 1.9 is nearly released, I think we could get back to FLIP-27. I
believe it should be included in 1.10.

There are so many things mentioned in document of FLIP-27. [1] I think we'd
better discuss them separately. However the wiki is not a good place to
discuss. I wrote google doc about SplitReader API which misses some details
in the document. [2]

1.
https://cwiki.apache.org/confluence/display/FLINK/FLIP-27:+Refactor+Source+Interface
2.
https://docs.google.com/document/d/1R1s_89T4S3CZwq7Tf31DciaMCqZwrLHGZFqPASu66oE/edit?usp=sharing

CC Stephan, Aljoscha, Piotrek, Becket


On Thu, Mar 28, 2019 at 4:38 PM Biao Liu <[hidden email]> wrote:

> Hi Steven,
> Thank you for the feedback. Please take a look at the document FLIP-27
> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface> which
> is updated recently. A lot of details of enumerator were added in this
> document. I think it would help.
>
> Steven Wu <[hidden email]> 于2019年3月28日周四 下午12:52写道:
>
>> This proposal mentioned that SplitEnumerator might run on the JobManager
>> or
>> in a single task on a TaskManager.
>>
>> if enumerator is a single task on a taskmanager, then the job DAG can
>> never
>> been embarrassingly parallel anymore. That will nullify the leverage of
>> fine-grained recovery for embarrassingly parallel jobs.
>>
>> It's not clear to me what's the implication of running enumerator on the
>> jobmanager. So I will leave that out for now.
>>
>> On Mon, Jan 28, 2019 at 3:05 AM Biao Liu <[hidden email]> wrote:
>>
>> > Hi Stephan & Piotrek,
>> >
>> > Thank you for feedback.
>> >
>> > It seems that there are a lot of things to do in community. I am just
>> > afraid that this discussion may be forgotten since there so many
>> proposals
>> > recently.
>> > Anyway, wish to see the split topics soon :)
>> >
>> > Piotr Nowojski <[hidden email]> 于2019年1月24日周四 下午8:21写道:
>> >
>> > > Hi Biao!
>> > >
>> > > This discussion was stalled because of preparations for the open
>> sourcing
>> > > & merging Blink. I think before creating the tickets we should split
>> this
>> > > discussion into topics/areas outlined by Stephan and create Flips for
>> > that.
>> > >
>> > > I think there is no chance for this to be completed in couple of
>> > remaining
>> > > weeks/1 month before 1.8 feature freeze, however it would be good to
>> aim
>> > > with those changes for 1.9.
>> > >
>> > > Piotrek
>> > >
>> > > > On 20 Jan 2019, at 16:08, Biao Liu <[hidden email]> wrote:
>> > > >
>> > > > Hi community,
>> > > > The summary of Stephan makes a lot sense to me. It is much clearer
>> > indeed
>> > > > after splitting the complex topic into small ones.
>> > > > I was wondering is there any detail plan for next step? If not, I
>> would
>> > > > like to push this thing forward by creating some JIRA issues.
>> > > > Another question is that should version 1.8 include these features?
>> > > >
>> > > > Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道:
>> > > >
>> > > >> Thanks everyone for the lively discussion. Let me try to summarize
>> > > where I
>> > > >> see convergence in the discussion and open issues.
>> > > >> I'll try to group this by design aspect of the source. Please let
>> me
>> > > know
>> > > >> if I got things wrong or missed something crucial here.
>> > > >>
>> > > >> For issues 1-3, if the below reflects the state of the discussion,
>> I
>> > > would
>> > > >> try and update the FLIP in the next days.
>> > > >> For the remaining ones we need more discussion.
>> > > >>
>> > > >> I would suggest to fork each of these aspects into a separate mail
>> > > thread,
>> > > >> or will loose sight of the individual aspects.
>> > > >>
>> > > >> *(1) Separation of Split Enumerator and Split Reader*
>> > > >>
>> > > >>  - All seem to agree this is a good thing
>> > > >>  - Split Enumerator could in the end live on JobManager (and assign
>> > > splits
>> > > >> via RPC) or in a task (and assign splits via data streams)
>> > > >>  - this discussion is orthogonal and should come later, when the
>> > > interface
>> > > >> is agreed upon.
>> > > >>
>> > > >> *(2) Split Readers for one or more splits*
>> > > >>
>> > > >>  - Discussion seems to agree that we need to support one reader
>> that
>> > > >> possibly handles multiple splits concurrently.
>> > > >>  - The requirement comes from sources where one poll()-style call
>> > > fetches
>> > > >> data from different splits / partitions
>> > > >>    --> example sources that require that would be for example
>> Kafka,
>> > > >> Pravega, Pulsar
>> > > >>
>> > > >>  - Could have one split reader per source, or multiple split
>> readers
>> > > that
>> > > >> share the "poll()" function
>> > > >>  - To not make it too complicated, we can start with thinking about
>> > one
>> > > >> split reader for all splits initially and see if that covers all
>> > > >> requirements
>> > > >>
>> > > >> *(3) Threading model of the Split Reader*
>> > > >>
>> > > >>  - Most active part of the discussion ;-)
>> > > >>
>> > > >>  - A non-blocking way for Flink's task code to interact with the
>> > source
>> > > is
>> > > >> needed in order to a task runtime code based on a
>> > > >> single-threaded/actor-style task design
>> > > >>    --> I personally am a big proponent of that, it will help with
>> > > >> well-behaved checkpoints, efficiency, and simpler yet more robust
>> > > runtime
>> > > >> code
>> > > >>
>> > > >>  - Users care about simple abstraction, so as a subclass of
>> > SplitReader
>> > > >> (non-blocking / async) we need to have a BlockingSplitReader which
>> > will
>> > > >> form the basis of most source implementations. BlockingSplitReader
>> > lets
>> > > >> users do blocking simple poll() calls.
>> > > >>  - The BlockingSplitReader would spawn a thread (or more) and the
>> > > >> thread(s) can make blocking calls and hand over data buffers via a
>> > > blocking
>> > > >> queue
>> > > >>  - This should allow us to cover both, a fully async runtime, and a
>> > > simple
>> > > >> blocking interface for users.
>> > > >>  - This is actually very similar to how the Kafka connectors work.
>> > Kafka
>> > > >> 9+ with one thread, Kafka 8 with multiple threads
>> > > >>
>> > > >>  - On the base SplitReader (the async one), the non-blocking method
>> > that
>> > > >> gets the next chunk of data would signal data availability via a
>> > > >> CompletableFuture, because that gives the best flexibility (can
>> await
>> > > >> completion or register notification handlers).
>> > > >>  - The source task would register a "thenHandle()" (or similar) on
>> the
>> > > >> future to put a "take next data" task into the actor-style mailbox
>> > > >>
>> > > >> *(4) Split Enumeration and Assignment*
>> > > >>
>> > > >>  - Splits may be generated lazily, both in cases where there is a
>> > > limited
>> > > >> number of splits (but very many), or splits are discovered over
>> time
>> > > >>  - Assignment should also be lazy, to get better load balancing
>> > > >>  - Assignment needs support locality preferences
>> > > >>
>> > > >>  - Possible design based on discussion so far:
>> > > >>
>> > > >>    --> SplitReader has a method "addSplits(SplitT...)" to add one
>> or
>> > > more
>> > > >> splits. Some split readers might assume they have only one split
>> ever,
>> > > >> concurrently, others assume multiple splits. (Note: idea behind
>> being
>> > > able
>> > > >> to add multiple splits at the same time is to ease startup where
>> > > multiple
>> > > >> splits may be assigned instantly.)
>> > > >>    --> SplitReader has a context object on which it can call
>> indicate
>> > > when
>> > > >> splits are completed. The enumerator gets that notification and can
>> > use
>> > > to
>> > > >> decide when to assign new splits. This should help both in cases of
>> > > sources
>> > > >> that take splits lazily (file readers) and in case the source
>> needs to
>> > > >> preserve a partial order between splits (Kinesis, Pravega, Pulsar
>> may
>> > > need
>> > > >> that).
>> > > >>    --> SplitEnumerator gets notification when SplitReaders start
>> and
>> > > when
>> > > >> they finish splits. They can decide at that moment to push more
>> splits
>> > > to
>> > > >> that reader
>> > > >>    --> The SplitEnumerator should probably be aware of the source
>> > > >> parallelism, to build its initial distribution.
>> > > >>
>> > > >>  - Open question: Should the source expose something like "host
>> > > >> preferences", so that yarn/mesos/k8s can take this into account
>> when
>> > > >> selecting a node to start a TM on?
>> > > >>
>> > > >> *(5) Watermarks and event time alignment*
>> > > >>
>> > > >>  - Watermark generation, as well as idleness, needs to be per split
>> > > (like
>> > > >> currently in the Kafka Source, per partition)
>> > > >>  - It is desirable to support optional event-time-alignment,
>> meaning
>> > > that
>> > > >> splits that are ahead are back-pressured or temporarily
>> unsubscribed
>> > > >>
>> > > >>  - I think i would be desirable to encapsulate watermark generation
>> > > logic
>> > > >> in watermark generators, for a separation of concerns. The
>> watermark
>> > > >> generators should run per split.
>> > > >>  - Using watermark generators would also help with another problem
>> of
>> > > the
>> > > >> suggested interface, namely supporting non-periodic watermarks
>> > > efficiently.
>> > > >>
>> > > >>  - Need a way to "dispatch" next record to different watermark
>> > > generators
>> > > >>  - Need a way to tell SplitReader to "suspend" a split until a
>> certain
>> > > >> watermark is reached (event time backpressure)
>> > > >>  - This would in fact be not needed (and thus simpler) if we had a
>> > > >> SplitReader per split and may be a reason to re-open that
>> discussion
>> > > >>
>> > > >> *(6) Watermarks across splits and in the Split Enumerator*
>> > > >>
>> > > >>  - The split enumerator may need some watermark awareness, which
>> > should
>> > > be
>> > > >> purely based on split metadata (like create timestamp of file
>> splits)
>> > > >>  - If there are still more splits with overlapping event time range
>> > for
>> > > a
>> > > >> split reader, then that split reader should not advance the
>> watermark
>> > > >> within the split beyond the overlap boundary. Otherwise future
>> splits
>> > > will
>> > > >> produce late data.
>> > > >>
>> > > >>  - One way to approach this could be that the split enumerator may
>> > send
>> > > >> watermarks to the readers, and the readers cannot emit watermarks
>> > beyond
>> > > >> that received watermark.
>> > > >>  - Many split enumerators would simply immediately send Long.MAX
>> out
>> > and
>> > > >> leave the progress purely to the split readers.
>> > > >>
>> > > >>  - For event-time alignment / split back pressure, this begs the
>> > > question
>> > > >> how we can avoid deadlocks that may arise when splits are suspended
>> > for
>> > > >> event time back pressure,
>> > > >>
>> > > >> *(7) Batch and streaming Unification*
>> > > >>
>> > > >>  - Functionality wise, the above design should support both
>> > > >>  - Batch often (mostly) does not care about reading "in order" and
>> > > >> generating watermarks
>> > > >>    --> Might use different enumerator logic that is more locality
>> > aware
>> > > >> and ignores event time order
>> > > >>    --> Does not generate watermarks
>> > > >>  - Would be great if bounded sources could be identified at compile
>> > > time,
>> > > >> so that "env.addBoundedSource(...)" is type safe and can return a
>> > > >> "BoundedDataStream".
>> > > >>  - Possible to defer this discussion until later
>> > > >>
>> > > >> *Miscellaneous Comments*
>> > > >>
>> > > >>  - Should the source have a TypeInformation for the produced type,
>> > > instead
>> > > >> of a serializer? We need a type information in the stream anyways,
>> and
>> > > can
>> > > >> derive the serializer from that. Plus, creating the serializer
>> should
>> > > >> respect the ExecutionConfig.
>> > > >>
>> > > >>  - The TypeSerializer interface is very powerful but also not easy
>> to
>> > > >> implement. Its purpose is to handle data super efficiently, support
>> > > >> flexible ways of evolution, etc.
>> > > >>  For metadata I would suggest to look at the
>> SimpleVersionedSerializer
>> > > >> instead, which is used for example for checkpoint master hooks, or
>> for
>> > > the
>> > > >> streaming file sink. I think that is is a good match for cases
>> where
>> > we
>> > > do
>> > > >> not need more than ser/deser (no copy, etc.) and don't need to push
>> > > >> versioning out of the serialization paths for best performance (as
>> in
>> > > the
>> > > >> TypeSerializer)
>> > > >>
>> > > >>
>> > > >> On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
>> > > >> [hidden email]>
>> > > >> wrote:
>> > > >>
>> > > >>> Hi Biao,
>> > > >>>
>> > > >>> Thanks for the answer!
>> > > >>>
>> > > >>> So given the multi-threaded readers, now we have as open
>> questions:
>> > > >>>
>> > > >>> 1) How do we let the checkpoints pass through our multi-threaded
>> > reader
>> > > >>> operator?
>> > > >>>
>> > > >>> 2) Do we have separate reader and source operators or not? In the
>> > > >> strategy
>> > > >>> that has a separate source, the source operator has a parallelism
>> of
>> > 1
>> > > >> and
>> > > >>> is responsible for split recovery only.
>> > > >>>
>> > > >>> For the first one, given also the constraints (blocking, finite
>> > queues,
>> > > >>> etc), I do not have an answer yet.
>> > > >>>
>> > > >>> For the 2nd, I think that we should go with separate operators for
>> > the
>> > > >>> source and the readers, for the following reasons:
>> > > >>>
>> > > >>> 1) This is more aligned with a potential future improvement where
>> the
>> > > >> split
>> > > >>> discovery becomes a responsibility of the JobManager and readers
>> are
>> > > >>> pooling more work from the JM.
>> > > >>>
>> > > >>> 2) The source is going to be the "single point of truth". It will
>> > know
>> > > >> what
>> > > >>> has been processed and what not. If the source and the readers
>> are a
>> > > >> single
>> > > >>> operator with parallelism > 1, or in general, if the split
>> discovery
>> > is
>> > > >>> done by each task individually, then:
>> > > >>>   i) we have to have a deterministic scheme for each reader to
>> assign
>> > > >>> splits to itself (e.g. mod subtaskId). This is not necessarily
>> > trivial
>> > > >> for
>> > > >>> all sources.
>> > > >>>   ii) each reader would have to keep a copy of all its processed
>> > slpits
>> > > >>>   iii) the state has to be a union state with a non-trivial
>> merging
>> > > >> logic
>> > > >>> in order to support rescaling.
>> > > >>>
>> > > >>> Two additional points that you raised above:
>> > > >>>
>> > > >>> i) The point that you raised that we need to keep all splits
>> > (processed
>> > > >> and
>> > > >>> not-processed) I think is a bit of a strong requirement. This
>> would
>> > > imply
>> > > >>> that for infinite sources the state will grow indefinitely. This
>> is
>> > > >> problem
>> > > >>> is even more pronounced if we do not have a single source that
>> > assigns
>> > > >>> splits to readers, as each reader will have its own copy of the
>> > state.
>> > > >>>
>> > > >>> ii) it is true that for finite sources we need to somehow not
>> close
>> > the
>> > > >>> readers when the source/split discoverer finishes. The
>> > > >>> ContinuousFileReaderOperator has a work-around for that. It is not
>> > > >> elegant,
>> > > >>> and checkpoints are not emitted after closing the source, but
>> this, I
>> > > >>> believe, is a bigger problem which requires more changes than just
>> > > >>> refactoring the source interface.
>> > > >>>
>> > > >>> Cheers,
>> > > >>> Kostas
>> > > >>>
>> > > >>
>> > >
>> > >
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Stephan Ewen
Hi Biao!

Thanks for reviving this. I would like to join this discussion, but am
quite occupied with the 1.9 release, so can we maybe pause this discussion
for a week or so?

In the meantime I can share some suggestion based on prior experiments:

How to do watermarks / timestamp extractors in a simpler and more flexible
way. I think that part is quite promising should be part of the new source
interface.
https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/eventtime

https://github.com/StephanEwen/flink/blob/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src/SourceOutput.java



Some experiments on how to build the source reader and its library for
common threading/split patterns:
https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src


Best,
Stephan


On Thu, Jul 25, 2019 at 10:03 AM Biao Liu <[hidden email]> wrote:

> Hi devs,
>
> Since 1.9 is nearly released, I think we could get back to FLIP-27. I
> believe it should be included in 1.10.
>
> There are so many things mentioned in document of FLIP-27. [1] I think
> we'd better discuss them separately. However the wiki is not a good place
> to discuss. I wrote google doc about SplitReader API which misses some
> details in the document. [2]
>
> 1.
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27:+Refactor+Source+Interface
> 2.
> https://docs.google.com/document/d/1R1s_89T4S3CZwq7Tf31DciaMCqZwrLHGZFqPASu66oE/edit?usp=sharing
>
> CC Stephan, Aljoscha, Piotrek, Becket
>
>
> On Thu, Mar 28, 2019 at 4:38 PM Biao Liu <[hidden email]> wrote:
>
>> Hi Steven,
>> Thank you for the feedback. Please take a look at the document FLIP-27
>> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface> which
>> is updated recently. A lot of details of enumerator were added in this
>> document. I think it would help.
>>
>> Steven Wu <[hidden email]> 于2019年3月28日周四 下午12:52写道:
>>
>>> This proposal mentioned that SplitEnumerator might run on the JobManager
>>> or
>>> in a single task on a TaskManager.
>>>
>>> if enumerator is a single task on a taskmanager, then the job DAG can
>>> never
>>> been embarrassingly parallel anymore. That will nullify the leverage of
>>> fine-grained recovery for embarrassingly parallel jobs.
>>>
>>> It's not clear to me what's the implication of running enumerator on the
>>> jobmanager. So I will leave that out for now.
>>>
>>> On Mon, Jan 28, 2019 at 3:05 AM Biao Liu <[hidden email]> wrote:
>>>
>>> > Hi Stephan & Piotrek,
>>> >
>>> > Thank you for feedback.
>>> >
>>> > It seems that there are a lot of things to do in community. I am just
>>> > afraid that this discussion may be forgotten since there so many
>>> proposals
>>> > recently.
>>> > Anyway, wish to see the split topics soon :)
>>> >
>>> > Piotr Nowojski <[hidden email]> 于2019年1月24日周四 下午8:21写道:
>>> >
>>> > > Hi Biao!
>>> > >
>>> > > This discussion was stalled because of preparations for the open
>>> sourcing
>>> > > & merging Blink. I think before creating the tickets we should split
>>> this
>>> > > discussion into topics/areas outlined by Stephan and create Flips for
>>> > that.
>>> > >
>>> > > I think there is no chance for this to be completed in couple of
>>> > remaining
>>> > > weeks/1 month before 1.8 feature freeze, however it would be good to
>>> aim
>>> > > with those changes for 1.9.
>>> > >
>>> > > Piotrek
>>> > >
>>> > > > On 20 Jan 2019, at 16:08, Biao Liu <[hidden email]> wrote:
>>> > > >
>>> > > > Hi community,
>>> > > > The summary of Stephan makes a lot sense to me. It is much clearer
>>> > indeed
>>> > > > after splitting the complex topic into small ones.
>>> > > > I was wondering is there any detail plan for next step? If not, I
>>> would
>>> > > > like to push this thing forward by creating some JIRA issues.
>>> > > > Another question is that should version 1.8 include these features?
>>> > > >
>>> > > > Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道:
>>> > > >
>>> > > >> Thanks everyone for the lively discussion. Let me try to summarize
>>> > > where I
>>> > > >> see convergence in the discussion and open issues.
>>> > > >> I'll try to group this by design aspect of the source. Please let
>>> me
>>> > > know
>>> > > >> if I got things wrong or missed something crucial here.
>>> > > >>
>>> > > >> For issues 1-3, if the below reflects the state of the
>>> discussion, I
>>> > > would
>>> > > >> try and update the FLIP in the next days.
>>> > > >> For the remaining ones we need more discussion.
>>> > > >>
>>> > > >> I would suggest to fork each of these aspects into a separate mail
>>> > > thread,
>>> > > >> or will loose sight of the individual aspects.
>>> > > >>
>>> > > >> *(1) Separation of Split Enumerator and Split Reader*
>>> > > >>
>>> > > >>  - All seem to agree this is a good thing
>>> > > >>  - Split Enumerator could in the end live on JobManager (and
>>> assign
>>> > > splits
>>> > > >> via RPC) or in a task (and assign splits via data streams)
>>> > > >>  - this discussion is orthogonal and should come later, when the
>>> > > interface
>>> > > >> is agreed upon.
>>> > > >>
>>> > > >> *(2) Split Readers for one or more splits*
>>> > > >>
>>> > > >>  - Discussion seems to agree that we need to support one reader
>>> that
>>> > > >> possibly handles multiple splits concurrently.
>>> > > >>  - The requirement comes from sources where one poll()-style call
>>> > > fetches
>>> > > >> data from different splits / partitions
>>> > > >>    --> example sources that require that would be for example
>>> Kafka,
>>> > > >> Pravega, Pulsar
>>> > > >>
>>> > > >>  - Could have one split reader per source, or multiple split
>>> readers
>>> > > that
>>> > > >> share the "poll()" function
>>> > > >>  - To not make it too complicated, we can start with thinking
>>> about
>>> > one
>>> > > >> split reader for all splits initially and see if that covers all
>>> > > >> requirements
>>> > > >>
>>> > > >> *(3) Threading model of the Split Reader*
>>> > > >>
>>> > > >>  - Most active part of the discussion ;-)
>>> > > >>
>>> > > >>  - A non-blocking way for Flink's task code to interact with the
>>> > source
>>> > > is
>>> > > >> needed in order to a task runtime code based on a
>>> > > >> single-threaded/actor-style task design
>>> > > >>    --> I personally am a big proponent of that, it will help with
>>> > > >> well-behaved checkpoints, efficiency, and simpler yet more robust
>>> > > runtime
>>> > > >> code
>>> > > >>
>>> > > >>  - Users care about simple abstraction, so as a subclass of
>>> > SplitReader
>>> > > >> (non-blocking / async) we need to have a BlockingSplitReader which
>>> > will
>>> > > >> form the basis of most source implementations. BlockingSplitReader
>>> > lets
>>> > > >> users do blocking simple poll() calls.
>>> > > >>  - The BlockingSplitReader would spawn a thread (or more) and the
>>> > > >> thread(s) can make blocking calls and hand over data buffers via a
>>> > > blocking
>>> > > >> queue
>>> > > >>  - This should allow us to cover both, a fully async runtime, and
>>> a
>>> > > simple
>>> > > >> blocking interface for users.
>>> > > >>  - This is actually very similar to how the Kafka connectors work.
>>> > Kafka
>>> > > >> 9+ with one thread, Kafka 8 with multiple threads
>>> > > >>
>>> > > >>  - On the base SplitReader (the async one), the non-blocking
>>> method
>>> > that
>>> > > >> gets the next chunk of data would signal data availability via a
>>> > > >> CompletableFuture, because that gives the best flexibility (can
>>> await
>>> > > >> completion or register notification handlers).
>>> > > >>  - The source task would register a "thenHandle()" (or similar)
>>> on the
>>> > > >> future to put a "take next data" task into the actor-style mailbox
>>> > > >>
>>> > > >> *(4) Split Enumeration and Assignment*
>>> > > >>
>>> > > >>  - Splits may be generated lazily, both in cases where there is a
>>> > > limited
>>> > > >> number of splits (but very many), or splits are discovered over
>>> time
>>> > > >>  - Assignment should also be lazy, to get better load balancing
>>> > > >>  - Assignment needs support locality preferences
>>> > > >>
>>> > > >>  - Possible design based on discussion so far:
>>> > > >>
>>> > > >>    --> SplitReader has a method "addSplits(SplitT...)" to add one
>>> or
>>> > > more
>>> > > >> splits. Some split readers might assume they have only one split
>>> ever,
>>> > > >> concurrently, others assume multiple splits. (Note: idea behind
>>> being
>>> > > able
>>> > > >> to add multiple splits at the same time is to ease startup where
>>> > > multiple
>>> > > >> splits may be assigned instantly.)
>>> > > >>    --> SplitReader has a context object on which it can call
>>> indicate
>>> > > when
>>> > > >> splits are completed. The enumerator gets that notification and
>>> can
>>> > use
>>> > > to
>>> > > >> decide when to assign new splits. This should help both in cases
>>> of
>>> > > sources
>>> > > >> that take splits lazily (file readers) and in case the source
>>> needs to
>>> > > >> preserve a partial order between splits (Kinesis, Pravega, Pulsar
>>> may
>>> > > need
>>> > > >> that).
>>> > > >>    --> SplitEnumerator gets notification when SplitReaders start
>>> and
>>> > > when
>>> > > >> they finish splits. They can decide at that moment to push more
>>> splits
>>> > > to
>>> > > >> that reader
>>> > > >>    --> The SplitEnumerator should probably be aware of the source
>>> > > >> parallelism, to build its initial distribution.
>>> > > >>
>>> > > >>  - Open question: Should the source expose something like "host
>>> > > >> preferences", so that yarn/mesos/k8s can take this into account
>>> when
>>> > > >> selecting a node to start a TM on?
>>> > > >>
>>> > > >> *(5) Watermarks and event time alignment*
>>> > > >>
>>> > > >>  - Watermark generation, as well as idleness, needs to be per
>>> split
>>> > > (like
>>> > > >> currently in the Kafka Source, per partition)
>>> > > >>  - It is desirable to support optional event-time-alignment,
>>> meaning
>>> > > that
>>> > > >> splits that are ahead are back-pressured or temporarily
>>> unsubscribed
>>> > > >>
>>> > > >>  - I think i would be desirable to encapsulate watermark
>>> generation
>>> > > logic
>>> > > >> in watermark generators, for a separation of concerns. The
>>> watermark
>>> > > >> generators should run per split.
>>> > > >>  - Using watermark generators would also help with another
>>> problem of
>>> > > the
>>> > > >> suggested interface, namely supporting non-periodic watermarks
>>> > > efficiently.
>>> > > >>
>>> > > >>  - Need a way to "dispatch" next record to different watermark
>>> > > generators
>>> > > >>  - Need a way to tell SplitReader to "suspend" a split until a
>>> certain
>>> > > >> watermark is reached (event time backpressure)
>>> > > >>  - This would in fact be not needed (and thus simpler) if we had a
>>> > > >> SplitReader per split and may be a reason to re-open that
>>> discussion
>>> > > >>
>>> > > >> *(6) Watermarks across splits and in the Split Enumerator*
>>> > > >>
>>> > > >>  - The split enumerator may need some watermark awareness, which
>>> > should
>>> > > be
>>> > > >> purely based on split metadata (like create timestamp of file
>>> splits)
>>> > > >>  - If there are still more splits with overlapping event time
>>> range
>>> > for
>>> > > a
>>> > > >> split reader, then that split reader should not advance the
>>> watermark
>>> > > >> within the split beyond the overlap boundary. Otherwise future
>>> splits
>>> > > will
>>> > > >> produce late data.
>>> > > >>
>>> > > >>  - One way to approach this could be that the split enumerator may
>>> > send
>>> > > >> watermarks to the readers, and the readers cannot emit watermarks
>>> > beyond
>>> > > >> that received watermark.
>>> > > >>  - Many split enumerators would simply immediately send Long.MAX
>>> out
>>> > and
>>> > > >> leave the progress purely to the split readers.
>>> > > >>
>>> > > >>  - For event-time alignment / split back pressure, this begs the
>>> > > question
>>> > > >> how we can avoid deadlocks that may arise when splits are
>>> suspended
>>> > for
>>> > > >> event time back pressure,
>>> > > >>
>>> > > >> *(7) Batch and streaming Unification*
>>> > > >>
>>> > > >>  - Functionality wise, the above design should support both
>>> > > >>  - Batch often (mostly) does not care about reading "in order" and
>>> > > >> generating watermarks
>>> > > >>    --> Might use different enumerator logic that is more locality
>>> > aware
>>> > > >> and ignores event time order
>>> > > >>    --> Does not generate watermarks
>>> > > >>  - Would be great if bounded sources could be identified at
>>> compile
>>> > > time,
>>> > > >> so that "env.addBoundedSource(...)" is type safe and can return a
>>> > > >> "BoundedDataStream".
>>> > > >>  - Possible to defer this discussion until later
>>> > > >>
>>> > > >> *Miscellaneous Comments*
>>> > > >>
>>> > > >>  - Should the source have a TypeInformation for the produced type,
>>> > > instead
>>> > > >> of a serializer? We need a type information in the stream
>>> anyways, and
>>> > > can
>>> > > >> derive the serializer from that. Plus, creating the serializer
>>> should
>>> > > >> respect the ExecutionConfig.
>>> > > >>
>>> > > >>  - The TypeSerializer interface is very powerful but also not
>>> easy to
>>> > > >> implement. Its purpose is to handle data super efficiently,
>>> support
>>> > > >> flexible ways of evolution, etc.
>>> > > >>  For metadata I would suggest to look at the
>>> SimpleVersionedSerializer
>>> > > >> instead, which is used for example for checkpoint master hooks,
>>> or for
>>> > > the
>>> > > >> streaming file sink. I think that is is a good match for cases
>>> where
>>> > we
>>> > > do
>>> > > >> not need more than ser/deser (no copy, etc.) and don't need to
>>> push
>>> > > >> versioning out of the serialization paths for best performance
>>> (as in
>>> > > the
>>> > > >> TypeSerializer)
>>> > > >>
>>> > > >>
>>> > > >> On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
>>> > > >> [hidden email]>
>>> > > >> wrote:
>>> > > >>
>>> > > >>> Hi Biao,
>>> > > >>>
>>> > > >>> Thanks for the answer!
>>> > > >>>
>>> > > >>> So given the multi-threaded readers, now we have as open
>>> questions:
>>> > > >>>
>>> > > >>> 1) How do we let the checkpoints pass through our multi-threaded
>>> > reader
>>> > > >>> operator?
>>> > > >>>
>>> > > >>> 2) Do we have separate reader and source operators or not? In the
>>> > > >> strategy
>>> > > >>> that has a separate source, the source operator has a
>>> parallelism of
>>> > 1
>>> > > >> and
>>> > > >>> is responsible for split recovery only.
>>> > > >>>
>>> > > >>> For the first one, given also the constraints (blocking, finite
>>> > queues,
>>> > > >>> etc), I do not have an answer yet.
>>> > > >>>
>>> > > >>> For the 2nd, I think that we should go with separate operators
>>> for
>>> > the
>>> > > >>> source and the readers, for the following reasons:
>>> > > >>>
>>> > > >>> 1) This is more aligned with a potential future improvement
>>> where the
>>> > > >> split
>>> > > >>> discovery becomes a responsibility of the JobManager and readers
>>> are
>>> > > >>> pooling more work from the JM.
>>> > > >>>
>>> > > >>> 2) The source is going to be the "single point of truth". It will
>>> > know
>>> > > >> what
>>> > > >>> has been processed and what not. If the source and the readers
>>> are a
>>> > > >> single
>>> > > >>> operator with parallelism > 1, or in general, if the split
>>> discovery
>>> > is
>>> > > >>> done by each task individually, then:
>>> > > >>>   i) we have to have a deterministic scheme for each reader to
>>> assign
>>> > > >>> splits to itself (e.g. mod subtaskId). This is not necessarily
>>> > trivial
>>> > > >> for
>>> > > >>> all sources.
>>> > > >>>   ii) each reader would have to keep a copy of all its processed
>>> > slpits
>>> > > >>>   iii) the state has to be a union state with a non-trivial
>>> merging
>>> > > >> logic
>>> > > >>> in order to support rescaling.
>>> > > >>>
>>> > > >>> Two additional points that you raised above:
>>> > > >>>
>>> > > >>> i) The point that you raised that we need to keep all splits
>>> > (processed
>>> > > >> and
>>> > > >>> not-processed) I think is a bit of a strong requirement. This
>>> would
>>> > > imply
>>> > > >>> that for infinite sources the state will grow indefinitely. This
>>> is
>>> > > >> problem
>>> > > >>> is even more pronounced if we do not have a single source that
>>> > assigns
>>> > > >>> splits to readers, as each reader will have its own copy of the
>>> > state.
>>> > > >>>
>>> > > >>> ii) it is true that for finite sources we need to somehow not
>>> close
>>> > the
>>> > > >>> readers when the source/split discoverer finishes. The
>>> > > >>> ContinuousFileReaderOperator has a work-around for that. It is
>>> not
>>> > > >> elegant,
>>> > > >>> and checkpoints are not emitted after closing the source, but
>>> this, I
>>> > > >>> believe, is a bigger problem which requires more changes than
>>> just
>>> > > >>> refactoring the source interface.
>>> > > >>>
>>> > > >>> Cheers,
>>> > > >>> Kostas
>>> > > >>>
>>> > > >>
>>> > >
>>> > >
>>> >
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Biao Liu
Hi Stephan,

Thank you for feedback!
Will take a look at your branch before public discussing.


On Fri, Jul 26, 2019 at 12:01 AM Stephan Ewen <[hidden email]> wrote:

> Hi Biao!
>
> Thanks for reviving this. I would like to join this discussion, but am
> quite occupied with the 1.9 release, so can we maybe pause this discussion
> for a week or so?
>
> In the meantime I can share some suggestion based on prior experiments:
>
> How to do watermarks / timestamp extractors in a simpler and more flexible
> way. I think that part is quite promising should be part of the new source
> interface.
>
> https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/eventtime
>
>
> https://github.com/StephanEwen/flink/blob/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src/SourceOutput.java
>
>
>
> Some experiments on how to build the source reader and its library for
> common threading/split patterns:
>
> https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src
>
>
> Best,
> Stephan
>
>
> On Thu, Jul 25, 2019 at 10:03 AM Biao Liu <[hidden email]> wrote:
>
>> Hi devs,
>>
>> Since 1.9 is nearly released, I think we could get back to FLIP-27. I
>> believe it should be included in 1.10.
>>
>> There are so many things mentioned in document of FLIP-27. [1] I think
>> we'd better discuss them separately. However the wiki is not a good place
>> to discuss. I wrote google doc about SplitReader API which misses some
>> details in the document. [2]
>>
>> 1.
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27:+Refactor+Source+Interface
>> 2.
>> https://docs.google.com/document/d/1R1s_89T4S3CZwq7Tf31DciaMCqZwrLHGZFqPASu66oE/edit?usp=sharing
>>
>> CC Stephan, Aljoscha, Piotrek, Becket
>>
>>
>> On Thu, Mar 28, 2019 at 4:38 PM Biao Liu <[hidden email]> wrote:
>>
>>> Hi Steven,
>>> Thank you for the feedback. Please take a look at the document FLIP-27
>>> <https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface> which
>>> is updated recently. A lot of details of enumerator were added in this
>>> document. I think it would help.
>>>
>>> Steven Wu <[hidden email]> 于2019年3月28日周四 下午12:52写道:
>>>
>>>> This proposal mentioned that SplitEnumerator might run on the
>>>> JobManager or
>>>> in a single task on a TaskManager.
>>>>
>>>> if enumerator is a single task on a taskmanager, then the job DAG can
>>>> never
>>>> been embarrassingly parallel anymore. That will nullify the leverage of
>>>> fine-grained recovery for embarrassingly parallel jobs.
>>>>
>>>> It's not clear to me what's the implication of running enumerator on the
>>>> jobmanager. So I will leave that out for now.
>>>>
>>>> On Mon, Jan 28, 2019 at 3:05 AM Biao Liu <[hidden email]> wrote:
>>>>
>>>> > Hi Stephan & Piotrek,
>>>> >
>>>> > Thank you for feedback.
>>>> >
>>>> > It seems that there are a lot of things to do in community. I am just
>>>> > afraid that this discussion may be forgotten since there so many
>>>> proposals
>>>> > recently.
>>>> > Anyway, wish to see the split topics soon :)
>>>> >
>>>> > Piotr Nowojski <[hidden email]> 于2019年1月24日周四 下午8:21写道:
>>>> >
>>>> > > Hi Biao!
>>>> > >
>>>> > > This discussion was stalled because of preparations for the open
>>>> sourcing
>>>> > > & merging Blink. I think before creating the tickets we should
>>>> split this
>>>> > > discussion into topics/areas outlined by Stephan and create Flips
>>>> for
>>>> > that.
>>>> > >
>>>> > > I think there is no chance for this to be completed in couple of
>>>> > remaining
>>>> > > weeks/1 month before 1.8 feature freeze, however it would be good
>>>> to aim
>>>> > > with those changes for 1.9.
>>>> > >
>>>> > > Piotrek
>>>> > >
>>>> > > > On 20 Jan 2019, at 16:08, Biao Liu <[hidden email]> wrote:
>>>> > > >
>>>> > > > Hi community,
>>>> > > > The summary of Stephan makes a lot sense to me. It is much clearer
>>>> > indeed
>>>> > > > after splitting the complex topic into small ones.
>>>> > > > I was wondering is there any detail plan for next step? If not, I
>>>> would
>>>> > > > like to push this thing forward by creating some JIRA issues.
>>>> > > > Another question is that should version 1.8 include these
>>>> features?
>>>> > > >
>>>> > > > Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道:
>>>> > > >
>>>> > > >> Thanks everyone for the lively discussion. Let me try to
>>>> summarize
>>>> > > where I
>>>> > > >> see convergence in the discussion and open issues.
>>>> > > >> I'll try to group this by design aspect of the source. Please
>>>> let me
>>>> > > know
>>>> > > >> if I got things wrong or missed something crucial here.
>>>> > > >>
>>>> > > >> For issues 1-3, if the below reflects the state of the
>>>> discussion, I
>>>> > > would
>>>> > > >> try and update the FLIP in the next days.
>>>> > > >> For the remaining ones we need more discussion.
>>>> > > >>
>>>> > > >> I would suggest to fork each of these aspects into a separate
>>>> mail
>>>> > > thread,
>>>> > > >> or will loose sight of the individual aspects.
>>>> > > >>
>>>> > > >> *(1) Separation of Split Enumerator and Split Reader*
>>>> > > >>
>>>> > > >>  - All seem to agree this is a good thing
>>>> > > >>  - Split Enumerator could in the end live on JobManager (and
>>>> assign
>>>> > > splits
>>>> > > >> via RPC) or in a task (and assign splits via data streams)
>>>> > > >>  - this discussion is orthogonal and should come later, when the
>>>> > > interface
>>>> > > >> is agreed upon.
>>>> > > >>
>>>> > > >> *(2) Split Readers for one or more splits*
>>>> > > >>
>>>> > > >>  - Discussion seems to agree that we need to support one reader
>>>> that
>>>> > > >> possibly handles multiple splits concurrently.
>>>> > > >>  - The requirement comes from sources where one poll()-style call
>>>> > > fetches
>>>> > > >> data from different splits / partitions
>>>> > > >>    --> example sources that require that would be for example
>>>> Kafka,
>>>> > > >> Pravega, Pulsar
>>>> > > >>
>>>> > > >>  - Could have one split reader per source, or multiple split
>>>> readers
>>>> > > that
>>>> > > >> share the "poll()" function
>>>> > > >>  - To not make it too complicated, we can start with thinking
>>>> about
>>>> > one
>>>> > > >> split reader for all splits initially and see if that covers all
>>>> > > >> requirements
>>>> > > >>
>>>> > > >> *(3) Threading model of the Split Reader*
>>>> > > >>
>>>> > > >>  - Most active part of the discussion ;-)
>>>> > > >>
>>>> > > >>  - A non-blocking way for Flink's task code to interact with the
>>>> > source
>>>> > > is
>>>> > > >> needed in order to a task runtime code based on a
>>>> > > >> single-threaded/actor-style task design
>>>> > > >>    --> I personally am a big proponent of that, it will help with
>>>> > > >> well-behaved checkpoints, efficiency, and simpler yet more robust
>>>> > > runtime
>>>> > > >> code
>>>> > > >>
>>>> > > >>  - Users care about simple abstraction, so as a subclass of
>>>> > SplitReader
>>>> > > >> (non-blocking / async) we need to have a BlockingSplitReader
>>>> which
>>>> > will
>>>> > > >> form the basis of most source implementations.
>>>> BlockingSplitReader
>>>> > lets
>>>> > > >> users do blocking simple poll() calls.
>>>> > > >>  - The BlockingSplitReader would spawn a thread (or more) and the
>>>> > > >> thread(s) can make blocking calls and hand over data buffers via
>>>> a
>>>> > > blocking
>>>> > > >> queue
>>>> > > >>  - This should allow us to cover both, a fully async runtime,
>>>> and a
>>>> > > simple
>>>> > > >> blocking interface for users.
>>>> > > >>  - This is actually very similar to how the Kafka connectors
>>>> work.
>>>> > Kafka
>>>> > > >> 9+ with one thread, Kafka 8 with multiple threads
>>>> > > >>
>>>> > > >>  - On the base SplitReader (the async one), the non-blocking
>>>> method
>>>> > that
>>>> > > >> gets the next chunk of data would signal data availability via a
>>>> > > >> CompletableFuture, because that gives the best flexibility (can
>>>> await
>>>> > > >> completion or register notification handlers).
>>>> > > >>  - The source task would register a "thenHandle()" (or similar)
>>>> on the
>>>> > > >> future to put a "take next data" task into the actor-style
>>>> mailbox
>>>> > > >>
>>>> > > >> *(4) Split Enumeration and Assignment*
>>>> > > >>
>>>> > > >>  - Splits may be generated lazily, both in cases where there is a
>>>> > > limited
>>>> > > >> number of splits (but very many), or splits are discovered over
>>>> time
>>>> > > >>  - Assignment should also be lazy, to get better load balancing
>>>> > > >>  - Assignment needs support locality preferences
>>>> > > >>
>>>> > > >>  - Possible design based on discussion so far:
>>>> > > >>
>>>> > > >>    --> SplitReader has a method "addSplits(SplitT...)" to add
>>>> one or
>>>> > > more
>>>> > > >> splits. Some split readers might assume they have only one split
>>>> ever,
>>>> > > >> concurrently, others assume multiple splits. (Note: idea behind
>>>> being
>>>> > > able
>>>> > > >> to add multiple splits at the same time is to ease startup where
>>>> > > multiple
>>>> > > >> splits may be assigned instantly.)
>>>> > > >>    --> SplitReader has a context object on which it can call
>>>> indicate
>>>> > > when
>>>> > > >> splits are completed. The enumerator gets that notification and
>>>> can
>>>> > use
>>>> > > to
>>>> > > >> decide when to assign new splits. This should help both in cases
>>>> of
>>>> > > sources
>>>> > > >> that take splits lazily (file readers) and in case the source
>>>> needs to
>>>> > > >> preserve a partial order between splits (Kinesis, Pravega,
>>>> Pulsar may
>>>> > > need
>>>> > > >> that).
>>>> > > >>    --> SplitEnumerator gets notification when SplitReaders start
>>>> and
>>>> > > when
>>>> > > >> they finish splits. They can decide at that moment to push more
>>>> splits
>>>> > > to
>>>> > > >> that reader
>>>> > > >>    --> The SplitEnumerator should probably be aware of the source
>>>> > > >> parallelism, to build its initial distribution.
>>>> > > >>
>>>> > > >>  - Open question: Should the source expose something like "host
>>>> > > >> preferences", so that yarn/mesos/k8s can take this into account
>>>> when
>>>> > > >> selecting a node to start a TM on?
>>>> > > >>
>>>> > > >> *(5) Watermarks and event time alignment*
>>>> > > >>
>>>> > > >>  - Watermark generation, as well as idleness, needs to be per
>>>> split
>>>> > > (like
>>>> > > >> currently in the Kafka Source, per partition)
>>>> > > >>  - It is desirable to support optional event-time-alignment,
>>>> meaning
>>>> > > that
>>>> > > >> splits that are ahead are back-pressured or temporarily
>>>> unsubscribed
>>>> > > >>
>>>> > > >>  - I think i would be desirable to encapsulate watermark
>>>> generation
>>>> > > logic
>>>> > > >> in watermark generators, for a separation of concerns. The
>>>> watermark
>>>> > > >> generators should run per split.
>>>> > > >>  - Using watermark generators would also help with another
>>>> problem of
>>>> > > the
>>>> > > >> suggested interface, namely supporting non-periodic watermarks
>>>> > > efficiently.
>>>> > > >>
>>>> > > >>  - Need a way to "dispatch" next record to different watermark
>>>> > > generators
>>>> > > >>  - Need a way to tell SplitReader to "suspend" a split until a
>>>> certain
>>>> > > >> watermark is reached (event time backpressure)
>>>> > > >>  - This would in fact be not needed (and thus simpler) if we had
>>>> a
>>>> > > >> SplitReader per split and may be a reason to re-open that
>>>> discussion
>>>> > > >>
>>>> > > >> *(6) Watermarks across splits and in the Split Enumerator*
>>>> > > >>
>>>> > > >>  - The split enumerator may need some watermark awareness, which
>>>> > should
>>>> > > be
>>>> > > >> purely based on split metadata (like create timestamp of file
>>>> splits)
>>>> > > >>  - If there are still more splits with overlapping event time
>>>> range
>>>> > for
>>>> > > a
>>>> > > >> split reader, then that split reader should not advance the
>>>> watermark
>>>> > > >> within the split beyond the overlap boundary. Otherwise future
>>>> splits
>>>> > > will
>>>> > > >> produce late data.
>>>> > > >>
>>>> > > >>  - One way to approach this could be that the split enumerator
>>>> may
>>>> > send
>>>> > > >> watermarks to the readers, and the readers cannot emit watermarks
>>>> > beyond
>>>> > > >> that received watermark.
>>>> > > >>  - Many split enumerators would simply immediately send Long.MAX
>>>> out
>>>> > and
>>>> > > >> leave the progress purely to the split readers.
>>>> > > >>
>>>> > > >>  - For event-time alignment / split back pressure, this begs the
>>>> > > question
>>>> > > >> how we can avoid deadlocks that may arise when splits are
>>>> suspended
>>>> > for
>>>> > > >> event time back pressure,
>>>> > > >>
>>>> > > >> *(7) Batch and streaming Unification*
>>>> > > >>
>>>> > > >>  - Functionality wise, the above design should support both
>>>> > > >>  - Batch often (mostly) does not care about reading "in order"
>>>> and
>>>> > > >> generating watermarks
>>>> > > >>    --> Might use different enumerator logic that is more locality
>>>> > aware
>>>> > > >> and ignores event time order
>>>> > > >>    --> Does not generate watermarks
>>>> > > >>  - Would be great if bounded sources could be identified at
>>>> compile
>>>> > > time,
>>>> > > >> so that "env.addBoundedSource(...)" is type safe and can return a
>>>> > > >> "BoundedDataStream".
>>>> > > >>  - Possible to defer this discussion until later
>>>> > > >>
>>>> > > >> *Miscellaneous Comments*
>>>> > > >>
>>>> > > >>  - Should the source have a TypeInformation for the produced
>>>> type,
>>>> > > instead
>>>> > > >> of a serializer? We need a type information in the stream
>>>> anyways, and
>>>> > > can
>>>> > > >> derive the serializer from that. Plus, creating the serializer
>>>> should
>>>> > > >> respect the ExecutionConfig.
>>>> > > >>
>>>> > > >>  - The TypeSerializer interface is very powerful but also not
>>>> easy to
>>>> > > >> implement. Its purpose is to handle data super efficiently,
>>>> support
>>>> > > >> flexible ways of evolution, etc.
>>>> > > >>  For metadata I would suggest to look at the
>>>> SimpleVersionedSerializer
>>>> > > >> instead, which is used for example for checkpoint master hooks,
>>>> or for
>>>> > > the
>>>> > > >> streaming file sink. I think that is is a good match for cases
>>>> where
>>>> > we
>>>> > > do
>>>> > > >> not need more than ser/deser (no copy, etc.) and don't need to
>>>> push
>>>> > > >> versioning out of the serialization paths for best performance
>>>> (as in
>>>> > > the
>>>> > > >> TypeSerializer)
>>>> > > >>
>>>> > > >>
>>>> > > >> On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
>>>> > > >> [hidden email]>
>>>> > > >> wrote:
>>>> > > >>
>>>> > > >>> Hi Biao,
>>>> > > >>>
>>>> > > >>> Thanks for the answer!
>>>> > > >>>
>>>> > > >>> So given the multi-threaded readers, now we have as open
>>>> questions:
>>>> > > >>>
>>>> > > >>> 1) How do we let the checkpoints pass through our multi-threaded
>>>> > reader
>>>> > > >>> operator?
>>>> > > >>>
>>>> > > >>> 2) Do we have separate reader and source operators or not? In
>>>> the
>>>> > > >> strategy
>>>> > > >>> that has a separate source, the source operator has a
>>>> parallelism of
>>>> > 1
>>>> > > >> and
>>>> > > >>> is responsible for split recovery only.
>>>> > > >>>
>>>> > > >>> For the first one, given also the constraints (blocking, finite
>>>> > queues,
>>>> > > >>> etc), I do not have an answer yet.
>>>> > > >>>
>>>> > > >>> For the 2nd, I think that we should go with separate operators
>>>> for
>>>> > the
>>>> > > >>> source and the readers, for the following reasons:
>>>> > > >>>
>>>> > > >>> 1) This is more aligned with a potential future improvement
>>>> where the
>>>> > > >> split
>>>> > > >>> discovery becomes a responsibility of the JobManager and
>>>> readers are
>>>> > > >>> pooling more work from the JM.
>>>> > > >>>
>>>> > > >>> 2) The source is going to be the "single point of truth". It
>>>> will
>>>> > know
>>>> > > >> what
>>>> > > >>> has been processed and what not. If the source and the readers
>>>> are a
>>>> > > >> single
>>>> > > >>> operator with parallelism > 1, or in general, if the split
>>>> discovery
>>>> > is
>>>> > > >>> done by each task individually, then:
>>>> > > >>>   i) we have to have a deterministic scheme for each reader to
>>>> assign
>>>> > > >>> splits to itself (e.g. mod subtaskId). This is not necessarily
>>>> > trivial
>>>> > > >> for
>>>> > > >>> all sources.
>>>> > > >>>   ii) each reader would have to keep a copy of all its processed
>>>> > slpits
>>>> > > >>>   iii) the state has to be a union state with a non-trivial
>>>> merging
>>>> > > >> logic
>>>> > > >>> in order to support rescaling.
>>>> > > >>>
>>>> > > >>> Two additional points that you raised above:
>>>> > > >>>
>>>> > > >>> i) The point that you raised that we need to keep all splits
>>>> > (processed
>>>> > > >> and
>>>> > > >>> not-processed) I think is a bit of a strong requirement. This
>>>> would
>>>> > > imply
>>>> > > >>> that for infinite sources the state will grow indefinitely.
>>>> This is
>>>> > > >> problem
>>>> > > >>> is even more pronounced if we do not have a single source that
>>>> > assigns
>>>> > > >>> splits to readers, as each reader will have its own copy of the
>>>> > state.
>>>> > > >>>
>>>> > > >>> ii) it is true that for finite sources we need to somehow not
>>>> close
>>>> > the
>>>> > > >>> readers when the source/split discoverer finishes. The
>>>> > > >>> ContinuousFileReaderOperator has a work-around for that. It is
>>>> not
>>>> > > >> elegant,
>>>> > > >>> and checkpoints are not emitted after closing the source, but
>>>> this, I
>>>> > > >>> believe, is a bigger problem which requires more changes than
>>>> just
>>>> > > >>> refactoring the source interface.
>>>> > > >>>
>>>> > > >>> Cheers,
>>>> > > >>> Kostas
>>>> > > >>>
>>>> > > >>
>>>> > >
>>>> > >
>>>> >
>>>>
>>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Thomas Weise-2
Thanks for putting together this proposal!

I see that the "Per Split Event Time" and "Event Time Alignment" sections
are still TBD.

It would probably be good to flesh those out a bit before proceeding too far
as the event time alignment will probably influence the interaction with
the split reader, specifically ReaderStatus emitNext(SourceOutput<E>
output).

We currently have only one implementation for event time alignment in the
Kinesis consumer. The synchronization in that case takes place as the last
step before records are emitted downstream (RecordEmitter). With the
currently proposed interfaces, the equivalent can be implemented in the
reader loop, although note that in the Kinesis consumer the per shard
threads push records.

Synchronization has not been implemented for the Kafka consumer yet.

https://issues.apache.org/jira/browse/FLINK-12675

When I looked at it, I realized that the implementation will look quite
different
from Kinesis because it needs to take place in the pull part, where records
are taken from the Kafka client. Due to the multiplexing it cannot be done
by blocking the split thread like it currently works for Kinesis. Reading
from individual Kafka partitions needs to be controlled via pause/resume
on the Kafka client.

To take on that responsibility the split thread would need to be aware of
the
watermarks or at least whether it should or should not continue to consume
a given split and this may require a different SourceReader or SourceOutput
interface.

Thanks,
Thomas


On Fri, Jul 26, 2019 at 1:39 AM Biao Liu <[hidden email]> wrote:

> Hi Stephan,
>
> Thank you for feedback!
> Will take a look at your branch before public discussing.
>
>
> On Fri, Jul 26, 2019 at 12:01 AM Stephan Ewen <[hidden email]> wrote:
>
> > Hi Biao!
> >
> > Thanks for reviving this. I would like to join this discussion, but am
> > quite occupied with the 1.9 release, so can we maybe pause this
> discussion
> > for a week or so?
> >
> > In the meantime I can share some suggestion based on prior experiments:
> >
> > How to do watermarks / timestamp extractors in a simpler and more
> flexible
> > way. I think that part is quite promising should be part of the new
> source
> > interface.
> >
> >
> https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/eventtime
> >
> >
> >
> https://github.com/StephanEwen/flink/blob/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src/SourceOutput.java
> >
> >
> >
> > Some experiments on how to build the source reader and its library for
> > common threading/split patterns:
> >
> >
> https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src
> >
> >
> > Best,
> > Stephan
> >
> >
> > On Thu, Jul 25, 2019 at 10:03 AM Biao Liu <[hidden email]> wrote:
> >
> >> Hi devs,
> >>
> >> Since 1.9 is nearly released, I think we could get back to FLIP-27. I
> >> believe it should be included in 1.10.
> >>
> >> There are so many things mentioned in document of FLIP-27. [1] I think
> >> we'd better discuss them separately. However the wiki is not a good
> place
> >> to discuss. I wrote google doc about SplitReader API which misses some
> >> details in the document. [2]
> >>
> >> 1.
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27:+Refactor+Source+Interface
> >> 2.
> >>
> https://docs.google.com/document/d/1R1s_89T4S3CZwq7Tf31DciaMCqZwrLHGZFqPASu66oE/edit?usp=sharing
> >>
> >> CC Stephan, Aljoscha, Piotrek, Becket
> >>
> >>
> >> On Thu, Mar 28, 2019 at 4:38 PM Biao Liu <[hidden email]> wrote:
> >>
> >>> Hi Steven,
> >>> Thank you for the feedback. Please take a look at the document FLIP-27
> >>> <
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface>
> which
> >>> is updated recently. A lot of details of enumerator were added in this
> >>> document. I think it would help.
> >>>
> >>> Steven Wu <[hidden email]> 于2019年3月28日周四 下午12:52写道:
> >>>
> >>>> This proposal mentioned that SplitEnumerator might run on the
> >>>> JobManager or
> >>>> in a single task on a TaskManager.
> >>>>
> >>>> if enumerator is a single task on a taskmanager, then the job DAG can
> >>>> never
> >>>> been embarrassingly parallel anymore. That will nullify the leverage
> of
> >>>> fine-grained recovery for embarrassingly parallel jobs.
> >>>>
> >>>> It's not clear to me what's the implication of running enumerator on
> the
> >>>> jobmanager. So I will leave that out for now.
> >>>>
> >>>> On Mon, Jan 28, 2019 at 3:05 AM Biao Liu <[hidden email]> wrote:
> >>>>
> >>>> > Hi Stephan & Piotrek,
> >>>> >
> >>>> > Thank you for feedback.
> >>>> >
> >>>> > It seems that there are a lot of things to do in community. I am
> just
> >>>> > afraid that this discussion may be forgotten since there so many
> >>>> proposals
> >>>> > recently.
> >>>> > Anyway, wish to see the split topics soon :)
> >>>> >
> >>>> > Piotr Nowojski <[hidden email]> 于2019年1月24日周四 下午8:21写道:
> >>>> >
> >>>> > > Hi Biao!
> >>>> > >
> >>>> > > This discussion was stalled because of preparations for the open
> >>>> sourcing
> >>>> > > & merging Blink. I think before creating the tickets we should
> >>>> split this
> >>>> > > discussion into topics/areas outlined by Stephan and create Flips
> >>>> for
> >>>> > that.
> >>>> > >
> >>>> > > I think there is no chance for this to be completed in couple of
> >>>> > remaining
> >>>> > > weeks/1 month before 1.8 feature freeze, however it would be good
> >>>> to aim
> >>>> > > with those changes for 1.9.
> >>>> > >
> >>>> > > Piotrek
> >>>> > >
> >>>> > > > On 20 Jan 2019, at 16:08, Biao Liu <[hidden email]> wrote:
> >>>> > > >
> >>>> > > > Hi community,
> >>>> > > > The summary of Stephan makes a lot sense to me. It is much
> clearer
> >>>> > indeed
> >>>> > > > after splitting the complex topic into small ones.
> >>>> > > > I was wondering is there any detail plan for next step? If not,
> I
> >>>> would
> >>>> > > > like to push this thing forward by creating some JIRA issues.
> >>>> > > > Another question is that should version 1.8 include these
> >>>> features?
> >>>> > > >
> >>>> > > > Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道:
> >>>> > > >
> >>>> > > >> Thanks everyone for the lively discussion. Let me try to
> >>>> summarize
> >>>> > > where I
> >>>> > > >> see convergence in the discussion and open issues.
> >>>> > > >> I'll try to group this by design aspect of the source. Please
> >>>> let me
> >>>> > > know
> >>>> > > >> if I got things wrong or missed something crucial here.
> >>>> > > >>
> >>>> > > >> For issues 1-3, if the below reflects the state of the
> >>>> discussion, I
> >>>> > > would
> >>>> > > >> try and update the FLIP in the next days.
> >>>> > > >> For the remaining ones we need more discussion.
> >>>> > > >>
> >>>> > > >> I would suggest to fork each of these aspects into a separate
> >>>> mail
> >>>> > > thread,
> >>>> > > >> or will loose sight of the individual aspects.
> >>>> > > >>
> >>>> > > >> *(1) Separation of Split Enumerator and Split Reader*
> >>>> > > >>
> >>>> > > >>  - All seem to agree this is a good thing
> >>>> > > >>  - Split Enumerator could in the end live on JobManager (and
> >>>> assign
> >>>> > > splits
> >>>> > > >> via RPC) or in a task (and assign splits via data streams)
> >>>> > > >>  - this discussion is orthogonal and should come later, when
> the
> >>>> > > interface
> >>>> > > >> is agreed upon.
> >>>> > > >>
> >>>> > > >> *(2) Split Readers for one or more splits*
> >>>> > > >>
> >>>> > > >>  - Discussion seems to agree that we need to support one reader
> >>>> that
> >>>> > > >> possibly handles multiple splits concurrently.
> >>>> > > >>  - The requirement comes from sources where one poll()-style
> call
> >>>> > > fetches
> >>>> > > >> data from different splits / partitions
> >>>> > > >>    --> example sources that require that would be for example
> >>>> Kafka,
> >>>> > > >> Pravega, Pulsar
> >>>> > > >>
> >>>> > > >>  - Could have one split reader per source, or multiple split
> >>>> readers
> >>>> > > that
> >>>> > > >> share the "poll()" function
> >>>> > > >>  - To not make it too complicated, we can start with thinking
> >>>> about
> >>>> > one
> >>>> > > >> split reader for all splits initially and see if that covers
> all
> >>>> > > >> requirements
> >>>> > > >>
> >>>> > > >> *(3) Threading model of the Split Reader*
> >>>> > > >>
> >>>> > > >>  - Most active part of the discussion ;-)
> >>>> > > >>
> >>>> > > >>  - A non-blocking way for Flink's task code to interact with
> the
> >>>> > source
> >>>> > > is
> >>>> > > >> needed in order to a task runtime code based on a
> >>>> > > >> single-threaded/actor-style task design
> >>>> > > >>    --> I personally am a big proponent of that, it will help
> with
> >>>> > > >> well-behaved checkpoints, efficiency, and simpler yet more
> robust
> >>>> > > runtime
> >>>> > > >> code
> >>>> > > >>
> >>>> > > >>  - Users care about simple abstraction, so as a subclass of
> >>>> > SplitReader
> >>>> > > >> (non-blocking / async) we need to have a BlockingSplitReader
> >>>> which
> >>>> > will
> >>>> > > >> form the basis of most source implementations.
> >>>> BlockingSplitReader
> >>>> > lets
> >>>> > > >> users do blocking simple poll() calls.
> >>>> > > >>  - The BlockingSplitReader would spawn a thread (or more) and
> the
> >>>> > > >> thread(s) can make blocking calls and hand over data buffers
> via
> >>>> a
> >>>> > > blocking
> >>>> > > >> queue
> >>>> > > >>  - This should allow us to cover both, a fully async runtime,
> >>>> and a
> >>>> > > simple
> >>>> > > >> blocking interface for users.
> >>>> > > >>  - This is actually very similar to how the Kafka connectors
> >>>> work.
> >>>> > Kafka
> >>>> > > >> 9+ with one thread, Kafka 8 with multiple threads
> >>>> > > >>
> >>>> > > >>  - On the base SplitReader (the async one), the non-blocking
> >>>> method
> >>>> > that
> >>>> > > >> gets the next chunk of data would signal data availability via
> a
> >>>> > > >> CompletableFuture, because that gives the best flexibility (can
> >>>> await
> >>>> > > >> completion or register notification handlers).
> >>>> > > >>  - The source task would register a "thenHandle()" (or similar)
> >>>> on the
> >>>> > > >> future to put a "take next data" task into the actor-style
> >>>> mailbox
> >>>> > > >>
> >>>> > > >> *(4) Split Enumeration and Assignment*
> >>>> > > >>
> >>>> > > >>  - Splits may be generated lazily, both in cases where there
> is a
> >>>> > > limited
> >>>> > > >> number of splits (but very many), or splits are discovered over
> >>>> time
> >>>> > > >>  - Assignment should also be lazy, to get better load balancing
> >>>> > > >>  - Assignment needs support locality preferences
> >>>> > > >>
> >>>> > > >>  - Possible design based on discussion so far:
> >>>> > > >>
> >>>> > > >>    --> SplitReader has a method "addSplits(SplitT...)" to add
> >>>> one or
> >>>> > > more
> >>>> > > >> splits. Some split readers might assume they have only one
> split
> >>>> ever,
> >>>> > > >> concurrently, others assume multiple splits. (Note: idea behind
> >>>> being
> >>>> > > able
> >>>> > > >> to add multiple splits at the same time is to ease startup
> where
> >>>> > > multiple
> >>>> > > >> splits may be assigned instantly.)
> >>>> > > >>    --> SplitReader has a context object on which it can call
> >>>> indicate
> >>>> > > when
> >>>> > > >> splits are completed. The enumerator gets that notification and
> >>>> can
> >>>> > use
> >>>> > > to
> >>>> > > >> decide when to assign new splits. This should help both in
> cases
> >>>> of
> >>>> > > sources
> >>>> > > >> that take splits lazily (file readers) and in case the source
> >>>> needs to
> >>>> > > >> preserve a partial order between splits (Kinesis, Pravega,
> >>>> Pulsar may
> >>>> > > need
> >>>> > > >> that).
> >>>> > > >>    --> SplitEnumerator gets notification when SplitReaders
> start
> >>>> and
> >>>> > > when
> >>>> > > >> they finish splits. They can decide at that moment to push more
> >>>> splits
> >>>> > > to
> >>>> > > >> that reader
> >>>> > > >>    --> The SplitEnumerator should probably be aware of the
> source
> >>>> > > >> parallelism, to build its initial distribution.
> >>>> > > >>
> >>>> > > >>  - Open question: Should the source expose something like "host
> >>>> > > >> preferences", so that yarn/mesos/k8s can take this into account
> >>>> when
> >>>> > > >> selecting a node to start a TM on?
> >>>> > > >>
> >>>> > > >> *(5) Watermarks and event time alignment*
> >>>> > > >>
> >>>> > > >>  - Watermark generation, as well as idleness, needs to be per
> >>>> split
> >>>> > > (like
> >>>> > > >> currently in the Kafka Source, per partition)
> >>>> > > >>  - It is desirable to support optional event-time-alignment,
> >>>> meaning
> >>>> > > that
> >>>> > > >> splits that are ahead are back-pressured or temporarily
> >>>> unsubscribed
> >>>> > > >>
> >>>> > > >>  - I think i would be desirable to encapsulate watermark
> >>>> generation
> >>>> > > logic
> >>>> > > >> in watermark generators, for a separation of concerns. The
> >>>> watermark
> >>>> > > >> generators should run per split.
> >>>> > > >>  - Using watermark generators would also help with another
> >>>> problem of
> >>>> > > the
> >>>> > > >> suggested interface, namely supporting non-periodic watermarks
> >>>> > > efficiently.
> >>>> > > >>
> >>>> > > >>  - Need a way to "dispatch" next record to different watermark
> >>>> > > generators
> >>>> > > >>  - Need a way to tell SplitReader to "suspend" a split until a
> >>>> certain
> >>>> > > >> watermark is reached (event time backpressure)
> >>>> > > >>  - This would in fact be not needed (and thus simpler) if we
> had
> >>>> a
> >>>> > > >> SplitReader per split and may be a reason to re-open that
> >>>> discussion
> >>>> > > >>
> >>>> > > >> *(6) Watermarks across splits and in the Split Enumerator*
> >>>> > > >>
> >>>> > > >>  - The split enumerator may need some watermark awareness,
> which
> >>>> > should
> >>>> > > be
> >>>> > > >> purely based on split metadata (like create timestamp of file
> >>>> splits)
> >>>> > > >>  - If there are still more splits with overlapping event time
> >>>> range
> >>>> > for
> >>>> > > a
> >>>> > > >> split reader, then that split reader should not advance the
> >>>> watermark
> >>>> > > >> within the split beyond the overlap boundary. Otherwise future
> >>>> splits
> >>>> > > will
> >>>> > > >> produce late data.
> >>>> > > >>
> >>>> > > >>  - One way to approach this could be that the split enumerator
> >>>> may
> >>>> > send
> >>>> > > >> watermarks to the readers, and the readers cannot emit
> watermarks
> >>>> > beyond
> >>>> > > >> that received watermark.
> >>>> > > >>  - Many split enumerators would simply immediately send
> Long.MAX
> >>>> out
> >>>> > and
> >>>> > > >> leave the progress purely to the split readers.
> >>>> > > >>
> >>>> > > >>  - For event-time alignment / split back pressure, this begs
> the
> >>>> > > question
> >>>> > > >> how we can avoid deadlocks that may arise when splits are
> >>>> suspended
> >>>> > for
> >>>> > > >> event time back pressure,
> >>>> > > >>
> >>>> > > >> *(7) Batch and streaming Unification*
> >>>> > > >>
> >>>> > > >>  - Functionality wise, the above design should support both
> >>>> > > >>  - Batch often (mostly) does not care about reading "in order"
> >>>> and
> >>>> > > >> generating watermarks
> >>>> > > >>    --> Might use different enumerator logic that is more
> locality
> >>>> > aware
> >>>> > > >> and ignores event time order
> >>>> > > >>    --> Does not generate watermarks
> >>>> > > >>  - Would be great if bounded sources could be identified at
> >>>> compile
> >>>> > > time,
> >>>> > > >> so that "env.addBoundedSource(...)" is type safe and can
> return a
> >>>> > > >> "BoundedDataStream".
> >>>> > > >>  - Possible to defer this discussion until later
> >>>> > > >>
> >>>> > > >> *Miscellaneous Comments*
> >>>> > > >>
> >>>> > > >>  - Should the source have a TypeInformation for the produced
> >>>> type,
> >>>> > > instead
> >>>> > > >> of a serializer? We need a type information in the stream
> >>>> anyways, and
> >>>> > > can
> >>>> > > >> derive the serializer from that. Plus, creating the serializer
> >>>> should
> >>>> > > >> respect the ExecutionConfig.
> >>>> > > >>
> >>>> > > >>  - The TypeSerializer interface is very powerful but also not
> >>>> easy to
> >>>> > > >> implement. Its purpose is to handle data super efficiently,
> >>>> support
> >>>> > > >> flexible ways of evolution, etc.
> >>>> > > >>  For metadata I would suggest to look at the
> >>>> SimpleVersionedSerializer
> >>>> > > >> instead, which is used for example for checkpoint master hooks,
> >>>> or for
> >>>> > > the
> >>>> > > >> streaming file sink. I think that is is a good match for cases
> >>>> where
> >>>> > we
> >>>> > > do
> >>>> > > >> not need more than ser/deser (no copy, etc.) and don't need to
> >>>> push
> >>>> > > >> versioning out of the serialization paths for best performance
> >>>> (as in
> >>>> > > the
> >>>> > > >> TypeSerializer)
> >>>> > > >>
> >>>> > > >>
> >>>> > > >> On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
> >>>> > > >> [hidden email]>
> >>>> > > >> wrote:
> >>>> > > >>
> >>>> > > >>> Hi Biao,
> >>>> > > >>>
> >>>> > > >>> Thanks for the answer!
> >>>> > > >>>
> >>>> > > >>> So given the multi-threaded readers, now we have as open
> >>>> questions:
> >>>> > > >>>
> >>>> > > >>> 1) How do we let the checkpoints pass through our
> multi-threaded
> >>>> > reader
> >>>> > > >>> operator?
> >>>> > > >>>
> >>>> > > >>> 2) Do we have separate reader and source operators or not? In
> >>>> the
> >>>> > > >> strategy
> >>>> > > >>> that has a separate source, the source operator has a
> >>>> parallelism of
> >>>> > 1
> >>>> > > >> and
> >>>> > > >>> is responsible for split recovery only.
> >>>> > > >>>
> >>>> > > >>> For the first one, given also the constraints (blocking,
> finite
> >>>> > queues,
> >>>> > > >>> etc), I do not have an answer yet.
> >>>> > > >>>
> >>>> > > >>> For the 2nd, I think that we should go with separate operators
> >>>> for
> >>>> > the
> >>>> > > >>> source and the readers, for the following reasons:
> >>>> > > >>>
> >>>> > > >>> 1) This is more aligned with a potential future improvement
> >>>> where the
> >>>> > > >> split
> >>>> > > >>> discovery becomes a responsibility of the JobManager and
> >>>> readers are
> >>>> > > >>> pooling more work from the JM.
> >>>> > > >>>
> >>>> > > >>> 2) The source is going to be the "single point of truth". It
> >>>> will
> >>>> > know
> >>>> > > >> what
> >>>> > > >>> has been processed and what not. If the source and the readers
> >>>> are a
> >>>> > > >> single
> >>>> > > >>> operator with parallelism > 1, or in general, if the split
> >>>> discovery
> >>>> > is
> >>>> > > >>> done by each task individually, then:
> >>>> > > >>>   i) we have to have a deterministic scheme for each reader to
> >>>> assign
> >>>> > > >>> splits to itself (e.g. mod subtaskId). This is not necessarily
> >>>> > trivial
> >>>> > > >> for
> >>>> > > >>> all sources.
> >>>> > > >>>   ii) each reader would have to keep a copy of all its
> processed
> >>>> > slpits
> >>>> > > >>>   iii) the state has to be a union state with a non-trivial
> >>>> merging
> >>>> > > >> logic
> >>>> > > >>> in order to support rescaling.
> >>>> > > >>>
> >>>> > > >>> Two additional points that you raised above:
> >>>> > > >>>
> >>>> > > >>> i) The point that you raised that we need to keep all splits
> >>>> > (processed
> >>>> > > >> and
> >>>> > > >>> not-processed) I think is a bit of a strong requirement. This
> >>>> would
> >>>> > > imply
> >>>> > > >>> that for infinite sources the state will grow indefinitely.
> >>>> This is
> >>>> > > >> problem
> >>>> > > >>> is even more pronounced if we do not have a single source that
> >>>> > assigns
> >>>> > > >>> splits to readers, as each reader will have its own copy of
> the
> >>>> > state.
> >>>> > > >>>
> >>>> > > >>> ii) it is true that for finite sources we need to somehow not
> >>>> close
> >>>> > the
> >>>> > > >>> readers when the source/split discoverer finishes. The
> >>>> > > >>> ContinuousFileReaderOperator has a work-around for that. It is
> >>>> not
> >>>> > > >> elegant,
> >>>> > > >>> and checkpoints are not emitted after closing the source, but
> >>>> this, I
> >>>> > > >>> believe, is a bigger problem which requires more changes than
> >>>> just
> >>>> > > >>> refactoring the source interface.
> >>>> > > >>>
> >>>> > > >>> Cheers,
> >>>> > > >>> Kostas
> >>>> > > >>>
> >>>> > > >>
> >>>> > >
> >>>> > >
> >>>> >
> >>>>
> >>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Łukasz Jędrzejewski
Hi,

This proposal looks very promising for us. Do you have any plans in which Flink release it is going to be released? We are thinking on using a Data Set API for our future use cases but on the other hand Data Set API is going to be deprecated so using proposed bounded data streams solution could be more viable in the long term.

Thanks,
Łukasz

On 2019/10/01 15:48:03, Thomas Weise <[hidden email]> wrote:

> Thanks for putting together this proposal!
>
> I see that the "Per Split Event Time" and "Event Time Alignment" sections
> are still TBD.
>
> It would probably be good to flesh those out a bit before proceeding too far
> as the event time alignment will probably influence the interaction with
> the split reader, specifically ReaderStatus emitNext(SourceOutput<E>
> output).
>
> We currently have only one implementation for event time alignment in the
> Kinesis consumer. The synchronization in that case takes place as the last
> step before records are emitted downstream (RecordEmitter). With the
> currently proposed interfaces, the equivalent can be implemented in the
> reader loop, although note that in the Kinesis consumer the per shard
> threads push records.
>
> Synchronization has not been implemented for the Kafka consumer yet.
>
> https://issues.apache.org/jira/browse/FLINK-12675
>
> When I looked at it, I realized that the implementation will look quite
> different
> from Kinesis because it needs to take place in the pull part, where records
> are taken from the Kafka client. Due to the multiplexing it cannot be done
> by blocking the split thread like it currently works for Kinesis. Reading
> from individual Kafka partitions needs to be controlled via pause/resume
> on the Kafka client.
>
> To take on that responsibility the split thread would need to be aware of
> the
> watermarks or at least whether it should or should not continue to consume
> a given split and this may require a different SourceReader or SourceOutput
> interface.
>
> Thanks,
> Thomas
>
>
> On Fri, Jul 26, 2019 at 1:39 AM Biao Liu <[hidden email]> wrote:
>
> > Hi Stephan,
> >
> > Thank you for feedback!
> > Will take a look at your branch before public discussing.
> >
> >
> > On Fri, Jul 26, 2019 at 12:01 AM Stephan Ewen <[hidden email]> wrote:
> >
> > > Hi Biao!
> > >
> > > Thanks for reviving this. I would like to join this discussion, but am
> > > quite occupied with the 1.9 release, so can we maybe pause this
> > discussion
> > > for a week or so?
> > >
> > > In the meantime I can share some suggestion based on prior experiments:
> > >
> > > How to do watermarks / timestamp extractors in a simpler and more
> > flexible
> > > way. I think that part is quite promising should be part of the new
> > source
> > > interface.
> > >
> > >
> > https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/eventtime
> > >
> > >
> > >
> > https://github.com/StephanEwen/flink/blob/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src/SourceOutput.java
> > >
> > >
> > >
> > > Some experiments on how to build the source reader and its library for
> > > common threading/split patterns:
> > >
> > >
> > https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src
> > >
> > >
> > > Best,
> > > Stephan
> > >
> > >
> > > On Thu, Jul 25, 2019 at 10:03 AM Biao Liu <[hidden email]> wrote:
> > >
> > >> Hi devs,
> > >>
> > >> Since 1.9 is nearly released, I think we could get back to FLIP-27. I
> > >> believe it should be included in 1.10.
> > >>
> > >> There are so many things mentioned in document of FLIP-27. [1] I think
> > >> we'd better discuss them separately. However the wiki is not a good
> > place
> > >> to discuss. I wrote google doc about SplitReader API which misses some
> > >> details in the document. [2]
> > >>
> > >> 1.
> > >>
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-27:+Refactor+Source+Interface
> > >> 2.
> > >>
> > https://docs.google.com/document/d/1R1s_89T4S3CZwq7Tf31DciaMCqZwrLHGZFqPASu66oE/edit?usp=sharing
> > >>
> > >> CC Stephan, Aljoscha, Piotrek, Becket
> > >>
> > >>
> > >> On Thu, Mar 28, 2019 at 4:38 PM Biao Liu <[hidden email]> wrote:
> > >>
> > >>> Hi Steven,
> > >>> Thank you for the feedback. Please take a look at the document FLIP-27
> > >>> <
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface>
> > which
> > >>> is updated recently. A lot of details of enumerator were added in this
> > >>> document. I think it would help.
> > >>>
> > >>> Steven Wu <[hidden email]> 于2019年3月28日周四 下午12:52写道:
> > >>>
> > >>>> This proposal mentioned that SplitEnumerator might run on the
> > >>>> JobManager or
> > >>>> in a single task on a TaskManager.
> > >>>>
> > >>>> if enumerator is a single task on a taskmanager, then the job DAG can
> > >>>> never
> > >>>> been embarrassingly parallel anymore. That will nullify the leverage
> > of
> > >>>> fine-grained recovery for embarrassingly parallel jobs.
> > >>>>
> > >>>> It's not clear to me what's the implication of running enumerator on
> > the
> > >>>> jobmanager. So I will leave that out for now.
> > >>>>
> > >>>> On Mon, Jan 28, 2019 at 3:05 AM Biao Liu <[hidden email]> wrote:
> > >>>>
> > >>>> > Hi Stephan & Piotrek,
> > >>>> >
> > >>>> > Thank you for feedback.
> > >>>> >
> > >>>> > It seems that there are a lot of things to do in community. I am
> > just
> > >>>> > afraid that this discussion may be forgotten since there so many
> > >>>> proposals
> > >>>> > recently.
> > >>>> > Anyway, wish to see the split topics soon :)
> > >>>> >
> > >>>> > Piotr Nowojski <[hidden email]> 于2019年1月24日周四 下午8:21写道:
> > >>>> >
> > >>>> > > Hi Biao!
> > >>>> > >
> > >>>> > > This discussion was stalled because of preparations for the open
> > >>>> sourcing
> > >>>> > > & merging Blink. I think before creating the tickets we should
> > >>>> split this
> > >>>> > > discussion into topics/areas outlined by Stephan and create Flips
> > >>>> for
> > >>>> > that.
> > >>>> > >
> > >>>> > > I think there is no chance for this to be completed in couple of
> > >>>> > remaining
> > >>>> > > weeks/1 month before 1.8 feature freeze, however it would be good
> > >>>> to aim
> > >>>> > > with those changes for 1.9.
> > >>>> > >
> > >>>> > > Piotrek
> > >>>> > >
> > >>>> > > > On 20 Jan 2019, at 16:08, Biao Liu <[hidden email]> wrote:
> > >>>> > > >
> > >>>> > > > Hi community,
> > >>>> > > > The summary of Stephan makes a lot sense to me. It is much
> > clearer
> > >>>> > indeed
> > >>>> > > > after splitting the complex topic into small ones.
> > >>>> > > > I was wondering is there any detail plan for next step? If not,
> > I
> > >>>> would
> > >>>> > > > like to push this thing forward by creating some JIRA issues.
> > >>>> > > > Another question is that should version 1.8 include these
> > >>>> features?
> > >>>> > > >
> > >>>> > > > Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道:
> > >>>> > > >
> > >>>> > > >> Thanks everyone for the lively discussion. Let me try to
> > >>>> summarize
> > >>>> > > where I
> > >>>> > > >> see convergence in the discussion and open issues.
> > >>>> > > >> I'll try to group this by design aspect of the source. Please
> > >>>> let me
> > >>>> > > know
> > >>>> > > >> if I got things wrong or missed something crucial here.
> > >>>> > > >>
> > >>>> > > >> For issues 1-3, if the below reflects the state of the
> > >>>> discussion, I
> > >>>> > > would
> > >>>> > > >> try and update the FLIP in the next days.
> > >>>> > > >> For the remaining ones we need more discussion.
> > >>>> > > >>
> > >>>> > > >> I would suggest to fork each of these aspects into a separate
> > >>>> mail
> > >>>> > > thread,
> > >>>> > > >> or will loose sight of the individual aspects.
> > >>>> > > >>
> > >>>> > > >> *(1) Separation of Split Enumerator and Split Reader*
> > >>>> > > >>
> > >>>> > > >>  - All seem to agree this is a good thing
> > >>>> > > >>  - Split Enumerator could in the end live on JobManager (and
> > >>>> assign
> > >>>> > > splits
> > >>>> > > >> via RPC) or in a task (and assign splits via data streams)
> > >>>> > > >>  - this discussion is orthogonal and should come later, when
> > the
> > >>>> > > interface
> > >>>> > > >> is agreed upon.
> > >>>> > > >>
> > >>>> > > >> *(2) Split Readers for one or more splits*
> > >>>> > > >>
> > >>>> > > >>  - Discussion seems to agree that we need to support one reader
> > >>>> that
> > >>>> > > >> possibly handles multiple splits concurrently.
> > >>>> > > >>  - The requirement comes from sources where one poll()-style
> > call
> > >>>> > > fetches
> > >>>> > > >> data from different splits / partitions
> > >>>> > > >>    --> example sources that require that would be for example
> > >>>> Kafka,
> > >>>> > > >> Pravega, Pulsar
> > >>>> > > >>
> > >>>> > > >>  - Could have one split reader per source, or multiple split
> > >>>> readers
> > >>>> > > that
> > >>>> > > >> share the "poll()" function
> > >>>> > > >>  - To not make it too complicated, we can start with thinking
> > >>>> about
> > >>>> > one
> > >>>> > > >> split reader for all splits initially and see if that covers
> > all
> > >>>> > > >> requirements
> > >>>> > > >>
> > >>>> > > >> *(3) Threading model of the Split Reader*
> > >>>> > > >>
> > >>>> > > >>  - Most active part of the discussion ;-)
> > >>>> > > >>
> > >>>> > > >>  - A non-blocking way for Flink's task code to interact with
> > the
> > >>>> > source
> > >>>> > > is
> > >>>> > > >> needed in order to a task runtime code based on a
> > >>>> > > >> single-threaded/actor-style task design
> > >>>> > > >>    --> I personally am a big proponent of that, it will help
> > with
> > >>>> > > >> well-behaved checkpoints, efficiency, and simpler yet more
> > robust
> > >>>> > > runtime
> > >>>> > > >> code
> > >>>> > > >>
> > >>>> > > >>  - Users care about simple abstraction, so as a subclass of
> > >>>> > SplitReader
> > >>>> > > >> (non-blocking / async) we need to have a BlockingSplitReader
> > >>>> which
> > >>>> > will
> > >>>> > > >> form the basis of most source implementations.
> > >>>> BlockingSplitReader
> > >>>> > lets
> > >>>> > > >> users do blocking simple poll() calls.
> > >>>> > > >>  - The BlockingSplitReader would spawn a thread (or more) and
> > the
> > >>>> > > >> thread(s) can make blocking calls and hand over data buffers
> > via
> > >>>> a
> > >>>> > > blocking
> > >>>> > > >> queue
> > >>>> > > >>  - This should allow us to cover both, a fully async runtime,
> > >>>> and a
> > >>>> > > simple
> > >>>> > > >> blocking interface for users.
> > >>>> > > >>  - This is actually very similar to how the Kafka connectors
> > >>>> work.
> > >>>> > Kafka
> > >>>> > > >> 9+ with one thread, Kafka 8 with multiple threads
> > >>>> > > >>
> > >>>> > > >>  - On the base SplitReader (the async one), the non-blocking
> > >>>> method
> > >>>> > that
> > >>>> > > >> gets the next chunk of data would signal data availability via
> > a
> > >>>> > > >> CompletableFuture, because that gives the best flexibility (can
> > >>>> await
> > >>>> > > >> completion or register notification handlers).
> > >>>> > > >>  - The source task would register a "thenHandle()" (or similar)
> > >>>> on the
> > >>>> > > >> future to put a "take next data" task into the actor-style
> > >>>> mailbox
> > >>>> > > >>
> > >>>> > > >> *(4) Split Enumeration and Assignment*
> > >>>> > > >>
> > >>>> > > >>  - Splits may be generated lazily, both in cases where there
> > is a
> > >>>> > > limited
> > >>>> > > >> number of splits (but very many), or splits are discovered over
> > >>>> time
> > >>>> > > >>  - Assignment should also be lazy, to get better load balancing
> > >>>> > > >>  - Assignment needs support locality preferences
> > >>>> > > >>
> > >>>> > > >>  - Possible design based on discussion so far:
> > >>>> > > >>
> > >>>> > > >>    --> SplitReader has a method "addSplits(SplitT...)" to add
> > >>>> one or
> > >>>> > > more
> > >>>> > > >> splits. Some split readers might assume they have only one
> > split
> > >>>> ever,
> > >>>> > > >> concurrently, others assume multiple splits. (Note: idea behind
> > >>>> being
> > >>>> > > able
> > >>>> > > >> to add multiple splits at the same time is to ease startup
> > where
> > >>>> > > multiple
> > >>>> > > >> splits may be assigned instantly.)
> > >>>> > > >>    --> SplitReader has a context object on which it can call
> > >>>> indicate
> > >>>> > > when
> > >>>> > > >> splits are completed. The enumerator gets that notification and
> > >>>> can
> > >>>> > use
> > >>>> > > to
> > >>>> > > >> decide when to assign new splits. This should help both in
> > cases
> > >>>> of
> > >>>> > > sources
> > >>>> > > >> that take splits lazily (file readers) and in case the source
> > >>>> needs to
> > >>>> > > >> preserve a partial order between splits (Kinesis, Pravega,
> > >>>> Pulsar may
> > >>>> > > need
> > >>>> > > >> that).
> > >>>> > > >>    --> SplitEnumerator gets notification when SplitReaders
> > start
> > >>>> and
> > >>>> > > when
> > >>>> > > >> they finish splits. They can decide at that moment to push more
> > >>>> splits
> > >>>> > > to
> > >>>> > > >> that reader
> > >>>> > > >>    --> The SplitEnumerator should probably be aware of the
> > source
> > >>>> > > >> parallelism, to build its initial distribution.
> > >>>> > > >>
> > >>>> > > >>  - Open question: Should the source expose something like "host
> > >>>> > > >> preferences", so that yarn/mesos/k8s can take this into account
> > >>>> when
> > >>>> > > >> selecting a node to start a TM on?
> > >>>> > > >>
> > >>>> > > >> *(5) Watermarks and event time alignment*
> > >>>> > > >>
> > >>>> > > >>  - Watermark generation, as well as idleness, needs to be per
> > >>>> split
> > >>>> > > (like
> > >>>> > > >> currently in the Kafka Source, per partition)
> > >>>> > > >>  - It is desirable to support optional event-time-alignment,
> > >>>> meaning
> > >>>> > > that
> > >>>> > > >> splits that are ahead are back-pressured or temporarily
> > >>>> unsubscribed
> > >>>> > > >>
> > >>>> > > >>  - I think i would be desirable to encapsulate watermark
> > >>>> generation
> > >>>> > > logic
> > >>>> > > >> in watermark generators, for a separation of concerns. The
> > >>>> watermark
> > >>>> > > >> generators should run per split.
> > >>>> > > >>  - Using watermark generators would also help with another
> > >>>> problem of
> > >>>> > > the
> > >>>> > > >> suggested interface, namely supporting non-periodic watermarks
> > >>>> > > efficiently.
> > >>>> > > >>
> > >>>> > > >>  - Need a way to "dispatch" next record to different watermark
> > >>>> > > generators
> > >>>> > > >>  - Need a way to tell SplitReader to "suspend" a split until a
> > >>>> certain
> > >>>> > > >> watermark is reached (event time backpressure)
> > >>>> > > >>  - This would in fact be not needed (and thus simpler) if we
> > had
> > >>>> a
> > >>>> > > >> SplitReader per split and may be a reason to re-open that
> > >>>> discussion
> > >>>> > > >>
> > >>>> > > >> *(6) Watermarks across splits and in the Split Enumerator*
> > >>>> > > >>
> > >>>> > > >>  - The split enumerator may need some watermark awareness,
> > which
> > >>>> > should
> > >>>> > > be
> > >>>> > > >> purely based on split metadata (like create timestamp of file
> > >>>> splits)
> > >>>> > > >>  - If there are still more splits with overlapping event time
> > >>>> range
> > >>>> > for
> > >>>> > > a
> > >>>> > > >> split reader, then that split reader should not advance the
> > >>>> watermark
> > >>>> > > >> within the split beyond the overlap boundary. Otherwise future
> > >>>> splits
> > >>>> > > will
> > >>>> > > >> produce late data.
> > >>>> > > >>
> > >>>> > > >>  - One way to approach this could be that the split enumerator
> > >>>> may
> > >>>> > send
> > >>>> > > >> watermarks to the readers, and the readers cannot emit
> > watermarks
> > >>>> > beyond
> > >>>> > > >> that received watermark.
> > >>>> > > >>  - Many split enumerators would simply immediately send
> > Long.MAX
> > >>>> out
> > >>>> > and
> > >>>> > > >> leave the progress purely to the split readers.
> > >>>> > > >>
> > >>>> > > >>  - For event-time alignment / split back pressure, this begs
> > the
> > >>>> > > question
> > >>>> > > >> how we can avoid deadlocks that may arise when splits are
> > >>>> suspended
> > >>>> > for
> > >>>> > > >> event time back pressure,
> > >>>> > > >>
> > >>>> > > >> *(7) Batch and streaming Unification*
> > >>>> > > >>
> > >>>> > > >>  - Functionality wise, the above design should support both
> > >>>> > > >>  - Batch often (mostly) does not care about reading "in order"
> > >>>> and
> > >>>> > > >> generating watermarks
> > >>>> > > >>    --> Might use different enumerator logic that is more
> > locality
> > >>>> > aware
> > >>>> > > >> and ignores event time order
> > >>>> > > >>    --> Does not generate watermarks
> > >>>> > > >>  - Would be great if bounded sources could be identified at
> > >>>> compile
> > >>>> > > time,
> > >>>> > > >> so that "env.addBoundedSource(...)" is type safe and can
> > return a
> > >>>> > > >> "BoundedDataStream".
> > >>>> > > >>  - Possible to defer this discussion until later
> > >>>> > > >>
> > >>>> > > >> *Miscellaneous Comments*
> > >>>> > > >>
> > >>>> > > >>  - Should the source have a TypeInformation for the produced
> > >>>> type,
> > >>>> > > instead
> > >>>> > > >> of a serializer? We need a type information in the stream
> > >>>> anyways, and
> > >>>> > > can
> > >>>> > > >> derive the serializer from that. Plus, creating the serializer
> > >>>> should
> > >>>> > > >> respect the ExecutionConfig.
> > >>>> > > >>
> > >>>> > > >>  - The TypeSerializer interface is very powerful but also not
> > >>>> easy to
> > >>>> > > >> implement. Its purpose is to handle data super efficiently,
> > >>>> support
> > >>>> > > >> flexible ways of evolution, etc.
> > >>>> > > >>  For metadata I would suggest to look at the
> > >>>> SimpleVersionedSerializer
> > >>>> > > >> instead, which is used for example for checkpoint master hooks,
> > >>>> or for
> > >>>> > > the
> > >>>> > > >> streaming file sink. I think that is is a good match for cases
> > >>>> where
> > >>>> > we
> > >>>> > > do
> > >>>> > > >> not need more than ser/deser (no copy, etc.) and don't need to
> > >>>> push
> > >>>> > > >> versioning out of the serialization paths for best performance
> > >>>> (as in
> > >>>> > > the
> > >>>> > > >> TypeSerializer)
> > >>>> > > >>
> > >>>> > > >>
> > >>>> > > >> On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
> > >>>> > > >> [hidden email]>
> > >>>> > > >> wrote:
> > >>>> > > >>
> > >>>> > > >>> Hi Biao,
> > >>>> > > >>>
> > >>>> > > >>> Thanks for the answer!
> > >>>> > > >>>
> > >>>> > > >>> So given the multi-threaded readers, now we have as open
> > >>>> questions:
> > >>>> > > >>>
> > >>>> > > >>> 1) How do we let the checkpoints pass through our
> > multi-threaded
> > >>>> > reader
> > >>>> > > >>> operator?
> > >>>> > > >>>
> > >>>> > > >>> 2) Do we have separate reader and source operators or not? In
> > >>>> the
> > >>>> > > >> strategy
> > >>>> > > >>> that has a separate source, the source operator has a
> > >>>> parallelism of
> > >>>> > 1
> > >>>> > > >> and
> > >>>> > > >>> is responsible for split recovery only.
> > >>>> > > >>>
> > >>>> > > >>> For the first one, given also the constraints (blocking,
> > finite
> > >>>> > queues,
> > >>>> > > >>> etc), I do not have an answer yet.
> > >>>> > > >>>
> > >>>> > > >>> For the 2nd, I think that we should go with separate operators
> > >>>> for
> > >>>> > the
> > >>>> > > >>> source and the readers, for the following reasons:
> > >>>> > > >>>
> > >>>> > > >>> 1) This is more aligned with a potential future improvement
> > >>>> where the
> > >>>> > > >> split
> > >>>> > > >>> discovery becomes a responsibility of the JobManager and
> > >>>> readers are
> > >>>> > > >>> pooling more work from the JM.
> > >>>> > > >>>
> > >>>> > > >>> 2) The source is going to be the "single point of truth". It
> > >>>> will
> > >>>> > know
> > >>>> > > >> what
> > >>>> > > >>> has been processed and what not. If the source and the readers
> > >>>> are a
> > >>>> > > >> single
> > >>>> > > >>> operator with parallelism > 1, or in general, if the split
> > >>>> discovery
> > >>>> > is
> > >>>> > > >>> done by each task individually, then:
> > >>>> > > >>>   i) we have to have a deterministic scheme for each reader to
> > >>>> assign
> > >>>> > > >>> splits to itself (e.g. mod subtaskId). This is not necessarily
> > >>>> > trivial
> > >>>> > > >> for
> > >>>> > > >>> all sources.
> > >>>> > > >>>   ii) each reader would have to keep a copy of all its
> > processed
> > >>>> > slpits
> > >>>> > > >>>   iii) the state has to be a union state with a non-trivial
> > >>>> merging
> > >>>> > > >> logic
> > >>>> > > >>> in order to support rescaling.
> > >>>> > > >>>
> > >>>> > > >>> Two additional points that you raised above:
> > >>>> > > >>>
> > >>>> > > >>> i) The point that you raised that we need to keep all splits
> > >>>> > (processed
> > >>>> > > >> and
> > >>>> > > >>> not-processed) I think is a bit of a strong requirement. This
> > >>>> would
> > >>>> > > imply
> > >>>> > > >>> that for infinite sources the state will grow indefinitely.
> > >>>> This is
> > >>>> > > >> problem
> > >>>> > > >>> is even more pronounced if we do not have a single source that
> > >>>> > assigns
> > >>>> > > >>> splits to readers, as each reader will have its own copy of
> > the
> > >>>> > state.
> > >>>> > > >>>
> > >>>> > > >>> ii) it is true that for finite sources we need to somehow not
> > >>>> close
> > >>>> > the
> > >>>> > > >>> readers when the source/split discoverer finishes. The
> > >>>> > > >>> ContinuousFileReaderOperator has a work-around for that. It is
> > >>>> not
> > >>>> > > >> elegant,
> > >>>> > > >>> and checkpoints are not emitted after closing the source, but
> > >>>> this, I
> > >>>> > > >>> believe, is a bigger problem which requires more changes than
> > >>>> just
> > >>>> > > >>> refactoring the source interface.
> > >>>> > > >>>
> > >>>> > > >>> Cheers,
> > >>>> > > >>> Kostas
> > >>>> > > >>>
> > >>>> > > >>
> > >>>> > >
> > >>>> > >
> > >>>> >
> > >>>>
> > >>>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Stephan Ewen
Hi  Łukasz!

Becket and me are working hard on figuring out the last details and
implementing the first PoC. We would update the FLIP hopefully next week.

There is a fair chance that a first version of this will be in 1.10, but I
think it will take another release to battle test it and migrate the
connectors.

Best,
Stephan




On Fri, Nov 15, 2019 at 11:14 AM Łukasz Jędrzejewski <[hidden email]> wrote:

> Hi,
>
> This proposal looks very promising for us. Do you have any plans in which
> Flink release it is going to be released? We are thinking on using a Data
> Set API for our future use cases but on the other hand Data Set API is
> going to be deprecated so using proposed bounded data streams solution
> could be more viable in the long term.
>
> Thanks,
> Łukasz
>
> On 2019/10/01 15:48:03, Thomas Weise <[hidden email]> wrote:
> > Thanks for putting together this proposal!
> >
> > I see that the "Per Split Event Time" and "Event Time Alignment" sections
> > are still TBD.
> >
> > It would probably be good to flesh those out a bit before proceeding too
> far
> > as the event time alignment will probably influence the interaction with
> > the split reader, specifically ReaderStatus emitNext(SourceOutput<E>
> > output).
> >
> > We currently have only one implementation for event time alignment in the
> > Kinesis consumer. The synchronization in that case takes place as the
> last
> > step before records are emitted downstream (RecordEmitter). With the
> > currently proposed interfaces, the equivalent can be implemented in the
> > reader loop, although note that in the Kinesis consumer the per shard
> > threads push records.
> >
> > Synchronization has not been implemented for the Kafka consumer yet.
> >
> > https://issues.apache.org/jira/browse/FLINK-12675
> >
> > When I looked at it, I realized that the implementation will look quite
> > different
> > from Kinesis because it needs to take place in the pull part, where
> records
> > are taken from the Kafka client. Due to the multiplexing it cannot be
> done
> > by blocking the split thread like it currently works for Kinesis. Reading
> > from individual Kafka partitions needs to be controlled via pause/resume
> > on the Kafka client.
> >
> > To take on that responsibility the split thread would need to be aware of
> > the
> > watermarks or at least whether it should or should not continue to
> consume
> > a given split and this may require a different SourceReader or
> SourceOutput
> > interface.
> >
> > Thanks,
> > Thomas
> >
> >
> > On Fri, Jul 26, 2019 at 1:39 AM Biao Liu <[hidden email]> wrote:
> >
> > > Hi Stephan,
> > >
> > > Thank you for feedback!
> > > Will take a look at your branch before public discussing.
> > >
> > >
> > > On Fri, Jul 26, 2019 at 12:01 AM Stephan Ewen <[hidden email]>
> wrote:
> > >
> > > > Hi Biao!
> > > >
> > > > Thanks for reviving this. I would like to join this discussion, but
> am
> > > > quite occupied with the 1.9 release, so can we maybe pause this
> > > discussion
> > > > for a week or so?
> > > >
> > > > In the meantime I can share some suggestion based on prior
> experiments:
> > > >
> > > > How to do watermarks / timestamp extractors in a simpler and more
> > > flexible
> > > > way. I think that part is quite promising should be part of the new
> > > source
> > > > interface.
> > > >
> > > >
> > >
> https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/eventtime
> > > >
> > > >
> > > >
> > >
> https://github.com/StephanEwen/flink/blob/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src/SourceOutput.java
> > > >
> > > >
> > > >
> > > > Some experiments on how to build the source reader and its library
> for
> > > > common threading/split patterns:
> > > >
> > > >
> > >
> https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src
> > > >
> > > >
> > > > Best,
> > > > Stephan
> > > >
> > > >
> > > > On Thu, Jul 25, 2019 at 10:03 AM Biao Liu <[hidden email]>
> wrote:
> > > >
> > > >> Hi devs,
> > > >>
> > > >> Since 1.9 is nearly released, I think we could get back to FLIP-27.
> I
> > > >> believe it should be included in 1.10.
> > > >>
> > > >> There are so many things mentioned in document of FLIP-27. [1] I
> think
> > > >> we'd better discuss them separately. However the wiki is not a good
> > > place
> > > >> to discuss. I wrote google doc about SplitReader API which misses
> some
> > > >> details in the document. [2]
> > > >>
> > > >> 1.
> > > >>
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27:+Refactor+Source+Interface
> > > >> 2.
> > > >>
> > >
> https://docs.google.com/document/d/1R1s_89T4S3CZwq7Tf31DciaMCqZwrLHGZFqPASu66oE/edit?usp=sharing
> > > >>
> > > >> CC Stephan, Aljoscha, Piotrek, Becket
> > > >>
> > > >>
> > > >> On Thu, Mar 28, 2019 at 4:38 PM Biao Liu <[hidden email]>
> wrote:
> > > >>
> > > >>> Hi Steven,
> > > >>> Thank you for the feedback. Please take a look at the document
> FLIP-27
> > > >>> <
> > >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface
> >
> > > which
> > > >>> is updated recently. A lot of details of enumerator were added in
> this
> > > >>> document. I think it would help.
> > > >>>
> > > >>> Steven Wu <[hidden email]> 于2019年3月28日周四 下午12:52写道:
> > > >>>
> > > >>>> This proposal mentioned that SplitEnumerator might run on the
> > > >>>> JobManager or
> > > >>>> in a single task on a TaskManager.
> > > >>>>
> > > >>>> if enumerator is a single task on a taskmanager, then the job DAG
> can
> > > >>>> never
> > > >>>> been embarrassingly parallel anymore. That will nullify the
> leverage
> > > of
> > > >>>> fine-grained recovery for embarrassingly parallel jobs.
> > > >>>>
> > > >>>> It's not clear to me what's the implication of running enumerator
> on
> > > the
> > > >>>> jobmanager. So I will leave that out for now.
> > > >>>>
> > > >>>> On Mon, Jan 28, 2019 at 3:05 AM Biao Liu <[hidden email]>
> wrote:
> > > >>>>
> > > >>>> > Hi Stephan & Piotrek,
> > > >>>> >
> > > >>>> > Thank you for feedback.
> > > >>>> >
> > > >>>> > It seems that there are a lot of things to do in community. I am
> > > just
> > > >>>> > afraid that this discussion may be forgotten since there so many
> > > >>>> proposals
> > > >>>> > recently.
> > > >>>> > Anyway, wish to see the split topics soon :)
> > > >>>> >
> > > >>>> > Piotr Nowojski <[hidden email]> 于2019年1月24日周四 下午8:21写道:
> > > >>>> >
> > > >>>> > > Hi Biao!
> > > >>>> > >
> > > >>>> > > This discussion was stalled because of preparations for the
> open
> > > >>>> sourcing
> > > >>>> > > & merging Blink. I think before creating the tickets we should
> > > >>>> split this
> > > >>>> > > discussion into topics/areas outlined by Stephan and create
> Flips
> > > >>>> for
> > > >>>> > that.
> > > >>>> > >
> > > >>>> > > I think there is no chance for this to be completed in couple
> of
> > > >>>> > remaining
> > > >>>> > > weeks/1 month before 1.8 feature freeze, however it would be
> good
> > > >>>> to aim
> > > >>>> > > with those changes for 1.9.
> > > >>>> > >
> > > >>>> > > Piotrek
> > > >>>> > >
> > > >>>> > > > On 20 Jan 2019, at 16:08, Biao Liu <[hidden email]>
> wrote:
> > > >>>> > > >
> > > >>>> > > > Hi community,
> > > >>>> > > > The summary of Stephan makes a lot sense to me. It is much
> > > clearer
> > > >>>> > indeed
> > > >>>> > > > after splitting the complex topic into small ones.
> > > >>>> > > > I was wondering is there any detail plan for next step? If
> not,
> > > I
> > > >>>> would
> > > >>>> > > > like to push this thing forward by creating some JIRA
> issues.
> > > >>>> > > > Another question is that should version 1.8 include these
> > > >>>> features?
> > > >>>> > > >
> > > >>>> > > > Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道:
> > > >>>> > > >
> > > >>>> > > >> Thanks everyone for the lively discussion. Let me try to
> > > >>>> summarize
> > > >>>> > > where I
> > > >>>> > > >> see convergence in the discussion and open issues.
> > > >>>> > > >> I'll try to group this by design aspect of the source.
> Please
> > > >>>> let me
> > > >>>> > > know
> > > >>>> > > >> if I got things wrong or missed something crucial here.
> > > >>>> > > >>
> > > >>>> > > >> For issues 1-3, if the below reflects the state of the
> > > >>>> discussion, I
> > > >>>> > > would
> > > >>>> > > >> try and update the FLIP in the next days.
> > > >>>> > > >> For the remaining ones we need more discussion.
> > > >>>> > > >>
> > > >>>> > > >> I would suggest to fork each of these aspects into a
> separate
> > > >>>> mail
> > > >>>> > > thread,
> > > >>>> > > >> or will loose sight of the individual aspects.
> > > >>>> > > >>
> > > >>>> > > >> *(1) Separation of Split Enumerator and Split Reader*
> > > >>>> > > >>
> > > >>>> > > >>  - All seem to agree this is a good thing
> > > >>>> > > >>  - Split Enumerator could in the end live on JobManager
> (and
> > > >>>> assign
> > > >>>> > > splits
> > > >>>> > > >> via RPC) or in a task (and assign splits via data streams)
> > > >>>> > > >>  - this discussion is orthogonal and should come later,
> when
> > > the
> > > >>>> > > interface
> > > >>>> > > >> is agreed upon.
> > > >>>> > > >>
> > > >>>> > > >> *(2) Split Readers for one or more splits*
> > > >>>> > > >>
> > > >>>> > > >>  - Discussion seems to agree that we need to support one
> reader
> > > >>>> that
> > > >>>> > > >> possibly handles multiple splits concurrently.
> > > >>>> > > >>  - The requirement comes from sources where one
> poll()-style
> > > call
> > > >>>> > > fetches
> > > >>>> > > >> data from different splits / partitions
> > > >>>> > > >>    --> example sources that require that would be for
> example
> > > >>>> Kafka,
> > > >>>> > > >> Pravega, Pulsar
> > > >>>> > > >>
> > > >>>> > > >>  - Could have one split reader per source, or multiple
> split
> > > >>>> readers
> > > >>>> > > that
> > > >>>> > > >> share the "poll()" function
> > > >>>> > > >>  - To not make it too complicated, we can start with
> thinking
> > > >>>> about
> > > >>>> > one
> > > >>>> > > >> split reader for all splits initially and see if that
> covers
> > > all
> > > >>>> > > >> requirements
> > > >>>> > > >>
> > > >>>> > > >> *(3) Threading model of the Split Reader*
> > > >>>> > > >>
> > > >>>> > > >>  - Most active part of the discussion ;-)
> > > >>>> > > >>
> > > >>>> > > >>  - A non-blocking way for Flink's task code to interact
> with
> > > the
> > > >>>> > source
> > > >>>> > > is
> > > >>>> > > >> needed in order to a task runtime code based on a
> > > >>>> > > >> single-threaded/actor-style task design
> > > >>>> > > >>    --> I personally am a big proponent of that, it will
> help
> > > with
> > > >>>> > > >> well-behaved checkpoints, efficiency, and simpler yet more
> > > robust
> > > >>>> > > runtime
> > > >>>> > > >> code
> > > >>>> > > >>
> > > >>>> > > >>  - Users care about simple abstraction, so as a subclass of
> > > >>>> > SplitReader
> > > >>>> > > >> (non-blocking / async) we need to have a
> BlockingSplitReader
> > > >>>> which
> > > >>>> > will
> > > >>>> > > >> form the basis of most source implementations.
> > > >>>> BlockingSplitReader
> > > >>>> > lets
> > > >>>> > > >> users do blocking simple poll() calls.
> > > >>>> > > >>  - The BlockingSplitReader would spawn a thread (or more)
> and
> > > the
> > > >>>> > > >> thread(s) can make blocking calls and hand over data
> buffers
> > > via
> > > >>>> a
> > > >>>> > > blocking
> > > >>>> > > >> queue
> > > >>>> > > >>  - This should allow us to cover both, a fully async
> runtime,
> > > >>>> and a
> > > >>>> > > simple
> > > >>>> > > >> blocking interface for users.
> > > >>>> > > >>  - This is actually very similar to how the Kafka
> connectors
> > > >>>> work.
> > > >>>> > Kafka
> > > >>>> > > >> 9+ with one thread, Kafka 8 with multiple threads
> > > >>>> > > >>
> > > >>>> > > >>  - On the base SplitReader (the async one), the
> non-blocking
> > > >>>> method
> > > >>>> > that
> > > >>>> > > >> gets the next chunk of data would signal data availability
> via
> > > a
> > > >>>> > > >> CompletableFuture, because that gives the best flexibility
> (can
> > > >>>> await
> > > >>>> > > >> completion or register notification handlers).
> > > >>>> > > >>  - The source task would register a "thenHandle()" (or
> similar)
> > > >>>> on the
> > > >>>> > > >> future to put a "take next data" task into the actor-style
> > > >>>> mailbox
> > > >>>> > > >>
> > > >>>> > > >> *(4) Split Enumeration and Assignment*
> > > >>>> > > >>
> > > >>>> > > >>  - Splits may be generated lazily, both in cases where
> there
> > > is a
> > > >>>> > > limited
> > > >>>> > > >> number of splits (but very many), or splits are discovered
> over
> > > >>>> time
> > > >>>> > > >>  - Assignment should also be lazy, to get better load
> balancing
> > > >>>> > > >>  - Assignment needs support locality preferences
> > > >>>> > > >>
> > > >>>> > > >>  - Possible design based on discussion so far:
> > > >>>> > > >>
> > > >>>> > > >>    --> SplitReader has a method "addSplits(SplitT...)" to
> add
> > > >>>> one or
> > > >>>> > > more
> > > >>>> > > >> splits. Some split readers might assume they have only one
> > > split
> > > >>>> ever,
> > > >>>> > > >> concurrently, others assume multiple splits. (Note: idea
> behind
> > > >>>> being
> > > >>>> > > able
> > > >>>> > > >> to add multiple splits at the same time is to ease startup
> > > where
> > > >>>> > > multiple
> > > >>>> > > >> splits may be assigned instantly.)
> > > >>>> > > >>    --> SplitReader has a context object on which it can
> call
> > > >>>> indicate
> > > >>>> > > when
> > > >>>> > > >> splits are completed. The enumerator gets that
> notification and
> > > >>>> can
> > > >>>> > use
> > > >>>> > > to
> > > >>>> > > >> decide when to assign new splits. This should help both in
> > > cases
> > > >>>> of
> > > >>>> > > sources
> > > >>>> > > >> that take splits lazily (file readers) and in case the
> source
> > > >>>> needs to
> > > >>>> > > >> preserve a partial order between splits (Kinesis, Pravega,
> > > >>>> Pulsar may
> > > >>>> > > need
> > > >>>> > > >> that).
> > > >>>> > > >>    --> SplitEnumerator gets notification when SplitReaders
> > > start
> > > >>>> and
> > > >>>> > > when
> > > >>>> > > >> they finish splits. They can decide at that moment to push
> more
> > > >>>> splits
> > > >>>> > > to
> > > >>>> > > >> that reader
> > > >>>> > > >>    --> The SplitEnumerator should probably be aware of the
> > > source
> > > >>>> > > >> parallelism, to build its initial distribution.
> > > >>>> > > >>
> > > >>>> > > >>  - Open question: Should the source expose something like
> "host
> > > >>>> > > >> preferences", so that yarn/mesos/k8s can take this into
> account
> > > >>>> when
> > > >>>> > > >> selecting a node to start a TM on?
> > > >>>> > > >>
> > > >>>> > > >> *(5) Watermarks and event time alignment*
> > > >>>> > > >>
> > > >>>> > > >>  - Watermark generation, as well as idleness, needs to be
> per
> > > >>>> split
> > > >>>> > > (like
> > > >>>> > > >> currently in the Kafka Source, per partition)
> > > >>>> > > >>  - It is desirable to support optional
> event-time-alignment,
> > > >>>> meaning
> > > >>>> > > that
> > > >>>> > > >> splits that are ahead are back-pressured or temporarily
> > > >>>> unsubscribed
> > > >>>> > > >>
> > > >>>> > > >>  - I think i would be desirable to encapsulate watermark
> > > >>>> generation
> > > >>>> > > logic
> > > >>>> > > >> in watermark generators, for a separation of concerns. The
> > > >>>> watermark
> > > >>>> > > >> generators should run per split.
> > > >>>> > > >>  - Using watermark generators would also help with another
> > > >>>> problem of
> > > >>>> > > the
> > > >>>> > > >> suggested interface, namely supporting non-periodic
> watermarks
> > > >>>> > > efficiently.
> > > >>>> > > >>
> > > >>>> > > >>  - Need a way to "dispatch" next record to different
> watermark
> > > >>>> > > generators
> > > >>>> > > >>  - Need a way to tell SplitReader to "suspend" a split
> until a
> > > >>>> certain
> > > >>>> > > >> watermark is reached (event time backpressure)
> > > >>>> > > >>  - This would in fact be not needed (and thus simpler) if
> we
> > > had
> > > >>>> a
> > > >>>> > > >> SplitReader per split and may be a reason to re-open that
> > > >>>> discussion
> > > >>>> > > >>
> > > >>>> > > >> *(6) Watermarks across splits and in the Split Enumerator*
> > > >>>> > > >>
> > > >>>> > > >>  - The split enumerator may need some watermark awareness,
> > > which
> > > >>>> > should
> > > >>>> > > be
> > > >>>> > > >> purely based on split metadata (like create timestamp of
> file
> > > >>>> splits)
> > > >>>> > > >>  - If there are still more splits with overlapping event
> time
> > > >>>> range
> > > >>>> > for
> > > >>>> > > a
> > > >>>> > > >> split reader, then that split reader should not advance the
> > > >>>> watermark
> > > >>>> > > >> within the split beyond the overlap boundary. Otherwise
> future
> > > >>>> splits
> > > >>>> > > will
> > > >>>> > > >> produce late data.
> > > >>>> > > >>
> > > >>>> > > >>  - One way to approach this could be that the split
> enumerator
> > > >>>> may
> > > >>>> > send
> > > >>>> > > >> watermarks to the readers, and the readers cannot emit
> > > watermarks
> > > >>>> > beyond
> > > >>>> > > >> that received watermark.
> > > >>>> > > >>  - Many split enumerators would simply immediately send
> > > Long.MAX
> > > >>>> out
> > > >>>> > and
> > > >>>> > > >> leave the progress purely to the split readers.
> > > >>>> > > >>
> > > >>>> > > >>  - For event-time alignment / split back pressure, this
> begs
> > > the
> > > >>>> > > question
> > > >>>> > > >> how we can avoid deadlocks that may arise when splits are
> > > >>>> suspended
> > > >>>> > for
> > > >>>> > > >> event time back pressure,
> > > >>>> > > >>
> > > >>>> > > >> *(7) Batch and streaming Unification*
> > > >>>> > > >>
> > > >>>> > > >>  - Functionality wise, the above design should support both
> > > >>>> > > >>  - Batch often (mostly) does not care about reading "in
> order"
> > > >>>> and
> > > >>>> > > >> generating watermarks
> > > >>>> > > >>    --> Might use different enumerator logic that is more
> > > locality
> > > >>>> > aware
> > > >>>> > > >> and ignores event time order
> > > >>>> > > >>    --> Does not generate watermarks
> > > >>>> > > >>  - Would be great if bounded sources could be identified at
> > > >>>> compile
> > > >>>> > > time,
> > > >>>> > > >> so that "env.addBoundedSource(...)" is type safe and can
> > > return a
> > > >>>> > > >> "BoundedDataStream".
> > > >>>> > > >>  - Possible to defer this discussion until later
> > > >>>> > > >>
> > > >>>> > > >> *Miscellaneous Comments*
> > > >>>> > > >>
> > > >>>> > > >>  - Should the source have a TypeInformation for the
> produced
> > > >>>> type,
> > > >>>> > > instead
> > > >>>> > > >> of a serializer? We need a type information in the stream
> > > >>>> anyways, and
> > > >>>> > > can
> > > >>>> > > >> derive the serializer from that. Plus, creating the
> serializer
> > > >>>> should
> > > >>>> > > >> respect the ExecutionConfig.
> > > >>>> > > >>
> > > >>>> > > >>  - The TypeSerializer interface is very powerful but also
> not
> > > >>>> easy to
> > > >>>> > > >> implement. Its purpose is to handle data super efficiently,
> > > >>>> support
> > > >>>> > > >> flexible ways of evolution, etc.
> > > >>>> > > >>  For metadata I would suggest to look at the
> > > >>>> SimpleVersionedSerializer
> > > >>>> > > >> instead, which is used for example for checkpoint master
> hooks,
> > > >>>> or for
> > > >>>> > > the
> > > >>>> > > >> streaming file sink. I think that is is a good match for
> cases
> > > >>>> where
> > > >>>> > we
> > > >>>> > > do
> > > >>>> > > >> not need more than ser/deser (no copy, etc.) and don't
> need to
> > > >>>> push
> > > >>>> > > >> versioning out of the serialization paths for best
> performance
> > > >>>> (as in
> > > >>>> > > the
> > > >>>> > > >> TypeSerializer)
> > > >>>> > > >>
> > > >>>> > > >>
> > > >>>> > > >> On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
> > > >>>> > > >> [hidden email]>
> > > >>>> > > >> wrote:
> > > >>>> > > >>
> > > >>>> > > >>> Hi Biao,
> > > >>>> > > >>>
> > > >>>> > > >>> Thanks for the answer!
> > > >>>> > > >>>
> > > >>>> > > >>> So given the multi-threaded readers, now we have as open
> > > >>>> questions:
> > > >>>> > > >>>
> > > >>>> > > >>> 1) How do we let the checkpoints pass through our
> > > multi-threaded
> > > >>>> > reader
> > > >>>> > > >>> operator?
> > > >>>> > > >>>
> > > >>>> > > >>> 2) Do we have separate reader and source operators or
> not? In
> > > >>>> the
> > > >>>> > > >> strategy
> > > >>>> > > >>> that has a separate source, the source operator has a
> > > >>>> parallelism of
> > > >>>> > 1
> > > >>>> > > >> and
> > > >>>> > > >>> is responsible for split recovery only.
> > > >>>> > > >>>
> > > >>>> > > >>> For the first one, given also the constraints (blocking,
> > > finite
> > > >>>> > queues,
> > > >>>> > > >>> etc), I do not have an answer yet.
> > > >>>> > > >>>
> > > >>>> > > >>> For the 2nd, I think that we should go with separate
> operators
> > > >>>> for
> > > >>>> > the
> > > >>>> > > >>> source and the readers, for the following reasons:
> > > >>>> > > >>>
> > > >>>> > > >>> 1) This is more aligned with a potential future
> improvement
> > > >>>> where the
> > > >>>> > > >> split
> > > >>>> > > >>> discovery becomes a responsibility of the JobManager and
> > > >>>> readers are
> > > >>>> > > >>> pooling more work from the JM.
> > > >>>> > > >>>
> > > >>>> > > >>> 2) The source is going to be the "single point of truth".
> It
> > > >>>> will
> > > >>>> > know
> > > >>>> > > >> what
> > > >>>> > > >>> has been processed and what not. If the source and the
> readers
> > > >>>> are a
> > > >>>> > > >> single
> > > >>>> > > >>> operator with parallelism > 1, or in general, if the split
> > > >>>> discovery
> > > >>>> > is
> > > >>>> > > >>> done by each task individually, then:
> > > >>>> > > >>>   i) we have to have a deterministic scheme for each
> reader to
> > > >>>> assign
> > > >>>> > > >>> splits to itself (e.g. mod subtaskId). This is not
> necessarily
> > > >>>> > trivial
> > > >>>> > > >> for
> > > >>>> > > >>> all sources.
> > > >>>> > > >>>   ii) each reader would have to keep a copy of all its
> > > processed
> > > >>>> > slpits
> > > >>>> > > >>>   iii) the state has to be a union state with a
> non-trivial
> > > >>>> merging
> > > >>>> > > >> logic
> > > >>>> > > >>> in order to support rescaling.
> > > >>>> > > >>>
> > > >>>> > > >>> Two additional points that you raised above:
> > > >>>> > > >>>
> > > >>>> > > >>> i) The point that you raised that we need to keep all
> splits
> > > >>>> > (processed
> > > >>>> > > >> and
> > > >>>> > > >>> not-processed) I think is a bit of a strong requirement.
> This
> > > >>>> would
> > > >>>> > > imply
> > > >>>> > > >>> that for infinite sources the state will grow
> indefinitely.
> > > >>>> This is
> > > >>>> > > >> problem
> > > >>>> > > >>> is even more pronounced if we do not have a single source
> that
> > > >>>> > assigns
> > > >>>> > > >>> splits to readers, as each reader will have its own copy
> of
> > > the
> > > >>>> > state.
> > > >>>> > > >>>
> > > >>>> > > >>> ii) it is true that for finite sources we need to somehow
> not
> > > >>>> close
> > > >>>> > the
> > > >>>> > > >>> readers when the source/split discoverer finishes. The
> > > >>>> > > >>> ContinuousFileReaderOperator has a work-around for that.
> It is
> > > >>>> not
> > > >>>> > > >> elegant,
> > > >>>> > > >>> and checkpoints are not emitted after closing the source,
> but
> > > >>>> this, I
> > > >>>> > > >>> believe, is a bigger problem which requires more changes
> than
> > > >>>> just
> > > >>>> > > >>> refactoring the source interface.
> > > >>>> > > >>>
> > > >>>> > > >>> Cheers,
> > > >>>> > > >>> Kostas
> > > >>>> > > >>>
> > > >>>> > > >>
> > > >>>> > >
> > > >>>> > >
> > > >>>> >
> > > >>>>
> > > >>>
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Becket Qin
Hi all,

Sorry for the long belated update. I have updated FLIP-27 wiki page with
the latest proposals. Some noticeable changes include:
1. A new generic communication mechanism between SplitEnumerator and
SourceReader.
2. Some detail API method signature changes.

We left a few things out of this FLIP and will address them in separate
FLIPs. Including:
1. Per split event time.
2. Event time alignment.
3. Fine grained failover for SplitEnumerator failure.

Please let us know if you have any question.

Thanks,

Jiangjie (Becket) Qin

On Sat, Nov 16, 2019 at 6:10 AM Stephan Ewen <[hidden email]> wrote:

> Hi  Łukasz!
>
> Becket and me are working hard on figuring out the last details and
> implementing the first PoC. We would update the FLIP hopefully next week.
>
> There is a fair chance that a first version of this will be in 1.10, but I
> think it will take another release to battle test it and migrate the
> connectors.
>
> Best,
> Stephan
>
>
>
>
> On Fri, Nov 15, 2019 at 11:14 AM Łukasz Jędrzejewski <[hidden email]> wrote:
>
> > Hi,
> >
> > This proposal looks very promising for us. Do you have any plans in which
> > Flink release it is going to be released? We are thinking on using a Data
> > Set API for our future use cases but on the other hand Data Set API is
> > going to be deprecated so using proposed bounded data streams solution
> > could be more viable in the long term.
> >
> > Thanks,
> > Łukasz
> >
> > On 2019/10/01 15:48:03, Thomas Weise <[hidden email]> wrote:
> > > Thanks for putting together this proposal!
> > >
> > > I see that the "Per Split Event Time" and "Event Time Alignment"
> sections
> > > are still TBD.
> > >
> > > It would probably be good to flesh those out a bit before proceeding
> too
> > far
> > > as the event time alignment will probably influence the interaction
> with
> > > the split reader, specifically ReaderStatus emitNext(SourceOutput<E>
> > > output).
> > >
> > > We currently have only one implementation for event time alignment in
> the
> > > Kinesis consumer. The synchronization in that case takes place as the
> > last
> > > step before records are emitted downstream (RecordEmitter). With the
> > > currently proposed interfaces, the equivalent can be implemented in the
> > > reader loop, although note that in the Kinesis consumer the per shard
> > > threads push records.
> > >
> > > Synchronization has not been implemented for the Kafka consumer yet.
> > >
> > > https://issues.apache.org/jira/browse/FLINK-12675
> > >
> > > When I looked at it, I realized that the implementation will look quite
> > > different
> > > from Kinesis because it needs to take place in the pull part, where
> > records
> > > are taken from the Kafka client. Due to the multiplexing it cannot be
> > done
> > > by blocking the split thread like it currently works for Kinesis.
> Reading
> > > from individual Kafka partitions needs to be controlled via
> pause/resume
> > > on the Kafka client.
> > >
> > > To take on that responsibility the split thread would need to be aware
> of
> > > the
> > > watermarks or at least whether it should or should not continue to
> > consume
> > > a given split and this may require a different SourceReader or
> > SourceOutput
> > > interface.
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > > On Fri, Jul 26, 2019 at 1:39 AM Biao Liu <[hidden email]> wrote:
> > >
> > > > Hi Stephan,
> > > >
> > > > Thank you for feedback!
> > > > Will take a look at your branch before public discussing.
> > > >
> > > >
> > > > On Fri, Jul 26, 2019 at 12:01 AM Stephan Ewen <[hidden email]>
> > wrote:
> > > >
> > > > > Hi Biao!
> > > > >
> > > > > Thanks for reviving this. I would like to join this discussion, but
> > am
> > > > > quite occupied with the 1.9 release, so can we maybe pause this
> > > > discussion
> > > > > for a week or so?
> > > > >
> > > > > In the meantime I can share some suggestion based on prior
> > experiments:
> > > > >
> > > > > How to do watermarks / timestamp extractors in a simpler and more
> > > > flexible
> > > > > way. I think that part is quite promising should be part of the new
> > > > source
> > > > > interface.
> > > > >
> > > > >
> > > >
> >
> https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/eventtime
> > > > >
> > > > >
> > > > >
> > > >
> >
> https://github.com/StephanEwen/flink/blob/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src/SourceOutput.java
> > > > >
> > > > >
> > > > >
> > > > > Some experiments on how to build the source reader and its library
> > for
> > > > > common threading/split patterns:
> > > > >
> > > > >
> > > >
> >
> https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src
> > > > >
> > > > >
> > > > > Best,
> > > > > Stephan
> > > > >
> > > > >
> > > > > On Thu, Jul 25, 2019 at 10:03 AM Biao Liu <[hidden email]>
> > wrote:
> > > > >
> > > > >> Hi devs,
> > > > >>
> > > > >> Since 1.9 is nearly released, I think we could get back to
> FLIP-27.
> > I
> > > > >> believe it should be included in 1.10.
> > > > >>
> > > > >> There are so many things mentioned in document of FLIP-27. [1] I
> > think
> > > > >> we'd better discuss them separately. However the wiki is not a
> good
> > > > place
> > > > >> to discuss. I wrote google doc about SplitReader API which misses
> > some
> > > > >> details in the document. [2]
> > > > >>
> > > > >> 1.
> > > > >>
> > > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27:+Refactor+Source+Interface
> > > > >> 2.
> > > > >>
> > > >
> >
> https://docs.google.com/document/d/1R1s_89T4S3CZwq7Tf31DciaMCqZwrLHGZFqPASu66oE/edit?usp=sharing
> > > > >>
> > > > >> CC Stephan, Aljoscha, Piotrek, Becket
> > > > >>
> > > > >>
> > > > >> On Thu, Mar 28, 2019 at 4:38 PM Biao Liu <[hidden email]>
> > wrote:
> > > > >>
> > > > >>> Hi Steven,
> > > > >>> Thank you for the feedback. Please take a look at the document
> > FLIP-27
> > > > >>> <
> > > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface
> > >
> > > > which
> > > > >>> is updated recently. A lot of details of enumerator were added in
> > this
> > > > >>> document. I think it would help.
> > > > >>>
> > > > >>> Steven Wu <[hidden email]> 于2019年3月28日周四 下午12:52写道:
> > > > >>>
> > > > >>>> This proposal mentioned that SplitEnumerator might run on the
> > > > >>>> JobManager or
> > > > >>>> in a single task on a TaskManager.
> > > > >>>>
> > > > >>>> if enumerator is a single task on a taskmanager, then the job
> DAG
> > can
> > > > >>>> never
> > > > >>>> been embarrassingly parallel anymore. That will nullify the
> > leverage
> > > > of
> > > > >>>> fine-grained recovery for embarrassingly parallel jobs.
> > > > >>>>
> > > > >>>> It's not clear to me what's the implication of running
> enumerator
> > on
> > > > the
> > > > >>>> jobmanager. So I will leave that out for now.
> > > > >>>>
> > > > >>>> On Mon, Jan 28, 2019 at 3:05 AM Biao Liu <[hidden email]>
> > wrote:
> > > > >>>>
> > > > >>>> > Hi Stephan & Piotrek,
> > > > >>>> >
> > > > >>>> > Thank you for feedback.
> > > > >>>> >
> > > > >>>> > It seems that there are a lot of things to do in community. I
> am
> > > > just
> > > > >>>> > afraid that this discussion may be forgotten since there so
> many
> > > > >>>> proposals
> > > > >>>> > recently.
> > > > >>>> > Anyway, wish to see the split topics soon :)
> > > > >>>> >
> > > > >>>> > Piotr Nowojski <[hidden email]> 于2019年1月24日周四
> 下午8:21写道:
> > > > >>>> >
> > > > >>>> > > Hi Biao!
> > > > >>>> > >
> > > > >>>> > > This discussion was stalled because of preparations for the
> > open
> > > > >>>> sourcing
> > > > >>>> > > & merging Blink. I think before creating the tickets we
> should
> > > > >>>> split this
> > > > >>>> > > discussion into topics/areas outlined by Stephan and create
> > Flips
> > > > >>>> for
> > > > >>>> > that.
> > > > >>>> > >
> > > > >>>> > > I think there is no chance for this to be completed in
> couple
> > of
> > > > >>>> > remaining
> > > > >>>> > > weeks/1 month before 1.8 feature freeze, however it would be
> > good
> > > > >>>> to aim
> > > > >>>> > > with those changes for 1.9.
> > > > >>>> > >
> > > > >>>> > > Piotrek
> > > > >>>> > >
> > > > >>>> > > > On 20 Jan 2019, at 16:08, Biao Liu <[hidden email]>
> > wrote:
> > > > >>>> > > >
> > > > >>>> > > > Hi community,
> > > > >>>> > > > The summary of Stephan makes a lot sense to me. It is much
> > > > clearer
> > > > >>>> > indeed
> > > > >>>> > > > after splitting the complex topic into small ones.
> > > > >>>> > > > I was wondering is there any detail plan for next step? If
> > not,
> > > > I
> > > > >>>> would
> > > > >>>> > > > like to push this thing forward by creating some JIRA
> > issues.
> > > > >>>> > > > Another question is that should version 1.8 include these
> > > > >>>> features?
> > > > >>>> > > >
> > > > >>>> > > > Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道:
> > > > >>>> > > >
> > > > >>>> > > >> Thanks everyone for the lively discussion. Let me try to
> > > > >>>> summarize
> > > > >>>> > > where I
> > > > >>>> > > >> see convergence in the discussion and open issues.
> > > > >>>> > > >> I'll try to group this by design aspect of the source.
> > Please
> > > > >>>> let me
> > > > >>>> > > know
> > > > >>>> > > >> if I got things wrong or missed something crucial here.
> > > > >>>> > > >>
> > > > >>>> > > >> For issues 1-3, if the below reflects the state of the
> > > > >>>> discussion, I
> > > > >>>> > > would
> > > > >>>> > > >> try and update the FLIP in the next days.
> > > > >>>> > > >> For the remaining ones we need more discussion.
> > > > >>>> > > >>
> > > > >>>> > > >> I would suggest to fork each of these aspects into a
> > separate
> > > > >>>> mail
> > > > >>>> > > thread,
> > > > >>>> > > >> or will loose sight of the individual aspects.
> > > > >>>> > > >>
> > > > >>>> > > >> *(1) Separation of Split Enumerator and Split Reader*
> > > > >>>> > > >>
> > > > >>>> > > >>  - All seem to agree this is a good thing
> > > > >>>> > > >>  - Split Enumerator could in the end live on JobManager
> > (and
> > > > >>>> assign
> > > > >>>> > > splits
> > > > >>>> > > >> via RPC) or in a task (and assign splits via data
> streams)
> > > > >>>> > > >>  - this discussion is orthogonal and should come later,
> > when
> > > > the
> > > > >>>> > > interface
> > > > >>>> > > >> is agreed upon.
> > > > >>>> > > >>
> > > > >>>> > > >> *(2) Split Readers for one or more splits*
> > > > >>>> > > >>
> > > > >>>> > > >>  - Discussion seems to agree that we need to support one
> > reader
> > > > >>>> that
> > > > >>>> > > >> possibly handles multiple splits concurrently.
> > > > >>>> > > >>  - The requirement comes from sources where one
> > poll()-style
> > > > call
> > > > >>>> > > fetches
> > > > >>>> > > >> data from different splits / partitions
> > > > >>>> > > >>    --> example sources that require that would be for
> > example
> > > > >>>> Kafka,
> > > > >>>> > > >> Pravega, Pulsar
> > > > >>>> > > >>
> > > > >>>> > > >>  - Could have one split reader per source, or multiple
> > split
> > > > >>>> readers
> > > > >>>> > > that
> > > > >>>> > > >> share the "poll()" function
> > > > >>>> > > >>  - To not make it too complicated, we can start with
> > thinking
> > > > >>>> about
> > > > >>>> > one
> > > > >>>> > > >> split reader for all splits initially and see if that
> > covers
> > > > all
> > > > >>>> > > >> requirements
> > > > >>>> > > >>
> > > > >>>> > > >> *(3) Threading model of the Split Reader*
> > > > >>>> > > >>
> > > > >>>> > > >>  - Most active part of the discussion ;-)
> > > > >>>> > > >>
> > > > >>>> > > >>  - A non-blocking way for Flink's task code to interact
> > with
> > > > the
> > > > >>>> > source
> > > > >>>> > > is
> > > > >>>> > > >> needed in order to a task runtime code based on a
> > > > >>>> > > >> single-threaded/actor-style task design
> > > > >>>> > > >>    --> I personally am a big proponent of that, it will
> > help
> > > > with
> > > > >>>> > > >> well-behaved checkpoints, efficiency, and simpler yet
> more
> > > > robust
> > > > >>>> > > runtime
> > > > >>>> > > >> code
> > > > >>>> > > >>
> > > > >>>> > > >>  - Users care about simple abstraction, so as a subclass
> of
> > > > >>>> > SplitReader
> > > > >>>> > > >> (non-blocking / async) we need to have a
> > BlockingSplitReader
> > > > >>>> which
> > > > >>>> > will
> > > > >>>> > > >> form the basis of most source implementations.
> > > > >>>> BlockingSplitReader
> > > > >>>> > lets
> > > > >>>> > > >> users do blocking simple poll() calls.
> > > > >>>> > > >>  - The BlockingSplitReader would spawn a thread (or more)
> > and
> > > > the
> > > > >>>> > > >> thread(s) can make blocking calls and hand over data
> > buffers
> > > > via
> > > > >>>> a
> > > > >>>> > > blocking
> > > > >>>> > > >> queue
> > > > >>>> > > >>  - This should allow us to cover both, a fully async
> > runtime,
> > > > >>>> and a
> > > > >>>> > > simple
> > > > >>>> > > >> blocking interface for users.
> > > > >>>> > > >>  - This is actually very similar to how the Kafka
> > connectors
> > > > >>>> work.
> > > > >>>> > Kafka
> > > > >>>> > > >> 9+ with one thread, Kafka 8 with multiple threads
> > > > >>>> > > >>
> > > > >>>> > > >>  - On the base SplitReader (the async one), the
> > non-blocking
> > > > >>>> method
> > > > >>>> > that
> > > > >>>> > > >> gets the next chunk of data would signal data
> availability
> > via
> > > > a
> > > > >>>> > > >> CompletableFuture, because that gives the best
> flexibility
> > (can
> > > > >>>> await
> > > > >>>> > > >> completion or register notification handlers).
> > > > >>>> > > >>  - The source task would register a "thenHandle()" (or
> > similar)
> > > > >>>> on the
> > > > >>>> > > >> future to put a "take next data" task into the
> actor-style
> > > > >>>> mailbox
> > > > >>>> > > >>
> > > > >>>> > > >> *(4) Split Enumeration and Assignment*
> > > > >>>> > > >>
> > > > >>>> > > >>  - Splits may be generated lazily, both in cases where
> > there
> > > > is a
> > > > >>>> > > limited
> > > > >>>> > > >> number of splits (but very many), or splits are
> discovered
> > over
> > > > >>>> time
> > > > >>>> > > >>  - Assignment should also be lazy, to get better load
> > balancing
> > > > >>>> > > >>  - Assignment needs support locality preferences
> > > > >>>> > > >>
> > > > >>>> > > >>  - Possible design based on discussion so far:
> > > > >>>> > > >>
> > > > >>>> > > >>    --> SplitReader has a method "addSplits(SplitT...)" to
> > add
> > > > >>>> one or
> > > > >>>> > > more
> > > > >>>> > > >> splits. Some split readers might assume they have only
> one
> > > > split
> > > > >>>> ever,
> > > > >>>> > > >> concurrently, others assume multiple splits. (Note: idea
> > behind
> > > > >>>> being
> > > > >>>> > > able
> > > > >>>> > > >> to add multiple splits at the same time is to ease
> startup
> > > > where
> > > > >>>> > > multiple
> > > > >>>> > > >> splits may be assigned instantly.)
> > > > >>>> > > >>    --> SplitReader has a context object on which it can
> > call
> > > > >>>> indicate
> > > > >>>> > > when
> > > > >>>> > > >> splits are completed. The enumerator gets that
> > notification and
> > > > >>>> can
> > > > >>>> > use
> > > > >>>> > > to
> > > > >>>> > > >> decide when to assign new splits. This should help both
> in
> > > > cases
> > > > >>>> of
> > > > >>>> > > sources
> > > > >>>> > > >> that take splits lazily (file readers) and in case the
> > source
> > > > >>>> needs to
> > > > >>>> > > >> preserve a partial order between splits (Kinesis,
> Pravega,
> > > > >>>> Pulsar may
> > > > >>>> > > need
> > > > >>>> > > >> that).
> > > > >>>> > > >>    --> SplitEnumerator gets notification when
> SplitReaders
> > > > start
> > > > >>>> and
> > > > >>>> > > when
> > > > >>>> > > >> they finish splits. They can decide at that moment to
> push
> > more
> > > > >>>> splits
> > > > >>>> > > to
> > > > >>>> > > >> that reader
> > > > >>>> > > >>    --> The SplitEnumerator should probably be aware of
> the
> > > > source
> > > > >>>> > > >> parallelism, to build its initial distribution.
> > > > >>>> > > >>
> > > > >>>> > > >>  - Open question: Should the source expose something like
> > "host
> > > > >>>> > > >> preferences", so that yarn/mesos/k8s can take this into
> > account
> > > > >>>> when
> > > > >>>> > > >> selecting a node to start a TM on?
> > > > >>>> > > >>
> > > > >>>> > > >> *(5) Watermarks and event time alignment*
> > > > >>>> > > >>
> > > > >>>> > > >>  - Watermark generation, as well as idleness, needs to be
> > per
> > > > >>>> split
> > > > >>>> > > (like
> > > > >>>> > > >> currently in the Kafka Source, per partition)
> > > > >>>> > > >>  - It is desirable to support optional
> > event-time-alignment,
> > > > >>>> meaning
> > > > >>>> > > that
> > > > >>>> > > >> splits that are ahead are back-pressured or temporarily
> > > > >>>> unsubscribed
> > > > >>>> > > >>
> > > > >>>> > > >>  - I think i would be desirable to encapsulate watermark
> > > > >>>> generation
> > > > >>>> > > logic
> > > > >>>> > > >> in watermark generators, for a separation of concerns.
> The
> > > > >>>> watermark
> > > > >>>> > > >> generators should run per split.
> > > > >>>> > > >>  - Using watermark generators would also help with
> another
> > > > >>>> problem of
> > > > >>>> > > the
> > > > >>>> > > >> suggested interface, namely supporting non-periodic
> > watermarks
> > > > >>>> > > efficiently.
> > > > >>>> > > >>
> > > > >>>> > > >>  - Need a way to "dispatch" next record to different
> > watermark
> > > > >>>> > > generators
> > > > >>>> > > >>  - Need a way to tell SplitReader to "suspend" a split
> > until a
> > > > >>>> certain
> > > > >>>> > > >> watermark is reached (event time backpressure)
> > > > >>>> > > >>  - This would in fact be not needed (and thus simpler) if
> > we
> > > > had
> > > > >>>> a
> > > > >>>> > > >> SplitReader per split and may be a reason to re-open that
> > > > >>>> discussion
> > > > >>>> > > >>
> > > > >>>> > > >> *(6) Watermarks across splits and in the Split
> Enumerator*
> > > > >>>> > > >>
> > > > >>>> > > >>  - The split enumerator may need some watermark
> awareness,
> > > > which
> > > > >>>> > should
> > > > >>>> > > be
> > > > >>>> > > >> purely based on split metadata (like create timestamp of
> > file
> > > > >>>> splits)
> > > > >>>> > > >>  - If there are still more splits with overlapping event
> > time
> > > > >>>> range
> > > > >>>> > for
> > > > >>>> > > a
> > > > >>>> > > >> split reader, then that split reader should not advance
> the
> > > > >>>> watermark
> > > > >>>> > > >> within the split beyond the overlap boundary. Otherwise
> > future
> > > > >>>> splits
> > > > >>>> > > will
> > > > >>>> > > >> produce late data.
> > > > >>>> > > >>
> > > > >>>> > > >>  - One way to approach this could be that the split
> > enumerator
> > > > >>>> may
> > > > >>>> > send
> > > > >>>> > > >> watermarks to the readers, and the readers cannot emit
> > > > watermarks
> > > > >>>> > beyond
> > > > >>>> > > >> that received watermark.
> > > > >>>> > > >>  - Many split enumerators would simply immediately send
> > > > Long.MAX
> > > > >>>> out
> > > > >>>> > and
> > > > >>>> > > >> leave the progress purely to the split readers.
> > > > >>>> > > >>
> > > > >>>> > > >>  - For event-time alignment / split back pressure, this
> > begs
> > > > the
> > > > >>>> > > question
> > > > >>>> > > >> how we can avoid deadlocks that may arise when splits are
> > > > >>>> suspended
> > > > >>>> > for
> > > > >>>> > > >> event time back pressure,
> > > > >>>> > > >>
> > > > >>>> > > >> *(7) Batch and streaming Unification*
> > > > >>>> > > >>
> > > > >>>> > > >>  - Functionality wise, the above design should support
> both
> > > > >>>> > > >>  - Batch often (mostly) does not care about reading "in
> > order"
> > > > >>>> and
> > > > >>>> > > >> generating watermarks
> > > > >>>> > > >>    --> Might use different enumerator logic that is more
> > > > locality
> > > > >>>> > aware
> > > > >>>> > > >> and ignores event time order
> > > > >>>> > > >>    --> Does not generate watermarks
> > > > >>>> > > >>  - Would be great if bounded sources could be identified
> at
> > > > >>>> compile
> > > > >>>> > > time,
> > > > >>>> > > >> so that "env.addBoundedSource(...)" is type safe and can
> > > > return a
> > > > >>>> > > >> "BoundedDataStream".
> > > > >>>> > > >>  - Possible to defer this discussion until later
> > > > >>>> > > >>
> > > > >>>> > > >> *Miscellaneous Comments*
> > > > >>>> > > >>
> > > > >>>> > > >>  - Should the source have a TypeInformation for the
> > produced
> > > > >>>> type,
> > > > >>>> > > instead
> > > > >>>> > > >> of a serializer? We need a type information in the stream
> > > > >>>> anyways, and
> > > > >>>> > > can
> > > > >>>> > > >> derive the serializer from that. Plus, creating the
> > serializer
> > > > >>>> should
> > > > >>>> > > >> respect the ExecutionConfig.
> > > > >>>> > > >>
> > > > >>>> > > >>  - The TypeSerializer interface is very powerful but also
> > not
> > > > >>>> easy to
> > > > >>>> > > >> implement. Its purpose is to handle data super
> efficiently,
> > > > >>>> support
> > > > >>>> > > >> flexible ways of evolution, etc.
> > > > >>>> > > >>  For metadata I would suggest to look at the
> > > > >>>> SimpleVersionedSerializer
> > > > >>>> > > >> instead, which is used for example for checkpoint master
> > hooks,
> > > > >>>> or for
> > > > >>>> > > the
> > > > >>>> > > >> streaming file sink. I think that is is a good match for
> > cases
> > > > >>>> where
> > > > >>>> > we
> > > > >>>> > > do
> > > > >>>> > > >> not need more than ser/deser (no copy, etc.) and don't
> > need to
> > > > >>>> push
> > > > >>>> > > >> versioning out of the serialization paths for best
> > performance
> > > > >>>> (as in
> > > > >>>> > > the
> > > > >>>> > > >> TypeSerializer)
> > > > >>>> > > >>
> > > > >>>> > > >>
> > > > >>>> > > >> On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
> > > > >>>> > > >> [hidden email]>
> > > > >>>> > > >> wrote:
> > > > >>>> > > >>
> > > > >>>> > > >>> Hi Biao,
> > > > >>>> > > >>>
> > > > >>>> > > >>> Thanks for the answer!
> > > > >>>> > > >>>
> > > > >>>> > > >>> So given the multi-threaded readers, now we have as open
> > > > >>>> questions:
> > > > >>>> > > >>>
> > > > >>>> > > >>> 1) How do we let the checkpoints pass through our
> > > > multi-threaded
> > > > >>>> > reader
> > > > >>>> > > >>> operator?
> > > > >>>> > > >>>
> > > > >>>> > > >>> 2) Do we have separate reader and source operators or
> > not? In
> > > > >>>> the
> > > > >>>> > > >> strategy
> > > > >>>> > > >>> that has a separate source, the source operator has a
> > > > >>>> parallelism of
> > > > >>>> > 1
> > > > >>>> > > >> and
> > > > >>>> > > >>> is responsible for split recovery only.
> > > > >>>> > > >>>
> > > > >>>> > > >>> For the first one, given also the constraints (blocking,
> > > > finite
> > > > >>>> > queues,
> > > > >>>> > > >>> etc), I do not have an answer yet.
> > > > >>>> > > >>>
> > > > >>>> > > >>> For the 2nd, I think that we should go with separate
> > operators
> > > > >>>> for
> > > > >>>> > the
> > > > >>>> > > >>> source and the readers, for the following reasons:
> > > > >>>> > > >>>
> > > > >>>> > > >>> 1) This is more aligned with a potential future
> > improvement
> > > > >>>> where the
> > > > >>>> > > >> split
> > > > >>>> > > >>> discovery becomes a responsibility of the JobManager and
> > > > >>>> readers are
> > > > >>>> > > >>> pooling more work from the JM.
> > > > >>>> > > >>>
> > > > >>>> > > >>> 2) The source is going to be the "single point of
> truth".
> > It
> > > > >>>> will
> > > > >>>> > know
> > > > >>>> > > >> what
> > > > >>>> > > >>> has been processed and what not. If the source and the
> > readers
> > > > >>>> are a
> > > > >>>> > > >> single
> > > > >>>> > > >>> operator with parallelism > 1, or in general, if the
> split
> > > > >>>> discovery
> > > > >>>> > is
> > > > >>>> > > >>> done by each task individually, then:
> > > > >>>> > > >>>   i) we have to have a deterministic scheme for each
> > reader to
> > > > >>>> assign
> > > > >>>> > > >>> splits to itself (e.g. mod subtaskId). This is not
> > necessarily
> > > > >>>> > trivial
> > > > >>>> > > >> for
> > > > >>>> > > >>> all sources.
> > > > >>>> > > >>>   ii) each reader would have to keep a copy of all its
> > > > processed
> > > > >>>> > slpits
> > > > >>>> > > >>>   iii) the state has to be a union state with a
> > non-trivial
> > > > >>>> merging
> > > > >>>> > > >> logic
> > > > >>>> > > >>> in order to support rescaling.
> > > > >>>> > > >>>
> > > > >>>> > > >>> Two additional points that you raised above:
> > > > >>>> > > >>>
> > > > >>>> > > >>> i) The point that you raised that we need to keep all
> > splits
> > > > >>>> > (processed
> > > > >>>> > > >> and
> > > > >>>> > > >>> not-processed) I think is a bit of a strong requirement.
> > This
> > > > >>>> would
> > > > >>>> > > imply
> > > > >>>> > > >>> that for infinite sources the state will grow
> > indefinitely.
> > > > >>>> This is
> > > > >>>> > > >> problem
> > > > >>>> > > >>> is even more pronounced if we do not have a single
> source
> > that
> > > > >>>> > assigns
> > > > >>>> > > >>> splits to readers, as each reader will have its own copy
> > of
> > > > the
> > > > >>>> > state.
> > > > >>>> > > >>>
> > > > >>>> > > >>> ii) it is true that for finite sources we need to
> somehow
> > not
> > > > >>>> close
> > > > >>>> > the
> > > > >>>> > > >>> readers when the source/split discoverer finishes. The
> > > > >>>> > > >>> ContinuousFileReaderOperator has a work-around for that.
> > It is
> > > > >>>> not
> > > > >>>> > > >> elegant,
> > > > >>>> > > >>> and checkpoints are not emitted after closing the
> source,
> > but
> > > > >>>> this, I
> > > > >>>> > > >>> believe, is a bigger problem which requires more changes
> > than
> > > > >>>> just
> > > > >>>> > > >>> refactoring the source interface.
> > > > >>>> > > >>>
> > > > >>>> > > >>> Cheers,
> > > > >>>> > > >>> Kostas
> > > > >>>> > > >>>
> > > > >>>> > > >>
> > > > >>>> > >
> > > > >>>> > >
> > > > >>>> >
> > > > >>>>
> > > > >>>
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Stephan Ewen
Thanks, Becket, for updating this.

I agree with moving the aspects you mentioned into separate FLIPs - this
one way becoming unwieldy in size.

+1 to the FLIP in its current state. Its a very detailed write-up, nicely
done!

On Wed, Dec 4, 2019 at 7:38 AM Becket Qin <[hidden email]> wrote:

> Hi all,
>
> Sorry for the long belated update. I have updated FLIP-27 wiki page with
> the latest proposals. Some noticeable changes include:
> 1. A new generic communication mechanism between SplitEnumerator and
> SourceReader.
> 2. Some detail API method signature changes.
>
> We left a few things out of this FLIP and will address them in separate
> FLIPs. Including:
> 1. Per split event time.
> 2. Event time alignment.
> 3. Fine grained failover for SplitEnumerator failure.
>
> Please let us know if you have any question.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Sat, Nov 16, 2019 at 6:10 AM Stephan Ewen <[hidden email]> wrote:
>
> > Hi  Łukasz!
> >
> > Becket and me are working hard on figuring out the last details and
> > implementing the first PoC. We would update the FLIP hopefully next week.
> >
> > There is a fair chance that a first version of this will be in 1.10, but
> I
> > think it will take another release to battle test it and migrate the
> > connectors.
> >
> > Best,
> > Stephan
> >
> >
> >
> >
> > On Fri, Nov 15, 2019 at 11:14 AM Łukasz Jędrzejewski <[hidden email]>
> wrote:
> >
> > > Hi,
> > >
> > > This proposal looks very promising for us. Do you have any plans in
> which
> > > Flink release it is going to be released? We are thinking on using a
> Data
> > > Set API for our future use cases but on the other hand Data Set API is
> > > going to be deprecated so using proposed bounded data streams solution
> > > could be more viable in the long term.
> > >
> > > Thanks,
> > > Łukasz
> > >
> > > On 2019/10/01 15:48:03, Thomas Weise <[hidden email]> wrote:
> > > > Thanks for putting together this proposal!
> > > >
> > > > I see that the "Per Split Event Time" and "Event Time Alignment"
> > sections
> > > > are still TBD.
> > > >
> > > > It would probably be good to flesh those out a bit before proceeding
> > too
> > > far
> > > > as the event time alignment will probably influence the interaction
> > with
> > > > the split reader, specifically ReaderStatus emitNext(SourceOutput<E>
> > > > output).
> > > >
> > > > We currently have only one implementation for event time alignment in
> > the
> > > > Kinesis consumer. The synchronization in that case takes place as the
> > > last
> > > > step before records are emitted downstream (RecordEmitter). With the
> > > > currently proposed interfaces, the equivalent can be implemented in
> the
> > > > reader loop, although note that in the Kinesis consumer the per shard
> > > > threads push records.
> > > >
> > > > Synchronization has not been implemented for the Kafka consumer yet.
> > > >
> > > > https://issues.apache.org/jira/browse/FLINK-12675
> > > >
> > > > When I looked at it, I realized that the implementation will look
> quite
> > > > different
> > > > from Kinesis because it needs to take place in the pull part, where
> > > records
> > > > are taken from the Kafka client. Due to the multiplexing it cannot be
> > > done
> > > > by blocking the split thread like it currently works for Kinesis.
> > Reading
> > > > from individual Kafka partitions needs to be controlled via
> > pause/resume
> > > > on the Kafka client.
> > > >
> > > > To take on that responsibility the split thread would need to be
> aware
> > of
> > > > the
> > > > watermarks or at least whether it should or should not continue to
> > > consume
> > > > a given split and this may require a different SourceReader or
> > > SourceOutput
> > > > interface.
> > > >
> > > > Thanks,
> > > > Thomas
> > > >
> > > >
> > > > On Fri, Jul 26, 2019 at 1:39 AM Biao Liu <[hidden email]> wrote:
> > > >
> > > > > Hi Stephan,
> > > > >
> > > > > Thank you for feedback!
> > > > > Will take a look at your branch before public discussing.
> > > > >
> > > > >
> > > > > On Fri, Jul 26, 2019 at 12:01 AM Stephan Ewen <[hidden email]>
> > > wrote:
> > > > >
> > > > > > Hi Biao!
> > > > > >
> > > > > > Thanks for reviving this. I would like to join this discussion,
> but
> > > am
> > > > > > quite occupied with the 1.9 release, so can we maybe pause this
> > > > > discussion
> > > > > > for a week or so?
> > > > > >
> > > > > > In the meantime I can share some suggestion based on prior
> > > experiments:
> > > > > >
> > > > > > How to do watermarks / timestamp extractors in a simpler and more
> > > > > flexible
> > > > > > way. I think that part is quite promising should be part of the
> new
> > > > > source
> > > > > > interface.
> > > > > >
> > > > > >
> > > > >
> > >
> >
> https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/eventtime
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > >
> >
> https://github.com/StephanEwen/flink/blob/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src/SourceOutput.java
> > > > > >
> > > > > >
> > > > > >
> > > > > > Some experiments on how to build the source reader and its
> library
> > > for
> > > > > > common threading/split patterns:
> > > > > >
> > > > > >
> > > > >
> > >
> >
> https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Stephan
> > > > > >
> > > > > >
> > > > > > On Thu, Jul 25, 2019 at 10:03 AM Biao Liu <[hidden email]>
> > > wrote:
> > > > > >
> > > > > >> Hi devs,
> > > > > >>
> > > > > >> Since 1.9 is nearly released, I think we could get back to
> > FLIP-27.
> > > I
> > > > > >> believe it should be included in 1.10.
> > > > > >>
> > > > > >> There are so many things mentioned in document of FLIP-27. [1] I
> > > think
> > > > > >> we'd better discuss them separately. However the wiki is not a
> > good
> > > > > place
> > > > > >> to discuss. I wrote google doc about SplitReader API which
> misses
> > > some
> > > > > >> details in the document. [2]
> > > > > >>
> > > > > >> 1.
> > > > > >>
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27:+Refactor+Source+Interface
> > > > > >> 2.
> > > > > >>
> > > > >
> > >
> >
> https://docs.google.com/document/d/1R1s_89T4S3CZwq7Tf31DciaMCqZwrLHGZFqPASu66oE/edit?usp=sharing
> > > > > >>
> > > > > >> CC Stephan, Aljoscha, Piotrek, Becket
> > > > > >>
> > > > > >>
> > > > > >> On Thu, Mar 28, 2019 at 4:38 PM Biao Liu <[hidden email]>
> > > wrote:
> > > > > >>
> > > > > >>> Hi Steven,
> > > > > >>> Thank you for the feedback. Please take a look at the document
> > > FLIP-27
> > > > > >>> <
> > > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface
> > > >
> > > > > which
> > > > > >>> is updated recently. A lot of details of enumerator were added
> in
> > > this
> > > > > >>> document. I think it would help.
> > > > > >>>
> > > > > >>> Steven Wu <[hidden email]> 于2019年3月28日周四 下午12:52写道:
> > > > > >>>
> > > > > >>>> This proposal mentioned that SplitEnumerator might run on the
> > > > > >>>> JobManager or
> > > > > >>>> in a single task on a TaskManager.
> > > > > >>>>
> > > > > >>>> if enumerator is a single task on a taskmanager, then the job
> > DAG
> > > can
> > > > > >>>> never
> > > > > >>>> been embarrassingly parallel anymore. That will nullify the
> > > leverage
> > > > > of
> > > > > >>>> fine-grained recovery for embarrassingly parallel jobs.
> > > > > >>>>
> > > > > >>>> It's not clear to me what's the implication of running
> > enumerator
> > > on
> > > > > the
> > > > > >>>> jobmanager. So I will leave that out for now.
> > > > > >>>>
> > > > > >>>> On Mon, Jan 28, 2019 at 3:05 AM Biao Liu <[hidden email]>
> > > wrote:
> > > > > >>>>
> > > > > >>>> > Hi Stephan & Piotrek,
> > > > > >>>> >
> > > > > >>>> > Thank you for feedback.
> > > > > >>>> >
> > > > > >>>> > It seems that there are a lot of things to do in community.
> I
> > am
> > > > > just
> > > > > >>>> > afraid that this discussion may be forgotten since there so
> > many
> > > > > >>>> proposals
> > > > > >>>> > recently.
> > > > > >>>> > Anyway, wish to see the split topics soon :)
> > > > > >>>> >
> > > > > >>>> > Piotr Nowojski <[hidden email]> 于2019年1月24日周四
> > 下午8:21写道:
> > > > > >>>> >
> > > > > >>>> > > Hi Biao!
> > > > > >>>> > >
> > > > > >>>> > > This discussion was stalled because of preparations for
> the
> > > open
> > > > > >>>> sourcing
> > > > > >>>> > > & merging Blink. I think before creating the tickets we
> > should
> > > > > >>>> split this
> > > > > >>>> > > discussion into topics/areas outlined by Stephan and
> create
> > > Flips
> > > > > >>>> for
> > > > > >>>> > that.
> > > > > >>>> > >
> > > > > >>>> > > I think there is no chance for this to be completed in
> > couple
> > > of
> > > > > >>>> > remaining
> > > > > >>>> > > weeks/1 month before 1.8 feature freeze, however it would
> be
> > > good
> > > > > >>>> to aim
> > > > > >>>> > > with those changes for 1.9.
> > > > > >>>> > >
> > > > > >>>> > > Piotrek
> > > > > >>>> > >
> > > > > >>>> > > > On 20 Jan 2019, at 16:08, Biao Liu <[hidden email]>
> > > wrote:
> > > > > >>>> > > >
> > > > > >>>> > > > Hi community,
> > > > > >>>> > > > The summary of Stephan makes a lot sense to me. It is
> much
> > > > > clearer
> > > > > >>>> > indeed
> > > > > >>>> > > > after splitting the complex topic into small ones.
> > > > > >>>> > > > I was wondering is there any detail plan for next step?
> If
> > > not,
> > > > > I
> > > > > >>>> would
> > > > > >>>> > > > like to push this thing forward by creating some JIRA
> > > issues.
> > > > > >>>> > > > Another question is that should version 1.8 include
> these
> > > > > >>>> features?
> > > > > >>>> > > >
> > > > > >>>> > > > Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道:
> > > > > >>>> > > >
> > > > > >>>> > > >> Thanks everyone for the lively discussion. Let me try
> to
> > > > > >>>> summarize
> > > > > >>>> > > where I
> > > > > >>>> > > >> see convergence in the discussion and open issues.
> > > > > >>>> > > >> I'll try to group this by design aspect of the source.
> > > Please
> > > > > >>>> let me
> > > > > >>>> > > know
> > > > > >>>> > > >> if I got things wrong or missed something crucial here.
> > > > > >>>> > > >>
> > > > > >>>> > > >> For issues 1-3, if the below reflects the state of the
> > > > > >>>> discussion, I
> > > > > >>>> > > would
> > > > > >>>> > > >> try and update the FLIP in the next days.
> > > > > >>>> > > >> For the remaining ones we need more discussion.
> > > > > >>>> > > >>
> > > > > >>>> > > >> I would suggest to fork each of these aspects into a
> > > separate
> > > > > >>>> mail
> > > > > >>>> > > thread,
> > > > > >>>> > > >> or will loose sight of the individual aspects.
> > > > > >>>> > > >>
> > > > > >>>> > > >> *(1) Separation of Split Enumerator and Split Reader*
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - All seem to agree this is a good thing
> > > > > >>>> > > >>  - Split Enumerator could in the end live on JobManager
> > > (and
> > > > > >>>> assign
> > > > > >>>> > > splits
> > > > > >>>> > > >> via RPC) or in a task (and assign splits via data
> > streams)
> > > > > >>>> > > >>  - this discussion is orthogonal and should come later,
> > > when
> > > > > the
> > > > > >>>> > > interface
> > > > > >>>> > > >> is agreed upon.
> > > > > >>>> > > >>
> > > > > >>>> > > >> *(2) Split Readers for one or more splits*
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - Discussion seems to agree that we need to support
> one
> > > reader
> > > > > >>>> that
> > > > > >>>> > > >> possibly handles multiple splits concurrently.
> > > > > >>>> > > >>  - The requirement comes from sources where one
> > > poll()-style
> > > > > call
> > > > > >>>> > > fetches
> > > > > >>>> > > >> data from different splits / partitions
> > > > > >>>> > > >>    --> example sources that require that would be for
> > > example
> > > > > >>>> Kafka,
> > > > > >>>> > > >> Pravega, Pulsar
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - Could have one split reader per source, or multiple
> > > split
> > > > > >>>> readers
> > > > > >>>> > > that
> > > > > >>>> > > >> share the "poll()" function
> > > > > >>>> > > >>  - To not make it too complicated, we can start with
> > > thinking
> > > > > >>>> about
> > > > > >>>> > one
> > > > > >>>> > > >> split reader for all splits initially and see if that
> > > covers
> > > > > all
> > > > > >>>> > > >> requirements
> > > > > >>>> > > >>
> > > > > >>>> > > >> *(3) Threading model of the Split Reader*
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - Most active part of the discussion ;-)
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - A non-blocking way for Flink's task code to interact
> > > with
> > > > > the
> > > > > >>>> > source
> > > > > >>>> > > is
> > > > > >>>> > > >> needed in order to a task runtime code based on a
> > > > > >>>> > > >> single-threaded/actor-style task design
> > > > > >>>> > > >>    --> I personally am a big proponent of that, it will
> > > help
> > > > > with
> > > > > >>>> > > >> well-behaved checkpoints, efficiency, and simpler yet
> > more
> > > > > robust
> > > > > >>>> > > runtime
> > > > > >>>> > > >> code
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - Users care about simple abstraction, so as a
> subclass
> > of
> > > > > >>>> > SplitReader
> > > > > >>>> > > >> (non-blocking / async) we need to have a
> > > BlockingSplitReader
> > > > > >>>> which
> > > > > >>>> > will
> > > > > >>>> > > >> form the basis of most source implementations.
> > > > > >>>> BlockingSplitReader
> > > > > >>>> > lets
> > > > > >>>> > > >> users do blocking simple poll() calls.
> > > > > >>>> > > >>  - The BlockingSplitReader would spawn a thread (or
> more)
> > > and
> > > > > the
> > > > > >>>> > > >> thread(s) can make blocking calls and hand over data
> > > buffers
> > > > > via
> > > > > >>>> a
> > > > > >>>> > > blocking
> > > > > >>>> > > >> queue
> > > > > >>>> > > >>  - This should allow us to cover both, a fully async
> > > runtime,
> > > > > >>>> and a
> > > > > >>>> > > simple
> > > > > >>>> > > >> blocking interface for users.
> > > > > >>>> > > >>  - This is actually very similar to how the Kafka
> > > connectors
> > > > > >>>> work.
> > > > > >>>> > Kafka
> > > > > >>>> > > >> 9+ with one thread, Kafka 8 with multiple threads
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - On the base SplitReader (the async one), the
> > > non-blocking
> > > > > >>>> method
> > > > > >>>> > that
> > > > > >>>> > > >> gets the next chunk of data would signal data
> > availability
> > > via
> > > > > a
> > > > > >>>> > > >> CompletableFuture, because that gives the best
> > flexibility
> > > (can
> > > > > >>>> await
> > > > > >>>> > > >> completion or register notification handlers).
> > > > > >>>> > > >>  - The source task would register a "thenHandle()" (or
> > > similar)
> > > > > >>>> on the
> > > > > >>>> > > >> future to put a "take next data" task into the
> > actor-style
> > > > > >>>> mailbox
> > > > > >>>> > > >>
> > > > > >>>> > > >> *(4) Split Enumeration and Assignment*
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - Splits may be generated lazily, both in cases where
> > > there
> > > > > is a
> > > > > >>>> > > limited
> > > > > >>>> > > >> number of splits (but very many), or splits are
> > discovered
> > > over
> > > > > >>>> time
> > > > > >>>> > > >>  - Assignment should also be lazy, to get better load
> > > balancing
> > > > > >>>> > > >>  - Assignment needs support locality preferences
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - Possible design based on discussion so far:
> > > > > >>>> > > >>
> > > > > >>>> > > >>    --> SplitReader has a method "addSplits(SplitT...)"
> to
> > > add
> > > > > >>>> one or
> > > > > >>>> > > more
> > > > > >>>> > > >> splits. Some split readers might assume they have only
> > one
> > > > > split
> > > > > >>>> ever,
> > > > > >>>> > > >> concurrently, others assume multiple splits. (Note:
> idea
> > > behind
> > > > > >>>> being
> > > > > >>>> > > able
> > > > > >>>> > > >> to add multiple splits at the same time is to ease
> > startup
> > > > > where
> > > > > >>>> > > multiple
> > > > > >>>> > > >> splits may be assigned instantly.)
> > > > > >>>> > > >>    --> SplitReader has a context object on which it can
> > > call
> > > > > >>>> indicate
> > > > > >>>> > > when
> > > > > >>>> > > >> splits are completed. The enumerator gets that
> > > notification and
> > > > > >>>> can
> > > > > >>>> > use
> > > > > >>>> > > to
> > > > > >>>> > > >> decide when to assign new splits. This should help both
> > in
> > > > > cases
> > > > > >>>> of
> > > > > >>>> > > sources
> > > > > >>>> > > >> that take splits lazily (file readers) and in case the
> > > source
> > > > > >>>> needs to
> > > > > >>>> > > >> preserve a partial order between splits (Kinesis,
> > Pravega,
> > > > > >>>> Pulsar may
> > > > > >>>> > > need
> > > > > >>>> > > >> that).
> > > > > >>>> > > >>    --> SplitEnumerator gets notification when
> > SplitReaders
> > > > > start
> > > > > >>>> and
> > > > > >>>> > > when
> > > > > >>>> > > >> they finish splits. They can decide at that moment to
> > push
> > > more
> > > > > >>>> splits
> > > > > >>>> > > to
> > > > > >>>> > > >> that reader
> > > > > >>>> > > >>    --> The SplitEnumerator should probably be aware of
> > the
> > > > > source
> > > > > >>>> > > >> parallelism, to build its initial distribution.
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - Open question: Should the source expose something
> like
> > > "host
> > > > > >>>> > > >> preferences", so that yarn/mesos/k8s can take this into
> > > account
> > > > > >>>> when
> > > > > >>>> > > >> selecting a node to start a TM on?
> > > > > >>>> > > >>
> > > > > >>>> > > >> *(5) Watermarks and event time alignment*
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - Watermark generation, as well as idleness, needs to
> be
> > > per
> > > > > >>>> split
> > > > > >>>> > > (like
> > > > > >>>> > > >> currently in the Kafka Source, per partition)
> > > > > >>>> > > >>  - It is desirable to support optional
> > > event-time-alignment,
> > > > > >>>> meaning
> > > > > >>>> > > that
> > > > > >>>> > > >> splits that are ahead are back-pressured or temporarily
> > > > > >>>> unsubscribed
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - I think i would be desirable to encapsulate
> watermark
> > > > > >>>> generation
> > > > > >>>> > > logic
> > > > > >>>> > > >> in watermark generators, for a separation of concerns.
> > The
> > > > > >>>> watermark
> > > > > >>>> > > >> generators should run per split.
> > > > > >>>> > > >>  - Using watermark generators would also help with
> > another
> > > > > >>>> problem of
> > > > > >>>> > > the
> > > > > >>>> > > >> suggested interface, namely supporting non-periodic
> > > watermarks
> > > > > >>>> > > efficiently.
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - Need a way to "dispatch" next record to different
> > > watermark
> > > > > >>>> > > generators
> > > > > >>>> > > >>  - Need a way to tell SplitReader to "suspend" a split
> > > until a
> > > > > >>>> certain
> > > > > >>>> > > >> watermark is reached (event time backpressure)
> > > > > >>>> > > >>  - This would in fact be not needed (and thus simpler)
> if
> > > we
> > > > > had
> > > > > >>>> a
> > > > > >>>> > > >> SplitReader per split and may be a reason to re-open
> that
> > > > > >>>> discussion
> > > > > >>>> > > >>
> > > > > >>>> > > >> *(6) Watermarks across splits and in the Split
> > Enumerator*
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - The split enumerator may need some watermark
> > awareness,
> > > > > which
> > > > > >>>> > should
> > > > > >>>> > > be
> > > > > >>>> > > >> purely based on split metadata (like create timestamp
> of
> > > file
> > > > > >>>> splits)
> > > > > >>>> > > >>  - If there are still more splits with overlapping
> event
> > > time
> > > > > >>>> range
> > > > > >>>> > for
> > > > > >>>> > > a
> > > > > >>>> > > >> split reader, then that split reader should not advance
> > the
> > > > > >>>> watermark
> > > > > >>>> > > >> within the split beyond the overlap boundary. Otherwise
> > > future
> > > > > >>>> splits
> > > > > >>>> > > will
> > > > > >>>> > > >> produce late data.
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - One way to approach this could be that the split
> > > enumerator
> > > > > >>>> may
> > > > > >>>> > send
> > > > > >>>> > > >> watermarks to the readers, and the readers cannot emit
> > > > > watermarks
> > > > > >>>> > beyond
> > > > > >>>> > > >> that received watermark.
> > > > > >>>> > > >>  - Many split enumerators would simply immediately send
> > > > > Long.MAX
> > > > > >>>> out
> > > > > >>>> > and
> > > > > >>>> > > >> leave the progress purely to the split readers.
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - For event-time alignment / split back pressure, this
> > > begs
> > > > > the
> > > > > >>>> > > question
> > > > > >>>> > > >> how we can avoid deadlocks that may arise when splits
> are
> > > > > >>>> suspended
> > > > > >>>> > for
> > > > > >>>> > > >> event time back pressure,
> > > > > >>>> > > >>
> > > > > >>>> > > >> *(7) Batch and streaming Unification*
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - Functionality wise, the above design should support
> > both
> > > > > >>>> > > >>  - Batch often (mostly) does not care about reading "in
> > > order"
> > > > > >>>> and
> > > > > >>>> > > >> generating watermarks
> > > > > >>>> > > >>    --> Might use different enumerator logic that is
> more
> > > > > locality
> > > > > >>>> > aware
> > > > > >>>> > > >> and ignores event time order
> > > > > >>>> > > >>    --> Does not generate watermarks
> > > > > >>>> > > >>  - Would be great if bounded sources could be
> identified
> > at
> > > > > >>>> compile
> > > > > >>>> > > time,
> > > > > >>>> > > >> so that "env.addBoundedSource(...)" is type safe and
> can
> > > > > return a
> > > > > >>>> > > >> "BoundedDataStream".
> > > > > >>>> > > >>  - Possible to defer this discussion until later
> > > > > >>>> > > >>
> > > > > >>>> > > >> *Miscellaneous Comments*
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - Should the source have a TypeInformation for the
> > > produced
> > > > > >>>> type,
> > > > > >>>> > > instead
> > > > > >>>> > > >> of a serializer? We need a type information in the
> stream
> > > > > >>>> anyways, and
> > > > > >>>> > > can
> > > > > >>>> > > >> derive the serializer from that. Plus, creating the
> > > serializer
> > > > > >>>> should
> > > > > >>>> > > >> respect the ExecutionConfig.
> > > > > >>>> > > >>
> > > > > >>>> > > >>  - The TypeSerializer interface is very powerful but
> also
> > > not
> > > > > >>>> easy to
> > > > > >>>> > > >> implement. Its purpose is to handle data super
> > efficiently,
> > > > > >>>> support
> > > > > >>>> > > >> flexible ways of evolution, etc.
> > > > > >>>> > > >>  For metadata I would suggest to look at the
> > > > > >>>> SimpleVersionedSerializer
> > > > > >>>> > > >> instead, which is used for example for checkpoint
> master
> > > hooks,
> > > > > >>>> or for
> > > > > >>>> > > the
> > > > > >>>> > > >> streaming file sink. I think that is is a good match
> for
> > > cases
> > > > > >>>> where
> > > > > >>>> > we
> > > > > >>>> > > do
> > > > > >>>> > > >> not need more than ser/deser (no copy, etc.) and don't
> > > need to
> > > > > >>>> push
> > > > > >>>> > > >> versioning out of the serialization paths for best
> > > performance
> > > > > >>>> (as in
> > > > > >>>> > > the
> > > > > >>>> > > >> TypeSerializer)
> > > > > >>>> > > >>
> > > > > >>>> > > >>
> > > > > >>>> > > >> On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
> > > > > >>>> > > >> [hidden email]>
> > > > > >>>> > > >> wrote:
> > > > > >>>> > > >>
> > > > > >>>> > > >>> Hi Biao,
> > > > > >>>> > > >>>
> > > > > >>>> > > >>> Thanks for the answer!
> > > > > >>>> > > >>>
> > > > > >>>> > > >>> So given the multi-threaded readers, now we have as
> open
> > > > > >>>> questions:
> > > > > >>>> > > >>>
> > > > > >>>> > > >>> 1) How do we let the checkpoints pass through our
> > > > > multi-threaded
> > > > > >>>> > reader
> > > > > >>>> > > >>> operator?
> > > > > >>>> > > >>>
> > > > > >>>> > > >>> 2) Do we have separate reader and source operators or
> > > not? In
> > > > > >>>> the
> > > > > >>>> > > >> strategy
> > > > > >>>> > > >>> that has a separate source, the source operator has a
> > > > > >>>> parallelism of
> > > > > >>>> > 1
> > > > > >>>> > > >> and
> > > > > >>>> > > >>> is responsible for split recovery only.
> > > > > >>>> > > >>>
> > > > > >>>> > > >>> For the first one, given also the constraints
> (blocking,
> > > > > finite
> > > > > >>>> > queues,
> > > > > >>>> > > >>> etc), I do not have an answer yet.
> > > > > >>>> > > >>>
> > > > > >>>> > > >>> For the 2nd, I think that we should go with separate
> > > operators
> > > > > >>>> for
> > > > > >>>> > the
> > > > > >>>> > > >>> source and the readers, for the following reasons:
> > > > > >>>> > > >>>
> > > > > >>>> > > >>> 1) This is more aligned with a potential future
> > > improvement
> > > > > >>>> where the
> > > > > >>>> > > >> split
> > > > > >>>> > > >>> discovery becomes a responsibility of the JobManager
> and
> > > > > >>>> readers are
> > > > > >>>> > > >>> pooling more work from the JM.
> > > > > >>>> > > >>>
> > > > > >>>> > > >>> 2) The source is going to be the "single point of
> > truth".
> > > It
> > > > > >>>> will
> > > > > >>>> > know
> > > > > >>>> > > >> what
> > > > > >>>> > > >>> has been processed and what not. If the source and the
> > > readers
> > > > > >>>> are a
> > > > > >>>> > > >> single
> > > > > >>>> > > >>> operator with parallelism > 1, or in general, if the
> > split
> > > > > >>>> discovery
> > > > > >>>> > is
> > > > > >>>> > > >>> done by each task individually, then:
> > > > > >>>> > > >>>   i) we have to have a deterministic scheme for each
> > > reader to
> > > > > >>>> assign
> > > > > >>>> > > >>> splits to itself (e.g. mod subtaskId). This is not
> > > necessarily
> > > > > >>>> > trivial
> > > > > >>>> > > >> for
> > > > > >>>> > > >>> all sources.
> > > > > >>>> > > >>>   ii) each reader would have to keep a copy of all its
> > > > > processed
> > > > > >>>> > slpits
> > > > > >>>> > > >>>   iii) the state has to be a union state with a
> > > non-trivial
> > > > > >>>> merging
> > > > > >>>> > > >> logic
> > > > > >>>> > > >>> in order to support rescaling.
> > > > > >>>> > > >>>
> > > > > >>>> > > >>> Two additional points that you raised above:
> > > > > >>>> > > >>>
> > > > > >>>> > > >>> i) The point that you raised that we need to keep all
> > > splits
> > > > > >>>> > (processed
> > > > > >>>> > > >> and
> > > > > >>>> > > >>> not-processed) I think is a bit of a strong
> requirement.
> > > This
> > > > > >>>> would
> > > > > >>>> > > imply
> > > > > >>>> > > >>> that for infinite sources the state will grow
> > > indefinitely.
> > > > > >>>> This is
> > > > > >>>> > > >> problem
> > > > > >>>> > > >>> is even more pronounced if we do not have a single
> > source
> > > that
> > > > > >>>> > assigns
> > > > > >>>> > > >>> splits to readers, as each reader will have its own
> copy
> > > of
> > > > > the
> > > > > >>>> > state.
> > > > > >>>> > > >>>
> > > > > >>>> > > >>> ii) it is true that for finite sources we need to
> > somehow
> > > not
> > > > > >>>> close
> > > > > >>>> > the
> > > > > >>>> > > >>> readers when the source/split discoverer finishes. The
> > > > > >>>> > > >>> ContinuousFileReaderOperator has a work-around for
> that.
> > > It is
> > > > > >>>> not
> > > > > >>>> > > >> elegant,
> > > > > >>>> > > >>> and checkpoints are not emitted after closing the
> > source,
> > > but
> > > > > >>>> this, I
> > > > > >>>> > > >>> believe, is a bigger problem which requires more
> changes
> > > than
> > > > > >>>> just
> > > > > >>>> > > >>> refactoring the source interface.
> > > > > >>>> > > >>>
> > > > > >>>> > > >>> Cheers,
> > > > > >>>> > > >>> Kostas
> > > > > >>>> > > >>>
> > > > > >>>> > > >>
> > > > > >>>> > >
> > > > > >>>> > >
> > > > > >>>> >
> > > > > >>>>
> > > > > >>>
> > > > >
> > > >
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Jiayi Liao
In reply to this post by Aljoscha Krettek-2
Hi Becket,


Thanks for updating the progress!


I have a question about the #OperatorCoordinator. Will there be any communication between different #OperatorCoordinators (or in the future plan)? Because in that way it may be able to cover some cases in FLIP-27[1] like initializing static data before main input processing. Of course it requires more thinking, just want to speak up some ideas in my mind.


+1 to the FLIP and detailed design!




[1]. https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API


Best,
Jiayi Liao


 Original Message
Sender: Stephan Ewen<[hidden email]>
Recipient: dev<[hidden email]>
Date: Wednesday, Dec 4, 2019 18:25
Subject: Re: [DISCUSS] FLIP-27: Refactor Source Interface


Thanks, Becket, for updating this. I agree with moving the aspects you mentioned into separate FLIPs - this one way becoming unwieldy in size. +1 to the FLIP in its current state. Its a very detailed write-up, nicely done! On Wed, Dec 4, 2019 at 7:38 AM Becket Qin <[hidden email]> wrote: > Hi all, > > Sorry for the long belated update. I have updated FLIP-27 wiki page with > the latest proposals. Some noticeable changes include: > 1. A new generic communication mechanism between SplitEnumerator and > SourceReader. > 2. Some detail API method signature changes. > > We left a few things out of this FLIP and will address them in separate > FLIPs. Including: > 1. Per split event time. > 2. Event time alignment. > 3. Fine grained failover for SplitEnumerator failure. > > Please let us know if you have any question. > > Thanks, > > Jiangjie (Becket) Qin > > On Sat, Nov 16, 2019 at 6:10 AM Stephan Ewen <[hidden email]> wrote: > > > Hi Łukasz! > > > > Becket and me are working hard on figuring out the last details and > > implementing the first PoC. We would update the FLIP hopefully next week. > > > > There is a fair chance that a first version of this will be in 1.10, but > I > > think it will take another release to battle test it and migrate the > > connectors. > > > > Best, > > Stephan > > > > > > > > > > On Fri, Nov 15, 2019 at 11:14 AM Łukasz Jędrzejewski <[hidden email]> > wrote: > > > > > Hi, > > > > > > This proposal looks very promising for us. Do you have any plans in > which > > > Flink release it is going to be released? We are thinking on using a > Data > > > Set API for our future use cases but on the other hand Data Set API is > > > going to be deprecated so using proposed bounded data streams solution > > > could be more viable in the long term. > > > > > > Thanks, > > > Łukasz > > > > > > On 2019/10/01 15:48:03, Thomas Weise <[hidden email]> wrote: > > > > Thanks for putting together this proposal! > > > > > > > > I see that the "Per Split Event Time" and "Event Time Alignment" > > sections > > > > are still TBD. > > > > > > > > It would probably be good to flesh those out a bit before proceeding > > too > > > far > > > > as the event time alignment will probably influence the interaction > > with > > > > the split reader, specifically ReaderStatus emitNext(SourceOutput<E> > > > > output). > > > > > > > > We currently have only one implementation for event time alignment in > > the > > > > Kinesis consumer. The synchronization in that case takes place as the > > > last > > > > step before records are emitted downstream (RecordEmitter). With the > > > > currently proposed interfaces, the equivalent can be implemented in > the > > > > reader loop, although note that in the Kinesis consumer the per shard > > > > threads push records. > > > > > > > > Synchronization has not been implemented for the Kafka consumer yet. > > > > > > > > https://issues.apache.org/jira/browse/FLINK-12675 > > > > > > > > When I looked at it, I realized that the implementation will look > quite > > > > different > > > > from Kinesis because it needs to take place in the pull part, where > > > records > > > > are taken from the Kafka client. Due to the multiplexing it cannot be > > > done > > > > by blocking the split thread like it currently works for Kinesis. > > Reading > > > > from individual Kafka partitions needs to be controlled via > > pause/resume > > > > on the Kafka client. > > > > > > > > To take on that responsibility the split thread would need to be > aware > > of > > > > the > > > > watermarks or at least whether it should or should not continue to > > > consume > > > > a given split and this may require a different SourceReader or > > > SourceOutput > > > > interface. > > > > > > > > Thanks, > > > > Thomas > > > > > > > > > > > > On Fri, Jul 26, 2019 at 1:39 AM Biao Liu <[hidden email]> wrote: > > > > > > > > > Hi Stephan, > > > > > > > > > > Thank you for feedback! > > > > > Will take a look at your branch before public discussing. > > > > > > > > > > > > > > > On Fri, Jul 26, 2019 at 12:01 AM Stephan Ewen <[hidden email]> > > > wrote: > > > > > > > > > > > Hi Biao! > > > > > > > > > > > > Thanks for reviving this. I would like to join this discussion, > but > > > am > > > > > > quite occupied with the 1.9 release, so can we maybe pause this > > > > > discussion > > > > > > for a week or so? > > > > > > > > > > > > In the meantime I can share some suggestion based on prior > > > experiments: > > > > > > > > > > > > How to do watermarks / timestamp extractors in a simpler and more > > > > > flexible > > > > > > way. I think that part is quite promising should be part of the > new > > > > > source > > > > > > interface. > > > > > > > > > > > > > > > > > > > > > > > https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/eventtime > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/StephanEwen/flink/blob/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src/SourceOutput.java > > > > > > > > > > > > > > > > > > > > > > > > Some experiments on how to build the source reader and its > library > > > for > > > > > > common threading/split patterns: > > > > > > > > > > > > > > > > > > > > > > > https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src > > > > > > > > > > > > > > > > > > Best, > > > > > > Stephan > > > > > > > > > > > > > > > > > > On Thu, Jul 25, 2019 at 10:03 AM Biao Liu <[hidden email]> > > > wrote: > > > > > > > > > > > >> Hi devs, > > > > > >> > > > > > >> Since 1.9 is nearly released, I think we could get back to > > FLIP-27. > > > I > > > > > >> believe it should be included in 1.10. > > > > > >> > > > > > >> There are so many things mentioned in document of FLIP-27. [1] I > > > think > > > > > >> we'd better discuss them separately. However the wiki is not a > > good > > > > > place > > > > > >> to discuss. I wrote google doc about SplitReader API which > misses > > > some > > > > > >> details in the document. [2] > > > > > >> > > > > > >> 1. > > > > > >> > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-27:+Refactor+Source+Interface > > > > > >> 2. > > > > > >> > > > > > > > > > > > https://docs.google.com/document/d/1R1s_89T4S3CZwq7Tf31DciaMCqZwrLHGZFqPASu66oE/edit?usp=sharing > > > > > >> > > > > > >> CC Stephan, Aljoscha, Piotrek, Becket > > > > > >> > > > > > >> > > > > > >> On Thu, Mar 28, 2019 at 4:38 PM Biao Liu <[hidden email]> > > > wrote: > > > > > >> > > > > > >>> Hi Steven, > > > > > >>> Thank you for the feedback. Please take a look at the document > > > FLIP-27 > > > > > >>> < > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface > > > > > > > > > which > > > > > >>> is updated recently. A lot of details of enumerator were added > in > > > this > > > > > >>> document. I think it would help. > > > > > >>> > > > > > >>> Steven Wu <[hidden email]> 于2019年3月28日周四 下午12:52写道: > > > > > >>> > > > > > >>>> This proposal mentioned that SplitEnumerator might run on the > > > > > >>>> JobManager or > > > > > >>>> in a single task on a TaskManager. > > > > > >>>> > > > > > >>>> if enumerator is a single task on a taskmanager, then the job > > DAG > > > can > > > > > >>>> never > > > > > >>>> been embarrassingly parallel anymore. That will nullify the > > > leverage > > > > > of > > > > > >>>> fine-grained recovery for embarrassingly parallel jobs. > > > > > >>>> > > > > > >>>> It's not clear to me what's the implication of running > > enumerator > > > on > > > > > the > > > > > >>>> jobmanager. So I will leave that out for now. > > > > > >>>> > > > > > >>>> On Mon, Jan 28, 2019 at 3:05 AM Biao Liu <[hidden email]> > > > wrote: > > > > > >>>> > > > > > >>>> > Hi Stephan & Piotrek, > > > > > >>>> > > > > > > >>>> > Thank you for feedback. > > > > > >>>> > > > > > > >>>> > It seems that there are a lot of things to do in community. > I > > am > > > > > just > > > > > >>>> > afraid that this discussion may be forgotten since there so > > many > > > > > >>>> proposals > > > > > >>>> > recently. > > > > > >>>> > Anyway, wish to see the split topics soon :) > > > > > >>>> > > > > > > >>>> > Piotr Nowojski <[hidden email]> 于2019年1月24日周四 > > 下午8:21写道: > > > > > >>>> > > > > > > >>>> > > Hi Biao! > > > > > >>>> > > > > > > > >>>> > > This discussion was stalled because of preparations for > the > > > open > > > > > >>>> sourcing > > > > > >>>> > > & merging Blink. I think before creating the tickets we > > should > > > > > >>>> split this > > > > > >>>> > > discussion into topics/areas outlined by Stephan and > create > > > Flips > > > > > >>>> for > > > > > >>>> > that. > > > > > >>>> > > > > > > > >>>> > > I think there is no chance for this to be completed in > > couple > > > of > > > > > >>>> > remaining > > > > > >>>> > > weeks/1 month before 1.8 feature freeze, however it would > be > > > good > > > > > >>>> to aim > > > > > >>>> > > with those changes for 1.9. > > > > > >>>> > > > > > > > >>>> > > Piotrek > > > > > >>>> > > > > > > > >>>> > > > On 20 Jan 2019, at 16:08, Biao Liu <[hidden email]> > > > wrote: > > > > > >>>> > > > > > > > > >>>> > > > Hi community, > > > > > >>>> > > > The summary of Stephan makes a lot sense to me. It is > much > > > > > clearer > > > > > >>>> > indeed > > > > > >>>> > > > after splitting the complex topic into small ones. > > > > > >>>> > > > I was wondering is there any detail plan for next step? > If > > > not, > > > > > I > > > > > >>>> would > > > > > >>>> > > > like to push this thing forward by creating some JIRA > > > issues. > > > > > >>>> > > > Another question is that should version 1.8 include > these > > > > > >>>> features? > > > > > >>>> > > > > > > > > >>>> > > > Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道: > > > > > >>>> > > > > > > > > >>>> > > >> Thanks everyone for the lively discussion. Let me try > to > > > > > >>>> summarize > > > > > >>>> > > where I > > > > > >>>> > > >> see convergence in the discussion and open issues. > > > > > >>>> > > >> I'll try to group this by design aspect of the source. > > > Please > > > > > >>>> let me > > > > > >>>> > > know > > > > > >>>> > > >> if I got things wrong or missed something crucial here. > > > > > >>>> > > >> > > > > > >>>> > > >> For issues 1-3, if the below reflects the state of the > > > > > >>>> discussion, I > > > > > >>>> > > would > > > > > >>>> > > >> try and update the FLIP in the next days. > > > > > >>>> > > >> For the remaining ones we need more discussion. > > > > > >>>> > > >> > > > > > >>>> > > >> I would suggest to fork each of these aspects into a > > > separate > > > > > >>>> mail > > > > > >>>> > > thread, > > > > > >>>> > > >> or will loose sight of the individual aspects. > > > > > >>>> > > >> > > > > > >>>> > > >> *(1) Separation of Split Enumerator and Split Reader* > > > > > >>>> > > >> > > > > > >>>> > > >> - All seem to agree this is a good thing > > > > > >>>> > > >> - Split Enumerator could in the end live on JobManager > > > (and > > > > > >>>> assign > > > > > >>>> > > splits > > > > > >>>> > > >> via RPC) or in a task (and assign splits via data > > streams) > > > > > >>>> > > >> - this discussion is orthogonal and should come later, > > > when > > > > > the > > > > > >>>> > > interface > > > > > >>>> > > >> is agreed upon. > > > > > >>>> > > >> > > > > > >>>> > > >> *(2) Split Readers for one or more splits* > > > > > >>>> > > >> > > > > > >>>> > > >> - Discussion seems to agree that we need to support > one > > > reader > > > > > >>>> that > > > > > >>>> > > >> possibly handles multiple splits concurrently. > > > > > >>>> > > >> - The requirement comes from sources where one > > > poll()-style > > > > > call > > > > > >>>> > > fetches > > > > > >>>> > > >> data from different splits / partitions > > > > > >>>> > > >> --> example sources that require that would be for > > > example > > > > > >>>> Kafka, > > > > > >>>> > > >> Pravega, Pulsar > > > > > >>>> > > >> > > > > > >>>> > > >> - Could have one split reader per source, or multiple > > > split > > > > > >>>> readers > > > > > >>>> > > that > > > > > >>>> > > >> share the "poll()" function > > > > > >>>> > > >> - To not make it too complicated, we can start with > > > thinking > > > > > >>>> about > > > > > >>>> > one > > > > > >>>> > > >> split reader for all splits initially and see if that > > > covers > > > > > all > > > > > >>>> > > >> requirements > > > > > >>>> > > >> > > > > > >>>> > > >> *(3) Threading model of the Split Reader* > > > > > >>>> > > >> > > > > > >>>> > > >> - Most active part of the discussion ;-) > > > > > >>>> > > >> > > > > > >>>> > > >> - A non-blocking way for Flink's task code to interact > > > with > > > > > the > > > > > >>>> > source > > > > > >>>> > > is > > > > > >>>> > > >> needed in order to a task runtime code based on a > > > > > >>>> > > >> single-threaded/actor-style task design > > > > > >>>> > > >> --> I personally am a big proponent of that, it will > > > help > > > > > with > > > > > >>>> > > >> well-behaved checkpoints, efficiency, and simpler yet > > more > > > > > robust > > > > > >>>> > > runtime > > > > > >>>> > > >> code > > > > > >>>> > > >> > > > > > >>>> > > >> - Users care about simple abstraction, so as a > subclass > > of > > > > > >>>> > SplitReader > > > > > >>>> > > >> (non-blocking / async) we need to have a > > > BlockingSplitReader > > > > > >>>> which > > > > > >>>> > will > > > > > >>>> > > >> form the basis of most source implementations. > > > > > >>>> BlockingSplitReader > > > > > >>>> > lets > > > > > >>>> > > >> users do blocking simple poll() calls. > > > > > >>>> > > >> - The BlockingSplitReader would spawn a thread (or > more) > > > and > > > > > the > > > > > >>>> > > >> thread(s) can make blocking calls and hand over data > > > buffers > > > > > via > > > > > >>>> a > > > > > >>>> > > blocking > > > > > >>>> > > >> queue > > > > > >>>> > > >> - This should allow us to cover both, a fully async > > > runtime, > > > > > >>>> and a > > > > > >>>> > > simple > > > > > >>>> > > >> blocking interface for users. > > > > > >>>> > > >> - This is actually very similar to how the Kafka > > > connectors > > > > > >>>> work. > > > > > >>>> > Kafka > > > > > >>>> > > >> 9+ with one thread, Kafka 8 with multiple threads > > > > > >>>> > > >> > > > > > >>>> > > >> - On the base SplitReader (the async one), the > > > non-blocking > > > > > >>>> method > > > > > >>>> > that > > > > > >>>> > > >> gets the next chunk of data would signal data > > availability > > > via > > > > > a > > > > > >>>> > > >> CompletableFuture, because that gives the best > > flexibility > > > (can > > > > > >>>> await > > > > > >>>> > > >> completion or register notification handlers). > > > > > >>>> > > >> - The source task would register a "thenHandle()" (or > > > similar) > > > > > >>>> on the > > > > > >>>> > > >> future to put a "take next data" task into the > > actor-style > > > > > >>>> mailbox > > > > > >>>> > > >> > > > > > >>>> > > >> *(4) Split Enumeration and Assignment* > > > > > >>>> > > >> > > > > > >>>> > > >> - Splits may be generated lazily, both in cases where > > > there > > > > > is a > > > > > >>>> > > limited > > > > > >>>> > > >> number of splits (but very many), or splits are > > discovered > > > over > > > > > >>>> time > > > > > >>>> > > >> - Assignment should also be lazy, to get better load > > > balancing > > > > > >>>> > > >> - Assignment needs support locality preferences > > > > > >>>> > > >> > > > > > >>>> > > >> - Possible design based on discussion so far: > > > > > >>>> > > >> > > > > > >>>> > > >> --> SplitReader has a method "addSplits(SplitT...)" > to > > > add > > > > > >>>> one or > > > > > >>>> > > more > > > > > >>>> > > >> splits. Some split readers might assume they have only > > one > > > > > split > > > > > >>>> ever, > > > > > >>>> > > >> concurrently, others assume multiple splits. (Note: > idea > > > behind > > > > > >>>> being > > > > > >>>> > > able > > > > > >>>> > > >> to add multiple splits at the same time is to ease > > startup > > > > > where > > > > > >>>> > > multiple > > > > > >>>> > > >> splits may be assigned instantly.) > > > > > >>>> > > >> --> SplitReader has a context object on which it can > > > call > > > > > >>>> indicate > > > > > >>>> > > when > > > > > >>>> > > >> splits are completed. The enumerator gets that > > > notification and > > > > > >>>> can > > > > > >>>> > use > > > > > >>>> > > to > > > > > >>>> > > >> decide when to assign new splits. This should help both > > in > > > > > cases > > > > > >>>> of > > > > > >>>> > > sources > > > > > >>>> > > >> that take splits lazily (file readers) and in case the > > > source > > > > > >>>> needs to > > > > > >>>> > > >> preserve a partial order between splits (Kinesis, > > Pravega, > > > > > >>>> Pulsar may > > > > > >>>> > > need > > > > > >>>> > > >> that). > > > > > >>>> > > >> --> SplitEnumerator gets notification when > > SplitReaders > > > > > start > > > > > >>>> and > > > > > >>>> > > when > > > > > >>>> > > >> they finish splits. They can decide at that moment to > > push > > > more > > > > > >>>> splits > > > > > >>>> > > to > > > > > >>>> > > >> that reader > > > > > >>>> > > >> --> The SplitEnumerator should probably be aware of > > the > > > > > source > > > > > >>>> > > >> parallelism, to build its initial distribution. > > > > > >>>> > > >> > > > > > >>>> > > >> - Open question: Should the source expose something > like > > > "host > > > > > >>>> > > >> preferences", so that yarn/mesos/k8s can take this into > > > account > > > > > >>>> when > > > > > >>>> > > >> selecting a node to start a TM on? > > > > > >>>> > > >> > > > > > >>>> > > >> *(5) Watermarks and event time alignment* > > > > > >>>> > > >> > > > > > >>>> > > >> - Watermark generation, as well as idleness, needs to > be > > > per > > > > > >>>> split > > > > > >>>> > > (like > > > > > >>>> > > >> currently in the Kafka Source, per partition) > > > > > >>>> > > >> - It is desirable to support optional > > > event-time-alignment, > > > > > >>>> meaning > > > > > >>>> > > that > > > > > >>>> > > >> splits that are ahead are back-pressured or temporarily > > > > > >>>> unsubscribed > > > > > >>>> > > >> > > > > > >>>> > > >> - I think i would be desirable to encapsulate > watermark > > > > > >>>> generation > > > > > >>>> > > logic > > > > > >>>> > > >> in watermark generators, for a separation of concerns. > > The > > > > > >>>> watermark > > > > > >>>> > > >> generators should run per split. > > > > > >>>> > > >> - Using watermark generators would also help with > > another > > > > > >>>> problem of > > > > > >>>> > > the > > > > > >>>> > > >> suggested interface, namely supporting non-periodic > > > watermarks > > > > > >>>> > > efficiently. > > > > > >>>> > > >> > > > > > >>>> > > >> - Need a way to "dispatch" next record to different > > > watermark > > > > > >>>> > > generators > > > > > >>>> > > >> - Need a way to tell SplitReader to "suspend" a split > > > until a > > > > > >>>> certain > > > > > >>>> > > >> watermark is reached (event time backpressure) > > > > > >>>> > > >> - This would in fact be not needed (and thus simpler) > if > > > we > > > > > had > > > > > >>>> a > > > > > >>>> > > >> SplitReader per split and may be a reason to re-open > that > > > > > >>>> discussion > > > > > >>>> > > >> > > > > > >>>> > > >> *(6) Watermarks across splits and in the Split > > Enumerator* > > > > > >>>> > > >> > > > > > >>>> > > >> - The split enumerator may need some watermark > > awareness, > > > > > which > > > > > >>>> > should > > > > > >>>> > > be > > > > > >>>> > > >> purely based on split metadata (like create timestamp > of > > > file > > > > > >>>> splits) > > > > > >>>> > > >> - If there are still more splits with overlapping > event > > > time > > > > > >>>> range > > > > > >>>> > for > > > > > >>>> > > a > > > > > >>>> > > >> split reader, then that split reader should not advance > > the > > > > > >>>> watermark > > > > > >>>> > > >> within the split beyond the overlap boundary. Otherwise > > > future > > > > > >>>> splits > > > > > >>>> > > will > > > > > >>>> > > >> produce late data. > > > > > >>>> > > >> > > > > > >>>> > > >> - One way to approach this could be that the split > > > enumerator > > > > > >>>> may > > > > > >>>> > send > > > > > >>>> > > >> watermarks to the readers, and the readers cannot emit > > > > > watermarks > > > > > >>>> > beyond > > > > > >>>> > > >> that received watermark. > > > > > >>>> > > >> - Many split enumerators would simply immediately send > > > > > Long.MAX > > > > > >>>> out > > > > > >>>> > and > > > > > >>>> > > >> leave the progress purely to the split readers. > > > > > >>>> > > >> > > > > > >>>> > > >> - For event-time alignment / split back pressure, this > > > begs > > > > > the > > > > > >>>> > > question > > > > > >>>> > > >> how we can avoid deadlocks that may arise when splits > are > > > > > >>>> suspended > > > > > >>>> > for > > > > > >>>> > > >> event time back pressure, > > > > > >>>> > > >> > > > > > >>>> > > >> *(7) Batch and streaming Unification* > > > > > >>>> > > >> > > > > > >>>> > > >> - Functionality wise, the above design should support > > both > > > > > >>>> > > >> - Batch often (mostly) does not care about reading "in > > > order" > > > > > >>>> and > > > > > >>>> > > >> generating watermarks > > > > > >>>> > > >> --> Might use different enumerator logic that is > more > > > > > locality > > > > > >>>> > aware > > > > > >>>> > > >> and ignores event time order > > > > > >>>> > > >> --> Does not generate watermarks > > > > > >>>> > > >> - Would be great if bounded sources could be > identified > > at > > > > > >>>> compile > > > > > >>>> > > time, > > > > > >>>> > > >> so that "env.addBoundedSource(...)" is type safe and > can > > > > > return a > > > > > >>>> > > >> "BoundedDataStream". > > > > > >>>> > > >> - Possible to defer this discussion until later > > > > > >>>> > > >> > > > > > >>>> > > >> *Miscellaneous Comments* > > > > > >>>> > > >> > > > > > >>>> > > >> - Should the source have a TypeInformation for the > > > produced > > > > > >>>> type, > > > > > >>>> > > instead > > > > > >>>> > > >> of a serializer? We need a type information in the > stream > > > > > >>>> anyways, and > > > > > >>>> > > can > > > > > >>>> > > >> derive the serializer from that. Plus, creating the > > > serializer > > > > > >>>> should > > > > > >>>> > > >> respect the ExecutionConfig. > > > > > >>>> > > >> > > > > > >>>> > > >> - The TypeSerializer interface is very powerful but > also > > > not > > > > > >>>> easy to > > > > > >>>> > > >> implement. Its purpose is to handle data super > > efficiently, > > > > > >>>> support > > > > > >>>> > > >> flexible ways of evolution, etc. > > > > > >>>> > > >> For metadata I would suggest to look at the > > > > > >>>> SimpleVersionedSerializer > > > > > >>>> > > >> instead, which is used for example for checkpoint > master > > > hooks, > > > > > >>>> or for > > > > > >>>> > > the > > > > > >>>> > > >> streaming file sink. I think that is is a good match > for > > > cases > > > > > >>>> where > > > > > >>>> > we > > > > > >>>> > > do > > > > > >>>> > > >> not need more than ser/deser (no copy, etc.) and don't > > > need to > > > > > >>>> push > > > > > >>>> > > >> versioning out of the serialization paths for best > > > performance > > > > > >>>> (as in > > > > > >>>> > > the > > > > > >>>> > > >> TypeSerializer) > > > > > >>>> > > >> > > > > > >>>> > > >> > > > > > >>>> > > >> On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas < > > > > > >>>> > > >> [hidden email]> > > > > > >>>> > > >> wrote: > > > > > >>>> > > >> > > > > > >>>> > > >>> Hi Biao, > > > > > >>>> > > >>> > > > > > >>>> > > >>> Thanks for the answer! > > > > > >>>> > > >>> > > > > > >>>> > > >>> So given the multi-threaded readers, now we have as > open > > > > > >>>> questions: > > > > > >>>> > > >>> > > > > > >>>> > > >>> 1) How do we let the checkpoints pass through our > > > > > multi-threaded > > > > > >>>> > reader > > > > > >>>> > > >>> operator? > > > > > >>>> > > >>> > > > > > >>>> > > >>> 2) Do we have separate reader and source operators or > > > not? In > > > > > >>>> the > > > > > >>>> > > >> strategy > > > > > >>>> > > >>> that has a separate source, the source operator has a > > > > > >>>> parallelism of > > > > > >>>> > 1 > > > > > >>>> > > >> and > > > > > >>>> > > >>> is responsible for split recovery only. > > > > > >>>> > > >>> > > > > > >>>> > > >>> For the first one, given also the constraints > (blocking, > > > > > finite > > > > > >>>> > queues, > > > > > >>>> > > >>> etc), I do not have an answer yet. > > > > > >>>> > > >>> > > > > > >>>> > > >>> For the 2nd, I think that we should go with separate > > > operators > > > > > >>>> for > > > > > >>>> > the > > > > > >>>> > > >>> source and the readers, for the following reasons: > > > > > >>>> > > >>> > > > > > >>>> > > >>> 1) This is more aligned with a potential future > > > improvement > > > > > >>>> where the > > > > > >>>> > > >> split > > > > > >>>> > > >>> discovery becomes a responsibility of the JobManager > and > > > > > >>>> readers are > > > > > >>>> > > >>> pooling more work from the JM. > > > > > >>>> > > >>> > > > > > >>>> > > >>> 2) The source is going to be the "single point of > > truth". > > > It > > > > > >>>> will > > > > > >>>> > know > > > > > >>>> > > >> what > > > > > >>>> > > >>> has been processed and what not. If the source and the > > > readers > > > > > >>>> are a > > > > > >>>> > > >> single > > > > > >>>> > > >>> operator with parallelism > 1, or in general, if the > > split > > > > > >>>> discovery > > > > > >>>> > is > > > > > >>>> > > >>> done by each task individually, then: > > > > > >>>> > > >>> i) we have to have a deterministic scheme for each > > > reader to > > > > > >>>> assign > > > > > >>>> > > >>> splits to itself (e.g. mod subtaskId). This is not > > > necessarily > > > > > >>>> > trivial > > > > > >>>> > > >> for > > > > > >>>> > > >>> all sources. > > > > > >>>> > > >>> ii) each reader would have to keep a copy of all its > > > > > processed > > > > > >>>> > slpits > > > > > >>>> > > >>> iii) the state has to be a union state with a > > > non-trivial > > > > > >>>> merging > > > > > >>>> > > >> logic > > > > > >>>> > > >>> in order to support rescaling. > > > > > >>>> > > >>> > > > > > >>>> > > >>> Two additional points that you raised above: > > > > > >>>> > > >>> > > > > > >>>> > > >>> i) The point that you raised that we need to keep all > > > splits > > > > > >>>> > (processed > > > > > >>>> > > >> and > > > > > >>>> > > >>> not-processed) I think is a bit of a strong > requirement. > > > This > > > > > >>>> would > > > > > >>>> > > imply > > > > > >>>> > > >>> that for infinite sources the state will grow > > > indefinitely. > > > > > >>>> This is > > > > > >>>> > > >> problem > > > > > >>>> > > >>> is even more pronounced if we do not have a single > > source > > > that > > > > > >>>> > assigns > > > > > >>>> > > >>> splits to readers, as each reader will have its own > copy > > > of > > > > > the > > > > > >>>> > state. > > > > > >>>> > > >>> > > > > > >>>> > > >>> ii) it is true that for finite sources we need to > > somehow > > > not > > > > > >>>> close > > > > > >>>> > the > > > > > >>>> > > >>> readers when the source/split discoverer finishes. The > > > > > >>>> > > >>> ContinuousFileReaderOperator has a work-around for > that. > > > It is > > > > > >>>> not > > > > > >>>> > > >> elegant, > > > > > >>>> > > >>> and checkpoints are not emitted after closing the > > source, > > > but > > > > > >>>> this, I > > > > > >>>> > > >>> believe, is a bigger problem which requires more > changes > > > than > > > > > >>>> just > > > > > >>>> > > >>> refactoring the source interface. > > > > > >>>> > > >>> > > > > > >>>> > > >>> Cheers, > > > > > >>>> > > >>> Kostas > > > > > >>>> > > >>> > > > > > >>>> > > >> > > > > > >>>> > > > > > > > >>>> > > > > > > > >>>> > > > > > > >>>> > > > > > >>> > > > > > > > > > > > > > > >
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Becket Qin
Hi Jiayi,

For now there is no communication between the coordinators. And I do see
some use cases if we can open up that channel. But it won't be in this FLIP.

Thanks,

Jiangjie (Becket) Qin

On Wed, Dec 4, 2019 at 6:53 PM bupt_ljy <[hidden email]> wrote:

> Hi Becket,
>
>
> Thanks for updating the progress!
>
>
> I have a question about the #OperatorCoordinator. Will there be any
> communication between different #OperatorCoordinators (or in the future
> plan)? Because in that way it may be able to cover some cases in FLIP-27[1]
> like initializing static data before main input processing. Of course it
> requires more thinking, just want to speak up some ideas in my mind.
>
>
> +1 to the FLIP and detailed design!
>
>
>
> [1].
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API
>
>
> Best,
>
> Jiayi Liao
>
>  Original Message
> *Sender:* Stephan Ewen<[hidden email]>
> *Recipient:* dev<[hidden email]>
> *Date:* Wednesday, Dec 4, 2019 18:25
> *Subject:* Re: [DISCUSS] FLIP-27: Refactor Source Interface
>
> Thanks, Becket, for updating this.
>
> I agree with moving the aspects you mentioned into separate FLIPs - this
> one way becoming unwieldy in size.
>
> +1 to the FLIP in its current state. Its a very detailed write-up, nicely
> done!
>
> On Wed, Dec 4, 2019 at 7:38 AM Becket Qin <[hidden email]> wrote:
>
> > Hi all,
> >
> > Sorry for the long belated update. I have updated FLIP-27 wiki page with
> > the latest proposals. Some noticeable changes include:
> > 1. A new generic communication mechanism between SplitEnumerator and
> > SourceReader.
> > 2. Some detail API method signature changes.
> >
> > We left a few things out of this FLIP and will address them in separate
> > FLIPs. Including:
> > 1. Per split event time.
> > 2. Event time alignment.
> > 3. Fine grained failover for SplitEnumerator failure.
> >
> > Please let us know if you have any question.
> >
> > Thanks,
> >
> > Jiangjie (Becket) Qin
> >
> > On Sat, Nov 16, 2019 at 6:10 AM Stephan Ewen <[hidden email]> wrote:
> >
> > > Hi  Łukasz!
> > >
> > > Becket and me are working hard on figuring out the last details and
> > > implementing the first PoC. We would update the FLIP hopefully next week.
> > >
> > > There is a fair chance that a first version of this will be in 1.10, but
> > I
> > > think it will take another release to battle test it and migrate the
> > > connectors.
> > >
> > > Best,
> > > Stephan
> > >
> > >
> > >
> > >
> > > On Fri, Nov 15, 2019 at 11:14 AM Łukasz Jędrzejewski <[hidden email]>
> > wrote:
> > >
> > > > Hi,
> > > >
> > > > This proposal looks very promising for us. Do you have any plans in
> > which
> > > > Flink release it is going to be released? We are thinking on using a
> > Data
> > > > Set API for our future use cases but on the other hand Data Set API is
> > > > going to be deprecated so using proposed bounded data streams solution
> > > > could be more viable in the long term.
> > > >
> > > > Thanks,
> > > > Łukasz
> > > >
> > > > On 2019/10/01 15:48:03, Thomas Weise <[hidden email]> wrote:
> > > > > Thanks for putting together this proposal!
> > > > >
> > > > > I see that the "Per Split Event Time" and "Event Time Alignment"
> > > sections
> > > > > are still TBD.
> > > > >
> > > > > It would probably be good to flesh those out a bit before proceeding
> > > too
> > > > far
> > > > > as the event time alignment will probably influence the interaction
> > > with
> > > > > the split reader, specifically ReaderStatus emitNext(SourceOutput<E>
> > > > > output).
> > > > >
> > > > > We currently have only one implementation for event time alignment in
> > > the
> > > > > Kinesis consumer. The synchronization in that case takes place as the
> > > > last
> > > > > step before records are emitted downstream (RecordEmitter). With the
> > > > > currently proposed interfaces, the equivalent can be implemented in
> > the
> > > > > reader loop, although note that in the Kinesis consumer the per shard
> > > > > threads push records.
> > > > >
> > > > > Synchronization has not been implemented for the Kafka consumer yet.
> > > > >
> > > > > https://issues.apache.org/jira/browse/FLINK-12675
> > > > >
> > > > > When I looked at it, I realized that the implementation will look
> > quite
> > > > > different
> > > > > from Kinesis because it needs to take place in the pull part, where
> > > > records
> > > > > are taken from the Kafka client. Due to the multiplexing it cannot be
> > > > done
> > > > > by blocking the split thread like it currently works for Kinesis.
> > > Reading
> > > > > from individual Kafka partitions needs to be controlled via
> > > pause/resume
> > > > > on the Kafka client.
> > > > >
> > > > > To take on that responsibility the split thread would need to be
> > aware
> > > of
> > > > > the
> > > > > watermarks or at least whether it should or should not continue to
> > > > consume
> > > > > a given split and this may require a different SourceReader or
> > > > SourceOutput
> > > > > interface.
> > > > >
> > > > > Thanks,
> > > > > Thomas
> > > > >
> > > > >
> > > > > On Fri, Jul 26, 2019 at 1:39 AM Biao Liu <[hidden email]> wrote:
> > > > >
> > > > > > Hi Stephan,
> > > > > >
> > > > > > Thank you for feedback!
> > > > > > Will take a look at your branch before public discussing.
> > > > > >
> > > > > >
> > > > > > On Fri, Jul 26, 2019 at 12:01 AM Stephan Ewen <[hidden email]>
> > > > wrote:
> > > > > >
> > > > > > > Hi Biao!
> > > > > > >
> > > > > > > Thanks for reviving this. I would like to join this discussion,
> > but
> > > > am
> > > > > > > quite occupied with the 1.9 release, so can we maybe pause this
> > > > > > discussion
> > > > > > > for a week or so?
> > > > > > >
> > > > > > > In the meantime I can share some suggestion based on prior
> > > > experiments:
> > > > > > >
> > > > > > > How to do watermarks / timestamp extractors in a simpler and more
> > > > > > flexible
> > > > > > > way. I think that part is quite promising should be part of the
> > new
> > > > > > source
> > > > > > > interface.
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> > https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/eventtime
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> > https://github.com/StephanEwen/flink/blob/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src/SourceOutput.java
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Some experiments on how to build the source reader and its
> > library
> > > > for
> > > > > > > common threading/split patterns:
> > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> > https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > > Stephan
> > > > > > >
> > > > > > >
> > > > > > > On Thu, Jul 25, 2019 at 10:03 AM Biao Liu <[hidden email]>
> > > > wrote:
> > > > > > >
> > > > > > >> Hi devs,
> > > > > > >>
> > > > > > >> Since 1.9 is nearly released, I think we could get back to
> > > FLIP-27.
> > > > I
> > > > > > >> believe it should be included in 1.10.
> > > > > > >>
> > > > > > >> There are so many things mentioned in document of FLIP-27. [1] I
> > > > think
> > > > > > >> we'd better discuss them separately. However the wiki is not a
> > > good
> > > > > > place
> > > > > > >> to discuss. I wrote google doc about SplitReader API which
> > misses
> > > > some
> > > > > > >> details in the document. [2]
> > > > > > >>
> > > > > > >> 1.
> > > > > > >>
> > > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-27:+Refactor+Source+Interface
> > > > > > >> 2.
> > > > > > >>
> > > > > >
> > > >
> > >
> > https://docs.google.com/document/d/1R1s_89T4S3CZwq7Tf31DciaMCqZwrLHGZFqPASu66oE/edit?usp=sharing
> > > > > > >>
> > > > > > >> CC Stephan, Aljoscha, Piotrek, Becket
> > > > > > >>
> > > > > > >>
> > > > > > >> On Thu, Mar 28, 2019 at 4:38 PM Biao Liu <[hidden email]>
> > > > wrote:
> > > > > > >>
> > > > > > >>> Hi Steven,
> > > > > > >>> Thank you for the feedback. Please take a look at the document
> > > > FLIP-27
> > > > > > >>> <
> > > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface
> > > > >
> > > > > > which
> > > > > > >>> is updated recently. A lot of details of enumerator were added
> > in
> > > > this
> > > > > > >>> document. I think it would help.
> > > > > > >>>
> > > > > > >>> Steven Wu <[hidden email]> 于2019年3月28日周四 下午12:52写道:
> > > > > > >>>
> > > > > > >>>> This proposal mentioned that SplitEnumerator might run on the
> > > > > > >>>> JobManager or
> > > > > > >>>> in a single task on a TaskManager.
> > > > > > >>>>
> > > > > > >>>> if enumerator is a single task on a taskmanager, then the job
> > > DAG
> > > > can
> > > > > > >>>> never
> > > > > > >>>> been embarrassingly parallel anymore. That will nullify the
> > > > leverage
> > > > > > of
> > > > > > >>>> fine-grained recovery for embarrassingly parallel jobs.
> > > > > > >>>>
> > > > > > >>>> It's not clear to me what's the implication of running
> > > enumerator
> > > > on
> > > > > > the
> > > > > > >>>> jobmanager. So I will leave that out for now.
> > > > > > >>>>
> > > > > > >>>> On Mon, Jan 28, 2019 at 3:05 AM Biao Liu <[hidden email]>
> > > > wrote:
> > > > > > >>>>
> > > > > > >>>> > Hi Stephan & Piotrek,
> > > > > > >>>> >
> > > > > > >>>> > Thank you for feedback.
> > > > > > >>>> >
> > > > > > >>>> > It seems that there are a lot of things to do in community.
> > I
> > > am
> > > > > > just
> > > > > > >>>> > afraid that this discussion may be forgotten since there so
> > > many
> > > > > > >>>> proposals
> > > > > > >>>> > recently.
> > > > > > >>>> > Anyway, wish to see the split topics soon :)
> > > > > > >>>> >
> > > > > > >>>> > Piotr Nowojski <[hidden email]> 于2019年1月24日周四
> > > 下午8:21写道:
> > > > > > >>>> >
> > > > > > >>>> > > Hi Biao!
> > > > > > >>>> > >
> > > > > > >>>> > > This discussion was stalled because of preparations for
> > the
> > > > open
> > > > > > >>>> sourcing
> > > > > > >>>> > > & merging Blink. I think before creating the tickets we
> > > should
> > > > > > >>>> split this
> > > > > > >>>> > > discussion into topics/areas outlined by Stephan and
> > create
> > > > Flips
> > > > > > >>>> for
> > > > > > >>>> > that.
> > > > > > >>>> > >
> > > > > > >>>> > > I think there is no chance for this to be completed in
> > > couple
> > > > of
> > > > > > >>>> > remaining
> > > > > > >>>> > > weeks/1 month before 1.8 feature freeze, however it would
> > be
> > > > good
> > > > > > >>>> to aim
> > > > > > >>>> > > with those changes for 1.9.
> > > > > > >>>> > >
> > > > > > >>>> > > Piotrek
> > > > > > >>>> > >
> > > > > > >>>> > > > On 20 Jan 2019, at 16:08, Biao Liu <[hidden email]>
> > > > wrote:
> > > > > > >>>> > > >
> > > > > > >>>> > > > Hi community,
> > > > > > >>>> > > > The summary of Stephan makes a lot sense to me. It is
> > much
> > > > > > clearer
> > > > > > >>>> > indeed
> > > > > > >>>> > > > after splitting the complex topic into small ones.
> > > > > > >>>> > > > I was wondering is there any detail plan for next step?
> > If
> > > > not,
> > > > > > I
> > > > > > >>>> would
> > > > > > >>>> > > > like to push this thing forward by creating some JIRA
> > > > issues.
> > > > > > >>>> > > > Another question is that should version 1.8 include
> > these
> > > > > > >>>> features?
> > > > > > >>>> > > >
> > > > > > >>>> > > > Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道:
> > > > > > >>>> > > >
> > > > > > >>>> > > >> Thanks everyone for the lively discussion. Let me try
> > to
> > > > > > >>>> summarize
> > > > > > >>>> > > where I
> > > > > > >>>> > > >> see convergence in the discussion and open issues.
> > > > > > >>>> > > >> I'll try to group this by design aspect of the source.
> > > > Please
> > > > > > >>>> let me
> > > > > > >>>> > > know
> > > > > > >>>> > > >> if I got things wrong or missed something crucial here.
> > > > > > >>>> > > >>
> > > > > > >>>> > > >> For issues 1-3, if the below reflects the state of the
> > > > > > >>>> discussion, I
> > > > > > >>>> > > would
> > > > > > >>>> > > >> try and update the FLIP in the next days.
> > > > > > >>>> > > >> For the remaining ones we need more discussion.
> > > > > > >>>> > > >>
> > > > > > >>>> > > >> I would suggest to fork each of these aspects into a
> > > > separate
> > > > > > >>>> mail
> > > > > > >>>> > > thread,
> > > > > > >>>> > > >> or will loose sight of the individual aspects.
> > > > > > >>>> > > >>
> > > > > > >>>> > > >> *(1) Separation of Split Enumerator and Split Reader*
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - All seem to agree this is a good thing
> > > > > > >>>> > > >>  - Split Enumerator could in the end live on JobManager
> > > > (and
> > > > > > >>>> assign
> > > > > > >>>> > > splits
> > > > > > >>>> > > >> via RPC) or in a task (and assign splits via data
> > > streams)
> > > > > > >>>> > > >>  - this discussion is orthogonal and should come later,
> > > > when
> > > > > > the
> > > > > > >>>> > > interface
> > > > > > >>>> > > >> is agreed upon.
> > > > > > >>>> > > >>
> > > > > > >>>> > > >> *(2) Split Readers for one or more splits*
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - Discussion seems to agree that we need to support
> > one
> > > > reader
> > > > > > >>>> that
> > > > > > >>>> > > >> possibly handles multiple splits concurrently.
> > > > > > >>>> > > >>  - The requirement comes from sources where one
> > > > poll()-style
> > > > > > call
> > > > > > >>>> > > fetches
> > > > > > >>>> > > >> data from different splits / partitions
> > > > > > >>>> > > >>    --> example sources that require that would be for
> > > > example
> > > > > > >>>> Kafka,
> > > > > > >>>> > > >> Pravega, Pulsar
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - Could have one split reader per source, or multiple
> > > > split
> > > > > > >>>> readers
> > > > > > >>>> > > that
> > > > > > >>>> > > >> share the "poll()" function
> > > > > > >>>> > > >>  - To not make it too complicated, we can start with
> > > > thinking
> > > > > > >>>> about
> > > > > > >>>> > one
> > > > > > >>>> > > >> split reader for all splits initially and see if that
> > > > covers
> > > > > > all
> > > > > > >>>> > > >> requirements
> > > > > > >>>> > > >>
> > > > > > >>>> > > >> *(3) Threading model of the Split Reader*
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - Most active part of the discussion ;-)
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - A non-blocking way for Flink's task code to interact
> > > > with
> > > > > > the
> > > > > > >>>> > source
> > > > > > >>>> > > is
> > > > > > >>>> > > >> needed in order to a task runtime code based on a
> > > > > > >>>> > > >> single-threaded/actor-style task design
> > > > > > >>>> > > >>    --> I personally am a big proponent of that, it will
> > > > help
> > > > > > with
> > > > > > >>>> > > >> well-behaved checkpoints, efficiency, and simpler yet
> > > more
> > > > > > robust
> > > > > > >>>> > > runtime
> > > > > > >>>> > > >> code
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - Users care about simple abstraction, so as a
> > subclass
> > > of
> > > > > > >>>> > SplitReader
> > > > > > >>>> > > >> (non-blocking / async) we need to have a
> > > > BlockingSplitReader
> > > > > > >>>> which
> > > > > > >>>> > will
> > > > > > >>>> > > >> form the basis of most source implementations.
> > > > > > >>>> BlockingSplitReader
> > > > > > >>>> > lets
> > > > > > >>>> > > >> users do blocking simple poll() calls.
> > > > > > >>>> > > >>  - The BlockingSplitReader would spawn a thread (or
> > more)
> > > > and
> > > > > > the
> > > > > > >>>> > > >> thread(s) can make blocking calls and hand over data
> > > > buffers
> > > > > > via
> > > > > > >>>> a
> > > > > > >>>> > > blocking
> > > > > > >>>> > > >> queue
> > > > > > >>>> > > >>  - This should allow us to cover both, a fully async
> > > > runtime,
> > > > > > >>>> and a
> > > > > > >>>> > > simple
> > > > > > >>>> > > >> blocking interface for users.
> > > > > > >>>> > > >>  - This is actually very similar to how the Kafka
> > > > connectors
> > > > > > >>>> work.
> > > > > > >>>> > Kafka
> > > > > > >>>> > > >> 9+ with one thread, Kafka 8 with multiple threads
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - On the base SplitReader (the async one), the
> > > > non-blocking
> > > > > > >>>> method
> > > > > > >>>> > that
> > > > > > >>>> > > >> gets the next chunk of data would signal data
> > > availability
> > > > via
> > > > > > a
> > > > > > >>>> > > >> CompletableFuture, because that gives the best
> > > flexibility
> > > > (can
> > > > > > >>>> await
> > > > > > >>>> > > >> completion or register notification handlers).
> > > > > > >>>> > > >>  - The source task would register a "thenHandle()" (or
> > > > similar)
> > > > > > >>>> on the
> > > > > > >>>> > > >> future to put a "take next data" task into the
> > > actor-style
> > > > > > >>>> mailbox
> > > > > > >>>> > > >>
> > > > > > >>>> > > >> *(4) Split Enumeration and Assignment*
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - Splits may be generated lazily, both in cases where
> > > > there
> > > > > > is a
> > > > > > >>>> > > limited
> > > > > > >>>> > > >> number of splits (but very many), or splits are
> > > discovered
> > > > over
> > > > > > >>>> time
> > > > > > >>>> > > >>  - Assignment should also be lazy, to get better load
> > > > balancing
> > > > > > >>>> > > >>  - Assignment needs support locality preferences
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - Possible design based on discussion so far:
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>    --> SplitReader has a method "addSplits(SplitT...)"
> > to
> > > > add
> > > > > > >>>> one or
> > > > > > >>>> > > more
> > > > > > >>>> > > >> splits. Some split readers might assume they have only
> > > one
> > > > > > split
> > > > > > >>>> ever,
> > > > > > >>>> > > >> concurrently, others assume multiple splits. (Note:
> > idea
> > > > behind
> > > > > > >>>> being
> > > > > > >>>> > > able
> > > > > > >>>> > > >> to add multiple splits at the same time is to ease
> > > startup
> > > > > > where
> > > > > > >>>> > > multiple
> > > > > > >>>> > > >> splits may be assigned instantly.)
> > > > > > >>>> > > >>    --> SplitReader has a context object on which it can
> > > > call
> > > > > > >>>> indicate
> > > > > > >>>> > > when
> > > > > > >>>> > > >> splits are completed. The enumerator gets that
> > > > notification and
> > > > > > >>>> can
> > > > > > >>>> > use
> > > > > > >>>> > > to
> > > > > > >>>> > > >> decide when to assign new splits. This should help both
> > > in
> > > > > > cases
> > > > > > >>>> of
> > > > > > >>>> > > sources
> > > > > > >>>> > > >> that take splits lazily (file readers) and in case the
> > > > source
> > > > > > >>>> needs to
> > > > > > >>>> > > >> preserve a partial order between splits (Kinesis,
> > > Pravega,
> > > > > > >>>> Pulsar may
> > > > > > >>>> > > need
> > > > > > >>>> > > >> that).
> > > > > > >>>> > > >>    --> SplitEnumerator gets notification when
> > > SplitReaders
> > > > > > start
> > > > > > >>>> and
> > > > > > >>>> > > when
> > > > > > >>>> > > >> they finish splits. They can decide at that moment to
> > > push
> > > > more
> > > > > > >>>> splits
> > > > > > >>>> > > to
> > > > > > >>>> > > >> that reader
> > > > > > >>>> > > >>    --> The SplitEnumerator should probably be aware of
> > > the
> > > > > > source
> > > > > > >>>> > > >> parallelism, to build its initial distribution.
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - Open question: Should the source expose something
> > like
> > > > "host
> > > > > > >>>> > > >> preferences", so that yarn/mesos/k8s can take this into
> > > > account
> > > > > > >>>> when
> > > > > > >>>> > > >> selecting a node to start a TM on?
> > > > > > >>>> > > >>
> > > > > > >>>> > > >> *(5) Watermarks and event time alignment*
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - Watermark generation, as well as idleness, needs to
> > be
> > > > per
> > > > > > >>>> split
> > > > > > >>>> > > (like
> > > > > > >>>> > > >> currently in the Kafka Source, per partition)
> > > > > > >>>> > > >>  - It is desirable to support optional
> > > > event-time-alignment,
> > > > > > >>>> meaning
> > > > > > >>>> > > that
> > > > > > >>>> > > >> splits that are ahead are back-pressured or temporarily
> > > > > > >>>> unsubscribed
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - I think i would be desirable to encapsulate
> > watermark
> > > > > > >>>> generation
> > > > > > >>>> > > logic
> > > > > > >>>> > > >> in watermark generators, for a separation of concerns.
> > > The
> > > > > > >>>> watermark
> > > > > > >>>> > > >> generators should run per split.
> > > > > > >>>> > > >>  - Using watermark generators would also help with
> > > another
> > > > > > >>>> problem of
> > > > > > >>>> > > the
> > > > > > >>>> > > >> suggested interface, namely supporting non-periodic
> > > > watermarks
> > > > > > >>>> > > efficiently.
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - Need a way to "dispatch" next record to different
> > > > watermark
> > > > > > >>>> > > generators
> > > > > > >>>> > > >>  - Need a way to tell SplitReader to "suspend" a split
> > > > until a
> > > > > > >>>> certain
> > > > > > >>>> > > >> watermark is reached (event time backpressure)
> > > > > > >>>> > > >>  - This would in fact be not needed (and thus simpler)
> > if
> > > > we
> > > > > > had
> > > > > > >>>> a
> > > > > > >>>> > > >> SplitReader per split and may be a reason to re-open
> > that
> > > > > > >>>> discussion
> > > > > > >>>> > > >>
> > > > > > >>>> > > >> *(6) Watermarks across splits and in the Split
> > > Enumerator*
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - The split enumerator may need some watermark
> > > awareness,
> > > > > > which
> > > > > > >>>> > should
> > > > > > >>>> > > be
> > > > > > >>>> > > >> purely based on split metadata (like create timestamp
> > of
> > > > file
> > > > > > >>>> splits)
> > > > > > >>>> > > >>  - If there are still more splits with overlapping
> > event
> > > > time
> > > > > > >>>> range
> > > > > > >>>> > for
> > > > > > >>>> > > a
> > > > > > >>>> > > >> split reader, then that split reader should not advance
> > > the
> > > > > > >>>> watermark
> > > > > > >>>> > > >> within the split beyond the overlap boundary. Otherwise
> > > > future
> > > > > > >>>> splits
> > > > > > >>>> > > will
> > > > > > >>>> > > >> produce late data.
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - One way to approach this could be that the split
> > > > enumerator
> > > > > > >>>> may
> > > > > > >>>> > send
> > > > > > >>>> > > >> watermarks to the readers, and the readers cannot emit
> > > > > > watermarks
> > > > > > >>>> > beyond
> > > > > > >>>> > > >> that received watermark.
> > > > > > >>>> > > >>  - Many split enumerators would simply immediately send
> > > > > > Long.MAX
> > > > > > >>>> out
> > > > > > >>>> > and
> > > > > > >>>> > > >> leave the progress purely to the split readers.
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - For event-time alignment / split back pressure, this
> > > > begs
> > > > > > the
> > > > > > >>>> > > question
> > > > > > >>>> > > >> how we can avoid deadlocks that may arise when splits
> > are
> > > > > > >>>> suspended
> > > > > > >>>> > for
> > > > > > >>>> > > >> event time back pressure,
> > > > > > >>>> > > >>
> > > > > > >>>> > > >> *(7) Batch and streaming Unification*
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - Functionality wise, the above design should support
> > > both
> > > > > > >>>> > > >>  - Batch often (mostly) does not care about reading "in
> > > > order"
> > > > > > >>>> and
> > > > > > >>>> > > >> generating watermarks
> > > > > > >>>> > > >>    --> Might use different enumerator logic that is
> > more
> > > > > > locality
> > > > > > >>>> > aware
> > > > > > >>>> > > >> and ignores event time order
> > > > > > >>>> > > >>    --> Does not generate watermarks
> > > > > > >>>> > > >>  - Would be great if bounded sources could be
> > identified
> > > at
> > > > > > >>>> compile
> > > > > > >>>> > > time,
> > > > > > >>>> > > >> so that "env.addBoundedSource(...)" is type safe and
> > can
> > > > > > return a
> > > > > > >>>> > > >> "BoundedDataStream".
> > > > > > >>>> > > >>  - Possible to defer this discussion until later
> > > > > > >>>> > > >>
> > > > > > >>>> > > >> *Miscellaneous Comments*
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - Should the source have a TypeInformation for the
> > > > produced
> > > > > > >>>> type,
> > > > > > >>>> > > instead
> > > > > > >>>> > > >> of a serializer? We need a type information in the
> > stream
> > > > > > >>>> anyways, and
> > > > > > >>>> > > can
> > > > > > >>>> > > >> derive the serializer from that. Plus, creating the
> > > > serializer
> > > > > > >>>> should
> > > > > > >>>> > > >> respect the ExecutionConfig.
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>  - The TypeSerializer interface is very powerful but
> > also
> > > > not
> > > > > > >>>> easy to
> > > > > > >>>> > > >> implement. Its purpose is to handle data super
> > > efficiently,
> > > > > > >>>> support
> > > > > > >>>> > > >> flexible ways of evolution, etc.
> > > > > > >>>> > > >>  For metadata I would suggest to look at the
> > > > > > >>>> SimpleVersionedSerializer
> > > > > > >>>> > > >> instead, which is used for example for checkpoint
> > master
> > > > hooks,
> > > > > > >>>> or for
> > > > > > >>>> > > the
> > > > > > >>>> > > >> streaming file sink. I think that is is a good match
> > for
> > > > cases
> > > > > > >>>> where
> > > > > > >>>> > we
> > > > > > >>>> > > do
> > > > > > >>>> > > >> not need more than ser/deser (no copy, etc.) and don't
> > > > need to
> > > > > > >>>> push
> > > > > > >>>> > > >> versioning out of the serialization paths for best
> > > > performance
> > > > > > >>>> (as in
> > > > > > >>>> > > the
> > > > > > >>>> > > >> TypeSerializer)
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>
> > > > > > >>>> > > >> On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
> > > > > > >>>> > > >> [hidden email]>
> > > > > > >>>> > > >> wrote:
> > > > > > >>>> > > >>
> > > > > > >>>> > > >>> Hi Biao,
> > > > > > >>>> > > >>>
> > > > > > >>>> > > >>> Thanks for the answer!
> > > > > > >>>> > > >>>
> > > > > > >>>> > > >>> So given the multi-threaded readers, now we have as
> > open
> > > > > > >>>> questions:
> > > > > > >>>> > > >>>
> > > > > > >>>> > > >>> 1) How do we let the checkpoints pass through our
> > > > > > multi-threaded
> > > > > > >>>> > reader
> > > > > > >>>> > > >>> operator?
> > > > > > >>>> > > >>>
> > > > > > >>>> > > >>> 2) Do we have separate reader and source operators or
> > > > not? In
> > > > > > >>>> the
> > > > > > >>>> > > >> strategy
> > > > > > >>>> > > >>> that has a separate source, the source operator has a
> > > > > > >>>> parallelism of
> > > > > > >>>> > 1
> > > > > > >>>> > > >> and
> > > > > > >>>> > > >>> is responsible for split recovery only.
> > > > > > >>>> > > >>>
> > > > > > >>>> > > >>> For the first one, given also the constraints
> > (blocking,
> > > > > > finite
> > > > > > >>>> > queues,
> > > > > > >>>> > > >>> etc), I do not have an answer yet.
> > > > > > >>>> > > >>>
> > > > > > >>>> > > >>> For the 2nd, I think that we should go with separate
> > > > operators
> > > > > > >>>> for
> > > > > > >>>> > the
> > > > > > >>>> > > >>> source and the readers, for the following reasons:
> > > > > > >>>> > > >>>
> > > > > > >>>> > > >>> 1) This is more aligned with a potential future
> > > > improvement
> > > > > > >>>> where the
> > > > > > >>>> > > >> split
> > > > > > >>>> > > >>> discovery becomes a responsibility of the JobManager
> > and
> > > > > > >>>> readers are
> > > > > > >>>> > > >>> pooling more work from the JM.
> > > > > > >>>> > > >>>
> > > > > > >>>> > > >>> 2) The source is going to be the "single point of
> > > truth".
> > > > It
> > > > > > >>>> will
> > > > > > >>>> > know
> > > > > > >>>> > > >> what
> > > > > > >>>> > > >>> has been processed and what not. If the source and the
> > > > readers
> > > > > > >>>> are a
> > > > > > >>>> > > >> single
> > > > > > >>>> > > >>> operator with parallelism > 1, or in general, if the
> > > split
> > > > > > >>>> discovery
> > > > > > >>>> > is
> > > > > > >>>> > > >>> done by each task individually, then:
> > > > > > >>>> > > >>>   i) we have to have a deterministic scheme for each
> > > > reader to
> > > > > > >>>> assign
> > > > > > >>>> > > >>> splits to itself (e.g. mod subtaskId). This is not
> > > > necessarily
> > > > > > >>>> > trivial
> > > > > > >>>> > > >> for
> > > > > > >>>> > > >>> all sources.
> > > > > > >>>> > > >>>   ii) each reader would have to keep a copy of all its
> > > > > > processed
> > > > > > >>>> > slpits
> > > > > > >>>> > > >>>   iii) the state has to be a union state with a
> > > > non-trivial
> > > > > > >>>> merging
> > > > > > >>>> > > >> logic
> > > > > > >>>> > > >>> in order to support rescaling.
> > > > > > >>>> > > >>>
> > > > > > >>>> > > >>> Two additional points that you raised above:
> > > > > > >>>> > > >>>
> > > > > > >>>> > > >>> i) The point that you raised that we need to keep all
> > > > splits
> > > > > > >>>> > (processed
> > > > > > >>>> > > >> and
> > > > > > >>>> > > >>> not-processed) I think is a bit of a strong
> > requirement.
> > > > This
> > > > > > >>>> would
> > > > > > >>>> > > imply
> > > > > > >>>> > > >>> that for infinite sources the state will grow
> > > > indefinitely.
> > > > > > >>>> This is
> > > > > > >>>> > > >> problem
> > > > > > >>>> > > >>> is even more pronounced if we do not have a single
> > > source
> > > > that
> > > > > > >>>> > assigns
> > > > > > >>>> > > >>> splits to readers, as each reader will have its own
> > copy
> > > > of
> > > > > > the
> > > > > > >>>> > state.
> > > > > > >>>> > > >>>
> > > > > > >>>> > > >>> ii) it is true that for finite sources we need to
> > > somehow
> > > > not
> > > > > > >>>> close
> > > > > > >>>> > the
> > > > > > >>>> > > >>> readers when the source/split discoverer finishes. The
> > > > > > >>>> > > >>> ContinuousFileReaderOperator has a work-around for
> > that.
> > > > It is
> > > > > > >>>> not
> > > > > > >>>> > > >> elegant,
> > > > > > >>>> > > >>> and checkpoints are not emitted after closing the
> > > source,
> > > > but
> > > > > > >>>> this, I
> > > > > > >>>> > > >>> believe, is a bigger problem which requires more
> > changes
> > > > than
> > > > > > >>>> just
> > > > > > >>>> > > >>> refactoring the source interface.
> > > > > > >>>> > > >>>
> > > > > > >>>> > > >>> Cheers,
> > > > > > >>>> > > >>> Kostas
> > > > > > >>>> > > >>>
> > > > > > >>>> > > >>
> > > > > > >>>> > >
> > > > > > >>>> > >
> > > > > > >>>> >
> > > > > > >>>>
> > > > > > >>>
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

Becket Qin
Thanks Stephan,

I have to mention that most of the design work and FLIP wiki had actually
been done by Aljoscha, Biao and you, before I picked up this FLIP.

Given that this FLIP has gone through an extended discussion and release
1.10 code freeze is approaching, I'd like to start a vote thread in about
12 hours if there is no further objections.

Thanks,

Jiangjie (Becket) Qin

On Wed, Dec 4, 2019 at 7:04 PM Becket Qin <[hidden email]> wrote:

> Hi Jiayi,
>
> For now there is no communication between the coordinators. And I do see
> some use cases if we can open up that channel. But it won't be in this FLIP.
>
> Thanks,
>
> Jiangjie (Becket) Qin
>
> On Wed, Dec 4, 2019 at 6:53 PM bupt_ljy <[hidden email]> wrote:
>
>> Hi Becket,
>>
>>
>> Thanks for updating the progress!
>>
>>
>> I have a question about the #OperatorCoordinator. Will there be any
>> communication between different #OperatorCoordinators (or in the future
>> plan)? Because in that way it may be able to cover some cases in FLIP-27[1]
>> like initializing static data before main input processing. Of course it
>> requires more thinking, just want to speak up some ideas in my mind.
>>
>>
>> +1 to the FLIP and detailed design!
>>
>>
>>
>> [1].
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-17+Side+Inputs+for+DataStream+API
>>
>>
>> Best,
>>
>> Jiayi Liao
>>
>>  Original Message
>> *Sender:* Stephan Ewen<[hidden email]>
>> *Recipient:* dev<[hidden email]>
>> *Date:* Wednesday, Dec 4, 2019 18:25
>> *Subject:* Re: [DISCUSS] FLIP-27: Refactor Source Interface
>>
>> Thanks, Becket, for updating this.
>>
>> I agree with moving the aspects you mentioned into separate FLIPs - this
>> one way becoming unwieldy in size.
>>
>> +1 to the FLIP in its current state. Its a very detailed write-up, nicely
>> done!
>>
>> On Wed, Dec 4, 2019 at 7:38 AM Becket Qin <[hidden email]> wrote:
>>
>> > Hi all,
>> >
>> > Sorry for the long belated update. I have updated FLIP-27 wiki page with
>> > the latest proposals. Some noticeable changes include:
>> > 1. A new generic communication mechanism between SplitEnumerator and
>> > SourceReader.
>> > 2. Some detail API method signature changes.
>> >
>> > We left a few things out of this FLIP and will address them in separate
>> > FLIPs. Including:
>> > 1. Per split event time.
>> > 2. Event time alignment.
>> > 3. Fine grained failover for SplitEnumerator failure.
>> >
>> > Please let us know if you have any question.
>> >
>> > Thanks,
>> >
>> > Jiangjie (Becket) Qin
>> >
>> > On Sat, Nov 16, 2019 at 6:10 AM Stephan Ewen <[hidden email]> wrote:
>> >
>> > > Hi  Łukasz!
>> > >
>> > > Becket and me are working hard on figuring out the last details and
>> > > implementing the first PoC. We would update the FLIP hopefully next week.
>> > >
>> > > There is a fair chance that a first version of this will be in 1.10, but
>> > I
>> > > think it will take another release to battle test it and migrate the
>> > > connectors.
>> > >
>> > > Best,
>> > > Stephan
>> > >
>> > >
>> > >
>> > >
>> > > On Fri, Nov 15, 2019 at 11:14 AM Łukasz Jędrzejewski <[hidden email]>
>> > wrote:
>> > >
>> > > > Hi,
>> > > >
>> > > > This proposal looks very promising for us. Do you have any plans in
>> > which
>> > > > Flink release it is going to be released? We are thinking on using a
>> > Data
>> > > > Set API for our future use cases but on the other hand Data Set API is
>> > > > going to be deprecated so using proposed bounded data streams solution
>> > > > could be more viable in the long term.
>> > > >
>> > > > Thanks,
>> > > > Łukasz
>> > > >
>> > > > On 2019/10/01 15:48:03, Thomas Weise <[hidden email]> wrote:
>> > > > > Thanks for putting together this proposal!
>> > > > >
>> > > > > I see that the "Per Split Event Time" and "Event Time Alignment"
>> > > sections
>> > > > > are still TBD.
>> > > > >
>> > > > > It would probably be good to flesh those out a bit before proceeding
>> > > too
>> > > > far
>> > > > > as the event time alignment will probably influence the interaction
>> > > with
>> > > > > the split reader, specifically ReaderStatus emitNext(SourceOutput<E>
>> > > > > output).
>> > > > >
>> > > > > We currently have only one implementation for event time alignment in
>> > > the
>> > > > > Kinesis consumer. The synchronization in that case takes place as the
>> > > > last
>> > > > > step before records are emitted downstream (RecordEmitter). With the
>> > > > > currently proposed interfaces, the equivalent can be implemented in
>> > the
>> > > > > reader loop, although note that in the Kinesis consumer the per shard
>> > > > > threads push records.
>> > > > >
>> > > > > Synchronization has not been implemented for the Kafka consumer yet.
>> > > > >
>> > > > > https://issues.apache.org/jira/browse/FLINK-12675
>> > > > >
>> > > > > When I looked at it, I realized that the implementation will look
>> > quite
>> > > > > different
>> > > > > from Kinesis because it needs to take place in the pull part, where
>> > > > records
>> > > > > are taken from the Kafka client. Due to the multiplexing it cannot be
>> > > > done
>> > > > > by blocking the split thread like it currently works for Kinesis.
>> > > Reading
>> > > > > from individual Kafka partitions needs to be controlled via
>> > > pause/resume
>> > > > > on the Kafka client.
>> > > > >
>> > > > > To take on that responsibility the split thread would need to be
>> > aware
>> > > of
>> > > > > the
>> > > > > watermarks or at least whether it should or should not continue to
>> > > > consume
>> > > > > a given split and this may require a different SourceReader or
>> > > > SourceOutput
>> > > > > interface.
>> > > > >
>> > > > > Thanks,
>> > > > > Thomas
>> > > > >
>> > > > >
>> > > > > On Fri, Jul 26, 2019 at 1:39 AM Biao Liu <[hidden email]> wrote:
>> > > > >
>> > > > > > Hi Stephan,
>> > > > > >
>> > > > > > Thank you for feedback!
>> > > > > > Will take a look at your branch before public discussing.
>> > > > > >
>> > > > > >
>> > > > > > On Fri, Jul 26, 2019 at 12:01 AM Stephan Ewen <[hidden email]>
>> > > > wrote:
>> > > > > >
>> > > > > > > Hi Biao!
>> > > > > > >
>> > > > > > > Thanks for reviving this. I would like to join this discussion,
>> > but
>> > > > am
>> > > > > > > quite occupied with the 1.9 release, so can we maybe pause this
>> > > > > > discussion
>> > > > > > > for a week or so?
>> > > > > > >
>> > > > > > > In the meantime I can share some suggestion based on prior
>> > > > experiments:
>> > > > > > >
>> > > > > > > How to do watermarks / timestamp extractors in a simpler and more
>> > > > > > flexible
>> > > > > > > way. I think that part is quite promising should be part of the
>> > new
>> > > > > > source
>> > > > > > > interface.
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > >
>> > >
>> > https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/eventtime
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > >
>> > >
>> > https://github.com/StephanEwen/flink/blob/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src/SourceOutput.java
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > Some experiments on how to build the source reader and its
>> > library
>> > > > for
>> > > > > > > common threading/split patterns:
>> > > > > > >
>> > > > > > >
>> > > > > >
>> > > >
>> > >
>> > https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src
>> > > > > > >
>> > > > > > >
>> > > > > > > Best,
>> > > > > > > Stephan
>> > > > > > >
>> > > > > > >
>> > > > > > > On Thu, Jul 25, 2019 at 10:03 AM Biao Liu <[hidden email]>
>> > > > wrote:
>> > > > > > >
>> > > > > > >> Hi devs,
>> > > > > > >>
>> > > > > > >> Since 1.9 is nearly released, I think we could get back to
>> > > FLIP-27.
>> > > > I
>> > > > > > >> believe it should be included in 1.10.
>> > > > > > >>
>> > > > > > >> There are so many things mentioned in document of FLIP-27. [1] I
>> > > > think
>> > > > > > >> we'd better discuss them separately. However the wiki is not a
>> > > good
>> > > > > > place
>> > > > > > >> to discuss. I wrote google doc about SplitReader API which
>> > misses
>> > > > some
>> > > > > > >> details in the document. [2]
>> > > > > > >>
>> > > > > > >> 1.
>> > > > > > >>
>> > > > > >
>> > > >
>> > >
>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-27:+Refactor+Source+Interface
>> > > > > > >> 2.
>> > > > > > >>
>> > > > > >
>> > > >
>> > >
>> > https://docs.google.com/document/d/1R1s_89T4S3CZwq7Tf31DciaMCqZwrLHGZFqPASu66oE/edit?usp=sharing
>> > > > > > >>
>> > > > > > >> CC Stephan, Aljoscha, Piotrek, Becket
>> > > > > > >>
>> > > > > > >>
>> > > > > > >> On Thu, Mar 28, 2019 at 4:38 PM Biao Liu <[hidden email]>
>> > > > wrote:
>> > > > > > >>
>> > > > > > >>> Hi Steven,
>> > > > > > >>> Thank you for the feedback. Please take a look at the document
>> > > > FLIP-27
>> > > > > > >>> <
>> > > > > >
>> > > >
>> > >
>> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface
>> > > > >
>> > > > > > which
>> > > > > > >>> is updated recently. A lot of details of enumerator were added
>> > in
>> > > > this
>> > > > > > >>> document. I think it would help.
>> > > > > > >>>
>> > > > > > >>> Steven Wu <[hidden email]> 于2019年3月28日周四 下午12:52写道:
>> > > > > > >>>
>> > > > > > >>>> This proposal mentioned that SplitEnumerator might run on the
>> > > > > > >>>> JobManager or
>> > > > > > >>>> in a single task on a TaskManager.
>> > > > > > >>>>
>> > > > > > >>>> if enumerator is a single task on a taskmanager, then the job
>> > > DAG
>> > > > can
>> > > > > > >>>> never
>> > > > > > >>>> been embarrassingly parallel anymore. That will nullify the
>> > > > leverage
>> > > > > > of
>> > > > > > >>>> fine-grained recovery for embarrassingly parallel jobs.
>> > > > > > >>>>
>> > > > > > >>>> It's not clear to me what's the implication of running
>> > > enumerator
>> > > > on
>> > > > > > the
>> > > > > > >>>> jobmanager. So I will leave that out for now.
>> > > > > > >>>>
>> > > > > > >>>> On Mon, Jan 28, 2019 at 3:05 AM Biao Liu <[hidden email]>
>> > > > wrote:
>> > > > > > >>>>
>> > > > > > >>>> > Hi Stephan & Piotrek,
>> > > > > > >>>> >
>> > > > > > >>>> > Thank you for feedback.
>> > > > > > >>>> >
>> > > > > > >>>> > It seems that there are a lot of things to do in community.
>> > I
>> > > am
>> > > > > > just
>> > > > > > >>>> > afraid that this discussion may be forgotten since there so
>> > > many
>> > > > > > >>>> proposals
>> > > > > > >>>> > recently.
>> > > > > > >>>> > Anyway, wish to see the split topics soon :)
>> > > > > > >>>> >
>> > > > > > >>>> > Piotr Nowojski <[hidden email]> 于2019年1月24日周四
>> > > 下午8:21写道:
>> > > > > > >>>> >
>> > > > > > >>>> > > Hi Biao!
>> > > > > > >>>> > >
>> > > > > > >>>> > > This discussion was stalled because of preparations for
>> > the
>> > > > open
>> > > > > > >>>> sourcing
>> > > > > > >>>> > > & merging Blink. I think before creating the tickets we
>> > > should
>> > > > > > >>>> split this
>> > > > > > >>>> > > discussion into topics/areas outlined by Stephan and
>> > create
>> > > > Flips
>> > > > > > >>>> for
>> > > > > > >>>> > that.
>> > > > > > >>>> > >
>> > > > > > >>>> > > I think there is no chance for this to be completed in
>> > > couple
>> > > > of
>> > > > > > >>>> > remaining
>> > > > > > >>>> > > weeks/1 month before 1.8 feature freeze, however it would
>> > be
>> > > > good
>> > > > > > >>>> to aim
>> > > > > > >>>> > > with those changes for 1.9.
>> > > > > > >>>> > >
>> > > > > > >>>> > > Piotrek
>> > > > > > >>>> > >
>> > > > > > >>>> > > > On 20 Jan 2019, at 16:08, Biao Liu <[hidden email]>
>> > > > wrote:
>> > > > > > >>>> > > >
>> > > > > > >>>> > > > Hi community,
>> > > > > > >>>> > > > The summary of Stephan makes a lot sense to me. It is
>> > much
>> > > > > > clearer
>> > > > > > >>>> > indeed
>> > > > > > >>>> > > > after splitting the complex topic into small ones.
>> > > > > > >>>> > > > I was wondering is there any detail plan for next step?
>> > If
>> > > > not,
>> > > > > > I
>> > > > > > >>>> would
>> > > > > > >>>> > > > like to push this thing forward by creating some JIRA
>> > > > issues.
>> > > > > > >>>> > > > Another question is that should version 1.8 include
>> > these
>> > > > > > >>>> features?
>> > > > > > >>>> > > >
>> > > > > > >>>> > > > Stephan Ewen <[hidden email]> 于2018年12月1日周六 上午4:20写道:
>> > > > > > >>>> > > >
>> > > > > > >>>> > > >> Thanks everyone for the lively discussion. Let me try
>> > to
>> > > > > > >>>> summarize
>> > > > > > >>>> > > where I
>> > > > > > >>>> > > >> see convergence in the discussion and open issues.
>> > > > > > >>>> > > >> I'll try to group this by design aspect of the source.
>> > > > Please
>> > > > > > >>>> let me
>> > > > > > >>>> > > know
>> > > > > > >>>> > > >> if I got things wrong or missed something crucial here.
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >> For issues 1-3, if the below reflects the state of the
>> > > > > > >>>> discussion, I
>> > > > > > >>>> > > would
>> > > > > > >>>> > > >> try and update the FLIP in the next days.
>> > > > > > >>>> > > >> For the remaining ones we need more discussion.
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >> I would suggest to fork each of these aspects into a
>> > > > separate
>> > > > > > >>>> mail
>> > > > > > >>>> > > thread,
>> > > > > > >>>> > > >> or will loose sight of the individual aspects.
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >> *(1) Separation of Split Enumerator and Split Reader*
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - All seem to agree this is a good thing
>> > > > > > >>>> > > >>  - Split Enumerator could in the end live on JobManager
>> > > > (and
>> > > > > > >>>> assign
>> > > > > > >>>> > > splits
>> > > > > > >>>> > > >> via RPC) or in a task (and assign splits via data
>> > > streams)
>> > > > > > >>>> > > >>  - this discussion is orthogonal and should come later,
>> > > > when
>> > > > > > the
>> > > > > > >>>> > > interface
>> > > > > > >>>> > > >> is agreed upon.
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >> *(2) Split Readers for one or more splits*
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - Discussion seems to agree that we need to support
>> > one
>> > > > reader
>> > > > > > >>>> that
>> > > > > > >>>> > > >> possibly handles multiple splits concurrently.
>> > > > > > >>>> > > >>  - The requirement comes from sources where one
>> > > > poll()-style
>> > > > > > call
>> > > > > > >>>> > > fetches
>> > > > > > >>>> > > >> data from different splits / partitions
>> > > > > > >>>> > > >>    --> example sources that require that would be for
>> > > > example
>> > > > > > >>>> Kafka,
>> > > > > > >>>> > > >> Pravega, Pulsar
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - Could have one split reader per source, or multiple
>> > > > split
>> > > > > > >>>> readers
>> > > > > > >>>> > > that
>> > > > > > >>>> > > >> share the "poll()" function
>> > > > > > >>>> > > >>  - To not make it too complicated, we can start with
>> > > > thinking
>> > > > > > >>>> about
>> > > > > > >>>> > one
>> > > > > > >>>> > > >> split reader for all splits initially and see if that
>> > > > covers
>> > > > > > all
>> > > > > > >>>> > > >> requirements
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >> *(3) Threading model of the Split Reader*
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - Most active part of the discussion ;-)
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - A non-blocking way for Flink's task code to interact
>> > > > with
>> > > > > > the
>> > > > > > >>>> > source
>> > > > > > >>>> > > is
>> > > > > > >>>> > > >> needed in order to a task runtime code based on a
>> > > > > > >>>> > > >> single-threaded/actor-style task design
>> > > > > > >>>> > > >>    --> I personally am a big proponent of that, it will
>> > > > help
>> > > > > > with
>> > > > > > >>>> > > >> well-behaved checkpoints, efficiency, and simpler yet
>> > > more
>> > > > > > robust
>> > > > > > >>>> > > runtime
>> > > > > > >>>> > > >> code
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - Users care about simple abstraction, so as a
>> > subclass
>> > > of
>> > > > > > >>>> > SplitReader
>> > > > > > >>>> > > >> (non-blocking / async) we need to have a
>> > > > BlockingSplitReader
>> > > > > > >>>> which
>> > > > > > >>>> > will
>> > > > > > >>>> > > >> form the basis of most source implementations.
>> > > > > > >>>> BlockingSplitReader
>> > > > > > >>>> > lets
>> > > > > > >>>> > > >> users do blocking simple poll() calls.
>> > > > > > >>>> > > >>  - The BlockingSplitReader would spawn a thread (or
>> > more)
>> > > > and
>> > > > > > the
>> > > > > > >>>> > > >> thread(s) can make blocking calls and hand over data
>> > > > buffers
>> > > > > > via
>> > > > > > >>>> a
>> > > > > > >>>> > > blocking
>> > > > > > >>>> > > >> queue
>> > > > > > >>>> > > >>  - This should allow us to cover both, a fully async
>> > > > runtime,
>> > > > > > >>>> and a
>> > > > > > >>>> > > simple
>> > > > > > >>>> > > >> blocking interface for users.
>> > > > > > >>>> > > >>  - This is actually very similar to how the Kafka
>> > > > connectors
>> > > > > > >>>> work.
>> > > > > > >>>> > Kafka
>> > > > > > >>>> > > >> 9+ with one thread, Kafka 8 with multiple threads
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - On the base SplitReader (the async one), the
>> > > > non-blocking
>> > > > > > >>>> method
>> > > > > > >>>> > that
>> > > > > > >>>> > > >> gets the next chunk of data would signal data
>> > > availability
>> > > > via
>> > > > > > a
>> > > > > > >>>> > > >> CompletableFuture, because that gives the best
>> > > flexibility
>> > > > (can
>> > > > > > >>>> await
>> > > > > > >>>> > > >> completion or register notification handlers).
>> > > > > > >>>> > > >>  - The source task would register a "thenHandle()" (or
>> > > > similar)
>> > > > > > >>>> on the
>> > > > > > >>>> > > >> future to put a "take next data" task into the
>> > > actor-style
>> > > > > > >>>> mailbox
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >> *(4) Split Enumeration and Assignment*
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - Splits may be generated lazily, both in cases where
>> > > > there
>> > > > > > is a
>> > > > > > >>>> > > limited
>> > > > > > >>>> > > >> number of splits (but very many), or splits are
>> > > discovered
>> > > > over
>> > > > > > >>>> time
>> > > > > > >>>> > > >>  - Assignment should also be lazy, to get better load
>> > > > balancing
>> > > > > > >>>> > > >>  - Assignment needs support locality preferences
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - Possible design based on discussion so far:
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>    --> SplitReader has a method "addSplits(SplitT...)"
>> > to
>> > > > add
>> > > > > > >>>> one or
>> > > > > > >>>> > > more
>> > > > > > >>>> > > >> splits. Some split readers might assume they have only
>> > > one
>> > > > > > split
>> > > > > > >>>> ever,
>> > > > > > >>>> > > >> concurrently, others assume multiple splits. (Note:
>> > idea
>> > > > behind
>> > > > > > >>>> being
>> > > > > > >>>> > > able
>> > > > > > >>>> > > >> to add multiple splits at the same time is to ease
>> > > startup
>> > > > > > where
>> > > > > > >>>> > > multiple
>> > > > > > >>>> > > >> splits may be assigned instantly.)
>> > > > > > >>>> > > >>    --> SplitReader has a context object on which it can
>> > > > call
>> > > > > > >>>> indicate
>> > > > > > >>>> > > when
>> > > > > > >>>> > > >> splits are completed. The enumerator gets that
>> > > > notification and
>> > > > > > >>>> can
>> > > > > > >>>> > use
>> > > > > > >>>> > > to
>> > > > > > >>>> > > >> decide when to assign new splits. This should help both
>> > > in
>> > > > > > cases
>> > > > > > >>>> of
>> > > > > > >>>> > > sources
>> > > > > > >>>> > > >> that take splits lazily (file readers) and in case the
>> > > > source
>> > > > > > >>>> needs to
>> > > > > > >>>> > > >> preserve a partial order between splits (Kinesis,
>> > > Pravega,
>> > > > > > >>>> Pulsar may
>> > > > > > >>>> > > need
>> > > > > > >>>> > > >> that).
>> > > > > > >>>> > > >>    --> SplitEnumerator gets notification when
>> > > SplitReaders
>> > > > > > start
>> > > > > > >>>> and
>> > > > > > >>>> > > when
>> > > > > > >>>> > > >> they finish splits. They can decide at that moment to
>> > > push
>> > > > more
>> > > > > > >>>> splits
>> > > > > > >>>> > > to
>> > > > > > >>>> > > >> that reader
>> > > > > > >>>> > > >>    --> The SplitEnumerator should probably be aware of
>> > > the
>> > > > > > source
>> > > > > > >>>> > > >> parallelism, to build its initial distribution.
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - Open question: Should the source expose something
>> > like
>> > > > "host
>> > > > > > >>>> > > >> preferences", so that yarn/mesos/k8s can take this into
>> > > > account
>> > > > > > >>>> when
>> > > > > > >>>> > > >> selecting a node to start a TM on?
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >> *(5) Watermarks and event time alignment*
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - Watermark generation, as well as idleness, needs to
>> > be
>> > > > per
>> > > > > > >>>> split
>> > > > > > >>>> > > (like
>> > > > > > >>>> > > >> currently in the Kafka Source, per partition)
>> > > > > > >>>> > > >>  - It is desirable to support optional
>> > > > event-time-alignment,
>> > > > > > >>>> meaning
>> > > > > > >>>> > > that
>> > > > > > >>>> > > >> splits that are ahead are back-pressured or temporarily
>> > > > > > >>>> unsubscribed
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - I think i would be desirable to encapsulate
>> > watermark
>> > > > > > >>>> generation
>> > > > > > >>>> > > logic
>> > > > > > >>>> > > >> in watermark generators, for a separation of concerns.
>> > > The
>> > > > > > >>>> watermark
>> > > > > > >>>> > > >> generators should run per split.
>> > > > > > >>>> > > >>  - Using watermark generators would also help with
>> > > another
>> > > > > > >>>> problem of
>> > > > > > >>>> > > the
>> > > > > > >>>> > > >> suggested interface, namely supporting non-periodic
>> > > > watermarks
>> > > > > > >>>> > > efficiently.
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - Need a way to "dispatch" next record to different
>> > > > watermark
>> > > > > > >>>> > > generators
>> > > > > > >>>> > > >>  - Need a way to tell SplitReader to "suspend" a split
>> > > > until a
>> > > > > > >>>> certain
>> > > > > > >>>> > > >> watermark is reached (event time backpressure)
>> > > > > > >>>> > > >>  - This would in fact be not needed (and thus simpler)
>> > if
>> > > > we
>> > > > > > had
>> > > > > > >>>> a
>> > > > > > >>>> > > >> SplitReader per split and may be a reason to re-open
>> > that
>> > > > > > >>>> discussion
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >> *(6) Watermarks across splits and in the Split
>> > > Enumerator*
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - The split enumerator may need some watermark
>> > > awareness,
>> > > > > > which
>> > > > > > >>>> > should
>> > > > > > >>>> > > be
>> > > > > > >>>> > > >> purely based on split metadata (like create timestamp
>> > of
>> > > > file
>> > > > > > >>>> splits)
>> > > > > > >>>> > > >>  - If there are still more splits with overlapping
>> > event
>> > > > time
>> > > > > > >>>> range
>> > > > > > >>>> > for
>> > > > > > >>>> > > a
>> > > > > > >>>> > > >> split reader, then that split reader should not advance
>> > > the
>> > > > > > >>>> watermark
>> > > > > > >>>> > > >> within the split beyond the overlap boundary. Otherwise
>> > > > future
>> > > > > > >>>> splits
>> > > > > > >>>> > > will
>> > > > > > >>>> > > >> produce late data.
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - One way to approach this could be that the split
>> > > > enumerator
>> > > > > > >>>> may
>> > > > > > >>>> > send
>> > > > > > >>>> > > >> watermarks to the readers, and the readers cannot emit
>> > > > > > watermarks
>> > > > > > >>>> > beyond
>> > > > > > >>>> > > >> that received watermark.
>> > > > > > >>>> > > >>  - Many split enumerators would simply immediately send
>> > > > > > Long.MAX
>> > > > > > >>>> out
>> > > > > > >>>> > and
>> > > > > > >>>> > > >> leave the progress purely to the split readers.
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - For event-time alignment / split back pressure, this
>> > > > begs
>> > > > > > the
>> > > > > > >>>> > > question
>> > > > > > >>>> > > >> how we can avoid deadlocks that may arise when splits
>> > are
>> > > > > > >>>> suspended
>> > > > > > >>>> > for
>> > > > > > >>>> > > >> event time back pressure,
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >> *(7) Batch and streaming Unification*
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - Functionality wise, the above design should support
>> > > both
>> > > > > > >>>> > > >>  - Batch often (mostly) does not care about reading "in
>> > > > order"
>> > > > > > >>>> and
>> > > > > > >>>> > > >> generating watermarks
>> > > > > > >>>> > > >>    --> Might use different enumerator logic that is
>> > more
>> > > > > > locality
>> > > > > > >>>> > aware
>> > > > > > >>>> > > >> and ignores event time order
>> > > > > > >>>> > > >>    --> Does not generate watermarks
>> > > > > > >>>> > > >>  - Would be great if bounded sources could be
>> > identified
>> > > at
>> > > > > > >>>> compile
>> > > > > > >>>> > > time,
>> > > > > > >>>> > > >> so that "env.addBoundedSource(...)" is type safe and
>> > can
>> > > > > > return a
>> > > > > > >>>> > > >> "BoundedDataStream".
>> > > > > > >>>> > > >>  - Possible to defer this discussion until later
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >> *Miscellaneous Comments*
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - Should the source have a TypeInformation for the
>> > > > produced
>> > > > > > >>>> type,
>> > > > > > >>>> > > instead
>> > > > > > >>>> > > >> of a serializer? We need a type information in the
>> > stream
>> > > > > > >>>> anyways, and
>> > > > > > >>>> > > can
>> > > > > > >>>> > > >> derive the serializer from that. Plus, creating the
>> > > > serializer
>> > > > > > >>>> should
>> > > > > > >>>> > > >> respect the ExecutionConfig.
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>  - The TypeSerializer interface is very powerful but
>> > also
>> > > > not
>> > > > > > >>>> easy to
>> > > > > > >>>> > > >> implement. Its purpose is to handle data super
>> > > efficiently,
>> > > > > > >>>> support
>> > > > > > >>>> > > >> flexible ways of evolution, etc.
>> > > > > > >>>> > > >>  For metadata I would suggest to look at the
>> > > > > > >>>> SimpleVersionedSerializer
>> > > > > > >>>> > > >> instead, which is used for example for checkpoint
>> > master
>> > > > hooks,
>> > > > > > >>>> or for
>> > > > > > >>>> > > the
>> > > > > > >>>> > > >> streaming file sink. I think that is is a good match
>> > for
>> > > > cases
>> > > > > > >>>> where
>> > > > > > >>>> > we
>> > > > > > >>>> > > do
>> > > > > > >>>> > > >> not need more than ser/deser (no copy, etc.) and don't
>> > > > need to
>> > > > > > >>>> push
>> > > > > > >>>> > > >> versioning out of the serialization paths for best
>> > > > performance
>> > > > > > >>>> (as in
>> > > > > > >>>> > > the
>> > > > > > >>>> > > >> TypeSerializer)
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >> On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
>> > > > > > >>>> > > >> [hidden email]>
>> > > > > > >>>> > > >> wrote:
>> > > > > > >>>> > > >>
>> > > > > > >>>> > > >>> Hi Biao,
>> > > > > > >>>> > > >>>
>> > > > > > >>>> > > >>> Thanks for the answer!
>> > > > > > >>>> > > >>>
>> > > > > > >>>> > > >>> So given the multi-threaded readers, now we have as
>> > open
>> > > > > > >>>> questions:
>> > > > > > >>>> > > >>>
>> > > > > > >>>> > > >>> 1) How do we let the checkpoints pass through our
>> > > > > > multi-threaded
>> > > > > > >>>> > reader
>> > > > > > >>>> > > >>> operator?
>> > > > > > >>>> > > >>>
>> > > > > > >>>> > > >>> 2) Do we have separate reader and source operators or
>> > > > not? In
>> > > > > > >>>> the
>> > > > > > >>>> > > >> strategy
>> > > > > > >>>> > > >>> that has a separate source, the source operator has a
>> > > > > > >>>> parallelism of
>> > > > > > >>>> > 1
>> > > > > > >>>> > > >> and
>> > > > > > >>>> > > >>> is responsible for split recovery only.
>> > > > > > >>>> > > >>>
>> > > > > > >>>> > > >>> For the first one, given also the constraints
>> > (blocking,
>> > > > > > finite
>> > > > > > >>>> > queues,
>> > > > > > >>>> > > >>> etc), I do not have an answer yet.
>> > > > > > >>>> > > >>>
>> > > > > > >>>> > > >>> For the 2nd, I think that we should go with separate
>> > > > operators
>> > > > > > >>>> for
>> > > > > > >>>> > the
>> > > > > > >>>> > > >>> source and the readers, for the following reasons:
>> > > > > > >>>> > > >>>
>> > > > > > >>>> > > >>> 1) This is more aligned with a potential future
>> > > > improvement
>> > > > > > >>>> where the
>> > > > > > >>>> > > >> split
>> > > > > > >>>> > > >>> discovery becomes a responsibility of the JobManager
>> > and
>> > > > > > >>>> readers are
>> > > > > > >>>> > > >>> pooling more work from the JM.
>> > > > > > >>>> > > >>>
>> > > > > > >>>> > > >>> 2) The source is going to be the "single point of
>> > > truth".
>> > > > It
>> > > > > > >>>> will
>> > > > > > >>>> > know
>> > > > > > >>>> > > >> what
>> > > > > > >>>> > > >>> has been processed and what not. If the source and the
>> > > > readers
>> > > > > > >>>> are a
>> > > > > > >>>> > > >> single
>> > > > > > >>>> > > >>> operator with parallelism > 1, or in general, if the
>> > > split
>> > > > > > >>>> discovery
>> > > > > > >>>> > is
>> > > > > > >>>> > > >>> done by each task individually, then:
>> > > > > > >>>> > > >>>   i) we have to have a deterministic scheme for each
>> > > > reader to
>> > > > > > >>>> assign
>> > > > > > >>>> > > >>> splits to itself (e.g. mod subtaskId). This is not
>> > > > necessarily
>> > > > > > >>>> > trivial
>> > > > > > >>>> > > >> for
>> > > > > > >>>> > > >>> all sources.
>> > > > > > >>>> > > >>>   ii) each reader would have to keep a copy of all its
>> > > > > > processed
>> > > > > > >>>> > slpits
>> > > > > > >>>> > > >>>   iii) the state has to be a union state with a
>> > > > non-trivial
>> > > > > > >>>> merging
>> > > > > > >>>> > > >> logic
>> > > > > > >>>> > > >>> in order to support rescaling.
>> > > > > > >>>> > > >>>
>> > > > > > >>>> > > >>> Two additional points that you raised above:
>> > > > > > >>>> > > >>>
>> > > > > > >>>> > > >>> i) The point that you raised that we need to keep all
>> > > > splits
>> > > > > > >>>> > (processed
>> > > > > > >>>> > > >> and
>> > > > > > >>>> > > >>> not-processed) I think is a bit of a strong
>> > requirement.
>> > > > This
>> > > > > > >>>> would
>> > > > > > >>>> > > imply
>> > > > > > >>>> > > >>> that for infinite sources the state will grow
>> > > > indefinitely.
>> > > > > > >>>> This is
>> > > > > > >>>> > > >> problem
>> > > > > > >>>> > > >>> is even more pronounced if we do not have a single
>> > > source
>> > > > that
>> > > > > > >>>> > assigns
>> > > > > > >>>> > > >>> splits to readers, as each reader will have its own
>> > copy
>> > > > of
>> > > > > > the
>> > > > > > >>>> > state.
>> > > > > > >>>> > > >>>
>> > > > > > >>>> > > >>> ii) it is true that for finite sources we need to
>> > > somehow
>> > > > not
>> > > > > > >>>> close
>> > > > > > >>>> > the
>> > > > > > >>>> > > >>> readers when the source/split discoverer finishes. The
>> > > > > > >>>> > > >>> ContinuousFileReaderOperator has a work-around for
>> > that.
>> > > > It is
>> > > > > > >>>> not
>> > > > > > >>>> > > >> elegant,
>> > > > > > >>>> > > >>> and checkpoints are not emitted after closing the
>> > > source,
>> > > > but
>> > > > > > >>>> this, I
>> > > > > > >>>> > > >>> believe, is a bigger problem which requires more
>> > changes
>> > > > than
>> > > > > > >>>> just
>> > > > > > >>>> > > >>> refactoring the source interface.
>> > > > > > >>>> > > >>>
>> > > > > > >>>> > > >>> Cheers,
>> > > > > > >>>> > > >>> Kostas
>> > > > > > >>>> > > >>>
>> > > > > > >>>> > > >>
>> > > > > > >>>> > >
>> > > > > > >>>> > >
>> > > > > > >>>> >
>> > > > > > >>>>
>> > > > > > >>>
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-27: Refactor Source Interface

dwysakowicz
In reply to this post by Stephan Ewen

Hi all,

Really well written proposal and very important one. I must admit I have not understood all the intricacies of it yet.

One question I have though is about where does the information about boundedness come from. I think in most cases it is a property of the source. As you described it might be e.g. end offset, a flag should it monitor new splits etc. I think it would be a really nice use case to be able to say:

new KafkaSource().readUntil(long timestamp), 

which could work as an "end offset". Moreover I think all Bounded sources support continuous mode, but no intrinsically continuous source support the Bounded mode. If I understood the proposal correctly it suggest the boundedness sort of "comes" from the outside of the source, from the invokation of either boundedStream or continousSource.

I am wondering if it would make sense to actually change the method

boolean Source#supportsBoundedness(Boundedness) 

to

Boundedness Source#getBoundedness(). 

As for the methods #boundedSource, #continousSource, assuming the boundedness is property of the source they do not affect how the enumerator works, but mostly how the dag is scheduled, right? I am not against those methods, but I think it is a very specific use case to actually override the property of the source. In general I would expect users to only call env.source(theSource), where the source tells if it is bounded or not. I would suggest considering following set of methods:

// boundedness depends of source property, imo this should always be preferred
DataStream<MyType> stream = env.source(theSource);
 
// always continous execution, whether bounded or unbounded source
DataStream<MyType> boundedStream = env.continousSource(theSource);
 
// imo this would make sense if the BoundedDataStream provides additional features unavailable for continous mode
BoundedDataStream<MyType> batch = env.boundedSource(theSource);

Best,

Dawid


On 04/12/2019 11:25, Stephan Ewen wrote:
Thanks, Becket, for updating this.

I agree with moving the aspects you mentioned into separate FLIPs - this
one way becoming unwieldy in size.

+1 to the FLIP in its current state. Its a very detailed write-up, nicely
done!

On Wed, Dec 4, 2019 at 7:38 AM Becket Qin [hidden email] wrote:

Hi all,

Sorry for the long belated update. I have updated FLIP-27 wiki page with
the latest proposals. Some noticeable changes include:
1. A new generic communication mechanism between SplitEnumerator and
SourceReader.
2. Some detail API method signature changes.

We left a few things out of this FLIP and will address them in separate
FLIPs. Including:
1. Per split event time.
2. Event time alignment.
3. Fine grained failover for SplitEnumerator failure.

Please let us know if you have any question.

Thanks,

Jiangjie (Becket) Qin

On Sat, Nov 16, 2019 at 6:10 AM Stephan Ewen [hidden email] wrote:

Hi  Łukasz!

Becket and me are working hard on figuring out the last details and
implementing the first PoC. We would update the FLIP hopefully next week.

There is a fair chance that a first version of this will be in 1.10, but
I
think it will take another release to battle test it and migrate the
connectors.

Best,
Stephan




On Fri, Nov 15, 2019 at 11:14 AM Łukasz Jędrzejewski [hidden email]
wrote:

          
Hi,

This proposal looks very promising for us. Do you have any plans in
which
Flink release it is going to be released? We are thinking on using a
Data
Set API for our future use cases but on the other hand Data Set API is
going to be deprecated so using proposed bounded data streams solution
could be more viable in the long term.

Thanks,
Łukasz

On 2019/10/01 15:48:03, Thomas Weise [hidden email] wrote:
Thanks for putting together this proposal!

I see that the "Per Split Event Time" and "Event Time Alignment"
sections
are still TBD.

It would probably be good to flesh those out a bit before proceeding
too
far
as the event time alignment will probably influence the interaction
with
the split reader, specifically ReaderStatus emitNext(SourceOutput<E>
output).

We currently have only one implementation for event time alignment in
the
Kinesis consumer. The synchronization in that case takes place as the
last
step before records are emitted downstream (RecordEmitter). With the
currently proposed interfaces, the equivalent can be implemented in
the
reader loop, although note that in the Kinesis consumer the per shard
threads push records.

Synchronization has not been implemented for the Kafka consumer yet.

https://issues.apache.org/jira/browse/FLINK-12675

When I looked at it, I realized that the implementation will look
quite
different
from Kinesis because it needs to take place in the pull part, where
records
are taken from the Kafka client. Due to the multiplexing it cannot be
done
by blocking the split thread like it currently works for Kinesis.
Reading
from individual Kafka partitions needs to be controlled via
pause/resume
on the Kafka client.

To take on that responsibility the split thread would need to be
aware
of
the
watermarks or at least whether it should or should not continue to
consume
a given split and this may require a different SourceReader or
SourceOutput
interface.

Thanks,
Thomas


On Fri, Jul 26, 2019 at 1:39 AM Biao Liu [hidden email] wrote:

Hi Stephan,

Thank you for feedback!
Will take a look at your branch before public discussing.


On Fri, Jul 26, 2019 at 12:01 AM Stephan Ewen [hidden email]
wrote:

                
Hi Biao!

Thanks for reviving this. I would like to join this discussion,
but
am
quite occupied with the 1.9 release, so can we maybe pause this
discussion
for a week or so?

In the meantime I can share some suggestion based on prior
experiments:
How to do watermarks / timestamp extractors in a simpler and more
flexible
way. I think that part is quite promising should be part of the
new
source
interface.



              

          

        
https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/eventtime



              

          

        
https://github.com/StephanEwen/flink/blob/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src/SourceOutput.java


Some experiments on how to build the source reader and its
library
for
common threading/split patterns:



              

          

        
https://github.com/StephanEwen/flink/tree/source_interface/flink-core/src/main/java/org/apache/flink/api/common/src

Best,
Stephan


On Thu, Jul 25, 2019 at 10:03 AM Biao Liu [hidden email]
wrote:

                  
Hi devs,

Since 1.9 is nearly released, I think we could get back to
FLIP-27.
I
believe it should be included in 1.10.

There are so many things mentioned in document of FLIP-27. [1] I
think
we'd better discuss them separately. However the wiki is not a
good
place
to discuss. I wrote google doc about SplitReader API which
misses
some
details in the document. [2]

1.


              

          

        
https://cwiki.apache.org/confluence/display/FLINK/FLIP-27:+Refactor+Source+Interface
2.


              

          

        
https://docs.google.com/document/d/1R1s_89T4S3CZwq7Tf31DciaMCqZwrLHGZFqPASu66oE/edit?usp=sharing
CC Stephan, Aljoscha, Piotrek, Becket


On Thu, Mar 28, 2019 at 4:38 PM Biao Liu [hidden email]
wrote:

                    
Hi Steven,
Thank you for the feedback. Please take a look at the document
FLIP-27
<

              

          

        
https://cwiki.apache.org/confluence/display/FLINK/FLIP-27%3A+Refactor+Source+Interface

              
which
is updated recently. A lot of details of enumerator were added
in
this
document. I think it would help.

Steven Wu [hidden email] 于2019年3月28日周四 下午12:52写道:

This proposal mentioned that SplitEnumerator might run on the
JobManager or
in a single task on a TaskManager.

if enumerator is a single task on a taskmanager, then the job
DAG
can
never
been embarrassingly parallel anymore. That will nullify the
leverage
of
fine-grained recovery for embarrassingly parallel jobs.

It's not clear to me what's the implication of running
enumerator
on
the
jobmanager. So I will leave that out for now.

On Mon, Jan 28, 2019 at 3:05 AM Biao Liu [hidden email]
wrote:

                        
Hi Stephan & Piotrek,

Thank you for feedback.

It seems that there are a lot of things to do in community.
I
am
just
afraid that this discussion may be forgotten since there so
many
proposals
recently.
Anyway, wish to see the split topics soon :)

Piotr Nowojski [hidden email] 于2019年1月24日周四
下午8:21写道:

                          
Hi Biao!

This discussion was stalled because of preparations for
the
open
sourcing
& merging Blink. I think before creating the tickets we
should
split this
discussion into topics/areas outlined by Stephan and
create
Flips
for
that.
I think there is no chance for this to be completed in
couple
of
remaining
weeks/1 month before 1.8 feature freeze, however it would
be
good
to aim
with those changes for 1.9.

Piotrek

On 20 Jan 2019, at 16:08, Biao Liu [hidden email]
wrote:
Hi community,
The summary of Stephan makes a lot sense to me. It is
much
clearer
indeed
after splitting the complex topic into small ones.
I was wondering is there any detail plan for next step?
If
not,
I
would
like to push this thing forward by creating some JIRA
issues.
Another question is that should version 1.8 include
these
features?
Stephan Ewen [hidden email] 于2018年12月1日周六 上午4:20写道:

Thanks everyone for the lively discussion. Let me try
to
summarize
where I
see convergence in the discussion and open issues.
I'll try to group this by design aspect of the source.
Please
let me
know
if I got things wrong or missed something crucial here.

For issues 1-3, if the below reflects the state of the
discussion, I
would
try and update the FLIP in the next days.
For the remaining ones we need more discussion.

I would suggest to fork each of these aspects into a
separate
mail
thread,
or will loose sight of the individual aspects.

*(1) Separation of Split Enumerator and Split Reader*

 - All seem to agree this is a good thing
 - Split Enumerator could in the end live on JobManager
(and
assign
splits
via RPC) or in a task (and assign splits via data
streams)
 - this discussion is orthogonal and should come later,
when
the
interface
is agreed upon.

*(2) Split Readers for one or more splits*

 - Discussion seems to agree that we need to support
one
reader
that
possibly handles multiple splits concurrently.
 - The requirement comes from sources where one
poll()-style
call
fetches
data from different splits / partitions
   --> example sources that require that would be for
example
Kafka,
Pravega, Pulsar

 - Could have one split reader per source, or multiple
split
readers
that
share the "poll()" function
 - To not make it too complicated, we can start with
thinking
about
one
split reader for all splits initially and see if that
covers
all
requirements

*(3) Threading model of the Split Reader*

 - Most active part of the discussion ;-)

 - A non-blocking way for Flink's task code to interact
with
the
source
is
needed in order to a task runtime code based on a
single-threaded/actor-style task design
   --> I personally am a big proponent of that, it will
help
with
well-behaved checkpoints, efficiency, and simpler yet
more
robust
runtime
code

 - Users care about simple abstraction, so as a
subclass
of
SplitReader
(non-blocking / async) we need to have a
BlockingSplitReader
which
will
form the basis of most source implementations.
BlockingSplitReader
lets
users do blocking simple poll() calls.
 - The BlockingSplitReader would spawn a thread (or
more)
and
the
thread(s) can make blocking calls and hand over data
buffers
via
a
blocking
queue
 - This should allow us to cover both, a fully async
runtime,
and a
simple
blocking interface for users.
 - This is actually very similar to how the Kafka
connectors
work.
Kafka
9+ with one thread, Kafka 8 with multiple threads

 - On the base SplitReader (the async one), the
non-blocking
method
that
gets the next chunk of data would signal data
availability
via
a
CompletableFuture, because that gives the best
flexibility
(can
await
completion or register notification handlers).
 - The source task would register a "thenHandle()" (or
similar)
on the
future to put a "take next data" task into the
actor-style
mailbox
*(4) Split Enumeration and Assignment*

 - Splits may be generated lazily, both in cases where
there
is a
limited
number of splits (but very many), or splits are
discovered
over
time
 - Assignment should also be lazy, to get better load
balancing
 - Assignment needs support locality preferences

 - Possible design based on discussion so far:

   --> SplitReader has a method "addSplits(SplitT...)"
to
add
one or
more
splits. Some split readers might assume they have only
one
split
ever,
concurrently, others assume multiple splits. (Note:
idea
behind
being
able
to add multiple splits at the same time is to ease
startup
where
multiple
splits may be assigned instantly.)
   --> SplitReader has a context object on which it can
call
indicate
when
splits are completed. The enumerator gets that
notification and
can
use
to
decide when to assign new splits. This should help both
in
cases
of
sources
that take splits lazily (file readers) and in case the
source
needs to
preserve a partial order between splits (Kinesis,
Pravega,
Pulsar may
need
that).
   --> SplitEnumerator gets notification when
SplitReaders
start
and
when
they finish splits. They can decide at that moment to
push
more
splits
to
that reader
   --> The SplitEnumerator should probably be aware of
the
source
parallelism, to build its initial distribution.

 - Open question: Should the source expose something
like
"host
preferences", so that yarn/mesos/k8s can take this into
account
when
selecting a node to start a TM on?

*(5) Watermarks and event time alignment*

 - Watermark generation, as well as idleness, needs to
be
per
split
(like
currently in the Kafka Source, per partition)
 - It is desirable to support optional
event-time-alignment,
meaning
that
splits that are ahead are back-pressured or temporarily
unsubscribed
 - I think i would be desirable to encapsulate
watermark
generation
logic
in watermark generators, for a separation of concerns.
The
watermark
generators should run per split.
 - Using watermark generators would also help with
another
problem of
the
suggested interface, namely supporting non-periodic
watermarks
efficiently.
 - Need a way to "dispatch" next record to different
watermark
generators
 - Need a way to tell SplitReader to "suspend" a split
until a
certain
watermark is reached (event time backpressure)
 - This would in fact be not needed (and thus simpler)
if
we
had
a
SplitReader per split and may be a reason to re-open
that
discussion
*(6) Watermarks across splits and in the Split
Enumerator*
 - The split enumerator may need some watermark
awareness,
which
should
be
purely based on split metadata (like create timestamp
of
file
splits)
 - If there are still more splits with overlapping
event
time
range
for
a
split reader, then that split reader should not advance
the
watermark
within the split beyond the overlap boundary. Otherwise
future
splits
will
produce late data.

 - One way to approach this could be that the split
enumerator
may
send
watermarks to the readers, and the readers cannot emit
watermarks
beyond
that received watermark.
 - Many split enumerators would simply immediately send
Long.MAX
out
and
leave the progress purely to the split readers.

 - For event-time alignment / split back pressure, this
begs
the
question
how we can avoid deadlocks that may arise when splits
are
suspended
for
event time back pressure,

*(7) Batch and streaming Unification*

 - Functionality wise, the above design should support
both
 - Batch often (mostly) does not care about reading "in
order"
and
generating watermarks
   --> Might use different enumerator logic that is
more
locality
aware
and ignores event time order
   --> Does not generate watermarks
 - Would be great if bounded sources could be
identified
at
compile
time,
so that "env.addBoundedSource(...)" is type safe and
can
return a
"BoundedDataStream".
 - Possible to defer this discussion until later

*Miscellaneous Comments*

 - Should the source have a TypeInformation for the
produced
type,
instead
of a serializer? We need a type information in the
stream
anyways, and
can
derive the serializer from that. Plus, creating the
serializer
should
respect the ExecutionConfig.

 - The TypeSerializer interface is very powerful but
also
not
easy to
implement. Its purpose is to handle data super
efficiently,
support
flexible ways of evolution, etc.
 For metadata I would suggest to look at the
SimpleVersionedSerializer
instead, which is used for example for checkpoint
master
hooks,
or for
the
streaming file sink. I think that is is a good match
for
cases
where
we
do
not need more than ser/deser (no copy, etc.) and don't
need to
push
versioning out of the serialization paths for best
performance
(as in
the
TypeSerializer)


On Tue, Nov 27, 2018 at 11:45 AM Kostas Kloudas <
[hidden email]>
wrote:

Hi Biao,

Thanks for the answer!

So given the multi-threaded readers, now we have as
open
questions:
1) How do we let the checkpoints pass through our
multi-threaded
reader
operator?

2) Do we have separate reader and source operators or
not? In
the
strategy
that has a separate source, the source operator has a
parallelism of
1
and
is responsible for split recovery only.

For the first one, given also the constraints
(blocking,
finite
queues,
etc), I do not have an answer yet.

For the 2nd, I think that we should go with separate
operators
for
the
source and the readers, for the following reasons:

1) This is more aligned with a potential future
improvement
where the
split
discovery becomes a responsibility of the JobManager
and
readers are
pooling more work from the JM.

2) The source is going to be the "single point of
truth".
It
will
know
what
has been processed and what not. If the source and the
readers
are a
single
operator with parallelism > 1, or in general, if the
split
discovery
is
done by each task individually, then:
  i) we have to have a deterministic scheme for each
reader to
assign
splits to itself (e.g. mod subtaskId). This is not
necessarily
trivial
for
all sources.
  ii) each reader would have to keep a copy of all its
processed
slpits
  iii) the state has to be a union state with a
non-trivial
merging
logic
in order to support rescaling.

Two additional points that you raised above:

i) The point that you raised that we need to keep all
splits
(processed
and
not-processed) I think is a bit of a strong
requirement.
This
would
imply
that for infinite sources the state will grow
indefinitely.
This is
problem
is even more pronounced if we do not have a single
source
that
assigns
splits to readers, as each reader will have its own
copy
of
the
state.
ii) it is true that for finite sources we need to
somehow
not
close
the
readers when the source/split discoverer finishes. The
ContinuousFileReaderOperator has a work-around for
that.
It is
not
elegant,
and checkpoints are not emitted after closing the
source,
but
this, I
believe, is a bigger problem which requires more
changes
than
just
refactoring the source interface.

Cheers,
Kostas


                              


                        

                      

                    

              

            

          

        

      

    

signature.asc (849 bytes) Download Attachment
12345