(DEPRECATED) Apache Flink Mailing List archive.

expose side output stream

Classic

List

Threaded

5 messages Options

chenqin

expose side output stream

Hi there,

I am thinking of implement sideOutput into Flink which seems missing
support.
https://cloud.google.com/dataflow/model/par-do#side-outputs

It is useful because it will help pipeline author redirect corrputed input/
code bug to a side stream or write to a table and reconsile afterwards.

After some hack prototyping, I were able to get it works for simple tests.
Basically, It allows env to register a side output typeInfo which will be
passed to configurations during graph building; Adding a new transform
which similar to selection transform but holding different input type;
StreamEdge will has a boolean to see if that is side output edge, if so,
create output writer loads side output type serializer and emit record only
when sideOutput is called.

I have some problem passing side output type as template to each data
stream. It means it will have to expose any output stream with two type
parameters. As you can imagine, the API interface change will be sizable.

Any suggestion?

Chen

Stephan Ewen

Re: expose side output stream

Hi!

This is a very big change, both on the semantics, the runtime classes.
These changes are tricky to get in, and usually work best if you document
the changes and all implications well.

Something like a deep design doc, or a FLIP would be great for this.
https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals

Greetings,
Stephan

On Thu, Aug 11, 2016 at 12:41 AM, Chen Qin <[hidden email]> wrote:

> Hi there,
>
> I am thinking of implement sideOutput into Flink which seems missing
> support.
> https://cloud.google.com/dataflow/model/par-do#side-outputs
>
> It is useful because it will help pipeline author redirect corrputed input/
> code bug to a side stream or write to a table and reconsile afterwards.
>
> After some hack prototyping, I were able to get it works for simple tests.
> Basically, It allows env to register a side output typeInfo which will be
> passed to configurations during graph building; Adding a new transform
> which similar to selection transform but holding different input type;
> StreamEdge will has a boolean to see if that is side output edge, if so,
> create output writer loads side output type serializer and emit record only
> when sideOutput is called.
>
> I have some problem passing side output type as template to each data
> stream. It means it will have to expose any output stream with two type
> parameters. As you can imagine, the API interface change will be sizable.
>
> Any suggestion?
>
> Chen
>

Chen Qin

Re: expose side output stream

Stephan

Humm... I see.
Back off one step, how do Flink deal with corrupted input data right now,
like a dead letter queue?

Thanks,
Chen

On Thu, Aug 11, 2016 at 5:40 AM, Stephan Ewen <[hidden email]> wrote:

> Hi!
>
> This is a very big change, both on the semantics, the runtime classes.
> These changes are tricky to get in, and usually work best if you document
> the changes and all implications well.
>
> Something like a deep design doc, or a FLIP would be great for this.
> https://cwiki.apache.org/confluence/display/FLINK/
> Flink+Improvement+Proposals
>
> Greetings,
> Stephan
>
>
> On Thu, Aug 11, 2016 at 12:41 AM, Chen Qin <[hidden email]> wrote:
>
> > Hi there,
> >
> > I am thinking of implement sideOutput into Flink which seems missing
> > support.
> > https://cloud.google.com/dataflow/model/par-do#side-outputs
> >
> > It is useful because it will help pipeline author redirect corrputed
> input/
> > code bug to a side stream or write to a table and reconsile afterwards.
> >
> > After some hack prototyping, I were able to get it works for simple
> tests.
> > Basically, It allows env to register a side output typeInfo which will be
> > passed to configurations during graph building; Adding a new transform
> > which similar to selection transform but holding different input type;
> > StreamEdge will has a boolean to see if that is side output edge, if so,
> > create output writer loads side output type serializer and emit record
> only
> > when sideOutput is called.
> >
> > I have some problem passing side output type as template to each data
> > stream. It means it will have to expose any output stream with two type
> > parameters. As you can imagine, the API interface change will be sizable.
> >
> > Any suggestion?
> >
> > Chen
> >
>

--
-Chen Qin

Aljoscha Krettek-2

Re: expose side output stream

In reply to this post by Stephan Ewen

Hi Chen,
could you maybe share the code that you have so far?

If you wan't you can start a google doc and then we can work together on
fleshing out an API/implementation that we can present to the Flink
community as a FLIP.

Cheers,
Aljoscha

On Thu, 11 Aug 2016 at 14:40 Stephan Ewen <[hidden email]> wrote:

> Hi!
>
> This is a very big change, both on the semantics, the runtime classes.
> These changes are tricky to get in, and usually work best if you document
> the changes and all implications well.
>
> Something like a deep design doc, or a FLIP would be great for this.
>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>
> Greetings,
> Stephan
>
>
> On Thu, Aug 11, 2016 at 12:41 AM, Chen Qin <[hidden email]> wrote:
>
> > Hi there,
> >
> > I am thinking of implement sideOutput into Flink which seems missing
> > support.
> > https://cloud.google.com/dataflow/model/par-do#side-outputs
> >
> > It is useful because it will help pipeline author redirect corrputed
> input/
> > code bug to a side stream or write to a table and reconsile afterwards.
> >
> > After some hack prototyping, I were able to get it works for simple
> tests.
> > Basically, It allows env to register a side output typeInfo which will be
> > passed to configurations during graph building; Adding a new transform
> > which similar to selection transform but holding different input type;
> > StreamEdge will has a boolean to see if that is side output edge, if so,
> > create output writer loads side output type serializer and emit record
> only
> > when sideOutput is called.
> >
> > I have some problem passing side output type as template to each data
> > stream. It means it will have to expose any output stream with two type
> > parameters. As you can imagine, the API interface change will be sizable.
> >
> > Any suggestion?
> >
> > Chen
> >
>

Chen Qin

Re: expose side output stream

Stephan & Aljoscha,

What I did was a API hack without much thoughts into architect at beginning :-)

patch attached, I think it would be "tricky" to achieve backward compatibility with current architect.

We might need to encapsulate "Collectors" to ProcessContext as starting point.

I would try to work on a FLIP next week.

Best Regards,

Chen

On Fri, Aug 12, 2016 at 2:13 AM, Aljoscha Krettek <[hidden email]> wrote:

Hi Chen,
could you maybe share the code that you have so far?

If you wan't you can start a google doc and then we can work together on
fleshing out an API/implementation that we can present to the Flink
community as a FLIP.

Cheers,
Aljoscha

On Thu, 11 Aug 2016 at 14:40 Stephan Ewen <[hidden email]> wrote:

> Hi!
>
> This is a very big change, both on the semantics, the runtime classes.
> These changes are tricky to get in, and usually work best if you document
> the changes and all implications well.
>
> Something like a deep design doc, or a FLIP would be great for this.
>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>
> Greetings,
> Stephan
>
>
> On Thu, Aug 11, 2016 at 12:41 AM, Chen Qin <[hidden email]> wrote:
>
> > Hi there,
> >
> > I am thinking of implement sideOutput into Flink which seems missing
> > support.
> > https://cloud.google.com/dataflow/model/par-do#side-outputs
> >
> > It is useful because it will help pipeline author redirect corrputed
> input/
> > code bug to a side stream or write to a table and reconsile afterwards.
> >
> > After some hack prototyping, I were able to get it works for simple
> tests.
> > Basically, It allows env to register a side output typeInfo which will be
> > passed to configurations during graph building; Adding a new transform
> > which similar to selection transform but holding different input type;
> > StreamEdge will has a boolean to see if that is side output edge, if so,
> > create output writer loads side output type serializer and emit record
> only
> > when sideOutput is called.
> >
> > I have some problem passing side output type as template to each data
> > stream. It means it will have to expose any output stream with two type
> > parameters. As you can imagine, the API interface change will be sizable.
> >
> > Any suggestion?
> >
> > Chen
> >
>

-Chen Qin