Hi there,
I am thinking of implement sideOutput into Flink which seems missing support. https://cloud.google.com/dataflow/model/par-do#side-outputs It is useful because it will help pipeline author redirect corrputed input/ code bug to a side stream or write to a table and reconsile afterwards. After some hack prototyping, I were able to get it works for simple tests. Basically, It allows env to register a side output typeInfo which will be passed to configurations during graph building; Adding a new transform which similar to selection transform but holding different input type; StreamEdge will has a boolean to see if that is side output edge, if so, create output writer loads side output type serializer and emit record only when sideOutput is called. I have some problem passing side output type as template to each data stream. It means it will have to expose any output stream with two type parameters. As you can imagine, the API interface change will be sizable. Any suggestion? Chen |
Hi!
This is a very big change, both on the semantics, the runtime classes. These changes are tricky to get in, and usually work best if you document the changes and all implications well. Something like a deep design doc, or a FLIP would be great for this. https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals Greetings, Stephan On Thu, Aug 11, 2016 at 12:41 AM, Chen Qin <[hidden email]> wrote: > Hi there, > > I am thinking of implement sideOutput into Flink which seems missing > support. > https://cloud.google.com/dataflow/model/par-do#side-outputs > > It is useful because it will help pipeline author redirect corrputed input/ > code bug to a side stream or write to a table and reconsile afterwards. > > After some hack prototyping, I were able to get it works for simple tests. > Basically, It allows env to register a side output typeInfo which will be > passed to configurations during graph building; Adding a new transform > which similar to selection transform but holding different input type; > StreamEdge will has a boolean to see if that is side output edge, if so, > create output writer loads side output type serializer and emit record only > when sideOutput is called. > > I have some problem passing side output type as template to each data > stream. It means it will have to expose any output stream with two type > parameters. As you can imagine, the API interface change will be sizable. > > Any suggestion? > > Chen > |
Stephan
Humm... I see. Back off one step, how do Flink deal with corrupted input data right now, like a dead letter queue? Thanks, Chen On Thu, Aug 11, 2016 at 5:40 AM, Stephan Ewen <[hidden email]> wrote: > Hi! > > This is a very big change, both on the semantics, the runtime classes. > These changes are tricky to get in, and usually work best if you document > the changes and all implications well. > > Something like a deep design doc, or a FLIP would be great for this. > https://cwiki.apache.org/confluence/display/FLINK/ > Flink+Improvement+Proposals > > Greetings, > Stephan > > > On Thu, Aug 11, 2016 at 12:41 AM, Chen Qin <[hidden email]> wrote: > > > Hi there, > > > > I am thinking of implement sideOutput into Flink which seems missing > > support. > > https://cloud.google.com/dataflow/model/par-do#side-outputs > > > > It is useful because it will help pipeline author redirect corrputed > input/ > > code bug to a side stream or write to a table and reconsile afterwards. > > > > After some hack prototyping, I were able to get it works for simple > tests. > > Basically, It allows env to register a side output typeInfo which will be > > passed to configurations during graph building; Adding a new transform > > which similar to selection transform but holding different input type; > > StreamEdge will has a boolean to see if that is side output edge, if so, > > create output writer loads side output type serializer and emit record > only > > when sideOutput is called. > > > > I have some problem passing side output type as template to each data > > stream. It means it will have to expose any output stream with two type > > parameters. As you can imagine, the API interface change will be sizable. > > > > Any suggestion? > > > > Chen > > > -- -Chen Qin |
In reply to this post by Stephan Ewen
Hi Chen,
could you maybe share the code that you have so far? If you wan't you can start a google doc and then we can work together on fleshing out an API/implementation that we can present to the Flink community as a FLIP. Cheers, Aljoscha On Thu, 11 Aug 2016 at 14:40 Stephan Ewen <[hidden email]> wrote: > Hi! > > This is a very big change, both on the semantics, the runtime classes. > These changes are tricky to get in, and usually work best if you document > the changes and all implications well. > > Something like a deep design doc, or a FLIP would be great for this. > > https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals > > Greetings, > Stephan > > > On Thu, Aug 11, 2016 at 12:41 AM, Chen Qin <[hidden email]> wrote: > > > Hi there, > > > > I am thinking of implement sideOutput into Flink which seems missing > > support. > > https://cloud.google.com/dataflow/model/par-do#side-outputs > > > > It is useful because it will help pipeline author redirect corrputed > input/ > > code bug to a side stream or write to a table and reconsile afterwards. > > > > After some hack prototyping, I were able to get it works for simple > tests. > > Basically, It allows env to register a side output typeInfo which will be > > passed to configurations during graph building; Adding a new transform > > which similar to selection transform but holding different input type; > > StreamEdge will has a boolean to see if that is side output edge, if so, > > create output writer loads side output type serializer and emit record > only > > when sideOutput is called. > > > > I have some problem passing side output type as template to each data > > stream. It means it will have to expose any output stream with two type > > parameters. As you can imagine, the API interface change will be sizable. > > > > Any suggestion? > > > > Chen > > > |
Stephan & Aljoscha, What I did was a API hack without much thoughts into architect at beginning :-) patch attached, I think it would be "tricky" to achieve backward compatibility with current architect. We might need to encapsulate "Collectors" to ProcessContext as starting point. I would try to work on a FLIP next week. Best Regards, Chen On Fri, Aug 12, 2016 at 2:13 AM, Aljoscha Krettek <[hidden email]> wrote: Hi Chen, -Chen Qin |
Free forum by Nabble | Edit this page |