[DISCUSS] FLIP-105: Support to Interpret and Emit Changelog in Flink SQL

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] FLIP-105: Support to Interpret and Emit Changelog in Flink SQL

Jark Wu-2
Hi everyone,

I would like to start discussion about how to support interpreting external
changelog into Flink SQL, and how to emit changelog from Flink SQL.

This topic has already been mentioned several times in the past. CDC
(Change Data Capture) data has been a very important streaming data in the
world. Connect to CDC is a significant feature for Flink, it fills the
missing piece for Flink's streaming processing.

In FLIP-105, we propose 2 approaches to achieve.
One is introducing new TableSource interface (higher priority),
the other is introducing new SQL syntax to interpret and emit changelog.

FLIP-105:
https://docs.google.com/document/d/1onyIUUdWAHfr_Yd5nZOE7SOExBc6TiW5C4LiL5FrjtQ/edit#

Thanks for any feedback!

Best,
Jark
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-105: Support to Interpret and Emit Changelog in Flink SQL

Jark Wu-2
Hi everyone,

Since this FLIP was proposed, the community has discussed a lot about the
first approach: introducing new TableSource and TableSink interfaces to
support changelog.
And yes, that is FLIP-95 which has been accepted last week. So most of the
work has been merged into FLIP-95.

In order to support the goal of FLIP-105, there is still a little things to
discuss: how to connect external CDC formats.
We propose to introduce 2 new formats: Debezium format and Canal format.
They are the most popular CDC tools according to the survey in user [1] and
user-zh [2] mailing list.

I have updated the FLIP:
https://cwiki.apache.org/confluence/display/FLINK/FLIP-105%3A+Support+to+Interpret+and+Emit+Changelog+in+Flink+SQL

Welcome feedbacks!

Best,
Jark

[1]:
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/SURVEY-What-Change-Data-Capture-tools-are-you-using-td33569.html
[2]: http://apache-flink.147419.n8.nabble.com/SURVEY-CDC-td1910.html


On Fri, 14 Feb 2020 at 22:08, Jark Wu <[hidden email]> wrote:

> Hi everyone,
>
> I would like to start discussion about how to support interpreting
> external changelog into Flink SQL, and how to emit changelog from Flink SQL.
>
> This topic has already been mentioned several times in the past. CDC
> (Change Data Capture) data has been a very important streaming data in the
> world. Connect to CDC is a significant feature for Flink, it fills the
> missing piece for Flink's streaming processing.
>
> In FLIP-105, we propose 2 approaches to achieve.
> One is introducing new TableSource interface (higher priority),
> the other is introducing new SQL syntax to interpret and emit changelog.
>
> FLIP-105:
> https://docs.google.com/document/d/1onyIUUdWAHfr_Yd5nZOE7SOExBc6TiW5C4LiL5FrjtQ/edit#
>
> Thanks for any feedback!
>
> Best,
> Jark
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-105: Support to Interpret and Emit Changelog in Flink SQL

Kurt Young
One minor comment, is there any other encoding or format in debezium? I'm
asking because the format
name is debezium-json, i'm wondering whether debezium is enough. This also
applies to canal.

Best,
Kurt


On Tue, Apr 7, 2020 at 11:49 AM Jark Wu <[hidden email]> wrote:

> Hi everyone,
>
> Since this FLIP was proposed, the community has discussed a lot about the
> first approach: introducing new TableSource and TableSink interfaces to
> support changelog.
> And yes, that is FLIP-95 which has been accepted last week. So most of the
> work has been merged into FLIP-95.
>
> In order to support the goal of FLIP-105, there is still a little things to
> discuss: how to connect external CDC formats.
> We propose to introduce 2 new formats: Debezium format and Canal format.
> They are the most popular CDC tools according to the survey in user [1] and
> user-zh [2] mailing list.
>
> I have updated the FLIP:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-105%3A+Support+to+Interpret+and+Emit+Changelog+in+Flink+SQL
>
> Welcome feedbacks!
>
> Best,
> Jark
>
> [1]:
>
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/SURVEY-What-Change-Data-Capture-tools-are-you-using-td33569.html
> [2]: http://apache-flink.147419.n8.nabble.com/SURVEY-CDC-td1910.html
>
>
> On Fri, 14 Feb 2020 at 22:08, Jark Wu <[hidden email]> wrote:
>
> > Hi everyone,
> >
> > I would like to start discussion about how to support interpreting
> > external changelog into Flink SQL, and how to emit changelog from Flink
> SQL.
> >
> > This topic has already been mentioned several times in the past. CDC
> > (Change Data Capture) data has been a very important streaming data in
> the
> > world. Connect to CDC is a significant feature for Flink, it fills the
> > missing piece for Flink's streaming processing.
> >
> > In FLIP-105, we propose 2 approaches to achieve.
> > One is introducing new TableSource interface (higher priority),
> > the other is introducing new SQL syntax to interpret and emit changelog.
> >
> > FLIP-105:
> >
> https://docs.google.com/document/d/1onyIUUdWAHfr_Yd5nZOE7SOExBc6TiW5C4LiL5FrjtQ/edit#
> >
> > Thanks for any feedback!
> >
> > Best,
> > Jark
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-105: Support to Interpret and Emit Changelog in Flink SQL

Jark Wu-2
Hi Kurt,

The JSON encoding of Debezium can be configured to include or exclude the
message schema using the `value.converter.schemas.enable` properties.
That's why we propose to have a `format.schema-include` property key to
config how to parse the json.

Besides, the encoding format of debezium is stable and unified across
different databases (MySQL, Oracle, SQL Server, DB2, PostgresSQL).
However, because of the limitation of some special databases, some
databases CDC encoding are different (Cassandra and MongoDB).
If we want to support them in the future, we can introduce an optional
property key, e.g. `format.encoding-connector=mongodb`, to recognize this
special encoding.

Canal currently only support to capture changes from MySQL, so there is
only one encoding in Canal. But both Canal and Debezium may evolve their
encoding in the future.
We can also introduce a `format.encoding-version` in the future if needed.

Best,
Jark


On Wed, 8 Apr 2020 at 14:26, Kurt Young <[hidden email]> wrote:

> One minor comment, is there any other encoding or format in debezium? I'm
> asking because the format
> name is debezium-json, i'm wondering whether debezium is enough. This also
> applies to canal.
>
> Best,
> Kurt
>
>
> On Tue, Apr 7, 2020 at 11:49 AM Jark Wu <[hidden email]> wrote:
>
> > Hi everyone,
> >
> > Since this FLIP was proposed, the community has discussed a lot about the
> > first approach: introducing new TableSource and TableSink interfaces to
> > support changelog.
> > And yes, that is FLIP-95 which has been accepted last week. So most of
> the
> > work has been merged into FLIP-95.
> >
> > In order to support the goal of FLIP-105, there is still a little things
> to
> > discuss: how to connect external CDC formats.
> > We propose to introduce 2 new formats: Debezium format and Canal format.
> > They are the most popular CDC tools according to the survey in user [1]
> and
> > user-zh [2] mailing list.
> >
> > I have updated the FLIP:
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-105%3A+Support+to+Interpret+and+Emit+Changelog+in+Flink+SQL
> >
> > Welcome feedbacks!
> >
> > Best,
> > Jark
> >
> > [1]:
> >
> >
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/SURVEY-What-Change-Data-Capture-tools-are-you-using-td33569.html
> > [2]: http://apache-flink.147419.n8.nabble.com/SURVEY-CDC-td1910.html
> >
> >
> > On Fri, 14 Feb 2020 at 22:08, Jark Wu <[hidden email]> wrote:
> >
> > > Hi everyone,
> > >
> > > I would like to start discussion about how to support interpreting
> > > external changelog into Flink SQL, and how to emit changelog from Flink
> > SQL.
> > >
> > > This topic has already been mentioned several times in the past. CDC
> > > (Change Data Capture) data has been a very important streaming data in
> > the
> > > world. Connect to CDC is a significant feature for Flink, it fills the
> > > missing piece for Flink's streaming processing.
> > >
> > > In FLIP-105, we propose 2 approaches to achieve.
> > > One is introducing new TableSource interface (higher priority),
> > > the other is introducing new SQL syntax to interpret and emit
> changelog.
> > >
> > > FLIP-105:
> > >
> >
> https://docs.google.com/document/d/1onyIUUdWAHfr_Yd5nZOE7SOExBc6TiW5C4LiL5FrjtQ/edit#
> > >
> > > Thanks for any feedback!
> > >
> > > Best,
> > > Jark
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] FLIP-105: Support to Interpret and Emit Changelog in Flink SQL

Jark Wu-2
Hi,

After a short offline discussion with Kurt, It seems that I misunderstood
Kurt's meaning.
Kurt meant: is `format=debezium` is enough, or split into two options
`format=debezium` and `format.encoding=json`.

Debezium not only support JSON encoding, but also Avro. Canal supports JSON
and Protobuf. So a single `format=debezium` is not enough (in the long
term).
The reason I proposed a single option `format=debezium-json` instead of two:
 - It's simpler to write a single option instead of two, we also make this
design decision for "connector" and "version".
 - I didn't find a good name for separate option keys, because JSON is also
a format, not an encoding, but `format.format=json` is weird.

Hi everyone,

If there are no further concerns, I would like to start a voting thread by
tomorrow.

Best,
Jark



On Wed, 8 Apr 2020 at 15:37, Jark Wu <[hidden email]> wrote:

> Hi Kurt,
>
> The JSON encoding of Debezium can be configured to include or exclude the
> message schema using the `value.converter.schemas.enable` properties.
> That's why we propose to have a `format.schema-include` property key to
> config how to parse the json.
>
> Besides, the encoding format of debezium is stable and unified across
> different databases (MySQL, Oracle, SQL Server, DB2, PostgresSQL).
> However, because of the limitation of some special databases, some
> databases CDC encoding are different (Cassandra and MongoDB).
> If we want to support them in the future, we can introduce an optional
> property key, e.g. `format.encoding-connector=mongodb`, to recognize this
> special encoding.
>
> Canal currently only support to capture changes from MySQL, so there is
> only one encoding in Canal. But both Canal and Debezium may evolve their
> encoding in the future.
> We can also introduce a `format.encoding-version` in the future if needed.
>
> Best,
> Jark
>
>
> On Wed, 8 Apr 2020 at 14:26, Kurt Young <[hidden email]> wrote:
>
>> One minor comment, is there any other encoding or format in debezium? I'm
>> asking because the format
>> name is debezium-json, i'm wondering whether debezium is enough. This also
>> applies to canal.
>>
>> Best,
>> Kurt
>>
>>
>> On Tue, Apr 7, 2020 at 11:49 AM Jark Wu <[hidden email]> wrote:
>>
>> > Hi everyone,
>> >
>> > Since this FLIP was proposed, the community has discussed a lot about
>> the
>> > first approach: introducing new TableSource and TableSink interfaces to
>> > support changelog.
>> > And yes, that is FLIP-95 which has been accepted last week. So most of
>> the
>> > work has been merged into FLIP-95.
>> >
>> > In order to support the goal of FLIP-105, there is still a little
>> things to
>> > discuss: how to connect external CDC formats.
>> > We propose to introduce 2 new formats: Debezium format and Canal format.
>> > They are the most popular CDC tools according to the survey in user [1]
>> and
>> > user-zh [2] mailing list.
>> >
>> > I have updated the FLIP:
>> >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-105%3A+Support+to+Interpret+and+Emit+Changelog+in+Flink+SQL
>> >
>> > Welcome feedbacks!
>> >
>> > Best,
>> > Jark
>> >
>> > [1]:
>> >
>> >
>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/SURVEY-What-Change-Data-Capture-tools-are-you-using-td33569.html
>> > [2]: http://apache-flink.147419.n8.nabble.com/SURVEY-CDC-td1910.html
>> >
>> >
>> > On Fri, 14 Feb 2020 at 22:08, Jark Wu <[hidden email]> wrote:
>> >
>> > > Hi everyone,
>> > >
>> > > I would like to start discussion about how to support interpreting
>> > > external changelog into Flink SQL, and how to emit changelog from
>> Flink
>> > SQL.
>> > >
>> > > This topic has already been mentioned several times in the past. CDC
>> > > (Change Data Capture) data has been a very important streaming data in
>> > the
>> > > world. Connect to CDC is a significant feature for Flink, it fills the
>> > > missing piece for Flink's streaming processing.
>> > >
>> > > In FLIP-105, we propose 2 approaches to achieve.
>> > > One is introducing new TableSource interface (higher priority),
>> > > the other is introducing new SQL syntax to interpret and emit
>> changelog.
>> > >
>> > > FLIP-105:
>> > >
>> >
>> https://docs.google.com/document/d/1onyIUUdWAHfr_Yd5nZOE7SOExBc6TiW5C4LiL5FrjtQ/edit#
>> > >
>> > > Thanks for any feedback!
>> > >
>> > > Best,
>> > > Jark
>> > >
>> >
>>
>