[DISCUSS] [FLINK SQL] External catalog for Confluent Kafka

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] [FLINK SQL] External catalog for Confluent Kafka

Artsem Semianenka
Hi guys!

I'm working on External Catalog for Confluent Kafka. The main idea to
register the external catalog which provides the list of Kafka topics and
execute SQL queries like :
Select * form kafka.topic_name

I'm going to receive the table schema from Confluent schema registry. The
main disadvantage is: we should have the topic name with the same name
(prefix and postfix are accepted ) as this schema subject in Schema
Registry.
For example :
topic: test-topic-prod
schema subject: test-topic

I would like to contribute this solution into the main Flink branch and
would like to discuss the pros and cons of this approach.

Best regards,
Artsem
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] [FLINK SQL] External catalog for Confluent Kafka

dwysakowicz
Hi Artsem,

I think it totally makes sense to have a catalog for the Schema
Registry. It is also good to hear you want to contribute that. There is
few important things to consider though:

1. The Catalog interface is currently under rework. You make take a look
at the corresponding FLIP-30[1], and also have a look at the first PR
that introduces the basic interfaces[2]. I think it would be worth to
already consider those changes. I cc Xuefu who is participating in the
efforts of Catalog integration.

2. There is still ongoing discussion about what properties should we
store for streaming tables and how. I think this might affect (but maybe
doesn't have to) the design of the Catalog.[3] I cc Timo who might give
more insights if those should be blocking for the work around this Catalog.

Best,

Dawid

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs

[2] https://github.com/apache/flink/pull/8007

[3]
https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit#heading=h.egn858cgizao

On 16/04/2019 17:35, Artsem Semianenka wrote:

> Hi guys!
>
> I'm working on External Catalog for Confluent Kafka. The main idea to
> register the external catalog which provides the list of Kafka topics and
> execute SQL queries like :
> Select * form kafka.topic_name
>
> I'm going to receive the table schema from Confluent schema registry. The
> main disadvantage is: we should have the topic name with the same name
> (prefix and postfix are accepted ) as this schema subject in Schema
> Registry.
> For example :
> topic: test-topic-prod
> schema subject: test-topic
>
> I would like to contribute this solution into the main Flink branch and
> would like to discuss the pros and cons of this approach.
>
> Best regards,
> Artsem
>


signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] [FLINK SQL] External catalog for Confluent Kafka

Artsem Semianenka
Thank you, Dawid!
This is very helpful information. I will keep a close eye on the updates of
FLIP-30 and contribute whenever it possible.
I guess I may create a Jira ticket for my proposal in which I describe the
idea and attach intermediate pull request based on current API(just for
initial discuss). But the final pull request definitely will be based on
FLIP-30 API.

Best regards,
Artsem

On Wed, 17 Apr 2019 at 09:36, Dawid Wysakowicz <[hidden email]>
wrote:

> Hi Artsem,
>
> I think it totally makes sense to have a catalog for the Schema
> Registry. It is also good to hear you want to contribute that. There is
> few important things to consider though:
>
> 1. The Catalog interface is currently under rework. You make take a look
> at the corresponding FLIP-30[1], and also have a look at the first PR
> that introduces the basic interfaces[2]. I think it would be worth to
> already consider those changes. I cc Xuefu who is participating in the
> efforts of Catalog integration.
>
> 2. There is still ongoing discussion about what properties should we
> store for streaming tables and how. I think this might affect (but maybe
> doesn't have to) the design of the Catalog.[3] I cc Timo who might give
> more insights if those should be blocking for the work around this Catalog.
>
> Best,
>
> Dawid
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
>
> [2] https://github.com/apache/flink/pull/8007
>
> [3]
>
> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit#heading=h.egn858cgizao
>
> On 16/04/2019 17:35, Artsem Semianenka wrote:
> > Hi guys!
> >
> > I'm working on External Catalog for Confluent Kafka. The main idea to
> > register the external catalog which provides the list of Kafka topics and
> > execute SQL queries like :
> > Select * form kafka.topic_name
> >
> > I'm going to receive the table schema from Confluent schema registry. The
> > main disadvantage is: we should have the topic name with the same name
> > (prefix and postfix are accepted ) as this schema subject in Schema
> > Registry.
> > For example :
> > topic: test-topic-prod
> > schema subject: test-topic
> >
> > I would like to contribute this solution into the main Flink branch and
> > would like to discuss the pros and cons of this approach.
> >
> > Best regards,
> > Artsem
> >
>
>

--

С уважением,
Артем Семененко
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] [FLINK SQL] External catalog for Confluent Kafka

Rong Rong
Thanks Artsem for looking into this problem and Thanks Dawid for bringing
up the discussion on FLIP-30.

We've observe similar scenarios when we also would like to reuse the schema
registry of both Kafka stream as well as the raw ingested kafka messages in
datalake.
FYI another more catalog-oriented document can be found here [1]. I do have
one question to follow up with Dawid's point (2): are we suggesting that
different kafka topics (e.g. test-topic-prod, test-topic-non-prod, etc)
considered as a "view" of a logical table with schema (e.g. test-topic) ?

Also, seems like a few of the FLIPs, like the FLIP-30 page is not linked in
the main FLIP confluence wiki page [2] for some reason.
I tried to fix that be seems like I don't have permission. Maybe someone
can also take a look?

Thanks,
Rong


[1]
https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#heading=h.xp424vn7ioei
[2]
https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals

On Wed, Apr 17, 2019 at 2:30 AM Artsem Semianenka <[hidden email]>
wrote:

> Thank you, Dawid!
> This is very helpful information. I will keep a close eye on the updates of
> FLIP-30 and contribute whenever it possible.
> I guess I may create a Jira ticket for my proposal in which I describe the
> idea and attach intermediate pull request based on current API(just for
> initial discuss). But the final pull request definitely will be based on
> FLIP-30 API.
>
> Best regards,
> Artsem
>
> On Wed, 17 Apr 2019 at 09:36, Dawid Wysakowicz <[hidden email]>
> wrote:
>
> > Hi Artsem,
> >
> > I think it totally makes sense to have a catalog for the Schema
> > Registry. It is also good to hear you want to contribute that. There is
> > few important things to consider though:
> >
> > 1. The Catalog interface is currently under rework. You make take a look
> > at the corresponding FLIP-30[1], and also have a look at the first PR
> > that introduces the basic interfaces[2]. I think it would be worth to
> > already consider those changes. I cc Xuefu who is participating in the
> > efforts of Catalog integration.
> >
> > 2. There is still ongoing discussion about what properties should we
> > store for streaming tables and how. I think this might affect (but maybe
> > doesn't have to) the design of the Catalog.[3] I cc Timo who might give
> > more insights if those should be blocking for the work around this
> Catalog.
> >
> > Best,
> >
> > Dawid
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
> >
> > [2] https://github.com/apache/flink/pull/8007
> >
> > [3]
> >
> >
> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit#heading=h.egn858cgizao
> >
> > On 16/04/2019 17:35, Artsem Semianenka wrote:
> > > Hi guys!
> > >
> > > I'm working on External Catalog for Confluent Kafka. The main idea to
> > > register the external catalog which provides the list of Kafka topics
> and
> > > execute SQL queries like :
> > > Select * form kafka.topic_name
> > >
> > > I'm going to receive the table schema from Confluent schema registry.
> The
> > > main disadvantage is: we should have the topic name with the same name
> > > (prefix and postfix are accepted ) as this schema subject in Schema
> > > Registry.
> > > For example :
> > > topic: test-topic-prod
> > > schema subject: test-topic
> > >
> > > I would like to contribute this solution into the main Flink branch and
> > > would like to discuss the pros and cons of this approach.
> > >
> > > Best regards,
> > > Artsem
> > >
> >
> >
>
> --
>
> С уважением,
> Артем Семененко
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] [FLINK SQL] External catalog for Confluent Kafka

Jark Wu-2
Hi Rong,

Thanks for pointing out the missing FLIPs in the FLIP main page. I added
all the missing FLIP (incl. FLIP-14, FLIP-22, FLIP-29, FLIP-30, FLIP-31) to
the page.

I also include @[hidden email] <[hidden email]>  and @Bowen
Li <[hidden email]>  into the thread who are familiar with the latest
catalog design.

Thanks,
Jark

On Thu, 18 Apr 2019 at 02:39, Rong Rong <[hidden email]> wrote:

> Thanks Artsem for looking into this problem and Thanks Dawid for bringing
> up the discussion on FLIP-30.
>
> We've observe similar scenarios when we also would like to reuse the schema
> registry of both Kafka stream as well as the raw ingested kafka messages in
> datalake.
> FYI another more catalog-oriented document can be found here [1]. I do have
> one question to follow up with Dawid's point (2): are we suggesting that
> different kafka topics (e.g. test-topic-prod, test-topic-non-prod, etc)
> considered as a "view" of a logical table with schema (e.g. test-topic) ?
>
> Also, seems like a few of the FLIPs, like the FLIP-30 page is not linked in
> the main FLIP confluence wiki page [2] for some reason.
> I tried to fix that be seems like I don't have permission. Maybe someone
> can also take a look?
>
> Thanks,
> Rong
>
>
> [1]
>
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#heading=h.xp424vn7ioei
> [2]
>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>
> On Wed, Apr 17, 2019 at 2:30 AM Artsem Semianenka <[hidden email]>
> wrote:
>
> > Thank you, Dawid!
> > This is very helpful information. I will keep a close eye on the updates
> of
> > FLIP-30 and contribute whenever it possible.
> > I guess I may create a Jira ticket for my proposal in which I describe
> the
> > idea and attach intermediate pull request based on current API(just for
> > initial discuss). But the final pull request definitely will be based on
> > FLIP-30 API.
> >
> > Best regards,
> > Artsem
> >
> > On Wed, 17 Apr 2019 at 09:36, Dawid Wysakowicz <[hidden email]>
> > wrote:
> >
> > > Hi Artsem,
> > >
> > > I think it totally makes sense to have a catalog for the Schema
> > > Registry. It is also good to hear you want to contribute that. There is
> > > few important things to consider though:
> > >
> > > 1. The Catalog interface is currently under rework. You make take a
> look
> > > at the corresponding FLIP-30[1], and also have a look at the first PR
> > > that introduces the basic interfaces[2]. I think it would be worth to
> > > already consider those changes. I cc Xuefu who is participating in the
> > > efforts of Catalog integration.
> > >
> > > 2. There is still ongoing discussion about what properties should we
> > > store for streaming tables and how. I think this might affect (but
> maybe
> > > doesn't have to) the design of the Catalog.[3] I cc Timo who might give
> > > more insights if those should be blocking for the work around this
> > Catalog.
> > >
> > > Best,
> > >
> > > Dawid
> > >
> > > [1]
> > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
> > >
> > > [2] https://github.com/apache/flink/pull/8007
> > >
> > > [3]
> > >
> > >
> >
> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit#heading=h.egn858cgizao
> > >
> > > On 16/04/2019 17:35, Artsem Semianenka wrote:
> > > > Hi guys!
> > > >
> > > > I'm working on External Catalog for Confluent Kafka. The main idea to
> > > > register the external catalog which provides the list of Kafka topics
> > and
> > > > execute SQL queries like :
> > > > Select * form kafka.topic_name
> > > >
> > > > I'm going to receive the table schema from Confluent schema registry.
> > The
> > > > main disadvantage is: we should have the topic name with the same
> name
> > > > (prefix and postfix are accepted ) as this schema subject in Schema
> > > > Registry.
> > > > For example :
> > > > topic: test-topic-prod
> > > > schema subject: test-topic
> > > >
> > > > I would like to contribute this solution into the main Flink branch
> and
> > > > would like to discuss the pros and cons of this approach.
> > > >
> > > > Best regards,
> > > > Artsem
> > > >
> > >
> > >
> >
> > --
> >
> > С уважением,
> > Артем Семененко
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] [FLINK SQL] External catalog for Confluent Kafka

bowen.li
Hi,

Thanks Artsem and Rong for bringing up the demand from user perspective. A
Kafka/Confluent Schema Registry catalog would have a good use case in
Flink. We actually mentioned the potential of Unified Catalog APIs for
Kafka in our talk a couple weeks ago at Flink Forward SF [1], and glad to
learn you are interested in contributing. I think creating a JIRA ticket
with link in FLINK-11275 [2], and starting with discussions and design
would help to advance the effort.

The most interesting part of Confluent Schema Registry, from my point of
view, is the core idea of smoothing real production experience and things
built around it, including versioned schemas, schema evolution and
compatibility checks, etc. Introducing a confluent-schema-registry backed
catalog to Flink may also help our design to benefit from those ideas.

To add on Dawid's points. I assume the MVP for this project would be
supporting Kafka as streaming tables thru the new catalog. FLIP-30 is for
both streaming and batch tables, thus it won't be blocked by the whole
FLIP-30. I think as soon as we finish the table operation APIs, finalize
properties and formats, and connect the APIs to Calcite, this work can be
unblocked. Timo and Xuefu may have more things to say.

[1]
https://www.slideshare.net/BowenLi9/integrating-flink-with-hive-flink-forward-sf-2019/23
[2] https://issues.apache.org/jira/browse/FLINK-11275

On Wed, Apr 17, 2019 at 6:39 PM Jark Wu <[hidden email]> wrote:

> Hi Rong,
>
> Thanks for pointing out the missing FLIPs in the FLIP main page. I added
> all the missing FLIP (incl. FLIP-14, FLIP-22, FLIP-29, FLIP-30, FLIP-31) to
> the page.
>
> I also include @[hidden email] <[hidden email]>  and @Bowen
> Li <[hidden email]>  into the thread who are familiar with the
> latest catalog design.
>
> Thanks,
> Jark
>
> On Thu, 18 Apr 2019 at 02:39, Rong Rong <[hidden email]> wrote:
>
>> Thanks Artsem for looking into this problem and Thanks Dawid for bringing
>> up the discussion on FLIP-30.
>>
>> We've observe similar scenarios when we also would like to reuse the
>> schema
>> registry of both Kafka stream as well as the raw ingested kafka messages
>> in
>> datalake.
>> FYI another more catalog-oriented document can be found here [1]. I do
>> have
>> one question to follow up with Dawid's point (2): are we suggesting that
>> different kafka topics (e.g. test-topic-prod, test-topic-non-prod, etc)
>> considered as a "view" of a logical table with schema (e.g. test-topic) ?
>>
>> Also, seems like a few of the FLIPs, like the FLIP-30 page is not linked
>> in
>> the main FLIP confluence wiki page [2] for some reason.
>> I tried to fix that be seems like I don't have permission. Maybe someone
>> can also take a look?
>>
>> Thanks,
>> Rong
>>
>>
>> [1]
>>
>> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#heading=h.xp424vn7ioei
>> [2]
>>
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>>
>> On Wed, Apr 17, 2019 at 2:30 AM Artsem Semianenka <[hidden email]
>> >
>> wrote:
>>
>> > Thank you, Dawid!
>> > This is very helpful information. I will keep a close eye on the
>> updates of
>> > FLIP-30 and contribute whenever it possible.
>> > I guess I may create a Jira ticket for my proposal in which I describe
>> the
>> > idea and attach intermediate pull request based on current API(just for
>> > initial discuss). But the final pull request definitely will be based on
>> > FLIP-30 API.
>> >
>> > Best regards,
>> > Artsem
>> >
>> > On Wed, 17 Apr 2019 at 09:36, Dawid Wysakowicz <[hidden email]>
>> > wrote:
>> >
>> > > Hi Artsem,
>> > >
>> > > I think it totally makes sense to have a catalog for the Schema
>> > > Registry. It is also good to hear you want to contribute that. There
>> is
>> > > few important things to consider though:
>> > >
>> > > 1. The Catalog interface is currently under rework. You make take a
>> look
>> > > at the corresponding FLIP-30[1], and also have a look at the first PR
>> > > that introduces the basic interfaces[2]. I think it would be worth to
>> > > already consider those changes. I cc Xuefu who is participating in the
>> > > efforts of Catalog integration.
>> > >
>> > > 2. There is still ongoing discussion about what properties should we
>> > > store for streaming tables and how. I think this might affect (but
>> maybe
>> > > doesn't have to) the design of the Catalog.[3] I cc Timo who might
>> give
>> > > more insights if those should be blocking for the work around this
>> > Catalog.
>> > >
>> > > Best,
>> > >
>> > > Dawid
>> > >
>> > > [1]
>> > >
>> > >
>> >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
>> > >
>> > > [2] https://github.com/apache/flink/pull/8007
>> > >
>> > > [3]
>> > >
>> > >
>> >
>> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit#heading=h.egn858cgizao
>> > >
>> > > On 16/04/2019 17:35, Artsem Semianenka wrote:
>> > > > Hi guys!
>> > > >
>> > > > I'm working on External Catalog for Confluent Kafka. The main idea
>> to
>> > > > register the external catalog which provides the list of Kafka
>> topics
>> > and
>> > > > execute SQL queries like :
>> > > > Select * form kafka.topic_name
>> > > >
>> > > > I'm going to receive the table schema from Confluent schema
>> registry.
>> > The
>> > > > main disadvantage is: we should have the topic name with the same
>> name
>> > > > (prefix and postfix are accepted ) as this schema subject in Schema
>> > > > Registry.
>> > > > For example :
>> > > > topic: test-topic-prod
>> > > > schema subject: test-topic
>> > > >
>> > > > I would like to contribute this solution into the main Flink branch
>> and
>> > > > would like to discuss the pros and cons of this approach.
>> > > >
>> > > > Best regards,
>> > > > Artsem
>> > > >
>> > >
>> > >
>> >
>> > --
>> >
>> > С уважением,
>> > Артем Семененко
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] [FLINK SQL] External catalog for Confluent Kafka

Timo Walther-2
Hi Artsem,

having a catalog support for Confluent Schema Registry would be a great
addition. Although the implementation of FLIP-30 is still ongoing, we
merged the stable interfaces today [0]. This should unblock people from
contributing new catalog implementations. So you could already start
designing an implementation. The implementation could be unit tested for
now until it can also be registered in a table environment for
integration tests/end-to-end tests.

I hope we can reuse the existing SQL Kafka connector and SQL Avro format?

Looking forward to a JIRA issue and a little design document how to
connect the APIs.

Thanks,
Timo

[0] https://github.com/apache/flink/pull/8007

Am 18.04.19 um 07:03 schrieb Bowen Li:

> Hi,
>
> Thanks Artsem and Rong for bringing up the demand from user perspective. A
> Kafka/Confluent Schema Registry catalog would have a good use case in
> Flink. We actually mentioned the potential of Unified Catalog APIs for
> Kafka in our talk a couple weeks ago at Flink Forward SF [1], and glad to
> learn you are interested in contributing. I think creating a JIRA ticket
> with link in FLINK-11275 [2], and starting with discussions and design
> would help to advance the effort.
>
> The most interesting part of Confluent Schema Registry, from my point of
> view, is the core idea of smoothing real production experience and things
> built around it, including versioned schemas, schema evolution and
> compatibility checks, etc. Introducing a confluent-schema-registry backed
> catalog to Flink may also help our design to benefit from those ideas.
>
> To add on Dawid's points. I assume the MVP for this project would be
> supporting Kafka as streaming tables thru the new catalog. FLIP-30 is for
> both streaming and batch tables, thus it won't be blocked by the whole
> FLIP-30. I think as soon as we finish the table operation APIs, finalize
> properties and formats, and connect the APIs to Calcite, this work can be
> unblocked. Timo and Xuefu may have more things to say.
>
> [1]
> https://www.slideshare.net/BowenLi9/integrating-flink-with-hive-flink-forward-sf-2019/23
> [2] https://issues.apache.org/jira/browse/FLINK-11275
>
> On Wed, Apr 17, 2019 at 6:39 PM Jark Wu <[hidden email]> wrote:
>
>> Hi Rong,
>>
>> Thanks for pointing out the missing FLIPs in the FLIP main page. I added
>> all the missing FLIP (incl. FLIP-14, FLIP-22, FLIP-29, FLIP-30, FLIP-31) to
>> the page.
>>
>> I also include @[hidden email] <[hidden email]>  and @Bowen
>> Li <[hidden email]>  into the thread who are familiar with the
>> latest catalog design.
>>
>> Thanks,
>> Jark
>>
>> On Thu, 18 Apr 2019 at 02:39, Rong Rong <[hidden email]> wrote:
>>
>>> Thanks Artsem for looking into this problem and Thanks Dawid for bringing
>>> up the discussion on FLIP-30.
>>>
>>> We've observe similar scenarios when we also would like to reuse the
>>> schema
>>> registry of both Kafka stream as well as the raw ingested kafka messages
>>> in
>>> datalake.
>>> FYI another more catalog-oriented document can be found here [1]. I do
>>> have
>>> one question to follow up with Dawid's point (2): are we suggesting that
>>> different kafka topics (e.g. test-topic-prod, test-topic-non-prod, etc)
>>> considered as a "view" of a logical table with schema (e.g. test-topic) ?
>>>
>>> Also, seems like a few of the FLIPs, like the FLIP-30 page is not linked
>>> in
>>> the main FLIP confluence wiki page [2] for some reason.
>>> I tried to fix that be seems like I don't have permission. Maybe someone
>>> can also take a look?
>>>
>>> Thanks,
>>> Rong
>>>
>>>
>>> [1]
>>>
>>> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#heading=h.xp424vn7ioei
>>> [2]
>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>>>
>>> On Wed, Apr 17, 2019 at 2:30 AM Artsem Semianenka <[hidden email]
>>> wrote:
>>>
>>>> Thank you, Dawid!
>>>> This is very helpful information. I will keep a close eye on the
>>> updates of
>>>> FLIP-30 and contribute whenever it possible.
>>>> I guess I may create a Jira ticket for my proposal in which I describe
>>> the
>>>> idea and attach intermediate pull request based on current API(just for
>>>> initial discuss). But the final pull request definitely will be based on
>>>> FLIP-30 API.
>>>>
>>>> Best regards,
>>>> Artsem
>>>>
>>>> On Wed, 17 Apr 2019 at 09:36, Dawid Wysakowicz <[hidden email]>
>>>> wrote:
>>>>
>>>>> Hi Artsem,
>>>>>
>>>>> I think it totally makes sense to have a catalog for the Schema
>>>>> Registry. It is also good to hear you want to contribute that. There
>>> is
>>>>> few important things to consider though:
>>>>>
>>>>> 1. The Catalog interface is currently under rework. You make take a
>>> look
>>>>> at the corresponding FLIP-30[1], and also have a look at the first PR
>>>>> that introduces the basic interfaces[2]. I think it would be worth to
>>>>> already consider those changes. I cc Xuefu who is participating in the
>>>>> efforts of Catalog integration.
>>>>>
>>>>> 2. There is still ongoing discussion about what properties should we
>>>>> store for streaming tables and how. I think this might affect (but
>>> maybe
>>>>> doesn't have to) the design of the Catalog.[3] I cc Timo who might
>>> give
>>>>> more insights if those should be blocking for the work around this
>>>> Catalog.
>>>>> Best,
>>>>>
>>>>> Dawid
>>>>>
>>>>> [1]
>>>>>
>>>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
>>>>> [2] https://github.com/apache/flink/pull/8007
>>>>>
>>>>> [3]
>>>>>
>>>>>
>>> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit#heading=h.egn858cgizao
>>>>> On 16/04/2019 17:35, Artsem Semianenka wrote:
>>>>>> Hi guys!
>>>>>>
>>>>>> I'm working on External Catalog for Confluent Kafka. The main idea
>>> to
>>>>>> register the external catalog which provides the list of Kafka
>>> topics
>>>> and
>>>>>> execute SQL queries like :
>>>>>> Select * form kafka.topic_name
>>>>>>
>>>>>> I'm going to receive the table schema from Confluent schema
>>> registry.
>>>> The
>>>>>> main disadvantage is: we should have the topic name with the same
>>> name
>>>>>> (prefix and postfix are accepted ) as this schema subject in Schema
>>>>>> Registry.
>>>>>> For example :
>>>>>> topic: test-topic-prod
>>>>>> schema subject: test-topic
>>>>>>
>>>>>> I would like to contribute this solution into the main Flink branch
>>> and
>>>>>> would like to discuss the pros and cons of this approach.
>>>>>>
>>>>>> Best regards,
>>>>>> Artsem
>>>>>>
>>>>>
>>>> --
>>>>
>>>> С уважением,
>>>> Артем Семененко
>>>>

Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] [FLINK SQL] External catalog for Confluent Kafka

Artsem Semianenka
Thank you guys so much!

You provided me a lot of helpful information.
I've created the Jira ticket[1] and added to it an initial description only
with the main purpose of the new feature. More detailed implementation
description will be added further.

Hi Rong, to tell the truth, my first idea was to use some predefined
prefix/postfix for topic name and lookup mapping between
topic/schema-subject.  But the idea with a separated view of a logical
table with schema looks more elegant and flexible.

Also, I thought about other approaches on how to define the mapping between
topic and schema-subject in case if they have different names:
Define the "subject" as a part of the table definition:

Select * from kafka.topic.subject
or
Select * from kafka.topic#subject

In case if the subject is not defined try to find a subject with the same
name as a topic.
If the subject still not found  -  take one last message and try to infer
the schema ( retrieve schema id from the message and get last defined
schema)

But I see one disadvantage for all of these approaches: the subject name
may contain not supported in SQL symbols.

I try to investigate how to escape the illegal symbols in the table name
definition.

Thanks,
Artsem

[1] https://issues.apache.org/jira/browse/FLINK-11275

On Thu, 18 Apr 2019 at 11:54, Timo Walther <[hidden email]> wrote:

> Hi Artsem,
>
> having a catalog support for Confluent Schema Registry would be a great
> addition. Although the implementation of FLIP-30 is still ongoing, we
> merged the stable interfaces today [0]. This should unblock people from
> contributing new catalog implementations. So you could already start
> designing an implementation. The implementation could be unit tested for
> now until it can also be registered in a table environment for
> integration tests/end-to-end tests.
>
> I hope we can reuse the existing SQL Kafka connector and SQL Avro format?
>
> Looking forward to a JIRA issue and a little design document how to
> connect the APIs.
>
> Thanks,
> Timo
>
> [0] https://github.com/apache/flink/pull/8007
>
> Am 18.04.19 um 07:03 schrieb Bowen Li:
> > Hi,
> >
> > Thanks Artsem and Rong for bringing up the demand from user perspective.
> A
> > Kafka/Confluent Schema Registry catalog would have a good use case in
> > Flink. We actually mentioned the potential of Unified Catalog APIs for
> > Kafka in our talk a couple weeks ago at Flink Forward SF [1], and glad to
> > learn you are interested in contributing. I think creating a JIRA ticket
> > with link in FLINK-11275 [2], and starting with discussions and design
> > would help to advance the effort.
> >
> > The most interesting part of Confluent Schema Registry, from my point of
> > view, is the core idea of smoothing real production experience and things
> > built around it, including versioned schemas, schema evolution and
> > compatibility checks, etc. Introducing a confluent-schema-registry backed
> > catalog to Flink may also help our design to benefit from those ideas.
> >
> > To add on Dawid's points. I assume the MVP for this project would be
> > supporting Kafka as streaming tables thru the new catalog. FLIP-30 is for
> > both streaming and batch tables, thus it won't be blocked by the whole
> > FLIP-30. I think as soon as we finish the table operation APIs, finalize
> > properties and formats, and connect the APIs to Calcite, this work can be
> > unblocked. Timo and Xuefu may have more things to say.
> >
> > [1]
> >
> https://www.slideshare.net/BowenLi9/integrating-flink-with-hive-flink-forward-sf-2019/23
> > [2] https://issues.apache.org/jira/browse/FLINK-11275
> >
> > On Wed, Apr 17, 2019 at 6:39 PM Jark Wu <[hidden email]> wrote:
> >
> >> Hi Rong,
> >>
> >> Thanks for pointing out the missing FLIPs in the FLIP main page. I added
> >> all the missing FLIP (incl. FLIP-14, FLIP-22, FLIP-29, FLIP-30,
> FLIP-31) to
> >> the page.
> >>
> >> I also include @[hidden email] <[hidden email]>  and
> @Bowen
> >> Li <[hidden email]>  into the thread who are familiar with the
> >> latest catalog design.
> >>
> >> Thanks,
> >> Jark
> >>
> >> On Thu, 18 Apr 2019 at 02:39, Rong Rong <[hidden email]> wrote:
> >>
> >>> Thanks Artsem for looking into this problem and Thanks Dawid for
> bringing
> >>> up the discussion on FLIP-30.
> >>>
> >>> We've observe similar scenarios when we also would like to reuse the
> >>> schema
> >>> registry of both Kafka stream as well as the raw ingested kafka
> messages
> >>> in
> >>> datalake.
> >>> FYI another more catalog-oriented document can be found here [1]. I do
> >>> have
> >>> one question to follow up with Dawid's point (2): are we suggesting
> that
> >>> different kafka topics (e.g. test-topic-prod, test-topic-non-prod, etc)
> >>> considered as a "view" of a logical table with schema (e.g.
> test-topic) ?
> >>>
> >>> Also, seems like a few of the FLIPs, like the FLIP-30 page is not
> linked
> >>> in
> >>> the main FLIP confluence wiki page [2] for some reason.
> >>> I tried to fix that be seems like I don't have permission. Maybe
> someone
> >>> can also take a look?
> >>>
> >>> Thanks,
> >>> Rong
> >>>
> >>>
> >>> [1]
> >>>
> >>>
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#heading=h.xp424vn7ioei
> >>> [2]
> >>>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >>>
> >>> On Wed, Apr 17, 2019 at 2:30 AM Artsem Semianenka <
> [hidden email]
> >>> wrote:
> >>>
> >>>> Thank you, Dawid!
> >>>> This is very helpful information. I will keep a close eye on the
> >>> updates of
> >>>> FLIP-30 and contribute whenever it possible.
> >>>> I guess I may create a Jira ticket for my proposal in which I describe
> >>> the
> >>>> idea and attach intermediate pull request based on current API(just
> for
> >>>> initial discuss). But the final pull request definitely will be based
> on
> >>>> FLIP-30 API.
> >>>>
> >>>> Best regards,
> >>>> Artsem
> >>>>
> >>>> On Wed, 17 Apr 2019 at 09:36, Dawid Wysakowicz <
> [hidden email]>
> >>>> wrote:
> >>>>
> >>>>> Hi Artsem,
> >>>>>
> >>>>> I think it totally makes sense to have a catalog for the Schema
> >>>>> Registry. It is also good to hear you want to contribute that. There
> >>> is
> >>>>> few important things to consider though:
> >>>>>
> >>>>> 1. The Catalog interface is currently under rework. You make take a
> >>> look
> >>>>> at the corresponding FLIP-30[1], and also have a look at the first PR
> >>>>> that introduces the basic interfaces[2]. I think it would be worth to
> >>>>> already consider those changes. I cc Xuefu who is participating in
> the
> >>>>> efforts of Catalog integration.
> >>>>>
> >>>>> 2. There is still ongoing discussion about what properties should we
> >>>>> store for streaming tables and how. I think this might affect (but
> >>> maybe
> >>>>> doesn't have to) the design of the Catalog.[3] I cc Timo who might
> >>> give
> >>>>> more insights if those should be blocking for the work around this
> >>>> Catalog.
> >>>>> Best,
> >>>>>
> >>>>> Dawid
> >>>>>
> >>>>> [1]
> >>>>>
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
> >>>>> [2] https://github.com/apache/flink/pull/8007
> >>>>>
> >>>>> [3]
> >>>>>
> >>>>>
> >>>
> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit#heading=h.egn858cgizao
> >>>>> On 16/04/2019 17:35, Artsem Semianenka wrote:
> >>>>>> Hi guys!
> >>>>>>
> >>>>>> I'm working on External Catalog for Confluent Kafka. The main idea
> >>> to
> >>>>>> register the external catalog which provides the list of Kafka
> >>> topics
> >>>> and
> >>>>>> execute SQL queries like :
> >>>>>> Select * form kafka.topic_name
> >>>>>>
> >>>>>> I'm going to receive the table schema from Confluent schema
> >>> registry.
> >>>> The
> >>>>>> main disadvantage is: we should have the topic name with the same
> >>> name
> >>>>>> (prefix and postfix are accepted ) as this schema subject in Schema
> >>>>>> Registry.
> >>>>>> For example :
> >>>>>> topic: test-topic-prod
> >>>>>> schema subject: test-topic
> >>>>>>
> >>>>>> I would like to contribute this solution into the main Flink branch
> >>> and
> >>>>>> would like to discuss the pros and cons of this approach.
> >>>>>>
> >>>>>> Best regards,
> >>>>>> Artsem
> >>>>>>
> >>>>>
> >>>> --
> >>>>
> >>>> С уважением,
> >>>> Артем Семененко
> >>>>
>
>

--

С уважением,
Артем Семененко
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] [FLINK SQL] External catalog for Confluent Kafka

Artsem Semianenka
Sorry guys I've attached the wrong link for Jira ticket in the
previous email. This is the correct link :
https://issues.apache.org/jira/browse/FLINK-12256

On Thu, 18 Apr 2019 at 18:29, Artsem Semianenka <[hidden email]>
wrote:

> Thank you guys so much!
>
> You provided me a lot of helpful information.
> I've created the Jira ticket[1] and added to it an initial description
> only with the main purpose of the new feature. More detailed implementation
> description will be added further.
>
> Hi Rong, to tell the truth, my first idea was to use some predefined
> prefix/postfix for topic name and lookup mapping between
> topic/schema-subject.  But the idea with a separated view of a logical
> table with schema looks more elegant and flexible.
>
> Also, I thought about other approaches on how to define the mapping
> between topic and schema-subject in case if they have different names:
> Define the "subject" as a part of the table definition:
>
> Select * from kafka.topic.subject
> or
> Select * from kafka.topic#subject
>
> In case if the subject is not defined try to find a subject with the same
> name as a topic.
> If the subject still not found  -  take one last message and try to infer
> the schema ( retrieve schema id from the message and get last defined
> schema)
>
> But I see one disadvantage for all of these approaches: the subject name
> may contain not supported in SQL symbols.
>
> I try to investigate how to escape the illegal symbols in the table name
> definition.
>
> Thanks,
> Artsem
>
> [1] https://issues.apache.org/jira/browse/FLINK-11275
>
> On Thu, 18 Apr 2019 at 11:54, Timo Walther <[hidden email]> wrote:
>
>> Hi Artsem,
>>
>> having a catalog support for Confluent Schema Registry would be a great
>> addition. Although the implementation of FLIP-30 is still ongoing, we
>> merged the stable interfaces today [0]. This should unblock people from
>> contributing new catalog implementations. So you could already start
>> designing an implementation. The implementation could be unit tested for
>> now until it can also be registered in a table environment for
>> integration tests/end-to-end tests.
>>
>> I hope we can reuse the existing SQL Kafka connector and SQL Avro format?
>>
>> Looking forward to a JIRA issue and a little design document how to
>> connect the APIs.
>>
>> Thanks,
>> Timo
>>
>> [0] https://github.com/apache/flink/pull/8007
>>
>> Am 18.04.19 um 07:03 schrieb Bowen Li:
>> > Hi,
>> >
>> > Thanks Artsem and Rong for bringing up the demand from user
>> perspective. A
>> > Kafka/Confluent Schema Registry catalog would have a good use case in
>> > Flink. We actually mentioned the potential of Unified Catalog APIs for
>> > Kafka in our talk a couple weeks ago at Flink Forward SF [1], and glad
>> to
>> > learn you are interested in contributing. I think creating a JIRA ticket
>> > with link in FLINK-11275 [2], and starting with discussions and design
>> > would help to advance the effort.
>> >
>> > The most interesting part of Confluent Schema Registry, from my point of
>> > view, is the core idea of smoothing real production experience and
>> things
>> > built around it, including versioned schemas, schema evolution and
>> > compatibility checks, etc. Introducing a confluent-schema-registry
>> backed
>> > catalog to Flink may also help our design to benefit from those ideas.
>> >
>> > To add on Dawid's points. I assume the MVP for this project would be
>> > supporting Kafka as streaming tables thru the new catalog. FLIP-30 is
>> for
>> > both streaming and batch tables, thus it won't be blocked by the whole
>> > FLIP-30. I think as soon as we finish the table operation APIs, finalize
>> > properties and formats, and connect the APIs to Calcite, this work can
>> be
>> > unblocked. Timo and Xuefu may have more things to say.
>> >
>> > [1]
>> >
>> https://www.slideshare.net/BowenLi9/integrating-flink-with-hive-flink-forward-sf-2019/23
>> > [2] https://issues.apache.org/jira/browse/FLINK-11275
>> >
>> > On Wed, Apr 17, 2019 at 6:39 PM Jark Wu <[hidden email]> wrote:
>> >
>> >> Hi Rong,
>> >>
>> >> Thanks for pointing out the missing FLIPs in the FLIP main page. I
>> added
>> >> all the missing FLIP (incl. FLIP-14, FLIP-22, FLIP-29, FLIP-30,
>> FLIP-31) to
>> >> the page.
>> >>
>> >> I also include @[hidden email] <[hidden email]>
>> and @Bowen
>> >> Li <[hidden email]>  into the thread who are familiar with the
>> >> latest catalog design.
>> >>
>> >> Thanks,
>> >> Jark
>> >>
>> >> On Thu, 18 Apr 2019 at 02:39, Rong Rong <[hidden email]> wrote:
>> >>
>> >>> Thanks Artsem for looking into this problem and Thanks Dawid for
>> bringing
>> >>> up the discussion on FLIP-30.
>> >>>
>> >>> We've observe similar scenarios when we also would like to reuse the
>> >>> schema
>> >>> registry of both Kafka stream as well as the raw ingested kafka
>> messages
>> >>> in
>> >>> datalake.
>> >>> FYI another more catalog-oriented document can be found here [1]. I do
>> >>> have
>> >>> one question to follow up with Dawid's point (2): are we suggesting
>> that
>> >>> different kafka topics (e.g. test-topic-prod, test-topic-non-prod,
>> etc)
>> >>> considered as a "view" of a logical table with schema (e.g.
>> test-topic) ?
>> >>>
>> >>> Also, seems like a few of the FLIPs, like the FLIP-30 page is not
>> linked
>> >>> in
>> >>> the main FLIP confluence wiki page [2] for some reason.
>> >>> I tried to fix that be seems like I don't have permission. Maybe
>> someone
>> >>> can also take a look?
>> >>>
>> >>> Thanks,
>> >>> Rong
>> >>>
>> >>>
>> >>> [1]
>> >>>
>> >>>
>> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#heading=h.xp424vn7ioei
>> >>> [2]
>> >>>
>> >>>
>> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
>> >>>
>> >>> On Wed, Apr 17, 2019 at 2:30 AM Artsem Semianenka <
>> [hidden email]
>> >>> wrote:
>> >>>
>> >>>> Thank you, Dawid!
>> >>>> This is very helpful information. I will keep a close eye on the
>> >>> updates of
>> >>>> FLIP-30 and contribute whenever it possible.
>> >>>> I guess I may create a Jira ticket for my proposal in which I
>> describe
>> >>> the
>> >>>> idea and attach intermediate pull request based on current API(just
>> for
>> >>>> initial discuss). But the final pull request definitely will be
>> based on
>> >>>> FLIP-30 API.
>> >>>>
>> >>>> Best regards,
>> >>>> Artsem
>> >>>>
>> >>>> On Wed, 17 Apr 2019 at 09:36, Dawid Wysakowicz <
>> [hidden email]>
>> >>>> wrote:
>> >>>>
>> >>>>> Hi Artsem,
>> >>>>>
>> >>>>> I think it totally makes sense to have a catalog for the Schema
>> >>>>> Registry. It is also good to hear you want to contribute that. There
>> >>> is
>> >>>>> few important things to consider though:
>> >>>>>
>> >>>>> 1. The Catalog interface is currently under rework. You make take a
>> >>> look
>> >>>>> at the corresponding FLIP-30[1], and also have a look at the first
>> PR
>> >>>>> that introduces the basic interfaces[2]. I think it would be worth
>> to
>> >>>>> already consider those changes. I cc Xuefu who is participating in
>> the
>> >>>>> efforts of Catalog integration.
>> >>>>>
>> >>>>> 2. There is still ongoing discussion about what properties should we
>> >>>>> store for streaming tables and how. I think this might affect (but
>> >>> maybe
>> >>>>> doesn't have to) the design of the Catalog.[3] I cc Timo who might
>> >>> give
>> >>>>> more insights if those should be blocking for the work around this
>> >>>> Catalog.
>> >>>>> Best,
>> >>>>>
>> >>>>> Dawid
>> >>>>>
>> >>>>> [1]
>> >>>>>
>> >>>>>
>> >>>
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
>> >>>>> [2] https://github.com/apache/flink/pull/8007
>> >>>>>
>> >>>>> [3]
>> >>>>>
>> >>>>>
>> >>>
>> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit#heading=h.egn858cgizao
>> >>>>> On 16/04/2019 17:35, Artsem Semianenka wrote:
>> >>>>>> Hi guys!
>> >>>>>>
>> >>>>>> I'm working on External Catalog for Confluent Kafka. The main idea
>> >>> to
>> >>>>>> register the external catalog which provides the list of Kafka
>> >>> topics
>> >>>> and
>> >>>>>> execute SQL queries like :
>> >>>>>> Select * form kafka.topic_name
>> >>>>>>
>> >>>>>> I'm going to receive the table schema from Confluent schema
>> >>> registry.
>> >>>> The
>> >>>>>> main disadvantage is: we should have the topic name with the same
>> >>> name
>> >>>>>> (prefix and postfix are accepted ) as this schema subject in Schema
>> >>>>>> Registry.
>> >>>>>> For example :
>> >>>>>> topic: test-topic-prod
>> >>>>>> schema subject: test-topic
>> >>>>>>
>> >>>>>> I would like to contribute this solution into the main Flink branch
>> >>> and
>> >>>>>> would like to discuss the pros and cons of this approach.
>> >>>>>>
>> >>>>>> Best regards,
>> >>>>>> Artsem
>> >>>>>>
>> >>>>>
>> >>>> --
>> >>>>
>> >>>> С уважением,
>> >>>> Артем Семененко
>> >>>>
>>
>>
>
> --
>
> С уважением,
> Артем Семененко
>


--

С уважением,
Артем Семененко
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] [FLINK SQL] External catalog for Confluent Kafka

bowen.li
Great! I linked that JIRA to FLINK-11275
<https://issues.apache.org/jira/browse/FLINK-11275> , and put it along with
JIRAs for HiveCatalog and GenericHiveMetastoreCatalog.

I have some initial thoughts on the solution you described, but I'll wait
till a more complete google design doc comes up, since this discussion is
about engaging community interest.

On Thu, Apr 18, 2019 at 8:39 AM Artsem Semianenka <[hidden email]>
wrote:

> Sorry guys I've attached the wrong link for Jira ticket in the
> previous email. This is the correct link :
> https://issues.apache.org/jira/browse/FLINK-12256
>
> On Thu, 18 Apr 2019 at 18:29, Artsem Semianenka <[hidden email]>
> wrote:
>
> > Thank you guys so much!
> >
> > You provided me a lot of helpful information.
> > I've created the Jira ticket[1] and added to it an initial description
> > only with the main purpose of the new feature. More detailed
> implementation
> > description will be added further.
> >
> > Hi Rong, to tell the truth, my first idea was to use some predefined
> > prefix/postfix for topic name and lookup mapping between
> > topic/schema-subject.  But the idea with a separated view of a logical
> > table with schema looks more elegant and flexible.
> >
> > Also, I thought about other approaches on how to define the mapping
> > between topic and schema-subject in case if they have different names:
> > Define the "subject" as a part of the table definition:
> >
> > Select * from kafka.topic.subject
> > or
> > Select * from kafka.topic#subject
> >
> > In case if the subject is not defined try to find a subject with the same
> > name as a topic.
> > If the subject still not found  -  take one last message and try to infer
> > the schema ( retrieve schema id from the message and get last defined
> > schema)
> >
> > But I see one disadvantage for all of these approaches: the subject name
> > may contain not supported in SQL symbols.
> >
> > I try to investigate how to escape the illegal symbols in the table name
> > definition.
> >
> > Thanks,
> > Artsem
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-11275
> >
> > On Thu, 18 Apr 2019 at 11:54, Timo Walther <[hidden email]> wrote:
> >
> >> Hi Artsem,
> >>
> >> having a catalog support for Confluent Schema Registry would be a great
> >> addition. Although the implementation of FLIP-30 is still ongoing, we
> >> merged the stable interfaces today [0]. This should unblock people from
> >> contributing new catalog implementations. So you could already start
> >> designing an implementation. The implementation could be unit tested for
> >> now until it can also be registered in a table environment for
> >> integration tests/end-to-end tests.
> >>
> >> I hope we can reuse the existing SQL Kafka connector and SQL Avro
> format?
> >>
> >> Looking forward to a JIRA issue and a little design document how to
> >> connect the APIs.
> >>
> >> Thanks,
> >> Timo
> >>
> >> [0] https://github.com/apache/flink/pull/8007
> >>
> >> Am 18.04.19 um 07:03 schrieb Bowen Li:
> >> > Hi,
> >> >
> >> > Thanks Artsem and Rong for bringing up the demand from user
> >> perspective. A
> >> > Kafka/Confluent Schema Registry catalog would have a good use case in
> >> > Flink. We actually mentioned the potential of Unified Catalog APIs for
> >> > Kafka in our talk a couple weeks ago at Flink Forward SF [1], and glad
> >> to
> >> > learn you are interested in contributing. I think creating a JIRA
> ticket
> >> > with link in FLINK-11275 [2], and starting with discussions and design
> >> > would help to advance the effort.
> >> >
> >> > The most interesting part of Confluent Schema Registry, from my point
> of
> >> > view, is the core idea of smoothing real production experience and
> >> things
> >> > built around it, including versioned schemas, schema evolution and
> >> > compatibility checks, etc. Introducing a confluent-schema-registry
> >> backed
> >> > catalog to Flink may also help our design to benefit from those ideas.
> >> >
> >> > To add on Dawid's points. I assume the MVP for this project would be
> >> > supporting Kafka as streaming tables thru the new catalog. FLIP-30 is
> >> for
> >> > both streaming and batch tables, thus it won't be blocked by the whole
> >> > FLIP-30. I think as soon as we finish the table operation APIs,
> finalize
> >> > properties and formats, and connect the APIs to Calcite, this work can
> >> be
> >> > unblocked. Timo and Xuefu may have more things to say.
> >> >
> >> > [1]
> >> >
> >>
> https://www.slideshare.net/BowenLi9/integrating-flink-with-hive-flink-forward-sf-2019/23
> >> > [2] https://issues.apache.org/jira/browse/FLINK-11275
> >> >
> >> > On Wed, Apr 17, 2019 at 6:39 PM Jark Wu <[hidden email]> wrote:
> >> >
> >> >> Hi Rong,
> >> >>
> >> >> Thanks for pointing out the missing FLIPs in the FLIP main page. I
> >> added
> >> >> all the missing FLIP (incl. FLIP-14, FLIP-22, FLIP-29, FLIP-30,
> >> FLIP-31) to
> >> >> the page.
> >> >>
> >> >> I also include @[hidden email] <[hidden email]>
> >> and @Bowen
> >> >> Li <[hidden email]>  into the thread who are familiar with the
> >> >> latest catalog design.
> >> >>
> >> >> Thanks,
> >> >> Jark
> >> >>
> >> >> On Thu, 18 Apr 2019 at 02:39, Rong Rong <[hidden email]> wrote:
> >> >>
> >> >>> Thanks Artsem for looking into this problem and Thanks Dawid for
> >> bringing
> >> >>> up the discussion on FLIP-30.
> >> >>>
> >> >>> We've observe similar scenarios when we also would like to reuse the
> >> >>> schema
> >> >>> registry of both Kafka stream as well as the raw ingested kafka
> >> messages
> >> >>> in
> >> >>> datalake.
> >> >>> FYI another more catalog-oriented document can be found here [1]. I
> do
> >> >>> have
> >> >>> one question to follow up with Dawid's point (2): are we suggesting
> >> that
> >> >>> different kafka topics (e.g. test-topic-prod, test-topic-non-prod,
> >> etc)
> >> >>> considered as a "view" of a logical table with schema (e.g.
> >> test-topic) ?
> >> >>>
> >> >>> Also, seems like a few of the FLIPs, like the FLIP-30 page is not
> >> linked
> >> >>> in
> >> >>> the main FLIP confluence wiki page [2] for some reason.
> >> >>> I tried to fix that be seems like I don't have permission. Maybe
> >> someone
> >> >>> can also take a look?
> >> >>>
> >> >>> Thanks,
> >> >>> Rong
> >> >>>
> >> >>>
> >> >>> [1]
> >> >>>
> >> >>>
> >>
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#heading=h.xp424vn7ioei
> >> >>> [2]
> >> >>>
> >> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> >> >>>
> >> >>> On Wed, Apr 17, 2019 at 2:30 AM Artsem Semianenka <
> >> [hidden email]
> >> >>> wrote:
> >> >>>
> >> >>>> Thank you, Dawid!
> >> >>>> This is very helpful information. I will keep a close eye on the
> >> >>> updates of
> >> >>>> FLIP-30 and contribute whenever it possible.
> >> >>>> I guess I may create a Jira ticket for my proposal in which I
> >> describe
> >> >>> the
> >> >>>> idea and attach intermediate pull request based on current API(just
> >> for
> >> >>>> initial discuss). But the final pull request definitely will be
> >> based on
> >> >>>> FLIP-30 API.
> >> >>>>
> >> >>>> Best regards,
> >> >>>> Artsem
> >> >>>>
> >> >>>> On Wed, 17 Apr 2019 at 09:36, Dawid Wysakowicz <
> >> [hidden email]>
> >> >>>> wrote:
> >> >>>>
> >> >>>>> Hi Artsem,
> >> >>>>>
> >> >>>>> I think it totally makes sense to have a catalog for the Schema
> >> >>>>> Registry. It is also good to hear you want to contribute that.
> There
> >> >>> is
> >> >>>>> few important things to consider though:
> >> >>>>>
> >> >>>>> 1. The Catalog interface is currently under rework. You make take
> a
> >> >>> look
> >> >>>>> at the corresponding FLIP-30[1], and also have a look at the first
> >> PR
> >> >>>>> that introduces the basic interfaces[2]. I think it would be worth
> >> to
> >> >>>>> already consider those changes. I cc Xuefu who is participating in
> >> the
> >> >>>>> efforts of Catalog integration.
> >> >>>>>
> >> >>>>> 2. There is still ongoing discussion about what properties should
> we
> >> >>>>> store for streaming tables and how. I think this might affect (but
> >> >>> maybe
> >> >>>>> doesn't have to) the design of the Catalog.[3] I cc Timo who might
> >> >>> give
> >> >>>>> more insights if those should be blocking for the work around this
> >> >>>> Catalog.
> >> >>>>> Best,
> >> >>>>>
> >> >>>>> Dawid
> >> >>>>>
> >> >>>>> [1]
> >> >>>>>
> >> >>>>>
> >> >>>
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
> >> >>>>> [2] https://github.com/apache/flink/pull/8007
> >> >>>>>
> >> >>>>> [3]
> >> >>>>>
> >> >>>>>
> >> >>>
> >>
> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit#heading=h.egn858cgizao
> >> >>>>> On 16/04/2019 17:35, Artsem Semianenka wrote:
> >> >>>>>> Hi guys!
> >> >>>>>>
> >> >>>>>> I'm working on External Catalog for Confluent Kafka. The main
> idea
> >> >>> to
> >> >>>>>> register the external catalog which provides the list of Kafka
> >> >>> topics
> >> >>>> and
> >> >>>>>> execute SQL queries like :
> >> >>>>>> Select * form kafka.topic_name
> >> >>>>>>
> >> >>>>>> I'm going to receive the table schema from Confluent schema
> >> >>> registry.
> >> >>>> The
> >> >>>>>> main disadvantage is: we should have the topic name with the same
> >> >>> name
> >> >>>>>> (prefix and postfix are accepted ) as this schema subject in
> Schema
> >> >>>>>> Registry.
> >> >>>>>> For example :
> >> >>>>>> topic: test-topic-prod
> >> >>>>>> schema subject: test-topic
> >> >>>>>>
> >> >>>>>> I would like to contribute this solution into the main Flink
> branch
> >> >>> and
> >> >>>>>> would like to discuss the pros and cons of this approach.
> >> >>>>>>
> >> >>>>>> Best regards,
> >> >>>>>> Artsem
> >> >>>>>>
> >> >>>>>
> >> >>>> --
> >> >>>>
> >> >>>> С уважением,
> >> >>>> Артем Семененко
> >> >>>>
> >>
> >>
> >
> > --
> >
> > С уважением,
> > Артем Семененко
> >
>
>
> --
>
> С уважением,
> Артем Семененко
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] [FLINK SQL] External catalog for Confluent Kafka

Artsem Semianenka
Hi guys!

I've created the first draft of the design document [1] and attached it to
the Jira ticket [2]. Please let's continue our discussion in the ticket or
in comments in Google Docs.

[1]
https://docs.google.com/document/d/14thwgV2RY1AA9KgYztv_kLYSz4K1TckJ-YiGfkB5650/edit?usp=sharing
[2] https://issues.apache.org/jira/browse/FLINK-12256

Best regards,
Artsem

On Thu, 18 Apr 2019 at 23:57, Bowen Li <[hidden email]> wrote:

> Great! I linked that JIRA to FLINK-11275
> <https://issues.apache.org/jira/browse/FLINK-11275> , and put it along
> with
> JIRAs for HiveCatalog and GenericHiveMetastoreCatalog.
>
> I have some initial thoughts on the solution you described, but I'll wait
> till a more complete google design doc comes up, since this discussion is
> about engaging community interest.
>
> On Thu, Apr 18, 2019 at 8:39 AM Artsem Semianenka <[hidden email]>
> wrote:
>
> > Sorry guys I've attached the wrong link for Jira ticket in the
> > previous email. This is the correct link :
> > https://issues.apache.org/jira/browse/FLINK-12256
> >
> > On Thu, 18 Apr 2019 at 18:29, Artsem Semianenka <[hidden email]>
> > wrote:
> >
> > > Thank you guys so much!
> > >
> > > You provided me a lot of helpful information.
> > > I've created the Jira ticket[1] and added to it an initial description
> > > only with the main purpose of the new feature. More detailed
> > implementation
> > > description will be added further.
> > >
> > > Hi Rong, to tell the truth, my first idea was to use some predefined
> > > prefix/postfix for topic name and lookup mapping between
> > > topic/schema-subject.  But the idea with a separated view of a logical
> > > table with schema looks more elegant and flexible.
> > >
> > > Also, I thought about other approaches on how to define the mapping
> > > between topic and schema-subject in case if they have different names:
> > > Define the "subject" as a part of the table definition:
> > >
> > > Select * from kafka.topic.subject
> > > or
> > > Select * from kafka.topic#subject
> > >
> > > In case if the subject is not defined try to find a subject with the
> same
> > > name as a topic.
> > > If the subject still not found  -  take one last message and try to
> infer
> > > the schema ( retrieve schema id from the message and get last defined
> > > schema)
> > >
> > > But I see one disadvantage for all of these approaches: the subject
> name
> > > may contain not supported in SQL symbols.
> > >
> > > I try to investigate how to escape the illegal symbols in the table
> name
> > > definition.
> > >
> > > Thanks,
> > > Artsem
> > >
> > > [1] https://issues.apache.org/jira/browse/FLINK-11275
> > >
> > > On Thu, 18 Apr 2019 at 11:54, Timo Walther <[hidden email]> wrote:
> > >
> > >> Hi Artsem,
> > >>
> > >> having a catalog support for Confluent Schema Registry would be a
> great
> > >> addition. Although the implementation of FLIP-30 is still ongoing, we
> > >> merged the stable interfaces today [0]. This should unblock people
> from
> > >> contributing new catalog implementations. So you could already start
> > >> designing an implementation. The implementation could be unit tested
> for
> > >> now until it can also be registered in a table environment for
> > >> integration tests/end-to-end tests.
> > >>
> > >> I hope we can reuse the existing SQL Kafka connector and SQL Avro
> > format?
> > >>
> > >> Looking forward to a JIRA issue and a little design document how to
> > >> connect the APIs.
> > >>
> > >> Thanks,
> > >> Timo
> > >>
> > >> [0] https://github.com/apache/flink/pull/8007
> > >>
> > >> Am 18.04.19 um 07:03 schrieb Bowen Li:
> > >> > Hi,
> > >> >
> > >> > Thanks Artsem and Rong for bringing up the demand from user
> > >> perspective. A
> > >> > Kafka/Confluent Schema Registry catalog would have a good use case
> in
> > >> > Flink. We actually mentioned the potential of Unified Catalog APIs
> for
> > >> > Kafka in our talk a couple weeks ago at Flink Forward SF [1], and
> glad
> > >> to
> > >> > learn you are interested in contributing. I think creating a JIRA
> > ticket
> > >> > with link in FLINK-11275 [2], and starting with discussions and
> design
> > >> > would help to advance the effort.
> > >> >
> > >> > The most interesting part of Confluent Schema Registry, from my
> point
> > of
> > >> > view, is the core idea of smoothing real production experience and
> > >> things
> > >> > built around it, including versioned schemas, schema evolution and
> > >> > compatibility checks, etc. Introducing a confluent-schema-registry
> > >> backed
> > >> > catalog to Flink may also help our design to benefit from those
> ideas.
> > >> >
> > >> > To add on Dawid's points. I assume the MVP for this project would be
> > >> > supporting Kafka as streaming tables thru the new catalog. FLIP-30
> is
> > >> for
> > >> > both streaming and batch tables, thus it won't be blocked by the
> whole
> > >> > FLIP-30. I think as soon as we finish the table operation APIs,
> > finalize
> > >> > properties and formats, and connect the APIs to Calcite, this work
> can
> > >> be
> > >> > unblocked. Timo and Xuefu may have more things to say.
> > >> >
> > >> > [1]
> > >> >
> > >>
> >
> https://www.slideshare.net/BowenLi9/integrating-flink-with-hive-flink-forward-sf-2019/23
> > >> > [2] https://issues.apache.org/jira/browse/FLINK-11275
> > >> >
> > >> > On Wed, Apr 17, 2019 at 6:39 PM Jark Wu <[hidden email]> wrote:
> > >> >
> > >> >> Hi Rong,
> > >> >>
> > >> >> Thanks for pointing out the missing FLIPs in the FLIP main page. I
> > >> added
> > >> >> all the missing FLIP (incl. FLIP-14, FLIP-22, FLIP-29, FLIP-30,
> > >> FLIP-31) to
> > >> >> the page.
> > >> >>
> > >> >> I also include @[hidden email] <[hidden email]>
> > >> and @Bowen
> > >> >> Li <[hidden email]>  into the thread who are familiar with
> the
> > >> >> latest catalog design.
> > >> >>
> > >> >> Thanks,
> > >> >> Jark
> > >> >>
> > >> >> On Thu, 18 Apr 2019 at 02:39, Rong Rong <[hidden email]>
> wrote:
> > >> >>
> > >> >>> Thanks Artsem for looking into this problem and Thanks Dawid for
> > >> bringing
> > >> >>> up the discussion on FLIP-30.
> > >> >>>
> > >> >>> We've observe similar scenarios when we also would like to reuse
> the
> > >> >>> schema
> > >> >>> registry of both Kafka stream as well as the raw ingested kafka
> > >> messages
> > >> >>> in
> > >> >>> datalake.
> > >> >>> FYI another more catalog-oriented document can be found here [1].
> I
> > do
> > >> >>> have
> > >> >>> one question to follow up with Dawid's point (2): are we
> suggesting
> > >> that
> > >> >>> different kafka topics (e.g. test-topic-prod, test-topic-non-prod,
> > >> etc)
> > >> >>> considered as a "view" of a logical table with schema (e.g.
> > >> test-topic) ?
> > >> >>>
> > >> >>> Also, seems like a few of the FLIPs, like the FLIP-30 page is not
> > >> linked
> > >> >>> in
> > >> >>> the main FLIP confluence wiki page [2] for some reason.
> > >> >>> I tried to fix that be seems like I don't have permission. Maybe
> > >> someone
> > >> >>> can also take a look?
> > >> >>>
> > >> >>> Thanks,
> > >> >>> Rong
> > >> >>>
> > >> >>>
> > >> >>> [1]
> > >> >>>
> > >> >>>
> > >>
> >
> https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#heading=h.xp424vn7ioei
> > >> >>> [2]
> > >> >>>
> > >> >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/Flink+Improvement+Proposals
> > >> >>>
> > >> >>> On Wed, Apr 17, 2019 at 2:30 AM Artsem Semianenka <
> > >> [hidden email]
> > >> >>> wrote:
> > >> >>>
> > >> >>>> Thank you, Dawid!
> > >> >>>> This is very helpful information. I will keep a close eye on the
> > >> >>> updates of
> > >> >>>> FLIP-30 and contribute whenever it possible.
> > >> >>>> I guess I may create a Jira ticket for my proposal in which I
> > >> describe
> > >> >>> the
> > >> >>>> idea and attach intermediate pull request based on current
> API(just
> > >> for
> > >> >>>> initial discuss). But the final pull request definitely will be
> > >> based on
> > >> >>>> FLIP-30 API.
> > >> >>>>
> > >> >>>> Best regards,
> > >> >>>> Artsem
> > >> >>>>
> > >> >>>> On Wed, 17 Apr 2019 at 09:36, Dawid Wysakowicz <
> > >> [hidden email]>
> > >> >>>> wrote:
> > >> >>>>
> > >> >>>>> Hi Artsem,
> > >> >>>>>
> > >> >>>>> I think it totally makes sense to have a catalog for the Schema
> > >> >>>>> Registry. It is also good to hear you want to contribute that.
> > There
> > >> >>> is
> > >> >>>>> few important things to consider though:
> > >> >>>>>
> > >> >>>>> 1. The Catalog interface is currently under rework. You make
> take
> > a
> > >> >>> look
> > >> >>>>> at the corresponding FLIP-30[1], and also have a look at the
> first
> > >> PR
> > >> >>>>> that introduces the basic interfaces[2]. I think it would be
> worth
> > >> to
> > >> >>>>> already consider those changes. I cc Xuefu who is participating
> in
> > >> the
> > >> >>>>> efforts of Catalog integration.
> > >> >>>>>
> > >> >>>>> 2. There is still ongoing discussion about what properties
> should
> > we
> > >> >>>>> store for streaming tables and how. I think this might affect
> (but
> > >> >>> maybe
> > >> >>>>> doesn't have to) the design of the Catalog.[3] I cc Timo who
> might
> > >> >>> give
> > >> >>>>> more insights if those should be blocking for the work around
> this
> > >> >>>> Catalog.
> > >> >>>>> Best,
> > >> >>>>>
> > >> >>>>> Dawid
> > >> >>>>>
> > >> >>>>> [1]
> > >> >>>>>
> > >> >>>>>
> > >> >>>
> > >>
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
> > >> >>>>> [2] https://github.com/apache/flink/pull/8007
> > >> >>>>>
> > >> >>>>> [3]
> > >> >>>>>
> > >> >>>>>
> > >> >>>
> > >>
> >
> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit#heading=h.egn858cgizao
> > >> >>>>> On 16/04/2019 17:35, Artsem Semianenka wrote:
> > >> >>>>>> Hi guys!
> > >> >>>>>>
> > >> >>>>>> I'm working on External Catalog for Confluent Kafka. The main
> > idea
> > >> >>> to
> > >> >>>>>> register the external catalog which provides the list of Kafka
> > >> >>> topics
> > >> >>>> and
> > >> >>>>>> execute SQL queries like :
> > >> >>>>>> Select * form kafka.topic_name
> > >> >>>>>>
> > >> >>>>>> I'm going to receive the table schema from Confluent schema
> > >> >>> registry.
> > >> >>>> The
> > >> >>>>>> main disadvantage is: we should have the topic name with the
> same
> > >> >>> name
> > >> >>>>>> (prefix and postfix are accepted ) as this schema subject in
> > Schema
> > >> >>>>>> Registry.
> > >> >>>>>> For example :
> > >> >>>>>> topic: test-topic-prod
> > >> >>>>>> schema subject: test-topic
> > >> >>>>>>
> > >> >>>>>> I would like to contribute this solution into the main Flink
> > branch
> > >> >>> and
> > >> >>>>>> would like to discuss the pros and cons of this approach.
> > >> >>>>>>
> > >> >>>>>> Best regards,
> > >> >>>>>> Artsem
> > >> >>>>>>
> > >> >>>>>
> > >> >>>> --
> > >> >>>>
> > >> >>>> С уважением,
> > >> >>>> Артем Семененко
> > >> >>>>
> > >>
> > >>
> > >
> > > --
> > >
> > > С уважением,
> > > Артем Семененко
> > >
> >
> >
> > --
> >
> > С уважением,
> > Артем Семененко
> >
>


--

С уважением,
Артем Семененко