(DEPRECATED) Apache Flink Mailing List archive.

[DISCUSS] FLIP-57 - Rework FunctionCatalog

Classic

List

Threaded

73 messages Options

1234

bowen.li

[DISCUSS] FLIP-57 - Rework FunctionCatalog

Hi folks,

I'd like to kick off a discussion on reworking Flink's FunctionCatalog.
It's critically helpful to improve function usability in SQL.

https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing

In short, it:
- adds support for precise function reference with fully/partially
qualified name
- redefines function resolution order for ambiguous function reference
- adds support for Hive's rich built-in functions (support for Hive user
defined functions was already added in 1.9.0)
- clarifies the concept of temporary functions

Would love to hear your thoughts.

Bowen

bowen.li

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Thanks everyone for the feedback.

I have updated the document accordingly. Here're the summary of changes:

- clarify the concept of temporary functions, to facilitate deciding
function resolution order
- provide two options to support Hive built-in functions, with the 2nd one
being preferred
- add detailed prototype code for FunctionCatalog#lookupFunction(name)
- move the section of ”rename existing FunctionCatalog APIs in favor of
temporary functions“ out of the scope of the FLIP
- add another reasonable limitation for function resolution, to not
consider resolving overloaded functions - those with the same name but
different params. (It's still valid to have a single function with
overloaded eval() methods)

Please take another look.

Thanks,
Bowen

On Tue, Aug 27, 2019 at 11:49 AM Bowen Li <[hidden email]> wrote:

> Hi folks,
>
> I'd like to kick off a discussion on reworking Flink's FunctionCatalog.
> It's critically helpful to improve function usability in SQL.
>
>
> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
>
> In short, it:
> - adds support for precise function reference with fully/partially
> qualified name
> - redefines function resolution order for ambiguous function reference
> - adds support for Hive's rich built-in functions (support for Hive user
> defined functions was already added in 1.9.0)
> - clarifies the concept of temporary functions
>
> Would love to hear your thoughts.
>
> Bowen
>

Kurt Young

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Thanks Bowen for driving this.

+1 for the general idea. It makes the function resolved behavior more
clear and deterministic. Besides, the user can use all hive built-in
functions, which is a great feature.

I only have one comment, but maybe it may touch your design so I think
it would make sense to reply this mail instead of comment on google doc.
Regarding to the classfication of functions, you currently have 4 types
of functions, which are:
1. temporary functions
2. Flink built-in functions
3. Hive built-in functions (or generalized as external built-in functions)
4. catalog functions

What I want to propose is we can merge #3 and #4, make them both under
"catalog" concept, by extending catalog function to make it have ability to
have built-in catalog functions. Some benefits I can see from this approach:
1. We don't have to introduce new concept like external built-in functions.
Actually
I don't see a full story about how to treat a built-in functions, and it
seems a little
bit disrupt with catalog. As a result, you have to make some restriction
like "hive
built-in functions can only be used when current catalog is hive catalog".

2. It makes us easier to adopt another system's built-in functions to
Flink, such as
MySQL. If we don't treat uniformly with "external built-in functions" and
"external
catalog function", things like user set current catalog to hive but want to
use MySQL's
built-in function will happen.

One more thing, follow this approach, it's clear for your question about
how to support
external built-in functions, which is "add a getBuiltInFunction to current
Catalog API".

What do you think?

Best,
Kurt

On Fri, Aug 30, 2019 at 7:14 AM Bowen Li <[hidden email]> wrote:

> Thanks everyone for the feedback.
>
> I have updated the document accordingly. Here're the summary of changes:
>
> - clarify the concept of temporary functions, to facilitate deciding
> function resolution order
> - provide two options to support Hive built-in functions, with the 2nd one
> being preferred
> - add detailed prototype code for FunctionCatalog#lookupFunction(name)
> - move the section of ”rename existing FunctionCatalog APIs in favor of
> temporary functions“ out of the scope of the FLIP
> - add another reasonable limitation for function resolution, to not
> consider resolving overloaded functions - those with the same name but
> different params. (It's still valid to have a single function with
> overloaded eval() methods)
>
> Please take another look.
>
> Thanks,
> Bowen
>
> On Tue, Aug 27, 2019 at 11:49 AM Bowen Li <[hidden email]> wrote:
>
> > Hi folks,
> >
> > I'd like to kick off a discussion on reworking Flink's FunctionCatalog.
> > It's critically helpful to improve function usability in SQL.
> >
> >
> >
> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
> >
> > In short, it:
> > - adds support for precise function reference with fully/partially
> > qualified name
> > - redefines function resolution order for ambiguous function reference
> > - adds support for Hive's rich built-in functions (support for Hive user
> > defined functions was already added in 1.9.0)
> > - clarifies the concept of temporary functions
> >
> > Would love to hear your thoughts.
> >
> > Bowen
> >
>

JingsongLee-2

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Thanks Bowen:

+1 for this. And +1 to Kurt's suggestion. My other points are:

1.Hive built-in functions is an intermediate solution. So we should
not introduce interfaces to influence the framework. To make
Flink itself more powerful, we should implement the functions
we need to add.

2.Non-flink built-in functions are easy for users to change their
behavior. If we support some flink built-in functions in the
future but act differently from non-flink built-in, this will lead to
changes in user behavior.

3.Fallback to Non-flink built-in functions is a bad choice to
performance. Without flink internal codegen and data format,
and bring data format conversion, the performance is not so
good.

We need to support more complete hive jobs now, we need to
have this fallback strategy. But it's not worth adding this
concept at the catalog interface level, and it's not worth
encouraging other catalogs to do so.

Another question is, does this fallback include all
hive built-in functions? As far as I know, some hive functions
have some hacky. If possible, can we start with a white list?
Once we implement some functions to flink built-in, we can
also update the whitelist.

Best,
Jingsong Lee

------------------------------------------------------------------
From:Kurt Young <[hidden email]>
Send Time:2019年9月3日(星期二) 15:41
To:dev <[hidden email]>
Subject:Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Thanks Bowen for driving this.

+1 for the general idea. It makes the function resolved behavior more
clear and deterministic. Besides, the user can use all hive built-in
functions, which is a great feature.

I only have one comment, but maybe it may touch your design so I think
it would make sense to reply this mail instead of comment on google doc.
Regarding to the classfication of functions, you currently have 4 types
of functions, which are:
1. temporary functions
2. Flink built-in functions
3. Hive built-in functions (or generalized as external built-in functions)
4. catalog functions

What I want to propose is we can merge #3 and #4, make them both under
"catalog" concept, by extending catalog function to make it have ability to
have built-in catalog functions. Some benefits I can see from this approach:
1. We don't have to introduce new concept like external built-in functions.
Actually
I don't see a full story about how to treat a built-in functions, and it
seems a little
bit disrupt with catalog. As a result, you have to make some restriction
like "hive
built-in functions can only be used when current catalog is hive catalog".

2. It makes us easier to adopt another system's built-in functions to
Flink, such as
MySQL. If we don't treat uniformly with "external built-in functions" and
"external
catalog function", things like user set current catalog to hive but want to
use MySQL's
built-in function will happen.

One more thing, follow this approach, it's clear for your question about
how to support
external built-in functions, which is "add a getBuiltInFunction to current
Catalog API".

What do you think?

Best,
Kurt

On Fri, Aug 30, 2019 at 7:14 AM Bowen Li <[hidden email]> wrote:

Timo Walther-2

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Hi Bowen,

thanks for your proposal. Here are some thoughts:

1) We should not have the restriction "hive built-in functions can only
be used when current catalog is hive catalog". Switching a catalog
should only have implications on the cat.db.object resolution but not
functions. It would be quite convinient for users to use Hive built-ins
even if they use a Confluent schema registry or just the in-memory catalog.

2) I would propose to have separate concepts for catalog and built-in
functions. In particular it would be nice to modularize built-in
functions. Some built-in functions are very crucial (like AS, CAST,
MINUS), others are more optional but stable (MD5, CONCAT_WS), and maybe
we add more experimental functions in the future or function for some
special application area (Geo functions, ML functions). A data platform
team might not want to make every built-in function available. Or a
function module like ML functions is in a different Maven module.

3) Following the suggestion above, we can have a separate discovery
mechanism for built-in functions. Instead of just going through a static
list like in BuiltInFunctionDefinitions, a platform team should be able
to select function modules like
catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
HiveFunctions) or via service discovery;

3) Dawid and I discussed the resulution order again. I agree with Kurt
that we should unify built-in function (external or internal) under a
common layer. However, the resolution order should be:
   1. built-in functions
   2. temporary functions
   3. regular catalog resolution logic
Otherwise a temporary function could cause clashes with Flink's built-in
functions. If you take a look at other vendors, like SQL Server they
also do not allow to overwrite built-in functions.

Regards,
Timo

On 03.09.19 10:35, JingsongLee wrote:

> Thanks Bowen:
>
> +1 for this. And +1 to Kurt's suggestion. My other points are:
>
> 1.Hive built-in functions is an intermediate solution. So we should
> not introduce interfaces to influence the framework. To make
> Flink itself more powerful, we should implement the functions
> we need to add.
>
> 2.Non-flink built-in functions are easy for users to change their
> behavior. If we support some flink built-in functions in the
> future but act differently from non-flink built-in, this will lead to
> changes in user behavior.
>
> 3.Fallback to Non-flink built-in functions is a bad choice to
> performance. Without flink internal codegen and data format,
> and bring data format conversion, the performance is not so
> good.
>
> We need to support more complete hive jobs now, we need to
> have this fallback strategy. But it's not worth adding this
> concept at the catalog interface level, and it's not worth
> encouraging other catalogs to do so.
>
> Another question is, does this fallback include all
> hive built-in functions? As far as I know, some hive functions
> have some hacky. If possible, can we start with a white list?
> Once we implement some functions to flink built-in, we can
> also update the whitelist.
>
> Best,
> Jingsong Lee
>
>
> ------------------------------------------------------------------
> From:Kurt Young <[hidden email]>
> Send Time:2019年9月3日(星期二) 15:41
> To:dev <[hidden email]>
> Subject:Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog
>
> Thanks Bowen for driving this.
>
> +1 for the general idea. It makes the function resolved behavior more
> clear and deterministic. Besides, the user can use all hive built-in
> functions, which is a great feature.
>
> I only have one comment, but maybe it may touch your design so I think
> it would make sense to reply this mail instead of comment on google doc.
> Regarding to the classfication of functions, you currently have 4 types
> of functions, which are:
> 1. temporary functions
> 2. Flink built-in functions
> 3. Hive built-in functions (or generalized as external built-in functions)
> 4. catalog functions
>
> What I want to propose is we can merge #3 and #4, make them both under
> "catalog" concept, by extending catalog function to make it have ability to
> have built-in catalog functions. Some benefits I can see from this approach:
> 1. We don't have to introduce new concept like external built-in functions.
> Actually
> I don't see a full story about how to treat a built-in functions, and it
> seems a little
> bit disrupt with catalog. As a result, you have to make some restriction
> like "hive
> built-in functions can only be used when current catalog is hive catalog".
>
> 2. It makes us easier to adopt another system's built-in functions to
> Flink, such as
> MySQL. If we don't treat uniformly with "external built-in functions" and
> "external
> catalog function", things like user set current catalog to hive but want to
> use MySQL's
> built-in function will happen.
>
> One more thing, follow this approach, it's clear for your question about
> how to support
> external built-in functions, which is "add a getBuiltInFunction to current
> Catalog API".
>
> What do you think?
>
> Best,
> Kurt
>
>
> On Fri, Aug 30, 2019 at 7:14 AM Bowen Li <[hidden email]> wrote:
>
>> Thanks everyone for the feedback.
>>
>> I have updated the document accordingly. Here're the summary of changes:
>>
>> - clarify the concept of temporary functions, to facilitate deciding
>> function resolution order
>> - provide two options to support Hive built-in functions, with the 2nd one
>> being preferred
>> - add detailed prototype code for FunctionCatalog#lookupFunction(name)
>> - move the section of ”rename existing FunctionCatalog APIs in favor of
>> temporary functions“ out of the scope of the FLIP
>> - add another reasonable limitation for function resolution, to not
>> consider resolving overloaded functions - those with the same name but
>> different params. (It's still valid to have a single function with
>> overloaded eval() methods)
>>
>> Please take another look.
>>
>> Thanks,
>> Bowen
>>
>> On Tue, Aug 27, 2019 at 11:49 AM Bowen Li <[hidden email]> wrote:
>>
>>> Hi folks,
>>>
>>> I'd like to kick off a discussion on reworking Flink's FunctionCatalog.
>>> It's critically helpful to improve function usability in SQL.
>>>
>>>
>>>
>> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
>>> In short, it:
>>> - adds support for precise function reference with fully/partially
>>> qualified name
>>> - redefines function resolution order for ambiguous function reference
>>> - adds support for Hive's rich built-in functions (support for Hive user
>>> defined functions was already added in 1.9.0)
>>> - clarifies the concept of temporary functions
>>>
>>> Would love to hear your thoughts.
>>>
>>> Bowen
>>>

Danny Chan

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

In reply to this post by bowen.li

Thanks Bowen for bring up this topic, I think it’s a useful refactoring to make our function usage more user friendly.

For the topic of how to organize the builtin operators and operators of Hive, here is a solution from Apache Calcite, the Calcite way is to make every dialect operators a “Library”, user can specify which libraries they want to use for a sql query. The builtin operators always comes as the first class objects and the others are used from the order they appears. Maybe you can take a reference.

[1] https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28

Best,
Danny Chan
在 2019年8月28日 +0800 AM2:50，Bowen Li <[hidden email]>，写道：

> Hi folks,
>
> I'd like to kick off a discussion on reworking Flink's FunctionCatalog.
> It's critically helpful to improve function usability in SQL.
>
> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
>
> In short, it:
> - adds support for precise function reference with fully/partially
> qualified name
> - redefines function resolution order for ambiguous function reference
> - adds support for Hive's rich built-in functions (support for Hive user
> defined functions was already added in 1.9.0)
> - clarifies the concept of temporary functions
>
> Would love to hear your thoughts.
>
> Bowen

Timo Walther-2

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

This sounds exactly as the module approach I mentioned, no?

Regards,
Timo

On 03.09.19 13:42, Danny Chan wrote:

> Thanks Bowen for bring up this topic, I think it’s a useful refactoring to make our function usage more user friendly.
>
> For the topic of how to organize the builtin operators and operators of Hive, here is a solution from Apache Calcite, the Calcite way is to make every dialect operators a “Library”, user can specify which libraries they want to use for a sql query. The builtin operators always comes as the first class objects and the others are used from the order they appears. Maybe you can take a reference.
>
> [1] https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
>
> Best,
> Danny Chan
> 在 2019年8月28日 +0800 AM2:50，Bowen Li <[hidden email]>，写道：
>> Hi folks,
>>
>> I'd like to kick off a discussion on reworking Flink's FunctionCatalog.
>> It's critically helpful to improve function usability in SQL.
>>
>> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
>>
>> In short, it:
>> - adds support for precise function reference with fully/partially
>> qualified name
>> - redefines function resolution order for ambiguous function reference
>> - adds support for Hive's rich built-in functions (support for Hive user
>> defined functions was already added in 1.9.0)
>> - clarifies the concept of temporary functions
>>
>> Would love to hear your thoughts.
>>
>> Bowen

Danny Chan

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

The way you proposed are basically the same as what Calcite does, I think we are in the same line.

Best,
Danny Chan
在 2019年9月3日 +0800 PM7:57，Timo Walther <[hidden email]>，写道：

> This sounds exactly as the module approach I mentioned, no?
>
> Regards,
> Timo
>
> On 03.09.19 13:42, Danny Chan wrote:
> > Thanks Bowen for bring up this topic, I think it’s a useful refactoring to make our function usage more user friendly.
> >
> > For the topic of how to organize the builtin operators and operators of Hive, here is a solution from Apache Calcite, the Calcite way is to make every dialect operators a “Library”, user can specify which libraries they want to use for a sql query. The builtin operators always comes as the first class objects and the others are used from the order they appears. Maybe you can take a reference.
> >
> > [1] https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
> >
> > Best,
> > Danny Chan
> > 在 2019年8月28日 +0800 AM2:50，Bowen Li <[hidden email]>，写道：
> > > Hi folks,
> > >
> > > I'd like to kick off a discussion on reworking Flink's FunctionCatalog.
> > > It's critically helpful to improve function usability in SQL.
> > >
> > > https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
> > >
> > > In short, it:
> > > - adds support for precise function reference with fully/partially
> > > qualified name
> > > - redefines function resolution order for ambiguous function reference
> > > - adds support for Hive's rich built-in functions (support for Hive user
> > > defined functions was already added in 1.9.0)
> > > - clarifies the concept of temporary functions
> > >
> > > Would love to hear your thoughts.
> > >
> > > Bowen
>
>

Kurt Young

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Does this only affect the functions and operations we currently have in SQL
and
have no effect on tables, right? Looks like this is an orthogonal concept
with Catalog?
If the answer are both yes, then the catalog function will be a weird
concept?

Best,
Kurt

On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email]> wrote:

> The way you proposed are basically the same as what Calcite does, I think
> we are in the same line.
>
> Best,
> Danny Chan
> 在 2019年9月3日 +0800 PM7:57，Timo Walther <[hidden email]>，写道：
> > This sounds exactly as the module approach I mentioned, no?
> >
> > Regards,
> > Timo
> >
> > On 03.09.19 13:42, Danny Chan wrote:
> > > Thanks Bowen for bring up this topic, I think it’s a useful
> refactoring to make our function usage more user friendly.
> > >
> > > For the topic of how to organize the builtin operators and operators
> of Hive, here is a solution from Apache Calcite, the Calcite way is to make
> every dialect operators a “Library”, user can specify which libraries they
> want to use for a sql query. The builtin operators always comes as the
> first class objects and the others are used from the order they appears.
> Maybe you can take a reference.
> > >
> > > [1]
> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
> > >
> > > Best,
> > > Danny Chan
> > > 在 2019年8月28日 +0800 AM2:50，Bowen Li <[hidden email]>，写道：
> > > > Hi folks,
> > > >
> > > > I'd like to kick off a discussion on reworking Flink's
> FunctionCatalog.
> > > > It's critically helpful to improve function usability in SQL.
> > > >
> > > >
> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
> > > >
> > > > In short, it:
> > > > - adds support for precise function reference with fully/partially
> > > > qualified name
> > > > - redefines function resolution order for ambiguous function
> reference
> > > > - adds support for Hive's rich built-in functions (support for Hive
> user
> > > > defined functions was already added in 1.9.0)
> > > > - clarifies the concept of temporary functions
> > > >
> > > > Would love to hear your thoughts.
> > > >
> > > > Bowen
> >
> >
>

Timo Walther-2

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Hi Kurt,

it should not affect the functions and operations we currently have in
SQL. It just categorizes the available built-in functions. It is kind of
an orthogonal concept to the catalog API but built-in functions deserve
this special kind of treatment. CatalogFunction still fits perfectly in
there because the regular catalog object resolution logic is not
affected. So tables and functions are resolved in the same way but with
built-in functions that have priority as in the original design.

Regards,
Timo

On 03.09.19 15:26, Kurt Young wrote:

> Does this only affect the functions and operations we currently have in SQL
> and
> have no effect on tables, right? Looks like this is an orthogonal concept
> with Catalog?
> If the answer are both yes, then the catalog function will be a weird
> concept?
>
> Best,
> Kurt
>
>
> On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email]> wrote:
>
>> The way you proposed are basically the same as what Calcite does, I think
>> we are in the same line.
>>
>> Best,
>> Danny Chan
>> 在 2019年9月3日 +0800 PM7:57，Timo Walther <[hidden email]>，写道：
>>> This sounds exactly as the module approach I mentioned, no?
>>>
>>> Regards,
>>> Timo
>>>
>>> On 03.09.19 13:42, Danny Chan wrote:
>>>> Thanks Bowen for bring up this topic, I think it’s a useful
>> refactoring to make our function usage more user friendly.
>>>> For the topic of how to organize the builtin operators and operators
>> of Hive, here is a solution from Apache Calcite, the Calcite way is to make
>> every dialect operators a “Library”, user can specify which libraries they
>> want to use for a sql query. The builtin operators always comes as the
>> first class objects and the others are used from the order they appears.
>> Maybe you can take a reference.
>>>> [1]
>> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
>>>> Best,
>>>> Danny Chan
>>>> 在 2019年8月28日 +0800 AM2:50，Bowen Li <[hidden email]>，写道：
>>>>> Hi folks,
>>>>>
>>>>> I'd like to kick off a discussion on reworking Flink's
>> FunctionCatalog.
>>>>> It's critically helpful to improve function usability in SQL.
>>>>>
>>>>>
>> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
>>>>> In short, it:
>>>>> - adds support for precise function reference with fully/partially
>>>>> qualified name
>>>>> - redefines function resolution order for ambiguous function
>> reference
>>>>> - adds support for Hive's rich built-in functions (support for Hive
>> user
>>>>> defined functions was already added in 1.9.0)
>>>>> - clarifies the concept of temporary functions
>>>>>
>>>>> Would love to hear your thoughts.
>>>>>
>>>>> Bowen
>>>

bowen.li

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Hi Kurt,

Re: > What I want to propose is we can merge #3 and #4, make them both under
>"catalog" concept, by extending catalog function to make it have ability to
>have built-in catalog functions. Some benefits I can see from this
approach:
>1. We don't have to introduce new concept like external built-in functions.
>Actually I don't see a full story about how to treat a built-in functions,
and it
>seems a little bit disrupt with catalog. As a result, you have to make
some restriction
>like "hive built-in functions can only be used when current catalog is
hive catalog".

Yes, I've unified #3 and #4 but it seems I didn't update some part of the
doc. I've modified those sections, and they are up to date now.

In short, now built-in function of external systems are defined as a
special kind of catalog function in Flink, and handled by Flink as
following:
- An external built-in function must be associated with a catalog for the
purpose of decoupling flink-table and external systems.
- It always resides in front of catalog functions in ambiguous function
reference order, just like in its own external system
- It is a special catalog function that doesn’t have a schema/database
namespace
- It goes thru the same instantiation logic as other user defined catalog
functions in the external system

Please take another look at the doc, and let me know if you have more
questions.

On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> wrote:

> Hi Kurt,
>
> it should not affect the functions and operations we currently have in
> SQL. It just categorizes the available built-in functions. It is kind of
> an orthogonal concept to the catalog API but built-in functions deserve
> this special kind of treatment. CatalogFunction still fits perfectly in
> there because the regular catalog object resolution logic is not
> affected. So tables and functions are resolved in the same way but with
> built-in functions that have priority as in the original design.
>
> Regards,
> Timo
>
>
> On 03.09.19 15:26, Kurt Young wrote:
> > Does this only affect the functions and operations we currently have in
> SQL
> > and
> > have no effect on tables, right? Looks like this is an orthogonal concept
> > with Catalog?
> > If the answer are both yes, then the catalog function will be a weird
> > concept?
> >
> > Best,
> > Kurt
> >
> >
> > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email]> wrote:
> >
> >> The way you proposed are basically the same as what Calcite does, I
> think
> >> we are in the same line.
> >>
> >> Best,
> >> Danny Chan
> >> 在 2019年9月3日 +0800 PM7:57，Timo Walther <[hidden email]>，写道：
> >>> This sounds exactly as the module approach I mentioned, no?
> >>>
> >>> Regards,
> >>> Timo
> >>>
> >>> On 03.09.19 13:42, Danny Chan wrote:
> >>>> Thanks Bowen for bring up this topic, I think it’s a useful
> >> refactoring to make our function usage more user friendly.
> >>>> For the topic of how to organize the builtin operators and operators
> >> of Hive, here is a solution from Apache Calcite, the Calcite way is to
> make
> >> every dialect operators a “Library”, user can specify which libraries
> they
> >> want to use for a sql query. The builtin operators always comes as the
> >> first class objects and the others are used from the order they appears.
> >> Maybe you can take a reference.
> >>>> [1]
> >>
> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
> >>>> Best,
> >>>> Danny Chan
> >>>> 在 2019年8月28日 +0800 AM2:50，Bowen Li <[hidden email]>，写道：
> >>>>> Hi folks,
> >>>>>
> >>>>> I'd like to kick off a discussion on reworking Flink's
> >> FunctionCatalog.
> >>>>> It's critically helpful to improve function usability in SQL.
> >>>>>
> >>>>>
> >>
> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
> >>>>> In short, it:
> >>>>> - adds support for precise function reference with fully/partially
> >>>>> qualified name
> >>>>> - redefines function resolution order for ambiguous function
> >> reference
> >>>>> - adds support for Hive's rich built-in functions (support for Hive
> >> user
> >>>>> defined functions was already added in 1.9.0)
> >>>>> - clarifies the concept of temporary functions
> >>>>>
> >>>>> Would love to hear your thoughts.
> >>>>>
> >>>>> Bowen
> >>>
>
>

bowen.li

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Hi Jingsong,

Re> 1.Hive built-in functions is an intermediate solution. So we should
> not introduce interfaces to influence the framework. To make
> Flink itself more powerful, we should implement the functions
> we need to add.

Yes, please see the doc.

Re> 2.Non-flink built-in functions are easy for users to change their
> behavior. If we support some flink built-in functions in the
> future but act differently from non-flink built-in, this will lead to
> changes in user behavior.

There's no such concept as "external built-in functions" any more. Built-in
functions of external systems will be treated as special catalog functions.

Re> Another question is, does this fallback include all
> hive built-in functions? As far as I know, some hive functions
> have some hacky. If possible, can we start with a white list?
> Once we implement some functions to flink built-in, we can
> also update the whitelist.

Yes, that's something we thought of too. I don't think it's super critical
to the scope of this FLIP, thus I'd like to leave it to future efforts as a
nice-to-have feature.

On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> wrote:

> Hi Kurt,
>
> Re: > What I want to propose is we can merge #3 and #4, make them both
> under
> >"catalog" concept, by extending catalog function to make it have ability
> to
> >have built-in catalog functions. Some benefits I can see from this
> approach:
> >1. We don't have to introduce new concept like external built-in
> functions.
> >Actually I don't see a full story about how to treat a built-in
> functions, and it
> >seems a little bit disrupt with catalog. As a result, you have to make
> some restriction
> >like "hive built-in functions can only be used when current catalog is
> hive catalog".
>
> Yes, I've unified #3 and #4 but it seems I didn't update some part of the
> doc. I've modified those sections, and they are up to date now.
>
> In short, now built-in function of external systems are defined as a
> special kind of catalog function in Flink, and handled by Flink as
> following:
> - An external built-in function must be associated with a catalog for the
> purpose of decoupling flink-table and external systems.
> - It always resides in front of catalog functions in ambiguous function
> reference order, just like in its own external system
> - It is a special catalog function that doesn’t have a schema/database
> namespace
> - It goes thru the same instantiation logic as other user defined catalog
> functions in the external system
>
> Please take another look at the doc, and let me know if you have more
> questions.
>
>
> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> wrote:
>
>> Hi Kurt,
>>
>> it should not affect the functions and operations we currently have in
>> SQL. It just categorizes the available built-in functions. It is kind of
>> an orthogonal concept to the catalog API but built-in functions deserve
>> this special kind of treatment. CatalogFunction still fits perfectly in
>> there because the regular catalog object resolution logic is not
>> affected. So tables and functions are resolved in the same way but with
>> built-in functions that have priority as in the original design.
>>
>> Regards,
>> Timo
>>
>>
>> On 03.09.19 15:26, Kurt Young wrote:
>> > Does this only affect the functions and operations we currently have in
>> SQL
>> > and
>> > have no effect on tables, right? Looks like this is an orthogonal
>> concept
>> > with Catalog?
>> > If the answer are both yes, then the catalog function will be a weird
>> > concept?
>> >
>> > Best,
>> > Kurt
>> >
>> >
>> > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email]> wrote:
>> >
>> >> The way you proposed are basically the same as what Calcite does, I
>> think
>> >> we are in the same line.
>> >>
>> >> Best,
>> >> Danny Chan
>> >> 在 2019年9月3日 +0800 PM7:57，Timo Walther <[hidden email]>，写道：
>> >>> This sounds exactly as the module approach I mentioned, no?
>> >>>
>> >>> Regards,
>> >>> Timo
>> >>>
>> >>> On 03.09.19 13:42, Danny Chan wrote:
>> >>>> Thanks Bowen for bring up this topic, I think it’s a useful
>> >> refactoring to make our function usage more user friendly.
>> >>>> For the topic of how to organize the builtin operators and operators
>> >> of Hive, here is a solution from Apache Calcite, the Calcite way is to
>> make
>> >> every dialect operators a “Library”, user can specify which libraries
>> they
>> >> want to use for a sql query. The builtin operators always comes as the
>> >> first class objects and the others are used from the order they
>> appears.
>> >> Maybe you can take a reference.
>> >>>> [1]
>> >>
>> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
>> >>>> Best,
>> >>>> Danny Chan
>> >>>> 在 2019年8月28日 +0800 AM2:50，Bowen Li <[hidden email]>，写道：
>> >>>>> Hi folks,
>> >>>>>
>> >>>>> I'd like to kick off a discussion on reworking Flink's
>> >> FunctionCatalog.
>> >>>>> It's critically helpful to improve function usability in SQL.
>> >>>>>
>> >>>>>
>> >>
>> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
>> >>>>> In short, it:
>> >>>>> - adds support for precise function reference with fully/partially
>> >>>>> qualified name
>> >>>>> - redefines function resolution order for ambiguous function
>> >> reference
>> >>>>> - adds support for Hive's rich built-in functions (support for Hive
>> >> user
>> >>>>> defined functions was already added in 1.9.0)
>> >>>>> - clarifies the concept of temporary functions
>> >>>>>
>> >>>>> Would love to hear your thoughts.
>> >>>>>
>> >>>>> Bowen
>> >>>
>>
>>

bowen.li

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Hi Timo,

Re> 1) We should not have the restriction "hive built-in functions can only
> be used when current catalog is hive catalog". Switching a catalog
> should only have implications on the cat.db.object resolution but not
> functions. It would be quite convinient for users to use Hive built-ins
> even if they use a Confluent schema registry or just the in-memory
catalog.

There might be a misunderstanding here.

First of all, Hive built-in functions are not part of Flink built-in
functions, they are catalog functions, thus if the current catalog is not a
HiveCatalog but, say, a schema registry catalog, ambiguous functions
reference just shouldn't be resolved to a different catalog.

Second, Hive built-in functions can potentially be referenced across
catalog, but it doesn't have db namespace and we currently just don't have
a SQL syntax for it. It can be enabled when such a SQL syntax is defined,
e.g. "catalog::function", but it's out of scope of this FLIP.

2) I would propose to have separate concepts for catalog and built-in
functions. In particular it would be nice to modularize built-in
functions. Some built-in functions are very crucial (like AS, CAST,
MINUS), others are more optional but stable (MD5, CONCAT_WS), and maybe
we add more experimental functions in the future or function for some
special application area (Geo functions, ML functions). A data platform
team might not want to make every built-in function available. Or a
function module like ML functions is in a different Maven module.

I think this is orthogonal to this FLIP, especially we don't have the
"external built-in functions" anymore and currently the built-in function
category remains untouched.

But just to share some thoughts on the proposal, I'm not sure about it:
- I don't know if any other databases handle built-in functions like that.
Maybe you can give some examples? IMHO, built-in functions are system info
and should be deterministic, not depending on loaded libraries. Geo
functions should be either built-in already or just libraries functions,
and library functions can be adapted to catalog APIs or of some other
syntax to use
- I don't know if all use cases stand, and many can be achieved by other
approaches too. E.g. experimental functions can be taken good care of by
documentations, annotations, etc
- the proposal basically introduces some concept like a pluggable built-in
function catalog, despite the already existing catalog APIs
- it brings in even more complicated scenarios to the design. E.g. how do
you handle built-in functions in different modules but different names?

In short, I'm not sure if it really stands and it looks like an overkill to
me. I'd rather not go to that route. Related discussion can be on its own
thread.

3) Following the suggestion above, we can have a separate discovery
mechanism for built-in functions. Instead of just going through a static
list like in BuiltInFunctionDefinitions, a platform team should be able
to select function modules like
catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
HiveFunctions) or via service discovery;

Same as above. I'll leave it to its own thread.

re > 3) Dawid and I discussed the resulution order again. I agree with Kurt
> that we should unify built-in function (external or internal) under a
> common layer. However, the resolution order should be:
> 1. built-in functions
> 2. temporary functions
> 3. regular catalog resolution logic
> Otherwise a temporary function could cause clashes with Flink's built-in
> functions. If you take a look at other vendors, like SQL Server they
> also do not allow to overwrite built-in functions.

”I agree with Kurt that we should unify built-in function (external or
internal) under a common layer.“ <- I don't think this is what Kurt means.
Kurt and I are in favor of unifying built-in functions of external systems
and catalog functions. Did you type a mistake?

Besides, I'm not sure about the resolution order you proposed. Temporary
functions have a lifespan over a session and are only visible to the
session owner, they are unique to each user, and users create them on
purpose to be the highest priority in order to overwrite system info
(built-in functions in this case).

In your case, why would users name a temporary function the same as a
built-in function then? Since using that name in ambiguous function
reference will always be resolved to built-in functions, creating a
same-named temp function would be meaningless in the end.

On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> wrote:

> Hi Jingsong,
>
> Re> 1.Hive built-in functions is an intermediate solution. So we should
> > not introduce interfaces to influence the framework. To make
> > Flink itself more powerful, we should implement the functions
> > we need to add.
>
> Yes, please see the doc.
>
> Re> 2.Non-flink built-in functions are easy for users to change their
> > behavior. If we support some flink built-in functions in the
> > future but act differently from non-flink built-in, this will lead to
> > changes in user behavior.
>
> There's no such concept as "external built-in functions" any more.
> Built-in functions of external systems will be treated as special catalog
> functions.
>
> Re> Another question is, does this fallback include all
> > hive built-in functions? As far as I know, some hive functions
> > have some hacky. If possible, can we start with a white list?
> > Once we implement some functions to flink built-in, we can
> > also update the whitelist.
>
> Yes, that's something we thought of too. I don't think it's super critical
> to the scope of this FLIP, thus I'd like to leave it to future efforts as a
> nice-to-have feature.
>
>
> On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> wrote:
>
>> Hi Kurt,
>>
>> Re: > What I want to propose is we can merge #3 and #4, make them both
>> under
>> >"catalog" concept, by extending catalog function to make it have ability
>> to
>> >have built-in catalog functions. Some benefits I can see from this
>> approach:
>> >1. We don't have to introduce new concept like external built-in
>> functions.
>> >Actually I don't see a full story about how to treat a built-in
>> functions, and it
>> >seems a little bit disrupt with catalog. As a result, you have to make
>> some restriction
>> >like "hive built-in functions can only be used when current catalog is
>> hive catalog".
>>
>> Yes, I've unified #3 and #4 but it seems I didn't update some part of the
>> doc. I've modified those sections, and they are up to date now.
>>
>> In short, now built-in function of external systems are defined as a
>> special kind of catalog function in Flink, and handled by Flink as
>> following:
>> - An external built-in function must be associated with a catalog for the
>> purpose of decoupling flink-table and external systems.
>> - It always resides in front of catalog functions in ambiguous function
>> reference order, just like in its own external system
>> - It is a special catalog function that doesn’t have a schema/database
>> namespace
>> - It goes thru the same instantiation logic as other user defined catalog
>> functions in the external system
>>
>> Please take another look at the doc, and let me know if you have more
>> questions.
>>
>>
>> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> wrote:
>>
>>> Hi Kurt,
>>>
>>> it should not affect the functions and operations we currently have in
>>> SQL. It just categorizes the available built-in functions. It is kind of
>>> an orthogonal concept to the catalog API but built-in functions deserve
>>> this special kind of treatment. CatalogFunction still fits perfectly in
>>> there because the regular catalog object resolution logic is not
>>> affected. So tables and functions are resolved in the same way but with
>>> built-in functions that have priority as in the original design.
>>>
>>> Regards,
>>> Timo
>>>
>>>
>>> On 03.09.19 15:26, Kurt Young wrote:
>>> > Does this only affect the functions and operations we currently have
>>> in SQL
>>> > and
>>> > have no effect on tables, right? Looks like this is an orthogonal
>>> concept
>>> > with Catalog?
>>> > If the answer are both yes, then the catalog function will be a weird
>>> > concept?
>>> >
>>> > Best,
>>> > Kurt
>>> >
>>> >
>>> > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email]>
>>> wrote:
>>> >
>>> >> The way you proposed are basically the same as what Calcite does, I
>>> think
>>> >> we are in the same line.
>>> >>
>>> >> Best,
>>> >> Danny Chan
>>> >> 在 2019年9月3日 +0800 PM7:57，Timo Walther <[hidden email]>，写道：
>>> >>> This sounds exactly as the module approach I mentioned, no?
>>> >>>
>>> >>> Regards,
>>> >>> Timo
>>> >>>
>>> >>> On 03.09.19 13:42, Danny Chan wrote:
>>> >>>> Thanks Bowen for bring up this topic, I think it’s a useful
>>> >> refactoring to make our function usage more user friendly.
>>> >>>> For the topic of how to organize the builtin operators and operators
>>> >> of Hive, here is a solution from Apache Calcite, the Calcite way is
>>> to make
>>> >> every dialect operators a “Library”, user can specify which libraries
>>> they
>>> >> want to use for a sql query. The builtin operators always comes as the
>>> >> first class objects and the others are used from the order they
>>> appears.
>>> >> Maybe you can take a reference.
>>> >>>> [1]
>>> >>
>>> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
>>> >>>> Best,
>>> >>>> Danny Chan
>>> >>>> 在 2019年8月28日 +0800 AM2:50，Bowen Li <[hidden email]>，写道：
>>> >>>>> Hi folks,
>>> >>>>>
>>> >>>>> I'd like to kick off a discussion on reworking Flink's
>>> >> FunctionCatalog.
>>> >>>>> It's critically helpful to improve function usability in SQL.
>>> >>>>>
>>> >>>>>
>>> >>
>>> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
>>> >>>>> In short, it:
>>> >>>>> - adds support for precise function reference with fully/partially
>>> >>>>> qualified name
>>> >>>>> - redefines function resolution order for ambiguous function
>>> >> reference
>>> >>>>> - adds support for Hive's rich built-in functions (support for Hive
>>> >> user
>>> >>>>> defined functions was already added in 1.9.0)
>>> >>>>> - clarifies the concept of temporary functions
>>> >>>>>
>>> >>>>> Would love to hear your thoughts.
>>> >>>>>
>>> >>>>> Bowen
>>> >>>
>>>
>>>

bowen.li

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Hi all,

Thanks for the feedback. Just a kindly reminder that the [Proposal] section
in the google doc was updated, please take a look first and let me know if
you have more questions.

On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> wrote:

> Hi Timo,
>
> Re> 1) We should not have the restriction "hive built-in functions can
> only
> > be used when current catalog is hive catalog". Switching a catalog
> > should only have implications on the cat.db.object resolution but not
> > functions. It would be quite convinient for users to use Hive built-ins
> > even if they use a Confluent schema registry or just the in-memory
> catalog.
>
> There might be a misunderstanding here.
>
> First of all, Hive built-in functions are not part of Flink built-in
> functions, they are catalog functions, thus if the current catalog is not a
> HiveCatalog but, say, a schema registry catalog, ambiguous functions
> reference just shouldn't be resolved to a different catalog.
>
> Second, Hive built-in functions can potentially be referenced across
> catalog, but it doesn't have db namespace and we currently just don't have
> a SQL syntax for it. It can be enabled when such a SQL syntax is defined,
> e.g. "catalog::function", but it's out of scope of this FLIP.
>
> 2) I would propose to have separate concepts for catalog and built-in
> functions. In particular it would be nice to modularize built-in
> functions. Some built-in functions are very crucial (like AS, CAST,
> MINUS), others are more optional but stable (MD5, CONCAT_WS), and maybe
> we add more experimental functions in the future or function for some
> special application area (Geo functions, ML functions). A data platform
> team might not want to make every built-in function available. Or a
> function module like ML functions is in a different Maven module.
>
> I think this is orthogonal to this FLIP, especially we don't have the
> "external built-in functions" anymore and currently the built-in function
> category remains untouched.
>
> But just to share some thoughts on the proposal, I'm not sure about it:
> - I don't know if any other databases handle built-in functions like that.
> Maybe you can give some examples? IMHO, built-in functions are system info
> and should be deterministic, not depending on loaded libraries. Geo
> functions should be either built-in already or just libraries functions,
> and library functions can be adapted to catalog APIs or of some other
> syntax to use
> - I don't know if all use cases stand, and many can be achieved by other
> approaches too. E.g. experimental functions can be taken good care of by
> documentations, annotations, etc
> - the proposal basically introduces some concept like a pluggable built-in
> function catalog, despite the already existing catalog APIs
> - it brings in even more complicated scenarios to the design. E.g. how do
> you handle built-in functions in different modules but different names?
>
> In short, I'm not sure if it really stands and it looks like an overkill
> to me. I'd rather not go to that route. Related discussion can be on its
> own thread.
>
> 3) Following the suggestion above, we can have a separate discovery
> mechanism for built-in functions. Instead of just going through a static
> list like in BuiltInFunctionDefinitions, a platform team should be able
> to select function modules like
> catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
> HiveFunctions) or via service discovery;
>
> Same as above. I'll leave it to its own thread.
>
> re > 3) Dawid and I discussed the resulution order again. I agree with
> Kurt
> > that we should unify built-in function (external or internal) under a
> > common layer. However, the resolution order should be:
> > 1. built-in functions
> > 2. temporary functions
> > 3. regular catalog resolution logic
> > Otherwise a temporary function could cause clashes with Flink's built-in
> > functions. If you take a look at other vendors, like SQL Server they
> > also do not allow to overwrite built-in functions.
>
> ”I agree with Kurt that we should unify built-in function (external or
> internal) under a common layer.“ <- I don't think this is what Kurt means.
> Kurt and I are in favor of unifying built-in functions of external systems
> and catalog functions. Did you type a mistake?
>
> Besides, I'm not sure about the resolution order you proposed. Temporary
> functions have a lifespan over a session and are only visible to the
> session owner, they are unique to each user, and users create them on
> purpose to be the highest priority in order to overwrite system info
> (built-in functions in this case).
>
> In your case, why would users name a temporary function the same as a
> built-in function then? Since using that name in ambiguous function
> reference will always be resolved to built-in functions, creating a
> same-named temp function would be meaningless in the end.
>
>
> On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> wrote:
>
>> Hi Jingsong,
>>
>> Re> 1.Hive built-in functions is an intermediate solution. So we should
>> > not introduce interfaces to influence the framework. To make
>> > Flink itself more powerful, we should implement the functions
>> > we need to add.
>>
>> Yes, please see the doc.
>>
>> Re> 2.Non-flink built-in functions are easy for users to change their
>> > behavior. If we support some flink built-in functions in the
>> > future but act differently from non-flink built-in, this will lead to
>> > changes in user behavior.
>>
>> There's no such concept as "external built-in functions" any more.
>> Built-in functions of external systems will be treated as special catalog
>> functions.
>>
>> Re> Another question is, does this fallback include all
>> > hive built-in functions? As far as I know, some hive functions
>> > have some hacky. If possible, can we start with a white list?
>> > Once we implement some functions to flink built-in, we can
>> > also update the whitelist.
>>
>> Yes, that's something we thought of too. I don't think it's super
>> critical to the scope of this FLIP, thus I'd like to leave it to future
>> efforts as a nice-to-have feature.
>>
>>
>> On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> wrote:
>>
>>> Hi Kurt,
>>>
>>> Re: > What I want to propose is we can merge #3 and #4, make them both
>>> under
>>> >"catalog" concept, by extending catalog function to make it have
>>> ability to
>>> >have built-in catalog functions. Some benefits I can see from this
>>> approach:
>>> >1. We don't have to introduce new concept like external built-in
>>> functions.
>>> >Actually I don't see a full story about how to treat a built-in
>>> functions, and it
>>> >seems a little bit disrupt with catalog. As a result, you have to make
>>> some restriction
>>> >like "hive built-in functions can only be used when current catalog is
>>> hive catalog".
>>>
>>> Yes, I've unified #3 and #4 but it seems I didn't update some part of
>>> the doc. I've modified those sections, and they are up to date now.
>>>
>>> In short, now built-in function of external systems are defined as a
>>> special kind of catalog function in Flink, and handled by Flink as
>>> following:
>>> - An external built-in function must be associated with a catalog for
>>> the purpose of decoupling flink-table and external systems.
>>> - It always resides in front of catalog functions in ambiguous function
>>> reference order, just like in its own external system
>>> - It is a special catalog function that doesn’t have a schema/database
>>> namespace
>>> - It goes thru the same instantiation logic as other user defined
>>> catalog functions in the external system
>>>
>>> Please take another look at the doc, and let me know if you have more
>>> questions.
>>>
>>>
>>> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> wrote:
>>>
>>>> Hi Kurt,
>>>>
>>>> it should not affect the functions and operations we currently have in
>>>> SQL. It just categorizes the available built-in functions. It is kind
>>>> of
>>>> an orthogonal concept to the catalog API but built-in functions deserve
>>>> this special kind of treatment. CatalogFunction still fits perfectly in
>>>> there because the regular catalog object resolution logic is not
>>>> affected. So tables and functions are resolved in the same way but with
>>>> built-in functions that have priority as in the original design.
>>>>
>>>> Regards,
>>>> Timo
>>>>
>>>>
>>>> On 03.09.19 15:26, Kurt Young wrote:
>>>> > Does this only affect the functions and operations we currently have
>>>> in SQL
>>>> > and
>>>> > have no effect on tables, right? Looks like this is an orthogonal
>>>> concept
>>>> > with Catalog?
>>>> > If the answer are both yes, then the catalog function will be a weird
>>>> > concept?
>>>> >
>>>> > Best,
>>>> > Kurt
>>>> >
>>>> >
>>>> > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email]>
>>>> wrote:
>>>> >
>>>> >> The way you proposed are basically the same as what Calcite does, I
>>>> think
>>>> >> we are in the same line.
>>>> >>
>>>> >> Best,
>>>> >> Danny Chan
>>>> >> 在 2019年9月3日 +0800 PM7:57，Timo Walther <[hidden email]>，写道：
>>>> >>> This sounds exactly as the module approach I mentioned, no?
>>>> >>>
>>>> >>> Regards,
>>>> >>> Timo
>>>> >>>
>>>> >>> On 03.09.19 13:42, Danny Chan wrote:
>>>> >>>> Thanks Bowen for bring up this topic, I think it’s a useful
>>>> >> refactoring to make our function usage more user friendly.
>>>> >>>> For the topic of how to organize the builtin operators and
>>>> operators
>>>> >> of Hive, here is a solution from Apache Calcite, the Calcite way is
>>>> to make
>>>> >> every dialect operators a “Library”, user can specify which
>>>> libraries they
>>>> >> want to use for a sql query. The builtin operators always comes as
>>>> the
>>>> >> first class objects and the others are used from the order they
>>>> appears.
>>>> >> Maybe you can take a reference.
>>>> >>>> [1]
>>>> >>
>>>> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
>>>> >>>> Best,
>>>> >>>> Danny Chan
>>>> >>>> 在 2019年8月28日 +0800 AM2:50，Bowen Li <[hidden email]>，写道：
>>>> >>>>> Hi folks,
>>>> >>>>>
>>>> >>>>> I'd like to kick off a discussion on reworking Flink's
>>>> >> FunctionCatalog.
>>>> >>>>> It's critically helpful to improve function usability in SQL.
>>>> >>>>>
>>>> >>>>>
>>>> >>
>>>> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
>>>> >>>>> In short, it:
>>>> >>>>> - adds support for precise function reference with fully/partially
>>>> >>>>> qualified name
>>>> >>>>> - redefines function resolution order for ambiguous function
>>>> >> reference
>>>> >>>>> - adds support for Hive's rich built-in functions (support for
>>>> Hive
>>>> >> user
>>>> >>>>> defined functions was already added in 1.9.0)
>>>> >>>>> - clarifies the concept of temporary functions
>>>> >>>>>
>>>> >>>>> Would love to hear your thoughts.
>>>> >>>>>
>>>> >>>>> Bowen
>>>> >>>
>>>>
>>>>

Kurt Young

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is the
same
as Bowen's. But after thinking about it, I'm currently lean to Timo's
suggestion.

The reason is backward compatibility. If we follow Bowen's approach, let's
say we
first find function in Flink's built-in functions, and then hive's
built-in. For example, `foo`
is not supported by Flink, but hive has such built-in function. So user
will have hive's
behavior for function `foo`. And in next release, Flink realize this is a
very popular function
and add it into Flink's built-in functions, but with different behavior as
hive's. So in next
release, the behavior changes.

With Timo's approach, IIUC user have to tell the framework explicitly what
kind of
built-in functions he would like to use. He can just tell framework to
abandon Flink's built-in
functions, and use hive's instead. User can only choose between them, but
not use
them at the same time. I think this approach is more predictable.

Best,
Kurt

On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> wrote:

> Hi all,
>
> Thanks for the feedback. Just a kindly reminder that the [Proposal] section
> in the google doc was updated, please take a look first and let me know if
> you have more questions.
>
> On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> wrote:
>
> > Hi Timo,
> >
> > Re> 1) We should not have the restriction "hive built-in functions can
> > only
> > > be used when current catalog is hive catalog". Switching a catalog
> > > should only have implications on the cat.db.object resolution but not
> > > functions. It would be quite convinient for users to use Hive built-ins
> > > even if they use a Confluent schema registry or just the in-memory
> > catalog.
> >
> > There might be a misunderstanding here.
> >
> > First of all, Hive built-in functions are not part of Flink built-in
> > functions, they are catalog functions, thus if the current catalog is
> not a
> > HiveCatalog but, say, a schema registry catalog, ambiguous functions
> > reference just shouldn't be resolved to a different catalog.
> >
> > Second, Hive built-in functions can potentially be referenced across
> > catalog, but it doesn't have db namespace and we currently just don't
> have
> > a SQL syntax for it. It can be enabled when such a SQL syntax is defined,
> > e.g. "catalog::function", but it's out of scope of this FLIP.
> >
> > 2) I would propose to have separate concepts for catalog and built-in
> > functions. In particular it would be nice to modularize built-in
> > functions. Some built-in functions are very crucial (like AS, CAST,
> > MINUS), others are more optional but stable (MD5, CONCAT_WS), and maybe
> > we add more experimental functions in the future or function for some
> > special application area (Geo functions, ML functions). A data platform
> > team might not want to make every built-in function available. Or a
> > function module like ML functions is in a different Maven module.
> >
> > I think this is orthogonal to this FLIP, especially we don't have the
> > "external built-in functions" anymore and currently the built-in function
> > category remains untouched.
> >
> > But just to share some thoughts on the proposal, I'm not sure about it:
> > - I don't know if any other databases handle built-in functions like
> that.
> > Maybe you can give some examples? IMHO, built-in functions are system
> info
> > and should be deterministic, not depending on loaded libraries. Geo
> > functions should be either built-in already or just libraries functions,
> > and library functions can be adapted to catalog APIs or of some other
> > syntax to use
> > - I don't know if all use cases stand, and many can be achieved by other
> > approaches too. E.g. experimental functions can be taken good care of by
> > documentations, annotations, etc
> > - the proposal basically introduces some concept like a pluggable
> built-in
> > function catalog, despite the already existing catalog APIs
> > - it brings in even more complicated scenarios to the design. E.g. how do
> > you handle built-in functions in different modules but different names?
> >
> > In short, I'm not sure if it really stands and it looks like an overkill
> > to me. I'd rather not go to that route. Related discussion can be on its
> > own thread.
> >
> > 3) Following the suggestion above, we can have a separate discovery
> > mechanism for built-in functions. Instead of just going through a static
> > list like in BuiltInFunctionDefinitions, a platform team should be able
> > to select function modules like
> > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
> > HiveFunctions) or via service discovery;
> >
> > Same as above. I'll leave it to its own thread.
> >
> > re > 3) Dawid and I discussed the resulution order again. I agree with
> > Kurt
> > > that we should unify built-in function (external or internal) under a
> > > common layer. However, the resolution order should be:
> > > 1. built-in functions
> > > 2. temporary functions
> > > 3. regular catalog resolution logic
> > > Otherwise a temporary function could cause clashes with Flink's
> built-in
> > > functions. If you take a look at other vendors, like SQL Server they
> > > also do not allow to overwrite built-in functions.
> >
> > ”I agree with Kurt that we should unify built-in function (external or
> > internal) under a common layer.“ <- I don't think this is what Kurt
> means.
> > Kurt and I are in favor of unifying built-in functions of external
> systems
> > and catalog functions. Did you type a mistake?
> >
> > Besides, I'm not sure about the resolution order you proposed. Temporary
> > functions have a lifespan over a session and are only visible to the
> > session owner, they are unique to each user, and users create them on
> > purpose to be the highest priority in order to overwrite system info
> > (built-in functions in this case).
> >
> > In your case, why would users name a temporary function the same as a
> > built-in function then? Since using that name in ambiguous function
> > reference will always be resolved to built-in functions, creating a
> > same-named temp function would be meaningless in the end.
> >
> >
> > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> wrote:
> >
> >> Hi Jingsong,
> >>
> >> Re> 1.Hive built-in functions is an intermediate solution. So we should
> >> > not introduce interfaces to influence the framework. To make
> >> > Flink itself more powerful, we should implement the functions
> >> > we need to add.
> >>
> >> Yes, please see the doc.
> >>
> >> Re> 2.Non-flink built-in functions are easy for users to change their
> >> > behavior. If we support some flink built-in functions in the
> >> > future but act differently from non-flink built-in, this will lead to
> >> > changes in user behavior.
> >>
> >> There's no such concept as "external built-in functions" any more.
> >> Built-in functions of external systems will be treated as special
> catalog
> >> functions.
> >>
> >> Re> Another question is, does this fallback include all
> >> > hive built-in functions? As far as I know, some hive functions
> >> > have some hacky. If possible, can we start with a white list?
> >> > Once we implement some functions to flink built-in, we can
> >> > also update the whitelist.
> >>
> >> Yes, that's something we thought of too. I don't think it's super
> >> critical to the scope of this FLIP, thus I'd like to leave it to future
> >> efforts as a nice-to-have feature.
> >>
> >>
> >> On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> wrote:
> >>
> >>> Hi Kurt,
> >>>
> >>> Re: > What I want to propose is we can merge #3 and #4, make them both
> >>> under
> >>> >"catalog" concept, by extending catalog function to make it have
> >>> ability to
> >>> >have built-in catalog functions. Some benefits I can see from this
> >>> approach:
> >>> >1. We don't have to introduce new concept like external built-in
> >>> functions.
> >>> >Actually I don't see a full story about how to treat a built-in
> >>> functions, and it
> >>> >seems a little bit disrupt with catalog. As a result, you have to make
> >>> some restriction
> >>> >like "hive built-in functions can only be used when current catalog is
> >>> hive catalog".
> >>>
> >>> Yes, I've unified #3 and #4 but it seems I didn't update some part of
> >>> the doc. I've modified those sections, and they are up to date now.
> >>>
> >>> In short, now built-in function of external systems are defined as a
> >>> special kind of catalog function in Flink, and handled by Flink as
> >>> following:
> >>> - An external built-in function must be associated with a catalog for
> >>> the purpose of decoupling flink-table and external systems.
> >>> - It always resides in front of catalog functions in ambiguous function
> >>> reference order, just like in its own external system
> >>> - It is a special catalog function that doesn’t have a schema/database
> >>> namespace
> >>> - It goes thru the same instantiation logic as other user defined
> >>> catalog functions in the external system
> >>>
> >>> Please take another look at the doc, and let me know if you have more
> >>> questions.
> >>>
> >>>
> >>> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]>
> wrote:
> >>>
> >>>> Hi Kurt,
> >>>>
> >>>> it should not affect the functions and operations we currently have in
> >>>> SQL. It just categorizes the available built-in functions. It is kind
> >>>> of
> >>>> an orthogonal concept to the catalog API but built-in functions
> deserve
> >>>> this special kind of treatment. CatalogFunction still fits perfectly
> in
> >>>> there because the regular catalog object resolution logic is not
> >>>> affected. So tables and functions are resolved in the same way but
> with
> >>>> built-in functions that have priority as in the original design.
> >>>>
> >>>> Regards,
> >>>> Timo
> >>>>
> >>>>
> >>>> On 03.09.19 15:26, Kurt Young wrote:
> >>>> > Does this only affect the functions and operations we currently have
> >>>> in SQL
> >>>> > and
> >>>> > have no effect on tables, right? Looks like this is an orthogonal
> >>>> concept
> >>>> > with Catalog?
> >>>> > If the answer are both yes, then the catalog function will be a
> weird
> >>>> > concept?
> >>>> >
> >>>> > Best,
> >>>> > Kurt
> >>>> >
> >>>> >
> >>>> > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email]>
> >>>> wrote:
> >>>> >
> >>>> >> The way you proposed are basically the same as what Calcite does, I
> >>>> think
> >>>> >> we are in the same line.
> >>>> >>
> >>>> >> Best,
> >>>> >> Danny Chan
> >>>> >> 在 2019年9月3日 +0800 PM7:57，Timo Walther <[hidden email]>，写道：
> >>>> >>> This sounds exactly as the module approach I mentioned, no?
> >>>> >>>
> >>>> >>> Regards,
> >>>> >>> Timo
> >>>> >>>
> >>>> >>> On 03.09.19 13:42, Danny Chan wrote:
> >>>> >>>> Thanks Bowen for bring up this topic, I think it’s a useful
> >>>> >> refactoring to make our function usage more user friendly.
> >>>> >>>> For the topic of how to organize the builtin operators and
> >>>> operators
> >>>> >> of Hive, here is a solution from Apache Calcite, the Calcite way is
> >>>> to make
> >>>> >> every dialect operators a “Library”, user can specify which
> >>>> libraries they
> >>>> >> want to use for a sql query. The builtin operators always comes as
> >>>> the
> >>>> >> first class objects and the others are used from the order they
> >>>> appears.
> >>>> >> Maybe you can take a reference.
> >>>> >>>> [1]
> >>>> >>
> >>>>
> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
> >>>> >>>> Best,
> >>>> >>>> Danny Chan
> >>>> >>>> 在 2019年8月28日 +0800 AM2:50，Bowen Li <[hidden email]>，写道：
> >>>> >>>>> Hi folks,
> >>>> >>>>>
> >>>> >>>>> I'd like to kick off a discussion on reworking Flink's
> >>>> >> FunctionCatalog.
> >>>> >>>>> It's critically helpful to improve function usability in SQL.
> >>>> >>>>>
> >>>> >>>>>
> >>>> >>
> >>>>
> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
> >>>> >>>>> In short, it:
> >>>> >>>>> - adds support for precise function reference with
> fully/partially
> >>>> >>>>> qualified name
> >>>> >>>>> - redefines function resolution order for ambiguous function
> >>>> >> reference
> >>>> >>>>> - adds support for Hive's rich built-in functions (support for
> >>>> Hive
> >>>> >> user
> >>>> >>>>> defined functions was already added in 1.9.0)
> >>>> >>>>> - clarifies the concept of temporary functions
> >>>> >>>>>
> >>>> >>>>> Would love to hear your thoughts.
> >>>> >>>>>
> >>>> >>>>> Bowen
> >>>> >>>
> >>>>
> >>>>
>

JingsongLee-2

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Thank you for your wonderful points.

I like timo's proposal to enrich built-in functions to flexible function
modules (For example, the financial model is useful to bank system).
But I agree with bowen, I don't think hive functions deserves be a
function module. I think all function modules should be flink built-in
functions. In this way, we can control their standardization, rather
than some controversial functions like hive.

About Kurt's concern, yes, every addition of flink's function
changes user behavior. But in the near future, we'll cover all
of hive's functions (in the white list). So, if the final form does
not have hive functions. sooner or later, this behavioral change will
come. So do we need to let users choose?

Back to the goal of hive built-in, I always thought it was just an
intermediate solution. Do we need to provide hive built-in
functions mode to users in the future?

Best,
Jingsong Lee

------------------------------------------------------------------
From:Kurt Young <[hidden email]>
Send Time:2019年9月4日(星期三) 10:11
To:dev <[hidden email]>
Subject:Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is the
same
as Bowen's. But after thinking about it, I'm currently lean to Timo's
suggestion.

The reason is backward compatibility. If we follow Bowen's approach, let's
say we
first find function in Flink's built-in functions, and then hive's
built-in. For example, `foo`
is not supported by Flink, but hive has such built-in function. So user
will have hive's
behavior for function `foo`. And in next release, Flink realize this is a
very popular function
and add it into Flink's built-in functions, but with different behavior as
hive's. So in next
release, the behavior changes.

With Timo's approach, IIUC user have to tell the framework explicitly what
kind of
built-in functions he would like to use. He can just tell framework to
abandon Flink's built-in
functions, and use hive's instead. User can only choose between them, but
not use
them at the same time. I think this approach is more predictable.

Best,
Kurt

On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> wrote:

Xuefu Z

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

In reply to this post by Kurt Young

From what I have seen, there are a couple of focal disagreements:

1. Resolution order: temp function --> flink built-in function --> catalog
function vs flink built-in function --> temp function -> catalog function.
2. "External" built-in functions: how to treat built-in functions in
external system and how users reference them

For #1, I agree with Bowen that temp function needs to be at the highest
priority because that's how a user might overwrite a built-in function
without referencing a persistent, overwriting catalog function with a fully
qualified name. Putting built-in functions at the highest priority
eliminates that usage.

For #2, I saw a general agreement on referencing "external" built-in
functions such as those in Hive needs to be explicit and deterministic even
though different approaches are proposed. To limit the scope and simply the
usage, it seems making sense to me to introduce special syntax for user to
explicitly reference an external built-in function such as hive1::sqrt or
hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API call
hive1.getFunction(ObjectPath functionName) where the database name is
absent for bulit-in functions available in that catalog hive1. I understand
that Bowen's original proposal was trying to avoid this, but this could
turn out to be a clean and simple solution.

(Timo's modular approach is great way to "expand" Flink's built-in function
set, which seems orthogonal and complementary to this, which could be
tackled in further future work.)

I'd be happy to hear further thoughts on the two points.

Thanks,
Xuefu

On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> wrote:

> Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is the
> same
> as Bowen's. But after thinking about it, I'm currently lean to Timo's
> suggestion.
>
> The reason is backward compatibility. If we follow Bowen's approach, let's
> say we
> first find function in Flink's built-in functions, and then hive's
> built-in. For example, `foo`
> is not supported by Flink, but hive has such built-in function. So user
> will have hive's
> behavior for function `foo`. And in next release, Flink realize this is a
> very popular function
> and add it into Flink's built-in functions, but with different behavior as
> hive's. So in next
> release, the behavior changes.
>
> With Timo's approach, IIUC user have to tell the framework explicitly what
> kind of
> built-in functions he would like to use. He can just tell framework to
> abandon Flink's built-in
> functions, and use hive's instead. User can only choose between them, but
> not use
> them at the same time. I think this approach is more predictable.
>
> Best,
> Kurt
>
>
> On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> wrote:
>
> > Hi all,
> >
> > Thanks for the feedback. Just a kindly reminder that the [Proposal]
> section
> > in the google doc was updated, please take a look first and let me know
> if
> > you have more questions.
> >
> > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> wrote:
> >
> > > Hi Timo,
> > >
> > > Re> 1) We should not have the restriction "hive built-in functions can
> > > only
> > > > be used when current catalog is hive catalog". Switching a catalog
> > > > should only have implications on the cat.db.object resolution but not
> > > > functions. It would be quite convinient for users to use Hive
> built-ins
> > > > even if they use a Confluent schema registry or just the in-memory
> > > catalog.
> > >
> > > There might be a misunderstanding here.
> > >
> > > First of all, Hive built-in functions are not part of Flink built-in
> > > functions, they are catalog functions, thus if the current catalog is
> > not a
> > > HiveCatalog but, say, a schema registry catalog, ambiguous functions
> > > reference just shouldn't be resolved to a different catalog.
> > >
> > > Second, Hive built-in functions can potentially be referenced across
> > > catalog, but it doesn't have db namespace and we currently just don't
> > have
> > > a SQL syntax for it. It can be enabled when such a SQL syntax is
> defined,
> > > e.g. "catalog::function", but it's out of scope of this FLIP.
> > >
> > > 2) I would propose to have separate concepts for catalog and built-in
> > > functions. In particular it would be nice to modularize built-in
> > > functions. Some built-in functions are very crucial (like AS, CAST,
> > > MINUS), others are more optional but stable (MD5, CONCAT_WS), and maybe
> > > we add more experimental functions in the future or function for some
> > > special application area (Geo functions, ML functions). A data platform
> > > team might not want to make every built-in function available. Or a
> > > function module like ML functions is in a different Maven module.
> > >
> > > I think this is orthogonal to this FLIP, especially we don't have the
> > > "external built-in functions" anymore and currently the built-in
> function
> > > category remains untouched.
> > >
> > > But just to share some thoughts on the proposal, I'm not sure about it:
> > > - I don't know if any other databases handle built-in functions like
> > that.
> > > Maybe you can give some examples? IMHO, built-in functions are system
> > info
> > > and should be deterministic, not depending on loaded libraries. Geo
> > > functions should be either built-in already or just libraries
> functions,
> > > and library functions can be adapted to catalog APIs or of some other
> > > syntax to use
> > > - I don't know if all use cases stand, and many can be achieved by
> other
> > > approaches too. E.g. experimental functions can be taken good care of
> by
> > > documentations, annotations, etc
> > > - the proposal basically introduces some concept like a pluggable
> > built-in
> > > function catalog, despite the already existing catalog APIs
> > > - it brings in even more complicated scenarios to the design. E.g. how
> do
> > > you handle built-in functions in different modules but different names?
> > >
> > > In short, I'm not sure if it really stands and it looks like an
> overkill
> > > to me. I'd rather not go to that route. Related discussion can be on
> its
> > > own thread.
> > >
> > > 3) Following the suggestion above, we can have a separate discovery
> > > mechanism for built-in functions. Instead of just going through a
> static
> > > list like in BuiltInFunctionDefinitions, a platform team should be able
> > > to select function modules like
> > > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
> > > HiveFunctions) or via service discovery;
> > >
> > > Same as above. I'll leave it to its own thread.
> > >
> > > re > 3) Dawid and I discussed the resulution order again. I agree with
> > > Kurt
> > > > that we should unify built-in function (external or internal) under a
> > > > common layer. However, the resolution order should be:
> > > > 1. built-in functions
> > > > 2. temporary functions
> > > > 3. regular catalog resolution logic
> > > > Otherwise a temporary function could cause clashes with Flink's
> > built-in
> > > > functions. If you take a look at other vendors, like SQL Server they
> > > > also do not allow to overwrite built-in functions.
> > >
> > > ”I agree with Kurt that we should unify built-in function (external or
> > > internal) under a common layer.“ <- I don't think this is what Kurt
> > means.
> > > Kurt and I are in favor of unifying built-in functions of external
> > systems
> > > and catalog functions. Did you type a mistake?
> > >
> > > Besides, I'm not sure about the resolution order you proposed.
> Temporary
> > > functions have a lifespan over a session and are only visible to the
> > > session owner, they are unique to each user, and users create them on
> > > purpose to be the highest priority in order to overwrite system info
> > > (built-in functions in this case).
> > >
> > > In your case, why would users name a temporary function the same as a
> > > built-in function then? Since using that name in ambiguous function
> > > reference will always be resolved to built-in functions, creating a
> > > same-named temp function would be meaningless in the end.
> > >
> > >
> > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> wrote:
> > >
> > >> Hi Jingsong,
> > >>
> > >> Re> 1.Hive built-in functions is an intermediate solution. So we
> should
> > >> > not introduce interfaces to influence the framework. To make
> > >> > Flink itself more powerful, we should implement the functions
> > >> > we need to add.
> > >>
> > >> Yes, please see the doc.
> > >>
> > >> Re> 2.Non-flink built-in functions are easy for users to change their
> > >> > behavior. If we support some flink built-in functions in the
> > >> > future but act differently from non-flink built-in, this will lead
> to
> > >> > changes in user behavior.
> > >>
> > >> There's no such concept as "external built-in functions" any more.
> > >> Built-in functions of external systems will be treated as special
> > catalog
> > >> functions.
> > >>
> > >> Re> Another question is, does this fallback include all
> > >> > hive built-in functions? As far as I know, some hive functions
> > >> > have some hacky. If possible, can we start with a white list?
> > >> > Once we implement some functions to flink built-in, we can
> > >> > also update the whitelist.
> > >>
> > >> Yes, that's something we thought of too. I don't think it's super
> > >> critical to the scope of this FLIP, thus I'd like to leave it to
> future
> > >> efforts as a nice-to-have feature.
> > >>
> > >>
> > >> On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> wrote:
> > >>
> > >>> Hi Kurt,
> > >>>
> > >>> Re: > What I want to propose is we can merge #3 and #4, make them
> both
> > >>> under
> > >>> >"catalog" concept, by extending catalog function to make it have
> > >>> ability to
> > >>> >have built-in catalog functions. Some benefits I can see from this
> > >>> approach:
> > >>> >1. We don't have to introduce new concept like external built-in
> > >>> functions.
> > >>> >Actually I don't see a full story about how to treat a built-in
> > >>> functions, and it
> > >>> >seems a little bit disrupt with catalog. As a result, you have to
> make
> > >>> some restriction
> > >>> >like "hive built-in functions can only be used when current catalog
> is
> > >>> hive catalog".
> > >>>
> > >>> Yes, I've unified #3 and #4 but it seems I didn't update some part of
> > >>> the doc. I've modified those sections, and they are up to date now.
> > >>>
> > >>> In short, now built-in function of external systems are defined as a
> > >>> special kind of catalog function in Flink, and handled by Flink as
> > >>> following:
> > >>> - An external built-in function must be associated with a catalog for
> > >>> the purpose of decoupling flink-table and external systems.
> > >>> - It always resides in front of catalog functions in ambiguous
> function
> > >>> reference order, just like in its own external system
> > >>> - It is a special catalog function that doesn’t have a
> schema/database
> > >>> namespace
> > >>> - It goes thru the same instantiation logic as other user defined
> > >>> catalog functions in the external system
> > >>>
> > >>> Please take another look at the doc, and let me know if you have more
> > >>> questions.
> > >>>
> > >>>
> > >>> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]>
> > wrote:
> > >>>
> > >>>> Hi Kurt,
> > >>>>
> > >>>> it should not affect the functions and operations we currently have
> in
> > >>>> SQL. It just categorizes the available built-in functions. It is
> kind
> > >>>> of
> > >>>> an orthogonal concept to the catalog API but built-in functions
> > deserve
> > >>>> this special kind of treatment. CatalogFunction still fits perfectly
> > in
> > >>>> there because the regular catalog object resolution logic is not
> > >>>> affected. So tables and functions are resolved in the same way but
> > with
> > >>>> built-in functions that have priority as in the original design.
> > >>>>
> > >>>> Regards,
> > >>>> Timo
> > >>>>
> > >>>>
> > >>>> On 03.09.19 15:26, Kurt Young wrote:
> > >>>> > Does this only affect the functions and operations we currently
> have
> > >>>> in SQL
> > >>>> > and
> > >>>> > have no effect on tables, right? Looks like this is an orthogonal
> > >>>> concept
> > >>>> > with Catalog?
> > >>>> > If the answer are both yes, then the catalog function will be a
> > weird
> > >>>> > concept?
> > >>>> >
> > >>>> > Best,
> > >>>> > Kurt
> > >>>> >
> > >>>> >
> > >>>> > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email]>
> > >>>> wrote:
> > >>>> >
> > >>>> >> The way you proposed are basically the same as what Calcite
> does, I
> > >>>> think
> > >>>> >> we are in the same line.
> > >>>> >>
> > >>>> >> Best,
> > >>>> >> Danny Chan
> > >>>> >> 在 2019年9月3日 +0800 PM7:57，Timo Walther <[hidden email]>，写道：
> > >>>> >>> This sounds exactly as the module approach I mentioned, no?
> > >>>> >>>
> > >>>> >>> Regards,
> > >>>> >>> Timo
> > >>>> >>>
> > >>>> >>> On 03.09.19 13:42, Danny Chan wrote:
> > >>>> >>>> Thanks Bowen for bring up this topic, I think it’s a useful
> > >>>> >> refactoring to make our function usage more user friendly.
> > >>>> >>>> For the topic of how to organize the builtin operators and
> > >>>> operators
> > >>>> >> of Hive, here is a solution from Apache Calcite, the Calcite way
> is
> > >>>> to make
> > >>>> >> every dialect operators a “Library”, user can specify which
> > >>>> libraries they
> > >>>> >> want to use for a sql query. The builtin operators always comes
> as
> > >>>> the
> > >>>> >> first class objects and the others are used from the order they
> > >>>> appears.
> > >>>> >> Maybe you can take a reference.
> > >>>> >>>> [1]
> > >>>> >>
> > >>>>
> >
> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
> > >>>> >>>> Best,
> > >>>> >>>> Danny Chan
> > >>>> >>>> 在 2019年8月28日 +0800 AM2:50，Bowen Li <[hidden email]>，写道：
> > >>>> >>>>> Hi folks,
> > >>>> >>>>>
> > >>>> >>>>> I'd like to kick off a discussion on reworking Flink's
> > >>>> >> FunctionCatalog.
> > >>>> >>>>> It's critically helpful to improve function usability in SQL.
> > >>>> >>>>>
> > >>>> >>>>>
> > >>>> >>
> > >>>>
> >
> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
> > >>>> >>>>> In short, it:
> > >>>> >>>>> - adds support for precise function reference with
> > fully/partially
> > >>>> >>>>> qualified name
> > >>>> >>>>> - redefines function resolution order for ambiguous function
> > >>>> >> reference
> > >>>> >>>>> - adds support for Hive's rich built-in functions (support for
> > >>>> Hive
> > >>>> >> user
> > >>>> >>>>> defined functions was already added in 1.9.0)
> > >>>> >>>>> - clarifies the concept of temporary functions
> > >>>> >>>>>
> > >>>> >>>>> Would love to hear your thoughts.
> > >>>> >>>>>
> > >>>> >>>>> Bowen
> > >>>> >>>
> > >>>>
> > >>>>
> >
>

--
Xuefu Zhang

"In Honey We Trust!"

bowen.li

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Hi,

I agree with Xuefu that the main controversial points are mainly the two
places. My thoughts on them:

1) Determinism of referencing Hive built-in functions. We can either remove
Hive built-in functions from ambiguous function resolution and require
users to use special syntax for their qualified names, or add a config flag
to catalog constructor/yaml for turning on and off Hive built-in functions
with the flag set to 'false' by default and proper doc added to help users
make their decisions.

2) Flink temp functions v.s. Flink built-in functions in ambiguous function
resolution order. We believe Flink temp functions should precede Flink
built-in functions, and I have presented my reasons. Just in case if we
cannot reach an agreement, I propose forbid users registering temp
functions in the same name as a built-in function, like MySQL's approach,
for the moment. It won't have any performance concern, since built-in
functions are all in memory and thus cost of a name check will be really
trivial.

On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> wrote:

> From what I have seen, there are a couple of focal disagreements:
>
> 1. Resolution order: temp function --> flink built-in function --> catalog
> function vs flink built-in function --> temp function -> catalog function.
> 2. "External" built-in functions: how to treat built-in functions in
> external system and how users reference them
>
> For #1, I agree with Bowen that temp function needs to be at the highest
> priority because that's how a user might overwrite a built-in function
> without referencing a persistent, overwriting catalog function with a fully
> qualified name. Putting built-in functions at the highest priority
> eliminates that usage.
>
> For #2, I saw a general agreement on referencing "external" built-in
> functions such as those in Hive needs to be explicit and deterministic even
> though different approaches are proposed. To limit the scope and simply the
> usage, it seems making sense to me to introduce special syntax for user to
> explicitly reference an external built-in function such as hive1::sqrt or
> hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API call
> hive1.getFunction(ObjectPath functionName) where the database name is
> absent for bulit-in functions available in that catalog hive1. I understand
> that Bowen's original proposal was trying to avoid this, but this could
> turn out to be a clean and simple solution.
>
> (Timo's modular approach is great way to "expand" Flink's built-in function
> set, which seems orthogonal and complementary to this, which could be
> tackled in further future work.)
>
> I'd be happy to hear further thoughts on the two points.
>
> Thanks,
> Xuefu
>
> On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> wrote:
>
> > Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is the
> > same
> > as Bowen's. But after thinking about it, I'm currently lean to Timo's
> > suggestion.
> >
> > The reason is backward compatibility. If we follow Bowen's approach,
> let's
> > say we
> > first find function in Flink's built-in functions, and then hive's
> > built-in. For example, `foo`
> > is not supported by Flink, but hive has such built-in function. So user
> > will have hive's
> > behavior for function `foo`. And in next release, Flink realize this is a
> > very popular function
> > and add it into Flink's built-in functions, but with different behavior
> as
> > hive's. So in next
> > release, the behavior changes.
> >
> > With Timo's approach, IIUC user have to tell the framework explicitly
> what
> > kind of
> > built-in functions he would like to use. He can just tell framework to
> > abandon Flink's built-in
> > functions, and use hive's instead. User can only choose between them, but
> > not use
> > them at the same time. I think this approach is more predictable.
> >
> > Best,
> > Kurt
> >
> >
> > On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> wrote:
> >
> > > Hi all,
> > >
> > > Thanks for the feedback. Just a kindly reminder that the [Proposal]
> > section
> > > in the google doc was updated, please take a look first and let me know
> > if
> > > you have more questions.
> > >
> > > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> wrote:
> > >
> > > > Hi Timo,
> > > >
> > > > Re> 1) We should not have the restriction "hive built-in functions
> can
> > > > only
> > > > > be used when current catalog is hive catalog". Switching a catalog
> > > > > should only have implications on the cat.db.object resolution but
> not
> > > > > functions. It would be quite convinient for users to use Hive
> > built-ins
> > > > > even if they use a Confluent schema registry or just the in-memory
> > > > catalog.
> > > >
> > > > There might be a misunderstanding here.
> > > >
> > > > First of all, Hive built-in functions are not part of Flink built-in
> > > > functions, they are catalog functions, thus if the current catalog is
> > > not a
> > > > HiveCatalog but, say, a schema registry catalog, ambiguous functions
> > > > reference just shouldn't be resolved to a different catalog.
> > > >
> > > > Second, Hive built-in functions can potentially be referenced across
> > > > catalog, but it doesn't have db namespace and we currently just don't
> > > have
> > > > a SQL syntax for it. It can be enabled when such a SQL syntax is
> > defined,
> > > > e.g. "catalog::function", but it's out of scope of this FLIP.
> > > >
> > > > 2) I would propose to have separate concepts for catalog and built-in
> > > > functions. In particular it would be nice to modularize built-in
> > > > functions. Some built-in functions are very crucial (like AS, CAST,
> > > > MINUS), others are more optional but stable (MD5, CONCAT_WS), and
> maybe
> > > > we add more experimental functions in the future or function for some
> > > > special application area (Geo functions, ML functions). A data
> platform
> > > > team might not want to make every built-in function available. Or a
> > > > function module like ML functions is in a different Maven module.
> > > >
> > > > I think this is orthogonal to this FLIP, especially we don't have the
> > > > "external built-in functions" anymore and currently the built-in
> > function
> > > > category remains untouched.
> > > >
> > > > But just to share some thoughts on the proposal, I'm not sure about
> it:
> > > > - I don't know if any other databases handle built-in functions like
> > > that.
> > > > Maybe you can give some examples? IMHO, built-in functions are system
> > > info
> > > > and should be deterministic, not depending on loaded libraries. Geo
> > > > functions should be either built-in already or just libraries
> > functions,
> > > > and library functions can be adapted to catalog APIs or of some other
> > > > syntax to use
> > > > - I don't know if all use cases stand, and many can be achieved by
> > other
> > > > approaches too. E.g. experimental functions can be taken good care of
> > by
> > > > documentations, annotations, etc
> > > > - the proposal basically introduces some concept like a pluggable
> > > built-in
> > > > function catalog, despite the already existing catalog APIs
> > > > - it brings in even more complicated scenarios to the design. E.g.
> how
> > do
> > > > you handle built-in functions in different modules but different
> names?
> > > >
> > > > In short, I'm not sure if it really stands and it looks like an
> > overkill
> > > > to me. I'd rather not go to that route. Related discussion can be on
> > its
> > > > own thread.
> > > >
> > > > 3) Following the suggestion above, we can have a separate discovery
> > > > mechanism for built-in functions. Instead of just going through a
> > static
> > > > list like in BuiltInFunctionDefinitions, a platform team should be
> able
> > > > to select function modules like
> > > > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
> > > > HiveFunctions) or via service discovery;
> > > >
> > > > Same as above. I'll leave it to its own thread.
> > > >
> > > > re > 3) Dawid and I discussed the resulution order again. I agree
> with
> > > > Kurt
> > > > > that we should unify built-in function (external or internal)
> under a
> > > > > common layer. However, the resolution order should be:
> > > > > 1. built-in functions
> > > > > 2. temporary functions
> > > > > 3. regular catalog resolution logic
> > > > > Otherwise a temporary function could cause clashes with Flink's
> > > built-in
> > > > > functions. If you take a look at other vendors, like SQL Server
> they
> > > > > also do not allow to overwrite built-in functions.
> > > >
> > > > ”I agree with Kurt that we should unify built-in function (external
> or
> > > > internal) under a common layer.“ <- I don't think this is what Kurt
> > > means.
> > > > Kurt and I are in favor of unifying built-in functions of external
> > > systems
> > > > and catalog functions. Did you type a mistake?
> > > >
> > > > Besides, I'm not sure about the resolution order you proposed.
> > Temporary
> > > > functions have a lifespan over a session and are only visible to the
> > > > session owner, they are unique to each user, and users create them on
> > > > purpose to be the highest priority in order to overwrite system info
> > > > (built-in functions in this case).
> > > >
> > > > In your case, why would users name a temporary function the same as a
> > > > built-in function then? Since using that name in ambiguous function
> > > > reference will always be resolved to built-in functions, creating a
> > > > same-named temp function would be meaningless in the end.
> > > >
> > > >
> > > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> wrote:
> > > >
> > > >> Hi Jingsong,
> > > >>
> > > >> Re> 1.Hive built-in functions is an intermediate solution. So we
> > should
> > > >> > not introduce interfaces to influence the framework. To make
> > > >> > Flink itself more powerful, we should implement the functions
> > > >> > we need to add.
> > > >>
> > > >> Yes, please see the doc.
> > > >>
> > > >> Re> 2.Non-flink built-in functions are easy for users to change
> their
> > > >> > behavior. If we support some flink built-in functions in the
> > > >> > future but act differently from non-flink built-in, this will lead
> > to
> > > >> > changes in user behavior.
> > > >>
> > > >> There's no such concept as "external built-in functions" any more.
> > > >> Built-in functions of external systems will be treated as special
> > > catalog
> > > >> functions.
> > > >>
> > > >> Re> Another question is, does this fallback include all
> > > >> > hive built-in functions? As far as I know, some hive functions
> > > >> > have some hacky. If possible, can we start with a white list?
> > > >> > Once we implement some functions to flink built-in, we can
> > > >> > also update the whitelist.
> > > >>
> > > >> Yes, that's something we thought of too. I don't think it's super
> > > >> critical to the scope of this FLIP, thus I'd like to leave it to
> > future
> > > >> efforts as a nice-to-have feature.
> > > >>
> > > >>
> > > >> On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]>
> wrote:
> > > >>
> > > >>> Hi Kurt,
> > > >>>
> > > >>> Re: > What I want to propose is we can merge #3 and #4, make them
> > both
> > > >>> under
> > > >>> >"catalog" concept, by extending catalog function to make it have
> > > >>> ability to
> > > >>> >have built-in catalog functions. Some benefits I can see from this
> > > >>> approach:
> > > >>> >1. We don't have to introduce new concept like external built-in
> > > >>> functions.
> > > >>> >Actually I don't see a full story about how to treat a built-in
> > > >>> functions, and it
> > > >>> >seems a little bit disrupt with catalog. As a result, you have to
> > make
> > > >>> some restriction
> > > >>> >like "hive built-in functions can only be used when current
> catalog
> > is
> > > >>> hive catalog".
> > > >>>
> > > >>> Yes, I've unified #3 and #4 but it seems I didn't update some part
> of
> > > >>> the doc. I've modified those sections, and they are up to date now.
> > > >>>
> > > >>> In short, now built-in function of external systems are defined as
> a
> > > >>> special kind of catalog function in Flink, and handled by Flink as
> > > >>> following:
> > > >>> - An external built-in function must be associated with a catalog
> for
> > > >>> the purpose of decoupling flink-table and external systems.
> > > >>> - It always resides in front of catalog functions in ambiguous
> > function
> > > >>> reference order, just like in its own external system
> > > >>> - It is a special catalog function that doesn’t have a
> > schema/database
> > > >>> namespace
> > > >>> - It goes thru the same instantiation logic as other user defined
> > > >>> catalog functions in the external system
> > > >>>
> > > >>> Please take another look at the doc, and let me know if you have
> more
> > > >>> questions.
> > > >>>
> > > >>>
> > > >>> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]>
> > > wrote:
> > > >>>
> > > >>>> Hi Kurt,
> > > >>>>
> > > >>>> it should not affect the functions and operations we currently
> have
> > in
> > > >>>> SQL. It just categorizes the available built-in functions. It is
> > kind
> > > >>>> of
> > > >>>> an orthogonal concept to the catalog API but built-in functions
> > > deserve
> > > >>>> this special kind of treatment. CatalogFunction still fits
> perfectly
> > > in
> > > >>>> there because the regular catalog object resolution logic is not
> > > >>>> affected. So tables and functions are resolved in the same way but
> > > with
> > > >>>> built-in functions that have priority as in the original design.
> > > >>>>
> > > >>>> Regards,
> > > >>>> Timo
> > > >>>>
> > > >>>>
> > > >>>> On 03.09.19 15:26, Kurt Young wrote:
> > > >>>> > Does this only affect the functions and operations we currently
> > have
> > > >>>> in SQL
> > > >>>> > and
> > > >>>> > have no effect on tables, right? Looks like this is an
> orthogonal
> > > >>>> concept
> > > >>>> > with Catalog?
> > > >>>> > If the answer are both yes, then the catalog function will be a
> > > weird
> > > >>>> > concept?
> > > >>>> >
> > > >>>> > Best,
> > > >>>> > Kurt
> > > >>>> >
> > > >>>> >
> > > >>>> > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email]
> >
> > > >>>> wrote:
> > > >>>> >
> > > >>>> >> The way you proposed are basically the same as what Calcite
> > does, I
> > > >>>> think
> > > >>>> >> we are in the same line.
> > > >>>> >>
> > > >>>> >> Best,
> > > >>>> >> Danny Chan
> > > >>>> >> 在 2019年9月3日 +0800 PM7:57，Timo Walther <[hidden email]>，写道：
> > > >>>> >>> This sounds exactly as the module approach I mentioned, no?
> > > >>>> >>>
> > > >>>> >>> Regards,
> > > >>>> >>> Timo
> > > >>>> >>>
> > > >>>> >>> On 03.09.19 13:42, Danny Chan wrote:
> > > >>>> >>>> Thanks Bowen for bring up this topic, I think it’s a useful
> > > >>>> >> refactoring to make our function usage more user friendly.
> > > >>>> >>>> For the topic of how to organize the builtin operators and
> > > >>>> operators
> > > >>>> >> of Hive, here is a solution from Apache Calcite, the Calcite
> way
> > is
> > > >>>> to make
> > > >>>> >> every dialect operators a “Library”, user can specify which
> > > >>>> libraries they
> > > >>>> >> want to use for a sql query. The builtin operators always comes
> > as
> > > >>>> the
> > > >>>> >> first class objects and the others are used from the order they
> > > >>>> appears.
> > > >>>> >> Maybe you can take a reference.
> > > >>>> >>>> [1]
> > > >>>> >>
> > > >>>>
> > >
> >
> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
> > > >>>> >>>> Best,
> > > >>>> >>>> Danny Chan
> > > >>>> >>>> 在 2019年8月28日 +0800 AM2:50，Bowen Li <[hidden email]>，写道：
> > > >>>> >>>>> Hi folks,
> > > >>>> >>>>>
> > > >>>> >>>>> I'd like to kick off a discussion on reworking Flink's
> > > >>>> >> FunctionCatalog.
> > > >>>> >>>>> It's critically helpful to improve function usability in
> SQL.
> > > >>>> >>>>>
> > > >>>> >>>>>
> > > >>>> >>
> > > >>>>
> > >
> >
> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
> > > >>>> >>>>> In short, it:
> > > >>>> >>>>> - adds support for precise function reference with
> > > fully/partially
> > > >>>> >>>>> qualified name
> > > >>>> >>>>> - redefines function resolution order for ambiguous function
> > > >>>> >> reference
> > > >>>> >>>>> - adds support for Hive's rich built-in functions (support
> for
> > > >>>> Hive
> > > >>>> >> user
> > > >>>> >>>>> defined functions was already added in 1.9.0)
> > > >>>> >>>>> - clarifies the concept of temporary functions
> > > >>>> >>>>>
> > > >>>> >>>>> Would love to hear your thoughts.
> > > >>>> >>>>>
> > > >>>> >>>>> Bowen
> > > >>>> >>>
> > > >>>>
> > > >>>>
> > >
> >
>
>
> --
> Xuefu Zhang
>
> "In Honey We Trust!"
>

dwysakowicz

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Hi all,

Just an opinion on the built-in <> temporary functions resolution and
NAMING issue. I think we should not allow overriding the built-in
functions, as this may pose serious issues and to be honest is rather
not feasible and would require major rework. What happens if a user
wants to override CAST? Calls to that function are generated at
different layers of the stack that unfortunately does not always go
through the Catalog API (at least yet). Moreover from what I've checked
no other systems allow overriding the built-in functions. All the
systems I've checked so far register temporary functions in a
database/schema (either special database for temporary functions, or
just current database). What I would suggest is to always register
temporary functions with a 3 part identifier. The same way as tables,
views etc. This effectively means you cannot override built-in
functions. With such approach it is natural that the temporary functions
end up a step lower in the resolution order:

1. built-in functions (1 part, maybe 2? - this is still under discussion)

2. temporary functions (always 3 part path)

3. catalog functions (always 3 part path)

Let me know what do you think.

Best,

Dawid

On 04/09/2019 06:13, Bowen Li wrote:

> Hi,
>
> I agree with Xuefu that the main controversial points are mainly the two
> places. My thoughts on them:
>
> 1) Determinism of referencing Hive built-in functions. We can either remove
> Hive built-in functions from ambiguous function resolution and require
> users to use special syntax for their qualified names, or add a config flag
> to catalog constructor/yaml for turning on and off Hive built-in functions
> with the flag set to 'false' by default and proper doc added to help users
> make their decisions.
>
> 2) Flink temp functions v.s. Flink built-in functions in ambiguous function
> resolution order. We believe Flink temp functions should precede Flink
> built-in functions, and I have presented my reasons. Just in case if we
> cannot reach an agreement, I propose forbid users registering temp
> functions in the same name as a built-in function, like MySQL's approach,
> for the moment. It won't have any performance concern, since built-in
> functions are all in memory and thus cost of a name check will be really
> trivial.
>
>
> On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> wrote:
>
>> From what I have seen, there are a couple of focal disagreements:
>>
>> 1. Resolution order: temp function --> flink built-in function --> catalog
>> function vs flink built-in function --> temp function -> catalog function.
>> 2. "External" built-in functions: how to treat built-in functions in
>> external system and how users reference them
>>
>> For #1, I agree with Bowen that temp function needs to be at the highest
>> priority because that's how a user might overwrite a built-in function
>> without referencing a persistent, overwriting catalog function with a fully
>> qualified name. Putting built-in functions at the highest priority
>> eliminates that usage.
>>
>> For #2, I saw a general agreement on referencing "external" built-in
>> functions such as those in Hive needs to be explicit and deterministic even
>> though different approaches are proposed. To limit the scope and simply the
>> usage, it seems making sense to me to introduce special syntax for user to
>> explicitly reference an external built-in function such as hive1::sqrt or
>> hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API call
>> hive1.getFunction(ObjectPath functionName) where the database name is
>> absent for bulit-in functions available in that catalog hive1. I understand
>> that Bowen's original proposal was trying to avoid this, but this could
>> turn out to be a clean and simple solution.
>>
>> (Timo's modular approach is great way to "expand" Flink's built-in function
>> set, which seems orthogonal and complementary to this, which could be
>> tackled in further future work.)
>>
>> I'd be happy to hear further thoughts on the two points.
>>
>> Thanks,
>> Xuefu
>>
>> On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> wrote:
>>
>>> Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is the
>>> same
>>> as Bowen's. But after thinking about it, I'm currently lean to Timo's
>>> suggestion.
>>>
>>> The reason is backward compatibility. If we follow Bowen's approach,
>> let's
>>> say we
>>> first find function in Flink's built-in functions, and then hive's
>>> built-in. For example, `foo`
>>> is not supported by Flink, but hive has such built-in function. So user
>>> will have hive's
>>> behavior for function `foo`. And in next release, Flink realize this is a
>>> very popular function
>>> and add it into Flink's built-in functions, but with different behavior
>> as
>>> hive's. So in next
>>> release, the behavior changes.
>>>
>>> With Timo's approach, IIUC user have to tell the framework explicitly
>> what
>>> kind of
>>> built-in functions he would like to use. He can just tell framework to
>>> abandon Flink's built-in
>>> functions, and use hive's instead. User can only choose between them, but
>>> not use
>>> them at the same time. I think this approach is more predictable.
>>>
>>> Best,
>>> Kurt
>>>
>>>
>>> On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Thanks for the feedback. Just a kindly reminder that the [Proposal]
>>> section
>>>> in the google doc was updated, please take a look first and let me know
>>> if
>>>> you have more questions.
>>>>
>>>> On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> wrote:
>>>>
>>>>> Hi Timo,
>>>>>
>>>>> Re> 1) We should not have the restriction "hive built-in functions
>> can
>>>>> only
>>>>>> be used when current catalog is hive catalog". Switching a catalog
>>>>>> should only have implications on the cat.db.object resolution but
>> not
>>>>>> functions. It would be quite convinient for users to use Hive
>>> built-ins
>>>>>> even if they use a Confluent schema registry or just the in-memory
>>>>> catalog.
>>>>>
>>>>> There might be a misunderstanding here.
>>>>>
>>>>> First of all, Hive built-in functions are not part of Flink built-in
>>>>> functions, they are catalog functions, thus if the current catalog is
>>>> not a
>>>>> HiveCatalog but, say, a schema registry catalog, ambiguous functions
>>>>> reference just shouldn't be resolved to a different catalog.
>>>>>
>>>>> Second, Hive built-in functions can potentially be referenced across
>>>>> catalog, but it doesn't have db namespace and we currently just don't
>>>> have
>>>>> a SQL syntax for it. It can be enabled when such a SQL syntax is
>>> defined,
>>>>> e.g. "catalog::function", but it's out of scope of this FLIP.
>>>>>
>>>>> 2) I would propose to have separate concepts for catalog and built-in
>>>>> functions. In particular it would be nice to modularize built-in
>>>>> functions. Some built-in functions are very crucial (like AS, CAST,
>>>>> MINUS), others are more optional but stable (MD5, CONCAT_WS), and
>> maybe
>>>>> we add more experimental functions in the future or function for some
>>>>> special application area (Geo functions, ML functions). A data
>> platform
>>>>> team might not want to make every built-in function available. Or a
>>>>> function module like ML functions is in a different Maven module.
>>>>>
>>>>> I think this is orthogonal to this FLIP, especially we don't have the
>>>>> "external built-in functions" anymore and currently the built-in
>>> function
>>>>> category remains untouched.
>>>>>
>>>>> But just to share some thoughts on the proposal, I'm not sure about
>> it:
>>>>> - I don't know if any other databases handle built-in functions like
>>>> that.
>>>>> Maybe you can give some examples? IMHO, built-in functions are system
>>>> info
>>>>> and should be deterministic, not depending on loaded libraries. Geo
>>>>> functions should be either built-in already or just libraries
>>> functions,
>>>>> and library functions can be adapted to catalog APIs or of some other
>>>>> syntax to use
>>>>> - I don't know if all use cases stand, and many can be achieved by
>>> other
>>>>> approaches too. E.g. experimental functions can be taken good care of
>>> by
>>>>> documentations, annotations, etc
>>>>> - the proposal basically introduces some concept like a pluggable
>>>> built-in
>>>>> function catalog, despite the already existing catalog APIs
>>>>> - it brings in even more complicated scenarios to the design. E.g.
>> how
>>> do
>>>>> you handle built-in functions in different modules but different
>> names?
>>>>> In short, I'm not sure if it really stands and it looks like an
>>> overkill
>>>>> to me. I'd rather not go to that route. Related discussion can be on
>>> its
>>>>> own thread.
>>>>>
>>>>> 3) Following the suggestion above, we can have a separate discovery
>>>>> mechanism for built-in functions. Instead of just going through a
>>> static
>>>>> list like in BuiltInFunctionDefinitions, a platform team should be
>> able
>>>>> to select function modules like
>>>>> catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
>>>>> HiveFunctions) or via service discovery;
>>>>>
>>>>> Same as above. I'll leave it to its own thread.
>>>>>
>>>>> re > 3) Dawid and I discussed the resulution order again. I agree
>> with
>>>>> Kurt
>>>>>> that we should unify built-in function (external or internal)
>> under a
>>>>>> common layer. However, the resolution order should be:
>>>>>> 1. built-in functions
>>>>>> 2. temporary functions
>>>>>> 3. regular catalog resolution logic
>>>>>> Otherwise a temporary function could cause clashes with Flink's
>>>> built-in
>>>>>> functions. If you take a look at other vendors, like SQL Server
>> they
>>>>>> also do not allow to overwrite built-in functions.
>>>>> ”I agree with Kurt that we should unify built-in function (external
>> or
>>>>> internal) under a common layer.“ <- I don't think this is what Kurt
>>>> means.
>>>>> Kurt and I are in favor of unifying built-in functions of external
>>>> systems
>>>>> and catalog functions. Did you type a mistake?
>>>>>
>>>>> Besides, I'm not sure about the resolution order you proposed.
>>> Temporary
>>>>> functions have a lifespan over a session and are only visible to the
>>>>> session owner, they are unique to each user, and users create them on
>>>>> purpose to be the highest priority in order to overwrite system info
>>>>> (built-in functions in this case).
>>>>>
>>>>> In your case, why would users name a temporary function the same as a
>>>>> built-in function then? Since using that name in ambiguous function
>>>>> reference will always be resolved to built-in functions, creating a
>>>>> same-named temp function would be meaningless in the end.
>>>>>
>>>>>
>>>>> On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> wrote:
>>>>>
>>>>>> Hi Jingsong,
>>>>>>
>>>>>> Re> 1.Hive built-in functions is an intermediate solution. So we
>>> should
>>>>>>> not introduce interfaces to influence the framework. To make
>>>>>>> Flink itself more powerful, we should implement the functions
>>>>>>> we need to add.
>>>>>> Yes, please see the doc.
>>>>>>
>>>>>> Re> 2.Non-flink built-in functions are easy for users to change
>> their
>>>>>>> behavior. If we support some flink built-in functions in the
>>>>>>> future but act differently from non-flink built-in, this will lead
>>> to
>>>>>>> changes in user behavior.
>>>>>> There's no such concept as "external built-in functions" any more.
>>>>>> Built-in functions of external systems will be treated as special
>>>> catalog
>>>>>> functions.
>>>>>>
>>>>>> Re> Another question is, does this fallback include all
>>>>>>> hive built-in functions? As far as I know, some hive functions
>>>>>>> have some hacky. If possible, can we start with a white list?
>>>>>>> Once we implement some functions to flink built-in, we can
>>>>>>> also update the whitelist.
>>>>>> Yes, that's something we thought of too. I don't think it's super
>>>>>> critical to the scope of this FLIP, thus I'd like to leave it to
>>> future
>>>>>> efforts as a nice-to-have feature.
>>>>>>
>>>>>>
>>>>>> On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]>
>> wrote:
>>>>>>> Hi Kurt,
>>>>>>>
>>>>>>> Re: > What I want to propose is we can merge #3 and #4, make them
>>> both
>>>>>>> under
>>>>>>>> "catalog" concept, by extending catalog function to make it have
>>>>>>> ability to
>>>>>>>> have built-in catalog functions. Some benefits I can see from this
>>>>>>> approach:
>>>>>>>> 1. We don't have to introduce new concept like external built-in
>>>>>>> functions.
>>>>>>>> Actually I don't see a full story about how to treat a built-in
>>>>>>> functions, and it
>>>>>>>> seems a little bit disrupt with catalog. As a result, you have to
>>> make
>>>>>>> some restriction
>>>>>>>> like "hive built-in functions can only be used when current
>> catalog
>>> is
>>>>>>> hive catalog".
>>>>>>>
>>>>>>> Yes, I've unified #3 and #4 but it seems I didn't update some part
>> of
>>>>>>> the doc. I've modified those sections, and they are up to date now.
>>>>>>>
>>>>>>> In short, now built-in function of external systems are defined as
>> a
>>>>>>> special kind of catalog function in Flink, and handled by Flink as
>>>>>>> following:
>>>>>>> - An external built-in function must be associated with a catalog
>> for
>>>>>>> the purpose of decoupling flink-table and external systems.
>>>>>>> - It always resides in front of catalog functions in ambiguous
>>> function
>>>>>>> reference order, just like in its own external system
>>>>>>> - It is a special catalog function that doesn’t have a
>>> schema/database
>>>>>>> namespace
>>>>>>> - It goes thru the same instantiation logic as other user defined
>>>>>>> catalog functions in the external system
>>>>>>>
>>>>>>> Please take another look at the doc, and let me know if you have
>> more
>>>>>>> questions.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]>
>>>> wrote:
>>>>>>>> Hi Kurt,
>>>>>>>>
>>>>>>>> it should not affect the functions and operations we currently
>> have
>>> in
>>>>>>>> SQL. It just categorizes the available built-in functions. It is
>>> kind
>>>>>>>> of
>>>>>>>> an orthogonal concept to the catalog API but built-in functions
>>>> deserve
>>>>>>>> this special kind of treatment. CatalogFunction still fits
>> perfectly
>>>> in
>>>>>>>> there because the regular catalog object resolution logic is not
>>>>>>>> affected. So tables and functions are resolved in the same way but
>>>> with
>>>>>>>> built-in functions that have priority as in the original design.
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Timo
>>>>>>>>
>>>>>>>>
>>>>>>>> On 03.09.19 15:26, Kurt Young wrote:
>>>>>>>>> Does this only affect the functions and operations we currently
>>> have
>>>>>>>> in SQL
>>>>>>>>> and
>>>>>>>>> have no effect on tables, right? Looks like this is an
>> orthogonal
>>>>>>>> concept
>>>>>>>>> with Catalog?
>>>>>>>>> If the answer are both yes, then the catalog function will be a
>>>> weird
>>>>>>>>> concept?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Kurt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email]
>>>>>>>> wrote:
>>>>>>>>>> The way you proposed are basically the same as what Calcite
>>> does, I
>>>>>>>> think
>>>>>>>>>> we are in the same line.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Danny Chan
>>>>>>>>>> 在 2019年9月3日 +0800 PM7:57，Timo Walther <[hidden email]>，写道：
>>>>>>>>>>> This sounds exactly as the module approach I mentioned, no?
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Timo
>>>>>>>>>>>
>>>>>>>>>>> On 03.09.19 13:42, Danny Chan wrote:
>>>>>>>>>>>> Thanks Bowen for bring up this topic, I think it’s a useful
>>>>>>>>>> refactoring to make our function usage more user friendly.
>>>>>>>>>>>> For the topic of how to organize the builtin operators and
>>>>>>>> operators
>>>>>>>>>> of Hive, here is a solution from Apache Calcite, the Calcite
>> way
>>> is
>>>>>>>> to make
>>>>>>>>>> every dialect operators a “Library”, user can specify which
>>>>>>>> libraries they
>>>>>>>>>> want to use for a sql query. The builtin operators always comes
>>> as
>>>>>>>> the
>>>>>>>>>> first class objects and the others are used from the order they
>>>>>>>> appears.
>>>>>>>>>> Maybe you can take a reference.
>>>>>>>>>>>> [1]
>> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Danny Chan
>>>>>>>>>>>> 在 2019年8月28日 +0800 AM2:50，Bowen Li <[hidden email]>，写道：
>>>>>>>>>>>>> Hi folks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd like to kick off a discussion on reworking Flink's
>>>>>>>>>> FunctionCatalog.
>>>>>>>>>>>>> It's critically helpful to improve function usability in
>> SQL.
>>>>>>>>>>>>>
>> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
>>>>>>>>>>>>> In short, it:
>>>>>>>>>>>>> - adds support for precise function reference with
>>>> fully/partially
>>>>>>>>>>>>> qualified name
>>>>>>>>>>>>> - redefines function resolution order for ambiguous function
>>>>>>>>>> reference
>>>>>>>>>>>>> - adds support for Hive's rich built-in functions (support
>> for
>>>>>>>> Hive
>>>>>>>>>> user
>>>>>>>>>>>>> defined functions was already added in 1.9.0)
>>>>>>>>>>>>> - clarifies the concept of temporary functions
>>>>>>>>>>>>>
>>>>>>>>>>>>> Would love to hear your thoughts.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Bowen
>>>>>>>>
>>
>> --
>> Xuefu Zhang
>>
>> "In Honey We Trust!"
>>

signature.asc (849 bytes) Download Attachment

Xuefu Z

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Hi David,

Thank you for sharing your findings. It seems to me that there is no SQL
standard regarding temporary functions. There are few systems that support
it. Here are what I have found:

1. Hive: no DB qualifier allowed. Can overwrite built-in.
2. Spark: basically follows Hive (
https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html
)
3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting
behavior. (
http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html)

Because of lack of standard, it's perfectly fine for Flink to define
whatever it sees appropriate. Thus, your proposal (no overwriting and must
have DB as holder) is one option. The advantage is simplicity, The downside
is the deviation from Hive, which is popular and de facto standard in big
data world.

However, I don't think we have to follow Hive. More importantly, we need a
consensus. I have no objection if your proposal is generally agreed upon.

Thanks,
Xuefu

On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <[hidden email]>
wrote:

> Hi all,
>
> Just an opinion on the built-in <> temporary functions resolution and
> NAMING issue. I think we should not allow overriding the built-in
> functions, as this may pose serious issues and to be honest is rather
> not feasible and would require major rework. What happens if a user
> wants to override CAST? Calls to that function are generated at
> different layers of the stack that unfortunately does not always go
> through the Catalog API (at least yet). Moreover from what I've checked
> no other systems allow overriding the built-in functions. All the
> systems I've checked so far register temporary functions in a
> database/schema (either special database for temporary functions, or
> just current database). What I would suggest is to always register
> temporary functions with a 3 part identifier. The same way as tables,
> views etc. This effectively means you cannot override built-in
> functions. With such approach it is natural that the temporary functions
> end up a step lower in the resolution order:
>
> 1. built-in functions (1 part, maybe 2? - this is still under discussion)
>
> 2. temporary functions (always 3 part path)
>
> 3. catalog functions (always 3 part path)
>
> Let me know what do you think.
>
> Best,
>
> Dawid
>
> On 04/09/2019 06:13, Bowen Li wrote:
> > Hi,
> >
> > I agree with Xuefu that the main controversial points are mainly the two
> > places. My thoughts on them:
> >
> > 1) Determinism of referencing Hive built-in functions. We can either
> remove
> > Hive built-in functions from ambiguous function resolution and require
> > users to use special syntax for their qualified names, or add a config
> flag
> > to catalog constructor/yaml for turning on and off Hive built-in
> functions
> > with the flag set to 'false' by default and proper doc added to help
> users
> > make their decisions.
> >
> > 2) Flink temp functions v.s. Flink built-in functions in ambiguous
> function
> > resolution order. We believe Flink temp functions should precede Flink
> > built-in functions, and I have presented my reasons. Just in case if we
> > cannot reach an agreement, I propose forbid users registering temp
> > functions in the same name as a built-in function, like MySQL's approach,
> > for the moment. It won't have any performance concern, since built-in
> > functions are all in memory and thus cost of a name check will be really
> > trivial.
> >
> >
> > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> wrote:
> >
> >> From what I have seen, there are a couple of focal disagreements:
> >>
> >> 1. Resolution order: temp function --> flink built-in function -->
> catalog
> >> function vs flink built-in function --> temp function -> catalog
> function.
> >> 2. "External" built-in functions: how to treat built-in functions in
> >> external system and how users reference them
> >>
> >> For #1, I agree with Bowen that temp function needs to be at the highest
> >> priority because that's how a user might overwrite a built-in function
> >> without referencing a persistent, overwriting catalog function with a
> fully
> >> qualified name. Putting built-in functions at the highest priority
> >> eliminates that usage.
> >>
> >> For #2, I saw a general agreement on referencing "external" built-in
> >> functions such as those in Hive needs to be explicit and deterministic
> even
> >> though different approaches are proposed. To limit the scope and simply
> the
> >> usage, it seems making sense to me to introduce special syntax for
> user to
> >> explicitly reference an external built-in function such as hive1::sqrt
> or
> >> hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API
> call
> >> hive1.getFunction(ObjectPath functionName) where the database name is
> >> absent for bulit-in functions available in that catalog hive1. I
> understand
> >> that Bowen's original proposal was trying to avoid this, but this could
> >> turn out to be a clean and simple solution.
> >>
> >> (Timo's modular approach is great way to "expand" Flink's built-in
> function
> >> set, which seems orthogonal and complementary to this, which could be
> >> tackled in further future work.)
> >>
> >> I'd be happy to hear further thoughts on the two points.
> >>
> >> Thanks,
> >> Xuefu
> >>
> >> On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> wrote:
> >>
> >>> Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is
> the
> >>> same
> >>> as Bowen's. But after thinking about it, I'm currently lean to Timo's
> >>> suggestion.
> >>>
> >>> The reason is backward compatibility. If we follow Bowen's approach,
> >> let's
> >>> say we
> >>> first find function in Flink's built-in functions, and then hive's
> >>> built-in. For example, `foo`
> >>> is not supported by Flink, but hive has such built-in function. So user
> >>> will have hive's
> >>> behavior for function `foo`. And in next release, Flink realize this
> is a
> >>> very popular function
> >>> and add it into Flink's built-in functions, but with different behavior
> >> as
> >>> hive's. So in next
> >>> release, the behavior changes.
> >>>
> >>> With Timo's approach, IIUC user have to tell the framework explicitly
> >> what
> >>> kind of
> >>> built-in functions he would like to use. He can just tell framework to
> >>> abandon Flink's built-in
> >>> functions, and use hive's instead. User can only choose between them,
> but
> >>> not use
> >>> them at the same time. I think this approach is more predictable.
> >>>
> >>> Best,
> >>> Kurt
> >>>
> >>>
> >>> On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> wrote:
> >>>
> >>>> Hi all,
> >>>>
> >>>> Thanks for the feedback. Just a kindly reminder that the [Proposal]
> >>> section
> >>>> in the google doc was updated, please take a look first and let me
> know
> >>> if
> >>>> you have more questions.
> >>>>
> >>>> On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> wrote:
> >>>>
> >>>>> Hi Timo,
> >>>>>
> >>>>> Re> 1) We should not have the restriction "hive built-in functions
> >> can
> >>>>> only
> >>>>>> be used when current catalog is hive catalog". Switching a catalog
> >>>>>> should only have implications on the cat.db.object resolution but
> >> not
> >>>>>> functions. It would be quite convinient for users to use Hive
> >>> built-ins
> >>>>>> even if they use a Confluent schema registry or just the in-memory
> >>>>> catalog.
> >>>>>
> >>>>> There might be a misunderstanding here.
> >>>>>
> >>>>> First of all, Hive built-in functions are not part of Flink built-in
> >>>>> functions, they are catalog functions, thus if the current catalog is
> >>>> not a
> >>>>> HiveCatalog but, say, a schema registry catalog, ambiguous functions
> >>>>> reference just shouldn't be resolved to a different catalog.
> >>>>>
> >>>>> Second, Hive built-in functions can potentially be referenced across
> >>>>> catalog, but it doesn't have db namespace and we currently just don't
> >>>> have
> >>>>> a SQL syntax for it. It can be enabled when such a SQL syntax is
> >>> defined,
> >>>>> e.g. "catalog::function", but it's out of scope of this FLIP.
> >>>>>
> >>>>> 2) I would propose to have separate concepts for catalog and built-in
> >>>>> functions. In particular it would be nice to modularize built-in
> >>>>> functions. Some built-in functions are very crucial (like AS, CAST,
> >>>>> MINUS), others are more optional but stable (MD5, CONCAT_WS), and
> >> maybe
> >>>>> we add more experimental functions in the future or function for some
> >>>>> special application area (Geo functions, ML functions). A data
> >> platform
> >>>>> team might not want to make every built-in function available. Or a
> >>>>> function module like ML functions is in a different Maven module.
> >>>>>
> >>>>> I think this is orthogonal to this FLIP, especially we don't have the
> >>>>> "external built-in functions" anymore and currently the built-in
> >>> function
> >>>>> category remains untouched.
> >>>>>
> >>>>> But just to share some thoughts on the proposal, I'm not sure about
> >> it:
> >>>>> - I don't know if any other databases handle built-in functions like
> >>>> that.
> >>>>> Maybe you can give some examples? IMHO, built-in functions are system
> >>>> info
> >>>>> and should be deterministic, not depending on loaded libraries. Geo
> >>>>> functions should be either built-in already or just libraries
> >>> functions,
> >>>>> and library functions can be adapted to catalog APIs or of some other
> >>>>> syntax to use
> >>>>> - I don't know if all use cases stand, and many can be achieved by
> >>> other
> >>>>> approaches too. E.g. experimental functions can be taken good care of
> >>> by
> >>>>> documentations, annotations, etc
> >>>>> - the proposal basically introduces some concept like a pluggable
> >>>> built-in
> >>>>> function catalog, despite the already existing catalog APIs
> >>>>> - it brings in even more complicated scenarios to the design. E.g.
> >> how
> >>> do
> >>>>> you handle built-in functions in different modules but different
> >> names?
> >>>>> In short, I'm not sure if it really stands and it looks like an
> >>> overkill
> >>>>> to me. I'd rather not go to that route. Related discussion can be on
> >>> its
> >>>>> own thread.
> >>>>>
> >>>>> 3) Following the suggestion above, we can have a separate discovery
> >>>>> mechanism for built-in functions. Instead of just going through a
> >>> static
> >>>>> list like in BuiltInFunctionDefinitions, a platform team should be
> >> able
> >>>>> to select function modules like
> >>>>> catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
> >>>>> HiveFunctions) or via service discovery;
> >>>>>
> >>>>> Same as above. I'll leave it to its own thread.
> >>>>>
> >>>>> re > 3) Dawid and I discussed the resulution order again. I agree
> >> with
> >>>>> Kurt
> >>>>>> that we should unify built-in function (external or internal)
> >> under a
> >>>>>> common layer. However, the resolution order should be:
> >>>>>> 1. built-in functions
> >>>>>> 2. temporary functions
> >>>>>> 3. regular catalog resolution logic
> >>>>>> Otherwise a temporary function could cause clashes with Flink's
> >>>> built-in
> >>>>>> functions. If you take a look at other vendors, like SQL Server
> >> they
> >>>>>> also do not allow to overwrite built-in functions.
> >>>>> ”I agree with Kurt that we should unify built-in function (external
> >> or
> >>>>> internal) under a common layer.“ <- I don't think this is what Kurt
> >>>> means.
> >>>>> Kurt and I are in favor of unifying built-in functions of external
> >>>> systems
> >>>>> and catalog functions. Did you type a mistake?
> >>>>>
> >>>>> Besides, I'm not sure about the resolution order you proposed.
> >>> Temporary
> >>>>> functions have a lifespan over a session and are only visible to the
> >>>>> session owner, they are unique to each user, and users create them on
> >>>>> purpose to be the highest priority in order to overwrite system info
> >>>>> (built-in functions in this case).
> >>>>>
> >>>>> In your case, why would users name a temporary function the same as a
> >>>>> built-in function then? Since using that name in ambiguous function
> >>>>> reference will always be resolved to built-in functions, creating a
> >>>>> same-named temp function would be meaningless in the end.
> >>>>>
> >>>>>
> >>>>> On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> wrote:
> >>>>>
> >>>>>> Hi Jingsong,
> >>>>>>
> >>>>>> Re> 1.Hive built-in functions is an intermediate solution. So we
> >>> should
> >>>>>>> not introduce interfaces to influence the framework. To make
> >>>>>>> Flink itself more powerful, we should implement the functions
> >>>>>>> we need to add.
> >>>>>> Yes, please see the doc.
> >>>>>>
> >>>>>> Re> 2.Non-flink built-in functions are easy for users to change
> >> their
> >>>>>>> behavior. If we support some flink built-in functions in the
> >>>>>>> future but act differently from non-flink built-in, this will lead
> >>> to
> >>>>>>> changes in user behavior.
> >>>>>> There's no such concept as "external built-in functions" any more.
> >>>>>> Built-in functions of external systems will be treated as special
> >>>> catalog
> >>>>>> functions.
> >>>>>>
> >>>>>> Re> Another question is, does this fallback include all
> >>>>>>> hive built-in functions? As far as I know, some hive functions
> >>>>>>> have some hacky. If possible, can we start with a white list?
> >>>>>>> Once we implement some functions to flink built-in, we can
> >>>>>>> also update the whitelist.
> >>>>>> Yes, that's something we thought of too. I don't think it's super
> >>>>>> critical to the scope of this FLIP, thus I'd like to leave it to
> >>> future
> >>>>>> efforts as a nice-to-have feature.
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]>
> >> wrote:
> >>>>>>> Hi Kurt,
> >>>>>>>
> >>>>>>> Re: > What I want to propose is we can merge #3 and #4, make them
> >>> both
> >>>>>>> under
> >>>>>>>> "catalog" concept, by extending catalog function to make it have
> >>>>>>> ability to
> >>>>>>>> have built-in catalog functions. Some benefits I can see from this
> >>>>>>> approach:
> >>>>>>>> 1. We don't have to introduce new concept like external built-in
> >>>>>>> functions.
> >>>>>>>> Actually I don't see a full story about how to treat a built-in
> >>>>>>> functions, and it
> >>>>>>>> seems a little bit disrupt with catalog. As a result, you have to
> >>> make
> >>>>>>> some restriction
> >>>>>>>> like "hive built-in functions can only be used when current
> >> catalog
> >>> is
> >>>>>>> hive catalog".
> >>>>>>>
> >>>>>>> Yes, I've unified #3 and #4 but it seems I didn't update some part
> >> of
> >>>>>>> the doc. I've modified those sections, and they are up to date now.
> >>>>>>>
> >>>>>>> In short, now built-in function of external systems are defined as
> >> a
> >>>>>>> special kind of catalog function in Flink, and handled by Flink as
> >>>>>>> following:
> >>>>>>> - An external built-in function must be associated with a catalog
> >> for
> >>>>>>> the purpose of decoupling flink-table and external systems.
> >>>>>>> - It always resides in front of catalog functions in ambiguous
> >>> function
> >>>>>>> reference order, just like in its own external system
> >>>>>>> - It is a special catalog function that doesn’t have a
> >>> schema/database
> >>>>>>> namespace
> >>>>>>> - It goes thru the same instantiation logic as other user defined
> >>>>>>> catalog functions in the external system
> >>>>>>>
> >>>>>>> Please take another look at the doc, and let me know if you have
> >> more
> >>>>>>> questions.
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]>
> >>>> wrote:
> >>>>>>>> Hi Kurt,
> >>>>>>>>
> >>>>>>>> it should not affect the functions and operations we currently
> >> have
> >>> in
> >>>>>>>> SQL. It just categorizes the available built-in functions. It is
> >>> kind
> >>>>>>>> of
> >>>>>>>> an orthogonal concept to the catalog API but built-in functions
> >>>> deserve
> >>>>>>>> this special kind of treatment. CatalogFunction still fits
> >> perfectly
> >>>> in
> >>>>>>>> there because the regular catalog object resolution logic is not
> >>>>>>>> affected. So tables and functions are resolved in the same way but
> >>>> with
> >>>>>>>> built-in functions that have priority as in the original design.
> >>>>>>>>
> >>>>>>>> Regards,
> >>>>>>>> Timo
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 03.09.19 15:26, Kurt Young wrote:
> >>>>>>>>> Does this only affect the functions and operations we currently
> >>> have
> >>>>>>>> in SQL
> >>>>>>>>> and
> >>>>>>>>> have no effect on tables, right? Looks like this is an
> >> orthogonal
> >>>>>>>> concept
> >>>>>>>>> with Catalog?
> >>>>>>>>> If the answer are both yes, then the catalog function will be a
> >>>> weird
> >>>>>>>>> concept?
> >>>>>>>>>
> >>>>>>>>> Best,
> >>>>>>>>> Kurt
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email]
> >>>>>>>> wrote:
> >>>>>>>>>> The way you proposed are basically the same as what Calcite
> >>> does, I
> >>>>>>>> think
> >>>>>>>>>> we are in the same line.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Danny Chan
> >>>>>>>>>> 在 2019年9月3日 +0800 PM7:57，Timo Walther <[hidden email]>，写道：
> >>>>>>>>>>> This sounds exactly as the module approach I mentioned, no?
> >>>>>>>>>>>
> >>>>>>>>>>> Regards,
> >>>>>>>>>>> Timo
> >>>>>>>>>>>
> >>>>>>>>>>> On 03.09.19 13:42, Danny Chan wrote:
> >>>>>>>>>>>> Thanks Bowen for bring up this topic, I think it’s a useful
> >>>>>>>>>> refactoring to make our function usage more user friendly.
> >>>>>>>>>>>> For the topic of how to organize the builtin operators and
> >>>>>>>> operators
> >>>>>>>>>> of Hive, here is a solution from Apache Calcite, the Calcite
> >> way
> >>> is
> >>>>>>>> to make
> >>>>>>>>>> every dialect operators a “Library”, user can specify which
> >>>>>>>> libraries they
> >>>>>>>>>> want to use for a sql query. The builtin operators always comes
> >>> as
> >>>>>>>> the
> >>>>>>>>>> first class objects and the others are used from the order they
> >>>>>>>> appears.
> >>>>>>>>>> Maybe you can take a reference.
> >>>>>>>>>>>> [1]
> >>
> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Danny Chan
> >>>>>>>>>>>> 在 2019年8月28日 +0800 AM2:50，Bowen Li <[hidden email]>，写道：
> >>>>>>>>>>>>> Hi folks,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'd like to kick off a discussion on reworking Flink's
> >>>>>>>>>> FunctionCatalog.
> >>>>>>>>>>>>> It's critically helpful to improve function usability in
> >> SQL.
> >>>>>>>>>>>>>
> >>
> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
> >>>>>>>>>>>>> In short, it:
> >>>>>>>>>>>>> - adds support for precise function reference with
> >>>> fully/partially
> >>>>>>>>>>>>> qualified name
> >>>>>>>>>>>>> - redefines function resolution order for ambiguous function
> >>>>>>>>>> reference
> >>>>>>>>>>>>> - adds support for Hive's rich built-in functions (support
> >> for
> >>>>>>>> Hive
> >>>>>>>>>> user
> >>>>>>>>>>>>> defined functions was already added in 1.9.0)
> >>>>>>>>>>>>> - clarifies the concept of temporary functions
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Would love to hear your thoughts.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Bowen
> >>>>>>>>
> >>
> >> --
> >> Xuefu Zhang
> >>
> >> "In Honey We Trust!"
> >>
>
>

--
Xuefu Zhang

"In Honey We Trust!"

1234