Hi all,
Regarding #1 temp function <> built-in function and naming. I'm fine with temp functions should precede built-in function and can override built-in functions (we already support to override built-in function in 1.9). If we don't allow the same name as a built-in function, I'm afraid we will have compatibility issues in the future. Say users register a user defined function named "explode" in 1.9, and we support a built-in "explode" function in 1.10. Then the user's jobs which call the registered "explode" function in 1.9 will all fail in 1.10 because of naming conflict. Regarding #2 "External" built-in functions. I think if we store external built-in functions in catalog, then "hive1::sqrt" is a good way to go. However, I would prefer to support a discovery mechanism (e.g. SPI) for built-in functions as Timo suggested above. This gives us the flexibility to add Hive or MySQL or Geo or whatever function set as built-in functions in an easy way. Best, Jark On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[hidden email]> wrote: > Hi David, > > Thank you for sharing your findings. It seems to me that there is no SQL > standard regarding temporary functions. There are few systems that support > it. Here are what I have found: > > 1. Hive: no DB qualifier allowed. Can overwrite built-in. > 2. Spark: basically follows Hive ( > > https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html > ) > 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting > behavior. ( > http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html) > > Because of lack of standard, it's perfectly fine for Flink to define > whatever it sees appropriate. Thus, your proposal (no overwriting and must > have DB as holder) is one option. The advantage is simplicity, The downside > is the deviation from Hive, which is popular and de facto standard in big > data world. > > However, I don't think we have to follow Hive. More importantly, we need a > consensus. I have no objection if your proposal is generally agreed upon. > > Thanks, > Xuefu > > On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <[hidden email]> > wrote: > > > Hi all, > > > > Just an opinion on the built-in <> temporary functions resolution and > > NAMING issue. I think we should not allow overriding the built-in > > functions, as this may pose serious issues and to be honest is rather > > not feasible and would require major rework. What happens if a user > > wants to override CAST? Calls to that function are generated at > > different layers of the stack that unfortunately does not always go > > through the Catalog API (at least yet). Moreover from what I've checked > > no other systems allow overriding the built-in functions. All the > > systems I've checked so far register temporary functions in a > > database/schema (either special database for temporary functions, or > > just current database). What I would suggest is to always register > > temporary functions with a 3 part identifier. The same way as tables, > > views etc. This effectively means you cannot override built-in > > functions. With such approach it is natural that the temporary functions > > end up a step lower in the resolution order: > > > > 1. built-in functions (1 part, maybe 2? - this is still under discussion) > > > > 2. temporary functions (always 3 part path) > > > > 3. catalog functions (always 3 part path) > > > > Let me know what do you think. > > > > Best, > > > > Dawid > > > > On 04/09/2019 06:13, Bowen Li wrote: > > > Hi, > > > > > > I agree with Xuefu that the main controversial points are mainly the > two > > > places. My thoughts on them: > > > > > > 1) Determinism of referencing Hive built-in functions. We can either > > remove > > > Hive built-in functions from ambiguous function resolution and require > > > users to use special syntax for their qualified names, or add a config > > flag > > > to catalog constructor/yaml for turning on and off Hive built-in > > functions > > > with the flag set to 'false' by default and proper doc added to help > > users > > > make their decisions. > > > > > > 2) Flink temp functions v.s. Flink built-in functions in ambiguous > > function > > > resolution order. We believe Flink temp functions should precede Flink > > > built-in functions, and I have presented my reasons. Just in case if we > > > cannot reach an agreement, I propose forbid users registering temp > > > functions in the same name as a built-in function, like MySQL's > approach, > > > for the moment. It won't have any performance concern, since built-in > > > functions are all in memory and thus cost of a name check will be > really > > > trivial. > > > > > > > > > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> wrote: > > > > > >> From what I have seen, there are a couple of focal disagreements: > > >> > > >> 1. Resolution order: temp function --> flink built-in function --> > > catalog > > >> function vs flink built-in function --> temp function -> catalog > > function. > > >> 2. "External" built-in functions: how to treat built-in functions in > > >> external system and how users reference them > > >> > > >> For #1, I agree with Bowen that temp function needs to be at the > highest > > >> priority because that's how a user might overwrite a built-in function > > >> without referencing a persistent, overwriting catalog function with a > > fully > > >> qualified name. Putting built-in functions at the highest priority > > >> eliminates that usage. > > >> > > >> For #2, I saw a general agreement on referencing "external" built-in > > >> functions such as those in Hive needs to be explicit and deterministic > > even > > >> though different approaches are proposed. To limit the scope and > simply > > the > > >> usage, it seems making sense to me to introduce special syntax for > > user to > > >> explicitly reference an external built-in function such as hive1::sqrt > > or > > >> hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API > > call > > >> hive1.getFunction(ObjectPath functionName) where the database name is > > >> absent for bulit-in functions available in that catalog hive1. I > > understand > > >> that Bowen's original proposal was trying to avoid this, but this > could > > >> turn out to be a clean and simple solution. > > >> > > >> (Timo's modular approach is great way to "expand" Flink's built-in > > function > > >> set, which seems orthogonal and complementary to this, which could be > > >> tackled in further future work.) > > >> > > >> I'd be happy to hear further thoughts on the two points. > > >> > > >> Thanks, > > >> Xuefu > > >> > > >> On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> wrote: > > >> > > >>> Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is > > the > > >>> same > > >>> as Bowen's. But after thinking about it, I'm currently lean to Timo's > > >>> suggestion. > > >>> > > >>> The reason is backward compatibility. If we follow Bowen's approach, > > >> let's > > >>> say we > > >>> first find function in Flink's built-in functions, and then hive's > > >>> built-in. For example, `foo` > > >>> is not supported by Flink, but hive has such built-in function. So > user > > >>> will have hive's > > >>> behavior for function `foo`. And in next release, Flink realize this > > is a > > >>> very popular function > > >>> and add it into Flink's built-in functions, but with different > behavior > > >> as > > >>> hive's. So in next > > >>> release, the behavior changes. > > >>> > > >>> With Timo's approach, IIUC user have to tell the framework explicitly > > >> what > > >>> kind of > > >>> built-in functions he would like to use. He can just tell framework > to > > >>> abandon Flink's built-in > > >>> functions, and use hive's instead. User can only choose between them, > > but > > >>> not use > > >>> them at the same time. I think this approach is more predictable. > > >>> > > >>> Best, > > >>> Kurt > > >>> > > >>> > > >>> On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> wrote: > > >>> > > >>>> Hi all, > > >>>> > > >>>> Thanks for the feedback. Just a kindly reminder that the [Proposal] > > >>> section > > >>>> in the google doc was updated, please take a look first and let me > > know > > >>> if > > >>>> you have more questions. > > >>>> > > >>>> On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> > wrote: > > >>>> > > >>>>> Hi Timo, > > >>>>> > > >>>>> Re> 1) We should not have the restriction "hive built-in functions > > >> can > > >>>>> only > > >>>>>> be used when current catalog is hive catalog". Switching a catalog > > >>>>>> should only have implications on the cat.db.object resolution but > > >> not > > >>>>>> functions. It would be quite convinient for users to use Hive > > >>> built-ins > > >>>>>> even if they use a Confluent schema registry or just the in-memory > > >>>>> catalog. > > >>>>> > > >>>>> There might be a misunderstanding here. > > >>>>> > > >>>>> First of all, Hive built-in functions are not part of Flink > built-in > > >>>>> functions, they are catalog functions, thus if the current catalog > is > > >>>> not a > > >>>>> HiveCatalog but, say, a schema registry catalog, ambiguous > functions > > >>>>> reference just shouldn't be resolved to a different catalog. > > >>>>> > > >>>>> Second, Hive built-in functions can potentially be referenced > across > > >>>>> catalog, but it doesn't have db namespace and we currently just > don't > > >>>> have > > >>>>> a SQL syntax for it. It can be enabled when such a SQL syntax is > > >>> defined, > > >>>>> e.g. "catalog::function", but it's out of scope of this FLIP. > > >>>>> > > >>>>> 2) I would propose to have separate concepts for catalog and > built-in > > >>>>> functions. In particular it would be nice to modularize built-in > > >>>>> functions. Some built-in functions are very crucial (like AS, CAST, > > >>>>> MINUS), others are more optional but stable (MD5, CONCAT_WS), and > > >> maybe > > >>>>> we add more experimental functions in the future or function for > some > > >>>>> special application area (Geo functions, ML functions). A data > > >> platform > > >>>>> team might not want to make every built-in function available. Or a > > >>>>> function module like ML functions is in a different Maven module. > > >>>>> > > >>>>> I think this is orthogonal to this FLIP, especially we don't have > the > > >>>>> "external built-in functions" anymore and currently the built-in > > >>> function > > >>>>> category remains untouched. > > >>>>> > > >>>>> But just to share some thoughts on the proposal, I'm not sure about > > >> it: > > >>>>> - I don't know if any other databases handle built-in functions > like > > >>>> that. > > >>>>> Maybe you can give some examples? IMHO, built-in functions are > system > > >>>> info > > >>>>> and should be deterministic, not depending on loaded libraries. Geo > > >>>>> functions should be either built-in already or just libraries > > >>> functions, > > >>>>> and library functions can be adapted to catalog APIs or of some > other > > >>>>> syntax to use > > >>>>> - I don't know if all use cases stand, and many can be achieved by > > >>> other > > >>>>> approaches too. E.g. experimental functions can be taken good care > of > > >>> by > > >>>>> documentations, annotations, etc > > >>>>> - the proposal basically introduces some concept like a pluggable > > >>>> built-in > > >>>>> function catalog, despite the already existing catalog APIs > > >>>>> - it brings in even more complicated scenarios to the design. E.g. > > >> how > > >>> do > > >>>>> you handle built-in functions in different modules but different > > >> names? > > >>>>> In short, I'm not sure if it really stands and it looks like an > > >>> overkill > > >>>>> to me. I'd rather not go to that route. Related discussion can be > on > > >>> its > > >>>>> own thread. > > >>>>> > > >>>>> 3) Following the suggestion above, we can have a separate discovery > > >>>>> mechanism for built-in functions. Instead of just going through a > > >>> static > > >>>>> list like in BuiltInFunctionDefinitions, a platform team should be > > >> able > > >>>>> to select function modules like > > >>>>> catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, > > >>>>> HiveFunctions) or via service discovery; > > >>>>> > > >>>>> Same as above. I'll leave it to its own thread. > > >>>>> > > >>>>> re > 3) Dawid and I discussed the resulution order again. I agree > > >> with > > >>>>> Kurt > > >>>>>> that we should unify built-in function (external or internal) > > >> under a > > >>>>>> common layer. However, the resolution order should be: > > >>>>>> 1. built-in functions > > >>>>>> 2. temporary functions > > >>>>>> 3. regular catalog resolution logic > > >>>>>> Otherwise a temporary function could cause clashes with Flink's > > >>>> built-in > > >>>>>> functions. If you take a look at other vendors, like SQL Server > > >> they > > >>>>>> also do not allow to overwrite built-in functions. > > >>>>> ”I agree with Kurt that we should unify built-in function (external > > >> or > > >>>>> internal) under a common layer.“ <- I don't think this is what Kurt > > >>>> means. > > >>>>> Kurt and I are in favor of unifying built-in functions of external > > >>>> systems > > >>>>> and catalog functions. Did you type a mistake? > > >>>>> > > >>>>> Besides, I'm not sure about the resolution order you proposed. > > >>> Temporary > > >>>>> functions have a lifespan over a session and are only visible to > the > > >>>>> session owner, they are unique to each user, and users create them > on > > >>>>> purpose to be the highest priority in order to overwrite system > info > > >>>>> (built-in functions in this case). > > >>>>> > > >>>>> In your case, why would users name a temporary function the same > as a > > >>>>> built-in function then? Since using that name in ambiguous function > > >>>>> reference will always be resolved to built-in functions, creating a > > >>>>> same-named temp function would be meaningless in the end. > > >>>>> > > >>>>> > > >>>>> On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> > wrote: > > >>>>> > > >>>>>> Hi Jingsong, > > >>>>>> > > >>>>>> Re> 1.Hive built-in functions is an intermediate solution. So we > > >>> should > > >>>>>>> not introduce interfaces to influence the framework. To make > > >>>>>>> Flink itself more powerful, we should implement the functions > > >>>>>>> we need to add. > > >>>>>> Yes, please see the doc. > > >>>>>> > > >>>>>> Re> 2.Non-flink built-in functions are easy for users to change > > >> their > > >>>>>>> behavior. If we support some flink built-in functions in the > > >>>>>>> future but act differently from non-flink built-in, this will > lead > > >>> to > > >>>>>>> changes in user behavior. > > >>>>>> There's no such concept as "external built-in functions" any more. > > >>>>>> Built-in functions of external systems will be treated as special > > >>>> catalog > > >>>>>> functions. > > >>>>>> > > >>>>>> Re> Another question is, does this fallback include all > > >>>>>>> hive built-in functions? As far as I know, some hive functions > > >>>>>>> have some hacky. If possible, can we start with a white list? > > >>>>>>> Once we implement some functions to flink built-in, we can > > >>>>>>> also update the whitelist. > > >>>>>> Yes, that's something we thought of too. I don't think it's super > > >>>>>> critical to the scope of this FLIP, thus I'd like to leave it to > > >>> future > > >>>>>> efforts as a nice-to-have feature. > > >>>>>> > > >>>>>> > > >>>>>> On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> > > >> wrote: > > >>>>>>> Hi Kurt, > > >>>>>>> > > >>>>>>> Re: > What I want to propose is we can merge #3 and #4, make them > > >>> both > > >>>>>>> under > > >>>>>>>> "catalog" concept, by extending catalog function to make it have > > >>>>>>> ability to > > >>>>>>>> have built-in catalog functions. Some benefits I can see from > this > > >>>>>>> approach: > > >>>>>>>> 1. We don't have to introduce new concept like external built-in > > >>>>>>> functions. > > >>>>>>>> Actually I don't see a full story about how to treat a built-in > > >>>>>>> functions, and it > > >>>>>>>> seems a little bit disrupt with catalog. As a result, you have > to > > >>> make > > >>>>>>> some restriction > > >>>>>>>> like "hive built-in functions can only be used when current > > >> catalog > > >>> is > > >>>>>>> hive catalog". > > >>>>>>> > > >>>>>>> Yes, I've unified #3 and #4 but it seems I didn't update some > part > > >> of > > >>>>>>> the doc. I've modified those sections, and they are up to date > now. > > >>>>>>> > > >>>>>>> In short, now built-in function of external systems are defined > as > > >> a > > >>>>>>> special kind of catalog function in Flink, and handled by Flink > as > > >>>>>>> following: > > >>>>>>> - An external built-in function must be associated with a catalog > > >> for > > >>>>>>> the purpose of decoupling flink-table and external systems. > > >>>>>>> - It always resides in front of catalog functions in ambiguous > > >>> function > > >>>>>>> reference order, just like in its own external system > > >>>>>>> - It is a special catalog function that doesn’t have a > > >>> schema/database > > >>>>>>> namespace > > >>>>>>> - It goes thru the same instantiation logic as other user defined > > >>>>>>> catalog functions in the external system > > >>>>>>> > > >>>>>>> Please take another look at the doc, and let me know if you have > > >> more > > >>>>>>> questions. > > >>>>>>> > > >>>>>>> > > >>>>>>> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> > > >>>> wrote: > > >>>>>>>> Hi Kurt, > > >>>>>>>> > > >>>>>>>> it should not affect the functions and operations we currently > > >> have > > >>> in > > >>>>>>>> SQL. It just categorizes the available built-in functions. It is > > >>> kind > > >>>>>>>> of > > >>>>>>>> an orthogonal concept to the catalog API but built-in functions > > >>>> deserve > > >>>>>>>> this special kind of treatment. CatalogFunction still fits > > >> perfectly > > >>>> in > > >>>>>>>> there because the regular catalog object resolution logic is not > > >>>>>>>> affected. So tables and functions are resolved in the same way > but > > >>>> with > > >>>>>>>> built-in functions that have priority as in the original design. > > >>>>>>>> > > >>>>>>>> Regards, > > >>>>>>>> Timo > > >>>>>>>> > > >>>>>>>> > > >>>>>>>> On 03.09.19 15:26, Kurt Young wrote: > > >>>>>>>>> Does this only affect the functions and operations we currently > > >>> have > > >>>>>>>> in SQL > > >>>>>>>>> and > > >>>>>>>>> have no effect on tables, right? Looks like this is an > > >> orthogonal > > >>>>>>>> concept > > >>>>>>>>> with Catalog? > > >>>>>>>>> If the answer are both yes, then the catalog function will be a > > >>>> weird > > >>>>>>>>> concept? > > >>>>>>>>> > > >>>>>>>>> Best, > > >>>>>>>>> Kurt > > >>>>>>>>> > > >>>>>>>>> > > >>>>>>>>> On Tue, Sep 3, 2019 at 8:10 PM Danny Chan < > [hidden email] > > >>>>>>>> wrote: > > >>>>>>>>>> The way you proposed are basically the same as what Calcite > > >>> does, I > > >>>>>>>> think > > >>>>>>>>>> we are in the same line. > > >>>>>>>>>> > > >>>>>>>>>> Best, > > >>>>>>>>>> Danny Chan > > >>>>>>>>>> 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] > >,写道: > > >>>>>>>>>>> This sounds exactly as the module approach I mentioned, no? > > >>>>>>>>>>> > > >>>>>>>>>>> Regards, > > >>>>>>>>>>> Timo > > >>>>>>>>>>> > > >>>>>>>>>>> On 03.09.19 13:42, Danny Chan wrote: > > >>>>>>>>>>>> Thanks Bowen for bring up this topic, I think it’s a useful > > >>>>>>>>>> refactoring to make our function usage more user friendly. > > >>>>>>>>>>>> For the topic of how to organize the builtin operators and > > >>>>>>>> operators > > >>>>>>>>>> of Hive, here is a solution from Apache Calcite, the Calcite > > >> way > > >>> is > > >>>>>>>> to make > > >>>>>>>>>> every dialect operators a “Library”, user can specify which > > >>>>>>>> libraries they > > >>>>>>>>>> want to use for a sql query. The builtin operators always > comes > > >>> as > > >>>>>>>> the > > >>>>>>>>>> first class objects and the others are used from the order > they > > >>>>>>>> appears. > > >>>>>>>>>> Maybe you can take a reference. > > >>>>>>>>>>>> [1] > > >> > > > https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 > > >>>>>>>>>>>> Best, > > >>>>>>>>>>>> Danny Chan > > >>>>>>>>>>>> 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] > >,写道: > > >>>>>>>>>>>>> Hi folks, > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> I'd like to kick off a discussion on reworking Flink's > > >>>>>>>>>> FunctionCatalog. > > >>>>>>>>>>>>> It's critically helpful to improve function usability in > > >> SQL. > > >>>>>>>>>>>>> > > >> > > > https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing > > >>>>>>>>>>>>> In short, it: > > >>>>>>>>>>>>> - adds support for precise function reference with > > >>>> fully/partially > > >>>>>>>>>>>>> qualified name > > >>>>>>>>>>>>> - redefines function resolution order for ambiguous > function > > >>>>>>>>>> reference > > >>>>>>>>>>>>> - adds support for Hive's rich built-in functions (support > > >> for > > >>>>>>>> Hive > > >>>>>>>>>> user > > >>>>>>>>>>>>> defined functions was already added in 1.9.0) > > >>>>>>>>>>>>> - clarifies the concept of temporary functions > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Would love to hear your thoughts. > > >>>>>>>>>>>>> > > >>>>>>>>>>>>> Bowen > > >>>>>>>> > > >> > > >> -- > > >> Xuefu Zhang > > >> > > >> "In Honey We Trust!" > > >> > > > > > > -- > Xuefu Zhang > > "In Honey We Trust!" > |
Hi all,
thanks for the healthy discussion. It is already a very long discussion with a lot of text. So I will just post my opinion to a couple of statements: > Hive built-in functions are not part of Flink built-in functions, they are catalog functions That is not entirely true. Correct me if I'm wrong but I think Hive built-in functions are also not catalog functions. They are not stored in every Hive metastore catalog that is freshly created but are a set of functions that are listed somewhere and made available. > ambiguous functions reference just shouldn't be resolved to a different catalog I agree. They should not be resolved to a different catalog. That's why I am suggesting to split the concept of built-in functions and catalog lookup semantics. > I don't know if any other databases handle built-in functions like that What I called "module" is: - Extension in Postgres [1] - Plugin in Presto [2] Btw. Presto even mentions example modules that are similar to the ones that we will introduce in the near future both for ML and System XYZ compatibility: "See either the presto-ml module for machine learning functions or the presto-teradata-functions module for Teradata-compatible functions, both in the root of the Presto source." > functions should be either built-in already or just libraries functions, and library functions can be adapted to catalog APIs or of some other syntax to use Regarding "built-in already", of course we can add a lot of functions as built-ins but we will end-up in a dependency hell in the near future if we don't introduce a pluggable approach. Library functions is what you also suggest but storing them in a catalog means to always fully qualify them or modifying the existing catalog design that was inspired by the standard. I don't think "it brings in even more complicated scenarios to the design", it just does clear separation of concerns. Integrating the functionality into the current design makes the catalog API more complicated. > why would users name a temporary function the same as a built-in function then? Because you never know what users do. If they don't, my suggested resolution order should not be a problem, right? > I don't think hive functions deserves be a function module Our goal is not to create a Hive clone. We need to think forward and Hive is just one of many systems that we can support. Not every built-in function behaves and will behave exactly like Hive. > regarding temporary functions, there are few systems that support it IMHO Spark and Hive are not always the best examples for consistent design. Systems like Postgres, Presto, or SQL Server should be used as a reference. I don't think that a user can overwrite a built-in function there. Regards, Timo [1] https://www.postgresql.org/docs/10/extend-extensions.html [2] https://prestodb.github.io/docs/current/develop/functions.html On 04.09.19 13:44, Jark Wu wrote: > Hi all, > > Regarding #1 temp function <> built-in function and naming. > I'm fine with temp functions should precede built-in function and can > override built-in functions (we already support to override built-in > function in 1.9). > If we don't allow the same name as a built-in function, I'm afraid we will > have compatibility issues in the future. > Say users register a user defined function named "explode" in 1.9, and we > support a built-in "explode" function in 1.10. > Then the user's jobs which call the registered "explode" function in 1.9 > will all fail in 1.10 because of naming conflict. > > Regarding #2 "External" built-in functions. > I think if we store external built-in functions in catalog, then > "hive1::sqrt" is a good way to go. > However, I would prefer to support a discovery mechanism (e.g. SPI) for > built-in functions as Timo suggested above. > This gives us the flexibility to add Hive or MySQL or Geo or whatever > function set as built-in functions in an easy way. > > Best, > Jark > > On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[hidden email]> wrote: > >> Hi David, >> >> Thank you for sharing your findings. It seems to me that there is no SQL >> standard regarding temporary functions. There are few systems that support >> it. Here are what I have found: >> >> 1. Hive: no DB qualifier allowed. Can overwrite built-in. >> 2. Spark: basically follows Hive ( >> >> https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html >> ) >> 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting >> behavior. ( >> http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html) >> >> Because of lack of standard, it's perfectly fine for Flink to define >> whatever it sees appropriate. Thus, your proposal (no overwriting and must >> have DB as holder) is one option. The advantage is simplicity, The downside >> is the deviation from Hive, which is popular and de facto standard in big >> data world. >> >> However, I don't think we have to follow Hive. More importantly, we need a >> consensus. I have no objection if your proposal is generally agreed upon. >> >> Thanks, >> Xuefu >> >> On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <[hidden email]> >> wrote: >> >>> Hi all, >>> >>> Just an opinion on the built-in <> temporary functions resolution and >>> NAMING issue. I think we should not allow overriding the built-in >>> functions, as this may pose serious issues and to be honest is rather >>> not feasible and would require major rework. What happens if a user >>> wants to override CAST? Calls to that function are generated at >>> different layers of the stack that unfortunately does not always go >>> through the Catalog API (at least yet). Moreover from what I've checked >>> no other systems allow overriding the built-in functions. All the >>> systems I've checked so far register temporary functions in a >>> database/schema (either special database for temporary functions, or >>> just current database). What I would suggest is to always register >>> temporary functions with a 3 part identifier. The same way as tables, >>> views etc. This effectively means you cannot override built-in >>> functions. With such approach it is natural that the temporary functions >>> end up a step lower in the resolution order: >>> >>> 1. built-in functions (1 part, maybe 2? - this is still under discussion) >>> >>> 2. temporary functions (always 3 part path) >>> >>> 3. catalog functions (always 3 part path) >>> >>> Let me know what do you think. >>> >>> Best, >>> >>> Dawid >>> >>> On 04/09/2019 06:13, Bowen Li wrote: >>>> Hi, >>>> >>>> I agree with Xuefu that the main controversial points are mainly the >> two >>>> places. My thoughts on them: >>>> >>>> 1) Determinism of referencing Hive built-in functions. We can either >>> remove >>>> Hive built-in functions from ambiguous function resolution and require >>>> users to use special syntax for their qualified names, or add a config >>> flag >>>> to catalog constructor/yaml for turning on and off Hive built-in >>> functions >>>> with the flag set to 'false' by default and proper doc added to help >>> users >>>> make their decisions. >>>> >>>> 2) Flink temp functions v.s. Flink built-in functions in ambiguous >>> function >>>> resolution order. We believe Flink temp functions should precede Flink >>>> built-in functions, and I have presented my reasons. Just in case if we >>>> cannot reach an agreement, I propose forbid users registering temp >>>> functions in the same name as a built-in function, like MySQL's >> approach, >>>> for the moment. It won't have any performance concern, since built-in >>>> functions are all in memory and thus cost of a name check will be >> really >>>> trivial. >>>> >>>> >>>> On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> wrote: >>>> >>>>> From what I have seen, there are a couple of focal disagreements: >>>>> >>>>> 1. Resolution order: temp function --> flink built-in function --> >>> catalog >>>>> function vs flink built-in function --> temp function -> catalog >>> function. >>>>> 2. "External" built-in functions: how to treat built-in functions in >>>>> external system and how users reference them >>>>> >>>>> For #1, I agree with Bowen that temp function needs to be at the >> highest >>>>> priority because that's how a user might overwrite a built-in function >>>>> without referencing a persistent, overwriting catalog function with a >>> fully >>>>> qualified name. Putting built-in functions at the highest priority >>>>> eliminates that usage. >>>>> >>>>> For #2, I saw a general agreement on referencing "external" built-in >>>>> functions such as those in Hive needs to be explicit and deterministic >>> even >>>>> though different approaches are proposed. To limit the scope and >> simply >>> the >>>>> usage, it seems making sense to me to introduce special syntax for >>> user to >>>>> explicitly reference an external built-in function such as hive1::sqrt >>> or >>>>> hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API >>> call >>>>> hive1.getFunction(ObjectPath functionName) where the database name is >>>>> absent for bulit-in functions available in that catalog hive1. I >>> understand >>>>> that Bowen's original proposal was trying to avoid this, but this >> could >>>>> turn out to be a clean and simple solution. >>>>> >>>>> (Timo's modular approach is great way to "expand" Flink's built-in >>> function >>>>> set, which seems orthogonal and complementary to this, which could be >>>>> tackled in further future work.) >>>>> >>>>> I'd be happy to hear further thoughts on the two points. >>>>> >>>>> Thanks, >>>>> Xuefu >>>>> >>>>> On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> wrote: >>>>> >>>>>> Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is >>> the >>>>>> same >>>>>> as Bowen's. But after thinking about it, I'm currently lean to Timo's >>>>>> suggestion. >>>>>> >>>>>> The reason is backward compatibility. If we follow Bowen's approach, >>>>> let's >>>>>> say we >>>>>> first find function in Flink's built-in functions, and then hive's >>>>>> built-in. For example, `foo` >>>>>> is not supported by Flink, but hive has such built-in function. So >> user >>>>>> will have hive's >>>>>> behavior for function `foo`. And in next release, Flink realize this >>> is a >>>>>> very popular function >>>>>> and add it into Flink's built-in functions, but with different >> behavior >>>>> as >>>>>> hive's. So in next >>>>>> release, the behavior changes. >>>>>> >>>>>> With Timo's approach, IIUC user have to tell the framework explicitly >>>>> what >>>>>> kind of >>>>>> built-in functions he would like to use. He can just tell framework >> to >>>>>> abandon Flink's built-in >>>>>> functions, and use hive's instead. User can only choose between them, >>> but >>>>>> not use >>>>>> them at the same time. I think this approach is more predictable. >>>>>> >>>>>> Best, >>>>>> Kurt >>>>>> >>>>>> >>>>>> On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> wrote: >>>>>> >>>>>>> Hi all, >>>>>>> >>>>>>> Thanks for the feedback. Just a kindly reminder that the [Proposal] >>>>>> section >>>>>>> in the google doc was updated, please take a look first and let me >>> know >>>>>> if >>>>>>> you have more questions. >>>>>>> >>>>>>> On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> >> wrote: >>>>>>>> Hi Timo, >>>>>>>> >>>>>>>> Re> 1) We should not have the restriction "hive built-in functions >>>>> can >>>>>>>> only >>>>>>>>> be used when current catalog is hive catalog". Switching a catalog >>>>>>>>> should only have implications on the cat.db.object resolution but >>>>> not >>>>>>>>> functions. It would be quite convinient for users to use Hive >>>>>> built-ins >>>>>>>>> even if they use a Confluent schema registry or just the in-memory >>>>>>>> catalog. >>>>>>>> >>>>>>>> There might be a misunderstanding here. >>>>>>>> >>>>>>>> First of all, Hive built-in functions are not part of Flink >> built-in >>>>>>>> functions, they are catalog functions, thus if the current catalog >> is >>>>>>> not a >>>>>>>> HiveCatalog but, say, a schema registry catalog, ambiguous >> functions >>>>>>>> reference just shouldn't be resolved to a different catalog. >>>>>>>> >>>>>>>> Second, Hive built-in functions can potentially be referenced >> across >>>>>>>> catalog, but it doesn't have db namespace and we currently just >> don't >>>>>>> have >>>>>>>> a SQL syntax for it. It can be enabled when such a SQL syntax is >>>>>> defined, >>>>>>>> e.g. "catalog::function", but it's out of scope of this FLIP. >>>>>>>> >>>>>>>> 2) I would propose to have separate concepts for catalog and >> built-in >>>>>>>> functions. In particular it would be nice to modularize built-in >>>>>>>> functions. Some built-in functions are very crucial (like AS, CAST, >>>>>>>> MINUS), others are more optional but stable (MD5, CONCAT_WS), and >>>>> maybe >>>>>>>> we add more experimental functions in the future or function for >> some >>>>>>>> special application area (Geo functions, ML functions). A data >>>>> platform >>>>>>>> team might not want to make every built-in function available. Or a >>>>>>>> function module like ML functions is in a different Maven module. >>>>>>>> >>>>>>>> I think this is orthogonal to this FLIP, especially we don't have >> the >>>>>>>> "external built-in functions" anymore and currently the built-in >>>>>> function >>>>>>>> category remains untouched. >>>>>>>> >>>>>>>> But just to share some thoughts on the proposal, I'm not sure about >>>>> it: >>>>>>>> - I don't know if any other databases handle built-in functions >> like >>>>>>> that. >>>>>>>> Maybe you can give some examples? IMHO, built-in functions are >> system >>>>>>> info >>>>>>>> and should be deterministic, not depending on loaded libraries. Geo >>>>>>>> functions should be either built-in already or just libraries >>>>>> functions, >>>>>>>> and library functions can be adapted to catalog APIs or of some >> other >>>>>>>> syntax to use >>>>>>>> - I don't know if all use cases stand, and many can be achieved by >>>>>> other >>>>>>>> approaches too. E.g. experimental functions can be taken good care >> of >>>>>> by >>>>>>>> documentations, annotations, etc >>>>>>>> - the proposal basically introduces some concept like a pluggable >>>>>>> built-in >>>>>>>> function catalog, despite the already existing catalog APIs >>>>>>>> - it brings in even more complicated scenarios to the design. E.g. >>>>> how >>>>>> do >>>>>>>> you handle built-in functions in different modules but different >>>>> names? >>>>>>>> In short, I'm not sure if it really stands and it looks like an >>>>>> overkill >>>>>>>> to me. I'd rather not go to that route. Related discussion can be >> on >>>>>> its >>>>>>>> own thread. >>>>>>>> >>>>>>>> 3) Following the suggestion above, we can have a separate discovery >>>>>>>> mechanism for built-in functions. Instead of just going through a >>>>>> static >>>>>>>> list like in BuiltInFunctionDefinitions, a platform team should be >>>>> able >>>>>>>> to select function modules like >>>>>>>> catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, >>>>>>>> HiveFunctions) or via service discovery; >>>>>>>> >>>>>>>> Same as above. I'll leave it to its own thread. >>>>>>>> >>>>>>>> re > 3) Dawid and I discussed the resulution order again. I agree >>>>> with >>>>>>>> Kurt >>>>>>>>> that we should unify built-in function (external or internal) >>>>> under a >>>>>>>>> common layer. However, the resolution order should be: >>>>>>>>> 1. built-in functions >>>>>>>>> 2. temporary functions >>>>>>>>> 3. regular catalog resolution logic >>>>>>>>> Otherwise a temporary function could cause clashes with Flink's >>>>>>> built-in >>>>>>>>> functions. If you take a look at other vendors, like SQL Server >>>>> they >>>>>>>>> also do not allow to overwrite built-in functions. >>>>>>>> ”I agree with Kurt that we should unify built-in function (external >>>>> or >>>>>>>> internal) under a common layer.“ <- I don't think this is what Kurt >>>>>>> means. >>>>>>>> Kurt and I are in favor of unifying built-in functions of external >>>>>>> systems >>>>>>>> and catalog functions. Did you type a mistake? >>>>>>>> >>>>>>>> Besides, I'm not sure about the resolution order you proposed. >>>>>> Temporary >>>>>>>> functions have a lifespan over a session and are only visible to >> the >>>>>>>> session owner, they are unique to each user, and users create them >> on >>>>>>>> purpose to be the highest priority in order to overwrite system >> info >>>>>>>> (built-in functions in this case). >>>>>>>> >>>>>>>> In your case, why would users name a temporary function the same >> as a >>>>>>>> built-in function then? Since using that name in ambiguous function >>>>>>>> reference will always be resolved to built-in functions, creating a >>>>>>>> same-named temp function would be meaningless in the end. >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> >> wrote: >>>>>>>>> Hi Jingsong, >>>>>>>>> >>>>>>>>> Re> 1.Hive built-in functions is an intermediate solution. So we >>>>>> should >>>>>>>>>> not introduce interfaces to influence the framework. To make >>>>>>>>>> Flink itself more powerful, we should implement the functions >>>>>>>>>> we need to add. >>>>>>>>> Yes, please see the doc. >>>>>>>>> >>>>>>>>> Re> 2.Non-flink built-in functions are easy for users to change >>>>> their >>>>>>>>>> behavior. If we support some flink built-in functions in the >>>>>>>>>> future but act differently from non-flink built-in, this will >> lead >>>>>> to >>>>>>>>>> changes in user behavior. >>>>>>>>> There's no such concept as "external built-in functions" any more. >>>>>>>>> Built-in functions of external systems will be treated as special >>>>>>> catalog >>>>>>>>> functions. >>>>>>>>> >>>>>>>>> Re> Another question is, does this fallback include all >>>>>>>>>> hive built-in functions? As far as I know, some hive functions >>>>>>>>>> have some hacky. If possible, can we start with a white list? >>>>>>>>>> Once we implement some functions to flink built-in, we can >>>>>>>>>> also update the whitelist. >>>>>>>>> Yes, that's something we thought of too. I don't think it's super >>>>>>>>> critical to the scope of this FLIP, thus I'd like to leave it to >>>>>> future >>>>>>>>> efforts as a nice-to-have feature. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> >>>>> wrote: >>>>>>>>>> Hi Kurt, >>>>>>>>>> >>>>>>>>>> Re: > What I want to propose is we can merge #3 and #4, make them >>>>>> both >>>>>>>>>> under >>>>>>>>>>> "catalog" concept, by extending catalog function to make it have >>>>>>>>>> ability to >>>>>>>>>>> have built-in catalog functions. Some benefits I can see from >> this >>>>>>>>>> approach: >>>>>>>>>>> 1. We don't have to introduce new concept like external built-in >>>>>>>>>> functions. >>>>>>>>>>> Actually I don't see a full story about how to treat a built-in >>>>>>>>>> functions, and it >>>>>>>>>>> seems a little bit disrupt with catalog. As a result, you have >> to >>>>>> make >>>>>>>>>> some restriction >>>>>>>>>>> like "hive built-in functions can only be used when current >>>>> catalog >>>>>> is >>>>>>>>>> hive catalog". >>>>>>>>>> >>>>>>>>>> Yes, I've unified #3 and #4 but it seems I didn't update some >> part >>>>> of >>>>>>>>>> the doc. I've modified those sections, and they are up to date >> now. >>>>>>>>>> In short, now built-in function of external systems are defined >> as >>>>> a >>>>>>>>>> special kind of catalog function in Flink, and handled by Flink >> as >>>>>>>>>> following: >>>>>>>>>> - An external built-in function must be associated with a catalog >>>>> for >>>>>>>>>> the purpose of decoupling flink-table and external systems. >>>>>>>>>> - It always resides in front of catalog functions in ambiguous >>>>>> function >>>>>>>>>> reference order, just like in its own external system >>>>>>>>>> - It is a special catalog function that doesn’t have a >>>>>> schema/database >>>>>>>>>> namespace >>>>>>>>>> - It goes thru the same instantiation logic as other user defined >>>>>>>>>> catalog functions in the external system >>>>>>>>>> >>>>>>>>>> Please take another look at the doc, and let me know if you have >>>>> more >>>>>>>>>> questions. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> >>>>>>> wrote: >>>>>>>>>>> Hi Kurt, >>>>>>>>>>> >>>>>>>>>>> it should not affect the functions and operations we currently >>>>> have >>>>>> in >>>>>>>>>>> SQL. It just categorizes the available built-in functions. It is >>>>>> kind >>>>>>>>>>> of >>>>>>>>>>> an orthogonal concept to the catalog API but built-in functions >>>>>>> deserve >>>>>>>>>>> this special kind of treatment. CatalogFunction still fits >>>>> perfectly >>>>>>> in >>>>>>>>>>> there because the regular catalog object resolution logic is not >>>>>>>>>>> affected. So tables and functions are resolved in the same way >> but >>>>>>> with >>>>>>>>>>> built-in functions that have priority as in the original design. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Timo >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 03.09.19 15:26, Kurt Young wrote: >>>>>>>>>>>> Does this only affect the functions and operations we currently >>>>>> have >>>>>>>>>>> in SQL >>>>>>>>>>>> and >>>>>>>>>>>> have no effect on tables, right? Looks like this is an >>>>> orthogonal >>>>>>>>>>> concept >>>>>>>>>>>> with Catalog? >>>>>>>>>>>> If the answer are both yes, then the catalog function will be a >>>>>>> weird >>>>>>>>>>>> concept? >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Kurt >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Sep 3, 2019 at 8:10 PM Danny Chan < >> [hidden email] >>>>>>>>>>> wrote: >>>>>>>>>>>>> The way you proposed are basically the same as what Calcite >>>>>> does, I >>>>>>>>>>> think >>>>>>>>>>>>> we are in the same line. >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Danny Chan >>>>>>>>>>>>> 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] >>> ,写道: >>>>>>>>>>>>>> This sounds exactly as the module approach I mentioned, no? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Regards, >>>>>>>>>>>>>> Timo >>>>>>>>>>>>>> >>>>>>>>>>>>>> On 03.09.19 13:42, Danny Chan wrote: >>>>>>>>>>>>>>> Thanks Bowen for bring up this topic, I think it’s a useful >>>>>>>>>>>>> refactoring to make our function usage more user friendly. >>>>>>>>>>>>>>> For the topic of how to organize the builtin operators and >>>>>>>>>>> operators >>>>>>>>>>>>> of Hive, here is a solution from Apache Calcite, the Calcite >>>>> way >>>>>> is >>>>>>>>>>> to make >>>>>>>>>>>>> every dialect operators a “Library”, user can specify which >>>>>>>>>>> libraries they >>>>>>>>>>>>> want to use for a sql query. The builtin operators always >> comes >>>>>> as >>>>>>>>>>> the >>>>>>>>>>>>> first class objects and the others are used from the order >> they >>>>>>>>>>> appears. >>>>>>>>>>>>> Maybe you can take a reference. >>>>>>>>>>>>>>> [1] >> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> Danny Chan >>>>>>>>>>>>>>> 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] >>> ,写道: >>>>>>>>>>>>>>>> Hi folks, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> I'd like to kick off a discussion on reworking Flink's >>>>>>>>>>>>> FunctionCatalog. >>>>>>>>>>>>>>>> It's critically helpful to improve function usability in >>>>> SQL. >> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing >>>>>>>>>>>>>>>> In short, it: >>>>>>>>>>>>>>>> - adds support for precise function reference with >>>>>>> fully/partially >>>>>>>>>>>>>>>> qualified name >>>>>>>>>>>>>>>> - redefines function resolution order for ambiguous >> function >>>>>>>>>>>>> reference >>>>>>>>>>>>>>>> - adds support for Hive's rich built-in functions (support >>>>> for >>>>>>>>>>> Hive >>>>>>>>>>>>> user >>>>>>>>>>>>>>>> defined functions was already added in 1.9.0) >>>>>>>>>>>>>>>> - clarifies the concept of temporary functions >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Would love to hear your thoughts. >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Bowen >>>>> -- >>>>> Xuefu Zhang >>>>> >>>>> "In Honey We Trust!" >>>>> >>> >> -- >> Xuefu Zhang >> >> "In Honey We Trust!" >> |
Hi, Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just performed some experiments (hive-2.3.2 & spark 2.4.4) and I think they are very inconsistent in that manner (spark being way worse on that). Hive: You cannot overwrite all the built-in functions. I could overwrite most of the functions I tried e.g. length, e, pi, round, rtrim, but there are functions I cannot overwrite e.g. CAST, ARRAY I get: ParseException line 1:29 cannot recognize input near
'array' 'AS' What is interesting is that I cannot ovewrite array, but I can ovewrite map or struct. Though hive behaves reasonable well if I manage to overwrite a function. When I drop the temporary function the native function is still available. Spark: Spark's behavior imho is super bad. Theoretically I could overwrite all functions. I was able e.g. to overwrite CAST function. I had to use though CREATE OR REPLACE TEMPORARY FUNCTION syntax. Otherwise I get an exception that a function already exists. However when I used the CAST function in a query it used the native, built-in one. When I overwrote current_date() function, it was used in a query, but it completely replaces the built-in function and I can no longer use the native function in any way. I cannot also drop the temporary function. I get: Error in query: Cannot drop native function
'current_date'; Additional note, both systems do not allow creating TEMPORARY
FUNCTIONS with a database. Temporary functions are always
represented as a single name. In my opinion neither of the systems have consistent behavior. Generally speaking I think overwriting any system provided functions is just dangerous. Regarding Jark's concerns. Such functions would be registered in a current catalog/database schema, so a user could still use its own function, but would have to fully qualify the function (because built-in functions take precedence). Moreover users would have the same problem with permanent functions. Imagine a user have a permanent function 'cat.db.explode'. In 1.9 the user could use just the 'explode' function as long as the 'cat' & 'db' were the default catalog & database. If we introduce 'explode' built-in function in 1.10, the user has to fully qualify the function. Best, Dawid On 04/09/2019 15:19, Timo Walther
wrote:
Hi all, signature.asc (849 bytes) Download Attachment |
Let me try to summarize and conclude the long thread so far:
1. For order of temp function v.s. built-in function: I think Dawid's point that temp function should be of fully qualified path is a better reasoning to back the newly proposed order, and i agree we don't need to follow Hive/Spark. However, I'd rather not change fundamentals of temporary functions in this FLIP. It belongs to a bigger story of how temporary objects should be redefined and be handled uniformly - currently temporary tables and views (those registered from TableEnv#registerTable()) behave different than what Dawid propose for temp functions, and we need a FLIP to just unify their APIs and behaviors. I agree that backward compatibility is not an issue w.r.t Jark's points. ***Seems we do have consensus that it's acceptable to prevent users registering a temp function in the same name as a built-in function. To help us move forward, I'd like to propose setting such a restraint on temp functions in this FLIP to simplify the design and avoid disputes.*** It will also leave rooms for improvements in the future. 2. For Hive built-in function: Thanks Timo for providing the Presto and Postgres examples. I feel modular built-in functions can be a good fit for the geo and ml example as a native Flink extension, but not sure if it fits well with external integrations. Anyway, I think modular built-in functions is a bigger story and can be on its own thread too, and our proposal doesn't prevent Flink from doing that in the future. ***Seems we have consensus that users should be able to use built-in functions of Hive or other external systems in SQL explicitly and deterministically regardless of Flink built-in functions and the potential modular built-in functions, via some new syntax like "mycat::func"? If so, I'd like to propose removing Hive built-in functions from ambiguous function resolution order, and empower users with such a syntax. This way we sacrifice a little convenience for certainty*** What do you think? On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz <[hidden email]> wrote: > Hi, > > Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just > performed some experiments (hive-2.3.2 & spark 2.4.4) and I think they are > very inconsistent in that manner (spark being way worse on that). > > Hive: > > You cannot overwrite all the built-in functions. I could overwrite most of > the functions I tried e.g. length, e, pi, round, rtrim, but there are > functions I cannot overwrite e.g. CAST, ARRAY I get: > > > * ParseException line 1:29 cannot recognize input near 'array' 'AS' * > > What is interesting is that I cannot ovewrite *array*, but I can ovewrite > *map* or *struct*. Though hive behaves reasonable well if I manage to > overwrite a function. When I drop the temporary function the native > function is still available. > > Spark: > > Spark's behavior imho is super bad. > > Theoretically I could overwrite all functions. I was able e.g. to > overwrite CAST function. I had to use though CREATE OR REPLACE TEMPORARY > FUNCTION syntax. Otherwise I get an exception that a function already > exists. However when I used the CAST function in a query it used the > native, built-in one. > > When I overwrote current_date() function, it was used in a query, but it > completely replaces the built-in function and I can no longer use the > native function in any way. I cannot also drop the temporary function. I > get: > > * Error in query: Cannot drop native function 'current_date';* > > Additional note, both systems do not allow creating TEMPORARY FUNCTIONS > with a database. Temporary functions are always represented as a single > name. > > In my opinion neither of the systems have consistent behavior. Generally > speaking I think overwriting any system provided functions is just > dangerous. > > Regarding Jark's concerns. Such functions would be registered in a current > catalog/database schema, so a user could still use its own function, but > would have to fully qualify the function (because built-in functions take > precedence). Moreover users would have the same problem with permanent > functions. Imagine a user have a permanent function 'cat.db.explode'. In > 1.9 the user could use just the 'explode' function as long as the 'cat' & > 'db' were the default catalog & database. If we introduce 'explode' > built-in function in 1.10, the user has to fully qualify the function. > > Best, > > Dawid > On 04/09/2019 15:19, Timo Walther wrote: > > Hi all, > > thanks for the healthy discussion. It is already a very long discussion > with a lot of text. So I will just post my opinion to a couple of > statements: > > > Hive built-in functions are not part of Flink built-in functions, they > are catalog functions > > That is not entirely true. Correct me if I'm wrong but I think Hive > built-in functions are also not catalog functions. They are not stored in > every Hive metastore catalog that is freshly created but are a set of > functions that are listed somewhere and made available. > > > ambiguous functions reference just shouldn't be resolved to a different > catalog > > I agree. They should not be resolved to a different catalog. That's why I > am suggesting to split the concept of built-in functions and catalog lookup > semantics. > > > I don't know if any other databases handle built-in functions like that > > What I called "module" is: > - Extension in Postgres [1] > - Plugin in Presto [2] > > Btw. Presto even mentions example modules that are similar to the ones > that we will introduce in the near future both for ML and System XYZ > compatibility: > "See either the presto-ml module for machine learning functions or the > presto-teradata-functions module for Teradata-compatible functions, both in > the root of the Presto source." > > > functions should be either built-in already or just libraries functions, > and library functions can be adapted to catalog APIs or of some other > syntax to use > > Regarding "built-in already", of course we can add a lot of functions as > built-ins but we will end-up in a dependency hell in the near future if we > don't introduce a pluggable approach. Library functions is what you also > suggest but storing them in a catalog means to always fully qualify them or > modifying the existing catalog design that was inspired by the standard. > > I don't think "it brings in even more complicated scenarios to the > design", it just does clear separation of concerns. Integrating the > functionality into the current design makes the catalog API more > complicated. > > > why would users name a temporary function the same as a built-in > function then? > > Because you never know what users do. If they don't, my suggested > resolution order should not be a problem, right? > > > I don't think hive functions deserves be a function module > > Our goal is not to create a Hive clone. We need to think forward and Hive > is just one of many systems that we can support. Not every built-in > function behaves and will behave exactly like Hive. > > > regarding temporary functions, there are few systems that support it > > IMHO Spark and Hive are not always the best examples for consistent > design. Systems like Postgres, Presto, or SQL Server should be used as a > reference. I don't think that a user can overwrite a built-in function > there. > > Regards, > Timo > > [1] https://www.postgresql.org/docs/10/extend-extensions.html > [2] https://prestodb.github.io/docs/current/develop/functions.html > > > On 04.09.19 13:44, Jark Wu wrote: > > Hi all, > > Regarding #1 temp function <> built-in function and naming. > I'm fine with temp functions should precede built-in function and can > override built-in functions (we already support to override built-in > function in 1.9). > If we don't allow the same name as a built-in function, I'm afraid we will > have compatibility issues in the future. > Say users register a user defined function named "explode" in 1.9, and we > support a built-in "explode" function in 1.10. > Then the user's jobs which call the registered "explode" function in 1.9 > will all fail in 1.10 because of naming conflict. > > Regarding #2 "External" built-in functions. > I think if we store external built-in functions in catalog, then > "hive1::sqrt" is a good way to go. > However, I would prefer to support a discovery mechanism (e.g. SPI) for > built-in functions as Timo suggested above. > This gives us the flexibility to add Hive or MySQL or Geo or whatever > function set as built-in functions in an easy way. > > Best, > Jark > > On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[hidden email]> > <[hidden email]> wrote: > > Hi David, > > Thank you for sharing your findings. It seems to me that there is no SQL > standard regarding temporary functions. There are few systems that support > it. Here are what I have found: > > 1. Hive: no DB qualifier allowed. Can overwrite built-in. > 2. Spark: basically follows Hive ( > > > https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html > ) > 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting > behavior. ( > http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html) > > Because of lack of standard, it's perfectly fine for Flink to define > whatever it sees appropriate. Thus, your proposal (no overwriting and must > have DB as holder) is one option. The advantage is simplicity, The > downside > is the deviation from Hive, which is popular and de facto standard in big > data world. > > However, I don't think we have to follow Hive. More importantly, we need a > consensus. I have no objection if your proposal is generally agreed upon. > > Thanks, > Xuefu > > On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <[hidden email]> > <[hidden email]> > wrote: > > Hi all, > > Just an opinion on the built-in <> temporary functions resolution and > NAMING issue. I think we should not allow overriding the built-in > functions, as this may pose serious issues and to be honest is rather > not feasible and would require major rework. What happens if a user > wants to override CAST? Calls to that function are generated at > different layers of the stack that unfortunately does not always go > through the Catalog API (at least yet). Moreover from what I've checked > no other systems allow overriding the built-in functions. All the > systems I've checked so far register temporary functions in a > database/schema (either special database for temporary functions, or > just current database). What I would suggest is to always register > temporary functions with a 3 part identifier. The same way as tables, > views etc. This effectively means you cannot override built-in > functions. With such approach it is natural that the temporary functions > end up a step lower in the resolution order: > > 1. built-in functions (1 part, maybe 2? - this is still under discussion) > > 2. temporary functions (always 3 part path) > > 3. catalog functions (always 3 part path) > > Let me know what do you think. > > Best, > > Dawid > > On 04/09/2019 06:13, Bowen Li wrote: > > Hi, > > I agree with Xuefu that the main controversial points are mainly the > > two > > places. My thoughts on them: > > 1) Determinism of referencing Hive built-in functions. We can either > > remove > > Hive built-in functions from ambiguous function resolution and require > users to use special syntax for their qualified names, or add a config > > flag > > to catalog constructor/yaml for turning on and off Hive built-in > > functions > > with the flag set to 'false' by default and proper doc added to help > > users > > make their decisions. > > 2) Flink temp functions v.s. Flink built-in functions in ambiguous > > function > > resolution order. We believe Flink temp functions should precede Flink > built-in functions, and I have presented my reasons. Just in case if we > cannot reach an agreement, I propose forbid users registering temp > functions in the same name as a built-in function, like MySQL's > > approach, > > for the moment. It won't have any performance concern, since built-in > functions are all in memory and thus cost of a name check will be > > really > > trivial. > > > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> > <[hidden email]> wrote: > > From what I have seen, there are a couple of focal disagreements: > > 1. Resolution order: temp function --> flink built-in function --> > > catalog > > function vs flink built-in function --> temp function -> catalog > > function. > > 2. "External" built-in functions: how to treat built-in functions in > external system and how users reference them > > For #1, I agree with Bowen that temp function needs to be at the > > highest > > priority because that's how a user might overwrite a built-in function > without referencing a persistent, overwriting catalog function with a > > fully > > qualified name. Putting built-in functions at the highest priority > eliminates that usage. > > For #2, I saw a general agreement on referencing "external" built-in > functions such as those in Hive needs to be explicit and deterministic > > even > > though different approaches are proposed. To limit the scope and > > simply > > the > > usage, it seems making sense to me to introduce special syntax for > > user to > > explicitly reference an external built-in function such as hive1::sqrt > > or > > hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API > > call > > hive1.getFunction(ObjectPath functionName) where the database name is > absent for bulit-in functions available in that catalog hive1. I > > understand > > that Bowen's original proposal was trying to avoid this, but this > > could > > turn out to be a clean and simple solution. > > (Timo's modular approach is great way to "expand" Flink's built-in > > function > > set, which seems orthogonal and complementary to this, which could be > tackled in further future work.) > > I'd be happy to hear further thoughts on the two points. > > Thanks, > Xuefu > > On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> > <[hidden email]> wrote: > > Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is > > the > > same > as Bowen's. But after thinking about it, I'm currently lean to Timo's > suggestion. > > The reason is backward compatibility. If we follow Bowen's approach, > > let's > > say we > first find function in Flink's built-in functions, and then hive's > built-in. For example, `foo` > is not supported by Flink, but hive has such built-in function. So > > user > > will have hive's > behavior for function `foo`. And in next release, Flink realize this > > is a > > very popular function > and add it into Flink's built-in functions, but with different > > behavior > > as > > hive's. So in next > release, the behavior changes. > > With Timo's approach, IIUC user have to tell the framework explicitly > > what > > kind of > built-in functions he would like to use. He can just tell framework > > to > > abandon Flink's built-in > functions, and use hive's instead. User can only choose between them, > > but > > not use > them at the same time. I think this approach is more predictable. > > Best, > Kurt > > > On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> > <[hidden email]> wrote: > > Hi all, > > Thanks for the feedback. Just a kindly reminder that the [Proposal] > > section > > in the google doc was updated, please take a look first and let me > > know > > if > > you have more questions. > > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> > <[hidden email]> > > wrote: > > Hi Timo, > > Re> 1) We should not have the restriction "hive built-in functions > > can > > only > > be used when current catalog is hive catalog". Switching a catalog > should only have implications on the cat.db.object resolution but > > not > > functions. It would be quite convinient for users to use Hive > > built-ins > > even if they use a Confluent schema registry or just the in-memory > > catalog. > > There might be a misunderstanding here. > > First of all, Hive built-in functions are not part of Flink > > built-in > > functions, they are catalog functions, thus if the current catalog > > is > > not a > > HiveCatalog but, say, a schema registry catalog, ambiguous > > functions > > reference just shouldn't be resolved to a different catalog. > > Second, Hive built-in functions can potentially be referenced > > across > > catalog, but it doesn't have db namespace and we currently just > > don't > > have > > a SQL syntax for it. It can be enabled when such a SQL syntax is > > defined, > > e.g. "catalog::function", but it's out of scope of this FLIP. > > 2) I would propose to have separate concepts for catalog and > > built-in > > functions. In particular it would be nice to modularize built-in > functions. Some built-in functions are very crucial (like AS, CAST, > MINUS), others are more optional but stable (MD5, CONCAT_WS), and > > maybe > > we add more experimental functions in the future or function for > > some > > special application area (Geo functions, ML functions). A data > > platform > > team might not want to make every built-in function available. Or a > function module like ML functions is in a different Maven module. > > I think this is orthogonal to this FLIP, especially we don't have > > the > > "external built-in functions" anymore and currently the built-in > > function > > category remains untouched. > > But just to share some thoughts on the proposal, I'm not sure about > > it: > > - I don't know if any other databases handle built-in functions > > like > > that. > > Maybe you can give some examples? IMHO, built-in functions are > > system > > info > > and should be deterministic, not depending on loaded libraries. Geo > functions should be either built-in already or just libraries > > functions, > > and library functions can be adapted to catalog APIs or of some > > other > > syntax to use > - I don't know if all use cases stand, and many can be achieved by > > other > > approaches too. E.g. experimental functions can be taken good care > > of > > by > > documentations, annotations, etc > - the proposal basically introduces some concept like a pluggable > > built-in > > function catalog, despite the already existing catalog APIs > - it brings in even more complicated scenarios to the design. E.g. > > how > > do > > you handle built-in functions in different modules but different > > names? > > In short, I'm not sure if it really stands and it looks like an > > overkill > > to me. I'd rather not go to that route. Related discussion can be > > on > > its > > own thread. > > 3) Following the suggestion above, we can have a separate discovery > mechanism for built-in functions. Instead of just going through a > > static > > list like in BuiltInFunctionDefinitions, a platform team should be > > able > > to select function modules like > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, > HiveFunctions) or via service discovery; > > Same as above. I'll leave it to its own thread. > > re > 3) Dawid and I discussed the resulution order again. I agree > > with > > Kurt > > that we should unify built-in function (external or internal) > > under a > > common layer. However, the resolution order should be: > 1. built-in functions > 2. temporary functions > 3. regular catalog resolution logic > Otherwise a temporary function could cause clashes with Flink's > > built-in > > functions. If you take a look at other vendors, like SQL Server > > they > > also do not allow to overwrite built-in functions. > > ”I agree with Kurt that we should unify built-in function (external > > or > > internal) under a common layer.“ <- I don't think this is what Kurt > > means. > > Kurt and I are in favor of unifying built-in functions of external > > systems > > and catalog functions. Did you type a mistake? > > Besides, I'm not sure about the resolution order you proposed. > > Temporary > > functions have a lifespan over a session and are only visible to > > the > > session owner, they are unique to each user, and users create them > > on > > purpose to be the highest priority in order to overwrite system > > info > > (built-in functions in this case). > > In your case, why would users name a temporary function the same > > as a > > built-in function then? Since using that name in ambiguous function > reference will always be resolved to built-in functions, creating a > same-named temp function would be meaningless in the end. > > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> > <[hidden email]> > > wrote: > > Hi Jingsong, > > Re> 1.Hive built-in functions is an intermediate solution. So we > > should > > not introduce interfaces to influence the framework. To make > Flink itself more powerful, we should implement the functions > we need to add. > > Yes, please see the doc. > > Re> 2.Non-flink built-in functions are easy for users to change > > their > > behavior. If we support some flink built-in functions in the > future but act differently from non-flink built-in, this will > > lead > > to > > changes in user behavior. > > There's no such concept as "external built-in functions" any more. > Built-in functions of external systems will be treated as special > > catalog > > functions. > > Re> Another question is, does this fallback include all > > hive built-in functions? As far as I know, some hive functions > have some hacky. If possible, can we start with a white list? > Once we implement some functions to flink built-in, we can > also update the whitelist. > > Yes, that's something we thought of too. I don't think it's super > critical to the scope of this FLIP, thus I'd like to leave it to > > future > > efforts as a nice-to-have feature. > > > On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> > <[hidden email]> > > wrote: > > Hi Kurt, > > Re: > What I want to propose is we can merge #3 and #4, make them > > both > > under > > "catalog" concept, by extending catalog function to make it have > > ability to > > have built-in catalog functions. Some benefits I can see from > > this > > approach: > > 1. We don't have to introduce new concept like external built-in > > functions. > > Actually I don't see a full story about how to treat a built-in > > functions, and it > > seems a little bit disrupt with catalog. As a result, you have > > to > > make > > some restriction > > like "hive built-in functions can only be used when current > > catalog > > is > > hive catalog". > > Yes, I've unified #3 and #4 but it seems I didn't update some > > part > > of > > the doc. I've modified those sections, and they are up to date > > now. > > In short, now built-in function of external systems are defined > > as > > a > > special kind of catalog function in Flink, and handled by Flink > > as > > following: > - An external built-in function must be associated with a catalog > > for > > the purpose of decoupling flink-table and external systems. > - It always resides in front of catalog functions in ambiguous > > function > > reference order, just like in its own external system > - It is a special catalog function that doesn’t have a > > schema/database > > namespace > - It goes thru the same instantiation logic as other user defined > catalog functions in the external system > > Please take another look at the doc, and let me know if you have > > more > > questions. > > > On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> > <[hidden email]> > > wrote: > > Hi Kurt, > > it should not affect the functions and operations we currently > > have > > in > > SQL. It just categorizes the available built-in functions. It is > > kind > > of > an orthogonal concept to the catalog API but built-in functions > > deserve > > this special kind of treatment. CatalogFunction still fits > > perfectly > > in > > there because the regular catalog object resolution logic is not > affected. So tables and functions are resolved in the same way > > but > > with > > built-in functions that have priority as in the original design. > > Regards, > Timo > > > On 03.09.19 15:26, Kurt Young wrote: > > Does this only affect the functions and operations we currently > > have > > in SQL > > and > have no effect on tables, right? Looks like this is an > > orthogonal > > concept > > with Catalog? > If the answer are both yes, then the catalog function will be a > > weird > > concept? > > Best, > Kurt > > > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan < > > [hidden email] > > wrote: > > The way you proposed are basically the same as what Calcite > > does, I > > think > > we are in the same line. > > Best, > Danny Chan > 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] > > ,写道: > > This sounds exactly as the module approach I mentioned, no? > > Regards, > Timo > > On 03.09.19 13:42, Danny Chan wrote: > > Thanks Bowen for bring up this topic, I think it’s a useful > > refactoring to make our function usage more user friendly. > > For the topic of how to organize the builtin operators and > > operators > > of Hive, here is a solution from Apache Calcite, the Calcite > > way > > is > > to make > > every dialect operators a “Library”, user can specify which > > libraries they > > want to use for a sql query. The builtin operators always > > comes > > as > > the > > first class objects and the others are used from the order > > they > > appears. > > Maybe you can take a reference. > > [1] > > > https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 > > Best, > Danny Chan > 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] > > ,写道: > > Hi folks, > > I'd like to kick off a discussion on reworking Flink's > > FunctionCatalog. > > It's critically helpful to improve function usability in > > SQL. > > > https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing > > In short, it: > - adds support for precise function reference with > > fully/partially > > qualified name > - redefines function resolution order for ambiguous > > function > > reference > > - adds support for Hive's rich built-in functions (support > > for > > Hive > > user > > defined functions was already added in 1.9.0) > - clarifies the concept of temporary functions > > Would love to hear your thoughts. > > Bowen > > -- > Xuefu Zhang > > "In Honey We Trust!" > > > -- > Xuefu Zhang > > "In Honey We Trust!" > > > |
In reply to this post by dwysakowicz
Hi David,
Thanks for sharing the findings about temporary functions. Because of strong inconsistency observed in Spark, we can probably ignore it for now. For Hive, I understand one may not be able to overwrite everything, but the capability is being offered. Whether we offer this capability is to be determined, I don't see the "danger" you mentioned. It's user's action, which only impacts the user's current session. Other users are not impacted. (That's one of the benefits of temporary objects.) We cannot always prevent users from making mistakes. If we think overwriting is useful to the user, then we can allow it. To make it more consistent, we can also further restrict by further blacklisting built-in functions that may not be overwritten. In my past experience, I did see user needs to overwriting a built-in function in Hive. Without this capability, user has to create a permanent function and modify all the queries referencing the function with it fully-qualified name. This is equivalent to create a new, user-defined, function. This way can work, but the usability is bad. (Jark's "explode" example is actually important because forcing user to modify query is of bad experience after an upgrade.) In short, we can theoretically disallow function overwriting since there is no standard. However, I don't see strong reasons for doing so, especially such a capability is useful to some users. Thanks, Xuefu On Wed, Sep 4, 2019 at 10:02 PM Dawid Wysakowicz <[hidden email]> wrote: > Hi, > > Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just > performed some experiments (hive-2.3.2 & spark 2.4.4) and I think they are > very inconsistent in that manner (spark being way worse on that). > > Hive: > > You cannot overwrite all the built-in functions. I could overwrite most of > the functions I tried e.g. length, e, pi, round, rtrim, but there are > functions I cannot overwrite e.g. CAST, ARRAY I get: > > > * ParseException line 1:29 cannot recognize input near 'array' 'AS' * > > What is interesting is that I cannot ovewrite *array*, but I can ovewrite > *map* or *struct*. Though hive behaves reasonable well if I manage to > overwrite a function. When I drop the temporary function the native > function is still available. > > Spark: > > Spark's behavior imho is super bad. > > Theoretically I could overwrite all functions. I was able e.g. to > overwrite CAST function. I had to use though CREATE OR REPLACE TEMPORARY > FUNCTION syntax. Otherwise I get an exception that a function already > exists. However when I used the CAST function in a query it used the > native, built-in one. > > When I overwrote current_date() function, it was used in a query, but it > completely replaces the built-in function and I can no longer use the > native function in any way. I cannot also drop the temporary function. I > get: > > * Error in query: Cannot drop native function 'current_date';* > > Additional note, both systems do not allow creating TEMPORARY FUNCTIONS > with a database. Temporary functions are always represented as a single > name. > > In my opinion neither of the systems have consistent behavior. Generally > speaking I think overwriting any system provided functions is just > dangerous. > > Regarding Jark's concerns. Such functions would be registered in a current > catalog/database schema, so a user could still use its own function, but > would have to fully qualify the function (because built-in functions take > precedence). Moreover users would have the same problem with permanent > functions. Imagine a user have a permanent function 'cat.db.explode'. In > 1.9 the user could use just the 'explode' function as long as the 'cat' & > 'db' were the default catalog & database. If we introduce 'explode' > built-in function in 1.10, the user has to fully qualify the function. > > Best, > > Dawid > On 04/09/2019 15:19, Timo Walther wrote: > > Hi all, > > thanks for the healthy discussion. It is already a very long discussion > with a lot of text. So I will just post my opinion to a couple of > statements: > > > Hive built-in functions are not part of Flink built-in functions, they > are catalog functions > > That is not entirely true. Correct me if I'm wrong but I think Hive > built-in functions are also not catalog functions. They are not stored in > every Hive metastore catalog that is freshly created but are a set of > functions that are listed somewhere and made available. > > > ambiguous functions reference just shouldn't be resolved to a different > catalog > > I agree. They should not be resolved to a different catalog. That's why I > am suggesting to split the concept of built-in functions and catalog lookup > semantics. > > > I don't know if any other databases handle built-in functions like that > > What I called "module" is: > - Extension in Postgres [1] > - Plugin in Presto [2] > > Btw. Presto even mentions example modules that are similar to the ones > that we will introduce in the near future both for ML and System XYZ > compatibility: > "See either the presto-ml module for machine learning functions or the > presto-teradata-functions module for Teradata-compatible functions, both in > the root of the Presto source." > > > functions should be either built-in already or just libraries functions, > and library functions can be adapted to catalog APIs or of some other > syntax to use > > Regarding "built-in already", of course we can add a lot of functions as > built-ins but we will end-up in a dependency hell in the near future if we > don't introduce a pluggable approach. Library functions is what you also > suggest but storing them in a catalog means to always fully qualify them or > modifying the existing catalog design that was inspired by the standard. > > I don't think "it brings in even more complicated scenarios to the > design", it just does clear separation of concerns. Integrating the > functionality into the current design makes the catalog API more > complicated. > > > why would users name a temporary function the same as a built-in > function then? > > Because you never know what users do. If they don't, my suggested > resolution order should not be a problem, right? > > > I don't think hive functions deserves be a function module > > Our goal is not to create a Hive clone. We need to think forward and Hive > is just one of many systems that we can support. Not every built-in > function behaves and will behave exactly like Hive. > > > regarding temporary functions, there are few systems that support it > > IMHO Spark and Hive are not always the best examples for consistent > design. Systems like Postgres, Presto, or SQL Server should be used as a > reference. I don't think that a user can overwrite a built-in function > there. > > Regards, > Timo > > [1] https://www.postgresql.org/docs/10/extend-extensions.html > [2] https://prestodb.github.io/docs/current/develop/functions.html > > > On 04.09.19 13:44, Jark Wu wrote: > > Hi all, > > Regarding #1 temp function <> built-in function and naming. > I'm fine with temp functions should precede built-in function and can > override built-in functions (we already support to override built-in > function in 1.9). > If we don't allow the same name as a built-in function, I'm afraid we will > have compatibility issues in the future. > Say users register a user defined function named "explode" in 1.9, and we > support a built-in "explode" function in 1.10. > Then the user's jobs which call the registered "explode" function in 1.9 > will all fail in 1.10 because of naming conflict. > > Regarding #2 "External" built-in functions. > I think if we store external built-in functions in catalog, then > "hive1::sqrt" is a good way to go. > However, I would prefer to support a discovery mechanism (e.g. SPI) for > built-in functions as Timo suggested above. > This gives us the flexibility to add Hive or MySQL or Geo or whatever > function set as built-in functions in an easy way. > > Best, > Jark > > On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[hidden email]> > <[hidden email]> wrote: > > Hi David, > > Thank you for sharing your findings. It seems to me that there is no SQL > standard regarding temporary functions. There are few systems that support > it. Here are what I have found: > > 1. Hive: no DB qualifier allowed. Can overwrite built-in. > 2. Spark: basically follows Hive ( > > > https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html > ) > 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting > behavior. ( > http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html) > > Because of lack of standard, it's perfectly fine for Flink to define > whatever it sees appropriate. Thus, your proposal (no overwriting and must > have DB as holder) is one option. The advantage is simplicity, The > downside > is the deviation from Hive, which is popular and de facto standard in big > data world. > > However, I don't think we have to follow Hive. More importantly, we need a > consensus. I have no objection if your proposal is generally agreed upon. > > Thanks, > Xuefu > > On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <[hidden email]> > <[hidden email]> > wrote: > > Hi all, > > Just an opinion on the built-in <> temporary functions resolution and > NAMING issue. I think we should not allow overriding the built-in > functions, as this may pose serious issues and to be honest is rather > not feasible and would require major rework. What happens if a user > wants to override CAST? Calls to that function are generated at > different layers of the stack that unfortunately does not always go > through the Catalog API (at least yet). Moreover from what I've checked > no other systems allow overriding the built-in functions. All the > systems I've checked so far register temporary functions in a > database/schema (either special database for temporary functions, or > just current database). What I would suggest is to always register > temporary functions with a 3 part identifier. The same way as tables, > views etc. This effectively means you cannot override built-in > functions. With such approach it is natural that the temporary functions > end up a step lower in the resolution order: > > 1. built-in functions (1 part, maybe 2? - this is still under discussion) > > 2. temporary functions (always 3 part path) > > 3. catalog functions (always 3 part path) > > Let me know what do you think. > > Best, > > Dawid > > On 04/09/2019 06:13, Bowen Li wrote: > > Hi, > > I agree with Xuefu that the main controversial points are mainly the > > two > > places. My thoughts on them: > > 1) Determinism of referencing Hive built-in functions. We can either > > remove > > Hive built-in functions from ambiguous function resolution and require > users to use special syntax for their qualified names, or add a config > > flag > > to catalog constructor/yaml for turning on and off Hive built-in > > functions > > with the flag set to 'false' by default and proper doc added to help > > users > > make their decisions. > > 2) Flink temp functions v.s. Flink built-in functions in ambiguous > > function > > resolution order. We believe Flink temp functions should precede Flink > built-in functions, and I have presented my reasons. Just in case if we > cannot reach an agreement, I propose forbid users registering temp > functions in the same name as a built-in function, like MySQL's > > approach, > > for the moment. It won't have any performance concern, since built-in > functions are all in memory and thus cost of a name check will be > > really > > trivial. > > > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> > <[hidden email]> wrote: > > From what I have seen, there are a couple of focal disagreements: > > 1. Resolution order: temp function --> flink built-in function --> > > catalog > > function vs flink built-in function --> temp function -> catalog > > function. > > 2. "External" built-in functions: how to treat built-in functions in > external system and how users reference them > > For #1, I agree with Bowen that temp function needs to be at the > > highest > > priority because that's how a user might overwrite a built-in function > without referencing a persistent, overwriting catalog function with a > > fully > > qualified name. Putting built-in functions at the highest priority > eliminates that usage. > > For #2, I saw a general agreement on referencing "external" built-in > functions such as those in Hive needs to be explicit and deterministic > > even > > though different approaches are proposed. To limit the scope and > > simply > > the > > usage, it seems making sense to me to introduce special syntax for > > user to > > explicitly reference an external built-in function such as hive1::sqrt > > or > > hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API > > call > > hive1.getFunction(ObjectPath functionName) where the database name is > absent for bulit-in functions available in that catalog hive1. I > > understand > > that Bowen's original proposal was trying to avoid this, but this > > could > > turn out to be a clean and simple solution. > > (Timo's modular approach is great way to "expand" Flink's built-in > > function > > set, which seems orthogonal and complementary to this, which could be > tackled in further future work.) > > I'd be happy to hear further thoughts on the two points. > > Thanks, > Xuefu > > On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> > <[hidden email]> wrote: > > Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is > > the > > same > as Bowen's. But after thinking about it, I'm currently lean to Timo's > suggestion. > > The reason is backward compatibility. If we follow Bowen's approach, > > let's > > say we > first find function in Flink's built-in functions, and then hive's > built-in. For example, `foo` > is not supported by Flink, but hive has such built-in function. So > > user > > will have hive's > behavior for function `foo`. And in next release, Flink realize this > > is a > > very popular function > and add it into Flink's built-in functions, but with different > > behavior > > as > > hive's. So in next > release, the behavior changes. > > With Timo's approach, IIUC user have to tell the framework explicitly > > what > > kind of > built-in functions he would like to use. He can just tell framework > > to > > abandon Flink's built-in > functions, and use hive's instead. User can only choose between them, > > but > > not use > them at the same time. I think this approach is more predictable. > > Best, > Kurt > > > On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> > <[hidden email]> wrote: > > Hi all, > > Thanks for the feedback. Just a kindly reminder that the [Proposal] > > section > > in the google doc was updated, please take a look first and let me > > know > > if > > you have more questions. > > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> > <[hidden email]> > > wrote: > > Hi Timo, > > Re> 1) We should not have the restriction "hive built-in functions > > can > > only > > be used when current catalog is hive catalog". Switching a catalog > should only have implications on the cat.db.object resolution but > > not > > functions. It would be quite convinient for users to use Hive > > built-ins > > even if they use a Confluent schema registry or just the in-memory > > catalog. > > There might be a misunderstanding here. > > First of all, Hive built-in functions are not part of Flink > > built-in > > functions, they are catalog functions, thus if the current catalog > > is > > not a > > HiveCatalog but, say, a schema registry catalog, ambiguous > > functions > > reference just shouldn't be resolved to a different catalog. > > Second, Hive built-in functions can potentially be referenced > > across > > catalog, but it doesn't have db namespace and we currently just > > don't > > have > > a SQL syntax for it. It can be enabled when such a SQL syntax is > > defined, > > e.g. "catalog::function", but it's out of scope of this FLIP. > > 2) I would propose to have separate concepts for catalog and > > built-in > > functions. In particular it would be nice to modularize built-in > functions. Some built-in functions are very crucial (like AS, CAST, > MINUS), others are more optional but stable (MD5, CONCAT_WS), and > > maybe > > we add more experimental functions in the future or function for > > some > > special application area (Geo functions, ML functions). A data > > platform > > team might not want to make every built-in function available. Or a > function module like ML functions is in a different Maven module. > > I think this is orthogonal to this FLIP, especially we don't have > > the > > "external built-in functions" anymore and currently the built-in > > function > > category remains untouched. > > But just to share some thoughts on the proposal, I'm not sure about > > it: > > - I don't know if any other databases handle built-in functions > > like > > that. > > Maybe you can give some examples? IMHO, built-in functions are > > system > > info > > and should be deterministic, not depending on loaded libraries. Geo > functions should be either built-in already or just libraries > > functions, > > and library functions can be adapted to catalog APIs or of some > > other > > syntax to use > - I don't know if all use cases stand, and many can be achieved by > > other > > approaches too. E.g. experimental functions can be taken good care > > of > > by > > documentations, annotations, etc > - the proposal basically introduces some concept like a pluggable > > built-in > > function catalog, despite the already existing catalog APIs > - it brings in even more complicated scenarios to the design. E.g. > > how > > do > > you handle built-in functions in different modules but different > > names? > > In short, I'm not sure if it really stands and it looks like an > > overkill > > to me. I'd rather not go to that route. Related discussion can be > > on > > its > > own thread. > > 3) Following the suggestion above, we can have a separate discovery > mechanism for built-in functions. Instead of just going through a > > static > > list like in BuiltInFunctionDefinitions, a platform team should be > > able > > to select function modules like > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, > HiveFunctions) or via service discovery; > > Same as above. I'll leave it to its own thread. > > re > 3) Dawid and I discussed the resulution order again. I agree > > with > > Kurt > > that we should unify built-in function (external or internal) > > under a > > common layer. However, the resolution order should be: > 1. built-in functions > 2. temporary functions > 3. regular catalog resolution logic > Otherwise a temporary function could cause clashes with Flink's > > built-in > > functions. If you take a look at other vendors, like SQL Server > > they > > also do not allow to overwrite built-in functions. > > ”I agree with Kurt that we should unify built-in function (external > > or > > internal) under a common layer.“ <- I don't think this is what Kurt > > means. > > Kurt and I are in favor of unifying built-in functions of external > > systems > > and catalog functions. Did you type a mistake? > > Besides, I'm not sure about the resolution order you proposed. > > Temporary > > functions have a lifespan over a session and are only visible to > > the > > session owner, they are unique to each user, and users create them > > on > > purpose to be the highest priority in order to overwrite system > > info > > (built-in functions in this case). > > In your case, why would users name a temporary function the same > > as a > > built-in function then? Since using that name in ambiguous function > reference will always be resolved to built-in functions, creating a > same-named temp function would be meaningless in the end. > > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> > <[hidden email]> > > wrote: > > Hi Jingsong, > > Re> 1.Hive built-in functions is an intermediate solution. So we > > should > > not introduce interfaces to influence the framework. To make > Flink itself more powerful, we should implement the functions > we need to add. > > Yes, please see the doc. > > Re> 2.Non-flink built-in functions are easy for users to change > > their > > behavior. If we support some flink built-in functions in the > future but act differently from non-flink built-in, this will > > lead > > to > > changes in user behavior. > > There's no such concept as "external built-in functions" any more. > Built-in functions of external systems will be treated as special > > catalog > > functions. > > Re> Another question is, does this fallback include all > > hive built-in functions? As far as I know, some hive functions > have some hacky. If possible, can we start with a white list? > Once we implement some functions to flink built-in, we can > also update the whitelist. > > Yes, that's something we thought of too. I don't think it's super > critical to the scope of this FLIP, thus I'd like to leave it to > > future > > efforts as a nice-to-have feature. > > > On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> > <[hidden email]> > > wrote: > > Hi Kurt, > > Re: > What I want to propose is we can merge #3 and #4, make them > > both > > under > > "catalog" concept, by extending catalog function to make it have > > ability to > > have built-in catalog functions. Some benefits I can see from > > this > > approach: > > 1. We don't have to introduce new concept like external built-in > > functions. > > Actually I don't see a full story about how to treat a built-in > > functions, and it > > seems a little bit disrupt with catalog. As a result, you have > > to > > make > > some restriction > > like "hive built-in functions can only be used when current > > catalog > > is > > hive catalog". > > Yes, I've unified #3 and #4 but it seems I didn't update some > > part > > of > > the doc. I've modified those sections, and they are up to date > > now. > > In short, now built-in function of external systems are defined > > as > > a > > special kind of catalog function in Flink, and handled by Flink > > as > > following: > - An external built-in function must be associated with a catalog > > for > > the purpose of decoupling flink-table and external systems. > - It always resides in front of catalog functions in ambiguous > > function > > reference order, just like in its own external system > - It is a special catalog function that doesn’t have a > > schema/database > > namespace > - It goes thru the same instantiation logic as other user defined > catalog functions in the external system > > Please take another look at the doc, and let me know if you have > > more > > questions. > > > On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> > <[hidden email]> > > wrote: > > Hi Kurt, > > it should not affect the functions and operations we currently > > have > > in > > SQL. It just categorizes the available built-in functions. It is > > kind > > of > an orthogonal concept to the catalog API but built-in functions > > deserve > > this special kind of treatment. CatalogFunction still fits > > perfectly > > in > > there because the regular catalog object resolution logic is not > affected. So tables and functions are resolved in the same way > > but > > with > > built-in functions that have priority as in the original design. > > Regards, > Timo > > > On 03.09.19 15:26, Kurt Young wrote: > > Does this only affect the functions and operations we currently > > have > > in SQL > > and > have no effect on tables, right? Looks like this is an > > orthogonal > > concept > > with Catalog? > If the answer are both yes, then the catalog function will be a > > weird > > concept? > > Best, > Kurt > > > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan < > > [hidden email] > > wrote: > > The way you proposed are basically the same as what Calcite > > does, I > > think > > we are in the same line. > > Best, > Danny Chan > 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] > > ,写道: > > This sounds exactly as the module approach I mentioned, no? > > Regards, > Timo > > On 03.09.19 13:42, Danny Chan wrote: > > Thanks Bowen for bring up this topic, I think it’s a useful > > refactoring to make our function usage more user friendly. > > For the topic of how to organize the builtin operators and > > operators > > of Hive, here is a solution from Apache Calcite, the Calcite > > way > > is > > to make > > every dialect operators a “Library”, user can specify which > > libraries they > > want to use for a sql query. The builtin operators always > > comes > > as > > the > > first class objects and the others are used from the order > > they > > appears. > > Maybe you can take a reference. > > [1] > > > https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 > > Best, > Danny Chan > 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] > > ,写道: > > Hi folks, > > I'd like to kick off a discussion on reworking Flink's > > FunctionCatalog. > > It's critically helpful to improve function usability in > > SQL. > > > https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing > > In short, it: > - adds support for precise function reference with > > fully/partially > > qualified name > - redefines function resolution order for ambiguous > > function > > reference > > - adds support for Hive's rich built-in functions (support > > for > > Hive > > user > > defined functions was already added in 1.9.0) > - clarifies the concept of temporary functions > > Would love to hear your thoughts. > > Bowen > > -- > Xuefu Zhang > > "In Honey We Trust!" > > > -- > Xuefu Zhang > > "In Honey We Trust!" > > > -- Xuefu Zhang "In Honey We Trust!" |
In reply to this post by bowen.li
Thank all for the sharing thoughts. I think we have gathered some useful
initial feedback from this long discussion with a couple of focal points sticking out. We will go back to do more research and adapt our proposal. Once it's ready, we will ask for a new round of review. If there is any disagreement, we will start a new discussion thread on each rather than having a mega discussion like this. Thanks to everyone for participating. Regards, Xuefu On Thu, Sep 5, 2019 at 2:52 AM Bowen Li <[hidden email]> wrote: > Let me try to summarize and conclude the long thread so far: > > 1. For order of temp function v.s. built-in function: > > I think Dawid's point that temp function should be of fully qualified path > is a better reasoning to back the newly proposed order, and i agree we > don't need to follow Hive/Spark. > > However, I'd rather not change fundamentals of temporary functions in this > FLIP. It belongs to a bigger story of how temporary objects should be > redefined and be handled uniformly - currently temporary tables and views > (those registered from TableEnv#registerTable()) behave different than what > Dawid propose for temp functions, and we need a FLIP to just unify their > APIs and behaviors. > > I agree that backward compatibility is not an issue w.r.t Jark's points. > > ***Seems we do have consensus that it's acceptable to prevent users > registering a temp function in the same name as a built-in function. To > help us move forward, I'd like to propose setting such a restraint on temp > functions in this FLIP to simplify the design and avoid disputes.*** It > will also leave rooms for improvements in the future. > > > 2. For Hive built-in function: > > Thanks Timo for providing the Presto and Postgres examples. I feel modular > built-in functions can be a good fit for the geo and ml example as a native > Flink extension, but not sure if it fits well with external integrations. > Anyway, I think modular built-in functions is a bigger story and can be on > its own thread too, and our proposal doesn't prevent Flink from doing that > in the future. > > ***Seems we have consensus that users should be able to use built-in > functions of Hive or other external systems in SQL explicitly and > deterministically regardless of Flink built-in functions and the potential > modular built-in functions, via some new syntax like "mycat::func"? If so, > I'd like to propose removing Hive built-in functions from ambiguous > function resolution order, and empower users with such a syntax. This way > we sacrifice a little convenience for certainty*** > > > What do you think? > > On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz <[hidden email]> > wrote: > > > Hi, > > > > Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just > > performed some experiments (hive-2.3.2 & spark 2.4.4) and I think they > are > > very inconsistent in that manner (spark being way worse on that). > > > > Hive: > > > > You cannot overwrite all the built-in functions. I could overwrite most > of > > the functions I tried e.g. length, e, pi, round, rtrim, but there are > > functions I cannot overwrite e.g. CAST, ARRAY I get: > > > > > > * ParseException line 1:29 cannot recognize input near 'array' 'AS' * > > > > What is interesting is that I cannot ovewrite *array*, but I can ovewrite > > *map* or *struct*. Though hive behaves reasonable well if I manage to > > overwrite a function. When I drop the temporary function the native > > function is still available. > > > > Spark: > > > > Spark's behavior imho is super bad. > > > > Theoretically I could overwrite all functions. I was able e.g. to > > overwrite CAST function. I had to use though CREATE OR REPLACE TEMPORARY > > FUNCTION syntax. Otherwise I get an exception that a function already > > exists. However when I used the CAST function in a query it used the > > native, built-in one. > > > > When I overwrote current_date() function, it was used in a query, but it > > completely replaces the built-in function and I can no longer use the > > native function in any way. I cannot also drop the temporary function. I > > get: > > > > * Error in query: Cannot drop native function 'current_date';* > > > > Additional note, both systems do not allow creating TEMPORARY FUNCTIONS > > with a database. Temporary functions are always represented as a single > > name. > > > > In my opinion neither of the systems have consistent behavior. Generally > > speaking I think overwriting any system provided functions is just > > dangerous. > > > > Regarding Jark's concerns. Such functions would be registered in a > current > > catalog/database schema, so a user could still use its own function, but > > would have to fully qualify the function (because built-in functions take > > precedence). Moreover users would have the same problem with permanent > > functions. Imagine a user have a permanent function 'cat.db.explode'. In > > 1.9 the user could use just the 'explode' function as long as the 'cat' & > > 'db' were the default catalog & database. If we introduce 'explode' > > built-in function in 1.10, the user has to fully qualify the function. > > > > Best, > > > > Dawid > > On 04/09/2019 15:19, Timo Walther wrote: > > > > Hi all, > > > > thanks for the healthy discussion. It is already a very long discussion > > with a lot of text. So I will just post my opinion to a couple of > > statements: > > > > > Hive built-in functions are not part of Flink built-in functions, they > > are catalog functions > > > > That is not entirely true. Correct me if I'm wrong but I think Hive > > built-in functions are also not catalog functions. They are not stored in > > every Hive metastore catalog that is freshly created but are a set of > > functions that are listed somewhere and made available. > > > > > ambiguous functions reference just shouldn't be resolved to a different > > catalog > > > > I agree. They should not be resolved to a different catalog. That's why I > > am suggesting to split the concept of built-in functions and catalog > lookup > > semantics. > > > > > I don't know if any other databases handle built-in functions like that > > > > What I called "module" is: > > - Extension in Postgres [1] > > - Plugin in Presto [2] > > > > Btw. Presto even mentions example modules that are similar to the ones > > that we will introduce in the near future both for ML and System XYZ > > compatibility: > > "See either the presto-ml module for machine learning functions or the > > presto-teradata-functions module for Teradata-compatible functions, both > in > > the root of the Presto source." > > > > > functions should be either built-in already or just libraries > functions, > > and library functions can be adapted to catalog APIs or of some other > > syntax to use > > > > Regarding "built-in already", of course we can add a lot of functions as > > built-ins but we will end-up in a dependency hell in the near future if > we > > don't introduce a pluggable approach. Library functions is what you also > > suggest but storing them in a catalog means to always fully qualify them > or > > modifying the existing catalog design that was inspired by the standard. > > > > I don't think "it brings in even more complicated scenarios to the > > design", it just does clear separation of concerns. Integrating the > > functionality into the current design makes the catalog API more > > complicated. > > > > > why would users name a temporary function the same as a built-in > > function then? > > > > Because you never know what users do. If they don't, my suggested > > resolution order should not be a problem, right? > > > > > I don't think hive functions deserves be a function module > > > > Our goal is not to create a Hive clone. We need to think forward and Hive > > is just one of many systems that we can support. Not every built-in > > function behaves and will behave exactly like Hive. > > > > > regarding temporary functions, there are few systems that support it > > > > IMHO Spark and Hive are not always the best examples for consistent > > design. Systems like Postgres, Presto, or SQL Server should be used as a > > reference. I don't think that a user can overwrite a built-in function > > there. > > > > Regards, > > Timo > > > > [1] https://www.postgresql.org/docs/10/extend-extensions.html > > [2] https://prestodb.github.io/docs/current/develop/functions.html > > > > > > On 04.09.19 13:44, Jark Wu wrote: > > > > Hi all, > > > > Regarding #1 temp function <> built-in function and naming. > > I'm fine with temp functions should precede built-in function and can > > override built-in functions (we already support to override built-in > > function in 1.9). > > If we don't allow the same name as a built-in function, I'm afraid we > will > > have compatibility issues in the future. > > Say users register a user defined function named "explode" in 1.9, and we > > support a built-in "explode" function in 1.10. > > Then the user's jobs which call the registered "explode" function in 1.9 > > will all fail in 1.10 because of naming conflict. > > > > Regarding #2 "External" built-in functions. > > I think if we store external built-in functions in catalog, then > > "hive1::sqrt" is a good way to go. > > However, I would prefer to support a discovery mechanism (e.g. SPI) for > > built-in functions as Timo suggested above. > > This gives us the flexibility to add Hive or MySQL or Geo or whatever > > function set as built-in functions in an easy way. > > > > Best, > > Jark > > > > On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[hidden email]> > > <[hidden email]> wrote: > > > > Hi David, > > > > Thank you for sharing your findings. It seems to me that there is no SQL > > standard regarding temporary functions. There are few systems that > support > > it. Here are what I have found: > > > > 1. Hive: no DB qualifier allowed. Can overwrite built-in. > > 2. Spark: basically follows Hive ( > > > > > > > https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html > > ) > > 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting > > behavior. ( > > http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html > ) > > > > Because of lack of standard, it's perfectly fine for Flink to define > > whatever it sees appropriate. Thus, your proposal (no overwriting and > must > > have DB as holder) is one option. The advantage is simplicity, The > > downside > > is the deviation from Hive, which is popular and de facto standard in big > > data world. > > > > However, I don't think we have to follow Hive. More importantly, we need > a > > consensus. I have no objection if your proposal is generally agreed upon. > > > > Thanks, > > Xuefu > > > > On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <[hidden email] > > > > <[hidden email]> > > wrote: > > > > Hi all, > > > > Just an opinion on the built-in <> temporary functions resolution and > > NAMING issue. I think we should not allow overriding the built-in > > functions, as this may pose serious issues and to be honest is rather > > not feasible and would require major rework. What happens if a user > > wants to override CAST? Calls to that function are generated at > > different layers of the stack that unfortunately does not always go > > through the Catalog API (at least yet). Moreover from what I've checked > > no other systems allow overriding the built-in functions. All the > > systems I've checked so far register temporary functions in a > > database/schema (either special database for temporary functions, or > > just current database). What I would suggest is to always register > > temporary functions with a 3 part identifier. The same way as tables, > > views etc. This effectively means you cannot override built-in > > functions. With such approach it is natural that the temporary functions > > end up a step lower in the resolution order: > > > > 1. built-in functions (1 part, maybe 2? - this is still under discussion) > > > > 2. temporary functions (always 3 part path) > > > > 3. catalog functions (always 3 part path) > > > > Let me know what do you think. > > > > Best, > > > > Dawid > > > > On 04/09/2019 06:13, Bowen Li wrote: > > > > Hi, > > > > I agree with Xuefu that the main controversial points are mainly the > > > > two > > > > places. My thoughts on them: > > > > 1) Determinism of referencing Hive built-in functions. We can either > > > > remove > > > > Hive built-in functions from ambiguous function resolution and require > > users to use special syntax for their qualified names, or add a config > > > > flag > > > > to catalog constructor/yaml for turning on and off Hive built-in > > > > functions > > > > with the flag set to 'false' by default and proper doc added to help > > > > users > > > > make their decisions. > > > > 2) Flink temp functions v.s. Flink built-in functions in ambiguous > > > > function > > > > resolution order. We believe Flink temp functions should precede Flink > > built-in functions, and I have presented my reasons. Just in case if we > > cannot reach an agreement, I propose forbid users registering temp > > functions in the same name as a built-in function, like MySQL's > > > > approach, > > > > for the moment. It won't have any performance concern, since built-in > > functions are all in memory and thus cost of a name check will be > > > > really > > > > trivial. > > > > > > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> > > <[hidden email]> wrote: > > > > From what I have seen, there are a couple of focal disagreements: > > > > 1. Resolution order: temp function --> flink built-in function --> > > > > catalog > > > > function vs flink built-in function --> temp function -> catalog > > > > function. > > > > 2. "External" built-in functions: how to treat built-in functions in > > external system and how users reference them > > > > For #1, I agree with Bowen that temp function needs to be at the > > > > highest > > > > priority because that's how a user might overwrite a built-in function > > without referencing a persistent, overwriting catalog function with a > > > > fully > > > > qualified name. Putting built-in functions at the highest priority > > eliminates that usage. > > > > For #2, I saw a general agreement on referencing "external" built-in > > functions such as those in Hive needs to be explicit and deterministic > > > > even > > > > though different approaches are proposed. To limit the scope and > > > > simply > > > > the > > > > usage, it seems making sense to me to introduce special syntax for > > > > user to > > > > explicitly reference an external built-in function such as hive1::sqrt > > > > or > > > > hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API > > > > call > > > > hive1.getFunction(ObjectPath functionName) where the database name is > > absent for bulit-in functions available in that catalog hive1. I > > > > understand > > > > that Bowen's original proposal was trying to avoid this, but this > > > > could > > > > turn out to be a clean and simple solution. > > > > (Timo's modular approach is great way to "expand" Flink's built-in > > > > function > > > > set, which seems orthogonal and complementary to this, which could be > > tackled in further future work.) > > > > I'd be happy to hear further thoughts on the two points. > > > > Thanks, > > Xuefu > > > > On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> > > <[hidden email]> wrote: > > > > Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is > > > > the > > > > same > > as Bowen's. But after thinking about it, I'm currently lean to Timo's > > suggestion. > > > > The reason is backward compatibility. If we follow Bowen's approach, > > > > let's > > > > say we > > first find function in Flink's built-in functions, and then hive's > > built-in. For example, `foo` > > is not supported by Flink, but hive has such built-in function. So > > > > user > > > > will have hive's > > behavior for function `foo`. And in next release, Flink realize this > > > > is a > > > > very popular function > > and add it into Flink's built-in functions, but with different > > > > behavior > > > > as > > > > hive's. So in next > > release, the behavior changes. > > > > With Timo's approach, IIUC user have to tell the framework explicitly > > > > what > > > > kind of > > built-in functions he would like to use. He can just tell framework > > > > to > > > > abandon Flink's built-in > > functions, and use hive's instead. User can only choose between them, > > > > but > > > > not use > > them at the same time. I think this approach is more predictable. > > > > Best, > > Kurt > > > > > > On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> > > <[hidden email]> wrote: > > > > Hi all, > > > > Thanks for the feedback. Just a kindly reminder that the [Proposal] > > > > section > > > > in the google doc was updated, please take a look first and let me > > > > know > > > > if > > > > you have more questions. > > > > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> > > <[hidden email]> > > > > wrote: > > > > Hi Timo, > > > > Re> 1) We should not have the restriction "hive built-in functions > > > > can > > > > only > > > > be used when current catalog is hive catalog". Switching a catalog > > should only have implications on the cat.db.object resolution but > > > > not > > > > functions. It would be quite convinient for users to use Hive > > > > built-ins > > > > even if they use a Confluent schema registry or just the in-memory > > > > catalog. > > > > There might be a misunderstanding here. > > > > First of all, Hive built-in functions are not part of Flink > > > > built-in > > > > functions, they are catalog functions, thus if the current catalog > > > > is > > > > not a > > > > HiveCatalog but, say, a schema registry catalog, ambiguous > > > > functions > > > > reference just shouldn't be resolved to a different catalog. > > > > Second, Hive built-in functions can potentially be referenced > > > > across > > > > catalog, but it doesn't have db namespace and we currently just > > > > don't > > > > have > > > > a SQL syntax for it. It can be enabled when such a SQL syntax is > > > > defined, > > > > e.g. "catalog::function", but it's out of scope of this FLIP. > > > > 2) I would propose to have separate concepts for catalog and > > > > built-in > > > > functions. In particular it would be nice to modularize built-in > > functions. Some built-in functions are very crucial (like AS, CAST, > > MINUS), others are more optional but stable (MD5, CONCAT_WS), and > > > > maybe > > > > we add more experimental functions in the future or function for > > > > some > > > > special application area (Geo functions, ML functions). A data > > > > platform > > > > team might not want to make every built-in function available. Or a > > function module like ML functions is in a different Maven module. > > > > I think this is orthogonal to this FLIP, especially we don't have > > > > the > > > > "external built-in functions" anymore and currently the built-in > > > > function > > > > category remains untouched. > > > > But just to share some thoughts on the proposal, I'm not sure about > > > > it: > > > > - I don't know if any other databases handle built-in functions > > > > like > > > > that. > > > > Maybe you can give some examples? IMHO, built-in functions are > > > > system > > > > info > > > > and should be deterministic, not depending on loaded libraries. Geo > > functions should be either built-in already or just libraries > > > > functions, > > > > and library functions can be adapted to catalog APIs or of some > > > > other > > > > syntax to use > > - I don't know if all use cases stand, and many can be achieved by > > > > other > > > > approaches too. E.g. experimental functions can be taken good care > > > > of > > > > by > > > > documentations, annotations, etc > > - the proposal basically introduces some concept like a pluggable > > > > built-in > > > > function catalog, despite the already existing catalog APIs > > - it brings in even more complicated scenarios to the design. E.g. > > > > how > > > > do > > > > you handle built-in functions in different modules but different > > > > names? > > > > In short, I'm not sure if it really stands and it looks like an > > > > overkill > > > > to me. I'd rather not go to that route. Related discussion can be > > > > on > > > > its > > > > own thread. > > > > 3) Following the suggestion above, we can have a separate discovery > > mechanism for built-in functions. Instead of just going through a > > > > static > > > > list like in BuiltInFunctionDefinitions, a platform team should be > > > > able > > > > to select function modules like > > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, > > HiveFunctions) or via service discovery; > > > > Same as above. I'll leave it to its own thread. > > > > re > 3) Dawid and I discussed the resulution order again. I agree > > > > with > > > > Kurt > > > > that we should unify built-in function (external or internal) > > > > under a > > > > common layer. However, the resolution order should be: > > 1. built-in functions > > 2. temporary functions > > 3. regular catalog resolution logic > > Otherwise a temporary function could cause clashes with Flink's > > > > built-in > > > > functions. If you take a look at other vendors, like SQL Server > > > > they > > > > also do not allow to overwrite built-in functions. > > > > ”I agree with Kurt that we should unify built-in function (external > > > > or > > > > internal) under a common layer.“ <- I don't think this is what Kurt > > > > means. > > > > Kurt and I are in favor of unifying built-in functions of external > > > > systems > > > > and catalog functions. Did you type a mistake? > > > > Besides, I'm not sure about the resolution order you proposed. > > > > Temporary > > > > functions have a lifespan over a session and are only visible to > > > > the > > > > session owner, they are unique to each user, and users create them > > > > on > > > > purpose to be the highest priority in order to overwrite system > > > > info > > > > (built-in functions in this case). > > > > In your case, why would users name a temporary function the same > > > > as a > > > > built-in function then? Since using that name in ambiguous function > > reference will always be resolved to built-in functions, creating a > > same-named temp function would be meaningless in the end. > > > > > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> > > <[hidden email]> > > > > wrote: > > > > Hi Jingsong, > > > > Re> 1.Hive built-in functions is an intermediate solution. So we > > > > should > > > > not introduce interfaces to influence the framework. To make > > Flink itself more powerful, we should implement the functions > > we need to add. > > > > Yes, please see the doc. > > > > Re> 2.Non-flink built-in functions are easy for users to change > > > > their > > > > behavior. If we support some flink built-in functions in the > > future but act differently from non-flink built-in, this will > > > > lead > > > > to > > > > changes in user behavior. > > > > There's no such concept as "external built-in functions" any more. > > Built-in functions of external systems will be treated as special > > > > catalog > > > > functions. > > > > Re> Another question is, does this fallback include all > > > > hive built-in functions? As far as I know, some hive functions > > have some hacky. If possible, can we start with a white list? > > Once we implement some functions to flink built-in, we can > > also update the whitelist. > > > > Yes, that's something we thought of too. I don't think it's super > > critical to the scope of this FLIP, thus I'd like to leave it to > > > > future > > > > efforts as a nice-to-have feature. > > > > > > On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> > > <[hidden email]> > > > > wrote: > > > > Hi Kurt, > > > > Re: > What I want to propose is we can merge #3 and #4, make them > > > > both > > > > under > > > > "catalog" concept, by extending catalog function to make it have > > > > ability to > > > > have built-in catalog functions. Some benefits I can see from > > > > this > > > > approach: > > > > 1. We don't have to introduce new concept like external built-in > > > > functions. > > > > Actually I don't see a full story about how to treat a built-in > > > > functions, and it > > > > seems a little bit disrupt with catalog. As a result, you have > > > > to > > > > make > > > > some restriction > > > > like "hive built-in functions can only be used when current > > > > catalog > > > > is > > > > hive catalog". > > > > Yes, I've unified #3 and #4 but it seems I didn't update some > > > > part > > > > of > > > > the doc. I've modified those sections, and they are up to date > > > > now. > > > > In short, now built-in function of external systems are defined > > > > as > > > > a > > > > special kind of catalog function in Flink, and handled by Flink > > > > as > > > > following: > > - An external built-in function must be associated with a catalog > > > > for > > > > the purpose of decoupling flink-table and external systems. > > - It always resides in front of catalog functions in ambiguous > > > > function > > > > reference order, just like in its own external system > > - It is a special catalog function that doesn’t have a > > > > schema/database > > > > namespace > > - It goes thru the same instantiation logic as other user defined > > catalog functions in the external system > > > > Please take another look at the doc, and let me know if you have > > > > more > > > > questions. > > > > > > On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> > > <[hidden email]> > > > > wrote: > > > > Hi Kurt, > > > > it should not affect the functions and operations we currently > > > > have > > > > in > > > > SQL. It just categorizes the available built-in functions. It is > > > > kind > > > > of > > an orthogonal concept to the catalog API but built-in functions > > > > deserve > > > > this special kind of treatment. CatalogFunction still fits > > > > perfectly > > > > in > > > > there because the regular catalog object resolution logic is not > > affected. So tables and functions are resolved in the same way > > > > but > > > > with > > > > built-in functions that have priority as in the original design. > > > > Regards, > > Timo > > > > > > On 03.09.19 15:26, Kurt Young wrote: > > > > Does this only affect the functions and operations we currently > > > > have > > > > in SQL > > > > and > > have no effect on tables, right? Looks like this is an > > > > orthogonal > > > > concept > > > > with Catalog? > > If the answer are both yes, then the catalog function will be a > > > > weird > > > > concept? > > > > Best, > > Kurt > > > > > > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan < > > > > [hidden email] > > > > wrote: > > > > The way you proposed are basically the same as what Calcite > > > > does, I > > > > think > > > > we are in the same line. > > > > Best, > > Danny Chan > > 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] > > > > ,写道: > > > > This sounds exactly as the module approach I mentioned, no? > > > > Regards, > > Timo > > > > On 03.09.19 13:42, Danny Chan wrote: > > > > Thanks Bowen for bring up this topic, I think it’s a useful > > > > refactoring to make our function usage more user friendly. > > > > For the topic of how to organize the builtin operators and > > > > operators > > > > of Hive, here is a solution from Apache Calcite, the Calcite > > > > way > > > > is > > > > to make > > > > every dialect operators a “Library”, user can specify which > > > > libraries they > > > > want to use for a sql query. The builtin operators always > > > > comes > > > > as > > > > the > > > > first class objects and the others are used from the order > > > > they > > > > appears. > > > > Maybe you can take a reference. > > > > [1] > > > > > > > https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 > > > > Best, > > Danny Chan > > 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] > > > > ,写道: > > > > Hi folks, > > > > I'd like to kick off a discussion on reworking Flink's > > > > FunctionCatalog. > > > > It's critically helpful to improve function usability in > > > > SQL. > > > > > > > https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing > > > > In short, it: > > - adds support for precise function reference with > > > > fully/partially > > > > qualified name > > - redefines function resolution order for ambiguous > > > > function > > > > reference > > > > - adds support for Hive's rich built-in functions (support > > > > for > > > > Hive > > > > user > > > > defined functions was already added in 1.9.0) > > - clarifies the concept of temporary functions > > > > Would love to hear your thoughts. > > > > Bowen > > > > -- > > Xuefu Zhang > > > > "In Honey We Trust!" > > > > > > -- > > Xuefu Zhang > > > > "In Honey We Trust!" > > > > > > > -- Xuefu Zhang "In Honey We Trust!" |
Maybe Xuefu missed my email. Please let me know what your thoughts are on
the summary, if there's still major controversy, I can take time to reevaluate that part. On Wed, Sep 4, 2019 at 2:25 PM Xuefu Z <[hidden email]> wrote: > Thank all for the sharing thoughts. I think we have gathered some useful > initial feedback from this long discussion with a couple of focal points > sticking out. > > We will go back to do more research and adapt our proposal. Once it's > ready, we will ask for a new round of review. If there is any disagreement, > we will start a new discussion thread on each rather than having a mega > discussion like this. > > Thanks to everyone for participating. > > Regards, > Xuefu > > > On Thu, Sep 5, 2019 at 2:52 AM Bowen Li <[hidden email]> wrote: > > > Let me try to summarize and conclude the long thread so far: > > > > 1. For order of temp function v.s. built-in function: > > > > I think Dawid's point that temp function should be of fully qualified > path > > is a better reasoning to back the newly proposed order, and i agree we > > don't need to follow Hive/Spark. > > > > However, I'd rather not change fundamentals of temporary functions in > this > > FLIP. It belongs to a bigger story of how temporary objects should be > > redefined and be handled uniformly - currently temporary tables and views > > (those registered from TableEnv#registerTable()) behave different than > what > > Dawid propose for temp functions, and we need a FLIP to just unify their > > APIs and behaviors. > > > > I agree that backward compatibility is not an issue w.r.t Jark's points. > > > > ***Seems we do have consensus that it's acceptable to prevent users > > registering a temp function in the same name as a built-in function. To > > help us move forward, I'd like to propose setting such a restraint on > temp > > functions in this FLIP to simplify the design and avoid disputes.*** It > > will also leave rooms for improvements in the future. > > > > > > 2. For Hive built-in function: > > > > Thanks Timo for providing the Presto and Postgres examples. I feel > modular > > built-in functions can be a good fit for the geo and ml example as a > native > > Flink extension, but not sure if it fits well with external integrations. > > Anyway, I think modular built-in functions is a bigger story and can be > on > > its own thread too, and our proposal doesn't prevent Flink from doing > that > > in the future. > > > > ***Seems we have consensus that users should be able to use built-in > > functions of Hive or other external systems in SQL explicitly and > > deterministically regardless of Flink built-in functions and the > potential > > modular built-in functions, via some new syntax like "mycat::func"? If > so, > > I'd like to propose removing Hive built-in functions from ambiguous > > function resolution order, and empower users with such a syntax. This way > > we sacrifice a little convenience for certainty*** > > > > > > What do you think? > > > > On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz <[hidden email]> > > wrote: > > > > > Hi, > > > > > > Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just > > > performed some experiments (hive-2.3.2 & spark 2.4.4) and I think they > > are > > > very inconsistent in that manner (spark being way worse on that). > > > > > > Hive: > > > > > > You cannot overwrite all the built-in functions. I could overwrite most > > of > > > the functions I tried e.g. length, e, pi, round, rtrim, but there are > > > functions I cannot overwrite e.g. CAST, ARRAY I get: > > > > > > > > > * ParseException line 1:29 cannot recognize input near 'array' 'AS' > * > > > > > > What is interesting is that I cannot ovewrite *array*, but I can > ovewrite > > > *map* or *struct*. Though hive behaves reasonable well if I manage to > > > overwrite a function. When I drop the temporary function the native > > > function is still available. > > > > > > Spark: > > > > > > Spark's behavior imho is super bad. > > > > > > Theoretically I could overwrite all functions. I was able e.g. to > > > overwrite CAST function. I had to use though CREATE OR REPLACE > TEMPORARY > > > FUNCTION syntax. Otherwise I get an exception that a function already > > > exists. However when I used the CAST function in a query it used the > > > native, built-in one. > > > > > > When I overwrote current_date() function, it was used in a query, but > it > > > completely replaces the built-in function and I can no longer use the > > > native function in any way. I cannot also drop the temporary function. > I > > > get: > > > > > > * Error in query: Cannot drop native function 'current_date';* > > > > > > Additional note, both systems do not allow creating TEMPORARY FUNCTIONS > > > with a database. Temporary functions are always represented as a single > > > name. > > > > > > In my opinion neither of the systems have consistent behavior. > Generally > > > speaking I think overwriting any system provided functions is just > > > dangerous. > > > > > > Regarding Jark's concerns. Such functions would be registered in a > > current > > > catalog/database schema, so a user could still use its own function, > but > > > would have to fully qualify the function (because built-in functions > take > > > precedence). Moreover users would have the same problem with permanent > > > functions. Imagine a user have a permanent function 'cat.db.explode'. > In > > > 1.9 the user could use just the 'explode' function as long as the > 'cat' & > > > 'db' were the default catalog & database. If we introduce 'explode' > > > built-in function in 1.10, the user has to fully qualify the function. > > > > > > Best, > > > > > > Dawid > > > On 04/09/2019 15:19, Timo Walther wrote: > > > > > > Hi all, > > > > > > thanks for the healthy discussion. It is already a very long discussion > > > with a lot of text. So I will just post my opinion to a couple of > > > statements: > > > > > > > Hive built-in functions are not part of Flink built-in functions, > they > > > are catalog functions > > > > > > That is not entirely true. Correct me if I'm wrong but I think Hive > > > built-in functions are also not catalog functions. They are not stored > in > > > every Hive metastore catalog that is freshly created but are a set of > > > functions that are listed somewhere and made available. > > > > > > > ambiguous functions reference just shouldn't be resolved to a > different > > > catalog > > > > > > I agree. They should not be resolved to a different catalog. That's > why I > > > am suggesting to split the concept of built-in functions and catalog > > lookup > > > semantics. > > > > > > > I don't know if any other databases handle built-in functions like > that > > > > > > What I called "module" is: > > > - Extension in Postgres [1] > > > - Plugin in Presto [2] > > > > > > Btw. Presto even mentions example modules that are similar to the ones > > > that we will introduce in the near future both for ML and System XYZ > > > compatibility: > > > "See either the presto-ml module for machine learning functions or the > > > presto-teradata-functions module for Teradata-compatible functions, > both > > in > > > the root of the Presto source." > > > > > > > functions should be either built-in already or just libraries > > functions, > > > and library functions can be adapted to catalog APIs or of some other > > > syntax to use > > > > > > Regarding "built-in already", of course we can add a lot of functions > as > > > built-ins but we will end-up in a dependency hell in the near future if > > we > > > don't introduce a pluggable approach. Library functions is what you > also > > > suggest but storing them in a catalog means to always fully qualify > them > > or > > > modifying the existing catalog design that was inspired by the > standard. > > > > > > I don't think "it brings in even more complicated scenarios to the > > > design", it just does clear separation of concerns. Integrating the > > > functionality into the current design makes the catalog API more > > > complicated. > > > > > > > why would users name a temporary function the same as a built-in > > > function then? > > > > > > Because you never know what users do. If they don't, my suggested > > > resolution order should not be a problem, right? > > > > > > > I don't think hive functions deserves be a function module > > > > > > Our goal is not to create a Hive clone. We need to think forward and > Hive > > > is just one of many systems that we can support. Not every built-in > > > function behaves and will behave exactly like Hive. > > > > > > > regarding temporary functions, there are few systems that support it > > > > > > IMHO Spark and Hive are not always the best examples for consistent > > > design. Systems like Postgres, Presto, or SQL Server should be used as > a > > > reference. I don't think that a user can overwrite a built-in function > > > there. > > > > > > Regards, > > > Timo > > > > > > [1] https://www.postgresql.org/docs/10/extend-extensions.html > > > [2] https://prestodb.github.io/docs/current/develop/functions.html > > > > > > > > > On 04.09.19 13:44, Jark Wu wrote: > > > > > > Hi all, > > > > > > Regarding #1 temp function <> built-in function and naming. > > > I'm fine with temp functions should precede built-in function and can > > > override built-in functions (we already support to override built-in > > > function in 1.9). > > > If we don't allow the same name as a built-in function, I'm afraid we > > will > > > have compatibility issues in the future. > > > Say users register a user defined function named "explode" in 1.9, and > we > > > support a built-in "explode" function in 1.10. > > > Then the user's jobs which call the registered "explode" function in > 1.9 > > > will all fail in 1.10 because of naming conflict. > > > > > > Regarding #2 "External" built-in functions. > > > I think if we store external built-in functions in catalog, then > > > "hive1::sqrt" is a good way to go. > > > However, I would prefer to support a discovery mechanism (e.g. SPI) for > > > built-in functions as Timo suggested above. > > > This gives us the flexibility to add Hive or MySQL or Geo or whatever > > > function set as built-in functions in an easy way. > > > > > > Best, > > > Jark > > > > > > On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[hidden email]> > > > <[hidden email]> wrote: > > > > > > Hi David, > > > > > > Thank you for sharing your findings. It seems to me that there is no > SQL > > > standard regarding temporary functions. There are few systems that > > support > > > it. Here are what I have found: > > > > > > 1. Hive: no DB qualifier allowed. Can overwrite built-in. > > > 2. Spark: basically follows Hive ( > > > > > > > > > > > > https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html > > > ) > > > 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of > overwriting > > > behavior. ( > > > > http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html > > ) > > > > > > Because of lack of standard, it's perfectly fine for Flink to define > > > whatever it sees appropriate. Thus, your proposal (no overwriting and > > must > > > have DB as holder) is one option. The advantage is simplicity, The > > > downside > > > is the deviation from Hive, which is popular and de facto standard in > big > > > data world. > > > > > > However, I don't think we have to follow Hive. More importantly, we > need > > a > > > consensus. I have no objection if your proposal is generally agreed > upon. > > > > > > Thanks, > > > Xuefu > > > > > > On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz < > [hidden email] > > > > > > <[hidden email]> > > > wrote: > > > > > > Hi all, > > > > > > Just an opinion on the built-in <> temporary functions resolution and > > > NAMING issue. I think we should not allow overriding the built-in > > > functions, as this may pose serious issues and to be honest is rather > > > not feasible and would require major rework. What happens if a user > > > wants to override CAST? Calls to that function are generated at > > > different layers of the stack that unfortunately does not always go > > > through the Catalog API (at least yet). Moreover from what I've checked > > > no other systems allow overriding the built-in functions. All the > > > systems I've checked so far register temporary functions in a > > > database/schema (either special database for temporary functions, or > > > just current database). What I would suggest is to always register > > > temporary functions with a 3 part identifier. The same way as tables, > > > views etc. This effectively means you cannot override built-in > > > functions. With such approach it is natural that the temporary > functions > > > end up a step lower in the resolution order: > > > > > > 1. built-in functions (1 part, maybe 2? - this is still under > discussion) > > > > > > 2. temporary functions (always 3 part path) > > > > > > 3. catalog functions (always 3 part path) > > > > > > Let me know what do you think. > > > > > > Best, > > > > > > Dawid > > > > > > On 04/09/2019 06:13, Bowen Li wrote: > > > > > > Hi, > > > > > > I agree with Xuefu that the main controversial points are mainly the > > > > > > two > > > > > > places. My thoughts on them: > > > > > > 1) Determinism of referencing Hive built-in functions. We can either > > > > > > remove > > > > > > Hive built-in functions from ambiguous function resolution and require > > > users to use special syntax for their qualified names, or add a config > > > > > > flag > > > > > > to catalog constructor/yaml for turning on and off Hive built-in > > > > > > functions > > > > > > with the flag set to 'false' by default and proper doc added to help > > > > > > users > > > > > > make their decisions. > > > > > > 2) Flink temp functions v.s. Flink built-in functions in ambiguous > > > > > > function > > > > > > resolution order. We believe Flink temp functions should precede Flink > > > built-in functions, and I have presented my reasons. Just in case if we > > > cannot reach an agreement, I propose forbid users registering temp > > > functions in the same name as a built-in function, like MySQL's > > > > > > approach, > > > > > > for the moment. It won't have any performance concern, since built-in > > > functions are all in memory and thus cost of a name check will be > > > > > > really > > > > > > trivial. > > > > > > > > > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> > > > <[hidden email]> wrote: > > > > > > From what I have seen, there are a couple of focal disagreements: > > > > > > 1. Resolution order: temp function --> flink built-in function --> > > > > > > catalog > > > > > > function vs flink built-in function --> temp function -> catalog > > > > > > function. > > > > > > 2. "External" built-in functions: how to treat built-in functions in > > > external system and how users reference them > > > > > > For #1, I agree with Bowen that temp function needs to be at the > > > > > > highest > > > > > > priority because that's how a user might overwrite a built-in function > > > without referencing a persistent, overwriting catalog function with a > > > > > > fully > > > > > > qualified name. Putting built-in functions at the highest priority > > > eliminates that usage. > > > > > > For #2, I saw a general agreement on referencing "external" built-in > > > functions such as those in Hive needs to be explicit and deterministic > > > > > > even > > > > > > though different approaches are proposed. To limit the scope and > > > > > > simply > > > > > > the > > > > > > usage, it seems making sense to me to introduce special syntax for > > > > > > user to > > > > > > explicitly reference an external built-in function such as hive1::sqrt > > > > > > or > > > > > > hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API > > > > > > call > > > > > > hive1.getFunction(ObjectPath functionName) where the database name is > > > absent for bulit-in functions available in that catalog hive1. I > > > > > > understand > > > > > > that Bowen's original proposal was trying to avoid this, but this > > > > > > could > > > > > > turn out to be a clean and simple solution. > > > > > > (Timo's modular approach is great way to "expand" Flink's built-in > > > > > > function > > > > > > set, which seems orthogonal and complementary to this, which could be > > > tackled in further future work.) > > > > > > I'd be happy to hear further thoughts on the two points. > > > > > > Thanks, > > > Xuefu > > > > > > On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> > > > <[hidden email]> wrote: > > > > > > Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is > > > > > > the > > > > > > same > > > as Bowen's. But after thinking about it, I'm currently lean to Timo's > > > suggestion. > > > > > > The reason is backward compatibility. If we follow Bowen's approach, > > > > > > let's > > > > > > say we > > > first find function in Flink's built-in functions, and then hive's > > > built-in. For example, `foo` > > > is not supported by Flink, but hive has such built-in function. So > > > > > > user > > > > > > will have hive's > > > behavior for function `foo`. And in next release, Flink realize this > > > > > > is a > > > > > > very popular function > > > and add it into Flink's built-in functions, but with different > > > > > > behavior > > > > > > as > > > > > > hive's. So in next > > > release, the behavior changes. > > > > > > With Timo's approach, IIUC user have to tell the framework explicitly > > > > > > what > > > > > > kind of > > > built-in functions he would like to use. He can just tell framework > > > > > > to > > > > > > abandon Flink's built-in > > > functions, and use hive's instead. User can only choose between them, > > > > > > but > > > > > > not use > > > them at the same time. I think this approach is more predictable. > > > > > > Best, > > > Kurt > > > > > > > > > On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> > > > <[hidden email]> wrote: > > > > > > Hi all, > > > > > > Thanks for the feedback. Just a kindly reminder that the [Proposal] > > > > > > section > > > > > > in the google doc was updated, please take a look first and let me > > > > > > know > > > > > > if > > > > > > you have more questions. > > > > > > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> > > > <[hidden email]> > > > > > > wrote: > > > > > > Hi Timo, > > > > > > Re> 1) We should not have the restriction "hive built-in functions > > > > > > can > > > > > > only > > > > > > be used when current catalog is hive catalog". Switching a catalog > > > should only have implications on the cat.db.object resolution but > > > > > > not > > > > > > functions. It would be quite convinient for users to use Hive > > > > > > built-ins > > > > > > even if they use a Confluent schema registry or just the in-memory > > > > > > catalog. > > > > > > There might be a misunderstanding here. > > > > > > First of all, Hive built-in functions are not part of Flink > > > > > > built-in > > > > > > functions, they are catalog functions, thus if the current catalog > > > > > > is > > > > > > not a > > > > > > HiveCatalog but, say, a schema registry catalog, ambiguous > > > > > > functions > > > > > > reference just shouldn't be resolved to a different catalog. > > > > > > Second, Hive built-in functions can potentially be referenced > > > > > > across > > > > > > catalog, but it doesn't have db namespace and we currently just > > > > > > don't > > > > > > have > > > > > > a SQL syntax for it. It can be enabled when such a SQL syntax is > > > > > > defined, > > > > > > e.g. "catalog::function", but it's out of scope of this FLIP. > > > > > > 2) I would propose to have separate concepts for catalog and > > > > > > built-in > > > > > > functions. In particular it would be nice to modularize built-in > > > functions. Some built-in functions are very crucial (like AS, CAST, > > > MINUS), others are more optional but stable (MD5, CONCAT_WS), and > > > > > > maybe > > > > > > we add more experimental functions in the future or function for > > > > > > some > > > > > > special application area (Geo functions, ML functions). A data > > > > > > platform > > > > > > team might not want to make every built-in function available. Or a > > > function module like ML functions is in a different Maven module. > > > > > > I think this is orthogonal to this FLIP, especially we don't have > > > > > > the > > > > > > "external built-in functions" anymore and currently the built-in > > > > > > function > > > > > > category remains untouched. > > > > > > But just to share some thoughts on the proposal, I'm not sure about > > > > > > it: > > > > > > - I don't know if any other databases handle built-in functions > > > > > > like > > > > > > that. > > > > > > Maybe you can give some examples? IMHO, built-in functions are > > > > > > system > > > > > > info > > > > > > and should be deterministic, not depending on loaded libraries. Geo > > > functions should be either built-in already or just libraries > > > > > > functions, > > > > > > and library functions can be adapted to catalog APIs or of some > > > > > > other > > > > > > syntax to use > > > - I don't know if all use cases stand, and many can be achieved by > > > > > > other > > > > > > approaches too. E.g. experimental functions can be taken good care > > > > > > of > > > > > > by > > > > > > documentations, annotations, etc > > > - the proposal basically introduces some concept like a pluggable > > > > > > built-in > > > > > > function catalog, despite the already existing catalog APIs > > > - it brings in even more complicated scenarios to the design. E.g. > > > > > > how > > > > > > do > > > > > > you handle built-in functions in different modules but different > > > > > > names? > > > > > > In short, I'm not sure if it really stands and it looks like an > > > > > > overkill > > > > > > to me. I'd rather not go to that route. Related discussion can be > > > > > > on > > > > > > its > > > > > > own thread. > > > > > > 3) Following the suggestion above, we can have a separate discovery > > > mechanism for built-in functions. Instead of just going through a > > > > > > static > > > > > > list like in BuiltInFunctionDefinitions, a platform team should be > > > > > > able > > > > > > to select function modules like > > > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, > > > HiveFunctions) or via service discovery; > > > > > > Same as above. I'll leave it to its own thread. > > > > > > re > 3) Dawid and I discussed the resulution order again. I agree > > > > > > with > > > > > > Kurt > > > > > > that we should unify built-in function (external or internal) > > > > > > under a > > > > > > common layer. However, the resolution order should be: > > > 1. built-in functions > > > 2. temporary functions > > > 3. regular catalog resolution logic > > > Otherwise a temporary function could cause clashes with Flink's > > > > > > built-in > > > > > > functions. If you take a look at other vendors, like SQL Server > > > > > > they > > > > > > also do not allow to overwrite built-in functions. > > > > > > ”I agree with Kurt that we should unify built-in function (external > > > > > > or > > > > > > internal) under a common layer.“ <- I don't think this is what Kurt > > > > > > means. > > > > > > Kurt and I are in favor of unifying built-in functions of external > > > > > > systems > > > > > > and catalog functions. Did you type a mistake? > > > > > > Besides, I'm not sure about the resolution order you proposed. > > > > > > Temporary > > > > > > functions have a lifespan over a session and are only visible to > > > > > > the > > > > > > session owner, they are unique to each user, and users create them > > > > > > on > > > > > > purpose to be the highest priority in order to overwrite system > > > > > > info > > > > > > (built-in functions in this case). > > > > > > In your case, why would users name a temporary function the same > > > > > > as a > > > > > > built-in function then? Since using that name in ambiguous function > > > reference will always be resolved to built-in functions, creating a > > > same-named temp function would be meaningless in the end. > > > > > > > > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> > > > <[hidden email]> > > > > > > wrote: > > > > > > Hi Jingsong, > > > > > > Re> 1.Hive built-in functions is an intermediate solution. So we > > > > > > should > > > > > > not introduce interfaces to influence the framework. To make > > > Flink itself more powerful, we should implement the functions > > > we need to add. > > > > > > Yes, please see the doc. > > > > > > Re> 2.Non-flink built-in functions are easy for users to change > > > > > > their > > > > > > behavior. If we support some flink built-in functions in the > > > future but act differently from non-flink built-in, this will > > > > > > lead > > > > > > to > > > > > > changes in user behavior. > > > > > > There's no such concept as "external built-in functions" any more. > > > Built-in functions of external systems will be treated as special > > > > > > catalog > > > > > > functions. > > > > > > Re> Another question is, does this fallback include all > > > > > > hive built-in functions? As far as I know, some hive functions > > > have some hacky. If possible, can we start with a white list? > > > Once we implement some functions to flink built-in, we can > > > also update the whitelist. > > > > > > Yes, that's something we thought of too. I don't think it's super > > > critical to the scope of this FLIP, thus I'd like to leave it to > > > > > > future > > > > > > efforts as a nice-to-have feature. > > > > > > > > > On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> > > > <[hidden email]> > > > > > > wrote: > > > > > > Hi Kurt, > > > > > > Re: > What I want to propose is we can merge #3 and #4, make them > > > > > > both > > > > > > under > > > > > > "catalog" concept, by extending catalog function to make it have > > > > > > ability to > > > > > > have built-in catalog functions. Some benefits I can see from > > > > > > this > > > > > > approach: > > > > > > 1. We don't have to introduce new concept like external built-in > > > > > > functions. > > > > > > Actually I don't see a full story about how to treat a built-in > > > > > > functions, and it > > > > > > seems a little bit disrupt with catalog. As a result, you have > > > > > > to > > > > > > make > > > > > > some restriction > > > > > > like "hive built-in functions can only be used when current > > > > > > catalog > > > > > > is > > > > > > hive catalog". > > > > > > Yes, I've unified #3 and #4 but it seems I didn't update some > > > > > > part > > > > > > of > > > > > > the doc. I've modified those sections, and they are up to date > > > > > > now. > > > > > > In short, now built-in function of external systems are defined > > > > > > as > > > > > > a > > > > > > special kind of catalog function in Flink, and handled by Flink > > > > > > as > > > > > > following: > > > - An external built-in function must be associated with a catalog > > > > > > for > > > > > > the purpose of decoupling flink-table and external systems. > > > - It always resides in front of catalog functions in ambiguous > > > > > > function > > > > > > reference order, just like in its own external system > > > - It is a special catalog function that doesn’t have a > > > > > > schema/database > > > > > > namespace > > > - It goes thru the same instantiation logic as other user defined > > > catalog functions in the external system > > > > > > Please take another look at the doc, and let me know if you have > > > > > > more > > > > > > questions. > > > > > > > > > On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> > > > <[hidden email]> > > > > > > wrote: > > > > > > Hi Kurt, > > > > > > it should not affect the functions and operations we currently > > > > > > have > > > > > > in > > > > > > SQL. It just categorizes the available built-in functions. It is > > > > > > kind > > > > > > of > > > an orthogonal concept to the catalog API but built-in functions > > > > > > deserve > > > > > > this special kind of treatment. CatalogFunction still fits > > > > > > perfectly > > > > > > in > > > > > > there because the regular catalog object resolution logic is not > > > affected. So tables and functions are resolved in the same way > > > > > > but > > > > > > with > > > > > > built-in functions that have priority as in the original design. > > > > > > Regards, > > > Timo > > > > > > > > > On 03.09.19 15:26, Kurt Young wrote: > > > > > > Does this only affect the functions and operations we currently > > > > > > have > > > > > > in SQL > > > > > > and > > > have no effect on tables, right? Looks like this is an > > > > > > orthogonal > > > > > > concept > > > > > > with Catalog? > > > If the answer are both yes, then the catalog function will be a > > > > > > weird > > > > > > concept? > > > > > > Best, > > > Kurt > > > > > > > > > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan < > > > > > > [hidden email] > > > > > > wrote: > > > > > > The way you proposed are basically the same as what Calcite > > > > > > does, I > > > > > > think > > > > > > we are in the same line. > > > > > > Best, > > > Danny Chan > > > 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] > > > > > > ,写道: > > > > > > This sounds exactly as the module approach I mentioned, no? > > > > > > Regards, > > > Timo > > > > > > On 03.09.19 13:42, Danny Chan wrote: > > > > > > Thanks Bowen for bring up this topic, I think it’s a useful > > > > > > refactoring to make our function usage more user friendly. > > > > > > For the topic of how to organize the builtin operators and > > > > > > operators > > > > > > of Hive, here is a solution from Apache Calcite, the Calcite > > > > > > way > > > > > > is > > > > > > to make > > > > > > every dialect operators a “Library”, user can specify which > > > > > > libraries they > > > > > > want to use for a sql query. The builtin operators always > > > > > > comes > > > > > > as > > > > > > the > > > > > > first class objects and the others are used from the order > > > > > > they > > > > > > appears. > > > > > > Maybe you can take a reference. > > > > > > [1] > > > > > > > > > > > > https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 > > > > > > Best, > > > Danny Chan > > > 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] > > > > > > ,写道: > > > > > > Hi folks, > > > > > > I'd like to kick off a discussion on reworking Flink's > > > > > > FunctionCatalog. > > > > > > It's critically helpful to improve function usability in > > > > > > SQL. > > > > > > > > > > > > https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing > > > > > > In short, it: > > > - adds support for precise function reference with > > > > > > fully/partially > > > > > > qualified name > > > - redefines function resolution order for ambiguous > > > > > > function > > > > > > reference > > > > > > - adds support for Hive's rich built-in functions (support > > > > > > for > > > > > > Hive > > > > > > user > > > > > > defined functions was already added in 1.9.0) > > > - clarifies the concept of temporary functions > > > > > > Would love to hear your thoughts. > > > > > > Bowen > > > > > > -- > > > Xuefu Zhang > > > > > > "In Honey We Trust!" > > > > > > > > > -- > > > Xuefu Zhang > > > > > > "In Honey We Trust!" > > > > > > > > > > > > > > -- > Xuefu Zhang > > "In Honey We Trust!" > |
In reply to this post by Xuefu Z
Hi Xuefu, Just wanted to summarize my opinion on the one topic (temporary functions). My preference would be to make temporary functions always 3-part qualified (as a result that would prohibit overriding built-in functions). Having said that if the community decides that it's better to allow overriding built-in functions I am fine with it and can commit to that decision. I wanted to ask if you could clarify a few points for me around that option.
I would be really grateful if you could explain me those questions, thanks. BTW, Thank you all for a healthy discussion. Best, Dawid On 04/09/2019 23:25, Xuefu Z wrote:
Thank all for the sharing thoughts. I think we have gathered some useful initial feedback from this long discussion with a couple of focal points sticking out. We will go back to do more research and adapt our proposal. Once it's ready, we will ask for a new round of review. If there is any disagreement, we will start a new discussion thread on each rather than having a mega discussion like this. Thanks to everyone for participating. Regards, Xuefu On Thu, Sep 5, 2019 at 2:52 AM Bowen Li [hidden email] wrote:Let me try to summarize and conclude the long thread so far: 1. For order of temp function v.s. built-in function: I think Dawid's point that temp function should be of fully qualified path is a better reasoning to back the newly proposed order, and i agree we don't need to follow Hive/Spark. However, I'd rather not change fundamentals of temporary functions in this FLIP. It belongs to a bigger story of how temporary objects should be redefined and be handled uniformly - currently temporary tables and views (those registered from TableEnv#registerTable()) behave different than what Dawid propose for temp functions, and we need a FLIP to just unify their APIs and behaviors. I agree that backward compatibility is not an issue w.r.t Jark's points. ***Seems we do have consensus that it's acceptable to prevent users registering a temp function in the same name as a built-in function. To help us move forward, I'd like to propose setting such a restraint on temp functions in this FLIP to simplify the design and avoid disputes.*** It will also leave rooms for improvements in the future. 2. For Hive built-in function: Thanks Timo for providing the Presto and Postgres examples. I feel modular built-in functions can be a good fit for the geo and ml example as a native Flink extension, but not sure if it fits well with external integrations. Anyway, I think modular built-in functions is a bigger story and can be on its own thread too, and our proposal doesn't prevent Flink from doing that in the future. ***Seems we have consensus that users should be able to use built-in functions of Hive or other external systems in SQL explicitly and deterministically regardless of Flink built-in functions and the potential modular built-in functions, via some new syntax like "mycat::func"? If so, I'd like to propose removing Hive built-in functions from ambiguous function resolution order, and empower users with such a syntax. This way we sacrifice a little convenience for certainty*** What do you think? On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz [hidden email] wrote:Hi, Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just performed some experiments (hive-2.3.2 & spark 2.4.4) and I think theyarevery inconsistent in that manner (spark being way worse on that). Hive: You cannot overwrite all the built-in functions. I could overwrite mostofthe functions I tried e.g. length, e, pi, round, rtrim, but there are functions I cannot overwrite e.g. CAST, ARRAY I get: * ParseException line 1:29 cannot recognize input near 'array' 'AS' * What is interesting is that I cannot ovewrite *array*, but I can ovewrite *map* or *struct*. Though hive behaves reasonable well if I manage to overwrite a function. When I drop the temporary function the native function is still available. Spark: Spark's behavior imho is super bad. Theoretically I could overwrite all functions. I was able e.g. to overwrite CAST function. I had to use though CREATE OR REPLACE TEMPORARY FUNCTION syntax. Otherwise I get an exception that a function already exists. However when I used the CAST function in a query it used the native, built-in one. When I overwrote current_date() function, it was used in a query, but it completely replaces the built-in function and I can no longer use the native function in any way. I cannot also drop the temporary function. I get: * Error in query: Cannot drop native function 'current_date';* Additional note, both systems do not allow creating TEMPORARY FUNCTIONS with a database. Temporary functions are always represented as a single name. In my opinion neither of the systems have consistent behavior. Generally speaking I think overwriting any system provided functions is just dangerous. Regarding Jark's concerns. Such functions would be registered in acurrentcatalog/database schema, so a user could still use its own function, but would have to fully qualify the function (because built-in functions take precedence). Moreover users would have the same problem with permanent functions. Imagine a user have a permanent function 'cat.db.explode'. In 1.9 the user could use just the 'explode' function as long as the 'cat' & 'db' were the default catalog & database. If we introduce 'explode' built-in function in 1.10, the user has to fully qualify the function. Best, Dawid On 04/09/2019 15:19, Timo Walther wrote: Hi all, thanks for the healthy discussion. It is already a very long discussion with a lot of text. So I will just post my opinion to a couple of statements:Hive built-in functions are not part of Flink built-in functions, theyare catalog functions That is not entirely true. Correct me if I'm wrong but I think Hive built-in functions are also not catalog functions. They are not stored in every Hive metastore catalog that is freshly created but are a set of functions that are listed somewhere and made available.ambiguous functions reference just shouldn't be resolved to a differentcatalog I agree. They should not be resolved to a different catalog. That's why I am suggesting to split the concept of built-in functions and cataloglookupsemantics.I don't know if any other databases handle built-in functions like thatWhat I called "module" is: - Extension in Postgres [1] - Plugin in Presto [2] Btw. Presto even mentions example modules that are similar to the ones that we will introduce in the near future both for ML and System XYZ compatibility: "See either the presto-ml module for machine learning functions or the presto-teradata-functions module for Teradata-compatible functions, bothinthe root of the Presto source."functions should be either built-in already or just librariesfunctions,and library functions can be adapted to catalog APIs or of some other syntax to use Regarding "built-in already", of course we can add a lot of functions as built-ins but we will end-up in a dependency hell in the near future ifwedon't introduce a pluggable approach. Library functions is what you also suggest but storing them in a catalog means to always fully qualify themormodifying the existing catalog design that was inspired by the standard. I don't think "it brings in even more complicated scenarios to the design", it just does clear separation of concerns. Integrating the functionality into the current design makes the catalog API more complicated.why would users name a temporary function the same as a built-infunction then? Because you never know what users do. If they don't, my suggested resolution order should not be a problem, right?I don't think hive functions deserves be a function moduleOur goal is not to create a Hive clone. We need to think forward and Hive is just one of many systems that we can support. Not every built-in function behaves and will behave exactly like Hive.regarding temporary functions, there are few systems that support itIMHO Spark and Hive are not always the best examples for consistent design. Systems like Postgres, Presto, or SQL Server should be used as a reference. I don't think that a user can overwrite a built-in function there. Regards, Timo [1] https://www.postgresql.org/docs/10/extend-extensions.html [2] https://prestodb.github.io/docs/current/develop/functions.html On 04.09.19 13:44, Jark Wu wrote: Hi all, Regarding #1 temp function <> built-in function and naming. I'm fine with temp functions should precede built-in function and can override built-in functions (we already support to override built-in function in 1.9). If we don't allow the same name as a built-in function, I'm afraid wewillhave compatibility issues in the future. Say users register a user defined function named "explode" in 1.9, and we support a built-in "explode" function in 1.10. Then the user's jobs which call the registered "explode" function in 1.9 will all fail in 1.10 because of naming conflict. Regarding #2 "External" built-in functions. I think if we store external built-in functions in catalog, then "hive1::sqrt" is a good way to go. However, I would prefer to support a discovery mechanism (e.g. SPI) for built-in functions as Timo suggested above. This gives us the flexibility to add Hive or MySQL or Geo or whatever function set as built-in functions in an easy way. Best, Jark On Wed, 4 Sep 2019 at 17:47, Xuefu Z [hidden email] [hidden email] wrote: Hi David, Thank you for sharing your findings. It seems to me that there is no SQL standard regarding temporary functions. There are few systems thatsupportit. Here are what I have found: 1. Hive: no DB qualifier allowed. Can overwrite built-in. 2. Spark: basically follows Hive (https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html) 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting behavior. ( http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html)Because of lack of standard, it's perfectly fine for Flink to define whatever it sees appropriate. Thus, your proposal (no overwriting andmusthave DB as holder) is one option. The advantage is simplicity, The downside is the deviation from Hive, which is popular and de facto standard in big data world. However, I don't think we have to follow Hive. More importantly, we needaconsensus. I have no objection if your proposal is generally agreed upon. Thanks, Xuefu On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <[hidden email] [hidden email] wrote: Hi all, Just an opinion on the built-in <> temporary functions resolution and NAMING issue. I think we should not allow overriding the built-in functions, as this may pose serious issues and to be honest is rather not feasible and would require major rework. What happens if a user wants to override CAST? Calls to that function are generated at different layers of the stack that unfortunately does not always go through the Catalog API (at least yet). Moreover from what I've checked no other systems allow overriding the built-in functions. All the systems I've checked so far register temporary functions in a database/schema (either special database for temporary functions, or just current database). What I would suggest is to always register temporary functions with a 3 part identifier. The same way as tables, views etc. This effectively means you cannot override built-in functions. With such approach it is natural that the temporary functions end up a step lower in the resolution order: 1. built-in functions (1 part, maybe 2? - this is still under discussion) 2. temporary functions (always 3 part path) 3. catalog functions (always 3 part path) Let me know what do you think. Best, Dawid On 04/09/2019 06:13, Bowen Li wrote: Hi, I agree with Xuefu that the main controversial points are mainly the two places. My thoughts on them: 1) Determinism of referencing Hive built-in functions. We can either remove Hive built-in functions from ambiguous function resolution and require users to use special syntax for their qualified names, or add a config flag to catalog constructor/yaml for turning on and off Hive built-in functions with the flag set to 'false' by default and proper doc added to help users make their decisions. 2) Flink temp functions v.s. Flink built-in functions in ambiguous function resolution order. We believe Flink temp functions should precede Flink built-in functions, and I have presented my reasons. Just in case if we cannot reach an agreement, I propose forbid users registering temp functions in the same name as a built-in function, like MySQL's approach, for the moment. It won't have any performance concern, since built-in functions are all in memory and thus cost of a name check will be really trivial. On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z [hidden email] [hidden email] wrote: From what I have seen, there are a couple of focal disagreements: 1. Resolution order: temp function --> flink built-in function --> catalog function vs flink built-in function --> temp function -> catalog function. 2. "External" built-in functions: how to treat built-in functions in external system and how users reference them For #1, I agree with Bowen that temp function needs to be at the highest priority because that's how a user might overwrite a built-in function without referencing a persistent, overwriting catalog function with a fully qualified name. Putting built-in functions at the highest priority eliminates that usage. For #2, I saw a general agreement on referencing "external" built-in functions such as those in Hive needs to be explicit and deterministic even though different approaches are proposed. To limit the scope and simply the usage, it seems making sense to me to introduce special syntax for user to explicitly reference an external built-in function such as hive1::sqrt or hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API call hive1.getFunction(ObjectPath functionName) where the database name is absent for bulit-in functions available in that catalog hive1. I understand that Bowen's original proposal was trying to avoid this, but this could turn out to be a clean and simple solution. (Timo's modular approach is great way to "expand" Flink's built-in function set, which seems orthogonal and complementary to this, which could be tackled in further future work.) I'd be happy to hear further thoughts on the two points. Thanks, Xuefu On Tue, Sep 3, 2019 at 7:11 PM Kurt Young [hidden email] [hidden email] wrote: Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is the same as Bowen's. But after thinking about it, I'm currently lean to Timo's suggestion. The reason is backward compatibility. If we follow Bowen's approach, let's say we first find function in Flink's built-in functions, and then hive's built-in. For example, `foo` is not supported by Flink, but hive has such built-in function. So user will have hive's behavior for function `foo`. And in next release, Flink realize this is a very popular function and add it into Flink's built-in functions, but with different behavior as hive's. So in next release, the behavior changes. With Timo's approach, IIUC user have to tell the framework explicitly what kind of built-in functions he would like to use. He can just tell framework to abandon Flink's built-in functions, and use hive's instead. User can only choose between them, but not use them at the same time. I think this approach is more predictable. Best, Kurt On Wed, Sep 4, 2019 at 8:00 AM Bowen Li [hidden email] [hidden email] wrote: Hi all, Thanks for the feedback. Just a kindly reminder that the [Proposal] section in the google doc was updated, please take a look first and let me know if you have more questions. On Tue, Sep 3, 2019 at 4:57 PM Bowen Li [hidden email] [hidden email] wrote: Hi Timo, Re> 1) We should not have the restriction "hive built-in functions can only be used when current catalog is hive catalog". Switching a catalog should only have implications on the cat.db.object resolution but not functions. It would be quite convinient for users to use Hive built-ins even if they use a Confluent schema registry or just the in-memory catalog. There might be a misunderstanding here. First of all, Hive built-in functions are not part of Flink built-in functions, they are catalog functions, thus if the current catalog is not a HiveCatalog but, say, a schema registry catalog, ambiguous functions reference just shouldn't be resolved to a different catalog. Second, Hive built-in functions can potentially be referenced across catalog, but it doesn't have db namespace and we currently just don't have a SQL syntax for it. It can be enabled when such a SQL syntax is defined, e.g. "catalog::function", but it's out of scope of this FLIP. 2) I would propose to have separate concepts for catalog and built-in functions. In particular it would be nice to modularize built-in functions. Some built-in functions are very crucial (like AS, CAST, MINUS), others are more optional but stable (MD5, CONCAT_WS), and maybe we add more experimental functions in the future or function for some special application area (Geo functions, ML functions). A data platform team might not want to make every built-in function available. Or a function module like ML functions is in a different Maven module. I think this is orthogonal to this FLIP, especially we don't have the "external built-in functions" anymore and currently the built-in function category remains untouched. But just to share some thoughts on the proposal, I'm not sure about it: - I don't know if any other databases handle built-in functions like that. Maybe you can give some examples? IMHO, built-in functions are system info and should be deterministic, not depending on loaded libraries. Geo functions should be either built-in already or just libraries functions, and library functions can be adapted to catalog APIs or of some other syntax to use - I don't know if all use cases stand, and many can be achieved by other approaches too. E.g. experimental functions can be taken good care of by documentations, annotations, etc - the proposal basically introduces some concept like a pluggable built-in function catalog, despite the already existing catalog APIs - it brings in even more complicated scenarios to the design. E.g. how do you handle built-in functions in different modules but different names? In short, I'm not sure if it really stands and it looks like an overkill to me. I'd rather not go to that route. Related discussion can be on its own thread. 3) Following the suggestion above, we can have a separate discovery mechanism for built-in functions. Instead of just going through a static list like in BuiltInFunctionDefinitions, a platform team should be able to select function modules like catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, HiveFunctions) or via service discovery; Same as above. I'll leave it to its own thread. re > 3) Dawid and I discussed the resulution order again. I agree with Kurt that we should unify built-in function (external or internal) under a common layer. However, the resolution order should be: 1. built-in functions 2. temporary functions 3. regular catalog resolution logic Otherwise a temporary function could cause clashes with Flink's built-in functions. If you take a look at other vendors, like SQL Server they also do not allow to overwrite built-in functions. ”I agree with Kurt that we should unify built-in function (external or internal) under a common layer.“ <- I don't think this is what Kurt means. Kurt and I are in favor of unifying built-in functions of external systems and catalog functions. Did you type a mistake? Besides, I'm not sure about the resolution order you proposed. Temporary functions have a lifespan over a session and are only visible to the session owner, they are unique to each user, and users create them on purpose to be the highest priority in order to overwrite system info (built-in functions in this case). In your case, why would users name a temporary function the same as a built-in function then? Since using that name in ambiguous function reference will always be resolved to built-in functions, creating a same-named temp function would be meaningless in the end. On Tue, Sep 3, 2019 at 1:44 PM Bowen Li [hidden email] [hidden email] wrote: Hi Jingsong, Re> 1.Hive built-in functions is an intermediate solution. So we should not introduce interfaces to influence the framework. To make Flink itself more powerful, we should implement the functions we need to add. Yes, please see the doc. Re> 2.Non-flink built-in functions are easy for users to change their behavior. If we support some flink built-in functions in the future but act differently from non-flink built-in, this will lead to changes in user behavior. There's no such concept as "external built-in functions" any more. Built-in functions of external systems will be treated as special catalog functions. Re> Another question is, does this fallback include all hive built-in functions? As far as I know, some hive functions have some hacky. If possible, can we start with a white list? Once we implement some functions to flink built-in, we can also update the whitelist. Yes, that's something we thought of too. I don't think it's super critical to the scope of this FLIP, thus I'd like to leave it to future efforts as a nice-to-have feature. On Tue, Sep 3, 2019 at 1:37 PM Bowen Li [hidden email] [hidden email] wrote: Hi Kurt, Re: > What I want to propose is we can merge #3 and #4, make them both under "catalog" concept, by extending catalog function to make it have ability to have built-in catalog functions. Some benefits I can see from this approach: 1. We don't have to introduce new concept like external built-in functions. Actually I don't see a full story about how to treat a built-in functions, and it seems a little bit disrupt with catalog. As a result, you have to make some restriction like "hive built-in functions can only be used when current catalog is hive catalog". Yes, I've unified #3 and #4 but it seems I didn't update some part of the doc. I've modified those sections, and they are up to date now. In short, now built-in function of external systems are defined as a special kind of catalog function in Flink, and handled by Flink as following: - An external built-in function must be associated with a catalog for the purpose of decoupling flink-table and external systems. - It always resides in front of catalog functions in ambiguous function reference order, just like in its own external system - It is a special catalog function that doesn’t have a schema/database namespace - It goes thru the same instantiation logic as other user defined catalog functions in the external system Please take another look at the doc, and let me know if you have more questions. On Tue, Sep 3, 2019 at 7:28 AM Timo Walther [hidden email] [hidden email] wrote: Hi Kurt, it should not affect the functions and operations we currently have in SQL. It just categorizes the available built-in functions. It is kind of an orthogonal concept to the catalog API but built-in functions deserve this special kind of treatment. CatalogFunction still fits perfectly in there because the regular catalog object resolution logic is not affected. So tables and functions are resolved in the same way but with built-in functions that have priority as in the original design. Regards, Timo On 03.09.19 15:26, Kurt Young wrote: Does this only affect the functions and operations we currently have in SQL and have no effect on tables, right? Looks like this is an orthogonal concept with Catalog? If the answer are both yes, then the catalog function will be a weird concept? Best, Kurt On Tue, Sep 3, 2019 at 8:10 PM Danny Chan < [hidden email] wrote: The way you proposed are basically the same as what Calcite does, I think we are in the same line. Best, Danny Chan 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] ,写道: This sounds exactly as the module approach I mentioned, no? Regards, Timo On 03.09.19 13:42, Danny Chan wrote: Thanks Bowen for bring up this topic, I think it’s a useful refactoring to make our function usage more user friendly. For the topic of how to organize the builtin operators and operators of Hive, here is a solution from Apache Calcite, the Calcite way is to make every dialect operators a “Library”, user can specify which libraries they want to use for a sql query. The builtin operators always comes as the first class objects and the others are used from the order they appears. Maybe you can take a reference. [1]https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28Best, Danny Chan 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] ,写道: Hi folks, I'd like to kick off a discussion on reworking Flink's FunctionCatalog. It's critically helpful to improve function usability in SQL.https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharingIn short, it: - adds support for precise function reference with fully/partially qualified name - redefines function resolution order for ambiguous function reference - adds support for Hive's rich built-in functions (support for Hive user defined functions was already added in 1.9.0) - clarifies the concept of temporary functions Would love to hear your thoughts. Bowen -- Xuefu Zhang "In Honey We Trust!" -- Xuefu Zhang "In Honey We Trust!" signature.asc (849 bytes) Download Attachment |
In reply to this post by bowen.li
Yeah, sorry I prematurely concluded the discussions here, thinking they
were not converging. However, I did feel we needed to do more research and restart with individual topics. Please continue voicing your comments/suggestions while we revise and clarify our proposal. Thanks, Xuefu On Thu, Sep 5, 2019 at 12:04 PM Bowen Li <[hidden email]> wrote: > Maybe Xuefu missed my email. Please let me know what your thoughts are on > the summary, if there's still major controversy, I can take time to > reevaluate that part. > > > On Wed, Sep 4, 2019 at 2:25 PM Xuefu Z <[hidden email]> wrote: > > > Thank all for the sharing thoughts. I think we have gathered some useful > > initial feedback from this long discussion with a couple of focal points > > sticking out. > > > > We will go back to do more research and adapt our proposal. Once it's > > ready, we will ask for a new round of review. If there is any > disagreement, > > we will start a new discussion thread on each rather than having a mega > > discussion like this. > > > > Thanks to everyone for participating. > > > > Regards, > > Xuefu > > > > > > On Thu, Sep 5, 2019 at 2:52 AM Bowen Li <[hidden email]> wrote: > > > > > Let me try to summarize and conclude the long thread so far: > > > > > > 1. For order of temp function v.s. built-in function: > > > > > > I think Dawid's point that temp function should be of fully qualified > > path > > > is a better reasoning to back the newly proposed order, and i agree we > > > don't need to follow Hive/Spark. > > > > > > However, I'd rather not change fundamentals of temporary functions in > > this > > > FLIP. It belongs to a bigger story of how temporary objects should be > > > redefined and be handled uniformly - currently temporary tables and > views > > > (those registered from TableEnv#registerTable()) behave different than > > what > > > Dawid propose for temp functions, and we need a FLIP to just unify > their > > > APIs and behaviors. > > > > > > I agree that backward compatibility is not an issue w.r.t Jark's > points. > > > > > > ***Seems we do have consensus that it's acceptable to prevent users > > > registering a temp function in the same name as a built-in function. To > > > help us move forward, I'd like to propose setting such a restraint on > > temp > > > functions in this FLIP to simplify the design and avoid disputes.*** It > > > will also leave rooms for improvements in the future. > > > > > > > > > 2. For Hive built-in function: > > > > > > Thanks Timo for providing the Presto and Postgres examples. I feel > > modular > > > built-in functions can be a good fit for the geo and ml example as a > > native > > > Flink extension, but not sure if it fits well with external > integrations. > > > Anyway, I think modular built-in functions is a bigger story and can be > > on > > > its own thread too, and our proposal doesn't prevent Flink from doing > > that > > > in the future. > > > > > > ***Seems we have consensus that users should be able to use built-in > > > functions of Hive or other external systems in SQL explicitly and > > > deterministically regardless of Flink built-in functions and the > > potential > > > modular built-in functions, via some new syntax like "mycat::func"? If > > so, > > > I'd like to propose removing Hive built-in functions from ambiguous > > > function resolution order, and empower users with such a syntax. This > way > > > we sacrifice a little convenience for certainty*** > > > > > > > > > What do you think? > > > > > > On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz < > [hidden email]> > > > wrote: > > > > > > > Hi, > > > > > > > > Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just > > > > performed some experiments (hive-2.3.2 & spark 2.4.4) and I think > they > > > are > > > > very inconsistent in that manner (spark being way worse on that). > > > > > > > > Hive: > > > > > > > > You cannot overwrite all the built-in functions. I could overwrite > most > > > of > > > > the functions I tried e.g. length, e, pi, round, rtrim, but there are > > > > functions I cannot overwrite e.g. CAST, ARRAY I get: > > > > > > > > > > > > * ParseException line 1:29 cannot recognize input near 'array' > 'AS' > > * > > > > > > > > What is interesting is that I cannot ovewrite *array*, but I can > > ovewrite > > > > *map* or *struct*. Though hive behaves reasonable well if I manage to > > > > overwrite a function. When I drop the temporary function the native > > > > function is still available. > > > > > > > > Spark: > > > > > > > > Spark's behavior imho is super bad. > > > > > > > > Theoretically I could overwrite all functions. I was able e.g. to > > > > overwrite CAST function. I had to use though CREATE OR REPLACE > > TEMPORARY > > > > FUNCTION syntax. Otherwise I get an exception that a function already > > > > exists. However when I used the CAST function in a query it used the > > > > native, built-in one. > > > > > > > > When I overwrote current_date() function, it was used in a query, but > > it > > > > completely replaces the built-in function and I can no longer use the > > > > native function in any way. I cannot also drop the temporary > function. > > I > > > > get: > > > > > > > > * Error in query: Cannot drop native function 'current_date';* > > > > > > > > Additional note, both systems do not allow creating TEMPORARY > FUNCTIONS > > > > with a database. Temporary functions are always represented as a > single > > > > name. > > > > > > > > In my opinion neither of the systems have consistent behavior. > > Generally > > > > speaking I think overwriting any system provided functions is just > > > > dangerous. > > > > > > > > Regarding Jark's concerns. Such functions would be registered in a > > > current > > > > catalog/database schema, so a user could still use its own function, > > but > > > > would have to fully qualify the function (because built-in functions > > take > > > > precedence). Moreover users would have the same problem with > permanent > > > > functions. Imagine a user have a permanent function 'cat.db.explode'. > > In > > > > 1.9 the user could use just the 'explode' function as long as the > > 'cat' & > > > > 'db' were the default catalog & database. If we introduce 'explode' > > > > built-in function in 1.10, the user has to fully qualify the > function. > > > > > > > > Best, > > > > > > > > Dawid > > > > On 04/09/2019 15:19, Timo Walther wrote: > > > > > > > > Hi all, > > > > > > > > thanks for the healthy discussion. It is already a very long > discussion > > > > with a lot of text. So I will just post my opinion to a couple of > > > > statements: > > > > > > > > > Hive built-in functions are not part of Flink built-in functions, > > they > > > > are catalog functions > > > > > > > > That is not entirely true. Correct me if I'm wrong but I think Hive > > > > built-in functions are also not catalog functions. They are not > stored > > in > > > > every Hive metastore catalog that is freshly created but are a set of > > > > functions that are listed somewhere and made available. > > > > > > > > > ambiguous functions reference just shouldn't be resolved to a > > different > > > > catalog > > > > > > > > I agree. They should not be resolved to a different catalog. That's > > why I > > > > am suggesting to split the concept of built-in functions and catalog > > > lookup > > > > semantics. > > > > > > > > > I don't know if any other databases handle built-in functions like > > that > > > > > > > > What I called "module" is: > > > > - Extension in Postgres [1] > > > > - Plugin in Presto [2] > > > > > > > > Btw. Presto even mentions example modules that are similar to the > ones > > > > that we will introduce in the near future both for ML and System XYZ > > > > compatibility: > > > > "See either the presto-ml module for machine learning functions or > the > > > > presto-teradata-functions module for Teradata-compatible functions, > > both > > > in > > > > the root of the Presto source." > > > > > > > > > functions should be either built-in already or just libraries > > > functions, > > > > and library functions can be adapted to catalog APIs or of some other > > > > syntax to use > > > > > > > > Regarding "built-in already", of course we can add a lot of functions > > as > > > > built-ins but we will end-up in a dependency hell in the near future > if > > > we > > > > don't introduce a pluggable approach. Library functions is what you > > also > > > > suggest but storing them in a catalog means to always fully qualify > > them > > > or > > > > modifying the existing catalog design that was inspired by the > > standard. > > > > > > > > I don't think "it brings in even more complicated scenarios to the > > > > design", it just does clear separation of concerns. Integrating the > > > > functionality into the current design makes the catalog API more > > > > complicated. > > > > > > > > > why would users name a temporary function the same as a built-in > > > > function then? > > > > > > > > Because you never know what users do. If they don't, my suggested > > > > resolution order should not be a problem, right? > > > > > > > > > I don't think hive functions deserves be a function module > > > > > > > > Our goal is not to create a Hive clone. We need to think forward and > > Hive > > > > is just one of many systems that we can support. Not every built-in > > > > function behaves and will behave exactly like Hive. > > > > > > > > > regarding temporary functions, there are few systems that support > it > > > > > > > > IMHO Spark and Hive are not always the best examples for consistent > > > > design. Systems like Postgres, Presto, or SQL Server should be used > as > > a > > > > reference. I don't think that a user can overwrite a built-in > function > > > > there. > > > > > > > > Regards, > > > > Timo > > > > > > > > [1] https://www.postgresql.org/docs/10/extend-extensions.html > > > > [2] https://prestodb.github.io/docs/current/develop/functions.html > > > > > > > > > > > > On 04.09.19 13:44, Jark Wu wrote: > > > > > > > > Hi all, > > > > > > > > Regarding #1 temp function <> built-in function and naming. > > > > I'm fine with temp functions should precede built-in function and can > > > > override built-in functions (we already support to override built-in > > > > function in 1.9). > > > > If we don't allow the same name as a built-in function, I'm afraid we > > > will > > > > have compatibility issues in the future. > > > > Say users register a user defined function named "explode" in 1.9, > and > > we > > > > support a built-in "explode" function in 1.10. > > > > Then the user's jobs which call the registered "explode" function in > > 1.9 > > > > will all fail in 1.10 because of naming conflict. > > > > > > > > Regarding #2 "External" built-in functions. > > > > I think if we store external built-in functions in catalog, then > > > > "hive1::sqrt" is a good way to go. > > > > However, I would prefer to support a discovery mechanism (e.g. SPI) > for > > > > built-in functions as Timo suggested above. > > > > This gives us the flexibility to add Hive or MySQL or Geo or whatever > > > > function set as built-in functions in an easy way. > > > > > > > > Best, > > > > Jark > > > > > > > > On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[hidden email]> > > > > <[hidden email]> wrote: > > > > > > > > Hi David, > > > > > > > > Thank you for sharing your findings. It seems to me that there is no > > SQL > > > > standard regarding temporary functions. There are few systems that > > > support > > > > it. Here are what I have found: > > > > > > > > 1. Hive: no DB qualifier allowed. Can overwrite built-in. > > > > 2. Spark: basically follows Hive ( > > > > > > > > > > > > > > > > > > https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html > > > > ) > > > > 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of > > overwriting > > > > behavior. ( > > > > > > http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html > > > ) > > > > > > > > Because of lack of standard, it's perfectly fine for Flink to define > > > > whatever it sees appropriate. Thus, your proposal (no overwriting and > > > must > > > > have DB as holder) is one option. The advantage is simplicity, The > > > > downside > > > > is the deviation from Hive, which is popular and de facto standard in > > big > > > > data world. > > > > > > > > However, I don't think we have to follow Hive. More importantly, we > > need > > > a > > > > consensus. I have no objection if your proposal is generally agreed > > upon. > > > > > > > > Thanks, > > > > Xuefu > > > > > > > > On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz < > > [hidden email] > > > > > > > > <[hidden email]> > > > > wrote: > > > > > > > > Hi all, > > > > > > > > Just an opinion on the built-in <> temporary functions resolution and > > > > NAMING issue. I think we should not allow overriding the built-in > > > > functions, as this may pose serious issues and to be honest is rather > > > > not feasible and would require major rework. What happens if a user > > > > wants to override CAST? Calls to that function are generated at > > > > different layers of the stack that unfortunately does not always go > > > > through the Catalog API (at least yet). Moreover from what I've > checked > > > > no other systems allow overriding the built-in functions. All the > > > > systems I've checked so far register temporary functions in a > > > > database/schema (either special database for temporary functions, or > > > > just current database). What I would suggest is to always register > > > > temporary functions with a 3 part identifier. The same way as tables, > > > > views etc. This effectively means you cannot override built-in > > > > functions. With such approach it is natural that the temporary > > functions > > > > end up a step lower in the resolution order: > > > > > > > > 1. built-in functions (1 part, maybe 2? - this is still under > > discussion) > > > > > > > > 2. temporary functions (always 3 part path) > > > > > > > > 3. catalog functions (always 3 part path) > > > > > > > > Let me know what do you think. > > > > > > > > Best, > > > > > > > > Dawid > > > > > > > > On 04/09/2019 06:13, Bowen Li wrote: > > > > > > > > Hi, > > > > > > > > I agree with Xuefu that the main controversial points are mainly the > > > > > > > > two > > > > > > > > places. My thoughts on them: > > > > > > > > 1) Determinism of referencing Hive built-in functions. We can either > > > > > > > > remove > > > > > > > > Hive built-in functions from ambiguous function resolution and > require > > > > users to use special syntax for their qualified names, or add a > config > > > > > > > > flag > > > > > > > > to catalog constructor/yaml for turning on and off Hive built-in > > > > > > > > functions > > > > > > > > with the flag set to 'false' by default and proper doc added to help > > > > > > > > users > > > > > > > > make their decisions. > > > > > > > > 2) Flink temp functions v.s. Flink built-in functions in ambiguous > > > > > > > > function > > > > > > > > resolution order. We believe Flink temp functions should precede > Flink > > > > built-in functions, and I have presented my reasons. Just in case if > we > > > > cannot reach an agreement, I propose forbid users registering temp > > > > functions in the same name as a built-in function, like MySQL's > > > > > > > > approach, > > > > > > > > for the moment. It won't have any performance concern, since built-in > > > > functions are all in memory and thus cost of a name check will be > > > > > > > > really > > > > > > > > trivial. > > > > > > > > > > > > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> > > > > <[hidden email]> wrote: > > > > > > > > From what I have seen, there are a couple of focal disagreements: > > > > > > > > 1. Resolution order: temp function --> flink built-in function --> > > > > > > > > catalog > > > > > > > > function vs flink built-in function --> temp function -> catalog > > > > > > > > function. > > > > > > > > 2. "External" built-in functions: how to treat built-in functions in > > > > external system and how users reference them > > > > > > > > For #1, I agree with Bowen that temp function needs to be at the > > > > > > > > highest > > > > > > > > priority because that's how a user might overwrite a built-in > function > > > > without referencing a persistent, overwriting catalog function with a > > > > > > > > fully > > > > > > > > qualified name. Putting built-in functions at the highest priority > > > > eliminates that usage. > > > > > > > > For #2, I saw a general agreement on referencing "external" built-in > > > > functions such as those in Hive needs to be explicit and > deterministic > > > > > > > > even > > > > > > > > though different approaches are proposed. To limit the scope and > > > > > > > > simply > > > > > > > > the > > > > > > > > usage, it seems making sense to me to introduce special syntax for > > > > > > > > user to > > > > > > > > explicitly reference an external built-in function such as > hive1::sqrt > > > > > > > > or > > > > > > > > hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog > API > > > > > > > > call > > > > > > > > hive1.getFunction(ObjectPath functionName) where the database name is > > > > absent for bulit-in functions available in that catalog hive1. I > > > > > > > > understand > > > > > > > > that Bowen's original proposal was trying to avoid this, but this > > > > > > > > could > > > > > > > > turn out to be a clean and simple solution. > > > > > > > > (Timo's modular approach is great way to "expand" Flink's built-in > > > > > > > > function > > > > > > > > set, which seems orthogonal and complementary to this, which could be > > > > tackled in further future work.) > > > > > > > > I'd be happy to hear further thoughts on the two points. > > > > > > > > Thanks, > > > > Xuefu > > > > > > > > On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> > > > > <[hidden email]> wrote: > > > > > > > > Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is > > > > > > > > the > > > > > > > > same > > > > as Bowen's. But after thinking about it, I'm currently lean to Timo's > > > > suggestion. > > > > > > > > The reason is backward compatibility. If we follow Bowen's approach, > > > > > > > > let's > > > > > > > > say we > > > > first find function in Flink's built-in functions, and then hive's > > > > built-in. For example, `foo` > > > > is not supported by Flink, but hive has such built-in function. So > > > > > > > > user > > > > > > > > will have hive's > > > > behavior for function `foo`. And in next release, Flink realize this > > > > > > > > is a > > > > > > > > very popular function > > > > and add it into Flink's built-in functions, but with different > > > > > > > > behavior > > > > > > > > as > > > > > > > > hive's. So in next > > > > release, the behavior changes. > > > > > > > > With Timo's approach, IIUC user have to tell the framework explicitly > > > > > > > > what > > > > > > > > kind of > > > > built-in functions he would like to use. He can just tell framework > > > > > > > > to > > > > > > > > abandon Flink's built-in > > > > functions, and use hive's instead. User can only choose between them, > > > > > > > > but > > > > > > > > not use > > > > them at the same time. I think this approach is more predictable. > > > > > > > > Best, > > > > Kurt > > > > > > > > > > > > On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> > > > > <[hidden email]> wrote: > > > > > > > > Hi all, > > > > > > > > Thanks for the feedback. Just a kindly reminder that the [Proposal] > > > > > > > > section > > > > > > > > in the google doc was updated, please take a look first and let me > > > > > > > > know > > > > > > > > if > > > > > > > > you have more questions. > > > > > > > > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> > > > > <[hidden email]> > > > > > > > > wrote: > > > > > > > > Hi Timo, > > > > > > > > Re> 1) We should not have the restriction "hive built-in functions > > > > > > > > can > > > > > > > > only > > > > > > > > be used when current catalog is hive catalog". Switching a catalog > > > > should only have implications on the cat.db.object resolution but > > > > > > > > not > > > > > > > > functions. It would be quite convinient for users to use Hive > > > > > > > > built-ins > > > > > > > > even if they use a Confluent schema registry or just the in-memory > > > > > > > > catalog. > > > > > > > > There might be a misunderstanding here. > > > > > > > > First of all, Hive built-in functions are not part of Flink > > > > > > > > built-in > > > > > > > > functions, they are catalog functions, thus if the current catalog > > > > > > > > is > > > > > > > > not a > > > > > > > > HiveCatalog but, say, a schema registry catalog, ambiguous > > > > > > > > functions > > > > > > > > reference just shouldn't be resolved to a different catalog. > > > > > > > > Second, Hive built-in functions can potentially be referenced > > > > > > > > across > > > > > > > > catalog, but it doesn't have db namespace and we currently just > > > > > > > > don't > > > > > > > > have > > > > > > > > a SQL syntax for it. It can be enabled when such a SQL syntax is > > > > > > > > defined, > > > > > > > > e.g. "catalog::function", but it's out of scope of this FLIP. > > > > > > > > 2) I would propose to have separate concepts for catalog and > > > > > > > > built-in > > > > > > > > functions. In particular it would be nice to modularize built-in > > > > functions. Some built-in functions are very crucial (like AS, CAST, > > > > MINUS), others are more optional but stable (MD5, CONCAT_WS), and > > > > > > > > maybe > > > > > > > > we add more experimental functions in the future or function for > > > > > > > > some > > > > > > > > special application area (Geo functions, ML functions). A data > > > > > > > > platform > > > > > > > > team might not want to make every built-in function available. Or a > > > > function module like ML functions is in a different Maven module. > > > > > > > > I think this is orthogonal to this FLIP, especially we don't have > > > > > > > > the > > > > > > > > "external built-in functions" anymore and currently the built-in > > > > > > > > function > > > > > > > > category remains untouched. > > > > > > > > But just to share some thoughts on the proposal, I'm not sure about > > > > > > > > it: > > > > > > > > - I don't know if any other databases handle built-in functions > > > > > > > > like > > > > > > > > that. > > > > > > > > Maybe you can give some examples? IMHO, built-in functions are > > > > > > > > system > > > > > > > > info > > > > > > > > and should be deterministic, not depending on loaded libraries. Geo > > > > functions should be either built-in already or just libraries > > > > > > > > functions, > > > > > > > > and library functions can be adapted to catalog APIs or of some > > > > > > > > other > > > > > > > > syntax to use > > > > - I don't know if all use cases stand, and many can be achieved by > > > > > > > > other > > > > > > > > approaches too. E.g. experimental functions can be taken good care > > > > > > > > of > > > > > > > > by > > > > > > > > documentations, annotations, etc > > > > - the proposal basically introduces some concept like a pluggable > > > > > > > > built-in > > > > > > > > function catalog, despite the already existing catalog APIs > > > > - it brings in even more complicated scenarios to the design. E.g. > > > > > > > > how > > > > > > > > do > > > > > > > > you handle built-in functions in different modules but different > > > > > > > > names? > > > > > > > > In short, I'm not sure if it really stands and it looks like an > > > > > > > > overkill > > > > > > > > to me. I'd rather not go to that route. Related discussion can be > > > > > > > > on > > > > > > > > its > > > > > > > > own thread. > > > > > > > > 3) Following the suggestion above, we can have a separate discovery > > > > mechanism for built-in functions. Instead of just going through a > > > > > > > > static > > > > > > > > list like in BuiltInFunctionDefinitions, a platform team should be > > > > > > > > able > > > > > > > > to select function modules like > > > > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, > > > > HiveFunctions) or via service discovery; > > > > > > > > Same as above. I'll leave it to its own thread. > > > > > > > > re > 3) Dawid and I discussed the resulution order again. I agree > > > > > > > > with > > > > > > > > Kurt > > > > > > > > that we should unify built-in function (external or internal) > > > > > > > > under a > > > > > > > > common layer. However, the resolution order should be: > > > > 1. built-in functions > > > > 2. temporary functions > > > > 3. regular catalog resolution logic > > > > Otherwise a temporary function could cause clashes with Flink's > > > > > > > > built-in > > > > > > > > functions. If you take a look at other vendors, like SQL Server > > > > > > > > they > > > > > > > > also do not allow to overwrite built-in functions. > > > > > > > > ”I agree with Kurt that we should unify built-in function (external > > > > > > > > or > > > > > > > > internal) under a common layer.“ <- I don't think this is what Kurt > > > > > > > > means. > > > > > > > > Kurt and I are in favor of unifying built-in functions of external > > > > > > > > systems > > > > > > > > and catalog functions. Did you type a mistake? > > > > > > > > Besides, I'm not sure about the resolution order you proposed. > > > > > > > > Temporary > > > > > > > > functions have a lifespan over a session and are only visible to > > > > > > > > the > > > > > > > > session owner, they are unique to each user, and users create them > > > > > > > > on > > > > > > > > purpose to be the highest priority in order to overwrite system > > > > > > > > info > > > > > > > > (built-in functions in this case). > > > > > > > > In your case, why would users name a temporary function the same > > > > > > > > as a > > > > > > > > built-in function then? Since using that name in ambiguous function > > > > reference will always be resolved to built-in functions, creating a > > > > same-named temp function would be meaningless in the end. > > > > > > > > > > > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> > > > > <[hidden email]> > > > > > > > > wrote: > > > > > > > > Hi Jingsong, > > > > > > > > Re> 1.Hive built-in functions is an intermediate solution. So we > > > > > > > > should > > > > > > > > not introduce interfaces to influence the framework. To make > > > > Flink itself more powerful, we should implement the functions > > > > we need to add. > > > > > > > > Yes, please see the doc. > > > > > > > > Re> 2.Non-flink built-in functions are easy for users to change > > > > > > > > their > > > > > > > > behavior. If we support some flink built-in functions in the > > > > future but act differently from non-flink built-in, this will > > > > > > > > lead > > > > > > > > to > > > > > > > > changes in user behavior. > > > > > > > > There's no such concept as "external built-in functions" any more. > > > > Built-in functions of external systems will be treated as special > > > > > > > > catalog > > > > > > > > functions. > > > > > > > > Re> Another question is, does this fallback include all > > > > > > > > hive built-in functions? As far as I know, some hive functions > > > > have some hacky. If possible, can we start with a white list? > > > > Once we implement some functions to flink built-in, we can > > > > also update the whitelist. > > > > > > > > Yes, that's something we thought of too. I don't think it's super > > > > critical to the scope of this FLIP, thus I'd like to leave it to > > > > > > > > future > > > > > > > > efforts as a nice-to-have feature. > > > > > > > > > > > > On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> > > > > <[hidden email]> > > > > > > > > wrote: > > > > > > > > Hi Kurt, > > > > > > > > Re: > What I want to propose is we can merge #3 and #4, make them > > > > > > > > both > > > > > > > > under > > > > > > > > "catalog" concept, by extending catalog function to make it have > > > > > > > > ability to > > > > > > > > have built-in catalog functions. Some benefits I can see from > > > > > > > > this > > > > > > > > approach: > > > > > > > > 1. We don't have to introduce new concept like external built-in > > > > > > > > functions. > > > > > > > > Actually I don't see a full story about how to treat a built-in > > > > > > > > functions, and it > > > > > > > > seems a little bit disrupt with catalog. As a result, you have > > > > > > > > to > > > > > > > > make > > > > > > > > some restriction > > > > > > > > like "hive built-in functions can only be used when current > > > > > > > > catalog > > > > > > > > is > > > > > > > > hive catalog". > > > > > > > > Yes, I've unified #3 and #4 but it seems I didn't update some > > > > > > > > part > > > > > > > > of > > > > > > > > the doc. I've modified those sections, and they are up to date > > > > > > > > now. > > > > > > > > In short, now built-in function of external systems are defined > > > > > > > > as > > > > > > > > a > > > > > > > > special kind of catalog function in Flink, and handled by Flink > > > > > > > > as > > > > > > > > following: > > > > - An external built-in function must be associated with a catalog > > > > > > > > for > > > > > > > > the purpose of decoupling flink-table and external systems. > > > > - It always resides in front of catalog functions in ambiguous > > > > > > > > function > > > > > > > > reference order, just like in its own external system > > > > - It is a special catalog function that doesn’t have a > > > > > > > > schema/database > > > > > > > > namespace > > > > - It goes thru the same instantiation logic as other user defined > > > > catalog functions in the external system > > > > > > > > Please take another look at the doc, and let me know if you have > > > > > > > > more > > > > > > > > questions. > > > > > > > > > > > > On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> > > > > <[hidden email]> > > > > > > > > wrote: > > > > > > > > Hi Kurt, > > > > > > > > it should not affect the functions and operations we currently > > > > > > > > have > > > > > > > > in > > > > > > > > SQL. It just categorizes the available built-in functions. It is > > > > > > > > kind > > > > > > > > of > > > > an orthogonal concept to the catalog API but built-in functions > > > > > > > > deserve > > > > > > > > this special kind of treatment. CatalogFunction still fits > > > > > > > > perfectly > > > > > > > > in > > > > > > > > there because the regular catalog object resolution logic is not > > > > affected. So tables and functions are resolved in the same way > > > > > > > > but > > > > > > > > with > > > > > > > > built-in functions that have priority as in the original design. > > > > > > > > Regards, > > > > Timo > > > > > > > > > > > > On 03.09.19 15:26, Kurt Young wrote: > > > > > > > > Does this only affect the functions and operations we currently > > > > > > > > have > > > > > > > > in SQL > > > > > > > > and > > > > have no effect on tables, right? Looks like this is an > > > > > > > > orthogonal > > > > > > > > concept > > > > > > > > with Catalog? > > > > If the answer are both yes, then the catalog function will be a > > > > > > > > weird > > > > > > > > concept? > > > > > > > > Best, > > > > Kurt > > > > > > > > > > > > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan < > > > > > > > > [hidden email] > > > > > > > > wrote: > > > > > > > > The way you proposed are basically the same as what Calcite > > > > > > > > does, I > > > > > > > > think > > > > > > > > we are in the same line. > > > > > > > > Best, > > > > Danny Chan > > > > 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] > > > > > > > > ,写道: > > > > > > > > This sounds exactly as the module approach I mentioned, no? > > > > > > > > Regards, > > > > Timo > > > > > > > > On 03.09.19 13:42, Danny Chan wrote: > > > > > > > > Thanks Bowen for bring up this topic, I think it’s a useful > > > > > > > > refactoring to make our function usage more user friendly. > > > > > > > > For the topic of how to organize the builtin operators and > > > > > > > > operators > > > > > > > > of Hive, here is a solution from Apache Calcite, the Calcite > > > > > > > > way > > > > > > > > is > > > > > > > > to make > > > > > > > > every dialect operators a “Library”, user can specify which > > > > > > > > libraries they > > > > > > > > want to use for a sql query. The builtin operators always > > > > > > > > comes > > > > > > > > as > > > > > > > > the > > > > > > > > first class objects and the others are used from the order > > > > > > > > they > > > > > > > > appears. > > > > > > > > Maybe you can take a reference. > > > > > > > > [1] > > > > > > > > > > > > > > > > > > https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 > > > > > > > > Best, > > > > Danny Chan > > > > 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] > > > > > > > > ,写道: > > > > > > > > Hi folks, > > > > > > > > I'd like to kick off a discussion on reworking Flink's > > > > > > > > FunctionCatalog. > > > > > > > > It's critically helpful to improve function usability in > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing > > > > > > > > In short, it: > > > > - adds support for precise function reference with > > > > > > > > fully/partially > > > > > > > > qualified name > > > > - redefines function resolution order for ambiguous > > > > > > > > function > > > > > > > > reference > > > > > > > > - adds support for Hive's rich built-in functions (support > > > > > > > > for > > > > > > > > Hive > > > > > > > > user > > > > > > > > defined functions was already added in 1.9.0) > > > > - clarifies the concept of temporary functions > > > > > > > > Would love to hear your thoughts. > > > > > > > > Bowen > > > > > > > > -- > > > > Xuefu Zhang > > > > > > > > "In Honey We Trust!" > > > > > > > > > > > > -- > > > > Xuefu Zhang > > > > > > > > "In Honey We Trust!" > > > > > > > > > > > > > > > > > > > > > -- > > Xuefu Zhang > > > > "In Honey We Trust!" > > > -- Xuefu Zhang "In Honey We Trust!" |
In reply to this post by dwysakowicz
Hi David,
Thanks for sharing your thoughts and request for clarifications. I believe that I fully understood your proposal, which does has its merit. However, it's different from ours. Here are the answers to your questions: Re #1: yes, the temp functions in the proposal are global and have just one-part names, similar to built-in functions. Two- or three-part names are not allowed. Re #2: not applicable as two- or three-part names are disallowed. Re #3: same as above. Referencing external built-in functions is achieved either implicitly (only the built-in functions in the current catalogs are considered) or via special syntax such as cat::function. However, we are looking into the modular approach that Time suggested with other feedback received from the community. Re #4: the resolution order goes like the following in our proposal: 1. temporary functions 2. bulit-in functions (including those augmented by add-on modules) 3. built-in functions in current catalog (this will not be needed if the special syntax "cat::function" is required) 4. functions in current catalog and db. If we go with the modular approach and make external built-in functions as an add-on module, the 2 and 3 above will be combined. In essence, the resolution order is equivalent in the two approaches. By the way, resolution order matters only for simple name reference. For names such as db.function (interpreted as current_cat/db/function) or cat.db.function, the reference is unambiguous, so on resolution is needed. As it can be seen, the proposed concept regarding temp function and function resolution is quite simple. Additionally, the proposed resolution order allows temp function to shadow a built-in function, which is important (though not decisive) in our opinion. I started liking the modular approach as the resolution order will only include 1, 2, and 4, which is simpler and more generic. That's why I suggested we look more into this direction. Please let me know if there are further questions. Thanks, Xuefu On Thu, Sep 5, 2019 at 2:42 PM Dawid Wysakowicz <[hidden email]> wrote: > Hi Xuefu, > > Just wanted to summarize my opinion on the one topic (temporary functions). > > My preference would be to make temporary functions always 3-part qualified > (as a result that would prohibit overriding built-in functions). Having > said that if the community decides that it's better to allow overriding > built-in functions I am fine with it and can commit to that decision. > > I wanted to ask if you could clarify a few points for me around that > option. > > 1. Would you enforce temporary functions to be always just a single > name (without db & cat) as hive does, or would you allow also 3 or even 2 > part identifiers? > 2. Assuming 2/3-part paths. How would you register a function from a > following statement: CREATE TEMPORARY FUNCTION db.func? Would that shadow > all functions named 'func' in all databases named 'db' in all catalogs? Or > would you shadow only function 'func' in database 'db' in current catalog? > 3. This point is still under discussion, but was mentioned a few > times, that maybe we want to enable syntax cat.func for "external built-in > functions". How would that affect statement from previous point? Would > 'db.func' shadow "external built-in function" in 'db' catalog or user > functions as in point 2? Or maybe both? > 4. Lastly in fact to summarize the previous points. Assuming 2/3-part > paths. Would the function resolution be actually as follows?: > 1. temporary functions (1-part path) > 2. built-in functions > 3. temporary functions (2-part path) > 4. 2-part catalog functions a.k.a. "external built-in functions" > (cat + func) - this is still under discussion, if we want that in the other > focal point > 5. temporary functions (3-part path) > 6. 3-part catalog functions a.k.a. user functions > > I would be really grateful if you could explain me those questions, thanks. > > BTW, Thank you all for a healthy discussion. > > Best, > > Dawid > On 04/09/2019 23:25, Xuefu Z wrote: > > Thank all for the sharing thoughts. I think we have gathered some useful > initial feedback from this long discussion with a couple of focal points > sticking out. > > We will go back to do more research and adapt our proposal. Once it's > ready, we will ask for a new round of review. If there is any disagreement, > we will start a new discussion thread on each rather than having a mega > discussion like this. > > Thanks to everyone for participating. > > Regards, > Xuefu > > > On Thu, Sep 5, 2019 at 2:52 AM Bowen Li <[hidden email]> <[hidden email]> wrote: > > > Let me try to summarize and conclude the long thread so far: > > 1. For order of temp function v.s. built-in function: > > I think Dawid's point that temp function should be of fully qualified path > is a better reasoning to back the newly proposed order, and i agree we > don't need to follow Hive/Spark. > > However, I'd rather not change fundamentals of temporary functions in this > FLIP. It belongs to a bigger story of how temporary objects should be > redefined and be handled uniformly - currently temporary tables and views > (those registered from TableEnv#registerTable()) behave different than what > Dawid propose for temp functions, and we need a FLIP to just unify their > APIs and behaviors. > > I agree that backward compatibility is not an issue w.r.t Jark's points. > > ***Seems we do have consensus that it's acceptable to prevent users > registering a temp function in the same name as a built-in function. To > help us move forward, I'd like to propose setting such a restraint on temp > functions in this FLIP to simplify the design and avoid disputes.*** It > will also leave rooms for improvements in the future. > > > 2. For Hive built-in function: > > Thanks Timo for providing the Presto and Postgres examples. I feel modular > built-in functions can be a good fit for the geo and ml example as a native > Flink extension, but not sure if it fits well with external integrations. > Anyway, I think modular built-in functions is a bigger story and can be on > its own thread too, and our proposal doesn't prevent Flink from doing that > in the future. > > ***Seems we have consensus that users should be able to use built-in > functions of Hive or other external systems in SQL explicitly and > deterministically regardless of Flink built-in functions and the potential > modular built-in functions, via some new syntax like "mycat::func"? If so, > I'd like to propose removing Hive built-in functions from ambiguous > function resolution order, and empower users with such a syntax. This way > we sacrifice a little convenience for certainty*** > > > What do you think? > > On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz <[hidden email]> <[hidden email]> > wrote: > > > Hi, > > Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just > performed some experiments (hive-2.3.2 & spark 2.4.4) and I think they > > are > > very inconsistent in that manner (spark being way worse on that). > > Hive: > > You cannot overwrite all the built-in functions. I could overwrite most > > of > > the functions I tried e.g. length, e, pi, round, rtrim, but there are > functions I cannot overwrite e.g. CAST, ARRAY I get: > > > * ParseException line 1:29 cannot recognize input near 'array' 'AS' * > > What is interesting is that I cannot ovewrite *array*, but I can ovewrite > *map* or *struct*. Though hive behaves reasonable well if I manage to > overwrite a function. When I drop the temporary function the native > function is still available. > > Spark: > > Spark's behavior imho is super bad. > > Theoretically I could overwrite all functions. I was able e.g. to > overwrite CAST function. I had to use though CREATE OR REPLACE TEMPORARY > FUNCTION syntax. Otherwise I get an exception that a function already > exists. However when I used the CAST function in a query it used the > native, built-in one. > > When I overwrote current_date() function, it was used in a query, but it > completely replaces the built-in function and I can no longer use the > native function in any way. I cannot also drop the temporary function. I > get: > > * Error in query: Cannot drop native function 'current_date';* > > Additional note, both systems do not allow creating TEMPORARY FUNCTIONS > with a database. Temporary functions are always represented as a single > name. > > In my opinion neither of the systems have consistent behavior. Generally > speaking I think overwriting any system provided functions is just > dangerous. > > Regarding Jark's concerns. Such functions would be registered in a > > current > > catalog/database schema, so a user could still use its own function, but > would have to fully qualify the function (because built-in functions take > precedence). Moreover users would have the same problem with permanent > functions. Imagine a user have a permanent function 'cat.db.explode'. In > 1.9 the user could use just the 'explode' function as long as the 'cat' & > 'db' were the default catalog & database. If we introduce 'explode' > built-in function in 1.10, the user has to fully qualify the function. > > Best, > > Dawid > On 04/09/2019 15:19, Timo Walther wrote: > > Hi all, > > thanks for the healthy discussion. It is already a very long discussion > with a lot of text. So I will just post my opinion to a couple of > statements: > > > Hive built-in functions are not part of Flink built-in functions, they > > are catalog functions > > That is not entirely true. Correct me if I'm wrong but I think Hive > built-in functions are also not catalog functions. They are not stored in > every Hive metastore catalog that is freshly created but are a set of > functions that are listed somewhere and made available. > > > ambiguous functions reference just shouldn't be resolved to a different > > catalog > > I agree. They should not be resolved to a different catalog. That's why I > am suggesting to split the concept of built-in functions and catalog > > lookup > > semantics. > > > I don't know if any other databases handle built-in functions like that > > What I called "module" is: > - Extension in Postgres [1] > - Plugin in Presto [2] > > Btw. Presto even mentions example modules that are similar to the ones > that we will introduce in the near future both for ML and System XYZ > compatibility: > "See either the presto-ml module for machine learning functions or the > presto-teradata-functions module for Teradata-compatible functions, both > > in > > the root of the Presto source." > > > functions should be either built-in already or just libraries > > functions, > > and library functions can be adapted to catalog APIs or of some other > syntax to use > > Regarding "built-in already", of course we can add a lot of functions as > built-ins but we will end-up in a dependency hell in the near future if > > we > > don't introduce a pluggable approach. Library functions is what you also > suggest but storing them in a catalog means to always fully qualify them > > or > > modifying the existing catalog design that was inspired by the standard. > > I don't think "it brings in even more complicated scenarios to the > design", it just does clear separation of concerns. Integrating the > functionality into the current design makes the catalog API more > complicated. > > > why would users name a temporary function the same as a built-in > > function then? > > Because you never know what users do. If they don't, my suggested > resolution order should not be a problem, right? > > > I don't think hive functions deserves be a function module > > Our goal is not to create a Hive clone. We need to think forward and Hive > is just one of many systems that we can support. Not every built-in > function behaves and will behave exactly like Hive. > > > regarding temporary functions, there are few systems that support it > > IMHO Spark and Hive are not always the best examples for consistent > design. Systems like Postgres, Presto, or SQL Server should be used as a > reference. I don't think that a user can overwrite a built-in function > there. > > Regards, > Timo > > [1] https://www.postgresql.org/docs/10/extend-extensions.html > [2] https://prestodb.github.io/docs/current/develop/functions.html > > > On 04.09.19 13:44, Jark Wu wrote: > > Hi all, > > Regarding #1 temp function <> built-in function and naming. > I'm fine with temp functions should precede built-in function and can > override built-in functions (we already support to override built-in > function in 1.9). > If we don't allow the same name as a built-in function, I'm afraid we > > will > > have compatibility issues in the future. > Say users register a user defined function named "explode" in 1.9, and we > support a built-in "explode" function in 1.10. > Then the user's jobs which call the registered "explode" function in 1.9 > will all fail in 1.10 because of naming conflict. > > Regarding #2 "External" built-in functions. > I think if we store external built-in functions in catalog, then > "hive1::sqrt" is a good way to go. > However, I would prefer to support a discovery mechanism (e.g. SPI) for > built-in functions as Timo suggested above. > This gives us the flexibility to add Hive or MySQL or Geo or whatever > function set as built-in functions in an easy way. > > Best, > Jark > > On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> wrote: > > Hi David, > > Thank you for sharing your findings. It seems to me that there is no SQL > standard regarding temporary functions. There are few systems that > > support > > it. Here are what I have found: > > 1. Hive: no DB qualifier allowed. Can overwrite built-in. > 2. Spark: basically follows Hive ( > > > > > https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html > > ) > 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting > behavior. (http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html > > ) > > Because of lack of standard, it's perfectly fine for Flink to define > whatever it sees appropriate. Thus, your proposal (no overwriting and > > must > > have DB as holder) is one option. The advantage is simplicity, The > downside > is the deviation from Hive, which is popular and de facto standard in big > data world. > > However, I don't think we have to follow Hive. More importantly, we need > > a > > consensus. I have no objection if your proposal is generally agreed upon. > > Thanks, > Xuefu > > On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <[hidden email] > <[hidden email]> <[hidden email]> > wrote: > > Hi all, > > Just an opinion on the built-in <> temporary functions resolution and > NAMING issue. I think we should not allow overriding the built-in > functions, as this may pose serious issues and to be honest is rather > not feasible and would require major rework. What happens if a user > wants to override CAST? Calls to that function are generated at > different layers of the stack that unfortunately does not always go > through the Catalog API (at least yet). Moreover from what I've checked > no other systems allow overriding the built-in functions. All the > systems I've checked so far register temporary functions in a > database/schema (either special database for temporary functions, or > just current database). What I would suggest is to always register > temporary functions with a 3 part identifier. The same way as tables, > views etc. This effectively means you cannot override built-in > functions. With such approach it is natural that the temporary functions > end up a step lower in the resolution order: > > 1. built-in functions (1 part, maybe 2? - this is still under discussion) > > 2. temporary functions (always 3 part path) > > 3. catalog functions (always 3 part path) > > Let me know what do you think. > > Best, > > Dawid > > On 04/09/2019 06:13, Bowen Li wrote: > > Hi, > > I agree with Xuefu that the main controversial points are mainly the > > two > > places. My thoughts on them: > > 1) Determinism of referencing Hive built-in functions. We can either > > remove > > Hive built-in functions from ambiguous function resolution and require > users to use special syntax for their qualified names, or add a config > > flag > > to catalog constructor/yaml for turning on and off Hive built-in > > functions > > with the flag set to 'false' by default and proper doc added to help > > users > > make their decisions. > > 2) Flink temp functions v.s. Flink built-in functions in ambiguous > > function > > resolution order. We believe Flink temp functions should precede Flink > built-in functions, and I have presented my reasons. Just in case if we > cannot reach an agreement, I propose forbid users registering temp > functions in the same name as a built-in function, like MySQL's > > approach, > > for the moment. It won't have any performance concern, since built-in > functions are all in memory and thus cost of a name check will be > > really > > trivial. > > > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> wrote: > > From what I have seen, there are a couple of focal disagreements: > > 1. Resolution order: temp function --> flink built-in function --> > > catalog > > function vs flink built-in function --> temp function -> catalog > > function. > > 2. "External" built-in functions: how to treat built-in functions in > external system and how users reference them > > For #1, I agree with Bowen that temp function needs to be at the > > highest > > priority because that's how a user might overwrite a built-in function > without referencing a persistent, overwriting catalog function with a > > fully > > qualified name. Putting built-in functions at the highest priority > eliminates that usage. > > For #2, I saw a general agreement on referencing "external" built-in > functions such as those in Hive needs to be explicit and deterministic > > even > > though different approaches are proposed. To limit the scope and > > simply > > the > > usage, it seems making sense to me to introduce special syntax for > > user to > > explicitly reference an external built-in function such as hive1::sqrt > > or > > hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API > > call > > hive1.getFunction(ObjectPath functionName) where the database name is > absent for bulit-in functions available in that catalog hive1. I > > understand > > that Bowen's original proposal was trying to avoid this, but this > > could > > turn out to be a clean and simple solution. > > (Timo's modular approach is great way to "expand" Flink's built-in > > function > > set, which seems orthogonal and complementary to this, which could be > tackled in further future work.) > > I'd be happy to hear further thoughts on the two points. > > Thanks, > Xuefu > > On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> wrote: > > Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is > > the > > same > as Bowen's. But after thinking about it, I'm currently lean to Timo's > suggestion. > > The reason is backward compatibility. If we follow Bowen's approach, > > let's > > say we > first find function in Flink's built-in functions, and then hive's > built-in. For example, `foo` > is not supported by Flink, but hive has such built-in function. So > > user > > will have hive's > behavior for function `foo`. And in next release, Flink realize this > > is a > > very popular function > and add it into Flink's built-in functions, but with different > > behavior > > as > > hive's. So in next > release, the behavior changes. > > With Timo's approach, IIUC user have to tell the framework explicitly > > what > > kind of > built-in functions he would like to use. He can just tell framework > > to > > abandon Flink's built-in > functions, and use hive's instead. User can only choose between them, > > but > > not use > them at the same time. I think this approach is more predictable. > > Best, > Kurt > > > On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> wrote: > > Hi all, > > Thanks for the feedback. Just a kindly reminder that the [Proposal] > > section > > in the google doc was updated, please take a look first and let me > > know > > if > > you have more questions. > > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> > > wrote: > > Hi Timo, > > Re> 1) We should not have the restriction "hive built-in functions > > can > > only > > be used when current catalog is hive catalog". Switching a catalog > should only have implications on the cat.db.object resolution but > > not > > functions. It would be quite convinient for users to use Hive > > built-ins > > even if they use a Confluent schema registry or just the in-memory > > catalog. > > There might be a misunderstanding here. > > First of all, Hive built-in functions are not part of Flink > > built-in > > functions, they are catalog functions, thus if the current catalog > > is > > not a > > HiveCatalog but, say, a schema registry catalog, ambiguous > > functions > > reference just shouldn't be resolved to a different catalog. > > Second, Hive built-in functions can potentially be referenced > > across > > catalog, but it doesn't have db namespace and we currently just > > don't > > have > > a SQL syntax for it. It can be enabled when such a SQL syntax is > > defined, > > e.g. "catalog::function", but it's out of scope of this FLIP. > > 2) I would propose to have separate concepts for catalog and > > built-in > > functions. In particular it would be nice to modularize built-in > functions. Some built-in functions are very crucial (like AS, CAST, > MINUS), others are more optional but stable (MD5, CONCAT_WS), and > > maybe > > we add more experimental functions in the future or function for > > some > > special application area (Geo functions, ML functions). A data > > platform > > team might not want to make every built-in function available. Or a > function module like ML functions is in a different Maven module. > > I think this is orthogonal to this FLIP, especially we don't have > > the > > "external built-in functions" anymore and currently the built-in > > function > > category remains untouched. > > But just to share some thoughts on the proposal, I'm not sure about > > it: > > - I don't know if any other databases handle built-in functions > > like > > that. > > Maybe you can give some examples? IMHO, built-in functions are > > system > > info > > and should be deterministic, not depending on loaded libraries. Geo > functions should be either built-in already or just libraries > > functions, > > and library functions can be adapted to catalog APIs or of some > > other > > syntax to use > - I don't know if all use cases stand, and many can be achieved by > > other > > approaches too. E.g. experimental functions can be taken good care > > of > > by > > documentations, annotations, etc > - the proposal basically introduces some concept like a pluggable > > built-in > > function catalog, despite the already existing catalog APIs > - it brings in even more complicated scenarios to the design. E.g. > > how > > do > > you handle built-in functions in different modules but different > > names? > > In short, I'm not sure if it really stands and it looks like an > > overkill > > to me. I'd rather not go to that route. Related discussion can be > > on > > its > > own thread. > > 3) Following the suggestion above, we can have a separate discovery > mechanism for built-in functions. Instead of just going through a > > static > > list like in BuiltInFunctionDefinitions, a platform team should be > > able > > to select function modules like > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, > HiveFunctions) or via service discovery; > > Same as above. I'll leave it to its own thread. > > re > 3) Dawid and I discussed the resulution order again. I agree > > with > > Kurt > > that we should unify built-in function (external or internal) > > under a > > common layer. However, the resolution order should be: > 1. built-in functions > 2. temporary functions > 3. regular catalog resolution logic > Otherwise a temporary function could cause clashes with Flink's > > built-in > > functions. If you take a look at other vendors, like SQL Server > > they > > also do not allow to overwrite built-in functions. > > ”I agree with Kurt that we should unify built-in function (external > > or > > internal) under a common layer.“ <- I don't think this is what Kurt > > means. > > Kurt and I are in favor of unifying built-in functions of external > > systems > > and catalog functions. Did you type a mistake? > > Besides, I'm not sure about the resolution order you proposed. > > Temporary > > functions have a lifespan over a session and are only visible to > > the > > session owner, they are unique to each user, and users create them > > on > > purpose to be the highest priority in order to overwrite system > > info > > (built-in functions in this case). > > In your case, why would users name a temporary function the same > > as a > > built-in function then? Since using that name in ambiguous function > reference will always be resolved to built-in functions, creating a > same-named temp function would be meaningless in the end. > > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> > > wrote: > > Hi Jingsong, > > Re> 1.Hive built-in functions is an intermediate solution. So we > > should > > not introduce interfaces to influence the framework. To make > Flink itself more powerful, we should implement the functions > we need to add. > > Yes, please see the doc. > > Re> 2.Non-flink built-in functions are easy for users to change > > their > > behavior. If we support some flink built-in functions in the > future but act differently from non-flink built-in, this will > > lead > > to > > changes in user behavior. > > There's no such concept as "external built-in functions" any more. > Built-in functions of external systems will be treated as special > > catalog > > functions. > > Re> Another question is, does this fallback include all > > hive built-in functions? As far as I know, some hive functions > have some hacky. If possible, can we start with a white list? > Once we implement some functions to flink built-in, we can > also update the whitelist. > > Yes, that's something we thought of too. I don't think it's super > critical to the scope of this FLIP, thus I'd like to leave it to > > future > > efforts as a nice-to-have feature. > > > On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> > > wrote: > > Hi Kurt, > > Re: > What I want to propose is we can merge #3 and #4, make them > > both > > under > > "catalog" concept, by extending catalog function to make it have > > ability to > > have built-in catalog functions. Some benefits I can see from > > this > > approach: > > 1. We don't have to introduce new concept like external built-in > > functions. > > Actually I don't see a full story about how to treat a built-in > > functions, and it > > seems a little bit disrupt with catalog. As a result, you have > > to > > make > > some restriction > > like "hive built-in functions can only be used when current > > catalog > > is > > hive catalog". > > Yes, I've unified #3 and #4 but it seems I didn't update some > > part > > of > > the doc. I've modified those sections, and they are up to date > > now. > > In short, now built-in function of external systems are defined > > as > > a > > special kind of catalog function in Flink, and handled by Flink > > as > > following: > - An external built-in function must be associated with a catalog > > for > > the purpose of decoupling flink-table and external systems. > - It always resides in front of catalog functions in ambiguous > > function > > reference order, just like in its own external system > - It is a special catalog function that doesn’t have a > > schema/database > > namespace > - It goes thru the same instantiation logic as other user defined > catalog functions in the external system > > Please take another look at the doc, and let me know if you have > > more > > questions. > > > On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> > > wrote: > > Hi Kurt, > > it should not affect the functions and operations we currently > > have > > in > > SQL. It just categorizes the available built-in functions. It is > > kind > > of > an orthogonal concept to the catalog API but built-in functions > > deserve > > this special kind of treatment. CatalogFunction still fits > > perfectly > > in > > there because the regular catalog object resolution logic is not > affected. So tables and functions are resolved in the same way > > but > > with > > built-in functions that have priority as in the original design. > > Regards, > Timo > > > On 03.09.19 15:26, Kurt Young wrote: > > Does this only affect the functions and operations we currently > > have > > in SQL > > and > have no effect on tables, right? Looks like this is an > > orthogonal > > concept > > with Catalog? > If the answer are both yes, then the catalog function will be a > > weird > > concept? > > Best, > Kurt > > > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan < > [hidden email] > > wrote: > > The way you proposed are basically the same as what Calcite > > does, I > > think > > we are in the same line. > > Best, > Danny Chan > 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] > > ,写道: > > This sounds exactly as the module approach I mentioned, no? > > Regards, > Timo > > On 03.09.19 13:42, Danny Chan wrote: > > Thanks Bowen for bring up this topic, I think it’s a useful > > refactoring to make our function usage more user friendly. > > For the topic of how to organize the builtin operators and > > operators > > of Hive, here is a solution from Apache Calcite, the Calcite > > way > > is > > to make > > every dialect operators a “Library”, user can specify which > > libraries they > > want to use for a sql query. The builtin operators always > > comes > > as > > the > > first class objects and the others are used from the order > > they > > appears. > > Maybe you can take a reference. > > [1] > > > > > https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 > > Best, > Danny Chan > 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] > > ,写道: > > Hi folks, > > I'd like to kick off a discussion on reworking Flink's > > FunctionCatalog. > > It's critically helpful to improve function usability in > > SQL. > > > > > https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing > > In short, it: > - adds support for precise function reference with > > fully/partially > > qualified name > - redefines function resolution order for ambiguous > > function > > reference > > - adds support for Hive's rich built-in functions (support > > for > > Hive > > user > > defined functions was already added in 1.9.0) > - clarifies the concept of temporary functions > > Would love to hear your thoughts. > > Bowen > > -- > Xuefu Zhang > > "In Honey We Trust!" > > > -- > Xuefu Zhang > > "In Honey We Trust!" > > > > > -- Xuefu Zhang "In Honey We Trust!" |
Hi Xuefu, Thank you for your answers. Let me summarize my understanding. In principle we differ only in
regards to the fact if a temporary function can be only 1-part or
only 3-part identified. I can reconfirm that if the community
decides it prefers the 1-part approach I will commit to that, with
the assumption that we will force ONLY 1-part function names. (We
will parse identifier and throw exception if a user tries to
register e.g. db.temp_func). My preference is though the 3-part approach:
As it can be seen, the proposed concept regarding temp function and function resolution is quite simple. Both approaches are equally simple. I would even say the 3-part approach is slightly simpler as it does not have to care about some special built-in functions such as CAST. I don't want to express my opinion on the differentiation between built-in functions and "external" built-in functions in this thread as it is rather orthogonal, but I also like the modular approach and I definitely don't like the special syntax "cat::function". I think it's better to stick to a standard or at least other proved solutions from other systems. Best, Dawid On 05/09/2019 10:12, Xuefu Z wrote:
Hi David, Thanks for sharing your thoughts and request for clarifications. I believe that I fully understood your proposal, which does has its merit. However, it's different from ours. Here are the answers to your questions: Re #1: yes, the temp functions in the proposal are global and have just one-part names, similar to built-in functions. Two- or three-part names are not allowed. Re #2: not applicable as two- or three-part names are disallowed. Re #3: same as above. Referencing external built-in functions is achieved either implicitly (only the built-in functions in the current catalogs are considered) or via special syntax such as cat::function. However, we are looking into the modular approach that Time suggested with other feedback received from the community. Re #4: the resolution order goes like the following in our proposal: 1. temporary functions 2. bulit-in functions (including those augmented by add-on modules) 3. built-in functions in current catalog (this will not be needed if the special syntax "cat::function" is required) 4. functions in current catalog and db. If we go with the modular approach and make external built-in functions as an add-on module, the 2 and 3 above will be combined. In essence, the resolution order is equivalent in the two approaches. By the way, resolution order matters only for simple name reference. For names such as db.function (interpreted as current_cat/db/function) or cat.db.function, the reference is unambiguous, so on resolution is needed. As it can be seen, the proposed concept regarding temp function and function resolution is quite simple. Additionally, the proposed resolution order allows temp function to shadow a built-in function, which is important (though not decisive) in our opinion. I started liking the modular approach as the resolution order will only include 1, 2, and 4, which is simpler and more generic. That's why I suggested we look more into this direction. Please let me know if there are further questions. Thanks, Xuefu On Thu, Sep 5, 2019 at 2:42 PM Dawid Wysakowicz [hidden email] wrote:Hi Xuefu, Just wanted to summarize my opinion on the one topic (temporary functions). My preference would be to make temporary functions always 3-part qualified (as a result that would prohibit overriding built-in functions). Having said that if the community decides that it's better to allow overriding built-in functions I am fine with it and can commit to that decision. I wanted to ask if you could clarify a few points for me around that option. 1. Would you enforce temporary functions to be always just a single name (without db & cat) as hive does, or would you allow also 3 or even 2 part identifiers? 2. Assuming 2/3-part paths. How would you register a function from a following statement: CREATE TEMPORARY FUNCTION db.func? Would that shadow all functions named 'func' in all databases named 'db' in all catalogs? Or would you shadow only function 'func' in database 'db' in current catalog? 3. This point is still under discussion, but was mentioned a few times, that maybe we want to enable syntax cat.func for "external built-in functions". How would that affect statement from previous point? Would 'db.func' shadow "external built-in function" in 'db' catalog or user functions as in point 2? Or maybe both? 4. Lastly in fact to summarize the previous points. Assuming 2/3-part paths. Would the function resolution be actually as follows?: 1. temporary functions (1-part path) 2. built-in functions 3. temporary functions (2-part path) 4. 2-part catalog functions a.k.a. "external built-in functions" (cat + func) - this is still under discussion, if we want that in the other focal point 5. temporary functions (3-part path) 6. 3-part catalog functions a.k.a. user functions I would be really grateful if you could explain me those questions, thanks. BTW, Thank you all for a healthy discussion. Best, Dawid On 04/09/2019 23:25, Xuefu Z wrote: Thank all for the sharing thoughts. I think we have gathered some useful initial feedback from this long discussion with a couple of focal points sticking out. We will go back to do more research and adapt our proposal. Once it's ready, we will ask for a new round of review. If there is any disagreement, we will start a new discussion thread on each rather than having a mega discussion like this. Thanks to everyone for participating. Regards, Xuefu On Thu, Sep 5, 2019 at 2:52 AM Bowen Li [hidden email] [hidden email] wrote: Let me try to summarize and conclude the long thread so far: 1. For order of temp function v.s. built-in function: I think Dawid's point that temp function should be of fully qualified path is a better reasoning to back the newly proposed order, and i agree we don't need to follow Hive/Spark. However, I'd rather not change fundamentals of temporary functions in this FLIP. It belongs to a bigger story of how temporary objects should be redefined and be handled uniformly - currently temporary tables and views (those registered from TableEnv#registerTable()) behave different than what Dawid propose for temp functions, and we need a FLIP to just unify their APIs and behaviors. I agree that backward compatibility is not an issue w.r.t Jark's points. ***Seems we do have consensus that it's acceptable to prevent users registering a temp function in the same name as a built-in function. To help us move forward, I'd like to propose setting such a restraint on temp functions in this FLIP to simplify the design and avoid disputes.*** It will also leave rooms for improvements in the future. 2. For Hive built-in function: Thanks Timo for providing the Presto and Postgres examples. I feel modular built-in functions can be a good fit for the geo and ml example as a native Flink extension, but not sure if it fits well with external integrations. Anyway, I think modular built-in functions is a bigger story and can be on its own thread too, and our proposal doesn't prevent Flink from doing that in the future. ***Seems we have consensus that users should be able to use built-in functions of Hive or other external systems in SQL explicitly and deterministically regardless of Flink built-in functions and the potential modular built-in functions, via some new syntax like "mycat::func"? If so, I'd like to propose removing Hive built-in functions from ambiguous function resolution order, and empower users with such a syntax. This way we sacrifice a little convenience for certainty*** What do you think? On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz [hidden email] [hidden email] wrote: Hi, Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just performed some experiments (hive-2.3.2 & spark 2.4.4) and I think they are very inconsistent in that manner (spark being way worse on that). Hive: You cannot overwrite all the built-in functions. I could overwrite most of the functions I tried e.g. length, e, pi, round, rtrim, but there are functions I cannot overwrite e.g. CAST, ARRAY I get: * ParseException line 1:29 cannot recognize input near 'array' 'AS' * What is interesting is that I cannot ovewrite *array*, but I can ovewrite *map* or *struct*. Though hive behaves reasonable well if I manage to overwrite a function. When I drop the temporary function the native function is still available. Spark: Spark's behavior imho is super bad. Theoretically I could overwrite all functions. I was able e.g. to overwrite CAST function. I had to use though CREATE OR REPLACE TEMPORARY FUNCTION syntax. Otherwise I get an exception that a function already exists. However when I used the CAST function in a query it used the native, built-in one. When I overwrote current_date() function, it was used in a query, but it completely replaces the built-in function and I can no longer use the native function in any way. I cannot also drop the temporary function. I get: * Error in query: Cannot drop native function 'current_date';* Additional note, both systems do not allow creating TEMPORARY FUNCTIONS with a database. Temporary functions are always represented as a single name. In my opinion neither of the systems have consistent behavior. Generally speaking I think overwriting any system provided functions is just dangerous. Regarding Jark's concerns. Such functions would be registered in a current catalog/database schema, so a user could still use its own function, but would have to fully qualify the function (because built-in functions take precedence). Moreover users would have the same problem with permanent functions. Imagine a user have a permanent function 'cat.db.explode'. In 1.9 the user could use just the 'explode' function as long as the 'cat' & 'db' were the default catalog & database. If we introduce 'explode' built-in function in 1.10, the user has to fully qualify the function. Best, Dawid On 04/09/2019 15:19, Timo Walther wrote: Hi all, thanks for the healthy discussion. It is already a very long discussion with a lot of text. So I will just post my opinion to a couple of statements: Hive built-in functions are not part of Flink built-in functions, they are catalog functions That is not entirely true. Correct me if I'm wrong but I think Hive built-in functions are also not catalog functions. They are not stored in every Hive metastore catalog that is freshly created but are a set of functions that are listed somewhere and made available. ambiguous functions reference just shouldn't be resolved to a different catalog I agree. They should not be resolved to a different catalog. That's why I am suggesting to split the concept of built-in functions and catalog lookup semantics. I don't know if any other databases handle built-in functions like that What I called "module" is: - Extension in Postgres [1] - Plugin in Presto [2] Btw. Presto even mentions example modules that are similar to the ones that we will introduce in the near future both for ML and System XYZ compatibility: "See either the presto-ml module for machine learning functions or the presto-teradata-functions module for Teradata-compatible functions, both in the root of the Presto source." functions should be either built-in already or just libraries functions, and library functions can be adapted to catalog APIs or of some other syntax to use Regarding "built-in already", of course we can add a lot of functions as built-ins but we will end-up in a dependency hell in the near future if we don't introduce a pluggable approach. Library functions is what you also suggest but storing them in a catalog means to always fully qualify them or modifying the existing catalog design that was inspired by the standard. I don't think "it brings in even more complicated scenarios to the design", it just does clear separation of concerns. Integrating the functionality into the current design makes the catalog API more complicated. why would users name a temporary function the same as a built-in function then? Because you never know what users do. If they don't, my suggested resolution order should not be a problem, right? I don't think hive functions deserves be a function module Our goal is not to create a Hive clone. We need to think forward and Hive is just one of many systems that we can support. Not every built-in function behaves and will behave exactly like Hive. regarding temporary functions, there are few systems that support it IMHO Spark and Hive are not always the best examples for consistent design. Systems like Postgres, Presto, or SQL Server should be used as a reference. I don't think that a user can overwrite a built-in function there. Regards, Timo [1] https://www.postgresql.org/docs/10/extend-extensions.html [2] https://prestodb.github.io/docs/current/develop/functions.html On 04.09.19 13:44, Jark Wu wrote: Hi all, Regarding #1 temp function <> built-in function and naming. I'm fine with temp functions should precede built-in function and can override built-in functions (we already support to override built-in function in 1.9). If we don't allow the same name as a built-in function, I'm afraid we will have compatibility issues in the future. Say users register a user defined function named "explode" in 1.9, and we support a built-in "explode" function in 1.10. Then the user's jobs which call the registered "explode" function in 1.9 will all fail in 1.10 because of naming conflict. Regarding #2 "External" built-in functions. I think if we store external built-in functions in catalog, then "hive1::sqrt" is a good way to go. However, I would prefer to support a discovery mechanism (e.g. SPI) for built-in functions as Timo suggested above. This gives us the flexibility to add Hive or MySQL or Geo or whatever function set as built-in functions in an easy way. Best, Jark On Wed, 4 Sep 2019 at 17:47, Xuefu Z [hidden email] [hidden email][hidden email] [hidden email] wrote: Hi David, Thank you for sharing your findings. It seems to me that there is no SQL standard regarding temporary functions. There are few systems that support it. Here are what I have found: 1. Hive: no DB qualifier allowed. Can overwrite built-in. 2. Spark: basically follows Hive ( https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html ) 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting behavior. (http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html ) Because of lack of standard, it's perfectly fine for Flink to define whatever it sees appropriate. Thus, your proposal (no overwriting and must have DB as holder) is one option. The advantage is simplicity, The downside is the deviation from Hive, which is popular and de facto standard in big data world. However, I don't think we have to follow Hive. More importantly, we need a consensus. I have no objection if your proposal is generally agreed upon. Thanks, Xuefu On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <[hidden email] [hidden email] [hidden email] wrote: Hi all, Just an opinion on the built-in <> temporary functions resolution and NAMING issue. I think we should not allow overriding the built-in functions, as this may pose serious issues and to be honest is rather not feasible and would require major rework. What happens if a user wants to override CAST? Calls to that function are generated at different layers of the stack that unfortunately does not always go through the Catalog API (at least yet). Moreover from what I've checked no other systems allow overriding the built-in functions. All the systems I've checked so far register temporary functions in a database/schema (either special database for temporary functions, or just current database). What I would suggest is to always register temporary functions with a 3 part identifier. The same way as tables, views etc. This effectively means you cannot override built-in functions. With such approach it is natural that the temporary functions end up a step lower in the resolution order: 1. built-in functions (1 part, maybe 2? - this is still under discussion) 2. temporary functions (always 3 part path) 3. catalog functions (always 3 part path) Let me know what do you think. Best, Dawid On 04/09/2019 06:13, Bowen Li wrote: Hi, I agree with Xuefu that the main controversial points are mainly the two places. My thoughts on them: 1) Determinism of referencing Hive built-in functions. We can either remove Hive built-in functions from ambiguous function resolution and require users to use special syntax for their qualified names, or add a config flag to catalog constructor/yaml for turning on and off Hive built-in functions with the flag set to 'false' by default and proper doc added to help users make their decisions. 2) Flink temp functions v.s. Flink built-in functions in ambiguous function resolution order. We believe Flink temp functions should precede Flink built-in functions, and I have presented my reasons. Just in case if we cannot reach an agreement, I propose forbid users registering temp functions in the same name as a built-in function, like MySQL's approach, for the moment. It won't have any performance concern, since built-in functions are all in memory and thus cost of a name check will be really trivial. On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z [hidden email] [hidden email][hidden email] [hidden email] wrote: From what I have seen, there are a couple of focal disagreements: 1. Resolution order: temp function --> flink built-in function --> catalog function vs flink built-in function --> temp function -> catalog function. 2. "External" built-in functions: how to treat built-in functions in external system and how users reference them For #1, I agree with Bowen that temp function needs to be at the highest priority because that's how a user might overwrite a built-in function without referencing a persistent, overwriting catalog function with a fully qualified name. Putting built-in functions at the highest priority eliminates that usage. For #2, I saw a general agreement on referencing "external" built-in functions such as those in Hive needs to be explicit and deterministic even though different approaches are proposed. To limit the scope and simply the usage, it seems making sense to me to introduce special syntax for user to explicitly reference an external built-in function such as hive1::sqrt or hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API call hive1.getFunction(ObjectPath functionName) where the database name is absent for bulit-in functions available in that catalog hive1. I understand that Bowen's original proposal was trying to avoid this, but this could turn out to be a clean and simple solution. (Timo's modular approach is great way to "expand" Flink's built-in function set, which seems orthogonal and complementary to this, which could be tackled in further future work.) I'd be happy to hear further thoughts on the two points. Thanks, Xuefu On Tue, Sep 3, 2019 at 7:11 PM Kurt Young [hidden email] [hidden email][hidden email] [hidden email] wrote: Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is the same as Bowen's. But after thinking about it, I'm currently lean to Timo's suggestion. The reason is backward compatibility. If we follow Bowen's approach, let's say we first find function in Flink's built-in functions, and then hive's built-in. For example, `foo` is not supported by Flink, but hive has such built-in function. So user will have hive's behavior for function `foo`. And in next release, Flink realize this is a very popular function and add it into Flink's built-in functions, but with different behavior as hive's. So in next release, the behavior changes. With Timo's approach, IIUC user have to tell the framework explicitly what kind of built-in functions he would like to use. He can just tell framework to abandon Flink's built-in functions, and use hive's instead. User can only choose between them, but not use them at the same time. I think this approach is more predictable. Best, Kurt On Wed, Sep 4, 2019 at 8:00 AM Bowen Li [hidden email] [hidden email][hidden email] [hidden email] wrote: Hi all, Thanks for the feedback. Just a kindly reminder that the [Proposal] section in the google doc was updated, please take a look first and let me know if you have more questions. On Tue, Sep 3, 2019 at 4:57 PM Bowen Li [hidden email] [hidden email][hidden email] [hidden email] wrote: Hi Timo, Re> 1) We should not have the restriction "hive built-in functions can only be used when current catalog is hive catalog". Switching a catalog should only have implications on the cat.db.object resolution but not functions. It would be quite convinient for users to use Hive built-ins even if they use a Confluent schema registry or just the in-memory catalog. There might be a misunderstanding here. First of all, Hive built-in functions are not part of Flink built-in functions, they are catalog functions, thus if the current catalog is not a HiveCatalog but, say, a schema registry catalog, ambiguous functions reference just shouldn't be resolved to a different catalog. Second, Hive built-in functions can potentially be referenced across catalog, but it doesn't have db namespace and we currently just don't have a SQL syntax for it. It can be enabled when such a SQL syntax is defined, e.g. "catalog::function", but it's out of scope of this FLIP. 2) I would propose to have separate concepts for catalog and built-in functions. In particular it would be nice to modularize built-in functions. Some built-in functions are very crucial (like AS, CAST, MINUS), others are more optional but stable (MD5, CONCAT_WS), and maybe we add more experimental functions in the future or function for some special application area (Geo functions, ML functions). A data platform team might not want to make every built-in function available. Or a function module like ML functions is in a different Maven module. I think this is orthogonal to this FLIP, especially we don't have the "external built-in functions" anymore and currently the built-in function category remains untouched. But just to share some thoughts on the proposal, I'm not sure about it: - I don't know if any other databases handle built-in functions like that. Maybe you can give some examples? IMHO, built-in functions are system info and should be deterministic, not depending on loaded libraries. Geo functions should be either built-in already or just libraries functions, and library functions can be adapted to catalog APIs or of some other syntax to use - I don't know if all use cases stand, and many can be achieved by other approaches too. E.g. experimental functions can be taken good care of by documentations, annotations, etc - the proposal basically introduces some concept like a pluggable built-in function catalog, despite the already existing catalog APIs - it brings in even more complicated scenarios to the design. E.g. how do you handle built-in functions in different modules but different names? In short, I'm not sure if it really stands and it looks like an overkill to me. I'd rather not go to that route. Related discussion can be on its own thread. 3) Following the suggestion above, we can have a separate discovery mechanism for built-in functions. Instead of just going through a static list like in BuiltInFunctionDefinitions, a platform team should be able to select function modules like catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, HiveFunctions) or via service discovery; Same as above. I'll leave it to its own thread. re > 3) Dawid and I discussed the resulution order again. I agree with Kurt that we should unify built-in function (external or internal) under a common layer. However, the resolution order should be: 1. built-in functions 2. temporary functions 3. regular catalog resolution logic Otherwise a temporary function could cause clashes with Flink's built-in functions. If you take a look at other vendors, like SQL Server they also do not allow to overwrite built-in functions. ”I agree with Kurt that we should unify built-in function (external or internal) under a common layer.“ <- I don't think this is what Kurt means. Kurt and I are in favor of unifying built-in functions of external systems and catalog functions. Did you type a mistake? Besides, I'm not sure about the resolution order you proposed. Temporary functions have a lifespan over a session and are only visible to the session owner, they are unique to each user, and users create them on purpose to be the highest priority in order to overwrite system info (built-in functions in this case). In your case, why would users name a temporary function the same as a built-in function then? Since using that name in ambiguous function reference will always be resolved to built-in functions, creating a same-named temp function would be meaningless in the end. On Tue, Sep 3, 2019 at 1:44 PM Bowen Li [hidden email] [hidden email][hidden email] [hidden email] wrote: Hi Jingsong, Re> 1.Hive built-in functions is an intermediate solution. So we should not introduce interfaces to influence the framework. To make Flink itself more powerful, we should implement the functions we need to add. Yes, please see the doc. Re> 2.Non-flink built-in functions are easy for users to change their behavior. If we support some flink built-in functions in the future but act differently from non-flink built-in, this will lead to changes in user behavior. There's no such concept as "external built-in functions" any more. Built-in functions of external systems will be treated as special catalog functions. Re> Another question is, does this fallback include all hive built-in functions? As far as I know, some hive functions have some hacky. If possible, can we start with a white list? Once we implement some functions to flink built-in, we can also update the whitelist. Yes, that's something we thought of too. I don't think it's super critical to the scope of this FLIP, thus I'd like to leave it to future efforts as a nice-to-have feature. On Tue, Sep 3, 2019 at 1:37 PM Bowen Li [hidden email] [hidden email][hidden email] [hidden email] wrote: Hi Kurt, Re: > What I want to propose is we can merge #3 and #4, make them both under "catalog" concept, by extending catalog function to make it have ability to have built-in catalog functions. Some benefits I can see from this approach: 1. We don't have to introduce new concept like external built-in functions. Actually I don't see a full story about how to treat a built-in functions, and it seems a little bit disrupt with catalog. As a result, you have to make some restriction like "hive built-in functions can only be used when current catalog is hive catalog". Yes, I've unified #3 and #4 but it seems I didn't update some part of the doc. I've modified those sections, and they are up to date now. In short, now built-in function of external systems are defined as a special kind of catalog function in Flink, and handled by Flink as following: - An external built-in function must be associated with a catalog for the purpose of decoupling flink-table and external systems. - It always resides in front of catalog functions in ambiguous function reference order, just like in its own external system - It is a special catalog function that doesn’t have a schema/database namespace - It goes thru the same instantiation logic as other user defined catalog functions in the external system Please take another look at the doc, and let me know if you have more questions. On Tue, Sep 3, 2019 at 7:28 AM Timo Walther [hidden email] [hidden email][hidden email] [hidden email] wrote: Hi Kurt, it should not affect the functions and operations we currently have in SQL. It just categorizes the available built-in functions. It is kind of an orthogonal concept to the catalog API but built-in functions deserve this special kind of treatment. CatalogFunction still fits perfectly in there because the regular catalog object resolution logic is not affected. So tables and functions are resolved in the same way but with built-in functions that have priority as in the original design. Regards, Timo On 03.09.19 15:26, Kurt Young wrote: Does this only affect the functions and operations we currently have in SQL and have no effect on tables, right? Looks like this is an orthogonal concept with Catalog? If the answer are both yes, then the catalog function will be a weird concept? Best, Kurt On Tue, Sep 3, 2019 at 8:10 PM Danny Chan < [hidden email] wrote: The way you proposed are basically the same as what Calcite does, I think we are in the same line. Best, Danny Chan 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] ,写道: This sounds exactly as the module approach I mentioned, no? Regards, Timo On 03.09.19 13:42, Danny Chan wrote: Thanks Bowen for bring up this topic, I think it’s a useful refactoring to make our function usage more user friendly. For the topic of how to organize the builtin operators and operators of Hive, here is a solution from Apache Calcite, the Calcite way is to make every dialect operators a “Library”, user can specify which libraries they want to use for a sql query. The builtin operators always comes as the first class objects and the others are used from the order they appears. Maybe you can take a reference. [1] https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 Best, Danny Chan 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] ,写道: Hi folks, I'd like to kick off a discussion on reworking Flink's FunctionCatalog. It's critically helpful to improve function usability in SQL. https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing In short, it: - adds support for precise function reference with fully/partially qualified name - redefines function resolution order for ambiguous function reference - adds support for Hive's rich built-in functions (support for Hive user defined functions was already added in 1.9.0) - clarifies the concept of temporary functions Would love to hear your thoughts. Bowen -- Xuefu Zhang "In Honey We Trust!" -- Xuefu Zhang "In Honey We Trust!" signature.asc (849 bytes) Download Attachment |
Hi Dawid,
Thank you for your summary. While the only difference in the two proposals is one- or three-part in naming, the consequence would be substantial. To me, there are two major use cases of temporary functions compared to persistent ones: 1. Temporary in nature and auto managed by the session. More often than not, admin doesn't even allow user to create persistent functions. 2. Provide an opportunity to overwriting system built-in functions. Since built-in functions has one-part name, requiring three-part name for temporary functions eliminates the overwriting opportunity. One-part naming essentially puts all temp functions under a single namespace and simplifies function resolution, such as we don't need to consider the case of a temp function and a persistent function with the same name under the same database. I agree having three-parts does have its merits, such as consistency with other temporary objects (table) and minor difference between temp vs catalog functions. However, there is a slight difference between tables and function in that there is no built-in table in SQL so there is no need to overwrite it. I'm not sure if I fully agree the benefits you listed as the advantages of the three-part naming of temp functions. -- Allowing overwriting built-in functions is a benefit and the solution for disallowing certain overwriting shouldn't be totally banning it. -- Catalog functions are defined by users, and we suppose they can drop/alter it in any way they want. Thus, overwriting a catalog function doesn't seem to be a strong use case that we should be concerned about. Rather, there are known use case for overwriting built-in functions. Thus, personally I would prefer one-part name for temporary functions. In lack of SQL standard on this, I certainly like to get opinions from others to see if a consensus can be eventually reached. (To your point on modular approach to support external built-in functions, we saw the value and are actively looking into it. Thanks for sharing your opinion on that.) Thanks, Xuefu On Fri, Sep 6, 2019 at 3:48 PM Dawid Wysakowicz <[hidden email]> wrote: > Hi Xuefu, > > Thank you for your answers. > > Let me summarize my understanding. In principle we differ only in regards > to the fact if a temporary function can be only 1-part or only 3-part > identified. I can reconfirm that if the community decides it prefers the > 1-part approach I will commit to that, with the assumption that we will > force ONLY 1-part function names. (We will parse identifier and throw > exception if a user tries to register e.g. db.temp_func). > > My preference is though the 3-part approach: > > - there are some functions that it makes no sense to override, e.g. > CAST, moreover I'm afraid that allowing overriding such will lead to high > inconsistency, similar to those that I mentioned spark has > - you cannot shadow a fully-qualified function. (If a user fully > qualifies his/her objects in a SQL query, which is often considered a good > practice) > - it does not differentiate between functions & temporary functions. > Temporary functions just differ with regards to their life-cycle. The > registration & usage is exactly the same. > > As it can be seen, the proposed concept regarding temp function and > function resolution is quite simple. > > Both approaches are equally simple. I would even say the 3-part approach > is slightly simpler as it does not have to care about some special built-in > functions such as CAST. > > I don't want to express my opinion on the differentiation between built-in > functions and "external" built-in functions in this thread as it is rather > orthogonal, but I also like the modular approach and I definitely don't > like the special syntax "cat::function". I think it's better to stick to a > standard or at least other proved solutions from other systems. > > Best, > > Dawid > On 05/09/2019 10:12, Xuefu Z wrote: > > Hi David, > > Thanks for sharing your thoughts and request for clarifications. I believe > that I fully understood your proposal, which does has its merit. However, > it's different from ours. Here are the answers to your questions: > > Re #1: yes, the temp functions in the proposal are global and have just > one-part names, similar to built-in functions. Two- or three-part names are > not allowed. > > Re #2: not applicable as two- or three-part names are disallowed. > > Re #3: same as above. Referencing external built-in functions is achieved > either implicitly (only the built-in functions in the current catalogs are > considered) or via special syntax such as cat::function. However, we are > looking into the modular approach that Time suggested with other feedback > received from the community. > > Re #4: the resolution order goes like the following in our proposal: > > 1. temporary functions > 2. bulit-in functions (including those augmented by add-on modules) > 3. built-in functions in current catalog (this will not be needed if the > special syntax "cat::function" is required) > 4. functions in current catalog and db. > > If we go with the modular approach and make external built-in functions as > an add-on module, the 2 and 3 above will be combined. In essence, the > resolution order is equivalent in the two approaches. > > By the way, resolution order matters only for simple name reference. For > names such as db.function (interpreted as current_cat/db/function) or > cat.db.function, the reference is unambiguous, so on resolution is needed. > > As it can be seen, the proposed concept regarding temp function and > function resolution is quite simple. Additionally, the proposed resolution > order allows temp function to shadow a built-in function, which is > important (though not decisive) in our opinion. > > I started liking the modular approach as the resolution order will only > include 1, 2, and 4, which is simpler and more generic. That's why I > suggested we look more into this direction. > > Please let me know if there are further questions. > > Thanks, > Xuefu > > > > > On Thu, Sep 5, 2019 at 2:42 PM Dawid Wysakowicz <[hidden email]> <[hidden email]> > wrote: > > > Hi Xuefu, > > Just wanted to summarize my opinion on the one topic (temporary functions). > > My preference would be to make temporary functions always 3-part qualified > (as a result that would prohibit overriding built-in functions). Having > said that if the community decides that it's better to allow overriding > built-in functions I am fine with it and can commit to that decision. > > I wanted to ask if you could clarify a few points for me around that > option. > > 1. Would you enforce temporary functions to be always just a single > name (without db & cat) as hive does, or would you allow also 3 or even 2 > part identifiers? > 2. Assuming 2/3-part paths. How would you register a function from a > following statement: CREATE TEMPORARY FUNCTION db.func? Would that shadow > all functions named 'func' in all databases named 'db' in all catalogs? Or > would you shadow only function 'func' in database 'db' in current catalog? > 3. This point is still under discussion, but was mentioned a few > times, that maybe we want to enable syntax cat.func for "external built-in > functions". How would that affect statement from previous point? Would > 'db.func' shadow "external built-in function" in 'db' catalog or user > functions as in point 2? Or maybe both? > 4. Lastly in fact to summarize the previous points. Assuming 2/3-part > paths. Would the function resolution be actually as follows?: > 1. temporary functions (1-part path) > 2. built-in functions > 3. temporary functions (2-part path) > 4. 2-part catalog functions a.k.a. "external built-in functions" > (cat + func) - this is still under discussion, if we want that in the other > focal point > 5. temporary functions (3-part path) > 6. 3-part catalog functions a.k.a. user functions > > I would be really grateful if you could explain me those questions, thanks. > > BTW, Thank you all for a healthy discussion. > > Best, > > Dawid > On 04/09/2019 23:25, Xuefu Z wrote: > > Thank all for the sharing thoughts. I think we have gathered some useful > initial feedback from this long discussion with a couple of focal points > sticking out. > > We will go back to do more research and adapt our proposal. Once it's > ready, we will ask for a new round of review. If there is any disagreement, > we will start a new discussion thread on each rather than having a mega > discussion like this. > > Thanks to everyone for participating. > > Regards, > Xuefu > > > On Thu, Sep 5, 2019 at 2:52 AM Bowen Li <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> wrote: > > > Let me try to summarize and conclude the long thread so far: > > 1. For order of temp function v.s. built-in function: > > I think Dawid's point that temp function should be of fully qualified path > is a better reasoning to back the newly proposed order, and i agree we > don't need to follow Hive/Spark. > > However, I'd rather not change fundamentals of temporary functions in this > FLIP. It belongs to a bigger story of how temporary objects should be > redefined and be handled uniformly - currently temporary tables and views > (those registered from TableEnv#registerTable()) behave different than what > Dawid propose for temp functions, and we need a FLIP to just unify their > APIs and behaviors. > > I agree that backward compatibility is not an issue w.r.t Jark's points. > > ***Seems we do have consensus that it's acceptable to prevent users > registering a temp function in the same name as a built-in function. To > help us move forward, I'd like to propose setting such a restraint on temp > functions in this FLIP to simplify the design and avoid disputes.*** It > will also leave rooms for improvements in the future. > > > 2. For Hive built-in function: > > Thanks Timo for providing the Presto and Postgres examples. I feel modular > built-in functions can be a good fit for the geo and ml example as a native > Flink extension, but not sure if it fits well with external integrations. > Anyway, I think modular built-in functions is a bigger story and can be on > its own thread too, and our proposal doesn't prevent Flink from doing that > in the future. > > ***Seems we have consensus that users should be able to use built-in > functions of Hive or other external systems in SQL explicitly and > deterministically regardless of Flink built-in functions and the potential > modular built-in functions, via some new syntax like "mycat::func"? If so, > I'd like to propose removing Hive built-in functions from ambiguous > function resolution order, and empower users with such a syntax. This way > we sacrifice a little convenience for certainty*** > > > What do you think? > > On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> > wrote: > > > Hi, > > Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just > performed some experiments (hive-2.3.2 & spark 2.4.4) and I think they > > are > > very inconsistent in that manner (spark being way worse on that). > > Hive: > > You cannot overwrite all the built-in functions. I could overwrite most > > of > > the functions I tried e.g. length, e, pi, round, rtrim, but there are > functions I cannot overwrite e.g. CAST, ARRAY I get: > > > * ParseException line 1:29 cannot recognize input near 'array' 'AS' * > > What is interesting is that I cannot ovewrite *array*, but I can ovewrite > *map* or *struct*. Though hive behaves reasonable well if I manage to > overwrite a function. When I drop the temporary function the native > function is still available. > > Spark: > > Spark's behavior imho is super bad. > > Theoretically I could overwrite all functions. I was able e.g. to > overwrite CAST function. I had to use though CREATE OR REPLACE TEMPORARY > FUNCTION syntax. Otherwise I get an exception that a function already > exists. However when I used the CAST function in a query it used the > native, built-in one. > > When I overwrote current_date() function, it was used in a query, but it > completely replaces the built-in function and I can no longer use the > native function in any way. I cannot also drop the temporary function. I > get: > > * Error in query: Cannot drop native function 'current_date';* > > Additional note, both systems do not allow creating TEMPORARY FUNCTIONS > with a database. Temporary functions are always represented as a single > name. > > In my opinion neither of the systems have consistent behavior. Generally > speaking I think overwriting any system provided functions is just > dangerous. > > Regarding Jark's concerns. Such functions would be registered in a > > current > > catalog/database schema, so a user could still use its own function, but > would have to fully qualify the function (because built-in functions take > precedence). Moreover users would have the same problem with permanent > functions. Imagine a user have a permanent function 'cat.db.explode'. In > 1.9 the user could use just the 'explode' function as long as the 'cat' & > 'db' were the default catalog & database. If we introduce 'explode' > built-in function in 1.10, the user has to fully qualify the function. > > Best, > > Dawid > On 04/09/2019 15:19, Timo Walther wrote: > > Hi all, > > thanks for the healthy discussion. It is already a very long discussion > with a lot of text. So I will just post my opinion to a couple of > statements: > > > Hive built-in functions are not part of Flink built-in functions, they > > are catalog functions > > That is not entirely true. Correct me if I'm wrong but I think Hive > built-in functions are also not catalog functions. They are not stored in > every Hive metastore catalog that is freshly created but are a set of > functions that are listed somewhere and made available. > > > ambiguous functions reference just shouldn't be resolved to a different > > catalog > > I agree. They should not be resolved to a different catalog. That's why I > am suggesting to split the concept of built-in functions and catalog > > lookup > > semantics. > > > I don't know if any other databases handle built-in functions like that > > What I called "module" is: > - Extension in Postgres [1] > - Plugin in Presto [2] > > Btw. Presto even mentions example modules that are similar to the ones > that we will introduce in the near future both for ML and System XYZ > compatibility: > "See either the presto-ml module for machine learning functions or the > presto-teradata-functions module for Teradata-compatible functions, both > > in > > the root of the Presto source." > > > functions should be either built-in already or just libraries > > functions, > > and library functions can be adapted to catalog APIs or of some other > syntax to use > > Regarding "built-in already", of course we can add a lot of functions as > built-ins but we will end-up in a dependency hell in the near future if > > we > > don't introduce a pluggable approach. Library functions is what you also > suggest but storing them in a catalog means to always fully qualify them > > or > > modifying the existing catalog design that was inspired by the standard. > > I don't think "it brings in even more complicated scenarios to the > design", it just does clear separation of concerns. Integrating the > functionality into the current design makes the catalog API more > complicated. > > > why would users name a temporary function the same as a built-in > > function then? > > Because you never know what users do. If they don't, my suggested > resolution order should not be a problem, right? > > > I don't think hive functions deserves be a function module > > Our goal is not to create a Hive clone. We need to think forward and Hive > is just one of many systems that we can support. Not every built-in > function behaves and will behave exactly like Hive. > > > regarding temporary functions, there are few systems that support it > > IMHO Spark and Hive are not always the best examples for consistent > design. Systems like Postgres, Presto, or SQL Server should be used as a > reference. I don't think that a user can overwrite a built-in function > there. > > Regards, > Timo > > [1] https://www.postgresql.org/docs/10/extend-extensions.html > [2] https://prestodb.github.io/docs/current/develop/functions.html > > > On 04.09.19 13:44, Jark Wu wrote: > > Hi all, > > Regarding #1 temp function <> built-in function and naming. > I'm fine with temp functions should precede built-in function and can > override built-in functions (we already support to override built-in > function in 1.9). > If we don't allow the same name as a built-in function, I'm afraid we > > will > > have compatibility issues in the future. > Say users register a user defined function named "explode" in 1.9, and we > support a built-in "explode" function in 1.10. > Then the user's jobs which call the registered "explode" function in 1.9 > will all fail in 1.10 because of naming conflict. > > Regarding #2 "External" built-in functions. > I think if we store external built-in functions in catalog, then > "hive1::sqrt" is a good way to go. > However, I would prefer to support a discovery mechanism (e.g. SPI) for > built-in functions as Timo suggested above. > This gives us the flexibility to add Hive or MySQL or Geo or whatever > function set as built-in functions in an easy way. > > Best, > Jark > > On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> wrote: > > Hi David, > > Thank you for sharing your findings. It seems to me that there is no SQL > standard regarding temporary functions. There are few systems that > > support > > it. Here are what I have found: > > 1. Hive: no DB qualifier allowed. Can overwrite built-in. > 2. Spark: basically follows Hive ( > > > > https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html > > ) > 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting > behavior. (http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html > > ) > > Because of lack of standard, it's perfectly fine for Flink to define > whatever it sees appropriate. Thus, your proposal (no overwriting and > > must > > have DB as holder) is one option. The advantage is simplicity, The > downside > is the deviation from Hive, which is popular and de facto standard in big > data world. > > However, I don't think we have to follow Hive. More importantly, we need > > a > > consensus. I have no objection if your proposal is generally agreed upon. > > Thanks, > Xuefu > > On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <[hidden email]<[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> > wrote: > > Hi all, > > Just an opinion on the built-in <> temporary functions resolution and > NAMING issue. I think we should not allow overriding the built-in > functions, as this may pose serious issues and to be honest is rather > not feasible and would require major rework. What happens if a user > wants to override CAST? Calls to that function are generated at > different layers of the stack that unfortunately does not always go > through the Catalog API (at least yet). Moreover from what I've checked > no other systems allow overriding the built-in functions. All the > systems I've checked so far register temporary functions in a > database/schema (either special database for temporary functions, or > just current database). What I would suggest is to always register > temporary functions with a 3 part identifier. The same way as tables, > views etc. This effectively means you cannot override built-in > functions. With such approach it is natural that the temporary functions > end up a step lower in the resolution order: > > 1. built-in functions (1 part, maybe 2? - this is still under discussion) > > 2. temporary functions (always 3 part path) > > 3. catalog functions (always 3 part path) > > Let me know what do you think. > > Best, > > Dawid > > On 04/09/2019 06:13, Bowen Li wrote: > > Hi, > > I agree with Xuefu that the main controversial points are mainly the > > two > > places. My thoughts on them: > > 1) Determinism of referencing Hive built-in functions. We can either > > remove > > Hive built-in functions from ambiguous function resolution and require > users to use special syntax for their qualified names, or add a config > > flag > > to catalog constructor/yaml for turning on and off Hive built-in > > functions > > with the flag set to 'false' by default and proper doc added to help > > users > > make their decisions. > > 2) Flink temp functions v.s. Flink built-in functions in ambiguous > > function > > resolution order. We believe Flink temp functions should precede Flink > built-in functions, and I have presented my reasons. Just in case if we > cannot reach an agreement, I propose forbid users registering temp > functions in the same name as a built-in function, like MySQL's > > approach, > > for the moment. It won't have any performance concern, since built-in > functions are all in memory and thus cost of a name check will be > > really > > trivial. > > > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> wrote: > > From what I have seen, there are a couple of focal disagreements: > > 1. Resolution order: temp function --> flink built-in function --> > > catalog > > function vs flink built-in function --> temp function -> catalog > > function. > > 2. "External" built-in functions: how to treat built-in functions in > external system and how users reference them > > For #1, I agree with Bowen that temp function needs to be at the > > highest > > priority because that's how a user might overwrite a built-in function > without referencing a persistent, overwriting catalog function with a > > fully > > qualified name. Putting built-in functions at the highest priority > eliminates that usage. > > For #2, I saw a general agreement on referencing "external" built-in > functions such as those in Hive needs to be explicit and deterministic > > even > > though different approaches are proposed. To limit the scope and > > simply > > the > > usage, it seems making sense to me to introduce special syntax for > > user to > > explicitly reference an external built-in function such as hive1::sqrt > > or > > hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API > > call > > hive1.getFunction(ObjectPath functionName) where the database name is > absent for bulit-in functions available in that catalog hive1. I > > understand > > that Bowen's original proposal was trying to avoid this, but this > > could > > turn out to be a clean and simple solution. > > (Timo's modular approach is great way to "expand" Flink's built-in > > function > > set, which seems orthogonal and complementary to this, which could be > tackled in further future work.) > > I'd be happy to hear further thoughts on the two points. > > Thanks, > Xuefu > > On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> wrote: > > Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is > > the > > same > as Bowen's. But after thinking about it, I'm currently lean to Timo's > suggestion. > > The reason is backward compatibility. If we follow Bowen's approach, > > let's > > say we > first find function in Flink's built-in functions, and then hive's > built-in. For example, `foo` > is not supported by Flink, but hive has such built-in function. So > > user > > will have hive's > behavior for function `foo`. And in next release, Flink realize this > > is a > > very popular function > and add it into Flink's built-in functions, but with different > > behavior > > as > > hive's. So in next > release, the behavior changes. > > With Timo's approach, IIUC user have to tell the framework explicitly > > what > > kind of > built-in functions he would like to use. He can just tell framework > > to > > abandon Flink's built-in > functions, and use hive's instead. User can only choose between them, > > but > > not use > them at the same time. I think this approach is more predictable. > > Best, > Kurt > > > On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> wrote: > > Hi all, > > Thanks for the feedback. Just a kindly reminder that the [Proposal] > > section > > in the google doc was updated, please take a look first and let me > > know > > if > > you have more questions. > > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> > > wrote: > > Hi Timo, > > Re> 1) We should not have the restriction "hive built-in functions > > can > > only > > be used when current catalog is hive catalog". Switching a catalog > should only have implications on the cat.db.object resolution but > > not > > functions. It would be quite convinient for users to use Hive > > built-ins > > even if they use a Confluent schema registry or just the in-memory > > catalog. > > There might be a misunderstanding here. > > First of all, Hive built-in functions are not part of Flink > > built-in > > functions, they are catalog functions, thus if the current catalog > > is > > not a > > HiveCatalog but, say, a schema registry catalog, ambiguous > > functions > > reference just shouldn't be resolved to a different catalog. > > Second, Hive built-in functions can potentially be referenced > > across > > catalog, but it doesn't have db namespace and we currently just > > don't > > have > > a SQL syntax for it. It can be enabled when such a SQL syntax is > > defined, > > e.g. "catalog::function", but it's out of scope of this FLIP. > > 2) I would propose to have separate concepts for catalog and > > built-in > > functions. In particular it would be nice to modularize built-in > functions. Some built-in functions are very crucial (like AS, CAST, > MINUS), others are more optional but stable (MD5, CONCAT_WS), and > > maybe > > we add more experimental functions in the future or function for > > some > > special application area (Geo functions, ML functions). A data > > platform > > team might not want to make every built-in function available. Or a > function module like ML functions is in a different Maven module. > > I think this is orthogonal to this FLIP, especially we don't have > > the > > "external built-in functions" anymore and currently the built-in > > function > > category remains untouched. > > But just to share some thoughts on the proposal, I'm not sure about > > it: > > - I don't know if any other databases handle built-in functions > > like > > that. > > Maybe you can give some examples? IMHO, built-in functions are > > system > > info > > and should be deterministic, not depending on loaded libraries. Geo > functions should be either built-in already or just libraries > > functions, > > and library functions can be adapted to catalog APIs or of some > > other > > syntax to use > - I don't know if all use cases stand, and many can be achieved by > > other > > approaches too. E.g. experimental functions can be taken good care > > of > > by > > documentations, annotations, etc > - the proposal basically introduces some concept like a pluggable > > built-in > > function catalog, despite the already existing catalog APIs > - it brings in even more complicated scenarios to the design. E.g. > > how > > do > > you handle built-in functions in different modules but different > > names? > > In short, I'm not sure if it really stands and it looks like an > > overkill > > to me. I'd rather not go to that route. Related discussion can be > > on > > its > > own thread. > > 3) Following the suggestion above, we can have a separate discovery > mechanism for built-in functions. Instead of just going through a > > static > > list like in BuiltInFunctionDefinitions, a platform team should be > > able > > to select function modules like > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, > HiveFunctions) or via service discovery; > > Same as above. I'll leave it to its own thread. > > re > 3) Dawid and I discussed the resulution order again. I agree > > with > > Kurt > > that we should unify built-in function (external or internal) > > under a > > common layer. However, the resolution order should be: > 1. built-in functions > 2. temporary functions > 3. regular catalog resolution logic > Otherwise a temporary function could cause clashes with Flink's > > built-in > > functions. If you take a look at other vendors, like SQL Server > > they > > also do not allow to overwrite built-in functions. > > ”I agree with Kurt that we should unify built-in function (external > > or > > internal) under a common layer.“ <- I don't think this is what Kurt > > means. > > Kurt and I are in favor of unifying built-in functions of external > > systems > > and catalog functions. Did you type a mistake? > > Besides, I'm not sure about the resolution order you proposed. > > Temporary > > functions have a lifespan over a session and are only visible to > > the > > session owner, they are unique to each user, and users create them > > on > > purpose to be the highest priority in order to overwrite system > > info > > (built-in functions in this case). > > In your case, why would users name a temporary function the same > > as a > > built-in function then? Since using that name in ambiguous function > reference will always be resolved to built-in functions, creating a > same-named temp function would be meaningless in the end. > > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> > > wrote: > > Hi Jingsong, > > Re> 1.Hive built-in functions is an intermediate solution. So we > > should > > not introduce interfaces to influence the framework. To make > Flink itself more powerful, we should implement the functions > we need to add. > > Yes, please see the doc. > > Re> 2.Non-flink built-in functions are easy for users to change > > their > > behavior. If we support some flink built-in functions in the > future but act differently from non-flink built-in, this will > > lead > > to > > changes in user behavior. > > There's no such concept as "external built-in functions" any more. > Built-in functions of external systems will be treated as special > > catalog > > functions. > > Re> Another question is, does this fallback include all > > hive built-in functions? As far as I know, some hive functions > have some hacky. If possible, can we start with a white list? > Once we implement some functions to flink built-in, we can > also update the whitelist. > > Yes, that's something we thought of too. I don't think it's super > critical to the scope of this FLIP, thus I'd like to leave it to > > future > > efforts as a nice-to-have feature. > > > On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> > > wrote: > > Hi Kurt, > > Re: > What I want to propose is we can merge #3 and #4, make them > > both > > under > > "catalog" concept, by extending catalog function to make it have > > ability to > > have built-in catalog functions. Some benefits I can see from > > this > > approach: > > 1. We don't have to introduce new concept like external built-in > > functions. > > Actually I don't see a full story about how to treat a built-in > > functions, and it > > seems a little bit disrupt with catalog. As a result, you have > > to > > make > > some restriction > > like "hive built-in functions can only be used when current > > catalog > > is > > hive catalog". > > Yes, I've unified #3 and #4 but it seems I didn't update some > > part > > of > > the doc. I've modified those sections, and they are up to date > > now. > > In short, now built-in function of external systems are defined > > as > > a > > special kind of catalog function in Flink, and handled by Flink > > as > > following: > - An external built-in function must be associated with a catalog > > for > > the purpose of decoupling flink-table and external systems. > - It always resides in front of catalog functions in ambiguous > > function > > reference order, just like in its own external system > - It is a special catalog function that doesn’t have a > > schema/database > > namespace > - It goes thru the same instantiation logic as other user defined > catalog functions in the external system > > Please take another look at the doc, and let me know if you have > > more > > questions. > > > On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> > > wrote: > > Hi Kurt, > > it should not affect the functions and operations we currently > > have > > in > > SQL. It just categorizes the available built-in functions. It is > > kind > > of > an orthogonal concept to the catalog API but built-in functions > > deserve > > this special kind of treatment. CatalogFunction still fits > > perfectly > > in > > there because the regular catalog object resolution logic is not > affected. So tables and functions are resolved in the same way > > but > > with > > built-in functions that have priority as in the original design. > > Regards, > Timo > > > On 03.09.19 15:26, Kurt Young wrote: > > Does this only affect the functions and operations we currently > > have > > in SQL > > and > have no effect on tables, right? Looks like this is an > > orthogonal > > concept > > with Catalog? > If the answer are both yes, then the catalog function will be a > > weird > > concept? > > Best, > Kurt > > > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email] > > wrote: > > The way you proposed are basically the same as what Calcite > > does, I > > think > > we are in the same line. > > Best, > Danny Chan > 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] > > ,写道: > > This sounds exactly as the module approach I mentioned, no? > > Regards, > Timo > > On 03.09.19 13:42, Danny Chan wrote: > > Thanks Bowen for bring up this topic, I think it’s a useful > > refactoring to make our function usage more user friendly. > > For the topic of how to organize the builtin operators and > > operators > > of Hive, here is a solution from Apache Calcite, the Calcite > > way > > is > > to make > > every dialect operators a “Library”, user can specify which > > libraries they > > want to use for a sql query. The builtin operators always > > comes > > as > > the > > first class objects and the others are used from the order > > they > > appears. > > Maybe you can take a reference. > > [1] > > > > https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 > > Best, > Danny Chan > 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] > > ,写道: > > Hi folks, > > I'd like to kick off a discussion on reworking Flink's > > FunctionCatalog. > > It's critically helpful to improve function usability in > > SQL. > > > > https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing > > In short, it: > - adds support for precise function reference with > > fully/partially > > qualified name > - redefines function resolution order for ambiguous > > function > > reference > > - adds support for Hive's rich built-in functions (support > > for > > Hive > > user > > defined functions was already added in 1.9.0) > - clarifies the concept of temporary functions > > Would love to hear your thoughts. > > Bowen > > -- > Xuefu Zhang > > "In Honey We Trust!" > > > -- > Xuefu Zhang > > "In Honey We Trust!" > > > > > > > -- Xuefu Zhang "In Honey We Trust!" |
I agree the consequences of the decision are substantial. Let's see what others think. -- Catalog functions are defined by users, and we suppose they can drop/alter it in any way they want. Thus, overwriting a catalog function doesn't seem to be a strong use case that we should be concerned about. Rather, there are known use case for overwriting built-in functions. Not always users are in full control of the catalog functions. There is also the case where different teams manage the catalog & use the catalog. As for overriding built-in functions with 3-part approach user can always use an equally named function from a catalog. E.g. to override SELECT explode(arr) FROM ... user can always write: SELECT db.explode(arr) FROM ... Best, Dawid On 06/09/2019 10:54, Xuefu Z wrote:
Hi Dawid, Thank you for your summary. While the only difference in the two proposals is one- or three-part in naming, the consequence would be substantial. To me, there are two major use cases of temporary functions compared to persistent ones: 1. Temporary in nature and auto managed by the session. More often than not, admin doesn't even allow user to create persistent functions. 2. Provide an opportunity to overwriting system built-in functions. Since built-in functions has one-part name, requiring three-part name for temporary functions eliminates the overwriting opportunity. One-part naming essentially puts all temp functions under a single namespace and simplifies function resolution, such as we don't need to consider the case of a temp function and a persistent function with the same name under the same database. I agree having three-parts does have its merits, such as consistency with other temporary objects (table) and minor difference between temp vs catalog functions. However, there is a slight difference between tables and function in that there is no built-in table in SQL so there is no need to overwrite it. I'm not sure if I fully agree the benefits you listed as the advantages of the three-part naming of temp functions. -- Allowing overwriting built-in functions is a benefit and the solution for disallowing certain overwriting shouldn't be totally banning it. -- Catalog functions are defined by users, and we suppose they can drop/alter it in any way they want. Thus, overwriting a catalog function doesn't seem to be a strong use case that we should be concerned about. Rather, there are known use case for overwriting built-in functions. Thus, personally I would prefer one-part name for temporary functions. In lack of SQL standard on this, I certainly like to get opinions from others to see if a consensus can be eventually reached. (To your point on modular approach to support external built-in functions, we saw the value and are actively looking into it. Thanks for sharing your opinion on that.) Thanks, Xuefu On Fri, Sep 6, 2019 at 3:48 PM Dawid Wysakowicz [hidden email] wrote:Hi Xuefu, Thank you for your answers. Let me summarize my understanding. In principle we differ only in regards to the fact if a temporary function can be only 1-part or only 3-part identified. I can reconfirm that if the community decides it prefers the 1-part approach I will commit to that, with the assumption that we will force ONLY 1-part function names. (We will parse identifier and throw exception if a user tries to register e.g. db.temp_func). My preference is though the 3-part approach: - there are some functions that it makes no sense to override, e.g. CAST, moreover I'm afraid that allowing overriding such will lead to high inconsistency, similar to those that I mentioned spark has - you cannot shadow a fully-qualified function. (If a user fully qualifies his/her objects in a SQL query, which is often considered a good practice) - it does not differentiate between functions & temporary functions. Temporary functions just differ with regards to their life-cycle. The registration & usage is exactly the same. As it can be seen, the proposed concept regarding temp function and function resolution is quite simple. Both approaches are equally simple. I would even say the 3-part approach is slightly simpler as it does not have to care about some special built-in functions such as CAST. I don't want to express my opinion on the differentiation between built-in functions and "external" built-in functions in this thread as it is rather orthogonal, but I also like the modular approach and I definitely don't like the special syntax "cat::function". I think it's better to stick to a standard or at least other proved solutions from other systems. Best, Dawid On 05/09/2019 10:12, Xuefu Z wrote: Hi David, Thanks for sharing your thoughts and request for clarifications. I believe that I fully understood your proposal, which does has its merit. However, it's different from ours. Here are the answers to your questions: Re #1: yes, the temp functions in the proposal are global and have just one-part names, similar to built-in functions. Two- or three-part names are not allowed. Re #2: not applicable as two- or three-part names are disallowed. Re #3: same as above. Referencing external built-in functions is achieved either implicitly (only the built-in functions in the current catalogs are considered) or via special syntax such as cat::function. However, we are looking into the modular approach that Time suggested with other feedback received from the community. Re #4: the resolution order goes like the following in our proposal: 1. temporary functions 2. bulit-in functions (including those augmented by add-on modules) 3. built-in functions in current catalog (this will not be needed if the special syntax "cat::function" is required) 4. functions in current catalog and db. If we go with the modular approach and make external built-in functions as an add-on module, the 2 and 3 above will be combined. In essence, the resolution order is equivalent in the two approaches. By the way, resolution order matters only for simple name reference. For names such as db.function (interpreted as current_cat/db/function) or cat.db.function, the reference is unambiguous, so on resolution is needed. As it can be seen, the proposed concept regarding temp function and function resolution is quite simple. Additionally, the proposed resolution order allows temp function to shadow a built-in function, which is important (though not decisive) in our opinion. I started liking the modular approach as the resolution order will only include 1, 2, and 4, which is simpler and more generic. That's why I suggested we look more into this direction. Please let me know if there are further questions. Thanks, Xuefu On Thu, Sep 5, 2019 at 2:42 PM Dawid Wysakowicz [hidden email] [hidden email] wrote: Hi Xuefu, Just wanted to summarize my opinion on the one topic (temporary functions). My preference would be to make temporary functions always 3-part qualified (as a result that would prohibit overriding built-in functions). Having said that if the community decides that it's better to allow overriding built-in functions I am fine with it and can commit to that decision. I wanted to ask if you could clarify a few points for me around that option. 1. Would you enforce temporary functions to be always just a single name (without db & cat) as hive does, or would you allow also 3 or even 2 part identifiers? 2. Assuming 2/3-part paths. How would you register a function from a following statement: CREATE TEMPORARY FUNCTION db.func? Would that shadow all functions named 'func' in all databases named 'db' in all catalogs? Or would you shadow only function 'func' in database 'db' in current catalog? 3. This point is still under discussion, but was mentioned a few times, that maybe we want to enable syntax cat.func for "external built-in functions". How would that affect statement from previous point? Would 'db.func' shadow "external built-in function" in 'db' catalog or user functions as in point 2? Or maybe both? 4. Lastly in fact to summarize the previous points. Assuming 2/3-part paths. Would the function resolution be actually as follows?: 1. temporary functions (1-part path) 2. built-in functions 3. temporary functions (2-part path) 4. 2-part catalog functions a.k.a. "external built-in functions" (cat + func) - this is still under discussion, if we want that in the other focal point 5. temporary functions (3-part path) 6. 3-part catalog functions a.k.a. user functions I would be really grateful if you could explain me those questions, thanks. BTW, Thank you all for a healthy discussion. Best, Dawid On 04/09/2019 23:25, Xuefu Z wrote: Thank all for the sharing thoughts. I think we have gathered some useful initial feedback from this long discussion with a couple of focal points sticking out. We will go back to do more research and adapt our proposal. Once it's ready, we will ask for a new round of review. If there is any disagreement, we will start a new discussion thread on each rather than having a mega discussion like this. Thanks to everyone for participating. Regards, Xuefu On Thu, Sep 5, 2019 at 2:52 AM Bowen Li [hidden email] [hidden email] [hidden email] [hidden email] wrote: Let me try to summarize and conclude the long thread so far: 1. For order of temp function v.s. built-in function: I think Dawid's point that temp function should be of fully qualified path is a better reasoning to back the newly proposed order, and i agree we don't need to follow Hive/Spark. However, I'd rather not change fundamentals of temporary functions in this FLIP. It belongs to a bigger story of how temporary objects should be redefined and be handled uniformly - currently temporary tables and views (those registered from TableEnv#registerTable()) behave different than what Dawid propose for temp functions, and we need a FLIP to just unify their APIs and behaviors. I agree that backward compatibility is not an issue w.r.t Jark's points. ***Seems we do have consensus that it's acceptable to prevent users registering a temp function in the same name as a built-in function. To help us move forward, I'd like to propose setting such a restraint on temp functions in this FLIP to simplify the design and avoid disputes.*** It will also leave rooms for improvements in the future. 2. For Hive built-in function: Thanks Timo for providing the Presto and Postgres examples. I feel modular built-in functions can be a good fit for the geo and ml example as a native Flink extension, but not sure if it fits well with external integrations. Anyway, I think modular built-in functions is a bigger story and can be on its own thread too, and our proposal doesn't prevent Flink from doing that in the future. ***Seems we have consensus that users should be able to use built-in functions of Hive or other external systems in SQL explicitly and deterministically regardless of Flink built-in functions and the potential modular built-in functions, via some new syntax like "mycat::func"? If so, I'd like to propose removing Hive built-in functions from ambiguous function resolution order, and empower users with such a syntax. This way we sacrifice a little convenience for certainty*** What do you think? On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz [hidden email] [hidden email] [hidden email] [hidden email] wrote: Hi, Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just performed some experiments (hive-2.3.2 & spark 2.4.4) and I think they are very inconsistent in that manner (spark being way worse on that). Hive: You cannot overwrite all the built-in functions. I could overwrite most of the functions I tried e.g. length, e, pi, round, rtrim, but there are functions I cannot overwrite e.g. CAST, ARRAY I get: * ParseException line 1:29 cannot recognize input near 'array' 'AS' * What is interesting is that I cannot ovewrite *array*, but I can ovewrite *map* or *struct*. Though hive behaves reasonable well if I manage to overwrite a function. When I drop the temporary function the native function is still available. Spark: Spark's behavior imho is super bad. Theoretically I could overwrite all functions. I was able e.g. to overwrite CAST function. I had to use though CREATE OR REPLACE TEMPORARY FUNCTION syntax. Otherwise I get an exception that a function already exists. However when I used the CAST function in a query it used the native, built-in one. When I overwrote current_date() function, it was used in a query, but it completely replaces the built-in function and I can no longer use the native function in any way. I cannot also drop the temporary function. I get: * Error in query: Cannot drop native function 'current_date';* Additional note, both systems do not allow creating TEMPORARY FUNCTIONS with a database. Temporary functions are always represented as a single name. In my opinion neither of the systems have consistent behavior. Generally speaking I think overwriting any system provided functions is just dangerous. Regarding Jark's concerns. Such functions would be registered in a current catalog/database schema, so a user could still use its own function, but would have to fully qualify the function (because built-in functions take precedence). Moreover users would have the same problem with permanent functions. Imagine a user have a permanent function 'cat.db.explode'. In 1.9 the user could use just the 'explode' function as long as the 'cat' & 'db' were the default catalog & database. If we introduce 'explode' built-in function in 1.10, the user has to fully qualify the function. Best, Dawid On 04/09/2019 15:19, Timo Walther wrote: Hi all, thanks for the healthy discussion. It is already a very long discussion with a lot of text. So I will just post my opinion to a couple of statements: Hive built-in functions are not part of Flink built-in functions, they are catalog functions That is not entirely true. Correct me if I'm wrong but I think Hive built-in functions are also not catalog functions. They are not stored in every Hive metastore catalog that is freshly created but are a set of functions that are listed somewhere and made available. ambiguous functions reference just shouldn't be resolved to a different catalog I agree. They should not be resolved to a different catalog. That's why I am suggesting to split the concept of built-in functions and catalog lookup semantics. I don't know if any other databases handle built-in functions like that What I called "module" is: - Extension in Postgres [1] - Plugin in Presto [2] Btw. Presto even mentions example modules that are similar to the ones that we will introduce in the near future both for ML and System XYZ compatibility: "See either the presto-ml module for machine learning functions or the presto-teradata-functions module for Teradata-compatible functions, both in the root of the Presto source." functions should be either built-in already or just libraries functions, and library functions can be adapted to catalog APIs or of some other syntax to use Regarding "built-in already", of course we can add a lot of functions as built-ins but we will end-up in a dependency hell in the near future if we don't introduce a pluggable approach. Library functions is what you also suggest but storing them in a catalog means to always fully qualify them or modifying the existing catalog design that was inspired by the standard. I don't think "it brings in even more complicated scenarios to the design", it just does clear separation of concerns. Integrating the functionality into the current design makes the catalog API more complicated. why would users name a temporary function the same as a built-in function then? Because you never know what users do. If they don't, my suggested resolution order should not be a problem, right? I don't think hive functions deserves be a function module Our goal is not to create a Hive clone. We need to think forward and Hive is just one of many systems that we can support. Not every built-in function behaves and will behave exactly like Hive. regarding temporary functions, there are few systems that support it IMHO Spark and Hive are not always the best examples for consistent design. Systems like Postgres, Presto, or SQL Server should be used as a reference. I don't think that a user can overwrite a built-in function there. Regards, Timo [1] https://www.postgresql.org/docs/10/extend-extensions.html [2] https://prestodb.github.io/docs/current/develop/functions.html On 04.09.19 13:44, Jark Wu wrote: Hi all, Regarding #1 temp function <> built-in function and naming. I'm fine with temp functions should precede built-in function and can override built-in functions (we already support to override built-in function in 1.9). If we don't allow the same name as a built-in function, I'm afraid we will have compatibility issues in the future. Say users register a user defined function named "explode" in 1.9, and we support a built-in "explode" function in 1.10. Then the user's jobs which call the registered "explode" function in 1.9 will all fail in 1.10 because of naming conflict. Regarding #2 "External" built-in functions. I think if we store external built-in functions in catalog, then "hive1::sqrt" is a good way to go. However, I would prefer to support a discovery mechanism (e.g. SPI) for built-in functions as Timo suggested above. This gives us the flexibility to add Hive or MySQL or Geo or whatever function set as built-in functions in an easy way. Best, Jark On Wed, 4 Sep 2019 at 17:47, Xuefu Z [hidden email] [hidden email] [hidden email] [hidden email][hidden email] [hidden email] [hidden email] [hidden email] wrote: Hi David, Thank you for sharing your findings. It seems to me that there is no SQL standard regarding temporary functions. There are few systems that support it. Here are what I have found: 1. Hive: no DB qualifier allowed. Can overwrite built-in. 2. Spark: basically follows Hive ( https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html ) 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting behavior. (http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html ) Because of lack of standard, it's perfectly fine for Flink to define whatever it sees appropriate. Thus, your proposal (no overwriting and must have DB as holder) is one option. The advantage is simplicity, The downside is the deviation from Hive, which is popular and de facto standard in big data world. However, I don't think we have to follow Hive. More importantly, we need a consensus. I have no objection if your proposal is generally agreed upon. Thanks, Xuefu On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <[hidden email][hidden email] [hidden email] [hidden email] [hidden email] wrote: Hi all, Just an opinion on the built-in <> temporary functions resolution and NAMING issue. I think we should not allow overriding the built-in functions, as this may pose serious issues and to be honest is rather not feasible and would require major rework. What happens if a user wants to override CAST? Calls to that function are generated at different layers of the stack that unfortunately does not always go through the Catalog API (at least yet). Moreover from what I've checked no other systems allow overriding the built-in functions. All the systems I've checked so far register temporary functions in a database/schema (either special database for temporary functions, or just current database). What I would suggest is to always register temporary functions with a 3 part identifier. The same way as tables, views etc. This effectively means you cannot override built-in functions. With such approach it is natural that the temporary functions end up a step lower in the resolution order: 1. built-in functions (1 part, maybe 2? - this is still under discussion) 2. temporary functions (always 3 part path) 3. catalog functions (always 3 part path) Let me know what do you think. Best, Dawid On 04/09/2019 06:13, Bowen Li wrote: Hi, I agree with Xuefu that the main controversial points are mainly the two places. My thoughts on them: 1) Determinism of referencing Hive built-in functions. We can either remove Hive built-in functions from ambiguous function resolution and require users to use special syntax for their qualified names, or add a config flag to catalog constructor/yaml for turning on and off Hive built-in functions with the flag set to 'false' by default and proper doc added to help users make their decisions. 2) Flink temp functions v.s. Flink built-in functions in ambiguous function resolution order. We believe Flink temp functions should precede Flink built-in functions, and I have presented my reasons. Just in case if we cannot reach an agreement, I propose forbid users registering temp functions in the same name as a built-in function, like MySQL's approach, for the moment. It won't have any performance concern, since built-in functions are all in memory and thus cost of a name check will be really trivial. On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z [hidden email] [hidden email] [hidden email] [hidden email][hidden email] [hidden email] [hidden email] [hidden email] wrote: From what I have seen, there are a couple of focal disagreements: 1. Resolution order: temp function --> flink built-in function --> catalog function vs flink built-in function --> temp function -> catalog function. 2. "External" built-in functions: how to treat built-in functions in external system and how users reference them For #1, I agree with Bowen that temp function needs to be at the highest priority because that's how a user might overwrite a built-in function without referencing a persistent, overwriting catalog function with a fully qualified name. Putting built-in functions at the highest priority eliminates that usage. For #2, I saw a general agreement on referencing "external" built-in functions such as those in Hive needs to be explicit and deterministic even though different approaches are proposed. To limit the scope and simply the usage, it seems making sense to me to introduce special syntax for user to explicitly reference an external built-in function such as hive1::sqrt or hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API call hive1.getFunction(ObjectPath functionName) where the database name is absent for bulit-in functions available in that catalog hive1. I understand that Bowen's original proposal was trying to avoid this, but this could turn out to be a clean and simple solution. (Timo's modular approach is great way to "expand" Flink's built-in function set, which seems orthogonal and complementary to this, which could be tackled in further future work.) I'd be happy to hear further thoughts on the two points. Thanks, Xuefu On Tue, Sep 3, 2019 at 7:11 PM Kurt Young [hidden email] [hidden email] [hidden email] [hidden email][hidden email] [hidden email] [hidden email] [hidden email] wrote: Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is the same as Bowen's. But after thinking about it, I'm currently lean to Timo's suggestion. The reason is backward compatibility. If we follow Bowen's approach, let's say we first find function in Flink's built-in functions, and then hive's built-in. For example, `foo` is not supported by Flink, but hive has such built-in function. So user will have hive's behavior for function `foo`. And in next release, Flink realize this is a very popular function and add it into Flink's built-in functions, but with different behavior as hive's. So in next release, the behavior changes. With Timo's approach, IIUC user have to tell the framework explicitly what kind of built-in functions he would like to use. He can just tell framework to abandon Flink's built-in functions, and use hive's instead. User can only choose between them, but not use them at the same time. I think this approach is more predictable. Best, Kurt On Wed, Sep 4, 2019 at 8:00 AM Bowen Li [hidden email] [hidden email] [hidden email] [hidden email][hidden email] [hidden email] [hidden email] [hidden email] wrote: Hi all, Thanks for the feedback. Just a kindly reminder that the [Proposal] section in the google doc was updated, please take a look first and let me know if you have more questions. On Tue, Sep 3, 2019 at 4:57 PM Bowen Li [hidden email] [hidden email] [hidden email] [hidden email][hidden email] [hidden email] [hidden email] [hidden email] wrote: Hi Timo, Re> 1) We should not have the restriction "hive built-in functions can only be used when current catalog is hive catalog". Switching a catalog should only have implications on the cat.db.object resolution but not functions. It would be quite convinient for users to use Hive built-ins even if they use a Confluent schema registry or just the in-memory catalog. There might be a misunderstanding here. First of all, Hive built-in functions are not part of Flink built-in functions, they are catalog functions, thus if the current catalog is not a HiveCatalog but, say, a schema registry catalog, ambiguous functions reference just shouldn't be resolved to a different catalog. Second, Hive built-in functions can potentially be referenced across catalog, but it doesn't have db namespace and we currently just don't have a SQL syntax for it. It can be enabled when such a SQL syntax is defined, e.g. "catalog::function", but it's out of scope of this FLIP. 2) I would propose to have separate concepts for catalog and built-in functions. In particular it would be nice to modularize built-in functions. Some built-in functions are very crucial (like AS, CAST, MINUS), others are more optional but stable (MD5, CONCAT_WS), and maybe we add more experimental functions in the future or function for some special application area (Geo functions, ML functions). A data platform team might not want to make every built-in function available. Or a function module like ML functions is in a different Maven module. I think this is orthogonal to this FLIP, especially we don't have the "external built-in functions" anymore and currently the built-in function category remains untouched. But just to share some thoughts on the proposal, I'm not sure about it: - I don't know if any other databases handle built-in functions like that. Maybe you can give some examples? IMHO, built-in functions are system info and should be deterministic, not depending on loaded libraries. Geo functions should be either built-in already or just libraries functions, and library functions can be adapted to catalog APIs or of some other syntax to use - I don't know if all use cases stand, and many can be achieved by other approaches too. E.g. experimental functions can be taken good care of by documentations, annotations, etc - the proposal basically introduces some concept like a pluggable built-in function catalog, despite the already existing catalog APIs - it brings in even more complicated scenarios to the design. E.g. how do you handle built-in functions in different modules but different names? In short, I'm not sure if it really stands and it looks like an overkill to me. I'd rather not go to that route. Related discussion can be on its own thread. 3) Following the suggestion above, we can have a separate discovery mechanism for built-in functions. Instead of just going through a static list like in BuiltInFunctionDefinitions, a platform team should be able to select function modules like catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, HiveFunctions) or via service discovery; Same as above. I'll leave it to its own thread. re > 3) Dawid and I discussed the resulution order again. I agree with Kurt that we should unify built-in function (external or internal) under a common layer. However, the resolution order should be: 1. built-in functions 2. temporary functions 3. regular catalog resolution logic Otherwise a temporary function could cause clashes with Flink's built-in functions. If you take a look at other vendors, like SQL Server they also do not allow to overwrite built-in functions. ”I agree with Kurt that we should unify built-in function (external or internal) under a common layer.“ <- I don't think this is what Kurt means. Kurt and I are in favor of unifying built-in functions of external systems and catalog functions. Did you type a mistake? Besides, I'm not sure about the resolution order you proposed. Temporary functions have a lifespan over a session and are only visible to the session owner, they are unique to each user, and users create them on purpose to be the highest priority in order to overwrite system info (built-in functions in this case). In your case, why would users name a temporary function the same as a built-in function then? Since using that name in ambiguous function reference will always be resolved to built-in functions, creating a same-named temp function would be meaningless in the end. On Tue, Sep 3, 2019 at 1:44 PM Bowen Li [hidden email] [hidden email] [hidden email] [hidden email][hidden email] [hidden email] [hidden email] [hidden email] wrote: Hi Jingsong, Re> 1.Hive built-in functions is an intermediate solution. So we should not introduce interfaces to influence the framework. To make Flink itself more powerful, we should implement the functions we need to add. Yes, please see the doc. Re> 2.Non-flink built-in functions are easy for users to change their behavior. If we support some flink built-in functions in the future but act differently from non-flink built-in, this will lead to changes in user behavior. There's no such concept as "external built-in functions" any more. Built-in functions of external systems will be treated as special catalog functions. Re> Another question is, does this fallback include all hive built-in functions? As far as I know, some hive functions have some hacky. If possible, can we start with a white list? Once we implement some functions to flink built-in, we can also update the whitelist. Yes, that's something we thought of too. I don't think it's super critical to the scope of this FLIP, thus I'd like to leave it to future efforts as a nice-to-have feature. On Tue, Sep 3, 2019 at 1:37 PM Bowen Li [hidden email] [hidden email] [hidden email] [hidden email][hidden email] [hidden email] [hidden email] [hidden email] wrote: Hi Kurt, Re: > What I want to propose is we can merge #3 and #4, make them both under "catalog" concept, by extending catalog function to make it have ability to have built-in catalog functions. Some benefits I can see from this approach: 1. We don't have to introduce new concept like external built-in functions. Actually I don't see a full story about how to treat a built-in functions, and it seems a little bit disrupt with catalog. As a result, you have to make some restriction like "hive built-in functions can only be used when current catalog is hive catalog". Yes, I've unified #3 and #4 but it seems I didn't update some part of the doc. I've modified those sections, and they are up to date now. In short, now built-in function of external systems are defined as a special kind of catalog function in Flink, and handled by Flink as following: - An external built-in function must be associated with a catalog for the purpose of decoupling flink-table and external systems. - It always resides in front of catalog functions in ambiguous function reference order, just like in its own external system - It is a special catalog function that doesn’t have a schema/database namespace - It goes thru the same instantiation logic as other user defined catalog functions in the external system Please take another look at the doc, and let me know if you have more questions. On Tue, Sep 3, 2019 at 7:28 AM Timo Walther [hidden email] [hidden email] [hidden email] [hidden email][hidden email] [hidden email] [hidden email] [hidden email] wrote: Hi Kurt, it should not affect the functions and operations we currently have in SQL. It just categorizes the available built-in functions. It is kind of an orthogonal concept to the catalog API but built-in functions deserve this special kind of treatment. CatalogFunction still fits perfectly in there because the regular catalog object resolution logic is not affected. So tables and functions are resolved in the same way but with built-in functions that have priority as in the original design. Regards, Timo On 03.09.19 15:26, Kurt Young wrote: Does this only affect the functions and operations we currently have in SQL and have no effect on tables, right? Looks like this is an orthogonal concept with Catalog? If the answer are both yes, then the catalog function will be a weird concept? Best, Kurt On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email] wrote: The way you proposed are basically the same as what Calcite does, I think we are in the same line. Best, Danny Chan 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] ,写道: This sounds exactly as the module approach I mentioned, no? Regards, Timo On 03.09.19 13:42, Danny Chan wrote: Thanks Bowen for bring up this topic, I think it’s a useful refactoring to make our function usage more user friendly. For the topic of how to organize the builtin operators and operators of Hive, here is a solution from Apache Calcite, the Calcite way is to make every dialect operators a “Library”, user can specify which libraries they want to use for a sql query. The builtin operators always comes as the first class objects and the others are used from the order they appears. Maybe you can take a reference. [1] https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 Best, Danny Chan 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] ,写道: Hi folks, I'd like to kick off a discussion on reworking Flink's FunctionCatalog. It's critically helpful to improve function usability in SQL. https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing In short, it: - adds support for precise function reference with fully/partially qualified name - redefines function resolution order for ambiguous function reference - adds support for Hive's rich built-in functions (support for Hive user defined functions was already added in 1.9.0) - clarifies the concept of temporary functions Would love to hear your thoughts. Bowen -- Xuefu Zhang "In Honey We Trust!" -- Xuefu Zhang "In Honey We Trust!" signature.asc (849 bytes) Download Attachment |
Hi,
W.r.t temp functions, I feel both options have their benefits and can theoretically achieve similar functionalities one way or another. In the end, it's more about use cases, users habits, and trade-offs. Re> Not always users are in full control of the catalog functions. There is also the case where different teams manage the catalog & use the catalog. Temp functions live within a session, and not within a catalog. Having 3-part paths may implies temp functions are tied to a catalog in two aspects. 1) it may indicate each catalog manages their temp functions, which is not true as we seem all agree they should reside at a central place, either in FunctionCatalog or CatalogManager 2) it may indicate there's some access control. When users are forbidden to manipulate some objects in the catalog that's managed by other teams, but are allowed to manipulate some other objects (temp functions in this case) belonging to the catalog in namespaces, users may think we introduced extra complexity and confusion with some kind of access control into the problem. It doesn't feel intuitive enough for end users. Thus, I'd be in favor of 1-part path for temporary functions, and other temp objects. Thanks, Bowen On Fri, Sep 6, 2019 at 2:16 AM Dawid Wysakowicz <[hidden email]> wrote: > I agree the consequences of the decision are substantial. Let's see what > others think. > > -- Catalog functions are defined by users, and we suppose they can > drop/alter it in any way they want. Thus, overwriting a catalog function > doesn't seem to be a strong use case that we should be concerned about. > Rather, there are known use case for overwriting built-in functions. > > Not always users are in full control of the catalog functions. There is > also the case where different teams manage the catalog & use the catalog. > As for overriding built-in functions with 3-part approach user can always > use an equally named function from a catalog. E.g. to override > > * SELECT explode(arr) FROM ...* > > user can always write: > > * SELECT db.explode(arr) FROM ...* > > Best, > > Dawid > On 06/09/2019 10:54, Xuefu Z wrote: > > Hi Dawid, > > Thank you for your summary. While the only difference in the two proposals > is one- or three-part in naming, the consequence would be substantial. > > To me, there are two major use cases of temporary functions compared to > persistent ones: > 1. Temporary in nature and auto managed by the session. More often than > not, admin doesn't even allow user to create persistent functions. > 2. Provide an opportunity to overwriting system built-in functions. > > Since built-in functions has one-part name, requiring three-part name for > temporary functions eliminates the overwriting opportunity. > > One-part naming essentially puts all temp functions under a single > namespace and simplifies function resolution, such as we don't need to > consider the case of a temp function and a persistent function with the > same name under the same database. > > I agree having three-parts does have its merits, such as consistency with > other temporary objects (table) and minor difference between temp vs > catalog functions. However, there is a slight difference between tables and > function in that there is no built-in table in SQL so there is no need to > overwrite it. > > I'm not sure if I fully agree the benefits you listed as the advantages of > the three-part naming of temp functions. > -- Allowing overwriting built-in functions is a benefit and the solution > for disallowing certain overwriting shouldn't be totally banning it. > -- Catalog functions are defined by users, and we suppose they can > drop/alter it in any way they want. Thus, overwriting a catalog function > doesn't seem to be a strong use case that we should be concerned about. > Rather, there are known use case for overwriting built-in functions. > > Thus, personally I would prefer one-part name for temporary functions. In > lack of SQL standard on this, I certainly like to get opinions from others > to see if a consensus can be eventually reached. > > (To your point on modular approach to support external built-in functions, > we saw the value and are actively looking into it. Thanks for sharing your > opinion on that.) > > Thanks, > Xuefu > > On Fri, Sep 6, 2019 at 3:48 PM Dawid Wysakowicz <[hidden email]> <[hidden email]> > wrote: > > > Hi Xuefu, > > Thank you for your answers. > > Let me summarize my understanding. In principle we differ only in regards > to the fact if a temporary function can be only 1-part or only 3-part > identified. I can reconfirm that if the community decides it prefers the > 1-part approach I will commit to that, with the assumption that we will > force ONLY 1-part function names. (We will parse identifier and throw > exception if a user tries to register e.g. db.temp_func). > > My preference is though the 3-part approach: > > - there are some functions that it makes no sense to override, e.g. > CAST, moreover I'm afraid that allowing overriding such will lead to high > inconsistency, similar to those that I mentioned spark has > - you cannot shadow a fully-qualified function. (If a user fully > qualifies his/her objects in a SQL query, which is often considered a good > practice) > - it does not differentiate between functions & temporary functions. > Temporary functions just differ with regards to their life-cycle. The > registration & usage is exactly the same. > > As it can be seen, the proposed concept regarding temp function and > function resolution is quite simple. > > Both approaches are equally simple. I would even say the 3-part approach > is slightly simpler as it does not have to care about some special built-in > functions such as CAST. > > I don't want to express my opinion on the differentiation between built-in > functions and "external" built-in functions in this thread as it is rather > orthogonal, but I also like the modular approach and I definitely don't > like the special syntax "cat::function". I think it's better to stick to a > standard or at least other proved solutions from other systems. > > Best, > > Dawid > On 05/09/2019 10:12, Xuefu Z wrote: > > Hi David, > > Thanks for sharing your thoughts and request for clarifications. I believe > that I fully understood your proposal, which does has its merit. However, > it's different from ours. Here are the answers to your questions: > > Re #1: yes, the temp functions in the proposal are global and have just > one-part names, similar to built-in functions. Two- or three-part names are > not allowed. > > Re #2: not applicable as two- or three-part names are disallowed. > > Re #3: same as above. Referencing external built-in functions is achieved > either implicitly (only the built-in functions in the current catalogs are > considered) or via special syntax such as cat::function. However, we are > looking into the modular approach that Time suggested with other feedback > received from the community. > > Re #4: the resolution order goes like the following in our proposal: > > 1. temporary functions > 2. bulit-in functions (including those augmented by add-on modules) > 3. built-in functions in current catalog (this will not be needed if the > special syntax "cat::function" is required) > 4. functions in current catalog and db. > > If we go with the modular approach and make external built-in functions as > an add-on module, the 2 and 3 above will be combined. In essence, the > resolution order is equivalent in the two approaches. > > By the way, resolution order matters only for simple name reference. For > names such as db.function (interpreted as current_cat/db/function) or > cat.db.function, the reference is unambiguous, so on resolution is needed. > > As it can be seen, the proposed concept regarding temp function and > function resolution is quite simple. Additionally, the proposed resolution > order allows temp function to shadow a built-in function, which is > important (though not decisive) in our opinion. > > I started liking the modular approach as the resolution order will only > include 1, 2, and 4, which is simpler and more generic. That's why I > suggested we look more into this direction. > > Please let me know if there are further questions. > > Thanks, > Xuefu > > > > > On Thu, Sep 5, 2019 at 2:42 PM Dawid Wysakowicz <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> > wrote: > > > Hi Xuefu, > > Just wanted to summarize my opinion on the one topic (temporary functions). > > My preference would be to make temporary functions always 3-part qualified > (as a result that would prohibit overriding built-in functions). Having > said that if the community decides that it's better to allow overriding > built-in functions I am fine with it and can commit to that decision. > > I wanted to ask if you could clarify a few points for me around that > option. > > 1. Would you enforce temporary functions to be always just a single > name (without db & cat) as hive does, or would you allow also 3 or even 2 > part identifiers? > 2. Assuming 2/3-part paths. How would you register a function from a > following statement: CREATE TEMPORARY FUNCTION db.func? Would that shadow > all functions named 'func' in all databases named 'db' in all catalogs? Or > would you shadow only function 'func' in database 'db' in current catalog? > 3. This point is still under discussion, but was mentioned a few > times, that maybe we want to enable syntax cat.func for "external built-in > functions". How would that affect statement from previous point? Would > 'db.func' shadow "external built-in function" in 'db' catalog or user > functions as in point 2? Or maybe both? > 4. Lastly in fact to summarize the previous points. Assuming 2/3-part > paths. Would the function resolution be actually as follows?: > 1. temporary functions (1-part path) > 2. built-in functions > 3. temporary functions (2-part path) > 4. 2-part catalog functions a.k.a. "external built-in functions" > (cat + func) - this is still under discussion, if we want that in the other > focal point > 5. temporary functions (3-part path) > 6. 3-part catalog functions a.k.a. user functions > > I would be really grateful if you could explain me those questions, thanks. > > BTW, Thank you all for a healthy discussion. > > Best, > > Dawid > On 04/09/2019 23:25, Xuefu Z wrote: > > Thank all for the sharing thoughts. I think we have gathered some useful > initial feedback from this long discussion with a couple of focal points > sticking out. > > We will go back to do more research and adapt our proposal. Once it's > ready, we will ask for a new round of review. If there is any disagreement, > we will start a new discussion thread on each rather than having a mega > discussion like this. > > Thanks to everyone for participating. > > Regards, > Xuefu > > > On Thu, Sep 5, 2019 at 2:52 AM Bowen Li <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> wrote: > > > Let me try to summarize and conclude the long thread so far: > > 1. For order of temp function v.s. built-in function: > > I think Dawid's point that temp function should be of fully qualified path > is a better reasoning to back the newly proposed order, and i agree we > don't need to follow Hive/Spark. > > However, I'd rather not change fundamentals of temporary functions in this > FLIP. It belongs to a bigger story of how temporary objects should be > redefined and be handled uniformly - currently temporary tables and views > (those registered from TableEnv#registerTable()) behave different than what > Dawid propose for temp functions, and we need a FLIP to just unify their > APIs and behaviors. > > I agree that backward compatibility is not an issue w.r.t Jark's points. > > ***Seems we do have consensus that it's acceptable to prevent users > registering a temp function in the same name as a built-in function. To > help us move forward, I'd like to propose setting such a restraint on temp > functions in this FLIP to simplify the design and avoid disputes.*** It > will also leave rooms for improvements in the future. > > > 2. For Hive built-in function: > > Thanks Timo for providing the Presto and Postgres examples. I feel modular > built-in functions can be a good fit for the geo and ml example as a native > Flink extension, but not sure if it fits well with external integrations. > Anyway, I think modular built-in functions is a bigger story and can be on > its own thread too, and our proposal doesn't prevent Flink from doing that > in the future. > > ***Seems we have consensus that users should be able to use built-in > functions of Hive or other external systems in SQL explicitly and > deterministically regardless of Flink built-in functions and the potential > modular built-in functions, via some new syntax like "mycat::func"? If so, > I'd like to propose removing Hive built-in functions from ambiguous > function resolution order, and empower users with such a syntax. This way > we sacrifice a little convenience for certainty*** > > > What do you think? > > On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> > wrote: > > > Hi, > > Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just > performed some experiments (hive-2.3.2 & spark 2.4.4) and I think they > > are > > very inconsistent in that manner (spark being way worse on that). > > Hive: > > You cannot overwrite all the built-in functions. I could overwrite most > > of > > the functions I tried e.g. length, e, pi, round, rtrim, but there are > functions I cannot overwrite e.g. CAST, ARRAY I get: > > > * ParseException line 1:29 cannot recognize input near 'array' 'AS' * > > What is interesting is that I cannot ovewrite *array*, but I can ovewrite > *map* or *struct*. Though hive behaves reasonable well if I manage to > overwrite a function. When I drop the temporary function the native > function is still available. > > Spark: > > Spark's behavior imho is super bad. > > Theoretically I could overwrite all functions. I was able e.g. to > overwrite CAST function. I had to use though CREATE OR REPLACE TEMPORARY > FUNCTION syntax. Otherwise I get an exception that a function already > exists. However when I used the CAST function in a query it used the > native, built-in one. > > When I overwrote current_date() function, it was used in a query, but it > completely replaces the built-in function and I can no longer use the > native function in any way. I cannot also drop the temporary function. I > get: > > * Error in query: Cannot drop native function 'current_date';* > > Additional note, both systems do not allow creating TEMPORARY FUNCTIONS > with a database. Temporary functions are always represented as a single > name. > > In my opinion neither of the systems have consistent behavior. Generally > speaking I think overwriting any system provided functions is just > dangerous. > > Regarding Jark's concerns. Such functions would be registered in a > > current > > catalog/database schema, so a user could still use its own function, but > would have to fully qualify the function (because built-in functions take > precedence). Moreover users would have the same problem with permanent > functions. Imagine a user have a permanent function 'cat.db.explode'. In > 1.9 the user could use just the 'explode' function as long as the 'cat' & > 'db' were the default catalog & database. If we introduce 'explode' > built-in function in 1.10, the user has to fully qualify the function. > > Best, > > Dawid > On 04/09/2019 15:19, Timo Walther wrote: > > Hi all, > > thanks for the healthy discussion. It is already a very long discussion > with a lot of text. So I will just post my opinion to a couple of > statements: > > > Hive built-in functions are not part of Flink built-in functions, they > > are catalog functions > > That is not entirely true. Correct me if I'm wrong but I think Hive > built-in functions are also not catalog functions. They are not stored in > every Hive metastore catalog that is freshly created but are a set of > functions that are listed somewhere and made available. > > > ambiguous functions reference just shouldn't be resolved to a different > > catalog > > I agree. They should not be resolved to a different catalog. That's why I > am suggesting to split the concept of built-in functions and catalog > > lookup > > semantics. > > > I don't know if any other databases handle built-in functions like that > > What I called "module" is: > - Extension in Postgres [1] > - Plugin in Presto [2] > > Btw. Presto even mentions example modules that are similar to the ones > that we will introduce in the near future both for ML and System XYZ > compatibility: > "See either the presto-ml module for machine learning functions or the > presto-teradata-functions module for Teradata-compatible functions, both > > in > > the root of the Presto source." > > > functions should be either built-in already or just libraries > > functions, > > and library functions can be adapted to catalog APIs or of some other > syntax to use > > Regarding "built-in already", of course we can add a lot of functions as > built-ins but we will end-up in a dependency hell in the near future if > > we > > don't introduce a pluggable approach. Library functions is what you also > suggest but storing them in a catalog means to always fully qualify them > > or > > modifying the existing catalog design that was inspired by the standard. > > I don't think "it brings in even more complicated scenarios to the > design", it just does clear separation of concerns. Integrating the > functionality into the current design makes the catalog API more > complicated. > > > why would users name a temporary function the same as a built-in > > function then? > > Because you never know what users do. If they don't, my suggested > resolution order should not be a problem, right? > > > I don't think hive functions deserves be a function module > > Our goal is not to create a Hive clone. We need to think forward and Hive > is just one of many systems that we can support. Not every built-in > function behaves and will behave exactly like Hive. > > > regarding temporary functions, there are few systems that support it > > IMHO Spark and Hive are not always the best examples for consistent > design. Systems like Postgres, Presto, or SQL Server should be used as a > reference. I don't think that a user can overwrite a built-in function > there. > > Regards, > Timo > > [1] https://www.postgresql.org/docs/10/extend-extensions.html > [2] https://prestodb.github.io/docs/current/develop/functions.html > > > On 04.09.19 13:44, Jark Wu wrote: > > Hi all, > > Regarding #1 temp function <> built-in function and naming. > I'm fine with temp functions should precede built-in function and can > override built-in functions (we already support to override built-in > function in 1.9). > If we don't allow the same name as a built-in function, I'm afraid we > > will > > have compatibility issues in the future. > Say users register a user defined function named "explode" in 1.9, and we > support a built-in "explode" function in 1.10. > Then the user's jobs which call the registered "explode" function in 1.9 > will all fail in 1.10 because of naming conflict. > > Regarding #2 "External" built-in functions. > I think if we store external built-in functions in catalog, then > "hive1::sqrt" is a good way to go. > However, I would prefer to support a discovery mechanism (e.g. SPI) for > built-in functions as Timo suggested above. > This gives us the flexibility to add Hive or MySQL or Geo or whatever > function set as built-in functions in an easy way. > > Best, > Jark > > On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> wrote: > > Hi David, > > Thank you for sharing your findings. It seems to me that there is no SQL > standard regarding temporary functions. There are few systems that > > support > > it. Here are what I have found: > > 1. Hive: no DB qualifier allowed. Can overwrite built-in. > 2. Spark: basically follows Hive ( > > > https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html > > ) > 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting > behavior. (http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html > > ) > > Because of lack of standard, it's perfectly fine for Flink to define > whatever it sees appropriate. Thus, your proposal (no overwriting and > > must > > have DB as holder) is one option. The advantage is simplicity, The > downside > is the deviation from Hive, which is popular and de facto standard in big > data world. > > However, I don't think we have to follow Hive. More importantly, we need > > a > > consensus. I have no objection if your proposal is generally agreed upon. > > Thanks, > Xuefu > > On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <[hidden email]<[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> > wrote: > > Hi all, > > Just an opinion on the built-in <> temporary functions resolution and > NAMING issue. I think we should not allow overriding the built-in > functions, as this may pose serious issues and to be honest is rather > not feasible and would require major rework. What happens if a user > wants to override CAST? Calls to that function are generated at > different layers of the stack that unfortunately does not always go > through the Catalog API (at least yet). Moreover from what I've checked > no other systems allow overriding the built-in functions. All the > systems I've checked so far register temporary functions in a > database/schema (either special database for temporary functions, or > just current database). What I would suggest is to always register > temporary functions with a 3 part identifier. The same way as tables, > views etc. This effectively means you cannot override built-in > functions. With such approach it is natural that the temporary functions > end up a step lower in the resolution order: > > 1. built-in functions (1 part, maybe 2? - this is still under discussion) > > 2. temporary functions (always 3 part path) > > 3. catalog functions (always 3 part path) > > Let me know what do you think. > > Best, > > Dawid > > On 04/09/2019 06:13, Bowen Li wrote: > > Hi, > > I agree with Xuefu that the main controversial points are mainly the > > two > > places. My thoughts on them: > > 1) Determinism of referencing Hive built-in functions. We can either > > remove > > Hive built-in functions from ambiguous function resolution and require > users to use special syntax for their qualified names, or add a config > > flag > > to catalog constructor/yaml for turning on and off Hive built-in > > functions > > with the flag set to 'false' by default and proper doc added to help > > users > > make their decisions. > > 2) Flink temp functions v.s. Flink built-in functions in ambiguous > > function > > resolution order. We believe Flink temp functions should precede Flink > built-in functions, and I have presented my reasons. Just in case if we > cannot reach an agreement, I propose forbid users registering temp > functions in the same name as a built-in function, like MySQL's > > approach, > > for the moment. It won't have any performance concern, since built-in > functions are all in memory and thus cost of a name check will be > > really > > trivial. > > > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> wrote: > > From what I have seen, there are a couple of focal disagreements: > > 1. Resolution order: temp function --> flink built-in function --> > > catalog > > function vs flink built-in function --> temp function -> catalog > > function. > > 2. "External" built-in functions: how to treat built-in functions in > external system and how users reference them > > For #1, I agree with Bowen that temp function needs to be at the > > highest > > priority because that's how a user might overwrite a built-in function > without referencing a persistent, overwriting catalog function with a > > fully > > qualified name. Putting built-in functions at the highest priority > eliminates that usage. > > For #2, I saw a general agreement on referencing "external" built-in > functions such as those in Hive needs to be explicit and deterministic > > even > > though different approaches are proposed. To limit the scope and > > simply > > the > > usage, it seems making sense to me to introduce special syntax for > > user to > > explicitly reference an external built-in function such as hive1::sqrt > > or > > hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API > > call > > hive1.getFunction(ObjectPath functionName) where the database name is > absent for bulit-in functions available in that catalog hive1. I > > understand > > that Bowen's original proposal was trying to avoid this, but this > > could > > turn out to be a clean and simple solution. > > (Timo's modular approach is great way to "expand" Flink's built-in > > function > > set, which seems orthogonal and complementary to this, which could be > tackled in further future work.) > > I'd be happy to hear further thoughts on the two points. > > Thanks, > Xuefu > > On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> wrote: > > Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is > > the > > same > as Bowen's. But after thinking about it, I'm currently lean to Timo's > suggestion. > > The reason is backward compatibility. If we follow Bowen's approach, > > let's > > say we > first find function in Flink's built-in functions, and then hive's > built-in. For example, `foo` > is not supported by Flink, but hive has such built-in function. So > > user > > will have hive's > behavior for function `foo`. And in next release, Flink realize this > > is a > > very popular function > and add it into Flink's built-in functions, but with different > > behavior > > as > > hive's. So in next > release, the behavior changes. > > With Timo's approach, IIUC user have to tell the framework explicitly > > what > > kind of > built-in functions he would like to use. He can just tell framework > > to > > abandon Flink's built-in > functions, and use hive's instead. User can only choose between them, > > but > > not use > them at the same time. I think this approach is more predictable. > > Best, > Kurt > > > On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> wrote: > > Hi all, > > Thanks for the feedback. Just a kindly reminder that the [Proposal] > > section > > in the google doc was updated, please take a look first and let me > > know > > if > > you have more questions. > > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> > > wrote: > > Hi Timo, > > Re> 1) We should not have the restriction "hive built-in functions > > can > > only > > be used when current catalog is hive catalog". Switching a catalog > should only have implications on the cat.db.object resolution but > > not > > functions. It would be quite convinient for users to use Hive > > built-ins > > even if they use a Confluent schema registry or just the in-memory > > catalog. > > There might be a misunderstanding here. > > First of all, Hive built-in functions are not part of Flink > > built-in > > functions, they are catalog functions, thus if the current catalog > > is > > not a > > HiveCatalog but, say, a schema registry catalog, ambiguous > > functions > > reference just shouldn't be resolved to a different catalog. > > Second, Hive built-in functions can potentially be referenced > > across > > catalog, but it doesn't have db namespace and we currently just > > don't > > have > > a SQL syntax for it. It can be enabled when such a SQL syntax is > > defined, > > e.g. "catalog::function", but it's out of scope of this FLIP. > > 2) I would propose to have separate concepts for catalog and > > built-in > > functions. In particular it would be nice to modularize built-in > functions. Some built-in functions are very crucial (like AS, CAST, > MINUS), others are more optional but stable (MD5, CONCAT_WS), and > > maybe > > we add more experimental functions in the future or function for > > some > > special application area (Geo functions, ML functions). A data > > platform > > team might not want to make every built-in function available. Or a > function module like ML functions is in a different Maven module. > > I think this is orthogonal to this FLIP, especially we don't have > > the > > "external built-in functions" anymore and currently the built-in > > function > > category remains untouched. > > But just to share some thoughts on the proposal, I'm not sure about > > it: > > - I don't know if any other databases handle built-in functions > > like > > that. > > Maybe you can give some examples? IMHO, built-in functions are > > system > > info > > and should be deterministic, not depending on loaded libraries. Geo > functions should be either built-in already or just libraries > > functions, > > and library functions can be adapted to catalog APIs or of some > > other > > syntax to use > - I don't know if all use cases stand, and many can be achieved by > > other > > approaches too. E.g. experimental functions can be taken good care > > of > > by > > documentations, annotations, etc > - the proposal basically introduces some concept like a pluggable > > built-in > > function catalog, despite the already existing catalog APIs > - it brings in even more complicated scenarios to the design. E.g. > > how > > do > > you handle built-in functions in different modules but different > > names? > > In short, I'm not sure if it really stands and it looks like an > > overkill > > to me. I'd rather not go to that route. Related discussion can be > > on > > its > > own thread. > > 3) Following the suggestion above, we can have a separate discovery > mechanism for built-in functions. Instead of just going through a > > static > > list like in BuiltInFunctionDefinitions, a platform team should be > > able > > to select function modules like > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, > HiveFunctions) or via service discovery; > > Same as above. I'll leave it to its own thread. > > re > 3) Dawid and I discussed the resulution order again. I agree > > with > > Kurt > > that we should unify built-in function (external or internal) > > under a > > common layer. However, the resolution order should be: > 1. built-in functions > 2. temporary functions > 3. regular catalog resolution logic > Otherwise a temporary function could cause clashes with Flink's > > built-in > > functions. If you take a look at other vendors, like SQL Server > > they > > also do not allow to overwrite built-in functions. > > ”I agree with Kurt that we should unify built-in function (external > > or > > internal) under a common layer.“ <- I don't think this is what Kurt > > means. > > Kurt and I are in favor of unifying built-in functions of external > > systems > > and catalog functions. Did you type a mistake? > > Besides, I'm not sure about the resolution order you proposed. > > Temporary > > functions have a lifespan over a session and are only visible to > > the > > session owner, they are unique to each user, and users create them > > on > > purpose to be the highest priority in order to overwrite system > > info > > (built-in functions in this case). > > In your case, why would users name a temporary function the same > > as a > > built-in function then? Since using that name in ambiguous function > reference will always be resolved to built-in functions, creating a > same-named temp function would be meaningless in the end. > > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> > > wrote: > > Hi Jingsong, > > Re> 1.Hive built-in functions is an intermediate solution. So we > > should > > not introduce interfaces to influence the framework. To make > Flink itself more powerful, we should implement the functions > we need to add. > > Yes, please see the doc. > > Re> 2.Non-flink built-in functions are easy for users to change > > their > > behavior. If we support some flink built-in functions in the > future but act differently from non-flink built-in, this will > > lead > > to > > changes in user behavior. > > There's no such concept as "external built-in functions" any more. > Built-in functions of external systems will be treated as special > > catalog > > functions. > > Re> Another question is, does this fallback include all > > hive built-in functions? As far as I know, some hive functions > have some hacky. If possible, can we start with a white list? > Once we implement some functions to flink built-in, we can > also update the whitelist. > > Yes, that's something we thought of too. I don't think it's super > critical to the scope of this FLIP, thus I'd like to leave it to > > future > > efforts as a nice-to-have feature. > > > On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> > > wrote: > > Hi Kurt, > > Re: > What I want to propose is we can merge #3 and #4, make them > > both > > under > > "catalog" concept, by extending catalog function to make it have > > ability to > > have built-in catalog functions. Some benefits I can see from > > this > > approach: > > 1. We don't have to introduce new concept like external built-in > > functions. > > Actually I don't see a full story about how to treat a built-in > > functions, and it > > seems a little bit disrupt with catalog. As a result, you have > > to > > make > > some restriction > > like "hive built-in functions can only be used when current > > catalog > > is > > hive catalog". > > Yes, I've unified #3 and #4 but it seems I didn't update some > > part > > of > > the doc. I've modified those sections, and they are up to date > > now. > > In short, now built-in function of external systems are defined > > as > > a > > special kind of catalog function in Flink, and handled by Flink > > as > > following: > - An external built-in function must be associated with a catalog > > for > > the purpose of decoupling flink-table and external systems. > - It always resides in front of catalog functions in ambiguous > > function > > reference order, just like in its own external system > - It is a special catalog function that doesn’t have a > > schema/database > > namespace > - It goes thru the same instantiation logic as other user defined > catalog functions in the external system > > Please take another look at the doc, and let me know if you have > > more > > questions. > > > On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> > > wrote: > > Hi Kurt, > > it should not affect the functions and operations we currently > > have > > in > > SQL. It just categorizes the available built-in functions. It is > > kind > > of > an orthogonal concept to the catalog API but built-in functions > > deserve > > this special kind of treatment. CatalogFunction still fits > > perfectly > > in > > there because the regular catalog object resolution logic is not > affected. So tables and functions are resolved in the same way > > but > > with > > built-in functions that have priority as in the original design. > > Regards, > Timo > > > On 03.09.19 15:26, Kurt Young wrote: > > Does this only affect the functions and operations we currently > > have > > in SQL > > and > have no effect on tables, right? Looks like this is an > > orthogonal > > concept > > with Catalog? > If the answer are both yes, then the catalog function will be a > > weird > > concept? > > Best, > Kurt > > > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email] > > wrote: > > The way you proposed are basically the same as what Calcite > > does, I > > think > > we are in the same line. > > Best, > Danny Chan > 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] > > ,写道: > > This sounds exactly as the module approach I mentioned, no? > > Regards, > Timo > > On 03.09.19 13:42, Danny Chan wrote: > > Thanks Bowen for bring up this topic, I think it’s a useful > > refactoring to make our function usage more user friendly. > > For the topic of how to organize the builtin operators and > > operators > > of Hive, here is a solution from Apache Calcite, the Calcite > > way > > is > > to make > > every dialect operators a “Library”, user can specify which > > libraries they > > want to use for a sql query. The builtin operators always > > comes > > as > > the > > first class objects and the others are used from the order > > they > > appears. > > Maybe you can take a reference. > > [1] > > > https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 > > Best, > Danny Chan > 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] > > ,写道: > > Hi folks, > > I'd like to kick off a discussion on reworking Flink's > > FunctionCatalog. > > It's critically helpful to improve function usability in > > SQL. > > > https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing > > In short, it: > - adds support for precise function reference with > > fully/partially > > qualified name > - redefines function resolution order for ambiguous > > function > > reference > > - adds support for Hive's rich built-in functions (support > > for > > Hive > > user > > defined functions was already added in 1.9.0) > - clarifies the concept of temporary functions > > Would love to hear your thoughts. > > Bowen > > -- > Xuefu Zhang > > "In Honey We Trust!" > > > -- > Xuefu Zhang > > "In Honey We Trust!" > > > > > > > > > |
Hi all,
I'd like to add my opinion on this topic as well ;-) In general, I think overriding built-in function with temp functions has a couple of benefits but also a few challenges: * Users can reimplement the behavior of a built-in functions of a different system, e.g., for backward compatibility after a migration. * I don't think that "accidental" overrides and surprising semantics are an issue or dangerous. The user registered the temp function in the same session and should therefore be aware of the changed semantics. * I see that not all built-in functions can be overridden, like the CAST example that Dawid gave. However, I think these should be a small fraction and such functions could be blacklisted. Sure, that's not super consistent, but should (IMO) not be a big issue in practice. * Temp functions should be easy to use. Requiring a 3-part addressing makes them a lot less user friendly, IMO. Users need to think about what catalog and db to choose when registering them. Also using a temp function in a query becomes less convenient. Moreover, I agree with Bowen's concerns that a 3-part addressing scheme reduces the temporal appearance of the function. From the three possible solutions, my preference order is 1) 1-part address with override of built-in 2) 1-part address without override of built-in 3) 3-part address Regarding the issue of external built-in functions, I don't think that Timo's proposal of modules is fully orthogonal to this discussion. A Hive function module could be an alternative to offering Hive functions as part of Hive's catalog. From a user's point of view, I think that modules would be a "cleaner" integration ("Why do I need a Hive catalog if all I want to do is apply a Hive function on a Kafka table?"). However, the module approach clearly has the problem of dealing with same-named functions in different modules (e.g., a Hive function and a Flink built-in function). The catalog approach as the benefit that functions can be addressed like hiveCat::func (or a similar path). I'm not sure what's the best solution here. Cheers, Fabian Am Mo., 9. Sept. 2019 um 06:30 Uhr schrieb Bowen Li <[hidden email]>: > Hi, > > W.r.t temp functions, I feel both options have their benefits and can > theoretically achieve similar functionalities one way or another. In the > end, it's more about use cases, users habits, and trade-offs. > > Re> Not always users are in full control of the catalog functions. There is > also the case where different teams manage the catalog & use the catalog. > > Temp functions live within a session, and not within a catalog. Having > 3-part paths may implies temp functions are tied to a catalog in two > aspects. > 1) it may indicate each catalog manages their temp functions, which is not > true as we seem all agree they should reside at a central place, either in > FunctionCatalog or CatalogManager > 2) it may indicate there's some access control. When users are forbidden to > manipulate some objects in the catalog that's managed by other teams, but > are allowed to manipulate some other objects (temp functions in this case) > belonging to the catalog in namespaces, users may think we introduced extra > complexity and confusion with some kind of access control into the problem. > It doesn't feel intuitive enough for end users. > > Thus, I'd be in favor of 1-part path for temporary functions, and other > temp objects. > > Thanks, > Bowen > > > > On Fri, Sep 6, 2019 at 2:16 AM Dawid Wysakowicz <[hidden email]> > wrote: > > > I agree the consequences of the decision are substantial. Let's see what > > others think. > > > > -- Catalog functions are defined by users, and we suppose they can > > drop/alter it in any way they want. Thus, overwriting a catalog function > > doesn't seem to be a strong use case that we should be concerned about. > > Rather, there are known use case for overwriting built-in functions. > > > > Not always users are in full control of the catalog functions. There is > > also the case where different teams manage the catalog & use the catalog. > > As for overriding built-in functions with 3-part approach user can always > > use an equally named function from a catalog. E.g. to override > > > > * SELECT explode(arr) FROM ...* > > > > user can always write: > > > > * SELECT db.explode(arr) FROM ...* > > > > Best, > > > > Dawid > > On 06/09/2019 10:54, Xuefu Z wrote: > > > > Hi Dawid, > > > > Thank you for your summary. While the only difference in the two > proposals > > is one- or three-part in naming, the consequence would be substantial. > > > > To me, there are two major use cases of temporary functions compared to > > persistent ones: > > 1. Temporary in nature and auto managed by the session. More often than > > not, admin doesn't even allow user to create persistent functions. > > 2. Provide an opportunity to overwriting system built-in functions. > > > > Since built-in functions has one-part name, requiring three-part name for > > temporary functions eliminates the overwriting opportunity. > > > > One-part naming essentially puts all temp functions under a single > > namespace and simplifies function resolution, such as we don't need to > > consider the case of a temp function and a persistent function with the > > same name under the same database. > > > > I agree having three-parts does have its merits, such as consistency with > > other temporary objects (table) and minor difference between temp vs > > catalog functions. However, there is a slight difference between tables > and > > function in that there is no built-in table in SQL so there is no need to > > overwrite it. > > > > I'm not sure if I fully agree the benefits you listed as the advantages > of > > the three-part naming of temp functions. > > -- Allowing overwriting built-in functions is a benefit and the > solution > > for disallowing certain overwriting shouldn't be totally banning it. > > -- Catalog functions are defined by users, and we suppose they can > > drop/alter it in any way they want. Thus, overwriting a catalog function > > doesn't seem to be a strong use case that we should be concerned about. > > Rather, there are known use case for overwriting built-in functions. > > > > Thus, personally I would prefer one-part name for temporary functions. In > > lack of SQL standard on this, I certainly like to get opinions from > others > > to see if a consensus can be eventually reached. > > > > (To your point on modular approach to support external built-in > functions, > > we saw the value and are actively looking into it. Thanks for sharing > your > > opinion on that.) > > > > Thanks, > > Xuefu > > > > On Fri, Sep 6, 2019 at 3:48 PM Dawid Wysakowicz <[hidden email]> > <[hidden email]> > > wrote: > > > > > > Hi Xuefu, > > > > Thank you for your answers. > > > > Let me summarize my understanding. In principle we differ only in regards > > to the fact if a temporary function can be only 1-part or only 3-part > > identified. I can reconfirm that if the community decides it prefers the > > 1-part approach I will commit to that, with the assumption that we will > > force ONLY 1-part function names. (We will parse identifier and throw > > exception if a user tries to register e.g. db.temp_func). > > > > My preference is though the 3-part approach: > > > > - there are some functions that it makes no sense to override, e.g. > > CAST, moreover I'm afraid that allowing overriding such will lead to > high > > inconsistency, similar to those that I mentioned spark has > > - you cannot shadow a fully-qualified function. (If a user fully > > qualifies his/her objects in a SQL query, which is often considered a > good > > practice) > > - it does not differentiate between functions & temporary functions. > > Temporary functions just differ with regards to their life-cycle. The > > registration & usage is exactly the same. > > > > As it can be seen, the proposed concept regarding temp function and > > function resolution is quite simple. > > > > Both approaches are equally simple. I would even say the 3-part approach > > is slightly simpler as it does not have to care about some special > built-in > > functions such as CAST. > > > > I don't want to express my opinion on the differentiation between > built-in > > functions and "external" built-in functions in this thread as it is > rather > > orthogonal, but I also like the modular approach and I definitely don't > > like the special syntax "cat::function". I think it's better to stick to > a > > standard or at least other proved solutions from other systems. > > > > Best, > > > > Dawid > > On 05/09/2019 10:12, Xuefu Z wrote: > > > > Hi David, > > > > Thanks for sharing your thoughts and request for clarifications. I > believe > > that I fully understood your proposal, which does has its merit. However, > > it's different from ours. Here are the answers to your questions: > > > > Re #1: yes, the temp functions in the proposal are global and have just > > one-part names, similar to built-in functions. Two- or three-part names > are > > not allowed. > > > > Re #2: not applicable as two- or three-part names are disallowed. > > > > Re #3: same as above. Referencing external built-in functions is achieved > > either implicitly (only the built-in functions in the current catalogs > are > > considered) or via special syntax such as cat::function. However, we are > > looking into the modular approach that Time suggested with other feedback > > received from the community. > > > > Re #4: the resolution order goes like the following in our proposal: > > > > 1. temporary functions > > 2. bulit-in functions (including those augmented by add-on modules) > > 3. built-in functions in current catalog (this will not be needed if the > > special syntax "cat::function" is required) > > 4. functions in current catalog and db. > > > > If we go with the modular approach and make external built-in functions > as > > an add-on module, the 2 and 3 above will be combined. In essence, the > > resolution order is equivalent in the two approaches. > > > > By the way, resolution order matters only for simple name reference. For > > names such as db.function (interpreted as current_cat/db/function) or > > cat.db.function, the reference is unambiguous, so on resolution is > needed. > > > > As it can be seen, the proposed concept regarding temp function and > > function resolution is quite simple. Additionally, the proposed > resolution > > order allows temp function to shadow a built-in function, which is > > important (though not decisive) in our opinion. > > > > I started liking the modular approach as the resolution order will only > > include 1, 2, and 4, which is simpler and more generic. That's why I > > suggested we look more into this direction. > > > > Please let me know if there are further questions. > > > > Thanks, > > Xuefu > > > > > > > > > > On Thu, Sep 5, 2019 at 2:42 PM Dawid Wysakowicz <[hidden email]> > <[hidden email]> <[hidden email]> <[hidden email]> > > wrote: > > > > > > Hi Xuefu, > > > > Just wanted to summarize my opinion on the one topic (temporary > functions). > > > > My preference would be to make temporary functions always 3-part > qualified > > (as a result that would prohibit overriding built-in functions). Having > > said that if the community decides that it's better to allow overriding > > built-in functions I am fine with it and can commit to that decision. > > > > I wanted to ask if you could clarify a few points for me around that > > option. > > > > 1. Would you enforce temporary functions to be always just a single > > name (without db & cat) as hive does, or would you allow also 3 or > even 2 > > part identifiers? > > 2. Assuming 2/3-part paths. How would you register a function from a > > following statement: CREATE TEMPORARY FUNCTION db.func? Would that > shadow > > all functions named 'func' in all databases named 'db' in all > catalogs? Or > > would you shadow only function 'func' in database 'db' in current > catalog? > > 3. This point is still under discussion, but was mentioned a few > > times, that maybe we want to enable syntax cat.func for "external > built-in > > functions". How would that affect statement from previous point? Would > > 'db.func' shadow "external built-in function" in 'db' catalog or user > > functions as in point 2? Or maybe both? > > 4. Lastly in fact to summarize the previous points. Assuming 2/3-part > > paths. Would the function resolution be actually as follows?: > > 1. temporary functions (1-part path) > > 2. built-in functions > > 3. temporary functions (2-part path) > > 4. 2-part catalog functions a.k.a. "external built-in functions" > > (cat + func) - this is still under discussion, if we want that in > the other > > focal point > > 5. temporary functions (3-part path) > > 6. 3-part catalog functions a.k.a. user functions > > > > I would be really grateful if you could explain me those questions, > thanks. > > > > BTW, Thank you all for a healthy discussion. > > > > Best, > > > > Dawid > > On 04/09/2019 23:25, Xuefu Z wrote: > > > > Thank all for the sharing thoughts. I think we have gathered some useful > > initial feedback from this long discussion with a couple of focal points > > sticking out. > > > > We will go back to do more research and adapt our proposal. Once it's > > ready, we will ask for a new round of review. If there is any > disagreement, > > we will start a new discussion thread on each rather than having a mega > > discussion like this. > > > > Thanks to everyone for participating. > > > > Regards, > > Xuefu > > > > > > On Thu, Sep 5, 2019 at 2:52 AM Bowen Li <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> wrote: > > > > > > Let me try to summarize and conclude the long thread so far: > > > > 1. For order of temp function v.s. built-in function: > > > > I think Dawid's point that temp function should be of fully qualified > path > > is a better reasoning to back the newly proposed order, and i agree we > > don't need to follow Hive/Spark. > > > > However, I'd rather not change fundamentals of temporary functions in > this > > FLIP. It belongs to a bigger story of how temporary objects should be > > redefined and be handled uniformly - currently temporary tables and views > > (those registered from TableEnv#registerTable()) behave different than > what > > Dawid propose for temp functions, and we need a FLIP to just unify their > > APIs and behaviors. > > > > I agree that backward compatibility is not an issue w.r.t Jark's points. > > > > ***Seems we do have consensus that it's acceptable to prevent users > > registering a temp function in the same name as a built-in function. To > > help us move forward, I'd like to propose setting such a restraint on > temp > > functions in this FLIP to simplify the design and avoid disputes.*** It > > will also leave rooms for improvements in the future. > > > > > > 2. For Hive built-in function: > > > > Thanks Timo for providing the Presto and Postgres examples. I feel > modular > > built-in functions can be a good fit for the geo and ml example as a > native > > Flink extension, but not sure if it fits well with external integrations. > > Anyway, I think modular built-in functions is a bigger story and can be > on > > its own thread too, and our proposal doesn't prevent Flink from doing > that > > in the future. > > > > ***Seems we have consensus that users should be able to use built-in > > functions of Hive or other external systems in SQL explicitly and > > deterministically regardless of Flink built-in functions and the > potential > > modular built-in functions, via some new syntax like "mycat::func"? If > so, > > I'd like to propose removing Hive built-in functions from ambiguous > > function resolution order, and empower users with such a syntax. This way > > we sacrifice a little convenience for certainty*** > > > > > > What do you think? > > > > On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz <[hidden email]> > <[hidden email]> <[hidden email]> <[hidden email]> > <[hidden email]> <[hidden email]> <[hidden email]> > <[hidden email]> > > wrote: > > > > > > Hi, > > > > Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just > > performed some experiments (hive-2.3.2 & spark 2.4.4) and I think they > > > > are > > > > very inconsistent in that manner (spark being way worse on that). > > > > Hive: > > > > You cannot overwrite all the built-in functions. I could overwrite most > > > > of > > > > the functions I tried e.g. length, e, pi, round, rtrim, but there are > > functions I cannot overwrite e.g. CAST, ARRAY I get: > > > > > > * ParseException line 1:29 cannot recognize input near 'array' 'AS' * > > > > What is interesting is that I cannot ovewrite *array*, but I can ovewrite > > *map* or *struct*. Though hive behaves reasonable well if I manage to > > overwrite a function. When I drop the temporary function the native > > function is still available. > > > > Spark: > > > > Spark's behavior imho is super bad. > > > > Theoretically I could overwrite all functions. I was able e.g. to > > overwrite CAST function. I had to use though CREATE OR REPLACE TEMPORARY > > FUNCTION syntax. Otherwise I get an exception that a function already > > exists. However when I used the CAST function in a query it used the > > native, built-in one. > > > > When I overwrote current_date() function, it was used in a query, but it > > completely replaces the built-in function and I can no longer use the > > native function in any way. I cannot also drop the temporary function. I > > get: > > > > * Error in query: Cannot drop native function 'current_date';* > > > > Additional note, both systems do not allow creating TEMPORARY FUNCTIONS > > with a database. Temporary functions are always represented as a single > > name. > > > > In my opinion neither of the systems have consistent behavior. Generally > > speaking I think overwriting any system provided functions is just > > dangerous. > > > > Regarding Jark's concerns. Such functions would be registered in a > > > > current > > > > catalog/database schema, so a user could still use its own function, but > > would have to fully qualify the function (because built-in functions take > > precedence). Moreover users would have the same problem with permanent > > functions. Imagine a user have a permanent function 'cat.db.explode'. In > > 1.9 the user could use just the 'explode' function as long as the 'cat' & > > 'db' were the default catalog & database. If we introduce 'explode' > > built-in function in 1.10, the user has to fully qualify the function. > > > > Best, > > > > Dawid > > On 04/09/2019 15:19, Timo Walther wrote: > > > > Hi all, > > > > thanks for the healthy discussion. It is already a very long discussion > > with a lot of text. So I will just post my opinion to a couple of > > statements: > > > > > > Hive built-in functions are not part of Flink built-in functions, they > > > > are catalog functions > > > > That is not entirely true. Correct me if I'm wrong but I think Hive > > built-in functions are also not catalog functions. They are not stored in > > every Hive metastore catalog that is freshly created but are a set of > > functions that are listed somewhere and made available. > > > > > > ambiguous functions reference just shouldn't be resolved to a different > > > > catalog > > > > I agree. They should not be resolved to a different catalog. That's why I > > am suggesting to split the concept of built-in functions and catalog > > > > lookup > > > > semantics. > > > > > > I don't know if any other databases handle built-in functions like that > > > > What I called "module" is: > > - Extension in Postgres [1] > > - Plugin in Presto [2] > > > > Btw. Presto even mentions example modules that are similar to the ones > > that we will introduce in the near future both for ML and System XYZ > > compatibility: > > "See either the presto-ml module for machine learning functions or the > > presto-teradata-functions module for Teradata-compatible functions, both > > > > in > > > > the root of the Presto source." > > > > > > functions should be either built-in already or just libraries > > > > functions, > > > > and library functions can be adapted to catalog APIs or of some other > > syntax to use > > > > Regarding "built-in already", of course we can add a lot of functions as > > built-ins but we will end-up in a dependency hell in the near future if > > > > we > > > > don't introduce a pluggable approach. Library functions is what you also > > suggest but storing them in a catalog means to always fully qualify them > > > > or > > > > modifying the existing catalog design that was inspired by the standard. > > > > I don't think "it brings in even more complicated scenarios to the > > design", it just does clear separation of concerns. Integrating the > > functionality into the current design makes the catalog API more > > complicated. > > > > > > why would users name a temporary function the same as a built-in > > > > function then? > > > > Because you never know what users do. If they don't, my suggested > > resolution order should not be a problem, right? > > > > > > I don't think hive functions deserves be a function module > > > > Our goal is not to create a Hive clone. We need to think forward and Hive > > is just one of many systems that we can support. Not every built-in > > function behaves and will behave exactly like Hive. > > > > > > regarding temporary functions, there are few systems that support it > > > > IMHO Spark and Hive are not always the best examples for consistent > > design. Systems like Postgres, Presto, or SQL Server should be used as a > > reference. I don't think that a user can overwrite a built-in function > > there. > > > > Regards, > > Timo > > > > [1] https://www.postgresql.org/docs/10/extend-extensions.html > > [2] https://prestodb.github.io/docs/current/develop/functions.html > > > > > > On 04.09.19 13:44, Jark Wu wrote: > > > > Hi all, > > > > Regarding #1 temp function <> built-in function and naming. > > I'm fine with temp functions should precede built-in function and can > > override built-in functions (we already support to override built-in > > function in 1.9). > > If we don't allow the same name as a built-in function, I'm afraid we > > > > will > > > > have compatibility issues in the future. > > Say users register a user defined function named "explode" in 1.9, and we > > support a built-in "explode" function in 1.10. > > Then the user's jobs which call the registered "explode" function in 1.9 > > will all fail in 1.10 because of naming conflict. > > > > Regarding #2 "External" built-in functions. > > I think if we store external built-in functions in catalog, then > > "hive1::sqrt" is a good way to go. > > However, I would prefer to support a discovery mechanism (e.g. SPI) for > > built-in functions as Timo suggested above. > > This gives us the flexibility to add Hive or MySQL or Geo or whatever > > function set as built-in functions in an easy way. > > > > Best, > > Jark > > > > On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]><[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> wrote: > > > > Hi David, > > > > Thank you for sharing your findings. It seems to me that there is no SQL > > standard regarding temporary functions. There are few systems that > > > > support > > > > it. Here are what I have found: > > > > 1. Hive: no DB qualifier allowed. Can overwrite built-in. > > 2. Spark: basically follows Hive ( > > > > > > > https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html > > > > ) > > 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting > > behavior. ( > http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html > > > > ) > > > > Because of lack of standard, it's perfectly fine for Flink to define > > whatever it sees appropriate. Thus, your proposal (no overwriting and > > > > must > > > > have DB as holder) is one option. The advantage is simplicity, The > > downside > > is the deviation from Hive, which is popular and de facto standard in big > > data world. > > > > However, I don't think we have to follow Hive. More importantly, we need > > > > a > > > > consensus. I have no objection if your proposal is generally agreed upon. > > > > Thanks, > > Xuefu > > > > On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz <[hidden email] > <[hidden email]> <[hidden email]> <[hidden email]> > <[hidden email]> <[hidden email]> <[hidden email]> > <[hidden email]> <[hidden email]> > > wrote: > > > > Hi all, > > > > Just an opinion on the built-in <> temporary functions resolution and > > NAMING issue. I think we should not allow overriding the built-in > > functions, as this may pose serious issues and to be honest is rather > > not feasible and would require major rework. What happens if a user > > wants to override CAST? Calls to that function are generated at > > different layers of the stack that unfortunately does not always go > > through the Catalog API (at least yet). Moreover from what I've checked > > no other systems allow overriding the built-in functions. All the > > systems I've checked so far register temporary functions in a > > database/schema (either special database for temporary functions, or > > just current database). What I would suggest is to always register > > temporary functions with a 3 part identifier. The same way as tables, > > views etc. This effectively means you cannot override built-in > > functions. With such approach it is natural that the temporary functions > > end up a step lower in the resolution order: > > > > 1. built-in functions (1 part, maybe 2? - this is still under discussion) > > > > 2. temporary functions (always 3 part path) > > > > 3. catalog functions (always 3 part path) > > > > Let me know what do you think. > > > > Best, > > > > Dawid > > > > On 04/09/2019 06:13, Bowen Li wrote: > > > > Hi, > > > > I agree with Xuefu that the main controversial points are mainly the > > > > two > > > > places. My thoughts on them: > > > > 1) Determinism of referencing Hive built-in functions. We can either > > > > remove > > > > Hive built-in functions from ambiguous function resolution and require > > users to use special syntax for their qualified names, or add a config > > > > flag > > > > to catalog constructor/yaml for turning on and off Hive built-in > > > > functions > > > > with the flag set to 'false' by default and proper doc added to help > > > > users > > > > make their decisions. > > > > 2) Flink temp functions v.s. Flink built-in functions in ambiguous > > > > function > > > > resolution order. We believe Flink temp functions should precede Flink > > built-in functions, and I have presented my reasons. Just in case if we > > cannot reach an agreement, I propose forbid users registering temp > > functions in the same name as a built-in function, like MySQL's > > > > approach, > > > > for the moment. It won't have any performance concern, since built-in > > functions are all in memory and thus cost of a name check will be > > > > really > > > > trivial. > > > > > > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]><[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> wrote: > > > > From what I have seen, there are a couple of focal disagreements: > > > > 1. Resolution order: temp function --> flink built-in function --> > > > > catalog > > > > function vs flink built-in function --> temp function -> catalog > > > > function. > > > > 2. "External" built-in functions: how to treat built-in functions in > > external system and how users reference them > > > > For #1, I agree with Bowen that temp function needs to be at the > > > > highest > > > > priority because that's how a user might overwrite a built-in function > > without referencing a persistent, overwriting catalog function with a > > > > fully > > > > qualified name. Putting built-in functions at the highest priority > > eliminates that usage. > > > > For #2, I saw a general agreement on referencing "external" built-in > > functions such as those in Hive needs to be explicit and deterministic > > > > even > > > > though different approaches are proposed. To limit the scope and > > > > simply > > > > the > > > > usage, it seems making sense to me to introduce special syntax for > > > > user to > > > > explicitly reference an external built-in function such as hive1::sqrt > > > > or > > > > hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API > > > > call > > > > hive1.getFunction(ObjectPath functionName) where the database name is > > absent for bulit-in functions available in that catalog hive1. I > > > > understand > > > > that Bowen's original proposal was trying to avoid this, but this > > > > could > > > > turn out to be a clean and simple solution. > > > > (Timo's modular approach is great way to "expand" Flink's built-in > > > > function > > > > set, which seems orthogonal and complementary to this, which could be > > tackled in further future work.) > > > > I'd be happy to hear further thoughts on the two points. > > > > Thanks, > > Xuefu > > > > On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> > <[hidden email]> <[hidden email]> <[hidden email]><[hidden email]> > <[hidden email]> <[hidden email]> <[hidden email]> <[hidden email]> > <[hidden email]> <[hidden email]> <[hidden email]> wrote: > > > > Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is > > > > the > > > > same > > as Bowen's. But after thinking about it, I'm currently lean to Timo's > > suggestion. > > > > The reason is backward compatibility. If we follow Bowen's approach, > > > > let's > > > > say we > > first find function in Flink's built-in functions, and then hive's > > built-in. For example, `foo` > > is not supported by Flink, but hive has such built-in function. So > > > > user > > > > will have hive's > > behavior for function `foo`. And in next release, Flink realize this > > > > is a > > > > very popular function > > and add it into Flink's built-in functions, but with different > > > > behavior > > > > as > > > > hive's. So in next > > release, the behavior changes. > > > > With Timo's approach, IIUC user have to tell the framework explicitly > > > > what > > > > kind of > > built-in functions he would like to use. He can just tell framework > > > > to > > > > abandon Flink's built-in > > functions, and use hive's instead. User can only choose between them, > > > > but > > > > not use > > them at the same time. I think this approach is more predictable. > > > > Best, > > Kurt > > > > > > On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]><[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> wrote: > > > > Hi all, > > > > Thanks for the feedback. Just a kindly reminder that the [Proposal] > > > > section > > > > in the google doc was updated, please take a look first and let me > > > > know > > > > if > > > > you have more questions. > > > > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]><[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> > > > > wrote: > > > > Hi Timo, > > > > Re> 1) We should not have the restriction "hive built-in functions > > > > can > > > > only > > > > be used when current catalog is hive catalog". Switching a catalog > > should only have implications on the cat.db.object resolution but > > > > not > > > > functions. It would be quite convinient for users to use Hive > > > > built-ins > > > > even if they use a Confluent schema registry or just the in-memory > > > > catalog. > > > > There might be a misunderstanding here. > > > > First of all, Hive built-in functions are not part of Flink > > > > built-in > > > > functions, they are catalog functions, thus if the current catalog > > > > is > > > > not a > > > > HiveCatalog but, say, a schema registry catalog, ambiguous > > > > functions > > > > reference just shouldn't be resolved to a different catalog. > > > > Second, Hive built-in functions can potentially be referenced > > > > across > > > > catalog, but it doesn't have db namespace and we currently just > > > > don't > > > > have > > > > a SQL syntax for it. It can be enabled when such a SQL syntax is > > > > defined, > > > > e.g. "catalog::function", but it's out of scope of this FLIP. > > > > 2) I would propose to have separate concepts for catalog and > > > > built-in > > > > functions. In particular it would be nice to modularize built-in > > functions. Some built-in functions are very crucial (like AS, CAST, > > MINUS), others are more optional but stable (MD5, CONCAT_WS), and > > > > maybe > > > > we add more experimental functions in the future or function for > > > > some > > > > special application area (Geo functions, ML functions). A data > > > > platform > > > > team might not want to make every built-in function available. Or a > > function module like ML functions is in a different Maven module. > > > > I think this is orthogonal to this FLIP, especially we don't have > > > > the > > > > "external built-in functions" anymore and currently the built-in > > > > function > > > > category remains untouched. > > > > But just to share some thoughts on the proposal, I'm not sure about > > > > it: > > > > - I don't know if any other databases handle built-in functions > > > > like > > > > that. > > > > Maybe you can give some examples? IMHO, built-in functions are > > > > system > > > > info > > > > and should be deterministic, not depending on loaded libraries. Geo > > functions should be either built-in already or just libraries > > > > functions, > > > > and library functions can be adapted to catalog APIs or of some > > > > other > > > > syntax to use > > - I don't know if all use cases stand, and many can be achieved by > > > > other > > > > approaches too. E.g. experimental functions can be taken good care > > > > of > > > > by > > > > documentations, annotations, etc > > - the proposal basically introduces some concept like a pluggable > > > > built-in > > > > function catalog, despite the already existing catalog APIs > > - it brings in even more complicated scenarios to the design. E.g. > > > > how > > > > do > > > > you handle built-in functions in different modules but different > > > > names? > > > > In short, I'm not sure if it really stands and it looks like an > > > > overkill > > > > to me. I'd rather not go to that route. Related discussion can be > > > > on > > > > its > > > > own thread. > > > > 3) Following the suggestion above, we can have a separate discovery > > mechanism for built-in functions. Instead of just going through a > > > > static > > > > list like in BuiltInFunctionDefinitions, a platform team should be > > > > able > > > > to select function modules like > > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, > > HiveFunctions) or via service discovery; > > > > Same as above. I'll leave it to its own thread. > > > > re > 3) Dawid and I discussed the resulution order again. I agree > > > > with > > > > Kurt > > > > that we should unify built-in function (external or internal) > > > > under a > > > > common layer. However, the resolution order should be: > > 1. built-in functions > > 2. temporary functions > > 3. regular catalog resolution logic > > Otherwise a temporary function could cause clashes with Flink's > > > > built-in > > > > functions. If you take a look at other vendors, like SQL Server > > > > they > > > > also do not allow to overwrite built-in functions. > > > > ”I agree with Kurt that we should unify built-in function (external > > > > or > > > > internal) under a common layer.“ <- I don't think this is what Kurt > > > > means. > > > > Kurt and I are in favor of unifying built-in functions of external > > > > systems > > > > and catalog functions. Did you type a mistake? > > > > Besides, I'm not sure about the resolution order you proposed. > > > > Temporary > > > > functions have a lifespan over a session and are only visible to > > > > the > > > > session owner, they are unique to each user, and users create them > > > > on > > > > purpose to be the highest priority in order to overwrite system > > > > info > > > > (built-in functions in this case). > > > > In your case, why would users name a temporary function the same > > > > as a > > > > built-in function then? Since using that name in ambiguous function > > reference will always be resolved to built-in functions, creating a > > same-named temp function would be meaningless in the end. > > > > > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]><[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> > > > > wrote: > > > > Hi Jingsong, > > > > Re> 1.Hive built-in functions is an intermediate solution. So we > > > > should > > > > not introduce interfaces to influence the framework. To make > > Flink itself more powerful, we should implement the functions > > we need to add. > > > > Yes, please see the doc. > > > > Re> 2.Non-flink built-in functions are easy for users to change > > > > their > > > > behavior. If we support some flink built-in functions in the > > future but act differently from non-flink built-in, this will > > > > lead > > > > to > > > > changes in user behavior. > > > > There's no such concept as "external built-in functions" any more. > > Built-in functions of external systems will be treated as special > > > > catalog > > > > functions. > > > > Re> Another question is, does this fallback include all > > > > hive built-in functions? As far as I know, some hive functions > > have some hacky. If possible, can we start with a white list? > > Once we implement some functions to flink built-in, we can > > also update the whitelist. > > > > Yes, that's something we thought of too. I don't think it's super > > critical to the scope of this FLIP, thus I'd like to leave it to > > > > future > > > > efforts as a nice-to-have feature. > > > > > > On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]><[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> > > > > wrote: > > > > Hi Kurt, > > > > Re: > What I want to propose is we can merge #3 and #4, make them > > > > both > > > > under > > > > "catalog" concept, by extending catalog function to make it have > > > > ability to > > > > have built-in catalog functions. Some benefits I can see from > > > > this > > > > approach: > > > > 1. We don't have to introduce new concept like external built-in > > > > functions. > > > > Actually I don't see a full story about how to treat a built-in > > > > functions, and it > > > > seems a little bit disrupt with catalog. As a result, you have > > > > to > > > > make > > > > some restriction > > > > like "hive built-in functions can only be used when current > > > > catalog > > > > is > > > > hive catalog". > > > > Yes, I've unified #3 and #4 but it seems I didn't update some > > > > part > > > > of > > > > the doc. I've modified those sections, and they are up to date > > > > now. > > > > In short, now built-in function of external systems are defined > > > > as > > > > a > > > > special kind of catalog function in Flink, and handled by Flink > > > > as > > > > following: > > - An external built-in function must be associated with a catalog > > > > for > > > > the purpose of decoupling flink-table and external systems. > > - It always resides in front of catalog functions in ambiguous > > > > function > > > > reference order, just like in its own external system > > - It is a special catalog function that doesn’t have a > > > > schema/database > > > > namespace > > - It goes thru the same instantiation logic as other user defined > > catalog functions in the external system > > > > Please take another look at the doc, and let me know if you have > > > > more > > > > questions. > > > > > > On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]><[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> <[hidden email]> <[hidden email]> > > > > wrote: > > > > Hi Kurt, > > > > it should not affect the functions and operations we currently > > > > have > > > > in > > > > SQL. It just categorizes the available built-in functions. It is > > > > kind > > > > of > > an orthogonal concept to the catalog API but built-in functions > > > > deserve > > > > this special kind of treatment. CatalogFunction still fits > > > > perfectly > > > > in > > > > there because the regular catalog object resolution logic is not > > affected. So tables and functions are resolved in the same way > > > > but > > > > with > > > > built-in functions that have priority as in the original design. > > > > Regards, > > Timo > > > > > > On 03.09.19 15:26, Kurt Young wrote: > > > > Does this only affect the functions and operations we currently > > > > have > > > > in SQL > > > > and > > have no effect on tables, right? Looks like this is an > > > > orthogonal > > > > concept > > > > with Catalog? > > If the answer are both yes, then the catalog function will be a > > > > weird > > > > concept? > > > > Best, > > Kurt > > > > > > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email] > > > > wrote: > > > > The way you proposed are basically the same as what Calcite > > > > does, I > > > > think > > > > we are in the same line. > > > > Best, > > Danny Chan > > 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] > > > > ,写道: > > > > This sounds exactly as the module approach I mentioned, no? > > > > Regards, > > Timo > > > > On 03.09.19 13:42, Danny Chan wrote: > > > > Thanks Bowen for bring up this topic, I think it’s a useful > > > > refactoring to make our function usage more user friendly. > > > > For the topic of how to organize the builtin operators and > > > > operators > > > > of Hive, here is a solution from Apache Calcite, the Calcite > > > > way > > > > is > > > > to make > > > > every dialect operators a “Library”, user can specify which > > > > libraries they > > > > want to use for a sql query. The builtin operators always > > > > comes > > > > as > > > > the > > > > first class objects and the others are used from the order > > > > they > > > > appears. > > > > Maybe you can take a reference. > > > > [1] > > > > > > > https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 > > > > Best, > > Danny Chan > > 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] > > > > ,写道: > > > > Hi folks, > > > > I'd like to kick off a discussion on reworking Flink's > > > > FunctionCatalog. > > > > It's critically helpful to improve function usability in > > > > SQL. > > > > > > > https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing > > > > In short, it: > > - adds support for precise function reference with > > > > fully/partially > > > > qualified name > > - redefines function resolution order for ambiguous > > > > function > > > > reference > > > > - adds support for Hive's rich built-in functions (support > > > > for > > > > Hive > > > > user > > > > defined functions was already added in 1.9.0) > > - clarifies the concept of temporary functions > > > > Would love to hear your thoughts. > > > > Bowen > > > > -- > > Xuefu Zhang > > > > "In Honey We Trust!" > > > > > > -- > > Xuefu Zhang > > > > "In Honey We Trust!" > > > > > > > > > > > > > > > > > > > |
Hi Fabian,
Thank you for your response. Regarding the temporary function, just wanted to clarify one thing: the 3-part identifier does not mean the user always has to provide the catalog & database explicitly. The same way user does not have to provide them in e.g. when creating permanent table, view etc. It means though functions are always stored within a database. The same way as all the permanent objects and other temporary objects(tables, views). If not given explicitly the current catalog & database would be used, both in the create statement or when using the function. Point taken though your preference would be to support overriding built-in functions. Best, Dawid On Wed, 11 Sep 2019, 21:14 Fabian Hueske, <[hidden email]> wrote: > Hi all, > > I'd like to add my opinion on this topic as well ;-) > > In general, I think overriding built-in function with temp functions has a > couple of benefits but also a few challenges: > > * Users can reimplement the behavior of a built-in functions of a different > system, e.g., for backward compatibility after a migration. > * I don't think that "accidental" overrides and surprising semantics are an > issue or dangerous. The user registered the temp function in the same > session and should therefore be aware of the changed semantics. > * I see that not all built-in functions can be overridden, like the CAST > example that Dawid gave. However, I think these should be a small fraction > and such functions could be blacklisted. Sure, that's not super consistent, > but should (IMO) not be a big issue in practice. > * Temp functions should be easy to use. Requiring a 3-part addressing makes > them a lot less user friendly, IMO. Users need to think about what catalog > and db to choose when registering them. Also using a temp function in a > query becomes less convenient. Moreover, I agree with Bowen's concerns that > a 3-part addressing scheme reduces the temporal appearance of the function. > > From the three possible solutions, my preference order is > 1) 1-part address with override of built-in > 2) 1-part address without override of built-in > 3) 3-part address > > Regarding the issue of external built-in functions, I don't think that > Timo's proposal of modules is fully orthogonal to this discussion. > A Hive function module could be an alternative to offering Hive functions > as part of Hive's catalog. > From a user's point of view, I think that modules would be a "cleaner" > integration ("Why do I need a Hive catalog if all I want to do is apply a > Hive function on a Kafka table?"). > However, the module approach clearly has the problem of dealing with > same-named functions in different modules (e.g., a Hive function and a > Flink built-in function). > The catalog approach as the benefit that functions can be addressed like > hiveCat::func (or a similar path). > > I'm not sure what's the best solution here. > > Cheers, > Fabian > > > Am Mo., 9. Sept. 2019 um 06:30 Uhr schrieb Bowen Li <[hidden email]>: > > > Hi, > > > > W.r.t temp functions, I feel both options have their benefits and can > > theoretically achieve similar functionalities one way or another. In the > > end, it's more about use cases, users habits, and trade-offs. > > > > Re> Not always users are in full control of the catalog functions. There > is > > also the case where different teams manage the catalog & use the catalog. > > > > Temp functions live within a session, and not within a catalog. Having > > 3-part paths may implies temp functions are tied to a catalog in two > > aspects. > > 1) it may indicate each catalog manages their temp functions, which is > not > > true as we seem all agree they should reside at a central place, either > in > > FunctionCatalog or CatalogManager > > 2) it may indicate there's some access control. When users are forbidden > to > > manipulate some objects in the catalog that's managed by other teams, but > > are allowed to manipulate some other objects (temp functions in this > case) > > belonging to the catalog in namespaces, users may think we introduced > extra > > complexity and confusion with some kind of access control into the > problem. > > It doesn't feel intuitive enough for end users. > > > > Thus, I'd be in favor of 1-part path for temporary functions, and other > > temp objects. > > > > Thanks, > > Bowen > > > > > > > > On Fri, Sep 6, 2019 at 2:16 AM Dawid Wysakowicz <[hidden email]> > > wrote: > > > > > I agree the consequences of the decision are substantial. Let's see > what > > > others think. > > > > > > -- Catalog functions are defined by users, and we suppose they can > > > drop/alter it in any way they want. Thus, overwriting a catalog > function > > > doesn't seem to be a strong use case that we should be concerned about. > > > Rather, there are known use case for overwriting built-in functions. > > > > > > Not always users are in full control of the catalog functions. There is > > > also the case where different teams manage the catalog & use the > catalog. > > > As for overriding built-in functions with 3-part approach user can > always > > > use an equally named function from a catalog. E.g. to override > > > > > > * SELECT explode(arr) FROM ...* > > > > > > user can always write: > > > > > > * SELECT db.explode(arr) FROM ...* > > > > > > Best, > > > > > > Dawid > > > On 06/09/2019 10:54, Xuefu Z wrote: > > > > > > Hi Dawid, > > > > > > Thank you for your summary. While the only difference in the two > > proposals > > > is one- or three-part in naming, the consequence would be substantial. > > > > > > To me, there are two major use cases of temporary functions compared to > > > persistent ones: > > > 1. Temporary in nature and auto managed by the session. More often than > > > not, admin doesn't even allow user to create persistent functions. > > > 2. Provide an opportunity to overwriting system built-in functions. > > > > > > Since built-in functions has one-part name, requiring three-part name > for > > > temporary functions eliminates the overwriting opportunity. > > > > > > One-part naming essentially puts all temp functions under a single > > > namespace and simplifies function resolution, such as we don't need to > > > consider the case of a temp function and a persistent function with the > > > same name under the same database. > > > > > > I agree having three-parts does have its merits, such as consistency > with > > > other temporary objects (table) and minor difference between temp vs > > > catalog functions. However, there is a slight difference between tables > > and > > > function in that there is no built-in table in SQL so there is no need > to > > > overwrite it. > > > > > > I'm not sure if I fully agree the benefits you listed as the advantages > > of > > > the three-part naming of temp functions. > > > -- Allowing overwriting built-in functions is a benefit and the > > solution > > > for disallowing certain overwriting shouldn't be totally banning it. > > > -- Catalog functions are defined by users, and we suppose they can > > > drop/alter it in any way they want. Thus, overwriting a catalog > function > > > doesn't seem to be a strong use case that we should be concerned about. > > > Rather, there are known use case for overwriting built-in functions. > > > > > > Thus, personally I would prefer one-part name for temporary functions. > In > > > lack of SQL standard on this, I certainly like to get opinions from > > others > > > to see if a consensus can be eventually reached. > > > > > > (To your point on modular approach to support external built-in > > functions, > > > we saw the value and are actively looking into it. Thanks for sharing > > your > > > opinion on that.) > > > > > > Thanks, > > > Xuefu > > > > > > On Fri, Sep 6, 2019 at 3:48 PM Dawid Wysakowicz < > [hidden email]> > > <[hidden email]> > > > wrote: > > > > > > > > > Hi Xuefu, > > > > > > Thank you for your answers. > > > > > > Let me summarize my understanding. In principle we differ only in > regards > > > to the fact if a temporary function can be only 1-part or only 3-part > > > identified. I can reconfirm that if the community decides it prefers > the > > > 1-part approach I will commit to that, with the assumption that we will > > > force ONLY 1-part function names. (We will parse identifier and throw > > > exception if a user tries to register e.g. db.temp_func). > > > > > > My preference is though the 3-part approach: > > > > > > - there are some functions that it makes no sense to override, e.g. > > > CAST, moreover I'm afraid that allowing overriding such will lead to > > high > > > inconsistency, similar to those that I mentioned spark has > > > - you cannot shadow a fully-qualified function. (If a user fully > > > qualifies his/her objects in a SQL query, which is often considered > a > > good > > > practice) > > > - it does not differentiate between functions & temporary functions. > > > Temporary functions just differ with regards to their life-cycle. > The > > > registration & usage is exactly the same. > > > > > > As it can be seen, the proposed concept regarding temp function and > > > function resolution is quite simple. > > > > > > Both approaches are equally simple. I would even say the 3-part > approach > > > is slightly simpler as it does not have to care about some special > > built-in > > > functions such as CAST. > > > > > > I don't want to express my opinion on the differentiation between > > built-in > > > functions and "external" built-in functions in this thread as it is > > rather > > > orthogonal, but I also like the modular approach and I definitely don't > > > like the special syntax "cat::function". I think it's better to stick > to > > a > > > standard or at least other proved solutions from other systems. > > > > > > Best, > > > > > > Dawid > > > On 05/09/2019 10:12, Xuefu Z wrote: > > > > > > Hi David, > > > > > > Thanks for sharing your thoughts and request for clarifications. I > > believe > > > that I fully understood your proposal, which does has its merit. > However, > > > it's different from ours. Here are the answers to your questions: > > > > > > Re #1: yes, the temp functions in the proposal are global and have just > > > one-part names, similar to built-in functions. Two- or three-part names > > are > > > not allowed. > > > > > > Re #2: not applicable as two- or three-part names are disallowed. > > > > > > Re #3: same as above. Referencing external built-in functions is > achieved > > > either implicitly (only the built-in functions in the current catalogs > > are > > > considered) or via special syntax such as cat::function. However, we > are > > > looking into the modular approach that Time suggested with other > feedback > > > received from the community. > > > > > > Re #4: the resolution order goes like the following in our proposal: > > > > > > 1. temporary functions > > > 2. bulit-in functions (including those augmented by add-on modules) > > > 3. built-in functions in current catalog (this will not be needed if > the > > > special syntax "cat::function" is required) > > > 4. functions in current catalog and db. > > > > > > If we go with the modular approach and make external built-in functions > > as > > > an add-on module, the 2 and 3 above will be combined. In essence, the > > > resolution order is equivalent in the two approaches. > > > > > > By the way, resolution order matters only for simple name reference. > For > > > names such as db.function (interpreted as current_cat/db/function) or > > > cat.db.function, the reference is unambiguous, so on resolution is > > needed. > > > > > > As it can be seen, the proposed concept regarding temp function and > > > function resolution is quite simple. Additionally, the proposed > > resolution > > > order allows temp function to shadow a built-in function, which is > > > important (though not decisive) in our opinion. > > > > > > I started liking the modular approach as the resolution order will only > > > include 1, 2, and 4, which is simpler and more generic. That's why I > > > suggested we look more into this direction. > > > > > > Please let me know if there are further questions. > > > > > > Thanks, > > > Xuefu > > > > > > > > > > > > > > > On Thu, Sep 5, 2019 at 2:42 PM Dawid Wysakowicz < > [hidden email]> > > <[hidden email]> <[hidden email]> < > [hidden email]> > > > wrote: > > > > > > > > > Hi Xuefu, > > > > > > Just wanted to summarize my opinion on the one topic (temporary > > functions). > > > > > > My preference would be to make temporary functions always 3-part > > qualified > > > (as a result that would prohibit overriding built-in functions). Having > > > said that if the community decides that it's better to allow overriding > > > built-in functions I am fine with it and can commit to that decision. > > > > > > I wanted to ask if you could clarify a few points for me around that > > > option. > > > > > > 1. Would you enforce temporary functions to be always just a single > > > name (without db & cat) as hive does, or would you allow also 3 or > > even 2 > > > part identifiers? > > > 2. Assuming 2/3-part paths. How would you register a function from a > > > following statement: CREATE TEMPORARY FUNCTION db.func? Would that > > shadow > > > all functions named 'func' in all databases named 'db' in all > > catalogs? Or > > > would you shadow only function 'func' in database 'db' in current > > catalog? > > > 3. This point is still under discussion, but was mentioned a few > > > times, that maybe we want to enable syntax cat.func for "external > > built-in > > > functions". How would that affect statement from previous point? > Would > > > 'db.func' shadow "external built-in function" in 'db' catalog or > user > > > functions as in point 2? Or maybe both? > > > 4. Lastly in fact to summarize the previous points. Assuming > 2/3-part > > > paths. Would the function resolution be actually as follows?: > > > 1. temporary functions (1-part path) > > > 2. built-in functions > > > 3. temporary functions (2-part path) > > > 4. 2-part catalog functions a.k.a. "external built-in functions" > > > (cat + func) - this is still under discussion, if we want that in > > the other > > > focal point > > > 5. temporary functions (3-part path) > > > 6. 3-part catalog functions a.k.a. user functions > > > > > > I would be really grateful if you could explain me those questions, > > thanks. > > > > > > BTW, Thank you all for a healthy discussion. > > > > > > Best, > > > > > > Dawid > > > On 04/09/2019 23:25, Xuefu Z wrote: > > > > > > Thank all for the sharing thoughts. I think we have gathered some > useful > > > initial feedback from this long discussion with a couple of focal > points > > > sticking out. > > > > > > We will go back to do more research and adapt our proposal. Once it's > > > ready, we will ask for a new round of review. If there is any > > disagreement, > > > we will start a new discussion thread on each rather than having a mega > > > discussion like this. > > > > > > Thanks to everyone for participating. > > > > > > Regards, > > > Xuefu > > > > > > > > > On Thu, Sep 5, 2019 at 2:52 AM Bowen Li <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> wrote: > > > > > > > > > Let me try to summarize and conclude the long thread so far: > > > > > > 1. For order of temp function v.s. built-in function: > > > > > > I think Dawid's point that temp function should be of fully qualified > > path > > > is a better reasoning to back the newly proposed order, and i agree we > > > don't need to follow Hive/Spark. > > > > > > However, I'd rather not change fundamentals of temporary functions in > > this > > > FLIP. It belongs to a bigger story of how temporary objects should be > > > redefined and be handled uniformly - currently temporary tables and > views > > > (those registered from TableEnv#registerTable()) behave different than > > what > > > Dawid propose for temp functions, and we need a FLIP to just unify > their > > > APIs and behaviors. > > > > > > I agree that backward compatibility is not an issue w.r.t Jark's > points. > > > > > > ***Seems we do have consensus that it's acceptable to prevent users > > > registering a temp function in the same name as a built-in function. To > > > help us move forward, I'd like to propose setting such a restraint on > > temp > > > functions in this FLIP to simplify the design and avoid disputes.*** It > > > will also leave rooms for improvements in the future. > > > > > > > > > 2. For Hive built-in function: > > > > > > Thanks Timo for providing the Presto and Postgres examples. I feel > > modular > > > built-in functions can be a good fit for the geo and ml example as a > > native > > > Flink extension, but not sure if it fits well with external > integrations. > > > Anyway, I think modular built-in functions is a bigger story and can be > > on > > > its own thread too, and our proposal doesn't prevent Flink from doing > > that > > > in the future. > > > > > > ***Seems we have consensus that users should be able to use built-in > > > functions of Hive or other external systems in SQL explicitly and > > > deterministically regardless of Flink built-in functions and the > > potential > > > modular built-in functions, via some new syntax like "mycat::func"? If > > so, > > > I'd like to propose removing Hive built-in functions from ambiguous > > > function resolution order, and empower users with such a syntax. This > way > > > we sacrifice a little convenience for certainty*** > > > > > > > > > What do you think? > > > > > > On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz < > [hidden email]> > > <[hidden email]> <[hidden email]> < > [hidden email]> > > <[hidden email]> <[hidden email]> < > [hidden email]> > > <[hidden email]> > > > wrote: > > > > > > > > > Hi, > > > > > > Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just > > > performed some experiments (hive-2.3.2 & spark 2.4.4) and I think they > > > > > > are > > > > > > very inconsistent in that manner (spark being way worse on that). > > > > > > Hive: > > > > > > You cannot overwrite all the built-in functions. I could overwrite most > > > > > > of > > > > > > the functions I tried e.g. length, e, pi, round, rtrim, but there are > > > functions I cannot overwrite e.g. CAST, ARRAY I get: > > > > > > > > > * ParseException line 1:29 cannot recognize input near 'array' 'AS' > * > > > > > > What is interesting is that I cannot ovewrite *array*, but I can > ovewrite > > > *map* or *struct*. Though hive behaves reasonable well if I manage to > > > overwrite a function. When I drop the temporary function the native > > > function is still available. > > > > > > Spark: > > > > > > Spark's behavior imho is super bad. > > > > > > Theoretically I could overwrite all functions. I was able e.g. to > > > overwrite CAST function. I had to use though CREATE OR REPLACE > TEMPORARY > > > FUNCTION syntax. Otherwise I get an exception that a function already > > > exists. However when I used the CAST function in a query it used the > > > native, built-in one. > > > > > > When I overwrote current_date() function, it was used in a query, but > it > > > completely replaces the built-in function and I can no longer use the > > > native function in any way. I cannot also drop the temporary function. > I > > > get: > > > > > > * Error in query: Cannot drop native function 'current_date';* > > > > > > Additional note, both systems do not allow creating TEMPORARY FUNCTIONS > > > with a database. Temporary functions are always represented as a single > > > name. > > > > > > In my opinion neither of the systems have consistent behavior. > Generally > > > speaking I think overwriting any system provided functions is just > > > dangerous. > > > > > > Regarding Jark's concerns. Such functions would be registered in a > > > > > > current > > > > > > catalog/database schema, so a user could still use its own function, > but > > > would have to fully qualify the function (because built-in functions > take > > > precedence). Moreover users would have the same problem with permanent > > > functions. Imagine a user have a permanent function 'cat.db.explode'. > In > > > 1.9 the user could use just the 'explode' function as long as the > 'cat' & > > > 'db' were the default catalog & database. If we introduce 'explode' > > > built-in function in 1.10, the user has to fully qualify the function. > > > > > > Best, > > > > > > Dawid > > > On 04/09/2019 15:19, Timo Walther wrote: > > > > > > Hi all, > > > > > > thanks for the healthy discussion. It is already a very long discussion > > > with a lot of text. So I will just post my opinion to a couple of > > > statements: > > > > > > > > > Hive built-in functions are not part of Flink built-in functions, they > > > > > > are catalog functions > > > > > > That is not entirely true. Correct me if I'm wrong but I think Hive > > > built-in functions are also not catalog functions. They are not stored > in > > > every Hive metastore catalog that is freshly created but are a set of > > > functions that are listed somewhere and made available. > > > > > > > > > ambiguous functions reference just shouldn't be resolved to a different > > > > > > catalog > > > > > > I agree. They should not be resolved to a different catalog. That's > why I > > > am suggesting to split the concept of built-in functions and catalog > > > > > > lookup > > > > > > semantics. > > > > > > > > > I don't know if any other databases handle built-in functions like that > > > > > > What I called "module" is: > > > - Extension in Postgres [1] > > > - Plugin in Presto [2] > > > > > > Btw. Presto even mentions example modules that are similar to the ones > > > that we will introduce in the near future both for ML and System XYZ > > > compatibility: > > > "See either the presto-ml module for machine learning functions or the > > > presto-teradata-functions module for Teradata-compatible functions, > both > > > > > > in > > > > > > the root of the Presto source." > > > > > > > > > functions should be either built-in already or just libraries > > > > > > functions, > > > > > > and library functions can be adapted to catalog APIs or of some other > > > syntax to use > > > > > > Regarding "built-in already", of course we can add a lot of functions > as > > > built-ins but we will end-up in a dependency hell in the near future if > > > > > > we > > > > > > don't introduce a pluggable approach. Library functions is what you > also > > > suggest but storing them in a catalog means to always fully qualify > them > > > > > > or > > > > > > modifying the existing catalog design that was inspired by the > standard. > > > > > > I don't think "it brings in even more complicated scenarios to the > > > design", it just does clear separation of concerns. Integrating the > > > functionality into the current design makes the catalog API more > > > complicated. > > > > > > > > > why would users name a temporary function the same as a built-in > > > > > > function then? > > > > > > Because you never know what users do. If they don't, my suggested > > > resolution order should not be a problem, right? > > > > > > > > > I don't think hive functions deserves be a function module > > > > > > Our goal is not to create a Hive clone. We need to think forward and > Hive > > > is just one of many systems that we can support. Not every built-in > > > function behaves and will behave exactly like Hive. > > > > > > > > > regarding temporary functions, there are few systems that support it > > > > > > IMHO Spark and Hive are not always the best examples for consistent > > > design. Systems like Postgres, Presto, or SQL Server should be used as > a > > > reference. I don't think that a user can overwrite a built-in function > > > there. > > > > > > Regards, > > > Timo > > > > > > [1] https://www.postgresql.org/docs/10/extend-extensions.html > > > [2] https://prestodb.github.io/docs/current/develop/functions.html > > > > > > > > > On 04.09.19 13:44, Jark Wu wrote: > > > > > > Hi all, > > > > > > Regarding #1 temp function <> built-in function and naming. > > > I'm fine with temp functions should precede built-in function and can > > > override built-in functions (we already support to override built-in > > > function in 1.9). > > > If we don't allow the same name as a built-in function, I'm afraid we > > > > > > will > > > > > > have compatibility issues in the future. > > > Say users register a user defined function named "explode" in 1.9, and > we > > > support a built-in "explode" function in 1.10. > > > Then the user's jobs which call the registered "explode" function in > 1.9 > > > will all fail in 1.10 because of naming conflict. > > > > > > Regarding #2 "External" built-in functions. > > > I think if we store external built-in functions in catalog, then > > > "hive1::sqrt" is a good way to go. > > > However, I would prefer to support a discovery mechanism (e.g. SPI) for > > > built-in functions as Timo suggested above. > > > This gives us the flexibility to add Hive or MySQL or Geo or whatever > > > function set as built-in functions in an easy way. > > > > > > Best, > > > Jark > > > > > > On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]><[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> wrote: > > > > > > Hi David, > > > > > > Thank you for sharing your findings. It seems to me that there is no > SQL > > > standard regarding temporary functions. There are few systems that > > > > > > support > > > > > > it. Here are what I have found: > > > > > > 1. Hive: no DB qualifier allowed. Can overwrite built-in. > > > 2. Spark: basically follows Hive ( > > > > > > > > > > > > https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html > > > > > > ) > > > 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of > overwriting > > > behavior. ( > > http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html > > > > > > ) > > > > > > Because of lack of standard, it's perfectly fine for Flink to define > > > whatever it sees appropriate. Thus, your proposal (no overwriting and > > > > > > must > > > > > > have DB as holder) is one option. The advantage is simplicity, The > > > downside > > > is the deviation from Hive, which is popular and de facto standard in > big > > > data world. > > > > > > However, I don't think we have to follow Hive. More importantly, we > need > > > > > > a > > > > > > consensus. I have no objection if your proposal is generally agreed > upon. > > > > > > Thanks, > > > Xuefu > > > > > > On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz < > [hidden email] > > <[hidden email]> <[hidden email]> < > [hidden email]> > > <[hidden email]> <[hidden email]> < > [hidden email]> > > <[hidden email]> <[hidden email]> > > > wrote: > > > > > > Hi all, > > > > > > Just an opinion on the built-in <> temporary functions resolution and > > > NAMING issue. I think we should not allow overriding the built-in > > > functions, as this may pose serious issues and to be honest is rather > > > not feasible and would require major rework. What happens if a user > > > wants to override CAST? Calls to that function are generated at > > > different layers of the stack that unfortunately does not always go > > > through the Catalog API (at least yet). Moreover from what I've checked > > > no other systems allow overriding the built-in functions. All the > > > systems I've checked so far register temporary functions in a > > > database/schema (either special database for temporary functions, or > > > just current database). What I would suggest is to always register > > > temporary functions with a 3 part identifier. The same way as tables, > > > views etc. This effectively means you cannot override built-in > > > functions. With such approach it is natural that the temporary > functions > > > end up a step lower in the resolution order: > > > > > > 1. built-in functions (1 part, maybe 2? - this is still under > discussion) > > > > > > 2. temporary functions (always 3 part path) > > > > > > 3. catalog functions (always 3 part path) > > > > > > Let me know what do you think. > > > > > > Best, > > > > > > Dawid > > > > > > On 04/09/2019 06:13, Bowen Li wrote: > > > > > > Hi, > > > > > > I agree with Xuefu that the main controversial points are mainly the > > > > > > two > > > > > > places. My thoughts on them: > > > > > > 1) Determinism of referencing Hive built-in functions. We can either > > > > > > remove > > > > > > Hive built-in functions from ambiguous function resolution and require > > > users to use special syntax for their qualified names, or add a config > > > > > > flag > > > > > > to catalog constructor/yaml for turning on and off Hive built-in > > > > > > functions > > > > > > with the flag set to 'false' by default and proper doc added to help > > > > > > users > > > > > > make their decisions. > > > > > > 2) Flink temp functions v.s. Flink built-in functions in ambiguous > > > > > > function > > > > > > resolution order. We believe Flink temp functions should precede Flink > > > built-in functions, and I have presented my reasons. Just in case if we > > > cannot reach an agreement, I propose forbid users registering temp > > > functions in the same name as a built-in function, like MySQL's > > > > > > approach, > > > > > > for the moment. It won't have any performance concern, since built-in > > > functions are all in memory and thus cost of a name check will be > > > > > > really > > > > > > trivial. > > > > > > > > > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]><[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> wrote: > > > > > > From what I have seen, there are a couple of focal disagreements: > > > > > > 1. Resolution order: temp function --> flink built-in function --> > > > > > > catalog > > > > > > function vs flink built-in function --> temp function -> catalog > > > > > > function. > > > > > > 2. "External" built-in functions: how to treat built-in functions in > > > external system and how users reference them > > > > > > For #1, I agree with Bowen that temp function needs to be at the > > > > > > highest > > > > > > priority because that's how a user might overwrite a built-in function > > > without referencing a persistent, overwriting catalog function with a > > > > > > fully > > > > > > qualified name. Putting built-in functions at the highest priority > > > eliminates that usage. > > > > > > For #2, I saw a general agreement on referencing "external" built-in > > > functions such as those in Hive needs to be explicit and deterministic > > > > > > even > > > > > > though different approaches are proposed. To limit the scope and > > > > > > simply > > > > > > the > > > > > > usage, it seems making sense to me to introduce special syntax for > > > > > > user to > > > > > > explicitly reference an external built-in function such as hive1::sqrt > > > > > > or > > > > > > hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API > > > > > > call > > > > > > hive1.getFunction(ObjectPath functionName) where the database name is > > > absent for bulit-in functions available in that catalog hive1. I > > > > > > understand > > > > > > that Bowen's original proposal was trying to avoid this, but this > > > > > > could > > > > > > turn out to be a clean and simple solution. > > > > > > (Timo's modular approach is great way to "expand" Flink's built-in > > > > > > function > > > > > > set, which seems orthogonal and complementary to this, which could be > > > tackled in further future work.) > > > > > > I'd be happy to hear further thoughts on the two points. > > > > > > Thanks, > > > Xuefu > > > > > > On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> > > <[hidden email]> <[hidden email]> <[hidden email]>< > [hidden email]> > > <[hidden email]> <[hidden email]> <[hidden email]> < > [hidden email]> > > <[hidden email]> <[hidden email]> <[hidden email]> wrote: > > > > > > Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is > > > > > > the > > > > > > same > > > as Bowen's. But after thinking about it, I'm currently lean to Timo's > > > suggestion. > > > > > > The reason is backward compatibility. If we follow Bowen's approach, > > > > > > let's > > > > > > say we > > > first find function in Flink's built-in functions, and then hive's > > > built-in. For example, `foo` > > > is not supported by Flink, but hive has such built-in function. So > > > > > > user > > > > > > will have hive's > > > behavior for function `foo`. And in next release, Flink realize this > > > > > > is a > > > > > > very popular function > > > and add it into Flink's built-in functions, but with different > > > > > > behavior > > > > > > as > > > > > > hive's. So in next > > > release, the behavior changes. > > > > > > With Timo's approach, IIUC user have to tell the framework explicitly > > > > > > what > > > > > > kind of > > > built-in functions he would like to use. He can just tell framework > > > > > > to > > > > > > abandon Flink's built-in > > > functions, and use hive's instead. User can only choose between them, > > > > > > but > > > > > > not use > > > them at the same time. I think this approach is more predictable. > > > > > > Best, > > > Kurt > > > > > > > > > On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]><[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> wrote: > > > > > > Hi all, > > > > > > Thanks for the feedback. Just a kindly reminder that the [Proposal] > > > > > > section > > > > > > in the google doc was updated, please take a look first and let me > > > > > > know > > > > > > if > > > > > > you have more questions. > > > > > > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]><[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> > > > > > > wrote: > > > > > > Hi Timo, > > > > > > Re> 1) We should not have the restriction "hive built-in functions > > > > > > can > > > > > > only > > > > > > be used when current catalog is hive catalog". Switching a catalog > > > should only have implications on the cat.db.object resolution but > > > > > > not > > > > > > functions. It would be quite convinient for users to use Hive > > > > > > built-ins > > > > > > even if they use a Confluent schema registry or just the in-memory > > > > > > catalog. > > > > > > There might be a misunderstanding here. > > > > > > First of all, Hive built-in functions are not part of Flink > > > > > > built-in > > > > > > functions, they are catalog functions, thus if the current catalog > > > > > > is > > > > > > not a > > > > > > HiveCatalog but, say, a schema registry catalog, ambiguous > > > > > > functions > > > > > > reference just shouldn't be resolved to a different catalog. > > > > > > Second, Hive built-in functions can potentially be referenced > > > > > > across > > > > > > catalog, but it doesn't have db namespace and we currently just > > > > > > don't > > > > > > have > > > > > > a SQL syntax for it. It can be enabled when such a SQL syntax is > > > > > > defined, > > > > > > e.g. "catalog::function", but it's out of scope of this FLIP. > > > > > > 2) I would propose to have separate concepts for catalog and > > > > > > built-in > > > > > > functions. In particular it would be nice to modularize built-in > > > functions. Some built-in functions are very crucial (like AS, CAST, > > > MINUS), others are more optional but stable (MD5, CONCAT_WS), and > > > > > > maybe > > > > > > we add more experimental functions in the future or function for > > > > > > some > > > > > > special application area (Geo functions, ML functions). A data > > > > > > platform > > > > > > team might not want to make every built-in function available. Or a > > > function module like ML functions is in a different Maven module. > > > > > > I think this is orthogonal to this FLIP, especially we don't have > > > > > > the > > > > > > "external built-in functions" anymore and currently the built-in > > > > > > function > > > > > > category remains untouched. > > > > > > But just to share some thoughts on the proposal, I'm not sure about > > > > > > it: > > > > > > - I don't know if any other databases handle built-in functions > > > > > > like > > > > > > that. > > > > > > Maybe you can give some examples? IMHO, built-in functions are > > > > > > system > > > > > > info > > > > > > and should be deterministic, not depending on loaded libraries. Geo > > > functions should be either built-in already or just libraries > > > > > > functions, > > > > > > and library functions can be adapted to catalog APIs or of some > > > > > > other > > > > > > syntax to use > > > - I don't know if all use cases stand, and many can be achieved by > > > > > > other > > > > > > approaches too. E.g. experimental functions can be taken good care > > > > > > of > > > > > > by > > > > > > documentations, annotations, etc > > > - the proposal basically introduces some concept like a pluggable > > > > > > built-in > > > > > > function catalog, despite the already existing catalog APIs > > > - it brings in even more complicated scenarios to the design. E.g. > > > > > > how > > > > > > do > > > > > > you handle built-in functions in different modules but different > > > > > > names? > > > > > > In short, I'm not sure if it really stands and it looks like an > > > > > > overkill > > > > > > to me. I'd rather not go to that route. Related discussion can be > > > > > > on > > > > > > its > > > > > > own thread. > > > > > > 3) Following the suggestion above, we can have a separate discovery > > > mechanism for built-in functions. Instead of just going through a > > > > > > static > > > > > > list like in BuiltInFunctionDefinitions, a platform team should be > > > > > > able > > > > > > to select function modules like > > > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, > > > HiveFunctions) or via service discovery; > > > > > > Same as above. I'll leave it to its own thread. > > > > > > re > 3) Dawid and I discussed the resulution order again. I agree > > > > > > with > > > > > > Kurt > > > > > > that we should unify built-in function (external or internal) > > > > > > under a > > > > > > common layer. However, the resolution order should be: > > > 1. built-in functions > > > 2. temporary functions > > > 3. regular catalog resolution logic > > > Otherwise a temporary function could cause clashes with Flink's > > > > > > built-in > > > > > > functions. If you take a look at other vendors, like SQL Server > > > > > > they > > > > > > also do not allow to overwrite built-in functions. > > > > > > ”I agree with Kurt that we should unify built-in function (external > > > > > > or > > > > > > internal) under a common layer.“ <- I don't think this is what Kurt > > > > > > means. > > > > > > Kurt and I are in favor of unifying built-in functions of external > > > > > > systems > > > > > > and catalog functions. Did you type a mistake? > > > > > > Besides, I'm not sure about the resolution order you proposed. > > > > > > Temporary > > > > > > functions have a lifespan over a session and are only visible to > > > > > > the > > > > > > session owner, they are unique to each user, and users create them > > > > > > on > > > > > > purpose to be the highest priority in order to overwrite system > > > > > > info > > > > > > (built-in functions in this case). > > > > > > In your case, why would users name a temporary function the same > > > > > > as a > > > > > > built-in function then? Since using that name in ambiguous function > > > reference will always be resolved to built-in functions, creating a > > > same-named temp function would be meaningless in the end. > > > > > > > > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]><[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> > > > > > > wrote: > > > > > > Hi Jingsong, > > > > > > Re> 1.Hive built-in functions is an intermediate solution. So we > > > > > > should > > > > > > not introduce interfaces to influence the framework. To make > > > Flink itself more powerful, we should implement the functions > > > we need to add. > > > > > > Yes, please see the doc. > > > > > > Re> 2.Non-flink built-in functions are easy for users to change > > > > > > their > > > > > > behavior. If we support some flink built-in functions in the > > > future but act differently from non-flink built-in, this will > > > > > > lead > > > > > > to > > > > > > changes in user behavior. > > > > > > There's no such concept as "external built-in functions" any more. > > > Built-in functions of external systems will be treated as special > > > > > > catalog > > > > > > functions. > > > > > > Re> Another question is, does this fallback include all > > > > > > hive built-in functions? As far as I know, some hive functions > > > have some hacky. If possible, can we start with a white list? > > > Once we implement some functions to flink built-in, we can > > > also update the whitelist. > > > > > > Yes, that's something we thought of too. I don't think it's super > > > critical to the scope of this FLIP, thus I'd like to leave it to > > > > > > future > > > > > > efforts as a nice-to-have feature. > > > > > > > > > On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]><[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> > > > > > > wrote: > > > > > > Hi Kurt, > > > > > > Re: > What I want to propose is we can merge #3 and #4, make them > > > > > > both > > > > > > under > > > > > > "catalog" concept, by extending catalog function to make it have > > > > > > ability to > > > > > > have built-in catalog functions. Some benefits I can see from > > > > > > this > > > > > > approach: > > > > > > 1. We don't have to introduce new concept like external built-in > > > > > > functions. > > > > > > Actually I don't see a full story about how to treat a built-in > > > > > > functions, and it > > > > > > seems a little bit disrupt with catalog. As a result, you have > > > > > > to > > > > > > make > > > > > > some restriction > > > > > > like "hive built-in functions can only be used when current > > > > > > catalog > > > > > > is > > > > > > hive catalog". > > > > > > Yes, I've unified #3 and #4 but it seems I didn't update some > > > > > > part > > > > > > of > > > > > > the doc. I've modified those sections, and they are up to date > > > > > > now. > > > > > > In short, now built-in function of external systems are defined > > > > > > as > > > > > > a > > > > > > special kind of catalog function in Flink, and handled by Flink > > > > > > as > > > > > > following: > > > - An external built-in function must be associated with a catalog > > > > > > for > > > > > > the purpose of decoupling flink-table and external systems. > > > - It always resides in front of catalog functions in ambiguous > > > > > > function > > > > > > reference order, just like in its own external system > > > - It is a special catalog function that doesn’t have a > > > > > > schema/database > > > > > > namespace > > > - It goes thru the same instantiation logic as other user defined > > > catalog functions in the external system > > > > > > Please take another look at the doc, and let me know if you have > > > > > > more > > > > > > questions. > > > > > > > > > On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]><[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> <[hidden email]> <[hidden email]> > > > > > > wrote: > > > > > > Hi Kurt, > > > > > > it should not affect the functions and operations we currently > > > > > > have > > > > > > in > > > > > > SQL. It just categorizes the available built-in functions. It is > > > > > > kind > > > > > > of > > > an orthogonal concept to the catalog API but built-in functions > > > > > > deserve > > > > > > this special kind of treatment. CatalogFunction still fits > > > > > > perfectly > > > > > > in > > > > > > there because the regular catalog object resolution logic is not > > > affected. So tables and functions are resolved in the same way > > > > > > but > > > > > > with > > > > > > built-in functions that have priority as in the original design. > > > > > > Regards, > > > Timo > > > > > > > > > On 03.09.19 15:26, Kurt Young wrote: > > > > > > Does this only affect the functions and operations we currently > > > > > > have > > > > > > in SQL > > > > > > and > > > have no effect on tables, right? Looks like this is an > > > > > > orthogonal > > > > > > concept > > > > > > with Catalog? > > > If the answer are both yes, then the catalog function will be a > > > > > > weird > > > > > > concept? > > > > > > Best, > > > Kurt > > > > > > > > > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email] > > > > > > wrote: > > > > > > The way you proposed are basically the same as what Calcite > > > > > > does, I > > > > > > think > > > > > > we are in the same line. > > > > > > Best, > > > Danny Chan > > > 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] > > > > > > ,写道: > > > > > > This sounds exactly as the module approach I mentioned, no? > > > > > > Regards, > > > Timo > > > > > > On 03.09.19 13:42, Danny Chan wrote: > > > > > > Thanks Bowen for bring up this topic, I think it’s a useful > > > > > > refactoring to make our function usage more user friendly. > > > > > > For the topic of how to organize the builtin operators and > > > > > > operators > > > > > > of Hive, here is a solution from Apache Calcite, the Calcite > > > > > > way > > > > > > is > > > > > > to make > > > > > > every dialect operators a “Library”, user can specify which > > > > > > libraries they > > > > > > want to use for a sql query. The builtin operators always > > > > > > comes > > > > > > as > > > > > > the > > > > > > first class objects and the others are used from the order > > > > > > they > > > > > > appears. > > > > > > Maybe you can take a reference. > > > > > > [1] > > > > > > > > > > > > https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 > > > > > > Best, > > > Danny Chan > > > 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] > > > > > > ,写道: > > > > > > Hi folks, > > > > > > I'd like to kick off a discussion on reworking Flink's > > > > > > FunctionCatalog. > > > > > > It's critically helpful to improve function usability in > > > > > > SQL. > > > > > > > > > > > > https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing > > > > > > In short, it: > > > - adds support for precise function reference with > > > > > > fully/partially > > > > > > qualified name > > > - redefines function resolution order for ambiguous > > > > > > function > > > > > > reference > > > > > > - adds support for Hive's rich built-in functions (support > > > > > > for > > > > > > Hive > > > > > > user > > > > > > defined functions was already added in 1.9.0) > > > - clarifies the concept of temporary functions > > > > > > Would love to hear your thoughts. > > > > > > Bowen > > > > > > -- > > > Xuefu Zhang > > > > > > "In Honey We Trust!" > > > > > > > > > -- > > > Xuefu Zhang > > > > > > "In Honey We Trust!" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Hi,
Thanks @Fabian @Dawid and everyone else for sharing your thoughts! First, I'd like to take Hive built-in functions out of this FLIP to keep our original scope and make it less controversial on a potential modular approach. I will remove Hive built-in functions from the google doc. Then the focus of debate is mainly function resolution order and temp function namespace, which are somewhat related. I roughly summarized this thread, and currently we are debating on two approaches with preference from the following people: Option 1: Proposal: temp functions will be of 1-part path (function name only), and can override built-in functions. The ambiguous function resolution order is thus 1) temp functions 2) built-in functions 3) catalog functions in the current catalog/database Votes: Xuefu, Bowen, Fabian, Jark Option 2: Proposal: temp functions will be of 3-part path (with catalog, database, and function name), and temp functions cannot override built-in functions. The ambiguous function resolution order is thus 1) built-in functions 2) temp functions (in 3-part path) 3) catalog functions in the current catalog/database Votes: Dawid, Timo Do you think we need a separate voting thread on the two options in the community, or are we able to conclude from the above summary? On Wed, Sep 11, 2019 at 8:09 AM Dawid Wysakowicz <[hidden email]> wrote: > Hi Fabian, > Thank you for your response. > Regarding the temporary function, just wanted to clarify one thing: the > 3-part identifier does not mean the user always has to provide the catalog > & database explicitly. The same way user does not have to provide them in > e.g. when creating permanent table, view etc. It means though functions are > always stored within a database. The same way as all the permanent objects > and other temporary objects(tables, views). If not given explicitly the > current catalog & database would be used, both in the create statement or > when using the function. > > Point taken though your preference would be to support overriding built-in > functions. > > Best, > Dawid > > On Wed, 11 Sep 2019, 21:14 Fabian Hueske, <[hidden email]> wrote: > > > Hi all, > > > > I'd like to add my opinion on this topic as well ;-) > > > > In general, I think overriding built-in function with temp functions has > a > > couple of benefits but also a few challenges: > > > > * Users can reimplement the behavior of a built-in functions of a > different > > system, e.g., for backward compatibility after a migration. > > * I don't think that "accidental" overrides and surprising semantics are > an > > issue or dangerous. The user registered the temp function in the same > > session and should therefore be aware of the changed semantics. > > * I see that not all built-in functions can be overridden, like the CAST > > example that Dawid gave. However, I think these should be a small > fraction > > and such functions could be blacklisted. Sure, that's not super > consistent, > > but should (IMO) not be a big issue in practice. > > * Temp functions should be easy to use. Requiring a 3-part addressing > makes > > them a lot less user friendly, IMO. Users need to think about what > catalog > > and db to choose when registering them. Also using a temp function in a > > query becomes less convenient. Moreover, I agree with Bowen's concerns > that > > a 3-part addressing scheme reduces the temporal appearance of the > function. > > > > From the three possible solutions, my preference order is > > 1) 1-part address with override of built-in > > 2) 1-part address without override of built-in > > 3) 3-part address > > > > Regarding the issue of external built-in functions, I don't think that > > Timo's proposal of modules is fully orthogonal to this discussion. > > A Hive function module could be an alternative to offering Hive functions > > as part of Hive's catalog. > > From a user's point of view, I think that modules would be a "cleaner" > > integration ("Why do I need a Hive catalog if all I want to do is apply a > > Hive function on a Kafka table?"). > > However, the module approach clearly has the problem of dealing with > > same-named functions in different modules (e.g., a Hive function and a > > Flink built-in function). > > The catalog approach as the benefit that functions can be addressed like > > hiveCat::func (or a similar path). > > > > I'm not sure what's the best solution here. > > > > Cheers, > > Fabian > > > > > > Am Mo., 9. Sept. 2019 um 06:30 Uhr schrieb Bowen Li <[hidden email] > >: > > > > > Hi, > > > > > > W.r.t temp functions, I feel both options have their benefits and can > > > theoretically achieve similar functionalities one way or another. In > the > > > end, it's more about use cases, users habits, and trade-offs. > > > > > > Re> Not always users are in full control of the catalog functions. > There > > is > > > also the case where different teams manage the catalog & use the > catalog. > > > > > > Temp functions live within a session, and not within a catalog. Having > > > 3-part paths may implies temp functions are tied to a catalog in two > > > aspects. > > > 1) it may indicate each catalog manages their temp functions, which is > > not > > > true as we seem all agree they should reside at a central place, either > > in > > > FunctionCatalog or CatalogManager > > > 2) it may indicate there's some access control. When users are > forbidden > > to > > > manipulate some objects in the catalog that's managed by other teams, > but > > > are allowed to manipulate some other objects (temp functions in this > > case) > > > belonging to the catalog in namespaces, users may think we introduced > > extra > > > complexity and confusion with some kind of access control into the > > problem. > > > It doesn't feel intuitive enough for end users. > > > > > > Thus, I'd be in favor of 1-part path for temporary functions, and other > > > temp objects. > > > > > > Thanks, > > > Bowen > > > > > > > > > > > > On Fri, Sep 6, 2019 at 2:16 AM Dawid Wysakowicz < > [hidden email]> > > > wrote: > > > > > > > I agree the consequences of the decision are substantial. Let's see > > what > > > > others think. > > > > > > > > -- Catalog functions are defined by users, and we suppose they can > > > > drop/alter it in any way they want. Thus, overwriting a catalog > > function > > > > doesn't seem to be a strong use case that we should be concerned > about. > > > > Rather, there are known use case for overwriting built-in functions. > > > > > > > > Not always users are in full control of the catalog functions. There > is > > > > also the case where different teams manage the catalog & use the > > catalog. > > > > As for overriding built-in functions with 3-part approach user can > > always > > > > use an equally named function from a catalog. E.g. to override > > > > > > > > * SELECT explode(arr) FROM ...* > > > > > > > > user can always write: > > > > > > > > * SELECT db.explode(arr) FROM ...* > > > > > > > > Best, > > > > > > > > Dawid > > > > On 06/09/2019 10:54, Xuefu Z wrote: > > > > > > > > Hi Dawid, > > > > > > > > Thank you for your summary. While the only difference in the two > > > proposals > > > > is one- or three-part in naming, the consequence would be > substantial. > > > > > > > > To me, there are two major use cases of temporary functions compared > to > > > > persistent ones: > > > > 1. Temporary in nature and auto managed by the session. More often > than > > > > not, admin doesn't even allow user to create persistent functions. > > > > 2. Provide an opportunity to overwriting system built-in functions. > > > > > > > > Since built-in functions has one-part name, requiring three-part name > > for > > > > temporary functions eliminates the overwriting opportunity. > > > > > > > > One-part naming essentially puts all temp functions under a single > > > > namespace and simplifies function resolution, such as we don't need > to > > > > consider the case of a temp function and a persistent function with > the > > > > same name under the same database. > > > > > > > > I agree having three-parts does have its merits, such as consistency > > with > > > > other temporary objects (table) and minor difference between temp vs > > > > catalog functions. However, there is a slight difference between > tables > > > and > > > > function in that there is no built-in table in SQL so there is no > need > > to > > > > overwrite it. > > > > > > > > I'm not sure if I fully agree the benefits you listed as the > advantages > > > of > > > > the three-part naming of temp functions. > > > > -- Allowing overwriting built-in functions is a benefit and the > > > solution > > > > for disallowing certain overwriting shouldn't be totally banning it. > > > > -- Catalog functions are defined by users, and we suppose they can > > > > drop/alter it in any way they want. Thus, overwriting a catalog > > function > > > > doesn't seem to be a strong use case that we should be concerned > about. > > > > Rather, there are known use case for overwriting built-in functions. > > > > > > > > Thus, personally I would prefer one-part name for temporary > functions. > > In > > > > lack of SQL standard on this, I certainly like to get opinions from > > > others > > > > to see if a consensus can be eventually reached. > > > > > > > > (To your point on modular approach to support external built-in > > > functions, > > > > we saw the value and are actively looking into it. Thanks for sharing > > > your > > > > opinion on that.) > > > > > > > > Thanks, > > > > Xuefu > > > > > > > > On Fri, Sep 6, 2019 at 3:48 PM Dawid Wysakowicz < > > [hidden email]> > > > <[hidden email]> > > > > wrote: > > > > > > > > > > > > Hi Xuefu, > > > > > > > > Thank you for your answers. > > > > > > > > Let me summarize my understanding. In principle we differ only in > > regards > > > > to the fact if a temporary function can be only 1-part or only 3-part > > > > identified. I can reconfirm that if the community decides it prefers > > the > > > > 1-part approach I will commit to that, with the assumption that we > will > > > > force ONLY 1-part function names. (We will parse identifier and throw > > > > exception if a user tries to register e.g. db.temp_func). > > > > > > > > My preference is though the 3-part approach: > > > > > > > > - there are some functions that it makes no sense to override, > e.g. > > > > CAST, moreover I'm afraid that allowing overriding such will lead > to > > > high > > > > inconsistency, similar to those that I mentioned spark has > > > > - you cannot shadow a fully-qualified function. (If a user fully > > > > qualifies his/her objects in a SQL query, which is often > considered > > a > > > good > > > > practice) > > > > - it does not differentiate between functions & temporary > functions. > > > > Temporary functions just differ with regards to their life-cycle. > > The > > > > registration & usage is exactly the same. > > > > > > > > As it can be seen, the proposed concept regarding temp function and > > > > function resolution is quite simple. > > > > > > > > Both approaches are equally simple. I would even say the 3-part > > approach > > > > is slightly simpler as it does not have to care about some special > > > built-in > > > > functions such as CAST. > > > > > > > > I don't want to express my opinion on the differentiation between > > > built-in > > > > functions and "external" built-in functions in this thread as it is > > > rather > > > > orthogonal, but I also like the modular approach and I definitely > don't > > > > like the special syntax "cat::function". I think it's better to stick > > to > > > a > > > > standard or at least other proved solutions from other systems. > > > > > > > > Best, > > > > > > > > Dawid > > > > On 05/09/2019 10:12, Xuefu Z wrote: > > > > > > > > Hi David, > > > > > > > > Thanks for sharing your thoughts and request for clarifications. I > > > believe > > > > that I fully understood your proposal, which does has its merit. > > However, > > > > it's different from ours. Here are the answers to your questions: > > > > > > > > Re #1: yes, the temp functions in the proposal are global and have > just > > > > one-part names, similar to built-in functions. Two- or three-part > names > > > are > > > > not allowed. > > > > > > > > Re #2: not applicable as two- or three-part names are disallowed. > > > > > > > > Re #3: same as above. Referencing external built-in functions is > > achieved > > > > either implicitly (only the built-in functions in the current > catalogs > > > are > > > > considered) or via special syntax such as cat::function. However, we > > are > > > > looking into the modular approach that Time suggested with other > > feedback > > > > received from the community. > > > > > > > > Re #4: the resolution order goes like the following in our proposal: > > > > > > > > 1. temporary functions > > > > 2. bulit-in functions (including those augmented by add-on modules) > > > > 3. built-in functions in current catalog (this will not be needed if > > the > > > > special syntax "cat::function" is required) > > > > 4. functions in current catalog and db. > > > > > > > > If we go with the modular approach and make external built-in > functions > > > as > > > > an add-on module, the 2 and 3 above will be combined. In essence, the > > > > resolution order is equivalent in the two approaches. > > > > > > > > By the way, resolution order matters only for simple name reference. > > For > > > > names such as db.function (interpreted as current_cat/db/function) or > > > > cat.db.function, the reference is unambiguous, so on resolution is > > > needed. > > > > > > > > As it can be seen, the proposed concept regarding temp function and > > > > function resolution is quite simple. Additionally, the proposed > > > resolution > > > > order allows temp function to shadow a built-in function, which is > > > > important (though not decisive) in our opinion. > > > > > > > > I started liking the modular approach as the resolution order will > only > > > > include 1, 2, and 4, which is simpler and more generic. That's why I > > > > suggested we look more into this direction. > > > > > > > > Please let me know if there are further questions. > > > > > > > > Thanks, > > > > Xuefu > > > > > > > > > > > > > > > > > > > > On Thu, Sep 5, 2019 at 2:42 PM Dawid Wysakowicz < > > [hidden email]> > > > <[hidden email]> <[hidden email]> < > > [hidden email]> > > > > wrote: > > > > > > > > > > > > Hi Xuefu, > > > > > > > > Just wanted to summarize my opinion on the one topic (temporary > > > functions). > > > > > > > > My preference would be to make temporary functions always 3-part > > > qualified > > > > (as a result that would prohibit overriding built-in functions). > Having > > > > said that if the community decides that it's better to allow > overriding > > > > built-in functions I am fine with it and can commit to that decision. > > > > > > > > I wanted to ask if you could clarify a few points for me around that > > > > option. > > > > > > > > 1. Would you enforce temporary functions to be always just a > single > > > > name (without db & cat) as hive does, or would you allow also 3 or > > > even 2 > > > > part identifiers? > > > > 2. Assuming 2/3-part paths. How would you register a function > from a > > > > following statement: CREATE TEMPORARY FUNCTION db.func? Would that > > > shadow > > > > all functions named 'func' in all databases named 'db' in all > > > catalogs? Or > > > > would you shadow only function 'func' in database 'db' in current > > > catalog? > > > > 3. This point is still under discussion, but was mentioned a few > > > > times, that maybe we want to enable syntax cat.func for "external > > > built-in > > > > functions". How would that affect statement from previous point? > > Would > > > > 'db.func' shadow "external built-in function" in 'db' catalog or > > user > > > > functions as in point 2? Or maybe both? > > > > 4. Lastly in fact to summarize the previous points. Assuming > > 2/3-part > > > > paths. Would the function resolution be actually as follows?: > > > > 1. temporary functions (1-part path) > > > > 2. built-in functions > > > > 3. temporary functions (2-part path) > > > > 4. 2-part catalog functions a.k.a. "external built-in > functions" > > > > (cat + func) - this is still under discussion, if we want that > in > > > the other > > > > focal point > > > > 5. temporary functions (3-part path) > > > > 6. 3-part catalog functions a.k.a. user functions > > > > > > > > I would be really grateful if you could explain me those questions, > > > thanks. > > > > > > > > BTW, Thank you all for a healthy discussion. > > > > > > > > Best, > > > > > > > > Dawid > > > > On 04/09/2019 23:25, Xuefu Z wrote: > > > > > > > > Thank all for the sharing thoughts. I think we have gathered some > > useful > > > > initial feedback from this long discussion with a couple of focal > > points > > > > sticking out. > > > > > > > > We will go back to do more research and adapt our proposal. Once > it's > > > > ready, we will ask for a new round of review. If there is any > > > disagreement, > > > > we will start a new discussion thread on each rather than having a > mega > > > > discussion like this. > > > > > > > > Thanks to everyone for participating. > > > > > > > > Regards, > > > > Xuefu > > > > > > > > > > > > On Thu, Sep 5, 2019 at 2:52 AM Bowen Li <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> wrote: > > > > > > > > > > > > Let me try to summarize and conclude the long thread so far: > > > > > > > > 1. For order of temp function v.s. built-in function: > > > > > > > > I think Dawid's point that temp function should be of fully qualified > > > path > > > > is a better reasoning to back the newly proposed order, and i agree > we > > > > don't need to follow Hive/Spark. > > > > > > > > However, I'd rather not change fundamentals of temporary functions in > > > this > > > > FLIP. It belongs to a bigger story of how temporary objects should be > > > > redefined and be handled uniformly - currently temporary tables and > > views > > > > (those registered from TableEnv#registerTable()) behave different > than > > > what > > > > Dawid propose for temp functions, and we need a FLIP to just unify > > their > > > > APIs and behaviors. > > > > > > > > I agree that backward compatibility is not an issue w.r.t Jark's > > points. > > > > > > > > ***Seems we do have consensus that it's acceptable to prevent users > > > > registering a temp function in the same name as a built-in function. > To > > > > help us move forward, I'd like to propose setting such a restraint on > > > temp > > > > functions in this FLIP to simplify the design and avoid disputes.*** > It > > > > will also leave rooms for improvements in the future. > > > > > > > > > > > > 2. For Hive built-in function: > > > > > > > > Thanks Timo for providing the Presto and Postgres examples. I feel > > > modular > > > > built-in functions can be a good fit for the geo and ml example as a > > > native > > > > Flink extension, but not sure if it fits well with external > > integrations. > > > > Anyway, I think modular built-in functions is a bigger story and can > be > > > on > > > > its own thread too, and our proposal doesn't prevent Flink from doing > > > that > > > > in the future. > > > > > > > > ***Seems we have consensus that users should be able to use built-in > > > > functions of Hive or other external systems in SQL explicitly and > > > > deterministically regardless of Flink built-in functions and the > > > potential > > > > modular built-in functions, via some new syntax like "mycat::func"? > If > > > so, > > > > I'd like to propose removing Hive built-in functions from ambiguous > > > > function resolution order, and empower users with such a syntax. This > > way > > > > we sacrifice a little convenience for certainty*** > > > > > > > > > > > > What do you think? > > > > > > > > On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz < > > [hidden email]> > > > <[hidden email]> <[hidden email]> < > > [hidden email]> > > > <[hidden email]> <[hidden email]> < > > [hidden email]> > > > <[hidden email]> > > > > wrote: > > > > > > > > > > > > Hi, > > > > > > > > Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just > > > > performed some experiments (hive-2.3.2 & spark 2.4.4) and I think > they > > > > > > > > are > > > > > > > > very inconsistent in that manner (spark being way worse on that). > > > > > > > > Hive: > > > > > > > > You cannot overwrite all the built-in functions. I could overwrite > most > > > > > > > > of > > > > > > > > the functions I tried e.g. length, e, pi, round, rtrim, but there are > > > > functions I cannot overwrite e.g. CAST, ARRAY I get: > > > > > > > > > > > > * ParseException line 1:29 cannot recognize input near 'array' > 'AS' > > * > > > > > > > > What is interesting is that I cannot ovewrite *array*, but I can > > ovewrite > > > > *map* or *struct*. Though hive behaves reasonable well if I manage to > > > > overwrite a function. When I drop the temporary function the native > > > > function is still available. > > > > > > > > Spark: > > > > > > > > Spark's behavior imho is super bad. > > > > > > > > Theoretically I could overwrite all functions. I was able e.g. to > > > > overwrite CAST function. I had to use though CREATE OR REPLACE > > TEMPORARY > > > > FUNCTION syntax. Otherwise I get an exception that a function already > > > > exists. However when I used the CAST function in a query it used the > > > > native, built-in one. > > > > > > > > When I overwrote current_date() function, it was used in a query, but > > it > > > > completely replaces the built-in function and I can no longer use the > > > > native function in any way. I cannot also drop the temporary > function. > > I > > > > get: > > > > > > > > * Error in query: Cannot drop native function 'current_date';* > > > > > > > > Additional note, both systems do not allow creating TEMPORARY > FUNCTIONS > > > > with a database. Temporary functions are always represented as a > single > > > > name. > > > > > > > > In my opinion neither of the systems have consistent behavior. > > Generally > > > > speaking I think overwriting any system provided functions is just > > > > dangerous. > > > > > > > > Regarding Jark's concerns. Such functions would be registered in a > > > > > > > > current > > > > > > > > catalog/database schema, so a user could still use its own function, > > but > > > > would have to fully qualify the function (because built-in functions > > take > > > > precedence). Moreover users would have the same problem with > permanent > > > > functions. Imagine a user have a permanent function 'cat.db.explode'. > > In > > > > 1.9 the user could use just the 'explode' function as long as the > > 'cat' & > > > > 'db' were the default catalog & database. If we introduce 'explode' > > > > built-in function in 1.10, the user has to fully qualify the > function. > > > > > > > > Best, > > > > > > > > Dawid > > > > On 04/09/2019 15:19, Timo Walther wrote: > > > > > > > > Hi all, > > > > > > > > thanks for the healthy discussion. It is already a very long > discussion > > > > with a lot of text. So I will just post my opinion to a couple of > > > > statements: > > > > > > > > > > > > Hive built-in functions are not part of Flink built-in functions, > they > > > > > > > > are catalog functions > > > > > > > > That is not entirely true. Correct me if I'm wrong but I think Hive > > > > built-in functions are also not catalog functions. They are not > stored > > in > > > > every Hive metastore catalog that is freshly created but are a set of > > > > functions that are listed somewhere and made available. > > > > > > > > > > > > ambiguous functions reference just shouldn't be resolved to a > different > > > > > > > > catalog > > > > > > > > I agree. They should not be resolved to a different catalog. That's > > why I > > > > am suggesting to split the concept of built-in functions and catalog > > > > > > > > lookup > > > > > > > > semantics. > > > > > > > > > > > > I don't know if any other databases handle built-in functions like > that > > > > > > > > What I called "module" is: > > > > - Extension in Postgres [1] > > > > - Plugin in Presto [2] > > > > > > > > Btw. Presto even mentions example modules that are similar to the > ones > > > > that we will introduce in the near future both for ML and System XYZ > > > > compatibility: > > > > "See either the presto-ml module for machine learning functions or > the > > > > presto-teradata-functions module for Teradata-compatible functions, > > both > > > > > > > > in > > > > > > > > the root of the Presto source." > > > > > > > > > > > > functions should be either built-in already or just libraries > > > > > > > > functions, > > > > > > > > and library functions can be adapted to catalog APIs or of some other > > > > syntax to use > > > > > > > > Regarding "built-in already", of course we can add a lot of functions > > as > > > > built-ins but we will end-up in a dependency hell in the near future > if > > > > > > > > we > > > > > > > > don't introduce a pluggable approach. Library functions is what you > > also > > > > suggest but storing them in a catalog means to always fully qualify > > them > > > > > > > > or > > > > > > > > modifying the existing catalog design that was inspired by the > > standard. > > > > > > > > I don't think "it brings in even more complicated scenarios to the > > > > design", it just does clear separation of concerns. Integrating the > > > > functionality into the current design makes the catalog API more > > > > complicated. > > > > > > > > > > > > why would users name a temporary function the same as a built-in > > > > > > > > function then? > > > > > > > > Because you never know what users do. If they don't, my suggested > > > > resolution order should not be a problem, right? > > > > > > > > > > > > I don't think hive functions deserves be a function module > > > > > > > > Our goal is not to create a Hive clone. We need to think forward and > > Hive > > > > is just one of many systems that we can support. Not every built-in > > > > function behaves and will behave exactly like Hive. > > > > > > > > > > > > regarding temporary functions, there are few systems that support it > > > > > > > > IMHO Spark and Hive are not always the best examples for consistent > > > > design. Systems like Postgres, Presto, or SQL Server should be used > as > > a > > > > reference. I don't think that a user can overwrite a built-in > function > > > > there. > > > > > > > > Regards, > > > > Timo > > > > > > > > [1] https://www.postgresql.org/docs/10/extend-extensions.html > > > > [2] https://prestodb.github.io/docs/current/develop/functions.html > > > > > > > > > > > > On 04.09.19 13:44, Jark Wu wrote: > > > > > > > > Hi all, > > > > > > > > Regarding #1 temp function <> built-in function and naming. > > > > I'm fine with temp functions should precede built-in function and can > > > > override built-in functions (we already support to override built-in > > > > function in 1.9). > > > > If we don't allow the same name as a built-in function, I'm afraid we > > > > > > > > will > > > > > > > > have compatibility issues in the future. > > > > Say users register a user defined function named "explode" in 1.9, > and > > we > > > > support a built-in "explode" function in 1.10. > > > > Then the user's jobs which call the registered "explode" function in > > 1.9 > > > > will all fail in 1.10 because of naming conflict. > > > > > > > > Regarding #2 "External" built-in functions. > > > > I think if we store external built-in functions in catalog, then > > > > "hive1::sqrt" is a good way to go. > > > > However, I would prefer to support a discovery mechanism (e.g. SPI) > for > > > > built-in functions as Timo suggested above. > > > > This gives us the flexibility to add Hive or MySQL or Geo or whatever > > > > function set as built-in functions in an easy way. > > > > > > > > Best, > > > > Jark > > > > > > > > On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]><[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> wrote: > > > > > > > > Hi David, > > > > > > > > Thank you for sharing your findings. It seems to me that there is no > > SQL > > > > standard regarding temporary functions. There are few systems that > > > > > > > > support > > > > > > > > it. Here are what I have found: > > > > > > > > 1. Hive: no DB qualifier allowed. Can overwrite built-in. > > > > 2. Spark: basically follows Hive ( > > > > > > > > > > > > > > > > > > https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html > > > > > > > > ) > > > > 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of > > overwriting > > > > behavior. ( > > > > http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html > > > > > > > > ) > > > > > > > > Because of lack of standard, it's perfectly fine for Flink to define > > > > whatever it sees appropriate. Thus, your proposal (no overwriting and > > > > > > > > must > > > > > > > > have DB as holder) is one option. The advantage is simplicity, The > > > > downside > > > > is the deviation from Hive, which is popular and de facto standard in > > big > > > > data world. > > > > > > > > However, I don't think we have to follow Hive. More importantly, we > > need > > > > > > > > a > > > > > > > > consensus. I have no objection if your proposal is generally agreed > > upon. > > > > > > > > Thanks, > > > > Xuefu > > > > > > > > On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz < > > [hidden email] > > > <[hidden email]> <[hidden email]> < > > [hidden email]> > > > <[hidden email]> <[hidden email]> < > > [hidden email]> > > > <[hidden email]> <[hidden email]> > > > > wrote: > > > > > > > > Hi all, > > > > > > > > Just an opinion on the built-in <> temporary functions resolution and > > > > NAMING issue. I think we should not allow overriding the built-in > > > > functions, as this may pose serious issues and to be honest is rather > > > > not feasible and would require major rework. What happens if a user > > > > wants to override CAST? Calls to that function are generated at > > > > different layers of the stack that unfortunately does not always go > > > > through the Catalog API (at least yet). Moreover from what I've > checked > > > > no other systems allow overriding the built-in functions. All the > > > > systems I've checked so far register temporary functions in a > > > > database/schema (either special database for temporary functions, or > > > > just current database). What I would suggest is to always register > > > > temporary functions with a 3 part identifier. The same way as tables, > > > > views etc. This effectively means you cannot override built-in > > > > functions. With such approach it is natural that the temporary > > functions > > > > end up a step lower in the resolution order: > > > > > > > > 1. built-in functions (1 part, maybe 2? - this is still under > > discussion) > > > > > > > > 2. temporary functions (always 3 part path) > > > > > > > > 3. catalog functions (always 3 part path) > > > > > > > > Let me know what do you think. > > > > > > > > Best, > > > > > > > > Dawid > > > > > > > > On 04/09/2019 06:13, Bowen Li wrote: > > > > > > > > Hi, > > > > > > > > I agree with Xuefu that the main controversial points are mainly the > > > > > > > > two > > > > > > > > places. My thoughts on them: > > > > > > > > 1) Determinism of referencing Hive built-in functions. We can either > > > > > > > > remove > > > > > > > > Hive built-in functions from ambiguous function resolution and > require > > > > users to use special syntax for their qualified names, or add a > config > > > > > > > > flag > > > > > > > > to catalog constructor/yaml for turning on and off Hive built-in > > > > > > > > functions > > > > > > > > with the flag set to 'false' by default and proper doc added to help > > > > > > > > users > > > > > > > > make their decisions. > > > > > > > > 2) Flink temp functions v.s. Flink built-in functions in ambiguous > > > > > > > > function > > > > > > > > resolution order. We believe Flink temp functions should precede > Flink > > > > built-in functions, and I have presented my reasons. Just in case if > we > > > > cannot reach an agreement, I propose forbid users registering temp > > > > functions in the same name as a built-in function, like MySQL's > > > > > > > > approach, > > > > > > > > for the moment. It won't have any performance concern, since built-in > > > > functions are all in memory and thus cost of a name check will be > > > > > > > > really > > > > > > > > trivial. > > > > > > > > > > > > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]><[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> wrote: > > > > > > > > From what I have seen, there are a couple of focal disagreements: > > > > > > > > 1. Resolution order: temp function --> flink built-in function --> > > > > > > > > catalog > > > > > > > > function vs flink built-in function --> temp function -> catalog > > > > > > > > function. > > > > > > > > 2. "External" built-in functions: how to treat built-in functions in > > > > external system and how users reference them > > > > > > > > For #1, I agree with Bowen that temp function needs to be at the > > > > > > > > highest > > > > > > > > priority because that's how a user might overwrite a built-in > function > > > > without referencing a persistent, overwriting catalog function with a > > > > > > > > fully > > > > > > > > qualified name. Putting built-in functions at the highest priority > > > > eliminates that usage. > > > > > > > > For #2, I saw a general agreement on referencing "external" built-in > > > > functions such as those in Hive needs to be explicit and > deterministic > > > > > > > > even > > > > > > > > though different approaches are proposed. To limit the scope and > > > > > > > > simply > > > > > > > > the > > > > > > > > usage, it seems making sense to me to introduce special syntax for > > > > > > > > user to > > > > > > > > explicitly reference an external built-in function such as > hive1::sqrt > > > > > > > > or > > > > > > > > hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog > API > > > > > > > > call > > > > > > > > hive1.getFunction(ObjectPath functionName) where the database name is > > > > absent for bulit-in functions available in that catalog hive1. I > > > > > > > > understand > > > > > > > > that Bowen's original proposal was trying to avoid this, but this > > > > > > > > could > > > > > > > > turn out to be a clean and simple solution. > > > > > > > > (Timo's modular approach is great way to "expand" Flink's built-in > > > > > > > > function > > > > > > > > set, which seems orthogonal and complementary to this, which could be > > > > tackled in further future work.) > > > > > > > > I'd be happy to hear further thoughts on the two points. > > > > > > > > Thanks, > > > > Xuefu > > > > > > > > On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> > > > <[hidden email]> <[hidden email]> <[hidden email]>< > > [hidden email]> > > > <[hidden email]> <[hidden email]> <[hidden email]> < > > [hidden email]> > > > <[hidden email]> <[hidden email]> <[hidden email]> wrote: > > > > > > > > Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is > > > > > > > > the > > > > > > > > same > > > > as Bowen's. But after thinking about it, I'm currently lean to Timo's > > > > suggestion. > > > > > > > > The reason is backward compatibility. If we follow Bowen's approach, > > > > > > > > let's > > > > > > > > say we > > > > first find function in Flink's built-in functions, and then hive's > > > > built-in. For example, `foo` > > > > is not supported by Flink, but hive has such built-in function. So > > > > > > > > user > > > > > > > > will have hive's > > > > behavior for function `foo`. And in next release, Flink realize this > > > > > > > > is a > > > > > > > > very popular function > > > > and add it into Flink's built-in functions, but with different > > > > > > > > behavior > > > > > > > > as > > > > > > > > hive's. So in next > > > > release, the behavior changes. > > > > > > > > With Timo's approach, IIUC user have to tell the framework explicitly > > > > > > > > what > > > > > > > > kind of > > > > built-in functions he would like to use. He can just tell framework > > > > > > > > to > > > > > > > > abandon Flink's built-in > > > > functions, and use hive's instead. User can only choose between them, > > > > > > > > but > > > > > > > > not use > > > > them at the same time. I think this approach is more predictable. > > > > > > > > Best, > > > > Kurt > > > > > > > > > > > > On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]><[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> > wrote: > > > > > > > > Hi all, > > > > > > > > Thanks for the feedback. Just a kindly reminder that the [Proposal] > > > > > > > > section > > > > > > > > in the google doc was updated, please take a look first and let me > > > > > > > > know > > > > > > > > if > > > > > > > > you have more questions. > > > > > > > > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]><[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> > > > > > > > > wrote: > > > > > > > > Hi Timo, > > > > > > > > Re> 1) We should not have the restriction "hive built-in functions > > > > > > > > can > > > > > > > > only > > > > > > > > be used when current catalog is hive catalog". Switching a catalog > > > > should only have implications on the cat.db.object resolution but > > > > > > > > not > > > > > > > > functions. It would be quite convinient for users to use Hive > > > > > > > > built-ins > > > > > > > > even if they use a Confluent schema registry or just the in-memory > > > > > > > > catalog. > > > > > > > > There might be a misunderstanding here. > > > > > > > > First of all, Hive built-in functions are not part of Flink > > > > > > > > built-in > > > > > > > > functions, they are catalog functions, thus if the current catalog > > > > > > > > is > > > > > > > > not a > > > > > > > > HiveCatalog but, say, a schema registry catalog, ambiguous > > > > > > > > functions > > > > > > > > reference just shouldn't be resolved to a different catalog. > > > > > > > > Second, Hive built-in functions can potentially be referenced > > > > > > > > across > > > > > > > > catalog, but it doesn't have db namespace and we currently just > > > > > > > > don't > > > > > > > > have > > > > > > > > a SQL syntax for it. It can be enabled when such a SQL syntax is > > > > > > > > defined, > > > > > > > > e.g. "catalog::function", but it's out of scope of this FLIP. > > > > > > > > 2) I would propose to have separate concepts for catalog and > > > > > > > > built-in > > > > > > > > functions. In particular it would be nice to modularize built-in > > > > functions. Some built-in functions are very crucial (like AS, CAST, > > > > MINUS), others are more optional but stable (MD5, CONCAT_WS), and > > > > > > > > maybe > > > > > > > > we add more experimental functions in the future or function for > > > > > > > > some > > > > > > > > special application area (Geo functions, ML functions). A data > > > > > > > > platform > > > > > > > > team might not want to make every built-in function available. Or a > > > > function module like ML functions is in a different Maven module. > > > > > > > > I think this is orthogonal to this FLIP, especially we don't have > > > > > > > > the > > > > > > > > "external built-in functions" anymore and currently the built-in > > > > > > > > function > > > > > > > > category remains untouched. > > > > > > > > But just to share some thoughts on the proposal, I'm not sure about > > > > > > > > it: > > > > > > > > - I don't know if any other databases handle built-in functions > > > > > > > > like > > > > > > > > that. > > > > > > > > Maybe you can give some examples? IMHO, built-in functions are > > > > > > > > system > > > > > > > > info > > > > > > > > and should be deterministic, not depending on loaded libraries. Geo > > > > functions should be either built-in already or just libraries > > > > > > > > functions, > > > > > > > > and library functions can be adapted to catalog APIs or of some > > > > > > > > other > > > > > > > > syntax to use > > > > - I don't know if all use cases stand, and many can be achieved by > > > > > > > > other > > > > > > > > approaches too. E.g. experimental functions can be taken good care > > > > > > > > of > > > > > > > > by > > > > > > > > documentations, annotations, etc > > > > - the proposal basically introduces some concept like a pluggable > > > > > > > > built-in > > > > > > > > function catalog, despite the already existing catalog APIs > > > > - it brings in even more complicated scenarios to the design. E.g. > > > > > > > > how > > > > > > > > do > > > > > > > > you handle built-in functions in different modules but different > > > > > > > > names? > > > > > > > > In short, I'm not sure if it really stands and it looks like an > > > > > > > > overkill > > > > > > > > to me. I'd rather not go to that route. Related discussion can be > > > > > > > > on > > > > > > > > its > > > > > > > > own thread. > > > > > > > > 3) Following the suggestion above, we can have a separate discovery > > > > mechanism for built-in functions. Instead of just going through a > > > > > > > > static > > > > > > > > list like in BuiltInFunctionDefinitions, a platform team should be > > > > > > > > able > > > > > > > > to select function modules like > > > > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, > > > > HiveFunctions) or via service discovery; > > > > > > > > Same as above. I'll leave it to its own thread. > > > > > > > > re > 3) Dawid and I discussed the resulution order again. I agree > > > > > > > > with > > > > > > > > Kurt > > > > > > > > that we should unify built-in function (external or internal) > > > > > > > > under a > > > > > > > > common layer. However, the resolution order should be: > > > > 1. built-in functions > > > > 2. temporary functions > > > > 3. regular catalog resolution logic > > > > Otherwise a temporary function could cause clashes with Flink's > > > > > > > > built-in > > > > > > > > functions. If you take a look at other vendors, like SQL Server > > > > > > > > they > > > > > > > > also do not allow to overwrite built-in functions. > > > > > > > > ”I agree with Kurt that we should unify built-in function (external > > > > > > > > or > > > > > > > > internal) under a common layer.“ <- I don't think this is what Kurt > > > > > > > > means. > > > > > > > > Kurt and I are in favor of unifying built-in functions of external > > > > > > > > systems > > > > > > > > and catalog functions. Did you type a mistake? > > > > > > > > Besides, I'm not sure about the resolution order you proposed. > > > > > > > > Temporary > > > > > > > > functions have a lifespan over a session and are only visible to > > > > > > > > the > > > > > > > > session owner, they are unique to each user, and users create them > > > > > > > > on > > > > > > > > purpose to be the highest priority in order to overwrite system > > > > > > > > info > > > > > > > > (built-in functions in this case). > > > > > > > > In your case, why would users name a temporary function the same > > > > > > > > as a > > > > > > > > built-in function then? Since using that name in ambiguous function > > > > reference will always be resolved to built-in functions, creating a > > > > same-named temp function would be meaningless in the end. > > > > > > > > > > > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]><[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> > > > > > > > > wrote: > > > > > > > > Hi Jingsong, > > > > > > > > Re> 1.Hive built-in functions is an intermediate solution. So we > > > > > > > > should > > > > > > > > not introduce interfaces to influence the framework. To make > > > > Flink itself more powerful, we should implement the functions > > > > we need to add. > > > > > > > > Yes, please see the doc. > > > > > > > > Re> 2.Non-flink built-in functions are easy for users to change > > > > > > > > their > > > > > > > > behavior. If we support some flink built-in functions in the > > > > future but act differently from non-flink built-in, this will > > > > > > > > lead > > > > > > > > to > > > > > > > > changes in user behavior. > > > > > > > > There's no such concept as "external built-in functions" any more. > > > > Built-in functions of external systems will be treated as special > > > > > > > > catalog > > > > > > > > functions. > > > > > > > > Re> Another question is, does this fallback include all > > > > > > > > hive built-in functions? As far as I know, some hive functions > > > > have some hacky. If possible, can we start with a white list? > > > > Once we implement some functions to flink built-in, we can > > > > also update the whitelist. > > > > > > > > Yes, that's something we thought of too. I don't think it's super > > > > critical to the scope of this FLIP, thus I'd like to leave it to > > > > > > > > future > > > > > > > > efforts as a nice-to-have feature. > > > > > > > > > > > > On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]><[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> > > > > > > > > wrote: > > > > > > > > Hi Kurt, > > > > > > > > Re: > What I want to propose is we can merge #3 and #4, make them > > > > > > > > both > > > > > > > > under > > > > > > > > "catalog" concept, by extending catalog function to make it have > > > > > > > > ability to > > > > > > > > have built-in catalog functions. Some benefits I can see from > > > > > > > > this > > > > > > > > approach: > > > > > > > > 1. We don't have to introduce new concept like external built-in > > > > > > > > functions. > > > > > > > > Actually I don't see a full story about how to treat a built-in > > > > > > > > functions, and it > > > > > > > > seems a little bit disrupt with catalog. As a result, you have > > > > > > > > to > > > > > > > > make > > > > > > > > some restriction > > > > > > > > like "hive built-in functions can only be used when current > > > > > > > > catalog > > > > > > > > is > > > > > > > > hive catalog". > > > > > > > > Yes, I've unified #3 and #4 but it seems I didn't update some > > > > > > > > part > > > > > > > > of > > > > > > > > the doc. I've modified those sections, and they are up to date > > > > > > > > now. > > > > > > > > In short, now built-in function of external systems are defined > > > > > > > > as > > > > > > > > a > > > > > > > > special kind of catalog function in Flink, and handled by Flink > > > > > > > > as > > > > > > > > following: > > > > - An external built-in function must be associated with a catalog > > > > > > > > for > > > > > > > > the purpose of decoupling flink-table and external systems. > > > > - It always resides in front of catalog functions in ambiguous > > > > > > > > function > > > > > > > > reference order, just like in its own external system > > > > - It is a special catalog function that doesn’t have a > > > > > > > > schema/database > > > > > > > > namespace > > > > - It goes thru the same instantiation logic as other user defined > > > > catalog functions in the external system > > > > > > > > Please take another look at the doc, and let me know if you have > > > > > > > > more > > > > > > > > questions. > > > > > > > > > > > > On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]><[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> <[hidden email]> <[hidden email]> > > > > > > > > wrote: > > > > > > > > Hi Kurt, > > > > > > > > it should not affect the functions and operations we currently > > > > > > > > have > > > > > > > > in > > > > > > > > SQL. It just categorizes the available built-in functions. It is > > > > > > > > kind > > > > > > > > of > > > > an orthogonal concept to the catalog API but built-in functions > > > > > > > > deserve > > > > > > > > this special kind of treatment. CatalogFunction still fits > > > > > > > > perfectly > > > > > > > > in > > > > > > > > there because the regular catalog object resolution logic is not > > > > affected. So tables and functions are resolved in the same way > > > > > > > > but > > > > > > > > with > > > > > > > > built-in functions that have priority as in the original design. > > > > > > > > Regards, > > > > Timo > > > > > > > > > > > > On 03.09.19 15:26, Kurt Young wrote: > > > > > > > > Does this only affect the functions and operations we currently > > > > > > > > have > > > > > > > > in SQL > > > > > > > > and > > > > have no effect on tables, right? Looks like this is an > > > > > > > > orthogonal > > > > > > > > concept > > > > > > > > with Catalog? > > > > If the answer are both yes, then the catalog function will be a > > > > > > > > weird > > > > > > > > concept? > > > > > > > > Best, > > > > Kurt > > > > > > > > > > > > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email] > > > > > > > > wrote: > > > > > > > > The way you proposed are basically the same as what Calcite > > > > > > > > does, I > > > > > > > > think > > > > > > > > we are in the same line. > > > > > > > > Best, > > > > Danny Chan > > > > 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] > > > > > > > > ,写道: > > > > > > > > This sounds exactly as the module approach I mentioned, no? > > > > > > > > Regards, > > > > Timo > > > > > > > > On 03.09.19 13:42, Danny Chan wrote: > > > > > > > > Thanks Bowen for bring up this topic, I think it’s a useful > > > > > > > > refactoring to make our function usage more user friendly. > > > > > > > > For the topic of how to organize the builtin operators and > > > > > > > > operators > > > > > > > > of Hive, here is a solution from Apache Calcite, the Calcite > > > > > > > > way > > > > > > > > is > > > > > > > > to make > > > > > > > > every dialect operators a “Library”, user can specify which > > > > > > > > libraries they > > > > > > > > want to use for a sql query. The builtin operators always > > > > > > > > comes > > > > > > > > as > > > > > > > > the > > > > > > > > first class objects and the others are used from the order > > > > > > > > they > > > > > > > > appears. > > > > > > > > Maybe you can take a reference. > > > > > > > > [1] > > > > > > > > > > > > > > > > > > https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 > > > > > > > > Best, > > > > Danny Chan > > > > 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] > > > > > > > > ,写道: > > > > > > > > Hi folks, > > > > > > > > I'd like to kick off a discussion on reworking Flink's > > > > > > > > FunctionCatalog. > > > > > > > > It's critically helpful to improve function usability in > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing > > > > > > > > In short, it: > > > > - adds support for precise function reference with > > > > > > > > fully/partially > > > > > > > > qualified name > > > > - redefines function resolution order for ambiguous > > > > > > > > function > > > > > > > > reference > > > > > > > > - adds support for Hive's rich built-in functions (support > > > > > > > > for > > > > > > > > Hive > > > > > > > > user > > > > > > > > defined functions was already added in 1.9.0) > > > > - clarifies the concept of temporary functions > > > > > > > > Would love to hear your thoughts. > > > > > > > > Bowen > > > > > > > > -- > > > > Xuefu Zhang > > > > > > > > "In Honey We Trust!" > > > > > > > > > > > > -- > > > > Xuefu Zhang > > > > > > > > "In Honey We Trust!" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Hi all,
Thanks Dawid for the additional explanation! As others summarized there are two questions: 1) Are temporal functions a) top-level functions (1-part address) and not associated with a catalog/db or b) do we threat them like any other database object with a 3-part address. 2) If we treat them as top-level functions, do we a) allow overriding built-in functions (by giving preference over built-in functions) or b) not. From that, we can have three combinations: 1-part/override: It has the (IMO) benefit of allowing to change the default behavior of Flink SQL by overriding built-in functions. However, it means that temp functions are differently treated than other db objects. 1-part/no-override: Unambiguous function resolution, special treatment for temp functions 3-part: No special treatment of temp functions, but also no way to change the semantics of a built-in function (unambiguous semantics) In fact, I don't have strong preference for any of the choices but 1-part/no-override would be my least favorite (no gains but special treatment of temp functions). All have pros and cons and can be justified. Cheers, Fabian Am Mi., 11. Sept. 2019 um 20:52 Uhr schrieb Bowen Li <[hidden email]>: > Hi, > > Thanks @Fabian @Dawid and everyone else for sharing your thoughts! > > First, I'd like to take Hive built-in functions out of this FLIP to keep > our original scope and make it less controversial on a potential modular > approach. I will remove Hive built-in functions from the google doc. > > Then the focus of debate is mainly function resolution order and temp > function namespace, which are somewhat related. I roughly summarized this > thread, and currently we are debating on two approaches with preference > from the following people: > > Option 1: > Proposal: temp functions will be of 1-part path (function name only), > and can override built-in functions. The ambiguous function resolution > order is thus 1) temp functions 2) built-in functions 3) catalog functions > in the current catalog/database > Votes: Xuefu, Bowen, Fabian, Jark > > Option 2: > Proposal: temp functions will be of 3-part path (with catalog, > database, and function name), and temp functions cannot override built-in > functions. The ambiguous function resolution order is thus 1) built-in > functions 2) temp functions (in 3-part path) 3) catalog functions in the > current catalog/database > Votes: Dawid, Timo > > > Do you think we need a separate voting thread on the two options in the > community, or are we able to conclude from the above summary? > > > > On Wed, Sep 11, 2019 at 8:09 AM Dawid Wysakowicz <[hidden email]> > wrote: > > > Hi Fabian, > > Thank you for your response. > > Regarding the temporary function, just wanted to clarify one thing: the > > 3-part identifier does not mean the user always has to provide the > catalog > > & database explicitly. The same way user does not have to provide them in > > e.g. when creating permanent table, view etc. It means though functions > are > > always stored within a database. The same way as all the permanent > objects > > and other temporary objects(tables, views). If not given explicitly the > > current catalog & database would be used, both in the create statement or > > when using the function. > > > > Point taken though your preference would be to support overriding > built-in > > functions. > > > > Best, > > Dawid > > > > On Wed, 11 Sep 2019, 21:14 Fabian Hueske, <[hidden email]> wrote: > > > > > Hi all, > > > > > > I'd like to add my opinion on this topic as well ;-) > > > > > > In general, I think overriding built-in function with temp functions > has > > a > > > couple of benefits but also a few challenges: > > > > > > * Users can reimplement the behavior of a built-in functions of a > > different > > > system, e.g., for backward compatibility after a migration. > > > * I don't think that "accidental" overrides and surprising semantics > are > > an > > > issue or dangerous. The user registered the temp function in the same > > > session and should therefore be aware of the changed semantics. > > > * I see that not all built-in functions can be overridden, like the > CAST > > > example that Dawid gave. However, I think these should be a small > > fraction > > > and such functions could be blacklisted. Sure, that's not super > > consistent, > > > but should (IMO) not be a big issue in practice. > > > * Temp functions should be easy to use. Requiring a 3-part addressing > > makes > > > them a lot less user friendly, IMO. Users need to think about what > > catalog > > > and db to choose when registering them. Also using a temp function in a > > > query becomes less convenient. Moreover, I agree with Bowen's concerns > > that > > > a 3-part addressing scheme reduces the temporal appearance of the > > function. > > > > > > From the three possible solutions, my preference order is > > > 1) 1-part address with override of built-in > > > 2) 1-part address without override of built-in > > > 3) 3-part address > > > > > > Regarding the issue of external built-in functions, I don't think that > > > Timo's proposal of modules is fully orthogonal to this discussion. > > > A Hive function module could be an alternative to offering Hive > functions > > > as part of Hive's catalog. > > > From a user's point of view, I think that modules would be a "cleaner" > > > integration ("Why do I need a Hive catalog if all I want to do is > apply a > > > Hive function on a Kafka table?"). > > > However, the module approach clearly has the problem of dealing with > > > same-named functions in different modules (e.g., a Hive function and a > > > Flink built-in function). > > > The catalog approach as the benefit that functions can be addressed > like > > > hiveCat::func (or a similar path). > > > > > > I'm not sure what's the best solution here. > > > > > > Cheers, > > > Fabian > > > > > > > > > Am Mo., 9. Sept. 2019 um 06:30 Uhr schrieb Bowen Li < > [hidden email] > > >: > > > > > > > Hi, > > > > > > > > W.r.t temp functions, I feel both options have their benefits and can > > > > theoretically achieve similar functionalities one way or another. In > > the > > > > end, it's more about use cases, users habits, and trade-offs. > > > > > > > > Re> Not always users are in full control of the catalog functions. > > There > > > is > > > > also the case where different teams manage the catalog & use the > > catalog. > > > > > > > > Temp functions live within a session, and not within a catalog. > Having > > > > 3-part paths may implies temp functions are tied to a catalog in two > > > > aspects. > > > > 1) it may indicate each catalog manages their temp functions, which > is > > > not > > > > true as we seem all agree they should reside at a central place, > either > > > in > > > > FunctionCatalog or CatalogManager > > > > 2) it may indicate there's some access control. When users are > > forbidden > > > to > > > > manipulate some objects in the catalog that's managed by other teams, > > but > > > > are allowed to manipulate some other objects (temp functions in this > > > case) > > > > belonging to the catalog in namespaces, users may think we introduced > > > extra > > > > complexity and confusion with some kind of access control into the > > > problem. > > > > It doesn't feel intuitive enough for end users. > > > > > > > > Thus, I'd be in favor of 1-part path for temporary functions, and > other > > > > temp objects. > > > > > > > > Thanks, > > > > Bowen > > > > > > > > > > > > > > > > On Fri, Sep 6, 2019 at 2:16 AM Dawid Wysakowicz < > > [hidden email]> > > > > wrote: > > > > > > > > > I agree the consequences of the decision are substantial. Let's see > > > what > > > > > others think. > > > > > > > > > > -- Catalog functions are defined by users, and we suppose they can > > > > > drop/alter it in any way they want. Thus, overwriting a catalog > > > function > > > > > doesn't seem to be a strong use case that we should be concerned > > about. > > > > > Rather, there are known use case for overwriting built-in > functions. > > > > > > > > > > Not always users are in full control of the catalog functions. > There > > is > > > > > also the case where different teams manage the catalog & use the > > > catalog. > > > > > As for overriding built-in functions with 3-part approach user can > > > always > > > > > use an equally named function from a catalog. E.g. to override > > > > > > > > > > * SELECT explode(arr) FROM ...* > > > > > > > > > > user can always write: > > > > > > > > > > * SELECT db.explode(arr) FROM ...* > > > > > > > > > > Best, > > > > > > > > > > Dawid > > > > > On 06/09/2019 10:54, Xuefu Z wrote: > > > > > > > > > > Hi Dawid, > > > > > > > > > > Thank you for your summary. While the only difference in the two > > > > proposals > > > > > is one- or three-part in naming, the consequence would be > > substantial. > > > > > > > > > > To me, there are two major use cases of temporary functions > compared > > to > > > > > persistent ones: > > > > > 1. Temporary in nature and auto managed by the session. More often > > than > > > > > not, admin doesn't even allow user to create persistent functions. > > > > > 2. Provide an opportunity to overwriting system built-in functions. > > > > > > > > > > Since built-in functions has one-part name, requiring three-part > name > > > for > > > > > temporary functions eliminates the overwriting opportunity. > > > > > > > > > > One-part naming essentially puts all temp functions under a single > > > > > namespace and simplifies function resolution, such as we don't need > > to > > > > > consider the case of a temp function and a persistent function with > > the > > > > > same name under the same database. > > > > > > > > > > I agree having three-parts does have its merits, such as > consistency > > > with > > > > > other temporary objects (table) and minor difference between temp > vs > > > > > catalog functions. However, there is a slight difference between > > tables > > > > and > > > > > function in that there is no built-in table in SQL so there is no > > need > > > to > > > > > overwrite it. > > > > > > > > > > I'm not sure if I fully agree the benefits you listed as the > > advantages > > > > of > > > > > the three-part naming of temp functions. > > > > > -- Allowing overwriting built-in functions is a benefit and the > > > > solution > > > > > for disallowing certain overwriting shouldn't be totally banning > it. > > > > > -- Catalog functions are defined by users, and we suppose they > can > > > > > drop/alter it in any way they want. Thus, overwriting a catalog > > > function > > > > > doesn't seem to be a strong use case that we should be concerned > > about. > > > > > Rather, there are known use case for overwriting built-in > functions. > > > > > > > > > > Thus, personally I would prefer one-part name for temporary > > functions. > > > In > > > > > lack of SQL standard on this, I certainly like to get opinions from > > > > others > > > > > to see if a consensus can be eventually reached. > > > > > > > > > > (To your point on modular approach to support external built-in > > > > functions, > > > > > we saw the value and are actively looking into it. Thanks for > sharing > > > > your > > > > > opinion on that.) > > > > > > > > > > Thanks, > > > > > Xuefu > > > > > > > > > > On Fri, Sep 6, 2019 at 3:48 PM Dawid Wysakowicz < > > > [hidden email]> > > > > <[hidden email]> > > > > > wrote: > > > > > > > > > > > > > > > Hi Xuefu, > > > > > > > > > > Thank you for your answers. > > > > > > > > > > Let me summarize my understanding. In principle we differ only in > > > regards > > > > > to the fact if a temporary function can be only 1-part or only > 3-part > > > > > identified. I can reconfirm that if the community decides it > prefers > > > the > > > > > 1-part approach I will commit to that, with the assumption that we > > will > > > > > force ONLY 1-part function names. (We will parse identifier and > throw > > > > > exception if a user tries to register e.g. db.temp_func). > > > > > > > > > > My preference is though the 3-part approach: > > > > > > > > > > - there are some functions that it makes no sense to override, > > e.g. > > > > > CAST, moreover I'm afraid that allowing overriding such will > lead > > to > > > > high > > > > > inconsistency, similar to those that I mentioned spark has > > > > > - you cannot shadow a fully-qualified function. (If a user fully > > > > > qualifies his/her objects in a SQL query, which is often > > considered > > > a > > > > good > > > > > practice) > > > > > - it does not differentiate between functions & temporary > > functions. > > > > > Temporary functions just differ with regards to their > life-cycle. > > > The > > > > > registration & usage is exactly the same. > > > > > > > > > > As it can be seen, the proposed concept regarding temp function and > > > > > function resolution is quite simple. > > > > > > > > > > Both approaches are equally simple. I would even say the 3-part > > > approach > > > > > is slightly simpler as it does not have to care about some special > > > > built-in > > > > > functions such as CAST. > > > > > > > > > > I don't want to express my opinion on the differentiation between > > > > built-in > > > > > functions and "external" built-in functions in this thread as it is > > > > rather > > > > > orthogonal, but I also like the modular approach and I definitely > > don't > > > > > like the special syntax "cat::function". I think it's better to > stick > > > to > > > > a > > > > > standard or at least other proved solutions from other systems. > > > > > > > > > > Best, > > > > > > > > > > Dawid > > > > > On 05/09/2019 10:12, Xuefu Z wrote: > > > > > > > > > > Hi David, > > > > > > > > > > Thanks for sharing your thoughts and request for clarifications. I > > > > believe > > > > > that I fully understood your proposal, which does has its merit. > > > However, > > > > > it's different from ours. Here are the answers to your questions: > > > > > > > > > > Re #1: yes, the temp functions in the proposal are global and have > > just > > > > > one-part names, similar to built-in functions. Two- or three-part > > names > > > > are > > > > > not allowed. > > > > > > > > > > Re #2: not applicable as two- or three-part names are disallowed. > > > > > > > > > > Re #3: same as above. Referencing external built-in functions is > > > achieved > > > > > either implicitly (only the built-in functions in the current > > catalogs > > > > are > > > > > considered) or via special syntax such as cat::function. However, > we > > > are > > > > > looking into the modular approach that Time suggested with other > > > feedback > > > > > received from the community. > > > > > > > > > > Re #4: the resolution order goes like the following in our > proposal: > > > > > > > > > > 1. temporary functions > > > > > 2. bulit-in functions (including those augmented by add-on modules) > > > > > 3. built-in functions in current catalog (this will not be needed > if > > > the > > > > > special syntax "cat::function" is required) > > > > > 4. functions in current catalog and db. > > > > > > > > > > If we go with the modular approach and make external built-in > > functions > > > > as > > > > > an add-on module, the 2 and 3 above will be combined. In essence, > the > > > > > resolution order is equivalent in the two approaches. > > > > > > > > > > By the way, resolution order matters only for simple name > reference. > > > For > > > > > names such as db.function (interpreted as current_cat/db/function) > or > > > > > cat.db.function, the reference is unambiguous, so on resolution is > > > > needed. > > > > > > > > > > As it can be seen, the proposed concept regarding temp function and > > > > > function resolution is quite simple. Additionally, the proposed > > > > resolution > > > > > order allows temp function to shadow a built-in function, which is > > > > > important (though not decisive) in our opinion. > > > > > > > > > > I started liking the modular approach as the resolution order will > > only > > > > > include 1, 2, and 4, which is simpler and more generic. That's why > I > > > > > suggested we look more into this direction. > > > > > > > > > > Please let me know if there are further questions. > > > > > > > > > > Thanks, > > > > > Xuefu > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Sep 5, 2019 at 2:42 PM Dawid Wysakowicz < > > > [hidden email]> > > > > <[hidden email]> <[hidden email]> < > > > [hidden email]> > > > > > wrote: > > > > > > > > > > > > > > > Hi Xuefu, > > > > > > > > > > Just wanted to summarize my opinion on the one topic (temporary > > > > functions). > > > > > > > > > > My preference would be to make temporary functions always 3-part > > > > qualified > > > > > (as a result that would prohibit overriding built-in functions). > > Having > > > > > said that if the community decides that it's better to allow > > overriding > > > > > built-in functions I am fine with it and can commit to that > decision. > > > > > > > > > > I wanted to ask if you could clarify a few points for me around > that > > > > > option. > > > > > > > > > > 1. Would you enforce temporary functions to be always just a > > single > > > > > name (without db & cat) as hive does, or would you allow also 3 > or > > > > even 2 > > > > > part identifiers? > > > > > 2. Assuming 2/3-part paths. How would you register a function > > from a > > > > > following statement: CREATE TEMPORARY FUNCTION db.func? Would > that > > > > shadow > > > > > all functions named 'func' in all databases named 'db' in all > > > > catalogs? Or > > > > > would you shadow only function 'func' in database 'db' in > current > > > > catalog? > > > > > 3. This point is still under discussion, but was mentioned a few > > > > > times, that maybe we want to enable syntax cat.func for > "external > > > > built-in > > > > > functions". How would that affect statement from previous point? > > > Would > > > > > 'db.func' shadow "external built-in function" in 'db' catalog or > > > user > > > > > functions as in point 2? Or maybe both? > > > > > 4. Lastly in fact to summarize the previous points. Assuming > > > 2/3-part > > > > > paths. Would the function resolution be actually as follows?: > > > > > 1. temporary functions (1-part path) > > > > > 2. built-in functions > > > > > 3. temporary functions (2-part path) > > > > > 4. 2-part catalog functions a.k.a. "external built-in > > functions" > > > > > (cat + func) - this is still under discussion, if we want > that > > in > > > > the other > > > > > focal point > > > > > 5. temporary functions (3-part path) > > > > > 6. 3-part catalog functions a.k.a. user functions > > > > > > > > > > I would be really grateful if you could explain me those questions, > > > > thanks. > > > > > > > > > > BTW, Thank you all for a healthy discussion. > > > > > > > > > > Best, > > > > > > > > > > Dawid > > > > > On 04/09/2019 23:25, Xuefu Z wrote: > > > > > > > > > > Thank all for the sharing thoughts. I think we have gathered some > > > useful > > > > > initial feedback from this long discussion with a couple of focal > > > points > > > > > sticking out. > > > > > > > > > > We will go back to do more research and adapt our proposal. Once > > it's > > > > > ready, we will ask for a new round of review. If there is any > > > > disagreement, > > > > > we will start a new discussion thread on each rather than having a > > mega > > > > > discussion like this. > > > > > > > > > > Thanks to everyone for participating. > > > > > > > > > > Regards, > > > > > Xuefu > > > > > > > > > > > > > > > On Thu, Sep 5, 2019 at 2:52 AM Bowen Li <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]> wrote: > > > > > > > > > > > > > > > Let me try to summarize and conclude the long thread so far: > > > > > > > > > > 1. For order of temp function v.s. built-in function: > > > > > > > > > > I think Dawid's point that temp function should be of fully > qualified > > > > path > > > > > is a better reasoning to back the newly proposed order, and i agree > > we > > > > > don't need to follow Hive/Spark. > > > > > > > > > > However, I'd rather not change fundamentals of temporary functions > in > > > > this > > > > > FLIP. It belongs to a bigger story of how temporary objects should > be > > > > > redefined and be handled uniformly - currently temporary tables and > > > views > > > > > (those registered from TableEnv#registerTable()) behave different > > than > > > > what > > > > > Dawid propose for temp functions, and we need a FLIP to just unify > > > their > > > > > APIs and behaviors. > > > > > > > > > > I agree that backward compatibility is not an issue w.r.t Jark's > > > points. > > > > > > > > > > ***Seems we do have consensus that it's acceptable to prevent users > > > > > registering a temp function in the same name as a built-in > function. > > To > > > > > help us move forward, I'd like to propose setting such a restraint > on > > > > temp > > > > > functions in this FLIP to simplify the design and avoid > disputes.*** > > It > > > > > will also leave rooms for improvements in the future. > > > > > > > > > > > > > > > 2. For Hive built-in function: > > > > > > > > > > Thanks Timo for providing the Presto and Postgres examples. I feel > > > > modular > > > > > built-in functions can be a good fit for the geo and ml example as > a > > > > native > > > > > Flink extension, but not sure if it fits well with external > > > integrations. > > > > > Anyway, I think modular built-in functions is a bigger story and > can > > be > > > > on > > > > > its own thread too, and our proposal doesn't prevent Flink from > doing > > > > that > > > > > in the future. > > > > > > > > > > ***Seems we have consensus that users should be able to use > built-in > > > > > functions of Hive or other external systems in SQL explicitly and > > > > > deterministically regardless of Flink built-in functions and the > > > > potential > > > > > modular built-in functions, via some new syntax like "mycat::func"? > > If > > > > so, > > > > > I'd like to propose removing Hive built-in functions from ambiguous > > > > > function resolution order, and empower users with such a syntax. > This > > > way > > > > > we sacrifice a little convenience for certainty*** > > > > > > > > > > > > > > > What do you think? > > > > > > > > > > On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz < > > > [hidden email]> > > > > <[hidden email]> <[hidden email]> < > > > [hidden email]> > > > > <[hidden email]> <[hidden email]> < > > > [hidden email]> > > > > <[hidden email]> > > > > > wrote: > > > > > > > > > > > > > > > Hi, > > > > > > > > > > Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've > just > > > > > performed some experiments (hive-2.3.2 & spark 2.4.4) and I think > > they > > > > > > > > > > are > > > > > > > > > > very inconsistent in that manner (spark being way worse on that). > > > > > > > > > > Hive: > > > > > > > > > > You cannot overwrite all the built-in functions. I could overwrite > > most > > > > > > > > > > of > > > > > > > > > > the functions I tried e.g. length, e, pi, round, rtrim, but there > are > > > > > functions I cannot overwrite e.g. CAST, ARRAY I get: > > > > > > > > > > > > > > > * ParseException line 1:29 cannot recognize input near 'array' > > 'AS' > > > * > > > > > > > > > > What is interesting is that I cannot ovewrite *array*, but I can > > > ovewrite > > > > > *map* or *struct*. Though hive behaves reasonable well if I manage > to > > > > > overwrite a function. When I drop the temporary function the native > > > > > function is still available. > > > > > > > > > > Spark: > > > > > > > > > > Spark's behavior imho is super bad. > > > > > > > > > > Theoretically I could overwrite all functions. I was able e.g. to > > > > > overwrite CAST function. I had to use though CREATE OR REPLACE > > > TEMPORARY > > > > > FUNCTION syntax. Otherwise I get an exception that a function > already > > > > > exists. However when I used the CAST function in a query it used > the > > > > > native, built-in one. > > > > > > > > > > When I overwrote current_date() function, it was used in a query, > but > > > it > > > > > completely replaces the built-in function and I can no longer use > the > > > > > native function in any way. I cannot also drop the temporary > > function. > > > I > > > > > get: > > > > > > > > > > * Error in query: Cannot drop native function 'current_date';* > > > > > > > > > > Additional note, both systems do not allow creating TEMPORARY > > FUNCTIONS > > > > > with a database. Temporary functions are always represented as a > > single > > > > > name. > > > > > > > > > > In my opinion neither of the systems have consistent behavior. > > > Generally > > > > > speaking I think overwriting any system provided functions is just > > > > > dangerous. > > > > > > > > > > Regarding Jark's concerns. Such functions would be registered in a > > > > > > > > > > current > > > > > > > > > > catalog/database schema, so a user could still use its own > function, > > > but > > > > > would have to fully qualify the function (because built-in > functions > > > take > > > > > precedence). Moreover users would have the same problem with > > permanent > > > > > functions. Imagine a user have a permanent function > 'cat.db.explode'. > > > In > > > > > 1.9 the user could use just the 'explode' function as long as the > > > 'cat' & > > > > > 'db' were the default catalog & database. If we introduce 'explode' > > > > > built-in function in 1.10, the user has to fully qualify the > > function. > > > > > > > > > > Best, > > > > > > > > > > Dawid > > > > > On 04/09/2019 15:19, Timo Walther wrote: > > > > > > > > > > Hi all, > > > > > > > > > > thanks for the healthy discussion. It is already a very long > > discussion > > > > > with a lot of text. So I will just post my opinion to a couple of > > > > > statements: > > > > > > > > > > > > > > > Hive built-in functions are not part of Flink built-in functions, > > they > > > > > > > > > > are catalog functions > > > > > > > > > > That is not entirely true. Correct me if I'm wrong but I think Hive > > > > > built-in functions are also not catalog functions. They are not > > stored > > > in > > > > > every Hive metastore catalog that is freshly created but are a set > of > > > > > functions that are listed somewhere and made available. > > > > > > > > > > > > > > > ambiguous functions reference just shouldn't be resolved to a > > different > > > > > > > > > > catalog > > > > > > > > > > I agree. They should not be resolved to a different catalog. That's > > > why I > > > > > am suggesting to split the concept of built-in functions and > catalog > > > > > > > > > > lookup > > > > > > > > > > semantics. > > > > > > > > > > > > > > > I don't know if any other databases handle built-in functions like > > that > > > > > > > > > > What I called "module" is: > > > > > - Extension in Postgres [1] > > > > > - Plugin in Presto [2] > > > > > > > > > > Btw. Presto even mentions example modules that are similar to the > > ones > > > > > that we will introduce in the near future both for ML and System > XYZ > > > > > compatibility: > > > > > "See either the presto-ml module for machine learning functions or > > the > > > > > presto-teradata-functions module for Teradata-compatible functions, > > > both > > > > > > > > > > in > > > > > > > > > > the root of the Presto source." > > > > > > > > > > > > > > > functions should be either built-in already or just libraries > > > > > > > > > > functions, > > > > > > > > > > and library functions can be adapted to catalog APIs or of some > other > > > > > syntax to use > > > > > > > > > > Regarding "built-in already", of course we can add a lot of > functions > > > as > > > > > built-ins but we will end-up in a dependency hell in the near > future > > if > > > > > > > > > > we > > > > > > > > > > don't introduce a pluggable approach. Library functions is what you > > > also > > > > > suggest but storing them in a catalog means to always fully qualify > > > them > > > > > > > > > > or > > > > > > > > > > modifying the existing catalog design that was inspired by the > > > standard. > > > > > > > > > > I don't think "it brings in even more complicated scenarios to the > > > > > design", it just does clear separation of concerns. Integrating the > > > > > functionality into the current design makes the catalog API more > > > > > complicated. > > > > > > > > > > > > > > > why would users name a temporary function the same as a built-in > > > > > > > > > > function then? > > > > > > > > > > Because you never know what users do. If they don't, my suggested > > > > > resolution order should not be a problem, right? > > > > > > > > > > > > > > > I don't think hive functions deserves be a function module > > > > > > > > > > Our goal is not to create a Hive clone. We need to think forward > and > > > Hive > > > > > is just one of many systems that we can support. Not every built-in > > > > > function behaves and will behave exactly like Hive. > > > > > > > > > > > > > > > regarding temporary functions, there are few systems that support > it > > > > > > > > > > IMHO Spark and Hive are not always the best examples for consistent > > > > > design. Systems like Postgres, Presto, or SQL Server should be used > > as > > > a > > > > > reference. I don't think that a user can overwrite a built-in > > function > > > > > there. > > > > > > > > > > Regards, > > > > > Timo > > > > > > > > > > [1] https://www.postgresql.org/docs/10/extend-extensions.html > > > > > [2] https://prestodb.github.io/docs/current/develop/functions.html > > > > > > > > > > > > > > > On 04.09.19 13:44, Jark Wu wrote: > > > > > > > > > > Hi all, > > > > > > > > > > Regarding #1 temp function <> built-in function and naming. > > > > > I'm fine with temp functions should precede built-in function and > can > > > > > override built-in functions (we already support to override > built-in > > > > > function in 1.9). > > > > > If we don't allow the same name as a built-in function, I'm afraid > we > > > > > > > > > > will > > > > > > > > > > have compatibility issues in the future. > > > > > Say users register a user defined function named "explode" in 1.9, > > and > > > we > > > > > support a built-in "explode" function in 1.10. > > > > > Then the user's jobs which call the registered "explode" function > in > > > 1.9 > > > > > will all fail in 1.10 because of naming conflict. > > > > > > > > > > Regarding #2 "External" built-in functions. > > > > > I think if we store external built-in functions in catalog, then > > > > > "hive1::sqrt" is a good way to go. > > > > > However, I would prefer to support a discovery mechanism (e.g. SPI) > > for > > > > > built-in functions as Timo suggested above. > > > > > This gives us the flexibility to add Hive or MySQL or Geo or > whatever > > > > > function set as built-in functions in an easy way. > > > > > > > > > > Best, > > > > > Jark > > > > > > > > > > On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]><[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> wrote: > > > > > > > > > > Hi David, > > > > > > > > > > Thank you for sharing your findings. It seems to me that there is > no > > > SQL > > > > > standard regarding temporary functions. There are few systems that > > > > > > > > > > support > > > > > > > > > > it. Here are what I have found: > > > > > > > > > > 1. Hive: no DB qualifier allowed. Can overwrite built-in. > > > > > 2. Spark: basically follows Hive ( > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html > > > > > > > > > > ) > > > > > 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of > > > overwriting > > > > > behavior. ( > > > > > > http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html > > > > > > > > > > ) > > > > > > > > > > Because of lack of standard, it's perfectly fine for Flink to > define > > > > > whatever it sees appropriate. Thus, your proposal (no overwriting > and > > > > > > > > > > must > > > > > > > > > > have DB as holder) is one option. The advantage is simplicity, The > > > > > downside > > > > > is the deviation from Hive, which is popular and de facto standard > in > > > big > > > > > data world. > > > > > > > > > > However, I don't think we have to follow Hive. More importantly, we > > > need > > > > > > > > > > a > > > > > > > > > > consensus. I have no objection if your proposal is generally agreed > > > upon. > > > > > > > > > > Thanks, > > > > > Xuefu > > > > > > > > > > On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz < > > > [hidden email] > > > > <[hidden email]> <[hidden email]> < > > > [hidden email]> > > > > <[hidden email]> <[hidden email]> < > > > [hidden email]> > > > > <[hidden email]> <[hidden email]> > > > > > wrote: > > > > > > > > > > Hi all, > > > > > > > > > > Just an opinion on the built-in <> temporary functions resolution > and > > > > > NAMING issue. I think we should not allow overriding the built-in > > > > > functions, as this may pose serious issues and to be honest is > rather > > > > > not feasible and would require major rework. What happens if a user > > > > > wants to override CAST? Calls to that function are generated at > > > > > different layers of the stack that unfortunately does not always go > > > > > through the Catalog API (at least yet). Moreover from what I've > > checked > > > > > no other systems allow overriding the built-in functions. All the > > > > > systems I've checked so far register temporary functions in a > > > > > database/schema (either special database for temporary functions, > or > > > > > just current database). What I would suggest is to always register > > > > > temporary functions with a 3 part identifier. The same way as > tables, > > > > > views etc. This effectively means you cannot override built-in > > > > > functions. With such approach it is natural that the temporary > > > functions > > > > > end up a step lower in the resolution order: > > > > > > > > > > 1. built-in functions (1 part, maybe 2? - this is still under > > > discussion) > > > > > > > > > > 2. temporary functions (always 3 part path) > > > > > > > > > > 3. catalog functions (always 3 part path) > > > > > > > > > > Let me know what do you think. > > > > > > > > > > Best, > > > > > > > > > > Dawid > > > > > > > > > > On 04/09/2019 06:13, Bowen Li wrote: > > > > > > > > > > Hi, > > > > > > > > > > I agree with Xuefu that the main controversial points are mainly > the > > > > > > > > > > two > > > > > > > > > > places. My thoughts on them: > > > > > > > > > > 1) Determinism of referencing Hive built-in functions. We can > either > > > > > > > > > > remove > > > > > > > > > > Hive built-in functions from ambiguous function resolution and > > require > > > > > users to use special syntax for their qualified names, or add a > > config > > > > > > > > > > flag > > > > > > > > > > to catalog constructor/yaml for turning on and off Hive built-in > > > > > > > > > > functions > > > > > > > > > > with the flag set to 'false' by default and proper doc added to > help > > > > > > > > > > users > > > > > > > > > > make their decisions. > > > > > > > > > > 2) Flink temp functions v.s. Flink built-in functions in ambiguous > > > > > > > > > > function > > > > > > > > > > resolution order. We believe Flink temp functions should precede > > Flink > > > > > built-in functions, and I have presented my reasons. Just in case > if > > we > > > > > cannot reach an agreement, I propose forbid users registering temp > > > > > functions in the same name as a built-in function, like MySQL's > > > > > > > > > > approach, > > > > > > > > > > for the moment. It won't have any performance concern, since > built-in > > > > > functions are all in memory and thus cost of a name check will be > > > > > > > > > > really > > > > > > > > > > trivial. > > > > > > > > > > > > > > > On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]><[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> wrote: > > > > > > > > > > From what I have seen, there are a couple of focal disagreements: > > > > > > > > > > 1. Resolution order: temp function --> flink built-in function --> > > > > > > > > > > catalog > > > > > > > > > > function vs flink built-in function --> temp function -> catalog > > > > > > > > > > function. > > > > > > > > > > 2. "External" built-in functions: how to treat built-in functions > in > > > > > external system and how users reference them > > > > > > > > > > For #1, I agree with Bowen that temp function needs to be at the > > > > > > > > > > highest > > > > > > > > > > priority because that's how a user might overwrite a built-in > > function > > > > > without referencing a persistent, overwriting catalog function > with a > > > > > > > > > > fully > > > > > > > > > > qualified name. Putting built-in functions at the highest priority > > > > > eliminates that usage. > > > > > > > > > > For #2, I saw a general agreement on referencing "external" > built-in > > > > > functions such as those in Hive needs to be explicit and > > deterministic > > > > > > > > > > even > > > > > > > > > > though different approaches are proposed. To limit the scope and > > > > > > > > > > simply > > > > > > > > > > the > > > > > > > > > > usage, it seems making sense to me to introduce special syntax for > > > > > > > > > > user to > > > > > > > > > > explicitly reference an external built-in function such as > > hive1::sqrt > > > > > > > > > > or > > > > > > > > > > hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog > > API > > > > > > > > > > call > > > > > > > > > > hive1.getFunction(ObjectPath functionName) where the database name > is > > > > > absent for bulit-in functions available in that catalog hive1. I > > > > > > > > > > understand > > > > > > > > > > that Bowen's original proposal was trying to avoid this, but this > > > > > > > > > > could > > > > > > > > > > turn out to be a clean and simple solution. > > > > > > > > > > (Timo's modular approach is great way to "expand" Flink's built-in > > > > > > > > > > function > > > > > > > > > > set, which seems orthogonal and complementary to this, which could > be > > > > > tackled in further future work.) > > > > > > > > > > I'd be happy to hear further thoughts on the two points. > > > > > > > > > > Thanks, > > > > > Xuefu > > > > > > > > > > On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> > > > > <[hidden email]> <[hidden email]> <[hidden email]>< > > > [hidden email]> > > > > <[hidden email]> <[hidden email]> <[hidden email]> < > > > [hidden email]> > > > > <[hidden email]> <[hidden email]> <[hidden email]> wrote: > > > > > > > > > > Thanks Timo & Bowen for the feedback. Bowen was right, my proposal > is > > > > > > > > > > the > > > > > > > > > > same > > > > > as Bowen's. But after thinking about it, I'm currently lean to > Timo's > > > > > suggestion. > > > > > > > > > > The reason is backward compatibility. If we follow Bowen's > approach, > > > > > > > > > > let's > > > > > > > > > > say we > > > > > first find function in Flink's built-in functions, and then hive's > > > > > built-in. For example, `foo` > > > > > is not supported by Flink, but hive has such built-in function. So > > > > > > > > > > user > > > > > > > > > > will have hive's > > > > > behavior for function `foo`. And in next release, Flink realize > this > > > > > > > > > > is a > > > > > > > > > > very popular function > > > > > and add it into Flink's built-in functions, but with different > > > > > > > > > > behavior > > > > > > > > > > as > > > > > > > > > > hive's. So in next > > > > > release, the behavior changes. > > > > > > > > > > With Timo's approach, IIUC user have to tell the framework > explicitly > > > > > > > > > > what > > > > > > > > > > kind of > > > > > built-in functions he would like to use. He can just tell framework > > > > > > > > > > to > > > > > > > > > > abandon Flink's built-in > > > > > functions, and use hive's instead. User can only choose between > them, > > > > > > > > > > but > > > > > > > > > > not use > > > > > them at the same time. I think this approach is more predictable. > > > > > > > > > > Best, > > > > > Kurt > > > > > > > > > > > > > > > On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]><[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> > > wrote: > > > > > > > > > > Hi all, > > > > > > > > > > Thanks for the feedback. Just a kindly reminder that the [Proposal] > > > > > > > > > > section > > > > > > > > > > in the google doc was updated, please take a look first and let me > > > > > > > > > > know > > > > > > > > > > if > > > > > > > > > > you have more questions. > > > > > > > > > > On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]><[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> > > > > > > > > > > wrote: > > > > > > > > > > Hi Timo, > > > > > > > > > > Re> 1) We should not have the restriction "hive built-in functions > > > > > > > > > > can > > > > > > > > > > only > > > > > > > > > > be used when current catalog is hive catalog". Switching a catalog > > > > > should only have implications on the cat.db.object resolution but > > > > > > > > > > not > > > > > > > > > > functions. It would be quite convinient for users to use Hive > > > > > > > > > > built-ins > > > > > > > > > > even if they use a Confluent schema registry or just the in-memory > > > > > > > > > > catalog. > > > > > > > > > > There might be a misunderstanding here. > > > > > > > > > > First of all, Hive built-in functions are not part of Flink > > > > > > > > > > built-in > > > > > > > > > > functions, they are catalog functions, thus if the current catalog > > > > > > > > > > is > > > > > > > > > > not a > > > > > > > > > > HiveCatalog but, say, a schema registry catalog, ambiguous > > > > > > > > > > functions > > > > > > > > > > reference just shouldn't be resolved to a different catalog. > > > > > > > > > > Second, Hive built-in functions can potentially be referenced > > > > > > > > > > across > > > > > > > > > > catalog, but it doesn't have db namespace and we currently just > > > > > > > > > > don't > > > > > > > > > > have > > > > > > > > > > a SQL syntax for it. It can be enabled when such a SQL syntax is > > > > > > > > > > defined, > > > > > > > > > > e.g. "catalog::function", but it's out of scope of this FLIP. > > > > > > > > > > 2) I would propose to have separate concepts for catalog and > > > > > > > > > > built-in > > > > > > > > > > functions. In particular it would be nice to modularize built-in > > > > > functions. Some built-in functions are very crucial (like AS, CAST, > > > > > MINUS), others are more optional but stable (MD5, CONCAT_WS), and > > > > > > > > > > maybe > > > > > > > > > > we add more experimental functions in the future or function for > > > > > > > > > > some > > > > > > > > > > special application area (Geo functions, ML functions). A data > > > > > > > > > > platform > > > > > > > > > > team might not want to make every built-in function available. Or a > > > > > function module like ML functions is in a different Maven module. > > > > > > > > > > I think this is orthogonal to this FLIP, especially we don't have > > > > > > > > > > the > > > > > > > > > > "external built-in functions" anymore and currently the built-in > > > > > > > > > > function > > > > > > > > > > category remains untouched. > > > > > > > > > > But just to share some thoughts on the proposal, I'm not sure about > > > > > > > > > > it: > > > > > > > > > > - I don't know if any other databases handle built-in functions > > > > > > > > > > like > > > > > > > > > > that. > > > > > > > > > > Maybe you can give some examples? IMHO, built-in functions are > > > > > > > > > > system > > > > > > > > > > info > > > > > > > > > > and should be deterministic, not depending on loaded libraries. Geo > > > > > functions should be either built-in already or just libraries > > > > > > > > > > functions, > > > > > > > > > > and library functions can be adapted to catalog APIs or of some > > > > > > > > > > other > > > > > > > > > > syntax to use > > > > > - I don't know if all use cases stand, and many can be achieved by > > > > > > > > > > other > > > > > > > > > > approaches too. E.g. experimental functions can be taken good care > > > > > > > > > > of > > > > > > > > > > by > > > > > > > > > > documentations, annotations, etc > > > > > - the proposal basically introduces some concept like a pluggable > > > > > > > > > > built-in > > > > > > > > > > function catalog, despite the already existing catalog APIs > > > > > - it brings in even more complicated scenarios to the design. E.g. > > > > > > > > > > how > > > > > > > > > > do > > > > > > > > > > you handle built-in functions in different modules but different > > > > > > > > > > names? > > > > > > > > > > In short, I'm not sure if it really stands and it looks like an > > > > > > > > > > overkill > > > > > > > > > > to me. I'd rather not go to that route. Related discussion can be > > > > > > > > > > on > > > > > > > > > > its > > > > > > > > > > own thread. > > > > > > > > > > 3) Following the suggestion above, we can have a separate discovery > > > > > mechanism for built-in functions. Instead of just going through a > > > > > > > > > > static > > > > > > > > > > list like in BuiltInFunctionDefinitions, a platform team should be > > > > > > > > > > able > > > > > > > > > > to select function modules like > > > > > catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, > > > > > HiveFunctions) or via service discovery; > > > > > > > > > > Same as above. I'll leave it to its own thread. > > > > > > > > > > re > 3) Dawid and I discussed the resulution order again. I agree > > > > > > > > > > with > > > > > > > > > > Kurt > > > > > > > > > > that we should unify built-in function (external or internal) > > > > > > > > > > under a > > > > > > > > > > common layer. However, the resolution order should be: > > > > > 1. built-in functions > > > > > 2. temporary functions > > > > > 3. regular catalog resolution logic > > > > > Otherwise a temporary function could cause clashes with Flink's > > > > > > > > > > built-in > > > > > > > > > > functions. If you take a look at other vendors, like SQL Server > > > > > > > > > > they > > > > > > > > > > also do not allow to overwrite built-in functions. > > > > > > > > > > ”I agree with Kurt that we should unify built-in function (external > > > > > > > > > > or > > > > > > > > > > internal) under a common layer.“ <- I don't think this is what Kurt > > > > > > > > > > means. > > > > > > > > > > Kurt and I are in favor of unifying built-in functions of external > > > > > > > > > > systems > > > > > > > > > > and catalog functions. Did you type a mistake? > > > > > > > > > > Besides, I'm not sure about the resolution order you proposed. > > > > > > > > > > Temporary > > > > > > > > > > functions have a lifespan over a session and are only visible to > > > > > > > > > > the > > > > > > > > > > session owner, they are unique to each user, and users create them > > > > > > > > > > on > > > > > > > > > > purpose to be the highest priority in order to overwrite system > > > > > > > > > > info > > > > > > > > > > (built-in functions in this case). > > > > > > > > > > In your case, why would users name a temporary function the same > > > > > > > > > > as a > > > > > > > > > > built-in function then? Since using that name in ambiguous function > > > > > reference will always be resolved to built-in functions, creating a > > > > > same-named temp function would be meaningless in the end. > > > > > > > > > > > > > > > On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]><[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> > > > > > > > > > > wrote: > > > > > > > > > > Hi Jingsong, > > > > > > > > > > Re> 1.Hive built-in functions is an intermediate solution. So we > > > > > > > > > > should > > > > > > > > > > not introduce interfaces to influence the framework. To make > > > > > Flink itself more powerful, we should implement the functions > > > > > we need to add. > > > > > > > > > > Yes, please see the doc. > > > > > > > > > > Re> 2.Non-flink built-in functions are easy for users to change > > > > > > > > > > their > > > > > > > > > > behavior. If we support some flink built-in functions in the > > > > > future but act differently from non-flink built-in, this will > > > > > > > > > > lead > > > > > > > > > > to > > > > > > > > > > changes in user behavior. > > > > > > > > > > There's no such concept as "external built-in functions" any more. > > > > > Built-in functions of external systems will be treated as special > > > > > > > > > > catalog > > > > > > > > > > functions. > > > > > > > > > > Re> Another question is, does this fallback include all > > > > > > > > > > hive built-in functions? As far as I know, some hive functions > > > > > have some hacky. If possible, can we start with a white list? > > > > > Once we implement some functions to flink built-in, we can > > > > > also update the whitelist. > > > > > > > > > > Yes, that's something we thought of too. I don't think it's super > > > > > critical to the scope of this FLIP, thus I'd like to leave it to > > > > > > > > > > future > > > > > > > > > > efforts as a nice-to-have feature. > > > > > > > > > > > > > > > On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]><[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> > > > > > > > > > > wrote: > > > > > > > > > > Hi Kurt, > > > > > > > > > > Re: > What I want to propose is we can merge #3 and #4, make them > > > > > > > > > > both > > > > > > > > > > under > > > > > > > > > > "catalog" concept, by extending catalog function to make it have > > > > > > > > > > ability to > > > > > > > > > > have built-in catalog functions. Some benefits I can see from > > > > > > > > > > this > > > > > > > > > > approach: > > > > > > > > > > 1. We don't have to introduce new concept like external built-in > > > > > > > > > > functions. > > > > > > > > > > Actually I don't see a full story about how to treat a built-in > > > > > > > > > > functions, and it > > > > > > > > > > seems a little bit disrupt with catalog. As a result, you have > > > > > > > > > > to > > > > > > > > > > make > > > > > > > > > > some restriction > > > > > > > > > > like "hive built-in functions can only be used when current > > > > > > > > > > catalog > > > > > > > > > > is > > > > > > > > > > hive catalog". > > > > > > > > > > Yes, I've unified #3 and #4 but it seems I didn't update some > > > > > > > > > > part > > > > > > > > > > of > > > > > > > > > > the doc. I've modified those sections, and they are up to date > > > > > > > > > > now. > > > > > > > > > > In short, now built-in function of external systems are defined > > > > > > > > > > as > > > > > > > > > > a > > > > > > > > > > special kind of catalog function in Flink, and handled by Flink > > > > > > > > > > as > > > > > > > > > > following: > > > > > - An external built-in function must be associated with a catalog > > > > > > > > > > for > > > > > > > > > > the purpose of decoupling flink-table and external systems. > > > > > - It always resides in front of catalog functions in ambiguous > > > > > > > > > > function > > > > > > > > > > reference order, just like in its own external system > > > > > - It is a special catalog function that doesn’t have a > > > > > > > > > > schema/database > > > > > > > > > > namespace > > > > > - It goes thru the same instantiation logic as other user defined > > > > > catalog functions in the external system > > > > > > > > > > Please take another look at the doc, and let me know if you have > > > > > > > > > > more > > > > > > > > > > questions. > > > > > > > > > > > > > > > On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]><[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> < > > > > [hidden email]> <[hidden email]> <[hidden email]> > > > > > > > > > > wrote: > > > > > > > > > > Hi Kurt, > > > > > > > > > > it should not affect the functions and operations we currently > > > > > > > > > > have > > > > > > > > > > in > > > > > > > > > > SQL. It just categorizes the available built-in functions. It is > > > > > > > > > > kind > > > > > > > > > > of > > > > > an orthogonal concept to the catalog API but built-in functions > > > > > > > > > > deserve > > > > > > > > > > this special kind of treatment. CatalogFunction still fits > > > > > > > > > > perfectly > > > > > > > > > > in > > > > > > > > > > there because the regular catalog object resolution logic is not > > > > > affected. So tables and functions are resolved in the same way > > > > > > > > > > but > > > > > > > > > > with > > > > > > > > > > built-in functions that have priority as in the original design. > > > > > > > > > > Regards, > > > > > Timo > > > > > > > > > > > > > > > On 03.09.19 15:26, Kurt Young wrote: > > > > > > > > > > Does this only affect the functions and operations we currently > > > > > > > > > > have > > > > > > > > > > in SQL > > > > > > > > > > and > > > > > have no effect on tables, right? Looks like this is an > > > > > > > > > > orthogonal > > > > > > > > > > concept > > > > > > > > > > with Catalog? > > > > > If the answer are both yes, then the catalog function will be a > > > > > > > > > > weird > > > > > > > > > > concept? > > > > > > > > > > Best, > > > > > Kurt > > > > > > > > > > > > > > > On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[hidden email] > > > > > > > > > > wrote: > > > > > > > > > > The way you proposed are basically the same as what Calcite > > > > > > > > > > does, I > > > > > > > > > > think > > > > > > > > > > we are in the same line. > > > > > > > > > > Best, > > > > > Danny Chan > > > > > 在 2019年9月3日 +0800 PM7:57,Timo Walther <[hidden email] > > > > > > > > > > ,写道: > > > > > > > > > > This sounds exactly as the module approach I mentioned, no? > > > > > > > > > > Regards, > > > > > Timo > > > > > > > > > > On 03.09.19 13:42, Danny Chan wrote: > > > > > > > > > > Thanks Bowen for bring up this topic, I think it’s a useful > > > > > > > > > > refactoring to make our function usage more user friendly. > > > > > > > > > > For the topic of how to organize the builtin operators and > > > > > > > > > > operators > > > > > > > > > > of Hive, here is a solution from Apache Calcite, the Calcite > > > > > > > > > > way > > > > > > > > > > is > > > > > > > > > > to make > > > > > > > > > > every dialect operators a “Library”, user can specify which > > > > > > > > > > libraries they > > > > > > > > > > want to use for a sql query. The builtin operators always > > > > > > > > > > comes > > > > > > > > > > as > > > > > > > > > > the > > > > > > > > > > first class objects and the others are used from the order > > > > > > > > > > they > > > > > > > > > > appears. > > > > > > > > > > Maybe you can take a reference. > > > > > > > > > > [1] > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28 > > > > > > > > > > Best, > > > > > Danny Chan > > > > > 在 2019年8月28日 +0800 AM2:50,Bowen Li <[hidden email] > > > > > > > > > > ,写道: > > > > > > > > > > Hi folks, > > > > > > > > > > I'd like to kick off a discussion on reworking Flink's > > > > > > > > > > FunctionCatalog. > > > > > > > > > > It's critically helpful to improve function usability in > > > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing > > > > > > > > > > In short, it: > > > > > - adds support for precise function reference with > > > > > > > > > > fully/partially > > > > > > > > > > qualified name > > > > > - redefines function resolution order for ambiguous > > > > > > > > > > function > > > > > > > > > > reference > > > > > > > > > > - adds support for Hive's rich built-in functions (support > > > > > > > > > > for > > > > > > > > > > Hive > > > > > > > > > > user > > > > > > > > > > defined functions was already added in 1.9.0) > > > > > - clarifies the concept of temporary functions > > > > > > > > > > Would love to hear your thoughts. > > > > > > > > > > Bowen > > > > > > > > > > -- > > > > > Xuefu Zhang > > > > > > > > > > "In Honey We Trust!" > > > > > > > > > > > > > > > -- > > > > > Xuefu Zhang > > > > > > > > > > "In Honey We Trust!" > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > |
Hi Fabian,
Yes, I agree 1-part/no-override is the least favorable thus I didn't include that as a voting option, and the discussion is mainly between 1-part/override builtin and 3-part/not override builtin. Re > However, it means that temp functions are differently treated than other db objects. IMO, the treatment difference results from the fact that functions are a bit different from other objects - Flink don't have any other built-in objects (tables, views) except functions. Cheers, Bowen |
Hi Bowen,
I understand the potential benefit of overriding certain built-in functions. I'm open to such a feature if many people agree. However, it would be great to still support overriding catalog functions with temporary functions in order to prototype a query even though a catalog/database might not be available currently or should not be modified yet. How about we support both cases? CREATE TEMPORARY FUNCTION abs -> creates/overrides a built-in function and never consideres current catalog and database; inconsistent with other DDL but acceptable for functions I guess. CREATE TEMPORARY FUNCTION cat.db.fun -> creates/overrides a catalog function Regarding "Flink don't have any other built-in objects (tables, views) except functions", this might change in the near future. Take https://issues.apache.org/jira/browse/FLINK-13900 as an example. Thanks, Timo On 14.09.19 01:40, Bowen Li wrote: > Hi Fabian, > > Yes, I agree 1-part/no-override is the least favorable thus I didn't > include that as a voting option, and the discussion is mainly between > 1-part/override builtin and 3-part/not override builtin. > > Re > However, it means that temp functions are differently treated than > other db objects. > IMO, the treatment difference results from the fact that functions are a > bit different from other objects - Flink don't have any other built-in > objects (tables, views) except functions. > > Cheers, > Bowen > |
Free forum by Nabble | Edit this page |