[DISCUSS] modular built-in functions

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] modular built-in functions

bowen.li
Hi all,

During the discussion of how to support Hive built-in functions in Flink in
FLIP-57 [1], an idea of "modular built-in functions" was brought up with
examples of "Extension" in Postgres [2] and "Plugin" in Presto [3]. Thus
I'd like to kick off a discussion to see if we should adopt such an
approach.

I try to summarize basics of the idea:
    - functions from modules (e.g. Geo, ML) can be loaded into Flink as
built-in functions
    - modules can be configured with order, discovered using SPI or set via
code like "catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
HiveFunctions)"
    - built-in functions from external systems, like Hive, can be packaged
into such a module

I took time and researched Presto Plugin and Postgres Extension, and here
are some of my findings.

Presto:
    - "Presto's Catalog associated with a connector, and a catalog only
contains schemas and references a data source via a connector." [4] A
Presto catalog doesn't have the concept of catalog functions, thus all
Presto functions don't have namespaces. Neither does Presto have function
DDL [5].
    - Plugin are not specific to functions - "Plugins can provide
additional Connectors, Types, Functions, and System Access Control" [6]
    - Thus, I feel a Plugin in Presto acts more as a "catalog" which is
similar to catalogs in Flink. Since all Presto functions don't have
namespaces, it probably can be seen as a built-in function module.

Postgres:
    - Postgres extension is always installed to a schema, not the entire
cluster. There's a "schema_name" param in extension creation DDL - "The
name of the schema in which to install the extension's objects, given that
the extension allows its contents to be relocated. The named schema must
already exist. If not specified, and the extension's control file does not
specify a schema either, the current default object creation schema is
used." [7]  Thus it also acts as "catalog" for schema, and thus functions
in extension are not built-in functions to Postgres.

Therefore, I feel the examples are not exactly the "built-in function
modules" that were brought up, but feel free to correct me if I'm wrong.

Going back to the idea itself, besides it seems to be a simpler concept and
design in some ways, I have two concerns:
1. The major one is still on name resolution - how to deal with name
collisions?
    - Not allowing duplicated name won't work for Hive built-in functions
as many of them are dup named with Flink's, so we must allow modules
containing same named functions to be registered
    - One assumption of this approach seems to be, given modules are
specified in order, functions from modules can be overrode according to the
order?
    - If so, how can users reference a function that is overrode in the
above case (E.g. I may want to switch KMEANS between modules ML1 and ML2
with different implementations)?
         - If it's supported, it seems we still need some new syntax?
         - If it's not supported, that seems to be a major limitation for
users
2. The minor one is, allowing built-in functions from external system to be
accessed within Flink so widely can bring performance issue to users' jobs
    - Unlike the potential native Flink Geo or ML functions, built-in
functions from external systems come with a pretty big performance penalty
in Flink due to data conversions and different invocation mechanism.
Supporting Hive built-in functions is mainly for simplifying migration from
Hive. I'm not sure if it makes sense when a user job has nothing to do with
Hive data but unintentionally ends up using Hive built-in functions without
knowing it's penalized on performance. Though doc can help to some extent,
not all users really read docs in detail.

An alternative is to treat "function modules" as catalog.
- For Flink native function modules like Geo or ML, they can be discovered
and registered automatically at runtime with a predefined catalog name in
itself, like "ml" or "ml1", which should be unique. Their functions are
considered as built-in functions to the catalog, and can be referenced, in
some new syntax like "catalog::func", as "ml:kmeans" and "ml1:kmeans".
- For built-in functions from external systems (e.g. Hive), they have to be
referenced either as "catalog::func" to make sure users are explicitly
expecting those external functions, or as complementary built-in functions
to Flink if a config "enable_hive_built_in_functions" in HiveCatalog is
turned on.

Either approach seems to have its own benefits, and I'm open for discussion
and would like to hear others' opinions and use cases where a specific
solution is required.

Thanks,
Bowen


[1]
http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-57-Rework-FunctionCatalog-td32291.html
[2] https://www.postgresql.org/docs/10/extend-extensions.html
[3] https://prestodb.github.io/docs/current/develop/functions.html
[4]
https://prestodb.github.io/docs/current/overview/concepts.html#data-sources
[5] https://prestodb.github.io/docs/current/sql
[6] https://prestodb.github.io/docs/current/develop/spi-overview.html
[7] https://www.postgresql.org/docs/9.1/sql-createextension.html
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] modular built-in functions

Xuefu Z
HI Bowen,

Thank you for sharing your research summaries. The concerns you raised
about the modular approach are very valuable and practical. Here are some
of my thoughts..

1. Naming conflicts and resolution. Naming conflicts is likely, and as you
suggested, the resolution can be just based on the specified order of the
add-on modules after native Flink built-in functions which should be used
for resolution. Moreover, we can allow user to specify black-/whitelist of
functions for each module (including the native one). With this, if user
prefers a function at lower resolution priority, user can just blacklist
the one in the higher resolution order. Please note that
black-/whitelisting is more generally needed because user may not want to
take whatever functions that are available in a module. It's up to the user
to selectively "import" whatever is needed.

Nevertheless, I image that naming conflicts and resolution are more a task
of initial setup. Thus, the burden is bearable.

2. Performance issue. While the performance of the add-on built-in
functions may not be as good as the native ones, introducing modular
functions are to solve the problem of missing functions. Thus, performance
is less an issue in my opinion. Moreover, more native, performing ones may
be gradually added to the native set, which can reduce the dependency on
external implementations and improve the performance. Thus, performing
issue is secondary and solvable.

In short, comparing to the special syntax, which isn't SQL standard, I'm
personally feel that modular approach is a simpler, better option for
supporting Hive built-in functions. With additional black-/whitelisting,
flexibility is offered to meet common practical usage.

Thanks,
Xuefu






On Tue, Sep 10, 2019 at 7:20 AM Bowen Li <[hidden email]> wrote:

> Hi all,
>
> During the discussion of how to support Hive built-in functions in Flink in
> FLIP-57 [1], an idea of "modular built-in functions" was brought up with
> examples of "Extension" in Postgres [2] and "Plugin" in Presto [3]. Thus
> I'd like to kick off a discussion to see if we should adopt such an
> approach.
>
> I try to summarize basics of the idea:
>     - functions from modules (e.g. Geo, ML) can be loaded into Flink as
> built-in functions
>     - modules can be configured with order, discovered using SPI or set via
> code like "catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
> HiveFunctions)"
>     - built-in functions from external systems, like Hive, can be packaged
> into such a module
>
> I took time and researched Presto Plugin and Postgres Extension, and here
> are some of my findings.
>
> Presto:
>     - "Presto's Catalog associated with a connector, and a catalog only
> contains schemas and references a data source via a connector." [4] A
> Presto catalog doesn't have the concept of catalog functions, thus all
> Presto functions don't have namespaces. Neither does Presto have function
> DDL [5].
>     - Plugin are not specific to functions - "Plugins can provide
> additional Connectors, Types, Functions, and System Access Control" [6]
>     - Thus, I feel a Plugin in Presto acts more as a "catalog" which is
> similar to catalogs in Flink. Since all Presto functions don't have
> namespaces, it probably can be seen as a built-in function module.
>
> Postgres:
>     - Postgres extension is always installed to a schema, not the entire
> cluster. There's a "schema_name" param in extension creation DDL - "The
> name of the schema in which to install the extension's objects, given that
> the extension allows its contents to be relocated. The named schema must
> already exist. If not specified, and the extension's control file does not
> specify a schema either, the current default object creation schema is
> used." [7]  Thus it also acts as "catalog" for schema, and thus functions
> in extension are not built-in functions to Postgres.
>
> Therefore, I feel the examples are not exactly the "built-in function
> modules" that were brought up, but feel free to correct me if I'm wrong.
>
> Going back to the idea itself, besides it seems to be a simpler concept and
> design in some ways, I have two concerns:
> 1. The major one is still on name resolution - how to deal with name
> collisions?
>     - Not allowing duplicated name won't work for Hive built-in functions
> as many of them are dup named with Flink's, so we must allow modules
> containing same named functions to be registered
>     - One assumption of this approach seems to be, given modules are
> specified in order, functions from modules can be overrode according to the
> order?
>     - If so, how can users reference a function that is overrode in the
> above case (E.g. I may want to switch KMEANS between modules ML1 and ML2
> with different implementations)?
>          - If it's supported, it seems we still need some new syntax?
>          - If it's not supported, that seems to be a major limitation for
> users
> 2. The minor one is, allowing built-in functions from external system to be
> accessed within Flink so widely can bring performance issue to users' jobs
>     - Unlike the potential native Flink Geo or ML functions, built-in
> functions from external systems come with a pretty big performance penalty
> in Flink due to data conversions and different invocation mechanism.
> Supporting Hive built-in functions is mainly for simplifying migration from
> Hive. I'm not sure if it makes sense when a user job has nothing to do with
> Hive data but unintentionally ends up using Hive built-in functions without
> knowing it's penalized on performance. Though doc can help to some extent,
> not all users really read docs in detail.
>
> An alternative is to treat "function modules" as catalog.
> - For Flink native function modules like Geo or ML, they can be discovered
> and registered automatically at runtime with a predefined catalog name in
> itself, like "ml" or "ml1", which should be unique. Their functions are
> considered as built-in functions to the catalog, and can be referenced, in
> some new syntax like "catalog::func", as "ml:kmeans" and "ml1:kmeans".
> - For built-in functions from external systems (e.g. Hive), they have to be
> referenced either as "catalog::func" to make sure users are explicitly
> expecting those external functions, or as complementary built-in functions
> to Flink if a config "enable_hive_built_in_functions" in HiveCatalog is
> turned on.
>
> Either approach seems to have its own benefits, and I'm open for discussion
> and would like to hear others' opinions and use cases where a specific
> solution is required.
>
> Thanks,
> Bowen
>
>
> [1]
>
> http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-57-Rework-FunctionCatalog-td32291.html
> [2] https://www.postgresql.org/docs/10/extend-extensions.html
> [3] https://prestodb.github.io/docs/current/develop/functions.html
> [4]
> https://prestodb.github.io/docs/current/overview/concepts.html#data-sources
> [5] https://prestodb.github.io/docs/current/sql
> [6] https://prestodb.github.io/docs/current/develop/spi-overview.html
> [7] https://www.postgresql.org/docs/9.1/sql-createextension.html
>


--
Xuefu Zhang

"In Honey We Trust!"