Hi all,
During the discussion of how to support Hive built-in functions in Flink in FLIP-57 [1], an idea of "modular built-in functions" was brought up with examples of "Extension" in Postgres [2] and "Plugin" in Presto [3]. Thus I'd like to kick off a discussion to see if we should adopt such an approach. I try to summarize basics of the idea: - functions from modules (e.g. Geo, ML) can be loaded into Flink as built-in functions - modules can be configured with order, discovered using SPI or set via code like "catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, HiveFunctions)" - built-in functions from external systems, like Hive, can be packaged into such a module I took time and researched Presto Plugin and Postgres Extension, and here are some of my findings. Presto: - "Presto's Catalog associated with a connector, and a catalog only contains schemas and references a data source via a connector." [4] A Presto catalog doesn't have the concept of catalog functions, thus all Presto functions don't have namespaces. Neither does Presto have function DDL [5]. - Plugin are not specific to functions - "Plugins can provide additional Connectors, Types, Functions, and System Access Control" [6] - Thus, I feel a Plugin in Presto acts more as a "catalog" which is similar to catalogs in Flink. Since all Presto functions don't have namespaces, it probably can be seen as a built-in function module. Postgres: - Postgres extension is always installed to a schema, not the entire cluster. There's a "schema_name" param in extension creation DDL - "The name of the schema in which to install the extension's objects, given that the extension allows its contents to be relocated. The named schema must already exist. If not specified, and the extension's control file does not specify a schema either, the current default object creation schema is used." [7] Thus it also acts as "catalog" for schema, and thus functions in extension are not built-in functions to Postgres. Therefore, I feel the examples are not exactly the "built-in function modules" that were brought up, but feel free to correct me if I'm wrong. Going back to the idea itself, besides it seems to be a simpler concept and design in some ways, I have two concerns: 1. The major one is still on name resolution - how to deal with name collisions? - Not allowing duplicated name won't work for Hive built-in functions as many of them are dup named with Flink's, so we must allow modules containing same named functions to be registered - One assumption of this approach seems to be, given modules are specified in order, functions from modules can be overrode according to the order? - If so, how can users reference a function that is overrode in the above case (E.g. I may want to switch KMEANS between modules ML1 and ML2 with different implementations)? - If it's supported, it seems we still need some new syntax? - If it's not supported, that seems to be a major limitation for users 2. The minor one is, allowing built-in functions from external system to be accessed within Flink so widely can bring performance issue to users' jobs - Unlike the potential native Flink Geo or ML functions, built-in functions from external systems come with a pretty big performance penalty in Flink due to data conversions and different invocation mechanism. Supporting Hive built-in functions is mainly for simplifying migration from Hive. I'm not sure if it makes sense when a user job has nothing to do with Hive data but unintentionally ends up using Hive built-in functions without knowing it's penalized on performance. Though doc can help to some extent, not all users really read docs in detail. An alternative is to treat "function modules" as catalog. - For Flink native function modules like Geo or ML, they can be discovered and registered automatically at runtime with a predefined catalog name in itself, like "ml" or "ml1", which should be unique. Their functions are considered as built-in functions to the catalog, and can be referenced, in some new syntax like "catalog::func", as "ml:kmeans" and "ml1:kmeans". - For built-in functions from external systems (e.g. Hive), they have to be referenced either as "catalog::func" to make sure users are explicitly expecting those external functions, or as complementary built-in functions to Flink if a config "enable_hive_built_in_functions" in HiveCatalog is turned on. Either approach seems to have its own benefits, and I'm open for discussion and would like to hear others' opinions and use cases where a specific solution is required. Thanks, Bowen [1] http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-57-Rework-FunctionCatalog-td32291.html [2] https://www.postgresql.org/docs/10/extend-extensions.html [3] https://prestodb.github.io/docs/current/develop/functions.html [4] https://prestodb.github.io/docs/current/overview/concepts.html#data-sources [5] https://prestodb.github.io/docs/current/sql [6] https://prestodb.github.io/docs/current/develop/spi-overview.html [7] https://www.postgresql.org/docs/9.1/sql-createextension.html |
HI Bowen,
Thank you for sharing your research summaries. The concerns you raised about the modular approach are very valuable and practical. Here are some of my thoughts.. 1. Naming conflicts and resolution. Naming conflicts is likely, and as you suggested, the resolution can be just based on the specified order of the add-on modules after native Flink built-in functions which should be used for resolution. Moreover, we can allow user to specify black-/whitelist of functions for each module (including the native one). With this, if user prefers a function at lower resolution priority, user can just blacklist the one in the higher resolution order. Please note that black-/whitelisting is more generally needed because user may not want to take whatever functions that are available in a module. It's up to the user to selectively "import" whatever is needed. Nevertheless, I image that naming conflicts and resolution are more a task of initial setup. Thus, the burden is bearable. 2. Performance issue. While the performance of the add-on built-in functions may not be as good as the native ones, introducing modular functions are to solve the problem of missing functions. Thus, performance is less an issue in my opinion. Moreover, more native, performing ones may be gradually added to the native set, which can reduce the dependency on external implementations and improve the performance. Thus, performing issue is secondary and solvable. In short, comparing to the special syntax, which isn't SQL standard, I'm personally feel that modular approach is a simpler, better option for supporting Hive built-in functions. With additional black-/whitelisting, flexibility is offered to meet common practical usage. Thanks, Xuefu On Tue, Sep 10, 2019 at 7:20 AM Bowen Li <[hidden email]> wrote: > Hi all, > > During the discussion of how to support Hive built-in functions in Flink in > FLIP-57 [1], an idea of "modular built-in functions" was brought up with > examples of "Extension" in Postgres [2] and "Plugin" in Presto [3]. Thus > I'd like to kick off a discussion to see if we should adopt such an > approach. > > I try to summarize basics of the idea: > - functions from modules (e.g. Geo, ML) can be loaded into Flink as > built-in functions > - modules can be configured with order, discovered using SPI or set via > code like "catalogManager.setFunctionModules(CoreFunctions, GeoFunctions, > HiveFunctions)" > - built-in functions from external systems, like Hive, can be packaged > into such a module > > I took time and researched Presto Plugin and Postgres Extension, and here > are some of my findings. > > Presto: > - "Presto's Catalog associated with a connector, and a catalog only > contains schemas and references a data source via a connector." [4] A > Presto catalog doesn't have the concept of catalog functions, thus all > Presto functions don't have namespaces. Neither does Presto have function > DDL [5]. > - Plugin are not specific to functions - "Plugins can provide > additional Connectors, Types, Functions, and System Access Control" [6] > - Thus, I feel a Plugin in Presto acts more as a "catalog" which is > similar to catalogs in Flink. Since all Presto functions don't have > namespaces, it probably can be seen as a built-in function module. > > Postgres: > - Postgres extension is always installed to a schema, not the entire > cluster. There's a "schema_name" param in extension creation DDL - "The > name of the schema in which to install the extension's objects, given that > the extension allows its contents to be relocated. The named schema must > already exist. If not specified, and the extension's control file does not > specify a schema either, the current default object creation schema is > used." [7] Thus it also acts as "catalog" for schema, and thus functions > in extension are not built-in functions to Postgres. > > Therefore, I feel the examples are not exactly the "built-in function > modules" that were brought up, but feel free to correct me if I'm wrong. > > Going back to the idea itself, besides it seems to be a simpler concept and > design in some ways, I have two concerns: > 1. The major one is still on name resolution - how to deal with name > collisions? > - Not allowing duplicated name won't work for Hive built-in functions > as many of them are dup named with Flink's, so we must allow modules > containing same named functions to be registered > - One assumption of this approach seems to be, given modules are > specified in order, functions from modules can be overrode according to the > order? > - If so, how can users reference a function that is overrode in the > above case (E.g. I may want to switch KMEANS between modules ML1 and ML2 > with different implementations)? > - If it's supported, it seems we still need some new syntax? > - If it's not supported, that seems to be a major limitation for > users > 2. The minor one is, allowing built-in functions from external system to be > accessed within Flink so widely can bring performance issue to users' jobs > - Unlike the potential native Flink Geo or ML functions, built-in > functions from external systems come with a pretty big performance penalty > in Flink due to data conversions and different invocation mechanism. > Supporting Hive built-in functions is mainly for simplifying migration from > Hive. I'm not sure if it makes sense when a user job has nothing to do with > Hive data but unintentionally ends up using Hive built-in functions without > knowing it's penalized on performance. Though doc can help to some extent, > not all users really read docs in detail. > > An alternative is to treat "function modules" as catalog. > - For Flink native function modules like Geo or ML, they can be discovered > and registered automatically at runtime with a predefined catalog name in > itself, like "ml" or "ml1", which should be unique. Their functions are > considered as built-in functions to the catalog, and can be referenced, in > some new syntax like "catalog::func", as "ml:kmeans" and "ml1:kmeans". > - For built-in functions from external systems (e.g. Hive), they have to be > referenced either as "catalog::func" to make sure users are explicitly > expecting those external functions, or as complementary built-in functions > to Flink if a config "enable_hive_built_in_functions" in HiveCatalog is > turned on. > > Either approach seems to have its own benefits, and I'm open for discussion > and would like to hear others' opinions and use cases where a specific > solution is required. > > Thanks, > Bowen > > > [1] > > http://apache-flink-mailing-list-archive.1008284.n3.nabble.com/DISCUSS-FLIP-57-Rework-FunctionCatalog-td32291.html > [2] https://www.postgresql.org/docs/10/extend-extensions.html > [3] https://prestodb.github.io/docs/current/develop/functions.html > [4] > https://prestodb.github.io/docs/current/overview/concepts.html#data-sources > [5] https://prestodb.github.io/docs/current/sql > [6] https://prestodb.github.io/docs/current/develop/spi-overview.html > [7] https://www.postgresql.org/docs/9.1/sql-createextension.html > -- Xuefu Zhang "In Honey We Trust!" |
Free forum by Nabble | Edit this page |