[DISCUSS] A mechanism to validate the precision of columns for connectors

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

[DISCUSS] A mechanism to validate the precision of columns for connectors

Zhenghua Gao
Hi dev,

I'd like to kick off a discussion on a mechanism to validate the precision
of columns for some connectors.

We come to an agreement that the user should be informed if the connector
does not support the desired precision. And from the connector developer's
view, there are 3-levels information to be considered:

   -  the ability of external systems (e.g. Apache Derby support
   TIMESTAMP(9), Mysql support TIMESTAMP(6), etc)

Connector developers should use this information to validate user's DDL and
make sure throw an exception if concrete column is out of range.


   - schema of referenced tables in external systems

If the schema information of referenced tables is available in Compile
Time, connector developers could use it to find the mismatch between DDL.
But in most cases, the schema information is unavailable because of network
isolation or authority management. We should use it with caution.


   - schema-less external systems (e.g. HBase)

If the external systems is schema-less like HBase, the connector developer
should make sure the connector doesn't cause precision loss (e.g.
flink-hbase serializes java.sql.Timestamp to long in bytes which only keep
millisecond's precision.)

To make it more specific, some scenarios of JDBC Connector are list as
following:

   - The underlying DB supports DECIMAL(65, 30), which is out of the range
   of Flink's Decimal
   - The underlying DB supports TIMESTAMP(6), and user want to define a
   table with TIMESTAMP(9) in Flink
   - User defines a table with DECIMAL(10, 4) in underlying DB, and want to
   define a table with DECIMAL(5, 2) in Flink
   - The precision of the underlying DB varies between different versions


What do you think about this? any feedback are appreciates.

*Best Regards,*
*Zhenghua Gao*
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] A mechanism to validate the precision of columns for connectors

Jingsong Li
Hi Zhenghua,

I think it's not just about precision of type. Connectors not validate the
types either.
Now there is "SchemaValidator", this validator is just used to validate
type properties. But not for connector type support.
I think we can have something like "DataTypeValidator" to help connectors
validating their type support.

Consider current validator design, validator is called by connector itself.
it's more like a util class than a mechanism.

Best,
Jingsong Lee

On Fri, Jan 10, 2020 at 11:47 AM Zhenghua Gao <[hidden email]> wrote:

> Hi dev,
>
> I'd like to kick off a discussion on a mechanism to validate the precision
> of columns for some connectors.
>
> We come to an agreement that the user should be informed if the connector
> does not support the desired precision. And from the connector developer's
> view, there are 3-levels information to be considered:
>
>    -  the ability of external systems (e.g. Apache Derby support
>    TIMESTAMP(9), Mysql support TIMESTAMP(6), etc)
>
> Connector developers should use this information to validate user's DDL and
> make sure throw an exception if concrete column is out of range.
>
>
>    - schema of referenced tables in external systems
>
> If the schema information of referenced tables is available in Compile
> Time, connector developers could use it to find the mismatch between DDL.
> But in most cases, the schema information is unavailable because of network
> isolation or authority management. We should use it with caution.
>
>
>    - schema-less external systems (e.g. HBase)
>
> If the external systems is schema-less like HBase, the connector developer
> should make sure the connector doesn't cause precision loss (e.g.
> flink-hbase serializes java.sql.Timestamp to long in bytes which only keep
> millisecond's precision.)
>
> To make it more specific, some scenarios of JDBC Connector are list as
> following:
>
>    - The underlying DB supports DECIMAL(65, 30), which is out of the range
>    of Flink's Decimal
>    - The underlying DB supports TIMESTAMP(6), and user want to define a
>    table with TIMESTAMP(9) in Flink
>    - User defines a table with DECIMAL(10, 4) in underlying DB, and want to
>    define a table with DECIMAL(5, 2) in Flink
>    - The precision of the underlying DB varies between different versions
>
>
> What do you think about this? any feedback are appreciates.
>
> *Best Regards,*
> *Zhenghua Gao*
>


--
Best, Jingsong Lee
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] A mechanism to validate the precision of columns for connectors

Zhenghua Gao
Hi Jingsong Lee

You are right that the connectors don't validate data types either now.
We seems lack a mechanism to validate with properties[1], data types, etc
for CREATE TABLE.

[1] https://issues.apache.org/jira/browse/FLINK-15509

*Best Regards,*
*Zhenghua Gao*


On Fri, Jan 10, 2020 at 2:59 PM Jingsong Li <[hidden email]> wrote:

> Hi Zhenghua,
>
> I think it's not just about precision of type. Connectors not validate the
> types either.
> Now there is "SchemaValidator", this validator is just used to validate
> type properties. But not for connector type support.
> I think we can have something like "DataTypeValidator" to help connectors
> validating their type support.
>
> Consider current validator design, validator is called by connector itself.
> it's more like a util class than a mechanism.
>
> Best,
> Jingsong Lee
>
> On Fri, Jan 10, 2020 at 11:47 AM Zhenghua Gao <[hidden email]> wrote:
>
> > Hi dev,
> >
> > I'd like to kick off a discussion on a mechanism to validate the
> precision
> > of columns for some connectors.
> >
> > We come to an agreement that the user should be informed if the connector
> > does not support the desired precision. And from the connector
> developer's
> > view, there are 3-levels information to be considered:
> >
> >    -  the ability of external systems (e.g. Apache Derby support
> >    TIMESTAMP(9), Mysql support TIMESTAMP(6), etc)
> >
> > Connector developers should use this information to validate user's DDL
> and
> > make sure throw an exception if concrete column is out of range.
> >
> >
> >    - schema of referenced tables in external systems
> >
> > If the schema information of referenced tables is available in Compile
> > Time, connector developers could use it to find the mismatch between DDL.
> > But in most cases, the schema information is unavailable because of
> network
> > isolation or authority management. We should use it with caution.
> >
> >
> >    - schema-less external systems (e.g. HBase)
> >
> > If the external systems is schema-less like HBase, the connector
> developer
> > should make sure the connector doesn't cause precision loss (e.g.
> > flink-hbase serializes java.sql.Timestamp to long in bytes which only
> keep
> > millisecond's precision.)
> >
> > To make it more specific, some scenarios of JDBC Connector are list as
> > following:
> >
> >    - The underlying DB supports DECIMAL(65, 30), which is out of the
> range
> >    of Flink's Decimal
> >    - The underlying DB supports TIMESTAMP(6), and user want to define a
> >    table with TIMESTAMP(9) in Flink
> >    - User defines a table with DECIMAL(10, 4) in underlying DB, and want
> to
> >    define a table with DECIMAL(5, 2) in Flink
> >    - The precision of the underlying DB varies between different versions
> >
> >
> > What do you think about this? any feedback are appreciates.
> >
> > *Best Regards,*
> > *Zhenghua Gao*
> >
>
>
> --
> Best, Jingsong Lee
>
Reply | Threaded
Open this post in threaded view
|

Re: [DISCUSS] A mechanism to validate the precision of columns for connectors

bowen.li
Hi Zhenghua,

For external systems with schema, I think the schema information is
available most of the time and should be the single source of truth to
programmatically mapping column precision via Flink catalogs, to minimize
users efforts creating schema redundantly again and avoid any human errors.
They will be a subset of the systems supported types and precision, and
thus you don't need to validate the 1st category of "the ability of
external system". It would apply to most schema storage system, like
relational DBMS, hive metastore, avro schema in confluent schema registry
for kafka.

From my observation, the real problem right now is Flink cannot truly
leverage external system schemas via Flink Catalogs, as documented in [1].

I'm not sure if there's any unsolvable network or authorization problems,
as most systems nowadays can be read with simple access id/key pair via
vpc, intranet, or internet. What problems have you ran into?

For schemaless systems, we'd have to rely on full testing coverage in Flink.

[1] https://issues.apache.org/jira/browse/FLINK-15545

On Fri, Jan 10, 2020 at 1:12 AM Zhenghua Gao <[hidden email]> wrote:

> Hi Jingsong Lee
>
> You are right that the connectors don't validate data types either now.
> We seems lack a mechanism to validate with properties[1], data types, etc
> for CREATE TABLE.
>
> [1] https://issues.apache.org/jira/browse/FLINK-15509
>
> *Best Regards,*
> *Zhenghua Gao*
>
>
> On Fri, Jan 10, 2020 at 2:59 PM Jingsong Li <[hidden email]>
> wrote:
>
> > Hi Zhenghua,
> >
> > I think it's not just about precision of type. Connectors not validate
> the
> > types either.
> > Now there is "SchemaValidator", this validator is just used to validate
> > type properties. But not for connector type support.
> > I think we can have something like "DataTypeValidator" to help connectors
> > validating their type support.
> >
> > Consider current validator design, validator is called by connector
> itself.
> > it's more like a util class than a mechanism.
> >
> > Best,
> > Jingsong Lee
> >
> > On Fri, Jan 10, 2020 at 11:47 AM Zhenghua Gao <[hidden email]> wrote:
> >
> > > Hi dev,
> > >
> > > I'd like to kick off a discussion on a mechanism to validate the
> > precision
> > > of columns for some connectors.
> > >
> > > We come to an agreement that the user should be informed if the
> connector
> > > does not support the desired precision. And from the connector
> > developer's
> > > view, there are 3-levels information to be considered:
> > >
> > >    -  the ability of external systems (e.g. Apache Derby support
> > >    TIMESTAMP(9), Mysql support TIMESTAMP(6), etc)
> > >
> > > Connector developers should use this information to validate user's DDL
> > and
> > > make sure throw an exception if concrete column is out of range.
> > >
> > >
> > >    - schema of referenced tables in external systems
> > >
> > > If the schema information of referenced tables is available in Compile
> > > Time, connector developers could use it to find the mismatch between
> DDL.
> > > But in most cases, the schema information is unavailable because of
> > network
> > > isolation or authority management. We should use it with caution.
> > >
> > >
> > >    - schema-less external systems (e.g. HBase)
> > >
> > > If the external systems is schema-less like HBase, the connector
> > developer
> > > should make sure the connector doesn't cause precision loss (e.g.
> > > flink-hbase serializes java.sql.Timestamp to long in bytes which only
> > keep
> > > millisecond's precision.)
> > >
> > > To make it more specific, some scenarios of JDBC Connector are list as
> > > following:
> > >
> > >    - The underlying DB supports DECIMAL(65, 30), which is out of the
> > range
> > >    of Flink's Decimal
> > >    - The underlying DB supports TIMESTAMP(6), and user want to define a
> > >    table with TIMESTAMP(9) in Flink
> > >    - User defines a table with DECIMAL(10, 4) in underlying DB, and
> want
> > to
> > >    define a table with DECIMAL(5, 2) in Flink
> > >    - The precision of the underlying DB varies between different
> versions
> > >
> > >
> > > What do you think about this? any feedback are appreciates.
> > >
> > > *Best Regards,*
> > > *Zhenghua Gao*
> > >
> >
> >
> > --
> > Best, Jingsong Lee
> >
>