Hi,
I wanted to bring up the topic of primary key constraints that we could leverage for runtime optimizations. Please have a look at the proposal and I would especially draw your attention to the topic of nullability of columns that are part of a primary key. Some time in the future we will also be able to leverage it for upserting streaming tables. Here is the proposal: https://cwiki.apache.org/confluence/display/FLINK/FLIP+87%3A+Primary+key+constraints+in+Table+API Best, Dawid signature.asc (849 bytes) Download Attachment |
Hi Dawid,
Thanks for driving this. Primary key is a very important and useful feature to Flink Table API / SQL. I'm +1 to the general design. And have some thoughts as following: 1. > The design says Flink only support NOT ENFORCED mode. But the DDL and KeyConstraint#primaryKey(..) can pass in ENFORCED mode. What will we do when user specify an ENFORCED mode? Shall we forbid it? 2. > "When creating a table, creating a primary key constraint will alter the columns nullability." I think we should force the columns to be declared NOT NULL, instead of change the columns nullability implicitly. Otherwise, it may be confused what is the nullbility of the primary key columns. 3. > "Support for Unique key is not part of the FLIP. It is just mentioned to show how can we extend the primary key concept with more constraints in the future." I think we can include Unique Key as part of the FLIP, as it is already stated clearly. We can put the implemenation task of unique key in a lower priority. 4. > Method for retrieving primary key constraint is in CatalogBaseTable. In SQL standard, primary key is declared in line or out of line. If it is out of line, it is still in the schema part, i.e. besides column definitions and in the parentheses. So I think maybe primary key should belong to `TableSchema` becuase it is a part of schema. CREATE TABLE xxx ( a int, b varchar, c bigint, primary key (a, b) ) with (...) Best, Jark On Tue, 19 Nov 2019 at 23:18, Dawid Wysakowicz <[hidden email]> wrote: > Hi, > > I wanted to bring up the topic of primary key constraints that we could > leverage for runtime optimizations. Please have a look at the proposal > and I would especially draw your attention to the topic of nullability > of columns that are part of a primary key. Some time in the future we > will also be able to leverage it for upserting streaming tables. > > Here is the proposal: > > https://cwiki.apache.org/confluence/display/FLINK/FLIP+87%3A+Primary+key+constraints+in+Table+API > > Best, > > Dawid > > > |
Hi Jark, Thank you for your comments. Ad. 1 My intention was to throw an exception if a user tries to use the ENFORCE mode. SQL standard states this is the default mode, so if we ever want to support the ENFORCE/NOT ENFORCE mode we need to make sure users now always provide the NOT ENFORCE flag. Does it make sense to you? If it does, I will clarify the FLIP on this topic. Ad. 2 This is the behavior of majority of RDBM systems. I think this is quite useful from the usability perspective. A column is nullable by default. It is also a good practice to always define PRIMARY KEY for a table. Therefore we would force users to add additional annotations. E.g. instead of: CREATE TABLE tableName(id INT PRIMARY KEY); user would always have to type CREATE TABLE tableName(id INT NOT NULL PRIMARY KEY); I am not totally against it, if this is the preferred way. Ad. 3 The only reason why I didn't want to add a full UNIQUE
support is that we would have to decide how do we handle null for
the UNIQUE constraint. Some databases allow multiple rows to have
a null in a column with unique constraint, whereas other allow
only one. That means if you have a table CREATE TABLE tableName
(id INTEGER UNIQUE); In some databases it is legal to have rows
(null; null; null; null; ....) while in others you can have at
most (null). Do you think we should still discuss this as well? Ad. 4 I don't have a strong opinion on that. Actually SQL standard describes a Table descriptor. Some excerpts from the standard: — The name of the base table. ... — An indication of whether the table is insertable-into or not. — The column descriptor of each column in the table. — The descriptor of each table constraint specified for the
table. In this context I thought primary key belongs to the Table rather than schema, where schema is as far as I understand just a list of Table column descriptors. Tbh the TableSchema does not fit very well into that description of the descriptor. If you have a good reasons why you think it would be better to have it in the TableSchema, I would be fine with it. We should though have a clear separation what belongs to the Table schema and what does not (e.g. partitions are in the Catalog table now). Best, Dawid On 20/11/2019 04:32, Jark Wu wrote:
Hi Dawid, Thanks for driving this. Primary key is a very important and useful feature to Flink Table API / SQL. I'm +1 to the general design. And have some thoughts as following: 1. > The design says Flink only support NOT ENFORCED mode. But the DDL and KeyConstraint#primaryKey(..) can pass in ENFORCED mode. What will we do when user specify an ENFORCED mode? Shall we forbid it? 2. > "When creating a table, creating a primary key constraint will alter the columns nullability." I think we should force the columns to be declared NOT NULL, instead of change the columns nullability implicitly. Otherwise, it may be confused what is the nullbility of the primary key columns. 3. > "Support for Unique key is not part of the FLIP. It is just mentioned to show how can we extend the primary key concept with more constraints in the future." I think we can include Unique Key as part of the FLIP, as it is already stated clearly. We can put the implemenation task of unique key in a lower priority. 4. > Method for retrieving primary key constraint is in CatalogBaseTable. In SQL standard, primary key is declared in line or out of line. If it is out of line, it is still in the schema part, i.e. besides column definitions and in the parentheses. So I think maybe primary key should belong to `TableSchema` becuase it is a part of schema. CREATE TABLE xxx ( a int, b varchar, c bigint, primary key (a, b) ) with (...) Best, Jark On Tue, 19 Nov 2019 at 23:18, Dawid Wysakowicz [hidden email] wrote:Hi, I wanted to bring up the topic of primary key constraints that we could leverage for runtime optimizations. Please have a look at the proposal and I would especially draw your attention to the topic of nullability of columns that are part of a primary key. Some time in the future we will also be able to leverage it for upserting streaming tables. Here is the proposal: https://cwiki.apache.org/confluence/display/FLINK/FLIP+87%3A+Primary+key+constraints+in+Table+API Best, Dawid signature.asc (849 bytes) Download Attachment |
In reply to this post by Jark Wu-2
Hey,
After an offline discussion I updated the FLIP slightly. In the initial FLIP I put the OUT_OF_LINE definition at a wrong position. I moved it into the statement paranthesis. I also moved the primary key to the TableSchema class. I also extended sections about UNIQUE key support and how do the primary keys affect sources and sinks. Please have another look. If there are no further concerns, I would like to start the vote soon. Best, Dawid On 20/11/2019 04:32, Jark Wu wrote: > Hi Dawid, > > Thanks for driving this. Primary key is a very important and useful feature > to Flink Table API / SQL. > > I'm +1 to the general design. > > And have some thoughts as following: > > 1. > The design says Flink only support NOT ENFORCED mode. But the DDL > and KeyConstraint#primaryKey(..) can pass in ENFORCED mode. > What will we do when user specify an ENFORCED mode? Shall we forbid > it? > > 2. > "When creating a table, creating a primary key constraint will alter > the columns nullability." > I think we should force the columns to be declared NOT NULL, instead of > change the columns nullability implicitly. > Otherwise, it may be confused what is the nullbility of the primary key > columns. > > 3. > "Support for Unique key is not part of the FLIP. It is just mentioned > to show how can we extend the primary key concept with more constraints in > the future." > I think we can include Unique Key as part of the FLIP, as it is already > stated clearly. We can put the implemenation task of unique key in a lower > priority. > > 4. > Method for retrieving primary key constraint is in CatalogBaseTable. > In SQL standard, primary key is declared in line or out of line. If it > is out of line, it is still in the schema part, i.e. besides column > definitions and in the parentheses. > So I think maybe primary key should belong to `TableSchema` becuase it > is a part of schema. > > CREATE TABLE xxx ( > a int, > b varchar, > c bigint, > primary key (a, b) > ) with (...) > > > Best, > Jark > > > On Tue, 19 Nov 2019 at 23:18, Dawid Wysakowicz <[hidden email]> > wrote: > >> Hi, >> >> I wanted to bring up the topic of primary key constraints that we could >> leverage for runtime optimizations. Please have a look at the proposal >> and I would especially draw your attention to the topic of nullability >> of columns that are part of a primary key. Some time in the future we >> will also be able to leverage it for upserting streaming tables. >> >> Here is the proposal: >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP+87%3A+Primary+key+constraints+in+Table+API >> >> Best, >> >> Dawid >> >> >> signature.asc (849 bytes) Download Attachment |
Hi all,
Thanks Dawid for bringing this up, this is very important property of SQL and I'm big +1 to have this. But before the discussion going deeply, I want to first mention that according to SQL standard, any unique key constraint which including primary key, should be always be enforced. Please see: 4.18.3.1 Introduction to table constraints "NOTE 50 — A unique constraint is always enforced. A table check constraint or a referential constraint can either be enforced or not enforced." Best, Kurt On Wed, Nov 20, 2019 at 5:43 PM Dawid Wysakowicz <[hidden email]> wrote: > Hey, > > After an offline discussion I updated the FLIP slightly. In the initial > FLIP I put the OUT_OF_LINE definition at a wrong position. I moved it > into the statement paranthesis. I also moved the primary key to the > TableSchema class. > > I also extended sections about UNIQUE key support and how do the primary > keys affect sources and sinks. > > Please have another look. If there are no further concerns, I would like > to start the vote soon. > > Best, > > Dawid > > On 20/11/2019 04:32, Jark Wu wrote: > > Hi Dawid, > > > > Thanks for driving this. Primary key is a very important and useful > feature > > to Flink Table API / SQL. > > > > I'm +1 to the general design. > > > > And have some thoughts as following: > > > > 1. > The design says Flink only support NOT ENFORCED mode. But the DDL > > and KeyConstraint#primaryKey(..) can pass in ENFORCED mode. > > What will we do when user specify an ENFORCED mode? Shall we forbid > > it? > > > > 2. > "When creating a table, creating a primary key constraint will alter > > the columns nullability." > > I think we should force the columns to be declared NOT NULL, instead > of > > change the columns nullability implicitly. > > Otherwise, it may be confused what is the nullbility of the primary > key > > columns. > > > > 3. > "Support for Unique key is not part of the FLIP. It is just > mentioned > > to show how can we extend the primary key concept with more constraints > in > > the future." > > I think we can include Unique Key as part of the FLIP, as it is > already > > stated clearly. We can put the implemenation task of unique key in a > lower > > priority. > > > > 4. > Method for retrieving primary key constraint is in CatalogBaseTable. > > In SQL standard, primary key is declared in line or out of line. If > it > > is out of line, it is still in the schema part, i.e. besides column > > definitions and in the parentheses. > > So I think maybe primary key should belong to `TableSchema` becuase > it > > is a part of schema. > > > > CREATE TABLE xxx ( > > a int, > > b varchar, > > c bigint, > > primary key (a, b) > > ) with (...) > > > > > > Best, > > Jark > > > > > > On Tue, 19 Nov 2019 at 23:18, Dawid Wysakowicz <[hidden email]> > > wrote: > > > >> Hi, > >> > >> I wanted to bring up the topic of primary key constraints that we could > >> leverage for runtime optimizations. Please have a look at the proposal > >> and I would especially draw your attention to the topic of nullability > >> of columns that are part of a primary key. Some time in the future we > >> will also be able to leverage it for upserting streaming tables. > >> > >> Here is the proposal: > >> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP+87%3A+Primary+key+constraints+in+Table+API > >> > >> Best, > >> > >> Dawid > >> > >> > >> > > |
Hi Kurt,
You are right that this is included in the SQL standard. The problem is we are not a database and we do not own the data. There is no way for us to truly enforce the uniqueness constraint. We are always writing to some external systems that, moreover we are not an exclusive produce/consumer for that data. We cannot ensure that external systems will not corrupt the data. The way RDBM systems ensure uniqueness constraint is that they keep the index of all keys for a table that they check against for every write. This works only if the system that maintains the index is the only writer to that table. Moreover some of the systems also allow loosening this restriction e.g. DB2: https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.wn.doc/doc/c0061153.html HIVE(with a non standard syntax, they do it for the same reasons as listed above): https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Constraints What do you think? Best, Dawid On 20/11/2019 13:51, Kurt Young wrote: > Hi all, > > Thanks Dawid for bringing this up, this is very important property of SQL > and > I'm big +1 to have this. > > But before the discussion going deeply, I want to first mention that > according to > SQL standard, any unique key constraint which including primary key, should > be > always be enforced. > > Please see: > > 4.18.3.1 Introduction to table constraints > > "NOTE 50 — A unique constraint is always enforced. A table check constraint > or > a referential constraint can either be enforced or not enforced." > > Best, > Kurt > > > On Wed, Nov 20, 2019 at 5:43 PM Dawid Wysakowicz <[hidden email]> > wrote: > >> Hey, >> >> After an offline discussion I updated the FLIP slightly. In the initial >> FLIP I put the OUT_OF_LINE definition at a wrong position. I moved it >> into the statement paranthesis. I also moved the primary key to the >> TableSchema class. >> >> I also extended sections about UNIQUE key support and how do the primary >> keys affect sources and sinks. >> >> Please have another look. If there are no further concerns, I would like >> to start the vote soon. >> >> Best, >> >> Dawid >> >> On 20/11/2019 04:32, Jark Wu wrote: >>> Hi Dawid, >>> >>> Thanks for driving this. Primary key is a very important and useful >> feature >>> to Flink Table API / SQL. >>> >>> I'm +1 to the general design. >>> >>> And have some thoughts as following: >>> >>> 1. > The design says Flink only support NOT ENFORCED mode. But the DDL >>> and KeyConstraint#primaryKey(..) can pass in ENFORCED mode. >>> What will we do when user specify an ENFORCED mode? Shall we forbid >>> it? >>> >>> 2. > "When creating a table, creating a primary key constraint will alter >>> the columns nullability." >>> I think we should force the columns to be declared NOT NULL, instead >> of >>> change the columns nullability implicitly. >>> Otherwise, it may be confused what is the nullbility of the primary >> key >>> columns. >>> >>> 3. > "Support for Unique key is not part of the FLIP. It is just >> mentioned >>> to show how can we extend the primary key concept with more constraints >> in >>> the future." >>> I think we can include Unique Key as part of the FLIP, as it is >> already >>> stated clearly. We can put the implemenation task of unique key in a >> lower >>> priority. >>> >>> 4. > Method for retrieving primary key constraint is in CatalogBaseTable. >>> In SQL standard, primary key is declared in line or out of line. If >> it >>> is out of line, it is still in the schema part, i.e. besides column >>> definitions and in the parentheses. >>> So I think maybe primary key should belong to `TableSchema` becuase >> it >>> is a part of schema. >>> >>> CREATE TABLE xxx ( >>> a int, >>> b varchar, >>> c bigint, >>> primary key (a, b) >>> ) with (...) >>> >>> >>> Best, >>> Jark >>> >>> >>> On Tue, 19 Nov 2019 at 23:18, Dawid Wysakowicz <[hidden email]> >>> wrote: >>> >>>> Hi, >>>> >>>> I wanted to bring up the topic of primary key constraints that we could >>>> leverage for runtime optimizations. Please have a look at the proposal >>>> and I would especially draw your attention to the topic of nullability >>>> of columns that are part of a primary key. Some time in the future we >>>> will also be able to leverage it for upserting streaming tables. >>>> >>>> Here is the proposal: >>>> >>>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP+87%3A+Primary+key+constraints+in+Table+API >>>> Best, >>>> >>>> Dawid >>>> >>>> >>>> >> signature.asc (849 bytes) Download Attachment |
Hi Dawid,
Actually I'm not 100% sure about both choices: always enforce primary key by Flink and not enforce at all. As you said, always enforce primary key is hard to achieve since Flink doesn't own data most of the time. But with some certain cases, it's actually doable, like writing a new table in a batch job, or if we have upsert flag supported in message queue. So it's hard to say Flink has no chance to enforce primary key here. On the other hand, not enforce primary key indeed save us a lot of work, but also would cause confusing sometimes. Especially when the table is written and read by Flink only. For example, some user created a table with primary key not enforced, and they will use it as source and sink table. This situation is quite confusing, because Flink didn't enforce primary key when writing, so it's hard to say we can rely on such information during read. It looks we are contradictory with ourself. AFAIK, most databases will only do validity check only when writing to table, and simply trust such information during read. I think we could also follow this way. Since we are not sure yet what kind of behavior when writing to a table with primary key, I would suggest to defer this util we start to talk about the source and sink behavior with primary key, maybe as well as introducing update, delete flag. To summarize, my proposal would be: 1. Add primary key but not introducing enforce flag now 2. Table with primary key can't be used as sink, and we will simply trust pk info during read. Best, Kurt On Wed, Nov 20, 2019 at 9:26 PM Dawid Wysakowicz <[hidden email]> wrote: > Hi Kurt, > > You are right that this is included in the SQL standard. > > The problem is we are not a database and we do not own the data. There > is no way for us to truly enforce the uniqueness constraint. We are > always writing to some external systems that, moreover we are not an > exclusive produce/consumer for that data. We cannot ensure that external > systems will not corrupt the data. > > The way RDBM systems ensure uniqueness constraint is that they keep the > index of all keys for a table that they check against for every write. > This works only if the system that maintains the index is the only > writer to that table. > > Moreover some of the systems also allow loosening this restriction e.g. > > DB2: > > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.wn.doc/doc/c0061153.html > > HIVE(with a non standard syntax, they do it for the same reasons as > listed above): > > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Constraints > > > What do you think? > > Best, > > Dawid > > > On 20/11/2019 13:51, Kurt Young wrote: > > Hi all, > > > > Thanks Dawid for bringing this up, this is very important property of SQL > > and > > I'm big +1 to have this. > > > > But before the discussion going deeply, I want to first mention that > > according to > > SQL standard, any unique key constraint which including primary key, > should > > be > > always be enforced. > > > > Please see: > > > > 4.18.3.1 Introduction to table constraints > > > > "NOTE 50 — A unique constraint is always enforced. A table check > constraint > > or > > a referential constraint can either be enforced or not enforced." > > > > Best, > > Kurt > > > > > > On Wed, Nov 20, 2019 at 5:43 PM Dawid Wysakowicz <[hidden email] > > > > wrote: > > > >> Hey, > >> > >> After an offline discussion I updated the FLIP slightly. In the initial > >> FLIP I put the OUT_OF_LINE definition at a wrong position. I moved it > >> into the statement paranthesis. I also moved the primary key to the > >> TableSchema class. > >> > >> I also extended sections about UNIQUE key support and how do the primary > >> keys affect sources and sinks. > >> > >> Please have another look. If there are no further concerns, I would like > >> to start the vote soon. > >> > >> Best, > >> > >> Dawid > >> > >> On 20/11/2019 04:32, Jark Wu wrote: > >>> Hi Dawid, > >>> > >>> Thanks for driving this. Primary key is a very important and useful > >> feature > >>> to Flink Table API / SQL. > >>> > >>> I'm +1 to the general design. > >>> > >>> And have some thoughts as following: > >>> > >>> 1. > The design says Flink only support NOT ENFORCED mode. But the DDL > >>> and KeyConstraint#primaryKey(..) can pass in ENFORCED mode. > >>> What will we do when user specify an ENFORCED mode? Shall we > forbid > >>> it? > >>> > >>> 2. > "When creating a table, creating a primary key constraint will > alter > >>> the columns nullability." > >>> I think we should force the columns to be declared NOT NULL, > instead > >> of > >>> change the columns nullability implicitly. > >>> Otherwise, it may be confused what is the nullbility of the primary > >> key > >>> columns. > >>> > >>> 3. > "Support for Unique key is not part of the FLIP. It is just > >> mentioned > >>> to show how can we extend the primary key concept with more constraints > >> in > >>> the future." > >>> I think we can include Unique Key as part of the FLIP, as it is > >> already > >>> stated clearly. We can put the implemenation task of unique key in a > >> lower > >>> priority. > >>> > >>> 4. > Method for retrieving primary key constraint is in > CatalogBaseTable. > >>> In SQL standard, primary key is declared in line or out of line. If > >> it > >>> is out of line, it is still in the schema part, i.e. besides column > >>> definitions and in the parentheses. > >>> So I think maybe primary key should belong to `TableSchema` becuase > >> it > >>> is a part of schema. > >>> > >>> CREATE TABLE xxx ( > >>> a int, > >>> b varchar, > >>> c bigint, > >>> primary key (a, b) > >>> ) with (...) > >>> > >>> > >>> Best, > >>> Jark > >>> > >>> > >>> On Tue, 19 Nov 2019 at 23:18, Dawid Wysakowicz <[hidden email] > > > >>> wrote: > >>> > >>>> Hi, > >>>> > >>>> I wanted to bring up the topic of primary key constraints that we > could > >>>> leverage for runtime optimizations. Please have a look at the proposal > >>>> and I would especially draw your attention to the topic of nullability > >>>> of columns that are part of a primary key. Some time in the future we > >>>> will also be able to leverage it for upserting streaming tables. > >>>> > >>>> Here is the proposal: > >>>> > >>>> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP+87%3A+Primary+key+constraints+in+Table+API > >>>> Best, > >>>> > >>>> Dawid > >>>> > >>>> > >>>> > >> > > |
Hi all,
After some offline discussion with Dawid, I changed my mind. We still agree on that we should only support reading from tables with primary key, forbid writing to such table. But the enforce flag should be introduced with primary key, and probably should be `NOR ENFORCE` be default. The problem with making `ENFORCE` by default is when creating a table with primary key constraint. If we follow the SQL standard here, we should do the validity check during the creating operation, and this will introduce a really big burden here and might make supporting this feature not feasible. Best, Kurt On Thu, Nov 21, 2019 at 10:25 AM Kurt Young <[hidden email]> wrote: > Hi Dawid, > > Actually I'm not 100% sure about both choices: always enforce primary key > by Flink and not enforce at all. > > As you said, always enforce primary key is hard to achieve since Flink > doesn't > own data most of the time. But with some certain cases, it's actually > doable, like > writing a new table in a batch job, or if we have upsert flag supported in > message > queue. So it's hard to say Flink has no chance to enforce primary key > here. > > On the other hand, not enforce primary key indeed save us a lot of work, > but also > would cause confusing sometimes. Especially when the table is written and > read > by Flink only. For example, some user created a table with primary key not > enforced, > and they will use it as source and sink table. This situation is quite > confusing, because > Flink didn't enforce primary key when writing, so it's hard to say we can > rely on such > information during read. It looks we are contradictory with ourself. > > AFAIK, most databases will only do validity check only when writing to > table, and > simply trust such information during read. I think we could also follow > this way. Since > we are not sure yet what kind of behavior when writing to a table with > primary key, I would > suggest to defer this util we start to talk about the source and sink > behavior with primary > key, maybe as well as introducing update, delete flag. > > To summarize, my proposal would be: > 1. Add primary key but not introducing enforce flag now > 2. Table with primary key can't be used as sink, and we will simply trust > pk info during read. > > Best, > Kurt > > > On Wed, Nov 20, 2019 at 9:26 PM Dawid Wysakowicz <[hidden email]> > wrote: > >> Hi Kurt, >> >> You are right that this is included in the SQL standard. >> >> The problem is we are not a database and we do not own the data. There >> is no way for us to truly enforce the uniqueness constraint. We are >> always writing to some external systems that, moreover we are not an >> exclusive produce/consumer for that data. We cannot ensure that external >> systems will not corrupt the data. >> >> The way RDBM systems ensure uniqueness constraint is that they keep the >> index of all keys for a table that they check against for every write. >> This works only if the system that maintains the index is the only >> writer to that table. >> >> Moreover some of the systems also allow loosening this restriction e.g. >> >> DB2: >> >> https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.wn.doc/doc/c0061153.html >> >> HIVE(with a non standard syntax, they do it for the same reasons as >> listed above): >> >> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Constraints >> >> >> What do you think? >> >> Best, >> >> Dawid >> >> >> On 20/11/2019 13:51, Kurt Young wrote: >> > Hi all, >> > >> > Thanks Dawid for bringing this up, this is very important property of >> SQL >> > and >> > I'm big +1 to have this. >> > >> > But before the discussion going deeply, I want to first mention that >> > according to >> > SQL standard, any unique key constraint which including primary key, >> should >> > be >> > always be enforced. >> > >> > Please see: >> > >> > 4.18.3.1 Introduction to table constraints >> > >> > "NOTE 50 — A unique constraint is always enforced. A table check >> constraint >> > or >> > a referential constraint can either be enforced or not enforced." >> > >> > Best, >> > Kurt >> > >> > >> > On Wed, Nov 20, 2019 at 5:43 PM Dawid Wysakowicz < >> [hidden email]> >> > wrote: >> > >> >> Hey, >> >> >> >> After an offline discussion I updated the FLIP slightly. In the initial >> >> FLIP I put the OUT_OF_LINE definition at a wrong position. I moved it >> >> into the statement paranthesis. I also moved the primary key to the >> >> TableSchema class. >> >> >> >> I also extended sections about UNIQUE key support and how do the >> primary >> >> keys affect sources and sinks. >> >> >> >> Please have another look. If there are no further concerns, I would >> like >> >> to start the vote soon. >> >> >> >> Best, >> >> >> >> Dawid >> >> >> >> On 20/11/2019 04:32, Jark Wu wrote: >> >>> Hi Dawid, >> >>> >> >>> Thanks for driving this. Primary key is a very important and useful >> >> feature >> >>> to Flink Table API / SQL. >> >>> >> >>> I'm +1 to the general design. >> >>> >> >>> And have some thoughts as following: >> >>> >> >>> 1. > The design says Flink only support NOT ENFORCED mode. But the DDL >> >>> and KeyConstraint#primaryKey(..) can pass in ENFORCED mode. >> >>> What will we do when user specify an ENFORCED mode? Shall we >> forbid >> >>> it? >> >>> >> >>> 2. > "When creating a table, creating a primary key constraint will >> alter >> >>> the columns nullability." >> >>> I think we should force the columns to be declared NOT NULL, >> instead >> >> of >> >>> change the columns nullability implicitly. >> >>> Otherwise, it may be confused what is the nullbility of the >> primary >> >> key >> >>> columns. >> >>> >> >>> 3. > "Support for Unique key is not part of the FLIP. It is just >> >> mentioned >> >>> to show how can we extend the primary key concept with more >> constraints >> >> in >> >>> the future." >> >>> I think we can include Unique Key as part of the FLIP, as it is >> >> already >> >>> stated clearly. We can put the implemenation task of unique key in a >> >> lower >> >>> priority. >> >>> >> >>> 4. > Method for retrieving primary key constraint is in >> CatalogBaseTable. >> >>> In SQL standard, primary key is declared in line or out of line. >> If >> >> it >> >>> is out of line, it is still in the schema part, i.e. besides column >> >>> definitions and in the parentheses. >> >>> So I think maybe primary key should belong to `TableSchema` >> becuase >> >> it >> >>> is a part of schema. >> >>> >> >>> CREATE TABLE xxx ( >> >>> a int, >> >>> b varchar, >> >>> c bigint, >> >>> primary key (a, b) >> >>> ) with (...) >> >>> >> >>> >> >>> Best, >> >>> Jark >> >>> >> >>> >> >>> On Tue, 19 Nov 2019 at 23:18, Dawid Wysakowicz < >> [hidden email]> >> >>> wrote: >> >>> >> >>>> Hi, >> >>>> >> >>>> I wanted to bring up the topic of primary key constraints that we >> could >> >>>> leverage for runtime optimizations. Please have a look at the >> proposal >> >>>> and I would especially draw your attention to the topic of >> nullability >> >>>> of columns that are part of a primary key. Some time in the future we >> >>>> will also be able to leverage it for upserting streaming tables. >> >>>> >> >>>> Here is the proposal: >> >>>> >> >>>> >> >> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP+87%3A+Primary+key+constraints+in+Table+API >> >>>> Best, >> >>>> >> >>>> Dawid >> >>>> >> >>>> >> >>>> >> >> >> >> |
Hi Dawid,
I don't have other concerns now. Thanks for the explanation. Best, Jark On Thu, 21 Nov 2019 at 18:13, Kurt Young <[hidden email]> wrote: > Hi all, > > After some offline discussion with Dawid, I changed my mind. We still agree > on that > we should only support reading from tables with primary key, forbid writing > to such table. > > But the enforce flag should be introduced with primary key, and probably > should be > `NOR ENFORCE` be default. The problem with making `ENFORCE` by default is > when > creating a table with primary key constraint. If we follow the SQL standard > here, we should > do the validity check during the creating operation, and this will > introduce a really big > burden here and might make supporting this feature not feasible. > > Best, > Kurt > > > On Thu, Nov 21, 2019 at 10:25 AM Kurt Young <[hidden email]> wrote: > > > Hi Dawid, > > > > Actually I'm not 100% sure about both choices: always enforce primary key > > by Flink and not enforce at all. > > > > As you said, always enforce primary key is hard to achieve since Flink > > doesn't > > own data most of the time. But with some certain cases, it's actually > > doable, like > > writing a new table in a batch job, or if we have upsert flag supported > in > > message > > queue. So it's hard to say Flink has no chance to enforce primary key > > here. > > > > On the other hand, not enforce primary key indeed save us a lot of work, > > but also > > would cause confusing sometimes. Especially when the table is written and > > read > > by Flink only. For example, some user created a table with primary key > not > > enforced, > > and they will use it as source and sink table. This situation is quite > > confusing, because > > Flink didn't enforce primary key when writing, so it's hard to say we can > > rely on such > > information during read. It looks we are contradictory with ourself. > > > > AFAIK, most databases will only do validity check only when writing to > > table, and > > simply trust such information during read. I think we could also follow > > this way. Since > > we are not sure yet what kind of behavior when writing to a table with > > primary key, I would > > suggest to defer this util we start to talk about the source and sink > > behavior with primary > > key, maybe as well as introducing update, delete flag. > > > > To summarize, my proposal would be: > > 1. Add primary key but not introducing enforce flag now > > 2. Table with primary key can't be used as sink, and we will simply trust > > pk info during read. > > > > Best, > > Kurt > > > > > > On Wed, Nov 20, 2019 at 9:26 PM Dawid Wysakowicz <[hidden email] > > > > wrote: > > > >> Hi Kurt, > >> > >> You are right that this is included in the SQL standard. > >> > >> The problem is we are not a database and we do not own the data. There > >> is no way for us to truly enforce the uniqueness constraint. We are > >> always writing to some external systems that, moreover we are not an > >> exclusive produce/consumer for that data. We cannot ensure that external > >> systems will not corrupt the data. > >> > >> The way RDBM systems ensure uniqueness constraint is that they keep the > >> index of all keys for a table that they check against for every write. > >> This works only if the system that maintains the index is the only > >> writer to that table. > >> > >> Moreover some of the systems also allow loosening this restriction e.g. > >> > >> DB2: > >> > >> > https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.wn.doc/doc/c0061153.html > >> > >> HIVE(with a non standard syntax, they do it for the same reasons as > >> listed above): > >> > >> > https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Constraints > >> > >> > >> What do you think? > >> > >> Best, > >> > >> Dawid > >> > >> > >> On 20/11/2019 13:51, Kurt Young wrote: > >> > Hi all, > >> > > >> > Thanks Dawid for bringing this up, this is very important property of > >> SQL > >> > and > >> > I'm big +1 to have this. > >> > > >> > But before the discussion going deeply, I want to first mention that > >> > according to > >> > SQL standard, any unique key constraint which including primary key, > >> should > >> > be > >> > always be enforced. > >> > > >> > Please see: > >> > > >> > 4.18.3.1 Introduction to table constraints > >> > > >> > "NOTE 50 — A unique constraint is always enforced. A table check > >> constraint > >> > or > >> > a referential constraint can either be enforced or not enforced." > >> > > >> > Best, > >> > Kurt > >> > > >> > > >> > On Wed, Nov 20, 2019 at 5:43 PM Dawid Wysakowicz < > >> [hidden email]> > >> > wrote: > >> > > >> >> Hey, > >> >> > >> >> After an offline discussion I updated the FLIP slightly. In the > initial > >> >> FLIP I put the OUT_OF_LINE definition at a wrong position. I moved it > >> >> into the statement paranthesis. I also moved the primary key to the > >> >> TableSchema class. > >> >> > >> >> I also extended sections about UNIQUE key support and how do the > >> primary > >> >> keys affect sources and sinks. > >> >> > >> >> Please have another look. If there are no further concerns, I would > >> like > >> >> to start the vote soon. > >> >> > >> >> Best, > >> >> > >> >> Dawid > >> >> > >> >> On 20/11/2019 04:32, Jark Wu wrote: > >> >>> Hi Dawid, > >> >>> > >> >>> Thanks for driving this. Primary key is a very important and useful > >> >> feature > >> >>> to Flink Table API / SQL. > >> >>> > >> >>> I'm +1 to the general design. > >> >>> > >> >>> And have some thoughts as following: > >> >>> > >> >>> 1. > The design says Flink only support NOT ENFORCED mode. But the > DDL > >> >>> and KeyConstraint#primaryKey(..) can pass in ENFORCED mode. > >> >>> What will we do when user specify an ENFORCED mode? Shall we > >> forbid > >> >>> it? > >> >>> > >> >>> 2. > "When creating a table, creating a primary key constraint will > >> alter > >> >>> the columns nullability." > >> >>> I think we should force the columns to be declared NOT NULL, > >> instead > >> >> of > >> >>> change the columns nullability implicitly. > >> >>> Otherwise, it may be confused what is the nullbility of the > >> primary > >> >> key > >> >>> columns. > >> >>> > >> >>> 3. > "Support for Unique key is not part of the FLIP. It is just > >> >> mentioned > >> >>> to show how can we extend the primary key concept with more > >> constraints > >> >> in > >> >>> the future." > >> >>> I think we can include Unique Key as part of the FLIP, as it is > >> >> already > >> >>> stated clearly. We can put the implemenation task of unique key in a > >> >> lower > >> >>> priority. > >> >>> > >> >>> 4. > Method for retrieving primary key constraint is in > >> CatalogBaseTable. > >> >>> In SQL standard, primary key is declared in line or out of line. > >> If > >> >> it > >> >>> is out of line, it is still in the schema part, i.e. besides column > >> >>> definitions and in the parentheses. > >> >>> So I think maybe primary key should belong to `TableSchema` > >> becuase > >> >> it > >> >>> is a part of schema. > >> >>> > >> >>> CREATE TABLE xxx ( > >> >>> a int, > >> >>> b varchar, > >> >>> c bigint, > >> >>> primary key (a, b) > >> >>> ) with (...) > >> >>> > >> >>> > >> >>> Best, > >> >>> Jark > >> >>> > >> >>> > >> >>> On Tue, 19 Nov 2019 at 23:18, Dawid Wysakowicz < > >> [hidden email]> > >> >>> wrote: > >> >>> > >> >>>> Hi, > >> >>>> > >> >>>> I wanted to bring up the topic of primary key constraints that we > >> could > >> >>>> leverage for runtime optimizations. Please have a look at the > >> proposal > >> >>>> and I would especially draw your attention to the topic of > >> nullability > >> >>>> of columns that are part of a primary key. Some time in the future > we > >> >>>> will also be able to leverage it for upserting streaming tables. > >> >>>> > >> >>>> Here is the proposal: > >> >>>> > >> >>>> > >> >> > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP+87%3A+Primary+key+constraints+in+Table+API > >> >>>> Best, > >> >>>> > >> >>>> Dawid > >> >>>> > >> >>>> > >> >>>> > >> >> > >> > >> > |
Thank you for your comments!
I will start the vote then. Best, Dawid On 22/11/2019 07:30, Jark Wu wrote: > Hi Dawid, > > I don't have other concerns now. Thanks for the explanation. > > Best, > Jark > > On Thu, 21 Nov 2019 at 18:13, Kurt Young <[hidden email]> wrote: > >> Hi all, >> >> After some offline discussion with Dawid, I changed my mind. We still agree >> on that >> we should only support reading from tables with primary key, forbid writing >> to such table. >> >> But the enforce flag should be introduced with primary key, and probably >> should be >> `NOR ENFORCE` be default. The problem with making `ENFORCE` by default is >> when >> creating a table with primary key constraint. If we follow the SQL standard >> here, we should >> do the validity check during the creating operation, and this will >> introduce a really big >> burden here and might make supporting this feature not feasible. >> >> Best, >> Kurt >> >> >> On Thu, Nov 21, 2019 at 10:25 AM Kurt Young <[hidden email]> wrote: >> >>> Hi Dawid, >>> >>> Actually I'm not 100% sure about both choices: always enforce primary key >>> by Flink and not enforce at all. >>> >>> As you said, always enforce primary key is hard to achieve since Flink >>> doesn't >>> own data most of the time. But with some certain cases, it's actually >>> doable, like >>> writing a new table in a batch job, or if we have upsert flag supported >> in >>> message >>> queue. So it's hard to say Flink has no chance to enforce primary key >>> here. >>> >>> On the other hand, not enforce primary key indeed save us a lot of work, >>> but also >>> would cause confusing sometimes. Especially when the table is written and >>> read >>> by Flink only. For example, some user created a table with primary key >> not >>> enforced, >>> and they will use it as source and sink table. This situation is quite >>> confusing, because >>> Flink didn't enforce primary key when writing, so it's hard to say we can >>> rely on such >>> information during read. It looks we are contradictory with ourself. >>> >>> AFAIK, most databases will only do validity check only when writing to >>> table, and >>> simply trust such information during read. I think we could also follow >>> this way. Since >>> we are not sure yet what kind of behavior when writing to a table with >>> primary key, I would >>> suggest to defer this util we start to talk about the source and sink >>> behavior with primary >>> key, maybe as well as introducing update, delete flag. >>> >>> To summarize, my proposal would be: >>> 1. Add primary key but not introducing enforce flag now >>> 2. Table with primary key can't be used as sink, and we will simply trust >>> pk info during read. >>> >>> Best, >>> Kurt >>> >>> >>> On Wed, Nov 20, 2019 at 9:26 PM Dawid Wysakowicz <[hidden email] >>> >>> wrote: >>> >>>> Hi Kurt, >>>> >>>> You are right that this is included in the SQL standard. >>>> >>>> The problem is we are not a database and we do not own the data. There >>>> is no way for us to truly enforce the uniqueness constraint. We are >>>> always writing to some external systems that, moreover we are not an >>>> exclusive produce/consumer for that data. We cannot ensure that external >>>> systems will not corrupt the data. >>>> >>>> The way RDBM systems ensure uniqueness constraint is that they keep the >>>> index of all keys for a table that they check against for every write. >>>> This works only if the system that maintains the index is the only >>>> writer to that table. >>>> >>>> Moreover some of the systems also allow loosening this restriction e.g. >>>> >>>> DB2: >>>> >>>> >> https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.wn.doc/doc/c0061153.html >>>> HIVE(with a non standard syntax, they do it for the same reasons as >>>> listed above): >>>> >>>> >> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Constraints >>>> >>>> What do you think? >>>> >>>> Best, >>>> >>>> Dawid >>>> >>>> >>>> On 20/11/2019 13:51, Kurt Young wrote: >>>>> Hi all, >>>>> >>>>> Thanks Dawid for bringing this up, this is very important property of >>>> SQL >>>>> and >>>>> I'm big +1 to have this. >>>>> >>>>> But before the discussion going deeply, I want to first mention that >>>>> according to >>>>> SQL standard, any unique key constraint which including primary key, >>>> should >>>>> be >>>>> always be enforced. >>>>> >>>>> Please see: >>>>> >>>>> 4.18.3.1 Introduction to table constraints >>>>> >>>>> "NOTE 50 — A unique constraint is always enforced. A table check >>>> constraint >>>>> or >>>>> a referential constraint can either be enforced or not enforced." >>>>> >>>>> Best, >>>>> Kurt >>>>> >>>>> >>>>> On Wed, Nov 20, 2019 at 5:43 PM Dawid Wysakowicz < >>>> [hidden email]> >>>>> wrote: >>>>> >>>>>> Hey, >>>>>> >>>>>> After an offline discussion I updated the FLIP slightly. In the >> initial >>>>>> FLIP I put the OUT_OF_LINE definition at a wrong position. I moved it >>>>>> into the statement paranthesis. I also moved the primary key to the >>>>>> TableSchema class. >>>>>> >>>>>> I also extended sections about UNIQUE key support and how do the >>>> primary >>>>>> keys affect sources and sinks. >>>>>> >>>>>> Please have another look. If there are no further concerns, I would >>>> like >>>>>> to start the vote soon. >>>>>> >>>>>> Best, >>>>>> >>>>>> Dawid >>>>>> >>>>>> On 20/11/2019 04:32, Jark Wu wrote: >>>>>>> Hi Dawid, >>>>>>> >>>>>>> Thanks for driving this. Primary key is a very important and useful >>>>>> feature >>>>>>> to Flink Table API / SQL. >>>>>>> >>>>>>> I'm +1 to the general design. >>>>>>> >>>>>>> And have some thoughts as following: >>>>>>> >>>>>>> 1. > The design says Flink only support NOT ENFORCED mode. But the >> DDL >>>>>>> and KeyConstraint#primaryKey(..) can pass in ENFORCED mode. >>>>>>> What will we do when user specify an ENFORCED mode? Shall we >>>> forbid >>>>>>> it? >>>>>>> >>>>>>> 2. > "When creating a table, creating a primary key constraint will >>>> alter >>>>>>> the columns nullability." >>>>>>> I think we should force the columns to be declared NOT NULL, >>>> instead >>>>>> of >>>>>>> change the columns nullability implicitly. >>>>>>> Otherwise, it may be confused what is the nullbility of the >>>> primary >>>>>> key >>>>>>> columns. >>>>>>> >>>>>>> 3. > "Support for Unique key is not part of the FLIP. It is just >>>>>> mentioned >>>>>>> to show how can we extend the primary key concept with more >>>> constraints >>>>>> in >>>>>>> the future." >>>>>>> I think we can include Unique Key as part of the FLIP, as it is >>>>>> already >>>>>>> stated clearly. We can put the implemenation task of unique key in a >>>>>> lower >>>>>>> priority. >>>>>>> >>>>>>> 4. > Method for retrieving primary key constraint is in >>>> CatalogBaseTable. >>>>>>> In SQL standard, primary key is declared in line or out of line. >>>> If >>>>>> it >>>>>>> is out of line, it is still in the schema part, i.e. besides column >>>>>>> definitions and in the parentheses. >>>>>>> So I think maybe primary key should belong to `TableSchema` >>>> becuase >>>>>> it >>>>>>> is a part of schema. >>>>>>> >>>>>>> CREATE TABLE xxx ( >>>>>>> a int, >>>>>>> b varchar, >>>>>>> c bigint, >>>>>>> primary key (a, b) >>>>>>> ) with (...) >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> Jark >>>>>>> >>>>>>> >>>>>>> On Tue, 19 Nov 2019 at 23:18, Dawid Wysakowicz < >>>> [hidden email]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I wanted to bring up the topic of primary key constraints that we >>>> could >>>>>>>> leverage for runtime optimizations. Please have a look at the >>>> proposal >>>>>>>> and I would especially draw your attention to the topic of >>>> nullability >>>>>>>> of columns that are part of a primary key. Some time in the future >> we >>>>>>>> will also be able to leverage it for upserting streaming tables. >>>>>>>> >>>>>>>> Here is the proposal: >>>>>>>> >>>>>>>> >> https://cwiki.apache.org/confluence/display/FLINK/FLIP+87%3A+Primary+key+constraints+in+Table+API >>>>>>>> Best, >>>>>>>> >>>>>>>> Dawid >>>>>>>> >>>>>>>> >>>>>>>> >>>> signature.asc (849 bytes) Download Attachment |
Free forum by Nabble | Edit this page |