Hi everybody,
while reviewing PR #2094 [1] I noticed that the field reference syntax for FieldAccessors is not compatible with the syntax supported for key definitions (ExpressionKeys) used in groupBy(), keyBy(), join().where().equalTo(), etc. FieldAccessors are only used for build-in aggregations in the DataStream API (sum(), min(), max(), ...). In particular I identified the following inconsistencies: - FieldAccessors allow to address array cells. ExpressionKeys treat arrays as AtomicTypes (Array TypeInfos do not extend CompositeType). Hence, it is not possible to address array cells. - ExpressionKeys do only support Integer keys for tuples. An atomic type can only be addressed with "*". FieldAccessors allow to address AtomicTypes with 0 in addtion to "*". - ExpressionKeys support to address fields of Java tuples with "f2" and Scala tuple fields with "_3". FieldAccessors do not support the "f" or "_" prefix. I would like to propose to adapt the syntax of both mechanisms (ideally, both should use the same code for validation). IMO, the ExpressionKey syntax much more widely used and is well designed. Therefore, I would adopt it for FieldAccessors as well. However, that would mean to restrict the syntax of the FieldAccessors and might break existing code. What do others think? [1] https://github.com/apache/flink/pull/2094 |
Hello,
Thanks Fabian for starting this discussion. I would just like to add a few thougths about why does the FieldAccessors even exist. One could say that we should instead just re-use ExpressionKeys for the aggregations, and we are done. As far as I can see, the way ExpressionKeys is often used is roughly to call computeLogicalKeyPositions on it, then call CompositeType.createComparator with these key positions, and then we can call extractKeys on the comparator to get the values of the fields of a record. However, the problem with this is that in the aggregations we also need to write fields and not just read them. So with the above procedure, we would need an "intractKey" method on the comparators that would be the opposite of extractKeys, i.e. which would set the key fields. Another difference between the FieldAccessor world and the ExpressionKeys world is that in the aggregations we need to deal with only one field. This means that enhancing the comparators with the exact opposite of extractKeys would be overkill, because we don't need to set multiple fields at a time. It certainly seems to be a good idea to unify these at least externally (from the side of the users), but I'm not sure what would be a neat way to also internally unify them. Best, Gábor 2016-10-26 20:54 GMT+02:00 Fabian Hueske <[hidden email]>: > Hi everybody, > > while reviewing PR #2094 [1] I noticed that the field reference syntax for > FieldAccessors is not compatible with the syntax supported for key > definitions (ExpressionKeys) used in groupBy(), keyBy(), > join().where().equalTo(), etc. > > FieldAccessors are only used for build-in aggregations in the DataStream > API (sum(), min(), max(), ...). > > In particular I identified the following inconsistencies: > > - FieldAccessors allow to address array cells. ExpressionKeys treat arrays > as AtomicTypes (Array TypeInfos do not extend CompositeType). Hence, it is > not possible to address array cells. > - ExpressionKeys do only support Integer keys for tuples. An atomic type > can only be addressed with "*". FieldAccessors allow to address AtomicTypes > with 0 in addtion to "*". > - ExpressionKeys support to address fields of Java tuples with "f2" and > Scala tuple fields with "_3". FieldAccessors do not support the "f" or "_" > prefix. > > I would like to propose to adapt the syntax of both mechanisms (ideally, > both should use the same code for validation). IMO, the ExpressionKey > syntax much more widely used and is well designed. Therefore, I would adopt > it for FieldAccessors as well. However, that would mean to restrict the > syntax of the FieldAccessors and might break existing code. > > What do others think? > > [1] https://github.com/apache/flink/pull/2094 |
IMO, FieldAccessors are certainly valuable and should exist.
I only want to make sure that there is a common syntax for field references. Maybe we can refactor the parsing code for field references in ExpressionKeys and expose it in static public methods that the FieldAccessors can call. Best, Fabian 2016-10-27 14:52 GMT+02:00 Gábor Gévay <[hidden email]>: > Hello, > > Thanks Fabian for starting this discussion. > > I would just like to add a few thougths about why does the > FieldAccessors even exist. One could say that we should instead just > re-use ExpressionKeys for the aggregations, and we are done. As far as > I can see, the way ExpressionKeys is often used is roughly to call > computeLogicalKeyPositions on it, then call > CompositeType.createComparator with these key positions, and then we > can call extractKeys on the comparator to get the values of the fields > of a record. > > However, the problem with this is that in the aggregations we also > need to write fields and not just read them. So with the above > procedure, we would need an "intractKey" method on the comparators > that would be the opposite of extractKeys, i.e. which would set the > key fields. > > Another difference between the FieldAccessor world and the > ExpressionKeys world is that in the aggregations we need to deal with > only one field. This means that enhancing the comparators with the > exact opposite of extractKeys would be overkill, because we don't need > to set multiple fields at a time. > > It certainly seems to be a good idea to unify these at least > externally (from the side of the users), but I'm not sure what would > be a neat way to also internally unify them. > > Best, > Gábor > > > 2016-10-26 20:54 GMT+02:00 Fabian Hueske <[hidden email]>: > > Hi everybody, > > > > while reviewing PR #2094 [1] I noticed that the field reference syntax > for > > FieldAccessors is not compatible with the syntax supported for key > > definitions (ExpressionKeys) used in groupBy(), keyBy(), > > join().where().equalTo(), etc. > > > > FieldAccessors are only used for build-in aggregations in the DataStream > > API (sum(), min(), max(), ...). > > > > In particular I identified the following inconsistencies: > > > > - FieldAccessors allow to address array cells. ExpressionKeys treat > arrays > > as AtomicTypes (Array TypeInfos do not extend CompositeType). Hence, it > is > > not possible to address array cells. > > - ExpressionKeys do only support Integer keys for tuples. An atomic type > > can only be addressed with "*". FieldAccessors allow to address > AtomicTypes > > with 0 in addtion to "*". > > - ExpressionKeys support to address fields of Java tuples with "f2" and > > Scala tuple fields with "_3". FieldAccessors do not support the "f" or > "_" > > prefix. > > > > I would like to propose to adapt the syntax of both mechanisms (ideally, > > both should use the same code for validation). IMO, the ExpressionKey > > syntax much more widely used and is well designed. Therefore, I would > adopt > > it for FieldAccessors as well. However, that would mean to restrict the > > syntax of the FieldAccessors and might break existing code. > > > > What do others think? > > > > [1] https://github.com/apache/flink/pull/2094 > |
Thanks for bringing up the issue, Fabian.
I am also for unifying the access patterns between the FieldAccessor and ExpressionKeys logic. In terms of Fabian's suggestion to refactor the ExpressionKeys parsing logic and rely on it in the FieldAccessors I think that is nice first step. It would be great if we could provide even tighter integration though. On Fri, Oct 28, 2016 at 6:17 PM, Fabian Hueske <[hidden email]> wrote: > IMO, FieldAccessors are certainly valuable and should exist. > I only want to make sure that there is a common syntax for field > references. > > Maybe we can refactor the parsing code for field references in > ExpressionKeys and expose it in static public methods that the > FieldAccessors can call. > > Best, > Fabian > > > > 2016-10-27 14:52 GMT+02:00 Gábor Gévay <[hidden email]>: > > > Hello, > > > > Thanks Fabian for starting this discussion. > > > > I would just like to add a few thougths about why does the > > FieldAccessors even exist. One could say that we should instead just > > re-use ExpressionKeys for the aggregations, and we are done. As far as > > I can see, the way ExpressionKeys is often used is roughly to call > > computeLogicalKeyPositions on it, then call > > CompositeType.createComparator with these key positions, and then we > > can call extractKeys on the comparator to get the values of the fields > > of a record. > > > > However, the problem with this is that in the aggregations we also > > need to write fields and not just read them. So with the above > > procedure, we would need an "intractKey" method on the comparators > > that would be the opposite of extractKeys, i.e. which would set the > > key fields. > > > > Another difference between the FieldAccessor world and the > > ExpressionKeys world is that in the aggregations we need to deal with > > only one field. This means that enhancing the comparators with the > > exact opposite of extractKeys would be overkill, because we don't need > > to set multiple fields at a time. > > > > It certainly seems to be a good idea to unify these at least > > externally (from the side of the users), but I'm not sure what would > > be a neat way to also internally unify them. > > > > Best, > > Gábor > > > > > > 2016-10-26 20:54 GMT+02:00 Fabian Hueske <[hidden email]>: > > > Hi everybody, > > > > > > while reviewing PR #2094 [1] I noticed that the field reference syntax > > for > > > FieldAccessors is not compatible with the syntax supported for key > > > definitions (ExpressionKeys) used in groupBy(), keyBy(), > > > join().where().equalTo(), etc. > > > > > > FieldAccessors are only used for build-in aggregations in the > DataStream > > > API (sum(), min(), max(), ...). > > > > > > In particular I identified the following inconsistencies: > > > > > > - FieldAccessors allow to address array cells. ExpressionKeys treat > > arrays > > > as AtomicTypes (Array TypeInfos do not extend CompositeType). Hence, it > > is > > > not possible to address array cells. > > > - ExpressionKeys do only support Integer keys for tuples. An atomic > type > > > can only be addressed with "*". FieldAccessors allow to address > > AtomicTypes > > > with 0 in addtion to "*". > > > - ExpressionKeys support to address fields of Java tuples with "f2" and > > > Scala tuple fields with "_3". FieldAccessors do not support the "f" or > > "_" > > > prefix. > > > > > > I would like to propose to adapt the syntax of both mechanisms > (ideally, > > > both should use the same code for validation). IMO, the ExpressionKey > > > syntax much more widely used and is well designed. Therefore, I would > > adopt > > > it for FieldAccessors as well. However, that would mean to restrict the > > > syntax of the FieldAccessors and might break existing code. > > > > > > What do others think? > > > > > > [1] https://github.com/apache/flink/pull/2094 > > > |
Just a friendly reminder that PR 2094 resolved some of the issue mentioned
here, but a bit remains in terms of fully consolidating the semantics. Merging the PR as soon as Travis comes in green. The reminder is documented in [1]. [1] https://issues.apache.org/jira/browse/FLINK-5156 On Thu, Nov 3, 2016 at 4:42 PM, Márton Balassi <[hidden email]> wrote: > Thanks for bringing up the issue, Fabian. > > I am also for unifying the access patterns between the FieldAccessor and > ExpressionKeys logic. > > In terms of Fabian's suggestion to refactor the ExpressionKeys parsing > logic and rely on it in the FieldAccessors I think that is nice first step. > It would be great if we could provide even tighter integration though. > > On Fri, Oct 28, 2016 at 6:17 PM, Fabian Hueske <[hidden email]> wrote: > >> IMO, FieldAccessors are certainly valuable and should exist. >> I only want to make sure that there is a common syntax for field >> references. >> >> Maybe we can refactor the parsing code for field references in >> ExpressionKeys and expose it in static public methods that the >> FieldAccessors can call. >> >> Best, >> Fabian >> >> >> >> 2016-10-27 14:52 GMT+02:00 Gábor Gévay <[hidden email]>: >> >> > Hello, >> > >> > Thanks Fabian for starting this discussion. >> > >> > I would just like to add a few thougths about why does the >> > FieldAccessors even exist. One could say that we should instead just >> > re-use ExpressionKeys for the aggregations, and we are done. As far as >> > I can see, the way ExpressionKeys is often used is roughly to call >> > computeLogicalKeyPositions on it, then call >> > CompositeType.createComparator with these key positions, and then we >> > can call extractKeys on the comparator to get the values of the fields >> > of a record. >> > >> > However, the problem with this is that in the aggregations we also >> > need to write fields and not just read them. So with the above >> > procedure, we would need an "intractKey" method on the comparators >> > that would be the opposite of extractKeys, i.e. which would set the >> > key fields. >> > >> > Another difference between the FieldAccessor world and the >> > ExpressionKeys world is that in the aggregations we need to deal with >> > only one field. This means that enhancing the comparators with the >> > exact opposite of extractKeys would be overkill, because we don't need >> > to set multiple fields at a time. >> > >> > It certainly seems to be a good idea to unify these at least >> > externally (from the side of the users), but I'm not sure what would >> > be a neat way to also internally unify them. >> > >> > Best, >> > Gábor >> > >> > >> > 2016-10-26 20:54 GMT+02:00 Fabian Hueske <[hidden email]>: >> > > Hi everybody, >> > > >> > > while reviewing PR #2094 [1] I noticed that the field reference syntax >> > for >> > > FieldAccessors is not compatible with the syntax supported for key >> > > definitions (ExpressionKeys) used in groupBy(), keyBy(), >> > > join().where().equalTo(), etc. >> > > >> > > FieldAccessors are only used for build-in aggregations in the >> DataStream >> > > API (sum(), min(), max(), ...). >> > > >> > > In particular I identified the following inconsistencies: >> > > >> > > - FieldAccessors allow to address array cells. ExpressionKeys treat >> > arrays >> > > as AtomicTypes (Array TypeInfos do not extend CompositeType). Hence, >> it >> > is >> > > not possible to address array cells. >> > > - ExpressionKeys do only support Integer keys for tuples. An atomic >> type >> > > can only be addressed with "*". FieldAccessors allow to address >> > AtomicTypes >> > > with 0 in addtion to "*". >> > > - ExpressionKeys support to address fields of Java tuples with "f2" >> and >> > > Scala tuple fields with "_3". FieldAccessors do not support the "f" or >> > "_" >> > > prefix. >> > > >> > > I would like to propose to adapt the syntax of both mechanisms >> (ideally, >> > > both should use the same code for validation). IMO, the ExpressionKey >> > > syntax much more widely used and is well designed. Therefore, I would >> > adopt >> > > it for FieldAccessors as well. However, that would mean to restrict >> the >> > > syntax of the FieldAccessors and might break existing code. >> > > >> > > What do others think? >> > > >> > > [1] https://github.com/apache/flink/pull/2094 >> > >> > > |
Free forum by Nabble | Edit this page |