Hi,
I have to decide whether to expose the implementation or hide it from the user and would like to hear some opinions about that. The expression operations operate on DataSet[Row], where Row is basically a wrapper for an array of elements of different types. The expression API system keeps tracks of the names and types of these fields. Right now, when you have an operation like: // 'foo and 'bar are Scala symbols // they refer to fields named foo and // bar in the input data set val result = in.select('foo, 'bar) the result is a DataSet[Row]. This means two things: 1. The user can theoretically to a map operation on this where he manually accesses row fields, as in: in.map { row => (row.getField(0).asInstanceOf[Int], row.getField(1).asInstanceOf[String]) } 2. I cannot easily look at the whole structure of a query. Because queries are translated to DataSet operations one expression at a time, i.e.: val result = in1.join(in2).filter(...).select(...) results in a join operation, followed by a filter operation, followed by a map operation. If the translation would not happen one operator at-a-time, we could combine all the operations into one join operation. This would mean having a custom optimiser component for the expression API and bypassing the optimiser component we have for normal operator data flows. The question is now. Should I expose it as is, i.e. let expression operations result in DataSet[Row], or should I hide it behind another type of DataSet (ExpressionDataSet) so that we can later-on change the implementation details and perform any magic we want behind the scenes. Cheers, Aljoscha |
Hi,
Am I right that you basically fear that if you are allowing users to "manually" modify DataSet<Row>'s that you're loosing control of the types etc.? I think that integrating the expression API into the existing API is nicer, because it gives users more flexibility. It should also lead to a lower overall complexity, right? I'm in favor of keeping things as simple as possible. Robert On Thu, Jan 29, 2015 at 4:47 PM, Aljoscha Krettek <[hidden email]> wrote: > Hi, > I have to decide whether to expose the implementation or hide it from > the user and would like to hear some opinions about that. > > The expression operations operate on DataSet[Row], where Row is > basically a wrapper for an array of elements of different types. The > expression API system keeps tracks of the names and types of these > fields. Right now, when you have an operation like: > > // 'foo and 'bar are Scala symbols > // they refer to fields named foo and > // bar in the input data set > val result = in.select('foo, 'bar) > > the result is a DataSet[Row]. This means two things: > > 1. The user can theoretically to a map > operation on this where he manually accesses row fields, as in: > > in.map { row => (row.getField(0).asInstanceOf[Int], > row.getField(1).asInstanceOf[String]) } > > 2. I cannot easily look at the whole structure of a query. Because > queries are translated to DataSet > operations one expression at a time, i.e.: > > val result = in1.join(in2).filter(...).select(...) > > results in a join operation, followed by a filter operation, followed > by a map operation. If the translation would not happen one operator > at-a-time, we could combine all the operations into one join > operation. This would mean having a custom optimiser component for the > expression API and bypassing the optimiser component we have for > normal operator data flows. > > The question is now. Should I expose it as is, i.e. let expression > operations result in DataSet[Row], or should I hide it behind another > type of DataSet (ExpressionDataSet) so that we can later-on change the > implementation details and perform any magic we want behind the > scenes. > > Cheers, > Aljoscha > |
Yes, that's what I'm afraid about. I would also like to have the
flexibility to change the underlying implementation in the future, if we want to. And no, hiding the implementation would actually make things easier, implementation wise. On Fri, Jan 30, 2015 at 7:08 PM, Robert Metzger <[hidden email]> wrote: > Hi, > > Am I right that you basically fear that if you are allowing users to > "manually" modify DataSet<Row>'s that you're loosing control of the types > etc.? > > I think that integrating the expression API into the existing API is nicer, > because it gives users more flexibility. It should also lead to a lower > overall complexity, right? I'm in favor of keeping things as simple as > possible. > > Robert > > On Thu, Jan 29, 2015 at 4:47 PM, Aljoscha Krettek <[hidden email]> > wrote: > >> Hi, >> I have to decide whether to expose the implementation or hide it from >> the user and would like to hear some opinions about that. >> >> The expression operations operate on DataSet[Row], where Row is >> basically a wrapper for an array of elements of different types. The >> expression API system keeps tracks of the names and types of these >> fields. Right now, when you have an operation like: >> >> // 'foo and 'bar are Scala symbols >> // they refer to fields named foo and >> // bar in the input data set >> val result = in.select('foo, 'bar) >> >> the result is a DataSet[Row]. This means two things: >> >> 1. The user can theoretically to a map >> operation on this where he manually accesses row fields, as in: >> >> in.map { row => (row.getField(0).asInstanceOf[Int], >> row.getField(1).asInstanceOf[String]) } >> >> 2. I cannot easily look at the whole structure of a query. Because >> queries are translated to DataSet >> operations one expression at a time, i.e.: >> >> val result = in1.join(in2).filter(...).select(...) >> >> results in a join operation, followed by a filter operation, followed >> by a map operation. If the translation would not happen one operator >> at-a-time, we could combine all the operations into one join >> operation. This would mean having a custom optimiser component for the >> expression API and bypassing the optimiser component we have for >> normal operator data flows. >> >> The question is now. Should I expose it as is, i.e. let expression >> operations result in DataSet[Row], or should I hide it behind another >> type of DataSet (ExpressionDataSet) so that we can later-on change the >> implementation details and perform any magic we want behind the >> scenes. >> >> Cheers, >> Aljoscha >> |
My first Intuition is to not expose the row data type. If we add columnar
executing later, there may never be a Row data type during runtime (cp paper on hyper runtime engine). For these declarative operations, I think it is a big advantage to keep the underpinnings strictly separate so we can change the execution model. Also, I think that explicit switches between the logical and physical abstraction (switching from class type to logical row type and vice versa) make things more transparent to the user. As an example: A filter in a logical query expression may be pushed down, a filter defined as as udf on a physical type is not pushed down. |
Yes, that's exactly my reasoning for wanting to hide it.
On Sat, Jan 31, 2015 at 10:32 AM, Stephan Ewen <[hidden email]> wrote: > My first Intuition is to not expose the row data type. If we add columnar > executing later, there may never be a Row data type during runtime (cp > paper on hyper runtime engine). > > For these declarative operations, I think it is a big advantage to keep the > underpinnings strictly separate so we can change the execution model. > > Also, I think that explicit switches between the logical and physical > abstraction (switching from class type to logical row type and vice versa) > make things more transparent to the user. As an example: A filter in a > logical query expression may be pushed down, a filter defined as as udf on > a physical type is not pushed down. |
I am also +1 for hiding the internals.
Having conversion functions from and to DataSet sounds like the way to go for me. 2015-01-31 11:04 GMT+01:00 Aljoscha Krettek <[hidden email]>: > Yes, that's exactly my reasoning for wanting to hide it. > > On Sat, Jan 31, 2015 at 10:32 AM, Stephan Ewen <[hidden email]> wrote: > > My first Intuition is to not expose the row data type. If we add columnar > > executing later, there may never be a Row data type during runtime (cp > > paper on hyper runtime engine). > > > > For these declarative operations, I think it is a big advantage to keep > the > > underpinnings strictly separate so we can change the execution model. > > > > Also, I think that explicit switches between the logical and physical > > abstraction (switching from class type to logical row type and vice > versa) > > make things more transparent to the user. As an example: A filter in a > > logical query expression may be pushed down, a filter defined as as udf > on > > a physical type is not pushed down. > |
If we want to have a tight integration with our existing API we have
to hide the results of the expressions behind a wrapper. This enables us to change the internal implementation at any time and support future Flink API changes and features. +1 for not directly exposing the results as a row DataSet. On Tue, Feb 3, 2015 at 10:39 AM, Fabian Hueske <[hidden email]> wrote: > I am also +1 for hiding the internals. > Having conversion functions from and to DataSet sounds like the way to go > for me. > > > 2015-01-31 11:04 GMT+01:00 Aljoscha Krettek <[hidden email]>: > >> Yes, that's exactly my reasoning for wanting to hide it. >> >> On Sat, Jan 31, 2015 at 10:32 AM, Stephan Ewen <[hidden email]> wrote: >> > My first Intuition is to not expose the row data type. If we add columnar >> > executing later, there may never be a Row data type during runtime (cp >> > paper on hyper runtime engine). >> > >> > For these declarative operations, I think it is a big advantage to keep >> the >> > underpinnings strictly separate so we can change the execution model. >> > >> > Also, I think that explicit switches between the logical and physical >> > abstraction (switching from class type to logical row type and vice >> versa) >> > make things more transparent to the user. As an example: A filter in a >> > logical query expression may be pushed down, a filter defined as as udf >> on >> > a physical type is not pushed down. >> |
Free forum by Nabble | Edit this page |