Design Question in Expression API

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Design Question in Expression API

Aljoscha Krettek-2
Hi,
I have to decide whether to expose the implementation or hide it from
the user and would like to hear some opinions about that.

The expression operations operate on DataSet[Row], where Row is
basically a wrapper for an array of elements of different types. The
expression API system keeps tracks of the names and types of these
fields. Right now, when you have an operation like:

// 'foo and 'bar are Scala symbols
// they refer to fields named foo and
// bar in the input data set
val result = in.select('foo, 'bar)

the result is a DataSet[Row]. This means two things:

1. The user can theoretically to a map
operation on this where he manually accesses row fields, as in:

in.map { row => (row.getField(0).asInstanceOf[Int],
row.getField(1).asInstanceOf[String]) }

2. I cannot easily look at the whole structure of a query. Because
queries are translated to DataSet
operations one expression at a time, i.e.:

val result = in1.join(in2).filter(...).select(...)

results in a join operation, followed by a filter operation, followed
by a map operation. If the translation would not happen one operator
at-a-time, we could combine all the operations into one join
operation. This would mean having a custom optimiser component for the
expression API and bypassing the optimiser component we have for
normal operator data flows.

The question is now. Should I expose it as is, i.e. let expression
operations result in DataSet[Row], or should I hide it behind another
type of DataSet (ExpressionDataSet) so that we can later-on change the
implementation details and perform any magic we want behind the
scenes.

Cheers,
Aljoscha
Reply | Threaded
Open this post in threaded view
|

Re: Design Question in Expression API

Robert Metzger
Hi,

Am I right that you basically fear that if you are allowing users to
"manually" modify DataSet<Row>'s that you're loosing control of the types
etc.?

I think that integrating the expression API into the existing API is nicer,
because it gives users more flexibility. It should also lead to a lower
overall complexity, right? I'm in favor of keeping things as simple as
possible.

Robert

On Thu, Jan 29, 2015 at 4:47 PM, Aljoscha Krettek <[hidden email]>
wrote:

> Hi,
> I have to decide whether to expose the implementation or hide it from
> the user and would like to hear some opinions about that.
>
> The expression operations operate on DataSet[Row], where Row is
> basically a wrapper for an array of elements of different types. The
> expression API system keeps tracks of the names and types of these
> fields. Right now, when you have an operation like:
>
> // 'foo and 'bar are Scala symbols
> // they refer to fields named foo and
> // bar in the input data set
> val result = in.select('foo, 'bar)
>
> the result is a DataSet[Row]. This means two things:
>
> 1. The user can theoretically to a map
> operation on this where he manually accesses row fields, as in:
>
> in.map { row => (row.getField(0).asInstanceOf[Int],
> row.getField(1).asInstanceOf[String]) }
>
> 2. I cannot easily look at the whole structure of a query. Because
> queries are translated to DataSet
> operations one expression at a time, i.e.:
>
> val result = in1.join(in2).filter(...).select(...)
>
> results in a join operation, followed by a filter operation, followed
> by a map operation. If the translation would not happen one operator
> at-a-time, we could combine all the operations into one join
> operation. This would mean having a custom optimiser component for the
> expression API and bypassing the optimiser component we have for
> normal operator data flows.
>
> The question is now. Should I expose it as is, i.e. let expression
> operations result in DataSet[Row], or should I hide it behind another
> type of DataSet (ExpressionDataSet) so that we can later-on change the
> implementation details and perform any magic we want behind the
> scenes.
>
> Cheers,
> Aljoscha
>
Reply | Threaded
Open this post in threaded view
|

Re: Design Question in Expression API

Aljoscha Krettek-2
Yes, that's what I'm afraid about. I would also like to have the
flexibility to change the underlying implementation in the future, if
we want to. And no, hiding the implementation would actually make
things easier, implementation wise.

On Fri, Jan 30, 2015 at 7:08 PM, Robert Metzger <[hidden email]> wrote:

> Hi,
>
> Am I right that you basically fear that if you are allowing users to
> "manually" modify DataSet<Row>'s that you're loosing control of the types
> etc.?
>
> I think that integrating the expression API into the existing API is nicer,
> because it gives users more flexibility. It should also lead to a lower
> overall complexity, right? I'm in favor of keeping things as simple as
> possible.
>
> Robert
>
> On Thu, Jan 29, 2015 at 4:47 PM, Aljoscha Krettek <[hidden email]>
> wrote:
>
>> Hi,
>> I have to decide whether to expose the implementation or hide it from
>> the user and would like to hear some opinions about that.
>>
>> The expression operations operate on DataSet[Row], where Row is
>> basically a wrapper for an array of elements of different types. The
>> expression API system keeps tracks of the names and types of these
>> fields. Right now, when you have an operation like:
>>
>> // 'foo and 'bar are Scala symbols
>> // they refer to fields named foo and
>> // bar in the input data set
>> val result = in.select('foo, 'bar)
>>
>> the result is a DataSet[Row]. This means two things:
>>
>> 1. The user can theoretically to a map
>> operation on this where he manually accesses row fields, as in:
>>
>> in.map { row => (row.getField(0).asInstanceOf[Int],
>> row.getField(1).asInstanceOf[String]) }
>>
>> 2. I cannot easily look at the whole structure of a query. Because
>> queries are translated to DataSet
>> operations one expression at a time, i.e.:
>>
>> val result = in1.join(in2).filter(...).select(...)
>>
>> results in a join operation, followed by a filter operation, followed
>> by a map operation. If the translation would not happen one operator
>> at-a-time, we could combine all the operations into one join
>> operation. This would mean having a custom optimiser component for the
>> expression API and bypassing the optimiser component we have for
>> normal operator data flows.
>>
>> The question is now. Should I expose it as is, i.e. let expression
>> operations result in DataSet[Row], or should I hide it behind another
>> type of DataSet (ExpressionDataSet) so that we can later-on change the
>> implementation details and perform any magic we want behind the
>> scenes.
>>
>> Cheers,
>> Aljoscha
>>
Reply | Threaded
Open this post in threaded view
|

Re: Design Question in Expression API

Stephan Ewen
My first Intuition is to not expose the row data type. If we add columnar
executing later, there may never be a Row data type during runtime (cp
paper on hyper runtime engine).

For these declarative operations, I think it is a big advantage to keep the
underpinnings strictly separate so we can change the execution model.

Also, I think that explicit switches between the logical and physical
abstraction (switching from class type to logical row type and vice versa)
make things more transparent to the user. As an example: A filter in a
logical query expression may be pushed down, a filter defined as as udf on
a physical type is not pushed down.
Reply | Threaded
Open this post in threaded view
|

Re: Design Question in Expression API

Aljoscha Krettek-2
Yes, that's exactly my reasoning for wanting to hide it.

On Sat, Jan 31, 2015 at 10:32 AM, Stephan Ewen <[hidden email]> wrote:

> My first Intuition is to not expose the row data type. If we add columnar
> executing later, there may never be a Row data type during runtime (cp
> paper on hyper runtime engine).
>
> For these declarative operations, I think it is a big advantage to keep the
> underpinnings strictly separate so we can change the execution model.
>
> Also, I think that explicit switches between the logical and physical
> abstraction (switching from class type to logical row type and vice versa)
> make things more transparent to the user. As an example: A filter in a
> logical query expression may be pushed down, a filter defined as as udf on
> a physical type is not pushed down.
Reply | Threaded
Open this post in threaded view
|

Re: Design Question in Expression API

Fabian Hueske-2
I am also +1 for hiding the internals.
Having conversion functions from and to DataSet sounds like the way to go
for me.


2015-01-31 11:04 GMT+01:00 Aljoscha Krettek <[hidden email]>:

> Yes, that's exactly my reasoning for wanting to hide it.
>
> On Sat, Jan 31, 2015 at 10:32 AM, Stephan Ewen <[hidden email]> wrote:
> > My first Intuition is to not expose the row data type. If we add columnar
> > executing later, there may never be a Row data type during runtime (cp
> > paper on hyper runtime engine).
> >
> > For these declarative operations, I think it is a big advantage to keep
> the
> > underpinnings strictly separate so we can change the execution model.
> >
> > Also, I think that explicit switches between the logical and physical
> > abstraction (switching from class type to logical row type and vice
> versa)
> > make things more transparent to the user. As an example: A filter in a
> > logical query expression may be pushed down, a filter defined as as udf
> on
> > a physical type is not pushed down.
>
Reply | Threaded
Open this post in threaded view
|

Re: Design Question in Expression API

Max Michels
If we want to have a tight integration with our existing API we have
to hide the results of the expressions behind a wrapper. This enables
us to change the internal implementation at any time and support
future Flink API changes and features.

+1 for not directly exposing the results as a row DataSet.

On Tue, Feb 3, 2015 at 10:39 AM, Fabian Hueske <[hidden email]> wrote:

> I am also +1 for hiding the internals.
> Having conversion functions from and to DataSet sounds like the way to go
> for me.
>
>
> 2015-01-31 11:04 GMT+01:00 Aljoscha Krettek <[hidden email]>:
>
>> Yes, that's exactly my reasoning for wanting to hide it.
>>
>> On Sat, Jan 31, 2015 at 10:32 AM, Stephan Ewen <[hidden email]> wrote:
>> > My first Intuition is to not expose the row data type. If we add columnar
>> > executing later, there may never be a Row data type during runtime (cp
>> > paper on hyper runtime engine).
>> >
>> > For these declarative operations, I think it is a big advantage to keep
>> the
>> > underpinnings strictly separate so we can change the execution model.
>> >
>> > Also, I think that explicit switches between the logical and physical
>> > abstraction (switching from class type to logical row type and vice
>> versa)
>> > make things more transparent to the user. As an example: A filter in a
>> > logical query expression may be pushed down, a filter defined as as udf
>> on
>> > a physical type is not pushed down.
>>