(DEPRECATED) Apache Flink Mailing List archive.

Flink ML - NaN Handling

Classic

List

Threaded

5 messages Options

Stavros Kontopoulos

Flink ML - NaN Handling

Hello guys,

Is there a story for this (might have been discussed earlier)? I see
differences between scikit-learn and numpy. Do we standardize on
scikit-learn?

PS. I am working on the preprocessing stuff.

Best,
Stavros

Till Rohrmann

Re: Flink ML - NaN Handling

Hi Stavros,

so far we've sticked mainly to scikit-learn in terms of semantics. Thus, I
would recommend to follow scikit-learn's approach to handle NaNs.

Cheers,
Till

On Fri, Feb 10, 2017 at 11:48 PM, Stavros Kontopoulos <
[hidden email]> wrote:

> Hello guys,
>
> Is there a story for this (might have been discussed earlier)? I see
> differences between scikit-learn and numpy. Do we standardize on
> scikit-learn?
>
> PS. I am working on the preprocessing stuff.
>
> Best,
> Stavros
>

Stavros Kontopoulos

Re: Flink ML - NaN Handling

Ok cool thnx Till.

On Sun, Feb 12, 2017 at 4:59 PM, Till Rohrmann <[hidden email]> wrote:

> Hi Stavros,
>
> so far we've sticked mainly to scikit-learn in terms of semantics. Thus, I
> would recommend to follow scikit-learn's approach to handle NaNs.
>
> Cheers,
> Till
>
> On Fri, Feb 10, 2017 at 11:48 PM, Stavros Kontopoulos <
> [hidden email]> wrote:
>
> > Hello guys,
> >
> > Is there a story for this (might have been discussed earlier)? I see
> > differences between scikit-learn and numpy. Do we standardize on
> > scikit-learn?
> >
> > PS. I am working on the preprocessing stuff.
> >
> > Best,
> > Stavros
> >
>

Stavros Kontopoulos

Re: Flink ML - NaN Handling

Btw I think we should add an Imputer if we follow scikit-learn as stated
here for preparing the dataset:
http://scikit-learn.org/stable/modules/preprocessing.html
"Imputation of Missing Values" paragraph. What do you think? Should I add
it as an issue on jira?

The question for NaN also holds for generated data from one pipeline stage
feed to the other. In all cases we should fire an exception from what I
see....
For example for sklearn:

>>> X = [[ 1., -1., 2.],
... [ 2., 0., float('NaN')]]

>>> preprocessing.normalize(X, norm='l2')
Traceback (most recent call last):
....
ValueError: Input contains NaN, infinity or a value too large for
dtype('float64').

I don't see that in FLink ML's code, my understanding is that that NaNs
are propagated correct?
For example when I run the MinMaxScalerIT tests with NaN in the data I get
a result like:
DenseVector(0.34528405956977387, 0.5, NaN)
...
which is reasonable given the implementation but should be allowed?

On Sun, Feb 12, 2017 at 9:03 PM, Stavros Kontopoulos <
[hidden email]> wrote:

> Ok cool thnx Till.
>
> On Sun, Feb 12, 2017 at 4:59 PM, Till Rohrmann <[hidden email]>
> wrote:
>
>> Hi Stavros,
>>
>> so far we've sticked mainly to scikit-learn in terms of semantics. Thus, I
>> would recommend to follow scikit-learn's approach to handle NaNs.
>>
>> Cheers,
>> Till
>>
>> On Fri, Feb 10, 2017 at 11:48 PM, Stavros Kontopoulos <
>> [hidden email]> wrote:
>>
>> > Hello guys,
>> >
>> > Is there a story for this (might have been discussed earlier)? I see
>> > differences between scikit-learn and numpy. Do we standardize on
>> > scikit-learn?
>> >
>> > PS. I am working on the preprocessing stuff.
>> >
>> > Best,
>> > Stavros
>> >
>>
>
>

Till Rohrmann

Re: Flink ML - NaN Handling

Hi Stavros,

your idea to add an imputer is really good. Please open a JIRA issue for
that.

You're right that failing fast is usually the better behaviour in case of
an undefined value such as NaN or infinity. Thus, I think it makes sense to
define for the different components their value range and fail if an
incoming value is not contained in this range.

Cheers,
Till

On Sun, Feb 12, 2017 at 8:55 PM, Stavros Kontopoulos <
[hidden email]> wrote:

> Btw I think we should add an Imputer if we follow scikit-learn as stated
> here for preparing the dataset:
> http://scikit-learn.org/stable/modules/preprocessing.html
> "Imputation of Missing Values" paragraph. What do you think? Should I add
> it as an issue on jira?
>
> The question for NaN also holds for generated data from one pipeline stage
> feed to the other. In all cases we should fire an exception from what I
> see....
> For example for sklearn:
>
> >>> X = [[ 1., -1., 2.],
> ... [ 2., 0., float('NaN')]]
>
> >>> preprocessing.normalize(X, norm='l2')
> Traceback (most recent call last):
> ....
> ValueError: Input contains NaN, infinity or a value too large for
> dtype('float64').
>
> I don't see that in FLink ML's code, my understanding is that that NaNs
> are propagated correct?
> For example when I run the MinMaxScalerIT tests with NaN in the data I get
> a result like:
> DenseVector(0.34528405956977387, 0.5, NaN)
> ...
> which is reasonable given the implementation but should be allowed?
>
>
> On Sun, Feb 12, 2017 at 9:03 PM, Stavros Kontopoulos <
> [hidden email]> wrote:
>
> > Ok cool thnx Till.
> >
> > On Sun, Feb 12, 2017 at 4:59 PM, Till Rohrmann <[hidden email]>
> > wrote:
> >
> >> Hi Stavros,
> >>
> >> so far we've sticked mainly to scikit-learn in terms of semantics.
> Thus, I
> >> would recommend to follow scikit-learn's approach to handle NaNs.
> >>
> >> Cheers,
> >> Till
> >>
> >> On Fri, Feb 10, 2017 at 11:48 PM, Stavros Kontopoulos <
> >> [hidden email]> wrote:
> >>
> >> > Hello guys,
> >> >
> >> > Is there a story for this (might have been discussed earlier)? I see
> >> > differences between scikit-learn and numpy. Do we standardize on
> >> > scikit-learn?
> >> >
> >> > PS. I am working on the preprocessing stuff.
> >> >
> >> > Best,
> >> > Stavros
> >> >
> >>
> >
> >
>