Flink on Tez Test stuck

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Flink on Tez Test stuck

Stephan Ewen
I have observed that a Flink-on-Tez test job stalls in two cases on the
Travis CI server.

https://travis-ci.org/StephanEwen/incubator-flink/jobs/62302207

It looks like a shuffle fetch is simply not continuing, but freezing. The
stack traces suggest at a first glance that this is actually a Tez issue ,
rather than a Flink issue (all threads stuck in Tez methods), but one
cannot be sure.

Anyone observed something similar before?
Reply | Threaded
Open this post in threaded view
|

Re: Flink on Tez Test stuck

Aljoscha Krettek-2
I think I saw it once, yes. But dismissed it as a fluke.

On Wed, May 13, 2015 at 1:13 AM, Stephan Ewen <[hidden email]> wrote:

> I have observed that a Flink-on-Tez test job stalls in two cases on the
> Travis CI server.
>
> https://travis-ci.org/StephanEwen/incubator-flink/jobs/62302207
>
> It looks like a shuffle fetch is simply not continuing, but freezing. The
> stack traces suggest at a first glance that this is actually a Tez issue ,
> rather than a Flink issue (all threads stuck in Tez methods), but one
> cannot be sure.
>
> Anyone observed something similar before?
Reply | Threaded
Open this post in threaded view
|

Re: Flink on Tez Test stuck

Robert Metzger
I saw this failure also multiple times now.
This is another case of it: https://travis-ci.org/apache/flink/jobs/62767646

I think the Tez community is currently voting on a new release. Maybe we
should see if this one fixes the issue.
Otherwise we should ask on their list.

On Wed, May 13, 2015 at 9:35 AM, Aljoscha Krettek <[hidden email]>
wrote:

> I think I saw it once, yes. But dismissed it as a fluke.
>
> On Wed, May 13, 2015 at 1:13 AM, Stephan Ewen <[hidden email]> wrote:
> > I have observed that a Flink-on-Tez test job stalls in two cases on the
> > Travis CI server.
> >
> > https://travis-ci.org/StephanEwen/incubator-flink/jobs/62302207
> >
> > It looks like a shuffle fetch is simply not continuing, but freezing. The
> > stack traces suggest at a first glance that this is actually a Tez issue
> ,
> > rather than a Flink issue (all threads stuck in Tez methods), but one
> > cannot be sure.
> >
> > Anyone observed something similar before?
>
Reply | Threaded
Open this post in threaded view
|

Re: Flink on Tez Test stuck

Robert Metzger
Tez has just announced the availability of version 0.6.1.
Maybe that version is more stable. I've filed a jira for upgrading the
version: https://issues.apache.org/jira/browse/FLINK-2064

On Sun, May 17, 2015 at 12:04 PM, Robert Metzger <[hidden email]>
wrote:

> I saw this failure also multiple times now.
> This is another case of it:
> https://travis-ci.org/apache/flink/jobs/62767646
>
> I think the Tez community is currently voting on a new release. Maybe we
> should see if this one fixes the issue.
> Otherwise we should ask on their list.
>
> On Wed, May 13, 2015 at 9:35 AM, Aljoscha Krettek <[hidden email]>
> wrote:
>
>> I think I saw it once, yes. But dismissed it as a fluke.
>>
>> On Wed, May 13, 2015 at 1:13 AM, Stephan Ewen <[hidden email]> wrote:
>> > I have observed that a Flink-on-Tez test job stalls in two cases on the
>> > Travis CI server.
>> >
>> > https://travis-ci.org/StephanEwen/incubator-flink/jobs/62302207
>> >
>> > It looks like a shuffle fetch is simply not continuing, but freezing.
>> The
>> > stack traces suggest at a first glance that this is actually a Tez
>> issue ,
>> > rather than a Flink issue (all threads stuck in Tez methods), but one
>> > cannot be sure.
>> >
>> > Anyone observed something similar before?
>>
>
>