Collision of task number values for the same task

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Collision of task number values for the same task

aalexandrov
Hello,

I am analyzing the logs from a Flink batch job and am seeing the following
two lines:

2016-05-30 15:32:31,701 INFO  ...        - DataSource (at ${path}) (4/4)
(7efe8fcfe9c7c7e6cd4683e1b5c06a3a) switched from SCHEDULED to DEPLOYING
2016-05-30 15:32:31,701 INFO  ...        - DataSource (at ${path}) (4/4)
(e54bda3e413816b5d35046468cedbf86) switched from SCHEDULED to DEPLOYING

The lines are basically identical, except for the task ID. Both tasks then
transitioned to FINISHED.

I got the impression that the (x/y) part of the log reports the task number
/ total number of parallel tasks. Consequently, I was expecting that the x
component and the task path will functionally determine the task ID, except
for situations in which the first task fails.

Can somebody shed a light on the execution semantics of the scheduler which
will explain this behavior?

Cheers,
Alex
Reply | Threaded
Open this post in threaded view
|

Re: Collision of task number values for the same task

Ufuk Celebi-2
On Tue, May 31, 2016 at 11:53 AM, Alexander Alexandrov
<[hidden email]> wrote:
> Can somebody shed a light on the execution semantics of the scheduler which
> will explain this behavior?

The execution IDs are unique per execution attempt. Having two tasks
with the same subtask index running at the same time is unexpected.

Can you share the complete logs, please?

– Ufuk
Reply | Threaded
Open this post in threaded view
|

Re: Collision of task number values for the same task

aalexandrov
Sure, you can find them attached here (both jobmanager and taskmanager, the problem was observed in the jobmanager logs).

If needed I can also share the binary to reproduce the issue.

I think the problem is related to the fact that the input splits are lazily assigned to the task slots, and it seems that in case of 8 splits for 4 slots, we get each (x/y) combination twice.

Moreover, I am currently analyzing the structure of the log files, and it seems that the task ID is not reported consistently across the different messages [1,2,3]. This makes the implementation of an ETL job that extracts the statistics from the log and feed them into a database quite hard.

Would it be possible to push a fix which adds the task ID consistently across all messages in the 1.0.x line? If yes, I will open a JIRA and work on that this week.
I would like to get feedback from other people who are parsing jobmanager / taskamanager logs on that in order to avoid possible backwards compatibility with job analysis tools on the release line.

[1] https://github.com/apache/flink/blob/da23ee38e5b36ddf26a6a5a807efbbbcbfe1d517/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L370-L371
[2] https://github.com/apache/flink/blob/da23ee38e5b36ddf26a6a5a807efbbbcbfe1d517/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L991-L992

Regards,
A.


2016-05-31 12:01 GMT+02:00 Ufuk Celebi <[hidden email]>:
On Tue, May 31, 2016 at 11:53 AM, Alexander Alexandrov
<[hidden email]> wrote:
> Can somebody shed a light on the execution semantics of the scheduler which
> will explain this behavior?

The execution IDs are unique per execution attempt. Having two tasks
with the same subtask index running at the same time is unexpected.

Can you share the complete logs, please?

– Ufuk


flink-alexander-taskmanager-0-cow-11.log (221K) Download Attachment
flink-alexander-taskmanager-0-cow-11.log (221K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Collision of task number values for the same task

Stephan Ewen
It could be that

(a) The task failed and was restarted.

(b) The program has multiple steps (collect() print()), so that parts of
the graph get re-executed.

(c) You have two operators with the same name that become tasks with the
same name.

Do any of those explanations make sense in your setting?

Stephan


On Tue, May 31, 2016 at 12:48 PM, Alexander Alexandrov <
[hidden email]> wrote:

> Sure, you can find them attached here (both jobmanager and taskmanager,
> the problem was observed in the jobmanager logs).
>
> If needed I can also share the binary to reproduce the issue.
>
> I think the problem is related to the fact that the input splits are
> lazily assigned to the task slots, and it seems that in case of 8 splits
> for 4 slots, we get each (x/y) combination twice.
>
> Moreover, I am currently analyzing the structure of the log files, and it
> seems that the task ID is not reported consistently across the different
> messages [1,2,3]. This makes the implementation of an ETL job that extracts
> the statistics from the log and feed them into a database quite hard.
>
> Would it be possible to push a fix which adds the task ID consistently
> across all messages in the 1.0.x line? If yes, I will open a JIRA and work
> on that this week.
> I would like to get feedback from other people who are parsing jobmanager
> / taskamanager logs on that in order to avoid possible backwards
> compatibility with job analysis tools on the release line.
>
> [1]
> https://github.com/apache/flink/blob/da23ee38e5b36ddf26a6a5a807efbbbcbfe1d517/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L370-L371
> [2]
> https://github.com/apache/flink/blob/da23ee38e5b36ddf26a6a5a807efbbbcbfe1d517/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L991-L992
>
> Regards,
> A.
>
>
> 2016-05-31 12:01 GMT+02:00 Ufuk Celebi <[hidden email]>:
>
>> On Tue, May 31, 2016 at 11:53 AM, Alexander Alexandrov
>> <[hidden email]> wrote:
>> > Can somebody shed a light on the execution semantics of the scheduler
>> which
>> > will explain this behavior?
>>
>> The execution IDs are unique per execution attempt. Having two tasks
>> with the same subtask index running at the same time is unexpected.
>>
>> Can you share the complete logs, please?
>>
>> – Ufuk
>>
>
>
Reply | Threaded
Open this post in threaded view
|

Re: Collision of task number values for the same task

aalexandrov
> (c) You have two operators with the same name that become tasks with the
same name.

Actually it was a variation on that issue.
The problem was that I was reading a dataset X which was part of both the
dynamic and the static path of a Flink iteration. I guess the duplicates
duplicates these paths, so I had two DataSource tasks with the same
location.

Problem solved.

Regards,
A.

2016-05-31 13:54 GMT+02:00 Stephan Ewen <[hidden email]>:

> It could be that
>
> (a) The task failed and was restarted.
>
> (b) The program has multiple steps (collect() print()), so that parts of
> the graph get re-executed.
>
> (c) You have two operators with the same name that become tasks with the
> same name.
>
> Do any of those explanations make sense in your setting?
>
> Stephan
>
>
> On Tue, May 31, 2016 at 12:48 PM, Alexander Alexandrov <
> [hidden email]> wrote:
>
> > Sure, you can find them attached here (both jobmanager and taskmanager,
> > the problem was observed in the jobmanager logs).
> >
> > If needed I can also share the binary to reproduce the issue.
> >
> > I think the problem is related to the fact that the input splits are
> > lazily assigned to the task slots, and it seems that in case of 8 splits
> > for 4 slots, we get each (x/y) combination twice.
> >
> > Moreover, I am currently analyzing the structure of the log files, and it
> > seems that the task ID is not reported consistently across the different
> > messages [1,2,3]. This makes the implementation of an ETL job that
> extracts
> > the statistics from the log and feed them into a database quite hard.
> >
> > Would it be possible to push a fix which adds the task ID consistently
> > across all messages in the 1.0.x line? If yes, I will open a JIRA and
> work
> > on that this week.
> > I would like to get feedback from other people who are parsing jobmanager
> > / taskamanager logs on that in order to avoid possible backwards
> > compatibility with job analysis tools on the release line.
> >
> > [1]
> >
> https://github.com/apache/flink/blob/da23ee38e5b36ddf26a6a5a807efbbbcbfe1d517/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L370-L371
> > [2]
> >
> https://github.com/apache/flink/blob/da23ee38e5b36ddf26a6a5a807efbbbcbfe1d517/flink-runtime/src/main/java/org/apache/flink/runtime/executiongraph/Execution.java#L991-L992
> >
> > Regards,
> > A.
> >
> >
> > 2016-05-31 12:01 GMT+02:00 Ufuk Celebi <[hidden email]>:
> >
> >> On Tue, May 31, 2016 at 11:53 AM, Alexander Alexandrov
> >> <[hidden email]> wrote:
> >> > Can somebody shed a light on the execution semantics of the scheduler
> >> which
> >> > will explain this behavior?
> >>
> >> The execution IDs are unique per execution attempt. Having two tasks
> >> with the same subtask index running at the same time is unexpected.
> >>
> >> Can you share the complete logs, please?
> >>
> >> – Ufuk
> >>
> >
> >
>