Batch job getting stuck

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Batch job getting stuck

Amit Jain
Hi,

We have created Batch job where we are trying to merge set of S3
directories in TextFormat with the old snapshot in Parquet format.

We are running 50 such jobs daily and found the progress of few random jobs
get stuck in between. We have gone through logs of JobManager, TaskManager
and could not get any useful information there.

Important operators involved, are read using TextInputFormat, read using
HadoopInputFormat, FullOuterJoin, write using our BucketingSink code.

Please help resolve this issue.

Flink Version 1.3.2 deployed on Yarn Container

--
Thanks,
Amit
Reply | Threaded
Open this post in threaded view
|

Re: Batch job getting stuck

Timo Walther-2
Hi Amit,

how is the memory consumption when the jobs get stuck? Is the Java GC
active? Are you using off-heap memory?

Regards,
Timo

Am 2/12/18 um 10:10 AM schrieb Amit Jain:

> Hi,
>
> We have created Batch job where we are trying to merge set of S3
> directories in TextFormat with the old snapshot in Parquet format.
>
> We are running 50 such jobs daily and found the progress of few random jobs
> get stuck in between. We have gone through logs of JobManager, TaskManager
> and could not get any useful information there.
>
> Important operators involved, are read using TextInputFormat, read using
> HadoopInputFormat, FullOuterJoin, write using our BucketingSink code.
>
> Please help resolve this issue.
>
> Flink Version 1.3.2 deployed on Yarn Container
>
> --
> Thanks,
> Amit
>

Reply | Threaded
Open this post in threaded view
|

Re: Batch job getting stuck

Amit Jain
Hi Timo,

Yes, we are using off-heap memory, our yarn container are set to use ~23G
memory with two slot per container and set yarn heap cutoff ratio to 0.6.

Jobs are having normal memory usage, problem here is not temporary halt but
permanent halt for the running jobs.

Task manager's log

2018-02-08 16:55:31,007 INFO
org.apache.flink.yarn.YarnTaskManagerRunner                   -  JVM
Options:
2018-02-08 16:55:31,007 INFO
org.apache.flink.yarn.YarnTaskManagerRunner                   -
-Xms9370m
2018-02-08 16:55:31,007 INFO
org.apache.flink.yarn.YarnTaskManagerRunner                   -
-Xmx9370m


GC run and memory usage on one of used task manager

Garbage Collection
CollectorCountTime
PS_Scavenge 22,673 702,544
PS_MarkSweep 143 77,431
MemoryJVM (Heap/Non-Heap)
TypeCommittedUsedMaximum
Heap 9.11 GB 6.23 GB 9.11 GB
Non-Heap 1.73 GB 1.67 GB -1 B
Total 10.8 GB 7.90 GB 9.11 GB


--
Thanks,
Amit


On Mon, Feb 12, 2018 at 9:50 PM, Timo Walther <[hidden email]> wrote:

> Hi Amit,
>
> how is the memory consumption when the jobs get stuck? Is the Java GC
> active? Are you using off-heap memory?
>
> Regards,
> Timo
>
> Am 2/12/18 um 10:10 AM schrieb Amit Jain:
>
> Hi,
>>
>> We have created Batch job where we are trying to merge set of S3
>> directories in TextFormat with the old snapshot in Parquet format.
>>
>> We are running 50 such jobs daily and found the progress of few random
>> jobs
>> get stuck in between. We have gone through logs of JobManager, TaskManager
>> and could not get any useful information there.
>>
>> Important operators involved, are read using TextInputFormat, read using
>> HadoopInputFormat, FullOuterJoin, write using our BucketingSink code.
>>
>> Please help resolve this issue.
>>
>> Flink Version 1.3.2 deployed on Yarn Container
>>
>> --
>> Thanks,
>> Amit
>>
>>
>