Task manager processes crashing one after the other

classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

Task manager processes crashing one after the other

Gyula Fóra-2
Hi guys,

For quite some time now we fairly frequently experience a task manager
crashes around the time new streaming jobs are deployed. We use RocksDB
backend so this might be related.

We tried changing the GC from G1 to CMS that didnt help.

Yesterday for instance 6 task managers crashed one ofter the other with
similar errors:

*** Error in `java': double free or corruption (!prev): 0x00007fac0414d760
***
*** Error in `java': free(): invalid pointer: 0x00007f8dcc0026c0 ***
*** Error in `java': double free or corruption (!prev): 0x00007f15247f9a90
***
...

Does anyone have any clue what might cause this or how to debug?
This is very a critical issue :(

Cheers,
Gyula
Reply | Threaded
Open this post in threaded view
|

Re: Task manager processes crashing one after the other

Till Rohrmann
Hi Gyula,

I haven't seen this problem before. Do you have the logs of the failed TMs
so that we have some more context what was going on?

Cheers,
Till

On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra <[hidden email]> wrote:

> Hi guys,
>
> For quite some time now we fairly frequently experience a task manager
> crashes around the time new streaming jobs are deployed. We use RocksDB
> backend so this might be related.
>
> We tried changing the GC from G1 to CMS that didnt help.
>
> Yesterday for instance 6 task managers crashed one ofter the other with
> similar errors:
>
> *** Error in `java': double free or corruption (!prev): 0x00007fac0414d760
> ***
> *** Error in `java': free(): invalid pointer: 0x00007f8dcc0026c0 ***
> *** Error in `java': double free or corruption (!prev): 0x00007f15247f9a90
> ***
> ...
>
> Does anyone have any clue what might cause this or how to debug?
> This is very a critical issue :(
>
> Cheers,
> Gyula
>
Reply | Threaded
Open this post in threaded view
|

Re: Task manager processes crashing one after the other

Gyula Fóra
Hi,

Sure I am sending the TM logs in priv.

Currently what I did was to bump the Rocks version to 4.9.0 let's see if
that helps.

Cheers,
Gyula

Till Rohrmann <[hidden email]> ezt írta (időpont: 2016. aug. 25., Cs,
10:35):

> Hi Gyula,
>
> I haven't seen this problem before. Do you have the logs of the failed TMs
> so that we have some more context what was going on?
>
> Cheers,
> Till
>
> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra <[hidden email]> wrote:
>
> > Hi guys,
> >
> > For quite some time now we fairly frequently experience a task manager
> > crashes around the time new streaming jobs are deployed. We use RocksDB
> > backend so this might be related.
> >
> > We tried changing the GC from G1 to CMS that didnt help.
> >
> > Yesterday for instance 6 task managers crashed one ofter the other with
> > similar errors:
> >
> > *** Error in `java': double free or corruption (!prev):
> 0x00007fac0414d760
> > ***
> > *** Error in `java': free(): invalid pointer: 0x00007f8dcc0026c0 ***
> > *** Error in `java': double free or corruption (!prev):
> 0x00007f15247f9a90
> > ***
> > ...
> >
> > Does anyone have any clue what might cause this or how to debug?
> > This is very a critical issue :(
> >
> > Cheers,
> > Gyula
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Task manager processes crashing one after the other

Gyula Fóra
Hi,
This seems to be a sneaky concurrency issue in our custom statebackend
implementation.

I made some changes, will keep you posted.

Cheers,
Gyula

On Thu, Aug 25, 2016, 10:54 Gyula Fóra <[hidden email]> wrote:

> Hi,
>
> Sure I am sending the TM logs in priv.
>
> Currently what I did was to bump the Rocks version to 4.9.0 let's see if
> that helps.
>
> Cheers,
> Gyula
>
> Till Rohrmann <[hidden email]> ezt írta (időpont: 2016. aug. 25.,
> Cs, 10:35):
>
>> Hi Gyula,
>>
>> I haven't seen this problem before. Do you have the logs of the failed TMs
>> so that we have some more context what was going on?
>>
>> Cheers,
>> Till
>>
>> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra <[hidden email]> wrote:
>>
>> > Hi guys,
>> >
>> > For quite some time now we fairly frequently experience a task manager
>> > crashes around the time new streaming jobs are deployed. We use RocksDB
>> > backend so this might be related.
>> >
>> > We tried changing the GC from G1 to CMS that didnt help.
>> >
>> > Yesterday for instance 6 task managers crashed one ofter the other with
>> > similar errors:
>> >
>> > *** Error in `java': double free or corruption (!prev):
>> 0x00007fac0414d760
>> > ***
>> > *** Error in `java': free(): invalid pointer: 0x00007f8dcc0026c0 ***
>> > *** Error in `java': double free or corruption (!prev):
>> 0x00007f15247f9a90
>> > ***
>> > ...
>> >
>> > Does anyone have any clue what might cause this or how to debug?
>> > This is very a critical issue :(
>> >
>> > Cheers,
>> > Gyula
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Task manager processes crashing one after the other

Stephan Ewen
We saw some crashes in earlier versions when native handles in RocksDB
(even for config option objects) were manually and too eagerly released.

Maybe you have a similar issue here?

On Thu, Aug 25, 2016 at 6:27 PM, Gyula Fóra <[hidden email]> wrote:

> Hi,
> This seems to be a sneaky concurrency issue in our custom statebackend
> implementation.
>
> I made some changes, will keep you posted.
>
> Cheers,
> Gyula
>
> On Thu, Aug 25, 2016, 10:54 Gyula Fóra <[hidden email]> wrote:
>
> > Hi,
> >
> > Sure I am sending the TM logs in priv.
> >
> > Currently what I did was to bump the Rocks version to 4.9.0 let's see if
> > that helps.
> >
> > Cheers,
> > Gyula
> >
> > Till Rohrmann <[hidden email]> ezt írta (időpont: 2016. aug. 25.,
> > Cs, 10:35):
> >
> >> Hi Gyula,
> >>
> >> I haven't seen this problem before. Do you have the logs of the failed
> TMs
> >> so that we have some more context what was going on?
> >>
> >> Cheers,
> >> Till
> >>
> >> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra <[hidden email]> wrote:
> >>
> >> > Hi guys,
> >> >
> >> > For quite some time now we fairly frequently experience a task manager
> >> > crashes around the time new streaming jobs are deployed. We use
> RocksDB
> >> > backend so this might be related.
> >> >
> >> > We tried changing the GC from G1 to CMS that didnt help.
> >> >
> >> > Yesterday for instance 6 task managers crashed one ofter the other
> with
> >> > similar errors:
> >> >
> >> > *** Error in `java': double free or corruption (!prev):
> >> 0x00007fac0414d760
> >> > ***
> >> > *** Error in `java': free(): invalid pointer: 0x00007f8dcc0026c0 ***
> >> > *** Error in `java': double free or corruption (!prev):
> >> 0x00007f15247f9a90
> >> > ***
> >> > ...
> >> >
> >> > Does anyone have any clue what might cause this or how to debug?
> >> > This is very a critical issue :(
> >> >
> >> > Cheers,
> >> > Gyula
> >> >
> >>
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Task manager processes crashing one after the other

Gyula Fóra
Yes seems like that, I remember the fix in Flink. I apparently made a
mistake somewhere in our code :)

Thanks,
Gyula

On Thu, Aug 25, 2016, 18:59 Stephan Ewen <[hidden email]> wrote:

> We saw some crashes in earlier versions when native handles in RocksDB
> (even for config option objects) were manually and too eagerly released.
>
> Maybe you have a similar issue here?
>
> On Thu, Aug 25, 2016 at 6:27 PM, Gyula Fóra <[hidden email]> wrote:
>
> > Hi,
> > This seems to be a sneaky concurrency issue in our custom statebackend
> > implementation.
> >
> > I made some changes, will keep you posted.
> >
> > Cheers,
> > Gyula
> >
> > On Thu, Aug 25, 2016, 10:54 Gyula Fóra <[hidden email]> wrote:
> >
> > > Hi,
> > >
> > > Sure I am sending the TM logs in priv.
> > >
> > > Currently what I did was to bump the Rocks version to 4.9.0 let's see
> if
> > > that helps.
> > >
> > > Cheers,
> > > Gyula
> > >
> > > Till Rohrmann <[hidden email]> ezt írta (időpont: 2016. aug.
> 25.,
> > > Cs, 10:35):
> > >
> > >> Hi Gyula,
> > >>
> > >> I haven't seen this problem before. Do you have the logs of the failed
> > TMs
> > >> so that we have some more context what was going on?
> > >>
> > >> Cheers,
> > >> Till
> > >>
> > >> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra <[hidden email]>
> wrote:
> > >>
> > >> > Hi guys,
> > >> >
> > >> > For quite some time now we fairly frequently experience a task
> manager
> > >> > crashes around the time new streaming jobs are deployed. We use
> > RocksDB
> > >> > backend so this might be related.
> > >> >
> > >> > We tried changing the GC from G1 to CMS that didnt help.
> > >> >
> > >> > Yesterday for instance 6 task managers crashed one ofter the other
> > with
> > >> > similar errors:
> > >> >
> > >> > *** Error in `java': double free or corruption (!prev):
> > >> 0x00007fac0414d760
> > >> > ***
> > >> > *** Error in `java': free(): invalid pointer: 0x00007f8dcc0026c0 ***
> > >> > *** Error in `java': double free or corruption (!prev):
> > >> 0x00007f15247f9a90
> > >> > ***
> > >> > ...
> > >> >
> > >> > Does anyone have any clue what might cause this or how to debug?
> > >> > This is very a critical issue :(
> > >> >
> > >> > Cheers,
> > >> > Gyula
> > >> >
> > >>
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Task manager processes crashing one after the other

Gyula Fóra
Stephan,

I ported the fix for the concurrency issue from the Flink commit so now
that should be fine. I ran some fail/restore tests and that specific issue
hasn't appeared again.

However I now get many segfaults in the initializeForJob method where the
RocksDb instance is opened. Just for the record this is the same exact code
as we have in Flink now.:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f12b018f51f, pid=12576, tid=139668190197504
#
# JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build
1.8.0_60-b27)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode
linux-amd64 )
# Problematic frame:
# C  [libc.so.6+0x7b51f]
...
Stack: [0x00007f0708ccf000,0x00007f0708dd0000],  sp=0x00007f0708dccd20,
 free space=1015k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
code)
C  [libc.so.6+0x7b51f]

Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
j
 org.rocksdb.RocksDB.open(JLjava/lang/String;Ljava/util/List;I)Ljava/util/List;+0
j
 org.rocksdb.RocksDB.open(Lorg/rocksdb/DBOptions;Ljava/lang/String;Ljava/util/List;Ljava/util/List;)Lorg/rocksdb/RocksDB;+23
j
 com.king.rbea.backend.state.rocksdb.RocksDBStateBackend.initializeForJob...

And this happens fairly frequently when the jobs are restarting after
failure.

Cheers,
Gyula

Gyula Fóra <[hidden email]> ezt írta (időpont: 2016. aug. 25., Cs,
19:07):

> Yes seems like that, I remember the fix in Flink. I apparently made a
> mistake somewhere in our code :)
>
> Thanks,
> Gyula
>
> On Thu, Aug 25, 2016, 18:59 Stephan Ewen <[hidden email]> wrote:
>
>> We saw some crashes in earlier versions when native handles in RocksDB
>> (even for config option objects) were manually and too eagerly released.
>>
>> Maybe you have a similar issue here?
>>
>> On Thu, Aug 25, 2016 at 6:27 PM, Gyula Fóra <[hidden email]> wrote:
>>
>> > Hi,
>> > This seems to be a sneaky concurrency issue in our custom statebackend
>> > implementation.
>> >
>> > I made some changes, will keep you posted.
>> >
>> > Cheers,
>> > Gyula
>> >
>> > On Thu, Aug 25, 2016, 10:54 Gyula Fóra <[hidden email]> wrote:
>> >
>> > > Hi,
>> > >
>> > > Sure I am sending the TM logs in priv.
>> > >
>> > > Currently what I did was to bump the Rocks version to 4.9.0 let's see
>> if
>> > > that helps.
>> > >
>> > > Cheers,
>> > > Gyula
>> > >
>> > > Till Rohrmann <[hidden email]> ezt írta (időpont: 2016. aug.
>> 25.,
>> > > Cs, 10:35):
>> > >
>> > >> Hi Gyula,
>> > >>
>> > >> I haven't seen this problem before. Do you have the logs of the
>> failed
>> > TMs
>> > >> so that we have some more context what was going on?
>> > >>
>> > >> Cheers,
>> > >> Till
>> > >>
>> > >> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra <[hidden email]>
>> wrote:
>> > >>
>> > >> > Hi guys,
>> > >> >
>> > >> > For quite some time now we fairly frequently experience a task
>> manager
>> > >> > crashes around the time new streaming jobs are deployed. We use
>> > RocksDB
>> > >> > backend so this might be related.
>> > >> >
>> > >> > We tried changing the GC from G1 to CMS that didnt help.
>> > >> >
>> > >> > Yesterday for instance 6 task managers crashed one ofter the other
>> > with
>> > >> > similar errors:
>> > >> >
>> > >> > *** Error in `java': double free or corruption (!prev):
>> > >> 0x00007fac0414d760
>> > >> > ***
>> > >> > *** Error in `java': free(): invalid pointer: 0x00007f8dcc0026c0
>> ***
>> > >> > *** Error in `java': double free or corruption (!prev):
>> > >> 0x00007f15247f9a90
>> > >> > ***
>> > >> > ...
>> > >> >
>> > >> > Does anyone have any clue what might cause this or how to debug?
>> > >> > This is very a critical issue :(
>> > >> >
>> > >> > Cheers,
>> > >> > Gyula
>> > >> >
>> > >>
>> > >
>> >
>>
>
Reply | Threaded
Open this post in threaded view
|

Re: Task manager processes crashing one after the other

Gyula Fóra
Some addtitional info:

It doesn't seem to happen the first time I start the jobs / restore them
from a savepoint. It happens as jobs are failing over after a task manager
failure.

This could be an issue caused by a non-empty rocks directory (that was
somehow in an inconsistent state) but that should not happen as the
instanceDbPath is deleted before opening.

Gyula

Gyula Fóra <[hidden email]> ezt írta (időpont: 2016. aug. 25., Cs,
23:28):

> Stephan,
>
> I ported the fix for the concurrency issue from the Flink commit so now
> that should be fine. I ran some fail/restore tests and that specific issue
> hasn't appeared again.
>
> However I now get many segfaults in the initializeForJob method where the
> RocksDb instance is opened. Just for the record this is the same exact code
> as we have in Flink now.:
>
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x00007f12b018f51f, pid=12576, tid=139668190197504
> #
> # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build
> 1.8.0_60-b27)
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode
> linux-amd64 )
> # Problematic frame:
> # C  [libc.so.6+0x7b51f]
> ...
> Stack: [0x00007f0708ccf000,0x00007f0708dd0000],  sp=0x00007f0708dccd20,
>  free space=1015k
> Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
> code)
> C  [libc.so.6+0x7b51f]
>
> Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> j
>  org.rocksdb.RocksDB.open(JLjava/lang/String;Ljava/util/List;I)Ljava/util/List;+0
> j
>  org.rocksdb.RocksDB.open(Lorg/rocksdb/DBOptions;Ljava/lang/String;Ljava/util/List;Ljava/util/List;)Lorg/rocksdb/RocksDB;+23
> j
>  com.king.rbea.backend.state.rocksdb.RocksDBStateBackend.initializeForJob...
>
> And this happens fairly frequently when the jobs are restarting after
> failure.
>
> Cheers,
> Gyula
>
> Gyula Fóra <[hidden email]> ezt írta (időpont: 2016. aug. 25., Cs,
> 19:07):
>
>> Yes seems like that, I remember the fix in Flink. I apparently made a
>> mistake somewhere in our code :)
>>
>> Thanks,
>> Gyula
>>
>> On Thu, Aug 25, 2016, 18:59 Stephan Ewen <[hidden email]> wrote:
>>
>>> We saw some crashes in earlier versions when native handles in RocksDB
>>> (even for config option objects) were manually and too eagerly released.
>>>
>>> Maybe you have a similar issue here?
>>>
>>> On Thu, Aug 25, 2016 at 6:27 PM, Gyula Fóra <[hidden email]>
>>> wrote:
>>>
>>> > Hi,
>>> > This seems to be a sneaky concurrency issue in our custom statebackend
>>> > implementation.
>>> >
>>> > I made some changes, will keep you posted.
>>> >
>>> > Cheers,
>>> > Gyula
>>> >
>>> > On Thu, Aug 25, 2016, 10:54 Gyula Fóra <[hidden email]> wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > Sure I am sending the TM logs in priv.
>>> > >
>>> > > Currently what I did was to bump the Rocks version to 4.9.0 let's
>>> see if
>>> > > that helps.
>>> > >
>>> > > Cheers,
>>> > > Gyula
>>> > >
>>> > > Till Rohrmann <[hidden email]> ezt írta (időpont: 2016. aug.
>>> 25.,
>>> > > Cs, 10:35):
>>> > >
>>> > >> Hi Gyula,
>>> > >>
>>> > >> I haven't seen this problem before. Do you have the logs of the
>>> failed
>>> > TMs
>>> > >> so that we have some more context what was going on?
>>> > >>
>>> > >> Cheers,
>>> > >> Till
>>> > >>
>>> > >> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra <[hidden email]>
>>> wrote:
>>> > >>
>>> > >> > Hi guys,
>>> > >> >
>>> > >> > For quite some time now we fairly frequently experience a task
>>> manager
>>> > >> > crashes around the time new streaming jobs are deployed. We use
>>> > RocksDB
>>> > >> > backend so this might be related.
>>> > >> >
>>> > >> > We tried changing the GC from G1 to CMS that didnt help.
>>> > >> >
>>> > >> > Yesterday for instance 6 task managers crashed one ofter the other
>>> > with
>>> > >> > similar errors:
>>> > >> >
>>> > >> > *** Error in `java': double free or corruption (!prev):
>>> > >> 0x00007fac0414d760
>>> > >> > ***
>>> > >> > *** Error in `java': free(): invalid pointer: 0x00007f8dcc0026c0
>>> ***
>>> > >> > *** Error in `java': double free or corruption (!prev):
>>> > >> 0x00007f15247f9a90
>>> > >> > ***
>>> > >> > ...
>>> > >> >
>>> > >> > Does anyone have any clue what might cause this or how to debug?
>>> > >> > This is very a critical issue :(
>>> > >> >
>>> > >> > Cheers,
>>> > >> > Gyula
>>> > >> >
>>> > >>
>>> > >
>>> >
>>>
>>
Reply | Threaded
Open this post in threaded view
|

Re: Task manager processes crashing one after the other

Robert Metzger
I experienced a quite similar issue with RocksDB on my cluster, also after
some retries (with the Flink 1.1.4 RC3)

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f1611829f4e, pid=3545, tid=139732543575808
#
# JRE version: Java(TM) SE Runtime Environment (7.0_67-b01) (build
1.7.0_67-b01)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (24.65-b04 mixed mode
linux-amd64 compressed oops)
# Problematic frame:
# C  [ld-linux-x86-64.so.2+0x9f4e]  _dl_rtld_di_serinfo+0x86e
#
# Failed to write core dump. Core dumps have been disabled. To enable core
dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /yarn/nm/usercache/robert/appcache/application_
1481291289979_0024/container_1481291289979_0024_01_008775/hs_err_pid3545.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.sun.com/bugreport/crash.jsp
# The crash happened outside the Java Virtual Machine in native code.
# See problematic frame for where to report the bug.




On Fri, Aug 26, 2016 at 9:32 AM, Gyula Fóra <[hidden email]> wrote:

> Some addtitional info:
>
> It doesn't seem to happen the first time I start the jobs / restore them
> from a savepoint. It happens as jobs are failing over after a task manager
> failure.
>
> This could be an issue caused by a non-empty rocks directory (that was
> somehow in an inconsistent state) but that should not happen as the
> instanceDbPath is deleted before opening.
>
> Gyula
>
> Gyula Fóra <[hidden email]> ezt írta (időpont: 2016. aug. 25., Cs,
> 23:28):
>
> > Stephan,
> >
> > I ported the fix for the concurrency issue from the Flink commit so now
> > that should be fine. I ran some fail/restore tests and that specific
> issue
> > hasn't appeared again.
> >
> > However I now get many segfaults in the initializeForJob method where the
> > RocksDb instance is opened. Just for the record this is the same exact
> code
> > as we have in Flink now.:
> >
> > #
> > # A fatal error has been detected by the Java Runtime Environment:
> > #
> > #  SIGSEGV (0xb) at pc=0x00007f12b018f51f, pid=12576, tid=139668190197504
> > #
> > # JRE version: Java(TM) SE Runtime Environment (8.0_60-b27) (build
> > 1.8.0_60-b27)
> > # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.60-b23 mixed mode
> > linux-amd64 )
> > # Problematic frame:
> > # C  [libc.so.6+0x7b51f]
> > ...
> > Stack: [0x00007f0708ccf000,0x00007f0708dd0000],  sp=0x00007f0708dccd20,
> >  free space=1015k
> > Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native
> > code)
> > C  [libc.so.6+0x7b51f]
> >
> > Java frames: (J=compiled Java code, j=interpreted, Vv=VM code)
> > j
> >  org.rocksdb.RocksDB.open(JLjava/lang/String;Ljava/util/List;
> I)Ljava/util/List;+0
> > j
> >  org.rocksdb.RocksDB.open(Lorg/rocksdb/DBOptions;Ljava/lang/S
> tring;Ljava/util/List;Ljava/util/List;)Lorg/rocksdb/RocksDB;+23
> > j
> >  com.king.rbea.backend.state.rocksdb.RocksDBStateBackend.init
> ializeForJob...
> >
> > And this happens fairly frequently when the jobs are restarting after
> > failure.
> >
> > Cheers,
> > Gyula
> >
> > Gyula Fóra <[hidden email]> ezt írta (időpont: 2016. aug. 25., Cs,
> > 19:07):
> >
> >> Yes seems like that, I remember the fix in Flink. I apparently made a
> >> mistake somewhere in our code :)
> >>
> >> Thanks,
> >> Gyula
> >>
> >> On Thu, Aug 25, 2016, 18:59 Stephan Ewen <[hidden email]> wrote:
> >>
> >>> We saw some crashes in earlier versions when native handles in RocksDB
> >>> (even for config option objects) were manually and too eagerly
> released.
> >>>
> >>> Maybe you have a similar issue here?
> >>>
> >>> On Thu, Aug 25, 2016 at 6:27 PM, Gyula Fóra <[hidden email]>
> >>> wrote:
> >>>
> >>> > Hi,
> >>> > This seems to be a sneaky concurrency issue in our custom
> statebackend
> >>> > implementation.
> >>> >
> >>> > I made some changes, will keep you posted.
> >>> >
> >>> > Cheers,
> >>> > Gyula
> >>> >
> >>> > On Thu, Aug 25, 2016, 10:54 Gyula Fóra <[hidden email]> wrote:
> >>> >
> >>> > > Hi,
> >>> > >
> >>> > > Sure I am sending the TM logs in priv.
> >>> > >
> >>> > > Currently what I did was to bump the Rocks version to 4.9.0 let's
> >>> see if
> >>> > > that helps.
> >>> > >
> >>> > > Cheers,
> >>> > > Gyula
> >>> > >
> >>> > > Till Rohrmann <[hidden email]> ezt írta (időpont: 2016. aug.
> >>> 25.,
> >>> > > Cs, 10:35):
> >>> > >
> >>> > >> Hi Gyula,
> >>> > >>
> >>> > >> I haven't seen this problem before. Do you have the logs of the
> >>> failed
> >>> > TMs
> >>> > >> so that we have some more context what was going on?
> >>> > >>
> >>> > >> Cheers,
> >>> > >> Till
> >>> > >>
> >>> > >> On Thu, Aug 25, 2016 at 9:40 AM, Gyula Fóra <[hidden email]>
> >>> wrote:
> >>> > >>
> >>> > >> > Hi guys,
> >>> > >> >
> >>> > >> > For quite some time now we fairly frequently experience a task
> >>> manager
> >>> > >> > crashes around the time new streaming jobs are deployed. We use
> >>> > RocksDB
> >>> > >> > backend so this might be related.
> >>> > >> >
> >>> > >> > We tried changing the GC from G1 to CMS that didnt help.
> >>> > >> >
> >>> > >> > Yesterday for instance 6 task managers crashed one ofter the
> other
> >>> > with
> >>> > >> > similar errors:
> >>> > >> >
> >>> > >> > *** Error in `java': double free or corruption (!prev):
> >>> > >> 0x00007fac0414d760
> >>> > >> > ***
> >>> > >> > *** Error in `java': free(): invalid pointer: 0x00007f8dcc0026c0
> >>> ***
> >>> > >> > *** Error in `java': double free or corruption (!prev):
> >>> > >> 0x00007f15247f9a90
> >>> > >> > ***
> >>> > >> > ...
> >>> > >> >
> >>> > >> > Does anyone have any clue what might cause this or how to debug?
> >>> > >> > This is very a critical issue :(
> >>> > >> >
> >>> > >> > Cheers,
> >>> > >> > Gyula
> >>> > >> >
> >>> > >>
> >>> > >
> >>> >
> >>>
> >>
>