Strange segfaults in Flink 1.2 (probably not RocksDB related)

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Strange segfaults in Flink 1.2 (probably not RocksDB related)

Gyula Fóra-2
Hey All,

I am trying to port some of my streaming jobs to Flink 1.2 from 1.1.4 and I
have noticed some very strange segfaults. (I am running in test
environments - with the minicluster)
It is a fairly complex job so I wouldnt go into details but the interesting
part is that adding/removing a simple filter in the wrong place in the
topology (such as (e -> true)  or anything actually ) seems to cause
frequent segfaults during execution.

Basically the key part looks something like:

...
DataStream stream = source.map().setParallelism(1)..uid("AssignFieldIds").
name("AssignFieldIds").startNewChain();
DataStream filtered = input1.filter(t -> true).setParallelism(1)
IterativeStream itStream = filtered.iterate(...)
...

Some notes before the actual error: replacing the filter with a map or
other chained transforms also leads to this problem. If the filter is not
chained there is no error (or if I remove the filter).

The error I get looks like this:
https://gist.github.com/gyfora/da713b99b1e85a681ff1a4182f001242

I wonder if anyone has seen something like this before, or have some ideas
how to debug it. The simple work around is to not chain the filter but it's
still very strange.

Regards,
Gyula
Reply | Threaded
Open this post in threaded view
|

Re: Strange segfaults in Flink 1.2 (probably not RocksDB related)

Stephan Ewen
Hey!

I am actually a bit puzzled how these segfaults could come, unless via a
native library, or a JVM bug.

Can you try how it behaves when not using RocksDB or using a newer JVM
version?

Stephan


On Sun, Jan 22, 2017 at 7:51 PM, Gyula Fóra <[hidden email]> wrote:

> Hey All,
>
> I am trying to port some of my streaming jobs to Flink 1.2 from 1.1.4 and I
> have noticed some very strange segfaults. (I am running in test
> environments - with the minicluster)
> It is a fairly complex job so I wouldnt go into details but the interesting
> part is that adding/removing a simple filter in the wrong place in the
> topology (such as (e -> true)  or anything actually ) seems to cause
> frequent segfaults during execution.
>
> Basically the key part looks something like:
>
> ...
> DataStream stream = source.map().setParallelism(1)..uid("AssignFieldIds").
> name("AssignFieldIds").startNewChain();
> DataStream filtered = input1.filter(t -> true).setParallelism(1)
> IterativeStream itStream = filtered.iterate(...)
> ...
>
> Some notes before the actual error: replacing the filter with a map or
> other chained transforms also leads to this problem. If the filter is not
> chained there is no error (or if I remove the filter).
>
> The error I get looks like this:
> https://gist.github.com/gyfora/da713b99b1e85a681ff1a4182f001242
>
> I wonder if anyone has seen something like this before, or have some ideas
> how to debug it. The simple work around is to not chain the filter but it's
> still very strange.
>
> Regards,
> Gyula
>
Reply | Threaded
Open this post in threaded view
|

Re: Strange segfaults in Flink 1.2 (probably not RocksDB related)

Gyula Fóra
Hi,
I tried it with and without RocksDB, no difference. Also tried with the
newest JDK version, still fails for that specific topology.

On the other hand after I made some other changes in the topology (somwhere
else) I can't reproduce it anymore. So it seems to be a very specific
thing. To me it looks like a JVM problem, if it happens again I will let
you know.

Gyula

Stephan Ewen <[hidden email]> ezt írta (időpont: 2017. jan. 22., V,
21:16):

> Hey!
>
> I am actually a bit puzzled how these segfaults could come, unless via a
> native library, or a JVM bug.
>
> Can you try how it behaves when not using RocksDB or using a newer JVM
> version?
>
> Stephan
>
>
> On Sun, Jan 22, 2017 at 7:51 PM, Gyula Fóra <[hidden email]> wrote:
>
> > Hey All,
> >
> > I am trying to port some of my streaming jobs to Flink 1.2 from 1.1.4
> and I
> > have noticed some very strange segfaults. (I am running in test
> > environments - with the minicluster)
> > It is a fairly complex job so I wouldnt go into details but the
> interesting
> > part is that adding/removing a simple filter in the wrong place in the
> > topology (such as (e -> true)  or anything actually ) seems to cause
> > frequent segfaults during execution.
> >
> > Basically the key part looks something like:
> >
> > ...
> > DataStream stream =
> source.map().setParallelism(1)..uid("AssignFieldIds").
> > name("AssignFieldIds").startNewChain();
> > DataStream filtered = input1.filter(t -> true).setParallelism(1)
> > IterativeStream itStream = filtered.iterate(...)
> > ...
> >
> > Some notes before the actual error: replacing the filter with a map or
> > other chained transforms also leads to this problem. If the filter is not
> > chained there is no error (or if I remove the filter).
> >
> > The error I get looks like this:
> > https://gist.github.com/gyfora/da713b99b1e85a681ff1a4182f001242
> >
> > I wonder if anyone has seen something like this before, or have some
> ideas
> > how to debug it. The simple work around is to not chain the filter but
> it's
> > still very strange.
> >
> > Regards,
> > Gyula
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Strange segfaults in Flink 1.2 (probably not RocksDB related)

Robert Metzger
I think I saw similar segfaults when rebuilding Flink while Flink daemons
were running out of the build-target repository. After the flink-dist jar
has been rebuild and I tried stopping the daemons, they were running into
these segfaults.

On Mon, Jan 23, 2017 at 12:02 PM, Gyula Fóra <[hidden email]> wrote:

> Hi,
> I tried it with and without RocksDB, no difference. Also tried with the
> newest JDK version, still fails for that specific topology.
>
> On the other hand after I made some other changes in the topology (somwhere
> else) I can't reproduce it anymore. So it seems to be a very specific
> thing. To me it looks like a JVM problem, if it happens again I will let
> you know.
>
> Gyula
>
> Stephan Ewen <[hidden email]> ezt írta (időpont: 2017. jan. 22., V,
> 21:16):
>
> > Hey!
> >
> > I am actually a bit puzzled how these segfaults could come, unless via a
> > native library, or a JVM bug.
> >
> > Can you try how it behaves when not using RocksDB or using a newer JVM
> > version?
> >
> > Stephan
> >
> >
> > On Sun, Jan 22, 2017 at 7:51 PM, Gyula Fóra <[hidden email]> wrote:
> >
> > > Hey All,
> > >
> > > I am trying to port some of my streaming jobs to Flink 1.2 from 1.1.4
> > and I
> > > have noticed some very strange segfaults. (I am running in test
> > > environments - with the minicluster)
> > > It is a fairly complex job so I wouldnt go into details but the
> > interesting
> > > part is that adding/removing a simple filter in the wrong place in the
> > > topology (such as (e -> true)  or anything actually ) seems to cause
> > > frequent segfaults during execution.
> > >
> > > Basically the key part looks something like:
> > >
> > > ...
> > > DataStream stream =
> > source.map().setParallelism(1)..uid("AssignFieldIds").
> > > name("AssignFieldIds").startNewChain();
> > > DataStream filtered = input1.filter(t -> true).setParallelism(1)
> > > IterativeStream itStream = filtered.iterate(...)
> > > ...
> > >
> > > Some notes before the actual error: replacing the filter with a map or
> > > other chained transforms also leads to this problem. If the filter is
> not
> > > chained there is no error (or if I remove the filter).
> > >
> > > The error I get looks like this:
> > > https://gist.github.com/gyfora/da713b99b1e85a681ff1a4182f001242
> > >
> > > I wonder if anyone has seen something like this before, or have some
> > ideas
> > > how to debug it. The simple work around is to not chain the filter but
> > it's
> > > still very strange.
> > >
> > > Regards,
> > > Gyula
> > >
> >
>
Reply | Threaded
Open this post in threaded view
|

Re: Strange segfaults in Flink 1.2 (probably not RocksDB related)

Stephan Ewen
@Robert is this about "hot code replace" in the JVM?

On Mon, Jan 23, 2017 at 1:11 PM, Robert Metzger <[hidden email]> wrote:

> I think I saw similar segfaults when rebuilding Flink while Flink daemons
> were running out of the build-target repository. After the flink-dist jar
> has been rebuild and I tried stopping the daemons, they were running into
> these segfaults.
>
> On Mon, Jan 23, 2017 at 12:02 PM, Gyula Fóra <[hidden email]> wrote:
>
> > Hi,
> > I tried it with and without RocksDB, no difference. Also tried with the
> > newest JDK version, still fails for that specific topology.
> >
> > On the other hand after I made some other changes in the topology
> (somwhere
> > else) I can't reproduce it anymore. So it seems to be a very specific
> > thing. To me it looks like a JVM problem, if it happens again I will let
> > you know.
> >
> > Gyula
> >
> > Stephan Ewen <[hidden email]> ezt írta (időpont: 2017. jan. 22., V,
> > 21:16):
> >
> > > Hey!
> > >
> > > I am actually a bit puzzled how these segfaults could come, unless via
> a
> > > native library, or a JVM bug.
> > >
> > > Can you try how it behaves when not using RocksDB or using a newer JVM
> > > version?
> > >
> > > Stephan
> > >
> > >
> > > On Sun, Jan 22, 2017 at 7:51 PM, Gyula Fóra <[hidden email]> wrote:
> > >
> > > > Hey All,
> > > >
> > > > I am trying to port some of my streaming jobs to Flink 1.2 from 1.1.4
> > > and I
> > > > have noticed some very strange segfaults. (I am running in test
> > > > environments - with the minicluster)
> > > > It is a fairly complex job so I wouldnt go into details but the
> > > interesting
> > > > part is that adding/removing a simple filter in the wrong place in
> the
> > > > topology (such as (e -> true)  or anything actually ) seems to cause
> > > > frequent segfaults during execution.
> > > >
> > > > Basically the key part looks something like:
> > > >
> > > > ...
> > > > DataStream stream =
> > > source.map().setParallelism(1)..uid("AssignFieldIds").
> > > > name("AssignFieldIds").startNewChain();
> > > > DataStream filtered = input1.filter(t -> true).setParallelism(1)
> > > > IterativeStream itStream = filtered.iterate(...)
> > > > ...
> > > >
> > > > Some notes before the actual error: replacing the filter with a map
> or
> > > > other chained transforms also leads to this problem. If the filter is
> > not
> > > > chained there is no error (or if I remove the filter).
> > > >
> > > > The error I get looks like this:
> > > > https://gist.github.com/gyfora/da713b99b1e85a681ff1a4182f001242
> > > >
> > > > I wonder if anyone has seen something like this before, or have some
> > > ideas
> > > > how to debug it. The simple work around is to not chain the filter
> but
> > > it's
> > > > still very strange.
> > > >
> > > > Regards,
> > > > Gyula
> > > >
> > >
> >
>