Yarn deployment takes long on some networks

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Yarn deployment takes long on some networks

Gyula Fóra
Hi all!

Today we started noticing that deploying our jobs took over 3 minutes when
deployed from some machine and normal (few seconds) when deployed from the
others.

Looking at the logs it seems that the client cant find some job id for a
few minutes in this case:

...
2017-11-21 15:23:00,880 DEBUG org.apache.flink.yarn.YarnJobManager
                - Job with ID 179d67bfab7c4c0b9f00ea772f6e4f0c not found in
JobManager
2017-11-21 15:23:04,528 DEBUG org.apache.zookeeper.ClientCnxn
                 - Got ping response for sessionid: 0x25eb8e005b7971b after
0ms
2017-11-21 15:23:04,636 DEBUG org.apache.hadoop.ipc.Client
                - IPC Client (937277082) connection to
splat13.sto.midasplayer.com/172.26.87.155:8030 from splat sending #38
2017-11-21 15:23:04,636 DEBUG org.apache.hadoop.ipc.Client
                - IPC Client (937277082) connection to
splat13.sto.midasplayer.com/172.26.87.155:8030 from splat got value #38
2017-11-21 15:23:04,651 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine
                 - Call: allocate took 16ms
2017-11-21 15:23:05,880 DEBUG org.apache.flink.yarn.YarnJobManager
                - Job with ID 179d67bfab7c4c0b9f00ea772f6e4f0c not found in
JobManager
2017-11-21 15:23:06,409 DEBUG akka.remote.RemoteWatcher
                 - Sending Heartbeat to [akka.tcp://
[hidden email]:56045]
2017-11-21 15:23:06,413 DEBUG akka.remote.RemoteWatcher
                 - Received heartbeat rsp from [akka.tcp://
[hidden email]:56045]
2017-11-21 15:23:07,665 DEBUG
akka.serialization.Serialization(akka://flink)                - Using
serializer[akka.serialization.JavaSerializer] for message
[org.apache.flink.runtime.clusterframework.messages.GetClusterStatusResponse]
2017-11-21 15:23:07,824 INFO  org.apache.flink.yarn.YarnJobManager
                - Submitting job 179d67bfab7c4c0b9f00ea772f6e4f0c
(event-bifrost-log).
2017

Interestingly enough nothing like this shows when deployed from other
servers.
We suspect there might be some strange network issue (which doesnt seem to
affect jar upload times) that screws with akka in some way.

Any idea how to debug this?
Thank you!

Gyula
Reply | Threaded
Open this post in threaded view
|

Re: Yarn deployment takes long on some networks

Aljoscha Krettek-2
Hi Gyula,

Is there any news on this?

@Nico or @Gary you recently also did stuff with YARN, do you maybe have an idea of what could be going on?

Best,
Aljoscha

> On 21. Nov 2017, at 06:42, Gyula Fóra <[hidden email]> wrote:
>
> Hi all!
>
> Today we started noticing that deploying our jobs took over 3 minutes when
> deployed from some machine and normal (few seconds) when deployed from the
> others.
>
> Looking at the logs it seems that the client cant find some job id for a
> few minutes in this case:
>
> ...
> 2017-11-21 15:23:00,880 DEBUG org.apache.flink.yarn.YarnJobManager
>                - Job with ID 179d67bfab7c4c0b9f00ea772f6e4f0c not found in
> JobManager
> 2017-11-21 15:23:04,528 DEBUG org.apache.zookeeper.ClientCnxn
>                 - Got ping response for sessionid: 0x25eb8e005b7971b after
> 0ms
> 2017-11-21 15:23:04,636 DEBUG org.apache.hadoop.ipc.Client
>                - IPC Client (937277082) connection to
> splat13.sto.midasplayer.com/172.26.87.155:8030 from splat sending #38
> 2017-11-21 15:23:04,636 DEBUG org.apache.hadoop.ipc.Client
>                - IPC Client (937277082) connection to
> splat13.sto.midasplayer.com/172.26.87.155:8030 from splat got value #38
> 2017-11-21 15:23:04,651 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine
>                 - Call: allocate took 16ms
> 2017-11-21 15:23:05,880 DEBUG org.apache.flink.yarn.YarnJobManager
>                - Job with ID 179d67bfab7c4c0b9f00ea772f6e4f0c not found in
> JobManager
> 2017-11-21 15:23:06,409 DEBUG akka.remote.RemoteWatcher
>                 - Sending Heartbeat to [akka.tcp://
> [hidden email]:56045]
> 2017-11-21 15:23:06,413 DEBUG akka.remote.RemoteWatcher
>                 - Received heartbeat rsp from [akka.tcp://
> [hidden email]:56045]
> 2017-11-21 15:23:07,665 DEBUG
> akka.serialization.Serialization(akka://flink)                - Using
> serializer[akka.serialization.JavaSerializer] for message
> [org.apache.flink.runtime.clusterframework.messages.GetClusterStatusResponse]
> 2017-11-21 15:23:07,824 INFO  org.apache.flink.yarn.YarnJobManager
>                - Submitting job 179d67bfab7c4c0b9f00ea772f6e4f0c
> (event-bifrost-log).
> 2017
>
> Interestingly enough nothing like this shows when deployed from other
> servers.
> We suspect there might be some strange network issue (which doesnt seem to
> affect jar upload times) that screws with akka in some way.
>
> Any idea how to debug this?
> Thank you!
>
> Gyula

Reply | Threaded
Open this post in threaded view
|

Re: Yarn deployment takes long on some networks

Gyula Fóra
Hi!
Sorry for not following up on this, turned out some ports were blocked by
some random firewall change. So no issue on Flinks side.

Gyula

On Fri, Mar 16, 2018, 17:41 Aljoscha Krettek <[hidden email]> wrote:

> Hi Gyula,
>
> Is there any news on this?
>
> @Nico or @Gary you recently also did stuff with YARN, do you maybe have an
> idea of what could be going on?
>
> Best,
> Aljoscha
>
> > On 21. Nov 2017, at 06:42, Gyula Fóra <[hidden email]> wrote:
> >
> > Hi all!
> >
> > Today we started noticing that deploying our jobs took over 3 minutes
> when
> > deployed from some machine and normal (few seconds) when deployed from
> the
> > others.
> >
> > Looking at the logs it seems that the client cant find some job id for a
> > few minutes in this case:
> >
> > ...
> > 2017-11-21 15:23:00,880 DEBUG org.apache.flink.yarn.YarnJobManager
> >                - Job with ID 179d67bfab7c4c0b9f00ea772f6e4f0c not found
> in
> > JobManager
> > 2017-11-21 15:23:04,528 DEBUG org.apache.zookeeper.ClientCnxn
> >                 - Got ping response for sessionid: 0x25eb8e005b7971b
> after
> > 0ms
> > 2017-11-21 15:23:04,636 DEBUG org.apache.hadoop.ipc.Client
> >                - IPC Client (937277082) connection to
> > splat13.sto.midasplayer.com/172.26.87.155:8030 from splat sending #38
> > 2017-11-21 15:23:04,636 DEBUG org.apache.hadoop.ipc.Client
> >                - IPC Client (937277082) connection to
> > splat13.sto.midasplayer.com/172.26.87.155:8030 from splat got value #38
> > 2017-11-21 15:23:04,651 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine
> >                 - Call: allocate took 16ms
> > 2017-11-21 15:23:05,880 DEBUG org.apache.flink.yarn.YarnJobManager
> >                - Job with ID 179d67bfab7c4c0b9f00ea772f6e4f0c not found
> in
> > JobManager
> > 2017-11-21 15:23:06,409 DEBUG akka.remote.RemoteWatcher
> >                 - Sending Heartbeat to [akka.tcp://
> > [hidden email]:56045]
> > 2017-11-21 15:23:06,413 DEBUG akka.remote.RemoteWatcher
> >                 - Received heartbeat rsp from [akka.tcp://
> > [hidden email]:56045]
> > 2017-11-21 15:23:07,665 DEBUG
> > akka.serialization.Serialization(akka://flink)                - Using
> > serializer[akka.serialization.JavaSerializer] for message
> >
> [org.apache.flink.runtime.clusterframework.messages.GetClusterStatusResponse]
> > 2017-11-21 15:23:07,824 INFO  org.apache.flink.yarn.YarnJobManager
> >                - Submitting job 179d67bfab7c4c0b9f00ea772f6e4f0c
> > (event-bifrost-log).
> > 2017
> >
> > Interestingly enough nothing like this shows when deployed from other
> > servers.
> > We suspect there might be some strange network issue (which doesnt seem
> to
> > affect jar upload times) that screws with akka in some way.
> >
> > Any idea how to debug this?
> > Thank you!
> >
> > Gyula
>
>