Hi,
Can you tell me where I can find TaskManager logs. I can’t find them in logs folder? I don’t suppose I should run taskmanager.sh as well. Right? I’m using a OS X Yosemite. I’ll send you my ifconfig. lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384 options=3<RXCSUM,TXCSUM> inet6 ::1 prefixlen 128 inet 127.0.0.1 netmask 0xff000000 inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 nd6 options=1<PERFORMNUD> gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280 stf0: flags=0<> mtu 1280 en0: flags=8823<UP,BROADCAST,SMART,SIMPLEX,MULTICAST> mtu 1500 ether 60:03:08:a1:e0:f4 nd6 options=1<PERFORMNUD> media: autoselect (<unknown type>) status: inactive en1: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500 options=60<TSO4,TSO6> ether 72:00:02:32:14:d0 media: autoselect <full-duplex> status: inactive en2: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500 options=60<TSO4,TSO6> ether 72:00:02:32:14:d1 media: autoselect <full-duplex> status: inactive bridge0: flags=8822<BROADCAST,SMART,SIMPLEX,MULTICAST> mtu 1500 options=63<RXCSUM,TXCSUM,TSO4,TSO6> ether 62:03:08:1a:fa:00 Configuration: id 0:0:0:0:0:0 priority 0 hellotime 0 fwddelay 0 maxage 0 holdcnt 0 proto stp maxaddr 100 timeout 1200 root id 0:0:0:0:0:0 priority 0 ifcost 0 port 0 ipfilter disabled flags 0x2 member: en1 flags=3<LEARNING,DISCOVER> ifmaxaddr 0 port 5 priority 0 path cost 0 member: en2 flags=3<LEARNING,DISCOVER> ifmaxaddr 0 port 6 priority 0 path cost 0 media: <unknown type> status: inactive p2p0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> mtu 2304 ether 02:03:08:a1:e0:f4 media: autoselect status: inactive awdl0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> mtu 1452 ether 06:56:3d:f6:60:08 nd6 options=1<PERFORMNUD> media: autoselect status: inactive ppp0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1500 inet 10.218.98.228 --> 10.64.64.64 netmask 0xff000000 utun0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1380 inet6 fe80::b0d4:d4be:7e62:e730%utun0 prefixlen 64 scopeid 0xb inet6 fdd0:b291:7da7:9153:b0d4:d4be:7e62:e730 prefixlen 64 nd6 options=1<PERFORMNUD> > On Feb 26, 2015, at 10:48 PM, Stephan Ewen <[hidden email]> wrote: > > Hi Dulaj! > > Thanks for helping to debug. > > My guess is that you are seeing now the same thing between JobManager and > TaskManager as you saw before between JobManager and JobClient. I have a > patch pending that should help the issue (see > https://issues.apache.org/jira/browse/FLINK-1608), let's see if that solves > it. > > What seems not right is that the JobManager initially accepted the > TaskManager and later the communication. Can you paste the TaskManager log > as well? > > Also: There must be something fairly unique about your network > configuration, as it works on all other setups that we use (locally, cloud, > test servers, YARN, ...). Can you paste your ipconfig / ifconfig by any > chance? > > Greetings, > Stephan > > > > On Thu, Feb 26, 2015 at 4:33 PM, Dulaj Viduranga <[hidden email]> > wrote: > >> Hi, >> It’s great to help out. :) >> >> Setting 127.0.0.1 instead of “localhost” in >> jobmanager.rpc.address, helped to build the connection to the jobmanager. >> Apparently localhost resolving is different in webclient and the >> jobmanager. I think it’s good to set "jobmanager.rpc.address: 127.0.0.1" in >> future builds. >> But then I get this error when I tried to run examples. I don’t >> know if I should move this issue to another thread. If so please tell me. >> >> bin/flink run >> /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/examples/flink-java-examples-0.9-SNAPSHOT-WordCount.jar >> /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/hamlet.txt >> $FLINK_DIRECTORY/count >> >> >> 20:46:21,998 WARN org.apache.hadoop.util.NativeCodeLoader >> - Unable to load native-hadoop library for your platform... using >> builtin-java classes where applicable >> 02/26/2015 20:46:23 Job execution switched to status RUNNING. >> 02/26/2015 20:46:23 CHAIN DataSource (at >> getTextDataSet(WordCount.java:141) >> (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at >> main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) >> switched to SCHEDULED >> 02/26/2015 20:46:23 CHAIN DataSource (at >> getTextDataSet(WordCount.java:141) >> (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at >> main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) >> switched to DEPLOYING >> 02/26/2015 20:48:03 CHAIN DataSource (at >> getTextDataSet(WordCount.java:141) >> (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at >> main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) >> switched to FAILED >> akka.pattern.AskTimeoutException: Ask timed out on >> [Actor[akka://flink/user/taskmanager#-1628133761]] after [100000 ms] >> at >> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) >> at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) >> at >> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) >> at >> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) >> at >> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) >> at >> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) >> at >> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) >> at >> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) >> at java.lang.Thread.run(Thread.java:745) >> >> 02/26/2015 20:48:03 Job execution switched to status FAILING. >> 02/26/2015 20:48:03 Reduce (SUM(1), at main(WordCount.java:72)(1/1) >> switched to CANCELED >> 02/26/2015 20:48:03 DataSink(CsvOutputFormat (path: >> /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/count, >> delimiter: ))(1/1) switched to CANCELED >> 02/26/2015 20:48:03 Job execution switched to status FAILED. >> org.apache.flink.client.program.ProgramInvocationException: The program >> execution failed. >> at org.apache.flink.client.program.Client.run(Client.java:344) >> at org.apache.flink.client.program.Client.run(Client.java:306) >> at org.apache.flink.client.program.Client.run(Client.java:300) >> at >> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55) >> at >> org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:82) >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >> at >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >> at >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >> at java.lang.reflect.Method.invoke(Method.java:483) >> at >> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437) >> at >> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353) >> at org.apache.flink.client.program.Client.run(Client.java:250) >> at >> org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:371) >> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:344) >> at >> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1087) >> at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1114) >> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job >> execution failed. >> at >> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:284) >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >> at >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >> at >> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37) >> at >> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30) >> at >> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >> at >> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30) >> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >> at >> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:88) >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >> at akka.actor.ActorCell.invoke(ActorCell.scala:487) >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >> at akka.dispatch.Mailbox.run(Mailbox.scala:221) >> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >> at >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >> at >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >> at >> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >> at >> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >> Caused by: akka.pattern.AskTimeoutException: Ask timed out on >> [Actor[akka://flink/user/taskmanager#-1628133761]] after [100000 ms] >> at >> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) >> at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) >> at >> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) >> at >> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) >> at >> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) >> at >> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) >> at >> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) >> at >> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) >> at java.lang.Thread.run(Thread.java:745) >> >> The exception above occurred while trying to run your command. >> >> >>> On Feb 26, 2015, at 12:46 AM, Stephan Ewen <[hidden email]> wrote: >>> >>> Addition: To check whether a port is reachable, I think the easiest thing >>> is to try and connect with a telnet client and see if the connection is >>> refused. >>> >>> On Wed, Feb 25, 2015 at 8:15 PM, Stephan Ewen <[hidden email]> wrote: >>> >>>> Okay, the problem seems to be that even though both the client and the >>>> jobmanager use "localhost" as the host name, they resolve this to >> different >>>> IP addresses: In one case 127.0.0.1 in the other case 10.216.177.146 >>>> >>>> Also, the 127.0.0.1 address cannot communicate to 10.216.177.146 >>>> apparently. >>>> >>>> Can you help us debug this by checking the following: >>>> >>>> - Can you try and set "jobmanager.rpc.address" to 127.0.0.1 and see if >>>> that solves it? >>>> - Can you try and set "jobmanager.rpc.address" to the other address >> (10.216.177.146 >>>> or so) and see if that solves it? >>>> - Can you do "start-cluster.sh", rather than "start-local.sh" and see >>>> whether the webfrontend displays that the TaskManager connects? >>>> - As a hard core test: Can you bring up the jobmanager, check where it >>>> connects (10.216.192.98:6123 or so) and see whether the port is >> reachable? >>>> >>>> We have recently updated how the Akka URLs are build, to work around a >>>> limitation in Akka. Seems that did not yet fully solve the issue. >>>> >>>> Thanks for helping us debug this, it is not the easiest immigration >>>> experience, but the outcome is probably extremely valuable for the >> project >>>> :-) >>>> >>>> Greetings, >>>> Stephan >>>> >>>> >>>> On Wed, Feb 25, 2015 at 4:03 PM, Dulaj Viduranga <[hidden email]> >>>> wrote: >>>> >>>>> Hi, >>>>> Sorry for the delay to reply on this issue. >>>>> the jobmanager.rpc.address is set to “localhost” already in conf.yaml. >>>>> This can’t be an issue because the job manager web interface works fine >>>>> which also runs on localhost >>>>> >>>>> bin/flink run <jar> doesn’t seem to work either. Let me send you my >>>>> command and the result in terminal. >>>>> >>>>> bin/flink run >>>>> >> /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/examples/flink-java-examples-0.9-SNAPSHOT-WordCount.jar >>>>> >> /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/hamlet.txt >>>>> $FLINK_DIRECTORY/count >>>>> >>>>> 20:32:16,442 WARN org.apache.hadoop.util.NativeCodeLoader >>>>> - Unable to load native-hadoop library for your platform... using >>>>> builtin-java classes where applicable >>>>> org.apache.flink.client.program.ProgramInvocationException: Could not >>>>> build up connection to JobManager. >>>>> at org.apache.flink.client.program.Client.run(Client.java:327) >>>>> at org.apache.flink.client.program.Client.run(Client.java:306) >>>>> at org.apache.flink.client.program.Client.run(Client.java:300) >>>>> at >>>>> >> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55) >>>>> at >>>>> >> org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:82) >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>> at >>>>> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >>>>> at >>>>> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>> at java.lang.reflect.Method.invoke(Method.java:483) >>>>> at >>>>> >> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437) >>>>> at >>>>> >> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353) >>>>> at org.apache.flink.client.program.Client.run(Client.java:250) >>>>> at >>>>> >> org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:371) >>>>> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:344) >>>>> at >>>>> >> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1087) >>>>> at >> org.apache.flink.client.CliFrontend.main(CliFrontend.java:1114) >>>>> Caused by: java.io.IOException: JobManager at akka.tcp:// >>>>> flink@10.216.177.146:6123/user/jobmanager not reachable. Please make >>>>> sure that the JobManager is running and its port is reachable. >>>>> at >>>>> >> org.apache.flink.runtime.jobmanager.JobManager$.getJobManagerRemoteReference(JobManager.scala:897) >>>>> at >>>>> >> org.apache.flink.runtime.client.JobClient$.createJobClient(JobClient.scala:151) >>>>> at >>>>> >> org.apache.flink.runtime.client.JobClient$.createJobClientFromConfig(JobClient.scala:142) >>>>> at >>>>> >> org.apache.flink.runtime.client.JobClient$.startActorSystemAndActor(JobClient.scala:125) >>>>> at >>>>> >> org.apache.flink.runtime.client.JobClient.startActorSystemAndActor(JobClient.scala) >>>>> at org.apache.flink.client.program.Client.run(Client.java:322) >>>>> ... 15 more >>>>> Caused by: java.util.concurrent.TimeoutException: Futures timed out >> after >>>>> [10000 milliseconds] >>>>> at >>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) >>>>> at >>>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) >>>>> at >>>>> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) >>>>> at >>>>> >> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) >>>>> at scala.concurrent.Await$.result(package.scala:107) >>>>> at >>>>> >> org.apache.flink.runtime.jobmanager.JobManager$.getJobManagerRemoteReference(JobManager.scala:893) >>>>> ... 20 more >>>>> >>>>> The exception above occurred while trying to run your command. >>>>> >>>>> >>>>>> On Feb 25, 2015, at 1:29 AM, Stephan Ewen <[hidden email]> wrote: >>>>>> >>>>>> BTW: Does still work if you enter "localhost" for >>>>> "jobmanager.rpc.address" >>>>>> in your flink-conf.yaml ? >>>>>> >>>>>> On Tue, Feb 24, 2015 at 7:50 PM, Stephan Ewen <[hidden email]> >> wrote: >>>>>> >>>>>>> Hi! >>>>>>> >>>>>>> I think that this is a problem in the current master (probably in >> there >>>>>>> since a few days ago). I am fixing it... >>>>>>> >>>>>>> Thanks for reporting it! >>>>>>> >>>>>>> Stephan >>>>>>> >>>>>>> >>>>>>> On Tue, Feb 24, 2015 at 6:52 PM, Stephan Ewen <[hidden email]> >>>>> wrote: >>>>>>> >>>>>>>> Hi Dulaj! >>>>>>>> >>>>>>>> The log suggests that the JobManager binds itself to the IP >>>>>>>> address 10.216.192.98 and the WebClient runs at 127.0.0.1 >>>>>>>> >>>>>>>> The 127.0.0.1 actor system cannot connect to the 10.216.192.98. >>>>>>>> >>>>>>>> Let me verify whether this is a quirk of your particular setup, or a >>>>> bug >>>>>>>> recently introduces in the 0.9-SNAPSHOT. >>>>>>>> >>>>>>>> Does the command line work for you? ("bin/flink run <jar>") >>>>>>>> >>>>>>>> taskmanager.numberOfTaskSlots: -1 is also okay, this will mean that >>>>> the >>>>>>>> default of '1' is used. >>>>>>>> >>>>>>>> Greetings, >>>>>>>> Stephan >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Tue, Feb 24, 2015 at 5:18 PM, Dulaj Viduranga < >>>>> [hidden email]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Is taskmanager.numberOfTaskSlots: -1 normal? >>>>>>>>> >>>>>>>>>> On Feb 24, 2015, at 9:44 PM, Robert Metzger <[hidden email]> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> I could not find the logfiles attached to your mails. I think the >>>>>>>>>> mailinglists are not accepting attachments. >>>>>>>>>> Can you put the logs on gist.github.com? >>>>>>>>>> >>>>>>>>>> The configuration values are documented here: >>>>>>>>>> http://flink.apache.org/docs/0.8/config.html >>>>>>>>>> For the webclient's port its called webclient.port >>>>>>>>>> >>>>>>>>>> On Tue, Feb 24, 2015 at 5:04 PM, Dulaj Viduranga < >>>>> [hidden email] >>>>>>>>>> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> I tried to kill the job manager manually in the terminal and >> start >>>>> it >>>>>>>>>>> again but no luck. Also could you tell me if it’s possible to >>>>> change >>>>>>>>>>> webclient’s port (8080) ? >>>>>>>>>>> >>>>>>>>>>>> On Feb 24, 2015, at 1:41 PM, Stephan Ewen <[hidden email]> >>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hey Dulaj! >>>>>>>>>>>> >>>>>>>>>>>> As a contributor, I would go against the latest version, which >> is >>>>>>>>>>>> 0.9-SNAPSHOT. >>>>>>>>>>>> >>>>>>>>>>>> It may be in your case that the JobManager actor is down, but >> the >>>>>>>>> process >>>>>>>>>>>> still lingers. (BTW: I have a patch pending that makes sure the >>>>>>>>> process >>>>>>>>>>>> disappears when the actor via down). >>>>>>>>>>>> >>>>>>>>>>>> Could you have a look at the log >>>>>>>>> "flink-<user>-jobmanager-<host>-.log" >>>>>>>>>>> and >>>>>>>>>>>> see if there are any errors logged? >>>>>>>>>>>> >>>>>>>>>>>> Greetings, >>>>>>>>>>>> Stephan >>>>>>>>>>>> Am 24.02.2015 06:29 schrieb "Dulaj Viduranga" < >>>>> [hidden email] >>>>>>>>>> : >>>>>>>>>>>> >>>>>>>>>>>>> The JobManager seems to run fine. I don't know. When I tried to >>>>> run >>>>>>>>>>>>> start-local.sh again, It shows the PID of the running >> JobManager >>>>> and >>>>>>>>>>> also >>>>>>>>>>>>> :8081 runs fine. I want to contribute to the project and I >> could >>>>>>>>> get a >>>>>>>>>>>>> little boost if I could see the capabilities of FLINK. :) >>>>>>>>>>>>> Will it be OK to use 0.8.1 as a developer? >>>>>>>>>>>>> >>>>>>>>>>>>> On Feb 24, 2015, at 04:15 AM, Stephan Ewen <[hidden email]> >>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Dulaj, >>>>>>>>>>>>> >>>>>>>>>>>>> That error message indicates that the JobManager is not >> running. >>>>>>>>> Are you >>>>>>>>>>>>> sure that the JobManager runs properly? Anything in the >>>>> JobManager >>>>>>>>> logs? >>>>>>>>>>>>> >>>>>>>>>>>>> BTW: The 0.9 branch is under heavy development / changes. That >> is >>>>>>>>> why it >>>>>>>>>>>>> may behave a bit different on different days right now. I would >>>>>>>>>>> recommend >>>>>>>>>>>>> to use the 0.8.1 release for a stable experience. >>>>>>>>>>>>> >>>>>>>>>>>>> Greetings, >>>>>>>>>>>>> Stephan >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Feb 23, 2015 at 7:39 PM, Robert Metzger < >>>>>>>>> [hidden email]> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Thank you for the quick reply. >>>>>>>>>>>>> >>>>>>>>>>>>> The log you've send is from the webclient. Can you also send >> the >>>>>>>>> log of >>>>>>>>>>> the >>>>>>>>>>>>> >>>>>>>>>>>>> JobManager? >>>>>>>>>>>>> >>>>>>>>>>>>> On Mon, Feb 23, 2015 at 7:28 PM, Dulaj Viduranga < >>>>>>>>> [hidden email]> >>>>>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Yes. It seams it is not a problem with the arguments. I tried >>>>> two >>>>>>>>> days >>>>>>>>>>>>> >>>>>>>>>>>>> but >>>>>>>>>>>>> >>>>>>>>>>>>>> different error occurs. It seams the web client can’t connect >> to >>>>>>>>> the >>>>>>>>>>> job >>>>>>>>>>>>> >>>>>>>>>>>>>> manager although it is running >>>>>>>>>>>>> >>>>>>>>>>>>>> Right now, I can’t even get the webclient to run. >>>>>>>>>>>>> >>>>>>>>>>>>> ./bin/start-webclient.sh >>>>>>>>>>>>> >>>>>>>>>>>>>> executes fine but I cannot connect to localhost:8080 (even >> with >>>>>>>>> telnet >>>>>>>>>>> or >>>>>>>>>>>>> >>>>>>>>>>>>>> curl) >>>>>>>>>>>>> >>>>>>>>>>>>>> Here is the log for jobManager >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> 23:22:31,933 INFO >> org.apache.flink.client.web.WebInterfaceServer >>>>>>>>>>>>> >>>>>>>>>>>>>> - Setting up web frontend server, using web-root directory >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> 'jar: >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>> >> file:/Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/lib/flink-clients-0.9-SNAPSHOT.jar!/web-docs >>>>>>>>>>>>> '. >>>>>>>>>>>>> >>>>>>>>>>>>>> 23:22:31,934 INFO >> org.apache.flink.client.web.WebInterfaceServer >>>>>>>>>>>>> >>>>>>>>>>>>>> - Web frontend server will store temporary files in >>>>>>>>>>>>> >>>>>>>>>>>>>> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T', uploaded >>>>> jobs >>>>>>>>> in >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T/webclient-jobs', >>>>>>>>>>>>> >>>>>>>>>>>>>> plan-json-dumps in >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T/webclient-plans'. >>>>>>>>>>>>> >>>>>>>>>>>>>> 23:22:31,934 INFO >> org.apache.flink.client.web.WebInterfaceServer >>>>>>>>>>>>> >>>>>>>>>>>>>> - Web-frontend will submit jobs to nephele job-manager on >>>>>>>>>>>>> >>>>>>>>>>>>> localhost, >>>>>>>>>>>>> >>>>>>>>>>>>>> port 6123. >>>>>>>>>>>>> >>>>>>>>>>>>>> 23:22:32,580 INFO akka.event.slf4j.Slf4jLogger >>>>>>>>>>>>> >>>>>>>>>>>>>> - Slf4jLogger started >>>>>>>>>>>>> >>>>>>>>>>>>>> 23:22:32,625 INFO Remoting >>>>>>>>>>>>> >>>>>>>>>>>>>> - Starting remoting >>>>>>>>>>>>> >>>>>>>>>>>>>> 23:22:32,838 INFO Remoting >>>>>>>>>>>>> >>>>>>>>>>>>>> - Remoting started; listening on addresses :[akka.tcp:// >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> JobsInfoServletActorSystem@127.0.0.1:51517] >>>>>>>>>>>>> >>>>>>>>>>>>>> 23:23:48,119 WARN Remoting >>>>>>>>>>>>> >>>>>>>>>>>>>> - Tried to associate with unreachable remote address >>>>> [akka.tcp:// >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> flink@10.218.98.169:6123]. Address is now gated for 5000 ms, >>>>> all >>>>>>>>>>>>> >>>>>>>>>>>>> messages >>>>>>>>>>>>> >>>>>>>>>>>>>> to this address will be delivered to dead letters. Reason: >>>>>>>>> Operation >>>>>>>>>>>>> >>>>>>>>>>>>> timed >>>>>>>>>>>>> >>>>>>>>>>>>>> out: /10.218.98.169:6123 >>>>>>>>>>>>> >>>>>>>>>>>>>> 23:23:48,124 ERROR org.apache.flink.client.WebFrontend >>>>>>>>>>>>> >>>>>>>>>>>>>> - Unexpected exception: Could not find job manager at >> specified >>>>>>>>>>>>> >>>>>>>>>>>>>> address akka.flink@10.218.98.169:6123/user/jobmanager'>tcp:// >>>>>>>>>>>>> flink@10.218.98.169:6123/user/jobmanager. >>>>>>>>>>>>> >>>>>>>>>>>>>> java.lang.RuntimeException: Could not find job manager at >>>>> specified >>>>>>>>>>>>> >>>>>>>>>>>>>> address akka.flink@10.218.98.169:6123/user/jobmanager'>tcp:// >>>>>>>>>>>>> flink@10.218.98.169:6123/user/jobmanager. >>>>>>>>>>>>> >>>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>> >> org.apache.flink.client.web.JobsInfoServlet.<init>(JobsInfoServlet.java:82) >>>>>>>>>>>>> >>>>>>>>>>>>>> at >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>> >> org.apache.flink.client.web.WebInterfaceServer.<init>(WebInterfaceServer.java:158) >>>>>>>>>>>>> >>>>>>>>>>>>>> at >> org.apache.flink.client.WebFrontend.main(WebFrontend.java:74) >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>> On Feb 23, 2015, at 11:46 PM, Robert Metzger < >>>>> [hidden email] >>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>>>> you said in the other email thread that the error only occurs >>>>> for >>>>>>>>>>>>> >>>>>>>>>>>>>>> Wordcount, not for Kmeans. >>>>>>>>>>>>> >>>>>>>>>>>>>>> Can you copy me the commands for both examples? >>>>>>>>>>>>> >>>>>>>>>>>>>>> I can not really believe that there is a difference between >> the >>>>>>>>> two >>>>>>>>>>>>> >>>>>>>>>>>>> jobs. >>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>> Can you also send us the contents of the jobmanager log file? >>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>> Best, >>>>>>>>>>>>> >>>>>>>>>>>>>>> Robert >>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Feb 23, 2015 at 6:04 PM, Dulaj Viduranga < >>>>>>>>>>> [hidden email] >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>> I’m getting "Could not build up connection to JobManager.” >>>>> When i >>>>>>>>>>>>> >>>>>>>>>>>>> tried >>>>>>>>>>>>> >>>>>>>>>>>>>> to >>>>>>>>>>>>> >>>>>>>>>>>>>>>> run the wordCount example. Can anyone help? >>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>>>> Dulaj >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>>> >>>> >> >> |
It depends on how you started Flink. If you started a local cluster, then
the TaskManager log is contained in the JobManager log we just don't see the respective log output in the snippet you posted. If you started a TaskManager independently, either by taskmanager.sh or by start-cluster.sh, then a file with the name format flink-<user>-taskmanager-<hostname>.log should be created in flink/log/. If the Flink directory is not shared by your cluster nodes, then you have to look on the machine on which you started the TaskManager. But since the JobManager binds to 127.0.0.1 I guess that you started a local cluster. Try whether you find some logging statements from the logger org.apache.flink.runtime.taskmanager.TaskManager in your log. Maybe you can upload the corresponding log file to [1] and post a link here. Greets, Till [1] https://gist.github.com/ On Thu, Feb 26, 2015 at 6:45 PM, Dulaj Viduranga <[hidden email]> wrote: > Hi, > Can you tell me where I can find TaskManager logs. I can’t find > them in logs folder? I don’t suppose I should run taskmanager.sh as well. > Right? > I’m using a OS X Yosemite. I’ll send you my ifconfig. > > lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384 > options=3<RXCSUM,TXCSUM> > inet6 ::1 prefixlen 128 > inet 127.0.0.1 netmask 0xff000000 > inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 > nd6 options=1<PERFORMNUD> > gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280 > stf0: flags=0<> mtu 1280 > en0: flags=8823<UP,BROADCAST,SMART,SIMPLEX,MULTICAST> mtu 1500 > ether 60:03:08:a1:e0:f4 > nd6 options=1<PERFORMNUD> > media: autoselect (<unknown type>) > status: inactive > en1: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu > 1500 > options=60<TSO4,TSO6> > ether 72:00:02:32:14:d0 > media: autoselect <full-duplex> > status: inactive > en2: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu > 1500 > options=60<TSO4,TSO6> > ether 72:00:02:32:14:d1 > media: autoselect <full-duplex> > status: inactive > bridge0: flags=8822<BROADCAST,SMART,SIMPLEX,MULTICAST> mtu 1500 > options=63<RXCSUM,TXCSUM,TSO4,TSO6> > ether 62:03:08:1a:fa:00 > Configuration: > id 0:0:0:0:0:0 priority 0 hellotime 0 fwddelay 0 > maxage 0 holdcnt 0 proto stp maxaddr 100 timeout 1200 > root id 0:0:0:0:0:0 priority 0 ifcost 0 port 0 > ipfilter disabled flags 0x2 > member: en1 flags=3<LEARNING,DISCOVER> > ifmaxaddr 0 port 5 priority 0 path cost 0 > member: en2 flags=3<LEARNING,DISCOVER> > ifmaxaddr 0 port 6 priority 0 path cost 0 > media: <unknown type> > status: inactive > p2p0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> mtu 2304 > ether 02:03:08:a1:e0:f4 > media: autoselect > status: inactive > awdl0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> mtu 1452 > ether 06:56:3d:f6:60:08 > nd6 options=1<PERFORMNUD> > media: autoselect > status: inactive > ppp0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1500 > inet 10.218.98.228 --> 10.64.64.64 netmask 0xff000000 > utun0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1380 > inet6 fe80::b0d4:d4be:7e62:e730%utun0 prefixlen 64 scopeid 0xb > inet6 fdd0:b291:7da7:9153:b0d4:d4be:7e62:e730 prefixlen 64 > nd6 options=1<PERFORMNUD> > > > > On Feb 26, 2015, at 10:48 PM, Stephan Ewen <[hidden email]> wrote: > > > > Hi Dulaj! > > > > Thanks for helping to debug. > > > > My guess is that you are seeing now the same thing between JobManager and > > TaskManager as you saw before between JobManager and JobClient. I have a > > patch pending that should help the issue (see > > https://issues.apache.org/jira/browse/FLINK-1608), let's see if that > solves > > it. > > > > What seems not right is that the JobManager initially accepted the > > TaskManager and later the communication. Can you paste the TaskManager > log > > as well? > > > > Also: There must be something fairly unique about your network > > configuration, as it works on all other setups that we use (locally, > cloud, > > test servers, YARN, ...). Can you paste your ipconfig / ifconfig by any > > chance? > > > > Greetings, > > Stephan > > > > > > > > On Thu, Feb 26, 2015 at 4:33 PM, Dulaj Viduranga <[hidden email]> > > wrote: > > > >> Hi, > >> It’s great to help out. :) > >> > >> Setting 127.0.0.1 instead of “localhost” in > >> jobmanager.rpc.address, helped to build the connection to the > jobmanager. > >> Apparently localhost resolving is different in webclient and the > >> jobmanager. I think it’s good to set "jobmanager.rpc.address: > 127.0.0.1" in > >> future builds. > >> But then I get this error when I tried to run examples. I don’t > >> know if I should move this issue to another thread. If so please tell > me. > >> > >> bin/flink run > >> > /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/examples/flink-java-examples-0.9-SNAPSHOT-WordCount.jar > >> > /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/hamlet.txt > >> $FLINK_DIRECTORY/count > >> > >> > >> 20:46:21,998 WARN org.apache.hadoop.util.NativeCodeLoader > >> - Unable to load native-hadoop library for your platform... using > >> builtin-java classes where applicable > >> 02/26/2015 20:46:23 Job execution switched to status RUNNING. > >> 02/26/2015 20:46:23 CHAIN DataSource (at > >> getTextDataSet(WordCount.java:141) > >> (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at > >> main(WordCount.java:69)) -> Combine(SUM(1), at > main(WordCount.java:72)(1/1) > >> switched to SCHEDULED > >> 02/26/2015 20:46:23 CHAIN DataSource (at > >> getTextDataSet(WordCount.java:141) > >> (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at > >> main(WordCount.java:69)) -> Combine(SUM(1), at > main(WordCount.java:72)(1/1) > >> switched to DEPLOYING > >> 02/26/2015 20:48:03 CHAIN DataSource (at > >> getTextDataSet(WordCount.java:141) > >> (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at > >> main(WordCount.java:69)) -> Combine(SUM(1), at > main(WordCount.java:72)(1/1) > >> switched to FAILED > >> akka.pattern.AskTimeoutException: Ask timed out on > >> [Actor[akka://flink/user/taskmanager#-1628133761]] after [100000 ms] > >> at > >> > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) > >> at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) > >> at > >> > scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) > >> at > >> > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) > >> at > >> > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) > >> at > >> > akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) > >> at > >> > akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) > >> at > >> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) > >> at java.lang.Thread.run(Thread.java:745) > >> > >> 02/26/2015 20:48:03 Job execution switched to status FAILING. > >> 02/26/2015 20:48:03 Reduce (SUM(1), at main(WordCount.java:72)(1/1) > >> switched to CANCELED > >> 02/26/2015 20:48:03 DataSink(CsvOutputFormat (path: > >> > /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/count, > >> delimiter: ))(1/1) switched to CANCELED > >> 02/26/2015 20:48:03 Job execution switched to status FAILED. > >> org.apache.flink.client.program.ProgramInvocationException: The program > >> execution failed. > >> at org.apache.flink.client.program.Client.run(Client.java:344) > >> at org.apache.flink.client.program.Client.run(Client.java:306) > >> at org.apache.flink.client.program.Client.run(Client.java:300) > >> at > >> > org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55) > >> at > >> > org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:82) > >> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >> at > >> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > >> at > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > >> at java.lang.reflect.Method.invoke(Method.java:483) > >> at > >> > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437) > >> at > >> > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353) > >> at org.apache.flink.client.program.Client.run(Client.java:250) > >> at > >> org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:371) > >> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:344) > >> at > >> > org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1087) > >> at > org.apache.flink.client.CliFrontend.main(CliFrontend.java:1114) > >> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job > >> execution failed. > >> at > >> > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:284) > >> at > >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > >> at > >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > >> at > >> > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > >> at > >> > org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37) > >> at > >> > org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30) > >> at > >> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > >> at > >> > org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30) > >> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > >> at > >> > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:88) > >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > >> at akka.actor.ActorCell.invoke(ActorCell.scala:487) > >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > >> at akka.dispatch.Mailbox.run(Mailbox.scala:221) > >> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > >> at > >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > >> at > >> > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > >> at > >> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > >> at > >> > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > >> Caused by: akka.pattern.AskTimeoutException: Ask timed out on > >> [Actor[akka://flink/user/taskmanager#-1628133761]] after [100000 ms] > >> at > >> > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) > >> at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) > >> at > >> > scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) > >> at > >> > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) > >> at > >> > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) > >> at > >> > akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) > >> at > >> > akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) > >> at > >> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) > >> at java.lang.Thread.run(Thread.java:745) > >> > >> The exception above occurred while trying to run your command. > >> > >> > >>> On Feb 26, 2015, at 12:46 AM, Stephan Ewen <[hidden email]> wrote: > >>> > >>> Addition: To check whether a port is reachable, I think the easiest > thing > >>> is to try and connect with a telnet client and see if the connection is > >>> refused. > >>> > >>> On Wed, Feb 25, 2015 at 8:15 PM, Stephan Ewen <[hidden email]> > wrote: > >>> > >>>> Okay, the problem seems to be that even though both the client and the > >>>> jobmanager use "localhost" as the host name, they resolve this to > >> different > >>>> IP addresses: In one case 127.0.0.1 in the other case 10.216.177.146 > >>>> > >>>> Also, the 127.0.0.1 address cannot communicate to 10.216.177.146 > >>>> apparently. > >>>> > >>>> Can you help us debug this by checking the following: > >>>> > >>>> - Can you try and set "jobmanager.rpc.address" to 127.0.0.1 and see if > >>>> that solves it? > >>>> - Can you try and set "jobmanager.rpc.address" to the other address > >> (10.216.177.146 > >>>> or so) and see if that solves it? > >>>> - Can you do "start-cluster.sh", rather than "start-local.sh" and see > >>>> whether the webfrontend displays that the TaskManager connects? > >>>> - As a hard core test: Can you bring up the jobmanager, check where it > >>>> connects (10.216.192.98:6123 or so) and see whether the port is > >> reachable? > >>>> > >>>> We have recently updated how the Akka URLs are build, to work around a > >>>> limitation in Akka. Seems that did not yet fully solve the issue. > >>>> > >>>> Thanks for helping us debug this, it is not the easiest immigration > >>>> experience, but the outcome is probably extremely valuable for the > >> project > >>>> :-) > >>>> > >>>> Greetings, > >>>> Stephan > >>>> > >>>> > >>>> On Wed, Feb 25, 2015 at 4:03 PM, Dulaj Viduranga < > [hidden email]> > >>>> wrote: > >>>> > >>>>> Hi, > >>>>> Sorry for the delay to reply on this issue. > >>>>> the jobmanager.rpc.address is set to “localhost” already in > conf.yaml. > >>>>> This can’t be an issue because the job manager web interface works > fine > >>>>> which also runs on localhost > >>>>> > >>>>> bin/flink run <jar> doesn’t seem to work either. Let me send you my > >>>>> command and the result in terminal. > >>>>> > >>>>> bin/flink run > >>>>> > >> > /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/examples/flink-java-examples-0.9-SNAPSHOT-WordCount.jar > >>>>> > >> > /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/hamlet.txt > >>>>> $FLINK_DIRECTORY/count > >>>>> > >>>>> 20:32:16,442 WARN org.apache.hadoop.util.NativeCodeLoader > >>>>> - Unable to load native-hadoop library for your platform... > using > >>>>> builtin-java classes where applicable > >>>>> org.apache.flink.client.program.ProgramInvocationException: Could not > >>>>> build up connection to JobManager. > >>>>> at org.apache.flink.client.program.Client.run(Client.java:327) > >>>>> at org.apache.flink.client.program.Client.run(Client.java:306) > >>>>> at org.apache.flink.client.program.Client.run(Client.java:300) > >>>>> at > >>>>> > >> > org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55) > >>>>> at > >>>>> > >> > org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:82) > >>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >>>>> at > >>>>> > >> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > >>>>> at > >>>>> > >> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > >>>>> at java.lang.reflect.Method.invoke(Method.java:483) > >>>>> at > >>>>> > >> > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437) > >>>>> at > >>>>> > >> > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353) > >>>>> at org.apache.flink.client.program.Client.run(Client.java:250) > >>>>> at > >>>>> > >> org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:371) > >>>>> at > org.apache.flink.client.CliFrontend.run(CliFrontend.java:344) > >>>>> at > >>>>> > >> > org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1087) > >>>>> at > >> org.apache.flink.client.CliFrontend.main(CliFrontend.java:1114) > >>>>> Caused by: java.io.IOException: JobManager at akka.tcp:// > >>>>> flink@10.216.177.146:6123/user/jobmanager not reachable. Please make > >>>>> sure that the JobManager is running and its port is reachable. > >>>>> at > >>>>> > >> > org.apache.flink.runtime.jobmanager.JobManager$.getJobManagerRemoteReference(JobManager.scala:897) > >>>>> at > >>>>> > >> > org.apache.flink.runtime.client.JobClient$.createJobClient(JobClient.scala:151) > >>>>> at > >>>>> > >> > org.apache.flink.runtime.client.JobClient$.createJobClientFromConfig(JobClient.scala:142) > >>>>> at > >>>>> > >> > org.apache.flink.runtime.client.JobClient$.startActorSystemAndActor(JobClient.scala:125) > >>>>> at > >>>>> > >> > org.apache.flink.runtime.client.JobClient.startActorSystemAndActor(JobClient.scala) > >>>>> at org.apache.flink.client.program.Client.run(Client.java:322) > >>>>> ... 15 more > >>>>> Caused by: java.util.concurrent.TimeoutException: Futures timed out > >> after > >>>>> [10000 milliseconds] > >>>>> at > >>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > >>>>> at > >>>>> > scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > >>>>> at > >>>>> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > >>>>> at > >>>>> > >> > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > >>>>> at scala.concurrent.Await$.result(package.scala:107) > >>>>> at > >>>>> > >> > org.apache.flink.runtime.jobmanager.JobManager$.getJobManagerRemoteReference(JobManager.scala:893) > >>>>> ... 20 more > >>>>> > >>>>> The exception above occurred while trying to run your command. > >>>>> > >>>>> > >>>>>> On Feb 25, 2015, at 1:29 AM, Stephan Ewen <[hidden email]> wrote: > >>>>>> > >>>>>> BTW: Does still work if you enter "localhost" for > >>>>> "jobmanager.rpc.address" > >>>>>> in your flink-conf.yaml ? > >>>>>> > >>>>>> On Tue, Feb 24, 2015 at 7:50 PM, Stephan Ewen <[hidden email]> > >> wrote: > >>>>>> > >>>>>>> Hi! > >>>>>>> > >>>>>>> I think that this is a problem in the current master (probably in > >> there > >>>>>>> since a few days ago). I am fixing it... > >>>>>>> > >>>>>>> Thanks for reporting it! > >>>>>>> > >>>>>>> Stephan > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Feb 24, 2015 at 6:52 PM, Stephan Ewen <[hidden email]> > >>>>> wrote: > >>>>>>> > >>>>>>>> Hi Dulaj! > >>>>>>>> > >>>>>>>> The log suggests that the JobManager binds itself to the IP > >>>>>>>> address 10.216.192.98 and the WebClient runs at 127.0.0.1 > >>>>>>>> > >>>>>>>> The 127.0.0.1 actor system cannot connect to the 10.216.192.98. > >>>>>>>> > >>>>>>>> Let me verify whether this is a quirk of your particular setup, > or a > >>>>> bug > >>>>>>>> recently introduces in the 0.9-SNAPSHOT. > >>>>>>>> > >>>>>>>> Does the command line work for you? ("bin/flink run <jar>") > >>>>>>>> > >>>>>>>> taskmanager.numberOfTaskSlots: -1 is also okay, this will mean > that > >>>>> the > >>>>>>>> default of '1' is used. > >>>>>>>> > >>>>>>>> Greetings, > >>>>>>>> Stephan > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> On Tue, Feb 24, 2015 at 5:18 PM, Dulaj Viduranga < > >>>>> [hidden email]> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> Is taskmanager.numberOfTaskSlots: -1 normal? > >>>>>>>>> > >>>>>>>>>> On Feb 24, 2015, at 9:44 PM, Robert Metzger < > [hidden email]> > >>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>> Hi, > >>>>>>>>>> I could not find the logfiles attached to your mails. I think > the > >>>>>>>>>> mailinglists are not accepting attachments. > >>>>>>>>>> Can you put the logs on gist.github.com? > >>>>>>>>>> > >>>>>>>>>> The configuration values are documented here: > >>>>>>>>>> http://flink.apache.org/docs/0.8/config.html > >>>>>>>>>> For the webclient's port its called webclient.port > >>>>>>>>>> > >>>>>>>>>> On Tue, Feb 24, 2015 at 5:04 PM, Dulaj Viduranga < > >>>>> [hidden email] > >>>>>>>>>> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> I tried to kill the job manager manually in the terminal and > >> start > >>>>> it > >>>>>>>>>>> again but no luck. Also could you tell me if it’s possible to > >>>>> change > >>>>>>>>>>> webclient’s port (8080) ? > >>>>>>>>>>> > >>>>>>>>>>>> On Feb 24, 2015, at 1:41 PM, Stephan Ewen <[hidden email]> > >>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Hey Dulaj! > >>>>>>>>>>>> > >>>>>>>>>>>> As a contributor, I would go against the latest version, which > >> is > >>>>>>>>>>>> 0.9-SNAPSHOT. > >>>>>>>>>>>> > >>>>>>>>>>>> It may be in your case that the JobManager actor is down, but > >> the > >>>>>>>>> process > >>>>>>>>>>>> still lingers. (BTW: I have a patch pending that makes sure > the > >>>>>>>>> process > >>>>>>>>>>>> disappears when the actor via down). > >>>>>>>>>>>> > >>>>>>>>>>>> Could you have a look at the log > >>>>>>>>> "flink-<user>-jobmanager-<host>-.log" > >>>>>>>>>>> and > >>>>>>>>>>>> see if there are any errors logged? > >>>>>>>>>>>> > >>>>>>>>>>>> Greetings, > >>>>>>>>>>>> Stephan > >>>>>>>>>>>> Am 24.02.2015 06:29 schrieb "Dulaj Viduranga" < > >>>>> [hidden email] > >>>>>>>>>> : > >>>>>>>>>>>> > >>>>>>>>>>>>> The JobManager seems to run fine. I don't know. When I tried > to > >>>>> run > >>>>>>>>>>>>> start-local.sh again, It shows the PID of the running > >> JobManager > >>>>> and > >>>>>>>>>>> also > >>>>>>>>>>>>> :8081 runs fine. I want to contribute to the project and I > >> could > >>>>>>>>> get a > >>>>>>>>>>>>> little boost if I could see the capabilities of FLINK. :) > >>>>>>>>>>>>> Will it be OK to use 0.8.1 as a developer? > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Feb 24, 2015, at 04:15 AM, Stephan Ewen <[hidden email] > > > >>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> Hi Dulaj, > >>>>>>>>>>>>> > >>>>>>>>>>>>> That error message indicates that the JobManager is not > >> running. > >>>>>>>>> Are you > >>>>>>>>>>>>> sure that the JobManager runs properly? Anything in the > >>>>> JobManager > >>>>>>>>> logs? > >>>>>>>>>>>>> > >>>>>>>>>>>>> BTW: The 0.9 branch is under heavy development / changes. > That > >> is > >>>>>>>>> why it > >>>>>>>>>>>>> may behave a bit different on different days right now. I > would > >>>>>>>>>>> recommend > >>>>>>>>>>>>> to use the 0.8.1 release for a stable experience. > >>>>>>>>>>>>> > >>>>>>>>>>>>> Greetings, > >>>>>>>>>>>>> Stephan > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Mon, Feb 23, 2015 at 7:39 PM, Robert Metzger < > >>>>>>>>> [hidden email]> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>> Thank you for the quick reply. > >>>>>>>>>>>>> > >>>>>>>>>>>>> The log you've send is from the webclient. Can you also send > >> the > >>>>>>>>> log of > >>>>>>>>>>> the > >>>>>>>>>>>>> > >>>>>>>>>>>>> JobManager? > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Mon, Feb 23, 2015 at 7:28 PM, Dulaj Viduranga < > >>>>>>>>> [hidden email]> > >>>>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Yes. It seams it is not a problem with the arguments. I > tried > >>>>> two > >>>>>>>>> days > >>>>>>>>>>>>> > >>>>>>>>>>>>> but > >>>>>>>>>>>>> > >>>>>>>>>>>>>> different error occurs. It seams the web client can’t > connect > >> to > >>>>>>>>> the > >>>>>>>>>>> job > >>>>>>>>>>>>> > >>>>>>>>>>>>>> manager although it is running > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Right now, I can’t even get the webclient to run. > >>>>>>>>>>>>> > >>>>>>>>>>>>> ./bin/start-webclient.sh > >>>>>>>>>>>>> > >>>>>>>>>>>>>> executes fine but I cannot connect to localhost:8080 (even > >> with > >>>>>>>>> telnet > >>>>>>>>>>> or > >>>>>>>>>>>>> > >>>>>>>>>>>>>> curl) > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Here is the log for jobManager > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> 23:22:31,933 INFO > >> org.apache.flink.client.web.WebInterfaceServer > >>>>>>>>>>>>> > >>>>>>>>>>>>>> - Setting up web frontend server, using web-root directory > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> 'jar: > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>> > >> > file:/Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/lib/flink-clients-0.9-SNAPSHOT.jar!/web-docs > >>>>>>>>>>>>> '. > >>>>>>>>>>>>> > >>>>>>>>>>>>>> 23:22:31,934 INFO > >> org.apache.flink.client.web.WebInterfaceServer > >>>>>>>>>>>>> > >>>>>>>>>>>>>> - Web frontend server will store temporary files in > >>>>>>>>>>>>> > >>>>>>>>>>>>>> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T', uploaded > >>>>> jobs > >>>>>>>>> in > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T/webclient-jobs', > >>>>>>>>>>>>> > >>>>>>>>>>>>>> plan-json-dumps in > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T/webclient-plans'. > >>>>>>>>>>>>> > >>>>>>>>>>>>>> 23:22:31,934 INFO > >> org.apache.flink.client.web.WebInterfaceServer > >>>>>>>>>>>>> > >>>>>>>>>>>>>> - Web-frontend will submit jobs to nephele job-manager on > >>>>>>>>>>>>> > >>>>>>>>>>>>> localhost, > >>>>>>>>>>>>> > >>>>>>>>>>>>>> port 6123. > >>>>>>>>>>>>> > >>>>>>>>>>>>>> 23:22:32,580 INFO akka.event.slf4j.Slf4jLogger > >>>>>>>>>>>>> > >>>>>>>>>>>>>> - Slf4jLogger started > >>>>>>>>>>>>> > >>>>>>>>>>>>>> 23:22:32,625 INFO Remoting > >>>>>>>>>>>>> > >>>>>>>>>>>>>> - Starting remoting > >>>>>>>>>>>>> > >>>>>>>>>>>>>> 23:22:32,838 INFO Remoting > >>>>>>>>>>>>> > >>>>>>>>>>>>>> - Remoting started; listening on addresses :[akka.tcp:// > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> JobsInfoServletActorSystem@127.0.0.1:51517] > >>>>>>>>>>>>> > >>>>>>>>>>>>>> 23:23:48,119 WARN Remoting > >>>>>>>>>>>>> > >>>>>>>>>>>>>> - Tried to associate with unreachable remote address > >>>>> [akka.tcp:// > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> flink@10.218.98.169:6123]. Address is now gated for 5000 > ms, > >>>>> all > >>>>>>>>>>>>> > >>>>>>>>>>>>> messages > >>>>>>>>>>>>> > >>>>>>>>>>>>>> to this address will be delivered to dead letters. Reason: > >>>>>>>>> Operation > >>>>>>>>>>>>> > >>>>>>>>>>>>> timed > >>>>>>>>>>>>> > >>>>>>>>>>>>>> out: /10.218.98.169:6123 > >>>>>>>>>>>>> > >>>>>>>>>>>>>> 23:23:48,124 ERROR org.apache.flink.client.WebFrontend > >>>>>>>>>>>>> > >>>>>>>>>>>>>> - Unexpected exception: Could not find job manager at > >> specified > >>>>>>>>>>>>> > >>>>>>>>>>>>>> address akka.flink@10.218.98.169:6123/user/jobmanager > '>tcp:// > >>>>>>>>>>>>> flink@10.218.98.169:6123/user/jobmanager. > >>>>>>>>>>>>> > >>>>>>>>>>>>>> java.lang.RuntimeException: Could not find job manager at > >>>>> specified > >>>>>>>>>>>>> > >>>>>>>>>>>>>> address akka.flink@10.218.98.169:6123/user/jobmanager > '>tcp:// > >>>>>>>>>>>>> flink@10.218.98.169:6123/user/jobmanager. > >>>>>>>>>>>>> > >>>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>> > >> > org.apache.flink.client.web.JobsInfoServlet.<init>(JobsInfoServlet.java:82) > >>>>>>>>>>>>> > >>>>>>>>>>>>>> at > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>> > >> > org.apache.flink.client.web.WebInterfaceServer.<init>(WebInterfaceServer.java:158) > >>>>>>>>>>>>> > >>>>>>>>>>>>>> at > >> org.apache.flink.client.WebFrontend.main(WebFrontend.java:74) > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Feb 23, 2015, at 11:46 PM, Robert Metzger < > >>>>> [hidden email] > >>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> Hi, > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> you said in the other email thread that the error only > occurs > >>>>> for > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> Wordcount, not for Kmeans. > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> Can you copy me the commands for both examples? > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> I can not really believe that there is a difference between > >> the > >>>>>>>>> two > >>>>>>>>>>>>> > >>>>>>>>>>>>> jobs. > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> Can you also send us the contents of the jobmanager log > file? > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> Best, > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> Robert > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> On Mon, Feb 23, 2015 at 6:04 PM, Dulaj Viduranga < > >>>>>>>>>>> [hidden email] > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>> I’m getting "Could not build up connection to JobManager.” > >>>>> When i > >>>>>>>>>>>>> > >>>>>>>>>>>>> tried > >>>>>>>>>>>>> > >>>>>>>>>>>>>> to > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>> run the wordCount example. Can anyone help? > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Dulaj > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>> > >>>>> > >>>> > >> > >> > > |
In reply to this post by Dulaj Viduranga
Hi,
I’m thinking I’m doing something wrong. After setting jobManager address to 127.0.0.1, I can run kmeans example (java -cp ../examples/flink-java-examples-0.9-SNAPSHOT-KMeans.jar org.apache.flink.examples.java.clustering.util.KMeansDataGenerator 500 10 0.08) But I can’t run word count example (bin/flink run ./examples/flink-java-examples-0.9-SNAPSHOT-WordCount.jar file:'///Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/hamlet.txt' file:'///Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/count.txt’) I’m not sure whether I’m running it wrong > On Feb 26, 2015, at 9:03 PM, Dulaj Viduranga <[hidden email]> wrote: > > Hi, > It’s great to help out. :) > > Setting 127.0.0.1 instead of “localhost” in jobmanager.rpc.address, helped to build the connection to the jobmanager. Apparently localhost resolving is different in webclient and the jobmanager. I think it’s good to set "jobmanager.rpc.address: 127.0.0.1" in future builds. > But then I get this error when I tried to run examples. I don’t know if I should move this issue to another thread. If so please tell me. > > bin/flink run /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/examples/flink-java-examples-0.9-SNAPSHOT-WordCount.jar /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/hamlet.txt $FLINK_DIRECTORY/count > > > 20:46:21,998 WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable > 02/26/2015 20:46:23 Job execution switched to status RUNNING. > 02/26/2015 20:46:23 CHAIN DataSource (at getTextDataSet(WordCount.java:141) (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) switched to SCHEDULED > 02/26/2015 20:46:23 CHAIN DataSource (at getTextDataSet(WordCount.java:141) (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) switched to DEPLOYING > 02/26/2015 20:48:03 CHAIN DataSource (at getTextDataSet(WordCount.java:141) (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) switched to FAILED > akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/taskmanager#-1628133761]] after [100000 ms] > at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) > at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) > at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) > at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) > at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) > at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) > at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) > at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) > at java.lang.Thread.run(Thread.java:745) > > 02/26/2015 20:48:03 Job execution switched to status FAILING. > 02/26/2015 20:48:03 Reduce (SUM(1), at main(WordCount.java:72)(1/1) switched to CANCELED > 02/26/2015 20:48:03 DataSink(CsvOutputFormat (path: /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/count, delimiter: ))(1/1) switched to CANCELED > 02/26/2015 20:48:03 Job execution switched to status FAILED. > org.apache.flink.client.program.ProgramInvocationException: The program execution failed. > at org.apache.flink.client.program.Client.run(Client.java:344) > at org.apache.flink.client.program.Client.run(Client.java:306) > at org.apache.flink.client.program.Client.run(Client.java:300) > at org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55) > at org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:82) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:483) > at org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437) > at org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353) > at org.apache.flink.client.program.Client.run(Client.java:250) > at org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:371) > at org.apache.flink.client.CliFrontend.run(CliFrontend.java:344) > at org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1087) > at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1114) > Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed. > at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:284) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37) > at org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30) > at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > at org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30) > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > at org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:88) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > Caused by: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://flink/user/taskmanager#-1628133761]] after [100000 ms] > at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) > at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) > at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) > at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) > at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) > at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) > at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) > at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) > at java.lang.Thread.run(Thread.java:745) > > The exception above occurred while trying to run your command. > > >> On Feb 26, 2015, at 12:46 AM, Stephan Ewen <[hidden email]> wrote: >> >> Addition: To check whether a port is reachable, I think the easiest thing >> is to try and connect with a telnet client and see if the connection is >> refused. >> >> On Wed, Feb 25, 2015 at 8:15 PM, Stephan Ewen <[hidden email]> wrote: >> >>> Okay, the problem seems to be that even though both the client and the >>> jobmanager use "localhost" as the host name, they resolve this to different >>> IP addresses: In one case 127.0.0.1 in the other case 10.216.177.146 >>> >>> Also, the 127.0.0.1 address cannot communicate to 10.216.177.146 >>> apparently. >>> >>> Can you help us debug this by checking the following: >>> >>> - Can you try and set "jobmanager.rpc.address" to 127.0.0.1 and see if >>> that solves it? >>> - Can you try and set "jobmanager.rpc.address" to the other address (10.216.177.146 >>> or so) and see if that solves it? >>> - Can you do "start-cluster.sh", rather than "start-local.sh" and see >>> whether the webfrontend displays that the TaskManager connects? >>> - As a hard core test: Can you bring up the jobmanager, check where it >>> connects (10.216.192.98:6123 or so) and see whether the port is reachable? >>> >>> We have recently updated how the Akka URLs are build, to work around a >>> limitation in Akka. Seems that did not yet fully solve the issue. >>> >>> Thanks for helping us debug this, it is not the easiest immigration >>> experience, but the outcome is probably extremely valuable for the project >>> :-) >>> >>> Greetings, >>> Stephan >>> >>> >>> On Wed, Feb 25, 2015 at 4:03 PM, Dulaj Viduranga <[hidden email]> >>> wrote: >>> >>>> Hi, >>>> Sorry for the delay to reply on this issue. >>>> the jobmanager.rpc.address is set to “localhost” already in conf.yaml. >>>> This can’t be an issue because the job manager web interface works fine >>>> which also runs on localhost >>>> >>>> bin/flink run <jar> doesn’t seem to work either. Let me send you my >>>> command and the result in terminal. >>>> >>>> bin/flink run >>>> /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/examples/flink-java-examples-0.9-SNAPSHOT-WordCount.jar >>>> /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/hamlet.txt >>>> $FLINK_DIRECTORY/count >>>> >>>> 20:32:16,442 WARN org.apache.hadoop.util.NativeCodeLoader >>>> - Unable to load native-hadoop library for your platform... using >>>> builtin-java classes where applicable >>>> org.apache.flink.client.program.ProgramInvocationException: Could not >>>> build up connection to JobManager. >>>> at org.apache.flink.client.program.Client.run(Client.java:327) >>>> at org.apache.flink.client.program.Client.run(Client.java:306) >>>> at org.apache.flink.client.program.Client.run(Client.java:300) >>>> at >>>> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55) >>>> at >>>> org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:82) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >>>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> at java.lang.reflect.Method.invoke(Method.java:483) >>>> at >>>> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437) >>>> at >>>> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353) >>>> at org.apache.flink.client.program.Client.run(Client.java:250) >>>> at >>>> org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:371) >>>> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:344) >>>> at >>>> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1087) >>>> at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1114) >>>> Caused by: java.io.IOException: JobManager at akka.tcp:// >>>> flink@10.216.177.146:6123/user/jobmanager not reachable. Please make >>>> sure that the JobManager is running and its port is reachable. >>>> at >>>> org.apache.flink.runtime.jobmanager.JobManager$.getJobManagerRemoteReference(JobManager.scala:897) >>>> at >>>> org.apache.flink.runtime.client.JobClient$.createJobClient(JobClient.scala:151) >>>> at >>>> org.apache.flink.runtime.client.JobClient$.createJobClientFromConfig(JobClient.scala:142) >>>> at >>>> org.apache.flink.runtime.client.JobClient$.startActorSystemAndActor(JobClient.scala:125) >>>> at >>>> org.apache.flink.runtime.client.JobClient.startActorSystemAndActor(JobClient.scala) >>>> at org.apache.flink.client.program.Client.run(Client.java:322) >>>> ... 15 more >>>> Caused by: java.util.concurrent.TimeoutException: Futures timed out after >>>> [10000 milliseconds] >>>> at >>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) >>>> at >>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) >>>> at >>>> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) >>>> at >>>> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) >>>> at scala.concurrent.Await$.result(package.scala:107) >>>> at >>>> org.apache.flink.runtime.jobmanager.JobManager$.getJobManagerRemoteReference(JobManager.scala:893) >>>> ... 20 more >>>> >>>> The exception above occurred while trying to run your command. >>>> >>>> >>>>> On Feb 25, 2015, at 1:29 AM, Stephan Ewen <[hidden email]> wrote: >>>>> >>>>> BTW: Does still work if you enter "localhost" for >>>> "jobmanager.rpc.address" >>>>> in your flink-conf.yaml ? >>>>> >>>>> On Tue, Feb 24, 2015 at 7:50 PM, Stephan Ewen <[hidden email]> wrote: >>>>> >>>>>> Hi! >>>>>> >>>>>> I think that this is a problem in the current master (probably in there >>>>>> since a few days ago). I am fixing it... >>>>>> >>>>>> Thanks for reporting it! >>>>>> >>>>>> Stephan >>>>>> >>>>>> >>>>>> On Tue, Feb 24, 2015 at 6:52 PM, Stephan Ewen <[hidden email]> >>>> wrote: >>>>>> >>>>>>> Hi Dulaj! >>>>>>> >>>>>>> The log suggests that the JobManager binds itself to the IP >>>>>>> address 10.216.192.98 and the WebClient runs at 127.0.0.1 >>>>>>> >>>>>>> The 127.0.0.1 actor system cannot connect to the 10.216.192.98. >>>>>>> >>>>>>> Let me verify whether this is a quirk of your particular setup, or a >>>> bug >>>>>>> recently introduces in the 0.9-SNAPSHOT. >>>>>>> >>>>>>> Does the command line work for you? ("bin/flink run <jar>") >>>>>>> >>>>>>> taskmanager.numberOfTaskSlots: -1 is also okay, this will mean that >>>> the >>>>>>> default of '1' is used. >>>>>>> >>>>>>> Greetings, >>>>>>> Stephan >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Tue, Feb 24, 2015 at 5:18 PM, Dulaj Viduranga < >>>> [hidden email]> >>>>>>> wrote: >>>>>>> >>>>>>>> Is taskmanager.numberOfTaskSlots: -1 normal? >>>>>>>> >>>>>>>>> On Feb 24, 2015, at 9:44 PM, Robert Metzger <[hidden email]> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> I could not find the logfiles attached to your mails. I think the >>>>>>>>> mailinglists are not accepting attachments. >>>>>>>>> Can you put the logs on gist.github.com? >>>>>>>>> >>>>>>>>> The configuration values are documented here: >>>>>>>>> http://flink.apache.org/docs/0.8/config.html >>>>>>>>> For the webclient's port its called webclient.port >>>>>>>>> >>>>>>>>> On Tue, Feb 24, 2015 at 5:04 PM, Dulaj Viduranga < >>>> [hidden email] >>>>>>>>> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> I tried to kill the job manager manually in the terminal and start >>>> it >>>>>>>>>> again but no luck. Also could you tell me if it’s possible to >>>> change >>>>>>>>>> webclient’s port (8080) ? >>>>>>>>>> >>>>>>>>>>> On Feb 24, 2015, at 1:41 PM, Stephan Ewen <[hidden email]> >>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hey Dulaj! >>>>>>>>>>> >>>>>>>>>>> As a contributor, I would go against the latest version, which is >>>>>>>>>>> 0.9-SNAPSHOT. >>>>>>>>>>> >>>>>>>>>>> It may be in your case that the JobManager actor is down, but the >>>>>>>> process >>>>>>>>>>> still lingers. (BTW: I have a patch pending that makes sure the >>>>>>>> process >>>>>>>>>>> disappears when the actor via down). >>>>>>>>>>> >>>>>>>>>>> Could you have a look at the log >>>>>>>> "flink-<user>-jobmanager-<host>-.log" >>>>>>>>>> and >>>>>>>>>>> see if there are any errors logged? >>>>>>>>>>> >>>>>>>>>>> Greetings, >>>>>>>>>>> Stephan >>>>>>>>>>> Am 24.02.2015 06:29 schrieb "Dulaj Viduranga" < >>>> [hidden email] >>>>>>>>> : >>>>>>>>>>> >>>>>>>>>>>> The JobManager seems to run fine. I don't know. When I tried to >>>> run >>>>>>>>>>>> start-local.sh again, It shows the PID of the running JobManager >>>> and >>>>>>>>>> also >>>>>>>>>>>> :8081 runs fine. I want to contribute to the project and I could >>>>>>>> get a >>>>>>>>>>>> little boost if I could see the capabilities of FLINK. :) >>>>>>>>>>>> Will it be OK to use 0.8.1 as a developer? >>>>>>>>>>>> >>>>>>>>>>>> On Feb 24, 2015, at 04:15 AM, Stephan Ewen <[hidden email]> >>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Dulaj, >>>>>>>>>>>> >>>>>>>>>>>> That error message indicates that the JobManager is not running. >>>>>>>> Are you >>>>>>>>>>>> sure that the JobManager runs properly? Anything in the >>>> JobManager >>>>>>>> logs? >>>>>>>>>>>> >>>>>>>>>>>> BTW: The 0.9 branch is under heavy development / changes. That is >>>>>>>> why it >>>>>>>>>>>> may behave a bit different on different days right now. I would >>>>>>>>>> recommend >>>>>>>>>>>> to use the 0.8.1 release for a stable experience. >>>>>>>>>>>> >>>>>>>>>>>> Greetings, >>>>>>>>>>>> Stephan >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Feb 23, 2015 at 7:39 PM, Robert Metzger < >>>>>>>> [hidden email]> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Thank you for the quick reply. >>>>>>>>>>>> >>>>>>>>>>>> The log you've send is from the webclient. Can you also send the >>>>>>>> log of >>>>>>>>>> the >>>>>>>>>>>> >>>>>>>>>>>> JobManager? >>>>>>>>>>>> >>>>>>>>>>>> On Mon, Feb 23, 2015 at 7:28 PM, Dulaj Viduranga < >>>>>>>> [hidden email]> >>>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Yes. It seams it is not a problem with the arguments. I tried >>>> two >>>>>>>> days >>>>>>>>>>>> >>>>>>>>>>>> but >>>>>>>>>>>> >>>>>>>>>>>>> different error occurs. It seams the web client can’t connect to >>>>>>>> the >>>>>>>>>> job >>>>>>>>>>>> >>>>>>>>>>>>> manager although it is running >>>>>>>>>>>> >>>>>>>>>>>>> Right now, I can’t even get the webclient to run. >>>>>>>>>>>> >>>>>>>>>>>> ./bin/start-webclient.sh >>>>>>>>>>>> >>>>>>>>>>>>> executes fine but I cannot connect to localhost:8080 (even with >>>>>>>> telnet >>>>>>>>>> or >>>>>>>>>>>> >>>>>>>>>>>>> curl) >>>>>>>>>>>> >>>>>>>>>>>>> Here is the log for jobManager >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> 23:22:31,933 INFO org.apache.flink.client.web.WebInterfaceServer >>>>>>>>>>>> >>>>>>>>>>>>> - Setting up web frontend server, using web-root directory >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 'jar: >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>> file:/Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/lib/flink-clients-0.9-SNAPSHOT.jar!/web-docs >>>>>>>>>>>> '. >>>>>>>>>>>> >>>>>>>>>>>>> 23:22:31,934 INFO org.apache.flink.client.web.WebInterfaceServer >>>>>>>>>>>> >>>>>>>>>>>>> - Web frontend server will store temporary files in >>>>>>>>>>>> >>>>>>>>>>>>> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T', uploaded >>>> jobs >>>>>>>> in >>>>>>>>>>>> >>>>>>>>>>>>> >>>> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T/webclient-jobs', >>>>>>>>>>>> >>>>>>>>>>>>> plan-json-dumps in >>>>>>>>>>>> >>>>>>>>>>>>> >>>> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T/webclient-plans'. >>>>>>>>>>>> >>>>>>>>>>>>> 23:22:31,934 INFO org.apache.flink.client.web.WebInterfaceServer >>>>>>>>>>>> >>>>>>>>>>>>> - Web-frontend will submit jobs to nephele job-manager on >>>>>>>>>>>> >>>>>>>>>>>> localhost, >>>>>>>>>>>> >>>>>>>>>>>>> port 6123. >>>>>>>>>>>> >>>>>>>>>>>>> 23:22:32,580 INFO akka.event.slf4j.Slf4jLogger >>>>>>>>>>>> >>>>>>>>>>>>> - Slf4jLogger started >>>>>>>>>>>> >>>>>>>>>>>>> 23:22:32,625 INFO Remoting >>>>>>>>>>>> >>>>>>>>>>>>> - Starting remoting >>>>>>>>>>>> >>>>>>>>>>>>> 23:22:32,838 INFO Remoting >>>>>>>>>>>> >>>>>>>>>>>>> - Remoting started; listening on addresses :[akka.tcp:// >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> JobsInfoServletActorSystem@127.0.0.1:51517] >>>>>>>>>>>> >>>>>>>>>>>>> 23:23:48,119 WARN Remoting >>>>>>>>>>>> >>>>>>>>>>>>> - Tried to associate with unreachable remote address >>>> [akka.tcp:// >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> flink@10.218.98.169:6123]. Address is now gated for 5000 ms, >>>> all >>>>>>>>>>>> >>>>>>>>>>>> messages >>>>>>>>>>>> >>>>>>>>>>>>> to this address will be delivered to dead letters. Reason: >>>>>>>> Operation >>>>>>>>>>>> >>>>>>>>>>>> timed >>>>>>>>>>>> >>>>>>>>>>>>> out: /10.218.98.169:6123 >>>>>>>>>>>> >>>>>>>>>>>>> 23:23:48,124 ERROR org.apache.flink.client.WebFrontend >>>>>>>>>>>> >>>>>>>>>>>>> - Unexpected exception: Could not find job manager at specified >>>>>>>>>>>> >>>>>>>>>>>>> address akka.flink@10.218.98.169:6123/user/jobmanager'>tcp:// >>>>>>>>>>>> flink@10.218.98.169:6123/user/jobmanager. >>>>>>>>>>>> >>>>>>>>>>>>> java.lang.RuntimeException: Could not find job manager at >>>> specified >>>>>>>>>>>> >>>>>>>>>>>>> address akka.flink@10.218.98.169:6123/user/jobmanager'>tcp:// >>>>>>>>>>>> flink@10.218.98.169:6123/user/jobmanager. >>>>>>>>>>>> >>>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>> org.apache.flink.client.web.JobsInfoServlet.<init>(JobsInfoServlet.java:82) >>>>>>>>>>>> >>>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>> org.apache.flink.client.web.WebInterfaceServer.<init>(WebInterfaceServer.java:158) >>>>>>>>>>>> >>>>>>>>>>>>> at org.apache.flink.client.WebFrontend.main(WebFrontend.java:74) >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> On Feb 23, 2015, at 11:46 PM, Robert Metzger < >>>> [hidden email] >>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>>>> you said in the other email thread that the error only occurs >>>> for >>>>>>>>>>>> >>>>>>>>>>>>>> Wordcount, not for Kmeans. >>>>>>>>>>>> >>>>>>>>>>>>>> Can you copy me the commands for both examples? >>>>>>>>>>>> >>>>>>>>>>>>>> I can not really believe that there is a difference between the >>>>>>>> two >>>>>>>>>>>> >>>>>>>>>>>> jobs. >>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> Can you also send us the contents of the jobmanager log file? >>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>> >>>>>>>>>>>>>> Robert >>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Feb 23, 2015 at 6:04 PM, Dulaj Viduranga < >>>>>>>>>> [hidden email] >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>> I’m getting "Could not build up connection to JobManager.” >>>> When i >>>>>>>>>>>> >>>>>>>>>>>> tried >>>>>>>>>>>> >>>>>>>>>>>>> to >>>>>>>>>>>> >>>>>>>>>>>>>>> run the wordCount example. Can anyone help? >>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>>>> Dulaj >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>>> >>> > |
In reply to this post by Till Rohrmann
Here is the taskmanager log when I tried taskmanager.sh start
flink-Vidura-taskmanager-localhost.log <https://gist.github.com/anonymous/aef5a0bf8722feee9b97#file-flink-vidura-taskmanager-localhost-log> > On Feb 27, 2015, at 4:12 PM, Till Rohrmann <[hidden email]> wrote: > > It depends on how you started Flink. If you started a local cluster, then > the TaskManager log is contained in the JobManager log we just don't see > the respective log output in the snippet you posted. If you started a > TaskManager independently, either by taskmanager.sh or by start-cluster.sh, > then a file with the name format flink-<user>-taskmanager-<hostname>.log > should be created in flink/log/. If the Flink directory is not shared by > your cluster nodes, then you have to look on the machine on which you > started the TaskManager. > > But since the JobManager binds to 127.0.0.1 I guess that you started a > local cluster. Try whether you find some logging statements from the > logger org.apache.flink.runtime.taskmanager.TaskManager in your log. Maybe > you can upload the corresponding log file to [1] and post a link here. > > Greets, > > Till > > [1] https://gist.github.com/ > > On Thu, Feb 26, 2015 at 6:45 PM, Dulaj Viduranga <[hidden email]> > wrote: > >> Hi, >> Can you tell me where I can find TaskManager logs. I can’t find >> them in logs folder? I don’t suppose I should run taskmanager.sh as well. >> Right? >> I’m using a OS X Yosemite. I’ll send you my ifconfig. >> >> lo0: flags=8049<UP,LOOPBACK,RUNNING,MULTICAST> mtu 16384 >> options=3<RXCSUM,TXCSUM> >> inet6 ::1 prefixlen 128 >> inet 127.0.0.1 netmask 0xff000000 >> inet6 fe80::1%lo0 prefixlen 64 scopeid 0x1 >> nd6 options=1<PERFORMNUD> >> gif0: flags=8010<POINTOPOINT,MULTICAST> mtu 1280 >> stf0: flags=0<> mtu 1280 >> en0: flags=8823<UP,BROADCAST,SMART,SIMPLEX,MULTICAST> mtu 1500 >> ether 60:03:08:a1:e0:f4 >> nd6 options=1<PERFORMNUD> >> media: autoselect (<unknown type>) >> status: inactive >> en1: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu >> 1500 >> options=60<TSO4,TSO6> >> ether 72:00:02:32:14:d0 >> media: autoselect <full-duplex> >> status: inactive >> en2: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu >> 1500 >> options=60<TSO4,TSO6> >> ether 72:00:02:32:14:d1 >> media: autoselect <full-duplex> >> status: inactive >> bridge0: flags=8822<BROADCAST,SMART,SIMPLEX,MULTICAST> mtu 1500 >> options=63<RXCSUM,TXCSUM,TSO4,TSO6> >> ether 62:03:08:1a:fa:00 >> Configuration: >> id 0:0:0:0:0:0 priority 0 hellotime 0 fwddelay 0 >> maxage 0 holdcnt 0 proto stp maxaddr 100 timeout 1200 >> root id 0:0:0:0:0:0 priority 0 ifcost 0 port 0 >> ipfilter disabled flags 0x2 >> member: en1 flags=3<LEARNING,DISCOVER> >> ifmaxaddr 0 port 5 priority 0 path cost 0 >> member: en2 flags=3<LEARNING,DISCOVER> >> ifmaxaddr 0 port 6 priority 0 path cost 0 >> media: <unknown type> >> status: inactive >> p2p0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> mtu 2304 >> ether 02:03:08:a1:e0:f4 >> media: autoselect >> status: inactive >> awdl0: flags=8802<BROADCAST,SIMPLEX,MULTICAST> mtu 1452 >> ether 06:56:3d:f6:60:08 >> nd6 options=1<PERFORMNUD> >> media: autoselect >> status: inactive >> ppp0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1500 >> inet 10.218.98.228 --> 10.64.64.64 netmask 0xff000000 >> utun0: flags=8051<UP,POINTOPOINT,RUNNING,MULTICAST> mtu 1380 >> inet6 fe80::b0d4:d4be:7e62:e730%utun0 prefixlen 64 scopeid 0xb >> inet6 fdd0:b291:7da7:9153:b0d4:d4be:7e62:e730 prefixlen 64 >> nd6 options=1<PERFORMNUD> >> >> >>> On Feb 26, 2015, at 10:48 PM, Stephan Ewen <[hidden email]> wrote: >>> >>> Hi Dulaj! >>> >>> Thanks for helping to debug. >>> >>> My guess is that you are seeing now the same thing between JobManager and >>> TaskManager as you saw before between JobManager and JobClient. I have a >>> patch pending that should help the issue (see >>> https://issues.apache.org/jira/browse/FLINK-1608), let's see if that >> solves >>> it. >>> >>> What seems not right is that the JobManager initially accepted the >>> TaskManager and later the communication. Can you paste the TaskManager >> log >>> as well? >>> >>> Also: There must be something fairly unique about your network >>> configuration, as it works on all other setups that we use (locally, >> cloud, >>> test servers, YARN, ...). Can you paste your ipconfig / ifconfig by any >>> chance? >>> >>> Greetings, >>> Stephan >>> >>> >>> >>> On Thu, Feb 26, 2015 at 4:33 PM, Dulaj Viduranga <[hidden email]> >>> wrote: >>> >>>> Hi, >>>> It’s great to help out. :) >>>> >>>> Setting 127.0.0.1 instead of “localhost” in >>>> jobmanager.rpc.address, helped to build the connection to the >> jobmanager. >>>> Apparently localhost resolving is different in webclient and the >>>> jobmanager. I think it’s good to set "jobmanager.rpc.address: >> 127.0.0.1" in >>>> future builds. >>>> But then I get this error when I tried to run examples. I don’t >>>> know if I should move this issue to another thread. If so please tell >> me. >>>> >>>> bin/flink run >>>> >> /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/examples/flink-java-examples-0.9-SNAPSHOT-WordCount.jar >>>> >> /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/hamlet.txt >>>> $FLINK_DIRECTORY/count >>>> >>>> >>>> 20:46:21,998 WARN org.apache.hadoop.util.NativeCodeLoader >>>> - Unable to load native-hadoop library for your platform... using >>>> builtin-java classes where applicable >>>> 02/26/2015 20:46:23 Job execution switched to status RUNNING. >>>> 02/26/2015 20:46:23 CHAIN DataSource (at >>>> getTextDataSet(WordCount.java:141) >>>> (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at >>>> main(WordCount.java:69)) -> Combine(SUM(1), at >> main(WordCount.java:72)(1/1) >>>> switched to SCHEDULED >>>> 02/26/2015 20:46:23 CHAIN DataSource (at >>>> getTextDataSet(WordCount.java:141) >>>> (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at >>>> main(WordCount.java:69)) -> Combine(SUM(1), at >> main(WordCount.java:72)(1/1) >>>> switched to DEPLOYING >>>> 02/26/2015 20:48:03 CHAIN DataSource (at >>>> getTextDataSet(WordCount.java:141) >>>> (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at >>>> main(WordCount.java:69)) -> Combine(SUM(1), at >> main(WordCount.java:72)(1/1) >>>> switched to FAILED >>>> akka.pattern.AskTimeoutException: Ask timed out on >>>> [Actor[akka://flink/user/taskmanager#-1628133761]] after [100000 ms] >>>> at >>>> >> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) >>>> at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) >>>> at >>>> >> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) >>>> at >>>> >> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) >>>> at >>>> >> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) >>>> at >>>> >> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) >>>> at >>>> >> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) >>>> at >>>> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) >>>> at java.lang.Thread.run(Thread.java:745) >>>> >>>> 02/26/2015 20:48:03 Job execution switched to status FAILING. >>>> 02/26/2015 20:48:03 Reduce (SUM(1), at main(WordCount.java:72)(1/1) >>>> switched to CANCELED >>>> 02/26/2015 20:48:03 DataSink(CsvOutputFormat (path: >>>> >> /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/count, >>>> delimiter: ))(1/1) switched to CANCELED >>>> 02/26/2015 20:48:03 Job execution switched to status FAILED. >>>> org.apache.flink.client.program.ProgramInvocationException: The program >>>> execution failed. >>>> at org.apache.flink.client.program.Client.run(Client.java:344) >>>> at org.apache.flink.client.program.Client.run(Client.java:306) >>>> at org.apache.flink.client.program.Client.run(Client.java:300) >>>> at >>>> >> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55) >>>> at >>>> >> org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:82) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at >>>> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >>>> at >>>> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> at java.lang.reflect.Method.invoke(Method.java:483) >>>> at >>>> >> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437) >>>> at >>>> >> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353) >>>> at org.apache.flink.client.program.Client.run(Client.java:250) >>>> at >>>> org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:371) >>>> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:344) >>>> at >>>> >> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1087) >>>> at >> org.apache.flink.client.CliFrontend.main(CliFrontend.java:1114) >>>> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job >>>> execution failed. >>>> at >>>> >> org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:284) >>>> at >>>> >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) >>>> at >>>> >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) >>>> at >>>> >> scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) >>>> at >>>> >> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37) >>>> at >>>> >> org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30) >>>> at >>>> scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) >>>> at >>>> >> org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30) >>>> at akka.actor.Actor$class.aroundReceive(Actor.scala:465) >>>> at >>>> >> org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:88) >>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) >>>> at akka.actor.ActorCell.invoke(ActorCell.scala:487) >>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) >>>> at akka.dispatch.Mailbox.run(Mailbox.scala:221) >>>> at akka.dispatch.Mailbox.exec(Mailbox.scala:231) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>> at >>>> >> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>>> at >>>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>> at >>>> >> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>> Caused by: akka.pattern.AskTimeoutException: Ask timed out on >>>> [Actor[akka://flink/user/taskmanager#-1628133761]] after [100000 ms] >>>> at >>>> >> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) >>>> at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) >>>> at >>>> >> scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) >>>> at >>>> >> scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) >>>> at >>>> >> akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) >>>> at >>>> >> akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) >>>> at >>>> >> akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) >>>> at >>>> akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) >>>> at java.lang.Thread.run(Thread.java:745) >>>> >>>> The exception above occurred while trying to run your command. >>>> >>>> >>>>> On Feb 26, 2015, at 12:46 AM, Stephan Ewen <[hidden email]> wrote: >>>>> >>>>> Addition: To check whether a port is reachable, I think the easiest >> thing >>>>> is to try and connect with a telnet client and see if the connection is >>>>> refused. >>>>> >>>>> On Wed, Feb 25, 2015 at 8:15 PM, Stephan Ewen <[hidden email]> >> wrote: >>>>> >>>>>> Okay, the problem seems to be that even though both the client and the >>>>>> jobmanager use "localhost" as the host name, they resolve this to >>>> different >>>>>> IP addresses: In one case 127.0.0.1 in the other case 10.216.177.146 >>>>>> >>>>>> Also, the 127.0.0.1 address cannot communicate to 10.216.177.146 >>>>>> apparently. >>>>>> >>>>>> Can you help us debug this by checking the following: >>>>>> >>>>>> - Can you try and set "jobmanager.rpc.address" to 127.0.0.1 and see if >>>>>> that solves it? >>>>>> - Can you try and set "jobmanager.rpc.address" to the other address >>>> (10.216.177.146 >>>>>> or so) and see if that solves it? >>>>>> - Can you do "start-cluster.sh", rather than "start-local.sh" and see >>>>>> whether the webfrontend displays that the TaskManager connects? >>>>>> - As a hard core test: Can you bring up the jobmanager, check where it >>>>>> connects (10.216.192.98:6123 or so) and see whether the port is >>>> reachable? >>>>>> >>>>>> We have recently updated how the Akka URLs are build, to work around a >>>>>> limitation in Akka. Seems that did not yet fully solve the issue. >>>>>> >>>>>> Thanks for helping us debug this, it is not the easiest immigration >>>>>> experience, but the outcome is probably extremely valuable for the >>>> project >>>>>> :-) >>>>>> >>>>>> Greetings, >>>>>> Stephan >>>>>> >>>>>> >>>>>> On Wed, Feb 25, 2015 at 4:03 PM, Dulaj Viduranga < >> [hidden email]> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> Sorry for the delay to reply on this issue. >>>>>>> the jobmanager.rpc.address is set to “localhost” already in >> conf.yaml. >>>>>>> This can’t be an issue because the job manager web interface works >> fine >>>>>>> which also runs on localhost >>>>>>> >>>>>>> bin/flink run <jar> doesn’t seem to work either. Let me send you my >>>>>>> command and the result in terminal. >>>>>>> >>>>>>> bin/flink run >>>>>>> >>>> >> /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/examples/flink-java-examples-0.9-SNAPSHOT-WordCount.jar >>>>>>> >>>> >> /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/hamlet.txt >>>>>>> $FLINK_DIRECTORY/count >>>>>>> >>>>>>> 20:32:16,442 WARN org.apache.hadoop.util.NativeCodeLoader >>>>>>> - Unable to load native-hadoop library for your platform... >> using >>>>>>> builtin-java classes where applicable >>>>>>> org.apache.flink.client.program.ProgramInvocationException: Could not >>>>>>> build up connection to JobManager. >>>>>>> at org.apache.flink.client.program.Client.run(Client.java:327) >>>>>>> at org.apache.flink.client.program.Client.run(Client.java:306) >>>>>>> at org.apache.flink.client.program.Client.run(Client.java:300) >>>>>>> at >>>>>>> >>>> >> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55) >>>>>>> at >>>>>>> >>>> >> org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:82) >>>>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>>>>> at >>>>>>> >>>> >> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >>>>>>> at >>>>>>> >>>> >> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>>>>> at java.lang.reflect.Method.invoke(Method.java:483) >>>>>>> at >>>>>>> >>>> >> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437) >>>>>>> at >>>>>>> >>>> >> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353) >>>>>>> at org.apache.flink.client.program.Client.run(Client.java:250) >>>>>>> at >>>>>>> >>>> org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:371) >>>>>>> at >> org.apache.flink.client.CliFrontend.run(CliFrontend.java:344) >>>>>>> at >>>>>>> >>>> >> org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1087) >>>>>>> at >>>> org.apache.flink.client.CliFrontend.main(CliFrontend.java:1114) >>>>>>> Caused by: java.io.IOException: JobManager at akka.tcp:// >>>>>>> flink@10.216.177.146:6123/user/jobmanager not reachable. Please make >>>>>>> sure that the JobManager is running and its port is reachable. >>>>>>> at >>>>>>> >>>> >> org.apache.flink.runtime.jobmanager.JobManager$.getJobManagerRemoteReference(JobManager.scala:897) >>>>>>> at >>>>>>> >>>> >> org.apache.flink.runtime.client.JobClient$.createJobClient(JobClient.scala:151) >>>>>>> at >>>>>>> >>>> >> org.apache.flink.runtime.client.JobClient$.createJobClientFromConfig(JobClient.scala:142) >>>>>>> at >>>>>>> >>>> >> org.apache.flink.runtime.client.JobClient$.startActorSystemAndActor(JobClient.scala:125) >>>>>>> at >>>>>>> >>>> >> org.apache.flink.runtime.client.JobClient.startActorSystemAndActor(JobClient.scala) >>>>>>> at org.apache.flink.client.program.Client.run(Client.java:322) >>>>>>> ... 15 more >>>>>>> Caused by: java.util.concurrent.TimeoutException: Futures timed out >>>> after >>>>>>> [10000 milliseconds] >>>>>>> at >>>>>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) >>>>>>> at >>>>>>> >> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) >>>>>>> at >>>>>>> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) >>>>>>> at >>>>>>> >>>> >> scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) >>>>>>> at scala.concurrent.Await$.result(package.scala:107) >>>>>>> at >>>>>>> >>>> >> org.apache.flink.runtime.jobmanager.JobManager$.getJobManagerRemoteReference(JobManager.scala:893) >>>>>>> ... 20 more >>>>>>> >>>>>>> The exception above occurred while trying to run your command. >>>>>>> >>>>>>> >>>>>>>> On Feb 25, 2015, at 1:29 AM, Stephan Ewen <[hidden email]> wrote: >>>>>>>> >>>>>>>> BTW: Does still work if you enter "localhost" for >>>>>>> "jobmanager.rpc.address" >>>>>>>> in your flink-conf.yaml ? >>>>>>>> >>>>>>>> On Tue, Feb 24, 2015 at 7:50 PM, Stephan Ewen <[hidden email]> >>>> wrote: >>>>>>>> >>>>>>>>> Hi! >>>>>>>>> >>>>>>>>> I think that this is a problem in the current master (probably in >>>> there >>>>>>>>> since a few days ago). I am fixing it... >>>>>>>>> >>>>>>>>> Thanks for reporting it! >>>>>>>>> >>>>>>>>> Stephan >>>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Feb 24, 2015 at 6:52 PM, Stephan Ewen <[hidden email]> >>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi Dulaj! >>>>>>>>>> >>>>>>>>>> The log suggests that the JobManager binds itself to the IP >>>>>>>>>> address 10.216.192.98 and the WebClient runs at 127.0.0.1 >>>>>>>>>> >>>>>>>>>> The 127.0.0.1 actor system cannot connect to the 10.216.192.98. >>>>>>>>>> >>>>>>>>>> Let me verify whether this is a quirk of your particular setup, >> or a >>>>>>> bug >>>>>>>>>> recently introduces in the 0.9-SNAPSHOT. >>>>>>>>>> >>>>>>>>>> Does the command line work for you? ("bin/flink run <jar>") >>>>>>>>>> >>>>>>>>>> taskmanager.numberOfTaskSlots: -1 is also okay, this will mean >> that >>>>>>> the >>>>>>>>>> default of '1' is used. >>>>>>>>>> >>>>>>>>>> Greetings, >>>>>>>>>> Stephan >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Tue, Feb 24, 2015 at 5:18 PM, Dulaj Viduranga < >>>>>>> [hidden email]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Is taskmanager.numberOfTaskSlots: -1 normal? >>>>>>>>>>> >>>>>>>>>>>> On Feb 24, 2015, at 9:44 PM, Robert Metzger < >> [hidden email]> >>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> I could not find the logfiles attached to your mails. I think >> the >>>>>>>>>>>> mailinglists are not accepting attachments. >>>>>>>>>>>> Can you put the logs on gist.github.com? >>>>>>>>>>>> >>>>>>>>>>>> The configuration values are documented here: >>>>>>>>>>>> http://flink.apache.org/docs/0.8/config.html >>>>>>>>>>>> For the webclient's port its called webclient.port >>>>>>>>>>>> >>>>>>>>>>>> On Tue, Feb 24, 2015 at 5:04 PM, Dulaj Viduranga < >>>>>>> [hidden email] >>>>>>>>>>>> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> I tried to kill the job manager manually in the terminal and >>>> start >>>>>>> it >>>>>>>>>>>>> again but no luck. Also could you tell me if it’s possible to >>>>>>> change >>>>>>>>>>>>> webclient’s port (8080) ? >>>>>>>>>>>>> >>>>>>>>>>>>>> On Feb 24, 2015, at 1:41 PM, Stephan Ewen <[hidden email]> >>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hey Dulaj! >>>>>>>>>>>>>> >>>>>>>>>>>>>> As a contributor, I would go against the latest version, which >>>> is >>>>>>>>>>>>>> 0.9-SNAPSHOT. >>>>>>>>>>>>>> >>>>>>>>>>>>>> It may be in your case that the JobManager actor is down, but >>>> the >>>>>>>>>>> process >>>>>>>>>>>>>> still lingers. (BTW: I have a patch pending that makes sure >> the >>>>>>>>>>> process >>>>>>>>>>>>>> disappears when the actor via down). >>>>>>>>>>>>>> >>>>>>>>>>>>>> Could you have a look at the log >>>>>>>>>>> "flink-<user>-jobmanager-<host>-.log" >>>>>>>>>>>>> and >>>>>>>>>>>>>> see if there are any errors logged? >>>>>>>>>>>>>> >>>>>>>>>>>>>> Greetings, >>>>>>>>>>>>>> Stephan >>>>>>>>>>>>>> Am 24.02.2015 06:29 schrieb "Dulaj Viduranga" < >>>>>>> [hidden email] >>>>>>>>>>>> : >>>>>>>>>>>>>> >>>>>>>>>>>>>>> The JobManager seems to run fine. I don't know. When I tried >> to >>>>>>> run >>>>>>>>>>>>>>> start-local.sh again, It shows the PID of the running >>>> JobManager >>>>>>> and >>>>>>>>>>>>> also >>>>>>>>>>>>>>> :8081 runs fine. I want to contribute to the project and I >>>> could >>>>>>>>>>> get a >>>>>>>>>>>>>>> little boost if I could see the capabilities of FLINK. :) >>>>>>>>>>>>>>> Will it be OK to use 0.8.1 as a developer? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Feb 24, 2015, at 04:15 AM, Stephan Ewen <[hidden email] >>> >>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Hi Dulaj, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> That error message indicates that the JobManager is not >>>> running. >>>>>>>>>>> Are you >>>>>>>>>>>>>>> sure that the JobManager runs properly? Anything in the >>>>>>> JobManager >>>>>>>>>>> logs? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> BTW: The 0.9 branch is under heavy development / changes. >> That >>>> is >>>>>>>>>>> why it >>>>>>>>>>>>>>> may behave a bit different on different days right now. I >> would >>>>>>>>>>>>> recommend >>>>>>>>>>>>>>> to use the 0.8.1 release for a stable experience. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Greetings, >>>>>>>>>>>>>>> Stephan >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Feb 23, 2015 at 7:39 PM, Robert Metzger < >>>>>>>>>>> [hidden email]> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> Thank you for the quick reply. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The log you've send is from the webclient. Can you also send >>>> the >>>>>>>>>>> log of >>>>>>>>>>>>> the >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> JobManager? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Feb 23, 2015 at 7:28 PM, Dulaj Viduranga < >>>>>>>>>>> [hidden email]> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Yes. It seams it is not a problem with the arguments. I >> tried >>>>>>> two >>>>>>>>>>> days >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> but >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> different error occurs. It seams the web client can’t >> connect >>>> to >>>>>>>>>>> the >>>>>>>>>>>>> job >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> manager although it is running >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Right now, I can’t even get the webclient to run. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> ./bin/start-webclient.sh >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> executes fine but I cannot connect to localhost:8080 (even >>>> with >>>>>>>>>>> telnet >>>>>>>>>>>>> or >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> curl) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Here is the log for jobManager >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 23:22:31,933 INFO >>>> org.apache.flink.client.web.WebInterfaceServer >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - Setting up web frontend server, using web-root directory >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> 'jar: >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>> >>>> >> file:/Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/lib/flink-clients-0.9-SNAPSHOT.jar!/web-docs >>>>>>>>>>>>>>> '. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 23:22:31,934 INFO >>>> org.apache.flink.client.web.WebInterfaceServer >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - Web frontend server will store temporary files in >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T', uploaded >>>>>>> jobs >>>>>>>>>>> in >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T/webclient-jobs', >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> plan-json-dumps in >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T/webclient-plans'. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 23:22:31,934 INFO >>>> org.apache.flink.client.web.WebInterfaceServer >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - Web-frontend will submit jobs to nephele job-manager on >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> localhost, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> port 6123. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 23:22:32,580 INFO akka.event.slf4j.Slf4jLogger >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - Slf4jLogger started >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 23:22:32,625 INFO Remoting >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - Starting remoting >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 23:22:32,838 INFO Remoting >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - Remoting started; listening on addresses :[akka.tcp:// >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> JobsInfoServletActorSystem@127.0.0.1:51517] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 23:23:48,119 WARN Remoting >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - Tried to associate with unreachable remote address >>>>>>> [akka.tcp:// >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> flink@10.218.98.169:6123]. Address is now gated for 5000 >> ms, >>>>>>> all >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> messages >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> to this address will be delivered to dead letters. Reason: >>>>>>>>>>> Operation >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> timed >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> out: /10.218.98.169:6123 >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> 23:23:48,124 ERROR org.apache.flink.client.WebFrontend >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> - Unexpected exception: Could not find job manager at >>>> specified >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> address akka.flink@10.218.98.169:6123/user/jobmanager >> '>tcp:// >>>>>>>>>>>>>>> flink@10.218.98.169:6123/user/jobmanager. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> java.lang.RuntimeException: Could not find job manager at >>>>>>> specified >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> address akka.flink@10.218.98.169:6123/user/jobmanager >> '>tcp:// >>>>>>>>>>>>>>> flink@10.218.98.169:6123/user/jobmanager. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>> >>>> >> org.apache.flink.client.web.JobsInfoServlet.<init>(JobsInfoServlet.java:82) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> at >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>> >>>> >> org.apache.flink.client.web.WebInterfaceServer.<init>(WebInterfaceServer.java:158) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> at >>>> org.apache.flink.client.WebFrontend.main(WebFrontend.java:74) >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Feb 23, 2015, at 11:46 PM, Robert Metzger < >>>>>>> [hidden email] >>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> you said in the other email thread that the error only >> occurs >>>>>>> for >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Wordcount, not for Kmeans. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Can you copy me the commands for both examples? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> I can not really believe that there is a difference between >>>> the >>>>>>>>>>> two >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> jobs. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Can you also send us the contents of the jobmanager log >> file? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Best, >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> Robert >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> On Mon, Feb 23, 2015 at 6:04 PM, Dulaj Viduranga < >>>>>>>>>>>>> [hidden email] >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> I’m getting "Could not build up connection to JobManager.” >>>>>>> When i >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> tried >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> to >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> run the wordCount example. Can anyone help? >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>> Dulaj >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >>>> >> >> |
Hi,
I found the fix for this issue and I'll create a pull request in the following day. |
Wow, great. Can you tell us what the issue was?
Am 02.03.2015 09:31 schrieb "Dulaj Viduranga" <[hidden email]>: > Hi, > I found the fix for this issue and I'll create a pull request in the > following day. > |
In reply to this post by Dulaj Viduranga
Calling:
java -cp ../examples/flink-java-examples-0.9-SNAPSHOT-KMeans.jar org.apache.flink.examples.java.clustering.util.KMeansDataGenerator 500 10 0.08 Will not connect to Flink. Its just running a standalone KMeans data generator, not KMeans. I would suspect that the KMeans example is not running as well. You can run the KMeans example like this: bin/flink run ./examples/flink-java-examples-0.9-SNAPSHOT-KMeans.jar. On Sat, Feb 28, 2015 at 5:47 AM, Dulaj Viduranga <[hidden email]> wrote: > Hi, > I’m thinking I’m doing something wrong. After setting jobManager address > to 127.0.0.1, I can run kmeans example (java -cp > ../examples/flink-java-examples-0.9-SNAPSHOT-KMeans.jar > org.apache.flink.examples.java.clustering.util.KMeansDataGenerator 500 10 > 0.08) > But I can’t run word count example (bin/flink run > ./examples/flink-java-examples-0.9-SNAPSHOT-WordCount.jar > file:'///Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/hamlet.txt' > file:'///Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/count.txt’) > > I’m not sure whether I’m running it wrong > > > On Feb 26, 2015, at 9:03 PM, Dulaj Viduranga <[hidden email]> > wrote: > > > > Hi, > > It’s great to help out. :) > > > > Setting 127.0.0.1 instead of “localhost” in > jobmanager.rpc.address, helped to build the connection to the jobmanager. > Apparently localhost resolving is different in webclient and the > jobmanager. I think it’s good to set "jobmanager.rpc.address: 127.0.0.1" in > future builds. > > But then I get this error when I tried to run examples. I don’t > know if I should move this issue to another thread. If so please tell me. > > > > bin/flink run > /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/examples/flink-java-examples-0.9-SNAPSHOT-WordCount.jar > /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/hamlet.txt > $FLINK_DIRECTORY/count > > > > > > 20:46:21,998 WARN org.apache.hadoop.util.NativeCodeLoader > - Unable to load native-hadoop library for your platform... using > builtin-java classes where applicable > > 02/26/2015 20:46:23 Job execution switched to status RUNNING. > > 02/26/2015 20:46:23 CHAIN DataSource (at > getTextDataSet(WordCount.java:141) > (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at > main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) > switched to SCHEDULED > > 02/26/2015 20:46:23 CHAIN DataSource (at > getTextDataSet(WordCount.java:141) > (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at > main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) > switched to DEPLOYING > > 02/26/2015 20:48:03 CHAIN DataSource (at > getTextDataSet(WordCount.java:141) > (org.apache.flink.api.java.io.TextInputFormat)) -> FlatMap (FlatMap at > main(WordCount.java:69)) -> Combine(SUM(1), at main(WordCount.java:72)(1/1) > switched to FAILED > > akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka://flink/user/taskmanager#-1628133761]] after [100000 ms] > > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) > > at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) > > at > scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) > > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) > > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) > > at > akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) > > at > akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) > > at > akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) > > at java.lang.Thread.run(Thread.java:745) > > > > 02/26/2015 20:48:03 Job execution switched to status FAILING. > > 02/26/2015 20:48:03 Reduce (SUM(1), at main(WordCount.java:72)(1/1) > switched to CANCELED > > 02/26/2015 20:48:03 DataSink(CsvOutputFormat (path: > /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/count, > delimiter: ))(1/1) switched to CANCELED > > 02/26/2015 20:48:03 Job execution switched to status FAILED. > > org.apache.flink.client.program.ProgramInvocationException: The program > execution failed. > > at org.apache.flink.client.program.Client.run(Client.java:344) > > at org.apache.flink.client.program.Client.run(Client.java:306) > > at org.apache.flink.client.program.Client.run(Client.java:300) > > at > org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55) > > at > org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:82) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > at java.lang.reflect.Method.invoke(Method.java:483) > > at > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437) > > at > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353) > > at org.apache.flink.client.program.Client.run(Client.java:250) > > at > org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:371) > > at org.apache.flink.client.CliFrontend.run(CliFrontend.java:344) > > at > org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1087) > > at org.apache.flink.client.CliFrontend.main(CliFrontend.java:1114) > > Caused by: org.apache.flink.runtime.client.JobExecutionException: Job > execution failed. > > at > org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1.applyOrElse(JobManager.scala:284) > > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) > > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) > > at > scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) > > at > org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:37) > > at > org.apache.flink.runtime.ActorLogMessages$$anon$1.apply(ActorLogMessages.scala:30) > > at > scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) > > at > org.apache.flink.runtime.ActorLogMessages$$anon$1.applyOrElse(ActorLogMessages.scala:30) > > at akka.actor.Actor$class.aroundReceive(Actor.scala:465) > > at > org.apache.flink.runtime.jobmanager.JobManager.aroundReceive(JobManager.scala:88) > > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516) > > at akka.actor.ActorCell.invoke(ActorCell.scala:487) > > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:254) > > at akka.dispatch.Mailbox.run(Mailbox.scala:221) > > at akka.dispatch.Mailbox.exec(Mailbox.scala:231) > > at > scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > > at > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > > at > scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > > at > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > > Caused by: akka.pattern.AskTimeoutException: Ask timed out on > [Actor[akka://flink/user/taskmanager#-1628133761]] after [100000 ms] > > at > akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) > > at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) > > at > scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) > > at > scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) > > at > akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) > > at > akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) > > at > akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) > > at > akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) > > at java.lang.Thread.run(Thread.java:745) > > > > The exception above occurred while trying to run your command. > > > > > >> On Feb 26, 2015, at 12:46 AM, Stephan Ewen <[hidden email]> wrote: > >> > >> Addition: To check whether a port is reachable, I think the easiest > thing > >> is to try and connect with a telnet client and see if the connection is > >> refused. > >> > >> On Wed, Feb 25, 2015 at 8:15 PM, Stephan Ewen <[hidden email]> wrote: > >> > >>> Okay, the problem seems to be that even though both the client and the > >>> jobmanager use "localhost" as the host name, they resolve this to > different > >>> IP addresses: In one case 127.0.0.1 in the other case 10.216.177.146 > >>> > >>> Also, the 127.0.0.1 address cannot communicate to 10.216.177.146 > >>> apparently. > >>> > >>> Can you help us debug this by checking the following: > >>> > >>> - Can you try and set "jobmanager.rpc.address" to 127.0.0.1 and see if > >>> that solves it? > >>> - Can you try and set "jobmanager.rpc.address" to the other address > (10.216.177.146 > >>> or so) and see if that solves it? > >>> - Can you do "start-cluster.sh", rather than "start-local.sh" and see > >>> whether the webfrontend displays that the TaskManager connects? > >>> - As a hard core test: Can you bring up the jobmanager, check where it > >>> connects (10.216.192.98:6123 or so) and see whether the port is > reachable? > >>> > >>> We have recently updated how the Akka URLs are build, to work around a > >>> limitation in Akka. Seems that did not yet fully solve the issue. > >>> > >>> Thanks for helping us debug this, it is not the easiest immigration > >>> experience, but the outcome is probably extremely valuable for the > project > >>> :-) > >>> > >>> Greetings, > >>> Stephan > >>> > >>> > >>> On Wed, Feb 25, 2015 at 4:03 PM, Dulaj Viduranga <[hidden email] > > > >>> wrote: > >>> > >>>> Hi, > >>>> Sorry for the delay to reply on this issue. > >>>> the jobmanager.rpc.address is set to “localhost” already in conf.yaml. > >>>> This can’t be an issue because the job manager web interface works > fine > >>>> which also runs on localhost > >>>> > >>>> bin/flink run <jar> doesn’t seem to work either. Let me send you my > >>>> command and the result in terminal. > >>>> > >>>> bin/flink run > >>>> > /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/examples/flink-java-examples-0.9-SNAPSHOT-WordCount.jar > >>>> > /Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/hamlet.txt > >>>> $FLINK_DIRECTORY/count > >>>> > >>>> 20:32:16,442 WARN org.apache.hadoop.util.NativeCodeLoader > >>>> - Unable to load native-hadoop library for your platform... using > >>>> builtin-java classes where applicable > >>>> org.apache.flink.client.program.ProgramInvocationException: Could not > >>>> build up connection to JobManager. > >>>> at org.apache.flink.client.program.Client.run(Client.java:327) > >>>> at org.apache.flink.client.program.Client.run(Client.java:306) > >>>> at org.apache.flink.client.program.Client.run(Client.java:300) > >>>> at > >>>> > org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:55) > >>>> at > >>>> > org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:82) > >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > >>>> at > >>>> > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > >>>> at > >>>> > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > >>>> at java.lang.reflect.Method.invoke(Method.java:483) > >>>> at > >>>> > org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:437) > >>>> at > >>>> > org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:353) > >>>> at org.apache.flink.client.program.Client.run(Client.java:250) > >>>> at > >>>> > org.apache.flink.client.CliFrontend.executeProgram(CliFrontend.java:371) > >>>> at org.apache.flink.client.CliFrontend.run(CliFrontend.java:344) > >>>> at > >>>> > org.apache.flink.client.CliFrontend.parseParameters(CliFrontend.java:1087) > >>>> at > org.apache.flink.client.CliFrontend.main(CliFrontend.java:1114) > >>>> Caused by: java.io.IOException: JobManager at akka.tcp:// > >>>> flink@10.216.177.146:6123/user/jobmanager not reachable. Please make > >>>> sure that the JobManager is running and its port is reachable. > >>>> at > >>>> > org.apache.flink.runtime.jobmanager.JobManager$.getJobManagerRemoteReference(JobManager.scala:897) > >>>> at > >>>> > org.apache.flink.runtime.client.JobClient$.createJobClient(JobClient.scala:151) > >>>> at > >>>> > org.apache.flink.runtime.client.JobClient$.createJobClientFromConfig(JobClient.scala:142) > >>>> at > >>>> > org.apache.flink.runtime.client.JobClient$.startActorSystemAndActor(JobClient.scala:125) > >>>> at > >>>> > org.apache.flink.runtime.client.JobClient.startActorSystemAndActor(JobClient.scala) > >>>> at org.apache.flink.client.program.Client.run(Client.java:322) > >>>> ... 15 more > >>>> Caused by: java.util.concurrent.TimeoutException: Futures timed out > after > >>>> [10000 milliseconds] > >>>> at > >>>> scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) > >>>> at > >>>> scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) > >>>> at > >>>> scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) > >>>> at > >>>> > scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) > >>>> at scala.concurrent.Await$.result(package.scala:107) > >>>> at > >>>> > org.apache.flink.runtime.jobmanager.JobManager$.getJobManagerRemoteReference(JobManager.scala:893) > >>>> ... 20 more > >>>> > >>>> The exception above occurred while trying to run your command. > >>>> > >>>> > >>>>> On Feb 25, 2015, at 1:29 AM, Stephan Ewen <[hidden email]> wrote: > >>>>> > >>>>> BTW: Does still work if you enter "localhost" for > >>>> "jobmanager.rpc.address" > >>>>> in your flink-conf.yaml ? > >>>>> > >>>>> On Tue, Feb 24, 2015 at 7:50 PM, Stephan Ewen <[hidden email]> > wrote: > >>>>> > >>>>>> Hi! > >>>>>> > >>>>>> I think that this is a problem in the current master (probably in > there > >>>>>> since a few days ago). I am fixing it... > >>>>>> > >>>>>> Thanks for reporting it! > >>>>>> > >>>>>> Stephan > >>>>>> > >>>>>> > >>>>>> On Tue, Feb 24, 2015 at 6:52 PM, Stephan Ewen <[hidden email]> > >>>> wrote: > >>>>>> > >>>>>>> Hi Dulaj! > >>>>>>> > >>>>>>> The log suggests that the JobManager binds itself to the IP > >>>>>>> address 10.216.192.98 and the WebClient runs at 127.0.0.1 > >>>>>>> > >>>>>>> The 127.0.0.1 actor system cannot connect to the 10.216.192.98. > >>>>>>> > >>>>>>> Let me verify whether this is a quirk of your particular setup, or > a > >>>> bug > >>>>>>> recently introduces in the 0.9-SNAPSHOT. > >>>>>>> > >>>>>>> Does the command line work for you? ("bin/flink run <jar>") > >>>>>>> > >>>>>>> taskmanager.numberOfTaskSlots: -1 is also okay, this will mean > that > >>>> the > >>>>>>> default of '1' is used. > >>>>>>> > >>>>>>> Greetings, > >>>>>>> Stephan > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> On Tue, Feb 24, 2015 at 5:18 PM, Dulaj Viduranga < > >>>> [hidden email]> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> Is taskmanager.numberOfTaskSlots: -1 normal? > >>>>>>>> > >>>>>>>>> On Feb 24, 2015, at 9:44 PM, Robert Metzger <[hidden email] > > > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> I could not find the logfiles attached to your mails. I think the > >>>>>>>>> mailinglists are not accepting attachments. > >>>>>>>>> Can you put the logs on gist.github.com? > >>>>>>>>> > >>>>>>>>> The configuration values are documented here: > >>>>>>>>> http://flink.apache.org/docs/0.8/config.html > >>>>>>>>> For the webclient's port its called webclient.port > >>>>>>>>> > >>>>>>>>> On Tue, Feb 24, 2015 at 5:04 PM, Dulaj Viduranga < > >>>> [hidden email] > >>>>>>>>> > >>>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>>> I tried to kill the job manager manually in the terminal and > start > >>>> it > >>>>>>>>>> again but no luck. Also could you tell me if it’s possible to > >>>> change > >>>>>>>>>> webclient’s port (8080) ? > >>>>>>>>>> > >>>>>>>>>>> On Feb 24, 2015, at 1:41 PM, Stephan Ewen <[hidden email]> > >>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>> Hey Dulaj! > >>>>>>>>>>> > >>>>>>>>>>> As a contributor, I would go against the latest version, which > is > >>>>>>>>>>> 0.9-SNAPSHOT. > >>>>>>>>>>> > >>>>>>>>>>> It may be in your case that the JobManager actor is down, but > the > >>>>>>>> process > >>>>>>>>>>> still lingers. (BTW: I have a patch pending that makes sure the > >>>>>>>> process > >>>>>>>>>>> disappears when the actor via down). > >>>>>>>>>>> > >>>>>>>>>>> Could you have a look at the log > >>>>>>>> "flink-<user>-jobmanager-<host>-.log" > >>>>>>>>>> and > >>>>>>>>>>> see if there are any errors logged? > >>>>>>>>>>> > >>>>>>>>>>> Greetings, > >>>>>>>>>>> Stephan > >>>>>>>>>>> Am 24.02.2015 06:29 schrieb "Dulaj Viduranga" < > >>>> [hidden email] > >>>>>>>>> : > >>>>>>>>>>> > >>>>>>>>>>>> The JobManager seems to run fine. I don't know. When I tried > to > >>>> run > >>>>>>>>>>>> start-local.sh again, It shows the PID of the running > JobManager > >>>> and > >>>>>>>>>> also > >>>>>>>>>>>> :8081 runs fine. I want to contribute to the project and I > could > >>>>>>>> get a > >>>>>>>>>>>> little boost if I could see the capabilities of FLINK. :) > >>>>>>>>>>>> Will it be OK to use 0.8.1 as a developer? > >>>>>>>>>>>> > >>>>>>>>>>>> On Feb 24, 2015, at 04:15 AM, Stephan Ewen <[hidden email]> > >>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Hi Dulaj, > >>>>>>>>>>>> > >>>>>>>>>>>> That error message indicates that the JobManager is not > running. > >>>>>>>> Are you > >>>>>>>>>>>> sure that the JobManager runs properly? Anything in the > >>>> JobManager > >>>>>>>> logs? > >>>>>>>>>>>> > >>>>>>>>>>>> BTW: The 0.9 branch is under heavy development / changes. > That is > >>>>>>>> why it > >>>>>>>>>>>> may behave a bit different on different days right now. I > would > >>>>>>>>>> recommend > >>>>>>>>>>>> to use the 0.8.1 release for a stable experience. > >>>>>>>>>>>> > >>>>>>>>>>>> Greetings, > >>>>>>>>>>>> Stephan > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> On Mon, Feb 23, 2015 at 7:39 PM, Robert Metzger < > >>>>>>>> [hidden email]> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>> Thank you for the quick reply. > >>>>>>>>>>>> > >>>>>>>>>>>> The log you've send is from the webclient. Can you also send > the > >>>>>>>> log of > >>>>>>>>>> the > >>>>>>>>>>>> > >>>>>>>>>>>> JobManager? > >>>>>>>>>>>> > >>>>>>>>>>>> On Mon, Feb 23, 2015 at 7:28 PM, Dulaj Viduranga < > >>>>>>>> [hidden email]> > >>>>>>>>>>>> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>> Yes. It seams it is not a problem with the arguments. I tried > >>>> two > >>>>>>>> days > >>>>>>>>>>>> > >>>>>>>>>>>> but > >>>>>>>>>>>> > >>>>>>>>>>>>> different error occurs. It seams the web client can’t > connect to > >>>>>>>> the > >>>>>>>>>> job > >>>>>>>>>>>> > >>>>>>>>>>>>> manager although it is running > >>>>>>>>>>>> > >>>>>>>>>>>>> Right now, I can’t even get the webclient to run. > >>>>>>>>>>>> > >>>>>>>>>>>> ./bin/start-webclient.sh > >>>>>>>>>>>> > >>>>>>>>>>>>> executes fine but I cannot connect to localhost:8080 (even > with > >>>>>>>> telnet > >>>>>>>>>> or > >>>>>>>>>>>> > >>>>>>>>>>>>> curl) > >>>>>>>>>>>> > >>>>>>>>>>>>> Here is the log for jobManager > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> 23:22:31,933 INFO > org.apache.flink.client.web.WebInterfaceServer > >>>>>>>>>>>> > >>>>>>>>>>>>> - Setting up web frontend server, using web-root directory > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> 'jar: > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>> > file:/Users/Vidura/Documents/Development/flink/flink-dist/target/flink-0.9-SNAPSHOT-bin/flink-0.9-SNAPSHOT/lib/flink-clients-0.9-SNAPSHOT.jar!/web-docs > >>>>>>>>>>>> '. > >>>>>>>>>>>> > >>>>>>>>>>>>> 23:22:31,934 INFO > org.apache.flink.client.web.WebInterfaceServer > >>>>>>>>>>>> > >>>>>>>>>>>>> - Web frontend server will store temporary files in > >>>>>>>>>>>> > >>>>>>>>>>>>> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T', uploaded > >>>> jobs > >>>>>>>> in > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T/webclient-jobs', > >>>>>>>>>>>> > >>>>>>>>>>>>> plan-json-dumps in > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>> '/var/folders/3_/7gzbv7ks7q71lpm5d9hzrw2c0000gn/T/webclient-plans'. > >>>>>>>>>>>> > >>>>>>>>>>>>> 23:22:31,934 INFO > org.apache.flink.client.web.WebInterfaceServer > >>>>>>>>>>>> > >>>>>>>>>>>>> - Web-frontend will submit jobs to nephele job-manager on > >>>>>>>>>>>> > >>>>>>>>>>>> localhost, > >>>>>>>>>>>> > >>>>>>>>>>>>> port 6123. > >>>>>>>>>>>> > >>>>>>>>>>>>> 23:22:32,580 INFO akka.event.slf4j.Slf4jLogger > >>>>>>>>>>>> > >>>>>>>>>>>>> - Slf4jLogger started > >>>>>>>>>>>> > >>>>>>>>>>>>> 23:22:32,625 INFO Remoting > >>>>>>>>>>>> > >>>>>>>>>>>>> - Starting remoting > >>>>>>>>>>>> > >>>>>>>>>>>>> 23:22:32,838 INFO Remoting > >>>>>>>>>>>> > >>>>>>>>>>>>> - Remoting started; listening on addresses :[akka.tcp:// > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> JobsInfoServletActorSystem@127.0.0.1:51517] > >>>>>>>>>>>> > >>>>>>>>>>>>> 23:23:48,119 WARN Remoting > >>>>>>>>>>>> > >>>>>>>>>>>>> - Tried to associate with unreachable remote address > >>>> [akka.tcp:// > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> flink@10.218.98.169:6123]. Address is now gated for 5000 ms, > >>>> all > >>>>>>>>>>>> > >>>>>>>>>>>> messages > >>>>>>>>>>>> > >>>>>>>>>>>>> to this address will be delivered to dead letters. Reason: > >>>>>>>> Operation > >>>>>>>>>>>> > >>>>>>>>>>>> timed > >>>>>>>>>>>> > >>>>>>>>>>>>> out: /10.218.98.169:6123 > >>>>>>>>>>>> > >>>>>>>>>>>>> 23:23:48,124 ERROR org.apache.flink.client.WebFrontend > >>>>>>>>>>>> > >>>>>>>>>>>>> - Unexpected exception: Could not find job manager at > specified > >>>>>>>>>>>> > >>>>>>>>>>>>> address akka.flink@10.218.98.169:6123/user/jobmanager > '>tcp:// > >>>>>>>>>>>> flink@10.218.98.169:6123/user/jobmanager. > >>>>>>>>>>>> > >>>>>>>>>>>>> java.lang.RuntimeException: Could not find job manager at > >>>> specified > >>>>>>>>>>>> > >>>>>>>>>>>>> address akka.flink@10.218.98.169:6123/user/jobmanager > '>tcp:// > >>>>>>>>>>>> flink@10.218.98.169:6123/user/jobmanager. > >>>>>>>>>>>> > >>>>>>>>>>>>> at > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>> > org.apache.flink.client.web.JobsInfoServlet.<init>(JobsInfoServlet.java:82) > >>>>>>>>>>>> > >>>>>>>>>>>>> at > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>> > org.apache.flink.client.web.WebInterfaceServer.<init>(WebInterfaceServer.java:158) > >>>>>>>>>>>> > >>>>>>>>>>>>> at > org.apache.flink.client.WebFrontend.main(WebFrontend.java:74) > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>>> On Feb 23, 2015, at 11:46 PM, Robert Metzger < > >>>> [hidden email] > >>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>>> Hi, > >>>>>>>>>>>> > >>>>>>>>>>>>>> you said in the other email thread that the error only > occurs > >>>> for > >>>>>>>>>>>> > >>>>>>>>>>>>>> Wordcount, not for Kmeans. > >>>>>>>>>>>> > >>>>>>>>>>>>>> Can you copy me the commands for both examples? > >>>>>>>>>>>> > >>>>>>>>>>>>>> I can not really believe that there is a difference between > the > >>>>>>>> two > >>>>>>>>>>>> > >>>>>>>>>>>> jobs. > >>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>>> Can you also send us the contents of the jobmanager log > file? > >>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>>> Best, > >>>>>>>>>>>> > >>>>>>>>>>>>>> Robert > >>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>>> On Mon, Feb 23, 2015 at 6:04 PM, Dulaj Viduranga < > >>>>>>>>>> [hidden email] > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>>>> I’m getting "Could not build up connection to JobManager.” > >>>> When i > >>>>>>>>>>>> > >>>>>>>>>>>> tried > >>>>>>>>>>>> > >>>>>>>>>>>>> to > >>>>>>>>>>>> > >>>>>>>>>>>>>>> run the wordCount example. Can anyone help? > >>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>>>> Dulaj > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>> > >>>> > >>> > > > > |
In reply to this post by Stephan Ewen
In some places of the code, "localhost" is hard coded. When it is resolved by the DNS, it is posible to be directed to a different IP other than 127.0.0.1 (like private range 10.0.0.0/8). I changed those places to 127.0.0.1 and it works like a charm.
But hard coding 127.0.0.1 is not a good option because when the jobmanager ip is changed, this becomes an issue again. I'm thinking of setting jobmanager ip from the config.yaml to these places. If you have a better idea on doing this with your experience, please let me know. Best. |
If I recall correctly, we only hardcode "localhost" in the local mini
cluster - do you think it is problematic there as well? Have you found any other places? On Mon, Mar 2, 2015 at 10:26 AM, Dulaj Viduranga <[hidden email]> wrote: > In some places of the code, "localhost" is hard coded. When it is resolved > by the DNS, it is posible to be directed to a different IP other than > 127.0.0.1 (like private range 10.0.0.0/8). I changed those places to > 127.0.0.1 and it works like a charm. > But hard coding 127.0.0.1 is not a good option because when the jobmanager > ip is changed, this becomes an issue again. I'm thinking of setting > jobmanager ip from the config.yaml to these places. > If you have a better idea on doing this with your experience, please let > me know. > > Best. > |
Hi,
I found many other places “localhost” is hard coded. I changed them in a better way I think. I made a pull request. Please review. b7da22a <https://github.com/viduranga/flink/commit/b7da22a562d3da5a9be2657308c0f82e4e2f80cd> > On Mar 4, 2015, at 8:17 PM, Stephan Ewen <[hidden email]> wrote: > > If I recall correctly, we only hardcode "localhost" in the local mini > cluster - do you think it is problematic there as well? > > Have you found any other places? > > On Mon, Mar 2, 2015 at 10:26 AM, Dulaj Viduranga <[hidden email]> > wrote: > >> In some places of the code, "localhost" is hard coded. When it is resolved >> by the DNS, it is posible to be directed to a different IP other than >> 127.0.0.1 (like private range 10.0.0.0/8). I changed those places to >> 127.0.0.1 and it works like a charm. >> But hard coding 127.0.0.1 is not a good option because when the jobmanager >> ip is changed, this becomes an issue again. I'm thinking of setting >> jobmanager ip from the config.yaml to these places. >> If you have a better idea on doing this with your experience, please let >> me know. >> >> Best. >> |
The every change in the commit b7da22a is not required but I thought they are appropriate.
> On Mar 5, 2015, at 8:11 AM, Dulaj Viduranga <[hidden email]> wrote: > > Hi, > I found many other places “localhost” is hard coded. I changed them in a better way I think. I made a pull request. Please review. b7da22a <https://github.com/viduranga/flink/commit/b7da22a562d3da5a9be2657308c0f82e4e2f80cd> > >> On Mar 4, 2015, at 8:17 PM, Stephan Ewen <[hidden email]> wrote: >> >> If I recall correctly, we only hardcode "localhost" in the local mini >> cluster - do you think it is problematic there as well? >> >> Have you found any other places? >> >> On Mon, Mar 2, 2015 at 10:26 AM, Dulaj Viduranga <[hidden email]> >> wrote: >> >>> In some places of the code, "localhost" is hard coded. When it is resolved >>> by the DNS, it is posible to be directed to a different IP other than >>> 127.0.0.1 (like private range 10.0.0.0/8). I changed those places to >>> 127.0.0.1 and it works like a charm. >>> But hard coding 127.0.0.1 is not a good option because when the jobmanager >>> ip is changed, this becomes an issue again. I'm thinking of setting >>> jobmanager ip from the config.yaml to these places. >>> If you have a better idea on doing this with your experience, please let >>> me know. >>> >>> Best. >>> > |
Hi Dulaj,
I looked through your commit and noticed that the JobClient might not be listening on the right network interface. Your commit seems to fix it. I just want to understand the problem properly and therefore I opened a branch with a small change. Could you try out whether this change would also fix your problem? You can find the code here [1]. Would be awesome if you checked it out and let it run on your cluster setting. Thanks a lot Dulaj! [1] https://github.com/tillrohrmann/flink/tree/fixLocalFlinkMiniClusterJobClient On Thu, Mar 5, 2015 at 4:21 AM, Dulaj Viduranga <[hidden email]> wrote: > The every change in the commit b7da22a is not required but I thought they > are appropriate. > > > On Mar 5, 2015, at 8:11 AM, Dulaj Viduranga <[hidden email]> > wrote: > > > > Hi, > > I found many other places “localhost” is hard coded. I changed them in a > better way I think. I made a pull request. Please review. b7da22a < > https://github.com/viduranga/flink/commit/b7da22a562d3da5a9be2657308c0f82e4e2f80cd > > > > > >> On Mar 4, 2015, at 8:17 PM, Stephan Ewen <[hidden email]> wrote: > >> > >> If I recall correctly, we only hardcode "localhost" in the local mini > >> cluster - do you think it is problematic there as well? > >> > >> Have you found any other places? > >> > >> On Mon, Mar 2, 2015 at 10:26 AM, Dulaj Viduranga <[hidden email]> > >> wrote: > >> > >>> In some places of the code, "localhost" is hard coded. When it is > resolved > >>> by the DNS, it is posible to be directed to a different IP other than > >>> 127.0.0.1 (like private range 10.0.0.0/8). I changed those places to > >>> 127.0.0.1 and it works like a charm. > >>> But hard coding 127.0.0.1 is not a good option because when the > jobmanager > >>> ip is changed, this becomes an issue again. I'm thinking of setting > >>> jobmanager ip from the config.yaml to these places. > >>> If you have a better idea on doing this with your experience, please > let > >>> me know. > >>> > >>> Best. > >>> > > > > |
Hi Till,
I’m sorry. It doesn’t seem to solve the problem. The taskmanager still tries a 10.0.0.0/8 IP. Best regards. > On Mar 5, 2015, at 1:00 PM, Till Rohrmann <[hidden email]> wrote: > > Hi Dulaj, > > I looked through your commit and noticed that the JobClient might not be > listening on the right network interface. Your commit seems to fix it. I > just want to understand the problem properly and therefore I opened a > branch with a small change. Could you try out whether this change would > also fix your problem? You can find the code here [1]. Would be awesome if > you checked it out and let it run on your cluster setting. Thanks a lot > Dulaj! > > [1] > https://github.com/tillrohrmann/flink/tree/fixLocalFlinkMiniClusterJobClient > > On Thu, Mar 5, 2015 at 4:21 AM, Dulaj Viduranga <[hidden email]> > wrote: > >> The every change in the commit b7da22a is not required but I thought they >> are appropriate. >> >>> On Mar 5, 2015, at 8:11 AM, Dulaj Viduranga <[hidden email]> >> wrote: >>> >>> Hi, >>> I found many other places “localhost” is hard coded. I changed them in a >> better way I think. I made a pull request. Please review. b7da22a < >> https://github.com/viduranga/flink/commit/b7da22a562d3da5a9be2657308c0f82e4e2f80cd >>> >>> >>>> On Mar 4, 2015, at 8:17 PM, Stephan Ewen <[hidden email]> wrote: >>>> >>>> If I recall correctly, we only hardcode "localhost" in the local mini >>>> cluster - do you think it is problematic there as well? >>>> >>>> Have you found any other places? >>>> >>>> On Mon, Mar 2, 2015 at 10:26 AM, Dulaj Viduranga <[hidden email]> >>>> wrote: >>>> >>>>> In some places of the code, "localhost" is hard coded. When it is >> resolved >>>>> by the DNS, it is posible to be directed to a different IP other than >>>>> 127.0.0.1 (like private range 10.0.0.0/8). I changed those places to >>>>> 127.0.0.1 and it works like a charm. >>>>> But hard coding 127.0.0.1 is not a good option because when the >> jobmanager >>>>> ip is changed, this becomes an issue again. I'm thinking of setting >>>>> jobmanager ip from the config.yaml to these places. >>>>> If you have a better idea on doing this with your experience, please >> let >>>>> me know. >>>>> >>>>> Best. >>>>> >>> >> >> |
How did you start the flink cluster? Using the start-local.sh, the
start-cluster.sh or starting the job manager and task managers individually using taskmanager.sh/jobmanager.sh. Could you maybe post the flink-conf.yaml file, you're using? With your changes, everything works, right? On Thu, Mar 5, 2015 at 8:55 AM, Dulaj Viduranga <[hidden email]> wrote: > Hi Till, > I’m sorry. It doesn’t seem to solve the problem. The taskmanager still > tries a 10.0.0.0/8 IP. > > Best regards. > > > On Mar 5, 2015, at 1:00 PM, Till Rohrmann <[hidden email]> > wrote: > > > > Hi Dulaj, > > > > I looked through your commit and noticed that the JobClient might not be > > listening on the right network interface. Your commit seems to fix it. I > > just want to understand the problem properly and therefore I opened a > > branch with a small change. Could you try out whether this change would > > also fix your problem? You can find the code here [1]. Would be awesome > if > > you checked it out and let it run on your cluster setting. Thanks a lot > > Dulaj! > > > > [1] > > > https://github.com/tillrohrmann/flink/tree/fixLocalFlinkMiniClusterJobClient > > > > On Thu, Mar 5, 2015 at 4:21 AM, Dulaj Viduranga <[hidden email]> > > wrote: > > > >> The every change in the commit b7da22a is not required but I thought > they > >> are appropriate. > >> > >>> On Mar 5, 2015, at 8:11 AM, Dulaj Viduranga <[hidden email]> > >> wrote: > >>> > >>> Hi, > >>> I found many other places “localhost” is hard coded. I changed them in > a > >> better way I think. I made a pull request. Please review. b7da22a < > >> > https://github.com/viduranga/flink/commit/b7da22a562d3da5a9be2657308c0f82e4e2f80cd > >>> > >>> > >>>> On Mar 4, 2015, at 8:17 PM, Stephan Ewen <[hidden email]> wrote: > >>>> > >>>> If I recall correctly, we only hardcode "localhost" in the local mini > >>>> cluster - do you think it is problematic there as well? > >>>> > >>>> Have you found any other places? > >>>> > >>>> On Mon, Mar 2, 2015 at 10:26 AM, Dulaj Viduranga < > [hidden email]> > >>>> wrote: > >>>> > >>>>> In some places of the code, "localhost" is hard coded. When it is > >> resolved > >>>>> by the DNS, it is posible to be directed to a different IP other > than > >>>>> 127.0.0.1 (like private range 10.0.0.0/8). I changed those places to > >>>>> 127.0.0.1 and it works like a charm. > >>>>> But hard coding 127.0.0.1 is not a good option because when the > >> jobmanager > >>>>> ip is changed, this becomes an issue again. I'm thinking of setting > >>>>> jobmanager ip from the config.yaml to these places. > >>>>> If you have a better idea on doing this with your experience, please > >> let > >>>>> me know. > >>>>> > >>>>> Best. > >>>>> > >>> > >> > >> > > |
Using start-locat.sh.
I’m using the original config yaml. I also tried changing jobmanager address in config to “127.0.0.1 but no luck. With my changes it works ok. The conf file follows. ################################################################################ # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. ################################################################################ #============================================================================== # Common #============================================================================== jobmanager.rpc.address: 127.0.0.1 jobmanager.rpc.port: 6123 jobmanager.heap.mb: 256 taskmanager.heap.mb: 512 taskmanager.numberOfTaskSlots: 1 parallelization.degree.default: 1 #============================================================================== # Web Frontend #============================================================================== # The port under which the web-based runtime monitor listens. # A value of -1 deactivates the web server. jobmanager.web.port: 8081 # The port uder which the standalone web client # (for job upload and submit) listens. webclient.port: 8080 #============================================================================== # Advanced #============================================================================== # The number of buffers for the network stack. # # taskmanager.network.numberOfBuffers: 2048 # Directories for temporary files. # # Add a delimited list for multiple directories, using the system directory # delimiter (colon ':' on unix) or a comma, e.g.: # /data1/tmp:/data2/tmp:/data3/tmp # # Note: Each directory entry is read from and written to by a different I/O # thread. You can include the same directory multiple times in order to create # multiple I/O threads against that directory. This is for example relevant for # high-throughput RAIDs. # # If not specified, the system-specific Java temporary directory (java.io.tmpdir # property) is taken. # # taskmanager.tmp.dirs: /tmp # Path to the Hadoop configuration directory. # # This configuration is used when writing into HDFS. Unless specified otherwise, # HDFS file creation will use HDFS default settings with respect to block-size, # replication factor, etc. # # You can also directly specify the paths to hdfs-default.xml and hdfs-site.xml # via keys 'fs.hdfs.hdfsdefault' and 'fs.hdfs.hdfssite'. # # fs.hdfs.hadoopconf: /path/to/hadoop/conf/ > On Mar 5, 2015, at 2:03 PM, Till Rohrmann <[hidden email]> wrote: > > How did you start the flink cluster? Using the start-local.sh, the > start-cluster.sh or starting the job manager and task managers individually > using taskmanager.sh/jobmanager.sh. Could you maybe post the > flink-conf.yaml file, you're using? > > With your changes, everything works, right? > > On Thu, Mar 5, 2015 at 8:55 AM, Dulaj Viduranga <[hidden email]> > wrote: > >> Hi Till, >> I’m sorry. It doesn’t seem to solve the problem. The taskmanager still >> tries a 10.0.0.0/8 IP. >> >> Best regards. >> >>> On Mar 5, 2015, at 1:00 PM, Till Rohrmann <[hidden email]> >> wrote: >>> >>> Hi Dulaj, >>> >>> I looked through your commit and noticed that the JobClient might not be >>> listening on the right network interface. Your commit seems to fix it. I >>> just want to understand the problem properly and therefore I opened a >>> branch with a small change. Could you try out whether this change would >>> also fix your problem? You can find the code here [1]. Would be awesome >> if >>> you checked it out and let it run on your cluster setting. Thanks a lot >>> Dulaj! >>> >>> [1] >>> >> https://github.com/tillrohrmann/flink/tree/fixLocalFlinkMiniClusterJobClient >>> >>> On Thu, Mar 5, 2015 at 4:21 AM, Dulaj Viduranga <[hidden email]> >>> wrote: >>> >>>> The every change in the commit b7da22a is not required but I thought >> they >>>> are appropriate. >>>> >>>>> On Mar 5, 2015, at 8:11 AM, Dulaj Viduranga <[hidden email]> >>>> wrote: >>>>> >>>>> Hi, >>>>> I found many other places “localhost” is hard coded. I changed them in >> a >>>> better way I think. I made a pull request. Please review. b7da22a < >>>> >> https://github.com/viduranga/flink/commit/b7da22a562d3da5a9be2657308c0f82e4e2f80cd >>>>> >>>>> >>>>>> On Mar 4, 2015, at 8:17 PM, Stephan Ewen <[hidden email]> wrote: >>>>>> >>>>>> If I recall correctly, we only hardcode "localhost" in the local mini >>>>>> cluster - do you think it is problematic there as well? >>>>>> >>>>>> Have you found any other places? >>>>>> >>>>>> On Mon, Mar 2, 2015 at 10:26 AM, Dulaj Viduranga < >> [hidden email]> >>>>>> wrote: >>>>>> >>>>>>> In some places of the code, "localhost" is hard coded. When it is >>>> resolved >>>>>>> by the DNS, it is posible to be directed to a different IP other >> than >>>>>>> 127.0.0.1 (like private range 10.0.0.0/8). I changed those places to >>>>>>> 127.0.0.1 and it works like a charm. >>>>>>> But hard coding 127.0.0.1 is not a good option because when the >>>> jobmanager >>>>>>> ip is changed, this becomes an issue again. I'm thinking of setting >>>>>>> jobmanager ip from the config.yaml to these places. >>>>>>> If you have a better idea on doing this with your experience, please >>>> let >>>>>>> me know. >>>>>>> >>>>>>> Best. >>>>>>> >>>>> >>>> >>>> >> >> |
What does the jobmanager log says? I think Stephan added some more logging
output which helps us to debug this problem. On Thu, Mar 5, 2015 at 9:36 AM, Dulaj Viduranga <[hidden email]> wrote: > Using start-locat.sh. > I’m using the original config yaml. I also tried changing jobmanager > address in config to “127.0.0.1 but no luck. With my changes it works ok. > The conf file follows. > > > ################################################################################ > # Licensed to the Apache Software Foundation (ASF) under one > # or more contributor license agreements. See the NOTICE file > # distributed with this work for additional information > # regarding copyright ownership. The ASF licenses this file > # to you under the Apache License, Version 2.0 (the > # "License"); you may not use this file except in compliance > # with the License. You may obtain a copy of the License at > # > # http://www.apache.org/licenses/LICENSE-2.0 > # > # Unless required by applicable law or agreed to in writing, software > # distributed under the License is distributed on an "AS IS" BASIS, > # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. > # See the License for the specific language governing permissions and > # limitations under the License. > > ################################################################################ > > > > #============================================================================== > # Common > > #============================================================================== > > jobmanager.rpc.address: 127.0.0.1 > > jobmanager.rpc.port: 6123 > > jobmanager.heap.mb: 256 > > taskmanager.heap.mb: 512 > > taskmanager.numberOfTaskSlots: 1 > > parallelization.degree.default: 1 > > > #============================================================================== > # Web Frontend > > #============================================================================== > > # The port under which the web-based runtime monitor listens. > # A value of -1 deactivates the web server. > > jobmanager.web.port: 8081 > > # The port uder which the standalone web client > # (for job upload and submit) listens. > > webclient.port: 8080 > > > #============================================================================== > # Advanced > > #============================================================================== > > # The number of buffers for the network stack. > # > # taskmanager.network.numberOfBuffers: 2048 > > # Directories for temporary files. > # > # Add a delimited list for multiple directories, using the system directory > # delimiter (colon ':' on unix) or a comma, e.g.: > # /data1/tmp:/data2/tmp:/data3/tmp > # > # Note: Each directory entry is read from and written to by a different I/O > # thread. You can include the same directory multiple times in order to > create > # multiple I/O threads against that directory. This is for example > relevant for > # high-throughput RAIDs. > # > # If not specified, the system-specific Java temporary directory > (java.io.tmpdir > # property) is taken. > # > # taskmanager.tmp.dirs: /tmp > > # Path to the Hadoop configuration directory. > # > # This configuration is used when writing into HDFS. Unless specified > otherwise, > # HDFS file creation will use HDFS default settings with respect to > block-size, > # replication factor, etc. > # > # You can also directly specify the paths to hdfs-default.xml and > hdfs-site.xml > # via keys 'fs.hdfs.hdfsdefault' and 'fs.hdfs.hdfssite'. > # > # fs.hdfs.hadoopconf: /path/to/hadoop/conf/ > > > > On Mar 5, 2015, at 2:03 PM, Till Rohrmann <[hidden email]> wrote: > > > > How did you start the flink cluster? Using the start-local.sh, the > > start-cluster.sh or starting the job manager and task managers > individually > > using taskmanager.sh/jobmanager.sh. Could you maybe post the > > flink-conf.yaml file, you're using? > > > > With your changes, everything works, right? > > > > On Thu, Mar 5, 2015 at 8:55 AM, Dulaj Viduranga <[hidden email]> > > wrote: > > > >> Hi Till, > >> I’m sorry. It doesn’t seem to solve the problem. The taskmanager still > >> tries a 10.0.0.0/8 IP. > >> > >> Best regards. > >> > >>> On Mar 5, 2015, at 1:00 PM, Till Rohrmann <[hidden email]> > >> wrote: > >>> > >>> Hi Dulaj, > >>> > >>> I looked through your commit and noticed that the JobClient might not > be > >>> listening on the right network interface. Your commit seems to fix it. > I > >>> just want to understand the problem properly and therefore I opened a > >>> branch with a small change. Could you try out whether this change would > >>> also fix your problem? You can find the code here [1]. Would be awesome > >> if > >>> you checked it out and let it run on your cluster setting. Thanks a lot > >>> Dulaj! > >>> > >>> [1] > >>> > >> > https://github.com/tillrohrmann/flink/tree/fixLocalFlinkMiniClusterJobClient > >>> > >>> On Thu, Mar 5, 2015 at 4:21 AM, Dulaj Viduranga <[hidden email]> > >>> wrote: > >>> > >>>> The every change in the commit b7da22a is not required but I thought > >> they > >>>> are appropriate. > >>>> > >>>>> On Mar 5, 2015, at 8:11 AM, Dulaj Viduranga <[hidden email]> > >>>> wrote: > >>>>> > >>>>> Hi, > >>>>> I found many other places “localhost” is hard coded. I changed them > in > >> a > >>>> better way I think. I made a pull request. Please review. b7da22a < > >>>> > >> > https://github.com/viduranga/flink/commit/b7da22a562d3da5a9be2657308c0f82e4e2f80cd > >>>>> > >>>>> > >>>>>> On Mar 4, 2015, at 8:17 PM, Stephan Ewen <[hidden email]> wrote: > >>>>>> > >>>>>> If I recall correctly, we only hardcode "localhost" in the local > mini > >>>>>> cluster - do you think it is problematic there as well? > >>>>>> > >>>>>> Have you found any other places? > >>>>>> > >>>>>> On Mon, Mar 2, 2015 at 10:26 AM, Dulaj Viduranga < > >> [hidden email]> > >>>>>> wrote: > >>>>>> > >>>>>>> In some places of the code, "localhost" is hard coded. When it is > >>>> resolved > >>>>>>> by the DNS, it is posible to be directed to a different IP other > >> than > >>>>>>> 127.0.0.1 (like private range 10.0.0.0/8). I changed those places > to > >>>>>>> 127.0.0.1 and it works like a charm. > >>>>>>> But hard coding 127.0.0.1 is not a good option because when the > >>>> jobmanager > >>>>>>> ip is changed, this becomes an issue again. I'm thinking of setting > >>>>>>> jobmanager ip from the config.yaml to these places. > >>>>>>> If you have a better idea on doing this with your experience, > please > >>>> let > >>>>>>> me know. > >>>>>>> > >>>>>>> Best. > >>>>>>> > >>>>> > >>>> > >>>> > >> > >> > > |
Hi,
This is the log with setting “localhost” flink-Vidura-jobmanager-localhost.log <https://gist.github.com/viduranga/e9d43521587697de3eb5#file-flink-vidura-jobmanager-localhost-log> And this is the log with setting “127.0.0.1” flink-Vidura-jobmanager-localhost.log <https://gist.github.com/viduranga/5af6b05f204e1f4b344f#file-flink-vidura-jobmanager-localhost-log> > On Mar 5, 2015, at 2:23 PM, Till Rohrmann <[hidden email]> wrote: > > What does the jobmanager log says? I think Stephan added some more logging > output which helps us to debug this problem. > > On Thu, Mar 5, 2015 at 9:36 AM, Dulaj Viduranga <[hidden email]> > wrote: > >> Using start-locat.sh. >> I’m using the original config yaml. I also tried changing jobmanager >> address in config to “127.0.0.1 but no luck. With my changes it works ok. >> The conf file follows. >> >> >> ################################################################################ >> # Licensed to the Apache Software Foundation (ASF) under one >> # or more contributor license agreements. See the NOTICE file >> # distributed with this work for additional information >> # regarding copyright ownership. The ASF licenses this file >> # to you under the Apache License, Version 2.0 (the >> # "License"); you may not use this file except in compliance >> # with the License. You may obtain a copy of the License at >> # >> # http://www.apache.org/licenses/LICENSE-2.0 >> # >> # Unless required by applicable law or agreed to in writing, software >> # distributed under the License is distributed on an "AS IS" BASIS, >> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. >> # See the License for the specific language governing permissions and >> # limitations under the License. >> >> ################################################################################ >> >> >> >> #============================================================================== >> # Common >> >> #============================================================================== >> >> jobmanager.rpc.address: 127.0.0.1 >> >> jobmanager.rpc.port: 6123 >> >> jobmanager.heap.mb: 256 >> >> taskmanager.heap.mb: 512 >> >> taskmanager.numberOfTaskSlots: 1 >> >> parallelization.degree.default: 1 >> >> >> #============================================================================== >> # Web Frontend >> >> #============================================================================== >> >> # The port under which the web-based runtime monitor listens. >> # A value of -1 deactivates the web server. >> >> jobmanager.web.port: 8081 >> >> # The port uder which the standalone web client >> # (for job upload and submit) listens. >> >> webclient.port: 8080 >> >> >> #============================================================================== >> # Advanced >> >> #============================================================================== >> >> # The number of buffers for the network stack. >> # >> # taskmanager.network.numberOfBuffers: 2048 >> >> # Directories for temporary files. >> # >> # Add a delimited list for multiple directories, using the system directory >> # delimiter (colon ':' on unix) or a comma, e.g.: >> # /data1/tmp:/data2/tmp:/data3/tmp >> # >> # Note: Each directory entry is read from and written to by a different I/O >> # thread. You can include the same directory multiple times in order to >> create >> # multiple I/O threads against that directory. This is for example >> relevant for >> # high-throughput RAIDs. >> # >> # If not specified, the system-specific Java temporary directory >> (java.io.tmpdir >> # property) is taken. >> # >> # taskmanager.tmp.dirs: /tmp >> >> # Path to the Hadoop configuration directory. >> # >> # This configuration is used when writing into HDFS. Unless specified >> otherwise, >> # HDFS file creation will use HDFS default settings with respect to >> block-size, >> # replication factor, etc. >> # >> # You can also directly specify the paths to hdfs-default.xml and >> hdfs-site.xml >> # via keys 'fs.hdfs.hdfsdefault' and 'fs.hdfs.hdfssite'. >> # >> # fs.hdfs.hadoopconf: /path/to/hadoop/conf/ >> >> >>> On Mar 5, 2015, at 2:03 PM, Till Rohrmann <[hidden email]> wrote: >>> >>> How did you start the flink cluster? Using the start-local.sh, the >>> start-cluster.sh or starting the job manager and task managers >> individually >>> using taskmanager.sh/jobmanager.sh. Could you maybe post the >>> flink-conf.yaml file, you're using? >>> >>> With your changes, everything works, right? >>> >>> On Thu, Mar 5, 2015 at 8:55 AM, Dulaj Viduranga <[hidden email]> >>> wrote: >>> >>>> Hi Till, >>>> I’m sorry. It doesn’t seem to solve the problem. The taskmanager still >>>> tries a 10.0.0.0/8 IP. >>>> >>>> Best regards. >>>> >>>>> On Mar 5, 2015, at 1:00 PM, Till Rohrmann <[hidden email]> >>>> wrote: >>>>> >>>>> Hi Dulaj, >>>>> >>>>> I looked through your commit and noticed that the JobClient might not >> be >>>>> listening on the right network interface. Your commit seems to fix it. >> I >>>>> just want to understand the problem properly and therefore I opened a >>>>> branch with a small change. Could you try out whether this change would >>>>> also fix your problem? You can find the code here [1]. Would be awesome >>>> if >>>>> you checked it out and let it run on your cluster setting. Thanks a lot >>>>> Dulaj! >>>>> >>>>> [1] >>>>> >>>> >> https://github.com/tillrohrmann/flink/tree/fixLocalFlinkMiniClusterJobClient >>>>> >>>>> On Thu, Mar 5, 2015 at 4:21 AM, Dulaj Viduranga <[hidden email]> >>>>> wrote: >>>>> >>>>>> The every change in the commit b7da22a is not required but I thought >>>> they >>>>>> are appropriate. >>>>>> >>>>>>> On Mar 5, 2015, at 8:11 AM, Dulaj Viduranga <[hidden email]> >>>>>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> I found many other places “localhost” is hard coded. I changed them >> in >>>> a >>>>>> better way I think. I made a pull request. Please review. b7da22a < >>>>>> >>>> >> https://github.com/viduranga/flink/commit/b7da22a562d3da5a9be2657308c0f82e4e2f80cd >>>>>>> >>>>>>> >>>>>>>> On Mar 4, 2015, at 8:17 PM, Stephan Ewen <[hidden email]> wrote: >>>>>>>> >>>>>>>> If I recall correctly, we only hardcode "localhost" in the local >> mini >>>>>>>> cluster - do you think it is problematic there as well? >>>>>>>> >>>>>>>> Have you found any other places? >>>>>>>> >>>>>>>> On Mon, Mar 2, 2015 at 10:26 AM, Dulaj Viduranga < >>>> [hidden email]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> In some places of the code, "localhost" is hard coded. When it is >>>>>> resolved >>>>>>>>> by the DNS, it is posible to be directed to a different IP other >>>> than >>>>>>>>> 127.0.0.1 (like private range 10.0.0.0/8). I changed those places >> to >>>>>>>>> 127.0.0.1 and it works like a charm. >>>>>>>>> But hard coding 127.0.0.1 is not a good option because when the >>>>>> jobmanager >>>>>>>>> ip is changed, this becomes an issue again. I'm thinking of setting >>>>>>>>> jobmanager ip from the config.yaml to these places. >>>>>>>>> If you have a better idea on doing this with your experience, >> please >>>>>> let >>>>>>>>> me know. >>>>>>>>> >>>>>>>>> Best. >>>>>>>>> >>>>>>> >>>>>> >>>>>> >>>> >>>> >> >> |
Hi Dulaj!
Okay, the logs give us some insight. Both setups seem to look good in terms of TaskManager and JobManager startup. In one of the logs (127.0.0.1) you submit a job. The job fails because the TaskManager cannot grab the JAR file from the JobManager. I think the problem is that the BLOB server binds to 0.0.0.0 - it should bind to the same address as the JobManager actor system. That should definitely be changed... On Thu, Mar 5, 2015 at 10:08 AM, Dulaj Viduranga <[hidden email]> wrote: > Hi, > This is the log with setting “localhost” > flink-Vidura-jobmanager-localhost.log < > https://gist.github.com/viduranga/e9d43521587697de3eb5#file-flink-vidura-jobmanager-localhost-log > > > > And this is the log with setting “127.0.0.1” > flink-Vidura-jobmanager-localhost.log < > https://gist.github.com/viduranga/5af6b05f204e1f4b344f#file-flink-vidura-jobmanager-localhost-log > > > > > On Mar 5, 2015, at 2:23 PM, Till Rohrmann <[hidden email]> wrote: > > > > What does the jobmanager log says? I think Stephan added some more > logging > > output which helps us to debug this problem. > > > > On Thu, Mar 5, 2015 at 9:36 AM, Dulaj Viduranga <[hidden email]> > > wrote: > > > >> Using start-locat.sh. > >> I’m using the original config yaml. I also tried changing jobmanager > >> address in config to “127.0.0.1 but no luck. With my changes it works > ok. > >> The conf file follows. > >> > >> > >> > ################################################################################ > >> # Licensed to the Apache Software Foundation (ASF) under one > >> # or more contributor license agreements. See the NOTICE file > >> # distributed with this work for additional information > >> # regarding copyright ownership. The ASF licenses this file > >> # to you under the Apache License, Version 2.0 (the > >> # "License"); you may not use this file except in compliance > >> # with the License. You may obtain a copy of the License at > >> # > >> # http://www.apache.org/licenses/LICENSE-2.0 > >> # > >> # Unless required by applicable law or agreed to in writing, software > >> # distributed under the License is distributed on an "AS IS" BASIS, > >> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or > implied. > >> # See the License for the specific language governing permissions and > >> # limitations under the License. > >> > >> > ################################################################################ > >> > >> > >> > >> > #============================================================================== > >> # Common > >> > >> > #============================================================================== > >> > >> jobmanager.rpc.address: 127.0.0.1 > >> > >> jobmanager.rpc.port: 6123 > >> > >> jobmanager.heap.mb: 256 > >> > >> taskmanager.heap.mb: 512 > >> > >> taskmanager.numberOfTaskSlots: 1 > >> > >> parallelization.degree.default: 1 > >> > >> > >> > #============================================================================== > >> # Web Frontend > >> > >> > #============================================================================== > >> > >> # The port under which the web-based runtime monitor listens. > >> # A value of -1 deactivates the web server. > >> > >> jobmanager.web.port: 8081 > >> > >> # The port uder which the standalone web client > >> # (for job upload and submit) listens. > >> > >> webclient.port: 8080 > >> > >> > >> > #============================================================================== > >> # Advanced > >> > >> > #============================================================================== > >> > >> # The number of buffers for the network stack. > >> # > >> # taskmanager.network.numberOfBuffers: 2048 > >> > >> # Directories for temporary files. > >> # > >> # Add a delimited list for multiple directories, using the system > directory > >> # delimiter (colon ':' on unix) or a comma, e.g.: > >> # /data1/tmp:/data2/tmp:/data3/tmp > >> # > >> # Note: Each directory entry is read from and written to by a different > I/O > >> # thread. You can include the same directory multiple times in order to > >> create > >> # multiple I/O threads against that directory. This is for example > >> relevant for > >> # high-throughput RAIDs. > >> # > >> # If not specified, the system-specific Java temporary directory > >> (java.io.tmpdir > >> # property) is taken. > >> # > >> # taskmanager.tmp.dirs: /tmp > >> > >> # Path to the Hadoop configuration directory. > >> # > >> # This configuration is used when writing into HDFS. Unless specified > >> otherwise, > >> # HDFS file creation will use HDFS default settings with respect to > >> block-size, > >> # replication factor, etc. > >> # > >> # You can also directly specify the paths to hdfs-default.xml and > >> hdfs-site.xml > >> # via keys 'fs.hdfs.hdfsdefault' and 'fs.hdfs.hdfssite'. > >> # > >> # fs.hdfs.hadoopconf: /path/to/hadoop/conf/ > >> > >> > >>> On Mar 5, 2015, at 2:03 PM, Till Rohrmann <[hidden email]> > wrote: > >>> > >>> How did you start the flink cluster? Using the start-local.sh, the > >>> start-cluster.sh or starting the job manager and task managers > >> individually > >>> using taskmanager.sh/jobmanager.sh. Could you maybe post the > >>> flink-conf.yaml file, you're using? > >>> > >>> With your changes, everything works, right? > >>> > >>> On Thu, Mar 5, 2015 at 8:55 AM, Dulaj Viduranga <[hidden email]> > >>> wrote: > >>> > >>>> Hi Till, > >>>> I’m sorry. It doesn’t seem to solve the problem. The taskmanager still > >>>> tries a 10.0.0.0/8 IP. > >>>> > >>>> Best regards. > >>>> > >>>>> On Mar 5, 2015, at 1:00 PM, Till Rohrmann <[hidden email]> > >>>> wrote: > >>>>> > >>>>> Hi Dulaj, > >>>>> > >>>>> I looked through your commit and noticed that the JobClient might not > >> be > >>>>> listening on the right network interface. Your commit seems to fix > it. > >> I > >>>>> just want to understand the problem properly and therefore I opened a > >>>>> branch with a small change. Could you try out whether this change > would > >>>>> also fix your problem? You can find the code here [1]. Would be > awesome > >>>> if > >>>>> you checked it out and let it run on your cluster setting. Thanks a > lot > >>>>> Dulaj! > >>>>> > >>>>> [1] > >>>>> > >>>> > >> > https://github.com/tillrohrmann/flink/tree/fixLocalFlinkMiniClusterJobClient > >>>>> > >>>>> On Thu, Mar 5, 2015 at 4:21 AM, Dulaj Viduranga < > [hidden email]> > >>>>> wrote: > >>>>> > >>>>>> The every change in the commit b7da22a is not required but I thought > >>>> they > >>>>>> are appropriate. > >>>>>> > >>>>>>> On Mar 5, 2015, at 8:11 AM, Dulaj Viduranga <[hidden email]> > >>>>>> wrote: > >>>>>>> > >>>>>>> Hi, > >>>>>>> I found many other places “localhost” is hard coded. I changed them > >> in > >>>> a > >>>>>> better way I think. I made a pull request. Please review. b7da22a < > >>>>>> > >>>> > >> > https://github.com/viduranga/flink/commit/b7da22a562d3da5a9be2657308c0f82e4e2f80cd > >>>>>>> > >>>>>>> > >>>>>>>> On Mar 4, 2015, at 8:17 PM, Stephan Ewen <[hidden email]> > wrote: > >>>>>>>> > >>>>>>>> If I recall correctly, we only hardcode "localhost" in the local > >> mini > >>>>>>>> cluster - do you think it is problematic there as well? > >>>>>>>> > >>>>>>>> Have you found any other places? > >>>>>>>> > >>>>>>>> On Mon, Mar 2, 2015 at 10:26 AM, Dulaj Viduranga < > >>>> [hidden email]> > >>>>>>>> wrote: > >>>>>>>> > >>>>>>>>> In some places of the code, "localhost" is hard coded. When it is > >>>>>> resolved > >>>>>>>>> by the DNS, it is posible to be directed to a different IP other > >>>> than > >>>>>>>>> 127.0.0.1 (like private range 10.0.0.0/8). I changed those > places > >> to > >>>>>>>>> 127.0.0.1 and it works like a charm. > >>>>>>>>> But hard coding 127.0.0.1 is not a good option because when the > >>>>>> jobmanager > >>>>>>>>> ip is changed, this becomes an issue again. I'm thinking of > setting > >>>>>>>>> jobmanager ip from the config.yaml to these places. > >>>>>>>>> If you have a better idea on doing this with your experience, > >> please > >>>>>> let > >>>>>>>>> me know. > >>>>>>>>> > >>>>>>>>> Best. > >>>>>>>>> > >>>>>>> > >>>>>> > >>>>>> > >>>> > >>>> > >> > >> > > |
But can you explain why did my fix solved it?
> On Mar 5, 2015, at 5:50 PM, Stephan Ewen <[hidden email]> wrote: > > Hi Dulaj! > > Okay, the logs give us some insight. Both setups seem to look good in terms > of TaskManager and JobManager startup. > > In one of the logs (127.0.0.1) you submit a job. The job fails because the > TaskManager cannot grab the JAR file from the JobManager. > I think the problem is that the BLOB server binds to 0.0.0.0 - it should > bind to the same address as the JobManager actor system. > > That should definitely be changed... > > On Thu, Mar 5, 2015 at 10:08 AM, Dulaj Viduranga <[hidden email]> > wrote: > >> Hi, >> This is the log with setting “localhost” >> flink-Vidura-jobmanager-localhost.log < >> https://gist.github.com/viduranga/e9d43521587697de3eb5#file-flink-vidura-jobmanager-localhost-log >>> >> >> And this is the log with setting “127.0.0.1” >> flink-Vidura-jobmanager-localhost.log < >> https://gist.github.com/viduranga/5af6b05f204e1f4b344f#file-flink-vidura-jobmanager-localhost-log >>> >> >>> On Mar 5, 2015, at 2:23 PM, Till Rohrmann <[hidden email]> wrote: >>> >>> What does the jobmanager log says? I think Stephan added some more >> logging >>> output which helps us to debug this problem. >>> >>> On Thu, Mar 5, 2015 at 9:36 AM, Dulaj Viduranga <[hidden email]> >>> wrote: >>> >>>> Using start-locat.sh. >>>> I’m using the original config yaml. I also tried changing jobmanager >>>> address in config to “127.0.0.1 but no luck. With my changes it works >> ok. >>>> The conf file follows. >>>> >>>> >>>> >> ################################################################################ >>>> # Licensed to the Apache Software Foundation (ASF) under one >>>> # or more contributor license agreements. See the NOTICE file >>>> # distributed with this work for additional information >>>> # regarding copyright ownership. The ASF licenses this file >>>> # to you under the Apache License, Version 2.0 (the >>>> # "License"); you may not use this file except in compliance >>>> # with the License. You may obtain a copy of the License at >>>> # >>>> # http://www.apache.org/licenses/LICENSE-2.0 >>>> # >>>> # Unless required by applicable law or agreed to in writing, software >>>> # distributed under the License is distributed on an "AS IS" BASIS, >>>> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or >> implied. >>>> # See the License for the specific language governing permissions and >>>> # limitations under the License. >>>> >>>> >> ################################################################################ >>>> >>>> >>>> >>>> >> #============================================================================== >>>> # Common >>>> >>>> >> #============================================================================== >>>> >>>> jobmanager.rpc.address: 127.0.0.1 >>>> >>>> jobmanager.rpc.port: 6123 >>>> >>>> jobmanager.heap.mb: 256 >>>> >>>> taskmanager.heap.mb: 512 >>>> >>>> taskmanager.numberOfTaskSlots: 1 >>>> >>>> parallelization.degree.default: 1 >>>> >>>> >>>> >> #============================================================================== >>>> # Web Frontend >>>> >>>> >> #============================================================================== >>>> >>>> # The port under which the web-based runtime monitor listens. >>>> # A value of -1 deactivates the web server. >>>> >>>> jobmanager.web.port: 8081 >>>> >>>> # The port uder which the standalone web client >>>> # (for job upload and submit) listens. >>>> >>>> webclient.port: 8080 >>>> >>>> >>>> >> #============================================================================== >>>> # Advanced >>>> >>>> >> #============================================================================== >>>> >>>> # The number of buffers for the network stack. >>>> # >>>> # taskmanager.network.numberOfBuffers: 2048 >>>> >>>> # Directories for temporary files. >>>> # >>>> # Add a delimited list for multiple directories, using the system >> directory >>>> # delimiter (colon ':' on unix) or a comma, e.g.: >>>> # /data1/tmp:/data2/tmp:/data3/tmp >>>> # >>>> # Note: Each directory entry is read from and written to by a different >> I/O >>>> # thread. You can include the same directory multiple times in order to >>>> create >>>> # multiple I/O threads against that directory. This is for example >>>> relevant for >>>> # high-throughput RAIDs. >>>> # >>>> # If not specified, the system-specific Java temporary directory >>>> (java.io.tmpdir >>>> # property) is taken. >>>> # >>>> # taskmanager.tmp.dirs: /tmp >>>> >>>> # Path to the Hadoop configuration directory. >>>> # >>>> # This configuration is used when writing into HDFS. Unless specified >>>> otherwise, >>>> # HDFS file creation will use HDFS default settings with respect to >>>> block-size, >>>> # replication factor, etc. >>>> # >>>> # You can also directly specify the paths to hdfs-default.xml and >>>> hdfs-site.xml >>>> # via keys 'fs.hdfs.hdfsdefault' and 'fs.hdfs.hdfssite'. >>>> # >>>> # fs.hdfs.hadoopconf: /path/to/hadoop/conf/ >>>> >>>> >>>>> On Mar 5, 2015, at 2:03 PM, Till Rohrmann <[hidden email]> >> wrote: >>>>> >>>>> How did you start the flink cluster? Using the start-local.sh, the >>>>> start-cluster.sh or starting the job manager and task managers >>>> individually >>>>> using taskmanager.sh/jobmanager.sh. Could you maybe post the >>>>> flink-conf.yaml file, you're using? >>>>> >>>>> With your changes, everything works, right? >>>>> >>>>> On Thu, Mar 5, 2015 at 8:55 AM, Dulaj Viduranga <[hidden email]> >>>>> wrote: >>>>> >>>>>> Hi Till, >>>>>> I’m sorry. It doesn’t seem to solve the problem. The taskmanager still >>>>>> tries a 10.0.0.0/8 IP. >>>>>> >>>>>> Best regards. >>>>>> >>>>>>> On Mar 5, 2015, at 1:00 PM, Till Rohrmann <[hidden email]> >>>>>> wrote: >>>>>>> >>>>>>> Hi Dulaj, >>>>>>> >>>>>>> I looked through your commit and noticed that the JobClient might not >>>> be >>>>>>> listening on the right network interface. Your commit seems to fix >> it. >>>> I >>>>>>> just want to understand the problem properly and therefore I opened a >>>>>>> branch with a small change. Could you try out whether this change >> would >>>>>>> also fix your problem? You can find the code here [1]. Would be >> awesome >>>>>> if >>>>>>> you checked it out and let it run on your cluster setting. Thanks a >> lot >>>>>>> Dulaj! >>>>>>> >>>>>>> [1] >>>>>>> >>>>>> >>>> >> https://github.com/tillrohrmann/flink/tree/fixLocalFlinkMiniClusterJobClient >>>>>>> >>>>>>> On Thu, Mar 5, 2015 at 4:21 AM, Dulaj Viduranga < >> [hidden email]> >>>>>>> wrote: >>>>>>> >>>>>>>> The every change in the commit b7da22a is not required but I thought >>>>>> they >>>>>>>> are appropriate. >>>>>>>> >>>>>>>>> On Mar 5, 2015, at 8:11 AM, Dulaj Viduranga <[hidden email]> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> I found many other places “localhost” is hard coded. I changed them >>>> in >>>>>> a >>>>>>>> better way I think. I made a pull request. Please review. b7da22a < >>>>>>>> >>>>>> >>>> >> https://github.com/viduranga/flink/commit/b7da22a562d3da5a9be2657308c0f82e4e2f80cd >>>>>>>>> >>>>>>>>> >>>>>>>>>> On Mar 4, 2015, at 8:17 PM, Stephan Ewen <[hidden email]> >> wrote: >>>>>>>>>> >>>>>>>>>> If I recall correctly, we only hardcode "localhost" in the local >>>> mini >>>>>>>>>> cluster - do you think it is problematic there as well? >>>>>>>>>> >>>>>>>>>> Have you found any other places? >>>>>>>>>> >>>>>>>>>> On Mon, Mar 2, 2015 at 10:26 AM, Dulaj Viduranga < >>>>>> [hidden email]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> In some places of the code, "localhost" is hard coded. When it is >>>>>>>> resolved >>>>>>>>>>> by the DNS, it is posible to be directed to a different IP other >>>>>> than >>>>>>>>>>> 127.0.0.1 (like private range 10.0.0.0/8). I changed those >> places >>>> to >>>>>>>>>>> 127.0.0.1 and it works like a charm. >>>>>>>>>>> But hard coding 127.0.0.1 is not a good option because when the >>>>>>>> jobmanager >>>>>>>>>>> ip is changed, this becomes an issue again. I'm thinking of >> setting >>>>>>>>>>> jobmanager ip from the config.yaml to these places. >>>>>>>>>>> If you have a better idea on doing this with your experience, >>>> please >>>>>>>> let >>>>>>>>>>> me know. >>>>>>>>>>> >>>>>>>>>>> Best. >>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>> >>>> >> >> |
Could you submit a job when you set the job manager address to "localhost"?
I did not see any logging statements of received jobs. If you did, could you also send the logs of the client? The 0.0.0.0 to which the BlobServer binds works for me on my machine. I cannot remember that we had problems with that before. But I agree, we should set it to the network interface which the JobManager uses. I cannot explain why your fix solves the problem. It does not touch any of the JobClient/JobManager logic. I updated my local branch [1] with a fix for the BlobServer. Could you try it out again and send us the logs? Thanks a lot for your help Dulaj. On Thu, Mar 5, 2015 at 1:24 PM, Dulaj Viduranga <[hidden email]> wrote: > But can you explain why did my fix solved it? > > > On Mar 5, 2015, at 5:50 PM, Stephan Ewen <[hidden email]> wrote: > > > > Hi Dulaj! > > > > Okay, the logs give us some insight. Both setups seem to look good in > terms > > of TaskManager and JobManager startup. > > > > In one of the logs (127.0.0.1) you submit a job. The job fails because > the > > TaskManager cannot grab the JAR file from the JobManager. > > I think the problem is that the BLOB server binds to 0.0.0.0 - it should > > bind to the same address as the JobManager actor system. > > > > That should definitely be changed... > > > > On Thu, Mar 5, 2015 at 10:08 AM, Dulaj Viduranga <[hidden email]> > > wrote: > > > >> Hi, > >> This is the log with setting “localhost” > >> flink-Vidura-jobmanager-localhost.log < > >> > https://gist.github.com/viduranga/e9d43521587697de3eb5#file-flink-vidura-jobmanager-localhost-log > >>> > >> > >> And this is the log with setting “127.0.0.1” > >> flink-Vidura-jobmanager-localhost.log < > >> > https://gist.github.com/viduranga/5af6b05f204e1f4b344f#file-flink-vidura-jobmanager-localhost-log > >>> > >> > >>> On Mar 5, 2015, at 2:23 PM, Till Rohrmann <[hidden email]> > wrote: > >>> > >>> What does the jobmanager log says? I think Stephan added some more > >> logging > >>> output which helps us to debug this problem. > >>> > >>> On Thu, Mar 5, 2015 at 9:36 AM, Dulaj Viduranga <[hidden email]> > >>> wrote: > >>> > >>>> Using start-locat.sh. > >>>> I’m using the original config yaml. I also tried changing jobmanager > >>>> address in config to “127.0.0.1 but no luck. With my changes it works > >> ok. > >>>> The conf file follows. > >>>> > >>>> > >>>> > >> > ################################################################################ > >>>> # Licensed to the Apache Software Foundation (ASF) under one > >>>> # or more contributor license agreements. See the NOTICE file > >>>> # distributed with this work for additional information > >>>> # regarding copyright ownership. The ASF licenses this file > >>>> # to you under the Apache License, Version 2.0 (the > >>>> # "License"); you may not use this file except in compliance > >>>> # with the License. You may obtain a copy of the License at > >>>> # > >>>> # http://www.apache.org/licenses/LICENSE-2.0 > >>>> # > >>>> # Unless required by applicable law or agreed to in writing, software > >>>> # distributed under the License is distributed on an "AS IS" BASIS, > >>>> # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or > >> implied. > >>>> # See the License for the specific language governing permissions and > >>>> # limitations under the License. > >>>> > >>>> > >> > ################################################################################ > >>>> > >>>> > >>>> > >>>> > >> > #============================================================================== > >>>> # Common > >>>> > >>>> > >> > #============================================================================== > >>>> > >>>> jobmanager.rpc.address: 127.0.0.1 > >>>> > >>>> jobmanager.rpc.port: 6123 > >>>> > >>>> jobmanager.heap.mb: 256 > >>>> > >>>> taskmanager.heap.mb: 512 > >>>> > >>>> taskmanager.numberOfTaskSlots: 1 > >>>> > >>>> parallelization.degree.default: 1 > >>>> > >>>> > >>>> > >> > #============================================================================== > >>>> # Web Frontend > >>>> > >>>> > >> > #============================================================================== > >>>> > >>>> # The port under which the web-based runtime monitor listens. > >>>> # A value of -1 deactivates the web server. > >>>> > >>>> jobmanager.web.port: 8081 > >>>> > >>>> # The port uder which the standalone web client > >>>> # (for job upload and submit) listens. > >>>> > >>>> webclient.port: 8080 > >>>> > >>>> > >>>> > >> > #============================================================================== > >>>> # Advanced > >>>> > >>>> > >> > #============================================================================== > >>>> > >>>> # The number of buffers for the network stack. > >>>> # > >>>> # taskmanager.network.numberOfBuffers: 2048 > >>>> > >>>> # Directories for temporary files. > >>>> # > >>>> # Add a delimited list for multiple directories, using the system > >> directory > >>>> # delimiter (colon ':' on unix) or a comma, e.g.: > >>>> # /data1/tmp:/data2/tmp:/data3/tmp > >>>> # > >>>> # Note: Each directory entry is read from and written to by a > different > >> I/O > >>>> # thread. You can include the same directory multiple times in order > to > >>>> create > >>>> # multiple I/O threads against that directory. This is for example > >>>> relevant for > >>>> # high-throughput RAIDs. > >>>> # > >>>> # If not specified, the system-specific Java temporary directory > >>>> (java.io.tmpdir > >>>> # property) is taken. > >>>> # > >>>> # taskmanager.tmp.dirs: /tmp > >>>> > >>>> # Path to the Hadoop configuration directory. > >>>> # > >>>> # This configuration is used when writing into HDFS. Unless specified > >>>> otherwise, > >>>> # HDFS file creation will use HDFS default settings with respect to > >>>> block-size, > >>>> # replication factor, etc. > >>>> # > >>>> # You can also directly specify the paths to hdfs-default.xml and > >>>> hdfs-site.xml > >>>> # via keys 'fs.hdfs.hdfsdefault' and 'fs.hdfs.hdfssite'. > >>>> # > >>>> # fs.hdfs.hadoopconf: /path/to/hadoop/conf/ > >>>> > >>>> > >>>>> On Mar 5, 2015, at 2:03 PM, Till Rohrmann <[hidden email]> > >> wrote: > >>>>> > >>>>> How did you start the flink cluster? Using the start-local.sh, the > >>>>> start-cluster.sh or starting the job manager and task managers > >>>> individually > >>>>> using taskmanager.sh/jobmanager.sh. Could you maybe post the > >>>>> flink-conf.yaml file, you're using? > >>>>> > >>>>> With your changes, everything works, right? > >>>>> > >>>>> On Thu, Mar 5, 2015 at 8:55 AM, Dulaj Viduranga < > [hidden email]> > >>>>> wrote: > >>>>> > >>>>>> Hi Till, > >>>>>> I’m sorry. It doesn’t seem to solve the problem. The taskmanager > still > >>>>>> tries a 10.0.0.0/8 IP. > >>>>>> > >>>>>> Best regards. > >>>>>> > >>>>>>> On Mar 5, 2015, at 1:00 PM, Till Rohrmann <[hidden email] > > > >>>>>> wrote: > >>>>>>> > >>>>>>> Hi Dulaj, > >>>>>>> > >>>>>>> I looked through your commit and noticed that the JobClient might > not > >>>> be > >>>>>>> listening on the right network interface. Your commit seems to fix > >> it. > >>>> I > >>>>>>> just want to understand the problem properly and therefore I > opened a > >>>>>>> branch with a small change. Could you try out whether this change > >> would > >>>>>>> also fix your problem? You can find the code here [1]. Would be > >> awesome > >>>>>> if > >>>>>>> you checked it out and let it run on your cluster setting. Thanks a > >> lot > >>>>>>> Dulaj! > >>>>>>> > >>>>>>> [1] > >>>>>>> > >>>>>> > >>>> > >> > https://github.com/tillrohrmann/flink/tree/fixLocalFlinkMiniClusterJobClient > >>>>>>> > >>>>>>> On Thu, Mar 5, 2015 at 4:21 AM, Dulaj Viduranga < > >> [hidden email]> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> The every change in the commit b7da22a is not required but I > thought > >>>>>> they > >>>>>>>> are appropriate. > >>>>>>>> > >>>>>>>>> On Mar 5, 2015, at 8:11 AM, Dulaj Viduranga < > [hidden email]> > >>>>>>>> wrote: > >>>>>>>>> > >>>>>>>>> Hi, > >>>>>>>>> I found many other places “localhost” is hard coded. I changed > them > >>>> in > >>>>>> a > >>>>>>>> better way I think. I made a pull request. Please review. b7da22a > < > >>>>>>>> > >>>>>> > >>>> > >> > https://github.com/viduranga/flink/commit/b7da22a562d3da5a9be2657308c0f82e4e2f80cd > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> On Mar 4, 2015, at 8:17 PM, Stephan Ewen <[hidden email]> > >> wrote: > >>>>>>>>>> > >>>>>>>>>> If I recall correctly, we only hardcode "localhost" in the local > >>>> mini > >>>>>>>>>> cluster - do you think it is problematic there as well? > >>>>>>>>>> > >>>>>>>>>> Have you found any other places? > >>>>>>>>>> > >>>>>>>>>> On Mon, Mar 2, 2015 at 10:26 AM, Dulaj Viduranga < > >>>>>> [hidden email]> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> In some places of the code, "localhost" is hard coded. When it > is > >>>>>>>> resolved > >>>>>>>>>>> by the DNS, it is posible to be directed to a different IP > other > >>>>>> than > >>>>>>>>>>> 127.0.0.1 (like private range 10.0.0.0/8). I changed those > >> places > >>>> to > >>>>>>>>>>> 127.0.0.1 and it works like a charm. > >>>>>>>>>>> But hard coding 127.0.0.1 is not a good option because when the > >>>>>>>> jobmanager > >>>>>>>>>>> ip is changed, this becomes an issue again. I'm thinking of > >> setting > >>>>>>>>>>> jobmanager ip from the config.yaml to these places. > >>>>>>>>>>> If you have a better idea on doing this with your experience, > >>>> please > >>>>>>>> let > >>>>>>>>>>> me know. > >>>>>>>>>>> > >>>>>>>>>>> Best. > >>>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>> > >>>>>> > >>>> > >>>> > >> > >> > > |
Free forum by Nabble | Edit this page |