Hi,
I'm running the KMeans java and scala examples in two nodes. It works fine with very small files (3MB) but when I try with files of 30MB or bigger the process never ends. After several hours, the DataChain process that is reading the input points is still working. I have tried before with way bigger files in the same environment and I had no issue. I have already tried: - Check that the process is not locked using all the CPU time. - Format the datanodes. - Compile the last version available on github. - The debug log mode doesn't give any additional information. Could someone give me a hint where to look at that? Thanks for your help! Regards // Saludos // Mit Freundlichen Grüßen // Bien cordialement, Pino |
Have you looked at a jstack dump on one of the workera? That typically
helps finding out, where the processes are stuck. -s Am 22.06.2014 13:32 schrieb "José Luis López Pino" <[hidden email]>: > Hi, > > I'm running the KMeans java and scala examples in two nodes. It works fine > with very small files (3MB) but when I try with files of 30MB or bigger the > process never ends. After several hours, the DataChain process that is > reading the input points is still working. > > I have tried before with way bigger files in the same environment and I had > no issue. I have already tried: > - Check that the process is not locked using all the CPU time. > - Format the datanodes. > - Compile the last version available on github. > - The debug log mode doesn't give any additional information. > > Could someone give me a hint where to look at that? Thanks for your help! > > Regards // Saludos // Mit Freundlichen Grüßen // Bien cordialement, > Pino > |
It seems like the thread reading the points file is locked waiting for a
buffer from the global buffer pool that doesn't come. What could be causing this? java.lang.Thread.State: TIMED_WAITING (on object monitor) at java.lang.Object.wait(Native Method) - waiting on <0x6b985888> (a java.util.ArrayDeque) at eu.stratosphere.runtime.io.network.bufferprovider.LocalBufferPool.requestBuffer(LocalBufferPool.java:160) - locked <0x6b985888> (a java.util.ArrayDeque) at eu.stratosphere.runtime.io.network.bufferprovider.LocalBufferPool.requestBufferBlocking(LocalBufferPool.java:101) at eu.stratosphere.runtime.io.gates.InputGate.requestBufferBlocking(InputGate.java:333) at eu.stratosphere.runtime.io.channels.InputChannel.requestBufferBlocking(InputChannel.java:426) at eu.stratosphere.runtime.io.network.ChannelManager.dispatchFromOutputChannel(ChannelManager.java:441) at eu.stratosphere.runtime.io.channels.OutputChannel.sendBuffer(OutputChannel.java:74) at eu.stratosphere.runtime.io.gates.OutputGate.sendBuffer(OutputGate.java:49) at eu.stratosphere.runtime.io.api.BufferWriter.sendBuffer(BufferWriter.java:35) at eu.stratosphere.runtime.io.api.RecordWriter.emit(RecordWriter.java:96) at eu.stratosphere.pact.runtime.shipping.OutputCollector.collect(OutputCollector.java:82) at eu.stratosphere.pact.runtime.task.chaining.ChainedMapDriver.collect(ChainedMapDriver.java:71) at eu.stratosphere.pact.runtime.task.DataSourceTask.invoke(DataSourceTask.java:228) at eu.stratosphere.nephele.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:284) at java.lang.Thread.run(Thread.java:744) Thanks for your help Sebastian. Regards // Saludos // Mit Freundlichen Grüßen // Bien cordialement, Pino On 22 June 2014 13:38, Sebastian Schelter <[hidden email]> wrote: > Have you looked at a jstack dump on one of the workera? That typically > helps finding out, where the processes are stuck. > > -s > Am 22.06.2014 13:32 schrieb "José Luis López Pino" <[hidden email] > >: > > > Hi, > > > > I'm running the KMeans java and scala examples in two nodes. It works > fine > > with very small files (3MB) but when I try with files of 30MB or bigger > the > > process never ends. After several hours, the DataChain process that is > > reading the input points is still working. > > > > I have tried before with way bigger files in the same environment and I > had > > no issue. I have already tried: > > - Check that the process is not locked using all the CPU time. > > - Format the datanodes. > > - Compile the last version available on github. > > - The debug log mode doesn't give any additional information. > > > > Could someone give me a hint where to look at that? Thanks for your help! > > > > Regards // Saludos // Mit Freundlichen Grüßen // Bien cordialement, > > Pino > > > |
You could try to increase the number of buffers available to the network
stack. That solved similar problems for me in the past. -s Am 22.06.2014 13:48 schrieb "José Luis López Pino" <[hidden email]>: > It seems like the thread reading the points file is locked waiting for a > buffer from the global buffer pool that doesn't come. What could be causing > this? > > java.lang.Thread.State: TIMED_WAITING (on object monitor) > at java.lang.Object.wait(Native Method) > - waiting on <0x6b985888> (a java.util.ArrayDeque) > at > > eu.stratosphere.runtime.io.network.bufferprovider.LocalBufferPool.requestBuffer(LocalBufferPool.java:160) > - locked <0x6b985888> (a java.util.ArrayDeque) > at > > eu.stratosphere.runtime.io.network.bufferprovider.LocalBufferPool.requestBufferBlocking(LocalBufferPool.java:101) > at > > eu.stratosphere.runtime.io.gates.InputGate.requestBufferBlocking(InputGate.java:333) > at > > eu.stratosphere.runtime.io.channels.InputChannel.requestBufferBlocking(InputChannel.java:426) > at > > eu.stratosphere.runtime.io.network.ChannelManager.dispatchFromOutputChannel(ChannelManager.java:441) > at > > eu.stratosphere.runtime.io.channels.OutputChannel.sendBuffer(OutputChannel.java:74) > at > eu.stratosphere.runtime.io.gates.OutputGate.sendBuffer(OutputGate.java:49) > at > > eu.stratosphere.runtime.io.api.BufferWriter.sendBuffer(BufferWriter.java:35) > at eu.stratosphere.runtime.io.api.RecordWriter.emit(RecordWriter.java:96) > at > > eu.stratosphere.pact.runtime.shipping.OutputCollector.collect(OutputCollector.java:82) > at > > eu.stratosphere.pact.runtime.task.chaining.ChainedMapDriver.collect(ChainedMapDriver.java:71) > at > > eu.stratosphere.pact.runtime.task.DataSourceTask.invoke(DataSourceTask.java:228) > at > > eu.stratosphere.nephele.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:284) > at java.lang.Thread.run(Thread.java:744) > > > Thanks for your help Sebastian. > > Regards // Saludos // Mit Freundlichen Grüßen // Bien cordialement, > Pino > > > On 22 June 2014 13:38, Sebastian Schelter <[hidden email]> wrote: > > > Have you looked at a jstack dump on one of the workera? That typically > > helps finding out, where the processes are stuck. > > > > -s > > Am 22.06.2014 13:32 schrieb "José Luis López Pino" < > [hidden email] > > >: > > > > > Hi, > > > > > > I'm running the KMeans java and scala examples in two nodes. It works > > fine > > > with very small files (3MB) but when I try with files of 30MB or bigger > > the > > > process never ends. After several hours, the DataChain process that is > > > reading the input points is still working. > > > > > > I have tried before with way bigger files in the same environment and I > > had > > > no issue. I have already tried: > > > - Check that the process is not locked using all the CPU time. > > > - Format the datanodes. > > > - Compile the last version available on github. > > > - The debug log mode doesn't give any additional information. > > > > > > Could someone give me a hint where to look at that? Thanks for your > help! > > > > > > Regards // Saludos // Mit Freundlichen Grüßen // Bien cordialement, > > > Pino > > > > > > |
Workers waiting in "LocalBufferPool.requestBuffer()" is usually a sign for
a distributed deadlock. Can you send me some instructions on how to get the same input data you have (download url? generator settings?) and what configuration parameters you are using (max iteration limit, k, ?) when calling the K-Means example. I would like to try it on our cluster. Just out of curiosity, what hardware are you using? Is it the IBM Power cluster at TU Berlin? Robert On Sun, Jun 22, 2014 at 1:53 PM, Sebastian Schelter <[hidden email] > wrote: > You could try to increase the number of buffers available to the network > stack. That solved similar problems for me in the past. > > -s > Am 22.06.2014 13:48 schrieb "José Luis López Pino" <[hidden email] > >: > > > It seems like the thread reading the points file is locked waiting for a > > buffer from the global buffer pool that doesn't come. What could be > causing > > this? > > > > java.lang.Thread.State: TIMED_WAITING (on object monitor) > > at java.lang.Object.wait(Native Method) > > - waiting on <0x6b985888> (a java.util.ArrayDeque) > > at > > > > > eu.stratosphere.runtime.io.network.bufferprovider.LocalBufferPool.requestBuffer(LocalBufferPool.java:160) > > - locked <0x6b985888> (a java.util.ArrayDeque) > > at > > > > > eu.stratosphere.runtime.io.network.bufferprovider.LocalBufferPool.requestBufferBlocking(LocalBufferPool.java:101) > > at > > > > > eu.stratosphere.runtime.io.gates.InputGate.requestBufferBlocking(InputGate.java:333) > > at > > > > > eu.stratosphere.runtime.io.channels.InputChannel.requestBufferBlocking(InputChannel.java:426) > > at > > > > > eu.stratosphere.runtime.io.network.ChannelManager.dispatchFromOutputChannel(ChannelManager.java:441) > > at > > > > > eu.stratosphere.runtime.io.channels.OutputChannel.sendBuffer(OutputChannel.java:74) > > at > > > eu.stratosphere.runtime.io.gates.OutputGate.sendBuffer(OutputGate.java:49) > > at > > > > > eu.stratosphere.runtime.io.api.BufferWriter.sendBuffer(BufferWriter.java:35) > > at > eu.stratosphere.runtime.io.api.RecordWriter.emit(RecordWriter.java:96) > > at > > > > > eu.stratosphere.pact.runtime.shipping.OutputCollector.collect(OutputCollector.java:82) > > at > > > > > eu.stratosphere.pact.runtime.task.chaining.ChainedMapDriver.collect(ChainedMapDriver.java:71) > > at > > > > > eu.stratosphere.pact.runtime.task.DataSourceTask.invoke(DataSourceTask.java:228) > > at > > > > > eu.stratosphere.nephele.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:284) > > at java.lang.Thread.run(Thread.java:744) > > > > > > Thanks for your help Sebastian. > > > > Regards // Saludos // Mit Freundlichen Grüßen // Bien cordialement, > > Pino > > > > > > On 22 June 2014 13:38, Sebastian Schelter <[hidden email]> > wrote: > > > > > Have you looked at a jstack dump on one of the workera? That typically > > > helps finding out, where the processes are stuck. > > > > > > -s > > > Am 22.06.2014 13:32 schrieb "José Luis López Pino" < > > [hidden email] > > > >: > > > > > > > Hi, > > > > > > > > I'm running the KMeans java and scala examples in two nodes. It works > > > fine > > > > with very small files (3MB) but when I try with files of 30MB or > bigger > > > the > > > > process never ends. After several hours, the DataChain process that > is > > > > reading the input points is still working. > > > > > > > > I have tried before with way bigger files in the same environment > and I > > > had > > > > no issue. I have already tried: > > > > - Check that the process is not locked using all the CPU time. > > > > - Format the datanodes. > > > > - Compile the last version available on github. > > > > - The debug log mode doesn't give any additional information. > > > > > > > > Could someone give me a hint where to look at that? Thanks for your > > help! > > > > > > > > Regards // Saludos // Mit Freundlichen Grüßen // Bien cordialement, > > > > Pino > > > > > > > > > > |
Hi, I'm using two instances of a VPS and using as this input for the program: - Iterations: 2 - Dimensions: 2 (3 for the scala example program) - Number of centers (k): 10
This is my current configuration for network buffers (i think they are values by default): # Number of network buffers (used by each TaskManager) taskmanager.network.numberOfBuffers: 2048
# Size of network buffers taskmanager.network.bufferSizeInBytes: 32768 Regards // Saludos // Mit Freundlichen Grüßen // Bien cordialement, Pino On 22 June 2014 14:19, Robert Metzger <[hidden email]> wrote: Workers waiting in "LocalBufferPool.requestBuffer()" is usually a sign for ford2.py (302 bytes) Download Attachment |
Thank you. What degree of parallelism are you using when submitting the job?
You can either set it with the "-p" argument or as env.setDegreeOfParalleism(). How much heapspace do you assign to the TaskManagers? On Sun, Jun 22, 2014 at 3:07 PM, José Luis López Pino <[hidden email] > wrote: > Hi, > > I'm using two instances of a VPS and using as this input for the program: > - Iterations: 2 > - Dimensions: 2 (3 for the scala example program) > - Number of centers (k): 10 > > This is my current configuration for network buffers (i think they are > values by default): > # Number of network buffers (used by each TaskManager) > taskmanager.network.numberOfBuffers: 2048 > # Size of network buffers > taskmanager.network.bufferSizeInBytes: 32768 > > Regards // Saludos // Mit Freundlichen Grüßen // Bien cordialement, > Pino > > > On 22 June 2014 14:19, Robert Metzger <[hidden email]> wrote: > >> Workers waiting in "LocalBufferPool.requestBuffer()" is usually a sign for >> a distributed deadlock. >> Can you send me some instructions on how to get the same input data you >> have (download url? generator settings?) and what configuration parameters >> you are using (max iteration limit, k, ?) when calling the K-Means >> example. >> I would like to try it on our cluster. >> >> Just out of curiosity, what hardware are you using? Is it the IBM Power >> cluster at TU Berlin? >> >> Robert >> >> >> On Sun, Jun 22, 2014 at 1:53 PM, Sebastian Schelter < >> [hidden email] >> > wrote: >> >> > You could try to increase the number of buffers available to the network >> > stack. That solved similar problems for me in the past. >> > >> > -s >> > Am 22.06.2014 13:48 schrieb "José Luis López Pino" < >> [hidden email] >> > >: >> > >> > > It seems like the thread reading the points file is locked waiting >> for a >> > > buffer from the global buffer pool that doesn't come. What could be >> > causing >> > > this? >> > > >> > > java.lang.Thread.State: TIMED_WAITING (on object monitor) >> > > at java.lang.Object.wait(Native Method) >> > > - waiting on <0x6b985888> (a java.util.ArrayDeque) >> > > at >> > > >> > > >> > >> eu.stratosphere.runtime.io.network.bufferprovider.LocalBufferPool.requestBuffer(LocalBufferPool.java:160) >> > > - locked <0x6b985888> (a java.util.ArrayDeque) >> > > at >> > > >> > > >> > >> eu.stratosphere.runtime.io.network.bufferprovider.LocalBufferPool.requestBufferBlocking(LocalBufferPool.java:101) >> > > at >> > > >> > > >> > >> eu.stratosphere.runtime.io.gates.InputGate.requestBufferBlocking(InputGate.java:333) >> > > at >> > > >> > > >> > >> eu.stratosphere.runtime.io.channels.InputChannel.requestBufferBlocking(InputChannel.java:426) >> > > at >> > > >> > > >> > >> eu.stratosphere.runtime.io.network.ChannelManager.dispatchFromOutputChannel(ChannelManager.java:441) >> > > at >> > > >> > > >> > >> eu.stratosphere.runtime.io.channels.OutputChannel.sendBuffer(OutputChannel.java:74) >> > > at >> > > >> > >> eu.stratosphere.runtime.io.gates.OutputGate.sendBuffer(OutputGate.java:49) >> > > at >> > > >> > > >> > >> eu.stratosphere.runtime.io.api.BufferWriter.sendBuffer(BufferWriter.java:35) >> > > at >> > eu.stratosphere.runtime.io.api.RecordWriter.emit(RecordWriter.java:96) >> > > at >> > > >> > > >> > >> eu.stratosphere.pact.runtime.shipping.OutputCollector.collect(OutputCollector.java:82) >> > > at >> > > >> > > >> > >> eu.stratosphere.pact.runtime.task.chaining.ChainedMapDriver.collect(ChainedMapDriver.java:71) >> > > at >> > > >> > > >> > >> eu.stratosphere.pact.runtime.task.DataSourceTask.invoke(DataSourceTask.java:228) >> > > at >> > > >> > > >> > >> eu.stratosphere.nephele.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:284) >> > > at java.lang.Thread.run(Thread.java:744) >> > > >> > > >> > > Thanks for your help Sebastian. >> > > >> > > Regards // Saludos // Mit Freundlichen Grüßen // Bien cordialement, >> > > Pino >> > > >> > > >> > > On 22 June 2014 13:38, Sebastian Schelter <[hidden email]> >> > wrote: >> > > >> > > > Have you looked at a jstack dump on one of the workera? That >> typically >> > > > helps finding out, where the processes are stuck. >> > > > >> > > > -s >> > > > Am 22.06.2014 13:32 schrieb "José Luis López Pino" < >> > > [hidden email] >> > > > >: >> > > > >> > > > > Hi, >> > > > > >> > > > > I'm running the KMeans java and scala examples in two nodes. It >> works >> > > > fine >> > > > > with very small files (3MB) but when I try with files of 30MB or >> > bigger >> > > > the >> > > > > process never ends. After several hours, the DataChain process >> that >> > is >> > > > > reading the input points is still working. >> > > > > >> > > > > I have tried before with way bigger files in the same environment >> > and I >> > > > had >> > > > > no issue. I have already tried: >> > > > > - Check that the process is not locked using all the CPU time. >> > > > > - Format the datanodes. >> > > > > - Compile the last version available on github. >> > > > > - The debug log mode doesn't give any additional information. >> > > > > >> > > > > Could someone give me a hint where to look at that? Thanks for >> your >> > > help! >> > > > > >> > > > > Regards // Saludos // Mit Freundlichen Grüßen // Bien >> cordialement, >> > > > > Pino >> > > > > >> > > > >> > > >> > >> > > |
In reply to this post by Robert Metzger
There was a patch for deadlocks on broadcast variables a few days ago.
Can you try the current master branch (0.6-SNAPSHOT) and see if that solves your problem? |
I think Pino wrote that he is using the latest master.
I just finished running KMeans on a cluster, with the following configuration: - 2 nodes, 18 GB heapspace each - DOP=32 - 29 MB input data, 10 centers, 15 iterations max. I also reduced the heapspace to 1GB and both worked like charm. I've added a TODO to my list to test also with more data. On Sun, Jun 22, 2014 at 3:55 PM, Stephan Ewen <[hidden email]> wrote: > There was a patch for deadlocks on broadcast variables a few days ago. > > Can you try the current master branch (0.6-SNAPSHOT) and see if that solves > your problem? > |
Yes, I pulled and compiled the latest master from github.
Thank you for the test Robert, I'll try then to double check the configuration of both nodes, there should be something wrong. I've tried to execute the job with p = 1, 2 and 4. Regards // Saludos // Mit Freundlichen Grüßen // Bien cordialement, Pino On 22 June 2014 16:43, Robert Metzger <[hidden email]> wrote: > I think Pino wrote that he is using the latest master. > > I just finished running KMeans on a cluster, with the following > configuration: > - 2 nodes, 18 GB heapspace each > - DOP=32 > - 29 MB input data, 10 centers, 15 iterations max. > > I also reduced the heapspace to 1GB and both worked like charm. > I've added a TODO to my list to test also with more data. > > > > > On Sun, Jun 22, 2014 at 3:55 PM, Stephan Ewen <[hidden email]> wrote: > > > There was a patch for deadlocks on broadcast variables a few days ago. > > > > Can you try the current master branch (0.6-SNAPSHOT) and see if that > solves > > your problem? > > > |
Okay, let us know if you found the solution.
If you want, we can also do a short Google Hangout session with screensharing. Maybe I see something. On Sun, Jun 22, 2014 at 5:09 PM, José Luis López Pino <[hidden email] > wrote: > Yes, I pulled and compiled the latest master from github. > > Thank you for the test Robert, I'll try then to double check the > configuration of both nodes, there should be something wrong. I've tried to > execute the job with p = 1, 2 and 4. > > > Regards // Saludos // Mit Freundlichen Grüßen // Bien cordialement, > Pino > > > On 22 June 2014 16:43, Robert Metzger <[hidden email]> wrote: > > > I think Pino wrote that he is using the latest master. > > > > I just finished running KMeans on a cluster, with the following > > configuration: > > - 2 nodes, 18 GB heapspace each > > - DOP=32 > > - 29 MB input data, 10 centers, 15 iterations max. > > > > I also reduced the heapspace to 1GB and both worked like charm. > > I've added a TODO to my list to test also with more data. > > > > > > > > > > On Sun, Jun 22, 2014 at 3:55 PM, Stephan Ewen <[hidden email]> wrote: > > > > > There was a patch for deadlocks on broadcast variables a few days ago. > > > > > > Can you try the current master branch (0.6-SNAPSHOT) and see if that > > solves > > > your problem? > > > > > > |
Free forum by Nabble | Edit this page |