I am running a Giraph application in EMR . For this, I am using a cluster of 1 master and 10 slaves (CORE), all are m3.2xlarge.
The application consists of a BFS that runs the Spanish version of wikipedia (obviously, adapt the information on this page so that it can be loaded correctly on Giraph)
I ran the application as follows:
/home/hadoop/bin/yarn jar /home/hadoop/giraph/giraph.jar ar.edu.info.unlp.tesina.lectura.grafo.algoritmos.masivos.BusquedaDeCaminosNavegacionalesWikiquotesMasivo /tmp/vertices.txt 4 -@- 1 ar.edu.info.unlp.tesina.lectura.grafo.BusquedaDeCaminosNavegacionalesWikiquote -vif ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueInputFormat -vip /user/hduser/input/grafo-wikipedia.txt -vof ar.edu.info.unlp.tesina.vertice.estructuras.IdTextWithComplexValueOutputFormat -op /user/hduser/output/caminosNavegacionales -w 10 -yh 11500 -ca giraph.metrics.enable=true,giraph.useOutOfCoreMessages=true,giraph.isStaticGraph=true,giraph.numInputThreads=4,giraph.numOutputThreads=4
I can execute this application correctly with 3 supersteps, but if I want to execute it with 4, a container is killed (because its use of RAM goes beyond the limit) and the rest of the workers fail accordingly.
Looking at the Giraph Application Manager logs, it says the following:
16/08/15 03:44:32 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-0-147.sa-east-1.compute.internal:9103
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: START_CONTAINER for Container container_1471231949464_0001_01_000005
16/08/15 03:44:32 INFO impl.ContainerManagementProtocolProxy: Opening proxy : ip-172-31-0-145.sa-east-1.compute.internal:9103
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000009
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000011
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000004
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000010
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000006
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000007
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000008
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000005
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000002
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000012
16/08/15 03:44:32 INFO impl.NMClientAsyncImpl: Processing Event EventType: QUERY_CONTAINER for Container container_1471231949464_0001_01_000003
16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: Got response from RM for container ask, completedCnt=1
16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000008, state=COMPLETE, exitStatus=143, diagnostics=Container [pid=4455,containerID=container_1471231949464_0001_01_000008] is running beyond physical memory limits. Current usage: 11.4 GB of 11.3 GB physical memory used; 12.6 GB of 56.3 GB virtual memory used. Killing container.
Dump of the process-tree for container_1471231949464_0001_01_000008 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 4459 4455 4455 4455 (java) 13568 5567 13419675648 2982187 java -Xmx11500M -Xms11500M -cp .:${CLASSPATH}:./*:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/* org.apache.giraph.yarn.GiraphYarnTask 1471231949464 1 8 1
|- 4455 2706 4455 4455 (bash) 0 0 115875840 807 /bin/bash -c java -Xmx11500M -Xms11500M -cp .:${CLASSPATH}:./*:$HADOOP_CONF_DIR:$HADOOP_COMMON_HOME/share/hadoop/common/*:$HADOOP_COMMON_HOME/share/hadoop/common/lib/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/*:$HADOOP_HDFS_HOME/share/hadoop/hdfs/lib/*:$HADOOP_YARN_HOME/share/hadoop/yarn/*:$HADOOP_YARN_HOME/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/*:$HADOOP_MAPRED_HOME/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:./*:/home/hadoop/conf:/home/hadoop/share/hadoop/common/*:/home/hadoop/share/hadoop/common/lib/*:/home/hadoop/share/hadoop/hdfs/*:/home/hadoop/share/hadoop/hdfs/lib/*:/home/hadoop/share/hadoop/yarn/*:/home/hadoop/share/hadoop/yarn/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/*:/home/hadoop/share/hadoop/mapreduce/*:/home/hadoop/share/hadoop/mapreduce/lib/*:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/lib/*:/usr/share/aws/emr/auxlib/* org.apache.giraph.yarn.GiraphYarnTask 1471231949464 1 8 1 1>/mnt/var/log/hadoop/userlogs/application_1471231949464_0001/container_1471231949464_0001_01_000008/task-8-stdout.log 2>/mnt/var/log/hadoop/userlogs/application_1471231949464_0001/container_1471231949464_0001_01_000008/task-8-stderr.log
Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
16/08/15 03:46:53 INFO yarn.GiraphApplicationMaster: After completion of one conatiner. current status is: completedCount :1 containersToLaunch :11 successfulCount :0 failedCount :1
16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got response from RM for container ask, completedCnt=7
16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000002, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000012, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000006, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
16/08/15 03:46:55 INFO yarn.GiraphApplicationMaster: Got container status for containerID=container_1471231949464_0001_01_000007, state=COMPLETE, exitStatus=1, diagnostics=Exception from container-launch: org.apache.hadoop.util.Shell$ExitCodeException:
org.apache.hadoop.util.Shell$ExitCodeException:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:501)
at org.apache.hadoop.util.Shell.run(Shell.java:418)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:655)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:200)
Therefore, there seems to be a memory problem in container 8, but here you can see that it is not really so. The last log lines of this container are the following:
16/08/15 03:46:52 INFO graph.ComputeCallable: call: Computation took 23.90834 secs for 10 partitions on superstep 3. Flushing started
16/08/15 03:46:52 INFO worker.BspServiceWorker: finishSuperstep: Waiting on all requests, superstep 3 Memory (free/total/max) = 4516.47M / 10619.50M / 10619.50M
16/08/15 03:46:52 INFO netty.NettyClient: logInfoAboutOpenRequests: Waiting interval of 15000 msecs, 1307 open requests, waiting for it to be <= 0, MBytes/sec received = 0.0029, MBytesReceived = 0.0678, ave received req MBytes = 0, secs waited = 23.332
MBytes/sec sent = 143.2912, MBytesSent = 3343.4141, ave sent req MBytes = 0.4999, secs waited = 23.332
16/08/15 03:46:52 INFO netty.NettyClient: logInfoAboutOpenRequests: 548 requests for taskId=10, 504 requests for taskId=0, 251 requests for taskId=5, 1 requests for taskId=4, 1 requests for taskId=7, 1 requests for taskId=8,
Then, if I understand everything correctly, the container had 4516.47M available before the flush, and at the moment it does, it consumes all the available memory and even a little bit more, so it goes over the top, and is killed by the application master of Giraph.
I do not understand why you need so much memory to flush (although I do not understand 100% what the flush does) ... And I do not understand how or how to correct this problem.