我发现我真的是上天的宠儿,在我手上,Yarn的虚拟内存居然崩了,是的,它崩了。我这本来就是个测试的集群,数据量也不大。一次开的内存也不大,但是它崩了,虚拟内存崩了。请看案例分析。

案件回放

事情的经过是这样的:

因为需要,Yarn的原来的调度模式 Capacity Scheduler 对目前的项目而言不合适,就要去更换另外一种调度模式: Fair Scheduler。配置好的结果如下图所示:

这说明我的配置没问题呀。

现在,我要开始提叫我的开启我的Flink集群环境了:

1
2
3
4
5
6
7
./yarn-session.sh \
-n 3 \
-s 6 \
-jm 256 \
-tm 1024 \
-nm "flink on yarn"
-d

问题来了:
yarn-error

这个意思就是说,Flink的集群部署时间超过了60s,叫我们检查我们的请求资源在Yarn集群里面是否可用。换句话说,就是我们的Yarn集群挂了,您自个去找原因吧。这个找原因,就只能找logs文件了。我们找到日志文件,从里面去找相关信息:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
325.2 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used
2020-04-01 15:25:37,410 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 209.6 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used
2020-04-01 15:25:40,419 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24064 for container-id container_1585725830038_0003_02_000001: 336.3 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:40,427 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 340.0 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:43,450 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24064 for container-id container_1585725830038_0003_02_000001: 336.4 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:43,481 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 340.1 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:46,503 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24064 for container-id container_1585725830038_0003_02_000001: 336.4 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:46,526 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 334.9 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:49,545 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24064 for container-id container_1585725830038_0003_02_000001: 336.4 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:49,586 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 334.8 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:52,607 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24064 for container-id container_1585725830038_0003_02_000001: 336.6 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:52,640 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Memory usage of ProcessTree 24195 for container-id container_1585725830038_0002_02_000001: 334.9 MB of 1 GB physical memory used; 2.3 GB of 2.1 GB virtual memory used
2020-04-01 15:25:53,040 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /opt/module/hadoop-2.7.2/data/tmp/nm-local-dir error, used space above threshold of 90.0%, removing from list of valid directories
2020-04-01 15:25:53,040 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /opt/module/hadoop-2.7.2/logs/userlogs error, used space above threshold of 90.0%, removing from list of valid directories
2020-04-01 15:25:53,040 INFO org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Disk(s) failed: 1/1 local-dirs are bad: /opt/module/hadoop-2.7.2/data/tmp/nm-local-dir; 1/1 log-dirs are bad: /opt/module/hadoop-2.7.2/logs/userlogs
2020-04-01 15:25:53,040 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Most of the disks failed. 1/1 local-dirs are bad: /opt/module/hadoop-2.7.2/data/tmp/nm-local-dir; 1/1 log-dirs are bad: /opt/module/hadoop-2.7.2/logs/userlogs

我找到这么一段话,为了看的方便,我截个图:
yarn-error-oom
意思就是说:我们的一个 container_1585725830038_0003_02_000001 ,他的使用的物理内存 使用了336.4MB/1GB,虚拟内存使用了: 2.3GB/2.1GB 。我这个表达方式是: 实际使用量 / 总量 。

很明显就可以看到我们的虚拟内存明显不对,我只有2.1,你怎么冒了一个 2.3 出来了呢?这可不久 OOM 吗?

Yarn 的虚拟内存

关于Yarn的虚拟内存,官方有这么几个配置参数:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>true</value>
<description>Whether virtual memory limits will be enforced for containers.</description>
</property>

<property>
<name>yarn.nodemanager.vmem-pmem-ratio</name>
<value>2.1</value>
<description> Ratio between virtual memory to physical memory when setting memory limits for containers. Container allocations are expressed in terms of physical memory, and virtual memory usage is allowed to exceed this allocation by this ratio.</description>
</property>

<property>
<name>yarn.scheduler.minimum-allocation-vcores</name>
<value>1</value>
<description>The minimum allocation for every container request at the RM in terms of virtual CPU cores. Requests lower than this will be set to the value of this property. Additionally, a node manager that is configured to have fewer virtual cores than this value will be shut down by the resource manager.</description>
</property>

<property>
<name>yarn.scheduler.maximum-allocation-vcores</name>
<value>4</value>
<description> The maximum allocation for every container request at the RM in terms of virtual CPU cores. Requests higher than this will throw an InvalidResourceRequestException.</description>
</property>

<property>
<name>yarn.nodemanager.elastic-memory-control.enabled</name>
<value>false</value>
<description>Enable elastic memory control. This is a Linux only feature. When enabled, the node manager adds a listener to receive an event, if all the containers exceeded a limit. The limit is specified by yarn.nodemanager.resource.memory-mb. If this is not set, the limit is set based on the capabilities. See yarn.nodemanager.resource.detect-hardware-capabilities for details. The limit applies to the physical or virtual (rss+swap) memory depending on whether yarn.nodemanager.pmem-check-enabled or yarn.nodemanager.vmem-check-enabled is set.</description>
</property>

什么是虚拟内存

虚拟内存是我们的硬盘内存,被拿去充公了。

1
2
# 查看某个进程的虚拟内存使用
pmap -x pid

解决方案

内存小了,我们增大就是了。至于行不行,咋也不知道。只有尝试后才能发现具体的原因所在。但是到这里你们以为问题解决了吗?我告诉你们,不可能的。因为主要的问题不在这里,这只是我这个大问题里面的小问题。这个 oom 是解决了。我去解决大问题去了。

在这里需要说一下,如果出现上面提到的超时问题,有可能是因为Yarn的OOM,但是具体的原因是需要我们去查看日志文件的,这个日志文件是要找logs/userlogs/application_xxx的日志文件的。生命的意义就在于探索未知的奥秘。