一次奇怪GC的排查

一次Dse server full gc查找过程

  • 读文件的时候会发生full gc
  • 老年代还没有达到容量就发生full gc

G1 GC

G1 GC接收,原理可以看这篇,图挺多的.

请参考深入剖析JVM之G1收集器、及回收流程、与推荐用例

开启G1 GC 参数

CommandLine flags: -XX:+ExitOnOutOfMemoryError -XX:G1HeapRegionSize=16777216 -XX:G1HeapWastePercent=10 -XX:G1LogLevel=finest -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/octopus/oom/storage -XX:InitialHeapSize=19327352832 -XX:MaxHeapSize=19327352832 -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UnlockExperimentalVMOptions -XX:+UseCompressedClassPointers -XX:+UseCompressedOops -XX:+UseG1GC

最好再加上下面三个,但是如果使用java 11,会不支持

#jre 11 not support
add_jvm_arg "-XX:+PrintAdaptiveSizePolicy"
add_jvm_arg "-XX:+PrintReferenceGC"
add_jvm_arg "-XX:+PrintGCTimeStamps"

环境背景

实验对象是 octopus v4.1.7-5 版本 群集模式, storage server.

利用jstat 查内存情况

jstat  -gccause #gc原因
jstat  -gcutil #gc整体情况
jstat  -gcnew # 新生代使用情况
jstat  -gcold # 老年代使用情况

查看gc日志

试验组

设置Xmx 18GB,,读文件发送full gc. 年轻代总和一直是912MB.

这里有两个问题

  • 发生了full gc
  • g1自动调整的策略没有起效,可以看的日志. 新生代一直是912MB

    在对照组,堆增大之后,自动调整的策略就起效了.情况请看对照组

[root@node2 octopus]# grep Eden /tmp/dse.log-Survivors
   [Eden: 16.0M(912.0M)->0.0B(912.0M) Survivors: 0.0B->0.0B Heap: 16.0M(18.0G)->601.1K(18.0G)], [Metaspace: 3182K->3182K(1056768K)]
   [Eden: 912.0M(912.0M)->0.0B(800.0M) Survivors: 0.0B->112.0M Heap: 1244.2M(18.0G)->430.1M(18.0G)]
   [Eden: 800.0M(800.0M)->0.0B(784.0M) Survivors: 112.0M->128.0M Heap: 4479.7M(18.0G)->4459.7M(18.0G)]
   [Eden: 272.0M(784.0M)->0.0B(784.0M) Survivors: 128.0M->128.0M Heap: 6738.8M(18.0G)->6668.0M(18.0G)]
   [Eden: 784.0M(784.0M)->0.0B(784.0M) Survivors: 128.0M->128.0M Heap: 10.3G(18.0G)->10126.2M(18.0G)]
   [Eden: 784.0M(784.0M)->0.0B(784.0M) Survivors: 128.0M->128.0M Heap: 10.7G(18.0G)->10.3G(18.0G)]
   [Eden: 784.0M(784.0M)->0.0B(784.0M) Survivors: 128.0M->128.0M Heap: 11.1G(18.0G)->10.6G(18.0G)]
   [Eden: 784.0M(784.0M)->0.0B(784.0M) Survivors: 128.0M->128.0M Heap: 11.3G(18.0G)->10.9G(18.0G)]
   [Eden: 784.0M(784.0M)->0.0B(784.0M) Survivors: 128.0M->128.0M Heap: 11.6G(18.0G)->11.2G(18.0G)]
   [Eden: 784.0M(784.0M)->0.0B(784.0M) Survivors: 128.0M->128.0M Heap: 12.0G(18.0G)->11.6G(18.0G)]
   [Eden: 784.0M(784.0M)->0.0B(784.0M) Survivors: 128.0M->128.0M Heap: 11.5G(18.0G)->11.2G(18.0G)]
   [Eden: 784.0M(784.0M)->0.0B(784.0M) Survivors: 128.0M->128.0M Heap: 12.0G(18.0G)->11.6G(18.0G)]
...
   [Eden: 304.0M(784.0M)->0.0B(912.0M) Survivors: 128.0M->0.0B Heap: 13.2G(18.0G)->13.2G(18.0G)]
   [Eden: 48.0M(912.0M)->0.0B(912.0M) Survivors: 0.0B->0.0B Heap: 13.2G(18.0G)->13.2G(18.0G)]
   [Eden: 0.0B(912.0M)->0.0B(912.0M) Survivors: 0.0B->0.0B Heap: 13.2G(18.0G)->10147.6M(18.0G)], [Metaspace: 20886K->20886K(1069056K)]
   [Eden: 912.0M(912.0M)->0.0B(784.0M) Survivors: 0.0B->128.0M Heap: 10.8G(18.0G)->10.5G(18.0G)]
...
   [Eden: 784.0M(784.0M)->0.0B(784.0M) Survivors: 128.0M->128.0M Heap: 12.7G(18.0G)->12.8G(18.0G)]
   [Eden: 208.0M(784.0M)->0.0B(912.0M) Survivors: 128.0M->0.0B Heap: 13.0G(18.0G)->13.0G(18.0G)]
   [Eden: 0.0B(912.0M)->0.0B(912.0M) Survivors: 0.0B->0.0B Heap: 13.0G(18.0G)->10143.7M(18.0G)], [Metaspace: 20892K->20892K(1069056K)]
   [Eden: 912.0M(912.0M)->0.0B(784.0M) Survivors: 0.0B->128.0M Heap: 10.8G(18.0G)->10.4G(18.0G)]
   [Eden: 784.0M(784.0M)->0.0B(784.0M) Survivors: 128.0M->128.0M Heap: 11.2G(18.0G)->10.9G(18.0G)]

参照文章当jvm的eden区满了,进行回收时,s0区满了,此时eden区还有存活对象没复制完,会怎样?

  1. 非GC时,新生代的分布是这样,eden区和s0区有object,s1区为空;
  2. eden区满时,触发MinorGC,将eden区和s0区尚存活的object,复制到s1区,顺利完成后,s0区和s1区命名互换一下,返回情况1;
    • 2.1 MinorGC 中存活的object 如果超过tenuring threshold,会promote到老生代;
  3. 每次MinorGC后,都会计算一个合理的tenuring threshold和各年代区的size,以及适时地调整size
  4. 在2时,如果s1区提前满了,就是undesired的premature promotion,虽然从jvm角度是想通过方法3来避免的,但也无法保证;

  5. 老生代如果也满了(可能是由2.1或4导致),那就没什么好说了,直接Full GC了;

并且触发了多次to-space exhausted

[root@node2 octopus]# grep 'to-space exhausted' /tmp/dse.log-Survivors
77.221: [GC pause (G1 Evacuation Pause) (young) (initial-mark) (to-space exhausted), 0.0337199 secs]
78.450: [GC pause (G1 Evacuation Pause) (young) (to-space exhausted), 0.0394262 secs]
78.877: [GC pause (G1 Evacuation Pause) (mixed) (to-space exhausted), 0.0080850 secs]
78.902: [GC pause (G1 Evacuation Pause) (young) (to-space exhausted), 0.0022769 secs]
92.943: [GC pause (G1 Evacuation Pause) (young) (initial-mark) (to-space exhausted), 0.0343345 secs]
94.914: [GC pause (G1 Evacuation Pause) (young) (to-space exhausted), 0.0244542 secs]
95.365: [GC pause (G1 Evacuation Pause) (young) (initial-mark) (to-space exhausted), 0.0064090 secs]
110.848: [GC pause (G1 Evacuation Pause) (young) (initial-mark) (to-space exhausted), 0.0347729 secs]
113.026: [GC pause (G1 Evacuation Pause) (young) (to-space exhausted), 0.0246771 secs]
113.107: [GC pause (G1 Evacuation Pause) (young) (initial-mark) (to-space exhausted), 0.0060059 secs]
124.546: [GC pause (G1 Evacuation Pause) (young) (initial-mark) (to-space exhausted), 0.0326312 secs]
128.551: [GC pause (G1 Evacuation Pause) (young) (to-space exhausted), 0.0323352 secs]
128.796: [GC pause (G1 Evacuation Pause) (young) (initial-mark) (to-space exhausted), 0.0068281 secs]
136.617: [GC pause (G1 Evacuation Pause) (young) (initial-mark) (to-space exhausted), 0.0337861 secs]

参照文章JVM调优实战:G1中的to-space exhausted问题_阿杜的世界-CSDN博客

另外一点是 每次full gc 都在并发标记周期开始后.

95.371: [GC concurrent-root-region-scan-start]
95.372: [GC concurrent-root-region-scan-end, 0.0000127 secs]
95.372: [GC concurrent-mark-start]
95.372: [Full GC (Allocation Failure)  13G->10143M(18G), 0.1442809 secs]
   [Eden: 0.0B(912.0M)->0.0B(912.0M) Survivors: 0.0B->0.0B Heap: 13.0G(18.0G)->10143.7M(18.0G)], [Metaspace: 20892K->20892K(1069056K)]
 [Times: user=0.15 sys=0.00, real=0.15 secs]
95.516: [GC concurrent-mark-abort]

参照文章可能是最全面的G1学习笔记

GC concurrent-mark-start开始之后就发生了FULL GC,这说明针对老年代分区的回收速度比较慢,或者说对象过快得从新生代晋升到老年代,或者说是有很多大对象直接在老年代分配。针对上述原因,我们可能需要做的调整有:调大整个堆的大小、更快得触发并发回收周期、让更多的回收线程参与到垃圾收集的动作中。

但是由于设置过region size程序是没有在读文件过程中 分配过巨型对象.说明针对老年代的回收速度比较慢.

在G1GC 中老年代是通过mixed gc回收的.

Mixed GC指的不是一次GC,而是一个过程,这个过程从initial-mark开始.可以看到并发标记完之后,没有回收空间.以为回收收益达不到预值10%.

 68.242: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 1 regions, reclaimable: 16362912 bytes (0.08 %), threshold: 10.00 %]
 75.195: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 55 regions, reclaimable: 473957632 bytes (2.45 %), threshold: 10.00 %]
 80.499: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 72 regions, reclaimable: 743495848 bytes (3.85 %), threshold: 10.00 %]
 83.868: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 74 regions, reclaimable: 772788192 bytes (4.00 %), threshold: 10.00 %]
 87.226: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 85 regions, reclaimable: 948939880 bytes (4.91 %), threshold: 10.00 %]
 92.124: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 98 regions, reclaimable: 1183778416 bytes (6.12 %), threshold: 10.00 %]
 98.282: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 113 regions, reclaimable: 1426981920 bytes (7.38 %), threshold: 10.00 %]
 102.727: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 142 regions, reclaimable: 1896645496 bytes (9.81 %), threshold: 10.00 %]
 108.781: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old 

This shows that the concurrent marking cycle has completed but the Mixed GCs are not done because the reclaimable space is less than tolerable limit set for the waste with G1HeapWastePercent. Its default value is 10%. G1HeapWastePercent can be decreased to be around 2% so that mixed GCs can occur at a lower threshold.

参照文章Tuning G1GC For SOA–>Observations and Recommendations

但是我对照的测试,把对调整到22g,没有发生full gc.也有相同没有执行 mixed gc的日志.因此可以确定 是g1 根据G1HeapWastePercent判断是否启动mixed gc

于是我按照文章,设置了参数G1HeapWastePercent=2

再次跑了一次测试. 发现,确实起了效果. 读一次文件的过程只发生了一次to-space exhausted和一次full gc,并且多次读文件没有重复发生full gc. 很明显起了效果.

[root@node2 octopus]# grep 'to-space' /tmp/dse.log
 (to-space exhausted), 0.0317402 secs]
207.971: [SoftReference, 0 refs, 0.0000234 secs]207.971: [WeakReference, 0 refs, 0.0000038 secs]207.971: [FinalReference, 0 refs, 0.0000033 secs]207.971: [PhantomReference, 0 refs, 0 refs, 0.0000039 secs]207.971: [JNI Weak Reference, 0.0000070 secs] (to-space exhausted), 0.0097462 secs]
[root@node2 octopus]# grep 'Full' /tmp/dse.log
0.125: [Full GC (System.gc()) 0.169: [SoftReference, 0 refs, 0.0000273 secs]0.169: [WeakReference, 0 refs, 0.0000034 secs]0.169: [FinalReference, 64 refs, 0.0000112 secs]0.169: [PhantomReference, 0 refs, 0 refs, 0.0000034 secs]0.169: [JNI Weak Reference, 0.0000043 secs] 15M->601K(18G), 0.0603628 secs]
207.979: [Full GC (Allocation Failure) 208.028: [SoftReference, 0 refs, 0.0000299 secs]208.028: [WeakReference, 291 refs, 0.0000103 secs]208.028: [FinalReference, 387 refs, 0.0000247 secs]208.028: [PhantomReference, 0 refs, 100 refs, 0.0000059 secs]208.028: [JNI Weak Reference, 0.0000069 secs] 13G->10143M(18G), 0.2611047 secs]

[root@node2 octopus]# cat /tmp/dse.log |grep 'GC pause' |grep mixed
36.508: [GC pause (G1 Evacuation Pause) (mixed) 36.508: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 3655, predicted base time: 3.80 ms, remaining time: 196.20 ms, target pause time: 200.00 ms]
39.633: [GC pause (G1 Evacuation Pause) (mixed) 39.633: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 3722, predicted base time: 2.50 ms, remaining time: 197.50 ms, target pause time: 200.00 ms]
40.027: [GC pause (G1 Evacuation Pause) (mixed) 40.027: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 3720, predicted base time: 2.35 ms, remaining time: 197.65 ms, target pause time: 200.00 ms]
40.558: [GC pause (G1 Evacuation Pause) (mixed) 40.558: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 4018, predicted base time: 2.26 ms, remaining time: 197.74 ms, target pause time: 200.00 ms]
41.119: [GC pause (G1 Evacuation Pause) (mixed) 41.119: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 3898, predicted base time: 2.13 ms, remaining time: 197.87 ms, target pause time: 200.00 ms]
44.982: [GC pause (G1 Evacuation Pause) (mixed) 44.982: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 3689, predicted base time: 2.10 ms, remaining time: 197.90 ms, target pause time: 200.00 ms]
48.523: [GC pause (G1 Evacuation Pause) (mixed) 48.523: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 3523, predicted base time: 1.84 ms, remaining time: 198.16 ms, target pause time: 200.00 ms]

多次触发mixed gc 可以起效果. 由此看起来我们这个是老年代内存释放失败的问题.

上面说到日志里面有触发了多次to-space exhausted的记录

那我们是发生了转移失败(Evacuation Failure)吗?

对 survivors 或 promoted objects 进行GC时如果JVM的heap区不足就会发生提升失败(promotion failure). 堆内存不能继续扩充,因为已经达到最大值了. 当使用 -XX:+PrintGCDetails 时将会在GC日志中显示 to-space overflow (to-空间溢出).

应该不是,因此日志里面出现的是to-space exhausted而不是to-space overflow.不过有文章说它们都是一个东西.

You might have come across some of the terms “evacuation failure”, “to-space exhausted”, “to-space overflow” or maybe something like “promotion failure” during your GC travels. These terms all refer to the same thing, and the concept is similar in G1 GC as well.

那我们使用避免Evacuation Failure的一些调优去测试一下.

  • 增大堆(内存)大小

    可行,对照组使用一样的参数,只有xmx调大到22g,就没有发生full gc情况

  • 增大-XX:G1ReservePercent=n,默认是10.G1会预留一部分内存,制造一个假天花板,防止to-space的方式。

    目前确实发生的to-space exhausted,并且Tuning G1GC For SOA也是一样建议的.

    于是我将-XX:G1ReservePercent调整为15,测试 full gc情况没有改善. 每测试一次读文件都会增加4次full gc. (存疑)

  • 早点启动标记周期

    和我们情况不同,其实现在项目concurrent mark挺频繁.(存疑)

  • 增大并行标记的线程数,使用-XX:ConcGCThreads=n选项

    测试单独设置这个,效果不大.(存疑).

上面的文章Tuning G1GC For SOA–>Observations and Recommendations另外也讲到了 to-space exhausted的问题,可是我调整一下没啥改变,只有G1HeapWastePercent=2是有效修改.

另外也可以参考这篇文章Tips for Tuning the Garbage First Garbage Collector (infoq.com),里面有个观点是starting marking cycle too early

但是我也调整过-XX:InitiatingHeapOccupancyPercent为70%.没什么效果.甚至我直接调整到90.情况一样没有改善.

并且我也尝试过手动指定新生代大小,依然会发生full gc.

**到这我的结论是 mixed gc ,因为收益达不到预期,所以,一直没有回收空间. 但是另一方面,堆就已经分配不出来内存了. **

另外可以看下面来个内存分配失败的日志.于是直接触发了FGC.

  • 无法分配内存导致fullgc,这里是需要分配16MB,可能代表了4个block.因为程序里每个需要4MB保存
103.744: [GC concurrent-root-region-scan-start]
103.744: [GC concurrent-root-region-scan-end, 0.0000096 secs]
103.744: [GC concurrent-mark-start]
 103.744: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: allocation request failed, allocation request: 4194320 bytes]
 103.744: [G1Ergonomics (Heap Sizing) expand the heap, requested expansion amount: 16777216 bytes, attempted expansion amount: 16777216 bytes]
 103.744: [G1Ergonomics (Heap Sizing) did not expand the heap, reason: heap already fully expanded]
103.744: [Full GC (Allocation Failure) 103.793: [SoftReference, 0 refs, 0.0000302 secs]103.793: [WeakReference, 288 refs, 0.0000086 secs]103.793: [FinalReference, 485 refs, 0.0000460 secs]103.793: [PhantomReference, 0 refs, 100 refs, 0.0000048 secs]103.793: [JNI Weak Reference, 0.0000054 secs] 13G->10143M(18G), 0.2612791 secs]
   [Eden: 0.0B(912.0M)->0.0B(944.0M) Survivors: 0.0B->0.0B Heap: 13.2G(18.0G)->10143.6M(18.0G)], [Metaspace: 20900K->20900K(1069056K)]
 [Times: user=0.27 sys=0.00, real=0.26 secs]
104.006: [GC concurrent-mark-abort]
  • 在这之前还有类似的,这里需要分配4MB,恰好是读一个block需要的内存
159.229: [GC pause (G1 Evacuation Pause) (young) 159.229: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 3940, predicted base time: 1.79 ms, remaining time: 198.21 ms, target pause time: 200.00 ms]
 159.229: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 49 regions, survivors: 8 regions, predicted young region time: 26.65 ms]
 159.229: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 49 regions, survivors: 8 regions, old: 0 regions, predicted pause time: 28.44 ms, target pause time: 200.00 ms]
 159.244: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: region allocation request failed, allocation request: 4194320 bytes]
 159.244: [G1Ergonomics (Heap Sizing) expand the heap, requested expansion amount: 4194320 bytes, attempted expansion amount: 16777216 bytes]
 159.244: [G1Ergonomics (Heap Sizing) did not expand the heap, reason: heap already fully expanded]
159.246: [SoftReference, 0 refs, 0.0000314 secs]159.246: [WeakReference, 0 refs, 0.0000040 secs]159.246: [FinalReference, 82 refs, 0.0001580 secs]159.246: [PhantomReference, 0 refs, 0 refs, 0.0000040 secs]159.246: [JNI Weak Reference, 0.0000087 secs] 159.256: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 19025362944 bytes, allocation request: 0 bytes, threshold: 8697308760 bytes (45.00 %), source: end of GC]
 (to-space exhausted), 0.0277961 secs]

现在依然困惑的问题是:

堆实际使用的空间还远不到18GB.就无法分配内存了.

jvm没有动态调整新生代比例和老年代比例. 但是把堆调大,这个自动调整又有效果了.

是不是和【JVM】空间分配担保机制有关

1.谁进行空间担保?

  JVM使用分代收集算法,将堆内存划分为年轻代和老年代,两块内存分别采用不同的垃圾回收算法,空间担保指的是老年代进行空间分配担保

2.什么是空间分配担保?

    在发生Minor GC之前,虚拟机会检查老年代最大可用的连续空间是否大于新生代所有对象的总空间

    如果大于,则此次Minor GC是安全的

    如果小于,则虚拟机会查看HandlePromotionFailure设置值是否允许担保失败。如果HandlePromotionFailure=true,那么会继续检查老年代最大可用连续空间是否大于历次晋升到老年代的对象的平均大小,如果大于,则尝试进行一次Minor GC,但这次Minor GC依然是有风险的;如果小于或者HandlePromotionFailure=false,则改为进行一次Full GC。

3.为什么要进行空间担保?

    是因为新生代采用复制收集算法,假如大量对象在Minor GC后仍然存活(最极端情况为内存回收后新生代中所有对象均存活),而Survivor空间是比较小的,这时就需要老年代进行分配担保,把Survivor无法容纳的对象放到老年代。老年代要进行空间分配担保,前提是老年代得有足够空间来容纳这些对象,但一共有多少对象在内存回收后存活下来是不可预知的,因此只好取之前每次垃圾回收后晋升到老年代的对象大小的平均值作为参考。使用这个平均值与老年代剩余空间进行比较,来决定是否进行Full GC来让老年代腾出更多空间。

增加日志参数 “-XX:+G1PrintHeapRegions”

打印存活对象 age 情况

➜  ~ grep '\- age' dse.log-18g-region -A 1 -B 1 |head -n 3
Desired survivor size 67108864 bytes, new threshold 1 (max 15)
- age   1:   77343736 bytes,   77343736 total
 11.349: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 2051, predicted base time: 11.74 ms, remaining time: 188.26 ms, target pause time: 200.00 ms]


➜  ~ grep '\- age' dse.log-18g-region -A 1 -B 1 |tail  -n 3
Desired survivor size 67108864 bytes, new threshold 1 (max 15)
- age   1:  100687440 bytes,  100687440 total
 180.986: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 3805, predicted base time: 1.96 ms, remaining time: 198.04 ms, target pause time: 200.00 ms]

grep '\- age' dse.log-18g-region|grep -v 'age   1:'|wc -l
0

可以看到 ,存活的对象在新生代并没有存活很久. 还没有到达 15的晋升阀值. 那么就奇怪了,不是大对象分配,存活对象也没有达到晋升阀值.但是老年代确实也不断在增加.

我尝试 把region size 调整为 32MB.发现Full GC 更频繁了.

grep 'Full GC (Allocation Failure' /tmp/dse.log |wc -l
14

是不是 region size变大之后,region的数量变少了的影响. 那么我们现在面对的问题 是不是也是和region 数量有关呢?程序启动时,是有分配一些巨型对象的,这些H对象,占用了多少个region呢?

增加日志参数”-XX:+G1TraceEagerReclaimHumongousObjects”

分配巨型对象时的详细日志

 14.328: [G1Ergonomics (Concurrent Cycles) do not request concurrent cycle initiation, reason: still doing mixed collections, occupancy: 15082717184 bytes, allocation request: 17400888 bytes, threshold: 8697308760 bytes (45.00 %), source: concurrent humongous allocation]
 G1HR ALLOC(StartsH) 0x00000006c3000000 0x00000006c4000000
 G1HR ALLOC(ContinuesH) 0x00000006c4000000 0x00000006c4098438

占取的region数量大概是323

[root@node2 octopus]# cat h |head -n 3
Live humongous region 1 size 69539616 start 0x0000000341000000 length 5 with remset 1 code roots 0 is marked 0 reclaim candidate 1 type array 1
Live humongous region 6 size 69539616 start 0x0000000346000000 length 5 with remset 1 code roots 0 is marked 0 reclaim candidate 1 type array 1
Live humongous region 11 size 69539616 start 0x000000034b000000 length 5 with remset 1 code roots 0 is marked 0 reclaim candidate 1 type array 1
[root@node2 octopus]# cat h |tail -n 3
Live humongous region 901 size 17400888 start 0x00000006c5000000 length 2 with remset 1 code roots 0 is marked 0 reclaim candidate 1 type array 1
Live humongous region 903 size 17400888 start 0x00000006c7000000 length 2 with remset 1 code roots 0 is marked 0 reclaim candidate 1 type array 1
Live humongous region 905 size 17400888 start 0x00000006c9000000 length 2 with remset 1 code roots 0 is marked 0 reclaim candidate 1 type array 1
[root@node2 octopus]# cat h |wc -l
323

## 这里occupancy 也是说堆已经使用了18941476864
 219.826: [G1Ergonomics (Concurrent Cycles) request concurrent cycle initiation, reason: occupancy higher than threshold, occupancy: 18941476864 bytes, allocation request: 0 bytes, threshold: 8697308760 bytes (45.00 %), source: end of GC]
 G1HR #EndGC 93
 (to-space exhausted), 0.0405062 secs]
   [Parallel Time: 26.6 ms, GC Workers: 13]
   ....
   [Other: 13.8 ms]
      [Evacuation Failure: 12.3 ms]
         [Recalculate Used: 0.0 ms]
         [Remove Self Forwards: 12.2 ms]
         [Restore RemSet: 0.0 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 0.4 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.1 ms]
         [Parallel Redirty:  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0
          Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.0]
         [Redirtied Cards:  516  389  80  0  0  0  0  0  0  0  0  0  0
          Min: 0, Avg: 75.8, Max: 516, Diff: 516, Sum: 985]
      [Humongous Register: 0.3 ms]
         [Humongous Total: 323] #这里说了 巨型对象占据的region数量有 323个
         [Humongous Candidate: 323]
      [Humongous Reclaim: 0.5 ms]
         [Humongous Reclaimed: 0]
      [Free CSet: 0.0 ms]
         [Young Free CSet: 0.0 ms]
         [Non-Young Free CSet: 0.0 ms]
   [Eden: 784.0M(784.0M)->0.0B(784.0M) Survivors: 128.0M->128.0M Heap: 12.5G(18.0G)->12.8G(18.0G)] # 这里明明显示堆只使用12.5GB

关于这个问题网上的说法,升级jdk. java – Why have different descriptions of heap usage in G1GC logs – Stack Overflow

@Eugene I’m confused exactly how much space the heap uses. And I have solved this problem by upgrading openJDK to 8u232

使用 java-11 跑一下

也发生了full gc ,日志更加清晰.

[107.914s][info][gc,cpu         ] GC(118) User=0.02s Sys=0.00s Real=0.00s
[107.939s][info][gc,start       ] GC(119) Pause Young (Concurrent Start) (G1 Evacuation Pause) # 开始Young GC
[107.939s][info][gc,task        ] GC(119) Using 13 workers of 13 for evacuation
[107.940s][info][gc             ] GC(119) To-space exhausted
[107.940s][info][gc,phases      ] GC(119)   Pre Evacuate Collection Set: 0.2ms
[107.940s][info][gc,phases      ] GC(119)   Evacuate Collection Set: 0.7ms
[107.940s][info][gc,phases      ] GC(119)   Post Evacuate Collection Set: 0.6ms
[107.940s][info][gc,phases      ] GC(119)   Other: 0.2ms
[107.940s][info][gc,heap        ] GC(119) Eden regions: 4->0(57)
[107.940s][info][gc,heap        ] GC(119) Survivor regions: 0->0(8)
[107.940s][info][gc,heap        ] GC(119) Old regions: 301->305 # 这里老年代 region + 巨型对象regions 刚刚好是1152. 
[107.940s][info][gc,heap        ] GC(119) Humongous regions: 847->847
[107.940s][info][gc,metaspace   ] GC(119) Metaspace: 23563K->23563K(1071104K)
[107.940s][info][gc             ] GC(119) Pause Young (Concurrent Start) (G1 Evacuation Pause) 18093M->18093M(18432M) 1.732ms
[107.940s][info][gc,cpu         ] GC(119) User=0.00s Sys=0.00s Real=0.01s

[107.940s][info][gc,task        ] GC(120) Using 13 workers of 13 for full compaction
[107.941s][info][gc,start       ] GC(120) Pause Full (G1 Evacuation Pause) # 发生了full gc
[107.958s][info][gc,phases,start] GC(120) Phase 1: Mark live objects
[107.961s][info][gc,stringtable ] GC(120) Cleaned string and symbol table, strings: 8654 processed, 0 removed, symbols: 51477 processed, 0 removed
[107.961s][info][gc,phases      ] GC(120) Phase 1: Mark live objects 2.855ms
[107.961s][info][gc,phases,start] GC(120) Phase 2: Prepare for compaction
[107.967s][info][gc,phases      ] GC(120) Phase 2: Prepare for compaction 5.843ms
[107.967s][info][gc,phases,start] GC(120) Phase 3: Adjust pointers
[107.971s][info][gc,phases      ] GC(120) Phase 3: Adjust pointers 4.399ms
[107.971s][info][gc,phases,start] GC(120) Phase 4: Compact heap
[108.008s][info][gc,phases      ] GC(120) Phase 4: Compact heap 37.156ms
[108.011s][info][gc,heap        ] GC(120) Eden regions: 0->0(57)
[108.011s][info][gc,heap        ] GC(120) Survivor regions: 0->0(8)
[108.011s][info][gc,heap        ] GC(120) Old regions: 305->117
[108.011s][info][gc,heap        ] GC(120) Humongous regions: 847->847 # full gc 下 巨型对象占的region数量没有降下来.
[108.011s][info][gc,metaspace   ] GC(120) Metaspace: 23563K->23563K(1071104K)
[108.011s][info][gc             ] GC(120) Pause Full (G1 Evacuation Pause) 18093M->15024M(18432M) 70.173ms
[108.011s][info][gc,cpu         ] GC(120) User=0.75s Sys=0.00s Real=0.07s

看到了因为region 都 old regions 和Humongous regions用完了. 惊!

一开始做这个测试,用java-11也会发生full gc. 但是这周一,再次测试. 并没有发生full gc. 我还以为是自己误操作了. 这中间机器有重启过.我赶紧重启再试试.

结果没有变化,一样出现full gc.真的是玄学.

观察一下,程序启动之后,不读文件前的情况,Humongous regions:到了815.

[14.651s][info][gc,start       ] GC(31) Pause Young (Concurrent Start) (G1 Humongous Allocation)
[14.652s][info][gc,task        ] GC(31) Using 13 workers of 13 for evacuation
[14.653s][info][gc,phases      ] GC(31)   Pre Evacuate Collection Set: 0.2ms
[14.653s][info][gc,phases      ] GC(31)   Evacuate Collection Set: 0.7ms
[14.653s][info][gc,phases      ] GC(31)   Post Evacuate Collection Set: 0.3ms
[14.653s][info][gc,phases      ] GC(31)   Other: 0.2ms
[14.653s][info][gc,heap        ] GC(31) Eden regions: 6->0(137)
[14.653s][info][gc,heap        ] GC(31) Survivor regions: 1->1(19)
[14.653s][info][gc,heap        ] GC(31) Old regions: 72->72
[14.653s][info][gc,heap        ] GC(31) Humongous regions: 815->815
[14.653s][info][gc,metaspace   ] GC(31) Metaspace: 16605K->16605K(1064960K)
[14.653s][info][gc             ] GC(31) Pause Young (Concurrent Start) (G1 Humongous Allocation) 14233M->14137M(18432M) 1.460ms
[14.653s][info][gc,cpu         ] GC(31) User=0.01s Sys=0.00s Real=0.00s

然后读一下文件,达到847,然后一直不变. regions是16MB.

这里是明确可以看出来,To-space exhausted,这里是属于年轻代没有足够的region 用来保存存活的对象, 现在剩余的没有被使用的region = 1152-(49+8+234+847)=14.可能确实不够了. 可以看这个文章CDH6.3 HBase: G1 GC Tuning with JDK11

When you see to-space exhausted messages in GC logs, it means that G1 does not have enough memory for either survived or tenured objects, or for both, and the Java heap cannot be further expanded since it is already at its max.

[200.883s][info][gc,start       ] GC(51) Pause Young (Prepare Mixed) (G1 Evacuation Pause)
[200.883s][info][gc,task        ] GC(51) Using 13 workers of 13 for evacuation
[200.900s][info][gc             ] GC(51) To-space exhausted
[200.900s][info][gc,phases      ] GC(51)   Pre Evacuate Collection Set: 0.2ms
[200.900s][info][gc,phases      ] GC(51)   Evacuate Collection Set: 11.5ms
[200.900s][info][gc,phases      ] GC(51)   Post Evacuate Collection Set: 4.6ms
[200.900s][info][gc,phases      ] GC(51)   Other: 0.2ms
[200.900s][info][gc,heap        ] GC(51) Eden regions: 49->0(49)
[200.900s][info][gc,heap        ] GC(51) Survivor regions: 8->8(8)
[200.900s][info][gc,heap        ] GC(51) Old regions: 234->292
[200.900s][info][gc,heap        ] GC(51) Humongous regions: 847->847
[200.900s][info][gc,metaspace   ] GC(51) Metaspace: 23822K->23822K(1071104K)
[200.900s][info][gc             ] GC(51) Pause Young (Prepare Mixed) (G1 Evacuation Pause) 18157M->18301M(18432M) 16.636ms
[200.900s][info][gc,cpu         ] GC(51) User=0.18s Sys=0.00s Real=0.01s

在java11 ,设置"-XX:G1HeapWastePercent=2",让mixed gc 早点回收空间. 也不会出现Full gc. 可以看到老年代的regions 数量减少, 巨型对象的Region没有减少.于是剩余的region 给年轻代用.

[103.137s][info][gc,start       ] GC(124) Pause Young (Concurrent Start) (G1 Evacuation Pause)
[103.137s][info][gc,task        ] GC(124) Using 13 workers of 13 for evacuation
[103.168s][info][gc,phases      ] GC(124)   Pre Evacuate Collection Set: 0.3ms
[103.168s][info][gc,phases      ] GC(124)   Evacuate Collection Set: 30.9ms
[103.168s][info][gc,phases      ] GC(124)   Post Evacuate Collection Set: 0.3ms
[103.168s][info][gc,phases      ] GC(124)   Other: 0.3ms
[103.168s][info][gc,heap        ] GC(124) Eden regions: 49->0(49)
[103.168s][info][gc,heap        ] GC(124) Survivor regions: 8->8(8)
[103.168s][info][gc,heap        ] GC(124) Old regions: 180->211
[103.168s][info][gc,heap        ] GC(124) Humongous regions: 847->847
[103.168s][info][gc,metaspace   ] GC(124) Metaspace: 23769K->23769K(1071104K)
[103.168s][info][gc             ] GC(124) Pause Young (Concurrent Start) (G1 Evacuation Pause) 17310M->17014M(18432M) 31.744ms
[103.168s][info][gc,cpu         ] GC(124) User=0.40s Sys=0.00s Real=0.03s

结论是:巨型对象占用的region 数量比想象中多. 启动之后就占了815,开始读文件就涨到847. 总的region数量就1152.

其实java-11 启动的时候 年轻带分配的还挺大的,但是慢慢就变小了.

[14.746s][info][gc,start       ] GC(31) Pause Young (Concurrent Start) (GCLocker Initiated GC)
[14.746s][info][gc,task        ] GC(31) Using 13 workers of 13 for evacuation
[14.748s][info][gc,phases      ] GC(31)   Pre Evacuate Collection Set: 0.2ms
[14.748s][info][gc,phases      ] GC(31)   Evacuate Collection Set: 0.6ms
[14.748s][info][gc,phases      ] GC(31)   Post Evacuate Collection Set: 0.2ms
[14.748s][info][gc,phases      ] GC(31)   Other: 0.2ms
[14.748s][info][gc,heap        ] GC(31) Eden regions: 9->0(126)
[14.748s][info][gc,heap        ] GC(31) Survivor regions: 1->1(20)   #(126+20 )* 16MB = 2GB 左右
[14.748s][info][gc,heap        ] GC(31) Old regions: 72->72
[14.748s][info][gc,heap        ] GC(31) Humongous regions: 829->829
[14.748s][info][gc,metaspace   ] GC(31) Metaspace: 16610K->16610K(1064960K)
[14.748s][info][gc             ] GC(31) Pause Young (Concurrent Start) (GCLocker Initiated GC) 14504M->14368M(18432M) 1.317ms
[14.748s][info][gc,cpu         ] GC(31) User=0.00s Sys=0.00s Real=0.00s
[14.748s][info][gc             ] GC(32) Concurrent Cycle
[14.748s][info][gc,marking     ] GC(32) Concurrent Clear Claimed Marks
[14.748s][info][gc,marking     ] GC(32) Concurrent Clear Claimed Marks 0.015ms
[14.748s][info][gc,marking     ] GC(32) Concurrent Scan Root Regions
[14.758s][info][gc,marking     ] GC(32) Concurrent Scan Root Regions 10.401ms
[14.758s][info][gc,marking     ] GC(32) Concurrent Mark (14.758s)
[14.758s][info][gc,marking     ] GC(32) Concurrent Mark From Roots
[14.758s][info][gc,task        ] GC(32) Using 3 workers of 3 for marking
[14.770s][info][gc,marking     ] GC(32) Concurrent Mark From Roots 12.128ms
[14.770s][info][gc,marking     ] GC(32) Concurrent Preclean
[14.771s][info][gc,marking     ] GC(32) Concurrent Preclean 0.331ms
[14.771s][info][gc,marking     ] GC(32) Concurrent Mark (14.758s, 14.771s) 12.517ms
[14.773s][info][gc,start       ] GC(32) Pause Remark
[14.774s][info][gc,stringtable ] GC(32) Cleaned string and symbol table, strings: 5135 processed, 0 removed, symbols: 30087 processed, 0 removed
[14.774s][info][gc             ] GC(32) Pause Remark 14720M->14720M(18432M) 0.985ms
[14.774s][info][gc,cpu         ] GC(32) User=0.01s Sys=0.00s Real=0.00s
[14.774s][info][gc,marking     ] GC(32) Concurrent Rebuild Remembered Sets
[14.784s][info][gc,marking     ] GC(32) Concurrent Rebuild Remembered Sets 9.509ms
[14.784s][info][gc,start       ] GC(32) Pause Cleanup
[14.784s][info][gc             ] GC(32) Pause Cleanup 14752M->14752M(18432M) 0.189ms
[14.784s][info][gc,cpu         ] GC(32) User=0.00s Sys=0.00s Real=0.00s
[14.784s][info][gc,marking     ] GC(32) Concurrent Cleanup for Next Mark
[14.808s][info][gc,marking     ] GC(32) Concurrent Cleanup for Next Mark 24.023ms
[14.808s][info][gc             ] GC(32) Concurrent Cycle 60.487ms
[51.306s][info][gc,start       ] GC(33) Pause Young (Normal) (G1 Evacuation Pause)
[51.306s][info][gc,task        ] GC(33) Using 13 workers of 13 for evacuation
[51.372s][info][gc,phases      ] GC(33)   Pre Evacuate Collection Set: 0.2ms
[51.372s][info][gc,phases      ] GC(33)   Evacuate Collection Set: 65.2ms
[51.372s][info][gc,phases      ] GC(33)   Post Evacuate Collection Set: 0.5ms
[51.372s][info][gc,phases      ] GC(33)   Other: 0.3ms
[51.372s][info][gc,heap        ] GC(33) Eden regions: 126->0(49)
[51.372s][info][gc,heap        ] GC(33) Survivor regions: 1->16(16) # 看到eden的126用完之后,回收,然后把年轻代变少了.存活的对象survivor 写不下的都到了老年代.
[51.372s][info][gc,heap        ] GC(33) Old regions: 72->99
[51.372s][info][gc,heap        ] GC(33) Humongous regions: 847->847
[51.372s][info][gc,metaspace   ] GC(33) Metaspace: 23705K->23705K(1071104K)
[51.372s][info][gc             ] GC(33) Pause Young (Normal) (G1 Evacuation Pause) 16672M->15342M(18432M) 66.195ms
[51.372s][info][gc,cpu         ] GC(33) User=0.38s Sys=0.40s Real=0.07s

最终结果是老年代用的更多了,mixed gc 回收的收益又达不到预期的10%,因此也不回收空间. 于是G1的自动调节,给老年代分配了更多的空间.

这问题还是 Humongous regions数量太多.

对照组

设置22GB,不修改其他参数,读文件没有发生full gc. 并且明显看到 新生代自动调整了.

   [Eden: 16.0M(1088.0M)->0.0B(1088.0M) Survivors: 0.0B->0.0B Heap: 16.0M(21.5G)->601.1K(21.5G)], [Metaspace: 3182K->3182K(1056768K)]
   [Eden: 1088.0M(1088.0M)->0.0B(944.0M) Survivors: 0.0B->144.0M Heap: 2149.7M(21.5G)->1335.2M(21.5G)]
   [Eden: 880.0M(944.0M)->0.0B(944.0M) Survivors: 144.0M->144.0M Heap: 7952.0M(21.5G)->7886.3M(21.5G)]
   [Eden: 944.0M(944.0M)->0.0B(944.0M) Survivors: 144.0M->144.0M Heap: 10.5G(21.5G)->10238.2M(21.5G)]
   [Eden: 944.0M(944.0M)->0.0B(944.0M) Survivors: 144.0M->144.0M Heap: 10.9G(21.5G)->10.4G(21.5G)]
   [Eden: 944.0M(944.0M)->0.0B(944.0M) Survivors: 144.0M->144.0M Heap: 11.3G(21.5G)->10.7G(21.5G)]
   [Eden: 944.0M(944.0M)->0.0B(944.0M) Survivors: 144.0M->144.0M Heap: 11.4G(21.5G)->10.9G(21.5G)]
   [Eden: 944.0M(944.0M)->0.0B(944.0M) Survivors: 144.0M->144.0M Heap: 11.8G(21.5G)->11.3G(21.5G)]
   [Eden: 944.0M(944.0M)->0.0B(944.0M) Survivors: 144.0M->144.0M Heap: 12.3G(21.5G)->11.8G(21.5G)]
   [Eden: 944.0M(944.0M)->0.0B(944.0M) Survivors: 144.0M->144.0M Heap: 11.6G(21.5G)->11.1G(21.5G)]
   [Eden: 944.0M(944.0M)->0.0B(944.0M) Survivors: 144.0M->144.0M Heap: 12.1G(21.5G)->11.6G(21.5G)]
   [Eden: 944.0M(944.0M)->0.0B(944.0M) Survivors: 144.0M->144.0M Heap: 12.5G(21.5G)->12.0G(21.5G)]
   [Eden: 944.0M(944.0M)->0.0B(944.0M) Survivors: 144.0M->144.0M Heap: 11.6G(21.5G)->11.2G(21.5G)]
   [Eden: 944.0M(944.0M)->0.0B(944.0M) Survivors: 144.0M->144.0M Heap: 12.1G(21.5G)->11.7G(21.5G)]
   [Eden: 944.0M(944.0M)->0.0B(944.0M) Survivors: 144.0M->144.0M Heap: 12.6G(21.5G)->12.1G(21.5G)]
   [Eden: 944.0M(944.0M)->0.0B(944.0M) Survivors: 144.0M->144.0M Heap: 11.7G(21.5G)->11.3G(21.5G)]
   [Eden: 944.0M(944.0M)->0.0B(1424.0M) Survivors: 144.0M->144.0M Heap: 12.2G(21.5G)->11.7G(21.5G)]
   [Eden: 1424.0M(1424.0M)->0.0B(1600.0M) Survivors: 144.0M->208.0M Heap: 13.1G(21.5G)->12.1G(21.5G)]
   [Eden: 1600.0M(1600.0M)->0.0B(1952.0M) Survivors: 208.0M->240.0M Heap: 12.4G(21.5G)->11.4G(21.5G)]
   [Eden: 1952.0M(1952.0M)->0.0B(2224.0M) Survivors: 240.0M->288.0M Heap: 13.3G(21.5G)->11.8G(21.5G)]
   [Eden: 2224.0M(2224.0M)->0.0B(1840.0M) Survivors: 288.0M->320.0M Heap: 14.0G(21.5G)->12.2G(21.5G)]
   [Eden: 1840.0M(1840.0M)->0.0B(2768.0M) Survivors: 320.0M->272.0M Heap: 12.7G(21.5G)->11.3G(21.5G)]
   [Eden: 2768.0M(2768.0M)->0.0B(2320.0M) Survivors: 272.0M->384.0M Heap: 14.0G(21.5G)->11.7G(21.5G)]
   [Eden: 2320.0M(2320.0M)->0.0B(2032.0M) Survivors: 384.0M->352.0M Heap: 13.9G(21.5G)->12.0G(21.5G)]
   [Eden: 2032.0M(2032.0M)->0.0B(2560.0M) Survivors: 352.0M->304.0M Heap: 13.0G(21.5G)->11.3G(21.5G)]
   [Eden: 2560.0M(2560.0M)->0.0B(2256.0M) Survivors: 304.0M->368.0M Heap: 13.8G(21.5G)->11.7G(21.5G)]
   [Eden: 2256.0M(2256.0M)->0.0B(1936.0M) Survivors: 368.0M->336.0M Heap: 13.9G(21.5G)->12.1G(21.5G)]
   [Eden: 1936.0M(1936.0M)->0.0B(2480.0M) Survivors: 336.0M->288.0M Heap: 13.0G(21.5G)->11.4G(21.5G)]

没有full gc 也没有触发mixed gc 回收空间.

[root@node2 octopus]# cat /tmp/dse.log-22g-soa |grep 'GC pause' |grep mixed |wc -l
0

[root@node2 octopus]# cat /tmp/dse.log-22g-soa |grep Mixed
 175.554: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 2 regions, reclaimable: 22810456 bytes (0.10 %), threshold: 10.00 %]
 178.803: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 54 regions, reclaimable: 440224032 bytes (1.91 %), threshold: 10.00 %]
 181.797: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 59 regions, reclaimable: 515677760 bytes (2.24 %), threshold: 10.00 %]
 183.525: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 67 regions, reclaimable: 654036576 bytes (2.84 %), threshold: 10.00 %]
 186.059: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 69 regions, reclaimable: 683362152 bytes (2.96 %), threshold: 10.00 %]
 190.912: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 75 regions, reclaimable: 825892960 bytes (3.58 %), threshold: 10.00 %]
 196.683: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 64 regions, reclaimable: 758671280 bytes (3.29 %), threshold: 10.00 %]
 202.352: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 63 regions, reclaimable: 758561896 bytes (3.29 %), threshold: 10.00 %]
 208.441: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 66 regions, reclaimable: 808763384 bytes (3.51 %), threshold: 10.00 %]
 212.869: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 71 regions, reclaimable: 888348096 bytes (3.85 %), threshold: 10.00 %]
 218.121: [G1Ergonomics (Mixed GCs) do not start mixed GCs, reason: reclaimable percentage not over threshold, candidate old regions: 76 regions, reclaimable: 959590656 bytes (4.16 %), threshold: 10.00 %]

这说明堆大了,G1自动调整起效果了. 对象基本在新生代分配,没有怎么晋升到老年代,young gc就满足了需求.因此也不会触发full gc.

可以看到mixed gc 没有触发回收的原因是,可以回收的收益达不到预期.

另外 如何看是否发生巨型对象分配.

恰好程序现在启动时,会分配一些巨型对象

程序启动时 巨型对象分配 尝试触发mixed gc
[root@node2 octopus]# cat /tmp/dse.log-22g-soa |grep mixed|head -n 1
 15.834: [G1Ergonomics (Concurrent Cycles) do not request concurrent cycle initiation, reason: still doing mixed collections, occupancy: 12062818304 bytes, allocation request: 17400888 bytes, threshold: 10380902400 bytes (45.00 %), source: concurrent humongous allocation]

再另外 jmap dump 和 程序实际占用的内存大小对不上

这是使用java 11 时的情况.

/usr/lib/jvm/java-11/bin/jmap  -dump:format=b,file=/data/all.dump  31821
/usr/lib/jvm/java-11/bin/jmap  -dump:live,format=b,file=/data/live.dump  31821

[root@node2 octopus]# ll -alh /data/*.dump
-rw------- 1 root root 11G May 20 09:34 /data/all.dump #全部对象的堆只有11GB大小
-rw------- 1 root root 10G May 20 09:35 /data/live.dump #存活对象的堆只有10GB大小
[root@node2 octopus]# grep Dump /tmp/dse.log
[46826.131s][info][gc,start       ] GC(150) Pause Full (Heap Dump Initiated GC)
[46826.157s][info][gc             ] GC(150) Pause Full (Heap Dump Initiated GC) 15099M->15024M(18432M) 25.916ms
# gc 日志里面显示,堆dump之后,内存占用还是15GB左右.

[root@node2 octopus]# /usr/lib/jvm/java-11/bin/jstat -gcold 31821 1000
   MC       MU      CCSC     CCSU       OC          OU       YGC    FGC    FGCT    CGC    CGCT     GCT
 25728.0  24667.5   2688.0   2437.7  17891328.0  15385109.5     93     6    0.360    96    0.075    2.605
 25728.0  24667.5   2688.0   2437.7  17891328.0  15385109.5     93     6    0.360    96    0.075    2.605
# 老年代使用量OU显示还是 使用了接近15GB内存

^C[root@node2 octopus]# /usr/lib/jvm/java-11/bin/jstat -gcnew 31821 1000
 S0C    S1C    S0U    S1U   TT MTT  DSS      EC       EU     YGC     YGCT
   0.0    0.0    0.0    0.0 15  15 65536.0 983040.0      0.0     93    2.170
   0.0    0.0    0.0    0.0 15  15 65536.0 983040.0      0.0     93    2.170
   0.0    0.0    0.0    0.0 15  15 65536.0 983040.0      0.0     93    2.170
#年轻代使用量EU,显示已经是0

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
  31821 root      20   0  159.4g  15.0g  25260 S   0.0 31.9   3:42.77 jsvc
#top也可以看到 程序实际使用了15GB. (15.0g - 25260)