记一次线上oom问题排查

Posted by Solace Blog on May 7, 2020
  • 帖子地址

  • 前提

    查看了应用日志,是凌晨就已经OOM了,但是直到早上八点多客服反馈登录不了才发现问题,当时在路上,于是让运维dump了堆栈和线程信息,重启了服务器。

    jvm配置:

    -Xmx2000M				最大堆大小
    -Xms2000M				初始堆大小
    -Xmn500M				年轻代大小
    -XX:PermSize=250M		持久代大小
    -XX:MaxPermSize=250M	持久代最大值
    -Xss256K				每个线程的堆栈大小
    -XX:-UseGCOverheadLimit
    -Djava.net.preferIPv4Stack=true
    -Djava.net.preferIPv4Addresses
    -XX:+DisableExplicitGC	关闭System.gc()
    -XX:SurvivorRatio=1		Eden区与Survivor区的大小比值
    -XX:+UseConcMarkSweepGC
    -XX:+UseParNewGC		设置年轻代为并行收集
    -XX:+CMSParallelRemarkEnabled
    -XX:+UseCMSCompactAtFullCollection
    -XX:CMSFullGCsBeforeCompaction=0
    -XX:+CMSClassUnloadingEnabled
    -XX:LargePageSizeInBytes=128M
    -XX:+UseFastAccessorMethods
    -XX:+UseCMSInitiatingOccupancyOnly
    -XX:CMSInitiatingOccupancyFraction=60
    -XX:SoftRefLRUPolicyMSPerMB=0
    -XX:+PrintClassHistogram
    -XX:+PrintGCDetails
    -XX:+PrintGCTimeStamps
    -XX:+PrintHeapAtGC
    -Xloggc:/data/apps/apache-tomcat-6.0.36/logs/gc.log
    -XX:+HeapDumpOnOutOfMemoryError 
    -XX:HeapDumpPath=/data/apps/apache-tomcat-6.0.36/logs/gc.hprof
    
  • jvm.heap

    2020-05-07 08:46:
    
    using parallel threads in the new generation.
    using thread-local object allocation.
    Concurrent Mark-Sweep GC
    
    Heap Configuration:
       MinHeapFreeRatio = 40
       MaxHeapFreeRatio = 70
       MaxHeapSize      = 2097152000 (2000.0MB)
       NewSize          = 524288000 (500.0MB)
       MaxNewSize       = 524288000 (500.0MB)
       OldSize          = 5439488 (5.1875MB)
       NewRatio         = 2
       SurvivorRatio    = 1
       PermSize         = 262144000 (250.0MB)
       MaxPermSize      = 262144000 (250.0MB)
    
    Heap Usage:
    New Generation (Eden + 1 Survivor Space):
       capacity = 349569024 (333.375MB)
       used     = 137905504 (131.51693725585938MB)
       free     = 211663520 (201.85806274414062MB)
       39.45014990801931% used
    Eden Space:
       capacity = 174850048 (166.75MB)
       used     = 129558040 (123.5561752319336MB)
       free     = 45292008 (43.193824768066406MB)
       74.09665681075478% used
    From Space:
       capacity = 174718976 (166.625MB)
       used     = 8347464 (7.960762023925781MB)
       free     = 166371512 (158.66423797607422MB)
       4.777651627262284% used
    To Space:
       capacity = 174718976 (166.625MB)
       used     = 0 (0.0MB)
       free     = 174718976 (166.625MB)
       0.0% used
    concurrent mark-sweep generation:
       capacity = 1572864000 (1500.0MB)
       used     = 532712344 (508.0340805053711MB)
       free     = 1040151656 (991.9659194946289MB)
       33.86893870035807% used
    Perm Generation:
       capacity = 262144000 (250.0MB)
       used     = 149215056 (142.3025665283203MB)
       free     = 112928944 (107.69743347167969MB)
       56.921026611328124% used
    2020-05-07 08:46:
    
    using thread-local object allocation.
    Parallel GC with 4 thread(s)
    
    Heap Configuration:
       MinHeapFreeRatio = 40
       MaxHeapFreeRatio = 70
       MaxHeapSize      = 1566572544 (1494.0MB)
       NewSize          = 1310720 (1.25MB)
       MaxNewSize       = 17592186044415 MB
       OldSize          = 5439488 (5.1875MB)
       NewRatio         = 2
       SurvivorRatio    = 8
       PermSize         = 21757952 (20.75MB)
       MaxPermSize      = 174063616 (166.0MB)
    
    Heap Usage:
    PS Young Generation
    Eden Space:
       capacity = 1966080 (1.875MB)
       used     = 628456 (0.5993423461914062MB)
       free     = 1337624 (1.2756576538085938MB)
       31.964925130208332% used
    From Space:
       capacity = 196608 (0.1875MB)
       used     = 98304 (0.09375MB)
       free     = 98304 (0.09375MB)
       50.0% used
    To Space:
       capacity = 196608 (0.1875MB)
       used     = 0 (0.0MB)
       free     = 196608 (0.1875MB)
       0.0% used
    PS Old Generation
       capacity = 65273856 (62.25MB)
       used     = 4938552 (4.709770202636719MB)
       free     = 60335304 (57.54022979736328MB)
       7.565895907850151% used
    PS Perm Generation
       capacity = 25296896 (24.125MB)
       used     = 25255776 (24.085784912109375MB)
       free     = 41120 (0.039215087890625MB)
       99.8374504128886% used	
    
  • 应用日志

  • jstack信息

    2020-05-07 08:47 cpu_busy: 1
    "scheduler-9" prio=10 tid=0x00007f962c6e6000 nid=0x3a0d waiting on condition [0x00007f962437f000]
       java.lang.Thread.State: TIMED_WAITING (parking)
    	at sun.misc.Unsafe.park(Native Method)
    	- parking to wait for  <0x0000000796111028> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
    	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2081)
    	at java.util.concurrent.DelayQueue.take(DelayQueue.java:193)
    	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:688)
    	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:681)
    	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    	at java.lang.Thread.run(Thread.java:679)
      
    "scheduler-8" prio=10 tid=0x00007f962cb95800 nid=0x3a0c waiting on condition [0x00007f96248d4000]
       java.lang.Thread.State: TIMED_WAITING (parking)
    	at sun.misc.Unsafe.park(Native Method)
    	- parking to wait for  <0x0000000796111028> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
    	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
    	at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2081)
    	at java.util.concurrent.DelayQueue.take(DelayQueue.java:193)
    	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:688)
    	at java.util.concurrent.ScheduledThreadPoolExecutor$DelayedWorkQueue.take(ScheduledThreadPoolExecutor.java:681)
    	at java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1069)
    	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1131)
    	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    	at java.lang.Thread.run(Thread.java:679)
      
    "scheduler-7" prio=10 tid=0x00007f962c7c6000 nid=0x3a0b waiting on condition [0x00007f962468b000]
       java.lang.Thread.State: TIMED_WAITING (parking)
    	at sun.misc.Unsafe.park(Native Method)
      
      
      
    2020-05-07 08:47 cpu_busy: 1
    "VM Thread" prio=10 tid=0x00007f962c0fe800 nid=0x38e1 runnable 
      
    "Gang worker#0 (Parallel GC Threads)" prio=10 tid=0x00007f962c010800 nid=0x38dc runnable 
      
    "Gang worker#1 (Parallel GC Threads)" prio=10 tid=0x00007f962c012800 nid=0x38dd runnable 
      
    "Gang worker#2 (Parallel GC Threads)" prio=10 tid=0x00007f962c014800 nid=0x38de runnable 
      
    "Gang worker#3 (Parallel GC Threads)" prio=10 tid=0x00007f962c016000 nid=0x38df runnable 
      
    "Concurrent Mark-Sweep GC Thread" prio=10 tid=0x00007f962c0a5000 nid=0x38e0 runnable 
    "VM Periodic Task Thread" prio=10 tid=0x00007f962c258000 nid=0x39b4 waiting on condition 
      
    JNI global references: 1952
      
      
      
      
    2020-05-07 08:47 cpu_busy: 1
    "DefaultQuartzScheduler_QuartzSchedulerThread" prio=10 tid=0x00007f1c8c39c000 nid=0x6687 waiting on condition [0x00007f1c4b9f8000]
       java.lang.Thread.State: TIMED_WAITING (sleeping)
    	at java.lang.Thread.sleep(Native Method)
    	at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:275)
      
    "DefaultQuartzScheduler_Worker-10" prio=10 tid=0x00007f1c8c3a8000 nid=0x6686 in Object.wait() [0x00007f1c4baf9000]
       java.lang.Thread.State: TIMED_WAITING (on object monitor)
    	at java.lang.Object.wait(Native Method)
    	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523)
    	- locked <0x00000000a2a06000> (a org.quartz.simpl.SimpleThreadPool$WorkerThread)
      
    "DefaultQuartzScheduler_Worker-9" prio=10 tid=0x00007f1c8c3d1000 nid=0x6685 in Object.wait() [0x00007f1c4bbfa000]
       java.lang.Thread.State: TIMED_WAITING (on object monitor)
    	at java.lang.Object.wait(Native Method)
    	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523)
    	- locked <0x00000000a2a001a8> (a org.quartz.simpl.SimpleThreadPool$WorkerThread)
      
    "DefaultQuartzScheduler_Worker-8" prio=10 tid=0x00007f1c8c1e7800 nid=0x6684 in Object.wait() [0x00007f1c4bcfb000]
       java.lang.Thread.State: TIMED_WAITING (on object monitor)
    	at java.lang.Object.wait(Native Method)
    	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523)
    	- locked <0x00000000a2a04020> (a org.quartz.simpl.SimpleThreadPool$WorkerThread)
      
    "DefaultQuartzScheduler_Worker-7" prio=10 tid=0x00007f1c8c3a5000 nid=0x6683 in Object.wait() [0x00007f1c4bdfc000]
       java.lang.Thread.State: TIMED_WAITING (on object monitor)
    	at java.lang.Object.wait(Native Method)
    	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523)
    	- locked <0x00000000a2a00228> (a org.quartz.simpl.SimpleThreadPool$WorkerThread)
      
    "DefaultQuartzScheduler_Worker-6" prio=10 tid=0x00007f1c8c241000 nid=0x6682 in Object.wait() [0x00007f1c3bffe000]
       java.lang.Thread.State: TIMED_WAITING (on object monitor)
      
      
      
    2020-05-07 08:47 cpu_busy: 1
    "Attach Listener" daemon prio=10 tid=0x00007f1c68001000 nid=0x30d waiting on condition [0x0000000000000000]
       java.lang.Thread.State: RUNNABLE
      
    "DestroyJavaVM" prio=10 tid=0x00007f1c8c008000 nid=0x665b waiting on condition [0x0000000000000000]
       java.lang.Thread.State: RUNNABLE
      
    "DefaultQuartzScheduler_QuartzSchedulerThread" prio=10 tid=0x00007f1c8c39c000 nid=0x6687 waiting on condition [0x00007f1c4b9f8000]
       java.lang.Thread.State: TIMED_WAITING (sleeping)
    	at java.lang.Thread.sleep(Native Method)
    	at org.quartz.core.QuartzSchedulerThread.run(QuartzSchedulerThread.java:275)
      
    "DefaultQuartzScheduler_Worker-10" prio=10 tid=0x00007f1c8c3a8000 nid=0x6686 in Object.wait() [0x00007f1c4baf9000]
       java.lang.Thread.State: TIMED_WAITING (on object monitor)
    	at java.lang.Object.wait(Native Method)
    	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523)
    	- locked <0x00000000a2a06000> (a org.quartz.simpl.SimpleThreadPool$WorkerThread)
      
    "DefaultQuartzScheduler_Worker-9" prio=10 tid=0x00007f1c8c3d1000 nid=0x6685 in Object.wait() [0x00007f1c4bbfa000]
       java.lang.Thread.State: TIMED_WAITING (on object monitor)
    	at java.lang.Object.wait(Native Method)
    	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523)
    	- locked <0x00000000a2a001a8> (a org.quartz.simpl.SimpleThreadPool$WorkerThread)
      
    "DefaultQuartzScheduler_Worker-8" prio=10 tid=0x00007f1c8c1e7800 nid=0x6684 in Object.wait() [0x00007f1c4bcfb000]
       java.lang.Thread.State: TIMED_WAITING (on object monitor)
    	at java.lang.Object.wait(Native Method)
    	at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:523)
    	- locked <0x00000000a2a04020> (a org.quartz.simpl.SimpleThreadPool$WorkerThread)
      
    "DefaultQuartzScheduler_Worker-7" prio=10 tid=0x00007f1c8c3a5000 nid=0x6683 in Object.wait() [0x00007f1c4bdfc000]
       java.lang.Thread.State: TIMED_WAITING (on object monitor)
    
  • 用jhat -J-Xmx50m jvm.dump查看所有类的实例数


这是我第一次碰到OOM,我不理解的是为什么我看jvm.heap里面,堆内存使用率不是百分百,但请求依然是OOM呢?这些堆栈文件是运维用脚本生成的,必要的话我可以把脚本贴上来。

我会继续排查问题,希望哪位看到了也能帮我瞅瞅,非常感谢。


ps:

  • 发现日志的打印时间顺序有问题,不知道这个会不会可以找到什么线索?