记一次线上问题排查

Posted by Solace Blog on May 6, 2020

线上问题概述

某天下午,客服反馈项目中部分人的功能不能使用,部分人的可以正常使用。经排查,发现有一台服务器接受请求后不处理,dump了堆栈信息和线程的信息后,重启了服务然后一切正常。

排查问题思路

  • 线上系统架构采用的是nginx负载均衡到两个节点的服务器上,此次是其中一台服务器(92.4)出问题

  • 首先查看堆栈信息,没有发现oom报错。然后项目较老使用的是CMS(ConcurrentMarkSweep)垃圾收集器,下面是堆栈信息,可以看出JVM各个区也都正常

    Heap Configuration:
       MinHeapFreeRatio = 40
       MaxHeapFreeRatio = 70
       MaxHeapSize      = 2097152000 (2000.0MB)
       NewSize          = 524288000 (500.0MB)
       MaxNewSize       = 524288000 (500.0MB)
       OldSize          = 5439488 (5.1875MB)
       NewRatio         = 2
       SurvivorRatio    = 1
       PermSize         = 262144000 (250.0MB)
       MaxPermSize      = 262144000 (250.0MB)
      
    Heap Usage:
    New Generation (Eden + 1 Survivor Space):
       capacity = 349569024 (333.375MB)
       used     = 125871224 (120.04015350341797MB)
       free     = 223697800 (213.33484649658203MB)
       36.007545107886905% used
    Eden Space:
       capacity = 174850048 (166.75MB)
       used     = 16912984 (16.129478454589844MB)
       free     = 157937064 (150.62052154541016MB)
       9.672850647430192% used
    From Space:
       capacity = 174718976 (166.625MB)
       used     = 108958240 (103.91067504882812MB)
       free     = 65760736 (62.714324951171875MB)
       62.36199552817892% used
    To Space:
       capacity = 174718976 (166.625MB)
       used     = 0 (0.0MB)
       free     = 174718976 (166.625MB)
       0.0% used
    concurrent mark-sweep generation:
       capacity = 1572864000 (1500.0MB)
       used     = 604906696 (576.8839797973633MB)
       free     = 967957304 (923.1160202026367MB)
       38.45893198649089% used
    Perm Generation:
       capacity = 262144000 (250.0MB)
       used     = 174978296 (166.87230682373047MB)
       free     = 87165704 (83.12769317626953MB)
       66.7489227294922% used
    
  • 查看线程问题,发现有报错,信息如下,发现有个线程在链接数据库查询的时候报错了,线程显示在WAITTING,所以应该是查询数据库的时候遇到问题了,大概问题定位到是数据库连接出问题,也就能解释的通为什么另一个节点的tomcat是正常的。

    2020-04-29 16:48 cpu_busy: 59
    "http-8080-287" daemon prio=10 tid=0x00007fbf5b690800 nid=0x62c1 in Object.wait() [0x00007fbef9b51000]
       java.lang.Thread.State: WAITING (on object monitor)
            at java.lang.Object.wait(Native Method)
            - waiting on <0x00000007947ec8d8> (a com.mchange.v2.resourcepool.BasicResourcePool)
            at com.mchange.v2.resourcepool.BasicResourcePool.awaitAvailable(BasicResourcePool.java:1402)
            at com.mchange.v2.resourcepool.BasicResourcePool.prelimCheckoutResource(BasicResourcePool.java:594)
            - locked <0x00000007947ec8d8> (a com.mchange.v2.resourcepool.BasicResourcePool)
            at com.mchange.v2.resourcepool.BasicResourcePool.checkoutResource(BasicResourcePool.java:514)
            at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool.checkoutAndMarkConnectionInUse(C3P0PooledConnectionPool.java:707)
            at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool.checkoutPooledConnection(C3P0PooledConnectionPool.java:634)
            at com.mchange.v2.c3p0.impl.AbstractPoolBackedDataSource.getConnection(AbstractPoolBackedDataSource.java:128)
            at org.springframework.orm.hibernate3.LocalDataSourceConnectionProvider.getConnection(LocalDataSourceConnectionProvider.java:81)
            at org.hibernate.jdbc.ConnectionManager.openConnection(ConnectionManager.java:446)
            at org.hibernate.jdbc.ConnectionManager.getConnection(ConnectionManager.java:167)
            at org.hibernate.jdbc.AbstractBatcher.prepareQueryStatement(AbstractBatcher.java:161)
            at org.hibernate.loader.Loader.prepareQueryStatement(Loader.java:1700)
            at org.hibernate.loader.Loader.doQuery(Loader.java:801)
            at org.hibernate.loader.Loader.doQueryAndInitializeNonLazyCollections(Loader.java:274)
            at org.hibernate.loader.Loader.doList(Loader.java:2542)
            at org.hibernate.loader.Loader.listIgnoreQueryCache(Loader.java:2276)
            at org.hibernate.loader.Loader.list(Loader.java:2271)
            at org.hibernate.loader.custom.CustomLoader.list(CustomLoader.java:316)
            at org.hibernate.impl.SessionImpl.listCustomQuery(SessionImpl.java:1842)
            at org.hibernate.impl.AbstractSessionImpl.list(AbstractSessionImpl.java:165)
            at org.hibernate.impl.SQLQueryImpl.list(SQLQueryImpl.java:157)
            at com.gnetis.roadshow.dao.GnetRsConferenceDAO.queryConferenceTotalNumBySQL(GnetRsConferenceDAO.java:659)
            at com.gnetis.roadshow.conference.confManage.service.impl.ConferenceSearchManage.queryConferenceTotalNumBySQL(ConferenceSearchManage.java:770)
            at com.gnetis.roadshow.conference.confManage.service.impl.ConferenceSearchManage.selecteAcceptanceConference(ConferenceSearchManage.java:474)
            at sun.reflect.GeneratedMethodAccessor386.invoke(Unknown Source)
            at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
            at java.lang.reflect.Method.invoke(Method.java:616)
      
    
    • 首先进入DB的数据库连接数监控页面查看数据库连接是否正常,发现连接数有点大,线上使用了数据库连接池,每台机器最大连接池数为50,所以总的连接数最大是100,但是下面当时故障时间点的连接数是90个,猜想可能是92.4这台机器的连接数满了导致请求不能处理,然后进入mysql,使用show processlist查看运行的线程,只查看其中92.4连接的线程,发现92.4的连接很多而且很多的state是sleep而且time的时间很大,大概有十几天。

      我从中挑了一个,反推或者这个连接的时间,然后用时间去查日志,发现92.4里面当时有个定时任务在执行时报错了。通过代码和业务分析后发现,是由于疫情期间参会人数暴增,代码中参会人的属性是short类型,导致参数人强制类型转换后变成负数,导致定时任务报错。hibernate中旧版本有bug报错后没有释放连接,导致连接池被耗完,后面的请求获取不到连接一直在等待。

经验总结

  • 首先就是做好日志中错误信息的监控,对所有报错都及时处理
  • 增加数据库连接池的最大连接数,增加单台机器的数据库连接监控并及时告警