最近测试一个应用遇到问题:一旦压力略涨,应用的CPU就顶满。由于是多线程应用,直接就把系统的CPU耗完了。
本来想用gdb來调试的,结果gdb不给力,就在attach那里卡死,半天不动。后来想到了用perf来调试,果然找到了一处性能热点。修复热点以后,CPU顶满的问题缓解了一些,不太容易出现了,但是,多跑一会儿,还是会有。而且现在出现CPU顶满时,不仅gdb不返回,连perf record -a -g都无法用Ctrl+c来停止了,仔细top命令看了一下,原来系统的sys是100%,usr几乎为0%,也就是说,是卡在内核里了,难怪perf不好使。
perf这样的神器都用不了了,还能怎么办?最后@coly提醒我:用 echo t > /proc/sysrq-trigger 把内核栈整个打出来。好办法,我试了一下,dmesg里一堆信息:

Apr 25 22:37:57 v3 kernel: sheep         R  running task        0 19157      1 0x10000080
Apr 25 22:37:57 v3 kernel: ffff8802b41c5418 0000000000000086 ffff88028761ecc0 ffffea0008b0f478
Apr 25 22:37:57 v3 kernel: ffffffff814ef33e ffff88046ff9b588 ffffea0008b0f478 ffffea0008b0f478
Apr 25 22:37:57 v3 kernel: ffff8803ec6ab0f8 ffff8802b41c5fd8 000000000000f4e8 ffff8803ec6ab100
Apr 25 22:37:57 v3 kernel: Call Trace:
Apr 25 22:37:57 v3 kernel: [] ? _spin_lock+0x1e/0x30
Apr 25 22:37:57 v3 kernel: [] ? __lru_cache_add+0x40/0x90
Apr 25 22:37:57 v3 kernel: [] __cond_resched+0x2a/0x40
Apr 25 22:37:57 v3 kernel: [] _cond_resched+0x30/0x40
Apr 25 22:37:57 v3 kernel: [] migrate_pages+0x9d/0x4b0
Apr 25 22:37:57 v3 kernel: [] ? compaction_alloc+0x0/0x3e0
Apr 25 22:37:57 v3 kernel: [] compact_zone+0x4f4/0x770
Apr 25 22:37:57 v3 kernel: [] compact_zone_order+0xa1/0xe0
Apr 25 22:37:57 v3 kernel: [] try_to_compact_pages+0x11c/0x190
Apr 25 22:37:57 v3 kernel: [] ? native_sched_clock+0x13/0x60
Apr 25 22:37:57 v3 kernel: [] __alloc_pages_nodemask+0x5f5/0x940
Apr 25 22:37:57 v3 kernel: [] alloc_pages_vma+0x9a/0x150
Apr 25 22:37:57 v3 kernel: [] do_huge_pmd_anonymous_page+0x145/0x370
Apr 25 22:37:57 v3 kernel: [] handle_mm_fault+0x25a/0x2b0
Apr 25 22:37:57 v3 kernel: [] ? mempool_alloc+0x63/0x140
Apr 25 22:37:57 v3 kernel: [] __get_user_pages+0x12a/0x430
Apr 25 22:37:57 v3 kernel: [] ? rwsem_down_read_failed+0x26/0x30
Apr 25 22:37:57 v3 kernel: [] get_user_pages+0x49/0x50
Apr 25 22:37:57 v3 kernel: [] get_user_pages_fast+0x157/0x1c0
......
Apr 25 22:37:57 v3 kernel: sheep         D ffff88047fc24900     0 19158      1 0x10000080
Apr 25 22:37:57 v3 kernel: ffff880357da9890 0000000000000086 0000000000000000 0000000000000003
Apr 25 22:37:57 v3 kernel: ffff8804700e6ea0 ffff880357da9878 ffffffff812518e2 ffff880357da0000
Apr 25 22:37:57 v3 kernel: ffff8803ec6aa6b8 ffff880357da9fd8 000000000000f4e8 ffff8803ec6aa6b8
Apr 25 22:37:57 v3 kernel: Call Trace:
Apr 25 22:37:57 v3 kernel: [] ? __make_request+0x122/0x5a0
Apr 25 22:37:57 v3 kernel: [] rwsem_down_failed_common+0x95/0x1d0
Apr 25 22:37:57 v3 kernel: [] ? mempool_alloc_slab+0x15/0x20
Apr 25 22:37:57 v3 kernel: [] ? mempool_alloc+0x63/0x140
Apr 25 22:37:57 v3 kernel: [] rwsem_down_read_failed+0x26/0x30
Apr 25 22:37:57 v3 kernel: [] call_rwsem_down_read_failed+0x14/0x30
Apr 25 22:37:57 v3 kernel: [] ? down_read+0x24/0x30
Apr 25 22:37:57 v3 kernel: [] get_user_pages_fast+0x124/0x1c0
......

看到了”lock”、“mutex”这些字眼,果然是内核里面卡住了,@coly一看到compact_zone()就明确告诉我,是hugepages的问题,把它关掉试试。用

echo never > /sys/kernel/mm/redhat_transparent_hugepage/enabled
echo never > /sys/kernel/mm/redhat_transparent_hugepage/defrag
echo no > /sys/kernel/mm/redhat_transparent_hugepage/khugepaged/defrag

关掉,应用果然一切正常了。原来以前在hadoop集群上也遇到过类似的问题(所以现在alikernel是默认关掉transparent hugepaes的),不过当时只是CPU略涨,不像现在这个,干脆锁死了。
原来除了gdb,valgrind,perf,还有“echo t > /proc/sysrq-trigger”这样更犀利的办法,@coly告诉我,还有更犀利的,就是连sysrq-trigger都不好使了,可以看
/proc/{pid}/wchan,里面是该进程阻塞位置的内核函数名,在所有办法都没戏的时候可以看它。
学习了。