runsisi's

technical notes

TCP 客户端卡住导致的接收队列报文积压

2019-12-29 runsisi#tcp/ip

遇到了一个使用 glance 命令行工具下载镜像卡住的问题,镜像下载一部分之后就莫名其妙的卡住了,通过分析之后发现了两个问题。

问题一

一个问题和 glance image-download 的选项 --progress 有关,不知为何,客户端机器的两个终端 SecureCRT 和 MobaXterm 往终端写时都存在问题,由于 glance 是一个 Python 单线程程序,导致写终端卡住:

# ps aux | grep image-download
root      10481  6.3  0.0 258404 29304 pts/8    S+   19:02   0:11 /usr/bin/python2 /usr/bin/glance image-download --file i1 5d02d508-44cd-429f-8c63-3e44f64f5c75 --progress
root      22708  0.0  0.0 112664   968 pts/7    S+   19:05   0:00 grep --color=auto image-download
# pstack 10481
#0  0x00007f3906e30840 in __write_nocancel () from /lib64/libc.so.6
#1  0x00007f3906dbcfb3 in _IO_new_file_write () from /lib64/libc.so.6
#2  0x00007f3906dbe41c in __GI__IO_do_write () from /lib64/libc.so.6
#3  0x00007f3906dbc810 in __GI__IO_file_sync () from /lib64/libc.so.6
#4  0x00007f3906db20a2 in fflush () from /lib64/libc.so.6
#5  0x00007f3907a984c1 in file_flush () from /lib64/libpython2.7.so.1.0
#6  0x00007f3907b1493c in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#7  0x00007f3907b1457d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#8  0x00007f3907b16efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#9  0x00007f3907aa0858 in function_call () from /lib64/libpython2.7.so.1.0
#10 0x00007f3907a7b9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#11 0x00007f3907a8a995 in instancemethod_call () from /lib64/libpython2.7.so.1.0
#12 0x00007f3907a7b9a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
#13 0x00007f3907ad1e03 in call_method () from /lib64/libpython2.7.so.1.0
#14 0x00007f3907b0fb01 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#15 0x00007f3907b16efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#16 0x00007f3907b143fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#17 0x00007f3907b1457d in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#18 0x00007f3907b16efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#19 0x00007f3907b143fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#20 0x00007f3907b16efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#21 0x00007f3907b143fc in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0
#22 0x00007f3907b16efd in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0
#23 0x00007f3907b17002 in PyEval_EvalCode () from /lib64/libpython2.7.so.1.0
#24 0x00007f3907b3043f in run_mod () from /lib64/libpython2.7.so.1.0
#25 0x00007f3907b315fe in PyRun_FileExFlags () from /lib64/libpython2.7.so.1.0
#26 0x00007f3907b32889 in PyRun_SimpleFileExFlags () from /lib64/libpython2.7.so.1.0
#27 0x00007f3907b43a3f in Py_Main () from /lib64/libpython2.7.so.1.0
#28 0x00007f3906d68c05 in __libc_start_main () from /lib64/libc.so.6
#29 0x000000000040071e in _start ()

直接导致 socket 数据接收卡住,直接表现就是发送端的发送报文积压:

# ss -ntp | sort -nk 3 | tail
...
ESTAB      0      3647520   ::ffff:192.168.63.245:8262                 ::ffff:192.168.63.250:64035               users:(("glance-api",pid=66095,fd=5))
# ss -ntpioem 'src ::ffff:192.168.63.245:8262 && dst ::ffff:192.168.63.250:64035'
State       Recv-Q Send-Q                                           Local Address:Port                                                          Peer Address:Port
ESTAB       0      3647520                                   ::ffff:192.168.63.245:8262                                                 ::ffff:192.168.63.250:64035               users:(("glance-api",pid=66095,fd=5)) timer:(persist,32sec,0) uid:161 ino:3892928328 sk:ffff880b6aa1e300 <->
         skmem:(r0,rb1061488,t0,tb4194304,f3248,w3695440,o0,bl0) ts sack cubic wscale:7,7 rto:218 backoff:8 rtt:17.091/21.364 ato:40 mss:65468 cwnd:25 ssthresh:58 send 766.1Mbps lastsnd:78721 lastrcv:110933 lastack:23017 pacing_rate 1532.2Mbps rcv_space:43690
# ss -ntpioem 'src ::ffff:192.168.63.245:8262 && dst ::ffff:192.168.63.250:64035'
State       Recv-Q Send-Q                                           Local Address:Port                                                          Peer Address:Port
ESTAB       0      3647520                                   ::ffff:192.168.63.245:8262                                                 ::ffff:192.168.63.250:64035               users:(("glance-api",pid=66095,fd=5)) timer:(persist,6.431ms,0) uid:161 ino:3892928328 sk:ffff880b6aa1e300 <->
         skmem:(r0,rb1061488,t0,tb4194304,f3248,w3695440,o0,bl0) ts sack cubic wscale:7,7 rto:218 backoff:8 rtt:17.091/21.364 ato:40 mss:65468 cwnd:25 ssthresh:58 send 766.1Mbps lastsnd:105081 lastrcv:137293 lastack:49377 pacing_rate 1532.2Mbps rcv_space:43690
# ss -ntpioem 'src ::ffff:192.168.63.245:8262 && dst ::ffff:192.168.63.250:64035'
State       Recv-Q Send-Q                                           Local Address:Port                                                          Peer Address:Port
ESTAB       0      3647520                                   ::ffff:192.168.63.245:8262                                                 ::ffff:192.168.63.250:64035               users:(("glance-api",pid=66095,fd=5)) timer:(persist,1min44sec,0) uid:161 ino:3892928328 sk:ffff880b6aa1e300 <->
         skmem:(r0,rb1061488,t0,tb4194304,f3248,w3695440,o0,bl0) ts sack cubic wscale:7,7 rto:218 backoff:9 rtt:17.091/21.364 ato:40 mss:65468 cwnd:25 ssthresh:58 send 766.1Mbps lastsnd:118697 lastrcv:150909 lastack:6993 pacing_rate 1532.2Mbps rcv_space:43690

由于 TCP 没有接收窗口,导致发送端持续处于零窗口探测阶段:

probe0

当然,同时通过 ss 查看客户端的接受队列,也能发现存在大量的报文积压。

显然,简单的解决办法就是去掉 --progress 选项。

问题二

还一个问题看上去与网络有关,通过 tcpdump 进行抓包可以发现 glance-api 发往 Ceph OSD 的重传报文无法顺利送达 OSD:

rst

注意:如果客户端和服务端在同一台机器上,即使 IP 地址绑定在不同的网口上,但使用 tcpdump 时需要使用 -i 选项指定 lo 网口进行抓包。

这里最大的疑问在于,客户端的重传报文在服务端无法得到应答,但是最后服务端那个主动断链的 FIN 报文却在客户端能得到正确的响应,通过多次抓包得到的信息都是这样:

50763.c.pcap

50763.pcap

60393.c.pcap

60393.pcap

52826.c.pcap

52826.pcap

通过 ss 查看最后一个抓包所涉及的 socket 信息:

客户端(glance-api)

# ss -ntpioem 'src 192.168.66.246:52826'
State       Recv-Q Send-Q                                           Local Address:Port                                                          Peer Address:Port
ESTAB       0      6022                                            192.168.66.246:52826                                                         192.168.87.4:6800                users:(("glance-api",pid=66079,fd=29)) timer:(on,1min1sec,11) uid:161 ino:612952361 sk:ffff88140c232d00 <->
         skmem:(r0,rb6291456,t0,tb87040,f3511,w8777,o0,bl0) ts sack cubic wscale:7,7 rto:120000 backoff:11 rtt:0.112/0.028 ato:40 mss:1448 cwnd:1 ssthresh:7 send 103.4Mbps lastsnd:58688 lastrcv:385026 lastack:385026 pacing_rate 2057.1Mbps unacked:1 retrans:1/11 lost:1 rcv_rtt:1.875 rcv_space:3071208
# ss -ntpioem 'src 192.168.66.246:52826'
State       Recv-Q Send-Q                                           Local Address:Port                                                          Peer Address:Port
ESTAB       0      6922                                            192.168.66.246:52826                                                         192.168.87.4:6800                users:(("glance-api",pid=66079,fd=29)) timer:(on,1.863ms,11) uid:161 ino:612952361 sk:ffff88140c232d00 <->
         skmem:(r0,rb6291456,t0,tb87040,f2611,w9677,o0,bl0) ts sack cubic wscale:7,7 rto:120000 backoff:11 rtt:0.112/0.028 ato:40 mss:1448 cwnd:1 ssthresh:7 send 103.4Mbps lastsnd:118137 lastrcv:444475 lastack:444475 pacing_rate 2057.1Mbps unacked:1 retrans:1/11 lost:1 rcv_rtt:1.875 rcv_space:3071208
# ss -ntpioem 'src 192.168.66.246:52826'
State       Recv-Q Send-Q                                           Local Address:Port                                                          Peer Address:Port
ESTAB       0      7147                                            192.168.66.246:52826                                                         192.168.87.4:6800                users:(("glance-api",pid=66079,fd=29)) timer:(on,1min46sec,12) uid:161 ino:612952361 sk:ffff88140c232d00 <->
         skmem:(r0,rb6291456,t0,tb87040,f2386,w9902,o0,bl0) ts sack cubic wscale:7,7 rto:120000 backoff:12 rtt:0.112/0.028 ato:40 mss:1448 cwnd:1 ssthresh:7 send 103.4Mbps lastsnd:13925 lastrcv:460583 lastack:460583 pacing_rate 2057.1Mbps unacked:1 retrans:1/12 lost:1 rcv_rtt:1.875 rcv_space:3071208
# ss -ntpioem 'src 192.168.66.246:52826'
State       Recv-Q Send-Q                                           Local Address:Port                                                          Peer Address:Port
ESTAB       0      10372                                           192.168.66.246:52826                                                         192.168.87.4:6800                users:(("glance-api",pid=66079,fd=29)) timer:(on,13sec,13) uid:161 ino:612952361 sk:ffff88140c232d00 <->
         skmem:(r0,rb6291456,t0,tb87040,f3257,w13127,o0,bl0) ts sack cubic wscale:7,7 rto:120000 backoff:13 rtt:0.112/0.028 ato:40 mss:1448 cwnd:1 ssthresh:7 send 103.4Mbps lastsnd:106140 lastrcv:673118 lastack:673118 pacing_rate 2057.1Mbps unacked:1 retrans:1/13 lost:1 rcv_rtt:1.875 rcv_space:3071208
# ss -ntpioem 'src 192.168.66.246:52826'
State       Recv-Q Send-Q                                           Local Address:Port                                                          Peer Address:Port
ESTAB       0      11047                                           192.168.66.246:52826                                                         192.168.87.4:6800                users:(("glance-api",pid=66079,fd=29)) timer:(on,1min24sec,14) uid:161 ino:612952361 sk:ffff88140c232d00 <->
         skmem:(r0,rb6291456,t0,tb87040,f2582,w13802,o0,bl0) ts sack cubic wscale:7,7 rto:120000 backoff:14 rtt:0.112/0.028 ato:40 mss:1448 cwnd:1 ssthresh:7 send 103.4Mbps lastsnd:35333 lastrcv:722631 lastack:722631 pacing_rate 2057.1Mbps unacked:1 retrans:1/14 lost:1 rcv_rtt:1.875 rcv_space:3071208
# ss -ntpioem 'src 192.168.66.246:52826'
State       Recv-Q Send-Q                                           Local Address:Port                                                          Peer Address:Port
ESTAB       0      13897                                           192.168.66.246:52826                                                         192.168.87.4:6800                users:(("glance-api",pid=66079,fd=29)) timer:(on,15sec,15) uid:161 ino:612952361 sk:ffff88140c232d00 <->
         skmem:(r0,rb6291456,t0,tb87040,f3828,w16652,o0,bl0) ts sack cubic wscale:7,7 rto:120000 backoff:15 rtt:0.112/0.028 ato:40 mss:1448 cwnd:1 ssthresh:7 send 103.4Mbps lastsnd:104570 lastrcv:912188 lastack:912188 pacing_rate 2057.1Mbps unacked:1 retrans:1/15 lost:1 rcv_rtt:1.875 rcv_space:3071208

服务端(Ceph OSD)

# ss -ntpioem 'dst 192.168.66.246:52826'
State       Recv-Q Send-Q                                           Local Address:Port                                                          Peer Address:Port
ESTAB       0      0                                                 192.168.87.4:6800                                                        192.168.66.246:52826               users:(("ceph-osd",pid=7501,fd=153)) uid:167 ino:3025637930 sk:ffff880fd1450000 <->
         skmem:(r0,rb369280,t0,tb4194304,f0,w0,o0,bl0) ts sack cubic wscale:7,7 rto:201 rtt:0.881/0.176 ato:40 mss:1448 cwnd:1549 ssthresh:1536 send 20367.3Mbps lastsnd:423659 lastrcv:423742 lastack:423658 retrans:0/723 rcv_rtt:35472 rcv_space:29341
# ss -ntpioem 'dst 192.168.66.246:52826'
State       Recv-Q Send-Q                                           Local Address:Port                                                          Peer Address:Port
ESTAB       0      0                                                 192.168.87.4:6800                                                        192.168.66.246:52826               users:(("ceph-osd",pid=7501,fd=153)) uid:167 ino:3025637930 sk:ffff880fd1450000 <->
         skmem:(r0,rb369280,t0,tb4194304,f0,w0,o0,bl0) ts sack cubic wscale:7,7 rto:201 rtt:0.881/0.176 ato:40 mss:1448 cwnd:1549 ssthresh:1536 send 20367.3Mbps lastsnd:489181 lastrcv:489264 lastack:489180 retrans:0/723 rcv_rtt:35472 rcv_space:29341
# ss -ntpioem 'dst 192.168.66.246:52826'
State       Recv-Q Send-Q                                           Local Address:Port                                                          Peer Address:Port
ESTAB       0      0                                                 192.168.87.4:6800                                                        192.168.66.246:52826               users:(("ceph-osd",pid=7501,fd=153)) uid:167 ino:3025637930 sk:ffff880fd1450000 <->
         skmem:(r0,rb369280,t0,tb4194304,f0,w0,o0,bl0) ts sack cubic wscale:7,7 rto:201 rtt:0.881/0.176 ato:40 mss:1448 cwnd:1549 ssthresh:1536 send 20367.3Mbps lastsnd:677224 lastrcv:677307 lastack:677223 retrans:0/723 rcv_rtt:35472 rcv_space:29341
# ss -ntpioem 'dst 192.168.66.246:52826'
State       Recv-Q Send-Q                                           Local Address:Port                                                          Peer Address:Port
ESTAB       0      0                                                 192.168.87.4:6800                                                        192.168.66.246:52826               users:(("ceph-osd",pid=7501,fd=153)) uid:167 ino:3025637930 sk:ffff880fd1450000 <->
         skmem:(r0,rb369280,t0,tb4194304,f0,w0,o0,bl0) ts sack cubic wscale:7,7 rto:201 rtt:0.881/0.176 ato:40 mss:1448 cwnd:1549 ssthresh:1536 send 20367.3Mbps lastsnd:727560 lastrcv:727643 lastack:727559 retrans:0/723 rcv_rtt:35472 rcv_space:29341
# ss -ntpioem 'dst 192.168.66.246:52826'
State       Recv-Q Send-Q                                           Local Address:Port                                                          Peer Address:Port
ESTAB       0      0                                                 192.168.87.4:6800                                                        192.168.66.246:52826               users:(("ceph-osd",pid=7501,fd=153)) uid:167 ino:3025637930 sk:ffff880fd1450000 <->
         skmem:(r0,rb369280,t0,tb4194304,f0,w0,o0,bl0) ts sack cubic wscale:7,7 rto:201 rtt:0.881/0.176 ato:40 mss:1448 cwnd:1549 ssthresh:1536 send 20367.3Mbps lastsnd:918597 lastrcv:918680 lastack:918596 retrans:0/723 rcv_rtt:35472 rcv_space:29341

显然,客户端一直在尝试重传,但客户端始终未收到任何 TCP 报文。

这里仍是一个悬案,临时的规避手段就是设置 /proc/sys/net/ipv4/tcp_retries2 系统参数,或者 TCP_USER_TIMEOUT socket 选项。

参考资料

Manpage of PCAP-FILTER

https://www.tcpdump.org/manpages/pcap-filter.7.html

Linux TCPRTOMIN, TCPRTOMAX and the tcp_retries2 sysctl

https://pracucci.com/linux-tcp-rto-min-max-and-tcp-retries2.html

The “Out of socket memory” error

https://blog.tsunanet.net/2011/03/out-of-socket-memory.html

How to drop 10 million packets per second

https://blog.cloudflare.com/how-to-drop-10-million-packets/

Problem with target “NOTRACK”

https://github.com/ldx/python-iptables/issues/204

nf_conntrack: table full, dropping packet

https://morganwu277.github.io/2018/05/26/Solve-production-issue-of-nf-conntrack-table-full-dropping-packet/

Caveats about Linux connection tracking and high traffic servers

https://i-admin.blogspot.com/2014/02/caveats-about-linux-connection-tracking.html