最近在分析一个 TCP 的问题,在批量重启 OSD 等操作时 OSD 之间,或者客户端(qemu-kvm)与 OSD 之间的通信可能会卡住,通过如下的 ss 命令可以查看到多个 socket 的 TCP 发送队列有大量的报文积压,积压的报文始终无法被对端 ACK,直至应用层报错导致 socket 拆掉重建或者长时间的 IO 完全阻塞:
$ ss -ntp | sort -nk 3 | tail
实际上,从 ss 的信息来看,分为两种情况,一种是报文收发完全停滞,根据 backoff
, segs_out
, segs_in
以及 retrans
等字段可以推断出来,比如如下的两个 OSD 之间某个 OSD 侧的 socket 状态:
# ss -ntpioe 'src 2019:194:201:806::115:6834 && dst 2019:194:201:806::117:60331'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 99440 2019:194:201:806::115:6834 2019:194:201:806::117:60331 users:(("ceph-osd",pid=49490,fd=70)) timer:(on,25sec,5) uid:167 ino:1747269 sk:ffff88148ff0c600 <->
ts sack cubic wscale:7,7 rto:120000 backoff:8 rtt:102.963/115 ato:40 mss:1428 cwnd:1 ssthresh:6 bytes_acked:60470 bytes_received:268044 segs_out:143 segs_in:261 send 111.0Kbps lastsnd:238688 lastrcv:231106 lastack:231106 pacing_rate 2.4Mbps unacked:11 retrans:0/3 lost:8 sacked:3 rcv_rtt:1426 rcv_space:33703
# ss -ntpioe 'src 2019:194:201:806::115:6834 && dst 2019:194:201:806::117:60331'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 99440 2019:194:201:806::115:6834 2019:194:201:806::117:60331 users:(("ceph-osd",pid=49490,fd=70)) timer:(on,1min27sec,6) uid:167 ino:1747269 sk:ffff88148ff0c600 <->
ts sack cubic wscale:7,7 rto:120000 backoff:9 rtt:102.963/115 ato:40 mss:1428 cwnd:1 ssthresh:6 bytes_acked:60470 bytes_received:268044 segs_out:143 segs_in:261 send 111.0Kbps lastsnd:296951 lastrcv:289369 lastack:289369 pacing_rate 2.4Mbps unacked:11 retrans:0/3 lost:8 sacked:3 rcv_rtt:1426 rcv_space:33703
# ss -ntpioe 'src 2019:194:201:806::115:6834 && dst 2019:194:201:806::117:60331'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 99440 2019:194:201:806::115:6834 2019:194:201:806::117:60331 users:(("ceph-osd",pid=49490,fd=70)) timer:(on,9.380ms,6) uid:167 ino:1747269 sk:ffff88148ff0c600 <->
ts sack cubic wscale:7,7 rto:120000 backoff:9 rtt:102.963/115 ato:40 mss:1428 cwnd:1 ssthresh:6 bytes_acked:60470 bytes_received:268044 segs_out:143 segs_in:261 send 111.0Kbps lastsnd:375206 lastrcv:367624 lastack:367624 pacing_rate 2.4Mbps unacked:11 retrans:0/3 lost:8 sacked:3 rcv_rtt:1426 rcv_space:33703
# ss -ntpioe 'src 2019:194:201:806::115:6834 && dst 2019:194:201:806::117:60331'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 99440 2019:194:201:806::115:6834 2019:194:201:806::117:60331 users:(("ceph-osd",pid=49490,fd=70)) timer:(on,1min39sec,7) uid:167 ino:1747269 sk:ffff88148ff0c600 <->
ts sack cubic wscale:7,7 rto:120000 backoff:10 rtt:102.963/115 ato:40 mss:1428 cwnd:1 ssthresh:6 bytes_acked:60470 bytes_received:268044 segs_out:143 segs_in:261 send 111.0Kbps lastsnd:405749 lastrcv:398167 lastack:398167 pacing_rate 2.4Mbps unacked:11 retrans:0/3 lost:8 sacked:3 rcv_rtt:1426 rcv_space:33703
注意,ss 显示的定时器信息,如 timer:(on,1min25sec,0)
,的第三个字段以及 retrans
信息的第一个字段是内核 socket 结构体中的 tp->retrans_out
字段,该字段与 /proc/net/tcp(6) 中打印的 icsk->icsk_retransmits
有巨大的差异。
还一种是报文收发没有完全停滞,但是没有应用数据发送,根据 backoff
, bytes_acked
以及没有 retrans
等字段可以推断,比如如下的客户端 qemu-kvm 与 OSD 之间 OSD 侧的 socket 状态:
# ss -ntpioem 'src 2025:3406::1000:2004:6830 && dst 2025:3406::13:152:57432'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 89964 2025:3406::1000:2004:6830 2025:3406::13:152:57432 users:(("ceph-osd",pid=810581,fd=138)) timer:(on,1min49sec,0) uid:167 ino:1087570918 sk:ffff881929f8b480 <->
skmem:(r0,rb2563164,t0,tb46080,f3220,w95084,o0,bl0,d241) ts sack cubic wscale:7,7 rto:120000 backoff:108 rtt:0.503/0.836 ato:40 mss:1428 rcvmss:417 advmss:1428 cwnd:17 ssthresh:16 bytes_acked:2289515 bytes_received:5448690 segs_out:71993 segs_in:70918 send 386.1Mbps lastsnd:73611614 lastrcv:115 lastack:115 pacing_rate 817.0Mbps unacked:18 sacked:1 reordering:97 rcv_rtt:312308 rcv_space:399500
# ss -ntpioem 'src 2025:3406::1000:2004:6830 && dst 2025:3406::13:152:57432'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 89964 2025:3406::1000:2004:6830 2025:3406::13:152:57432 users:(("ceph-osd",pid=810581,fd=138)) timer:(on,7.910ms,0) uid:167 ino:1087570918 sk:ffff881929f8b480 <->
skmem:(r0,rb2563164,t0,tb46080,f3220,w95084,o0,bl0,d241) ts sack cubic wscale:7,7 rto:120000 backoff:108 rtt:0.503/0.836 ato:40 mss:1428 rcvmss:417 advmss:1428 cwnd:17 ssthresh:16 bytes_acked:2289515 bytes_received:5455965 segs_out:72090 segs_in:71015 send 386.1Mbps lastsnd:73713281 lastrcv:721 lastack:721 pacing_rate 817.0Mbps unacked:18 sacked:1 reordering:97 rcv_rtt:312308 rcv_space:399500
# ss -ntpioem 'src 2025:3406::1000:2004:6830 && dst 2025:3406::13:152:57432'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 89964 2025:3406::1000:2004:6830 2025:3406::13:152:57432 users:(("ceph-osd",pid=810581,fd=138)) timer:(on,1min25sec,0) uid:167 ino:1087570918 sk:ffff881929f8b480 <->
skmem:(r0,rb2563164,t0,tb46080,f3220,w95084,o0,bl0,d241) ts sack cubic wscale:7,7 rto:120000 backoff:109 rtt:0.503/0.836 ato:40 mss:1428 rcvmss:417 advmss:1428 cwnd:17 ssthresh:16 bytes_acked:2289515 bytes_received:5459040 segs_out:72131 segs_in:71056 send 386.1Mbps lastsnd:73755687 lastrcv:457 lastack:457 pacing_rate 817.0Mbps unacked:18 sacked:1 reordering:97 rcv_rtt:312308 rcv_space:399500
但是,两者的共同点是重传报文始终无法发送,通过追踪社区 TCP 的修改,发现了一处 TCP SACK 安全补丁可能导致 TCP 重传失败的 bug,而恰好有问题的这些机器正好使用了有 bug 的内核,打上补丁之后问题解决。
问题代码
问题出在函数 tcp_fragment
如下新增的代码逻辑:
if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
return -ENOMEM;
}
通过打印 TCP 的计数器也能发现问题的端倪:
$ nstat -z | grep TCPWqueueTooBig
TcpExtTCPWqueueTooBig 2 0.0
$ nstat -za | grep TCPWqueueTooBig
TcpExtTCPWqueueTooBig 1483185 0.0
同时,结合 ss 的打印(sk_wmem_queued
对应 skmem
的 w
字段,sk_sndbuf
对于 skmem
的 tb
字段):
(95084 >> 1) > 46080
显然,满足上面 -ENOMEM
分支的逻辑。
规避方案
tcp_sendmsg()
和 tcp_sendpage()
可能各多入队一个 sk_buff
(64KB),从而 sk_wmem_queued
满足反回 -ENOMEM
的判断条件(符合前面 ss -m
选项打印出来的 skmem 信息)。因此如果需要应用层进行规避,从代码角度严格来说应该是设置 SO_SNDBUF
为比 64K 稍大的值(即 SKB_TRUESIZE(GSO_MAX_SIZE)
,SO_SNDBUF
选项最终设置到 sk_sndbuf
时会自动乘 2):
SKB_TRUESIZE(x)
可由/proc/sys/net/core/wmem_max
(212992) 推出来为 (x + 576)- 因此,
SKB_TRUESIZE(GSO_MAX_SIZE)
=SKB_TRUESIZE(64K)
= (64K + 576) SO_SNDBUF
设置时允许的最大值由/proc/sys/net/core/wmem_max
控制,即如果不调整系统设置,最终sk_sndbuf
最大值只能为 212992 * 2 = 416K
由于设置了 SO_SNDBUF
选项导致 sk->sk_sndbuf
的调节(sk_stream_moderate_sndbuf
,tcp_sndbuf_expand
)被锁定,才间接使得条件 (sk->sk_wmem_queued >> 1) > sk->sk_sndbuf
无法因为 sk_sndbuf
被减小而进入 ENOMEM
流程,但这也导致内核对 TCP 内存使用的控制机制失效。
参考资料
TCP Selective Acknowledgment Options
https://tools.ietf.org/html/rfc2018
The TCP SACK panic
https://lwn.net/Articles/791409/
TCP SACK PANIC - Kernel vulnerabilities - CVE-2019-11477, CVE-2019-11478 & CVE-2019-11479
https://access.redhat.com/security/vulnerabilities/tcpsack
Network performance regressions from TCP SACK vulnerability fixes
Adventures in the TCP stack: Uncovering performance regressions in the TCP SACKs vulnerability fixes
tcp: tcp_fragment() should apply sane memory limits
https://github.com/torvalds/linux/commit/f070ef2ac66716357066b683fb0baf55f8191a2e
tcp: refine memory limit test in tcp_fragment()
https://github.com/torvalds/linux/commit/b6653b3629e5b88202be3c9abc44713973f5c4b4
tcp: be more careful in tcp_fragment()
https://github.com/torvalds/linux/commit/b617158dc096709d8600c53b6052144d12b89fab
SNMP counter
https://www.kernel.org/doc/html/latest/networking/snmp_counter.html
/* Linux NewReno/SACK/FACK/ECN state machine.
* --------------------------------------
*
* "Open" Normal state, no dubious events, fast path.
* "Disorder" In all the respects it is "Open",
* but requires a bit more attention. It is entered when
* we see some SACKs or dupacks. It is split of "Open"
* mainly to move some processing from fast path to slow one.
* "CWR" CWND was reduced due to some Congestion Notification event.
* It can be ECN, ICMP source quench, local device congestion.
* "Recovery" CWND was reduced, we are fast-retransmitting.
* "Loss" CWND was reduced due to RTO timeout or SACK reneging.
*
* tcp_fastretrans_alert() is entered:
* - each incoming ACK, if state is not "Open"
* - when arrived ACK is unusual, namely:
* * SACK
* * Duplicate ACK.
* * ECN ECE.
*
* Counting packets in flight is pretty simple.
*
* in_flight = packets_out - left_out + retrans_out
*
* packets_out is SND.NXT-SND.UNA counted in packets.
*
* retrans_out is number of retransmitted segments.
*
* left_out is number of segments left network, but not ACKed yet.
*
* left_out = sacked_out + lost_out
*
* sacked_out: Packets, which arrived to receiver out of order
* and hence not ACKed. With SACKs this number is simply
* amount of SACKed data. Even without SACKs
* it is easy to give pretty reliable estimate of this number,
* counting duplicate ACKs.
*
* lost_out: Packets lost by network. TCP has no explicit
* "loss notification" feedback from network (for now).
* It means that this number can be only _guessed_.
* Actually, it is the heuristics to predict lossage that
* distinguishes different algorithms.
*
* F.e. after RTO, when all the queue is considered as lost,
* lost_out = packets_out and in_flight = retrans_out.
*
* Essentially, we have now two algorithms counting
* lost packets.
*
* FACK: It is the simplest heuristics. As soon as we decided
* that something is lost, we decide that _all_ not SACKed
* packets until the most forward SACK are lost. I.e.
* lost_out = fackets_out - sacked_out and left_out = fackets_out.
* It is absolutely correct estimate, if network does not reorder
* packets. And it loses any connection to reality when reordering
* takes place. We use FACK by default until reordering
* is suspected on the path to this destination.
*
* NewReno: when Recovery is entered, we assume that one segment
* is lost (classic Reno). While we are in Recovery and
* a partial ACK arrives, we assume that one more packet
* is lost (NewReno). This heuristics are the same in NewReno
* and SACK.
*
* Imagine, that's all! Forget about all this shamanism about CWND inflation
* deflation etc. CWND is real congestion window, never inflated, changes
* only according to classic VJ rules.
*
* Really tricky (and requiring careful tuning) part of algorithm
* is hidden in functions tcp_time_to_recover() and tcp_xmit_retransmit_queue().
* The first determines the moment _when_ we should reduce CWND and,
* hence, slow down forward transmission. In fact, it determines the moment
* when we decide that hole is caused by loss, rather than by a reorder.
*
* tcp_xmit_retransmit_queue() decides, _what_ we should retransmit to fill
* holes, caused by lost packets.
*
* And the most logically complicated part of algorithm is undo
* heuristics. We detect false retransmits due to both too early
* fast retransmit (reordering) and underestimated RTO, analyzing
* timestamps and D-SACKs. When we detect that some segments were
* retransmitted by mistake and CWND reduction was wrong, we undo
* window reduction and abort recovery phase. This logic is hidden
* inside several functions named tcp_try_undo_<something>.
*/
最后修改于 2019-12-19