最近在分析一个 TCP 的问题,在批量重启 OSD 等操作时 OSD 之间,或者客户端(qemu-kvm)与 OSD 之间的通信可能会卡住,通过如下的 ss 命令可以查看到多个 socket 的 TCP 发送队列有大量的报文积压,积压的报文始终无法被对端 ACK,直至应用层报错导致 socket 拆掉重建或者长时间的 IO 完全阻塞:

1
$ ss -ntp | sort -nk 3 | tail

实际上,从 ss 的信息来看,分为两种情况,一种是报文收发完全停滞,根据 backoff, segs_out, segs_in 以及 retrans 等字段可以推断出来,比如如下的两个 OSD 之间某个 OSD 侧的 socket 状态:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# ss -ntpioe 'src 2019:194:201:806::115:6834 && dst 2019:194:201:806::117:60331'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 99440 2019:194:201:806::115:6834 2019:194:201:806::117:60331 users:(("ceph-osd",pid=49490,fd=70)) timer:(on,25sec,5) uid:167 ino:1747269 sk:ffff88148ff0c600 <->
ts sack cubic wscale:7,7 rto:120000 backoff:8 rtt:102.963/115 ato:40 mss:1428 cwnd:1 ssthresh:6 bytes_acked:60470 bytes_received:268044 segs_out:143 segs_in:261 send 111.0Kbps lastsnd:238688 lastrcv:231106 lastack:231106 pacing_rate 2.4Mbps unacked:11 retrans:0/3 lost:8 sacked:3 rcv_rtt:1426 rcv_space:33703

# ss -ntpioe 'src 2019:194:201:806::115:6834 && dst 2019:194:201:806::117:60331'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 99440 2019:194:201:806::115:6834 2019:194:201:806::117:60331 users:(("ceph-osd",pid=49490,fd=70)) timer:(on,1min27sec,6) uid:167 ino:1747269 sk:ffff88148ff0c600 <->
ts sack cubic wscale:7,7 rto:120000 backoff:9 rtt:102.963/115 ato:40 mss:1428 cwnd:1 ssthresh:6 bytes_acked:60470 bytes_received:268044 segs_out:143 segs_in:261 send 111.0Kbps lastsnd:296951 lastrcv:289369 lastack:289369 pacing_rate 2.4Mbps unacked:11 retrans:0/3 lost:8 sacked:3 rcv_rtt:1426 rcv_space:33703

# ss -ntpioe 'src 2019:194:201:806::115:6834 && dst 2019:194:201:806::117:60331'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 99440 2019:194:201:806::115:6834 2019:194:201:806::117:60331 users:(("ceph-osd",pid=49490,fd=70)) timer:(on,9.380ms,6) uid:167 ino:1747269 sk:ffff88148ff0c600 <->
ts sack cubic wscale:7,7 rto:120000 backoff:9 rtt:102.963/115 ato:40 mss:1428 cwnd:1 ssthresh:6 bytes_acked:60470 bytes_received:268044 segs_out:143 segs_in:261 send 111.0Kbps lastsnd:375206 lastrcv:367624 lastack:367624 pacing_rate 2.4Mbps unacked:11 retrans:0/3 lost:8 sacked:3 rcv_rtt:1426 rcv_space:33703

# ss -ntpioe 'src 2019:194:201:806::115:6834 && dst 2019:194:201:806::117:60331'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 99440 2019:194:201:806::115:6834 2019:194:201:806::117:60331 users:(("ceph-osd",pid=49490,fd=70)) timer:(on,1min39sec,7) uid:167 ino:1747269 sk:ffff88148ff0c600 <->
ts sack cubic wscale:7,7 rto:120000 backoff:10 rtt:102.963/115 ato:40 mss:1428 cwnd:1 ssthresh:6 bytes_acked:60470 bytes_received:268044 segs_out:143 segs_in:261 send 111.0Kbps lastsnd:405749 lastrcv:398167 lastack:398167 pacing_rate 2.4Mbps unacked:11 retrans:0/3 lost:8 sacked:3 rcv_rtt:1426 rcv_space:33703

注意,ss 显示的定时器信息,如 timer:(on,1min25sec,0),的第三个字段以及 retrans 信息的第一个字段是内核 socket 结构体中的 tp->retrans_out 字段,该字段与 /proc/net/tcp(6) 中打印的 icsk->icsk_retransmits 有巨大的差异。

还一种是报文收发没有完全停滞,但是没有应用数据发送,根据 backoff, bytes_acked 以及没有 retrans 等字段可以推断,比如如下的客户端 qemu-kvm 与 OSD 之间 OSD 侧的 socket 状态:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# ss -ntpioem 'src 2025:3406::1000:2004:6830 && dst 2025:3406::13:152:57432'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 89964 2025:3406::1000:2004:6830 2025:3406::13:152:57432 users:(("ceph-osd",pid=810581,fd=138)) timer:(on,1min49sec,0) uid:167 ino:1087570918 sk:ffff881929f8b480 <->
skmem:(r0,rb2563164,t0,tb46080,f3220,w95084,o0,bl0,d241) ts sack cubic wscale:7,7 rto:120000 backoff:108 rtt:0.503/0.836 ato:40 mss:1428 rcvmss:417 advmss:1428 cwnd:17 ssthresh:16 bytes_acked:2289515 bytes_received:5448690 segs_out:71993 segs_in:70918 send 386.1Mbps lastsnd:73611614 lastrcv:115 lastack:115 pacing_rate 817.0Mbps unacked:18 sacked:1 reordering:97 rcv_rtt:312308 rcv_space:399500

# ss -ntpioem 'src 2025:3406::1000:2004:6830 && dst 2025:3406::13:152:57432'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 89964 2025:3406::1000:2004:6830 2025:3406::13:152:57432 users:(("ceph-osd",pid=810581,fd=138)) timer:(on,7.910ms,0) uid:167 ino:1087570918 sk:ffff881929f8b480 <->
skmem:(r0,rb2563164,t0,tb46080,f3220,w95084,o0,bl0,d241) ts sack cubic wscale:7,7 rto:120000 backoff:108 rtt:0.503/0.836 ato:40 mss:1428 rcvmss:417 advmss:1428 cwnd:17 ssthresh:16 bytes_acked:2289515 bytes_received:5455965 segs_out:72090 segs_in:71015 send 386.1Mbps lastsnd:73713281 lastrcv:721 lastack:721 pacing_rate 817.0Mbps unacked:18 sacked:1 reordering:97 rcv_rtt:312308 rcv_space:399500

# ss -ntpioem 'src 2025:3406::1000:2004:6830 && dst 2025:3406::13:152:57432'
State Recv-Q Send-Q Local Address:Port Peer Address:Port
ESTAB 0 89964 2025:3406::1000:2004:6830 2025:3406::13:152:57432 users:(("ceph-osd",pid=810581,fd=138)) timer:(on,1min25sec,0) uid:167 ino:1087570918 sk:ffff881929f8b480 <->
skmem:(r0,rb2563164,t0,tb46080,f3220,w95084,o0,bl0,d241) ts sack cubic wscale:7,7 rto:120000 backoff:109 rtt:0.503/0.836 ato:40 mss:1428 rcvmss:417 advmss:1428 cwnd:17 ssthresh:16 bytes_acked:2289515 bytes_received:5459040 segs_out:72131 segs_in:71056 send 386.1Mbps lastsnd:73755687 lastrcv:457 lastack:457 pacing_rate 817.0Mbps unacked:18 sacked:1 reordering:97 rcv_rtt:312308 rcv_space:399500

但是,两者的共同点是重传报文始终无法发送,通过追踪社区 TCP 的修改,发现了一处 TCP SACK 安全补丁可能导致 TCP 重传失败的 bug,而恰好有问题的这些机器正好使用了有 bug 的内核,打上补丁之后问题解决。

问题代码

问题出在函数 tcp_fragment 如下新增的代码逻辑:

1
2
3
4
if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
return -ENOMEM;
}

通过打印 TCP 的计数器也能发现问题的端倪:

1
2
3
4
$ nstat -z | grep TCPWqueueTooBig
TcpExtTCPWqueueTooBig 2 0.0
$ nstat -za | grep TCPWqueueTooBig
TcpExtTCPWqueueTooBig 1483185 0.0

同时,结合 ss 的打印(sk_wmem_queued 对应 skmemw 字段,sk_sndbuf 对于 skmemtb 字段):

1
(95084 >> 1) > 46080

显然,满足上面 -ENOMEM 分支的逻辑。

规避方案

tcp_sendmsg()tcp_sendpage() 可能各多入队一个 sk_buff(64KB),从而 sk_wmem_queued 满足反回 -ENOMEM 的判断条件(符合前面 ss -m 选项打印出来的 skmem 信息)。因此如果需要应用层进行规避,从代码角度严格来说应该是设置 SO_SNDBUF 为比 64K 稍大的值(即 SKB_TRUESIZE(GSO_MAX_SIZE)SO_SNDBUF 选项最终设置到 sk_sndbuf 时会自动乘 2):

  • SKB_TRUESIZE(x) 可由 /proc/sys/net/core/wmem_max (212992) 推出来为 (x + 576)
  • 因此,SKB_TRUESIZE(GSO_MAX_SIZE) = SKB_TRUESIZE(64K) = (64K + 576)
  • SO_SNDBUF 设置时允许的最大值由 /proc/sys/net/core/wmem_max 控制,即如果不调整系统设置,最终 sk_sndbuf 最大值只能为 212992 * 2 = 416K

由于设置了 SO_SNDBUF 选项导致 sk->sk_sndbuf 的调节(sk_stream_moderate_sndbuftcp_sndbuf_expand)被锁定,才间接使得条件 (sk->sk_wmem_queued >> 1) > sk->sk_sndbuf 无法因为 sk_sndbuf 被减小而进入 ENOMEM 流程,但这也导致内核对 TCP 内存使用的控制机制失效。

参考资料

TCP Selective Acknowledgment Options

https://tools.ietf.org/html/rfc2018

The TCP SACK panic

https://lwn.net/Articles/791409/

TCP SACK PANIC - Kernel vulnerabilities - CVE-2019-11477, CVE-2019-11478 & CVE-2019-11479

https://access.redhat.com/security/vulnerabilities/tcpsack

Network performance regressions from TCP SACK vulnerability fixes

https://databricks.com/blog/2019/08/01/network-performance-regressions-from-tcp-sack-vulnerability-fixes.html

Adventures in the TCP stack: Uncovering performance regressions in the TCP SACKs vulnerability fixes

https://databricks.com/blog/2019/09/16/adventures-in-the-tcp-stack-performance-regressions-vulnerability-fixes.html

tcp: tcp_fragment() should apply sane memory limits

https://github.com/torvalds/linux/commit/f070ef2ac66716357066b683fb0baf55f8191a2e

tcp: refine memory limit test in tcp_fragment()

https://github.com/torvalds/linux/commit/b6653b3629e5b88202be3c9abc44713973f5c4b4

tcp: be more careful in tcp_fragment()

https://github.com/torvalds/linux/commit/b617158dc096709d8600c53b6052144d12b89fab

SNMP counter

https://www.kernel.org/doc/html/latest/networking/snmp_counter.html

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
/* Linux NewReno/SACK/FACK/ECN state machine.
* --------------------------------------
*
* "Open" Normal state, no dubious events, fast path.
* "Disorder" In all the respects it is "Open",
* but requires a bit more attention. It is entered when
* we see some SACKs or dupacks. It is split of "Open"
* mainly to move some processing from fast path to slow one.
* "CWR" CWND was reduced due to some Congestion Notification event.
* It can be ECN, ICMP source quench, local device congestion.
* "Recovery" CWND was reduced, we are fast-retransmitting.
* "Loss" CWND was reduced due to RTO timeout or SACK reneging.
*
* tcp_fastretrans_alert() is entered:
* - each incoming ACK, if state is not "Open"
* - when arrived ACK is unusual, namely:
* * SACK
* * Duplicate ACK.
* * ECN ECE.
*
* Counting packets in flight is pretty simple.
*
* in_flight = packets_out - left_out + retrans_out
*
* packets_out is SND.NXT-SND.UNA counted in packets.
*
* retrans_out is number of retransmitted segments.
*
* left_out is number of segments left network, but not ACKed yet.
*
* left_out = sacked_out + lost_out
*
* sacked_out: Packets, which arrived to receiver out of order
* and hence not ACKed. With SACKs this number is simply
* amount of SACKed data. Even without SACKs
* it is easy to give pretty reliable estimate of this number,
* counting duplicate ACKs.
*
* lost_out: Packets lost by network. TCP has no explicit
* "loss notification" feedback from network (for now).
* It means that this number can be only _guessed_.
* Actually, it is the heuristics to predict lossage that
* distinguishes different algorithms.
*
* F.e. after RTO, when all the queue is considered as lost,
* lost_out = packets_out and in_flight = retrans_out.
*
* Essentially, we have now two algorithms counting
* lost packets.
*
* FACK: It is the simplest heuristics. As soon as we decided
* that something is lost, we decide that _all_ not SACKed
* packets until the most forward SACK are lost. I.e.
* lost_out = fackets_out - sacked_out and left_out = fackets_out.
* It is absolutely correct estimate, if network does not reorder
* packets. And it loses any connection to reality when reordering
* takes place. We use FACK by default until reordering
* is suspected on the path to this destination.
*
* NewReno: when Recovery is entered, we assume that one segment
* is lost (classic Reno). While we are in Recovery and
* a partial ACK arrives, we assume that one more packet
* is lost (NewReno). This heuristics are the same in NewReno
* and SACK.
*
* Imagine, that's all! Forget about all this shamanism about CWND inflation
* deflation etc. CWND is real congestion window, never inflated, changes
* only according to classic VJ rules.
*
* Really tricky (and requiring careful tuning) part of algorithm
* is hidden in functions tcp_time_to_recover() and tcp_xmit_retransmit_queue().
* The first determines the moment _when_ we should reduce CWND and,
* hence, slow down forward transmission. In fact, it determines the moment
* when we decide that hole is caused by loss, rather than by a reorder.
*
* tcp_xmit_retransmit_queue() decides, _what_ we should retransmit to fill
* holes, caused by lost packets.
*
* And the most logically complicated part of algorithm is undo
* heuristics. We detect false retransmits due to both too early
* fast retransmit (reordering) and underestimated RTO, analyzing
* timestamps and D-SACKs. When we detect that some segments were
* retransmitted by mistake and CWND reduction was wrong, we undo
* window reduction and abort recovery phase. This logic is hidden
* inside several functions named tcp_try_undo_<something>.
*/