最近在分析一个 TCP 的问题，在批量重启 OSD 等操作时 OSD 之间，或者客户端（qemu-kvm）与 OSD 之间的通信可能会卡住，通过如下的 ss 命令可以查看到多个 socket 的 TCP 发送队列有大量的报文积压，积压的报文始终无法被对端 ACK，直至应用层报错导致 socket 拆掉重建或者长时间的 IO 完全阻塞：

$ ss -ntp | sort -nk 3 | tail

实际上，从 ss 的信息来看，分为两种情况，一种是报文收发完全停滞，根据 backoff, segs_out, segs_in 以及 retrans 等字段可以推断出来，比如如下的两个 OSD 之间某个 OSD 侧的 socket 状态：

# ss -ntpioe 'src 2019:194:201:806::115:6834 && dst 2019:194:201:806::117:60331'
State       Recv-Q Send-Q                         Local Address:Port                                        Peer Address:Port
ESTAB       0      99440                  2019:194:201:806::115:6834                               2019:194:201:806::117:60331               users:(("ceph-osd",pid=49490,fd=70)) timer:(on,25sec,5) uid:167 ino:1747269 sk:ffff88148ff0c600 <->
         ts sack cubic wscale:7,7 rto:120000 backoff:8 rtt:102.963/115 ato:40 mss:1428 cwnd:1 ssthresh:6 bytes_acked:60470 bytes_received:268044 segs_out:143 segs_in:261 send 111.0Kbps lastsnd:238688 lastrcv:231106 lastack:231106 pacing_rate 2.4Mbps unacked:11 retrans:0/3 lost:8 sacked:3 rcv_rtt:1426 rcv_space:33703

# ss -ntpioe 'src 2019:194:201:806::115:6834 && dst 2019:194:201:806::117:60331'
State       Recv-Q Send-Q                         Local Address:Port                                        Peer Address:Port
ESTAB       0      99440                  2019:194:201:806::115:6834                               2019:194:201:806::117:60331               users:(("ceph-osd",pid=49490,fd=70)) timer:(on,1min27sec,6) uid:167 ino:1747269 sk:ffff88148ff0c600 <->
         ts sack cubic wscale:7,7 rto:120000 backoff:9 rtt:102.963/115 ato:40 mss:1428 cwnd:1 ssthresh:6 bytes_acked:60470 bytes_received:268044 segs_out:143 segs_in:261 send 111.0Kbps lastsnd:296951 lastrcv:289369 lastack:289369 pacing_rate 2.4Mbps unacked:11 retrans:0/3 lost:8 sacked:3 rcv_rtt:1426 rcv_space:33703

# ss -ntpioe 'src 2019:194:201:806::115:6834 && dst 2019:194:201:806::117:60331'
State       Recv-Q Send-Q                         Local Address:Port                                        Peer Address:Port
ESTAB       0      99440                  2019:194:201:806::115:6834                               2019:194:201:806::117:60331               users:(("ceph-osd",pid=49490,fd=70)) timer:(on,9.380ms,6) uid:167 ino:1747269 sk:ffff88148ff0c600 <->
         ts sack cubic wscale:7,7 rto:120000 backoff:9 rtt:102.963/115 ato:40 mss:1428 cwnd:1 ssthresh:6 bytes_acked:60470 bytes_received:268044 segs_out:143 segs_in:261 send 111.0Kbps lastsnd:375206 lastrcv:367624 lastack:367624 pacing_rate 2.4Mbps unacked:11 retrans:0/3 lost:8 sacked:3 rcv_rtt:1426 rcv_space:33703

# ss -ntpioe 'src 2019:194:201:806::115:6834 && dst 2019:194:201:806::117:60331'
State       Recv-Q Send-Q                         Local Address:Port                                        Peer Address:Port
ESTAB       0      99440                  2019:194:201:806::115:6834                               2019:194:201:806::117:60331               users:(("ceph-osd",pid=49490,fd=70)) timer:(on,1min39sec,7) uid:167 ino:1747269 sk:ffff88148ff0c600 <->
         ts sack cubic wscale:7,7 rto:120000 backoff:10 rtt:102.963/115 ato:40 mss:1428 cwnd:1 ssthresh:6 bytes_acked:60470 bytes_received:268044 segs_out:143 segs_in:261 send 111.0Kbps lastsnd:405749 lastrcv:398167 lastack:398167 pacing_rate 2.4Mbps unacked:11 retrans:0/3 lost:8 sacked:3 rcv_rtt:1426 rcv_space:33703

注意，ss 显示的定时器信息，如 timer:(on,1min25sec,0)，的第三个字段以及 retrans 信息的第一个字段是内核 socket 结构体中的 tp->retrans_out 字段，该字段与 /proc/net/tcp(6) 中打印的 icsk->icsk_retransmits 有巨大的差异。

还一种是报文收发没有完全停滞，但是没有应用数据发送，根据 backoff, bytes_acked 以及没有 retrans 等字段可以推断，比如如下的客户端 qemu-kvm 与 OSD 之间 OSD 侧的 socket 状态：

# ss -ntpioem 'src 2025:3406::1000:2004:6830 && dst 2025:3406::13:152:57432'
State       Recv-Q Send-Q                                                  Local Address:Port                                                                 Peer Address:Port
ESTAB       0      89964                                            2025:3406::1000:2004:6830                                                            2025:3406::13:152:57432               users:(("ceph-osd",pid=810581,fd=138)) timer:(on,1min49sec,0) uid:167 ino:1087570918 sk:ffff881929f8b480 <->
         skmem:(r0,rb2563164,t0,tb46080,f3220,w95084,o0,bl0,d241) ts sack cubic wscale:7,7 rto:120000 backoff:108 rtt:0.503/0.836 ato:40 mss:1428 rcvmss:417 advmss:1428 cwnd:17 ssthresh:16 bytes_acked:2289515 bytes_received:5448690 segs_out:71993 segs_in:70918 send 386.1Mbps lastsnd:73611614 lastrcv:115 lastack:115 pacing_rate 817.0Mbps unacked:18 sacked:1 reordering:97 rcv_rtt:312308 rcv_space:399500

# ss -ntpioem 'src 2025:3406::1000:2004:6830 && dst 2025:3406::13:152:57432'
State       Recv-Q Send-Q                                                  Local Address:Port                                                                 Peer Address:Port
ESTAB       0      89964                                            2025:3406::1000:2004:6830                                                            2025:3406::13:152:57432               users:(("ceph-osd",pid=810581,fd=138)) timer:(on,7.910ms,0) uid:167 ino:1087570918 sk:ffff881929f8b480 <->
         skmem:(r0,rb2563164,t0,tb46080,f3220,w95084,o0,bl0,d241) ts sack cubic wscale:7,7 rto:120000 backoff:108 rtt:0.503/0.836 ato:40 mss:1428 rcvmss:417 advmss:1428 cwnd:17 ssthresh:16 bytes_acked:2289515 bytes_received:5455965 segs_out:72090 segs_in:71015 send 386.1Mbps lastsnd:73713281 lastrcv:721 lastack:721 pacing_rate 817.0Mbps unacked:18 sacked:1 reordering:97 rcv_rtt:312308 rcv_space:399500

# ss -ntpioem 'src 2025:3406::1000:2004:6830 && dst 2025:3406::13:152:57432'
State       Recv-Q Send-Q                                                  Local Address:Port                                                                 Peer Address:Port
ESTAB       0      89964                                            2025:3406::1000:2004:6830                                                            2025:3406::13:152:57432               users:(("ceph-osd",pid=810581,fd=138)) timer:(on,1min25sec,0) uid:167 ino:1087570918 sk:ffff881929f8b480 <->
         skmem:(r0,rb2563164,t0,tb46080,f3220,w95084,o0,bl0,d241) ts sack cubic wscale:7,7 rto:120000 backoff:109 rtt:0.503/0.836 ato:40 mss:1428 rcvmss:417 advmss:1428 cwnd:17 ssthresh:16 bytes_acked:2289515 bytes_received:5459040 segs_out:72131 segs_in:71056 send 386.1Mbps lastsnd:73755687 lastrcv:457 lastack:457 pacing_rate 817.0Mbps unacked:18 sacked:1 reordering:97 rcv_rtt:312308 rcv_space:399500

但是，两者的共同点是重传报文始终无法发送，通过追踪社区 TCP 的修改，发现了一处 TCP SACK 安全补丁可能导致 TCP 重传失败的 bug，而恰好有问题的这些机器正好使用了有 bug 的内核，打上补丁之后问题解决。

问题代码

问题出在函数 tcp_fragment 如下新增的代码逻辑：

if (unlikely((sk->sk_wmem_queued >> 1) > sk->sk_sndbuf)) {
	NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPWQUEUETOOBIG);
	return -ENOMEM;
}

通过打印 TCP 的计数器也能发现问题的端倪：

$ nstat -z | grep TCPWqueueTooBig
TcpExtTCPWqueueTooBig           2                  0.0
$ nstat -za | grep TCPWqueueTooBig
TcpExtTCPWqueueTooBig           1483185            0.0

同时，结合 ss 的打印（sk_wmem_queued 对应 skmem 的 w 字段，sk_sndbuf 对于 skmem 的 tb 字段）：

(95084 >> 1) > 46080

显然，满足上面 -ENOMEM 分支的逻辑。

规避方案

tcp_sendmsg() 和 tcp_sendpage() 可能各多入队一个 sk_buff（64KB），从而 sk_wmem_queued 满足反回 -ENOMEM 的判断条件（符合前面 ss -m 选项打印出来的 skmem 信息）。因此如果需要应用层进行规避，从代码角度严格来说应该是设置 SO_SNDBUF 为比 64K 稍大的值（即 SKB_TRUESIZE(GSO_MAX_SIZE)，SO_SNDBUF 选项最终设置到 sk_sndbuf 时会自动乘 2）：

SKB_TRUESIZE(x) 可由 /proc/sys/net/core/wmem_max (212992) 推出来为 (x + 576)
因此，SKB_TRUESIZE(GSO_MAX_SIZE) = SKB_TRUESIZE(64K) = (64K + 576)
SO_SNDBUF 设置时允许的最大值由 /proc/sys/net/core/wmem_max 控制，即如果不调整系统设置，最终 sk_sndbuf 最大值只能为 212992 * 2 = 416K

由于设置了 SO_SNDBUF 选项导致 sk->sk_sndbuf 的调节（sk_stream_moderate_sndbuf，tcp_sndbuf_expand）被锁定，才间接使得条件 (sk->sk_wmem_queued >> 1) > sk->sk_sndbuf 无法因为 sk_sndbuf 被减小而进入 ENOMEM 流程，但这也导致内核对 TCP 内存使用的控制机制失效。

参考资料

TCP Selective Acknowledgment Options

https://tools.ietf.org/html/rfc2018

The TCP SACK panic

https://lwn.net/Articles/791409/

TCP SACK PANIC - Kernel vulnerabilities - CVE-2019-11477, CVE-2019-11478 & CVE-2019-11479

https://access.redhat.com/security/vulnerabilities/tcpsack

Network performance regressions from TCP SACK vulnerability fixes

https://databricks.com/blog/2019/08/01/network-performance-regressions-from-tcp-sack-vulnerability-fixes.html

Adventures in the TCP stack: Uncovering performance regressions in the TCP SACKs vulnerability fixes

https://databricks.com/blog/2019/09/16/adventures-in-the-tcp-stack-performance-regressions-vulnerability-fixes.html

tcp: tcp_fragment() should apply sane memory limits

https://github.com/torvalds/linux/commit/f070ef2ac66716357066b683fb0baf55f8191a2e

tcp: refine memory limit test in tcp_fragment()

https://github.com/torvalds/linux/commit/b6653b3629e5b88202be3c9abc44713973f5c4b4

tcp: be more careful in tcp_fragment()

https://github.com/torvalds/linux/commit/b617158dc096709d8600c53b6052144d12b89fab

SNMP counter

https://www.kernel.org/doc/html/latest/networking/snmp_counter.html

/* Linux NewReno/SACK/FACK/ECN state machine.
 * --------------------------------------
 *
 * "Open"	Normal state, no dubious events, fast path.
 * "Disorder"   In all the respects it is "Open",
 *		but requires a bit more attention. It is entered when
 *		we see some SACKs or dupacks. It is split of "Open"
 *		mainly to move some processing from fast path to slow one.
 * "CWR"	CWND was reduced due to some Congestion Notification event.
 *		It can be ECN, ICMP source quench, local device congestion.
 * "Recovery"	CWND was reduced, we are fast-retransmitting.
 * "Loss"	CWND was reduced due to RTO timeout or SACK reneging.
 *
 * tcp_fastretrans_alert() is entered:
 * - each incoming ACK, if state is not "Open"
 * - when arrived ACK is unusual, namely:
 *	* SACK
 *	* Duplicate ACK.
 *	* ECN ECE.
 *
 * Counting packets in flight is pretty simple.
 *
 *	in_flight = packets_out - left_out + retrans_out
 *
 *	packets_out is SND.NXT-SND.UNA counted in packets.
 *
 *	retrans_out is number of retransmitted segments.
 *
 *	left_out is number of segments left network, but not ACKed yet.
 *
 *		left_out = sacked_out + lost_out
 *
 *     sacked_out: Packets, which arrived to receiver out of order
 *		   and hence not ACKed. With SACKs this number is simply
 *		   amount of SACKed data. Even without SACKs
 *		   it is easy to give pretty reliable estimate of this number,
 *		   counting duplicate ACKs.
 *
 *       lost_out: Packets lost by network. TCP has no explicit
 *		   "loss notification" feedback from network (for now).
 *		   It means that this number can be only _guessed_.
 *		   Actually, it is the heuristics to predict lossage that
 *		   distinguishes different algorithms.
 *
 *	F.e. after RTO, when all the queue is considered as lost,
 *	lost_out = packets_out and in_flight = retrans_out.
 *
 *		Essentially, we have now two algorithms counting
 *		lost packets.
 *
 *		FACK: It is the simplest heuristics. As soon as we decided
 *		that something is lost, we decide that _all_ not SACKed
 *		packets until the most forward SACK are lost. I.e.
 *		lost_out = fackets_out - sacked_out and left_out = fackets_out.
 *		It is absolutely correct estimate, if network does not reorder
 *		packets. And it loses any connection to reality when reordering
 *		takes place. We use FACK by default until reordering
 *		is suspected on the path to this destination.
 *
 *		NewReno: when Recovery is entered, we assume that one segment
 *		is lost (classic Reno). While we are in Recovery and
 *		a partial ACK arrives, we assume that one more packet
 *		is lost (NewReno). This heuristics are the same in NewReno
 *		and SACK.
 *
 *  Imagine, that's all! Forget about all this shamanism about CWND inflation
 *  deflation etc. CWND is real congestion window, never inflated, changes
 *  only according to classic VJ rules.
 *
 * Really tricky (and requiring careful tuning) part of algorithm
 * is hidden in functions tcp_time_to_recover() and tcp_xmit_retransmit_queue().
 * The first determines the moment _when_ we should reduce CWND and,
 * hence, slow down forward transmission. In fact, it determines the moment
 * when we decide that hole is caused by loss, rather than by a reorder.
 *
 * tcp_xmit_retransmit_queue() decides, _what_ we should retransmit to fill
 * holes, caused by lost packets.
 *
 * And the most logically complicated part of algorithm is undo
 * heuristics. We detect false retransmits due to both too early
 * fast retransmit (reordering) and underestimated RTO, analyzing
 * timestamps and D-SACKs. When we detect that some segments were
 * retransmitted by mistake and CWND reduction was wrong, we undo
 * window reduction and abort recovery phase. This logic is hidden
 * inside several functions named tcp_try_undo_<something>.
 */

最后修改于 2019-12-19