构建环境
Kylin V10 aarch64
依赖构建
netavark
netavark 是 podman 依赖的网络插件。
$ wget https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-aarch_64.zip
$ unzip protoc-26.1-linux-aarch_64.zip
$ sudo cp bin/protoc /usr/bin/
$ git clone https://github.com/containers/netavark.git
$ make
$ sudo cp bin/netavark /usr/libexec/podman/
runc
$ git clone https://github.com/opencontainers/runc.git
$ make
$ sudo cp runc /usr/bin/
构建调试版本:
$ make EXTRA_FLAGS='-gcflags="all=-N -l"'
runc doesn’t work with go1.22
https://github.com/opencontainers/runc/issues/4233
conmon
$ git clone https://github.com/containers/conmon.git
$ make
$ sudo cp bin/conmon /usr/libexec/podman/
pasta
$ git clone git://passt.top/passt
打上如下补丁:
diff --git a/util.c b/util.c
index 849fa7f..82be401 100644
--- a/util.c
+++ b/util.c
@@ -516,7 +516,7 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
return __clone2(fn, stack_area + stack_size / 2, stack_size / 2,
flags, arg);
#else
- return clone(fn, stack_area + stack_size / 2, flags, arg);
+ return clone(fn, (char *)((uintptr_t)(stack_area + stack_size / 2 + 15) & (-16)), flags, arg);
#endif
}
$ make prefix=/usr
$ sudo cp passt /usr/bin/
$ sudo ln -sr /usr/bin/passt /usr/bin/pasta
pasta SIGBUS error on aarch64
https://bugs.passt.top/show_bug.cgi?id=85
podman
$ make clean
$ make podman
$ ls bin/
podman
运行错误
错误1
$ podman run -it 192.168.1.71:5000/kylin-server-v10:latest
Error: could not find pasta, the network namespace can't be configured: exec: "pasta": executable file not found in $PATH
podman 5.x 默认从 slirp4netns
切换至 pasta
用户态网络栈,可以使用 --network slirp4netns
显式指定 slirp4netns
,当然,前提是系统安装了 slirp4netns
。
错误2
$ podman run --network slirp4netns -it 192.168.1.71:5000/kylin-server-v10:latest
open /run/runc/3804d1c3356acd3c9f0155ca047659c161e2b6e3e660569aa2ef76dddc3a0cc1/state.json: permission denied
Error: runc: stat /run/runc/3804d1c3356acd3c9f0155ca047659c161e2b6e3e660569aa2ef76dddc3a0cc1: permission denied: OCI permission denied
podman 调用 conmon,conmon 调用 runc,然后出错,可以为 runc 添加 --debug
选项从而辅助定位(同时通过 --log-level trace
启用了 podman 自身的调试日志):
$ podman run --network slirp4netns --log-level trace --runtime-flag debug -it 192.168.1.71:5000/kylin-server-v10:latest
此时 conmon 的日志都输出在 systemd 日志中:
Apr 09 21:35:47 zstack-5 conmon[3153310]: conmon 22a38900b92df457377f <ndebug>: failed to write to /proc/self/oom_score_adj: Permission denied
Apr 09 21:35:47 zstack-5 conmon[3153311]: conmon 22a38900b92df457377f <ndebug>: addr{sun_family=AF_UNIX, sun_path=/var/tmp/conmon-term.1YMSL2}
Apr 09 21:35:47 zstack-5 conmon[3153311]: conmon 22a38900b92df457377f <ntrace>: calling runtime args: /usr/bin/runc --debug --log-format=json --log /run/user/1001/containers/vfs-containers/22a38900b92df457377fc1248edfd217890646f8628b9d4a481910259bc02996/userdata/oci-log create --bundle /home/runsisi/.local/share/containers/storage/vfs-containers/22a38900b92df457377fc1248edfd217890646f8628b9d4a481910259bc02996/userdata --pid-file /run/user/1001/containers/vfs-containers/22a38900b92df457377fc1248edfd217890646f8628b9d4a481910259bc02996/userdata/pidfile --console-socket /var/tmp/conmon-term.1YMSL2 22a38900b92df457377fc1248edfd217890646f8628b9d4a481910259bc02996
Apr 09 21:35:47 zstack-5 conmon[3153311]: conmon 22a38900b92df457377f <ndebug>: addr{sun_family=AF_UNIX, sun_path=/proc/self/fd/13/attach}
Apr 09 21:35:47 zstack-5 conmon[3153311]: conmon 22a38900b92df457377f <ndebug>: terminal_ctrl_fd: 13
Apr 09 21:35:47 zstack-5 conmon[3153311]: conmon 22a38900b92df457377f <ndebug>: winsz read side: 16, winsz write side: 17
Apr 09 21:35:47 zstack-5 docker-runc[3153312]: {"level":"error","msg":"stat /run/runc/22a38900b92df457377fc1248edfd217890646f8628b9d4a481910259bc02996: permission denied\n","time":"2024-04-09T21:35:47+08:00"}
Apr 09 21:35:47 zstack-5 conmon[3153311]: conmon 22a38900b92df457377f <nwarn>: runtime stderr: stat /run/runc/22a38900b92df457377fc1248edfd217890646f8628b9d4a481910259bc02996: permission denied
Apr 09 21:35:47 zstack-5 conmon[3153311]: conmon 22a38900b92df457377f <error>: Failed to create container: exit status 1
Apr 09 21:35:47 zstack-5 docker-runc[3153319]: ERRO[0000] open /run/runc/22a38900b92df457377fc1248edfd217890646f8628b9d4a481910259bc02996/state.json: permission denied
注意其中的错误竟然出现了 docker-runc,明明运行的是我们手工构建的 runc:
$ rpm -qf /usr/bin/runc
docker-engine-18.09.0-101.ky10.aarch64
显然,是安装 docker 把 runc 给覆盖了。
// 这里的 common 是 containers/common 独立的 git 项目
// common/pkg/config/default.go
func defaultEngineConfig() (*EngineConfig, error) {
c.OCIRuntimes = map[string][]string{
"runc": {
"/usr/bin/runc",
"/usr/sbin/runc",
"/usr/local/bin/runc",
"/usr/local/sbin/runc",
"/sbin/runc",
"/bin/runc",
"/usr/lib/cri-o-runc/sbin/runc",
"/run/current-system/sw/bin/runc",
},
}
}
错误3
$ podman run -it 192.168.1.71:5000/kylin-server-v10:latest
Error: pasta failed with exit code -1:
$ sudo coredumpctl
TIME PID UID GID SIG COREFILE EXE
Thu 2024-04-11 14:53:48 CST 1649827 1001 1001 7 present /usr/bin/passt
$ sudo coredumpctl debug 1649827
Signal: 7 (BUS)
Message: Process 1649827 (pasta) of user 1001 dumped core.
Stack trace of thread 1649827:
#0 0x0000aaadc596e590 ns_check (passt)
#1 0x0000fffd1395a1ec thread_start (libc.so.6)
Stack trace of thread 1649826:
#0 0x0000fffd1395a1c0 __clone (libc.so.6)
#1 0x0000aaadc596e8c4 pasta_open_ns (passt)
#2 0x0000aaadc596724c conf (passt)
#3 0x0000aaadc5963434 main (passt)
#4 0x0000fffd138a3fe0 __libc_start_main (libc.so.6)
#5 0x0000aaadc5963be4 $x (passt)
#6 0x0000aaadc5963be4 $x (passt)
$ ./pasta --trace
0.1690: Multiple interfaces with IPv6 routes, use -i to select one
0.1690: Couldn't pick external interface: disabling IPv6
Bus error
通过 gdb 跟踪,可以发现是在 clone 时挂掉:
(gdb) bt
#0 do_clone (arg=0xffffffe2b548, flags=17681, stack_size=1048576, stack_area=0xffffffe2b568 "\370ͪ\252\252\252", fn=<optimized out>) at util.c:519
#1 open_in_ns (c=0xfffffff30050, path=path@entry=0xaaaaaaabf0b8 "/proc/net/tcp", flags=flags@entry=524288) at util.c:377
#2 0x0000aaaaaaaaa12c in fwd_scan_ports_init (c=c@entry=0xfffffff30050) at fwd.c:135
#3 0x0000aaaaaaaa709c in conf (c=c@entry=0xfffffff30050, argc=argc@entry=1, argv=argv@entry=0xffffffffec38) at conf.c:1768
#4 0x0000aaaaaaaa3434 in main (argc=1, argv=0xffffffffec38) at passt.c:269
结合 SIGBUS
错误,以及 gdb 打印 stack_area + stack_size / 2
确认是栈地址未 16 字节对齐导致:
diff --git a/util.c b/util.c
index 849fa7f..82be401 100644
--- a/util.c
+++ b/util.c
@@ -516,7 +516,7 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
return __clone2(fn, stack_area + stack_size / 2, stack_size / 2,
flags, arg);
#else
- return clone(fn, stack_area + stack_size / 2, flags, arg);
+ return clone(fn, (char *)((uintptr_t)(stack_area + stack_size / 2 + 15) & (-16)), flags, arg);
#endif
}
pasta SIGBUS error on aarch64
https://bugs.passt.top/show_bug.cgi?id=85
clone - create a child process
https://man7.org/linux/man-pages/man2/clone.2.html
It’s probably stack alignment.
How do I parse ARM64 assembly SIGBUS error?
https://stackoverflow.com/questions/72724797/how-do-i-parse-arm64-assembly-sigbus-error
错误4
$ podman run -it 192.168.1.71:5000/kylin-server-v10
Error: pasta failed with exit code 1:
Couldn't set IPv4 route(s) in guest: Invalid argument
增加 pasta 的调试选项:
$ sudo vi /etc/containers/containers.conf
[network]
pasta_options = ["--trace"]
可以看到 --trace
选项已加上:
$ podman run --log-level trace --rm -it 192.168.1.71:5000/kylin-server-v10
pasta arguments: --config-net --trace --dns-forward 169.254.0.1 -t none -u none -T none -U none --no-map-gw --quiet --netns /run/user/1001/netns/netns-33119720-8b69-2893-581f-60ebfb26291b
但是 --debug
和 --quiet
冲突(指定 --trace
会指定 --debug
):
$ sudo journalctl -f -t pasta
Apr 11 15:56:08 zstack-5 pasta[1825753]: Either --debug or --quiet
需要修改 containers/common 源代码:
diff --git a/vendor/github.com/containers/common/libnetwork/pasta/pasta_linux.go b/vendor/github.com/containers/common/libnetwork/pasta/pasta_linux.go
index 4b31320b5..5f98e2491 100644
--- a/vendor/github.com/containers/common/libnetwork/pasta/pasta_linux.go
+++ b/vendor/github.com/containers/common/libnetwork/pasta/pasta_linux.go
@@ -146,7 +146,7 @@ func Setup2(opts *SetupOptions) (*SetupResult, error) {
}
// always pass --quiet to silence the info output from pasta
- cmdArgs = append(cmdArgs, "--quiet", "--netns", opts.Netns)
+ cmdArgs = append(cmdArgs, "--netns", opts.Netns)
logrus.Debugf("pasta arguments: %s", strings.Join(cmdArgs, " "))
但是,trace 模式下的 pasta 也没有提供更多有意义的信息。
直接手工测试如下:
$ pasta --config-net
Couldn't set IPv4 route(s) in guest: Invalid argument
错误出在 nl_route_dup
调用的 nl_do
接口,ip route
的信息如下:
$ ip -4 a
9: lxcbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
inet 10.0.3.1/24 brd 10.0.3.255 scope global lxcbr0
valid_lft forever preferred_lft forever
$ ip -4 r list table all
broadcast 10.0.3.0 dev lxcbr0 table local proto kernel scope link src 10.0.3.1 linkdown
local 10.0.3.1 dev lxcbr0 table local proto kernel scope host src 10.0.3.1
broadcast 10.0.3.255 dev lxcbr0 table local proto kernel scope link src 10.0.3.1 linkdown
打印 netlink 消息(打印 nl_route_dup
中的 buf
),并通过 pyroute2 提供的工具进行 dump 对比,可知是添加 NO-CARRIER
状态接口的路由失败:
$ git diff
diff --git a/netlink.c b/netlink.c
index 89c0641..2945923 100644
--- a/netlink.c
+++ b/netlink.c
@@ -503,6 +503,22 @@ int nl_route_set_def(int s, unsigned int ifi, sa_family_t af, const void *gw)
return nl_do(s, &req, RTM_NEWROUTE, NLM_F_CREATE | NLM_F_EXCL, len);
}
+static void hex_dump(char *buf, int buflen) {
+ for (int i = 0; i < buflen; i += 16) {
+ for (int j = 0; j < 16; j++) {
+ if (i + j < buflen) {
+ printf("%02x", (unsigned char)buf[i + j]);
+ if ((j != 15) && (i + j != buflen - 1)) {
+ printf(":");
+ }
+ } else {
+ break;
+ }
+ }
+ printf("\n");
+ }
+}
+
/**
* nl_route_dup() - Copy routes for given interface and address family
* @s_src: Netlink socket in source namespace
@@ -602,6 +618,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
if (status < 0)
return status;
+ hex_dump(buf, nlmsgs_size);
+
/* Routes might have dependencies between each other, and the kernel
* processes RTM_NEWROUTE messages sequentially. For n routes, we might
* need to send the requests up to n times to get all of them inserted.
通过走读 linux 内核代码和 systemtap 确认:
$ sudo stap -e 'probe kernel.function("fib_create_info") { printf("fc_flags = 0x%x\n", $cfg->fc_flags); }
probe kernel.function("fib_check_nh").return { printf("%s\n", $$return); }
probe kernel.function("fib_create_info").return { printf("%s\n", $$return); }'
出错的地方为 linux 内核如下代码:
// linux/net/ipv4/fib_semantics.c
fib_create_info
if (cfg->fc_flags & (RTNH_F_DEAD | RTNH_F_LINKDOWN)) {
NL_SET_ERR_MSG(extack, "Invalid rtm_flags - can not contain DEAD or LINKDOWN");
goto err_inval;
}
但根本原因是 pasta nl_route_dup
获取路由信息的过滤条件发送给内核后,内核代码没有正确的过滤,导致拿到了 RTNH_F_LINKDOWN
类型的路由条目:
// linux/net/ipv4/fib_frontend.c
inet_dump_fib
RTNH_F_LINKDOWN
对应 pyroute2 dump 的 flags
值 16
:
{
"pcap header": "None",
"link layer header": "None",
"message class": "<class 'pyroute2.netlink.rtnl.rtmsg.rtmsg'>",
"exception": null,
"data": {
"family": 2,
"dst_len": 32,
"src_len": 0,
"tos": 0,
"table": 255,
"proto": 2,
"scope": 253,
"type": 3,
"flags": 16, // RTNH_F_LINKDOWN
}
}
pasta 在 nl_sock_init_do
函数中设置了 NETLINK_GET_STRICT_CHK
socket 选项:
// passt/netlink.c
static int nl_sock_init_do(void *arg)
{
#ifdef NETLINK_GET_STRICT_CHK
if (setsockopt(*s, SOL_NETLINK, NETLINK_GET_STRICT_CHK, &y, sizeof(y)))
debug("netlink: cannot set NETLINK_GET_STRICT_CHK on %i", *s);
#endif
return 0;
}
但是 inet_dump_fib
的实现在 4.19 内核和当前 6.x 内核有非常大的差异,6.x 内核支持 NETLINK_GET_STRICT_CHK
,因此过滤条件会生效,但是 4.19 并不支持该过滤设置。
此外需要注意的是,内核不支持根据 .rtm.rtm_scope
字段进行过滤,因此虽然 pasta 在 nl_route_dup
中获取路由条目时虽然设置了该字段,但会被内核直接忽略:
// passt/netlink.c
nl_route_dup
struct req_t {
struct nlmsghdr nlh;
struct rtmsg rtm;
struct rtattr rta;
unsigned int ifi;
} req = {
.rtm.rtm_family = af,
.rtm.rtm_table = RT_TABLE_MAIN, // 254
.rtm.rtm_scope = RT_SCOPE_UNIVERSE, // 0
.rtm.rtm_type = RTN_UNICAST, // 1
.rta.rta_type = RTA_OIF,
.rta.rta_len = RTA_LENGTH(sizeof(unsigned int)),
.ifi = ifi_src,
};
// linux/include/net/ip_fib.h
struct fib_dump_filter {
u32 table_id;
/* filter_set is an optimization that an entry is set */
bool filter_set;
bool dump_routes;
bool dump_exceptions;
unsigned char protocol;
unsigned char rt_type;
unsigned int flags;
struct net_device *dev;
};
netlink: Add new socket option to enable strict checking on dumps
https://github.com/torvalds/linux/commit/89d35528d17d25819a755a2b52931e911baebc66 (v4.20-rc1+)
net: netlink: rename NETLINK_DUMP_STRICT_CHK -> NETLINK_GET_STRICT_CHK (v4.20+)
https://github.com/torvalds/linux/commit/d3e8869ec82645599e6497d6974593bf00f7b19b
复现手段(CentOS 8,Kyllin V10 Desktop/Server 都存在该问题,因为他们的内核 < 4.20):
$ uname -r
4.18.0-348.el8.aarch64
$ sudo ip link add dummy0 type dummy
$ sudo ip link set dev dummy0 up
$ echo 0 | sudo tee /sys/class/net/dummy0/carrier
$ sudo ip addr add dev dummy0 172.16.13.13/16
$ sudo ip addr | grep dummy0 -A 3
4: dummy0: <NO-CARRIER,BROADCAST,NOARP,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
link/ether f6:b0:92:bd:83:71 brd ff:ff:ff:ff:ff:ff
inet 172.16.13.13/16 scope global dummy0
valid_lft forever preferred_lft forever
inet6 fe80::f4b0:92ff:febd:8371/64 scope link
valid_lft forever preferred_lft forever
$ ./pasta -4 --config-net
Couldn't set IPv4 route(s) in guest: Invalid argument
如果设置成 link up 的状态,会复制不需要的路由条目:
$ echo 1 | sudo tee /sys/class/net/dummy0/carrier
$ ip route
default via 10.0.3.1 dev enp0s19 proto dhcp metric 100
10.0.0.0/8 dev enp0s18 proto kernel scope link src 10.0.0.91
10.0.3.0/24 dev enp0s19 proto kernel scope link src 10.0.3.91 metric 100
172.16.0.0/16 dev dummy0 proto kernel scope link src 172.16.13.13
$ ip route list table main type unicast oif enp0s19
default via 10.0.3.1 proto dhcp metric 100
10.0.3.0/24 proto kernel scope link src 10.0.3.91 metric 100
$ ./pasta -4 --config-net
# ip route
default via 10.0.3.1 dev enp0s19 proto dhcp metric 100
10.0.0.0/8 dev enp0s19 proto kernel scope link
10.0.3.0/24 dev enp0s19 proto kernel scope link metric 100
172.16.0.0/16 dev enp0s19 proto kernel scope link
gdb-hexdump
https://github.com/runsisi/gdb-hexdump
Dump data
https://docs.pyroute2.org/debug.html
简单规避可以切换成使用 slirp4netns 用户态网络支持:
$ sudo vi /etc/containers/containers.conf
[network]
default_rootless_network_cmd = "slirp4netns"
或者删除 NO-CARRIER
状态的接口,或者删除具有NO-CARRIER
状态的接口上的 IP 地址。
从根本上解决的话需要 pasta 兼容 4.19 内核,也即在复制路由条目时剔除掉不满足过滤条件的条目(前端过滤 vs 后端过滤,实际上 ip route list 命令也是在前端过滤的),稍微有点麻烦:
diff --git a/netlink.c b/netlink.c
index 89c0641..ea80bc7 100644
--- a/netlink.c
+++ b/netlink.c
@@ -537,6 +537,10 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
char buf[NLBUFSIZ];
uint32_t seq;
unsigned i;
+ ssize_t nlmsgs_size2 = 0;
+ char buf2[NLBUFSIZ];
+ char *pos = buf2;
+ bool skip = false;
seq = nl_send(s_src, &req, RTM_GETROUTE, NLM_F_DUMP, sizeof(req));
@@ -554,8 +558,10 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
if (nh->nlmsg_type != RTM_NEWROUTE)
continue;
-
- dup_routes++;
+ if (rtm->rtm_table != RT_TABLE_MAIN)
+ continue;
+ if (rtm->rtm_type != RTN_UNICAST)
+ continue;
for (rta = RTM_RTA(rtm), na = RTM_PAYLOAD(nh); RTA_OK(rta, na);
rta = RTA_NEXT(rta, na)) {
@@ -564,6 +570,10 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
* the corresponding identifier in the target namespace.
*/
if (rta->rta_type == RTA_OIF) {
+ if (*(unsigned int *)RTA_DATA(rta) != ifi_src) {
+ skip = true;
+ break;
+ }
*(unsigned int *)RTA_DATA(rta) = ifi_dst;
} else if (rta->rta_type == RTA_MULTIPATH) {
struct rtnexthop *rtnh;
@@ -582,6 +592,17 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
rta->rta_type = RTA_UNSPEC;
}
}
+
+ if (skip) {
+ skip = false;
+ continue;
+ }
+
+ dup_routes++;
+
+ memcpy(pos, nh, NLMSG_ALIGN(nh->nlmsg_len));
+ pos += NLMSG_ALIGN(nh->nlmsg_len);
+ nlmsgs_size2 += NLMSG_ALIGN(nh->nlmsg_len);
}
if (!NLMSG_OK(nh, left)) {
@@ -610,7 +631,7 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
* to calculate dependencies: let the kernel do that.
*/
for (i = 0; i < dup_routes; i++) {
- for (nh = (struct nlmsghdr *)buf, left = nlmsgs_size;
+ for (nh = (struct nlmsghdr *)buf2, left = nlmsgs_size2;
NLMSG_OK(nh, left);
nh = NLMSG_NEXT(nh, left)) {
uint16_t flags = nh->nlmsg_flags;
pasta does not filter out unneeded routes on kernel < 4.20
https://bugs.passt.top/show_bug.cgi?id=86
错误5
这个错误只出现了一次,可能是版本不匹配导致,没有具体跟踪:
$ podman run -it 192.168.1.71:5000/kylin-server-v10:latest
Error: OCI runtime error: runc: container_linux.go:318: starting container process caused "requested action matches default action of filter"
$ podman run -it --security-opt=seccomp=unconfined 192.168.1.71:5000/kylin-server-v10:latest
Root/rootless Error: OCI runtime error: container_linux.go:349: starting container process caused “error adding seccomp rule for syscall socket: requested action matches default action of filter”
https://github.com/containers/podman/issues/8472
错误6
buildah 1.34.1 调用 runc 1.1.0 时出错:
$ buildah build .
STEP 1/5: FROM 192.168.1.71:5000/kylin-server-v10:latest
STEP 2/5: COPY <<EOF /etc/yum.repos.d/xcube-ev.repo ([xcube-os-ext]...)
STEP 3/5: RUN <<EOF (#!/bin/bash -ex...)
error running container: from /usr/bin/runc creating container for [/bin/sh -c /bin/bash -ex /dev/pipes/buildahheredoc4181887380]: time="2024-04-15T15:17:58+08:00" level=error msg="runc create failed: invalid mount &{Source:/var/tmp/buildah2446329082/mnt/buildah-bind-target-10 Destination:/dev/pipes/buildahheredoc4181887380 Device:bind Flags:20480 ClearedFlags:1 PropagationFlags:[278528] Data:z,Z Relabel: RecAttr:<nil> Extensions:0 IDMapping:<nil>}: bind mounts cannot have any filesystem-specific options applied"
: exit status 1
ERRO[0003] did not get container create message from subprocess: EOF
Error: building at STEP "RUN <<EOF": while running runtime: exit status 1
bind mounts cannot have any filesystem-specific options applied
这个错误由 runc 在 checkBindOptions
中打印的,mount 选项在 runc parseMountOptions
中进行解析,并在 checkBindOptions
中进行校验出错。
构建 buildah 调试版本(用 BUILDDEBUG=1
调试信息没那么全):
$ make GOGCFLAGS="all=-N -l"
buildah 会调用 runc,因此如果调试的话 runc 也需要构建调试版本:
$ make EXTRA_FLAGS='-gcflags="all=-N -l"'
buildah 调试需要设置 cap_sys_admin
权限,否则由于 buildah reexec 导致找不到 debuginfo:
$ sudo dlv exec /usr/bin/buildah -- build .
(dlv) target follow-exec -on
(dlv) config substitute-path github.com/opencontainers/runc /home/runsisi/build/runc
导致 runc checkBindOptions
出错的选项 z
, Z
由 buildah 在 createNeededHeredocMountsForRun
中创建(runc parseMountOptions
会处理其可识别的 mount 选项):
// buildah/imagebuildah/stage_executor.go
// createNeededHeredocMountsForRun
(dlv) p mountResult
[]github.com/opencontainers/runtime-spec/specs-go.Mount len: 1, cap: 1, [
{
Destination: "/dev/pipes/buildahheredoc2192141856",
Type: "bind",
Source: "/var/tmp/buildahheredoc2192141856",
Options: []string len: 4, cap: 4, [
"bind",
"rprivate",
"z",
"Z",
],
UIDMappings: []github.com/opencontainers/runtime-spec/specs-go.LinuxIDMapping len: 0, cap: 0, nil,
GIDMappings: []github.com/opencontainers/runtime-spec/specs-go.LinuxIDMapping len: 0, cap: 0, nil,
},
]
// runc/libcontainer/specconv/spec_linux.go
// CreateLibcontainerConfig
(dlv) p spec.Mounts
Sending output to pager...
[]github.com/opencontainers/runtime-spec/specs-go.Mount len: 12, cap: 17, [
{
Destination: "/dev/pipes/buildahheredoc1708672189",
Type: "bind",
Source: "/var/tmp/buildah2208349361/mnt/buildah-bind-target-10",
Options: []string len: 6, cap: 8, [
"bind",
"rprivate",
"z",
"Z",
"rw",
"rbind",
],
UIDMappings: []github.com/opencontainers/runtime-spec/specs-go.LinuxIDMapping len: 0, cap: 0, nil,
GIDMappings: []github.com/opencontainers/runtime-spec/specs-go.LinuxIDMapping len: 0, cap: 0, nil,
},
]
// runc/libcontiner/specconv/spec_linux.go
// parseMountOptions
(dlv) p options
[]string len: 6, cap: 8, [
"bind",
"rprivate",
"z",
"Z",
"rw",
"rbind",
]
// runc/libcontainer/configs/validate/validator.go
func checkBindOptions(m *configs.Mount) error {
if !m.IsBind() {
return nil
}
// We must reject bind-mounts that also have filesystem-specific mount
// options, because the kernel will completely ignore these flags and we
// cannot set them per-mountpoint.
//
// It should be noted that (due to how the kernel caches superblocks), data
// options could also silently ignored for other filesystems even when
// doing a fresh mount, but there is no real way to avoid this (and it
// matches how everything else works). There have been proposals to make it
// possible for userspace to detect this caching, but this wouldn't help
// runc because the behaviour wouldn't even be desirable for most users.
if m.Data != "" {
return errors.New("bind mounts cannot have any filesystem-specific options applied")
}
return nil
}
之所以在 ArchLinux 上没有这个问题是因为 buildah 默认使用的容器运行时是 crun 而不是 runc(可以通过 /etc/containers/containers.conf 进行配置):
❯ buildah build --log-level trace .
DEBU[0000] Running ["/usr/bin/crun" "create" "--bundle" "/var/tmp/buildah1329540379" "--pid-file" "/var/tmp/buildah1329540379/pid" "--no-new-keyring" "buildah-buildah1329540379"]
❯ man containers.conf
runtime=""
Default OCI specific runtime in runtimes that will be used by default. Must refer to a member of the runtimes table. Default runtime will be searched for on the system using the priority: "crun", "crun-vm", "runc", "kata".
因此,该问题可以通过安装 crun 容器运行时进行规避:
$ git clone https://github.com/containers/crun.git
$ ./autogen.sh
$ ./configure --prefix /usr
$ make
$ make install-strip DESTDIR=$PWD/built
$ sudo cp built/usr/bin/crun /usr/bin/
最后修改于 2024-05-05