Podman 源码构建
Podman 与 Buildah 集成非常友好,同时又兼容 Docker,因此将 Docker 环境迁移至 Podman。

构建环境

Kylin V10 aarch64

依赖构建

netavark

netavark 是 podman 依赖的网络插件。

$ wget https://github.com/protocolbuffers/protobuf/releases/download/v26.1/protoc-26.1-linux-aarch_64.zip
$ unzip protoc-26.1-linux-aarch_64.zip
$ sudo cp bin/protoc /usr/bin/
$ git clone https://github.com/containers/netavark.git

$ make
$ sudo cp bin/netavark /usr/libexec/podman/

runc

$ git clone https://github.com/opencontainers/runc.git

$ make
$ sudo cp runc /usr/bin/

构建调试版本:

$ make EXTRA_FLAGS='-gcflags="all=-N -l"'

runc doesn’t work with go1.22
https://github.com/opencontainers/runc/issues/4233

conmon

$ git clone https://github.com/containers/conmon.git

$ make
$ sudo cp bin/conmon /usr/libexec/podman/

pasta

$ git clone git://passt.top/passt

打上如下补丁:

diff --git a/util.c b/util.c
index 849fa7f..82be401 100644
--- a/util.c
+++ b/util.c
@@ -516,7 +516,7 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
        return __clone2(fn, stack_area + stack_size / 2, stack_size / 2,
                        flags, arg);
 #else
-       return clone(fn, stack_area + stack_size / 2, flags, arg);
+       return clone(fn, (char *)((uintptr_t)(stack_area + stack_size / 2 + 15) & (-16)), flags, arg);
 #endif
 }
$ make prefix=/usr
$ sudo cp passt /usr/bin/
$ sudo ln -sr /usr/bin/passt /usr/bin/pasta

pasta SIGBUS error on aarch64
https://bugs.passt.top/show_bug.cgi?id=85

podman

$ make clean
$ make podman

$ ls bin/
podman

运行错误

错误1

$ podman run -it 192.168.1.71:5000/kylin-server-v10:latest
Error: could not find pasta, the network namespace can't be configured: exec: "pasta": executable file not found in $PATH

podman 5.x 默认从 slirp4netns 切换至 pasta 用户态网络栈,可以使用 --network slirp4netns 显式指定 slirp4netns,当然,前提是系统安装了 slirp4netns

错误2

$ podman run --network slirp4netns -it 192.168.1.71:5000/kylin-server-v10:latest
open /run/runc/3804d1c3356acd3c9f0155ca047659c161e2b6e3e660569aa2ef76dddc3a0cc1/state.json: permission denied
Error: runc: stat /run/runc/3804d1c3356acd3c9f0155ca047659c161e2b6e3e660569aa2ef76dddc3a0cc1: permission denied: OCI permission denied

podman 调用 conmon,conmon 调用 runc,然后出错,可以为 runc 添加 --debug 选项从而辅助定位(同时通过 --log-level trace 启用了 podman 自身的调试日志):

$ podman run --network slirp4netns --log-level trace --runtime-flag debug -it 192.168.1.71:5000/kylin-server-v10:latest

此时 conmon 的日志都输出在 systemd 日志中:

Apr 09 21:35:47 zstack-5 conmon[3153310]: conmon 22a38900b92df457377f <ndebug>: failed to write to /proc/self/oom_score_adj: Permission denied
Apr 09 21:35:47 zstack-5 conmon[3153311]: conmon 22a38900b92df457377f <ndebug>: addr{sun_family=AF_UNIX, sun_path=/var/tmp/conmon-term.1YMSL2}
Apr 09 21:35:47 zstack-5 conmon[3153311]: conmon 22a38900b92df457377f <ntrace>: calling runtime args: /usr/bin/runc --debug --log-format=json --log /run/user/1001/containers/vfs-containers/22a38900b92df457377fc1248edfd217890646f8628b9d4a481910259bc02996/userdata/oci-log create --bundle /home/runsisi/.local/share/containers/storage/vfs-containers/22a38900b92df457377fc1248edfd217890646f8628b9d4a481910259bc02996/userdata --pid-file /run/user/1001/containers/vfs-containers/22a38900b92df457377fc1248edfd217890646f8628b9d4a481910259bc02996/userdata/pidfile --console-socket /var/tmp/conmon-term.1YMSL2 22a38900b92df457377fc1248edfd217890646f8628b9d4a481910259bc02996
Apr 09 21:35:47 zstack-5 conmon[3153311]: conmon 22a38900b92df457377f <ndebug>: addr{sun_family=AF_UNIX, sun_path=/proc/self/fd/13/attach}
Apr 09 21:35:47 zstack-5 conmon[3153311]: conmon 22a38900b92df457377f <ndebug>: terminal_ctrl_fd: 13
Apr 09 21:35:47 zstack-5 conmon[3153311]: conmon 22a38900b92df457377f <ndebug>: winsz read side: 16, winsz write side: 17
Apr 09 21:35:47 zstack-5 docker-runc[3153312]: {"level":"error","msg":"stat /run/runc/22a38900b92df457377fc1248edfd217890646f8628b9d4a481910259bc02996: permission denied\n","time":"2024-04-09T21:35:47+08:00"}
Apr 09 21:35:47 zstack-5 conmon[3153311]: conmon 22a38900b92df457377f <nwarn>: runtime stderr: stat /run/runc/22a38900b92df457377fc1248edfd217890646f8628b9d4a481910259bc02996: permission denied
Apr 09 21:35:47 zstack-5 conmon[3153311]: conmon 22a38900b92df457377f <error>: Failed to create container: exit status 1
Apr 09 21:35:47 zstack-5 docker-runc[3153319]: ERRO[0000] open /run/runc/22a38900b92df457377fc1248edfd217890646f8628b9d4a481910259bc02996/state.json: permission denied

注意其中的错误竟然出现了 docker-runc,明明运行的是我们手工构建的 runc:

$ rpm -qf /usr/bin/runc
docker-engine-18.09.0-101.ky10.aarch64

显然,是安装 docker 把 runc 给覆盖了。

// 这里的 common 是 containers/common 独立的 git 项目
// common/pkg/config/default.go

func defaultEngineConfig() (*EngineConfig, error) {
    c.OCIRuntimes = map[string][]string{
        "runc": {
            "/usr/bin/runc",
            "/usr/sbin/runc",
            "/usr/local/bin/runc",
            "/usr/local/sbin/runc",
            "/sbin/runc",
            "/bin/runc",
            "/usr/lib/cri-o-runc/sbin/runc",
            "/run/current-system/sw/bin/runc",
        },
    }
}

错误3

$ podman run -it 192.168.1.71:5000/kylin-server-v10:latest
Error: pasta failed with exit code -1:

$ sudo coredumpctl
TIME                            PID   UID   GID SIG COREFILE  EXE
Thu 2024-04-11 14:53:48 CST  1649827  1001  1001   7 present   /usr/bin/passt

$ sudo coredumpctl debug 1649827
        Signal: 7 (BUS)
       Message: Process 1649827 (pasta) of user 1001 dumped core.
                
                Stack trace of thread 1649827:
                #0  0x0000aaadc596e590 ns_check (passt)
                #1  0x0000fffd1395a1ec thread_start (libc.so.6)
                
                Stack trace of thread 1649826:
                #0  0x0000fffd1395a1c0 __clone (libc.so.6)
                #1  0x0000aaadc596e8c4 pasta_open_ns (passt)
                #2  0x0000aaadc596724c conf (passt)
                #3  0x0000aaadc5963434 main (passt)
                #4  0x0000fffd138a3fe0 __libc_start_main (libc.so.6)
                #5  0x0000aaadc5963be4 $x (passt)
                #6  0x0000aaadc5963be4 $x (passt)
$ ./pasta  --trace
0.1690: Multiple interfaces with IPv6 routes, use -i to select one
0.1690: Couldn't pick external interface: disabling IPv6
Bus error

通过 gdb 跟踪,可以发现是在 clone 时挂掉:

(gdb) bt
#0  do_clone (arg=0xffffffe2b548, flags=17681, stack_size=1048576, stack_area=0xffffffe2b568 "\370ͪ\252\252\252", fn=<optimized out>) at util.c:519
#1  open_in_ns (c=0xfffffff30050, path=path@entry=0xaaaaaaabf0b8 "/proc/net/tcp", flags=flags@entry=524288) at util.c:377
#2  0x0000aaaaaaaaa12c in fwd_scan_ports_init (c=c@entry=0xfffffff30050) at fwd.c:135
#3  0x0000aaaaaaaa709c in conf (c=c@entry=0xfffffff30050, argc=argc@entry=1, argv=argv@entry=0xffffffffec38) at conf.c:1768
#4  0x0000aaaaaaaa3434 in main (argc=1, argv=0xffffffffec38) at passt.c:269

结合 SIGBUS 错误,以及 gdb 打印 stack_area + stack_size / 2 确认是栈地址未 16 字节对齐导致:

diff --git a/util.c b/util.c
index 849fa7f..82be401 100644
--- a/util.c
+++ b/util.c
@@ -516,7 +516,7 @@ int do_clone(int (*fn)(void *), char *stack_area, size_t stack_size, int flags,
        return __clone2(fn, stack_area + stack_size / 2, stack_size / 2,
                        flags, arg);
 #else
-       return clone(fn, stack_area + stack_size / 2, flags, arg);
+       return clone(fn, (char *)((uintptr_t)(stack_area + stack_size / 2 + 15) & (-16)), flags, arg);
 #endif
 }

pasta SIGBUS error on aarch64
https://bugs.passt.top/show_bug.cgi?id=85

clone - create a child process
https://man7.org/linux/man-pages/man2/clone.2.html

It’s probably stack alignment.

How do I parse ARM64 assembly SIGBUS error?
https://stackoverflow.com/questions/72724797/how-do-i-parse-arm64-assembly-sigbus-error

错误4

$ podman run -it 192.168.1.71:5000/kylin-server-v10
Error: pasta failed with exit code 1:
Couldn't set IPv4 route(s) in guest: Invalid argument

增加 pasta 的调试选项:

$ sudo vi /etc/containers/containers.conf
[network]
pasta_options = ["--trace"]

可以看到 --trace 选项已加上:

$ podman run --log-level trace --rm -it 192.168.1.71:5000/kylin-server-v10
pasta arguments: --config-net --trace --dns-forward 169.254.0.1 -t none -u none -T none -U none --no-map-gw --quiet --netns /run/user/1001/netns/netns-33119720-8b69-2893-581f-60ebfb26291b

但是 --debug--quiet 冲突(指定 --trace 会指定 --debug):

$ sudo journalctl -f -t pasta
Apr 11 15:56:08 zstack-5 pasta[1825753]: Either --debug or --quiet

需要修改 containers/common 源代码:

diff --git a/vendor/github.com/containers/common/libnetwork/pasta/pasta_linux.go b/vendor/github.com/containers/common/libnetwork/pasta/pasta_linux.go
index 4b31320b5..5f98e2491 100644
--- a/vendor/github.com/containers/common/libnetwork/pasta/pasta_linux.go
+++ b/vendor/github.com/containers/common/libnetwork/pasta/pasta_linux.go
@@ -146,7 +146,7 @@ func Setup2(opts *SetupOptions) (*SetupResult, error) {
        }
 
        // always pass --quiet to silence the info output from pasta
-       cmdArgs = append(cmdArgs, "--quiet", "--netns", opts.Netns)
+       cmdArgs = append(cmdArgs, "--netns", opts.Netns)
 
        logrus.Debugf("pasta arguments: %s", strings.Join(cmdArgs, " "))

但是,trace 模式下的 pasta 也没有提供更多有意义的信息。

直接手工测试如下:

$ pasta --config-net
Couldn't set IPv4 route(s) in guest: Invalid argument

错误出在 nl_route_dup 调用的 nl_do 接口,ip route 的信息如下:

$ ip -4 a
9: lxcbr0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    inet 10.0.3.1/24 brd 10.0.3.255 scope global lxcbr0
       valid_lft forever preferred_lft forever

$ ip -4 r list table all
broadcast 10.0.3.0 dev lxcbr0 table local proto kernel scope link src 10.0.3.1 linkdown 
local 10.0.3.1 dev lxcbr0 table local proto kernel scope host src 10.0.3.1 
broadcast 10.0.3.255 dev lxcbr0 table local proto kernel scope link src 10.0.3.1 linkdown 

打印 netlink 消息(打印 nl_route_dup 中的 buf),并通过 pyroute2 提供的工具进行 dump 对比,可知是添加 NO-CARRIER 状态接口的路由失败:

$ git diff
diff --git a/netlink.c b/netlink.c
index 89c0641..2945923 100644
--- a/netlink.c
+++ b/netlink.c
@@ -503,6 +503,22 @@ int nl_route_set_def(int s, unsigned int ifi, sa_family_t af, const void *gw)
        return nl_do(s, &req, RTM_NEWROUTE, NLM_F_CREATE | NLM_F_EXCL, len);
 }
 
+static void hex_dump(char *buf, int buflen) {
+    for (int i = 0; i < buflen; i += 16) {
+        for (int j = 0; j < 16; j++) {
+            if (i + j < buflen) {
+                printf("%02x", (unsigned char)buf[i + j]);
+               if ((j != 15) && (i + j != buflen - 1)) {
+                   printf(":");
+                }
+            } else {
+                break;
+            }
+       }
+        printf("\n");
+    }
+}
+
 /**
  * nl_route_dup() - Copy routes for given interface and address family
  * @s_src:     Netlink socket in source namespace
@@ -602,6 +618,8 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
        if (status < 0)
                return status;
 
+       hex_dump(buf, nlmsgs_size);
+
        /* Routes might have dependencies between each other, and the kernel
         * processes RTM_NEWROUTE messages sequentially. For n routes, we might
         * need to send the requests up to n times to get all of them inserted.

通过走读 linux 内核代码和 systemtap 确认:

$ sudo stap -e 'probe kernel.function("fib_create_info") { printf("fc_flags = 0x%x\n", $cfg->fc_flags); }
    probe kernel.function("fib_check_nh").return { printf("%s\n", $$return); }
    probe kernel.function("fib_create_info").return { printf("%s\n", $$return); }'

出错的地方为 linux 内核如下代码:

// linux/net/ipv4/fib_semantics.c

fib_create_info
    if (cfg->fc_flags & (RTNH_F_DEAD | RTNH_F_LINKDOWN)) {
        NL_SET_ERR_MSG(extack, "Invalid rtm_flags - can not contain DEAD or LINKDOWN");
        goto err_inval;
    }

但根本原因是 pasta nl_route_dup 获取路由信息的过滤条件发送给内核后,内核代码没有正确的过滤,导致拿到了 RTNH_F_LINKDOWN 类型的路由条目:

// linux/net/ipv4/fib_frontend.c

inet_dump_fib

RTNH_F_LINKDOWN 对应 pyroute2 dump 的 flags16:

{
    "pcap header": "None",
    "link layer header": "None",
    "message class": "<class 'pyroute2.netlink.rtnl.rtmsg.rtmsg'>",
    "exception": null,
    "data": {
        "family": 2,
        "dst_len": 32,
        "src_len": 0,
        "tos": 0,
        "table": 255,
        "proto": 2,
        "scope": 253,
        "type": 3,
        "flags": 16,   // RTNH_F_LINKDOWN
    }
}

pasta 在 nl_sock_init_do 函数中设置了 NETLINK_GET_STRICT_CHK socket 选项:

// passt/netlink.c

static int nl_sock_init_do(void *arg)
{
#ifdef NETLINK_GET_STRICT_CHK
	if (setsockopt(*s, SOL_NETLINK, NETLINK_GET_STRICT_CHK, &y, sizeof(y)))
		debug("netlink: cannot set NETLINK_GET_STRICT_CHK on %i", *s);
#endif
	return 0;
}

但是 inet_dump_fib 的实现在 4.19 内核和当前 6.x 内核有非常大的差异,6.x 内核支持 NETLINK_GET_STRICT_CHK,因此过滤条件会生效,但是 4.19 并不支持该过滤设置。

此外需要注意的是,内核不支持根据 .rtm.rtm_scope 字段进行过滤,因此虽然 pasta 在 nl_route_dup 中获取路由条目时虽然设置了该字段,但会被内核直接忽略:

// passt/netlink.c

nl_route_dup
    struct req_t {
        struct nlmsghdr nlh;
        struct rtmsg rtm;
        struct rtattr rta;
        unsigned int ifi;
    } req = {
        .rtm.rtm_family   = af,
        .rtm.rtm_table    = RT_TABLE_MAIN,      // 254
        .rtm.rtm_scope    = RT_SCOPE_UNIVERSE,  // 0
        .rtm.rtm_type     = RTN_UNICAST,        // 1

        .rta.rta_type     = RTA_OIF,
        .rta.rta_len      = RTA_LENGTH(sizeof(unsigned int)),
        .ifi          = ifi_src,
    };


// linux/include/net/ip_fib.h

struct fib_dump_filter {
	u32			table_id;
	/* filter_set is an optimization that an entry is set */
	bool			filter_set;
	bool			dump_routes;
	bool			dump_exceptions;
	unsigned char		protocol;
	unsigned char		rt_type;
	unsigned int		flags;
	struct net_device	*dev;
};

netlink: Add new socket option to enable strict checking on dumps
https://github.com/torvalds/linux/commit/89d35528d17d25819a755a2b52931e911baebc66 (v4.20-rc1+)

net: netlink: rename NETLINK_DUMP_STRICT_CHK -> NETLINK_GET_STRICT_CHK (v4.20+)
https://github.com/torvalds/linux/commit/d3e8869ec82645599e6497d6974593bf00f7b19b

复现手段(CentOS 8,Kyllin V10 Desktop/Server 都存在该问题,因为他们的内核 < 4.20):

$ uname -r
4.18.0-348.el8.aarch64

$ sudo ip link add dummy0 type dummy
$ sudo ip link set dev dummy0 up
$ echo 0 | sudo tee /sys/class/net/dummy0/carrier
$ sudo ip addr add dev dummy0 172.16.13.13/16
$ sudo ip addr | grep dummy0 -A 3
4: dummy0: <NO-CARRIER,BROADCAST,NOARP,UP> mtu 1500 qdisc noqueue state DOWN group default qlen 1000
    link/ether f6:b0:92:bd:83:71 brd ff:ff:ff:ff:ff:ff
    inet 172.16.13.13/16 scope global dummy0
       valid_lft forever preferred_lft forever
    inet6 fe80::f4b0:92ff:febd:8371/64 scope link 
       valid_lft forever preferred_lft forever

$ ./pasta -4 --config-net
Couldn't set IPv4 route(s) in guest: Invalid argument

如果设置成 link up 的状态,会复制不需要的路由条目:

$ echo 1 | sudo tee /sys/class/net/dummy0/carrier
$ ip route
default via 10.0.3.1 dev enp0s19 proto dhcp metric 100 
10.0.0.0/8 dev enp0s18 proto kernel scope link src 10.0.0.91 
10.0.3.0/24 dev enp0s19 proto kernel scope link src 10.0.3.91 metric 100 
172.16.0.0/16 dev dummy0 proto kernel scope link src 172.16.13.13
$ ip route list table main type unicast oif enp0s19
default via 10.0.3.1 proto dhcp metric 100 
10.0.3.0/24 proto kernel scope link src 10.0.3.91 metric 100

$ ./pasta -4 --config-net
# ip route
default via 10.0.3.1 dev enp0s19 proto dhcp metric 100 
10.0.0.0/8 dev enp0s19 proto kernel scope link 
10.0.3.0/24 dev enp0s19 proto kernel scope link metric 100 
172.16.0.0/16 dev enp0s19 proto kernel scope link

gdb-hexdump
https://github.com/runsisi/gdb-hexdump

Dump data
https://docs.pyroute2.org/debug.html

简单规避可以切换成使用 slirp4netns 用户态网络支持:

$ sudo vi /etc/containers/containers.conf
[network]
default_rootless_network_cmd = "slirp4netns"

或者删除 NO-CARRIER 状态的接口,或者删除具有NO-CARRIER 状态的接口上的 IP 地址。

从根本上解决的话需要 pasta 兼容 4.19 内核,也即在复制路由条目时剔除掉不满足过滤条件的条目(前端过滤 vs 后端过滤,实际上 ip route list 命令也是在前端过滤的),稍微有点麻烦:

diff --git a/netlink.c b/netlink.c
index 89c0641..ea80bc7 100644
--- a/netlink.c
+++ b/netlink.c
@@ -537,6 +537,10 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
        char buf[NLBUFSIZ];
        uint32_t seq;
        unsigned i;
+       ssize_t nlmsgs_size2 = 0;
+       char buf2[NLBUFSIZ];
+       char *pos = buf2;
+       bool skip = false;
 
        seq = nl_send(s_src, &req, RTM_GETROUTE, NLM_F_DUMP, sizeof(req));
 
@@ -554,8 +558,10 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
 
                if (nh->nlmsg_type != RTM_NEWROUTE)
                        continue;
-
-               dup_routes++;
+               if (rtm->rtm_table != RT_TABLE_MAIN)
+                       continue;
+               if (rtm->rtm_type != RTN_UNICAST)
+                       continue;
 
                for (rta = RTM_RTA(rtm), na = RTM_PAYLOAD(nh); RTA_OK(rta, na);
                     rta = RTA_NEXT(rta, na)) {
@@ -564,6 +570,10 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
                         * the corresponding identifier in the target namespace.
                         */
                        if (rta->rta_type == RTA_OIF) {
+                               if (*(unsigned int *)RTA_DATA(rta) != ifi_src) {
+                                       skip = true;
+                                       break;
+                               }
                                *(unsigned int *)RTA_DATA(rta) = ifi_dst;
                        } else if (rta->rta_type == RTA_MULTIPATH) {
                                struct rtnexthop *rtnh;
@@ -582,6 +592,17 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
                                rta->rta_type = RTA_UNSPEC;
                        }
                }
+
+               if (skip) {
+                       skip = false;
+                       continue;
+               }
+
+               dup_routes++;
+
+               memcpy(pos, nh, NLMSG_ALIGN(nh->nlmsg_len));
+               pos += NLMSG_ALIGN(nh->nlmsg_len);
+               nlmsgs_size2 += NLMSG_ALIGN(nh->nlmsg_len);
        }
 
        if (!NLMSG_OK(nh, left)) {
@@ -610,7 +631,7 @@ int nl_route_dup(int s_src, unsigned int ifi_src,
         * to calculate dependencies: let the kernel do that.
         */
        for (i = 0; i < dup_routes; i++) {
-               for (nh = (struct nlmsghdr *)buf, left = nlmsgs_size;
+               for (nh = (struct nlmsghdr *)buf2, left = nlmsgs_size2;
                     NLMSG_OK(nh, left);
                     nh = NLMSG_NEXT(nh, left)) {
                        uint16_t flags = nh->nlmsg_flags;

pasta does not filter out unneeded routes on kernel < 4.20
https://bugs.passt.top/show_bug.cgi?id=86

错误5

这个错误只出现了一次,可能是版本不匹配导致,没有具体跟踪:

$ podman run -it 192.168.1.71:5000/kylin-server-v10:latest
Error: OCI runtime error: runc: container_linux.go:318: starting container process caused "requested action matches default action of filter"

$ podman run -it --security-opt=seccomp=unconfined 192.168.1.71:5000/kylin-server-v10:latest

Root/rootless Error: OCI runtime error: container_linux.go:349: starting container process caused “error adding seccomp rule for syscall socket: requested action matches default action of filter”
https://github.com/containers/podman/issues/8472

错误6

buildah 1.34.1 调用 runc 1.1.0 时出错:

$ buildah build .
STEP 1/5: FROM 192.168.1.71:5000/kylin-server-v10:latest
STEP 2/5: COPY <<EOF /etc/yum.repos.d/xcube-ev.repo ([xcube-os-ext]...)
STEP 3/5: RUN <<EOF (#!/bin/bash -ex...)
error running container: from /usr/bin/runc creating container for [/bin/sh -c /bin/bash -ex /dev/pipes/buildahheredoc4181887380]: time="2024-04-15T15:17:58+08:00" level=error msg="runc create failed: invalid mount &{Source:/var/tmp/buildah2446329082/mnt/buildah-bind-target-10 Destination:/dev/pipes/buildahheredoc4181887380 Device:bind Flags:20480 ClearedFlags:1 PropagationFlags:[278528] Data:z,Z Relabel: RecAttr:<nil> Extensions:0 IDMapping:<nil>}: bind mounts cannot have any filesystem-specific options applied"
: exit status 1
ERRO[0003] did not get container create message from subprocess: EOF 
Error: building at STEP "RUN <<EOF": while running runtime: exit status 1

bind mounts cannot have any filesystem-specific options applied 这个错误由 runc 在 checkBindOptions 中打印的,mount 选项在 runc parseMountOptions 中进行解析,并在 checkBindOptions 中进行校验出错。

构建 buildah 调试版本(用 BUILDDEBUG=1 调试信息没那么全):

$ make GOGCFLAGS="all=-N -l"

buildah 会调用 runc,因此如果调试的话 runc 也需要构建调试版本:

$ make EXTRA_FLAGS='-gcflags="all=-N -l"'

buildah 调试需要设置 cap_sys_admin 权限,否则由于 buildah reexec 导致找不到 debuginfo:

$ sudo dlv exec /usr/bin/buildah -- build .
(dlv) target follow-exec -on
(dlv) config substitute-path github.com/opencontainers/runc /home/runsisi/build/runc

导致 runc checkBindOptions 出错的选项 z, Z 由 buildah 在 createNeededHeredocMountsForRun 中创建(runc parseMountOptions 会处理其可识别的 mount 选项):

// buildah/imagebuildah/stage_executor.go
// createNeededHeredocMountsForRun
(dlv) p mountResult
[]github.com/opencontainers/runtime-spec/specs-go.Mount len: 1, cap: 1, [
    {
        Destination: "/dev/pipes/buildahheredoc2192141856",
        Type: "bind",
        Source: "/var/tmp/buildahheredoc2192141856",
        Options: []string len: 4, cap: 4, [
            "bind",
            "rprivate",
            "z",
            "Z",
        ],
        UIDMappings: []github.com/opencontainers/runtime-spec/specs-go.LinuxIDMapping len: 0, cap: 0, nil,
        GIDMappings: []github.com/opencontainers/runtime-spec/specs-go.LinuxIDMapping len: 0, cap: 0, nil,
    },
]

// runc/libcontainer/specconv/spec_linux.go
// CreateLibcontainerConfig
(dlv) p spec.Mounts
Sending output to pager...
[]github.com/opencontainers/runtime-spec/specs-go.Mount len: 12, cap: 17, [
    {
        Destination: "/dev/pipes/buildahheredoc1708672189",
        Type: "bind",
        Source: "/var/tmp/buildah2208349361/mnt/buildah-bind-target-10",
        Options: []string len: 6, cap: 8, [
            "bind",
            "rprivate",
            "z",
            "Z",
            "rw",
            "rbind",
        ],
        UIDMappings: []github.com/opencontainers/runtime-spec/specs-go.LinuxIDMapping len: 0, cap: 0, nil,
        GIDMappings: []github.com/opencontainers/runtime-spec/specs-go.LinuxIDMapping len: 0, cap: 0, nil,
    },
]

// runc/libcontiner/specconv/spec_linux.go
// parseMountOptions
(dlv) p options
[]string len: 6, cap: 8, [
        "bind",
        "rprivate",
        "z",
        "Z",
        "rw",
        "rbind",
]
// runc/libcontainer/configs/validate/validator.go

func checkBindOptions(m *configs.Mount) error {
	if !m.IsBind() {
		return nil
	}
	// We must reject bind-mounts that also have filesystem-specific mount
	// options, because the kernel will completely ignore these flags and we
	// cannot set them per-mountpoint.
	//
	// It should be noted that (due to how the kernel caches superblocks), data
	// options could also silently ignored for other filesystems even when
	// doing a fresh mount, but there is no real way to avoid this (and it
	// matches how everything else works). There have been proposals to make it
	// possible for userspace to detect this caching, but this wouldn't help
	// runc because the behaviour wouldn't even be desirable for most users.
	if m.Data != "" {
		return errors.New("bind mounts cannot have any filesystem-specific options applied")
	}
	return nil
}

之所以在 ArchLinux 上没有这个问题是因为 buildah 默认使用的容器运行时是 crun 而不是 runc(可以通过 /etc/containers/containers.conf 进行配置):

❯ buildah build --log-level trace .
DEBU[0000] Running ["/usr/bin/crun" "create" "--bundle" "/var/tmp/buildah1329540379" "--pid-file" "/var/tmp/buildah1329540379/pid" "--no-new-keyring" "buildah-buildah1329540379"]
❯ man containers.conf
runtime=""
    Default OCI specific runtime in runtimes that will be used by default. Must refer to a member of the runtimes table. Default runtime will be searched for on the system using the priority: "crun", "crun-vm", "runc", "kata".

因此,该问题可以通过安装 crun 容器运行时进行规避:

$ git clone https://github.com/containers/crun.git
$ ./autogen.sh
$ ./configure --prefix /usr
$ make
$ make install-strip DESTDIR=$PWD/built
$ sudo cp built/usr/bin/crun /usr/bin/

最后修改于 2024-05-05