故障现象

磁盘空间不够导致 MON 主动退出，同时系统日志中存在大量如下信息：

~# grep 'segfault at' /var/log/messages
salt-minion[3478758]: segfault at 8 ip 00007faf203a74e2 sp 00007faf017f8bf0 error 4 in librbd.so.1.12.0[7faf202d6000+30f000]
salt-minion[3494202]: segfault at 104 ip 00007faf2032c051 sp 00007faf03ffdb68 error 4 in librbd.so.1.12.0[7faf202d6000+30f000]
salt-minion[3527668]: segfault at 3ff00000408 ip 00007faf203a74e2 sp 00007faf11b37bf0 error 4 in librbd.so.1.12.0[7faf202d6000+30f000]

查看系统生成的 coredump 文件占用了大量的空间：

~# sysctl -a | grep core_pattern
kernel.core_pattern = /var/core/core-%e-%p-%t
~# ls /var/core/
-rw-------. 1 root root 372604928 8月  15 19:58 core-salt-minion-2221669-1534334283
-rw-------. 1 root root 403374080 8月  15 20:46 core-salt-minion-2507909-1534337161
-rw-------. 1 root root 403456000 8月  15 20:46 core-salt-minion-2508761-1534337172
-rw-------. 1 root root 405602304 8月  15 20:52 core-salt-minion-2546697-1534337559
-rw-------. 1 root root 397918208 8月  15 20:52 core-salt-minion-2547466-1534337570
-rw-------. 1 root root 398032896 8月  15 21:13 core-salt-minion-2668416-1534338806
-rw-------. 1 root root 413097984 8月  15 21:30 core-salt-minion-2773468-1534339834

显然是 coredump 文件占用了磁盘空间，导致 MON 进程退出。

故障分析

由于有 coredump 文件，所以直接通过 gdb 进行定位。首先安装 ceph 的 debuginfo 包，这样我们才不仅能看到调用栈，而且能够查看代码，打印函数实参，甚至局部变量等，否则的话我们基本就只能分析汇编以及手工分析内存结构来推测变量了，像下面这样：

   0x00007fc5843a74a0 <+0>:     push   %r15
   0x00007fc5843a74a2 <+2>:     push   %r14
   0x00007fc5843a74a4 <+4>:     push   %r13
   0x00007fc5843a74a6 <+6>:     push   %r12
   0x00007fc5843a74a8 <+8>:     push   %rbp
   0x00007fc5843a74a9 <+9>:     push   %rbx
   0x00007fc5843a74aa <+10>:    sub    $0x248,%rsp
   0x00007fc5843a74b1 <+17>:    mov    %rdi,0x60(%rsp)
   0x00007fc5843a74b6 <+22>:    mov    %rsi,0x68(%rsp)
   0x00007fc5843a74bb <+27>:    mov    %rdx,0x40(%rsp)
   0x00007fc5843a74c0 <+32>:    mov    %rcx,0x10(%rsp)
   0x00007fc5843a74c5 <+37>:    mov    %fs:0x28,%rax
   0x00007fc5843a74ce <+46>:    mov    %rax,0x238(%rsp)
   0x00007fc5843a74d6 <+54>:    xor    %eax,%eax
   0x00007fc5843a74d8 <+56>:    callq  0x7fc584301fc0 <_ZN8librados5IoCtx3cctEv@plt>
   0x00007fc5843a74dd <+61>:    mov    %rax,0x58(%rsp)
   0x00007fc5843a74e2 <+66>:    mov    0x8(%rax),%rax
   0x00007fc5843a74e6 <+70>:    mov    $0x14,%edx
   0x00007fc5843a74eb <+75>:    mov    $0x10,%esi
   0x00007fc5843a74f0 <+80>:    lea    0x170(%rax),%rdi
   0x00007fc5843a74f7 <+87>:    mov    %rax,(%rsp)
   0x00007fc5843a74fb <+91>:    callq  0x7fc58432c030 <_ZN4ceph7logging12SubsystemMap13should_gatherEji>
=> 0x00007fc5843a7500 <+96>:    test   %al,%al

对照源代码来看执行逻辑不是说不可以，但一旦需要通过内存来分析变量值估计会让人疯掉，这时你需要随时编写代码来计算结构体中成员变量的偏移。

由于目标机器分区不合理，导致 ceph-debuginfo 包由于磁盘空间不足无法安装，在这种情况下，可以只拷贝部分必须的符号信息以及源代码至目标机器进行 gdb 调试。

准备工作做好之后，使用 gdb 加载 core 文件：

~# file core-salt-minion-89222-1534465288
core-salt-minion-89222-1534465288: ELF 64-bit LSB core file x86-64, version 1 (SYSV), SVR4-style, from '/usr/bin/python /usr/bin/salt-minion'
~# gdb /usr/bin/python core-salt-minion-89222-1534465288
(gdb) bt
#0  ceph::logging::SubsystemMap::should_gather (this=0x4ca0fb0, sub=<optimized out>, level=20) at /usr/src/debug/ceph-12.2.2/src/log/SubsystemMap.h:63
#1  0x00007f022c3a7500 in librbd::api::Image<librbd::ImageCtx>::status_list_snapshots (io_ctx=..., start=start@entry=64517, max=max@entry=1024,
    snapshots=snapshots@entry=0x7f021db37ed0) at /usr/src/debug/ceph-12.2.2/src/librbd/api/Image.cc:224
#2  0x00007f022c3146cc in rbd_status_list_snapshots (p=0x7f0218002400, start=64517, max=1024, c_snapshots=0x7f0218029740, size=0x3eb7808)
    at /usr/src/debug/ceph-12.2.2/src/librbd/librbd.cc:4360
#3  0x00007f023c357391 in __pyx_pw_3rbd_22StatusSnapshotIterator_8get_next_chunk () from /usr/lib64/python2.7/site-packages/rbd.so
#4  0x00007f025b1589a3 in PyObject_Call () from /lib64/libpython2.7.so.1.0
...
(gdb) f 0
#0  ceph::logging::SubsystemMap::should_gather (this=0x4ca0fb0, sub=<optimized out>, level=20) at /usr/src/debug/ceph-12.2.2/src/log/SubsystemMap.h:63
63          return level <= m_subsys[sub].gather_level ||
(gdb) p m_subsys
$2 = std::vector of length 8728005947280, capacity 8728005945854 = {Cannot access memory at address 0x10
(gdb) up
#1  0x00007f022c3a7500 in librbd::api::Image<librbd::ImageCtx>::status_list_snapshots (io_ctx=..., start=start@entry=64517, max=max@entry=1024,
    snapshots=snapshots@entry=0x7f021db37ed0) at /usr/src/debug/ceph-12.2.2/src/librbd/api/Image.cc:224
224       ldout(cct, 20) << "status_list_snapshots io_ctx=" << &io_ctx << dendl;
(gdb) p *io_ctx.io_ctx_impl->client->cct
$3 = {nref = {<std::__atomic_base<unsigned int>> = {_M_i = 0}, <No data fields>}, _conf = 0x91, _log = 0x49319e0, _module_type = 75167552, _init_flags = 0, _set_uid = 53300992,
  _set_gid = 0,
...

从上面打印出来的 m_subsys 和 cct 的内容实际上是可以看出问题的，m_subsys 是 ceph 日志记录子系统的数目，在进程上电时直接通过源代码里的宏加载，因此是一个不变值，当前系统总共也就不到 60 个子系统，而现在这个数组的长度 8728005947280 明显是一个非法值。而 cct 中的 nref 成员是一个引用计数，只有客户端 rados 实例销毁时才会导致 cct 的引用计数为 0 (https://github.com/ceph/ceph/pull/11655)，当然这里的值 0 是个随机值，并不说明引用计数已经到 0 了，可以通过另外一个 coredump 文件进行验证：

(gdb) p *io_ctx.io_ctx_impl->client->cct
warning: can't find linker symbol for virtual table for `librados::RadosClient' value
$1 = {nref = {<std::__atomic_base<unsigned int>> = {_M_i = 76697280}, <No data fields>}, _conf = 0x4924ec0, _log = 0x4c55ee0, _module_type = 80043744, _init_flags = 0,
  _set_uid = 1518214648, _set_gid = 32514,

在另一个 coredump 文件里有这样的信息：

gdb /bin/python core-salt-minion-3291358-1534344593
(gdb) bt
...
#6  0x00007faf315d6b71 in boost::throw_exception<boost::lock_error> (e=...) at /usr/src/debug/ceph-12.2.2/build/boost/include/boost/throw_exception.hpp:69
#7  0x00007faf315d728c in lock (this=<optimized out>) at /usr/src/debug/ceph-12.2.2/build/boost/include/boost/thread/pthread/mutex.hpp:119
#8  boost::unique_lock<boost::mutex>::lock (this=0x7faf01ff9710) at /usr/src/debug/ceph-12.2.2/build/boost/include/boost/thread/lock_types.hpp:346
#9  0x00007faf3161d811 in unique_lock (m_=..., this=0x7faf01ff9710) at /usr/src/debug/ceph-12.2.2/build/boost/include/boost/thread/lock_types.hpp:124
#10 lock_shared (this=0x3b41f60) at /usr/src/debug/ceph-12.2.2/build/boost/include/boost/thread/pthread/shared_mutex.hpp:188
#11 lock_shared (this=0x7faf01ff9700) at /usr/src/debug/ceph-12.2.2/src/common/shunique_lock.h:163
#12 shunique_lock (m=..., this=0x7faf01ff9700) at /usr/src/debug/ceph-12.2.2/src/common/shunique_lock.h:70
#13 Objecter::op_submit (this=0x3b41e80, op=0x7faee13e6110, ptid=0x0, ctx_budget=0x0) at /usr/src/debug/ceph-12.2.2/src/osdc/Objecter.cc:2310
...
(gdb) f 6
#6  0x00007faf315d6b71 in boost::throw_exception<boost::lock_error> (e=...) at /usr/src/debug/ceph-12.2.2/build/boost/include/boost/throw_exception.hpp:69
warning: Source file is more recent than executable.
69          throw enable_current_exception(enable_error_info(e));
(gdb) p e
$1 = (const boost::lock_error &) @0x7faf01ff9690: {<boost::thread_exception> = {<boost::system::system_error> = {<std::runtime_error> = {<No data fields>}, m_error_code = {
        m_val = 22, m_cat = 0x7faf1bfff440 <boost::system::system_category()::system_category_const>}, m_what = ""}, <No data fields>}, <No data fields>}

其中 boost 抛出的异常对象的错误码 m_val 是 EINVAL（22）。

结合以上各种信息，显然可以推断是访问已释放的内存导致进程挂掉，此时需要结合 ceph 提供的 api 以及上层的业务逻辑来做进一步分析。

源码分析

由于上层代码使用一个 rados handle 让多个线程共享使用，因此怀疑是线程安全的问题，因此首先分析 rados handle 是否是线程安全的。ceph 的 api 不是完全线程安全的，比如 rados_connect 接口，常见的应用（如 qemu，fio）都是一个 rados 对象对应一个 rbd 对象，即 rados 对象不会被多个 image 对象（多线程）访问。我们也可以用后面的测试代码简单测试一下 rados 对象的线程安全性。

既然线程安全引起问题的可能性不大，那再回到具体出问题的代码（简化版本）：

def thread_func():
    with cluster_handle.open_ioctx(pool_name) as ioctx:
        images = rbd_inst.status_list_snapshots(ioctx)
        trashs = rbd_inst.trash_list(ioctx)

def run_in_thread(func, args, timeout):
    interrupt = False

    countdown = timeout
    t = threading.Thread(func, args)
    t.daemon = True

    t.start()
    try:
        # poll for thread exit
        while t.is_alive():
            t.join(POLL_TIME_INCR)
            if timeout and t.is_alive():
                countdown = countdown - POLL_TIME_INCR
                if countdown <= 0:
                    raise KeyboardInterrupt
        t.join()
    except KeyboardInterrupt:
        interrupt = True

    if interrupt:
        t.retval = -1
    return t.retval

if __name__ == '__main__':
    handle = rados.Rados()
    handle.connect()
    run_in_thread(thread_func, args, 20)
    handle.shutdown()

实际上通过 gdb 分析所有的 coredump 文件可以发现所有的异常栈都包含 thread_func 线程函数调用的接口，而这里的 run_in_thread 对线程函数超时的处理非常值得怀疑，当 thread_func 调用的接口超时返回时，它所创建的线程并没有结束，直到主线程调用 handle.shutdown 直到进程退出前该线程仍然可以自由执行，显然会导致 ceph 内部实现出现访问已释放内存的问题。通过调小上层代码的超时时间（即代码中的 20），该问题很容易复现，生成的调用栈也与之前的 coredump 完全一致，同时结合出问题的时间点，那会正好是集群创建 rbd image 数目到 30000+ 的时候，相关的查询接口所需的时间增加，导致接口查询时间变长，从而直接导致了问题的出现。

不过需要注意的是，虽然问题的原因已经找到了，但是由于不存在终止线程的方法，一旦工作线程卡在外部库的接口中，这种问题总是会出现：（

测试代码

简单测试 rados 对象是否线程安全：

#include "rados/librados.hpp"
#include "rbd/librbd.hpp"
#include <string>
#include <vector>
#include <iostream>
#include <vector>
#include <thread>

using namespace std;
using namespace librados;
using namespace librbd;

int main(int argc, char **argv) {
    Rados client;
    int r = client.init2("client.admin", "ceph", 0);
    if (r < 0) {
        cout << "client.init2 failed: " << r << endl;
        return -1;
    }

    r = client.conf_read_file(nullptr);
    if (r < 0) {
        cout << "client.conf_read_file failed: " << r << endl;
        return -1;
    }

    r = client.connect();
    if (r < 0) {
        cout << "client.connect failed: " << r << endl;
        return -1;
    }

    int thread_count = 8;
    int iterate_count = 16;
    int create_or_remove = 1;
    if (argc > 1) {
        thread_count = std::stoi(argv[1]);
    }
    if (argc > 2) {
        iterate_count = std::stoi(argv[2]);
    }
    if (argc > 3) {
        create_or_remove = std::stoi(argv[3]);
    }

    // create worker threads
    std::vector<std::thread> threads;
    for (int i = 0; i < thread_count; i ++) {
        auto t = std::thread([&client, i, iterate_count, create_or_remove]() {
            IoCtx ioctx;
            int r = client.ioctx_create("rbd", ioctx);
            if (r < 0) {
                cout << "client.ioctx_create failed: " << r << endl;
                return -2;
            }

            RBD rbd;

            string name("thread");
            name += std::to_string(i);
            uint64_t size = 1024 * 1024 * 1024 * 10UL;
            int order = 22;

            if (create_or_remove & 1) {
                for (int j = 0; j < iterate_count; j++) {
                    cout << "--> thread: " << i << ", create iterate: " << j << endl;

                    rbd.create(ioctx, name.c_str(), size, &order);
                    vector<string> names;
                    rbd.list(ioctx, names);
                }
            }

            if (create_or_remove & 2) {
                for (int j = 0; j < iterate_count; j++) {
                    cout << "--> thread: " << i << ", remove iterate: " << j << endl;

                    vector<string> names;
                    rbd.list(ioctx, names);
                    rbd.remove(ioctx, name.c_str());
                }
            }
        });

        threads.push_back(std::move(t));
    }

    for (int i = 0; i < thread_count; i++) {
        threads[i].join();
    }

    cout << "main exit" << endl;

    return 0;
}

// cd ceph/src/test/
// g++ -o rados_threads -std=c++11 rados_threads.cc -lrados -lrbd -lpthread -ggdb3
//        -I../../build/include -I../include  -L../../build/lib

最后修改于 2019-01-15