ceph-mgr dead lock while loading mgr modules

构建 ceph master 分支并使用 vstart 环境创建本地集群时发现 mgr 无法上线,结合 mgr 日志解决了所有了 python 依赖的问题之后,问题仍然没有解决,深入分析之后发现是 mgr 在加载模块过程中死锁了,这里简单记录一下分析过程。

gdb 确认死锁现象

注意 16、25 号线程,两个线程一直在等锁(mutex 互斥锁 PyModuleRegistry::lock ):

(gdb) info threads
  Id   Target Id                                            Frame
  1    Thread 0x7ffff7fd1fc0 (LWP 945310) "ceph-mgr"        0x00007fffebf39d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  2    Thread 0x7fffe9c66700 (LWP 945467) "log"             0x00007fffebf39965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  4    Thread 0x7fffe8c64700 (LWP 945469) "msgr-worker-0"   0x00007fffeb014483 in epoll_wait () from /lib64/libc.so.6
  5    Thread 0x7fffe8463700 (LWP 945470) "msgr-worker-1"   0x00007fffeb014483 in epoll_wait () from /lib64/libc.so.6
  6    Thread 0x7fffe7c62700 (LWP 945471) "msgr-worker-2"   0x00007fffeb014483 in epoll_wait () from /lib64/libc.so.6
  10   Thread 0x7fffe9465700 (LWP 945475) "service"         0x00007fffebf39d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  11   Thread 0x7fffe7461700 (LWP 945476) "admin_socket"    0x00007fffeb00920d in poll () from /lib64/libc.so.6
  12   Thread 0x7fffe6c60700 (LWP 945477) "ceph_timer"      0x00007fffebf39d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  13   Thread 0x7fffe645f700 (LWP 945478) "fn_anonymous"    0x00007fffebf39965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  14   Thread 0x7fffe5c5e700 (LWP 945479) "signal_handler"  0x00007fffeb00920d in poll () from /lib64/libc.so.6
  15   Thread 0x7fffe545d700 (LWP 945480) "mgrsb-fin"       0x00007fffebf39965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 16   Thread 0x7fffe4c5c700 (LWP 945481) "ms_dispatch"     0x00007fffebf3c4ed in __lll_lock_wait () from /lib64/libpthread.so.0
  17   Thread 0x7fffe445b700 (LWP 945482) "ms_local"        0x00007fffebf39965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  18   Thread 0x7fffe3c5a700 (LWP 945483) "io_context_pool" 0x00007fffebf39965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  19   Thread 0x7fffe3459700 (LWP 945484) "io_context_pool" 0x00007fffebf39965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  20   Thread 0x7fffe2c58700 (LWP 945486) "safe_timer"      0x00007fffebf39d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  21   Thread 0x7fffe2457700 (LWP 945487) "safe_timer"      0x00007fffebf39965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  22   Thread 0x7fffe1c56700 (LWP 945488) "safe_timer"      0x00007fffebf39965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  23   Thread 0x7fffe1455700 (LWP 945489) "fn_anonymous"    0x00007fffebf39965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  24   Thread 0x7fffe0c54700 (LWP 945490) "flusher"         0x00007fffebf39d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  25   Thread 0x7fffe0453700 (LWP 945491) "safe_timer"      0x00007fffebf3c4ed in __lll_lock_wait () from /lib64/libpthread.so.0
  26   Thread 0x7fffcc1d3700 (LWP 947011) "ceph-mgr"        0x00007fffebf39965 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
...

这两个线程的调用栈信息如下:

(gdb) thread apply 16 bt

Thread 16 (Thread 0x7fffe4c5c700 (LWP 945481)):
#0  0x00007fffebf3c4ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fffebf37dcb in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007fffebf37c98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x0000555555fb2b77 in ceph::mutex_debug_detail::mutex_debug_impl<false>::lock_impl (this=0x7fffffffd688) at /home/runsisi/build/master/src/common/mutex_debug.h:121
#4  0x0000555555fb23c2 in ceph::mutex_debug_detail::mutex_debug_impl<false>::lock (this=0x7fffffffd688, no_lockdep=false) at /home/runsisi/build/master/src/common/mutex_debug.h:185
#5  0x0000555555fb1cfb in std::lock_guard<ceph::mutex_debug_detail::mutex_debug_impl<false> >::lock_guard (this=0x7fffe4c593c0, __m=...) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_mutex.h:162
#6  0x00005555560f0ff8 in MgrStandby::ms_dispatch2 (this=0x7fffffffb460, m=...) at /home/runsisi/build/master/src/mgr/MgrStandby.cc:435
#7  0x00007fffeebe1e6d in Messenger::ms_deliver_dispatch (this=0x5555571d6900, m=...) at /home/runsisi/build/master/src/msg/Messenger.h:703
#8  0x00007fffeebe090a in DispatchQueue::entry (this=0x5555571d6c60) at /home/runsisi/build/master/src/msg/DispatchQueue.cc:201
#9  0x00007fffeed5ecec in DispatchQueue::DispatchThread::entry (this=0x5555571d6dd8) at /home/runsisi/build/master/src/msg/DispatchQueue.h:101
#10 0x00007fffee9c3796 in Thread::entry_wrapper (this=0x5555571d6dd8) at /home/runsisi/build/master/src/common/Thread.cc:91
#11 0x00007fffee9c3714 in Thread::_entry_func (arg=0x5555571d6dd8) at /home/runsisi/build/master/src/common/Thread.cc:75
#12 0x00007fffebf35dd5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fffeb013ead in clone () from /lib64/libc.so.6
(gdb) thread apply 25 bt

Thread 25 (Thread 0x7fffe0453700 (LWP 945491)):
#0  0x00007fffebf3c4ed in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00007fffebf37dcb in _L_lock_883 () from /lib64/libpthread.so.0
#2  0x00007fffebf37c98 in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00007fffee8d0baf in ceph::mutex_debug_detail::mutex_debug_impl<false>::lock_impl (this=0x7fffffffd688) at /home/runsisi/build/master/src/common/mutex_debug.h:121
#4  0x00007fffee8cee8e in ceph::mutex_debug_detail::mutex_debug_impl<false>::lock (this=0x7fffffffd688, no_lockdep=false) at /home/runsisi/build/master/src/common/mutex_debug.h:185
#5  0x00007fffee8cef6a in std::unique_lock<ceph::mutex_debug_detail::mutex_debug_impl<false> >::lock (this=0x7fffe04505a0) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_mutex.h:267
#6  0x00007fffee8cd15e in std::unique_lock<ceph::mutex_debug_detail::mutex_debug_impl<false> >::unique_lock (this=0x7fffe04505a0, __m=...) at /opt/rh/devtoolset-8/root/usr/include/c++/8/bits/std_mutex.h:197
#7  0x00007fffee9cfd18 in SafeTimer::timer_thread (this=0x7fffffffd800) at /home/runsisi/build/master/src/common/Timer.cc:76
#8  0x00007fffee9d1af6 in SafeTimerThread::entry (this=0x55555716cb70) at /home/runsisi/build/master/src/common/Timer.cc:32
#9  0x00007fffee9c3796 in Thread::entry_wrapper (this=0x55555716cb70) at /home/runsisi/build/master/src/common/Thread.cc:91
#10 0x00007fffee9c3714 in Thread::_entry_func (arg=0x55555716cb70) at /home/runsisi/build/master/src/common/Thread.cc:75
#11 0x00007fffebf35dd5 in start_thread () from /lib64/libpthread.so.0
#12 0x00007fffeb013ead in clone () from /lib64/libc.so.6

在等待同一把锁:

(gdb) thread apply 16 f 6

Thread 16 (Thread 0x7fffe4c5c700 (LWP 945481)):
#6  0x00005555560f0ff8 in MgrStandby::ms_dispatch2 (this=0x7fffffffb460, m=...) at /home/runsisi/build/master/src/mgr/MgrStandby.cc:435
435       std::lock_guard l(lock);
(gdb) p &lock
$10 = (ceph::mutex *) 0x7fffffffd688
(gdb) thread apply 25 f 7

Thread 25 (Thread 0x7fffe0453700 (LWP 945491)):
#7  0x00007fffee9cfd18 in SafeTimer::timer_thread (this=0x7fffffffd800) at /home/runsisi/build/master/src/common/Timer.cc:76
76        std::unique_lock l{lock};
(gdb) p &lock
$11 = (ceph::mutex *) 0x7fffffffd688

而这把锁当前被 1 号线程拿着(注意 locked_by 字段):

(gdb) p lock
$14 = {<ceph::mutex_debug_detail::mutex_debugging_base> = {group = "MgrStandby::lock", id = 37, lockdep = true, backtrace = false, nlock = 1, locked_by = {_M_thread = 140737353949120}}, m = {__data = {__lock = 2, __count = 0, __owner = 945310, __nusers = 1, __kind = 2, __spins = 0, __elision = 0, __list = {__prev = 0x0, __next = 0x0}}, __size = "\002\000\000\000\000\000\000\000\236l\016\000\001\000\000\000\002", '\000' <repeats 22 times>, __align = 2}, static recursive = false}
(gdb) p/x 140737353949120
$15 = 0x7ffff7fd1fc0

  1    Thread 0x7ffff7fd1fc0 (LWP 945310) "ceph-mgr"        0x00007fffebf39d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
* 16   Thread 0x7fffe4c5c700 (LWP 945481) "ms_dispatch"     0x00007fffebf3c4ed in __lll_lock_wait () from /lib64/libpthread.so.0
  25   Thread 0x7fffe0453700 (LWP 945491) "safe_timer"      0x00007fffebf3c4ed in __lll_lock_wait () from /lib64/libpthread.so.0

显然,1 号线程拿着这个锁,导致 16、25 号线程死等。但是,1 号线程为什么不释放这个锁?

(gdb) thread 1
[Switching to thread 1 (Thread 0x7ffff7fd1fc0 (LWP 945310))]
#0  0x00007fffebf39d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007fffebf39d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fffed140ab8 in take_gil () from /lib64/libpython3.6m.so.1.0
#2  0x00007fffed140bf9 in PyEval_RestoreThread () from /lib64/libpython3.6m.so.1.0
#3  0x00007fffed1d5036 in PyGILState_Ensure () from /lib64/libpython3.6m.so.1.0
#4  0x00007fffa4074a05 in ?? () from /home/runsisi/.local/lib/python3.6/site-packages/scipy/fft/_pocketfft/pypocketfft.cpython-36m-x86_64-linux-gnu.so
#5  0x00007fffa4061b45 in ?? () from /home/runsisi/.local/lib/python3.6/site-packages/scipy/fft/_pocketfft/pypocketfft.cpython-36m-x86_64-linux-gnu.so
#6  0x00007fffa407627a in ?? () from /home/runsisi/.local/lib/python3.6/site-packages/scipy/fft/_pocketfft/pypocketfft.cpython-36m-x86_64-linux-gnu.so
#7  0x00007fffa4061f32 in ?? () from /home/runsisi/.local/lib/python3.6/site-packages/scipy/fft/_pocketfft/pypocketfft.cpython-36m-x86_64-linux-gnu.so
#8  0x00007fffa4063650 in PyInit_pypocketfft () from /home/runsisi/.local/lib/python3.6/site-packages/scipy/fft/_pocketfft/pypocketfft.cpython-36m-x86_64-linux-gnu.so
#9  0x00007fffed1d19b0 in _PyImport_LoadDynamicModuleWithSpec () from /lib64/libpython3.6m.so.1.0
...
#222 0x00007fffed165a76 in PyImport_Import () from /lib64/libpython3.6m.so.1.0
#223 0x00007fffed165bcb in PyImport_ImportModule () from /lib64/libpython3.6m.so.1.0
#224 0x000055555612089b in PyModule::load_subclass_of (this=0x555557135a50, base_class=0x55555643b6ea "MgrModule", py_class=0x555557135b08) at /home/runsisi/build/master/src/mgr/PyModule.cc:649
#225 0x000055555611d0d6 in PyModule::load (this=0x555557135a50, pMainThreadState=0x5555571d6000) at /home/runsisi/build/master/src/mgr/PyModule.cc:335
#226 0x0000555556127ef4 in PyModuleRegistry::init (this=0x7fffffffd8c8) at /home/runsisi/build/master/src/mgr/PyModuleRegistry.cc:86
#227 0x00005555560eddbd in MgrStandby::init (this=0x7fffffffb460) at /home/runsisi/build/master/src/mgr/MgrStandby.cc:184
#228 0x0000555555ebf09c in main (argc=4, argv=0x7fffffffdc88) at /home/runsisi/build/master/src/ceph_mgr.cc:71

从 1 号线程的调用栈来看,在加载 mgr 模块(diskprediction_local)的过程中调用 PyEval_RestoreThread 接口等待 GIL 锁。但是,1 号线程在调用 PyModule::load 加载 mgr 模块时,已经调用 PyEval_RestoreThread 接口拿到了 GIL 锁,PyEval_RestoreThread 接口文档明确提及在同一个线程中不能重复调用,否则会发生死锁(If the lock has been created, the current thread must not have acquired it, otherwise deadlock ensues),显然,16、25 两个线程等锁只是表象,真正的死锁发生在 1 号线程内部。

pybind11 扩展模块

从上面 1 号线程的调用栈可以知道,死锁发生在导入 Python C/C++ 扩展模块 scipy.fft 的过程中。

这里要注意一个问题,社区 master 分支每天都在构建、测试,为何没有出现死锁?这时候显然要关注 mgr 模块的 Python 依赖的版本,在我的环境中,所有的 Python 依赖都是通过 pip 安装的最新版本,再对比 ceph 代码中 requirements.txt 要求的版本,显然两者并不匹配,通过降级至 requirements.txt 要求的版本,死锁问题解决。

问题虽然解决,但是问题出在哪里?

通过分析 scipy 两个版本的差异,可以看到从(v1.4.0)版本开始引入了 scipy.fft 这个基于 pybind11 的 C/C++ 扩展模块,而我降级之前的版本是 1.5.2,显然 pybind11 的行为值得怀疑。

构建一个简单的 pybind11 扩展模块如下:

#include <pybind11/pybind11.h>

namespace py = pybind11;

namespace x {

int add(int i, int j) {
  return i + j;
}

PYBIND11_MODULE(x, m) {

  m.def("add", &add, "A function which adds two numbers");

}

}

并在另一个 mgr 模块(alerts,该模块在前面的加载过程中没有出现死锁)中 import 该 pybind11 扩展模块,不出所料,死锁出现了:

(gdb) bt
#0  0x00007fffebf39d12 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00007fffed140ab8 in PyCOND_TIMEDWAIT (cond=0x7fffed50c580 <gil_cond>, mut=0x7fffed50c540 <gil_mutex>, us=5000) at /usr/src/debug/Python-3.6.8/Python/condvar.h:103
#2  take_gil (tstate=tstate@entry=0x5555571d6900) at /usr/src/debug/Python-3.6.8/Python/ceval_gil.h:224
#3  0x00007fffed140bf9 in PyEval_RestoreThread (tstate=tstate@entry=0x5555571d6900) at /usr/src/debug/Python-3.6.8/Python/ceval.c:369
#4  0x00007fffed1d5036 in PyGILState_Ensure () at /usr/src/debug/Python-3.6.8/Python/pystate.c:895
#5  0x00007fffd4e8fc69 in pybind11::detail::get_internals()::gil_scoped_acquire_local::gil_scoped_acquire_local() (this=0x7fffffff7550) at /home/runsisi/build/pyleak/pybind11/include/pybind11/detail/internals.h:253
#6  0x00007fffd4e900b8 in pybind11::detail::get_internals () at /home/runsisi/build/pyleak/pybind11/include/pybind11/detail/internals.h:256
#7  0x00007fffd4e8c852 in x::PyInit_x () at /home/runsisi/build/pyleak/src/x.cc:11
#8  0x00007fffed1d19b0 in _PyImport_LoadDynamicModuleWithSpec (spec=spec@entry=0x7fffd76eb208, fp=fp@entry=0x0) at /usr/src/debug/Python-3.6.8/Python/importdl.c:159
...
#106 0x00007fffed165a76 in PyImport_Import (module_name=module_name@entry=0x7fffd9463228) at /usr/src/debug/Python-3.6.8/Python/import.c:1767
#107 0x00007fffed165bcb in PyImport_ImportModule (name=<optimized out>) at /usr/src/debug/Python-3.6.8/Python/import.c:1269
#108 0x000055555612089b in PyModule::load_subclass_of (this=0x5555571355f0, base_class=0x55555643b6ea "MgrModule", py_class=0x5555571356a8) at /home/runsisi/build/master/src/mgr/PyModule.cc:649
#109 0x000055555611d0d6 in PyModule::load (this=0x5555571355f0, pMainThreadState=0x5555571d6900) at /home/runsisi/build/master/src/mgr/PyModule.cc:335
#110 0x0000555556127ef4 in PyModuleRegistry::init (this=0x7fffffffd8c8) at /home/runsisi/build/master/src/mgr/PyModuleRegistry.cc:86
#111 0x00005555560eddbd in MgrStandby::init (this=0x7fffffffb460) at /home/runsisi/build/master/src/mgr/MgrStandby.cc:184
#112 0x0000555555ebf09c in main (argc=4, argv=0x7fffffffdc88) at /home/runsisi/build/master/src/ceph_mgr.cc:71

结合 pybind11 的宏定义:

#define PYBIND11_MODULE(name, variable)                                        \
    static void PYBIND11_CONCAT(pybind11_init_, name)(pybind11::module &);     \
    PYBIND11_PLUGIN_IMPL(name) {                                               \
        PYBIND11_CHECK_PYTHON_VERSION                                          \
        PYBIND11_ENSURE_INTERNALS_READY                                        \
        auto m = pybind11::module(PYBIND11_TOSTRING(name));                    \
        try {                                                                  \
            PYBIND11_CONCAT(pybind11_init_, name)(m);                          \
            return m.ptr();                                                    \
        } PYBIND11_CATCH_INIT_EXCEPTIONS                                       \
    }                                                                          \
    void PYBIND11_CONCAT(pybind11_init_, name)(pybind11::module &variable)

#define PYBIND11_PLUGIN_IMPL(name) \
    extern "C" PYBIND11_EXPORT PyObject *PyInit_##name();   \
    extern "C" PYBIND11_EXPORT PyObject *PyInit_##name()

#define PYBIND11_ENSURE_INTERNALS_READY \
    pybind11::detail::get_internals();

将前面定义的 pybind11 模块展开,得到如下的代码:

static void pybind11_init_x(pybind11::module &);                               \
extern "C" __attribute__ ((visibility("default"))) PyObject *PyInit_x();       \
extern "C" __attribute__ ((visibility("default"))) PyObject *PyInit_x() {      \
    {                                                                          \
        const char *compiled_ver = "3"                                         \
            "." "PY_MINOR_VERSION";                                            \
        const char *runtime_ver = Py_GetVersion();                             \
        size_t len = std::strlen(compiled_ver);                                \
        if (std::strncmp(runtime_ver, compiled_ver, len) != 0                  \
                || (runtime_ver[len] >= '0' && runtime_ver[len] <= '9')) {     \
            PyErr_Format(PyExc_ImportError,                                    \
                "Python version mismatch: module was compiled for Python %s, " \
                "but the interpreter version is incompatible: %s.",            \
                compiled_ver, runtime_ver);                                    \
            return nullptr;                                                    \
        }                                                                      \
    }                                                                          \
    pybind11::detail::get_internals();                                         \
    auto m = pybind11::module("x");                                            \
    try {                                                                      \
        pybind11_init_x(m);                                                    \
        return m.ptr();                                                        \
    } catch (pybind11::error_already_set &e) {                                 \
        PyErr_SetString(PyExc_ImportError, e.what());                          \
        return nullptr;                                                        \
    } catch (const std::exception &e) {                                        \
        PyErr_SetString(PyExc_ImportError, e.what());                          \
        return nullptr;                                                        \
    }                                                                          \
}                                                                              \
void pybind11_init_x(pybind11::module &m)
    m.def("add", &add, "A function which adds two numbers");

}

结合线程调用栈,正是 PyInit_x 初始化函数调用 pybind11::detail::get_internals(),并最终调用 PyEval_RestoreThread 导致死锁发生:

PYBIND11_NOINLINE inline internals &get_internals() {
    auto **&internals_pp = get_internals_pp();
    if (internals_pp && *internals_pp)
        return **internals_pp;

    // Ensure that the GIL is held since we will need to make Python calls.
    // Cannot use py::gil_scoped_acquire here since that constructor calls get_internals.
    struct gil_scoped_acquire_local {
        gil_scoped_acquire_local() : state (PyGILState_Ensure()) {}
        ~gil_scoped_acquire_local() { PyGILState_Release(state); }
        const PyGILState_STATE state;
    } gil;
    ...
}

Python Thread State & GIL(基于 Python 3.6.8)

接着来看一下 PyGILState_Ensure 的行为:

PyGILState_STATE
PyGILState_Ensure(void)
{
    int current;
    PyThreadState *tcur;

    tcur = (PyThreadState *)PyThread_get_key_value(autoTLSkey);
    current = PyThreadState_IsCurrent(tcur);

    if (current == 0) {
        PyEval_RestoreThread(tcur);
    }

    return current ? PyGILState_LOCKED : PyGILState_UNLOCKED;
}
static int
PyThreadState_IsCurrent(PyThreadState *tstate)
{
    /* Must be the tstate for this thread */
    assert(PyGILState_GetThisThreadState()==tstate);
    return tstate == GET_TSTATE();
}

显然,由于 current == 0,或者说 autoTLSkey TLS 中记录的 PyThreadState 实例(运行 Python 解释器的 OS 线程必须创建对应的 PyThreadState 实例)与当前全局 PyThreadState 实例 _PyThreadState_Current(通过 GET_TSTATE / SET_TSTATE 访问)不一致,导致 PyEval_RestoreThread 被调用。或者说,PyGILState_* 这些 API 基于一个简单的假设,如果当前 OS 线程 TLS 记录的 tstate(即 PyThreadState 实例)与全局 tstate 不一致,则表示当前 OS 线程没有拿到 GIL 锁,因此,PyGILState_Ensure 就会调用 PyEval_RestoreThread 去拿 GIL 锁。

线程 TLS 记录的 tstate 为何会出现与 _PyThreadState_Current 全局变量记录的 tstate 不一致?确实,一种情况是因为该线程确实没有拿 GIL 锁,但还一种情况就是我们这里分析的:在同一 OS 线程里使用多个 Python 解释器。

ceph mgr 进程内嵌的 Python 解释器及各 mgr 模块导入的初始化过程如下:

Py_InitializeEx(0);

// create gil and let the current tstate, i.e., the main tstate takes the gil
PyEval_InitThreads();

pMainThreadState = PyEval_SaveThread();

// load each mgr module in a separate newly created sub-interpreter
for (...) {
    {
        SafeThreadState sts(pMainThreadState);
        Gil gil(sts);   // main tstate (associated with the main interpreter) takes gil

        // create sub-interpreter and returns the sub-tstate associated with the sub-interpreter
        auto thread_state = Py_NewInterpreter();
        pMyThreadState.set(thread_state);
    }
    {
        Gil gil(pMyThreadState);    // sub-tstate (associated with the sub-interpreter) takes gil
        PyImport_ImportModule(...)  // imported mgr modules are associated with the current tstate, i.e., the sub-tstate
    }
}

其中 Gil 只是一个使用 C++ RAII 特性实现的 GIL 管理实例:

Gil::Gil(SafeThreadState &ts, bool new_thread) : pThreadState(ts)
{
  // Acquire the GIL, set the current thread state
  PyEval_RestoreThread(pThreadState.ts);
}

Gil::~Gil()
{
  // Release the GIL, reset the thread state to NULL
  PyEval_SaveThread();
}

显然,为每个 mgr 模块创建了一个 Python 子解释器(每个解释器默认创建并关联一个 PyThreadState 实例),然后为每个子解释器导入各自的 mgr 模块。

那么,主解释器又是什么时候创建并初始化的呢,答案是第一个调用的 Python 接口,Py_InitializeEx

Py_InitializeEx
    _Py_InitializeEx_Private
        interp = PyInterpreterState_New();
        tstate = PyThreadState_New(interp);
        _PyGILState_Init(interp, tstate);
            autoInterpreterState = interp;
            _PyGILState_NoteThreadState(tstate);
                autoTLSkey = PyThread_create_key();
                PyThread_set_key_value(autoTLSkey, (void *)tstate)

在这里,我们看到了线程 TLS 关联的 KV 键值对:autoTLSkey -> tstate,显然,当前线程的 autoTLSkey 与主解释器 tstate 关联,而通过阅读其它相关的代码,可以看到除非结束主解释器,否则这个关联关系不会改变。

至此,同一 OS 线程中多个 Python 解释器共存导致 TLS 记录的 tstate 与 _PyThreadState_Current 全局变量记录的 tstate 不一致的问题就容易解释了。

首先,看一下相关的 Python 接口实现:

PyEval_InitThreads(void)
{
    create_gil();
    take_gil(PyThreadState_GET());
}

PyThreadState *
PyThreadState_Swap(PyThreadState *newts)
{
    PyThreadState *oldts = GET_TSTATE();

    SET_TSTATE(newts);
    return oldts;
}

PyThreadState *
PyEval_SaveThread(void)
{
    PyThreadState *tstate = PyThreadState_Swap(NULL);
    drop_gil(tstate);
    return tstate;
}

void
PyEval_RestoreThread(PyThreadState *tstate)
{
    take_gil(tstate);
    PyThreadState_Swap(tstate);
}

如前所述,TLS 的记录值保持不变,但在切换子解释器时,相应的 _PyThreadState_Current 全局变量记录的 tstate 就会进行切换,如此一来,不一致就是必然。

因此,要解决 PyGILState_Ensure 与子解释器不兼容的问题,就需要每个解释器使用一个独立的 OS 线程。

其实,Python C API 文档是这么说的:

Also note that combining this functionality with PyGILState_*() APIs is delicate, because these APIs assume a bijection between Python thread states and OS-level threads, an assumption broken by the presence of sub-interpreters. It is highly recommended that you don’t switch sub-interpreters between a pair of matching PyGILState_Ensure() and PyGILState_Release() calls. Furthermore, extensions (such as ctypes) using these APIs to allow calling of Python code from non-Python created threads will probably be broken when using sub-interpreters.

通过前面的分析,应该能很容易理解这段话了。

规避方案

如果 Python、pybind11 都不能动,那么就只能在 ceph 侧进行规避了:一种方式就是降级 scipy 的版本;还一种,就是修改 mgr 模块,仅在真正使用时才 import scipy(此时 mgr 模块会有独立的 OS 线程承载),从而避免加载 mgr 模块过程中 import scipy 导致死锁,当然,还一种就是 mgr 模块加载之初就使用多线程,改动稍微麻烦,不过是治本的方法(至少在 Python 没有解决 PyGILState_* API 实现的情况下)。

后记

折腾了一圈之后,发现 ceph 社区其实已经发现这个问题了(参见后面的链接),汗颜,正式的解决方案可能只用静待社区的跟进了。

参考资料

Make the PyGILState API compatible with subinterpreters

https://bugs.python.org/issue15751

Python multi-thread multi-interpreter C API

https://stackoverflow.com/questions/26061298/python-multi-thread-multi-interpreter-c-api

A simple example for using Python sub interpreters and multiple threads.

https://github.com/runsisi/python-sub-interpreters-multiple-threads-example

Sub-interpreters: importing numpy causes hang

https://mail.python.org/pipermail/python-dev/2019-January/156095.html

mgr/diskprediction: diskprediction module fails to initialize with newer SciPy versions

https://tracker.ceph.com/issues/43447


最后修改于 2020-07-31