runsisi's

technical notes

qemu(rbd) 源码阅读

2019-03-01 runsisi#openstack#qemu

阅读的代码为 qemu v2.5.0 版本,qemu 的发展和 ceph 一样快,翻了下最新的 master 分支的代码,基本上已经面目全非了,不过基本思路并没有太大变化。

qemu 事件框架(main loop, iothread)基于 glib 的 event loop 机制:

aio_context_new
  g_source_new
  event_notifier_init
  aio_set_event_notifier
    aio_set_fd_handler
      g_source_add_poll
  ctx->notify_dummy_bh = aio_bh_new(ctx, notify_dummy_bh, NULL)

qemu 上电时读取用户指定的 -drive 选项创建硬盘驱动器,以及包括 cdrom 在内的默认驱动器(PS: qemu 源代码中最主要的文件是 vl.c,之所以叫 vl 是因为 qemu 最早的名字是 virtual linux,且其命令行程序就是 vl,而不是现今的 qemu):

vl.c/main
  qemu_opts_foreach(qemu_find_opts("drive"), drive_init_func, ...)
    drive_new
      qemu_opt_get(all_opts, "cache")
      bdrv_parse_cache_flags
      blockdev_init
        extract_common_blockdev_options  // get open flags which set by values parsed by bdrv_parse_cache_flags
        blk_new_open
          blk_new_with_bs
            blk_new
            bdrv_new_root
              bdrv_new
          bdrv_open
            bdrv_open_inherit
              bdrv_open_child  // 打开真实的后端设备,如 rbd image
                bdrv_open_inherit
                  bdrv_open_common
                    bs->drv = drv  // 此时的 drv 为 bdrv_rbd,当然此时的 bs 也是后端设备的 bs,即下面的 bs->file->bs
                    drv->bdrv_file_open  // 如果驱动器是 rbd 类型,则 drv 为 bdrv_rbd,其它常见驱动包括:bdrv_file, bdrv_qcow2 等
                bdrv_attach_child
              bdrv_open_common
                bs->drv = drv  // 此时的 drv 为 bdrv_raw
                bs->file = file  // 即 bdrv_open_child 中打开的后端设备

  default_drive(cdrom)
    drive_new
  default_drive(floppy)
  default_drive(sdcard)

程序主循环:

vl.c/main
  main_loop
  do {
    main_loop_wait(false)
      os_host_main_loop_wait
        glib_pollfds_fill
        qemu_poll_ns
        glib_pollfds_poll
      qemu_clock_run_all_timers
  } while (!main_loop_should_exit())

如果在 qemu 启动时增加了 iothread 的选项,在遍历 -drive 选项前会遍历 -object 选项,并创建对应的 iothread:

vl.c/main
  qemu_opts_foreach(qemu_find_opts("object"), object_create, ...)
    object_add("iothread", ...)
      object_new("iothread")
      user_creatable_complete
        iothread_complete
          iothread->ctx = aio_context_new(...)
          qemu_thread_create

此时所有与硬盘 io 相关的操作都在 iothread 线程函数 iothread_run 中进行:

iothread.c/iothread_run
  while (!iothread->stopping) {
    aio_poll(iothread->ctx, ...)
  }

硬盘驱动注册:

rbd.c/bdrv_rbd_init
  bdrv_register(&bdrv_rbd)
    bdrv_setup_io_funcs
      /* Block drivers without coroutine functions need emulation */
      if (!bdrv->bdrv_co_readv) {
        bdrv->bdrv_co_readv = bdrv_co_readv_em;
        bdrv->bdrv_co_writev = bdrv_co_writev_em;

        /* bdrv_co_readv_em()/brdv_co_writev_em() work in terms of aio, so if
         * the block driver lacks aio we need to emulate that too.
         */
        if (!bdrv->bdrv_aio_readv) { // not needed for rbd
          /* add AIO emulation layer */
          bdrv->bdrv_aio_readv = bdrv_aio_readv_em;
          bdrv->bdrv_aio_writev = bdrv_aio_writev_em;
        }
      }

io 调用链:

blk_aio_writev
  bdrv_aio_writev
    bdrv_co_aio_rw_vector
      co = qemu_coroutine_create(bdrv_co_do_rw)
      qemu_coroutine_enter(co, acb)
      bdrv_co_maybe_schedule_bh(acb)

bdrv_co_do_rw
  bdrv_co_do_writev
    bdrv_co_do_pwritev
      bdrv_aligned_pwritev
        drv->bdrv_co_readv // 即 raw_co_writev, drv 是 bdrv_raw 驱动,这里要结合 bdrv_open 中的处理来看
          bdrv_co_writev(bs->file->bs, ...// 这时才是后端设备 io 的真正调用,之所以这么复杂是因为如果后端设备存储的
                                               // 数据格式是非 raw 格式则还需要中间转换过程
            bdrv_co_do_writev
              bdrv_co_do_pwritev
                bdrv_aligned_pwritev
                  drv->bdrv_co_readv // 即 bdrv_co_writev_em, drv 是 bdrv_raw 驱动,rbd 驱动没有提供 bdrv_co_readv 实现,
                                     // 因此在注册驱动时填充了模拟实现的接口,参考 bdrv_register(&bdrv_rbd) -> bdrv_setup_io_funcs
                    bdrv_co_io_em
                      qemu_rbd_aio_writev

其中 read 的流程和 write 一样,只是把 write 替换成 read 而已。

rbd IO 回调处理:

rbd_finish_aiocb
  qemu_bh_schedule
    aio_notify

参考资料

The Main Event Loop

https://developer.gnome.org/glib/stable/glib-The-Main-Event-Loop.html

Understanding QEMU devices

https://www.qemu.org/2018/02/09/understanding-qemu-devices/

Improving the QEMU Event Loop

http://events17.linuxfoundation.org/sites/events/files/slides/Improving%20the%20QEMU%20Event%20Loop%20-%203.pdf

Towards Multi-threaded Device Emulation in QEMU

https://www.linux-kvm.org/images/a/a7/02x04-MultithreadedDevices.pdf

multiple-iothreads.txt

https://github.com/qemu/qemu/blob/master/docs/devel/multiple-iothreads.txt

Live Block Device Operations in QEMU

https://archive.fosdem.org/2018/schedule/event/vai_qemu_live_dev_operations/attachments/slides/2391/export/events/attachments/vai_qemu_live_dev_operations/slides/2391/Live_Block_Device_Operations_in_QEMU_FOSDEM2018.pdf