Contributing Device-Specific Error Reporting to OpenZFS

A kernel-to-userspace patch that replaces a vague zpool create error with one that names the exact device and pool causing the problem. Here’s how it works, from the ioctl layer to the formatted error message.

The problem

If you’ve managed ZFS pools with more than a handful of disks, you’ve almost certainly hit this error:

bash$ sudo zpool create tank mirror /dev/sda /dev/sdb /dev/sdc /dev/sdd
cannot create 'tank': one or more vdevs refer to the same device,
or one of the devices is part of an active md or lvm device

Which device? What pool? The error gives you nothing. In a 12-disk server you’re left checking each device one by one until you find the culprit.

I’d been working on a previous PR (#18184) improving zpool create error messages when Brian Behlendorf suggested a follow-up: pass device-specific error information from the kernel back to userspace, following the existing ZPOOL_CONFIG_LOAD_INFO pattern that zpool import already uses.

So I built it. The result is PR #18213:

Error message
Beforecannot create 'tank': one or more vdevs refer to the same device
Aftercannot create 'tank': device '/dev/sdb1' is part of active pool 'rpool'

Why this is harder than it looks

The obvious approach would be: when zpool create fails, walk the vdev tree, find the device with the error, and report it. But there’s a timing problem in the kernel that makes this impossible.

When spa_create() fails, the error cleanup path calls vdev_close() on all vdevs. This function unconditionally resets vd->vdev_stat.vs_aux to VDEV_AUX_NONE on every device in the tree. By the time the error code reaches the ioctl handler, all evidence of which device failed and why has been wiped clean.

Key Insight The error information must be captured at the exact moment of failure, inside vdev_label_init(), before the cleanup path destroys it. And it must be stored somewhere that survives the cleanup — the spa_t struct, which represents the pool itself.

The only errno that travels back through the ioctl is an integer like EBUSY. No context about which device, no pool name, nothing. The entire design challenge is getting two strings (a device path and a pool name) from a kernel function that runs during vdev initialization all the way back to the userspace zpool command.

Architecture: the data flow

The solution follows the same mechanism that zpool import already uses to return rich error information: an nvlist (ZFS’s key-value dictionary, like a JSON object) packed into the ioctl output buffer under a well-known key.

vdev_label_init()
detect conflict,
read label
spa→errlist
vdev + pool name
spa_create()
hand off errlist
ioc_pool_create()
wrap → put_nvlist
ioctl
kernel → user
zpool_create()
unpack → format

Four touch points, each doing one small thing. Let’s walk through them.

Implementation

1. Capture the error at the moment of failure

This is the heart of the change. Inside vdev_label_init(), when vdev_inuse() returns true, we build an nvlist with the device path, then read the on-disk label to extract the pool name:

module/zfs/vdev_label.c/*
 * Determine if the vdev is in use.
 */
if (reason != VDEV_LABEL_REMOVE && reason != VDEV_LABEL_SPLIT &&
    vdev_inuse(vd, crtxg, reason, &spare_guid, &l2cache_guid)) {
        if (spa->spa_create_errlist == NULL) {
                nvlist_t *nv = fnvlist_alloc();
                nvlist_t *cfg;

                if (vd->vdev_path != NULL)
                        fnvlist_add_string(nv,
                            ZPOOL_CREATE_INFO_VDEV, vd->vdev_path);

                cfg = vdev_label_read_config(vd, -1ULL);
                if (cfg != NULL) {
                        const char *pname;
                        if (nvlist_lookup_string(cfg,
                            ZPOOL_CONFIG_POOL_NAME, &pname) == 0)
                                fnvlist_add_string(nv,
                                    ZPOOL_CREATE_INFO_POOL, pname);
                        nvlist_free(cfg);
                }

                spa->spa_create_errlist = nv;
        }
        return (SET_ERROR(EBUSY));
}

The NULL check on spa_create_errlist ensures we only record the first failing device. If there are multiple conflicts, the first one is what you need to fix anyway. fnvlist_alloc() and fnvlist_add_string() are the “fatal” nvlist functions that panic on allocation failure — appropriate here since we’re in a code path where memory should be available.

2. Hand the errlist to the caller

On error, spa_create() transfers ownership of the errlist via the new errinfo output parameter:

module/zfs/spa.cif (error != 0) {
        if (errinfo != NULL) {
                *errinfo = spa->spa_create_errlist;
                spa->spa_create_errlist = NULL;
        }
        spa_unload(spa);
        spa_deactivate(spa);
        spa_remove(spa);
        ...

Setting spa_create_errlist to NULL after the handoff prevents spa_deactivate() from freeing it — ownership transfers to the caller.

3. Wrap and pack into the ioctl output

The ioctl handler wraps the errlist under a ZPOOL_CONFIG_CREATE_INFO key, mirroring how zpool import uses ZPOOL_CONFIG_LOAD_INFO:

module/zfs/zfs_ioctl.cerror = spa_create(zc->zc_name, config, props, zplprops, dcp,
    &errinfo);
if (errinfo != NULL) {
        nvlist_t *outnv = fnvlist_alloc();
        fnvlist_add_nvlist(outnv,
            ZPOOL_CONFIG_CREATE_INFO, errinfo);
        (void) put_nvlist(zc, outnv);
        nvlist_free(outnv);
        nvlist_free(errinfo);
}

put_nvlist() serializes the nvlist into zc->zc_nvlist_dst, which is a shared buffer between kernel and userspace.

4. Unpack and format in userspace

In libzfs, after the ioctl fails, we unpack the buffer, extract the device and pool name, and format the error:

lib/libzfs/libzfs_pool.cnvlist_t *outnv = NULL;
if (zc.zc_nvlist_dst_size > 0 &&
    nvlist_unpack((void *)(uintptr_t)zc.zc_nvlist_dst,
    zc.zc_nvlist_dst_size, &outnv, 0) == 0 &&
    outnv != NULL) {
        nvlist_t *errinfo = NULL;
        if (nvlist_lookup_nvlist(outnv,
            ZPOOL_CONFIG_CREATE_INFO, &errinfo) == 0) {
                const char *vdev = NULL;
                const char *pname = NULL;
                (void) nvlist_lookup_string(errinfo,
                    ZPOOL_CREATE_INFO_VDEV, &vdev);
                (void) nvlist_lookup_string(errinfo,
                    ZPOOL_CREATE_INFO_POOL, &pname);
                if (vdev != NULL) {
                        if (pname != NULL)
                                zfs_error_aux(hdl,
                                    dgettext(TEXT_DOMAIN,
                                    "device '%s' is part of "
                                    "active pool '%s'"),
                                    vdev, pname);
                        else
                                zfs_error_aux(hdl,
                                    dgettext(TEXT_DOMAIN,
                                    "device '%s' is in use"),
                                    vdev);
                        ...
                }
        }
}

If both values are available, you get: device ‘/dev/sdb1’ is part of active pool ‘rpool’. If only the path is available (label can’t be read), you get: device ‘/dev/sdb1’ is in use. If no errinfo came back at all, the existing generic error handling kicks in unchanged.

What changed

File+
module/zfs/vdev_label.c+23-1
lib/libzfs/libzfs_pool.c+41
module/zfs/zfs_ioctl.c+12-1
module/zfs/spa.c+10-1
cmd/ztest.c+5-5
include/sys/fs/zfs.h+3
include/sys/spa.h+1-1
include/sys/spa_impl.h+1
tests/.../zpool_create_errinfo_001_neg.ksh+99
11 files total+195-10

93 lines of feature code across 8 C files, plus a 99-line ZTS test. The cmd/ztest.c changes are mechanical — just adding a NULL parameter to each spa_create() call to match the new signature.

Testing

I tested on an Arch Linux VM running kernel 6.18.9-arch1-2 with ZFS built from source. The test environment used loopback devices, which is the standard approach in the ZFS Test Suite — the kernel code path is identical regardless of the underlying block device.

Duplicate device — device-specific error

bash$ truncate -s 128M /tmp/vdev1
$ sudo losetup /dev/loop10 /tmp/vdev1
$ sudo losetup /dev/loop12 /tmp/vdev1   # same backing file
$ sudo zpool create testpool1 mirror /dev/loop10 /dev/loop12
cannot create 'testpool1': device '/dev/loop12' is part of active pool 'testpool1'

Normal creation — no regression

bash$ truncate -s 128M /tmp/vdev1 /tmp/vdev2
$ sudo zpool create testpool1 mirror /tmp/vdev1 /tmp/vdev2
$ sudo zpool status testpool1
  pool: testpool1
 state: ONLINE
config:

        NAME            STATE     READ WRITE CKSUM
        testpool1       ONLINE       0     0     0
          mirror-0      ONLINE       0     0     0
            /tmp/vdev1  ONLINE       0     0     0
            /tmp/vdev2  ONLINE       0     0     0

ZTS test

A new negative test (zpool_create_errinfo_001_neg) creates two loopback devices backed by the same file and attempts a mirror pool creation. It verifies three things: the command fails, the error names the specific device, and the error mentions the active pool.

ZTS$ zfs-tests.sh -vx -t cli_root/zpool_create/zpool_create_errinfo_001_neg

Test: zpool_create_errinfo_001_neg (run as root) [00:00] [PASS]

Results Summary
PASS       1
Running Time:  00:00:00
Percent passed: 100.0%

CI checkstyle passes on all platforms (Ubuntu 22/24, Debian 12/13, CentOS Stream 9, AlmaLinux 8/10, FreeBSD 14). Clean build with no compiler warnings.

Design trade-offs

Only the first failing device is recorded. If multiple vdevs conflict, only the first one goes into spa_create_errlist. You need to fix the first problem before you can see the next one anyway, and it keeps the implementation simple.

The label is read twice. vdev_inuse() already reads the on-disk label and frees it before returning. We read it again with vdev_label_read_config() to extract the pool name. Modifying vdev_inuse() to optionally return the label would avoid this, but changing that function signature affects many callers — a much larger change for a follow-up.

The errlist field lives on spa_t permanently. It’s only used during spa_create(), but the field exists on every pool in memory. This costs 8 bytes per pool (one pointer, always NULL during normal operation) — negligible.

Only one error path is covered. The mechanism only fires for the vdev_inuse() EBUSY case inside vdev_label_init(). Other failures (open errors, size mismatches) still produce generic messages. The spa_create_errlist infrastructure is there for future extension.

What’s next

This is a focused first step. The spa_create_errlist mechanism could be extended to cover more error paths — vdev_open() failures, size mismatches, GUID conflicts. The infrastructure is in place; it just needs more callsites.

The PR is at openzfs/zfs #18213. Feedback welcome.

  • openzfs
  • zfs
  • kernel
  • c
  • linux
  • storage
  • open-source
  • nvlist
  • ioctl

Leave a comment

Discover more from /root

Subscribe now to keep reading and get access to the full archive.

Continue reading