Nailgun agent unreliably identifies disks, causes inconsistency in volume manager

Bug #1477604 reported by Oleg S. Gelbukh
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Alexander Gordeev

Bug Description

Nailgun agent uses '/dev/disk/by-path/' entry as an ID for a disk. However, due to known issue in udev [1] [2], this way to identify a disk is unreliable. We hit this multiple times, both on HW and virtualized systems.

This issue leads to inconsistent disks configurations on node when it reboots from bootstrap into operating system after deployment. This, in turn, can lead to issues with volumes creation and representation [3 - see related bugs].

On the other hand, there's an 'extra' parameter in the disk configuration reported by nailgun agent, which seems more reliable (it lists '/dev/disk/by-id/' entries for the device).

I propose to refactor volume manager to use the 'extra' field as more reliable identifier for the disk.

[1] https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1193705
[2] https://bugs.launchpad.net/ubuntu/+source/udev/+bug/1166326
[3] https://blueprints.launchpad.net/fuel/+spec/volume-manager-refactoring

Revision history for this message
Oleg S. Gelbukh (gelbuhos) wrote :

The example of unreliable identification of disk via '/dev/disk/by-path/':

  lrwxrwxrwx 1 root root 9 Jul 8 16:35 /dev/disk/by-path/pci-0000:00:01.1-scsi-0:0:0:0 -> ../../sdb
  lrwxrwxrwx 1 root root 10 Jul 8 16:32 /dev/disk/by-path/pci-0000:00:01.1-scsi-0:0:0:0-part1 -> ../../sdb1
  lrwxrwxrwx 1 root root 10 Jul 8 16:32 /dev/disk/by-path/pci-0000:00:01.1-scsi-0:0:0:0-part2 -> ../../sdb2
  lrwxrwxrwx 1 root root 10 Jul 8 16:32 /dev/disk/by-path/pci-0000:00:01.1-scsi-0:0:0:0-part3 -> ../../sda3
  lrwxrwxrwx 1 root root 10 Jul 8 16:32 /dev/disk/by-path/pci-0000:00:01.1-scsi-0:0:0:0-part4 -> ../../sda4
  lrwxrwxrwx 1 root root 10 Jul 8 16:32 /dev/disk/by-path/pci-0000:00:01.1-scsi-0:0:0:0-part5 -> ../../sda5

In this case, initially /dev/sda/ was identified by /dev/disk/by-path/pci-0000:00:01.1-scsi-0:0:0:0, but after the reboot that path was taken over by /dev/sdb.

description: updated
tags: added: feature
Changed in fuel:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Fuel Python Team (fuel-python)
milestone: none → 7.0
tags: added: blocked-by-bp feature-image-based ibp
tags: added: module-nailgun
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/205559

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Oleg S. Gelbukh (gelbuhos)
status: Confirmed → In Progress
Changed in fuel:
assignee: Oleg S. Gelbukh (gelbuhos) → Aleksandr Gordeev (a-gordeev)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/205559
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=2abdfc50f021164d0ba33e61882f590c379623b3
Submitter: Jenkins
Branch: master

commit 2abdfc50f021164d0ba33e61882f590c379623b3
Author: Oleg Gelbukh <email address hidden>
Date: Fri Jul 24 12:00:54 2015 +0000

    Identify disk for volume by 'extra' field

    Nailgun stores disk and volumes data separately: disk data in node's 'meta'
    json blob, and volumes data in attributes for the node. User can't access the
    attributes directly via API. Only node metadata is accessible.

    In combination with known udev issue that breaks identification of devices by
    PCI path [1], it can create a situation when disk identified differently in
    volumes and in disk dicts.

    This happens when during reboot from bootstrap to operating system after
    provisioning udev changes the by-path linking of disks. The system ends up in a
    situation like this:

      /dev/disk/by-path/pci-0000:00:01.1-scsi-0:0:0:0 -> ../../sdb
      /dev/disk/by-path/pci-0000:00:01.1-scsi-0:0:0:0-part1 -> ../../sdb1
      /dev/disk/by-path/pci-0000:00:01.1-scsi-0:0:0:0-part2 -> ../../sdb2
      /dev/disk/by-path/pci-0000:00:01.1-scsi-0:0:0:0-part3 -> ../../sda3
      /dev/disk/by-path/pci-0000:00:01.1-scsi-0:0:0:0-part4 -> ../../sda4
      /dev/disk/by-path/pci-0000:00:01.1-scsi-0:0:0:0-part5 -> ../../sda5

    This, in turn, causes KeyError exception when user tries to restore the disk
    data on the node for reinstallation, or after moving to another environment.

    Change-Id: I26bc2fc3cc01c3356710aac98a69c3c1a653c741
    Closes-bug: 1477604

Changed in fuel:
status: In Progress → Fix Committed
tags: added: on-verification
Revision history for this message
Alexander Bochkarev (abochkarev) wrote :

Verified on 301 ISO.
VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "7.0"
  openstack_version: "2015.1.0-7.0"
  api: "1.0"
  build_number: "301"
  build_id: "301"
  nailgun_sha: "4162b0c15adb425b37608c787944d1983f543aa8"
  python-fuelclient_sha: "486bde57cda1badb68f915f66c61b544108606f3"
  fuel-agent_sha: "50e90af6e3d560e9085ff71d2950cfbcca91af67"
  fuel-nailgun-agent_sha: "d7027952870a35db8dc52f185bb1158cdd3d1ebd"
  astute_sha: "6c5b73f93e24cc781c809db9159927655ced5012"
  fuel-library_sha: "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd"
  fuel-ostf_sha: "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c"
  fuelmain_sha: "a65d453215edb0284a2e4761be7a156bb5627677"

Changed in fuel:
status: Fix Committed → Fix Released
tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.