Partition deleted immediately after being created

Bug #1790159 reported by Daniel Badea
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Daniel Badea

Bug Description

Brief Description
-----------------
When installing a multi-node system without dedicated storage the stage of adding a new partition on controller-0 fails because the partition is created and then deleted immediately. This is not 100% reproducible.

Severity
--------
Minor: Running the command to create partition again is successfull.

Steps to Reproduce
------------------
On unlocked controller-0 run:

  system host-disk-partition-add controller-0 /dev/sdb 10

as fast as you can while sysinv-agent service is starting (watch /var/log/sysinv.log).

Expected Behavior
------------------
Run:

  system host-disk-partition-list controller-0

it should display the new partition.

Actual Behavior
----------------
There is no new partition.

Reproducibility
---------------
Intermittent. No frequency data.

System Configuration
--------------------
Multi-node system

Branch/Pull Time/Commit
-----------------------

Timestamp/Logs
--------------
Most likely there's a race between manage-partitions and inventory agent's update: conductor decides to delete the partition.

2018-08-21 15:57:22.503 121506 INFO manage-partitions [-] Executing command: 'udevadm settle -E /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0'
2018-08-21 15:57:22.506 121506 INFO manage-partitions [-] Executing command: 'parted -s /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0 unit mib mkpart primary 1 81921'
2018-08-21 15:57:22.621 121506 INFO manage-partitions [-] Executing command: 'sgdisk --typecode=1:ba5eba11-0000-1111-2222-000000000001 --change-name=1:LVM Physical Volume /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0'
2018-08-21 15:57:23.760 121506 INFO manage-partitions [-] Executing command: 'ls /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0-part1'
2018-08-21 15:57:24.064 121506 INFO manage-partitions [-] Executing command: 'ls /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0-part1'
2018-08-21 15:57:24.368 121506 INFO manage-partitions [-] Executing command: 'ls /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0-part1'
2018-08-21 15:57:24.371 121506 INFO manage-partitions [-] Executing command: 'sgdisk -i 1 /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0'
2018-08-21 15:57:24.477 121506 INFO manage-partitions [-] Executing command: 'udevadm settle -E /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0'
2018-08-21 15:57:24.482 121506 INFO manage-partitions [-] Executing pipe command: 'sgdisk -p /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0'
2018-08-21 15:57:24.588 121506 INFO manage-partitions [-] Executing command: 'blockdev --getss /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0'
...
2018-08-21 15:57:25.505 81085 INFO sysinv.conductor.manager [req-49c1f3b5-e467-444b-9947-054f1f9b05b6 None None] Deleting missing partition 8043c833-334e-4f49-910d-cbe4e03df4af - /dev/disk/by-path/pci-0000:00:1f.2-ata-2.0-part1

description: updated
Changed in starlingx:
assignee: nobody → Daniel Badea (daniel.badea)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-config (master)

Fix proposed to branch: master
Review: https://review.openstack.org/599002

Changed in starlingx:
status: New → In Progress
Ghada Khalil (gkhalil)
tags: added: stx.config
tags: added: stx.2018.10
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-config (master)

Reviewed: https://review.openstack.org/599002
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=ca5505bee9f0ad057078b4743e31f7fb8b801a29
Submitter: Zuul
Branch: master

commit ca5505bee9f0ad057078b4743e31f7fb8b801a29
Author: Daniel Badea <email address hidden>
Date: Fri Aug 31 14:37:49 2018 +0000

    partition deleted immediately after creation

    There is a race between agent_audit and applying partitions manifest:
    1. agent_audit starts to collect partition information for the current node
    2. at the same time there is a request to apply partitions manifest for a
       new configuration; exec. call for running puppet is not eventlet friendly
       and blocks the agent including the in-progress audit
    3. partitions are created and an update is sent back to conductor that sets
       new partition status "Ready"
    4. agent audit resumes execution and sends an update with the partition info
       status collected before partitions manifest ran (without the new partition)
    5. conductor doesn't find new partition in status update and removes it
       from the database

    Add lock around agent_audit() and config_apply_runtime_manifests() to
    prevent them from running both at the same time.

    Add lock in manage_partitions's run() and agent's _update_disk_partitions()
    to prevent agent from running in case partitions manifest is applied by
    an external trigger (not by sysinv-agent).

    Refactor common ipartition_update_by_ihost( ..., ipartition_get()) code.

    Closes-Bug: #1790159
    Change-Id: I0c730f9c249810d7eea5e3192c819c498fe30602

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote :

This is still failing in Wind River labs; re-opening the bug

Changed in starlingx:
status: Fix Released → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to stx-config (master)

Fix proposed to branch: master
Review: https://review.openstack.org/602133

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to stx-config (master)

Reviewed: https://review.openstack.org/602133
Committed: https://git.openstack.org/cgit/openstack/stx-config/commit/?id=d2dcb9882c1bcfe1c7eea1e1b0a45ef3fab633eb
Submitter: Zuul
Branch: master

commit d2dcb9882c1bcfe1c7eea1e1b0a45ef3fab633eb
Author: Daniel Badea <email address hidden>
Date: Wed Sep 12 14:36:22 2018 +0000

    apply runtime manifest deadlock waiting for management ip

    Fix for "partition deleted immediately after creation"
    adds mutex between config_apply_runtime_manifests()
    and agent_audit() however:
    1. config_apply_runtime_manifests is looping (max 300s)
       waiting for self._mgmt_ip to be set
    2. agent_audit() is setting self._mgmt_ip but can't run
       because config_apply_runtime_manifests() is running

    Move retry logic on self._mgmt_ip outside of
    config_apply_runtime_manifests() so agent_audit()
    can run.

    Change-Id: I3b1e2ebdaa684fa16e21662fb703dffffa70abe3
    Closes-Bug: #1790159

Changed in starlingx:
status: In Progress → Fix Released
Ken Young (kenyis)
tags: added: stx.1.0
removed: stx.2018.10
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.