STX-O Antelope | Failing to boot VMs by volume

Bug #2037463 reported by Gabriel Calixto de Paula
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Lucas de Ataides Barreto

Bug Description

Brief Description
---------------------------------------------

When trying to boot a VM from volume using the STX-Openstack antelope build, the VM fails to boot up and shows an error
Severity

Major

Steps to Reproduce
---------------------------------------------

    Create flavor
    Create a volume from tis-centos-guest
    try to boot VM

Expected Behavior
---------------------------------------------

VM should boot

Actual Behavior
---------------------------------------------

VM doesn't boot

Reproducibility

Seen Once

System Configuration
---------------------------------------------

System Configuration
--------------------
DX

Branch/Pull Time/Commit
-----------------------

 STX + STX-openstack f/antelope app_version: 1.0-1.stx.70-debian-stable-versioned
build_date: 2023-09-22 23:28:26 +0000

Last Pass
--------------------------------------------

Aug-30

Timestamp/Logs
---------------------------------------------

  Details
E | Field | Value |

E | OS-DCF:diskConfig | MANUAL |
E | OS-EXT-AZ:availability_zone | |
E | OS-EXT-SRV-ATTR:host | None |
E | OS-EXT-SRV-ATTR:hostname | tenant1-tis-centos-guest-vifs-2 |
E | OS-EXT-SRV-ATTR:hypervisor_hostname | None |
E | OS-EXT-SRV-ATTR:instance_name | instance-00000004 |
E | OS-EXT-SRV-ATTR:kernel_id | |
E | OS-EXT-SRV-ATTR:launch_index | 0 |
E | OS-EXT-SRV-ATTR:ramdisk_id | |
E | OS-EXT-SRV-ATTR:reservation_id | r-gyuz2pyw |
E | OS-EXT-SRV-ATTR:root_device_name | /dev/vda |
E | OS-EXT-SRV-ATTR:user_data | None |
E | OS-EXT-STS:power_state | NOSTATE |
E | OS-EXT-STS:task_state | None |
E | OS-EXT-STS:vm_state | error |
E | OS-SCH-HNT:scheduler_hints | None |
E | OS-SRV-USG:launched_at | None |
E | OS-SRV-USG:terminated_at | None |
E | accessIPv4 | |
E | accessIPv6 | |
E | access_ipv4 | |
E | access_ipv6 | |
E | addresses | |
E | adminPass | None |
E | admin_password | None |
E | attached_volumes | [{'device': None, 'id': 'ddad9c1f-554d-4755-ad21-a298ce445a16', 'volume_id': None, 'attachment_id': None, 'bdm_id': None, 'tag': None, 'delete_on_termination': False, 'name': None, 'location': None}] |
E | availability_zone | |
E | block_device_mapping | None |
E | block_device_mapping_v2 | None |
E | compute_host | None |
E | config_drive | |
E | created | 2023-09-26T14:20:58Z |
E | created_at | 2023-09-26T14:20:58Z |
E | description | None |
E | disk_config | MANUAL |
E | fault | {'code': 500, 'created': '2023-09-26T14:21:16Z', 'message': 'Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 78667fe8-629b-4640-88a5-28a4b6582f74.', 'details': 'Traceback (most recent call last):\n File "/var/lib/openstack/lib/python3.9/site-packages/nova/conductor/manager.py", line 705, in build_instances\n raise exception.MaxRetriesExceeded(reason=msg)\nnova.exception.MaxRetriesExceeded: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 78667fe8-629b-4640-88a5-28a4b6582f74.\n'} |
E

Alarms

N/A

Test Activity
---------------------------------------------
Sanity

Workaround
---------------------------------------------

N/A

Changed in starlingx:
assignee: nobody → Lucas de Ataides Barreto (ldeataid)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (f/antelope)
Changed in starlingx:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (master)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on openstack-armada-app (f/antelope)

Change abandoned by "Lucas de Ataides Barreto <email address hidden>" on branch: f/antelope
Review: https://review.opendev.org/c/starlingx/openstack-armada-app/+/896817
Reason: wrong branch and gerrit is not allowing me to move it

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (f/antelope)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (f/antelope)
Download full text (3.3 KiB)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/897292
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/cba7b53fb1b686ad7557f1b05e48f366fda4ca72
Submitter: "Zuul (22348)"
Branch: f/antelope

commit cba7b53fb1b686ad7557f1b05e48f366fda4ca72
Author: Lucas de Ataides <email address hidden>
Date: Thu Sep 28 14:03:49 2023 -0300

    Allow VMs to be created via volumes

    After STX-Openstack upversioned to Antelope, we noticed that it was not
    possible to create VMs by volume, as they would be stuck on ERROR
    status. The first idea I had to solve this was to create a patch
    containing [1] and [2], because, as specified in [3], Nova now requires
    a service token in order to be able to manipulate Cinder volumes. This
    unfortunately did not solve the issue by itself, as now an error message
    showed up on the nova-conductor pods with the following (not full error
    message, only important part): "nova.exception.RescheduledException:
    Build of instance 2f32c7ea-1720-4f61-bce8-dbe970c40b0c was re-scheduled:
    Secret not found: no secret with matching uuid 'a7f3ae2e-cee7-4f04-9402
    -a78047747654". This UUID was not the same one present when issuing
    `virsh secret-list` on Cinder, Nova and Libvirt containers.

    Turns out openstack-helm and openstack-helm-infra have a Ceph UUID
    hardcoded in them, in Cinder [4], Nova [5] [6] and Libvirt [7] values.
    By changing these values to the UUID that libvirt was trying to find
    (7f3ae2e-cee7-4f04-9402-a78047747654), and it worked to solve the issue.
    However, it is not a good practice to use hardcoded values, and,
    searching on where this UUID was coming from, it turns out it was
    defined by the platform's Ceph configuration under
    `/etc/ceph/ceph.conf`.

    What this change does is dynamically read this `/etc/ceph/ceph.conf`
    file to search for the UUID value, and use it to override the [4] [5]
    [6] and [7] values. It also adds the patch including the Nova service
    token configuration. The combination of these 2 changes allows VMs to be
    created by volumes.

    [1] https://opendev.org/openstack/openstack-helm/commit/91c8a5baf2cf2f0dddded57d88f00ea11dd4ff4a
    [2] https://opendev.org/openstack/openstack-helm/commit/7d39af25fddbf5fc67e15c92a9265f28567a214e
    [3] https://docs.openstack.org/releasenotes/cinder/2023.1.html#upgrade-notes
    [4] https://opendev.org/openstack/openstack-helm/src/branch/master/cinder/values.yaml#L942
    [5] https://opendev.org/openstack/openstack-helm/src/branch/master/nova/values.yaml#L594
    [6] https://opendev.org/openstack/openstack-helm/src/branch/master/nova/values.yaml#L1432
    [7] https://opendev.org/openstack/openstack-helm-infra/src/branch/master/libvirt/values.yaml#L100

    Test Plan:
    PASS: Build openstack-helm, python3-k8sapp-openstack and
          stx-openstack-helm-fluxcd
    PASS: Upload / apply / remove STX-Openstack
    PASS: Create a VM by an image
    PASS: Create a volume and launch a VM from it
    PASS: Create a VM using the `boot-from-volume` flag
    PASS: Delete a VM created by a volume

    Closes...

Read more...

tags: added: in-f-antelope
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)
Download full text (4.7 KiB)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/896818
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/eaa8b41cb0ce3d251903d7dfe3ece3b0a58a7a09
Submitter: "Zuul (22348)"
Branch: master

commit eaa8b41cb0ce3d251903d7dfe3ece3b0a58a7a09
Author: Lucas de Ataides <email address hidden>
Date: Thu Sep 28 14:03:49 2023 -0300

    Allow VMs to be created via volumes

    After STX-Openstack upversioned to Antelope, we noticed that it was not
    possible to create VMs by volume, as they would be stuck on ERROR
    status. The first proposed solution was to create a patch containing
    [1] and [2], because, as specified in [3], Nova now requires
    a service token in order to be able to manipulate Cinder volumes. This
    unfortunately did not solve the issue by itself, as now an error message
    showed up on the nova-conductor pods with the following (not full error
    message, only important part): "nova.exception.RescheduledException:
    Build of instance 2f32c7ea-1720-4f61-bce8-dbe970c40b0c was re-scheduled:
    Secret not found: no secret with matching uuid 'a7f3ae2e-cee7-4f04-9402
    -a78047747654". This UUID was not the same one present when issuing
    `virsh secret-list` on Cinder, Nova and Libvirt containers.

    Turns out openstack-helm and openstack-helm-infra have a Ceph UUID
    hardcoded in them, in Cinder [4], Nova [5] [6] and Libvirt [7] values.
    By changing these values to the UUID that libvirt was trying to find
    (7f3ae2e-cee7-4f04-9402-a78047747654), and it worked to solve the issue.
    However, it is not a good practice to use hardcoded values, and,
    searching on where this UUID was coming from, it turns out it was
    defined by the platform's Ceph configuration under
    `/etc/ceph/ceph.conf`.

    This still leaves the question, why was this working on Ussuri and
    stopped working on Antelope? First of all, the Ceph official
    documentation [8] [9] about using it with OpenStack explains the
    process of adding the secret to libvirt, to store the ceph admin
    keyring. You can see that the secret uuid is generated "on the fly" and
    both docs mention that old/hard-coded value
    (i.e., 457eb676-33da-42ec-9a8c-9293d545c337). This is the reason why it
    used to work until our upversion to OpenStack Antelope/2023.1: this
    UUID does not really matter (as long as nova and libvirt have the same
    value for it)! It is a given UUID to the libvirt secret that will store
    ceph keyring [10], and the key ring will ensure proper communication
    between our services and the platform ceph.

    What changed between Ussuri and Antelope (2023.1), is that now there is
    a specific method [11] to set a default value (Ceph's Cluster UUID) for
    this UUID when it is not specified in the driver configuration.

    What this change does is dynamically read this `/etc/ceph/ceph.conf`
    file to search for the UUID value, and use it to override the [4] [5]
    [6] and [7] values. It also adds the patch including the Nova service
    token configuration. The combination of these 2 changes allows VMs to be
    created by volume...

Read more...

Changed in starlingx:
status: In Progress → Fix Released
Ghada Khalil (gkhalil)
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.9.0 stx.distro.openstack
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.