Timeout during VM resize

Bug #1956229 reported by Alexandru Dimofte
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Lucas

Bug Description

Brief Description
-----------------
Our VM resize/migrate tests are failing

Severity
--------
<Major: System/Feature is usable but degraded>. Running a test suite this might affect the tests executed after.

Steps to Reproduce
------------------
Try to resize VM

Expected Behavior
------------------
It should work fine

Actual Behavior
----------------
Timeout during VM resize

Reproducibility
---------------
This happens pretty often, I guess it might be 100%

System Configuration
--------------------
One node system, Two node system, Multi-node system, Dedicated storage

Branch/Pull Time/Commit
-----------------------
master

Last Pass
---------
-

Timestamp/Logs
--------------
The collected logs will be attached.

[2022-01-01 08:59:51,126] 54 DEBUG MainThread conftest.update_results:: ***Failure at test call: /localdisk/starlingx/s1_duplex/test/automated-pytest-suite/keywords/vm_helper.py:1994: uti
***Details: launch_instances = 'b07d5807-5d77-4461-a400-96c517a7dc24'
create_flavors_and_images = {'flavor1': '182c2525-7a39-4882-bdc1-6a0e577c5dc4', 'flavor2': 'a9f9be0f-0862-491f-aa40-496666297aa1', 'image': '769818a4-4812-4ceb-bef0-79ecb0580b73'}

    @mark.robotsanity
    def test_resize_instances(launch_instances, create_flavors_and_images):
> vm_helper.resize_vm(vm_id=launch_instances, flavor_id=create_flavors_and_images["flavor2"])

testcases/sanity/sanity_openstack/test_instance_from_image.py:155:.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _.
keywords/vm_helper.py:1829: in resize_vm
    timeout=300, con_ssh=con_ssh)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _.

vm_id = 'b07d5807-5d77-4461-a400-96c517a7dc24'
status = ['VERIFY_RESIZE', 'ERROR'], timeout = 300, check_interval = 3
fail_ok = False, con_ssh = None
auth_info = {'domain': 'Default', 'password': 'St4rlingX*', 'tenant': 'admin', 'user': 'admin'}

    def wait_for_vm_status(vm_id, status=VMStatus.ACTIVE,
                           timeout=VMTimeout.STATUS_CHANGE, check_interval=3,
                           fail_ok=False,
                           con_ssh=None, auth_info=Tenant.get('admin')):
        """
....
        Args:
            vm_id:
            status (list|str):
            timeout:
            check_interval:
            fail_ok (bool):
            con_ssh:
            auth_info:
....
        Returns: The Status of the vm_id depend on what Status it is looking for
....
        """
        end_time = time.time() + timeout
        if isinstance(status, str):
            status = [status]
....
        current_status = get_vm_values(vm_id, 'status', strict=True,
                                       con_ssh=con_ssh, auth_info=auth_info)[0]
        while time.time() < end_time:
            for expected_status in status:
                if current_status == expected_status:
                    LOG.info("VM status has reached {}".format(expected_status))
                    return expected_status
....
            time.sleep(check_interval)
            current_status = get_vm_values(vm_id, 'status', strict=True,
                                           con_ssh=con_ssh, auth_info=auth_info)[0]
....
        err_msg = "Timed out waiting for vm status: {}. Actual vm status: " \
                  "{}".format(status, current_status)
        if fail_ok:
            LOG.warning(err_msg)
            return None
        else:
> raise exceptions.VMTimeout(err_msg)
E utils.exceptions.VMTimeout: VM operation timed out.
E Details: Timed out waiting for vm status: ['VERIFY_RESIZE', 'ERROR']. Actual vm status: ACTIVE

keywords/vm_helper.py:1994: VMTimeout

Test Activity
-------------
Sanity

Workaround
----------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Heitor Matsui (heitormatsui) wrote :

I observed this behavior too, yesterday. My comments below:

The cold migrate that is failing logs errors like these on nova-compute:
https://paste.opendev.org/show/811898/

And because it fails, it returns the instance to ACTIVE state, so that it never gets on VERIFY_RESIZE and causes the tests to fail.

Resize (not as part of cold migrate) seems to fail intermittently.

Revision history for this message
Hugo Nicodemos Brito (hbrito) wrote :

I found this in the logs:

```
nova.exception.ResizeError: Resize error: not able to execute ssh command: Unexpected error while running command.\nCommand: ssh -o BatchMode=yes 192.168.206.3 mkdir -p /var/lib/nova/instances/b07d5807-5d77-4461-a400-96c517a7dc24\nExit code: 255\nStdout: \'\'\nStderr: \'/root/.ssh/config: line 1: Bad configuration option: null\\r\\n/root/.ssh/config: terminating, 1 bad configuration options\\r\\n\'\n','controller-0',2);

```

The `/root/.ssh/config` in the nova-compute container is indeed created with `null` inside it.

Sometimes the resize passes, probably because is for the same host, but a cold migrate command will always fail.

Ghada Khalil (gkhalil)
tags: added: stx.distro.openstack
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
Austin Sun (sunausti) wrote :

chant, please help triage this issue. thanks.

Changed in starlingx:
assignee: nobody → YU CHENGDE (chant)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: stx.7.0 / medium; issue resulting in a yellow sanity on the main stx builds

Changed in starlingx:
status: New → Triaged
tags: added: stx.7.0
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

I observed that Simplex Bare-Metal and Simplex Virtual configurations are not affected by this issue, only the rest.

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per the sanity email for stx.6.0, this issue is affecting that branch as well which is expected given there is likely not much difference in this area:
http://lists.starlingx.io/pipermail/starlingx-discuss/2022-January/012626.html

We will need to decide if this is serious enough to hold up stx.6.0

Revision history for this message
YU CHENGDE (chant) wrote :
Download full text (3.8 KiB)

The nova resize process is normal.

/opt/platform/armada/22.02/stx-openstack/1.0-152-centos-stable-versioned

+--------------------------------------+-----------+-------+------+-----------+-------+-----------+
| ID | Name | RAM | Disk | Ephemeral | VCPUs | Is Public |
+--------------------------------------+-----------+-------+------+-----------+-------+-----------+
| 6e9f4a10-6f66-4583-a4df-e27f4f8ef91a | f2.small | 2048 | 20 | 0 | 1 | True |
| e4f4741d-b5df-43cc-b181-c376d8816275 | m1.small | 2048 | 20 | 0 | 1 | True |
+--------------------------------------+-----------+-------+------+-----------+-------+-----------+

nova boot --flavor=6e9f4a10-6f66-4583-a4df-e27f4f8ef91a --image=7a9a91bd-9c55-43ca-b53d-f32d61588343 --key-name=keypair-tenant1 --availability-zone=nova:controller-0 --poll --nic net-id=4abb1c82-8430-48e2-b856-1d5e36b19f60 admin-vm-2

+--------------------------------------+------------+--------+------------------------+--------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------+--------+------------------------+--------+----------+
| a42cdb91-005f-4cb0-9b19-941f8ddf7913 | admin-vm-2 | ACTIVE | network-1=192.168.0.42 | cirros | f2.small |
+--------------------------------------+------------+--------+------------------------+--------+----------+

openstack server resize --wait --flavor e4f4741d-b5df-43cc-b181-c376d8816275 a42cdb91-005f-4cb0-9b19-941f8ddf7913

 +--------------------------------------+------------+---------------+------------------------+--------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------+---------------+------------------------+--------+----------+
| a42cdb91-005f-4cb0-9b19-941f8ddf7913 | admin-vm-2 | VERIFY_RESIZE | network-1=192.168.0.42 | cirros | m1.small |
+--------------------------------------+------------+---------------+------------------------+--------+----------+

openstack server resize confirm a42cdb91-005f-4cb0-9b19-941f8ddf7913

+--------------------------------------+------------+--------+-------------------------+--------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------+--------+-------------------------+--------+----------+
| 58ce95d1-1426-4321-a174-3d622798dece | admin-vm-2 | ACTIVE | network-1=192.168.0.164 | cirros | m1.small |
+--------------------------------------+------------+--------+-------------------------+--------+----------+

2022-01-12T10:08:20.837048666Z stdout F 2022-01-12 10:08:20,836.836 6 INFO nova_api_proxy.apps.acceptor [-] Forward to NFV "POST http://nova.openstack.svc.cluster.local/v2.1/34e2fd0c530341d69678fa17c40dbb6c/servers/a42cdb91-005f-4cb0-9b19-941f8ddf7913/action", action: (resize), val:({u'flavorRef': u'e4f4741d-b5df-43cc-b181-c376d8816275'})
2022-01-12T10:08:21.075301381Z stdout F 2022-01-12 10:08:21,0...

Read more...

Revision history for this message
YU CHENGDE (chant) wrote :

I found the error log from "BM Duplex "

2022-01-01T09:04:13.496319718Z stdout F 2022-01-01 09:04:13,494.494 6 INFO nova_api_proxy.apps.proxy [-] POST response body: ['{"metadata": {"sw:wrs:auto_recovery": "true", "hw:wrs:live_migration_timeout": "180", "hw:wrs:live_migration_max_downtime": "500"}}']

Keep tracing.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Based on the community discussion today, it would be nice to fix this issue for stx.6.0 to reach a green sanity. We'd need a fix by early next week (Jan 17-18) to avoid impacting the release declaration plans (target: Jan 21).

tags: added: stx.6.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/822357
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/4e3157bf4c3271d2d01d72b1ea6bc5407dbc296a
Submitter: "Zuul (22348)"
Branch: master

commit 4e3157bf4c3271d2d01d72b1ea6bc5407dbc296a
Author: Lucas Cavalcante <email address hidden>
Date: Mon Dec 20 17:19:09 2021 -0300

    Fix nova-compute-ssh

    On stx-openstack, we run the nova-compute containers as user 0 (root) to
    get privileged access to some of the host resources. During the latest
    upversion of openstack-helm, we got in some commits that were
    incompatible with our usage of the root user since the keys for ssh
    access to a different compute were always placed under the 'nova' user's
    folder. This commit fixes that behavior while we don't merge a
    definitive fix on openstack-helm and go through a new upversion.

    Test Plan:

    PASS - nova-compute-ssh starting correctly after change
    the sshd->ssh parameter
    PASS - migrate/resize vm

    Closes-Bug: #1956229

    Signed-off-by: Lucas Cavalcante <email address hidden>
    Signed-off-by: Hugo Brito <email address hidden>
    Change-Id: Ic90e8e64670b8314b9a2f38b93a59361dcb7ecc9

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote (last edit ):

Assigning to Lucas Cavalcante as he provided the fix above.
@Lucas, Can you please confirm that this change is applicable to the r/stx.6.0 branch. If so, please cherry-pick it.
Note: The sanity reports from RC6.0 indicate that the same issue was seen in that release as well.

Changed in starlingx:
assignee: YU CHENGDE (chant) → Lucas (lcavalca)
tags: added: stx.cherrypickneeded
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

I can tell that I checked the fix for this issue, executing the sanity. On all virtual configurations(Simplex, Duplex, Standard and Standard External) it works perfectly.
However I couldn't check it on any of the bare-metal servers because that configurations are affected by other LP bug: https://bugs.launchpad.net/starlingx/+bug/1958035
I believe we should merge this fix into r/stx.6.0 branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (r/stx.6.0)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (r/stx.6.0)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/824887
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/06f9d133350dd4f4affb24e96aaeec6580a91371
Submitter: "Zuul (22348)"
Branch: r/stx.6.0

commit 06f9d133350dd4f4affb24e96aaeec6580a91371
Author: Lucas Cavalcante <email address hidden>
Date: Mon Dec 20 17:19:09 2021 -0300

    Fix nova-compute-ssh

    On stx-openstack, we run the nova-compute containers as user 0 (root) to
    get privileged access to some of the host resources. During the latest
    upversion of openstack-helm, we got in some commits that were
    incompatible with our usage of the root user since the keys for ssh
    access to a different compute were always placed under the 'nova' user's
    folder. This commit fixes that behavior while we don't merge a
    definitive fix on openstack-helm and go through a new upversion.

    Test Plan:

    PASS - nova-compute-ssh starting correctly after change
    the sshd->ssh parameter
    PASS - migrate/resize vm

    Closes-Bug: #1956229

    Signed-off-by: Lucas Cavalcante <email address hidden>
    Signed-off-by: Hugo Brito <email address hidden>
    Change-Id: Ic90e8e64670b8314b9a2f38b93a59361dcb7ecc9
    (cherry picked from commit 4e3157bf4c3271d2d01d72b1ea6bc5407dbc296a)

Ghada Khalil (gkhalil)
tags: added: in-r-stx60
removed: stx.cherrypickneeded
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

This issue disappeared from RC.6.0 too. So, it is verified, I think we can close this bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers