Timeout during VM resize

Bug #1956229 reported by Alexandru Dimofte
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Lucas

Bug Description

Brief Description
-----------------
Our VM resize/migrate tests are failing

Severity
--------
<Major: System/Feature is usable but degraded>. Running a test suite this might affect the tests executed after.

Steps to Reproduce
------------------
Try to resize VM

Expected Behavior
------------------
It should work fine

Actual Behavior
----------------
Timeout during VM resize

Reproducibility
---------------
This happens pretty often, I guess it might be 100%

System Configuration
--------------------
One node system, Two node system, Multi-node system, Dedicated storage

Branch/Pull Time/Commit
-----------------------
master

Last Pass
---------
-

Timestamp/Logs
--------------
The collected logs will be attached.

[2022-01-01 08:59:51,126] 54 DEBUG MainThread conftest.update_results:: ***Failure at test call: /localdisk/starlingx/s1_duplex/test/automated-pytest-suite/keywords/vm_helper.py:1994: uti
***Details: launch_instances = 'b07d5807-5d77-4461-a400-96c517a7dc24'
create_flavors_and_images = {'flavor1': '182c2525-7a39-4882-bdc1-6a0e577c5dc4', 'flavor2': 'a9f9be0f-0862-491f-aa40-496666297aa1', 'image': '769818a4-4812-4ceb-bef0-79ecb0580b73'}

    @mark.robotsanity
    def test_resize_instances(launch_instances, create_flavors_and_images):
> vm_helper.resize_vm(vm_id=launch_instances, flavor_id=create_flavors_and_images["flavor2"])

testcases/sanity/sanity_openstack/test_instance_from_image.py:155:.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _.
keywords/vm_helper.py:1829: in resize_vm
    timeout=300, con_ssh=con_ssh)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _.

vm_id = 'b07d5807-5d77-4461-a400-96c517a7dc24'
status = ['VERIFY_RESIZE', 'ERROR'], timeout = 300, check_interval = 3
fail_ok = False, con_ssh = None
auth_info = {'domain': 'Default', 'password': 'St4rlingX*', 'tenant': 'admin', 'user': 'admin'}

    def wait_for_vm_status(vm_id, status=VMStatus.ACTIVE,
                           timeout=VMTimeout.STATUS_CHANGE, check_interval=3,
                           fail_ok=False,
                           con_ssh=None, auth_info=Tenant.get('admin')):
        """
....
        Args:
            vm_id:
            status (list|str):
            timeout:
            check_interval:
            fail_ok (bool):
            con_ssh:
            auth_info:
....
        Returns: The Status of the vm_id depend on what Status it is looking for
....
        """
        end_time = time.time() + timeout
        if isinstance(status, str):
            status = [status]
....
        current_status = get_vm_values(vm_id, 'status', strict=True,
                                       con_ssh=con_ssh, auth_info=auth_info)[0]
        while time.time() < end_time:
            for expected_status in status:
                if current_status == expected_status:
                    LOG.info("VM status has reached {}".format(expected_status))
                    return expected_status
....
            time.sleep(check_interval)
            current_status = get_vm_values(vm_id, 'status', strict=True,
                                           con_ssh=con_ssh, auth_info=auth_info)[0]
....
        err_msg = "Timed out waiting for vm status: {}. Actual vm status: " \
                  "{}".format(status, current_status)
        if fail_ok:
            LOG.warning(err_msg)
            return None
        else:
> raise exceptions.VMTimeout(err_msg)
E utils.exceptions.VMTimeout: VM operation timed out.
E Details: Timed out waiting for vm status: ['VERIFY_RESIZE', 'ERROR']. Actual vm status: ACTIVE

keywords/vm_helper.py:1994: VMTimeout

Test Activity
-------------
Sanity

Workaround
----------
-

Revision history for this message
Alexandru Dimofte (adimofte) wrote :
Revision history for this message
Heitor Matsui (heitormatsui) wrote :

I observed this behavior too, yesterday. My comments below:

The cold migrate that is failing logs errors like these on nova-compute:
https://paste.opendev.org/show/811898/

And because it fails, it returns the instance to ACTIVE state, so that it never gets on VERIFY_RESIZE and causes the tests to fail.

Resize (not as part of cold migrate) seems to fail intermittently.

Revision history for this message
Hugo Nicodemos Brito (hbrito) wrote :

I found this in the logs:

```
nova.exception.ResizeError: Resize error: not able to execute ssh command: Unexpected error while running command.\nCommand: ssh -o BatchMode=yes 192.168.206.3 mkdir -p /var/lib/nova/instances/b07d5807-5d77-4461-a400-96c517a7dc24\nExit code: 255\nStdout: \'\'\nStderr: \'/root/.ssh/config: line 1: Bad configuration option: null\\r\\n/root/.ssh/config: terminating, 1 bad configuration options\\r\\n\'\n','controller-0',2);

```

The `/root/.ssh/config` in the nova-compute container is indeed created with `null` inside it.

Sometimes the resize passes, probably because is for the same host, but a cold migrate command will always fail.

Ghada Khalil (gkhalil)
tags: added: stx.distro.openstack
Changed in starlingx:
importance: Undecided → Medium
Revision history for this message
Austin Sun (sunausti) wrote :

chant, please help triage this issue. thanks.

Changed in starlingx:
assignee: nobody → YU CHENGDE (chant)
Revision history for this message
Ghada Khalil (gkhalil) wrote :

screening: stx.7.0 / medium; issue resulting in a yellow sanity on the main stx builds

Changed in starlingx:
status: New → Triaged
tags: added: stx.7.0
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

I observed that Simplex Bare-Metal and Simplex Virtual configurations are not affected by this issue, only the rest.

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per the sanity email for stx.6.0, this issue is affecting that branch as well which is expected given there is likely not much difference in this area:
http://lists.starlingx.io/pipermail/starlingx-discuss/2022-January/012626.html

We will need to decide if this is serious enough to hold up stx.6.0

Revision history for this message
YU CHENGDE (chant) wrote :
Download full text (3.8 KiB)

The nova resize process is normal.

/opt/platform/armada/22.02/stx-openstack/1.0-152-centos-stable-versioned

+--------------------------------------+-----------+-------+------+-----------+-------+-----------+
| ID | Name | RAM | Disk | Ephemeral | VCPUs | Is Public |
+--------------------------------------+-----------+-------+------+-----------+-------+-----------+
| 6e9f4a10-6f66-4583-a4df-e27f4f8ef91a | f2.small | 2048 | 20 | 0 | 1 | True |
| e4f4741d-b5df-43cc-b181-c376d8816275 | m1.small | 2048 | 20 | 0 | 1 | True |
+--------------------------------------+-----------+-------+------+-----------+-------+-----------+

nova boot --flavor=6e9f4a10-6f66-4583-a4df-e27f4f8ef91a --image=7a9a91bd-9c55-43ca-b53d-f32d61588343 --key-name=keypair-tenant1 --availability-zone=nova:controller-0 --poll --nic net-id=4abb1c82-8430-48e2-b856-1d5e36b19f60 admin-vm-2

+--------------------------------------+------------+--------+------------------------+--------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------+--------+------------------------+--------+----------+
| a42cdb91-005f-4cb0-9b19-941f8ddf7913 | admin-vm-2 | ACTIVE | network-1=192.168.0.42 | cirros | f2.small |
+--------------------------------------+------------+--------+------------------------+--------+----------+

openstack server resize --wait --flavor e4f4741d-b5df-43cc-b181-c376d8816275 a42cdb91-005f-4cb0-9b19-941f8ddf7913

 +--------------------------------------+------------+---------------+------------------------+--------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------+---------------+------------------------+--------+----------+
| a42cdb91-005f-4cb0-9b19-941f8ddf7913 | admin-vm-2 | VERIFY_RESIZE | network-1=192.168.0.42 | cirros | m1.small |
+--------------------------------------+------------+---------------+------------------------+--------+----------+

openstack server resize confirm a42cdb91-005f-4cb0-9b19-941f8ddf7913

+--------------------------------------+------------+--------+-------------------------+--------+----------+
| ID | Name | Status | Networks | Image | Flavor |
+--------------------------------------+------------+--------+-------------------------+--------+----------+
| 58ce95d1-1426-4321-a174-3d622798dece | admin-vm-2 | ACTIVE | network-1=192.168.0.164 | cirros | m1.small |
+--------------------------------------+------------+--------+-------------------------+--------+----------+

2022-01-12T10:08:20.837048666Z stdout F 2022-01-12 10:08:20,836.836 6 INFO nova_api_proxy.apps.acceptor [-] Forward to NFV "POST http://nova.openstack.svc.cluster.local/v2.1/34e2fd0c530341d69678fa17c40dbb6c/servers/a42cdb91-005f-4cb0-9b19-941f8ddf7913/action", action: (resize), val:({u'flavorRef': u'e4f4741d-b5df-43cc-b181-c376d8816275'})
2022-01-12T10:08:21.075301381Z stdout F 2022-01-12 10:08:21,0...

Read more...

Revision history for this message
YU CHENGDE (chant) wrote :

I found the error log from "BM Duplex "

2022-01-01T09:04:13.496319718Z stdout F 2022-01-01 09:04:13,494.494 6 INFO nova_api_proxy.apps.proxy [-] POST response body: ['{"metadata": {"sw:wrs:auto_recovery": "true", "hw:wrs:live_migration_timeout": "180", "hw:wrs:live_migration_max_downtime": "500"}}']

Keep tracing.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

Based on the community discussion today, it would be nice to fix this issue for stx.6.0 to reach a green sanity. We'd need a fix by early next week (Jan 17-18) to avoid impacting the release declaration plans (target: Jan 21).

tags: added: stx.6.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (master)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/822357
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/4e3157bf4c3271d2d01d72b1ea6bc5407dbc296a
Submitter: "Zuul (22348)"
Branch: master

commit 4e3157bf4c3271d2d01d72b1ea6bc5407dbc296a
Author: Lucas Cavalcante <email address hidden>
Date: Mon Dec 20 17:19:09 2021 -0300

    Fix nova-compute-ssh

    On stx-openstack, we run the nova-compute containers as user 0 (root) to
    get privileged access to some of the host resources. During the latest
    upversion of openstack-helm, we got in some commits that were
    incompatible with our usage of the root user since the keys for ssh
    access to a different compute were always placed under the 'nova' user's
    folder. This commit fixes that behavior while we don't merge a
    definitive fix on openstack-helm and go through a new upversion.

    Test Plan:

    PASS - nova-compute-ssh starting correctly after change
    the sshd->ssh parameter
    PASS - migrate/resize vm

    Closes-Bug: #1956229

    Signed-off-by: Lucas Cavalcante <email address hidden>
    Signed-off-by: Hugo Brito <email address hidden>
    Change-Id: Ic90e8e64670b8314b9a2f38b93a59361dcb7ecc9

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Ghada Khalil (gkhalil) wrote (last edit ):

Assigning to Lucas Cavalcante as he provided the fix above.
@Lucas, Can you please confirm that this change is applicable to the r/stx.6.0 branch. If so, please cherry-pick it.
Note: The sanity reports from RC6.0 indicate that the same issue was seen in that release as well.

Changed in starlingx:
assignee: YU CHENGDE (chant) → Lucas (lcavalca)
tags: added: stx.cherrypickneeded
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

I can tell that I checked the fix for this issue, executing the sanity. On all virtual configurations(Simplex, Duplex, Standard and Standard External) it works perfectly.
However I couldn't check it on any of the bare-metal servers because that configurations are affected by other LP bug: https://bugs.launchpad.net/starlingx/+bug/1958035
I believe we should merge this fix into r/stx.6.0 branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to openstack-armada-app (r/stx.6.0)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to openstack-armada-app (r/stx.6.0)

Reviewed: https://review.opendev.org/c/starlingx/openstack-armada-app/+/824887
Committed: https://opendev.org/starlingx/openstack-armada-app/commit/06f9d133350dd4f4affb24e96aaeec6580a91371
Submitter: "Zuul (22348)"
Branch: r/stx.6.0

commit 06f9d133350dd4f4affb24e96aaeec6580a91371
Author: Lucas Cavalcante <email address hidden>
Date: Mon Dec 20 17:19:09 2021 -0300

    Fix nova-compute-ssh

    On stx-openstack, we run the nova-compute containers as user 0 (root) to
    get privileged access to some of the host resources. During the latest
    upversion of openstack-helm, we got in some commits that were
    incompatible with our usage of the root user since the keys for ssh
    access to a different compute were always placed under the 'nova' user's
    folder. This commit fixes that behavior while we don't merge a
    definitive fix on openstack-helm and go through a new upversion.

    Test Plan:

    PASS - nova-compute-ssh starting correctly after change
    the sshd->ssh parameter
    PASS - migrate/resize vm

    Closes-Bug: #1956229

    Signed-off-by: Lucas Cavalcante <email address hidden>
    Signed-off-by: Hugo Brito <email address hidden>
    Change-Id: Ic90e8e64670b8314b9a2f38b93a59361dcb7ecc9
    (cherry picked from commit 4e3157bf4c3271d2d01d72b1ea6bc5407dbc296a)

Ghada Khalil (gkhalil)
tags: added: in-r-stx60
removed: stx.cherrypickneeded
Revision history for this message
Alexandru Dimofte (adimofte) wrote :

This issue disappeared from RC.6.0 too. So, it is verified, I think we can close this bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.