Intermittently the openstack server show indicates that the server does not exist (in live migration tests)

Bug #1838676 reported by Wendy Mitchell
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Invalid
Medium
Boxiang Zhu

Bug Description

Brief Description
-----------------
Intermittently the openstack server show command unexpectedly indicates that the server does not exist (in live migration tests)

Severity
--------
Major

Steps to Reproduce
------------------

TC-name: nova/test_migrate.py::test_migrate_stress[volume-1000]

1. This testcase launches an instance (from volume) and continuously live migrates the instance from compute-5 to compute-0
(Note: the first migration attempt at 5:01 was a block migration which is expected not to work)

# sudo virsh list
 Id Name State
-----------------------------------
 11 instance-00000017 running

req-ebcac875-beae-42f1-9cb4-02280dd775e8 live-migration Aug. 1, 2019, 5:03 p.m. a500ef102c864bba94687c9d057a9e46 -
req-719e9c64-ff3d-46aa-b315-194230b7af5f live-migration Aug. 1, 2019, 5:02 p.m. a500ef102c864bba94687c9d057a9e46 -
req-13744855-944a-484c-a0e6-882402fe7e6e live-migration Aug. 1, 2019, 5:01 p.m. a500ef102c864bba94687c9d057a9e46 -
req-d074eb27-941e-4445-9688-561b48236ffb live-migration Aug. 1, 2019, 5:01 p.m. a500ef102c864bba94687c9d057a9e46 Error
req-94ffde51-b754-4f1c-b57a-e04c981acc53 Create Aug. 1, 2019, 5 p.m. 2b5a7c6495864c61a4a7477987c7e73a -

2. Each time a migration is performed the same openstack command to show the server is performed
CLI 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show f404f44d-3131-4390-bc48-a35a6143002c' failed to execute. Output: No server with a name or ID of 'f404f44d-3131-4390-bc48-a35a6143002c' exists.

Name tenant1-volume-stress-1
ID f404f44d-3131-4390-bc48-a35a6143002c
Description -
Project ID cb61ced9022c453389b1297daa98214a
Status Active
Locked False
Availability Zone nova
Created Aug. 1, 2019, 5 p.m.
Age 7 minutes
Host compute-0
Specs
Flavor Name flavor-default-size2
Flavor ID d897ad67-44f0-443d-a5bb-b94d84d54eec
RAM 1GB
VCPUs 1 VCPU
Disk 2GB
IP Addresses
tenant1-net1
    172.16.1.230
tenant1-mgmt-net
    192.168.131.58

Security Groups
default

        ALLOW IPv4 to 0.0.0.0/0
        ALLOW IPv4 from default
        ALLOW IPv4 22/tcp from 0.0.0.0/0
        ALLOW IPv6 to ::/0
        ALLOW IPv4 icmp from 0.0.0.0/0
        ALLOW IPv6 from default

Metadata

Key Name keypair-tenant1
Image None

Volumes Attached
Attached To
    vol-tenant1-volume-stress-1-1 on /dev/vda

Expected Behavior
------------------
openstack cli command is able to show the server before after each migration operation

Actual Behavior
----------------
Intermittently the openstack server show command indicates that the server does not exist (in live migration tests)

Reproducibility
---------------
Intermittent
Can happen on the 1st migration iteration, 3rd iteration
or 13th migration iteration

System Configuration
--------------------
storage lab (PV-1)

Branch/Pull Time/Commit
-----------------------
20190731T013000Z"

Last Pass
---------

Timestamp/Logs
--------------
see logs attached for migration request as follows:
req-ebcac875-beae-42f1-9cb4-02280dd775e8 live-migration Aug. 1, 2019, 5:03 p.m. a500ef102c864bba94687c9d057a9e46 -

see also kubectl output from the time

Test Activity
-------------
Regression

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
description: updated
tags: added: stx.regression
Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

This is blocking automated test on migrations. (The testcase is attempting to live migrate the instance 1000 times but hits this issue before reaching 50 migrations)

Frank Miller (sensfan22)
tags: added: stx.distro.openstack
Changed in starlingx:
assignee: nobody → yong hu (yhu6)
Revision history for this message
yong hu (yhu6) wrote :

@Wendy, in order to unblock the automation test, advice to add a few retries in the test scripts, or wait for a bit longer time before checking, say 2 mins, after sending live-migration.

@Boxiang, pls help to check the live-migration internal procedures and watch the delay b/w the live-migration cmd is triggered and the final state is synced up.

Changed in starlingx:
importance: Undecided → Medium
assignee: yong hu (yhu6) → Boxiang Zhu (bxzhu-5355)
Revision history for this message
Boxiang Zhu (bxzhu-5355) wrote :

@Yong and @Wendy, I have checked all the attachements.

First of all, I found an api call[0] from controller-0_20190801.171420.tgz(/var/log/containers/nova-api-osapi-964c5987f-7jf96_openstack_nova-osapi-f2e6bd9da76dd74d9403d897db06389744fc54503c30d9669edddf2f31b7121b.log). This call should be the first migration(a block migration).

But I did not found any status 404 for this server f404f44d-3131-4390-bc48-a35a6143002c from api log. Is there only 2 controller for this cluster?

At last, all files in /var/log/containers/ of compute-0_20190801.171420.tgz are 0 bytes. Is it correct?

[0] {"log":"2019-08-01 17:01:01.068 1 INFO nova.osapi_compute.wsgi.server [req-d074eb27-941e-4445-9688-561b48236ffb a500ef102c864bba94687c9d057a9e46 5bde0d3c28554c769e85c4edb545d5c5 - default default] 128.224.151.198 \"POST /v2.1/5bde0d3c28554c769e85c4edb545d5c5/servers/f404f44d-3131-4390-bc48-a35a6143002c/action HTTP/1.1\" status: 202 len: 400 time:

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

In response to your question about the compute-0 logs, the containers logs have a link to the corresponding pod log... please refer to corresponding logs in the /var/log/pods folder.

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

In addition, the server should not exhibit this behavior as it is "live" migration. A delay in issuing the server show command should not be required.

Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Frank Miller (sensfan22) wrote :

Adding stx.2.0 tag as this issue did not exist in stx.1.0 but has started to appear sometime during the stx.2.0 release.

tags: added: stx.2.0
Ghada Khalil (gkhalil)
Changed in starlingx:
status: New → Triaged
Revision history for this message
Bob Church (rchurch) wrote :

I was discussing with Frank and Wendy that I saw a similar issue when running a sanity test on an AIO-SX.

I observed a test failure that reported the following:

CLI 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show 1820414d-cf2d-46c1-877c-3a98a1b18be3' failed to execute.

Output: No server with a name or ID of '1820414d-cf2d-46c1-877c-3a98a1b18be3' exists.

In this scenario a reboot -f was perform on the host that contained 5 VMs. During the period after rebooting the Kubernetes OpenStack pods are starting and reaching a steady state condition. Also the 5 VMs are being restarted.

I suspect that in my scenario we were potentially experiencing some resource constraints on the system probably CPU and/or disk which resulted in inconsistent API results.

I suspect that this types of LP fall into the performance tuning pile of LPs. Resolution of these more than likely need to be done on H/W with additional instrumentation to identify bottlenecks: top, schedtop, iostat, etc...

Revision history for this message
Boxiang Zhu (bxzhu-5355) wrote :

@yong and @bob, so as described from bob, it looks like the bottleneck of physical resources like CPU and/or Disk. So can we set this issue as Invalid or what do you need me to do something for this issue? thanks.

Revision history for this message
Ghada Khalil (gkhalil) wrote :

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags: added: stx.3.0
removed: stx.2.0
Revision history for this message
yong hu (yhu6) wrote :

@Boxiang, could you help verify this issue after we get OpenStack Train?

When running the test, please keep watching system sources on "worker" nodes on which VMs are supposedly scheduled to run by Nova.

If there is still potential resource bottle neck, we need data (such as CPU rate on cores, memory usage, and so on) to support next step's plan.

Revision history for this message
Boxiang Zhu (bxzhu-5355) wrote :

@Yong, I'm sorry that I have no resources(include physical or virtual servers) to support me to do the stress testing. Can @Wendy help us to do this test?

Thanks :)

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :
Download full text (5.9 KiB)

I have installed the same HW lab with a recent load 2019-12-01_20-00-00 and run the same automated testcase. Now suspect new behavior related to LP-1841189

The instance survived several live migrations but now fails ping test after iteration8
(originating host was compute-1 and target host was compute-6)

...
[2019-12-04 19:36:42,818] 164 INFO MainThread test_migrate.test_migrate_stress:: Live migration iter8
[2019-12-04 19:36:46,801] 1288 INFO MainThread vm_helper.live_migrate_vm:: Live migrating VM 358526fa-0f85-45e9-b246-63be77384e3d from compute-1 started.
[2019-12-04 19:36:46,802] 1289 INFO MainThread vm_helper.live_migrate_vm:: nova live-migration 358526fa-0f85-45e9-b246-63be77384e3d
[2019-12-04 19:36:48,847] 1300 INFO MainThread vm_helper.live_migrate_vm:: Waiting for VM status change to MIGRATING with best effort
[2019-12-04 19:36:51,184] 1761 INFO MainThread vm_helper.wait_for_vm_status:: VM status has reached MIGRATING
[2019-12-04 19:36:51,184] 1309 INFO MainThread vm_helper.live_migrate_vm:: Waiting for VM status change to original state ACTIVE
[2019-12-04 19:37:11,844] 1316 INFO MainThread vm_helper.live_migrate_vm:: Live migrate vm 358526fa-0f85-45e9-b246-63be77384e3d completed
[2019-12-04 19:37:13,951] 1368 INFO MainThread vm_helper.live_migrate_vm:: VM 358526fa-0f85-45e9-b246-63be77384e3d successfully migrated from compute-1 to compute-6
[2019-12-04 19:37:13,951] 1484 INFO MainThread ssh.get_natbox_client:: Getting NatBox Client...
[2019-12-04 19:37:16,074] 1666 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'358526fa-0f85-45e9-b246-63be77384e3d': ['192.168.131.50']}
[2019-12-04 19:37:16,074] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.131.50 from host 128.224.186.181
[2019-12-04 19:37:28,281] 2563 INFO MainThread network_helper.ping_server:: retry in 3 seconds
[2019-12-04 19:37:31,284] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.131.50 failed.
[2019-12-04 19:37:31,284] 1926 INFO MainThread vm_helper._ping_vms:: Ping unsuccessful from vm (logged in via 128.224.186.181): {'192.168.131.50': 100}

Name tenant1-volume-stress-1
ID 358526fa-0f85-45e9-b246-63be77384e3d
    tenant1-net10 172.16.10.135
    tenant1-mgmt-net 192.168.131.50

tenant1-volume-stress-1:~# ip add
[ 3752.763497] -bash[1597]: HISTORY: PID=1597 UID=0 ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:77:6e:55 brd ff:ff:ff:ff:ff:ff
    inet 192.168.131.50/27 brd 192.168.131.63 scope global dynamic eth0
       valid_lft 82655sec preferred_lft 82655sec
    inet6 fe80::f816:3eff:fe77:6e55/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:1a:98:e7 brd ff:ff:ff:ff:ff:ff
    inet 172.16.10.13...

Read more...

Revision history for this message
Wendy Mitchell (wmitchellwr) wrote :

Closing this issue. Outstanding issue encountered relates to LP-1841189

tags: removed: stx.retestneeded
Ghada Khalil (gkhalil)
Changed in starlingx:
status: Triaged → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.