StarlingX

Intermittently the openstack server show indicates that the server does not exist (in live migration tests)

Bug #1838676 reported by Wendy Mitchell on 2019-08-01

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Invalid	Medium	Boxiang Zhu

Bug Description

Brief Description
-----------------
Intermittently the openstack server show command unexpectedly indicates that the server does not exist (in live migration tests)

Severity
--------
Major

Steps to Reproduce
------------------

TC-name: nova/test_migrate.py::test_migrate_stress[volume-1000]

1. This testcase launches an instance (from volume) and continuously live migrates the instance from compute-5 to compute-0
(Note: the first migration attempt at 5:01 was a block migration which is expected not to work)

# sudo virsh list
Id Name State
-----------------------------------
11 instance-00000017 running

req-ebcac875-beae-42f1-9cb4-02280dd775e8 live-migration Aug. 1, 2019, 5:03 p.m. a500ef102c864bba94687c9d057a9e46 -
req-719e9c64-ff3d-46aa-b315-194230b7af5f live-migration Aug. 1, 2019, 5:02 p.m. a500ef102c864bba94687c9d057a9e46 -
req-13744855-944a-484c-a0e6-882402fe7e6e live-migration Aug. 1, 2019, 5:01 p.m. a500ef102c864bba94687c9d057a9e46 -
req-d074eb27-941e-4445-9688-561b48236ffb live-migration Aug. 1, 2019, 5:01 p.m. a500ef102c864bba94687c9d057a9e46 Error
req-94ffde51-b754-4f1c-b57a-e04c981acc53 Create Aug. 1, 2019, 5 p.m. 2b5a7c6495864c61a4a7477987c7e73a -

2. Each time a migration is performed the same openstack command to show the server is performed
CLI 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show f404f44d-3131-4390-bc48-a35a6143002c' failed to execute. Output: No server with a name or ID of 'f404f44d-3131-4390-bc48-a35a6143002c' exists.

Name tenant1-volume-stress-1
ID f404f44d-3131-4390-bc48-a35a6143002c
Description -
Project ID cb61ced9022c453389b1297daa98214a
Status Active
Locked False
Availability Zone nova
Created Aug. 1, 2019, 5 p.m.
Age 7 minutes
Host compute-0
Specs
Flavor Name flavor-default-size2
Flavor ID d897ad67-44f0-443d-a5bb-b94d84d54eec
RAM 1GB
VCPUs 1 VCPU
Disk 2GB
IP Addresses
tenant1-net1
172.16.1.230
tenant1-mgmt-net
192.168.131.58

Security Groups
default

        ALLOW IPv4 to 0.0.0.0/0
        ALLOW IPv4 from default
        ALLOW IPv4 22/tcp from 0.0.0.0/0
        ALLOW IPv6 to ::/0
        ALLOW IPv4 icmp from 0.0.0.0/0
        ALLOW IPv6 from default

Metadata

Key Name keypair-tenant1
Image None

Volumes Attached
Attached To
vol-tenant1-volume-stress-1-1 on /dev/vda

Expected Behavior
------------------
openstack cli command is able to show the server before after each migration operation

Actual Behavior
----------------
Intermittently the openstack server show command indicates that the server does not exist (in live migration tests)

Reproducibility
---------------
Intermittent
Can happen on the 1st migration iteration, 3rd iteration
or 13th migration iteration

System Configuration
--------------------
storage lab (PV-1)

Branch/Pull Time/Commit
-----------------------
20190731T013000Z"

Last Pass
---------

Timestamp/Logs
--------------
see logs attached for migration request as follows:
req-ebcac875-beae-42f1-9cb4-02280dd775e8 live-migration Aug. 1, 2019, 5:03 p.m. a500ef102c864bba94687c9d057a9e46 -

see also kubectl output from the time

Test Activity
-------------
Regression

See original description

Tags:

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-01:

logs.txt Edit (36.7 KiB, text/plain)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-01:

compute-0_20190801.171420.tgz Edit (14.9 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-01:

compute-5_20190801.171420.tgz Edit (14.6 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-01:

controller-0_20190801.171420.tgz Edit (29.5 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-01:

controller-1_20190801.171420.tgz Edit (25.1 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-01:

storage-0_20190801.171420.tgz Edit (15.1 MiB, application/x-tar)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-01:

storage-1_20190801.171420.tgz Edit (14.0 MiB, application/x-tar)

description:

updated

Wendy Mitchell (wmitchellwr) on 2019-08-01

tags:

added: stx.regression

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-02:

This is blocking automated test on migrations. (The testcase is attempting to live migrate the instance 1000 times but hits this issue before reaching 50 migrations)

Frank Miller (sensfan22) on 2019-08-02

tags:	added: stx.distro.openstack
Changed in starlingx:
assignee:	nobody → yong hu (yhu6)

Revision history for this message

yong hu (yhu6) wrote on 2019-08-02:

@Wendy, in order to unblock the automation test, advice to add a few retries in the test scripts, or wait for a bit longer time before checking, say 2 mins, after sending live-migration.

@Boxiang, pls help to check the live-migration internal procedures and watch the delay b/w the live-migration cmd is triggered and the final state is synced up.

Changed in starlingx:
importance:	Undecided → Medium
assignee:	yong hu (yhu6) → Boxiang Zhu (bxzhu-5355)

Revision history for this message

Boxiang Zhu (bxzhu-5355) wrote on 2019-08-06:

#10

@Yong and @Wendy, I have checked all the attachements.

First of all, I found an api call[0] from controller-0_20190801.171420.tgz(/var/log/containers/nova-api-osapi-964c5987f-7jf96_openstack_nova-osapi-f2e6bd9da76dd74d9403d897db06389744fc54503c30d9669edddf2f31b7121b.log). This call should be the first migration(a block migration).

But I did not found any status 404 for this server f404f44d-3131-4390-bc48-a35a6143002c from api log. Is there only 2 controller for this cluster?

At last, all files in /var/log/containers/ of compute-0_20190801.171420.tgz are 0 bytes. Is it correct?

[0] {"log":"2019-08-01 17:01:01.068 1 INFO nova.osapi_compute.wsgi.server [req-d074eb27-941e-4445-9688-561b48236ffb a500ef102c864bba94687c9d057a9e46 5bde0d3c28554c769e85c4edb545d5c5 - default default] 128.224.151.198 \"POST /v2.1/5bde0d3c28554c769e85c4edb545d5c5/servers/f404f44d-3131-4390-bc48-a35a6143002c/action HTTP/1.1\" status: 202 len: 400 time:

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-07:

#11

In response to your question about the compute-0 logs, the containers logs have a link to the corresponding pod log... please refer to corresponding logs in the /var/log/pods folder.

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-08-07:

#12

In addition, the server should not exhibit this behavior as it is "live" migration. A delay in issuing the server show command should not be required.

Numan Waheed (nwaheed) on 2019-08-07

tags:

added: stx.retestneeded

Revision history for this message

Frank Miller (sensfan22) wrote on 2019-08-09:

#13

Adding stx.2.0 tag as this issue did not exist in stx.1.0 but has started to appear sometime during the stx.2.0 release.

tags:

added: stx.2.0

Ghada Khalil (gkhalil) on 2019-08-12

Changed in starlingx:
status:	New → Triaged

Revision history for this message

Bob Church (rchurch) wrote on 2019-08-16:

#14

I was discussing with Frank and Wendy that I saw a similar issue when running a sanity test on an AIO-SX.

I observed a test failure that reported the following:

CLI 'openstack --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://keystone.openstack.svc.cluster.local/v3 --os-user-domain-name Default --os-project-domain-name Default --os-identity-api-version 3 --os-interface internal --os-region-name RegionOne server show 1820414d-cf2d-46c1-877c-3a98a1b18be3' failed to execute.

Output: No server with a name or ID of '1820414d-cf2d-46c1-877c-3a98a1b18be3' exists.

In this scenario a reboot -f was perform on the host that contained 5 VMs. During the period after rebooting the Kubernetes OpenStack pods are starting and reaching a steady state condition. Also the 5 VMs are being restarted.

I suspect that in my scenario we were potentially experiencing some resource constraints on the system probably CPU and/or disk which resulted in inconsistent API results.

I suspect that this types of LP fall into the performance tuning pile of LPs. Resolution of these more than likely need to be done on H/W with additional instrumentation to identify bottlenecks: top, schedtop, iostat, etc...

Revision history for this message

Boxiang Zhu (bxzhu-5355) wrote on 2019-08-23:

#15

@yong and @bob, so as described from bob, it looks like the bottleneck of physical resources like CPU and/or Disk. So can we set this issue as Invalid or what do you need me to do something for this issue? thanks.

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-08-23:

#16

As per agreement with the community, moving all unresolved medium priority bugs from stx.2.0 to stx.3.0

tags:

added: stx.3.0
removed: stx.2.0

Revision history for this message

yong hu (yhu6) wrote on 2019-11-13:

#17

@Boxiang, could you help verify this issue after we get OpenStack Train?

When running the test, please keep watching system sources on "worker" nodes on which VMs are supposedly scheduled to run by Nova.

If there is still potential resource bottle neck, we need data (such as CPU rate on cores, memory usage, and so on) to support next step's plan.

Revision history for this message

Boxiang Zhu (bxzhu-5355) wrote on 2019-11-13:

#18

@Yong, I'm sorry that I have no resources(include physical or virtual servers) to support me to do the stress testing. Can @Wendy help us to do this test?

Thanks :)

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-12-04:

#19

Download full text (5.9 KiB)

I have installed the same HW lab with a recent load 2019-12-01_20-00-00 and run the same automated testcase. Now suspect new behavior related to LP-1841189

The instance survived several live migrations but now fails ping test after iteration8
(originating host was compute-1 and target host was compute-6)

...
[2019-12-04 19:36:42,818] 164 INFO MainThread test_migrate.test_migrate_stress:: Live migration iter8
[2019-12-04 19:36:46,801] 1288 INFO MainThread vm_helper.live_migrate_vm:: Live migrating VM 358526fa-0f85-45e9-b246-63be77384e3d from compute-1 started.
[2019-12-04 19:36:46,802] 1289 INFO MainThread vm_helper.live_migrate_vm:: nova live-migration 358526fa-0f85-45e9-b246-63be77384e3d
[2019-12-04 19:36:48,847] 1300 INFO MainThread vm_helper.live_migrate_vm:: Waiting for VM status change to MIGRATING with best effort
[2019-12-04 19:36:51,184] 1761 INFO MainThread vm_helper.wait_for_vm_status:: VM status has reached MIGRATING
[2019-12-04 19:36:51,184] 1309 INFO MainThread vm_helper.live_migrate_vm:: Waiting for VM status change to original state ACTIVE
[2019-12-04 19:37:11,844] 1316 INFO MainThread vm_helper.live_migrate_vm:: Live migrate vm 358526fa-0f85-45e9-b246-63be77384e3d completed
[2019-12-04 19:37:13,951] 1368 INFO MainThread vm_helper.live_migrate_vm:: VM 358526fa-0f85-45e9-b246-63be77384e3d successfully migrated from compute-1 to compute-6
[2019-12-04 19:37:13,951] 1484 INFO MainThread ssh.get_natbox_client:: Getting NatBox Client...
[2019-12-04 19:37:16,074] 1666 INFO MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'358526fa-0f85-45e9-b246-63be77384e3d': ['192.168.131.50']}
[2019-12-04 19:37:16,074] 2525 INFO MainThread network_helper.ping_server:: Ping 192.168.131.50 from host 128.224.186.181
[2019-12-04 19:37:28,281] 2563 INFO MainThread network_helper.ping_server:: retry in 3 seconds
[2019-12-04 19:37:31,284] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.131.50 failed.
[2019-12-04 19:37:31,284] 1926 INFO MainThread vm_helper._ping_vms:: Ping unsuccessful from vm (logged in via 128.224.186.181): {'192.168.131.50': 100}

Name tenant1-volume-stress-1
ID 358526fa-0f85-45e9-b246-63be77384e3d
tenant1-net10 172.16.10.135
tenant1-mgmt-net 192.168.131.50

tenant1-volume-stress-1:~# ip add
[ 3752.763497] -bash[1597]: HISTORY: PID=1597 UID=0 ip add
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:77:6e:55 brd ff:ff:ff:ff:ff:ff
    inet 192.168.131.50/27 brd 192.168.131.63 scope global dynamic eth0
       valid_lft 82655sec preferred_lft 82655sec
    inet6 fe80::f816:3eff:fe77:6e55/64 scope link
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether fa:16:3e:1a:98:e7 brd ff:ff:ff:ff:ff:ff
    inet 172.16.10.13...

I have installed the same HW lab with a recent load 2019-12-01_20-00-00 and run the same automated testcase. Now suspect new behavior related to LP-1841189

The instance survived several live migrations but now fails ping test after iteration8 
(originating host was compute-1 and target host was compute-6)

...
[2019-12-04 19:36:42,818] 164  INFO  MainThread test_migrate.test_migrate_stress:: Live migration iter8
[2019-12-04 19:36:46,801] 1288 INFO  MainThread vm_helper.live_migrate_vm:: Live migrating VM 358526fa-0f85-45e9-b246-63be77384e3d from compute-1 started.
[2019-12-04 19:36:46,802] 1289 INFO  MainThread vm_helper.live_migrate_vm:: nova live-migration 358526fa-0f85-45e9-b246-63be77384e3d
[2019-12-04 19:36:48,847] 1300 INFO  MainThread vm_helper.live_migrate_vm:: Waiting for VM status change to MIGRATING with best effort
[2019-12-04 19:36:51,184] 1761 INFO  MainThread vm_helper.wait_for_vm_status:: VM status has reached MIGRATING
[2019-12-04 19:36:51,184] 1309 INFO  MainThread vm_helper.live_migrate_vm:: Waiting for VM status change to original state ACTIVE
[2019-12-04 19:37:11,844] 1316 INFO  MainThread vm_helper.live_migrate_vm:: Live migrate vm 358526fa-0f85-45e9-b246-63be77384e3d completed
[2019-12-04 19:37:13,951] 1368 INFO  MainThread vm_helper.live_migrate_vm:: VM 358526fa-0f85-45e9-b246-63be77384e3d successfully migrated from compute-1 to compute-6
[2019-12-04 19:37:13,951] 1484 INFO  MainThread ssh.get_natbox_client:: Getting NatBox Client...
[2019-12-04 19:37:16,074] 1666 INFO  MainThread network_helper._get_net_ips_for_vms:: IPs dict: {'358526fa-0f85-45e9-b246-63be77384e3d': ['192.168.131.50']}
[2019-12-04 19:37:16,074] 2525 INFO  MainThread network_helper.ping_server:: Ping 192.168.131.50 from host 128.224.186.181
[2019-12-04 19:37:28,281] 2563 INFO  MainThread network_helper.ping_server:: retry in 3 seconds
[2019-12-04 19:37:31,284] 2567 WARNING MainThread network_helper.ping_server:: Ping from 128.224.186.181 to 192.168.131.50 failed.
[2019-12-04 19:37:31,284] 1926 INFO  MainThread vm_helper._ping_vms:: Ping unsuccessful from vm (logged in via 128.224.186.181): {'192.168.131.50': 100}

Name    tenant1-volume-stress-1
ID    358526fa-0f85-45e9-b246-63be77384e3d
    tenant1-net10 172.16.10.135
    tenant1-mgmt-net 192.168.131.50

(20b14c7e-0453)	    192.168.131.1 	Active 	Internal Interface 	UP 
(3006e0df-0a95)	    192.168.131.65 	Active 	Internal Interface 	UP 
(69f7186e-f008)	    192.168.131.33 	Active 	Internal Interface 	UP 
(72149874-f1fe)	    192.168.31.2 	Active 	External Gateway 	UP

Pings failures from the vm
tenant1-volume-stress-1:~# ping 192.168.131.1
[ 3770.922772] -bash[1597]: HISTORY: PID=1597 UID=0 ping 192.168.131.1
PING 192.168.131.1 (192.168.131.1) 56(84) bytes of data.
From 192.168.131.50 icmp_seq=1 Destination Host Unreachable
From 192.168.131.50 icmp_seq=2 Destination Host Unreachable
From 192.168.131.50 icmp_seq=3 Destination Host Unreachable
From 192.168.131.50 icmp_seq=4 Destination Host Unreachable

--- 192.168.131.1 ping statistics ---
5 packets transmitted, 0 received, +4 errors, 100% packet loss, time 4001ms
pipe 4
tenant1-volume-stress-1:~# ping 192.168.131.2
[ 3777.060447] -bash[1597]: HISTORY: PID=1597 UID=0 ping 192.168.131.2
PING 192.168.131.2 (192.168.131.2) 56(84) bytes of data.
From 192.168.131.50 icmp_seq=1 Destination Host Unreachable
From 192.168.131.50 icmp_seq=2 Destination Host Unreachable
From 192.168.131.50 icmp_seq=3 Destination Host Unreachable
From 192.168.131.50 icmp_seq=4 Destination Host Unreachable
From 192.168.131.50 icmp_seq=5 Destination Host Unreachable
From 192.168.131.50 icmp_seq=6 Destination Host Unreachable
From 192.168.131.50 icmp_seq=7 Destination Host Unreachable
From 192.168.131.50 icmp_seq=8 Destination Host Unreachable
From 192.168.131.50 icmp_seq=9 Destination Host Unreachable
From 192.168.131.50 icmp_seq=10 Destination Host Unreachable
From 192.168.131.50 icmp_seq=11 Destination Host Unreachable
From 192.168.131.50 icmp_seq=12 Destination Host Unreachable
--- 192.168.131.2 ping statistics ---
13 packets transmitted, 0 received, +12 errors, 100% packet loss, time 12003ms
pipe 4

tenant1-volume-stress-1:~# ping 192.168.31.2
[ 3791.826266] -bash[1597]: HISTORY: PID=1597 UID=0 ping 192.168.31.2
PING 192.168.31.2 (192.168.31.2) 56(84) bytes of data.
From 192.168.131.50 icmp_seq=1 Destination Host Unreachable
From 192.168.131.50 icmp_seq=2 Destination Host Unreachable
From 192.168.131.50 icmp_seq=3 Destination Host Unreachable
From 192.168.131.50 icmp_seq=4 Destination Host Unreachable
From 192.168.131.50 icmp_seq=5 Destination Host Unreachable
From 192.168.131.50 icmp_seq=6 Destination Host Unreachable
From 192.168.131.50 icmp_seq=7 Destination Host Unreachable
From 192.168.131.50 icmp_seq=8 Destination Host Unreachable

--- 192.168.31.2 ping statistics ---
9 packets transmitted, 0 received, +8 errors, 100% packet loss, time 8003ms
pipe 4

Revision history for this message

Wendy Mitchell (wmitchellwr) wrote on 2019-12-05:

#20

Closing this issue. Outstanding issue encountered relates to LP-1841189

tags:

removed: stx.retestneeded

Ghada Khalil (gkhalil) on 2019-12-11

Changed in starlingx:
status:	Triaged → Invalid

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.