"HTTP exception thrown: Cannot 'os-migrateLive' instance while it is in task_state migrating" in gate-grenade-dsvm-neutron-multinode-live-migration-nv

Bug #1649586 reported by Matt Riedemann
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
devstack
Fix Released
Medium
Pawel Koniszewski

Bug Description

Seen here:

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-api.txt.gz#_2016-12-13_03_02_15_500

2016-12-13 03:02:15.500 8323 INFO nova.api.openstack.wsgi [req-54d8956d-3289-418b-a05f-4f85910d83d8 tempest-LiveBlockMigrationTestJSON-1803989051 tempest-LiveBlockMigrationTestJSON-1803989051] HTTP exception thrown: Cannot 'os-migrateLive' instance af219925-a0a1-4e1c-92e0-ff6f510d7cd1 while it is in task_state migrating

Which causes the job to fail.

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22HTTP%20exception%20thrown%3A%20Cannot%20'os-migrateLive'%20instance%5C%22%20AND%20message%3A%5C%22while%20it%20is%20in%20task_state%20migrating%5C%22%20AND%20tags%3A%5C%22screen-n-api.txt%5C%22&from=7d

292 hits in 7 days, all failures.

Matt Riedemann (mriedem)
summary: - "HTTP exception thrown: Cannot 'os-migrateLive' instance
- af219925-a0a1-4e1c-92e0-ff6f510d7cd1 while it is in task_state
- migrating" in gate-grenade-dsvm-neutron-multinode-live-migration-nv
+ "HTTP exception thrown: Cannot 'os-migrateLive' instance while it is in
+ task_state migrating" in gate-grenade-dsvm-neutron-multinode-live-
+ migration-nv
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Matt Riedemann (mriedem) wrote :

This doesn't really make sense because right before the POST to live migrate the server, there is a GET to show server details and the task_state is null:

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/console.html#_2016-12-13_03_07_56_455490

Body: {"server": {"OS-EXT-STS:task_state": null, "addresses": {}, "links": [{"href": "http://10.14.206.189:8774/v2.1/servers/af219925-a0a1-4e1c-92e0-ff6f510d7cd1", "rel": "self"}, {"href": "http://10.14.206.189:8774/servers/af219925-a0a1-4e1c-92e0-ff6f510d7cd1", "rel": "bookmark"}], "image": {"id": "a475d1c8-5f9b-4552-9ecf-87533ec3e85e", "links": [{"href": "http://10.14.206.189:8774/images/a475d1c8-5f9b-4552-9ecf-87533ec3e85e", "rel": "bookmark"}]}, "OS-EXT-STS:vm_state": "active", "OS-EXT-SRV-ATTR:instance_name": "instance-00000014", "OS-SRV-USG:launched_at": "2016-12-13T03:01:14.000000", "flavor": {"id": "42", "links": [{"href": "http://10.14.206.189:8774/flavors/42", "rel": "bookmark"}]}, "id": "af219925-a0a1-4e1c-92e0-ff6f510d7cd1", "user_id": "9b95e78e3c244290b6ffd698de425a2e", "OS-DCF:diskConfig": "MANUAL", "accessIPv4": "", "accessIPv6": "", "progress": 0, "OS-EXT-STS:power_state": 1, "OS-EXT-AZ:availability_zone": "nova", "metadata": {}, "status": "ACTIVE", "updated": "2016-12-13T03:01:15Z", "hostId": "da92b52d875d7401b5f39e89d04281ea08bd3f87b14598ae7fd43d1c", "OS-EXT-SRV-ATTR:host": "ubuntu-xenial-2-node-osic-cloud1-s3700-6130183", "OS-SRV-USG:terminated_at": null, "key_name": null, "OS-EXT-SRV-ATTR:hypervisor_hostname": "ubuntu-xenial-2-node-osic-cloud1-s3700-6130183", "name": "tempest-LiveBlockMigrationTestJSON-server-424389653", "created": "2016-12-13T03:01:12Z", "tenant_id": "1d976b4e688848f7ba264443a2d662c8", "os-extended-volumes:volumes_attached": [], "config_drive": ""}}

Revision history for this message
Matt Riedemann (mriedem) wrote :

It's also curious that there is no networking for this server (addresses field is empty).

Revision history for this message
Matt Riedemann (mriedem) wrote :

There is apparently no network configured:

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-cpu.txt.gz#_2016-12-13_03_01_13_106

2016-12-13 03:01:13.106 9546 DEBUG nova.network.neutronv2.api [req-dab04fba-823a-4de7-845e-3e68e89cbb01 tempest-LiveBlockMigrationTestJSON-952175415 tempest-LiveBlockMigrationTestJSON-952175415] [instance: af219925-a0a1-4e1c-92e0-ff6f510d7cd1] No network configured allocate_for_instance /opt/stack/new/nova/nova/network/neutronv2/api.py:836

Revision history for this message
Matt Riedemann (mriedem) wrote :

This is in the conductor logs:

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-cond.txt.gz#_2016-12-13_03_02_15_582

2016-12-13 03:02:15.582 9702 WARNING nova.scheduler.utils [req-e0927ad1-3d21-42b8-a8ec-986f1ae67c65 tempest-LiveBlockMigrationTestJSON-1803989051 tempest-LiveBlockMigrationTestJSON-1803989051] Failed to compute_task_migrate_server: Timeout while checking if we can live migrate to host: ubuntu-xenial-2-node-osic-cloud1-s3700-6130183-336995
2016-12-13 03:02:15.583 9702 WARNING nova.scheduler.utils [req-e0927ad1-3d21-42b8-a8ec-986f1ae67c65 tempest-LiveBlockMigrationTestJSON-1803989051 tempest-LiveBlockMigrationTestJSON-1803989051] [instance: af219925-a0a1-4e1c-92e0-ff6f510d7cd1] Setting instance to ACTIVE state.

So my guess is we're failing to rollback a task_state in pre-live migration and the error coming back to the API is confusing.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Download full text (5.8 KiB)

We also have this messaging timeout in the n-api service:

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-api.txt.gz#_2016-12-13_03_02_15_537

2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions [req-e0927ad1-3d21-42b8-a8ec-986f1ae67c65 tempest-LiveBlockMigrationTestJSON-1803989051 tempest-LiveBlockMigrationTestJSON-1803989051] Unexpected exception in API method
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions Traceback (most recent call last):
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/api/openstack/extensions.py", line 338, in wrapped
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions return f(*args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/api/validation/__init__.py", line 73, in wrapper
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions return func(*args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/api/validation/__init__.py", line 73, in wrapper
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions return func(*args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/api/validation/__init__.py", line 73, in wrapper
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions return func(*args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/api/openstack/compute/migrate_server.py", line 100, in _migrate_live
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions disk_over_commit, host, force, async)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/api.py", line 164, in inner
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions return function(self, context, instance, *args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/api.py", line 172, in _wrapped
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions return fn(self, context, instance, *args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/api.py", line 145, in inner
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions return f(self, context, instance, *args, **kw)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/api.py", line 3768, in live_migrate
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions messaging_timeout)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions self.force_reraise()
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 19...

Read more...

Revision history for this message
Matt Riedemann (mriedem) wrote :

I'm not really sure where the req-e0927ad1-3d21-42b8-a8ec-986f1ae67c65 request is coming from, it's not coming from Tempest from what I can tell, the live migration request from Tempest is req-54d8956d-3289-418b-a05f-4f85910d83d8.

Revision history for this message
Pawel Koniszewski (pawel-koniszewski) wrote :

I did some investigation, here is what I found. IMHO it fails because of rabbit issues. Let's analyze it on an instance with ID af219925-a0a1-4e1c-92e0-ff6f510d7cd1 using this run of grenade http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/

The first request of a VM happened at 03:01:15.451, the request ID is req-e0927ad1-3d21-42b8-a8ec-986f1ae67c65

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-api.txt.gz#_2016-12-13_03_01_15_451

This request ends up with a 500 error

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-api.txt.gz#_2016-12-13_03_02_15_537

But few milliseconds before this error we can see another request to live migrate the same VM for the second time (which is not what live migration test should do):

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-api.txt.gz#_2016-12-13_03_02_15_446

Look at the time - 03:02:15.446, it's more or less 1 minute after the first request. When calling urllib3 and creating retry mechanism we do not override 'total retries', we just create Retry object:

https://github.com/openstack/tempest/blob/071e94c31ce2d1549ecb34df1fea7f5a6cd210a6/tempest/lib/common/http.py#L55

And here is the commit that changed tempest behavior, so that tempest creates PoolManager that timeouts exactly after 60 seconds:

https://github.com/openstack/tempest/commit/071e94c31ce2d1549ecb34df1fea7f5a6cd210a6

So due to this timeout behavior and that we do not change retry logic while creating Retry object, urllib3 retries the live migration call, therefore we can see 409 CONFLICT error and failing test. This is also why we can see request ids that do not come from tempest, as those are requests generated by urllib3.

The real problem there is Rabbit and that it timeouts some requests and this is what we should fix. We can find the same timeouts in q-agt in gate-grenade-dsvm-neutron-multinode-ubuntu-xenial, but this gate is not vulnerable to this error:

http://logs.openstack.org/02/408002/2/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/0443a47/logs/subnode-2/old/screen-q-agt.txt.gz?level=TRACE

Revision history for this message
Matt Riedemann (mriedem) wrote :

Pawel, thanks for investigating this. It does seem odd though that we'd be randomly tripping up on failures in Rabbit so frequently in this job, i.e. we hit this failure 360+ times in 7 days, so something must really be tripping this up in rabbit in this particular job.

Revision history for this message
Matt Riedemann (mriedem) wrote :

FWIW I had noticed the retries in the console log dump when this fails, e.g.:

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/console.html#_2016-12-13_03_07_56_4627410fbae-ae07-4f9d-8d08-a053c25f86e3/action

2016-12-13 03:07:56.462741 | 2016-12-13 03:07:56.462 | 2016-12-13 03:02:15,413 19290 WARNING [urllib3.connectionpool] Retrying (Retry(total=9, connect=None, read=None, redirect=5)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='10.14.206.189', port=8774): Read timed out. (read timeout=60)",)': /v2.1/servers/af219925-a0a1-4e1c-92e0-ff6f510d7cd1/action

Revision history for this message
Pawel Koniszewski (pawel-koniszewski) wrote :

Well, there are a lot of RPC calls in live migration, so it's more vulnerable to rabbit issues than any other action. Before nova responds to live migration POST action there are at least 3 RPC calls, from API to conductor, from conductor to destination and from destination to source (and looks like the source node suffers the most). Once nova hits compute task API it changes state of a VM, so any timeout on this action will cause conflicts due to retries. When we look on when urllib starts to count timeout and when first RPC occurs in nova - this does not make me surprised about this. Everytime urllib will timeout about 1 second faster than first RPC call in nova.

Revision history for this message
Pawel Koniszewski (pawel-koniszewski) wrote :

There is one thing I don't understand. There are a lot of errors that AMQP server is unreachable:

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/8e63e9c/logs/subnode-2/old/screen-n-cpu.txt.gz#_2016-12-13_14_09_18_649

2016-12-13 14:09:14.551 12173 DEBUG oslo_messaging._drivers.amqpdriver [req-b71314d0-c5c2-44b3-927f-072a97636c47 - -] CAST unique_id: ede0ffc923d64e9a8502b1fe772c2dcd FANOUT topic 'scheduler' _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:437
2016-12-13 14:09:18.649 12173 ERROR oslo.messaging._drivers.impl_rabbit [req-070105f7-302c-4ea2-ba6c-6ed1f21cea97 tempest-LiveBlockMigrationTestJSON-1846422298 tempest-LiveBlockMigrationTestJSON-1846422298] [f253dd7c-f063-4a70-8267-9551112a3939] AMQP server on 127.0.0.1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 22 seconds. Client port: None
2016-12-13 14:09:19.291 12173 DEBUG oslo_messaging._drivers.amqpdriver [-] CALL msg_id: f9b691fb9de14ce29b59b339e0c2f355 exchange 'nova' topic 'conductor' _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:448

However, nova-compute accepts requests through RPC, e.g., nova spawned new instance at 14:10:46.907

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/8e63e9c/logs/subnode-2/old/screen-n-cpu.txt.gz#_2016-12-13_14_10_46_907

2016-12-13 14:10:46.907 12173 INFO nova.virt.libvirt.driver [-] [instance: 9c4a72aa-0540-470d-9446-969a7db638e7] Instance spawned successfully.
2016-12-13 14:10:46.907 12173 INFO nova.compute.manager [req-5f7a2bde-892a-46a3-910c-134443029483 tempest-LiveAutoBlockMigrationV225TestJSON-970444200 tempest-LiveAutoBlockMigrationV225TestJSON-970444200] [instance: 9c4a72aa-0540-470d-9446-969a7db638e7] Took 1.48 seconds to spawn the instance on the hypervisor.

But it never reconnected to RPC after error at 14:10:30.682

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/8e63e9c/logs/subnode-2/old/screen-n-cpu.txt.gz#_2016-12-13_14_10_30_682

2016-12-13 14:10:30.682 12173 ERROR oslo.messaging._drivers.impl_rabbit [req-070105f7-302c-4ea2-ba6c-6ed1f21cea97 tempest-LiveBlockMigrationTestJSON-1846422298 tempest-LiveBlockMigrationTestJSON-1846422298] [f253dd7c-f063-4a70-8267-9551112a3939] AMQP server on 127.0.0.1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 28 seconds. Client port: None

Revision history for this message
Pawel Koniszewski (pawel-koniszewski) wrote :

I did some more investigation. However, no rootcause discovered yet.

I found that gate is failing since 1st of December:

http://graphite.openstack.org/render/?width=1613&height=914&_salt=1472475642.338&target=stats.zuul.pipeline.check.job.gate-grenade-dsvm-neutron-multinode-live-migration-nv.FAILURE&target=stats.zuul.pipeline.check.job.gate-grenade-dsvm-neutron-multinode-live-migration-nv.SUCCESS&from=-17days&lineWidth=10&colorList=red%2Cgreen
http://graphite.openstack.org/render/?width=1613&height=914&_salt=1472475642.338&target=stats.zuul.pipeline.check.job.gate-grenade-dsvm-neutron-multinode-live-migration-nv.FAILURE&from=-17days&lineWidth=10&colorList=red%2Cgreen

I found two changes from 1st of December, one that ended up with SUCCESS and other one that ended up with FAILURE:

SUCCESS - 4:07 PM - https://review.openstack.org/#/c/403855/

LOGS: http://logs.openstack.org/55/403855/1/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/c488037/

FAILURE - 7:35 PM - https://review.openstack.org/#/c/333272/

LOGS: http://logs.openstack.org/72/333272/21/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/65ce53c/

I did some analysis of installed packages on subnode-2, as only subnode-2 has issues with connecting to rabbit. The only difference that I found is oslo.messaging. Gate succeeded with oslo.messaging==5.10.0 and failed with oslo.messaging==5.10.1. Tried to test whether oslo.messaging is an issue, but looks like it is not:

https://review.openstack.org/#/c/411662/ LM grenade job failed despite of oslo.messaging version (5.10.0)
http://logs.openstack.org/62/411662/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/6d9c27a/

Revision history for this message
Pawel Koniszewski (pawel-koniszewski) wrote :

Finally found the root cause. This patch forces grenade to set cells v2

https://github.com/openstack-dev/grenade/commit/ecfaa7a5a235a69a0ead8b53dd4b2807093c9cfa

After this commit LM+grenade job started to fail every time. This is because grenade forces RABBIT_HOST to be localhost on controller node

https://github.com/openstack-dev/grenade/blob/b7ae1b46602e34d53b5489ba9e0a7cb73d6ea418/devstack.localrc.base#L21
https://github.com/openstack-dev/grenade/blob/b7ae1b46602e34d53b5489ba9e0a7cb73d6ea418/devstack.localrc.target#L21

So when nova configures cells v2 deployment it uses localhost in transport_url and saves such url in cell mapping. When subnode-2 wants to start pre live migration steps on newer node, it takes cell_mapping from database and uses transport url which points to localhost. Unfortunately rabbit is not running on the same node, therefore every time LM+grenade job fails on starting pre live migration steps.

Changed in nova:
assignee: nobody → Pawel Koniszewski (pawel-koniszewski)
Revision history for this message
Pawel Koniszewski (pawel-koniszewski) wrote :

Here i'm trying to fix this issue: https://review.openstack.org/#/c/412922/

Changed in nova:
status: Confirmed → In Progress
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

This devstack change is also needed: https://review.openstack.org/#/c/413595/

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

After the Devstack change merged yesterday, there's no hit in Logstash anymore. Maybe we should observe it a few hours longer, and then this report can be closed as resolved.

Revision history for this message
Sean Dague (sdague) wrote :

There are no currently open reviews on this bug, changing
the status back to the previous state and unassigning. If
there are active reviews related to this bug, please include
links in comments.

Changed in nova:
status: In Progress → Confirmed
assignee: Pawel Koniszewski (pawel-koniszewski) → nobody
Matt Riedemann (mriedem)
no longer affects: nova
Changed in devstack:
status: New → Fix Released
importance: Undecided → Medium
assignee: nobody → Pawel Koniszewski (pawel-koniszewski)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.