Bug #1649586 ""HTTP exception thrown: Cannot 'os-migrateLive' in... : Bugs : devstack

Matt Riedemann (mriedem) on 2016-12-13

summary:	- "HTTP exception thrown: Cannot 'os-migrateLive' instance - af219925-a0a1-4e1c-92e0-ff6f510d7cd1 while it is in task_state - migrating" in gate-grenade-dsvm-neutron-multinode-live-migration-nv + "HTTP exception thrown: Cannot 'os-migrateLive' instance while it is in + task_state migrating" in gate-grenade-dsvm-neutron-multinode-live- + migration-nv
Changed in nova:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-12-13:

#1

This doesn't really make sense because right before the POST to live migrate the server, there is a GET to show server details and the task_state is null:

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/console.html#_2016-12-13_03_07_56_455490

Body: {"server": {"OS-EXT-STS:task_state": null, "addresses": {}, "links": [{"href": "http://10.14.206.189:8774/v2.1/servers/af219925-a0a1-4e1c-92e0-ff6f510d7cd1", "rel": "self"}, {"href": "http://10.14.206.189:8774/servers/af219925-a0a1-4e1c-92e0-ff6f510d7cd1", "rel": "bookmark"}], "image": {"id": "a475d1c8-5f9b-4552-9ecf-87533ec3e85e", "links": [{"href": "http://10.14.206.189:8774/images/a475d1c8-5f9b-4552-9ecf-87533ec3e85e", "rel": "bookmark"}]}, "OS-EXT-STS:vm_state": "active", "OS-EXT-SRV-ATTR:instance_name": "instance-00000014", "OS-SRV-USG:launched_at": "2016-12-13T03:01:14.000000", "flavor": {"id": "42", "links": [{"href": "http://10.14.206.189:8774/flavors/42", "rel": "bookmark"}]}, "id": "af219925-a0a1-4e1c-92e0-ff6f510d7cd1", "user_id": "9b95e78e3c244290b6ffd698de425a2e", "OS-DCF:diskConfig": "MANUAL", "accessIPv4": "", "accessIPv6": "", "progress": 0, "OS-EXT-STS:power_state": 1, "OS-EXT-AZ:availability_zone": "nova", "metadata": {}, "status": "ACTIVE", "updated": "2016-12-13T03:01:15Z", "hostId": "da92b52d875d7401b5f39e89d04281ea08bd3f87b14598ae7fd43d1c", "OS-EXT-SRV-ATTR:host": "ubuntu-xenial-2-node-osic-cloud1-s3700-6130183", "OS-SRV-USG:terminated_at": null, "key_name": null, "OS-EXT-SRV-ATTR:hypervisor_hostname": "ubuntu-xenial-2-node-osic-cloud1-s3700-6130183", "name": "tempest-LiveBlockMigrationTestJSON-server-424389653", "created": "2016-12-13T03:01:12Z", "tenant_id": "1d976b4e688848f7ba264443a2d662c8", "os-extended-volumes:volumes_attached": [], "config_drive": ""}}

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-12-13:

#2

It's also curious that there is no networking for this server (addresses field is empty).

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-12-13:

#3

There is apparently no network configured:

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-cpu.txt.gz#_2016-12-13_03_01_13_106

2016-12-13 03:01:13.106 9546 DEBUG nova.network.neutronv2.api [req-dab04fba-823a-4de7-845e-3e68e89cbb01 tempest-LiveBlockMigrationTestJSON-952175415 tempest-LiveBlockMigrationTestJSON-952175415] [instance: af219925-a0a1-4e1c-92e0-ff6f510d7cd1] No network configured allocate_for_instance /opt/stack/new/nova/nova/network/neutronv2/api.py:836

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-12-13:

#4

This is in the conductor logs:

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-cond.txt.gz#_2016-12-13_03_02_15_582

2016-12-13 03:02:15.582 9702 WARNING nova.scheduler.utils [req-e0927ad1-3d21-42b8-a8ec-986f1ae67c65 tempest-LiveBlockMigrationTestJSON-1803989051 tempest-LiveBlockMigrationTestJSON-1803989051] Failed to compute_task_migrate_server: Timeout while checking if we can live migrate to host: ubuntu-xenial-2-node-osic-cloud1-s3700-6130183-336995
2016-12-13 03:02:15.583 9702 WARNING nova.scheduler.utils [req-e0927ad1-3d21-42b8-a8ec-986f1ae67c65 tempest-LiveBlockMigrationTestJSON-1803989051 tempest-LiveBlockMigrationTestJSON-1803989051] [instance: af219925-a0a1-4e1c-92e0-ff6f510d7cd1] Setting instance to ACTIVE state.

So my guess is we're failing to rollback a task_state in pre-live migration and the error coming back to the API is confusing.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-12-13:

#5

Download full text (5.8 KiB)

We also have this messaging timeout in the n-api service:

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-api.txt.gz#_2016-12-13_03_02_15_537

2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions [req-e0927ad1-3d21-42b8-a8ec-986f1ae67c65 tempest-LiveBlockMigrationTestJSON-1803989051 tempest-LiveBlockMigrationTestJSON-1803989051] Unexpected exception in API method
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions Traceback (most recent call last):
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/api/openstack/extensions.py", line 338, in wrapped
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions return f(*args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/api/validation/__init__.py", line 73, in wrapper
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions return func(*args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/api/validation/__init__.py", line 73, in wrapper
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions return func(*args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/api/validation/__init__.py", line 73, in wrapper
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions return func(*args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/api/openstack/compute/migrate_server.py", line 100, in _migrate_live
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions disk_over_commit, host, force, async)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/api.py", line 164, in inner
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions return function(self, context, instance, *args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/api.py", line 172, in _wrapped
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions return fn(self, context, instance, *args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/api.py", line 145, in inner
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions return f(self, context, instance, *args, **kw)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/opt/stack/new/nova/nova/compute/api.py", line 3768, in live_migrate
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions messaging_timeout)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions self.force_reraise()
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 19...

We also have this messaging timeout in the n-api service:

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-api.txt.gz#_2016-12-13_03_02_15_537

2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions [req-e0927ad1-3d21-42b8-a8ec-986f1ae67c65 tempest-LiveBlockMigrationTestJSON-1803989051 tempest-LiveBlockMigrationTestJSON-1803989051] Unexpected exception in API method
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions Traceback (most recent call last):
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/opt/stack/new/nova/nova/api/openstack/extensions.py", line 338, in wrapped
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     return f(*args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/opt/stack/new/nova/nova/api/validation/__init__.py", line 73, in wrapper
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     return func(*args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/opt/stack/new/nova/nova/api/validation/__init__.py", line 73, in wrapper
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     return func(*args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/opt/stack/new/nova/nova/api/validation/__init__.py", line 73, in wrapper
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     return func(*args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/opt/stack/new/nova/nova/api/openstack/compute/migrate_server.py", line 100, in _migrate_live
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     disk_over_commit, host, force, async)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/opt/stack/new/nova/nova/compute/api.py", line 164, in inner
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     return function(self, context, instance, *args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/opt/stack/new/nova/nova/compute/api.py", line 172, in _wrapped
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     return fn(self, context, instance, *args, **kwargs)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/opt/stack/new/nova/nova/compute/api.py", line 145, in inner
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     return f(self, context, instance, *args, **kw)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/opt/stack/new/nova/nova/compute/api.py", line 3768, in live_migrate
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     messaging_timeout)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     self.force_reraise()
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     six.reraise(self.type_, self.value, self.tb)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/opt/stack/new/nova/nova/compute/api.py", line 3760, in live_migrate
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     request_spec=request_spec, async=async)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/opt/stack/new/nova/nova/conductor/api.py", line 112, in live_migrate_instance
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     request_spec=request_spec)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/opt/stack/new/nova/nova/conductor/rpcapi.py", line 318, in migrate_server
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     return cctxt.call(context, 'migrate_server', **kw)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 169, in call
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     retry=self.retry)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 97, in _send
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     timeout=timeout, retry=retry)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 458, in send
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     retry=retry)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 447, in _send
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     result = self._waiter.wait(msg_id, timeout)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 339, in wait
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     message = self.waiters.get(msg_id, timeout=timeout)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions   File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 238, in get
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions     'to message ID %s' % msg_id)
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions MessagingTimeout: Timed out waiting for a reply to message ID 9342edb7a67b4dec820760ac0b062679
2016-12-13 03:02:15.537 8322 ERROR nova.api.openstack.extensions

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-12-13:

#6

I'm not really sure where the req-e0927ad1-3d21-42b8-a8ec-986f1ae67c65 request is coming from, it's not coming from Tempest from what I can tell, the live migration request from Tempest is req-54d8956d-3289-418b-a05f-4f85910d83d8.

Revision history for this message

Pawel Koniszewski (pawel-koniszewski) wrote on 2016-12-14:

#7

I did some investigation, here is what I found. IMHO it fails because of rabbit issues. Let's analyze it on an instance with ID af219925-a0a1-4e1c-92e0-ff6f510d7cd1 using this run of grenade http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/

The first request of a VM happened at 03:01:15.451, the request ID is req-e0927ad1-3d21-42b8-a8ec-986f1ae67c65

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-api.txt.gz#_2016-12-13_03_01_15_451

This request ends up with a 500 error

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-api.txt.gz#_2016-12-13_03_02_15_537

But few milliseconds before this error we can see another request to live migrate the same VM for the second time (which is not what live migration test should do):

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-api.txt.gz#_2016-12-13_03_02_15_446

Look at the time - 03:02:15.446, it's more or less 1 minute after the first request. When calling urllib3 and creating retry mechanism we do not override 'total retries', we just create Retry object:

https://github.com/openstack/tempest/blob/071e94c31ce2d1549ecb34df1fea7f5a6cd210a6/tempest/lib/common/http.py#L55

And here is the commit that changed tempest behavior, so that tempest creates PoolManager that timeouts exactly after 60 seconds:

https://github.com/openstack/tempest/commit/071e94c31ce2d1549ecb34df1fea7f5a6cd210a6

So due to this timeout behavior and that we do not change retry logic while creating Retry object, urllib3 retries the live migration call, therefore we can see 409 CONFLICT error and failing test. This is also why we can see request ids that do not come from tempest, as those are requests generated by urllib3.

The real problem there is Rabbit and that it timeouts some requests and this is what we should fix. We can find the same timeouts in q-agt in gate-grenade-dsvm-neutron-multinode-ubuntu-xenial, but this gate is not vulnerable to this error:

http://logs.openstack.org/02/408002/2/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/0443a47/logs/subnode-2/old/screen-q-agt.txt.gz?level=TRACE

I did some investigation, here is what I found. IMHO it fails because of rabbit issues. Let's analyze it on an instance with ID af219925-a0a1-4e1c-92e0-ff6f510d7cd1 using this run of grenade http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/

The first request of a VM happened at 03:01:15.451, the request ID is req-e0927ad1-3d21-42b8-a8ec-986f1ae67c65

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-api.txt.gz#_2016-12-13_03_01_15_451

This request ends up with a 500 error

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-api.txt.gz#_2016-12-13_03_02_15_537

But few milliseconds before this error we can see another request to live migrate the same VM for the second time (which is not what live migration test should do):

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/logs/new/screen-n-api.txt.gz#_2016-12-13_03_02_15_446

Look at the time - 03:02:15.446, it's more or less 1 minute after the first request. When calling urllib3 and creating retry mechanism we do not override 'total retries', we just create Retry object:

https://github.com/openstack/tempest/blob/071e94c31ce2d1549ecb34df1fea7f5a6cd210a6/tempest/lib/common/http.py#L55

And here is the commit that changed tempest behavior, so that tempest creates PoolManager that timeouts exactly after 60 seconds:

https://github.com/openstack/tempest/commit/071e94c31ce2d1549ecb34df1fea7f5a6cd210a6

So due to this timeout behavior and that we do not change retry logic while creating Retry object, urllib3 retries the live migration call, therefore we can see 409 CONFLICT error and failing test. This is also why we can see request ids that do not come from tempest, as those are requests generated by urllib3.

The real problem there is Rabbit and that it timeouts some requests and this is what we should fix. We can find the same timeouts in q-agt in gate-grenade-dsvm-neutron-multinode-ubuntu-xenial, but this gate is not vulnerable to this error:

http://logs.openstack.org/02/408002/2/check/gate-grenade-dsvm-neutron-multinode-ubuntu-xenial/0443a47/logs/subnode-2/old/screen-q-agt.txt.gz?level=TRACE

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-12-14:

#8

Pawel, thanks for investigating this. It does seem odd though that we'd be randomly tripping up on failures in Rabbit so frequently in this job, i.e. we hit this failure 360+ times in 7 days, so something must really be tripping this up in rabbit in this particular job.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2016-12-14:

#9

FWIW I had noticed the retries in the console log dump when this fails, e.g.:

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/78de48c/console.html#_2016-12-13_03_07_56_4627410fbae-ae07-4f9d-8d08-a053c25f86e3/action

2016-12-13 03:07:56.462741 | 2016-12-13 03:07:56.462 | 2016-12-13 03:02:15,413 19290 WARNING [urllib3.connectionpool] Retrying (Retry(total=9, connect=None, read=None, redirect=5)) after connection broken by 'ReadTimeoutError("HTTPConnectionPool(host='10.14.206.189', port=8774): Read timed out. (read timeout=60)",)': /v2.1/servers/af219925-a0a1-4e1c-92e0-ff6f510d7cd1/action

Revision history for this message

Pawel Koniszewski (pawel-koniszewski) wrote on 2016-12-14:

#10

Well, there are a lot of RPC calls in live migration, so it's more vulnerable to rabbit issues than any other action. Before nova responds to live migration POST action there are at least 3 RPC calls, from API to conductor, from conductor to destination and from destination to source (and looks like the source node suffers the most). Once nova hits compute task API it changes state of a VM, so any timeout on this action will cause conflicts due to retries. When we look on when urllib starts to count timeout and when first RPC occurs in nova - this does not make me surprised about this. Everytime urllib will timeout about 1 second faster than first RPC call in nova.

Revision history for this message

Pawel Koniszewski (pawel-koniszewski) wrote on 2016-12-14:

#11

There is one thing I don't understand. There are a lot of errors that AMQP server is unreachable:

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/8e63e9c/logs/subnode-2/old/screen-n-cpu.txt.gz#_2016-12-13_14_09_18_649

2016-12-13 14:09:14.551 12173 DEBUG oslo_messaging._drivers.amqpdriver [req-b71314d0-c5c2-44b3-927f-072a97636c47 - -] CAST unique_id: ede0ffc923d64e9a8502b1fe772c2dcd FANOUT topic 'scheduler' _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:437
2016-12-13 14:09:18.649 12173 ERROR oslo.messaging._drivers.impl_rabbit [req-070105f7-302c-4ea2-ba6c-6ed1f21cea97 tempest-LiveBlockMigrationTestJSON-1846422298 tempest-LiveBlockMigrationTestJSON-1846422298] [f253dd7c-f063-4a70-8267-9551112a3939] AMQP server on 127.0.0.1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 22 seconds. Client port: None
2016-12-13 14:09:19.291 12173 DEBUG oslo_messaging._drivers.amqpdriver [-] CALL msg_id: f9b691fb9de14ce29b59b339e0c2f355 exchange 'nova' topic 'conductor' _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:448

However, nova-compute accepts requests through RPC, e.g., nova spawned new instance at 14:10:46.907

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/8e63e9c/logs/subnode-2/old/screen-n-cpu.txt.gz#_2016-12-13_14_10_46_907

2016-12-13 14:10:46.907 12173 INFO nova.virt.libvirt.driver [-] [instance: 9c4a72aa-0540-470d-9446-969a7db638e7] Instance spawned successfully.
2016-12-13 14:10:46.907 12173 INFO nova.compute.manager [req-5f7a2bde-892a-46a3-910c-134443029483 tempest-LiveAutoBlockMigrationV225TestJSON-970444200 tempest-LiveAutoBlockMigrationV225TestJSON-970444200] [instance: 9c4a72aa-0540-470d-9446-969a7db638e7] Took 1.48 seconds to spawn the instance on the hypervisor.

But it never reconnected to RPC after error at 14:10:30.682

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/8e63e9c/logs/subnode-2/old/screen-n-cpu.txt.gz#_2016-12-13_14_10_30_682

2016-12-13 14:10:30.682 12173 ERROR oslo.messaging._drivers.impl_rabbit [req-070105f7-302c-4ea2-ba6c-6ed1f21cea97 tempest-LiveBlockMigrationTestJSON-1846422298 tempest-LiveBlockMigrationTestJSON-1846422298] [f253dd7c-f063-4a70-8267-9551112a3939] AMQP server on 127.0.0.1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 28 seconds. Client port: None

There is one thing I don't understand. There are a lot of errors that AMQP server is unreachable:

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/8e63e9c/logs/subnode-2/old/screen-n-cpu.txt.gz#_2016-12-13_14_09_18_649

2016-12-13 14:09:14.551 12173 DEBUG oslo_messaging._drivers.amqpdriver [req-b71314d0-c5c2-44b3-927f-072a97636c47 - -] CAST unique_id: ede0ffc923d64e9a8502b1fe772c2dcd FANOUT topic 'scheduler' _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:437
2016-12-13 14:09:18.649 12173 ERROR oslo.messaging._drivers.impl_rabbit [req-070105f7-302c-4ea2-ba6c-6ed1f21cea97 tempest-LiveBlockMigrationTestJSON-1846422298 tempest-LiveBlockMigrationTestJSON-1846422298] [f253dd7c-f063-4a70-8267-9551112a3939] AMQP server on 127.0.0.1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 22 seconds. Client port: None
2016-12-13 14:09:19.291 12173 DEBUG oslo_messaging._drivers.amqpdriver [-] CALL msg_id: f9b691fb9de14ce29b59b339e0c2f355 exchange 'nova' topic 'conductor' _send /usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py:448

However, nova-compute accepts requests through RPC, e.g., nova spawned new instance at 14:10:46.907

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/8e63e9c/logs/subnode-2/old/screen-n-cpu.txt.gz#_2016-12-13_14_10_46_907

2016-12-13 14:10:46.907 12173 INFO nova.virt.libvirt.driver [-] [instance: 9c4a72aa-0540-470d-9446-969a7db638e7] Instance spawned successfully.
2016-12-13 14:10:46.907 12173 INFO nova.compute.manager [req-5f7a2bde-892a-46a3-910c-134443029483 tempest-LiveAutoBlockMigrationV225TestJSON-970444200 tempest-LiveAutoBlockMigrationV225TestJSON-970444200] [instance: 9c4a72aa-0540-470d-9446-969a7db638e7] Took 1.48 seconds to spawn the instance on the hypervisor.

But it never reconnected to RPC after error at 14:10:30.682

http://logs.openstack.org/04/409904/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/8e63e9c/logs/subnode-2/old/screen-n-cpu.txt.gz#_2016-12-13_14_10_30_682

2016-12-13 14:10:30.682 12173 ERROR oslo.messaging._drivers.impl_rabbit [req-070105f7-302c-4ea2-ba6c-6ed1f21cea97 tempest-LiveBlockMigrationTestJSON-1846422298 tempest-LiveBlockMigrationTestJSON-1846422298] [f253dd7c-f063-4a70-8267-9551112a3939] AMQP server on 127.0.0.1:5672 is unreachable: [Errno 111] ECONNREFUSED. Trying again in 28 seconds. Client port: None

Revision history for this message

Pawel Koniszewski (pawel-koniszewski) wrote on 2016-12-16:

#12

I did some more investigation. However, no rootcause discovered yet.

I found that gate is failing since 1st of December:

http://graphite.openstack.org/render/?width=1613&height=914&_salt=1472475642.338&target=stats.zuul.pipeline.check.job.gate-grenade-dsvm-neutron-multinode-live-migration-nv.FAILURE&target=stats.zuul.pipeline.check.job.gate-grenade-dsvm-neutron-multinode-live-migration-nv.SUCCESS&from=-17days&lineWidth=10&colorList=red%2Cgreen
http://graphite.openstack.org/render/?width=1613&height=914&_salt=1472475642.338&target=stats.zuul.pipeline.check.job.gate-grenade-dsvm-neutron-multinode-live-migration-nv.FAILURE&from=-17days&lineWidth=10&colorList=red%2Cgreen

I found two changes from 1st of December, one that ended up with SUCCESS and other one that ended up with FAILURE:

SUCCESS - 4:07 PM - https://review.openstack.org/#/c/403855/

LOGS: http://logs.openstack.org/55/403855/1/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/c488037/

FAILURE - 7:35 PM - https://review.openstack.org/#/c/333272/

LOGS: http://logs.openstack.org/72/333272/21/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/65ce53c/

I did some analysis of installed packages on subnode-2, as only subnode-2 has issues with connecting to rabbit. The only difference that I found is oslo.messaging. Gate succeeded with oslo.messaging==5.10.0 and failed with oslo.messaging==5.10.1. Tried to test whether oslo.messaging is an issue, but looks like it is not:

https://review.openstack.org/#/c/411662/ LM grenade job failed despite of oslo.messaging version (5.10.0)
http://logs.openstack.org/62/411662/2/check/gate-grenade-dsvm-neutron-multinode-live-migration-nv/6d9c27a/

Revision history for this message

Pawel Koniszewski (pawel-koniszewski) wrote on 2016-12-20:

#13

Finally found the root cause. This patch forces grenade to set cells v2

https://github.com/openstack-dev/grenade/commit/ecfaa7a5a235a69a0ead8b53dd4b2807093c9cfa

After this commit LM+grenade job started to fail every time. This is because grenade forces RABBIT_HOST to be localhost on controller node

https://github.com/openstack-dev/grenade/blob/b7ae1b46602e34d53b5489ba9e0a7cb73d6ea418/devstack.localrc.base#L21
https://github.com/openstack-dev/grenade/blob/b7ae1b46602e34d53b5489ba9e0a7cb73d6ea418/devstack.localrc.target#L21

So when nova configures cells v2 deployment it uses localhost in transport_url and saves such url in cell mapping. When subnode-2 wants to start pre live migration steps on newer node, it takes cell_mapping from database and uses transport url which points to localhost. Unfortunately rabbit is not running on the same node, therefore every time LM+grenade job fails on starting pre live migration steps.

Changed in nova:
assignee:	nobody → Pawel Koniszewski (pawel-koniszewski)

Revision history for this message

Pawel Koniszewski (pawel-koniszewski) wrote on 2016-12-20:

#14

Here i'm trying to fix this issue: https://review.openstack.org/#/c/412922/

Changed in nova:
status:	Confirmed → In Progress

Revision history for this message

Markus Zoeller (markus_z) (mzoeller) wrote on 2017-01-10:

#15

This devstack change is also needed: https://review.openstack.org/#/c/413595/

Revision history for this message

Markus Zoeller (markus_z) (mzoeller) wrote on 2017-01-11:

#16

After the Devstack change merged yesterday, there's no hit in Logstash anymore. Maybe we should observe it a few hours longer, and then this report can be closed as resolved.

Revision history for this message

Sean Dague (sdague) wrote on 2017-06-23:

#17

There are no currently open reviews on this bug, changing
the status back to the previous state and unassigning. If
there are active reviews related to this bug, please include
links in comments.

Changed in nova:
status:	In Progress → Confirmed
assignee:	Pawel Koniszewski (pawel-koniszewski) → nobody

Matt Riedemann (mriedem) on 2018-08-03

no longer affects:	nova
Changed in devstack:
status:	New → Fix Released
importance:	Undecided → Medium
assignee:	nobody → Pawel Koniszewski (pawel-koniszewski)

devstack

"HTTP exception thrown: Cannot 'os-migrateLive' instance while it is in task_state migrating" in gate-grenade-dsvm-neutron-multinode-live-migration-nv

Bug Description

Other bug subscribers

Remote bug watches