ServerAddressesTestXML.test_list_server_addresses FAIL

Bug #1210483 reported by Attila Fazekas
52
This bug affects 10 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned
neutron
Invalid
Medium
Unassigned

Bug Description

2013-08-09 09:05:28.205 | ======================================================================
2013-08-09 09:05:28.220 | FAIL: tempest.api.compute.servers.test_server_addresses.ServerAddressesTestXML.test_list_server_addresses[gate,smoke]
2013-08-09 09:05:28.231 | ----------------------------------------------------------------------
2013-08-09 09:05:28.232 | _StringException: Traceback (most recent call last):
2013-08-09 09:05:28.232 | File "/opt/stack/new/tempest/tempest/api/compute/servers/test_server_addresses.py", line 56, in test_list_server_addresses
2013-08-09 09:05:28.232 | self.assertTrue(len(addresses) >= 1)
2013-08-09 09:05:28.233 | File "/usr/lib/python2.7/unittest/case.py", line 420, in assertTrue
2013-08-09 09:05:28.233 | raise self.failureException(msg)
2013-08-09 09:05:28.233 | AssertionError: False is not true

http://logs.openstack.org/80/38980/2/check/gate-tempest-devstack-vm-neutron/1e116fa/

Revision history for this message
Sean Dague (sdague) wrote :

This is another instance of neutron racing on allocating IP addresses, and our tests failing because the server is active before it has a valid IP. There may be other related bugs to this.

Changed in neutron:
status: New → Confirmed
importance: Undecided → Critical
tags: added: l3-ipam-dhcp
Thierry Carrez (ttx)
Changed in neutron:
milestone: none → havana-3
Revision history for this message
yong sheng gong (gongysh) wrote :

IMHO:

There are two threads trying to update instance info_cache:
1.periodic task:
2013-08-09 09:01:43.965 30251 DEBUG nova.compute.manager [-] [instance: 2b92028f-fee4-43d3-b55a-80ced1dc3ff1] Updated the info_cache for instance _heal_instance_info_cache /opt/stack/new/nova/nova/compute/manager.py:4109

2. instance creation process:
    @refresh_cache
    def allocate_for_instance(self, context, instance, **kwargs):

2013-08-09 09:01:43.124 30251 DEBUG nova.network.neutronv2.api [-] allocate_for_instance() for ServerAddressesTestXML-instance1227692404 allocate_for_instance /opt/stack/new/nova/nova/network/neutronv2/api.py:169
2013-08-09 09:01:43.708 30251 DEBUG neutronclient.client [-] RESP:{'date': 'Fri, 09 Aug 2013 09:01:43 GMT', 'status': '200', 'content-length': '611', 'content-type': 'application/json; charset=UTF-8', 'content-location': 'http://127.0.0.1:9696/v2.0/ports.json?tenant_id=b4c67f546ea94bba8427b1157a2b1f42&device_id=2b92028f-fee4-43d3-b55a-80ced1dc3ff1'} {"ports": [{"status": "DOWN", "binding:host_id": "devstack-1376034070", "name": "", "admin_state_up": true, "network_id": "23071a9a-1432-4d75-805d-4fbbf4b2b967", "tenant_id": "b4c67f546ea94bba8427b1157a2b1f42", "binding:vif_type": "ovs", "device_owner": "compute:None", "binding:capabilities": {"port_filter": true}, "mac_address": "fa:16:3e:0b:57:ad", "fixed_ips": [{"subnet_id": "53159554-9944-4983-9f3d-ed17e53e3ca5", "ip_address": "10.1.0.13"}], "id": "368091ff-b7d7-4485-859b-1b4bb211778a", "security_groups": ["e959cf62-9a51-4645-b98b-ea2523627123"], "device_id": "2b92028f-fee4-43d3-b55a-80ced1dc3ff1"}]}

the API used by this tempest test relies on info_cache updated by these two threads. The result is not deterministic, but in the end the info_cache will be consistent.

so to write this smoke test, I suggest, we wait for two cycles of CONF.heal_instance_info_cache_interval.

Hua Zhang (zhhuabj)
Changed in neutron:
assignee: nobody → Hua Zhang (zhhuabj)
Revision history for this message
Hua Zhang (zhhuabj) wrote :
Hua Zhang (zhhuabj)
affects: neutron → tempest
Changed in tempest:
milestone: havana-3 → none
Changed in tempest:
status: Confirmed → In Progress
Revision history for this message
Attila Fazekas (afazekas) wrote :

It does not sounds good.

Looks like the nova when using neutron provides invalid information to the end users without any clue the info is not fresh.
I wonder is this information refreshed before a floating-ip-assocition request sent to the neutron ?

VM can exists without a known address, how can I distinguish that situation, from a just waiting for some internal process or I do not have an address?

IMHO the address should be updated before the VM reaches the ACTIVE status or at the first address query after it is ACTIVE.

Hua Zhang (zhhuabj)
Changed in nova:
assignee: nobody → Hua Zhang (zhhuabj)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/43279

Changed in nova:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/43279
Committed: http://github.com/openstack/nova/commit/c51f23834db599558098e38a7b4ccd7ef1c49a95
Submitter: Jenkins
Branch: master

commit c51f23834db599558098e38a7b4ccd7ef1c49a95
Author: zhhuabj <email address hidden>
Date: Thu Aug 22 20:11:08 2013 +0800

    Should finish allocating network before VM reaches ACTIVE

    This patch can fix issues like tempest caused by async
    network allocation. Nova uses NetworkInfoAsyncWrapper
    to allocate network in an async manner, and this allows
    one to start querying for network info before you know
    you will need it. Nova will invoke refresh_cache method
    to update instance_cache_info after excuting the
    allocate_for_instance method as well.

    However, nova rest api will query instance_cache_info
    directly, thus maybe get the empty network info before
    instance_cache_info table has been updated asynchronously.
    Thus will cause VM can exist without a known address
    accidentally, the caller is difficult to distinguish
    this situation, unless waiting for some internal process
    or it will not have an address. So should finish
    allocating network before the VM reaches ACTIVE status.
    For more detail can refer the bug description.

    BTW, almost all create server related unit tests have
    coverred this case.

    Closes-Bug: #1210483
    Change-Id: Iab542df1be7d5c9661d717ec596d2a37b975169a

Changed in nova:
status: In Progress → Fix Committed
Changed in nova:
milestone: none → havana-rc1
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: havana-rc1 → 2013.2
Revision history for this message
David Kranz (david-kranz) wrote :

This is marked as Fix Released but seems to have happened recently here https://review.openstack.org/#/c/50052/

Revision history for this message
Attila Fazekas (afazekas) wrote :

This is not a tempest bug.

The network info should be available when it is queried.
 Now the nova should fill the nw_info before reporting an ACTIVE state. Tempest only quires the address of an ACTIVE instance.

Changed in tempest:
status: In Progress → Incomplete
importance: Critical → Undecided
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

Looks like it starts showing up again on 11/9, unless my query is not restrictive enough:

http://bit.ly/19MyhEq

Revision history for this message
Matt Riedemann (mriedem) wrote :

So the original 'fix' for this bug was merged on 9/18, David Kranz reported seeing it again on 10/22, and this was merged on 10/14 which was related to the network info cache code (which has other bugs breaking the check/gate queues this week), so maybe a culprit?

https://review.openstack.org/51401

Sean Dague (sdague)
Changed in neutron:
status: New → Triaged
importance: Undecided → Critical
Sean Dague (sdague)
no longer affects: tempest
Changed in nova:
status: Fix Released → Triaged
Changed in neutron:
status: Triaged → Confirmed
Changed in nova:
status: Triaged → Confirmed
Revision history for this message
Sean Dague (sdague) wrote :

This is really not a tempest issue, it's a neutron race from what I can figure out (possibly a nova interaction as well).

Revision history for this message
Hua Zhang (zhhuabj) wrote :

I believe the patch https://review.openstack.org/#/c/43279/ can fix async network allocation problem well on nova side,
if the issue still exist, may be because neutron can't provide network service in that time, so not sure if this patch (https://review.openstack.org/#/c/57509) can fix this issue.

Thierry Carrez (ttx)
Changed in neutron:
milestone: none → icehouse-1
Revision history for this message
Hua Zhang (zhhuabj) wrote :

The patch https://review.openstack.org/#/c/57509 was merged on Nov 21, 2013, I think we can observe if has the same report since that day.

Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-1 → icehouse-2
Revision history for this message
Joe Gordon (jogo) wrote :

Is there anything in the nova or neutron logs that we can use as a fingerprint for this bug? If so we can add it to elastic-recheck

Revision history for this message
Matt Riedemann (mriedem) wrote :
Changed in neutron:
assignee: nobody → Eugene Nikanorov (enikanorov)
Revision history for this message
Jay Pipes (jaypipes) wrote :
Download full text (10.0 KiB)

Looking at the log file that Matt R referenced above, I did some digging around and there's a traceback in there around not being able to set the network info cache:

2013-11-21 23:15:52.403 ERROR nova.network.api [req-b2745c05-f148-4a09-aa34-ffa8d7d9d1cd demo demo] [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] Failed storing info cache
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] Traceback (most recent call last):
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] File "/opt/stack/new/nova/nova/network/api.py", line 74, in update_instance_cache_with_nw_info
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] nw_info = api._get_instance_nw_info(context, instance)
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] File "/opt/stack/new/nova/nova/network/api.py", line 49, in wrapper
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] res = f(self, context, *args, **kwargs)
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] File "/opt/stack/new/nova/nova/network/neutronv2/api.py", line 456, in _get_instance_nw_info
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] nw_info = self._build_network_info_model(context, instance, networks)
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] File "/opt/stack/new/nova/nova/network/neutronv2/api.py", line 984, in _build_network_info_model
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] context, instance['uuid'])['info_cache']['network_info']
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] File "/opt/stack/new/nova/nova/conductor/api.py", line 74, in instance_get_by_uuid
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] columns_to_join)
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] File "/opt/stack/new/nova/nova/conductor/rpcapi.py", line 165, in instance_get_by_uuid
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] return cctxt.call(context, 'instance_get_by_uuid', **kwargs)
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] File "/opt/stack/new/nova/nova/rpcclient.py", line 85, in call
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] return self._invoke(self.proxy.call, ctxt, method, **kwargs)
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] File "/opt/stack/new/nova/nova/rpcclient.py", line 63, in _invoke
2013-11-21 23:15:52.403 4036 TRACE nova.network.api [instance: d3bd703c-a3f4-4e83-8ab0-a2f3e603ec95] return cast_or_call(ctxt, msg, **self.kwargs)...

Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-2 → icehouse-3
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

No occurrences for the last 3 weeks. Lowering to 'Medium'

Changed in neutron:
importance: Critical → Medium
tags: added: gate-failure
Revision history for this message
Alan Pevec (apevec) wrote :

> No occurrences for the last 3 weeks. Lowering to 'Medium'

Was there a fix on master, if so please backport, it was seen in stable/havana http://logs.openstack.org/14/67214/3/check/check-tempest-dsvm-neutron-pg-isolated/9287c4d/logs/testr_results.html.gz

tags: added: havana-backport-potential
Revision history for this message
Hua Zhang (zhhuabj) wrote : AUTO: Hua BJ Zhang is out of the office (returning 02/14/2014)

I am out of the office until 02/14/2014.

ZhangHua will leave for Spring Festival from Jan 29 to Feb 14, any argent
can reach me via 158-110-220-56, Hope everyone a happy holiday.

Note: This is an automated response to your message "[Bug 1210483] Re:
ServerAddressesTestXML.test_list_server_addresses FAIL" sent on 02/02/2014
7:01:15.

This is the only notification you will receive while this person is away.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :

Thanks Matt,
I investigated the log you pointed.

The summary:
Problem server_id: 78fe9cfd-1f0b-4d0a-8164-4b752e759246
2014-02-15 16:52:03.587: (tempest.log): received a response of "create server" API
2014-02-15 16:52:03.743: (n-api.log): from Neutron: {"ports": []}
2014-02-15 16:52:04.878: (n-api.log): from Neutron: {"ports": [{"status": "DOWN", "name": "", "allowed_address_pairs": [] ..
2014-02-15 16:52:06.088: (n-api.log): from Neutron: {"ports": [{"status": "DOWN", "name": "", "allowed_address_pairs": [] ..
2014-02-15 16:52:07.232: (n-api.log): from Neutron: {"ports": [{"status": "DOWN", "name": "", "allowed_address_pairs": [] ..
2014-02-15 16:52:08.348: (n-api.log): from Neutron: {"ports": [{"status": "DOWN", "name": "", "allowed_address_pairs": [] ..
2014-02-15 16:52:08.371: (tempest.log): received a response of server status "ACTIVE"
2014-02-15 16:52:08,502: (testr.log): The test failed because the response of "get ips" did not contain addresses info.

According to Russell's Nova meetup summary[1], there is asynchronous between Nova and Neutron during server creation.
This problem seems due to that.

[1]: http://lists.openstack.org/pipermail/openstack-dev/2014-February/027370.html

Revision history for this message
Matt Riedemann (mriedem) wrote :

Ken'ichi, so should we skip the test if configured with neutron? There are 44 hits on this failure in the last 7 days:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwic2VsZi5hc3NlcnRUcnVlKGxlbihhZGRyZXNzZXMpID49IDEpXCIgQU5EIGZpbGVuYW1lOlwiY29uc29sZS5odG1sXCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjEzOTI3NDQ2NTE0NjB9

Looking at the build_name it's definitely a neutron problem. If we're looking to fix the info cache races and vif ready state in Icehouse, then we should be able to remove the skip on the test if it's fixed before the end of the release.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Matt Riedemann (mriedem) wrote :

I have a patch up in tempest to skip the tests for now if using neutron: https://review.openstack.org/#/c/74504/

Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :

Matt,

I agree with you.
The skipped tests of your patch are only 2 tests in neutron gate.
That is reasonable for me.

Thierry Carrez (ttx)
Changed in neutron:
milestone: icehouse-3 → icehouse-rc1
Changed in neutron:
milestone: icehouse-rc1 → none
tags: added: icehouse-rc-potential
Thierry Carrez (ttx)
tags: added: icehouse-backport-potential
removed: icehouse-rc-potential
Revision history for this message
Sean Dague (sdague) wrote :

Was this addressed by the admin event extension?

Changed in nova:
status: Confirmed → Incomplete
assignee: Hua Zhang (zhhuabj) → nobody
Revision history for this message
Matt Riedemann (mriedem) wrote :

Probably, let's re-enable the test now.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Tempest change to enable the tests again: https://review.openstack.org/#/c/128772/

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Apparently this bug is not valid for neutron anymore. The test has been enabled, no failures encountered with it.

Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → nobody
status: Confirmed → Invalid
Revision history for this message
Ken'ichi Ohmichi (oomichi) wrote :

XML API tests are removed already, so this bug doesn't happen now.

Changed in nova:
status: Incomplete → Fix Committed
status: Fix Committed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Related blueprints

Remote bug watches

Bug watches keep track of this bug in other bug trackers.