Nova resize/restart results in guest ending up in inconsistent state with Neutron

Bug #1323658 reported by Matt Riedemann on 2014-05-27
88
This bug affects 13 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Dariusz Smigiel
neutron
Medium
Dariusz Smigiel

Bug Description

Was looking at this when investigating bug 1310852, so that might be a duplicate of this, but the "Public network connectivity check failed" message doesn't show up in the logs for that bug, so opening a new one.

This is also maybe related to or regressions of bug 1194026 and/or bug 1253896.

The error in the console log:

2014-05-27 13:34:49.707 | 2014-05-27 13:33:24,369 Creating ssh connection to '172.24.4.110' as 'cirros' with public key authentication
2014-05-27 13:34:49.707 | 2014-05-27 13:33:24,491 Failed to establish authenticated ssh connection to cirros@172.24.4.110 ([Errno 111] Connection refused). Number attempts: 1. Retry after 2 seconds.
2014-05-27 13:34:49.707 | 2014-05-27 13:33:27,162 Failed to establish authenticated ssh connection to cirros@172.24.4.110 ([Errno 111] Connection refused). Number attempts: 2. Retry after 3 seconds.
2014-05-27 13:34:49.707 | 2014-05-27 13:33:32,049 starting thread (client mode): 0x9e9cf10L
2014-05-27 13:34:49.707 | 2014-05-27 13:33:32,050 EOF in transport thread
2014-05-27 13:34:49.707 | 2014-05-27 13:33:32,051 Public network connectivity check failed
2014-05-27 13:34:49.707 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops Traceback (most recent call last):
2014-05-27 13:34:49.707 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops File "tempest/scenario/test_network_advanced_server_ops.py", line 119, in _check_public_network_connectivity
2014-05-27 13:34:49.707 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops should_connect=should_connect)
2014-05-27 13:34:49.708 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops File "tempest/scenario/manager.py", line 779, in _check_vm_connectivity
2014-05-27 13:34:49.708 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops linux_client.validate_authentication()
2014-05-27 13:34:49.708 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops File "tempest/common/utils/linux/remote_client.py", line 53, in validate_authentication
2014-05-27 13:34:49.708 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops self.ssh_client.test_connection_auth()
2014-05-27 13:34:49.708 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops File "tempest/common/ssh.py", line 150, in test_connection_auth
2014-05-27 13:34:49.708 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops connection = self._get_ssh_connection()
2014-05-27 13:34:49.708 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops File "tempest/common/ssh.py", line 75, in _get_ssh_connection
2014-05-27 13:34:49.708 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops timeout=self.channel_timeout, pkey=self.pkey)
2014-05-27 13:34:49.708 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops File "/usr/local/lib/python2.7/dist-packages/paramiko/client.py", line 242, in connect
2014-05-27 13:34:49.709 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops t.start_client()
2014-05-27 13:34:49.709 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops File "/usr/local/lib/python2.7/dist-packages/paramiko/transport.py", line 346, in start_client
2014-05-27 13:34:49.709 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops raise e
2014-05-27 13:34:49.709 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops EOFError
2014-05-27 13:34:49.709 | 2014-05-27 13:33:32.051 10354 TRACE tempest.scenario.test_network_advanced_server_ops

Logstash query:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiUHVibGljIG5ldHdvcmsgY29ubmVjdGl2aXR5IGNoZWNrIGZhaWxlZFwiIEFORCB0YWdzOmNvbnNvbGUiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MDExOTk0MTI1Mzd9

664 hits in 7 days, check and gate, master and stable/icehouse branches.

Matt Riedemann (mriedem) on 2014-05-27
tags: added: gate-failure testing
Matt Riedemann (mriedem) wrote :

Hmm, "Public network connectivity check failed" shows up in both basic and advanced server ops scenario tests, very similar _check_public_network_connectivity methods (that should probably be consolidated).

Matt Riedemann (mriedem) wrote :

This also shows up all over the tempest logs when this fails:

DEBUG paramiko.transport [-] EOF in transport thread _log /usr/local/lib/python2.7/dist-packages/paramiko/transport.py:1269

That looks like the same as Ubuntu test harness bug 1158224, so maybe just an underlying paramiko limitation in the version we use in the gate?

Kyle Mestery (mestery) wrote :

mriedem: How can we verify if paramiko problem here?

Sean Dague (sdague) wrote :

Extremely doubtful that it's a paramiko issue. the EOF is normal for paramiko. It's just confusing.

This kind of bug has actually the same trace as bug 1253896.
It pretty much says: the network is not working, please have a look at what's going on.

As the failure frequency apparently spiked I'm ok with having another catch-all bug.
If I found a better root cause, I'll amend the bug and the e-r query.

Kyle Mestery (mestery) wrote :

Salvatore, Armando and myself have been trying to root cause this one all week. It's re-creatable locally, and appears to be a timing issue, though we haven't tracked it down yet 100%. Hoping to do that today.

I don't know if this is are red herrings or hallucinations, but I've noticed a few things.

The neutron jobs always fail on two tests - those which do stop/start and resize for an instance.
reboot, rebuild, suspend, and pause never cause failures.

So a first attempt would the conditionally skip these tests and see if the failures disappear or move to some other test. In the former case we'd need to investigate the server actions, in the latter case we'd have an indirect confirmation of timing issues.

I also noticed that the fingerprint is neutron specific, so I decided to look if there were similar nova-network failures.
I found a few, albeit they seem to have started more recently than neutron's. the query below only checks for gate failures in the full job with the infamous signature of bug 1253896:

message:"SSHTimeout: Connection to the" AND message:"via SSH timed out" AND build_queue:"gate" AND build_name:"gate-tempest-dsvm-full"
http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiU1NIVGltZW91dDogQ29ubmVjdGlvbiB0byB0aGVcIiBBTkQgbWVzc2FnZTpcInZpYSBTU0ggdGltZWQgb3V0XCIgQU5EIGJ1aWxkX3F1ZXVlOlwiZ2F0ZVwiIEFORCBidWlsZF9uYW1lOlwiZ2F0ZS10ZW1wZXN0LWRzdm0tZnVsbFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDAyMDQ3ODU2NTg3LCJtb2RlIjoiIiwiYW5hbHl6ZV9maWVsZCI6IiJ9

I have been scavenging failures for instance console logs. Unfortunately none was found. However, extending the search to other nova-net jobs failing in the gate, I found one. In the console log the instance was hang at startup - which would justify the crazy behaviour we're seeing in neutron failures:

http://logs.openstack.org/12/94812/6/gate/gate-grenade-dsvm-partial-ncpu/901f7dd/console.html#_2014-06-04_05_34_54_072

However, one data point has no significance at all.
Looking for similar failures not just in the gate queue (and excluding those occuring on heat jobs as they use fedora images):
http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiU1NIVGltZW91dDogQ29ubmVjdGlvbiB0byB0aGVcIiBBTkQgbWVzc2FnZTpcInZpYSBTU0ggdGltZWQgb3V0XCIgQU5EIGJ1aWxkX3F1ZXVlOlwiZ2F0ZVwiIEFORCBidWlsZF9uYW1lOlwiZ2F0ZS10ZW1wZXN0LWRzdm0tZnVsbFwiIiwiZmllbGRzIjpbXSwib2Zmc2V0IjowLCJ0aW1lZnJhbWUiOiI2MDQ4MDAiLCJncmFwaG1vZGUiOiJjb3VudCIsInRpbWUiOnsidXNlcl9pbnRlcnZhbCI6MH0sInN0YW1wIjoxNDAyMDQ3ODU2NTg3LCJtb2RlIjoiIiwiYW5hbHl6ZV9maWVsZCI6IiJ9

I can find a 100% correlation between the timer error in the instance and job failures. Still, there is not enough to assess that this timer failures are actually the root cause we're looking for, but it's definetely a reason for having console logs printed for each timeout failure.

In the case we decide it worth looking at these failures, what could have started triggering them? Could it be again a kernel update in the machines we use for run tests? (kern.log provides no hints)

Matt Riedemann (mriedem) wrote :

Sean Dague has a patch up in Tempest to dump server console logs on timeout failures, but those are on server state changes (e.g. resize) rather than on this network scenario path.

https://review.openstack.org/#/c/97812/

Kyle Mestery (mestery) wrote :

It's worth noting that with the change here [1] (and it's subsequent dependent change), I can reproduce this. That change should dump the console, but I never see it. An example is here [2]. It tries to dump the console, I even see the API request in the nova logs, but I don't actually see a console dump. If someone can help interpret that nova console log dump in the paste it would greatly help me.

Also, for me, if I run a fresh devstack, I can then recreate this every time. If I then re-run devstack (without unstacking), I can never seem to recreate this. Seems to happen readily on the first run, and then never again. A datapoint at least.

And one final note: armax has pushed this change [3], which will hopefully trigger and leave the VM around for us to debug. fungi is on watch to catch this and point armax and I at it once it happens over the weekend. So fingers crossed, we can maybe see what state the guest VM is in when this happens.

[1] https://review.openstack.org/#/c/95831/
[2] http://paste.openstack.org/show/83183/
[3] https://review.openstack.org/#/c/98483/

The tempest revert https://review.openstack.org/#/c/97245/ has pretty much nullified impact on gate.

This is good for the gate, but does not help us nailing the root cause.
"Some sort of timing issue" is the closest thing we have at the moment.

Here's a summary of the analysis so far:
- the elastic recheck query has a fingerprint that can be matched only by neutron jobs. SSH failures have been observed also in jobs running nova-network, but it's not clear whether there is the same failure mode.
- Failures occur only on start/stop and resize tests. The other tests in the network advanced server ops scenario seem to pass always.
- The failure has been observed in upstream CI, ODL CI, and VMware CI - with exactly the same failure mode. This probably rules out any issue in neutron's agents. (VMware CI does not even run the L3 agent)
- syslog reveals the VM gets an IP even after the reboot, when instead it's not reachable through ssh.
  L2 logs and L3 logs for the same interval do not report instead changes to secuity groups, nat rules, or router interfaces.
- the ssh timeout occurs because of "connection refused" (111) rather than "no route to host" (113). This could be because:
  - instance booted but ssh service is disabled (waiting for console log output on tests)
  - ssh traffic being rejected at the host (iptables drop counters suggest this is not the case)
  - floating ip acting as a responder (nat rules for the floating IP are in place, so this should not be the case)
- no errors are seen in kernel logs.

Kyle Mestery (mestery) wrote :

Salvatore, from what I can tell, there is no console dump. See my analysis in #11. The patches I was using there should have dumped the console, and I see the API call, but the data returned appears to be empty. This is peculiar to me, I'm wondering if anyone else has any input on this. To me, it appears like the VM is never fully running.

Adding Nova as affected project as well, from what we've observed so far we cannot rule it out as a potential culprit in this whole saga.

Kyle Mestery (mestery) wrote :
Download full text (7.5 KiB)

I'm now able to reproduce this locally, and in each case (either restart or resize), the guest VM itself appears to not be booted. I've added code to dump the console of the guest during the failure, and in each case the recall from Nova comes back with no data in the response body (see below). This leads me to believe this is a problem with the guest after the restart/resize:

2014-06-09 16:14:32.640 30408 TRACE tempest.scenario.test_network_advanced_server_ops AssertionError: Timed out waiting for 172.24.4.16 to become reachable
2014-06-09 16:14:32.640 30408 TRACE tempest.scenario.test_network_advanced_server_ops
2014-06-09 16:14:32.642 30408 DEBUG tempest.scenario.manager [-] Console output for 86e7e749-83ec-4e39-9a39-3042213569b5 _log_console_output tempest/scenario/manager.py:397
2014-06-09 16:14:32.642 30408 DEBUG novaclient.client [-]
REQ: curl -i 'http://173.39.225.77:8774/v2/54a724dc112c482e84975fc6c1e3014e/servers/86e7e749-83ec-4e39-9a39-3042213569b5/action' -X POST -H "X-Auth-Project-Id: TestNetworkAdvancedServerOps-1411494319" -H "User-Agent: python-novaclient" -H "Content-Type: application/json" -H "Accept: application/json" -H "X-Auth-Token: MIIRxAYJKoZIhvcNAQcCoIIRtTCCEbECAQExCTAHBgUrDgMCGjCCEBoGCSqGSIb3DQEHAaCCEAsEghAHeyJhY2Nlc3MiOiB7InRva2VuIjogeyJpc3N1ZWRfYXQiOiAiMjAxNC0wNi0wOVQxNjowNDoyNS41MTM2ODkiLCAiZXhwaXJlcyI6ICIyMDE0LTA2LTA5VDE3OjA0OjI1WiIsICJpZCI6ICJwbGFjZWhvbGRlciIsICJ0ZW5hbnQiOiB7ImRlc2NyaXB0aW9uIjogIlRlc3ROZXR3b3JrQWR2YW5jZWRTZXJ2ZXJPcHMtMTQxMTQ5NDMxOS1kZXNjIiwgImVuYWJsZWQiOiB0cnVlLCAiaWQiOiAiNTRhNzI0ZGMxMTJjNDgyZTg0OTc1ZmM2YzFlMzAxNGUiLCAibmFtZSI6ICJUZXN0TmV0d29ya0FkdmFuY2VkU2VydmVyT3BzLTE0MTE0OTQzMTkifX0sICJzZXJ2aWNlQ2F0YWxvZyI6IFt7ImVuZHBvaW50cyI6IFt7ImFkbWluVVJMIjogImh0dHA6Ly8xNzMuMzkuMjI1Ljc3Ojg3NzQvdjIvNTRhNzI0ZGMxMTJjNDgyZTg0OTc1ZmM2YzFlMzAxNGUiLCAicmVnaW9uIjogIlJlZ2lvbk9uZSIsICJpbnRlcm5hbFVSTCI6ICJodHRwOi8vMTczLjM5LjIyNS43Nzo4Nzc0L3YyLzU0YTcyNGRjMTEyYzQ4MmU4NDk3NWZjNmMxZTMwMTRlIiwgImlkIjogIjEyYWVlMDc2M2E2ZTRkMDFiMTc0ZTMyMWQ0NTIxZjRiIiwgInB1YmxpY1VSTCI6ICJodHRwOi8vMTczLjM5LjIyNS43Nzo4Nzc0L3YyLzU0YTcyNGRjMTEyYzQ4MmU4NDk3NWZjNmMxZTMwMTRlIn1dLCAiZW5kcG9pbnRzX2xpbmtzIjogW10sICJ0eXBlIjogImNvbXB1dGUiLCAibmFtZSI6ICJub3ZhIn0sIHsiZW5kcG9pbnRzIjogW3siYWRtaW5VUkwiOiAiaHR0cDovLzE3My4zOS4yMjUuNzc6OTY5Ni8iLCAicmVnaW9uIjogIlJlZ2lvbk9uZSIsICJpbnRlcm5hbFVSTCI6ICJodHRwOi8vMTczLjM5LjIyNS43Nzo5Njk2LyIsICJpZCI6ICIzOTJjY2Y3MzAwYzc0ZThhYjI5ZTA2YTUyZTFlOWEwYyIsICJwdWJsaWNVUkwiOiAiaHR0cDovLzE3My4zOS4yMjUuNzc6OTY5Ni8ifV0sICJlbmRwb2ludHNfbGlua3MiOiBbXSwgInR5cGUiOiAibmV0d29yayIsICJuYW1lIjogIm5ldXRyb24ifSwgeyJlbmRwb2ludHMiOiBbeyJhZG1pblVSTCI6ICJodHRwOi8vMTczLjM5LjIyNS43Nzo4Nzc2L3YyLzU0YTcyNGRjMTEyYzQ4MmU4NDk3NWZjNmMxZTMwMTRlIiwgInJlZ2lvbiI6ICJSZWdpb25PbmUiLCAiaW50ZXJuYWxVUkwiOiAiaHR0cDovLzE3My4zOS4yMjUuNzc6ODc3Ni92Mi81NGE3MjRkYzExMmM0ODJlODQ5NzVmYzZjMWUzMDE0ZSIsICJpZCI6ICIwY2ZkMTkyMjNmYzc0YTkzYjAyZDI5MTg3ODkxMzc1ZSIsICJwdWJsaWNVUkwiOiAiaHR0cDovLzE3My4zOS4yMjUuNzc6ODc3Ni92Mi81NGE3MjRkYzExMmM0ODJlODQ5NzVmYzZjMWUzMDE0ZSJ9XSwgImVuZHBvaW50c19saW5rcyI6IFtdLCAidHlwZSI6ICJ2b2x1bWV2MiIsICJuYW1lIjogImNpbmRlcnYyIn0sIHsiZW5kcG9pbnRzIjogW3siYWRtaW5VUkwiOiAiaHR0cDovLzE3My4zOS4yMjUuNzc6ODc3NC92MyIsICJyZWdpb24iOiAiUmVnaW9uT25lIiwgImludGVybmFsVVJMIjog...

Read more...

Kyle Mestery (mestery) wrote :

BTW, my test environment has this information:

ubuntu@ubuntu-1204-1:~/devstack$ uname -a
Linux ubuntu-1204-1 3.2.0-63-virtual #95-Ubuntu SMP Thu May 15 23:24:31 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

And this is my local.conf:

[[local|localrc]] LOGFILE=stack.sh.log
SCREEN_LOGDIR=/opt/stack/data/log
LOG_COLOR=False
#OFFLINE=True
RECLONE=yes

# DOCKER
#VIRT_DRIVER=docker

disable_service n-net
enable_service q-svc
enable_service q-dhcp
enable_service q-l3
enable_service q-meta
enable_service q-vpn
enable_service q-metering
enable_service q-lbaas
enable_service q-fwaas
enable_service neutron
enable_service tempest

HOST_IP=<redacted>
#enable_service opendaylight

# ODL WITH ML2
Q_PLUGIN=ml2

enable_service q-agt

HOST_NAME=$(hostname)
SERVICE_HOST_NAME=${HOST_NAME}
SERVICE_HOST=<redacted>

VNCSERVER_PROXYCLIENT_ADDRESS=<redacted>
VNCSERVER_LISTEN=0.0.0.0

MYSQL_HOST=$SERVICE_HOST
RABBIT_HOST=$SERVICE_HOST
GLANCE_HOSTPORT=$SERVICE_HOST:9292
KEYSTONE_AUTH_HOST=$SERVICE_HOST
KEYSTONE_SERVICE_HOST=$SERVICE_HOST

MYSQL_PASSWORD=mysql
RABBIT_PASSWORD=rabbit
SERVICE_TOKEN=service
SERVICE_PASSWORD=admin
ADMIN_PASSWORD=admin

# For "ssh timeout" debugging
API_RATE_LIMIT=False
FORCE_CONFIG_DRIVE=always
TEMPEST_HTTP_IMAGE=http://127.0.0.1/
#CINDER_PERIODIC_INTERVAL=10
#CEILOMETER_PIPELINE_INTERVAL=15
Q_USE_DEBUG_COMMAND=True

[[post-config|/etc/neutron/plugins/ml2/ml2_conf.ini]]
[agent]
minimize_polling=True
ovsdb_monitor_respawn_interval=2

Attila Fazekas (afazekas) wrote :

Looks there are two boot up failure situation exists:
1. https://bugs.launchpad.net/cirros/+bug/1312199
The console output contains: "MP-BIOS bug"

(an extra no_timer_check kernel parameter can skip the buggy test)

2. empty console output
Would be nice to have a 'virsh dump' from this situation.

Attila Fazekas (afazekas) wrote :

I did nt found any evidence for neutron did something wrong here[1]
And I did not find any evidence the instance is started after os-stop and os-start.

[1] http://logs.openstack.org/78/98178/1/check/check-tempest-dsvm-neutron/fbea859/console.html.gz

Kyle Mestery (mestery) wrote :

Thanks for looking at the logs for this Attila. Your analysis is the same as mine: This bug is due to the guest VM not being started or running after a restart or resize operation.

Removed Neutron from affected, changed title of this bug based on analysis.

summary: - SSH EOFError - Public network connectivity check failed
+ Nova resize/restart results in guest ending up in inconsistent state
no longer affects: neutron
Phil Day (philip-day) wrote :

I wonder if what were seeing here is the filesystem for the Guest sometimes being corrupted because Nova does a power-off for stop/resize without giving the guest a chance to shut down. (Maybe grasping at straws).

I have a patch up for review that fixes this for stop if you want to grab it and see if it helps:
https://review.openstack.org/#/c/68942/

Hi Phil,

This is one of the theories I am trying to confirm.
Unfortunately the failing test suite did not report console log output, which is a shame.
Some user-driven reproductions of the bug revealed an empty console log - which might be consistent with your hypothesis.
However, using logstash I observed the VM taking an IP after reboot always, meaning that the file system was not corrupted. (This can be observed from syslog.txt.gz in test logs).

One might then point the finger on the l3 agent, however:
- l3 agent logs have no event at all between VM shutoff and subsequent destroy after failure
- the same failure was observed also on VMware minesweeper for NSX, which does not use the l3 agent

I think we should reason on what triggered the bug on the gate.
That was a simple patch for verifying a guest was active and reachable before doing further tests on it.
That patch, at the end of the day did two things:
- ensure the VM went into active state, and the floating ip was wired
- add a bit of delay before performing the actual test

The difference is that before the shutoff action was performed most of the time even before the vm reached active status.
I'm not sure if this might explain anything.

Kyle Mestery (mestery) wrote :

Now that we have console logs thanks to some tempest patches merging, we can see the guest doesn't even boot all the way. See the failed test here [1] for this patch [2].

[1] http://logs.openstack.org/92/96792/18/gate/gate-tempest-dsvm-neutron-pg-2/54420bc/console.html
[2] https://review.openstack.org/#/c/96792/

Joe Gordon (jogo) wrote :

looks like this is happening in neutron-full the most

Joe Gordon (jogo) on 2014-09-09
Changed in nova:
status: New → Confirmed
importance: Undecided → High
Joe Gordon (jogo) on 2014-09-11
Changed in nova:
importance: High → Critical
milestone: none → juno-rc1
Joe Gordon (jogo) on 2014-09-18
Changed in neutron:
status: New → Confirmed
summary: Nova resize/restart results in guest ending up in inconsistent state
+ with Neutron
Ken'ichi Ohmichi (oomichi) wrote :

I investigated this problem based on the latest failure http://logs.openstack.org/59/120159/1/gate/gate-tempest-dsvm-neutron/be422ac/ , and Tempest could not get server console-log[1] again as Kyle at comment #21.
In this case, the scheduling a sever started 18:57:10 and the server became Active 18:57:21.

According to nova-cpu log, a server creation seems successful with the message "Instance spawned successfully."
and there is nothing special on both libvirtd.txt.gz and kern_log.txt.gz in the above period.
I guessed this problem was a server creation failure but I could not find its evidence and the reason why we are
facing this problem on Neutron gate.

[1]: https://github.com/openstack/tempest/blob/master/tempest/scenario/manager.py#L705

Ken'ichi Ohmichi (oomichi) wrote :

one more point:
My previous investigation source was a failure of test_server_connectivity_reboot.
in addition, the failure happened at the server creation which calling self._setup_network_and_servers() before any actions.

Setting target to RC1 for neutron as well - to ensure neutron devs' eyes are on this bug as well

Changed in neutron:
milestone: none → juno-rc1
importance: Undecided → Critical
Adam Gandelman (gandelman-a) wrote :
Download full text (5.8 KiB)

So I'm not sure this is the same bug or not, but we just started running the 'check-tempest-dsvm-ironic-parallel-nv' in the Ironic pipeline yesterday and we are seeing a percentage of failures in test_network_advanced_server_ops @ test_server_connectivity_reboot. In our case:

* Original server boots
* First ssh connectivity check passes (via floating ip)
* Server reboots, become active
* Second ssh connectivity check fails:
   - Connection is made to the server, but _get_ssh_connection() times out ~6 min. (!) later

The symptoms in our case look consistent across all failures and it certainly smells like something wrong in the guest. Unfortunately, Ironic does not support getting console log so I'm in the blind there.

2014-09-23 19:30:55,769 29820 INFO [paramiko.transport] Authentication (publickey) successful!
2014-09-23 19:30:55,771 29820 INFO [tempest.common.ssh] ssh connection to cirros@172.24.4.39 successfuly created
2014-09-23 19:30:56,069 29820 INFO [tempest.common.rest_client] Request (TestNetworkAdvancedServerOps:test_server_connectivity_reboot): 202 POST http://127.0.0.1:8774/v2/403a6c74810c47ea9b109f7f07e14c90/servers/788eaed6-284c-4b95-bef0-b08251eef043/action 0.117s
2014-09-23 19:30:56,168 29820 INFO [tempest.common.rest_client] Request (TestNetworkAdvancedServerOps:test_server_connectivity_reboot): 200 GET http://127.0.0.1:8774/v2/403a6c74810c47ea9b109f7f07e14c90/servers/788eaed6-284c-4b95-bef0-b08251eef043 0.099s
2014-09-23 19:30:57,283 29820 INFO [tempest.common.rest_client] Request (TestNetworkAdvancedServerOps:test_server_connectivity_reboot): 200 GET http://127.0.0.1:8774/v2/403a6c74810c47ea9b109f7f07e14c90/servers/788eaed6-284c-4b95-bef0-b08251eef043 0.098s
2014-09-23 19:30:57,297 29820 INFO [tempest.common.waiters] State transition "REBOOT/rebooting" ==> "REBOOT/reboot_started" after 1 second wait
2014-09-23 19:30:58,402 29820 INFO [tempest.common.rest_client] Request (TestNetworkAdvancedServerOps:test_server_connectivity_reboot): 200 GET http://127.0.0.1:8774/v2/403a6c74810c47ea9b109f7f07e14c90/servers/788eaed6-284c-4b95-bef0-b08251eef043 0.104s
2014-09-23 19:30:59,516 29820 INFO [tempest.common.rest_client] Request (TestNetworkAdvancedServerOps:test_server_connectivity_reboot): 200 GET http://127.0.0.1:8774/v2/403a6c74810c47ea9b109f7f07e14c90/servers/788eaed6-284c-4b95-bef0-b08251eef043 0.096s
2014-09-23 19:30:59,531 29820 INFO [tempest.common.waiters] State transition "REBOOT/reboot_started" ==> "ACTIVE/None" after 3 second wait
2014-09-23 19:30:59,531 29820 INFO [tempest.scenario.manager] Tenant networks not configured to be reachable.
2014-09-23 19:30:59,531 29820 DEBUG [tempest.scenario.manager] checking network connections to IP 172.24.4.39 with user: cirros
2014-09-23 19:30:59,541 29820 INFO [tempest.common.ssh] Creating ssh connection to '172.24.4.39' as 'cirros' with public key authentication
2014-09-23 19:30:59,548 29820 INFO [paramiko.transport] Connected (version 2.0, client dropbear_2012.55)
2014-09-23 19:36:39,266 29820 ERROR [tempest.scenario.manager] Public network connectivity check failed
2014-09-23 19:36:39.266 29820 TRACE tempest.scenario.man...

Read more...

Michael Still (mikal) wrote :

Is anyone on the nova side working on this bug? If not we might need to drop it from rc1.

(I'm not saying its not important, just that I am not sure we can justify holding rc1 for it if there's no one working on it).

Kyle Mestery (mestery) wrote :

I've heard from Salvatore that Aaron is working on this one, so assigning it to him for now unless I hear otherwise.

Changed in neutron:
assignee: nobody → Aaron Rosen (arosen)
Matt Riedemann (mriedem) wrote :

In the nova meeting today dansmith said that he worked with arosen on this and arosen things it's something in tempest that's causing this so shouldn't be a problem in the real world, so removing from rc1.

Changed in nova:
milestone: juno-rc1 → none

The issue started to show up in Icehouse recently (a week or two?) in a very painful way. It's very hard to merge anything into Neutron due to this, since failure rate is very high. Nova also suffers a lot because it runs multiple neutron jobs. Other projects are affected, but not that hard (less neutron jobs in runs => lesser chance to fail in gate).

It would be beneficial to find out when exactly it started to fail in Icehouse, to check patches around that time in devstack/master and neutron/icehouse.

Matt Riedemann (mriedem) wrote :

Logstash only goes back 10 days, so 9/17 isn't when it showed up probably, it's just how far back we go. I've made that mistake before. :)

Matt, thanks for clearing my ignorance! :)
That said, the issue started to show up in Icehouse branches so frequently since recently, so there was probably some change in stable branches that triggered it.

Kyle Mestery (mestery) wrote :

Similar to comment #30, removing from RC1 for Neutron.

Changed in neutron:
milestone: juno-rc1 → kilo-1
Aaron Rosen (arosen) wrote :

I don't think that neutron has anything to do with this bug because nova never unplugs the vif (and the connectivity was tested and working before the resize occured).

Aaron Rosen (arosen) wrote :

For this failed run:

http://logs.openstack.org/06/123106/2/gate/gate-tempest-dsvm-neutron-full/1924e8c/logs/

After the resize the instance did come up and make the same metadata quries:

 2014-09-25 16:36:58.607 23696 INFO eventlet.wsgi.server [req-2695eb5f-5327-47f4-a541-29fb0abf8493 None] 10.100.0.2,<local> - - [25/Sep/2014 16:36:58] "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" 200 119 0.045631

 2014-09-25 16:37:28.402 23696 INFO eventlet.wsgi.server [req-2695eb5f-5327-47f4-a541-29fb0abf8493 None] 10.100.0.2,<local> - - [25/Sep/2014 16:37:28] "GET /2009-04-04/meta-data/placement/availability-zone HTTP/1.1" 200 119 0.035326

Aaron Rosen (arosen) on 2014-09-30
Changed in neutron:
assignee: Aaron Rosen (arosen) → nobody
Sean Dague (sdague) wrote :

I am currently pretty suspicious that we have 3 chains for the same IP still sitting around by the time we get to this test - http://logs.openstack.org/31/122931/2/check/check-tempest-dsvm-neutron-pg-2/b540280/console.html.gz#_2014-09-25_16_55_37_692

My current guess is that we are leaking rules because there are races in the rules constructor, and that after the guest comes down and back up something is making inbound rules not actually work.

Marios Andreou (marios-b) wrote :

My attempt at a summary so far:

mestery @ 2014-06-25 comment #6 [1] suggests "It's re-creatable locally, and appears to be a timing issue"

mestery@ 2014-06-06: #9 [2] "can recreate with a fresh devstack" (but only on first run). - references the test consolidation patch @ [3] which merged

mestery @ #13 [4] can reproduce it locally - from the trace can see failed tempest.scenario.test_network_advanced_server_ops

ihar @ #33 [5] suggests the root cause is something common to branches from the time this started to be seen in the gate.

The offending tests were skipped at [6] - merged Oct 01st, but I can't find reference to that review here (even though the review does have partial-bug to here). I unskip them but on my devstack box:

./run_tempest.sh tempest.scenario.test_network_advanced_server_ops

runs 6 tests clean. So, does anyone have a reliable repro for this for a devstack box?

thanks, marios

[1] https://bugs.launchpad.net/neutron/+bug/1323658/comments/6
[2] https://bugs.launchpad.net/neutron/+bug/1323658/comments/9
[3] https://review.openstack.org/#/c/95831/
[4] https://bugs.launchpad.net/neutron/+bug/1323658/comments/13
[5] https://bugs.launchpad.net/neutron/+bug/1323658/comments/33
[6] https://review.openstack.org/#/c/125150/

Mehdi Abaakouk (sileht) wrote :

I guest I have got this issue on a real running cluster.

I have made some investigation, to find a difference between instance that loose their network interfaces after hard reboot, and those that doesn't.
When as say 'loose their network interfaces', I means the network device is no more present into the guest and the libvirt configuration.

On a bugged instance:

We can see that networks is empty in 'nova list', but the interface is still attached:

    # nova list --name XXXXXXXXXXXX
    +--------------------------------------+---------------------+--------+------------+-------------+----------+
    | ID | Name | Status | Task State | Power State | Networks |
    +--------------------------------------+---------------------+--------+------------+-------------+----------+
    | 940be621-97a9-409a-8c31-7c0c9c8afbfd | XXXXXX | ACTIVE | - | Running | |
    +--------------------------------------+---------------------+--------+------------+-------------+----------+
    # nova interface-list 940be621-97a9-409a-8c31-7c0c9c8afbfd
    +------------+--------------------------------------+--------------------------------------+---------------------------------------+-------------------+
    | Port State | Port ID | Net ID | IP addresses | MAC Addr |
    +------------+--------------------------------------+--------------------------------------+---------------------------------------+-------------------+
    | ACTIVE | cdfb0e9f-d0cd-4ca7-9954-0fce25680905 | 07487477-8cbd-4d2a-b549-713a964ddb51 | XXXXXXXXXXXXXXXXX | fa:16:3e:fc:78:a3 |
    +------------+--------------------------------------+--------------------------------------+---------------------------------------+-------------------+

So I have looked into the db and it seems that something have emptied the columns "instance_info_caches.network_info" of some of my instances:

    mysql> select * from instance_info_caches where id = 174;
    +---------------------+---------------------+------------+-----+--------------+--------------------------------------+---------+
    | created_at | updated_at | deleted_at | id | network_info | instance_uuid | deleted |
    +---------------------+---------------------+------------+-----+--------------+--------------------------------------+---------+
    | 2014-09-22 13:59:11 | 2014-10-20 05:39:48 | NULL | 174 | [] | 940be621-97a9-409a-8c31-7c0c9c8afbfd | 0 |
    +---------------------+---------------------+------------+-----+--------------+--------------------------------------+---------+
    1 row in set (0.00 sec)

Changed in nova:
assignee: nobody → yong sheng gong (gongysh)
status: Confirmed → In Progress
yong sheng gong (gongysh) wrote :

when I was in unitedstack, we run into this problem. Here I tentatively commit my patch and let us see if it can be mitigated or even fixed. Even more, we disabled the heal instance info cache periodic task by setting heal_instance_info_cache_interval to -1.

Joe Gordon (jogo) wrote :

not seeing this as much in the gate not critical anymore

Changed in nova:
importance: Critical → High
Wang Junqing (wangjunqing) wrote :

I am sorry for changing yong sheng gong to nobody.

Changed in nova:
assignee: yong sheng gong (gongysh) → nobody
Kyle Mestery (mestery) wrote :

Moving down to High and out of Kilo-1 until someone is assigned this bug.

Changed in neutron:
milestone: kilo-1 → none
importance: Critical → High
Changed in nova:
status: In Progress → Confirmed
Joe Gordon (jogo) wrote :

this has been the top gate bug for a long time. Lowering the priority because no one wants to work on it is backwards.

Changed in neutron:
importance: High → Critical
milestone: none → kilo-3
Changed in nova:
milestone: none → kilo-3
importance: High → Critical
Kyle Mestery (mestery) on 2015-02-24
Changed in neutron:
assignee: nobody → Kyle Mestery (mestery)
Changed in nova:
importance: Critical → Medium
importance: Medium → Critical
Matt Riedemann (mriedem) wrote :

Trying to read through this again (there are a lot of comments), for awhile it sounded like this was happening more for reboot and we've had a long-standing problem with reboot where we don't test soft reboot in tempest because we can't actually tell when a soft reboot happens or it fallsback to hard reboot, see bug 1014647. That's really more of a test sanity issue though I think, stepping back it sounds like we are really just hearing about a problem bringing up a guest after stop/start (resize) and/or ssh'ing into that guest.

If the problem is ssh, and we think it's due to missing network_info in the instance_info_cache, as suggested in comment 41 and comment 42, we could add some diagnostic trace to tempest on failure by getting the network info from the instance and see if there are inconsistencies, e.g. comment 41 where there is no network listed but there are interfaces attached.

There have been races in the neutronv2 API code in nova around refreshing the info cache, I'm wondering if there is something in the compute manager in the resize/reboot operations where we need to refresh the info_cache from neutron and that's missing today.

We could also look at refreshing https://review.openstack.org/#/c/134689/ but changing it given Dan Smith's comments about how we don't need to refresh everything, just instance_info_cache. I can take a look at that.

Matt Riedemann (mriedem) wrote :

Also, per comment 40, we still have the two tests skipped in test_network_advanced_server_ops.py, we could push a WIP patch to unskip those and see how bad things fail on that to get a recreate, then push a nova change which tries to address the problem, and use the Depends-On tag in the tempest revert to only run with the nova change and see if it helps.

Matt Riedemann (mriedem) wrote :

When this fails in the gate today, it's happening on this test:

tempest.scenario.test_network_advanced_server_ops.TestNetworkAdvancedServerOps.test_server_connectivity_reboot

e.g.:

http://logs.openstack.org/97/152997/2/gate/gate-grenade-dsvm-ironic-sideways/229ce51/console.html#_2015-03-03_19_14_56_542

Matt Riedemann (mriedem) wrote :

One thing I'd be interested in, is on initial server create with the libvirt driver, we have the callback from neutron on VIF plugging that we wait for, I wonder if that same code path is used for resize/migrate/reboot? I'm wondering if we're just getting lucky when things pass that the ports are still attached/associated with the instance on the neutron side after doing a resize/migration. In the reboot case, it might be something completely different, like the guest not coming up/being ready even though the ports are still attached after the reboot.

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.openstack.org/134689
Reason: This has had a -1 since November, so it's abandoned.

Matt Riedemann (mriedem) wrote :

If there is something in here related to vif plugging callbacks with neutron and we're not waiting long enough, we could add some logging in resize_instance and _post_live_migration like was added to _delete_instance here:

https://github.com/openstack/nova/commit/272aca6926f50b584dab77fb2d1c29fc8bdbdb9e

Matt Riedemann (mriedem) wrote :

This is also doing unnecessary work with neutron if we're resizing to the same host, since the host hasn't changed in that case so there is really nothing to update in the port:

http://git.openstack.org/cgit/openstack/nova/tree/nova/network/neutronv2/api.py#n1592

Matt Riedemann (mriedem) wrote :

This should probably also be INFO instead of WARNING:

http://git.openstack.org/cgit/openstack/nova/tree/nova/virt/libvirt/driver.py#n681

Given it's basically expected and happens a lot in normal gate runs: http://goo.gl/R5Uesa

Matt Riedemann (mriedem) wrote :

Regarding waiting for a neutron event callback after we've migrated (resized), we don't wait for any events back from neutron because vifs_already_plugged is set to True:

http://git.openstack.org/cgit/openstack/nova/tree/nova/virt/libvirt/driver.py#n6358

This doesn't seem necessarily correct, the things that call unplug_vifs are _cleanup_reize:

http://git.openstack.org/cgit/openstack/nova/tree/nova/virt/libvirt/driver.py#n957

which is called from confirm_migration:

http://git.openstack.org/cgit/openstack/nova/tree/nova/virt/libvirt/driver.py#n6415

That only unplugs vifs if the host has changed though, which for the resize tests in the gate with tempest they are the same host, so vifs wouldn't be unplugged on resize to the same host, but still seems like we have a bug waiting to happen here.

The other place is post live migration:

http://git.openstack.org/cgit/openstack/nova/tree/nova/virt/libvirt/driver.py#n5927

Talking with dansmith in IRC, apparently in finish_migration we pass vifs_already_plugged=True even if the host has changed because apparently neutron will not send the vif plugged event twice (or for a resize/migrate), even if the host changed. When they were waiting on that when the neutron callback code was previously added it was stalling out in the gate b/c nova was waiting for an event from neutron that would never come. We should at least document something in the code that explains this and have the neutron cores assert and explain why.

Related fix proposed to branch: master
Review: https://review.openstack.org/161934

Reviewed: https://review.openstack.org/161763
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=b09262414a26e849fff45699169781bb9924bddf
Submitter: Jenkins
Branch: master

commit b09262414a26e849fff45699169781bb9924bddf
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 5 08:12:49 2015 -0800

    neutron: add logging during nw info_cache refresh when port is gone

    When doing a resize/migrate/reboot, we want to make sure that the
    network info we're getting is current and nothing was lost in the
    process, so add some logging when what was in the instance_info_cache is
    not considered associated with the instance from neutron's point of view
    during a refresh.

    Related-Bug: #1323658

    Change-Id: I0a8d8dcabf0b2c8f0e4552178e8cb99113640b11

Reviewed: https://review.openstack.org/161933
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=cc8b8dc826f460104b30ceee67ebc1f8cc41e198
Submitter: Jenkins
Branch: master

commit cc8b8dc826f460104b30ceee67ebc1f8cc41e198
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 5 14:11:18 2015 -0800

    neutron: check for same host in _update_port_binding_for_instance

    If we're doing a resize to the same host, like all resize tests in the
    gate, the host hasn't changed so there is no point in updating the port
    binding host in neutron so add a check for that case.

    Related-Bug: #1323658

    Change-Id: Ieb5ade398da8c11b29a3fa83b01ecf14e5e1f5b7

Reviewed: https://review.openstack.org/161934
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e4e1c6a2f060a91899549b9643132abde27d033e
Submitter: Jenkins
Branch: master

commit e4e1c6a2f060a91899549b9643132abde27d033e
Author: Matt Riedemann <email address hidden>
Date: Thu Mar 5 14:44:00 2015 -0800

    libvirt: add comment for vifs_already_plugged=True in finish_migration

    Given we unplug VIFs in _cleanup_resize if the host has changed, it's
    confusing to pass vifs_already_plugged=True in finish_migration without
    some explanation of why we don't check to see if the host has changed
    before deciding what the value of vifs_already_plugged should be when
    creating the new domain and plugging VIFs again.

    After some discussion with Neutron cores, this is the explanation I've
    come up with, albeit not a great one, and indicates a gap in some of the
    nova/neutron integration story.

    Related-Bug: #1323658

    Change-Id: I982b6bd0650f29f34c213b6eebbe066f7bd65815

Kyle Mestery (mestery) on 2015-03-10
Changed in neutron:
assignee: Kyle Mestery (mestery) → nobody
Matt Riedemann (mriedem) wrote :

https://review.openstack.org/#/c/161768/ has been rechecked 8 times and hasn't hit the failures, so maybe this was already fixed/mitigated by something else since October.

Matt Riedemann (mriedem) wrote :

Dropping from critical since this isn't failing after several rechecks: https://review.openstack.org/#/c/161768/

Changed in nova:
milestone: kilo-3 → none
importance: Critical → Medium
Changed in neutron:
importance: Critical → Medium
milestone: kilo-3 → none
Matt Riedemann (mriedem) wrote :

Bug 1326183 could be related here, we might be hitting a weird sync issue with the _heal_instance_info_cache periodic task.

Changed in neutron:
assignee: nobody → Dariusz Smigiel (smigiel-dariusz)

Retested this with latest devstack configuration. Based on description from other bug (duplicate of this: https://bugs.launchpad.net/nova/+bug/1364588) I'm not able to reproduce this problem.

Resized about 20 times on running and disabled servers. All the time, everything is OK. There are no problems with "ERROR" state or losing connectivity.
Logstash doesn't show any similar problems with this issue.

Closing as works-for-me, fix probably was already released.

If anyone has another experiences, please reopen with additional info.

Changed in neutron:
status: Confirmed → Fix Released
Changed in nova:
assignee: nobody → Dariusz Smigiel (smigiel-dariusz)
status: Confirmed → Fix Released
melanie witt (melwitt) wrote :

If this doesn't appear to be a problem anymore and we can't cite a patch or change that fixes it, I think it's more appropriate to mark it as Invalid rather than Fix Released. Please reopen if needed.

Changed in nova:
status: Fix Released → Invalid

@Melanie, OK. Thank you for clarification.
So I'll change also Neutron from Fix Released to Invalid.

Eh.. I'm not able to do this for Neutron, so probably someone else, with greater permissions should do this.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers