Failed to establish authenticated ssh connection to cirros - error: [Errno 113] No route to host

Bug #1370496 reported by Matt Riedemann
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Augustina Ragwitz
neutron
Expired
Undecided
Unassigned

Bug Description

Saw this in the gate today, I think it's separate from bug 1349617, at least the error is different and the hits in logstash are different. The root cause might be the same.

This is in both nova-network and neutron jobs.

http://logs.openstack.org/01/116001/3/gate/gate-grenade-dsvm-partial-ncpu/3dce5d7/logs/tempest.txt.gz#_2014-09-16_22_32_14_888

2014-09-16 22:32:14.888 6636 ERROR tempest.common.ssh [-] Failed to establish authenticated ssh connection to cirros@172.24.4.4 after 15 attempts
2014-09-16 22:32:14.888 6636 TRACE tempest.common.ssh Traceback (most recent call last):
2014-09-16 22:32:14.888 6636 TRACE tempest.common.ssh File "tempest/common/ssh.py", line 76, in _get_ssh_connection
2014-09-16 22:32:14.888 6636 TRACE tempest.common.ssh timeout=self.channel_timeout, pkey=self.pkey)
2014-09-16 22:32:14.888 6636 TRACE tempest.common.ssh File "/usr/local/lib/python2.7/dist-packages/paramiko/client.py", line 236, in connect
2014-09-16 22:32:14.888 6636 TRACE tempest.common.ssh retry_on_signal(lambda: sock.connect(addr))
2014-09-16 22:32:14.888 6636 TRACE tempest.common.ssh File "/usr/local/lib/python2.7/dist-packages/paramiko/util.py", line 278, in retry_on_signal
2014-09-16 22:32:14.888 6636 TRACE tempest.common.ssh return function()
2014-09-16 22:32:14.888 6636 TRACE tempest.common.ssh File "/usr/local/lib/python2.7/dist-packages/paramiko/client.py", line 236, in <lambda>
2014-09-16 22:32:14.888 6636 TRACE tempest.common.ssh retry_on_signal(lambda: sock.connect(addr))
2014-09-16 22:32:14.888 6636 TRACE tempest.common.ssh File "/usr/lib/python2.7/socket.py", line 224, in meth
2014-09-16 22:32:14.888 6636 TRACE tempest.common.ssh return getattr(self._sock,name)(*args)
2014-09-16 22:32:14.888 6636 TRACE tempest.common.ssh error: [Errno 113] No route to host
2014-09-16 22:32:14.888 6636 TRACE tempest.common.ssh

message:"_get_ssh_connection" AND message:"error: [Errno 113] No route to host" AND tags:"tempest.txt"

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiX2dldF9zc2hfY29ubmVjdGlvblwiIEFORCBtZXNzYWdlOlwiZXJyb3I6IFtFcnJubyAxMTNdIE5vIHJvdXRlIHRvIGhvc3RcIiBBTkQgdGFnczpcInRlbXBlc3QudHh0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6ImN1c3RvbSIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJmcm9tIjoiMjAxNC0wOS0xMFQxMjo0ODowMyswMDowMCIsInRvIjoiMjAxNC0wOS0xN1QxMjo0ODowMyswMDowMCIsInVzZXJfaW50ZXJ2YWwiOiIwIn0sInN0YW1wIjoxNDEwOTU4MTU3MTY3LCJtb2RlIjoiIiwiYW5hbHl6ZV9maWVsZCI6IiJ9

63 hits in 10 days, check and gate, all failures.

Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Sean Dague (sdague) wrote :

This needs deeper debug to be useful to address

Changed in nova:
status: New → Incomplete
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

The console log is garbled as it's all printed in one line (it's printed nicely for other scenario tests, but I guess this depends on OfficialClientTest vs ScenarioTest base class, so it's probably not something worth addressing as the former is being transition to the latter).

Anyway, the VM never boots correctly - at least in all the failure instances I've looked at.
These are the last lines of a console log:

Freeing unused kernel memory: 1188k freed
info: initramfs: up at 2.44

The failure has been observed at both the 2nd and 3rd step, which means it does not depend on the snapshot process, but more on boot from volume. I have not seen the failure on the 1st step, but this does not mean automatically that it won't happen there too.

This could be just a red herring, since this message is very common: http://logs.openstack.org/89/119589/24/check/check-tempest-dsvm-neutron-full/159af17/logs/screen-n-cpu.txt.gz#_2014-09-10_21_43_18_577

It seems however the instance was stopped early, and this could justify the failed connection.

It sounds like the volume does not get correctly attached to the instance, however no anomaly is reported by cinder logs:
http://logs.openstack.org/89/119589/24/check/check-tempest-dsvm-neutron-full/159af17/logs/screen-c-vol.txt.gz#_2014-09-10_21_42_44_822

Finally, libvirt logs appears to report no errors. perhaps something could be found in qemu logs, which need to be searched for nova's vm uuid.

Revision history for this message
Sean Dague (sdague) wrote :

Currently one of the top gate hits

Changed in nova:
importance: High → Critical
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Changing to incomplete for neutron per Salvatore's analysis

Changed in neutron:
status: New → Incomplete
Changed in neutron:
assignee: nobody → Salvatore Orlando (salvatore-orlando)
tags: added: juno-rc-potential
Revision history for this message
Salvatore Orlando (salvatore-orlando) wrote :

Unfortunately I can't find any hint about the root of this bug, so I have to spend more time trying to reproduce it locally.
To me it seems like the VM is not able to boot from volume in some circumstances - it could be something strictly related to cirros.

Thierry Carrez (ttx)
tags: added: juno-backport-potential
removed: juno-rc-potential
Changed in neutron:
assignee: Salvatore Orlando (salvatore-orlando) → nobody
Revision history for this message
Nell Jerram (neil-jerram) wrote :

I'm seeing a failure that elastic-recheck thinks maps to this bug: http://logs.openstack.org/14/146914/1/check/check-tempest-dsvm-neutron-full/e1fee15/, for the very small Nova code change at https://review.openstack.org/#/c/146914/. If that's correct, it's a bit disheartening that this bug hasn't seen any change since last October.

My interpretation of the logs at http://logs.openstack.org/14/146914/1/check/check-tempest-dsvm-neutron-full/e1fee15/ is that the problem is that the instance isn't getting its IPv4 address; the instance's console output includes:

Starting network...
udhcpc (v1.20.1) started
Sending discover...
Sending discover...
Sending discover...
No lease, failing
[...]
=== network info ===
if-info: lo,up,127.0.0.1,8,::1
if-info: eth0,up,,8,fe80::f816:3eff:fefb:eaa7

So, I wonder if anyone can throw light on two questions:

1. Assuming that DHCP _should_ be working in the failing test, and that the instance _should_ be getting its IPv4 address from DHCP, can anyone see why that is now not working?

2. Is elastic-recheck correct to map the test failure that I'm seeing to this bug?

Revision history for this message
Robert Collins (lifeless) wrote :

@Neil Boot from volume has worked if the instance console got that far. Failing to ssh in is a fairly blanket failure - lots of different actual causes end up with the same symptom

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for neutron because there has been no activity for 60 days.]

Changed in neutron:
status: Incomplete → Expired
Changed in nova:
importance: Critical → Undecided
Alan Pevec (apevec)
tags: removed: juno-backport-potential
Revision history for this message
Augustina Ragwitz (auggy) wrote :

This may have been fixed by: https://review.openstack.org/#/c/273042/

I checked the above logstash query and got zero results for the past 30 days.

Closing this as Fix Released. If this issue pops back up, please open a new bug.

Changed in nova:
status: Incomplete → Fix Released
assignee: nobody → Augustina Ragwitz (auggy)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.