Comment 10 for bug 1349617

Revision history for this message
melanie witt (melwitt) wrote : Re: test_volume_boot_pattern fails in grenade with "SSHException: Error reading SSH protocol banner[Errno 104] Connection reset by peer"

Here is what I’ve gathered so far. I looked through a few failed builds and focused on one [0] that uses the metadata service rather than config drive as it gives more clues.

1. The messages about “userdata” in the guest console don’t seem related to the failure i.e. the guest console only shows up in the logs if the build fails. I think it always says "/run/cirros/datasource/data/user-data was not '#!' or executable" or “no userdata for datasource" if no “userdata” is being used, and none is. The ssh keys are part of the metadata in these tests, not the userdata portion of the metadata.

2. In the metadata service log [1], there are zero calls to e.g. "GET /2009-04-04/meta-data/user-data HTTP/1.1" further supporting no userdata relationship.

3. Ssh keys are added to the metadata in nova/api/metadata.py by nova itself, so it appears unlikely there is anything wrong there, or at least I didn’t see anything unusual. The key is created by a POST to nova [2] and nova creates the key. The key content then appears several times in the log messages of the metadata service (it seems fine, uncorrupted).

4. The error “Exception: Error reading SSH protocol banner[Errno 104] Connection reset by peer” implies a corruption of some kind (being that it seems communication wasn’t a problem otherwise, there’s a route) -- this seems consistent with too low of an mtu and data getting truncated “occasionally.” In the log [3], the attempt to connect begins with connection refused (before sshd starts), then changes to authentication failure (likely before the guest has tried to pull the key from the metadata service), then changes to the ssh protocol banner read error. Which sounds like the key was retrieved but it’s corrupted (truncated?).

5. Web search for the same error yielded others having problems with mtu setting in the guest, where they can ping but not ssh with key pair, openstack [4] and cirros [5].

Is it at all possible that there’s an issue with the mtu of the guest sometimes? It would explain the randomness and the protocol banner errors, if data is getting truncated sometimes. I’m not sure where to go from here, I didn’t think anything like this would show up in the guest kernel logs.

[0] http://logs.openstack.org/38/115938/6/check/check-tempest-dsvm-neutron-pg-full-2/8833a83
[1] http://logs.openstack.org/38/115938/6/check/check-tempest-dsvm-neutron-pg-full-2/8833a83/logs/screen-q-meta.txt.gz
[2] http://logs.openstack.org/38/115938/6/check/check-tempest-dsvm-neutron-pg-full-2/8833a83/console.html#_2014-08-28_18_39_33_546
[3] http://logs.openstack.org/38/115938/6/check/check-tempest-dsvm-neutron-pg-full-2/8833a83/console.html#_2014-08-28_18_39_33_659
[4] https://ask.openstack.org/en/question/32958/unable-to-ssh-with-key-pair/
[5] https://bugs.launchpad.net/cirros/+bug/1301958