CI: tempest TestVolumeBootPattern tests fail due to not being able to ssh to the VM

Bug #1731063 reported by Alex Schultz on 2017-11-08
24
This bug affects 3 people
Affects Status Importance Assigned to Milestone
ovsdbapp
Undecided
Unassigned
tripleo
Critical
Daniel Alvarez

Bug Description

We're seeing occasional timeouts in the gate with the failure being that tempest cannot ssh to the VM. It looks like the VM is started and eventually tempest is able to connect but it might be failing before the nova keys can be configured.

http://logs.openstack.org/79/516979/2/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/13841da/logs/undercloud/home/zuul/tempest_output.log.txt.gz#_2017-11-08_21_19_54

2017-11-08 21:19:54 | Captured traceback:
2017-11-08 21:19:54 | ~~~~~~~~~~~~~~~~~~~
2017-11-08 21:19:54 | Traceback (most recent call last):
2017-11-08 21:19:54 | File "/usr/lib/python2.7/site-packages/tempest/common/utils/__init__.py", line 89, in wrapper
2017-11-08 21:19:54 | return f(self, *func_args, **func_kwargs)
2017-11-08 21:19:54 | File "/usr/lib/python2.7/site-packages/tempest/scenario/test_volume_boot_pattern.py", line 112, in test_volume_boot_pattern
2017-11-08 21:19:54 | private_key=keypair['private_key'])
2017-11-08 21:19:54 | File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 609, in create_timestamp
2017-11-08 21:19:54 | private_key=private_key)
2017-11-08 21:19:54 | File "/usr/lib/python2.7/site-packages/tempest/scenario/manager.py", line 356, in get_remote_client
2017-11-08 21:19:54 | linux_client.validate_authentication()
2017-11-08 21:19:54 | File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 57, in wrapper
2017-11-08 21:19:54 | six.reraise(*original_exception)
2017-11-08 21:19:54 | File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 30, in wrapper
2017-11-08 21:19:54 | return function(self, *args, **kwargs)
2017-11-08 21:19:54 | File "/usr/lib/python2.7/site-packages/tempest/lib/common/utils/linux/remote_client.py", line 113, in validate_authentication
2017-11-08 21:19:54 | self.ssh_client.test_connection_auth()
2017-11-08 21:19:54 | File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 207, in test_connection_auth
2017-11-08 21:19:54 | connection = self._get_ssh_connection()
2017-11-08 21:19:54 | File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 121, in _get_ssh_connection
2017-11-08 21:19:54 | password=self.password)
2017-11-08 21:19:54 | tempest.lib.exceptions.SSHTimeout: Connection to the 192.168.24.104 via SSH timed out.
2017-11-08 21:19:54 | User: cirros, Password: None

From the tempest logs we can see that there is a point when it can establish connection but key auth is failing.

2017-11-08 21:24:31 | 2017-11-08 21:18:07,581 18813 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 ([Errno None] Unable to connect to port 22 on 192.168.24.104). Number attempts: 20. Retry after 21 seconds.
2017-11-08 21:24:31 | 2017-11-08 21:18:29,110 18813 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 ([Errno None] Unable to connect to port 22 on 192.168.24.104). Number attempts: 21. Retry after 22 seconds.
2017-11-08 21:24:31 | 2017-11-08 21:18:51,668 18813 INFO [paramiko.transport] Connected (version 2.0, client dropbear_2012.55)
2017-11-08 21:24:31 | 2017-11-08 21:18:51,781 18813 INFO [paramiko.transport] Authentication (publickey) failed.
2017-11-08 21:24:31 | 2017-11-08 21:18:51,909 18813 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 (Authentication failed.). Number attempts: 22. Retry after 23 seconds.
2017-11-08 21:24:31 | 2017-11-08 21:19:15,443 18813 INFO [paramiko.transport] Connected (version 2.0, client dropbear_2012.55)
2017-11-08 21:24:31 | 2017-11-08 21:19:15,562 18813 INFO [paramiko.transport] Authentication (publickey) failed.
2017-11-08 21:24:31 | 2017-11-08 21:19:15,684 18813 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 (Authentication failed.). Number attempts: 23. Retry after 24 seconds.
2017-11-08 21:24:31 | 2017-11-08 21:19:40,213 18813 INFO [paramiko.transport] Connected (version 2.0, client dropbear_2012.55)
2017-11-08 21:24:31 | 2017-11-08 21:19:40,346 18813 INFO [paramiko.transport] Authentication (publickey) failed.
2017-11-08 21:24:31 | 2017-11-08 21:19:40,468 18813 ERROR [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 after 23 attempts
2017-11-08 21:24:31 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh Traceback (most recent call last):
2017-11-08 21:24:31 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 107, in _get_ssh_connection
2017-11-08 21:24:31 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh sock=proxy_chan)
2017-11-08 21:24:31 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 380, in connect
2017-11-08 21:24:31 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh look_for_keys, gss_auth, gss_kex, gss_deleg_creds, gss_host)
2017-11-08 21:24:31 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 621, in _auth
2017-11-08 21:24:31 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh raise saved_exception
2017-11-08 21:24:31 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh AuthenticationException: Authentication failed.
2017-11-08 21:24:31 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh

summary: - CI: tempest fails to be able to ssh to the vm
+ CI: tempest TestVolumeBootPattern tests fail due to not being able to
+ ssh to the VM
Jiří Stránský (jistr) wrote :

Nova metadata API returns 404 for public key list in the broken job:

http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/logs/subnode-2/var/log/containers/nova/nova-api-metadata.log.txt.gz#_2017-11-09_02_39_25_887

2017-11-09 02:39:25.887 7 INFO nova.metadata.wsgi.server [req-0facd91f-29e8-4799-ab24-a8b93ae5e8b4 - - - - -] 10.100.0.11,192.168.24.3 "GET /2009-04-04/meta-data/public-keys HTTP/1.1" status: 404 len: 176 time: 0.0059540

while in successful jobs it returns 200 and proceeds to query the key too:

2017-11-09 12:03:29.352 8 INFO nova.metadata.wsgi.server [req-ae44dbac-18ab-4016-8c6a-ac5c1a3a686d - - - - -] 10.100.0.4,192.168.24.3 "GET /2009-04-04/meta-data/public-keys HTTP/1.1" status: 200 len: 159 time: 0.0032320
2017-11-09 12:03:29.478 8 INFO nova.metadata.wsgi.server [req-ae44dbac-18ab-4016-8c6a-ac5c1a3a686d - - - - -] 10.100.0.4,192.168.24.3 "GET /2009-04-04/meta-data/public-keys/0/openssh-key HTTP/1.1" status: 200 len: 516 time: 0.0030150

Jiří Stránský (jistr) wrote :

The key got created even in the broken job though:

http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/logs/subnode-2/var/log/containers/nova/nova-api.log.txt.gz#_2017-11-09_02_38_37_745

2017-11-09 02:38:37.745 18 INFO nova.api.openstack.requestlog [req-b5abbb4c-1752-42e0-996c-04e44f7299ba 5d0f711fd7774bf4bb1738dc66b47409 d4100d04632648d8b8b2a4a346bfac58 - default default] 192.168.24.3 "POST /v2.1/os-keypairs" status: 200 len: 2314 microversion: 2.1 time: 0.576968

wes hayutin (weshayutin) on 2017-11-09
tags: added: promo
tags: added: promotion-blocker
removed: promo
wes hayutin (weshayutin) wrote :

To see how this is impacting the tripleo gate navigate to

http://cistatus.tripleo.org:8000/ ---> "TripleO CI gate jobs" ----> "legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container"

Note the performance metrics:
In the gates we should have as close to 100% passing

This issue is causing random failures in the job, overall it's failing 20.9% of the time on tempest (mostly) and 4.5% on timeouts or other infra issues.

Oliver Walsh (owalsh) wrote :

Can see that the Instance object has the key_data when nova-compute boots it:
key_data='ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC9H4ad2IV712wZm5wCgFQPGVEiF9GG1OooX3BXAvEKdhP1vYAq687kOiKV+RmK4qhYiCyp4GF/hKrL4eRmJlSS74AfkvU2w83nqYS97njw5J9mBSeZWzQ4KaHXpVgD7rKtRCoEtufK/WT79sdocSXstQshyWG2w1rAZoaGKOiS1wCHOvV8JkZ45BoVrgiS734bds0MrHxo4r+6nN5WMvAzoLnwJ1Njzyz2b/IWiQfRCljha9F2bLrA9iXTppR2E8HMr+pAH2yFWT/wF6kq8c8vMF/NEFoCuV0ZkIH7ZjLPRFr8I6DiaPeGFVu4Zx1W9PhQsdh3gtyOGswP0MhvYqyJ Generated-by-Nova'
key_name='tempest-TestVolumeBootPattern-1512872222'

And the ip is 10.100.0.4:
"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.100.0.4"}]

However the 404 for the public_key metadata is from 10.100.0.6:
http://logs.openstack.org/79/516979/2/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/13841da/logs/subnode-2/var/log/containers/nova/nova-api-metadata.log.txt.gz#_2017-11-08_21_14_37_476

Which is a different VM that doesn't have a key_pair:
http://logs.openstack.org/79/516979/2/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/13841da/logs/subnode-2/var/log/containers/nova/nova-compute.log.txt.gz#_2017-11-08_21_14_16_781

This looks more like a storage or network issue causing cloud-init to fail on the cirros VM, not an nova metadata issue.

Alex Schultz (alex-schultz) wrote :

If it's storage, scenario001 is glance/cinder is ceph based while the other scenarios are not.

https://github.com/openstack/tripleo-heat-templates/blob/master/README.rst#service-testing-matrix

Alex Schultz (alex-schultz) wrote :

It should be noted that this is failing on both containerized and non-containerized scenario001 jobs.

Hi, fwiw here are a few things that looks odds in the 13841da run:

logs/subnode-2/var/log/pacemaker/bundles/rabbitmq-bundle-0/pacemaker.log.txt.gz:
0619: pacemaker_remoted: warning: qb_ipcs_event_sendv: new_event_notification (13-32500-17): Broken pipe (32)
0620: pacemaker_remoted: warning: send_client_notify: Notification of client proxy-cib_rw-32500-23c89608/23c89608-4ae0-4282-889e-ee895be4abc5 failed

logs/subnode-2/var/log/extra/docker/docker_allinfo.log.txt.gz
0288: WARNING: bridge-nf-call-iptables is disabled
0289: WARNING: bridge-nf-call-ip6tables is disabled

logs/subnode-2/var/log/messages.txt.gz:
33606: Nov 8 21:08:48 centos-7-rax-iad-0000787869 systemd: Started Cluster Controlled openstack-cinder-backup. (this doesn't appear in other logs for legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container)

logs/subnode-2/pip2-freeze.txt.gz:
0132: python-cinderclient==3.2.0
0135: python-gflags==2.0

logs/subnode-2/var/log/config-data/neutron/etc/neutron/plugins/ml2/ml2_conf.ini.txt.gz:
0261: firewall_driver=openvswitch (other jobs have firewall_driver=iptables_hybrid)

logs/undercloud/var/log/neutron/dhcp-agent.log.txt.gz:
0514: Leaving behind already spawned process with pid 7401, root should kill it if it's still there (I can't): error: [Errno 2] ENOENT

logs/undercloud/var/log/neutron/openvswitch-agent.log.txt.gz:
1554: Leaving behind already spawned process with pid 2106, root should kill it if it's still there (I can't): error: [Errno 32] Broken pipe

logs/subnode<email address hidden>:
0547: Connection <0.12701.0> (192.168.24.3:59026 -> 192.168.24.15:5672) has a client-provided name: cinder-backup:124342:7dc58edd-5119-41ca-b2d7-444308b63a8c

Also it's probably not much, but logs/subnode-2/var/log/containers/neutron/neutron-openvswitch-agent.log.txt.gz does have those lines:

2543: 2017-11-08 21:14:13.356 113692 DEBUG neutron.agent.linux.openvswitch_firewall.firewall [req-8a760281-5495-43e0-957e-de04bc82ad1f - - - - -] Creating flow rules for port 66b0ab75-0cd5-4b4d-bb0c-af39371071cc that is port 15 in OVS add_flows_from_rules /usr/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py:1010
2544: 2017-11-08 21:14:13.356 113692 DEBUG neutron.agent.linux.openvswitch_firewall.firewall [req-8a760281-5495-43e0-957e-de04bc82ad1f - - - - -] RULGEN: Rules generated for flow {'ethertype': u'IPv6', 'direction': u'egress'} are [{'priority': 70, 'dl_type': 34525, 'actions': 'resubmit(,73)', 'reg_port': 15, 'table': 72}] add_flows_from_rules /usr/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py:1014
2545: 2017-11-08 21:14:13.357 113692 DEBUG neutron.agent.linux.openvswitch_firewall.firewall [req-8a760281-5495-43e0-957e-de04bc82ad1f - - - - -] RULGEN: Rules generated for flow {'ethertype': u'IPv4', 'direction': u'egress'} are [{'priority': 70, 'dl_type': 2048, 'actions': 'resubmit(,73)', 'reg_port': 15, 'table': 72}] add_flows_from_rules /usr/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py:1014
2546: 2017-11-08 21:14:13.357 113692 DEBUG neutron.agent.linux.openvswitch_firewall.firewall [req-8a760281-5495-43e0-957e-de04bc82ad1f - - - - -] RULGEN: Rules generated for flow {'ethertype': u'IPv4', 'direction': u'ingress', 'source_ip_prefix': '0.0.0.0/0', 'protocol': 1} are [{'dl_type': 2048, 'reg_port': 15, 'actions': 'output:15', 'priority': 70, 'nw_proto': 1, 'table': 82}] add_flows_from_rules /usr/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py:1014
2547: 2017-11-08 21:14:13.358 113692 DEBUG neutron.agent.linux.openvswitch_firewall.firewall [req-8a760281-5495-43e0-957e-de04bc82ad1f - - - - -] RULGEN: Rules generated for flow {'direction': u'ingress', 'protocol': 6, 'ethertype': u'IPv4', 'port_range_max': 22, 'source_ip_prefix': '0.0.0.0/0', 'port_range_min': 22} are [{'dl_type': 2048, 'reg_port': 15, 'nw_proto': 6, 'tcp_dst': '0x0016', 'table': 82, 'actions': 'output:15', 'priority': 70}] add_flows_from_rules /usr/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py:1014

which for some reason, are not present in other jobs, for example this one that succeeded recently: http://logs.openstack.org/78/516378/4/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/f30f641/logs/subnode-2/var/log/containers/neutron/neutron-openvswitch-agent.log.txt.gz

John Fulton (jfulton-org) wrote :

Does the job ping before it SSH's?

If we know the instance answered pings then we could target troubleshooting above that line (e.g. SSH key init problem with cloud init or security groups, etc).

John Fulton (jfulton-org) wrote :

Storage for the containerized sceario001 http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/ looks fine:

- Ceph installation finished without errors:
   http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/logs/undercloud/var/log/mistral/ceph-install-workflow.log.txt.gz#_2017-11-09_02_14_26_016

- Ceph cluster is active+clean with 1 OSD in and up
   http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/logs/subnode-2/var/log/extra/docker/containers/ceph-mon-centos-7-ovh-bhs1-0000790706/log/ceph/ceph.log.txt.gz#_2017-11-09_02_39_06_503430

- Glance looks normal and the RBD scheme registered.
   http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/logs/subnode-2/var/log/containers/glance/api.log.txt.gz#_2017-11-09_02_37_47_046

- I know Cinder volume is working because it supported 3 instances (see below)

- Nova libvirt indicates an instance was running for 5 minutes using an RBD-backed cinder volume:
   file=rbd:volumes/volume-9e74c518-9772-4a9e-bb5d-727d362b2ba6:id=openstack:auth_supported=cephx\;none:mon_host=192.168.24.3\:6789,file.password-secret=virtio-disk0-secret0,format=raw
   http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/logs/subnode-2/libvirt/qemu/instance-00000001.txt.gz

John Fulton (jfulton-org) wrote :

For both the containerized and non-containerized jobs, we have 3 instances where instance1 and instance2 are both about for about 5 minutes but instance 3 runs for one minute or less and is then shut down [1][2].

If the failure is in SSH'ing into instance 3, do we know how many retries we get? Is it possible the SSH daemon is not listening yet when the SSH fails and if we wait a little, then it would accept the connection?

[1]
non-containerized:
 http://logs.openstack.org/81/508881/2/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-puppet/f9e4b6b/logs/subnode-2/var/log/libvirt/qemu/instance-00000003.log.txt.gz

[2]
continaerized:
 http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/logs/subnode-2/libvirt/qemu/instance-00000003.txt.gz

Giulio Fidente (gfidente) wrote :

scenario001 (both containerized and not) is the only one deploying the ComputeCeilometerAgent and the only one *not* using NotificationDriver: noop as one of the deployment parameter.

I see some strange logs in the ceilometer compute logs [1]

Maybe Pradeep can help us finding more the errors in the logs? Shall we add the NotificationDriver parameter? Is there a reason why we're deploying ComputeCeilometerAgent only in this job?

1. http://logs.openstack.org/79/516979/2/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/13841da/logs/subnode-2/var/log/containers/ceilometer/compute.log.txt.gz#_2017-11-08_21_17_08_630

Emilien Macchi (emilienm) wrote :

NotificationDriver is set to "noop" when we don't need notifications sent on RPC.
Scenario001 is deploying Ceilometer and test autoscaling, etc, so we NEED these notifications sent on RPC otherwise Telemetry services don't know when OpenStack resources are created.

Pradeep Kilambi (pkilambi) wrote :

Yes, like Emilien said, we need notifications turned on for this job. The ComputeCeilometerAgent agent is only run in this because scenario001 include the telemetry services and others don't. I dont see how that can cause volume boot issue though. The log warning indicates it couldnt find volume, and rightfully so?

John Fulton (jfulton-org) wrote :

arxcruz pointed out that this seems to be an issue with the metadata service that is not injecting the ssh key, and that's why it's getting timeout, because ssh service is started after the metadata service finishes the env.

It's similar to https://bugs.launchpad.net/tripleo/+bug/1702257

2017-11-08 21:19:54 | 2017-11-08 21:18:29,110 18813 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 ([Errno None] Unable to connect to port 22 on 192.168.24.104). Number attempts: 21. Retry after 22 seconds.
2017-11-08 21:19:54 | 2017-11-08 21:18:51,668 18813 INFO [paramiko.transport] Connected (version 2.0, client dropbear_2012.55)
2017-11-08 21:19:54 | 2017-11-08 21:18:51,781 18813 INFO [paramiko.transport] Authentication (publickey) failed.
2017-11-08 21:19:54 | 2017-11-08 21:18:51,909 18813 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 (Authentication failed.). Number attempts: 22. Retry after 23 seconds.
2017-11-08 21:19:54 | 2017-11-08 21:19:15,443 18813 INFO [paramiko.transport] Connected (version 2.0, client dropbear_2012.55)
2017-11-08 21:19:54 | 2017-11-08 21:19:15,562 18813 INFO [paramiko.transport] Authentication (publickey) failed.
2017-11-08 21:19:54 | 2017-11-08 21:19:15,684 18813 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 (Authentication failed.). Number attempts: 23. Retry after 24 seconds.
2017-11-08 21:19:54 | 2017-11-08 21:19:40,213 18813 INFO [paramiko.transport] Connected (version 2.0, client dropbear_2012.55)
2017-11-08 21:19:54 | 2017-11-08 21:19:40,346 18813 INFO [paramiko.transport] Authentication (publickey) failed.
2017-11-08 21:19:54 | 2017-11-08 21:19:40,468 18813 ERROR [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 after 23 attempts
2017-11-08 21:19:54 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh Traceback (most recent call last):
2017-11-08 21:19:54 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 107, in _get_ssh_connection
2017-11-08 21:19:54 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh sock=proxy_chan)
2017-11-08 21:19:54 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 380, in connect
2017-11-08 21:19:54 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh look_for_keys, gss_auth, gss_kex, gss_deleg_creds, gss_host)
2017-11-08 21:19:54 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 621, in _auth
2017-11-08 21:19:54 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh raise saved_exception
2017-11-08 21:19:54 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh AuthenticationException: Authentication failed.

yatin (yatinkarel) wrote :

Also for future debugging, i think it would be good to log: nova_console_output in case of failure here:-https://github.com/openstack/tempest/blob/17.1.0/tempest/scenario/manager.py#L608-L609

Alex Schultz (alex-schultz) wrote :

Not sure if this is the same thing, but I did see a failure in scenario003 with what appears to be a similar problem not being able to ssh to the VM. http://logs.openstack.org/70/519770/2/gate/legacy-tripleo-ci-centos-7-scenario003-multinode-oooq-puppet/69e0c2c/logs/undercloud/home/zuul/tempest_output.log.txt.gz#_2017-11-15_22_38_40

John Trowbridge (trown) wrote :

I set up an environment on rdocloud which ran the toci script on top of pre-provisioned nodes (so almost exact reproducer of CI).

I then made a small script that would run tempest over and over, and record the results. It ran 100 times consecutively without failing. So there is definitely an environmental component to this issue. Maybe need to artificially create some memory or cpu pressure.

Reviewed: https://review.openstack.org/520280
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=17acd021402b93ebbb6f875a6a758d7abf8bd860
Submitter: Zuul
Branch: master

commit 17acd021402b93ebbb6f875a6a758d7abf8bd860
Author: Alex Schultz <email address hidden>
Date: Wed Nov 15 20:37:27 2017 -0700

    Debug configuration for neutron metadata agent

    The neutron metadata agent isn't currently configured for debug when the
    global debug flag is enabled. Additionally add a flag to just configure
    debug for the metadata agent itself.

    Change-Id: I7ef636b8ad4c6fe4f05970c41c05d522a0f0d892
    Related-Bug: #1731063

Alex Schultz (alex-schultz) wrote :

So in looking, at the logs further there are errors in the l3-agent which are also occurring in scenario003 now.

http://logs.openstack.org/81/520581/1/check/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/128fccc/logs/subnode-2/var/log/containers/neutron/neutron-l3-agent.log.txt.gz#_2017-11-16_15_11_32_149

2017-11-16 15:11:32.149 110790 ERROR ovsdbapp.backend.ovs_idl.command NotConnectedError: Cannot commit transaction DelPortCommand(if_exists=True, bridge=None, port=qg-15cf71cf-65). Not connected

2017-11-16 15:11:32.810 110790 ERROR neutron.agent.l3.router_info [-] Cannot commit transaction AddPortCommand(bridge=br-int, may_exist=False, port=qg-15cf71cf-65), DbSetCommand(table=Interface, col_values=(('type', 'internal'), ('external_ids', {'iface-id': u'15cf71cf-65f1-4e3e-b109-2a70c69cb50d', 'iface-status': 'active', 'attached-mac': u'fa:16:3e:72:ed:38'})), record=qg-15cf71cf-65). Not connected: NotConnectedError: Cannot commit transaction AddPortCommand(bridge=br-int, may_exist=False, port=qg-15cf71cf-65), DbSetCommand(table=Interface, col_values=(('type', 'internal'), ('external_ids', {'iface-id': u'15cf71cf-65f1-4e3e-b109-2a70c69cb50d', 'iface-status': 'active', 'attached-mac': u'fa:16:3e:72:ed:38'})), record=qg-15cf71cf-65). Not connected

Bogdan Dobrelya (bogdando) wrote :

So the tracebacks show that the connection to OVSDB server is lost sporadically. Perhaps we should trace down the signs of these events.

venkata anil (anil-venkata) wrote :

As we have a retry in l3 agent, this port 'qg-4eb6eb60-17' was added successfully later [1]
So I think l3 agent is properly configuring the port.
[1] http://logs.openstack.org/56/519756/5/check/legacy-tripleo-ci-centos-7-scenario003-multinode-oooq-container/e2bd239/logs/subnode-2/var/log/containers/neutron/neutron-l3-agent.log.txt.gz#_2017-11-18_02_58_55_734

Reviewed: https://review.openstack.org/521191
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=5de608f0ebc9f1b790f20b078b81c1111896d31a
Submitter: Zuul
Branch: master

commit 5de608f0ebc9f1b790f20b078b81c1111896d31a
Author: Alex Schultz <email address hidden>
Date: Fri Nov 17 14:13:41 2017 -0700

    Add debug to neutron agents

    We currently do not enable debug for the l3, l2gw or dhcp agent if the
    debug flag is set to true. This fixes the templates to support enabling
    debug for these agents either via the global Debug setting or a specific
    debug setting.

    Change-Id: I0a44bfc27b306bfb08dab0656e3362503f07d6b3
    Related-Bug: #1731063

I don't think bug 1711463 is related. It's just same symptom.

Maybe it's related to the new ovsdbapp release: https://review.openstack.org/520630 that includes a fix for timeout retries: https://review.openstack.org/520176

Disregard the links to possible regressing patches, I now see the bug was reported before they landed.

I looked at the Not connected error mentioned in comment #31. As Anil said, the commands are later successfully executed, so it probably doesn't result in the test failure.

The Not connected error is raised since https://review.openstack.org/#/c/497612/6/ovsdbapp/backend/ovs_idl/connection.py that is included in ovsdbapp 0.6.0+. It happens when queue_txn is called before .start() for the ovsdb connection. The latter connection is initialized when Backend.__init__ is executed, which should happen in neutron context when BaseOVS -> from_config -> api_factory -> NeutronOvsdbIdl.__init__ is executed. In the traceback, we see that the error happens when replace_port is executed, which is a regular method on OVSBridge class that is a child of BaseOVS. So in theory, at the moment when queue_txn is executed, the connection should already be initialized. But it clearly wasn't.

I don't think this error is the root cause of the failure tracked here. Anyway, adding ovsdbapp project to the list of affected projects, and hopefully Terry Wilson will be able to assist with the traceback once he is back from PTO next week.

Alex Schultz (alex-schultz) wrote :

If there is a delay in the network connecting due to the NotConnected exception could that impact the metadata fetching? I don't think we're seeing the NotConnected error on successful runs. That being said it may not be the cause but at the moment we're really unsure why this is failing.

Changed in tripleo:
assignee: nobody → Daniel Alvarez (dalvarezs)
Emilien Macchi (emilienm) wrote :

FYI, Alex is working on a patch: https://review.openstack.org/#/c/523508

Reviewed: https://review.openstack.org/523691
Committed: https://git.openstack.org/cgit/openstack/ovsdbapp/commit/?id=8ffd90dbf3f817cc43ab42e36c82d021d2df04e6
Submitter: Zuul
Branch: master

commit 8ffd90dbf3f817cc43ab42e36c82d021d2df04e6
Author: Terry Wilson <email address hidden>
Date: Tue Nov 28 23:26:58 2017 -0600

    Don't throw NotConnectedError in queue_txn

    Before Ifc74a834ca1a7d914c7ca66d71b47694f9eef1e5, queueing a
    transaction before we had connected to an ovsdb-server would
    happily work. When the connection started, the run() method would
    pull the queued transaction off the queue and process it. Adding
    the exception added a race condition in existing code, so this
    patch removes it.

    Change-Id: I0b620b0040ef5f7cb99d2f27140f605e97f95512
    Related-Bug: #1731063

Daniel Alvarez (dalvarezs) wrote :
Download full text (3.2 KiB)

Ok, we made some progress. I'm updating this to the latest findings:

All jobs which failed had something in common which is the error in ovsdbapp trying to add the gateway port to the OVS bridge [1]. However, this behavior is only exhibited in master so this is what happened before we bumped ovsdbapp to 0.8.0:

With previous ovsdbapp version (and the one in pike), this error was 'hidden' by ovsdbapp and the failed transaction was just queued. The l3-agent went ahead and processed the router normally after having installed all iptables rules (especially important the iptables rule which redirects traffic to 169.254.169.254 to haproxy). Eventually, ovsdbapp will reconnect to ovsdb-server and gateway port will be added to br-int. At this point the VM may have booted already and fetched the metadata properly (gw is not involved for metadata).

With ovsdbapp 0.8.0, when the gateway router is added and this error occurs, it will not enqueue the transaction but throw an exception instead. This exception will be handled by neutron-l3-agent which is going to schedule the router for resync [2] and the gateway port will eventually added after resync. At this point, router namespace should be there and iptables should be there as well since they're created on AFTER_UPDATE notification [3].

On this last scenario it looks like when VM boots and requests metadata to 169.254.169.254, the requests arrive to the router namespace but somehow they don't get redirected to haproxy (possible iptables mangle/nat rules missing due to resyncs?) and instead, they're getting routed through the default gateway. When running this job on RDO-Cloud, metadata requests going through the default gateway will eventually hit the metadata server running on the underlying cloud (RDO cloud) and those are answered. This is why the console output reports that the VM has successfully fetched the instance-id [4] but this instance-id is i-000ca05e while it should be i-00000001 instead since it's the first instance booted in the hypervisor. The reason why we're getting such a high instance id is because it's the actual id of the undercloud within RDO-cloud since it's their metadata server the one responding to the instance requests.

We have tagged ovsdbapp 0.9.0 to include this fix [5] and bumped upper-constraints in requirements. We hope CI will be green soon with this but still we need to work a proper fix to prevent metadata request from going out of the router namespace.

[1] https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset018-master/08ed7fb/subnode-2/var/log/containers/neutron/neutron-l3-agent.log.txt.gz#_2017-11-29_02_44_09_758
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L557
[3] https://github.com/openstack/neutron/blob/master/neutron/agent/metadata/driver.py#L289
[4] 2017-11-29 02:50:10 | checking http://169.254.169.254/2009-04-04/instance-id
    2017-11-29 02:50:10 | successful after 1/20 tries: up 5.34. iid=i-000ca05e
    https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset018-master/08ed7fb/undercloud/home/jenkins/tempest_out...

Read more...

Alex Schultz (alex-schultz) wrote :

The ovbdbapp fix appears to have addressed our problems. Dropping the alert for now and will close this when we switch scenario001/003 back to voting.

Changed in ovsdbapp:
status: New → Fix Released
tags: removed: alert

Reviewed: https://review.openstack.org/524763
Committed: https://git.openstack.org/cgit/openstack/tripleo-quickstart/commit/?id=b4df3358e0cc7c895a938101e974494187d444c6
Submitter: Zuul
Branch: master

commit b4df3358e0cc7c895a938101e974494187d444c6
Author: Alex Schultz <email address hidden>
Date: Fri Dec 1 16:22:32 2017 -0700

    Move scenario001 and scenario003 back to the gate

    Change-Id: I4311ad402e78470192efb2227b523f482da0a928
    Depends-On: I3d5327f10a815b7dc89d7b3bd3aba0d05a609e23
    Related-Bug: #1731063

Reviewed: https://review.openstack.org/524761
Committed: https://git.openstack.org/cgit/openstack/tripleo-common/commit/?id=72404a9da0a4e703f3cea710e9196c408490e4bd
Submitter: Zuul
Branch: master

commit 72404a9da0a4e703f3cea710e9196c408490e4bd
Author: Alex Schultz <email address hidden>
Date: Fri Dec 1 16:17:50 2017 -0700

    Move scenario001 and scenario003 back to the gate

    Change-Id: Ieb7f06f0d0fc85f4baf17a0edc90200d4b73e861
    Depends-On: I3d5327f10a815b7dc89d7b3bd3aba0d05a609e23
    Related-Bug: #1731063

Reviewed: https://review.openstack.org/524762
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=bde39b66a78cb884488dec8b53861c7f8f675942
Submitter: Zuul
Branch: master

commit bde39b66a78cb884488dec8b53861c7f8f675942
Author: Alex Schultz <email address hidden>
Date: Fri Dec 1 16:20:42 2017 -0700

    Move scenario001 and scenario003 back to the gate

    Change-Id: Icd0ec5395204178a891c2e923de43a88996b8b83
    Depends-On: I3d5327f10a815b7dc89d7b3bd3aba0d05a609e23
    Related-Bug: #1731063

Reviewed: https://review.openstack.org/524760
Committed: https://git.openstack.org/cgit/openstack/puppet-tripleo/commit/?id=47fa3f4e5d4e4a8de321e9da32ce761e4f624ad6
Submitter: Zuul
Branch: master

commit 47fa3f4e5d4e4a8de321e9da32ce761e4f624ad6
Author: Alex Schultz <email address hidden>
Date: Fri Dec 1 16:15:29 2017 -0700

    Move scenario001 and scenario003 back to the gate

    Change-Id: I348afa362ca8241f9022898ee987443c77d821f4
    Depends-On: I3d5327f10a815b7dc89d7b3bd3aba0d05a609e23
    Related-Bug: #1731063

Alex Schultz (alex-schultz) wrote :

Moving this bug to Fix Released as we've re-enabled the scenario tests and things appear to be cleared up for tripleo.

Changed in tripleo:
status: Triaged → Fix Released

Change abandoned by Daniel Alvarez (<email address hidden>) on branch: master
Review: https://review.openstack.org/523712

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers