Bug #1731063 “CI: tempest TestVolumeBootPattern tests fail due t...” : Bugs : tripleo

Alex Schultz (alex-schultz) on 2017-11-08

summary:

- CI: tempest fails to be able to ssh to the vm
+ CI: tempest TestVolumeBootPattern tests fail due to not being able to
+ ssh to the VM

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2017-11-09:

#1

again http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/

Revision history for this message

Jiří Stránský (jistr) wrote on 2017-11-09:

#2

Nova metadata API returns 404 for public key list in the broken job:

http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/logs/subnode-2/var/log/containers/nova/nova-api-metadata.log.txt.gz#_2017-11-09_02_39_25_887

2017-11-09 02:39:25.887 7 INFO nova.metadata.wsgi.server [req-0facd91f-29e8-4799-ab24-a8b93ae5e8b4 - - - - -] 10.100.0.11,192.168.24.3 "GET /2009-04-04/meta-data/public-keys HTTP/1.1" status: 404 len: 176 time: 0.0059540

while in successful jobs it returns 200 and proceeds to query the key too:

2017-11-09 12:03:29.352 8 INFO nova.metadata.wsgi.server [req-ae44dbac-18ab-4016-8c6a-ac5c1a3a686d - - - - -] 10.100.0.4,192.168.24.3 "GET /2009-04-04/meta-data/public-keys HTTP/1.1" status: 200 len: 159 time: 0.0032320
2017-11-09 12:03:29.478 8 INFO nova.metadata.wsgi.server [req-ae44dbac-18ab-4016-8c6a-ac5c1a3a686d - - - - -] 10.100.0.4,192.168.24.3 "GET /2009-04-04/meta-data/public-keys/0/openssh-key HTTP/1.1" status: 200 len: 516 time: 0.0030150

Revision history for this message

Jiří Stránský (jistr) wrote on 2017-11-09:

#3

The key got created even in the broken job though:

http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/logs/subnode-2/var/log/containers/nova/nova-api.log.txt.gz#_2017-11-09_02_38_37_745

2017-11-09 02:38:37.745 18 INFO nova.api.openstack.requestlog [req-b5abbb4c-1752-42e0-996c-04e44f7299ba 5d0f711fd7774bf4bb1738dc66b47409 d4100d04632648d8b8b2a4a346bfac58 - default default] 192.168.24.3 "POST /v2.1/os-keypairs" status: 200 len: 2314 microversion: 2.1 time: 0.576968

Revision history for this message

Jiří Stránský (jistr) wrote on 2017-11-09:

#4

Again here http://logs.openstack.org/62/510562/22/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/f56dfb4/

wes hayutin (weshayutin) on 2017-11-09

tags:	added: promo
tags:	added: promotion-blocker removed: promo

Revision history for this message

Jiří Stránský (jistr) wrote on 2017-11-10:

#5

And here http://logs.openstack.org/94/518094/1/check/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/45ff229/

Revision history for this message

wes hayutin (weshayutin) wrote on 2017-11-10:

#6

To see how this is impacting the tripleo gate navigate to

http://cistatus.tripleo.org:8000/ ---> "TripleO CI gate jobs" ----> "legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container"

Note the performance metrics:
In the gates we should have as close to 100% passing

This issue is causing random failures in the job, overall it's failing 20.9% of the time on tempest (mostly) and 4.5% on timeouts or other infra issues.

Revision history for this message

Oliver Walsh (owalsh) wrote on 2017-11-11:

#7

Can see that the Instance object has the key_data when nova-compute boots it:
key_data='ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC9H4ad2IV712wZm5wCgFQPGVEiF9GG1OooX3BXAvEKdhP1vYAq687kOiKV+RmK4qhYiCyp4GF/hKrL4eRmJlSS74AfkvU2w83nqYS97njw5J9mBSeZWzQ4KaHXpVgD7rKtRCoEtufK/WT79sdocSXstQshyWG2w1rAZoaGKOiS1wCHOvV8JkZ45BoVrgiS734bds0MrHxo4r+6nN5WMvAzoLnwJ1Njzyz2b/IWiQfRCljha9F2bLrA9iXTppR2E8HMr+pAH2yFWT/wF6kq8c8vMF/NEFoCuV0ZkIH7ZjLPRFr8I6DiaPeGFVu4Zx1W9PhQsdh3gtyOGswP0MhvYqyJ Generated-by-Nova'
key_name='tempest-TestVolumeBootPattern-1512872222'

And the ip is 10.100.0.4:
"ips": [{"meta": {}, "version": 4, "type": "fixed", "floating_ips": [], "address": "10.100.0.4"}]

However the 404 for the public_key metadata is from 10.100.0.6:
http://logs.openstack.org/79/516979/2/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/13841da/logs/subnode-2/var/log/containers/nova/nova-api-metadata.log.txt.gz#_2017-11-08_21_14_37_476

Which is a different VM that doesn't have a key_pair:
http://logs.openstack.org/79/516979/2/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/13841da/logs/subnode-2/var/log/containers/nova/nova-compute.log.txt.gz#_2017-11-08_21_14_16_781

This looks more like a storage or network issue causing cloud-init to fail on the cirros VM, not an nova metadata issue.

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2017-11-11:

#8

If it's storage, scenario001 is glance/cinder is ceph based while the other scenarios are not.

https://github.com/openstack/tripleo-heat-templates/blob/master/README.rst#service-testing-matrix

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2017-11-12:

#9

It should be noted that this is failing on both containerized and non-containerized scenario001 jobs.

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2017-11-13:

#10

Hi, fwiw here are a few things that looks odds in the 13841da run:

logs/subnode-2/var/log/pacemaker/bundles/rabbitmq-bundle-0/pacemaker.log.txt.gz:
0619: pacemaker_remoted: warning: qb_ipcs_event_sendv: new_event_notification (13-32500-17): Broken pipe (32)
0620: pacemaker_remoted: warning: send_client_notify: Notification of client proxy-cib_rw-32500-23c89608/23c89608-4ae0-4282-889e-ee895be4abc5 failed

logs/subnode-2/var/log/extra/docker/docker_allinfo.log.txt.gz
0288: WARNING: bridge-nf-call-iptables is disabled
0289: WARNING: bridge-nf-call-ip6tables is disabled

logs/subnode-2/var/log/messages.txt.gz:
33606: Nov 8 21:08:48 centos-7-rax-iad-0000787869 systemd: Started Cluster Controlled openstack-cinder-backup. (this doesn't appear in other logs for legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container)

logs/subnode-2/pip2-freeze.txt.gz:
0132: python-cinderclient==3.2.0
0135: python-gflags==2.0

logs/subnode-2/var/log/config-data/neutron/etc/neutron/plugins/ml2/ml2_conf.ini.txt.gz:
0261: firewall_driver=openvswitch (other jobs have firewall_driver=iptables_hybrid)

logs/undercloud/var/log/neutron/dhcp-agent.log.txt.gz:
0514: Leaving behind already spawned process with pid 7401, root should kill it if it's still there (I can't): error: [Errno 2] ENOENT

logs/undercloud/var/log/neutron/openvswitch-agent.log.txt.gz:
1554: Leaving behind already spawned process with pid 2106, root should kill it if it's still there (I can't): error: [Errno 32] Broken pipe

logs/subnode<email address hidden>:
0547: Connection <0.12701.0> (192.168.24.3:59026 -> 192.168.24.15:5672) has a client-provided name: cinder-backup:124342:7dc58edd-5119-41ca-b2d7-444308b63a8c

Revision history for this message

Tristan Cacqueray (tristan-cacqueray) wrote on 2017-11-13:

#11

Also it's probably not much, but logs/subnode-2/var/log/containers/neutron/neutron-openvswitch-agent.log.txt.gz does have those lines:

2543: 2017-11-08 21:14:13.356 113692 DEBUG neutron.agent.linux.openvswitch_firewall.firewall [req-8a760281-5495-43e0-957e-de04bc82ad1f - - - - -] Creating flow rules for port 66b0ab75-0cd5-4b4d-bb0c-af39371071cc that is port 15 in OVS add_flows_from_rules /usr/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py:1010
2544: 2017-11-08 21:14:13.356 113692 DEBUG neutron.agent.linux.openvswitch_firewall.firewall [req-8a760281-5495-43e0-957e-de04bc82ad1f - - - - -] RULGEN: Rules generated for flow {'ethertype': u'IPv6', 'direction': u'egress'} are [{'priority': 70, 'dl_type': 34525, 'actions': 'resubmit(,73)', 'reg_port': 15, 'table': 72}] add_flows_from_rules /usr/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py:1014
2545: 2017-11-08 21:14:13.357 113692 DEBUG neutron.agent.linux.openvswitch_firewall.firewall [req-8a760281-5495-43e0-957e-de04bc82ad1f - - - - -] RULGEN: Rules generated for flow {'ethertype': u'IPv4', 'direction': u'egress'} are [{'priority': 70, 'dl_type': 2048, 'actions': 'resubmit(,73)', 'reg_port': 15, 'table': 72}] add_flows_from_rules /usr/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py:1014
2546: 2017-11-08 21:14:13.357 113692 DEBUG neutron.agent.linux.openvswitch_firewall.firewall [req-8a760281-5495-43e0-957e-de04bc82ad1f - - - - -] RULGEN: Rules generated for flow {'ethertype': u'IPv4', 'direction': u'ingress', 'source_ip_prefix': '0.0.0.0/0', 'protocol': 1} are [{'dl_type': 2048, 'reg_port': 15, 'actions': 'output:15', 'priority': 70, 'nw_proto': 1, 'table': 82}] add_flows_from_rules /usr/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py:1014
2547: 2017-11-08 21:14:13.358 113692 DEBUG neutron.agent.linux.openvswitch_firewall.firewall [req-8a760281-5495-43e0-957e-de04bc82ad1f - - - - -] RULGEN: Rules generated for flow {'direction': u'ingress', 'protocol': 6, 'ethertype': u'IPv4', 'port_range_max': 22, 'source_ip_prefix': '0.0.0.0/0', 'port_range_min': 22} are [{'dl_type': 2048, 'reg_port': 15, 'nw_proto': 6, 'tcp_dst': '0x0016', 'table': 82, 'actions': 'output:15', 'priority': 70}] add_flows_from_rules /usr/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py:1014

which for some reason, are not present in other jobs, for example this one that succeeded recently: http://logs.openstack.org/78/516378/4/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/f30f641/logs/subnode-2/var/log/containers/neutron/neutron-openvswitch-agent.log.txt.gz

Also it's probably not much, but logs/subnode-2/var/log/containers/neutron/neutron-openvswitch-agent.log.txt.gz does have those lines:

2543: 2017-11-08 21:14:13.356 113692 DEBUG neutron.agent.linux.openvswitch_firewall.firewall [req-8a760281-5495-43e0-957e-de04bc82ad1f - - - - -] Creating flow rules for port 66b0ab75-0cd5-4b4d-bb0c-af39371071cc that is port 15 in OVS add_flows_from_rules /usr/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py:1010
2544: 2017-11-08 21:14:13.356 113692 DEBUG neutron.agent.linux.openvswitch_firewall.firewall [req-8a760281-5495-43e0-957e-de04bc82ad1f - - - - -] RULGEN: Rules generated for flow {'ethertype': u'IPv6', 'direction': u'egress'} are [{'priority': 70, 'dl_type': 34525, 'actions': 'resubmit(,73)', 'reg_port': 15, 'table': 72}] add_flows_from_rules /usr/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py:1014
2545: 2017-11-08 21:14:13.357 113692 DEBUG neutron.agent.linux.openvswitch_firewall.firewall [req-8a760281-5495-43e0-957e-de04bc82ad1f - - - - -] RULGEN: Rules generated for flow {'ethertype': u'IPv4', 'direction': u'egress'} are [{'priority': 70, 'dl_type': 2048, 'actions': 'resubmit(,73)', 'reg_port': 15, 'table': 72}] add_flows_from_rules /usr/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py:1014
2546: 2017-11-08 21:14:13.357 113692 DEBUG neutron.agent.linux.openvswitch_firewall.firewall [req-8a760281-5495-43e0-957e-de04bc82ad1f - - - - -] RULGEN: Rules generated for flow {'ethertype': u'IPv4', 'direction': u'ingress', 'source_ip_prefix': '0.0.0.0/0', 'protocol': 1} are [{'dl_type': 2048, 'reg_port': 15, 'actions': 'output:15', 'priority': 70, 'nw_proto': 1, 'table': 82}] add_flows_from_rules /usr/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py:1014
2547: 2017-11-08 21:14:13.358 113692 DEBUG neutron.agent.linux.openvswitch_firewall.firewall [req-8a760281-5495-43e0-957e-de04bc82ad1f - - - - -] RULGEN: Rules generated for flow {'direction': u'ingress', 'protocol': 6, 'ethertype': u'IPv4', 'port_range_max': 22, 'source_ip_prefix': '0.0.0.0/0', 'port_range_min': 22} are [{'dl_type': 2048, 'reg_port': 15, 'nw_proto': 6, 'tcp_dst': '0x0016', 'table': 82, 'actions': 'output:15', 'priority': 70}] add_flows_from_rules /usr/lib/python2.7/site-packages/neutron/agent/linux/openvswitch_firewall/firewall.py:1014

which for some reason, are not present in other jobs, for example this one that succeeded recently: http://logs.openstack.org/78/516378/4/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/f30f641/logs/subnode-2/var/log/containers/neutron/neutron-openvswitch-agent.log.txt.gz

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2017-11-13:

#12

here's a non-containerized failure http://logs.openstack.org/81/508881/2/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-puppet/f9e4b6b/

Revision history for this message

John Fulton (jfulton-org) wrote on 2017-11-13:

#13

In the non-containerized failure http://logs.openstack.org/81/508881/2/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-puppet/f9e4b6b/ , Ceph and OpenStack's use of it looks like it's fine

- Ceph cluster is active+clean with 1 OSD in and up
http://logs.openstack.org/81/508881/2/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-puppet/f9e4b6b/logs/subnode-2/ceph/ceph.txt.gz#_2017-11-13_14_31_53_998507

- Cinder volume service indicates it could initialize the ceph driver:
http://logs.openstack.org/81/508881/2/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-puppet/f9e4b6b/logs/subnode-2/var/log/cinder/volume.log.txt.gz#_2017-11-13_14_20_46_862

- Glance looks normal:
http://logs.openstack.org/81/508881/2/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-puppet/f9e4b6b/logs/subnode-2/var/log/glance/api.log.txt.gz

- Nova libvirt indicates an instance was running for 6 minutes using an RBD-backed cinder volume:
rbd:volumes/volume-5e77a14b-d7ec-4852-ad1a-687381494efe:id=openstack:auth_supported=cephx\;none:mon_host=192.168.24.17\:6789,file.password-secret=virtio-disk0-secret0,format=raw,
http://logs.openstack.org/81/508881/2/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-puppet/f9e4b6b/logs/subnode-2/var/log/libvirt/qemu/instance-00000001.log.txt.gz#_2017-11-13_14_31_52_729

Revision history for this message

John Fulton (jfulton-org) wrote on 2017-11-13:

#14

Does the job ping before it SSH's?

If we know the instance answered pings then we could target troubleshooting above that line (e.g. SSH key init problem with cloud init or security groups, etc).

Revision history for this message

John Fulton (jfulton-org) wrote on 2017-11-13:

#15

Storage for the containerized sceario001 http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/ looks fine:

- Ceph installation finished without errors:
http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/logs/undercloud/var/log/mistral/ceph-install-workflow.log.txt.gz#_2017-11-09_02_14_26_016

- Ceph cluster is active+clean with 1 OSD in and up
http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/logs/subnode-2/var/log/extra/docker/containers/ceph-mon-centos-7-ovh-bhs1-0000790706/log/ceph/ceph.log.txt.gz#_2017-11-09_02_39_06_503430

- Glance looks normal and the RBD scheme registered.
http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/logs/subnode-2/var/log/containers/glance/api.log.txt.gz#_2017-11-09_02_37_47_046

- I know Cinder volume is working because it supported 3 instances (see below)

- Nova libvirt indicates an instance was running for 5 minutes using an RBD-backed cinder volume:
file=rbd:volumes/volume-9e74c518-9772-4a9e-bb5d-727d362b2ba6:id=openstack:auth_supported=cephx\;none:mon_host=192.168.24.3\:6789,file.password-secret=virtio-disk0-secret0,format=raw
http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/logs/subnode-2/libvirt/qemu/instance-00000001.txt.gz

Revision history for this message

John Fulton (jfulton-org) wrote on 2017-11-13:

#16

For both the containerized and non-containerized jobs, we have 3 instances where instance1 and instance2 are both about for about 5 minutes but instance 3 runs for one minute or less and is then shut down [1][2].

If the failure is in SSH'ing into instance 3, do we know how many retries we get? Is it possible the SSH daemon is not listening yet when the SSH fails and if we wait a little, then it would accept the connection?

[1]
non-containerized:
http://logs.openstack.org/81/508881/2/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-puppet/f9e4b6b/logs/subnode-2/var/log/libvirt/qemu/instance-00000003.log.txt.gz

[2]
continaerized:
http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/logs/subnode-2/libvirt/qemu/instance-00000003.txt.gz

Revision history for this message

Giulio Fidente (gfidente) wrote on 2017-11-14:

#17

scenario001 (both containerized and not) is the only one deploying the ComputeCeilometerAgent and the only one *not* using NotificationDriver: noop as one of the deployment parameter.

I see some strange logs in the ceilometer compute logs [1]

Maybe Pradeep can help us finding more the errors in the logs? Shall we add the NotificationDriver parameter? Is there a reason why we're deploying ComputeCeilometerAgent only in this job?

1. http://logs.openstack.org/79/516979/2/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/13841da/logs/subnode-2/var/log/containers/ceilometer/compute.log.txt.gz#_2017-11-08_21_17_08_630

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-11-14:

#18

NotificationDriver is set to "noop" when we don't need notifications sent on RPC.
Scenario001 is deploying Ceilometer and test autoscaling, etc, so we NEED these notifications sent on RPC otherwise Telemetry services don't know when OpenStack resources are created.

Revision history for this message

Pradeep Kilambi (pkilambi) wrote on 2017-11-14:

#19

Yes, like Emilien said, we need notifications turned on for this job. The ComputeCeilometerAgent agent is only run in this because scenario001 include the telemetry services and others don't. I dont see how that can cause volume boot issue though. The log warning indicates it couldnt find volume, and rightfully so?

Revision history for this message

John Fulton (jfulton-org) wrote on 2017-11-15:

#20

The short life of the 3rd instance [1] is not affecting the results of the test, because the test tries to SSH into the first instance which we see in both the failed containerized and failed non-containerized jobs lives for about 5 minutes [2][3].

[1] https://github.com/openstack/tempest/blob/master/tempest/scenario/test_volume_boot_pattern.py#L137-L155
[2] http://logs.openstack.org/81/508881/2/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-puppet/f9e4b6b/logs/subnode-2/var/log/libvirt/qemu/instance-00000001.log.txt.gz
[3] http://logs.openstack.org/71/516771/1/gate/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/276d17b/logs/subnode-2/libvirt/qemu/instance-00000001.txt.gz

Revision history for this message

John Fulton (jfulton-org) wrote on 2017-11-15:

#21

arxcruz pointed out that this seems to be an issue with the metadata service that is not injecting the ssh key, and that's why it's getting timeout, because ssh service is started after the metadata service finishes the env.

It's similar to https://bugs.launchpad.net/tripleo/+bug/1702257

2017-11-08 21:19:54 | 2017-11-08 21:18:29,110 18813 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 ([Errno None] Unable to connect to port 22 on 192.168.24.104). Number attempts: 21. Retry after 22 seconds.
2017-11-08 21:19:54 | 2017-11-08 21:18:51,668 18813 INFO [paramiko.transport] Connected (version 2.0, client dropbear_2012.55)
2017-11-08 21:19:54 | 2017-11-08 21:18:51,781 18813 INFO [paramiko.transport] Authentication (publickey) failed.
2017-11-08 21:19:54 | 2017-11-08 21:18:51,909 18813 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 (Authentication failed.). Number attempts: 22. Retry after 23 seconds.
2017-11-08 21:19:54 | 2017-11-08 21:19:15,443 18813 INFO [paramiko.transport] Connected (version 2.0, client dropbear_2012.55)
2017-11-08 21:19:54 | 2017-11-08 21:19:15,562 18813 INFO [paramiko.transport] Authentication (publickey) failed.
2017-11-08 21:19:54 | 2017-11-08 21:19:15,684 18813 WARNING [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 (Authentication failed.). Number attempts: 23. Retry after 24 seconds.
2017-11-08 21:19:54 | 2017-11-08 21:19:40,213 18813 INFO [paramiko.transport] Connected (version 2.0, client dropbear_2012.55)
2017-11-08 21:19:54 | 2017-11-08 21:19:40,346 18813 INFO [paramiko.transport] Authentication (publickey) failed.
2017-11-08 21:19:54 | 2017-11-08 21:19:40,468 18813 ERROR [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 after 23 attempts
2017-11-08 21:19:54 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh Traceback (most recent call last):
2017-11-08 21:19:54 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 107, in _get_ssh_connection
2017-11-08 21:19:54 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh sock=proxy_chan)
2017-11-08 21:19:54 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 380, in connect
2017-11-08 21:19:54 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh look_for_keys, gss_auth, gss_kex, gss_deleg_creds, gss_host)
2017-11-08 21:19:54 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 621, in _auth
2017-11-08 21:19:54 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh raise saved_exception
2017-11-08 21:19:54 | 2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh AuthenticationException: Authentication failed.

arxcruz pointed out that this seems to be an issue with the metadata service that is not  injecting the ssh key, and that's why it's getting timeout, because ssh service is started after the metadata service finishes the env.

It's similar to https://bugs.launchpad.net/tripleo/+bug/1702257

2017-11-08 21:19:54 |     2017-11-08 21:18:29,110 18813 WARNING  [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 ([Errno None] Unable to connect to port 22 on 192.168.24.104). Number attempts: 21. Retry after 22 seconds.
2017-11-08 21:19:54 |     2017-11-08 21:18:51,668 18813 INFO     [paramiko.transport] Connected (version 2.0, client dropbear_2012.55)
2017-11-08 21:19:54 |     2017-11-08 21:18:51,781 18813 INFO     [paramiko.transport] Authentication (publickey) failed.
2017-11-08 21:19:54 |     2017-11-08 21:18:51,909 18813 WARNING  [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 (Authentication failed.). Number attempts: 22. Retry after 23 seconds.
2017-11-08 21:19:54 |     2017-11-08 21:19:15,443 18813 INFO     [paramiko.transport] Connected (version 2.0, client dropbear_2012.55)
2017-11-08 21:19:54 |     2017-11-08 21:19:15,562 18813 INFO     [paramiko.transport] Authentication (publickey) failed.
2017-11-08 21:19:54 |     2017-11-08 21:19:15,684 18813 WARNING  [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 (Authentication failed.). Number attempts: 23. Retry after 24 seconds.
2017-11-08 21:19:54 |     2017-11-08 21:19:40,213 18813 INFO     [paramiko.transport] Connected (version 2.0, client dropbear_2012.55)
2017-11-08 21:19:54 |     2017-11-08 21:19:40,346 18813 INFO     [paramiko.transport] Authentication (publickey) failed.
2017-11-08 21:19:54 |     2017-11-08 21:19:40,468 18813 ERROR    [tempest.lib.common.ssh] Failed to establish authenticated ssh connection to cirros@192.168.24.104 after 23 attempts
2017-11-08 21:19:54 |     2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh Traceback (most recent call last):
2017-11-08 21:19:54 |     2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh   File "/usr/lib/python2.7/site-packages/tempest/lib/common/ssh.py", line 107, in _get_ssh_connection
2017-11-08 21:19:54 |     2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh     sock=proxy_chan)
2017-11-08 21:19:54 |     2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh   File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 380, in connect
2017-11-08 21:19:54 |     2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh     look_for_keys, gss_auth, gss_kex, gss_deleg_creds, gss_host)
2017-11-08 21:19:54 |     2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh   File "/usr/lib/python2.7/site-packages/paramiko/client.py", line 621, in _auth
2017-11-08 21:19:54 |     2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh     raise saved_exception
2017-11-08 21:19:54 |     2017-11-08 21:19:40.468 18813 ERROR tempest.lib.common.ssh AuthenticationException: Authentication failed.

Revision history for this message

yatin (yatinkarel) wrote on 2017-11-15:

#22

Also for future debugging, i think it would be good to log: nova_console_output in case of failure here:-https://github.com/openstack/tempest/blob/17.1.0/tempest/scenario/manager.py#L608-L609

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2017-11-15:

#23

Not sure if this is the same thing, but I did see a failure in scenario003 with what appears to be a similar problem not being able to ssh to the VM. http://logs.openstack.org/70/519770/2/gate/legacy-tripleo-ci-centos-7-scenario003-multinode-oooq-puppet/69e0c2c/logs/undercloud/home/zuul/tempest_output.log.txt.gz#_2017-11-15_22_38_40

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2017-11-16:

#24

another scenario003 http://logs.openstack.org/17/499117/7/gate/legacy-tripleo-ci-centos-7-scenario003-multinode-oooq-puppet/0e01b75/logs/undercloud/home/zuul/tempest_output.log.txt.gz#_2017-11-16_03_12_55

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-16: Related fix proposed to tripleo-heat-templates (master)

#25

Related fix proposed to branch: master
Review: https://review.openstack.org/520280

Revision history for this message

John Trowbridge (trown) wrote on 2017-11-16:

#26

I set up an environment on rdocloud which ran the toci script on top of pre-provisioned nodes (so almost exact reproducer of CI).

I then made a small script that would run tempest over and over, and record the results. It ran 100 times consecutively without failing. So there is definitely an environmental component to this issue. Maybe need to artificially create some memory or cpu pressure.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-17: Related fix merged to tripleo-heat-templates (master)

#27

Reviewed: https://review.openstack.org/520280
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=17acd021402b93ebbb6f875a6a758d7abf8bd860
Submitter: Zuul
Branch: master

commit 17acd021402b93ebbb6f875a6a758d7abf8bd860
Author: Alex Schultz <email address hidden>
Date: Wed Nov 15 20:37:27 2017 -0700

Debug configuration for neutron metadata agent

    The neutron metadata agent isn't currently configured for debug when the
    global debug flag is enabled. Additionally add a flag to just configure
    debug for the metadata agent itself.

Change-Id: I7ef636b8ad4c6fe4f05970c41c05d522a0f0d892
Related-Bug: #1731063

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2017-11-17:

#28

So in looking, at the logs further there are errors in the l3-agent which are also occurring in scenario003 now.

http://logs.openstack.org/81/520581/1/check/legacy-tripleo-ci-centos-7-scenario001-multinode-oooq-container/128fccc/logs/subnode-2/var/log/containers/neutron/neutron-l3-agent.log.txt.gz#_2017-11-16_15_11_32_149

2017-11-16 15:11:32.149 110790 ERROR ovsdbapp.backend.ovs_idl.command NotConnectedError: Cannot commit transaction DelPortCommand(if_exists=True, bridge=None, port=qg-15cf71cf-65). Not connected

2017-11-16 15:11:32.810 110790 ERROR neutron.agent.l3.router_info [-] Cannot commit transaction AddPortCommand(bridge=br-int, may_exist=False, port=qg-15cf71cf-65), DbSetCommand(table=Interface, col_values=(('type', 'internal'), ('external_ids', {'iface-id': u'15cf71cf-65f1-4e3e-b109-2a70c69cb50d', 'iface-status': 'active', 'attached-mac': u'fa:16:3e:72:ed:38'})), record=qg-15cf71cf-65). Not connected: NotConnectedError: Cannot commit transaction AddPortCommand(bridge=br-int, may_exist=False, port=qg-15cf71cf-65), DbSetCommand(table=Interface, col_values=(('type', 'internal'), ('external_ids', {'iface-id': u'15cf71cf-65f1-4e3e-b109-2a70c69cb50d', 'iface-status': 'active', 'attached-mac': u'fa:16:3e:72:ed:38'})), record=qg-15cf71cf-65). Not connected

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2017-11-17:

#29

Here's a link to scenario003 as well

http://logs.openstack.org/36/521136/1/check/legacy-tripleo-ci-centos-7-scenario003-multinode-oooq-container/a6708d9/logs/subnode-2/var/log/containers/neutron/neutron-l3-agent.log.txt.gz#_2017-11-17_18_57_44_523

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-17: Related fix proposed to tripleo-heat-templates (master)

#30

Related fix proposed to branch: master
Review: https://review.openstack.org/521191

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2017-11-18:

#31

Here is a failure with debug enabled for the l3 agent:

http://logs.openstack.org/56/519756/5/check/legacy-tripleo-ci-centos-7-scenario003-multinode-oooq-container/e2bd239/logs/subnode-2/var/log/containers/neutron/neutron-l3-agent.log.txt.gz#_2017-11-18_02_58_53_503

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2017-11-19:

#32

So the tracebacks show that the connection to OVSDB server is lost sporadically. Perhaps we should trace down the signs of these events.

Revision history for this message

venkata anil (anil-venkata) wrote on 2017-11-20:

#33

As we have a retry in l3 agent, this port 'qg-4eb6eb60-17' was added successfully later [1]
So I think l3 agent is properly configuring the port.
[1] http://logs.openstack.org/56/519756/5/check/legacy-tripleo-ci-centos-7-scenario003-multinode-oooq-container/e2bd239/logs/subnode-2/var/log/containers/neutron/neutron-l3-agent.log.txt.gz#_2017-11-18_02_58_55_734

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-20: Related fix merged to tripleo-heat-templates (master)

#34

Reviewed: https://review.openstack.org/521191
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=5de608f0ebc9f1b790f20b078b81c1111896d31a
Submitter: Zuul
Branch: master

commit 5de608f0ebc9f1b790f20b078b81c1111896d31a
Author: Alex Schultz <email address hidden>
Date: Fri Nov 17 14:13:41 2017 -0700

Add debug to neutron agents

    We currently do not enable debug for the l3, l2gw or dhcp agent if the
    debug flag is set to true. This fixes the templates to support enabling
    debug for these agents either via the global Debug setting or a specific
    debug setting.

Change-Id: I0a44bfc27b306bfb08dab0656e3362503f07d6b3
Related-Bug: #1731063

Revision history for this message

wes hayutin (weshayutin) wrote on 2017-11-20:

#35

related bug in neutron https://bugs.launchpad.net/neutron/+bug/1711463

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2017-11-20:

#36

I don't think bug 1711463 is related. It's just same symptom.

Maybe it's related to the new ovsdbapp release: https://review.openstack.org/520630 that includes a fix for timeout retries: https://review.openstack.org/520176

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2017-11-20:

#37

Disregard the links to possible regressing patches, I now see the bug was reported before they landed.

Revision history for this message

Ihar Hrachyshka (ihar-hrachyshka) wrote on 2017-11-20:

#38

I looked at the Not connected error mentioned in comment #31. As Anil said, the commands are later successfully executed, so it probably doesn't result in the test failure.

The Not connected error is raised since https://review.openstack.org/#/c/497612/6/ovsdbapp/backend/ovs_idl/connection.py that is included in ovsdbapp 0.6.0+. It happens when queue_txn is called before .start() for the ovsdb connection. The latter connection is initialized when Backend.__init__ is executed, which should happen in neutron context when BaseOVS -> from_config -> api_factory -> NeutronOvsdbIdl.__init__ is executed. In the traceback, we see that the error happens when replace_port is executed, which is a regular method on OVSBridge class that is a child of BaseOVS. So in theory, at the moment when queue_txn is executed, the connection should already be initialized. But it clearly wasn't.

I don't think this error is the root cause of the failure tracked here. Anyway, adding ovsdbapp project to the list of affected projects, and hopefully Terry Wilson will be able to assist with the traceback once he is back from PTO next week.

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2017-11-20:

#39

If there is a delay in the network connecting due to the NotConnected exception could that impact the metadata fetching? I don't think we're seeing the NotConnected error on successful runs. That being said it may not be the cause but at the moment we're really unsure why this is failing.

Daniel Alvarez (dalvarezs) on 2017-11-28

Changed in tripleo:
assignee:	nobody → Daniel Alvarez (dalvarezs)

Revision history for this message

Emilien Macchi (emilienm) wrote on 2017-11-28:

#40

FYI, Alex is working on a patch: https://review.openstack.org/#/c/523508

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-29: Related fix proposed to ovsdbapp (master)

#41

Related fix proposed to branch: master
Review: https://review.openstack.org/523691

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-29: Related fix proposed to tripleo-quickstart-extras (master)

#42

Related fix proposed to branch: master
Review: https://review.openstack.org/523712

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-11-29: Related fix merged to ovsdbapp (master)

#43

Reviewed: https://review.openstack.org/523691
Committed: https://git.openstack.org/cgit/openstack/ovsdbapp/commit/?id=8ffd90dbf3f817cc43ab42e36c82d021d2df04e6
Submitter: Zuul
Branch: master

commit 8ffd90dbf3f817cc43ab42e36c82d021d2df04e6
Author: Terry Wilson <email address hidden>
Date: Tue Nov 28 23:26:58 2017 -0600

Don't throw NotConnectedError in queue_txn

    Before Ifc74a834ca1a7d914c7ca66d71b47694f9eef1e5, queueing a
    transaction before we had connected to an ovsdb-server would
    happily work. When the connection started, the run() method would
    pull the queued transaction off the queue and process it. Adding
    the exception added a race condition in existing code, so this
    patch removes it.

Change-Id: I0b620b0040ef5f7cb99d2f27140f605e97f95512
Related-Bug: #1731063

Revision history for this message

Daniel Alvarez (dalvarezs) wrote on 2017-11-30:

#44

Download full text (3.2 KiB)

Ok, we made some progress. I'm updating this to the latest findings:

All jobs which failed had something in common which is the error in ovsdbapp trying to add the gateway port to the OVS bridge [1]. However, this behavior is only exhibited in master so this is what happened before we bumped ovsdbapp to 0.8.0:

With previous ovsdbapp version (and the one in pike), this error was 'hidden' by ovsdbapp and the failed transaction was just queued. The l3-agent went ahead and processed the router normally after having installed all iptables rules (especially important the iptables rule which redirects traffic to 169.254.169.254 to haproxy). Eventually, ovsdbapp will reconnect to ovsdb-server and gateway port will be added to br-int. At this point the VM may have booted already and fetched the metadata properly (gw is not involved for metadata).

With ovsdbapp 0.8.0, when the gateway router is added and this error occurs, it will not enqueue the transaction but throw an exception instead. This exception will be handled by neutron-l3-agent which is going to schedule the router for resync [2] and the gateway port will eventually added after resync. At this point, router namespace should be there and iptables should be there as well since they're created on AFTER_UPDATE notification [3].

On this last scenario it looks like when VM boots and requests metadata to 169.254.169.254, the requests arrive to the router namespace but somehow they don't get redirected to haproxy (possible iptables mangle/nat rules missing due to resyncs?) and instead, they're getting routed through the default gateway. When running this job on RDO-Cloud, metadata requests going through the default gateway will eventually hit the metadata server running on the underlying cloud (RDO cloud) and those are answered. This is why the console output reports that the VM has successfully fetched the instance-id [4] but this instance-id is i-000ca05e while it should be i-00000001 instead since it's the first instance booted in the hypervisor. The reason why we're getting such a high instance id is because it's the actual id of the undercloud within RDO-cloud since it's their metadata server the one responding to the instance requests.

We have tagged ovsdbapp 0.9.0 to include this fix [5] and bumped upper-constraints in requirements. We hope CI will be green soon with this but still we need to work a proper fix to prevent metadata request from going out of the router namespace.

[1] https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset018-master/08ed7fb/subnode-2/var/log/containers/neutron/neutron-l3-agent.log.txt.gz#_2017-11-29_02_44_09_758
[2] https://github.com/openstack/neutron/blob/master/neutron/agent/l3/agent.py#L557
[3] https://github.com/openstack/neutron/blob/master/neutron/agent/metadata/driver.py#L289
[4] 2017-11-29 02:50:10 | checking http://169.254.169.254/2009-04-04/instance-id
2017-11-29 02:50:10 | successful after 1/20 tries: up 5.34. iid=i-000ca05e
https://logs.rdoproject.org/openstack-periodic/periodic-tripleo-ci-centos-7-multinode-1ctlr-featureset018-master/08ed7fb/undercloud/home/jenkins/tempest_out...

Affects		Status	Importance	Assigned to	Milestone
	ovsdbapp	Fix Released	Undecided	Unassigned
	tripleo	Fix Released	Critical	Daniel Alvarez	tripleo queens-2

Changed in ovsdbapp:
status:	New → Fix Released
tags:	removed: alert

tripleo

CI: tempest TestVolumeBootPattern tests fail due to not being able to ssh to the VM

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches