Unstable grenade multinode job

Bug #1603268 reported by Armando Migliaccio
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Ihar Hrachyshka

Bug Description

Grafana is showing the gate-grenade-dsvm-neutron-multinode being unstable since July 12th [1,2]. The issue seems related to Grenade verify phase getting stalled [3]. Probably MTU related as the SSH times out.

2016-07-09 10:16:00.716 | ++ /opt/stack/new/grenade/projects/70_cinder/resources.sh:verify_noapi:177 : resource_get cinder cinder_server_ip
2016-07-09 10:16:00.717 | ++ /opt/stack/new/grenade/inc/plugin:resource_get:137 : local project=cinder
2016-07-09 10:16:00.718 | ++ /opt/stack/new/grenade/inc/plugin:resource_get:138 : local key=cinder_server_ip
2016-07-09 10:16:00.720 | +++ /opt/stack/new/grenade/inc/plugin:resource_get:139 : iniget /opt/stack/save/grenade_db.ini cinder cinder_server_ip
2016-07-09 10:16:00.733 | ++ /opt/stack/new/grenade/inc/plugin:resource_get:139 : local value=172.24.5.51
2016-07-09 10:16:00.734 | ++ /opt/stack/new/grenade/inc/plugin:resource_get:140 : echo 172.24.5.51
2016-07-09 10:16:00.737 | + /opt/stack/new/grenade/projects/70_cinder/resources.sh:verify_noapi:177 : local server_ip=172.24.5.51
2016-07-09 10:16:00.739 | + /opt/stack/new/grenade/projects/70_cinder/resources.sh:verify_noapi:178 : ping_check_public 172.24.5.51 30
2016-07-09 10:16:00.740 | + /opt/stack/new/grenade/functions:ping_check_public:56 : local ip=172.24.5.51
2016-07-09 10:16:00.741 | + /opt/stack/new/grenade/functions:ping_check_public:57 : local boot_timeout=30
2016-07-09 10:16:00.743 | + /opt/stack/new/grenade/functions:ping_check_public:58 : local expected=True
2016-07-09 10:16:00.745 | + /opt/stack/new/grenade/functions:ping_check_public:59 : local check_command=
2016-07-09 10:16:00.746 | + /opt/stack/new/grenade/functions:ping_check_public:60 : [[ True = \T\r\u\e ]]
2016-07-09 10:16:00.747 | + /opt/stack/new/grenade/functions:ping_check_public:61 : check_command='while ! ping -c1 -w1 172.24.5.51; do sleep 1; done'
2016-07-09 10:16:00.749 | + /opt/stack/new/grenade/functions:ping_check_public:65 : timeout 30 sh -c 'while ! ping -c1 -w1 172.24.5.51; do sleep 1; done'
2016-07-09 10:16:00.755 | PING 172.24.5.51 (172.24.5.51) 56(84) bytes of data.
2016-07-09 10:16:00.755 | 64 bytes from 172.24.5.51: icmp_seq=1 ttl=63 time=2.81 ms
2016-07-09 10:16:00.755 |
2016-07-09 10:16:00.755 | --- 172.24.5.51 ping statistics ---
2016-07-09 10:16:00.755 | 1 packets transmitted, 1 received, 0% packet loss, time 0ms
2016-07-09 10:16:00.755 | rtt min/avg/max/mdev = 2.811/2.811/2.811/0.000 ms
2016-07-09 10:16:00.757 | + /opt/stack/new/grenade/projects/70_cinder/resources.sh:verify_noapi:181 : timeout 30 ssh -v -o ConnectTimeout=10 -o UserKnownHostsFile=/dev/null -o StrictHostKeyChecking=no -i /opt/stack/save/cinder_key.pem cirros@172.24.5.51 sync
2016-07-09 10:16:00.761 | OpenSSH_6.6.1, OpenSSL 1.0.1f 6 Jan 2014
2016-07-09 10:16:00.762 | debug1: Reading configuration data /etc/ssh/ssh_config
2016-07-09 10:16:00.762 | debug1: /etc/ssh/ssh_config line 19: Applying options for *
2016-07-09 10:16:00.762 | debug1: Connecting to 172.24.5.51 [172.24.5.51] port 22.
2016-07-09 10:16:00.763 | debug1: fd 3 clearing O_NONBLOCK
2016-07-09 10:16:00.763 | debug1: Connection established.
2016-07-09 10:16:00.764 | debug1: identity file /opt/stack/save/cinder_key.pem type -1
2016-07-09 10:16:00.764 | debug1: identity file /opt/stack/save/cinder_key.pem-cert type -1
2016-07-09 10:16:00.764 | debug1: Enabling compatibility mode for protocol 2.0
2016-07-09 10:16:00.764 | debug1: Local version string SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.7
2016-07-09 10:16:00.770 | debug1: Remote protocol version 2.0, remote software version dropbear_2012.55
2016-07-09 10:16:00.770 | debug1: no match: dropbear_2012.55
2016-07-09 10:16:00.770 | debug1: SSH2_MSG_KEXINIT sent
2016-07-09 10:16:00.773 | debug1: SSH2_MSG_KEXINIT received
2016-07-09 10:16:00.773 | debug1: kex: server->client aes128-ctr hmac-md5 none
2016-07-09 10:16:00.773 | debug1: kex: client->server aes128-ctr hmac-md5 none
2016-07-09 10:16:00.774 | debug1: sending SSH2_MSG_KEXDH_INIT

==============================
2016-07-09 10:16:00.774 | debug1: expecting SSH2_MSG_KEXDH_REPLY
2016-07-09 10:16:30.760 | +
/opt/stack/new/grenade/projects/70_cinder/resources.sh:verify_noapi:1 : exit_trap
==============================

2016-07-09 10:16:30.762 | + /opt/stack/new/grenade/functions:exit_trap:103 : local r=124
2016-07-09 10:16:30.763 | 1 verify_noapi /opt/stack/new/grenade/projects/70_cinder/resources.sh
2016-07-09 10:16:30.763 | 173 verify /opt/stack/new/grenade/projects/70_cinder/resources.sh
2016-07-09 10:16:30.763 | 224 main /opt/stack/new/grenade/projects/70_cinder/resources.sh
2016-07-09 10:16:30.763 | Exit code: 124

[1] http://grafana.openstack.org/dashboard/db/neutron-failure-rate?panelId=5&fullscreen
[2] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%2270_cinder%2Fresources.sh%3Averify_noapi%3A1%20%3A%20%20%20exit_trap%5C%22
[3] http://logs.openstack.org/46/321146/2/gate/gate-grenade-dsvm-neutron-multinode/988f952/logs/grenade.sh.txt.gz#_2016-07-14_18_03_14_092

Tags: gate-failure
Changed in neutron:
status: New → Confirmed
tags: added: gate-failure
Changed in neutron:
importance: Undecided → Critical
description: updated
summary: - unstable grenade multinode
+ Unstable grenade multinode job
description: updated
description: updated
description: updated
Revision history for this message
Sean M. Collins (scollins) wrote :

Looking at the logs this appears to occur *after* Neutron is upgraded from stable/mitaka to HEAD, so there may be a regression that was introduced?

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Indeed, interestingly though the MTU for br-ex set for the offending job is 1500, whereas it's 1450 for all the grenade and multi jobs. I wonder whether this was a ticking bomb...

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

1. gate-tempest-dsvm-neutron-dvr-multinode-full
2. gate-grenade-dsvm-neutron-dvr-multinode
3. gate-tempest-dsvm-neutron-multinode-full

Any job except the offending one.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Changed in neutron:
status: Confirmed → In Progress
assignee: nobody → Clark Boylan (cboylan)
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

The fix we currently look into as the probable relief is: https://review.openstack.org/#/c/343024/ + its deps.

Changed in neutron:
assignee: Clark Boylan (cboylan) → Ihar Hrachyshka (ihar-hrachyshka)
milestone: none → newton-3
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

We identified that the bug is not fixed yet. This is because devstack-gate still uses network_device_mtu, so mtu is not calculated properly for VXLAN network used to connect cinder_server1 resource; that makes some ssh packets being dropped.

The patch to switch to global_physnet_mtu is at: https://review.openstack.org/342975

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

What baffles me then is why doesn't the issue occur on the other jobs? Just sheer coincidence?

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Seems like grafana dashboard shows that the issue went away, and the job is now on par with other multinode jobs in terms of rate of failures in check queue. Closing the bug.

Changed in neutron:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.