sc000 upgrade getting stuck on NetworkDeployment

Bug #1772040 reported by Jiří Stránský
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Jiří Stránský

Bug Description

In sc000 overcloud upgrade job, the upgrade step ran for an hour before it got killed:

2018-05-18 12:38:06.244828 | primary | TASK [tripleo-upgrade : run overcloud major upgrade in each of the roles/hostgroups] ***
2018-05-18 12:38:06.285055 | primary | Friday 18 May 2018 12:38:06 +0000 (0:00:00.128) 0:17:21.847 ************
2018-05-18 13:36:40.456113 | primary | +(./toci_quickstart.sh:101): exit_value=143

#####################################################################

However, in package_update.log i can see that Ansible only ran for 10 minutes before getting stuck here:

2018-05-18 12:48:21,823 p=25658 u=mistral | TASK [Check-mode for Run deployment InstanceIdDeployment] **********************
2018-05-18 12:48:21,846 p=25658 u=mistral | skipping: [centos-7-inap-mtl01-0004085172] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true}
2018-05-18 12:48:21,866 p=25658 u=mistral | TASK [Lookup deployment UUID] **************************************************
2018-05-18 12:48:22,008 p=25658 u=mistral | ok: [centos-7-inap-mtl01-0004085172] => {"ansible_facts": {"deployment_uuid": "13a7b14b-f8a7-4111-b8cb-32ec1299bb3d"}, "changed": false, "failed": false}
2018-05-18 12:48:22,029 p=25658 u=mistral | TASK [Render deployment file for NetworkDeployment] ****************************
2018-05-18 12:48:23,287 p=25658 u=mistral | changed: [centos-7-inap-mtl01-0004085172] => {"changed": true, "checksum": "8a0681ea339a48aaef58f4e8a9ee82be718329a5", "dest": "/var/lib/heat-config/tripleo-config-download/NetworkDeployment-13a7b14b-f8a7-4111-b8cb-32ec1299bb3d", "failed": false, "gid": 0, "group": "root", "md5sum": "f7b354dd15fa143d0384da7366bfc2f5", "mode": "0644", "owner": "root", "secontext": "system_u:object_r:var_lib_t:s0", "size": 8650, "src": "/home/tripleo-admin/.ansible/tmp/ansible-tmp-1526647702.16-142551103574098/source", "state": "file", "uid": 0}
2018-05-18 12:48:23,310 p=25658 u=mistral | TASK [Check if deployed file exists for NetworkDeployment] *********************
2018-05-18 12:48:23,918 p=25658 u=mistral | ok: [centos-7-inap-mtl01-0004085172] => {"changed": false, "failed": false, "stat": {"exists": false}}
2018-05-18 12:48:23,939 p=25658 u=mistral | TASK [Check previous deployment rc for NetworkDeployment] **********************
2018-05-18 12:48:23,968 p=25658 u=mistral | skipping: [centos-7-inap-mtl01-0004085172] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true}
2018-05-18 12:48:23,995 p=25658 u=mistral | TASK [Remove deployed file for NetworkDeployment when previous deployment failed] ***
2018-05-18 12:48:24,028 p=25658 u=mistral | skipping: [centos-7-inap-mtl01-0004085172] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true}
2018-05-18 12:48:24,050 p=25658 u=mistral | TASK [Force remove deployed file for NetworkDeployment] ************************
2018-05-18 12:48:24,079 p=25658 u=mistral | skipping: [centos-7-inap-mtl01-0004085172] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true}
2018-05-18 12:48:24,099 p=25658 u=mistral | TASK [Run deployment NetworkDeployment] ****************************************

#####################################################################

And in journal on the overcloud there seem to be issues with network from that point on:

May 18 12:48:24 centos-7-inap-mtl01-0004085172 sudo[132607]: tripleo-admin : TTY=pts/0 ; PWD=/home/tripleo-admin ; USER=root ; COMMAND=/bin/sh -c echo BECOME-SUCCESS-ejafdcinazmlurfyjpdrifvituqdmpyo; HEAT_SHELL_CONFIG=/var/lib/heat-config/tripleo-config-download/NetworkDeployment-13a7b14b-f8a7-4111-b8cb-32ec1299bb3d /usr/bin/python /home/tripleo-admin/.ansible/tmp/ansible-tmp-1526647704.2-18691295782855/command.py; rm -rf "/home/tripleo-admin/.ansible/tmp/ansible-tmp-1526647704.2-18691295782855/" > /dev/null 2>&1
May 18 12:48:24 centos-7-inap-mtl01-0004085172 ansible-command[132610]: Invoked with warn=True executable=None _uses_shell=True _raw_params=/usr/libexec/os-refresh-config/configure.d/55-heat-config
                                                                         exit $(jq .deploy_status_code /var/lib/heat-config/deployed/13a7b14b-f8a7-4111-b8cb-32ec1299bb3d.notify.json) removes=None creates=None chdir=None stdin=None
May 18 12:48:24 centos-7-inap-mtl01-0004085172 ovs-vsctl[132627]: ovs|00001|vsctl|INFO|Called as ovs-vsctl del-br br-ex
May 18 12:48:24 centos-7-inap-mtl01-0004085172 kernel: device br-ex left promiscuous mode
May 18 12:48:24 centos-7-inap-mtl01-0004085172 kernel: device vxlan_sys_4789 left promiscuous mode
May 18 12:48:24 centos-7-inap-mtl01-0004085172 kernel: device ovs-system left promiscuous mode
May 18 12:48:25 centos-7-inap-mtl01-0004085172 corosync[123189]: [TOTEM ] The network interface is down.
May 18 12:48:25 centos-7-inap-mtl01-0004085172 corosync[123189]: [TOTEM ] adding new UDPU member {192.168.24.11}
May 18 12:48:26 centos-7-inap-mtl01-0004085172 ntpd[9327]: Deleting interface #16 br-ex, 192.168.24.13#123, interface stats: received=0, sent=0, dropped=0, active_time=263 secs
May 18 12:48:26 centos-7-inap-mtl01-0004085172 ntpd[9327]: Deleting interface #15 br-ex, 192.168.24.7#123, interface stats: received=0, sent=0, dropped=0, active_time=263 secs
May 18 12:48:26 centos-7-inap-mtl01-0004085172 ntpd[9327]: Deleting interface #8 vxlan_sys_4789, fe80::a4fe:1dff:fe00:4f6d#123, interface stats: received=0, sent=0, dropped=0, active_time=2675 secs
May 18 12:48:26 centos-7-inap-mtl01-0004085172 ntpd[9327]: Deleting interface #7 br-ex, fe80::8427:bff:feea:bc4e#123, interface stats: received=0, sent=0, dropped=0, active_time=2675 secs
May 18 12:48:26 centos-7-inap-mtl01-0004085172 ntpd[9327]: Deleting interface #4 br-ex, 192.168.24.11#123, interface stats: received=0, sent=0, dropped=0, active_time=2675 secs
May 18 12:48:28 centos-7-inap-mtl01-0004085172 dockerd-current[9480]: time="2018-05-18T12:48:28.365359775Z" level=debug msg="Running health check for container 17ad03eb2fc01e447c99c5ba473965406f13bd5c495cda3e0a7df65fd1597652 ..."
May 18 12:48:28 centos-7-inap-mtl01-0004085172 dockerd-current[9480]: time="2018-05-18T12:48:28.368264345Z" level=debug msg="starting exec command 7cfa08a9c9e71417a0e4f5d06a88b107b4a0d86816f9b46575e65352a9cc8d06 in container 17ad03eb2fc01e447c99c5ba473965406f13bd5c495cda3e0a7df65fd1597652"
May 18 12:48:28 centos-7-inap-mtl01-0004085172 dockerd-current[9480]: time="2018-05-18T12:48:28.370045078Z" level=debug msg="attach: stdout: begin"
May 18 12:48:28 centos-7-inap-mtl01-0004085172 dockerd-current[9480]: time="2018-05-18T12:48:28.370194263Z" level=debug msg="attach: stderr: begin"
May 18 12:48:28 centos-7-inap-mtl01-0004085172 dockerd-current[9480]: time="2018-05-18T12:48:28.508910392Z" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"start-process\", Id:\"17ad03eb2fc01e447c99c5ba473965406f13bd5c495cda3e0a7df65fd1597652\", Status:0x0, Pid:\"7cfa08a9c9e71417a0e4f5d06a88b107b4a0d86816f9b46575e65352a9cc8d06\", Timestamp:(*timestamp.Timestamp)(0xc421eba7d0)}"
May 18 12:48:28 centos-7-inap-mtl01-0004085172 dockerd-current[9480]: time="2018-05-18T12:48:28.509054443Z" level=debug msg="libcontainerd: event unhandled: type:\"start-process\" id:\"17ad03eb2fc01e447c99c5ba473965406f13bd5c495cda3e0a7df65fd1597652\" pid:\"7cfa08a9c9e71417a0e4f5d06a88b107b4a0d86816f9b46575e65352a9cc8d06\" timestamp:<seconds:1526647708 nanos:508332080 > "
May 18 12:48:28 centos-7-inap-mtl01-0004085172 IPaddr2(ip-192.168.24.13)[132759]: ERROR: Unable to find nic or netmask.
May 18 12:48:28 centos-7-inap-mtl01-0004085172 IPaddr2(ip-192.168.24.7)[132760]: ERROR: Unable to find nic or netmask.
May 18 12:48:28 centos-7-inap-mtl01-0004085172 IPaddr2(ip-192.168.24.13)[132771]: ERROR: [findif] failed
May 18 12:48:28 centos-7-inap-mtl01-0004085172 IPaddr2(ip-192.168.24.7)[132772]: ERROR: [findif] failed
May 18 12:48:28 centos-7-inap-mtl01-0004085172 lrmd[123201]: notice: ip-192.168.24.13_monitor_10000:132714:stderr [ ocf-exit-reason:Unable to find nic or netmask. ]
May 18 12:48:28 centos-7-inap-mtl01-0004085172 lrmd[123201]: notice: ip-192.168.24.7_monitor_10000:132713:stderr [ ocf-exit-reason:Unable to find nic or netmask. ]
May 18 12:48:28 centos-7-inap-mtl01-0004085172 crmd[123204]: notice: centos-7-inap-mtl01-0004085172-ip-192.168.24.13_monitor_10000:30 [ ocf-exit-reason:Unable to find nic or netmask.\n ]
May 18 12:48:28 centos-7-inap-mtl01-0004085172 crmd[123204]: notice: centos-7-inap-mtl01-0004085172-ip-192.168.24.7_monitor_10000:29 [ ocf-exit-reason:Unable to find nic or netmask.\n ]
May 18 12:48:29 centos-7-inap-mtl01-0004085172 kernel: iptables dropped: IN=eth0 OUT= MAC=fa:16:3e:37:b2:4d:e8:04:62:1c:a6:80:08:00 SRC=39.135.17.40 DST=198.72.124.15 LEN=40 TOS=0x00 PREC=0x00 TTL=39 ID=58492 DF PROTO=TCP SPT=161 DPT=28392 WINDOW=0 RES=0x00 ACK RST URGP=0
May 18 12:48:38 centos-7-inap-mtl01-0004085172 dockerd-current[9480]: time="2018-05-18T12:48:38.538695466Z" level=debug msg="containerd: process exited" id=17ad03eb2fc01e447c99c5ba473965406f13bd5c495cda3e0a7df65fd1597652 pid=7cfa08a9c9e71417a0e4f5d06a88b107b4a0d86816f9b46575e65352a9cc8d06 status=1 systemPid=132696
May 18 12:48:38 centos-7-inap-mtl01-0004085172 dockerd-current[9480]: time="2018-05-18T12:48:38.539849970Z" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"exit\", Id:\"17ad03eb2fc01e447c99c5ba473965406f13bd5c495cda3e0a7df65fd1597652\", Status:0x1, Pid:\"7cfa08a9c9e71417a0e4f5d06a88b107b4a0d86816f9b46575e65352a9cc8d06\", Timestamp:(*timestamp.Timestamp)(0xc421ebab90)}"
May 18 12:48:38 centos-7-inap-mtl01-0004085172 dockerd-current[9480]: time="2018-05-18T12:48:38.540206880Z" level=debug msg="attach: stderr: end"
May 18 12:48:38 centos-7-inap-mtl01-0004085172 dockerd-current[9480]: time="2018-05-18T12:48:38.540245986Z" level=debug msg="attach: stdout: end"
May 18 12:48:38 centos-7-inap-mtl01-0004085172 dockerd-current[9480]: time="2018-05-18T12:48:38.540378589Z" level=debug msg="Health check for container 17ad03eb2fc01e447c99c5ba473965406f13bd5c495cda3e0a7df65fd1597652 done (exitCode=1)"
May 18 12:48:38 centos-7-inap-mtl01-0004085172 IPaddr2(ip-192.168.24.13)[132898]: ERROR: Unable to find nic or netmask.
May 18 12:48:38 centos-7-inap-mtl01-0004085172 IPaddr2(ip-192.168.24.7)[132899]: ERROR: Unable to find nic or netmask.
May 18 12:48:38 centos-7-inap-mtl01-0004085172 IPaddr2(ip-192.168.24.13)[132910]: ERROR: [findif] failed
May 18 12:48:38 centos-7-inap-mtl01-0004085172 IPaddr2(ip-192.168.24.7)[132911]: ERROR: [findif] failed
May 18 12:48:38 centos-7-inap-mtl01-0004085172 lrmd[123201]: notice: ip-192.168.24.13_monitor_10000:132852:stderr [ ocf-exit-reason:Unable to find nic or netmask. ]
May 18 12:48:38 centos-7-inap-mtl01-0004085172 lrmd[123201]: notice: ip-192.168.24.7_monitor_10000:132853:stderr [ ocf-exit-reason:Unable to find nic or netmask. ]
May 18 12:48:43 centos-7-inap-mtl01-0004085172 haproxy[125073]: Server keystone_public/centos-7-inap-mtl01-0004085172.internalapi.localdomain is DOWN, reason: Layer4 timeout, check duration: 2001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
May 18 12:48:43 centos-7-inap-mtl01-0004085172 haproxy[125073]: proxy keystone_public has no server available!
May 18 12:48:44 centos-7-inap-mtl01-0004085172 haproxy[125073]: Server keystone_admin/centos-7-inap-mtl01-0004085172.ctlplane.localdomain is DOWN, reason: Layer4 timeout, check duration: 2001ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
May 18 12:48:44 centos-7-inap-mtl01-0004085172 haproxy[125073]: proxy keystone_admin has no server available!
May 18 12:48:46 centos-7-inap-mtl01-0004085172 os-collect-config[4045]: HTTPConnectionPool(host='192.168.24.1', port=8080): Max retries exceeded with url: /v1/AUTH_e6f8fb51f05d4913bc88f5c1007e8c88/ov-ntroller-4i3nvifbtabn-deployed-server-xwvxubp7c3qr/754fede3-7932-42fa-adf6-f5c69161a4a7?temp_url_sig=e2110c15b086458fb8c32e4cf58df8944bc3b8ae&temp_url_expires=2147483586 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x7fa128698f50>, 'Connection to 192.168.24.1 timed out. (connect timeout=10.0)'))
May 18 12:48:46 centos-7-inap-mtl01-0004085172 os-collect-config[4045]: Source [request] Unavailable.
May 18 12:48:49 centos-7-inap-mtl01-0004085172 IPaddr2(ip-192.168.24.7)[133047]: ERROR: Unable to find nic or netmask.
May 18 12:48:49 centos-7-inap-mtl01-0004085172 IPaddr2(ip-192.168.24.13)[133048]: ERROR: Unable to find nic or netmask.
May 18 12:48:49 centos-7-inap-mtl01-0004085172 IPaddr2(ip-192.168.24.13)[133059]: ERROR: [findif] failed
May 18 12:48:49 centos-7-inap-mtl01-0004085172 IPaddr2(ip-192.168.24.7)[133060]: ERROR: [findif] failed
May 18 12:48:49 centos-7-inap-mtl01-0004085172 lrmd[123201]: notice: ip-192.168.24.13_monitor_10000:133001:stderr [ ocf-exit-reason:Unable to find nic or netmask. ]
May 18 12:48:49 centos-7-inap-mtl01-0004085172 lrmd[123201]: notice: ip-192.168.24.7_monitor_10000:133002:stderr [ ocf-exit-reason:Unable to find nic or netmask. ]
May 18 12:48:54 centos-7-inap-mtl01-0004085172 dockerd-current[9480]: time="2018-05-18T12:48:54.328870686Z" level=debug msg="clean 10 unused exec commands"
May 18 12:48:59 centos-7-inap-mtl01-0004085172 IPaddr2(ip-192.168.24.7)[133187]: ERROR: Unable to find nic or netmask.
May 18 12:48:59 centos-7-inap-mtl01-0004085172 IPaddr2(ip-192.168.24.13)[133186]: ERROR: Unable to find nic or netmask.
May 18 12:48:59 centos-7-inap-mtl01-0004085172 IPaddr2(ip-192.168.24.7)[133198]: ERROR: [findif] failed
May 18 12:48:59 centos-7-inap-mtl01-0004085172 IPaddr2(ip-192.168.24.13)[133199]: ERROR: [findif] failed
May 18 12:48:59 centos-7-inap-mtl01-0004085172 lrmd[123201]: notice: ip-192.168.24.7_monitor_10000:133141:stderr [ ocf-exit-reason:Unable to find nic or netmask. ]
May 18 12:48:59 centos-7-inap-mtl01-0004085172 lrmd[123201]: notice: ip-192.168.24.13_monitor_10000:133140:stderr [ ocf-exit-reason:Unable to find nic or netmask. ]
May 18 12:48:59 centos-7-inap-mtl01-0004085172 kernel: iptables dropped: IN=eth0 OUT= MAC=fa:16:3e:37:b2:4d:e8:04:62:1c:a6:80:08:00 SRC=5.188.9.25 DST=198.72.124.15 LEN=40 TOS=0x00 PREC=0x00 TTL=247 ID=32945 PROTO=TCP SPT=56669 DPT=45848 WINDOW=1024 RES=0x00 SYN URGP=0

Tags: ci upgrade
Revision history for this message
Jiří Stránský (jistr) wrote :
description: updated
description: updated
Revision history for this message
Jiří Stránský (jistr) wrote :

The root cause could be that os-net-config is getting reexecuted during upgrade, and it tries to delete br-ex:

https://github.com/openstack/tripleo-heat-templates/blame/5bd9aca3e72001f8b935139ef4140e497305765b/ci/common/net-config-multinode-os-net-config.yaml#L76

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/569781

Changed in tripleo:
assignee: nobody → Jiří Stránský (jistr)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/569781
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=e9e2b4d104e51f3aafc0cb4ee285163da21ef8a2
Submitter: Zuul
Branch: master

commit e9e2b4d104e51f3aafc0cb4ee285163da21ef8a2
Author: Jiri Stransky <email address hidden>
Date: Mon May 21 13:26:05 2018 +0200

    Don't delete br-ex on upgrade in multinode jobs

    The hook for os-net-config in multinode NIC config templates contains
    a call to `ovs-vsctl del-br br-ex` to remove the bridge we got from
    nodepool and initialize network config. We need to avoid executing
    that again on upgrade, or the Ansible process running the upgrade gets
    stuck.

    Change-Id: Ie36342402426d74fd528e320d60adc951bf8c9ac
    Closes-Bug: #1772040

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 9.0.0.0b3

This issue was fixed in the openstack/tripleo-heat-templates 9.0.0.0b3 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.