Bug #1649252 “CI: rh1 compute nodes not spawning vms correctly” : Bugs : tripleo

Emilien Macchi (emilienm) on 2016-12-12

Changed in tripleo:
status:	New → In Progress
importance:	Undecided → Critical
milestone:	none → ocata-2

Revision history for this message

Ben Nemec (bnemec) wrote on 2016-12-12:

#1

Download full text (7.3 KiB)

I've found a few concerning things in the logs on the compute nodes (which appear to be the cause of these failures). I'm looking at compute 15.

nova-compute shows this error output for a failed instance:

2016-12-12 16:49:12.119 3417 ERROR nova.network.linux_net [req-8ce28da4-1b90-4dfa-970b-f4b52b7ffc5c ba119eef29ce49f5b8697f4d63948e3c b79291658f384b7ebbc9019b6349e5c9 - - -] Unable to execute ['ovs-vsctl', '--timeout=120', '--', '--if-exists', 'del-port', u'qvo439addb1-14', '--', 'add-port', 'br-int', u'qvo439addb1-14', '--', 'set', 'Interface', u'qvo439addb1-14', u'external-ids:iface-id=439addb1-14a6-47dc-89ce-332ede23cdfa', 'external-ids:iface-status=active', u'external-ids:attached-mac=fa:16:3e:44:ea:df', 'external-ids:vm-uuid=44acb3af-ecad-448a-ad91-eb54302b7e8e']. Exception: Unexpected error while running command.
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf ovs-vsctl --timeout=120 -- --if-exists del-port qvo439addb1-14 -- add-port br-int qvo439addb1-14 -- set Interface qvo439addb1-14 external-ids:iface-id=439addb1-14a6-47dc-89ce-332ede23cdfa external-ids:iface-status=active external-ids:attached-mac=fa:16:3e:44:ea:df external-ids:vm-uuid=44acb3af-ecad-448a-ad91-eb54302b7e8e
Exit code: 142
Stdout: u''
Stderr: u'2016-12-12T16:49:12Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n'
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [req-8ce28da4-1b90-4dfa-970b-f4b52b7ffc5c ba119eef29ce49f5b8697f4d63948e3c b79291658f384b7ebbc9019b6349e5c9 - - -] [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] Instance failed to spawn
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] Traceback (most recent call last):
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2218, in _build_resources
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] yield resources
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2064, in _build_and_run_instance
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] block_device_info=block_device_info)
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2780, in spawn
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] LOG.debug("Instance is running", instance=instance)
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4920, in _create_domain_and_network
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] with self._lxc_disk_handler(instance, instance.image_meta,
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manage...

I've found a few concerning things in the logs on the compute nodes (which appear to be the cause of these failures).  I'm looking at compute 15.

nova-compute shows this error output for a failed instance:

2016-12-12 16:49:12.119 3417 ERROR nova.network.linux_net [req-8ce28da4-1b90-4dfa-970b-f4b52b7ffc5c ba119eef29ce49f5b8697f4d63948e3c b79291658f384b7ebbc9019b6349e5c9 - - -] Unable to execute ['ovs-vsctl', '--timeout=120', '--', '--if-exists', 'del-port', u'qvo439addb1-14', '--', 'add-port', 'br-int', u'qvo439addb1-14', '--', 'set', 'Interface', u'qvo439addb1-14', u'external-ids:iface-id=439addb1-14a6-47dc-89ce-332ede23cdfa', 'external-ids:iface-status=active', u'external-ids:attached-mac=fa:16:3e:44:ea:df', 'external-ids:vm-uuid=44acb3af-ecad-448a-ad91-eb54302b7e8e']. Exception: Unexpected error while running command.
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf ovs-vsctl --timeout=120 -- --if-exists del-port qvo439addb1-14 -- add-port br-int qvo439addb1-14 -- set Interface qvo439addb1-14 external-ids:iface-id=439addb1-14a6-47dc-89ce-332ede23cdfa external-ids:iface-status=active external-ids:attached-mac=fa:16:3e:44:ea:df external-ids:vm-uuid=44acb3af-ecad-448a-ad91-eb54302b7e8e
Exit code: 142
Stdout: u''
Stderr: u'2016-12-12T16:49:12Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n'
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [req-8ce28da4-1b90-4dfa-970b-f4b52b7ffc5c ba119eef29ce49f5b8697f4d63948e3c b79291658f384b7ebbc9019b6349e5c9 - - -] [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] Instance failed to spawn
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] Traceback (most recent call last):
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2218, in _build_resources
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]     yield resources
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2064, in _build_and_run_instance
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]     block_device_info=block_device_info)
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2780, in spawn
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]     LOG.debug("Instance is running", instance=instance)
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4920, in _create_domain_and_network
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]     with self._lxc_disk_handler(instance, instance.image_meta,
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 879, in plug_vifs
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]     self.vif_driver.plug(instance, vif)
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/vif.py", line 756, in plug
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]     
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/vif.py", line 529, in plug_ovs
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]     self.plug_ovs_hybrid(instance, vif)
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/vif.py", line 525, in plug_ovs_hybrid
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]     self._plug_bridge_with_port(instance, vif, port='ovs')
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/vif.py", line 512, in _plug_bridge_with_port
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]     mtu)
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]   File "/usr/lib/python2.7/site-packages/nova/network/linux_net.py", line 1387, in create_ovs_vif_port
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]     interface_type))
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]   File "/usr/lib/python2.7/site-packages/nova/network/linux_net.py", line 1366, in _ovs_vsctl
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]     raise exception.OvsConfigurationFailure(inner_exception=e)
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] OvsConfigurationFailure: OVS configuration failed with: Unexpected error while running command.
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] Command: sudo nova-rootwrap /etc/nova/rootwrap.conf ovs-vsctl --timeout=120 -- --if-exists del-port qvo439addb1-14 -- add-port br-int qvo439addb1-14 -- set Interface qvo439addb1-14 external-ids:iface-id=439addb1-14a6-47dc-89ce-332ede23cdfa external-ids:iface-status=active external-ids:attached-mac=fa:16:3e:44:ea:df external-ids:vm-uuid=44acb3af-ecad-448a-ad91-eb54302b7e8e
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] Exit code: 142
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] Stdout: u''
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] Stderr: u'2016-12-12T16:49:12Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n'.
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e]

I also noticed that snmpd is eating 100% cpu, and its logs are full of:

Dec 10 23:47:20 overcloud-novacompute-15.localdomain snmpd[1694]: ioctl 35123 returned -1

Unfortunately, I can't even try restarting snmpd because systemd also seems to have given up the ghost.  I can't even do a systemctl status on any services.  Hooray for giant, complex single points of failure.

I suspect we're going to need to reboot all of the affected compute nodes (I'm assuming there are a bunch more based on the number of instance spawn failures we're seeing), particularly since systemd is down so we can't restart individual services.  Unfortunately, just rebooting doesn't get us any closer to an explanation of why this happened. :-/

Revision history for this message

Ben Nemec (bnemec) wrote on 2016-12-12:

#2

Basically all of our compute nodes just went down too:

And I can't get on them anymore:

[bnemec@RedHat ~]$ ssh -i rh1id xxx.xxx.xxx.72
ssh_exchange_identification: Connection closed by remote host

I'm going to start bouncing them in ironic since there's really nothing more I can do to debug this.

Basically all of our compute nodes just went down too:

And I can't get on them anymore:

[bnemec@RedHat ~]$ ssh -i rh1id xxx.xxx.xxx.72
ssh_exchange_identification: Connection closed by remote host

I'm going to start bouncing them in ironic since there's really nothing more I can do to debug this.

Revision history for this message

Ben Nemec (bnemec) wrote on 2016-12-12:

#3

Okay, they all went down because someone already started rebooting the entire environment. I want to note that it would have been preferrable to do more limited reboots, first to make sure it actually fixed the problem, and second because now we've taken down all of our infra vms and will need to restart them manually. Not a huge deal, but it probably wasn't necessary. Those compute nodes weren't being scheduled to anyway.

summary:

- CI: OVB jobs are failing in CI
+ CI: rh1 compute nodes not spawning vms correctly

Revision history for this message

Ben Nemec (bnemec) wrote on 2016-12-12:

#4

The work to fix this is actually being tracked in https://etherpad.openstack.org/p/bug-1649252

Revision history for this message

Ben Nemec (bnemec) wrote on 2016-12-12:

#5

Dropping alert because testenvs appear to be creating successfully. It looks like we may have some other issues with the ovb jobs (not many have completed yet, so I'm not confident enough to call it a trend), but they don't appear to be related to this issue so we should open a new bug for them.

tags:

removed: alert

Revision history for this message

Emilien Macchi (emilienm) wrote on 2016-12-13:

#6

I'm re-adding the alert, I still see the issue where overcloud nodes are not created:
http://logs.openstack.org/58/379958/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/4b41da4/console.html#_2016-12-13_00_58_56_844407

tags:

added: alert

Revision history for this message

Sagi (Sergey) Shnaidman (sshnaidm) wrote on 2016-12-13:

#7

There is no enough space on our undercloud (nodepool images), which fails ironic deployments:

http://logs.openstack.org/43/407943/4/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/680096e/logs/undercloud/var/log/nova/nova-compute.txt.gz#_2016-12-13_01_17_25_959

2016-12-13 01:17:25.961 24965 ERROR nova.compute.manager [instance: fa3c962d-9cbd-4712-a6a2-4bb9012cbf96] InstanceDeployFailure: Failed to provision instance fa3c962d-9cbd-4712-a6a2-4bb9012cbf96: Failed to deploy. Error: Disk volume where '/var/lib/ironic/master_images/tmpU78MIq' is located doesn't have enough disk space. Required 5116 MiB, only 4302 MiB available space present.

http://logs.openstack.org/43/407943/4/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/680096e/logs/undercloud/var/log/ironic/ironic-conductor.txt.gz#_2016-12-13_01_12_33_251

2016-12-13 01:12:33.251 25629 ERROR ironic.conductor.manager [req-5048c3cb-6863-455c-9fc8-effd2137b2da 79b54a43398b431c8bac6439afdb8595 f2474735437e49b38b0225c8d603af18 - default -] Error in deploy of node cf0d9729-5740-47c9-a1ed-171e80f6a787: Disk volume where '/var/lib/ironic/master_images/tmpiizphO' is located doesn't have enough disk space. Required 5116 MiB, only 4330 MiB available space present.

Revision history for this message

Ben Nemec (bnemec) wrote on 2016-12-13:

#8

As I noted in my previous comment, this is a different problem so we should open a new bug for it. See https://bugs.launchpad.net/tripleo/+bug/1649615

tags:	removed: alert
Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

Derek Higgins (derekh) wrote on 2016-12-13:

#9

> I want to note that it would have been preferrable to do more limited reboots, first to make sure it actually fixed the problem, and second because now we've taken down all of our infra vms and will need to restart them manually.

Just a not to say we did do this, before rebooting all nodes, I did a trial reboot on 2 nodes, services on each node took a long time to come back up due to ovs ports (in some cases over 2000 of them on one compute node) that had to be cleaned out. Once I saw this and found I couldn't reliably restart services on the compute node, I decided to reboot all of them and let the neutron port cleanup script deal with the excess ports.

tripleo

CI: rh1 compute nodes not spawning vms correctly

Bug Description

Other bug subscribers

Remote bug watches