CI: rh1 compute nodes not spawning vms correctly

Bug #1649252 reported by Sagi (Sergey) Shnaidman
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Unassigned

Bug Description

Currently all OVB jobs fail in CI, mostly because of slow creating instances by nova.

Tags: ci
Changed in tripleo:
status: New → In Progress
importance: Undecided → Critical
milestone: none → ocata-2
Revision history for this message
Ben Nemec (bnemec) wrote :
Download full text (7.3 KiB)

I've found a few concerning things in the logs on the compute nodes (which appear to be the cause of these failures). I'm looking at compute 15.

nova-compute shows this error output for a failed instance:

2016-12-12 16:49:12.119 3417 ERROR nova.network.linux_net [req-8ce28da4-1b90-4dfa-970b-f4b52b7ffc5c ba119eef29ce49f5b8697f4d63948e3c b79291658f384b7ebbc9019b6349e5c9 - - -] Unable to execute ['ovs-vsctl', '--timeout=120', '--', '--if-exists', 'del-port', u'qvo439addb1-14', '--', 'add-port', 'br-int', u'qvo439addb1-14', '--', 'set', 'Interface', u'qvo439addb1-14', u'external-ids:iface-id=439addb1-14a6-47dc-89ce-332ede23cdfa', 'external-ids:iface-status=active', u'external-ids:attached-mac=fa:16:3e:44:ea:df', 'external-ids:vm-uuid=44acb3af-ecad-448a-ad91-eb54302b7e8e']. Exception: Unexpected error while running command.
Command: sudo nova-rootwrap /etc/nova/rootwrap.conf ovs-vsctl --timeout=120 -- --if-exists del-port qvo439addb1-14 -- add-port br-int qvo439addb1-14 -- set Interface qvo439addb1-14 external-ids:iface-id=439addb1-14a6-47dc-89ce-332ede23cdfa external-ids:iface-status=active external-ids:attached-mac=fa:16:3e:44:ea:df external-ids:vm-uuid=44acb3af-ecad-448a-ad91-eb54302b7e8e
Exit code: 142
Stdout: u''
Stderr: u'2016-12-12T16:49:12Z|00002|fatal_signal|WARN|terminating with signal 14 (Alarm clock)\n'
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [req-8ce28da4-1b90-4dfa-970b-f4b52b7ffc5c ba119eef29ce49f5b8697f4d63948e3c b79291658f384b7ebbc9019b6349e5c9 - - -] [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] Instance failed to spawn
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] Traceback (most recent call last):
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2218, in _build_resources
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] yield resources
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2064, in _build_and_run_instance
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] block_device_info=block_device_info)
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 2780, in spawn
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] LOG.debug("Instance is running", instance=instance)
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 4920, in _create_domain_and_network
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manager [instance: 44acb3af-ecad-448a-ad91-eb54302b7e8e] with self._lxc_disk_handler(instance, instance.image_meta,
2016-12-12 16:49:12.120 3417 ERROR nova.compute.manage...

Read more...

Revision history for this message
Ben Nemec (bnemec) wrote :

Basically all of our compute nodes just went down too:

+----+--------------------------------------+-------+----------+
| ID | Hypervisor hostname | State | Status |
+----+--------------------------------------+-------+----------+
| 1 | overcloud-novacompute-28.localdomain | down | enabled |
| 2 | overcloud-novacompute-29.localdomain | down | enabled |
| 3 | overcloud-novacompute-30.localdomain | down | enabled |
| 4 | overcloud-novacompute-26.localdomain | down | enabled |
| 5 | overcloud-novacompute-6.localdomain | down | enabled |
| 6 | overcloud-novacompute-27.localdomain | down | enabled |
| 7 | overcloud-novacompute-10.localdomain | down | enabled |
| 8 | overcloud-novacompute-31.localdomain | down | enabled |
| 9 | overcloud-novacompute-1.localdomain | down | disabled |
| 10 | overcloud-novacompute-9.localdomain | down | enabled |
| 11 | overcloud-novacompute-4.localdomain | down | enabled |
| 12 | overcloud-novacompute-14.localdomain | down | disabled |
| 13 | overcloud-novacompute-19.localdomain | down | enabled |
| 14 | overcloud-novacompute-5.localdomain | down | enabled |
| 15 | overcloud-novacompute-17.localdomain | down | disabled |
| 16 | overcloud-novacompute-22.localdomain | down | enabled |
| 17 | overcloud-novacompute-7.localdomain | down | enabled |
| 18 | overcloud-novacompute-8.localdomain | up | enabled |
| 19 | overcloud-novacompute-0.localdomain | down | enabled |
| 20 | overcloud-novacompute-3.localdomain | down | enabled |
| 21 | overcloud-novacompute-23.localdomain | down | enabled |
| 22 | overcloud-novacompute-21.localdomain | down | enabled |
| 23 | overcloud-novacompute-25.localdomain | down | enabled |
| 24 | overcloud-novacompute-32.localdomain | down | enabled |
| 25 | overcloud-novacompute-18.localdomain | down | enabled |
| 26 | overcloud-novacompute-20.localdomain | down | enabled |
| 27 | overcloud-novacompute-11.localdomain | down | enabled |
| 28 | overcloud-novacompute-15.localdomain | down | enabled |
| 29 | overcloud-novacompute-2.localdomain | down | disabled |
| 30 | overcloud-novacompute-12.localdomain | down | enabled |
| 31 | overcloud-novacompute-13.localdomain | down | enabled |
| 32 | overcloud-novacompute-16.localdomain | down | enabled |
| 33 | overcloud-novacompute-24.localdomain | down | enabled |
+----+--------------------------------------+-------+----------+

And I can't get on them anymore:

[bnemec@RedHat ~]$ ssh -i rh1id xxx.xxx.xxx.72
ssh_exchange_identification: Connection closed by remote host

I'm going to start bouncing them in ironic since there's really nothing more I can do to debug this.

Revision history for this message
Ben Nemec (bnemec) wrote :

Okay, they all went down because someone already started rebooting the entire environment. I want to note that it would have been preferrable to do more limited reboots, first to make sure it actually fixed the problem, and second because now we've taken down all of our infra vms and will need to restart them manually. Not a huge deal, but it probably wasn't necessary. Those compute nodes weren't being scheduled to anyway.

summary: - CI: OVB jobs are failing in CI
+ CI: rh1 compute nodes not spawning vms correctly
Revision history for this message
Ben Nemec (bnemec) wrote :

The work to fix this is actually being tracked in https://etherpad.openstack.org/p/bug-1649252

Revision history for this message
Ben Nemec (bnemec) wrote :

Dropping alert because testenvs appear to be creating successfully. It looks like we may have some other issues with the ovb jobs (not many have completed yet, so I'm not confident enough to call it a trend), but they don't appear to be related to this issue so we should open a new bug for them.

tags: removed: alert
Revision history for this message
Emilien Macchi (emilienm) wrote :
tags: added: alert
Revision history for this message
Sagi (Sergey) Shnaidman (sshnaidm) wrote :

There is no enough space on our undercloud (nodepool images), which fails ironic deployments:

http://logs.openstack.org/43/407943/4/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/680096e/logs/undercloud/var/log/nova/nova-compute.txt.gz#_2016-12-13_01_17_25_959

2016-12-13 01:17:25.961 24965 ERROR nova.compute.manager [instance: fa3c962d-9cbd-4712-a6a2-4bb9012cbf96] InstanceDeployFailure: Failed to provision instance fa3c962d-9cbd-4712-a6a2-4bb9012cbf96: Failed to deploy. Error: Disk volume where '/var/lib/ironic/master_images/tmpU78MIq' is located doesn't have enough disk space. Required 5116 MiB, only 4302 MiB available space present.

http://logs.openstack.org/43/407943/4/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/680096e/logs/undercloud/var/log/ironic/ironic-conductor.txt.gz#_2016-12-13_01_12_33_251

2016-12-13 01:12:33.251 25629 ERROR ironic.conductor.manager [req-5048c3cb-6863-455c-9fc8-effd2137b2da 79b54a43398b431c8bac6439afdb8595 f2474735437e49b38b0225c8d603af18 - default -] Error in deploy of node cf0d9729-5740-47c9-a1ed-171e80f6a787: Disk volume where '/var/lib/ironic/master_images/tmpiizphO' is located doesn't have enough disk space. Required 5116 MiB, only 4330 MiB available space present.

Revision history for this message
Ben Nemec (bnemec) wrote :

As I noted in my previous comment, this is a different problem so we should open a new bug for it. See https://bugs.launchpad.net/tripleo/+bug/1649615

tags: removed: alert
Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
Derek Higgins (derekh) wrote :

> I want to note that it would have been preferrable to do more limited reboots, first to make sure it actually fixed the problem, and second because now we've taken down all of our infra vms and will need to restart them manually.

Just a not to say we did do this, before rebooting all nodes, I did a trial reboot on 2 nodes, services on each node took a long time to come back up due to ovs ports (in some cases over 2000 of them on one compute node) that had to be cleaned out. Once I saw this and found I couldn't reliably restart services on the compute node, I decided to reboot all of them and let the neutron port cleanup script deal with the excess ports.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.