OpenStack Nova Compute Charm

Services not running that should be: nova-compute

Bug #1751859 reported by Jason Hobbs on 2018-02-26

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Nova Compute Charm	Expired	High	Unassigned

Bug Description

Two nova-compute units on my deployment ended up in blocked state with the error "Services not running that should be: nova-compute".

Each has a traceback in their nova-compute.log like this:
http://paste.ubuntu.com/p/KhtBMX6GYm/

bundle and overlay:
http://paste.ubuntu.com/p/cj8Dw4X9HS/

Tags:

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-02-26:

juju-crashdump-c16912bd-8ec6-4d1e-a786-5c0f3558b035.tar.gz Edit (25.7 MiB, application/x-tar)

Revision history for this message

Chris Gregan (cgregan) wrote on 2018-03-14:

escalated to field high

Revision history for this message

John George (jog) wrote on 2018-03-20:

juju-crashdump-d3a58a04-aa9f-4f31-b0a8-20ce687efedd.tar.gz Edit (40.5 MiB, application/x-tar)

Revision history for this message

John George (jog) wrote on 2018-03-20:

bundle.yaml Edit (27.1 KiB, text/plain)

Revision history for this message

John George (jog) wrote on 2018-03-20:

Added the bundle and juju-crashdump from the latest recreate.

Ryan Beisner (1chb1n) on 2018-03-20

Changed in charm-nova-compute:
assignee:	nobody → Alex Kavanagh (ajkavanagh)
importance:	Undecided → High

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2018-03-21:

Which nova-compute units were failing please?

Please could you post the juju status output for the model?

On the two units with the error, please could you post the output of "ps ax | grep nova" -- for comparison for another kvm unit that is running okay, please also post the output "ps ax | grep nova"

Thanks.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2018-03-21:

Actually, scratch juju status, as that's in the crashdump. Thanks.

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2018-03-21:

Do we have a way to know the system load of the metal hosting rabbitmq? I ask because these look like amqp messaging timeouts, which we know can happen with too much resource contention.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2018-03-21:

Okay, I've had a good trawl through the logs in the crash dump and the only errors I can find are amqp (rabbitmq) related timeout errors. However, the logs appear to show that rabbit is happy.

I'm wondering if we're looking at an overload situation; i.e. the timeout occurs because everything is slowed down due to load? Could that be the case? Have we got an load figures for the machines in question (the nova-lxd and nova-kvm) which I'm guessing are the leaders?

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-03-21: Re: [Bug 1751859] Re: Services not running that should be: nova-compute

#10

Alex, these failures come from a CI setup. If additional logging is
needed, I would suggest that it be added to the charm.

On Wed, Mar 21, 2018 at 7:14 AM, Alex Kavanagh
<email address hidden> wrote:
> Actually, scratch juju status, as that's in the crashdump. Thanks.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1751859
>
> Title:
> Services not running that should be: nova-compute
>
> Status in OpenStack nova-compute charm:
> New
>
> Bug description:
> Two nova-compute units on my deployment ended up in blocked state with
> the error "Services not running that should be: nova-compute".
>
> Each has a traceback in their nova-compute.log like this:
> http://paste.ubuntu.com/p/KhtBMX6GYm/
>
> bundle and overlay:
> http://paste.ubuntu.com/p/cj8Dw4X9HS/
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/charm-nova-compute/+bug/1751859/+subscriptions

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-03-21:

#11

The crashdump includes load information in /var/log/load - top and
iotop output is collected every 30 seconds.

On Wed, Mar 21, 2018 at 7:33 AM, Ryan Beisner
<email address hidden> wrote:
> Do we have a way to know the system load of the metal hosting rabbitmq?
> I ask because these look like amqp messaging timeouts, which we know can
> happen with too much resource contention.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1751859
>
> Title:
> Services not running that should be: nova-compute
>
> Status in OpenStack nova-compute charm:
> New
>
> Bug description:
> Two nova-compute units on my deployment ended up in blocked state with
> the error "Services not running that should be: nova-compute".
>
> Each has a traceback in their nova-compute.log like this:
> http://paste.ubuntu.com/p/KhtBMX6GYm/
>
> bundle and overlay:
> http://paste.ubuntu.com/p/cj8Dw4X9HS/
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/charm-nova-compute/+bug/1751859/+subscriptions

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-03-21:

#12

Here's CPU utilization numbers from the crashdump. None of the systems ever exhaust their CPU resources:

http://paste.ubuntu.com/p/pqzSVKPydf/

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2018-03-21:

#13

Answering my own question: /var/log/load in the baremetal units within the crashdump tarball, contains load info. Please cross-reference amqp timeouts with load spikes.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-03-21:

#14

storage i/o is also low; peak is 2 MB/s write. http://paste.ubuntu.com/p/bPYCsKpWRF/

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2018-03-21:

#15

@tinwood coincidentally, that /var/log/load/iotop.log file exposes running processes, so you may be able to derive the info you were seeking in comment #6.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-03-21:

#16

whoops, storage i/o goes much higher than 2 MB/s sometimes; I wrongfully assumed it was all in K/s in the first look:

http://paste.ubuntu.com/p/pKZrXDsTGp/

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-03-21:

#17

Even so, given that these systems are running writeback bcache with nvme cache devices, 90 MB/s disk write should not be a bottleneck.

Ryan Beisner (1chb1n) on 2018-03-21

Changed in charm-nova-compute:
status:	New → In Progress

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2018-03-23:

#18

I've got jupyter notebook with a lot of stats that I'll publish to an instance in serverstack (internal IP) ... some very interesting things to look at around memory usage, disk IO, and rabbitmq message timeouts. I'll get something up in the next few days. (also attach a pdf to this bug report).

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2018-03-29:

#19

At the moment, our only line on this bug is one of system load.

Regardless of the math of the underlying hardware specs, there are numerous amqp timeouts across this deployment, and that is generally indicative of resource contention. It is not a surprise that one or more services tripped over a messaging timeout.

If this is a recurring issue, please do add more instances of juju crashdumps and accompanying bundle files, so that we can collate and cross-reference multiple events.

Alex has an interesting analysis forming, though it is not complete yet.

Thank you.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2018-04-09:

#20

Analysis of load/mem, etc with respect to oslo message timeouts and the nova-compute service not being started. Edit (1.2 MiB, application/pdf)

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2018-04-09:

#21

Please see the attached of the analysis I did on the Juju crash dump files. There were no Tracebacks or other crashes in the logs, so I started staring at the log files. I then thought it might be useful to actually try to graph what's going on in the system.

I'd recommend at least skim-reading the document. There's lots of graphs so it shouldn't take too long. The most interest graph is at the top of page 24. It shows OSLO message timeouts across machines against the rabbit resets.

The rabbit resets are associated with high load at the beginning of the run; this just may be rabbit being restarted (although that's lots of restarts). Rabbit then settles down and doesn't error again. However, there are still oslo message timeouts across the nova-compute units.

The very regular nova-compute/neutron oslo message timeouts are probably associated with tests scripts "doing things" to the OpenStack install. I don't know whether there are retries or not, but these are very regular and consistent timeouts if you look at the pattern in the aforementioned graph.

Is there, therefore, a poor-fit configuration of rabbitmq? Are queues filling up and not being emptied? I'm also going to look more closely at the nova-compute units that had errors to see if there's anything there.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2018-04-19:

#22

Thanks for the analysis, that is an amazing document!

Is it too much to say that given the rabbitmq errors occur both when there is (relatively) high load and when there is not, that the source of the errors can't always be the relatively high load?

In the final analysis, I'm confused by this statement: "This is due to activity (i.e. testing) on the system." We're not running any tests on these systems; we're just deploying openstack.

Finally, the analysis seems to stem from the assumption that load and dropped rabbitMQ messages are somehow related to the "Services not running that should be: nova-compute" error. However, I don't see anything in the document that draws that conclusion from the data. What is the theory there? What messages are being dropped that would cause that? If it's due to transient message loss, why is this not recoverable?

What are the next steps here?

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2018-04-23:

#23

There isn't one thing that will resolve this, in our view. The next steps are for more people to analyze this data, and testing theories on the various fronts of potential contention. We've solicited input from additional engineers.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2018-04-24:

#24

Just to build on Ryan's comment:

1. I didn't realise that the system was quiescent and not under test. It makes the regular amqp message failures on the nova/neutron units even more disturbing as they occur AFTER the initial load conditions of the charms/juju setting up the system.
2. It's very difficult to untangle what's going on in a complex, distributed, system. My comments in the document, unfortunately, show some of the confusion that I'm still having around what might be going on.
3. The next step is definitely, as Ryan indicates, to get together an analysis group that can plan a set of tests/scenarios that will home in on what the issue(s) might be. I'd like to be part of that group.

Broadly I think this will be:

a) Brainstorm some test scenarios with increasingly complex bundles.
b) Leave certain features 'off' (e.g. monitoring)
c) Perhaps collect more targeted data from RabbitMQ
d) Re-do the analysis, perhaps looking at different correlations.

Finally, Jason, you're right in your comment: my initial focus was to try to answer the question: "Is this a load issue." This was indirectly because of the AMQP message timeouts and wondering what could be causing them.

I was trying to either rule in/out whether it was load/cpu/memory/disk related. I don't think it's yet possible to say either way; we have some possibly strange behaviour in the monitoring units, particularly on the elasticsearch machines (6, 15) which are using all their memory and doing a lot of disk activity, and prometheus on 10 uses 50% of it's CPU, but with ramping disk access which then abruptly stops -- but this may be normal for them. I don't know if this could be affecting the other machines?

Revision history for this message

Chris Gregan (cgregan) wrote on 2018-05-08:

#25

Field High SLA now requires that a estimated date for a fix is listed in the comments. Please provide this estimate for the open tasks.

Revision history for this message

Chris Gregan (cgregan) wrote on 2018-05-21:

#26

No update on this one in a while and it is in danger for failing High SLA requirements of "Significant attention will be given to the issue" Please update.

Ryan Beisner (1chb1n) on 2018-05-29

Changed in charm-nova-compute:
status:	In Progress → Incomplete

Revision history for this message

Ryan Beisner (1chb1n) wrote on 2018-05-29:

#27

FYI, going "incomplete," given that we cannot reproduce it, and it's not currently observed anywhere that we know of.

Alex Kavanagh (ajkavanagh) on 2019-05-28

Changed in charm-nova-compute:
assignee:	Alex Kavanagh (ajkavanagh) → nobody

Revision history for this message

Launchpad Janitor (janitor) wrote on 2019-07-28:

#28

[Expired for OpenStack nova-compute charm because there has been no activity for 60 days.]

Changed in charm-nova-compute:
status:	Incomplete → Expired

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.