ceph-osd is showing as fail

Bug #1931567 reported by Eric Desrochers
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Ian Booth
2.9
Fix Released
High
Heather Lanigan

Bug Description

# juju status
Model Controller Cloud/Region Version SLA Timestamp
openstack <OBFUSCATED> <OBFUSCATED> 2.8.10 unsupported 04:27:28Z

# juju show-action-status 3783
actions:
- action: juju-run
completed at: n/a
id: "3783"
status: aborting
unit: ceph-osd/36

# controller logs:
https://pastebin.ubuntu.com/p/bYDvScGfzR/

# ceph-osd/36 logs
https://paste.ubuntu.com/p/pcVnSFzPgz/

Tags: seg sts
Eric Desrochers (slashd)
description: updated
Eric Desrochers (slashd)
description: updated
Eric Desrochers (slashd)
tags: added: seg sts
Revision history for this message
Ian Booth (wallyworld) wrote :

Looking at a database dump, the issue is that the parent operation consisted of 39 actions across 39 units - 38 of them are marked as completed, 1 is marked as aborting (the one we are looking at here 3783)

An aborting action only happens if someone has run juju cancel-action. the action would be killed by juju and then marked as aborted but that hasn't happened so it's still in aborting state. Sometimes the process can get hung and juju will forcibly kill it but that hasn't happened - perhaps the unit agent got shut down before this could happen.

The logs show the unit agent is making an API call to fail the action (set status to failed),
but, the parent operation itself is marked as completed but it's really not because 1 action is not complete yet (still aborting) so juju gets confused.

Given the unit agent appears to be trying to set the action to failed, we can try to set the parent operation state back to running; this should allow things to progress

db.operations.update({"_id" : "f7afc459-639a-44e6-8bf1-ad5928637772:3754"},{$set: { "status" : "running"}});

We will need to loosen how strict juju is with checking for expected state so that in cases like this juju will mark the offended action as failed even if the parent thinks it is already complete.

Changed in juju:
milestone: none → 2.8.12
importance: Undecided → High
status: New → Triaged
Revision history for this message
John A Meinel (jameinel) wrote : Re: [Bug 1931567] Re: ceph-osd is showing as fail

This line:
juju.worker.uniter.remotestate watcher.go:565 got action change for
ceph-osd/36: ...

is very confusing to me. It shows 489 action ids in that entry.
Are we giving the remote state the list of all possible actions that it
might have wanted to run in the past, and then it infers which of those
haven't actually been run yet?

It just seems like a really big list for a 'change'. And it is failing on
the first entry in that list. But then again, if you're regularly running
actions, and one unit is unhappy, it could certainly get behind and then
have a long queue of actions that it should be running but can't get past
the first one.

On Thu, Jun 10, 2021 at 7:55 PM Ian Booth <email address hidden>
wrote:

> Looking at a database dump, the issue is that the parent operation
> consisted of 39 actions across 39 units - 38 of them are marked as
> completed, 1 is marked as aborting (the one we are looking at here 3783)
>
> An aborting action only happens if someone has run juju cancel-action.
> the action would be killed by juju and then marked as aborted but that
> hasn't happened so it's still in aborting state. Sometimes the process
> can get hung and juju will forcibly kill it but that hasn't happened -
> perhaps the unit agent got shut down before this could happen.
>
> The logs show the unit agent is making an API call to fail the action (set
> status to failed),
> but, the parent operation itself is marked as completed but it's really
> not because 1 action is not complete yet (still aborting) so juju gets
> confused.
>
> Given the unit agent appears to be trying to set the action to failed,
> we can try to set the parent operation state back to running; this
> should allow things to progress
>
> db.operations.update({"_id" : "f7afc459-639a-
> 44e6-8bf1-ad5928637772:3754"},{$set: { "status" : "running"}});
>
> We will need to loosen how strict juju is with checking for expected
> state so that in cases like this juju will mark the offended action as
> failed even if the parent thinks it is already complete.
>
>
> ** Changed in: juju
> Milestone: None => 2.8.12
>
> ** Changed in: juju
> Importance: Undecided => High
>
> ** Changed in: juju
> Status: New => Triaged
>
> ** Also affects: juju/2.9
> Importance: Undecided
> Status: New
>
> ** Changed in: juju/2.9
> Milestone: None => 2.9.6
>
> ** Changed in: juju/2.9
> Importance: Undecided => High
>
> ** Changed in: juju/2.9
> Status: New => Triaged
>
> --
> You received this bug notification because you are subscribed to juju.
> Matching subscriptions: juju bugs
> https://bugs.launchpad.net/bugs/1931567
>
> Title:
> ceph-osd is showing as fail
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/juju/+bug/1931567/+subscriptions
>

Revision history for this message
Ian Booth (wallyworld) wrote :

Confirming that the fix in comment #1 addresses the symptoms reported in the bug. This was also applied to another site to fix the same issue.

If this happens again, to diagnose you look at the failing action, grab the parent operation, and look at the completed task count and status. If the status is completed and the completed count is 1 less than the number of actions belonging to that task (obtained via grepping a db dump for actions with that parent), then you can set the operation state back to running as per comment #1.

Changed in juju:
assignee: nobody → Heather Lanigan (hmlanigan)
status: Triaged → In Progress
Revision history for this message
Heather Lanigan (hmlanigan) wrote :
Changed in juju:
status: In Progress → Won't Fix
Revision history for this message
Ian Booth (wallyworld) wrote :

We'll backport a fix for 2.8

https://github.com/juju/juju/pull/13483

Changed in juju:
milestone: 2.8.12 → 2.8.13
assignee: Heather Lanigan (hmlanigan) → Ian Booth (wallyworld)
status: Won't Fix → In Progress
Ian Booth (wallyworld)
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.