[2.9.8] openstack provider: "volume quota exceeed" errors are not surfaced in Juju status

Bug #1935964 reported by Dmitrii Shcherbakov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Expired
High
Unassigned

Bug Description

Deployed one of our functional test bundles https://paste.ubuntu.com/p/xtBZnTp8PW/ on a fresh 2.9.8 controller and ran into the situation where unit agents are reported as "allocating" per the controller state, whereas local `juju_unit_status` invocations show unit in the "running" state. At the same time, there are no unit agent processes running (the machine agent is fine).

Dependency engine report: https://paste.ubuntu.com/p/9KZtHsXHjS/
Goroutines: https://paste.ubuntu.com/p/M9w4tRMZ9n/
Full Juju status: https://paste.ubuntu.com/p/mDnxYhcKgf/ (note that some units are fine).
mongodump: https://private-fileshare.canonical.com/~dima/juju-dumps/2021-07-13-mongodump-1626178792

I was able to reproduce the same issue twice (did a controller redeployment just in case).

ubuntu@dmitriis-bastion:~$ juju status
Model Controller Cloud/Region Version SLA Timestamp
zaza-634b66967daa dmitriis-serverstack serverstack/serverstack 2.9.8 unsupported 11:47:11Z
# ...
ceph-osd/0* waiting allocating 8 10.5.1.251 agent initializing
ceph-osd/1 waiting allocating 9 10.5.3.223 agent initializing
ceph-osd/2 waiting allocating 10 10.5.2.5 agent initializing
# ...
Machine State DNS Inst id Series AZ Message
# ...
8 started 10.5.1.251 41acc793-3cf9-4148-a9eb-a6f22dfaef0f focal nova ACTIVE
9 started 10.5.3.223 b8cab859-23c3-4ae0-b9b9-d41cc2913f02 focal nova ACTIVE
10 started 10.5.2.5 1fe227cd-2cf7-4236-a439-1e8327c6a7a2 focal nova ACTIVE
# ...

ubuntu@juju-9b611f-zaza-634b66967daa-8:/var/log$ systemctl list-unit-files | grep juju
jujud-machine-8.service enabled enabled

ubuntu@juju-9b611f-zaza-634b66967daa-8:/var/log$ pgrep -af juju
2178 bash /etc/systemd/system/jujud-machine-8-exec-start.sh
2183 /var/lib/juju/tools/machine-8/jujud machine --data-dir /var/lib/juju --machine-id 8 --debug

ubuntu@juju-9b611f-zaza-634b66967daa-8:/var/log$ grep -RiP ceph-osd
juju/machine-8.log:2021-07-13 11:20:01 INFO juju.worker.deployer deployer.go:161 checking unit "ceph-osd/0"
juju/machine-8.log:2021-07-13 11:20:01 INFO juju.worker.deployer deployer.go:204 deploying unit "ceph-osd/0"
juju/machine-8.log:2021-07-13 11:20:02 INFO juju.worker.deployer unit_agent.go:121 creating new agent config for "ceph-osd/0"
juju/machine-8.log:2021-07-13 11:20:02 INFO juju.worker.deployer nested.go:354 starting workers for "ceph-osd/0"
juju/machine-8.log:2021-07-13 11:20:02 DEBUG juju.worker.leadership tracker.go:125 ceph-osd/0 making initial claim for ceph-osd leadership
juju/machine-8.log:2021-07-13 11:20:02 INFO juju.worker.leadership tracker.go:194 ceph-osd/0 promoted to leadership of ceph-osd
juju/machine-8.log:2021-07-13 11:20:02 INFO juju.agent.tools symlinks.go:20 ensure jujuc symlinks in /var/lib/juju/tools/unit-ceph-osd-0
juju/machine-8.log:2021-07-13 11:20:02 INFO juju.downloader download.go:111 downloading from cs:~openstack-charmers-next/ceph-osd-494
juju/machine-8.log:2021-07-13 11:20:02 INFO juju.downloader download.go:94 download complete ("cs:~openstack-charmers-next/ceph-osd-494")
juju/machine-8.log:2021-07-13 11:20:02 INFO juju.downloader download.go:174 download verified ("cs:~openstack-charmers-next/ceph-osd-494")
juju/unit-ceph-osd-0.log:2021-07-13 11:20:02 INFO juju unit_agent.go:253 Starting unit workers for "ceph-osd/0"
juju/unit-ceph-osd-0.log:2021-07-13 11:20:02 INFO juju.worker.apicaller connect.go:158 [a06a92] "unit-ceph-osd-0" successfully connected to "10.5.1.74:17070"
juju/unit-ceph-osd-0.log:2021-07-13 11:20:02 INFO juju.worker.apicaller connect.go:255 [a06a92] password changed for "unit-ceph-osd-0"
juju/unit-ceph-osd-0.log:2021-07-13 11:20:02 INFO juju.worker.apicaller connect.go:158 [a06a92] "unit-ceph-osd-0" successfully connected to "10.5.1.74:17070"
juju/unit-ceph-osd-0.log:2021-07-13 11:20:02 INFO juju.worker.uniter uniter.go:328 unit "ceph-osd/0" started
juju/unit-ceph-osd-0.log:2021-07-13 11:20:02 INFO juju.worker.uniter.charm bundles.go:79 downloading cs:~openstack-charmers-next/ceph-osd-494 from API server
juju/machine-lock.log:2021-07-13 11:20:02 === agent unit-ceph-osd-0 started ===
juju/machine-lock.log:2021-07-13 11:20:02 unit-ceph-osd-0: meterstatus (meter-status-changed), waited 0s, held 0s

ubuntu@juju-9b611f-zaza-634b66967daa-8:/var/log$ ls -al /var/lib/juju/agents/
total 16
drwxr-xr-x 4 root root 4096 Jul 13 11:20 .
drwxr-xr-x 6 root root 4096 Jul 13 11:20 ..
drwxr-xr-x 2 root root 4096 Jul 13 11:20 machine-8
drwxr-xr-x 4 root root 4096 Jul 13 11:20 unit-ceph-osd-0

ubuntu@juju-9b611f-zaza-634b66967daa-8:/var/log$ sudo ss -apx | grep juju
u_str LISTEN 0 4096 @jujud-machine-8 27612 * 0 users:(("jujud",pid=2183,fd=7))
u_str LISTEN 0 4096 /var/lib/juju/agents/unit-ceph-osd-0/914117156/s 28290 * 0 users:(("jujud",pid=2183,fd=14))
u_str LISTEN 0 4096 /var/lib/juju/agents/unit-ceph-osd-0/805790685/s 28365 * 0 users:(("jujud",pid=2183,fd=12))

ubuntu@juju-9b611f-zaza-634b66967daa-8:/var/log$ ps 2183
    PID TTY STAT TIME COMMAND
   2183 ? SLl 0:00 /var/lib/juju/tools/machine-8/jujud machine --data-dir /var/lib/juju --machine-id 8 --debug

ubuntu@juju-9b611f-zaza-634b66967daa-8:/var/log$ sudo lsof -p 2183
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
jujud 2183 root cwd DIR 252,1 4096 2 /
jujud 2183 root rtd DIR 252,1 4096 2 /
jujud 2183 root txt REG 252,1 155099136 258070 /var/lib/juju/tools/2.9.8-ubuntu-amd64/jujud
jujud 2183 root 0r CHR 1,3 0t0 6 /dev/null
jujud 2183 root 1w REG 252,1 16374 258201 /var/log/juju/machine-8.log
jujud 2183 root 2w REG 252,1 16374 258201 /var/log/juju/machine-8.log
jujud 2183 root 3u a_inode 0,14 0 10376 [eventpoll]
jujud 2183 root 4r FIFO 0,13 0t0 27489 pipe
jujud 2183 root 5w FIFO 0,13 0t0 27489 pipe
jujud 2183 root 6w REG 252,1 16374 258201 /var/log/juju/machine-8.log
jujud 2183 root 7u unix 0xffff8f9a391e2c00 0t0 27612 @jujud-machine-8 type=STREAM
jujud 2183 root 8u IPv4 27658 0t0 TCP juju-9b611f-zaza-634b66967daa-8.project.serverstack:44964->juju-05185a-controller-0.project.serverstack:17070 (ESTABLISHED)
jujud 2183 root 9u IPv4 27657 0t0 TCP juju-9b611f-zaza-634b66967daa-8.project.serverstack:44962->juju-05185a-controller-0.project.serverstack:17070 (ESTABLISHED)
jujud 2183 root 10w REG 252,1 1217 258199 /var/log/juju/unit-ceph-osd-0.log
jujud 2183 root 11u IPv4 28276 0t0 TCP juju-9b611f-zaza-634b66967daa-8.project.serverstack:44968->juju-05185a-controller-0.project.serverstack:17070 (ESTABLISHED)
jujud 2183 root 12u unix 0xffff8f9a391e2400 0t0 28365 /var/lib/juju/agents/unit-ceph-osd-0/805790685/s type=STREAM
jujud 2183 root 13u IPv4 28289 0t0 TCP juju-9b611f-zaza-634b66967daa-8.project.serverstack:44970->juju-05185a-controller-0.project.serverstack:17070 (ESTABLISHED)
jujud 2183 root 14u unix 0xffff8f9a391e3000 0t0 28290 /var/lib/juju/agents/unit-ceph-osd-0/914117156/s type=STREAM

ubuntu@juju-9b611f-zaza-634b66967daa-8:/var/log$ juju_unit_status
agent: machine-8
units:
  ceph-osd/0: running

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :
description: updated
Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

From what I can see, unit agents are no longer started as separate processes and are instead run within the machine agent.

After enabling the TRACE log level, it looks like the resolver is blocked on waiting for storage.

juju model-config logging-config='<root>=TRACE;unit=TRACE;juju.state=TRACE;juju.worker.uniter.resolver=TRACE;'

root@juju-9b611f-zaza-634b66967daa-8:~# systemctl restart jujud-machine-8.service

https://paste.ubuntu.com/p/gb7FTXQ4Px/

2021-07-13 13:12:30 INFO juju.worker.uniter uniter.go:328 unit "ceph-osd/0" started
2021-07-13 13:12:30 DEBUG juju.worker.uniter runlistener.go:118 juju-run listener running
2021-07-13 13:12:30 INFO juju.worker.uniter uniter.go:346 hooks are retried false
2021-07-13 13:12:30 DEBUG juju.worker.uniter.remotestate watcher.go:511 got application change for ceph-osd/0
2021-07-13 13:12:30 DEBUG juju.worker.uniter.remotestate watcher.go:584 got address change for ceph-osd/0: ok=true, hashes=[20c16e5e8646abbbefa7f5cf61702bc22fb50fec602ee7b2992a3a1a4934b0e9]
2021-07-13 13:12:30 DEBUG juju.worker.uniter.remotestate watcher.go:613 got relations change for ceph-osd/0: ok=true
2021-07-13 13:12:30 DEBUG juju.worker.uniter.remotestate watcher.go:501 got unit change for ceph-osd/0
2021-07-13 13:12:30 DEBUG juju.worker.uniter.remotestate watcher.go:563 got trust config change for ceph-osd/0: ok=true, hashes=[7b6f0296d4721da093763e03dc8df64bd05de8b3e3bdecd8220fc10cf4b1171d]
2021-07-13 13:12:30 DEBUG juju.worker.uniter.remotestate watcher.go:595 got leader settings change for ceph-osd/0: ok=true
2021-07-13 13:12:30 DEBUG juju.worker.uniter.remotestate watcher.go:633 got update status interval change for ceph-osd/0: ok=true
2021-07-13 13:12:30 DEBUG juju.worker.uniter.remotestate watcher.go:623 got storage change for ceph-osd/0: [osd-devices/0] ok=true
2021-07-13 13:12:30 DEBUG juju.worker.uniter.remotestate watcher.go:574 got upgrade series change
2021-07-13 13:12:30 DEBUG juju.worker.uniter.remotestate watcher.go:724 no upgrade series in progress, reinitializing local upgrade series state
2021-07-13 13:12:30 DEBUG juju.worker.uniter.remotestate watcher.go:605 got action change for ceph-osd/0: [] ok=true
2021-07-13 13:12:30 DEBUG juju.worker.uniter.remotestate watcher.go:552 got config change for ceph-osd/0: ok=true, hashes=[94f0bb34b42ffa8222dae077ca74cc42b969701068774be908d6024961b1bdd1]
2021-07-13 13:12:30 TRACE juju.worker.uniter.resolver loop.go:190 charmdir: available=false opState: started=false stopped=false changing=false
2021-07-13 13:12:30 TRACE juju.worker.uniter.upgradeseries resolver.go:48 localState.Kind="run-hook", localState.UpgradeSeriesStatus="not started", remoteState.UpgradeSeriesStatus="not started"
2021-07-13 13:12:30 DEBUG juju.worker.uniter.storage resolver.go:174 next hook op for storage-osd-devices-0: {Kind:0 Life:alive Attached:false Location:}
2021-07-13 13:12:30 DEBUG juju.worker.uniter.storage resolver.go:138 still pending [storage-osd-devices-0]

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote (last edit ):

Indeed, the problem is related to storage volume allocation.

juju storage
Unit Storage id Type Size Status Message
ceph-osd/0 osd-devices/0 block pending invalid status (413): {"overLimit": {"code": 413, "message": "VolumeLimitExceeded: Maximum number of volumes allowed (10) exceeded for quota 'volumes'.", "retryAfter": "0"}}
ceph-osd/1 osd-devices/1 block pending invalid status (413): {"overLimit": {"code": 413, "message": "VolumeLimitExceeded: Maximum number of volumes allowed (10) exceeded for quota 'volumes'.", "retryAfter": "0"}}
ceph-osd/2 osd-devices/2 block pending invalid status (413): {"overLimit": {"code": 413, "message": "VolumeLimitExceeded: Maximum number of volumes allowed (10) exceeded for quota 'volumes'.", "retryAfter": "0"}}

I am going to rename the bug because there are two problems here:

* error reporting;
* does not look like there is a way to retry storage volume provisioning.

summary: - [2.9.8] juju unit agents are reported as stuck in the allocating state
- (no processes started)
+ [2.9.8] openstack provider: "volume quota exceeed" errors are not
+ surfaced in Juju status
Ian Booth (wallyworld)
Changed in juju:
importance: Undecided → High
status: New → Triaged
milestone: none → 2.9.12
Changed in juju:
milestone: 2.9.12 → 2.9.13
Changed in juju:
milestone: 2.9.13 → 2.9.14
Changed in juju:
milestone: 2.9.14 → 2.9.15
Changed in juju:
milestone: 2.9.15 → 2.9.16
Changed in juju:
milestone: 2.9.16 → 2.9.17
tags: added: open
tags: added: openstack-provider
removed: open
tags: added: reviewed
Changed in juju:
milestone: 2.9.17 → 2.9.18
Changed in juju:
milestone: 2.9.18 → 2.9.19
Changed in juju:
milestone: 2.9.19 → 2.9.20
Changed in juju:
milestone: 2.9.20 → 2.9.21
Changed in juju:
milestone: 2.9.21 → 2.9.22
Changed in juju:
milestone: 2.9.22 → 2.9.23
Changed in juju:
milestone: 2.9.23 → 2.9.24
Changed in juju:
milestone: 2.9.24 → 2.9.25
Changed in juju:
milestone: 2.9.25 → 2.9.26
Changed in juju:
milestone: 2.9.26 → 2.9.27
Changed in juju:
milestone: 2.9.27 → 2.9.28
Changed in juju:
milestone: 2.9.28 → 2.9.29
Changed in juju:
milestone: 2.9.29 → 2.9.30
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.9.30 → 2.9-next
Harry Pidcock (hpidcock)
Changed in juju:
milestone: 2.9-next → 3.0.1
Changed in juju:
milestone: 3.0.1 → 3.0.2
Changed in juju:
milestone: 3.0.2 → 3.0.3
Revision history for this message
Juan M. Tirado (tiradojm) wrote :

Old issue. Reopen if needed.

Changed in juju:
milestone: 3.0.3 → none
status: Triaged → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for Canonical Juju because there has been no activity for 60 days.]

Changed in juju:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.