[2.7] ceph-osd stuck in "agent initializing"

Bug #1847128 reported by Peter Matulis
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
Critical
Harry Pidcock

Bug Description

There appears to be a regression from 2.6.9. I can't use Ceph on AWS. User-facing symptom:

Unit Workload Agent Machine Public address Ports Message
ceph-mon/0* waiting idle 0 54.234.190.27 Monitor bootstrapped but waiting for number of OSDs to reach expected-osd-count (3)
ceph-mon/1 waiting idle 1 18.212.147.104 Monitor bootstrapped but waiting for number of OSDs to reach expected-osd-count (3)
ceph-mon/2 waiting idle 2 75.101.192.68 Monitor bootstrapped but waiting for number of OSDs to reach expected-osd-count (3)
ceph-osd/0* waiting allocating 3 54.172.158.50 agent initializing
ceph-osd/1 waiting allocating 4 34.229.113.197 agent initializing
ceph-osd/2 waiting allocating 5 3.208.87.212 agent initializing

Details here:

https://paste.ubuntu.com/p/mWjS5wBQHx/

Looks similar to bug #1778033.

Revision history for this message
Harry Pidcock (hpidcock) wrote :

I tried your steps with a 2.7 build straight out of the develop tree with success.

Model Controller Cloud/Region Version SLA Timestamp
default hpidcock aws/us-east-1 2.7-beta1.1 unsupported 12:18:17+10:00

App Version Status Scale Charm Store Rev OS Notes
ceph-mon 12.2.12 active 3 ceph-mon jujucharms 42 ubuntu
ceph-osd 12.2.12 active 3 ceph-osd jujucharms 291 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-mon/0* active idle 0 3.230.170.116 Unit is ready and clustered
ceph-mon/1 active idle 1 100.26.17.9 Unit is ready and clustered
ceph-mon/2 active idle 2 50.19.130.214 Unit is ready and clustered
ceph-osd/0 active idle 3 3.226.252.64 Unit is ready (2 OSD)
ceph-osd/1* active idle 4 34.204.42.232 Unit is ready (2 OSD)
ceph-osd/2 active executing 5 18.210.15.39 Unit is ready (2 OSD)

Machine State DNS Inst id Series AZ Message
0 started 3.230.170.116 i-07f143c04f0d970e2 bionic us-east-1a running
1 started 100.26.17.9 i-05ec51127fe7cd483 bionic us-east-1b running
2 started 50.19.130.214 i-04d941f270cc49bff bionic us-east-1c running
3 started 3.226.252.64 i-0e09265fc266dabc3 bionic us-east-1a running
4 started 34.204.42.232 i-0c3baba4c1390e313 bionic us-east-1c running
5 started 18.210.15.39 i-06b53f0edb5e3017a bionic us-east-1b running

Is it possible you are hitting AWS limits?

Changed in juju:
assignee: nobody → Harry Pidcock (hpidcock)
Harry Pidcock (hpidcock)
Changed in juju:
status: New → Incomplete
Revision history for this message
Peter Matulis (petermatulis) wrote :

Hi Harry,

I am able to consistently reproduce this issue. Alternating between 2.6.9 and 2.7 gives me working and not-working deployments, respectively. I'm tracking volumes on the cloud's dashboard and after each kill-controller invocation every volume gets removed. The last 2.7 version I tried was 2.7-beta1+develop-af0c715 .

Changed in juju:
status: Incomplete → New
Revision history for this message
Tim McNamara (tim-clicks) wrote :

Running the edge snap, I encounter the same problem:

$ juju status --storage
Model Controller Cloud/Region Version SLA Timestamp
default aws-ceph aws/us-east-1 2.7-beta1 unsupported 15:57:48+13:00

App Version Status Scale Charm Store Rev OS Notes
ceph-mon 12.2.12 waiting 3 ceph-mon jujucharms 42 ubuntu
ceph-osd waiting 0/3 ceph-osd jujucharms 291 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-mon/0* waiting idle 0 54.237.245.65 Monitor bootstrapped but waiting for number of OSDs to reach expected-osd-count (3)
ceph-mon/1 waiting idle 1 18.234.240.196 Monitor bootstrapped but waiting for number of OSDs to reach expected-osd-count (3)
ceph-mon/2 waiting idle 2 54.234.221.50 Monitor bootstrapped but waiting for number of OSDs to reach expected-osd-count (3)
ceph-osd/0* waiting allocating 3 54.198.124.115 agent initializing
ceph-osd/1 waiting allocating 4 52.70.40.7 agent initializing
ceph-osd/2 waiting allocating 5 3.93.31.77 agent initializing

Machine State DNS Inst id Series AZ Message
0 started 54.237.245.65 i-00834b2d7f515005d bionic us-east-1a running
1 started 18.234.240.196 i-0a23d4460f080e79f bionic us-east-1c running
2 started 54.234.221.50 i-061e8172cdb34f1a0 bionic us-east-1d running
3 started 54.198.124.115 i-0f94827ea68c59e17 bionic us-east-1a running
4 started 52.70.40.7 i-02cdda575d0576af5 bionic us-east-1c running
5 started 3.93.31.77 i-0d6a73cbf3e2d7765 bionic us-east-1d running

Storage Unit Storage id Type Pool Mountpoint Size Status Message
ceph-osd/0 osd-devices/0 block ebs 2.0GiB attached
ceph-osd/0 osd-devices/1 block ebs 2.0GiB attached
ceph-osd/0 osd-journals/2 block ebs 3.0GiB attached
ceph-osd/1 osd-devices/3 block ebs 2.0GiB attached
ceph-osd/1 osd-devices/4 block ebs 2.0GiB attached
ceph-osd/1 osd-journals/5 block ebs 3.0GiB attached
ceph-osd/2 osd-devices/6 block ebs 2.0GiB attached
ceph-osd/2 osd-devices/7 block ebs 2.0GiB attached
ceph-osd/2 osd-journals/8 block ebs 3.0GiB attached

To get there, I issued these commands:

  $ snap refresh --channel=edge juju
  $ snap info juju | grep ^installed:
  installed: 2.7-beta1+develop-2215fe6 (9219) 75MB classic
  $ /snap/bin/juju bootstrap --no-gui aws aws-ceph
  $ juju deploy -n 3 ceph-mon
  $ juju deploy -n 3 ceph-osd --storage osd-devices=2G,2 --storage osd-journals=3G,1
  $ juju relate ceph-mon ceph-osd

Revision history for this message
Harry Pidcock (hpidcock) wrote :

Hey Peter,

Hmm that's concerning. Can you please give me a bit more information so that I may replicate/diagnose this.

Can you please add `--logging-config "<root>=INFO;juju.worker.diskmanager=TRACE"` to your bootstrap command and then add the full debug-log here.

Additionally are you using cs:ceph-mon-42 and cs:ceph-osd-291?

Thank-you

Revision history for this message
Peter Matulis (petermatulis) wrote :
Revision history for this message
Harry Pidcock (hpidcock) wrote :
Harry Pidcock (hpidcock)
Changed in juju:
milestone: none → 2.6.10
importance: Undecided → Critical
status: New → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.