verify the number of osds deployed
Bug #1721817 reported by
Joe Talerico
This bug affects 1 person
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Wishlist
|
John Fulton |
Bug Description
TripleO should verify the number of OSDs that are actually deployed before reporting the deployment succeeded. I have seen where TripleO reports Deployment Complete, however the # of OSDs was much lower than what I was expecting.
It should be quite simply to sum the number of ceph-storage nodes + the osd map, then look at ceph-s on the controller to determine if the number of osds match. <-- possibly a gross over simplification.
Changed in tripleo: | |
importance: | Undecided → Medium |
Changed in tripleo: | |
assignee: | nobody → John Fulton (jfulton-org) |
Changed in tripleo: | |
milestone: | queens-2 → queens-3 |
Changed in tripleo: | |
milestone: | queens-3 → queens-rc1 |
Changed in tripleo: | |
milestone: | queens-rc1 → rocky-1 |
Changed in tripleo: | |
milestone: | rocky-1 → rocky-2 |
Changed in tripleo: | |
milestone: | rocky-2 → rocky-3 |
Changed in tripleo: | |
milestone: | rocky-3 → rocky-rc1 |
Changed in tripleo: | |
milestone: | rocky-rc1 → stein-1 |
Changed in tripleo: | |
milestone: | stein-1 → stein-2 |
Changed in tripleo: | |
milestone: | stein-2 → stein-3 |
Changed in tripleo: | |
status: | Triaged → Won't Fix |
importance: | Medium → Wishlist |
milestone: | stein-3 → none |
Changed in tripleo: | |
status: | Fix Committed → Fix Released |
To post a comment you must log in.
So one extreme is to declare the deployment complete when 1/3 of the OSDs are there that we expected, and the other extreme is to require *all* osds - in a large-scale deployment, it is possible that some drive will have died somewhere in the system, isn't seated properly, etc. etc. In any sufficiently large cluster, probabilities dictate that some hardware somewhere is always offline. So IMHO we should declare the deployment complete if some percentage of the OSDs show up, and make this percentage an adjustable parameter with 95% as the default value. In other words, we declare it complete if enough of the hardware becomes available that we can safely start to use the system and fix the missing OSDs later, as a day-2 activity. For example, if we have a 3-OSD-node cluster with 24 drives per OSD, this would be 72 OSDs, and 95% of that would be 68 OSDs. Make sense?