[astute] For some reason astute erased node before reinstall with partition preservation and caused data loss
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Won't Fix
|
High
|
Fuel Sustaining | ||
8.0.x |
Won't Fix
|
High
|
Fuel Sustaining | ||
Mitaka |
Won't Fix
|
High
|
Vladimir Kozhukalov | ||
Newton |
Won't Fix
|
High
|
Fuel Sustaining |
Bug Description
Astute has a bug causes unexpected node erasure.
We are doing upgrade from MOS7.0 to MOS9.1 and using partition presevation.
Two of our nodes failed to reprovision and immediately were erased by astute.
Which caused data loss. At least we lost /var/lib/nova XFS partition.
Initial root cause is that astute has only 4 minutes timeout for node reboot. But hardware node takes ~10min usually.
2016-10-17 12:15:18 WARNING [1041] Time detection (240 sec) for node reboot has expired
2016-10-17 12:15:18 WARNING [1041] Reboot command failed for nodes ["10"]. Check debug output for details
2016-10-17 12:15:18 DEBUG [1041] Task time summary: reboot_
Step-by-step:
1) Astute reboots node and gaves up waitning (4min timeout)
2) Step [1] which writes systemtype "image" fails (because node still offline)
3) This later causes code block [2] to be executed after provisioning failed (not related issue).
Finally node erased. See log [3] :(
1 - https:/
2 - https:/
3 - http://
Changed in fuel: | |
importance: | Undecided → High |
assignee: | nobody → Fuel Sustaining (fuel-sustaining-team) |
milestone: | none → 9.2 |
status: | New → Confirmed |
tags: | added: area-python |
Changed in fuel: | |
milestone: | 10.0 → 11.0 |
I would also mention here that this problem is not only about timeouts but in the first place this is about robustness of the partitions preservation mechanism. The keep_data=True flag which can be set for partitions have to mean that any operations which can be lead to potential data loss have to be excluded for disks with such partitions.