landscape auto-upgrade is not cluster aware

Bug #1702807 reported by Drew Freiberger
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Landscape Charm
Won't Fix
Undecided
Unassigned
Landscape Server
New
Undecided
Unassigned
OpenStack HA Cluster Charm
Invalid
Undecided
Unassigned
landscape-client-charm
Won't Fix
Undecided
Unassigned

Bug Description

When investigating a recent issue on our percona-cluster database, we found that the three nodes of our HA cluster environment randomly got scheduled for patching at 14:06, 14:12, and 14:12. During the patching of the first node, the database shutdown for the percona-cluster patch upgrade scripts hung indefinitely, and the other two nodes patched (along with restarts) before the first node completed patching. This condition turned into losing all three instances of the database because the startup failed on the second and third nodes because the first node hung up.

I'd like to request a feature upgrade for landscape or it's client to detect if there's a peer hacluster or "crm" installation on the landscape-client-charm unit and determine some way to lock out cluster neighbors from updating at the same time. I realize in our particular scenario, the fault was on the percona/mysql instance, but the entire cluster outage was caused by "random offset" creating a race condition where the failed mysql instance couldn't be detected before the other nodes started upgrading, and two nodes upgraded at the same time, even through random selection.

Revision history for this message
James Page (james-page) wrote :

Marking hacluster task as Invalid as this is a landscape feature request.

Changed in charm-hacluster:
status: New → Invalid
Changed in landscape-charm:
status: New → Invalid
status: Invalid → Won't Fix
Changed in landscape-client-charm:
status: New → Won't Fix
Revision history for this message
Xav Paice (xavpaice) wrote :

Added Landscape server to this, as the current situation needs to be either resolved automatically or a suitable implementation method documented so we can prevent customers hitting this same problem again. I wonder if we had created different update schedules for different cluster nodes, if we'd have avoided the outage caused by updates in this case.

Revision history for this message
Tejeev Patel (tejeevpatel) wrote :

We are running upgrades manually from the NOC on specified nights but having to drag out and baby sit upgrades so we can avoid this issue. As we use Landscape to manage our cloud machines, it would be VERY helpful if it was cloud aware so we could leave it to do it's thing without handling the "automatic" upgrades manually.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.