landscape auto-upgrade is not cluster aware

Bug #1702807 reported by Drew Freiberger on 2017-07-07
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Landscape Charm
Landscape Server
OpenStack hacluster charm

Bug Description

When investigating a recent issue on our percona-cluster database, we found that the three nodes of our HA cluster environment randomly got scheduled for patching at 14:06, 14:12, and 14:12. During the patching of the first node, the database shutdown for the percona-cluster patch upgrade scripts hung indefinitely, and the other two nodes patched (along with restarts) before the first node completed patching. This condition turned into losing all three instances of the database because the startup failed on the second and third nodes because the first node hung up.

I'd like to request a feature upgrade for landscape or it's client to detect if there's a peer hacluster or "crm" installation on the landscape-client-charm unit and determine some way to lock out cluster neighbors from updating at the same time. I realize in our particular scenario, the fault was on the percona/mysql instance, but the entire cluster outage was caused by "random offset" creating a race condition where the failed mysql instance couldn't be detected before the other nodes started upgrading, and two nodes upgraded at the same time, even through random selection.

James Page (james-page) wrote :

Marking hacluster task as Invalid as this is a landscape feature request.

Changed in charm-hacluster:
status: New → Invalid
Changed in landscape-charm:
status: New → Invalid
status: Invalid → Won't Fix
Changed in landscape-client-charm:
status: New → Won't Fix
Xav Paice (xavpaice) wrote :

Added Landscape server to this, as the current situation needs to be either resolved automatically or a suitable implementation method documented so we can prevent customers hitting this same problem again. I wonder if we had created different update schedules for different cluster nodes, if we'd have avoided the outage caused by updates in this case.

Tejeev Patel (tejeevpatel) wrote :

We are running upgrades manually from the NOC on specified nights but having to drag out and baby sit upgrades so we can avoid this issue. As we use Landscape to manage our cloud machines, it would be VERY helpful if it was cloud aware so we could leave it to do it's thing without handling the "automatic" upgrades manually.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers