Landscape Server

landscape auto-upgrade is not cluster aware

Bug #1702807 reported by Drew Freiberger on 2017-07-07

This bug affects 3 people

	Status	Importance	Assigned to
Landscape Charm	Won't Fix	Undecided	Unassigned
Landscape Server	New	Undecided	Unassigned
OpenStack HA Cluster Charm	Invalid	Undecided	Unassigned
landscape-client-charm	Won't Fix	Undecided	Unassigned

Bug Description

When investigating a recent issue on our percona-cluster database, we found that the three nodes of our HA cluster environment randomly got scheduled for patching at 14:06, 14:12, and 14:12. During the patching of the first node, the database shutdown for the percona-cluster patch upgrade scripts hung indefinitely, and the other two nodes patched (along with restarts) before the first node completed patching. This condition turned into losing all three instances of the database because the startup failed on the second and third nodes because the first node hung up.

I'd like to request a feature upgrade for landscape or it's client to detect if there's a peer hacluster or "crm" installation on the landscape-client-charm unit and determine some way to lock out cluster neighbors from updating at the same time. I realize in our particular scenario, the fault was on the percona/mysql instance, but the entire cluster outage was caused by "random offset" creating a race condition where the failed mysql instance couldn't be detected before the other nodes started upgrading, and two nodes upgraded at the same time, even through random selection.

Tags:

Revision history for this message

James Page (james-page) wrote on 2017-09-21:

Marking hacluster task as Invalid as this is a landscape feature request.

Changed in charm-hacluster:
status:	New → Invalid

Uros Jovanovic (uros-jovanovic) on 2017-10-05

Changed in landscape-charm:
status:	New → Invalid
status:	Invalid → Won't Fix
Changed in landscape-client-charm:
status:	New → Won't Fix

Revision history for this message

Xav Paice (xavpaice) wrote on 2017-10-13:

Added Landscape server to this, as the current situation needs to be either resolved automatically or a suitable implementation method documented so we can prevent customers hitting this same problem again. I wonder if we had created different update schedules for different cluster nodes, if we'd have avoided the outage caused by updates in this case.

Revision history for this message

Tejeev Patel (tejeevpatel) wrote on 2018-10-12:

We are running upgrades manually from the NOC on specified nights but having to drag out and baby sit upgrades so we can avoid this issue. As we use Landscape to manage our cloud machines, it would be VERY helpful if it was cloud aware so we could leave it to do it's thing without handling the "automatic" upgrades manually.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.