swift_replicator_health check needs to handle recovery case

Bug #1893975 reported by Paul Goins
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Swift Storage Charm
New
Undecided
Unassigned

Bug Description

We're encountering alerts on a customer cloud because nodes have partially caught up on replication.

The current methodology of this check (Note: I have not reviewed sources) seems to be to check syslog for the "replicated" string, and to fire warnings/critical alerts if there have been few or no matches seen in the last 15 minutes. (This is configurable.)

The problem is: once replication catches up, we seem to stop seeing this message. We see a final "Object replication complete" message and, unless additional replication becomes necessary, we don't see further messages which would keep this alert from firing.

A more sophisticated check may be required to determine whether replication is ongoing or not, and to not fire the warnings/alerts unless we are known to actively be replicating.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

The default swift-object-replicator mechanism runs a daemonized loop. Once one run completes at 100%, run_pause configuration is consulted for a time to sleep between the next replication, but there will always be a replication loop that will at least audit each partition on the host to determine if any partitions have changed since the last sync and will log a completion or an update within 5 minutes.

So, the longest you should be without a "replicated" audit line in the syslog from swift-object-replicator should be 5 minutes plus $run_pause.

Here's the default from the swift config documentation:
run_pause = 30 Time in seconds to wait between replication passes

https://docs.openstack.org/mitaka/config-reference/object-storage/object-server.html

When investigating the status of the node that sparked this bug, I found the replicator had not been functioning for 7+ days, hung on attempting to select() from a no-longer-running child process.

The check is configurable to be disabled if you don't run the replicator as a daemon, which is not a charm-supported option, so I think this is an invalid bug.

Revision history for this message
Drew Freiberger (afreiberger) wrote :

Here's a strace and process info for what's sticking on swift.

The issue is a swift-replicator bug, not a charm health check bug:

https://pastebin.ubuntu.com/p/r3qjSmXZRN/

Revision history for this message
Drew Freiberger (afreiberger) wrote :

I'm finding that it seems this is happening after the replicator comes across numerous objects for a given partition on the host that look like this in syslog:

Sep 9 19:52:13 host object-server: Unexpected file /srv/node/bcache16/objects/189491/001/b90cccb2719e1275dc2b78bf35205001/.1576225247.80782.data.6bEZcT: Invalid Timestamp value in filename '.1576225247.80782.data.6bEZcT'

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.