a lot of workers restarting at cron.daily time - presumably raft sync()

Bug #1864496 reported by Junien F
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Triaged
Low
Unassigned

Bug Description

Hi,

cron.daily is a cron feature that allows one to easily run scripts once per day. Every day, on every single Ubuntu machine, it starts at 06:25, and runs the scripts present in /etc/cron.daily (see "grep daily /etc/crontab")

Among these scripts, "logrotate" is generally present. logrotate will rotate logs (duh !), which generally means compress them, which means IO and CPU usage tend to spike. On compute nodes with lots of VMs, it means that all of a sudden, all VMs are doing high IO and CPU usage. So workloads tend to work slower than usual around this time.

And workloads running on VMs include our juju controllers. We're monitoring API request time, and also "juju deploy cs:ubuntu" duration, and they tend to alert us every day around that time (API request time is > 30s for 20 min).

While investigating this, I also noticed high churn on the controllers during that time (spike in API requests, mostly "next", "life", "stop" and "relation"), which shouldn't happen since there's nothing generating more calls than usual at these times. This is probably caused by manifold workers restarts. Indeed, out of 6231 workers, 2400 restarted during the last event (so this morning) - to be precise, I counted restarts between 06:20 and 06:39/

I'm filing this bug to understand why so many workers are restarting, and how to prevent it.

Additional datapoint : we're seeing a lot of "juju.core.raftlease store.go:260 timeout" during these times https://pastebin.canonical.com/p/WfFdqfhcCv/

Revision history for this message
Junien F (axino) wrote :

Today's restarts :
machine 0 : 1 out of 4891
machine 1 : 6224 out of 10817 (mongodb primary)
machine 2 : 42 out of 4932

Revision history for this message
Junien F (axino) wrote :

Today's restarts : 0. No restart on any controller machine. :(

Revision history for this message
Junien F (axino) wrote :

API request duration stayed stable though.

Revision history for this message
Junien F (axino) wrote :

After discussions with mostly jam, thumper and babbageclunk, current theory is that because any raft write is followed by an fsync(), raft operations are very sensitive to IO load.

However, raft only stores application leadership (and perhaps one other thing, I forgot) for which we don't really care losing a few seconds/minutes of history in case of all controllers restarting simultaneously.

The juju team is investigating the possibility of getting rid of these fsync()s.

Changed in juju:
status: New → Triaged
importance: Undecided → High
Junien F (axino)
summary: - a lot of workers restarting at cron.daily time
+ a lot of workers restarting at cron.daily time - presumably raft sync()
Revision history for this message
Canonical Juju QA Bot (juju-qa-bot) wrote :

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance: High → Low
tags: added: expirebugs-bot
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.