Canonical Juju

a lot of workers restarting at cron.daily time - presumably raft sync()

Bug #1864496 reported by Junien F on 2020-02-24

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Triaged	Low	Unassigned

Bug Description

Hi,

cron.daily is a cron feature that allows one to easily run scripts once per day. Every day, on every single Ubuntu machine, it starts at 06:25, and runs the scripts present in /etc/cron.daily (see "grep daily /etc/crontab")

Among these scripts, "logrotate" is generally present. logrotate will rotate logs (duh !), which generally means compress them, which means IO and CPU usage tend to spike. On compute nodes with lots of VMs, it means that all of a sudden, all VMs are doing high IO and CPU usage. So workloads tend to work slower than usual around this time.

And workloads running on VMs include our juju controllers. We're monitoring API request time, and also "juju deploy cs:ubuntu" duration, and they tend to alert us every day around that time (API request time is > 30s for 20 min).

While investigating this, I also noticed high churn on the controllers during that time (spike in API requests, mostly "next", "life", "stop" and "relation"), which shouldn't happen since there's nothing generating more calls than usual at these times. This is probably caused by manifold workers restarts. Indeed, out of 6231 workers, 2400 restarted during the last event (so this morning) - to be precise, I counted restarts between 06:20 and 06:39/

I'm filing this bug to understand why so many workers are restarting, and how to prevent it.

Additional datapoint : we're seeing a lot of "juju.core.raftlease store.go:260 timeout" during these times https://pastebin.canonical.com/p/WfFdqfhcCv/

Tags:

Revision history for this message

Junien F (axino) wrote on 2020-02-25:

Today's restarts :
machine 0 : 1 out of 4891
machine 1 : 6224 out of 10817 (mongodb primary)
machine 2 : 42 out of 4932

Revision history for this message

Junien F (axino) wrote on 2020-02-26:

Today's restarts : 0. No restart on any controller machine. :(

Revision history for this message

Junien F (axino) wrote on 2020-02-26:

API request duration stayed stable though.

Revision history for this message

Junien F (axino) wrote on 2020-03-09:

After discussions with mostly jam, thumper and babbageclunk, current theory is that because any raft write is followed by an fsync(), raft operations are very sensitive to IO load.

However, raft only stores application leadership (and perhaps one other thing, I forgot) for which we don't really care losing a few seconds/minutes of history in case of all controllers restarting simultaneously.

The juju team is investigating the possibility of getting rid of these fsync()s.

Richard Harding (rharding) on 2020-03-10

Changed in juju:
status:	New → Triaged
importance:	Undecided → High

Junien F (axino) on 2020-04-07

summary:

- a lot of workers restarting at cron.daily time
+ a lot of workers restarting at cron.daily time - presumably raft sync()

Revision history for this message

Canonical Juju QA Bot (juju-qa-bot) wrote on 2022-11-03:

This bug has not been updated in 2 years, so we're marking it Low importance. If you believe this is incorrect, please update the importance.

Changed in juju:
importance:	High → Low
tags:	added: expirebugs-bot

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.