[2.4.7] cannot record provisioning info for "dryspc": cannot set link-layer devices to machine: i/o timeout

Bug #1809029 reported by Vladimir Grevtsev
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical Juju
Won't Fix
Undecided
Unassigned

Bug Description

Juju 2.4.7, Bionic, HA mode.

7 down pending bionic cannot record provisioning info for "dryspc": cannot set link-layer devices to machine "7": read tcp 10.220.40.128:38598->10.220.40.128:37017: i/o timeout

Juju crashdump attached.
Can you please help us find a root cause of this? This is impacting customer delivery now.

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :
Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Controller syslog attached.

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Subscribed field-crit as this is impacting ongoing customer delivery.

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Reproduced again.

juju status: http://paste.ubuntu.com/p/KkD8v39gD6/
controller logs: https://drive.google.com/drive/folders/1ukg6_hQnNXHF_FZOPc9Oql1ApHFPJwLv?usp=sharing

This has been reproduced after deploying two bundles on the same model - first one is "small" bundle with 9 machines, second one has been started after first deployment has been totally finished - it contains the same content as first bundle + some more machines.

"small" bundle: https://pastebin.canonical.com/p/wtYTf4V5C8/
"full" bundle: https://pastebin.canonical.com/p/sWKHwxkx5G/

Any advices?

Revision history for this message
Pedro Guimarães (pguimaraes) wrote :

This seems to be an issue on mongodb instances syncing up during deployment. Can we instead run the deployment on stand-alone mode and, after, deploy Juju controller's copies for HA? That would avoid mongodb syncing during deployment.
Will that bring any impact on Juju after deployment?

Revision history for this message
Richard Harding (rharding) wrote :

That's fine as you're bringing up a fresh deploy and the risk of the main controller going down is low.

Revision history for this message
Tim Penhey (thumper) wrote :

It appears that the problem was resolved. Juju did eventually record the link-layer devices as far as we are able to determine.

Changed in juju:
status: New → Won't Fix
Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

It has not been resolved, but workarounded by disabling [not establishing] Juju HA before deployment.

In short: this is still reproducible when deploying big bundle with HA mode enabled.

Revision history for this message
Vladimir Grevtsev (vlgrevtsev) wrote :

Issue is still reproducible; however, after applying the following one on all hypervisor nodes which hosts Juju controller VMs, issue is gone - I have redeployed 3 times in a row on the same controllers with HA enabled - and no errors faced.

# echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
# echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
# echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us

Revision history for this message
Richard Harding (rharding) wrote :

Thanks Vladimir. We saw that we were storming the mongodb in Juju during deploy and getting better disk performance out of the underlying machine helps with it. The Juju team is actively looking at mechanisms to do a better job throttling and avoiding some of the storming issues of a deploy like this.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.