Canonical Juju

[2.4.7] cannot record provisioning info for "dryspc": cannot set link-layer devices to machine: i/o timeout

Bug #1809029 reported by Vladimir Grevtsev on 2018-12-18

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Won't Fix	Undecided	Unassigned

Bug Description

Juju 2.4.7, Bionic, HA mode.

7 down pending bionic cannot record provisioning info for "dryspc": cannot set link-layer devices to machine "7": read tcp 10.220.40.128:38598->10.220.40.128:37017: i/o timeout

Juju crashdump attached.
Can you please help us find a root cause of this? This is impacting customer delivery now.

Tags:

Revision history for this message

Vladimir Grevtsev (vlgrevtsev) wrote on 2018-12-18:

juju-crashdump-64ad0910-5642-49db-b941-ade6e256a0ef.tar.xz Edit (16.4 KiB, application/x-tar)

Revision history for this message

Vladimir Grevtsev (vlgrevtsev) wrote on 2018-12-18:

controller-syslog Edit (1000.7 KiB, text/plain)

Controller syslog attached.

Revision history for this message

Vladimir Grevtsev (vlgrevtsev) wrote on 2018-12-18:

Subscribed field-crit as this is impacting ongoing customer delivery.

Revision history for this message

Vladimir Grevtsev (vlgrevtsev) wrote on 2018-12-21:

Reproduced again.

juju status: http://paste.ubuntu.com/p/KkD8v39gD6/
controller logs: https://drive.google.com/drive/folders/1ukg6_hQnNXHF_FZOPc9Oql1ApHFPJwLv?usp=sharing

This has been reproduced after deploying two bundles on the same model - first one is "small" bundle with 9 machines, second one has been started after first deployment has been totally finished - it contains the same content as first bundle + some more machines.

"small" bundle: https://pastebin.canonical.com/p/wtYTf4V5C8/
"full" bundle: https://pastebin.canonical.com/p/sWKHwxkx5G/

Any advices?

Revision history for this message

Pedro Guimarães (pguimaraes) wrote on 2018-12-21:

This seems to be an issue on mongodb instances syncing up during deployment. Can we instead run the deployment on stand-alone mode and, after, deploy Juju controller's copies for HA? That would avoid mongodb syncing during deployment.
Will that bring any impact on Juju after deployment?

Revision history for this message

Richard Harding (rharding) wrote on 2018-12-21:

That's fine as you're bringing up a fresh deploy and the risk of the main controller going down is low.

Revision history for this message

Tim Penhey (thumper) wrote on 2019-01-08:

It appears that the problem was resolved. Juju did eventually record the link-layer devices as far as we are able to determine.

Changed in juju:
status:	New → Won't Fix

Revision history for this message

Vladimir Grevtsev (vlgrevtsev) wrote on 2019-01-09:

It has not been resolved, but workarounded by disabling [not establishing] Juju HA before deployment.

In short: this is still reproducible when deploying big bundle with HA mode enabled.

Revision history for this message

Vladimir Grevtsev (vlgrevtsev) wrote on 2019-01-16:

Issue is still reproducible; however, after applying the following one on all hypervisor nodes which hosts Juju controller VMs, issue is gone - I have redeployed 3 times in a row on the same controllers with HA enabled - and no errors faced.

# echo 0 > /sys/block/bcache0/bcache/sequential_cutoff
# echo 0 > /sys/fs/bcache/<cache set>/congested_read_threshold_us
# echo 0 > /sys/fs/bcache/<cache set>/congested_write_threshold_us

Revision history for this message

Richard Harding (rharding) wrote on 2019-01-17:

#10

Thanks Vladimir. We saw that we were storming the mongodb in Juju during deploy and getting better disk performance out of the underlying machine helps with it. The Juju team is actively looking at mechanisms to do a better job throttling and avoiding some of the storming issues of a deploy like this.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.