error upgrading from 2.7 to 2.8: cannot get all link layer devices

Bug #1899536 reported by Laurent Sesquès
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical Juju
Fix Released
High
Joseph Phillips

Bug Description

On a non-ha juju controller, with just the default model, but managing machines with lots of network devices (openstack neutron) being created or deleted, upgrading to 2.8.3 ended up with a very slow controller, which errors when just trying to get a juju status on the default model:

ERROR could not fetch IP addresses and link layer devices: cannot get all link layer devices

https://pastebin.canonical.com/p/hXkDM32dTn/

The size of the linklayerdevices collection:
`db.linklayerdevices.find().explain("executionStats")`:
     "totalDocsExamined" : 795837

These are all the qvb*, tap*, qvo*, qbr* interfaces.

We started investigating with manadart (to whom I proveided a dump) and achilleasa. This bug is to capture further investigations.

tags: added: canonical-is-upgrades
description: updated
Revision history for this message
Pen Gale (pengale) wrote :

Triaged as high priority and dropped into juju 2.8.6 milestone. Conversations are currently ongoing. Please post updates to this bug!

Changed in juju:
status: New → Triaged
importance: Undecided → High
milestone: none → 2.8.6
John A Meinel (jameinel)
Changed in juju:
status: Triaged → In Progress
assignee: nobody → Joseph Phillips (manadart)
Revision history for this message
Joseph Phillips (manadart) wrote :

Adds an index to the collection for access by model and/or machine:
https://github.com/juju/juju/pull/12133

Revision history for this message
Joseph Phillips (manadart) wrote :

Prior releases had an issue where discovered devices were added to the collection, but not removed, which explains the accrual of these ephemeral virtual devices.

The 2.8 series addresses this issue in the logic for setting each machine's link-layer devices.

Where neither the provider nor the machine observes an interface, it is now deleted. So this should be a transient problem.

We are experimenting with indexes on the collection. The access-by-machine case is sped up greatly, but we're seeing worse performance where the index is not very selective, such as by model in this case.

This is only used by "juju status" though, so once the machines strangle out the devices, this should resolve.

Revision history for this message
Joseph Phillips (manadart) wrote :
Ian Booth (wallyworld)
Changed in juju:
milestone: 2.8.6 → 2.8.7
Changed in juju:
status: In Progress → Fix Committed
Changed in juju:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.