OpenStack Snap

cluster bootstrap fails after 10': juju.errors.JujuAPIError: watcher was stopped

Bug #2039126 reported by Andrea Ieri on 2023-10-12

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Canonical Juju	Won't Fix	Undecided	Unassigned
	OpenStack Snap	In Progress	High	Unassigned

Bug Description

Today I tried going through the sunbeam quickstart again on a VM on my local machine, and it failed after 10 minutes with `juju.errors.JujuAPIError: watcher was stopped`.

I am attaching the sunbeam debug log and a juju status in json format.

The instance has 8 cores and 24GB of RAM, and I can confirm no OOM killing took place.

I am not destroying my instance so if I can help by providing further debug data let me know.

Tags:

Revision history for this message

Andrea Ieri (aieri) wrote on 2023-10-12:

sunbeam-20231012-012600.747745.log Edit (179.4 KiB, text/plain)

Revision history for this message

Andrea Ieri (aieri) wrote on 2023-10-12:

status.json Edit (40.3 KiB, application/json)

Revision history for this message

Marian Gasparovic (marosg) wrote on 2023-10-27:

We are hitting this in SQA in each 2023.2/edge test run.

https://solutions.qa.canonical.com/testruns/73a11792-a75d-45a5-b58a-a8e635419fe4

and https://solutions.qa.canonical.com/testruns/8f7ed93a-6a5e-4ca5-90de-4b66ae8e7041

as examples

tags:

added: cdo-qa

Guillaume Boutry (gboutry) on 2023-10-30

Changed in snap-openstack:
assignee:	nobody → Guillaume Boutry (gboutry)
importance:	Undecided → High
status:	New → In Progress

Revision history for this message

James Page (james-page) wrote on 2023-10-30:

Adding bug task for Juju; we switched the openstack snap to use Juju 3.1 in edge - on machines with many configured network interfaces, the controller bootstraps, but then the controller app stalls in waiting state.

Charms deploy OK, but then units sit in waiting state as well.

The same machine works fine with Juju 3.2

Guillaume Boutry (gboutry) on 2023-10-31

Changed in snap-openstack:
assignee:	Guillaume Boutry (gboutry) → nobody

Revision history for this message

Joseph Phillips (manadart) wrote on 2023-11-30:

As described, Juju cannot come up on manually provisioned machines that have multiple local-cloud scoped IP addresses.

This is due to the lease sub-system waiting for the peer-grouper worker to broadcast API addresses.

In turn this means that singular controller and model workers never start, and applications cannot achieve leader units.

3.2 uses Dqlite for leases and doesn't have the same issue. The peer-grouper worker will still keep logging errors, but the lease system doesn't rely on it and everything comes up as desired.

We do not intend to work on a fix for this in the earlier tracks.

A work-around is to use single-NIC machines for controllers, or disable all NICs but one prior to bootstrap, restoring them after.

Changed in juju:
status:	New → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.