MAAS

A new installation of maas 3.5.0 fails to start the reverse proxy

Bug #2060172 reported by Jacopo Rota on 2024-04-04

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MAAS	Fix Committed	High	Jacopo Rota	MAAS 3.6.0
	3.5	Fix Committed	High	Jacopo Rota	MAAS 3.5.0

Bug Description

In 3.5 I got a run where the snap has started but no UI was available. After checking in the maas-log, I found

```
Apr 04 03:50:31 maas-ha-1 maas-log[129195]: maas.service_monitor: [error] Service 'reverse_proxy' failed to restart: Pebble change 3 failed with an error: cannot perform the following tasks:#012- Stop service "http" (cannot stop service while starting)
```

Also, the pebble services seems to be running but still no http endpoints are available

```
# pebble services
Service Startup Current Since
agent disabled inactive -
apiserver enabled active today at 03:50 UTC
bind9 disabled active today at 03:50 UTC
dhcpd disabled inactive -
dhcpd6 disabled inactive -
http disabled active today at 03:50 UTC
ntp disabled active today at 03:50 UTC
proxy disabled active today at 03:50 UTC
rackd enabled active today at 03:50 UTC
regiond enabled active today at 03:50 UTC
syslog disabled active today at 03:50 UTC
temporal disabled active today at 03:50 UTC
temporal-worker disabled active today at 03:50 UTC
```

In the regiond and rackd logs there are plenty of
```
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: temporalio.activity: [warn] Completing activity as failed ({'activity_id': '1', 'activity_type': 'get-bootresourcefile-endpoints', 'attempt': 14, 'namespace': 'default', 'task_queue': 'region-internal', 'workflow_id': 'sync-b>
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: Traceback (most recent call last):
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: File "/snap/maas/x1/usr/lib/python3/dist-packages/temporalio/worker/_activity.py", line 439, in _run_activity
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: result = await impl.execute_activity(input)
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: File "/snap/maas/x1/usr/lib/python3/dist-packages/temporalio/worker/_activity.py", line 694, in execute_activity
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: return await input.fn(*input.args)
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: File "/snap/maas/x1/lib/python3.10/site-packages/maasserver/workflow/bootresource.py", line 108, in get_bootresourcefile_endpoints
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: raise ApplicationError(
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: temporalio.exceptions.ApplicationError: Could not retrieve the IP addresses of the region controller 'yfaqya' from the API. This activity will be retried until we have the IP for all the region controllers.
```

and

```
Apr 04 04:05:47 maas-ha-1 maas-rackd[129097]: provisioningserver.rpc.clusterservice: [info] Region not available: Connection was refused by other side: 111: Connection refused. (While requesting RPC info at http://10.0.1.23:5240/MAAS).
```

because the region is not exposing any http service actually

See original description

Related branches

~r00ta/maas:lp-2060172-3.5

Merged into maas:3.5

Jacopo Rota: Approve on 2024-04-09

~r00ta/maas:lp-2060172

Merged into maas:master

Alexsander de Souza: Approve on 2024-04-08

MAAS Lander: Approve on 2024-04-08

Jacopo Rota (r00ta) on 2024-04-04

description:

updated

Jacopo Rota (r00ta) on 2024-04-04

Changed in maas:
assignee:	nobody → Jacopo Rota (r00ta)

Revision history for this message

Anton Troyanov (troyanov) wrote on 2024-04-06:

It might happen because nginx process is started before the configuration file is being written on disk.

Changed in maas:
milestone:	3.5.0 → 3.6.0

Revision history for this message

Jacopo Rota (r00ta) wrote on 2024-04-08:

I noticed that when the bug hits the pid of nginx is always the same (I monitored it using `for x in {1..10000}; do cat /var/snap/maas/current/http/nginx.pid; sleep 0.5; done`).

On the other side, when it works properly the pid changes.

Looking at http://nginx.org/en/docs/control.html , in our case it's enough to send a HUP signal to reload the configuration.

Still we can't really understand what happened, but it must be a nasty race with NGINX and pebble (like https://trac.nginx.org/nginx/ticket/424).

Revision history for this message

Jacopo Rota (r00ta) wrote on 2024-04-08 (last edit on 2024-04-09):

Just to recap:

This bug is caused by a nasty race condition between pebble and NGINX.
1) Pebble starts NGINX and the other services
2) The region writes the nginx config and asks pebble to restart NGINX. Pebble returns 202 and will process the request asynchronously.
3) Pebble can't send the TERM signal to NGINX because its status is still "STARTING".
The error in pebble is "[change 3 "Stop service \"http\"" task] failed: cannot stop service while starting"
See https://github.com/canonical/pebble/blob/18c93cbc280de20c9e21d323fe3a8bbb340777a2/internals/overlord/servstate/handlers.go#L701
4) Then, NGINX keeps the old configuration and does not process any request.

However, they are suggesting to use the HUP signal to reload the configuration (see also http://nginx.org/en/docs/control.html), this is what this MP is doing and I haven't observed this issue so far.

MAAS Lander (maas-lander) on 2024-04-08

Changed in maas:
status:	Triaged → Fix Committed

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

auto-trac.nginx.org #424 Edit

Bug watches keep track of this bug in other bug trackers.