A new installation of maas 3.5.0 fails to start the reverse proxy

Bug #2060172 reported by Jacopo Rota
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Committed
High
Jacopo Rota
3.5
Fix Committed
High
Jacopo Rota

Bug Description

In 3.5 I got a run where the snap has started but no UI was available. After checking in the maas-log, I found

```
Apr 04 03:50:31 maas-ha-1 maas-log[129195]: maas.service_monitor: [error] Service 'reverse_proxy' failed to restart: Pebble change 3 failed with an error: cannot perform the following tasks:#012- Stop service "http" (cannot stop service while starting)
```

Also, the pebble services seems to be running but still no http endpoints are available

```
# pebble services
Service Startup Current Since
agent disabled inactive -
apiserver enabled active today at 03:50 UTC
bind9 disabled active today at 03:50 UTC
dhcpd disabled inactive -
dhcpd6 disabled inactive -
http disabled active today at 03:50 UTC
ntp disabled active today at 03:50 UTC
proxy disabled active today at 03:50 UTC
rackd enabled active today at 03:50 UTC
regiond enabled active today at 03:50 UTC
syslog disabled active today at 03:50 UTC
temporal disabled active today at 03:50 UTC
temporal-worker disabled active today at 03:50 UTC
```

In the regiond and rackd logs there are plenty of
```
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: temporalio.activity: [warn] Completing activity as failed ({'activity_id': '1', 'activity_type': 'get-bootresourcefile-endpoints', 'attempt': 14, 'namespace': 'default', 'task_queue': 'region-internal', 'workflow_id': 'sync-b>
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: Traceback (most recent call last):
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: File "/snap/maas/x1/usr/lib/python3/dist-packages/temporalio/worker/_activity.py", line 439, in _run_activity
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: result = await impl.execute_activity(input)
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: File "/snap/maas/x1/usr/lib/python3/dist-packages/temporalio/worker/_activity.py", line 694, in execute_activity
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: return await input.fn(*input.args)
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: File "/snap/maas/x1/lib/python3.10/site-packages/maasserver/workflow/bootresource.py", line 108, in get_bootresourcefile_endpoints
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: raise ApplicationError(
Apr 04 04:03:43 maas-ha-1 maas-regiond[129092]: temporalio.exceptions.ApplicationError: Could not retrieve the IP addresses of the region controller 'yfaqya' from the API. This activity will be retried until we have the IP for all the region controllers.
```

and

```
Apr 04 04:05:47 maas-ha-1 maas-rackd[129097]: provisioningserver.rpc.clusterservice: [info] Region not available: Connection was refused by other side: 111: Connection refused. (While requesting RPC info at http://10.0.1.23:5240/MAAS).
```

because the region is not exposing any http service actually

Related branches

Jacopo Rota (r00ta)
description: updated
Jacopo Rota (r00ta)
Changed in maas:
assignee: nobody → Jacopo Rota (r00ta)
Revision history for this message
Anton Troyanov (troyanov) wrote :

It might happen because nginx process is started before the configuration file is being written on disk.

Changed in maas:
milestone: 3.5.0 → 3.6.0
Revision history for this message
Jacopo Rota (r00ta) wrote :

I noticed that when the bug hits the pid of nginx is always the same (I monitored it using `for x in {1..10000}; do cat /var/snap/maas/current/http/nginx.pid; sleep 0.5; done`).

On the other side, when it works properly the pid changes.

Looking at http://nginx.org/en/docs/control.html , in our case it's enough to send a HUP signal to reload the configuration.

Still we can't really understand what happened, but it must be a nasty race with NGINX and pebble (like https://trac.nginx.org/nginx/ticket/424).

Revision history for this message
Jacopo Rota (r00ta) wrote (last edit ):

Just to recap:

This bug is caused by a nasty race condition between pebble and NGINX.
1) Pebble starts NGINX and the other services
2) The region writes the nginx config and asks pebble to restart NGINX. Pebble returns 202 and will process the request asynchronously.
3) Pebble can't send the TERM signal to NGINX because its status is still "STARTING".
The error in pebble is "[change 3 "Stop service \"http\"" task] failed: cannot stop service while starting"
See https://github.com/canonical/pebble/blob/18c93cbc280de20c9e21d323fe3a8bbb340777a2/internals/overlord/servstate/handlers.go#L701
4) Then, NGINX keeps the old configuration and does not process any request.

However, they are suggesting to use the HUP signal to reload the configuration (see also http://nginx.org/en/docs/control.html), this is what this MP is doing and I haven't observed this issue so far.

Changed in maas:
status: Triaged → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.