[3.5] \\\"crypto/rsa: verification error\\\" while trying to verify candidate authority certificate \\\"maas-ca\\\")\""

Bug #2076910 reported by Peter Jose De Sousa
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Committed
Critical
Jacopo Rota
3.5
Fix Released
Critical
Jacopo Rota

Bug Description

Hello

When setting up MAAS on a split rackd/region design rackd is crashing with the error:

Aug 13 13:36:55 HOSTNAME maas-agent[712774]: ERR Temporal client error error="failed reaching server: last connection error: connection error: desc = \"transport: authentication handshake failed: tls: failed to verify certificate: x509: certificate signed by unknown authority (possibly because of \\\"crypto/rsa: verification error\\\" while trying to verify candidate authority certificate \\\"maas-ca\\\")\""
Aug 13 13:36:55 HOSTNAME maas.pebble[705343]: 2024-08-13T12:36:55.598Z [pebble] Service "agent" stopped unexpectedly with code 1
Aug 13 13:36:55 HOSTNAME maas.pebble[705343]: 2024-08-13T12:36:55.598Z [pebble] Service "agent" on-failure action is "restart", waiting ~500ms before restart (backoff 1)
Aug 13 13:36:56 HOSTNAME maas.pebble[705343]: 2024-08-13T12:36:56.142Z [pebble] Service "agent" starting: sh -c "exec systemd-cat -t maas-agent $SNAP/bin/run-maas-agent"

The issue is presistent across deployments. Steps to reproduce:

1. Deploy MAAS in HA: 3 rack controllers, 2 rackds
2. Enable TLS on the three region controllers
3. join the racks to the region
4. Enable dhcp on the rackd network
5. Attempt to PXE boot anothter machine -

[Expected result]

Checking journalctl with sudo the above error should be observed with maas agent crashing

[workaround]:

rm /var/snap/maas/current/certificates/* on racks
sudo snap restart maas

Thanks,
Peter

Related branches

Revision history for this message
Peter Jose De Sousa (pjds) wrote :

subscribing field critical as this is blocking a handover, and affecting multiple projects.

description: updated
Revision history for this message
Peter Jose De Sousa (pjds) wrote (last edit ):

downgrading to field high:

[workaround]:

rm /var/snap/maas/current/certificates/* on racks
sudo snap restart maas

description: updated
Changed in maas:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 3.6.0
Changed in maas:
assignee: nobody → Anton Troyanov (troyanov)
status: Triaged → In Progress
Revision history for this message
Anton Troyanov (troyanov) wrote :

Hi Peter,

I did multiple attempts, but I failed to reproduce the issue.
Can you provide more detailed steps for a reproducer or maybe give access to the environment where this error happens after setting up MAAS?

I think that error caused by multiple attempts to install MAAS on the same machine without doing a proper cleanup.

The logic related to cluster certificate (used for mTLS) is the following:

1. When MAAS Region Controller starts it calls _create_cluster_certificate_if_necessary method.
1.1 If there are no cert/key in the database/vault MAAS will generate a new pair
2. If there is no cert/key under /var/snap/maas/current/certificates/ (for snap) it will add them there as well. We don't check if files on disk are the same as in the database. Hence if you have a dirty environment, you might end up with some files on disk that we don't overwrite with DB values
3. During MAAS Rack Controller init with `sudo maas init rack --maas-url http://$MAAS_IP:5240/MAAS --secret` cert/key/ca will be encrypted (AES with secret as a key) and transferred to the Rack over RPC. Once transferred, files are written to disk at the same location.

Details can be found at: src/maasserver/start_up.py

Changed in maas:
status: In Progress → Invalid
status: Invalid → Incomplete
assignee: Anton Troyanov (troyanov) → nobody
Changed in maas:
status: Incomplete → Invalid
Jacopo Rota (r00ta)
Changed in maas:
assignee: nobody → Jacopo Rota (r00ta)
Revision history for this message
Jacopo Rota (r00ta) wrote :

I managed to reproduce this.

In short, we are generating the certificates in

```
@with_connection # Needed by the following lock.
@synchronised(locks.startup)
@transactional
def inner_start_up(retry, master=False):
....
.... do_stuff()
.... generate_certificates_and_store_to_disk()
.... do_other_stuff()

```

but if for any reason `do_other_stuff` raise an exception, we are not committing the transaction and for the logic we have right now we are not replacing the certificates if these files are already on the disk.

In order to reproduce this you can simply apply the following patch

```
diff --git a/src/maasserver/start_up.py b/src/maasserver/start_up.py
index f2a180869c..9b581dd805 100644
--- a/src/maasserver/start_up.py
+++ b/src/maasserver/start_up.py
@@ -200,6 +200,7 @@ def start_up(master=False):
     but this method uses database locking to ensure that the methods it calls
     internally are not run concurrently.
     """
+ retry = 0
     while True:
         try:
             # Since start_up now can be called multiple times in a process lifetime,
@@ -214,7 +215,8 @@ def start_up(master=False):
             # Execute other start-up tasks that must not run concurrently with
             # other invocations of themselves, across the whole of this MAAS
             # installation.
- yield deferToDatabase(inner_start_up, master=master)
+ retry += 1
+ yield deferToDatabase(inner_start_up, retry=retry, master=master)
         except SystemExit:
             raise
         except KeyboardInterrupt:
@@ -243,13 +245,14 @@ def start_up(master=False):
             logger.error("Error during start-up.", exc_info=True)
             yield pause(3.0) # Wait 3 seconds before having another go.
         else:
+ print(f"BYEEEEEEEEEEEEEE")
             break

 @with_connection # Needed by the following lock.
 @synchronised(locks.startup)
 @transactional
-def inner_start_up(master=False):
+def inner_start_up(retry, master=False):
     """Startup jobs that must run serialized w.r.t. other starting servers."""
     # All commissioning and testing scripts are stored in the database. For
     # a commissioning ScriptSet to be created Scripts must exist first. Call
@@ -289,6 +292,12 @@ def inner_start_up(master=False):

         _create_cluster_certificate_if_necessary(client)

+ print(f"I created the certificates on the disk! retry {retry}")
+ if retry < 2:
+ print(f"But I'm going to crash now! retry {retry}")
+ raise Exception("BOOOOOM!")
+ print(f"MOVING FORWARD")
+
         ControllerInfo.objects.filter(node_id=node.id).update(
             vault_configured=bool(client)
         )
```

The fix here might be to save the files on the disk in the post_commit when the transaction has been committed or by just improving the logic to replace the certificates on the disk. I'm taking care of this now

Changed in maas:
status: Invalid → Triaged
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.