RPC RegisterRackController can saturate all the database threads, causing region controllers to become unresponsive for minutes
| Affects | Status | Importance | Assigned to | Milestone | |
|---|---|---|---|---|---|
| MAAS |
Fix Committed
|
High
|
Jacopo Rota | ||
| 3.3 |
Triaged
|
High
|
Jacopo Rota | ||
| 3.4 |
Fix Committed
|
High
|
Jacopo Rota | ||
| 3.5 |
Fix Released
|
High
|
Jacopo Rota | ||
| 3.6 |
Fix Committed
|
High
|
Jacopo Rota | ||
| 3.7 |
Fix Released
|
High
|
Jacopo Rota | ||
Bug Description
Describe the bug:
By design rack controllers should connect to every region workers. If a region has 8 workers, every rack controller should establish a connection to each worker (8 connections in total).
In 3.3, a logic to scale up RPC connections was added. So in case the connections are busy, the rack will try to establish new ephimeral connections. Still, this is known to be affected by a bug https:/
When the region handles the RegisterRackCon
```
)
```
and in particular
```
@synchronous
@with_connection
@synchronised(
@transactional
def register(
system_id=None,
hostname="",
interfaces=
url=None,
is_
version=None,
):
```
in some cases the transaction waits on other transactions to be committed, so the lock is not released.
We tracked down the problem to provisioningser
Steps to reproduce:
Pick a couple of controller with a lot of processes and install MAAS in HA. Put some debug statements and you should see that. In some environments the bug is triggered and it gets in very bad situations.
Expected behavior (what should have happened?):
MAAS should be simply up and running without downtime :)
Actual behavior (what actually happened?):
Every 10/15 minutes, the regions become unresponsive for few minutes (2/3 minutes, more or less)
MAAS version and installation type (deb, snap):
From 3.3 on
MAAS setup (HA, single node, multiple regions/racks):
Host OS distro and version:
Ubuntu any version
Additional context:
Related branches
- Jacopo Rota: Approve
-
Diff: 187 lines (+61/-77)2 files modifiedsrc/provisioningserver/utils/ps.py (+12/-12)
src/provisioningserver/utils/tests/test_ps.py (+49/-65)
- Jacopo Rota: Approve
-
Diff: 187 lines (+61/-77)2 files modifiedsrc/provisioningserver/utils/ps.py (+12/-12)
src/provisioningserver/utils/tests/test_ps.py (+49/-65)
- Jacopo Rota: Approve
-
Diff: 187 lines (+61/-77)2 files modifiedsrc/provisioningserver/utils/ps.py (+12/-12)
src/provisioningserver/utils/tests/test_ps.py (+49/-65)
- Jacopo Rota: Approve
-
Diff: 187 lines (+61/-77)2 files modifiedsrc/provisioningserver/utils/ps.py (+12/-12)
src/provisioningserver/utils/tests/test_ps.py (+49/-65)
- Anton Troyanov: Approve
-
Diff: 187 lines (+61/-77)2 files modifiedsrc/provisioningserver/utils/ps.py (+12/-12)
src/provisioningserver/utils/tests/test_ps.py (+49/-65)
- MAAS Maintainers: Pending requested
-
Diff: 161 lines (+58/-9)4 files modifiedsrc/maasserver/rpc/rackcontrollers.py (+4/-2)
src/maasserver/rpc/tests/test_rackcontrollers.py (+31/-3)
src/maasserver/utils/orm.py (+13/-4)
src/maasserver/utils/tests/test_orm.py (+10/-0)
| Changed in maas: | |
| importance: | Undecided → High |
| assignee: | nobody → Jacopo Rota (r00ta) |
| milestone: | none → 3.8.x |
| description: | updated |
| description: | updated |
| Changed in maas: | |
| status: | Triaged → In Progress |
| Changed in maas: | |
| status: | In Progress → Fix Committed |
