Bug #1998615 “Rack controller status flapping when “ClusterClien...” : Bugs : MAAS

Revision history for this message

Guillaume Penin (guillaume-penin) wrote on 2022-12-02:

#1

maas.png Edit (49.0 KiB, image/png)

Alexsander de Souza (alexsander-souza) on 2022-12-03

Changed in maas:
status:	New → Triaged
importance:	Undecided → Medium
milestone:	none → 3.4.0

Revision history for this message

Alexsander de Souza (alexsander-souza) wrote on 2022-12-03:

#2

These messages are harmless, but we recognize they are annoying and misleading. We should improve this.

PS: 3.3.0 is not released yet, so this is subjected to change at any moment

Revision history for this message

Thorsten Merten (thorsten-merten) wrote on 2023-02-14:

#3

Via MAAS chat:

New installations of MAAS 3.3 (both new installations and upgrades) where the region+rack controllers are going up and down constantly. maas.log show messages like this:

maas.rpc.rackcontrollers: [info] Existing rack controller 'xxx' running version 3.3.0-13159-g.1c22f7beb has connected to region 'yyy'

and rackd.log shows this:

rack controller 'mqrfwf' registered (via xxx:pid=1416) with MAAS version 3.3.0-13159-g.1c22f7beb
both repeating all the time.

Adam Collard (adam-collard) on 2023-02-14

summary:

- [MAAS 3.3] Many "ClusterClient connection lost" messages in rackd.log
+ Many "ClusterClient connection lost" messages in rackd.log

Revision history for this message

Jorge Merlino (jorge-merlino) wrote on 2023-03-29: Re: Many "ClusterClient connection lost" messages in rackd.log

#4

Hi all,

I've been investigating this issue and I think it might be more than just a cosmetic issue. The logs get thousands of this connection established/lost messages because each connection lives around 10-15 minutes. As there are several connections in the pool there is always some connection being created or destroyed.
I looked at a network dump and there is nothing strange there, the party that starts the connections ends them normally after a few minutes.

Jerzy Husakowski (jhusakowski) on 2023-03-30

tags:

added: bug-council

Revision history for this message

Jerzy Husakowski (jhusakowski) wrote on 2023-03-31:

#5

MAAS uses connection pooling and the behaviour with creating new connections is expected and harmless. We will improve the log spam by changing the log level from INFO to DEBUG, where these messages belong.

tags:	removed: bug-council
tags:	added: easy-to-fix papercut

Revision history for this message

Bartosz Woronicz (mastier1) wrote on 2023-04-20 (last edit on 2023-04-20):

#6

It seems more serious, it might cause frequent kill of DHCP agent. So DHCPd is very short lived "
In my case in HA setup that prevent machines to being commisioned

Also some machines are dissapearing when be ncommisioned in the same time.

MAAS version
installed: 3.3.2-13177-g.a73a6e2bd (27109) 138MB -

maas-01:~$ sudo maas status
bind9 RUNNING pid 272671, uptime 1 day, 6:17:34
dhcpd RUNNING pid 2288973, uptime 0:04:05
dhcpd6 STOPPED Not started
http RUNNING pid 272521, uptime 1 day, 6:17:35
ntp RUNNING pid 2228029, uptime 0:13:06
proxy RUNNING pid 1865584, uptime 1:06:39
rackd RUNNING pid 263370, uptime 1 day, 6:18:38
regiond RUNNING pid 263371, uptime 1 day, 6:18:38
syslog RUNNING pid 276172, uptime 1 day, 6:17:06
maas-02:~$ sudo maas status
bind9 RUNNING pid 1616842, uptime 1 day, 6:14:20
dhcpd RUNNING pid 1047478, uptime 0:00:47
dhcpd6 STOPPED Not started
http RUNNING pid 1620198, uptime 1 day, 6:13:32
ntp RUNNING pid 1622263, uptime 1 day, 6:12:48
proxy RUNNING pid 923083, uptime 1:02:36
rackd RUNNING pid 1616845, uptime 1 day, 6:14:20
regiond RUNNING pid 1616847, uptime 1 day, 6:14:20
syslog RUNNING pid 1622253, uptime 1 day, 6:12:48

Revision history for this message

Jorge Merlino (jorge-merlino) wrote on 2023-04-21 (last edit on 2023-04-21):

#8

I finally managed to reproduce this and want to add that just changing the log level of the messages might not be enough. In the interval between a connection is lost and is established again the Controllers page on the web interface shows that controller as degraded. This is confusing for the users.

If this is normal, maybe we should not notify the web page that the controller is degraded until something abnormal happens.

Thanks

Revision history for this message

Bartosz Woronicz (mastier1) wrote on 2023-04-28 (last edit on 2023-04-28):

#9

Screenshot from 2023-04-28 11-28-46.png Edit (321.5 KiB, image/png)

Message about racks is very strange. That was not an issue in case 3.2.7

92% rackd available.

Revision history for this message

Bartosz Woronicz (mastier1) wrote on 2023-04-28:

#10

maas1_1682673928.txz Edit (304.8 KiB, application/x-tar)

Revision history for this message

Bartosz Woronicz (mastier1) wrote on 2023-04-28:

#11

maas2_1682673926.txz Edit (310.1 KiB, application/x-tar)

Revision history for this message

Bartosz Woronicz (mastier1) wrote on 2023-04-28:

#12

maas3_1682673923.txz Edit (292.9 KiB, application/x-tar)

Revision history for this message

Jerzy Husakowski (jhusakowski) wrote on 2023-05-05:

#13

The UI behavior needs to be looked at, see attached screenshot.

tags:

added: bug-council
removed: easy-to-fix papercut

Revision history for this message

Jerzy Husakowski (jhusakowski) wrote on 2023-05-11:

#14

Please have a look at the UI behavior.

summary:	- Many "ClusterClient connection lost" messages in rackd.log + Rack controller status flapping when "ClusterClient connection lost" + messages in rackd.log
Changed in maas:
assignee:	nobody → Christian Grabowski (cgrabowski)
importance:	Medium → High
tags:	removed: bug-council

Jerzy Husakowski (jhusakowski) on 2023-05-25

Changed in maas:
status:	Triaged → Fix Committed

Alberto Donato (ack) on 2023-05-25

Changed in maas:
milestone:	3.4.0 → 3.4.0-beta1

Revision history for this message

Bartosz Woronicz (mastier1) wrote on 2023-06-30 (last edit on 2023-06-30):

#15

I have seen happening now in 3.2.8

no longer affects:

maas (Ubuntu)

Thorsten Merten (thorsten-merten) on 2023-08-21

Changed in maas-ui:
milestone:	none → 3.4.x
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Christian (cboitel) wrote on 2023-08-21:

#16

I am also affected although i am currently running snap 3.3.4

```
$ sudo snap list maas
Name Version Rev Tracking Publisher Notes
maas 3.3.4-13189-g.f88272d1e 28521 3.3/stable canonical✓ -
```

As stated earlier, this is not only a cosmetic bug since each time a connection lost is logged, dhcpd is restarted by rackd (i saw the `supervisorctl restart dhpcd` in the process list).

So not displaying in UI or changing log level won't solve the dhpcd restart.

Moreover, this can be highly frequent: i once got a restart every 10s and had to stop all rackd and restart them in sequence to get out of this forever loop. I will look into regiond logs to find out why connection was closed.

Revision history for this message

Thorsten Merten (thorsten-merten) wrote on 2023-08-22 (last edit on 2023-08-22):

#17

For maas-ui we need to

1. check what's wrong with that number and where it is coming from
2. think about a different way to display this

Changed in maas-ui:
assignee:	nobody → Nick De Villiers (nickdv99)

Christian Grabowski (cgrabowski) on 2023-08-25

Changed in maas:
status:	Fix Committed → In Progress

MAAS Lander (maas-lander) on 2023-08-29

Changed in maas:
status:	In Progress → Fix Committed

Revision history for this message

Adam Vest (foxmulder2004) wrote on 2023-09-25:

#18

Hello,

Just thought I'd report that I, too, am seeing this flapping behavior on my couple of rack controllers - both on MAAS 3.3.4. Throwing my comment on this since this bug is reportedly fixed in the 3.3.x branch, yet persists for me:
---
root # snap list maas
Name Version Rev Tracking Publisher Notes
maas 3.3.4-13189-g.f88272d1e 28521 3.3/stable canonical✓ -

root # tail -n1 /var/snap/maas/common/log/rackd.log
2023-09-25 22:36:25 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:192.168.1.200', port=43998, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:192.168.1.46', port=5252, flowInfo=0, scopeID=0))

root # date
Mon Sep 25 10:36:46 PM UTC 2023
---

It's worth noting that I am NOT seeing the dhcpd service restarts on every disconnect/reconnect, however - in my particular example, insofar as I can see, there is no loss of functionality and it is purely cosmetic.

Revision history for this message

Sam Schmitt (samcat116) wrote on 2023-10-03:

#19

I am also still seeing this behavior in MAAS 3.4.0rc2.

Revision history for this message

Björn Hinz (bhinz83) wrote on 2023-11-14:

#20

Download full text (7.3 KiB)

same here
Version: 3.3.4 (snap install)

2023-11-14 15:06:25 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas01:pid=1959698' authenticated.
2023-11-14 15:06:25 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas01:pid=1959698' authenticated.
2023-11-14 15:06:25 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas01:pid=1959698) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:06:26 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas01:pid=1959698) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:06:55 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.15.2.3', port=41030, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.15.2.2', port=5250, flowInfo=0, scopeID=0))
2023-11-14 15:06:55 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.15.2.3', port=41036, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.15.2.2', port=5250, flowInfo=0, scopeID=0))
2023-11-14 15:06:55 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas01:pid=1959698' authenticated.
2023-11-14 15:06:55 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas01:pid=1959698' authenticated.
2023-11-14 15:06:55 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas01:pid=1959698) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:06:55 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas01:pid=1959698) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:07:05 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.14.3.3', port=37338, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.14.3.4', port=5253, flowInfo=0, scopeID=0))
2023-11-14 15:07:05 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.14.3.3', port=37348, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.14.3.4', port=5253, flowInfo=0, scopeID=0))
2023-11-14 15:07:25 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.14.3.3', port=34250, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.14.3.4', port=5253, flowInfo=0, scopeID=0))
2023-11-14 15:07:25 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.14.3.3', port=34252, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.14.3.4', port=5253, flowInfo=0, scopeID=0))
2023-11-14 15:07:25 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas03:pid=1659391' authenticated.
2023-11-14 15:07:25 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas03:pid=1659391' authenticated.
2023-11-14 15:07:25 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas03:pid=1659391) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:07:26 provisioningserver.rpc.clusterservice: [info] Rack controller '...

same here
Version: 3.3.4 (snap install)

2023-11-14 15:06:25 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas01:pid=1959698' authenticated.
2023-11-14 15:06:25 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas01:pid=1959698' authenticated.
2023-11-14 15:06:25 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas01:pid=1959698) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:06:26 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas01:pid=1959698) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:06:55 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.15.2.3', port=41030, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.15.2.2', port=5250, flowInfo=0, scopeID=0))
2023-11-14 15:06:55 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.15.2.3', port=41036, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.15.2.2', port=5250, flowInfo=0, scopeID=0))
2023-11-14 15:06:55 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas01:pid=1959698' authenticated.
2023-11-14 15:06:55 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas01:pid=1959698' authenticated.
2023-11-14 15:06:55 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas01:pid=1959698) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:06:55 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas01:pid=1959698) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:07:05 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.14.3.3', port=37338, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.14.3.4', port=5253, flowInfo=0, scopeID=0))
2023-11-14 15:07:05 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.14.3.3', port=37348, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.14.3.4', port=5253, flowInfo=0, scopeID=0))
2023-11-14 15:07:25 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.14.3.3', port=34250, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.14.3.4', port=5253, flowInfo=0, scopeID=0))
2023-11-14 15:07:25 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.14.3.3', port=34252, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.14.3.4', port=5253, flowInfo=0, scopeID=0))
2023-11-14 15:07:25 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas03:pid=1659391' authenticated.
2023-11-14 15:07:25 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas03:pid=1659391' authenticated.
2023-11-14 15:07:25 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas03:pid=1659391) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:07:26 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas03:pid=1659391) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:07:35 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.14.3.3', port=50994, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.14.3.4', port=5253, flowInfo=0, scopeID=0))
2023-11-14 15:07:35 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.14.3.3', port=51010, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.14.3.4', port=5253, flowInfo=0, scopeID=0))
2023-11-14 15:07:55 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.14.3.3', port=40464, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.14.3.4', port=5253, flowInfo=0, scopeID=0))
2023-11-14 15:07:55 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.14.3.3', port=40472, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.14.3.4', port=5253, flowInfo=0, scopeID=0))
2023-11-14 15:07:55 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas03:pid=1659391' authenticated.
2023-11-14 15:07:55 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas03:pid=1659391' authenticated.
2023-11-14 15:07:55 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas03:pid=1659391) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:07:56 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas03:pid=1659391) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:08:05 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.15.2.3', port=51980, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.15.2.4', port=5252, flowInfo=0, scopeID=0))
2023-11-14 15:08:05 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.15.2.3', port=51982, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.15.2.4', port=5252, flowInfo=0, scopeID=0))
2023-11-14 15:08:25 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.15.2.3', port=54688, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.15.2.4', port=5252, flowInfo=0, scopeID=0))
2023-11-14 15:08:25 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.15.2.3', port=54698, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.15.2.4', port=5252, flowInfo=0, scopeID=0))
2023-11-14 15:08:25 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas03:pid=1659392' authenticated.
2023-11-14 15:08:25 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas03:pid=1659392' authenticated.
2023-11-14 15:08:25 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas03:pid=1659392) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:08:25 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas03:pid=1659392) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:08:35 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.15.2.3', port=47152, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.15.2.4', port=5252, flowInfo=0, scopeID=0))
2023-11-14 15:08:35 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.15.2.3', port=47168, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.15.2.4', port=5252, flowInfo=0, scopeID=0))
2023-11-14 15:08:55 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.15.2.3', port=42440, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.15.2.4', port=5252, flowInfo=0, scopeID=0))
2023-11-14 15:08:55 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.15.2.3', port=42448, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.15.2.4', port=5252, flowInfo=0, scopeID=0))

Revision history for this message

Alan Baghumian (alanbach) wrote on 2023-11-20:

#21

Download full text (3.6 KiB)

This is also happening to me - Snap 3.3.5 but it never causes any enlisting / commissioning issues:

2023-11-20 09:22:34 provisioningserver.rackdservices.dhcp_probe_service: [info] Probe for external DHCP servers started on interfaces: enp1s0.
2023-11-20 09:22:44 provisioningserver.dhcp.detect: [info] External DHCP server(s) discovered on interface 'enp1s0': 10.1.9.6
2023-11-20 09:22:44 provisioningserver.rackdservices.dhcp_probe_service: [info] External DHCP probe complete.
2023-11-20 09:23:13 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=57782, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:23:14 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=57788, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:23:33 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=37494, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:23:34 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=37496, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:23:34 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas-region-1:pid=641111' authenticated.
2023-11-20 09:23:34 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas-region-1:pid=641111' authenticated.
2023-11-20 09:23:34 provisioningserver.rpc.clusterservice: [info] Rack controller 'crxdhh' registered (via maas-region-1:pid=641111) with MAAS version 3.3.5-13213-g.e2fb46e1a.
2023-11-20 09:23:34 provisioningserver.rpc.clusterservice: [info] Rack controller 'crxdhh' registered (via maas-region-1:pid=641111) with MAAS version 3.3.5-13213-g.e2fb46e1a.
2023-11-20 09:23:44 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=45290, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:23:44 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=45292, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:24:04 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=44316, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:24:04 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=44328, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:24:04 provisioningserver.rpc.clusterservice: [info] ...

This is also happening to me - Snap 3.3.5 but it never causes any enlisting / commissioning issues:

2023-11-20 09:22:34 provisioningserver.rackdservices.dhcp_probe_service: [info] Probe for external DHCP servers started on interfaces: enp1s0.
2023-11-20 09:22:44 provisioningserver.dhcp.detect: [info] External DHCP server(s) discovered on interface 'enp1s0': 10.1.9.6
2023-11-20 09:22:44 provisioningserver.rackdservices.dhcp_probe_service: [info] External DHCP probe complete.
2023-11-20 09:23:13 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=57782, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:23:14 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=57788, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:23:33 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=37494, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:23:34 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=37496, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:23:34 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas-region-1:pid=641111' authenticated.
2023-11-20 09:23:34 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas-region-1:pid=641111' authenticated.
2023-11-20 09:23:34 provisioningserver.rpc.clusterservice: [info] Rack controller 'crxdhh' registered (via maas-region-1:pid=641111) with MAAS version 3.3.5-13213-g.e2fb46e1a.
2023-11-20 09:23:34 provisioningserver.rpc.clusterservice: [info] Rack controller 'crxdhh' registered (via maas-region-1:pid=641111) with MAAS version 3.3.5-13213-g.e2fb46e1a.
2023-11-20 09:23:44 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=45290, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:23:44 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=45292, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:24:04 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=44316, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:24:04 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=44328, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:24:04 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas-region-1:pid=641111' authenticated.
2023-11-20 09:24:04 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas-region-1:pid=641111' authenticated.
2023-11-20 09:24:04 provisioningserver.rpc.clusterservice: [info] Rack controller 'crxdhh' registered (via maas-region-1:pid=641111) with MAAS version 3.3.5-13213-g.e2fb46e1a.
2023-11-20 09:24:04 provisioningserver.rpc.clusterservice: [info] Rack controller 'crxdhh' registered (via maas-region-1:pid=641111) with MAAS version 3.3.5-13213-g.e2fb46e1a.

Revision history for this message

Joao Andre Simioni (jasimioni) wrote on 2024-02-07 (last edit on 2024-02-07):

#22

I'm also seeing this issue with MAAS 3.3.5 and Debs. The setup is a 2 controllers, region + rack

Logs look the same, but the main issue for me is that the controllers show degraded status, as mentioned by Jorge in https://bugs.launchpad.net/maas/+bug/1998615/comments/8

It sounds like, sometimes, the monitor period action happens when the connections are being reestablished and if not all of them are ready yet, the status will be degraded:

https://git.launchpad.net/maas/tree/src/maasserver/models/node.py#n6795

Also, I see that the maas-ui is triaged and assigned. Would that also fix the API response? (`maas $PROFILE rack-controllers read` output?)

Revision history for this message

Leonardo Silva (leo-scs) wrote on 2024-02-27:

#23

I am also still seeing this behavior in MAAS 3.4.0-14321-g.1027c7664

2024-02-27 20:26:43 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x', port=40336, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5251, flowInfo=0, scopeID=0))
2024-02-27 20:26:43 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=40328, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5251, flowInfo=0, scopeID=0))
2024-02-27 20:27:03 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=47854, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5251, flowInfo=0, scopeID=0))
2024-02-27 20:27:03 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=47848, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5251, flowInfo=0, scopeID=0))
2024-02-27 20:27:13 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=54178, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5250, flowInfo=0, scopeID=0))
2024-02-27 20:27:13 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=54182, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5250, flowInfo=0, scopeID=0))
2024-02-27 20:27:33 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=36414, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5250, flowInfo=0, scopeID=0))
2024-02-27 20:27:33 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=36422, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5250, flowInfo=0, scopeID=0))

The setup is a 3 region controllers + 20 rack controllers

I am also still seeing this behavior in MAAS 3.4.0-14321-g.1027c7664

2024-02-27 20:26:43 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x', port=40336, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5251, flowInfo=0, scopeID=0))
2024-02-27 20:26:43 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=40328, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5251, flowInfo=0, scopeID=0))
2024-02-27 20:27:03 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=47854, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5251, flowInfo=0, scopeID=0))
2024-02-27 20:27:03 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=47848, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5251, flowInfo=0, scopeID=0))
2024-02-27 20:27:13 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=54178, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5250, flowInfo=0, scopeID=0))
2024-02-27 20:27:13 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=54182, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5250, flowInfo=0, scopeID=0))
2024-02-27 20:27:33 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=36414, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5250, flowInfo=0, scopeID=0))
2024-02-27 20:27:33 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=36422, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5250, flowInfo=0, scopeID=0))

The setup is a 3 region controllers + 20 rack controllers

Revision history for this message

Giovanni Tirloni (gtirloni) wrote on 2024-03-11:

#24

I found this bug report while looking for a solution.

I experience this on a single all-in-one MAAS server running 3.4.0-14321-g.1027c7664

The server has two interfaces. The errors happen twice with the IPv6 of one interface. Two minutes later, they happen again twice with the IPv6 address of the other interface. Then back to the first interface after two minutes.

When they happen, they happen again exactly 30 seconds later with the IP of the same interface. Two minutes later, the other IP.. followed by another disconnect exactly 30 seconds. And so on.

I checked all logs files and there is no errors/warnings before the disconnects happen.

After I restart regiond, the disconnects stop for a while and start again after ~15min in the environment I'm troubleshooting.

After enabling debug logs, I cannot see any errors before the disconnects happen either. There are only a bunch of successful SELECT's to the database and the service monitors checking things. I also don't see any processes dying.

Revision history for this message

Mauricio Faria de Oliveira (mfo) wrote on 2024-05-09:

#25

Marking 3.2 as Won't Fix for now.
If there is an actual demand for this, please describe the reasons.

Revision history for this message

Agora Machine Reliability Engineering (agora-mre) wrote on 2024-09-11:

#26

I notice the log is print by twisted

Revision history for this message

Agora Machine Reliability Engineering (agora-mre) wrote on 2024-09-11:

#27

I think the problem is maas rpc client pool try to replace exsiting rpc connection.

MAAS

Rack controller status flapping when "ClusterClient connection lost" messages in rackd.log

Bug Description

Related branches

Other bug subscribers

Bug attachments

Remote bug watches

	Status	Importance	Assigned to	Milestone
MAAS	Fix Committed	High	Christian Grabowski	MAAS 3.4.0-beta1
3.2	Won't Fix	Undecided	Unassigned
3.3	Fix Released	High	Unassigned	MAAS 3.3.4
maas-ui	Triaged	High	Nick De Villiers	maas-ui 3.4.x