Rack controller status flapping when "ClusterClient connection lost" messages in rackd.log

Bug #1998615 reported by Guillaume Penin
104
This bug affects 20 people
Affects Status Importance Assigned to Milestone
MAAS
Fix Committed
High
Christian Grabowski
3.2
New
Undecided
Unassigned
3.3
Fix Released
High
Unassigned
maas-ui
Triaged
High
Nick De Villiers

Bug Description

I've noticed many "ClusterClient connection lost" occurences in rackd.log, on our remote "Rack controller", and also on our central "Region and rack controller", for example :

2022-12-02 16:25:33 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.x.x.x', port=37144, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.x.x.x', port=5251, flowInfo=0, scopeID=0))
2022-12-02 16:25:33 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.x.x.x', port=37152, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.x.x.x', port=5251, flowInfo=0, scopeID=0))
2022-12-02 16:25:53 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.x.x.x', port=50880, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.x.x.x', port=5251, flowInfo=0, scopeID=0))
2022-12-02 16:25:53 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.x.x.x', port=50882, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.x.x.x', port=5251, flowInfo=0, scopeID=0))
2022-12-02 16:25:53 provisioningserver.rpc.clusterservice: [info] Event-loop 'ugifudhi01:pid=2078' authenticated.
2022-12-02 16:25:53 provisioningserver.rpc.clusterservice: [info] Event-loop 'ugifudhi01:pid=2078' authenticated.
2022-12-02 16:25:54 provisioningserver.rpc.clusterservice: [info] Rack controller 'qk68h8' registered (via xxxxxxxxxx:pid=2078) with MAAS version 3.3.0~beta3-13010-g.f45817046.
2022-12-02 16:25:54 provisioningserver.rpc.clusterservice: [info] Rack controller 'qk68h8' registered (via xxxxxxxxxx:pid=2078) with MAAS version 3.3.0~beta3-13010-g.f45817046.

You may find attached a screenshot of web UI status during those messages.

A tcpdump capture only seems to show "regular" disconnections.

Also, it does not seems to affect MAAS usage.

Are those messages legitimate and harmless ?

Regards,

Related branches

Revision history for this message
Guillaume Penin (guillaume-penin) wrote :
Changed in maas:
status: New → Triaged
importance: Undecided → Medium
milestone: none → 3.4.0
Revision history for this message
Alexsander de Souza (alexsander-souza) wrote :

These messages are harmless, but we recognize they are annoying and misleading. We should improve this.

PS: 3.3.0 is not released yet, so this is subjected to change at any moment

Revision history for this message
Thorsten Merten (thorsten-merten) wrote :

Via MAAS chat:

New installations of MAAS 3.3 (both new installations and upgrades) where the region+rack controllers are going up and down constantly. maas.log show messages like this:

maas.rpc.rackcontrollers: [info] Existing rack controller 'xxx' running version 3.3.0-13159-g.1c22f7beb has connected to region 'yyy'

and rackd.log shows this:

rack controller 'mqrfwf' registered (via xxx:pid=1416) with MAAS version 3.3.0-13159-g.1c22f7beb
both repeating all the time.

summary: - [MAAS 3.3] Many "ClusterClient connection lost" messages in rackd.log
+ Many "ClusterClient connection lost" messages in rackd.log
Revision history for this message
Jorge Merlino (jorge-merlino) wrote : Re: Many "ClusterClient connection lost" messages in rackd.log

Hi all,

I've been investigating this issue and I think it might be more than just a cosmetic issue. The logs get thousands of this connection established/lost messages because each connection lives around 10-15 minutes. As there are several connections in the pool there is always some connection being created or destroyed.
I looked at a network dump and there is nothing strange there, the party that starts the connections ends them normally after a few minutes.

tags: added: bug-council
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

MAAS uses connection pooling and the behaviour with creating new connections is expected and harmless. We will improve the log spam by changing the log level from INFO to DEBUG, where these messages belong.

tags: removed: bug-council
tags: added: easy-to-fix papercut
Revision history for this message
Bartosz Woronicz (mastier1) wrote (last edit ):

It seems more serious, it might cause frequent kill of DHCP agent. So DHCPd is very short lived "
In my case in HA setup that prevent machines to being commisioned

Also some machines are dissapearing when be ncommisioned in the same time.

MAAS version
installed: 3.3.2-13177-g.a73a6e2bd (27109) 138MB -

maas-01:~$ sudo maas status
bind9 RUNNING pid 272671, uptime 1 day, 6:17:34
dhcpd RUNNING pid 2288973, uptime 0:04:05
dhcpd6 STOPPED Not started
http RUNNING pid 272521, uptime 1 day, 6:17:35
ntp RUNNING pid 2228029, uptime 0:13:06
proxy RUNNING pid 1865584, uptime 1:06:39
rackd RUNNING pid 263370, uptime 1 day, 6:18:38
regiond RUNNING pid 263371, uptime 1 day, 6:18:38
syslog RUNNING pid 276172, uptime 1 day, 6:17:06
maas-02:~$ sudo maas status
bind9 RUNNING pid 1616842, uptime 1 day, 6:14:20
dhcpd RUNNING pid 1047478, uptime 0:00:47
dhcpd6 STOPPED Not started
http RUNNING pid 1620198, uptime 1 day, 6:13:32
ntp RUNNING pid 1622263, uptime 1 day, 6:12:48
proxy RUNNING pid 923083, uptime 1:02:36
rackd RUNNING pid 1616845, uptime 1 day, 6:14:20
regiond RUNNING pid 1616847, uptime 1 day, 6:14:20
syslog RUNNING pid 1622253, uptime 1 day, 6:12:48

Revision history for this message
Jorge Merlino (jorge-merlino) wrote (last edit ):

I finally managed to reproduce this and want to add that just changing the log level of the messages might not be enough. In the interval between a connection is lost and is established again the Controllers page on the web interface shows that controller as degraded. This is confusing for the users.

If this is normal, maybe we should not notify the web page that the controller is degraded until something abnormal happens.

Thanks

Revision history for this message
Bartosz Woronicz (mastier1) wrote (last edit ):

Message about racks is very strange. That was not an issue in case 3.2.7

92% rackd available.

Revision history for this message
Bartosz Woronicz (mastier1) wrote :
Revision history for this message
Bartosz Woronicz (mastier1) wrote :
Revision history for this message
Bartosz Woronicz (mastier1) wrote :
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

The UI behavior needs to be looked at, see attached screenshot.

tags: added: bug-council
removed: easy-to-fix papercut
Revision history for this message
Jerzy Husakowski (jhusakowski) wrote :

Please have a look at the UI behavior.

summary: - Many "ClusterClient connection lost" messages in rackd.log
+ Rack controller status flapping when "ClusterClient connection lost"
+ messages in rackd.log
Changed in maas:
assignee: nobody → Christian Grabowski (cgrabowski)
importance: Medium → High
tags: removed: bug-council
Changed in maas:
status: Triaged → Fix Committed
Alberto Donato (ack)
Changed in maas:
milestone: 3.4.0 → 3.4.0-beta1
Revision history for this message
Bartosz Woronicz (mastier1) wrote (last edit ):

I have seen happening now in 3.2.8

no longer affects: maas (Ubuntu)
Changed in maas-ui:
milestone: none → 3.4.x
importance: Undecided → High
status: New → Triaged
Revision history for this message
Christian (cboitel) wrote :

I am also affected although i am currently running snap 3.3.4

```
$ sudo snap list maas
Name Version Rev Tracking Publisher Notes
maas 3.3.4-13189-g.f88272d1e 28521 3.3/stable canonical✓ -
```

As stated earlier, this is not only a cosmetic bug since each time a connection lost is logged, dhcpd is restarted by rackd (i saw the `supervisorctl restart dhpcd` in the process list).

So not displaying in UI or changing log level won't solve the dhpcd restart.

Moreover, this can be highly frequent: i once got a restart every 10s and had to stop all rackd and restart them in sequence to get out of this forever loop. I will look into regiond logs to find out why connection was closed.

Revision history for this message
Thorsten Merten (thorsten-merten) wrote (last edit ):

For maas-ui we need to

1. check what's wrong with that number and where it is coming from
2. think about a different way to display this

Changed in maas-ui:
assignee: nobody → Nick De Villiers (nickdv99)
Changed in maas:
status: Fix Committed → In Progress
Changed in maas:
status: In Progress → Fix Committed
Revision history for this message
Adam Vest (foxmulder2004) wrote :

Hello,

Just thought I'd report that I, too, am seeing this flapping behavior on my couple of rack controllers - both on MAAS 3.3.4. Throwing my comment on this since this bug is reportedly fixed in the 3.3.x branch, yet persists for me:
---
root # snap list maas
Name Version Rev Tracking Publisher Notes
maas 3.3.4-13189-g.f88272d1e 28521 3.3/stable canonical✓ -

root # tail -n1 /var/snap/maas/common/log/rackd.log
2023-09-25 22:36:25 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:192.168.1.200', port=43998, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:192.168.1.46', port=5252, flowInfo=0, scopeID=0))

root # date
Mon Sep 25 10:36:46 PM UTC 2023
---

It's worth noting that I am NOT seeing the dhcpd service restarts on every disconnect/reconnect, however - in my particular example, insofar as I can see, there is no loss of functionality and it is purely cosmetic.

Revision history for this message
Sam Schmitt (samcat116) wrote :

I am also still seeing this behavior in MAAS 3.4.0rc2.

Revision history for this message
Björn Hinz (bhinz83) wrote :
Download full text (7.3 KiB)

same here
Version: 3.3.4 (snap install)

2023-11-14 15:06:25 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas01:pid=1959698' authenticated.
2023-11-14 15:06:25 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas01:pid=1959698' authenticated.
2023-11-14 15:06:25 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas01:pid=1959698) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:06:26 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas01:pid=1959698) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:06:55 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.15.2.3', port=41030, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.15.2.2', port=5250, flowInfo=0, scopeID=0))
2023-11-14 15:06:55 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.15.2.3', port=41036, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.15.2.2', port=5250, flowInfo=0, scopeID=0))
2023-11-14 15:06:55 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas01:pid=1959698' authenticated.
2023-11-14 15:06:55 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas01:pid=1959698' authenticated.
2023-11-14 15:06:55 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas01:pid=1959698) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:06:55 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas01:pid=1959698) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:07:05 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.14.3.3', port=37338, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.14.3.4', port=5253, flowInfo=0, scopeID=0))
2023-11-14 15:07:05 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.14.3.3', port=37348, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.14.3.4', port=5253, flowInfo=0, scopeID=0))
2023-11-14 15:07:25 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.14.3.3', port=34250, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.14.3.4', port=5253, flowInfo=0, scopeID=0))
2023-11-14 15:07:25 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.14.3.3', port=34252, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.14.3.4', port=5253, flowInfo=0, scopeID=0))
2023-11-14 15:07:25 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas03:pid=1659391' authenticated.
2023-11-14 15:07:25 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas03:pid=1659391' authenticated.
2023-11-14 15:07:25 provisioningserver.rpc.clusterservice: [info] Rack controller 'ad3c78' registered (via maas03:pid=1659391) with MAAS version 3.3.4-13189-g.f88272d1e.
2023-11-14 15:07:26 provisioningserver.rpc.clusterservice: [info] Rack controller '...

Read more...

Revision history for this message
Alan Baghumian (alanbach) wrote :
Download full text (3.6 KiB)

This is also happening to me - Snap 3.3.5 but it never causes any enlisting / commissioning issues:

2023-11-20 09:22:34 provisioningserver.rackdservices.dhcp_probe_service: [info] Probe for external DHCP servers started on interfaces: enp1s0.
2023-11-20 09:22:44 provisioningserver.dhcp.detect: [info] External DHCP server(s) discovered on interface 'enp1s0': 10.1.9.6
2023-11-20 09:22:44 provisioningserver.rackdservices.dhcp_probe_service: [info] External DHCP probe complete.
2023-11-20 09:23:13 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=57782, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:23:14 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=57788, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:23:33 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=37494, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:23:34 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=37496, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:23:34 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas-region-1:pid=641111' authenticated.
2023-11-20 09:23:34 provisioningserver.rpc.clusterservice: [info] Event-loop 'maas-region-1:pid=641111' authenticated.
2023-11-20 09:23:34 provisioningserver.rpc.clusterservice: [info] Rack controller 'crxdhh' registered (via maas-region-1:pid=641111) with MAAS version 3.3.5-13213-g.e2fb46e1a.
2023-11-20 09:23:34 provisioningserver.rpc.clusterservice: [info] Rack controller 'crxdhh' registered (via maas-region-1:pid=641111) with MAAS version 3.3.5-13213-g.e2fb46e1a.
2023-11-20 09:23:44 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=45290, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:23:44 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=45292, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:24:04 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=44316, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:24:04 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:10.1.9.12', port=44328, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:10.1.9.10', port=5252, flowInfo=0, scopeID=0))
2023-11-20 09:24:04 provisioningserver.rpc.clusterservice: [info] ...

Read more...

Revision history for this message
Joao Andre Simioni (jasimioni) wrote (last edit ):

I'm also seeing this issue with MAAS 3.3.5 and Debs. The setup is a 2 controllers, region + rack

Logs look the same, but the main issue for me is that the controllers show degraded status, as mentioned by Jorge in https://bugs.launchpad.net/maas/+bug/1998615/comments/8

It sounds like, sometimes, the monitor period action happens when the connections are being reestablished and if not all of them are ready yet, the status will be degraded:

https://git.launchpad.net/maas/tree/src/maasserver/models/node.py#n6795

Also, I see that the maas-ui is triaged and assigned. Would that also fix the API response? (`maas $PROFILE rack-controllers read` output?)

Revision history for this message
Leonardo Silva (leo-scs) wrote :

I am also still seeing this behavior in MAAS 3.4.0-14321-g.1027c7664

2024-02-27 20:26:43 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x', port=40336, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5251, flowInfo=0, scopeID=0))
2024-02-27 20:26:43 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=40328, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5251, flowInfo=0, scopeID=0))
2024-02-27 20:27:03 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=47854, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5251, flowInfo=0, scopeID=0))
2024-02-27 20:27:03 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=47848, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5251, flowInfo=0, scopeID=0))
2024-02-27 20:27:13 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=54178, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5250, flowInfo=0, scopeID=0))
2024-02-27 20:27:13 ClusterClient,client: [info] ClusterClient connection lost (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=54182, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5250, flowInfo=0, scopeID=0))
2024-02-27 20:27:33 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=36414, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5250, flowInfo=0, scopeID=0))
2024-02-27 20:27:33 Uninitialized: [info] ClusterClient connection established (HOST:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=36422, flowInfo=0, scopeID=0) PEER:IPv6Address(type='TCP', host='::ffff:x.x.x.x'', port=5250, flowInfo=0, scopeID=0))

The setup is a 3 region controllers + 20 rack controllers

Revision history for this message
Giovanni Tirloni (gtirloni) wrote :

I found this bug report while looking for a solution.

I experience this on a single all-in-one MAAS server running 3.4.0-14321-g.1027c7664

The server has two interfaces. The errors happen twice with the IPv6 of one interface. Two minutes later, they happen again twice with the IPv6 address of the other interface. Then back to the first interface after two minutes.

When they happen, they happen again exactly 30 seconds later with the IP of the same interface. Two minutes later, the other IP.. followed by another disconnect exactly 30 seconds. And so on.

I checked all logs files and there is no errors/warnings before the disconnects happen.

After I restart regiond, the disconnects stop for a while and start again after ~15min in the environment I'm troubleshooting.

After enabling debug logs, I cannot see any errors before the disconnects happen either. There are only a bunch of successful SELECT's to the database and the service monitors checking things. I also don't see any processes dying.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.