Cluster disconnecting periodically - [ClusterClient,client] Failed to refresh power state

Bug #1475445 reported by Larry Michel
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Gavin Panella

Bug Description

We're seeing issues with cluster disconnecting periodically..

From regiond.log:

ubuntu@maas-trusty-back-may22:/var/log/maas$ grep "Unable to get RPC connection for cluster" regiond.log
2015-07-16 22:03:46 [maasserver] ERROR: Unable to get RPC connection for cluster 'OIL Cluster' (037c960b-5b9f-4701-8366-eeda2c09d14e)
2015-07-16 22:07:14 [maasserver] ERROR: Unable to get RPC connection for cluster 'OIL Cluster' (037c960b-5b9f-4701-8366-eeda2c09d14e)
2015-07-16 22:07:16 [maasserver] ERROR: Unable to get RPC connection for cluster 'OIL Cluster' (037c960b-5b9f-4701-8366-eeda2c09d14e)
2015-07-16 22:07:18 [maasserver] ERROR: Unable to get RPC connection for cluster 'OIL Cluster' (037c960b-5b9f-4701-8366-eeda2c09d14e)
2015-07-16 22:07:20 [maasserver] ERROR: Unable to get RPC connection for cluster 'OIL Cluster' (037c960b-5b9f-4701-8366-eeda2c09d14e)
.. .. repeats multiple times
2015-07-16 22:19:20 [maasserver] ERROR: Unable to get RPC connection for cluster 'OIL Cluster' (037c960b-5b9f-4701-8366-eeda2c09d14e)
2015-07-16 22:19:20 [maasserver] ERROR: Unable to get RPC connection for cluster 'OIL Cluster' (037c960b-5b9f-4701-8366-eeda2c09d14e)
2015-07-16 22:19:20 [maasserver] ERROR: Unable to get RPC connection for cluster 'OIL Cluster' (037c960b-5b9f-4701-8366-eeda2c09d14e)
2015-07-16 22:19:20 [maasserver] ERROR: Unable to get RPC connection for cluster 'OIL Cluster' (037c960b-5b9f-4701-8366-eeda2c09d14e)
2015-07-16 22:19:20 [maasserver] ERROR: Unable to get RPC connection for cluster 'OIL Cluster' (037c960b-5b9f-4701-8366-eeda2c09d14e)
2015-07-16 22:19:20 [maasserver] ERROR: Unable to get RPC connection for cluster 'OIL Cluster' (037c960b-5b9f-4701-8366-eeda2c09d14e)
2015-07-16 22:19:20 [maasserver] ERROR: Unable to get RPC connection for cluster 'OIL Cluster' (037c960b-5b9f-4701-8366-eeda2c09d14e)
2015-07-16 22:19:20 [maasserver] ERROR: Unable to get RPC connection for cluster 'OIL Cluster' (037c960b-5b9f-4701-8366-eeda2c09d14e)

From clusterd.log:

2015-07-16 22:03:27+0000 [-] donphan: Power could not be turned off; timed out.
2015-07-16 22:03:45+0000 [ClusterClient,client] ClusterClient connection lost (HOST:IPv4Address(TCP, '127.0.0.1', 60788) PEER:IPv4Address(TCP, u'127.0.0.1', 5251))
2015-07-16 22:03:45+0000 [ClusterClient,client] Logging TFTP request failed.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Logging TFTP request failed.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Logging TFTP request failed.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Logging TFTP request failed.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Unhandled failure dispatching AMP command. This is probably a bug. Please ensure that this error is handled within application code or declared in the signature of the PowerQuery command. [maas-trusty-back-may22:pid=10787:cmd=PowerQuery:ask=17c04]
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Logging TFTP request failed.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Failed to refresh power state.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Logging TFTP request failed.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Failed to refresh power state.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Failed to refresh power state.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Logging TFTP request failed.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.
2015-07-16 22:03:27+0000 [-] donphan: Power could not be turned off; timed out.
2015-07-16 22:03:45+0000 [ClusterClient,client] ClusterClient connection lost (HOST:IPv4Address(TCP, '127.0.0.1', 60788) PEER:IPv4Address(TCP, u'127.0.0.1', 5251))
2015-07-16 22:03:45+0000 [ClusterClient,client] Logging TFTP request failed.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Logging TFTP request failed.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Logging TFTP request failed.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Logging TFTP request failed.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Unhandled failure dispatching AMP command. This is probably a bug. Please ensure that this error is handled within application code or declared in the signature of the PowerQuery command. [maas-trusty-back-may22:pid=10787:cmd=PowerQuery:ask=17c04]
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Logging TFTP request failed.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Failed to refresh power state.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Logging TFTP request failed.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Failed to refresh power state.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Failed to refresh power state.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:45+0000 [ClusterClient,client] Logging TFTP request failed.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.
        Traceback (most recent call last):
        Failure: twisted.internet.error.ConnectionDone: Connection was closed cleanly.

2015-07-16 22:03:53+0000 [TFTP (UDP)] Datagram received from ('10.245.0.152', 1024): <RRQDatagram(filename=pxelinux.0, mode=octet, options={'blksize': '1432', 'tsize': '0'})>

Tags: oil
Revision history for this message
Larry Michel (lmic) wrote :
Revision history for this message
Larry Michel (lmic) wrote :

We've recreated this issue this morning.

Revision history for this message
Larry Michel (lmic) wrote :

logs from this morning's recreate.

Changed in maas:
assignee: nobody → Gavin Panella (allenap)
importance: Undecided → Critical
Changed in maas:
status: New → Triaged
milestone: none → 1.9.0
Revision history for this message
Gavin Panella (allenap) wrote :

Can you check for stray regiond processes?

Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Larry Michel (lmic) wrote :
Download full text (12.2 KiB)

Gavin, here's a list of processes. Any specific command you'd like us to run to get more info on what's running?

ubuntu@maas-trusty-back-may22:~$ ps -efaux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 33776 2420 ? Ss Jun23 3:17 /sbin/init
root 236 0.0 0.0 19472 536 ? S Jun23 0:00 upstart-udev-bridge --daemon
root 267 0.0 0.0 49268 836 ? Ss Jun23 0:00 /lib/systemd/systemd-udevd --daemon
root 388 0.0 0.0 23416 584 ? Ss Jun23 0:03 rpcbind
message+ 398 0.0 0.0 39320 940 ? Ss Jun23 0:00 dbus-daemon --system --fork
root 516 0.0 0.0 43448 1332 ? Ss Jun23 0:00 /lib/systemd/systemd-logind
root 534 0.0 0.0 12784 668 ? Ss+ Jun23 0:00 /sbin/getty -8 38400 tty4
root 544 0.0 0.0 12784 668 pts/4 Ss+ Jun23 0:00 /sbin/getty -8 38400 tty2
root 545 0.0 0.0 12784 668 pts/7 Ss+ Jun23 0:00 /sbin/getty -8 38400 tty3
root 573 0.0 0.0 61364 1228 ? Ss Jun23 0:00 /usr/sbin/sshd -D
root 46681 0.0 0.0 105628 3396 ? Ss Jul28 0:00 \_ sshd: ubuntu [priv]
ubuntu 46738 0.0 0.0 105628 1312 ? S Jul28 0:00 | \_ sshd: ubuntu@pts/2
ubuntu 46739 0.0 0.0 22956 2020 pts/2 Ss+ Jul28 0:00 | \_ -bash LANG=en_GB.UTF-8 USER=ubuntu LOGNAME=ubuntu HOME=/home/ubuntu PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games MAIL=/var/mail/ubuntu SHELL=/bin/bash SSH_CLIENT=10.172.68.84 60285 22 SSH_CONNECTION=10.172.68.84 60285 10.245.0.10 22 SSH_TTY=/dev/pts/2 TERM=xterm
root 63532 0.0 0.0 105628 3728 ? Ss Jul30 0:00 \_ sshd: ubuntu [priv]
ubuntu 63707 0.0 0.0 105628 1436 ? S Jul30 0:00 | \_ sshd: ubuntu
root 17307 0.0 0.0 105628 4244 ? Ss 14:50 0:00 \_ sshd: ubuntu [priv]
ubuntu 17371 0.0 0.0 105628 1904 ? S 14:50 0:00 \_ sshd: ubuntu@pts/6
ubuntu 17372 0.0 0.0 22948 4108 pts/6 Ss 14:50 0:00 \_ -bash LANG=en_GB.UTF-8 USER=ubuntu LOGNAME=ubuntu HOME=/home/ubuntu PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games MAIL=/var/mail/ubuntu SHELL=/bin/bash SSH_CLIENT=10.172.68.84 34814 22 SSH_CONNECTION=10.172.68.84 34814 10.245.0.10 22 SSH_TTY=/dev/pts/6 TERM=xterm
ubuntu 20517 0.0 0.0 18580 1328 pts/6 R+ 14:54 0:00 \_ ps -efaux VIRTUALENVWRAPPER_SCRIPT=/usr/share/virtualenvwrapper/virtualenvwrapper.sh TERM=xterm-256color SHELL=/bin/bash SSH_CLIENT=10.172.68.84 34814 22 SSH_TTY=/dev/pts/6 USER=ubuntu LS_COLORS=rs=0:di=01;34:ln=01;36:mh=00:pi=40;33:so=01;35:do=01;35:bd=40;33;01:cd=40;33;01:or=40;31;01:su=37;41:sg=30;43:ca=30;41:tw=30;42:ow=34;42:st=37;44:ex=01;32:*.tar=01;31:*.tgz=01;31:*.arj=01;31:*.taz=01;31:*.lzh=01;31:*.lzma
root 588 0.0 0.0 23652 868 ? Ss Jun23 1:24 cron
daemon 590 0.0 0.0 19136 20 ? Ss Jun23 0:00 atd
root 652 1.3 0.2 10364748 22432 ? Ssl Jun23 756:04 tgtd
root 665 0.0 0.0 13124 264 ? ...

Changed in maas:
status: Incomplete → New
no longer affects: maas/1.9
Gavin Panella (allenap)
Changed in maas:
status: New → Triaged
Revision history for this message
Blake Rouse (blake-rouse) wrote :

The checking of power in 1.9 has been greatly improved to help with concurrency and transactions. This is not something that is able to be backported to 1.8.

no longer affects: maas/1.8
Changed in maas:
status: Triaged → Fix Committed
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.