"Unrecoverable error" in celery-region, service died, after upgrade to beta6
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MAAS |
New
|
Undecided
|
Unassigned |
Bug Description
We upgraded to beta6 at around 2014-07-21 14:12:32 CDT (taken from apt's history.log):
Start-Date: 2014-07-21 14:12:32
Upgrade: python-
End-Date: 2014-07-21 14:14:28
Didn't see anything odd for a while, but then we noticed that some nodes were up and running while maas had them just "ready", not allocated. Inspecting the service:
<root@courage>
maas-cluster-celery stop/waiting
maas-dhcp-server start/running, process 24203
maas-pserv start/running, process 24495
maas-region-celery start/running, process 26720
maas-txlongpoll start/running, process 26536
So one service died.
celery.log shows some errors talking to rabbit at that time:
[2014-07-21 14:14:06,896: WARNING/
(backtrace)
error: [Errno 104] Connection reset by peer
[2014-07-21 14:14:06,914: ERROR/MainProcess] consumer: Cannot connect to amqp://
Trying again in 2.00 seconds...
[2014-07-21 14:14:08,924: ERROR/MainProcess] consumer: Cannot connect to amqp://
Trying again in 4.00 seconds...
Until it finally dies:
[2014-07-21 14:14:12,933: ERROR/MainProcess] Unrecoverable error: ValueError('I/O operation on closed epoll fd',)
Traceback (most recent call last):
File "/usr/lib/
self.
File "/usr/lib/
step.
File "/usr/lib/
return self.obj.start()
File "/usr/lib/
blueprint.
File "/usr/lib/
step.
File "/usr/lib/
c.connection = c.connect()
File "/usr/lib/
conn.
File "/usr/lib/
loop.
File "/usr/lib/
return self.add(fds, callback, READ | ERR, args)
File "/usr/lib/
self.
File "/usr/lib/
self.
ValueError: I/O operation on closed epoll fd
[2014-07-21 14:14:13,945: INFO/MainProcess] beat: Shutting down...
It stayed like that until we started it back manually. And then it got nuts trying to replay all the tasks that were pending:
[2014-07-21 18:12:14,360: WARNING/
[2014-07-21 18:12:14,362: INFO/MainProcess] Received task: provisioningser
[2014-07-21 18:12:14,363: INFO/MainProcess] Received task: provisioningser
[2014-07-21 18:12:14,364: INFO/MainProcess] Received task: provisioningser
[2014-07-21 18:12:14,364: INFO/MainProcess] Received task: provisioningser
[2014-07-21 18:12:14,365: INFO/MainProcess] Received task: provisioningser
(...)
There were lots of tasks, power offs, power ups, etc. In the end, the server was back to normal.
Logs attached. I just don't have the terminal output from this package upgrade I'm afraid. /var/log/
tags: | added: landscape |
Hah found the upgrade logs. It was driven by landscape via auto-upgrade profiles.
You can see in these logs that maas-cluster-celery was restarted and got a new PID, and the other's PIDs match the status call I pasted at the beginning of this bug report:
maas-dhcp-server start/running, process 24203
(...)
maas-pserv start/running, process 24495
maas-cluster-celery start/running, process 24542