"Unrecoverable error" in celery-region, service died, after upgrade to beta6

Bug #1346701 reported by Andreas Hasenack
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
New
Undecided
Unassigned

Bug Description

We upgraded to beta6 at around 2014-07-21 14:12:32 CDT (taken from apt's history.log):

Start-Date: 2014-07-21 14:12:32
Upgrade: python-maas-provisioningserver:amd64 (1.6+bzr2535-0ubuntu1~beta5~ppa1, 1.6.0+bzr2539-0ubuntu1~beta6~ppa1), ...
End-Date: 2014-07-21 14:14:28

Didn't see anything odd for a while, but then we noticed that some nodes were up and running while maas had them just "ready", not allocated. Inspecting the service:
<root@courage>:/etc/init# for n in maas*conf; do status ${n%%.conf}; done
maas-cluster-celery stop/waiting
maas-dhcp-server start/running, process 24203
maas-pserv start/running, process 24495
maas-region-celery start/running, process 26720
maas-txlongpoll start/running, process 26536

So one service died.

celery.log shows some errors talking to rabbit at that time:
[2014-07-21 14:14:06,896: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
(backtrace)
error: [Errno 104] Connection reset by peer
[2014-07-21 14:14:06,914: ERROR/MainProcess] consumer: Cannot connect to amqp://maas_workers@10.0.7.8:5672//maas_workers: [Errno 111] Connection refused.
Trying again in 2.00 seconds...

[2014-07-21 14:14:08,924: ERROR/MainProcess] consumer: Cannot connect to amqp://maas_workers@10.0.7.8:5672//maas_workers: [Errno 111] Connection refused.
Trying again in 4.00 seconds...

Until it finally dies:
[2014-07-21 14:14:12,933: ERROR/MainProcess] Unrecoverable error: ValueError('I/O operation on closed epoll fd',)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/celery/worker/__init__.py", line 212, in start
    self.blueprint.start(self)
  File "/usr/lib/python2.7/dist-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/usr/lib/python2.7/dist-packages/celery/bootsteps.py", line 373, in start
    return self.obj.start()
  File "/usr/lib/python2.7/dist-packages/celery/worker/consumer.py", line 270, in start
    blueprint.start(self)
  File "/usr/lib/python2.7/dist-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/usr/lib/python2.7/dist-packages/celery/worker/consumer.py", line 467, in start
    c.connection = c.connect()
  File "/usr/lib/python2.7/dist-packages/celery/worker/consumer.py", line 369, in connect
    conn.transport.register_with_event_loop(conn.connection, self.hub)
  File "/usr/lib/python2.7/dist-packages/kombu/transport/pyamqp.py", line 124, in register_with_event_loop
    loop.add_reader(connection.sock, self.on_readable, connection, loop)
  File "/usr/lib/python2.7/dist-packages/kombu/async/hub.py", line 214, in add_reader
    return self.add(fds, callback, READ | ERR, args)
  File "/usr/lib/python2.7/dist-packages/kombu/async/hub.py", line 165, in add
    self.poller.register(fd, flags)
  File "/usr/lib/python2.7/dist-packages/kombu/utils/eventio.py", line 78, in register
    self._epoll.register(fd, events)
ValueError: I/O operation on closed epoll fd
[2014-07-21 14:14:13,945: INFO/MainProcess] beat: Shutting down...

It stayed like that until we started it back manually. And then it got nuts trying to replay all the tasks that were pending:
[2014-07-21 18:12:14,360: WARNING/MainProcess] celery@courage ready.
[2014-07-21 18:12:14,362: INFO/MainProcess] Received task: provisioningserver.tasks.refresh_secrets[067978dd-81b7-44df-bff2-a5005111086d]
[2014-07-21 18:12:14,363: INFO/MainProcess] Received task: provisioningserver.tasks.refresh_secrets[adf830fc-1622-406b-9a09-ef29649d1780]
[2014-07-21 18:12:14,364: INFO/MainProcess] Received task: provisioningserver.tasks.power_off[39317e4d-7e22-4b58-901a-b9cecf6bf351]
[2014-07-21 18:12:14,364: INFO/MainProcess] Received task: provisioningserver.tasks.remove_dhcp_host_map[7bb291db-d521-4814-90db-3edca47e0e77]
[2014-07-21 18:12:14,365: INFO/MainProcess] Received task: provisioningserver.tasks.power_off[812c16cd-d34f-4635-be6b-53ea0084f167]
(...)

There were lots of tasks, power offs, power ups, etc. In the end, the server was back to normal.

Logs attached. I just don't have the terminal output from this package upgrade I'm afraid. /var/log/apt/term.log just has a start and end line, and I don't know who did it.

Tags: landscape
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
Revision history for this message
Andreas Hasenack (ahasenack) wrote :
description: updated
summary: - Unrecoverable error in celery-region, service died, after upgrade to
+ "Unrecoverable error" in celery-region, service died, after upgrade to
beta6
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Hah found the upgrade logs. It was driven by landscape via auto-upgrade profiles.

You can see in these logs that maas-cluster-celery was restarted and got a new PID, and the other's PIDs match the status call I pasted at the beginning of this bug report:

maas-dhcp-server start/running, process 24203
(...)
maas-pserv start/running, process 24495
maas-cluster-celery start/running, process 24542

Revision history for this message
Julian Edwards (julian-edwards) wrote :

I think this is a manifestation of a bug in the packaging where it restarts rabbit to change its password, and doesn't celery restarts afterwards (a race condition, basically).

bug 1300507

tags: added: landscape
Revision history for this message
neel (neel-basu-z) wrote :

I am getting a simi;ar Error with Ubuntu 14.04 Server installed from downloaded iso.

[2014-09-26 12:37:35,356: ERROR/Beat] beat: Connection error: timed out. Trying again in 32.0 seconds...
[2014-09-26 12:37:35,357: ERROR/MainProcess] consumer: Cannot connect to amqp://maas_workers@192.168.250.140:5672//maas_workers: timed out.
Trying again in 32.00 seconds...

[2014-09-26 12:38:11,403: ERROR/MainProcess] consumer: Cannot connect to amqp://maas_workers@192.168.250.140:5672//maas_workers: timed out.
Trying again in 32.00 seconds...

[2014-09-26 12:38:11,403: ERROR/Beat] beat: Connection error: timed out. Trying again in 32.0 seconds...

It all started once I started downloading pxe boot images with sudo -E maas-import-pxe-files

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.