MAAS

"Unrecoverable error" in celery-region, service died, after upgrade to beta6

Bug #1346701 reported by Andreas Hasenack on 2014-07-22

This bug report is a duplicate of: Bug #1300507: Rabbit password is reset on every upgrade which forces lockstep cluster restarts. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MAAS	New	Undecided	Unassigned

Bug Description

We upgraded to beta6 at around 2014-07-21 14:12:32 CDT (taken from apt's history.log):

Start-Date: 2014-07-21 14:12:32
Upgrade: python-maas-provisioningserver:amd64 (1.6+bzr2535-0ubuntu1~beta5~ppa1, 1.6.0+bzr2539-0ubuntu1~beta6~ppa1), ...
End-Date: 2014-07-21 14:14:28

Didn't see anything odd for a while, but then we noticed that some nodes were up and running while maas had them just "ready", not allocated. Inspecting the service:
<root@courage>:/etc/init# for n in maas*conf; do status ${n%%.conf}; done
maas-cluster-celery stop/waiting
maas-dhcp-server start/running, process 24203
maas-pserv start/running, process 24495
maas-region-celery start/running, process 26720
maas-txlongpoll start/running, process 26536

So one service died.

celery.log shows some errors talking to rabbit at that time:
[2014-07-21 14:14:06,896: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
(backtrace)
error: [Errno 104] Connection reset by peer
[2014-07-21 14:14:06,914: ERROR/MainProcess] consumer: Cannot connect to amqp://maas_workers@10.0.7.8:5672//maas_workers: [Errno 111] Connection refused.
Trying again in 2.00 seconds...

[2014-07-21 14:14:08,924: ERROR/MainProcess] consumer: Cannot connect to amqp://maas_workers@10.0.7.8:5672//maas_workers: [Errno 111] Connection refused.
Trying again in 4.00 seconds...

Until it finally dies:
[2014-07-21 14:14:12,933: ERROR/MainProcess] Unrecoverable error: ValueError('I/O operation on closed epoll fd',)
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/celery/worker/__init__.py", line 212, in start
    self.blueprint.start(self)
  File "/usr/lib/python2.7/dist-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/usr/lib/python2.7/dist-packages/celery/bootsteps.py", line 373, in start
    return self.obj.start()
  File "/usr/lib/python2.7/dist-packages/celery/worker/consumer.py", line 270, in start
    blueprint.start(self)
  File "/usr/lib/python2.7/dist-packages/celery/bootsteps.py", line 123, in start
    step.start(parent)
  File "/usr/lib/python2.7/dist-packages/celery/worker/consumer.py", line 467, in start
    c.connection = c.connect()
  File "/usr/lib/python2.7/dist-packages/celery/worker/consumer.py", line 369, in connect
    conn.transport.register_with_event_loop(conn.connection, self.hub)
  File "/usr/lib/python2.7/dist-packages/kombu/transport/pyamqp.py", line 124, in register_with_event_loop
    loop.add_reader(connection.sock, self.on_readable, connection, loop)
  File "/usr/lib/python2.7/dist-packages/kombu/async/hub.py", line 214, in add_reader
    return self.add(fds, callback, READ | ERR, args)
  File "/usr/lib/python2.7/dist-packages/kombu/async/hub.py", line 165, in add
    self.poller.register(fd, flags)
  File "/usr/lib/python2.7/dist-packages/kombu/utils/eventio.py", line 78, in register
    self._epoll.register(fd, events)
ValueError: I/O operation on closed epoll fd
[2014-07-21 14:14:13,945: INFO/MainProcess] beat: Shutting down...

It stayed like that until we started it back manually. And then it got nuts trying to replay all the tasks that were pending:
[2014-07-21 18:12:14,360: WARNING/MainProcess] celery@courage ready.
[2014-07-21 18:12:14,362: INFO/MainProcess] Received task: provisioningserver.tasks.refresh_secrets[067978dd-81b7-44df-bff2-a5005111086d]
[2014-07-21 18:12:14,363: INFO/MainProcess] Received task: provisioningserver.tasks.refresh_secrets[adf830fc-1622-406b-9a09-ef29649d1780]
[2014-07-21 18:12:14,364: INFO/MainProcess] Received task: provisioningserver.tasks.power_off[39317e4d-7e22-4b58-901a-b9cecf6bf351]
[2014-07-21 18:12:14,364: INFO/MainProcess] Received task: provisioningserver.tasks.remove_dhcp_host_map[7bb291db-d521-4814-90db-3edca47e0e77]
[2014-07-21 18:12:14,365: INFO/MainProcess] Received task: provisioningserver.tasks.power_off[812c16cd-d34f-4635-be6b-53ea0084f167]
(...)

There were lots of tasks, power offs, power ups, etc. In the end, the server was back to normal.

Logs attached. I just don't have the terminal output from this package upgrade I'm afraid. /var/log/apt/term.log just has a start and end line, and I don't know who did it.

See original description

Tags:

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2014-07-22:

celery.log Edit (28.2 KiB, text/plain)

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2014-07-22:

celery-region.log Edit (32.3 KiB, text/plain)

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2014-07-22:

maas.log Edit (15.4 KiB, text/plain)

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2014-07-22:

rabbit@courage.log Edit (13.7 KiB, text/plain)

description:	updated
summary:	- Unrecoverable error in celery-region, service died, after upgrade to + "Unrecoverable error" in celery-region, service died, after upgrade to beta6

Revision history for this message

Andreas Hasenack (ahasenack) wrote on 2014-07-22:

maas-b5tob6-upgrade.txt Edit (19.9 KiB, text/plain)

Hah found the upgrade logs. It was driven by landscape via auto-upgrade profiles.

You can see in these logs that maas-cluster-celery was restarted and got a new PID, and the other's PIDs match the status call I pasted at the beginning of this bug report:

maas-dhcp-server start/running, process 24203
(...)
maas-pserv start/running, process 24495
maas-cluster-celery start/running, process 24542

Revision history for this message

Julian Edwards (julian-edwards) wrote on 2014-07-22:

I think this is a manifestation of a bug in the packaging where it restarts rabbit to change its password, and doesn't celery restarts afterwards (a race condition, basically).

bug 1300507

Andreas Hasenack (ahasenack) on 2014-08-07

tags:

added: landscape

Revision history for this message

neel (neel-basu-z) wrote on 2014-09-26:

I am getting a simi;ar Error with Ubuntu 14.04 Server installed from downloaded iso.

[2014-09-26 12:37:35,356: ERROR/Beat] beat: Connection error: timed out. Trying again in 32.0 seconds...
[2014-09-26 12:37:35,357: ERROR/MainProcess] consumer: Cannot connect to amqp://maas_workers@192.168.250.140:5672//maas_workers: timed out.
Trying again in 32.00 seconds...

[2014-09-26 12:38:11,403: ERROR/MainProcess] consumer: Cannot connect to amqp://maas_workers@192.168.250.140:5672//maas_workers: timed out.
Trying again in 32.00 seconds...

[2014-09-26 12:38:11,403: ERROR/Beat] beat: Connection error: timed out. Trying again in 32.0 seconds...

It all started once I started downloading pxe boot images with sudo -E maas-import-pxe-files

Report a bug

This report contains Public information

Everyone can see this information.

Duplicate of bug #1300507 Remove

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.