CeleryNodeMonitor: if a node is not available the master node should commit suicide

Bug #1409775 reported by Michele Simionato on 2015-01-12
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenQuake Engine
Wishlist
Michele Simionato

Bug Description

Instead it seems that computation is continuing in the controller node. Here is a log from a multi-day computation by Graeme:

[2015-01-12 15:25:48,421 hazard job #520 - PROGRESS MainProcess/11291] ** compute_hazard_curves 66%
[2015-01-12 15:28:51,574 hazard job #519 - WARNING MainProcess/29518] (404, u"NOT_FOUND - no exchange 'reply.celeryd.pidbox' in vhost '/'", (50, 20), 'Channel.queue_bind')
[2015-01-12 15:28:51,575 hazard job #519 - CRITICAL MainProcess/29518] Cluster nodes not accessible: [u'mercury', u'dylan', u'marley', u'cobain']
[2015-01-12 15:28:52,145 hazard job #519 - WARNING MainProcess/29518] Revoking 16 tasks
Traceback (most recent call last):
  File "/usr/local/openquake/oq-engine/bin/oq-engine", line 574, in <module>
    main()
  File "/usr/local/openquake/oq-engine/bin/oq-engine", line 487, in main
    log_file, args.exports)
  File "/usr/local/openquake/oq-engine/openquake/engine/engine.py", line 361, in run_job
    run_calc(job, log_level, log_file, exports)
  File "/usr/local/openquake/oq-engine/openquake/engine/engine.py", line 180, in run_calc
    _do_run_calc(calculator, exports)
  File "/usr/local/openquake/oq-engine/openquake/engine/engine.py", line 217, in _do_run_calc
    calc.execute()
  File "/usr/local/openquake/oq-engine/openquake/engine/performance.py", line 45, in newmeth
    return method(self, *args)
  File "/usr/local/openquake/oq-engine/openquake/engine/calculators/hazard/general.py", line 154, in execute
    weight=attrgetter('weight'), key=attrgetter('trt_model_id'))
  File "/usr/local/openquake/oq-engine/openquake/engine/utils/tasks.py", line 131, in apply_reduce
    return starmap(task, task_args, logs.LOG.progress, name).reduce(agg, acc)
  File "/usr/local/openquake/oq-risklib/openquake/commonlib/parallel.py", line 288, in reduce
    agg_result = self.aggregate_result_set(agg_and_percent, acc)
  File "/usr/local/openquake/oq-engine/openquake/engine/utils/tasks.py", line 85, in aggregate_result_set
    for task_id, result_dict in rset.iter_native():
  File "/usr/lib/python2.7/dist-packages/celery/backends/amqp.py", line 216, in get_many
    r = self.drain_events(conn, consumer, timeout)
  File "/usr/lib/python2.7/dist-packages/celery/backends/amqp.py", line 186, in drain_events
    wait(timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/kombu/connection.py", line 175, in drain_events
    return self.transport.drain_events(self.connection, **kwargs)
  File "/usr/lib/python2.7/dist-packages/kombu/transport/pyamqplib.py", line 238, in drain_events
    return connection.drain_events(**kwargs)
  File "/usr/lib/python2.7/dist-packages/kombu/transport/pyamqplib.py", line 57, in drain_events
    return self.wait_multi(self.channels.values(), timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/kombu/transport/pyamqplib.py", line 63, in wait_multi
    chanmap.keys(), allowed_methods, timeout=timeout)
  File "/usr/lib/python2.7/dist-packages/kombu/transport/pyamqplib.py", line 120, in _wait_multiple
    channel, method_sig, args, content = read_timeout(timeout)
  File "/usr/lib/python2.7/dist-packages/kombu/transport/pyamqplib.py", line 88, in read_timeout
    return self.method_reader.read_method()
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/method_framing.py", line 218, in read_method
    self._next_method()
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/method_framing.py", line 133, in _next_method
    frame_type, channel, payload = self.source.read_frame()
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/transport.py", line 149, in read_frame
    frame_type, channel, size = unpack('>BHI', self._read(7))
  File "/usr/lib/python2.7/dist-packages/amqplib/client_0_8/transport.py", line 261, in _read
    s = self.sock.recv(65536)
  File "/usr/local/openquake/oq-engine/openquake/engine/celery_node_monitor.py", line 56, in handle_signal
    raise cls(msg)
openquake.engine.celery_node_monitor.MasterKilled: The openquake master process was killed by the CeleryNodeMonitor because some node failed
[2015-01-12 15:34:11,801 hazard job #520 - PROGRESS MainProcess/11291] ** compute_hazard_curves 67%
[2015-01-12 15:42:22,495 hazard job #520 - PROGRESS MainProcess/11291] ** compute_hazard_curves 68%
[2015-01-12 15:47:19,776 hazard job #520 - PROGRESS MainProcess/11291] ** compute_hazard_curves 69%
[2015-01-12 15:51:50,274 hazard job #520 - PROGRESS MainProcess/11291] ** compute_hazard_curves 70%

Changed in oq-engine:
status: New → Won't Fix
assignee: nobody → Michele Simionato (michele-simionato)
importance: Undecided → Wishlist
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers