If rndc fails, you get an Internal Server Error page

Bug #1386488 reported by Julian Edwards
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Fix Released
Critical
Julian Edwards

Bug Description

I inadvertently nobbled rndc so that it won't work and when trying to start a node I get:

ERROR 2014-10-28 12:02:19,448 django.request Internal Server Error: /MAAS/nodes/node-c46c
458c-d047-11e3-af93-e4115b13819f/view/
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 112, in get_
response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python2.7/dist-packages/django/views/generic/base.py", line 69, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/views/nodes.py", line 622, in dispatc
h
    return super(NodeView, self).dispatch(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/django/views/generic/base.py", line 87, in dispatch
    return handler(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/django/views/generic/edit.py", line 228, in post
    return super(BaseUpdateView, self).post(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/django/views/generic/edit.py", line 171, in post
    return self.form_valid(form)
  File "/usr/lib/python2.7/dist-packages/django/views/generic/edit.py", line 147, in form_valid
    self.object = form.save()
  File "/usr/lib/python2.7/dist-packages/maasserver/forms.py", line 1003, in save
    message = action.execute(allow_redirect=allow_redirect)
  File "/usr/lib/python2.7/dist-packages/maasserver/node_action.py", line 205, in execute
    self.node.start_commissioning(self.user)
  File "/usr/lib/python2.7/dist-packages/maasserver/models/node.py", line 976, in start_commissioning
    [self.system_id], user, user_data=commissioning_user_data)
  File "/usr/lib/python2.7/dist-packages/maasserver/models/node.py", line 436, in start_nodes
    change_dns_zones({node.nodegroup for node in nodes})
  File "/usr/lib/python2.7/dist-packages/maasserver/utils/__init__.py", line 193, in call_with_lock
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/dns/config.py", line 85, in change_dns_zones
    zones=[zone], callback=zone_reload_subtask)
  File "/usr/lib/python2.7/dist-packages/celery/app/task.py", line 453, in delay
    return self.apply_async(args, kwargs)
  File "/usr/lib/python2.7/dist-packages/celery/app/task.py", line 547, in apply_async
    link=link, link_error=link_error, **options)
  File "/usr/lib/python2.7/dist-packages/celery/app/task.py", line 735, in apply
    request=request, propagate=throw)
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 354, in eager_trace_task
    uuid, args, kwargs, request)
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 253, in trace_task
    I, R, state, retval = on_error(task_request, exc, uuid)
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/provisioningserver/logger/utils.py", line 42, in wrapper
    func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/provisioningserver/tasks.py", line 54, in wrapper
    func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/provisioningserver/tasks.py", line 155, in write_dns_zone_config
    callback.delay()
  File "/usr/lib/python2.7/dist-packages/celery/canvas.py", line 168, in delay
    return self.apply_async(partial_args, partial_kwargs)
  File "/usr/lib/python2.7/dist-packages/celery/canvas.py", line 242, in apply_async
    return _apply(args, kwargs, **options)
  File "/usr/lib/python2.7/dist-packages/celery/app/task.py", line 547, in apply_async
    link=link, link_error=link_error, **options)
  File "/usr/lib/python2.7/dist-packages/celery/app/task.py", line 735, in apply
    request=request, propagate=throw)
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 354, in eager_trace_task
    uuid, args, kwargs, request)
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 253, in trace_task
    I, R, state, retval = on_error(task_request, exc, uuid)
  File "/usr/lib/python2.7/dist-packages/celery/app/trace.py", line 240, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/provisioningserver/logger/utils.py", line 42, in wrapper
    func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/provisioningserver/tasks.py", line 54, in wrapper
    func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/provisioningserver/tasks.py", line 85, in rndc_command
    execute_rndc_command(arguments)
  File "/usr/lib/python2.7/dist-packages/provisioningserver/dns/config.py", line 170, in execute_rndc_command
    call_and_check(rndc_cmd)
  File "/usr/lib/python2.7/dist-packages/provisioningserver/utils/shell.py", line 125, in call_and_check
    raise ExternalProcessError(process.returncode, command, output=stderr)
ExternalProcessError: Command `rndc -c /etc/bind/maas/rndc.conf.maas reload maas` returned non-zero exit status 1:
rndc: 'reload' failed: dynamic zone

Related branches

Changed in maas:
status: New → Triaged
importance: Undecided → Critical
milestone: none → 1.7.0
Revision history for this message
Christian Reis (kiko) wrote :

Can you give us some more context as to why this is critical?

How did this happen?

Will and end-user run into this scenario during normal operation?

Is it a regression?

If you fix rdnc, does MAAS go back to working?

Revision history for this message
Julian Edwards (julian-edwards) wrote :

It's critical because it's a crash, and it could happen at any time that rndc fails *for whatever reason*. In this case I literally just tried to start a commissioning run. (I'm not sure why we're even calling rndc for that, but that's a separate bug)

As soon as I fixed rndc, yes it works fine again.

I don't know if this is a regression; the error may have just never been caught properly before.

I think it's easy enough to fix (famous last works), so I'll sort it out shortly.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

words*

Curse my fat fingers.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

So my analysis so far shows that:
 - provisioningserver.dns.config.execute_rndc_command() uses provisioningserver.utils.shell.call_and_check() which raises ExternalProcessError if the rndc fails.
 - the rndc_command() task catches CalledProcessError, not ExternalProcessError
 - rndc_command() logs the CalledProcessError, and then simply re-raises it

So two things need to be fixed:
 1. rndc_command needs to catch ExternalProcessError
 2. Because we now use celery in eager mode (so we don't need the daemon), this exception propagates up to the webapp which causes a crash, so this needs to be an exception that is caught at all the call sites for rndc_command() and dealt with properly (such as rolling back the transaction and returning an error to clients)

I'm calling this a regression caused by the removal of celery. In the past it was buggy too as it would have blindly continued even though the celery job failed, but it didn't crash the appserver thread.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

And *mock surprise face* there is no unit test for rndc_command.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Oh my bad, there is... oops.

Revision history for this message
Julian Edwards (julian-edwards) wrote :

Amendment to original analysis - ExternalProcessError inherits from CalledProcessError so it does get caught. So the remaining fix is to get the call sites to catch this.

Revision history for this message
Christian Reis (kiko) wrote :

An unanswered question: will the end-user run into this in normal operation? Why would rdnc be broken?

Punting to next pre-emptively, but could bring back if there's a clear scenario.

Changed in maas:
milestone: 1.7.0 → next
Revision history for this message
Julian Edwards (julian-edwards) wrote : Re: [Bug 1386488] Re: If rndc fails, you get an Internal Server Error page

On Wednesday 29 Oct 2014 11:20:28 you wrote:
> An unanswered question: will the end-user run into this in normal
> operation? Why would rdnc be broken?

The only situation I can imagine it being a problem is if there's a
misconfiguration somewhere, the Bind package is missing, or some other weird-
ass problem going on.

I think "next" is probably OK here, we've lived with this since the MAAS
epoch.

Changed in maas:
status: Triaged → Fix Committed
Christian Reis (kiko)
Changed in maas:
milestone: next → 1.7.1
Changed in maas:
assignee: nobody → Julian Edwards (julian-edwards)
Changed in maas:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.