MAAS failed to respond to POST'd deploy request but still deployed node

Bug #1718016 reported by Jason Hobbs
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Critical
Unassigned
2.2
Invalid
Undecided
Unassigned

Bug Description

maas version: 2.2.2-6099-g8751f91-0ubuntu1~16.04.1
juju version: 2.2.4-0ubuntu1~16.04.1~juju1

=Description=
When using juju to deploy a bundle to MAAS, juju received an EOF/no response from the MAAS server when POST'ing to MAAS to deploy the node:

logsink.log:24cbf3cb-bad6-4d28-8377-4ffe01d2a25c: machine-0 2017-09-18 17:13:58 WARNING juju.provisioner provisioner_task.go:747 failed to start instance (unexpected: Post http://10.245.208.33/MAAS/api/2.0/machines/ab6sax/?op=deploy: EOF), retrying in 10s (10 more attempts)

MAAS did not log this POST request in regiond.log. However, the node started its deploying process and eventually reach deployed. Since juju got no response from MAAS when trying to deploy the node, juju considered the node in an error state and the deployment of the bundle failed.

Here's the summarized event log for the node: http://paste.ubuntu.com/25567525/

MAAS and juju logs will be attached. This is an HA setup with 3 region controllers and 2 rack controllers. There are quite a "few could not serialize access due to concurrent update" errors in the logs.

=Expected behavior=
MAAS should always return a response to the client.

=Reproduction steps=
I don't have a good way to reproduce the reliably. I saw similar behavior in bug 1717301 - it seems like they could be the same root cause.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
tags: added: foundations-engine
Changed in maas:
status: New → Incomplete
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Why is this marked incomplete?

Changed in maas:
status: Incomplete → New
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Jason,

We don't have enough information to determine what could be causing your issues, but I've seen three different failures in your logs on the different machines:

1. http://pastebin.ubuntu.com/25567686/
2. http://pastebin.ubuntu.com/25567702/
3. http://pastebin.ubuntu.com/25567705/

1 and 2 do seem critical, but 3 is not. To ensure we limit the amount of variables, please install the rack-controller on the machine with error of 3. Installing the rack controller is harmless as it wont be configured. It could be that it is failing on the region controller where 3 is being seen, or it could be it is failing where 1 and 2 are being seen.

Also note that the deploy request seem to be happening before any of 1 and 2 are seeing, which would lead me to believe that juju could be contacting the region on 3, and the logging is masked before of the error on 3.

That said, we need some further information:

1. Postgresql logs of the exact time where you are seeing the failures. (FWIW< you may need
2. Number of connections allowed by postgresql (or even your postgresql config).
3. Number of regiond processes per machine while seeing the issues.

Also, if you could please limit the logging to the exact time you are seeing the issues, it would be hjelpful provided that your logging is quite extensive (specially the juju one).

Changed in maas:
milestone: none → 2.3.0
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Further info:

Juju:

2017-09-18 17:13:58 WARNING juju.provisioner provisioner_task.go:747 failed to start instance (unexpected: Post http://10.245.208.33/MAAS/api/2.0/machines/ab6sax/?op=deploy: EOF), retrying in 10s (10 more attempts)

MAAS 1:

2017-09-18 17:13:05 maasserver: [error] ################################ Exception: No available machine matches constraints: [('interfaces', ['0:space=4;amqp:space=4;data:space=4;db:space=4;hosted:space=4;registration:space=4;website:space=4']), ('tags', ['landscape']), ('agent_name', ['3eff5a1b-4c29-4365-85b2-e6d0a878b647']), ('zone', ['default'])] (resolved to "interfaces=0:space=4;amqp:space=4;data:space=4;db:space=4;hosted:space=4;registration:space=4;website:space=4 tags=landscape zone=default") ################################
2017-09-18 17:13:06 maasserver: [error] Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 132, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 181, in view_atomic_with_post_commit_savepoint
    return view_atomic(*args, **kwargs)
  File "/usr/lib/python3.5/contextlib.py", line 30, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 59, in __call__
    response = upcall(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 21, in inner_func
    response = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/piston3/resource.py", line 190, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/usr/lib/python3/dist-packages/piston3/resource.py", line 188, in __call__
    result = meth(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 298, in dispatch
    return function(self, request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/api/machines.py", line 1485, in allocate
    raise NodesNotAvailable(message)
maasserver.exceptions.NodesNotAvailable: No available machine matches constraints: [('interfaces', ['0:space=4;amqp:space=4;data:space=4;db:space=4;hosted:space=4;registration:space=4;website:space=4']), ('tags', ['landscape']), ('agent_name', ['3eff5a1b-4c29-4365-85b2-e6d0a878b647']), ('zone', ['default'])] (resolved to "interfaces=0:space=4;amqp:space=4;data:space=4;db:space=4;hosted:space=4;registration:space=4;website:space=4 tags=landscape zone=default")

MAAS 2 (this error is repeated constantly around the time of the failure):

2017-09-18 17:13:04 provisioningserver.utils.services: [info] observe-arp[eno2]: FileNotFoundError: [Errno 2] No such file or directory: b'/bin/maas-rack'

MAAS 3: No error around that time.

So judging by this info, two things to try out:

1. ensure the issue is not related to the constraints error above, which would seem like it is related to allocation, and the machine not matching the constraints, which is the reason for the failure.

2. The missing maas-rack command.

Changed in maas:
importance: Undecided → Critical
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Andres also suggested we try source based balancing rather than roundrobin. This should lead to all the requests from a given node going to the same server, which makes it much easier to debug. Right now, requests from the same client can go to any server, which makes reading logs very hard.

https://www.haproxy.com/blog/client-ip-persistence-or-source-ip-hash-load-balancing/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

We hit this again today:

2017-09-19 18:33:21 WARNING juju.provisioner provisioner_task.go:747 failed to start instance (unexpected: Post http://10.245.208.33/MAAS/api/2.0/machines/egpntr/?op=deploy: EOF), retrying in 10s (10 more attempts)

We have source based balancing on now instead of roundrobin, i'll grab the maas logs and attach.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

In the latest logs, it looks like the juju requests are all going to the 10.245.208.30 host, that's nice!

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

postgresql logs - nothing of value: http://paste.ubuntu.com/25575140/

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

This happens with calls other than node deploy:

http://paste.ubuntu.com/25588250/

Changed in maas:
milestone: 2.3.0 → 2.3.0beta2
tags: added: internal
Changed in maas:
status: New → Triaged
Changed in maas:
milestone: 2.3.0beta2 → 2.3.0beta3
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

I wonder if this is being caused by hacluster timing out on requests that are taking a long time to respond. The behavior where we see a request just die is what happens when hacluster times out.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

I would bet that is the case, have you tried increasing the timeout on haproxy to see if that helps this issue?

Revision history for this message
Jason Hobbs (jason-hobbs) wrote : Re: [Bug 1718016] Re: MAAS failed to respond to POST'd deploy request but still deployed node

doh, meant haproxy when i said hacluster :) No we haven't tried that. It's
set to 30 seconds right now. We can enable more logging in haproxy and try
more to reproduce.

On Mon, Oct 9, 2017 at 3:36 PM, Blake Rouse <email address hidden>
wrote:

> I would bet that is the case, have you tried increasing the timeout on
> haproxy to see if that helps this issue?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1718016
>
> Title:
> MAAS failed to respond to POST'd deploy request but still deployed
> node
>
> Status in MAAS:
> Triaged
> Status in MAAS 2.2 series:
> New
>
> Bug description:
> maas version: 2.2.2-6099-g8751f91-0ubuntu1~16.04.1
> juju version: 2.2.4-0ubuntu1~16.04.1~juju1
>
> =Description=
> When using juju to deploy a bundle to MAAS, juju received an EOF/no
> response from the MAAS server when POST'ing to MAAS to deploy the node:
>
> logsink.log:24cbf3cb-bad6-4d28-8377-4ffe01d2a25c: machine-0 2017-09-18
> 17:13:58 WARNING juju.provisioner provisioner_task.go:747 failed to
> start instance (unexpected: Post
> http://10.245.208.33/MAAS/api/2.0/machines/ab6sax/?op=deploy: EOF),
> retrying in 10s (10 more attempts)
>
> MAAS did not log this POST request in regiond.log. However, the node
> started its deploying process and eventually reach deployed. Since
> juju got no response from MAAS when trying to deploy the node, juju
> considered the node in an error state and the deployment of the bundle
> failed.
>
> Here's the summarized event log for the node:
> http://paste.ubuntu.com/25567525/
>
> MAAS and juju logs will be attached. This is an HA setup with 3
> region controllers and 2 rack controllers. There are quite a "few
> could not serialize access due to concurrent update" errors in the
> logs.
>
> =Expected behavior=
> MAAS should always return a response to the client.
>
> =Reproduction steps=
> I don't have a good way to reproduce the reliably. I saw similar
> behavior in bug 1717301 - it seems like they could be the same root cause.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1718016/+subscriptions
>

Changed in maas:
assignee: nobody → Blake Rouse (blake-rouse)
Changed in maas:
status: Triaged → Incomplete
assignee: Blake Rouse (blake-rouse) → nobody
Revision history for this message
Jason Hobbs (jason-hobbs) wrote :

Testing with the 90s timeout, we're not seeing EOF's anymore, so it looks like that was the culprit.

Revision history for this message
Jason Hobbs (jason-hobbs) wrote :
Changed in maas:
status: Incomplete → Invalid
Changed in maas:
milestone: 2.3.0beta3 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.