Bug #1718016 “MAAS failed to respond to POST'd deploy request bu...” : Bugs : MAAS

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-09-18:

#1

logs-2017-09-18-18.14.03.tar Edit (6.7 MiB, application/x-tar)

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-09-18:

#2

juju-logs.tgz Edit (20.4 MiB, application/x-tar)

Jason Hobbs (jason-hobbs) on 2017-09-18

tags:

added: foundations-engine

Andres Rodriguez (andreserl) on 2017-09-18

Changed in maas:
status:	New → Incomplete

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-09-18:

#3

Why is this marked incomplete?

Changed in maas:
status:	Incomplete → New

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2017-09-18:

#4

Hi Jason,

We don't have enough information to determine what could be causing your issues, but I've seen three different failures in your logs on the different machines:

1. http://pastebin.ubuntu.com/25567686/
2. http://pastebin.ubuntu.com/25567702/
3. http://pastebin.ubuntu.com/25567705/

1 and 2 do seem critical, but 3 is not. To ensure we limit the amount of variables, please install the rack-controller on the machine with error of 3. Installing the rack controller is harmless as it wont be configured. It could be that it is failing on the region controller where 3 is being seen, or it could be it is failing where 1 and 2 are being seen.

Also note that the deploy request seem to be happening before any of 1 and 2 are seeing, which would lead me to believe that juju could be contacting the region on 3, and the logging is masked before of the error on 3.

That said, we need some further information:

1. Postgresql logs of the exact time where you are seeing the failures. (FWIW< you may need
2. Number of connections allowed by postgresql (or even your postgresql config).
3. Number of regiond processes per machine while seeing the issues.

Also, if you could please limit the logging to the exact time you are seeing the issues, it would be hjelpful provided that your logging is quite extensive (specially the juju one).

Changed in maas:
milestone:	none → 2.3.0

Revision history for this message

Andres Rodriguez (andreserl) wrote on 2017-09-18:

#5

Further info:

Juju:

2017-09-18 17:13:58 WARNING juju.provisioner provisioner_task.go:747 failed to start instance (unexpected: Post http://10.245.208.33/MAAS/api/2.0/machines/ab6sax/?op=deploy: EOF), retrying in 10s (10 more attempts)

MAAS 1:

2017-09-18 17:13:05 maasserver: [error] ################################ Exception: No available machine matches constraints: [('interfaces', ['0:space=4;amqp:space=4;data:space=4;db:space=4;hosted:space=4;registration:space=4;website:space=4']), ('tags', ['landscape']), ('agent_name', ['3eff5a1b-4c29-4365-85b2-e6d0a878b647']), ('zone', ['default'])] (resolved to "interfaces=0:space=4;amqp:space=4;data:space=4;db:space=4;hosted:space=4;registration:space=4;website:space=4 tags=landscape zone=default") ################################
2017-09-18 17:13:06 maasserver: [error] Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 132, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 181, in view_atomic_with_post_commit_savepoint
    return view_atomic(*args, **kwargs)
  File "/usr/lib/python3.5/contextlib.py", line 30, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 59, in __call__
    response = upcall(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 21, in inner_func
    response = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/piston3/resource.py", line 190, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/usr/lib/python3/dist-packages/piston3/resource.py", line 188, in __call__
    result = meth(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 298, in dispatch
    return function(self, request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/api/machines.py", line 1485, in allocate
    raise NodesNotAvailable(message)
maasserver.exceptions.NodesNotAvailable: No available machine matches constraints: [('interfaces', ['0:space=4;amqp:space=4;data:space=4;db:space=4;hosted:space=4;registration:space=4;website:space=4']), ('tags', ['landscape']), ('agent_name', ['3eff5a1b-4c29-4365-85b2-e6d0a878b647']), ('zone', ['default'])] (resolved to "interfaces=0:space=4;amqp:space=4;data:space=4;db:space=4;hosted:space=4;registration:space=4;website:space=4 tags=landscape zone=default")

MAAS 2 (this error is repeated constantly around the time of the failure):

2017-09-18 17:13:04 provisioningserver.utils.services: [info] observe-arp[eno2]: FileNotFoundError: [Errno 2] No such file or directory: b'/bin/maas-rack'

MAAS 3: No error around that time.

So judging by this info, two things to try out:

1. ensure the issue is not related to the constraints error above, which would seem like it is related to allocation, and the machine not matching the constraints, which is the reason for the failure.

2. The missing maas-rack command.

Further info:

Juju:

2017-09-18 17:13:58 WARNING juju.provisioner provisioner_task.go:747 failed to start instance (unexpected: Post http://10.245.208.33/MAAS/api/2.0/machines/ab6sax/?op=deploy: EOF), retrying in 10s (10 more attempts)

MAAS 1:

2017-09-18 17:13:05 maasserver: [error] ################################ Exception: No available machine matches constraints: [('interfaces', ['0:space=4;amqp:space=4;data:space=4;db:space=4;hosted:space=4;registration:space=4;website:space=4']), ('tags', ['landscape']), ('agent_name', ['3eff5a1b-4c29-4365-85b2-e6d0a878b647']), ('zone', ['default'])] (resolved to "interfaces=0:space=4;amqp:space=4;data:space=4;db:space=4;hosted:space=4;registration:space=4;website:space=4 tags=landscape zone=default") ################################
2017-09-18 17:13:06 maasserver: [error] Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/django/core/handlers/base.py", line 132, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/utils/views.py", line 181, in view_atomic_with_post_commit_savepoint
    return view_atomic(*args, **kwargs)
  File "/usr/lib/python3.5/contextlib.py", line 30, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 59, in __call__
    response = upcall(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/django/views/decorators/vary.py", line 21, in inner_func
    response = func(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/piston3/resource.py", line 190, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/usr/lib/python3/dist-packages/piston3/resource.py", line 188, in __call__
    result = meth(request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/api/support.py", line 298, in dispatch
    return function(self, request, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/maasserver/api/machines.py", line 1485, in allocate
    raise NodesNotAvailable(message)
maasserver.exceptions.NodesNotAvailable: No available machine matches constraints: [('interfaces', ['0:space=4;amqp:space=4;data:space=4;db:space=4;hosted:space=4;registration:space=4;website:space=4']), ('tags', ['landscape']), ('agent_name', ['3eff5a1b-4c29-4365-85b2-e6d0a878b647']), ('zone', ['default'])] (resolved to "interfaces=0:space=4;amqp:space=4;data:space=4;db:space=4;hosted:space=4;registration:space=4;website:space=4 tags=landscape zone=default")

MAAS 2 (this error is repeated constantly around the time of the failure):

2017-09-18 17:13:04 provisioningserver.utils.services: [info] observe-arp[eno2]: FileNotFoundError: [Errno 2] No such file or directory: b'/bin/maas-rack'

MAAS 3: No error around that time.

So judging by this info, two things to try out:

1. ensure the issue is not related to the constraints error above, which would seem like it is related to allocation, and the machine not matching the constraints, which is the reason for the failure.

2. The missing maas-rack command.

Andres Rodriguez (andreserl) on 2017-09-18

Changed in maas:
importance:	Undecided → Critical

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-09-18:

#6

Andres also suggested we try source based balancing rather than roundrobin. This should lead to all the requests from a given node going to the same server, which makes it much easier to debug. Right now, requests from the same client can go to any server, which makes reading logs very hard.

https://www.haproxy.com/blog/client-ip-persistence-or-source-ip-hash-load-balancing/

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-09-19:

#7

We hit this again today:

2017-09-19 18:33:21 WARNING juju.provisioner provisioner_task.go:747 failed to start instance (unexpected: Post http://10.245.208.33/MAAS/api/2.0/machines/egpntr/?op=deploy: EOF), retrying in 10s (10 more attempts)

We have source based balancing on now instead of roundrobin, i'll grab the maas logs and attach.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-09-19:

#8

logs-2017-09-19-20.54.12.tar Edit (6.1 MiB, application/x-tar)

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-09-19:

#9

In the latest logs, it looks like the juju requests are all going to the 10.245.208.30 host, that's nice!

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-09-19:

#10

postgresql logs - nothing of value: http://paste.ubuntu.com/25575140/

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-09-21:

#11

This happens with calls other than node deploy:

http://paste.ubuntu.com/25588250/

Andres Rodriguez (andreserl) on 2017-09-25

Changed in maas:
milestone:	2.3.0 → 2.3.0beta2

Andres Rodriguez (andreserl) on 2017-09-25

tags:

added: internal

Andres Rodriguez (andreserl) on 2017-10-05

Changed in maas:
status:	New → Triaged

Andres Rodriguez (andreserl) on 2017-10-06

Changed in maas:
milestone:	2.3.0beta2 → 2.3.0beta3

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-10-09:

#12

I wonder if this is being caused by hacluster timing out on requests that are taking a long time to respond. The behavior where we see a request just die is what happens when hacluster times out.

Revision history for this message

Blake Rouse (blake-rouse) wrote on 2017-10-09:

#13

I would bet that is the case, have you tried increasing the timeout on haproxy to see if that helps this issue?

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-10-09: Re: [Bug 1718016] Re: MAAS failed to respond to POST'd deploy request but still deployed node

#14

doh, meant haproxy when i said hacluster :) No we haven't tried that. It's
set to 30 seconds right now. We can enable more logging in haproxy and try
more to reproduce.

On Mon, Oct 9, 2017 at 3:36 PM, Blake Rouse <email address hidden>
wrote:

> I would bet that is the case, have you tried increasing the timeout on
> haproxy to see if that helps this issue?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1718016
>
> Title:
> MAAS failed to respond to POST'd deploy request but still deployed
> node
>
> Status in MAAS:
> Triaged
> Status in MAAS 2.2 series:
> New
>
> Bug description:
> maas version: 2.2.2-6099-g8751f91-0ubuntu1~16.04.1
> juju version: 2.2.4-0ubuntu1~16.04.1~juju1
>
> =Description=
> When using juju to deploy a bundle to MAAS, juju received an EOF/no
> response from the MAAS server when POST'ing to MAAS to deploy the node:
>
> logsink.log:24cbf3cb-bad6-4d28-8377-4ffe01d2a25c: machine-0 2017-09-18
> 17:13:58 WARNING juju.provisioner provisioner_task.go:747 failed to
> start instance (unexpected: Post
> http://10.245.208.33/MAAS/api/2.0/machines/ab6sax/?op=deploy: EOF),
> retrying in 10s (10 more attempts)
>
> MAAS did not log this POST request in regiond.log. However, the node
> started its deploying process and eventually reach deployed. Since
> juju got no response from MAAS when trying to deploy the node, juju
> considered the node in an error state and the deployment of the bundle
> failed.
>
> Here's the summarized event log for the node:
> http://paste.ubuntu.com/25567525/
>
> MAAS and juju logs will be attached. This is an HA setup with 3
> region controllers and 2 rack controllers. There are quite a "few
> could not serialize access due to concurrent update" errors in the
> logs.
>
> =Expected behavior=
> MAAS should always return a response to the client.
>
> =Reproduction steps=
> I don't have a good way to reproduce the reliably. I saw similar
> behavior in bug 1717301 - it seems like they could be the same root cause.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1718016/+subscriptions
>

doh, meant haproxy when i said hacluster :) No we haven't tried that. It's
set to 30 seconds right now.  We can enable more logging in haproxy and try
more to reproduce.

On Mon, Oct 9, 2017 at 3:36 PM, Blake Rouse <blake.rouse@canonical.com>
wrote:

> I would bet that is the case, have you tried increasing the timeout on
> haproxy to see if that helps this issue?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1718016
>
> Title:
>   MAAS failed to respond to POST'd deploy request but still deployed
>   node
>
> Status in MAAS:
>   Triaged
> Status in MAAS 2.2 series:
>   New
>
> Bug description:
>   maas version: 2.2.2-6099-g8751f91-0ubuntu1~16.04.1
>   juju version: 2.2.4-0ubuntu1~16.04.1~juju1
>
>   =Description=
>   When using juju to deploy a bundle to MAAS, juju received an EOF/no
> response from the MAAS server when POST'ing to MAAS to deploy the node:
>
>   logsink.log:24cbf3cb-bad6-4d28-8377-4ffe01d2a25c: machine-0 2017-09-18
>   17:13:58 WARNING juju.provisioner provisioner_task.go:747 failed to
>   start instance (unexpected: Post
>   http://10.245.208.33/MAAS/api/2.0/machines/ab6sax/?op=deploy: EOF),
>   retrying in 10s (10 more attempts)
>
>   MAAS did not log this POST request in regiond.log. However, the node
>   started its deploying process and eventually reach deployed.  Since
>   juju got no response from MAAS when trying to deploy the node, juju
>   considered the node in an error state and the deployment of the bundle
>   failed.
>
>   Here's the summarized event log for the node:
>   http://paste.ubuntu.com/25567525/
>
>   MAAS and juju logs will be attached.  This is an HA setup with 3
>   region controllers and 2 rack controllers.  There are quite a "few
>   could not serialize access due to concurrent update" errors in the
>   logs.
>
>   =Expected behavior=
>   MAAS should always return a response to the client.
>
>   =Reproduction steps=
>   I don't have a good way to reproduce the reliably. I saw similar
> behavior in bug 1717301 - it seems like they could be the same root cause.
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/maas/+bug/1718016/+subscriptions
>

Andres Rodriguez (andreserl) on 2017-10-11

Changed in maas:
assignee:	nobody → Blake Rouse (blake-rouse)

Andres Rodriguez (andreserl) on 2017-10-13

Changed in maas:
status:	Triaged → Incomplete
assignee:	Blake Rouse (blake-rouse) → nobody

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-10-17:

#15

Testing with the 90s timeout, we're not seeing EOF's anymore, so it looks like that was the culprit.

Revision history for this message

Jason Hobbs (jason-hobbs) wrote on 2017-10-17:

#16

https://github.com/CanonicalLtd/maas-docs/pull/644

Andres Rodriguez (andreserl) on 2017-10-17

Changed in maas:
status:	Incomplete → Invalid

Andres Rodriguez (andreserl) on 2017-11-02

Changed in maas:
milestone:	2.3.0beta3 → none

MAAS

MAAS failed to respond to POST'd deploy request but still deployed node

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches