1.8b4: 504 GATEWAY TIMEOUT (Unexpected exception: TimeoutError)

Bug #1449402 reported by Larry Michel
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MAAS
Invalid
Critical
Unassigned

Bug Description

We hit this issue with juju bootstrap:

From console.txt:
Bootstrapping environment "ci-oil-slave10"
Starting new instance for initial state server
Launching instance
WARNING no architecture was specified, acquiring an arbitrary node
WARNING no architecture was specified, acquiring an arbitrary node
WARNING no architecture was specified, acquiring an arbitrary node
Bootstrap failed, destroying environment
ERROR failed to bootstrap environment: cannot start bootstrap instance: gomaasapi: got error back from server: 504 GATEWAY TIMEOUT (Unexpected exception: TimeoutError. See /var/log/maas/regiond.log on the region server for more information.)
2015-04-27 14:33:50,651 [ERROR] oil_ci.juju.client: Calling "juju bootstrap" failed!
2015-04-27 14:33:50,651 [ERROR] oil_ci.cli: Deployment failed:
+ rc=1
+ echo 'Deployment returned: 1'
Deployment returned: 1

From apache log:

2015-04-27 14:26:49 [-] 127.0.0.1 - - [27/Apr/2015:14:26:48 +0000] "POST /MAAS/api/1.0/nodes/node-95167f62-12b6-11e4-9a15-00163eca07b6/?op=start HTTP/1.1" 504 108 "-" "Go 1.1 package http"
2015-04-27 14:27:19 [-] 127.0.0.1 - - [27/Apr/2015:14:27:18 +0000] "POST /MAAS/api/1.0/nodes/node-5959d5f6-6071-11e4-b050-00163eca07b6/?op=start HTTP/1.1" 504 108 "-" "Go 1.1 package http"
2015-04-27 14:27:49 [-] 127.0.0.1 - - [27/Apr/2015:14:27:48 +0000] "POST /MAAS/api/1.0/nodes/node-cb79abaa-3f58-11e4-b38a-00163eca07b6/?op=start HTTP/1.1" 504 108 "-" "Go 1.1 package http"
2015-04-27 14:28:19 [-] 127.0.0.1 - - [27/Apr/2015:14:28:18 +0000] "POST /MAAS/api/1.0/nodes/node-95167f62-12b6-11e4-9a15-00163eca07b6/?op=start HTTP/1.1" 504 108 "-" "Go 1.1 package http"
2015-04-27 14:28:49 [-] 127.0.0.1 - - [27/Apr/2015:14:28:49 +0000] "POST /MAAS/api/1.0/nodes/node-9d4e6a5a-c4cd-11e3-824b-00163efc5068/?op=start HTTP/1.1" 504 108 "-" "Go 1.1 package http"
2015-04-27 14:29:19 [-] 127.0.0.1 - - [27/Apr/2015:14:29:18 +0000] "POST /MAAS/api/1.0/nodes/node-5959d5f6-6071-11e4-b050-00163eca07b6/?op=start HTTP/1.1" 504 108 "-" "Go 1.1 package http"
2015-04-27 14:29:49 [-] 127.0.0.1 - - [27/Apr/2015:14:29:48 +0000] "POST /MAAS/api/1.0/nodes/node-95167f62-12b6-11e4-9a15-00163eca07b6/?op=start HTTP/1.1" 504 108 "-" "Go 1.1 package http"
2015-04-27 14:30:19 [-] 127.0.0.1 - - [27/Apr/2015:14:30:19 +0000] "POST /MAAS/api/1.0/nodes/node-cb79abaa-3f58-11e4-b38a-00163eca07b6/?op=start HTTP/1.1" 504 108 "-" "Go 1.1 package http"
2015-04-27 14:30:49 [-] 127.0.0.1 - - [27/Apr/2015:14:30:49 +0000] "POST /MAAS/api/1.0/nodes/node-5959d5f6-6071-11e4-b050-00163eca07b6/?op=start HTTP/1.1" 504 108 "-" "Go 1.1 package http"
2015-04-27 14:31:19 [-] 127.0.0.1 - - [27/Apr/2015:14:31:19 +0000] "POST /MAAS/api/1.0/nodes/node-95167f62-12b6-11e4-9a15-00163eca07b6/?op=start HTTP/1.1" 504 108 "-" "Go 1.1 package http"
2015-04-27 14:31:49 [-] 127.0.0.1 - - [27/Apr/2015:14:31:49 +0000] "POST /MAAS/api/1.0/nodes/node-cb79abaa-3f58-11e4-b38a-00163eca07b6/?op=start HTTP/1.1" 504 108 "-" "Go 1.1 package http"
2015-04-27 14:32:19 [-] 127.0.0.1 - - [27/Apr/2015:14:32:19 +0000] "POST /MAAS/api/1.0/nodes/node-5959d5f6-6071-11e4-b050-00163eca07b6/?op=start HTTP/1.1" 504 108 "-" "Go 1.1 package http"
2015-04-27 14:32:49 [-] 127.0.0.1 - - [27/Apr/2015:14:32:49 +0000] "POST /MAAS/api/1.0/nodes/node-95167f62-12b6-11e4-9a15-00163eca07b6/?op=start HTTP/1.1" 504 108 "-" "Go 1.1 package http"
2015-04-27 14:33:19 [-] 127.0.0.1 - - [27/Apr/2015:14:33:18 +0000] "POST /MAAS/api/1.0/nodes/node-cb79abaa-3f58-11e4-b38a-00163eca07b6/?op=start HTTP/1.1" 504 108 "-" "Go 1.1 package http"
2015-04-27 14:33:49 [-] 127.0.0.1 - - [27/Apr/2015:14:33:48 +0000] "POST /MAAS/api/1.0/nodes/node-5959d5f6-6071-11e4-b050-00163eca07b6/?op=start HTTP/1.1" 504 108 "-" "Go 1.1 package http"

Tags: oil
Larry Michel (lmic)
summary: - 1.8b4: 504 GATEWAY TIMEOUT (Unexpected exception: TimeoutError
+ 1.8b4: 504 GATEWAY TIMEOUT (Unexpected exception: TimeoutError)
Revision history for this message
Larry Michel (lmic) wrote :
Download full text (3.2 KiB)

Seeing this in 1.8rc3:

This is from regiond.log

015-06-19 09:43:37 [-] 127.0.0.1 - - [19/Jun/2015:09:43:37 +0000] "GET /MAAS/api/1.0/nodegroups/037c960b-5b9f-4701-8366-eeda2c09d14e/boot-images/ HTTP/1.1" 200 1407 "-" "Go 1.1 package http"
2015-06-19 09:43:38 [root] ERROR:
Traceback (most recent call last):
  File "/usr/lib/python2.7/dist-packages/django/core/handlers/base.py", line 112, in get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/api/support.py", line 52, in __call__
    response = upcall(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/django/views/decorators/vary.py", line 19, in inner_func
    response = func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/piston/resource.py", line 167, in __call__
    result = self.error_handler(e, request, meth, em_format)
  File "/usr/lib/python2.7/dist-packages/piston/resource.py", line 165, in __call__
    result = meth(request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/api/support.py", line 200, in dispatch
    return function(self, request, *args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/api/nodes.py", line 422, in start
    node.start(request.user, user_data=user_data)
  File "/usr/lib/python2.7/dist-packages/maasserver/utils/orm.py", line 399, in call_within_transaction
    return func_within_txn(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/django/db/transaction.py", line 339, in inner
    return func(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/maasserver/models/node.py", line 1957, in start
    claims = self.claim_static_ip_addresses()
  File "/usr/lib/python2.7/dist-packages/maasserver/models/node.py", line 1770, in claim_static_ip_addresses
    alloc_type=alloc_type, requested_address=requested_address)
  File "/usr/lib/python2.7/dist-packages/maasserver/models/macaddress.py", line 358, in claim_static_ips
    interface, alloc_type, requested_address, user=user)
  File "/usr/lib/python2.7/dist-packages/maasserver/models/macaddress.py", line 247, in _allocate_static_address
    user=user)
  File "/usr/lib/python2.7/dist-packages/maasserver/models/staticipaddress.py", line 179, in allocate_new
    alloc_type, user, hostname=hostname).wait(30)
  File "/usr/lib/python2.7/dist-packages/crochet/_eventloop.py", line 217, in wait
    result = self._result(timeout)
  File "/usr/lib/python2.7/dist-packages/crochet/_eventloop.py", line 195, in _result
    raise TimeoutError()
TimeoutError
2015-06-19 09:43:38 [-] 127.0.0.1 - - [19/Jun/2015:09:43:37 +0000] "POST /MAAS/api/1.0/nodes/node-9f40d3ca-c4cd-11e3-8102-00163efc5068/?op=start HTTP/1.1" 504 108 "-" "Go 1.1 package http"

But signature looked different in apache.log file:

10.245.0.161 - - [19/Jun/2015:09:43:37 +0000] "GET /MAAS/api/1.0/nodegroups/037c960b-5b9f-4701-8366-eeda2c09d14e/boot-images/ HTTP/1.1" 200 1734 "-" "Go 1.1 package http" (0 secs)
10.245.0.161 - - [19/Jun/2015:09:43:06 +0000] "POST /MAAS/api/1.0/nodes/node-9f40d3ca-c4cd-11e3-8102-00163efc5068/?op=start HTTP/1.1" 504 346 "-" "Go 1.1 package http" (31 secs)
10.33.16.177 - - [19/Jun/...

Read more...

Raphaël Badin (rvb)
Changed in maas:
importance: Undecided → Critical
Revision history for this message
Raphaël Badin (rvb) wrote :

The timeout comes from _async_find_free_ip (in src/maasserver/models/staticipaddress.py) which means that _find_free_ip (run in a separate thread) took more than 30 seconds to complete. The only operation in there that can block or take a long time is the fetching of the list of already-taken IP addresses from the StaticIPAddress table. Looks like a locking problem but the (very conservative) 30s timeout could also indicate that what we're seeing here is a deadlock.

Revision history for this message
Raphaël Badin (rvb) wrote :

Another datapoint: this seems to be happening very frequently in OIL: over the course of 4 days (2015-06-18 06:45:32 - 2015-06-21 06:46:08), this problem happened 52 times.

Revision history for this message
Raphaël Badin (rvb) wrote :

Another cause for this bug might be that the threadpool is busy servicing other requests and that no thread becomes available during the 30s interval.

Changed in maas:
milestone: none → 1.8.1
no longer affects: maas/1.9
Changed in maas:
milestone: 1.8.1 → 1.9.0
Revision history for this message
Raphaël Badin (rvb) wrote :

I'm seeing similar (but unrelated) timeouts in OIL: http://paste.ubuntu.com/11757234/

I'd say this indicates thread starvation…

Revision history for this message
Raphaël Badin (rvb) wrote :

So, I've instrumented the web thread pool and the default thread pool in OIL to dump stats every 30 seconds and there is absolutely no indication of a thread starvation. Tasks don't seem to pile up at all.

Changed in maas:
status: New → Triaged
Revision history for this message
Blake Rouse (blake-rouse) wrote :

Is this still an issue you are hitting in OIL?

In MAAS 1.9 we have fixed many concurrency issues so this might be resolved in 1.9. Those fixes are not backportable to 1.8. I am going to mark this as incomplete, please change back to new if this is still an issue.

no longer affects: maas/1.8
Changed in maas:
status: Triaged → Incomplete
Revision history for this message
Larry Michel (lmic) wrote :

We are not seeing these timeouts in OIL from regiond.log which has entries as old as 8/25.

Revision history for this message
Blake Rouse (blake-rouse) wrote :

Thanks for the update I will set to Invalid. Please update if it occurs again.

Changed in maas:
status: Incomplete → Invalid
Changed in maas:
milestone: 1.9.0 → none
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.