retry failed API calls

Bug #1655716 reported by Francis Ginther on 2017-01-11
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Autopilot Log Analyser
High
Francis Ginther
juju
Wishlist
Unassigned

Bug Description

I've seen a variety of similar juju bootstrap failures on top of a newton openstack cloud deployed by Landscape autopilot. The openstack cloud is using an HA configuration. Most of these are with juju 2.1beta3 but some are now using 2.1beta4. I'm still collecting results with 2.1beta4.

The bootstrap errors are not identical, but have a similar signature in that a call to an openstack resource (i.e. a nova API endpoint) returns an EOF. The equivalent openstack CLI commands work, although the time it takes for them to return can be 10+ seconds (which is unusual for these clouds).

The suspicion is that the failures are due to the openstack host units being under heavy load and just not returning data timely enough for juju.

Here are some examples of the failures (full logs are in the juju-bootstrap-error-logs.tar.gz attachment):

[4975.log (with beta4)]
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . ERROR juju.cmd.juju.commands bootstrap.go:1021 error cleaning up: destroying managed models: listing volumes: Get http://10.96.105.144:8776/v2/40532f47f66f462abe1fa72cddf76e75/volumes/detail: EOF
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . 01:47:25 ERROR cmd supercommand.go:458 failed to bootstrap model: refreshing addresses: failed to get list of server details
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . caused by: failed executing the request http://10.96.105.143:8774/v2.1/40532f47f66f462abe1fa72cddf76e75/servers/detail
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . caused by: Get http://10.96.105.143:8774/v2.1/40532f47f66f462abe1fa72cddf76e75/servers/detail: EOF
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . DEBUG cmd supercommand.go:459 (error details: [{github.com/juju/juju/cmd/juju/commands/bootstrap.go:556: failed to bootstrap model} {refreshing addresses: failed to get list of server details
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . caused by: failed executing the request http://10.96.105.143:8774/v2.1/40532f47f66f462abe1fa72cddf76e75/servers/detail
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . caused by: Get http://10.96.105.143:8774/v2.1/40532f47f66f462abe1fa72cddf76e75/servers/detail: EOF}])
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap ERROR
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap ERROR (869.951 secs)

[4961.log (beta3)]
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . 10:59:30 ERROR cmd supercommand.go:458 failed to bootstrap model: getting addresses: failed to get details for serverId: f00cef8b-85e8-456c-96e1-b3aace509d5b
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . caused by: failed executing the request http://10.96.101.100:8774/v2.1/e188045ca35548b1bbdd4404b9ff1edc/servers/f00cef8b-85e8-456c-96e1-b3aace509d5b
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . caused by: Get http://10.96.101.100:8774/v2.1/e188045ca35548b1bbdd4404b9ff1edc/servers/f00cef8b-85e8-456c-96e1-b3aace509d5b: EOF
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . 10:59:30 DEBUG cmd supercommand.go:459 (error details: [{github.com/juju/juju/cmd/juju/commands/bootstrap.go:556: failed to bootstrap model} {getting addresses: failed to get details for serverId: f00cef8b-85e8-456c-96e1-b3aace509d5b
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . caused by: failed executing the request http://10.96.101.100:8774/v2.1/e188045ca35548b1bbdd4404b9ff1edc/servers/f00cef8b-85e8-456c-96e1-b3aace509d5b
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . caused by: Get http://10.96.101.100:8774/v2.1/e188045ca35548b1bbdd4404b9ff1edc/servers/f00cef8b-85e8-456c-96e1-b3aace509d5b: EOF}])
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap ERROR
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap ERROR (842.456 secs)

[4953.log (beta3)]
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . 12:38:53 ERROR cmd supercommand.go:458 failed to bootstrap model: cannot start bootstrap instance: failed to get list of availability zones
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . caused by: failed executing the request http://10.96.125.141:8774/v2.1/3ce00359c6d042b4844f3cfffd792b60/os-availability-zone
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . caused by: Get http://10.96.125.141:8774/v2.1/3ce00359c6d042b4844f3cfffd792b60/os-availability-zone: EOF
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . 12:38:53 DEBUG cmd supercommand.go:459 (error details: [{github.com/juju/juju/cmd/juju/commands/bootstrap.go:556: failed to bootstrap model} {github.com/juju/juju/provider/common/bootstrap.go:47: } {github.com/juju/juju/provider/common/bootstrap.go:178: cannot start bootstrap instance} {failed to get list of availability zones
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . caused by: failed executing the request http://10.96.125.141:8774/v2.1/3ce00359c6d042b4844f3cfffd792b60/os-availability-zone
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . caused by: Get http://10.96.125.141:8774/v2.1/3ce00359c6d042b4844f3cfffd792b60/os-availability-zone: EOF}])
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap ERROR
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap ERROR (662.976 secs)

[4969.log (beta3)]
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . 10:44:57 ERROR cmd supercommand.go:458 failed to bootstrap model: cannot start bootstrap instance: cannot set up groups: failed to create a rule for the security group with id: 7c16777f-ae5d-4803-8d06-f8838804f0c0
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . caused by: failed executing the request http://10.96.103.163:9696/v2.0/security-group-rules
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . caused by: Post http://10.96.103.163:9696/v2.0/security-group-rules: EOF
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . 10:44:57 DEBUG cmd supercommand.go:459 (error details: [{github.com/juju/juju/cmd/juju/commands/bootstrap.go:556: failed to bootstrap model} {github.com/juju/juju/provider/common/bootstrap.go:47: } {github.com/juju/juju/provider/common/bootstrap.go:178: cannot start bootstrap instance} {github.com/juju/juju/provider/openstack/provider.go:1007: cannot set up groups} {github.com/juju/juju/provider/openstack/firewaller.go:434: } {failed to create a rule for the security group with id: 7c16777f-ae5d-4803-8d06-f8838804f0c0
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . caused by: failed executing the request http://10.96.103.163:9696/v2.0/security-group-rules
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap . caused by: Post http://10.96.103.163:9696/v2.0/security-group-rules: EOF}])
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap ERROR
INFO ui.openstack.test_region_juju_bootstrap.RegionBootstrapTest.test_juju_bootstrap ERROR (691.248 secs)

Francis Ginther (fginther) wrote :
Changed in autopilot-log-analyser:
status: New → In Progress
assignee: nobody → Francis Ginther (fginther)
Changed in landscape:
assignee: nobody → Francis Ginther (fginther)
status: New → Triaged
importance: Undecided → High
Changed in autopilot-log-analyser:
importance: Undecided → High
Changed in landscape:
milestone: none → 16.12
Anastasia (anastasia-macmood) wrote :

@Francis Ginther,
Please confirm if you are seeing these with 2.1b4?
It would also be nice to confirm if openstack host units are indeed under heavy load or not.

Changed in juju:
status: New → Incomplete
tags: added: eda
David Britton (davidpbritton) wrote :

We are seeing failures on 2.1b4

Changed in juju:
status: Incomplete → New
Francis Ginther (fginther) wrote :

@Anastasia, yes, the 4975.log example was with juju 2.1beta4.

I'm still attempting to gather more data on this with beta4 and some live debugging.

Francis Ginther (fginther) wrote :

I did some live debugging and found the EOFs appear to be the result of nova getting timeouts from keystone. One of the keystone units was co-located on the same physical machine as the juju bootstrap unit. We've discovered from lp:1655169 that there is intense memory pressure on the bootstrap unit and it's possible that keystone is slow because of swap thrashing. The bootstrap unit was also co-located with mysql, which also has high memory demands. This isn't likely to help.

One thing I haven't quite figured out yet is why this occurs just after the autopilot cloud is deployed. At that point in time, there isn't enough memory usage to use swap yet (as determined by looking at landscape's memory graphs), so swap thrashing doesn't make sense. A couple of possible causes are the cloud not yet being ready (for example, the keystone units aren't ready to accept requests yet) or another performance issue is impacting this (such has high disk usage).

Will keep trying to collect more data and isolate the problem.

Anastasia (anastasia-macmood) wrote :

It's possible that the performance was degraded due to the bug tat we have discovered and fixed on 2.1 tip: https://bugs.launchpad.net/juju/+bug/1649719

PR with the fix: https://github.com/juju/juju/pull/6789

Changed in juju:
status: New → Incomplete
Andreas Hasenack (ahasenack) wrote :

Although not really surprising, it's not just bootstrap that can fail. Here is a case where kill-controller failed:
INFO lib.juju.environment . . All hosted models reclaimed, cleaning up controller machines
INFO lib.juju.environment . . WARNING discovered API versions: [{Version:{major:2 minor:0} Links:[{Href:http://10.96.21.108:8774/v2/ Rel:self}] Status:SUPPORTED} {Versio
n:{major:2 minor:1} Links:[{Href:http://10.96.21.108:8774/v2.1/ Rel:self}] Status:CURRENT}]
INFO lib.juju.environment . . WARNING discovered API versions: [{Version:{major:2 minor:0} Links:[{Href:http://10.96.21.116:9696/v2.0 Rel:self}] Status:CURRENT}]
INFO lib.juju.environment . . ERROR destroying controller model: destroying storage: listing volumes: Get http://10.96.21.114:8776/v2/0e0b4ae2e6724f4ea42ebd018991d360/volumes/detail: EOF
INFO lib.juju.environment . ERROR
...
RuntimeError: Command ('juju', 'kill-controller', '-y', 'osa-controller') exited with status 1 ()

Changed in autopilot-log-analyser:
status: In Progress → Fix Released
Changed in landscape:
milestone: 16.12 → 17.01
Francis Ginther (fginther) wrote :

There is evidence that this bug is related to lp:1648677. In every example I have of the EOF problem occurring in a history of test runs, I also found lp:1648677. I haven't figured out what the actual connection is if there is one.

Francis Ginther (fginther) wrote :

More debugging today came up with some additional theories. Openstack uses mysql, which ends up getting deployed across three units. In our case, keystone and nova end up on the same physical machines. Mysql is deployed with a default setting for innodb_buffer_pool_size which causes it to think it can allocate up to 50% of memory for it internal buffer pool. This ends up being 50% of the host machines memory, even though it's smooshed together with 6 other services.

I'm going to attempt to tune innodb_buffer_pool_size to a lower value (say 10%) and see what the outcome is.

Francis Ginther (fginther) wrote :

Discussion with openstack team also raised the need to tune "worker-multiplier" on any api service. This charm config option sets the number of worker process per each service to 2x the number of CPUs by default. This will eat a lot of memory on a 12 cpu machine and create way too many workers on a smooshed deployment.

So far we have the percona config options to tune:
 - max-connections
 - innodb-buffer-pool-size

And on all API services:
 - worker-multiplier

Chad Smith (chad.smith) on 2017-02-10
Changed in landscape:
milestone: 17.01 → 17.02
Changed in autopilot-log-analyser:
status: Fix Released → Fix Committed

It seems like juju should retry API calls when they fail. I am going to re-open as new, but am convinced this is in "wishlist" category.

summary: - bootstrap fails with various EOF responses from autopilot newton
- openstack deployment
+ EOF responses from newton openstack deployment
Changed in juju:
status: Incomplete → New
summary: - EOF responses from newton openstack deployment
+ retry failed API calls
Changed in juju:
status: New → Triaged
importance: Undecided → Wishlist
Francis Ginther (fginther) wrote :

This issue has been mitigated in landscape-server by setting the worker-multiplier to 1. Still would be nice to see API retries, but with this change we're not seeing the issue in our testing.

Changed in landscape:
status: Triaged → Incomplete
Chad Smith (chad.smith) on 2017-03-16
Changed in landscape:
milestone: 17.02 → 17.03
no longer affects: landscape
John A Meinel (jameinel) wrote :

I thought we already had code at the go-openstack (goose) level to retry requests. I know we implemented support for openstack telling us "come back again in 30s" sort of responses. It may be that there is another failure that we need to handle at a different level.

One problem with something like EOF, is you may not actually know whether the request succeeded but didn't tell you it did, or failed. So you might have resources that were allocated but you didn't get identified.

Francis Ginther (fginther) wrote :

@John,

> I thought we already had code at the go-openstack (goose) level to retry requests.
> I know we implemented support for openstack telling us "come back again in 30s" sort
> of responses. It may be that there is another failure that we need to handle at a
> different level.

Looking at the logsink.log it does look like juju retried the call to create the lxc several times.

> One problem with something like EOF, is you may not actually know whether the request
> succeeded but didn't tell you it did, or failed. So you might have resources that
> were allocated but you didn't get identified.

Indeed, as an outside observer, this looks like a difficult problem.

I'll note that this is very rare. I've only seen it 2 times with 100s of deployments with 2.1.x juju and appears to require something impacting the state of the LXD service.

Pedro Guimarães (pguimaraes) wrote :

Hi, I am seeing the same situation on a customer deployment:

caused by: Post https://URL/compute/v2.1/SOME_UUID/servers: EOF

Full logs here: https://pastebin.canonical.com/p/4TtR2zm8wg/

Discussing with Juju team, seems EOF cases are out-of-scope on retrials. However, I only see this issue during bootstrap. enable_ha and deploy seems to be fine. I can see those commands implement a 10x retry by default and I think that covers this issue with enough retrials to get a VM. Bootstrap op does not have that.

It is annoying because it demands me to relaunch bootstrap command over and over again until it get that right.

If what stops us from implementing a retrial for EOF is the fact that we don't know if the request was eventually accepted or not, why we don't sleep for say, 10s and run a list after? Juju machines have always a particular naming: juju-UUID-X where X is the machine #.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers