ec2 changes? rising failure rate in ec2 health checks

Bug #1345638 reported by Curtis Hovey
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
juju-core
Fix Released
Medium
Unassigned

Bug Description

CI's hourly health checks are seeing a 50% failure rate to bootstrap. We need to understand why this is happening and maybe we need to make changes to keep up with ec2. I am marking this bug critical for the period we do not know the cause. I hope we can lower the priority afterward.

2014-07-20 15:45:53 INFO juju.environs.bootstrap bootstrap.go:86 picked bootstrap tools version: 1.20.1
Launching instance
2014-07-20 15:45:56 INFO juju.utils http.go:59 hostname SSL verification enabled
2014-07-20 15:45:56 INFO juju.utils http.go:59 hostname SSL verification enabled
2014-07-20 15:45:56 INFO juju.utils http.go:59 hostname SSL verification enabled
2014-07-20 15:45:58 INFO juju.provider.ec2 ec2.go:643 started instance "i-ad695887" in "us-east-1a"
 - i-ad695887
Waiting for address
2014-07-20 15:45:59 ERROR juju.provider.common bootstrap.go:119 bootstrap failed: refreshing addresses: The service is unavailable. Please try again shortly. (Unavailable)
Stopping instance...
Bootstrap failed, destroying environment
2014-07-20 15:45:59 INFO juju.provider.common destroy.go:15 destroying environment "test-cloud-aws"
2014-07-20 15:46:00 ERROR juju.cmd supercommand.go:323 refreshing addresses: The service is unavailable. Please try again shortly. (Unavailable)
<class 'subprocess.CalledProcessError'>: Command '('juju', '--show-log', 'bootstrap', '-e', 'test-cloud-aws', '--constraints', 'mem=2G')' returned non-zero exit status 1
Build step 'Execute shell' marked build as failure

Tags: ec2-provider
Revision history for this message
Curtis Hovey (sinzui) wrote :

The failures a fast "Took 12 sec on master". Juju isn't waiting for an IP address. There isn't enough time to intervene in the aws console to add termination protection. I will try some manual runs with --debug to capture more information.

Revision history for this message
Curtis Hovey (sinzui) wrote :

STDERR was directed to a file to capture the failure without revealing private data to the console.
    https://pastebin.canonical.com/113904/

There isn't a lot of meaningful information. I see this

2014-07-21 21:45:58 INFO juju.provider.ec2 ec2.go:643 started instance "i-bbc7f591" in "us-east-1a"
 - i-bbc7f591
2014-07-21 21:45:58 DEBUG juju.environs.bootstrap state.go:36 putting "provider-state" to bootstrap storage *ec2.ec2storage
Waiting for address
2014-07-21 21:45:59 ERROR juju.provider.common bootstrap.go:119 bootstrap failed: refreshing addresses: The service is unavailable. Please try again shortly. (Unavailable)
Stopping instance...
2014-07-21 21:45:59 INFO juju.cmd cmd.go:113 Bootstrap failed, destroying environment
2014-07-21 21:45:59 INFO juju.provider.common destroy.go:15 destroying environment "test-cloud-aws"
2014-07-21 21:46:00 ERROR juju.cmd supercommand.go:323 refreshing addresses: The service is unavailable. Please try again shortly. (Unavailable)

Revision history for this message
Andrew Wilkins (axwalk) wrote :

I've tried to reproduce this ~10 times, without any success. According to the AWS docs:
    "Unavailable The server is overloaded and can't handle the request."

So I expect this error is more likely to occur in a US region during US-friendly hours. I wonder if there's any relationship to us selecting a specific AZ? There's no difference between the failed/successful runs in that regard, though: they're all bootstrapping in us-east-1a.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

To clarify, when I said "without any success" I mean it bootstrapped fine every time.

Revision history for this message
Curtis Hovey (sinzui) wrote :

The failures happen in spurts. I too could not reproduce the issue in 5 tries. Once the error is seen, it happens for a while. The backup-restore test failed 3 times ina few minutes, marking 1.20.0 a failure. I moved the test to HP and reset the count of failures.

I will try changing the region.

Revision history for this message
Curtis Hovey (sinzui) wrote :

The health checks looks much better today. There are 4 failure in the last 24 hours. The 2 most recent errors were timeouts; juju could provision the state-server in 30 minutes. The later 2 errors are the "service is unavailable".

I believe ec2 is fixing itself. If this issue was about the region/AZ didn't have instances, then maybe this issue is about Juju making it clear that I should choose another region/AZ.

Revision history for this message
Andrew Wilkins (axwalk) wrote :

> I believe ec2 is fixing itself. If this issue was about the region/AZ didn't have instances, then maybe this issue is about Juju making it clear that I should choose another region/AZ.

Juju does if ec2 tells it that's the case. I was just guessing that perhaps the server load is associated with a specific AZ, and ec2 isn't returning an error message to indicate that. But I'm just guessing.

Revision history for this message
Curtis Hovey (sinzui) wrote :

CI testing and cloud health checks are no longer seeing this problem. I have lowered the importance. I think there is something that needs fixing. At least Juju should explain what happened.

Changed in juju-core:
importance: Critical → Medium
milestone: 1.21-alpha1 → none
Curtis Hovey (sinzui)
no longer affects: juju-core/1.20
Curtis Hovey (sinzui)
Changed in juju-core:
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.