metadata service returns empty data with 200 OK

Bug #566792 reported by C de-Avillez on 2010-04-19
28
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Eucalyptus
Fix Released
Undecided
chris grzegorczyk
cloud-init (Ubuntu)
Undecided
Unassigned
Lucid
Undecided
Unassigned
Maverick
Undecided
Unassigned
eucalyptus (Ubuntu)
High
Dave Walker
Lucid
High
Dave Walker
Maverick
High
Dave Walker

Bug Description

Binary package hint: cloud-init

Release of Ubuntu: Lucid serverAMD64 daily 20100419.1/UEC AMD64 Lucid daily 20100419. Running on the test rig, on lucid-amd64-topo2.

Package Version:
Expected Results: no errors
Actual Results: see attached file.

### Ubuntu SRU Justification ###
IMPACT: The current package in Lucid has an intolerably high failure rate at starting instances, this seems to be due to some instances failing to get assigned a public ip address. This means that the meta service cannot function correctly.

BUG FIX: Platform, QA and upstream worked closely together to identify the issue and through iterations tested proposed fixes, in conjunction with using the QA uec-testing scripts. Once the correct fix was identified, it was committed to upstreams stable branch.

PATCH: New patch, attached below - 30th June 2010

TEST CASE: Fresh installation, start many instances and note the failure rate. This process can be aided with lp:uec-testing-scripts that allows multiple instances to be attempted to start.

REGRESSION POTENTIAL: Low. C de-Avillez has been undertaking multiple QA runs of the patch through it's iterations to the extent that it would be comparable to pre-declaring verification-passed , as he has been testing the same patch from a PPA (ppa:davewalker/uec-testing). Once this is Lucid proposed, i am still expecting the traditional verification process to happen.

C de-Avillez (hggdh2) wrote :
Scott Moser (smoser) wrote :

ugh.

I'm somewhat baffled on how this can happen.

WARNING:INSTANCE i-39AC0623: File "/usr/lib/python2.6/dist-packages/cloudinit/DataSourceEc2.py", line 64, in get_instance_id
WARNING:INSTANCE i-39AC0623: return(self.metadata['instance-id'])

That indicates that somehow, the 'metadata' member of self (DataSourceEc2.py) does not have an 'instance-id' field.

That field gets populated in cloudinit/DataSourceEc2.py's get_data, which does something like:

        try:
            if not self.wait_for_metadata_service():
                return False
            self.userdata_raw = boto_utils.get_instance_userdata(self.api_ver)
            self.metadata = boto_utils.get_instance_metadata(self.api_ver)
            return True

self.wait_for_metadata_service is an attempt to make sure the metadata service is up before calling boto's get_instance_userdata and get_instance_metadata methods to populate the members.

both of get_instance_userdata and get_instance_metadata also take precautions to wait on the metadata service. So, it appears that the metadata service was correctly parsed / crawled by boto, and that it did not have an 'instance-id' field .

Scott Moser (smoser) wrote :

before 'return True' in the above snippet, I added:

            import pprint
            print "%s\n%s\n%s" % ("==== begin userdata_raw ====", self.userdata_raw, "==== end userdata_raw ====")
            print "%s\n%s\n%s" % ("==== begin metadata ====", pprint.pprint(self.metadata), "==== end metadata ====")
            return True

We saw the error, and see in the console:
WARNING:INSTANCE i-489C08A5:==== begin userdata_raw ====
WARNING:INSTANCE i-489C08A5:
WARNING:INSTANCE i-489C08A5:==== end userdata_raw ====
WARNING:INSTANCE i-489C08A5:{}
WARNING:INSTANCE i-489C08A5:==== begin metadata ====
WARNING:INSTANCE i-489C08A5:None
WARNING:INSTANCE i-489C08A5:==== end metadata ====

As far as I can tell, that means the metadata service was up (ie, something was at port 80 169.254.169.254 and it returned valid http with empty data)

Scott Moser (smoser) wrote :

I modified 'retry_url' in the cloudinit/boto_utils.py to have:
        try:
            req = urllib2.Request(url)
            resp = urllib2.urlopen(req)
            content = resp.read()
            print "%s [%i]\n\t%s" % (url, resp.getcode(), content)
            return content

the result is, we still see failure, but have output like:
|http://169.254.169.254/2009-04-04/user-data [200]
|
|http://169.254.169.254/2009-04-04/meta-data/ [200]
|
|==== begin userdata_raw ====
|
|==== end userdata_raw ====
|==== begin metadata ====
|{}
|==== end metadata ====

So, we are in fact getting an http 200 (OK) response from the server for the http://169.254.169.254/2009-04-04/user-data and http://169.254.169.254/2009-04-04/meta-data/ URLs. I suspect sleeping there and trying again will make this work. I'll try that to verify.

Basically, the metadata service is up and responding but not populated yet.

Scott Moser (smoser) wrote :

I've opened a Eucalyptus (and Ubuntu/Eucalyptus task). This is really a eucalyptus bug. Ideally the instance would not be booted without the metadata service up and ready. That would be the ideal world. As seen in bug 308530, EC2 has an issue where if the instance looks too soon, the metadata service might not be present at all (no route to host, or failed tcp connection)

Eucalyptus seems to have a similar bug the metadata service responds with http succes (code '200') before it is actually populated.

Scott Moser (smoser) wrote :

I'm attaching a possible workaround patch to cloud-utils to work around this to this Eucalyptus bug.

Scott Moser (smoser) wrote :

Re-attaching. The previous had a intentional typo to test the error path left in. this wone i hope is better.

Scott Moser (smoser) wrote :

I've run some tests with this patch applied on the system that Carlos was running on. The patch seems to do what was intended, to block and wait for existance of metadata service with instance-id field or wait for 20 seconds.

The result is that the instances still fail, but now, rather than failing with 'consume userdata', they die like:

| waiting for md at http://169.254.169.254/2009-04-04/meta-data/instance-id (try 1/20)
| ...
| waiting for md at http://169.254.169.254/2009-04-04/meta-data/instance-id (try 19/20)
| Could not find data source
| Failed to get instance datainit: cloud-init main process (411) terminated with status 1^M
| mountall: Event failed
| init: plymouth-log main p

I'll also note that Dustin, and I have also seen failures of this type. It is the primary failure for "ssh isnt available" that we've run into.

Thierry Carrez (ttx) wrote :

Apparently a eucalyptus metadata service issue, as evidenced by previous comment.

Changed in cloud-init (Ubuntu):
status: New → Invalid
Changed in eucalyptus (Ubuntu):
importance: Undecided → High
status: New → Confirmed
Scott Moser (smoser) wrote :

I'm attaching a debug version of cloud-init. It has debug printfs showing when cloud-init-cfg is called, and also better info on 'sleeping' waiting for the metadata service.

tags: added: patch
summary: - UEC guests sometimes fail on consuming user data
+ UEC guests sometimes fail on consuming user data (metadata service isn't
+ ready)
tags: added: iso-testing
Thierry Carrez (ttx) wrote :

Assigning to Scott, pending input from the Eucalyptus team

Changed in eucalyptus (Ubuntu Lucid):
assignee: nobody → Scott Moser (smoser)
Scott Moser (smoser) wrote :

I've recreated this with a ttylinux based image, with modifications to access metadata service early in boot.
I'm attaching that instance here.

this can be registered with 'uec-register-tarball ttylinux-11.0-fastboot-amd64-0.07.tar.gz mybucket'
it will also boot in smaller than m1.small (ie, you can shrink m1.small to 1G fs and 96 M mem)

Scott Moser (smoser) wrote :

Well, as is often the case, user error added some confusion here.
My system (and possibly kirkland and ttx) have a second dhcp server available on the network. That dhcp server was responding, and sometimes its response was being accepted rather than that of the CC.

So, user error causes the '400' timeout, but the original debugging shown above showed empty data fields and a 200 response. As far as I can tell, that is still valid.

Additionally, I belive I've got the correct dnsmasq/dd-wrt configuration options to ignore the UEC nodes:
  dhcp-host=d0:0d:*:*:*:*,ignore

Scott Moser (smoser) wrote :

more testing has led me to the following:
a.) I cannot reproduce the 200 OK response with empty metadata that we see in the data center rig on my own hardware. That is still an issue.
b.) It appears that in all my tests the metadata service eventually *does* come up. On average in my tests (hundreds if not thousands), the metadata service is up after less than 10 seconds. However, in cases where it is not it doesn't seem to come up until something around 2 minutes.

The solution to 'b' from the instances perspective is to simply retry for much longer than we initially were (which was ~ 30 seconds).

this is quite problematic for any sort of automated testing, because you essentially cannot call a system "failed" for several minutes. Both my tests and mathias' tests give up after less than a minute of unreachability and call the instance failed.

We may have to redesign tets to accomodate this.

I will also say, though, that I've not seen a single case of "metadata service not available" on ec2 in the past 9 months. They seem to have fixed that issue that was present previously.

I plan on putting some better code in the "wait for metadata service", complete with waiting (indefinitely) for a non-empty value in instance-id (which would work around the 200-OK failure, and give us a chance to catch it an debug when it does fail).

Scott Moser (smoser) wrote :

I'm testing with a modified version of 20100427.1 (release) in the data center. In about 4 out of 50 cases, I see the 200 OK error. Running ngrep to watch data on the CC, I see things like:

########
T 172.19.1.42:58975 -> 169.254.169.254:80 [AP]
  GET /2009-04-04/meta-data/instance-id HTTP/1.1..Accept-Encoding: identity..
  Host: 169.254.169.254..Connection: close..User-Agent: Python-urllib/2.6....
##
T 169.254.169.254:80 -> 172.19.1.42:58975 [AP]
  HTTP/1.1 200 OK..Content-Length: 0..Content-Type: text/html....
####exit

Scott Moser (smoser) wrote :

I've been testing more, now with current lucid images, which have newer cloud-init.
'ngrep port 8773' on the CLC shows:
########
T 10.55.55.8:39968 -> 10.55.55.2:8773 [AP]
  GET /2009-04-04/meta-data/instance-id HTTP/1.1..Accept-Encoding: identity..
  Host: 169.254.169.254..Connection: close..User-Agent: Python-urllib/2.6....
##
T 10.55.55.2:8773 -> 10.55.55.8:39968 [AP]
  HTTP/1.1 200 OK..Content-Length: 0..Content-Type: text/html....

I'm pretty sure the empty string is coming from:
clc/modules/cluster-manager/src/main/java/edu/ucsb/eucalyptus/cloud/ws/VmMetadata.java: handle

it searches for a vminstance, and if it doens't find one, returns ""

Scott Moser (smoser) wrote :

I noticed that above, 10.55.55.8 is the address of the node controller. In cases where the correct response is found, that token is the public IP address of the node. That would explain why the search for a matching VM's IP would fail.

Scott Moser (smoser) on 2010-05-20
summary: - UEC guests sometimes fail on consuming user data (metadata service isn't
- ready)
+ metadata service returns empty data with 200 OK
Scott Moser (smoser) wrote :

OK. I think I've figured out what is causing this.
in 100% of the failed instances that I'm seeing today, the instance has its public ip identical to its private ip.

Ie, the following is euca-describe-instances output of a failed instance:
INSTANCE i-2F1A0573 emi-E58B1525 172.19.1.48 172.19.1.48 running uectest-k0 0 m1.small 2010-05-20T15:34:28.275Z UEC-TEST1 eki-2B241A2B

And this is from a successful instance:
INSTANCE i-37A30725 emi-E58B1525 10.55.55.187 172.19.1.34 running uectest-k0 0 m1.small 2010-05-20T15:34:06.874Z UEC-TEST1 eki-2B241A2B

Note, these instances were not launched with '--addressing private'.

$ euca-describe-addreses | sort
ADDRESS 10.55.55.100 i-4A34090E (eucalyptus)
ADDRESS 10.55.55.101 available (nobody)
ADDRESS 10.55.55.102 available (eucalyptus)
...
ADDRESS 10.55.55.238 nobody
ADDRESS 10.55.55.239 nobody
ADDRESS 10.55.55.240 nobody
$ euca-describe-addresses | grep nobody | wc -l
24
$ euca-describe-addresses | wc -l
141

It seems there should have been addresses available, euca-describe-availability-zones only shows 32 instances max.

# on the cloud-controller
cc-host$ cat /etc/eucalyptus/eucalyptus.local.conf
VNET_MODE="MANAGED-NOVLAN"
VNET_SUBNET="172.19.0.0"
VNET_NETMASK="256.255.0.0"
VNET_DNS="10.55.55.1"
VNET_ADDRSPERNET="32"
VNET_PUBLICIPS="10.55.55.100-10.55.55.240"
VNET_CLOUDIP="10.55.55.2"

The testcases (lp:~uec-testing-scripts-dev/uec-testing-scripts/trunk/) basically keep firing off instances 1 by 1 with euca-run-instances as long as 'euca-describe-availability-zones verbose' reports that there is space. Note, instances are ran *without* '--addressing private', but appear to be getting private addressing.

Scott Moser (smoser) wrote :

I'm attaching iptables-save output on CLC, CC, NC1 and NC2. I really don't think its of that much value due to the previous comment, but it was requested.

Daniel Nurmi (nurmi) wrote :

Scott, thank you for pursuing this issue. Indeed, the meta-data service keys off of the public IP of the VM, since that is the only source address that is translated when the NCs cannot directly contact the CLC (MANAGED modes). In order for the meta-data service to work, the VM needs to be assigned a public IP, and it looks like this operation is failing (indicated by the fact that the public/private IPs being reported are both from the private IP subnet).

Lets take a look at cc.log for an instance that is failing to get a public IP (look for the RunInstance that corresponds with the instance ID in question, and include a few minutes of log after the RunInstance). We're specifically looking to see if an AssignAddress() operation is ever called for the VM's private IP, or if no AssignAddress() is ever called for that VM. I think this will help us figure out why that VM is not getting a public IP assigned.

Thanks again, I think we're close to being able to get to the root cause of this.

p.s. I notice that your VNET_NETMASK is set to an invalid value, but the iptables output doesn't reflect this problem; can you confirm that your VNET_NETMASK is actually '255.255.0.0' and not '256.255.0.0' as was reported above? If it is, note that this may have unforeseen consequences.

Daniel Nurmi wrote:
> Scott, thank you for pursuing this issue. Indeed, the meta-data service
> keys off of the public IP of the VM, since that is the only source
> address that is translated when the NCs cannot directly contact the CLC
> (MANAGED modes). In order for the meta-data service to work, the VM
> needs to be assigned a public IP, and it looks like this operation is
> failing (indicated by the fact that the public/private IPs being
> reported are both from the private IP subnet).

That is interesting. Does that mean the meta-data service would not
work for instance started with "--addressing private"?

Scott Moser (smoser) wrote :

I'm attaching 2 files. failed-instance-cc.log (i-3812076B) and good-instance-cc.log (i-42A70769)

Unfortunately, I dont think they'll be of a whole lot of use, as the 'AssignAddress' calls are not tied to the instance id in the log, so I dont know how to tell which call was for which instance id. In the 'good-instance-cc.log', there was really only one instance, but in the failed version, there are lots of things happening (the test case is running).

Scott Moser (smoser) wrote :
Scott Moser (smoser) wrote :

On Thu, 20 May 2010, Etienne Goyer wrote:

> > (MANAGED modes). In order for the meta-data service to work, the VM
> > needs to be assigned a public IP, and it looks like this operation is
> > failing (indicated by the fact that the public/private IPs being
> > reported are both from the private IP subnet).
>
> That is interesting. Does that mean the meta-data service would not
> work for instance started with "--addressing private"?

I happen to from experience that it does work for addressing private. My
guess is that Eucalyptus especially takes care of the case where
'--addressing private' was used.

Scott Moser (smoser) on 2010-05-27
Changed in eucalyptus (Ubuntu Lucid):
milestone: none → lucid-updates
Mike Cook (mikewillcook) wrote :

Scott Moser wrote on 2010-05-20:
> > > (MANAGED modes). In order for the meta-data service to work, the VM
> > > needs to be assigned a public IP, and it looks like this operation is
> > > failing (indicated by the fact that the public/private IPs being
> > > reported are both from the private IP subnet).
> >
> > That is interesting. Does that mean the meta-data service would not
> > work for instance started with "--addressing private"?
>
> I happen to from experience that it does work for addressing private. My
> guess is that Eucalyptus especially takes care of the case where
> '--addressing private' was used.

Scott, were the CC & CLC the same host when it worked for you? I don't see
how it could ever work with "--addressing-private" if they're separate, and
Daniel's comment seems to confirm. Since it doesn't inject the key using the
non-MANAGED way when in MANAGED mode, it seems we need add a fix for
private IP instances.

It seems this specific bug may be around the fact that a host with a public IP
isn't getting it nat'd properly by the CC. But the issue of a private-only IP
instance not being able to get metadata may be separate (see bug 580184).

Scott Moser (smoser) wrote :

On Fri, 28 May 2010, Mike Cook wrote:
> > I happen to [know] from experience that it does work for addressing
> > private. My guess is that Eucalyptus especially takes care of the case
> > where '--addressing private' was used.
>
> Scott, were the CC & CLC the same host when it worked for you? I don't see
> how it could ever work with "--addressing-private" if they're separate, and
> Daniel's comment seems to confirm. Since it doesn't inject the key using the
> non-MANAGED way when in MANAGED mode, it seems we need add a fix for
> private IP instances.

They were the same, yes. So my comment may well be invalid.

David Futcher (bobbo) on 2010-06-08
tags: added: patch-forwarded-upstream
Scott Moser (smoser) on 2010-06-08
tags: removed: patch-forwarded-upstream
David Futcher (bobbo) on 2010-06-10
tags: added: patch-forwarded-upstream
Scott Moser (smoser) wrote :

@David,
  Why are you setting the 'patch-forwarded-upstream' tag on this bug, the patch has *not* been forwarded upstream. in fact, there is no patch or solution for the bug.

tags: removed: patch-forwarded-upstream
C de-Avillez (hggdh2) wrote :

Update on status: revision 1230 seems to have corrected the issue of IP addresses not being correctly set. Nevertheless, we still have metadata errors. Chris is looking into it.

tags: added: patch-refused
removed: patch
Changed in eucalyptus:
assignee: nobody → chris grzegorczyk (chris-grze)
Changed in eucalyptus (Ubuntu):
assignee: Scott Moser (smoser) → Dave Walker (davewalker)
Changed in eucalyptus (Ubuntu Lucid):
assignee: Scott Moser (smoser) → Dave Walker (davewalker)
C de-Avillez (hggdh2) wrote :

I am still running the 2,000 test. Right now I am at about 385 instances
started, with 10 failures. Looking at the failures I see:

i-442A07C, i-55F5097A, i-405C076B, i-42B708A5, i-3C0F0746, i-4C6D099C,
i-4B0D08E4, i-568308F5 -- did not get assigned a public IP.

I have uploaded the current view to lp:~hggdh2/+junk/uec-qa, under
lucid/r1235.

Sorry.

C de-Avillez (hggdh2) wrote :

I have just ran a 400-instance test with the latest patch (from Dave's PPA, 1.6.2-0ubuntu30.3.6).

I sent the following to the interested parties:

----email begins----
I just finished a 400-instances run. I am extremely glad to state I saw no occurrence of the metadata/public IP issue.

Of course, we are talking about a race, but I would say that we have drastically reduced (if not completely resolved) the incidence.

I favour we consider this fix done & publish it ASAP.
----email ends----

Changed in eucalyptus:
status: New → Fix Committed
Dave Walker (davewalker) wrote :
description: updated
tags: added: patch
removed: patch-refused

Accepted eucalyptus into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in eucalyptus (Ubuntu Lucid):
status: Confirmed → Fix Committed
tags: added: verification-needed
Dave Walker (davewalker) wrote :

@Colin, thanks for that.

@C de-Avillez: Can you re-run your tests against the package that is hitting lucid-proposed shortly, and update the verification tag as appropriate.

Thanks all!

C de-Avillez (hggdh2) wrote :

Tests re-run; in 800 instances we got a single metadata failure. After discussing the status with all in a teleconference, we decided to approve the changes so far:

* no regressions were detected;
* one failure in 800, on a extremely stressed UEC is *way better* than ~5 failures in 100.
* we need this fix published for stability.

So I am marking this verification-done; Eucalyptus is looking at the error, and will eventually propose an addendum to the patch, which we will integrate in another SRU.

tags: added: verification-done
removed: verification-needed
Changed in eucalyptus (Ubuntu):
status: Confirmed → Fix Committed
status: Fix Committed → Confirmed
Martin Pitt (pitti) wrote :

Making this a release blocker to avoid regressions in maverick.

Changed in eucalyptus (Ubuntu Maverick):
milestone: none → ubuntu-10.10-beta
status: Confirmed → Won't Fix
status: Won't Fix → Triaged
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package eucalyptus - 1.6.2-0ubuntu30.3

---------------
eucalyptus (1.6.2-0ubuntu30.3) lucid-proposed; urgency=low

  * debian/patches/lp566792-metadata-service.patch: Prevent invalid state
    transitions and fix address handling which mangled in-flight state
    changes. Cherry pick from upstream 1.6.2 branch (r1230..1232). (LP: #566792)
 -- Dave Walker (Daviey) <email address hidden> Wed, 30 Jun 2010 00:10:43 +0100

Changed in eucalyptus (Ubuntu Lucid):
status: Fix Committed → Fix Released
Thomy23 (thomas-stather) wrote :

Hi

Can it be that with this packages, the initscripts on the CC and CLC have changed from

eucalyptus-cloud (CLC)
eucalyptus-CC (CC)

to only "eucalyptus"? Because i installed the CC based on these packages and and "restart eucalyptus-cc" did nothing, whereas a "restart eucalyptus CLEAN=1" worked for me?

Greets Thomas

Added patch-forwarded-upstream since the upstream project uses launchpad for bugtracking

tags: added: patch-forwarded-upstream
removed: patch
papukaija (papukaija) on 2010-08-04
tags: added: lucid maverick patch
C de-Avillez (hggdh2) on 2010-08-04
tags: removed: patch

Sorry if this is the wrong place to post such a message, but I'm experiencing this issue on a production host and it's a significant issue. I have to reboot my machine regularly, and this bug is causing boot to take well over 30 minutes.

Is there a workaround I can use until the next stable release?
Can someone point me to the proper place to get support for this issue?

Thanks in advance

Scott Moser (smoser) wrote :

aaronbauman,
   This bug is believed to be fix-released in lucid (lucid-updates).
   I think its probably best if you open a new bug and describe the issue you're having.
   Please open the bug with 'ubuntu-bug eucalyptus' on the cluster controller that way related debug information will be collected.
   Thank you.

Dave Walker (davewalker) wrote :

I'm not seeing this behavior in Maverick, so marking fixed released for that task.

Changed in eucalyptus (Ubuntu Maverick):
status: Triaged → Fix Released
Changed in eucalyptus:
status: Fix Committed → Fix Released
To post a comment you must log in.