EC2 data source does not properly return the instance ID when cached data exists

Bug #1883907 reported by Robert Schweikert
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
cloud-init
Invalid
Undecided
Unassigned

Bug Description

When creating an AMI from a running instance without cleaning the cloud-init cache an instance from the new AMI is no properly identified as a new instance and none of the PER_INSTANCE tasks will be executed.

The problem is that the Ec2 data source will return the cached instance-id rather than the new instance ID.

Revision history for this message
Robert Schweikert (rjschwei) wrote :

The data source gets pickeled to /var/lib/cloud/instance/obj.pkl and the information in this pickled object is read and taken at face value, i.e. the pickeled object has the instance data from the "donor" instance built in. Therefore anything that accesses data that is extracted from "self.identity" is incorrect in the new instance.

It appears as if the identity member of the object has to be set anew every time we run through the code. It appears that the idea was to save a call to the metadata server. However there is no other way to validate the cache, a call to the metadata server for at least the IID is necessary to validate if we are still on the same instance.

Revision history for this message
Robert Schweikert (rjschwei) wrote :

Further investigation shows that the data source eventually corrects itself and gets updated to the new instance data.

Initially in the log, with modified Ec2 data source to capture the identity:

{'accountId': '810320120389', 'architecture': 'x86_64', 'availabilityZone': 'us-east-1d', 'billingProducts': ['bp-6ca54005'], 'devpayProductCodes': None, 'marketplaceProductCodes': None, 'imageId': 'ami-0068cd63259e9f24c', 'instanceId': 'i-05e9c363543b1495a', 'instanceType': 't2.micro', 'kernelId': None, 'pendingTime': '2020-06-17T12:18:04Z', 'privateIp': '192.168.10.215', 'ramdiskId': None, 'region': 'us-east-1', 'version': '2017-09-30'}

This is the instance ID of the "donor" instance that was used to create the new AMI. Then later in the log:

{'accountId': '810320120389', 'architecture': 'x86_64', 'availabilityZone': 'us-east-1d', 'billingProducts': ['bp-6ca54005'], 'devpayProductCodes': None, 'marketplaceProductCodes': None, 'imageId': 'ami-032b9b77946912216', 'instanceId': 'i-06f5ab4826f90d11b', 'instanceType': 't2.micro', 'kernelId': None, 'pendingTime': '2020-06-17T13:46:17Z', 'privateIp': '192.168.10.222', 'ramdiskId': None, 'region': 'us-east-1', 'version': '2017-09-30'}

This is the data for the new instance is is correct.

What is not yet clear is how this self correction interplays with the execution of user data scripts.

Revision history for this message
Ryan Harper (raharper) wrote :

Hi Robert,

Thanks for filing the issue. As you mentioned in your comment #2,
cloud-init detects that it is on a new instance during
cloud-init-local.service and removes/ignores the obj.plk from
the previous instance.

In particular, when cloud-init attempts to restore from cache, it has
multiple checks it performs before using the cached object.

1) does it match the runtime instance-id stored in
/run/cloud-init/.instance_id match what's in the cached object.

On new instances, this file does not *yet* exist on first boot this
early

2) If existing=trust, we use the cache

On first boot, cloud-init-local.service runs first, and it specifies
existing=check

3) if the cached object has a 'check_instance_id' attribute, we call
the method which will verify the cached value matches (or doesnt) with
the actual instance id.

EC2 datasource does not implement check_instance_id()

So, Ec2 will never use the cached object on a new instance during
cloud-init-local.service. After rejecting the cached object,
cloud-init-local will then call get_data() on the datasource; this
marks the _dirty_cache flag True and then calls ._get_data(); for
Ec2 this crawls the IMDS network end point and sets the new
instance-id. and then stages._reflect_cur_instance() runs to
remove old symlinks and creates new instance directory; this is
what removes all of the previous instance semaphor files which
protect the per-instance locks used for instance idempotency

> What is not yet clear is how this self correction interplays with
> the execution of user data scripts.

All of the user-data scripts and config modules will run after
cloud-init-local has reconfigured the instance-id.

Revision history for this message
Ryan Harper (raharper) wrote :

I'm marking this invalid; if you find more information that would indicate that cloud-init is not booting like a new instance after capturing on Ec2, please reopen this bug and include updated information.

Changed in cloud-init:
status: New → Invalid
Revision history for this message
James Falcon (falcojr) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.