Sometimes InstanceMetada API returns 404 due to invalid InstaceID returned by _get_instance_and_tenant_id()

Bug #1836253 reported by william
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Won't Fix
Medium
Bence Romsics

Bug Description

Sometimes on instance initialization, the metadata step fails.

On metadata-agent.log there are lots of 404:
"GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 404 len: 297 time: 0.0771070

On nova-api.log we get 404 too:
"GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 404

After some debuging we get that problem occurs when new instances is getting same IP used by deleted instances.
The problem is related to cache implementation on method "_get_ports_for_remote_address()" on "/neutron/agent/metadata/agent.py" that returns an port from deleted instance (with the same IP) which returns wrong InstanceID that will be sent to nova-api which will fail because this instanceId not exists.
This problem only occurs with cache enabled on neuton metadata-agent.

Version: Queens

How to reproduce:
---
#!/bin/bash

computenodelist=(
  'computenode00.test.openstack.net'
  'computenode01.test.openstack.net'
  'computenode02.test.openstack.net'
  'computenode03.test.openstack.net'
)

validate_metadata(){
cat << EOF > /tmp/metadata
#!/bin/sh -x
if curl 192.168.10.2
then
 echo "ControllerNode00 - OK"
else
 echo "ControllerNode00 - ERROR"
fi
EOF

  #SUBNAME=$(date +%s)
  openstack server delete "${node}" 2>/dev/null
  source /root/admin-openrc
  openstack server create --image cirros --nic net-id=internal --flavor Cirros --security-group default --user-data /tmp/metadata --availability-zone nova:${node} --wait "${node}" &> /dev/null

  i=0
  until [ $i -gt 3 ] || openstack console log show "${node}" | grep -q "ControllerNode00"
  do
    i=$((i+1))
    sleep 1
  done
  openstack console log show "${node}" | grep -q "ControllerNode00 - OK"
  if [ $? == 0 ]; then
        echo "Metadata Servers OK: ${node}"
  else
        echo "Metadata Servers ERROR: ${node}"
  fi

  rm /tmp/metadata
}

for node in ${computenodelist[@]}
do
  export node
  validate_metadata
done
echo -e "\n"
---

william (wfelipew)
description: updated
Changed in neutron:
importance: Undecided → Medium
status: New → Confirmed
Revision history for this message
Bence Romsics (bence-romsics) wrote :

I'll take a look at this.

Changed in neutron:
assignee: nobody → Bence Romsics (bence-romsics)
Revision history for this message
Bence Romsics (bence-romsics) wrote :

I managed to reproduce the error in my environment (on the master branch).

The config needed in devstack:

/etc/neutron/metadata_agent.ini
[cache]
enabled = True
backend = dogpile.cache.memory

I simplified the script a bit:

#!/bin/bash

cat << EOF > /tmp/metadata
#!/bin/sh
if curl http://169.254.169.254/openstack
then
    echo "XXX metadata OK"
else
    echo "XXX metadata ERROR"
fi
EOF

openstack server delete vm0 2>/dev/null
openstack server create \
    --flavor cirros256 \
    --image cirros-0.4.0-x86_64-disk \
    --nic net-id=private,v4-fixed-ip=10.0.0.41 \
    --user-data /tmp/metadata \
    --wait \
    vm0 \
    >/dev/null
openstack server show vm0 -f value -c addresses

sleep 4

openstack console log show vm0 | grep "XXX metadata"

-- end of script --

10.0.0.41 is an arbitrary fixed-ip I selected from the range of the private network. The first run finds "XXX metadata OK". Starting rom the second run (that is re-using the same IP) the following errors can be seen instead in the console log:

checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 3.26. request failed
failed 2/20: up 5.32. request failed
failed 3/20: up 7.45. request failed
failed 4/20: up 9.52. request failed
...

Revision history for this message
Bence Romsics (bence-romsics) wrote :

Hi William,

From the original change (https://review.opendev.org/70689) I realized that invalidation of this cache was originally designed to be done by time-to-live. In Queens and in master this config option controls the TTL I think (earlier it may have had a different config option name):

/etc/neutron/metadata_agent.ini
[cache]
# seconds
expiration_time = ...

On master I see the effect of the cache expiration. Have you tried tweaking this to your needs?

Cheers,
Bence

Revision history for this message
Bence Romsics (bence-romsics) wrote :

I don't know when William will read my previous comment, but overall what I found is this:

The cache of metadata-agent was designed to be invalidated by time-based expiry. That method has the reported kind of side effect if a client is too fast. Which is not perfect, but usually can be addressed by tweaking the cache TTL and/or waiting more in the client.

A more correct cache invalidation is theoretically possible, but I think it is not feasible, because it would introduce cross-dependencies between metadata-agent and far-away parts of neutron.

Therefore I'm inclined to mark this bug report as Invalid (not a bug). Let me know please if I missed something here.

Changed in neutron:
status: Confirmed → Invalid
Changed in neutron:
status: Invalid → Won't Fix
Revision history for this message
Gustavo Randich (gustavo-randich) wrote :

In my case this metadata cache issue is causing failures in our terraform builds.

Is there any possibility of a fix that simply don't use an incorrect instance ID when requesting metadata?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.