Sometimes InstanceMetada API returns 404 due to invalid InstaceID returned by _get_instance_and_tenant_id()

Bug #1836253 reported by william on 2019-07-11
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Medium
Bence Romsics

Bug Description

Sometimes on instance initialization, the metadata step fails.

On metadata-agent.log there are lots of 404:
"GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 404 len: 297 time: 0.0771070

On nova-api.log we get 404 too:
"GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 404

After some debuging we get that problem occurs when new instances is getting same IP used by deleted instances.
The problem is related to cache implementation on method "_get_ports_for_remote_address()" on "/neutron/agent/metadata/agent.py" that returns an port from deleted instance (with the same IP) which returns wrong InstanceID that will be sent to nova-api which will fail because this instanceId not exists.
This problem only occurs with cache enabled on neuton metadata-agent.

Version: Queens

How to reproduce:
---
#!/bin/bash

computenodelist=(
  'computenode00.test.openstack.net'
  'computenode01.test.openstack.net'
  'computenode02.test.openstack.net'
  'computenode03.test.openstack.net'
)

validate_metadata(){
cat << EOF > /tmp/metadata
#!/bin/sh -x
if curl 192.168.10.2
then
 echo "ControllerNode00 - OK"
else
 echo "ControllerNode00 - ERROR"
fi
EOF

  #SUBNAME=$(date +%s)
  openstack server delete "${node}" 2>/dev/null
  source /root/admin-openrc
  openstack server create --image cirros --nic net-id=internal --flavor Cirros --security-group default --user-data /tmp/metadata --availability-zone nova:${node} --wait "${node}" &> /dev/null

  i=0
  until [ $i -gt 3 ] || openstack console log show "${node}" | grep -q "ControllerNode00"
  do
    i=$((i+1))
    sleep 1
  done
  openstack console log show "${node}" | grep -q "ControllerNode00 - OK"
  if [ $? == 0 ]; then
        echo "Metadata Servers OK: ${node}"
  else
        echo "Metadata Servers ERROR: ${node}"
  fi

  rm /tmp/metadata
}

for node in ${computenodelist[@]}
do
  export node
  validate_metadata
done
echo -e "\n"
---

william (wfelipew) on 2019-07-11
description: updated
Changed in neutron:
importance: Undecided → Medium
status: New → Confirmed
Bence Romsics (bence-romsics) wrote :

I'll take a look at this.

Changed in neutron:
assignee: nobody → Bence Romsics (bence-romsics)
Bence Romsics (bence-romsics) wrote :

I managed to reproduce the error in my environment (on the master branch).

The config needed in devstack:

/etc/neutron/metadata_agent.ini
[cache]
enabled = True
backend = dogpile.cache.memory

I simplified the script a bit:

#!/bin/bash

cat << EOF > /tmp/metadata
#!/bin/sh
if curl http://169.254.169.254/openstack
then
    echo "XXX metadata OK"
else
    echo "XXX metadata ERROR"
fi
EOF

openstack server delete vm0 2>/dev/null
openstack server create \
    --flavor cirros256 \
    --image cirros-0.4.0-x86_64-disk \
    --nic net-id=private,v4-fixed-ip=10.0.0.41 \
    --user-data /tmp/metadata \
    --wait \
    vm0 \
    >/dev/null
openstack server show vm0 -f value -c addresses

sleep 4

openstack console log show vm0 | grep "XXX metadata"

-- end of script --

10.0.0.41 is an arbitrary fixed-ip I selected from the range of the private network. The first run finds "XXX metadata OK". Starting rom the second run (that is re-using the same IP) the following errors can be seen instead in the console log:

checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 3.26. request failed
failed 2/20: up 5.32. request failed
failed 3/20: up 7.45. request failed
failed 4/20: up 9.52. request failed
...

Bence Romsics (bence-romsics) wrote :

Hi William,

From the original change (https://review.opendev.org/70689) I realized that invalidation of this cache was originally designed to be done by time-to-live. In Queens and in master this config option controls the TTL I think (earlier it may have had a different config option name):

/etc/neutron/metadata_agent.ini
[cache]
# seconds
expiration_time = ...

On master I see the effect of the cache expiration. Have you tried tweaking this to your needs?

Cheers,
Bence

Bence Romsics (bence-romsics) wrote :

I don't know when William will read my previous comment, but overall what I found is this:

The cache of metadata-agent was designed to be invalidated by time-based expiry. That method has the reported kind of side effect if a client is too fast. Which is not perfect, but usually can be addressed by tweaking the cache TTL and/or waiting more in the client.

A more correct cache invalidation is theoretically possible, but I think it is not feasible, because it would introduce cross-dependencies between metadata-agent and far-away parts of neutron.

Therefore I'm inclined to mark this bug report as Invalid (not a bug). Let me know please if I missed something here.

Changed in neutron:
status: Confirmed → Invalid
Changed in neutron:
status: Invalid → Won't Fix

In my case this metadata cache issue is causing failures in our terraform builds.

Is there any possibility of a fix that simply don't use an incorrect instance ID when requesting metadata?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers