neutron

Sometimes InstanceMetada API returns 404 due to invalid InstaceID returned by _get_instance_and_tenant_id()

Bug #1836253 reported by william on 2019-07-11

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	neutron	Won't Fix	Medium	Bence Romsics

Bug Description

Sometimes on instance initialization, the metadata step fails.

On metadata-agent.log there are lots of 404:
"GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 404 len: 297 time: 0.0771070

On nova-api.log we get 404 too:
"GET /2009-04-04/meta-data/instance-id HTTP/1.1" status: 404

After some debuging we get that problem occurs when new instances is getting same IP used by deleted instances.
The problem is related to cache implementation on method "_get_ports_for_remote_address()" on "/neutron/agent/metadata/agent.py" that returns an port from deleted instance (with the same IP) which returns wrong InstanceID that will be sent to nova-api which will fail because this instanceId not exists.
This problem only occurs with cache enabled on neuton metadata-agent.

Version: Queens

How to reproduce:
---
#!/bin/bash

computenodelist=(
  'computenode00.test.openstack.net'
  'computenode01.test.openstack.net'
  'computenode02.test.openstack.net'
  'computenode03.test.openstack.net'
)

validate_metadata(){
cat << EOF > /tmp/metadata
#!/bin/sh -x
if curl 192.168.10.2
then
echo "ControllerNode00 - OK"
else
echo "ControllerNode00 - ERROR"
fi
EOF

  #SUBNAME=$(date +%s)
  openstack server delete "${node}" 2>/dev/null
  source /root/admin-openrc
  openstack server create --image cirros --nic net-id=internal --flavor Cirros --security-group default --user-data /tmp/metadata --availability-zone nova:${node} --wait "${node}" &> /dev/null

  i=0
  until [ $i -gt 3 ] || openstack console log show "${node}" | grep -q "ControllerNode00"
  do
    i=$((i+1))
    sleep 1
  done
  openstack console log show "${node}" | grep -q "ControllerNode00 - OK"
  if [ $? == 0 ]; then
        echo "Metadata Servers OK: ${node}"
  else
        echo "Metadata Servers ERROR: ${node}"
  fi

rm /tmp/metadata
}

for node in ${computenodelist[@]}
do
export node
validate_metadata
done
echo -e "\n"
---

See original description

william (wfelipew) on 2019-07-11

description:

updated

Slawek Kaplonski (slaweq) on 2019-07-12

Changed in neutron:
importance:	Undecided → Medium
status:	New → Confirmed

Revision history for this message

Bence Romsics (bence-romsics) wrote on 2019-07-16:

I'll take a look at this.

Changed in neutron:
assignee:	nobody → Bence Romsics (bence-romsics)

Revision history for this message

Bence Romsics (bence-romsics) wrote on 2019-07-17:

I managed to reproduce the error in my environment (on the master branch).

The config needed in devstack:

/etc/neutron/metadata_agent.ini
[cache]
enabled = True
backend = dogpile.cache.memory

I simplified the script a bit:

#!/bin/bash

cat << EOF > /tmp/metadata
#!/bin/sh
if curl http://169.254.169.254/openstack
then
echo "XXX metadata OK"
else
echo "XXX metadata ERROR"
fi
EOF

openstack server delete vm0 2>/dev/null
openstack server create \
    --flavor cirros256 \
    --image cirros-0.4.0-x86_64-disk \
    --nic net-id=private,v4-fixed-ip=10.0.0.41 \
    --user-data /tmp/metadata \
    --wait \
    vm0 \
    >/dev/null
openstack server show vm0 -f value -c addresses

sleep 4

openstack console log show vm0 | grep "XXX metadata"

-- end of script --

10.0.0.41 is an arbitrary fixed-ip I selected from the range of the private network. The first run finds "XXX metadata OK". Starting rom the second run (that is re-using the same IP) the following errors can be seen instead in the console log:

checking http://169.254.169.254/2009-04-04/instance-id
failed 1/20: up 3.26. request failed
failed 2/20: up 5.32. request failed
failed 3/20: up 7.45. request failed
failed 4/20: up 9.52. request failed
...

Revision history for this message

Bence Romsics (bence-romsics) wrote on 2019-07-17:

Hi William,

From the original change (https://review.opendev.org/70689) I realized that invalidation of this cache was originally designed to be done by time-to-live. In Queens and in master this config option controls the TTL I think (earlier it may have had a different config option name):

/etc/neutron/metadata_agent.ini
[cache]
# seconds
expiration_time = ...

On master I see the effect of the cache expiration. Have you tried tweaking this to your needs?

Cheers,
Bence

Revision history for this message

Bence Romsics (bence-romsics) wrote on 2019-07-19:

I don't know when William will read my previous comment, but overall what I found is this:

The cache of metadata-agent was designed to be invalidated by time-based expiry. That method has the reported kind of side effect if a client is too fast. Which is not perfect, but usually can be addressed by tweaking the cache TTL and/or waiting more in the client.

A more correct cache invalidation is theoretically possible, but I think it is not feasible, because it would introduce cross-dependencies between metadata-agent and far-away parts of neutron.

Therefore I'm inclined to mark this bug report as Invalid (not a bug). Let me know please if I missed something here.

Changed in neutron:
status:	Confirmed → Invalid

Slawek Kaplonski (slaweq) on 2019-07-19

Changed in neutron:
status:	Invalid → Won't Fix

Revision history for this message

Gustavo Randich (gustavo-randich) wrote on 2019-10-02:

In my case this metadata cache issue is causing failures in our terraform builds.

Is there any possibility of a fix that simply don't use an incorrect instance ID when requesting metadata?

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.