Some instances can't connect to metadata due to ARP failure

Bug #719798 reported by Hyunsun Moon
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Vish Ishaya

Bug Description

Instances that have a local route (due to installing Network Manager in ubuntu for example) cannot contact the metadata server. This is because they send out an ARP (who-has) for 169.254.169.254 and never get a response. This issue also affects windows vms. The issue can be solved by giving the ip address to the host that is running nova-network, something along the lines of:

ip addr add 169.254.169.254/32 scope link dev eth1

This causes the network host to respond to the arp and the metadata request succeeds. Nova should automatically add this address to avoid this failure.

Example Error messages are below:

Instance fails to access metadata server at launch resulting it fails to complete initial booting process including sshd startup.

Logs from tty linux.
=========================================

Lease of 10.0.0.5 obtained, lease time 120^M
starting DHCP forEthernet interface eth0 [ ^[[1;32mOK^[[0;39m ]^M
cloud-setup: checking http://169.254.169.254/2009-04-04/meta-data/instance-id^M
cloud-setup: failed 1/30: up 7.97. iid had 1.0^M
cloud-setup: failed 2/30: up 9.18. iid had 1.0^M
cloud-setup: failed 3/30: up 10.35. iid had 1.0^M
cloud-setup: failed 4/30: up 11.52. iid had 1.0^M
cloud-setup: failed 5/30: up 12.70. iid had 1.0^M
cloud-setup: failed 6/30: up 13.88. iid had 1.0^M
cloud-setup: failed 7/30: up 15.06. iid had 1.0^M
cloud-setup: failed 8/30: up 16.24. iid had 1.0^M
cloud-setup: failed 9/30: up 17.43. iid had 1.0^M
cloud-setup: failed 10/30: up 18.62. iid had 1.0^M
cloud-setup: failed 11/30: up 19.81. iid had 1.0^M
cloud-setup: failed 12/30: up 21.00. iid had 1.0^M
cloud-setup: failed 13/30: up 22.20. iid had 1.0^M
cloud-setup: failed 14/30: up 23.40. iid had 1.0^M
cloud-setup: failed 15/30: up 24.60. iid had 1.0^M
cloud-setup: failed 16/30: up 25.80. iid had 1.0^M
cloud-setup: failed 17/30: up 27.01. iid had 1.0^M
cloud-setup: failed 18/30: up 28.22. iid had 1.0^M
cloud-setup: failed 19/30: up 29.43. iid had 1.0^M
cloud-setup: failed 20/30: up 30.65. iid had 1.0^M
cloud-setup: failed 21/30: up 31.86. iid had 1.0^M
cloud-setup: failed 22/30: up 33.08. iid had 1.0^M
cloud-setup: failed 23/30: up 34.30. iid had 1.0^M
cloud-setup: failed 24/30: up 35.60. iid had 1.0^M
cloud-setup: failed 25/30: up 36.89. iid had 1.0^M
cloud-setup: failed 26/30: up 38.11. iid had 1.0^M
cloud-setup: failed 27/30: up 39.34. iid had 1.0^M
cloud-setup: failed 28/30: up 40.56. iid had 1.0^M
cloud-setup: failed 29/30: up 41.82. iid had 1.0^M
cloud-setup: failed 30/30: up 43.05. iid had 1.0^M
cloud-setup: after 30 fails, debugging^M
cloud-setup: running debug (30 tries reached)^M
############ debug start ##############^M
### /etc/rc.d/init.d/sshd start^M
stty: /dev/console^M
generating DSS host key [^[[1;33mWATING^[[0;39m]^[[-11G^[[1;34m..^[[0;39m [ ^[[1;32mOK^[[0;39m ]^M
generating RSA host key [^[[1;33mWATING^[[0;39m]^[[-11G^[[1;34m..^[[0;39m [ ^[[1;32mOK^[[0;39m ]^M
startup dropbear [ ^[[1;32mOK^[[0;39m ]^M
### ifconfig -a^M
eth0 Link encap:Ethernet HWaddr 02:16:3E:57:D3:B5 ^M
=========================================

Logs from UEC image.
=========================================
init: plymouth-splash main process (263) terminated with status 2
init: plymouth main process (48) killed by SEGV signal
cloud-init running: Tue, 15 Feb 2011 09:55:54 +0000. up 30.11 seconds
consuming user data failed!
Traceback (most recent call last):
  File "/usr/bin/cloud-init", line 103, in <module>
    main()
  File "/usr/bin/cloud-init", line 60, in main
    cloud.consume_userdata,[],False)
  File "/usr/lib/python2.6/dist-packages/cloudinit/__init__.py", line 215, in sem_and_run
    if self.sem_has_run(semname,freq): return
  File "/usr/lib/python2.6/dist-packages/cloudinit/__init__.py", line 173, in sem_has_run
    semfile = self.sem_getpath(name,freq)
  File "/usr/lib/python2.6/dist-packages/cloudinit/__init__.py", line 167, in sem_getpath
    freqtok = self.datasource.get_instance_id()
  File "/usr/lib/python2.6/dist-packages/cloudinit/DataSourceEc2.py", line 65, in get_instance_id
    return(self.metadata['instance-id'])
KeyError: 'instance-id'
init: cloud-init main process (334) terminated with status 1
mountall: Event failed
mountall: Plymouth command failed
mountall: Plymouth command failed
mountall: Plymouth command failed
mountall: Plymouth command failed
mountall: Disconnected from Plymouth
init: plymouth-log main process (364) terminated with status 1
 * Starting AppArmor profiles [ OK ]
Traceback (most recent call last):
  File "/usr/bin/cloud-init-cfg", line 56, in <module>
Traceback (most recent call last):
  File "/usr/bin/cloud-init-cfg", line 56, in <module>
    main()
    main()
  File "/usr/bin/cloud-init-cfg", line 43, in main
    cc = cloudinit.CloudConfig.CloudConfig(cfg_path)
  File "/usr/lib/python2.6/dist-packages/cloudinit/CloudConfig.py", line 42, in __init__
  File "/usr/bin/cloud-init-cfg", line 43, in main
    cc = cloudinit.CloudConfig.CloudConfig(cfg_path)
  File "/usr/lib/python2.6/dist-packages/cloudinit/CloudConfig.py", line 42, in __init__
    self.cfg = self.get_config_obj(cfgfile)
  File "/usr/lib/python2.6/dist-packages/cloudinit/CloudConfig.py", line 53, in get_config_obj
    f=file(cfgfile)
Traceback (most recent call last):
  File "/usr/bin/cloud-init-cfg", line 56, in <module>
    self.cfg = self.get_config_obj(cfgfile)
  File "/usr/lib/python2.6/dist-packages/cloudinit/CloudConfig.py", line 53, in get_config_obj
    f=file(cfgfile)
IOError: IOError: [Errno 2] No such file or directory: '/var/lib/cloud/data/cloud-config.txt'
    main()
  File "/usr/bin/cloud-init-cfg", line 43, in main
    cc = cloudinit.CloudConfig.CloudConfig(cfg_path)
  File "/usr/lib/python2.6/dist-packages/cloudinit/CloudConfig.py", line 42, in __init__
    self.cfg = self.get_config_obj(cfgfile)
  File "/usr/lib/python2.6/dist-packages/cloudinit/CloudConfig.py", line 53, in get_config_obj
[Errno 2] No such file or directory: '/var/lib/cloud/data/cloud-config.txt'
    f=file(cfgfile)
IOError: [Errno 2] No such file or directory: '/var/lib/cloud/data/cloud-config.txt'
Traceback (most recent call last):
  File "/usr/bin/cloud-init-cfg", line 56, in <module>
    main()
  File "/usr/bin/cloud-init-cfg", line 43, in main
    cc = cloudinit.CloudConfig.CloudConfig(cfg_path)
  File "/usr/lib/python2.6/dist-packages/cloudinit/CloudConfig.py", line 42, in __init__
    self.cfg = self.get_config_obj(cfgfile)
  File "/usr/lib/python2.6/dist-packages/cloudinit/CloudConfig.py", line 53, in get_config_obj
    f=file(cfgfile)
IOError: [Errno 2] No such file or directory: '/var/lib/cloud/data/cloud-config.txt'
Traceback (most recent call last):
  File "/usr/bin/cloud-init-cfg", line 56, in <module>
    main()
  File "/usr/bin/cloud-init-cfg", line 43, in main
    cc = cloudinit.CloudConfig.CloudConfig(cfg_path)
  File "/usr/lib/python2.6/dist-packages/cloudinit/CloudConfig.py", line 42, in __init__
    self.cfg = self.get_config_obj(cfgfile)
  File "/usr/lib/python2.6/dist-packages/cloudinit/CloudConfig.py", line 53, in get_config_obj
    f=file(cfgfile)
IOError: [Errno 2] No such file or directory: '/var/lib/cloud/data/cloud-config.txt'
landscape-client is not configured, please run landscape-config.
Traceback (most recent call last):
  File "/usr/bin/cloud-init-cfg", line 56, in <module>
    main()
  File "/usr/bin/cloud-init-cfg", line 43, in main
    cc = cloudinit.CloudConfig.CloudConfig(cfg_path)
  File "/usr/lib/python2.6/dist-packages/cloudinit/CloudConfig.py", line 42, in __init__
    self.cfg = self.get_config_obj(cfgfile)
  File "/usr/lib/python2.6/dist-packages/cloudinit/CloudConfig.py", line 53, in get_config_obj
    f=file(cfgfile)
IOError: [Errno 2] No such file or directory: '/var/lib/cloud/data/cloud-config.txt'

Related branches

Revision history for this message
Thierry Carrez (ttx) wrote :

What network mode are you using ? Modes outside VlanNamager require specific routing for metadata server to work.

Changed in nova:
status: New → Incomplete
Revision history for this message
Hyunsun Moon (hyunsun-moon) wrote :

It was default VLAN mode.

Revision history for this message
Wayne A. Walls (wayne-walls) wrote :

Greetings!

I've messed around quite a bit with the UEC images, and I found if you add an iptables NAT wherever your nova-api services runs it fixes the boot problems. You would want something like this...

iptables -t nat -A PREROUTING -d 169.254.169.254/32 -p tcp -m tcp --dport 80 -j DNAT --to-destination <NOVA-API-SERVER-IP>:8773

Thierry is right though, VlanManager /usually/ doesn't need this, but it's worth a shot. Lastly, to be honest, it looks like there is more going on with your UEC image than just the metadata server not getting contacted...if that was the only case, you'd likely see something more along, 'Cannot reach metadata server ... trying again 1/100.' The image will try 100 times to contact the metadata server, and if it can't it will sometimes continue booting, but in my experience it just loops :(

Give it a try, and let us know how it goes!

Cheers

Revision history for this message
Hyunsun Moon (hyunsun-moon) wrote :
Download full text (5.2 KiB)

I've already tried iptables command and it didn't work for me.
The reason I need to access metadata server is for cloudpipe instance, get 'autorun.sh' from the server.

Here's my 'iptables -L' result. "cloud02" the hostname of API Server.
Something wrong?

Chain INPUT (policy DROP)
target prot opt source destination
ACCEPT udp -- anywhere anywhere udp dpt:domain
ACCEPT tcp -- anywhere anywhere tcp dpt:domain
ACCEPT udp -- anywhere anywhere udp dpt:bootps
ACCEPT tcp -- anywhere anywhere tcp dpt:bootps
ACCEPT all -- anywhere anywhere
ACCEPT icmp -- anywhere anywhere icmp any
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:ftp
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:ssh
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:telnet
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:smtp
ACCEPT tcp -- anywhere anywhere state NEW tcp dpt:www
DROP all -- anywhere anywhere state INVALID
ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED
ACCEPT tcp -- anywhere cloud02 tcp dpt:ssh
ACCEPT udp -- anywhere anywhere udp dpt:ntp
nova_input all -- anywhere anywhere
ACCEPT icmp -- anywhere anywhere
REJECT tcp -- anywhere anywhere reject-with tcp-reset
REJECT all -- anywhere anywhere reject-with icmp-port-unreachable

Chain FORWARD (policy DROP)
target prot opt source destination
nova-local all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT all -- anywhere anywhere
ACCEPT udp -- anywhere 10.0.0.2 udp dpt:openvpn
ACCEPT all -- anywhere 192.168.122.0/24 state RELATED,ESTABLISHED
ACCEPT all -- 192.168.122.0/24 anywhere
ACCEPT all -- anywhere anywhere
REJECT all -- anywhere anywhere reject-with icmp-port-unreachable
REJECT all -- anywhere anywhere reject-with icmp-port-unreachable
DROP all -- anywhere anywhere state INVALID
ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED
TCPMSS tcp -- anywhere anywhere tcp flags:SYN,RST/SYN TCPMSS clamp to PMTU
nova_forward all -- anywhere anywhere

Chain OUTPUT (policy ACCEPT)
target prot opt source destination
nova-local all -- anywhere anywhere
DROP all -- anywhere anywhere state INVALID
ACCEPT all -- anywhere anywhere state RELATED,ESTABLISHED
nova_output all -- anywhere anywhere

Chain nova-fallback (1 references)
target prot opt source de...

Read more...

Revision history for this message
Thierry Carrez (ttx) wrote :

I don't really know where this fails, but the plan is to simplify metadata access by serving metadata from the local compute node rather than from the API node. This should then work in every network mode and not rely on fancy routes/rules/bridges setup.

Revision history for this message
Vish Ishaya (vishvananda) wrote :

If this is a desktop image, you may have to give the 169.254 address to the network host:
something like:
ip addr add 169.254.169.254/32 scope link dev eth1
This will allow it to arp for the address. The eth device that you add the address to isn't particularly important, although if you decide to add it to br100 you should probably use scope global instead of scope link, or the ordering of ip addresses can sometimes mess up dhcp.
If this is not a desktop image, then you may be having issues with your forwarding rules. Check:
iptables -L -n -v
for the 169.254 rule. Make sure that the rule has the proper ip for your api server and make sure that the rule is actually getting hit properly.

The ip addr add command really needs to be done automatically. I consider this a must have for cactus. The workaround of adding it manually is too much to expect users to have to do

Changed in nova:
status: Incomplete → Triaged
importance: Undecided → High
summary: - Instance fails to access metadata server
+ Some instances can't connect to metadata due to ARP failure
description: updated
Changed in nova:
assignee: nobody → Vish Ishaya (vishvananda)
milestone: none → cactus-gamma
status: Triaged → In Progress
Thierry Carrez (ttx)
Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: cactus-gamma → none
Thierry Carrez (ttx)
Changed in nova:
milestone: none → 2011.2
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.