can't get IP for some instance when I keep booting instances in a high rate

Bug #1185916 reported by li,chen on 2013-05-30
48
This bug affects 10 people
Affects Status Importance Assigned to Milestone
neutron
Undecided
Édouard Thuleau

Bug Description

I'm doing a instance provisioning test based on Grizzly.
I boot up new instances in a 3 second interval, and run the tests for about 5 miniutes.
Then I found that sometimes some instances can not get IP from DHCP forever. Even I do soft reboot and hard reboot in Horizon, still no DHCP reply.

I'm pretty sure it is not an installation issue, because all other instances just boot up with IP address fine. And also it is not a physical host specific issue, because when I have three instances running on one host, two of them can work just fine.

And, all logs in quantum shows it is wokring fine.
Except ovs-vswitchd.log, I checked it with key words "tapa2bf7d8a-7f", which is the virtual port with issue:

May 30 14:15:57|01711|bridge|INFO|created port tapa2bf7d8a-7f on bridge br-int
May 30 14:41:03|01729|netdev_linux|WARN|ethtool command ETHTOOL_GSET on network device tapa2bf7d8a-7f failed: No such device
May 30 14:41:03|01730|netdev_linux|INFO|ioctl(SIOCGIFHWADDR) on tapa2bf7d8a-7f device failed: No such device
May 30 14:41:03|01731|netdev|WARN|failed to get flags for network device tapa2bf7d8a-7f: No such device
May 30 14:41:03|01732|netdev|WARN|failed to retrieve MTU for network device tapa2bf7d8a-7f: No such device
May 30 14:41:03|01779|netdev|WARN|failed to get flags for network device tapa2bf7d8a-7f: No such device
May 30 14:41:04|01780|bridge|INFO|destroyed port tapa2bf7d8a-7f on bridge br-int
May 30 14:41:06|01782|bridge|INFO|created port tapa2bf7d8a-7f on bridge br-int
May 30 15:04:05|01784|netdev_linux|WARN|ethtool command ETHTOOL_GSET on network device tapa2bf7d8a-7f failed: No such device
May 30 15:04:05|01785|netdev_linux|INFO|ioctl(SIOCGIFHWADDR) on tapa2bf7d8a-7f device failed: No such device
May 30 15:04:05|01786|netdev|WARN|failed to get flags for network device tapa2bf7d8a-7f: No such device
May 30 15:04:05|01787|netdev|WARN|failed to retrieve MTU for network device tapa2bf7d8a-7f: No such device
May 30 15:04:05|01792|netdev|WARN|failed to get flags for network device tapa2bf7d8a-7f: No such device
May 30 15:04:06|01794|netdev|WARN|failed to get flags for network device tapa2bf7d8a-7f: No such device
May 30 15:04:06|01795|bridge|INFO|destroyed port tapa2bf7d8a-7f on bridge br-int
May 30 15:04:08|01797|bridge|INFO|created port tapa2bf7d8a-7f on bridge br-int

Anyone know why this happen?

Édouard Thuleau (ethuleau) wrote :

When the load is too heavy (update dnsmasq host file and send lease update) on DHCP agent, the report state to Neutron server is delayed and the Neutron sever considers that agent is down and doesn't sent the port creation to the agent. So the dnsmasq host file isn't updated to serve that IP port's.

Do you have this log in agent log file :
2013-08-07 13:21:46 WARNING [quantum.openstack.common.loopingcall] task run outlasted interval by 2.375859 sec

You can increase the 'report_interval' flag on the agent and the 'agent_down_time' flag on the Neutron server side.
This problem should be corrected with this bp: https://blueprints.launchpad.net/neutron/+spec/remove-dhcp-lease.
Meanwhile, I think we should add log warning in the neutron server code to prevent that it cannot notify any DHCP agent for a port creation. And backport that on the Grizzly release.

What do you think ?

tags: added: l3-ipam-dhcp
removed: ovs
Changed in neutron:
assignee: nobody → Édouard Thuleau (ethuleau)
status: New → In Progress
Robert Collins (lifeless) wrote :

I think it would be better to still notify the down agent, knowing that they may be overloaded/going to reconnect; and if the agent is behind, it should poll for freshness of any networks it has - once it's caught up.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers