Comment 4 for bug 1376958

Revision history for this message
Ryan Beisner (1chb1n) wrote :

I believe UOSCI may be seeing the symptoms of this, or something very similar.

Environment:
Tenant w/ 1 neutron router, 15 connected networks and subnets; with a somewhat high-volume of short-lived instances (~100 to 200 built and torn down each day). Max concurrent instances in the tenant are generally 60 to 80, and they roughly hit all networks equally in rotation.

Observation:
After a month or so of nova booting and/or juju deploying around 200 instances per day, eventually, we start to see 'no network' issues in 7 to 12% of instance boot attempts. Once the issue arises, it is persistently present and debilitating. This, after working flawlessly for ~1mo+. We have seen this cycle twice now, and hit it again today.

Impact:
We end up seeing false failures in our deployment testing, which requires a manual workaround once detected.

Workaround:
Delete the neutron nets and subs, but not the router, and re-add the nets and subs. Then it hums along happily again until some unknown point in the future, where we start to experience 'nonet' bootstrap fails again - possibly coinciding with this bug.

The common symptom / indicator is present in nova console-log for the affected instances:
    cloud-init-nonet[134.04]: gave up waiting for a network device

From the tenant perspective, log inspection has yielded no other useful indicators to me. Granted, my host log inspection level has not been thorough on this to-date, as we generally need to resolve the issue ASAP to restore services.