VM's fail to receive DHCPOFFER messages
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| neutron |
Critical
|
Unassigned |
Bug Description
message:
420 hits in 7 days, check and gate, all failures. Seems like this is probably a known issue already so could be a duplicate of another bug, but given elastic-recheck didn't comment on my patch when this failed I'm reporting a new bug and a new e-r query:
tags: | added: gate-failure |
Changed in neutron: | |
importance: | Undecided → Medium |
Changed in neutron: | |
importance: | Medium → High |
Eugene Nikanorov (enikanorov) wrote : | #1 |
Changed in neutron: | |
assignee: | nobody → Eugene Nikanorov (enikanorov) |
Eugene Nikanorov (enikanorov) wrote : | #2 |
The regression was introduced by https:/
iptables rules application should be performed at the end of router processing.
Eugene Nikanorov (enikanorov) wrote : | #3 |
Previous analysis was wrong. L3 nat routing works as expected.
Eugene Nikanorov (enikanorov) wrote : | #4 |
Apparently the issue is in DHCP agent where every thing is worked as expected except that sometimes green thread that serves network by spawning dnsmasq is stuck on dhcp-agent semaphore for too long (about ~1 minute since dhcp agent is notified about the network)
VM is spawned during that time, already sent DHCPDISCOVER and then gave up getting IP, because no dnsmasq is available in the network yet.
The logstash query is too broad, and it depicts a picture that's gloomier than it actually is, most of failures captured are infra-caused or genuine ones.
There's an instance where the VM does not get the IP address at all:
Still going through the various failures...
Every recent trace I have been looking at, it seems to have been caused by lack of DHCP address to be assigned to the VM. In multiple cases, what I can see is that the DHCPOFFER does not make back to the VM:
The same job [1] experiences two failure modes, both I can relate them to lack of DHCP address assignment to the VM.
[1] http://
Changed in neutron: | |
status: | New → Confirmed |
And another one:
http://
Same failure mode: the dhcp offer does not make back.
Changed in neutron: | |
importance: | High → Critical |
Fix proposed to branch: master
Review: https:/
Armando Migliaccio (armando-migliaccio) wrote : Re: test_server_connectivity_pause_unpause fails with "AssertionError: False is not true : Timed out waiting for 172.24.4.64 to become reachable" | #11 |
^^^ this is just to get access to the box's info
no longer affects: | tempest |
After closer inspection to this log:
And looking at the three DHCPDISCOVER that the VM sends:
DHCPDISCOVER(
I see that in all three cases an iptables rule drops the packet:
iptables dropped: IN=qbr34a975d7-ff OUT= PHYSIN=
Eugene Nikanorov (enikanorov) wrote : | #13 |
Yes, I've seen this as well, but i also see that in other cases, where DHCPOFFER is followed by DHCPREQUEST and DHCPACK, this message appears as well, which makes me think that its expected behavior?
Yes. That said I am not 100% how relevant these rules are. From what I understand they drop packets that come from the client, which is not what we're seeing.
Um...in this instance I don't even see the DHCPDISCOVER making to the server :(
Well, at least the failure is imputable to a much clearer error:
Related fix proposed to branch: master
Review: https:/
Armando Migliaccio (armando-migliaccio) wrote : Re: test_server_connectivity_pause_unpause fails with "AssertionError: False is not true : Timed out waiting for 172.24.4.64 to become reachable" | #19 |
I was looking at nova's dhcp code and I spotted this one:
https:/
I wonder if we need a similar tweak for Neutron, which I couldn't find.
And I wonder if this is related: https:/
After change https:/
This is the iptables-save for a faulty VM:
This is the iptables-save for a good VM:
Notice the missing rule that allows the dhcp traffic from the server to go back to the client in [1]:
-s 10.100.0.3/32 -p udp -m udp --sport 67 --dport 68 -j RETURN
Thanks Kevin for helping me spotting this.
Kevin Benton (kevinbenton) wrote : | #22 |
Russell snuck in a topic change here:
https:/
Changed in neutron: | |
assignee: | Eugene Nikanorov (enikanorov) → Kevin Benton (kevinbenton) |
Fix proposed to branch: master
Review: https:/
Changed in neutron: | |
status: | Confirmed → In Progress |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit a7b54d3c60ff6dd
Author: Kevin Benton <email address hidden>
Date: Tue Jan 13 08:05:19 2015 -0800
Fix topic for provider security group update
Commit 8098b6bd20bb125
the topic for the provider security group update to a regular member
update. This resulted in the L2 agent not asking for the latest
security group rules after a DHCP port was created. If a regular
compute port was brought online and wired up by the L2 agent
before the DHCP port was created, the VM would never get its allow
rule to communicate with the DHCP server.
Co-
Closes-Bug: #1403291
Change-Id: I382f2e1390c9a3
Changed in neutron: | |
status: | In Progress → Fix Committed |
Reviewed: https:/
Committed: https:/
Submitter: Jenkins
Branch: master
commit bb195338e414492
Author: armando-migliaccio <email address hidden>
Date: Tue Jan 13 16:45:16 2015 -0800
Log iptables rules in a readable format
When troubleshooting issues having to parse the \n mentally is kind of
difficult. Be nice to the user and have the newlines interpreted correctly.
It's fine if we waste some spaces in the logs, storage is cheap these days.
Related-bug: #1403291
Change-Id: Ia6c651ae0d17c0
summary: |
- test_server_connectivity_pause_unpause fails with "AssertionError: False - is not true : Timed out waiting for 172.24.4.64 to become reachable" + VM's fail to receive DHCPOFFER messages |
Changed in neutron: | |
milestone: | none → kilo-2 |
status: | Fix Committed → Fix Released |
Changed in neutron: | |
milestone: | kilo-2 → 2015.1.0 |
Matt Riedemann (mriedem) wrote : | #26 |
Joe Gordon (jogo) wrote : | #27 |
Top gate bug, this is not fixed.
Changed in neutron: | |
status: | Fix Released → New |
ratalevolamena (chris-techno1307) wrote : | #28 |
hi guys, i'm new on debuging code but highly motivated to contribute on openstack.
So first, i'd like to ask you. How to fix this bug???
Changed in neutron: | |
assignee: | Kevin Benton (kevinbenton) → ratalevolamena (chris-techno1307) |
assignee: | ratalevolamena (chris-techno1307) → nobody |
assignee: | nobody → ratalevolamena (chris-techno1307) |
Changed in neutron: | |
assignee: | ratalevolamena (chris-techno1307) → nobody |
Kevin Benton (kevinbenton) wrote : | #29 |
The problem disappeared before it could be narrowed down.
Changed in neutron: | |
status: | New → Incomplete |
status: | Incomplete → Fix Released |
Looking at logs I see that L3 agent (vpn agent) doesn't apply FIP-related iptables NAT commands in time, it applies it around 2 minutes later after it receives router info (and after test has timed out)
Continuing to analyze the logs.