VM's fail to receive DHCPOFFER messages

Bug #1403291 reported by Matt Riedemann on 2014-12-17
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Critical
Unassigned

Bug Description

http://logs.openstack.org/46/142246/1/check//check-tempest-dsvm-neutron-full/ff04c3e/console.html#_2014-12-16_23_45_56_966

message:"check_public_network_connectivity" AND message:"AssertionError: False is not true : Timed out waiting for" AND message:"to become reachable" AND tags:"tempest.txt"

420 hits in 7 days, check and gate, all failures. Seems like this is probably a known issue already so could be a duplicate of another bug, but given elastic-recheck didn't comment on my patch when this failed I'm reporting a new bug and a new e-r query:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiY2hlY2tfcHVibGljX25ldHdvcmtfY29ubmVjdGl2aXR5XCIgQU5EIG1lc3NhZ2U6XCJBc3NlcnRpb25FcnJvcjogRmFsc2UgaXMgbm90IHRydWUgOiBUaW1lZCBvdXQgd2FpdGluZyBmb3JcIiBBTkQgbWVzc2FnZTpcInRvIGJlY29tZSByZWFjaGFibGVcIiBBTkQgdGFnczpcInRlbXBlc3QudHh0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MTg3ODcwOTM1OTIsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=

tags: added: gate-failure
Changed in neutron:
importance: Undecided → Medium
Changed in neutron:
importance: Medium → High
Eugene Nikanorov (enikanorov) wrote :

Looking at logs I see that L3 agent (vpn agent) doesn't apply FIP-related iptables NAT commands in time, it applies it around 2 minutes later after it receives router info (and after test has timed out)
Continuing to analyze the logs.

Changed in neutron:
assignee: nobody → Eugene Nikanorov (enikanorov)
Eugene Nikanorov (enikanorov) wrote :

The regression was introduced by https://review.openstack.org/#/c/140193/4

iptables rules application should be performed at the end of router processing.

Eugene Nikanorov (enikanorov) wrote :

Previous analysis was wrong. L3 nat routing works as expected.

Eugene Nikanorov (enikanorov) wrote :

Apparently the issue is in DHCP agent where every thing is worked as expected except that sometimes green thread that serves network by spawning dnsmasq is stuck on dhcp-agent semaphore for too long (about ~1 minute since dhcp agent is notified about the network)
VM is spawned during that time, already sent DHCPDISCOVER and then gave up getting IP, because no dnsmasq is available in the network yet.

The logstash query is too broad, and it depicts a picture that's gloomier than it actually is, most of failures captured are infra-caused or genuine ones.

There's an instance where the VM does not get the IP address at all:

http://logs.openstack.org/50/144250/4/check/check-tempest-dsvm-neutron-full/3a46100/logs/testr_results.html.gz

Still going through the various failures...

Every recent trace I have been looking at, it seems to have been caused by lack of DHCP address to be assigned to the VM. In multiple cases, what I can see is that the DHCPOFFER does not make back to the VM:

http://logs.openstack.org/51/146151/1/check/check-tempest-dsvm-neutron-pg-full/0ef9c2d/logs/syslog.txt.gz#_Jan__9_18_55_08

The same job [1] experiences two failure modes, both I can relate them to lack of DHCP address assignment to the VM.

[1] http://logs.openstack.org/51/146151/1/check/check-tempest-dsvm-neutron-pg-full

Changed in neutron:
status: New → Confirmed

And another one:

http://logs.openstack.org/77/144977/5/check/check-tempest-dsvm-neutron-full/0662271/logs/

Same failure mode: the dhcp offer does not make back.

Changed in neutron:
importance: High → Critical
no longer affects: tempest

After closer inspection to this log:

http://logs.openstack.org/77/144977/5/check/check-tempest-dsvm-neutron-full/0662271/logs/syslog.txt.gz

And looking at the three DHCPDISCOVER that the VM sends:

DHCPDISCOVER(tap3820512e-14) fa:16:3e:ec:3c:3d

I see that in all three cases an iptables rule drops the packet:

iptables dropped: IN=qbr34a975d7-ff OUT= PHYSIN=tap34a975d7-ff MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:ec:3c:3d:08:00 SRC=0.0.0.0 DST=255.255.255.255 LEN=309 TOS=0x00 PREC=0x00 TTL=64 ID=0 PROTO=UDP SPT=68 DPT=67 LEN=289

Eugene Nikanorov (enikanorov) wrote :

Yes, I've seen this as well, but i also see that in other cases, where DHCPOFFER is followed by DHCPREQUEST and DHCPACK, this message appears as well, which makes me think that its expected behavior?

Yes. That said I am not 100% how relevant these rules are. From what I understand they drop packets that come from the client, which is not what we're seeing.

Um...in this instance I don't even see the DHCPDISCOVER making to the server :(

After change https://review.openstack.org/147039 things suddenly became a lot clearer:

This is the iptables-save for a faulty VM:

[1] http://logs.openstack.org/39/147039/1/check/check-tempest-dsvm-neutron-pg-full/f0a079d/logs/screen-q-agt.txt.gz#_2015-01-14_02_17_04_806

This is the iptables-save for a good VM:

[2] http://logs.openstack.org/39/147039/1/check/check-tempest-dsvm-neutron-pg-full/f0a079d/logs/screen-q-agt.txt.gz#_2015-01-14_02_22_52_192

Notice the missing rule that allows the dhcp traffic from the server to go back to the client in [1]:

-s 10.100.0.3/32 -p udp -m udp --sport 67 --dport 68 -j RETURN

Thanks Kevin for helping me spotting this.

Kevin Benton (kevinbenton) wrote :
Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → Kevin Benton (kevinbenton)

Fix proposed to branch: master
Review: https://review.openstack.org/147085

Changed in neutron:
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/147085
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a7b54d3c60ff6dd63273b573bcc5c572faab22a8
Submitter: Jenkins
Branch: master

commit a7b54d3c60ff6dd63273b573bcc5c572faab22a8
Author: Kevin Benton <email address hidden>
Date: Tue Jan 13 08:05:19 2015 -0800

    Fix topic for provider security group update

    Commit 8098b6bd20bb12520e438ef9125dc7b96664f712 accidentally changed
    the topic for the provider security group update to a regular member
    update. This resulted in the L2 agent not asking for the latest
    security group rules after a DHCP port was created. If a regular
    compute port was brought online and wired up by the L2 agent
    before the DHCP port was created, the VM would never get its allow
    rule to communicate with the DHCP server.

    Co-bug-squasher: Armando

    Closes-Bug: #1403291
    Change-Id: I382f2e1390c9a3ff558e152eeb76f0f827266cb4

Changed in neutron:
status: In Progress → Fix Committed

Reviewed: https://review.openstack.org/147039
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bb195338e4144929c193a6b5bf0e8024d551c885
Submitter: Jenkins
Branch: master

commit bb195338e4144929c193a6b5bf0e8024d551c885
Author: armando-migliaccio <email address hidden>
Date: Tue Jan 13 16:45:16 2015 -0800

    Log iptables rules in a readable format

    When troubleshooting issues having to parse the \n mentally is kind of
    difficult. Be nice to the user and have the newlines interpreted correctly.
    It's fine if we waste some spaces in the logs, storage is cheap these days.

    Related-bug: #1403291

    Change-Id: Ia6c651ae0d17c0165e4fe833c523061a66f7d7d1

summary: - test_server_connectivity_pause_unpause fails with "AssertionError: False
- is not true : Timed out waiting for 172.24.4.64 to become reachable"
+ VM's fail to receive DHCPOFFER messages
Thierry Carrez (ttx) on 2015-02-05
Changed in neutron:
milestone: none → kilo-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx) on 2015-04-30
Changed in neutron:
milestone: kilo-2 → 2015.1.0
Joe Gordon (jogo) wrote :

Top gate bug, this is not fixed.

Changed in neutron:
status: Fix Released → New

hi guys, i'm new on debuging code but highly motivated to contribute on openstack.
So first, i'd like to ask you. How to fix this bug???

Changed in neutron:
assignee: Kevin Benton (kevinbenton) → ratalevolamena (chris-techno1307)
assignee: ratalevolamena (chris-techno1307) → nobody
assignee: nobody → ratalevolamena (chris-techno1307)
Changed in neutron:
assignee: ratalevolamena (chris-techno1307) → nobody
Kevin Benton (kevinbenton) wrote :

The problem disappeared before it could be narrowed down.

Changed in neutron:
status: New → Incomplete
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers