VM's fail to receive DHCPOFFER messages

Bug #1403291 reported by Matt Riedemann
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Critical
Unassigned

Bug Description

http://logs.openstack.org/46/142246/1/check//check-tempest-dsvm-neutron-full/ff04c3e/console.html#_2014-12-16_23_45_56_966

message:"check_public_network_connectivity" AND message:"AssertionError: False is not true : Timed out waiting for" AND message:"to become reachable" AND tags:"tempest.txt"

420 hits in 7 days, check and gate, all failures. Seems like this is probably a known issue already so could be a duplicate of another bug, but given elastic-recheck didn't comment on my patch when this failed I'm reporting a new bug and a new e-r query:

http://logstash.openstack.org/#eyJzZWFyY2giOiJtZXNzYWdlOlwiY2hlY2tfcHVibGljX25ldHdvcmtfY29ubmVjdGl2aXR5XCIgQU5EIG1lc3NhZ2U6XCJBc3NlcnRpb25FcnJvcjogRmFsc2UgaXMgbm90IHRydWUgOiBUaW1lZCBvdXQgd2FpdGluZyBmb3JcIiBBTkQgbWVzc2FnZTpcInRvIGJlY29tZSByZWFjaGFibGVcIiBBTkQgdGFnczpcInRlbXBlc3QudHh0XCIiLCJmaWVsZHMiOltdLCJvZmZzZXQiOjAsInRpbWVmcmFtZSI6IjYwNDgwMCIsImdyYXBobW9kZSI6ImNvdW50IiwidGltZSI6eyJ1c2VyX2ludGVydmFsIjowfSwic3RhbXAiOjE0MTg3ODcwOTM1OTIsIm1vZGUiOiIiLCJhbmFseXplX2ZpZWxkIjoiIn0=

Tags: gate-failure
tags: added: gate-failure
Changed in neutron:
importance: Undecided → Medium
Changed in neutron:
importance: Medium → High
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Looking at logs I see that L3 agent (vpn agent) doesn't apply FIP-related iptables NAT commands in time, it applies it around 2 minutes later after it receives router info (and after test has timed out)
Continuing to analyze the logs.

Changed in neutron:
assignee: nobody → Eugene Nikanorov (enikanorov)
Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

The regression was introduced by https://review.openstack.org/#/c/140193/4

iptables rules application should be performed at the end of router processing.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Previous analysis was wrong. L3 nat routing works as expected.

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Apparently the issue is in DHCP agent where every thing is worked as expected except that sometimes green thread that serves network by spawning dnsmasq is stuck on dhcp-agent semaphore for too long (about ~1 minute since dhcp agent is notified about the network)
VM is spawned during that time, already sent DHCPDISCOVER and then gave up getting IP, because no dnsmasq is available in the network yet.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

The logstash query is too broad, and it depicts a picture that's gloomier than it actually is, most of failures captured are infra-caused or genuine ones.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

There's an instance where the VM does not get the IP address at all:

http://logs.openstack.org/50/144250/4/check/check-tempest-dsvm-neutron-full/3a46100/logs/testr_results.html.gz

Still going through the various failures...

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Every recent trace I have been looking at, it seems to have been caused by lack of DHCP address to be assigned to the VM. In multiple cases, what I can see is that the DHCPOFFER does not make back to the VM:

http://logs.openstack.org/51/146151/1/check/check-tempest-dsvm-neutron-pg-full/0ef9c2d/logs/syslog.txt.gz#_Jan__9_18_55_08

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

The same job [1] experiences two failure modes, both I can relate them to lack of DHCP address assignment to the VM.

[1] http://logs.openstack.org/51/146151/1/check/check-tempest-dsvm-neutron-pg-full

Changed in neutron:
status: New → Confirmed
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

And another one:

http://logs.openstack.org/77/144977/5/check/check-tempest-dsvm-neutron-full/0662271/logs/

Same failure mode: the dhcp offer does not make back.

Changed in neutron:
importance: High → Critical
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tempest (master)

Fix proposed to branch: master
Review: https://review.openstack.org/146256

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote : Re: test_server_connectivity_pause_unpause fails with "AssertionError: False is not true : Timed out waiting for 172.24.4.64 to become reachable"

^^^ this is just to get access to the box's info

no longer affects: tempest
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

After closer inspection to this log:

http://logs.openstack.org/77/144977/5/check/check-tempest-dsvm-neutron-full/0662271/logs/syslog.txt.gz

And looking at the three DHCPDISCOVER that the VM sends:

DHCPDISCOVER(tap3820512e-14) fa:16:3e:ec:3c:3d

I see that in all three cases an iptables rule drops the packet:

iptables dropped: IN=qbr34a975d7-ff OUT= PHYSIN=tap34a975d7-ff MAC=ff:ff:ff:ff:ff:ff:fa:16:3e:ec:3c:3d:08:00 SRC=0.0.0.0 DST=255.255.255.255 LEN=309 TOS=0x00 PREC=0x00 TTL=64 ID=0 PROTO=UDP SPT=68 DPT=67 LEN=289

Revision history for this message
Eugene Nikanorov (enikanorov) wrote :

Yes, I've seen this as well, but i also see that in other cases, where DHCPOFFER is followed by DHCPREQUEST and DHCPACK, this message appears as well, which makes me think that its expected behavior?

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Yes. That said I am not 100% how relevant these rules are. From what I understand they drop packets that come from the client, which is not what we're seeing.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

Um...in this instance I don't even see the DHCPDISCOVER making to the server :(

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/147039

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote : Re: test_server_connectivity_pause_unpause fails with "AssertionError: False is not true : Timed out waiting for 172.24.4.64 to become reachable"

I was looking at nova's dhcp code and I spotted this one:

https://github.com/openstack/nova/blob/master/nova/network/linux_net.py#L961

I wonder if we need a similar tweak for Neutron, which I couldn't find.

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

And I wonder if this is related: https://bugs.launchpad.net/neutron/+bug/1244589

Revision history for this message
Armando Migliaccio (armando-migliaccio) wrote :

After change https://review.openstack.org/147039 things suddenly became a lot clearer:

This is the iptables-save for a faulty VM:

[1] http://logs.openstack.org/39/147039/1/check/check-tempest-dsvm-neutron-pg-full/f0a079d/logs/screen-q-agt.txt.gz#_2015-01-14_02_17_04_806

This is the iptables-save for a good VM:

[2] http://logs.openstack.org/39/147039/1/check/check-tempest-dsvm-neutron-pg-full/f0a079d/logs/screen-q-agt.txt.gz#_2015-01-14_02_22_52_192

Notice the missing rule that allows the dhcp traffic from the server to go back to the client in [1]:

-s 10.100.0.3/32 -p udp -m udp --sport 67 --dport 68 -j RETURN

Thanks Kevin for helping me spotting this.

Revision history for this message
Kevin Benton (kevinbenton) wrote :
Changed in neutron:
assignee: Eugene Nikanorov (enikanorov) → Kevin Benton (kevinbenton)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/147085

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/147085
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a7b54d3c60ff6dd63273b573bcc5c572faab22a8
Submitter: Jenkins
Branch: master

commit a7b54d3c60ff6dd63273b573bcc5c572faab22a8
Author: Kevin Benton <email address hidden>
Date: Tue Jan 13 08:05:19 2015 -0800

    Fix topic for provider security group update

    Commit 8098b6bd20bb12520e438ef9125dc7b96664f712 accidentally changed
    the topic for the provider security group update to a regular member
    update. This resulted in the L2 agent not asking for the latest
    security group rules after a DHCP port was created. If a regular
    compute port was brought online and wired up by the L2 agent
    before the DHCP port was created, the VM would never get its allow
    rule to communicate with the DHCP server.

    Co-bug-squasher: Armando

    Closes-Bug: #1403291
    Change-Id: I382f2e1390c9a3ff558e152eeb76f0f827266cb4

Changed in neutron:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.openstack.org/147039
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=bb195338e4144929c193a6b5bf0e8024d551c885
Submitter: Jenkins
Branch: master

commit bb195338e4144929c193a6b5bf0e8024d551c885
Author: armando-migliaccio <email address hidden>
Date: Tue Jan 13 16:45:16 2015 -0800

    Log iptables rules in a readable format

    When troubleshooting issues having to parse the \n mentally is kind of
    difficult. Be nice to the user and have the newlines interpreted correctly.
    It's fine if we waste some spaces in the logs, storage is cheap these days.

    Related-bug: #1403291

    Change-Id: Ia6c651ae0d17c0165e4fe833c523061a66f7d7d1

summary: - test_server_connectivity_pause_unpause fails with "AssertionError: False
- is not true : Timed out waiting for 172.24.4.64 to become reachable"
+ VM's fail to receive DHCPOFFER messages
Thierry Carrez (ttx)
Changed in neutron:
milestone: none → kilo-2
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in neutron:
milestone: kilo-2 → 2015.1.0
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
Joe Gordon (jogo) wrote :

Top gate bug, this is not fixed.

Changed in neutron:
status: Fix Released → New
Revision history for this message
ratalevolamena (chris-techno1307) wrote :

hi guys, i'm new on debuging code but highly motivated to contribute on openstack.
So first, i'd like to ask you. How to fix this bug???

Changed in neutron:
assignee: Kevin Benton (kevinbenton) → ratalevolamena (chris-techno1307)
assignee: ratalevolamena (chris-techno1307) → nobody
assignee: nobody → ratalevolamena (chris-techno1307)
Changed in neutron:
assignee: ratalevolamena (chris-techno1307) → nobody
Revision history for this message
Kevin Benton (kevinbenton) wrote :

The problem disappeared before it could be narrowed down.

Changed in neutron:
status: New → Incomplete
status: Incomplete → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.