arp fails to update, 3.13.0-{24,29}

Bug #1331150 reported by Dave Liebreich
28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
High
Unassigned

Bug Description

in an AWS environment managed by BOSH, we saw tcp connectivity problems between pairs of VMs.

VM A and VM B are up, running ubuntu 14.04-based images, and in the same vpc subnet.

VM B is terminated, and another VM is created that uses the IP address from VM B

The new VM tries to make a tcp connection to VM A. In some cases, the connection fails.

Investigation reveals that VM A has the old MAC address in its neighbor list. In some cases, running arp -d <IP Address> on VM A "fixes the problem". Also in some cases, running arping in gratuitous arp mode on the new VM also fixes the problem.

The problem does not occur if VM A is running an ubuntu 10.04-based image.

The problem seems to occur only on machines with lots of tcp connection traffic between them.

'arp -d' and arping on the new VM do not seem to "fix" the problem if there are processes on the VMs that are trying to create new TCP connections.

Revision history for this message
Dave Liebreich (dliebreich) wrote :

vcap@34fd733b-3ec0-4f6f-bdf7-24ab74b89b37:~$ cat /proc/version_signature
Ubuntu 3.13.0-24.47-generic 3.13.9
vcap@34fd733b-3ec0-4f6f-bdf7-24ab74b89b37:~$ lsb_release -rd
Description: Ubuntu 14.04 LTS
Release: 14.04
vcap@34fd733b-3ec0-4f6f-bdf7-24ab74b89b37:~$

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1331150

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue occur in a previous version of Ubuntu, or is this a new issue?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.15 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.15-utopic/

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key
tags: added: kernel-key
Changed in linux (Ubuntu):
importance: Medium → Critical
status: Incomplete → Confirmed
tags: added: trusty
Revision history for this message
Dustin Kirkland  (kirkland) wrote :

Kamal suggests that this upstream kernel commit *might* be related to this issue:

http://kernel.ubuntu.com/git?p=ubuntu/linux.git;a=commitdiff;h=d85f62f554904c778bd4d0eb1eb2b629753771b2

Completely untested. Just FYI.

Revision history for this message
Dave Liebreich (dliebreich) wrote :

It may take a day or so to integrate a 3.15 kernel into our ami - we'll have to modify our build process [0], then deploy it into one of the environments that is exhibiting the problem.

We will start on that track later today - please let us know if you need any additional info.

[0] https://github.com/cloudfoundry/bosh/tree/master/bosh-stemcell

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Possible fix that is in mainline as of 3.15:

 cc2f338 "batman-adv: fix local TT check for outgoing arp requests in DAT"

Will be included in upstream 3.13 soon.

Changed in linux (Ubuntu):
importance: Critical → High
Revision history for this message
Dave Liebreich (dliebreich) wrote :

Thanks for the quick response and offers of help to test. We have a workaround in-flight that we are focusing on, so someone from our org may pick up testing and merging the 3.13 + fix kernel tomorrow (at the earliest).

Thanks again.

Revision history for this message
Dave Liebreich (dliebreich) wrote :

That commit does not appear to have fixed our problem.

we took the 3.13.0-29 #53 kernel source, applied the patch above, and deployed to our environment. We are seeing the same behavior with this patch.

We are currently setting up a parallel test environment, as well as building a precise image as a fallback plan.

We should have something later today where we can do more intensive investigation. Any suggestions as to where to look first?

Should we take a 3.15 mainline kernel?

Thanks

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

It would be good to test the latest mainline kernel, which is 3.16-rc1. That will tell us if it is already fixed upstream. If it is, we can perform a "Reverse" kernel bisect to identify the commit that fixes this.

The latest mainline kernel is available from:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16-rc1-utopic/

Revision history for this message
Dave Liebreich (dliebreich) wrote :

we are seeing the same problem with 3.15.0-6 #11 kernel (we used an utopic ami and added our bosh stemcell bits to it). Does that kernel include the fix listed above?

Sorry it took so long to get this up and running our environment.

We do *not* see the problem in a precise system (3.2.0-64 #97)

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the update, Dave. Were you able to test the latest mainline kernel mentioned in comment #9 to see if this bug is already fixed upstream?

If the bug is fixed in mainline, we can perform a "Reverse" bisect to find the commit that fixes it. Otherwise, we can perform a regular bisect to identify the commit that introduced the regression.

Revision history for this message
BOSH Eng (pivotal-bosh-eng) wrote :
Download full text (4.4 KiB)

quick update: git bisect points at commit 2724680. it appears that net.ipv4.neigh.default.gc_thresh1 conf is now being used to skip pruning of table entries. setting gc_thresh1=0 brings back old (desired) behavior. we are continuing to investigate this problem.

```
git bisect start
# bad: [51f68c26bc65b2e885d01cab48856ddb7d514841] UBUNTU: Ubuntu-3.11.0-23.40
git bisect bad 51f68c26bc65b2e885d01cab48856ddb7d514841
# good: [19f949f52599ba7c3f67a5897ac6be14bfcb1200] Linux 3.8
git bisect good 19f949f52599ba7c3f67a5897ac6be14bfcb1200
# bad: [736a2dd2571ac56b11ed95a7814d838d5311be04] Merge tag 'virtio-next-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux
git bisect bad 736a2dd2571ac56b11ed95a7814d838d5311be04
# bad: [5bc7c33ca93a285dcfe7b7fd64970f6314440ad1] mtd: nand: reintroduce NAND_NO_READRDY as NAND_NEED_READRDY
git bisect bad 5bc7c33ca93a285dcfe7b7fd64970f6314440ad1
# bad: [7ed214ac2095f561a94335ca672b6c42a1ea40ff] Merge tag 'char-misc-3.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
git bisect bad 7ed214ac2095f561a94335ca672b6c42a1ea40ff
# bad: [a0b1c42951dd06ec83cc1bc2c9788131d9fefcd8] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect bad a0b1c42951dd06ec83cc1bc2c9788131d9fefcd8
# bad: [98d5fac2330779e6eea6431a90b44c7476260dcc] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next into for-davem
git bisect bad 98d5fac2330779e6eea6431a90b44c7476260dcc
# bad: [cdda88912d62f9603d27433338a18be83ef23ac1] net: avoid to hang up on sending due to sysctl configuration overflow.
git bisect bad cdda88912d62f9603d27433338a18be83ef23ac1
# bad: [580d9d081341aad5341884f9e6b070c01512e94c] bnx2x: correct memory release scheme
git bisect bad 580d9d081341aad5341884f9e6b070c01512e94c
# good: [1cc7a3a14fa60f31ca4ff69f0dd31f369e0a51c2] e1000e: Invalid Image CSUM bit changed for I217
git bisect good 1cc7a3a14fa60f31ca4ff69f0dd31f369e0a51c2
# good: [b27b28cb445975dc02d2e7d9437d23af76a51571] ipv6: Make ipv6_addr_is_XXX() return boolean.
git bisect good b27b28cb445975dc02d2e7d9437d23af76a51571
# good: [100204147be9a2b87be2b118959b12eeaf8ef5d0] Merge branch 'dsa'
git bisect good 100204147be9a2b87be2b118959b12eeaf8ef5d0
# good: [463d413cb7dcd5509bc01e1108c2e2dcf8104683] drivers/net: delete old x86 variant of the seeq8005 driver
git bisect good 463d413cb7dcd5509bc01e1108c2e2dcf8104683
# bad: [ba418fa357a7b3c9d477f4706c6c7c96ddbd1360] soreuseport: UDP/IPv4 implementation
git bisect bad ba418fa357a7b3c9d477f4706c6c7c96ddbd1360
# bad: [0cc8d8df9bb931f1d4ab376f59d8ab8a49f9d4d4] netfilter: Use IS_ERR_OR_NULL().
git bisect bad 0cc8d8df9bb931f1d4ab376f59d8ab8a49f9d4d4
# bad: [8fbcec241df21d1ba2aba09974ea9017832b69b0] net: Use IS_ERR_OR_NULL().
git bisect bad 8fbcec241df21d1ba2aba09974ea9017832b69b0
# bad: [2724680bceee94eac391552863771af105a7355c] neigh: Keep neighbour cache entries if number of them is small enough.
git bisect bad 2724680bceee94eac391552863771af105a7355c
# good: [360eb5da665566a110993c58ed2a63e98f6720bf] ipmr: fix sparse warning when testing origin or group
git bisect good 360eb5da665566a110993c58ed2a63e98f6720bf
# first bad commit: [2724680b...

Read more...

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Thanks for the update. I just wanted to touch base and see if you would like me to build a test kernel with commit 2724680 reverted?

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.