Activity log for bug #1819786

Date Who What changed Old value New value Message
2019-03-12 21:53:21 Marc Hasson bug added bug
2019-03-12 21:53:21 Marc Hasson attachment added ip_vs refcount error report that kills performance with every --ops packet https://bugs.launchpad.net/bugs/1819786/+attachment/5245705/+files/kern_log_1804.txt
2019-03-12 21:54:47 Marc Hasson tags amd64 apport-bug bionic amd64 apport-bug bionic xenial
2019-03-12 22:00:12 Ubuntu Kernel Bot linux (Ubuntu): status New Confirmed
2019-03-15 00:13:09 Terry Rudd bug added subscriber Terry Rudd
2019-04-09 05:37:32 Kai-Heng Feng nominated for series Ubuntu Bionic
2019-04-09 05:37:32 Kai-Heng Feng bug task added linux (Ubuntu Bionic)
2019-04-09 05:38:06 Kai-Heng Feng description On our 16.04LTS (and earlier) systems we used the ipvsadm --ops UDP support (one-packet scheduling) to get a better distribution amongst our real servers behind the load-balancer for some small subset of applications. This has worked fine through the 4.4.0-xxx kernels. But when we started a program to upgrade systems to use the 4.15 series of kernels to take advantage of new facilities, the subset of systems which used the --ops option ran into problems. Everything else with the 4.15 kernels appeared to work well. This issue was reported in #1817247 against 16.04LTS with the HWE 4.15 kernel but has not received any acknowledgement after having been reported weeks ago. So we have moved on to confirm that a stock 18.04LTS system with the latest expected/standard 4.15 kernel also has this issue as well and report that here. Perhaps this will get more attention. The issue appears to have been the change in the ip_vs module from using "atomic_*()" increment/decrement functions in the 4.4 kernel to instead use "refcount_*()" functions in a later kernel, including the 4.15 one we switched to. Unfortunately, the simple refcount_dec() function was inadequate, in putting out a time-consuming message and handling when the refcount dropped to zero, which is expected in the case of --ops support that retains no state post packet delivery. I will upload an attachment with the sample messages that get put out at packet arrival rate, which destroys performance of course. This test VM reports the identical errors we see in our production servers, but at least throwing only a couple of test --ops packets at it doesn't crash/hang the 18.04 system as it did in the 16.04 VM reported earlier. And in production, with the far greater packet rates, our systems fail since the attached call backtrace *** appears on every packet!! *** This issue was apparently already recognized as an error and has appeared as a fix in upstream kernels. This is a reference to the 4.17 version of the fix that we'd like to see incorporated into the next possible kernel maintenance release: https://github.com/torvalds/linux/commit/a050d345cef0dc6249263540da1e902bba617e43#diff-75923493f6e3f314b196a8223b0d6342 We have successfully used the livepatch facility to build a livepatch .ko with the above diffs on our 4.15.0-36 system and successfully demonstrated the contrast in good/bad behavior with/without the livepatch module loaded. But we'd rather not have to build a version of livepatch.ko for each kernel maintenance release, such as the 4.5.0-46 kernel here used to demonstrate the issue persists in the Ubuntu mainline distro. The problem is easy to generate, with only a couple of packets and a simple configuration. Here's a very basic test (addresses rewritten/obscured) version of an example configuration for 2 servers that worked on my test VM: ipvsadm -A -f 100 -s rr --ops ipvsadm -a -f 100 -r 10.129.131.227:0 -g -w 9999 ipvsadm -a -f 100 -r 10.129.131.228:0 -g -w 9999 iptables -t mangle -A PREROUTING -d 172.16.5.1/32 -j MARK --set-xmark 0x64/0xffffffff ifconfig lo:0 172.16.5.1/32 up Routing and addressing to achieve the above, or adaptation for one's own test environment, is left to the tester. I just added alias 10.129.131.x addresses on my "outbound" interface and a static route for 172.16.5.1 to my client system so the test packets arrived on the "inbound" interface. I set up routing and addresses on my 2 NIC test such that packets arrived on my test machine's eth1 NIC and were directed by ip_vs out the eth2. To test, all I did was throw a few UDP packets via traceroute at the address on the iptables/firewall mark rule so that the eth1 interface of the test system was the traceroute system's default gateway: traceroute -m 2 172.16.5.1 Without the fix my test ip_vs system either hangs or puts out messages as per the attached. With our livepatch module using the above commit's contents, all is well. Both of the test ("real" as opposed to "virtual") servers configured above via ipvsadm, get packets and no errors are reported in the logs. Let me know of anything I can do to help accelerate addressing of this issue or understanding. It seems that the fix incorporation is fairly straightforward, and is a performance disaster without it for anyone using the --ops facility to any significant degree. Thanks! $ lsb_release -rd Description: Ubuntu 18.04.2 LTS Release: 18.04 $ uname -a Linux direct-18-04 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ProblemType: Bug DistroRelease: Ubuntu 18.04 Package: linux-image-4.15.0-46-generic 4.15.0-46.49 ProcVersionSignature: Ubuntu 4.15.0-46.49-generic 4.15.18 Uname: Linux 4.15.0-46-generic x86_64 ApportVersion: 2.20.9-0ubuntu7.5 Architecture: amd64 AudioDevicesInUse: USER PID ACCESS COMMAND /dev/snd/controlC0: marc 1980 F.... pulseaudio CurrentDesktop: ubuntu:GNOME Date: Tue Mar 12 13:14:31 2019 InstallationDate: Installed on 2018-08-01 (222 days ago) InstallationMedia: Ubuntu 18.04 LTS "Bionic Beaver" - Release amd64 (20180426) Lsusb: Bus 001 Device 003: ID 0e0f:0002 VMware, Inc. Virtual USB Hub Bus 001 Device 002: ID 0e0f:0003 VMware, Inc. Virtual Mouse Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub MachineType: VMware, Inc. VMware Virtual Platform ProcEnviron: TERM=xterm-256color PATH=(custom, no user) XDG_RUNTIME_DIR=<set> LANG=en_US.UTF-8 SHELL=/bin/bash ProcFB: 0 svgadrmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-46-generic root=UUID=103a235a-d768-45d6-9f14-c7414f72320d ro quiet splash nopti nospectre_v2 nospec_store_bypass_disable RelatedPackageVersions: linux-restricted-modules-4.15.0-46-generic N/A linux-backports-modules-4.15.0-46-generic N/A linux-firmware 1.173.3 RfKill: SourcePackage: linux UpgradeStatus: No upgrade log present (probably fresh install) dmi.bios.date: 05/20/2014 dmi.bios.vendor: Phoenix Technologies LTD dmi.bios.version: 6.00 dmi.board.name: 440BX Desktop Reference Platform dmi.board.vendor: Intel Corporation dmi.board.version: None dmi.chassis.asset.tag: No Asset Tag dmi.chassis.type: 1 dmi.chassis.vendor: No Enclosure dmi.chassis.version: N/A dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd05/20/2014:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A: dmi.product.name: VMware Virtual Platform dmi.product.version: None dmi.sys.vendor: VMware, Inc. === SRU Justification === [Impact] From the commit message: "Connections in One-packet scheduling mode (-o, --ops) are removed with refcnt=0 because they are not hashed in conn table." [Fix] From the commit message: "To avoid refcount_dec reporting this as error, change them to be removed with refcount_dec_if_one as all other connections." [Test] The bug reporter has a reproducer and confirmed this commit fixes the issue. [Regression Potential] Low. Fix for a specific use case, and it's in upstream for a while. === Original Bug Report === On our 16.04LTS (and earlier) systems we used the ipvsadm --ops UDP support (one-packet scheduling) to get a better distribution amongst our real servers behind the load-balancer for some small subset of applications. This has worked fine through the 4.4.0-xxx kernels. But when we started a program to upgrade systems to use the 4.15 series of kernels to take advantage of new facilities, the subset of systems which used the --ops option ran into problems. Everything else with the 4.15 kernels appeared to work well. This issue was reported in #1817247 against 16.04LTS with the HWE 4.15 kernel but has not received any acknowledgement after having been reported weeks ago. So we have moved on to confirm that a stock 18.04LTS system with the latest expected/standard 4.15 kernel also has this issue as well and report that here. Perhaps this will get more attention. The issue appears to have been the change in the ip_vs module from using "atomic_*()" increment/decrement functions in the 4.4 kernel to instead use "refcount_*()" functions in a later kernel, including the 4.15 one we switched to. Unfortunately, the simple refcount_dec() function was inadequate, in putting out a time-consuming message and handling when the refcount dropped to zero, which is expected in the case of --ops support that retains no state post packet delivery. I will upload an attachment with the sample messages that get put out at packet arrival rate, which destroys performance of course. This test VM reports the identical errors we see in our production servers, but at least throwing only a couple of test --ops packets at it doesn't crash/hang the 18.04 system as it did in the 16.04 VM reported earlier. And in production, with the far greater packet rates, our systems fail since the attached call backtrace *** appears on every packet!! *** This issue was apparently already recognized as an error and has appeared as a fix in upstream kernels. This is a reference to the 4.17 version of the fix that we'd like to see incorporated into the next possible kernel maintenance release: https://github.com/torvalds/linux/commit/a050d345cef0dc6249263540da1e902bba617e43#diff-75923493f6e3f314b196a8223b0d6342 We have successfully used the livepatch facility to build a livepatch .ko with the above diffs on our 4.15.0-36 system and successfully demonstrated the contrast in good/bad behavior with/without the livepatch module loaded. But we'd rather not have to build a version of livepatch.ko for each kernel maintenance release, such as the 4.5.0-46 kernel here used to demonstrate the issue persists in the Ubuntu mainline distro. The problem is easy to generate, with only a couple of packets and a simple configuration. Here's a very basic test (addresses rewritten/obscured) version of an example configuration for 2 servers that worked on my test VM: ipvsadm -A -f 100 -s rr --ops ipvsadm -a -f 100 -r 10.129.131.227:0 -g -w 9999 ipvsadm -a -f 100 -r 10.129.131.228:0 -g -w 9999 iptables -t mangle -A PREROUTING -d 172.16.5.1/32 -j MARK --set-xmark 0x64/0xffffffff ifconfig lo:0 172.16.5.1/32 up Routing and addressing to achieve the above, or adaptation for one's own test environment, is left to the tester. I just added alias 10.129.131.x addresses on my "outbound" interface and a static route for 172.16.5.1 to my client system so the test packets arrived on the "inbound" interface. I set up routing and addresses on my 2 NIC test such that packets arrived on my test machine's eth1 NIC and were directed by ip_vs out the eth2. To test, all I did was throw a few UDP packets via traceroute at the address on the iptables/firewall mark rule so that the eth1 interface of the test system was the traceroute system's default gateway:   traceroute -m 2 172.16.5.1 Without the fix my test ip_vs system either hangs or puts out messages as per the attached. With our livepatch module using the above commit's contents, all is well. Both of the test ("real" as opposed to "virtual") servers configured above via ipvsadm, get packets and no errors are reported in the logs. Let me know of anything I can do to help accelerate addressing of this issue or understanding. It seems that the fix incorporation is fairly straightforward, and is a performance disaster without it for anyone using the --ops facility to any significant degree. Thanks! $ lsb_release -rd Description: Ubuntu 18.04.2 LTS Release: 18.04 $ uname -a Linux direct-18-04 4.15.0-46-generic #49-Ubuntu SMP Wed Feb 6 09:33:07 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux ProblemType: Bug DistroRelease: Ubuntu 18.04 Package: linux-image-4.15.0-46-generic 4.15.0-46.49 ProcVersionSignature: Ubuntu 4.15.0-46.49-generic 4.15.18 Uname: Linux 4.15.0-46-generic x86_64 ApportVersion: 2.20.9-0ubuntu7.5 Architecture: amd64 AudioDevicesInUse:  USER PID ACCESS COMMAND  /dev/snd/controlC0: marc 1980 F.... pulseaudio CurrentDesktop: ubuntu:GNOME Date: Tue Mar 12 13:14:31 2019 InstallationDate: Installed on 2018-08-01 (222 days ago) InstallationMedia: Ubuntu 18.04 LTS "Bionic Beaver" - Release amd64 (20180426) Lsusb:  Bus 001 Device 003: ID 0e0f:0002 VMware, Inc. Virtual USB Hub  Bus 001 Device 002: ID 0e0f:0003 VMware, Inc. Virtual Mouse  Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub MachineType: VMware, Inc. VMware Virtual Platform ProcEnviron:  TERM=xterm-256color  PATH=(custom, no user)  XDG_RUNTIME_DIR=<set>  LANG=en_US.UTF-8  SHELL=/bin/bash ProcFB: 0 svgadrmfb ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-46-generic root=UUID=103a235a-d768-45d6-9f14-c7414f72320d ro quiet splash nopti nospectre_v2 nospec_store_bypass_disable RelatedPackageVersions:  linux-restricted-modules-4.15.0-46-generic N/A  linux-backports-modules-4.15.0-46-generic N/A  linux-firmware 1.173.3 RfKill: SourcePackage: linux UpgradeStatus: No upgrade log present (probably fresh install) dmi.bios.date: 05/20/2014 dmi.bios.vendor: Phoenix Technologies LTD dmi.bios.version: 6.00 dmi.board.name: 440BX Desktop Reference Platform dmi.board.vendor: Intel Corporation dmi.board.version: None dmi.chassis.asset.tag: No Asset Tag dmi.chassis.type: 1 dmi.chassis.vendor: No Enclosure dmi.chassis.version: N/A dmi.modalias: dmi:bvnPhoenixTechnologiesLTD:bvr6.00:bd05/20/2014:svnVMware,Inc.:pnVMwareVirtualPlatform:pvrNone:rvnIntelCorporation:rn440BXDesktopReferencePlatform:rvrNone:cvnNoEnclosure:ct1:cvrN/A: dmi.product.name: VMware Virtual Platform dmi.product.version: None dmi.sys.vendor: VMware, Inc.
2019-04-09 05:38:16 Kai-Heng Feng linux (Ubuntu): status Confirmed Fix Released
2019-04-15 03:32:19 Khaled El Mously linux (Ubuntu Bionic): status New Fix Committed
2019-04-29 16:04:05 Ubuntu Kernel Bot tags amd64 apport-bug bionic xenial amd64 apport-bug bionic verification-needed-bionic xenial
2019-04-29 23:25:27 Marc Hasson tags amd64 apport-bug bionic verification-needed-bionic xenial amd64 apport-bug bionic verification-done-bionic xenial
2019-05-14 19:00:51 Launchpad Janitor linux (Ubuntu Bionic): status Fix Committed Fix Released
2019-05-14 19:00:51 Launchpad Janitor cve linked 2017-5715
2019-05-14 19:00:51 Launchpad Janitor cve linked 2017-5753
2019-05-14 19:00:51 Launchpad Janitor cve linked 2017-5754
2019-05-14 19:00:51 Launchpad Janitor cve linked 2018-12126
2019-05-14 19:00:51 Launchpad Janitor cve linked 2018-12127
2019-05-14 19:00:51 Launchpad Janitor cve linked 2018-12130
2019-05-14 19:00:51 Launchpad Janitor cve linked 2018-16884
2019-05-14 19:00:51 Launchpad Janitor cve linked 2018-3620
2019-05-14 19:00:51 Launchpad Janitor cve linked 2018-3639
2019-05-14 19:00:51 Launchpad Janitor cve linked 2018-3646
2019-05-14 19:00:51 Launchpad Janitor cve linked 2019-3874
2019-05-14 19:00:51 Launchpad Janitor cve linked 2019-3882
2019-05-14 19:00:51 Launchpad Janitor cve linked 2019-9500
2019-05-14 19:00:51 Launchpad Janitor cve linked 2019-9503