I did some more investigation on our systems. It turns out that while the memory does run low on our systems, the bottleneck is actually the CPU. I noticed a spike to 100% CPU usage every 30 minutes and htop revealed that this is due to the landscape-package-reporter process.
The landscape-client charm reports the packages on all 13 machines (12 lxd machines + the baremetal machine itself) at the same time every 30 minutes, which results in a CPU overload that lasts about 2 minutes. This results in the neutron server responding too slowly to requests which causes tempest to fail.
Sure enough, after removing the landscape-client charm all tempest tests passed without errors. I also realize now that we started seeing this bug a lot only when we added landscape-client to all our OpenStack SKUS.
I'm moving this bug to the landscape client. I think it would be very helpful if the various checks that landscape runs are staggered for the different machines, rather than all at the same time. I did find an option in the landscape UI to stagger the package updates, but I couldn't find a similar option for the package-reporter.
I did some more investigation on our systems. It turns out that while the memory does run low on our systems, the bottleneck is actually the CPU. I noticed a spike to 100% CPU usage every 30 minutes and htop revealed that this is due to the landscape- package- reporter process.
The landscape-client charm reports the packages on all 13 machines (12 lxd machines + the baremetal machine itself) at the same time every 30 minutes, which results in a CPU overload that lasts about 2 minutes. This results in the neutron server responding too slowly to requests which causes tempest to fail.
Sure enough, after removing the landscape-client charm all tempest tests passed without errors. I also realize now that we started seeing this bug a lot only when we added landscape-client to all our OpenStack SKUS.
I'm moving this bug to the landscape client. I think it would be very helpful if the various checks that landscape runs are staggered for the different machines, rather than all at the same time. I did find an option in the landscape UI to stagger the package updates, but I couldn't find a similar option for the package-reporter.