Comment 4 for bug 533493

Revision history for this message
EAB (erwin-true) wrote :

I came across a PDF-document from HP. A recent white-paper on LVM snapshots.

http://h20000.www2.hp.com/bizsupport/TechSupport/CoreRedirect.jsp?redirectReason=DocIndexPDF&prodSeriesId=4296010&targetPage=http%3A%2F%2Fbizsupport2.austin.hp.com%2Fbc%2Fdocs%2Fsupport%2FSupportManual%2Fc02054539%2Fc02054539.pdf

Most interesting part:
"In very low system memory conditions, deletion of a single snapshot can hang indefinitely for memory to become available. Ensure that sufficient memory is available during deletion of a single snapshot that requires data to be copied to its predecessor. If the lvremove command hangs in these cases, increase the system memory or free some existing system memory to proceed with the snapshot deletion."

No further explaination is give....

Our host contains 64GB RAM and 2 6-core Intel CPU's.
We're using Munin to graph memory-usage. The graphs are updated every 5 minutes, so we don't have a real numbers on usage on the moment the snapshot was removed.
At the moment the removal of the snapshot was initiated the host used approximately 51GB RAM, 6GB buffers, 10GB unused and 3GB swap.

I'm thinking about some NUMA-issues I researched last weeks. It's probably nothing to do with this issue.

Some memory-statistics:
# free -m
             total used free shared buffers cached
Mem: 64549 64062 487 0 23579 780
-/+ buffers/cache: 39702 24847
Swap: 7627 377 7250

# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22
node 0 size: 32768 MB
node 0 free: 63 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23
node 1 size: 32758 MB
node 1 free: 437 MB
node distances:
node 0 1
  0: 10 20

The host is swapping a little now, but every day it swaps out 4GB of RAM. vm.swappiness=0
swapoff -a && swapon -a is run every day a couple times.
It should not swap, but it seems to be an issue with multiple CPU sockets and processes not using the same NUMA-node (CPU-pinning). It seems that hosts with multiple sockets (not cores) swaps out a lot more.

It could be possible that the lvremove action thinks there is not enough ram and hangs indefinitely.

Hopefully someone can confirm some of this.