Comment 5 for bug 2002951

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

In the run https://zuul.opendev.org/t/openstack/build/a52064f718904b33a7b7136bf63a1458/logs the test test_attach_scsi_disk_with_config_drive was moved to the end of the execution and it still triggered an OOM during the test execution.

OOM happened at
Jan 18 15:55:13 np0032779705 kernel: Out of memory: Killed process 127344 (python) total-vm:5232640kB, anon-rss:4173792kB, file-rss:5404kB, shmem-rss:0kB, UID:1002 pgtables:8964kB oom_score_adj:0

dstat shows that there is a sudden increase in memory starting at 15:55:02:

Jan 18 15:55:00.704919 np0032779705 dstat.sh[85428]: 18-01 15:55:00| 28 12 59 1 0|6225M 142M 256M 990M| 569M 569M| 171M 21M|1453 458 |9600 10k|1.65 2.42 4.12|2.0 0 25|40.0 5343 |2297M 1998M| 31 494 0 132 8

Jan 18 15:55:01.702533 np0032779705 dstat.sh[85428]: 18-01 15:55:01| 14 0 85 0 1|6224M 143M 256M 991M| 79k 79k| 265k 148k|17.0 10.0 |2051 4493 |1.65 2.42 4.12|1.0 0 0| 0 0 |2297M 1998M| 31 492 0 101 10

Jan 18 15:55:02.703908 np0032779705 dstat.sh[85428]: 18-01 15:55:02| 14 2 84 0 0|6339M 118M 256M 902M| 39k 39k| 492k 15M|19.0 592 |4253 6209 |1.65 2.42 4.12|1.0 0 0|4.00 2178 |2306M 1989M| 31 492 0 101 10

Jan 18 15:55:03.704748 np0032779705 dstat.sh[85428]: 18-01 15:55:03| 10 11 79 0 0|6965M 128M 227M 300M|3700 3700 | 480k 91M|39.0 1237 |9253 5032 |1.68 2.41 4.11|3.0 0 1.0|2.00 23.0k|2399M 1896M| 31 492 0 95 10

Jan 18 15:55:04.712489 np0032779705 dstat.sh[85428]: 18-01 15:55:04| 10 14 74 1 0|7437M 125M 21M 55M| 34k 34k|6923k 203M| 371 1489 | 16k 7296 |1.68 2.41 4.11|3.0 0 5.0|30.8 50.8k|2605M 1690M| 31 492 0 95 10

Jan 18 15:55:05.711668 np0032779705 dstat.sh[85428]: 18-01 15:55:05| 10 19 67 4 0|7488M 103M 2560k 49M|1318 1318 | 50M 513M|1645 2102 | 12k 9149 |1.68 2.41 4.11|4.0 0 6.0| 623 131k|3139M 1156M| 31 492 0 90 10

Jan 18 15:55:06.702850 np0032779705 dstat.sh[85428]: 18-01 15:55:06| 12 17 54 16 0|7415M 126M 6088k 95M| 97k 97k| 100M 258M|4687 6623 | 21k 23k|1.68 2.41 4.11|1.0 2.0 434|10.2k 64.3k|3371M 924M| 31 492 0 91 10

Jan 18 15:55:07.705732 np0032779705 dstat.sh[85428]: 18-01 15:55:07| 5 9 75 10 0|7427M 120M 5272k 90M|7606 7606 | 65M 153M|3704 3302 | 15k 13k|1.68 2.41 4.11|4.0 0 2.0|13.4k 39.3k|3489M 806M| 31 492 0 90 10

Jan 18 15:55:08.703123 np0032779705 dstat.sh[85428]: 18-01 15:55:08| 5 10 76 9 0|7434M 121M 4320k 85M| 35k 35k| 60M 193M|3134 6494 | 16k 10k|1.62 2.39 4.09|3.0 0 0|14.2k 48.8k|3641M 654M| 31 492 0 90 10

Jan 18 15:55:09.704144 np0032779705 dstat.sh[85428]: 18-01 15:55:09| 8 12 56 24 0|7434M 122M 4120k 83M| 46k 46k| 116M 197M|13.6k 6780 | 30k 37k|1.62 2.39 4.09|4.0 1.0 0|27.9k 50.6k|3762M 533M| 31 492 0 90 10

Jan 18 15:55:10.701265 np0032779705 dstat.sh[85428]: 18-01 15:55:10| 5 10 75 9 1|7442M 122M 3784k 76M| 33k 33k| 63M 199M|3784 4171 | 16k 13k|1.62 2.39 4.09|1.0 1.0 1.0|15.7k 50.8k|3920M 375M| 31 492 0 90 10

Jan 18 15:55:11.702276 np0032779705 dstat.sh[85428]: 18-01 15:55:11| 5 10 76 8 0|7438M 123M 3532k 80M| 65k 65k| 63M 210M|3329 4500 | 15k 12k|1.62 2.39 4.09|3.0 0 9.0|13.7k 53.7k|4096M 199M| 31 492 0 90 10

Jan 18 15:55:12.703212 np0032779705 dstat.sh[85428]: 18-01 15:55:12| 5 10 76 7 0|7446M 123M 2928k 72M| 34k 34k| 57M 217M|2514 6590 | 18k 9070 |1.62 2.39 4.09|1.0 0 0|14.4k 54.8k|4273M 22M| 31 492 0 90 10

Jan 18 15:55:13.801361 np0032779705 dstat.sh[85428]: 18-01 15:55:13| 4 28 52 16 0|6279M 1336M 452k 31M|3520 3520 | 181M 63M|5191 4610 | 21k 16k|2.05 2.47 4.11|2.0 32 2.0|12.1k 16.2k|4177M 118M| 31 492 0 90 10

Jan 18 15:55:14.707471 np0032779705 dstat.sh[85428]: 18-01 15:55:14| 4 20 52 21 1|3177M 4376M 1636k 96M| 37k 37k| 96M 351M|4671 16.2k| 53k 29k|2.05 2.47 4.11|1.0 2.0 0|4718 89.7k|4290M 5145k| 31 492 0 90 10

So the problem lies between 15:55:02 and 15:55:13 (when the OOM hit).

The last line of the tempest log is just before this time window:

2023-01-18 15:55:00.337 127344 INFO tempest.lib.common.rest_client [req-21041963-dfcb-482e-88b8-b6538de0f45c req-21041963-dfcb-482e-88b8-b6538de0f45c ] Request (AttachSCSIVolumeTestJSON:test_attach_scsi_disk_with_config_drive): 200 GET https://213.32.79.144/image/v2/images/87972826-8f8b-4f50-a57e-4a26911a7051/file 3.454s

It downloads the content of the image successfully. So we are somewhere in this code when the memory is used up:

https://github.com/openstack/tempest/blob/6fa213cc0fcab744b46109e5f07cb58b0df4a314/tempest/api/compute/admin/test_volume.py#L52-L63

L52 is where we get the last log line and L63 should produce another log line but it could not as the process gets killed.