[train] ERROR root [ ] Image prepare failed: A process in the process pool was terminated abruptly while the future was running or pending failing on few jobs

Bug #1895290 reported by Bhagyashri Shewale
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Bhagyashri Shewale

Bug Description

Few c8 train jobs are failing because of ERROR root [ ] Image prepare failed: A process in the process pool was terminated abruptly while the future was running or pending

Jobs affected:

periodic-tripleo-ci-centos-8-ovb-3ctlr_1comp-featureset035-train
periodic-tripleo-ci-centos-8-multinode-1ctlr-featureset010-train
periodic-tripleo-ci-centos-8-undercloud-containers-train

Error logs:

2020-09-11 11:51:18,146 143187 ERROR root [ ] Image prepare failed: A process in the process pool was terminated abruptly while the future was running or pending.
Traceback (most recent call last):
  File "/usr/bin/tripleo-container-image-prepare", line 138, in <module>
    lock=lock)
  File "/usr/lib/python3.6/site-packages/tripleo_common/image/kolla_builder.py", line 241, in container_images_prepare_multi
    uploader.upload()
  File "/usr/lib/python3.6/site-packages/tripleo_common/image/image_uploader.py", line 506, in upload
    uploader.run_tasks()
  File "/usr/lib/python3.6/site-packages/tripleo_common/image/image_uploader.py", line 2514, in run_tasks
    for result in p.map(upload_task, self.upload_tasks):
  File "/usr/lib64/python3.6/concurrent/futures/process.py", line 366, in _chain_from_iterable_of_lists
    for element in iterable:
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
    yield fs.pop().result()
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Reference link:

[1]: https://logserver.rdoproject.org/54/29254/1/check/periodic-tripleo-ci-centos-8-multinode-1ctlr-featureset010-train/bee02d1/logs/undercloud/var/log/tripleo-container-image-prepare.log.txt.gz

[2]: https://logserver.rdoproject.org/54/29254/1/check/periodic-tripleo-ci-centos-8-multinode-1ctlr-featureset010-train/bee02d1/logs/undercloud/home/zuul/overcloud_deploy.log.txt.gz

[3]: https://logserver.rdoproject.org/54/29254/1/check/periodic-tripleo-ci-centos-8-multinode-1ctlr-featureset010-train/bee02d1/job-output.txt

Revision history for this message
Rabi Mishra (rabi) wrote :

OOM Killer

https://logserver.rdoproject.org/54/29254/1/check/periodic-tripleo-ci-centos-8-multinode-1ctlr-featureset010-train/bee02d1/logs/undercloud/var/log/extra/journal.txt.gz

Sep 11 11:50:54 undercloud.localdomain kernel: httpd invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
Sep 11 11:50:54 undercloud.localdomain kernel: httpd cpuset=libpod-d3b833c5ac22b35fc5873f2fcdde1d1736aed0ed19159c1f95b482ea8e23cc86.scope mems_allowed=0
Sep 11 11:50:54 undercloud.localdomain kernel: CPU: 4 PID: 128877 Comm: httpd Not tainted 4.18.0-193.14.2.el8_2.x86_64 #1
Sep 11 11:50:54 undercloud.localdomain kernel: Hardware name: OpenStack Foundation OpenStack Nova, BIOS 1.11.0-2.el7 04/01/2014
Sep 11 11:50:54 undercloud.localdomain kernel: Call Trace:
Sep 11 11:50:54 undercloud.localdomain kernel: dump_stack+0x5c/0x80
Sep 11 11:50:54 undercloud.localdomain kernel: dump_header+0x6e/0x27a
Sep 11 11:50:54 undercloud.localdomain kernel: ? virtballoon_oom_notify+0x25/0x70 [virtio_balloon]
Sep 11 11:50:54 undercloud.localdomain kernel: oom_kill_process.cold.28+0xb/0x10
Sep 11 11:50:54 undercloud.localdomain kernel: out_of_memory+0x1ba/0x490
Sep 11 11:50:54 undercloud.localdomain kernel: __alloc_pages_slowpath+0xc40/0xd60
Sep 11 11:50:54 undercloud.localdomain kernel: __alloc_pages_nodemask+0x245/0x280
Sep 11 11:50:54 undercloud.localdomain kernel: filemap_fault+0x3d0/0x860
Sep 11 11:50:54 undercloud.localdomain kernel: ? alloc_set_pte+0x203/0x480
Sep 11 11:50:54 undercloud.localdomain kernel: ? filemap_map_pages+0x38d/0x3b0
Sep 11 11:50:54 undercloud.localdomain kernel: ext4_filemap_fault+0x2c/0x40 [ext4]
Sep 11 11:50:54 undercloud.localdomain kernel: __do_fault+0x38/0xc0
Sep 11 11:50:54 undercloud.localdomain kernel: do_fault+0x191/0x3c0
Sep 11 11:50:54 undercloud.localdomain kernel: __handle_mm_fault+0x441/0x6a0
Sep 11 11:50:54 undercloud.localdomain kernel: handle_mm_fault+0xda/0x200
Sep 11 11:50:54 undercloud.localdomain kernel: __do_page_fault+0x22d/0x4e0
Sep 11 11:50:54 undercloud.localdomain kernel: do_page_fault+0x32/0x110
Sep 11 11:50:54 undercloud.localdomain kernel: ? async_page_fault+0x8/0x30
Sep 11 11:50:54 undercloud.localdomain kernel: async_page_fault+0x1e/0x30
Sep 11 11:50:54 undercloud.localdomain kernel: RIP: 0033:0x7f48a2b5bf42
Sep 11 11:50:54 undercloud.localdomain kernel: Code: Bad RIP value.

Revision history for this message
Alex Schultz (alex-schultz) wrote :
Revision history for this message
chandan kumar (chkumar246) wrote :

We can bump the value of configure_swap_size: 4096 which is coming from https://opendev.org/openstack/openstack-zuul-jobs/src/branch/master/roles/configure-swap/defaults/main.yaml#L2 via zuul job var or better put it in the base job

Revision history for this message
chandan kumar (chkumar246) wrote :
Download full text (3.7 KiB)

We are also seeing in c8 train multinode job https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_b04/749645/51/check/tripleo-ci-centos-8-containers-multinode-train/b045598/logs/undercloud/home/zuul/overcloud_deploy.log

"DEBUG:tripleo_common.image.image_uploader:Released lock on layer sha256:369aa5567f56b79c7001b7e292c42eb615f76b3132cfcf8c70d6c79c8c701078", "DEBUG:tripleo_common.image.image_uploader:[/tripleotraincentos8/centos-binary-multipathd] Waiting for next job: 3 of 4 complete", "DEBUG:urllib3.connectionpool:http://192.168.24.1:8787 \"HEAD /v2/tripleotraincentos8/centos-binary-nova-libvirt/blobs/sha256:ec1681b6a383e4ecedbeddd5abc596f3de835aed6db39a735f62395c8edbff30 HTTP/1.1\" 404 0", "DEBUG:tripleo_common.image.image_export:[tripleotraincentos8/centos-binary-haproxy] Linking layers: /var/lib/image-serve/v2/tripleotraincentos8/centos-binary-haproxy/blobs/sha256:ec1681b6a383e4ecedbeddd5abc596f3de835aed6db39a735f62395c8edbff30.gz -> /var/lib/image-serve/v2/tripleotraincentos8/centos-binary-nova-libvirt/blobs/sha256:ec1681b6a383e4ecedbeddd5abc596f3de835aed6db39a735f62395c8edbff30.gz", "DEBUG:tripleo_common.image.image_uploader:Released lock on layer sha256:ec1681b6a383e4ecedbeddd5abc596f3de835aed6db39a735f62395c8edbff30", "DEBUG:tripleo_common.image.image_uploader:[/tripleotraincentos8/centos-binary-nova-libvirt] Waiting for next job: 4 of 6 complete", "Exception in thread Thread-1:", "Traceback (most recent call last):", " File \"/usr/lib64/python3.6/threading.py\", line 916, in _bootstrap_inner", " self.run()", " File \"/usr/lib64/python3.6/threading.py\", line 864, in run", " self._target(*self._args, **self._kwargs)", " File \"/usr/lib64/python3.6/concurrent/futures/process.py\", line 295, in _queue_management_worker", " shutdown_worker()", " File \"/usr/lib64/python3.6/concurrent/futures/process.py\", line 253, in shutdown_worker", " call_queue.put_nowait(None)", " File \"/usr/lib64/python3.6/multiprocessing/queues.py\", line 129, in put_nowait", " return self.put(obj, False)", " File \"/usr/lib64/python3.6/multiprocessing/queues.py\", line 83, in put", " raise Full", "queue.Full", "", "ERROR:root:Image prepare failed: A process in the process pool was terminated abruptly while the future was running or pending.", "Traceback (most recent call last):", " File \"/usr/bin/tripleo-container-image-prepare\", line 138, in <module>", " lock=lock)", " File \"/usr/lib/python3.6/site-packages/tripleo_common/image/kolla_builder.py\", line 241, in container_images_prepare_multi", " uploader.upload()", " File \"/usr/lib/python3.6/site-packages/tripleo_common/image/image_uploader.py\", line 506, in upload", " uploader.run_tasks()", " File \"/usr/lib/python3.6/site-packages/tripleo_common/image/image_uploader.py\", line 2514, in run_tasks", " for result in p.map(upload_task, self.upload_tasks):", " File \"/usr/lib64/python3.6/concurrent/futures/process.py\", line 366, in _chain_from_iterable_of_lists", " for element in iterable:", " File \"/usr/lib64/python3.6/concurrent/futures/_base.py\", line 586, in result_iterator", " yield fs.pop().result()", " File \"/u...

Read more...

Revision history for this message
Bhagyashri Shewale (bhagyashri-shewale) wrote :
Revision history for this message
Michele Baldessari (michele) wrote :
Revision history for this message
Bhagyashri Shewale (bhagyashri-shewale) wrote :

Hi Michele Baldessari (michele),

I have pushed the patch: https://review.opendev.org/#/c/752187/ (Increase configure_swap_size to 4096)

Revision history for this message
Bhagyashri Shewale (bhagyashri-shewale) wrote :
Changed in tripleo:
assignee: nobody → Bhagyashri Shewale (bhagyashri-shewale)
status: Triaged → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.