Containers: rabbitmq pod dead after lab left running for 3 days - jbd2 hang

Bug #1815541 reported by Frank Miller on 2019-02-12
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Low
Ghada Khalil

Bug Description

Title
-----
Containers: rabbitmq pod dead after lab left running for 3 days

Brief Description
-----------------
On a system running containers for 3-4 days, the openstack CLI failed with an Unknown Error (HTTP 504). Looking at the state of the containers, the following command indicate the rabbit pod was dead:

kubelet describe posd -n openstack <rabbit pod name>

Events:
  Type Reason Age From Message
  ---- ------ ---- ---- -------
  Normal Killing 4m25s (x880 over 29h) kubelet, controller-1 Killing container with id docker://rabbitmq:Container failed liveness probe.. Container will be killed and recreated.

A delete attempt hung as well. A force recovered the rabbit pods but the system did not fully recover,

Severity
--------
Critical. The system did not recover.

Steps to Reproduce
------------------
Configure StarlingX with the --kubernetes option. Leave the lab idle for several days.

Expected Behavior
------------------
The rabbit pod should not fail. If it fails is should automatically recover. Manual recovery should bring the system back up to a fully functioning state.

Actual Behavior
----------------
see description

Reproducibility
---------------
Unknown at this time.

System Configuration
--------------------
Dedicated storage

Branch/Pull Time/Commit
-----------------------
February 7th CENGN build of f/stein branch:

controller-0:~$ cat /etc/build.info
###
### StarlingX
### Release 19.01
###

OS="centos"
SW_VERSION="19.01"
BUILD_TARGET="Host Installer"
BUILD_TYPE="Formal"
BUILD_ID="f/stein"

JOB="STX_build_stein_master"
<email address hidden>"
BUILD_NUMBER="44"
BUILD_HOST="starlingx_mirror"
BUILD_DATE="2019-02-07 18:00:27 +0000"

Timestamp/Logs
--------------
n/a

Frank Miller (sensfan22) on 2019-02-12
Changed in starlingx:
importance: Undecided → High
Ghada Khalil (gkhalil) wrote :

Marking as release gating; issue related to container env.

Changed in starlingx:
assignee: nobody → Bob Church (rchurch)
status: New → Triaged
tags: added: stx.2019.05
Frank Miller (sensfan22) wrote :
Download full text (8.9 KiB)

This issue has been seen 3 times in the past 3 weeks on 2 different labs. May be coincidence but in at least 2 of the cases it just randomly occurred ~3 days a new install and while the lab was relatively idle with all pods healthy at the time.

Investigation to date indicates the kernel is hung.

Bart started the investigation into why the rabbit pod died and at the time of the death saw a kernel thread hung:
"looks like a filesystem journal thread blocked for more than two minutes right when the first liveness probe failed:"
2019-02-10T11:39:26.644 controller-0 kernel: err [199214.587185] INFO: task jbd2/rbd1-8:59444 blocked for more than 120 seconds.
2019-02-10T11:39:26.644 controller-0 kernel: err [199214.587213] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
2019-02-10T11:39:26.644 controller-0 kernel: info [199214.587249] jbd2/rbd1-8 D ffff8e8b4289e4b0 0 59444 2 0x00000000
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587253] Call Trace:
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587260] [<ffffffffa22f3914>] ? blk_mq_run_hw_queue+0x14/0x20
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587263] [<ffffffffa27bb790>] ? bit_wait+0x50/0x50
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587265] [<ffffffffa27be399>] schedule+0x29/0x70
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587267] [<ffffffffa27bb171>] schedule_timeout+0x2a1/0x330
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587269] [<ffffffffa27bb790>] ? bit_wait+0x50/0x50
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587272] [<ffffffffa22eb57e>] ? blk_flush_plug_list+0xce/0x230
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587273] [<ffffffffa27bb790>] ? bit_wait+0x50/0x50
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587275] [<ffffffffa27bce7d>] io_schedule_timeout+0xad/0x130
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587276] [<ffffffffa27bcf18>] io_schedule+0x18/0x20
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587278] [<ffffffffa27bb7a1>] bit_wait_io+0x11/0x50
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587280] [<ffffffffa27bb2c7>] __wait_on_bit+0x67/0x90
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587283] [<ffffffffa2165b01>] wait_on_page_bit+0x81/0xa0
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587286] [<ffffffffa20a47d0>] ? wake_bit_function+0x40/0x40
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587288] [<ffffffffa2165c31>] __filemap_fdatawait_range+0x111/0x190
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587290] [<ffffffffa22e9a40>] ? submit_bio+0x70/0x150
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587293] [<ffffffffa221e387>] ? bio_alloc_bioset+0xd7/0x220
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587296] [<ffffffffa2165cc4>] filemap_fdatawait_range+0x14/0x30
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214.587297] [<ffffffffa2165d07>] filemap_fdatawait+0x27/0x30
2019-02-10T11:39:26.644 controller-0 kernel: warning [199214....

Read more...

Frank Miller (sensfan22) wrote :
Frank Miller (sensfan22) wrote :

Bart tested a reboot of the controllers when the lab was in this state. The pods did recover the reboot. This is a drastic workaround but seems to at least recover the pods and openstack containers.

Frank Miller (sensfan22) on 2019-02-27
Changed in starlingx:
status: Triaged → Fix Released
status: Fix Released → Triaged
Frank Miller (sensfan22) on 2019-02-27
Changed in starlingx:
assignee: Bob Church (rchurch) → Ghada Khalil (gkhalil)
Ghada Khalil (gkhalil) on 2019-03-01
summary: - Containers: rabbitmq pod dead after lab left running for 3 days
+ Containers: rabbitmq pod dead after lab left running for 3 days - jbd2
+ hang

I have the same problem in the installation process when I run “system application-apply stx-openstack” in Duplex virtual.

iso: master 2019-Mar-05
Attach: dmesg.log

Ken Young (kenyis) on 2019-04-05
tags: added: stx.2.0
removed: stx.2019.05
Ghada Khalil (gkhalil) wrote :

As per Cindy's comment in the duplicate bug, reducing the priority to Low and marking as not gating.
https://bugs.launchpad.net/starlingx/+bug/1814595/comments/18

Changed in starlingx:
importance: High → Low
tags: added: stx.distro.other
removed: stx.2.0
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments