Containers: Kernel call trace Task jbd2 blocked
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| StarlingX |
Low
|
Jim Somerville |
Bug Description
Brief Description
-----------------
Kernel errors in repetitive migration operations
(Containerized STX load)
Linux version 3.10.0-
2019-01-
Severity
--------
Standard
Steps to Reproduce
------------------
Run instance launch and migrations (live then cold migration)
Repeat again (over the course of several hours)
instance was created here
req-c320289f-
Live migrated here
req-05faa0f4-
Cold migration here
req-64f7de0a-
Cold-Migrate-
Cold-Migrate complete for instance tenant2-volume-47 now enabled on host compute-1 waiting for confirmation
Cold migration confirm here
req-55876933-
Cold-Migrate-
Instance tenant2-volume-47 owned by tenant2 has failed on host compute-1
Then the instance reboots a few times to try to recover here
req-7c055100-
req-14b5c560-
req-256f317b-
Expected Behavior
------------------
Actual Behavior
----------------
Kernel Call trace hung task (see below)
see nfv-vim-events.log
log-id = 636
event-id = instance-failed
event-type = action-event
event-context = admin
importance = high
entity = tenant=
reason_text = Instance tenant2-volume-47 owned by tenant2 has failed on host compute-1
additional_text =
timestamp = 2019-02-01 21:51:39.050201
controller-0 kern.log
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
2019-02-
Reproducibility
---------------
Several hours to happen on the 1st set of migrations
Within 2 hours on 2nd reproduction run
System Configuration
-------
standard
Branch/Pull Time/Commit
-------
Master as of date: 2019-01-30_20-18-00
This was seen when running a containerized config using the pike docker images
Timestamp/Logs
--------------
see kern.log attached ~ 2019-02-
see also kern2.log attached ~ 2019-02-
Wendy Mitchell (wmitchellwr) wrote : | #1 |
Wendy Mitchell (wmitchellwr) wrote : | #2 |
Wendy Mitchell (wmitchellwr) wrote : | #3 |
Ghada Khalil (gkhalil) wrote : | #4 |
Marking as release gating until further investigation.
The issue was reported during testing in a containerized environment. It was not seen w/o containers.
Changed in starlingx: | |
importance: | Undecided → Medium |
status: | New → Triaged |
description: | updated |
summary: |
- Kernel call trace Task jbd2 blocked + Containers: Kernel call trace Task jbd2 blocked |
tags: | added: stx.2019.05 stx.containers stx.distro.other |
Changed in starlingx: | |
assignee: | nobody → Cindy Xie (xxie1) |
Ovidiu Poncea (ovidiu.poncea) wrote : | #5 |
We also saw the issue. During normal operation yet we were unable to reproduce it consistently, it's good that you are able to reproduce it. Sharing below our data:
So, given the kernel error & with a little google help we got to either one of these two:
1. An issue caused by "Too many open files" see https:/
2. Or may be an issue in the RBD driver in the kernel since no monitor reported slow operations and health was ok all the time (load on the cluster was light all the time). Google gets two related hits: https:/
Regarding #1 above, to get more info, in ceph.conf please add the section below before doing the test:
[client]
log file = /var/log/
debug rbd = 20
debug librbd = 20
If it is the same issue that redhat was having we should get, in /var/log/
7f0b769e7700 -1 -- 192.168.
If not, at least you should get more info on what's happening.
Changed in starlingx: | |
assignee: | Cindy Xie (xxie1) → Lin Shuicheng (shuicheng) |
Frank Miller (sensfan22) wrote : | #6 |
Changed priority to High due to the # of labs that have seen this issue in the past week (at least 7). Also looks like this is a duplicate of https:/
Changed in starlingx: | |
importance: | Medium → High |
Jim Somerville (jsomervi) wrote : | #7 |
Possible candidate:
In the 862 kernel we have this patch which wasn't in 693:
commit bd166ef183c263c
Author: Jens Axboe <email address hidden>
Date: Tue Jan 17 06:03:22 2017 -0700
blk-mq-sched: add framework for MQ capable IO schedulers
This adds a set of hooks that intercepts the blk-mq path of
allocating/
us to develop a scheduler within that framework.
We reuse the existing elevator scheduler API on the registration
side, but augment that with the scheduler flagging support for
the blk-mq interfce, and with a separate set of ops hooks for MQ
devices.
We split driver and scheduler tags, so we can run the scheduling
independently of device queue depth.
Signed-off-by: Jens Axboe <email address hidden>
Reviewed-by: Bart Van Assche <email address hidden>
Reviewed-by: Omar Sandoval <email address hidden>
Subsequently, it had to be fixed (not in 862 or 957):
commit 85bd6e61f34dffa
Author: Jianchao Wang <email address hidden>
Date: Wed Jan 30 17:01:56 2019 +0800
blk-mq: fix a hung issue when fsync
Florian reported a io hung issue when fsync(). It should be
triggered by following race condition.
data + post flush a flush
blk_
case REQ_FSEQ_DATA
issued to driver blk_mq_
request completion
req->end_io // doesn't check RESTART
mq_
case REQ_FSEQ_POSTFLUSH
do nothing because previous flush
has not been completed
To fix this, replace the blk_mq_run_hw_queue in mq_flush_
with blk_mq_
Fixes: bd166ef1 (blk-mq-sched: add framework for MQ capable IO schedulers)
Reported-by: Florian Stecker <email address hidden>
Tested-by: Florian Stecker <email address hidden>
Signed-off-by: Jianchao Wang <email address hidden>
Signed-off-by: Jens Axboe <email address hidden>
Ghada Khalil (gkhalil) wrote : | #8 |
The issue has been recently reproduced in 3 labs - 2 running the CentOS 7.6 kernel. Jim Somerville is actively investigating; assigning the launchpad to him.
Changed in starlingx: | |
assignee: | Lin Shuicheng (shuicheng) → Jim Somerville (jsomervi) |
Changed in starlingx: | |
status: | Triaged → In Progress |
Juan Carlos Alonso (juancarlosa) wrote : | #9 |
DX and Standard 2+2 configurations faced problems with the provisioning due to an existing issue related to “rabbitmq” service pod.
The same issue affected the Stress and Stability execution. Seems that “rabbitmq” pod had issue to manage signals when there is a big load.
Jim Somerville (jsomervi) wrote : | #10 |
That's interesting about rabbitmq, do you have a link or bug number?
Erich Cordoba (ericho) wrote : | #11 |
I think Juan Carlos refers to this one: https:/
Also, in that LP, it was mentioned that the issue happened in a duplex system before the "system application-apply", it was assumed that the failures regarding rabbitmq were related with this kernel oops (could be symptom?).
however it seems that the only constant is to just let the system up for a couple hours (or days) to reproduce the behaviour.
See for this comment: https:/
it has a dmesg output attached that might be helpful.
Related fix proposed to branch: master
Review: https:/
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 77cbb985f294fbc
Author: Robert Church <email address hidden>
Date: Fri Mar 8 15:54:19 2019 -0500
For stability bump size of rabbitmq PV to 1Gi
The rabbitmq chart requests a 256Mi PV for operational storage. With
CentOS 7.5 and 7.6 kernels, a jbd2 kernel thread hang is observed after
a long soak period. Once this occurs, a host reboot is required to
recover access to the PV.
We have been able to reliably recreate this using the stock upstream
CentOS 7.6 kernel and the latest Ceph Jewel LTS (10.2.11) version using
fsstress. This is currently pointing to a race condition in the
filesystem code.
With a reliable test available for this, other scenarios to characterize
this have been performed including using different volume sizes and
using different ext4 filesystem formatting options.
We've been unable to cause the hang using a 1Gi PV over an extended soak
period so we'll update the stx-openstack manifest to request a 1Gi PV
until the root cause and fix has been addressed in the kernel.
Change-Id: Ia0e5b7ffb049c6
Related-Bug: #1814595
Signed-off-by: Robert Church <email address hidden>
Frank Miller (sensfan22) wrote : | #14 |
Dropping priority to Medium since workaround (using 1G volume for rabbitmq) prevents issue. Still need to find a solution and Jim continues to investigate the kernel.
Changed in starlingx: | |
importance: | High → Medium |
Ghada Khalil (gkhalil) wrote : | #15 |
Downgraded to Medium given that a workaround has been pushed.
Jim Somerville (jsomervi) wrote : | #16 |
This issue occurs in out-of-the-box standard CentOS 7.6 and earlier, nothing to do with any of our kernel patches or other modifications.
I have reported it upstream. https:/
My sincere thanks to Bob Church for taking the time to put together a comprehensive recipe for reproducing the issue.
I also tested against the upstream 4.9 kernel from Linus' tree, using the same CentOS 7.6 user space ie. I just slid the newer kernel onto the same system, and the problem was gone.
Thus this really looks like it is specifically a CentOS kernel issue.
tags: |
added: stx.2.0 removed: stx.2019.05 |
tags: | added: stx.retestneeded |
tags: | removed: stx.2.0 |
Ghada Khalil (gkhalil) wrote : | #17 |
Changing priority to Low based on Cindy's update that this should not be release gating. I've asked Cindy to add justification for tracking purposes as well.
Changed in starlingx: | |
importance: | Medium → Low |
Cindy Xie (xxie1) wrote : | #18 |
we reviewed in non-openstack-dist project meeting and agreed this is not gating stx.2.0. Due to the reason that Jim found it's CentOS 7.6 upstream issue, WR provided and upstream bug opened.
Frank Miller (sensfan22) wrote : | #19 |
To add to Cindy's comment, the main reason this issue is no longer gating stx.2.0 is the original issue reported in StarlingX has been avoided by using a different implementation for size of PVs used by StarlingX (see: https:/
Since StarlingX software is no longer using 256Mi sized PVs, the software is no longer hitting this jbd2 issue. As such this CentOS7.6 upstream issue is no longer impacting StarlingX and this LP is not gating stx.2.0.
The general steps are as follows, then they are repeated for several hours.
Test Step 1: Clear local storage cache on compute-0 nova/instances/ _base/* vm/drop_ caches
rm -rf /var/lib/
'sync;echo 3 > /proc/sys/
Step 2: Create a flavor with 2 vcpus, dedicated cpu policy, and local_image storage keystone. openstack. svc.cluster. local/v3 --os-user- domain- name Default --os-project- domain- name Default --os-endpoint-type internalURL --os-region-name RegionOne flavor-create local_image auto 1024 2 2'
nova --os-username 'admin' --os-password '<password>' --os-project-name admin --os-auth-url http://
Step 3: Tenant2 launches instance1 keystone. openstack. svc.cluster. local/v3 --os-user- domain- name Default --os-project- domain- name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name keypair-tenant2 --flavor <flavorid> --image <tis-centos-guest image> --nic net-id= 28348793- baef-4987- ae23-0baf020368 19 --nic net-id= 9267348a- c940-41e4- 8a57-ce1046946f 42 --nic net-id= 3de977f7- 94d2-45ff- 907d-977a715c77 82 tenant2- local_image- 1 --poll'
tenant2 boots the instance eg. from local_image on compute-0 (collect startup time info)
'nova --os-username 'tenant2' --os-password '<password>' --os-project-name tenant2 --os-auth-url http://
Send 'ping -c 3 <ip of mgmt network for tenant2>
Step 4: Tenant1 launches instance2 (the observer instance) keystone. openstack. svc.cluster. local/v3 --os-user- domain- name Default --os-project- domain- name Default --os-endpoint-type internalURL --os-region-name RegionOne boot --key-name keypair-tenant1 --flavor <flavorid> --image <tis-centos-guest image> --nic net-id= f6ac6f3d- 45e2-4e1f- 88ad-ac10bbe8cc f2 --nic net-id= 0d3b246c- f0b5-4ec4- bbd4-c76eb39d1b 9b --nic net-id= 3de977f7- 94d2-45ff- 907d-977a715c77 82 tenant1-observer-2 --poll'
Send 'nova --os-username 'tenant1' --os-password '<password>' --os-project-name tenant1 --os-auth-url http://
Send 'ping -c 3 <ip of mgmt network for tenant1>
Step 5 Collects live migrate KPI for vm booted from local_image
Step 6: Setting up traffic for pairs instance1 and instance2
Step 7: Cold migrates the instance & confirm cold migraton (collects KPI for each)
Step 8: Setup traffic for pairs between the instances (the one launched in step 3 and the observer instance launched in step 4)
Step 9/10: Ping instance1 from NatBox on a new thread (collect kpi for rebuild)
Test Step 11: Perform rebuild on instance1 again and ensure it's reachable after that (ping tenant2 mgmt network)
Teardown