radosgw coredump files generated during launch VMs

Bug #1830938 reported by Peng Peng
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
StarlingX
Fix Released
Medium
Tingjie Chen

Bug Description

Brief Description
-----------------
core.radosgw files were generated. Around that timestamp, system was booting VMs. And before that timestamp, alarm "Service group storage-monitoring-services warning; ceph-radosgw(enabled-active, )" was in the alarm-list.

Severity
--------
Major

Steps to Reproduce
------------------
As description

TC-name:

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Two node system

Lab-name: WP_1-2

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-05-28_17-05-57

Last Pass
---------
2019-05-27_09-53-12

Timestamp/Logs
--------------
core file time stammp: 11:14 & 11:16
controller-1:/var/lib/systemd/coredump$ ls -l
total 1536
-rw-r----- 1 root root 774032 May 29 11:14 core.radosgw.0.3b71b9ac72b84eab9e56c4856c546181.1784442.1559128451000000.xz
-rw-r----- 1 root root 789196 May 29 11:16 core.radosgw.0.3b71b9ac72b84eab9e56c4856c546181.1811512.1559128572000000.xz

[2019-05-29 11:10:35,106] 262 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-05-29 11:10:37,927] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+----------+----------------------------+
| e55882c9-508e-472c-b4c5-5c6a640453cc | 400.001 | Service group cloud-services warning; dbmon(enabled-standby, ) | service_domain=controller.service_group=cloud-services.host=controller-1 | minor | 2019-05-29T11:10:33.514372 |
| 40b45e1b-f7a4-4911-866f-c99d06071aa1 | 400.001 | Service group cloud-services warning; dbmon(enabled-active, ) | service_domain=controller.service_group=cloud-services.host=controller-0 | minor | 2019-05-29T11:10:30.549199 |
| d006d1c4-6bc4-45ad-afc0-e3138f624725 | 100.103 | Platform Memory threshold exceeded ; threshold 80.00%, actual 80.06% | host=controller-0.numa=node1 | major | 2019-05-29T11:09:09.512971 |
| 27b1b64f-78a8-42ad-8f27-ffc0264f880f | 400.001 | Service group storage-monitoring-services warning; ceph-radosgw(enabled-active, ) | service_domain=controller.service_group=storage-monitoring-services.host=controller-1 | minor | 2019-05-29T11:08:45.028211 |
| 8bef685d-7f07-4e32-9d16-dde371622402 | 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-05-29T10:59:39.663177 |
+--------------------------------------+----------+-----------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+----------+----------------------------+
controller-0:~$

Frome 11:11 to 11:16 system boot up VMs

[2019-05-29 11:18:34,170] 262 DEBUG MainThread ssh.send :: Send 'system --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne service-list'
[2019-05-29 11:18:38,139] 387 DEBUG MainThread ssh.expect :: Output:
+-----+----------------------------+--------------+----------------+
| id | service_name | hostname | state |
+-----+----------------------------+--------------+----------------+
| 102 | barbican-api | controller-0 | enabled-active |
| 103 | barbican-keystone-listener | controller-0 | enabled-active |
| 104 | barbican-worker | controller-0 | enabled-active |
| 54 | ceph-manager | controller-0 | disabled |
| 110 | ceph-mon | controller-0 | enabled-active |
| 113 | ceph-osd | controller-0 | enabled-active |
| 67 | ceph-radosgw | controller-0 | disabled |

[2019-05-29 11:19:00,260] 139 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-05-29 11:19:00,260] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

Test Activity
-------------
Sanity

Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Revision history for this message
Peng Peng (ppeng) wrote :
Numan Waheed (nwaheed)
tags: added: stx.retestneeded
Revision history for this message
Ghada Khalil (gkhalil) wrote :

 Marking as release gating for now until further investigation. This would impact object storage (swift) configurations.

description: updated
Changed in starlingx:
importance: Undecided → Medium
tags: added: stx.2.0 stx.storage
Changed in starlingx:
status: New → Triaged
assignee: nobody → Cindy Xie (xxie1)
Peng Peng (ppeng)
description: updated
Revision history for this message
yong hu (yhu6) wrote :

@peng, please tell the reproduce step. we don't quite know the steps of "launch VMs".

In addition, it was duplex deployment, wasn't it?

Changed in starlingx:
assignee: Cindy Xie (xxie1) → nobody
assignee: nobody → chen haochuan (martin1982)
Revision history for this message
chen haochuan (martin1982) wrote :

duplicate of LP1827268
https://bugs.launchpad.net/starlingx/+bug/1827268

2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718851] Call Trace:
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718872] [<ffffffff95612b89>] schedule_preempt_disabled+0x39/0x90
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718887] [<ffffffff956109d5>] __mutex_lock_slowpath+0xd5/0x210
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718893] [<ffffffff9500ccd5>] ? unlazy_walk+0xb5/0x130
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718897] [<ffffffff9560fc07>] mutex_lock+0x17/0x30
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718904] [<ffffffff956045f1>] lookup_slow+0x33/0xa7
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718909] [<ffffffff9501053f>] link_path_walk+0x80f/0x8b0
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718916] [<ffffffff94f86de4>] ? filemap_fault+0x74/0x460
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718918] [<ffffffff9501074a>] path_lookupat+0x7a/0x8b0
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718948] [<ffffffffc0a6b911>] ? ext4_filemap_fault+0x41/0x50 [ext4]
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718950] [<ffffffff9501311f>] ? getname_flags+0x4f/0x1a0
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718952] [<ffffffff95010fab>] filename_lookup+0x2b/0xc0
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718954] [<ffffffff950142b7>] user_path_at_empty+0x67/0xc0
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718958] [<ffffffff94fb8267>] ? handle_mm_fault+0x557/0xc30
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718960] [<ffffffff95014321>] user_path_at+0x11/0x20
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718962] [<ffffffff95007373>] vfs_fstatat+0x63/0xc0
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718963] [<ffffffff9500772e>] SYSC_newstat+0x2e/0x60
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718966] [<ffffffff94ffe8ba>] ? __check_object_size+0x1ca/0x250
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718972] [<ffffffff94e940f4>] ? SyS_rt_sigprocmask+0xc4/0x100
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718973] [<ffffffff95007bee>] SyS_newstat+0xe/0x10
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718976] [<ffffffff95616fdb>] system_call_fastpath+0x22/0x27
2019-05-01T17:10:07.569 storage-2 kernel: err [ 721.719241] INFO: task install_banner_:15141 blocked for more than 120 seconds.

Revision history for this message
chen haochuan (martin1982) wrote :

controller-0:~$ gdb /usr/bin/radosgw core.radosgw.0.3b71b9ac72b84eab9e56c4856c546181.1811512.1559128572000000

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/bin/radosgw -n client.radosgw.gateway'.
Program terminated with signal 6, Aborted.
#0 0x00007fcdd1dbd207 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install ceph-radosgw-13.2.2-0.el7.tis.26.x86_64
(gdb) bt
#0 0x00007fcdd1dbd207 in raise () from /lib64/libc.so.6
#1 0x00007fcdd1dbe8f8 in abort () from /lib64/libc.so.6
#2 0x00007fcdd26cc765 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3 0x00007fcdd26ca746 in ?? () from /lib64/libstdc++.so.6
#4 0x00007fcdd26ca773 in std::terminate() () from /lib64/libstdc++.so.6
#5 0x00007fcdd26ca993 in __cxa_throw () from /lib64/libstdc++.so.6
#6 0x00007fcddddb18bb in (anonymous namespace)::handle_oom(void* (*)(void*), void*, bool, bool) () from /lib64/libtcmalloc.so.4
#7 0x00007fcddddcfb83 in tcmalloc::allocate_full_cpp_throw_oom(unsigned long) () from /lib64/libtcmalloc.so.4
#8 0x0000557535b3d3d7 in RGWGC::process(int, int, bool, RGWGCIOManager&) ()
#9 0x0000557535b3df72 in RGWGC::process(bool) ()
#10 0x0000557535b4118f in RGWGC::GCWorker::entry() ()
#11 0x00007fcdd51a6b21 in Thread::entry_wrapper() () from /usr/lib64/ceph/libceph-common.so.0
#12 0x00007fcddd97ddd5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fcdd1e84ead in clone () from /lib64/libc.so.6

This is should be this ceph issue
https://tracker.ceph.com/issues/23199

fixed by this PR
https://github.com/ceph/ceph/pull/25430

fix should merged in starlingx

Revision history for this message
Cindy Xie (xxie1) wrote :

@Fang Liang, can you please review Martin's founding and confirm if this is a Ceph upstream issue? shall we cherry pick the fixes to starlingX-staging?

Changed in starlingx:
assignee: chen haochuan (martin1982) → Liang Fang (liangfang)
Changed in starlingx:
assignee: Liang Fang (liangfang) → Tingjie Chen (silverhandy)
Revision history for this message
Tingjie Chen (silverhandy) wrote :

I have create PR: https://github.com/starlingx-staging/stx-ceph/pull/34 backport from https://github.com/ceph/ceph/pull/25430 as Martin mentioned.

--------------------------------------------
rgw: rgwgc:process coredump in some special case

Gc processes obja, objb, objc in order and pool of objb is deleted (obja and
objc is in the same pool and pool exits). RGW will coredump as ctx->io_ctx_impl
is an empty point during delete objc.

Cherry-picked from ceph:master 575a7900660c7ec02250aa58cd88b2e02962e135

Changed in starlingx:
status: Triaged → In Progress
Revision history for this message
Tingjie Chen (silverhandy) wrote :

The PR: https://github.com/starlingx-staging/stx-ceph/pull/34 has merged, this LP can be set fixed released.

Changed in starlingx:
status: In Progress → Fix Released
Revision history for this message
Peng Peng (ppeng) wrote :

Not saw this issue recently

tags: removed: stx.retestneeded
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.