StarlingX

radosgw coredump files generated during launch VMs

Bug #1830938 reported by Peng Peng on 2019-05-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	StarlingX	Fix Released	Medium	Tingjie Chen

Bug Description

Brief Description
-----------------
core.radosgw files were generated. Around that timestamp, system was booting VMs. And before that timestamp, alarm "Service group storage-monitoring-services warning; ceph-radosgw(enabled-active, )" was in the alarm-list.

Severity
--------
Major

Steps to Reproduce
------------------
As description

TC-name:

Expected Behavior
------------------

Actual Behavior
----------------

Reproducibility
---------------
Intermittent

System Configuration
--------------------
Two node system

Lab-name: WP_1-2

Branch/Pull Time/Commit
-----------------------
stx master as of 2019-05-28_17-05-57

Last Pass
---------
2019-05-27_09-53-12

Timestamp/Logs
--------------
core file time stammp: 11:14 & 11:16
controller-1:/var/lib/systemd/coredump$ ls -l
total 1536
-rw-r----- 1 root root 774032 May 29 11:14 core.radosgw.0.3b71b9ac72b84eab9e56c4856c546181.1784442.1559128451000000.xz
-rw-r----- 1 root root 789196 May 29 11:16 core.radosgw.0.3b71b9ac72b84eab9e56c4856c546181.1811512.1559128572000000.xz

[2019-05-29 11:10:35,106] 262 DEBUG MainThread ssh.send :: Send 'fm --os-username 'admin' --os-password 'Li69nux*' --os-project-name admin --os-auth-url http://192.168.204.2:5000/v3 --os-user-domain-name Default --os-project-domain-name Default --os-endpoint-type internalURL --os-region-name RegionOne alarm-list --nowrap --uuid'
[2019-05-29 11:10:37,927] 387 DEBUG MainThread ssh.expect :: Output:
+--------------------------------------+----------+-----------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+----------+----------------------------+
| UUID | Alarm ID | Reason Text | Entity ID | Severity | Time Stamp |
+--------------------------------------+----------+-----------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+----------+----------------------------+
| e55882c9-508e-472c-b4c5-5c6a640453cc | 400.001 | Service group cloud-services warning; dbmon(enabled-standby, ) | service_domain=controller.service_group=cloud-services.host=controller-1 | minor | 2019-05-29T11:10:33.514372 |
| 40b45e1b-f7a4-4911-866f-c99d06071aa1 | 400.001 | Service group cloud-services warning; dbmon(enabled-active, ) | service_domain=controller.service_group=cloud-services.host=controller-0 | minor | 2019-05-29T11:10:30.549199 |
| d006d1c4-6bc4-45ad-afc0-e3138f624725 | 100.103 | Platform Memory threshold exceeded ; threshold 80.00%, actual 80.06% | host=controller-0.numa=node1 | major | 2019-05-29T11:09:09.512971 |
| 27b1b64f-78a8-42ad-8f27-ffc0264f880f | 400.001 | Service group storage-monitoring-services warning; ceph-radosgw(enabled-active, ) | service_domain=controller.service_group=storage-monitoring-services.host=controller-1 | minor | 2019-05-29T11:08:45.028211 |
| 8bef685d-7f07-4e32-9d16-dde371622402 | 200.010 | controller-0 access to board management module has failed. | host=controller-0 | warning | 2019-05-29T10:59:39.663177 |
+--------------------------------------+----------+-----------------------------------------------------------------------------------+---------------------------------------------------------------------------------------+----------+----------------------------+
controller-0:~$

Frome 11:11 to 11:16 system boot up VMs

[2019-05-29 11:19:00,260] 139 INFO MainThread host_helper.reboot_hosts:: Rebooting active controller: controller-0
[2019-05-29 11:19:00,260] 262 DEBUG MainThread ssh.send :: Send 'sudo reboot -f'

Test Activity
-------------
Sanity

See original description

Tags:

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-29:

ALL_NODES_20190529.144549.tar Edit (40.9 MiB, application/x-tar)

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-29:

core.radosgw.0.3b71b9ac72b84eab9e56c4856c546181.1784442.1559128451000000.xz Edit (755.9 KiB, application/octet-stream)

Revision history for this message

Peng Peng (ppeng) wrote on 2019-05-29:

core.radosgw.0.3b71b9ac72b84eab9e56c4856c546181.1811512.1559128572000000.xz Edit (770.7 KiB, application/octet-stream)

Numan Waheed (nwaheed) on 2019-05-29

tags:

added: stx.retestneeded

Revision history for this message

Ghada Khalil (gkhalil) wrote on 2019-05-29:

Marking as release gating for now until further investigation. This would impact object storage (swift) configurations.

description:	updated
Changed in starlingx:
importance:	Undecided → Medium
tags:	added: stx.2.0 stx.storage
Changed in starlingx:
status:	New → Triaged
assignee:	nobody → Cindy Xie (xxie1)

Peng Peng (ppeng) on 2019-05-29

description:

updated

Revision history for this message

yong hu (yhu6) wrote on 2019-05-31:

@peng, please tell the reproduce step. we don't quite know the steps of "launch VMs".

In addition, it was duplex deployment, wasn't it?

Changed in starlingx:
assignee:	Cindy Xie (xxie1) → nobody
assignee:	nobody → chen haochuan (martin1982)

Revision history for this message

chen haochuan (martin1982) wrote on 2019-06-13:

duplicate of LP1827268
https://bugs.launchpad.net/starlingx/+bug/1827268

2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718851] Call Trace:
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718872] [<ffffffff95612b89>] schedule_preempt_disabled+0x39/0x90
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718887] [<ffffffff956109d5>] __mutex_lock_slowpath+0xd5/0x210
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718893] [<ffffffff9500ccd5>] ? unlazy_walk+0xb5/0x130
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718897] [<ffffffff9560fc07>] mutex_lock+0x17/0x30
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718904] [<ffffffff956045f1>] lookup_slow+0x33/0xa7
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718909] [<ffffffff9501053f>] link_path_walk+0x80f/0x8b0
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718916] [<ffffffff94f86de4>] ? filemap_fault+0x74/0x460
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718918] [<ffffffff9501074a>] path_lookupat+0x7a/0x8b0
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718948] [<ffffffffc0a6b911>] ? ext4_filemap_fault+0x41/0x50 [ext4]
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718950] [<ffffffff9501311f>] ? getname_flags+0x4f/0x1a0
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718952] [<ffffffff95010fab>] filename_lookup+0x2b/0xc0
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718954] [<ffffffff950142b7>] user_path_at_empty+0x67/0xc0
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718958] [<ffffffff94fb8267>] ? handle_mm_fault+0x557/0xc30
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718960] [<ffffffff95014321>] user_path_at+0x11/0x20
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718962] [<ffffffff95007373>] vfs_fstatat+0x63/0xc0
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718963] [<ffffffff9500772e>] SYSC_newstat+0x2e/0x60
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718966] [<ffffffff94ffe8ba>] ? __check_object_size+0x1ca/0x250
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718972] [<ffffffff94e940f4>] ? SyS_rt_sigprocmask+0xc4/0x100
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718973] [<ffffffff95007bee>] SyS_newstat+0xe/0x10
2019-05-01T17:08:07.552 storage-2 kernel: warning [ 601.718976] [<ffffffff95616fdb>] system_call_fastpath+0x22/0x27
2019-05-01T17:10:07.569 storage-2 kernel: err [ 721.719241] INFO: task install_banner_:15141 blocked for more than 120 seconds.

duplicate of LP1827268
https://bugs.launchpad.net/starlingx/+bug/1827268

2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718851] Call Trace:
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718872]  [<ffffffff95612b89>] schedule_preempt_disabled+0x39/0x90
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718887]  [<ffffffff956109d5>] __mutex_lock_slowpath+0xd5/0x210
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718893]  [<ffffffff9500ccd5>] ? unlazy_walk+0xb5/0x130
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718897]  [<ffffffff9560fc07>] mutex_lock+0x17/0x30
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718904]  [<ffffffff956045f1>] lookup_slow+0x33/0xa7
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718909]  [<ffffffff9501053f>] link_path_walk+0x80f/0x8b0
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718916]  [<ffffffff94f86de4>] ? filemap_fault+0x74/0x460
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718918]  [<ffffffff9501074a>] path_lookupat+0x7a/0x8b0
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718948]  [<ffffffffc0a6b911>] ? ext4_filemap_fault+0x41/0x50 [ext4]
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718950]  [<ffffffff9501311f>] ? getname_flags+0x4f/0x1a0
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718952]  [<ffffffff95010fab>] filename_lookup+0x2b/0xc0
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718954]  [<ffffffff950142b7>] user_path_at_empty+0x67/0xc0
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718958]  [<ffffffff94fb8267>] ? handle_mm_fault+0x557/0xc30
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718960]  [<ffffffff95014321>] user_path_at+0x11/0x20
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718962]  [<ffffffff95007373>] vfs_fstatat+0x63/0xc0
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718963]  [<ffffffff9500772e>] SYSC_newstat+0x2e/0x60
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718966]  [<ffffffff94ffe8ba>] ? __check_object_size+0x1ca/0x250
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718972]  [<ffffffff94e940f4>] ? SyS_rt_sigprocmask+0xc4/0x100
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718973]  [<ffffffff95007bee>] SyS_newstat+0xe/0x10
2019-05-01T17:08:07.552 storage-2 kernel: warning [  601.718976]  [<ffffffff95616fdb>] system_call_fastpath+0x22/0x27
2019-05-01T17:10:07.569 storage-2 kernel: err [  721.719241] INFO: task install_banner_:15141 blocked for more than 120 seconds.

Revision history for this message

chen haochuan (martin1982) wrote on 2019-06-26:

controller-0:~$ gdb /usr/bin/radosgw core.radosgw.0.3b71b9ac72b84eab9e56c4856c546181.1811512.1559128572000000

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/bin/radosgw -n client.radosgw.gateway'.
Program terminated with signal 6, Aborted.
#0 0x00007fcdd1dbd207 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install ceph-radosgw-13.2.2-0.el7.tis.26.x86_64
(gdb) bt
#0 0x00007fcdd1dbd207 in raise () from /lib64/libc.so.6
#1 0x00007fcdd1dbe8f8 in abort () from /lib64/libc.so.6
#2 0x00007fcdd26cc765 in __gnu_cxx::__verbose_terminate_handler() () from /lib64/libstdc++.so.6
#3 0x00007fcdd26ca746 in ?? () from /lib64/libstdc++.so.6
#4 0x00007fcdd26ca773 in std::terminate() () from /lib64/libstdc++.so.6
#5 0x00007fcdd26ca993 in __cxa_throw () from /lib64/libstdc++.so.6
#6 0x00007fcddddb18bb in (anonymous namespace)::handle_oom(void* (*)(void*), void*, bool, bool) () from /lib64/libtcmalloc.so.4
#7 0x00007fcddddcfb83 in tcmalloc::allocate_full_cpp_throw_oom(unsigned long) () from /lib64/libtcmalloc.so.4
#8 0x0000557535b3d3d7 in RGWGC::process(int, int, bool, RGWGCIOManager&) ()
#9 0x0000557535b3df72 in RGWGC::process(bool) ()
#10 0x0000557535b4118f in RGWGC::GCWorker::entry() ()
#11 0x00007fcdd51a6b21 in Thread::entry_wrapper() () from /usr/lib64/ceph/libceph-common.so.0
#12 0x00007fcddd97ddd5 in start_thread () from /lib64/libpthread.so.0
#13 0x00007fcdd1e84ead in clone () from /lib64/libc.so.6

This is should be this ceph issue
https://tracker.ceph.com/issues/23199

fixed by this PR
https://github.com/ceph/ceph/pull/25430

fix should merged in starlingx

Revision history for this message

Cindy Xie (xxie1) wrote on 2019-06-26:

@Fang Liang, can you please review Martin's founding and confirm if this is a Ceph upstream issue? shall we cherry pick the fixes to starlingX-staging?

Changed in starlingx:
assignee:	chen haochuan (martin1982) → Liang Fang (liangfang)

Tingjie Chen (silverhandy) on 2019-06-27

Changed in starlingx:
assignee:	Liang Fang (liangfang) → Tingjie Chen (silverhandy)

Revision history for this message

Tingjie Chen (silverhandy) wrote on 2019-06-27:

I have create PR: https://github.com/starlingx-staging/stx-ceph/pull/34 backport from https://github.com/ceph/ceph/pull/25430 as Martin mentioned.

--------------------------------------------
rgw: rgwgc:process coredump in some special case

Gc processes obja, objb, objc in order and pool of objb is deleted (obja and
objc is in the same pool and pool exits). RGW will coredump as ctx->io_ctx_impl
is an empty point during delete objc.

Cherry-picked from ceph:master 575a7900660c7ec02250aa58cd88b2e02962e135

Changed in starlingx:
status:	Triaged → In Progress

Revision history for this message

Tingjie Chen (silverhandy) wrote on 2019-07-12:

#10

The PR: https://github.com/starlingx-staging/stx-ceph/pull/34 has merged, this LP can be set fixed released.

Changed in starlingx:
status:	In Progress → Fix Released

Revision history for this message

Peng Peng (ppeng) wrote on 2019-07-15:

#11

Not saw this issue recently

tags:

removed: stx.retestneeded

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

auto-tracker.ceph.com #23199 Edit

Bug watches keep track of this bug in other bug trackers.