[luminous] ceph-mon crashes when it is elected leader (s390x)

Bug #1713032 reported by Andrew McLeod on 2017-08-25
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu Cloud Archive
High
Unassigned
Ubuntu on IBM z Systems
High
bugproxy
ceph (Ubuntu)
High
Unassigned

Bug Description

ceph-mon - s390x - 1 of my 3 nodes decides it is the leader, then crashes:

Summary:

2017-08-25 10:30:49.764717 3ff9a7ff910 1 mon.juju-a9ec9d-1@0(electing).elector(105) init, last seen epoch 105
2017-08-25 10:30:55.288336 3ff9a7ff910 0 log_channel(cluster) log [INF] : mon.juju-a9ec9d-1@0 won leader election with quorum 0,1
2017-08-25 10:30:55.487872 3ff9a7ff910 0 log_channel(cluster) log [INF] : HEALTH_ERR; no osds; 1 mons down, quorum 0,1 juju-a9ec9d-1,juju-a9ec9d-0
2017-08-25 10:30:56.047020 3ff8bfff910 0 log_channel(cluster) log [INF] : monmap e1: 3 mons at {juju-a9ec9d-0=10.0.8.105:6789/0,juju-a9ec9d-1=10.0.8.84:6789/0,noname-b=10.0.8.179:6789/0}
2017-08-25 10:30:56.047050 3ff8bfff910 0 log_channel(cluster) log [INF] : pgmap 0 pgs: ; 0 bytes data, 0 kB used, 0 kB / 0 kB avail
2017-08-25 10:30:56.047073 3ff8bfff910 0 log_channel(cluster) log [DBG] : fsmap
2017-08-25 10:30:56.047122 3ff8bfff910 1 mon.juju-a9ec9d-1@0(leader).osd e0 create_pending setting backfillfull_ratio = 0.9
2017-08-25 10:30:56.047135 3ff8bfff910 1 mon.juju-a9ec9d-1@0(leader).osd e0 create_pending setting full_ratio = 0.95
2017-08-25 10:30:56.047137 3ff8bfff910 1 mon.juju-a9ec9d-1@0(leader).osd e0 create_pending setting nearfull_ratio = 0.85
2017-08-25 10:30:56.047288 3ff8bfff910 1 mon.juju-a9ec9d-1@0(leader).osd e0 encode_pending skipping prime_pg_temp; mapping job did not start
2017-08-25 10:30:56.051808 3ff8bfff910 -1 *** Caught signal (Aborted) **
 in thread 3ff8bfff910 thread_name:ms_dispatch

 ceph version 12.1.2 (b661348f156f148d764b998b65b90451f096cb27) luminous (rc)
 1: (()+0x9334b4) [0x2aa0f9334b4]
 2: [0x3ff8bff9b66]
 3: (gsignal()+0x30) [0x3ffa16381b8]
 4: (abort()+0x14e) [0x3ffa1639726]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x19c) [0x3ffa1a28e2c]
 6: (()+0xa6776) [0x3ffa1a26776]
 7: (()+0xa67d8) [0x3ffa1a267d8]
 8: (__cxa_rethrow()+0x64) [0x3ffa1a26adc]
 9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0xdc2) [0x2aa0f8b4d92]
 10: (OSDMap::decode(ceph::buffer::list::iterator&)+0x5c4) [0x2aa0f739d44]
 11: (OSDMap::decode(ceph::buffer::list&)+0x44) [0x2aa0f73c434]
 12: (OSDMap::apply_incremental(OSDMap::Incremental const&)+0x1782) [0x2aa0f73dbe2]
 13: (OSDMonitor::encode_pending(std::shared_ptr<MonitorDBStore::Transaction>)+0x212) [0x2aa0f55cf3a]
 14: (PaxosService::propose_pending()+0x2be) [0x2aa0f5214d6]
 15: (PaxosService::_active()+0x2be) [0x2aa0f521bbe]
 16: (Context::complete(int)+0x1e) [0x2aa0f3f1d86]
 17: (void finish_contexts<Context>(CephContext*, std::__cxx11::list<Context*, std::allocator<Context*> >&, int)+0x212) [0x2aa0f3fb0ea]
 18: (Paxos::finish_round()+0x194) [0x2aa0f51937c]
 19: (Paxos::handle_last(boost::intrusive_ptr<MonOpRequest>)+0xfb2) [0x2aa0f51a97a]
 20: (Paxos::dispatch(boost::intrusive_ptr<MonOpRequest>)+0x2d4) [0x2aa0f51b2c4]
 21: (Monitor::dispatch_op(boost::intrusive_ptr<MonOpRequest>)+0xf20) [0x2aa0f3e86b8]
 22: (Monitor::_ms_dispatch(Message*)+0x64e) [0x2aa0f3e91ae]
 23: (Monitor::ms_dispatch(Message*)+0x34) [0x2aa0f41919c]
 24: (DispatchQueue::entry()+0xf0c) [0x2aa0f8df744]
 25: (DispatchQueue::DispatchThread::entry()+0x18) [0x2aa0f6ed828]
 26: (()+0x7934) [0x3ffa1e87934]
 27: (()+0xedd1a) [0x3ffa16edd1a]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

Full log:

https://pastebin.canonical.com/196718/

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: ceph 12.1.2-0ubuntu2~cloud0 [origin: Canonical]
ProcVersionSignature: Ubuntu 4.4.0-87.110-generic 4.4.73
Uname: Linux 4.4.0-87-generic s390x
NonfreeKernelModules: ebtable_broute vport_gre ip_gre gre ip_tunnel xt_CT xt_mac xt_physdev br_netfilter xt_set ip_set_hash_net ip_set nfnetlink xt_REDIRECT nf_nat_redirect nf_conntrack_ipv6 ip6table_mangle xt_nat xt_mark xt_connmark ip6table_raw iptable_raw xt_conntrack ipt_REJECT nf_reject_ipv4 ebtable_filter nbd openvswitch nf_defrag_ipv6 ebt_arp ebt_dnat ebt_ip scsi_transport_iscsi binfmt_misc veth ip6table_filter ip6_tables xt_CHECKSUM iptable_mangle xt_tcpudp ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_comment iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack zfs zunicode zcommon znvpair spl zavl zlib_deflate iptable_filter ip_tables ebt_snat ebtable_nat ebtables x_tables bridge 8021q garp mrp stp llc xfs libcrc32c dm_snapshot dm_bufio ghash_s390 prng aes_s390 des_s390 des_generic sha512_s390 sha256_s390 sha1_s390 qeth_l2 sha_common chsc_sch eadm_sch qeth ctcm ccwgroup fsm zfcp qdio scsi_transport_fc dasd_eckd_mod dasd_mod
ApportVersion: 2.20.1-0ubuntu2.10
Architecture: s390x
CrashDB:
 {
                "impl": "launchpad",
                "project": "cloud-archive",
                "bug_pattern_url": "http://people.canonical.com/~ubuntu-archive/bugpatterns/bugpatterns.xml",
             }
Date: Fri Aug 25 11:17:36 2017
SourcePackage: ceph
UpgradeStatus: No upgrade log present (probably fresh install)

Andrew McLeod (admcleod) wrote :
Corey Bryant (corey.bryant) wrote :

This may be specific to use with Ceph BlueStore.

Corey Bryant (corey.bryant) wrote :

Just talked with Andrew and this is not a BlueStore issue.

Ryan Beisner (1chb1n) on 2017-09-12
tags: added: uosci
Ryan Beisner (1chb1n) wrote :

Can you please add the deploy/config info, such as bundle and juju status? Is this openstack-on-lxd, or multi-lpar?

summary: - ceph-mon crashes when it is elected leader (s390x)
+ [luminous] ceph-mon crashes when it is elected leader (s390x)
Andrew McLeod (admcleod) wrote :

This was using the openstack-on-lxd and instructions, but was not using zfs backend - it is using directory backend.

James Page (james-page) wrote :

I got a failed bootstrap with a quick test - journalctl -xe -u ceph-mon revealed:

Oct 20 13:36:07 juju-e856f0-0 ceph-mon[13887]: error: monitor data filesystem reached concerning levels of available storage space (avail
Oct 20 13:36:07 juju-e856f0-0 ceph-mon[13887]: you may adjust 'mon data avail crit' to a lower value to make this go away (default: 5%)
Oct 20 13:36:07 juju-e856f0-0 systemd[1]: ceph-mon.service: Main process exited, code=exited, status=28/n/a
Oct 20 13:36:07 juju-e856f0-0 systemd[1]: ceph-mon.service: Unit entered failed state.
Oct 20 13:36:07 juju-e856f0-0 systemd[1]: ceph-mon.service: Failed with result 'exit-code'.
Oct 20 13:36:37 juju-e856f0-0 systemd[1]: ceph-mon.service: Service hold-off time over, scheduling restart.
Oct 20 13:36:37 juju-e856f0-0 systemd[1]: Stopped Ceph cluster monitor daemon.

Might be the same issue I guess?

James Page (james-page) wrote :

(trying again with more storage this time)

James Page (james-page) wrote :

Better:

$ juju status
Model Controller Cloud/Region Version SLA
default localhost-localhost localhost/localhost 2.2.4 unsupported

App Version Status Scale Charm Store Rev OS Notes
ceph-mon 10.2.7 active 3 ceph-mon jujucharms 13 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-mon/0 active idle 0 10.0.171.8 Unit is ready and clustered
ceph-mon/1* active idle 1 10.0.171.158 Unit is ready and clustered
ceph-mon/2 active idle 2 10.0.171.93 Unit is ready and clustered

Machine State DNS Inst id Series AZ Message
0 started 10.0.171.8 juju-2b4fd0-0 xenial Running
1 started 10.0.171.158 juju-2b4fd0-1 xenial Running
2 started 10.0.171.93 juju-2b4fd0-2 xenial Running

Relation provider Requirer Interface Type
ceph-mon:mon ceph-mon:mon ceph peer

James Page (james-page) wrote :

Note that the original bug report was on an older ceph version (rc)

James Page (james-page) wrote :

no - its on a newer ceph version - retrying.

James Page (james-page) wrote :
Changed in ceph (Ubuntu):
status: New → Confirmed
Changed in cloud-archive:
status: New → Confirmed
importance: Undecided → High
Changed in ceph (Ubuntu):
importance: Undecided → High
status: Confirmed → Triaged
Changed in cloud-archive:
status: Confirmed → Triaged
James Page (james-page) wrote :

Confirmed on artful and xenial/pike; attached backtrace is from an artful install.

James Page (james-page) wrote :

To reproduce on an Ubuntu install on s390x:

sudo apt install juju
juju bootstrap localhost
juju deploy -n 3 --series artful --force ceph-mon

one of the three units will fail with this problem - the other two remain running, but won't form quorum.

Changed in ubuntu-z-systems:
status: New → Triaged
Changed in ubuntu-z-systems:
assignee: nobody → bugproxy (bugproxy)
Frank Heimes (frank-heimes) wrote :

requested to reverse proxy this ticket, because it looks like an s390x code issue

bugproxy (bugproxy) on 2017-11-27
tags: added: architecture-s39064 bugnameltc-160996 severity-high targetmilestone-inin16043
Frank Heimes (frank-heimes) wrote :

re-attaching log from bug description for external visibility

------- Comment From <email address hidden> 2017-11-27 10:33 EDT-------
Does this problem exit on previous levels, and its a regression problem due to package upgrade, or is this a new used case...?

Frank Heimes (frank-heimes) wrote :

This is not a regression. Furthermore the following link let's assume that s390x is not yet upstream supported (as well as ppc64el ?!):
https://download.ceph.com/debian-luminous/dists/xenial/
Would be good to see parity between all architectures.

An attempt to install ceph luminous on Ubuntu on s390x as per the description above allows to reproduce the described situation.

bugproxy (bugproxy) on 2017-11-29
tags: added: severity-medium
removed: severity-high
tags: added: reverse-proxy-bugzilla
Changed in ubuntu-z-systems:
importance: Undecided → High
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2018-03-13 10:54 EDT-------
My team don't have experience on this. I recall some folks in Germany has file system type of skills before, asking broader team might help.

Andrew McLeod (admcleod) wrote :

this same problem is still present in rocky, using ceph mimic

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-01-09 11:25 EDT-------
From the stack backtrace ...

8: (__cxa_rethrow()+0x64) [0x3ffa1a26adc]
9: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0xdc2) [0x2aa0f8b4d92]
10: (OSDMap::decode(ceph::buffer::list::iterator&)+0x5c4) [0x2aa0f739d44]

... it seems that in CrushWrapper::decode() an exception is thrown.

Looking at the (v12.1.2) code in ceph/src/crush/CrushWrapper.cc (https://github.com/ceph/ceph/blob/v12.1.2/src/crush/CrushWrapper.cc#L2097-L2105) ...

void CrushWrapper::decode(bufferlist::iterator& blp)
{
create();

__u32 magic;
::decode(magic, blp);
if (magic != CRUSH_MAGIC)
throw buffer::malformed_input("bad magic number");

... it looks like the exception is thrown in the MAGIC check.

So perhaps this could be a little/big endian problem when decoding the magic number?

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-01-10 08:20 EDT-------
I investigated the encode() / decode() of the magic number bit more:

Ceph seems use internally little endian data format. So the decoding of the magic number (https://github.com/ceph/ceph/blob/v12.1.2/src/crush/CrushWrapper.cc#L2097-L2105) converts from little to big endian on s390.

For __u32 this is done encoding.h: https://github.com/ceph/ceph/blob/v12.1.2/src/include/encoding.h#L109

The conversion is done in the cast operator (https://github.com/ceph/ceph/blob/v12.1.2/src/include/encoding.h#L104) of the "struct ceph_le" template class: https://github.com/ceph/ceph/blob/v12.1.2/src/include/byteorder.h#L70-L72

For the byte swap operation form little to big endian the mswab() function is used: https://github.com/ceph/ceph/blob/v12.1.2/src/include/byteorder.h#L54-L58

The magic number seems to be properly encoded: https://github.com/ceph/ceph/blob/v12.1.2/src/crush/CrushWrapper.cc#L1922-L1927
And decoded: https://github.com/ceph/ceph/blob/v12.1.2/src/crush/CrushWrapper.cc#L2097-L2105

So at least no obvious / trivial problem here.

Changed in ceph (Ubuntu):
status: Triaged → In Progress
Changed in ubuntu-z-systems:
status: Triaged → In Progress
Frank Heimes (frank-heimes) wrote :

We got asked about the package with the debug symbol for ceph-mon - ceph-mon-dbgsym.
Since I couldn't find the packages with the debug symbols in the 'normal' UCA:
deb http://ubuntu-cloud.archive.canonical.com/ubuntu bionic-updates/rocky main
and since there doesn't seem to be a ddeb version of the UCA
I tried downloading the ceph-mon-dbgsym package manually from here:
https://launchpad.net/ubuntu/cosmic/+package/ceph-mon-dbgsym
But when I try to install I get:
sudo apt install ./ceph-mon-dbgsym_13.2.1+dfsg1-0ubuntu2.18.10.1_s390x.ddeb
Reading package lists... Done
Building dependency tree
Reading state information... Done
Note, selecting 'ceph-mon-dbgsym' instead of './ceph-mon-dbgsym_13.2.1+dfsg1-0ubuntu2.18.10.1_s390x.ddeb'
Some packages could not be installed. This may mean that you have
requested an impossible situation or if you are using the unstable
distribution that some required packages have not yet been created
or been moved out of Incoming.
The following information may help to resolve the situation:

The following packages have unmet dependencies:
 ceph-mon-dbgsym : Depends: ceph-mon (= 13.2.1+dfsg1-0ubuntu2.18.10.1) but 13.2.1+dfsg1-0ubuntu2.18.10.1~cloud0 is to be installed
E: Unable to correct problems, you have held broken packages.

So the package with the debug symbols "ceph-mon-dbgsym_13.2.1+dfsg1-0ubuntu2.18.10.1_s390x.ddeb" requires a package "ceph-mon_13.2.1+dfsg1-0ubuntu2.18.10.1_s390x.deb".
But the UCA only provides a package "ceph-mon_13.2.1+dfsg1-0ubuntu2.18.10.1~cloud0_s390x.deb" - notice the suffix "~cloud0".

Any idea to get ceph-mon including debug-symbols installed?

Frank Heimes (frank-heimes) wrote :

Just noticed that all the packages here: https://launchpad.net/ubuntu/+source/ceph/13.2.0-0ubuntu1
don't have the suffix, hence one pick all needed ones from there.
But downloading and installing them manually is cumbersome and probably not the recommended way ...

Andrew McLeod (admcleod) wrote :

If you add the debug symbols repo as follows here:

https://wiki.ubuntu.com/Debug%20Symbol%20Packages#Manual_install_of_debug_packages

then the debug symbols should be available for install - it was mentioned that they may be available for luminous but not mimic - in this case, luminous can be used as the bug is exactly the same between both versions

------- Comment on attachment From <email address hidden> 2019-01-24 15:18 EDT-------

With the help of Frank Helmes I finally managed to build a setup for debugging with gdb.
The attached file shows the steps for establishing a gdb session to the failing ceph-mon process.

------- Comment on attachment From <email address hidden> 2019-01-24 15:35 EDT-------

This is the result of the first gdb session with the new setup:

The ceph-mon process (v12.2.8) does not crash in the magic check but instead in "decode_32_or_64_string_map()": https://github.com/ceph/ceph/blob/v12.2.8/src/crush/CrushWrapper.cc#L2515
More precisely somewhere in "::decode_nohead()": https://github.com/ceph/ceph/blob/v12.2.8/src/crush/CrushWrapper.cc#L2465

It is called with a suspicious value for "strlen=50331648" (48 MiB).

Frank Heimes (frank-heimes) wrote :

Waiting for an update from IBM - setting to Incomplete for now.

Changed in ubuntu-z-systems:
status: In Progress → Incomplete
Frank Heimes (frank-heimes) wrote :

I got an update that work is in progress.
Latest status can be seen here: https://github.com/ceph/ceph/pull/30079

Changed in ubuntu-z-systems:
status: Incomplete → In Progress
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers