Ceph cluster mds failed during cephfs usage

Bug #1649872 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on IBM z Systems
Invalid
Undecided
Unassigned
ceph (Ubuntu)
Invalid
Low
Skipper Bug Screeners

Bug Description

Ceph cluster mds failed during cephfs usage

---uname output---
Linux testU 4.4.0-21-generic #37-Ubuntu SMP Mon Apr 18 18:31:26 UTC 2016 s390x s390x s390x GNU/Linux

---Additional Hardware Info---
System Z s390x LPAR

Machine Type = Ubuntu VM on s390x LPAR

---Debugger---
A debugger is not configured

---Steps to Reproduce---
 On s390x LPAR 4 Ubuntu VMs:
1VM - ceph monitor, ceph mds
2VM - ceph monitor, ceph osd
3VM - ceph monitor, ceph osd
4VM - client for using cephfs
I installed ceph cluster on 3 VMs and use 4d VM as a client for cephfs. Mount cephfs share and trying to touch some file in mount point:
root@testU:~# ceph osd pool ls
rbd
libvirt-pool
root@testU:~# ceph osd pool create cephfs1_data 32
pool 'cephfs1_data' created
root@testU:~# ceph osd pool create cephfs1_metadata 32
pool 'cephfs1_metadata' created
root@testU:~# ceph osd pool ls
rbd
libvirt-pool
cephfs1_data
cephfs1_metadata
root@testU:~# ceph fs new cephfs1 cephfs1_metadata cephfs1_data
new fs with metadata pool 5 and data pool 4
root@testU:~# ceph fs ls
name: cephfs1, metadata pool: cephfs1_metadata, data pools: [cephfs1_data ]

root@testU:~# ceph mds stat
e37: 1/1/1 up {2:0=mon1=up:active}
root@testU:~# ceph -s
    cluster 9f054e62-10e5-4b58-adb9-03d27a360bdc
     health HEALTH_OK
     monmap e1: 3 mons at {mon1=192.168.122.144:6789/0,osd1=192.168.122.233:6789/0,osd2=192.168.122.73:6789/0}
            election epoch 4058, quorum 0,1,2 osd2,mon1,osd1
      fsmap e37: 1/1/1 up {2:0=mon1=up:active}
     osdmap e62: 2 osds: 2 up, 2 in
            flags sortbitwise
      pgmap v2011: 256 pgs, 4 pools, 4109 MB data, 1318 objects
            12371 MB used, 18326 MB / 30698 MB avail
                 256 active+clean

root@testU:~# ceph auth get client.admin | grep key
exported keyring for client.admin
 key = AQCepkZY5wuMOxAA6KxPQjDJ17eoGZcGmCvS/g==
root@testU:~# mount -t ceph 192.168.122.144:6789:/ /mnt/cephfs -o name=admin,secret=AQCepkZY5wuMOxAA6KxPQjDJ17eoGZcGmCvS/g==,context="system_u:object_r:tmp_t:s0"
root@testU:~#
root@testU:~# mount |grep ceph
192.168.122.144:6789:/ on /mnt/cephfs type ceph (rw,relatime,name=admin,secret=<hidden>,acl)

root@testU:~# ls -l /mnt/cephfs/
total 0
root@testU:~# touch /mnt/cephfs/testfile
[ 759.865289] ceph: mds parse_reply err -5
[ 759.865293] ceph: mdsc_handle_reply got corrupt reply mds0(tid:2)
root@testU:~# ls -l /mnt/cephfs/
[ 764.600952] ceph: mds parse_reply err -5
[ 764.600955] ceph: mdsc_handle_reply got corrupt reply mds0(tid:5)
[ 764.601343] ceph: mds parse_reply err -5
[ 764.601345] ceph: mdsc_handle_reply got corrupt reply mds0(tid:6)
ls: reading directory '/mnt/cephfs/': Input/output error
total 0

Userspace tool common name: cephfs ceph

The userspace tool has the following bit modes: 64-bit

Userspace rpm: -

Userspace tool obtained from project website: na

-Attach ltrace and strace of userspace application.

Revision history for this message
bugproxy (bugproxy) wrote : dmesg.log from client VM

Default Comment by Bridge

tags: added: architecture-s39064 bugnameltc-149933 severity-high targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Skipper Bug Screeners (skipper-screen-team)
affects: ubuntu → ceph (Ubuntu)
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
assignee: nobody → Canonical Server Team (canonical-server)
Revision history for this message
bugproxy (bugproxy) wrote : dmesg.log from client VM with ceph package version 10.2.3

------- Comment (attachment only) From <email address hidden> 2016-12-19 05:20 EDT-------

Revision history for this message
James Page (james-page) wrote :

It would be useful to have log files for the ceph-mds daemon as well please; this might give more of a clue as to why the reply is being corrupted.

Changed in ceph (Ubuntu):
status: New → Incomplete
importance: Undecided → Low
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment (attachment only) From <email address hidden> 2016-12-19 05:20 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : ceph-mds log

------- Comment on attachment From <email address hidden> 2017-01-09 08:10 EDT-------

I added ceph-mds.log from /var/log/ceph. It seems there is nothing on this issue...
In ceph.conf, I use the following settings for mds:
[mds]
  mds data = /var/lib/ceph/mds/mds.$id

[mds.mon2]
  host = mon2
  debug mds = 1
  debug mds balancer = 1
  debug mds log = 1
  debug mds migrator = 1

Ryan Beisner (1chb1n)
tags: added: s390x uosci
Frank Heimes (fheimes)
Changed in ceph (Ubuntu):
status: Incomplete → Confirmed
tags: added: openstack-ibm
Changed in ubuntu-z-systems:
assignee: Canonical Server Team (canonical-server) → nobody
assignee: nobody → Ceph OpenStack Team (ceph-openstack-team)
James Page (james-page)
Changed in ubuntu-z-systems:
assignee: Ceph OpenStack Team (ceph-openstack-team) → nobody
Frank Heimes (fheimes)
Changed in ubuntu-z-systems:
status: New → Confirmed
Revision history for this message
James Page (james-page) wrote :

Unable to reproduce this problem on amd64; will attempt with an s390x.

Revision history for this message
James Page (james-page) wrote :

I'm unable to reproduce this on s390x either; steps I took to reproduce:

Bootstrap local LXD provider using ZFS backend.

juju deploy -n 3 ceph-mon
juju deploy -n 3 ceph-osd && juju config ceph-osd osd-devices=/srv/osd use-direct-io=False
juju deploy ceph-fs
juju add-relation ceph-mon ceph-osd
juju add-relation ceph-mon ceph-fs

I then mounted the cephfs created from the host system:

sudo mount -t ceph 10.189.59.86:6789:/ /mnt/cephfs -o name=admin,secret=AQDW91xZWcFvIhAAbWVm3x6xx1LBDgvW7RyP9g==

after which I was able to write files to /mnt/cephfs:

ubuntu@s4lpb:/mnt/cephfs$ ls -lrt
total 314880
-rw-r--r-- 1 root root 322437120 Jul 3 21:54 zesty-server-cloudimg-arm64.img
-rw-r--r-- 1 root root 0 Jul 5 10:36 a
-rw-r--r-- 1 root root 0 Jul 5 10:36 b

ubuntu@s4lpb:/mnt/cephfs$ df -h
Filesystem Size Used Avail Use% Mounted on
[...]
10.189.59.86:6789:/ 43G 6.1G 37G 15% /mnt/cephfs

Confirming kernel version:

$ uname -a
Linux s4lpb 4.4.0-67-generic #88-Ubuntu SMP Wed Mar 8 16:39:07 UTC 2017 s390x s390x s390x GNU/Linux
ubuntu@s4lpb:/mnt/cephfs$

Changed in ceph (Ubuntu):
status: Confirmed → Incomplete
Changed in ubuntu-z-systems:
status: Confirmed → Incomplete
Revision history for this message
James Page (james-page) wrote :

For completeness:

$ juju status
Model Controller Cloud/Region Version
default localhost lxd/localhost 2.0.2

App Version Status Scale Charm Store Rev OS Notes
ceph-fs 10.2.7 active 1 ceph-fs jujucharms 3 ubuntu
ceph-mon 10.2.7 active 3 ceph-mon jujucharms 9 ubuntu
ceph-osd 10.2.7 active 3 ceph-osd jujucharms 243 ubuntu

Unit Workload Agent Machine Public address Ports Message
ceph-fs/0* active idle 6 10.189.59.58 Unit is ready (1 MDS)
ceph-mon/0* active idle 0 10.189.59.86 Unit is ready and clustered
ceph-mon/1 active idle 1 10.189.59.156 Unit is ready and clustered
ceph-mon/2 active idle 2 10.189.59.246 Unit is ready and clustered
ceph-osd/0* active idle 3 10.189.59.93 Unit is ready (1 OSD)
ceph-osd/1 active idle 4 10.189.59.82 Unit is ready (1 OSD)
ceph-osd/2 active idle 5 10.189.59.204 Unit is ready (1 OSD)

Machine State DNS Inst id Series AZ
0 started 10.189.59.86 juju-0e3568-0 xenial
1 started 10.189.59.156 juju-0e3568-1 xenial
2 started 10.189.59.246 juju-0e3568-2 xenial
3 started 10.189.59.93 juju-0e3568-3 xenial
4 started 10.189.59.82 juju-0e3568-4 xenial
5 started 10.189.59.204 juju-0e3568-5 xenial
6 started 10.189.59.58 juju-0e3568-6 xenial

Relation Provides Consumes Type
mds ceph-fs ceph-mon regular
mon ceph-mon ceph-mon peer
mon ceph-mon ceph-osd regular

Revision history for this message
James Page (james-page) wrote :

Also attaching Juju bundle for those following.

Revision history for this message
James Page (james-page) wrote :

Looking at the original deployment:

2 osds: 2 up, 2 in

That might create issues; pools by default have a replica configuration of 3.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-09-12 03:21 EDT-------
IBM bugzilla status-> closed, currently not reproduceable, if this problem will occur again. New bugzilla shoudl be opened.

James Page (james-page)
Changed in ceph (Ubuntu):
status: Incomplete → Invalid
Changed in ubuntu-z-systems:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.