[library] ceph osd: invalid (someone else's?) journal

Bug #1622989 reported by Nadezhda Kabanova
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Oleksiy Molchanov

Bug Description

Detailed decription:
Environment: MOS 8, MU2
# fuel plugins
id | name | version | package_version
---|-----------------------------|---------|----------------
1 | elasticsearch_kibana | 0.9.0 | 4.0.0
2 | influxdb_grafana | 0.9.0 | 4.0.0
3 | lma_collector | 0.9.0 | 4.0.0
4 | lma_infrastructure_alerting | 0.9.0 | 4.0.0
5 | static_ntp_routing | 4.0.6 | 4.0.0
6 | zabbix-database | 4.0.1 | 4.0.0
7 | zabbix_monitoring | 2.6.25 | 4.0.0
8 | ivb5_plugin_vw | 1.0.27 | 4.0.0

# ceph -v
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
360 osds, 18 hosts.

In ceph-osd logs wee see errors [library] ceph osd: invalid (someone else's?) journal

We had 20 osd per host and 4disks(5 partitions per disk) for journaling.
After one osd failure we were not able to start osd until we recreated journal for it.
We rechecked the whole env and discovered that we have 7 problematic hosts, when some osds are sharing the same journal and they are still working.
So, during deployment the wrong journal assignment/configuration didn't prevent from starting new deployed osds.

All partitions created properly, but not assigned during ceph-deployement:
# ceph-disk list | grep journal
 /dev/sda4 ceph data, active, cluster ceph, osd.469, journal /dev/sdu3
 /dev/sdb3 ceph data, active, cluster ceph, osd.265, journal /dev/sdu4
 /dev/sdc3 ceph data, active, cluster ceph, osd.483, journal /dev/sdu4
 /dev/sdd3 ceph data, active, cluster ceph, osd.476, journal /dev/sdu5
 /dev/sde3 ceph data, active, cluster ceph, osd.399, journal /dev/sdu6
 /dev/sdf3 ceph data, active, cluster ceph, osd.498, journal /dev/sdu3
 /dev/sdg3 ceph data, active, cluster ceph, osd.428, journal /dev/sdv3
 /dev/sdh3 ceph data, active, cluster ceph, osd.508, journal /dev/sdu4
 /dev/sdi3 ceph data, active, cluster ceph, osd.504, journal /dev/sdu5
 /dev/sdj3 ceph data, active, cluster ceph, osd.183, journal /dev/sdv7
 /dev/sdk3 ceph data, active, cluster ceph, osd.453, journal /dev/sdv6
 /dev/sdl3 ceph data, active, cluster ceph, osd.396, journal /dev/sdv7
 /dev/sdm3 ceph data, active, cluster ceph, osd.349, journal /dev/sdw3
 /dev/sdn3 ceph data, active, cluster ceph, osd.419, journal /dev/sdw4
 /dev/sdo3 ceph data, active, cluster ceph, osd.206, journal /dev/sdw7
 /dev/sdp3 ceph data, active, cluster ceph, osd.436, journal /dev/sdw5
 /dev/sdq3 ceph data, active, cluster ceph, osd.401, journal /dev/sdw6
 /dev/sdr3 ceph data, active, cluster ceph, osd.463, journal /dev/sdw7
 /dev/sds3 ceph data, active, cluster ceph, osd.445, journal /dev/sdx3
 /dev/sdt3 ceph data, active, cluster ceph, osd.490, journal /dev/sdx4
 /dev/sdu3 ceph journal, for /dev/sdf3
 /dev/sdu4 ceph journal, for /dev/sdh3
 /dev/sdu5 ceph journal, for /dev/sdi3
 /dev/sdu6 ceph journal, for /dev/sde3
 /dev/sdu7 ceph journal
 /dev/sdv3 ceph journal, for /dev/sdg3
 /dev/sdv4 ceph journal
 /dev/sdv5 ceph journal
 /dev/sdv6 ceph journal, for /dev/sdk3
 /dev/sdv7 ceph journal, for /dev/sdl3
 /dev/sdw3 ceph journal, for /dev/sdm3
 /dev/sdw4 ceph journal, for /dev/sdn3
 /dev/sdw5 ceph journal, for /dev/sdp3
 /dev/sdw6 ceph journal, for /dev/sdq3
 /dev/sdw7 ceph journal, for /dev/sdr3
 /dev/sdx3 ceph journal, for /dev/sds3
 /dev/sdx4 ceph journal, for /dev/sdt3
 /dev/sdx5 ceph journal
 /dev/sdx6 ceph journal
 /dev/sdx7 ceph journal

# ls -l /var/lib/ceph/osd/ceph-*/journal
lrwxrwxrwx 1 root root 58 May 19 22:27 /var/lib/ceph/osd/ceph-183/journal -> /dev/disk/by-partuuid/0cb494ed-3c1a-4a28-a7b3-db04db5892e5
lrwxrwxrwx 1 root root 58 May 19 22:29 /var/lib/ceph/osd/ceph-206/journal -> /dev/disk/by-partuuid/8ffa1146-625d-4e76-966b-8f34087dd626
lrwxrwxrwx 1 root root 58 May 19 22:32 /var/lib/ceph/osd/ceph-265/journal -> /dev/disk/by-partuuid/a2a226b8-dceb-4c9c-9d50-0852f28af0c9
lrwxrwxrwx 1 root root 58 May 19 22:40 /var/lib/ceph/osd/ceph-349/journal -> /dev/disk/by-partuuid/2c12026d-23d7-4778-83ec-f25774610cac
lrwxrwxrwx 1 root root 58 May 19 22:41 /var/lib/ceph/osd/ceph-396/journal -> /dev/disk/by-partuuid/0cb494ed-3c1a-4a28-a7b3-db04db5892e5
lrwxrwxrwx 1 root root 58 May 19 22:42 /var/lib/ceph/osd/ceph-399/journal -> /dev/disk/by-partuuid/69bf249c-ae04-48eb-99cc-8731af7f3d81
lrwxrwxrwx 1 root root 58 May 19 22:42 /var/lib/ceph/osd/ceph-401/journal -> /dev/disk/by-partuuid/86de0de1-ccf3-4972-99f0-8fc705eb9ff8
lrwxrwxrwx 1 root root 58 May 19 22:43 /var/lib/ceph/osd/ceph-419/journal -> /dev/disk/by-partuuid/8765daaa-d358-4724-8c6c-7c0acc102ac6
lrwxrwxrwx 1 root root 58 May 19 22:43 /var/lib/ceph/osd/ceph-428/journal -> /dev/disk/by-partuuid/5586cc16-a802-43f8-83a2-f0453aa68cba
lrwxrwxrwx 1 root root 58 May 19 22:43 /var/lib/ceph/osd/ceph-436/journal -> /dev/disk/by-partuuid/48e5eca3-980c-4fab-b107-5f3f1033295d
lrwxrwxrwx 1 root root 58 May 19 22:43 /var/lib/ceph/osd/ceph-445/journal -> /dev/disk/by-partuuid/ed1a9a3f-626e-4787-8dd1-122c5deafa4b
lrwxrwxrwx 1 root root 58 May 19 22:44 /var/lib/ceph/osd/ceph-453/journal -> /dev/disk/by-partuuid/cef387f8-8280-4478-9878-bb7b045f3a83
lrwxrwxrwx 1 root root 58 May 19 22:44 /var/lib/ceph/osd/ceph-463/journal -> /dev/disk/by-partuuid/8ffa1146-625d-4e76-966b-8f34087dd626
lrwxrwxrwx 1 root root 58 May 19 22:44 /var/lib/ceph/osd/ceph-469/journal -> /dev/disk/by-partuuid/ec2d882a-07cd-4a9c-bc03-9f3c17dfa7fc
lrwxrwxrwx 1 root root 58 May 19 22:45 /var/lib/ceph/osd/ceph-476/journal -> /dev/disk/by-partuuid/eda4d843-5868-4227-8c49-4501727d0ee1
lrwxrwxrwx 1 root root 58 May 19 22:45 /var/lib/ceph/osd/ceph-483/journal -> /dev/disk/by-partuuid/a2a226b8-dceb-4c9c-9d50-0852f28af0c9
lrwxrwxrwx 1 root root 58 May 19 22:45 /var/lib/ceph/osd/ceph-490/journal -> /dev/disk/by-partuuid/d857168b-ce4b-4885-8aa5-ff7a820453ab
lrwxrwxrwx 1 root root 58 May 19 22:46 /var/lib/ceph/osd/ceph-498/journal -> /dev/disk/by-partuuid/ec2d882a-07cd-4a9c-bc03-9f3c17dfa7fc
lrwxrwxrwx 1 root root 58 May 19 22:46 /var/lib/ceph/osd/ceph-504/journal -> /dev/disk/by-partuuid/eda4d843-5868-4227-8c49-4501727d0ee1
lrwxrwxrwx 1 root root 58 May 19 22:46 /var/lib/ceph/osd/ceph-508/journal -> /dev/disk/by-partuuid/a2a226b8-dceb-4c9c-9d50-0852f28af0c9

# ls -l /var/lib/ceph/osd/ceph-*/journal | awk {'print $11'} | sort | uniq -c
      2 /dev/disk/by-partuuid/0cb494ed-3c1a-4a28-a7b3-db04db5892e5
      1 /dev/disk/by-partuuid/2c12026d-23d7-4778-83ec-f25774610cac
      1 /dev/disk/by-partuuid/48e5eca3-980c-4fab-b107-5f3f1033295d
      1 /dev/disk/by-partuuid/5586cc16-a802-43f8-83a2-f0453aa68cba
      1 /dev/disk/by-partuuid/69bf249c-ae04-48eb-99cc-8731af7f3d81
      1 /dev/disk/by-partuuid/86de0de1-ccf3-4972-99f0-8fc705eb9ff8
      1 /dev/disk/by-partuuid/8765daaa-d358-4724-8c6c-7c0acc102ac6
      2 /dev/disk/by-partuuid/8ffa1146-625d-4e76-966b-8f34087dd626
      3 /dev/disk/by-partuuid/a2a226b8-dceb-4c9c-9d50-0852f28af0c9
      1 /dev/disk/by-partuuid/cef387f8-8280-4478-9878-bb7b045f3a83
      1 /dev/disk/by-partuuid/d857168b-ce4b-4885-8aa5-ff7a820453ab
      2 /dev/disk/by-partuuid/ec2d882a-07cd-4a9c-bc03-9f3c17dfa7fc
      1 /dev/disk/by-partuuid/ed1a9a3f-626e-4787-8dd1-122c5deafa4b
      2 /dev/disk/by-partuuid/eda4d843-5868-4227-8c49-4501727d0ee1

Steps to reproduce:

I can just guess , that:
try to deploy ceph-osd node with more than 2 osds, stop deployment when just only one is started, try to deploy others.

Workaround:
Manually recreated journals for the half of osds that use dublicated/triplicated journal_uuid.#
After you know which partition you will use for specific osd journal:
# ceph osd set noout
# cd /var/lib/ceph/osd/ceph-247/
# echo "bbaddcd9-d54f-4bd1-8c23-7f79f8485c8d" > journal_uuid
# cp -rp journal_uuid journal_uuid_orig
# rm journal
# ln -s /dev/disk/by-partuuid/bbaddcd9-d54f-4bd1-8c23-7f79f8485c8d ./journal
# stop seph-osd id=247
# ceph-osd --mkjournal -i 247
# start seph-osd id=247

Dmitry Klenov (dklenov)
tags: added: area-ceph
Changed in fuel:
assignee: nobody → MOS Ceph (mos-ceph)
status: New → Confirmed
Changed in fuel:
assignee: MOS Ceph (mos-ceph) → Oleksiy Molchanov (omolchanov)
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

It seems to be very specific case, was not able to reproduce on small env. I am marking this as Invalid. Please reopen as soon as it reproduced again and you have diagnostic snapshot.

Changed in fuel:
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.