[library] ceph osd: invalid (someone else's?) journal

Bug #1326146 reported by Frank J. Cameron
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
Medium
Dmitry Borodaenko

Bug Description

I have not been able to replicate this.

I have 3 OSD servers with 4 OSD disks and 1 Journal disk each; deployed with Fuel 5.0.

# ceph osd dump | fgrep out
osd.11 down out weight 0 up_from 0 up_thru 0 down_at 0 last_clean_interval [0,0) :/0 :/0 :/0 :/0 exists,new e68c819d-ad02-46c2-8eab-29e60cec3a85

# ps -ef | fgrep ceph
root 3438 1 0 Jun02 ? 00:01:37 /usr/bin/ceph-mon -i node-1 --pid-file /var/run/ceph/mon.node-1.pid -c /etc/ceph/ceph.conf
root 3791 1 0 Jun02 ? 00:08:26 /usr/bin/ceph-osd -i 0 --pid-file /var/run/ceph/osd.0.pid -c /etc/ceph/ceph.conf
root 4026 1 0 Jun02 ? 00:08:25 /usr/bin/ceph-osd -i 3 --pid-file /var/run/ceph/osd.3.pid -c /etc/ceph/ceph.conf
root 4322 1 0 Jun02 ? 00:08:15 /usr/bin/ceph-osd -i 10 --pid-file /var/run/ceph/osd.10.pid -c /etc/ceph/ceph.conf

# tail -5 /var/log/ceph/ceph-osd.11.log
2014-06-03 21:23:44.949786 7f6b9d1e57a0 0 filestore(/var/lib/ceph/osd/ceph-11) mount: enabling WRITEAHEAD journal mode: btrfs not detected
2014-06-03 21:23:44.958455 7f6b9d1e57a0 1 journal _open /var/lib/ceph/osd/ceph-11/journal fd 19: 10737418240 bytes, block size 4096 bytes, directio = 1, aio = 1
2014-06-03 21:23:44.958563 7f6b9d1e57a0 -1 journal FileJournal::open: ondisk fsid f69f1af8-e234-43c4-a240-3fe88838d92d doesn't match expected e68c819d-ad02-46c2-8eab-29e60cec3a85, invalid (someone else's?) journal
2014-06-03 21:23:44.958632 7f6b9d1e57a0 -1 filestore(/var/lib/ceph/osd/ceph-11) mount failed to open journal /var/lib/ceph/osd/ceph-11/journal: (22) Invalid argument
2014-06-03 21:23:44.960387 7f6b9d1e57a0 -1 ** ERROR: error converting store /var/lib/ceph/osd/ceph-11: (22) Invalid argument

# ls -l /var/lib/ceph/osd/ceph-*/journal
lrwxrwxrwx 1 root root 9 May 30 20:37 /var/lib/ceph/osd/ceph-0/journal -> /dev/sda4
lrwxrwxrwx 1 root root 9 May 30 20:42 /var/lib/ceph/osd/ceph-11/journal -> /dev/sda4
lrwxrwxrwx 1 root root 9 May 30 20:37 /var/lib/ceph/osd/ceph-3/journal -> /dev/sda5
lrwxrwxrwx 1 root root 9 May 30 20:37 /var/lib/ceph/osd/ceph-10/journal -> /dev/sda7

Two of the OSDs are pointed at the same journal partition.

Revision history for this message
Frank J. Cameron (fjc) wrote :

Redeployed my cluster and it happened again; one of the OSD nodes has the same journal partition linked to two of the OSD disks:

[root@node-1 ~]# ls -l /var/lib/ceph/osd/ceph-*/journal
lrwxrwxrwx 1 root root 9 Jun 4 16:34 /var/lib/ceph/osd/ceph-11/journal -> /dev/sda4
lrwxrwxrwx 1 root root 9 Jun 4 16:30 /var/lib/ceph/osd/ceph-2/journal -> /dev/sda4
lrwxrwxrwx 1 root root 9 Jun 4 16:30 /var/lib/ceph/osd/ceph-5/journal -> /dev/sda5
lrwxrwxrwx 1 root root 9 Jun 4 16:31 /var/lib/ceph/osd/ceph-10/journal -> /dev/sda7

From the puppet logs:

(/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns) change from notrun to 0 failed: ceph-deploy osd activate node-1:/dev/sde4:/dev/sda4 returned 1 instead of one of [0]
(/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns) change from notrun to 0 failed: ceph-deploy osd prepare node-1:/dev/sde4:/dev/sda4 node-1:/dev/sdb4:/dev/sda5 returned 1 instead of one of [0]
ceph-deploy osd prepare node-1:/dev/sde4:/dev/sda4 node-1:/dev/sdb4:/dev/sda5 returned 1 instead of one of [0]
(/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns) change from notrun to 0 failed: ceph-deploy osd activate node-1:/dev/sdc4:/dev/sda4 node-1:/dev/sdd4:/dev/sda5 node-1:/dev/sde4:/dev/sda6 node-1:/dev/sdb4:/dev/sda7 returned 1 instead of one of [0]
ceph-deploy osd activate node-1:/dev/sdc4:/dev/sda4 node-1:/dev/sdd4:/dev/sda5 node-1:/dev/sde4:/dev/sda6 node-1:/dev/sdb4:/dev/sda7 returned 1 instead of one of [0]

Revision history for this message
Frank J. Cameron (fjc) wrote :
Revision history for this message
Frank J. Cameron (fjc) wrote :
Revision history for this message
Frank J. Cameron (fjc) wrote :
Download full text (4.2 KiB)

I've been digging through the puppet.log on the node with the failure. On the failing node puppet tries three times to deploy the ceph-osd role.

The first time with all four devices. It appears to me that the proper ceph command are being invoked and "ceph-deploy osd prepare" succeeds but "ceph-deploy osd activate" fails with an exception thrown from ceph_deploy/osd.py.

The second time with only two of the devices. It errors out when it tries to format the device that is already mounted.

The third time with only one of the devices. This is where the "(someone else's?) journal" comes from; the initial invocation used sda4 for the journal device for sdc4 and sda6 for the journal device for sde4, but the final invocation uses sda4 for the journal device for sde4.

The real problem appears to be the failure of the initial "ceph-deploy osd activate" failure:

***** Beginning deployment of node node-1 with role ceph-osd *****

osd_devices: /dev/sdc4:/dev/sda4 /dev/sdd4:/dev/sda5 /dev/sde4:/dev/sda6 /dev/sdb4:/dev/sda7
ceph_osd: /dev/sdc4:/dev/sda4/dev/sdd4:/dev/sda5/dev/sde4:/dev/sda6/dev/sdb4:/dev/sda7

(/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns)
Invoked (1.2.7): /usr/bin/ceph-deploy osd prepare node-1:/dev/sdc4:/dev/sda4 node-1:/dev/sdd4:/dev/sda5 node-1:/dev/sde4:/dev/sda6 node-1:/dev/sdb4:/dev/sda7
Preparing host node-1 disk /dev/sdc4 journal /dev/sda4 activate False
Preparing host node-1 disk /dev/sdd4 journal /dev/sda5 activate False
Preparing host node-1 disk /dev/sde4 journal /dev/sda6 activate False
Preparing host node-1 disk /dev/sdb4 journal /dev/sda7 activate False
executed successfully

(/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd activate]/returns)
Invoked (1.2.7): /usr/bin/ceph-deploy osd activate node-1:/dev/sdc4:/dev/sda4 node-1:/dev/sdd4:/dev/sda5 node-1:/dev/sde4:/dev/sda6 node-1:/dev/sdb4:/dev/sda7
Activating cluster ceph disks node-1:/dev/sdc4:/dev/sda4 node-1:/dev/sdd4:/dev/sda5 node-1:/dev/sde4:/dev/sda6 node-1:/dev/sdb4:/dev/sda7
Activating host node-1 disk /dev/sdc4
Activating host node-1 disk /dev/sdd4
Traceback (most recent call last):
 File "/usr/bin/ceph-deploy", line 21, in <module>
   sys.exit(main())
 File "/usr/lib/python2.6/site-packages/ceph_deploy/util/decorators.py", line 83, in newfunc
   return f(*a, **kw)
 File "/usr/lib/python2.6/site-packages/ceph_deploy/cli.py", line 150, in main
   return args.func(args)
 File "/usr/lib/python2.6/site-packages/ceph_deploy/osd.py", line 371, in osd
   activate(args, cfg)
 File "/usr/lib/python2.6/site-packages/ceph_deploy/osd.py", line 278, in activate
   cmd=cmd, ret=ret, out=out, err=err)
NameError: global name 'ret' is not defined
(/Stage[main]/Ceph/Service[ceph]) Dependency Exec[ceph-deploy osd activate] has failures: true

**** Beginning deployment of node node-1 with role ceph-osd *****

ceph_osd: /dev/sde4:/dev/sda4/dev/sdb4:/dev/sda5
osd_devices: /dev/sde4:/dev/sda4 /dev/sdb4:/dev/sda5

(/Stage[main]/Ceph::Osd/Exec[ceph-deploy osd prepare]/returns)
Invoked (1.2.7): /usr/bin/ceph-deploy osd prepare node-1:/dev/sde4:/dev/sda4 node-1:/dev/sdb4:/dev/sda5
Preparing cluster ceph disks node-1:/dev/sde4:/dev/sda4 node-1:/dev/sdb4:/dev/sda5
Pre...

Read more...

Revision history for this message
Frank J. Cameron (fjc) wrote :

"NameError: global name 'ret' is not defined" is clearly a bug in the error handling in acticate() in ceph_deploy/osd.py that is masking the underlying err.

Looking at github, the activate() function was heavily refactored between versions 1.2.6 and 1.2.7:

https://github.com/ceph/ceph-deploy/blob/v1.2.6/ceph_deploy/osd.py
https://github.com/ceph/ceph-deploy/blob/v1.2.7/ceph_deploy/osd.py

The version of the function on my node matches the github code for 1.2.6 even though installed rpm is version 1.2.7 (specifically ceph-deploy-1.2.7-0.mira.1.noarch).

Changed in fuel:
status: New → Confirmed
importance: Undecided → Medium
assignee: nobody → Dmitry Borodaenko (dborodaenko)
milestone: none → 5.1
Dmitry Ilyin (idv1985)
summary: - ceph osd: invalid (someone else's?) journal
+ [library] ceph osd: invalid (someone else's?) journal
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Package repositories for 5.1 now have ceph-deploy 1.5.9 which shouldn't have this problem.

Changed in fuel:
status: Confirmed → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.