ceph-disk-prepare --zap-disk hang

Bug #1475247 reported by Adam Collard on 2015-07-16
36
This bug affects 5 people
Affects Status Importance Assigned to Milestone
ceph (Juju Charms Collection)
Critical
Chris Glass
ceph (Ubuntu)
High
Unassigned
Trusty
High
Chris Glass
Utopic
High
Unassigned
Vivid
High
Unassigned
Wily
High
Unassigned
ceph-osd (Juju Charms Collection)
Critical
Chris Glass

Bug Description

[Impact]
Disks with invalid metadata can cause hangs during cleaning; resulting in stuck deployments.

[Test Case]
Initialize a disk with invalid metadata using the '--zap-disk' option.

[Regression Potential]
Minimal; already in later Ubuntu releases.

[Original Bug Report]
During an Autopilot deployment on gMAAS, Juju had hung running a mon-relation-changed hook

$ ps afxwww | grep -A 4 [m]on-relation-changed
  29118 ? S 0:03 \_ /usr/bin/python /var/lib/juju/agents/unit-ceph-1/charm/hooks/mon-relation-changed
  37996 ? S 0:00 \_ /bin/sh /usr/sbin/ceph-disk-prepare --fs-type xfs --zap-disk /dev/sdb
  37998 ? S 0:00 \_ /usr/bin/python /usr/sbin/ceph-disk prepare --fs-type xfs --zap-disk /dev/sdb
  38016 ? D 0:00 \_ /sbin/sgdisk --zap-all --clear --mbrtogpt -- /dev/sdb

This had been in this state for > 10m. The logs[1] from the unit in question showed that something was up with the partition tables on that disk.

I fixed this by hand using gdisk[2]

[1] https://pastebin.canonical.com/135426/
[2] http://paste.ubuntu.com/11887096/

Related branches

Andreas Hasenack (ahasenack) wrote :

I had the same thing happen to a run of mine once:

2015-07-01 14:26:50 INFO mon-relation-changed ^GCaution: invalid backup GPT header, but valid main header; regenerating
2015-07-01 14:26:50 INFO mon-relation-changed backup header from main header.
2015-07-01 14:26:50 INFO mon-relation-changed
2015-07-01 14:26:50 INFO mon-relation-changed Warning! Main and backup partition tables differ! Use the 'c' and 'e' options
2015-07-01 14:26:50 INFO mon-relation-changed on the recovery & transformation menu to examine the two tables.
2015-07-01 14:26:50 INFO mon-relation-changed
2015-07-01 14:26:50 INFO mon-relation-changed Warning! One or more CRCs don't match. You should repair the disk!
2015-07-01 14:26:50 INFO mon-relation-changed

  16841 ? Ssl 0:00 /var/lib/juju/tools/unit-ceph-1/jujud unit --data-dir /var/lib/juju --unit-name ceph/1 --debug
  36054 ? S 0:02 \_ /usr/bin/python /var/lib/juju/agents/unit-ceph-1/charm/hooks/mon-relation-changed
  38589 ? S 0:00 \_ /bin/sh /usr/sbin/ceph-disk-prepare --fs-type xfs --zap-disk /dev/sdb
  38592 ? S 0:00 \_ /usr/bin/python /usr/sbin/ceph-disk prepare --fs-type xfs --zap-disk /dev/sdb
  38601 ? D 0:00 \_ /sbin/sgdisk --zap-all --clear --mbrtogpt -- /dev/sdb

It was stuck like this for hours. Nothing interesting in dmesg.

Since I was debugging potentially hardware disk issues on that node, I dismissed it.

Andreas Hasenack (ahasenack) wrote :

In another stuck install, looks like sgdisk eventually moves on:

2015-07-16 15:11:33 INFO mon-relation-changed ^MReading state information... 0%^M^MReading state information... 0%^M^MReading state information... Done
2015-07-16 15:11:33 INFO mon-relation-changed ^GCaution: invalid backup GPT header, but valid main header; regenerating
2015-07-16 15:11:33 INFO mon-relation-changed backup header from main header.
2015-07-16 15:11:33 INFO mon-relation-changed
2015-07-16 15:11:33 INFO mon-relation-changed Warning! Main and backup partition tables differ! Use the 'c' and 'e' options
2015-07-16 15:11:33 INFO mon-relation-changed on the recovery & transformation menu to examine the two tables.
2015-07-16 15:11:33 INFO mon-relation-changed
2015-07-16 15:11:33 INFO mon-relation-changed Warning! One or more CRCs don't match. You should repair the disk!
2015-07-16 15:11:33 INFO mon-relation-changed
2015-07-16 15:55:33 INFO mon-relation-changed ^G^G****************************************************************************
2015-07-16 15:55:33 INFO mon-relation-changed Caution: Found protective or hybrid MBR and corrupt GPT. Using GPT, but disk
2015-07-16 15:55:33 INFO mon-relation-changed verification and recovery are STRONGLY recommended.
2015-07-16 15:55:33 INFO mon-relation-changed ****************************************************************************
2015-07-16 15:55:33 INFO mon-relation-changed GPT data structures destroyed! You may now partition the disk using fdisk or
2015-07-16 15:55:33 INFO mon-relation-changed other utilities.
2015-07-16 15:55:33 INFO mon-relation-changed The operation has completed successfully.

Look at the delta between the "repair the disk" advice and the last line. That would account for sgdisk in D state doing stuff to the disk. What, I don't know.

This disk in particular, /dev/sdb, is a:
[ 3.913915] sd 0:0:1:0: [sdb] 5860533168 512-byte logical blocks: (3.00 TB/2.72 TiB)

Andreas Hasenack (ahasenack) wrote :

#server discussion:
<cholcombe> ok i see something
<cholcombe> andreas, http://www.rodsbooks.com/gdisk/sgdisk.html check out -o --clear
<cholcombe> andreas, so it looks like we need to sgdisk -Z /dev/sdb
<andreas> two steps
<cholcombe> let that finish and then do the next step
<cholcombe> right

Andreas Hasenack (ahasenack) wrote :

This is the original upstream bug, filed by SuSE: http://tracker.ceph.com/issues/11143

The fix is the same: split sgdisk into two calls.

James Page (james-page) on 2015-07-17
Changed in ceph (Juju Charms Collection):
status: New → Invalid
status: Invalid → New
Changed in ceph (Ubuntu):
importance: Undecided → High
Changed in ceph (Ubuntu Wily):
status: New → Fix Released
Changed in ceph (Ubuntu Vivid):
status: New → Fix Released
importance: Undecided → High
Changed in ceph (Ubuntu Utopic):
status: New → Won't Fix
Changed in ceph (Ubuntu Trusty):
status: New → Triaged
importance: Undecided → High
Changed in ceph (Juju Charms Collection):
assignee: nobody → Chris Glass (tribaal)
Changed in ceph-osd (Juju Charms Collection):
assignee: nobody → Chris Glass (tribaal)
importance: Undecided → Critical
status: New → In Progress
Changed in ceph (Juju Charms Collection):
status: New → In Progress
importance: High → Critical
Changed in ceph-osd (Juju Charms Collection):
milestone: none → 15.07
Changed in ceph (Juju Charms Collection):
milestone: 15.10 → 15.07
Chris Glass (tribaal) on 2015-08-07
Changed in ceph (Ubuntu Trusty):
assignee: nobody → Chris Glass (tribaal)
status: Triaged → In Progress
Chris Glass (tribaal) on 2015-08-07
Changed in ceph-osd (Juju Charms Collection):
status: In Progress → Fix Committed
Changed in ceph (Juju Charms Collection):
status: In Progress → Fix Committed

I updated the upstream ticket that was still opened, the fix from SUSE went into master in April, and the changes where backported to Firefly and Hammer releases.

Commit was https://github.com/ceph/ceph/commit/b73a236c576dc9f43348fa6a7c696c5d19513cca

James Page (james-page) on 2015-08-10
description: updated
James Page (james-page) on 2015-08-10
Changed in ceph (Juju Charms Collection):
status: Fix Committed → Fix Released
Changed in ceph-osd (Juju Charms Collection):
status: Fix Committed → Fix Released

Hello Adam, or anyone else affected,

Accepted ceph into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/0.80.10-0ubuntu1.14.04.2 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in ceph (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed
Andreas Hasenack (ahasenack) wrote :

I still see sgdisk hanging in D state. The command line is correct now:
   3642 ? Ssl 0:00 /var/lib/juju/tools/unit-ceph-1/jujud unit --data-dir /var/lib/juju --unit-name ceph/1 --debug
  23810 ? S 0:02 \_ /usr/bin/python /var/lib/juju/agents/unit-ceph-1/charm/hooks/mon-relation-changed
  27816 ? S 0:00 \_ /bin/sh /usr/sbin/ceph-disk-prepare --fs-type xfs --zap-disk /dev/sdb
  27818 ? S 0:00 \_ /usr/bin/python /usr/sbin/ceph-disk prepare --fs-type xfs --zap-disk /dev/sdb
  27833 ? D 0:00 \_ /sbin/sgdisk --zap-all -- /dev/sdb

The unit log is stuck in:
2015-09-02 18:57:54 INFO mon-relation-changed Warning! One or more CRCs don't match. You should repair the disk!
2015-09-02 18:57:54 INFO mon-relation-changed

James Page (james-page) wrote :

I think there are potentially two bugs here; the first being:

http://tracker.ceph.com/issues/11143

which is resolved by the fix in proposed (splitting the calls); I think the hanging sgdisk might be a different problem.

I'm going to mark this as verification-done for http://tracker.ceph.com/issues/11143; we'll need another bug for the hang problem

tags: added: verification-done
removed: verification-needed
James Page (james-page) wrote :

Reverting the verfication-done for now - maybe this is all the same bug.

tags: added: verification-needed
removed: verification-done
James Page (james-page) wrote :

As this has not regressed functionality, marking verification-done - there may be other bugs, but there always are bugs...

tags: added: verification-done
removed: verification-needed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package ceph - 0.80.10-0ubuntu1.14.04.2

---------------
ceph (0.80.10-0ubuntu1.14.04.2) trusty; urgency=medium

  * Switch to two step 'zapping' of disks, ensuring that disks with invalid
    metadata don't cause hangs are fully cleaned and initialized prior
    to use (LP: #1475247).

ceph (0.80.10-0ubuntu0.14.04.1) trusty; urgency=medium

  * New upstream stable point release (LP: #1477174):
    - d/ceph.install: Add manpage for ceph-disk.
    - d/ceph-common.install: Replace ceph_filestore_* with
      ceph-objectstore-tool.
    - d/control: Ensure ceph-test-dbg depends on ceph-test only.
    - d/p/fix-python-rados-memleak.patch: Dropped included upstream.

 -- Christopher Glass (Canonical) <email address hidden> Mon, 10 Aug 2015 11:00:44 +0100

Changed in ceph (Ubuntu Trusty):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for ceph has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Hello Adam, or anyone else affected,

Accepted ceph into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/ceph/0.80.10-0ubuntu1.14.04.3 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in ceph (Ubuntu Trusty):
status: Fix Released → Fix Committed
Changed in ceph (Ubuntu Vivid):
status: Fix Released → Fix Committed
tags: removed: verification-done
tags: added: verification-needed
James Page (james-page) on 2015-11-04
tags: added: verification-done
removed: verification-needed
Changed in ceph (Ubuntu Utopic):
importance: Undecided → High
James Page (james-page) wrote :

Marking Fix Released as this is in the version of ceph in updates for 14.04

Changed in ceph (Ubuntu Trusty):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers