500 Error when getting /recon/diskusage

Bug #1249181 reported by WangChao
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Undecided
Kun Huang

Bug Description

I yanked a disk from my system by mistake .Then I execute the command "curl -v http://proxy-ip:6000/recon/diskusage" and 500 reported.

It is unreasonable that the recon daemon crash when the only one disk removed while the others disks works fine.

Thanks.

Revision history for this message
Kun Huang (academicgareth) wrote :

Could you provide more traceback details in log? It seems os.lstat(<your-mount-point>) raise an un-excepted error

Revision history for this message
WangChao (chaowsh) wrote :

yes , it really the os.lstat() (in the method swift.common.utils.ismount(path)) raise the IO exception, and the method did not catch the errno.EIO exception.
In swift 1.9.0 , I did the same thing and it worked ok. But in Havana the recon daemon crash.

Revision history for this message
Kun Huang (academicgareth) wrote :

I submit a patch to catch all oserror. A reason is the newer implementation of os.path.ismount does this too.

Changed in swift:
assignee: nobody → Kun Huang (academicgareth)
Revision history for this message
Peter Portante (peter-a-portante) wrote :

Before we go changing things, can we find out what the error is that is being raised and why it is being raised? This smells like a configuration problem that will likely be hidden by catching all errors as proposed.

Revision history for this message
WangChao (chaowsh) wrote :

Below is the error log from both kernel and swift log. When I run the command "curl -v http://proxy-ip:6000/recon/diskusage" the 500 report. I am sure in swift-1.9.0 ,I can also get the recon/diskusage.

Nov 4 11:35:59 swift-node7 kernel: [1818200.382441] XFS (sdd): metadata I/O error: block 0x7470770d ("xlog_iodone") error 5 numblks 64
Nov 4 11:35:59 swift-node7 kernel: [1818200.382729] XFS (sdd): xfs_do_force_shutdown(0x2) called from line 1038 of file /build/buildd/linux-lts-quantal-3.5.0/fs/xfs/xfs_log.c. Return address = 0xffffffffa0265ac1
Nov 4 11:35:59 swift-node7 kernel: [1818200.382754] XFS (sdd): Log I/O Error Detected. Shutting down filesystem
Nov 4 11:35:59 swift-node7 kernel: [1818200.382762] XFS (sdd): xfs_log_force: error 5 returned.
Nov 4 11:35:59 swift-node7 kernel: [1818200.382768] XFS (sdd): xfs_do_force_shutdown(0x1) called from line 1098 of file /build/buildd/linux-lts-quantal-3.5.0/fs/xfs/xfs_buf.c. Return address = 0xffffffffa02115a4
Nov 4 11:35:59 swift-node7 kernel: [1818200.382977] XFS (sdd): Please umount the filesystem and rectify the problem(s)
Nov 4 11:35:59 swift-node7 kernel: [1818200.384131] sd 0:0:11:0: [sdd] Synchronizing SCSI cache

Nov 4 11:36:00 swift-node7 account-server ERROR __call__ error with REPLICATE /sdd/989/7bba9492c36a4163c0c8a3991c3c5625 : [Errno 5] Input/output error: '/srv/node/sdd'

Revision history for this message
Peter Portante (peter-a-portante) wrote :

Looks like you have a corrupted disk. Perhaps it was a good thing we did not fix this imount command?

Revision history for this message
Kun Huang (academicgareth) wrote :

@peter

If we don't catch EIO in utils.ismount, we have to catch this in places which use utils.ismount. For example in this case, we should catch EIO at https://github.com/openstack/swift/blob/master/swift/common/middleware/recon.py#L200. And we also have to catch this in more places. Compared with newer codes in os.path.ismount (http://hg.python.org/releasing/3.3.3/file/6e81c3a16d1c/Lib/posixpath.py#l211), ismount catch all OS error and seems be designed to return True or False as an easy check. So my understanding is that if there're something wrong with your disk, just give you a False. So the patch here(https://review.openstack.org/#/c/55991/) catch all OS error.

Revision history for this message
Peter Portante (peter-a-portante) wrote :

@kun

Wouldn't we want to have recon raise a red flag to say that there is a major problem with a configured device, rather than just reporting that the disk is simply not mounted?

Personally, the upstream python code is performing a disservice swallowing all errors and just reporting that it is not mounted.

In your case, the device IS mounted, but it is reporting errors. That seems to be the information recon should be catching to help report what is actually happening.

I will not prevent that patch from going forward, so if other swifters disagree, that is fine.

Revision history for this message
WangChao (chaowsh) wrote :

@peter
I agree with Huang Kun, I don't think it is a good idea that the recon crash just one disk failed while the rest all works fine.

Revision history for this message
Peter Portante (peter-a-portante) wrote :

So then we need to fix recon to catch errors checking disk availability, so it won't crash, flagging errors instead, rather than changing utils.ismount. Do you agree?

Revision history for this message
Kun Huang (academicgareth) wrote :

In this case catch the error in recon is ok, but that means a request to object server on that disk also raise errors because ismount doesn't catch it. So we have to catch this error in many places using ismount or mount_check. I don't think that's a good idea.

In another word, IMO, to return False means device is not mounted CORRECTLY. So catching all OS errors seems more reasonable.

Revision history for this message
Peter Portante (peter-a-portante) wrote :

@Kun, respectfully, I disagree. there is only one place in object server that checks mounts in master, so I don't think this is an issue. There are other places in account and container servers that make that same check today, but there is a series of patches proposed that will reduce those to one place as well.

Hiding the errors in ismount seems too expedient. Let's get our code to handle the errors properly.

Revision history for this message
WangChao (chaowsh) wrote :

@peter it is ok to fix recon to catch errors checking disk availability.

Changed in swift:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/55991
Committed: http://github.com/openstack/swift/commit/fd4843f8e72227318b1e2ee5fbd2d43cc3732a1d
Submitter: Jenkins
Branch: master

commit fd4843f8e72227318b1e2ee5fbd2d43cc3732a1d
Author: Kun Huang <email address hidden>
Date: Wed Nov 13 19:20:16 2013 +0800

    catch OSError to prevent breaking request /recon/diskusage

    swift.common.utils.ismount maybe raise some OSError in some special
    cases; and the request against /recon/diskusage doesn't handle it
    before. This patch let output of mounted keyword is the error's message.

    Change-Id: I5d9018f580181e618a3fa072b7a760d41795d8eb
    Closes-Bug: #1249181

Changed in swift:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/ec)

Fix proposed to branch: feature/ec
Review: https://review.openstack.org/59766

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/ec)
Download full text (43.6 KiB)

Reviewed: https://review.openstack.org/59766
Committed: http://github.com/openstack/swift/commit/239f88a42b00a71a07860f953d00771c8aef4305
Submitter: Jenkins
Branch: feature/ec

commit 8a64bff2dc28b43b3ed4fa7b65da1a9ea29677cc
Author: Samuel Merritt <email address hidden>
Date: Wed Nov 27 17:23:59 2013 -0800

    Report transaction ID in failure exceptions

    This way, when something fails in Jenkins, you have some chance of
    searching the logs for the relevant transaction.

    Change-Id: I3cf606cb4963e32b5c6ac3deda08e73541b3ff7d

commit e0147e60d800fd67bc05bc4299c315f1761bd60b
Author: Peter Portante <email address hidden>
Date: Fri Nov 22 16:59:09 2013 -0500

    Add a unit test to verify proxy logging fields

    Also bring unit test coverage to 100% (well, at least every line is
    reported as "covered").

    Change-Id: I659d0c02008368897b1307a7a5c9aaba73b80588
    Signed-off-by: Peter Portante <email address hidden>

commit 87cd5598476d0835c526918a9e1f03fe2d698866
Author: Alex Gaynor <email address hidden>
Date: Sun Nov 24 20:24:45 2013 -0600

    Account for a platform difference in semaphores

    On OS X (and probably other Operating Systems) it isn't possible to
    introspect the value of a semaphore. Account for this by skipping a
    test about this.

    Change-Id: I97824f9fc4e36de4f7a62c8ce53865e6977dfdfe

commit 3c7c355120a3ebe5c3f47e62176cec8cab824143
Author: Peter Portante <email address hidden>
Date: Mon Nov 25 13:30:41 2013 -0500

    Use TCP_NODELAY for created sockets.

    Mark Seger at HP has been looking at small objects, 1 and 2 KB size,
    and with Rick Jones' help noticed that TCP protocol traces showed
    effects from the Nagel algorithm client-to-server and
    server-to-client.

    This patch just addresses our WSGI server responses, but does not
    address out-bound connections from the various servers.

    Change-Id: I11f86df1f56fba1c6ab6084dc1f580c395f072dc
    Signed-off-by: Peter Portante <email address hidden>

commit 39032c359f01a5e397fce2eb8326b961c9673607
Author: Darrell Bishop <email address hidden>
Date: Wed Nov 27 12:07:42 2013 -0500

    Add HTML reporting for test branch coverage.

    When including branch coverage results, also generate HTML reports into
    a "cover" subdirectory under the directory in which .unittests resides
    (i.e. known location at the top of the swift tree).

    Change-Id: I493d74f38755f7bf0d7043052585efb27840b238

commit 0ba071f27c009e1d028189e812f722e8583a07ee
Author: Darrell Bishop <email address hidden>
Date: Tue Nov 26 15:08:13 2013 -0500

    Fix bug in obj updater run_once().

    The "not" in front of the ismount() call got accidentally dropped in a
    recent change. This patch adds it back along with a few more tests.

    Note that this bug only showed up on an SAIO during probe tests because
    I used actually-mounted (virtual) "disks". So keep that in mind when
    building SAIOs for development/testing.

    Change-Id: Ia193f3c4b73203605954036863575c22ddab6b03

commit edc9f62ed6c537c4c112cf552310705b99fa08b8
Author: Peter Portante <peter.portante@redha...

Changed in swift:
milestone: none → 1.11.0
Thierry Carrez (ttx)
Changed in swift:
status: Fix Committed → Fix Released
Revision history for this message
WangChao (chaowsh) wrote :

There is another problem happens.

[{"device": "sdb", "avail": 32162545664, "mounted": true, "used": 33980416, "size": 32196526080}, {"device": "sdd", "avail": 32162545664, "mounted": true, "used": 33980416, "size": 32196526080}, {"device": "sdc", "avail": 32162312192, "mounted": true, "used": 34213888, "size": 32196526080}, {"device": "sde", "avail": "", "mounted": "[Errno 5] Input/output error: '/srv/node/sde'", "used": "", "size": ""}]

And then when I run "swift-recon -d", the error like below:

Traceback (most recent call last):
  File "/usr/local/bin/swift-recon", line 5, in <module>
    pkg_resources.run_script('swift==1.11.0', 'swift-recon')
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 499, in run_script
    self.require(requires)[0].run_script(script_name, ns)
  File "/usr/lib/python2.7/dist-packages/pkg_resources.py", line 1239, in run_script
    execfile(script_filename, namespace, namespace)
  File "/usr/local/lib/python2.7/dist-packages/swift-1.11.0-py2.7.egg/EGG-INFO/scripts/swift-recon", line 867, in <module>
    reconnoiter.main()
  File "/usr/local/lib/python2.7/dist-packages/swift-1.11.0-py2.7.egg/EGG-INFO/scripts/swift-recon", line 855, in main
    self.disk_usage(hosts, options.top, options.human_readable)
  File "/usr/local/lib/python2.7/dist-packages/swift-1.11.0-py2.7.egg/EGG-INFO/scripts/swift-recon", line 660, in disk_usage
    * 100.0
ValueError: could not convert string to float:

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.