swift-object-auditor keeps forking processes if /srv is not accessible

Bug #1375348 reported by Jay Bryant
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Object Storage (swift)
Fix Released
Undecided
Unassigned

Bug Description

We have discovered that the swift-object- auditor will just keep forking off new processes in the case that the /srv directory has the wrong permissions set. The result is that it acts like a fork bomb which is not desirable.

Sep 29 11:22:30 oc0644314035 object-auditor: Begin object audit "forever" mode (ZBF)
Sep 29 11:22:30 oc0644314035 object-auditor: Begin object audit "forever" mode (ALL)
Sep 29 11:22:30 oc0644314035 object-auditor: ERROR auditing: [Errno 13] Permission denied: '/srv/node': #012Traceback (most recent call last):#012 File "/usr/lib/python2.6/site-packages/swift/obj/auditor.py", line 335, in run_forever#012 self.audit_loop(parent, zbo_fps, **kwargs)#012 File "/usr/lib/python2.6/site-packages/swift/obj/auditor.py", line 274, in audit_loop#012 zbf_pid = self.fork_child(zero_byte_fps=True, **kwargs)#012 File "/usr/lib/python2.6/site-packages/swift/obj/auditor.py", line 260, in fork_child#012 self.run_audit(**kwargs)#012 File "/usr/lib/python2.6/site-packages/swift/obj/auditor.py", line 249, in run_audit#012 worker.audit_all_objects(mode=mode, device_dirs=device_dirs)#012 File "/usr/lib/python2.6/site-packages/swift/obj/auditor.py", line 91, in audit_all_objects#012 for location in all_locs:#012 File "/usr/lib/python2.6/site-packages/swift/obj/diskfile.py", line 412, in object_audit_location_generator#012 device_dirs = listdir(devices)#012 File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 2400, in listdir#012 return os.listdir(path)#012OSError: [Errno 13] Permission denied: '/srv/node'
Sep 29 11:22:30 oc0644314035 object-auditor: ERROR auditing: [Errno 13] Permission denied: '/srv/node': #012Traceback (most recent call last):#012 File "/usr/lib/python2.6/site-packages/swift/obj/auditor.py", line 335, in run_forever#012 self.audit_loop(parent, zbo_fps, **kwargs)#012 File "/usr/lib/python2.6/site-packages/swift/obj/auditor.py", line 278, in audit_loop#012 pids.append(self.fork_child(**kwargs))#012 File "/usr/lib/python2.6/site-packages/swift/obj/auditor.py", line 260, in fork_child#012 self.run_audit(**kwargs)#012 File "/usr/lib/python2.6/site-packages/swift/obj/auditor.py", line 249, in run_audit#012 worker.audit_all_objects(mode=mode, device_dirs=device_dirs)#012 File "/usr/lib/python2.6/site-packages/swift/obj/auditor.py", line 91, in audit_all_objects#012 for location in all_locs:#012 File "/usr/lib/python2.6/site-packages/swift/obj/diskfile.py", line 412, in object_audit_location_generator#012 device_dirs = listdir(devices)#012 File "/usr/lib/python2.6/site-packages/swift/common/utils.py", line 2400, in listdir#012 return os.listdir(path)#012OSError: [Errno 13] Permission denied: '/srv/node'
---------------------------------------------------------------------------
-bash-4.1# ps aux | grep swift | grep auditor
swift 23462 0.7 0.7 228040 14216 ? Ss 17:06 0:00 /usr/bin/python /usr/bin/swift-account-auditor /etc/swift/account-server.conf
swift 23524 0.9 0.7 228024 14228 ? Ss 17:07 0:00 /usr/bin/python /usr/bin/swift-container-auditor /etc/swift/container-server.conf
swift 23586 1.1 0.8 237788 16260 ? Ss 17:07 0:00 /usr/bin/python /usr/bin/swift-object-auditor /etc/swift/object-server.conf
swift 23598 0.0 0.7 238144 14096 ? S 17:07 0:00 /usr/bin/python /usr/bin/swift-object-auditor /etc/swift/object-server.conf
swift 23599 0.0 0.7 238144 14096 ? S 17:07 0:00 /usr/bin/python /usr/bin/swift-object-auditor /etc/swift/object-server.conf
-bash-4.1# ps aux | grep swift | grep auditor | wc -l
5
-bash-4.1# ps aux | grep swift | grep auditor | wc -l
9
-bash-4.1# ps aux | grep swift | grep auditor | wc -l
9
-bash-4.1# ps aux | grep swift | grep auditor
swift 23462 0.4 0.7 228040 14216 ? Ss 17:06 0:00 /usr/bin/python /usr/bin/swift-account-auditor /etc/swift/account-server.conf
swift 23524 0.5 0.7 228024 14228 ? Ss 17:07 0:00 /usr/bin/python /usr/bin/swift-container-auditor /etc/swift/container-server.conf
swift 23586 0.5 0.8 237788 16260 ? Ss 17:07 0:00 /usr/bin/python /usr/bin/swift-object-auditor /etc/swift/object-server.conf
swift 23598 0.0 0.7 238144 14276 ? S 17:07 0:00 /usr/bin/python /usr/bin/swift-object-auditor /etc/swift/object-server.conf
swift 23599 0.0 0.7 238144 14244 ? S 17:07 0:00 /usr/bin/python /usr/bin/swift-object-auditor /etc/swift/object-server.conf
swift 23676 0.0 0.7 238144 14044 ? S 17:07 0:00 /usr/bin/python /usr/bin/swift-object-auditor /etc/swift/object-server.conf
swift 23677 0.0 0.7 238144 14044 ? S 17:07 0:00 /usr/bin/python /usr/bin/swift-object-auditor /etc/swift/object-server.conf
swift 23678 0.0 0.7 238144 14052 ? S 17:07 0:00 /usr/bin/python /usr/bin/swift-object-auditor /etc/swift/object-server.conf
swift 23679 0.0 0.7 238144 14052 ? S 17:07 0:00 /usr/bin/python /usr/bin/swift-object-auditor /etc/swift/object-server.conf
-bash-4.1# ps aux | grep swift | grep auditor | wc -l
9
-bash-4.1# ps aux | grep swift | grep auditor | wc -l
9
-bash-4.1# ps aux | grep swift | grep auditor | wc -l
17
-bash-4.1#

I am working on narrowing down the source of the problem.

Revision history for this message
Jay Bryant (jsbryant) wrote :

 def fork_child(self, zero_byte_fps=False, **kwargs):
    """Child execution"""
    pid = os.fork()
    if pid:
        return pid
    else:
        signal.signal(signal.SIGTERM, signal.SIG_DFL)
        if zero_byte_fps:
            kwargs['zero_byte_fps'] = self.conf_zero_byte_fps
        self.run_audit(**kwargs)
        sys.exit()

I think the problem is here. If self.run_audit(**kwargs) returns an exception we never call sys.exit() so, eventually we end up with a bunch of zombie processes. I am working on finding a way of handling this that allows me to log the issue. Having a problem there at the moment.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to swift (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/125746

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (master)

Reviewed: https://review.openstack.org/125197
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=301a96f664d58b4ccad8e3cbf5d5a889cc76790f
Submitter: Jenkins
Branch: master

commit 301a96f664d58b4ccad8e3cbf5d5a889cc76790f
Author: Jay S. Bryant <email address hidden>
Date: Tue Sep 30 15:08:59 2014 -0500

    Ensure sys.exit called in fork_child after exception

    Currently, the fork_child() function in auditor.py does not
    handle the case where run_audit() encounters an exception
    properly.

    A simple case is where the /srv directory is set
    with permissions such that the 'swift' user cannot access it.
    Such a situation causes a os.listdir() to return an OSError
    exception. When this happens the fork_child() process does not
    run to completion and sys.exit() is not executed. The process
    that was forked off continues to run as a result. Execution goes
    back up to the audit_loop function which restarts the whole process. The
    end result is an increasing number of processes on the system
    until the parent is terminated. This can quickly exhaust the
    process descriptors on a system.

    This change wraps run_audit() in a try block and adds an
    exception handler that prints what exception was encountered.
    The sys.exit() was moved to a finally: block so that it will
    always be run, avoiding the creation of zombies.

    Change-Id: I89d7cd27112445893852e62df857c3d5262c27b3
    Closes-bug: 1375348

Changed in swift:
status: New → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to swift (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/125864

Thierry Carrez (ttx)
Changed in swift:
milestone: none → 2.2.0-rc1
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/126371

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to swift (feature/ec)

Fix proposed to branch: feature/ec
Review: https://review.openstack.org/126595

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (feature/ec)
Download full text (11.3 KiB)

Reviewed: https://review.openstack.org/126595
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=06800cbe446ce4c937a57b69517b55c3bba9b6e1
Submitter: Jenkins
Branch: feature/ec

commit 7528f2b22169e90fe8ddd19b7ef7d46ecff5d231
Author: Christian Schwede <email address hidden>
Date: Mon Oct 6 10:01:03 2014 +0000

    Fix minor typo

    Fixes minor typo in one method and adds missing parameter in other
    method. Only checked swift/container/reconciler.py for now.

    Change-Id: I5c648010f09b6e4b1fb0380bc97b266e680602f8

commit 94fd95ba30c72fbcb03367aaa8da407a408948d5
Author: OpenStack Proposal Bot <email address hidden>
Date: Sat Oct 4 06:07:47 2014 +0000

    Imported Translations from Transifex

    Change-Id: I31b5e6b0f2922150902e1bfa52144302ee0c7a8e

commit d6a827792619f3343af07fc2519f4253fbdc67f7
Author: John Dickinson <email address hidden>
Date: Fri Oct 3 10:17:00 2014 -0400

    updated AUTHORS and CHANGELOG for 2.2.0

    Change-Id: I6c0bc1570f6a48439de5a029a86f1b582f30f8a6

commit 5b2c27a5874c2b5b0a333e4955b03544f6a8119f
Author: Richard (Rick) Hawkins <email address hidden>
Date: Wed Oct 1 09:37:47 2014 -0400

    Fix metadata overall limits bug

    Currently metadata limits are checked on a per request basis. If
    multiple requests are sent within the per request limits, it is
    possible to exceed the overall limits. This patch adds an overall
    metadata check to ensure that multiple requests to add metadata to
    an account/container will check overall limits before adding
    the additional metadata.

    Change-Id: Ib9401a4ee05a9cb737939541bd9b84e8dc239c70
    Closes-Bug: 1365350

commit 301a96f664d58b4ccad8e3cbf5d5a889cc76790f
Author: Jay S. Bryant <email address hidden>
Date: Tue Sep 30 15:08:59 2014 -0500

    Ensure sys.exit called in fork_child after exception

    Currently, the fork_child() function in auditor.py does not
    handle the case where run_audit() encounters an exception
    properly.

    A simple case is where the /srv directory is set
    with permissions such that the 'swift' user cannot access it.
    Such a situation causes a os.listdir() to return an OSError
    exception. When this happens the fork_child() process does not
    run to completion and sys.exit() is not executed. The process
    that was forked off continues to run as a result. Execution goes
    back up to the audit_loop function which restarts the whole process. The
    end result is an increasing number of processes on the system
    until the parent is terminated. This can quickly exhaust the
    process descriptors on a system.

    This change wraps run_audit() in a try block and adds an
    exception handler that prints what exception was encountered.
    The sys.exit() was moved to a finally: block so that it will
    always be run, avoiding the creation of zombies.

    Change-Id: I89d7cd27112445893852e62df857c3d5262c27b3
    Closes-bug: 1375348

commit 6d49cc3092168de6d22378557b2c37ea4063beeb
Author: Samuel Merritt <email address hidden>
Date: Thu Oct 2 17:14:58 2014 -0400

    Fix ring-builder crash.

    If you adjust ...

Thierry Carrez (ttx)
Changed in swift:
milestone: 2.2.0-rc1 → 2.2.0
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to swift (stable/icehouse)

Reviewed: https://review.openstack.org/126371
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=fbe00380b0617476d84d2fde394c6ad347689aac
Submitter: Jenkins
Branch: stable/icehouse

commit fbe00380b0617476d84d2fde394c6ad347689aac
Author: Jay S. Bryant <email address hidden>
Date: Tue Sep 30 15:08:59 2014 -0500

    Ensure sys.exit called in fork_child after exception

    Currently, the fork_child() function in auditor.py does not
    handle the case where run_audit() encounters an exception
    properly.

    A simple case is where the /srv directory is set
    with permissions such that the 'swift' user cannot access it.
    Such a situation causes a os.listdir() to return an OSError
    exception. When this happens the fork_child() process does not
    run to completion and sys.exit() is not executed. The process
    that was forked off continues to run as a result. Execution goes
    back up to the audit_loop function which restarts the whole process. The
    end result is an increasing number of processes on the system
    until the parent is terminated. This can quickly exhaust the
    process descriptors on a system.

    This change wraps run_audit() in a try block and adds an
    exception handler that prints what exception was encountered.
    The sys.exit() was moved to a finally: block so that it will
    always be run, avoiding the creation of zombies.

    Change-Id: I89d7cd27112445893852e62df857c3d5262c27b3
    Closes-bug: 1375348
    (cherry picked from commit 301a96f664d58b4ccad8e3cbf5d5a889cc76790f)

tags: added: in-stable-icehouse
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to swift (master)

Reviewed: https://review.openstack.org/125864
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=d895cea07f4f51b7d3e683344729a08e76f85f51
Submitter: Jenkins
Branch: master

commit d895cea07f4f51b7d3e683344729a08e76f85f51
Author: Jay S. Bryant <email address hidden>
Date: Thu Oct 2 21:13:30 2014 -0500

    Handle os.listdir failures in container-updater

    While investigating bug 1375348 I discovered the problem
    reported there was not limited to the object-auditor. The
    container-updated has similar bugs.

    This patch catches the unhandled exception that can be thrown by
    os.listdir.

    Change-Id: I7eed122bf6b663e6e7894ace136b6f4653db4985
    Related-bug: 1375348

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to swift (feature/ec)

Related fix proposed to branch: feature/ec
Review: https://review.openstack.org/133750

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to swift (feature/ec)
Download full text (9.6 KiB)

Reviewed: https://review.openstack.org/133750
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=01b77740103e0dbe69b1d25ac6bd51f12310cae6
Submitter: Jenkins
Branch: feature/ec

commit fdcd20f2b6496a9e857cb47ae8907938033be9df
Author: John Dickinson <email address hidden>
Date: Fri Nov 7 10:34:51 2014 +0100

    added docs on specs workflow to CONTRIBUTING.md

    Change-Id: Id83d1da2a7a594a07fc5332b918539b3728e101b

commit ecc946b4ffb09ca0a94998ef54a7af7d4c572aff
Author: Christian Schwede <email address hidden>
Date: Thu Nov 6 15:44:29 2014 +0100

    Rename Swiftbrowser in associated projects

    Let's use the full project name to avoid confusion with the recently added
    Swiftbrowser based on AngularJS.

    Change-Id: Ib07338268a1593bc2882908b49c1fb4a130ff43d

commit dff981a03eac34a4c776cfd9a1528f1c3824f29b
Author: Martin Geisler <email address hidden>
Date: Thu Nov 6 14:52:04 2014 +0100

    Add Swift Browser as an associated project

    This is a JavaScript based browser for Swift.

    Change-Id: I2e304d4a0623c715f8712a358fef5067abc8935b

commit f9bed74d1bba6a512becd057c3139c54c176c226
Author: Clay Gerrard <email address hidden>
Date: Wed Oct 29 15:59:45 2014 -0700

    Return 403 on unauthorized upload when over account quota

    If you try an unauthorized upload into a container that is over quota you get
    a 403 instead of a 413, but if you try to unauthorized upload when an
    *account* is over quota you can see the 413 even though the upload would have
    been rejected by the authorize callback. By wrapping the authorize callback
    associated with the incoming request we can make sure to only return our 413
    when the request would have been authorized otherwise.

    Drive by doc fixes thanks to acoles:

     * State that container_quotas should be after auth middleware in
       the class doc string.
     * Add note to proxy-server.conf.sample that account_quotas should
       be after auth middleware.

    The equivalent statements are already in place for each quota
    middleware.

    Doc-Impact

    Closes-Bug: #1387415
    Change-Id: I2a88b3ec79d35bfdd73ea6ad64e376b7c7af4ea6

commit cbc52a7a4ea32f7d84f54948aa9ebb5decd26813
Author: Christian Schwede <email address hidden>
Date: Thu Oct 23 16:23:20 2014 +0200

    Return verbose message if account quota exceeded

    This message is already used in the container quota middleware, so let's use it
    in the account middleware too.

    Change-Id: I136fe6102c28cc8ccc021555c42ec7b0be716444
    Closes-Bug: 1381875

commit 83030b921dd83a84a2e966c88156e64d30fb9c24
Author: Christian Schwede <email address hidden>
Date: Wed Oct 29 10:34:53 2014 +0000

    Update admin guide on handling drive failures

    Simply replacing a failed disk requires a very long time if the ring is not
    changed, because all data will be replicated to a single new disk. This extends
    the time to recover from missing replicas, and becomes even more important with
    bigger disks.

    This patch updates the doc to include a faster alternative by setting the weight
    of a fail...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to swift (stable/juno)

Related fix proposed to branch: stable/juno
Review: https://review.openstack.org/134082

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to swift (master)

Reviewed: https://review.openstack.org/125746
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=1c9bc0b522bed333b04a46ed7bd2c66a4eb89860
Submitter: Jenkins
Branch: master

commit 1c9bc0b522bed333b04a46ed7bd2c66a4eb89860
Author: Jay S. Bryant <email address hidden>
Date: Thu Oct 2 14:10:04 2014 -0500

    Handle os.listdir failures in object-updater

    While investigating bug 1375348 I discovered the problem
    reported there was not limited to the object-auditor. The
    object-updater has similar bugs.

    This patch catches the unhandled exception that can be thrown
    by os.listdir if the self.devices directory is inaccessible.

    Change-Id: I6293b840916bb63cf9eebbc05068d9a3c871bdc3
    Related-bug: 1375348

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to swift (feature/ec)

Related fix proposed to branch: feature/ec
Review: https://review.openstack.org/138165

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to swift (feature/ec)
Download full text (15.6 KiB)

Reviewed: https://review.openstack.org/138165
Committed: https://git.openstack.org/cgit/openstack/swift/commit/?id=0d3ebf09b94b41782b2c2a6bbcf255bf1203eca0
Submitter: Jenkins
Branch: feature/ec

commit 977d7c14daa38ab9c9d79bbf8b92371024b93fc8
Author: John Dickinson <email address hidden>
Date: Wed Nov 26 14:19:08 2014 -0800

    Fix tempfile bugs from commit 6978275

    Commit 6978275 changed xprofile middleware's usage of mktemp
    and moved to using tempfile. But it was clearly never tested,
    because the os.close() calls never worked. This patch updates
    that previous patch to use a context to open and close the file.

    Change-Id: I40ee42e8539551fd8e4dfb353f50146ab40a7847

commit dec97fc3ba2c71884f1c098e7d9cd1f709f74958
Author: OpenStack Proposal Bot <email address hidden>
Date: Wed Nov 26 06:13:29 2014 +0000

    Imported Translations from Transifex

    For more information about this automatic import see:
    https://wiki.openstack.org/wiki/Translations/Infrastructure

    Change-Id: Ibf319f7cc1b5036ad8031776cf2c6018fb8a0159

commit 01f6e860066640a2ba1406a23c93a72b34ec495e
Author: Clay Gerrard <email address hidden>
Date: Fri Nov 21 17:28:13 2014 -0800

    Add Expected Failure for ssync with sys-meta

    Sysmeta included with an object PUT persists with the PUT data - if an
    internal operation such as POST-as-copy during partial failure, or ssync
    with fast-POST (not supported), causes that data to be lost then the
    associated sysmeta will also be lost.

    Since object sys-meta persistence in the face of a POST when the
    original .data is unavailable requires fast-POST with .meta files the
    probetest that validates object sys-meta persistence of a POST when the
    most up-to-date copy of the object with sys-meta is unavailable
    configures an InternalClient with object_post_as_copy = false.

    This non-default configuration option is not supported by ssync and
    results in a loss of sys-meta very similar to the object sys-meta
    failure you would see with object_post_as_copy = true when the COPY part
    of the POST is unable to retrieve the most recently written object with
    sys-meta.

    Until we can fix the default POST behavior to make metadata updates
    without stomping on newer data file timestamps we should expect object
    sys-meta to be "very very best possible but not really guaranteed
    effort".

    Until we can fix ssync to replicate metadata updates without stomping on
    newer data file timestamps we should expect this test to fail.

    When ssync replication of fast-POST metadata update is fixed this test
    will fail signaling that the expected failure cruft should be removed,
    but other parts of ssync replication will still work and some other bugs
    can be fixed while we wait.

    Change-Id: Ifc5d49514de79b78f7715408e0fe0908357771d3

commit a8751ae557616cab1cafd98a338cad352526a262
Author: Cedric Dos Santos <email address hidden>
Date: Tue Nov 25 12:37:05 2014 +0100

    Correct misspelled words

    In some files I found misspelling words.

    bin/swift-reconciler-enqueue#l26
       prima...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to swift (stable/juno)

Related fix proposed to branch: stable/juno
Review: https://review.openstack.org/146211

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on swift (stable/juno)

Change abandoned by Jeremy Stanley (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/134082
Reason: I'm abandoning this change in preparation for deleting the stable/juno branch, which is now at end of life.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Jeremy Stanley (<email address hidden>) on branch: stable/juno
Review: https://review.openstack.org/146211
Reason: I'm abandoning this change in preparation for deleting the stable/juno branch, which is now at end of life.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.