Swift-storage dies if rsyslog is stopped

Bug #1683076 reported by Jill Rouleau on 2017-04-15
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack swift-storage charm
Undecided
Unassigned
Ubuntu Cloud Archive
Undecided
Unassigned
Declined for Juno by Corey Bryant
Declined for Liberty by Corey Bryant
Icehouse
Undecided
Unassigned
Kilo
Critical
Unassigned
Mitaka
Critical
Unassigned
Newton
Undecided
Unassigned
swift (Ubuntu)
Undecided
Unassigned
Trusty
Critical
Unassigned
Xenial
Critical
Unassigned

Bug Description

Trusty, Mitaka, Juju 1.25.11

We have a cloud where swift replicators are constantly falling over on 2 nodes. This occurs whenever rsyslog restarts, as in

https://bugs.launchpad.net/swift/+bug/1094230
https://review.openstack.org/#/c/24871
https://bugs.python.org/issue15179

rsyslog restarts are unfortunately frequent right now, due to https://bugs.launchpad.net/juju-core/+bug/1683075

Nodes are landscape managed and up to date but still exhibit the failure.

Not much from running swift in verbose. https://pastebin.canonical.com/185609/

sosreports are uploading to https://private-fileshare.canonical.com/~jillr/sf00137831/

[Impact]

 * Stopping rsyslog causes swift daemons to crash due to overflowing the call stack when attempting to write an entry to the logging subsystem and the attempt to write to /dev/log fails. When rsyslog stops, the /dev/log socket is unavailable and results in an exception. The swift logging code attempts to log the resultant error, which again results in an exception. This continues until the stack is overflowed and the swift daemons crash. When the swift daemons crash, the object, container and account data are not able to be replicated to other storage nodes in the system, which affects the data integrity of the data being written to the system.

 * The patch should be backported to stable releases in order to ensure that the data integrity of objects, accounts, and containers within Swift are not adversely affected due to failed logging subsystems.

 * The uploaded patches fix the bug by only attempting to log an entry to the logging subsystem if the current call stack does not include an attempt to write to the logs. If the current call stack includes an attempt to log to the logging subsystem, the log message is dropped avoiding the recursion.

[Test Case]

 * Install swift storage cluster
 * Log into one of the swift storage nodes
 * Ensure the swift-{object,account,container}-replicator processes are running
 * Stop the rsyslog service
 * Wait a minute
 * Observe the swift-{object,account,container}-replicator processes are no longer running

[Regression Potential]

 * This affects the logging capabilities provided by the Swift code. Possible regressions could occur in almost any subsystem, since the logging is universal throughout the code base and could result in lost log entries in the best regression scenario and possible crashing of swift daemons in the worst case scenario. The regression potential is mitigated by the fact that this patch has already been included upstream for over a year now and no regressions have been reported against this code since.

[Other Info]

 * /dev/log is not provided by the rsyslog daemon in Xenial, but this patch still applies in that any persistent exception encountered when writing to /dev/log will cause the call stack to overflow and crash the swift daemons.

Billy Olsen (billy-olsen) wrote :

This appears to be a bug in the Swift logging behavior not in the charm itself.

There's an upstream commit located at [0] which fixes the problem. In a nutshell, the stopping of rsyslogd in an upstart system removes the /dev/log socket, which causes the swift service's attempt to log a message to fail - which it then proceeds to attempt to log, in the syslog. This spirals out and causes the program to crash due to the infinite recursion overflowing the call stack.

[0] https://github.com/openstack/swift/commit/95efd3f9035ec4141e1b182516f040a59a3e5aa6

Changed in charm-swift-storage:
status: New → Invalid
Changed in swift (Ubuntu):
status: New → Invalid
tags: added: sts
Billy Olsen (billy-olsen) wrote :

This patch is for the xenial version of swift.

description: updated
Billy Olsen (billy-olsen) wrote :

This patch is for the trusty-mitaka version of swift for the trusty-mitaka Ubuntu Cloud Archive

Changed in swift (Ubuntu Trusty):
importance: Undecided → Critical
Changed in swift (Ubuntu Xenial):
importance: Undecided → Critical
tags: added: sts-sru-needed
Billy Olsen (billy-olsen) wrote :

This patch is for the kilo version of swift for the trusty-kilo Ubuntu Cloud Archive.

Billy Olsen (billy-olsen) wrote :

This patch is for the icehouse version of swift in Trusty.

tags: added: ubuntu-sponsors
Andy Whitcroft (apw) wrote :

Confirmed this is already fixed in yakkety and later. Reviewed and sponsored for xenial and trusty.

Robie Basak (racb) wrote :

The patches look superfluously different between Trusty and Xenial, and there are no dep3 headers in the Trusty patch. Was this backported, and if so by whom? Are they both separate cherry-picks from upstream, or otherwise what has been done to make the patch for Trusty work?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers