app servers not shutting down (again)

Bug #626577 reported by Steve McInerney
This bug report is a duplicate of:  Bug #669296: App servers die and hang. Edit Remove
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Launchpad itself
Triaged
High
Unassigned

Bug Description

We appear to have a reversion on bug 307447.

Whereby app servers are told to shutdown and they fail to do so.
Have observed this on Staging and Edge so far.

The only recover is to manually issue kill, and eventually kill -9's to shut the service down.

Steve McInerney (spm)
tags: added: canonical-losa-lp
Revision history for this message
Robert Collins (lifeless) wrote :

Definitely high, possibly even critical. Does it occur after every shutdown, or only after the server is up for a bit? Can you please get a thread-dump - backtrace of all threads - next time.

Thanks.

Changed in launchpad-foundations:
status: New → Triaged
importance: Undecided → High
Revision history for this message
Steve McInerney (spm) wrote :

Not every, fortunately. eg. Consider we get edge updates every day, I've only seen 2 of these, across 4 edge services, in about a week.
As for length of time? It depends. Edge is daily, the app servers MUCH longer, but they do lock and trigger the (old, was working) stop/kill/kill -9 fandango.

Revision history for this message
Gary Poster (gary) wrote :

As Robert said, it's difficult to know where to start on this without a thread dump (https://dev.launchpad.net/Debugging/GDB).

Revision history for this message
Gary Poster (gary) wrote :
Changed in launchpad-foundations:
importance: High → Critical
Revision history for this message
Robert Collins (lifeless) wrote :

https://pastebin.canonical.com/37071/
https://pastebin.canonical.com/37072/

and also reported

11:17 < elmo> IOError: close() called during concurrent operation on the same file object.
11:17 < elmo> that looks interesting

Revision history for this message
Robert Collins (lifeless) wrote :

We have just had a degraded-service outage for 17 hours, which looks to be caused by an edge update not shutting down two servers properly, as a result of which the other two servers got saturated.

Revision history for this message
Robert Collins (lifeless) wrote :

The comment was logged to nohup.out.

Revision history for this message
Robert Collins (lifeless) wrote :

We really cannot depend on sharing *any* objects between threads other than those explicitly designed to be shared.

Revision history for this message
Robert Collins (lifeless) wrote :

error: uncaptured python exception, closing channel <zc.zservertracelog.tracelog.Channel connected 91.189.89.47:55783 at 0x2b001898eb90> (<class 'socket.error'>:[Errno 32] Broken pipe [/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.server-3.6.1-py2.6.egg/zope/server/dualmodechannel.py|handle_write|73] [/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.server-3.6.1-py2.6.egg/zope/server/dualmodechannel.py|_flush_some|187] [/usr/lib/python2.6/asyncore.py|send|352])
error: uncaptured python exception, closing channel <zc.zservertracelog.tracelog.Channel connected 91.189.89.47:46329 at 0x2b00212bc050> (<class 'socket.error'>:[Errno 32] Broken pipe [/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.server-3.6.1-py2.6.egg/zope/server/dualmodechannel.py|handle_write|73] [/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.server-3.6.1-py2.6.egg/zope/server/dualmodechannel.py|_flush_some|187] [/usr/lib/python2.6/asyncore.py|send|352])
Traceback (most recent call last):
  File "bin/run", line 26, in <module>
    canonical.launchpad.scripts.runlaunchpad.start_launchpad()
  File "/srv/edge.launchpad.net/edge/launchpad-rev-11522/lib/canonical/launchpad/scripts/runlaunchpad.py", line 281, in start_launchpad
    main(argv)
  File "/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.app.server-3.4.2-py2.6.egg/zope/app/server/main.py", line 62, in main
    run()
  File "/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.app.server-3.4.2-py2.6.egg/zope/app/server/main.py", line 83, in run
    asyncore.poll(30.0)
  File "/usr/lib/python2.6/asyncore.py", line 132, in poll
    r, w, e = select.select(r, w, e, timeout)
  File "/srv/edge.launchpad.net/edge/launchpad-rev-11522/lib/canonical/launchpad/webapp/sigusr2.py", line 20, in sigusr2_handler
    reopenFiles()
  File "/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/ZConfig-2.7.1-py2.6.egg/ZConfig/components/logger/loghandler.py", line 50, in reopenFiles
    h.reopen()
  File "/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/ZConfig-2.7.1-py2.6.egg/ZConfig/components/logger/loghandler.py", line 87, in reopen
    self.stream.close()
IOError: close() called during concurrent operation on the same file object.
Unhandled exception in thread started by
Error in sys.excepthook:

Original exception was:

Revision history for this message
Robert Collins (lifeless) wrote :
Download full text (3.4 KiB)

rror: uncaptured python exception, closing channel <zc.zservertracelog.tracelog.Channel connected 91.189.89.47:46668 at 0xf53c410> (<class 'socket.error'>:[Errno 32] Broken pipe [/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.server-3.6.1-py2.6.egg/zope/server/dualmodechannel.py|handle_write|73] [/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.server-3.6.1-py2.6.egg/zope/server/dualmodechannel.py|_flush_some|187] [/usr/lib/python2.6/asyncore.py|send|352])
error: uncaptured python exception, closing channel <zc.zservertracelog.tracelog.Channel connected 91.189.89.47:38663 at 0x2ba260edd690> (<class 'socket.error'>:[Errno 32] Broken pipe [/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.server-3.6.1-py2.6.egg/zope/server/dualmodechannel.py|handle_write|73] [/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.server-3.6.1-py2.6.egg/zope/server/dualmodechannel.py|_flush_some|187] [/usr/lib/python2.6/asyncore.py|send|352])
error: uncaptured python exception, closing channel <zc.zservertracelog.tracelog.Channel connected 91.189.89.47:36663 at 0x136a05d0> (<class 'socket.error'>:[Errno 32] Broken pipe [/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.server-3.6.1-py2.6.egg/zope/server/dualmodechannel.py|handle_write|73] [/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.server-3.6.1-py2.6.egg/zope/server/dualmodechannel.py|_flush_some|187] [/usr/lib/python2.6/asyncore.py|send|352])
error: uncaptured python exception, closing channel <zc.zservertracelog.tracelog.Channel connected 91.189.89.47:52182 at 0x1123d2d0> (<class 'socket.error'>:[Errno 32] Broken pipe [/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.server-3.6.1-py2.6.egg/zope/server/dualmodechannel.py|handle_write|73] [/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.server-3.6.1-py2.6.egg/zope/server/dualmodechannel.py|_flush_some|187] [/usr/lib/python2.6/asyncore.py|send|352])
error: uncaptured python exception, closing channel <zc.zservertracelog.tracelog.Channel connected 91.189.89.47:33861 at 0x2ba2628d6690> (<class 'socket.error'>:[Errno 32] Broken pipe [/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.server-3.6.1-py2.6.egg/zope/server/dualmodechannel.py|handle_write|73] [/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.server-3.6.1-py2.6.egg/zope/server/dualmodechannel.py|_flush_some|187] [/usr/lib/python2.6/asyncore.py|send|352])
Traceback (most recent call last):
  File "bin/run", line 26, in <module>
    canonical.launchpad.scripts.runlaunchpad.start_launchpad()
  File "/srv/edge.launchpad.net/edge/launchpad-rev-11522/lib/canonical/launchpad/scripts/runlaunchpad.py", line 281, in start_launchpad
    main(argv)
  File "/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.app.server-3.4.2-py2.6.egg/zope/app/server/main.py", line 62, in main
    run()
  File "/srv/edge.launchpad.net/edge/launchpad-rev-11522/eggs/zope.app.server-3.4.2-py2.6.egg/zope/app/server/main.py", line 83, in run
    asyncore.poll(30.0)
  File "/usr/lib/python2.6/asyncore.py", line 132, in poll
    r, w, e = select.select(r, w, e, timeout)
  File "/srv/edge.launchpad.net/edge/launchpad-rev-11522/lib/canonica...

Read more...

Revision history for this message
Gary Poster (gary) wrote :

The "critical" value has a particular meaning (https://wiki.canonical.com/Launchpad/PolicyandProcess/DefinitionofCriticalPolicy) which I do not think is appropriate. I am downgrading it to "high" but in progress.

The meaning of "high" has been diluted by a variety of causes IMO, which means that I understand the desire to call this critical. However, I don't think the answer is to call very-high priority bugs critical bugs, except possibly as a way of kicking.

Changed in launchpad-foundations:
status: Triaged → In Progress
importance: Critical → High
assignee: nobody → Gary Poster (gary)
Revision history for this message
Gary Poster (gary) wrote :

Switching attention to bug 636695

Changed in launchpad-foundations:
assignee: Gary Poster (gary) → nobody
status: In Progress → Triaged
Revision history for this message
Robert Collins (lifeless) wrote :

I'm duping this with another issue we're having - it -may- not be the same, we'll know once we've analysed and can undup then.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.