nova-novncproxy process gets wedged, requiring kill -HUP

Bug #1715254 reported by Graham Burgess on 2017-09-05
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack nova-cloud-controller charm
Undecided
Unassigned
Ubuntu Cloud Archive
Undecided
Unassigned
Kilo
Medium
Seyeong Kim
Mitaka
Medium
Seyeong Kim
websockify (Ubuntu)
Undecided
Unassigned
Xenial
Medium
Seyeong Kim

Bug Description

[Impact]

affected
- UCA Mitaka, Kilo
- Xenial

not affected
- UCA Icehouse
- Trusty
( log symptom is different, there is no reaing(which is errata) zombie... etc)

When number of connections are many or frequently reconnecting to console, nova-novncproxy daemon is stuck because websockify is hang.

[Test case]

1. Deploy openstack
2. Creating instances
3. open console in browser with auto refresh extension ( set 5 seconds )
4. after several hours connection rejected

[Regression Potential]

Components that using websockify, escpecially nova-novncproxy, will be affected by this patch. However, After upgrading this and refreshing test above mentioned for 2 days without restarting any services, no hang happens. I tested this test in my local simple environment, so need to be considered possibility in different circumstances.

[Others]

related commits

- https://github.com/novnc/websockify/pull/226
- https://github.com/novnc/websockify/pull/219

[Original Description]

Users reported they were unable to connect to instance consoles via either Horizon or direct URL. Upon investigation we found errors suggesting the address and port were in use:

2017-08-23 14:51:56.248 1355081 INFO nova.console.websocketproxy [-] WebSocket server settings:
2017-08-23 14:51:56.248 1355081 INFO nova.console.websocketproxy [-] - Listen on 0.0.0.0:6080
2017-08-23 14:51:56.248 1355081 INFO nova.console.websocketproxy [-] - Flash security policy server
2017-08-23 14:51:56.248 1355081 INFO nova.console.websocketproxy [-] - Web server (no directory listings). Web root: /usr/share/novnc
2017-08-23 14:51:56.248 1355081 INFO nova.console.websocketproxy [-] - No SSL/TLS support (no cert file)
2017-08-23 14:51:56.249 1355081 CRITICAL nova [-] error: [Errno 98] Address already in use
2017-08-23 14:51:56.249 1355081 ERROR nova Traceback (most recent call last):
2017-08-23 14:51:56.249 1355081 ERROR nova File "/usr/bin/nova-novncproxy", line 10, in <module>
2017-08-23 14:51:56.249 1355081 ERROR nova sys.exit(main())
2017-08-23 14:51:56.249 1355081 ERROR nova File "/usr/lib/python2.7/dist-packages/nova/cmd/novncproxy.py", line 41, in main
2017-08-23 14:51:56.249 1355081 ERROR nova port=CONF.vnc.novncproxy_port)
2017-08-23 14:51:56.249 1355081 ERROR nova File "/usr/lib/python2.7/dist-packages/nova/cmd/baseproxy.py", line 73, in proxy
2017-08-23 14:51:56.249 1355081 ERROR nova RequestHandlerClass=websocketproxy.NovaProxyRequestHandler
2017-08-23 14:51:56.249 1355081 ERROR nova File "/usr/lib/python2.7/dist-packages/websockify/websocket.py", line 909, in start_server
2017-08-23 14:51:56.249 1355081 ERROR nova tcp_keepintvl=self.tcp_keepintvl)
2017-08-23 14:51:56.249 1355081 ERROR nova File "/usr/lib/python2.7/dist-packages/websockify/websocket.py", line 698, in socket
2017-08-23 14:51:56.249 1355081 ERROR nova sock.bind(addrs[0][4])
2017-08-23 14:51:56.249 1355081 ERROR nova File "/usr/lib/python2.7/socket.py", line 224, in meth
2017-08-23 14:51:56.249 1355081 ERROR nova return getattr(self._sock,name)(*args)
2017-08-23 14:51:56.249 1355081 ERROR nova error: [Errno 98] Address already in use
2017-08-23 14:51:56.249 1355081 ERROR nova

This lead us to the discovery of a stuck nova-novncproxy process after stopping the service. Once we sent a kill -HUP to that process, we were able to start the nova-novncproxy and restore service to the users.

This was not the first time we have had to restart nova-novncproxy services after users reported that were unable to connect with VNC. This time, as well as at least 2 other times, we have seen the following errors in the nova-novncproxy.log during the time frame of the issue:

gaierror: [Errno -8] Servname not supported for ai_socktype

which seems to correspond to a log entries for connection strings with an invalid port ('port': u'-1'). As well as a bunch of:

error: [Errno 104] Connection reset by peer

Graham Burgess (stormmore) wrote :

Additional information

List of nova packages installed on nova-cloud-controller:

$ dpkg -l | grep nova
ii nova-api-os-compute 2:13.1.4-0ubuntu2~cloud0 all OpenStack Compute - OpenStack Compute API frontend
ii nova-cert 2:13.1.4-0ubuntu2~cloud0 all OpenStack Compute - certificate management
ii nova-common 2:13.1.4-0ubuntu2~cloud0 all OpenStack Compute - common files
ii nova-conductor 2:13.1.4-0ubuntu2~cloud0 all OpenStack Compute - conductor service
ii nova-consoleauth 2:13.1.4-0ubuntu2~cloud0 all OpenStack Compute - Console Authenticator
ii nova-novncproxy 2:13.1.4-0ubuntu2~cloud0 all OpenStack Compute - NoVNC proxy
ii nova-scheduler 2:13.1.4-0ubuntu2~cloud0 all OpenStack Compute - virtual machine scheduler
ii python-nova 2:13.1.4-0ubuntu2~cloud0 all OpenStack Compute Python libraries
ii python-novaclient 2:3.3.1-2ubuntu1~cloud0 all client library for OpenStack Compute API - Python 2.7

Keystone is configured for multi-domains, and there are 2 domains in case that is pertinent, also their endpoints are not SSL:

$ openstack endpoint list --format csv -c "Service Name" -c "Service Type" -c "Interface" -c URL | grep keystone
"keystone","identity","internal","http://<ip>:5000/v3"
"keystone","identity","admin","http://<ip>:35357/v3"
"keystone","identity","public","http://<ip>:5000/v3"

affects: nova (Ubuntu) → charm-nova-cloud-controller
Jill Rouleau (jillrouleau) wrote :

This issue continues to reoccur on this cloud. From nova-novncproxy.log: https://pastebin.ubuntu.com/25667986/
It's necessary to kill -HUP all nova-novncproxy pids before init'ing the service again.

Trusty/Mitaka/17.02 charms.

James Page (james-page) on 2017-10-05
Changed in charm-nova-cloud-controller:
status: New → Invalid
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nova (Ubuntu):
status: New → Confirmed
Changed in websockify (Ubuntu):
status: New → Confirmed
James Page (james-page) on 2017-10-20
Changed in nova (Ubuntu):
importance: Undecided → Medium
Changed in websockify (Ubuntu):
importance: Undecided → Medium
Seyeong Kim (xtrusia) on 2017-10-24
Changed in websockify (Ubuntu):
assignee: nobody → Seyeong Kim (xtrusia)
Seyeong Kim (xtrusia) on 2017-10-24
no longer affects: nova (Ubuntu)
Seyeong Kim (xtrusia) wrote :
description: updated
Seyeong Kim (xtrusia) wrote :

The attachment "lp1715254-mitaka.debdiff" seems to be a debdiff. The ubuntu-sponsors team has been subscribed to the bug report so that they can review and hopefully sponsor the debdiff. If the attachment isn't a patch, please remove the "patch" flag from the attachment, remove the "patch" tag, and if you are member of the ~ubuntu-sponsors, unsubscribe the team.

[This is an automated message performed by a Launchpad user owned by ~brian-murray, for any issue please contact him.]

tags: added: patch
Corey Bryant (corey.bryant) wrote :

Thanks for the patches Seyeong. Assuming those fix the problem this only affects websockify < 0.8.0, which are releases prior to Yakkety/Newton.

Changed in cloud-archive:
status: New → Invalid
Changed in websockify (Ubuntu Trusty):
status: New → Triaged
Changed in websockify (Ubuntu Xenial):
status: New → Triaged
Changed in websockify (Ubuntu Trusty):
importance: Undecided → Medium
Changed in websockify (Ubuntu Xenial):
importance: Undecided → Medium
Changed in websockify (Ubuntu Trusty):
assignee: nobody → Seyeong Kim (xtrusia)
Changed in websockify (Ubuntu Xenial):
assignee: nobody → Seyeong Kim (xtrusia)
Changed in websockify (Ubuntu):
status: Confirmed → Invalid
assignee: Seyeong Kim (xtrusia) → nobody
importance: Medium → Undecided
Corey Bryant (corey.bryant) wrote :

I've uploaded Seyeong's xenial patch to the xenial review queue and it is awaiting SRU team review.
https://launchpad.net/ubuntu/xenial/+queue?queue_state=1&queue_text=

If you'd like to provide patches for trusty-kilo and trusty-icehouse I'd be happy to sponsor those as well.

Seyeong Kim (xtrusia) wrote :
Seyeong Kim (xtrusia) wrote :

Hello Corey,

I've uploaded patch for kilo.

I'm going to upload patches for icehouse and trusty

after testing them.

I'm testing them but log is little different.

will keep posting

Thanks

Seyeong Kim (xtrusia) on 2017-10-30
description: updated
description: updated
Seyeong Kim (xtrusia) wrote :

Hello Corey,
I've tested Trusty & UCA Icehouse.
However, I couldn't reproduce this issue.
msgs in logs are different to kilo, mitaka, xenial

There is no 'Reaing zombies, active child count is'.
There are a lot of them on kilo, mitaka, xenial

I saw jame's latest commit which is patch for multiprocessing
but it seems not working on trusty, uca icehouse ( not sure 100% )

Hello Graham, or anyone else affected,

Accepted websockify into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/websockify/0.6.1+dfsg1-1ubuntu1 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in websockify (Ubuntu Xenial):
status: Triaged → Fix Committed
tags: added: verification-needed verification-needed-xenial
Seyeong Kim (xtrusia) wrote :

Hello,

I tested this proposed pkg and confirmed it is working fine.

For testing, I just did steps on [Test case] section.

1. juju deploy xenial.bundle
2. create network & subnet
3. juju config nova-cloud-controller console-access-protocol=novnc
4. create instance
5. refreshing every 5 seconds on 2 browsers with console url for several hors

Thanks

ii websockify 0.6.1+dfsg1-1ubuntu1 amd64 WebSockets support for any application/server

tags: added: verification-done-xenial
removed: verification-needed-xenial

The verification of the Stable Release Update for websockify has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package websockify - 0.6.1+dfsg1-1ubuntu1

---------------
websockify (0.6.1+dfsg1-1ubuntu1) xenial; urgency=medium

  * Fix hanging nova-novncproxy and can't be restarted (LP: #1715254)
    - [PATCH] Make websockify respect SIGTERM
    - [PATCH] Remove additional signal calls in websockify that
      causes novnc to hang.

 -- Seyeong Kim <email address hidden> Mon, 23 Oct 2017 18:31:40 +0900

Changed in websockify (Ubuntu Xenial):
status: Fix Committed → Fix Released

Hello Graham, or anyone else affected,

Accepted websockify into mitaka-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:mitaka-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-mitaka-needed to verification-mitaka-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-mitaka-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-mitaka-needed
Seyeong Kim (xtrusia) wrote :

it seems not in -proposed yet,

I'll test this when I can upgrade websockify

Seyeong Kim (xtrusia) wrote :

hello corey

I checked trusty-proposed/mitaka/main/binary-amd64/Packages

but websockify version is

0.6.1+dfsg1-1~cloud1

but it is current version i think.

you need to check this?

Thanks

Corey Bryant (corey.bryant) wrote :

Hello Seyeong,

This is all set now. We had an issue with the cloud archive sync. Can you try again?

Thanks,
Corey

Seyeong Kim (xtrusia) wrote :

upgraded to 0.6.1+dfsg1-1ubuntu1~cloud0

tested same steps as above.

it works fine.

Thanks.

tags: added: verification-mitaka-done
removed: verification-mitaka-needed

The verification of the Stable Release Update for websockify has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

James Page (james-page) wrote :

This bug was fixed in the package websockify - 0.6.1+dfsg1-1ubuntu1~cloud0
---------------

 websockify (0.6.1+dfsg1-1ubuntu1~cloud0) trusty-mitaka; urgency=medium
 .
   * New update for the Ubuntu Cloud Archive.
 .
 websockify (0.6.1+dfsg1-1ubuntu1) xenial; urgency=medium
 .
   * Fix hanging nova-novncproxy and can't be restarted (LP: #1715254)
     - [PATCH] Make websockify respect SIGTERM
     - [PATCH] Remove additional signal calls in websockify that
       causes novnc to hang.

Seyeong Kim (xtrusia) on 2017-12-05
Changed in websockify (Ubuntu Trusty):
assignee: Seyeong Kim (xtrusia) → nobody
Seyeong Kim (xtrusia) on 2017-12-11
no longer affects: websockify (Ubuntu Trusty)
no longer affects: cloud-archive/icehouse

Hello Graham, or anyone else affected,

Accepted websockify into kilo-proposed. The package will build now and be available in the Ubuntu Cloud Archive in a few hours, and then in the -proposed repository.

Please help us by testing this new package. To enable the -proposed repository:

  sudo add-apt-repository cloud-archive:kilo-proposed
  sudo apt-get update

Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-kilo-needed to verification-kilo-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-kilo-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

tags: added: verification-kilo-needed
Seyeong Kim (xtrusia) wrote :

ii websockify 0.6.0+dfsg1-1~cloud2 amd64 WebSockets support for any application/server

reconnection test for several hours.

Thanks.

tags: added: verification-kilo-done
removed: verification-kilo-needed

The verification of the Stable Release Update for websockify has completed successfully and the package has now been released to -updates. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Corey Bryant (corey.bryant) wrote :

This bug was fixed in the package websockify - 0.6.0+dfsg1-1~cloud2
---------------

 websockify (0.6.0+dfsg1-1~cloud2) trusty-kilo; urgency=medium
 .
   * Fix hanging nova-novncproxy and can't be restarted (LP: #1715254)
     - [PATCH] Make websockify respect SIGTERM
     - [PATCH] Remove additional signal calls in websockify that
       causes novnc to hang.

tags: added: sts sts-sru-done verification-done
removed: verification-needed
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers