Unable to connect to: ws://<maas IP>:/MAAS/ws

Bug #1484696 reported by Larry Michel on 2015-08-13
58
This bug affects 10 people
Affects Status Importance Assigned to Milestone
MAAS
High
Anthony Ochoa
1.8
High
Blake Rouse
1.9
High
Blake Rouse
Trunk
High
Blake Rouse
apache2 (Ubuntu)
High
Dave Chiluk
Trusty
High
Dave Chiluk

Bug Description

[Impact]

 * The Maas webui shows "Unable to connect to: ws://<maas IP>:/MAAS/ws" instead of listing the available machines.

 * This is because apache proxy is attempting to reuse a socket for an ssl connection to the maas web server sitting on 5240. This is causing the maas server to throw an error 500.

 * Similar results should be received from other web services that make use of the mod_proxy_wstunnel services and ssl.

 * The upload fixes the bug by backporting an upstream commit that resolves this. It functions by preventing re-use of the socket connection.

[Test Case]

 * Create a maas server, access webui repeatedly. Eventually the connections will get reused, and the error will be seen.

[Regression Potential]

 * There is a potential performance regression for non-ssl proxy work, but that should be minimal.

 * Patch has existed in vivid, wily, and xenial for a long while.

[Other Info]

 * Upstream commit: https://github.com/apache/httpd/commit/53038bd5b1e9f072460e6aeac2ae433c4854f2ad

--------------------Original case Description ----------------------------
I have a set of servers that were transferred over to another maas servers so that the power credentials were no longer valid when back on our maas server. While to trying to get them working, I would see that check power state hangs because of the credentials being no longer valid. These would eventually time out and error out as expected.

However, after clicking on Nodes view and or trying to view nodes from a Zone, I would observe this error: "Unable to connect to: ws://<maas IP>:/MAAS/ws", then the view would eventually start load slowly while showing some nodes but not all of them. In one case it was stuck loading for > 10 minutes and it did not seem like it would load without me trying to reload from the browser.

I was initially doing this on firefox, then when I tried to start a session with chromium after problem was recreated, I got the same error followed by the nodes view being stuck while loading.

After a while, reloading the page from the browser would seem0 to work and everything would look ok.

I didn't see anything in the logs anything pointing to any issue with the cluster.

Related branches

Larry Michel (lmic) on 2015-08-13
description: updated
description: updated
Larry Michel (lmic) wrote :

Here are the logs.

Larry Michel (lmic) wrote :

The scenario to recreate this may just be a coincidence and the recreates may be somewhat random. I do see the issue recreated just by clicking on Zone, then the link provided for servers without having checked power state prior to that.

But, one thing I did observe is that if I changed power settings while checking power state was going on, then my changes made during the check power state were lost after I navigated away from the window. In those cases, save changes looked like it completed successfully. If it didn't work, I should have gotten an error and I didn't see any error other than the check power state one.

Larry Michel (lmic) wrote :

I am seeing some strange behavior with the browser not updating. I am thinking it's related to this so adding to this bug. So, I started commissioning system from UI:

Aug 14 05:28:26 maas-trusty-back-may22 maas.node: [INFO] <nodename>: Status transition from READY to COMMISSIONING
Aug 14 05:28:26 maas-trusty-back-may22 maas.power: [INFO] Changing power state (on) of node: moline (node-1232a9c0-cfc9-11e3-a833-00163efc5068)
Aug 14 05:28:26 maas-trusty-back-may22 maas.node: [INFO] <nodename>: Commissioning started
Aug 14 05:28:28 maas-trusty-back-may22 maas.power: [INFO] Changed power state (on) of node: moline (node-1232a9c0-cfc9-11e3-a833-00163efc5068)

The logs shows it is commissioning but it's still in READY state, and power is on in the UI! If I try to commission it again from the UI, it fails as expected but without the log there was not telling that commissioning had worked (attached screen capture)..

I saw something similar with deleting a node where node was still there as if the delete hadn't worked, so when I tried to delete again, it would fail with message: "The delete action for 1 node failed with error: node-84864d62-e6a1-11e3-9754-00163eca07b6"

From another session, I couldn't find the node, so the delete had worked. But the log showed 3 separate delete attempts but no error:

Aug 13 22:03:49 maas-trusty-back-may22 maas.node: [INFO] <deletenodename>: Deleting node
Aug 13 22:03:58 maas-trusty-back-may22 maas.node: [INFO] <deletenodename>: Deleting node
Aug 13 22:04:44 maas-trusty-back-may22 maas.node: [INFO] <deletenodename>: Deleting node

Gavin Panella (allenap) wrote :

The following from regiond.conf (repeated a lot) seems pertinent:

  Closing connection: <STATUSES=PROTOCOL_ERROR> (u'Invalid CSRF token.')
  Opening connection with IPv4Address(TCP, '10.172.68.84', 54762)

Gavin Panella (allenap) wrote :

In reverse order :)

Gavin Panella (allenap) wrote :

Verified with Andres: please connect to MAAS via port 5240 instead of the default (80). There seems to be a CSRF issue affecting all connections via Apache.

Andres Rodriguez (andreserl) wrote :

Marking this bug as Incomplete for the time being as I'm unable to reproduce anymore. Blake says this might just require a log out and logging back in through the webui!

Changed in maas:
status: New → Incomplete
Andreas Hasenack (ahasenack) wrote :

We are seeing it as well with 1.8.1:

192.168.24.45 - - [18/Aug/2015:16:28:41 +0000] "GET /MAAS/ws?csrftoken=5fERal3JfbELe3zJJEEc6uRobrilPsCS HTTP/1.1" 500 832 "-" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0"

tags: added: cisco landscape
Blake Rouse (blake-rouse) wrote :

Andreas,

That is showing a 500 error. A traceback should be in regiond.log. That seems like a different bug.

Changed in maas:
status: Incomplete → Confirmed
importance: Undecided → High
milestone: none → 1.9.0
Changed in maas:
status: Confirmed → Fix Committed
Changed in maas:
assignee: nobody → Blake Rouse (blake-rouse)
Larry Michel (lmic) wrote :

I have been testing the patch from attached branch since last week and seems to work so far. On the maas server without the patch, we're now hitting it on a regular basis.

Jason Hobbs (jason-hobbs) wrote :

I hit this on MAAS 1.8.2.

Jason Hobbs (jason-hobbs) wrote :

We're still hitting this on 1.8.2 - a fresh install of 1.8.2.

Jason Hobbs (jason-hobbs) wrote :

The error is still "Unable to connect to: ws://10.245.0.10:/MAAS/ws'.

Blake Rouse (blake-rouse) wrote :

The issue is the extra ":" at the end. I will look into why its being appended.

Changed in maas:
status: Fix Committed → Confirmed
summary: - Unable to connect to: ws://<maas IP>:/MAAS/ws after check power state
- slow responding on bad power credentials
+ Unable to connect to: ws://<maas IP>:/MAAS/ws
LaMont Jones (lamont) on 2015-09-08
Changed in maas:
status: Confirmed → Triaged
Changed in maas:
status: Triaged → In Progress
Changed in maas:
status: In Progress → Fix Committed
Jason Hobbs (jason-hobbs) wrote :

I just hit this in MAAS 1.8.3.

Larry Michel (lmic) wrote :

Also hitting on 1.8.3 on both firefox and chromium after not having seen it since upgrade last tuesday from 1.8.2.

nate (nate-sellers) wrote :

Hitting this on a fresh install of 1.8.3
ii maas 1.8.3+bzr4053-0ubuntu1~trusty1 all MAAS server all-in-one metapackage

nate (nate-sellers) wrote :

2015-11-17 11:16:17 [-] 127.0.0.1 - - [17/Nov/2015:18:16:16 +0000] "GET /MAAS/ HTTP/1.1" 200 1862 "http://172.16.232.10/MAAS/clusters/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11) AppleWebKit/601.1.56 (KHTML, like Gecko) Version/9.0 Safari/601.1.56"
2015-11-17 11:16:17 [HTTPChannel,5,127.0.0.1] Unhandled Error
 Traceback (most recent call last):
   File "/usr/lib/python2.7/dist-packages/twisted/python/context.py", line 81, in callWithContext
     return func(*args,**kw)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/posixbase.py", line 614, in _doReadOrWrite
     why = selectable.doRead()
   File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 215, in doRead
     return self._dataReceived(data)
   File "/usr/lib/python2.7/dist-packages/twisted/internet/tcp.py", line 221, in _dataReceived
     rval = self.protocol.dataReceived(data)
 --- <exception caught here> ---
   File "/usr/lib/python2.7/dist-packages/maasserver/websockets/websockets.py", line 419, in dataReceived
     self._parseFrames()
   File "/usr/lib/python2.7/dist-packages/maasserver/websockets/websockets.py", line 391, in _parseFrames
     for opcode, data, fin in _parseFrames(self._buffer):
   File "/usr/lib/python2.7/dist-packages/maasserver/websockets/websockets.py", line 190, in _parseFrames
     raise _WSException("Reserved flag in frame (%d)" % (header,))
 maasserver.websockets.websockets._WSException: Reserved flag in frame (71)

Ray Wang (raywang) wrote :

also hit on this bug, not sure it's relevant, but I saw errors in /var/log/apache2/error.log

<snip>
[Fri Nov 27 09:15:26.672654 2015] [proxy_wstunnel:error] [pid 1709:tid 140536253982464] (104)Connection reset by peer: [client 10.231.112.10:55564] AH02441: error on sock - ap_pass_brigade
[Fri Nov 27 09:15:33.530217 2015] [proxy_wstunnel:error] [pid 1709:tid 140536203626240] (104)Connection reset by peer: [client 10.231.112.10:55565] AH02441: error on sock - ap_pass_brigade
</snip>

Mike Pontillo (mpontillo) wrote :

Have you customized your Apache configuration at all? This appears to be an issue with that.

Can you try hitting http://<maas-ip-address>:5240/? This is the socket the MAAS server is actually running on; Apache on port 80 is just a proxy, and does not need to be used.

Larry Michel (lmic) wrote :

We haven't customized our Apache2 configuration at all Mike. We hit this issue on our prod environment and were able to recreate on a maas environment. http://<maas-ip-address>:5240 has been the work around that we've been using since the bug was opened. We have not seen this issue with 1.9.

Larry Michel (lmic) wrote :

Correction/clarification: were able to recreate on maas server in a test environment.

Ray Wang (raywang) wrote :

I also saw this bug on MAAS 1.9 rc4 a lot of times, if it's fixed by maas 1.9.0, it might be a regression bug?

Changed in maas:
status: Fix Committed → Fix Released
Billy Olsen (billy-olsen) wrote :

Marked this as also affects apache2 to get the root cause fixed in apache. Upstream bug for this is https://bz.apache.org/bugzilla/show_bug.cgi?id=55890, which was fixed in 2.4.10.

From the looks of it, this only needs to be fixed in trusty as of this time, since vivid already includes the updated code and 2.2.x series on precise does not have support for mod_proxy_wstunnel:

wolsen@chaps:~/ubuntu/apache2$ rmadison apache2
 apache2 | 2.2.22-1ubuntu1 | precise | source, amd64, armel, armhf, i386, powerpc
 apache2 | 2.2.22-1ubuntu1.10 | precise-security | source, amd64, armel, armhf, i386, powerpc
 apache2 | 2.2.22-1ubuntu1.10 | precise-updates | source, amd64, armel, armhf, i386, powerpc
 apache2 | 2.4.7-1ubuntu4 | trusty | source, amd64, arm64, armhf, i386, powerpc, ppc64el
 apache2 | 2.4.7-1ubuntu4.5 | trusty-security | source, amd64, arm64, armhf, i386, powerpc, ppc64el
 apache2 | 2.4.7-1ubuntu4.8 | trusty-updates | source, amd64, arm64, armhf, i386, powerpc, ppc64el
 apache2 | 2.4.10-9ubuntu1 | vivid | source, amd64, arm64, armhf, i386, powerpc, ppc64el
 apache2 | 2.4.10-9ubuntu1.1 | vivid-security | source, amd64, arm64, armhf, i386, powerpc, ppc64el
 apache2 | 2.4.10-9ubuntu1.1 | vivid-updates | source, amd64, arm64, armhf, i386, powerpc, ppc64el
 apache2 | 2.4.12-2ubuntu2 | wily | source, amd64, arm64, armhf, i386, powerpc, ppc64el
 apache2 | 2.4.17-3ubuntu1 | xenial | source, amd64, arm64, armhf, i386, powerpc, ppc64el, s390x

Dave Chiluk (chiluk) wrote :

I'll work on getting the usptream fix SRU'd.

Dave Chiluk (chiluk) wrote :

Xenial, and vivid already have the solution for this.

Here is the debdiff with the solution for trusty.

I also took the liberty of uploading a build with this debdiff here in the hopes that I might get some additional testing.
https://launchpad.net/~chiluk/+archive/ubuntu/1484696

Changed in apache2 (Ubuntu):
status: New → In Progress
assignee: nobody → Dave Chiluk (chiluk)
milestone: none → trusty-updates
Dave Chiluk (chiluk) on 2016-01-13
description: updated
Dave Chiluk (chiluk) on 2016-01-13
Changed in apache2 (Ubuntu):
importance: Undecided → High
Dave Chiluk (chiluk) wrote :

Tested that the apache fix appears to resolve the issue even after removing the workaround from wolsen.

Dave Chiluk (chiluk) wrote :

Updated dep3 header information per arges' review.

Dave Chiluk (chiluk) wrote :

@blake-rouse
After this fix is committed the maas team should strongly reconsider their stance to disable the apache proxy on port 80.

Hello Larry, or anyone else affected,

Accepted apache2 into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/apache2/2.4.7-1ubuntu4.9 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested, and change the tag from verification-needed to verification-done. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in apache2 (Ubuntu Trusty):
status: New → Fix Committed
tags: added: verification-needed
Dave Chiluk (chiluk) wrote :

Verification complete. My test Apache is still up for maas after 2 days.

tags: added: verification-done
removed: verification-needed
Chris J Arges (arges) on 2016-01-27
Changed in apache2 (Ubuntu):
status: In Progress → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package apache2 - 2.4.7-1ubuntu4.9

---------------
apache2 (2.4.7-1ubuntu4.9) trusty; urgency=medium

  * Force disablereuse on for mod_proxy_wstunnel. Fixes "Unable to connect to:
    ws://<maas IP>:/MAAS/ws" errors with maas, and other proxy applications.
    https://bz.apache.org/bugzilla/show_bug.cgi?id=55890
    (LP: #1484696).

 -- Dave Chiluk <email address hidden> Wed, 13 Jan 2016 15:34:51 -0600

Changed in apache2 (Ubuntu Trusty):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for apache2 has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Dave Chiluk (chiluk) on 2016-01-27
Changed in apache2 (Ubuntu):
status: Fix Released → Invalid
Changed in apache2 (Ubuntu Trusty):
milestone: none → trusty-updates
Changed in apache2 (Ubuntu):
milestone: trusty-updates → none
Changed in apache2 (Ubuntu Trusty):
importance: Undecided → High
assignee: nobody → Dave Chiluk (chiluk)
no longer affects: maas/1.10
Changed in maas:
assignee: Blake Rouse (blake-rouse) → Anthony Ochoa (anthony1985)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.