nova-consoleauth doesn't share tokens in HA envs

Bug #1576218 reported by Sergey Arkhipov
120
This bug affects 23 people
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Invalid
Medium
Nikita Karpin
9.x
Fix Released
Medium
Nikita Karpin

Bug Description

Detailed bug description:
Cannot connect to VNC (or connection is unstable) after restarting of nova-compute service on all computes. Got 'Failed to connect to server (code: 1006)' message in Horizon and following tracebacks in logs (nova-novncproxy):

InvalidToken: The token '598d6469-663c-4ada-8a7d-ce3acc75cdb7' is invalid or has expired

or

socket.error(last_err) error: timed out

What is interesting, I've never managed to connect from inlined console on Instance page, but connected using standalone 'vnc_auto.html' ("Click here to show only console" link). Connection was unstable though

Steps to reproduce:
1. Try to connect to vnc console page several times.

Expected results:
1. Connection to VNC is established without any problems

Actual result:
1. Got 'Failed to connect to server (code: 1006)' error

Workaround:
- Proceed to "Click here to show only console" link. Standalone page almost always connects.
- Reload page several times.

Description of the environment:
* 10 baremetal nodes:
   - CPU: 12 x 2.10 GHz
   - Disks: 2 drives (SSD - 80 GB, HDD - 931.5 GB), 1006.0 GB total
   - Memory: 2 x 16.0 GB, 32.0 GB total
   - NUMA topology: 1 NUMA node
* Node roles:
  - 1 ElasticSearch / Kibana node
  - 1 InfluxDB / Grafana node
  - 3 controllers (1 was is offline because of disk problems)
  - 5 computes
* Details:
  - OS: Mitaka on Ubuntu 14.04
  - Compute: KVM
  - Neutron with VLAN segmentation
  - Ceph RBD for volumes (Cinder)
  - Ceph RadosGW for objects (Swift API)
  - Ceph RBD for ephemeral volumes (Nova)
  - Ceph RBD for images (Glance)
* MOS 8.0, build 227

Additional information:
Logs from controller and compute running VM '19908832-c817-4e20-a80f-2853d8a2ff42' are here:
http://mos-scale-share.mirantis.com/env14/28-04-2016-novnc-problems-logs.tar.xz

I've used VM with UUID '19908832-c817-4e20-a80f-2853d8a2ff42' for all tests.

Revision history for this message
Sergey Arkhipov (sarkhipov) wrote :
Changed in mos:
assignee: MOS Nova (mos-nova) → stgleb (gstepanov)
importance: Undecided → High
status: New → Confirmed
tags: added: area-nova
Revision history for this message
stgleb (gstepanov) wrote :

Such behavoir occured even without running rally tests.
It also behave strange because after a few page reloading everything is ok.

Revision history for this message
stgleb (gstepanov) wrote :

It seems that i've found a rout cause, token that is used to authenticate console is absent in memcache for some reason, i'm trying to figure out why it is so and how it gets there after a few page refreshes.

Revision history for this message
stgleb (gstepanov) wrote :

We have an assumption that such thing happens because tokens for vnc console are stored in memcache and shared not properly between controller nodes. So, each request to get vnc console first arrives to nova-api, it make request nova compute to provide console and generate token, than token is authenticated at nova-consoleauth and in that part everything goes wrong, by some reason nova-consoleauth cant se anything in memcache corresponding to token key and returns an error in check_token method.

Changed in mos:
assignee: stgleb (gstepanov) → Timofey Durakov (tdurakov)
Revision history for this message
Jay Pipes (jaypipes) wrote :

This is definitely not a High priority bug, folks. a) There is a (simple) workaround for this b) it's not a data corruption or loss bug, and c) it's a transient issue since reloading the page a few times seems to fix the problem. Recommend setting this to Low priority and putting some additional log messages into the code base to detect specifically when check_token() is returning an error due to a memcache miss.

Changed in mos:
status: Confirmed → In Progress
Revision history for this message
Timofey Durakov (tdurakov) wrote :

The root cause of the issue is that current MOS deployment uses python dict to store tokens, so they are not shared between controller nodes. During round robin mechanism vnc is available in ~30% attempts. Here is PoC ansible playbook, that allows to mitigate this problem: http://xsnippet.org/361719/ Setting priority to Low, as Jay proposed in #5 comment.

Changed in mos:
importance: High → Low
summary: - Cannot connect to VNC after nova-compute restart
+ Nova console auth doesn't share tokens in ha envs
description: updated
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

No longer fixing Low bugs in 9.0.

summary: - Nova console auth doesn't share tokens in ha envs
+ Nova console auth doesn't share tokens in HA envs
summary: - Nova console auth doesn't share tokens in HA envs
+ nova-consoleauth doesn't share tokens in HA envs
Changed in mos:
status: In Progress → Won't Fix
Changed in mos:
assignee: Timofey Durakov (tdurakov) → MOS Puppet Team (mos-puppet)
Revision history for this message
Timofey Durakov (tdurakov) wrote :

/etc/nova/nova.conf should be configured on controllers to properly manage memcached backend:
[cache]
enabled = true
backend = oslo_cache.memcache_pool
memcache_servers = {list of memcached servers to connect} # e.g. 192.168.0.6:11211, 192.168.0.7:11211, 192.168.0.7:11211

Changed in mos:
assignee: MOS Puppet Team (mos-puppet) → Alexey Deryugin (velovec)
Revision history for this message
Alexey Deryugin (velovec) wrote :

Related upstream change is on review: https://review.openstack.org/#/c/325588/

Revision history for this message
Alexey Deryugin (velovec) wrote :
Revision history for this message
Justinas Balciunas (justinas-balciunas) wrote :

I suggest this is backported to MOS9 updates branch too because the workaround with full screen console and multiple reloads is a very poor one.

Revision history for this message
Justinas Balciunas (justinas-balciunas) wrote :

I can confirm that the fix proposed here: https://bugs.launchpad.net/mos/+bug/1576218/comments/8 works properly on MOS9.

Revision history for this message
Matthew Roark (mroark) wrote :

I agree with Justinas that this should be backported to MOS9 updates. Operations has been receiving customer reports related to this bug.

Revision history for this message
Ivan Kolodyazhny (e0ne) wrote :

I vote to have this bugfix for MOS9.x. HA is our reference architecture and this feature is useful

Changed in mos:
importance: Low → Medium
status: Won't Fix → Confirmed
Revision history for this message
Ivan Berezovskiy (iberezovskiy) wrote :
Changed in mos:
milestone: 9.0 → 9.1
status: Confirmed → In Progress
Roman Rufanov (rrufanov)
tags: added: customer-found
Revision history for this message
Nikita Karpin (mkarpin) wrote :
tags: added: on-verification
Revision history for this message
Vladimir Jigulin (vjigulin) wrote :

Verified on 9.2 snapshot #801

tags: removed: on-verification
tags: added: on-verification
Revision history for this message
Sergey Novikov (snovikov) wrote :

Verified on 9.2 snapshot #822 (RC2)

tags: removed: on-verification
Revision history for this message
Aleksei Chekunov (achekunov) wrote :

9.2 bug still exist
root@node-1:~# hiera memcached_servers
["192.168.0.9:11211", "192.168.0.15:11211", "192.168.0.16:11211"]
root@node-1:~# cat /etc/nova/nova.conf | grep memcached_servers
#memcached_servers = <None>
memcached_servers = 192.168.0.9:11211

Revision history for this message
Nikita Karpin (mkarpin) wrote :

you need not memcached_servers option but memcache_servers option, but I have already checked it is also set to local memcached, it is incorrect.

Looks like it is regression because of https://github.com/openstack/fuel-library/commit/a529033fdcb36ccea8cf0cc76339816ed31418c7

I will fix this

Revision history for this message
Nikita Karpin (mkarpin) wrote :

as Nova team told me this bug does not affect Newton, should be fixed in Mitaka only

Revision history for this message
Felipe Alfaro Solana (felipe-alfaro-gmail) wrote :

Yes, exactly. It seems change https://github.com/openstack/fuel-library/commit/a529033fdcb36ccea8cf0cc76339816ed31418c7 broke how noVNC is configured. Instead of using all three memcached servers, it configures Nova to only use the local one so, in the end, I'm back to broken noVNC consoles in HA mode.

Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

The fix for stable/mitaka is committed: https://review.openstack.org/#/c/436365/

tags: added: on-verification
Revision history for this message
TatyanaGladysheva (tgladysheva) wrote :

Verified on 9.2 + mu1 updates.

Steps to verify:
1. Deploy HA environment with 3 controllers + 1 compute node
2. Deploy with HTTPS/TLS for Horizon enable and self-signed certifications
3. Launch a test VM and try to connect with console

Actual results:
Before the fix:
'Failed to connect to server (code: 1006)' error is observed.

After the fix:
VNC console is available after click on appropriate link.

tags: removed: on-verification
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.