Mirantis OpenStack

nova-consoleauth doesn't share tokens in HA envs

Bug #1576218 reported by Sergey Arkhipov on 2016-04-28

120

This bug affects 23 people

	Status	Importance	Assigned to	Milestone
Mirantis OpenStack	Status tracked in 10.0.x
10.0.x	Invalid	Medium	Nikita Karpin	Mirantis OpenStack 10.0
9.x	Fix Released	Medium	Nikita Karpin	Mirantis OpenStack 9.2-mu-1

Bug Description

Detailed bug description:
Cannot connect to VNC (or connection is unstable) after restarting of nova-compute service on all computes. Got 'Failed to connect to server (code: 1006)' message in Horizon and following tracebacks in logs (nova-novncproxy):

InvalidToken: The token '598d6469-663c-4ada-8a7d-ce3acc75cdb7' is invalid or has expired

socket.error(last_err) error: timed out

What is interesting, I've never managed to connect from inlined console on Instance page, but connected using standalone 'vnc_auto.html' ("Click here to show only console" link). Connection was unstable though

Steps to reproduce:
1. Try to connect to vnc console page several times.

Expected results:
1. Connection to VNC is established without any problems

Actual result:
1. Got 'Failed to connect to server (code: 1006)' error

Workaround:
- Proceed to "Click here to show only console" link. Standalone page almost always connects.
- Reload page several times.

Description of the environment:
* 10 baremetal nodes:
   - CPU: 12 x 2.10 GHz
   - Disks: 2 drives (SSD - 80 GB, HDD - 931.5 GB), 1006.0 GB total
   - Memory: 2 x 16.0 GB, 32.0 GB total
   - NUMA topology: 1 NUMA node
* Node roles:
  - 1 ElasticSearch / Kibana node
  - 1 InfluxDB / Grafana node
  - 3 controllers (1 was is offline because of disk problems)
  - 5 computes
* Details:
  - OS: Mitaka on Ubuntu 14.04
  - Compute: KVM
  - Neutron with VLAN segmentation
  - Ceph RBD for volumes (Cinder)
  - Ceph RadosGW for objects (Swift API)
  - Ceph RBD for ephemeral volumes (Nova)
  - Ceph RBD for images (Glance)
* MOS 8.0, build 227

Additional information:
Logs from controller and compute running VM '19908832-c817-4e20-a80f-2853d8a2ff42' are here:
http://mos-scale-share.mirantis.com/env14/28-04-2016-novnc-problems-logs.tar.xz

I've used VM with UUID '19908832-c817-4e20-a80f-2853d8a2ff42' for all tests.

See original description

Tags:

Revision history for this message

Sergey Arkhipov (sarkhipov) wrote on 2016-04-29:

Diagnostic snapshot: http://mos-scale-share.mirantis.com/env14/fuel-snapshot-2016-04-28_13-17-56.tar.xz

Roman Podoliaka (rpodolyaka) on 2016-05-04

Changed in mos:
assignee:	MOS Nova (mos-nova) → stgleb (gstepanov)
importance:	Undecided → High
status:	New → Confirmed
tags:	added: area-nova

Revision history for this message

stgleb (gstepanov) wrote on 2016-05-12:

Such behavoir occured even without running rally tests.
It also behave strange because after a few page reloading everything is ok.

Revision history for this message

stgleb (gstepanov) wrote on 2016-05-12:

It seems that i've found a rout cause, token that is used to authenticate console is absent in memcache for some reason, i'm trying to figure out why it is so and how it gets there after a few page refreshes.

Revision history for this message

stgleb (gstepanov) wrote on 2016-05-13:

We have an assumption that such thing happens because tokens for vnc console are stored in memcache and shared not properly between controller nodes. So, each request to get vnc console first arrives to nova-api, it make request nova compute to provide console and generate token, than token is authenticated at nova-consoleauth and in that part everything goes wrong, by some reason nova-consoleauth cant se anything in memcache corresponding to token key and returns an error in check_token method.

Roman Podoliaka (rpodolyaka) on 2016-05-19

Changed in mos:
assignee:	stgleb (gstepanov) → Timofey Durakov (tdurakov)

Revision history for this message

Jay Pipes (jaypipes) wrote on 2016-05-20:

This is definitely not a High priority bug, folks. a) There is a (simple) workaround for this b) it's not a data corruption or loss bug, and c) it's a transient issue since reloading the page a few times seems to fix the problem. Recommend setting this to Low priority and putting some additional log messages into the code base to detect specifically when check_token() is returning an error due to a memcache miss.

Roman Podoliaka (rpodolyaka) on 2016-05-20

Changed in mos:
status:	Confirmed → In Progress

Revision history for this message

Timofey Durakov (tdurakov) wrote on 2016-05-23:

The root cause of the issue is that current MOS deployment uses python dict to store tokens, so they are not shared between controller nodes. During round robin mechanism vnc is available in ~30% attempts. Here is PoC ansible playbook, that allows to mitigate this problem: http://xsnippet.org/361719/ Setting priority to Low, as Jay proposed in #5 comment.

Changed in mos:
importance:	High → Low
summary:	- Cannot connect to VNC after nova-compute restart + Nova console auth doesn't share tokens in ha envs
description:	updated

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-05-23:

No longer fixing Low bugs in 9.0.

summary:	- Nova console auth doesn't share tokens in ha envs + Nova console auth doesn't share tokens in HA envs
summary:	- Nova console auth doesn't share tokens in HA envs + nova-consoleauth doesn't share tokens in HA envs
Changed in mos:
status:	In Progress → Won't Fix

Timofey Durakov (tdurakov) on 2016-05-23

Changed in mos:
assignee:	Timofey Durakov (tdurakov) → MOS Puppet Team (mos-puppet)

Revision history for this message

Timofey Durakov (tdurakov) wrote on 2016-05-23:

/etc/nova/nova.conf should be configured on controllers to properly manage memcached backend:
[cache]
enabled = true
backend = oslo_cache.memcache_pool
memcache_servers = {list of memcached servers to connect} # e.g. 192.168.0.6:11211, 192.168.0.7:11211, 192.168.0.7:11211

Ivan Berezovskiy (iberezovskiy) on 2016-06-03

Changed in mos:
assignee:	MOS Puppet Team (mos-puppet) → Alexey Deryugin (velovec)

Revision history for this message

Alexey Deryugin (velovec) wrote on 2016-06-04:

Related upstream change is on review: https://review.openstack.org/#/c/325588/

Revision history for this message

Alexey Deryugin (velovec) wrote on 2016-06-23:

#10

Fix is merged: https://review.openstack.org/#/c/325590/

Revision history for this message

Justinas Balciunas (justinas-balciunas) wrote on 2016-07-06:

#11

I suggest this is backported to MOS9 updates branch too because the workaround with full screen console and multiple reloads is a very poor one.

Revision history for this message

Justinas Balciunas (justinas-balciunas) wrote on 2016-07-06:

#12

I can confirm that the fix proposed here: https://bugs.launchpad.net/mos/+bug/1576218/comments/8 works properly on MOS9.

Revision history for this message

Matthew Roark (mroark) wrote on 2016-07-25:

#13

I agree with Justinas that this should be backported to MOS9 updates. Operations has been receiving customer reports related to this bug.

Revision history for this message

Ivan Kolodyazhny (e0ne) wrote on 2016-08-19:

#14

I vote to have this bugfix for MOS9.x. HA is our reference architecture and this feature is useful

Changed in mos:
importance:	Low → Medium
status:	Won't Fix → Confirmed

Revision history for this message

Ivan Berezovskiy (iberezovskiy) wrote on 2016-08-19:

#15

Patch for 9.1 - https://review.openstack.org/#/c/357672/

Changed in mos:
milestone:	9.0 → 9.1
status:	Confirmed → In Progress

Roman Rufanov (rrufanov) on 2016-10-18

tags:

added: customer-found

Revision history for this message

Nikita Karpin (mkarpin) wrote on 2016-12-29:

#16

fix - https://review.openstack.org/#/c/415013/

Vladimir Jigulin (vjigulin) on 2017-01-25

tags:

added: on-verification

Revision history for this message

Vladimir Jigulin (vjigulin) wrote on 2017-01-31:

#17

Verified on 9.2 snapshot #801

tags:

removed: on-verification

Sergey Novikov (snovikov) on 2017-01-31

tags:

added: on-verification

Revision history for this message

Sergey Novikov (snovikov) wrote on 2017-02-02:

#18

Verified on 9.2 snapshot #822 (RC2)

tags:

removed: on-verification

Revision history for this message

Aleksei Chekunov (achekunov) wrote on 2017-02-14:

#19

9.2 bug still exist
root@node-1:~# hiera memcached_servers
["192.168.0.9:11211", "192.168.0.15:11211", "192.168.0.16:11211"]
root@node-1:~# cat /etc/nova/nova.conf | grep memcached_servers
#memcached_servers = <None>
memcached_servers = 192.168.0.9:11211

Revision history for this message

Nikita Karpin (mkarpin) wrote on 2017-02-14:

#20

you need not memcached_servers option but memcache_servers option, but I have already checked it is also set to local memcached, it is incorrect.

Looks like it is regression because of https://github.com/openstack/fuel-library/commit/a529033fdcb36ccea8cf0cc76339816ed31418c7

I will fix this

Revision history for this message

Nikita Karpin (mkarpin) wrote on 2017-02-21:

#21

as Nova team told me this bug does not affect Newton, should be fixed in Mitaka only

Revision history for this message

Felipe Alfaro Solana (felipe-alfaro-gmail) wrote on 2017-02-22:

#22

Yes, exactly. It seems change https://github.com/openstack/fuel-library/commit/a529033fdcb36ccea8cf0cc76339816ed31418c7 broke how noVNC is configured. Instead of using all three memcached servers, it configures Nova to only use the local one so, in the end, I'm back to broken noVNC consoles in HA mode.

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2017-03-06:

#23

The fix for stable/mitaka is committed: https://review.openstack.org/#/c/436365/

TatyanaGladysheva (tgladysheva) on 2017-03-13

tags:

added: on-verification

Revision history for this message

TatyanaGladysheva (tgladysheva) wrote on 2017-03-13:

#24

Verified on 9.2 + mu1 updates.

Steps to verify:
1. Deploy HA environment with 3 controllers + 1 compute node
2. Deploy with HTTPS/TLS for Horizon enable and self-signed certifications
3. Launch a test VM and try to connect with console

Actual results:
Before the fix:
'Failed to connect to server (code: 1006)' error is observed.

After the fix:
VNC console is available after click on appropriate link.

tags:

removed: on-verification

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.