multinode deploy results in intermittent authentication failures

Bug #1490778 reported by Steven Dake
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
kolla
Invalid
Critical
Steven Dake

Bug Description

Running glance image-list results in either error 500 or not authenticated while other times it works. All services are affected - nova etc EXCEPT for keystone.

Running openstack endpoint list in a loop works 100% every time while glance image-list fails about 30% of the time.

Since glance is the first service that would use keystone in a typical deployment it is the easiest to debug. This bug tracker will detail the debugging I am going through.

Revision history for this message
Steven Dake (sdake) wrote :

[sdake@minime-03 ~]$ docker exec keystone tail -20 /var/log/keystone/keystone.log
2015-08-31 23:38:52.842 18 INFO keystone.common.wsgi [-] POST http://192.168.1.148:35357/v3/auth/tokens
2015-08-31 23:38:55.874 12 INFO keystone.common.wsgi [-] POST http://broked.selfip.net:5000/v3/auth/tokens
2015-08-31 23:38:58.858 14 INFO keystone.common.wsgi [-] POST http://broked.selfip.net:5000/v3/auth/tokens
2015-08-31 23:38:58.930 20 INFO keystone.common.wsgi [-] GET http://192.168.1.148:35357/
2015-08-31 23:38:58.991 18 WARNING keystone.middleware.core [-] RBAC: Invalid token
2015-08-31 23:38:58.991 18 WARNING keystone.common.wsgi [-] The request you have made requires authentication.
2015-08-31 23:38:58.995 19 INFO keystone.common.wsgi [-] POST http://192.168.1.148:35357/v3/auth/tokens
2015-08-31 23:38:59.088 21 INFO keystone.common.wsgi [-] POST http://192.168.1.148:35357/v3/auth/tokens
2015-08-31 23:39:02.065 12 INFO keystone.common.wsgi [-] POST http://broked.selfip.net:5000/v3/auth/tokens
2015-08-31 23:39:02.143 19 INFO keystone.common.wsgi [-] GET http://192.168.1.148:35357/v3/auth/tokens
2015-08-31 23:39:02.145 19 WARNING keystone.common.wsgi [-] Could not find token: 783549cfdaf3470cbbc7867af5091552
2015-08-31 23:39:04.926 15 INFO keystone.common.wsgi [-] POST http://broked.selfip.net:5000/v3/auth/tokens
2015-08-31 23:39:05.071 17 INFO keystone.common.wsgi [-] GET http://192.168.1.148:35357/
2015-08-31 23:39:05.184 18 WARNING keystone.middleware.core [-] RBAC: Invalid token
2015-08-31 23:39:05.184 18 WARNING keystone.common.wsgi [-] The request you have made requires authentication.
2015-08-31 23:39:05.188 20 INFO keystone.common.wsgi [-] POST http://192.168.1.148:35357/v3/auth/tokens
2015-08-31 23:43:27.198 13 INFO keystone.common.wsgi [-] POST http://broked.selfip.net:5000/v3/auth/tokens
2015-08-31 23:43:27.286 18 INFO keystone.common.wsgi [-] POST http://192.168.1.148:35357/v3/auth/tokens
2015-08-31 23:43:27.392 21 INFO keystone.common.wsgi [-] GET http://192.168.1.148:35357/
2015-08-31 23:43:27.460 17 INFO keystone.common.wsgi [-] POST http://192.168.1.148:35357/v3/auth/tokens

Changed in kolla:
assignee: nobody → Steven Dake (sdake)
importance: Undecided → Critical
status: New → Triaged
milestone: none → liberty-3
summary: - multinode deploy results in authentication failures
+ multinode deploy results in intermittent authentication failures
description: updated
Steven Dake (sdake)
description: updated
Revision history for this message
Steven Dake (sdake) wrote :

[sdake@MINIME-ONE ~]$ docker logs keepalived
Starting Healthcheck child process, pid=11
Initializing ipvs 2.6
Starting VRRP child process, pid=12
Registering Kernel netlink reflector
Registering Kernel netlink command channel
Registering Kernel netlink reflector
Registering Kernel netlink command channel
Registering gratuitous ARP shared channel
Opening file '/etc/keepalived/keepalived.conf'.
Opening file '/etc/keepalived/keepalived.conf'.
VRRP Error : Priority not valid !
Configuration is using : 5473 Bytes
             must be between 1 & 255. reconfigure !
             Using default value : 100

Configuration is using : 62138 Bytes
------< Global definitions >------
 Router ID = minime-one
 Smtp server connection timeout = 30
 Email notification from = root@minime-one
 VRRP IPv4 mcast group = 224.0.0.18
 VRRP IPv6 mcast group = 224.0.0.18
 SNMP Trap disabled
------< VRRP Topology >------
 VRRP Instance = Floating
   Want State = MASTER
   Runing on device = em1
   Virtual Router ID = 51
   Priority = 100
   Advert interval = 1sec
   Tracked scripts = 1
     check_alive weight 0
   Virtual IP = 1
     192.168.1.148/32 dev em1 scope global
------< VRRP Scripts >------
 VRRP Script = check_alive
   Command = /check_alive.sh
   Interval = 2 sec
   Timeout = 0 sec
   Weight = 0
   Rise = 10
   Fall = 2
   Status = INIT
Using LinkWatch kernel netlink reflector...
------< Global definitions >------
 Router ID = minime-one
 Smtp server connection timeout = 30
 Email notification from = root@minime-one
 VRRP IPv4 mcast group = 224.0.0.18
 VRRP IPv6 mcast group = 224.0.0.18
 SNMP Trap disabled
------< SSL definitions >------
 Using autogen SSL context
Using LinkWatch kernel netlink reflector...
VRRP_Instance(Floating) Now in FAULT state
VRRP_Script(check_alive) succeeded
Kernel is reporting: interface em1 UP
VRRP_Instance(Floating) Transition to MASTER STATE
VRRP_Instance(Floating) Entering MASTER STATE
Netlink: filter function error
Netlink: filter function error
Netlink: filter function error
Netlink: filter function error
Netlink: filter function error
Netlink: filter function error
<repeated>

Revision history for this message
Steven Dake (sdake) wrote :

The internets suggested sending a SIGHUP to keepalived to resole the netlink filter function error because keepalived does not support hot plugging of interfaces. I stopped all keepalived and started all keepaliveds on all nodes and received this on the master node:

 Using autogen SSL context
Using LinkWatch kernel netlink reflector...
VRRP_Script(check_alive) succeeded
VRRP_Instance(Floating) Transition to MASTER STATE
VRRP_Instance(Floating) Entering MASTER STATE
VRRP_Instance(Floating) Received lower prio advert, forcing new election
VRRP_Instance(Floating) Received lower prio advert, forcing new election
VRRP_Instance(Floating) Received lower prio advert, forcing new election
VRRP_Instance(Floating) Received lower prio advert, forcing new election

Revision history for this message
Steven Dake (sdake) wrote :
Revision history for this message
Steven Dake (sdake) wrote :
Revision history for this message
Steven Dake (sdake) wrote :

inc0 confirmed this problem exists for him.

Changed in kolla:
status: Triaged → Confirmed
Revision history for this message
Steven Dake (sdake) wrote :

Deployed ubuntu from source packaging, same result.

keystone endpoint list in repetition works
glance image-list in repetition does not work

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to kolla (master)

Fix proposed to branch: master
Review: https://review.openstack.org/219261

Changed in kolla:
status: Confirmed → In Progress
Revision history for this message
Sam Yaple (s8m) wrote :

In this case, this was caused by incorrect time on the servers. I suggest we close this as invalid or use this in a Docs reference.

Steven Dake (sdake)
Changed in kolla:
status: In Progress → Won't Fix
status: Won't Fix → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on kolla (master)

Change abandoned by Steven Dake (<email address hidden>) on branch: master
Review: https://review.openstack.org/219261

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.