HA cluster with ceilomter deployment failed with Could not evaluate: Authentication failed with response code 504

Bug #1338608 reported by Andrey Sledzinskiy
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Confirmed
Critical
Fuel Library (Deprecated)
6.0.x
Confirmed
Critical
Fuel Library (Deprecated)

Bug Description

{

    "build_id": "2014-07-06_00-31-14",
    "mirantis": "yes",
    "build_number": "102",
    "ostf_sha": "09b6bccf7d476771ac859bb3c76c9ebec9da9e1f",
    "nailgun_sha": "217ae694e487211fc8e352e4a45c0ef66644e1d8",
    "production": "docker",
    "api": "1.0",
    "fuelmain_sha": "cd72d39823b87aa5b7506a6e4f7d6ab0ed32de7b",
    "astute_sha": "644d279970df3daa5f5a2d2ccf8b4d22d53386ff",
    "release": "5.0.1",
    "fuellib_sha": "869acab37a78d018a0806e6fc6b76aabb2cdf5f0"

}

Steps:
1. Create next cluster - Ubuntu, HA, Nova flat, Ceilometer, 3 controllers, 1 compute, 1 mongo node
2. Deploy cluster

Actual result - deployment failed on controller slave-01 (node-1) with puppet errors:
Could not run: execution expired
(/Stage[main]/Osnailyfacter::Cluster_ha/Nova_floating_range[10.108.31.128-10.108.31.254]) Could not evaluate: Authentication failed with response code 504

Logs are attached

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :
no longer affects: fuel/5.1.x
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :
Download full text (5.8 KiB)

Actual problem is here: looks like puppet timeout. This is related to nova-network - not ceilometer. Logs are showing that nova-api is functioning, haproxy is working, keystone is issuing tokens. Not sure where the problem may be. Need a reproducer or more logs. Keeping as new for a while.

2014-07-06T02:10:24.562529+00:00 err: Could not run: execution expired
2014-07-06T02:10:24.562865+00:00 err: /usr/lib/ruby/1.8/timeout.rb:64:in `rbuf_fill'
2014-07-06T02:10:24.563714+00:00 err: /usr/lib/ruby/1.8/net/protocol.rb:134:in `rbuf_fill'
2014-07-06T02:10:24.563714+00:00 err: /usr/lib/ruby/1.8/net/protocol.rb:116:in `readuntil'
2014-07-06T02:10:24.563714+00:00 err: /usr/lib/ruby/1.8/net/protocol.rb:126:in `readline'
2014-07-06T02:10:24.563714+00:00 err: /usr/lib/ruby/1.8/net/http.rb:2028:in `read_status_line'
2014-07-06T02:10:24.563714+00:00 err: /usr/lib/ruby/1.8/net/http.rb:2017:in `read_new'
2014-07-06T02:10:24.563714+00:00 err: /usr/lib/ruby/1.8/net/http.rb:1051:in `request'
2014-07-06T02:10:24.563714+00:00 err: /usr/lib/ruby/1.8/net/http.rb:845:in `post'
2014-07-06T02:10:24.563714+00:00 err: /usr/lib/ruby/vendor_ruby/openstack/connection.rb:305:in `initialize'
2014-07-06T02:10:24.563714+00:00 err: /usr/lib/ruby/vendor_ruby/openstack/connection.rb:261:in `new'
2014-07-06T02:10:24.563714+00:00 err: /usr/lib/ruby/vendor_ruby/openstack/connection.rb:261:in `init'
2014-07-06T02:10:24.563714+00:00 err: /usr/lib/ruby/vendor_ruby/openstack/compute/connection.rb:12:in `initialize'
2014-07-06T02:10:24.563714+00:00 err: /usr/lib/ruby/vendor_ruby/openstack/connection.rb:66:in `new'
2014-07-06T02:10:24.563714+00:00 err: /usr/lib/ruby/vendor_ruby/openstack/connection.rb:66:in `create'
2014-07-06T02:10:24.563714+00:00 err: /etc/puppet/modules/nova/lib/puppet/provider/nova_floating_range/nova_manage.rb:68:in `connect'
2014-07-06T02:10:24.563991+00:00 err: /etc/puppet/modules/nova/lib/puppet/provider/nova_floating_range/nova_manage.rb:49:in `operate_range'
2014-07-06T02:10:24.563991+00:00 err: /etc/puppet/modules/nova/lib/puppet/provider/nova_floating_range/nova_manage.rb:15:in `exists?'
2014-07-06T02:10:24.563991+00:00 err: /usr/lib/ruby/vendor_ruby/puppet/property/ensure.rb:81:in `retrieve'
2014-07-06T02:10:24.563991+00:00 err: /usr/lib/ruby/vendor_ruby/puppet/type.rb:1035:in `retrieve'
2014-07-06T02:10:24.563991+00:00 err: /usr/lib/ruby/vendor_ruby/puppet/type.rb:1063:in `retrieve_resource'
2014-07-06T02:10:24.563991+00:00 err: /usr/lib/ruby/vendor_ruby/puppet/transaction/resource_harness.rb:223:in `from_resource'
2014-07-06T02:10:24.564997+00:00 err: /usr/lib/ruby/vendor_ruby/puppet/transaction/resource_harness.rb:17:in `evaluate'
2014-07-06T02:10:24.565799+00:00 err: /usr/lib/ruby/vendor_ruby/puppet/transaction.rb:174:in `apply'
2014-07-06T02:10:24.565799+00:00 err: /usr/lib/ruby/vendor_ruby/puppet/transaction.rb:187:in `eval_resource'
2014-07-06T02:10:24.565799+00:00 err: /usr/lib/ruby/vendor_ruby/puppet/transaction.rb:117:in `call'
2014-07-06T02:10:24.565799+00:00 err: /usr/lib/ruby/vendor_ruby/puppet/transaction.rb:117:in `evaluate'
2014-07-06T02:10:24.565799+00:00 err: /usr/lib/ruby/vendor_ruby/puppet/util.rb:327:in `thinma...

Read more...

Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

Problem wasn't reproduced with today system tests run. I'll be checking it during the week and provide env if bug occurs again

Changed in fuel:
status: New → Incomplete
Changed in fuel:
milestone: 5.0.1 → 5.0.2
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

api: '1.0'
astute_sha: 9a74b788be9a7c5682f1c52a892df36e4766ce3f
build_id: 2014-07-16_21-15-18
build_number: '133'
fuellib_sha: 2d1e1369c13bc9771e9473086cb064d257a21fc2
fuelmain_sha: 069686abb90f458f67cfcb4018cacc19971e4b4d
mirantis: 'yes'
nailgun_sha: 1d08d6f80b6514085dd8c0af4d437ef5d37e2802
ostf_sha: 09b6bccf7d476771ac859bb3c76c9ebec9da9e1f
production: docker
release: 5.0.1

Reproduced on ISO # 133 (5.0.1):

http://jenkins-product.srt.mirantis.net:8080/view/5.0_swarm/job/5.0_fuelmain.system_test.centos.services_ha/43/testReport/(root)/deploy_ceilometer_ha_with_mongo/deploy_ceilometer_ha_with_mongo/

http://paste.openstack.org/show/86960/

Changed in fuel:
status: Incomplete → New
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

From logs #5 provided at swarm job in #4
http://pastebin.com/eCwj9NmE

issue related to network connectivity issues of given env.

Changed in fuel:
status: New → Invalid
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Reproduced on system test 'deploy_ha':
http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.ubuntu.thread_5/151/

Same situation in the puppet-agent.log (node-1, 'slave-02_controller'):
(/Stage[main]/Osnailyfacter::Cluster_ha/Nova_floating_range[10.108.6.128-10.108.6.254]) Could not evaluate: Authentication failed with response code 504

Same situation in rabbitmq.log with connectivity errors:
2014-09-05T00:23:03.526999+01:00 notice: {amqp_error,connection_forced,
2014-09-05T00:23:03.526999+01:00 notice: "broker forced connection closure with reason 'shutdown'",none}
2014-09-05T00:23:13.534687+01:00 notice: =INFO REPORT==== 4-Sep-2014::23:23:12 ===
2014-09-05T00:23:13.534812+01:00 notice: Halting Erlang VM

Deployment of node-5 'slave-05_compute' failed before it even started (there is no puppet-agent.log on the node)

Also, while I was examining the logs, Fuel-UI shown that node-3 and node-5 went offline for a few minutes. But, actually, node-3 and node-5 was working all the time (I open VNC on it via vitr-manager and my session to the nodes wasn't broken)

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Fuel DevOps (fuel-devops)
status: Invalid → Confirmed
importance: High → Medium
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

srv17:
http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.ubuntu.thread_4/155/
The same connectivity problems between nodes

srv18
http://jenkins-product.srt.mirantis.net:8080/view/0_master_swarm/job/master_fuelmain.system_test.ubuntu.thread_3/155/
The same problem, slave-05 and slave-06 went offline during the test.

Also, after snapshot 'error_ceph_ha_restart' was reverted, slave-01 was lost by fuel for few minutes :

[root@nailgun ~]# fuel node
id | status | name | cluster | ip | mac | roles | pending_roles | online
---|--------|------------------------------|---------|-------------|-------------------|----------------------|---------------|-------
1 | ready | slave-01_controller_ceph-osd | 1 | 10.108.10.3 | 64:3a:93:52:e6:58 | ceph-osd, controller | | False
3 | ready | slave-04_compute_ceph-osd | 1 | 10.108.10.5 | 64:65:38:57:6f:4d | ceph-osd, compute | | True
4 | ready | slave-02_controller_ceph-osd | 1 | 10.108.10.6 | 64:b4:14:db:da:3b | ceph-osd, controller | | True
2 | ready | slave-03_controller_ceph-osd | 1 | 10.108.10.4 | 64:df:cb:b6:9a:24 | ceph-osd, controller | | True
5 | ready | slave-05_compute_ceph-osd | 1 | 10.108.10.7 | 64:db:35:22:df:ce | ceph-osd, compute | | False
6 | ready | slave-06_ceph-osd | 1 | 10.108.10.8 | 64:ca:0d:fa:7f:79 | ceph-osd | | False

But node-01 is online at that moment:
root@node-1:~# uptime
 04:25:04 up 23 min, 1 user, load average: 0.51, 0.63, 0.79

Changed in fuel:
milestone: 5.0.2 → 5.0.3
Revision history for this message
Igor Shishkin (teran) wrote :

It's bad idea to play ping-pong with tasks :)
As I can see from Vova's comment bug is related to nova so please take one of library engineers and check what's wrong there.

Changed in fuel:
assignee: Fuel DevOps (fuel-devops) → Fuel QA Team (fuel-qa)
Revision history for this message
Dennis Dmitriev (ddmitriev) wrote :

Reproduced on CI: http://jenkins-product.srt.mirantis.net:8080/view/5.1_swarm/job/5.1_fuelmain.system_test.centos.thread_5/28/

==============================
Wed Oct 22 08:18:47 +0000 2014 /Stage[main]/Osnailyfacter::Cluster_ha/Nova_floating_range[10.108.9.128-10.108.9.254] (info): Starting to evaluate the resource
2014-10-22 09:19:47 ERR (/Stage[main]/Osnailyfacter::Cluster_ha/Nova_floating_range[10.108.9.128-10.108.9.254]) Could not evaluate: Authentication failed with response code 504
==============================

Looks like this issue is connected with keystone on node-1:

===============================
2014-10-22 08:18:47.023 9498 DEBUG keystone.middleware.core [-] Auth token not in the request header. Will not build auth context. process_request /usr/lib/python2.6/site-packages/keystone/middleware/core.py:271
===============================

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Fuel Library Team (fuel-library)
Revision history for this message
Igor Shishkin (teran) wrote :
Revision history for this message
Aleksandra Fedorova (bookwar) wrote :
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Commit diff between community iso #181 (passed bvt) and #182 (failed bvt) indicates this is not a regression, must be an intermittent bug:

fuel-library:
If2a67b15beb6371979f1638d2c49ef935bfbed50 Add timeout for anaconda yum conf
I68a15aef4ead33b3f89527951b08e37d927e3f56 Add glance.store.http.Store to glance known_stores

fuel-web:
Ib43c4f97102229f38eef88a6a631f466e4b7817f Fix Healthcheck table header
Ia85278402799d7d19e0c0aee975cc986de5879f9 Fix KeystoneClient issue when token was expired
Ic6dc24a133478bf89dfdc8f07f3fb2cfe8fdcb52 Add more validation for Plugin entity
Iade29275b9f569f05313633c10128be325df0c71 Added error server response handling on Healthcheck tab

fuel-main:
I8e6a52f1d75f6a6f5bae1d9202deeb31f8674b9e Fix selenium tests after changes in UI

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

There's no keystone or nova logs in the snapshot from comment #12, even though ps output indicates that both were running at the time when snapshot was generated. Based on haproxy.log, I can derive that at the time the failure occurred, keystone wasn't running or at least wasn't responding to haproxy:

2014-10-28T16:10:56.011090+00:00 err: (/Stage[main]/Osnailyfacter::Cluster_ha/Nova_floating_range[10.108.36.128-10.108.36.254]) Could not evaluate: Authentication failed with response code 504

2014-10-28T16:09:57.895115+00:00 alert: Server keystone-1/node-3 is DOWN, reason: Layer4 timeout, check duration: 2001ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2014-10-28T16:09:57.961115+00:00 alert: Server keystone-1/node-4 is DOWN, reason: Layer4 timeout, check duration: 2000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2014-10-28T16:09:58.093883+00:00 alert: Server keystone-2/node-3 is DOWN, reason: Layer4 timeout, check duration: 2000ms. 2 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2014-10-28T16:09:58.160867+00:00 alert: Server keystone-2/node-4 is DOWN, reason: Layer4 timeout, check duration: 2000ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
2014-10-28T16:10:55.999073+00:00 info: 10.108.37.3:33993 [28/Oct/2014:16:09:55.991] keystone-1 keystone-1/node-1 0/0/1/-1/60004 504 194 - - sH-- 1/0/0/0/0 0/0 "POST /v2.0/tokens HTTP/1.1"

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Keystone CLI works fine right around the time Nova has received 504 from haproxy:
2014-10-28T16:09:54.782488+00:00 debug: Executing '/usr/bin/keystone --os-token Z2DLBH1m --os-endpoint http://10.108.37.3:35357/v2.0/ user-role-add --user-id 56ca03e4f7f94b25912557a4532ca44b --tenant-id f54f41d5019b47b7a57c292a5cdda2f3 --role-id 3e78ef938f644ebab29ed42e48948c15'
2014-10-28T16:10:56.024305+00:00 debug: Executing '/usr/bin/keystone --os-token Z2DLBH1m --os-endpoint http://10.108.37.3:35357/v2.0/ tenant-list'

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.