Controller node offline after successful deployment

Bug #1489315 reported by Maksym Strukov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
In Progress
High
Aleksandr Didenko

Bug Description

Steps (https://mirantis.testrail.com/index.php?/cases/view/261869)
1. Create new environment
2. Choose Neutron, VLAN
3. Uncheck "Cinder LVM over iSCSI for volumes" on Settings tab
4. Add 1 controller
5. Add 1 compute
6. Verify networks
7. Deploy the environment
8. Verify networks
9. Run OSTF

Actual:
All tests failed with message: Keystone client is not available
AuthorizationFailure: Authorization Failed: Unable to establish connection to http://10.109.17.2:5000/v2.0/tokens

Snapshot: https://drive.google.com/a/mirantis.com/file/d/0B1yfbgZlRKfxY0lKRTJTekFfZDQ/view?usp=sharing

10. Reset env
11. Deploy the environment

Actual:
After deployment env passed several hours. And controller node marked "Offline"

Snapshot: https://drive.google.com/a/mirantis.com/file/d/0B1yfbgZlRKfxZ0FCV3VHeDA2ajQ/view?usp=sharing

Env:
{"build_id": "2015-08-25_09-19-20", "build_number": "229", "release_versions": {"2015.1.0-7.0": {"VERSION": {"build_id": "2015-08-25_09-19-20", "build_number": "229", "api": "1.0", "fuel-library_sha": "f1443266fa7a32eaf8ccc131677389079339bb43", "nailgun_sha": "87a569c06ecd478cae523bd437d4e0af8014dd5b", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "e01693992d7a0304d926b922b43f3b747c35964c", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "4c2ab9d6c623d345086c6e2874d1df81fd96a942", "production": "docker", "python-fuelclient_sha": "fc7b63aa6900fe3b2c183108ba6a13e868bc0472", "astute_sha": "53c86cba593ddbac776ce5a3360240274c20738c", "fuel-ostf_sha": "3ad03d076c46347691cc3480dd19d34e37b73df4", "release": "7.0", "fuelmain_sha": "28d4bfcff7a0fb1b37504dbcac4998789df17935"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "f1443266fa7a32eaf8ccc131677389079339bb43", "nailgun_sha": "87a569c06ecd478cae523bd437d4e0af8014dd5b", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "e01693992d7a0304d926b922b43f3b747c35964c", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "4c2ab9d6c623d345086c6e2874d1df81fd96a942", "production": "docker", "python-fuelclient_sha": "fc7b63aa6900fe3b2c183108ba6a13e868bc0472", "astute_sha": "53c86cba593ddbac776ce5a3360240274c20738c", "fuel-ostf_sha": "3ad03d076c46347691cc3480dd19d34e37b73df4", "release": "7.0", "fuelmain_sha": "28d4bfcff7a0fb1b37504dbcac4998789df17935"}

Changed in fuel:
milestone: none → 7.0
assignee: nobody → MOS Keystone (mos-keystone)
importance: Undecided → Critical
status: New → Confirmed
Revision history for this message
Boris Bobrov (bbobrov) wrote :

For some reason keystone is not running at all.

Revision history for this message
Boris Bobrov (bbobrov) wrote :

Yeah, for some reason keystone was not running at all. I restarted apache with `service apache2 restart` and keystone started working. However, the node is still marked as offline.

I also noted that /var/log/messages is missing, I have no idea why.

I am reassigning the bug to fuel-library, maybe they can tell why apache was not serving keystone.

Changed in fuel:
assignee: MOS Keystone (mos-keystone) → nobody
Maksym Strukov (unbelll)
Changed in fuel:
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Trying to reproduce and RCA

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Bogdan Dobrelya (bogdando)
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

At my try, the node was reported in supervisord.log as gone away in the middle of deployment

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

I'm not sure I reproduced the same result, but here is logs anyway

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

A fix: the node has gone away in the middle of provision

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

For the second try, deploy had passed OK, but the OSTF error was related to TLS:
SSLError: [Errno 1] _ssl.c:492: error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed

Changed in fuel:
assignee: Bogdan Dobrelya (bogdando) → Fuel Library Team (fuel-library)
Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

Problem in original environment wasn't in SSL. When I tried it, OSTF cannot ran because of incorrect proxy settings. And code that sets proxy didn't do it because there was one false negative controller - it was set as offline in nailgun but was actually online.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

If you think I reproduced two new bugs instead, let me know and I will submit them.

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

@Stanislaw, do you think I reproduced https://bugs.launchpad.net/fuel/+bug/1429807 instead?

Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Also note, I was able to reproduce the same SSL error with OSTF one hour later after I deployed the env

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Aleksandr Didenko (adidenko)
Revision history for this message
Stanislaw Bogatkin (sbogatkin) wrote :

@Bogdan, I think that this bug seems like duplicate of 1429807. Your bug rather a new one for me.

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

Analyzing snapshots from the original bug description:

1) All tests failed with message: Keystone client is not available

Deployment ended at 19:37:34 :

2015-08-26T19:37:34 info: [669] Casting message to Nailgun: {"method"=>"deploy_resp", "args"=>{"task_uuid"=>"2402f497-edcc-46fb-96af-e3c5a460c2a8", "status"=>"ready", "progress"=>100}}

According to haproxy log, keystone-apache logs, ps output, netstat output - there were no problems with keystone and/or apache and/or api-proxy:

[node-1.test.domain.local] out: tcp6 0 0 :::8888 :::* LISTEN 4004/apache2
[node-1.test.domain.local] out: tcp6 0 0 :::35357 :::* LISTEN 4004/apache2
[node-1.test.domain.local] out: tcp6 0 0 :::5000 :::* LISTEN 4004/apache2

[node-1.test.domain.local] out: root 17789 0.0 0.1 96736 4028 ? Ss 19:23 0:00 /usr/sbin/apache2 -k start
[node-1.test.domain.local] out: keystone 4000 0.0 2.9 552636 103924 ? Sl 19:27 0:03 \_ keystone-admin -k start
[node-1.test.domain.local] out: keystone 4001 0.0 2.8 550348 101348 ? Sl 19:27 0:03 \_ keystone-main -k start

2015-08-26T19:58:13.669374+00:00 info: 10.109.17.4:49668 [26/Aug/2015:19:58:13.659] keystone-2 keystone-2/node-1 0/0/0/8/9 200 1392 - - ---- 21/0/0/0/0 0/0 "GET /v2.0/users HTTP/1.1"
2015-08-26T19:58:13.788386+00:00 info: 10.109.17.4:49672 [26/Aug/2015:19:58:13.779] keystone-2 keystone-2/node-1 0/0/0/8/9 200 506 - - ---- 21/0/0/0/0 0/0 "GET /v2.0/tenants HTTP/1.1"

Also I see no attempts to connect to keystone in haproxy logs around the time of first OSTF failures. So it really looks like some network connection issue between Fuel master node and controller node.

2) After deployment env passed several hours. And controller node marked "Offline"

From nailgun/assassind.log:

2015-08-26 23:44:27.582 WARNING [7f29bcdf3700] (assassind) Node 'con (90:2d)' has gone away

Last record from nailgun-agent on node-5:

2015-08-26T23:42:26.048227+00:00 notice: I, [2015-08-26T23:42:26.039625 #27046] INFO -- : API URL is http://10.109.15.2:8000/api

Appropriate record from CRON.log:

2015-08-26T23:42:01.810129+00:00 info: (root) CMD (flock -w 0 -o /var/lock/nailgun-agent.lock -c "/usr/bin/nailgun-agent 2>&1 | tee -a /var/log/nailgun-agent.log | /usr/bin/logger -t nailgun-agent")

And as we can see in commands/ps.txt nailgun-agent got stuck in R state since then:

http://paste.openstack.org/raw/429793/

This is why cron was not able to run nailgun-agent commands again and thus update information about node in nailgun.

@Stanislaw, it's not a duplicate or regression of 1429807

Revision history for this message
Aleksandr Didenko (adidenko) wrote :

I was not able to reproduce any of the issues listed in this bug on #242 ISO. And I think stuck nailgun-agent is not a very common problem, it's rather rare. So I'm lowering priority to High.

Changed in fuel:
assignee: Aleksandr Didenko (adidenko) → Vladimir Sharshov (vsharshov)
importance: Critical → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-nailgun-agent (master)

Fix proposed to branch: master
Review: https://review.openstack.org/217799

Changed in fuel:
assignee: Vladimir Sharshov (vsharshov) → Aleksandr Didenko (adidenko)
status: Confirmed → In Progress
Revision history for this message
Aleksandr Didenko (adidenko) wrote :

From pkg list on the problem env:

ii ohai 6.14.0-2

While it should be:

ii ohai 6.14.0-2~u14.04+mos1

There are also a lot of segfaults in nailgun-agent log. So I'm marking it as duplicate of https://bugs.launchpad.net/fuel/+bug/1488844

Revision history for this message
Alexander Gordeev (a-gordeev) wrote :

Marking as duplicate of https://bugs.launchpad.net/fuel/+bug/1488844.

Due to broken apt preferences most of ruby packages were installed from official ubuntu repos.
Moreover, It could affect any other packages and cause Keystone outage as well.

Eg.:
> [node-5.test.domain.local] out: ii ohai 6.14.0-2 all Detects data about your operating system and reports it in JSON

Expected ohai package version is 6.14.0-2~u14.04+mos1

ISO on which this bug was reproduced still contains 1488844. This bug was fixed few days later than the ISO was built.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-nailgun-agent (master)

Change abandoned by Aleksandr Didenko (<email address hidden>) on branch: master
Review: https://review.openstack.org/217799
Reason: No longer needed

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.