Node(s) going "Offline"

Bug #1380786 reported by Aleksandr Shaposhnikov
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Fuel Python (Deprecated)

Bug Description

By some reason after successful provisioning node could go offline in Fuel/MOS masternode UI. It still accessible using PXE network and accessible from all nodes without any problems.

Even nova service-list shows that everything is fine.
| nova-compute | node-6.domain.tld | nova | enabled | up | 2014-10-13T20:34:53.000000 | -

Snapshot attached. The person who will take a look on snapshot should take a look on last successful deployment (first one wasn't successful after that environment was resetted and redeployed using same cluster id)

/api/version
{"build_id": "2014-10-13_00-01-06", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "27", "auth_required": true, "api": "1.0", "nailgun_sha": "88a94a11426d356540722593af1603e5089d442c", "production": "docker", "fuelmain_sha": "431350ba204146f815f0e51dd47bf44569ae1f6d", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["mirantis"], "release": "5.1.1", "release_versions": {"2014.1.1-5.1": {"VERSION": {"build_id": "2014-10-13_00-01-06", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "27", "api": "1.0", "nailgun_sha": "88a94a11426d356540722593af1603e5089d442c", "production": "docker", "fuelmain_sha": "431350ba204146f815f0e51dd47bf44569ae1f6d", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["mirantis"], "release": "5.1.1", "fuellib_sha": "46ad455514614ec2600314ac80191e0539ddfc04"}}}, "fuellib_sha": "46ad455514614ec2600314ac80191e0539ddfc04"}

Tags: scale
Revision history for this message
Aleksandr Shaposhnikov (alashai8) wrote :
description: updated
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Changed in fuel:
assignee: nobody → Fuel Python Team (fuel-python)
milestone: none → 6.0
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

also there is next message in nailgun agent in the same time as assassind says that node 6 is offline
http://paste.openstack.org/show/120824/
from assassind log
2014-10-13 17:50:49.644 INFO [7f9e22036700] (assassind) Running Assassind...
2014-10-13 21:21:35.504 INFO [7f9e22036700] (notification) Notification: topic: error message: Node 'compute_6' has gone away

Łukasz Oleś (loles)
Changed in fuel:
status: Confirmed → Triaged
Revision history for this message
Łukasz Oleś (loles) wrote :

I logged in. It looks like nailgun-agent for some reason hangs

tags: added: scale
Revision history for this message
Tomasz 'Zen' Napierala (tzn) wrote :

This is high priority and needs to be addressed as soon as possible.

Revision history for this message
Łukasz Oleś (loles) wrote :

It may be connected with bug in cinder and nonexistent iSCSI disk. nailgun-agent tries to get disk stats but can not connect to it.
It's just a guess. Logs from nailgun-agent are not included in snapshot.
We need to reproduce it again

Revision history for this message
Dima Shulyak (dshulyak) wrote :

You can find agent logc in rsyslog directory:
10.20.0.2/var/log/docker-logs/remote/node-6.domain.tld

Ad for the bug, i found that messages delivered without any delayes and in normal state,
after each message 200 Response was returned.

Network unreachable message is at 21:04, but node has gone offline is only at 21.21.
I was considering certain races between assasind and api, but given 3 minutes of timeout, and that
message appears in log every 1 min - i dont know where the problem is :)

Revision history for this message
Dima Shulyak (dshulyak) wrote :

Aleksandr, maybe you was doing any additional actions with this nodes?
I dont really see any codepasses that can lead to this situation

Changed in fuel:
status: Triaged → Incomplete
Revision history for this message
Mike Scherbakov (mihgen) wrote :

Guys,
let's do some more troubleshooting, logs analysis. I've heard about this issue from a few sources already. Let's do our best to analyze the code, and may be write a few tests. Might be we will be able to find the root cause of the issue.
Again, let's do our best...

Changed in fuel:
status: Incomplete → New
Revision history for this message
Dima Shulyak (dshulyak) wrote :

I would ask for another source of logs, i really dont see any pattern that can lead to temporary false-positive node offline.

It is possible with nodes that was stopped/reseted..

Revision history for this message
Tomasz 'Zen' Napierala (tzn) wrote :

We need to reproduce this log and get our hand on lab or logs

Revision history for this message
Łukasz Oleś (loles) wrote :

See my comment in #7
If I'm right there, we will not see it again because cinder was fixed.

Revision history for this message
Dmitry Pyzhov (dpyzhov) wrote :

Looks like there are at least two issues related to offline nodes. One with Cinder and one with interface numbers: https://bugs.launchpad.net/fuel/+bug/1394466
I'm closing this request. Let's create new tickets for new issues with offline nodes.

Changed in fuel:
status: New → Invalid
Revision history for this message
Denis Klepikov (dklepikov) wrote :

In attacment lsof from hanged processes
root@node-9:~# ps axwu | grep nail
root 16872 0.0 0.0 4408 608 ? Ss 08:46 0:00 /bin/sh -c flock -w 0 -o /var/lock/agent.lock -c "/opt/nailgun/bin/agent 2>&1 | tee -a /var/log/nailgun-agent.log | /usr/bin/logger -t nailgun-agent"
root 16873 0.0 0.0 7152 596 ? S 08:46 0:00 flock -w 0 -o /var/lock/agent.lock -c /opt/nailgun/bin/agent 2>&1 | tee -a /var/log/nailgun-agent.log | /usr/bin/logger -t nailgun-agent
root 16875 0.0 0.0 4408 600 ? S 08:46 0:00 /bin/sh -c /opt/nailgun/bin/agent 2>&1 | tee -a /var/log/nailgun-agent.log | /usr/bin/logger -t nailgun-agent
root 16876 0.2 0.1 90148 18344 ? Sl 08:46 0:02 ruby /opt/nailgun/bin/agent
root 16877 0.0 0.0 7172 712 ? S 08:46 0:00 tee -a /var/log/nailgun-agent.log
root 16878 0.0 0.0 7164 688 ? S 08:46 0:00 /usr/bin/logger -t nailgun-agent
root 23772 0.0 0.0 9364 876 pts/0 S+ 09:01 0:00 grep --color=auto nail

Revision history for this message
Łukasz Oleś (loles) wrote :

Ok, it looks like agent hangs when there are problems with disks and it can not list them. I will create new bug for this

Revision history for this message
Łukasz Oleś (loles) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.