Fuel for OpenStack

Node(s) going "Offline"

Bug #1380786 reported by Aleksandr Shaposhnikov on 2014-10-13

16

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Invalid	High	Fuel Python (Deprecated)	Fuel for OpenStack 6.0

Bug Description

By some reason after successful provisioning node could go offline in Fuel/MOS masternode UI. It still accessible using PXE network and accessible from all nodes without any problems.

Even nova service-list shows that everything is fine.
| nova-compute | node-6.domain.tld | nova | enabled | up | 2014-10-13T20:34:53.000000 | -

Snapshot attached. The person who will take a look on snapshot should take a look on last successful deployment (first one wasn't successful after that environment was resetted and redeployed using same cluster id)

/api/version
{"build_id": "2014-10-13_00-01-06", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "27", "auth_required": true, "api": "1.0", "nailgun_sha": "88a94a11426d356540722593af1603e5089d442c", "production": "docker", "fuelmain_sha": "431350ba204146f815f0e51dd47bf44569ae1f6d", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["mirantis"], "release": "5.1.1", "release_versions": {"2014.1.1-5.1": {"VERSION": {"build_id": "2014-10-13_00-01-06", "ostf_sha": "64cb59c681658a7a55cc2c09d079072a41beb346", "build_number": "27", "api": "1.0", "nailgun_sha": "88a94a11426d356540722593af1603e5089d442c", "production": "docker", "fuelmain_sha": "431350ba204146f815f0e51dd47bf44569ae1f6d", "astute_sha": "f5fbd89d1e0e1f22ef9ab2af26da5ffbfbf24b13", "feature_groups": ["mirantis"], "release": "5.1.1", "fuellib_sha": "46ad455514614ec2600314ac80191e0539ddfc04"}}}, "fuellib_sha": "46ad455514614ec2600314ac80191e0539ddfc04"}

See original description

Tags:

Revision history for this message

Aleksandr Shaposhnikov (alashai8) wrote on 2014-10-13:

#1

fuel-snapshot-2014-10-13_21-34-18.tgz Edit (42.9 MiB, application/x-tar)

description:

updated

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-10-13:

#2

http://paste.openstack.org/show/120813/

Changed in fuel:
assignee:	nobody → Fuel Python Team (fuel-python)
milestone:	none → 6.0
importance:	Undecided → High
status:	New → Confirmed

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2014-10-13:

#3

also there is next message in nailgun agent in the same time as assassind says that node 6 is offline
http://paste.openstack.org/show/120824/
from assassind log
2014-10-13 17:50:49.644 INFO [7f9e22036700] (assassind) Running Assassind...
2014-10-13 21:21:35.504 INFO [7f9e22036700] (notification) Notification: topic: error message: Node 'compute_6' has gone away

Łukasz Oleś (loles) on 2014-10-13

Changed in fuel:
status:	Confirmed → Triaged

Revision history for this message

Łukasz Oleś (loles) wrote on 2014-10-13:

#4

I logged in. It looks like nailgun-agent for some reason hangs

Tomasz 'Zen' Napierala (tzn) on 2014-10-20

tags:

added: scale

Revision history for this message

Tomasz 'Zen' Napierala (tzn) wrote on 2014-10-20:

#6

This is high priority and needs to be addressed as soon as possible.

Revision history for this message

Łukasz Oleś (loles) wrote on 2014-10-20:

#7

It may be connected with bug in cinder and nonexistent iSCSI disk. nailgun-agent tries to get disk stats but can not connect to it.
It's just a guess. Logs from nailgun-agent are not included in snapshot.
We need to reproduce it again

Revision history for this message

Dima Shulyak (dshulyak) wrote on 2014-11-03:

#8

You can find agent logc in rsyslog directory:
10.20.0.2/var/log/docker-logs/remote/node-6.domain.tld

Ad for the bug, i found that messages delivered without any delayes and in normal state,
after each message 200 Response was returned.

Network unreachable message is at 21:04, but node has gone offline is only at 21.21.
I was considering certain races between assasind and api, but given 3 minutes of timeout, and that
message appears in log every 1 min - i dont know where the problem is :)

Revision history for this message

Dima Shulyak (dshulyak) wrote on 2014-11-10:

#9

Aleksandr, maybe you was doing any additional actions with this nodes?
I dont really see any codepasses that can lead to this situation

Changed in fuel:
status:	Triaged → Incomplete

Revision history for this message

Mike Scherbakov (mihgen) wrote on 2014-11-19:

#10

Guys,
let's do some more troubleshooting, logs analysis. I've heard about this issue from a few sources already. Let's do our best to analyze the code, and may be write a few tests. Might be we will be able to find the root cause of the issue.
Again, let's do our best...

Changed in fuel:
status:	Incomplete → New

Revision history for this message

Dima Shulyak (dshulyak) wrote on 2014-11-19:

#11

I would ask for another source of logs, i really dont see any pattern that can lead to temporary false-positive node offline.

It is possible with nodes that was stopped/reseted..

Revision history for this message

Tomasz 'Zen' Napierala (tzn) wrote on 2014-11-19:

#12

We need to reproduce this log and get our hand on lab or logs

Revision history for this message

Łukasz Oleś (loles) wrote on 2014-11-19:

#13

See my comment in #7
If I'm right there, we will not see it again because cinder was fixed.

Revision history for this message

Dmitry Pyzhov (dpyzhov) wrote on 2014-11-21:

#14

Looks like there are at least two issues related to offline nodes. One with Cinder and one with interface numbers: https://bugs.launchpad.net/fuel/+bug/1394466
I'm closing this request. Let's create new tickets for new issues with offline nodes.

Changed in fuel:
status:	New → Invalid

Revision history for this message

Denis Klepikov (dklepikov) wrote on 2014-11-25:

#15

lsof-from-node.txt Edit (9.5 KiB, text/plain)

In attacment lsof from hanged processes
root@node-9:~# ps axwu | grep nail
root 16872 0.0 0.0 4408 608 ? Ss 08:46 0:00 /bin/sh -c flock -w 0 -o /var/lock/agent.lock -c "/opt/nailgun/bin/agent 2>&1 | tee -a /var/log/nailgun-agent.log | /usr/bin/logger -t nailgun-agent"
root 16873 0.0 0.0 7152 596 ? S 08:46 0:00 flock -w 0 -o /var/lock/agent.lock -c /opt/nailgun/bin/agent 2>&1 | tee -a /var/log/nailgun-agent.log | /usr/bin/logger -t nailgun-agent
root 16875 0.0 0.0 4408 600 ? S 08:46 0:00 /bin/sh -c /opt/nailgun/bin/agent 2>&1 | tee -a /var/log/nailgun-agent.log | /usr/bin/logger -t nailgun-agent
root 16876 0.2 0.1 90148 18344 ? Sl 08:46 0:02 ruby /opt/nailgun/bin/agent
root 16877 0.0 0.0 7172 712 ? S 08:46 0:00 tee -a /var/log/nailgun-agent.log
root 16878 0.0 0.0 7164 688 ? S 08:46 0:00 /usr/bin/logger -t nailgun-agent
root 23772 0.0 0.0 9364 876 pts/0 S+ 09:01 0:00 grep --color=auto nail

Revision history for this message

Łukasz Oleś (loles) wrote on 2014-11-25:

#16

Ok, it looks like agent hangs when there are problems with disks and it can not list them. I will create new bug for this

Revision history for this message

Łukasz Oleś (loles) wrote on 2014-11-25:

#17

Link to new bug https://bugs.launchpad.net/fuel/+bug/1396086

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.