nailgun agent with multipath stops working
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| Fuel for OpenStack |
Critical
|
Fuel Python (Deprecated) |
Bug Description
I have Fuel 5.1 installation on CentOS.
Customer is using NetApp storage with iscsi and multipath.
In the very first time we try to create a volume from image, cinder attaches iscsi volume to the controller, copies the image into it and unmounts the volume. SOmetimes members of the multipath device don't disappear from the OS and it causes nailgun agent's crash.
There is a filesystem plugin of ohai - it executes set of commands "blkid", and if the volume is not connected to the host anymore, those commands will never finish.
I'd suggest to remove that plugin from the directory of ohai. Once I did it on the working environment - everything continued working. AFAIK we don't use the info from this plugin for our purposes.
Roman Prykhodchenko (romcheg) wrote : | #1 |
Changed in fuel: | |
status: | New → Incomplete |
importance: | Undecided → Medium |
milestone: | none → 6.1 |
Andrey Grebennikov (agrebennikov) wrote : | #2 |
The agent works in the next way - it finds the server which it needs to connect to, then sleeps for a random period of time, then executes ohai, then parses the data from its output, sends the info to the server and exits. In my case the agent hangs on the ohai step. Ohai in its turn opens all the plugins it has, one of them is "filesystem.rb", which executes a set of commands collecting the data about all filesystems at the moment. The process "blsid -s TYPE" hangs and the agent's process never finishes its job, so the node appears as "Offline" in the Fuel UI.
In order to resolve the current issue, I had to restart multipathd process so that it releases all non-existing devices, after that it becomes possible to kill those blkid processes as well.
Once I removed the plugin from ohai directory, the agent started to operate properly.
Roman Prykhodchenko (romcheg) wrote : | #3 |
@Andrey: Thank you for the quick update.
Changed in fuel: | |
assignee: | nobody → Roman Prykhodchenko (romcheg) |
status: | Incomplete → Confirmed |
Michael Polenchuk (mpolenchuk) wrote : | #4 |
With EMC + multipath the same shi^W uninterruptible sleep (I/O) on blkid -s TYPE (UUID/LABEL).
>> ohai/plugins/
...
# Gather more filesystem types via libuuid, even devices that's aren't mounted
popen4("blkid -s TYPE") do |pid, stdin, stdout, stderr|
...
end
# Gather device UUIDs via libuuid
popen4("blkid -s UUID") do |pid, stdin, stdout, stderr|
...
end
# Gather device labels via libuuid
popen4("blkid -s LABEL") do |pid, stdin, stdout, stderr|
...
end
Commenting out the above code is wrapped up an issue.
Changed in fuel: | |
milestone: | 6.1 → 7.0 |
Vladimir Sharshov (vsharshov) wrote : | #5 |
Workaround: https:/
Can be moving to 8.0
Changed in fuel: | |
assignee: | Roman Prykhodchenko (romcheg) → Fuel Python Team (fuel-python) |
status: | Confirmed → Won't Fix |
Changed in fuel: | |
milestone: | 7.0 → 8.0 |
status: | Won't Fix → Confirmed |
no longer affects: | fuel/8.0.x |
tags: | added: area-python |
Changed in fuel: | |
milestone: | 8.0 → 9.0 |
Alexander Kislitsky (akislitsky) wrote : | #6 |
We passed SCF in 8.0. Moving the bug to 9.0.
tags: | added: module-volumes |
Changed in fuel: | |
status: | Confirmed → Fix Committed |
Andrew Woodward (xarses) wrote : | #7 |
What resolved this?
Nastya Urlapova (aurlapova) wrote : | #8 |
Moved to 10.0, because review I didn't find + Andrey G. did provide additional information.
We should recheck it in new release.
Changed in fuel: | |
status: | Fix Committed → Confirmed |
milestone: | 9.0 → 10.0 |
Alexander Gordeev (a-gordeev) wrote : | #9 |
looks like duplicate of https:/
Dmitry Pyzhov (dpyzhov) wrote : | #10 |
Marking as Fix Committed. QA team, please verify that the issue is not reproducible any more.
Changed in fuel: | |
status: | Confirmed → Fix Committed |
Anatolii Neliubin (aneliubin) wrote : | #11 |
Still experiencing the same issue on MOS 9.1 + EMC + multipath.
Anatolii Neliubin (aneliubin) wrote : | #12 |
Is there any possibility to backport it to previous versions of MOS? The customers are experiencing the same issue om MOS 7.0
Anatolii Neliubin (aneliubin) wrote : | #13 |
Team this bug seems to be critical since there are about 30000 "blkid -s" processes that consume CPU and memory resources on a compute node.
Changed in fuel: | |
importance: | Medium → Critical |
milestone: | 10.0 → 7.0-mu-7 |
Denis Meltsaykin (dmeltsaykin) wrote : | #14 |
Anatolii, please don't change the original milestone of the bug, this makes reporting inefficient and adds a lot of confusion.
Changed in fuel: | |
milestone: | 7.0-mu-7 → 10.x-updates |
Denis Kostryukov (dkostryukov) wrote : | #15 |
Denis, one more customer has the same issue on MOS 7.0
He has more that 27000 "blkid -s TYPE" processes on his node.
Is it possible to create backport for MOS 7.0?
tags: | added: customer-found |
@Andrey: Could you please provide more debug information?