Network verification fails on deployed environment after few days of load simulation: 'net_probe#get_probing_info failed: #<Class:0x007f7be01a4d30>: execution expired'
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Fuel for OpenStack |
Fix Released
|
High
|
Dima Shulyak | ||
6.1.x |
Won't Fix
|
High
|
Fuel Python (Deprecated) | ||
7.0.x |
Fix Released
|
High
|
Dima Shulyak |
Bug Description
Fuel version info (6.1 build #521 RC1): http://
Network verification fails on environment which was under load (Rally benchmark test) for few days:
Verification failed.
Method verify_networks. 78f04583-
Here are errors from mcollective logs (node-28):
E, [2015-06-
E, [2015-06-
Steps to reproduce:
1. Deploy env on bare-metal servers: Ubuntu, Ceph, NeutronVlan, 3 controllers, 2 computes
2. Simulate load on the cloud using Rally (for example run some Nova related test for 3 days)
3. Run network verification from Fuel UI
Expected result:
- verification passed
Actual:
- verification fails
I found that processes created by some of previous network checks still alive on node-28:
root 7880 0.3 0.1 618896 23160 ? Ssl Jun10 30:52 ruby /usr/sbin/
root 53617 0.0 0.0 4440 652 ? S Jun15 0:00 \_ sh -c "/usr/bin/
root 53619 0.0 0.0 4440 652 ? S Jun15 0:00 | \_ sh -c urlaccesscheck check 'http://
uel-infra.
root 53622 0.0 0.1 72924 18380 ? S Jun15 0:00 | \_ /usr/bin/python /usr/bin/
iary http://
root 38790 0.0 0.1 618896 20848 ? Sl 11:54 0:00 \_ ruby /usr/sbin/
root 38794 0.0 0.1 90160 20780 ? S 11:54 0:00 \_ /usr/bin/python /usr/bin/
root 38803 0.0 0.0 19144 4284 ? S 11:54 0:00 \_ tcpdump -i eth1 -w /var/run/
root 38804 0.0 0.0 19144 4280 ? S 11:54 0:00 \_ tcpdump -i eth1 -w /var/run/
root 38806 0.0 0.0 19144 4288 ? S 11:54 0:00 \_ tcpdump -i eth0 -w /var/run/
root 38807 0.0 0.0 19144 4280 ? S 11:54 0:00 \_ tcpdump -i eth0 -w /var/run/
After I killed them (python ones) network verification passed.
Diagnostic snapshot doesn't contain remote logs due to bug #1465262 , so I'm attaching logs (mcollective and net_probe) from node-28.
Changed in fuel: | |
assignee: | Fuel Python Team (fuel-python) → Dima Shulyak (dshulyak) |
status: | Confirmed → In Progress |
Changed in fuel: | |
importance: | Medium → High |
tags: | added: 6.1-mu-1 |
Changed in fuel: | |
status: | In Progress → Fix Committed |
tags: | added: customer-found |
tags: | added: long-haul-testing |
This situation can happen if mcollective will be interrupted by timeout / or any other unpredicted crush.
In such case we will try to stop all net_probe.py processes on the next run (there is even 2 such mechanism).
1. https:/ /github. com/stackforge/ fuel-astute/ blob/master/ mcagents/ net_probe. rb#L205 /github. com/stackforge/ fuel-astute/ blob/master/ mcagents/ net_probe. rb#L102
2. https:/
Need to investigate further, what happened.