Network verification fails on deployed environment after few days of load simulation: 'net_probe#get_probing_info failed: #<Class:0x007f7be01a4d30>: execution expired'

Bug #1466020 reported by Artem Panchenko
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Dima Shulyak
6.1.x
Won't Fix
High
Fuel Python (Deprecated)
7.0.x
Fix Released
High
Dima Shulyak

Bug Description

Fuel version info (6.1 build #521 RC1): http://paste.openstack.org/show/277715/

Network verification fails on environment which was under load (Rally benchmark test) for few days:

Verification failed.
Method verify_networks. 78f04583-0afc-40aa-997b-fb0b3f7ddab4: MCollective call failed in agent 'net_probe', method 'get_probing_info', failed nodes: ID: 28 - Reason: execution expired . Inspect Astute logs for the details

Here are errors from mcollective logs (node-28):

E, [2015-06-16T09:32:00.206795 #7880] ERROR -- : agent.rb:111:in `rescue in handlemsg' net_probe#get_probing_info failed: #<Class:0x007f7be01a4d30>: execution expired
E, [2015-06-16T09:32:00.206890 #7880] ERROR -- : agent.rb:112:in `rescue in handlemsg' /usr/share/mcollective/plugins/mcollective/agent/net_probe.rb:229:in `stop_frame_listeners'
        /usr/share/mcollective/plugins/mcollective/agent/net_probe.rb:188:in `get_probing_info'
        /usr/share/mcollective/plugins/mcollective/agent/net_probe.rb:54:in `block in <class:Net_probe>'
        /usr/lib/ruby/vendor_ruby/mcollective/rpc/agent.rb:88:in `handlemsg'
        /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:126:in `block (2 levels) in dispatch'
        /usr/lib/ruby/1.9.1/timeout.rb:69:in `timeout'
        /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:125:in `block in dispatch'

Steps to reproduce:

1. Deploy env on bare-metal servers: Ubuntu, Ceph, NeutronVlan, 3 controllers, 2 computes
2. Simulate load on the cloud using Rally (for example run some Nova related test for 3 days)
3. Run network verification from Fuel UI

Expected result:

- verification passed

Actual:

- verification fails

I found that processes created by some of previous network checks still alive on node-28:

root 7880 0.3 0.1 618896 23160 ? Ssl Jun10 30:52 ruby /usr/sbin/mcollectived --config /etc/mcollective/server.cfg
root 53617 0.0 0.0 4440 652 ? S Jun15 0:00 \_ sh -c "/usr/bin/ruby1.9.1" "/tmp/systemu_node_28_mirantis_com_1_7880_0_9649850401906507_1/program"
root 53619 0.0 0.0 4440 652 ? S Jun15 0:00 | \_ sh -c urlaccesscheck check 'http://172.16.39.130:8080/2014.2.2-6.1/ubuntu/auxiliary' 'http://mirror.f
uel-infra.org/mos/ubuntu/' 'http://archive.ubuntu.com/ubuntu/' 'http://172.16.39.130:8080/2014.2.2-6.1/ubuntu/x86_64'
root 53622 0.0 0.1 72924 18380 ? S Jun15 0:00 | \_ /usr/bin/python /usr/bin/urlaccesscheck check http://172.16.39.130:8080/2014.2.2-6.1/ubuntu/auxil
iary http://mirror.fuel-infra.org/mos/ubuntu/ http://archive.ubuntu.com/ubuntu/ http://172.16.39.130:8080/2014.2.2-6.1/ubuntu/x86_64
root 38790 0.0 0.1 618896 20848 ? Sl 11:54 0:00 \_ ruby /usr/sbin/mcollectived --config /etc/mcollective/server.cfg
root 38794 0.0 0.1 90160 20780 ? S 11:54 0:00 \_ /usr/bin/python /usr/bin/net_probe.py -c /tmp/net_probe20150616-7880-15w3xg1
root 38803 0.0 0.0 19144 4284 ? S 11:54 0:00 \_ tcpdump -i eth1 -w /var/run/pcap_dir/eth1.pcap udp and dst port 31337
root 38804 0.0 0.0 19144 4280 ? S 11:54 0:00 \_ tcpdump -i eth1 -w /var/run/pcap_dir/vlan_eth1.pcap vlan and udp and dst port 31337
root 38806 0.0 0.0 19144 4288 ? S 11:54 0:00 \_ tcpdump -i eth0 -w /var/run/pcap_dir/eth0.pcap udp and dst port 31337
root 38807 0.0 0.0 19144 4280 ? S 11:54 0:00 \_ tcpdump -i eth0 -w /var/run/pcap_dir/vlan_eth0.pcap vlan and udp and dst port 31337

After I killed them (python ones) network verification passed.
Diagnostic snapshot doesn't contain remote logs due to bug #1465262 , so I'm attaching logs (mcollective and net_probe) from node-28.

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
summary: Network verification fails on deployed environment after few days of
- load simulating: 'net_probe#get_probing_info failed:
+ load simulation: 'net_probe#get_probing_info failed:
#<Class:0x007f7be01a4d30>: execution expired'
Revision history for this message
Dima Shulyak (dshulyak) wrote :

This situation can happen if mcollective will be interrupted by timeout / or any other unpredicted crush.
In such case we will try to stop all net_probe.py processes on the next run (there is even 2 such mechanism).

1. https://github.com/stackforge/fuel-astute/blob/master/mcagents/net_probe.rb#L205
2. https://github.com/stackforge/fuel-astute/blob/master/mcagents/net_probe.rb#L102

Need to investigate further, what happened.

Changed in fuel:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Dima Shulyak (dshulyak) wrote :

Set medium priority because it is reproduced only once. If it will fail again, lets increase priority.

Changed in fuel:
milestone: 6.1 → 7.0
Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Dima Shulyak (dshulyak)
status: Confirmed → In Progress
Dima Shulyak (dshulyak)
Changed in fuel:
importance: Medium → High
Revision history for this message
Andrey Maximov (maximov) wrote :
tags: added: 6.1-mu-1
Revision history for this message
Dima Shulyak (dshulyak) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (stable/6.1)

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/197511

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (stable/6.1)

Reviewed: https://review.openstack.org/197511
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=16d89478dc2833dfceebb7be124e7a69f1a1fe75
Submitter: Jenkins
Branch: stable/6.1

commit 16d89478dc2833dfceebb7be124e7a69f1a1fe75
Author: Dmitry Shulyak <email address hidden>
Date: Thu Jun 18 11:57:50 2015 +0300

    Kill hung frame listeners on each verify networks event

    Network checker relies on interrupt signal for its own
    internal logic, so after interrupt is received - it will
    do some parsing work and wont finish immediatly.

    But before starting frame listeners in a new run - old frame
    listeners may be not interrupted because mcollective failed
    due timeout or any other not predicted problem.
    So, we need to completely shutdown frame listeners with kill -9

    Tested:
      Spawned bunch of - /usr/bin/python /usr/bin/net_probe.py -c listen.conf
      Executed network verification from UI - all old listeners were killed,
      verification succeeded.
      Also, executed same test with removed pid files in /var/run/net_probe folder

    Change-Id: I7aa48d00b3f86827ba9860a3e3797ed5bd0eb713
    Closes-Bug: 1466020

Revision history for this message
Dima Shulyak (dshulyak) wrote :

Fixes will be reverted

Changed in fuel:
status: Fix Committed → Triaged
Revision history for this message
Dima Shulyak (dshulyak) wrote :

Catched race condition on bvt. Fix needs to take into account such cases
https://bugs.launchpad.net/fuel/+bug/1470729

Revision history for this message
Sam Stoelinga (sammiestoel) wrote :

Just had the same issue on a new Fuel install without any existing environment deployed. I had to manually kill a network verification task and it seems that caused the python net_probe.py to still be alive on some of the nodes.

Ryan Moe (rmoe)
tags: added: customer-found
Revision history for this message
Alex Schultz (alex-schultz) wrote :

So I think we've traced it down to the following...

execution expired means the script never terminates, so as part of the net_probe mcollective task, we call stop_frame_listeners[0] which attempts to send a SIGINT to the running processes[1]. Unfortunately if the SIGINT does not terminate the running processes, it will enter an infinite loop[2]. Going over to the SIGINT which is sent to the n et_probe code from nailgun, we can see where the INT is handled as part of the _run function in the Listener[3] class. When then trace the function execution and we see it loops through all the current listeners (forked tcpdump processes) and attempt to terminate them[4]. The issue seems that the tcpdump process is not exiting or itself is hanging. I believe this to be related to the lack of internet connectivity on the hosts being check. I found a posting over on serverfault[5], that points to DNS being an issue and as we are not skipping DNS as part of our tcpdump command line options[6], I believe this is what is causing it to hang.

[0] https://github.com/stackforge/fuel-astute/blob/master/mcagents/net_probe.rb#L188
[1] https://github.com/stackforge/fuel-astute/blob/master/mcagents/net_probe.rb#L209-L216
[2] https://github.com/stackforge/fuel-astute/blob/master/mcagents/net_probe.rb#L218-L230
[3] https://github.com/stackforge/fuel-web/blob/master/network_checker/network_checker/net_check/api.py#L467
[4] https://github.com/stackforge/fuel-web/blob/master/network_checker/network_checker/net_check/api.py#L472-L480
[5] http://serverfault.com/questions/697854/tcpdump-freezes-and-not-capturing-properly-without-internet-connection
[6] https://github.com/stackforge/fuel-web/blob/master/network_checker/network_checker/net_check/api.py#L539-L542

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/206196

Changed in fuel:
assignee: Fuel Python Team (fuel-python) → Alex Schultz (alex-schultz)
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (master)

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.openstack.org/206196
Reason: I'm not able to properly recreate the issue to validate that this will solve it. I think it's something with the tcpdump process hanging but without being able to reproduce it I cannot confirm this as a fix.

Revision history for this message
Alex Schultz (alex-schultz) wrote :

based on some testing I can't confirm the dns issue, but the error message still points to the tcpdump process not properly terminating. I'm updating the status back to someone else to have a look to see if they can figure it out since I can't reproduce the issue.

tags: added: tricky
Revision history for this message
Dima Shulyak (dshulyak) wrote :

I think that issue described by @Alex is different from originally reported, because first one was resolved by manually killing workers, and original error was caused by timeout. And the second was reproducible every run. @Alex can you please attach snapshot or network checker logs from environment where this issue was found?

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/211510

Changed in fuel:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-web (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/211511

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-web (master)

Fix proposed to branch: master
Review: https://review.openstack.org/211567

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-astute (master)

Fix proposed to branch: master
Review: https://review.openstack.org/211569

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/211569
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=c2819dd7136721118a1a92b130035ce5485c88b8
Submitter: Jenkins
Branch: master

commit c2819dd7136721118a1a92b130035ce5485c88b8
Author: Dmitry Shulyak <email address hidden>
Date: Tue Aug 11 15:28:58 2015 +0300

    Add collect_timeout and cleanup hung listeners properly

    netprobe will be started with *collect_timeout* parameter
    that is lower than net_probe.rb agent timeout

    hung frame listeners will be matched correctly, and cleaned with
    termination signal

    Change-Id: I7dc2adf6aa1d199e2b8fe58718aad5e56f85b9d1
    Partial-Bug: 1466020

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-web (master)

Reviewed: https://review.openstack.org/211510
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=7c4526ce1c436b3805964fd6cd4d9a661c50f5c2
Submitter: Jenkins
Branch: master

commit 7c4526ce1c436b3805964fd6cd4d9a661c50f5c2
Author: Dmitry Shulyak <email address hidden>
Date: Tue Aug 11 14:10:36 2015 +0300

    Introduce timeout for a collecting information from pcap files

    Timeout for collecting procedure will ensure that net_probe.py
    will exit earlier than timeout will be raied in net_probe.rb.

    Change-Id: I69e53809bca717919998dc06b4c9315c5642d184
    Closes-Bug: 1466020

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.openstack.org/211567
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=4cea3178f947815a2df8f554b91a53693e978c94
Submitter: Jenkins
Branch: master

commit 4cea3178f947815a2df8f554b91a53693e978c94
Author: Dmitry Shulyak <email address hidden>
Date: Tue Aug 11 15:58:02 2015 +0300

    Dont read all packets from pcap file at once

    scapy.utils.PcapReader provides iterator that reads packet
    only when it is called.
    Using this approach limits memory footprint to 19 MB.

    Change-Id: I5e26c461ccc074800ae359ee510e09c527d1ddac
    Partial-Bug: 1466020

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-web (master)

Change abandoned by Dmitry Shulyak (<email address hidden>) on branch: master
Review: https://review.openstack.org/211511

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

verified for 7.0 {"build_id": "301", "build_number": "301", "release_versions": {"2015.1.0-7.0": {"VERSION": {"build_id": "301", "build_number": "301", "api": "1.0", "fuel-library_sha": "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd", "nailgun_sha": "4162b0c15adb425b37608c787944d1983f543aa8", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "50e90af6e3d560e9085ff71d2950cfbcca91af67", "production": "docker", "python-fuelclient_sha": "486bde57cda1badb68f915f66c61b544108606f3", "astute_sha": "6c5b73f93e24cc781c809db9159927655ced5012", "fuel-ostf_sha": "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c", "release": "7.0", "fuelmain_sha": "a65d453215edb0284a2e4761be7a156bb5627677"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd", "nailgun_sha": "4162b0c15adb425b37608c787944d1983f543aa8", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "50e90af6e3d560e9085ff71d2950cfbcca91af67", "production": "docker", "python-fuelclient_sha": "486bde57cda1badb68f915f66c61b544108606f3", "astute_sha": "6c5b73f93e24cc781c809db9159927655ced5012", "fuel-ostf_sha": "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c", "release": "7.0", "fuelmain_sha": "a65d453215edb0284a2e4761be7a156bb5627677"}

Steps to verify:
Deploy ha env on vm, leave it for 2 days under load (use rally for this), run network verification several times, check running processes on the slaves

Changed in fuel:
status: Fix Committed → Fix Released
tags: added: long-haul-testing
Revision history for this message
Denis Meltsaykin (dmeltsaykin) wrote :

Closing this as Won't Fix for 6.1, as there is no clear solution and the impact is not too big.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.