Fuel for OpenStack

Network verification fails on deployed environment after few days of load simulation: 'net_probe#get_probing_info failed: #<Class:0x007f7be01a4d30>: execution expired'

Bug #1466020 reported by Artem Panchenko on 2015-06-17

This bug affects 2 people

	Status	Importance	Assigned to	Milestone
Fuel for OpenStack	Fix Released	High	Dima Shulyak	Fuel for OpenStack 7.0
6.1.x	Won't Fix	High	Fuel Python (Deprecated)	Fuel for OpenStack 6.1-updates
7.0.x	Fix Released	High	Dima Shulyak	Fuel for OpenStack 7.0

Bug Description

Fuel version info (6.1 build #521 RC1): http://paste.openstack.org/show/277715/

Network verification fails on environment which was under load (Rally benchmark test) for few days:

Verification failed.
Method verify_networks. 78f04583-0afc-40aa-997b-fb0b3f7ddab4: MCollective call failed in agent 'net_probe', method 'get_probing_info', failed nodes: ID: 28 - Reason: execution expired . Inspect Astute logs for the details

Here are errors from mcollective logs (node-28):

E, [2015-06-16T09:32:00.206795 #7880] ERROR -- : agent.rb:111:in `rescue in handlemsg' net_probe#get_probing_info failed: #<Class:0x007f7be01a4d30>: execution expired
E, [2015-06-16T09:32:00.206890 #7880] ERROR -- : agent.rb:112:in `rescue in handlemsg' /usr/share/mcollective/plugins/mcollective/agent/net_probe.rb:229:in `stop_frame_listeners'
        /usr/share/mcollective/plugins/mcollective/agent/net_probe.rb:188:in `get_probing_info'
        /usr/share/mcollective/plugins/mcollective/agent/net_probe.rb:54:in `block in <class:Net_probe>'
        /usr/lib/ruby/vendor_ruby/mcollective/rpc/agent.rb:88:in `handlemsg'
        /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:126:in `block (2 levels) in dispatch'
        /usr/lib/ruby/1.9.1/timeout.rb:69:in `timeout'
        /usr/lib/ruby/vendor_ruby/mcollective/agents.rb:125:in `block in dispatch'

Steps to reproduce:

1. Deploy env on bare-metal servers: Ubuntu, Ceph, NeutronVlan, 3 controllers, 2 computes
2. Simulate load on the cloud using Rally (for example run some Nova related test for 3 days)
3. Run network verification from Fuel UI

Expected result:

- verification passed

Actual:

- verification fails

I found that processes created by some of previous network checks still alive on node-28:

root 7880 0.3 0.1 618896 23160 ? Ssl Jun10 30:52 ruby /usr/sbin/mcollectived --config /etc/mcollective/server.cfg
root 53617 0.0 0.0 4440 652 ? S Jun15 0:00 \_ sh -c "/usr/bin/ruby1.9.1" "/tmp/systemu_node_28_mirantis_com_1_7880_0_9649850401906507_1/program"
root 53619 0.0 0.0 4440 652 ? S Jun15 0:00 | \_ sh -c urlaccesscheck check 'http://172.16.39.130:8080/2014.2.2-6.1/ubuntu/auxiliary' 'http://mirror.f
uel-infra.org/mos/ubuntu/' 'http://archive.ubuntu.com/ubuntu/' 'http://172.16.39.130:8080/2014.2.2-6.1/ubuntu/x86_64'
root 53622 0.0 0.1 72924 18380 ? S Jun15 0:00 | \_ /usr/bin/python /usr/bin/urlaccesscheck check http://172.16.39.130:8080/2014.2.2-6.1/ubuntu/auxil
iary http://mirror.fuel-infra.org/mos/ubuntu/ http://archive.ubuntu.com/ubuntu/ http://172.16.39.130:8080/2014.2.2-6.1/ubuntu/x86_64
root 38790 0.0 0.1 618896 20848 ? Sl 11:54 0:00 \_ ruby /usr/sbin/mcollectived --config /etc/mcollective/server.cfg
root 38794 0.0 0.1 90160 20780 ? S 11:54 0:00 \_ /usr/bin/python /usr/bin/net_probe.py -c /tmp/net_probe20150616-7880-15w3xg1
root 38803 0.0 0.0 19144 4284 ? S 11:54 0:00 \_ tcpdump -i eth1 -w /var/run/pcap_dir/eth1.pcap udp and dst port 31337
root 38804 0.0 0.0 19144 4280 ? S 11:54 0:00 \_ tcpdump -i eth1 -w /var/run/pcap_dir/vlan_eth1.pcap vlan and udp and dst port 31337
root 38806 0.0 0.0 19144 4288 ? S 11:54 0:00 \_ tcpdump -i eth0 -w /var/run/pcap_dir/eth0.pcap udp and dst port 31337
root 38807 0.0 0.0 19144 4280 ? S 11:54 0:00 \_ tcpdump -i eth0 -w /var/run/pcap_dir/vlan_eth0.pcap vlan and udp and dst port 31337

After I killed them (python ones) network verification passed.
Diagnostic snapshot doesn't contain remote logs due to bug #1465262 , so I'm attaching logs (mcollective and net_probe) from node-28.

Tags:

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2015-06-17:

node-28.logs.tgz Edit (1004.1 KiB, application/x-tar)

Revision history for this message

Artem Panchenko (apanchenko-8) wrote on 2015-06-17:

fuel-snapshot-2015-06-17_09-26-13.tar.xz Edit (201.9 MiB, application/octet-stream)

summary:

Network verification fails on deployed environment after few days of
- load simulating: 'net_probe#get_probing_info failed:
+ load simulation: 'net_probe#get_probing_info failed:
#<Class:0x007f7be01a4d30>: execution expired'

Revision history for this message

Dima Shulyak (dshulyak) wrote on 2015-06-17:

This situation can happen if mcollective will be interrupted by timeout / or any other unpredicted crush.
In such case we will try to stop all net_probe.py processes on the next run (there is even 2 such mechanism).

1. https://github.com/stackforge/fuel-astute/blob/master/mcagents/net_probe.rb#L205
2. https://github.com/stackforge/fuel-astute/blob/master/mcagents/net_probe.rb#L102

Need to investigate further, what happened.

Changed in fuel:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Dima Shulyak (dshulyak) wrote on 2015-06-17:

Set medium priority because it is reproduced only once. If it will fail again, lets increase priority.

Changed in fuel:
milestone:	6.1 → 7.0

OpenStack Infra (hudson-openstack) on 2015-06-17

Changed in fuel:
assignee:	Fuel Python Team (fuel-python) → Dima Shulyak (dshulyak)
status:	Confirmed → In Progress

Dima Shulyak (dshulyak) on 2015-06-17

Changed in fuel:
importance:	Medium → High

Revision history for this message

Andrey Maximov (maximov) wrote on 2015-06-18:

review https://review.openstack.org/#/c/19273

Vitaly Sedelnik (vsedelnik) on 2015-06-25

tags:

added: 6.1-mu-1

Revision history for this message

Dima Shulyak (dshulyak) wrote on 2015-06-30:

Review
https://review.openstack.org/#/c/192734/

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-07-01: Fix proposed to fuel-astute (stable/6.1)

Fix proposed to branch: stable/6.1
Review: https://review.openstack.org/197511

OpenStack Infra (hudson-openstack) on 2015-07-01

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-07-01: Fix merged to fuel-astute (stable/6.1)

Reviewed: https://review.openstack.org/197511
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=16d89478dc2833dfceebb7be124e7a69f1a1fe75
Submitter: Jenkins
Branch: stable/6.1

commit 16d89478dc2833dfceebb7be124e7a69f1a1fe75
Author: Dmitry Shulyak <email address hidden>
Date: Thu Jun 18 11:57:50 2015 +0300

Kill hung frame listeners on each verify networks event

    Network checker relies on interrupt signal for its own
    internal logic, so after interrupt is received - it will
    do some parsing work and wont finish immediatly.

    But before starting frame listeners in a new run - old frame
    listeners may be not interrupted because mcollective failed
    due timeout or any other not predicted problem.
    So, we need to completely shutdown frame listeners with kill -9

    Tested:
      Spawned bunch of - /usr/bin/python /usr/bin/net_probe.py -c listen.conf
      Executed network verification from UI - all old listeners were killed,
      verification succeeded.
      Also, executed same test with removed pid files in /var/run/net_probe folder

Change-Id: I7aa48d00b3f86827ba9860a3e3797ed5bd0eb713
Closes-Bug: 1466020

Revision history for this message

Dima Shulyak (dshulyak) wrote on 2015-07-02:

Fixes will be reverted

Changed in fuel:
status:	Fix Committed → Triaged

Revision history for this message

Dima Shulyak (dshulyak) wrote on 2015-07-02:

#10

Catched race condition on bvt. Fix needs to take into account such cases
https://bugs.launchpad.net/fuel/+bug/1470729

Revision history for this message

Sam Stoelinga (sammiestoel) wrote on 2015-07-06:

#11

Just had the same issue on a new Fuel install without any existing environment deployed. I had to manually kill a network verification task and it seems that caused the python net_probe.py to still be alive on some of the nodes.

Ryan Moe (rmoe) on 2015-07-27

tags:

added: customer-found

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2015-07-27:

#12

So I think we've traced it down to the following...

execution expired means the script never terminates, so as part of the net_probe mcollective task, we call stop_frame_listeners[0] which attempts to send a SIGINT to the running processes[1]. Unfortunately if the SIGINT does not terminate the running processes, it will enter an infinite loop[2]. Going over to the SIGINT which is sent to the n et_probe code from nailgun, we can see where the INT is handled as part of the _run function in the Listener[3] class. When then trace the function execution and we see it loops through all the current listeners (forked tcpdump processes) and attempt to terminate them[4]. The issue seems that the tcpdump process is not exiting or itself is hanging. I believe this to be related to the lack of internet connectivity on the hosts being check. I found a posting over on serverfault[5], that points to DNS being an issue and as we are not skipping DNS as part of our tcpdump command line options[6], I believe this is what is causing it to hang.

[0] https://github.com/stackforge/fuel-astute/blob/master/mcagents/net_probe.rb#L188
[1] https://github.com/stackforge/fuel-astute/blob/master/mcagents/net_probe.rb#L209-L216
[2] https://github.com/stackforge/fuel-astute/blob/master/mcagents/net_probe.rb#L218-L230
[3] https://github.com/stackforge/fuel-web/blob/master/network_checker/network_checker/net_check/api.py#L467
[4] https://github.com/stackforge/fuel-web/blob/master/network_checker/network_checker/net_check/api.py#L472-L480
[5] http://serverfault.com/questions/697854/tcpdump-freezes-and-not-capturing-properly-without-internet-connection
[6] https://github.com/stackforge/fuel-web/blob/master/network_checker/network_checker/net_check/api.py#L539-L542

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-07-27: Fix proposed to fuel-web (master)

#13

Fix proposed to branch: master
Review: https://review.openstack.org/206196

Changed in fuel:
assignee:	Fuel Python Team (fuel-python) → Alex Schultz (alex-schultz)
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-07-30: Change abandoned on fuel-web (master)

#14

Change abandoned by Alex Schultz (<email address hidden>) on branch: master
Review: https://review.openstack.org/206196
Reason: I'm not able to properly recreate the issue to validate that this will solve it. I think it's something with the tcpdump process hanging but without being able to reproduce it I cannot confirm this as a fix.

Revision history for this message

Alex Schultz (alex-schultz) wrote on 2015-07-30:

#15

based on some testing I can't confirm the dns issue, but the error message still points to the tcpdump process not properly terminating. I'm updating the status back to someone else to have a look to see if they can figure it out since I can't reproduce the issue.

tags:

added: tricky

Revision history for this message

Dima Shulyak (dshulyak) wrote on 2015-08-10:

#16

I think that issue described by @Alex is different from originally reported, because first one was resolved by manually killing workers, and original error was caused by timeout. And the second was reproducible every run. @Alex can you please attach snapshot or network checker logs from environment where this issue was found?

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-11: Fix proposed to fuel-web (master)

#17

Fix proposed to branch: master
Review: https://review.openstack.org/211510

Changed in fuel:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-11: Related fix proposed to fuel-web (master)

#18

Related fix proposed to branch: master
Review: https://review.openstack.org/211511

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-11: Fix proposed to fuel-web (master)

#19

Fix proposed to branch: master
Review: https://review.openstack.org/211567

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-11: Fix proposed to fuel-astute (master)

#20

Fix proposed to branch: master
Review: https://review.openstack.org/211569

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-13: Fix merged to fuel-astute (master)

#21

Reviewed: https://review.openstack.org/211569
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=c2819dd7136721118a1a92b130035ce5485c88b8
Submitter: Jenkins
Branch: master

commit c2819dd7136721118a1a92b130035ce5485c88b8
Author: Dmitry Shulyak <email address hidden>
Date: Tue Aug 11 15:28:58 2015 +0300

Add collect_timeout and cleanup hung listeners properly

netprobe will be started with *collect_timeout* parameter
that is lower than net_probe.rb agent timeout

hung frame listeners will be matched correctly, and cleaned with
termination signal

Change-Id: I7dc2adf6aa1d199e2b8fe58718aad5e56f85b9d1
Partial-Bug: 1466020

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-13: Fix merged to fuel-web (master)

#22

Reviewed: https://review.openstack.org/211510
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=7c4526ce1c436b3805964fd6cd4d9a661c50f5c2
Submitter: Jenkins
Branch: master

commit 7c4526ce1c436b3805964fd6cd4d9a661c50f5c2
Author: Dmitry Shulyak <email address hidden>
Date: Tue Aug 11 14:10:36 2015 +0300

Introduce timeout for a collecting information from pcap files

Timeout for collecting procedure will ensure that net_probe.py
will exit earlier than timeout will be raied in net_probe.rb.

Change-Id: I69e53809bca717919998dc06b4c9315c5642d184
Closes-Bug: 1466020

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-13:

#23

Reviewed: https://review.openstack.org/211567
Committed: https://git.openstack.org/cgit/stackforge/fuel-web/commit/?id=4cea3178f947815a2df8f554b91a53693e978c94
Submitter: Jenkins
Branch: master

commit 4cea3178f947815a2df8f554b91a53693e978c94
Author: Dmitry Shulyak <email address hidden>
Date: Tue Aug 11 15:58:02 2015 +0300

Dont read all packets from pcap file at once

    scapy.utils.PcapReader provides iterator that reads packet
    only when it is called.
    Using this approach limits memory footprint to 19 MB.

Change-Id: I5e26c461ccc074800ae359ee510e09c527d1ddac
Partial-Bug: 1466020

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-14: Change abandoned on fuel-web (master)

#24

Change abandoned by Dmitry Shulyak (<email address hidden>) on branch: master
Review: https://review.openstack.org/211511

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-09-28:

#25

verified for 7.0 {"build_id": "301", "build_number": "301", "release_versions": {"2015.1.0-7.0": {"VERSION": {"build_id": "301", "build_number": "301", "api": "1.0", "fuel-library_sha": "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd", "nailgun_sha": "4162b0c15adb425b37608c787944d1983f543aa8", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "50e90af6e3d560e9085ff71d2950cfbcca91af67", "production": "docker", "python-fuelclient_sha": "486bde57cda1badb68f915f66c61b544108606f3", "astute_sha": "6c5b73f93e24cc781c809db9159927655ced5012", "fuel-ostf_sha": "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c", "release": "7.0", "fuelmain_sha": "a65d453215edb0284a2e4761be7a156bb5627677"}}}, "auth_required": true, "api": "1.0", "fuel-library_sha": "5d50055aeca1dd0dc53b43825dc4c8f7780be9dd", "nailgun_sha": "4162b0c15adb425b37608c787944d1983f543aa8", "feature_groups": ["mirantis"], "fuel-nailgun-agent_sha": "d7027952870a35db8dc52f185bb1158cdd3d1ebd", "openstack_version": "2015.1.0-7.0", "fuel-agent_sha": "50e90af6e3d560e9085ff71d2950cfbcca91af67", "production": "docker", "python-fuelclient_sha": "486bde57cda1badb68f915f66c61b544108606f3", "astute_sha": "6c5b73f93e24cc781c809db9159927655ced5012", "fuel-ostf_sha": "2cd967dccd66cfc3a0abd6af9f31e5b4d150a11c", "release": "7.0", "fuelmain_sha": "a65d453215edb0284a2e4761be7a156bb5627677"}

Steps to verify:
Deploy ha env on vm, leave it for 2 days under load (use rally for this), run network verification several times, check running processes on the slaves

Changed in fuel:
status:	Fix Committed → Fix Released

Nastya Urlapova (aurlapova) on 2015-10-06

tags:

added: long-haul-testing

Revision history for this message

Denis Meltsaykin (dmeltsaykin) wrote on 2016-04-22:

#26

Closing this as Won't Fix for 6.1, as there is no clear solution and the impact is not too big.

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.