neutron-keepalived-state-change fails with "AssertionError: do not call blocking functions from the mainloop"

Bug #1680183 reported by Ihar Hrachyshka
34
This bug affects 8 people
Affects Status Importance Assigned to Milestone
neutron
Fix Released
Low
Unassigned

Bug Description

17:39:17.802 6173 CRITICAL neutron [-] AssertionError: do not call blocking functions from the mainloop
17:39:17.802 6173 ERROR neutron Traceback (most recent call last):
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/.tox/dsvm-functional/bin/neutron-keepalived-state-change", line 10, in <module>
17:39:17.802 6173 ERROR neutron sys.exit(main())
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/neutron/cmd/keepalived_state_change.py", line 19, in main
17:39:17.802 6173 ERROR neutron keepalived_state_change.main()
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/neutron/agent/l3/keepalived_state_change.py", line 157, in main
17:39:17.802 6173 ERROR neutron cfg.CONF.monitor_cidr).start()
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/neutron/agent/linux/daemon.py", line 249, in start
17:39:17.802 6173 ERROR neutron self.run()
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/neutron/agent/l3/keepalived_state_change.py", line 70, in run
17:39:17.802 6173 ERROR neutron for iterable in self.monitor:
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/neutron/agent/linux/async_process.py", line 256, in _iter_queue
17:39:17.802 6173 ERROR neutron yield queue.get(block=block)
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/.tox/dsvm-functional/lib/python2.7/site-packages/eventlet/queue.py", line 313, in get
17:39:17.802 6173 ERROR neutron return waiter.wait()
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/.tox/dsvm-functional/lib/python2.7/site-packages/eventlet/queue.py", line 141, in wait
17:39:17.802 6173 ERROR neutron return get_hub().switch()
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/.tox/dsvm-functional/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
17:39:17.802 6173 ERROR neutron return self.greenlet.switch()
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/.tox/dsvm-functional/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 346, in run
17:39:17.802 6173 ERROR neutron self.wait(sleep_time)
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/.tox/dsvm-functional/lib/python2.7/site-packages/eventlet/hubs/poll.py", line 85, in wait
17:39:17.802 6173 ERROR neutron presult = self.do_poll(seconds)
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/.tox/dsvm-functional/lib/python2.7/site-packages/eventlet/hubs/epolls.py", line 62, in do_poll
17:39:17.802 6173 ERROR neutron return self.poll.poll(seconds)
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/neutron/agent/l3/keepalived_state_change.py", line 134, in handle_sigterm
17:39:17.802 6173 ERROR neutron self._kill_monitor()
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/neutron/agent/l3/keepalived_state_change.py", line 131, in _kill_monitor
17:39:17.802 6173 ERROR neutron run_as_root=True)
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/neutron/agent/linux/utils.py", line 221, in kill_process
17:39:17.802 6173 ERROR neutron execute(['kill', '-%d' % signal, pid], run_as_root=run_as_root)
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/neutron/agent/linux/utils.py", line 155, in execute
17:39:17.802 6173 ERROR neutron greenthread.sleep(0)
17:39:17.802 6173 ERROR neutron File "/opt/stack/neutron/.tox/dsvm-functional/lib/python2.7/site-packages/eventlet/greenthread.py", line 31, in sleep
17:39:17.802 6173 ERROR neutron assert hub.greenlet is not current, 'do not call blocking functions from the mainloop'
17:39:17.802 6173 ERROR neutron AssertionError: do not call blocking functions from the mainloop
17:39:17.802 6173 ERROR neutron

This is what I see when running fullstack l3ha tests, once I enable syslog logging for the helper process.

tags: added: l3-ha
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote : Re: neutron-keepalived-state-change fails with "AssertionError: do not call blocking functions from the mainloop" in functional tests

Actually, this comes from functional tests.

summary: neutron-keepalived-state-change fails with "AssertionError: do not call
- blocking functions from the mainloop"
+ blocking functions from the mainloop" in functional tests
tags: added: functional-tests
tags: removed: functional-tests
summary: neutron-keepalived-state-change fails with "AssertionError: do not call
- blocking functions from the mainloop" in functional tests
+ blocking functions from the mainloop"
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

No, it happens in fullstack too. I suspect it's more than just tests.

Changed in neutron:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Ihar Hrachyshka (ihar-hrachyshka)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/453838

Changed in neutron:
status: Confirmed → In Progress
Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Probably doesn't affect data plane, just an ugly log message (that is not seen without syslog logging enabled).

Changed in neutron:
importance: High → Medium
importance: Medium → Low
Revision history for this message
Jakub Libosvar (libosvar) wrote :

We had issues with syslog in the past where they didn't play nice with multiple workers (API, RPC) due to opening a socket too early - before fork. Just leaving this here if it could be related.

Revision history for this message
Daniel Alvarez (dalvarezs) wrote :

I addressed it at some point here [0] but later in this patch I removed it because I couldn't reproduce it anymore. Your solution of calling time.sleep instead is better and I didn't realize, thanks for handling this Ihar :)

[0] https://review.openstack.org/#/c/411968/3/neutron/agent/l3/keepalived_state_change.py@139

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.openstack.org/453838
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=757476983ca0832099f04b9f029617ab270de9fe
Submitter: Jenkins
Branch: master

commit 757476983ca0832099f04b9f029617ab270de9fe
Author: Ihar Hrachyshka <email address hidden>
Date: Wed Apr 5 19:07:46 2017 +0000

    execute: don't call greenthread.sleep directly

    The module may be used in scope that is not using eventlet (f.e. from
    inside neutron-keepalived-state-change), in which case a call to
    greenthread.sleep from the main thread will result in AssertionError.

    Instead of that, call time.sleep that will redirect to greenthread.sleep
    in eventlet environment.

    This reflects what we do in oslo.concurrency execute implementation.

    Change-Id: Id73c0e5015feca2c559da4a68a70aeaaf5cec2a7
    Closes-Bug: #1680183

Changed in neutron:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/460751

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :
Download full text (8.8 KiB)

We still hit the issue, though the trace is a bit different now:

http://logs.openstack.org/38/284738/69/check/gate-neutron-dsvm-fullstack-ubuntu-xenial/2e022c5/logs/syslog.txt.gz

Apr 28 17:24:20 ubuntu-xenial-rax-ord-8648308 neutron-keepalived-state-change[21615]: 2017-04-28 17:24:20.423 21615 CRITICAL neutron [-] AssertionError: do not call blocking functions from the mainloop
                                                                                      2017-04-28 17:24:20.423 21615 ERROR neutron Traceback (most recent call last):
                                                                                      2017-04-28 17:24:20.423 21615 ERROR neutron File "/opt/stack/new/neutron/.tox/dsvm-fullstack/bin/neutron-keepalived-state-change", line 10, in <module>
                                                                                      2017-04-28 17:24:20.423 21615 ERROR neutron sys.exit(main())
                                                                                      2017-04-28 17:24:20.423 21615 ERROR neutron File "/opt/stack/new/neutron/neutron/cmd/keepalived_state_change.py", line 19, in main
                                                                                      2017-04-28 17:24:20.423 21615 ERROR neutron keepalived_state_change.main()
                                                                                      2017-04-28 17:24:20.423 21615 ERROR neutron File "/opt/stack/new/neutron/neutron/agent/l3/keepalived_state_change.py", line 156, in main
                                                                                      2017-04-28 17:24:20.423 21615 ERROR neutron cfg.CONF.monitor_cidr).start()
                                                                                      2017-04-28 17:24:20.423 21615 ERROR neutron File "/opt/stack/new/neutron/neutron/agent/linux/daemon.py", line 253, in start
                                                                                      2017-04-28 17:24:20.423 21615 ERROR neutron self.run()
                                                                                      2017-04-28 17:24:20.423 21615 ERROR neutron File "/opt/stack/new/neutron/neutron/agent/l3/keepalived_state_change.py", line 69, in run
                                                                                      2017-04-28 17:24:20.423 21615 ERROR neutron for iterable in self.monitor:
                                                                                      2017-04-28 17:24:20.423 21615 ERROR neutron File "/opt/stack/new/neutron/neutron/agent/linux/async_process.py", line 261, in _iter_queue
                                                                                      2017-04-28 17:24:20.423 21615 ERROR neutron yield queue.get(block=block)
                                                                                      2017-04-28 17:24:20.423 21615 ERROR neutron File "/opt/stack/new/neutron/.tox/dsvm-fullstack/local/lib/python2.7/site-packages/eventlet/queue.py", line 313, in get
                                                                                      2017-04-28 17:24:...

Read more...

Changed in neutron:
status: Fix Released → Confirmed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/neutron 11.0.0.0b2

This issue was fixed in the openstack/neutron 11.0.0.0b2 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (stable/ocata)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: stable/ocata
Review: https://review.openstack.org/460751
Reason: Not of help for the issue in question.

Revision history for this message
Miguel Angel Ajo (mangelajo) wrote :

The issue is not only testing, also hit by customers.

RHBZ: https://bugzilla.redhat.com/show_bug.cgi?id=1502282

Revision history for this message
John T Johnson (jt.johnson) wrote :
Download full text (5.5 KiB)

Was this "Fix" ever merged for production release? I"m running Openstack-Ansible Pike (16.0.1) and I'm still getting this error. Based on your number / release cycle, this error has probably been floating around for 2 years. Embarrassing.

2017-11-06 15:33:54.427 6079 DEBUG neutron.agent.linux.utils [-] Running command: ['sudo', '/openstack/venvs/neutron-16.0.1/bin/neutron-rootwrap', '/etc/neutron/rootwrap.conf', 'kill', '-9', '6080'] create_process /openstack/venvs/neutron-16.0.1/lib/python2.7/site-packages/neutron/agent/linux/utils.py:92
2017-11-06 15:33:54.437 6079 CRITICAL neutron [-] Unhandled error: AssertionError: do not call blocking functions from the mainloop
2017-11-06 15:33:54.437 6079 ERROR neutron Traceback (most recent call last):
2017-11-06 15:33:54.437 6079 ERROR neutron File "/openstack/venvs/neutron-16.0.1/bin/neutron-keepalived-state-change", line 11, in <module>
2017-11-06 15:33:54.437 6079 ERROR neutron sys.exit(main())
2017-11-06 15:33:54.437 6079 ERROR neutron File "/openstack/venvs/neutron-16.0.1/lib/python2.7/site-packages/neutron/cmd/keepalived_state_change.py", line 19, in main
2017-11-06 15:33:54.437 6079 ERROR neutron keepalived_state_change.main()
2017-11-06 15:33:54.437 6079 ERROR neutron File "/openstack/venvs/neutron-16.0.1/lib/python2.7/site-packages/neutron/agent/l3/keepalived_state_change.py", line 156, in main
2017-11-06 15:33:54.437 6079 ERROR neutron cfg.CONF.monitor_cidr).start()
2017-11-06 15:33:54.437 6079 ERROR neutron File "/openstack/venvs/neutron-16.0.1/lib/python2.7/site-packages/neutron/agent/linux/daemon.py", line 253, in start
2017-11-06 15:33:54.437 6079 ERROR neutron self.run()
2017-11-06 15:33:54.437 6079 ERROR neutron File "/openstack/venvs/neutron-16.0.1/lib/python2.7/site-packages/neutron/agent/l3/keepalived_state_change.py", line 69, in run
2017-11-06 15:33:54.437 6079 ERROR neutron for iterable in self.monitor:
2017-11-06 15:33:54.437 6079 ERROR neutron File "/openstack/venvs/neutron-16.0.1/lib/python2.7/site-packages/neutron/agent/linux/async_process.py", line 261, in _iter_queue
2017-11-06 15:33:54.437 6079 ERROR neutron yield queue.get(block=block)
2017-11-06 15:33:54.437 6079 ERROR neutron File "/openstack/venvs/neutron-16.0.1/lib/python2.7/site-packages/eventlet/queue.py", line 313, in get
2017-11-06 15:33:54.437 6079 ERROR neutron return waiter.wait()
2017-11-06 15:33:54.437 6079 ERROR neutron File "/openstack/venvs/neutron-16.0.1/lib/python2.7/site-packages/eventlet/queue.py", line 141, in wait
2017-11-06 15:33:54.437 6079 ERROR neutron return get_hub().switch()
2017-11-06 15:33:54.437 6079 ERROR neutron File "/openstack/venvs/neutron-16.0.1/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 294, in switch
2017-11-06 15:33:54.437 6079 ERROR neutron return self.greenlet.switch()
2017-11-06 15:33:54.437 6079 ERROR neutron File "/openstack/venvs/neutron-16.0.1/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 346, in run
2017-11-06 15:33:54.437 6079 ERROR neutron self.wait(sleep_time)
2017-11-06 15:33:54.437 6079 ERROR neutron File "/openstack/venvs/neutron-16.0.1/lib/python2.7/site-packages/eventlet/hubs/...

Read more...

Changed in neutron:
assignee: Ihar Hrachyshka (ihar-hrachyshka) → nobody
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/660611

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/660722

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/661257

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/661257
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=6ea01444dd75d2d15658c808e052143deaf66543
Submitter: Zuul
Branch: master

commit 6ea01444dd75d2d15658c808e052143deaf66543
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Fri May 24 09:32:41 2019 +0000

    Add Timer context manager class

    This class creates a context that:
    * Triggers a timeout exception if the timeout is set.
    * Returns the time elapsed since the context was initialized.
    * Returns the time spent in the context once it's closed.

    The timeout exception can be suppressed; when the time expires, the
    context finishes without rising TimerTimeout.

    This class can be used to execute blocking methods and finish them if
    there is no return. For example, in the followup patch [1], this class
    is used in the method "neutron.agent.linux.ip_lib.ip_monitor" to control
    how much time the blocking function "IPRSocket.get" is being called. By
    using this strategy the method "ip_monitor" can be finished gracefully
    using an external event signal in the main loop.

    [1] https://review.opendev.org/#/c/660611/

    Change-Id: I1f33535b201d49b875437bcc3397fcb465118064
    Related-Bug: #1680183

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/660611
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=a477c31a23049407b010e2eadbc93d0a2ed42e9d
Submitter: Zuul
Branch: master

commit a477c31a23049407b010e2eadbc93d0a2ed42e9d
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Tue May 21 10:48:42 2019 +0000

    Add ip_monitor command implemented using Pyroute2

    This method allows to track any IP address change in a
    namespace. In future patches, this method will replace
    the current IP monitor used in the keepalived_state_change
    daemon. The current implementation relays in a spawned shell,
    executed in root mode, and the output of this shell,
    conveniently parsed.

    If the passed namespace is not None, this new method must
    be executed in privileged mode (root user), but cannot use
    privsep because is a blocking function and can exhaust the
    number of working threads.

    This function should be executed in a parallel thread, returning
    the data using the eventlet queue. Pyroute does not implement yet
    a non blocking method to retrieve the command output or to know if
    the buffer has data. This method, spawned in a greenthread, must be
    stopped by killing this thread.

    An example of how to use it can be found in the functional tests
    implemented in this patch.

    Change-Id: I86e4487035d60e1b52e951dd3cd50d6bb54f388b
    Related-Bug: #1680183

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/660722
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=3437572906e38ec05caf141e634025941d887c50
Submitter: Zuul
Branch: master

commit 3437572906e38ec05caf141e634025941d887c50
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed May 22 14:07:20 2019 +0000

    Replace "ip monitor" command with Pyroute2 implementation

    Use the "ip monitor" tool implemented with Pyroute2 library in
    the neutron-keepalived-state-change monitor.

    Change-Id: I932b62a8e0fa1a2f51bbde44134272f0b31b5c76
    Related-Bug: #1680183

Revision history for this message
Rodolfo Alonso (rodolfo-alonso-hernandez) wrote :

Hello:

The reported problem should be solved now with the previous patch:
https://review.opendev.org/#/c/660722/

Now we don't have another process but a thread monitoring the IPs on the namespace selected. This thread is stopped gracefully when the SIGTERM signal is received in the main thread.

Regards.

Changed in neutron:
status: Confirmed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (stable/train)

Related fix proposed to branch: stable/train
Review: https://review.opendev.org/717804

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (stable/train)

Reviewed: https://review.opendev.org/717804
Committed: https://git.openstack.org/cgit/openstack/neutron/commit/?id=478e20a61f7356739b7e01efb955c83bc512f428
Submitter: Zuul
Branch: stable/train

commit 478e20a61f7356739b7e01efb955c83bc512f428
Author: Rodolfo Alonso Hernandez <email address hidden>
Date: Wed May 22 14:07:20 2019 +0000

    Replace "ip monitor" command with Pyroute2 implementation

    Use the "ip monitor" tool implemented with Pyroute2 library in
    the neutron-keepalived-state-change monitor.

    Conflicts:
        neutron/agent/l3/ha_router.py
        neutron/agent/l3/keepalived_state_change.py
        neutron/tests/functional/agent/l3/test_keepalived_state_change.py

    This patch also includes https://review.opendev.org/#/c/718754/,
    adding to the ProcessManager call the namespace of the router. This
    will enforce the new process to be execute on the namespace context.

    This patch also modifies the main thread of "keepalived_state_change".
    To lock the main thread execution, allowing the other two threads
    (IP monitor and queue reader) to continue their work, a thread.join()
    was used. In Python 3, when the SIGTERM signal is received and both
    threads are finished, the main thread ends the wait and exits. No
    process was left behind when the monitor was stopped.

    But in Python 2.7, the queue reader thread was never finished. When
    the monitor was stopped, the thread was still being executed. As
    detected in the CI, that causes a memory exhaustion. Instead of
    trying to join the thread, a trivial loop is used insted.

    Change-Id: I932b62a8e0fa1a2f51bbde44134272f0b31b5c76
    Related-Bug: #1680183
    (cherry picked from commit 3437572906e38ec05caf141e634025941d887c50)

tags: added: in-stable-train
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.