openstack overcloud node provide --all-manageable timing out and failing periodic rhel-8-ovb-3ctlr_1comp-featureset001-master

Bug #1853652 reported by wes hayutin on 2019-11-22
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Critical
Unassigned

Bug Description

https://review.rdoproject.org/zuul/builds?job_name=tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001

2019-11-22 10:32:41.001110 | primary | TASK [overcloud-prep-images : Prepare the overcloud images for deploy] *********
2019-11-22 10:32:41.019882 | primary | Friday 22 November 2019 10:32:41 -0500 (0:00:01.752) 0:00:16.259 *******
2019-11-22 18:30:45.206602 | RUN END RESULT_TIMED_OUT: [untrusted : opendev.org/openstack/tripleo-ci/playbooks/tripleo-ci/run-v3.yaml@master]
2019-11-22 18:30:45.207778 | POST-RUN START: [trusted : review.rdoproject.org/config/playbooks/tripleo-rdo-base/post-ovb-delete.yaml@master]
2019-11-22 18:30:47.358969 |

2019-11-22 10:33:30 | + sudo tcpdump -i any port 67 or port 68 or port 69 or port 623 -w /home/zuul/overcloud_prep_images_introspection_debug.pcap.log
2019-11-22 10:33:30 | tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
2019-11-22 10:33:31 | ++ ps --ppid 133485 -o pid=
2019-11-22 10:33:31 | + pidof_tcpdump=' 133500'
2019-11-22 10:33:31 | + openstack overcloud node introspect --all-manageable
2019-11-22 10:33:33 | Waiting for messages on queue 'tripleo' with no timeout.
2019-11-22 10:36:03 | Waiting for introspection to finish...
2019-11-22 10:36:03 | Introspection of node e24e9a4c-0129-44d4-ac82-bf96d0366c56 completed. Status:SUCCESS. Errors:None
2019-11-22 10:36:03 | Introspection of node e1396805-90ef-45bd-ac58-1cd29e5abc7a completed. Status:SUCCESS. Errors:None
2019-11-22 10:36:03 | Introspection of node e44ce917-b170-4478-912d-d3ba4b1b14ec completed. Status:SUCCESS. Errors:None
2019-11-22 10:36:03 | Introspection of node 2d2e743b-fcf4-4e26-a6a0-26d80e79a928 completed. Status:SUCCESS. Errors:None
2019-11-22 10:36:03 | Successfully introspected 4 node(s).
2019-11-22 10:36:03 |
2019-11-22 10:36:03 | Introspection completed.
2019-11-22 10:36:03 | + openstack overcloud node provide --all-manageable
2019-11-22 10:36:05 | Waiting for messages on queue 'tripleo' with no timeout.

tcp dump is available here:
http://logs.rdoproject.org/28/695528/1/openstack-check/tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001/690ffa2/logs/undercloud/home/zuul/overcloud_prep_images_introspection_debug.pcap.log.txt.gz

baremetal console:
http://logs.rdoproject.org/28/695528/1/openstack-check/tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001/690ffa2/logs/baremetal_1-console.log

http://logs.rdoproject.org/28/695528/1/openstack-check/tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001/690ffa2/logs/baremetal_2-console.log

http://logs.rdoproject.org/28/695528/1/openstack-check/tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001/690ffa2/logs/baremetal_3-console.log

BMC console:
http://logs.rdoproject.org/28/695528/1/openstack-check/tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001/690ffa2/logs/bmc-console.log

summary: - tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001 timing out and failing
+ openstack overcloud node provide --all-manageable timing out and failing
+ in both centos7 and rhel8 jobs.
Alex Schultz (alex-schultz) wrote :
Download full text (3.4 KiB)

From the mistral engine log

2019-11-22 11:01:12.161 7 INFO workflow_trace [-] Workflow 'tripleo.baremetal.v1.provide_manageable_nodes' [RUNNING -> ERROR, msg=Failure caused by error in tasks: provide_manageable

  provide_manageable [task_ex_id=8d502f0f-4a10-47d8-aa9a-f73efa7dfbbb] -> Failure caused by error in tasks: send_message

  send_message [task_ex_id=675016f2-b20b-4c40-8663-952fba16f876] -> Workflow failed due to message status. Status:FAILED Message:({'result': 'Node e24e9a4c-0129-44d4-ac82-bf96d0366c56 did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node e44ce917-b170-4478-912d-d3ba4b1b14ec did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node e1396805-90ef-45bd-ac58-1cd29e5abc7a did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node 2d2e743b-fcf4-4e26-a6a0-26d80e79a928 did not reach state "available", the state is "clean wait", error: None'})
    [wf_ex_id=14d04eff-4bd6-4d27-9072-67cf1d9bf433, idx=0]: Workflow failed due to message status. Status:FAILED Message:({'result': 'Node e24e9a4c-0129-44d4-ac82-bf96d0366c56 did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node e44ce917-b170-4478-912d-d3ba4b1b14ec did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node e1396805-90ef-45bd-ac58-1cd29e5abc7a did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node 2d2e743b-fcf4-4e26-a6a0-26d80e79a928 did not reach state "available", the state is "clean wait", error: None'})

    [wf_ex_id=02b49a5a-fe5c-4af8-995a-1438f185cd7c, idx=0]: Failure caused by error in tasks: send_message

  send_message [task_ex_id=675016f2-b20b-4c40-8663-952fba16f876] -> Workflow failed due to message status. Status:FAILED Message:({'result': 'Node e24e9a4c-0129-44d4-ac82-bf96d0366c56 did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node e44ce917-b170-4478-912d-d3ba4b1b14ec did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node e1396805-90ef-45bd-ac58-1cd29e5abc7a did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node 2d2e743b-fcf4-4e26-a6a0-26d80e79a928 did not reach state "available", the state is "clean wait", error: None'})
    [wf_ex_id=14d04eff-4bd6-4d27-9072-67cf1d9bf433, idx=0]: Workflow failed due to message status. Status:FAILED Message:({'result': 'Node e24e9a4c-0129-44d4-ac82-bf96d0366c56 did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node e44ce917-b170-4478-912d-d3ba4b1b14ec did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node e1396805-90ef-45bd-ac58-1cd29e5abc7a did not reach state "available", the state is "clean wait", error: None'...

Read more...

wes hayutin (weshayutin) wrote :

2019-11-22 11:07:01.437 ERROR /var/log/containers/ironic/ironic-conductor.log: 8 ERROR ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Node 2d2e743b-fcf4-4e26-a6a0-26d80e79a928 moved to provision state "clean failed" from state "clean wait"; target provision state is "available"
2019-11-22 11:07:01.569 ERROR /var/log/containers/ironic/ironic-conductor.log: 8 ERROR ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Node e1396805-90ef-45bd-ac58-1cd29e5abc7a moved to provision state "clean failed" from state "clean wait"; target provision state is "available"
2019-11-22 11:07:01.620 ERROR /var/log/containers/ironic/ironic-conductor.log: 8 ERROR ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Node e24e9a4c-0129-44d4-ac82-bf96d0366c56 moved to provision state "clean failed" from state "clean wait"; target provision state is "available"
2019-11-22 11:07:01.667 ERROR /var/log/containers/ironic/ironic-conductor.log: 8 ERROR ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Node e44ce917-b170-4478-912d-d3ba4b1b14ec moved to provision state "clean failed" from state "clean wait"; target provision state is "available"

http://logs.rdoproject.org/28/695528/1/openstack-check/tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001/690ffa2/logs/undercloud/var/log/extra/errors.txt.txt.gz

wes hayutin (weshayutin) wrote :

2019-11-22 11:07:01.481 8 DEBUG ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Node e1396805-90ef-45bd-ac58-1cd29e5abc7a successfully reserved for node state check (took 0.03 seconds) reserve_node /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:277
2019-11-22 11:07:01.509 8 DEBUG ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Attempting to get shared lock on node e44ce917-b170-4478-912d-d3ba4b1b14ec (for power state sync) __init__ /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:222
2019-11-22 11:07:01.511 8 DEBUG ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Attempting to get shared lock on node e1396805-90ef-45bd-ac58-1cd29e5abc7a (for power state sync) __init__ /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:222
2019-11-22 11:07:01.516 8 DEBUG ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Attempting to get shared lock on node e24e9a4c-0129-44d4-ac82-bf96d0366c56 (for power state sync) __init__ /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:222
2019-11-22 11:07:01.520 8 DEBUG ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Attempting to get shared lock on node 2d2e743b-fcf4-4e26-a6a0-26d80e79a928 (for power state sync) __init__ /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:222
2019-11-22 11:07:01.540 8 DEBUG ironic.common.states [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Exiting old state 'clean wait' in response to event 'fail' on_exit /usr/lib/python3.6/site-packages/ironic/common/states.py:294
2019-11-22 11:07:01.540 8 DEBUG ironic.common.states [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Entering new state 'clean failed' in response to event 'fail' on_enter /usr/lib/python3.6/site-packages/ironic/common/states.py:300
2019-11-22 11:07:01.565 8 DEBUG ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Successfully released shared lock for power state sync on node e1396805-90ef-45bd-ac58-1cd29e5abc7a (lock was held 0.05 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:356
2019-11-22 11:07:01.567 8 DEBUG ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Successfully released shared lock for power state sync on node e44ce917-b170-4478-912d-d3ba4b1b14ec (lock was held 0.06 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:356
2019-11-22 11:07:01.569 8 ERROR ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Node e1396805-90ef-45bd-ac58-1cd29e5abc7a moved to provision state "clean failed" from state "clean wait"; target provision state is "available"

Bob Fournier (bfournie) wrote :
Marios Andreou (marios-b) wrote :

  @Bob thanks for checking. You can't see logs in the job from comment #6 because of another bug as that is a check and not periodic. The rhel8 check jobs are broken by another bug (https://bugs.launchpad.net/tripleo/+bug/1854062) which incidentally is blocked by the fix for this one (this rhel8 fs1 we are looking at for this bug is the one remaining job so we can get a promotion and fix the check jobs +bug/1854062).

    You'll find the ironic logs if you check the periodic version of the job see the latest run list at https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master

    Ignoring the 'SKIPPED' you can see a block of TIMED_OUT from 2019-11-26T16:28:45 ... adding some with the example "clean failed" trace below:

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/fc9dabe/logs/undercloud/var/log/containers/ironic/ironic-conductor.log.txt.gz

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/12609c1//logs/undercloud/var/log/containers/ironic/ironic-conductor.log.1.gz

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/6093de8//logs/undercloud/var/log/containers/ironic/ironic-conductor.log.txt.gz

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/8b3e5bc/logs/undercloud/var/log/containers/ironic/ironic-conductor.log.1.gz

summary: openstack overcloud node provide --all-manageable timing out and failing
- in both centos7 and rhel8 jobs.
+ periodic rhel-8-ovb-3ctlr_1comp-featureset001-master
Marios Andreou (marios-b) wrote :

updated the title since we are now focusing on rhel8 periodic fs1 specifically

Bob Fournier (bfournie) wrote :

Thanks Marios. Yeah it looks like Ironic marks the node as "clean failed" after not getting a callback for 30 minutes, which is the clean_callback_timeout:
conductor.clean_callback_timeout = 1800

Do we get any console logs indicating what's going on on the node?

Harald Jensås (harald-jensas) wrote :

I made a blooper in disk-imagebuilder a week ago. Can we try a job that depend on this: https://review.opendev.org/696873 ? (It's worth a try ...)

Bob Fournier (bfournie) wrote :

We're seeing neutron errors in
http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/fc9dabe/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz

which is preventing pxe boot from happening.

2019-11-26 09:09:07.716 71147 DEBUG oslo.privsep.daemon [-] privsep: Exception during request[140498753610560]: Network interface tap4caf13c7-3f not found in namespace qdhcp-e0002d0c-7228-427b-bc93-fa2ad7164fd7. _process_cmd /usr/lib/python3.6/site-packages/oslo_privsep/daemon.py:454
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 449, in _process_cmd
    ret = func(*f_args, **f_kwargs)
  File "/usr/lib/python3.6/site-packages/neutron/privileged/agent/linux/ip_lib.py", line 53, in sync_inner
    return input_func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py", line 247, in _wrap
    return func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/neutron/privileged/agent/linux/ip_lib.py", line 406, in get_link_attributes
    link = _run_iproute_link("get", device, namespace)[0]
  File "/usr/lib/python3.6/site-packages/neutron/privileged/agent/linux/ip_lib.py", line 254, in _run_iproute_link
    idx = get_link_id(device, namespace)
  File "/usr/lib/python3.6/site-packages/neutron/privileged/agent/linux/ip_lib.py", line 247, in get_link_id
    raise NetworkInterfaceNotFound(device=device, namespace=namespace)
neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound: Network interface tap4caf13c7-3f not found in namespace qdhcp-e0002d0c-7228-427b-bc93-fa2ad7164fd7.
2019-11-26 09:09:07.719 71147 DEBUG oslo.privsep.daemon [-] privsep: reply[140498753610560]: (5, 'neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound', ('Network interface tap4caf13c7-3f not found in namespace qdhcp-e0002d0c-7228-427b-bc93-fa2ad7164fd7.',)) _call_back /usr/lib/python3.6/site-packages/oslo_privsep/daemon.py:475

Followed by:

2019-11-26 09:10:08.416 63269 ERROR neutron.agent.linux.dhcp [req-772c1abd-e370-4d16-b709-0c1e15c672e6 - - - - -] Failed to start DHCP process for network e0002d0c-7228-427b-bc93-fa2ad7164fd7: neutron.common.utils.WaitTimeout: Timed out after 60 seconds
2019-11-26 09:10:08.428 63269 DEBUG neutron.agent.linux.utils [req-772c1abd-e370-4d16-b709-0c1e15c672e6 - - - - -] Unable to access /var/lib/neutron/external/pids/e0002d0c-7228-427b-bc93-fa2ad7164fd7.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2019-11-26 09:10:08.429 63269 DEBUG neutron.agent.linux.utils [req-772c1abd-e370-4d16-b709-0c1e15c672e6 - - - - -] Unable to access /var/lib/neutron/external/pids/e0002d0c-7228-427b-bc93-fa2ad7164fd7.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261

Would like to have someone from neutron team take a look.

Harald Jensås (harald-jensas) wrote :

Is it not the neutron's dhcp agent that do not start for the ctlplane network on the undercloud?

http://logs.rdoproject.org/28/695528/1/openstack-check/tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001/690ffa2/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz

2019-11-22 11:13:10.155 122439 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/055eac10-2b47-41fd-9b9e-7a932124bf92/pid get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2019-11-22 11:13:11.156 122439 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/055eac10-2b47-41fd-9b9e-7a932124bf92/pid get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2019-11-22 11:13:12.086 122439 ERROR neutron.agent.linux.dhcp [-] Failed to start DHCP process for network 055eac10-2b47-41fd-9b9e-7a932124bf92: neutron.common.utils.WaitTimeout: Timed out after 60 seconds

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/fc9dabe/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz

2019-11-26 09:50:49.128 63269 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/e0002d0c-7228-427b-bc93-fa2ad7164fd7/pid get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2019-11-26 09:50:50.130 63269 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/e0002d0c-7228-427b-bc93-fa2ad7164fd7/pid get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2019-11-26 09:50:50.856 63269 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327
2019-11-26 09:50:50.857 63269 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" released by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: held 0.001s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339
2019-11-26 09:50:51.132 63269 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/e0002d0c-7228-427b-bc93-fa2ad7164fd7/pid get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2019-11-26 09:50:52.038 63269 ERROR neutron.agent.linux.dhcp [-] Failed to start DHCP process for network e0002d0c-7228-427b-bc93-fa2ad7164fd7: neutron.common.utils.WaitTimeout: Timed out after 60 seconds

Marios Andreou (marios-b) wrote :

thanks hjensas & bfournie:

posted that just now https://review.rdoproject.org/r/#/c/23919/
with depends on https://review.opendev.org/696873
lets see how that goes

Slawek Kaplonski (slaweq) wrote :

The error in DHCP agent seems that happens due to problem with getting pid value from file. See logs like:

2019-11-22 10:33:39.360 122439 DEBUG neutron.agent.linux.utils [req-a2222313-7fc2-4c99-ab27-5effb69d3cab - - - - -] Unable to access /var/lib/neutron/dhcp/055eac10-2b47-41fd-9b9e-7a932124bf92/pid get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261

I just sent small patch to neutron to log exactly what was the reason of IOError there so maybe we will know more why this happens. Patch is here: https://review.opendev.org/697035

Marios Andreou (marios-b) wrote :

looks like the test at https://review.rdoproject.org/r/#/c/23919/ failed... now re-running with the debug patch from comment #15 you can check the logs there if it helps @slaweq

as per irc just now with chkumar|ruck we are seriously considering removing this fs1 from master rhel8 criteria so we can get a much needed and overdue promotion and restore immediately once we do.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers