openstack overcloud node provide --all-manageable timing out and failing periodic rhel-8-ovb-3ctlr_1comp-featureset001-master

Bug #1853652 reported by wes hayutin
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
Cédric Jeanneret

Bug Description

https://review.rdoproject.org/zuul/builds?job_name=tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001

2019-11-22 10:32:41.001110 | primary | TASK [overcloud-prep-images : Prepare the overcloud images for deploy] *********
2019-11-22 10:32:41.019882 | primary | Friday 22 November 2019 10:32:41 -0500 (0:00:01.752) 0:00:16.259 *******
2019-11-22 18:30:45.206602 | RUN END RESULT_TIMED_OUT: [untrusted : opendev.org/openstack/tripleo-ci/playbooks/tripleo-ci/run-v3.yaml@master]
2019-11-22 18:30:45.207778 | POST-RUN START: [trusted : review.rdoproject.org/config/playbooks/tripleo-rdo-base/post-ovb-delete.yaml@master]
2019-11-22 18:30:47.358969 |

2019-11-22 10:33:30 | + sudo tcpdump -i any port 67 or port 68 or port 69 or port 623 -w /home/zuul/overcloud_prep_images_introspection_debug.pcap.log
2019-11-22 10:33:30 | tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 262144 bytes
2019-11-22 10:33:31 | ++ ps --ppid 133485 -o pid=
2019-11-22 10:33:31 | + pidof_tcpdump=' 133500'
2019-11-22 10:33:31 | + openstack overcloud node introspect --all-manageable
2019-11-22 10:33:33 | Waiting for messages on queue 'tripleo' with no timeout.
2019-11-22 10:36:03 | Waiting for introspection to finish...
2019-11-22 10:36:03 | Introspection of node e24e9a4c-0129-44d4-ac82-bf96d0366c56 completed. Status:SUCCESS. Errors:None
2019-11-22 10:36:03 | Introspection of node e1396805-90ef-45bd-ac58-1cd29e5abc7a completed. Status:SUCCESS. Errors:None
2019-11-22 10:36:03 | Introspection of node e44ce917-b170-4478-912d-d3ba4b1b14ec completed. Status:SUCCESS. Errors:None
2019-11-22 10:36:03 | Introspection of node 2d2e743b-fcf4-4e26-a6a0-26d80e79a928 completed. Status:SUCCESS. Errors:None
2019-11-22 10:36:03 | Successfully introspected 4 node(s).
2019-11-22 10:36:03 |
2019-11-22 10:36:03 | Introspection completed.
2019-11-22 10:36:03 | + openstack overcloud node provide --all-manageable
2019-11-22 10:36:05 | Waiting for messages on queue 'tripleo' with no timeout.

tcp dump is available here:
http://logs.rdoproject.org/28/695528/1/openstack-check/tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001/690ffa2/logs/undercloud/home/zuul/overcloud_prep_images_introspection_debug.pcap.log.txt.gz

baremetal console:
http://logs.rdoproject.org/28/695528/1/openstack-check/tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001/690ffa2/logs/baremetal_1-console.log

http://logs.rdoproject.org/28/695528/1/openstack-check/tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001/690ffa2/logs/baremetal_2-console.log

http://logs.rdoproject.org/28/695528/1/openstack-check/tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001/690ffa2/logs/baremetal_3-console.log

BMC console:
http://logs.rdoproject.org/28/695528/1/openstack-check/tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001/690ffa2/logs/bmc-console.log

Revision history for this message
wes hayutin (weshayutin) wrote : Re: openstack overcloud node provide --all-manageable timing out and failing in both centos7 and rhel8 jobs.
summary: - tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001 timing out and failing
+ openstack overcloud node provide --all-manageable timing out and failing
+ in both centos7 and rhel8 jobs.
Revision history for this message
Alex Schultz (alex-schultz) wrote :
Download full text (3.4 KiB)

From the mistral engine log

2019-11-22 11:01:12.161 7 INFO workflow_trace [-] Workflow 'tripleo.baremetal.v1.provide_manageable_nodes' [RUNNING -> ERROR, msg=Failure caused by error in tasks: provide_manageable

  provide_manageable [task_ex_id=8d502f0f-4a10-47d8-aa9a-f73efa7dfbbb] -> Failure caused by error in tasks: send_message

  send_message [task_ex_id=675016f2-b20b-4c40-8663-952fba16f876] -> Workflow failed due to message status. Status:FAILED Message:({'result': 'Node e24e9a4c-0129-44d4-ac82-bf96d0366c56 did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node e44ce917-b170-4478-912d-d3ba4b1b14ec did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node e1396805-90ef-45bd-ac58-1cd29e5abc7a did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node 2d2e743b-fcf4-4e26-a6a0-26d80e79a928 did not reach state "available", the state is "clean wait", error: None'})
    [wf_ex_id=14d04eff-4bd6-4d27-9072-67cf1d9bf433, idx=0]: Workflow failed due to message status. Status:FAILED Message:({'result': 'Node e24e9a4c-0129-44d4-ac82-bf96d0366c56 did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node e44ce917-b170-4478-912d-d3ba4b1b14ec did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node e1396805-90ef-45bd-ac58-1cd29e5abc7a did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node 2d2e743b-fcf4-4e26-a6a0-26d80e79a928 did not reach state "available", the state is "clean wait", error: None'})

    [wf_ex_id=02b49a5a-fe5c-4af8-995a-1438f185cd7c, idx=0]: Failure caused by error in tasks: send_message

  send_message [task_ex_id=675016f2-b20b-4c40-8663-952fba16f876] -> Workflow failed due to message status. Status:FAILED Message:({'result': 'Node e24e9a4c-0129-44d4-ac82-bf96d0366c56 did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node e44ce917-b170-4478-912d-d3ba4b1b14ec did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node e1396805-90ef-45bd-ac58-1cd29e5abc7a did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node 2d2e743b-fcf4-4e26-a6a0-26d80e79a928 did not reach state "available", the state is "clean wait", error: None'})
    [wf_ex_id=14d04eff-4bd6-4d27-9072-67cf1d9bf433, idx=0]: Workflow failed due to message status. Status:FAILED Message:({'result': 'Node e24e9a4c-0129-44d4-ac82-bf96d0366c56 did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node e44ce917-b170-4478-912d-d3ba4b1b14ec did not reach state "available", the state is "clean wait", error: None'}, {'result': 'Node e1396805-90ef-45bd-ac58-1cd29e5abc7a did not reach state "available", the state is "clean wait", error: None'...

Read more...

Revision history for this message
wes hayutin (weshayutin) wrote :

2019-11-22 11:07:01.437 ERROR /var/log/containers/ironic/ironic-conductor.log: 8 ERROR ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Node 2d2e743b-fcf4-4e26-a6a0-26d80e79a928 moved to provision state "clean failed" from state "clean wait"; target provision state is "available"
2019-11-22 11:07:01.569 ERROR /var/log/containers/ironic/ironic-conductor.log: 8 ERROR ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Node e1396805-90ef-45bd-ac58-1cd29e5abc7a moved to provision state "clean failed" from state "clean wait"; target provision state is "available"
2019-11-22 11:07:01.620 ERROR /var/log/containers/ironic/ironic-conductor.log: 8 ERROR ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Node e24e9a4c-0129-44d4-ac82-bf96d0366c56 moved to provision state "clean failed" from state "clean wait"; target provision state is "available"
2019-11-22 11:07:01.667 ERROR /var/log/containers/ironic/ironic-conductor.log: 8 ERROR ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Node e44ce917-b170-4478-912d-d3ba4b1b14ec moved to provision state "clean failed" from state "clean wait"; target provision state is "available"

http://logs.rdoproject.org/28/695528/1/openstack-check/tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001/690ffa2/logs/undercloud/var/log/extra/errors.txt.txt.gz

Revision history for this message
wes hayutin (weshayutin) wrote :

2019-11-22 11:07:01.481 8 DEBUG ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Node e1396805-90ef-45bd-ac58-1cd29e5abc7a successfully reserved for node state check (took 0.03 seconds) reserve_node /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:277
2019-11-22 11:07:01.509 8 DEBUG ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Attempting to get shared lock on node e44ce917-b170-4478-912d-d3ba4b1b14ec (for power state sync) __init__ /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:222
2019-11-22 11:07:01.511 8 DEBUG ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Attempting to get shared lock on node e1396805-90ef-45bd-ac58-1cd29e5abc7a (for power state sync) __init__ /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:222
2019-11-22 11:07:01.516 8 DEBUG ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Attempting to get shared lock on node e24e9a4c-0129-44d4-ac82-bf96d0366c56 (for power state sync) __init__ /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:222
2019-11-22 11:07:01.520 8 DEBUG ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Attempting to get shared lock on node 2d2e743b-fcf4-4e26-a6a0-26d80e79a928 (for power state sync) __init__ /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:222
2019-11-22 11:07:01.540 8 DEBUG ironic.common.states [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Exiting old state 'clean wait' in response to event 'fail' on_exit /usr/lib/python3.6/site-packages/ironic/common/states.py:294
2019-11-22 11:07:01.540 8 DEBUG ironic.common.states [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Entering new state 'clean failed' in response to event 'fail' on_enter /usr/lib/python3.6/site-packages/ironic/common/states.py:300
2019-11-22 11:07:01.565 8 DEBUG ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Successfully released shared lock for power state sync on node e1396805-90ef-45bd-ac58-1cd29e5abc7a (lock was held 0.05 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:356
2019-11-22 11:07:01.567 8 DEBUG ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Successfully released shared lock for power state sync on node e44ce917-b170-4478-912d-d3ba4b1b14ec (lock was held 0.06 sec) release_resources /usr/lib/python3.6/site-packages/ironic/conductor/task_manager.py:356
2019-11-22 11:07:01.569 8 ERROR ironic.conductor.task_manager [req-c6c68f98-012b-44bd-9c0a-43e7cd74b266 - - - - -] Node e1396805-90ef-45bd-ac58-1cd29e5abc7a moved to provision state "clean failed" from state "clean wait"; target provision state is "available"

Revision history for this message
Marios Andreou (marios-b) wrote :
Revision history for this message
Bob Fournier (bfournie) wrote :
Revision history for this message
Marios Andreou (marios-b) wrote :

  @Bob thanks for checking. You can't see logs in the job from comment #6 because of another bug as that is a check and not periodic. The rhel8 check jobs are broken by another bug (https://bugs.launchpad.net/tripleo/+bug/1854062) which incidentally is blocked by the fix for this one (this rhel8 fs1 we are looking at for this bug is the one remaining job so we can get a promotion and fix the check jobs +bug/1854062).

    You'll find the ironic logs if you check the periodic version of the job see the latest run list at https://review.rdoproject.org/zuul/builds?job_name=periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master

    Ignoring the 'SKIPPED' you can see a block of TIMED_OUT from 2019-11-26T16:28:45 ... adding some with the example "clean failed" trace below:

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/fc9dabe/logs/undercloud/var/log/containers/ironic/ironic-conductor.log.txt.gz

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/12609c1//logs/undercloud/var/log/containers/ironic/ironic-conductor.log.1.gz

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/6093de8//logs/undercloud/var/log/containers/ironic/ironic-conductor.log.txt.gz

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/8b3e5bc/logs/undercloud/var/log/containers/ironic/ironic-conductor.log.1.gz

summary: openstack overcloud node provide --all-manageable timing out and failing
- in both centos7 and rhel8 jobs.
+ periodic rhel-8-ovb-3ctlr_1comp-featureset001-master
Revision history for this message
Marios Andreou (marios-b) wrote :

updated the title since we are now focusing on rhel8 periodic fs1 specifically

Revision history for this message
Bob Fournier (bfournie) wrote :

Thanks Marios. Yeah it looks like Ironic marks the node as "clean failed" after not getting a callback for 30 minutes, which is the clean_callback_timeout:
conductor.clean_callback_timeout = 1800

Do we get any console logs indicating what's going on on the node?

Revision history for this message
Marios Andreou (marios-b) wrote :
Revision history for this message
Harald Jensås (harald-jensas) wrote :

I made a blooper in disk-imagebuilder a week ago. Can we try a job that depend on this: https://review.opendev.org/696873 ? (It's worth a try ...)

Revision history for this message
Bob Fournier (bfournie) wrote :

We're seeing neutron errors in
http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/fc9dabe/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz

which is preventing pxe boot from happening.

2019-11-26 09:09:07.716 71147 DEBUG oslo.privsep.daemon [-] privsep: Exception during request[140498753610560]: Network interface tap4caf13c7-3f not found in namespace qdhcp-e0002d0c-7228-427b-bc93-fa2ad7164fd7. _process_cmd /usr/lib/python3.6/site-packages/oslo_privsep/daemon.py:454
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/oslo_privsep/daemon.py", line 449, in _process_cmd
    ret = func(*f_args, **f_kwargs)
  File "/usr/lib/python3.6/site-packages/neutron/privileged/agent/linux/ip_lib.py", line 53, in sync_inner
    return input_func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/oslo_privsep/priv_context.py", line 247, in _wrap
    return func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/neutron/privileged/agent/linux/ip_lib.py", line 406, in get_link_attributes
    link = _run_iproute_link("get", device, namespace)[0]
  File "/usr/lib/python3.6/site-packages/neutron/privileged/agent/linux/ip_lib.py", line 254, in _run_iproute_link
    idx = get_link_id(device, namespace)
  File "/usr/lib/python3.6/site-packages/neutron/privileged/agent/linux/ip_lib.py", line 247, in get_link_id
    raise NetworkInterfaceNotFound(device=device, namespace=namespace)
neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound: Network interface tap4caf13c7-3f not found in namespace qdhcp-e0002d0c-7228-427b-bc93-fa2ad7164fd7.
2019-11-26 09:09:07.719 71147 DEBUG oslo.privsep.daemon [-] privsep: reply[140498753610560]: (5, 'neutron.privileged.agent.linux.ip_lib.NetworkInterfaceNotFound', ('Network interface tap4caf13c7-3f not found in namespace qdhcp-e0002d0c-7228-427b-bc93-fa2ad7164fd7.',)) _call_back /usr/lib/python3.6/site-packages/oslo_privsep/daemon.py:475

Followed by:

2019-11-26 09:10:08.416 63269 ERROR neutron.agent.linux.dhcp [req-772c1abd-e370-4d16-b709-0c1e15c672e6 - - - - -] Failed to start DHCP process for network e0002d0c-7228-427b-bc93-fa2ad7164fd7: neutron.common.utils.WaitTimeout: Timed out after 60 seconds
2019-11-26 09:10:08.428 63269 DEBUG neutron.agent.linux.utils [req-772c1abd-e370-4d16-b709-0c1e15c672e6 - - - - -] Unable to access /var/lib/neutron/external/pids/e0002d0c-7228-427b-bc93-fa2ad7164fd7.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2019-11-26 09:10:08.429 63269 DEBUG neutron.agent.linux.utils [req-772c1abd-e370-4d16-b709-0c1e15c672e6 - - - - -] Unable to access /var/lib/neutron/external/pids/e0002d0c-7228-427b-bc93-fa2ad7164fd7.pid.haproxy get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261

Would like to have someone from neutron team take a look.

Revision history for this message
Harald Jensås (harald-jensas) wrote :

Is it not the neutron's dhcp agent that do not start for the ctlplane network on the undercloud?

http://logs.rdoproject.org/28/695528/1/openstack-check/tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001/690ffa2/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz

2019-11-22 11:13:10.155 122439 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/055eac10-2b47-41fd-9b9e-7a932124bf92/pid get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2019-11-22 11:13:11.156 122439 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/055eac10-2b47-41fd-9b9e-7a932124bf92/pid get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2019-11-22 11:13:12.086 122439 ERROR neutron.agent.linux.dhcp [-] Failed to start DHCP process for network 055eac10-2b47-41fd-9b9e-7a932124bf92: neutron.common.utils.WaitTimeout: Timed out after 60 seconds

http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/fc9dabe/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.txt.gz

2019-11-26 09:50:49.128 63269 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/e0002d0c-7228-427b-bc93-fa2ad7164fd7/pid get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2019-11-26 09:50:50.130 63269 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/e0002d0c-7228-427b-bc93-fa2ad7164fd7/pid get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2019-11-26 09:50:50.856 63269 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" acquired by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: waited 0.000s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:327
2019-11-26 09:50:50.857 63269 DEBUG oslo_concurrency.lockutils [-] Lock "_check_child_processes" released by "neutron.agent.linux.external_process.ProcessMonitor._check_child_processes" :: held 0.001s inner /usr/lib/python3.6/site-packages/oslo_concurrency/lockutils.py:339
2019-11-26 09:50:51.132 63269 DEBUG neutron.agent.linux.utils [-] Unable to access /var/lib/neutron/dhcp/e0002d0c-7228-427b-bc93-fa2ad7164fd7/pid get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261
2019-11-26 09:50:52.038 63269 ERROR neutron.agent.linux.dhcp [-] Failed to start DHCP process for network e0002d0c-7228-427b-bc93-fa2ad7164fd7: neutron.common.utils.WaitTimeout: Timed out after 60 seconds

Revision history for this message
Marios Andreou (marios-b) wrote :

thanks hjensas & bfournie:

posted that just now https://review.rdoproject.org/r/#/c/23919/
with depends on https://review.opendev.org/696873
lets see how that goes

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

The error in DHCP agent seems that happens due to problem with getting pid value from file. See logs like:

2019-11-22 10:33:39.360 122439 DEBUG neutron.agent.linux.utils [req-a2222313-7fc2-4c99-ab27-5effb69d3cab - - - - -] Unable to access /var/lib/neutron/dhcp/055eac10-2b47-41fd-9b9e-7a932124bf92/pid get_value_from_file /usr/lib/python3.6/site-packages/neutron/agent/linux/utils.py:261

I just sent small patch to neutron to log exactly what was the reason of IOError there so maybe we will know more why this happens. Patch is here: https://review.opendev.org/697035

Revision history for this message
Marios Andreou (marios-b) wrote :

looks like the test at https://review.rdoproject.org/r/#/c/23919/ failed... now re-running with the debug patch from comment #15 you can check the logs there if it helps @slaweq

as per irc just now with chkumar|ruck we are seriously considering removing this fs1 from master rhel8 criteria so we can get a much needed and overdue promotion and restore immediately once we do.

Revision history for this message
chandan kumar (chkumar246) wrote :

The current job is failing at Get image expected checksum and the master promotion job is blocked as RDO dlrn master is not yet consistent, http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/ad73dd9/job-output.txt

We are waiting for next run.

Revision history for this message
Marios Andreou (marios-b) wrote :

still seeing a timeout in the latest run

        * http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/083861e/job-output.txt

        * 2019-12-10 20:54:11.920706 | primary | TASK [overcloud-prep-images : Prepare the overcloud images for deploy] *********
2019-12-10 20:54:11.937916 | primary | Tuesday 10 December 2019 20:54:11 -0500 (0:00:01.796) 0:00:16.273 ******
2019-12-11 05:05:42.171970 | RUN END RESULT_TIMED_OUT: [untrusted : opendev.org/openstack/tripleo-ci/playbooks/tripleo-ci/run-v3.yaml@master]

with the neutron dhcp agent error 2019-12-10 21:34:48.492 61922 ERROR neutron.agent.linux.dhcp [-] Failed to start DHCP process for network c2a9ce2c-cd89-46fc-8f9b-623848405039: neutron.common.utils.WaitTimeout: Timed out after 60 seconds @ http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/083861e/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.1.gz

Revision history for this message
Marios Andreou (marios-b) wrote :

the debug patch in https://bugs.launchpad.net/tripleo/+bug/1853652/comments/15 merged but it isn't showing in latest logs e.g. from https://bugs.launchpad.net/tripleo/+bug/1853652/comments/19 because we need a promotion for that to happen :/

I just sanity checked because I asked Slawek to check it now that his patch merged. However current rhel8 tripleo-ci-testing [1] (what is used in the promotion jobs, the 'candidate or promotion') still has:

        * openstack-neutron-15.1.0-0.20191209102301.8375f4c.el8.noarch.rpm

And you can see from [2] "8375f4c" is a bit older than the one we need [3]

I just hit recheck at [4] let's see how it goes the image checksum issue should be gone now.

[1] https://trunk.rdoproject.org/rhel8-master/tripleo-ci-testing/
[2] https://github.com/openstack/neutron/commits/master?after=4051e0b19dc9f318c2e0dd7c60eaa2c46536ad03+34
[3] https://github.com/openstack/neutron/commit/12852b3e68f7840205c1185297e8c242ea44e182
[4] https://review.rdoproject.org/r/#/c/23919/

Revision history for this message
Marios Andreou (marios-b) wrote :

well... to clarify we don't need a promotion... more like we need a new enough build of neutron to be promoted to tripleo-ci-testing so it would happen if we wait long enough but hoping the https://review.rdoproject.org/r/#/c/23919/ will give us logs (but we've been unlucky there so far :/)

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

I'm checking it today on hold node and when I'm trying to start dnsmasq process from neutron_dhcp container, I see an error like:

gru 13 10:13:15 undercloud.localdomain systemd[257153]: dhcp_dnsmasq.service: Failed to execute command: Permission denied
gru 13 10:13:15 undercloud.localdomain systemd[257153]: dhcp_dnsmasq.service: Failed at step EXEC spawning /var/lib/neutron/dhcp_dnsmasq/sync: Permission denied
gru 13 10:13:15 undercloud.localdomain systemd[1]: dhcp_dnsmasq.service: Main process exited, code=exited, status=203/EXEC
gru 13 10:13:15 undercloud.localdomain systemd[1]: dhcp_dnsmasq.service: Failed with result 'exit-code'.
gru 13 10:13:15 undercloud.localdomain systemd[1]: Failed to start Tripleo dhcp_dnsmasq sync service.

Revision history for this message
Slawek Kaplonski (slaweq) wrote :

When I disabled SELinux on test node and than restarted neutron_dhcp container all worked fine. So it seems that it's some missing SELinux rule

Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Hello,

So I've been able to check some recent logs[1] and yes, we do have "some" AVCs in there.

Doing a "grep -v dbus" on the file content shows this kind of things:
/var/log/audit/audit.log.1:type=AVC msg=audit(1576476091.688:19443): avc: denied { execute } for pid=117705 comm="(sync)" name="sync" dev="vda1" ino=152835984 scontext=system_u:system_r:init_t:s0 tcontext=system_u:object_r:container_file_t:s0 tclass=file permissive=0

IIRC the way side-cars are started has changed lately, and of course nobody checked selinux logs, as usual....

@Slawek, would you be able to provide a full audit.log on a SELinux Permissive node with this deploy? It's located in /var/log/audit/audit.log (accessible only as root).

[1] http://logs.rdoproject.org/20/698620/6/openstack-check/tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001/418799c/logs/undercloud/var/log/extra/denials.txt.gz

Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

@Chandan: nope, it's not permissive: "permissive=0"

Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
chandan kumar (chkumar246) wrote :

Trying out here https://review.rdoproject.org/r/24189 with selinux permissive

Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
chandan kumar (chkumar246) wrote :
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Hello there,

I've just pushed a change against openstack-selinux here:
https://github.com/redhat-openstack/openstack-selinux/pull/52

It should solve this issue.

Small note: please ensure openstack-selinux package is the latest available, I've seen a lot of "old" AVCs in the last logs :/.

Cheers,

C.

Changed in tripleo:
assignee: nobody → Cédric Jeanneret (cjeanner)
status: Triaged → In Progress
Revision history for this message
Cédric Jeanneret (cjeanner) wrote :

Upstream merged. So we should be fine as soon as we get a new openstack-selinux package.
In parallel, Chandan is pushing a patch[1] that will set RHEL jobs to Permissive. We should therefore be safe on both sides.

[1] https://review.opendev.org/699170

Changed in tripleo:
status: In Progress → Fix Committed
Revision history for this message
Marios Andreou (marios-b) wrote :

09:45 < marios|ruck> chkumar|rover: lets see i rechecked that https://review.rdoproject.org/r/#/c/23919/ based on https://bugs.launchpad.net/tripleo/+bug/1853652/comments/32

Changed in tripleo:
status: Fix Committed → Fix Released
Revision history for this message
Marios Andreou (marios-b) wrote :

we're still hitting some timeout issue and it still seems to be neutron dhcp agent please @slaweq can you have another look.

Latest periodic run at [1] times out at

            * 2019-12-19 20:59:19 | + openstack overcloud node provide --all-manageable
              2019-12-19 20:59:22 | Waiting for messages on queue 'tripleo' with no timeout.
              ...

Looking in the neutron dhcp agent logs [2] I still see an error like:

            * 2019-12-19 21:03:55.966 65530 DEBUG neutron.agent.linux.dhcp [-] Spawning DHCP process for network 0e 7a1494-706e-409a-b0e5-4a4cfbc20ca2 failed; Error: Exit code: 1; Stdin: ; Stdout: ; Stderr: + exec$
            * 2019-12-19 21:03:59.042 65530 DEBUG neutron.agent.linux.dhcp [-] Spawning DHCP process for network 0e 7a1494-706e-409a-b0e5-4a4cfbc20ca2 failed; Error: Exit code: 1; Stdin: ; Stdout: ; Stderr: + exec$

SELinux is now Permissive see [3] - can you spot something more in that agent log?

[1] http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/1e3ab91/logs/undercloud/home/zuul/overcloud_prep_images.log.txt.gz
[2] http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/1e3ab91/logs/undercloud/var/log/containers/neutron/dhcp-agent.log.1.gz
[3] http://logs.rdoproject.org/openstack-periodic-master/opendev.org/openstack/tripleo-ci/master/periodic-tripleo-ci-rhel-8-ovb-3ctlr_1comp-featureset001-master/1e3ab91/logs/undercloud/var/log/extra/selinux.txt.gz

Changed in tripleo:
status: Fix Released → In Progress
Revision history for this message
wes hayutin (weshayutin) wrote :

Another bug is now getting in the way of this one... put this on hold for now

Changed in tripleo:
milestone: ussuri-1 → ussuri-2
Revision history for this message
wes hayutin (weshayutin) wrote :
Changed in tripleo:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.