cover job started to fail with Killed

Bug #2065821 reported by Ihar Hrachyshka
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
neutron
In Progress
Critical
Brian Haley

Bug Description

Pipeline here: https://zuul.opendev.org/t/openstack/builds?job_name=openstack-tox-cover&project=openstack/neutron

First failure is May 14: https://zuul.opendev.org/t/openstack/build/6899085a449248ed8b017eb4e9f231ab

In logs, it looks like this:

2024-05-14 16:33:32.050334 | ubuntu-jammy | Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
2024-05-14 16:33:32.050424 | ubuntu-jammy | declare_namespace(pkg)
2024-05-14 16:33:32.050451 | ubuntu-jammy | /home/zuul/src/opendev.org/openstack/neutron/.tox/cover/lib/python3.10/site-packages/pkg_resources/__init__.py:2832: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('repoze')`.
2024-05-14 16:33:32.050472 | ubuntu-jammy | Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
2024-05-14 16:33:32.050490 | ubuntu-jammy | declare_namespace(pkg)
2024-05-14 16:33:32.050516 | ubuntu-jammy | /home/zuul/src/opendev.org/openstack/neutron/.tox/cover/lib/python3.10/site-packages/pkg_resources/__init__.py:2832: DeprecationWarning: Deprecated call to `pkg_resources.declare_namespace('repoze')`.
2024-05-14 16:59:58.794881 | ubuntu-jammy | Implementing implicit namespace packages (as specified in PEP 420) is preferred to `pkg_resources.declare_namespace`. See https://setuptools.pypa.io/en/latest/references/keywords.html#keyword-namespace-packages
2024-05-14 16:59:58.796083 | ubuntu-jammy | declare_namespace(pkg)
2024-05-14 16:59:58.796171 | ubuntu-jammy | Killed
2024-05-14 17:03:29.030113 | ubuntu-jammy | Ran 20812 tests in 1777.707s
2024-05-14 17:03:29.174365 | ubuntu-jammy | FAILED (id=0, failures=1, skips=1701)

Could it be that the job no longer has enough memory and gets OOM killed?

I've compared versions of packages updated between older good and newer bad runs, and I only see these bumped: sqlalchemy 1.4.51 -> 2.0.29 and alembic 1.9.4 -> 1.13.1.

Different runs have different unit tests reported as failed (all failed runs claim a single test case failed).

Examples of different failed tests:

https://storage.bhs.cloud.ovh.net/v1/AUTH_dcaab5e32b234d56b626f72581e3644c/zuul_opendev_logs_218/917262/3/check/openstack-tox-cover/2180cc3/testr_results.html - neutron.tests.unit.services.network_segment_range.test_plugin.TestNetworkSegmentRange.test_delete_network_segment_range_failed_with_segment_referenced

https://9b86ab5bbc6be76c9905-30f46d6ec556e6b2dd47ea35fedbb1ac.ssl.cf5.rackcdn.com/919699/4/check/openstack-tox-cover/ce9baa9/testr_results.html - neutron.tests.unit.services.ovn_l3.test_plugin.OVNL3ExtrarouteTests .test_floatingip_update_different_port_owner_as_admin

https://6eed35a50c35f284b4d2-bf433abff5f8b85f7f80257b72ac6f67.ssl.cf2.rackcdn.com/919632/1/check/openstack-tox-cover/3b1c5fa/testr_results.html - neutron.tests.unit.services.placement_report.test_plugin.PlacementReportPluginTestCases.test__sync_placement_state_legacy

I suspect specific unit test cases are not relevant - the test runner process dies for some reason and whatever the test it was running at that moment gets reported as failed.

Tags: gate-failure
tags: added: gate-failure
Changed in neutron:
status: New → Triaged
importance: Undecided → Critical
Revision history for this message
Bence Romsics (bence-romsics) wrote :
Revision history for this message
Brian Haley (brian-haley) wrote :

So it doesn't look like the job exceeded any time limit, so it's not getting killed by zuul.

There isn't much of anything in the logs except "Killed" which isn't helpful, but it looks OOM related.

Failure in testr log shows:

...test_router_remove_interface_returns_200testtools.testresult.real._StringException

Logstash only shows Neutron with the "Killed" string, the Octavia ones can be ignored as that has a message it's killing the provider driver.

The first occurrence according to logstash is here:

https://review.opendev.org/c/openstack/neutron/+/919602/1

May 14 2:28 PM

Any of the other changes seem unrelated to each other, and as Bence noted, it's not the same test triggering anything.

The only obvious thing that merged shortly before that is the requirements change for sqlalchemy 2.0.29:

https://review.opendev.org/c/openstack/requirements/+/879743

Promoted May 14 9:07 AM

I don't see anything in more recent versions that mentions memory issues:

https://docs.sqlalchemy.org/en/20/changelog/changelog_20.html#change-2.0.30

But this is a unit test which shouldn't be involving a running mysqld :-/

There weren't a whole lot of requirements bumps in that timeframe either:

https://review.opendev.org/q/project:openstack/requirements

testrepository is the only other one, to 0.0.21, previous version 0.0.20 was from 2014 which is interesting, but it's not getting installed.

Regarding the version of coverage, it was 7.4.4 (2024-03-14) - again, not exactly recent and nothing in a newer version changelog implicating any memory issues. But I guess we could start pinning things to see

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/920150

Revision history for this message
Brian Haley (brian-haley) wrote :

Reverting the sqlalchemy bump and the issue doesn't happen in my test patch.

Will try and narrow things down.

Changed in neutron:
assignee: nobody → Brian Haley (brian-haley)
Revision history for this message
Brian Haley (brian-haley) wrote :

So it was the oom-killer. I was able to get infra to hold the node on one of the failed jobs, here's some snippets from syslog:

May 28 21:55:17 np0037615281 kernel: [ 1900.367241] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/u
ser.slice/user-1000.slice/session-1.scope,task=/home/zuul/src/,pid=5226,uid=1000
May 28 21:55:17 np0037615281 kernel: [ 1900.367282] Out of memory: Killed process 5226 (/home/zuul/src/) total-vm:1051852kB, anon-rss:935592kB, file-rss
:0kB, shmem-rss:0kB, UID:1000 pgtables:1952kB oom_score_adj:0
May 28 21:55:17 np0037615281 systemd[1]: session-1.scope: A process of this unit has been killed by the OOM killer.

And the pid is this, which isn't very specific:

May 28 21:55:17 np0037615281 kernel: [ 1900.367219] [ 5226] 1000 5226 262963 233898 1998848 0 0 /home/zuul/src/

Since there were 8 of these I'll assume it was the cover job running concurrently on 8 vcpus.

There isn't much more info on the held system.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

@Brian do we have an estimate of time added to the job if we reduce the concurrency to e.g. 4? This could be a short-term path forward. I think we should still check with sqlalchemy author (Mike) to see if he has any ideas why sqlalchemy version bump could blow up the memory usage for cover job. (I also wonder if any other projects noticed memory use increase after the sqla bump. Maybe infra folks could have some insights.)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/920766

Changed in neutron:
status: Triaged → In Progress
Revision history for this message
Brian Haley (brian-haley) wrote :

So just based on my test patch, the cover job is typically around 36 minutes. With --concurrency 4 it is between 46 and 1:18, but it would probably never be noticed since other jobs are longer. The job will get killed at 1:20 without increasing it's timeout, so we might have to deal with that.

I was going to send an email to the list asking for advice and/or Mike's help. It might be I can just run locally gathering stats and see something, I know when I ran it locally things got very slow. Don't know why I didn't do that earlier.

I will also propose a change to add --concurrency 4 as a temp workaround until we can figure this out.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/920766
Committed: https://opendev.org/openstack/neutron/commit/5a1e9826c26a6e4d2d078fcedff3a3f696b537a6
Submitter: "Zuul (22348)"
Branch: master

commit 5a1e9826c26a6e4d2d078fcedff3a3f696b537a6
Author: Brian Haley <email address hidden>
Date: Wed May 29 13:41:04 2024 -0400

    Temporarily add '--concurrency 4' to coverage job

    Add '--concurrency 4' when running coverage job as it
    helps work around an OOM killer issue we are seeing in
    the gate. It typically only adds between 10 and 30
    minutes to job. Can revert once issue is resolved.

    Partial-bug: #2065821
    Change-Id: I33aea3a3c884877573d4d45c6e07955bb177c766

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to neutron (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/neutron/+/921420

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to neutron (master)

Reviewed: https://review.opendev.org/c/openstack/neutron/+/921420
Committed: https://opendev.org/openstack/neutron/commit/eaca8a360fa6e6960549865184b8aae38fab6610
Submitter: "Zuul (22348)"
Branch: master

commit eaca8a360fa6e6960549865184b8aae38fab6610
Author: yatinkarel <email address hidden>
Date: Tue May 28 15:35:41 2024 +0530

    Add some swap to tox-cover job

    As mentioned in the related bug we are seeing
    hign memory usage in the job since sqlalchemy
    bump to 2.0 and this leads to gate instability.
    Adding some swap to handle this until the issue
    is fixed.

    Also reverts the concurrency reduction change[1]
    that was done for this issue as that shouldn't be
    needed now.

    [1] https://review.opendev.org/c/openstack/neutron/+/920766
    Related-Bug: #2065821

    Change-Id: I33c4869c1d5230c0c8cc2f0dd4f6d071600bf220

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/c/openstack/neutron/+/921309
Committed: https://opendev.org/openstack/neutron/commit/11027e3e1ef9a58d5b2faa575a3764bd33cd2a08
Submitter: "Zuul (22348)"
Branch: master

commit 11027e3e1ef9a58d5b2faa575a3764bd33cd2a08
Author: Brian Haley <email address hidden>
Date: Tue Jun 4 12:23:27 2024 -0400

    Fix regex lines in zuul.d/* files

    Commit 260c968118934 broke the gate by causing jobs
    to not get run when it added RE2 compatibility for
    irrelevant-files. Digging found that RE2 doesn't
    support negative lookahead (and won't ever), so it's
    impossible to replace the previous pcre filter with a
    similar RE2 filter.

    Instead of reverting to the original filter, which
    is considered obsolete by zuul, this patch fixes the
    issue by explicitly listing all files under zuul.d/
    except the one that we actually want to trigger the
    jobs: zuul.d/project.yaml.

    Listing all the files in the directory for every job
    is not ideal, and we may revisit it later, or perhaps
    even reconsider the extensive use of irrelevant-files
    in the neutron tree. This will have to wait for when
    the gate is in better shape though.

    [0] https://github.com/google/re2/issues/156

    Related-bug: #2065821
    Change-Id: I3bba89ac14414c6b7d375072ae92d2e0b5497736

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.