Spike in TestNovaMigrationsMySQL.test_walk_versions/test_innodb_tables failures since April 1 2019 on limestone-regionone

Bug #1823251 reported by Matt Riedemann on 2019-04-04
8
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Unassigned

Bug Description

Seen here:

http://logs.openstack.org/13/621313/21/check/openstack-tox-py36/ba2c848/job-output.txt.gz#_2019-04-03_22_54_01_907907

The test times out and the underlying issue seems to be an InterfaceError from pymysql:

http://paste.openstack.org/show/748901/

The thread switch issue could be related to bug 1793364.

The funny thing about this is it started spiking since April 1 when this merged:

https://github.com/openstack/nova/commit/e5269f2fbb773953266fccbfd9df3f75e351eeb9

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.InterfaceError%3A%20(pymysql.err.InterfaceError)%20(0%2C%20'')%5C%22%20AND%20tags%3A%5C%22console%5C%22&from=7d

And it predominantly shows up in the limestone-regionone nodes, I'm not sure why.

The change above that merged on 4/1 also removed skipsdist=True from nova's tox.ini so I'm wondering if there is something on the bionic images in those limestone nodes that causes weird issues.

I noticed in the job that failed that tox is 3.8.4 and in another job that passed tox was 3.8.1, but looking at:

https://tox.readthedocs.io/en/latest/changelog.html

I'm not seeing anything obvious about a behavior change for python 3.6 jobs.

Matt Riedemann (mriedem) wrote :

This would also help to explain why we've been seeing a spike in unit test job timeouts for nova recently because these tests hit that thread switch and then timeout.

summary: Spike in TestNovaMigrationsMySQL.test_walk_versions/test_innodb_tables
- failures since April 1 2019
+ failures since April 1 2019 on limestone-regionone
Matt Riedemann (mriedem) wrote :

Since this is failing on more than just nova, I wonder if something more global changed on 4/1 that is causing this on limestone nodes.

Matt Riedemann (mriedem) on 2019-04-04
Changed in nova:
importance: Medium → High

Reviewed: https://review.opendev.org/662434
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=15b9e81d37d1adaf5a13be68d12c1544405488e3
Submitter: Zuul
Branch: master

commit 15b9e81d37d1adaf5a13be68d12c1544405488e3
Author: Balazs Gibizer <email address hidden>
Date: Fri May 31 13:41:11 2019 +0200

    Simplfy test setup for TestNovaMigrations* tests

    The NoDBTestCase base class does a lot of env setup that is not needed
    for these tests. As we have issues with these tests (see bug 1823251)
    this patch tries to decrease the complexity of them to help the
    troubleshooting.

    Change-Id: I46fafa470b9d378d96c0f364e9b487742bf58cb2
    Related-Bug: #1823251

Related fix proposed to branch: master
Review: https://review.opendev.org/682661

Related fix proposed to branch: master
Review: https://review.opendev.org/683009

Related fix proposed to branch: master
Review: https://review.opendev.org/683011

Reviewed: https://review.opendev.org/683009
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aebebe713a50d1057d5dfdfa4f13be6cc3ce625b
Submitter: Zuul
Branch: master

commit aebebe713a50d1057d5dfdfa4f13be6cc3ce625b
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 18 19:55:14 2019 -0400

    Temporarily skip TestNovaMigrationsMySQL

    There are several changes approved for Train feature freeze
    which are still not merged yet nearly a week later because
    of gate failures with bug 1823251. This change skips those
    tests temporarily so we can land some code in time for Train
    RC1 but this change should be reverted prior to RC1 after
    all of those other changes are merged. Note that those
    changes (vpmems and pcpu series) don't really have anything
    to do with data model schema changes at this point so skipping
    this test for now won't drop any coverage for what is going in,
    and we aren't merging any other data model schema changes at
    this point before RC1. Again, this is a temporary measure and
    should be reverted within a few days.

    Change-Id: I0564c613f30d321de33a1bd71bfe48c37e72ff80
    Related-Bug: #1823251

Reviewed: https://review.opendev.org/683011
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4f261f98e19d28fa29ff6a4d62a6a3a8a4114575
Submitter: Zuul
Branch: master

commit 4f261f98e19d28fa29ff6a4d62a6a3a8a4114575
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 18 20:06:04 2019 -0400

    Revert "Temporarily skip TestNovaMigrationsMySQL"

    This reverts commit aebebe713a50d1057d5dfdfa4f13be6cc3ce625b.

    We said we would since the skip was temporary.

    Change-Id: I5f7f5779735e158688992ccc948e7305bae1c9c4
    Related-Bug: #1823251

Reviewed: https://review.opendev.org/718629
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=316fef22c699a8dc5fc13ed8d779369043167874
Submitter: Zuul
Branch: master

commit 316fef22c699a8dc5fc13ed8d779369043167874
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 18 19:55:14 2019 -0400

    Temporarily skip TestNovaMigrationsMySQL

    We are hitting bug 1823251 more frequently again close to the Ussuri
    Feature Freeze. So let's temporary turn of the unstable patches. We
    don't have patches in the queue with DB model changes so loosing this
    coverage is not super critical.

    Change-Id: I51039c4c5817add873a289234ab79569d2520cb3
    Related-Bug: #1823251

Change abandoned by Stephen Finucane (<email address hidden>) on branch: master
Review: https://review.opendev.org/682661

Download full text (5.0 KiB)

Finally I managed to reproduce the same issue locally.

Reproduction steps:

1) create IO. I did started coping files in a loop.
$ for i in {1..1000}; do cp loadfile loadfile3; done &

2) create load on mysqld.
$ mysqlslap -a -u openstack_citest -p --iterations 1000 --number-of-queries 8000 --number-char-cols 20 --number-int-cols 20 --concurrency 80 --password=openstack_citest &

3) run the nova unit test suite
$ tox -e py38

With enough load the test times out but due to the load the timeout time is pretty random as the signal handler also gets delayed due to the same load.

[...]
      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/pymysql/connections.py", line 683, in _read_bytes
    data = self._rfile.read(num_bytes)

      File "/usr/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/eventlet/greenio/base.py", line 371, in recv_into
    return self._recv_loop(self.fd.recv_into, 0, buffer, nbytes, flags)

[...]

     File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/eventlet/hubs/epolls.py", line 31, in do_poll
    return self.poll.poll(seconds)

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler
    raise TimeoutException()

The nova db migration unit test sets timeout to these tests specifically: https://github.com/openstack/nova/blob/5d910b6695fc67517b6fe08194ed2754a182f9a9/nova/tests/unit/db/test_migrations.py#L101

But interrestingly the api db functional migration tests do not https://github.com/openstack/nova/blob/7dcc4cfea7da2ef1163e0a65618cb784f5159c6d/nova/tests/functional/db/api/test_migrations.py

Now that I can reporduce this I can play with the timeout values to see what is reasonable.

Full primary stack trace:

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
    self.dialect.do_execute(

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 593, in do_execute
    cursor.execute(statement, parameters)

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/pymysql/cursors.py", line 163, in execute
    result = self._query(query)

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/pymysql/cursors.py", line 321, in _query
    conn.query(q)

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/pymysql/connections.py", line 505, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/pymysql/connections.py", line 724, in _read_query_result
    result.read()

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/pymysql/connections.py", line 1069, in read
    first_packet = self.connection._read_packet()

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/pymysql/connections.py", line 646, in _read_packet
    packet_header = self._r...

Read more...

Decreasing the scaling factor in https://github.com/openstack/nova/blob/master/nova/tests/unit/db/test_migrations.py#L101 makes the test to fail faster so I guess we just need to increase it and see if we are still in the time budget of the whole unit test run when on a slow node.

My tests shows a 75% decrease of test runtime after the patches: https://review.opendev.org/q/topic:%22bp%252Fcompact-db-migrations-wallaby%22+(status:open%20OR%20status:merged)

So the final solution is compacting the migrations. Until that merges we need to increase the timeout.

Increasing the timeout did not help so now I try to execute the db migration test in serial https://review.opendev.org/c/openstack/nova/+/767590

https://review.opendev.org/c/openstack/nova/+/767590 has been merged and it seems it does help, but does not fully remove the timeout. We need to merge https://review.opendev.org/q/topic:%22bp%252Fcompact-db-migrations-wallaby%22+(status:open%20OR%20status:merged) to solve this

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers