Spike in TestNovaMigrationsMySQL.test_walk_versions/test_innodb_tables failures since April 1 2019 on limestone-regionone

Bug #1823251 reported by Matt Riedemann
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
High
Unassigned

Bug Description

Seen here:

http://logs.openstack.org/13/621313/21/check/openstack-tox-py36/ba2c848/job-output.txt.gz#_2019-04-03_22_54_01_907907

The test times out and the underlying issue seems to be an InterfaceError from pymysql:

http://paste.openstack.org/show/748901/

The thread switch issue could be related to bug 1793364.

The funny thing about this is it started spiking since April 1 when this merged:

https://github.com/openstack/nova/commit/e5269f2fbb773953266fccbfd9df3f75e351eeb9

http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22sqlalchemy.exc.InterfaceError%3A%20(pymysql.err.InterfaceError)%20(0%2C%20'')%5C%22%20AND%20tags%3A%5C%22console%5C%22&from=7d

And it predominantly shows up in the limestone-regionone nodes, I'm not sure why.

The change above that merged on 4/1 also removed skipsdist=True from nova's tox.ini so I'm wondering if there is something on the bionic images in those limestone nodes that causes weird issues.

I noticed in the job that failed that tox is 3.8.4 and in another job that passed tox was 3.8.1, but looking at:

https://tox.readthedocs.io/en/latest/changelog.html

I'm not seeing anything obvious about a behavior change for python 3.6 jobs.

Revision history for this message
Matt Riedemann (mriedem) wrote :

This would also help to explain why we've been seeing a spike in unit test job timeouts for nova recently because these tests hit that thread switch and then timeout.

summary: Spike in TestNovaMigrationsMySQL.test_walk_versions/test_innodb_tables
- failures since April 1 2019
+ failures since April 1 2019 on limestone-regionone
Revision history for this message
Matt Riedemann (mriedem) wrote :

Since this is failing on more than just nova, I wonder if something more global changed on 4/1 that is causing this on limestone nodes.

Matt Riedemann (mriedem)
Changed in nova:
importance: Medium → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/662434

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/662434
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=15b9e81d37d1adaf5a13be68d12c1544405488e3
Submitter: Zuul
Branch: master

commit 15b9e81d37d1adaf5a13be68d12c1544405488e3
Author: Balazs Gibizer <email address hidden>
Date: Fri May 31 13:41:11 2019 +0200

    Simplfy test setup for TestNovaMigrations* tests

    The NoDBTestCase base class does a lot of env setup that is not needed
    for these tests. As we have issues with these tests (see bug 1823251)
    this patch tries to decrease the complexity of them to help the
    troubleshooting.

    Change-Id: I46fafa470b9d378d96c0f364e9b487742bf58cb2
    Related-Bug: #1823251

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/678051

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/682661

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/683009

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/683011

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/683009
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aebebe713a50d1057d5dfdfa4f13be6cc3ce625b
Submitter: Zuul
Branch: master

commit aebebe713a50d1057d5dfdfa4f13be6cc3ce625b
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 18 19:55:14 2019 -0400

    Temporarily skip TestNovaMigrationsMySQL

    There are several changes approved for Train feature freeze
    which are still not merged yet nearly a week later because
    of gate failures with bug 1823251. This change skips those
    tests temporarily so we can land some code in time for Train
    RC1 but this change should be reverted prior to RC1 after
    all of those other changes are merged. Note that those
    changes (vpmems and pcpu series) don't really have anything
    to do with data model schema changes at this point so skipping
    this test for now won't drop any coverage for what is going in,
    and we aren't merging any other data model schema changes at
    this point before RC1. Again, this is a temporary measure and
    should be reverted within a few days.

    Change-Id: I0564c613f30d321de33a1bd71bfe48c37e72ff80
    Related-Bug: #1823251

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Reviewed: https://review.opendev.org/683011
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=4f261f98e19d28fa29ff6a4d62a6a3a8a4114575
Submitter: Zuul
Branch: master

commit 4f261f98e19d28fa29ff6a4d62a6a3a8a4114575
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 18 20:06:04 2019 -0400

    Revert "Temporarily skip TestNovaMigrationsMySQL"

    This reverts commit aebebe713a50d1057d5dfdfa4f13be6cc3ce625b.

    We said we would since the skip was temporary.

    Change-Id: I5f7f5779735e158688992ccc948e7305bae1c9c4
    Related-Bug: #1823251

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/718629

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.opendev.org/718629
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=316fef22c699a8dc5fc13ed8d779369043167874
Submitter: Zuul
Branch: master

commit 316fef22c699a8dc5fc13ed8d779369043167874
Author: Matt Riedemann <email address hidden>
Date: Wed Sep 18 19:55:14 2019 -0400

    Temporarily skip TestNovaMigrationsMySQL

    We are hitting bug 1823251 more frequently again close to the Ussuri
    Feature Freeze. So let's temporary turn of the unstable patches. We
    don't have patches in the queue with DB model changes so loosing this
    coverage is not super critical.

    Change-Id: I51039c4c5817add873a289234ab79569d2520cb3
    Related-Bug: #1823251

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Stephen Finucane (<email address hidden>) on branch: master
Review: https://review.opendev.org/682661

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Download full text (5.0 KiB)

Finally I managed to reproduce the same issue locally.

Reproduction steps:

1) create IO. I did started coping files in a loop.
$ for i in {1..1000}; do cp loadfile loadfile3; done &

2) create load on mysqld.
$ mysqlslap -a -u openstack_citest -p --iterations 1000 --number-of-queries 8000 --number-char-cols 20 --number-int-cols 20 --concurrency 80 --password=openstack_citest &

3) run the nova unit test suite
$ tox -e py38

With enough load the test times out but due to the load the timeout time is pretty random as the signal handler also gets delayed due to the same load.

[...]
      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/pymysql/connections.py", line 683, in _read_bytes
    data = self._rfile.read(num_bytes)

      File "/usr/lib/python3.8/socket.py", line 669, in readinto
    return self._sock.recv_into(b)

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/eventlet/greenio/base.py", line 371, in recv_into
    return self._recv_loop(self.fd.recv_into, 0, buffer, nbytes, flags)

[...]

     File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/eventlet/hubs/epolls.py", line 31, in do_poll
    return self.poll.poll(seconds)

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/fixtures/_fixtures/timeout.py", line 52, in signal_handler
    raise TimeoutException()

The nova db migration unit test sets timeout to these tests specifically: https://github.com/openstack/nova/blob/5d910b6695fc67517b6fe08194ed2754a182f9a9/nova/tests/unit/db/test_migrations.py#L101

But interrestingly the api db functional migration tests do not https://github.com/openstack/nova/blob/7dcc4cfea7da2ef1163e0a65618cb784f5159c6d/nova/tests/functional/db/api/test_migrations.py

Now that I can reporduce this I can play with the timeout values to see what is reasonable.

Full primary stack trace:

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1276, in _execute_context
    self.dialect.do_execute(

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 593, in do_execute
    cursor.execute(statement, parameters)

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/pymysql/cursors.py", line 163, in execute
    result = self._query(query)

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/pymysql/cursors.py", line 321, in _query
    conn.query(q)

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/pymysql/connections.py", line 505, in query
    self._affected_rows = self._read_query_result(unbuffered=unbuffered)

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/pymysql/connections.py", line 724, in _read_query_result
    result.read()

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/pymysql/connections.py", line 1069, in read
    first_packet = self.connection._read_packet()

      File "/root/rtox/nova/py38-walk/.tox/py38/lib/python3.8/site-packages/pymysql/connections.py", line 646, in _read_packet
    packet_header = self._r...

Read more...

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Decreasing the scaling factor in https://github.com/openstack/nova/blob/master/nova/tests/unit/db/test_migrations.py#L101 makes the test to fail faster so I guess we just need to increase it and see if we are still in the time budget of the whole unit test run when on a slow node.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

My tests shows a 75% decrease of test runtime after the patches: https://review.opendev.org/q/topic:%22bp%252Fcompact-db-migrations-wallaby%22+(status:open%20OR%20status:merged)

So the final solution is compacting the migrations. Until that merges we need to increase the timeout.

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

Increasing the timeout did not help so now I try to execute the db migration test in serial https://review.opendev.org/c/openstack/nova/+/767590

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :

https://review.opendev.org/c/openstack/nova/+/767590 has been merged and it seems it does help, but does not fully remove the timeout. We need to merge https://review.opendev.org/q/topic:%22bp%252Fcompact-db-migrations-wallaby%22+(status:open%20OR%20status:merged) to solve this

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Stephen Finucane <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/678051

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by "Stephen Finucane <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/774889
Reason: I'm tackling this as part of the alebmic series, so this shouldn't be necessary now

Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Changed in nova:
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers