I dug into this more about a month ago and unfortunately came up empty-handed. Going to dump some info I gathered at the time on a DNM patch [1] here for the record. "Looking at the mysql error log: https://zuul.opendev.org/t/openstack/build/833a46b05c9641b9b22b3ee7f394e80b/log/logs/mysql/error.txt.gz I see lots of errors [2] that I think must be why we get the cell timeouts: ... 2020-01-09T00:11:53.330894Z 157 [Note] Aborted connection 157 to db: 'keystone' user: 'root' host: 'localhost' (Got an error reading communication packets) 2020-01-09T00:12:19.047684Z 158 [Note] Aborted connection 158 to db: 'placement' user: 'root' host: 'localhost' (Got an error reading communication packets) 2020-01-09T00:13:09.703587Z 161 [Note] Aborted connection 161 to db: 'glance' user: 'root' host: 'localhost' (Got an error reading communication packets) 2020-01-09T00:13:09.703775Z 162 [Note] Aborted connection 162 to db: 'glance' user: 'root' host: 'localhost' (Got an error reading communication packets) 2020-01-09T00:14:41.215651Z 171 [Note] Aborted connection 171 to db: 'nova_api' user: 'root' host: 'localhost' (Got an error reading communication packets) 2020-01-09T00:14:44.841855Z 173 [Note] Aborted connection 173 to db: 'nova_cell0' user: 'root' host: 'localhost' (Got an error reading communication packets) 2020-01-09T00:14:44.842886Z 174 [Note] Aborted connection 174 to db: 'nova_cell1' user: 'root' host: 'localhost' (Got an error reading communication packets) 2020-01-09T00:14:44.958218Z 172 [Note] Aborted connection 172 to db: 'nova_api' user: 'root' host: 'localhost' (Got an error reading communication packets) 2020-01-09T00:14:48.749196Z 178 [Note] Aborted connection 178 to db: 'nova_cell1' user: 'root' host: 'localhost' (Got an error reading communication packets) 2020-01-09T00:14:48.749216Z 179 [Note] Aborted connection 179 to db: 'nova_cell1' user: 'root' host: 'localhost' (Got an error reading communication packets) ... I note that all services are getting the same error, not just nova. Includes cinder, neutron, placement, glance, keystone, and nova." "For this failure, trying to correlate the cell timeout with interesting things from mysql and dstat logging: screen-n-sch.txt [3]: Jan 16 18:12:50.807350 ubuntu-bionic-ovh-bhs1-0013905596 nova-scheduler[17316]: WARNING nova.context [None req-d3eeda8c-74d5-451c-8262-e80fa1e45c0f tempest-ServersTestManualDisk-965972951 tempest-ServersTestManualDisk-965972951] Timed out waiting for response from cell 948fa69c-ada4-488c-9808-83dd645d7069 mysql error.txt [4]: 2020-01-16T18:12:41.934112Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 4279ms. The settings might not be optimal. (flushed=4 and evicted=0, during the time.) screen-dstat.txt [5]: Jan 16 18:12:40.399103 ubuntu-bionic-ovh-bhs1-0013905596 dstat.sh[1006]: 16-01 18:12:40| 8 1 66 25 0|4785M 168M 438M 2024M| 46B 46B| 0 8008k| 0 144 |2371 4119 |2.27 2.00 2.20| 0 2.0 2.0| 0 0 |uwsgi 164902.5%3585B 62k|python2 1036 179k 457B 0%|mysqld 517M|4096k 8188M| 29 334 0 739 0" Latest query shows all hits showing up on OVH nodes. [1] https://review.opendev.org/701478 [2] http://paste.openstack.org/show/788349 [3] http://paste.openstack.org/show/788507 [4] http://paste.openstack.org/show/788505 [5] http://paste.openstack.org/show/788506