tripleo-ci: nonha jobs failing with Unable to establish connection to https://192.0.2.2:13004/v1/a90407df1e7f4f80a38a1b1671ced2ff/stacks/overcloud/f9f6f712-8e89-4ea9-a34b-6084dc74b5c1

Bug #1616144 reported by James Slagle
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
Critical
James Slagle

Bug Description

Example failed job:

http://logs.openstack.org/53/359153/1/check-tripleo/gate-tripleo-ci-centos-7-ovb-nonha/4dd74c5/

Does not happen every time, but when it does, it only seems to happen in nonha jobs, which do use ssl.

Tags: ci
Revision history for this message
James Slagle (james-slagle) wrote :

Error from heat api log:

2016-08-23 15:20:39.957 2146 INFO eventlet.wsgi.server [req-fbf416e4-d33e-4747-8728-e8bbeb6ddb0d 0d9b950c7fdb4032b59f687e9c305eb2 a90407df1e7f4f80a38a1b1671ced2ff - default default] Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 477, in handle_one_response
    write(b''.join(towrite))
  File "/usr/lib/python2.7/site-packages/eventlet/wsgi.py", line 426, in write
    _writelines(towrite)
  File "/usr/lib64/python2.7/socket.py", line 334, in writelines
    self.flush()
  File "/usr/lib64/python2.7/socket.py", line 303, in flush
    self._sock.sendall(view[write_offset:write_offset+buffer_size])
  File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 377, in sendall
    tail = self.send(data, flags)
  File "/usr/lib/python2.7/site-packages/eventlet/greenio/base.py", line 359, in send
    total_sent += fd.send(data[total_sent:], flags)
error: [Errno 104] Connection reset by peer

Error that shows up on the console during overcloud deployment:

2016-08-23 15:16:00.119339 | 2016-08-23 15:15:29 [3]: CREATE_IN_PROGRESS state changed
2016-08-23 15:16:00.119371 | 2016-08-23 15:15:29 [48]: CREATE_IN_PROGRESS state changed
2016-08-23 15:16:00.119404 | 2016-08-23 15:15:32 [10]: CREATE_IN_PROGRESS state changed
2016-08-23 15:16:00.119436 | 2016-08-23 15:15:33 [31]: CREATE_IN_PROGRESS state changed
2016-08-23 15:16:00.119468 | 2016-08-23 15:15:35 [5]: CREATE_IN_PROGRESS state changed
2016-08-23 15:16:00.119501 | 2016-08-23 15:15:37 [47]: CREATE_IN_PROGRESS state changed
2016-08-23 15:16:00.119533 | 2016-08-23 15:15:38 [13]: CREATE_IN_PROGRESS state changed
2016-08-23 15:16:00.119577 | 2016-08-23 15:15:41 [23]: CREATE_IN_PROGRESS state changed
2016-08-23 15:16:00.119611 | 2016-08-23 15:15:43 [60]: CREATE_IN_PROGRESS state changed
2016-08-23 15:16:00.119644 | 2016-08-23 15:15:44 [29]: CREATE_IN_PROGRESS state changed
2016-08-23 15:16:00.119676 | 2016-08-23 15:15:46 [40]: CREATE_IN_PROGRESS state changed
2016-08-23 15:16:00.119709 | 2016-08-23 15:15:47 [39]: CREATE_IN_PROGRESS state changed
2016-08-23 15:16:00.119745 | 2016-08-23 15:15:50 [0]: CREATE_IN_PROGRESS state changed
2016-08-23 15:16:00.119779 | 2016-08-23 15:15:53 [42]: CREATE_IN_PROGRESS state changed
2016-08-23 15:17:01.453629 | Unable to establish connection to https://192.0.2.2:13004/v1/a90407df1e7f4f80a38a1b1671ced2ff/stacks/overcloud/f9f6f712-8e89-4ea9-a34b-6084dc74b5c1
2016-08-23 15:17:02.414953 | #################
2016-08-23 15:17:02.415065 | tripleo.sh -- Overcloud create - FAILED!
2016-08-23 15:17:02.415098 | #################

Changed in tripleo:
importance: Undecided → Critical
milestone: none → newton-3
status: New → Triaged
tags: added: alert
Revision history for this message
James Slagle (james-slagle) wrote :

This error happens while polling for events from the overcloud stack by tripleoclient.

I can reproduce this error pretty easily locally by deploying with an ssl undercloud with 6GB ram and 2 vcpus. If I don't enable swap, something gets OOM killed. If I do enable swap, swap gets used (< 1GB) and then I hit this error.

The stack keeps deploying but the client has been killed, so the job fails. My investigation so far has only pointed out that it's the swap allocation that is delaying things enough to cause the client to fail in this way.

We do not see this error in the ha job even though it deploys more nodes. As of now, my only suspect is that it's the overhead of the initial SSL connections causing the error.

Revision history for this message
James Slagle (james-slagle) wrote :

increased undercloud flavor on rh1 to 8GB ram and 4vcpus.

patch to update ovb script:
https://review.openstack.org/360756

Changed in tripleo:
assignee: nobody → James Slagle (james-slagle)
status: Triaged → In Progress
tags: removed: alert
tags: added: ci
Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
Zane Bitter (zaneb) wrote :

Note there's a Heat bug tracking the increased memory use: bug 1626675.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.