py35 dsvm job failing with RemoteDisconnected error

Bug #1698355 reported by Rabi Mishra
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Heat
Invalid
Critical
Unassigned
neutron
Fix Released
Critical
Akihiro Motoki
oslo.serialization
Fix Released
Critical
Ihar Hrachyshka

Bug Description

traceback:

2017-06-16 10:24:47.339195 | 2017-06-16 10:24:47.338 |
2017-06-16 10:24:47.340517 | 2017-06-16 10:24:47.340 | heat_integrationtests.scenario.test_autoscaling_lbv2.AutoscalingLoadBalancerv2Test.test_autoscaling_loadbalancer_neutron
2017-06-16 10:24:47.342125 | 2017-06-16 10:24:47.341 | ------------------------------------------------------------------------------------------------------------------------
2017-06-16 10:24:47.343471 | 2017-06-16 10:24:47.343 |
2017-06-16 10:24:47.344919 | 2017-06-16 10:24:47.344 | Captured traceback:
2017-06-16 10:24:47.346272 | 2017-06-16 10:24:47.346 | ~~~~~~~~~~~~~~~~~~~
2017-06-16 10:24:47.347614 | 2017-06-16 10:24:47.347 | b'Traceback (most recent call last):'
2017-06-16 10:24:47.348873 | 2017-06-16 10:24:47.348 | b' File "/opt/stack/new/heat/heat_integrationtests/common/test.py", line 376, in _stack_delete'
2017-06-16 10:24:47.350049 | 2017-06-16 10:24:47.349 | b' success_on_not_found=True)'
2017-06-16 10:24:47.351627 | 2017-06-16 10:24:47.351 | b' File "/opt/stack/new/heat/heat_integrationtests/common/test.py", line 357, in _wait_for_stack_status'
2017-06-16 10:24:47.352791 | 2017-06-16 10:24:47.352 | b' fail_regexp):'
2017-06-16 10:24:47.353977 | 2017-06-16 10:24:47.353 | b' File "/opt/stack/new/heat/heat_integrationtests/common/test.py", line 321, in _verify_status'
2017-06-16 10:24:47.355411 | 2017-06-16 10:24:47.355 | b' stack_status_reason=stack.stack_status_reason)'
2017-06-16 10:24:47.356920 | 2017-06-16 10:24:47.356 | b"heat_integrationtests.common.exceptions.StackBuildErrorException: Stack AutoscalingLoadBalancerv2Test-1133164547 is in DELETE_FAILED status due to 'Resource DELETE failed: ConnectFailure: resources.sec_group: Unable to establish connection to http://10.1.43.45:9696/v2.0/security-group-rules/8d33f0cf-d473-455a-8fe2-978c64af5e0d: ('Connection aborted.', RemoteDisconnected('Remote end closed connection without response',))'"
2017-06-16 10:24:47.358227 | 2017-06-16 10:24:47.357 | b''

http://logs.openstack.org/65/473765/1/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/e07f32f/console.html#_2017-06-16_10_24_47_356920

heat engine log:

http://logs.openstack.org/65/473765/1/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/e07f32f/logs/screen-h-eng.txt.gz?level=INFO#_Jun_16_10_24_22_312023

In the same job nova is failing to connect to neutron with the same error

http://logs.openstack.org/65/473765/1/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/e07f32f/logs/screen-n-api.txt.gz?level=ERROR#_Jun_16_10_24_22_633655

It seems to be happening for security-groups and ports/floating-ip stuff mostly.

Not sure if this is a neutronclient/urllib3 issue(I see a new openstacksdk release[1])or something specific to changes merged recently to neuton[2].

[1] https://github.com/openstack/requirements/commit/1b30d517efd442867888359e4619d822f13a3cf2

[2] https://review.openstack.org/#/q/topic:bp/push-notifications

Rabi Mishra (rabi)
Changed in heat:
importance: Undecided → Critical
Revision history for this message
Rabi Mishra (rabi) wrote :

Looks like a child is killed and spawned again at the time of the error.

time: 0.3141711
Jun 16 11:18:39.592744 ubuntu-xenial-osic-cloud1-s3500-9357622 neutron-server[25622]: INFO oslo_service.service [-] Child 25715 killed by signal 9
Jun 16 11:18:39.606119 ubuntu-xenial-osic-cloud1-s3500-9357622 neutron-server[25622]: DEBUG oslo_service.service [-] Started child 542 {{(pid=25622) _start_child /usr/local/lib/python3.5/dist-packages/oslo_service/service.py:513}}
Jun 16 11:18:39.673921 ubuntu-xenial-osic-cloud1-s3500-9357622 neutron-server[25622]: DEBUG neutron_lib.callbacks.manager [-] Notify callbacks [] for process, after_init {{(pid=542) _notify_loop /usr/local/lib/python3.5/dist-packages/neutron_lib/callbacks/manager.py:167}}
Jun 16 11:18:39.688446 ubuntu-xenial-osic-cloud1-s3500-9357622 neutron-server[25622]: INFO neutron.wsgi [-] (542) wsgi starting up on http://0.0.0.0:9696
Jun 16

http://logs.openstack.org/04/461904/5/gate/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/beeca10/logs/screen-q-svc.txt.gz#_Jun_16_11_18_39_592744

Probably OOM issue?

Revision history for this message
Akihiro Motoki (amotoki) wrote :

Th link in comment #1 is from a different job.

An error message of the same job is below. Anyway we see "killed by signal 9".

Jun 16 10:24:22.872990 ubuntu-xenial-osic-cloud1-s3500-9356859 neutron-server[25764]: INFO oslo_service.service [-] Child 25855 killed by signal 9
Jun 16 10:24:23.025308 ubuntu-xenial-osic-cloud1-s3500-9356859 neutron-server[25764]: INFO neutron.wsgi [-] (783) wsgi starting up on http://0.0.0.0:9696

http://logs.openstack.org/65/473765/1/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/e07f32f/logs/screen-q-svc.txt.gz?level=INFO#_Jun_16_10_24_22_872990

Revision history for this message
Akihiro Motoki (amotoki) wrote :

It's OOM issue.

Jun 16 10:24:22 ubuntu-xenial-osic-cloud1-s3500-9356859 kernel: Out of memory: Kill process 25855 (neutron-server) score 539 or sacrifice child
Jun 16 10:24:22 ubuntu-xenial-osic-cloud1-s3500-9356859 kernel: Killed process 25855 (neutron-server) total-vm:9168668kB, anon-rss:5334496kB, file-rss:2320kB

http://logs.openstack.org/65/473765/1/check/gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial/e07f32f/logs/syslog.txt.gz#_Jun_16_10_24_22

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to neutron (master)

Fix proposed to branch: master
Review: https://review.openstack.org/474997

Changed in neutron:
assignee: nobody → Akihiro Motoki (amotoki)
status: New → In Progress
Akihiro Motoki (amotoki)
tags: added: gate-failure
Changed in neutron:
importance: Undecided → Critical
Revision history for this message
Akihiro Motoki (amotoki) wrote :

Discussion on analyzing when it happened.
http://eavesdrop.openstack.org/irclogs/%23openstack-neutron/%23openstack-neutron.2017-06-16.log.html#t2017-06-16T14:49:52

According to logstash, the status of of gate-heat-dsvm-functional-convg-mysql-lbaasv2-py35-ubuntu-xenial
- The last success started 2017-06-15 06:38:32
- The first failure started 2017-06-15 09:09:14
- After that all heat py35 dsvm jobs failed.

A patch merged between these two time is:

commit c507656044fd6c19c1f3ead535772437f5a9cebc
Merge: 920c2f6 af52d49
Author: Jenkins <email address hidden>
Date: Thu Jun 15 07:37:38 2017 +0000

    Merge "Integrate Security Groups OVO"

The failure rate is 100%, so it is a suspect.

Revision history for this message
Ihar Hrachyshka (ihar-hrachyshka) wrote :

Neutron fix: https://review.openstack.org/#/c/474575/

Also oslo.serialization that resulted in memory exhaustion: https://review.openstack.org/#/c/475052/

Changed in oslo.serialization:
importance: Undecided → Critical
status: New → Confirmed
Changed in heat:
status: New → Confirmed
Changed in oslo.serialization:
assignee: nobody → Ihar Hrachyshka (ihar-hrachyshka)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on neutron (master)

Change abandoned by Ihar Hrachyshka (<email address hidden>) on branch: master
Review: https://review.openstack.org/474997
Reason: We land https://review.openstack.org/#/c/474575/ instead.

Revision history for this message
Akihiro Motoki (amotoki) wrote :

https://review.openstack.org/#/c/474575/ has been merged. Mark it as Fix Released in Neutron.

Changed in neutron:
status: In Progress → Fix Released
Rabi Mishra (rabi)
Changed in heat:
status: Confirmed → Invalid
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.serialization 2.19.0

This issue was fixed in the openstack/oslo.serialization 2.19.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.serialization (stable/ocata)

Fix proposed to branch: stable/ocata
Review: https://review.openstack.org/500781

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.serialization (stable/ocata)

Reviewed: https://review.openstack.org/500781
Committed: https://git.openstack.org/cgit/openstack/oslo.serialization/commit/?id=be828c35d84f8f10b07f11aa8bbf195a44e730f8
Submitter: Jenkins
Branch: stable/ocata

commit be828c35d84f8f10b07f11aa8bbf195a44e730f8
Author: Ihar Hrachyshka <email address hidden>
Date: Fri Jun 16 11:43:21 2017 -0700

    Don't iterate through addresses in netaddr.IPNetwork

    Currently, to_primitive tries to iterate through all addresses in the
    network, because the type doesn't have a special handling that would
    short curcuit it, but also has __iter__. This may be detrimental to
    performance, up to the point of node crash due to memory exhaustion if
    the passed network range is too large (think of 0.0.0.0/0 or even
    2001::/64). This behavior also makes it impossible to restore the
    original data format (CIDR).

    This patch short curcuits the iteration by handling the IPNetwork type
    as a special case, same as we do for IPAddress.

    Change-Id: I6aecd2d057d282a655ff9e4918c164253142b188
    Closes-Bug: #1698355
    (cherry picked from commit 38ac21b523f23f802557d94b527821bc84deaa16)

tags: added: in-stable-ocata
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.serialization 2.16.1

This issue was fixed in the openstack/oslo.serialization 2.16.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to oslo.serialization (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/508386

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to oslo.serialization (stable/newton)

Reviewed: https://review.openstack.org/508386
Committed: https://git.openstack.org/cgit/openstack/oslo.serialization/commit/?id=e85a6706ae8a0e607e87028110170daa5043f237
Submitter: Jenkins
Branch: stable/newton

commit e85a6706ae8a0e607e87028110170daa5043f237
Author: Ihar Hrachyshka <email address hidden>
Date: Fri Jun 16 11:43:21 2017 -0700

    Don't iterate through addresses in netaddr.IPNetwork

    Currently, to_primitive tries to iterate through all addresses in the
    network, because the type doesn't have a special handling that would
    short curcuit it, but also has __iter__. This may be detrimental to
    performance, up to the point of node crash due to memory exhaustion if
    the passed network range is too large (think of 0.0.0.0/0 or even
    2001::/64). This behavior also makes it impossible to restore the
    original data format (CIDR).

    This patch short curcuits the iteration by handling the IPNetwork type
    as a special case, same as we do for IPAddress.

    Change-Id: I6aecd2d057d282a655ff9e4918c164253142b188
    Closes-Bug: #1698355
    (cherry picked from commit 38ac21b523f23f802557d94b527821bc84deaa16)
    (cherry picked from commit be828c35d84f8f10b07f11aa8bbf195a44e730f8)

tags: added: in-stable-newton
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/oslo.serialization 2.13.2

This issue was fixed in the openstack/oslo.serialization 2.13.2 release.

Changed in oslo.serialization:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.