[8.0] [CI tests] Test 8.0.fuel-library.pkgs.ubuntu.neutron_vlan_ha failes due to timeout

Bug #1635563 reported by Rodion Tikunov
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Rodion Tikunov

Bug Description

Detailed bug description:
MOS8 fuel-library branch 8.0.fuel-library.pkgs.ubuntu.neutron_vlan_ha test fails with the message "TimeoutError: Waiting timed out".
Latest green job https://ci.fuel-infra.org/job/8.0.fuel-library.pkgs.ubuntu.neutron_vlan_ha/1056/console

Additional information:
https://ci.fuel-infra.org/job/8.0.fuel-library.pkgs.ubuntu.neutron_vlan_ha/

Tags: non-release
Changed in fuel:
status: New → Confirmed
Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Fuel CI (fuel-ci)
description: updated
Revision history for this message
Roman Vyalov (r0mikiam) wrote :

The test are failing with the error "TimeoutError: Waiting timed out"
its not related to the jenkins job timeout..
I think this bug dont related to CI team

Changed in fuel:
status: Confirmed → New
assignee: Fuel CI (fuel-ci) → Fuel QA Team (fuel-qa)
Revision history for this message
Rodion Tikunov (rtikunov) wrote :

I have talked with QA Team about this and they said that they didn't change timeouts.
Jobs are failed after "2016-10-25 13:04:58,465 - INFO fuel_web_client.py:844 -- Get ID of a last created cluster". Seems that cluster created unproperly or didn't return his ID after creating.

Please, check this before re-assigning to Fuel QA.

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Roman Vyalov (r0mikiam)
Roman Vyalov (r0mikiam)
Changed in fuel:
assignee: Roman Vyalov (r0mikiam) → nobody
assignee: nobody → Fuel CI (fuel-ci)
Changed in fuel:
status: New → Confirmed
Revision history for this message
Dmitry Kaigarodеsev (dkaiharodsev) wrote :

we did a run with double increased timeout on Jenkins job side and it fails:
https://ci.fuel-infra.org/job/8.0.fuel-library.pkgs.ubuntu.neutron_vlan_ha/1090/console
fuel-qa folks, please investigate this issue by checking snapshot

Changed in fuel:
assignee: Fuel CI (fuel-ci) → Fuel QA Team (fuel-qa)
Revision history for this message
Dmitry Kaigarodеsev (dkaiharodsev) wrote :

after discussion with fuel-qa decided to pass this bug to Fuel Sustaining team for investigation

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Fuel Sustaining (fuel-sustaining-team)
Revision history for this message
Rodion Tikunov (rtikunov) wrote :

Job failed because of nginx container doesn't restart properly. In [0] occurs error [1]:
2016-10-27 13:11:53,515 - DEBUG ssh_client.py:719 -- timeout 5 dockerctl check nginx
 execution results: Exit code: 124

Seems that OOM Killer don't allow to start nginx container. [2] from /var/log/messages logs in snapshot

[0] https://ci.fuel-infra.org/job/8.0.fuel-library.pkgs.ubuntu.neutron_vlan_ha/1099/artifact/logs/1099/sys_test.log
[1] http://paste.openstack.org/show/587241/
[2] http://paste.openstack.org/show/587242/

Revision history for this message
Rodion Tikunov (rtikunov) wrote :
Changed in fuel:
assignee: Fuel Sustaining (fuel-sustaining-team) → Fuel CI (fuel-ci)
Revision history for this message
Dmitry Kaigarodеsev (dkaiharodsev) wrote :

forwarding bug to 'fuel-qa' since our team not contributing fuel-qa or fuel-devops code

Changed in fuel:
assignee: Fuel CI (fuel-ci) → Fuel QA Team (fuel-qa)
Revision history for this message
Vadim Rovachev (vrovachev) wrote :

Dear colleagues, for fix it need to add line:
export ADMIN_NODE_MEMORY=4096
before run test in job:
https://ci.fuel-infra.org/job/8.0.fuel-library.pkgs.ubuntu.neutron_vlan_ha/

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → Fuel CI (fuel-ci)
Revision history for this message
Dmitry Kaigarodеsev (dkaiharodsev) wrote :
Changed in fuel:
assignee: Fuel CI (fuel-ci) → Fuel QA Team (fuel-qa)
Revision history for this message
Rodion Tikunov (rtikunov) wrote :

Memory on master node didn't change.
From snapshot logs: fuel-snapshot-2016-10-31_13-12-18/nailgun.test.domain.local/var/log/dmesg System RAM: 3071MB
From master node:
[root@nailgun ~]# free -m
              total used free shared buff/cache available
Mem: 2848 1362 107 62 1378 1201
Swap: 3071 379 2692

Revision history for this message
Rodion Tikunov (rtikunov) wrote :

Seems, that node has created from old snapshot:
$ grep 'System RAM:' fuel-snapshot-2016-10-31_13-12-18/nailgun.test.domain.local/var/log/messages
Oct 28 13:33:45 nailgun kernel: Reserving 161MB of memory at 688MB for crashkernel (System RAM: 3071MB)

Revision history for this message
Alexander Kurenyshev (akurenyshev) wrote :

Guys, from the latest run [0]:

[root@nailgun ~]# systemctl status docker.service
? docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; vendor preset: disabled)
   Active: inactive (dead) since Mon 2016-10-31 12:30:57 UTC; 2h 35min ago
     Docs: http://docs.docker.com
 Main PID: 17389 (code=killed, signal=PIPE)

Oct 31 12:29:09 nailgun.test.domain.local docker[17389]: 2016/10/31 12:29:09 http: response.WriteHeader on hijacked connection
Oct 31 12:29:09 nailgun.test.domain.local docker[17389]: time="2016-10-31T12:29:09.794995108Z" level=info msg="GET /v1.20/exec/64860ef6afa7b1a115f93f3eb6cdc3aa38955f4244a4e4b48b6bb1232c8856e7/json"
Oct 31 12:29:10 nailgun.test.domain.local docker[17389]: time="2016-10-31T12:29:10.417839907Z" level=info msg="GET /v1.20/containers/fuel-core-8.0-nginx/json"
Oct 31 12:29:10 nailgun.test.domain.local docker[17389]: time="2016-10-31T12:29:10.430942699Z" level=info msg="POST /v1.20/containers/3104067baa920e8ded4c180ded5b849efae334f682d8a5d91bbb74560541428d/exec"
Oct 31 12:29:10 nailgun.test.domain.local docker[17389]: time="2016-10-31T12:29:10.431415899Z" level=info msg="POST /v1.20/exec/d072a6c6d6334b4fb522384dadfe1ece16313aa42280d7c56eea3bfd541aff8f/start"
Oct 31 12:29:10 nailgun.test.domain.local docker[17389]: 2016/10/31 12:29:10 http: response.WriteHeader on hijacked connection
Oct 31 12:29:10 nailgun.test.domain.local docker[17389]: time="2016-10-31T12:29:10.481166049Z" level=info msg="GET /v1.20/exec/d072a6c6d6334b4fb522384dadfe1ece16313aa42280d7c56eea3bfd541aff8f/json"
Oct 31 12:30:00 nailgun.test.domain.local systemd[1]: docker.service changed dead -> running
Oct 31 12:30:01 nailgun.test.domain.local systemd[1]: docker.service changed dead -> running
Oct 31 12:30:36 nailgun.test.domain.local systemd[1]: docker.service changed dead -> running

[root@nailgun ~]# grep -riP "killed process|Out of memory" /var/log/
[root@nailgun ~]#

This looks like docker was killed by some reason, but OOM killer didn't do that.

I found related issue with docker:
https://github.com/docker/docker/issues/7087

And of course the timeout increasing couldn't help here.
So, this bug is out of fuel-qa area. Somebody from dev team should take a look on the docker.

[0] https://ci.fuel-infra.org/job/8.0.fuel-library.pkgs.ubuntu.neutron_vlan_ha/1109/consoleFull

Changed in fuel:
assignee: Fuel QA Team (fuel-qa) → MOS Maintenance (mos-maintenance)
Changed in fuel:
assignee: MOS Maintenance (mos-maintenance) → Rodion Tikunov (rtikunov)
Revision history for this message
Rodion Tikunov (rtikunov) wrote :

Reproduced on master node with systemd-219.
Just after few seconds after "systemctl restart systemd-journald" docker daemon failed.
Seems that we hit the bug https://rhn.redhat.com/errata/RHBA-2016-0536.html

mos-linux advice is to update docker to version 1.10.

Workaround: add "--log-driver=syslog" in OPTIONS variable at file /etc/sysconfig/docker but after that "docker log" doesn't show containers log.

Revision history for this message
Rodion Tikunov (rtikunov) wrote :

Assigned to mos-linux as we need the updated Docker.

Changed in fuel:
assignee: Rodion Tikunov (rtikunov) → MOS Linux (mos-linux)
Changed in fuel:
assignee: MOS Linux (mos-linux) → Rodion Tikunov (rtikunov)
Revision history for this message
Rodion Tikunov (rtikunov) wrote :

Packages with new docker have uploaded on http://pkg-updates.fuel-infra.org/

Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (stable/8.0)

Fix proposed to branch: stable/8.0
Review: https://review.openstack.org/398415

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (stable/8.0)

Reviewed: https://review.openstack.org/398415
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=2916c8c0f2ae8f6795327f55990a8c58eeade7a0
Submitter: Jenkins
Branch: stable/8.0

commit 2916c8c0f2ae8f6795327f55990a8c58eeade7a0
Author: Rodion Tikunov <email address hidden>
Date: Wed Nov 16 17:18:08 2016 +0300

    Added updating docker containers to update_fuel

    systemd update may lead to situation that docker and docker containers
    can be suspended.
    Now update_fuel function stops containers, deletes all images, stops
    docker daemon before update. Then starts docker and loads new images
    after update.

    Change-Id: If8adaf02d9a4c8a4b9ce23aa24001cef1bac8878
    Closes-bug: #1635563

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to fuel-qa (stable-mu/8.0)

Fix proposed to branch: stable-mu/8.0
Review: https://review.openstack.org/399119

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-qa (stable-mu/8.0)

Reviewed: https://review.openstack.org/399119
Committed: https://git.openstack.org/cgit/openstack/fuel-qa/commit/?id=1abafec33edaf31698ae4d4cf6fc8a20192b04ac
Submitter: Jenkins
Branch: stable-mu/8.0

commit 1abafec33edaf31698ae4d4cf6fc8a20192b04ac
Author: Rodion Tikunov <email address hidden>
Date: Wed Nov 16 17:18:08 2016 +0300

    Added updating docker containers to update_fuel

    systemd update may lead to situation that docker and docker containers
    can be suspended.
    Now update_fuel function stops containers, deletes all images, stops
    docker daemon before update. Then starts docker and loads new images
    after update.

    Change-Id: If8adaf02d9a4c8a4b9ce23aa24001cef1bac8878
    Closes-bug: #1635563
    (cherry picked from commit 2916c8c0f2ae8f6795327f55990a8c58eeade7a0)

Changed in fuel:
status: In Progress → Fix Committed
tags: added: non-release
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.