[devops] Fuel master node overload causes system tests failure

Bug #1482130 reported by Artem Panchenko
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Invalid
High
Fuel DevOps

Bug Description

Fuel version info (7.0 build #139): http://paste.openstack.org/show/411091/

System tests failed on CI because they got 504 error from Nginx during cluster deployment:

http://paste.openstack.org/show/411094/

2015/08/06 02:59:53 [error] 642#0: *575 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.109.0.1, server: localhost, request: "GET /api/tasks/2 HTTP/1.1", upstream: "uwsgi://10.109.0.2:8001", host: "10.109.0.2:8000"
...
10.109.0.1 - - [06/Aug/2015:03:00:04 +0000] "GET /api/tasks/2 HTTP/1.1" 504 176 "-" "Python-urllib/2.7"

According to monitoring logs the VM for Fuel master node was overloaded.

ATOP logs from CI node:

http://paste.openstack.org/show/411088/

ATOP logs from Fuel master node VM:

http://paste.openstack.org/show/411090/

Looks like there was an issue with disk system performance (IOwait values are high) on CI node 'mc2n7-srt'.

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :
Changed in fuel:
status: New → Confirmed
Revision history for this message
Igor Shishkin (teran) wrote :

@Artem, what are your expectations?
This load is generated by tests so probably proper solution would be researching and improving tests in that direction(like retries).

Revision history for this message
Artem Panchenko (apanchenko-8) wrote :

@Igor,

I observed such issue only once and it was caused by CI node overload. Usually we get similar problems on CI because server is overloaded by VMs spawned manually without debug session, but in that case there were only VMs running created by tests. So I assumed overload is caused by performance degradation on mc2n7-srt server. If there were no other complaints about that server, please mark this bug as incomplete/invalid.

As for tests, I don't think it's fine to re-try some operations if they failed due to hardware overload. Because in such cases we need to either figure out what did produce a high load in product or increase server performance (e.g. ask IT to add RAM or replace old slow hard drive).

Revision history for this message
Igor Shishkin (teran) wrote :

Moving to 8.0 since there is no real progress started and it doesn't block 7.0 release.

Changed in fuel:
milestone: 7.0 → 8.0
Revision history for this message
Mateusz Matuszkowiak (mmatuszkowiak) wrote :

Agreed with Artem, that we can set the bug to incomplete status until its get reproduced again. At the moment deeper investigation is impossible because that VM and logs don't exist.

Changed in fuel:
status: Confirmed → Incomplete
Dmitry Pyzhov (dpyzhov)
tags: added: area-devops
Revision history for this message
Igor Shishkin (teran) wrote :

Setting to invalid according to policy

Changed in fuel:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.