Haproxy glance-api is DOWN, after all ceph nodes removal/addition and redeploy

Bug #1459743 reported by Tatyanka
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Committed
High
Max Yatsenko
7.0.x
Fix Released
High
Max Yatsenko

Bug Description

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.1"
  openstack_version: "2014.2.2-6.1"
  api: "1.0"
  build_number: "472"
  build_id: "2015-05-26_23-21-01"
  nailgun_sha: "f737675091bd1903aace0e36812e855ce47dfec7"
  python-fuelclient_sha: "e19f1b65792f84c4a18b5a9473f85ef3ba172fce"
  astute_sha: "0bd72c72369e743376864e8e8dabfe873d40450a"
  fuel-library_sha: "8cfeca1a86179ebed1e4e03b2133b49c27350f6f"
  fuel-ostf_sha: "87819878bc0ca572900e1f6933d9b99e666d6f62"
  fuelmain_sha: "13b3e9cf074ba1cf1ae06509c55fbab613c73f4e"

Steps to Reproduce:
OS: Ubuntu
neutron VLan
1. Deploy 1 controller 2 compute + ceph, ceph for all
2. When cluster ready, delete 2 compute with ceph (before deletion of ceph nodes you need delete osd manually http://ceph.com/docs/master/rados/operations/add-or-rm-osds/), add 2 new ceph nodes, redeploy env
3. When cluster ready, add 1 compute and 2 controllers and re-deploy env
4. When cluster ready - run ostf

Actual result:
ostf tests for images failed with 503 error
haproxy backend for glance-api is down
root@node-11:~# echo "show stat" | nc -U /var/lib/haproxy/stats | awk -F',' '{if (NR == 1 || $1 ~/glance/){printf "%14s %14s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n", $1,$2,$8,$13,$14,$15,$18,$22,$23,$25,$28}}'
      # pxname svname stot ereq econ eresp status chkfail chkdown downtime iid
    glance-api FRONTEND 0 0 OPEN 10
    glance-api node-5 0 0 0 DOWN 4 2 15843 10
    glance-api node-6 0 0 0 DOWN 4 2 15733 10
    glance-api node-11 0 0 0 DOWN 4 2 11191 10
    glance-api BACKEND 0 0 0 DOWN 2 11181 10
glance-registry FRONTEND 0 0 OPEN 12
glance-registry node-5 0 0 0 UP 1 1 778 12
glance-registry node-6 0 0 0 UP 1 1 778 12
glance-registry node-11 0 0 0 UP 2 1 412 12
glance-registry BACKEND 0 0 0 UP 1 410 12

Expected:
After redeployment of env - all services works perfectly

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :
Revision history for this message
Dmitry Ilyin (idv1985) wrote :

Glance apis are running on all controllers and are listening to the network but don't respond. Otherwise processes are looking normal.

Restarting glance-api helps.

Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

So far it seems, the case broke ceph cluster so glance api talks to the old ceph cluster. From my point of view this bug is invalid as you do not expect cluster to work after you destroy your ceph cluster completely.

Changed in fuel:
status: New → Invalid
status: Invalid → Incomplete
status: Incomplete → Won't Fix
no longer affects: fuel/6.1.x
Revision history for this message
Vladimir Kuklin (vkuklin) wrote :

We can add some warnings and workarounds for this test case, but I think it is won't fix for 6.1

summary: - Haproxy glance-api is DOWN, after remove/addiotinal nodes and redeploy
+ Haproxy glance-api is DOWN, after all ceph nodes removal/addition and
+ redeploy
Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

Why I destroy all cluster ? I' be added 2 ceph nodes, and redeploy( btw when post task upload_cirros.rb was run -glance was ok, and we can see in astute logs, that we receive there image list, when new ceph nodes were adeed) , but then it stops to work, so if glance try to see in the old ceph cluster, why we successfully get image list in post deployment task cirros_upload.rb( for node11)?

Changed in fuel:
status: Won't Fix → Confirmed
Changed in fuel:
assignee: Fuel Library Team (fuel-library) → MOS Glance (mos-glance)
Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :

Together with mfedosin and ativelkov we managed to find out the reason why glance-api could possibly appear in not responding state - librbd + eventlet coexistence (see Cinder problem of the same kind and the proposed solution https://review.openstack.org/#/c/175555/).

Most likely, as the very last request that glance-api received was registry's image PUT at 2015-05-28T13:10:23, the scenario was as follows:

* client requests image's creation
* driver's Store.add() has been called
* the blocking call happens here: https://github.com/openstack/glance_store/blob/master/glance_store/_drivers/rbd.py#L351
Store._create() becomes synchronous by nature due to librbd + eventlet issues and direct call to rbd.RBD()
https://github.com/openstack/glance_store/blob/master/glance_store/_drivers/rbd.py#L265

That is just a hint (though definitely a glance_store rbd's driver bug), so the problem has to investigated further.

Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :

We also need some information about the state of the environment in between steps 2 and 3 from the reproduction instruction:

....
2. When cluster ready, delete 2 compute with ceph (before deletion of ceph nodes you need delete osd manually http://ceph.com/docs/master/rados/operations/add-or-rm-osds/), add 2 new ceph nodes, redeploy env

Wanted: glance image-list output and ostf test run results

3. When cluster ready, add 1 compute and 2 controllers and re-deploy env
 ...

Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :

And some comments on the ceph.log will be most welcome:

(grep '210097fd-3d94-448f-b957-f1a01829aecd' node-11.test.domain.local/var/log/ceph/ceph.log)

2015-05-28 13:14:24.122289 osd.2 10.109.4.9:6800/6645 102 : [WRN] slow request 240.588014 seconds old, received at 2015-05-28 13:10:23.534173: osd_op(client.5014.0:1 210097fd-3d94-448f-b957-f1a01829aecd.rbd [stat] 3.ab8ad092 ack+read e58) v4 currently waiting for pg to exist locally
2015-05-28 13:18:24.171729 osd.2 10.109.4.9:6800/6645 103 : [WRN] 1 slow requests, 1 included below; oldest blocked for > 480.637476 secs
2015-05-28 13:18:24.171754 osd.2 10.109.4.9:6800/6645 104 : [WRN] slow request 480.637476 seconds old, received at 2015-05-28 13:10:23.534173: osd_op(client.5014.0:1 210097fd-3d94-448f-b957-f1a01829aecd.rbd [stat] 3.ab8ad092 ack+read e58) v4 currently waiting for pg to exist locally
2015-05-28 13:26:24.272224 osd.2 10.109.4.9:6800/6645 105 : [WRN] 1 slow requests, 1 included below; oldest blocked for > 960.737951 secs
2015-05-28 13:26:24.272248 osd.2 10.109.4.9:6800/6645 106 : [WRN] slow request 960.737951 seconds old, received at 2015-05-28 13:10:23.534173: osd_op(client.5014.0:1 210097fd-3d94-448f-b957-f1a01829aecd.rbd [stat] 3.ab8ad092 ack+read e58) v4 currently waiting for pg to exist locally

Why do we have 960 seconds old requests?

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

so, additional info for step 2:
2. When cluster ready, delete 2 compute with ceph (before deletion of ceph nodes you need delete osd manually http://ceph.com/docs/master/rados/operations/add-or-rm-osds/), add 2 new ceph nodes, redeploy env

1- list images - works
2. try o upload image - 503 error
3. list images again - 503 error
http://paste.openstack.org/show/245158/

Revision history for this message
Tatyanka (tatyana-leontovich) wrote :

root@node-22:~# ceph osd tree
# id weight type name up/down reweight
-4 0.09998 host node-26
0 0.04999 osd.0 up 1
3 0.04999 osd.3 up 1
-5 0.09998 host node-25
1 0.04999 osd.1 up 1
4 0.04999 osd.4 up 1
2 0 osd.2 down 0
root@node-22:~

ruhe (ruhe)
Changed in fuel:
assignee: MOS Glance (mos-glance) → Inessa Vasilevskaya (ivasilevskaya)
Revision history for this message
Mike Fedosin (mfedosin) wrote :

I made an upstream commit, that fixes issues with glance lockups on ceph https://review.openstack.org/#/c/186780/
The reasons why specific ceph doesn't response have to be investigated separately.

ruhe (ruhe)
tags: added: ceph glance
Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :

What I see on the environment provided is that an important detail was omitted in bug description:

If you try to run any OSTF test that requires a healthy ceph environment after step 2 (2 computes destruction, 2 nodes addition and env redeploy) you will receive:

"There are no compute nodes" failure.

So I see the bug as follows:

if you brutally destroy ceph cluster and try to atone for your sins by creating 2 nodes and redeploying env, the 2 new nodes won't be automatically picked up as computes. Any first request that requires glance's interaction with ceph (like PUT) will cause glance to hang due to the reasons described in details above (see cinder-like eventlet + librbd problem) and due to the absence of rados timeout by default.

My proposal is to mark the problem as invalid for 6.1 and schedule the glance_store bug fix for 7.0.

Changed in fuel:
status: Confirmed → Invalid
Changed in fuel:
status: Invalid → Won't Fix
Revision history for this message
Inessa Vasilevskaya (ivasilevskaya) wrote :

sample release notes for 6.1:

Problem: glance-api stops responding (http 503) after all ceph compute nodes destruction followed by their recreation and environment redeployment.

Description and reproducibility:
Setup: mos 6.1, ceph as backend, 1 controller, 2 computes.
Destroy ceph computes, create 2 new compute nodes, recreate environment, DO NOT restart glance-api.

Due to the fact that librbd calls from glance_store's rbd driver are not patched by eventlet library and the absence of rados_connect_timeout, any call to ceph with badly configured computes results in an eternal blocking request.

Workaround: glance-api service restart.

tags: added: release-notes
tags: added: release-notes-done
removed: release-notes
Revision history for this message
Mike Fedosin (mfedosin) wrote :
Revision history for this message
Mike Fedosin (mfedosin) wrote :

A simple workaround for this - to set a ceph config parameter in ceph.conf - 'rgw op thread suicide timeout' http://ceph.com/docs/master/radosgw/config-ref/

I think 30 seconds is a reasonable value for this. It will fully prevent glance hangups during connection to rados.

Revision history for this message
Mike Fedosin (mfedosin) wrote :

btw, also it will help to avoid the similar issue in cinder.

Revision history for this message
Max Yatsenko (myatsenko) wrote :

There is a CR that contains fix for the bug: https://review.openstack.org/#/c/214745/

Revision history for this message
Max Yatsenko (myatsenko) wrote :

A patch that cab fix the bug correctly was merged in upstream : https://review.openstack.org/#/c/200554/

It will be backported from upstream and after that configured in additional CR .

Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix proposed to openstack/glance_store (openstack-ci/fuel-7.0/2015.1.0)

Fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Mike Fedosin <email address hidden>
Review: https://review.fuel-infra.org/10896

Changed in fuel:
assignee: Mike Fedosin (mfedosin) → Max Yatsenko (myatsenko)
Revision history for this message
Fuel Devops McRobotson (fuel-devops-robot) wrote : Fix merged to openstack/glance_store (openstack-ci/fuel-7.0/2015.1.0)

Reviewed: https://review.fuel-infra.org/10896
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: fa4886af27de6cec50db2ea19b0099a1c8fdf7af
Author: Mike Fedosin <email address hidden>
Date: Fri Aug 28 15:02:22 2015

Prevent glance-api hangups during connection to rbd

This code adds new config option for rados connect timeout,
which tells how much time glance-api has to wait before
close the connection.

Setting 'rados_connect_timeout' <= 0 means no timeout.

DocImpact:
    new config option 'rados_connect_timout' (default=0)

Closes-Bug: #1459743
Cherry-picked from https://review.openstack.org/#/c/200554/

Change-Id: I0e2f5a6630cfae5d3be215d1cbae6fdca5c6419d

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Denis Egorenko (degorenko) wrote :

Fix to fuel-library is not merged yet.

Revision history for this message
Denis Egorenko (degorenko) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to fuel-library (master)

Reviewed: https://review.openstack.org/218449
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=d209be85da0e725a2d18839974d1a9816170eb18
Submitter: Jenkins
Branch: master

commit d209be85da0e725a2d18839974d1a9816170eb18
Author: Maksym Yatsenko <email address hidden>
Date: Fri Aug 28 22:54:49 2015 +0300

     Configure rados_connect_timeout parameter in glance config

     This patch adds new config option for rados connect timeout,
    which tells how much time glance-api has to wait before
    close the connection.

    Setting 'rados_connect_timeout' <= 0 means no timeout.

    Upstream change-id: I2352bdec2a3339785eb38249afe385c5e496457c

    Closes-Bug:#1459743

    Change-Id: Ie4435a6bae56e8e6c1192ab66c83aa72a6b76305

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on fuel-library (master)

Change abandoned by Maksym Yatsenko (<email address hidden>) on branch: master
Review: https://review.openstack.org/214745

tags: added: on-verification
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Verified on #298, but seems that we have a new bug - https://bugs.launchpad.net/mos/+bug/1498331

tags: removed: on-verification
Changed in fuel:
status: Fix Committed → Fix Released
Revision history for this message
Oleksiy Molchanov (omolchanov) wrote :

Seems that the bug still persists

root@node-1:~# echo "show stat" | nc -U /var/lib/haproxy/stats | awk -F',' '{if (NR == 1 || $1 ~/glance/){printf "%14s %14s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n", $1,$2,$8,$13,$14,$15,$18,$22,$23,$25,$28}}'
      # pxname svname stot ereq econ eresp status chkfail chkdown downtime iid
    glance-api FRONTEND 148 0 OPEN 11
    glance-api node-7 4 0 2 DOWN 7 3 160772 11
    glance-api node-6 71 0 1 DOWN 12 4 157341 11
    glance-api node-1 67 0 1 DOWN 4 1 154986 11
    glance-api BACKEND 148 6 4 DOWN 3 154244 11
glance-registry FRONTEND 143 0 OPEN 13
glance-registry node-7 47 0 0 DOWN 4 2 52367 13
glance-registry node-6 48 0 0 UP 1 1 3092 13
glance-registry node-1 48 0 0 UP 1 0 0 13
glance-registry BACKEND 143 0 0 UP 0 0 13

Changed in fuel:
status: Fix Released → Confirmed
Revision history for this message
Max Yatsenko (myatsenko) wrote :
Download full text (3.4 KiB)

In node-1 (controller) "glance-api" service was not running, I started "glance-api" and it works now:

# glance image-list
+--------------------------------------+-----------------------------------+-------------+------------------+------------+--------+
| ID | Name | Disk Format | Container Format | Size | Status |
+--------------------------------------+-----------------------------------+-------------+------------------+------------+--------+
| 1df94068-7fd5-4fe6-ab3f-0248d0065107 | cool | qcow2 | bare | 1073741824 | saving |
| 7c489ff2-7973-47b6-be7d-0d6c20cf350f | cool | qcow2 | bare | 1073741824 | saving |
| b4b095ec-4e63-4697-9b45-3d56296a90d2 | cool | qcow2 | bare | 10485760 | saving |
| 9f322ebd-27ad-420f-9d19-bf9ced16c282 | cool | qcow2 | bare | 10485760 | saving |
| e0c6dca6-cc4f-4d42-849e-c670fcf257e8 | cool | qcow2 | bare | 10485760 | saving |
| d4df1358-2c2c-4635-b015-9ef5ca151075 | ostf_test-image_glance-1266579261 | raw | bare | | saving |
| f792a13f-d601-4029-b9db-b23825190371 | ostf_test-image_glance-836406336 | raw | bare | 1024 | saving |
| 8bcc2f22-f003-41d7-95c0-64f20ba0b265 | TestVM | qcow2 | bare | 13287936 | active |
+--------------------------------------+-----------------------------------+-------------+------------------+------------+--------+

In node-6(controller) "glance-api" was running and its OK:

root@node-6:~# glance image-list
+--------------------------------------+-----------------------------------+-------------+------------------+------------+--------+
| ID | Name | Disk Format | Container Format | Size | Status |
+--------------------------------------+-----------------------------------+-------------+------------------+------------+--------+
| 1df94068-7fd5-4fe6-ab3f-0248d0065107 | cool | qcow2 | bare | 1073741824 | saving |
| 7c489ff2-7973-47b6-be7d-0d6c20cf350f | cool | qcow2 | bare | 1073741824 | saving |
| b4b095ec-4e63-4697-9b45-3d56296a90d2 | cool | qcow2 | bare | 10485760 | saving |
| 9f322ebd-27ad-420f-9d19-bf9ced16c282 | cool | qcow2 | bare | 10485760 | saving |
| e0c6dca6-cc4f-4d42-849e-c670fcf257e8 | cool | qcow2 | bare | 10485760 | saving |
| d4df1358-2c2c-4635-b015-9ef5ca151075 | ostf_test-image_glance-1266579261 | raw | bare | | saving |
| f792a13f-d601-4029-b9db-b23825190371 | ostf_test-image_glance-836406336 | raw | bare | 1024 | saving |
| 8bcc2f22-f003-41d7-95c0-64f20ba0b265 | TestVM |...

Read more...

Revision history for this message
Max Yatsenko (myatsenko) wrote :

The environment was examined together with Oleksiy Molchanov once more. The bug was not confirmed.

Changed in fuel:
status: Confirmed → Fix Released
Changed in fuel:
status: Fix Released → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.