Fuel for OpenStack

Haproxy glance-api is DOWN, after all ceph nodes removal/addition and redeploy

Bug #1459743 reported by Tatyanka on 2015-05-28

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Fuel for OpenStack	Fix Committed	High	Max Yatsenko	Fuel for OpenStack 6.1
	7.0.x	Fix Released	High	Max Yatsenko	Fuel for OpenStack 7.0

Bug Description

VERSION:
  feature_groups:
    - mirantis
  production: "docker"
  release: "6.1"
  openstack_version: "2014.2.2-6.1"
  api: "1.0"
  build_number: "472"
  build_id: "2015-05-26_23-21-01"
  nailgun_sha: "f737675091bd1903aace0e36812e855ce47dfec7"
  python-fuelclient_sha: "e19f1b65792f84c4a18b5a9473f85ef3ba172fce"
  astute_sha: "0bd72c72369e743376864e8e8dabfe873d40450a"
  fuel-library_sha: "8cfeca1a86179ebed1e4e03b2133b49c27350f6f"
  fuel-ostf_sha: "87819878bc0ca572900e1f6933d9b99e666d6f62"
  fuelmain_sha: "13b3e9cf074ba1cf1ae06509c55fbab613c73f4e"

Steps to Reproduce:
OS: Ubuntu
neutron VLan
1. Deploy 1 controller 2 compute + ceph, ceph for all
2. When cluster ready, delete 2 compute with ceph (before deletion of ceph nodes you need delete osd manually http://ceph.com/docs/master/rados/operations/add-or-rm-osds/), add 2 new ceph nodes, redeploy env
3. When cluster ready, add 1 compute and 2 controllers and re-deploy env
4. When cluster ready - run ostf

Actual result:
ostf tests for images failed with 503 error
haproxy backend for glance-api is down
root@node-11:~# echo "show stat" | nc -U /var/lib/haproxy/stats | awk -F',' '{if (NR == 1 || $1 ~/glance/){printf "%14s %14s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n", $1,$2,$8,$13,$14,$15,$18,$22,$23,$25,$28}}'
      # pxname svname stot ereq econ eresp status chkfail chkdown downtime iid
    glance-api FRONTEND 0 0 OPEN 10
    glance-api node-5 0 0 0 DOWN 4 2 15843 10
    glance-api node-6 0 0 0 DOWN 4 2 15733 10
    glance-api node-11 0 0 0 DOWN 4 2 11191 10
    glance-api BACKEND 0 0 0 DOWN 2 11181 10
glance-registry FRONTEND 0 0 OPEN 12
glance-registry node-5 0 0 0 UP 1 1 778 12
glance-registry node-6 0 0 0 UP 1 1 778 12
glance-registry node-11 0 0 0 UP 2 1 412 12
glance-registry BACKEND 0 0 0 UP 1 410 12

Expected:
After redeployment of env - all services works perfectly

Tags:

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-05-28:

https://drive.google.com/a/mirantis.com/file/d/0B_tSitrwrgvoNTJRdUR4eG9iajA/view?usp=sharing -diagnostic snapshot

Revision history for this message

Dmitry Ilyin (idv1985) wrote on 2015-05-28:

Glance apis are running on all controllers and are listening to the network but don't respond. Otherwise processes are looking normal.

Restarting glance-api helps.

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-05-28:

So far it seems, the case broke ceph cluster so glance api talks to the old ceph cluster. From my point of view this bug is invalid as you do not expect cluster to work after you destroy your ceph cluster completely.

Changed in fuel:
status:	New → Invalid
status:	Invalid → Incomplete
status:	Incomplete → Won't Fix
no longer affects:	fuel/6.1.x

Revision history for this message

Vladimir Kuklin (vkuklin) wrote on 2015-05-28:

We can add some warnings and workarounds for this test case, but I think it is won't fix for 6.1

summary:

- Haproxy glance-api is DOWN, after remove/addiotinal nodes and redeploy
+ Haproxy glance-api is DOWN, after all ceph nodes removal/addition and
+ redeploy

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-05-28:

Why I destroy all cluster ? I' be added 2 ceph nodes, and redeploy( btw when post task upload_cirros.rb was run -glance was ok, and we can see in astute logs, that we receive there image list, when new ceph nodes were adeed) , but then it stops to work, so if glance try to see in the old ceph cluster, why we successfully get image list in post deployment task cirros_upload.rb( for node11)?

Nastya Urlapova (aurlapova) on 2015-05-29

Changed in fuel:
status:	Won't Fix → Confirmed

Vladimir Kuklin (vkuklin) on 2015-05-29

Changed in fuel:
assignee:	Fuel Library Team (fuel-library) → MOS Glance (mos-glance)

Revision history for this message

Inessa Vasilevskaya (ivasilevskaya) wrote on 2015-05-29:

Together with mfedosin and ativelkov we managed to find out the reason why glance-api could possibly appear in not responding state - librbd + eventlet coexistence (see Cinder problem of the same kind and the proposed solution https://review.openstack.org/#/c/175555/).

Most likely, as the very last request that glance-api received was registry's image PUT at 2015-05-28T13:10:23, the scenario was as follows:

* client requests image's creation
* driver's Store.add() has been called
* the blocking call happens here: https://github.com/openstack/glance_store/blob/master/glance_store/_drivers/rbd.py#L351
Store._create() becomes synchronous by nature due to librbd + eventlet issues and direct call to rbd.RBD()
https://github.com/openstack/glance_store/blob/master/glance_store/_drivers/rbd.py#L265

That is just a hint (though definitely a glance_store rbd's driver bug), so the problem has to investigated further.

Revision history for this message

Inessa Vasilevskaya (ivasilevskaya) wrote on 2015-05-29:

We also need some information about the state of the environment in between steps 2 and 3 from the reproduction instruction:

....
2. When cluster ready, delete 2 compute with ceph (before deletion of ceph nodes you need delete osd manually http://ceph.com/docs/master/rados/operations/add-or-rm-osds/), add 2 new ceph nodes, redeploy env

Wanted: glance image-list output and ostf test run results

3. When cluster ready, add 1 compute and 2 controllers and re-deploy env
...

Revision history for this message

Inessa Vasilevskaya (ivasilevskaya) wrote on 2015-05-29:

And some comments on the ceph.log will be most welcome:

(grep '210097fd-3d94-448f-b957-f1a01829aecd' node-11.test.domain.local/var/log/ceph/ceph.log)

2015-05-28 13:14:24.122289 osd.2 10.109.4.9:6800/6645 102 : [WRN] slow request 240.588014 seconds old, received at 2015-05-28 13:10:23.534173: osd_op(client.5014.0:1 210097fd-3d94-448f-b957-f1a01829aecd.rbd [stat] 3.ab8ad092 ack+read e58) v4 currently waiting for pg to exist locally
2015-05-28 13:18:24.171729 osd.2 10.109.4.9:6800/6645 103 : [WRN] 1 slow requests, 1 included below; oldest blocked for > 480.637476 secs
2015-05-28 13:18:24.171754 osd.2 10.109.4.9:6800/6645 104 : [WRN] slow request 480.637476 seconds old, received at 2015-05-28 13:10:23.534173: osd_op(client.5014.0:1 210097fd-3d94-448f-b957-f1a01829aecd.rbd [stat] 3.ab8ad092 ack+read e58) v4 currently waiting for pg to exist locally
2015-05-28 13:26:24.272224 osd.2 10.109.4.9:6800/6645 105 : [WRN] 1 slow requests, 1 included below; oldest blocked for > 960.737951 secs
2015-05-28 13:26:24.272248 osd.2 10.109.4.9:6800/6645 106 : [WRN] slow request 960.737951 seconds old, received at 2015-05-28 13:10:23.534173: osd_op(client.5014.0:1 210097fd-3d94-448f-b957-f1a01829aecd.rbd [stat] 3.ab8ad092 ack+read e58) v4 currently waiting for pg to exist locally

Why do we have 960 seconds old requests?

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-05-29:

so, additional info for step 2:
2. When cluster ready, delete 2 compute with ceph (before deletion of ceph nodes you need delete osd manually http://ceph.com/docs/master/rados/operations/add-or-rm-osds/), add 2 new ceph nodes, redeploy env

1- list images - works
2. try o upload image - 503 error
3. list images again - 503 error
http://paste.openstack.org/show/245158/

Revision history for this message

Tatyanka (tatyana-leontovich) wrote on 2015-05-29:

#10

root@node-22:~# ceph osd tree
# id weight type name up/down reweight
-4 0.09998 host node-26
0 0.04999 osd.0 up 1
3 0.04999 osd.3 up 1
-5 0.09998 host node-25
1 0.04999 osd.1 up 1
4 0.04999 osd.4 up 1
2 0 osd.2 down 0
root@node-22:~

ruhe (ruhe) on 2015-06-01

Changed in fuel:
assignee:	MOS Glance (mos-glance) → Inessa Vasilevskaya (ivasilevskaya)

Revision history for this message

Mike Fedosin (mfedosin) wrote on 2015-06-01:

#11

I made an upstream commit, that fixes issues with glance lockups on ceph https://review.openstack.org/#/c/186780/
The reasons why specific ceph doesn't response have to be investigated separately.

ruhe (ruhe) on 2015-06-01

tags:

added: ceph glance

Revision history for this message

Inessa Vasilevskaya (ivasilevskaya) wrote on 2015-06-01:

#12

What I see on the environment provided is that an important detail was omitted in bug description:

If you try to run any OSTF test that requires a healthy ceph environment after step 2 (2 computes destruction, 2 nodes addition and env redeploy) you will receive:

"There are no compute nodes" failure.

So I see the bug as follows:

if you brutally destroy ceph cluster and try to atone for your sins by creating 2 nodes and redeploying env, the 2 new nodes won't be automatically picked up as computes. Any first request that requires glance's interaction with ceph (like PUT) will cause glance to hang due to the reasons described in details above (see cinder-like eventlet + librbd problem) and due to the absence of rados timeout by default.

My proposal is to mark the problem as invalid for 6.1 and schedule the glance_store bug fix for 7.0.

Changed in fuel:
status:	Confirmed → Invalid

Victor Ryzhenkin (vryzhenkin) on 2015-06-01

Changed in fuel:
status:	Invalid → Won't Fix

Revision history for this message

Inessa Vasilevskaya (ivasilevskaya) wrote on 2015-06-01:

#13

sample release notes for 6.1:

Problem: glance-api stops responding (http 503) after all ceph compute nodes destruction followed by their recreation and environment redeployment.

Description and reproducibility:
Setup: mos 6.1, ceph as backend, 1 controller, 2 computes.
Destroy ceph computes, create 2 new compute nodes, recreate environment, DO NOT restart glance-api.

Due to the fact that librbd calls from glance_store's rbd driver are not patched by eventlet library and the absence of rados_connect_timeout, any call to ceph with badly configured computes results in an eternal blocking request.

Workaround: glance-api service restart.

Victor Ryzhenkin (vryzhenkin) on 2015-06-01

tags:

added: release-notes

Olena Logvinova (ologvinova) on 2015-06-03

tags:

added: release-notes-done
removed: release-notes

Revision history for this message

Mike Fedosin (mfedosin) wrote on 2015-07-17:

#14

Upstream fixes on review:
glance_store - https://review.openstack.org/#/c/200554/,
glance - https://review.openstack.org/#/c/200625/

Revision history for this message

Mike Fedosin (mfedosin) wrote on 2015-08-17:

#15

A simple workaround for this - to set a ceph config parameter in ceph.conf - 'rgw op thread suicide timeout' http://ceph.com/docs/master/radosgw/config-ref/

I think 30 seconds is a reasonable value for this. It will fully prevent glance hangups during connection to rados.

Revision history for this message

Mike Fedosin (mfedosin) wrote on 2015-08-17:

#16

btw, also it will help to avoid the similar issue in cinder.

Revision history for this message

Max Yatsenko (myatsenko) wrote on 2015-08-25:

#17

There is a CR that contains fix for the bug: https://review.openstack.org/#/c/214745/

Revision history for this message

Max Yatsenko (myatsenko) wrote on 2015-08-28:

#18

A patch that cab fix the bug correctly was merged in upstream : https://review.openstack.org/#/c/200554/

It will be backported from upstream and after that configured in additional CR .

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-08-28: Fix proposed to openstack/glance_store (openstack-ci/fuel-7.0/2015.1.0)

#19

Fix proposed to branch: openstack-ci/fuel-7.0/2015.1.0
Change author: Mike Fedosin <email address hidden>
Review: https://review.fuel-infra.org/10896

OpenStack Infra (hudson-openstack) on 2015-08-31

Changed in fuel:
assignee:	Mike Fedosin (mfedosin) → Max Yatsenko (myatsenko)

Revision history for this message

Fuel Devops McRobotson (fuel-devops-robot) wrote on 2015-08-31: Fix merged to openstack/glance_store (openstack-ci/fuel-7.0/2015.1.0)

#20

Reviewed: https://review.fuel-infra.org/10896
Submitter: mos-infra-ci <>
Branch: openstack-ci/fuel-7.0/2015.1.0

Commit: fa4886af27de6cec50db2ea19b0099a1c8fdf7af
Author: Mike Fedosin <email address hidden>
Date: Fri Aug 28 15:02:22 2015

Prevent glance-api hangups during connection to rbd

This code adds new config option for rados connect timeout,
which tells how much time glance-api has to wait before
close the connection.

Setting 'rados_connect_timeout' <= 0 means no timeout.

DocImpact:
new config option 'rados_connect_timout' (default=0)

Closes-Bug: #1459743
Cherry-picked from https://review.openstack.org/#/c/200554/

Change-Id: I0e2f5a6630cfae5d3be215d1cbae6fdca5c6419d

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

Denis Egorenko (degorenko) wrote on 2015-08-31:

#21

Fix to fuel-library is not merged yet.

Revision history for this message

Denis Egorenko (degorenko) wrote on 2015-09-01:

#22

https://review.openstack.org/218449

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-02: Fix merged to fuel-library (master)

#23

Reviewed: https://review.openstack.org/218449
Committed: https://git.openstack.org/cgit/stackforge/fuel-library/commit/?id=d209be85da0e725a2d18839974d1a9816170eb18
Submitter: Jenkins
Branch: master

commit d209be85da0e725a2d18839974d1a9816170eb18
Author: Maksym Yatsenko <email address hidden>
Date: Fri Aug 28 22:54:49 2015 +0300

Configure rados_connect_timeout parameter in glance config

     This patch adds new config option for rados connect timeout,
    which tells how much time glance-api has to wait before
    close the connection.

Setting 'rados_connect_timeout' <= 0 means no timeout.

Upstream change-id: I2352bdec2a3339785eb38249afe385c5e496457c

Closes-Bug:#1459743

Change-Id: Ie4435a6bae56e8e6c1192ab66c83aa72a6b76305

Changed in fuel:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-02: Change abandoned on fuel-library (master)

#24

Change abandoned by Maksym Yatsenko (<email address hidden>) on branch: master
Review: https://review.openstack.org/214745

Oleksiy Molchanov (omolchanov) on 2015-09-14

tags:

added: on-verification

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2015-09-22:

#25

Verified on #298, but seems that we have a new bug - https://bugs.launchpad.net/mos/+bug/1498331

tags:	removed: on-verification
Changed in fuel:
status:	Fix Committed → Fix Released

Revision history for this message

Oleksiy Molchanov (omolchanov) wrote on 2015-09-23:

#26

Seems that the bug still persists

root@node-1:~# echo "show stat" | nc -U /var/lib/haproxy/stats | awk -F',' '{if (NR == 1 || $1 ~/glance/){printf "%14s %14s %10s %10s %10s %10s %10s %10s %10s %10s %10s\n", $1,$2,$8,$13,$14,$15,$18,$22,$23,$25,$28}}'
      # pxname svname stot ereq econ eresp status chkfail chkdown downtime iid
    glance-api FRONTEND 148 0 OPEN 11
    glance-api node-7 4 0 2 DOWN 7 3 160772 11
    glance-api node-6 71 0 1 DOWN 12 4 157341 11
    glance-api node-1 67 0 1 DOWN 4 1 154986 11
    glance-api BACKEND 148 6 4 DOWN 3 154244 11
glance-registry FRONTEND 143 0 OPEN 13
glance-registry node-7 47 0 0 DOWN 4 2 52367 13
glance-registry node-6 48 0 0 UP 1 1 3092 13
glance-registry node-1 48 0 0 UP 1 0 0 13
glance-registry BACKEND 143 0 0 UP 0 0 13

Changed in fuel:
status:	Fix Released → Confirmed

Revision history for this message

Max Yatsenko (myatsenko) wrote on 2015-09-23:

#27

Download full text (3.4 KiB)

In node-1 (controller) "glance-api" service was not running, I started "glance-api" and it works now:

In node-6(controller) "glance-api" was running and its OK:

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.