RadosGW region map is corrupted

Bug #1287166 reported by Vadim Rovachev on 2014-03-03
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
High
Dmitry Borodaenko

Bug Description

{"build_id": "2014-02-28_01-17-30", "mirantis": "yes", "build_number": "225", "nailgun_sha": "12a7e7a99557f2bc302f0806ad3beef02e94b974", "ostf_sha": "ceb3ea8c2c0da27306b30b9936f27dbc5044d2c6", "fuelmain_sha": "ba019bf15a9597a154e7c1d6ecc840614d21414c", "astute_sha": "f15f5615249c59c826ea05d26707f062c88db32a", "release": "4.1", "fuellib_sha": "61d3a150402da3ce1160836c8d659f6d9d1f9640"}

Install configuration: Ubuntu (Controller; Compute+Ceph-OSD), KVM, NovaNetwork, Install Savanna.

root@node-9:~# export OS_PASSWORD=admin
root@node-9:~# export OS_AUTH_URL=http://10.20.0.3:5000/v2.0/
root@node-9:~# export OS_USERNAME=admin
root@node-9:~# export OS_TENANT_NAME=admin
root@node-9:~# swift list
Account GET failed: http://172.18.92.100:6780/swift/v1?format=json 500 Internal Server Error [first 60 chars of response] <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><he

root@node-9:~# telnet 172.18.92.100 6780
Trying 172.18.92.100...
Connected to 172.18.92.100.
Escape character is '^]'.
^]
telnet>
Connection closed by foreign host.

root@node-9:~# telnet 172.18.92.100 6780
Trying 172.18.92.100...
Connected to 172.18.92.100.
Escape character is '^]'.
^]
telnet>
Connection closed by foreign host.

root@node-9:~# nova endpoints
.
.
.
+-------------+------------------------------------+
| swift | Value |
+-------------+------------------------------------+
| adminURL | http://172.18.92.100:6780/swift/v1 |
| region | RegionOne |
| publicURL | http://172.18.92.100:6780/swift/v1 |
| internalURL | http://172.18.92.100:6780/swift/v1 |
| id | cbfb74b5bb324d5986b668936e2131a6 |
+-------------+------------------------------------+

Vadim Rovachev (vrovachev) wrote :
Bogdan Dobrelya (bogdando) wrote :

Fuel configures Swift only for HA deployment case with 3 or more controllers. Am I right?..

Changed in fuel:
status: New → Incomplete
assignee: nobody → Fuel Library Team (fuel-library)
Dmitry Borodaenko (angdraug) wrote :

See storage settings in the attached bundle:

storage:
  ephemeral_ceph: false
  objects_ceph: true
  volumes_ceph: false
  images_ceph: true
  osd_pool_size: "1"
  volumes_lvm: true

This environment has RadosGW enabled for Swift API, which does not require 3 controllers.

However, only one of the nodes has ceph-osd role, we're supposed to have a pre-deployment check that would prevent you from proceeding with deployment if you have less than 2 (or Ceph replication factor, whichever is greater). Looks like Vadim was able to circumvent that by setting the Ceph replication factor to 1, which is a problem. It's ok to set replication factor to 1, but Ceph still needs 2 OSD nodes in such configuration.

Dmitry Borodaenko (angdraug) wrote :

Naily log:

2014-03-03T11:58:46 info: [8345] f8040b68-cd1a-4a53-87e2-c13ded4273e7: Finish restarting radosgw on controller nodes

RadosGW log on node-9:

2014-03-03 11:58:52.303928 7f6df1bcc780 -1 ERROR: region map does not specify master region
2014-03-03 11:58:52.305820 7f6df1bcc780 -1 Couldn't init storage provider (RADOS)

Looks like "radosgw region-map update" didn't result in a usable region map.

Dmitry Borodaenko (angdraug) wrote :

Related bug about radosgw region map:
https://bugs.launchpad.net/fuel/+bug/1275999

Dmitry Borodaenko (angdraug) wrote :

I couldn't reproduce this in the following configuration: 1x controller, 1x compute + ceph-osd, CentOS, Neutron/GRE, storage settings:

storage:
  images_ceph: true
  osd_pool_size: "1"
  objects_ceph: true
  volumes_ceph: true
  ephemeral_ceph: true
  volumes_lvm: false

I was wrong about 2 OSDs: Ceph cluster is healthy and operational with just 1 OSD. RadosGW region map was created successfully (no ERROR lines in radosgw.log about it), swift CLI works fine:

[root@node-5 ~]# swift post test
[root@node-5 ~]# swift list
test

Whatever was the reason for RadosGW throwing error 500, it looks like it's related to region-map, but has nothing to do with the number of deployed OSDs. I think this bug should stay "Incomplete" until we have more information on how to reproduce it.

summary: - Swift does not work
+ RadosGW throws 500 error

I don't know if this is important., I reproduce this bug in first time in env with Nona Network. If I reproduce it, then I'll write about it.

Dmitry Borodaenko (angdraug) wrote :

No, I tried the same configuration with nova-network, radosgw is still running fine after deployment.

Dmitry Borodaenko (angdraug) wrote :

Seems to be the same problem:
https://bugs.launchpad.net/fuel/+bug/1291140

Changed in fuel:
importance: Undecided → High
milestone: none → 5.0
status: Incomplete → Confirmed
Dmitry Borodaenko (angdraug) wrote :

Primary suspect is the second "radosgw restart" after "radosgw-admin region-map update" in https://review.openstack.org/75389 (most recent version of the same code is in https://review.openstack.org/78914).

Dmitry Borodaenko (angdraug) wrote :

After disabling restart_radosgw in astute, I am no longer able to reproduce https://bugs.launchpad.net/fuel/+bug/1275999, at least in a non-HA environment. There must have been some other problem that prevented radosgw from working. We still need restart for https://bugs.launchpad.net/fuel/+bug/1261966, but I'm no longer convinced region-map update is necessary.

summary: - RadosGW throws 500 error
+ RadosGW region map is corrupted
Dmitry Borodaenko (angdraug) wrote :

Please try to reproduce this problem after applying this fix to Astute:
https://review.openstack.org/80689

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Vadim Rovachev (vrovachev)
Vadim Rovachev (vrovachev) wrote :

Excellent. I'll wait for when this fix will be in iso.

Andrew Lazarev (alazarev) wrote :

Workaround:

To solve the problem on installed environment run the following commands.

radosgw-admin region-map update
service ceph-radosgw start

Dmitry Borodaenko (angdraug) wrote :

On an environment where this bug has occurred, re-running "radosgw-admin region-map update" by hand has fixed the problem: radosgw service became able to start. This is one more indirect confirmation that the root cause is related to running this command post-deployment in astute.

Changed in fuel:
assignee: Vadim Rovachev (vrovachev) → Dmitry Borodaenko (dborodaenko)
Changed in fuel:
status: Confirmed → In Progress

Reviewed: https://review.openstack.org/80689
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=1f159fb92378f362f1e7bbc10cbaa689122013f4
Submitter: Jenkins
Branch: master

commit 1f159fb92378f362f1e7bbc10cbaa689122013f4
Author: Dmitry Borodaenko <email address hidden>
Date: Fri Mar 14 13:18:31 2014 -0700

    do not create radosgw region map in post-deploy

    RadosGW region map may intermittently get corrupted when it is created
    during post-deployment, leading to RadosGW being unable to start.

    Change-Id: I9587b983ad50d74d22eb157b5d98a6f709e58b23
    Related-bug: #1287166

Changed in fuel:
status: In Progress → Fix Committed
tags: added: backports-4.1.1
Changed in fuel:
status: Fix Committed → In Progress
Changed in fuel:
milestone: 5.0 → 4.1.1

Related fix proposed to branch: stable/4.1
Review: https://review.openstack.org/85105

Reviewed: https://review.openstack.org/85105
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=55df06b2e84fa5d71a1cc0e78dbccab5db29d968
Submitter: Jenkins
Branch: stable/4.1

commit 55df06b2e84fa5d71a1cc0e78dbccab5db29d968
Author: Dmitry Borodaenko <email address hidden>
Date: Mon Mar 31 11:33:09 2014 -0700

    do not create radosgw region map in post-deploy

    RadosGW region map may intermittently get corrupted when it is created
    during post-deployment, leading to RadosGW being unable to start.

    Change-Id: I9587b983ad50d74d22eb157b5d98a6f709e58b23
    Related-bug: #1287166

Changed in fuel:
status: In Progress → Fix Committed

verified on {"build_id": "2014-04-03_04-17-26", "mirantis": "yes", "build_number": "266", "nailgun_sha": "7a05e365240ab27c492b20585ef8ac8557102cc0", "ostf_sha": "de0222fed646525d248dc6892eeceab139d5c469", "fuelmain_sha": "16637e2ea0ae6fe9a773aceb9d76c6e3a75f6c3b", "astute_sha": "5bcacc84cdaee3b31f1178d92a1c0681dc6ce520", "release": "4.1", "fuellib_sha": "52e7f57695f33bafa5d84d524d77f1bc3a2289b2"}

Changed in fuel:
status: Fix Committed → Fix Released
Mike Scherbakov (mihgen) on 2014-05-08
tags: added: release-notes
Meg McRoberts (dreidellhasa) wrote :

Added to the "Other Limitations" list in 5.0 Release Notes

Meg McRoberts (dreidellhasa) wrote :

Included as "Other Limitation" in 5.0.1 Release Notes.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments