RadosGW region map is corrupted

Bug #1287166 reported by Vadim Rovachev
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Fuel for OpenStack
Fix Released
High
Dmitry Borodaenko

Bug Description

{"build_id": "2014-02-28_01-17-30", "mirantis": "yes", "build_number": "225", "nailgun_sha": "12a7e7a99557f2bc302f0806ad3beef02e94b974", "ostf_sha": "ceb3ea8c2c0da27306b30b9936f27dbc5044d2c6", "fuelmain_sha": "ba019bf15a9597a154e7c1d6ecc840614d21414c", "astute_sha": "f15f5615249c59c826ea05d26707f062c88db32a", "release": "4.1", "fuellib_sha": "61d3a150402da3ce1160836c8d659f6d9d1f9640"}

Install configuration: Ubuntu (Controller; Compute+Ceph-OSD), KVM, NovaNetwork, Install Savanna.

root@node-9:~# export OS_PASSWORD=admin
root@node-9:~# export OS_AUTH_URL=http://10.20.0.3:5000/v2.0/
root@node-9:~# export OS_USERNAME=admin
root@node-9:~# export OS_TENANT_NAME=admin
root@node-9:~# swift list
Account GET failed: http://172.18.92.100:6780/swift/v1?format=json 500 Internal Server Error [first 60 chars of response] <!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><he

root@node-9:~# telnet 172.18.92.100 6780
Trying 172.18.92.100...
Connected to 172.18.92.100.
Escape character is '^]'.
^]
telnet>
Connection closed by foreign host.

root@node-9:~# telnet 172.18.92.100 6780
Trying 172.18.92.100...
Connected to 172.18.92.100.
Escape character is '^]'.
^]
telnet>
Connection closed by foreign host.

root@node-9:~# nova endpoints
.
.
.
+-------------+------------------------------------+
| swift | Value |
+-------------+------------------------------------+
| adminURL | http://172.18.92.100:6780/swift/v1 |
| region | RegionOne |
| publicURL | http://172.18.92.100:6780/swift/v1 |
| internalURL | http://172.18.92.100:6780/swift/v1 |
| id | cbfb74b5bb324d5986b668936e2131a6 |
+-------------+------------------------------------+

Revision history for this message
Vadim Rovachev (vrovachev) wrote :
Revision history for this message
Bogdan Dobrelya (bogdando) wrote :

Fuel configures Swift only for HA deployment case with 3 or more controllers. Am I right?..

Changed in fuel:
status: New → Incomplete
assignee: nobody → Fuel Library Team (fuel-library)
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

See storage settings in the attached bundle:

storage:
  ephemeral_ceph: false
  objects_ceph: true
  volumes_ceph: false
  images_ceph: true
  osd_pool_size: "1"
  volumes_lvm: true

This environment has RadosGW enabled for Swift API, which does not require 3 controllers.

However, only one of the nodes has ceph-osd role, we're supposed to have a pre-deployment check that would prevent you from proceeding with deployment if you have less than 2 (or Ceph replication factor, whichever is greater). Looks like Vadim was able to circumvent that by setting the Ceph replication factor to 1, which is a problem. It's ok to set replication factor to 1, but Ceph still needs 2 OSD nodes in such configuration.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Naily log:

2014-03-03T11:58:46 info: [8345] f8040b68-cd1a-4a53-87e2-c13ded4273e7: Finish restarting radosgw on controller nodes

RadosGW log on node-9:

2014-03-03 11:58:52.303928 7f6df1bcc780 -1 ERROR: region map does not specify master region
2014-03-03 11:58:52.305820 7f6df1bcc780 -1 Couldn't init storage provider (RADOS)

Looks like "radosgw region-map update" didn't result in a usable region map.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Related bug about radosgw region map:
https://bugs.launchpad.net/fuel/+bug/1275999

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

I couldn't reproduce this in the following configuration: 1x controller, 1x compute + ceph-osd, CentOS, Neutron/GRE, storage settings:

storage:
  images_ceph: true
  osd_pool_size: "1"
  objects_ceph: true
  volumes_ceph: true
  ephemeral_ceph: true
  volumes_lvm: false

I was wrong about 2 OSDs: Ceph cluster is healthy and operational with just 1 OSD. RadosGW region map was created successfully (no ERROR lines in radosgw.log about it), swift CLI works fine:

[root@node-5 ~]# swift post test
[root@node-5 ~]# swift list
test

Whatever was the reason for RadosGW throwing error 500, it looks like it's related to region-map, but has nothing to do with the number of deployed OSDs. I think this bug should stay "Incomplete" until we have more information on how to reproduce it.

summary: - Swift does not work
+ RadosGW throws 500 error
Revision history for this message
Vadim Rovachev (vrovachev) wrote : Re: RadosGW throws 500 error

I don't know if this is important., I reproduce this bug in first time in env with Nona Network. If I reproduce it, then I'll write about it.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

No, I tried the same configuration with nova-network, radosgw is still running fine after deployment.

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Seems to be the same problem:
https://bugs.launchpad.net/fuel/+bug/1291140

Changed in fuel:
importance: Undecided → High
milestone: none → 5.0
status: Incomplete → Confirmed
Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Primary suspect is the second "radosgw restart" after "radosgw-admin region-map update" in https://review.openstack.org/75389 (most recent version of the same code is in https://review.openstack.org/78914).

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

After disabling restart_radosgw in astute, I am no longer able to reproduce https://bugs.launchpad.net/fuel/+bug/1275999, at least in a non-HA environment. There must have been some other problem that prevented radosgw from working. We still need restart for https://bugs.launchpad.net/fuel/+bug/1261966, but I'm no longer convinced region-map update is necessary.

summary: - RadosGW throws 500 error
+ RadosGW region map is corrupted
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/80689

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

Please try to reproduce this problem after applying this fix to Astute:
https://review.openstack.org/80689

Changed in fuel:
assignee: Fuel Library Team (fuel-library) → Vadim Rovachev (vrovachev)
Revision history for this message
Vadim Rovachev (vrovachev) wrote :

Excellent. I'll wait for when this fix will be in iso.

Revision history for this message
Andrew Lazarev (alazarev) wrote :

Workaround:

To solve the problem on installed environment run the following commands.

radosgw-admin region-map update
service ceph-radosgw start

Revision history for this message
Dmitry Borodaenko (angdraug) wrote :

On an environment where this bug has occurred, re-running "radosgw-admin region-map update" by hand has fixed the problem: radosgw service became able to start. This is one more indirect confirmation that the root cause is related to running this command post-deployment in astute.

Changed in fuel:
assignee: Vadim Rovachev (vrovachev) → Dmitry Borodaenko (dborodaenko)
Changed in fuel:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (master)

Reviewed: https://review.openstack.org/80689
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=1f159fb92378f362f1e7bbc10cbaa689122013f4
Submitter: Jenkins
Branch: master

commit 1f159fb92378f362f1e7bbc10cbaa689122013f4
Author: Dmitry Borodaenko <email address hidden>
Date: Fri Mar 14 13:18:31 2014 -0700

    do not create radosgw region map in post-deploy

    RadosGW region map may intermittently get corrupted when it is created
    during post-deployment, leading to RadosGW being unable to start.

    Change-Id: I9587b983ad50d74d22eb157b5d98a6f709e58b23
    Related-bug: #1287166

Changed in fuel:
status: In Progress → Fix Committed
tags: added: backports-4.1.1
Changed in fuel:
status: Fix Committed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to fuel-astute (stable/4.1)

Related fix proposed to branch: stable/4.1
Review: https://review.openstack.org/84203

Changed in fuel:
milestone: 5.0 → 4.1.1
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: stable/4.1
Review: https://review.openstack.org/85105

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to fuel-astute (stable/4.1)

Reviewed: https://review.openstack.org/85105
Committed: https://git.openstack.org/cgit/stackforge/fuel-astute/commit/?id=55df06b2e84fa5d71a1cc0e78dbccab5db29d968
Submitter: Jenkins
Branch: stable/4.1

commit 55df06b2e84fa5d71a1cc0e78dbccab5db29d968
Author: Dmitry Borodaenko <email address hidden>
Date: Mon Mar 31 11:33:09 2014 -0700

    do not create radosgw region map in post-deploy

    RadosGW region map may intermittently get corrupted when it is created
    during post-deployment, leading to RadosGW being unable to start.

    Change-Id: I9587b983ad50d74d22eb157b5d98a6f709e58b23
    Related-bug: #1287166

Changed in fuel:
status: In Progress → Fix Committed
Revision history for this message
Andrey Sledzinskiy (asledzinskiy) wrote :

verified on {"build_id": "2014-04-03_04-17-26", "mirantis": "yes", "build_number": "266", "nailgun_sha": "7a05e365240ab27c492b20585ef8ac8557102cc0", "ostf_sha": "de0222fed646525d248dc6892eeceab139d5c469", "fuelmain_sha": "16637e2ea0ae6fe9a773aceb9d76c6e3a75f6c3b", "astute_sha": "5bcacc84cdaee3b31f1178d92a1c0681dc6ce520", "release": "4.1", "fuellib_sha": "52e7f57695f33bafa5d84d524d77f1bc3a2289b2"}

Changed in fuel:
status: Fix Committed → Fix Released
Mike Scherbakov (mihgen)
tags: added: release-notes
Revision history for this message
Meg McRoberts (dreidellhasa) wrote :

Added to the "Other Limitations" list in 5.0 Release Notes

Revision history for this message
Meg McRoberts (dreidellhasa) wrote :

Included as "Other Limitation" in 5.0.1 Release Notes.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.