David Hadas (the author of the swift_hash_path_prefix patch I believe) gently asked me to open the bug report so here we go.
We've been able to reproduce account-server errors trying to locate a DB when using Swift 1.8.0 in combination with the recently introduced setting swift_hash_path_prefix (see https://bugs.launchpad.net/swift/+bug/1157454).
The issue goes away if swift_hash_path_prefix is NOT present (i.e. the way it was before 1.8.0). Tested quite a few times both enabling and disabling swift_hash_path_prefix and we can consistently reproduce the issue.
The error log:
Apr 24 18:35:47 swift-002 account-server ERROR __call__ error with PUT /7e0cfcfc-7c6e-41a4-adc8-e4173147bd2a/464510/AUTH_aeedb7bdcb8846599e2b6b87bb8a947f/dispersion_e888be22b1a24f218cc77cdcda1762be : #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/account/server.py", line 333, in __call__#012 res = method(req)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 1558, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 520, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/dist-packages/swift/account/server.py", line 112, in PUT#012 req.headers['x-bytes-used'])#012 File "/usr/lib/python2.7/dist-packages/swift/common/db.py", line 1431, in put_container#012 raise DatabaseConnectionError(self.db_file, "DB doesn't exist")#012DatabaseConnectionError: DB connection error (/srv/node/7e0cfcfc-7c6e-41a4-adc8-e4173147bd2a/accounts/464510/adc/e2cfc6be58be71d5ad6111364fff0adc/e2cfc6be58be71d5ad6111364fff0adc.db, 0):#012DB doesn't exist
We get this kind of errors periodically (perhaps triggered by the replicators/updaters?).
The way we reproduce it:
1. Start with an empty cluster
2. Use swift-dispersion-populate the cluster
root@swift-proxy-01:~# swift-dispersion-populate
Created 5242 containers for dispersion reporting, 1m, 0 retries
Created 5242 objects for dispersion reporting, 41s, 0 retries
# dispersion config, redacted to remove sensitive info
[dispersion]
auth_url = http://test-host:5000/v2.0/
auth_version = 2.0
auth_user = tenant:user
auth_key = secret
swift_dir = /etc/swift
dispersion_coverage = 1
retries = 5
concurrency = 25
dump_json = no
We get some errors here in the storage nodes sometimes when running populate, concurrency related perhaps.
3. Have a look at the error log in the storage nodes, we start getting messages like these:
Apr 24 19:08:04 swift-001 account-server ERROR __call__ error with PUT /d3829495-9f10-4075-a558-f99fb665cfe2/464510/AUTH_aeedb7bdcb8846599e2b6b87bb8a947f/dispersion_e39093c2f161418a93dbf981ef839325 : #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/account/server.py", line 333, in __call__#012 res = method(req)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 1558, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 520, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/dist-packages/swift/account/server.py", line 112, in PUT#012 req.headers['x-bytes-used'])#012 File "/usr/lib/python2.7/dist-packages/swift/common/db.py", line 1431, in put_container#012 raise DatabaseConnectionError(self.db_file, "DB doesn't exist")#012DatabaseConnectionError: DB connection error (/srv/node/d3829495-9f10-4075-a558-f99fb665cfe2/accounts/464510/adc/e2cfc6be58be71d5ad6111364fff0adc/e2cfc6be58be71d5ad6111364fff0adc.db, 0):#012DB doesn't exist
Apr 24 19:08:04 swift-001 account-server ERROR __call__ error with PUT /d3829495-9f10-4075-a558-f99fb665cfe2/464510/AUTH_aeedb7bdcb8846599e2b6b87bb8a947f/dispersion_4a651c8ea0384cbc8e0b206acc89c5a3 : #012Traceback (most recent call last):#012 File "/usr/lib/python2.7/dist-packages/swift/account/server.py", line 333, in __call__#012 res = method(req)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 1558, in wrapped#012 return func(*a, **kw)#012 File "/usr/lib/python2.7/dist-packages/swift/common/utils.py", line 520, in _timing_stats#012 resp = func(ctrl, *args, **kwargs)#012 File "/usr/lib/python2.7/dist-packages/swift/account/server.py", line 112, in PUT#012 req.headers['x-bytes-used'])#012 File "/usr/lib/python2.7/dist-packages/swift/common/db.py", line 1431, in put_container#012 raise DatabaseConnectionError(self.db_file, "DB doesn't exist")#012DatabaseConnectionError: DB connection error (/srv/node/d3829495-9f10-4075-a558-f99fb665cfe2/accounts/464510/adc/e2cfc6be58be71d5ad6111364fff0adc/e2cfc6be58be71d5ad6111364fff0adc.db, 0):#012DB doesn't exist
We waited for the replicators/updaters long enough to let them do their job, but the issue is always there, with the account server periodically logging that.
Looking for the database file the account server doesn't find, reveals that the file is in another partition:
root@swift-001:/srv/node# find|grep e2cfc6be58be71d5ad6111364fff0adc.db
./07cc80bf-b033-4b31-87c5-49ae9aced24c/accounts/523880/adc/e2cfc6be58be71d5ad6111364fff0adc/e2cfc6be58be71d5ad6111364fff0adc.db
./07cc80bf-b033-4b31-87c5-49ae9aced24c/accounts/523880/adc/e2cfc6be58be71d5ad6111364fff0adc/e2cfc6be58be71d5ad6111364fff0adc.db.pending
That's pretty much all we do to consistently reproduce the issue. Removing swift_hash_path_prefix works around the issue for us. YMMV.
Test cluster related information:
# swift.conf
[swift-hash]
swift_hash_path_suffix = 0d85cf6346c7086d
swift_hash_path_prefix = 64543f1f5108d509
OpenStack Swift 1.8.0 from Ubuntu Cloud Archive (Grizzly)
root@swift-002:/srv/node# apt-cache policy python-swift
python-swift:
Installed: 1.8.0-0ubuntu1~cloud0
Candidate: 1.8.0-0ubuntu1~cloud0
Version table:
*** 1.8.0-0ubuntu1~cloud0 0
500 http://ubuntu-cloud.archive.canonical.com/ubuntu/ precise-updates/grizzly/main amd64 Packages
100 /var/lib/dpkg/status
Ubuntu 12.04.2 amd64
Test cluster with 2 storage nodes (more than 20GB RAM each, multiple cores, 10+ SATA disks each, running obj/cont/acct servers), 2 proxy nodes (virtualized, big enough) and one load balancer.
Feel free to ask me for any other info you may require to debug the issue.
Alright, silly questions time:
Are you running swift-dispersio n-(populate| report) from 1.8.0 as well?
From which machine are you running swift-dispersio n-report? One of the proxies?
Are the rings synchronized across all the machines?
Are both the prefix and the suffix in /etc/swift/ swift.conf on the machine running dispersion report?
Sorry if these questions seem sort of basic; I'm just trying to make sure the setup is sane before digging into the code.