rsync for time to time fails due to exit code 24, due to wrong file list that rsync creates for itself

Bug #1338660 reported by Denis M.
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack DBaaS (Trove)
Fix Released
Medium
Amrith Kumar

Bug Description

2014-07-07 15:53:50.246 DEBUG trove.openstack.common.processutils [req-dc5ac406-8e2c-43c9-a6e9-181a2554a565 b4d3011f2de44550849b19756e951dd6 4c4c8487e59f4e66997cae8630f2fb80] Result was 0 from (pid=916) execute /home/ubuntu/trove/trove/openstack/common/processutils.py:194
2014-07-07 15:53:50.390 DEBUG trove.guestagent.volume [req-dc5ac406-8e2c-43c9-a6e9-181a2554a565 b4d3011f2de44550849b19756e951dd6 4c4c8487e59f4e66997cae8630f2fb80] Mounting volume. Device path:/dev/vdb, mount_point:/mnt/volume, volume_type:ext3, mount options:defaults,noatime from (pid=916) mount /home/ubuntu/trove/trove/guestagent/volume.py:162
2014-07-07 15:53:55.473 DEBUG trove.openstack.common.processutils [req-dc5ac406-8e2c-43c9-a6e9-181a2554a565 b4d3011f2de44550849b19756e951dd6 4c4c8487e59f4e66997cae8630f2fb80] Running cmd (subprocess): [ ' s u d o ' , ' r s y n c ' , ' - - s a f e - l i n k s ' , ' - - p e r m s ' , ' - - r e c u r s i v e ' , ' - - o w n e r ' , ' - - g r o u p ' , ' - - x a t t r s ' , ' - - s p a r s e ' , ' / v a r / l i b / c a s s a n d r a / ' , ' / m n t / v o l u m e ' ] from (pid=916) execute /home/ubuntu/trove/trove/openstack/common/processutils.py:160
2014-07-07 15:53:58.687 DEBUG trove.openstack.common.periodic_task [req-dc5ac406-8e2c-43c9-a6e9-181a2554a565 b4d3011f2de44550849b19756e951dd6 4c4c8487e59f4e66997cae8630f2fb80] Skipping Manager.update_status, 3 ticks left until next run from (pid=916) run_periodic_tasks /home/ubuntu/trove/trove/openstack/common/periodic_task.py:100
2014-07-07 15:54:08.627 DEBUG trove.openstack.common.periodic_task [req-dc5ac406-8e2c-43c9-a6e9-181a2554a565 b4d3011f2de44550849b19756e951dd6 4c4c8487e59f4e66997cae8630f2fb80] Skipping Manager.update_status, 2 ticks left until next run from (pid=916) run_periodic_tasks /home/ubuntu/trove/trove/openstack/common/periodic_task.py:100
2014-07-07 15:54:10.879 DEBUG trove.openstack.common.processutils [req-dc5ac406-8e2c-43c9-a6e9-181a2554a565 b4d3011f2de44550849b19756e951dd6 4c4c8487e59f4e66997cae8630f2fb80] Result was 24 from (pid=916) execute /home/ubuntu/trove/trove/openstack/common/processutils.py:194
2014-07-07 15:54:11.020 ERROR trove.openstack.common.rpc.amqp [req-dc5ac406-8e2c-43c9-a6e9-181a2554a565 b4d3011f2de44550849b19756e951dd6 4c4c8487e59f4e66997cae8630f2fb80] Exception during message handling
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp Traceback (most recent call last):
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp File "/home/ubuntu/trove/trove/openstack/common/rpc/amqp.py", line 440, in _process_data
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp **args)
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp File "/home/ubuntu/trove/trove/openstack/common/rpc/dispatcher.py", line 172, in dispatch
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp result = getattr(proxyobj, method)(ctxt, **kwargs)
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp File "/home/ubuntu/trove/trove/guestagent/datastore/cassandra/manager.py", line 82, in prepare
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp device.migrate_data(mount_point)
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp File "/home/ubuntu/trove/trove/guestagent/volume.py", line 45, in migrate_data
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp "--sparse", mysql_base, TMP_MOUNT_POINT)
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp File "/home/ubuntu/trove/trove/openstack/common/processutils.py", line 200, in execute
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp cmd=' '.join(cmd))
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp ProcessExecutionError: Unexpected error while running command.
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp Command: sudo rsync --safe-links --perms --recursive --owner --group --xattrs --sparse /var/lib/cassandra/ /mnt/volume
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp Exit code: 24
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp Stdout: ''
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp Stderr: 'file has vanished: "/var/lib/cassandra/data/system/schema_columnfamilies/system-schema_columnfamilies-tmp-jb-1-CompressionInfo.db"\nfile has vanished: "/var/lib/cassandra/data/system/schema_columnfamilies/system-schema_columnfamilies-tmp-jb-1-Data.db"\nfile has vanished: "/var/lib/cassandra/data/system/schema_columnfamilies/system-schema_columnfamilies-tmp-jb-1-Index.db"\nrsync warning: some files vanished before they could be transferred (code 24) at main.c(1070) [sender=3.0.9]\n'
2014-07-07 15:54:11.020 TRACE trove.openstack.common.rpc.amqp
2014-07-07 15:54:18.625 DEBUG trove.openstack.common.periodic_task [req-dc5ac406-8e2c-43c9-a6e9-181a2554a565 b4d3011f2de44550849b19756e951dd6 4c4c8487e59f4e66997cae8630f2fb80] Skipping Manager.update_status, 1 ticks left until next run from (pid=916) run_periodic_tasks /home/ubuntu/trove/trove/openstack/common/periodic_task.py:100
2014-07-07 15:54:28.632 DEBUG trove.openstack.common.periodic_task [req-dc5ac406-8e2c-43c9-a6e9-181a2554a565 b4d3011f2de44550849b19756e951dd6 4c4c8487e59f4e66997cae8630f2fb80] Running periodic task Manager.update_status from (pid=916) run_periodic_tasks /home/ubuntu/trove/trove/openstack/common/periodic_task.py:106
2014-07-07 15:54:28.665 INFO trove.guestagent.datastore.service [req-dc5ac406-8e2c-43c9-a6e9-181a2554a565 b4d3011f2de44550849b19756e951dd6 4c4c8487e59f4e66997cae8630f2fb80] DB server is not installed or is in restart mode, so for now we'll skip determining the status of DB on this box.

Trove-guestagent should expect exit codes: 0 and 24 for rsync

Revision history for this message
Denis M. (dmakogon) wrote :
Changed in trove:
assignee: nobody → Denis M. (dmakogon)
status: New → In Progress
status: In Progress → Confirmed
Revision history for this message
Denis M. (dmakogon) wrote :

I was able to reproduce this bug 2-3 times. Simple fix is to extend a list of expected exit codes for that exec which invokes rsync

Revision history for this message
Amrith Kumar (amrith) wrote :

per discussion with Denis, I'm reassigning to myself.

Changed in trove:
assignee: Denis M. (dmakogon) → Amrith (amrith)
Revision history for this message
Amrith Kumar (amrith) wrote :

The code in prepare()

https://github.com/openstack/trove/blob/master/trove/guestagent/datastore/cassandra/manager.py#L80

invokes migrate_data() which is the thing that is making the rsync call.

Unfortunately, at this time the DB is up and running. Not very nice.

Other datastores (MongoDB and MySQL) carefully stop the DB before migrating data. I'm going to do the same kind of thing for Cassandra and submit some code.

The notion of just ignoring the error (24) strikes me as sleazy. It may work but I don't want to have to take a bath each time someone launches a Cassandra instance.

Changed in trove:
status: Confirmed → Triaged
importance: Undecided → Medium
milestone: none → juno-2
Revision history for this message
Nikhil Manchanda (slicknik) wrote :

I don't think we should be migrating any data from the data directory while the datastore (in this case Cassandra) is running. Following the mongoDB and mysql example, we should stop the db before we do any data migration.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to trove (master)

Fix proposed to branch: master
Review: https://review.openstack.org/106127

Changed in trove:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to trove (master)

Reviewed: https://review.openstack.org/106127
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=e527ad8f2ef45be7afd97e9af7c01eb105a6197e
Submitter: Jenkins
Branch: master

commit e527ad8f2ef45be7afd97e9af7c01eb105a6197e
Author: Amrith Kumar <email address hidden>
Date: Thu Jul 10 13:10:27 2014 -0400

    Stop cassandra during configuration and volume migration

    The cassandra service is left running after installation.
    The next operation(s) are to rewrite the configuration, and
    migrate the data onto a volume (if provided). It is not a
    good thing for the database to be running during the latter,
    and a restart is required after the former.

    Failing to stop the database during the data migration can
    result in an rsync error 24 because files that were present
    when rsync made its list may be missing by the time rsync
    gets around to trying to sync that file. This can lead rsync
    to not make a 'true' copy of the source on the destination.

    The only way around this is to stop the database, something
    that is already done by both MongoDB and MySQL in their
    guest agents.

    Change-Id: I74ecc5c12d43eccc6295488e344bc0e75fe8116d
    Closes-bug: #1338660

Changed in trove:
status: In Progress → Fix Committed
Changed in trove:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in trove:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.