Spurious abort message after successful pvmove

Bug #591475 reported by Jeffrey Baker on 2010-06-08
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
lvm2
Fix Released
Low
lvm2 (Ubuntu)
Medium
Unassigned

Bug Description

Binary package hint: lvm2

lvm2 2.02.39-0ubuntu11 on Karmic. pvmove sometimes issues a spurious error message at the end of a successful move. To wit:

# pvmove -i 60 -v /dev/sde1
    Finding volume group "homedirs"
    Archiving volume group "homedirs" metadata (seqno 17).
    Creating logical volume pvmove0
    Moving 25599 extents of logical volume homedirs/homedirs_lv
    Found volume group "homedirs"
    Updating volume group metadata
    Creating volume group backup "/etc/lvm/backup/homedirs" (seqno 18).
    Found volume group "homedirs"
    Found volume group "homedirs"
    Suspending homedirs-homedirs_lv (254:1) with device flush
    Found volume group "homedirs"
    Creating homedirs-pvmove0
    Loading homedirs-pvmove0 table
    Resuming homedirs-pvmove0 (254:2)
    Found volume group "homedirs"
    Loading homedirs-pvmove0 table
    Suppressed homedirs-pvmove0 identical table reload.
    Loading homedirs-homedirs_lv table
    Resuming homedirs-homedirs_lv (254:1)
    Checking progress every 60 seconds
  /dev/sde1: Moved: 1.4%
[... time passes ...]
  /dev/sde1: Moved: 99.6%
  ABORTING: Can't find mirror LV in homedirs for /dev/sde1
# pvs
  PV VG Fmt Attr PSize PFree
  /dev/sdb1 lvm2 -- 100.00G 100.00G
  /dev/sdc1 lvm2 -- 100.00G 100.00G
  /dev/sdd1 lvm2 -- 100.00G 100.00G
  /dev/sde1 homedirs lvm2 a- 100.00G 100.00G
  /dev/sdf1 homedirs lvm2 a- 100.00G 0
  /dev/sdq1 homedirs lvm2 a- 499.99G 100.01G
# vgreduce -a homedirs
  Removed "/dev/sde1" from volume group "homedirs"
  Physical volume "/dev/sdf1" still in use
  Physical volume "/dev/sdq1" still in use

Despite the scary-looking abort message, the source PV is 100% free and can be removed from the VG.

Architecture: i386
DistroRelease: Ubuntu 9.10
Ec2AMI: ami-bb709dd2
Ec2AMIManifest: ubuntu-images-us/ubuntu-karmic-9.10-i386-server-20100121.manifest.xml
Ec2AvailabilityZone: us-east-1c
Ec2InstanceType: m1.small
Ec2Kernel: aki-5f15f636
Ec2Ramdisk: ari-d5709dbc
Package: lvm2 2.02.39-0ubuntu11
PackageArchitecture: i386
ProcEnviron:
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.31-302.7-ec2
Tags: ec2-images
Uname: Linux 2.6.31-302-ec2 i686
UserGroups:

tags: added: apport-collected
description: updated
Changed in lvm2 (Ubuntu):
status: New → Confirmed
Changed in lvm2 (Ubuntu):
status: Confirmed → Fix Committed
status: Fix Committed → Confirmed
Clint Byrum (clint-fewbar) wrote :

This is a race condition in the way pvmove relies on polldaemon to show the percentage complete.

Because it is a race condition causing the issue, its going to be extremely difficult to reproduce reliably, though I believe it may become more likely with volume groups that have a large amount of metadata and physical volumes associated, since it is more likely the steps will take longer.

The patch is just a workaround. The proper way to fix this bug is to use a signal that informs the polling daemon when the pvmove has been a success.

Created attachment 422647
A crude bug fix that would prevent the erroneous error message.

Description of problem:

pvmove relies on polldaemon.c:_wait_for_single_lv() to read the percentage complete on the mirror that is used to do the pvmove. However, the mirror goes away sometimes while this program is running, presumably in between init_full_scan_done(0) and locking the volume group. This would appear to be a race condition, so it only happens sometimes.

When the problem occurs, a user gets something like this printed out:

  /dev/sde1: Moved: 99.6%
  ABORTING: Can't find mirror LV in homedirs for /dev/sde1

This is very confusing, as the user may think that the pvmove operation failed.

Version-Release number of selected component (if applicable):

2.02.54, code appears similar in 2.02.67

How reproducible:

As this is a race condition, it does not always happen. However users have reported it happening with enough frequency to cause alarm as the error message

Steps to reproduce:

assuming /dev/sdb has two equal sized partitions of at least 10G

Setup:
pvcreate /dev/sdb1
pvcreate /dev/sdb2
vgcreate test /dev/sdb1 /dev/sdb2
lvcreate -L 9G -n t1 test /dev/sdb1

Then repeat these in an alternating manner:

pvmove -i1 /dev/sdb1
pvmove -i1 /dev/sdb2

It may take many iterations to reproduce the race, or it may never reproduce it, as other factors may be necessary to make it more likely (such as many more physical volumes).

Actual results:

Expected results:

I would expect that if the pvmove completes successfully, that pvmove would show that fact rather than abort.

Additional info:

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release. Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release. This request is not yet committed for
inclusion.

Jeffrey Baker (jwbaker) wrote :

Good to know the error is in fact harmless.

Comment on attachment 422647
A crude bug fix that would prevent the erroneous error message.

Fix mime type on attachment.

C de-Avillez (hggdh2) wrote :

Marking Triaged/Medium per input from Clint

Changed in lvm2 (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Triaged

Fix in upstream lvm 2.02.82.

Clint Byrum (clint-fewbar) wrote :

According to the bugzilla, this was just recently fixed upstream in v2.0.82.

Clint Byrum (clint-fewbar) wrote :

Make that, 2.02.82

tags: added: patch

I didn't see any 'ABORT' messages in the pvmove regression test output. Marking verified in the latest rpms.

2.6.32-128.el6.x86_64

lvm2-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011
lvm2-libs-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011
lvm2-cluster-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011
udev-147-2.35.el6 BUILT: Wed Mar 30 07:32:05 CDT 2011
device-mapper-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-libs-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-event-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011
device-mapper-event-libs-1.02.62-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011
cmirror-2.02.83-3.el6 BUILT: Fri Mar 18 09:31:10 CDT 2011

Clint Byrum (clint-fewbar) wrote :

Since redhat bugzilla updates are disabled, just posting here that this has been fixed upstream in version 2.02.82. That is included in Debian unstable now, so this bug should be fixed the next time lvm2 is merged.

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2011-0772.html

Changed in lvm2:
importance: Unknown → Low
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.