2.6.32-47 kernel update on 10.04 breaks software RAID (+ LVM)

Bug #1190295 reported by bl8n8r on 2013-06-12
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Steve Conklin

Bug Description

Been running 10.04 LTS on 8 similar AMD Opteron x86_64 servers for several years. The servers have been kept up-to-date with patches as they come out. These servers have been running 2.6.x kernels. Each server has some form of Linux software RAID running on it as well as 3Ware hardware RAID card using SATA disks. Software RAID is configured as RAID1 for all but one server running software RAID10. All servers had software raid configured to use single partitions on each disk of types of 0xFD (Linux Software Raid Autodetect). All servers were configured with LVM over the top of /dev/md0.

In past year, mysterious problems have been happening with software RAID after applying system patches. Upon reboot, server is unable to mount LVM partitions on Linux software RAID and boot is interrupted with "Continue to wait; or Press S to skip mounting or M for manual recovery" requiring intervention from an operator.

Upon pressing 'M' and logging in as root, the LVM slices on the software RAID partition are not mounted and sometimes appear to be missing from LVM. Oftentimes pvs, vgs and lvs will complain about "leaking memory". Germane to the issue, LVM will sometimes show the problem partitions as "Active" while other times during the login, they will simply be gone. With LVM and /dev/md0 unstable, there is no way to discern the true state of the partitons in question. Starting the system from alternate boot media such as CDROM or USB drive, sometimes shows the software RAID and LVM in proper state which leads to suspicion of a kernel update on the afflicted system. Historically and subjectively, best practice in this instance seems to be booting from live media and starting the array degraded mode, and backing up the array.

bl8n8r (bl8n8r-gmail) wrote :

Prior to rebooting the system, these were the contents of /var/cache/apt/archives showing which patches had just been applied. This system had been running and rebooting for months prior to applying these patches.

[11:42 06/12/13]
[root@usb-live /mnt/slash]
# dir var/cache/apt/archives/
total 46260
-rw-r--r-- 1 root root 51876 Dec 4 2007 libdbi0_0.8.2-3_amd64.deb
-rw-r--r-- 1 root root 60626 Nov 6 2009 libnet1_1.1.4-2_amd64.deb
-rw-r--r-- 1 root root 342662 Sep 6 2010 syslog-ng_3.1.2-1~lucid1_amd64.deb
-rw-r--r-- 1 root root 276106 May 25 2011 language-selector-common_0.5.8+langfixes~lucid2_all.deb
-rw-r--r-- 1 root root 1501242 Oct 7 2011 postfix_2.8.5-2~build0.10.04_amd64.deb
-rw-r--r-- 1 root root 5138 May 7 04:33 linux-headers-server_2.6.32.47.54_amd64.deb
-rw-r--r-- 1 root root 5140 May 7 04:33 linux-server_2.6.32.47.54_amd64.deb
-rw-r--r-- 1 root root 5144 May 7 04:33 linux-image-server_2.6.32.47.54_amd64.deb
-rw-r--r-- 1 root root 10175388 May 7 04:34 linux-headers-2.6.32-47_2.6.32-47.109_all.deb
-rw-r--r-- 1 root root 31861814 May 7 04:34 linux-image-2.6.32-47-server_2.6.32-47.109_amd64.deb
-rw-r--r-- 1 root root 837616 May 7 04:34 linux-headers-2.6.32-47-server_2.6.32-47.109_amd64.deb
-rw-r--r-- 1 root root 335916 May 23 21:04 dhcp3-common_3.1.3-2ubuntu3.5_amd64.deb
-rw-r--r-- 1 root root 275594 May 23 21:04 dhcp3-client_3.1.3-2ubuntu3.5_amd64.deb
-rw-r--r-- 1 root root 417662 May 29 13:03 libgnutls26_2.8.5-2ubuntu0.4_amd64.deb
-rw-r--r-- 1 root root 43626 Jun 5 12:08 libxext6_2%3a1.1.1-2ubuntu0.1_amd64.deb
-rw-r--r-- 1 root root 43186 Jun 5 12:08 libxcb1_1.5-2ubuntu0.1_amd64.deb
-rw-r--r-- 1 root root 233946 Jun 5 12:08 libx11-data_2%3a1.3.2-1ubuntu3.1_all.deb
-rw-r--r-- 1 root root 844966 Jun 5 12:08 libx11-6_2%3a1.3.2-1ubuntu3.1_amd64.deb
drwxr-xr-x 2 root root 4096 Jun 12 10:08 partial
drwxr-xr-x 3 root root 4096 Jun 12 10:08 .
drwxr-xr-x 3 root root 4096 Jun 12 10:11 ..

bl8n8r (bl8n8r-gmail) wrote :
Download full text (5.7 KiB)

I was expecting the RAID system to blow up so I saved the scroll-back of the terminal. This is the state of the disk subsystem and RAID info prior to rebooting. Of particular note, /dev/md0p1 is marked as Linux Raid Autodetect but it's supposed to be an LVM partition!

$ ssh root@newbox
Linux newbox 3.0.0-32-server #51~lucid1-Ubuntu SMP Fri Mar 22 17:53:04 UTC 2013 x86_64 GNU/Linux
Ubuntu 10.04.4 LTS

Welcome to the Ubuntu Server!
 * Documentation: http://www.ubuntu.com/server/doc

  System information as of Wed Jun 12 10:08:23 CDT 2013

  System load: 0.15 Processes: 171
  Usage of /home: 0.8% of 3.99GB Users logged in: 0
  Memory usage: 0% IP address for eth0: 10.140.136.41
  Swap usage: 0%

  Graph this data and manage this system at https://landscape.canonical.com/

18 packages can be updated.
11 updates are security updates.

New release 'precise' available.
Run 'do-release-upgrade' to upgrade to it.

Last login: Wed Jun 12 10:07:11 2013

# screen -x
[detached from 1306.tty1.newbox]

# fdisk -l

Disk /dev/sda: 60.0 GB, 60022480896 bytes
32 heads, 32 sectors/track, 114483 cylinders
Units = cylinders of 1024 * 512 = 524288 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000cb60c

   Device Boot Start End Blocks Id System
/dev/sda1 1 114484 58615640 fd Linux raid autodetect

Disk /dev/sdb: 60.0 GB, 60022480896 bytes
32 heads, 32 sectors/track, 114483 cylinders
Units = cylinders of 1024 * 512 = 524288 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000da964

   Device Boot Start End Blocks Id System
/dev/sdb1 1 114484 58615640 fd Linux raid autodetect

Disk /dev/sdc: 1000.0 GB, 999989182464 bytes
255 heads, 63 sectors/track, 121575 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000c3b74

   Device Boot Start End Blocks Id System
/dev/sdc1 * 1 262 2104483+ 83 Linux
/dev/sdc2 263 2352 16787925 82 Linux swap / Solaris
/dev/sdc3 2353 121575 957658747+ 8e Linux LVM

Disk /dev/md0: 60.0 GB, 60022325248 bytes
32 heads, 32 sectors/track, 114483 cylinders
Units = cylinders of 1024 * 512 = 524288 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000cb60c

    Device Boot Start End Blocks Id System
/dev/md0p1 1 114484 58615640 fd Linux raid autodetect

# pvs
  PV VG Fmt Attr PSize PFree
  /dev/md0p1 vgssd lvm2 a- 55.90g 20.90g
  /dev/sdc3 vgsata lvm2 a- 913.29g 849.29g

# lvs
  LV VG Attr LSize Origin Snap% Move Log Copy% Convert
  home vgsata -wi-ao 4.00g ...

Read more...

affects: partman-base (Ubuntu) → mdadm (Ubuntu)
bl8n8r (bl8n8r-gmail) wrote :
Download full text (8.6 KiB)

This is the state of the disk subsystem after rebooting from 10.04 live media an bringing up /dev/md0 on one disk (to make backups) and then re-adding the second disk to rebuild the array. Notice /dev/md0p1 is Gone and pvs says it's using /dev/md0 now!

[08:06 06/13/13]
[root@usb-live /mnt/slash]
# mdadm --readwrite /dev/md0

[08:07 06/13/13]
[root@usb-live /mnt/slash]
# mdadm --add /dev/md0 /dev/sdc1
mdadm: re-added /dev/sdc1

[08:07 06/13/13]
[root@usb-live /mnt/slash]
# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid1 sdc1[2] sdb1[0]
      58615552 blocks [2/1] [U_]
      [>....................] recovery = 0.3% (198592/58615552) finish=14.7min speed=66197K/sec

unused devices: <none>

[08:07 06/13/13]
[root@usb-live /mnt/slash]
# pvs
  PV VG Fmt Attr PSize PFree
  /dev/md0 vgssd lvm2 a- 55.90g 20.90g
  /dev/sda3 vgsata lvm2 a- 913.29g 824.29g

[08:07 06/13/13]
[root@usb-live /mnt/slash]
# lvs
  LV VG Attr LSize Origin Snap% Move Log Copy% Convert
  home vgsata -wi-a- 4.00g
  mysql.bak vgsata -wi-a- 10.00g
  sysbak vgsata -wi-a- 50.00g
  usr vgsata -wi-ao 5.00g
  var vgsata -wi-ao 20.00g
  mysqldump vgssd -wi-a- 10.00g
  usr vgssd -wi-ao 5.00g
  var vgssd -wi-ao 20.00g

[08:07 06/13/13]
[root@usb-live /mnt/slash]
# vgs
  VG #PV #LV #SN Attr VSize VFree
  vgsata 1 5 0 wz--n- 913.29g 824.29g
  vgssd 1 3 0 wz--n- 55.90g 20.90g

[08:07 06/13/13]
[root@usb-live /mnt/slash]
# mount
aufs on / type aufs (rw)
none on /proc type proc (rw,noexec,nosuid,nodev)
none on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /dev type devtmpfs (rw,mode=0755)
none on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
/dev/sdd1 on /cdrom type vfat (rw,relatime,fmask=0022,dmask=0022,codepage=cp437,iocharset=iso8859-1,shortname=mixed,errors=remount-ro)
/dev/loop0 on /rofs type squashfs (ro,noatime)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
none on /dev/shm type tmpfs (rw,nosuid,nodev)
tmpfs on /tmp type tmpfs (rw,nosuid,nodev)
none on /var/run type tmpfs (rw,nosuid,mode=0755)
none on /var/lock type tmpfs (rw,noexec,nosuid,nodev)
none on /lib/init/rw type tmpfs (rw,nosuid,mode=0755)
/dev/sda1 on /mnt/slash type xfs (rw)
/dev/mapper/vgssd-var on /mnt/slash/var type ext4 (ro)
/dev/mapper/vgsata-usr on /mnt/usr type xfs (rw)
/dev/mapper/vgsata-var on /mnt/var type xfs (rw)
/dev/mapper/vgssd-usr on /mnt/slash/usr type ext4 (ro)

[08:07 06/13/13]
[root@usb-live /mnt/slash]
# fdisk -l

Disk /dev/sda: 1000.0 GB, 999989182464 bytes
255 heads, 63 sectors/track, 121575 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / ...

Read more...

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mdadm (Ubuntu):
status: New → Confirmed
A1an (alan-b) wrote :

On my system:

$ uname -svrmo
Linux 2.6.32-46-generic #108-Ubuntu SMP Thu Apr 11 15:56:25 UTC 2013 x86_64 GNU/Linux

$ lsb_release -a
Distributor ID: Ubuntu
Description: Ubuntu 10.04.4 LTS
Release: 10.04
Codename: lucid

The RAID 1 partitions broke since kernel 2.6.32-47 ("Continue to wait; or Press S to skip mounting or M for manual recovery")
Booting back into 2.6.32-46 the md devices are correctly set up and the system boots normally

PS: still present with 2.6.32-48

Alan

summary: - latest 10.04 kernel update breaks software RAID + LVM
+ 2.6.32-46 kernel update on 10.04 breaks software RAID (+ LVM)
summary: - 2.6.32-46 kernel update on 10.04 breaks software RAID (+ LVM)
+ 2.6.32-47 kernel update on 10.04 breaks software RAID (+ LVM)

bl8n8r, thank you for reporting this and helping make Ubuntu better. Thank you for taking the time to report this bug and helping to make Ubuntu better. Please execute the following command, as it will automatically gather debugging information, in a terminal:
apport-collect BUGNUMBER
When reporting bugs in the future please use apport by using 'ubuntu-bug' and the name of the package affected. You can learn more about this functionality at https://wiki.ubuntu.com/ReportingBugs.

affects: mdadm (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: needs-kernel-logs needs-upstream-testing regression-update
Changed in linux (Ubuntu):
importance: Undecided → Medium
A1an (alan-b) wrote :

The array that does not start anymore has the following config:
level=raid1 metadata=1.2 num-devices=2

While a second array with metadata=1.0 seems to be recognized and started correctly.

A1an (alan-b) wrote :

Just updated to 2.6.32-49, issue still present an had to switch back to 2.6.32-46 which still works fine.

A1an, if you have a bug in Ubuntu, the Ubuntu Kernel team, Ubuntu Bug Control team, and Ubuntu Bug Squad would like you to please file a new report by executing the following in a terminal:
ubuntu-bug linux

For more on this, please see the Ubuntu Kernel team article:
https://wiki.ubuntu.com/KernelTeam/KernelTeamBugPolicies#Filing_Kernel_Bug_reports

the Ubuntu Bug Control team and Ubuntu Bug Squad team article:
https://wiki.ubuntu.com/Bugs/BestPractices#X.2BAC8-Reporting.Focus_on_One_Issue

and Ubuntu Community article:
https://help.ubuntu.com/community/ReportingBugs#Bug_reporting_etiquette

When opening up the new report, please feel free to subscribe me to it.

Please note, not filing a new report would delay your problem being addressed as quickly as possible.

Thank you for your understanding.

A1an (alan-b) wrote :

I see and I'll try to flie the new report as soon as possible, even though I think it is the same cause as this one.

I've had a look at the kernel log and see that only two patches affected the md component between 2.6.32-46 (working kernel) and 2.6.32-47 (broken kernel, reported in this bug):

https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.32.y&id=c28f366a6ef9b6e14e069e7d750c32d73544444e
https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/commit/?h=linux-2.6.32.y&id=372994e9fd5cadbacdfcc8724b590193d136c947

A notification to the author might help solve this even quicker.

Steve Conklin (sconklin) on 2013-08-06
Changed in linux (Ubuntu):
assignee: nobody → Steve Conklin (sconklin)
Steve Conklin (sconklin) wrote :

It's possible that the patches which were linked are not related. They're not in the delta between the two versions which are listed.

Here are some questions to try to narrow down the field of potential issues.

1. You mention that this occurs on various upgrades, could you confirm that these
would be various different kernel versions over time, and not ONLY on the 46 to 47 update?

2. Have you seen this when doing any other reboots (not after system updates)?
(such as might occur if this was a boot race - which could only shows up because the first boot after an update performs additional operations)

3. When this occurs you mention that you see the 'M' for manual recovery, do you also see the MD "degraded raid" prompt and if so how do you respond?

4. On taking the 'M' that option LVM slices are missing which are served from the RAID, can you provide the following information for missing LVs (backed by md0):

  A) are the device links in /dev/<vgname>/<lvname> present?
  B) are the LVs listed in the output of 'lvs' and what are their state?
  C) are the volumes present in the 'dmsetup ls' output?
  D) are the PVs which are backed by md0 present in 'pvs' and what are their state?
  E) what is the actual state of md0 as show in 'cat /proc/mdstat'?

A1an (alan-b) wrote :

@penalvch New bug filed as requested: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1209423

PS: I did not realize the kernel versions (using a dot for the last digits) are different from the ubuntu ones (using a -). I'll also answer latest questions by sconklin on the new bugreport.

bl8n8r (bl8n8r-gmail) wrote :

:: could you confirm that these would be various different kernel versions over time, and not ONLY on the 46 to 47 update?

That's probable. IIRC, my RAID1 MD devices have been blowing up inexplicably over the past year, and I tend to stay quite current on patches, applying them each week regardless.

:: Have you seen this when doing any other reboots (not after system updates)?

No, it really seems related to kernel updates. The system I reported on had been pulled
out of production for hardware updates, so was powered off. In the time it took to replace hardware and test the new system new kernel patches had come out so I suspected things were going to get broken as soon as I applied the kernel patches.

:: do you also see the MD "degraded raid" prompt and if so how do you respond?

No, I only ever see the "Continue to wait; or Press S to skip mounting or M for manual recovery". Nothing about MD degraded.

:: are the device links in /dev/<vgname>/<lvname> present?
:: are the LVs listed in the output of 'lvs' and what are their state?
:: are the PVs which are backed by md0 present in 'pvs' and what are their state?

Don't remember about /dev/<vgname>. lvs, vgs and pvs all were sporadic in output, sometimes complaining of leaked memory, sometimes displaying my LVs and PVs. It was unstable.

:: are the volumes present in the 'dmsetup ls' output?

Never used 'dmsetup ls'.

:: what is the actual state of md0 as show in 'cat /proc/mdstat'?

In the unstable state, after rebooting with patched kernel and RAID+LVM borked, sometimes mdstat would say something about the device not existing and then issuing it again would show "/dev/md_d0" -- which is not the correct MD device. Sometimes I would
have to issue "mdadm -S /dev/md_d0" and then "mdadm --examine --scan" to restart it. I have never seen Linux software raid fail this badly. Hopefully it's fixed soon as I know from experience MD has historically been bulletproof. I have never had such problems before.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers