partitions and file system data need erase block alignment

Bug #626907 reported by Arnd Bergmann
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Linaro Image Tools
Fix Released
Medium
Loïc Minier

Bug Description

The linaro-media-create currently does not attempt to align partitions to erase blocks on flash media. This may be devastating to both performance and life expectancy of the media.

Instead, the script goes through a lot of effort to align the partition on logical cylinders, which is completely pointless on anything that does not run old versions of MS-DOS.

I'd recommend aligning all partitions to full megabytes using "sfdisk" instead of "fdisk".

Similarly, the mkfs.msdos tool should use hidden sectors to align the start of the ms-dos file system on an erase block boundary, and use a cluster size that is as large as possible without increasing the file system size too much.

When creating a btrfs root file system, the nodatasum and ssd mount options should be used to improve performance. We might add a warning when a user tries to create an ext3/ext4 file system on flash, since there is no way to make that perform or not wear out the flash.

We should probably allow using logfs as the root file system besides btrfs, since that is optimized a lot better for flash and consumes less storage space when using compression.

Note that even though flash vendors claim to have hardware wear levelling to make up for this, that is utter nonsense, even on industrial-grade compactflash cards it is basically useless and you have to do it in software.

Related branches

Loïc Minier (lool)
Changed in linaro-image-tools:
importance: Undecided → Medium
Revision history for this message
Ken Werner (kwerner) wrote :

We just looked into the source and it appears that l-m-c currently calls sfdisk (linaro_media_create/partitions.py:create_partitions) as following:

sfdisk -D -H <heads> -S <sectors> ',9,<fatid>,*\n,,,-' <device>

sfdisk reads lines of the form:
 <start> <size> <id> <bootable> <c,h,s> <c,h,s>
where each line fills one partition descriptor. So, the above command creates two partitions:
part1 (vfat):
   ,9,%s,*
  | | | |
  | | | +--- bootable
  | | +----- id (set to fat)
  | +-------- size (in cylinders)
  +---------- start
part2 (ext):
  ,,,-
  ||||
  |||+--- non bootable
  ||+---- id (defaults to linux)
  |+----- size (everything)
  +------ start

Which means that the second partition is cylinder aligned but we want it to be 4-meg aligned.

Revision history for this message
Loïc Minier (lool) wrote :

So the main constraint on changing the way we create partition tables is, I think, old versions of the OMAP boot ROM.

Basically on OMAP, the boot ROM will try to load MLO from first partition, and there are some limitations in where the partition can be located, filesystem type and things like that. After this point, we control MLO which is currently x-loader loading u-boot loading a linux kernel, and we can fix any bug in x-loader/u-boot/linux with new partition layout.

The only doubt I have is for older OMAP boot ROMs, they might not be able to load MLO anymore. But probably we don't need to worry too much about these anymore. In the worst case, we could introduce a different partitioning logic for these older boards.

NB: we currently don't have x-loader for IGEP and Overo, but this is pending.

I think parted has some alignment features now which will this stuff really easy, but it might require LBA partitions, hopefully that's not a problem in these days. Since we use parted for most other operations, I'm inclined to say we should ditch sfdisk and use parted everywhere. I guess the python parted bindings support at least as much as the command-line tool.

Loïc Minier (lool)
Changed in linaro-image-tools:
status: New → Triaged
Revision history for this message
Arnd Bergmann (arnd-arndb) wrote :

It would be vital to find out exactly what the requirements of the boot ROM are. Is this documented anywhere?
Most importantly, does it require the partitions to start on a cylinder boundary like MS-DOS, or does it require the first partition to start on sector 63?

In the first case, we can change the default geometry from 63S/255H to 32S/128H, which would naturally align all partitions to 2 MB. In the second case, we can ignore all geometry calculations and simply start the first partition on LBA sector 63, and all other partitions on an LBA sector that is a multiple of 8192.

Are there also requirements in the boot ROM regarding the layout of the FAT file system on the first partition?

Revision history for this message
Peter Maydell (pmaydell) wrote :

There is a section describing exactly how the boot ROM supposedly finds and reads the FAT filesystem and the MLO file in the OMAP3 TRM (section 25.4.7.6.4 File System Handling).

Revision history for this message
Loïc Minier (lool) wrote :

Problem is that aside of the TRM documentation, some older versions of the OMAP3s would have bugs or limitations (documented or not) in handling of the partition table and fs.

For instance:
http://processors.wiki.ti.com/index.php/SD/MMC_format_for_OMAP3_boot
or:
http://code.google.com/p/beagleboard/wiki/LinuxBootDiskFormat
require setting geometry to 255 heads and 63 sectors.

I think we can't live forever with these limitations, especially with lack of clear documentation of which chips are affected. Also, it's not entirely clear when the limitations came from x-loader or from the boot ROM.

Let us move to some modern partition layout with decent alignment, and if that causes any issue we will learn from affected people what the issues are, and we can decide whether we create a special partition layout for these boards. (We have different partitioning for mx51 and omap already, we could have an old-omap or something.)

Revision history for this message
Arnd Bergmann (arnd-arndb) wrote :

"setting geometry to 255 heads and 63 sectors" does not mean anything unless you also align the partitions to full cylinders, because the geometry is not stored anywhere but (in Linux) is calculated from the end of the last partition if there is an existing partition table.

The easiest way to to align the partitions in an unconstrained environment would be to assume 63/255 (or whatever the driver reports) and still put the partitions on a 4MB boundary, not aligned to partition boundaries.

I would also recommend putting the bootable FAT partition at the end of the medium instead of the beginning, so that the file system in the first partition can take advantage of the special area that is typically located in the first 8 MB. This is likely to cause other problems with firmware that tries to make sense of the C/H/S data in the partition table.

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

I haven't looked at the TRM so far, but my experience is that the boot FAT partition must start at sector 63 and must probably be (n*255-1)*63 sectors in size, where n is an integer >=5. In addition, only FAT32 appears to work. Furthermore, the "geometry" described by the partition table must be 255x63. I haven't managed to get any other configuration booting on Beagle xM.

I have succeeded in booting a card matching the above constraints, and where the FATs and data area are 4MB aligned (SD card style). So it looks like the FAT fs handling in the boot firmware sane, but the partition handling is not, at least for some boards.

It would be quite simple to modify mkdosfs to allow creation of filesystems with different internal alignment as above, but it doesn't support this natively at present -- instead, I had to build an initial fs and then tweak it in a hex editor. However, this is probably not very useful, since the boot partition is pretty much write-only for us (and not written very often).

I wasn't successful so far in moving the FAT partition to another location, but it's probably worth persevering due to the obvious value of freeing up the first 8MB for real uses.

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

@Ken

Can we call sfdisk with -S 63 -H 255 -uS -L, and work out the desired sector offsets in l-m-c? -- sfdisk writes then writes sensible partition data even for non-cylinder-aligned sectors:

cat <<EOF | sfdisk -S 63 -H 255 -uS -L <device>
63,$(((5*255-1)*63)),0xc,*
$(((5*255*63+8191)/8192*8192))
EOF

I've been doing this kind of thing manually for some time, and it seems to produce the desired results.

Revision history for this message
Nicolas Pitre (npitre) wrote : Re: [Bug 626907] Re: partitions and file system data need erase block alignment

On Mon, 7 Feb 2011, Dave Martin wrote:

> It would be quite simple to modify mkdosfs to allow creation of
> filesystems with different internal alignment as above, but it doesn't
> support this natively at present -- instead, I had to build an initial
> fs and then tweak it in a hex editor. However, this is probably not
> very useful, since the boot partition is pretty much write-only for us
> (and not written very often).

Right. What would be useful is not to align the first partition, but
rather to tweak that first partition's start or length so the second
partition does get properly aligned.

Revision history for this message
Arnd Bergmann (arnd-arndb) wrote :

@Dave:
I think using geometry 63/255 is the best option, since that is the only legal thing to do for >8GB cards according to a lot of software interpreting it. Starting the first partition at sector 63 can be done with any partition layout the way you describe. However, for the end of the first partition, I would advocate aligning it to 4MB so we don't waste any space before the start of the second partition and I don't see how that could break. That would be a slight modification of your method (untested):

cat <<EOF | sfdisk -S 63 -H 255 -uS -L <device>
63,$((($SIZE_MB + 3) / 4 * 4 * 2048 - 1)),0xc,*
,,,
EOF

For the first partition, another straightforward optimization will be to use 32KB clusters, as they are normally used on SD cards.
Obviously these should be naturally aligned but even when they are not, we get much better than we are today for both reading and writing. The 4MB alignment for the first partition doesn't matter too much with FAT.

@Nicolas:
The sfdisk command that Dave posted does align the second partition to 4 MB correctly, while keeping the first partition cylinder aligned, which should always work.

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

@Arnd,

[...]

> you describe. However, for the end of the first partition, I
> would advocate aligning it to 4MB so we don't waste any space
> before the start of the second partition and I don't see how
> that could break. That would be a slight modification of your
> method (untested):
>
> cat <<EOF | sfdisk -S 63 -H 255 -uS -L <device>
> 63,$((($SIZE_MB + 3) / 4 * 4 * 2048 - 1)),0xc,*
> ,,,
> EOF

That's a sensible idea--- I tried exactly that at one point, but it failed if my memory serves me correctly.

Definitely worth someone trying again, though. It's silly to leave a hole between the partitions if we don't have to.

Cheers
---Dave

Revision history for this message
Loïc Minier (lool) wrote :

The linked branch seems to create properly aligned partitions for me

When I'm trying to dd a file created with --image-file (which calls sfdisk -H/-S etc.), it seems good when I look at it with fdisk, however if I dd it to a MMC and boot that, the partitions are aligned at the right LBA offsets but CHS informations seems incorrect; does someone understand why that's the case?

Loïc Minier (lool)
Changed in linaro-image-tools:
assignee: nobody → Loïc Minier (lool)
status: Triaged → In Progress
Revision history for this message
Arnd Bergmann (arnd-arndb) wrote :

Can you be more specific what you mean by "seems good" and "seems incorrect", respectively?

It should always have geometry 63/255, as we discussed above, but this makes the
partitions not aligned to cylinders when they are aligned to multiples of 4 MB.

E.g. a factory-formatted SD card should report something like:

$ sudo fdisk -u /dev/mmcblk0

WARNING: DOS-compatible mode is deprecated. It's strongly recommended to
         switch off the mode (command 'c').

Command (m for help): p

Disk /dev/mmcblk0: 15.8 GB, 15823011840 bytes
255 heads, 63 sectors/track, 1923 cylinders, total 30904320 sectors
Units = sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000

        Device Boot Start End Blocks Id System
/dev/mmcblk0p1 8192 30904319 15448064 c W95 FAT32 (LBA)

Command (m for help): x

Expert command (m for help): p

Disk /dev/mmcblk0: 255 heads, 63 sectors, 1923 cylinders

Nr AF Hd Sec Cyl Hd Sec Cyl Start Size ID
 1 00 130 3 0 254 63 1023 8192 30896128 0c
 2 00 0 0 0 0 0 0 0 0 00
 3 00 0 0 0 0 0 0 0 0 00
 4 00 0 0 0 0 0 0 0 0 00

A partition starting at LBA sector 8192 in a 63/255 layout should correctly start
at cylinder 0, head 130, sector 3. A partition table with a FAT32 partition
starting at sector 63 and another partition starting 64 MB into the device
looks like:

Disk /dev/mmcblk0: 255 heads, 63 sectors, 1923 cylinders

Nr AF Hd Sec Cyl Hd Sec Cyl Start Size ID
 1 00 1 1 0 40 32 8 63 131009 0c
Partition 1 does not end on cylinder boundary.
 2 00 40 33 8 254 63 1023 131072 30773248 83
 3 00 0 0 0 0 0 0 0 0 00
 4 00 0 0 0 0 0 0 0 0 00

Revision history for this message
Loïc Minier (lool) wrote :

Checking the x-loader history, I found this commit from April 2010:

--- a/fs/fat/fat.c
+++ b/fs/fat/fat.c
@@ -145,13 +145,11 @@ fat_register_device(block_dev_desc_t *dev_desc, int part_n
                        return -1;
                }
 #else
- /* FIXME we need to determine the start block of the
- * partition where the DOS FS resides. This can be done
- * by using the get_partition_info routine. For this
- * purpose the libpart must be included.
- */
- part_offset=63;
- //part_offset=0;
+ part_offset = buffer[DOS_PART_TBL_OFFSET+8] |
+ buffer[DOS_PART_TBL_OFFSET+9] <<8 |
+ buffer[DOS_PART_TBL_OFFSET+10]<<16 |
+ buffer[DOS_PART_TBL_OFFSET+11]<<24;
+
                cur_part = 1;
 #endif
        }

that's probably the limitation for x-loader at least; not sure about the ROM.

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

@Loïc

> The linked branch seems to create properly aligned partitions for me
>
> When I'm trying to dd a file created with --image-file (which calls sfdisk -H/-S etc.), it seems good when I look at it with fdisk,
> however if I dd it to a MMC and boot that, the partitions are aligned at the right LBA offsets but CHS informations seems
> incorrect; does someone understand why that's the case?

The geometry is not stored anywhere -- for devices which have a real declared geometry (i.e., real disks) the disk's declared geometry may be discoverable using ioctls etc., but otherwise fdisk attempts to guess the geometry from the partition table. However, if there is no geometry which makes all the partitions look cylinder-aligned, fdisk's guess may well be different from what was specified when partitioning.

Can you include a hex dump of the partition table? This is the only way to determine whether it's correct in this kind of case.

$ dd bs=446 skip=1 if=image.bin | dd bs=64 count=1 | hexdump -vC

However, the CHS fields are not much used--- if X-loader may just assumes sector 63 for the start of the first partition, rather than reading this information, then their contents may be totally irrelevant.

(Given the overwhelming simplicity of the partition table structure, it's rather tempting to abandon all these complex tools and generate it directly...)

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

@Arnd

> It should always have geometry 63/255, as we discussed above, but this makes the
> partitions not aligned to cylinders when they are aligned to multiples of 4 MB.

IIUC, we absolutely don't care if some of the partitions are not cylinder aligned, because there is no such thing as a cylinder.

We may care about specific partitions being aligned in a certain specific way (i.e., the boot partition on OMAP) but that's really a separate problem -- I view that as bootloader bug-compatibility rather than a general alignment requirement. Such a requirement should be addressed by putting the partition in the right place at sector granularity directly -- partitioners shouldn't be expected to do this magically without complaining, since they are in no way designed to come with this kind of situation.

Revision history for this message
Loïc Minier (lool) wrote :

On Fri, Feb 11, 2011, Dave Martin wrote:
> The geometry is not stored anywhere -- for devices which have a real
> declared geometry (i.e., real disks) the disk's declared geometry may be
> discoverable using ioctls etc., but otherwise fdisk attempts to guess
> the geometry from the partition table. However, if there is no geometry
> which makes all the partitions look cylinder-aligned, fdisk's guess may
> well be different from what was specified when partitioning.

 Right; I just don't understand why it comes to different conclusions on
 geometry on a real device versus on a file; I think fdisk assumes
 63/255 on files.

> Can you include a hex dump of the partition table? This is the only way
> to determine whether it's correct in this kind of case.
>
> $ dd bs=446 skip=1 if=image.bin | dd bs=64 count=1 | hexdump -vC

 here's one:

00000000 80 05 05 01 0c fe 3f 07 00 40 00 00 08 b6 01 00 |......?..@......|
00000010 00 28 21 08 83 2a a0 0a 00 00 02 00 00 00 7e 00 |.(!..*........~.|
00000020 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000030 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
00000040

 I actually think I should be starting at +4MiB rather than +8MiB
 (for some reason I had decided I wouldn't use the first "cylinder")

> However, the CHS fields are not much used--- if X-loader may just
> assumes sector 63 for the start of the first partition, rather than
> reading this information, then their contents may be totally irrelevant

Revision history for this message
Loïc Minier (lool) wrote :

On Fri, Feb 11, 2011, Dave Martin wrote:
> We may care about specific partitions being aligned in a certain
> specific way (i.e., the boot partition on OMAP) but that's really a
> separate problem -- I view that as bootloader bug-compatibility rather
> than a general alignment requirement. Such a requirement should be
> addressed by putting the partition in the right place at sector
> granularity directly -- partitioners shouldn't be expected to do this
> magically without complaining, since they are in no way designed to come
> with this kind of situation.

 Will it cause issues with e.g. accessing the vfat from windows?

 My understanding is that even if modern tools and Linux look at LBA
 offsets, they still have to write something in the CHS fields and
 follow constraints of these fields. I always thought that one of the
 constraints was that partitions couldn't share the same cylinder, put
 apparently that's not a problem at all!

 I think I'll throw all the CHS calculations away, and just add a compat
 flag to create the vfat boot part at +63s for older x-loader, unless
 someone reports this breaks some OS or use case

--
Loïc Minier

Revision history for this message
Arnd Bergmann (arnd-arndb) wrote :

@Dave: I absolutely agree that we shouldn't align partitions to cylinders. We should however ensure that the c/h/s data can be interpreted in a way that matches the LBA number, and in that case I think we should just assume 63/255 geometry to find the right
c/h/s values to put in the table.

I think the only thing we really care about is that any LBA numbers larger than 8GB are at the magic C/H/S 1023/255/63 position, which some software uses as an indication to ignore all C/H/S values and rely on LBA exclusively.

Revision history for this message
Dave Martin (dave-martin-arm) wrote :

> We should however ensure that the c/h/s data can be interpreted in a way that matches the LBA number

agreed - note that tools generally seem to do it right, even if the numbers come out strangely when the partitions are listed.

@Loic - is that partitioning supposed to be for OMAP? For that platform, we really mustn't 4MB align the FAT partiton because X-loader expects it to start at sector 63. Other boards presumably don't have this problem though (I hope!)

In Loïc's example above, the CHS is right with respect to the sector numbers, assuming 63x255 geometry.

> I think the only thing we really care about is that any LBA numbers larger than 8GB are at the magic C/H/S 1023/255/63 position,

agreed - again, my experience suggests that partitioners such as sfdisk DTRT here.

Revision history for this message
Dave Martin (dave-martin-arm) wrote : RE: [Bug 626907] Re: partitions and file system data need erase block alignment

> On Fri, Feb 11, 2011, Dave Martin wrote:
> > We may care about specific partitions being aligned in a certain
> > specific way (i.e., the boot partition on OMAP) but that's really a
> > separate problem -- I view that as bootloader
> bug-compatibility rather
> > than a general alignment requirement. Such a requirement should be
> > addressed by putting the partition in the right place at sector
> > granularity directly -- partitioners shouldn't be expected
> to do this
> > magically without complaining, since they are in no way designed to
> > come with this kind of situation.
>
> Will it cause issues with e.g. accessing the vfat from windows?

Good question-- try it and see?

In any case, I think we need to cylinder-align the FAT partition at least for OMAP ... and this partition is rarely read and even more rarely written, so performance is not much of an issue. So maybe there's no problem with this.

Note that when when the partition type field is 0xC this anyway warns the OS that the partition needs LBA and the CHS fields are/may be junk. So really, any non-broken OS ought to cope... (no guarantees though!)

>
>
> My understanding is that even if modern tools and Linux look
> at LBA offsets, they still have to write something in the
> CHS fields and follow constraints of these fields. I always

Well, in principle you have to write the CHS fields with "sensible" values in case some software tries to interpret them. But it seems that little or no software tries to interpret them nowadays, with the possible exception of partitioning tools. And with sectors past the 1024-cylinder limit, you can't write sensible values anyway. Some old tools used to simply mask off the high bits instead of setting the cylinder field to 1023, which would have led to catastrophe if OSes actually used that information when mounting

Loïc Minier (lool)
Changed in linaro-image-tools:
status: In Progress → Fix Committed
Loïc Minier (lool)
Changed in linaro-image-tools:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.