Prev: linux-next: build failure after merge of the scsi-post-merge final tree
Next: [PATCH] scripts/kernel-doc: fix empty function description section
From: Tejun Heo on 7 Mar 2010 22:50 Hello, guys. It looks like transition to ATA 4k drives will be quite painful and we aren't really ready although these drives are already selling widely. I've written up a summary document on the issue to clarify stuff as it's getting more and more confusing and develop some consensus. It's also on the linux ata wiki. http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues I've cc'd people whom I can think of off the top of my head but I surely have missed some people who would have been interested. Please feel free to add cc's or forward the message to other MLs. Especially, I don't know much about partitioners so the details there are pretty shallow and could be plain wrong. It would be great if someone who knows more about this stuff can chime in. Thanks. === Document follows === ATA 4 KiB sector issues Background ========== Up until recently, all ATA hard drives have been organized in 512 byte sectors. For example, my 500 GB or 477 GiB hard drive is organized of 976773168 512 byte sectors numbered from 0 to 976773167. This is how a drive communicates with the driver. When the operating system wants to read 32 KiB of data at 1 MiB position, the driver asks the drive to read 64 sectors from LBA (Logical block address, sector number) 2048. Because each sector should be addressable, readable and writable individually, the physical medium also is organized in the same sized sectors. In addition to the area to store the actual data, each sector requires extra space for book keeping - inter-sector space to enable locating and addressing each sector and ECC data to detect and correct inevitable raw data errors. As the densities and capacities of hard drives keep growing, stronger ECC becomes necessary to guarantee acceptable level of data integrity increasing the space overhead. In addition, in most applications, hard drives are now accessed in units of at least 8 sectors or 4096 bytes and maintaining 512 byte granularity has become somewhat meaningless. This reached a point where enlarging the sector size to 4096 bytes would yield measurably more usable space given the same raw data storage size and hard drive manufacturers are transitioning to 4 KiB sectors. Anandtech has a good article which illustrates the background and issues with pretty diagrams[1]. Physical vs. Logical ==================== Because the 512 byte sector size has been around for a very long time and upto ATA/ATAPI-7 the sector size was fixed at 512 bytes, the sector size assumption is scattered across all the layers - controllers or bridge chips snooping commands, BIOSs, boot codes, drivers, partitioners and system utilities, which makes it very difficult to change the sector size from 512 byte without breaking backward compatibility massively. As a workaround, the concept of logical sector size was introduced. The physical medium is organized in 4 KiB sectors but the firmware on the drive will present it as if the drive is composed of 512 byte sectors thus making the drive behave as before, so if the driver asks the hard drive to read 64 sectors from LBA 2048, the firmware will translate it and read 8 4 KiB sectors from hardware sector 256. As a result, the hard drive now has two sector sizes - the physical one which the physical media is actually organized in, and the logical one which the firmware presents to the outside world. A straight forward example mapping between physical sector and LBA would be LBA = 8 * phys_sect Alignment problem on 4 KiB physical / 512 logical drives ======================================================= This workaround keeps older hardware and software working while allowing the drive to use larger sector size internally. However, the discrepancy between physical and logical sector sizes creates an alignment issue. For example, if the driver wants to read 7 sectors from LBA 2047, the firmware has to read hardware sector 255 and 256 and trim leading 7*512 bytes and tailing 512 bytes. For reads, this isn't an issue as drives read in larger chunks anyway but for writes, the drive has to do read-modify-write to achieve the requested action. It has to first read hardware sector 255 and 256, update requested parts and then write back those sectors which can cause significant performance degradation[2]. The problem is aggravated by the way DOS partitions[3] have been laid out traditionally. For reasons dating back more than two decades, they are laid out considering something called disk geometry which nowadays are arbitrary values with a number of restrictions for backward compatibility accumulated over the years. The end result is that until recently (most Linux variants and upto Windows XP) the first partition ends up on sector 63 and later ones on cylinder boundaries where each cylinder usually is composed of 255 * 63 sectors. Most modern filesystems generate 4 KiB aligned accesses from the partition it is in. If a drive maps 4 KiB physical sectors to 512 byte logical sectors from LBA0, the filesystem in the first partition will always be misaligned and filesystems in later partitions are likely to be misaligned too. Solving the alignment problem on 4 KiB physical / 512 logical drives ==================================================================== There are multiple ways which attempt to solve the problem. S-1. Yet another workaround from the firmware - offset-by-one. Yet another workaround which can be done by the firmware is to offset physical to logical mapping by one logical sector such that LBA 63 ends up on physical sector boundary, which aligns the first partition to physical sectors without requiring any software update. The example mapping between phys_sector and LBA becomes LBA = 8 * phys_sect - 1 The leading 512 bytes from phys_sect 0 is not used and LBA 0 starts from after that point. phys_sect 1 maps to LBA 7 and phys_sect 8 to 63, making LBA 63 aligned on hardware sector. Although this aligns only the first partition, for many use cases, especially the ones involving older software, this workaround was deemed useful and some recent drives with 4 KiB physical sectors are equipped with a dip switch to turn on or off offset-by-one mapping. S-2. The proper solution. Correct alignments for all partitions can't be achieved by the firmware alone. The system utilities should be informed about the alignment requirements and align partitions accordingly. The above firmware workaround complicates the situation because the two different configurations require different offsets to achieve the correct alignments. ATA/ATAPI-8 specifies a way for a drive to export the physical and logical sector sizes and the LBA offset which is aligned to the physical sectors. In Linux, these parameters are exported via the following sysfs nodes. physical sector size : /sys/block/sdX/queue/physical_block_size logical sector size : /sys/block/sdX/queue/logical_block_size alignment offset : /sys/block/sdX/alignment_offset Let the physical sector size be PSS, logical sector size LSS and alignment offset AOFF. The system software should place partitions such that the starting LBAs of all partitions are aligned on (n * PSS + AOFF) / LSS For 4 KiB physical sector offset-by-one drives, PSS is 4096, LSS 512 and AOFF 3584 and with n of 7 the above becomes, (7 * 4096 + 3584) / 512 == 63 making sector 63 an aligned LBA where the first partition can be put, but without the offset-by-one mapping, AOFF is zero and LBA 63 is not aligned. With the above new alignment requirement in place, it becomes difficult to honor the legacy one - first partition on sector 63 and all other partitions on cylinder boundary (255 * 63 sectors) - as the two alignment requirements contradict each other. This might be worked around by adjusting how LBA and CHS addresses are mapped but the disk geometry parameters are hard coded everywhere and there is no reliable way to communicate custom geometry parameters. Complications ============= Unfortunately, there are complications. C-1. The standard is not and won't be followed as-is. Some of the existing BIOSs and/or drivers can't cope with drives which report 4 KiB physical sector size. To work around this, some drive models lie that its physical sector size is 512 bytes when the actual configuration is 4 KiB without offsetting. This nullifies the provisions for alignment in the ATA standard but results in the correct alignment for Windows Vista and 7. OS behaviors will be described further later. For these drives, which are likely to continue to be shipped for the foreseeable future, traditional LBA 63 and cylinder based aligning results in misalignment. C-2. Windows XP depends on the traditional partition layout. Windows XP makes use of the CHS start/end addresses in the partition table and gets confused if partitions are not laid out traditionally. This means that XP can't be installed into a partition prepared by later versions of Windows[4]. This isn't a big problem for Windows because in most cases the later version is replacing the older one, not the other way around. Unfortunately, the situation is more complex for Linux because Linux is often co-installed with various versions of Windows and XP is still quite popular. This means that when a Linux partitioner is used to prepare a partition which may be used by Windows, the partitioner might have to consider which version of Windows is going to be used and whether to align the partitions for the correct alignment or compatibility with older versions of Windows. C-3. The 2 TiB barrier and the possibility for 4 KiB logical sector size. The DOS partition format uses 32 bit for the starting LBA and the number of sectors and, reportedly, 32 bit Windows XP shares the limitation. With 32 bit addressing and 512 byte logical sector size, the maximum addressable sector + 1 is at 2^32 * 2^9 == 2^41 == 2 TiB The DOS partition format allows a partition to reach beyond 2 TiB as long as the starting LBA is under 2 TiB; however, both Windows XP and and the Linux kernel (at least upto v2.6.33) refuse such partition configurations. With the right combination of host controller, BIOS and driver, this barrier can be overcome by enlarging the logical sector size to 4 KiB, which will push the barrier out to 16 TiB. On the right configuration, Windows XP is reportedly able to address beyond the 2 TiB barrier with a DOS partition and 4 KiB logical sector size. Linux kernel upto v2.6.33 doesn't work under such configurations but a patch to make it work is pending[5]. This might also be beneficial for operating systems which don't suffer from this limitation. A different partition format - GPT[6] - should be used beyond 2^32 sectors, which could harm compatibility with older BIOSs or other operating systems which don't recognize the new format. As mentioned previously, 512 byte sector assumption has been there for a very long time and changing it is likely to cause various compatibility problems at many different layers from hardware up to the system utilities. Windows ======= As hard drive vendors aim for performance and compatibility in modern Windows environments, it is worthwhile to investigate how Windows partitions with different alignment requirements. Up until Windows XP, it followed the traditional layout - the first partition on LBA 63 and the others on cylinder boundaries where a cylinder is defined as 255 tracks with 63 sectors each. Windows Vista and 7 align partitions differently. As the two behave similarly, only 7's behavior is shown here. These partition tables are created by Windows 7 RC installer on blank disks. W-1. 512 byte physical and logical sector drive. ST FIRST T LAST LBA NBLKS 80 202100 07 df130c 00080000 00200300 00 df140c 07 feffff 00280300 00689e12 00 000000 00 000000 00000000 00000000 00 000000 00 000000 00000000 00000000 Part0: FIRST C 0 H 32 S 33 : 2048 (63 sec/trk) LAST C 12 H 223 S 19 : 206847 (255 heads/cyl) LBA 2048 + 204800 = 206848 Part1: FIRST C 12 H 223 S 20 : 206848 LAST C 1023 H 254 S 63 : E LBA 206848 + 312371200 = 312578048 Both aligned at (2048 * n). Part 1 not aligned to cylinder. W-2. 4 KiB physical and 512 byte logical sector drive without offset-by-one. ST FIRST T LAST LBA NBLKS 80 202100 07 df130c 00080000 00200300 00 df140c 07 feffff 00280300 00b83f25 00 000000 00 000000 00000000 00000000 00 000000 00 000000 00000000 00000000 Part0: FIRST C 0 H 32 S 33 : 2048 (63 sec/trk) LAST C 12 H 223 S 19 : 206847 (255 heads/cyl) LBA 2048 + 204800 = 206848 Part1: FIRST C 12 H 223 S 20 : 206848 LAST C 1023 H 254 S 63 : E LBA 206848 + 624932864 = 625139712 Both aligned at (2048 * n). Part 1 not aligned to cylinder. W-3. 4 KiB physical and 512 byte logical sector drive with offset-by-one. ST FIRST T LAST LBA NBLKS 80 202800 07 df130c 07080000 f91f0300 00 df1b0c 07 feffff 07280300 f9376d74 00 000000 00 000000 00000000 00000000 00 000000 00 000000 00000000 00000000 Part0: FIRST C 0 H 32 S 40 : 2055 (63 sec/trk) LAST C 12 H 223 S 19 : 206847 (255 heads/cyl) LBA 2055 + 204793 = 206848 Part1: FIRST C 12 H 223 S 27 : 206855 LAST C 1023 H 254 S 63 : E LBA 206855 + 1953314809 = 1953521664 Both aligned at (2048 * n + 7). Part 1 not aligned to cylinder. The partitioner seems to be using 1M as the basic alignment unit and offsetting from there if explicitly requested by the drive and there is no difference between handling of 512 byte and 4 KiB drives, which explains why C-1 works for hard drive vendors. In all cases, the partitioner ignores both the first partition on LBA 63 and the others on cylinder boundary requirements while still using the same 255*63 cylinder size. Also, note that in W-3, both part 0 and 1 end up with odd number of sectors. It seems that they simply decided to completely break away from the traditional layout, which is understandable given that there really isn't one good solution which can cover all the cases and that the default larger alignment benefits earlier SSDs. Windows Vista basically shows the same behavior. Vista was tested by creating two partitions using the management tool. Test data is available at [7]. *-alignment_offset : alignment_offset reported by Linux kernel *-fdisk : fdisk -l output *-fdisk-u : fdisk -lu output *-hdparm : hdparm -I output *-mbr : dump of mbr *-part : decoded partition table from mbr Please note that hdparm is misreporting the alignment offset. It should be reporting 512 instead of 256 for offset-by-one drives. So, what now for Linux? ======================= The situation is not easy. Considering all the factors, the only workable solution looks like doing what Windows is doing. Hard drive and SSD vendors are focusing on compatibility and performance on recent Windows releases and are happy to do things which break the standard defined mechanism as shown by C-1, so parting away from what Windows does would be unnecessarily painful. Unfortunately, while Windows can assume that newer releases won't share the hard drive with older releases including Windows XP, Linux distros can't do that. There will be many installations where a modern Linux distros share a hard drive with older releases of Windows. At this point, I can't see a silver bullet solution. Partitioners maybe should only align partitions which will be used by Linux and default to the traditional layout for others while allowing explicit override. I think Windows XP wouldn't have problem with differently aligned partitions as long as it doesn't actually use them but haven't tested it. Reportedly, commonly used partitioners aren't ready to handle drives larger than 2 TiB in any configuration and alignment isn't done properly for drives with 4 KiB physical sectors. 4 KiB logical sector support is broken in both the kernel and partitioners. (need more details and probably a whole section on partitioner behaviors) Unfortunately, the transition to 4 KiB sector size, physical only or logical too, is looking fairly ugly. Hopefully, a reasonable solution can be reached in not too distant future but even with all the software side updated, it looks like it's gonna cause significant amount of confusion and frustration. [1] http://www.anandtech.com/storage/showdoc.aspx?i=3691 [2] http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives [3] http://en.wikipedia.org/wiki/Master_boot_record [4] http://support.microsoft.com/kb/931760 [5] http://thread.gmane.org/gmane.linux.kernel/953981 [6] http://en.wikipedia.org/wiki/GUID_Partition_Table [7] http://userweb.kernel.org/~tj/partalign/ * Mar 04 2009 Initial draft, Tejun Heo <tj(a)kernel.org> * Mar 08 2009 Updated according to comments from Daniel Taylor <Daniel.Taylor(a)wdc.com>. Other minor updates. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Greg Freemyer on 8 Mar 2010 00:40 cc'ing Martin Petersen since I believe he is one of the most knowledgeable kernel hackers on this topic and has been working the issue for the last year. On Sun, Mar 7, 2010 at 10:48 PM, Tejun Heo <tj(a)kernel.org> wrote: > Hello, guys. > > It looks like transition to ATA 4k drives will be quite painful and we > aren't really ready although these drives are already selling widely. > I've written up a summary document on the issue to clarify stuff as > it's getting more and more confusing and develop some consensus. �It's > also on the linux ata wiki. > > �http://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues > > I've cc'd people whom I can think of off the top of my head but I > surely have missed some people who would have been interested. �Please > feel free to add cc's or forward the message to other MLs. > Especially, I don't know much about partitioners so the details there > are pretty shallow and could be plain wrong. �It would be great if > someone who knows more about this stuff can chime in. > > Thanks. > > === Document follows === > > ATA 4 KiB sector issues > > Background > ========== > > Up until recently, all ATA hard drives have been organized in 512 byte > sectors. �For example, my 500 GB or 477 GiB hard drive is organized of > 976773168 512 byte sectors numbered from 0 to 976773167. �This is how > a drive communicates with the driver. �When the operating system wants > to read 32 KiB of data at 1 MiB position, the driver asks the drive to > read 64 sectors from LBA (Logical block address, sector number) 2048. > > Because each sector should be addressable, readable and writable > individually, the physical medium also is organized in the same sized > sectors. �In addition to the area to store the actual data, each > sector requires extra space for book keeping - inter-sector space to > enable locating and addressing each sector and ECC data to detect and > correct inevitable raw data errors. > > As the densities and capacities of hard drives keep growing, stronger > ECC becomes necessary to guarantee acceptable level of data integrity > increasing the space overhead. �In addition, in most applications, > hard drives are now accessed in units of at least 8 sectors or 4096 > bytes and maintaining 512 byte granularity has become somewhat > meaningless. > > This reached a point where enlarging the sector size to 4096 bytes > would yield measurably more usable space given the same raw data > storage size and hard drive manufacturers are transitioning to 4 KiB > sectors. > > Anandtech has a good article which illustrates the background and > issues with pretty diagrams[1]. > > > Physical vs. Logical > ==================== > > Because the 512 byte sector size has been around for a very long time > and upto ATA/ATAPI-7 the sector size was fixed at 512 bytes, the > sector size assumption is scattered across all the layers - > controllers or bridge chips snooping commands, BIOSs, boot codes, > drivers, partitioners and system utilities, which makes it very > difficult to change the sector size from 512 byte without breaking > backward compatibility massively. > > As a workaround, the concept of logical sector size was introduced. > The physical medium is organized in 4 KiB sectors but the firmware on > the drive will present it as if the drive is composed of 512 byte > sectors thus making the drive behave as before, so if the driver asks > the hard drive to read 64 sectors from LBA 2048, the firmware will > translate it and read 8 4 KiB sectors from hardware sector 256. �As a > result, the hard drive now has two sector sizes - the physical one > which the physical media is actually organized in, and the logical one > which the firmware presents to the outside world. > > A straight forward example mapping between physical sector and LBA > would be > > �LBA = 8 * phys_sect > > > Alignment problem on 4 KiB physical / 512 logical drives > ======================================================= > > This workaround keeps older hardware and software working while > allowing the drive to use larger sector size internally. �However, the > discrepancy between physical and logical sector sizes creates an > alignment issue. �For example, if the driver wants to read 7 sectors > from LBA 2047, the firmware has to read hardware sector 255 and 256 > and trim leading 7*512 bytes and tailing 512 bytes. > > For reads, this isn't an issue as drives read in larger chunks anyway > but for writes, the drive has to do read-modify-write to achieve the > requested action. �It has to first read hardware sector 255 and 256, > update requested parts and then write back those sectors which can > cause significant performance degradation[2]. > > The problem is aggravated by the way DOS partitions[3] have been laid > out traditionally. �For reasons dating back more than two decades, > they are laid out considering something called disk geometry which > nowadays are arbitrary values with a number of restrictions for > backward compatibility accumulated over the years. �The end result is > that until recently (most Linux variants and upto Windows XP) the > first partition ends up on sector 63 and later ones on cylinder > boundaries where each cylinder usually is composed of 255 * 63 > sectors. > > Most modern filesystems generate 4 KiB aligned accesses from the > partition it is in. �If a drive maps 4 KiB physical sectors to 512 > byte logical sectors from LBA0, the filesystem in the first partition > will always be misaligned and filesystems in later partitions are > likely to be misaligned too. > > > Solving the alignment problem on 4 KiB physical / 512 logical drives > ==================================================================== > > There are multiple ways which attempt to solve the problem. > > S-1. Yet another workaround from the firmware - offset-by-one. > > �Yet another workaround which can be done by the firmware is to > �offset physical to logical mapping by one logical sector such that > �LBA 63 ends up on physical sector boundary, which aligns the first > �partition to physical sectors without requiring any software update. > �The example mapping between phys_sector and LBA becomes > > � �LBA = 8 * phys_sect - 1 > > �The leading 512 bytes from phys_sect 0 is not used and LBA 0 starts > �from after that point. �phys_sect 1 maps to LBA 7 and phys_sect 8 to > �63, making LBA 63 aligned on hardware sector. > > �Although this aligns only the first partition, for many use cases, > �especially the ones involving older software, this workaround was > �deemed useful and some recent drives with 4 KiB physical sectors are > �equipped with a dip switch to turn on or off offset-by-one mapping. > > S-2. The proper solution. > > �Correct alignments for all partitions can't be achieved by the > �firmware alone. �The system utilities should be informed about the > �alignment requirements and align partitions accordingly. > > �The above firmware workaround complicates the situation because the > �two different configurations require different offsets to achieve > �the correct alignments. �ATA/ATAPI-8 specifies a way for a drive to > �export the physical and logical sector sizes and the LBA offset > �which is aligned to the physical sectors. > > �In Linux, these parameters are exported via the following sysfs > �nodes. > > � �physical sector size � � � �: /sys/block/sdX/queue/physical_block_size > � �logical sector size � � � � : /sys/block/sdX/queue/logical_block_size > � �alignment offset � � � � � �: /sys/block/sdX/alignment_offset > > �Let the physical sector size be PSS, logical sector size LSS and > �alignment offset AOFF. �The system software should place partitions > �such that the starting LBAs of all partitions are aligned on > > � �(n * PSS + AOFF) / LSS > > �For 4 KiB physical sector offset-by-one drives, PSS is 4096, LSS 512 > �and AOFF 3584 and with n of 7 the above becomes, > > � �(7 * 4096 + 3584) / 512 == 63 > > �making sector 63 an aligned LBA where the first partition can be > �put, but without the offset-by-one mapping, AOFF is zero and LBA 63 > �is not aligned. > > �With the above new alignment requirement in place, it becomes > �difficult to honor the legacy one - first partition on sector 63 and > �all other partitions on cylinder boundary (255 * 63 sectors) - as > �the two alignment requirements contradict each other. �This might be > �worked around by adjusting how LBA and CHS addresses are mapped but > �the disk geometry parameters are hard coded everywhere and there is > �no reliable way to communicate custom geometry parameters. > > > Complications > ============= > > Unfortunately, there are complications. > > C-1. The standard is not and won't be followed as-is. > > �Some of the existing BIOSs and/or drivers can't cope with drives > �which report 4 KiB physical sector size. �To work around this, some > �drive models lie that its physical sector size is 512 bytes when the > �actual configuration is 4 KiB without offsetting. > > �This nullifies the provisions for alignment in the ATA standard but > �results in the correct alignment for Windows Vista and 7. �OS > �behaviors will be described further later. > > �For these drives, which are likely to continue to be shipped for the > �foreseeable future, traditional LBA 63 and cylinder based aligning > �results in misalignment. > > C-2. Windows XP depends on the traditional partition layout. > > �Windows XP makes use of the CHS start/end addresses in the partition > �table and gets confused if partitions are not laid out > �traditionally. �This means that XP can't be installed into a > �partition prepared by later versions of Windows[4]. �This isn't a > �big problem for Windows because in most cases the later version is > �replacing the older one, not the other way around. > > �Unfortunately, the situation is more complex for Linux because Linux > �is often co-installed with various versions of Windows and XP is > �still quite popular. �This means that when a Linux partitioner is > �used to prepare a partition which may be used by Windows, the > �partitioner might have to consider which version of Windows is going > �to be used and whether to align the partitions for the correct > �alignment or compatibility with older versions of Windows. > > C-3. The 2 TiB barrier and the possibility for 4 KiB logical sector size. > > �The DOS partition format uses 32 bit for the starting LBA and the > �number of sectors and, reportedly, 32 bit Windows XP shares the > �limitation. �With 32 bit addressing and 512 byte logical sector > �size, the maximum addressable sector + 1 is at > > � �2^32 * 2^9 == 2^41 == 2 TiB > > �The DOS partition format allows a partition to reach beyond 2 TiB as > �long as the starting LBA is under 2 TiB; however, both Windows XP > �and and the Linux kernel (at least upto v2.6.33) refuse such > �partition configurations. > > �With the right combination of host controller, BIOS and driver, this > �barrier can be overcome by enlarging the logical sector size to 4 > �KiB, which will push the barrier out to 16 TiB. �On the right > �configuration, Windows XP is reportedly able to address beyond the 2 > �TiB barrier with a DOS partition and 4 KiB logical sector size. > �Linux kernel upto v2.6.33 doesn't work under such configurations but > �a patch to make it work is pending[5]. > > �This might also be beneficial for operating systems which don't > �suffer from this limitation. �A different partition format - GPT[6] > �- should be used beyond 2^32 sectors, which could harm compatibility > �with older BIOSs or other operating systems which don't recognize > �the new format. > > �As mentioned previously, 512 byte sector assumption has been there > �for a very long time and changing it is likely to cause various > �compatibility problems at many different layers from hardware up to > �the system utilities. > > > Windows > ======= > > As hard drive vendors aim for performance and compatibility in modern > Windows environments, it is worthwhile to investigate how Windows > partitions with different alignment requirements. �Up until Windows > XP, it followed the traditional layout - the first partition on LBA 63 > and the others on cylinder boundaries where a cylinder is defined as > 255 tracks with 63 sectors each. > > Windows Vista and 7 align partitions differently. �As the two behave > similarly, only 7's behavior is shown here. �These partition tables > are created by Windows 7 RC installer on blank disks. > > W-1. 512 byte physical and logical sector drive. > > �ST FIRST �T �LAST � LBA � � �NBLKS > �80 202100 07 df130c 00080000 00200300 > �00 df140c 07 feffff 00280300 00689e12 > �00 000000 00 000000 00000000 00000000 > �00 000000 00 000000 00000000 00000000 > > �Part0: � � � �FIRST � C � �0 �H � 32 �S � 33 �: 2048 � � � � �(63 sec/trk) > � � � � � � � �LAST � �C � 12 �H �223 �S � 19 �: 206847 � � � �(255 heads/cyl) > � � � � � � � �LBA � � 2048 + 204800 = 206848 > > �Part1: � � � �FIRST � C � 12 �H �223 �S � 20 �: 206848 > � � � � � � � �LAST � �C 1023 �H �254 �S � 63 �: E > � � � � � � � �LBA � � 206848 + 312371200 = 312578048 > > �Both aligned at (2048 * n). �Part 1 not aligned to cylinder. > > W-2. 4 KiB physical and 512 byte logical sector drive without offset-by-one. > > �ST FIRST �T �LAST � LBA � � �NBLKS > �80 202100 07 df130c 00080000 00200300 > �00 df140c 07 feffff 00280300 00b83f25 > �00 000000 00 000000 00000000 00000000 > �00 000000 00 000000 00000000 00000000 > > �Part0: � � � �FIRST � C � �0 �H � 32 �S � 33 �: 2048 � � � � �(63 sec/trk) > � � � � � � � �LAST � �C � 12 �H �223 �S � 19 �: 206847 � � � �(255 heads/cyl) > � � � � � � � �LBA � � 2048 + 204800 = 206848 > > �Part1: � � � �FIRST � C � 12 �H �223 �S � 20 �: 206848 > � � � � � � � �LAST � �C 1023 �H �254 �S � 63 �: E > � � � � � � � �LBA � � 206848 + 624932864 = 625139712 > > �Both aligned at (2048 * n). �Part 1 not aligned to cylinder. > > W-3. 4 KiB physical and 512 byte logical sector drive with offset-by-one. > > �ST FIRST �T �LAST � LBA � � �NBLKS > �80 202800 07 df130c 07080000 f91f0300 > �00 df1b0c 07 feffff 07280300 f9376d74 > �00 000000 00 000000 00000000 00000000 > �00 000000 00 000000 00000000 00000000 > > �Part0: � � � �FIRST � C � �0 �H � 32 �S � 40 �: 2055 � � � � �(63 sec/trk) > � � � � � � � �LAST � �C � 12 �H �223 �S � 19 �: 206847 � � � �(255 heads/cyl) > � � � � � � � �LBA � � 2055 + 204793 = 206848 > > �Part1: � � � �FIRST � C � 12 �H �223 �S � 27 �: 206855 > � � � � � � � �LAST � �C 1023 �H �254 �S � 63 �: E > � � � � � � � �LBA � � 206855 + 1953314809 = 1953521664 > > �Both aligned at (2048 * n + 7). �Part 1 not aligned to cylinder. > > The partitioner seems to be using 1M as the basic alignment unit and > offsetting from there if explicitly requested by the drive and there > is no difference between handling of 512 byte and 4 KiB drives, which > explains why C-1 works for hard drive vendors. > > In all cases, the partitioner ignores both the first partition on LBA > 63 and the others on cylinder boundary requirements while still using > the same 255*63 cylinder size. �Also, note that in W-3, both part 0 > and 1 end up with odd number of sectors. �It seems that they simply > decided to completely break away from the traditional layout, which is > understandable given that there really isn't one good solution which > can cover all the cases and that the default larger alignment benefits > earlier SSDs. > > Windows Vista basically shows the same behavior. �Vista was tested by > creating two partitions using the management tool. �Test data is > available at [7]. > > �*-alignment_offset � �: alignment_offset reported by Linux kernel > �*-fdisk � � � � � � � : fdisk -l output > �*-fdisk-u � � � � � � : fdisk -lu output > �*-hdparm � � � � � � �: hdparm -I output > �*-mbr � � � � � � � � : dump of mbr > �*-part � � � � � � � �: decoded partition table from mbr > > Please note that hdparm is misreporting the alignment offset. �It > should be reporting 512 instead of 256 for offset-by-one drives. > > > So, what now for Linux? > ======================= > > The situation is not easy. �Considering all the factors, the only > workable solution looks like doing what Windows is doing. �Hard drive > and SSD vendors are focusing on compatibility and performance on > recent Windows releases and are happy to do things which break the > standard defined mechanism as shown by C-1, so parting away from what > Windows does would be unnecessarily painful. > > Unfortunately, while Windows can assume that newer releases won't > share the hard drive with older releases including Windows XP, Linux > distros can't do that. �There will be many installations where a > modern Linux distros share a hard drive with older releases of > Windows. �At this point, I can't see a silver bullet solution. > > Partitioners maybe should only align partitions which will be used by > Linux and default to the traditional layout for others while allowing > explicit override. �I think Windows XP wouldn't have problem with > differently aligned partitions as long as it doesn't actually use them > but haven't tested it. > > Reportedly, commonly used partitioners aren't ready to handle drives > larger than 2 TiB in any configuration and alignment isn't done > properly for drives with 4 KiB physical sectors. �4 KiB logical sector > support is broken in both the kernel and partitioners. �(need more > details and probably a whole section on partitioner behaviors) > > Unfortunately, the transition to 4 KiB sector size, physical only or > logical too, is looking fairly ugly. �Hopefully, a reasonable solution > can be reached in not too distant future but even with all the > software side updated, it looks like it's gonna cause significant > amount of confusion and frustration. > > > [1] http://www.anandtech.com/storage/showdoc.aspx?i=3691 > [2] http://www.osnews.com/story/22872/Linux_Not_Fully_Prepared_for_4096-Byte_Sector_Hard_Drives > [3] http://en.wikipedia.org/wiki/Master_boot_record > [4] http://support.microsoft.com/kb/931760 > [5] http://thread.gmane.org/gmane.linux.kernel/953981 > [6] http://en.wikipedia.org/wiki/GUID_Partition_Table > [7] http://userweb.kernel.org/~tj/partalign/ > > * Mar 04 2009 > � � � �Initial draft, Tejun Heo <tj(a)kernel.org> > * Mar 08 2009 > � � � �Updated according to comments from Daniel Taylor > � � � �<Daniel.Taylor(a)wdc.com>. �Other minor updates. > -- > To unsubscribe from this list: send the line "unsubscribe linux-ide" in > the body of a message to majordomo(a)vger.kernel.org > More majordomo info at �http://vger.kernel.org/majordomo-info.html > -- Greg Freemyer Head of EDD Tape Extraction and Processing team Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer Preservation and Forensic processing of Exchange Repositories White Paper - <http://www.norcrossgroup.com/forms/whitepapers/tng_whitepaper_fpe.html> The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: James Bottomley on 8 Mar 2010 02:10 Just a quick note: The 2TB size for msdos partitions is a problem independent of the 4k sector issue. Traditional 512 byte sector drives are now available in those sizes. It looks like we're going to have to move to a new partitioning label to solve this. There's actually another barrier at 8 or 16TB, which is where a 4k logical sector filesystem tops out using 32 bit block offsets (it's 8TB if the fs hasn't been proof checked against sign extension problems). However, for 4k sectors, the main issues which have shown up in testing by others (mostly Martin) are 1. In native 4k mode, we work perfectly fine. *however*, most BIOSs can't boot native 4k drives. 2. Even if the BIOS can boot native 4k, our own boot loaders seem to be hard coded for 512 byte sectors in several places. 3. If we run in the 512 byte sector emulation mode, we end up with the partition alignment problems you allude to. 4. The aligment problem is made more complex by drives that make use of the offset exponent feature (what you refer to as offset by one) ... fortunately very few of these have been seen in the wild and we're hopeful they can be shot before they breed. 5. I'm really, really sorry to have to mention it, but it looks like uefi is going to be the only way we can boot non-msdos partitioned devices with native 4k sectors. so the bottom line seems to be that if you want the device as a non boot disk, use native 4k sectors and a non-msdos partition label. If you want to boot from the drive and your bios won't book 4k natively, partition everything using the 512 emulation and try to align the partitions correctly. If your bios/uefi will boot 4k natively, just use it and whatever partition label the bios/uefi supports. Martin can fill in the pieces I've left out. James -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: H. Peter Anvin on 8 Mar 2010 03:00 On 03/07/2010 11:00 PM, James Bottomley wrote: > Just a quick note: > > The 2TB size for msdos partitions is a problem independent of the 4k > sector issue. Traditional 512 byte sector drives are now available in > those sizes. It looks like we're going to have to move to a new > partitioning label to solve this. > > There's actually another barrier at 8 or 16TB, which is where a 4k > logical sector filesystem tops out using 32 bit block offsets (it's 8TB > if the fs hasn't been proof checked against sign extension problems). > > However, for 4k sectors, the main issues which have shown up in testing > by others (mostly Martin) are > > 1. In native 4k mode, we work perfectly fine. *however*, most > BIOSs can't boot native 4k drives. > 2. Even if the BIOS can boot native 4k, our own boot loaders seem > to be hard coded for 512 byte sectors in several places. > 3. If we run in the 512 byte sector emulation mode, we end up with > the partition alignment problems you allude to. > 4. The aligment problem is made more complex by drives that make > use of the offset exponent feature (what you refer to as offset > by one) ... fortunately very few of these have been seen in the > wild and we're hopeful they can be shot before they breed. > 5. I'm really, really sorry to have to mention it, but it looks > like uefi is going to be the only way we can boot non-msdos > partitioned devices with native 4k sectors. > > so the bottom line seems to be that if you want the device as a non boot > disk, use native 4k sectors and a non-msdos partition label. If you > want to boot from the drive and your bios won't book 4k natively, > partition everything using the 512 emulation and try to align the > partitions correctly. If your bios/uefi will boot 4k natively, just use > it and whatever partition label the bios/uefi supports. > > Martin can fill in the pieces I've left out. > I would very much like a reference for a platform which has firmware which can successfully boot from 4K-logical media. It would be very useful for bootloader testing. Aligning partitions is something we should have done long ago. It affects RAID and many flash drives just as much or more than 4K-sectored disks. Legacy BIOS doesn't care at all how the disk is partitioned, so as long as the BIOS can read the disk at all the rest is up to the bootloader. Of course, since there hasn't been the opportunity to test, bootloaders generally don't handle it correctly (early versions of Syslinux supported any sector size, but that bitrotted, and for the lack of testing I eventually ended up hard-coding the number. Now I'd like to get it working properly.) As far as partitioning... I believe we should be using GPT partition tables where possible. Even on non-EFI systems, it's simply a much better partition table format. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: H. Peter Anvin on 8 Mar 2010 03:00
On 03/07/2010 11:00 PM, James Bottomley wrote: > > The 2TB size for msdos partitions is a problem independent of the 4k > sector issue. Traditional 512 byte sector drives are now available in > those sizes. It looks like we're going to have to move to a new > partitioning label to solve this. > > There's actually another barrier at 8 or 16TB, which is where a 4k > logical sector filesystem tops out using 32 bit block offsets (it's 8TB > if the fs hasn't been proof checked against sign extension problems). > The limit for the MS-DOS partition tables is 2^32 sectors. The patch that Daniel posted was for a Linux kernel internal limit that set the limit to 2 TB. -hpa -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |