Prev: How to track down abysmal performance ata - raid1 - crypto - vg/lv - xfs
Next: How to track down abysmal performance ata - raid1 - crypto - vg/lv - xfs
From: Stefan Richter on 4 Aug 2010 05:10 (adding Cc: linux-scsi) Nigel Cunningham wrote: > I've just given hibernation a go under 2.6.35, and at first I thought > there was some sort of hang in freezing processes. The computer sat > there for aaaaaages, apparently doing nothing. Switched from TuxOnIce to > swsusp to see if it was specific to my code but no - the problem was > there too. I used the nifty new kdb support to get a backtrace, which was: > > get_swap_page_of_type > discard_swap_cluster > blk_dev_issue_discard > wait_for_completion > > Adding a printk in discard swap cluster gives the following: > > [ 46.758330] Discarding 256 pages from bdev 800003 beginning at page 640377. > [ 47.003363] Discarding 256 pages from bdev 800003 beginning at page 640633. > [ 47.246514] Discarding 256 pages from bdev 800003 beginning at page 640889. > > ... > > [ 221.877465] Discarding 256 pages from bdev 800003 beginning at page 826745. > [ 222.121284] Discarding 256 pages from bdev 800003 beginning at page 827001. > [ 222.365908] Discarding 256 pages from bdev 800003 beginning at page 827257. > [ 222.610311] Discarding 256 pages from bdev 800003 beginning at page 827513. > > So allocating 4GB of swap on my SSD now takes 176 seconds instead of > virtually no time at all. (This code is completely unchanged from 2.6.34). > > I have a couple of questions: > > 1) As far as I can see, there haven't been any changes in mm/swapfile.c > that would cause this slowdown, so something in the block layer has > (from my point of view) regressed. Is this a known issue? Perhaps ATA TRIM is enabled for this SSD in 2.6.35 but not in 2.6.34? Or the discard code has been changed to issue many moderately sized ATA TRIMs instead of a single huge one, and the former was much more optimal for your particular SSD? > 2) Why are we calling discard_swap_cluster anyway? The swap was unused > and we're allocating it. I could understand calling it when freeing > swap, but when allocating? At the moment when the administrator creates swap space, the kernel can assume that he has no use anymore for the data that may have existed previously at this space. Hence instruct the SSD's flash translation layer to return all these blocks to the list of unused logical blocks which do not have to be read and backed up whenever another logical block within the same erase block is written to. However, I am surprised that this is done every time (?) when preparing for hibernation. -- Stefan Richter -=====-==-=- =--- --=-- http://arcgraph.de/sr/ -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Nigel Cunningham on 4 Aug 2010 05:20 Hi. On 04/08/10 18:59, Stefan Richter wrote: > (adding Cc: linux-scsi) > > Nigel Cunningham wrote: >> I've just given hibernation a go under 2.6.35, and at first I thought >> there was some sort of hang in freezing processes. The computer sat >> there for aaaaaages, apparently doing nothing. Switched from TuxOnIce to >> swsusp to see if it was specific to my code but no - the problem was >> there too. I used the nifty new kdb support to get a backtrace, which was: >> >> get_swap_page_of_type >> discard_swap_cluster >> blk_dev_issue_discard >> wait_for_completion >> >> Adding a printk in discard swap cluster gives the following: >> >> [ 46.758330] Discarding 256 pages from bdev 800003 beginning at page 640377. >> [ 47.003363] Discarding 256 pages from bdev 800003 beginning at page 640633. >> [ 47.246514] Discarding 256 pages from bdev 800003 beginning at page 640889. >> >> ... >> >> [ 221.877465] Discarding 256 pages from bdev 800003 beginning at page 826745. >> [ 222.121284] Discarding 256 pages from bdev 800003 beginning at page 827001. >> [ 222.365908] Discarding 256 pages from bdev 800003 beginning at page 827257. >> [ 222.610311] Discarding 256 pages from bdev 800003 beginning at page 827513. >> >> So allocating 4GB of swap on my SSD now takes 176 seconds instead of >> virtually no time at all. (This code is completely unchanged from 2.6.34). >> >> I have a couple of questions: >> >> 1) As far as I can see, there haven't been any changes in mm/swapfile.c >> that would cause this slowdown, so something in the block layer has >> (from my point of view) regressed. Is this a known issue? > > Perhaps ATA TRIM is enabled for this SSD in 2.6.35 but not in 2.6.34? > Or the discard code has been changed to issue many moderately sized ATA > TRIMs instead of a single huge one, and the former was much more optimal > for your particular SSD? Mmmm. Wonder how I tell. Something in dmesg or hdparm -I? ata3.00: ATA-8: ARSSD56GBP, 1916, max UDMA/133 ata3.00: 500118192 sectors, multi 1: LBA48 NCQ (depth 31/32), AA ata3.00: configured for UDMA/133 scsi 2:0:0:0: Direct-Access ATA ARSSD56GBP 1916 PQ: 0 ANSI: 5 sd 2:0:0:0: Attached scsi generic sg1 type 0 sd 2:0:0:0: [sda] 500118192 512-byte logical blocks: (256 GB/238 GiB) sd 2:0:0:0: [sda] Write Protect is off sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00 sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA sda: sda1 sda2 sda3 sda4 sd 2:0:0:0: [sda] Attached SCSI disk /dev/sda: ATA device, with non-removable media Model Number: ARSSD56GBP Serial Number: DC2210200F1B40015 Firmware Revision: 1916 Standards: Supported: 8 7 6 5 Likely used: 8 Configuration: Logical max current cylinders 16383 16383 heads 16 16 sectors/track 63 63 -- CHS current addressable sectors: 16514064 LBA user addressable sectors: 268435455 LBA48 user addressable sectors: 500118192 Logical Sector size: 512 bytes Physical Sector size: 512 bytes device size with M = 1024*1024: 244198 MBytes device size with M = 1000*1000: 256060 MBytes (256 GB) cache/buffer size = unknown Nominal Media Rotation Rate: Solid State Device Capabilities: LBA, IORDY(can be disabled) Queue depth: 32 Standby timer values: spec'd by Standard, no device specific minimum R/W multiple sector transfer: Max = 1 Current = 1 DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6 Cycle time: min=120ns recommended=120ns PIO: pio0 pio1 pio2 pio3 pio4 Cycle time: no flow control=120ns IORDY flow control=120ns Commands/features: Enabled Supported: * SMART feature set Security Mode feature set * Power Management feature set * Write cache * Look-ahead * Host Protected Area feature set * WRITE_BUFFER command * READ_BUFFER command * DOWNLOAD_MICROCODE SET_MAX security extension * 48-bit Address feature set * Device Configuration Overlay feature set * Mandatory FLUSH_CACHE * FLUSH_CACHE_EXT * SMART self-test * General Purpose Logging feature set * Gen1 signaling speed (1.5Gb/s) * Gen2 signaling speed (3.0Gb/s) * Native Command Queueing (NCQ) * Phy event counters * DMA Setup Auto-Activate optimization Device-initiated interface power management * Software settings preservation * Data Set Management determinate TRIM supported Security: supported not enabled not locked frozen not expired: security count not supported: enhanced erase Checksum: correct >> 2) Why are we calling discard_swap_cluster anyway? The swap was unused >> and we're allocating it. I could understand calling it when freeing >> swap, but when allocating? > > At the moment when the administrator creates swap space, the kernel can > assume that he has no use anymore for the data that may have existed > previously at this space. Hence instruct the SSD's flash translation > layer to return all these blocks to the list of unused logical blocks > which do not have to be read and backed up whenever another logical > block within the same erase block is written to. > > However, I am surprised that this is done every time (?) when preparing > for hibernation. It's not hibernation per se. The discard code is called from a few places in swapfile.c in (afaict from a quick scan) both swap allocation and free paths. Regards, Nigel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Mark Lord on 4 Aug 2010 09:00 Looks to me like more and more things are using the block discard functionality, and as predicted it is slowing things down enormously. The problem is that we still only discard tiny bits (a single range still??) per TRIM command, rather than batching larger ranges and larger numbers of ranges into single TRIM commands. That's a very poor implementation, especially when things start enabling it by default. Eg. the swap code, mke2fs, etc.. Ugh. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Nigel Cunningham on 4 Aug 2010 17:30 Hi. On 04/08/10 22:44, Mark Lord wrote: > Looks to me like more and more things are using the block discard > functionality, and as predicted it is slowing things down enormously. > > The problem is that we still only discard tiny bits (a single range > still??) > per TRIM command, rather than batching larger ranges and larger numbers > of ranges into single TRIM commands. > > That's a very poor implementation, especially when things start enabling > it by default. Eg. the swap code, mke2fs, etc.. > > Ugh. I was hoping for a nice quick and simple answer. Since I haven't got one, I'll try to find time to do a git bisect. I think I'll also look at the swap code more carefully and see if it's doing the sensible thing. I can't (at the moment) see the logic behind calling discard when allocating swap. At freeing time makes much more sense to me. Regards, Nigel -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Christoph Hellwig on 13 Aug 2010 08:00
On Fri, Aug 06, 2010 at 03:07:25PM -0700, Hugh Dickins wrote: > If REQ_SOFTBARRIER means that the device is still free to reorder a > write, which was issued after discard completion was reported, before > the discard (so later discarding the data written), then certainly I > agree with Christoph (now Cc'ed) that the REQ_HARDBARRIER is > unavoidable there; but if not, then it's not needed for the swap case. > I hope to gain a little more enlightenment on such barriers shortly. REQ_SOFTBARRIER is indeed purely a reordering barrier inside the block elevator. > What does seem over the top to me, is for mm/swapfile.c's > blkdev_issue_discard()s to be asking for both BLKDEV_IFL_WAIT and > BLKDEV_IFL_BARRIER: those swap discards were originally written just > to use barriers, without needing to wait for completion in there. I'd > be interested to hear if cutting out the BLKDEV_IFL_WAITs makes the > swap discards behave acceptably again for you - but understand that > you won't have a chance to try that until later next week. That does indeed look incorrect to me. Any kind of explicit waits usually mean the caller provides ordering. Getting rid of BLKDEV_IFL_BARRIER in the swap code ASAP would indeed be beneficial given that we are trying to get rid of hard barriers completely soon. Auditing the existing blkdev_issue_discard callers in filesystems is high on the todo list for me. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |