2.6.35 Regression: Ages spent discarding blocks that weren't used! [Kernel]

Prev: How to track down abysmal performance ata - raid1 - crypto - vg/lv - xfs
Next: How to track down abysmal performance ata - raid1 - crypto - vg/lv - xfs

From: Stefan Richter on 4 Aug 2010 05:10

(adding Cc: linux-scsi)

Nigel Cunningham wrote:
> I've just given hibernation a go under 2.6.35, and at first I thought
> there was some sort of hang in freezing processes. The computer sat
> there for aaaaaages, apparently doing nothing. Switched from TuxOnIce to
> swsusp to see if it was specific to my code but no - the problem was
> there too. I used the nifty new kdb support to get a backtrace, which was:
>
> get_swap_page_of_type
> discard_swap_cluster
> blk_dev_issue_discard
> wait_for_completion
>
> Adding a printk in discard swap cluster gives the following:
>
> [ 46.758330] Discarding 256 pages from bdev 800003 beginning at page 640377.
> [ 47.003363] Discarding 256 pages from bdev 800003 beginning at page 640633.
> [ 47.246514] Discarding 256 pages from bdev 800003 beginning at page 640889.
>
> ...
>
> [ 221.877465] Discarding 256 pages from bdev 800003 beginning at page 826745.
> [ 222.121284] Discarding 256 pages from bdev 800003 beginning at page 827001.
> [ 222.365908] Discarding 256 pages from bdev 800003 beginning at page 827257.
> [ 222.610311] Discarding 256 pages from bdev 800003 beginning at page 827513.
>
> So allocating 4GB of swap on my SSD now takes 176 seconds instead of
> virtually no time at all. (This code is completely unchanged from 2.6.34).
>
> I have a couple of questions:
>
> 1) As far as I can see, there haven't been any changes in mm/swapfile.c
> that would cause this slowdown, so something in the block layer has
> (from my point of view) regressed. Is this a known issue?

Perhaps ATA TRIM is enabled for this SSD in 2.6.35 but not in 2.6.34?
Or the discard code has been changed to issue many moderately sized ATA
TRIMs instead of a single huge one, and the former was much more optimal
for your particular SSD?

> 2) Why are we calling discard_swap_cluster anyway? The swap was unused
> and we're allocating it. I could understand calling it when freeing
> swap, but when allocating?

At the moment when the administrator creates swap space, the kernel can
assume that he has no use anymore for the data that may have existed
previously at this space. Hence instruct the SSD's flash translation
layer to return all these blocks to the list of unused logical blocks
which do not have to be read and backed up whenever another logical
block within the same erase block is written to.

However, I am surprised that this is done every time (?) when preparing
for hibernation.
--
Stefan Richter
-=====-==-=- =--- --=--
http://arcgraph.de/sr/
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nigel Cunningham on 4 Aug 2010 05:20

Hi.

On 04/08/10 18:59, Stefan Richter wrote:
> (adding Cc: linux-scsi)
>
> Nigel Cunningham wrote:
>> I've just given hibernation a go under 2.6.35, and at first I thought
>> there was some sort of hang in freezing processes. The computer sat
>> there for aaaaaages, apparently doing nothing. Switched from TuxOnIce to
>> swsusp to see if it was specific to my code but no - the problem was
>> there too. I used the nifty new kdb support to get a backtrace, which was:
>>
>> get_swap_page_of_type
>> discard_swap_cluster
>> blk_dev_issue_discard
>> wait_for_completion
>>
>> Adding a printk in discard swap cluster gives the following:
>>
>> [ 46.758330] Discarding 256 pages from bdev 800003 beginning at page 640377.
>> [ 47.003363] Discarding 256 pages from bdev 800003 beginning at page 640633.
>> [ 47.246514] Discarding 256 pages from bdev 800003 beginning at page 640889.
>>
>> ...
>>
>> [ 221.877465] Discarding 256 pages from bdev 800003 beginning at page 826745.
>> [ 222.121284] Discarding 256 pages from bdev 800003 beginning at page 827001.
>> [ 222.365908] Discarding 256 pages from bdev 800003 beginning at page 827257.
>> [ 222.610311] Discarding 256 pages from bdev 800003 beginning at page 827513.
>>
>> So allocating 4GB of swap on my SSD now takes 176 seconds instead of
>> virtually no time at all. (This code is completely unchanged from 2.6.34).
>>
>> I have a couple of questions:
>>
>> 1) As far as I can see, there haven't been any changes in mm/swapfile.c
>> that would cause this slowdown, so something in the block layer has
>> (from my point of view) regressed. Is this a known issue?
>
> Perhaps ATA TRIM is enabled for this SSD in 2.6.35 but not in 2.6.34?
> Or the discard code has been changed to issue many moderately sized ATA
> TRIMs instead of a single huge one, and the former was much more optimal
> for your particular SSD?

Mmmm. Wonder how I tell. Something in dmesg or hdparm -I?

ata3.00: ATA-8: ARSSD56GBP, 1916, max UDMA/133
ata3.00: 500118192 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
ata3.00: configured for UDMA/133
scsi 2:0:0:0: Direct-Access ATA ARSSD56GBP 1916 PQ: 0 ANSI: 5
sd 2:0:0:0: Attached scsi generic sg1 type 0
sd 2:0:0:0: [sda] 500118192 512-byte logical blocks: (256 GB/238 GiB)
sd 2:0:0:0: [sda] Write Protect is off
sd 2:0:0:0: [sda] Mode Sense: 00 3a 00 00
sd 2:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't
support DPO or FUA
sda: sda1 sda2 sda3 sda4
sd 2:0:0:0: [sda] Attached SCSI disk

/dev/sda:

ATA device, with non-removable media
Model Number: ARSSD56GBP
Serial Number: DC2210200F1B40015
Firmware Revision: 1916
Standards:
Supported: 8 7 6 5
Likely used: 8
Configuration:
Logical max current
cylinders 16383 16383
heads 16 16
sectors/track 63 63
--
CHS current addressable sectors: 16514064
LBA user addressable sectors: 268435455
LBA48 user addressable sectors: 500118192
Logical Sector size: 512 bytes
Physical Sector size: 512 bytes
device size with M = 1024*1024: 244198 MBytes
device size with M = 1000*1000: 256060 MBytes (256 GB)
cache/buffer size = unknown
Nominal Media Rotation Rate: Solid State Device
Capabilities:
LBA, IORDY(can be disabled)
Queue depth: 32
Standby timer values: spec'd by Standard, no device specific minimum
R/W multiple sector transfer: Max = 1 Current = 1
DMA: mdma0 mdma1 mdma2 udma0 udma1 udma2 udma3 udma4 udma5 *udma6
Cycle time: min=120ns recommended=120ns
PIO: pio0 pio1 pio2 pio3 pio4
Cycle time: no flow control=120ns IORDY flow control=120ns
Commands/features:
Enabled Supported:
* SMART feature set
Security Mode feature set
* Power Management feature set
* Write cache
* Look-ahead
* Host Protected Area feature set
* WRITE_BUFFER command
* READ_BUFFER command
* DOWNLOAD_MICROCODE
SET_MAX security extension
* 48-bit Address feature set
* Device Configuration Overlay feature set
* Mandatory FLUSH_CACHE
* FLUSH_CACHE_EXT
* SMART self-test
* General Purpose Logging feature set
* Gen1 signaling speed (1.5Gb/s)
* Gen2 signaling speed (3.0Gb/s)
* Native Command Queueing (NCQ)
* Phy event counters
* DMA Setup Auto-Activate optimization
Device-initiated interface power management
* Software settings preservation
* Data Set Management determinate TRIM supported
Security:
supported
not enabled
not locked
frozen
not expired: security count
not supported: enhanced erase
Checksum: correct

>> 2) Why are we calling discard_swap_cluster anyway? The swap was unused
>> and we're allocating it. I could understand calling it when freeing
>> swap, but when allocating?
>
> At the moment when the administrator creates swap space, the kernel can
> assume that he has no use anymore for the data that may have existed
> previously at this space. Hence instruct the SSD's flash translation
> layer to return all these blocks to the list of unused logical blocks
> which do not have to be read and backed up whenever another logical
> block within the same erase block is written to.
>
> However, I am surprised that this is done every time (?) when preparing
> for hibernation.

It's not hibernation per se. The discard code is called from a few
places in swapfile.c in (afaict from a quick scan) both swap allocation
and free paths.

Regards,

Nigel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Mark Lord on 4 Aug 2010 09:00

Looks to me like more and more things are using the block discard
functionality, and as predicted it is slowing things down enormously.

The problem is that we still only discard tiny bits (a single range still??)
per TRIM command, rather than batching larger ranges and larger numbers
of ranges into single TRIM commands.

That's a very poor implementation, especially when things start enabling
it by default. Eg. the swap code, mke2fs, etc..

Ugh.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Nigel Cunningham on 4 Aug 2010 17:30

Hi.

On 04/08/10 22:44, Mark Lord wrote:
> Looks to me like more and more things are using the block discard
> functionality, and as predicted it is slowing things down enormously.
>
> The problem is that we still only discard tiny bits (a single range
> still??)
> per TRIM command, rather than batching larger ranges and larger numbers
> of ranges into single TRIM commands.
>
> That's a very poor implementation, especially when things start enabling
> it by default. Eg. the swap code, mke2fs, etc..
>
> Ugh.

I was hoping for a nice quick and simple answer. Since I haven't got
one, I'll try to find time to do a git bisect. I think I'll also look at
the swap code more carefully and see if it's doing the sensible thing. I
can't (at the moment) see the logic behind calling discard when
allocating swap. At freeing time makes much more sense to me.

Regards,

Nigel
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Christoph Hellwig on 13 Aug 2010 08:00

On Fri, Aug 06, 2010 at 03:07:25PM -0700, Hugh Dickins wrote:
> If REQ_SOFTBARRIER means that the device is still free to reorder a
> write, which was issued after discard completion was reported, before
> the discard (so later discarding the data written), then certainly I
> agree with Christoph (now Cc'ed) that the REQ_HARDBARRIER is
> unavoidable there; but if not, then it's not needed for the swap case.
> I hope to gain a little more enlightenment on such barriers shortly.

REQ_SOFTBARRIER is indeed purely a reordering barrier inside the block
elevator.

> What does seem over the top to me, is for mm/swapfile.c's
> blkdev_issue_discard()s to be asking for both BLKDEV_IFL_WAIT and
> BLKDEV_IFL_BARRIER: those swap discards were originally written just
> to use barriers, without needing to wait for completion in there. I'd
> be interested to hear if cutting out the BLKDEV_IFL_WAITs makes the
> swap discards behave acceptably again for you - but understand that
> you won't have a chance to try that until later next week.

That does indeed look incorrect to me. Any kind of explicit waits
usually mean the caller provides ordering. Getting rid of
BLKDEV_IFL_BARRIER in the swap code ASAP would indeed be beneficial
given that we are trying to get rid of hard barriers completely soon.
Auditing the existing blkdev_issue_discard callers in filesystems
is high on the todo list for me.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

|
Pages: 1
Prev: How to track down abysmal performance ata - raid1 - crypto - vg/lv - xfs
Next: How to track down abysmal performance ata - raid1 - crypto - vg/lv - xfs