ocfs2: Zero the tail cluster when extending past i

Prev: Linux 2.6.35-rc4
Next: [PATCH/RFC v2 4/4] hwmon: sysfs API updates

From: Tao Ma on 5 Jul 2010 00:00

Hi Joel,

On 07/04/2010 05:32 AM, Joel Becker wrote:
> Here's the second version of my corruption fix. It fixes two
> bugs:
>
> 1) i_size can obviously be at a place that is a hole, so don't BUG on
> that.
> 2) Fix an off-by-one when checking whether the write position is within
> the tail allocation.
>
> This version passes my tail corruption test as well as the kernel
> compile that exposed the two bugs above.
>
> Joel
>
> ---------------------------------------------------------------
>
> ocfs2's allocation unit is the cluster. This can be larger than a block
> or even a memory page. This means that a file may have many blocks in
> its last extent that are beyond the block containing i_size.
>
> When ocfs2 grows a file, it zeros the entire cluster in order to ensure
> future i_size growth will see cleared blocks. Unfortunately,
> block_write_full_page() drops the pages past i_size. This means that
> ocfs2 is actually leaking garbage data into the tail end of that last
> cluster.
>
> We adjust ocfs2_write_begin_nolock() and ocfs2_extend_file() to detect
> when a write or truncate is past i_size. If there is any existing
> allocation between the block containing the current i_size and the
> location of the write or truncate, zeros will be written to that
> allocation.
>
> This is only for sparse filesystems. Non-sparse filesystems already get
> this via ocfs2_extend_no_holes().
>
> Signed-off-by: Joel Becker<joel.becker(a)oracle.com>
> ---
> fs/ocfs2/aops.c | 22 ++++----
> fs/ocfs2/file.c | 154 +++++++++++++++++++++++++++++++++++++++++++++++++------
> fs/ocfs2/file.h | 2 +
> 3 files changed, 150 insertions(+), 28 deletions(-)
>
<snip>
> diff --git a/fs/ocfs2/file.c b/fs/ocfs2/file.c
> index 6a13ea6..7fca78d 100644
> --- a/fs/ocfs2/file.c
> +++ b/fs/ocfs2/file.c
> @@ -848,6 +848,137 @@ out:
> return ret;
> }
>
> +/*
> + * This function is a helper for ocfs2_zero_tail(). It calculates
> + * what blocks need zeroing and does any CoW necessary.
> + */
> +static int ocfs2_zero_tail_prepare(struct inode *inode,
> + struct buffer_head *di_bh,
> + loff_t pos, u64 *start_blkno,
> + u64 *blocks)
> +{
> + int rc = 0;
> + struct ocfs2_super *osb = OCFS2_SB(inode->i_sb);
> + u32 tail_cpos, pos_cpos, p_cpos;
> + u64 tail_blkno, pos_blkno, blocks_to_zero;
> + unsigned int num_clusters = 0;
> + unsigned int ext_flags = 0;
> +
> + /*
> + * The block containing i_size has already been zeroed, so our tail
> + * block is the first block after i_size. The block containing
> + * pos will be zeroed. So we only need to do anything if
> + * tail_blkno is before pos_blkno.
> + */
> + tail_blkno = (i_size_read(inode)>> inode->i_sb->s_blocksize_bits) + 1;
> + pos_blkno = pos>> inode->i_sb->s_blocksize_bits;
> + mlog(0, "tail_blkno = %llu, pos_blkno = %llu\n",
> + (unsigned long long)tail_blkno, (unsigned long long)pos_blkno);
> + if (pos_blkno<= tail_blkno)
> + goto out;
> + blocks_to_zero = pos_blkno - tail_blkno;
> +
> + /*
> + * If tail_blkno is in the cluster past i_size, we don't need
> + * to touch the cluster containing i_size at all.
> + */
> + tail_cpos = i_size_read(inode)>> osb->s_clustersize_bits;
> + if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)> tail_cpos)
> + tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb,
> + tail_blkno);
Can we always set tail_cpos in one line?
tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)?
tail_cpos is either the same cluster as i_size or the next cluster and
both works for tail_blkno I guess?
> +
> + rc = ocfs2_get_clusters(inode, tail_cpos,&p_cpos,&num_clusters,
> + &ext_flags);
> + if (rc) {
> + mlog_errno(rc);
> + goto out;
> + }
> +
> + /* Is there a cluster to zero? */
> + if (!p_cpos)
> + goto out;
For unwritten extent, we also need to clear the pages? If yes, the
solution doesn't complete if we have 2 unwritten extent, one contains
i_size while one passes i_size. Here we only clear the pages for the 1st
unwritten extent and leave the 2nd one untouched.
> +
> + pos_cpos = pos>> osb->s_clustersize_bits;
> + mlog(0, "tail_cpos = %u, num_clusters = %u, pos_cpos = %u, tail_blkno = %llu, pos_blkno = %llu\n",
> + (unsigned int)tail_cpos, (unsigned int)num_clusters,
> + (unsigned int)pos_cpos, (unsigned long long)tail_blkno,
> + (unsigned long long)pos_blkno);
From here to the call of CoW is a bit hard to understand. In 'if',
num_clusters is set for CoW and in 'else', blocks_to_zero is set. So it
isn't easy for the reader to tell why these 2 clauses are setting
different values. So how about my code below? It looks more
straightforward I think.
> + if ((tail_cpos + num_clusters)> pos_cpos) {
> + num_clusters = pos_cpos - tail_cpos;
> + if (pos_blkno>
> + ocfs2_clusters_to_blocks(inode->i_sb, pos_cpos))
> + num_clusters += 1;
> + } else {
> + blocks_to_zero =
> + ocfs2_clusters_to_blocks(inode->i_sb,
> + tail_cpos + num_clusters);
> + blocks_to_zero -= tail_blkno;
> + }
> +
> + /* Now CoW the clusters we're about to zero */
> + if (ext_flags& OCFS2_EXT_REFCOUNTED) {
> + rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
> + num_clusters, UINT_MAX);
> + if (rc) {
> + mlog_errno(rc);
> + goto out;
> + }
> + }
/* Decrease blocks_to_zero if there is some hole after extent */
if (tail_cpos + num_clusters <= pos_cpos) {
blocks_to_zero =
ocfs2_clusters_to_blocks(inode->i_sb,
tail_cpos + num_clusters);
blocks_to_zero -= tail_blkno;
}

/* Now CoW if we have some refcounted clusters. */
if (ext_flags & OCFS2_EXT_REFCOUNTED) {
/*
* We add one more cluster here since it will be
* written shortly and if the pos_blkno isn't aligned
* to the cluster size, we have to zero the blocks
* before it.
*/
if (tail_cpos + num_clusters > pos_cpos)
num_clusters = pos_cpos - tail_cpos + 1;

rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
num_clusters, UINT_MAX);
if (rc) {
mlog_errno(rc);
goto out;
}

}
> +
> + *start_blkno = tail_blkno;
> + *blocks = blocks_to_zero;
> + mlog(0, "start_blkno = %llu, blocks = %llu\n",
> + (unsigned long long)(*start_blkno),
> + (unsigned long long)(*blocks));
> +
> +out:
> + return rc;
> +}

Regards,
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Joel Becker on 6 Jul 2010 03:30

On Mon, Jul 05, 2010 at 11:51:44AM +0800, Tao Ma wrote:
> >+ /*
> >+ * If tail_blkno is in the cluster past i_size, we don't need
> >+ * to touch the cluster containing i_size at all.
> >+ */
> >+ tail_cpos = i_size_read(inode)>> osb->s_clustersize_bits;
> >+ if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)> tail_cpos)
> >+ tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb,
> >+ tail_blkno);
> Can we always set tail_cpos in one line?
> tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)?
> tail_cpos is either the same cluster as i_size or the next cluster
> and both works for tail_blkno I guess?

I had the same thought on Friday, but the current version passes
testing and I was wary of changing that.

> >+ /* Is there a cluster to zero? */
> >+ if (!p_cpos)
> >+ goto out;
> For unwritten extent, we also need to clear the pages? If yes, the
> solution doesn't complete if we have 2 unwritten extent, one
> contains i_size while one passes i_size. Here we only clear the
> pages for the 1st unwritten extent and leave the 2nd one untouched.

We probably don't need to zero unwritten extents. We cannot
have an extent past i_size, can we?

> From here to the call of CoW is a bit hard to understand. In 'if',
> num_clusters is set for CoW and in 'else', blocks_to_zero is set. So
> it isn't easy for the reader to tell why these 2 clauses are setting
> different values. So how about my code below? It looks more
> straightforward I think.
> >+ if ((tail_cpos + num_clusters)> pos_cpos) {
> >+ num_clusters = pos_cpos - tail_cpos;
> >+ if (pos_blkno>
> >+ ocfs2_clusters_to_blocks(inode->i_sb, pos_cpos))
> >+ num_clusters += 1;
> >+ } else {
> >+ blocks_to_zero =
> >+ ocfs2_clusters_to_blocks(inode->i_sb,
> >+ tail_cpos + num_clusters);
> >+ blocks_to_zero -= tail_blkno;
> >+ }
> >+
> >+ /* Now CoW the clusters we're about to zero */
> >+ if (ext_flags& OCFS2_EXT_REFCOUNTED) {
> >+ rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
> >+ num_clusters, UINT_MAX);
> >+ if (rc) {
> >+ mlog_errno(rc);
> >+ goto out;
> >+ }
> >+ }
> /* Decrease blocks_to_zero if there is some hole after extent */
> if (tail_cpos + num_clusters <= pos_cpos) {
> blocks_to_zero =
> ocfs2_clusters_to_blocks(inode->i_sb,
> tail_cpos + num_clusters);
> blocks_to_zero -= tail_blkno;
> }

Not a bad split-out here.

> /* Now CoW if we have some refcounted clusters. */
> if (ext_flags & OCFS2_EXT_REFCOUNTED) {
> /*
> * We add one more cluster here since it will be
> * written shortly and if the pos_blkno isn't aligned
> * to the cluster size, we have to zero the blocks
> * before it.
> */
> if (tail_cpos + num_clusters > pos_cpos)
> num_clusters = pos_cpos - tail_cpos + 1;

But you dropped the check for pos_blkno alignment.
Unconditionally adding the +1 doesn't seem like a good idea.

Joel

--

"Where are my angels?
Where's my golden one?
And where is my hope
Now that my heroes are gone?"

Joel Becker
Consulting Software Developer
Oracle
E-mail: joel.becker(a)oracle.com
Phone: (650) 506-8127
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Tao Ma on 6 Jul 2010 04:00

Hi Joel,

On 07/06/2010 03:17 PM, Joel Becker wrote:
> On Mon, Jul 05, 2010 at 11:51:44AM +0800, Tao Ma wrote:
>>> + /*
>>> + * If tail_blkno is in the cluster past i_size, we don't need
>>> + * to touch the cluster containing i_size at all.
>>> + */
>>> + tail_cpos = i_size_read(inode)>> osb->s_clustersize_bits;
>>> + if (ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)> tail_cpos)
>>> + tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb,
>>> + tail_blkno);
>> Can we always set tail_cpos in one line?
>> tail_cpos = ocfs2_blocks_to_clusters(inode->i_sb, tail_blkno)?
>> tail_cpos is either the same cluster as i_size or the next cluster
>> and both works for tail_blkno I guess?
>
> I had the same thought on Friday, but the current version passes
> testing and I was wary of changing that.
ok, so as you wish.
>
>>> + /* Is there a cluster to zero? */
>>> + if (!p_cpos)
>>> + goto out;
>> For unwritten extent, we also need to clear the pages? If yes, the
>> solution doesn't complete if we have 2 unwritten extent, one
>> contains i_size while one passes i_size. Here we only clear the
>> pages for the 1st unwritten extent and leave the 2nd one untouched.
>
> We probably don't need to zero unwritten extents. We cannot
> have an extent past i_size, can we?
we can. AFAICS, ocfs2_change_file_space will allocate unwritten extents
and does't change i_size.
>
>> From here to the call of CoW is a bit hard to understand. In 'if',
>> num_clusters is set for CoW and in 'else', blocks_to_zero is set. So
>> it isn't easy for the reader to tell why these 2 clauses are setting
>> different values. So how about my code below? It looks more
>> straightforward I think.
>>> + if ((tail_cpos + num_clusters)> pos_cpos) {
>>> + num_clusters = pos_cpos - tail_cpos;
>>> + if (pos_blkno>
>>> + ocfs2_clusters_to_blocks(inode->i_sb, pos_cpos))
>>> + num_clusters += 1;
>>> + } else {
>>> + blocks_to_zero =
>>> + ocfs2_clusters_to_blocks(inode->i_sb,
>>> + tail_cpos + num_clusters);
>>> + blocks_to_zero -= tail_blkno;
>>> + }
>>> +
>>> + /* Now CoW the clusters we're about to zero */
>>> + if (ext_flags& OCFS2_EXT_REFCOUNTED) {
>>> + rc = ocfs2_refcount_cow(inode, di_bh, tail_cpos,
>>> + num_clusters, UINT_MAX);
>>> + if (rc) {
>>> + mlog_errno(rc);
>>> + goto out;
>>> + }
>>> + }
>> /* Decrease blocks_to_zero if there is some hole after extent */
>> if (tail_cpos + num_clusters<= pos_cpos) {
>> blocks_to_zero =
>> ocfs2_clusters_to_blocks(inode->i_sb,
>> tail_cpos + num_clusters);
>> blocks_to_zero -= tail_blkno;
>> }
>
> Not a bad split-out here.
>
>> /* Now CoW if we have some refcounted clusters. */
>> if (ext_flags& OCFS2_EXT_REFCOUNTED) {
>> /*
>> * We add one more cluster here since it will be
>> * written shortly and if the pos_blkno isn't aligned
>> * to the cluster size, we have to zero the blocks
>> * before it.
>> */
>> if (tail_cpos + num_clusters> pos_cpos)
>> num_clusters = pos_cpos - tail_cpos + 1;
>
> But you dropped the check for pos_blkno alignment.
> Unconditionally adding the +1 doesn't seem like a good idea.
You can add it as you wish.
I just thought that you add one more extra cluster if pos_blkno isn't
aligned so as to zero blocks in [pos_cpos_start_block, pos_blkno).
But As I said in the comments, you will soon write pos_blkno(it also
needs to be CoW since it is within this refcounted extent), so if we can
CoW it out now, maybe we have a chance to not call ocfs2_refcount_cow later.

Regards,
Tao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Joel Becker on 6 Jul 2010 08:10

On Tue, Jul 06, 2010 at 03:54:58PM +0800, Tao Ma wrote:
> On 07/06/2010 03:17 PM, Joel Becker wrote:
> >>>+ /* Is there a cluster to zero? */
> >>>+ if (!p_cpos)
> >>>+ goto out;
> >>For unwritten extent, we also need to clear the pages? If yes, the
> >>solution doesn't complete if we have 2 unwritten extent, one
> >>contains i_size while one passes i_size. Here we only clear the
> >>pages for the 1st unwritten extent and leave the 2nd one untouched.
> >
> > We probably don't need to zero unwritten extents. We cannot
> >have an extent past i_size, can we?
> we can. AFAICS, ocfs2_change_file_space will allocate unwritten
> extents and does't change i_size.

Oh, you're right. We need to walk the entire extent range
between i_size and pos and figure out what needs CoW. This needs to
happen no matter what.

> > But you dropped the check for pos_blkno alignment.
> >Unconditionally adding the +1 doesn't seem like a good idea.
> You can add it as you wish.
> I just thought that you add one more extra cluster if pos_blkno
> isn't aligned so as to zero blocks in [pos_cpos_start_block,
> pos_blkno).
> But As I said in the comments, you will soon write pos_blkno(it also
> needs to be CoW since it is within this refcounted extent), so if we
> can CoW it out now, maybe we have a chance to not call
> ocfs2_refcount_cow later.

I'd much rather let the write handle its own contiguousness. If
we get lucky, that CoW melds with our CoW. If we don't get lucky, isn't
it better to have the newly changed area be fully contiguous rather than
have the first extent of it not be and then the remaining extents be?

Joel

--

Life's Little Instruction Book #3

"Watch a sunrise at least once a year."

Joel Becker
Consulting Software Developer
Oracle
E-mail: joel.becker(a)oracle.com
Phone: (650) 506-8127
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

From: Joel Becker on 6 Jul 2010 15:00

On Mon, Jul 05, 2010 at 11:51:44AM +0800, Tao Ma wrote:
> On 07/04/2010 05:32 AM, Joel Becker wrote:
> From here to the call of CoW is a bit hard to understand. In 'if',
> num_clusters is set for CoW and in 'else', blocks_to_zero is set. So
> it isn't easy for the reader to tell why these 2 clauses are setting
> different values. So how about my code below? It looks more
> straightforward I think.

I took your cleanup mostly.

Joel

--

"The cynics are right nine times out of ten."
- H. L. Mencken

Joel Becker
Consulting Software Developer
Oracle
E-mail: joel.becker(a)oracle.com
Phone: (650) 506-8127
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo(a)vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

| Next | Last
Pages: 1 2
Prev: Linux 2.6.35-rc4
Next: [PATCH/RFC v2 4/4] hwmon: sysfs API updates

ocfs2: Zero the tail cluster when extending past i_size v2