Prev: [PATCH] Oprofile: Change CPUIDS from decimal to hex, and add some comments
Next: [PATCH] hid: Add mappings for a few keys found on Logitech MX3200
From: Rick Sherm on 16 Apr 2010 13:10 Hello, I'm trying to measure the perf gain by using splice.For now I'm trying to copy a 1G file using splice.(In real scenario, the driver will DMA the data to some buffer(which is mmap'd).The app will then write the newly-DMA'd data to the disk while some other thread is crunching the same buffer.The buffer is guaranteed to not be modified.To avoid copying I was thinking of : splice-IN-mmap'd-buffer->pipe and splice-OUT-pipe->file.) PS - I've inlined some sloppy code that I cooked up. Case1) read from input_file and write(O_DIRECT so no buff-cache is involved but it doesn't work) to dest_file.We can talk about the buff-cache later. (csh#)time ./splice_to_splice 0.004u 1.451s 0:02.16 67.1% 0+0k 2097152+2097152io 0pf+0w #define KILO_BYTE (1024) #define PIPE_SIZE (64 * KILO_BYTE) int filedes [2]; pipe (filedes); fd_from = open(filename_from,O_RDWR|O_LARGEFILE|O_DIRECT),0777); fd_to = open(filename_to,(O_WRONLY|O_CREAT|O_LARGEFILE|O_DIRECT),0777); to_write = 2048 * 512 * KILO_BYTE; while (to_write) { ret = splice (fd_from, &from_offset, filedes [1], NULL, PIPE_SIZE, SPLICE_F_MORE | SPLICE_F_MOVE); if (ret < 0) { printf("Error: LINE:%d ret:%d\n",__LINE__,ret); goto error; } else { ret = splice (filedes [0], NULL, fd_to, &to_offset, PIPE_SIZE/*should be ret,but ...*/, SPLICE_F_MORE | SPLICE_F_MOVE); if (ret < 0) { printf("Error: LINE:%d ret:%d\n",__LINE__); goto error; } to_write -= ret; } } Case 2) directly reading and writing: Case2.1) copy 64K blocks (csh#)time ./file_to_file 64 0.015u 1.066s 0:04.04 26.4% 0+0k 2097152+2097152io 0pf+0w #define KILO_BYTE (1024) #define MEGA_BYTE (1024 * (KILO_BYTE)) #define BUFF_SIZE (64 * MEGA_BYTE) posix_memalign((void**)&buff,4096,BUFF_SIZE); fd_from = open(filename_from,(O_RDWR|O_LARGEFILE|O_DIRECT),0777); fd_to = open(filename_to,(O_WRONLY|O_CREAT|O_LARGEFILE|O_DIRECT),0777); /* 1G file == 2048 * 512K blocks */ to_write = 2048 * 512 * KILO_BYTE; copy_size = cmd_line_input * KILO_BYTE; /* control from cmd_line */ while (to_write) { ret = read(fd_from, buff,copy_size); if (ret != copy_size) { printf("Error: LINE:%d ret:%d\n",__LINE__,ret); goto error; } else { ret = write (fd_to,buff,copy_size); if (ret != copy_size) { printf("Error: LINE:%d ret:%d\n",__LINE__); goto error; } to_write -= ret; } } Case2.2) copy 512K blocks (csh#)time ./file_to_file 512 0.004u 0.306s 0:01.86 16.1% 0+0k 2097152+2097152io 0pf+0w Case 2.3) copy 1M blocks time ./file_to_file 1024 0.000u 0.240s 0:01.88 12.7% 0+0k 2097152+2097152io 0pf+0w Questions: Q1) When using splice,why is the CPU consumption greater than read/write(case 2.1)?What does this mean? Q2) How do I confirm that the memory bandwidth consumption does not spike up when using splice in this case? By this I mean, (node)cpu<->mem. The DMA-in/DMA-out will happen.You can't escape from that but the IOH-bus will be utilized. I want to keep the cpu(node)-mem path free(well, minimize unnecessary copies). Q3) When using splice, even though the destination file is opened in O_DIRECT mode, the data gets cached. I verified it using vmstat. r b swpd free buff cache 1 0 0 9358820 116576 2100904 ../splice_to_splice r b swpd free buff cache 2 0 0 7228908 116576 4198164 I see the same caching issue even if I vmsplice buffers(simple malloc'd iov) to a pipe and then splice the pipe to a file. The speed is still an issue with vmsplice too. Q4) Also, using splice, you can only transfer 64K worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But using stock read/write, I can go upto 1MB buffer. After that I don't see any gain. But still the reduction in system/cpu time is significant. I would appreciate any pointers. thanks Rick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Rick Sherm on 23 Apr 2010 16:00 Hello Jens, --- On Fri, 4/23/10, Jens Axboe <jens.axboe(a)oracle.com> wrote: > I still have patches pending for this, making the pipe > buffer count > settable form user space: > > http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=24547ac4d97bebb58caf9ce58bd507a95c812a3f > > Let me know if you want to give it a spin on a recent > kernel, and I'll > update it. > I think we need to adjust 'PIPE_BUFFERS' in default_file_splice_read() also,correct? > Jens Axboe Thanks -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Rick Sherm on 23 Apr 2010 12:10 Hello Jens - any assistance/pointers on 1) and 2) below will be great.I'm willing to test out any sample patch. Steve, --- On Wed, 4/21/10, Steven J. Magnani <steve(a)digidescorp.com> wrote: > Hi Rick, > > On Fri, 2010-04-16 at 10:02 -0700, Rick Sherm wrote: > > Q3) When using splice, even though the destination > file is opened in O_DIRECT mode, the data gets cached. I > verified it using vmstat. > > > > r b���swpd���free���buff cache��� > > 1 0� ���0 9358820 116576 2100904 > > > > ./splice_to_splice > > > > r b swpd���free���buff cache > > 2 0� 0 7228908 116576� 4198164 > > > > I see the same caching issue even if I vmsplice > buffers(simple malloc'd iov) to a pipe and then splice the > pipe to a file. The speed is still an issue with vmsplice > too. > > > > One thing is that O_DIRECT is a hint; not all filesystems > bypass the cache. I'm pretty sure ext2 does, and I know fat doesn't. > > Another variable is whether (and how) your filesystem > implements the splice_write file operation. The generic one (pipe_to_file) > in fs/splice.c copies data to pagecache. The default one goes > out to vfs_write() and might stand more of a chance of honoring > O_DIRECT. > True.I guess I should have looked harder. It's xfs and xfs's->file_ops points to 'generic_file_splice_read[write]'.Last time I had to 'fdatasync' and then fadvise to mimic 'O_DIRECT'. > > Q4) Also, using splice, you can only transfer 64K > worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But > using stock read/write, I can go upto 1MB buffer. After that > I don't see any gain. But still the reduction in system/cpu > time is significant. > > I'm not a splicing expert but I did spend some time > recently trying to > improve FTP reception by splicing from a TCP socket to a > file. I found that while splicing avoids copying packets to userland, > that gain is more than offset by a large increase in calls into the > storage stack.It's especially bad with TCP sockets because a typical > packet has, say,1460 bytes of data. Since splicing works on PIPE_BUFFERS > pages at a time, and packet pages are only about 35% utilized, each > cycle to userland I could only move 23 KiB of data at most. Some > similar effect may be in play in your case. > Agreed,increasing number of calls will offset the benefit. But what if: 1)We were to increase the PIPE_BUFFERS from '16' to '64' or 'some value'? What are the implications in the other parts of the kernel? 2)There was a way to find out if the DMA-out/in from the initial buffer's that were passed are complete so that we are free to recycle them? Callback would be helpful.Obviously, the user-space-app will have to manage it's buffers but atleast we are guranteed that the buffers can be recycled(in other words no worrying about modifying in-flight data that is being DMA'd). > Regards, >� Steven J. Magnani� � � � ��� regards ++Rick -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Steven J. Magnani on 23 Apr 2010 13:00 On Fri, 2010-04-23 at 09:07 -0700, Rick Sherm wrote: > Hello Jens - any assistance/pointers on 1) and 2) below > will be great.I'm willing to test out any sample patch. Recent mail from him has come from jens.axboe(a)oracle.com, I cc'd it. > > Steve, > > --- On Wed, 4/21/10, Steven J. Magnani <steve(a)digidescorp.com> wrote: > > Hi Rick, > > > > On Fri, 2010-04-16 at 10:02 -0700, Rick Sherm wrote: > > > Q3) When using splice, even though the destination > > file is opened in O_DIRECT mode, the data gets cached. I > > verified it using vmstat. > > > > > > r b swpd free buff cache > > > 1 0 0 9358820 116576 2100904 > > > > > > ./splice_to_splice > > > > > > r b swpd free buff cache > > > 2 0 0 7228908 116576 4198164 > > > > > > I see the same caching issue even if I vmsplice > > buffers(simple malloc'd iov) to a pipe and then splice the > > pipe to a file. The speed is still an issue with vmsplice > > too. > > > > > > > One thing is that O_DIRECT is a hint; not all filesystems > > bypass the cache. I'm pretty sure ext2 does, and I know fat doesn't. > > > > Another variable is whether (and how) your filesystem > > implements the splice_write file operation. The generic one (pipe_to_file) > > in fs/splice.c copies data to pagecache. The default one goes > > out to vfs_write() and might stand more of a chance of honoring > > O_DIRECT. > > > > True.I guess I should have looked harder. It's xfs and xfs's->file_ops points to 'generic_file_splice_read[write]'.Last time I had to 'fdatasync' and then fadvise to mimic 'O_DIRECT'. > > > > Q4) Also, using splice, you can only transfer 64K > > worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But > > using stock read/write, I can go upto 1MB buffer. After that > > I don't see any gain. But still the reduction in system/cpu > > time is significant. > > > > I'm not a splicing expert but I did spend some time > > recently trying to > > improve FTP reception by splicing from a TCP socket to a > > file. I found that while splicing avoids copying packets to userland, > > that gain is more than offset by a large increase in calls into the > > storage stack.It's especially bad with TCP sockets because a typical > > packet has, say,1460 bytes of data. Since splicing works on PIPE_BUFFERS > > pages at a time, and packet pages are only about 35% utilized, each > > cycle to userland I could only move 23 KiB of data at most. Some > > similar effect may be in play in your case. > > > > Agreed,increasing number of calls will offset the benefit. > But what if: > 1)We were to increase the PIPE_BUFFERS from '16' to '64' or 'some value'? > What are the implications in the other parts of the kernel? This came up recently, one problem is that there a couple of kernel functions having up to 3 stack-based arrays of dimension PIPE_BUFFER. So the stack cost of increasing PIPE_BUFFERS can be quite high. I've thought it might be nice if there was some mechanism for userland apps to be able to request larger PIPE_BUFFERS values, but I haven't pursued this line of thought to see if it's practical. > 2)There was a way to find out if the DMA-out/in from the initial buffer's that were passed are complete so that we are free to recycle them? Callback would be helpful.Obviously, the user-space-app will have to manage it's buffers but atleast we are guranteed that the buffers can be recycled(in other words no worrying about modifying in-flight data that is being DMA'd). It's a neat idea, but it would probably be much easier (and less invasive) to try this sort of pipelining in userland using a ring buffer or ping-pong approach. I'm actually in the middle of something like this with FTP, where I will have a reader thread that puts data from the network into a ring buffer, from which a writer thread moves it to a file. ------------------------------------------------------------------------ Steven J. Magnani "I claim this network for MARS! www.digidescorp.com Earthling, return my space modulator!" #include <standard.disclaimer> -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
From: Jens Axboe on 23 Apr 2010 13:10
On Fri, Apr 23 2010, Steven J. Magnani wrote: > On Fri, 2010-04-23 at 09:07 -0700, Rick Sherm wrote: > > Hello Jens - any assistance/pointers on 1) and 2) below > > will be great.I'm willing to test out any sample patch. > > Recent mail from him has come from jens.axboe(a)oracle.com, I cc'd it. Goes to the same inbox in the end, so no difference :-) > > > On Fri, 2010-04-16 at 10:02 -0700, Rick Sherm wrote: > > > > Q3) When using splice, even though the destination > > > file is opened in O_DIRECT mode, the data gets cached. I > > > verified it using vmstat. > > > > > > > > r b swpd free buff cache > > > > 1 0 0 9358820 116576 2100904 > > > > > > > > ./splice_to_splice > > > > > > > > r b swpd free buff cache > > > > 2 0 0 7228908 116576 4198164 > > > > > > > > I see the same caching issue even if I vmsplice > > > buffers(simple malloc'd iov) to a pipe and then splice the > > > pipe to a file. The speed is still an issue with vmsplice > > > too. > > > > > > > > > > One thing is that O_DIRECT is a hint; not all filesystems > > > bypass the cache. I'm pretty sure ext2 does, and I know fat doesn't. > > > > > > Another variable is whether (and how) your filesystem > > > implements the splice_write file operation. The generic one (pipe_to_file) > > > in fs/splice.c copies data to pagecache. The default one goes > > > out to vfs_write() and might stand more of a chance of honoring > > > O_DIRECT. > > > > > > > True.I guess I should have looked harder. It's xfs and xfs's->file_ops points to 'generic_file_splice_read[write]'.Last time I had to 'fdatasync' and then fadvise to mimic 'O_DIRECT'. > > > > > > Q4) Also, using splice, you can only transfer 64K > > > worth of data(PIPE_BUFFERS*PAGE_SIZE) at a time,correct?.But > > > using stock read/write, I can go upto 1MB buffer. After that > > > I don't see any gain. But still the reduction in system/cpu > > > time is significant. > > > > > > I'm not a splicing expert but I did spend some time > > > recently trying to > > > improve FTP reception by splicing from a TCP socket to a > > > file. I found that while splicing avoids copying packets to userland, > > > that gain is more than offset by a large increase in calls into the > > > storage stack.It's especially bad with TCP sockets because a typical > > > packet has, say,1460 bytes of data. Since splicing works on PIPE_BUFFERS > > > pages at a time, and packet pages are only about 35% utilized, each > > > cycle to userland I could only move 23 KiB of data at most. Some > > > similar effect may be in play in your case. > > > > > > > Agreed,increasing number of calls will offset the benefit. > > But what if: > > 1)We were to increase the PIPE_BUFFERS from '16' to '64' or 'some value'? > > What are the implications in the other parts of the kernel? > > This came up recently, one problem is that there a couple of kernel > functions having up to 3 stack-based arrays of dimension PIPE_BUFFER. So > the stack cost of increasing PIPE_BUFFERS can be quite high. I've > thought it might be nice if there was some mechanism for userland apps > to be able to request larger PIPE_BUFFERS values, but I haven't pursued > this line of thought to see if it's practical. I still have patches pending for this, making the pipe buffer count settable form user space: http://git.kernel.dk/?p=linux-2.6-block.git;a=commit;h=24547ac4d97bebb58caf9ce58bd507a95c812a3f Let me know if you want to give it a spin on a recent kernel, and I'll update it. > > 2)There was a way to find out if the DMA-out/in from the initial buffer's that were passed are complete so that we are free to recycle them? Callback would be helpful.Obviously, the user-space-app will have to manage it's buffers but atleast we are guranteed that the buffers can be recycled(in other words no worrying about modifying in-flight data that is being DMA'd). > > It's a neat idea, but it would probably be much easier (and less > invasive) to try this sort of pipelining in userland using a ring buffer > or ping-pong approach. I'm actually in the middle of something like this > with FTP, where I will have a reader thread that puts data from the > network into a ring buffer, from which a writer thread moves it to a > file. See vmsplice.c from the splice test tools: http://brick.kernel.dk/snaps/splice-git-latest.tar.gz -- Jens Axboe -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo(a)vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ |