a/splice

Date:	Thu, 18 Jan 2001 08:36:55 -0800
From:	Larry McVoy <lm@bitmover.com>
To:	Russell Leighton <leighton@imake.com>
Subject: Re: Is sendfile all that sexy?

On Thu, Jan 18, 2001 at 06:04:17AM -0500, Russell Leighton wrote:
> 
> "copy this fd to that one, and optimize that if you can"
> 
> ... isn't this Larry M's "splice" (http://www.bitmover.com/lm/papers/splice.ps)?

Not really.  It's not clear to me that people really understood what I was
getting at in that and I've had some coffee and BK 2.0 is just about ready
to ship (shameless plug :-) so I'll give it another go.

The goal of splice is to avoid both data copies and virtual memory completely.
My SGI experience taught me that once you remove the data copy problem, the
next problem becomes setting up and tearing down the virtual mappings to the
data.  Linux is quite a bit lighter than IRIX but that doesn't remove this
issue, it just moves the point on the spectrum where the setup/teardown
becomes a problem.

Another goal of splice was to be general enough to allow data to flow from
any place to any place.  The idea was to have a good model and then iterate
over all the possible endpoints; I can think of files, sockets, and virtual
address spaces right off the top of my head, devices are subset of files
as will become apparent.

A final goal was to be able to be able to handle caching vs non-caching.
Sometimes one of the endpoints is a cache, such as the file system cache.
Sometimes you want data to stay in the cache and sometimes you want to
bypass it completely.  The model had to handle this.

OK, so the issues are
    - avoid copying
    - avoid virtual memory as much as possible
    - allow data flow to/from non aligned, non page sized objects
    - handle caching or non-caching

This leads pretty naturally to some observations about the shape of the
solution:

    - the basic unit of data is a physical page, or part of one.  That's
      physical page, not a virtual address which points to a physical page.
    - since we may be coming from sockets, where the payload is buried in
      the middle of page, there needs to be a vector of pages and a 
      vector of { pageno, offset, len } that goes along with the first
      vector.  There are two vectors because you could have multiple payloads
      in a single page, i.e., there is not a 1:1 between pages and payloads.
    - The page vector needs some flags, which handle caching.  I had just
      two flags, the "LOAN" flag and the "GIFT" flag.

In my mind, this was enough that everyone should "get it" at this point, but
that's me being lazy.

So how would this all work?  The first thing is that we are now dealing
in vectors of physical pages.  That's key - if you look at an OS, it
spends a lot of time with data going into a physical page, then being
translated to a virtual page, being copied to another virtual page, and
then being translated back to a physical page so that it can be sent to
a different device.  That's the basic FTP loop.

So you go "hey, just always talk physical pages and you avoid a lot of this
wasted time".  Now is a good time to observe that splice() is a library
interface.  The kernel level interfaces I called pull() and push().  The
idea was that you could do

	vectors = 0;

	do {
		vectors = pull(from_fd, vectors);
	} while (splice_size(vectors) < BIG_ENOUGH_SIZE);
	push(to_fd, vectors);

The idea was that you maintained a pointer to the vectors, the pointer is
a "cookie", you can't really dereference it in user space, at least not all
of it, but the kernel doesn't want to maintain this stuff, it wants you to
do that.  So you start pulling and then you push what you got.  And you,
being the user land process, are never looking at the data, in fact, you 
can't, you have a pointer to a data structure which describes the data
but you can't look at it.

A couple of interesting things: 
    - this design allows for multiplexing.  You could pull from multiple devices
      and then push to one.  The interface needs a little tweaking for that to
      be meaningful, we can steal from pipe semantics.  We need to be able to
      say how much to pull, so we add a length.  
    - there is no reason that you couldn't have an fd which was open to 
      /proc/self/my_address_space and you could essentially do an mmap()
      by seeking to where you want the mapping and doing a push to it.
      This is a fairly important point, it allows for end to end.  Lots of
      nasty issues with non-page sized chunks in the vector, what you do there
      depends on the semantics you want.

So what about the caching?  That's the loan/gift distinction.  The deal is that
these pages have reference counts and when the reference count goes to zero,
somebody has to free them.  So the page vector needs a free_page() function
pointer and if the pages are a loan, you call that function pointer when you
are done with them.   In other words, if the file system cache loaned you 
the pages, you do a call back to let the file system know you are done with
them.  If the pages were a gift, then the function pointer is null and you
have to manage them.  You can put the normal decrement_and_free() function
in there and when you get done with them you call that and the pages go back
to the free list.  You can also "free" them into your private page pool, etc.
The point is that if the end point which is being pulled() from wants the
pages cached, it "loans" them, if it doesn't, it "gifts" them.  Sockets as 
a "from" end point would always gift, files as a from endpoint would typically
loan.

So, there's the set of ideas.  I'm ashamed to admit that I don't really know
how close kiobufs are to this.  I am interested in hearing what you all think,
but especially what the people think who have been playing around with kiobufs
and sendfile.
-- 
---
Larry McVoy            	 lm at bitmover.com           http://www.bitmover.com/lm 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
Please read the FAQ at http://www.tux.org/lkml/