[LWN Logo]

From:   Richard Guy Briggs <rgb@conscoop.ottawa.on.ca>
Subject: IPSEC transport mode w/2.2.x kernels and large packets
To:     linux-ipsec@clinet.fi (linux ipsec)
Date:   Fri, 6 Aug 1999 20:15:39 -0400 (EDT)

-----BEGIN PGP SIGNED MESSAGE-----

Some of you may have seen some of this somewhere before recently in
correspondence with me.  To those, I appologise to have to see it
appear again here, please correct me if I mis-interpreted your
essence.  To everyone else, it is a summary of the mess.


Due to very popular demand, here is a status report on the situation
with oversize packets in the 2.2.x kernels with FreeS/WAN.

Until recently (0.92) FreeS/WAN (a free IPSEC {un-}implementation for
Linux <www.xs4all.nl/~freeswan/>) didn't handle packets close to the MTU
or with DF set properly.  That was (mostly) shaken down by a pile of
testing with John S. Denker (thanks JSD).  This all happenned under
2.0.36.  If the packet was less than the MTU, we just simply sent it
on its way by calling dev_queue_xmit() with the underlying physical
device as a parameter.  If it was larger, we sent it to ip_fragment(),
which also accepts a device parameter.  In both these cases, it allows
us to change the outgoing device from the virtual ipsec[0-3] device to
the underlying physical device, whatever it might be.

For 2.2.x, we had planned to get away from all this silly virtual
device stuff and route stealing and routing table mucking about.  We
wanted to move things up the stack to avoid having to deal with the
link layer stuff, removing and re-assembling the hard_header, among
other things.

This didn't happen because we were handed a quick and dirty patch to
do 2.2.x kernels by Marc Boucher of Zero Knowledge Systems (Thanks
very much Marc!).  It used the same idea as the 2.0.x kernels,
changing a couple of calls and their number of arguments to make it
compile and seem to work.

It was made to work as a module or static immediately.

Large packet handling was not checked at the time and was tripped over
afterwards.  When large packets hit a 2.2.x kernel with ipsec and were
passed to ip_fragment(), the machine oopsed, sometimes hard.  The
output function passed to ip_fragment() was skb->dst->output().  Once
it was changed to neigh_compat_output(), as was used in the place of
dev_queue_xmit() from 2.0.x to 2.2.x, the problem went away *for me*
and got worse for others.  This started my quest as to what function
should be used for output, and lead into cflow diagrams...

There have been reports of severe crashes under the 2.2 kernels with
large packets.  I was seeing fairly regular crashes under 2.2.7 which
would start with a small oops and keep rolling to larger and larger
oopses until they finally exhausted themselves.  I upgraded to 2.2.10
and those problems largely went away, with only the occasional crash.
Others have reported exactly the opposite behaviour with relative
stability under 2.2.7 and crashing consistently under 2.2.10.  There
have been many reports on this list and linux-kernel about
instability, locking up, tcp stalling and such for many of the 2.2.x
series kernels, so we are not certain it is anything we are doing.
There was a report on the list that 2.2.11pre4 is more stable.  I am
trying that, but still have some unexplained behaviour, which might be
the 2.0.36 machine at the other end, which is not able to re-assemble
the fragments for some reason.  Investigation pending.

At the same time, I did a FreeS/WAN presentation at the Ottawa Linux
Symposium and in discussions with a number of other developpers,
describing the problem elucidated some of the issues surrounding doing
transport mode IPSEC.  It confirmed that we need to move our
processing up the stack, before things are fragmented.  In particular,
Alan Cox suggested setting the MTU to the maximum allowed:0xfff0, then
fragmenting after encryption to the MTU of the physical device.

This seems to work under the 2.0.x kernels since before I call
dev_queue_xmit() or ip_fragment(), I can change the ownership of the
skb from the virtual device to the physical device.  In the process of
testing all this, I started getting some skb's that were not getting
skb_copy_expand()ed properly.  For some reason,
skb->truesize-sizeof(sk_buff) was not quite right.  I have changed it
to be the same as that used in the 2.2.x kernels by Marc Boucher and I
am starting to see some 'cannot get free page' messages, which I am
not certain are from that change or the maximum mtu change.  It only
seems to happen with connections directly to the gateway, and not
connections *through* the gateway that use ipsec.

Under 2.2.x kernels, this is not so simple, since even if I change the
ownership of the skb, skb->dev, the skb still has a dst cache entry
pointer, skb->dst, which still assumes the old device, skb->dst->dev.
How do we change the dst entry?  We can't do another routing table
lookup since that is unreliable and prone to route stealing, besides
we are trying to move away from that method.  Do we blow it away?  It
would then need to do a routing lookup.  How do we tell it to use the
new device without corrupting the dst cache?  The only way I could
come up with was to patch net/ipv4/ip_output.c:ip_fragment() so that
if skb->dst->dev, the device pointed to in the dst cache pointer, and
skb->dev, the device pointed to by the skb, were different, it would
take its MTU information from skb->dev->mtu, rather than where it does
now, from skb->dst->pmtu.

At first response from Alexey Kuznetsov, he suggested I was passing an
invalid packet to the output function skb->dst->output.  This is quite
possible as I am changing the ownership of the packet, ie.: skb->dev,
which does not agree with skb->dst->dev, amongst other things.

Alexey also thinks:
"I think, you will not able to implement transport mode IPsec TCP
without significant changes in TCP itself."

I am not so sure I agree, based on some interesting ideas that have
come up in discussion around the Ottawa Linux Symposium.

Later he concludes that our processing should be done before
ip_build_xmit(), which echoes some of our previous instincts.  He also
suggests that our eroute lookup should happen when the dst stuff does
now, and suggests that we should store a per-eroute pmtu.

Steve Whitehouse (DECnet under Linux) said:
"Ah, this sounds like you are encapsulating packets in another
protocol before sending them, like ip_gre.c for example. I have come
across the same problem (I'd like to see DECnet in ip_gre) but I have
no solution at the moment. One way may be to have an array of dst
pointers in an skb, each one refering to a layer of encapsulation,
plus an index to indicate the current layer. I may suggest this to
Alexey, he may well have a better idea."

Where does that leave us/me?  Testing.  Alan Cox has just announced
2.2.7-pre5.

Any comments, hints and suggestions welcome.  Patches welcome only
from non-Americans outside the USA.  Thanks for reading this far...

	slainte mhath, RGB
- -- 
The first Ottawa Linux Symposium was a huge success! <ottawalinuxsymposium.org>
This SunRayce was a wet one!  DroughtRelief_99? -- <www.sunrayce.com/sunrayce/>
Richard Guy Briggs -- PGP key available                Auto-Free Ottawa! Canada
<http://www.conscoop.ottawa.on.ca/rgb/>                   </www.flora.org/afo/>
Prevent Internet Wiretapping!       --      FreeS/WAN:<www.xs4all.nl/~freeswan>
Thanks for voting Green! -- <green.ca>          Marillion:<www.marillion.co.uk>

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQCVAwUBN6t6qd+sBuIhFagtAQEK3wQAjwOprnYLwFj3dwZojWVhD7iDpwvJ2ikQ
R1G+e7uJpe6J+FPH8xtimrrFN1Vehp3TC8s33XzRAVExP35GPOEDYEZsOgW3XiWJ
qFvTzWIPXXb85Aa4t3DUVti3shAt7XwGaYAXle21nyq6o6H7Di3+vI+JLZ33tn5e
TPPza2Fnhk0=
=pQkP
-----END PGP SIGNATURE-----

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.rutgers.edu
Please read the FAQ at http://www.tux.org/lkml/