[LWN Logo]
[LWN.net]
Date:   Wed, 21 Feb 2001 02:42:03 -0500
From:   Richard Guy Briggs <rgb@conscoop.ottawa.on.ca>
To:     Linux Ipsec mailing list <linux-ipsec@freeswan.org>,
Subject: FreeS/WAN redesign thoughts (KLIPS, IPSEC)

-----BEGIN PGP SIGNED MESSAGE-----

Here is a third edition of the FreeS/WAN redesign plans.  Please pick
it apart.  Some glaring errors have been fixed.  A thousand pardons
for the potentially multiple posting, I feel like a complete boob
forgetting to include a subject line...




FreeS/WAN IPSEC -- KLIPS2 DESIGN THOUGHTS
=========================================

Wed Feb 21 02:17:58 EST 2001

This document was originally written 2.5 weeks after OLS2000, inspired
from a meeting with Rusty and Marc in Montreal in November 1999 and two
meetings at OLS2000.

Current kernel version reference is 2.4.0

The idea is to redesign KLIPS (kernel parts of FreeS/WAN) to avoid all
the 'stoopid routing tricks' (TM) to which we have had to resort over
the last 2+ years by disassociating any ipsec devices from physical
devices and to add a proper SPDB to do proper incoming IPSEC
policy checks.  We are hoping to use existing pattern-matching tools
rather than invent our own.  NetFilter appears to have all the pattern
matching capabilities, but is limited in other ways.

There is also a significant interest in enabling FreeS/WAN to
communicate with routing daemons and be able to do load sharing and
failover:

	http://www.quintillion.com/fdis/moat/ipsec+routing/

This is an exploratory document.  Please comment, particularly if I have
missed or mis-understood something, to the linux-ipsec,
netfilter-devel or netdev lists.

The basic architecture of NetFilter is:

       --->[1]--->(ROUTE)--->[3]--->[4]--->     where:
                     |            ^             [1] NF_IP_PRE_ROUTING
                     |            |             [2] NF_IP_LOCAL_IN
                     |         (ROUTE)          [3] NF_IP_FORWARD
                     v            |             [4] NF_IP_POST_ROUTING
                    [2]          [5]            [5] NF_IP_LOCAL_OUT
                     |            ^             
                     |            |             
                     v            |             

The basic path through the kernel as it concerns IPSEC for the three
types of packets is as follows:
IN:
	NIC
	sanity check
	NF_IP_PRE_ROUTING
	route-in
	ip-options processing
	defragment
	NF_IP_LOCAL_IN
	layer3demux
	application

FORWARD:
	NIC
	sanity check
	NF_IP_PRE_ROUTING
	routing-in
	ip-options processing
	ttl decrement and check
	NF_IP_FORWARD
	fragment
	NF_IP_POST_ROUTING
	output()
	NIC

OUT:
	application
	layer3mux
	NF_IP_LOCAL_OUT
	route-out
	NF_IP_POST_ROUTING
	output()
	NIC

Destination NAT (port forwarding) gets applied in NF_IP_PRE_ROUTING,
NF_IP_LOCAL_OUT and Source NAT (masquerading) gets applied in
NF_IP_POST_ROUTING.  Filtering is applied in NF_IP_LOCAL_IN,
NF_IP_FORWARD and NF_IP_LOCAL_OUT.

Hook processing order would generally be:
NF_IP_PRI_IPSEC_IN?
NF_IP_PRI_CONNTRACK
NF_IP_PRI_IPSEC_IN?
NF_IP_PRI_MANGLE
NF_IP_PRI_NAT_DST
NF_IP_PRI_FILTER
NF_IP_PRI_NAT_SRC
NF_IP_PRI_IPSEC_OUT
Not all modules are present at each hook.  I am uncertain still if
IPSEC_IN should be before or after CONNTRACK.  Any comments?

- -----------

There is more than one possible approach.  The following is not
exhaustive.  So far, the first is much better thought out and so far,
preferred.

    --- 1 ---

Treat incoming IPSEC encapsulation as a layer 3 protocol and decapsulate
it at the Layer 3 demultiplexer.

An incoming packet starts off with a sanity check.  It then goes through
all the NF_IP_PRE_ROUTING hooks starting with the SPDB checking.  Since
it is a fresh ESP or AH packet, it will not have any nfmarks and unless
that outer IP header should have been processed by another SG in
between, no policy will have been required, letting it through.

The rest of the NF_IP_PRE_ROUTING hooks may cause it to
be DNATed and defragmented.  It then goes through routing which thinks
it is a local packet, deals with any outer header IP options, then
defragmentation and NF_IP_LOCAL_IN filter (allow ESP,AH) before getting
to ipsec_rcv() where the outer bundle is authenticated and decrypted and
nfmarked to indicate what decapsulation happenned before being passed
back to netif_rx().  The next IP header is now visible.  The packet now
gets re-injected at the beginning.  It goes through the incoming sanity
check again, getting checked at NF_IP_PRE_ROUTING for policy using
previously set nfmark from decryption.  It may again be DNATed and
defragmented.  Routing looks at the now-visible next IP header and
routes it locally or via the forward hook.

If it is a local packet, IP options and defragmentation are processed.
NF_IP_LOCAL_IN then gets to check filtering policy for other L3
protocols.  If it is the endpoint for multiple bundles, it is sent back
to netif_rx(), having exposed the next IP header.

If it is not a local packet, routing has selected a route, potentially
through an existing virtual IPSEC device, one per connection, not per
physical I/F.  IP options and TTL are processed before being filtered at
NF_IP_FORWARD, fragmented, then sent to NF_IP_POST_ROUTING.

If it is a locally generated packet, it would go through normal
filtering at NF_IP_LOCAL_OUT, then go through routing, then go to
NF_IP_POST_ROUTING.

At NF_IP_POST_ROUTING, an IPSEC matching module would make a decision
about the fate of the packet.  It would have several possible targets:
ACCEPT would allow the packet through with no processing.  ENCRYPT would
steal the packet.  If the SA(s) do(es)n't exist(s), it would send up an
ACQUIRE to all listening key management daemons and stash the last copy
of the packet, waiting for the appropriate SA(s).  If or once the SA(s)
is/are available, it then ecrypts the packet, then re-injects the packet
at NF_IP_LOCAL_OUT (since the packet now appears to originate from this
host) and setting nfmark to indicate what processing happenned.  The
packet would then be routed and sent back to NF_IP_POST_ROUTING.  If no
new nfmark is generated, the IPSEC module would ACCEPT it.  DROP would
drop the packet if previous attempts to do opportunistic encryption
failed and the default policy was to block non-IPSEC packets.

A packet routed through an optional IPSEC virtual I/F simply gets
assigned a specific source address and has the nfmark preloaded.  Does
this sound correct?


The way that nfmark is used is rather vague.  It is presently only 32
bits.  Ideally, I would like to be able to indicate exactly which SAs
were processed on the way in, which would most easily be represented by
as many as 4 SAs (AH, ESP, IPCOMP, IPIP), each having an 8 bit protocol
field (absolute minimum of 2-bits), 32-bit destination address field
(for IPv4, IPv6 would be 128) and a 32-bit SPI.  This is a potential
maximum of 672 bits.  A way of mapping 672 bits on to the 32 bits
available would be required to use this.  A lookup table could be used
to map nfmarks to SAIDs, not the SAs themselves, since the SAs could
disappear at any time the tdb table is not locked.  It should be able to
represent a bundle of SAs where one SA could be used in more than one
bundle.  There could also be more than one right answer for the incoming
SPDB.  I have an idea how to accomplish this by changing/extending
nfmark by converting it to a list of nfmark structures that contain a
pointer to the next item on the list, a cookie for the specific
netfilter function that owns the data and a pointer to a data
structure.

nfmark may not be the right tool for this.  Another possible
solution is to add a member to the struct sk_buff to point to this
information.  This has the benefit of not depending on anyone else, but
the drawback of needing to patch a header file *and recompiling the
entire kernel*.

The SADB would be managed via the PF_KEYv2 socket I/F.

The SPDB would be managed via a combination of PF_KEYv2 socket I/F
extensions and iptables.  A separate NetFilter table called 'ipsec'
(as opposed to 'filter' or 'nat') would have the first hook at
NF_IP_PRE_ROUTING and the last hook at NF_IP_POST_ROUTING.  iptables
uses the AF_NETLINK socket family.

- -----------


     --- 2 ---

Treat incoming IPSEC encapsulation as an enhancement of the layer 2
protocol and decapsulate it at the NF_IP_PRE_ROUTING hook.  This option
is less favourable as it stands since it involves creating our own SPDB
engine.

An incoming packet starts off with a sanity check.  It then goes through
the NF_IP_PRE_ROUTING match hook for IPSEC, which would be the first in
priority, matching every single packet to force it through a policy
check.  If it was an ESP or AH packet with a local destination address,
it would then be sent to ipsec_rcv() and the first bundle
would be processed, keeping state until that bundle is completely
processed.  At this point the incoming SPDB would be checked to ensure
that the proper policy had been applied to it.  If there is another
bundle inside with an ESP or AH header, that bundle is processed,
storing the new and old state.  This SPDB check would not be
iptables-based since we have already gone through the match and target
hooks and would have too much state to store in nfmark.  The result of
the SPDB check would be ACCEPT or DROP (It could also be STOLEN or
QUEUEd at this point for opportunistic encryption).

The SADB and SPDB entries would be managed via the extended PF_KEYv2
socket I/F.

The rest of the NF_IP_PRE_ROUTING hooks may cause it to
be DNATed and defragmented.  It then gets routed.

For local packets, inner IP options and defragmentation are processed.
NF_IP_LOCAL_IN then gets to check filtering policy for layer 3
protocols.

For non-local packets, IP options and TTL are processed before being
filtered at NF_IP_FORWARD then fragmented.  Packets are then go
through the NF_IP_POST_ROUTING hooks potentially for SNAT, after which
the last hook would force all packets to go through the IPSEC outgoing
processing module.  Here outgoing policy would be checked, again not
necessarily by iptables.  A result could be ACCEPT, DROP or STOLEN.  The
last would result in encryption and authentication would be applied
as available, then the result would be re-injected at NF_IP_LOCAL_IN,
since it would now have a local address, a potentially different
destination address and need to be re-routed.  A mechanism would need to
be used here to prevent recursion.

- ------------------

If there are any other directions we should be considering, please
suggest...


	slainte mhath, RGB
- -- 
Richard Guy Briggs -- PGP key available            Auto-Free Ottawa! Canada
<www.conscoop.ottawa.on.ca/rgb/>                       <www.flora.org/afo/>
Prevent Internet Wiretapping!        --        FreeS/WAN:<www.freeswan.org>
Thanks for voting Green! -- <green.ca>      Marillion:<www.marillion.co.uk>

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQCVAwUBOpNxS9+sBuIhFagtAQE8UAP/WF4OwXopq7HhJPSuK5a8XyiZSUJpQcbC
IHefyFMFzswQAJDAu4JrRIWevwHPWTrm5PZ7zsALkQM0WwbcRCz8uueItcg2sKmS
aMfp1dbbMlmgPk1HTwIDBaeHOEIf8yyyy4S6W0gIyb8x4mdI4nx0zbEbNPXkjG/H
gB9G69Fod+M=
=wdwQ
-----END PGP SIGNATURE-----