[RFC] New Driver Model for 2.5

From:	 Patrick Mochel <mochelp@infinity.powertie.org>
To:	 linux-kernel@vger.kernel.org
Subject: [RFC] New Driver Model for 2.5
Date:	 Wed, 17 Oct 2001 16:52:29 -0700 (PDT)


One July afternoon, while hacking on the pm_dev layer for the purpose of
system-wide power management support, I decided that I was quite tired of
trying to make this layer look like a tree and feel like a tree, but not
have any real integration with the actual device drivers..

I had read the accounts of what the goals were for 2.5. And, after some
conversations with Linus and the (gasp) ACPI guys, I realized that I had a
good chunk of the infrastructural code written; it was a matter of working
out a few crucial details and massaging it in nicely.

I have had the chance this week (after moving and vacationing) to update
the (read: write some) documentation for it. I will not go into details,
and will let the document speak for itself. 

With all luck, this should go into the early stages of 2.5, and allow a
significant cleanup of many drivers. Such a model will also allow for neat
tricks like full device power management support, and Plug N Play
capabilities.


In order to support the new driver model, I have written a small in-memory
filesystem, called ddfs, to export a unified interface to userland. It is
mentioned in the doc, and is pretty self-explanatory. More information
will be available soon.


There is code available for the model and ddfs at:

http://kernel.org/pub/linux/kernel/people/mochel/device/

but there are some fairly large caveats concerning it. 

First, I feel comfortable with the device layer code and the ddfs
code. Though, the PCI code is still work in progress. I am still working
out some of the finer details concerning it. 

Next is the environment under which I developed it all. It was on an ia32
box, with only PCI support, and using ACPI. The latter didn't have too
much of an effect on the development, but there are a few items explicitly
inspired by it..

I am hoping both the PCI code, and the structure and in general can be
further improved based on the input of the driver maintainers. 


This model is not final, and may be way off from what most people actually
want. It has gotten tentative blessing from all those that have seen it,
though they number but a few. It's definitely not the only solution...

That said, enjoy; and have at it.

	-pat


~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The (New) Linux Kernel Driver Model

Version 0.01 

17 October 2001 


Overview
~~~~~~~~

This driver model is a unification of all the current, disparate driver models 
that are currently in the kernel. It is intended is to augment the 
bus-specific drivers for bridges and devices by consolidating a set of data 
and operations into globally accessible data structures. 

Current driver models implement some sort of tree-like structure (sometimes 
just a list) for the devices they control. But, there is no linkage between 
the different bus types. 

A common data structure can provide this linkage with little overhead: when a 
bus driver discovers a particular device, it can insert it into the global 
tree as well as its local tree. In fact, the local tree becomes just a subset 
of the global tree. 

Common data fields can also be moved out of the local bus models into the 
global model. Some of the manipulation of these fields can also be 
consolidated. Most likely, manipulation functions will become a set 
of helper functions, which the bus drivers wrap around to include any 
bus-specific items.

The common device and bridge interface currently reflects the goals of the 
modern PC: namely the ability to do seamless Plug and Play, power management, 
and hot plug. (The model dictated by Intel and Microsoft (read: ACPI) ensures 
us that any device in the system may fit any of these criteria.)

In reality, not every bus will be able to support such operations. But, most 
buses will support a majority of those operations, and all future buses will. 
In other words, a bus that doesn't support an operation is the exception, 
instead of the other way around.


Drivers
~~~~~~~

The callbacks for bridges and devices are intended to be singular for a 
particular type of bus. For each type of bus that has support compiled in the 
kernel, there should be one statically allocated structure with the 
appropriate callbacks that each device (or bridge) of that type share. 

Each bus layer should implement the callbacks for these drivers. It then 
forwards the calls on to the device-specific callbacks. This means that 
device-specific drivers must still implement callbacks for each operation. 
But, they are not called from the top level driver layer.

This does add another layer of indirection for calling one of these functions, 
but there are benefits that are believed to outweigh this slowdown.

First, it prevents device-specific drivers from having to know about the 
global device layer. This speeds up integration time incredibly. It also 
allows drivers to be more portable across kernel versions. Note that the 
former was intentional, the latter is an added bonus. 

Second, this added indirection allows the bus to perform any additional logic 
necessary for its child devices. A bus layer may add additional information to 
the call, or translate it into something meaningful for its children. 

This could be done in the driver, but if it happens for every object of a 
particular type, it is best done at a higher level. 

Recap
~~~~~

Instances of devices and bridges are allocated dynamically as the system 
discovers their existence. Their fields describe the individual object. 
Drivers - in the global sense - are statically allocated and singular for a 
particular type of bus. They describe a set of operations that every type of 
bus could implement, the implementation following the bus's semantics.


Downstream Access
~~~~~~~~~~~~~~~~~

Common data fields have been moved out of individual bus layers into a common 
data structure. But, these fields must still be accessed by the bus layers, 
and 
sometimes by the device-specific drivers. 

Other bus layers are encouraged to do what has been done for the PCI layer. 
struct pci_dev now looks like this:

struct pci_dev {
	...

	struct device device;
};

Note first that it is statically allocated. This means only one allocation on 
device discovery. Note also that it is at the _end_ of struct pci_dev. This is 
to make people think about what they're doing when switching between the bus 
driver and the global driver; and to prevent against mindless casts between 
the two.

The PCI bus layer freely accesses the fields of struct device. It knows about 
the structure of struct pci_dev, and it should know the structure of struct 
device. PCI devices that have been converted generally do not touch the fields 
of struct device. More precisely, device-specific drivers should not touch 
fields of struct device unless there is a strong compelling reason to do so.

This abstraction is prevention of unnecessary pain during transitional phases. 
If the name of the field changes or is removed, then every downstream driver 
will break. On the other hand, if only the bus layer (and not the device 
layer) accesses struct device, it is only those that need to change.


User Interface
~~~~~~~~~~~~~~

By virtue of having a complete hierarchical view of all the devices in the 
system, exporting a complete hierarchical view to userspace becomes relatively 
easy. Whenever a device is inserted into the tree, a file or directory can be 
created for it.

In this model, a directory is created for each bridge and each device. When it 
is created, it is populated with a set of default files, first at the global 
layer, then at the bus layer. The device layer may then add its own files. 

These files export data about the driver and can be used to modify behavior of 
the driver or even device.

For example, at the global layer, a file named 'status' is created for each 
device. When read, it reports to the user the name of the device, its bus ID, 
its current power state, and the name of the driver its using. 

By writing to this file, you can have control over the device. By writing 
"suspend 3" to this file, one could place the device into power state "3". 
Basically, by writing to this file, the user has access to the operations 
defined in struct device_driver. 

The PCI layer also adds default files. For devices, it adds a "resource" file 
and a "wake" file. The former reports the BAR information for the device; the 
latter reports the wake capabilities of the device. 

The device layer could also add files for device-specific data reporting and 
control. 

The dentry to the device's directory is kept in struct device. It also keeps a 
linked list of all the files in the directory, with pointers to their read and 
write callbacks. This allows the driver layer to maintain full control of its 
destiny. If it desired to override the default behavior of a file, or simply 
remove it, it could easily do so. (It is assumed that the files added upstream 
will always be a known quantity.)

These features were initially implemented using procfs. However, after one 
conversation with Linus, a new filesystem - ddfs - was created to implement 
these features. It is an in-memory filesystem, based heavily off of ramfs, 
though it uses procfs as inspiration for its callback functionality.


Device Structures
~~~~~~~~~~~~~~~~~

struct device {
	struct list_head 	bus_list;
	struct io_bus   	*parent;
	struct io_bus   	*subordinate;

	char    		name[DEVICE_NAME_SIZE];
	char    		bus_id[BUS_ID_SIZE];

	struct dentry   	*dentry;
	struct list_head        files;

	struct 	semaphore       lock;

	struct device_driver 	*driver;
	void            	*driver_data;
	void    		*platform_data;

	u32             	current_state;
	unsigned char 		*saved_state;
};

bus_list: 
	List of all devices on a particular bus; i.e. the device's siblings

parent:
	The parent bridge for the device.

subordinate:
	If the device is a bridge itself, this points to the struct io_bus that is
	created for it.

name:
	Human readable (descriptive) name of device. E.g. "Intel EEPro 100"

bus_id:
	Parsable (yet ASCII) bus id. E.g. "00:04.00" (PCI Bus 0, Device 4, Function
	0). It is necessary to have a searchable bus id for each device; making it
	ASCII allows us to use it for its directory name without translating it.

dentry:
	Pointer to driver's ddfs directory.

files:
	Linked list of all the files that a driver has in its ddfs directory.

lock:
	Driver specific lock.

driver:
	Pointer to a struct device_driver, the common operations for each device. See
	next section.

driver_data:
	Private data for the driver.
	Much like the PCI implementation of this field, this allows device-specific
	drivers to keep a pointer to a device-specific data.

platform_data:
	Data that the platform (firmware) provides about the device. 
	For example, the ACPI BIOS or EFI may have additional information about the
	device that is not directly mappable to any existing kernel data structure. 
	It also allows the platform driver (e.g. ACPI) to a driver without the driver 
	having to have explicit knowledge of (atrocities like) ACPI. 


current_state:
	Current power state of the device. For PCI and other modern devices, this is
	0-3, though it's not necessarily limited to those values.

saved_state:
	Pointer to driver-specific set of saved state. 
	Having it here allows modules to be unloaded on system suspend and reloaded
	on resume and maintain state across transitions.
	It also allows generic drivers to maintain state across system state
	transitions. 
	(I've implemented a generic PCI driver for devices that don't have a
	device-specific driver. Instead of managing some vector of saved state
	for each device the generic driver supports, it can simply store it here.)



struct device_driver {
        int     (*probe)        (struct device *dev);
        int     (*remove)       (struct device *dev);

        int     (*init)		(struct device *dev);
        int     (*shutdown)	(struct device *dev);

        int     (*save_state)   (struct device *dev, u32 state);
        int     (*restore_state)(struct device *dev);

        int     (*suspend)      (struct device *dev, u32 state);
        int     (*resume)       (struct device *dev);
}


probe:
	Check for device existence and associate driver with it. 

remove:
	Dissociate driver with device. Releases device so that it could be used by
	another driver. Also, if it is a hotplug device (hotplug PCI, Cardbus), an 
	ejection event could take place here.

init:
	Initialise the device - allocate resources, irqs, etc. 

shutdown:
	"De-initialise" the device - release resources, free memory, etc.

save_state:
	Save current device state before entering suspend state.

restore_state:
	Restore device state, after coming back from suspend state.

suspend:
	Physically enter suspend state.

resume:
	Physically leave suspend state and re-initialise hardware.


Initially, the probe/remove sequence followed the PCI semantics exactly, but 
have since been broken up into a four-stage process: probe(), remove(), 
init(), and shutdown().

While it's not entirely necessary in all environments, breaking them up so 
each routine does only one thing makes sense. 

Hot-pluggable devices may also benefit from this model, especially ones that 
can be subjected to suprise removals - only the remove function would be 
called, and the driver could easily know if the there was still hardware there 
to shutdown. 

Drivers that are controlling failing, or buggy, hardware, by allowing the user 
to trigger a removal of the driver from userspace, without trying to shutdown 
down the device. 

In each case that remove() is called without a shutdown(), it's important to 
note that resources will still need to be freed; it's only the hardware that 
cannot be assumed to be present.


Suspend/resume transitions are broken into four stages as well to provide 
graceful recovery from a failed suspend attempt; and to ensure that state gets 
stored in a non-volatile location before the system (and its devices) are 
suspended.

When a suspend transition is triggered, the device tree is walked first to 
save the state of all the devices in the system. Once this is complete, the 
saved state, now residing in memory, can be written to some non-volatile 
location, like a disk partition or network location. 

The device tree is then walked again to suspend all of the devices. This 
guarantees that the device controlling the location to write the state is 
still powered on while you have a snapshot of the system state. 

If a device is in a critical I/O transaction, or for some other reason cannot 
stand to be suspended, it notify the kernel by failing in the save state 
step. At this point, state can either be restored, or dropped, for all the 
devices that had been already been touched, and execution may resume. No 
devices will have been powered off at this point, making it much easier to 
recover.

The resume transition is broken up into two steps mainly to stress the 
singularity of each step: resume() powers on the device and reinitialises it; 
restore_state() restores the device and bus-specific registers of the device.
resume() will happen with interrupts disabled; restore_state() with them 
enabled.


Bus Structures
~~~~~~~~~~~~~~

struct io_bus {
	struct	list_head 	node;
	struct 	io_bus 		*parent;
	struct 	list_head 	children;
	struct 	list_head 	devices;

	struct 	list_head 	bus_list;

	struct 	device 		*self;
	struct 	dentry 		*dentry;
	struct 	list_head 	files;

	char    name[DEVICE_NAME_SIZE];
	char    bus_id[BUS_ID_SIZE];

	struct  bus_driver	*driver;
};

node:
	Bus's node in sibling list (its parent's list of child buses).

parent:
	Pointer to parent bridge.

children:
	List of subordinate buses. 
	In the children, this correlates to their 'node' field.

devices:
	List of devices on the bus this bridge controls.
	This field corresponds to the 'bus_list' field in each child device.

bus_list:
	Each type of bus keeps a list of all bridges that it finds. This is the 
	bridges entry in that list.

self:
	Pointer to the struct device for this bridge.

dentry:
	Every bus also gets a ddfs directory for which to add files to, as well as
	child device directories. Actually, every bridge will have two directories -
	one for the bridge device, and one for the subordinate device.

files:
	Each bus also gets a list of the files that are in the ddfs directory, for
	the same reasons as the devices - to have explicit control over the behavior
	and easy access to each file that any higher layers may have added.

name:
	Human readable ASCII name of bus.

bus_id:
	Machine readable (though ASCII) description of position on parent bus.

driver:
	Pointer to operations for bus.


struct bus_driver {
	char    name[16];
	struct  list_head node;
	int     (*scan)         (struct io_bus*);
	int     (*rescan)       (struct io_bus*);
	int     (*add_device)   (struct io_bus*, char*);
	int     (*remove_device)(struct io_bus*, struct device*);
	int     (*add_bus)      (struct io_bus*, char*);
	int     (*remove_bus)   (struct io_bus*, struct io_bus*);
};

name:
	ASCII name of bus.

node:
	List of buses of this type in system.

scan:
	Search the bus for devices. This is meant to be done only once - when the 
	bridge is initially discovered.

rescan:
	Search the bus again and look for changes. I.e. check for device insertion or 
	removal.

add_device:
	Trigger a device insertion at a particular location.

remove_device:
	Trigger the removal of a particular device.

add_bus:
	Trigger insertion of a new bridge device (and child bus) at a particular
	location on the bus.

remove_bus:
	Remove a particular bridge and subordinate bus.






-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/