Date: Sun, 29 Oct 2000 15:15:57 +0100 (CET) From: Andreas Gruenbacher <ag@bestbits.at> To: "Stephen C. Tweedie" <sct@redhat.com>, Subject: [PROPOSAL2] Extended Attributes Hi again, There were some good arguments for adding a few more features to the EA interface. This new proposal reflects some of the discussion. I still decline to support forks/streams through the EA interface. IMHO that's just the wrong way to go. This doesn't preclude an EA implementation on top of streams, of course. The interface described here also doesn't include Stephen's idea to allow an ordered list of EA's under the same name. In addition to the append and prepend operations Stephen suggested, a whole range of other operations (get/delete/... by index, etc.) might make sense, and stuff like that could well be added. However, it would complicate the semantics even further. I'd really like to learn more about the requirements for that. Stephen, do you have any good pointers? We have also been discussing how to support different EA namespaces. Stephen's approach was to use an integer namespace id to specify the namespace, while my approach was to use a textual prefix to the EA name. While those approaches are semantically equivalent, I have been convinced that an integer specifier is easier to handle in the kernel. Still, I believe in textual names at the user interface. I think the id's should be translated from/into textual names in a userspace library before presenting them to users. One of the issues raised was that it's important to be able to manipulate multiple EA's at once. The reason for this was to reduce system call overhead. Another idea was to allow manipulation of multiple EA's in an atomic way. If I recall correctly an even stronger semantic requirement of manipulating multiple EA's in a transactional way was also suggested. Another issue was that the current proposal at <http://acl.bestbits.at> has a race condition between GET and SET operations. This is also addressed here. Meanwhile I did a little background reading on NFSv4 since some complained the ACL implementation were too limited. I will not address that here, but the NFSv4 spec gave me a new idea of how the EA interface could support all of the above in a clean, simple and extensible way. NFSv4 supports compound operations, in which multiple requests are packed into a single RPC. A similar approach might also make sense for the EA interface. Note that the interface proposed here is comparable to Tru64's property lists interface (although it goes beyond that). The Tru64 proplist(4) manual page is here: <http://www.tru64unix.compaq.com/ faqs/publications/base_doc/DOCUMENTATION/V50_HTML/MAN/MAN4/0200____.HTM> I could imagine the system call(s) to be implemented like this: int sys_ext_attr_file(char *path, int namespace, int flags, struct ea_request *request, size_t request_len, int *results, size_t result_size); int sys_ext_attr_fd(int fd, int namespace, int flags, struct ea_request *request, size_t request_len, int *results, size_t result_size); (This doesn't actually work as system calls as is because there are too many parameters.) Multiple EA operations are marshalled into the reuest buffer; after the system call the results buffer contains the results. Operations are encoded in the request buffer as variable-size records with this structure: struct ea_request { int operation; /* additional operation specific fields */ }; Results just consist of one integer status code per operation. Operation could be one of: EA_REQ_LIST List the names of all EA's defined for this inode. EA_REQ_GET Get the value of an EA. EA_REQ_GETSIZE Get the buffer size required for storing the value of an EA. EA_REQ_SET Set the value of an EA to a new value. EA_REQ_REMOVE Remove an EA. EA_REQ_VERIFY Compare the current EA value with the value passed. EA_REQ_GET_COOKIE Get a cookie that corresponds to the current EA state. EA_REQ_VERIFY_COOKIE Compare an inode's current cookie with the cookie passed. (more on the last three below) For some requests/results, additional parameters are needed: struct ea_req_list { int operation = EA_REQ_LIST; size_t buffer_size; struct ea_entry *entries; size_t offset; }; struct ea_req_get { int operation = EA_REQ_GET; int op_flags; size_t *buffer_size; char *buffer; unsigned short name_len; char name[]; /* size padded to machine word size */ }; struct ea_req_set { int operation = EA_REQ_SET; int op_flags; size_t value_len; char *value; unsigned short name_len; char name[]; /* size padded to machine word size */ }; struct ea_req_compare { int operation = EA_REQ_VERIFY; int op_flags; size_t value_len; char *value; unsigned short name_len; char name[]; /* size padded to machine word size */ }; struct ea_req_get_cookie { int operation = EA_REQ_GET_COOKIE; size_t *buffer_size; char *buffer; }; struct struct ea_req_compare_cookie { int operation = EA_REQ_VERIFY_COOKIE; size_t value_len; char *value; }; The EA_REQ_LIST operation can pass attribute names as variable length records. With an integer namespace identifier the previous "name1\0name2\0name3\0\0" format isn't suitable anymore, so this format can be used instead: struct ea_entry { int namespace; unsigned short name_len; char name[]; /* size padded to machine word size */ }; Names are still zero terminated strings. With this approach, multiple EA operations can be implemented without too much system call overhead. Of course, implementing this is much more complicated than the previous proposal. The marshalling/buffer management/etc. would ideally be handled by a library, instead of dealing with that in each application separately. The default semantics would be to process the requests in sequence, aborting at the first request that fails. The system call itself could return the number of requests processed successfully. In the flags parameter to the system call, users could request additional restrictions that might be supported by the implementation or not, like the following: EA_FLAG_ISOLATED The requests are processed without other processes seeing any intermediate steps. EA_FLAG_ATOMIC Either all requests or none of the requests is processed. EA_FLAG_SYNC The EA's are guarenteed to be persistent on disk when the system call returns. An implementation would be free to use EA_FLAG_ISOLATED or EA_FLAG_SYNC semantics even though the corresponding flags were not set. As EA_FLAG_ATOMIC is a very strong requirement, most current filesystems probably wouldn't support it. The op_flags member of individual operations could include: EA_OP_FLAG_CREATE The operation only succeeds if the EA doesn't exist already. EA_OP_FLAG_EXISTS The operation only succeeds if the EA exists already. About the EA_REQ_VERIFY, EA_REQ_GET_COOKIE and EA_REQ_VERIFY_COOKIE operations. The problem with simple GET followed by a SET at some later point in time is that another process might in the meantime also have manipulated the very same EA. For relative changes to an EA, that's bad. Arbitrary interleavings of GET/SET operations lead to unpredictable results. The two approaches to get around that I'm currently aware of are: Either check that the previous value hasn't changed, or on check that a magic cookie (some sort of version tag) hasn't changed in the meantime. For the comparison approach, one correct implementation would be an atomic sequence of the two operations [EA_REQ_VERIFY, EA_REQ_SET] (EA_FLAG_ISOLATED), resulting in an atomic test-and-set operation: EA_REQ_VERIFY would compare the value retrieved in the EA_REQ_GET operation with the current EA value and abort if the previous value has changed in the meantime. This oepration is expensive if the EA value gets big. The other approach would be to retrieve a magic cookie together with the original value (this could be a simple integer). Instead of comparing the values, the previous and current cookies are compared. The cookie associated with an inode doesn't have to be related to the individual attribute, it must only be guaranteed that the cookie changes when that EA changes. Operation sequences: [EA_REQ_GET_COOKIE, EA_REQ_GET] (no flags required), and at some later point in time: [EA_REQ_VERIFY_COOKIE, EA_REQ_SET] (EA_FLAG_ISOLATED). I don't know if any protocols support the value comparison approach, but don't support the cookie apporach. AFAIK NFSv4 supports neither, but a verify operation can be followed by a set operation in a single RPC request, so at least the time window for inconsistencies gets minimized. For local filesystems, the cookie approach seems pretty easy to implement. The i_version fiels that is present in each in-memory inode can directly be used as the cookie. Cookies don't need to be stored on the filesystems. The cookie could also be used for the EA_REQ_LIST operation to retrieve very long lists of EA names across multiple system calls reliably. Regards, Andreas. ------------------------------------------------------------------------ Andreas Gruenbacher, a.gruenbacher@computer.org Contact information: http://www.bestbits.at/~ag/ - To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@vger.kernel.org