For awhile now I've been kicking around the idea of implementing a Data System, a sort of hybrid between traditional file systems and databases. Such a system would attempt to combine the best of both worlds and some of the funtionality that XML is supposed to provide (a universal and extensible way of representing data), but more efficient. Some of the features I'm envisoning are listed below.
The overall concept is to never store something without a set of metadata that describe what is being stored while still providing the generality of traditional file systems. This generality can be provided in several ways, two of which are 1. allow data to be stored as "miscellaneous" or "binary" thus emulating an FS for that peice of data and 2. provide a means of extending the range of data types that can be stored. I plan to implement both methods.
Generally a data system is a collection of data spaces and the storage media allocated to those data spaces. A DS is the data equivalent of a traditional FS. Data spaces are discuseed further below.
The Data System merges the concepts of File Systems and Databases while extending both. Files and directories provide file system support, but also store information regarding the contents of both. Tables and Records provide database functionality.These are a few of the features that will be directly provided by the DS.
All dataspaces can be represented by a URI. The content visible within various namespaces, and to various users, is configurable via a mechanism similar to an ACL and subject to the conditions below.
The URIs used here are similar to the traditional concept with some minor changes. Generally a URI is somethng of the form [protocol]//[user[:pass]@][host]/[path]. In this case the '/' between the host and path parts is replaced by a different character depending on the information to by accessed. The table below lists the possibilities.
| Symbol | Usage |
|---|---|
| / | Host-centric space |
| ~ | User/Group space |
| $ | Configuration, Settings, Environment... |
| # | System File space |
| ^ | Process space |
| Package space |
Any URI that starts with one of the dataspace seperators listed above is assumed to have a host part of "localhost". Also, a URI that begins with an empty host part (i.e "///") is also assumed to be localhost. Specifying a host of "localhost" produces the expected result. A URI that doesn't start with a protocol, '//', or any of the seperators is assumed to be relative to the current location withing the current dataspace.
The dataspace seperators can be nested. For instance, a group could have a number of members. An individual member would be at "~group~user". Likewise, a user can have a configuration space: "~user$".
The default dataspace for a particular host. Generally stuff that's public or not associated with any other sub-dataspace. Of course, the content acutally available here is subject to authentication (if any) and configuration.
Refers to the user dataspace for "user" on host "host". If "user" is a group name, then this dataspace is the conent for that group. The content that is visible in this namespace is dependent upon authentication and system configuration. For someone who can authenticate as "user" it could be equivalent to the user's home directory, the data that the user is permitted to see on the host, or some combination of the two. For someone who can't auhenticate as the user this could be material that the user has chosen to make public. The user could also use something like an ACL to create views for specific people (/host~user~friend/path).
Contains configuration settings relevant to particular entity. At the host level this space contains configuration settings and other things that are relevant to the configuration of the host. Similarly, at the user/group level it is settings and environment variables and for the process and package spaces it contains various configuration settings.
The URI schemes shown above include a [path] element. The path element is a series of components seperated by the path seperator, which is generally '/'. However a different character can be used to reference a particular item by index instead of name. For example, the path 'a/b/c' the item 'c' in 'a/b'. If item 'c' is the third item in 'a/b' then it could also be referenced by 'a/b^3'. Because every item is assigned an ID number, it can also be accessed by ID.
| Path Seperators | |
|---|---|
| / | Record Name/Generic |
| \ | Record Index |
| : | Record ID |
| / | Field Name |
| \ | Field Index |
Data spaces are a method of isolating and/or grouping particular collections of data independant of storage media or structure. In a sense this is very similar to the concept of a database, but I feel data space is a more apt description since a data space could contain things that are traditionally considered to be databases (tables, etc...).
The two main types of data spaces are centered around entities, both aggregate (groups) and non-aggregate (users). There is a third set of special data spaces that deal with system configurations. For ownership purposes they are associated with the root/administrative user/group, but are not considered a true user/group data space since they are tightly bound to specific hardware.
This is a data space dedicated to a particular user. It consists of both volatile and non-volatile storage. Volatile storage is intened for short term storage applications, it's essentially a per-user temp directory. It could also be used as a Windows-style clipboard. Obviously non-volatile storage is permanent.
Every user's dataspace contains all of the per-process dataspaces associated with the processes owned by the user. All processes owned by a user also have access to the user's data space.
Groups are considered to be entities just like users. Consequently each group can have its own dataspace.
Each process has its own volatile data space. This can be used as a cache/temp-dir or for making data available to the "outside world", such as configuration or status information. It can also be used to access process specific information and statistics (process ID, priority, anything top normally provides)
This data space is heavily dependent on a package management system. The package data space consists of all of the per-packages data spaces installed on the system.
Each installed package has its own non-volatile data space that can be used for storing package specific information. This data space is accessible to every process/executable registered in the package.
Shared dataspaces are those that encompass one or more of the above dataspace types, or exist "between" the above types. These can be volatile or non-volatile.
Data objects can be owned by either a user or a group. These is no group ownership in the classic unix style. The functionality afforded by the traditional Unix group/other permissions is handled by Access Control Lists.
Tables are made of records, records are made of fields. In this Data System a given table can only hold a single kind of record. Tables that can hold more than one kind of record are called Multi-Tables (for lack of a better name). Tables that can hold any kind of record are called directories. (Or folders? no...need to come up with something more data oriented)
struct
{