M4S Home Page

MetaKit - FAQ #2

The following list contains questions which have been asked about MetaKit and other issues which did not fit anywhere else in the documentation: [ Back to the MetaKit Home Page ]

What does rollback do and how do I use it?

In concept, MetaKit only makes permanent changes to a data file when the "Commit" function is called. It may write information to file more often, but these actions have no lasting effect until actually committed. This mechanism will be familiar to anyone using transaction-based database systems. Rolling a transaction back consist of undoing all changes since the last commit, or since the database was originally opened. In this first version of MetaKit, a call to "Rollback" also has the side effect of releasing all buffers. You will have to reconnect all views after rollback, since existing view will no longer be attached to the storage object. More flexible approaches are being investigated for a future version.

How do I change the structure of an existing data file?

To change the structure, you must perform the following steps: 1 - Load all data into memory (using either LoadFromStream or by deleting the storage object after the view has been defined). 2 - Define a new storage format on file (simply delete the original file and create a storage object with the new description). 3 - Store the loaded view in the new storage object (using Set). 4 - Commit the changes. This requires just a few lines of code.

How can MetaKit automatically choose 1/2/4-byte integers?

Each integer field starts out with 0 bytes per entry when created. Then - depending on the values stored in the field - the field is converted to use 1, 2, or 4 bytes per entries as needed (in all the records of the view). One-byte fields can store signed chars, two-byte fields store short integers, all other values cause the field to use the long 4-byte per entry format. Note that fields currently only increase in size, i.e. once you store a long value in a field it will always use longs (for all records) even if smaller values are subsequently stored in that field.

What is the overhead of an unused field?

Integer fields which are never used add no overhead at all, regardless of the number of records. This is a consequence of the adaptive integer field width implemented in MetaKit. Unused string fields currently use one byte per record (i.e. an empty string).

What are the worst-case memory requirements?

MetaKit implements on-demand loading and only loads structural data (not the data values) when opening a file. Whenever information is accessed, a corresponding segment is loaded from file - once - and then kept in memory. With the current implementation, this means that in the worst case all data will be made resident. Furthermore, format conversions require a second copy of the data, so for now twice the total amount of data may be in memory in the worst case. This is temporary, future versions will use virtual memory and will be optimized for a high locality of reference.

How can I transport a data structure over the network?

Every view (this includes structured views) can be placed in a storage object (using the Set function). Storage objects need not be associated to a file (i.e. the file pointer passed to the constructor may be null). To transport a view structure without using intermediate files, you can make it part of a storage object, and then use the members SaveToStream and LoadFromStream to stream the entire structure over a regular I/O stream. The Winsock-based client/server examples CatSend and CatRecv demonstrate this.

Can I store a tree structure using MetaKit?

Yes, you have to make a distinction between data which represents the tree and the structure of tree nodes. The Discat example program stores a directory tree using MetaKit, using a view with one entry per node. Larger tree structures require more entries but they do not affect the structure of the view itself. In the case of DisCat, child nodes contain the index of their parent nodes, but other schemes could be used.

How do I reclaim all unused free space from a data file?

To pack data files to their minimum size, you can load the entire data file in memory, recreate a fresh file of the same structure, and save the contents again. This is very similar to the process of converting the structure of a data file - only this time no changes are made. Newly created data files do not contain unused (i.e. reclaimable) space.

Will MetaKit handle I/O errors gracefully?

Yes. The mechanisms used to save data on file are based on "stable storage" principles. This is another way of saying that -either- a data file has a valid original state -or- it contains the newly committed state. There are no intermediate states, even when system failures occur. Exceptions during all file I/O will be passed up to the caller of MetaKit functions. To proceed after such errors you should use the rollback mechanism to make sure that all buffered changes are undone, and to synchronize the application state with the state of the data file. In the current version, some leaks in allocated memory will occur on platforms which do not support automatic C++ stack unwinding after exceptions.

Is there a way to store large amounts of text in this version?

Yes. In fact, the on-demand loading properties of MetaKit allow huge data files to be opened very quickly, since stored data is only loaded upon actual use. The main consideration is that string fields should not grow excessively large - this can be taken care of by storing each section of text in a separate sub-view. You should not just store the different text sections in adjacent rows of the same view, but use sub-views with only one row for each entry instead (or, to rephrase this: define a repeating field for large texts instead of a regular string field, then use a single repetition for each section). In this way the total size of texts stored in each view will not accumulate and on-demand loading will perform as expected. In some cases, you may find it easier to use sub-views with individual lines of text, which also works quite well.

How should I design my database for maximum performance?

The basic guideline to achieve high performance is "locality of reference". In MetaKit, this translates to accessing as few of the fields as possible and avoiding code which iterates over many fields - if you can write your loops in such a way that the inner loops iterates over records (as opposed to over fields) you will get dramatic performance gains. This is definitely not the conventional approach and may even seem to defy all logic - we're used to fetching an entire record/object and then using all its information. This is based on the assumption that each record is stored contiguously in memory and on disk. Well... in MetaKit, they're not. The current version does not yet exploit the full potential of this approach - searching and sorting are still far from optimal in performance.

Why doesn't MetaKit maintain indexes like every other DBMS?

First of all: MetaKit is not pretending to be a full-scale DBMS. The data model of MetaKit is closer to the classical data structures of programming than the table structures used in relational databases. The concept of a primary index is replaced by the equivalent "sorted container". Secondary indexes can be realized as arrays of (indirect) row-index values. This is possible because views are indexed collections - i.e. arrays - allowing this very efficient (both in space and time) alternative. The sort and select operations on views work in precisely this way, they simply set up an array of integers which are used internally to locate rows in their underlying view.

How can MetaKit be used to store millions of small objects?

Technically speaking, you could create views with millions of rows (except in 16-bit address spaces), but such views would be dreadfully slow. The only practical way to store millions of objects right now is to structure your information in nested views with hundreds, or perhaps thousands, of entries. In the current implementation, a 2- or 3-level view structure can easily handle millions of small objects. If you had to store a large collection of files in a hierarchical file system, you would probably do very much the same thing - create a few directory levels, perhaps segmented on the first letter of the file name (or the day of the month), to reduce the number of files stored in any single directory. The DisCat example program is an example of this strategy. The disk catalogs are segmented as a list of directories, each with a list of all files in them - this structure easily deals with just about any size file system, including CD-ROMs.

I have records/objects of varying structure, how do I store these?

Evidently, the first rule is to store substantially different records as separate views with their own structure. But you have a number of extra options at hand: if several fields of the same type are not needed in each record, you could consider creating an indexed sub-view for all those fields instead, in effect replacing each set of fields with a view containing a varying set of records. Also, since MetaKit only requires 1 byte for each unused string, and has zero overhead for integer fields which are not used at all in a view, the extra space used by simply declaring all fields may be much lower than expected.

My objects contain pointers, how does MetaKit deal with this?

MetaKit doesn't use pointer persistence. The objects managed by MetaKit are the rows which are present as entries in views. These "run-time" objects are similar - but not equivalent - to C++ objects. You can build and manipulate views and nest them to any level, but they do not contain pointers or other data which cannot be represented on file. The nearest analogy to a pointer is a "cursor", which is simply a <view, index> pair. Cursors cannot be stored, but of course indexes can, and cursors are easily reconstructed from a view and an index value. View indexing - and the use of cursors - is very efficient in MetaKit, both are very easy to use due to the operator overloading of C++.
For questions or comments, please contact Jean-Claude Wippler @ Meta Four Software.