MetaKit - FAQ #2
The following list contains questions which have been asked about MetaKit
and other issues which did not fit anywhere else in the documentation:
[ Back to the MetaKit Home Page ]
What does rollback do and how do I use it?
In concept, MetaKit only makes permanent changes to a data file when the
"Commit" function is called. It may write information to file
more often, but these actions have no lasting effect until actually committed.
This mechanism will be familiar to anyone using transaction-based database
systems. Rolling a transaction back consist of undoing all changes since
the last commit, or since the database was originally opened. In this first
version of MetaKit, a call to "Rollback" also has the side effect
of releasing all buffers. You will have to reconnect all views after rollback,
since existing view will no longer be attached to the storage object. More
flexible approaches are being investigated for a future version.
How do I change the structure of an existing
data file?
To change the structure, you must perform the following steps: 1 - Load
all data into memory (using either LoadFromStream or by deleting the storage
object after the view has been defined). 2 - Define a new storage format
on file (simply delete the original file and create a storage object with
the new description). 3 - Store the loaded view in the new storage object
(using Set). 4 - Commit the changes. This requires just a few lines of code.
How can MetaKit automatically choose 1/2/4-byte
integers?
Each integer field starts out with 0 bytes per entry when created. Then
- depending on the values stored in the field - the field is converted to
use 1, 2, or 4 bytes per entries as needed (in all the records of the view).
One-byte fields can store signed chars, two-byte fields store short integers,
all other values cause the field to use the long 4-byte per entry format.
Note that fields currently only increase in size, i.e. once you store a
long value in a field it will always use longs (for all records) even if
smaller values are subsequently stored in that field.
What is the overhead of an unused field?
Integer fields which are never used add no overhead at all, regardless of
the number of records. This is a consequence of the adaptive integer field
width implemented in MetaKit. Unused string fields currently use one byte
per record (i.e. an empty string).
What are the worst-case memory requirements?
MetaKit implements on-demand loading and only loads structural data (not
the data values) when opening a file. Whenever information is accessed,
a corresponding segment is loaded from file - once - and then kept in memory.
With the current implementation, this means that in the worst case all data
will be made resident. Furthermore, format conversions require a second
copy of the data, so for now twice the total amount of data may be in memory
in the worst case. This is temporary, future versions will use virtual memory
and will be optimized for a high locality of reference.
How can I transport a data structure over the
network?
Every view (this includes structured views) can be placed in a storage object
(using the Set function). Storage objects need not be associated to a file
(i.e. the file pointer passed to the constructor may be null). To transport
a view structure without using intermediate files, you can make it part
of a storage object, and then use the members SaveToStream and LoadFromStream
to stream the entire structure over a regular I/O stream. The Winsock-based
client/server examples CatSend and CatRecv demonstrate this.
Can I store a tree structure using MetaKit?
Yes, you have to make a distinction between data which represents the tree
and the structure of tree nodes. The Discat example program stores a directory
tree using MetaKit, using a view with one entry per node. Larger tree structures
require more entries but they do not affect the structure of the view itself.
In the case of DisCat, child nodes contain the index of their parent nodes,
but other schemes could be used.
How do I reclaim all unused free space from a
data file?
To pack data files to their minimum size, you can load the entire data file
in memory, recreate a fresh file of the same structure, and save the contents
again. This is very similar to the process of converting the structure of
a data file - only this time no changes are made. Newly created data files
do not contain unused (i.e. reclaimable) space.
Will MetaKit handle I/O errors gracefully?
Yes. The mechanisms used to save data on file are based on "stable
storage" principles. This is another way of saying that -either- a
data file has a valid original state -or- it contains the newly committed
state. There are no intermediate states, even when system failures occur.
Exceptions during all file I/O will be passed up to the caller of MetaKit
functions. To proceed after such errors you should use the rollback mechanism
to make sure that all buffered changes are undone, and to synchronize the
application state with the state of the data file. In the current version,
some leaks in allocated memory will occur on platforms which do not support
automatic C++ stack unwinding after exceptions.
Is there a way to store large amounts of text
in this version?
Yes. In fact, the on-demand loading properties of MetaKit allow huge data
files to be opened very quickly, since stored data is only loaded upon actual
use. The main consideration is that string fields should not grow excessively
large - this can be taken care of by storing each section of text in a separate
sub-view. You should not just store the different text sections in adjacent
rows of the same view, but use sub-views with only one row for each entry
instead (or, to rephrase this: define a repeating field for large texts
instead of a regular string field, then use a single repetition for each
section). In this way the total size of texts stored in each view will not
accumulate and on-demand loading will perform as expected. In some cases,
you may find it easier to use sub-views with individual lines of text, which
also works quite well.
How should I design my database for maximum performance?
The basic guideline to achieve high performance is "locality of reference".
In MetaKit, this translates to accessing as few of the fields as possible
and avoiding code which iterates over many fields - if you can write your
loops in such a way that the inner loops iterates over records (as opposed
to over fields) you will get dramatic performance gains. This is definitely
not the conventional approach and may even seem to defy all logic - we're
used to fetching an entire record/object and then using all its information.
This is based on the assumption that each record is stored contiguously
in memory and on disk. Well... in MetaKit, they're not. The current version
does not yet exploit the full potential of this approach - searching and
sorting are still far from optimal in performance.
Why doesn't MetaKit maintain indexes like every
other DBMS?
First of all: MetaKit is not pretending to be a full-scale DBMS. The data
model of MetaKit is closer to the classical data structures of programming
than the table structures used in relational databases. The concept of a
primary index is replaced by the equivalent "sorted container".
Secondary indexes can be realized as arrays of (indirect) row-index values.
This is possible because views are indexed collections - i.e. arrays - allowing
this very efficient (both in space and time) alternative. The sort and select
operations on views work in precisely this way, they simply set up an array
of integers which are used internally to locate rows in their underlying
view.
How can MetaKit be used to store millions of
small objects?
Technically speaking, you could create views with millions of rows (except
in 16-bit address spaces), but such views would be dreadfully slow. The
only practical way to store millions of objects right now is to structure
your information in nested views with hundreds, or perhaps thousands, of
entries. In the current implementation, a 2- or 3-level view structure can
easily handle millions of small objects. If you had to store a large collection
of files in a hierarchical file system, you would probably do very much
the same thing - create a few directory levels, perhaps segmented on the
first letter of the file name (or the day of the month), to reduce the number
of files stored in any single directory. The DisCat example program is an
example of this strategy. The disk catalogs are segmented as a list of directories,
each with a list of all files in them - this structure easily deals with
just about any size file system, including CD-ROMs.
I have records/objects of varying structure, how
do I store these?
Evidently, the first rule is to store substantially different records as
separate views with their own structure. But you have a number of extra
options at hand: if several fields of the same type are not needed in each
record, you could consider creating an indexed sub-view for all those fields
instead, in effect replacing each set of fields with a view containing a
varying set of records. Also, since MetaKit only requires 1 byte for each
unused string, and has zero overhead for integer fields which are not used
at all in a view, the extra space used by simply declaring all fields may
be much lower than expected.
My objects contain pointers, how does MetaKit
deal with this?
MetaKit doesn't use pointer persistence. The objects managed by MetaKit
are the rows which are present as entries in views. These "run-time"
objects are similar - but not equivalent - to C++ objects. You can build
and manipulate views and nest them to any level, but they do not contain
pointers or other data which cannot be represented on file. The nearest
analogy to a pointer is a "cursor", which is simply a <view,
index> pair. Cursors cannot be stored, but of course indexes can, and
cursors are easily reconstructed from a view and an index value. View indexing
- and the use of cursors - is very efficient in MetaKit, both are very easy
to use due to the operator overloading of C++.
For questions or comments, please contact Jean-Claude
Wippler @ Meta Four Software.