A new constant database

Many applications require a key,value database, but the data itself is (mostly) constant. General purpose hash databases are inefficient for this purpose as they waste a lot of space and often also require more complex queries. The most well-known constant database is from DJB, but it is a number of limitations.The biggest issues of DJB's implementation concern the large overhead (24 Bytes / record) and the inefficient hash function. Bob Jenkins has a comparison.

To work around this, I started to design a new constant database format. As a building block I used the CHM algorithm for a minimal perfect hash function. An implementation for that already existed in nbperf(1). I decided to use a separate offset table as it provides a natural encoding of the entry size and makes it easier to allow multiple keys for the same value.

To minimize the storage I decided against handling key matching in the library. The pay load for many of the intended uses already contains the key, so storing it separately would waste space. The perfect hashing ensures that only a single entry has to be matched, so the overhead is generally small, too.

The result is quite impressive. For the /etc/service database, the cdb version is 304KB, the original db(3) database 2.1MB (4.2MB file size, but sparse). The run time of services_mkdb dropped from 2.2s before to 0.24s afterwards. For terminfo, the cdb version is 1.3MB compared to 2.1MB for db(3) (2.8MB file size, but sparse). The run time of tic -x dropped from 1.5s to 0.4s.

Code size is quite small as well, the reader itself is 5.2KB code and the writer 15KB. Compiled for AMD64, the reader is 1.2KB and the writer 3.6KB.