www.sonnenberger.org Blog: Netbsd

Clang and NetBSD/ARM

2014-06-06T13:14:22Z

Full support for NetBSD/ARM committed.

Following the recent flag day to move (e)ARM from EHABI to normal Itanium-style unwinding as well as the import of recent LLVM/Clang, the EABI variants of the ARM ports are now fully supported with Clang. In short, just setting MKLLVM=yes and HAVE_LLVM=yes for build.sh will switch to over to the BSD licensed toolchain. Testing especially on ARMv4 is very welcome.

Martin Husemann is stitll working on the networking support for the Cubietruck, but once that is done we plan to setup a bunch of Cubietruck boards for bulk building pkgsrc. Stay tuned.

I'm investigating MIPS now, but that is a bit more tricky.

Bug fix update for cvs2fossil

2014-05-02T21:11:10Z

A minor update fixing the handling of path names with spaces as well as delayed vendor branches.

The new snapshot of cvs2fossil fixes two issues seen in the NetBSD repositories:

The vendor branch is sometimes lagging a few seconds behind the 1.1 revision, so apply some fudge.

Path names with spaces were not escaped. This broke NetBSD src after the LLVM import.

Both changes have been in use for the NetBSD conversions for a while.

Status of NetBSD and LLVM

2012-03-15T15:57:59Z

NetBSD/AMD64 has been supported by LLVM and Clang for quite some time now. There are a few regressions in the ATF tests compared to GCC, but they don't look serious.
Recent work for NetBSD/i386 involved fixing a number of nasty little ABI bugs, where NetBSD and other ELF systems differ in the details. It is now comparable to the status of AMD64.

I've started on getting support for LLVM and Clang as system compiler in NetBSD in 2010. The reach-over frame work was committed last February. Unlike GCC and PCC, I haven't imported the source code yet. It would add a lot of space in the repository and working copies as well as increase the overhead of keeping the copy in NetBSD synchronized with upstream trunk. As such interested parties have to run the "checkout" target in src/external/bsd/llvm to get a fresh copy from svn as well as re-run the target whenever the in-tree version changes. After that, all that is needed is setting MKLLVM=yes and HAVE_LLVM=yes to build the system with Clang.

In my own ATF runs I currently have 8 failures for the Clang world. 5 of the 8 failures are also seen in the ATF runs of the GCC world in the same Virtualbox environment. The remaining cases are as follows:

lib/libc/ssp/t_ssp's raw test: difference in the stack-protector handling of GCC and Clang. Details in the LLVM bug.
lib/libpthread/t_cond's cond_timedwait_race test doesn't trigger with Clang. Not analyzed in detail yet.
lib/semaphore/pthread/t_sem_pth's destroy test fails differently. Not analyzed in detail yet.

For i386, there is one problem in libm, where expf(3) seems to give wrong results, at least that is the reason why the sinhf regression test fails. I haven't looked at this further.

I haven't run any benchmarks yet, so no numbers on code size, execution speed or even build time. The LLVM build is a debug build, so it is slower than necessary.

I'm also running irregular pkgsrc builds with Clang. There are a number of common issues:

Clang defaults to C99 and is more picky due to that. It implements the C99 inline semantics, which differ from the traditional GCC behavior.
Clang is a lot stricter as C++ compiler, so a bunch of broken things won't pass, which work silently by GCC due to pruning of dead code.
CMSG_SPACE and friends are sometimes used as part of a union. NetBSD uses a tricky function internally, so this is not an Integer Constant Expression. Clang rejects using a Variable Length Array as part of a union, while GCC happily accepts that. This is about to be fixed for NetBSD/current.
Clang doesn't support nested functions. They should be avoided in general, since they require an executable stack. GCC by default doesn't even warn about them.

Help in cutting down the number of trivial build failures would definitely be appreciated. At the moment, almost 1000 packages fail for various reasons, many of them can be fixed in a bunch of minutes.

Update: The correct option is HAVE_LLVM and not HAVE_CLANG.

Git repositories on github

2011-07-28T00:33:57Z

Checkout https://github.com/jsonn/src and https://github.com/jsonn/pkgsrc!

After some initial issues the git conversion seems to be stable now. Both src and https://github.com/jsonn/pkgsrc can be found on github.

Fun fact about git: pushing trunk from src alone and the other refs with push -all afterwards requires 50% more space than doing the push -all in first place.

One issue remains. I can't do incremental exports from fossil to git, because git can't find objects it has imported earlier. I'm not the first person to experience this, but it seems no one cares enough to fix it.

Fossil repositories moved and updated, git in progress.

2011-07-25T10:09:01Z

New server with better connection for the fossil repositories, a mirror and initial addition of git support

The fossil repositories (src and pkgsrc) have moved to a new server, so at least upstream bandwidth no longer is a problem. For cloning, it is still recommented to fetch the database files directly from ftp.NetBSD.org. Zafer Aydoğan also provides mirrors at src and pkgsrc.

The src repository has been rebuild from scratch after many clean ups to the branching. It should be in a pretty good shape now. Special thanks to S.P. Zeidler for the assistance in messing up cvs.NetBSD.org.

At this point, I believe the conversion to be stable and don't plan any more repository changes. If you run across inconsistencies in the conversion, please mail me though.

I've also started to integrate the git export. This is still a lot more overhead than necessary. The incremental fast-import in git doesn't work, so it has to write all all blobs and commits on every pass, adding another 20min or so to the conversion rounds. The result can be found on github. I'm interested in feedback here to decide asking github for more space to add src as well.
Update: there seems to be a small bug with the conversion of file deletes, so the git repo will be nuked and rebuild soonish
Update 2: fixed and repushed

May update for cvs2fossil

2011-05-11T23:31:33Z

Bug fix release for cvs2fossil to get stable conversion for pkgsrc

The new snapshot of cvs2fossil fixes three important issues:

Vendor branches do not fall under the normal branch point rules. This effectively moved the real cut-off point for pkgsrc on every import, making the converted repository grow extremely fast.
The expansion of $Log$ is atomic and the commit message itself no subject of keyword substitution. This didn't affect the NetBSD repository, but evidently crashed the conversion of OpenBSD's src module. The rewritten expansion code is much, much simpler now.
Executable permissions are preserved.

I've used this and also changed the conversion to create the newer sub-second timestamp format. To fully exploit the bug fixes, the two NetBSD repositories (pkgsrc and src) have been recreated some scratch. I've decided to skip the top-level directory now as well, so src/Makefile is now just Makefile and src/bin/cat/cat.c can be found in bin/cat/cat.c.

Minor update for cvs2fossil

2011-03-07T22:24:04Z

Better diagnostic for Attic conflicts and support for commitid

There is another snapshot of cvs2fossil. This snapshot fixes two small issues:

Though most of the support for "commitid" was in place, I had forgotten to add it to the actual keyword list.
The error message if a file was both in the Attic and outside was a bit mysterious, so it now detects this explicitly.

Bug fix for cvs2fossil

2011-02-17T01:54:23Z

A small bug fix for cvs2fossil to handle "cvs delete" on branches correctly.

I'm proud to release a new snapshot of cvs2fossil. This fixes an annoying little bug reported from the TCL/TK community.

If the first commit of a branch is in state "dead", this can be the result of two different things:

The file could have been added after the branch creation and this is the first change.
The file could have existed at the branch creation time and was explicitly deleted.

In the first case, the time of the commit equals the branch point, in the second case, it doesn't. The old version didn't check this and handled the second case like the first one.

Update on fossil conversion

2011-01-22T20:16:58Z

In the two month since the last update, the Fossil conversion utility has seen quite a number of improvements.
A public mirror of the repository conversion is provided.

Since the initial release of the Fossil converter, a lot has changed:

The code should work out of the box on NetBSD and Linux, minimal changes for other Unix platforms should be enough.
Vendor branch are imported as separate branch. No merge data with trunk is provided and this only works for true vendor branches.
First one computes file diffs as hints for the fourth phase. This exploits multi-threading as well as the normal per-file parallelism.
The fourth phase will create deltas for the manifests in most cases as well. Delta creating and compression in this case uses multi-threading.
A bug in the branch tagging was fixed.
If a file was initally added a branch, the 1.1 revision should be skipped. This was fixed.

The code can be found here.

The processing time is still around 5h for src, but I'm currently running the machine with less memory. On my laptop with SSD, it needs 3h for a full conversion.

I am providing a public mirror of the conversion. It is updated around 3-4 times a day. Please avoid cloning directly from that site as I am a bit bandwidth constrained. You can fetch the repository directly from ftp.netbsd.org (which is much faster) and pull changes from my server afterwards. Please also note that the leaf versions are sometimes not completely stable due to incomplete rsync of large commits or if a branch was created without a commit between runs. When I see such an instance, I will close the wrong leaf.

Fossil conversion

2010-11-14T01:31:10Z

Initial release of cvs to fossil conversion routines

Three weeks ago I wrote about the fossil tests. Quite a few things have changed inside fossil and I have been working on reimplementing the Python parts of conversion tools in C as well improving the performance.

The code doesn't have any fancy build system and a few (Net)BSD features are used, so don't expect it to work out of the box on anything else. Most RCS files should work out-of-the-box. A few limitations are known and not handled automatically:

Commit time inversions, e.g. time going backwards from one commit to the other. The only instance where this is not a bug in the repository is for 1.1.1.1, which can legitimately be older than 1.1. This results in the revisions being picked up in the wrong order.
Vendor branches are ignored after trunk conversion has been done. Normally 1.1 is dropped in favor of 1.1.1.1 and the various vendor versions are pulled into trunk as long as they are older than 1.2. Support to do a bare import of a given vendor branch will be added at a later point, e.g. to convert them into a standalone branch.
Tags are completely ignored.
Branches are considered as vendor branches, if they are using 1.1.1 as revision for any file in the tree.
No automatic fixup of missing vendor branches or default branches. The output of 99-warnings can be used with cvs rtag or rcs to fix those up.
Branches must have a consistent parent relation, currently no diagnostic is given if a branch is dangling due to depending on a vendor branch.
Branch time computation is somewhat simplistic. No diagnostic yet for files added incorrectly after the initial branching or if commits happened between the global branch time and the branch point of a specific file. No support for replaying of commits to fix such cases either.

I've been slowly fixing up various issues exposed by this tool in the NetBSD repository. Processing needs around 5h on an AMD Opteron 1389 from src in CVS form to the Fossil repository. The majority of the time is spend in the commit building step, which is primarily IO bound and a potential place for further investigation. Ultimate goal is to get a bit exact import of all major branches, which seems to (almost) the case now.

Fossil and NetBSD

2010-10-23T22:20:21Z

Large scale repositories create interesting issues for version control systems. How does fossil cope?

The tech-repository has been silent for a long time. Initial conversions from CVS to git and Mercurial highlighted various issues with the conversion tools. They also provided some insight into scalability issues with the existing major two. In July, bsdtalk run an interesting interview with D. Richard Hipp about fossil.

Fossil has some interesting properties. One of them is the license, making it easy to integrate into the base system without too much discussion about legal issues. Another one is that it is pretty self-contained. This brings up the most important question, how does fossil scale?

To answer that question, it was important to get a repository into it first. The first try used the Git import tool for fossil on the original conversion from CVS to Git I created in 2008. This resulted in some bug fixes for the stat page in fossil to be 64bit clean, but also provided the insight that the branch handling in svn->git was messed up and that there is at least one major issue in fossil in terms of repo size.

Based on the known issues of the cvs2svn tool and the problems other conversion tools have, I started to hack up some conversion logic myself. Testing it on a private part of the NetBSD repository, that contains e.g. the master version of pkg-vulnerabilities, resulted in improvements for the delta caching in fossil. This file makes up about 1/3 about all revisions in that sub-repository and itself has over 3000 (CVS) revisions. At the time fossil was using plain recursion to handle delta chains and in some code paths using two stack frames for one entry in the delta chain. This resulted in crashes of fossil due the default stack size limit in NetBSD. This was fixed in smallstack branch of fossil by using tail recursion if possible and otherwise moving the necessary unwind data to the heap.

With that issue fixed, it was possible to import both pkgsrc and src into fossil. Many operations like checkout are speedy. I haven't done much benchmarking yet, but things like "fossil status" or the built-in web UI were fast enough. One very major factor that impacts clone and initial import was that the "rebuild" step which (re)creates the internal meta data cache took 10h on a fast machine.

The main issue with rebuild is that it has to parse all the manifests (the description of a change set). For pkgsrc and src, this is dominated by the list of files in revision with the associated SHA1 hash. A typical working copy of pkgsrc has over 60000 files, so the resulting manifest is in the area of Megabytes. Every single manifest has to be hashed (with both SHA1 and MD5) at least once and parsed. A longer discussion with Richard resulted in two different ideas to be tried. The first idea was to cluster the file list into smaller parts referenced by hash. If one part wasn't changed compared to the parent revision, it could be reused. This cut the size of the temporary database used for the import from 200GB to between 5GB and 8GB, depending on the number of clusters. The obvious down side is the much increase in the number of artifacts in the resulting fossil database. The second idea, which can now be found in the experimental branch, was to make the manifest itself kind of a patch by encoding only the list of changed files relative to a baseline. This approach resulted in similar savings for the intermediate database. Initial testing showed that the rebuild time went down from 10h to 1h -- much better, but still not the expected range.

A combination of some optimizations like using a faster SHA1 implementation and tweaking the manifest parser resulted in at least 10% improvement. An additional change to the manifest cache resulted in bring down the time to 10min. That's still not optimal, but for something normally only done during clone, it is not a major blocking point.

At this point the major road block for fossil is the run time of "fossil clone", In terms of IO, it isn't too bad. 36MB sent to the server and 530MB received is acceptable for the repository size. The problem is that the operation takes 90min on a fast machine cloning from localhost and only 25min are actually spend for the process.

The insane AMD64 segmentation issue

2010-05-04T14:29:05Z

When AMD introduced Long Mode aka the 64bit extension, they retired the segmentation implementation to switch to a flat memory model. The entire concept of segmentation? No! The FS and GS register still work somewhat like before, just not entirely. Why is it important? Modern Linux binaries require support for Thread Local Storage (TLS) and that is using FS and GS.

The x86 platform has two layers of virtualisation. The lower layer is the paging and translates logical to physical addresses. The upper layer turns a pointer and a base register into a logical address. The base register is called a segment register and either implied or specified by an instruction prefix. This started as a cute hack to support 1MB of memory with 16bit registers in the 8086 and turned more complex with the introduction of Protected Mode with the 80286. In Protected Mode, the value of the segment register is actually a descriptor value used to index the Global Descriptor Table or in some cases the per-process equivalent. This table contains a base and limit as well as type and access information. The CPU enforces the limit on access, e.g. the pointer value has be smaller than the limit. The logical address is the sum of the pointer and the base.

This changes dramatically when Long Mode is activated. Most segment registers are plainly ignored, the only exception are FS and GS. In many operating systems they are used to either point to Thread Local Storage (userland) or CPU local storage (kernel). To provide an alternative, two MSRs were introduced to contain the 64bit base and no limit checks are performed. This has some issues. First of all, some applications actually depended on the limit check. More importantly, it adds some fun issues for 32bit compatibility. There is no way besides writing to the segment register to set the limit. Setting the segment register also modifies the corresponding MSRs. It is not possible to load 64bit base addresses using descriptors. In short: the correct way to load FS/GS depends on whether the target code is running in 32bit compatibility mode or not. For 64bit mode it has to load fs/gs first, if the value changed. Next, the MSRs are updated to get the value of interest. For 32bit mode, the code doesn't have to bother with the MSRs and can just load the fs/gs register directly.

The initial patch to support sysarch on amd64 and TLS for Linux binaries does the MSR manipulations in cpu_switchto, e.g. on context change and in the manipulation functions on change. This means that the wrmsr instruction is not part of the normal system call path as opposed doing it on every interrupt return or system call exit as some other operating systems do. It is left for future work to

How to divide fast

2010-03-19T17:33:13Z

The division is one of the less often used integer operations and by a huge margin the slowest. Often the divisor is semi-constant and this allows using much cheaper operations.

The computation of a / b or a %b is one of the less common operations in many programs. Nevertheless it is part of many time critical operations. Many hash tables for example have a prime number as size and require the computation of the remainder. The division instructions are generally also the slowest integer operations CPUs offer and by a far margin.

The price of the division makes it attractive to check for cheaper alternatives. For constants, GCC and other compilers will replace unsigned divisions with simpler code based on an old paper. This doesn't help with numbers that are variable, but invariant at runtime. For resizable hash tables the resizing is a rare operation and the hot path (hashing) could benefit greatly from using this optimisations.

This leads to the new functions in sys/bitops.h: fast_divide32, fast_remainder32 and fast_divide32_prepare. The third function computes the constants needed for the other two. Benchmarking with the services(5) lookup code using CDB shows a 10% improvement for getservbyport. This is partially a result of the better scheduling as multiplications can run in parallel with other instructions and utilize the pipelines better.

A new constant database

2010-03-04T16:09:54Z

A space-efficient alternative to db and cdb

Many applications require a key,value database, but the data itself is (mostly) constant. General purpose hash databases are inefficient for this purpose as they waste a lot of space and often also require more complex queries. The most well-known constant database is from DJB, but it is a number of limitations.The biggest issues of DJB's implementation concern the large overhead (24 Bytes / record) and the inefficient hash function. Bob Jenkins has a comparison.

To work around this, I started to design a new constant database format. As a building block I used the CHM algorithm for a minimal perfect hash function. An implementation for that already existed in nbperf(1). I decided to use a separate offset table as it provides a natural encoding of the entry size and makes it easier to allow multiple keys for the same value.

To minimize the storage I decided against handling key matching in the library. The pay load for many of the intended uses already contains the key, so storing it separately would waste space. The perfect hashing ensures that only a single entry has to be matched, so the overhead is generally small, too.

The result is quite impressive. For the /etc/service database, the cdb version is 304KB, the original db(3) database 2.1MB (4.2MB file size, but sparse). The run time of services_mkdb dropped from 2.2s before to 0.24s afterwards. For terminfo, the cdb version is 1.3MB compared to 2.1MB for db(3) (2.8MB file size, but sparse). The run time of tic -x dropped from 1.5s to 0.4s.

Code size is quite small as well, the reader itself is 5.2KB code and the writer 15KB. Compiled for AMD64, the reader is 1.2KB and the writer 3.6KB.

Dynamic mbuf cluster limit

2010-02-11T17:25:48Z

The hard-coded limit of 1024 mbuf clusters in the x86 GENERIC kernel is gone. The new limit is based on the available memory.

The NetBSD kernel uses special memory buffers (mbufs) for network operations. Storage for large packets is allocated as clusters, typically 2KB in size.

Since forever, the limit for the number of clusters was static. Depending on the architecture and the presence of the GATEWAY option, the kernel used at most 4MB for the various forms of network IO. This limit was easy to exhaust, e.g. by starting a Bittorrent client with file descriptor limit of 1024.

This limit exists to avoid exhausting the system memory by a remote attack. It couldn't be raised on most architectures, because the kernel reserved a fixed amount of virtual address space at boot time.

On some architectures, this is completely unnecessary, because memory pools as used for mbuf clusters, but also other common data structures, are using a special direct mapping. This means that any given physical memory address can be easily converted into a virtual address -- without having to modify the page tables.

Other architectures can just grow the kernel memory map or have a huge reservation for the normal kernel memory allocator. AMD64 is such an architecture. By default, the kernel reserves up to 1GB for internal allocations, so it can just allocation the address space for mbuf clusters from the same range.

Over all, removing the special kernel submap was an easy exercise, only two architectures needed special care. Both i386 and the ARM family lack direct mappings and share address space between kernel and userland. On i386, the kernel is allowed to use only 512MB, including mapping device memory and the like. An additional limit was therefore needed on this architectures.

As a result, kern.mbuf.nmbclusters can now be increased at run time with sysctl(8). Some limits are enforced to prevent resource starvation. Basically, at most 1/4 of all memory can be used for mbuf clusters. Performance concerns were raised for architectures without direct mapping due to higher lock contention on the kernel memory map, but the normal pool cache makes that a rare event.