The insane AMD64 segmentation issue

The x86 platform has two layers of virtualisation. The lower layer is the paging and translates logical to physical addresses. The upper layer turns a pointer and a base register into a logical address. The base register is called a segment register and either implied or specified by an instruction prefix. This started as a cute hack to support 1MB of memory with 16bit registers in the 8086 and turned more complex with the introduction of Protected Mode with the 80286. In Protected Mode, the value of the segment register is actually a descriptor value used to index the Global Descriptor Table or in some cases the per-process equivalent. This table contains a base and limit as well as type and access information. The CPU enforces the limit on access, e.g. the pointer value has be smaller than the limit. The logical address is the sum of the pointer and the base.

This changes dramatically when Long Mode is activated. Most segment registers are plainly ignored, the only exception are FS and GS. In many operating systems they are used to either point to Thread Local Storage (userland) or CPU local storage (kernel). To provide an alternative, two MSRs were introduced to contain the 64bit base and no limit checks are performed. This has some issues. First of all, some applications actually depended on the limit check. More importantly, it adds some fun issues for 32bit compatibility. There is no way besides writing to the segment register to set the limit. Setting the segment register also modifies the corresponding MSRs. It is not possible to load 64bit base addresses using descriptors. In short: the correct way to load FS/GS depends on whether the target code is running in 32bit compatibility mode or not. For 64bit mode it has to load fs/gs first, if the value changed. Next, the MSRs are updated to get the value of interest. For 32bit mode, the code doesn't have to bother with the MSRs and can just load the fs/gs register directly.

The initial patch to support sysarch on amd64 and TLS for Linux binaries does the MSR manipulations in cpu_switchto, e.g. on context change and in the manipulation functions on change. This means that the wrmsr instruction is not part of the normal system call path as opposed doing it on every interrupt return or system call exit as some other operating systems do. It is left for future work to