Home > Blog > RISC-V H-Extension: The Only Hypervisor Spec You Can Read at Lunch
RISC-V H-Extension: The Only Hypervisor Spec You Can Read at Lunch

Here is a thing about RISC-V hypervisors that nobody tells you. The hypervisor extension is twenty CSRs. Not twenty new instructions. Not a separate root privilege level. Not a microcode blob you have to reverse. Twenty configuration registers. You read the spec in an afternoon and then you know the whole thing. Try that with VMX.

Every time I tell someone the H-extension is simple, they assume I mean "simple compared to x86" which is like saying your apartment is clean compared to a landfill. No. I mean it is genuinely, architecturally simple. The spec chapter is forty pages and half of that is register diagrams.

Let me help u walk through it from the bottom up. If you know RISC-V page tables already, skip the first section. If you don't, I am going to explain them in a way that makes the two-stage stuff easy

What You Need To Know First

RISC-V has three privilege levels. M-mode is the firmware, the thing that runs before anything else. S-mode is your kernel. U-mode is your userspace. When U-mode needs to do something it cannot, it traps to S-mode. When S-mode needs to do something it cannot, it traps to M-mode. This is a stack btw

A hypervisor lets you run an entire OS inside another OS. The outer OS (or a thin shim) becomes the hypervisor. The inner OS is the "guest." The guest thinks it owns the hardware. The hypervisor decides what the guest actually gets.

The H-extension adds two virtual privilege levels: VS-mode (where the guest kernel runs) and VU-mode (where the guest's user processes run). To the guest, VS-mode looks exactly like S-mode. Same CSRs, same trap mechanism, same page tables. The hypervisor lives in HS-mode, which is S-mode with access to the H CSRs.

A CSR is how you talk to the CPU. Page table pointer goes in satp. Interrupt handler address goes in stvec. Trap cause goes in scause. The H-extension adds new CSRs for the hypervisor's page table pointer, fault information, and the bits that control what traps to you

A trap is the CPU saving state and jumping to a handler. sret goes back. One bit in hstatus changes whether sret returns to a guest or returns to a normal S-mode process. That bit is literally all that separates "return to kernel" from "return to guest kernel."

The Page Table You Already Know (Sv39)

Before we get to two-stage, we need to understand what one-stage looks like. RISC-V uses a radix-tree page table called Sv39. The virtual address is 39 bits, broken into four pieces:

  38      30 29      21 20      12 11      0
+----------+----------+----------+----------+
|  VPN[2]  |  VPN[1]  |  VPN[0]  |  offset  |
+----------+----------+----------+----------+
  9 bits     9 bits     9 bits     12 bits

The MMU starts at the page table root, pointed to by satp.ppn shifted left 12 bits (because page tables are page-aligned). It indexes into level 2 using VPN[2], reads a PTE (Page Table Entry, 8 bytes). If the PTE has the V bit set and the R, W, or X bits set, it is a leaf, stop here, you found your physical page. If not, the PPN field points to the next level and you repeat with VPN[1], then VPN[0].

A PTE on RV64 looks like this:

bit    63  62 61 60 . 10 9 8 7 6 5 4 3 2 1 0
field  [RSV] [PBMT] [ PPN[2..0]  ] RSW D A G U X W R V

V is the valid bit. R, W, X are permission bits (read, write, execute). U means user-accessible. G means global (all address spaces). A and D are the accessed and dirty flags, set by the hardware on first access and first write respectively. PPN is the physical page number. If you write a PTE without setting A and the hardware has A-bit updates enabled, the MMU will fault on first access so software can set it.

That is Sv39. Standard issue. Your kernel already implements this. Good. Now here is where it gets interesting and pay attention !

Two-Stage Translation: The Actual Algorithm

In a virtualized system, there are two levels of page table. Stage one maps guest virtual addresses to guest physical addresses, using vsatp instead of satp. Stage two maps guest physical addresses to machine physical addresses, using hgatp. The guest manages stage one. The hypervisor manages stage two.

The MMU does not stop after stage one. It walks stage one, gets a GPA, then immediately walks stage two to get the final machine address. The entire operation is a single hardware walk. The hypervisor never gets involved during translation.

what the hardware actually does in steps :p

  1. Read vsatp to get the stage-one root. The guest set this up the same way it would set satp on bare metal.
  2. Walk the stage-one page table using the guest virtual address, exactly as Sv39 specifies. The result is a PTE that contains a PPN. This PPN is a guest physical address (GPA). (The guest thinks this is real memory)
  3. Read hgatp to get the stage-two root. The hypervisor set this up. The guest does not know hgatp exists.
  4. Walk the stage-two page table using the GPA from step 2 as the input address. The result is a machine physical address. This is where the data actually lives in RAM.
  5. Merge the permissions from both PTEs. If either PTE says "not readable," the access fails. Stage two is a ceiling.

The stage-two page table has a slightly different format. Sv39x4 uses 4 KiB PTEs like Sv39, but the intermediate PTE format is different because the GPA space can be up to 48 bits wide rather than 39. The PPN field is wider. But the walking algorithm is the same: radix tree, three levels, index by VPN[2..0] of the GPA.

The permission merge works like this:

Stage-1 PTE   Stage-2 PTE   Result
R             R             R
R             -             fault
W             W             W
W             -             fault
X             X             X
X             -             fault

Stage two never adds permissions. It can only remove them. If stage one says read-write and stage two says read-only, the guest gets a page fault on write. That is not negotiable. The CPU does not ask the hypervisor. It just faults the guest. The guest trap handler runs, the guest handles the fault internally (or panics), and the hypervisor never even knows.

But what if the guest cannot handle it? What if the guest faults, tries to fix it, and fails? Then the guest traps to the hypervisor with a stage-two fault code. Now the hypervisor can inspect the GPA, figure out what went wrong, and fix the stage-two mapping or inject a fault back into the guest.

That brings us to the fault codes. When a translation fails, scause tells you why:

  • 12 or 13 → stage-one fault. The guest's page table is wrong. The guest needs to handle this.
  • 20 or 21 → guest-page fault (stage two). Your hgatp mapping is wrong. You fix it.

If you have ever debugged a hypervisor without these codes, you know why they matter. On x86, a VM exit for an EPT violation tells you "something went wrong with the nested page tables," but the correlation between the guest address and the fault is not always straightforward. Here, htval gets written with the GPA that caused the fault. That GPA is the most useful piece of information you will get at 3am when your pet hypervisor panics.

The Merged PTE and the TLB

Walking two page tables on every memory access would be expensive. The TLB caches the result of the combined walk. A two-stage TLB entry contains the guest virtual address, the final machine physical address, and the merged permissions. Next access to the same GVA hits the TLB and never touches either page table.

This means TLB invalidation is more complex. If you change a stage-two mapping, you need to flush entries that might have cached the old translation for any guest. The hardware provides hfence.gvma for this. It takes an optional GPA and optional VM ID (from hgatp) to narrow the flush. If you call it without arguments, it flushes every two-stage TLB entry in the entire system.

If a guest calls sfence.vma inside itself, the hardware flushes only the stage-one entries for that specific guest. The hypervisor does not get involved. The hardware knows which guest is running because the virtualization mode is baked into the pipeline.

You would think this is obvious. but On some older virtualization architectures, every guest TLB flush had to trap to the hypervisor so the hypervisor could emulate the flush. Here, it is automatic. The guest flushes, the TLB drains, nobody traps. This is the difference between "virtualization support" and "virtualization support that someone actually thought about. smirk smirk "

hvip and Interrupt Injection

The hypervisor can inject interrupts into a guest by writing to hvip (Hypervisor Virtual Interrupt Pending). This is a CSR that mirrors the guest's sip register. Set a bit in hvip and the guest sees the corresponding interrupt pending when it reads sip. No actual interrupt line toggling, no wire-level signaling.

The hgeip CSR handles the reverse direction. External interrupts from devices configured as guest-directed show up here, and the hypervisor can forward them by setting the appropriate hvip bit. For timer interrupts, hgeie (guest external interrupt enable) controls which guest interrupt IDs are routed to the hypervisor's external interrupt line.

The guest's stimecmp (if Sstc is implemented) or the hypervisor's timer management via htimedelta handles scheduling. htimedelta adds a 64-bit offset to the time CSR. Each guest gets its own time base without a single trap. You write the offset once in htimedelta and every rdtime the guest issues returns (time + offset) transparently.

DMA and the IOMMU

If you pass a physical device through to a guest, the device does DMA using guest physical addresses. The IOMMU translates these using the same hgatp page tables the MMU uses for CPU accesses. The guest tells the device "write to GPA 0x1000," the device fires off a DMA write, the IOMMU translates 0x1000 through hgatp, and the data lands at the correct machine address.

IOMMUs have existed for decades. What is nice is that the IOMMU uses the exact same page table format as the CPU MMU. One set of hgatp tables serves both CPU and device accesses. You dont maintain separate translation structures for DMA.

hlv, hsv, and Touching Guest Memory

When the hypervisor needs to read or write guest memory, it has two options. Option one: switch satp to the guest's vsatp, do the access, switch back. This is a context switch, it pollutes the TLB, and it is ugly.

Option two: use hlv (hypervisor load virtual) and hsv (hypervisor store virtual). These instructions take a guest virtual address and do the full two-stage walk in hardware. No context switch. No TLB pollution on the hypervisor's own page tables. You hand it a GVA, it returns the value.

If you are implementing a para-virtualized device or need to patch guest memory for any reason, hlv/hsv are the difference between a clean implementation and a fragile mess of temporary page table swaps.

henvcfg and Hiding Hardware

henvcfg controls which ISA extensions are visible to a guest. Want to hide the FPU from a specific VM? Clear henvcfg.FP. The guest will fault on any floating-point instruction, and the hypervisor can either emulate it or tell the guest the feature does not exist.

The same mechanism controls individual hypervisor extensions. If you do not want a guest to use the AIA (advanced interrupt architecture) or the hypervisor extension itself for nested virtualization, you clear the corresponding bit.

What Traps When

The hstatus register has control bits that determine what guest operations trap to the hypervisor:

  • hstatus.vtvm → guest writes to satp trap. If you are doing shadow page tables (unusual on RISC-V but possible), you need this.
  • hstatus.vtw → guest wfi traps. Useful if you want to implement a different idle policy.
  • hstatus.vtsr → guest sret traps. Lets you intercept guest returns for accounting or emulation.
  • hstatus.vgein → selects which guest interface is active for interrupt delivery.

When a guest tries to access a CSR it should not, scause says "virtual instruction" (fault code 22) and mtval contains the CSR number. htinst often contains the actual instruction encoding that caused the trap. You do not need to decode anything or fetch from guest memory. The hardware tells you exactly what instruction the guest was executing and which CSR it was trying to touch.

The One About Performance

People ask about trap latency constantly. Guest to hypervisor to guest is two CSR reads, some decoding, maybe an hgatp switch, and sret. Same cost as a normal S-mode trap. The slow hypervisors are slow because of what they do during the trap. Device emulation in software is expensive. Scheduling decisions add latency. But the raw trap cost? pooofff

The hard parts are still hard. Device emulation, memory ballooning, dirty page tracking for live migration, scheduling policy, all yours. But the CPU virtualization path is clean, auditable, and does not require understanding a microcode blob to debug. That is valuable in a way you only appreciate after you have stared at a VMX crash dump at 2am. :thumbgs up gng

What You Should Actually Remember

Two-stage translation is the core of everything. Stage one is the guest's page table, pointed to by vsatp. Stage two is the hypervisor's page table, pointed to by hgatp. The MMU walks both in a single operation. Stage-two permissions are a ceiling on stage one. Fault codes 12/13 are stage one, 20/21 are stage two.

The rest is quality-of-life. htimedelta saves timer traps. hlv/hsv saves context switches. henvcfg saves emulating missing hardware. hfence.gvma and hfence.vvma keep the TLB coherent without hypervisor intervention.

Go read the spec. The H-extension chapter in the privileged architecture manual is about forty pages. Most of it is register layout diagrams. You can read it in a single afternoon.

Still Curious?

For the curious readers, here's a short trivia as a takeaway.

1. Can you nest hypervisors? The H-extension has henvcfg.HEN which allows a guest to use the H-extension itself. That means you can have a hypervisor running inside a VM. What happens to two-stage translation when there are three stages? The hardware handles one level of nesting natively. Beyond that, every H-extension CSR access traps and has to be emulated. Nobody has shipped a three-level nested hypervisor in production.

2. How does the TLB actually merge two PTEs? The spec says the merged permissions are the AND of both stages. But what about the A and D bits? If stage one sets A but stage two does not have it set yet, does the hardware walk stage two just to set the A bit? Or does it cache the merged entry with A=0 and fault on the next access to get stage two updated? Different implementations do different things. The spec is intentionally vague here.

3. What happens when GPA space exceeds VA space? Sv39 has 39-bit virtual addresses. Sv39x4 (the stage-two format) can map up to 48 bits of GPA. If the guest has a 48-bit address space and the hypervisor maps it through a 39-bit page table format, the top 9 bits of the GPA are consumed by an extra level or through superpages. How does the hardware handle a GPA that has more bits than the page table format was designed for?

4. How expensive is a full world switch? The article says individual traps are cheap. True. But scheduling between guests requires saving and restoring hgatp, htimedelta, henvcfg, hcounteren, hvip, and the entire guest register state. A full world switch is easily 10x the cost of a normal process context switch. The question is whether you can amortize that cost over longer scheduling quanta, or whether your workload demands frequent switching.

5. How do MSI interrupts reach the right vCPU? When a passthrough device fires a Message Signaled Interrupt, the PCIe write lands somewhere in physical address space. The AIA (Advanced Interrupt Architecture) maps that write to an interrupt file, which delivers the interrupt to a specific vCPU. But who sets up that mapping? The hypervisor. For each passthrough device, the hypervisor must allocate an interrupt file, configure the IOMMU to route the MSI to the right GPA, and program the AIA to deliver it to the target vCPU. Miss any step and the interrupt vanishes into a black hole.

Start
Blog Reading
12:00 PM