Here is a thing about RISC-V hypervisors that nobody tells you. The hypervisor extension is twenty CSRs. Not twenty new instructions. Not a separate root privilege level. Not a microcode blob you have to reverse. Twenty configuration registers. You read the spec in an afternoon and then you know the whole thing. Try that with VMX.
Every time I tell someone the H-extension is simple, they assume I mean "simple compared to x86" which is like saying your apartment is clean compared to a landfill. No. I mean it is genuinely, architecturally simple. The spec chapter is forty pages and half of that is register diagrams.
Let me help u walk through it from the bottom up. If you know RISC-V page tables already, skip the first section. If you don't, I am going to explain them in a way that makes the two-stage stuff easy
What You Need To Know First
RISC-V has three privilege levels. M-mode is the firmware, the thing that runs before anything else. S-mode is your kernel. U-mode is your userspace. When U-mode needs to do something it cannot, it traps to S-mode. When S-mode needs to do something it cannot, it traps to M-mode. This is a stack btw
A hypervisor lets you run an entire OS inside another OS. The outer OS (or a thin shim) becomes the hypervisor. The inner OS is the "guest." The guest thinks it owns the hardware. The hypervisor decides what the guest actually gets.
The H-extension adds two virtual privilege levels: VS-mode (where the guest kernel runs) and VU-mode (where the guest's user processes run). To the guest, VS-mode looks exactly like S-mode. Same CSRs, same trap mechanism, same page tables. The hypervisor lives in HS-mode, which is S-mode with access to the H CSRs.
A CSR is how you talk to the CPU. Page table
pointer goes in
satp. Interrupt handler address
goes in stvec. Trap cause goes in
scause. The H-extension adds new
CSRs for the hypervisor's page table pointer,
fault information, and the bits that control
what traps to you
A trap is the CPU saving state and jumping to a
handler.
sret goes back. One bit in
hstatus changes whether
sret returns to a guest or returns
to a normal S-mode process. That bit is
literally all that separates "return to kernel"
from "return to guest kernel."
The Page Table You Already Know (Sv39)
Before we get to two-stage, we need to understand what one-stage looks like. RISC-V uses a radix-tree page table called Sv39. The virtual address is 39 bits, broken into four pieces:
38 30 29 21 20 12 11 0 +----------+----------+----------+----------+ | VPN[2] | VPN[1] | VPN[0] | offset | +----------+----------+----------+----------+ 9 bits 9 bits 9 bits 12 bits
The MMU starts at the page table root, pointed
to by
satp.ppn shifted left 12 bits
(because page tables are page-aligned). It
indexes into level 2 using VPN[2], reads a PTE
(Page Table Entry, 8 bytes). If the PTE has the
V bit set and the R, W, or X bits set, it is a
leaf, stop here, you found your physical page.
If not, the PPN field points to the next level
and you repeat with VPN[1], then VPN[0].
A PTE on RV64 looks like this:
bit 63 62 61 60 . 10 9 8 7 6 5 4 3 2 1 0 field [RSV] [PBMT] [ PPN[2..0] ] RSW D A G U X W R V
V is the valid bit. R, W, X are permission bits (read, write, execute). U means user-accessible. G means global (all address spaces). A and D are the accessed and dirty flags, set by the hardware on first access and first write respectively. PPN is the physical page number. If you write a PTE without setting A and the hardware has A-bit updates enabled, the MMU will fault on first access so software can set it.
That is Sv39. Standard issue. Your kernel already implements this. Good. Now here is where it gets interesting and pay attention !
Two-Stage Translation: The Actual Algorithm
In a virtualized system, there are two levels of
page table. Stage one maps guest virtual
addresses to guest physical addresses, using
vsatp instead of satp.
Stage two maps guest physical addresses to
machine physical addresses, using
hgatp. The guest manages stage one.
The hypervisor manages stage two.
The MMU does not stop after stage one. It walks stage one, gets a GPA, then immediately walks stage two to get the final machine address. The entire operation is a single hardware walk. The hypervisor never gets involved during translation.
what the hardware actually does in steps :p
-
Read
vsatpto get the stage-one root. The guest set this up the same way it would setsatpon bare metal. - Walk the stage-one page table using the guest virtual address, exactly as Sv39 specifies. The result is a PTE that contains a PPN. This PPN is a guest physical address (GPA). (The guest thinks this is real memory)
-
Read
hgatpto get the stage-two root. The hypervisor set this up. The guest does not knowhgatpexists. - Walk the stage-two page table using the GPA from step 2 as the input address. The result is a machine physical address. This is where the data actually lives in RAM.
- Merge the permissions from both PTEs. If either PTE says "not readable," the access fails. Stage two is a ceiling.
The stage-two page table has a slightly different format. Sv39x4 uses 4 KiB PTEs like Sv39, but the intermediate PTE format is different because the GPA space can be up to 48 bits wide rather than 39. The PPN field is wider. But the walking algorithm is the same: radix tree, three levels, index by VPN[2..0] of the GPA.
The permission merge works like this:
Stage-1 PTE Stage-2 PTE Result R R R R - fault W W W W - fault X X X X - fault
Stage two never adds permissions. It can only remove them. If stage one says read-write and stage two says read-only, the guest gets a page fault on write. That is not negotiable. The CPU does not ask the hypervisor. It just faults the guest. The guest trap handler runs, the guest handles the fault internally (or panics), and the hypervisor never even knows.
But what if the guest cannot handle it? What if the guest faults, tries to fix it, and fails? Then the guest traps to the hypervisor with a stage-two fault code. Now the hypervisor can inspect the GPA, figure out what went wrong, and fix the stage-two mapping or inject a fault back into the guest.
That brings us to the fault codes. When a
translation fails,
scause tells you why:
- 12 or 13 → stage-one fault. The guest's page table is wrong. The guest needs to handle this.
-
20 or 21 → guest-page fault (stage
two). Your
hgatpmapping is wrong. You fix it.
If you have ever debugged a hypervisor without
these codes, you know why they matter. On x86, a
VM exit for an EPT violation tells you
"something went wrong with the nested page
tables," but the correlation between the guest
address and the fault is not always
straightforward. Here,
htval gets written with the GPA
that caused the fault. That GPA is the most
useful piece of information you will get at 3am
when your pet hypervisor panics.
The Merged PTE and the TLB
Walking two page tables on every memory access would be expensive. The TLB caches the result of the combined walk. A two-stage TLB entry contains the guest virtual address, the final machine physical address, and the merged permissions. Next access to the same GVA hits the TLB and never touches either page table.
This means TLB invalidation is more complex. If
you change a stage-two mapping, you need to
flush entries that might have cached the old
translation for any guest. The hardware provides
hfence.gvma for this. It takes an
optional GPA and optional VM ID (from
hgatp) to narrow the flush. If you
call it without arguments, it flushes every
two-stage TLB entry in the entire system.
If a guest calls sfence.vma inside
itself, the hardware flushes only the stage-one
entries for that specific guest. The hypervisor
does not get involved. The hardware knows which
guest is running because the virtualization mode
is baked into the pipeline.
You would think this is obvious. but On some older virtualization architectures, every guest TLB flush had to trap to the hypervisor so the hypervisor could emulate the flush. Here, it is automatic. The guest flushes, the TLB drains, nobody traps. This is the difference between "virtualization support" and "virtualization support that someone actually thought about. smirk smirk "
hvip and Interrupt Injection
The hypervisor can inject interrupts into a
guest by writing to
hvip (Hypervisor Virtual Interrupt
Pending). This is a CSR that mirrors the guest's
sip register. Set a bit in
hvip and the guest sees the
corresponding interrupt pending when it reads
sip. No actual interrupt line
toggling, no wire-level signaling.
The hgeip CSR handles the reverse
direction. External interrupts from devices
configured as guest-directed show up here, and
the hypervisor can forward them by setting the
appropriate hvip bit. For timer
interrupts, hgeie (guest external
interrupt enable) controls which guest interrupt
IDs are routed to the hypervisor's external
interrupt line.
The guest's stimecmp (if Sstc is
implemented) or the hypervisor's timer
management via htimedelta handles
scheduling. htimedelta adds a
64-bit offset to the time CSR. Each
guest gets its own time base without a single
trap. You write the offset once in
htimedelta and every
rdtime the guest issues returns
(time + offset) transparently.
DMA and the IOMMU
If you pass a physical device through to a
guest, the device does DMA using guest physical
addresses. The IOMMU translates these using the
same hgatp page tables the MMU uses
for CPU accesses. The guest tells the device
"write to GPA 0x1000," the device fires off a
DMA write, the IOMMU translates 0x1000 through
hgatp, and the data lands at the
correct machine address.
IOMMUs have existed for
decades. What is nice is that the IOMMU uses the
exact same page table format as the CPU MMU. One
set of hgatp tables serves both CPU
and device accesses. You dont maintain
separate translation structures for DMA.
hlv, hsv, and Touching Guest Memory
When the hypervisor needs to read or write guest
memory, it has two options. Option one: switch
satp to the guest's
vsatp, do the access, switch back.
This is a context switch, it pollutes the TLB,
and it is ugly.
Option two: use hlv (hypervisor
load virtual) and hsv (hypervisor
store virtual). These instructions take a guest
virtual address and do the full two-stage walk
in hardware. No context switch. No TLB pollution
on the hypervisor's own page tables. You hand it
a GVA, it returns the value.
If you are implementing a para-virtualized
device or need to patch guest memory for any
reason, hlv/hsv
are the difference between a clean
implementation and a fragile mess of temporary
page table swaps.
henvcfg and Hiding Hardware
henvcfg controls which ISA
extensions are visible to a guest. Want to hide
the FPU from a specific VM? Clear
henvcfg.FP. The guest will fault on
any floating-point instruction, and the
hypervisor can either emulate it or tell the
guest the feature does not exist.
The same mechanism controls individual hypervisor extensions. If you do not want a guest to use the AIA (advanced interrupt architecture) or the hypervisor extension itself for nested virtualization, you clear the corresponding bit.
What Traps When
The hstatus register has control
bits that determine what guest operations trap
to the hypervisor:
-
hstatus.vtvm→ guest writes tosatptrap. If you are doing shadow page tables (unusual on RISC-V but possible), you need this. -
hstatus.vtw→ guestwfitraps. Useful if you want to implement a different idle policy. -
hstatus.vtsr→ guestsrettraps. Lets you intercept guest returns for accounting or emulation. -
hstatus.vgein→ selects which guest interface is active for interrupt delivery.
When a guest tries to access a CSR it should
not,
scause says "virtual instruction"
(fault code 22) and mtval contains
the CSR number. htinst often
contains the actual instruction encoding that
caused the trap. You do not need to decode
anything or fetch from guest memory. The
hardware tells you exactly what instruction the
guest was executing and which CSR it was trying
to touch.
The One About Performance
People ask about trap latency constantly. Guest
to hypervisor to guest is two CSR reads, some
decoding, maybe an
hgatp switch, and
sret. Same cost as a normal S-mode
trap. The slow hypervisors are slow because of
what they do during the trap. Device
emulation in software is expensive. Scheduling
decisions add latency. But the raw trap cost?
pooofff
The hard parts are still hard. Device emulation, memory ballooning, dirty page tracking for live migration, scheduling policy, all yours. But the CPU virtualization path is clean, auditable, and does not require understanding a microcode blob to debug. That is valuable in a way you only appreciate after you have stared at a VMX crash dump at 2am. :thumbgs up gng
What You Should Actually Remember
Two-stage translation is the core of everything.
Stage one is the guest's page table, pointed to
by vsatp. Stage two is the
hypervisor's page table, pointed to by
hgatp. The MMU walks both in a
single operation. Stage-two permissions are a
ceiling on stage one. Fault codes 12/13 are
stage one, 20/21 are stage two.
The rest is quality-of-life.
htimedelta saves timer traps.
hlv/hsv saves context
switches. henvcfg saves emulating
missing hardware. hfence.gvma and
hfence.vvma keep the TLB coherent
without hypervisor intervention.
Go read the spec. The H-extension chapter in the privileged architecture manual is about forty pages. Most of it is register layout diagrams. You can read it in a single afternoon.
Still Curious?
For the curious readers, here's a short trivia as a takeaway.
1. Can you nest hypervisors?
The H-extension has
henvcfg.HEN which allows a guest to
use the H-extension itself. That means you can
have a hypervisor running inside a VM. What
happens to two-stage translation when there are
three stages? The hardware handles one level of
nesting natively. Beyond that, every H-extension
CSR access traps and has to be emulated. Nobody
has shipped a three-level nested hypervisor in
production.
2. How does the TLB actually merge two PTEs? The spec says the merged permissions are the AND of both stages. But what about the A and D bits? If stage one sets A but stage two does not have it set yet, does the hardware walk stage two just to set the A bit? Or does it cache the merged entry with A=0 and fault on the next access to get stage two updated? Different implementations do different things. The spec is intentionally vague here.
3. What happens when GPA space exceeds VA space? Sv39 has 39-bit virtual addresses. Sv39x4 (the stage-two format) can map up to 48 bits of GPA. If the guest has a 48-bit address space and the hypervisor maps it through a 39-bit page table format, the top 9 bits of the GPA are consumed by an extra level or through superpages. How does the hardware handle a GPA that has more bits than the page table format was designed for?
4. How expensive is a full world
switch?
The article says individual traps are cheap.
True. But scheduling between guests requires
saving and restoring hgatp,
htimedelta, henvcfg,
hcounteren, hvip, and
the entire guest register state. A full world
switch is easily 10x the cost of a normal
process context switch. The question is whether
you can amortize that cost over longer
scheduling quanta, or whether your workload
demands frequent switching.
5. How do MSI interrupts reach the right vCPU? When a passthrough device fires a Message Signaled Interrupt, the PCIe write lands somewhere in physical address space. The AIA (Advanced Interrupt Architecture) maps that write to an interrupt file, which delivers the interrupt to a specific vCPU. But who sets up that mapping? The hypervisor. For each passthrough device, the hypervisor must allocate an interrupt file, configure the IOMMU to route the MSI to the right GPA, and program the AIA to deliver it to the target vCPU. Miss any step and the interrupt vanishes into a black hole.