Process vs thread vs coroutine — stack layout, switch cost
Process vs thread vs coroutine — stack layout, switch cost
Processes, Threads, and Coroutines: Stop Trusting the Sandwich Layout and Understand TLB Flush
There’s a brutal truth I just had to face: I’ve been misunderstanding Process and Thread for years.
Not in a “I don’t know anything” kind of way. I can still fork() like a champ, and I can pthread_create with my eyes closed. But when a (virtual) Staff Engineer asked me a few questions about how the stack is actually laid out in memory for multiple threads, I stood there gaping like a fish out of water.
So I had to dig deep and start over. This post is my journey “from misconception to clarity.” If you’re carrying around that textbook metaphor of “Process is a house, Thread is a person living in it,” congratulations — you’re just like I was. But the remaining 10% of that story is what determines whether your code runs like a gazelle or a three-legged turtle.
1. Redefining the Terms (Because I Was Wrong)
Before we go deeper, we need to clean up the definitions. In the past, I used to explain things like this:
“A Process is a program that got started, like a kitchen. A Thread is an instruction of the Process. Multi-threading is true simultaneous execution, while multitasking is just faking it.”
Sounds smooth. But it’s dead wrong.
After getting schooled, here is the corrected version:
Process is a running entity. It’s not the .exe file sitting on your hard drive; it’s a bunch of resources allocated by the OS: its own private virtual address space, file handles, sockets… To use the metaphor, a Process is the house with its own distinct property deed.
Thread is not an “instruction.” A Thread is the unit of execution, the worker inside that house. The Thread is the thing that actually fetches and runs the instructions. A Process without any Threads is just an empty house — furniture everywhere, but nobody there to use it.
Processor/Core is the stove. At any given moment, one stove (Core) can only cook one dish (one Thread). If you want to cook multiple dishes truly at the same time, you need multiple stoves (Multi-core).
Multi-tasking is the OS’s ability to rotate Threads on and off the stove. It creates the illusion that everything is running at once. True parallelism only happens when you have multiple Cores and multiple Threads running simultaneously on those Cores. Don’t confuse these two again.
Coroutine is not a “lightweight Thread.” It is a special function that knows how to pause itself. It runs on top of a single Thread and voluntarily yields execution to another Coroutine. It’s like putting rice on the stove, realizing it’s not boiling yet, taking the pot off to read the newspaper, and putting it back on later. The OS is completely unaware of this.
2. The Sandwich Memory Layout Is a Lie
This was the biggest shock to my system.
2.1. The College Illusion
Back in school, I was taught that a Process’s memory layout looks like a neat sandwich stacked from top to bottom:
[ High Address ]
Stack (grows downwards)
|
v
Heap (grows upwards)
Data
Text (Code)
[ Low Address ]
This picture is correct — when the Process has exactly one main Thread.
The problem is, the moment you call pthread_create or std::thread, that beautiful layout collapses.
2.2. The Reality of a Multi-threaded Process
When you create a second Thread, the Kernel can’t just shove its stack right underneath the Main Thread’s stack. If it did, and Thread 1 overflowed its stack, it would crash straight into Thread 2. Game over.
What the Kernel actually does:
It calls mmap() to allocate a new chunk of virtual memory at a random location (thanks to ASLR) within the address space, usually somewhere between the Heap and the Main Stack. This chunk becomes the Stack for Thread 2.
Create Thread 3? Another mmap call, another random location.
The result: The actual layout of a multi-threaded Process looks less like a sandwich and more like a scattered tree.
[ High Address ]
...
0x7fffff000000 <-- Main Thread Stack (8MB, grows down)
... <-- (Guard space / gap)
0x7fffe8000000 <-- Thread 2 Stack (allocated via mmap)
0x7fffe0000000 <-- Thread 3 Stack (allocated via mmap elsewhere)
...
Heap and Memory Mappings (.so libraries)
...
Text (Code)
[ Low Address ]
What this means:
The Stacks of different Threads are not adjacent.
They are scattered around like puzzle pieces.
The space between them is protected by Guard Pages (virtual memory pages with read/write permissions stripped).
2.3. Guard Pages – Protection That Kills Everyone
I used to think a Guard Page was a soft “manhole cover.” I thought if the stack touched it, the OS would politely notify the thread and let it keep running.
Nope.
A Guard Page is a virtual memory page flagged with PROT_NONE (no read, no write). When the Stack Pointer accidentally touches it (due to deep recursion or a massive local array), the CPU immediately triggers a Page Fault.
The Kernel looks at this illegal access, sees a violation, and sends the SIGSEGV signal (Segmentation Fault). And because a memory violation is a Process-wide catastrophic error, the Kernel usually kills the entire Process, not just the offending Thread.
So don’t write sloppy code. One thread stepping out of line can bring down the whole house.
3. Switching Cost – It’s Not Just “Running Upstairs”
Let’s revisit my old metaphor:
“Switching a Process is like moving houses from A to B, heavy and slow. Switching a Thread is like running from the kitchen to the dining room, much lighter.”
The idea is right, but the technical root cause is the fascinating part.
3.1. What is a Context Switch?
Whenever the CPU moves from running Thread A to Thread B, it performs a Context Switch:
Save the entire state of Thread A to memory (registers like RAX, RBX, RSP, RIP, etc.).
Load the state of Thread B from memory into the CPU.
Jump to the next instruction of Thread B.
This cost is usually in the range of a few dozen to a hundred CPU cycles. Not a huge deal.
3.2. The Real Difference Lies in One Register: CR3
This is the “golden key” that distinguishes Threads from Processes.
CR3 is the register that holds the physical address of the Page Table for the current Process. It’s the map that translates “virtual addresses” to “physical addresses” in RAM.
When switching from Thread A to Thread B within the same Process: Both share the exact same virtual address space. They share the same Page Table. The CR3 register does not change.
When switching to a Thread in a different Process: The Kernel must write a new value into CR3 to point to the new Process’s Page Table.
The price of writing to CR3: TLB Flush.
TLB (Translation Lookaside Buffer) is a super-fast cache inside the CPU that stores recent address translations. When you change the Page Table (by writing to CR3), the entire TLB becomes invalid and is flushed clean.
After a TLB Flush, every time the CPU needs to access memory, it has to “walk” through the 4- or 5-level Page Table hierarchy again. This costs hundreds to thousands of extra CPU cycles to rebuild the TLB.
Quick comparison:
| Action | CR3 Changed? | TLB Flush? | Relative Cost |
|---|---|---|---|
| Switch Thread (same Process) | No | No | Low |
| Switch Process | Yes | Yes | High |
So my “moving houses” metaphor really boils down to TLB Flush. Threads living in the same house don’t need to “remap their virtual address.”
4. Coroutines – Dozing Off on the Sofa
Coroutines save even more.
Thread Context Switch: Requires Kernel Mode entry, saving registers, and possibly writing to CR3.
Coroutine Switch: Happens entirely in User Space. The runtime (e.g., Go, Python asyncio) simply swaps the Stack Pointer (RSP) and the Instruction Pointer (RIP) using a few lines of assembly.
Coroutines don’t have their own private address space, and they don’t get a Kernel-allocated stack. They are just functions that keep their state in a data structure on the Heap (or on the parent Thread’s stack). When you call yield, the Runtime just points the RSP register to that data structure.
The result: Switching a Coroutine is as cheap as pressing a button on the TV remote, and the Kernel sitting in the living room has no idea it happened.
5. Summary & Lessons Learned
After being humbled by a virtual Staff Engineer, I’ve learned a few things:
Don’t memorize the textbook “Stack -> Heap” layout. When you’re coding a heavily multi-threaded app, remember that thread stacks are scattered across mmap regions, not neatly lined up.
The performance cost isn’t the number of Threads, it’s the TLB Flush. If your app lags when switching tabs, you might be paying the price for too many Process context switches.
Coroutines are king for I/O. If you have 10,000 network connections, don’t spawn 10,000 Threads. Spin up a few Threads and pack them with Coroutines.
Hopefully, this post gives you a more realistic view of these seemingly basic concepts. And if your head still hurts, welcome to the club — I just took some ibuprofen myself.
Let's Connect
Interested in collaborating, have a question about a post, or just want to talk backend engineering? Reach out — I'm always happy to chat.