ARM Architecture and Fundamentals

Honestly, if you have been writing C/C++ on a PC, you likely never cared about how a processor actually transforms a line of code into electrical signals—the x86 stuff is too abstract, and the compiler and operating system shield you from almost all low-level details. However, once you step into the embedded world, especially when facing ARM Cortex-M series MCUs, this knowledge is no longer a bonus; it is a prerequisite for writing correct code. I have seen too many people jump straight into STM32 without being able to explain processor modes or the exception vector table, only to stare blankly at registers when a HardFault occurs.

Developers in other languages, such as Python or Java, basically don't need to worry about this—the virtual machine or interpreter has already abstracted the hardware cleanly. But C/C++ is different; their design philosophy is "close to the metal," with only a thin layer of abstraction between the machine code generated by the compiler and your source code. As the ARM architecture is the absolute mainstream in the embedded field today, understanding its architecture is equivalent to understanding what exactly happens to every line of C code you write on the chip. The connection to C++ is even stronger—object layout, cache-friendly design, and exception handling overhead are all directly tied to ARM's hardware characteristics.

In this tutorial, we will dissect the ARM processor from an architectural perspective, figuring out its memory architecture, instruction sets, register files, exception mechanisms, and processor modes. This is not to teach you to write assembly, but to give you a clear mental model of what is happening at the underlying level when you write C/C++—when you use volatile on a register, you know why; when you debug a HardFault caused by a stack overflow, you can locate the problem quickly.

Learning Objectives
After completing this chapter, you will be able to:
[ ] Distinguish between von Neumann architecture, Harvard architecture, and Modified Harvard architecture.
[ ] Explain the differences and use cases for ARM, Thumb, and Thumb-2 instruction sets.
[ ] Identify the roles of registers R0-R15 and the AAPCS calling convention.
[ ] Describe the Cortex-M exception vector table structure and the push/pop (stacking/unstacking) mechanism.
[ ] Understand Thread/Handler modes and privilege level divisions.

Environment

This content is theoretical but closely tied to actual hardware. All code examples can be verified using an ARM toolchain.

text

平台：ARM Cortex-M3/M4（代表芯片：STM32F1/F4 系列）
工具链：GCC ARM Embedded（arm-none-eabi-gcc）>= 10.x
       或 STM32CubeIDE / PlatformIO（底层同一套）
标准：-std=c11（C 部分）/ -std=c++17（C++ 对比部分）
硬件：阅读过程不需要开发板，有 STM32F103 或 STM32F407 可对照更佳
参考架构：ARMv7-M（Cortex-M3/M4），穿插 ARMv7-A（Cortex-A 系列）对比

Step 1 — Understanding How the Processor Accesses Memory

The first thing we need to discuss is the processor's memory architecture, specifically how the CPU interacts with memory. This might seem like a basic topic, but it directly dictates many everyday phenomena—such as why code runs faster on some chips than others, or why DMA always requires specific address region configurations.

Von Neumann Architecture — One Bus to Rule Them All

The core characteristic of the Von Neumann architecture is that instructions and data share the same bus and the same memory space. The CPU accesses memory via a single address bus; whether you are fetching code or reading data, it travels along the same path. You can imagine this as a single-lane road—instructions and data line up to pass through one by one, unable to travel side-by-side. The benefit is hardware simplicity—requiring only one bus and one memory, which reduces costs. The core concepts of early 8051 MCUs and most general-purpose computers stem from this.

The downside is also obvious: because instructions and data crowd the same bus, the CPU cannot fetch instructions and read/write data simultaneously. In practice, this means limited performance—you want to execute an addition and write the result back to memory at the same time? Sorry, the bus is busy fetching the next instruction, so you must wait. This is known as the "Von Neumann bottleneck."

Harvard Architecture — Two Buses, Each Doing Its Own Thing

The Harvard architecture takes a different path: instructions and data each have their own bus and memory space. It is essentially turning a single-lane road into a dual-lane road—fetching instructions and reading/writing data can happen simultaneously, theoretically doubling throughput. Most DSP chips and many modern MCUs adopt a pure Harvard architecture or a variant thereof.

However, a pure Harvard architecture isn't a silver bullet. If your program needs self-modifying code (rare in embedded systems), or if you want to use a block of memory as both code and data space, the hardware is less flexible—you would need to design an extra mechanism to allow the two buses to access each other's memory spaces.

Modified Harvard Architecture — The Practical Choice for ARM

In reality, ARM Cortex-M3/M4 processors rarely go to extremes, employing what is known as a Modified Harvard Architecture. You can understand it this way: from a software perspective, the address space is unified (like Von Neumann), but from a hardware perspective, instruction fetching and data access can occur in parallel (like Harvard).

Specifically, Cortex-M3/M4 has three sets of AHB-Lite buses: the I-Code bus is dedicated to fetching instructions from the Code region (0x00000000–0x1FFFFFFF, where Flash is mapped); the D-Code bus handles data access in the Code region (like loading constants from Flash); and the System bus handles access to SRAM and peripheral regions. I-Code and D-Code can operate in parallel, so code in Flash and constant data in Flash can be accessed simultaneously, significantly improving execution efficiency.

If you look at the memory map for the STM32F407, you will find that the 512MB space from address 0x00000000 to 0x1FFFFFFF is marked as the Code region, while 0x20000000 onwards is the SRAM region. ARM officially recommends that D-Code has higher priority than I-Code during bus arbitration—because if data access is blocked, the processor cannot proceed, whereas instruction prefetching can afford to wait a bit.

⚠️ Watch Out Although Cortex-M has multiple buses, they are not truly "completely parallel"—if I-Code and D-Code access Flash simultaneously, they ultimately go through arbitration by the Flash controller. On the STM32F1, Flash is only 16 bits wide and has no cache, so the benefits of bus parallelism are greatly diminished; whereas the STM32F4 has a 128-bit wide Flash interface and an Adaptive Real-Time (ART) Accelerator, making the difference very obvious. Don't forget to check this metric when selecting a chip.

Step 2 — Understanding How ARM Instruction Sets Are Encoded

With the memory architecture cleared up, let's look at ARM's instruction sets. This directly impacts the size and execution efficiency of the code you generate, which is especially critical on resource-constrained MCUs.

ARM Instruction Set (32-bit) — High Expressiveness but Large Volume

ARM's earliest instruction set (A32) uses 32-bit fixed-length encoding, with each instruction occupying 4 bytes. The encoding space is ample, allowing for rich operations—conditional execution, inline barrel shifter shifts, multi-register transfers (LDM/STM), and other advanced features. The benefit of 32-bit instructions is high expressiveness; a single instruction can do a lot, raising the performance ceiling. The cost is also obvious—code volume is large, and on small MCUs with only a few dozen KB of Flash, this overhead cannot be ignored.

Thumb Instruction Set (16-bit) — Small Volume but Limited Functionality

To address code density, ARM introduced the Thumb instruction set (T16) in the ARMv4T architecture, compressing most common instructions into 16-bit encoding. The代价 is the loss of some advanced features—most instructions in Thumb state no longer support conditional execution, and the use of the barrel shifter is restricted. However, the trade-off is a code size reduction of about 30%, which is a lifesaver for applications with tight Flash space.

Thumb-2 — The Default Choice for Cortex-M

Cortex-M3/M4 uses the Thumb-2 instruction set, a hybrid encoding scheme: 16-bit and 32-bit instructions are intermixed. The compiler automatically selects the most appropriate encoding width for each instruction—simple operations use 16 bits, while complex operations (like loading large immediate values, division, etc.) use 32 bits. This way, you get near-complete functionality comparable to the pure ARM instruction set while maintaining code density close to pure Thumb.

One point is particularly worth noting: Cortex-M series processors only support the Thumb instruction set, not the traditional 32-bit ARM instruction set. Therefore, all code you write on Cortex-M, whether compiled from C or hand-written assembly, must be Thumb encoded. The compiler defaults to Thumb mode, so you don't need to worry about it in most cases—but if you are inline assembling or writing startup files, you must remember this, otherwise you will be rewarded with a very beautiful Undefined Instruction exception.

/// @brief 一个简单的 Thumb 函数示例
/// Cortex-M 上所有函数默认使用 Thumb 编码
int add_values(int a, int b)
{
    return a + b;
}

/// @brief 内嵌汇编示例——在 Thumb 模式下读取主栈指针（MSP）
/// 注意：实际项目中推荐用 CMSIS 的 __get_MSP() 宏
uint32_t read_msp(void)
{
    uint32_t msp_value;
    __asm__ volatile("mov %0, sp" : "=r"(msp_value));
    return msp_value;
}

⚠️ Warning If you accidentally remove -mthumb from your linker script or compiler flags (or incorrectly add -marm), linking on Cortex-M will fail outright—because the Cortex-M instruction decoder does not understand 32-bit ARM encoding. If you encounter an Undefined Instruction exception, first check that your compiler flag is set to -mthumb.

Step 3 — Understanding the Processor's "Workbench": Register File

If the instruction set is the processor's "language," then registers are its "workbench." When the CPU performs calculations, data is first moved into registers, operations occur between registers, and finally, the results are written back to memory. Understanding the division of labor among registers is fundamental to understanding how ARM operates.

General-Purpose Registers R0–R15

The ARMv7-M architecture defines sixteen 32-bit general-purpose registers, numbered R0 through R15. Each has a specific role; not all registers can be used freely for any purpose.

R0–R3 are argument and return value registers. According to the AAPCS (ARM Architecture Procedure Call Standard) convention, the first four arguments of a function call are passed through R0–R3, and the return value is also placed in R0 (for 64-bit return values, R0 and R1 are used together). You can think of these as the "express lane" for function calls—if a C function has no more than four arguments, the call process requires no stack access whatsoever, making it very fast. However, if you write a function with five arguments, the fifth one must be pushed to the stack, adding a memory access.

R4–R11 are callee-saved registers. A function may use R4–R11 freely, but it must restore their original values before returning—meaning the caller can safely assume these registers will not be corrupted by the function call. Compilers typically allocate these registers to local variables, especially loop counters and frequently accessed pointers where data lifetimes span across function calls. If you see a bunch of PUSH {R4-R7, LR} instructions at the beginning of a function while debugging, that is the compiler saving the callee-saved registers it intends to use.

R12 (IP) is the Intra-Procedure-Call scratch register. The name is long, but the purpose is simple—the linker uses it as an intermediary when handling long jumps (where the target address exceeds the range of the jump instruction encoding). You will rarely touch this directly when writing C code.

R13 (SP) is the Stack Pointer, pointing to the top of the current stack. ARM has two stack pointers—the Main Stack Pointer (MSP) and the Process Stack Pointer (PSP)—selected via the CONTROL register. Bare-metal applications typically use only the MSP. If an RTOS is running, interrupt handlers use the MSP, while threads use the PSP, achieving isolation between the interrupt stack and thread stacks. This design is ingenious—even if a specific thread overflows its stack, it will not corrupt the stack space used by interrupt handling.

R14 (LR) is the Link Register, which stores the return address for a function call. When the BL (Branch with Link) instruction is executed, the return address is automatically stored in LR. The beauty of this is that for leaf functions (functions that do not call other functions), there is no need to push the return address to the stack; it is already sitting in LR, saving a memory write. However, if your function calls another function, the value in LR will be overwritten, so the compiler pushes LR to the stack at the beginning of the function to save it.

R15 (PC) is the Program Counter, pointing to the instruction currently being executed. Reading the PC on ARM usually yields the current instruction address plus 4 (due to pipeline prefetching), while writing to the PC is effectively performing a jump.

/// @brief 演示 AAPCS 调用约定对寄存器使用的影响
/// 前 4 个参数通过 R0-R3 传递，第 5 个参数需要压栈

int fast_path(int a, int b, int c, int d)
{
    // a -> R0, b -> R1, c -> R2, d -> R3
    // 全部通过寄存器传递，无栈操作
    return a + b + c + d;
}

int slow_path(int a, int b, int c, int d, int e)
{
    // a -> R0, b -> R1, c -> R2, d -> R3
    // e -> 栈传递，多一次内存读操作
    return a + b + c + d + e;
}

Let's use arm-none-eabi-objdump -d to disassemble and examine the differences:

text

; fast_path: 全部在寄存器中完成
fast_path:
    add   r0, r0, r1    ; a + b -> R0
    add   r0, r0, r2    ; + c
    add   r0, r0, r3    ; + d
    bx    lr            ; 返回

; slow_path: 第 5 个参数从栈上读取
slow_path:
    add   r0, r0, r1
    add   r0, r0, r2
    add   r0, r0, r3
    ldr   r3, [sp]      ; 从栈上读第 5 个参数
    add   r0, r0, r3
    bx    lr

We can see that slow_path has one extra ldr instruction—this is the cost of pushing the fifth parameter onto the stack.

⚠️ Warning Don't try to "save parameters" by stuffing a bunch of unrelated variables into a struct and passing a pointer. The struct pointer itself occupies a register slot, and accessing through a pointer adds a layer of dereferencing overhead. A reasonable design rule is: hot path functions should take no more than four basic type parameters (the size of int/float); only consider passing a struct pointer if there are more.

Program Status Registers — The xPSR Trio

ARM processors store status information in the Program Status Register. On Cortex-M, this is split into three sub-registers collectively known as xPSR.

APSR (Application PSR) holds the result flags of arithmetic and logic operations: N (Negative), Z (Zero), C (Carry), V (oVerflow), and Q (Saturation). The first four are the condition code flags we are familiar with; C code like if (a > b) compiles down to checks against these flags.

EPSR (Execution PSR) contains the Thumb state bit (T-bit) and the If-Then flag. The T-bit on Cortex-M is always 1 (because only Thumb mode is supported), so we rarely need to manipulate it manually.

IPSR (Interrupt PSR) holds the exception number of the currently executing exception. IPSR is 0 in Thread mode; if an interrupt is being handled, IPSR contains that interrupt's number. This is particularly useful when debugging HardFaults—reading IPSR lets us confirm which exception context we are currently in.

/// @brief 通过 xPSR 的条件标志理解 C 代码的比较操作
/// 编译器会将条件判断转换为对 N/Z/C/V 标志的检测
int max_value(int a, int b)
{
    // 编译后：CMP R0, R1，然后检测 APSR 的标志位
    if (a > b) {
        return a;  // GT 条件：Z=0 且 N=C
    }
    return b;
}

Step 4 – Understanding Processor "Modes"

ARM processors operate in different "modes," each with distinct privilege levels and accessible resources. This section serves as the foundation for understanding the security model and exception handling.

The Cortex-M Simplified Model: Thread and Handler

Cortex-M significantly simplifies the traditional ARM seven processor modes down to just two: Thread mode (for executing normal application code) and Handler mode (for executing interrupt service routines and exception handling code). Each mode is further divided into privileged and unprivileged levels.

After power-on reset, the processor defaults to Thread mode + privileged level. If we do not explicitly drop privileges (by writing to the CONTROL register), the entire program runs in a privileged state—this is common in bare-metal development, but it implies our code can "legally" do anything, including writing to the wrong registers and causing peripheral malfunctions. In scenarios running an RTOS, the RTOS typically drops permissions to the unprivileged level when creating user threads. This way, even if a thread goes astray, it cannot directly manipulate critical hardware registers.

Handler mode is always privileged—interrupt handling code requires full hardware access, which is a hard requirement. When an exception or interrupt occurs, the processor automatically switches from Thread to Handler mode, and switches back automatically when handling is complete.

⚠️ Warning If we accidentally drop to the unprivileged level in Thread mode, we cannot elevate privileges back within that mode—only Handler mode, triggered by exceptions/interrupts, can manipulate the CONTROL register to raise privileges. Therefore, if we intend to use unprivileged mode, we must trigger a system call via the SVC (Supervisor Call) instruction to perform privileged operations, rather than manipulating hardware registers directly in unprivileged mode.

Step 5 – Tracing the Interrupt Handling Flow via the Vector Table

Now that we have the basics of processor modes and registers, let's connect the dots and see exactly what the ARM processor does when an exception or interrupt occurs.

Exceptions Are Not Just Interrupts

In ARM terminology, an "Exception" is a broader concept than an "Interrupt." Interrupts are just one type of exception. Others include: Reset, NMI (Non-Maskable Interrupt), HardFault, Memory Management Fault, Bus Fault, Usage Fault, SVCall, PendSV, and SysTick. They share the same handling mechanism but differ in priority.

The Vector Table – The "Phonebook" of Exception Handling

When an exception occurs, the processor needs to know where the corresponding handler function is located. ARM's solution is the Vector Table—an array of function pointers stored in memory, where each exception type corresponds to an entry.

On Cortex-M, the vector table defaults to starting at address 0x00000000 (this can be relocated via the VTOR register). The first entry is not a function pointer, but the value of the initial Stack Pointer (MSP)—this is a clever design where the processor automatically loads this value into the SP (Stack Pointer) upon reset, requiring no extra initialization code. Starting from the second entry, the Reset Handler, NMI Handler, HardFault Handler, and others are stored in sequence.

/// @brief Cortex-M 向量表结构示意
typedef void (*ExceptionHandler)(void);

/// @brief 向量表布局（简化版，实际还包括更多 Fault 向量）
typedef struct {
    uint32_t         kInitialStackPointer;  // 初始 MSP 值
    ExceptionHandler reset_handler;         // 复位
    ExceptionHandler nmi_handler;           // 不可屏蔽中断
    ExceptionHandler hardfault_handler;     // 硬件错误
    ExceptionHandler memmanage_handler;     // 内存管理错误
    ExceptionHandler busfault_handler;      // 总线错误
    ExceptionHandler usagefault_handler;    // 用法错误
    // ... 省略若干保留项 ...
    ExceptionHandler svcall_handler;        // 系统服务调用
    ExceptionHandler pendsv_handler;        // 可挂起的系统调用
    ExceptionHandler systick_handler;       // 系统滴答定时器
    // 外部中断向量从此开始 ...
} VectorTable;

Exception Stacking—The "Context" Automatically Saved by the Processor

When an exception occurs, the Cortex-M processor automatically saves the values of eight registers on the current stack: R0, R1, R2, R3, R12, LR (Return Address), PC (Program Counter), and xPSR (Program Status Register). This operation is called "stacking" and is completed automatically by the hardware; we do not need to write any code to manually save the context. When the exception handler finishes and executes the return instruction, the processor automatically restores these eight registers from the stack ("unstacking").

This design means that our interrupt service routine is essentially a standard C function. We do not need special modifiers like __irq (a practice from the ARM7TDMI era), and the compiler does not need to generate special prologue or epilogue code. Compared to the ARM7TDMI days, where we had to write assembly code to save and restore registers ourselves, the Cortex-M approach is incredibly clean.

However, there is a pitfall to watch out for: if the stack space is insufficient (for example, if the stack allocated for a specific interrupt is too small), the stacking operation will trigger another exception—and handling this new exception also requires stacking. The result is a chain reaction of stack overflows that eventually triggers a HardFault. Therefore, reasonable stack size planning is crucial in Cortex-M development. It is generally recommended to reserve at least 512 bytes for the main stack, and if running an RTOS, each thread stack also needs at least 256 bytes.

Interrupt Priority—Who Goes First

The ARM Cortex-M supports configurable interrupt priorities. Each interrupt source has a priority register where a smaller numerical value indicates a higher priority. The Cortex-M3 supports up to 256 priority levels (8-bit width), but in actual implementations, most chips only use the upper 4 bits. This means the number of priority levels actually available to us might only be 16 (this is the case for STM32F1/F4).

Priority grouping splits the 8-bit priority register into two parts: the high bits are the "preemption priority," and the low bits are the "subpriority." An interrupt with a higher preemption priority can interrupt a lower priority interrupt that is currently being handled (nested interrupts), while the subpriority only determines which interrupt is processed first when they have the same preemption priority. CMSIS provides NVIC_SetPriorityGrouping() and NVIC_SetPriority() to configure these settings. If we are just getting started, using the default grouping of 4 bits for preemption priority and 0 bits for subpriority is sufficient; we can worry about fine-tuning later when necessary.

Step Six—Connecting This Knowledge to Writing C Code

At this point, we have reviewed the core concepts of the ARM architecture. We might ask: "I write C/C++ code, not assembly, so how does this knowledge manifest in actual programming?" Let's outline a few direct connections.

Calling Conventions and Function Design

As mentioned earlier, AAPCS specifies that the first four arguments are passed through R0-R3. The direct impact of this on C function design is: if we can control the function signature, we should try to keep the number of parameters to four or fewer and avoid passing large structures. A common practice is to streamline the parameters of frequently called "hot path" functions to four or fewer, giving the compiler maximum room for optimization.

`volatile` and Register Access

In embedded programming, the volatile keyword is almost everywhere—every pointer mapping to a hardware register must be marked volatile. The reason is that compiler optimizations assume memory values will not "change on their own," but hardware register values can be modified by external events (DMA transfer completion, peripheral status changes) at any time. volatile tells the compiler, "Every time, actually read from this address and do not cache the value."

/// @brief 典型的寄存器映射访问模式
/// volatile 保证每次访问都真正读写硬件
#define GPIOA_ODR_ADDRESS ((volatile uint32_t*)0x40020014U)

void set_gpio_pin(int pin)
{
    // 没有 volatile，编译器可能认为连续写同一个地址是冗余操作并优化掉
    *GPIOA_ODR_ADDRESS |= (1U << pin);
}

Stack Usage and Memory Layout Awareness

With the ARM stacking mechanism and dual-stack design understood, we now have a solid basis for planning memory usage. In bare-metal applications, we must ensure the linker script allocates sufficient space for the stack. In RTOS applications, we need to allocate a reasonable stack size for each thread. A rule of thumb is to start with 256 bytes for simple threads without floating-point operations, and 512 to 1024 bytes for threads involving floating-point math or deep function call chains. If the Cortex-M4 FPU is enabled, exception stacking will additionally save 16 floating-point registers (S0-S15) plus the FPSCR—this extra 68-byte overhead cannot be ignored.

C++ Connections

If you are coming from the C++ section of this tutorial, the connection between these low-level details and C++ is much more significant than you might imagine. ARM hardware characteristics directly influence many C++ design decisions.

Cache-Friendly Design and Data Locality

ARM processors (especially the Cortex-A series) feature multi-level caches. Understanding the size (typically 32 or 64 bytes) and behavior of cache lines directly impacts C++ data structure design. Compacting frequently accessed fields at the beginning of a structure on the hot path, while placing cold data at the end, or using alignas to control alignment, can significantly improve performance. We only need to establish this awareness during the C tutorial phase; the C++ chapters will explore this in depth later on.

cpp

// 不太友好的布局：热数据和冷数据交替排列
struct BadSensorData {
    uint32_t timestamp;   // 热
    char name[32];        // 冷——挤占了缓存行
    float value;          // 热
    int calibration_id;   // 冷
    float raw_value;      // 热
};

// 友好的布局：热数据集中在前 16 字节，一个缓存行搞定
struct GoodSensorData {
    uint32_t timestamp;   // 热
    float value;          // 热
    float raw_value;      // 热
    // --- 缓存行边界大概在这里 ---
    char name[32];        // 冷
    int calibration_id;   // 冷
};

Memory Layout and ABI of C++ Objects

The memory layout of C++ objects on ARM platforms follows the AAPCS ABI specification: ordinary member variables are arranged in declaration order, the virtual function table pointer (vptr) is placed at the beginning of the object, and multiple inheritance may introduce multiple vptrs. These layout details are critical for serialization, network transmission, and interaction with C code. If we write an object-oriented driver framework in C++ on Cortex-M, understanding the location and size of the vptr helps us accurately calculate the exact byte size of a driver object.

Overhead of Exception Handling

On embedded ARM platforms, the runtime overhead of the C++ exception handling mechanism (try/catch/throw) requires serious consideration. Exception handling tables and unwinding information significantly increase binary size, and the stack unwinding process during exception throwing involves extensive memory operations. On Cortex-M devices where both Flash and RAM are constrained, many teams choose to add -fno-exceptions at compile time to completely disable C++ exceptions, using error codes instead to handle errors. This isn't "not C++ enough," but rather a reasonable trade-off regarding resources.

`constexpr` and Compile-Time Calculation

Many operations that require table lookups at runtime (CRC calculations, bit manipulation mask generation) can be completed via constexpr functions at compile time, saving both Flash and execution time. On low-end chips like Cortex-M0/M0+ that lack even a hardware divider, the value of compile-time calculation is particularly prominent.

Exercises

We leave the following exercises for you to tinker with—hands-on research, coding, and hardware verification are the true path to learning.

/// @brief 练习 1：读取 IPSR 寄存器
/// 使用 GCC 内嵌汇编读取 Cortex-M 的 IPSR 寄存器值
/// 解释在正常运行和进入中断服务函数时读到的值有什么不同
/// 提示：IPSR 是 xPSR 的一部分，可以用 MRS 指令读取
uint32_t exercise_read_ipsr(void)
{
    // 练习： 用内嵌汇编读取 IPSR
    return 0;
}

/// @brief 练习 2：触发并调试 HardFault
/// 对一个无效地址执行写操作，故意触发 HardFault
/// 然后在 HardFault Handler 中读取入栈的寄存器值
/// 定位导致异常的指令地址
/// 提示：HardFault Handler 的参数可以拿到栈帧指针
void exercise_trigger_hardfault(void)
{
    // 练习： 写一个无效地址来触发 HardFault
}

/// @brief 练习 3：分析 AAPCS 的参数传递
/// 写两个函数：一个接受 4 个 int 参数，另一个接受 6 个
/// 用 arm-none-eabi-objdump -d 反汇编对比调用序列
/// 找出编译器如何分配 R4-R11 给局部变量
int exercise_aapcs_4(int a, int b, int c, int d)
{
    // 练习： 添加局部变量和函数调用，使反汇编更有看头
    return 0;
}

int exercise_aapcs_6(int a, int b, int c, int d, int e, int f)
{
    // 练习： 同上，对比反汇编结果
    return 0;
}

/// @brief 练习 4（进阶）：向量表重定位
/// 阅读一个 Cortex-M 启动文件（如 startup_stm32f407xx.s）
/// 画出完整的向量表布局
/// 然后修改链接脚本把向量表重定位到 RAM 中
/// 实现运行时动态修改中断向量（Bootloader 开发的基础技能）

进阶专题

ARM Architecture and Fundamentals

Environment

Step 1 — Understanding How the Processor Accesses Memory

Von Neumann Architecture — One Bus to Rule Them All

Harvard Architecture — Two Buses, Each Doing Its Own Thing

Modified Harvard Architecture — The Practical Choice for ARM

Step 2 — Understanding How ARM Instruction Sets Are Encoded

ARM Instruction Set (32-bit) — High Expressiveness but Large Volume

Thumb Instruction Set (16-bit) — Small Volume but Limited Functionality

Thumb-2 — The Default Choice for Cortex-M

Step 3 — Understanding the Processor's "Workbench": Register File

General-Purpose Registers R0–R15

Program Status Registers — The xPSR Trio

Step 4 – Understanding Processor "Modes"

The Cortex-M Simplified Model: Thread and Handler

Step 5 – Tracing the Interrupt Handling Flow via the Vector Table

Exceptions Are Not Just Interrupts

The Vector Table – The "Phonebook" of Exception Handling

Exception Stacking—The "Context" Automatically Saved by the Processor

Interrupt Priority—Who Goes First

Step Six—Connecting This Knowledge to Writing C Code

Calling Conventions and Function Design

`volatile` and Register Access

Stack Usage and Memory Layout Awareness

C++ Connections

Cache-Friendly Design and Data Locality

Memory Layout and ABI of C++ Objects

Overhead of Exception Handling

`constexpr` and Compile-Time Calculation

Exercises

Reference Resources

ARM Architecture and Fundamentals ​

Environment ​

Step 1 — Understanding How the Processor Accesses Memory ​

Von Neumann Architecture — One Bus to Rule Them All ​

Harvard Architecture — Two Buses, Each Doing Its Own Thing ​

Modified Harvard Architecture — The Practical Choice for ARM ​

Step 2 — Understanding How ARM Instruction Sets Are Encoded ​

ARM Instruction Set (32-bit) — High Expressiveness but Large Volume ​

Thumb Instruction Set (16-bit) — Small Volume but Limited Functionality ​

Thumb-2 — The Default Choice for Cortex-M ​

Step 3 — Understanding the Processor's "Workbench": Register File ​

General-Purpose Registers R0–R15 ​

Program Status Registers — The xPSR Trio ​

Step 4 – Understanding Processor "Modes" ​

The Cortex-M Simplified Model: Thread and Handler ​

Step 5 – Tracing the Interrupt Handling Flow via the Vector Table ​

Exceptions Are Not Just Interrupts ​

The Vector Table – The "Phonebook" of Exception Handling ​

Exception Stacking—The "Context" Automatically Saved by the Processor ​

Interrupt Priority—Who Goes First ​

Step Six—Connecting This Knowledge to Writing C Code ​

Calling Conventions and Function Design ​

volatile and Register Access ​

Stack Usage and Memory Layout Awareness ​

C++ Connections ​

Cache-Friendly Design and Data Locality ​

Memory Layout and ABI of C++ Objects ​

Overhead of Exception Handling ​

constexpr and Compile-Time Calculation ​

Exercises ​

Reference Resources ​

ARM Architecture and Fundamentals

Environment

Step 1 — Understanding How the Processor Accesses Memory

Von Neumann Architecture — One Bus to Rule Them All

Harvard Architecture — Two Buses, Each Doing Its Own Thing

Modified Harvard Architecture — The Practical Choice for ARM

Step 2 — Understanding How ARM Instruction Sets Are Encoded

ARM Instruction Set (32-bit) — High Expressiveness but Large Volume

Thumb Instruction Set (16-bit) — Small Volume but Limited Functionality

Thumb-2 — The Default Choice for Cortex-M

Step 3 — Understanding the Processor's "Workbench": Register File

General-Purpose Registers R0–R15

Program Status Registers — The xPSR Trio

Step 4 – Understanding Processor "Modes"

The Cortex-M Simplified Model: Thread and Handler

Step 5 – Tracing the Interrupt Handling Flow via the Vector Table

Exceptions Are Not Just Interrupts

The Vector Table – The "Phonebook" of Exception Handling

Exception Stacking—The "Context" Automatically Saved by the Processor

Interrupt Priority—Who Goes First

Step Six—Connecting This Knowledge to Writing C Code

Calling Conventions and Function Design

`volatile` and Register Access

Stack Usage and Memory Layout Awareness

C++ Connections

Cache-Friendly Design and Data Locality

Memory Layout and ABI of C++ Objects

Overhead of Exception Handling

`constexpr` and Compile-Time Calculation

Exercises

Reference Resources