Bare-Metal STM32 Programming on ARM Cortex-M: Hands-On Firmware Development and Debugging Tutorial

Bare-metal programming on STM32 microcontrollers gives you direct control over ARM Cortex-M hardware without relying on abstraction layers. This approach builds deep understanding of embedded systems fundamentals and prepares you for scenarios where performance, determinism, or code size constraints matter.

Most developers start with STM32 using the HAL or LL libraries provided by STMicroelectronics. These layers simplify development but add overhead and hide what is actually happening at the hardware level. Bare-metal programming removes these abstractions, forcing you to understand registers, memory layout, and peripheral configuration directly. This knowledge transfers across all ARM Cortex-M devices and makes you a more capable embedded systems engineer.

ARM Cortex-M Architecture Fundamentals

ARM Cortex-M processors are designed specifically for embedded applications. They combine low power consumption with real-time determinism, making them ideal for microcontroller use cases. The Cortex-M family includes M0, M0+, M3, M4, and M7 variants, each offering different performance and feature sets. Understanding the architecture fundamentals helps you write efficient bare-metal code.

The Cortex-M uses the Thumb instruction set, which provides dense 16-bit and 32-bit instructions. This reduces code size compared to traditional ARM instruction sets. Memory protection units are available on some variants, providing hardware-enforced isolation between privileged and unprivileged code. The Nested Vectored Interrupt Controller, built into all Cortex-M cores, handles interrupt priority and latency deterministically.

Processor modes in Cortex-M include Thread mode and Handler mode. Thread mode runs normal application code, while Handler mode executes exception and interrupt service routines. The stack pointer has two banked registers: MSP for privileged code and PSP for unprivileged threads. Understanding these modes matters for writing robust bare-metal firmware.

The System Control Block contains configuration and status registers. It controls features like sleep modes, fault handling, and system priority grouping. The SysTick timer, integrated into the core, provides a consistent time base across all Cortex-M implementations. This is useful for creating simple real-time scheduling without relying on peripheral-specific timers.

Toolchain Setup for Bare-Metal Development

Building bare-metal firmware requires a cross-compilation toolchain targeting ARM Cortex-M. The GNU Arm Embedded Toolchain provides GCC-based compilers specifically configured for ARM architectures. This toolchain includes the C and C++ compilers, assembler, linker, and debugger needed for embedded development.

You need several components for a complete development environment. The compiler generates ARM Thumb instructions from your C or C++ source code. The linker arranges code and data into the correct memory regions defined by your microcontroller. The debugger connects to your hardware probe and allows stepping through code while inspecting registers and memory.

Build systems like Make or CMake automate the compilation process. For bare-metal projects, you typically write a custom linker script that specifies memory regions and section placement. The script defines the starting address of flash memory, RAM boundaries, and where stack placement occurs. Understanding linker scripts is essential because the default layouts assume a hosted environment with an operating system.

In an industrial motor control application using an STM32F4 series microcontroller, the linker script must properly allocate flash regions for bootloader, main firmware, and configuration data while ensuring stack placement avoids collision with heap usage. The memory map typically defines flash starting at 0x08000000 with bootloader at the base, application code following, and SRAM starting at 0x20000000 with stack growing downward from the top of RAM.

ENTRY(Reset_Handler)

MEMORY
{
  BOOTLOADER (rx)  : ORIGIN = 0x08000000, LENGTH = 16K
  APPLICATION (rx) : ORIGIN = 0x08004000, LENGTH = 112K
  RAM (rwx)        : ORIGIN = 0x20000000, LENGTH = 20K
}

_estack = ORIGIN(RAM) + LENGTH(RAM);

SECTIONS
{
  .isr_vector :
  {
    KEEP(*(.isr_vector))
  } > BOOTLOADER

  .text :
  {
    *(.text*)
    . = ALIGN(4);
  } > APPLICATION

  .rodata :
  {
    *(.rodata*)
    . = ALIGN(4);
  } > APPLICATION

  .data :
  {
    _sdata = .;
    *(.data*)
    . = ALIGN(4);
    _edata = .;
  } > RAM AT > APPLICATION

  .bss :
  {
    _sbss = .;
    *(.bss*)
    *(COMMON)
    . = ALIGN(4);
    _ebss = .;
  } > RAM

  ._stack :
  {
    . = ALIGN(8);
    . = . + 0x400; /* Stack size 1KB */
    . = ALIGN(8);
  } > RAM
}

Execute the code with caution.

Flash programming tools transfer your compiled binary to the microcontroller. ST-Link is the debug and flash tool commonly used with STM32 development boards. Open-source alternatives like OpenOCD work with various debug probes including ST-Link and J-Link. These tools communicate with the debug port on your microcontroller to program flash memory and control execution.

Memory Map and Register Access

ARM Cortex-M microcontrollers use memory-mapped I/O for peripheral access. This means peripheral registers appear as addresses in the memory space. Writing to specific memory locations configures peripherals, reading from them retrieves status or data. This approach eliminates special I/O instructions and makes peripheral access consistent with normal memory operations.

The STM32 memory map divides the address space into regions. Code resides in flash memory starting at a base address. SRAM provides volatile storage for variables and stack space. Peripheral registers occupy specific regions, with each peripheral having a base address from which its registers are accessed. Understanding the memory map helps you locate registers and understand how different components interact.

Direct register access in C uses pointer dereferencing. You define a pointer to the register address, then read or write through that pointer. For clarity and type safety, many developers define structures that mirror the register layout of each peripheral. This allows accessing fields by name rather than raw address offsets. Compiler attributes ensure proper volatile handling, preventing optimization that would break hardware interaction.

Consider a medical device that reads sensor data through an ADC peripheral at address 0x40012000, where reading the data register retrieves the most recent conversion result while writing to control register 0x40012004 configures sampling parameters. Direct pointer access to these registers eliminates function call overhead and ensures precise timing required for synchronizing with external medical instrumentation.

#include <stdint.h>

// Hypothetical Memory Addresses for ADC
#define ADC_DATA_REGISTER  (0x40000000)
#define ADC_CTRL_REGISTER  (0x40000004)

// Define volatile pointers for direct register access
// volatile ensures the compiler does not optimize away reads/writes
volatile uint32_t * const adc_data_reg = (volatile uint32_t *) ADC_DATA_REGISTER;
volatile uint32_t * const adc_ctrl_reg = (volatile uint32_t *) ADC_CTRL_REGISTER;

// Function to write control configuration
void adc_configure(uint32_t config_value) {
    // Writing to the control register via volatile pointer
    *adc_ctrl_reg = config_value;
}

// Function to read ADC data
uint32_t adc_read(void) {
    // Reading from the data register via volatile pointer
    return *adc_data_reg;
}

// Example usage
int main(void) {
    // Write configuration (e.g., start conversion)
    uint32_t config = 0x01; // Example: Enable bit
    adc_configure(config);

    // Read data
    uint32_t sensor_value = adc_read();

    return 0;
}

Execute the code with caution.

Bit manipulation is a constant activity in bare-metal programming. Setting, clearing, and testing bits requires understanding bitwise operations in C. Common patterns include setting a bit using OR, clearing a bit using AND with complement, and testing bits using AND and comparison. Some Cortex-M cores include bit-banding regions that allow atomic bit manipulation, though not all STM32 devices expose this feature to developers.

Industrial automation controllers frequently need to manipulate individual bits in status registers to acknowledge alarms, clear error flags, or set specific control signals without affecting other bits in the same register. For example, clearing an overcurrent flag in a motor driver status register while preserving the temperature warning and running status requires a masked write operation.

#include <stdio.h>

/**
 * Bit manipulation patterns for status registers.
 * Reference: Standard bitwise operations.
 */

// Pattern: Setting a bit (Use OR)
// Sets the bit at position 'pos' to 1
void set_bit(unsigned int *reg, unsigned int pos) {
    *reg |= (1U << pos);
}

// Pattern: Clearing a bit (Use AND with NOT)
// Sets the bit at position 'pos' to 0
void clear_bit(unsigned int *reg, unsigned int pos) {
    *reg &= ~(1U << pos);
}

// Pattern: Toggling a bit (Use XOR)
// Flips the bit at position 'pos'
void toggle_bit(unsigned int *reg, unsigned int pos) {
    *reg ^= (1U << pos);
}

// Pattern: Testing a bit (Use AND and Shift)
// Returns 1 if bit is set, 0 if clear
int test_bit(unsigned int reg, unsigned int pos) {
    return (reg >> pos) & 1U;
}

int main() {
    unsigned int status_register = 0x00; // Initialize register

    // Set bit 2 (0b0000)
    set_bit(&status_register, 2);
    printf("Set bit 2: 0x%X\n", status_register);

    // Set bit 5
    set_bit(&status_register, 5);
    printf("Set bit 5: 0x%X\n", status_register);

    // Test bit 2
    if (test_bit(status_register, 2)) {
        printf("Bit 2 is set.\n");
    }

    // Clear bit 2
    clear_bit(&status_register, 2);
    printf("Cleared bit 2: 0x%X\n", status_register);

    // Toggle bit 5
    toggle_bit(&status_register, 5);
    printf("Toggled bit 5: 0x%X\n", status_register);

    return 0;
}

Execute the code with caution.

GPIO Configuration and Control

General Purpose Input/Output pins are the most basic peripheral but fundamental to most applications. Each GPIO port has multiple pins, and each pin can be configured as input, output, alternate function, or analog mode. Configuration involves setting mode bits, output type, speed, and pull-up or pull-down resistors.

As output, GPIO pins can drive LEDs, relays, or other digital inputs. The output data register holds the logical state of each pin. Some STM32 devices include output data registers that allow atomic bit set and clear operations without read-modify-write cycles. This is important when multiple pins share the same port and you want to change one pin without affecting others.

In a commercial LED lighting system, the firmware must configure multiple GPIO pins on port A as push-pull outputs driving PWM signals for RGB LED channels while ensuring that unrelated pins on the same port remain in their configured input states for sensor monitoring. The configuration requires setting the mode register, output type, and speed for each pin individually.

#include <stdint.h>

typedef struct {
    volatile uint32_t MODER;
    volatile uint32_t OTYPER;
    volatile uint32_t OSPEEDR;
    volatile uint32_t PUPDR;
    volatile uint32_t IDR;
    volatile uint32_t ODR;
} GPIO_TypeDef;

/**
 * @brief Configures multiple GPIO pins as push-pull outputs.
 * @param port Pointer to the GPIO port structure.
 * @param pins Bitmask of pins to configure (e.g., 0x0001 for Pin 0).
 * @param speed Output speed: 0=Low, 1=Medium, 2=High, 3=Very High.
 */
void GPIO_ConfigurePushPullOutput(GPIO_TypeDef* port, uint16_t pins, uint8_t speed) {
    uint32_t tmp;
    uint8_t i;

    // Set Mode to General Purpose Output (01)
    tmp = port->MODER;
    for (i = 0; i < 16; i++) {
        if ((pins & (1 << i)) != 0) {
            tmp &= ~(3UL << (i * 2));
            tmp |= (1UL << (i * 2));
        }
    }
    port->MODER = tmp;

    // Set Output Type to Push-Pull (0)
    port->OTYPER &= ~pins;

    // Set Output Speed
    tmp = port->OSPEEDR;
    for (i = 0; i < 16; i++) {
        if ((pins & (1 << i)) != 0) {
            tmp &= ~(3UL << (i * 2));
            tmp |= ((speed & 0x3) << (i * 2));
        }
    }
    port->OSPEEDR = tmp;
}

Execute the code with caution.

Input mode reads external signals such as buttons or sensors. The input data register reflects the current state of each pin. Input pins can be configured with pull-up or pull-down resistors to provide a defined state when no external signal is present. Reading the state of a button requires handling mechanical bounce, either through hardware debouncing or software techniques like reading multiple times with delays or using timer-based approaches.

An industrial control panel uses emergency stop buttons that must be read reliably despite contact bounce and electrical noise from nearby high-power switching equipment. The button input is configured with a pull-up resistor and the firmware implements debouncing by reading the pin multiple times over a 20-millisecond window to confirm the stable state before processing the stop command.

#include <stdint.h>
#include <stdbool.h>

// External hardware abstraction functions (to be implemented per platform)
// delay_ms: busy-wait or timer-based delay in milliseconds
extern void delay_ms(uint32_t ms);
// read_pin: reads the logic level of a specific pin (true = HIGH, false = LOW)
extern bool read_pin(uint8_t pin);

/**
 * Reads a button state with software debouncing.
 * 
 * This function implements software debouncing by sampling the button state,
 * waiting for a specific delay period (allowing mechanical bounce to settle),
 * and sampling the state a second time to confirm stability.
 * 
 * @param pin The input pin number where the button is connected.
 * @return bool The debounced state of the button.
 */
bool read_button_debounced(uint8_t pin) {
    // First sample
    bool state_1 = read_pin(pin);

    // Timer-based delay sampling to filter mechanical contact bounce
    // Typical debounce duration for mechanical switches is 10ms to 50ms
    delay_ms(20);

    // Second sample
    bool state_2 = read_pin(pin);

    // If both samples match, the state is considered stable (debounced)
    if (state_1 == state_2) {
        return state_1;
    }

    // If states differ, bounce is still occurring; return the state indicating no change or error
    // depending on application requirements. Here we return the initial state or false.
    return state_1; 
}

Execute the code with caution.

Alternate function mode connects pins to peripherals such as UART, SPI, I2C, or timers. Each pin supports multiple alternate functions, and you must select the appropriate one based on your configuration. This mode requires understanding both the GPIO multiplexing and the peripheral configuration itself. Complex routing decisions may be necessary when multiple peripherals require the same pins.

Interrupt Handling and NVIC Configuration

Interrupts allow the processor to respond to external events or peripheral status changes without constant polling. The Nested Vectored Interrupt Controller manages interrupt sources, priorities, and preemption on Cortex-M devices. Proper interrupt configuration is essential for real-time responsiveness and system reliability.

Each interrupt source has an associated interrupt service routine, a function that executes when the interrupt occurs. The vector table contains addresses for all exception and interrupt handlers. In bare-metal code, you typically define the vector table in assembly or using compiler-specific attributes, ensuring it is placed at the correct address in flash memory. The initial stack pointer and reset handler address are the first entries in this table.

Automotive safety systems require a properly constructed vector table placed at the flash memory start address to ensure the processor initializes correctly on power-up and can immediately respond to critical events such as airbag deployment or brake system faults. The table must contain the initial stack pointer value at offset 0x00, the reset handler address at 0x04, and addresses for all enabled exception and interrupt handlers.

/* Vector Table Definition for Cortex-M */
/* Reference: ARMv7-M Architecture Reference Manual */

/* External symbol for the top of the stack, defined in the linker script */
extern unsigned int __Stack_Top;

/* Forward declarations of handlers */
void Reset_Handler(void);
void NMI_Handler(void);
void HardFault_Handler(void);
void MemManage_Handler(void);
void BusFault_Handler(void);
void UsageFault_Handler(void);

/* Default weak handler implementation */
void Default_Handler(void) {
    while(1) {}
}

/* Assign weak aliases to default handler if not implemented */
void NMI_Handler(void) __attribute__((weak, alias("Default_Handler")));
void HardFault_Handler(void) __attribute__((weak, alias("Default_Handler")));
void MemManage_Handler(void) __attribute__((weak, alias("Default_Handler")));
void BusFault_Handler(void) __attribute__((weak, alias("Default_Handler")));
void UsageFault_Handler(void) __attribute__((weak, alias("Default_Handler")));

/* Vector table placed at start of flash (.isr_vector section) */
__attribute__((used, section(".isr_vector")))
void (* const g_pfnVectors[])(void) = {
    (void (*)(__attribute__((unused)) void))(__Stack_Top), /* Initial Stack Pointer */
    Reset_Handler,                                         /* Reset Handler */
    NMI_Handler,                                           /* NMI Handler */
    HardFault_Handler,                                     /* Hard Fault Handler */
    MemManage_Handler,                                     /* MPU Fault Handler */
    BusFault_Handler,                                      /* Bus Fault Handler */
    UsageFault_Handler                                     /* Usage Fault Handler */
};

Execute the code with caution.

Interrupt priority determines which interrupts can preempt others. The NVIC supports multiple priority levels, allowing you to specify which events are most time-critical. Lower numerical values indicate higher priority. Priority grouping controls how preempt priority and subpriority are distributed among the available bits. Understanding priority grouping helps prevent priority inversion and ensures deterministic response times.

A real-time data acquisition system must prioritize high-speed ADC conversion interrupts above communication UART interrupts to prevent sample loss during data streaming, while still allowing USB enumeration to complete during initialization. This requires configuring NVIC with proper priority grouping to allocate separate preempt and subpriority bits, then setting ADC interrupts at priority 0 and UART interrupts at priority 5.

#include "stm32f4xx.h"  // Adjust based on your specific MCU

/**
 * NVIC Priority Configuration
 * 
 * ARM Cortex-M NVIC allows configuring priority grouping to split
 * the priority register into pre-emption priority and subpriority.
 * 
 * Pre-emption priority: Determines if one interrupt can preempt another
 * Subpriority: Determines order when multiple interrupts have same pre-emption priority
 * 
 * Reference: ARM Cortex-M Technical Reference Manual
 */

void NVIC_Config_PriorityGroup(void)
{
    // Configure Priority Grouping
    // PRIGROUP[2:0] bits in SCB->AIRCR register:
    // 0b000 (NVIC_PriorityGroup_0): 0 bits pre-emption, 4 bits subpriority
    // 0b001 (NVIC_PriorityGroup_1): 1 bits pre-emption, 3 bits subpriority
    // 0b010 (NVIC_PriorityGroup_2): 2 bits pre-emption, 2 bits subpriority
    // 0b011 (NVIC_PriorityGroup_3): 3 bits pre-emption, 1 bits subpriority
    // 0b100 (NVIC_PriorityGroup_4): 4 bits pre-emption, 0 bits subpriority
    
    // Set to 4 bits pre-emption priority, 0 bits subpriority
    SCB->AIRCR = (0x5FAUL << SCB_AIRCR_VECTKEY_Pos) |
                 (NVIC_PriorityGroup_4 << SCB_AIRCR_PRIGROUP_Pos);
}

void NVIC_Assign_Priorities(void)
{
    // Assign priority levels to different interrupt sources
    // With PriorityGroup_4: value is entirely pre-emption priority (0-15)
    // Lower value = Higher priority
    
    // Critical timing interrupt - Highest priority
    NVIC_SetPriority(TIM2_IRQn, 0);
    
    // High priority external interrupt
    NVIC_SetPriority(EXTI0_IRQn, 1);
    
    // Medium priority communication interrupt
    NVIC_SetPriority(USART1_IRQn, 2);
    
    // Low priority external interrupt
    NVIC_SetPriority(EXTI1_IRQn, 5);
    
    // Background task interrupt - Lowest priority
    NVIC_SetPriority(TIM3_IRQn, 15);
}

void NVIC_Enable_Interrupts(void)
{
    // Enable the configured interrupts
    NVIC_EnableIRQ(TIM2_IRQn);
    NVIC_EnableIRQ(EXTI0_IRQn);
    NVIC_EnableIRQ(USART1_IRQn);
    NVIC_EnableIRQ(EXTI1_IRQn);
    NVIC_EnableIRQ(TIM3_IRQn);
}

void NVIC_Complete_Configuration(void)
{
    // Step 1: Set priority grouping
    NVIC_Config_PriorityGroup();
    
    // Step 2: Assign priorities to interrupt sources
    NVIC_Assign_Priorities();
    
    // Step 3: Enable interrupts
    NVIC_Enable_Interrupts();
}

/**
 * Example with subpriority usage
 */
void NVIC_Config_With_Subpriority(void)
{
    // Set Priority Grouping: 2 bits pre-emption, 2 bits subpriority
    SCB->AIRCR = (0x5FAUL << SCB_AIRCR_VECTKEY_Pos) |
                 (3 << SCB_AIRCR_PRIGROUP_Pos);
    
    // Priority format: [Preemption(2bits):Subpriority(2bits)]
    // Construct priority: (preemption << 4) | subpriority
    
    // TIM2 - Highest pre-emption priority (0), highest subpriority (0)
    NVIC_SetPriority(TIM2_IRQn, (0 << 4) | 0);
    
    // TIM3 - Same pre-emption as TIM2, lower subpriority
    NVIC_SetPriority(TIM3_IRQn, (0 << 4) | 1);
    
    // EXTI0 - Medium pre-emption priority (1), highest subpriority (0)
    NVIC_SetPriority(EXTI0_IRQn, (1 << 4) | 0);
    
    // USART1 - Medium pre-emption priority (1), lower subpriority (1)
    NVIC_SetPriority(USART1_IRQn, (1 << 4) | 1);
    
    // EXTI1 - Lowest pre-emption priority (3)
    NVIC_SetPriority(EXTI1_IRQn, (3 << 4) | 0);
    
    // Enable all interrupts
    NVIC_EnableIRQ(TIM2_IRQn);
    NVIC_EnableIRQ(TIM3_IRQn);
    NVIC_EnableIRQ(EXTI0_IRQn);
    NVIC_EnableIRQ(USART1_IRQn);
    NVIC_EnableIRQ(EXTI1_IRQn);
}

Execute the code with caution.

Enabling and disabling interrupts at the peripheral level and NVIC level provides fine-grained control. Some peripherals have multiple interrupt sources that can be individually enabled or disabled. The NVIC controls whether the interrupt signal reaches the processor at all. Critical sections often temporarily disable interrupts to protect shared data, but this must be done carefully to avoid missing time-critical events.

Timer Configuration and PWM Generation

Timers provide precise timing control for tasks such as periodic interrupts, pulse width modulation, input capture, and output compare. STM32 devices include several timer types with different capabilities. Basic timers provide simple time base functionality, general-purpose timers add input capture and output compare, and advanced timers support motor control and complementary outputs.

Configuring a timer involves setting the clock source, prescaler, and auto-reload value. The prescaler divides the input clock to achieve the desired timer frequency. The auto-reload value determines when the timer counter rolls over, setting the period. The combination of prescaler and period controls the timer tick resolution and overall timing range. Understanding the clock tree is crucial because timers can be sourced from various clocks with different frequencies.

Industrial conveyor belt control systems use timers configured to generate periodic interrupts every 10 milliseconds for closed-loop speed control based on encoder feedback. The timer clock is sourced from the 72 MHz system clock, divided by a prescaler to achieve 1 MHz counting frequency, and the auto-reload value is set to 10000 to produce the precise 10-millisecond timing interval.

#include <stdint.h>

typedef struct {
    uint32_t prescaler;
    uint32_t auto_reload_value;
} TimerConfig;

/**
 * Calculates Prescaler and Auto-Reload values for periodic interrupts.
 * 
 * Formula: Interval = (Prescaler + 1) * (Auto_Reload + 1) / Clock_Frequency
 */
TimerConfig calculate_timer_values(uint32_t clock_freq_hz, float interval_seconds) {
    TimerConfig config = {0};
    
    if (interval_seconds <= 0.0f) {
        return config;
    }

    // Set prescaler to achieve 1 microsecond tick (1000000 Hz)
    uint32_t prescaler = (clock_freq_hz / 1000000) - 1;
    if (clock_freq_hz < 1000000) prescaler = 0;

    // Calculate actual tick frequency based on integer prescaler
    uint32_t tick_freq = clock_freq_hz / (prescaler + 1);

    // Calculate Auto Reload Register value
    // Total_Ticks = Interval * Tick_Frequency
    uint32_t total_ticks = (uint32_t)(interval_seconds * tick_freq);
    
    if (total_ticks > 0) {
        config.auto_reload_value = total_ticks - 1;
    }
    
    config.prescaler = prescaler;
    return config;
}

int main() {
    uint32_t clock = 72000000; // 72 MHz
    float interval = 0.001f;   // 1 ms
    
    TimerConfig cfg = calculate_timer_values(clock, interval);
    // Use cfg to configure hardware timer registers
    
    return 0;
}

Execute the code with caution.

Pulse width modulation generates variable duty cycle signals for controlling motors, LEDs, or power converters. In PWM mode, the timer counts to the auto-reload value and resets, creating a fixed period. A capture/compare register determines when the output changes state, setting the duty cycle. The relationship between the compare value and auto-reload value controls the duty cycle percentage.

Brushless DC motor controllers in electric vehicle powertrain systems generate three-phase PWM signals to control motor speed and torque with precise timing relationships between the phases. The advanced timer channels are configured in complementary PWM mode with dead-time insertion to prevent shoot-through, and the compare values are dynamically adjusted based on throttle input and motor speed feedback.

#include <stdint.h>

typedef struct {
    volatile uint32_t CR1;
    volatile uint32_t PSC;
    volatile uint32_t ARR;
    volatile uint32_t CCR1;
} Timer_TypeDef;

void Timer_PWM_Init(Timer_TypeDef *timer, uint32_t clk_freq, uint32_t pwm_freq) {
    // Formula: PSC = (System_Clock / (PWM_Freq * Resolution)) - 1
    // Using a fixed resolution of 1000 counts for this example
    uint32_t period_counts = 1000; 
    
    timer->PSC = (clk_freq / (pwm_freq * period_counts)) - 1;
    timer->ARR = period_counts - 1;
    
    // Enable Timer (CEN bit)
    timer->CR1 |= (1 << 0);
}

void Set_Duty_Cycle(Timer_TypeDef *timer, float duty_percent) {
    // Clamp duty cycle between 0 and 100
    if (duty_percent > 100.0f) duty_percent = 100.0f;
    if (duty_percent < 0.0f) duty_percent = 0.0f;
    
    // Formula: CCR = (Duty_Percent / 100) * (ARR + 1)
    timer->CCR1 = (uint32_t)((duty_percent / 100.0f) * (timer->ARR + 1));
}

Execute the code with caution.

Input capture measures external signal timing by recording the timer counter value when a transition occurs on a pin. This is useful for measuring pulse widths, frequencies, or decoding protocols. Output compare generates output transitions when the counter matches a compare value. These modes require configuring the timer, enabling the appropriate interrupts or DMA requests, and handling the events in software or through automatic hardware actions.

Debugging Techniques and Tools

Debugging bare-metal firmware requires different approaches compared to application-level debugging. You often lack operating system support, print output may be unavailable early in development, and timing constraints complicate traditional step-through debugging. Using the right tools and techniques makes debugging manageable and efficient.

Hardware debug probes connect to the microcontroller debug port and allow real-time inspection and control. The Serial Wire Debug interface provides two-wire access to core registers, memory, and breakpoint capabilities. Debuggers like GDB, integrated with tools like OpenOCD or vendor-specific servers, enable setting breakpoints, single-stepping, and examining memory and registers without requiring code modifications.

Logging through semihosting or ITM channels provides printf-style debugging without dedicated hardware. Semihosting uses debug probes to communicate with the host computer for file I/O operations including console output. The Instrumentation Trace Macrocell provides high-bandwidth trace output without stopping the processor. These techniques help when UART-based logging is impractical or unavailable.

Logic analyzers and oscilloscopes complement software debugging by showing actual signal behavior. When timing-sensitive code misbehaves or peripherals fail to communicate, capturing the actual waveforms reveals problems that are not visible through register inspection alone. Protocol decoders help interpret I2C, SPI, UART, and other communication protocols without manual bit-counting.

Memory corruption is a common challenge in embedded systems. Watchpoints trigger execution stops when specific memory locations are accessed or modified. Stack usage analysis helps ensure you are not overflowing stack regions. Memory protection units, when available, can detect illegal accesses and provide controlled fault handling. Static analysis tools and runtime checking libraries identify buffer overflows and unsafe pointer usage before they cause subtle failures.

Progressive Project Examples

Start with a simple LED blink project to verify your toolchain and basic understanding. Configure a GPIO pin as output, toggle it in a loop with delay, and confirm the LED flashes at the expected rate. This project validates linker script placement, clock configuration, register access, and flash programming. Once working, you can replace the software delay with a timer-based interrupt for more precise timing.

Next, build a UART communication example to transmit and receive data. Configure the USART peripheral for the desired baud rate, enable transmit and receive interrupts or use polling, and implement simple echo functionality. This project teaches peripheral clocking, alternate function GPIO configuration, interrupt handling, and basic protocol considerations. Expanding this to use DMA for transfers introduces more advanced concepts without application-level overhead.

A building automation system uses UART communication to exchange configuration and status messages between a central controller and distributed sensor modules operating at 115200 baud rate. The firmware configures USART2 with the correct clock divisor for the baud rate, enables the receive interrupt to process incoming commands without blocking, and implements a circular buffer to manage incoming data while other system tasks continue executing.

#include <stdint.h>
#include <stdbool.h>

// Circular buffer structure
typedef struct {
    uint8_t* data;
    uint16_t size;
    volatile uint16_t head;
    volatile uint16_t tail;
} CircularBuffer;

// UART register structure (adjust for specific hardware)
typedef struct {
    volatile uint32_t baud_rate;
    volatile uint32_t control;
    volatile uint32_t status;
    volatile uint32_t data;
    volatile uint32_t int_enable;
} UART_Registers;

// Control register bits
#define UART_ENABLE       (1 << 0)
#define UART_TX_ENABLE     (1 << 1)
#define UART_RX_ENABLE     (1 << 2)
#define UART_RX_INT_ENABLE (1 << 3)

// Status register bits
#define UART_RX_READY      (1 << 0)
#define UART_TX_READY      (1 << 1)

// Global UART register pointer and circular buffer
static UART_Registers* uart;
static CircularBuffer rx_buffer;

// Initialize circular buffer
void buffer_init(CircularBuffer* buf, uint8_t* data, uint16_t size) {
    buf->data = data;
    buf->size = size;
    buf->head = 0;
    buf->tail = 0;
}

// Check if buffer is empty
bool buffer_is_empty(CircularBuffer* buf) {
    return buf->head == buf->tail;
}

// Check if buffer is full
bool buffer_is_full(CircularBuffer* buf) {
    return ((buf->head + 1) % buf->size) == buf->tail;
}

// Write byte to buffer
bool buffer_write(CircularBuffer* buf, uint8_t byte) {
    uint16_t next_head = (buf->head + 1) % buf->size;
    
    if (next_head == buf->tail) {
        return false;
    }
    
    buf->data[buf->head] = byte;
    buf->head = next_head;
    return true;
}

// Read byte from buffer
bool buffer_read(CircularBuffer* buf, uint8_t* byte) {
    if (buf->head == buf->tail) {
        return false;
    }
    
    *byte = buf->data[buf->tail];
    buf->tail = (buf->tail + 1) % buf->size;
    return true;
}

// Calculate UART baud rate divisor
// Baud_Divisor = Clock_Frequency / (16 * Baud_Rate)
uint32_t uart_calculate_baud_divisor(uint32_t clock_freq, uint32_t baud_rate) {
    return clock_freq / (16 * baud_rate);
}

// Initialize UART with specified baud rate and buffer
void uart_init(UART_Registers* uart_ptr, uint32_t clock_freq, 
               uint32_t baud_rate, uint8_t* buffer_data, uint16_t buffer_size) {
    uart = uart_ptr;
    
    // Calculate and set baud rate divisor
    uint32_t divisor = uart_calculate_baud_divisor(clock_freq, baud_rate);
    uart->baud_rate = divisor;
    
    // Initialize circular buffer
    buffer_init(&rx_buffer, buffer_data, buffer_size);
    
    // Enable UART, TX, RX, and RX interrupt
    uart->control = UART_ENABLE | UART_TX_ENABLE | UART_RX_ENABLE | UART_RX_INT_ENABLE;
    uart->int_enable = UART_RX_INT_ENABLE;
}

// UART Receive Interrupt Handler
void uart_rx_isr(void) {
    if (uart->status & UART_RX_READY) {
        uint8_t data = (uint8_t)uart->data;
        buffer_write(&rx_buffer, data);
    }
}

// Read byte from UART receive buffer
bool uart_read_byte(uint8_t* byte) {
    return buffer_read(&rx_buffer, byte);
}

// Write byte to UART (polling mode)
void uart_write_byte(uint8_t byte) {
    while (!(uart->status & UART_TX_READY)) {
        // Wait until TX ready
    }
    uart->data = byte;
}

Execute the code with caution.

Develop a sensor interface project using I2C or SPI to communicate with an external device. Read temperature, pressure, or accelerometer data and output it through UART or display it on an LCD. This project involves multiple peripherals, timing requirements, protocol understanding, and error handling. Implementing the protocol purely in registers, rather than using library functions, reinforces understanding of bus timing and data formatting.

Finally, create a control system using PWM and input capture. Control a motor speed with PWM feedback from an encoder or other sensor. This combines timer configuration, interrupt handling, real-time processing, and feedback algorithms. The project demonstrates practical embedded systems design that mirrors industrial applications. Adding safety features such as watchdog timers and fault detection prepares you for production firmware development.

Sources

ARM Developer - Cortex-M Processor Documentation - https://developer.arm.com/documentation/
STMicroelectronics - STM32 Microcontrollers Documentation - https://www.st.com/en/microcontrollers-microprocessors/stm32-32-bit-arm-cortex-mcus.html
GNU Arm Embedded Toolchain - https://developer.arm.com/downloads/-/gnu-rm
ARM Cortex-M Technical Reference Manual - https://developer.arm.com/documentation/dui0553/latest/
STM32F103 Reference Manual - https://www.st.com/resource/en/reference_manual/rm0008-stm32f101xx-stm32f102xx-stm32f103xx-stm32f105xx-and-stm32f107xx-advanced-armbased-32bit-mcus-stmicroelectronics.pdf
Segger - J-Link Debug Probe Documentation - https://www.segger.com/products/debug-probes/j-link/
ARMv7-M Architecture Reference Manual - https://developer.arm.com/documentation/ddi0403/latest/

ARM Cortex-M Architecture Fundamentals

Toolchain Setup for Bare-Metal Development

Memory Map and Register Access

GPIO Configuration and Control

Interrupt Handling and NVIC Configuration

Timer Configuration and PWM Generation

Debugging Techniques and Tools

Progressive Project Examples

Sources

Related Articles