Notes on debugging Rust microcontroller stack usage

A few days ago I was doing some refactoring of my galileo-osnma project. This is a Rust library that implements the Galileo OSNMA (open service navigation message authentication) system. The library includes a demo that runs in a Longan nano GD32VF103 RISC-V microcontroller board. The purpose of this demo is to show that this library can run on small microcontrollers. My refactoring was in principle a simple thing: I was mainly organizing the repository as a Cargo workspace, and unifying the library and some supporting tools into the same crate. However, after the refactor, users reported that the Longan nano software was broken. It would hang after processing some messages. This post is a collection of notes about how I investigated the issue, which turned out to be related to stack usage.

First I checked that I was able to reproduce the issue either by feeding data from the Galmon public feed (which can be fetched using nc 86.82.68.237 10000) or by feeding data from my own GNSS receiver with Galmon’s ubxtool. Indeed the software was broken. At some point it stopped sending the READY message over the UART, which is how this demo implements flow control. I made a recording of a few minutes of Galmon public feed data in order to test repeatably, and I could confirm that the software from before the refactor obtained successful authentication, while the software after the refactor hanged before obtaining authentication and before reaching the end of the file. This left me wondering which of my changes could have caused the breakage.

I realized that one of the things that I changed was the compiler options. Before the refactor, the Longan nano was in a standalone crate with the following configuration.

[profile.release]
opt-level = "z"
lto = true

After the refactor, because profiles need to be defined at workspace level, I had the following:

[profile.release]
codegen-units = 1
lto = true
panic = "abort"

[profile.embedded]
inherits = "release"
opt-level = "z"  # Optimize for size.

To build the Longan nano software I used the embedded profile. I noticed that this caused the software to be built with codegen-units = 1 instead of the default. I played with other values of codgen-units in the embedded profile and found that codegen-units = 1 caused the software to crash eventually, while other values of codegen-units worked (later we discover that with codegen-units = 2 the software still crashed after some hours, but with codegen-units = 16 it seemed to run well forever).

At this point I was puzzled: what could be making codegen-units = 1 break the software? The value of codegen-units defines how many independent code generation units are used in compilation. Lower values of codgen-units cause longer build times but can enable additional opportunities for compiler optimizations. The behaviour of the resulting code shouldn’t change, however.

One of the things I suspected was that perhaps the software was running out of stack space. The GD32VF103 only has 32 KiB of SRAM. This is an important limitation which I have needed to take into account in the past. The full data that the galileo-osnma library would need to handle to account for all the satellites in the Galileo constellation is around 37 KiB. Because this doesn’t fit in SRAM, the Longan nano software uses a reduced-memory configuration that only considers some data for up to 12 satellites in view, which takes around 9 KiB. However, why would using codegen-units = 1 increase the stack usage? The other suspicion I had was that perhaps a compiler bug was triggered by codegen-units = 1.

Initially I didn’t plan to spend more time debugging this, since using codegen-units = 16 was a good workaround. However the curiosity to understand what was happening won. First I tried to check if the software built with codegen-units = 1 was panicking. For that, I made a custom panic handler that turned on the red LED on the board. I saw that the LED didn’t turn on when the software crashed. Additionally, by adding a couple more print statements on the UART, I could find that the crash happened during a call to the Osnma::feed_osnma function, which I responsible for processing a single piece of OSNMA data taken from a INAV page. Therefore, this function does a lot of work, so the crash could be anywhere in the library.

Intrigued by why a Rust program could be crashing without panicking, I decided to debug using JTAG. I have never had to use JTAG with this Longan nano board, but I found that an Altera USB Blaster clone that I had for the Hermes-Lite 2.0 worked fine and allowed me to debug with gdb over JTAG.

With gdb, the first thing I could see is that when the software crashed, it was running an abort function which is just an endless loop.

08000240 <abort>:
 8000240: 0000006f      j       0x8000240 <abort>

In the backtrace I could see that this had been called by a _start_trap function.

080001c0 <_start_trap>:
 80001c0: 7139          addi    sp, sp, -0x40
 80001c2: c006          sw      ra, 0x0(sp)
 80001c4: c216          sw      t0, 0x4(sp)
 80001c6: c41a          sw      t1, 0x8(sp)
 80001c8: c61e          sw      t2, 0xc(sp)
 80001ca: c872          sw      t3, 0x10(sp)
 80001cc: ca76          sw      t4, 0x14(sp)
 80001ce: cc7a          sw      t5, 0x18(sp)
 80001d0: ce7e          sw      t6, 0x1c(sp)
 80001d2: d02a          sw      a0, 0x20(sp)
 80001d4: d22e          sw      a1, 0x24(sp)
 80001d6: d432          sw      a2, 0x28(sp)
 80001d8: d636          sw      a3, 0x2c(sp)
 80001da: d83a          sw      a4, 0x30(sp)
 80001dc: da3e          sw      a5, 0x34(sp)
 80001de: dc42          sw      a6, 0x38(sp)
 80001e0: de46          sw      a7, 0x3c(sp)
 80001e2: 00010533      add     a0, sp, zero
 80001e6: fa7ff0ef      jal     0x800018c <_start_trap_rust>
 80001ea: 4082          lw      ra, 0x0(sp)
 80001ec: 4292          lw      t0, 0x4(sp)
 80001ee: 4322          lw      t1, 0x8(sp)
 80001f0: 43b2          lw      t2, 0xc(sp)
 80001f2: 4e42          lw      t3, 0x10(sp)
 80001f4: 4ed2          lw      t4, 0x14(sp)
 80001f6: 4f62          lw      t5, 0x18(sp)
 80001f8: 4ff2          lw      t6, 0x1c(sp)
 80001fa: 5502          lw      a0, 0x20(sp)
 80001fc: 5592          lw      a1, 0x24(sp)
 80001fe: 5622          lw      a2, 0x28(sp)
 8000200: 56b2          lw      a3, 0x2c(sp)
 8000202: 5742          lw      a4, 0x30(sp)
 8000204: 57d2          lw      a5, 0x34(sp)
 8000206: 5862          lw      a6, 0x38(sp)
 8000208: 58f2          lw      a7, 0x3c(sp)
 800020a: 6121          addi    sp, sp, 0x40
 800020c: 30200073      mret

This function is the trap handler that gets run when the RISC-V hart (hardware thread, which is how RISC-V formally calls CPU cores) traps. I set a hardware breakpoint at the beginning of _start_trap with hbreak *0x80001c0 and let the software crash again. This is the state of the registers at the breakpoint.

(gdb) info r
ra             0xffffffff	0xffffffff
sp             0x1ffffce0	0x1ffffce0
gp             0x20000800	0x20000800
tp             0x0	0x0
t0             0x800c000	134266880
t1             0xffffffff	-1
t2             0xffffffff	-1
fp             0xffffffff	0xffffffff
s1             0xffffffff	-1
a0             0x0	0
a1             0xffffffff	-1
a2             0xffffffff	-1
a3             0xffffffff	-1
a4             0x0	0
a5             0x0	0
a6             0x0	0
a7             0x0	0
s2             0xffffffff	-1
s3             0xffffffff	-1
s4             0xffffffff	-1
s5             0xffffffff	-1
s6             0xffffffff	-1
s7             0xffffffff	-1
s8             0xffffffff	-1
s9             0xffffffff	-1
s10            0xffffffff	-1
s11            0xffffffff	-1
t3             0x0	0
t4             0xffffffff	-1
t5             0x0	0
t6             0xffffffff	-1
pc             0x80001c0	0x80001c0 <_start_trap>

There is more relevant information in some of the machine CSRs. The mstatus register contains information about the hart’s state. In this case there is nothing of interest there.

(gdb) info r $mstatus
mstatus        0x1800	SD:0 VM:00 MXR:0 PUM:0 MPRV:0 XS:0
 FS:0 MPP:3 HPP:0 SPP:0 MPIE:0 HPIE:0 SPIE:0 UPIE:0 MIE:0
 HIE:0 SIE:0 UIE:0

The mtvec register contains the trap vector base address. It indicates the address of the trap handler.

(gdb) info r $mtvec
mtvec          0x80001c3	134218179

Trap handling can be done in direct mode, in which there is a single handler for all traps, or in vectored mode, in which the handler address is computed in terms of the trap cause. The two LSBs of mtvec are supposed to be 2'b00 for direct mode and 2'b01 for vectored mode, with the other two possible values being reserved. Here the value is 2'b11, which is reserved. I think that this is a silicon bug in the GD32VF103, which probably has these two bits hardcoded to 1, even though I believe that trap handling is working in direct mode. The 30 MSBs of the register contain the 30 MSBs of the trap handler base address (which in direct mode is just the trap handler address). The base address is supposed to be aligned to 4 bytes. We see that the trap handler address 0x80001c0 matches the _start_trap function.

The mcause register contains a code indicating the event that caused the trap. The MSB is set if the trap was caused by an interrupt. The remaining bits indicate the exception code that caused the trap.

(gdb) info r $mcause
mcause         0x30000001	805306369

Here we also have what looks like a silicon bug, because the leading nibble is 0x3. According to the RISC-V specification, for interrupt = 0, all the cause codes greater or equal than 64 are reserved. I believe that the correct value of mcause in this case would be 0x00000001, which means instruction access fault.

The mepc register contains the program counter value at the moment when the trap happened. Its LSB is always zero, because instructions in RISC-V (with the compressed instructions extension) are aligned to 2 bytes.

(gdb) info r $mepc
mepc           0xfffffffe	-2

Here we see that the mepc register contains the address 0xfffffffe. This address is outside the memory map of the GD32VF103 (see Section 2.4 in the datasheet), so it makes sense that an instruction access fault was generated and we ended up in the trap handler.

What has happened here? The registers contain some clues. First of all, there are many registers with the value 0xffffffff, which looks suspicious. The stack pointer is 0x1ffffce0. This is pointing outside the SRAM, which is mapped to 0x20000000 - 0x20007FFF. In fact we can see how this executable is supposed to be using the SRAM by inspecting some symbols in the ELF file:

$ rust-objdump -x \
  target/riscv32imac-unknown-none-elf/embedded/\
  osnma-longan-nano | grep -e "__[se]\(bss\|stack\)"
20000000 g       .bss	00000000 __sbss
20000020 g       .bss	00000000 __ebss
20000020 g       .stack	00000000 __estack
20008000 g       .stack	00000000 __sstack

We see that this program only needs 32 bytes of BSS, so the remaining SRAM space is allocated to the stack, which grows downwards starting at the end of the SRAM.

Finally, the register t0 contains another clue. Its value is 0x0800c000, which is an address into the flash where the program is contained (the flash is mapped to 0x08000000 - 0x08020000). If we go to that part of the code we find that it corresponds to the ret instruction in a function that performs arithmetic for the p256 elliptic curve cryptography.

0800bda6 <p256::arithmetic::field::field_impl::sub_inner::hd30cc84231e2f12e>:
 [...]
 800bff8: 00002297      auipc   t0, 0x2
 800bffc: 3e6282e7      jalr    t0, 0x3e6(t0) <OUTLINED_FUNCTION_11>
 800c000: 8082          ret

Before this function returns, an outlined function is run. This is the code for that function.

0800e3de <OUTLINED_FUNCTION_11>:
 800e3de: 40f6          lw      ra, 0x5c(sp)
 800e3e0: 4466          lw      s0, 0x58(sp)
 800e3e2: 44d6          lw      s1, 0x54(sp)
 800e3e4: 4946          lw      s2, 0x50(sp)
 800e3e6: 49b6          lw      s3, 0x4c(sp)
 800e3e8: 4a26          lw      s4, 0x48(sp)
 800e3ea: 4a96          lw      s5, 0x44(sp)
 800e3ec: 4b06          lw      s6, 0x40(sp)
 800e3ee: 5bf2          lw      s7, 0x3c(sp)
 800e3f0: 5c62          lw      s8, 0x38(sp)
 800e3f2: 5cd2          lw      s9, 0x34(sp)
 800e3f4: 5d42          lw      s10, 0x30(sp)
 800e3f6: 5db2          lw      s11, 0x2c(sp)
 800e3f8: 6125          addi    sp, sp, 0x60
 800e3fa: 8282          jr      t0

The goal of this outlined function is to reduce the code size. It contains a very common routine that restores the return address register and all the saved registers. Many functions will need to perform these operations before returning, so by putting them in an outlined function, code repetition is reduced.

The way that the main function jumps into the outlined function is interesting. Because the outlined function is going to load the return address register ra with the return address needed by the main function, another mechanism is needed to return from the outlined function. This is implemented by the jalr t0, 0x3e6(t0) instruction, which jumps to t0 + 0x3e6 and stores the program counter value for the next instruction (which is 0x800c000, corresponding to the ret instruction) into the t0 register. In this way, the outlined function can perform its work, taking care not to clobber t0, and then use jr t0 to return to the main function.

The value 0x800c000 that we have found in t0 at the trap handler breakpoint is the telltale sign of this mechanism. Now we understand that the first instruction of the outlined function has loaded 0xffffffff into ra (and also all the saved registers). Therefore, the ret instruction of the main function is trying to jump to 0xffffffff, which is an illegal instruction address because it is not aligned and because it is outside of the CPU address map. This is why the trap happens. The RISC-V specification defines a trap cause for instruction address misaligned, so I think this, rather than instruction access fault, should have been the cause of the trap in mcause.

We have noticed that the stack pointer contains the value 0x1ffffce0 at the beginning of the trap handler. Taking into account that the outlined function has performed addi sp, sp, 0x60, this means that the stack pointer was 0x1ffffc80 at the beginning of the outlined function. This is 704 bytes below the start of the SRAM, so we see that the program has run out of stack space and it is doomed to crash one way or another.

All the loads in the outlined function, as well as some other loads in the main function are targetting the area immediately below the SRAM. In the address map, the area 0x1FFFF810 - 0x1FFFFFFF is shown as code (reserved). For some reason, loads from this area are returning 0xffffffff. That is the reason why most of the registers have this value at the beginning of the trap handler. I don’t know all the details of the RISC-V specification, but I think that it would be better that these loads generate a load access fault trap instead of returning a hardcoded all-ones constant.

Now that we have understood all the details about how the software crashes, the next question is why is the stack usage different depending on the codegen-units value, and what can we do about it. The first thing I checked was putting a breakpoint in the main function with hbreak main (note that this does not place the breakpoint at the first instruction of main, but rather at first instruction after the preamble of main, where the stack pointer has already been decremented to reserve stack for main) and printing the value of the stack pointer. I got the following:

With codegen-units = 16, the stack pointer is 0x200032c0, which means that there are around 12.65 KiB of SRAM free.
With codegen-units = 1, the stack pointer is 0x20001160, which means that there are around 4.31 KiB of SRAM free.

It makes sense that these 4.3 KiB are not enough to run the relatively complex calculations required by the elliptic curve cryptography, and so we run out of stack space when codegen-units = 1. The question now is why we have a difference in the stack usage at the main function of 8.34 KiB depending on how we compile the program.

First I read the code of the program until the main function is called to understand the stack usage up to this point. I saw that the stack pointer was first initialized to 0x20008000 and only 16 bytes of stack were reserved later on, so at the start of the main function the stack size is only 16 bytes. I verified this with gdb. Therefore, in order to understand the stack usage of main, we only need to look at how the stack pointer is decremented in the preamble of main.

With codegen-units = 1, this is the beginning of the main function.

0800354e <main>:
 800354e: 0000b297      auipc   t0, 0xb
 8003552: 120282e7      jalr    t0, 0x120(t0) <OUTLINED_FUNCTION_35>
 8003556: 651d          lui     a0, 0x7
 8003558: d9050513      addi    a0, a0, -0x270
 800355c: 40a10133      sub     sp, sp, a0
 [...]

The outlined function that is being called is basically the opposite of the previous outlined function we saw. It saves the return address register and all the saved registers to the stack. Interestingly it decrements the stack pointer by 256 bytes, which is more than what is necessary to store these 13 registers. I don’t know why this is done like so, but clearly this is then taken into account when reserving more stack in the main function. Perhaps the idea here is that functions that need slightly less than 256 bytes of stack can simply use this reservation and not touch the stack pointer in the main function.

0800e66e <OUTLINED_FUNCTION_35>:
 800e66e: 7111          addi    sp, sp, -0x100
 800e670: df86          sw      ra, 0xfc(sp)
 800e672: dda2          sw      s0, 0xf8(sp)
 800e674: dba6          sw      s1, 0xf4(sp)
 800e676: d9ca          sw      s2, 0xf0(sp)
 800e678: d7ce          sw      s3, 0xec(sp)
 800e67a: d5d2          sw      s4, 0xe8(sp)
 800e67c: d3d6          sw      s5, 0xe4(sp)
 800e67e: d1da          sw      s6, 0xe0(sp)
 800e680: cfde          sw      s7, 0xdc(sp)
 800e682: cde2          sw      s8, 0xd8(sp)
 800e684: cbe6          sw      s9, 0xd4(sp)
 800e686: c9ea          sw      s10, 0xd0(sp)
 800e688: c7ee          sw      s11, 0xcc(sp)
 800e68a: 8282          jr      t0

In any case, in the main function we see that the value (0x7 << 12) - 0x270 is loaded into a0 and then a0 is subtracted from the stack pointer. This means that overall the main function is decrementing the stack pointer by (0x7 << 12) - 0x270 + 0x100 = 28304 bytes.

In comparison, with codegen-units = 16, the preamble of the main function looks like this.

08009736 <main>:
 8009736: 00006297      auipc   t0, 0x6
 800973a: bf4282e7      jalr    t0, -0x40c(t0) <OUTLINED_FUNCTION_46>
 800973e: 6515          lui     a0, 0x5
 8009740: c3050513      addi    a0, a0, -0x3d0
 8009744: 40a10133      sub     sp, sp, a0
[...]

The outlined function is identical to the one above, except that it has a different name and it sits at a different address. Therefore, in this case the main function is decrementing the stack pointer by (0x5 << 12) - 0x3d0 + 0x100 = 19760 bytes. As we already knew, there is a difference of 8544 bytes in the stack usage of main depending on how we compile.

To investigate this difference, I looked at the LLVM IR for the program. This can be obtained by building with

cargo rustc -p osnma-longan-nano \
  --target riscv32imac-unknown-none-elf \
  --profile embedded -- --emit=llvm-ir

The LLVM IR is very verbose, so here I will only put the relevant details. With codegen-units = 1, the main function contains the following large stack allocations.

define dso_local void @main() unnamed_addr #14 !dbg !23641 {
[…]
%osnma.i = alloca [8536 x i8], align 8
[…]
%interface = alloca [8792 x i8], align 8

With codegen-units = 16, the main function only contains the %interface allocation. The %osnma.i is missing.

define dso_local void @main() unnamed_addr #43 !dbg !60675 {
[…]
%interface = alloca [8792 x i8], align 8

The presence of %osnma.i in the codegen-units = 1 LLVM IR is what causes most of the stack usage difference (there are another 8 bytes that it isn’t worth to investigate). The relevant part of the Rust code to understand these allocations is the following.

struct Board {
    tx: serial::Tx<USART0>,
    rx: serial::Rx<USART0>,
    rx_buffer: [u8; 256],
}

struct OsnmaInterface {
    osnma: Osnma<SmallStorage>,
    board: Board,
}

impl OsnmaInterface {
    fn new(board: Board) -> OsnmaInterface {
        let pubkey = VerifyingKey::from_sec1_bytes(&OSNMA_PUBKEY).unwrap();
        let pubkey = PublicKey::from_p256(pubkey, OSNMA_PUBKEY_ID).force_valid();
        let osnma =
            Osnma::<SmallStorage>::from_merkle_tree(OSNMA_MERKLE_TREE_ROOT, Some(pubkey), false);
        OsnmaInterface { osnma, board }
    }
[...]
}

#[entry]
fn main() -> ! {
    let board = Board::take();
    let mut interface = OsnmaInterface::new(board);

    loop {
        interface.spin();
    }
}

Basically, the program contains an OsnmaInterface struct that has a 256-byte buffer for reading UART data (the serial::Tx and serial::Rx are zero-sized types) and an Osnma object that contains all the data required by the galileo-osnma library. The Board object is constructed by initializing the UART and zero-initializing the rx_buffer, and then the Osnma object is constructed with one of the constructors offered by the library. Both objects are put together in an OsnmaInterface object.

Because osnma is moved into interface, we would expect that it gets constructed directly into the allocation for interface, instead of being constructed somewhere else and then copied over. This is indeed what happens with codegen-units = 16, but not with codegen-units = 1. The thing we need to understand here is that Rust’s move system relies heavily on LLVM’s ability to optimize out unnecesary temporary allocations and copies. I will illustrate this with a simple example, which you can also see in Godbolt’s Compiler Explorer.

Consider the following code:

#![no_std]

struct A {
    _data: [u8; 64]
}

struct B {
    _data: [u8; 32]
}

pub struct Both {
    _a: A,
    _b: B,
}

#[unsafe(no_mangle)]
pub fn construct() -> Both {
    Both {
        _a: A { _data: [0; 64] },
        _b: B { _data: [0xff; 32] },
    }
}

When construct is called, the allocation for the Both that is returned has been already reserved by the caller, so the only thing that construct should do is two memsets to initialize the arrays in this allocation to their required values. This is indeed what happens when we build with -C opt-level=z -C codegen-units=1.

construct:
        addi    sp, sp, -16
        sw      ra, 12(sp)
        sw      s0, 8(sp)
        mv      s0, a0
        li      a2, 64
        li      a1, 0
        call    memset
        addi    a0, s0, 64
        li      a1, 255
        li      a2, 32
        call    memset
        lw      ra, 12(sp)
        lw      s0, 8(sp)
        addi    sp, sp, 16
        ret

The LLVM IR is basically two calls to memset as we would expect.

define dso_local void @construct(ptr dead_on_unwind noalias noundef writable writeonly sret([96 x i8]) align 1 captures(none) dereferenceable(96) initializes((0, 96)) %_0) unnamed_addr {
start:
  tail call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(64) %_0, i8 0, i32 64, i1 false)
  %0 = getelementptr inbounds nuw i8, ptr %_0, i32 64
  tail call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(32) %0, i8 -1, i32 32, i1 false)
  ret void
}

declare void @llvm.memset.p0.i32(ptr writeonly captures(none), i8, i32, i1 immarg) #1

However the Rust MIR looks quite different. This is the relevant part corresponding to the construct function.

fn construct() -> Both {
    let mut _0: Both;
    let mut _1: A;
    let mut _2: [u8; 64];
    let mut _3: B;
    let mut _4: [u8; 32];

    bb0: {
        StorageLive(_1);
        StorageLive(_2);
        _2 = [const 0_u8; 64];
        _1 = A { _data: move _2 };
        StorageDead(_2);
        StorageLive(_3);
        StorageLive(_4);
        _4 = [const u8::MAX; 32];
        _3 = B { _data: move _4 };
        StorageDead(_4);
        _0 = Both { _a: move _1, _b: move _3 };
        StorageDead(_3);
        StorageDead(_1);
        return;
    }
}

We see that there are temporaries for everything. Even A and B are constructed by first putting the array into a temporary and then moving the array into the struct.

If we look at the initial LLVM IR before any optimization passes are done, we see that it closely mirrors the MIR. We have memset() to initialize the arrays to their corresponding values, and memcpy() to move things. There are 192 bytes of temporaries allocated on the stack just for this thing that is only supposed to initialize caller-allocated memory.

define dso_local void @construct(ptr dead_on_unwind noalias noundef writable sret([96 x i8]) align 1 captures(address) dereferenceable(96) %0) unnamed_addr {
  %2 = alloca [32 x i8], align 1
  %3 = alloca [32 x i8], align 1
  %4 = alloca [64 x i8], align 1
  %5 = alloca [64 x i8], align 1
  call void @llvm.lifetime.start.p0(i64 64, ptr %5)
  call void @llvm.lifetime.start.p0(i64 64, ptr %4)
  call void @llvm.memset.p0.i32(ptr align 1 %4, i8 0, i32 64, i1 false)
  call void @llvm.memcpy.p0.p0.i32(ptr align 1 %5, ptr align 1 %4, i32 64, i1 false)
  call void @llvm.lifetime.end.p0(i64 64, ptr %4)
  call void @llvm.lifetime.start.p0(i64 32, ptr %3)
  call void @llvm.lifetime.start.p0(i64 32, ptr %2)
  call void @llvm.memset.p0.i32(ptr align 1 %2, i8 -1, i32 32, i1 false)
  call void @llvm.memcpy.p0.p0.i32(ptr align 1 %3, ptr align 1 %2, i32 32, i1 false)
  call void @llvm.lifetime.end.p0(i64 32, ptr %2)
  call void @llvm.memcpy.p0.p0.i32(ptr align 1 %0, ptr align 1 %5, i32 64, i1 false)
  %6 = getelementptr inbounds i8, ptr %0, i32 64
  call void @llvm.memcpy.p0.p0.i32(ptr align 1 %6, ptr align 1 %3, i32 32, i1 false)
  call void @llvm.lifetime.end.p0(i64 32, ptr %3)
  call void @llvm.lifetime.end.p0(i64 64, ptr %5)
  ret void
}

Things look quite similar for a few optimization passes, until we reach a MemCpyOptPass that realizes that a memset followed by a single memcpy or a chain of multiple memcpy‘s can be replaced by a direct memset to the destination of the last memcpy in the chain. This optimization gives the following. Note that we now have two memset‘s and no memcpy‘s, but the temporary allocations haven’t been optimized out yet.

define dso_local void @construct(ptr dead_on_unwind noalias noundef writable writeonly sret([96 x i8]) align 1 captures(none) dereferenceable(96) %0) unnamed_addr {
  %2 = alloca [32 x i8], align 1
  %3 = alloca [32 x i8], align 1
  %4 = alloca [64 x i8], align 1
  %5 = alloca [64 x i8], align 1
  call void @llvm.lifetime.start.p0(i64 64, ptr nonnull %5)
  call void @llvm.lifetime.start.p0(i64 64, ptr nonnull %4)
  call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(64) %0, i8 0, i32 64, i1 false)
  call void @llvm.lifetime.end.p0(i64 64, ptr nonnull %4)
  call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %3)
  call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %2)
  %6 = getelementptr inbounds nuw i8, ptr %0, i32 64
  call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(32) %6, i8 -1, i32 32, i1 false)
  call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %2)
  call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %3)
  call void @llvm.lifetime.end.p0(i64 64, ptr nonnull %5)
  ret void
}

The next InstCombinePass realizes that the temporary allocations are unused and removes them. The IR now looks very similar to the final IR.

define dso_local void @construct(ptr dead_on_unwind noalias noundef writable writeonly sret([96 x i8]) align 1 captures(none) dereferenceable(96) %0) unnamed_addr {
  call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(64) %0, i8 0, i32 64, i1 false)
  %2 = getelementptr inbounds nuw i8, ptr %0, i32 64
  call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(32) %2, i8 -1, i32 32, i1 false)
  ret void
}

In this very simple example everything works as expected and we get the assembly code that we wanted. In much more complex cases, LLVM optimization passes might not be able to optimize out all the moves emitted by the Rust compiler. This is what is happening in the codegen-units = 1 case with the Longan nano software. Since the software is much more complex than this simple example, I haven’t investigated what is preventing LLVM from optimizing out the %osnma.i temporary allocation.

The final question is: what can we do to improve this? One issue is missed oportunities for move optimizations. This could be fixed by future improvements in the LLVM optimizer. However, the main issue here is that the Rust compiler is inlining a lot of initialization code into the main function. Besides a few other variables which are small, the data that the program needs on the stack to run is the interface object, which has 8792 bytes. However, we have seen that even in the codegen-units = 16 case, the main function needs 19760 bytes of stack, which is a lot.

In the LLVM IR we can see that besides the interface allocation there are a few other large allocations with sizes around 1 to 3 KiB. These have less obvious names, some of which include sroa, which means “scalar replacement of aggregates”, which is a rustc optimization. My understanding of this situation is that we are getting allocations for temporaries that are needed in the initialization of the Osnma object, such as for instance temporaries used to load the ECDSA public key.

In an ideal world, we could build Osnma using compile-time const evaluation, since this Osnma instance only depends on const‘s containing the ECDSA public key and Merkle tree root that are generated by the build.rs script. However const evaluation in Rust is somewhat limited (for good reasons) and none of the elliptic curve cryptography functions that are used here are const. Also, compile-time initialization wouldn’t be realistic. A more realistic software would read this cryptographic material from somewhere in flash, to allow the material to be updated without updating the software. Such software would still need to run all of this initialization at runtime.

In any case, there is a simple way to improve this program. If initialization takes up a lot of stack space, then it shouldn’t be inlined into the main function. In this way that stack space can be freed at the end of the initalization, recovering stack space to be used by the program loop. This is what I’ve done in this software. It now looks like this.

#[inline(never)]
fn new_interface() -> OsnmaInterface {
    OsnmaInterface::new(Board::take())
}

#[entry]
fn main() -> ! {
    let mut interface = new_interface();
    loop {
        interface.spin();
    }
}

With this change, and building with codegen-units = 1, the new_interface function looks like this.

08002f92 <osnma_longan_nano::new_interface::h0b1ed4267d09a8c1>:
 8002f92: 0000b297      auipc   t0, 0xb
 8002f96: 610282e7      jalr    t0, 0x610(t0) <OUTLINED_FUNCTION_30>
 8002f9a: 6595          lui     a1, 0x5
 8002f9c: af058593      addi    a1, a1, -0x510
 8002fa0: 40b10133      sub     sp, sp, a1
[...]

The outlined function is saving the return address register and the first 8 saved registers.

0800e5a2 <OUTLINED_FUNCTION_30>:
 800e5a2: 7111          addi    sp, sp, -0x100
 800e5a4: df86          sw      ra, 0xfc(sp)
 800e5a6: dda2          sw      s0, 0xf8(sp)
 800e5a8: dba6          sw      s1, 0xf4(sp)
 800e5aa: d9ca          sw      s2, 0xf0(sp)
 800e5ac: d7ce          sw      s3, 0xec(sp)
 800e5ae: d5d2          sw      s4, 0xe8(sp)
 800e5b0: d3d6          sw      s5, 0xe4(sp)
 800e5b2: d1da          sw      s6, 0xe0(sp)
 800e5b4: cfde          sw      s7, 0xdc(sp)
 800e5b6: 8282          jr      t0

The stack usage of new_interface is (0x5 << 12) - 0x510 + 0x100 = 19440 bytes. So we see that as a bonus the move optimization for osnma is now working even with codegen-units = 1.

The main function starts like this.

08003bc2 <main>:
 8003bc2: 0000b297      auipc   t0, 0xb
 8003bc6: 9e0282e7      jalr    t0, -0x620(t0) <OUTLINED_FUNCTION_30>
 8003bca: cde2          sw      s8, 0xd8(sp)
 8003bcc: cbe6          sw      s9, 0xd4(sp)
 8003bce: c9ea          sw      s10, 0xd0(sp)
 8003bd0: c7ee          sw      s11, 0xcc(sp)
 8003bd2: 6509          lui     a0, 0x2
 8003bd4: 24050513      addi    a0, a0, 0x240
 8003bd8: 40a10133      sub     sp, sp, a0
 8003bdc: 6509          lui     a0, 0x2
 8003bde: 2a050513      addi    a0, a0, 0x2a0
 8003be2: 00a10db3      add     s11, sp, a0
 8003be6: 40014437      lui     s0, 0x40014
 8003bea: 6909          lui     s2, 0x2
 8003bec: 04810993      addi    s3, sp, 0x48
 8003bf0: 12c90513      addi    a0, s2, 0x12c
 8003bf4: 7ff98493      addi    s1, s3, 0x7ff
 8003bf8: 954e          add     a0, a0, s3
 8003bfa: cc2a          sw      a0, 0x18(sp)
 8003bfc: 67d48c93      addi    s9, s1, 0x67d
 8003c00: 00a8          addi    a0, sp, 0x48
 8003c02: fffff097      auipc   ra, 0xfffff
 8003c06: 390080e7      jalr    0x390(ra) <osnma_longan_nano::new_interface::h0b1ed4267d09a8c1>
[...]

It calls the same outlined function. The stack space that it needs is (0x2 << 12) + 0x240 + 0x100 = 9024 bytes. Taking into account that interface is using 8792 bytes, that leaves only 232 bytes used by other variables, which is excellent. With this change, the main function has 23.1 KiB of free stack space, so we don’t risk running out of stack during the program loop.

One comment

Bob Mattaliano says:

December 19, 2025 at 19:26 UTC

Most excellent analysis and solution Dani. Well done. Glad you could get the the bottom of this, end up with efficient code, and resolve the crashes. Thanks digging into this.
Bob
N6RFM

One comment

Leave a comment Cancel reply