Notes on debugging Rust microcontroller stack usage

A few days ago I was doing some refactoring of my galileo-osnma project. This is a Rust library that implements the Galileo OSNMA (open service navigation message authentication) system. The library includes a demo that runs in a Longan nano GD32VF103 RISC-V microcontroller board. The purpose of this demo is to show that this library can run on small microcontrollers. My refactoring was in principle a simple thing: I was mainly organizing the repository as a Cargo workspace, and unifying the library and some supporting tools into the same crate. However, after the refactor, users reported that the Longan nano software was broken. It would hang after processing some messages. This post is a collection of notes about how I investigated the issue, which turned out to be related to stack usage.

First I checked that I was able to reproduce the issue either by feeding data from the Galmon public feed (which can be fetched using nc 86.82.68.237 10000) or by feeding data from my own GNSS receiver with Galmon’s ubxtool. Indeed the software was broken. At some point it stopped sending the READY message over the UART, which is how this demo implements flow control. I made a recording of a few minutes of Galmon public feed data in order to test repeatably, and I could confirm that the software from before the refactor obtained successful authentication, while the software after the refactor hanged before obtaining authentication and before reaching the end of the file. This left me wondering which of my changes could have caused the breakage.

I realized that one of the things that I changed was the compiler options. Before the refactor, the Longan nano was in a standalone crate with the following configuration.

[profile.release]
opt-level = "z"
lto = true

After the refactor, because profiles need to be defined at workspace level, I had the following:

[profile.release]
codegen-units = 1
lto = true
panic = "abort"

[profile.embedded]
inherits = "release"
opt-level = "z" # Optimize for size.

To build the Longan nano software I used the embedded profile. I noticed that this caused the software to be built with codegen-units = 1 instead of the default. I played with other values of codgen-units in the embedded profile and found that codegen-units = 1 caused the software to crash eventually, while other values of codegen-units worked (later we discover that with codegen-units = 2 the software still crashed after some hours, but with codegen-units = 16 it seemed to run well forever).

At this point I was puzzled: what could be making codegen-units = 1 break the software? The value of codegen-units defines how many independent code generation units are used in compilation. Lower values of codgen-units cause longer build times but can enable additional opportunities for compiler optimizations. The behaviour of the resulting code shouldn’t change, however.

One of the things I suspected was that perhaps the software was running out of stack space. The GD32VF103 only has 32 KiB of SRAM. This is an important limitation which I have needed to take into account in the past. The full data that the galileo-osnma library would need to handle to account for all the satellites in the Galileo constellation is around 37 KiB. Because this doesn’t fit in SRAM, the Longan nano software uses a reduced-memory configuration that only considers some data for up to 12 satellites in view, which takes around 9 KiB. However, why would using codegen-units = 1 increase the stack usage? The other suspicion I had was that perhaps a compiler bug was triggered by codegen-units = 1.

Initially I didn’t plan to spend more time debugging this, since using codegen-units = 16 was a good workaround. However the curiosity to understand what was happening won. First I tried to check if the software built with codegen-units = 1 was panicking. For that, I made a custom panic handler that turned on the red LED on the board. I saw that the LED didn’t turn on when the software crashed. Additionally, by adding a couple more print statements on the UART, I could find that the crash happened during a call to the Osnma::feed_osnma function, which I responsible for processing a single piece of OSNMA data taken from a INAV page. Therefore, this function does a lot of work, so the crash could be anywhere in the library.

Intrigued by why a Rust program could be crashing without panicking, I decided to debug using JTAG. I have never had to use JTAG with this Longan nano board, but I found that an Altera USB Blaster clone that I had for the Hermes-Lite 2.0 worked fine and allowed me to debug with gdb over JTAG.

With gdb, the first thing I could see is that when the software crashed, it was running an abort function which is just an endless loop.

08000240 <abort>:
8000240: 0000006f j 0x8000240 <abort>

In the backtrace I could see that this had been called by a _start_trap function.

080001c0 <_start_trap>:
80001c0: 7139 addi sp, sp, -0x40
80001c2: c006 sw ra, 0x0(sp)
80001c4: c216 sw t0, 0x4(sp)
80001c6: c41a sw t1, 0x8(sp)
80001c8: c61e sw t2, 0xc(sp)
80001ca: c872 sw t3, 0x10(sp)
80001cc: ca76 sw t4, 0x14(sp)
80001ce: cc7a sw t5, 0x18(sp)
80001d0: ce7e sw t6, 0x1c(sp)
80001d2: d02a sw a0, 0x20(sp)
80001d4: d22e sw a1, 0x24(sp)
80001d6: d432 sw a2, 0x28(sp)
80001d8: d636 sw a3, 0x2c(sp)
80001da: d83a sw a4, 0x30(sp)
80001dc: da3e sw a5, 0x34(sp)
80001de: dc42 sw a6, 0x38(sp)
80001e0: de46 sw a7, 0x3c(sp)
80001e2: 00010533 add a0, sp, zero
80001e6: fa7ff0ef jal 0x800018c <_start_trap_rust>
80001ea: 4082 lw ra, 0x0(sp)
80001ec: 4292 lw t0, 0x4(sp)
80001ee: 4322 lw t1, 0x8(sp)
80001f0: 43b2 lw t2, 0xc(sp)
80001f2: 4e42 lw t3, 0x10(sp)
80001f4: 4ed2 lw t4, 0x14(sp)
80001f6: 4f62 lw t5, 0x18(sp)
80001f8: 4ff2 lw t6, 0x1c(sp)
80001fa: 5502 lw a0, 0x20(sp)
80001fc: 5592 lw a1, 0x24(sp)
80001fe: 5622 lw a2, 0x28(sp)
8000200: 56b2 lw a3, 0x2c(sp)
8000202: 5742 lw a4, 0x30(sp)
8000204: 57d2 lw a5, 0x34(sp)
8000206: 5862 lw a6, 0x38(sp)
8000208: 58f2 lw a7, 0x3c(sp)
800020a: 6121 addi sp, sp, 0x40
800020c: 30200073 mret

This function is the trap handler that gets run when the RISC-V hart (hardware thread, which is how RISC-V formally calls CPU cores) traps. I set a hardware breakpoint at the beginning of _start_trap with hbreak *0x80001c0 and let the software crash again. This is the state of the registers at the breakpoint.

(gdb) info r
ra 0xffffffff 0xffffffff
sp 0x1ffffce0 0x1ffffce0
gp 0x20000800 0x20000800
tp 0x0 0x0
t0 0x800c000 134266880
t1 0xffffffff -1
t2 0xffffffff -1
fp 0xffffffff 0xffffffff
s1 0xffffffff -1
a0 0x0 0
a1 0xffffffff -1
a2 0xffffffff -1
a3 0xffffffff -1
a4 0x0 0
a5 0x0 0
a6 0x0 0
a7 0x0 0
s2 0xffffffff -1
s3 0xffffffff -1
s4 0xffffffff -1
s5 0xffffffff -1
s6 0xffffffff -1
s7 0xffffffff -1
s8 0xffffffff -1
s9 0xffffffff -1
s10 0xffffffff -1
s11 0xffffffff -1
t3 0x0 0
t4 0xffffffff -1
t5 0x0 0
t6 0xffffffff -1
pc 0x80001c0 0x80001c0 <_start_trap>

There is more relevant information in some of the machine CSRs. The mstatus register contains information about the hart’s state. In this case there is nothing of interest there.

(gdb) info r $mstatus
mstatus 0x1800 SD:0 VM:00 MXR:0 PUM:0 MPRV:0 XS:0
FS:0 MPP:3 HPP:0 SPP:0 MPIE:0 HPIE:0 SPIE:0 UPIE:0 MIE:0
HIE:0 SIE:0 UIE:0

The mtvec register contains the trap vector base address. It indicates the address of the trap handler.

(gdb) info r $mtvec
mtvec 0x80001c3 134218179

Trap handling can be done in direct mode, in which there is a single handler for all traps, or in vectored mode, in which the handler address is computed in terms of the trap cause. The two LSBs of mtvec are supposed to be 2'b00 for direct mode and 2'b01 for vectored mode, with the other two possible values being reserved. Here the value is 2'b11, which is reserved. I think that this is a silicon bug in the GD32VF103, which probably has these two bits hardcoded to 1, even though I believe that trap handling is working in direct mode. The 30 MSBs of the register contain the 30 MSBs of the trap handler base address (which in direct mode is just the trap handler address). The base address is supposed to be aligned to 4 bytes. We see that the trap handler address 0x80001c0 matches the _start_trap function.

The mcause register contains a code indicating the event that caused the trap. The MSB is set if the trap was caused by an interrupt. The remaining bits indicate the exception code that caused the trap.

(gdb) info r $mcause
mcause 0x30000001 805306369

Here we also have what looks like a silicon bug, because the leading nibble is 0x3. According to the RISC-V specification, for interrupt = 0, all the cause codes greater or equal than 64 are reserved. I believe that the correct value of mcause in this case would be 0x00000001, which means instruction access fault.

The mepc register contains the program counter value at the moment when the trap happened. Its LSB is always zero, because instructions in RISC-V (with the compressed instructions extension) are aligned to 2 bytes.

(gdb) info r $mepc
mepc 0xfffffffe -2

Here we see that the mepc register contains the address 0xfffffffe. This address is outside the memory map of the GD32VF103 (see Section 2.4 in the datasheet), so it makes sense that an instruction access fault was generated and we ended up in the trap handler.

What has happened here? The registers contain some clues. First of all, there are many registers with the value 0xffffffff, which looks suspicious. The stack pointer is 0x1ffffce0. This is pointing outside the SRAM, which is mapped to 0x20000000 - 0x20007FFF. In fact we can see how this executable is supposed to be using the SRAM by inspecting some symbols in the ELF file:

$ rust-objdump -x \
target/riscv32imac-unknown-none-elf/embedded/\
osnma-longan-nano | grep -e "__[se]\(bss\|stack\)"
20000000 g .bss 00000000 __sbss
20000020 g .bss 00000000 __ebss
20000020 g .stack 00000000 __estack
20008000 g .stack 00000000 __sstack

We see that this program only needs 32 bytes of BSS, so the remaining SRAM space is allocated to the stack, which grows downwards starting at the end of the SRAM.

Finally, the register t0 contains another clue. Its value is 0x0800c000, which is an address into the flash where the program is contained (the flash is mapped to 0x08000000 - 0x08020000). If we go to that part of the code we find that it corresponds to the ret instruction in a function that performs arithmetic for the p256 elliptic curve cryptography.

0800bda6 <p256::arithmetic::field::field_impl::sub_inner::hd30cc84231e2f12e>:
[...]
800bff8: 00002297 auipc t0, 0x2
800bffc: 3e6282e7 jalr t0, 0x3e6(t0) <OUTLINED_FUNCTION_11>
800c000: 8082 ret

Before this function returns, an outlined function is run. This is the code for that function.

0800e3de <OUTLINED_FUNCTION_11>:
800e3de: 40f6 lw ra, 0x5c(sp)
800e3e0: 4466 lw s0, 0x58(sp)
800e3e2: 44d6 lw s1, 0x54(sp)
800e3e4: 4946 lw s2, 0x50(sp)
800e3e6: 49b6 lw s3, 0x4c(sp)
800e3e8: 4a26 lw s4, 0x48(sp)
800e3ea: 4a96 lw s5, 0x44(sp)
800e3ec: 4b06 lw s6, 0x40(sp)
800e3ee: 5bf2 lw s7, 0x3c(sp)
800e3f0: 5c62 lw s8, 0x38(sp)
800e3f2: 5cd2 lw s9, 0x34(sp)
800e3f4: 5d42 lw s10, 0x30(sp)
800e3f6: 5db2 lw s11, 0x2c(sp)
800e3f8: 6125 addi sp, sp, 0x60
800e3fa: 8282 jr t0

The goal of this outlined function is to reduce the code size. It contains a very common routine that restores the return address register and all the saved registers. Many functions will need to perform these operations before returning, so by putting them in an outlined function, code repetition is reduced.

The way that the main function jumps into the outlined function is interesting. Because the outlined function is going to load the return address register ra with the return address needed by the main function, another mechanism is needed to return from the outlined function. This is implemented by the jalr t0, 0x3e6(t0) instruction, which jumps to t0 + 0x3e6 and stores the program counter value for the next instruction (which is 0x800c000, corresponding to the ret instruction) into the t0 register. In this way, the outlined function can perform its work, taking care not to clobber t0, and then use jr t0 to return to the main function.

The value 0x800c000 that we have found in t0 at the trap handler breakpoint is the telltale sign of this mechanism. Now we understand that the first instruction of the outlined function has loaded 0xffffffff into ra (and also all the saved registers). Therefore, the ret instruction of the main function is trying to jump to 0xffffffff, which is an illegal instruction address because it is not aligned and because it is outside of the CPU address map. This is why the trap happens. The RISC-V specification defines a trap cause for instruction address misaligned, so I think this, rather than instruction access fault, should have been the cause of the trap in mcause.

We have noticed that the stack pointer contains the value 0x1ffffce0 at the beginning of the trap handler. Taking into account that the outlined function has performed addi sp, sp, 0x60, this means that the stack pointer was 0x1ffffc80 at the beginning of the outlined function. This is 704 bytes below the start of the SRAM, so we see that the program has run out of stack space and it is doomed to crash one way or another.

All the loads in the outlined function, as well as some other loads in the main function are targetting the area immediately below the SRAM. In the address map, the area 0x1FFFF810 - 0x1FFFFFFF is shown as code (reserved). For some reason, loads from this area are returning 0xffffffff. That is the reason why most of the registers have this value at the beginning of the trap handler. I don’t know all the details of the RISC-V specification, but I think that it would be better that these loads generate a load access fault trap instead of returning a hardcoded all-ones constant.

Now that we have understood all the details about how the software crashes, the next question is why is the stack usage different depending on the codegen-units value, and what can we do about it. The first thing I checked was putting a breakpoint in the main function with hbreak main (note that this does not place the breakpoint at the first instruction of main, but rather at first instruction after the preamble of main, where the stack pointer has already been decremented to reserve stack for main) and printing the value of the stack pointer. I got the following:

  • With codegen-units = 16, the stack pointer is 0x200032c0, which means that there are around 12.65 KiB of SRAM free.
  • With codegen-units = 1, the stack pointer is 0x20001160, which means that there are around 4.31 KiB of SRAM free.

It makes sense that these 4.3 KiB are not enough to run the relatively complex calculations required by the elliptic curve cryptography, and so we run out of stack space when codegen-units = 1. The question now is why we have a difference in the stack usage at the main function of 8.34 KiB depending on how we compile the program.

First I read the code of the program until the main function is called to understand the stack usage up to this point. I saw that the stack pointer was first initialized to 0x20008000 and only 16 bytes of stack were reserved later on, so at the start of the main function the stack size is only 16 bytes. I verified this with gdb. Therefore, in order to understand the stack usage of main, we only need to look at how the stack pointer is decremented in the preamble of main.

With codegen-units = 1, this is the beginning of the main function.

0800354e <main>:
800354e: 0000b297 auipc t0, 0xb
8003552: 120282e7 jalr t0, 0x120(t0) <OUTLINED_FUNCTION_35>
8003556: 651d lui a0, 0x7
8003558: d9050513 addi a0, a0, -0x270
800355c: 40a10133 sub sp, sp, a0
[...]

The outlined function that is being called is basically the opposite of the previous outlined function we saw. It saves the return address register and all the saved registers to the stack. Interestingly it decrements the stack pointer by 256 bytes, which is more than what is necessary to store these 13 registers. I don’t know why this is done like so, but clearly this is then taken into account when reserving more stack in the main function. Perhaps the idea here is that functions that need slightly less than 256 bytes of stack can simply use this reservation and not touch the stack pointer in the main function.

0800e66e <OUTLINED_FUNCTION_35>:
800e66e: 7111 addi sp, sp, -0x100
800e670: df86 sw ra, 0xfc(sp)
800e672: dda2 sw s0, 0xf8(sp)
800e674: dba6 sw s1, 0xf4(sp)
800e676: d9ca sw s2, 0xf0(sp)
800e678: d7ce sw s3, 0xec(sp)
800e67a: d5d2 sw s4, 0xe8(sp)
800e67c: d3d6 sw s5, 0xe4(sp)
800e67e: d1da sw s6, 0xe0(sp)
800e680: cfde sw s7, 0xdc(sp)
800e682: cde2 sw s8, 0xd8(sp)
800e684: cbe6 sw s9, 0xd4(sp)
800e686: c9ea sw s10, 0xd0(sp)
800e688: c7ee sw s11, 0xcc(sp)
800e68a: 8282 jr t0

In any case, in the main function we see that the value (0x7 << 12) - 0x270 is loaded into a0 and then a0 is subtracted from the stack pointer. This means that overall the main function is decrementing the stack pointer by (0x7 << 12) - 0x270 + 0x100 = 28304 bytes.

In comparison, with codegen-units = 16, the preamble of the main function looks like this.

08009736 <main>:
8009736: 00006297 auipc t0, 0x6
800973a: bf4282e7 jalr t0, -0x40c(t0) <OUTLINED_FUNCTION_46>
800973e: 6515 lui a0, 0x5
8009740: c3050513 addi a0, a0, -0x3d0
8009744: 40a10133 sub sp, sp, a0
[...]

The outlined function is identical to the one above, except that it has a different name and it sits at a different address. Therefore, in this case the main function is decrementing the stack pointer by (0x5 << 12) - 0x3d0 + 0x100 = 19760 bytes. As we already knew, there is a difference of 8544 bytes in the stack usage of main depending on how we compile.

To investigate this difference, I looked at the LLVM IR for the program. This can be obtained by building with

cargo rustc -p osnma-longan-nano \
--target riscv32imac-unknown-none-elf \
--profile embedded -- --emit=llvm-ir

The LLVM IR is very verbose, so here I will only put the relevant details. With codegen-units = 1, the main function contains the following large stack allocations.

define dso_local void @main() unnamed_addr #14 !dbg !23641 {
[…]
%osnma.i = alloca [8536 x i8], align 8
[…]
%interface = alloca [8792 x i8], align 8

With codegen-units = 16, the main function only contains the %interface allocation. The %osnma.i is missing.

define dso_local void @main() unnamed_addr #43 !dbg !60675 {
[…]
%interface = alloca [8792 x i8], align 8

The presence of %osnma.i in the codegen-units = 1 LLVM IR is what causes most of the stack usage difference (there are another 8 bytes that it isn’t worth to investigate). The relevant part of the Rust code to understand these allocations is the following.

struct Board {
tx: serial::Tx<USART0>,
rx: serial::Rx<USART0>,
rx_buffer: [u8; 256],
}

struct OsnmaInterface {
osnma: Osnma<SmallStorage>,
board: Board,
}

impl OsnmaInterface {
fn new(board: Board) -> OsnmaInterface {
let pubkey = VerifyingKey::from_sec1_bytes(&OSNMA_PUBKEY).unwrap();
let pubkey = PublicKey::from_p256(pubkey, OSNMA_PUBKEY_ID).force_valid();
let osnma =
Osnma::<SmallStorage>::from_merkle_tree(OSNMA_MERKLE_TREE_ROOT, Some(pubkey), false);
OsnmaInterface { osnma, board }
}
[...]
}

#[entry]
fn main() -> ! {
let board = Board::take();
let mut interface = OsnmaInterface::new(board);

loop {
interface.spin();
}
}

Basically, the program contains an OsnmaInterface struct that has a 256-byte buffer for reading UART data (the serial::Tx and serial::Rx are zero-sized types) and an Osnma object that contains all the data required by the galileo-osnma library. The Board object is constructed by initializing the UART and zero-initializing the rx_buffer, and then the Osnma object is constructed with one of the constructors offered by the library. Both objects are put together in an OsnmaInterface object.

Because osnma is moved into interface, we would expect that it gets constructed directly into the allocation for interface, instead of being constructed somewhere else and then copied over. This is indeed what happens with codegen-units = 16, but not with codegen-units = 1. The thing we need to understand here is that Rust’s move system relies heavily on LLVM’s ability to optimize out unnecesary temporary allocations and copies. I will illustrate this with a simple example, which you can also see in Godbolt’s Compiler Explorer.

Consider the following code:

#![no_std]

struct A {
_data: [u8; 64]
}

struct B {
_data: [u8; 32]
}

pub struct Both {
_a: A,
_b: B,
}

#[unsafe(no_mangle)]
pub fn construct() -> Both {
Both {
_a: A { _data: [0; 64] },
_b: B { _data: [0xff; 32] },
}
}

When construct is called, the allocation for the Both that is returned has been already reserved by the caller, so the only thing that construct should do is two memsets to initialize the arrays in this allocation to their required values. This is indeed what happens when we build with -C opt-level=z -C codegen-units=1.

construct:
addi sp, sp, -16
sw ra, 12(sp)
sw s0, 8(sp)
mv s0, a0
li a2, 64
li a1, 0
call memset
addi a0, s0, 64
li a1, 255
li a2, 32
call memset
lw ra, 12(sp)
lw s0, 8(sp)
addi sp, sp, 16
ret

The LLVM IR is basically two calls to memset as we would expect.

define dso_local void @construct(ptr dead_on_unwind noalias noundef writable writeonly sret([96 x i8]) align 1 captures(none) dereferenceable(96) initializes((0, 96)) %_0) unnamed_addr {
start:
tail call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(64) %_0, i8 0, i32 64, i1 false)
%0 = getelementptr inbounds nuw i8, ptr %_0, i32 64
tail call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(32) %0, i8 -1, i32 32, i1 false)
ret void
}

declare void @llvm.memset.p0.i32(ptr writeonly captures(none), i8, i32, i1 immarg) #1

However the Rust MIR looks quite different. This is the relevant part corresponding to the construct function.

fn construct() -> Both {
let mut _0: Both;
let mut _1: A;
let mut _2: [u8; 64];
let mut _3: B;
let mut _4: [u8; 32];

bb0: {
StorageLive(_1);
StorageLive(_2);
_2 = [const 0_u8; 64];
_1 = A { _data: move _2 };
StorageDead(_2);
StorageLive(_3);
StorageLive(_4);
_4 = [const u8::MAX; 32];
_3 = B { _data: move _4 };
StorageDead(_4);
_0 = Both { _a: move _1, _b: move _3 };
StorageDead(_3);
StorageDead(_1);
return;
}
}

We see that there are temporaries for everything. Even A and B are constructed by first putting the array into a temporary and then moving the array into the struct.

If we look at the initial LLVM IR before any optimization passes are done, we see that it closely mirrors the MIR. We have memset() to initialize the arrays to their corresponding values, and memcpy() to move things. There are 192 bytes of temporaries allocated on the stack just for this thing that is only supposed to initialize caller-allocated memory.

define dso_local void @construct(ptr dead_on_unwind noalias noundef writable sret([96 x i8]) align 1 captures(address) dereferenceable(96) %0) unnamed_addr {
%2 = alloca [32 x i8], align 1
%3 = alloca [32 x i8], align 1
%4 = alloca [64 x i8], align 1
%5 = alloca [64 x i8], align 1
call void @llvm.lifetime.start.p0(i64 64, ptr %5)
call void @llvm.lifetime.start.p0(i64 64, ptr %4)
call void @llvm.memset.p0.i32(ptr align 1 %4, i8 0, i32 64, i1 false)
call void @llvm.memcpy.p0.p0.i32(ptr align 1 %5, ptr align 1 %4, i32 64, i1 false)
call void @llvm.lifetime.end.p0(i64 64, ptr %4)
call void @llvm.lifetime.start.p0(i64 32, ptr %3)
call void @llvm.lifetime.start.p0(i64 32, ptr %2)
call void @llvm.memset.p0.i32(ptr align 1 %2, i8 -1, i32 32, i1 false)
call void @llvm.memcpy.p0.p0.i32(ptr align 1 %3, ptr align 1 %2, i32 32, i1 false)
call void @llvm.lifetime.end.p0(i64 32, ptr %2)
call void @llvm.memcpy.p0.p0.i32(ptr align 1 %0, ptr align 1 %5, i32 64, i1 false)
%6 = getelementptr inbounds i8, ptr %0, i32 64
call void @llvm.memcpy.p0.p0.i32(ptr align 1 %6, ptr align 1 %3, i32 32, i1 false)
call void @llvm.lifetime.end.p0(i64 32, ptr %3)
call void @llvm.lifetime.end.p0(i64 64, ptr %5)
ret void
}

Things look quite similar for a few optimization passes, until we reach a MemCpyOptPass that realizes that a memset followed by a single memcpy or a chain of multiple memcpy‘s can be replaced by a direct memset to the destination of the last memcpy in the chain. This optimization gives the following. Note that we now have two memset‘s and no memcpy‘s, but the temporary allocations haven’t been optimized out yet.

define dso_local void @construct(ptr dead_on_unwind noalias noundef writable writeonly sret([96 x i8]) align 1 captures(none) dereferenceable(96) %0) unnamed_addr {
%2 = alloca [32 x i8], align 1
%3 = alloca [32 x i8], align 1
%4 = alloca [64 x i8], align 1
%5 = alloca [64 x i8], align 1
call void @llvm.lifetime.start.p0(i64 64, ptr nonnull %5)
call void @llvm.lifetime.start.p0(i64 64, ptr nonnull %4)
call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(64) %0, i8 0, i32 64, i1 false)
call void @llvm.lifetime.end.p0(i64 64, ptr nonnull %4)
call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %3)
call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %2)
%6 = getelementptr inbounds nuw i8, ptr %0, i32 64
call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(32) %6, i8 -1, i32 32, i1 false)
call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %2)
call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %3)
call void @llvm.lifetime.end.p0(i64 64, ptr nonnull %5)
ret void
}

The next InstCombinePass realizes that the temporary allocations are unused and removes them. The IR now looks very similar to the final IR.

define dso_local void @construct(ptr dead_on_unwind noalias noundef writable writeonly sret([96 x i8]) align 1 captures(none) dereferenceable(96) %0) unnamed_addr {
call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(64) %0, i8 0, i32 64, i1 false)
%2 = getelementptr inbounds nuw i8, ptr %0, i32 64
call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(32) %2, i8 -1, i32 32, i1 false)
ret void
}

In this very simple example everything works as expected and we get the assembly code that we wanted. In much more complex cases, LLVM optimization passes might not be able to optimize out all the moves emitted by the Rust compiler. This is what is happening in the codegen-units = 1 case with the Longan nano software. Since the software is much more complex than this simple example, I haven’t investigated what is preventing LLVM from optimizing out the %osnma.i temporary allocation.

The final question is: what can we do to improve this? One issue is missed oportunities for move optimizations. This could be fixed by future improvements in the LLVM optimizer. However, the main issue here is that the Rust compiler is inlining a lot of initialization code into the main function. Besides a few other variables which are small, the data that the program needs on the stack to run is the interface object, which has 8792 bytes. However, we have seen that even in the codegen-units = 16 case, the main function needs 19760 bytes of stack, which is a lot.

In the LLVM IR we can see that besides the interface allocation there are a few other large allocations with sizes around 1 to 3 KiB. These have less obvious names, some of which include sroa, which means “scalar replacement of aggregates”, which is a rustc optimization. My understanding of this situation is that we are getting allocations for temporaries that are needed in the initialization of the Osnma object, such as for instance temporaries used to load the ECDSA public key.

In an ideal world, we could build Osnma using compile-time const evaluation, since this Osnma instance only depends on const‘s containing the ECDSA public key and Merkle tree root that are generated by the build.rs script. However const evaluation in Rust is somewhat limited (for good reasons) and none of the elliptic curve cryptography functions that are used here are const. Also, compile-time initialization wouldn’t be realistic. A more realistic software would read this cryptographic material from somewhere in flash, to allow the material to be updated without updating the software. Such software would still need to run all of this initialization at runtime.

In any case, there is a simple way to improve this program. If initialization takes up a lot of stack space, then it shouldn’t be inlined into the main function. In this way that stack space can be freed at the end of the initalization, recovering stack space to be used by the program loop. This is what I’ve done in this software. It now looks like this.

#[inline(never)]
fn new_interface() -> OsnmaInterface {
OsnmaInterface::new(Board::take())
}

#[entry]
fn main() -> ! {
let mut interface = new_interface();
loop {
interface.spin();
}
}

With this change, and building with codegen-units = 1, the new_interface function looks like this.

08002f92 <osnma_longan_nano::new_interface::h0b1ed4267d09a8c1>:
8002f92: 0000b297 auipc t0, 0xb
8002f96: 610282e7 jalr t0, 0x610(t0) <OUTLINED_FUNCTION_30>
8002f9a: 6595 lui a1, 0x5
8002f9c: af058593 addi a1, a1, -0x510
8002fa0: 40b10133 sub sp, sp, a1
[...]

The outlined function is saving the return address register and the first 8 saved registers.

0800e5a2 <OUTLINED_FUNCTION_30>:
800e5a2: 7111 addi sp, sp, -0x100
800e5a4: df86 sw ra, 0xfc(sp)
800e5a6: dda2 sw s0, 0xf8(sp)
800e5a8: dba6 sw s1, 0xf4(sp)
800e5aa: d9ca sw s2, 0xf0(sp)
800e5ac: d7ce sw s3, 0xec(sp)
800e5ae: d5d2 sw s4, 0xe8(sp)
800e5b0: d3d6 sw s5, 0xe4(sp)
800e5b2: d1da sw s6, 0xe0(sp)
800e5b4: cfde sw s7, 0xdc(sp)
800e5b6: 8282 jr t0

The stack usage of new_interface is (0x5 << 12) - 0x510 + 0x100 = 19440 bytes. So we see that as a bonus the move optimization for osnma is now working even with codegen-units = 1.

The main function starts like this.

08003bc2 <main>:
8003bc2: 0000b297 auipc t0, 0xb
8003bc6: 9e0282e7 jalr t0, -0x620(t0) <OUTLINED_FUNCTION_30>
8003bca: cde2 sw s8, 0xd8(sp)
8003bcc: cbe6 sw s9, 0xd4(sp)
8003bce: c9ea sw s10, 0xd0(sp)
8003bd0: c7ee sw s11, 0xcc(sp)
8003bd2: 6509 lui a0, 0x2
8003bd4: 24050513 addi a0, a0, 0x240
8003bd8: 40a10133 sub sp, sp, a0
8003bdc: 6509 lui a0, 0x2
8003bde: 2a050513 addi a0, a0, 0x2a0
8003be2: 00a10db3 add s11, sp, a0
8003be6: 40014437 lui s0, 0x40014
8003bea: 6909 lui s2, 0x2
8003bec: 04810993 addi s3, sp, 0x48
8003bf0: 12c90513 addi a0, s2, 0x12c
8003bf4: 7ff98493 addi s1, s3, 0x7ff
8003bf8: 954e add a0, a0, s3
8003bfa: cc2a sw a0, 0x18(sp)
8003bfc: 67d48c93 addi s9, s1, 0x67d
8003c00: 00a8 addi a0, sp, 0x48
8003c02: fffff097 auipc ra, 0xfffff
8003c06: 390080e7 jalr 0x390(ra) <osnma_longan_nano::new_interface::h0b1ed4267d09a8c1>
[...]

It calls the same outlined function. The stack space that it needs is (0x2 << 12) + 0x240 + 0x100 = 9024 bytes. Taking into account that interface is using 8792 bytes, that leaves only 232 bytes used by other variables, which is excellent. With this change, the main function has 23.1 KiB of free stack space, so we don’t risk running out of stack during the program loop.

One comment

  1. Most excellent analysis and solution Dani. Well done. Glad you could get the the bottom of this, end up with efficient code, and resolve the crashes. Thanks digging into this.
    Bob
    N6RFM

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.