• Tooling for CSP

    CSP is the Cubesat Space Protocol. It is a network protocol that was developed by Aalborg university, and is commonly used in cubesats, in particular those using GOMspace hardware. Initially the protocol allowed different nodes on a satellite to exchange packets over a CAN bus, but eventually it grew into a protocol that spans a network composed by nodes in the satellite and the groundstation that communicate over different physical layers, including RF links.

    Recently I have been working on a project that involves CSP. To measure network performance and debug network issues, I have written some tooling in Rust, as well as a Wireshark dissector in Lua. The Rust tooling is an implementation from scratch and doesn’t use libcsp. Now I have open sourced these tools in a csp-tools repository and csp-tools Rust crate. In this post I showcase how the tools work.

    The tools included in this repository are the following CLI tools.

    • cspdump. A tool similar to tcpdump. It receives CSP packets from a ZMQ socket or a CAN interface and writes them to a PCAP file.
    • csp-iperf. A tool similar to iperf. It sends CSP packets through a ZMQ socket or a CAN interface, expects these packets to be replied by a ping service, and measures throughput, round-trip-times and lost packets.
    • csp-ping-server. A tool that implements a ping service. It can be used in combination with csp-iperf to perform network performance measurements.

    These tools all support the following physical layers or interfaces to send and receive CSP packets:

    • CAN interface through SocketCAN in Linux. This uses CFP (CSP Fragmentation Protocol) as defined in CSP to fragment CSP packets over CAN. A 29-bit CAN extended identifier is constructed from several header fields, including the source and destination address, so that CAN bus arbitration implements priorities based on CSP addresses.
    • ZMQ PUB and SUB socket. ZMQ is the usual way in which CSP applications are connected through a virtual network in Linux machines. An application such as the libcsp zmqproxy acts as a message broker. It receives CSP packets from applications on a PUB socket, and sends them to applications on a SUB socket. When sent over ZMQ, CSP packets are prepended with the CSP via address, which is the address of the next hop. In this way, applications can subscribe to the via addresses that they need to handle.

    csp-iperf for CSP over ZMQ

    As a first showcase of all of these tools, let’s use csp-iperf together with the csp_server_client example from libcsp. After building libcsp v1.0 as indicated in the repository and installing csp-tools with cargo install csp-tools, we first start zmqproxy by running

    ./zmqproxy

    in the libcsp build/ directory. We also run the following in the build/ directory to launch the server example using ZMQ and CSP address 10.

    ./csp_server_client -a 10 -z localhost

    We start cspdump by running

    cspdump --pcap-file /tmp/csp.pcap

    Now we can initiate csp-iperf as follows:

    csp-iperf --src-addr 11 --dest-addr 10 \
    --packet-size 200 --tx-rate 20e3

    This tells csp-iperf to use a source CSP address of 11, a destination CSP address of 10, a packet size of 200 bytes, and to throttle its transmission to 20 kbps. csp-iperf has a colour output that is easy to read at a glance.

    We can see how the requested 20 kbps data rate is achieved by sending 12-13 packets per second, that all the packets sent are received back except one at the beginning, and we also get a measurement of the minimum, average and maximum round-trip-times.

    In Wireshark we can see that by default csp-iperf sends CSP packets with CRC to the ping service CSP port 1. The csp_server_client example bounces these packets back to csp-iperf. The source port number used to send the ping packets is chosen from the range 16-63, and it increments with each packet. All this can be customized with command line arguments. We can also see that the payload of the ping packets contains a timestamp (used for RTT measurement), a sequence number (used to count lost packets), and dummy data.

    csp-iperf in Wireshark

    If we try to increment the data rate, say to 100 kbps, we see that there are many lost packets. The RX rate is only around 32 kbps.

    In the csp_server_client example we can see the following. This is because this application, unlike other CSP implementations, creates a CSP connection object for each ping packet received on a different port, and it quickly runs out of free connections.

    1768906802.263721 SERVICE: Ping received
    1768906802.293222 No free connections, max 10
    1768906802.293227 No more connections available
    1768906802.309237 No free connections, max 10
    1768906802.309242 No more connections available
    1768906802.313820 SERVICE: Ping received
    1768906802.341239 No free connections, max 10
    1768906802.341244 No more connections available
    1768906802.357223 No free connections, max 10
    1768906802.357228 No more connections available

    To work around this, we can tell csp-iperf to use a fixed source port, such as 32. With these settings I can reach a rate of a few Mbps on my machine, but packets are occasionally lost.

    The data rate that this kind of test can support on my machine goes as high as 320 Mbps, but in these conditions we will have more frequent packet loss, and a rather high RTT of a few tens of ms.

    As a replacement for the csp_server_client example, there is the tool csp-ping-server, which is more performant and more flexible. It is run as follows.

    csp-ping-server --addr 10

    By using the csp-ping-server, I can usually sustain 300 Mbps rate on my machine without any packet loss, although this depends heavily on how well the CPUs are used.

    Asymmetric throughput tests

    A useful feature of csp-ping-server is the ability to use a reply packet size which is different from the request packet size. This can be used to perform a throughput test with asymmetric rates, which can be useful in situations such as the following:

    • When testing an RF connection between the satellite and ground where uplink and downlink have different rates.
    • When the requests and replies go through the same CAN bus and the CAN bus is the bottleneck. Since CAN is half-duplex, the requests and replies share the same bus throughput, and using symmetric rates can oversubscribe the CAN bus.

    The reply packet size is implemented by truncating the request packet payload or extending it with padding when needed. In order to preserve timestamps and sequence numbers used for RTT and lost packets measurement, the reply packet size needs to be at least 24 bytes, or 20 bytes if CRC is disabled in the sender. If only timestamps are needed, the reply size can be 16 bytes, or 12 bytes with no CRC. If neither timestamps nor sequence numbers are needed, the reply size can be made as small as 4 bytes with no CRC, in which case the csp-ping-server replies packets that only consists of the 32-bit CSP header.

    As an example of this feature, we can run the following to reply with 20-byte packets:

    csp-ping-server --addr 10 --reply-size 20

    With a request packet size of 200 bytes in csp-iperf, this generates a 10:1 TX/RX ratio. Here we can see this working with 1 Mbps TX and 100 kbps RX.

    If we want to test the RX rate with a low TX rate, we can tell csp-ping-server to use a reply size of 200 bytes and tell csp-iperf to use a packet size of 20 bytes. For this kind of test, csp-iperf provides some convenience arguments to adjust the TX rate according to the desired RX rate. We can tell csp-iperf to expect a reply size of 200 bytes and to adjust the TX rate for an RX rate of 1 Mbps, and it will figure out that the TX rate needs to be 100 kbps.

    CSP over CAN

    So far all of these examples have used CSP over ZMQ, but it is also possible to use these tools with CAN. To show this, I will use a virtual CAN interface in Linux. This vcan interface can be set up as follows.

    sudo modprobe vcan
    sudo ip link add dev vcan0 type vcan
    sudo ip link set vcan0 mtu 16
    sudo ip link set up vcan0

    Unlike real CAN interfaces, vcan interfaces do not have a bitrate, so their throughput is limited by how fast the software can run. CSP over CAN generally uses CAN 2.0, which has a maximum bitrate of 1 Mbps. Because CAN 2.0 has a maximum payload size of 8 bytes, CAN has a massive overhead. A frame carrying 8 bytes of data actually needs 131 bits, and on top of that there can be bit stuffing depending on the frame contents. This calculator shows that the throughput of 1 Mbps CAN can range between 415.58 kbps and 488.55 kbps depending on bit stuffing. The CSP throughput is slightly lower than this because CFP also needs to send a 16-bit length field in the data of the first CAN frame of each CSP packet.

    We can run csp-ping-server with a CAN interface like so.

    csp-ping-server --addr 10 --can vcan0

    Here we show a 200 kbps csp-iperf test over the vcan interface. This is close to the maximum symmetric data rate that we can get on a 1 Mbps CAN interface.

    We can use canbusload from can-utils to monitor the CAN interface utilization.

    canbusload vcan0@1000000
    [...]
    vcan0@1M 6500 1025000 404000 0 102%
    [...]

    By default canbusload assumes worst-case bit stuffing when computing the bus utilization, so it is possible to see a utilization above 100% even in real CAN interfaces. We can use the -e option in canbusload to use an exact bit stuffing calculation. This gives the actual bus load caused by these packets.

    canbusload -e vcan0@1000000
    [...]
    vcan0@1M 6500 865221 404000 0 86%
    [...]

    For low-level debugging of CAN networks, cspdump has an option that writes both CAN frames and the defragmented CSP packets to the same PCAP file. We can use this as follows.

    cspdump --can vcan0 --include-can-frames \
    --pcap-file /tmp/csp.pcap

    In Wireshark we can see the CAN frames that contain CFP fragments, and the defragmented CSP packet afterwards. It is possible to read some fields directly from the CAN ID in hex. For instance, the first byte (0x0b) is the source CSP address. The second to last byte counts down the number of remaining fragments (0x64, 0x60, 0x5c, …, 0x04, 0x00), but it counts in units of 4 because the next field is 10 bits long.

    Wireshark dissector for RDP packets

    RDP is the Reliable Datagram Protocol defined in RFC 908. CSP uses a variant of this protocol to transmit data reliably in a sequence-controlled way. The Wireshark dissector can parse CSP RDP packets. Here I show an example produced by two test applications that use libcsp to send and receive a short message using RDP.

    The RDP connection starts with a SYN packet that establishes the connection options and the initiators’s initial sequence number. This packet is shown here.

    CSP RDP SYN shown in Wireshark

    The responder sends a SYN,ACK packet that establishes the connection and sets its initial sequence number. The initiator sends an empty ACK, and then it sends a data packet, which is shown here. Note that the RDP header is actually placed after the data.

    CSP RDP data packet

    Finally, the initiator sends an ACK,RST packet, since it has no more data to send. The responder replies with an ACK,RST packet that confirms the reception of the data packet and closes the connection.

  • V16 beacon full uplink conversation

    In my previous post I decoded a transmission from a V16 beacon. The V16 beacon has mandatorily replaced warning triangles in Spain in 2026. It is a device that contains a strobe light and an NB-IoT modem that sends its GNSS geolocation using the cellular network. It is said that the beacon first transmits is geolocation 100 seconds after it has been powered on, and then it transmits it again every 100 seconds. In that post I recorded one of those transmissions done after the beacon had been powered on for a few minutes and I decoded it by hand. I showed that the transmission contains a control plane service request NAS message that embeds a 158 byte encrypted message, which is what presumably contains the geolocation and other beacon data.

    In that post I couldn’t show how the beacon connects to the cellular network and sets up the EPS security context used to encrypt the message, since that would have happened some minutes before I made the recording. I have now made a recording that contains both the NB-IoT uplink and the corresponding NB-IoT downlink and starts before the V16 beacon is switched on. In this post I show the contents of the uplink recording.

    Recording set up

    In the previous post I determined that the beacon transmits on 832.3 MHz, which is the NB-IoT uplink in the B20 band for Orange. The corresponding downlink frequency is 791.3 MHz. Luckily the FDD duplex spacing the B20 band is only 41 MHz, so a 61.44 Msps SDR can be used to record both channels simultaneously. In this case I have used a USRP B205mini.

    A problem of recording the B20 band with devices based on the AD936x chips is that the LO frequency is around 800 MHz, so the third harmonic falls around 2.4 GHz. Because the LO is a square wave, it is quite common for WiFi and other 2.4 GHz signals to show up in the B20 recording. When recording a single channel, some clever adjustment of the LO frequency can move the 2.4 GHz interference away from the channel of interest. However, when recording these two NB-IoT channels, this is more difficult.

    To improve the quality of the recording, I made a ground plane antenna similar to the one that I made for recording DME signals, and I added a coax stub to this antenna to serve as a notch filter for 2.4 GHz. I selected 818 MHz as the LO frequency. With all this, there is little to none interference in the downlink channel and some weaker WiFi signals occasionally in the uplink channel (which is not really a problem, because the V16 beacon uplink is very strong).

    I also set up the beacon differently compared to the previous recording. In the previous recording I placed the beacon on a window that faces the nearest cell tower. Placing the beacon on a window rather than testing indoors is because I assume that the beacon needs to have direct view of the sky to get a good GNSS fix. For this recording, I set the antenna and SDR on that window, to get good SNR on the downlink signal. To prevent the V16 beacon from saturating the receiver, I placed the beacon on the window on the opposite side of the house. This is about 12 meters away from the receiving antenna with an almost line of sight path. The attenuation is high enough that the V16 beacon does not saturate the receiver when it is set to a gain that is reasonable for the NB-IoT downlink signal.

    I used a GNU Radio flowgraph to receive IQ data at 61.44 Msps from the USRP, downconvert each of the two channels to baseband, downsample them to 320 ksps, and record to disk. I recorded in GNU Radio metadata format in order to have rx_time tags from the USRP. I set the USRP to the PC clock (which was synchronized with NTP) without using a PPS signal, so the recording timestamps should be accurate to within a few ms.

    I started the recording at 11:44:19 UTC, and then I waited until 11:45:00 UTC to turn on the V16 beacon. I timed this using my watch, which I had synchronized by hand to the PC clock, so the switch on time of the V16 beacon should be accurate to within one or two seconds.

    Waterfall analysis

    By looking at the waterfall of the recording in Inspectrum we can already get some good information about how the beacon works. The beacon first transmits 40 seconds after being switched on. There is an NPRACH followed by a bunch of NPUSCH transmissions. Below we will see that this corresponds to the initial connection to the cell network.

    First transmission of the V16 beacon

    After 22 seconds we see another transmission, which only contains two NPUSCH. We will see that what happens here is that the eNB sends the beacon an RRC connection release because of inactivity. The beacon is just acknowledging that message.

    Second transmission of the V16 beacon

    Roughly 100 seconds after the first transmission, there is another transmission. The waterfall of this looks quite similar to the waterfall of the recording I used in the previous post. This is because this is the same kind of transmission of a control plane service request NAS message. We can also note that the NPRACH sequence used here is different (and longer) than the one used in the first transmission.

    Third transmission of the V16 beacon

    After 22 seconds there is another short transmission of two NPUSCH that corresponds to the acknowledgement of the RRC connection release because of inactivity.

    This process repeats approximately every 100 seconds. There is another transmission that looks very similar to this third transmission (it is not identical because the eNB sometimes schedules things slightly differently), and 22 seconds later there is the acknowledgement of the RRC connection release. The recording ends here after 342 seconds, but I assume that the beacon would have continued sending control plane service request NAS messages every 100 seconds and being timed out 22 seconds later.

    Decoding

    To decode the NPUSCH transmissions in the uplink I have used the same code and procedures as in the previous post, so I will not repeat the details here. I will only mention the things that are different or interesting.

    The first thing to note is that the physical cell ID is different. I quickly realized this when I saw that the Hadamard vector \(w_u(n)\) used in the NPUSCH format 2 DMRS was different. This depends on \(u = N^{\mathrm{Ncell}}_{\mathrm{ID}}\ \operatorname{mod} 16\). Whereas in the previous recording \(N^{\mathrm{Ncell}}_{\mathrm{ID}} = 135\), I found that in this recording \(N^{\mathrm{Ncell}}_{\mathrm{ID}} = 261\). The two recording have been made 1.5 months apart, and I don’t think that the cell IDs have been renumbered. I suspect that since the beacon was placed in a different window, it decided to connect to a different eNB. I have not looked at the downlink recording yet, but I am somewhat concerned that it might contain the downlink signal of \(N^{\mathrm{Ncell}}_{\mathrm{ID}} = 135\) with much higher SNR than that of \(N^{\mathrm{Ncell}}_{\mathrm{ID}} = 261\), which is the cell that I would need to see what the beacon is receiving on the downlink.

    Another difference is that the first NPUSCH format 1 transmission in this recording is only one resource unit long. In the previous post, the first NPUSCH transmission was 6 resource units long. We saw that it was composed by 2 repetitions of a 3 resource unit transport block. Using MCS 1, this gave a transport block size of 88 bits, which is the appropriate for the NB-IoT RRC connection request message. Here it turns out that the transmission uses MCS 6. With one resource unit this also gives a transport block size of 88 bits. However the coding rate for MCS 6 is higher than 1/3, so the rate matching algorithm does not repeat any Turbo codeword bits. This means that we cannot use rate matching repetitions to find the RNTI. Therefore, to find the RNTI we attempt to decode the transport block and check the CRC. The CRC check will only be successful for the correct RNTI. As in the previous post, we use the scrambling sequence from an NPUSCH format 2 transmission to narrow down the set of RNTI candidates.

    Since the UE is assigned a different RNTI every time that it transmits NPRACH after having being released from the RRC because of inactivity, we need to repeat this search for each time.

    Once we have obtained synchronization to the cell frame and subframe structure, we can propagate this synchronization by counting samples in the IQ recording. This count occasionally slips by one subframe when propagating long intervals, due to the sampling frequency offset. However it is easy to notice and correct the slip because many scrambling sequences and DMRS sequences depend on the slot number.

    Decoding is just a matter of selecting the start subframe, duration and occupied subcarriers for each transmission by looking at the Inspectrum waterfall, doing some small guesswork to find the MCS, and finding the RNTI after each NPRACH.

    MAC layer PDUs

    In this section I analyse the decoded MAC PDUs in Wireshark. The first PDU is the initial RRC connection request transmitted after the first NPRACH. This RRC connection request is similar to the one we had seen in the previous post, but there are some differences. First, the UE identity uses a random value rather than an S-TMSI. This is because the UE has not been assigned a TMSI by the network yet. Recall that the UE identity in this message is only used for contention resolution. The establishment cause is mo-Signalling rather than mo-Data. This makes sense, because in this moment the V16 beacon wants to register to the network rather than to send any data. Finally, the cqi-NPDCCH-r14 field contains a value of candidateRep-C (in the previous post it contained “no measurements”). The values of this field are defined in Table 9.1.22.15-1 in TS 36.133. The value of candidateRep-C given here means that the UE believes that it needs 4 NPDCCH repetitions to achieve a decoding error rate lower than 1%. Also, the data volume report indicates that the UE only has between 68 and 91 bytes to send.

    Initial RRC connection request

    The next PDU that the UE transmits is an RRC connection setup complete message that contains a NAS PDU with and attach request NAS message and a PDN connectivity request ESM message. This is the beacon registering to the network. In addition, the RRC connection setup complete contains signal quality measurements of the serving cell: an NRSRP of around -95 dBm and NRSRQ of -11 dB.

    Attach request

    The dissection of the NAS PDU is given here. We can see that it contains the IMSI of the V16 beacon (214032698247044) and that it states that the UE supports the control plane CIoT EPS optimization.

    Non-Access-Stratum (NAS)PDU
    0000 .... = Security header type: Plain NAS message, not security protected (0)
    .... 0111 = Protocol discriminator: EPS mobility management messages (0x7)
    NAS EPS Mobility Management Message Type: Attach request (0x41)
    0... .... = Type of security context flag (TSC): Native security context (for KSIasme or KSIamf)
    .111 .... = NAS key set identifier: No key is available (7)
    .... 0... = Spare bit(s): 0x00
    .... .001 = EPS attach type: EPS attach (1)
    EPS mobile identity
    Length: 8
    .... 1... = Odd/even indication: Odd number of identity digits
    .... .001 = Type of identity: IMSI (1)
    IMSI: 214032698247044
    [Association IMSI: 214032698247044]
    Mobile Country Code (MCC): Spain (214)
    Mobile Network Code (MNC): France Telecom España, SA (03)
    UE network capability
    Length: 7
    1... .... = EEA0: Supported
    .1.. .... = 128-EEA1: Supported
    ..1. .... = 128-EEA2: Supported
    ...1 .... = 128-EEA3: Supported
    .... 0... = EEA4: Not supported
    .... .0.. = EEA5: Not supported
    .... ..0. = EEA6: Not supported
    .... ...0 = EEA7: Not supported
    1... .... = EIA0: Supported
    .1.. .... = 128-EIA1: Supported
    ..1. .... = 128-EIA2: Supported
    ...1 .... = 128-EIA3: Supported
    .... 0... = EIA4: Not supported
    .... .0.. = EIA5: Not supported
    .... ..0. = EIA6: Not supported
    .... ...0 = EPS-UPIP: Not supported
    0... .... = UEA0: Not supported
    .0.. .... = UEA1: Not supported
    ..0. .... = UEA2: Not supported
    ...0 .... = UEA3: Not supported
    .... 0... = UEA4: Not supported
    .... .0.. = UEA5: Not supported
    .... ..0. = UEA6: Not supported
    .... ...0 = UEA7: Not supported
    0... .... = UCS2 support (UCS2): The UE has a preference for the default alphabet
    .0.. .... = UMTS integrity algorithm UIA1: Not supported
    ..0. .... = UMTS integrity algorithm UIA2: Not supported
    ...0 .... = UMTS integrity algorithm UIA3: Not supported
    .... 0... = UMTS integrity algorithm UIA4: Not supported
    .... .0.. = UMTS integrity algorithm UIA5: Not supported
    .... ..0. = UMTS integrity algorithm UIA6: Not supported
    .... ...0 = UMTS integrity algorithm UIA7: Not supported
    0... .... = ProSe direct discovery: Not supported
    .0.. .... = ProSe: Not supported
    ..0. .... = H.245 After SRVCC Handover: Not supported
    ...0 .... = Access class control for CSFB: Not supported
    .... 1... = LTE Positioning Protocol: Supported
    .... .1.. = Location services (LCS) notification mechanisms: Supported
    .... ..0. = SRVCC from E-UTRAN to cdma2000 1xCS: Not supported
    .... ...0 = Notification procedure: Not supported
    1... .... = Extended protocol configuration options: Supported
    .0.. .... = Header compression for control plane CIoT EPS optimization: Not supported
    ..0. .... = EMM-REGISTERED w/o PDN connectivity: Not supported
    ...0 .... = S1-U data transfer: Not supported
    .... 0... = User plane CIoT EPS optimization: Not supported
    .... .1.. = Control plane CIoT EPS optimization: Supported
    .... ..0. = ProSe UE-to-network relay: Not supported
    .... ...0 = ProSe direct communication: Not supported
    0... .... = Signalling for a maximum number of 15 EPS bearer contexts: Not supported
    .0.. .... = Service gap control: Not supported
    ..0. .... = N1 mode for 3GPP access: Not supported
    ...0 .... = Dual connectivity with NR: Not supported
    .... 1... = Control plane data backoff: Supported
    .... .1.. = Restriction on use of enhanced coverage: Supported
    .... ..0. = V2X communication over PC5: Not supported
    .... ...0 = Multiple DRB: Not supported
    ESM message container
    Length: 39
    ESM message container contents: 0201d0117b00208080211001010010810600000000830600000000000d00001000000a00001600
    0000 .... = EPS bearer identity: No EPS bearer identity assigned (0)
    .... 0010 = Protocol discriminator: EPS session management messages (0x2)
    Procedure transaction identity: 1
    NAS EPS session management messages: PDN connectivity request (0xd0)
    0001 .... = PDN type: IPv4 (1)
    .... 0001 = Request type: Initial request (1)
    Extended protocol configuration options
    Element ID: 0x7b
    Length: 32
    [Link direction: MS to network (0)]
    1... .... = Extension: True
    .... .000 = Configuration Protocol: PPP for use with IP PDP type or IP PDN type (0)
    Protocol or Container ID: Internet Protocol Control Protocol (0x8021)
    Length: 0x10 (16)
    PPP IP Control Protocol
    Code: Configuration Request (1)
    Identifier: 1 (0x01)
    Length: 16
    Options: (12 bytes), Primary DNS Server IP Address, Secondary DNS Server IP Address
    Primary DNS Server IP Address
    Type: Primary DNS Server IP Address (129)
    Length: 6
    Primary DNS Address: 0.0.0.0
    Secondary DNS Server IP Address
    Type: Secondary DNS Server IP Address (131)
    Length: 6
    Secondary DNS Address: 0.0.0.0
    Protocol or Container ID: DNS Server IPv4 Address Request (0x000d)
    Length: 0x00 (0)
    Protocol or Container ID: IPv4 Link MTU Request (0x0010)
    Length: 0x00 (0)
    Protocol or Container ID: IP address allocation via NAS signalling (0x000a)
    Length: 0x00 (0)
    Protocol or Container ID: APN rate control support indicator (0x0016)
    Length: 0x00 (0)
    Additional update type
    1111 .... = Element ID: 0xf-
    .... 01.. = Preferred CIoT network behaviour: Control plane CIoT EPS optimization (1)
    .... ..0. = SAF: Keeping the NAS signalling connection is not required after the completion of the tracking area updating procedure
    .... ...1 = AUTV: SMS only
    MS network feature support
    1100 .... = Element ID: 0xc-
    .... 000. = Spare bit(s): 0
    .... ...1 = Extended periodic timers: MS supports the extended periodic timer in this domain
    GPRS Timer 2 - T3324 value
    Element ID: 0x6a
    Length: 1
    GPRS Timer: 1 min
    001. .... = Unit: value is incremented in multiples of 1 minute (1)
    ...0 0001 = Timer value: 1
    GPRS Timer 3 - T3412 extended value
    Element ID: 0x5e
    Length: 1
    GPRS Timer: 80 hours
    010. .... = Unit: value is incremented in multiples of 10 hours (2)
    ...0 1000 = Timer value: 8

    The next message is a UE capability information message. It is fragmented into two AM PDUs, since the first uplink grant was not large enough to transmit this message. We see that the UE is category NB2. This page has a comparison table between categories NB1 and NB2. The UE supports the bands B8, B20, and B28. In Spain, B8 is used for GSM and UMTS, and for LTE by Movistar, but there is no NB-IoT (indeed there is a GSM carrier in the left guardband where NB-IoT could be placed). B28 is used used for LTE and 5G, and there is no NB-IoT. So this leaves B20 as the only available band for this UE. It is actually the only band in Spain that has NB-IoT. The UE capability information message is typically sent as a reply to a UE capability enquiry message, so we can assume that that is what the UE saw in the downlink.

    UE capability information message

    The second AM fragment of the capability information message is retransmitted twice. We had already seen this quirk in the previous recording, where some AM packets were retransmitted because presumably the AM ACK took a while to be received.

    The next message is an authentication response NAS PDU. We infer that the NAS has sent an authentication request message. The authentication response proves that the UE has the SIM with the right key, and that the network also has this key. As part of this authentication process, keys for an EPS security context which allows encryption and integrity of NAS messages are derived.

    Authentication response message

    The next message is also fragmented into two AM PDUs. It is a NAS message that is “integrity protected and ciphered with new EPS security context”. We know (see Table 9.3.1 in TS 24.301) that the only message that can carry this value in the security header type is the security mode complete NAS message. We infer that this message has been sent as a reply to a security mode command NAS message sent by the network. My post using srsRAN as an example for LTE MAC layer analysis shows how this exchange works.

    Encrypted security mode complete message

    However, we notice that the encrypted security mode complete message here is quite long. The NAS PDU has 91 bytes. In comparison, the security mode complete NAS PDU in the srsRAN example has only 8 bytes. What causes the difference? To guess this we need to look at Table 8.2.21.1 in TS 24.301, which shows the structure of the security mode complete message.

    Security mode complete message structure, taken from TS 24.301

    The mandatory fields in the message are short (they account for the last 2 bytes in the 8-byte srsRAN example). However, there are some optional fields. Three of these are sent as replies to optional requests included in the security mode command message: IMEISV request, UE radio capability ID request, and UE coarse location information
    request. The replayed NAS message container is described in Section 9.9.3.51 in TS 24.301. This section says that “The purpose of the Replayed NAS message container IE is to, during an ongoing attach or tracking area updating procedure, re-send the ATTACH REQUEST or TRACKING AREA UPDATE REQUEST message with which the UE
    had initiated the procedure, if the MME has included a \(\mathrm{HASH}_{\mathrm{MME}}\) in the SECURITY MODE COMMAND message and the \(\mathrm{HASH}_{\mathrm{MME}}\) is different from the hash value locally calculated at the UE as described in 3GPP TS 33.401.”

    The attach request NAS PDU in this recording had 69 bytes. So I suspect that this is what is happening here, and that the replayed attach request is responsible for most of the size of the security mode complete message.

    The next message is an encrypted NAS message. The size of the NAS PDU is 13 bytes. By the moment in which it happens and the size, I think that this is an attach complete message. The attach complete in the srsRAN example also has 13 bytes, and it happens at this point, after the UE receives an attach accept message. In that example, the attach accept carries an activate default EPS bearer context request ESM message that activates EPS bearer 5, and the attach complete carries an activate default EPS bearer context accept ESM message that confirms this activation. I guess that the same happens here.

    Encrypted attach complete message

    After this message, the UE sends an AM ACK in SRB 3, so we infer that the eNB has sent something in the DL-DCCH-NB, but that message does not have a reply from the UE. At this point, there are no transmissions until 21 seconds later the UE sends another AM ACK in SRB 3. As I mentioned above, I think that this is an ACK to an RRC connection release message in which the UE is told to disconnect because of inactivity.

    So far we have seen that 40 seconds after being switched on, the V16 beacon transmits to attach itself to the NB-IoT network, which includes authenticating and establishing an EPS security context that is used to encrypt NAS messages, and establishing an IPv4 PDN connection. There is no indication that any GNSS geolocation data or other beacon-specific metadata has been transmitted at this point.

    Roughly 100 seconds after the attach, the V16 beacon sends NPRACH again and performs a conversation to send a control plane service request message. This conversation is pretty much the same as the one I analysed in the previous post, so I will not give all the details here. I will only point out the details that are interesting.

    The RRC connection request is similar to the initial one. However, we see that rather than using a random value as the UE identity for contention resolution, now the S-TMSI is used. This is because presumably the UE was assigned an S-TMSI as part of the attach accept message. The S-TMSI remains valid while the beacon is powered on, and we can see that it uses the same S-TMSI in the following RRC connection requests.

    Second RRC connection request

    This gives an eavesdropper a way to track V16 beacons by listening to the NB-IoT uplink. Assuming that the only UEs that are transmitting are V16 beacons that behave like this one, or that other UEs can be ignored by how they behave, the eavesdropper can use the following facts:

    • The beacon transmits the IMSI as part of the attach request
    • The first RRC reconnection happens roughly 100 seconds after the attach request, and contains the S-TMSI. Assuming that not many beacons are active, this gives the eavesdropper a way to associate the IMSI and the S-TMSI by this timing
    • All future RRC connection requests use the same S-TMSI while the beacon is powered on
    • Within an RRC connection, all the NPUSCH transmissions done by the same UE can be tied to the S-TMSI in the RRC connection request because they share the RNTI

    Another thing we notice in this RRC connection request is that the establishment cause is now mo-Data instead of mo-Signalling. This makes sense, because the beacon now has user data to transmit.

    Note that the beacon timing does not match what the high-level descriptions of V16 beacons say. It is said that a V16 beacon first transmits its position data 100 seconds after power on, and then it retransmits every 100 seconds. However here we see that this beacon attaches to the network 40 seconds after power on, sends its first position data 40 + 100 = 140 seconds after power on, and then it sends the position data again every 100 seconds.

    The RRC connection setup complete message is used to send a control plane service request NAS message as in the previous post. There are a couple curious things. First, I have no idea why Wireshark is attempting to dissect the ESM message container contents field, which is encrypted. It doesn’t do so with the PCAP from the previous post or with the control plane service request message that comes 100 seconds later in this PCAP. Because the field is encrypted, Wireshark gets nonsensical values and gives a warning.

    Control plane service request message

    The second remark is that the RRC connection complete message includes NRSRP and NRSRQ measurements as in the first RRC connection complete message. I don’t know why these weren’t present in the previous post. Maybe something is configured differently in the SIB-NB of the two cells, since in the previous post the UE also sent “no measurements” in the cqi-NPDCCH-r14 field of the RRC connection request.

    Another thing that is good to note is that the sequence number for this NAS PDU is 2. The encrypted attach complete message was sent with sequence number 1, and the encrypted security mode complete message was sent with sequence number 0. Every time that the beacon now wakes up to send a control plane service request, the sequence number is incremented by one. In the previous post the control plane service request message had a sequence number of 5. This means that this was the 4th time that the beacon transmitted its position data, so it had been powered on for about 440 seconds.

    After the control plane service request message is sent, just as in the previous post, we see a partial retransmission because the AM ACK is not received in time. Then the UE sends an AM ACK to acknowledge the reply to the control plane service request, and it does not transmit again until it needs to ACK the RRC connection release.

    The following Wireshark listing shows a summary of all the MAC PDUs in the recording.

    Full conversation of V16 beacon uplink

    What is inside the control plane service request encrypted data?

    Since the ESM message inside the control plane service request NAS message is encrypted, we do not know exactly what it contains. However we can imagine. While doing more research about the V16 beacons, I found a vulnerability report written by Luis Miranda Acebedo a few weeks ago. In this report, Luis indicates that he used physical access to a V16 beacon and took advantage of a debug serial port to obtain logs. The logs contain AT commands exchanged with the beacon NB-IoT modem and packet payload contents. These indicate that the beacon data, which includes GNSS geolocation, IMEI, a timestamp, and other parameters, are sent in plaintext in a UDP message to IP address 172.31.46.149 port 50000.

    Luis’ report is based on studying a beacon of the brand Help Flash IoT. This model is different from the one I’m using here, and it uses the Vodafone network instead of the Orange network, so we cannot directly extrapolate Luis’ findings. However, we can imagine that perhaps Orange and Vodafone have set up their V16 beacon network infrastructure similarly.

    The control plane CIoT EPS optimization supports sending IP and non-IP user data. The NAS message sent by the UE arrives to the MME, and the MME forwards the user data to its corresponding destination: IP user data is sent to the P-GW for the corresponding PDN through the S-GW; non-IP user data is sent directly to an SCEF (see for instance the diagram in Sharetechnote). Here we have seen that the UE included in the attach request a PDN connectivity request for an IPv4 PDN. I believe that this means that it must be using the control plane CIoT EPS optimization to send IPv4 data to this PDN, rather than non-IP data.

    It is quite reasonable to think that the IPv4 user data is a UDP packet, as in the beacon that uses the Vodafone network. Since we have seen that there is only one control plane service request message sent each time, and that presumably a response message is received in the DL-DCCH-NB (we can see the corresponding AM ACK), this rules out more complex protocols with handshaking at the IP level. It is also quite likely that the data in this UDP packet is unencrypted, and that the system relies on the encryption and integrity given by the control plane service request NAS message.

    Luis points out in his report that the fact that the UDP packet data is unencrypted and unauthenticated is a security problem. I do not agree. If the PDN where this packet goes is a trusted private PDN, then sending an unencrypted packet is probably fine from a security point of view. The data is encrypted and integrity protected in the control plane service request NAS message by using the EPS security context. This provides a secure tunnel between the UE and the MME that has been established by authenticating both endpoints using the keys in the UE SIM. After the UDP packet reaches the MME, it is then transported only inside a trusted private network. In particular, the data is protected while it crosses the radio network, which is where an attacker could attempt to read or modify the data.

    Probably there is still a potential attack vector. If the SIM (or eSIM) in a V16 beacon is moved to an NB-IoT modem connected to a computer, or if the software that runs on the beacon is modified, an attacker could connect to the cellular network in the same way as the regular V16 beacon and use the same procedure to send a UDP packet that contains fake data, such as false geolocation data and/or the IMEI or serial number of another beacon. This would probably inject a fake report successfully into the system. It is possible that the system performs some kind of deep packet inspection that allows it to reject some of this fake data, but I don’t think it is this sophisticated. For instance, it could check that the IMEI and other data identifying the beacon that appears inside the UDP message matches the IMSI that was used to establish the network connection (recall that the IMSI has been validated as part of the UE authentication). However I don’t think that the IMSI is available at the receiver of the UDP packet. The only solution that I see to this attack vector is to ensure that all V16 beacons are completely tamper proof. Otherwise, even if the UDP payload was encrypted and authenticated, it would be possible to modify the beacon software to inject fake data before the encryption and authentication happens. Preventing any kind of tampering is probably too expensive to be feasible. Also keep in mind that if any abuse is detected, the network operator can disable the SIM that is being used, so future authentication attempts using that SIM will be rejected by the network.

    Summary

    In this post we have decoded a recording of the NB-IoT uplink of a V16 beacon since the moment it is turned on and begins transmitting. We see that 40 seconds after power on, the beacon attaches to the NB-IoT network and obtains PDN connectivity. This involves authenticating against the network and obtaining keys for an EPS security context used to encrypt subsequent NAS messages. The beacon does not send any GNSS geolocation data at this point. After 100 seconds, the beacon reconnects to the RRC and sends a control plane service request message that embeds the GNSS geolocation and other beacon metadata encrypted with the EPS security context. This is done again every 100 seconds. The eNB releases the beacon from the RRC because of inactivity 20 seconds after each of these transmissions. This requires the beacon to send ACKs both at the HARQ level and at the AM level.

    Code and data

    I have published the recording used in this post in the dataset “Recording of the NB-IoT uplink of a V16 beacon” in Zenodo. The remaining materials are in my LTE repository. This includes the Jupyter notebook used for decoding, the GNU Radio flowgraph used to resample the recording to 1.92 Msps, and the PCAP file containing the decoded MAC PDUs.

  • Decoding a V16 beacon

    The V16 beacon is a car warning beacon that will mandatorily replace the warning triangles in Spain starting in 2026. In the event of an emergency, this beacon can be magnetically attached to the roof of the car and switched on. It has a bright LED strobe light and a connection to the cellular network, which it uses to send its GNSS position to the DGT 3.0 cloud network (for readers outside of Spain, the Spanish DGT is roughly the equivalent of the US DMV). The main point of these beacons is that placing warning triangles far enough from a vehicle can be dangerous, while this beacon can be placed without leaving the car.

    There has been some criticism surrounding the V16 beacons and their mandatory usage that will start in January 2026, both for economical and implantation roadmap reasons, and also for purely technical reasons. The strobe light is so bright that you shouldn’t look at it directly while standing next to the beacon (which makes it tricky to pick it up and switch it off), but I have heard that it is not so easy to see in daylight from several hundreds of meters away.

    The GNSS geolocation and cellular network service is also somewhat questionable. I purchased a V16 beacon from the brand NK connected (certificate number LCOE 2024070678G1), for no reason other than the fact that it was sold in a common supermarket chain. The instructions in the box directed me to the website validatuv16.com for testing it. In this website you can register the serial number or IMEI of your beacon and your email. Then you switch on the beacon. After 100 seconds the beacon should send a message to the DGT network, and then periodically every 100 seconds. This test service is somehow subscribed to the DGT network, and it sends you an email that contains the message data (GNSS position and battery status) when the DGT network receives it. This is great, but there is no test mode or anything that declares that you are using the beacon just for testing purposes. They only say that you should not leave the beacon on for much longer than what it takes you to receive the email, to avoid the test being mistaken for a real emergency. The fact that the test procedure for this system is literally the same as the emergency procedure is a red flag for me. Additionally, this beacon only includes cellular data service for 12 years, and it is not clear what happens after that.

    Technical shortcomings aside, my main interest is how the RF connection to the DGT network works. The beacon I bought has a logo in the box saying that it uses the Orange cellular network. When I tested it, after receiving the confirmation email from the test service, I used a Pluto SDR running Maia SDR and quickly found that the beacon was transmitting NB-IoT on 832.3 MHz. I made a recording of one of the periodic transmissions. In this post I analyse and decode the recording.

    NB-IoT (Narrowband IoT) is a variant of LTE that is intended for low-power low-data-rate IoT devices. It uses a dedicated carrier that is 180 kHz wide (12 subcarriers with the usual LTE 15 kHz subcarrier spacing). This carrier can be placed either inside the regular subcarrier grid of an LTE carrier, in the guardband next to an LTE carrier (which is how it is commonly deployed in Spain), or standalone. In my series of posts about the LTE downlink, I used a recording of the three 10 MHz LTE downlink carriers in the B20 band in Spain made in 2022. A waterfall of this recording is shown here. The three carriers are, in order of increasing frequency, Orange, Vodafone and Movistar. Each of them has an NB-IoT downlink carrier in the lower guardband. The Orange NB-IoT downlink is at 791.3 MHz, so my beacon transmits in the corresponding uplink frequency, which is 832.3 MHz.

    LTE downlink cells in B20 band

    The waterfall of the V16 beacon transmission that I recorded is shown here. The whole transmission is comprised of multiple bursts, and it takes slightly less than one second.

    Waterfall of a V16 beacon transmission

    NB-IoT uplink fundamentals

    NB-IoT has many commonalities with LTE, but there are also large differences, both because everything needs to be crammed into just 12 subcarriers and because of changes to suit better the IoT uses cases (these usually involve simplifications or robustness in bad signal conditions). The main reference for the NB-IoT physical layer is Section 10 in TS 36.211. The resource element grid for NB-IoT is the same as for LTE: 15 kHz subcarrier spacing, 0.5 ms slots composed of 7 symbols, with the first of them having a slightly longer cyclic prefix, 1 ms subframes, and 10 ms radio frames. Both in the uplink and downlink, the NB-IoT subcarriers are shifted up by 7.5 kHz as in the LTE uplink in order to avoid having a subcarrier at DC.

    There are only two physical channels in the NB-IoT uplink: the NPUSCH, which is the NB-IoT equivalent of the LTE PUSCH, and is used to transmit user data, and the NPRACH, which is the NB-IoT equivalent of the LTE PRACH and is used for random access of a UE at the start of the cell connection procedure. There is no channel corresponding to the LTE PUCCH, which sends uplink control information. As we will see below, control information is heavily simplified and sent in the NPUSCH. There are two NPUSCH formats: format 1, which sends data, and format 2, which sends control information.

    The first transmission in the recording is an NPRACH transmission. The NPRACH is based on a frequency hopping sequence. I will not analyse it in this post. The remaining transmissions are all NPUSCH.

    The NPUSCH physical layer is somewhat similar to the PUSCH in the sense that it uses SC-FDMA to improve the waveform PAPR (peak-to-average power ratio) and that some symbols in the middle of the slot are dedicated to a DMRS (demodulation reference signal). The main difference with the PUSCH is that an NPUSCH transmission can occupy 1, 3, 6 or 12 subcarriers. To allow this, NB-IoT introduces the concept of a resource unit, which is what determines which resource elements are allocated to a transport block. For a 15 kHz subcarrier spacing (NB-IoT also supports 3.75 kHz subcarrier spacing, but this is not used in this recording) and frame structure type 1 (which is used in FDD, which is what corresponds to this recording, and IoT NTN TDD), there are the following resource unit shapes:

    • For NPUSCH format 1:
      • 1 subcarrier x 16 slots
      • 3 subcarriers x 8 slots
      • 6 subcarriers x 4 slots
      • 12 subcarriers x 2 slots
    • For NPUSCH format 2:
      • 1 subcarrier x 4 slots

    The lowest subcarrier of an allocation needs to be divisible by the number of subcarriers in the allocation. Therefore, for 3 subcarriers, the possible allocations are {-6,-5,-4}, {-3, -2, -1}, {0, 1, 2}, {3, 4, 5}, and for 6 subcarriers the possible allocations are {-6, -5, -4, -3, -2, -1}, and {0, 1, 2, 3, 4, 5}.

    Single-tone NPUSCH and phase rotation

    NPUSCH transmissions allocated to several subcarriers, commonly referred to as multi-tone transmissions, use SC-FDMA as the LTE PUSCH. NPUSCH transmissions allocated to one subcarrier are referred to a single-tone transmissions. In the 3GPP documentation they are also described as SC-FDMA, even though they are a special case in many sections. They are best described as single-carrier \(\pi/2\)-BPSK or \(\pi/4\)-QPSK modulation whose symbol duration matches the OFDM grid (so the first symbol in every 7 is slightly longer). The precise definition is in Section 10.1.5 in TS 36.211. The complex baseband time-domain waveform for a symbol is\[s_{k,l}(t) = a_{k^{(-)},l}\cdot e^{i\phi_{k,l}} \cdot e^{2\pi i(k + 1/2)\Delta f(t – N_{\mathrm{CP},l}T_s)},\qquad 0 \leq t < (N_{\mathrm{CP},l} + N)T_s.\]Here \(k \in \{-6, -5, \ldots, 5\}\) denotes the subcarrier that is used, \(l\) is the slot number, \(a_{k^{(-)},l}\) is the complex BPSK or QPSK symbol, \(\phi_{k,l}\) is a phase rotation term that is defined below, and the exponential on the right hand side is the carrier wave of the \(k\)-th OFDM subcarrier (recall that subcarriers are shifted up in frequency by 1/2 of the subcarrier spacing). The term \(\Delta f\) is the subcarrier spacing, \(N_{\mathrm{CP},l}T_s\) is the cyclic prefix duration and \(NT_s = 1/\Delta f\) is the useful symbol duration.

    The phase rotation term \(\phi_{k,l}\) is vaguely related to the 5G NR phase compensation, which I explained in this post. The key remark for why this phase rotation is needed is that if we consider the above formula without this term, then we don’t get a phase-continuous carrier wave. The issue is that the OFDM subcarrier does not advance an integer number of cycles over the symbol duration. More precisely, the phase of this exponential at the end of the symbol minus its phase at the beginning of the symbol is \(2\pi(k + 1/2)\Delta f(N_{\mathrm{CP},l} + N)T_s\), which is not an integer multiple of \(2\pi\).

    The phase rotation term is defined as\[\phi_{k,l} = \rho \cdot (\tilde{l} \ \operatorname{mod} 2) + \varphi_k(\tilde{l}).\]The index \(\tilde{l}\) counts over all the slots in an NPUSCH transmission (the details of what exactly counts as a “transmission” are in the TS). The function \(\varphi_k\) is defined recursively as \(\varphi_k(0) = 0\) and\[\varphi_k(\tilde{l}) = \varphi_k(\tilde{l}-1) + 2\pi(k + 1/2)\Delta f(N_{\mathrm{CP},l} + N)T_s\]for \(\tilde{l} > 0\). To motivate this definition, consider the term\[\psi_{k,l}(t) = e^{i\varphi_k(\tilde{l})}\cdot e^{2\pi i(k + 1/2)\Delta f(t – N_{\mathrm{CP},l}T_s)}.\]This is equal to\[e^{i\varphi_k(\tilde{l} – 1)}\cdot e^{2\pi i(k + 1/2)\Delta f(t + NTs)}.\]At \(t = 0\), this has the phase \(\varphi(\tilde{l} – 1) + 2\pi (k + 1/2)\Delta f N T_s\). If compute \(\psi_{k,l-1}((N_{\mathrm{CP},l-1} + N)T_s))\), we also get the same phase. Therefore, we see that this expression \(\psi_{k,l}(t)\) defines a phase-continuous carrier wave piecewise in each of the OFDM symbols, due to how \(\varphi_k(\tilde{l})\) was defined.

    The term \(\rho\) is defined as \(\pi/2\) for a BPSK modulation and \(\pi/4\) for a QPSK modulation. Therefore, \(\rho \cdot (\tilde{l}\ \operatorname{mod} 2)\) implements either \(\pi/2\)-BPSK or \(\pi/4\)-QPSK by introducing a phase rotation of \(\rho\) to odd symbols. The goal of using \(\pi/2\)-BPSK and \(\pi/4\)-QPSK instead of BPSK and QPSK is to improve the PAPR by avoiding zero crossings.

    Therefore, we have seen that the single-tone NPUSCH modulation can be described in more simple terms as \(\pi/2\)-BPSK or \(\pi/4\)-QPSK with rectangular pulse shape and phase continuous carrier. The more convoluted way in which it is defined in the TS is actually the practical way in which it needs to be treated when using OFDM for modulation or demodulation of this single-tone waveform.

    As a final note regarding the resetting of \(\tilde{l}\), note that for a subcarrier spacing of 15 kHz, \(\varphi_k(28)\) is an integer multiple of \(2\pi\), since the accumulated duration of useful symbols and cyclic prefixes is 2 ms, which gives an integer when multiplied by \(\Delta f/2\). This both means that when the TS says that \(\tilde{l}\) is reset exactly does not have much practical importance, and also that the phase rotation terms can be precomputed in a table of 28 entries for each of the 12 subcarriers.

    Initial recording analysis and synchronization

    The waterfall of this recording, excluding the NPRACH transmission at the beginning, is shown again here. We can clearly see that most NPUSCH transmissions are single-tone. Their spectrum looks like a sinc function, due to the rectangular pulse shaping (which is ultimately caused by the OFDM modulation). However the transmitter has a filter that cuts off the sinc sidelobes outside the 180 kHz bandwidth of the cell, so the sidelobes do not extend forever and the spectrum is technically different depending on which subcarrier is occupied.

    Waterfall of NPUSCH transmissions

    The 5th burst is a multi-tone transmission using 3 subcarriers. The 7th burst is a multi-tone transmission using 6 subcarriers. The last burst is a very brief transmission that uses the 12 subcarriers.

    It is easy to perform a coarse frequency synchronization by reading off the centre frequencies of the single-tone transmissions in the waterfall. However, to make use of this technique we need to know which subcarrier index is used for each transmission. To find this, we can take advantage of the fact that the nulls in the sinc spectrum of the transmissions roughly correspond to the subcarrier spacing (roughly rather than exactly because the symbol duration is slightly longer than \(1/\Delta f\) due to the cyclic prefix). Therefore, we can count subcarriers by counting nulls. In this way we see that the second and third transmission use the outermost subcarriers, which are -6 and 5 respectively. The first transmission uses subcarrier 1.

    Finer frequency synchronization can be obtained by hand by looking at the time-domain waveform of a single-tone transmission that has been downconverted to baseband. This can also be done after demodulation, by looking at how the phase of the symbols rotates over time.

    The following plot shows the first 3 ms of the first transmission after being downconverted to baseband by shifting it down in frequency by \((1+1/2)\Delta f\) plus the carrier frequency offset.

    We see that the modulation is \(\pi/4\)-QPSK and that the phase does not rotate visibly, because the carrier frequency offset has already been determined quite accurately. The fact that this is a single-tone \(\pi/4\)-QPSK modulation can also be used to measure the carrier frequency. By computing the 8-th power of the complex baseband time-domain data, we remove the modulation and obtain a CW carrier whose frequency is 8 times that of the residual frequency error.

    We can zoom in time to the beginning of this transmission to find the sample where the first symbol begins, which we need to perform demodulation.

    Technically speaking, we could demodulate these single-tone transmissions by downconverting them to baseband and accumulating each symbol. However, since we still need OFDM demodulation for multi-tone transmissions, we will use OFDM demodulation for all the NPUSCH transmissions. We perform OFDM demodulation as usual for LTE, by taking into account that the first symbol in each 7-symbol slot has a slightly longer cyclic prefix. To perform OFDM demodulation in a simple way, it is convenient to choose a sample rate such that the useful symbol duration and the cyclic prefix durations are all an integer number of samples. The lowest sample rate for which this happens is 1.92 Msps. I made the recording at 640 ksps, which is more than enough to capture the NB-IoT 180 kHz bandwidth. Therefore, to process the recording I have first used GNU Radio to interpolate the recording by a factor of 3 to obtain 1.92 Msps.

    After performing OFDM demodulation, we need to apply the phase rotation terms \(\phi_{k,l}\) that have been introduced above. It is critical that this phase rotation is implemented correctly. An error here will show up as an artificial carrier frequency offset, which will make us tune to the wrong frequency, causing other problems (such as inter-carrier interference). To verify that my phase rotation implementation is correct, I have used several methods to check that the carrier frequency offset has been cancelled out accurately without relying on this phase rotation. If we ensure this and we get a good constellation after applying the phase rotation to the OFDM demodulated symbols, this means that the phase rotation has been implemented correctly.

    The first method is to look at the amplitude in each subcarrier after OFDM demodulation. For a single-tone transmission all the power should be in a single subcarrier. A carrier frequency error causes inter-subcarrier interference and makes some power leak to the adjacent subcarriers.

    The second method is a frequency estimator that works only with intra-symbol information. It multiplies, in the time domain, the second half of the useful symbol by the complex conjugate of the first half of the useful symbol, integrates over time, and uses the phase of the result to estimate frequency. For single-tone transmissions this gives a frequency estimator that has an aliasing range of \(\pm \Delta f\). Since the nominal subcarrier frequency is \((k + 1/2)\Delta f\), we expect to get an estimate of \(\Delta f /2\) for even \(k\) and \(-\Delta f / 2\) for odd \(k\).

    The first single-tone transmission uses subcarrier \(k = 1\), so we expect to get a frequency estimate of -7500 Hz. The estimate for each symbol in the transmission is plotted here.

    The values have an average of -7479 Hz, so it would seem that there is still a small frequency error of 21 Hz. However what happens is that the pulse shape is not perfectly symmetric in frequency, because the transmit filter lets through more sinc sidelobes on one side of the subcarrier than on the other. This causes the estimator to be biased. We will see how transmissions that use \(k = -6\) have a larger bias, due to a higher frequency asymmetry.

    A useful side-effect of the intra-symbol frequency estimator is that it also serves to fine tune the symbol timing offset. If there is some timing offset, it will cause inter-symbol interference. The estimator will see the interference and the frequency estimates will split into a group above the nominal frequency and a group below the nominal frequency, since the estimate depends on the values of the adjacent symbols. By tweaking the symbol timing offset to minimize the variance of the intra-symbol frequency estimates we can fine tune the symbol time alignment.

    Finally, another method of verifying the carrier frequency offset is the analysis of the time-domain \(\pi/4\)-QPSK waveform that has been mentioned above.

    At this point we can perform OFDM demodulation of the first NPUSCH transmission and apply the phase rotation to verify that we obtain a clean QPSK constellation. The constellation for this transmission is shown here. The symbols in orange correspond to the DMRS, which is a pseudorandom BPSK sequence that has already been wiped off in this plot. The details of the DMRS are given below. The DMRS has been used to perform equalization. For each transmission, a constant phase and amplitude is determined using the DMRS and used for equalization of the data and DMRS symbols.

    Another plot that is useful is the phase of the symbols over time. By looking at this plot we can tell whether there is any residual frequency error.

    Single-tone NPUSCH format 1 DMRS

    The DMRS for single-tone NPUSCH is defined in Section 10.1.4.1.1 in TS 36.211. It is constructed differently depending on the NPUSCH format, but both formats use the same underlying pseudorandom sequence. This sequence is defined as\[\overline{r}_u(n) = \frac{1}{\sqrt{2}}(1 + i)(1 – 2c(n)) w_u(n\ \operatorname{mod} 16),\qquad 0 \leq n < M_{\mathrm{rep}}^{\mathrm{NPUSCH}}N_{\mathrm{slots}}^{\mathrm{UL}}N_{\mathrm{RU}}.\]The sequence \(c(n)\) is the usual LTE pseudorandom sequence initialized with \(c_{\mathrm{init}}=35\). I find it rather curious that a constant is used here, since most LTE pseudorandom sequences depend on the cell ID and slot number to reduce inter-cell interference.

    The index \(n\) counts slots used by the transmission. The length of what counts as “transmission” is given by \(M_{\mathrm{rep}}^{\mathrm{NPUSCH}}N_{\mathrm{slots}}^{\mathrm{UL}}N_{\mathrm{RU}}\). The number \(N_{\mathrm{slots}}^{\mathrm{UL}}\) denotes the number of slots in a resource unit, which is 16 for NPUSCH format 1 and 4 for NPUSCH format 2, \(N_{\mathrm{RU}}\) is the number of resource units used by the NPUSCH. For format 1 this depends indirectly on the transport block length and the coding rate. For format 2, \(N_{\mathrm{RU}} = 1\). Finally, unlike LTE, NB-IoT supports sending multiple repetitions of the NPUSCH upfront (possibly using different redundancy versions) to improve decoding reliability. The number \(M_{\mathrm{rep}}\) denotes the NPUSCH repetitions. We will see examples of NPUSCH repetition below.

    The sequence \(w_u(n)\) is a 16-element vector composed of 1’s and -1’s defined in Table 10.1.4.1.1-1 in TS 36.211. There are 16 possible choices for this sequence depending on the parameter \(u\), which is defined as \(u = N_{\mathrm{ID}}^{\mathrm{Ncell}}\ \operatorname{mod} 16\), unless group hopping is enabled, which is possible to use in NPUSCH format 1. The 16 possible vectors are chosen from a Hadamard orthonormal basis.

    In NPUSCH format 1, the middle symbol in each slot is allocated to the DMRS. The DMRS sequence is defined directly as the sequence \(\overline{r}_u(n)\) given above. Since we have just started the signal analysis, we don’t know the physical cell ID \(N_{\mathrm{ID}}^{\mathrm{Ncell}}\). However we can first perform wipe-off of the DMRS ignoring the term \(w_u(n\ \operatorname{mod} 16)\). In our case we see that what remains is an alternating sequence 1, -1, 1, -1, …. This means that group hopping is not used and that \(u = 1\), since \(w_1(n)\) is a vector of alternating 1’s and -1’s. Now that we have determined \(u\), we can wipe off the full DMRS sequence and obtain the constellation plot shown above.

    NPUSCH format 2 DMRS

    At this point we would like to decode the first NPUSCH format 1 transmission. However we are not yet in a position to do so. We have basically the same challenges that I explained in my post about the LTE PUSCH. The NPUSCH is scrambled with a pseudorandom sequence that depends on the RNTI, the cell ID, and the slot number in which the transmission begins. Without these values we don’t have much chance to decode the NPUSCH, because there are also other parameters that we need to guess, such as the transport block size and the coding rate. In my post about the PUSCH I took advantage of PUCCH formats 2, 2a and 2b, which are used to send CQI to the eNB, to find these values. In NB-IoT there is no direct equivalent of these PUCCH formats, because the UE does not send CQI feedback. However, similar ideas can be applied to the NPUSCH format 2, which is used to send HARQ acknowledgements.

    First we will study the DMRS of NPUSCH format 2 and use it to find the cell ID and the radio frame synchronization. In NPUSCH format 2, the 3 symbols in the middle of each slot are allocated to the DMRS. The DMRS sequence is defined as\[r_u(3n + m) = \overline{w}(m) \overline{r}_u(n),\qquad m=0,1,2.\]Here \(n\) counts the slots in the transmission, and \(m\) counts each of the three symbols in the slot. The sequence \(\overline{w}(m)\) is the same as the overlay code for PUCCH formats 1 and 1a. It is defined as \(\overline{w}(m) = e^{2\pi i Mm/3}\). The number \(M\) is computed for each slot as\[M = \sum_{j=0}^7 c(8n_s + j) \cdot 2^j \mod 3,\]where the pseudorandom sequence \(c\) is initialized with \(c_{\mathrm{init}} = N_{\mathrm{ID}}^{\mathrm{Ncell}}\). Therefore, for each slot, an overlay code of three 3rd roots of unity are chosen pseudorandomly based on the cell ID.

    Since the overlay code \(\overline{w}(m)\) contains 3rd roots of unity, from looking at a constellation plot it is quite easy to distinguish single-tone NPUSCH format 1 from NPUSCH format 2. In this recording, the second NPUSCH transmission uses format 2 and subcarrier \(k = -6\). Here is a plot of the time domain waveform after being downconverted to baseband. In this plot I am using a slightly different carrier frequency offset than in the plot for the first NPUSCH transmission. I have found that in this recording each transmission requires a small tweak of a few 10s of Hz to the carrier frequency offset, and sometimes also a few samples to the symbol timing offset.

    It is quite apparent that the modulation is \(\pi/2\)-BPSK. With a more careful look we can even seen the 3rd roots of unity used by the DMRS. Another thing that is of interest in this plot is that the pulse shape is much uglier than for the previous transmission. The reason is that this transmission uses the lowest subcarrier. Therefore the transmitter filtering has a highly asymmetric effect in the sinc sidelobes of the pulse.

    In the amplitude of the subcarriers after OFDM demodulation we can see that the power is concentrated in a single subcarrier, which is a check that the frequency error is small.

    However the intra-symbol frequency estimator gives an average value of 7281 Hz, suggesting that there is a frequency error of 219 Hz, which is reasonably large. This error is caused by a bias in the estimator due to the pulse shape distortion.

    Since we know how to compute the sequence \(\overline{r}_u(n)\) and the three DMRS symbols in each slot are the same except for the \(\overline{w}(m)\) term, we can wipe off \(\overline{r}_u(n)\) and then find the sequence of values \(M\) that generates the \(\overline{w}(m)\) term in each slot. This sequence is the following.

    [0, 1, 2, 2, 0, 2, 1, 0, 2, 2, 2, 2, 2, 2, 0, 2,
    1, 2, 1, 2, 0, 1, 2, 2, 0, 2, 1, 0, 2, 2, 2, 2]

    With this information we can now wipe off the full DMRS. After doing this, we get the following constellation and phase versus time plots.

    Now we can work with the sequence of values of \(M\) that we have obtained to find the cell ID and the radio frame synchronization. For each cell ID, the sequence of values of \(M\) depends only on the slot number \(n_s\). Therefore, it is a 20-element sequence that repeats periodically. Because this NPUSCH transmission is longer than 20 slots, we have more than 20 values for \(M\), but we can check that the sequence obtained above has period 20, so we only use the first 20 elements. The cell ID is an integer between 0 and 503. For each cell ID we generate the corresponding 20-element sequence of values of \(M\) and find whether the first 20 elements of the sequence we found match this 20-element sequence by applying a circular shift. Each match that we obtain gives a candidate cell ID and slot number \(n_s\), which is obtained from the circular shift used in the match. By running this brute force algorithm we obtain just one candidate: \(N_{\mathrm{ID}}^{\mathrm{Ncell}} = 135\) and \(n_s = 12\). Note that indeed \(N_{\mathrm{ID}}^{\mathrm{Ncell}}\ \operatorname{mod} 16 = 1\) as we had found with the Hadamard vector \(w_u(n)\).

    Obtaining the RNTI from NPUSCH format 2

    The scrambling sequence used for the data in both NPUSCH formats is initialized with\[c_\mathrm{init} = n_{\mathrm{RNTI}}\cdot 2^{14} + (n_f\ \operatorname{mod} 2) \cdot 2^{13} + \lfloor n_s / 2 \rfloor \cdot 2^9 + N_{\mathrm{ID}}^{\mathrm{Ncell}}.\]This is pretty similar to the LTE PUSCH scrambling sequence, except that the codeword index \(q\) has been replaced by the frame number \(n_f\) modulo 2. This is because NB-IoT does not support two-codeword transmission, but the downlink structure is organized in terms of even and odd frames, so a UE listening to the downlink obtains synchronization to the frame number modulo 2. Since the frame number modulo 2 only has 2 possibilities, we could easily brute force it, but we still need to find the RNTI, which is a 16-bit number. We will use the NPUSCH format 2 transmission to help us.

    The NPUSCH format 2 is very simple. It only carries a HARQ acknowledgement, which is a single bit of information: 1 for a positive acknowledgement and 0 for a negative acknowledgement. This bit is encoded into 16 bits by using a repetition code (see Section 6.3.3 in TS 36.212). These 16 bits get mapped into an NPUSCH format 2 resource unit, which has a duration of 4 slots. Each slot has 4 \(\pi/2\)-BPSK data symbols, because the 3 symbols in the middle of the 7-symbol slot are used by the DMRS. This means that if we take the 16 data bits in an NPUSCH format 2 transmission and descramble them, we should find 16 repetitions of the same value if the scrambling sequence was correct.

    The NPUSCH format 2 transmission that we are analysing is composed of 8 repetitions of the NPUSCH format 2. We take the first repetition and do a brute force search for each possible value of \(n_{\mathrm{RNTI}}\) and \(n_f\ \operatorname{mod} 2\), noting a candidate whenever we obtain 16 repetitions of the same value after descrambling. In this way, we get the following candidates:

    • \(n_{\mathrm{RNTI}} = 53958, n_f\ \operatorname{mod} 2 = 1\)
    • \(n_{\mathrm{RNTI}} = 55023, n_f\ \operatorname{mod} 2 = 0\)
    • \(n_{\mathrm{RNTI}} = 55957, n_f\ \operatorname{mod} 2 = 1\)
    • \(n_{\mathrm{RNTI}} = 57020, n_f\ \operatorname{mod} 2 = 0\)

    Here we have a similar challenge to what happened with the PUCCH format 2 Reed-Muller code. The function that maps the 17-bit vector \((n_{\mathrm{RNTI}}, n_f\ \operatorname{mod} 2)\) into the 16-bit NPUSCH format 2 scrambling vector is an affine function. We are requiring that after descrambling we get 16 copies of the same value. This is a system of 15 independent linear equations. Therefore, in the best case (which is when the corresponding affine maps are non-degenerate), we only can expect that the space of solutions of our brute force search is an affine subspace of dimension 2. This is actually what happens, so we get 4 solutions.

    Using more repetitions of the same NPUSCH format 2 transmission will not help us. The scrambling sequence is generated independently for each repetition, using the \(n_f\ \operatorname{mod} 2\) and \(\lfloor n_s / 2\rfloor\) corresponding to the start of each repetition. However a moment’s thought leads us to conclude that we would get exactly the same solutions (with \(n_f\) appropriately adjusted for the fact that the repetitions may not happen in the same radio frame). The issue is that if we add to the correct \((n_{\mathrm{RNTI}}, n_f\ \operatorname{mod} 2)\) vector a vector such that the corresponding scrambling sequence is all zeros or all ones, we also obtain another candidate solution. For the same reason, using other NPUSCH format 2 transmissions that happen at a later time will not help either. We could heuristically decide that the HARQ acknowledgement is more likely to be 1 for a positive acknowledgement than 0. This would still leave two candidate solutions.

    Obtaining the RNTI from NPUSCH format 1

    Now that we have narrowed down the candidates for the RNTI and frame number modulo 2 to only 4 options, we can attempt to use the first NPUSCH format 1 transmission to select only one out of the 4 options. The key idea is something I also used with the LTE PUSCH. Initial transmissions usually have a low coding rate, so the rate matching algorithm repeats some of the bits of the Turbo codeword. We can attempt to descramble the NPUSCH transmission and then check if all the rate matching repetitions have the same value. To do this we need the following parameters, which affect the rate matching algorithm:

    • The transport block size.
    • The redundancy version index.
    • The number of resource units \(N_{\mathrm{RU}}\) occupied by the NPUSCH transmission. What we see as a single transmission might be multiple repetitions of the same transmission.

    To find the transport block size, we need the following table from TS 36.213. The transport block size is found in terms of two indices: \(I_{\mathrm{RU}}\), which indicates the number of resource units \(N_{\mathrm{RU}}\) used by the NPUSCH transmission, and \(I_{\mathrm{TBS}}\), which is related to the MCS.

    NPUSCH transport block size table, taken from TS 36.213

    The number of resource units \(N_{\mathrm{RU}}\) that can be used for an NPUSCH transmission, and the corresponding index \(I_{\mathrm{RU}}\) is given in this table in TS 36.213.

    Number of NPUSCH resource units table, taken from TS 36.213

    The index \(I_{\mathrm{RU}}\) is transmitted in the scheduling grant in the NPDCCH, but we don’t have access to this information, since we only have a recording of the uplink.

    For single-tone NPUSCH, the MCS that are supported are given by the following table. The modulation order \(Q_m\) indicates \(\pi/2\)-BPSK for \(Q_m = 1\) and \(\pi/4\)-QPSK for \(Q_m = 2\). The MCS index is transmitted in the scheduling grant, and this also determines the \(I_{\mathrm{TBS}}\) index.

    MCS table for single-tone NPUSCH, taken from TS 36.213

    It is also helpful to know that the number of repetitions of an NPUSCH transmission can be 1, 2, 4, 8, 16, 32, 64, or 128.

    The first NPUSCH format 1 transmission occupies 48 subframes, which is 96 slots. Since a single-tone NPUSCH format 1 resource unit has 16 slots, this means that we have 6 resource units in total. Therefore, we could either have one repetition of a 6-resource unit transmission, or 2 repetitions of a 3-resource unit transmission.

    We know that the modulation is \(\pi/4\)-QPSK. Since we expect a low coding rate, our candidate for MCS is 2, which corresponds to a TBS index of 1. With this MCS, a 3-resource unit transmission would have a transport block size of 88 bits. A 6-resource unit transmission would have a transport block size of 208 bits.

    In the LTE PUSCH post I used the fact that the first PUSCH transmission that happens after PRACH is an RRCConnectionRequest message, which is almost 56 bits long. Looking at the table above, we would see that with MCS 2 this only needs 2 resource units. However, as we will see, the NB-IoT RRC connection request message is slightly longer than the LTE message, which is why the next larger transport block size, which is 88 bits, is needed.

    For the first transmission of a transport block, redundancy version 0 is used. Therefore, we use a transport block size of 88 bits, a redundancy version of 0, and a size of 3 resource units for our test. When generating the scrambling sequence, we need to account for the time that has passed between the NPUSCH transmission in which we found \(n_s = 12\) and this transmission, which we see that it starts at \(n_s = 8\). This test reduces the candidate set to just one candidate: \(n_{\mathrm{RNTI}} = 53958\), and \(n_f\ \operatorname{mod} 2 = 0\) (where \(n_f\) is referred to this NPUSCH transmission).

    Decoding NPUSCH format 1

    We can now decode the first NPUSCH format 1 transmission. The way in which the NPUSCH format 1 transport block is encoded is essentially identical to the LTE PUSCH, except that there are no CQI and RI bits embedded in the transmission, which greatly simplifies things. I have basically used the same code to perform channel decoding.

    The following plot shows the LLRs after inverting the rate matching algorithm. The LLRs around ±200 correspond to bits that have been transmitted twice.

    Since there are no bit errors, the Turbo decoder is able to obtain a correct decode in just one iteration.

    At this point the CRC-24 of the transport block can be checked to verify that decoding is successful, and the MAC PDU is written to a PCAP file for analysis with Wireshark. The following screenshot (click on it to view it in full size) shows the PDU dissected in Wireshark. It is a CCCH message that contains an rrcConnectionRequest-r13 message. We can see that the RRC Connection Request in NB-IoT is slightly different and longer than in LTE. Besides giving the TMSI and the establishment cause, the UE also gives some details about its physical layer support. In this case we can see that the establishment cause is mo-Data (mobile originating data), which makes sense, because the V16 beacon wants to send some data to the network. The UE supports multi-tone and multi-carrier operation, and early contention resolution, but it does not report a measurement of CQI for the NPDCCH.

    rrcConnectionRequest-r13 message in Wireshark

    Another novelty in this MAC PDU is the Data Volume and Headroom Report control element. This is an exclusive feature of NB-IoT. It is described in Section 6.1.3.10 in TS 36.321. Using this element, the UE is already reporting to the eNB that it has between 172 and 234 bytes of data to send.

    Because of the additional information compared to the LTE RRC connection request, the NB-IoT RRC connection request needs 71 bits, so there are 17 spare bits at the end of the ASN.1 message to fill up the 88-bit transport block.

    This NPUSCH format 1 transmission has two repetitions using 3 resource units each. The second repetition uses redundancy version 2. In NB-IoT, the redundancy version index can only be 0 or 2, and it alternates with each repetition of a transmission. We can descramble the second repetition and perform inverse rate matching with redundancy version 2, obtaining the following LLRs. Performing Turbo decoding of this repetition we obtain exactly the same RRC connection request PDU.

    Decoding the remaining NPUSCH transmissions

    The following plot gives an overview of the NPUSCH transmissions in the recording. It shows the average power in each subframe and subcarrier, starting with the first NPUSCH transmission. As we will see below, all the transmissions in subcarrier -6 are NPUSCH format 2 that contain a positive HARQ acknowledgement. The rest of the transmissions are NPUSCH format 1.

    We have already seen that the second transmission is 8 repetitions of NPUSCH format 2. Since we know know the RNTI and frame number modulo 2, we can descramble the transmission and find that it is a positive HARQ acknowledgement (which is encoded as the value -1).

    The third transmission is a rather long NPUSCH format 1 transmission. It lasts 80 subframes and occupies the highest subcarrier. In the plot of the symbol phases versus time we can see that there is some frequency drift, because the phases look like a parabola. This could be equalized out using the DMRS, but here I am using a constant phase equalization for all the duration of each transmission.

    Since this transmission occupies 10 resource units, it could either be 2 repetitions of a 5 resource unit NPUSCH, or a single 10 resource unit NPUSCH. We also need to find the MCS to perform decoding. We can perform trial and error with all the options until we obtain a successful decode for one of them. It turns out that this is a 10 resource unit transmission using MCS 10. This is the largest NPUSCH allocation in terms of resource units, and the highest MCS for single-tone NPUSCH, so the corresponding transport block size of 1736 bits is the largest that can be transmitted with single-tone.

    The following plot shows the LLRs after inverting the rate matching algorithm. Since the coding rate is high, many bits in the Turbo codeword have been punctured by rate matching, specially in the last two thirds of the codeword.

    It makes sense that the UE got this rather large uplink grant, since in the Data Volume and Headroom Report it said that it had probably enough data to fill it, and in fact it does. Looking at the decoded MAC PDU in Wireshark, we see that it is an UL-DCCH message in SRB 3 that contains an rrcConnectionSetupComplete-r13 message. We do not see the messages sent on the downlink by the eNB, but from what we see on the uplink and the knowledge of how a regular LTE connection works (see my post about the LTE MAC layer), we infer that this message is a response to an RRC connection setup message sent by the eNB.

    rrcConnectionSetupComplete-r13 message in Wireshark

    As in LTE, the RRC connection setup complete message embeds a NAS PDU. In addition, it also indicates the UE S-TMSI. The NAS PDU contains a control plane service request message. This message forms part of a mechanism callled control plane CIoT EPS optimization. I will describe this in more detail below, but the main idea is that it is a way for an NB-IoT UE to send some data to the network without having to go through the usual hoops to establish a user plane data connection. The control plane service request message contains a 158 byte encrypted message which is what presumably contains the V16 beacon GNSS position and other data.

    The NAS PDU also contains an EPS bearer context status. I haven’t expanded this in Wireshark because it doesn’t fit in the screen. It simply says that the EBI(5) bearer context is active and the remaining bearer contexts are inactive.

    The next transmission is a brief NPUSCH format 1 transmission using subcarrier 5. It occupies 8 subframes, which is just one single-tone resource unit.

    As before, the MCS turns out to be 10, so the transport block size is 144 bits. It looks like the eNB now knows that the UE has a relatively good channel, and therefore it is scheduling it with a high MCS.

    The decoded MAC PDU can be seen here in Wireshark. We see that it contains the first AM fragment of an RLC SDU in SRB 3. Wireshark warns us that the sequence number of this AM PDU, which is zero, is repeated. The RRC connection setup complete message in the previous transmission also used this AM sequence number. We can even see that the AM data in this PDU is the same as the beginning of the RRC connection setup complete message. Therefore, this PDU is a retransmission of the RRC connection setup complete message.

    First fragment of retransmission of RRC connection setup complete

    We can infer that the UE has received a HARQ acknowledgement for its transmission of the RRC connection setup complete message. Otherwise I think that it would be trying to send that same transport block with redundancy version 2. However, it has not received yet an AM ACK from the RLC layer, and the AM PDU was sent with the polling flag enabled. Therefore it makes sense that an RLC retransmission is initiated if an ACK is not received. Nevertheless it seems that the retransmit timer is configured too short here. The retransmission is done 96 ms after the orginal transmission, but the original transmission took 80 ms of air time to send. So this retransmission is really happening too soon.

    This MAC PDU also has a short BSR control element that says that the UE has between 172 and 200 bytes of data to send. This makes sense. Since the UE got a small uplink grant, it could only send 11 bytes of the large RRC connection setup complete message.

    The next transmission is the first multi-tone NPUSCH format 1 transmission that we see in the recording. It occupies subcarriers 3, 4, 5. Decoding multi-tone NPUSCH format 1 has the following differences compared to single-tone NPUSCH format 1:

    • The phase rotation is not applied.
    • Multi-tone NPUSCH format 1 uses SC-FDMA in the same way as the LTE PUSCH. After OFDM demodulation and equalization, an IFFT needs to be applied to data symbols to invert the SC-FDMA precoding (the DMRS does not use SC-FDMA).
    • The DMRS is constructed differently. The details are given in a section below.
    • Like in the LTE PUSCH, the transport block is interleaved to be mapped first in time and then in frequency. This means that adjacent symbols get mapped to symbols that are adjacent in time. In NPUSCH this interleaving is done independently in each resource unit.

    Because multi-tone transmissions are more susceptible to symbol timing offset (which causes time-domain inter-symbol interference in the SC-FDMA modulation), in these transmissions we have to tweak the timing offset by looking at whether there is a phase versus frequency slope in the DMRS.

    The following plots show the constellation after demodulation (including inversion of the SC-FDMA precoding) of the multi-tone transmission. In the plot of phase versus time, the x-axis corresponds to the OFDM symbol number, so symbols transmitted in different subcarriers at the same time are shown on the same x-axis value.

    This transmission occupies only 4 subframes, which corresponds to a single 3-tone resource unit. Multi-tone NPUSCH can use QPSK and 16QAM. For QPSK the allowed range of \(I_{\mathrm{TBS}}\) indices is between 0 and 13, and it corresponds directly to the MCS. For 16QAM (which is not used in this recording), the allowed range of \(I_{\mathrm{TBS}}\) is between 14 and 21 and it is indicated using a dedicated 16QAM MCS field in the DCI. In this case, MCS 13, which is the highest for QPSK is used. This gives a transport block size of 224 bits.

    The decoded MAC PDU dissected in Wireshark is shown here. This is the second AM fragment of the RRC connection setup complete retransmission. The UE is able to send the next 23 bytes of the message in this fragment.

    Second fragment of retransmission of RRC connection setup complete

    The next transmission happens some time later. It is an NPUSCH format 2 transmission using the lowest subcarrier. It occupies 4 subframes, so it is actually two repetitions. Decoding it we see that it contains a positive HARQ acknowledgement. The constellation and phase versus time for this transmission as shown here.

    Next we have a multi-tone NPUSCH format 1 transmission that uses subcarriers 0, 1, 2, 3, 4, 5. The transmission occupies 12 subframes, which is 6 resource units of 6-tone NPUSCH.

    This transmission also uses the highest MCS for QPSK, MCS 13. It is a single repetition of a 6-resource unit transport block, rather than two repetitions of a 3-resource unit transport block. The corresponding transport block size is 1544 bits. Dissecting the MAC PDU in Wireshark we see that it only contains padding. What has happened here? Since the UE had reported in a short BSR that it had data to transmit, it makes sense that it got this relatively large uplink grant to finish transmitting the rest of the data. However, we have seen that the UE has received something on the downlink before this tranmission, because it has sent a positive HARQ acknowledgement. We infer that what the UE has received is the AM ACK for the original RRC connection setup complete message. Therefore, it does not need to retransmit this message anymore, and it does not have anything to be sent in its uplink grant, so it fills it with padding.

    Padding-only MAC PDU

    The next transmission happens some time later and is an NPUSCH format 2 transmission in the lowest subcarrier. As usual, this is two repetitions of a positive HARQ acknowledgement.

    The last NPUSCH transmission is a multi-tone NPUSCH format 1 transmission that uses all the 12 subcarriers and has only one subframe of duration, which corresponds to one resource unit. With 12 subcarriers in use, getting the symbol timing offset right is more critical. Even changing it by one sample (at 1.92 Msps) will give noticeably worse results, although I haven’t needed to implement a fractional delay at OFDM equalization to perform sub-sample timing offset. The constellation and phases for this transmission are shown here.

    The MCS for this transmission is also 13. As we have seen earlier, this gives a transport block size of 224 bits for one resource unit. The MAC PDU is just an AM ACK. We infer that was has happened is that the network has sent a reply to the control plane service request that the UE sent in the RRC connection setup complete message. That reply has been acknowledged at the HARQ level in the previous transmission, and is now acknowledged at the RLC layer in this MAC PDU.

    AM ACK to reply of control plane service request

    In summary, we see that, ignoring the spurious AM retransmission, the conversation in this recording is quite simple:

    1. The UE sends PRACH
    2. After a reply by the eNB (which we infer), the UE sends an RRC connection request message indicating its S-TMSI and NB-IoT capabilitites
    3. We infer that the eNB sends an RRC connection setup message that configures the RRC layer
    4. The UE sends a positive HARQ acknowlegement as a reply to the RRC connection setup complete message
    5. The UE sends an RRC connection setup complete message. This message includes a control plane service request message for the NAS, which is used by the control plane CIoT EPS optimization (described below).
    6. We infer that the eNB sends an AM ACK for the RRC connection setup complete message
    7. The UE sends a positive HARQ acknowledgement for this AM ACK
    8. We infer that the eNB sends a reply to the control plane service request
    9. The UE sends a positive HARQ acknowledgement for this reply
    10. The UE sends an AM ACK for this reply

    The list of MAC PDUs shown in Wireshark also provides a good summary of the conversation.

    List of MAC PDUs in Wireshark

    Multi-tone NPUSCH DMRS

    The multi-tone NPUSCH format 1 DMRS occupies the central symbol in each slot. The DMRS sequence is constructed as\[r_u(n) = e^{i\alpha n} e^{i\phi_u(n)\pi/4},\qquad 0 \leq n < N_{\mathrm{sc}}^{\mathrm{RU}},\]where \(N_{\mathrm{sc}}^{\mathrm{RU}}\) denotes the number of subcarriers per resource unit, so \(n\) indexes the subcarriers occupied by the NPUSCH. Note that the DMRS sequence does not depend on the slot number \(n_s\), so it is always the same.

    The function \(\phi_u(n)\) takes the values \(\pm 1, \pm 3\), so the \(e^{i\phi_u(n)\pi/4}\) term corresponds to symbols drawn from a QPSK constellation. This function is defined in terms of tables that depend on the number of subcarriers allocated to the NPUSCH and on the index \(u\). Unless overridden by higher layers, the index \(u\) is chosen as:

    • \(u = N_{\mathrm{ID}}^{\mathrm{Ncell}} \mod 12\) for 3 subcarriers
    • \(u = N_{\mathrm{ID}}^{\mathrm{Ncell}} \mod 14\) for 6 subcarriers
    • \(u = N_{\mathrm{ID}}^{\mathrm{Ncell}} \mod 30\) for 12 subcarriers

    The term \(e^{i\alpha n}\) represents a cyclic shift that can be set by higher layers to a value of the form \(\alpha = 2\pi n_{\mathrm{cs}} / 12\), with integer \(n_{\mathrm{cs}}\), but it defaults to \(\alpha = 0\).

    In this recording, we can see that the index \(u\) is chosen in terms of the cell ID as indicated here, and \(\alpha = 0\). With this knowledge we can generate and wipe-off the DMRS in multi-tone NPUSCH transmissions.

    Control plane CIoT EPS optimization

    The control plane CIoT EPS optimization is part of some features introduced in the LTE network mainly to support NB-IoT use cases. They are intended to optimize small data transfers by removing some setup steps that are needed in regular LTE connections. In particular the control plane CIoT EPS optimization allows sending user data or SMS encapsulated in a NAS message. Therefore, data can be sent as soon as an RRC connection has been established, by encapsulating it in the RRC connection setup complete message. This is important for battery powered IoT sensors, which need to wake up periodically and send a small amount of data.

    The CIoT EPS optimizations are introduced in Section 4.10 in TS 23.401. The functional description for data transport in the control plane CIoT EPS optimization is given in Section 5.3.4B in that TS. This section includes a flow for mobile originated data transport in control plane CIoT EPS optimization with P-GW connectivity, which is shown here. I believe this is what is being used by the V16 beacon. However this section does not explicitly refer to the control plane service request NAS message. It only says “The UE establishes a RRC connection or sends the RRCEarlyDataRequest message as defined in TS 36.300 and sends as part of it an integrity protected NAS PDU. The NAS PDU carries the EPS Bearer ID and encrypted Uplink Data.”

    Mobile Originated Data Transport in Control Plane CIoT EPS Optimisation
    with P-GW connectivity, taken from TS 23.401

    The procedure for using a control plane service request message for the control plane CIoT EPS optimization is defined in Section 5.6.1.2.2 in TS 24.301. Section 5.6.1.1 describes the cases in which a service request procedure is invoked by the UE. The case applicable here is “b) the UE, in EMM-IDLE mode, has pending user data to be sent;”. In Section 6.6.4 there are more details about the transport of user data via the control plane procedure. This section says that an ESM data transport message is used to encapsulate user data. If the UE is in EMM-IDLE mode, it shall initiate the procedure by sending the ESM data transport message included in a control plane service request message. It seems that this is what is happening here. The control plane service request includes an encrypted ESM message, which is probably this ESM data transport message.

    The format of the control plane service request message is given in Section 8.2.33 in TS 24.301. Except for some missing optional fields, this matches what we see in Wireshark.

    Control plane service request message format, taken from TS 24.301

    The ESM message container information element is defined in Section 9.9.3.15. It also matches what we see in Wireshark.


    ESM message container format, taken from TS 24.301

    The ESM data transport message is defined in Section 8.3.25.

    ESM data transport message format, taken from TS 24.301

    We cannot see this message structure in Wireshark because the ESM message container contents are encrypted. Section 4.4.5 in TS 24.301 says that “The UE shall partially cipher the CONTROL PLANE SERVICE REQUEST message by ciphering the value part of the ESM message container IE or the value part of the NAS message container, using the ciphering algorithm of the current EPS security context.”

    Conclusions

    In this post I have shown how NB-IoT uplink transmissions can be decoded without access to the scheduling grants in the downlink control information by using NPUSCH format 2 transmissions to find the RNTI, the cell ID, and the synchronization to the two radio frame structure of NB-IoT.

    I have decoded all the NPUSCH transmissions done by a V16 beacon in a short burst some time after it had been powered on and had already obtained its GNSS position. These NPUSCH transmissions connect to the RRC and send a user data message to the network by using the control plane service request message as defined by the control plane CIoT EPS optimization. This allows the beacon to send a small amount of data and receive a reply without having to fully establish a network connection.

    The user data is included in an ESM data transport message, which is encrypted with an EPS security context. Presumably the V16 beacon has authenticated to the network and set up this security context the first time that it transmitted after being powered on. In fact we can see that the NAS PDU containing the control plane service request message has a sequence number of 5. This sequence number is used as part of the EPS security context IV, so this shows that the UE and the NAS have already exchanged a few messages in this security context. It would be interesting to make a recording that contains both the NB-IoT downlink and the uplink since the beacon is powered on, to be able to see how the full conversation works.

    Something interesting about these beacons is that they use an NB-IoT network to send the data. Rather than screaming an emergency message into the void without having any feedback, with the hopes that someone gets the message, they use the cellular network for reliable delivery. In fact we have seen in this recording how acknowledgements and retransmissions at different layers of the protocol stack work.

    Having feedback in emergency beacons can be quite important. For instance, EPIRBs really scream an emergency message into the void, hoping that it gets picked up by LEO and MEO satellite transponders and decoded at a groundstation. The Galileo system is going to great efforts to provide a SAR return link for these beacons through the navigation message in the Galileo open service broadcast signal. The return link can be used to inform to a user that their distress call has been received and processed by the mission control center and responders.

    Since V16 beacons use the cellular network, we get a return link for free. However the beacon does not have a user interface to communicate anything to the user. This seems a missed opportunity to communicate basic information, such as lack of cellular network connectivity, lack of GNSS signal, and successful delivery of the emergency message. Some LEDs or a small status LCD would be an obvious option, but because the beacon strobe light is so bright, this does not seem a good idea. Perhaps these status could be communicated with audio signals. Or maybe even BLE, since most users will also have a smartphone at hand during an emergency and BLE is also ubiquitous in cars these days. These beacons have an MSRP of 30 to 40€, so it seems that these small additions could still fit into the manufacturing costs.

    Code and data

    The code and data used in this post can be found in this repository. This contains the SigMF recording of the beacon transmission (sigmf-data, sigmf-meta), the Jupyter notebook used for decoding, the PCAP containing the decoded MAC PDUs, and the GNU Radio flowgraph used for resampling the IQ data.

  • Notes on debugging Rust microcontroller stack usage

    A few days ago I was doing some refactoring of my galileo-osnma project. This is a Rust library that implements the Galileo OSNMA (open service navigation message authentication) system. The library includes a demo that runs in a Longan nano GD32VF103 RISC-V microcontroller board. The purpose of this demo is to show that this library can run on small microcontrollers. My refactoring was in principle a simple thing: I was mainly organizing the repository as a Cargo workspace, and unifying the library and some supporting tools into the same crate. However, after the refactor, users reported that the Longan nano software was broken. It would hang after processing some messages. This post is a collection of notes about how I investigated the issue, which turned out to be related to stack usage.

    First I checked that I was able to reproduce the issue either by feeding data from the Galmon public feed (which can be fetched using nc 86.82.68.237 10000) or by feeding data from my own GNSS receiver with Galmon’s ubxtool. Indeed the software was broken. At some point it stopped sending the READY message over the UART, which is how this demo implements flow control. I made a recording of a few minutes of Galmon public feed data in order to test repeatably, and I could confirm that the software from before the refactor obtained successful authentication, while the software after the refactor hanged before obtaining authentication and before reaching the end of the file. This left me wondering which of my changes could have caused the breakage.

    I realized that one of the things that I changed was the compiler options. Before the refactor, the Longan nano was in a standalone crate with the following configuration.

    [profile.release]
    opt-level = "z"
    lto = true

    After the refactor, because profiles need to be defined at workspace level, I had the following:

    [profile.release]
    codegen-units = 1
    lto = true
    panic = "abort"

    [profile.embedded]
    inherits = "release"
    opt-level = "z" # Optimize for size.

    To build the Longan nano software I used the embedded profile. I noticed that this caused the software to be built with codegen-units = 1 instead of the default. I played with other values of codgen-units in the embedded profile and found that codegen-units = 1 caused the software to crash eventually, while other values of codegen-units worked (later we discover that with codegen-units = 2 the software still crashed after some hours, but with codegen-units = 16 it seemed to run well forever).

    At this point I was puzzled: what could be making codegen-units = 1 break the software? The value of codegen-units defines how many independent code generation units are used in compilation. Lower values of codgen-units cause longer build times but can enable additional opportunities for compiler optimizations. The behaviour of the resulting code shouldn’t change, however.

    One of the things I suspected was that perhaps the software was running out of stack space. The GD32VF103 only has 32 KiB of SRAM. This is an important limitation which I have needed to take into account in the past. The full data that the galileo-osnma library would need to handle to account for all the satellites in the Galileo constellation is around 37 KiB. Because this doesn’t fit in SRAM, the Longan nano software uses a reduced-memory configuration that only considers some data for up to 12 satellites in view, which takes around 9 KiB. However, why would using codegen-units = 1 increase the stack usage? The other suspicion I had was that perhaps a compiler bug was triggered by codegen-units = 1.

    Initially I didn’t plan to spend more time debugging this, since using codegen-units = 16 was a good workaround. However the curiosity to understand what was happening won. First I tried to check if the software built with codegen-units = 1 was panicking. For that, I made a custom panic handler that turned on the red LED on the board. I saw that the LED didn’t turn on when the software crashed. Additionally, by adding a couple more print statements on the UART, I could find that the crash happened during a call to the Osnma::feed_osnma function, which I responsible for processing a single piece of OSNMA data taken from a INAV page. Therefore, this function does a lot of work, so the crash could be anywhere in the library.

    Intrigued by why a Rust program could be crashing without panicking, I decided to debug using JTAG. I have never had to use JTAG with this Longan nano board, but I found that an Altera USB Blaster clone that I had for the Hermes-Lite 2.0 worked fine and allowed me to debug with gdb over JTAG.

    With gdb, the first thing I could see is that when the software crashed, it was running an abort function which is just an endless loop.

    08000240 :
    8000240: 0000006f j 0x8000240

    In the backtrace I could see that this had been called by a _start_trap function.

    080001c0 :
    80001c0: 7139 addi sp, sp, -0x40
    80001c2: c006 sw ra, 0x0(sp)
    80001c4: c216 sw t0, 0x4(sp)
    80001c6: c41a sw t1, 0x8(sp)
    80001c8: c61e sw t2, 0xc(sp)
    80001ca: c872 sw t3, 0x10(sp)
    80001cc: ca76 sw t4, 0x14(sp)
    80001ce: cc7a sw t5, 0x18(sp)
    80001d0: ce7e sw t6, 0x1c(sp)
    80001d2: d02a sw a0, 0x20(sp)
    80001d4: d22e sw a1, 0x24(sp)
    80001d6: d432 sw a2, 0x28(sp)
    80001d8: d636 sw a3, 0x2c(sp)
    80001da: d83a sw a4, 0x30(sp)
    80001dc: da3e sw a5, 0x34(sp)
    80001de: dc42 sw a6, 0x38(sp)
    80001e0: de46 sw a7, 0x3c(sp)
    80001e2: 00010533 add a0, sp, zero
    80001e6: fa7ff0ef jal 0x800018c
    80001ea: 4082 lw ra, 0x0(sp)
    80001ec: 4292 lw t0, 0x4(sp)
    80001ee: 4322 lw t1, 0x8(sp)
    80001f0: 43b2 lw t2, 0xc(sp)
    80001f2: 4e42 lw t3, 0x10(sp)
    80001f4: 4ed2 lw t4, 0x14(sp)
    80001f6: 4f62 lw t5, 0x18(sp)
    80001f8: 4ff2 lw t6, 0x1c(sp)
    80001fa: 5502 lw a0, 0x20(sp)
    80001fc: 5592 lw a1, 0x24(sp)
    80001fe: 5622 lw a2, 0x28(sp)
    8000200: 56b2 lw a3, 0x2c(sp)
    8000202: 5742 lw a4, 0x30(sp)
    8000204: 57d2 lw a5, 0x34(sp)
    8000206: 5862 lw a6, 0x38(sp)
    8000208: 58f2 lw a7, 0x3c(sp)
    800020a: 6121 addi sp, sp, 0x40
    800020c: 30200073 mret

    This function is the trap handler that gets run when the RISC-V hart (hardware thread, which is how RISC-V formally calls CPU cores) traps. I set a hardware breakpoint at the beginning of _start_trap with hbreak *0x80001c0 and let the software crash again. This is the state of the registers at the breakpoint.

    (gdb) info r
    ra 0xffffffff 0xffffffff
    sp 0x1ffffce0 0x1ffffce0
    gp 0x20000800 0x20000800
    tp 0x0 0x0
    t0 0x800c000 134266880
    t1 0xffffffff -1
    t2 0xffffffff -1
    fp 0xffffffff 0xffffffff
    s1 0xffffffff -1
    a0 0x0 0
    a1 0xffffffff -1
    a2 0xffffffff -1
    a3 0xffffffff -1
    a4 0x0 0
    a5 0x0 0
    a6 0x0 0
    a7 0x0 0
    s2 0xffffffff -1
    s3 0xffffffff -1
    s4 0xffffffff -1
    s5 0xffffffff -1
    s6 0xffffffff -1
    s7 0xffffffff -1
    s8 0xffffffff -1
    s9 0xffffffff -1
    s10 0xffffffff -1
    s11 0xffffffff -1
    t3 0x0 0
    t4 0xffffffff -1
    t5 0x0 0
    t6 0xffffffff -1
    pc 0x80001c0 0x80001c0

    There is more relevant information in some of the machine CSRs. The mstatus register contains information about the hart’s state. In this case there is nothing of interest there.

    (gdb) info r $mstatus
    mstatus 0x1800 SD:0 VM:00 MXR:0 PUM:0 MPRV:0 XS:0
    FS:0 MPP:3 HPP:0 SPP:0 MPIE:0 HPIE:0 SPIE:0 UPIE:0 MIE:0
    HIE:0 SIE:0 UIE:0

    The mtvec register contains the trap vector base address. It indicates the address of the trap handler.

    (gdb) info r $mtvec
    mtvec 0x80001c3 134218179

    Trap handling can be done in direct mode, in which there is a single handler for all traps, or in vectored mode, in which the handler address is computed in terms of the trap cause. The two LSBs of mtvec are supposed to be 2'b00 for direct mode and 2'b01 for vectored mode, with the other two possible values being reserved. Here the value is 2'b11, which is reserved. I think that this is a silicon bug in the GD32VF103, which probably has these two bits hardcoded to 1, even though I believe that trap handling is working in direct mode. The 30 MSBs of the register contain the 30 MSBs of the trap handler base address (which in direct mode is just the trap handler address). The base address is supposed to be aligned to 4 bytes. We see that the trap handler address 0x80001c0 matches the _start_trap function.

    The mcause register contains a code indicating the event that caused the trap. The MSB is set if the trap was caused by an interrupt. The remaining bits indicate the exception code that caused the trap.

    (gdb) info r $mcause
    mcause 0x30000001 805306369

    Here we also have what looks like a silicon bug, because the leading nibble is 0x3. According to the RISC-V specification, for interrupt = 0, all the cause codes greater or equal than 64 are reserved. I believe that the correct value of mcause in this case would be 0x00000001, which means instruction access fault.

    The mepc register contains the program counter value at the moment when the trap happened. Its LSB is always zero, because instructions in RISC-V (with the compressed instructions extension) are aligned to 2 bytes.

    (gdb) info r $mepc
    mepc 0xfffffffe -2

    Here we see that the mepc register contains the address 0xfffffffe. This address is outside the memory map of the GD32VF103 (see Section 2.4 in the datasheet), so it makes sense that an instruction access fault was generated and we ended up in the trap handler.

    What has happened here? The registers contain some clues. First of all, there are many registers with the value 0xffffffff, which looks suspicious. The stack pointer is 0x1ffffce0. This is pointing outside the SRAM, which is mapped to 0x20000000 - 0x20007FFF. In fact we can see how this executable is supposed to be using the SRAM by inspecting some symbols in the ELF file:

    $ rust-objdump -x \
    target/riscv32imac-unknown-none-elf/embedded/\
    osnma-longan-nano | grep -e "__[se]\(bss\|stack\)"
    20000000 g .bss 00000000 __sbss
    20000020 g .bss 00000000 __ebss
    20000020 g .stack 00000000 __estack
    20008000 g .stack 00000000 __sstack

    We see that this program only needs 32 bytes of BSS, so the remaining SRAM space is allocated to the stack, which grows downwards starting at the end of the SRAM.

    Finally, the register t0 contains another clue. Its value is 0x0800c000, which is an address into the flash where the program is contained (the flash is mapped to 0x08000000 - 0x08020000). If we go to that part of the code we find that it corresponds to the ret instruction in a function that performs arithmetic for the p256 elliptic curve cryptography.

    0800bda6 :
    [...]
    800bff8: 00002297 auipc t0, 0x2
    800bffc: 3e6282e7 jalr t0, 0x3e6(t0)
    800c000: 8082 ret

    Before this function returns, an outlined function is run. This is the code for that function.

    0800e3de :
    800e3de: 40f6 lw ra, 0x5c(sp)
    800e3e0: 4466 lw s0, 0x58(sp)
    800e3e2: 44d6 lw s1, 0x54(sp)
    800e3e4: 4946 lw s2, 0x50(sp)
    800e3e6: 49b6 lw s3, 0x4c(sp)
    800e3e8: 4a26 lw s4, 0x48(sp)
    800e3ea: 4a96 lw s5, 0x44(sp)
    800e3ec: 4b06 lw s6, 0x40(sp)
    800e3ee: 5bf2 lw s7, 0x3c(sp)
    800e3f0: 5c62 lw s8, 0x38(sp)
    800e3f2: 5cd2 lw s9, 0x34(sp)
    800e3f4: 5d42 lw s10, 0x30(sp)
    800e3f6: 5db2 lw s11, 0x2c(sp)
    800e3f8: 6125 addi sp, sp, 0x60
    800e3fa: 8282 jr t0

    The goal of this outlined function is to reduce the code size. It contains a very common routine that restores the return address register and all the saved registers. Many functions will need to perform these operations before returning, so by putting them in an outlined function, code repetition is reduced.

    The way that the main function jumps into the outlined function is interesting. Because the outlined function is going to load the return address register ra with the return address needed by the main function, another mechanism is needed to return from the outlined function. This is implemented by the jalr t0, 0x3e6(t0) instruction, which jumps to t0 + 0x3e6 and stores the program counter value for the next instruction (which is 0x800c000, corresponding to the ret instruction) into the t0 register. In this way, the outlined function can perform its work, taking care not to clobber t0, and then use jr t0 to return to the main function.

    The value 0x800c000 that we have found in t0 at the trap handler breakpoint is the telltale sign of this mechanism. Now we understand that the first instruction of the outlined function has loaded 0xffffffff into ra (and also all the saved registers). Therefore, the ret instruction of the main function is trying to jump to 0xffffffff, which is an illegal instruction address because it is not aligned and because it is outside of the CPU address map. This is why the trap happens. The RISC-V specification defines a trap cause for instruction address misaligned, so I think this, rather than instruction access fault, should have been the cause of the trap in mcause.

    We have noticed that the stack pointer contains the value 0x1ffffce0 at the beginning of the trap handler. Taking into account that the outlined function has performed addi sp, sp, 0x60, this means that the stack pointer was 0x1ffffc80 at the beginning of the outlined function. This is 704 bytes below the start of the SRAM, so we see that the program has run out of stack space and it is doomed to crash one way or another.

    All the loads in the outlined function, as well as some other loads in the main function are targetting the area immediately below the SRAM. In the address map, the area 0x1FFFF810 - 0x1FFFFFFF is shown as code (reserved). For some reason, loads from this area are returning 0xffffffff. That is the reason why most of the registers have this value at the beginning of the trap handler. I don’t know all the details of the RISC-V specification, but I think that it would be better that these loads generate a load access fault trap instead of returning a hardcoded all-ones constant.

    Now that we have understood all the details about how the software crashes, the next question is why is the stack usage different depending on the codegen-units value, and what can we do about it. The first thing I checked was putting a breakpoint in the main function with hbreak main (note that this does not place the breakpoint at the first instruction of main, but rather at first instruction after the preamble of main, where the stack pointer has already been decremented to reserve stack for main) and printing the value of the stack pointer. I got the following:

    • With codegen-units = 16, the stack pointer is 0x200032c0, which means that there are around 12.65 KiB of SRAM free.
    • With codegen-units = 1, the stack pointer is 0x20001160, which means that there are around 4.31 KiB of SRAM free.

    It makes sense that these 4.3 KiB are not enough to run the relatively complex calculations required by the elliptic curve cryptography, and so we run out of stack space when codegen-units = 1. The question now is why we have a difference in the stack usage at the main function of 8.34 KiB depending on how we compile the program.

    First I read the code of the program until the main function is called to understand the stack usage up to this point. I saw that the stack pointer was first initialized to 0x20008000 and only 16 bytes of stack were reserved later on, so at the start of the main function the stack size is only 16 bytes. I verified this with gdb. Therefore, in order to understand the stack usage of main, we only need to look at how the stack pointer is decremented in the preamble of main.

    With codegen-units = 1, this is the beginning of the main function.

    0800354e 
    :
    800354e: 0000b297 auipc t0, 0xb
    8003552: 120282e7 jalr t0, 0x120(t0)
    8003556: 651d lui a0, 0x7
    8003558: d9050513 addi a0, a0, -0x270
    800355c: 40a10133 sub sp, sp, a0
    [...]

    The outlined function that is being called is basically the opposite of the previous outlined function we saw. It saves the return address register and all the saved registers to the stack. Interestingly it decrements the stack pointer by 256 bytes, which is more than what is necessary to store these 13 registers. I don’t know why this is done like so, but clearly this is then taken into account when reserving more stack in the main function. Perhaps the idea here is that functions that need slightly less than 256 bytes of stack can simply use this reservation and not touch the stack pointer in the main function.

    0800e66e :
    800e66e: 7111 addi sp, sp, -0x100
    800e670: df86 sw ra, 0xfc(sp)
    800e672: dda2 sw s0, 0xf8(sp)
    800e674: dba6 sw s1, 0xf4(sp)
    800e676: d9ca sw s2, 0xf0(sp)
    800e678: d7ce sw s3, 0xec(sp)
    800e67a: d5d2 sw s4, 0xe8(sp)
    800e67c: d3d6 sw s5, 0xe4(sp)
    800e67e: d1da sw s6, 0xe0(sp)
    800e680: cfde sw s7, 0xdc(sp)
    800e682: cde2 sw s8, 0xd8(sp)
    800e684: cbe6 sw s9, 0xd4(sp)
    800e686: c9ea sw s10, 0xd0(sp)
    800e688: c7ee sw s11, 0xcc(sp)
    800e68a: 8282 jr t0

    In any case, in the main function we see that the value (0x7 << 12) - 0x270 is loaded into a0 and then a0 is subtracted from the stack pointer. This means that overall the main function is decrementing the stack pointer by (0x7 << 12) - 0x270 + 0x100 = 28304 bytes.

    In comparison, with codegen-units = 16, the preamble of the main function looks like this.

    08009736 
    :
    8009736: 00006297 auipc t0, 0x6
    800973a: bf4282e7 jalr t0, -0x40c(t0)
    800973e: 6515 lui a0, 0x5
    8009740: c3050513 addi a0, a0, -0x3d0
    8009744: 40a10133 sub sp, sp, a0
    [...]

    The outlined function is identical to the one above, except that it has a different name and it sits at a different address. Therefore, in this case the main function is decrementing the stack pointer by (0x5 << 12) - 0x3d0 + 0x100 = 19760 bytes. As we already knew, there is a difference of 8544 bytes in the stack usage of main depending on how we compile.

    To investigate this difference, I looked at the LLVM IR for the program. This can be obtained by building with

    cargo rustc -p osnma-longan-nano \
    --target riscv32imac-unknown-none-elf \
    --profile embedded -- --emit=llvm-ir

    The LLVM IR is very verbose, so here I will only put the relevant details. With codegen-units = 1, the main function contains the following large stack allocations.

    define dso_local void @main() unnamed_addr #14 !dbg !23641 {
    […]
    %osnma.i = alloca [8536 x i8], align 8
    […]
    %interface = alloca [8792 x i8], align 8

    With codegen-units = 16, the main function only contains the %interface allocation. The %osnma.i is missing.

    define dso_local void @main() unnamed_addr #43 !dbg !60675 {
    […]
    %interface = alloca [8792 x i8], align 8

    The presence of %osnma.i in the codegen-units = 1 LLVM IR is what causes most of the stack usage difference (there are another 8 bytes that it isn’t worth to investigate). The relevant part of the Rust code to understand these allocations is the following.

    struct Board {
    tx: serial::Tx,
    rx: serial::Rx,
    rx_buffer: [u8; 256],
    }

    struct OsnmaInterface {
    osnma: Osnma,
    board: Board,
    }

    impl OsnmaInterface {
    fn new(board: Board) -> OsnmaInterface {
    let pubkey = VerifyingKey::from_sec1_bytes(&OSNMA_PUBKEY).unwrap();
    let pubkey = PublicKey::from_p256(pubkey, OSNMA_PUBKEY_ID).force_valid();
    let osnma =
    Osnma::::from_merkle_tree(OSNMA_MERKLE_TREE_ROOT, Some(pubkey), false);
    OsnmaInterface { osnma, board }
    }
    [...]
    }

    #[entry]
    fn main() -> ! {
    let board = Board::take();
    let mut interface = OsnmaInterface::new(board);

    loop {
    interface.spin();
    }
    }

    Basically, the program contains an OsnmaInterface struct that has a 256-byte buffer for reading UART data (the serial::Tx and serial::Rx are zero-sized types) and an Osnma object that contains all the data required by the galileo-osnma library. The Board object is constructed by initializing the UART and zero-initializing the rx_buffer, and then the Osnma object is constructed with one of the constructors offered by the library. Both objects are put together in an OsnmaInterface object.

    Because osnma is moved into interface, we would expect that it gets constructed directly into the allocation for interface, instead of being constructed somewhere else and then copied over. This is indeed what happens with codegen-units = 16, but not with codegen-units = 1. The thing we need to understand here is that Rust’s move system relies heavily on LLVM’s ability to optimize out unnecesary temporary allocations and copies. I will illustrate this with a simple example, which you can also see in Godbolt’s Compiler Explorer.

    Consider the following code:

    #![no_std]

    struct A {
    _data: [u8; 64]
    }

    struct B {
    _data: [u8; 32]
    }

    pub struct Both {
    _a: A,
    _b: B,
    }

    #[unsafe(no_mangle)]
    pub fn construct() -> Both {
    Both {
    _a: A { _data: [0; 64] },
    _b: B { _data: [0xff; 32] },
    }
    }

    When construct is called, the allocation for the Both that is returned has been already reserved by the caller, so the only thing that construct should do is two memsets to initialize the arrays in this allocation to their required values. This is indeed what happens when we build with -C opt-level=z -C codegen-units=1.

    construct:
    addi sp, sp, -16
    sw ra, 12(sp)
    sw s0, 8(sp)
    mv s0, a0
    li a2, 64
    li a1, 0
    call memset
    addi a0, s0, 64
    li a1, 255
    li a2, 32
    call memset
    lw ra, 12(sp)
    lw s0, 8(sp)
    addi sp, sp, 16
    ret

    The LLVM IR is basically two calls to memset as we would expect.

    define dso_local void @construct(ptr dead_on_unwind noalias noundef writable writeonly sret([96 x i8]) align 1 captures(none) dereferenceable(96) initializes((0, 96)) %_0) unnamed_addr {
    start:
    tail call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(64) %_0, i8 0, i32 64, i1 false)
    %0 = getelementptr inbounds nuw i8, ptr %_0, i32 64
    tail call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(32) %0, i8 -1, i32 32, i1 false)
    ret void
    }

    declare void @llvm.memset.p0.i32(ptr writeonly captures(none), i8, i32, i1 immarg) #1

    However the Rust MIR looks quite different. This is the relevant part corresponding to the construct function.

    fn construct() -> Both {
    let mut _0: Both;
    let mut _1: A;
    let mut _2: [u8; 64];
    let mut _3: B;
    let mut _4: [u8; 32];

    bb0: {
    StorageLive(_1);
    StorageLive(_2);
    _2 = [const 0_u8; 64];
    _1 = A { _data: move _2 };
    StorageDead(_2);
    StorageLive(_3);
    StorageLive(_4);
    _4 = [const u8::MAX; 32];
    _3 = B { _data: move _4 };
    StorageDead(_4);
    _0 = Both { _a: move _1, _b: move _3 };
    StorageDead(_3);
    StorageDead(_1);
    return;
    }
    }

    We see that there are temporaries for everything. Even A and B are constructed by first putting the array into a temporary and then moving the array into the struct.

    If we look at the initial LLVM IR before any optimization passes are done, we see that it closely mirrors the MIR. We have memset() to initialize the arrays to their corresponding values, and memcpy() to move things. There are 192 bytes of temporaries allocated on the stack just for this thing that is only supposed to initialize caller-allocated memory.

    define dso_local void @construct(ptr dead_on_unwind noalias noundef writable sret([96 x i8]) align 1 captures(address) dereferenceable(96) %0) unnamed_addr {
    %2 = alloca [32 x i8], align 1
    %3 = alloca [32 x i8], align 1
    %4 = alloca [64 x i8], align 1
    %5 = alloca [64 x i8], align 1
    call void @llvm.lifetime.start.p0(i64 64, ptr %5)
    call void @llvm.lifetime.start.p0(i64 64, ptr %4)
    call void @llvm.memset.p0.i32(ptr align 1 %4, i8 0, i32 64, i1 false)
    call void @llvm.memcpy.p0.p0.i32(ptr align 1 %5, ptr align 1 %4, i32 64, i1 false)
    call void @llvm.lifetime.end.p0(i64 64, ptr %4)
    call void @llvm.lifetime.start.p0(i64 32, ptr %3)
    call void @llvm.lifetime.start.p0(i64 32, ptr %2)
    call void @llvm.memset.p0.i32(ptr align 1 %2, i8 -1, i32 32, i1 false)
    call void @llvm.memcpy.p0.p0.i32(ptr align 1 %3, ptr align 1 %2, i32 32, i1 false)
    call void @llvm.lifetime.end.p0(i64 32, ptr %2)
    call void @llvm.memcpy.p0.p0.i32(ptr align 1 %0, ptr align 1 %5, i32 64, i1 false)
    %6 = getelementptr inbounds i8, ptr %0, i32 64
    call void @llvm.memcpy.p0.p0.i32(ptr align 1 %6, ptr align 1 %3, i32 32, i1 false)
    call void @llvm.lifetime.end.p0(i64 32, ptr %3)
    call void @llvm.lifetime.end.p0(i64 64, ptr %5)
    ret void
    }

    Things look quite similar for a few optimization passes, until we reach a MemCpyOptPass that realizes that a memset followed by a single memcpy or a chain of multiple memcpy‘s can be replaced by a direct memset to the destination of the last memcpy in the chain. This optimization gives the following. Note that we now have two memset‘s and no memcpy‘s, but the temporary allocations haven’t been optimized out yet.

    define dso_local void @construct(ptr dead_on_unwind noalias noundef writable writeonly sret([96 x i8]) align 1 captures(none) dereferenceable(96) %0) unnamed_addr {
    %2 = alloca [32 x i8], align 1
    %3 = alloca [32 x i8], align 1
    %4 = alloca [64 x i8], align 1
    %5 = alloca [64 x i8], align 1
    call void @llvm.lifetime.start.p0(i64 64, ptr nonnull %5)
    call void @llvm.lifetime.start.p0(i64 64, ptr nonnull %4)
    call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(64) %0, i8 0, i32 64, i1 false)
    call void @llvm.lifetime.end.p0(i64 64, ptr nonnull %4)
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %3)
    call void @llvm.lifetime.start.p0(i64 32, ptr nonnull %2)
    %6 = getelementptr inbounds nuw i8, ptr %0, i32 64
    call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(32) %6, i8 -1, i32 32, i1 false)
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %2)
    call void @llvm.lifetime.end.p0(i64 32, ptr nonnull %3)
    call void @llvm.lifetime.end.p0(i64 64, ptr nonnull %5)
    ret void
    }

    The next InstCombinePass realizes that the temporary allocations are unused and removes them. The IR now looks very similar to the final IR.

    define dso_local void @construct(ptr dead_on_unwind noalias noundef writable writeonly sret([96 x i8]) align 1 captures(none) dereferenceable(96) %0) unnamed_addr {
    call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(64) %0, i8 0, i32 64, i1 false)
    %2 = getelementptr inbounds nuw i8, ptr %0, i32 64
    call void @llvm.memset.p0.i32(ptr noundef nonnull align 1 dereferenceable(32) %2, i8 -1, i32 32, i1 false)
    ret void
    }

    In this very simple example everything works as expected and we get the assembly code that we wanted. In much more complex cases, LLVM optimization passes might not be able to optimize out all the moves emitted by the Rust compiler. This is what is happening in the codegen-units = 1 case with the Longan nano software. Since the software is much more complex than this simple example, I haven’t investigated what is preventing LLVM from optimizing out the %osnma.i temporary allocation.

    The final question is: what can we do to improve this? One issue is missed oportunities for move optimizations. This could be fixed by future improvements in the LLVM optimizer. However, the main issue here is that the Rust compiler is inlining a lot of initialization code into the main function. Besides a few other variables which are small, the data that the program needs on the stack to run is the interface object, which has 8792 bytes. However, we have seen that even in the codegen-units = 16 case, the main function needs 19760 bytes of stack, which is a lot.

    In the LLVM IR we can see that besides the interface allocation there are a few other large allocations with sizes around 1 to 3 KiB. These have less obvious names, some of which include sroa, which means “scalar replacement of aggregates”, which is a rustc optimization. My understanding of this situation is that we are getting allocations for temporaries that are needed in the initialization of the Osnma object, such as for instance temporaries used to load the ECDSA public key.

    In an ideal world, we could build Osnma using compile-time const evaluation, since this Osnma instance only depends on const‘s containing the ECDSA public key and Merkle tree root that are generated by the build.rs script. However const evaluation in Rust is somewhat limited (for good reasons) and none of the elliptic curve cryptography functions that are used here are const. Also, compile-time initialization wouldn’t be realistic. A more realistic software would read this cryptographic material from somewhere in flash, to allow the material to be updated without updating the software. Such software would still need to run all of this initialization at runtime.

    In any case, there is a simple way to improve this program. If initialization takes up a lot of stack space, then it shouldn’t be inlined into the main function. In this way that stack space can be freed at the end of the initalization, recovering stack space to be used by the program loop. This is what I’ve done in this software. It now looks like this.

    #[inline(never)]
    fn new_interface() -> OsnmaInterface {
    OsnmaInterface::new(Board::take())
    }

    #[entry]
    fn main() -> ! {
    let mut interface = new_interface();
    loop {
    interface.spin();
    }
    }

    With this change, and building with codegen-units = 1, the new_interface function looks like this.

    08002f92 :
    8002f92: 0000b297 auipc t0, 0xb
    8002f96: 610282e7 jalr t0, 0x610(t0)
    8002f9a: 6595 lui a1, 0x5
    8002f9c: af058593 addi a1, a1, -0x510
    8002fa0: 40b10133 sub sp, sp, a1
    [...]

    The outlined function is saving the return address register and the first 8 saved registers.

    0800e5a2 :
    800e5a2: 7111 addi sp, sp, -0x100
    800e5a4: df86 sw ra, 0xfc(sp)
    800e5a6: dda2 sw s0, 0xf8(sp)
    800e5a8: dba6 sw s1, 0xf4(sp)
    800e5aa: d9ca sw s2, 0xf0(sp)
    800e5ac: d7ce sw s3, 0xec(sp)
    800e5ae: d5d2 sw s4, 0xe8(sp)
    800e5b0: d3d6 sw s5, 0xe4(sp)
    800e5b2: d1da sw s6, 0xe0(sp)
    800e5b4: cfde sw s7, 0xdc(sp)
    800e5b6: 8282 jr t0

    The stack usage of new_interface is (0x5 << 12) - 0x510 + 0x100 = 19440 bytes. So we see that as a bonus the move optimization for osnma is now working even with codegen-units = 1.

    The main function starts like this.

    08003bc2 
    :
    8003bc2: 0000b297 auipc t0, 0xb
    8003bc6: 9e0282e7 jalr t0, -0x620(t0)
    8003bca: cde2 sw s8, 0xd8(sp)
    8003bcc: cbe6 sw s9, 0xd4(sp)
    8003bce: c9ea sw s10, 0xd0(sp)
    8003bd0: c7ee sw s11, 0xcc(sp)
    8003bd2: 6509 lui a0, 0x2
    8003bd4: 24050513 addi a0, a0, 0x240
    8003bd8: 40a10133 sub sp, sp, a0
    8003bdc: 6509 lui a0, 0x2
    8003bde: 2a050513 addi a0, a0, 0x2a0
    8003be2: 00a10db3 add s11, sp, a0
    8003be6: 40014437 lui s0, 0x40014
    8003bea: 6909 lui s2, 0x2
    8003bec: 04810993 addi s3, sp, 0x48
    8003bf0: 12c90513 addi a0, s2, 0x12c
    8003bf4: 7ff98493 addi s1, s3, 0x7ff
    8003bf8: 954e add a0, a0, s3
    8003bfa: cc2a sw a0, 0x18(sp)
    8003bfc: 67d48c93 addi s9, s1, 0x67d
    8003c00: 00a8 addi a0, sp, 0x48
    8003c02: fffff097 auipc ra, 0xfffff
    8003c06: 390080e7 jalr 0x390(ra)
    [...]

    It calls the same outlined function. The stack space that it needs is (0x2 << 12) + 0x240 + 0x100 = 9024 bytes. Taking into account that interface is using 8792 bytes, that leaves only 232 bytes used by other variables, which is excellent. With this change, the main function has 23.1 KiB of free stack space, so we don’t risk running out of stack during the program loop.

  • First analysis of the Lunar GNSS Receiver Experiment data

    The Lunar GNSS Receiver Experiment (LuGRE) is a NASA and Italian Space Agency payload that flew in the Firefly Blue Ghost Mission 1 lunar lander. An overview of this experiment can be found in this presentation. The payload contains a Qascom GPS/Galileo L1 + L5 receiver capable of both real time positioning and raw IQ recording, and a 16 dBi high-gain antenna that was pointed towards Earth. For decades the GNSS community has been talking about using GNSS in the lunar environment, and LuGRE has been the first payload to actually demonstrate this concept.

    The LuGRE payload ran a total of 25 times over the Blue Ghost mission duration, starting with a commissioning run on 2025-01-15 a few hours after launch and ending with a series of 9 runs on the lunar surface between 2025-01-03 (the day after the lunar landing) and 2025-01-16 (end of mission after lunar sunset). Back in October 15, the experiment data was published in Zenodo in the dataset Lunar GNSS Receiver Experiment (LuGRE) Mission Data. This dataset includes some short raw IQ recordings, as well as output from the real time GNSS receiver (raw observables, PVT, ephemeris and acquisition data). Since I have some professional background implementing high-sensitivity GNSS acquisition algorithms and I find this experiment quite interesting, I decided to do some data analysis, mainly of the raw IQ data.

    The initial results of the experiment were presented on September 11 in the ION GNSS+ 2025 conference in a talk titled Initial Results of the Lunar GNSS Receiver Experiment (LuGRE). This talk is only available to registered ION attendees. As I don’t want to resort to my network to scrounge some ION paper credits (which is how proceedings are usually downloaded from the ION website), I haven’t seen anything about this talk besides the abstract. It is quite possible that something of what I will mention here was already presented in this talk. This is what we get as a society for not doing science in a completely public and open way. However, it’s interesting that this also makes my analysis less likely to be biased. I’ve just downloaded some raw IQ data and started looking at it with only basic context about the instrument that produced it and how it was run.

    For a first analysis, I have implemented a high-sensitivity acquisition algorithm for the GPS L1 C/A signal in CUDA. I have run this on all the 20 L1 IQ recordings that are available. In this post I present the algorithm and the results.

    Acquisition algorithm

    Most of the parameters of the high-sensitivity acquisition algorithm are configurable. Here I will describe it with the nominal configuration intended to process the L1 recordings from LuGRE. The algorithm uses a coherent integration length of \(T = 20\, \mathrm{ms}\), matching the symbol duration of the GPS L1 C/A signal. This coherent integration is realized by performing a linear correlation with a replica with a duration of \(T\) and a signal window with a duration of \(2T\). This signal window is guaranteed to contain a full symbol. The linear correlation is implemented with a circular correlation done using FFTs. The FFT size is \(2 T f_s\), where \(f_s\) is the sample rate (8 Msps in the case of most LuGRE L1 recordings), and the replica of duration \(T\) is zero padded to a duration of \(2T\). Only the first \(T f_s\) samples of the IFFT are taken into account, since the full symbol contained in the signal window is guaranteed to start within the first \(T f_s\) samples. Note that the FFT size has only 2 and 5 as prime factors, so it can be implemented reasonably efficiently. An alternative implementation could choose an FFT size which is slightly larger than \(T f_s\) and zero pad both the signal and replica.

    Doppler search is implemented by applying a circular shift to the FFT of the replica before performing the multiplication of the FFT of the signal and the complex conjugate of the replica. However, a circular shift by one position implements a frequency shift of \(1/(2T)\). For the nominal configuration we want a Doppler step of \(1/(4T)\). This is implemented by always working with two copies of the replica: one at baseband and one with an additional frequency shift of \(1/(4T)\).

    Non-coherent integrations are done by working with signal windows that are advanced in steps of \(T\). Since the signal window duration is \(2T\), these windows overlap by 50%. A total of \(m = 14\) non-coherent integrations are used. This is because the total length of signal needed is \((m+1)T\), and most of the LuGRE IQ recordings have a duration of 300 ms (there is a single 200 ms recording which I’m ignoring, and a few longer recordings for which I’m only using the first 300 ms).

    In order to avoid smearing of the correlation peak due to code Doppler, when the non-coherent integrations are added, a time shift of a few samples is applied to the integrations. This time shift is computed as the accumulated code Doppler at this integration, obtained from the carrier Doppler bin that is being computed, rounded to an integer number of samples. Note that code Doppler is ignored within each coherent integration of duration \(T\), since the replica is always generated with zero code Doppler. Non-zero code Doppler is only taken into account when stacking multiple integrations. Even though this time shift is only a few samples, there are edge effects to take into account. At the beginning and end of each integration, the time shift can require us to fetch samples from either the previous or the next integration.

    A rule of thumb for acquisition sensitivity that is mentioned for instance in the Springer Handbook of Global Navigation Satellite Systems is that acquisition sensitivity, in terms of CN0 in dB·Hz, is given by\[13\,\mathrm{dB} – 10 \log_{10} T – 5 \log_{10} m.\]In our case this gives 24.3 dB. This rule of thumb is good as an initial guess, but it needs to be refined by taking into account the size of the search space and the presence of interference from other GNSS signals.

    CUDA implementation

    Initially I developed a prototype of this algorithm in a Jupyter notebook using CuPy to validate the algorithm. After that, I implemented the same algorithm in Rust using cudarc, cuFFT, and custom CUDA kernels in C++ that are compiled just-in-time for the required acquisition parameters. I have made a new Rust crate called gnss-dsp with the intention that of adding more GNSS signal processing algorithms to that crate in the future. The crate uses PyO3 and Maturin for Python bindings, and it is also available as a gnss-dsp Python package. The Python package contains some convenient Python CLI scripts that run the acquisition and generate plots with Matplotlib.

    The CUDA implementation of the algorithm works as follows. First, \((m+1) T f_s\) signal samples are copied to the GPU. The FFTs of the \(m\) window signals are computed with a single cuFFT call that performs FFTs with 50% overlap. The results are stored in the GPU, so they can be reused to acquire with different PRNs.

    The acquisition works iteratively on multiple PRNs. How many PRNs can be used in a single call is limited by host memory, since the results (which are a full Doppler-delay map of \(T f_s\) samples in the delay dimension and \(4 T \Delta f\) Doppler bins in the Doppler dimension, where \(\Delta f\) is the total Doppler range to cover) for all the PRNs in each call are stored in an array in host memory.

    For each PRN, a baseband replica of length \(T f_s\) samples is computed. This baseband replica is then shifted in frequency to generate the other \(k – 1\) replicas, where \(k\) is the Doppler oversampling factor, which gives a Doppler step of \(1/(2kT)\). By default, \(k = 2\), so only another replica is generated. The FFT of all these \(k\) replicas is then computed with cuFFT. In order to make circular shifting of the replica more efficient, two repetitions of each replica are stored consecutively in GPU memory. This is achieved by performing a device-to-device memory copy after the FFT is finished.

    Circular shifts that are used to perform Doppler shifts by multiples of \(1/(2T)\) are evaluated in Doppler blocks, which compute \(r\) of these circular shifts at once. Since intermediate results for all the \(kr\) Doppler bins in the Doppler block are stored in GPU memory, the size of a Doppler block is limited by GPU memory. For each Doppler block, a signal_replica_products CUDA kernel is launched to compute the product of the all the \(m\) signal window FFTs times the complex conjugates of all the \(k\) replica FFTs for all the \(r\) circular shifts in the Doppler block. Because two consecutive copies of each replica FFT are stored in GPU memory, the circular shift is implemented simply as an address offset when reading the replica. After the signal_replica_products kernel finishes, a cuFFT call is done to compute the IFFTs for all \(mkr\) products.

    The final step is the non-coherent accumulations. An offset in samples (as a floating point number) per non-coherent integration is computed in terms of the carrier Doppler. The same value is used for all the Doppler bins in the Doppler block, since their difference in accumulated code Doppler is quite small for typical Doppler block sizes. A noncoherent_accumulate CUDA kernel is called to compute the non-coherent accumulations. For each of the \(kr\) outputs in this Doppler block, and for each of the \(m\) integrations in each output, the kernel computes the complex magnitude squared of the IFFTs and adds each of the \(m\) integrations, writing the result of the accumulation to the results array in GPU memory.

    After all the Doppler blocks corresponding to the required Doppler range are computed, the acquisition algorithm performs a GPU to host memory copy to copy the results for this PRN to the host array that contains the results for all the PRNs in the call. It then proceeds with the remaining PRNs in the list of PRNs to acquire.

    A simple acquisition benchmark example is included in gnss-dsp. It acquires 4 PRNs in a single call using a Doppler range of -10 kHz to +10 kHz and a Doppler block size of 8 circular shifts. This takes about 4.9 seconds to run on an RTX 3070 and uses 2.9 GiB of GPU memory.

    The following figure shows profiling results for the acquisition benchmark example in nsys-ui. We can see the four computes for each PRN followed by the device to host copy of the results. I don’t know why the last device to host copies takes much longer, since all the copies have the same size. I have seen varying levels of throughput for these copies in multiple runs of the same benchmark.

    Acquisition benchmark profile in nsys-ui.

    The beginning of the acquisition benchmark looks like this. First all the signal samples are copied to the GPU. Then the signal FFTs are computed by two calls to cuFFT kernels. When computing FFTs of size 320000, there is always one kernel to perform the factor 512 of the decomposition of the FFT size using radix 8 and another one to perform the factor 625 using radix 5. Then there is some time in which the host is computing the replicas. These are then copied to the GPU and their FFTs are computed, followed by the device to device copy used to duplicate the FFTs in GPU memory. Then the loop over Doppler blocks begins. This is always formed by the signal_replica_products kernel, the two kernels implementing the IFFTs, and the noncoherent_accumulate kernel.

    Detail of beginning of acquisition benchmark profile

    Acquisition results processing

    For each PRN, the acquisition algorithm returns an array corresponding to the cross ambiguity function (CAF) of the signal and the replica generated for that PRN. In the Doppler dimension the array has a resolution of \(1/(2kT)\) (by default \(k = 2\)) and it covers the specified Doppler range. In the time dimension the array has a resolution of one sample and it covers the delay range between zero and \(T\).

    The entry where this array has maximum value is searched to determine the Doppler and delay corresponding to the acquisition result. By default this searches over the whole array, but it is possible to limit the search along the time dimension to a particular range, which is useful in some cases that will be presented below.

    The mean and the standard deviation of the array are computed. Acquisition SNR is measured by subtracting the mean and normalizing by dividing by the standard deviation. Note that in the case of a strong signal, the mean is positively biased by the correlation peak. However, this algorithm is mostly intended for weaker signals, for which this bias is negligible.

    The following plot shows the CAF for a simulated signal with 24 dB·Hz of CN0. The plot has been obtained by running

    gnss-acquisition-simulation --plots-dir /tmp/plots --cn0 24

    The plots show the normalized SNR in linear units, but the title of the plot gives values in dB. In this case the delay of the simulated signals is zero seconds. However, for weak signals the correlation peak can appear at another multiple of the code period close to the symbol edge (19 ms in this case) because of the random contribution of noise, particularly if the simulated signal has few symbol changes.

    In this simulation the signal is only G01 plus AWGN. Other PRNs will also show what appears to be a small correlation peak at a random location, but this is just because of the random contribution of AWGN, because the search space is rather large (20 ms times 20 kHz in this case).

    The following plot shows the acquisition SNRs obtained for all the PRNs in this simulation. We can see that PRNs that are not present have an SNR around 9 to 10 dB. G01, which is the only one present, has an SNR around 13 dB, so it can be detected reliably. Note that in this case the delay and Doppler of G01 are perfectly aligned with one of the CAF bins that are computed, so there are no correlation losses because of residual delay or frequency error.

    A problem with the GPS L1 C/A code is that the cross correlation between different PRNs is relatively high, specially if we also consider frequency offsets which are integer multiples of 1 kHz. The consequence of this for high-sensitivity acquisition algorithms is that if there is a relatively strong signal from one PRN, there will be false acquisitions for PRNs which are not present due to this cross correlation. The Doppler frequency of these false acquisitions is typically close to the Doppler frequency of the strong PRN plus an integer multiple of 1 kHz.

    A demonstration of this effect is obtained by performing the same simulation but with a CN0 of 40 dB·Hz for G01. The acquisition SNRs for PRNs that are not present are slightly higher than before.

    If we plot the acquisition Doppler modulo 1 kHz, we can see that most PRNs have values that are close to 0, which is the Doppler frequency of G01. This means that the correlation peaks found by the acquisition algorithm are actually caused by cross correlation with the G01 signal.

    This kind of plot is quite useful to judge if cross correlation effects are happening. In the LuGRE data we will see that this does not happen, since we don’t have signals that are strong enough.

    Results obtained with the LuGRE data

    I have run the acquisition algorithm on all the LuGRE L1 recordings (except IQS_L1_20250119_045211_200MS_T_OP5_0, which is only 200 ms long) using a Doppler search range from -25 kHz to 25 kHz. The characteristics of each recording are listed in an OPTABLE.csv file included in the dataset.

    Here I show only the recordings in which some GPS L1 C/A signals were detected by this acquisition.

    IQS_L1_20250116_013117_400MS_C_OP2_0

    OP2 was a commissioning run done at 25.25 Earth radii distance. Two reasonably strong signals are detected.

    However, when we look at the CAF of these two PRNs, we see that there is a huge spread in Doppler. I investigate this in more detail in a section below.

    IQS_L1_20250203_091340_400MS_T_OP17_0

    OP17 was done during the transfer orbit at a distance of 44.91 Earth radii. G11, G31, and possibly G18 are detected.

    In the CAF of the two stronger signals we can see a Doppler spread effect.

    IQS_L1_20250207_232901_600MS_T_OP21_0

    OP21 was done during the transfer orbit at 32.63 Earth radii distance. Only G24 is detected.

    In the CAF we see the usual frequency spread.

    IQS_L1_20250212_034755_400MS_T_OP22_0

    OP22 was done during the transfer orbit at a distance of 52.92 Earth radii. Only G31 is detected. Its CAF shows frequency spread.

    IQS_L1_20250214_045853_400MS_L_OP23_0

    OP23 was done in lunar orbit at a distance of 61.19 Earth radii. This is an interesting dataset. Looking at the plot of SNRs it is not obvious that there are any detected signals, although the SNRs are all slightly higher than usual. The plot of Doppler modulo 1 kHz shows all the PRNs except one inside a band.

    What happens in this recording is that G21 is present, although there is some Doppler spread.

    Other PRNs show a horizontal band in the CAF plot. This is probably caused by interference. It also happens in other datasets. There is weak CW interference around -420 kHz and -400 kHz in this and other datasets. Perhaps this is what causes this interference.

    IQS_L1_20250224_120449_300MS_L_OP32_0

    OP32 was done in lunar orbit at a distance of 58.93 Earth radii. Only G06 is detected, with a rather weak signal.

    The CAF for G06 doesn’t show any frequency spread, but maybe that is just because the signal is weak and the spread is below the noise floor.

    IQS_L1_20250304_070323_300MS_S_OP40_0

    OP40 was done from the lunar surface at a distance of 56.25 Earth radii. G25 is detected with a reasonably strong signal, and perhaps G18 is present with a weak signal.

    In the CAF of G25 we do not see any frequency spreading, and this signal is strong enough that the spreading would show up if it was present.

    IQS_L1_20250314_124717_500MS_S_OP74_0

    OP74 was done from the lunar surface at a distance of 61.9 Earth radii. Only G31 is detected, but it has a rather strong signal. There is some Doppler spread in the CAF, but it is narrower than in other recordings.

    IQS_L1_20250315_130727_2000MS_S_OP76_0

    OP76 was done from the lunar surface at a distance of 62.2 Earth radii. Unlike most other recordings, it is 2 seconds long. G31 and G32 are detected. Their CAFs do not show any Doppler spread.

    IQS_L1_20250316_191504_300MS_S_OP77_1

    OP77_1 was done from the lunar surface at a distance of 62.44 Earth radii. The acquisition SNR shows weak signals for G06 and G14. The signals actually have large Doppler spread in their CAFs, which lowers the effective SNR. The spread patterns are noticeably different unlike in other recordings.

    IQS_L1_20250316_220402_300MS_S_OP78_0

    OP78_0 was done from lunar surface at a distance of 62.45 Earth radii. G12 has a strong signal, with some Doppler spread.

    IQS_L1_20250316_221128_300MS_S_OP78_1

    OP78_1 was done just 10 minutes after OP78_0, so the results are similar. G12 is present with a strong signal. In this case the Doppler spread is much larger than in OP78_0.

    Doppler spread analysis

    We have seen that in most of the recordings the CAFs show a large amount of Doppler spread. To investigate this, I have processed OP78_1 again with some specific parameters to make a video that shows how the CAF evolves over time. These parameters are:

    • Coherent integration length: 10 ms
    • Only one non-coherent integration
    • Doppler oversampling: 8 (gives a frequency resolution of 6.25 Hz)
    • Rather than finding the maximum CAF peak over a window of 20 ms (twice the coherent integration length), the CAF is restricted to the first 1 ms window
    • The file is processed in steps of 1 ms to obtain each frame of the video

    These processing parameters give us a view of how the CAF evolves over time in 1 ms steps. I have chosen OP78_1 for this analysis because it has a relatively strong signal with large Doppler spread, and because the receiver was “stationary” on the lunar surface.

    The video of the CAF of G12 in OP78_1 is shown here.

    We can see three effects. First, the Doppler frequency moves down at a roughly constant rate throughout the recording. Second, we can sometimes see how the CAF peak splits into two peaks that are at ±100 Hz offset of where they should be. This effect is fully expected. It happens when the 10 ms coherent integration contains a symbol change. Third, there is significant fading in the signal. When the CAF peak splits, the power will be divided equally between the two peaks, but we can also see large drops in signal power in other moments.

    For comparison, this is how a simulated signal with a constant Doppler of 3 kHz and 40 dB·Hz CN0 looks like when processed with the same parameters.

    The conclusion is that the Doppler frequency of the signal is changing throughout the recording because the receiver clock drift (frequency reference) is changing. A change of 500 Hz over 280 ms is too much to be explained by line-of-sight dynamics, specially since I have picked a recording where the receiver is on the lunar surface. This change corresponds to approximately 300 ppb, so the drift is on the order of 1 ppm/s. This is quite a lot.

    If I had to guess the cause of this problem, I would say that it is a thermal effect that happens when the receiver starts working. I am quite surprised that most of the recordings have this problem. This issue should have been detected and fixed while testing the instrument on ground before launch.

    I haven’t looked into how this problem impacts the real time positioning data. My guess is that if this is a thermal effect, it will only last for the first few minutes until things settle. Since each real time positioning session last a few hours, the impact would be minimal.

    Since IQ recordings are very short, they are completely affected by this issue even if it is a transient (except for OP40_0 and OP76_0, which do not present this problem at all). This is a pity. It is not that the data is completely bad, but a drift on the order of 1.5 kHz/s is too large for typical GNSS receiver algorithms (this corresponds to an acceleration of 29 g), so the drift would need to be estimated and corrected before trying to use these recordings to test algorithm performance.

    Analysis of OP76_0

    Since OP76_0 is 2 seconds long and it does not have the frequency drift problem, it is without any doubt the best recording in this dataset. I have made a video of the CAF of G31 in this recording using the same processing parameters as in the previous section.

    Since this video spans 2 seconds of data, we can see how the code delay and Doppler slowly change throughout the recording. The fading is also quite noticeable. I am somewhat impressed that this relatively strong signal has so much fading. Perhaps there is a contribution of multipath due to a reflection on the lunar surface. However, the landing site of Blue Ghost Mission 1 has a latitude of 18.6º, which means that the Earth would be quite high in the sky as seen from the Moon. This should make ground reflections less likely.

    Conclusions

    In this post I have presented a CUDA high-sensitivity acquisition algorithm for GPS L1 C/A that can also be repurposed to compute the evolution of the CAF over time (in the form of a Delay-Doppler map, which is commonly used in GNSS reflectometry). I have run this over the LuGRE L1 recordings and observed that out of 19 recordings, signals can only be detected in 12 of them. The number of satellites in each of these 12 recordings ranges between 1 and 3.

    There is a receiver frequency drift problem in most of the recordings. This affects the sensitivity of the acquisition algorithm, so it is possible that some weaker signals have not been detected because of this.

    I have not looked into the geometry of the detected satellites. I expect that those that have strong signals have the receiver in the main beam of their antenna. This means that, as seen from the receiver, the satellite is illuminating the Earth from “behind”, but still there is direct line of sight to it. Looking at the geometry in Figure 5 in this paper, we see that the GPS Block IIF antenna is design to illuminate up to 23º off nadir, while the edge of Earth is at 13.8º off nadir. Since the portion of the area of the sphere that is occupied by a spherical cap defined by a spherical cone with aperture \(2\varphi\) is \((1- \cos \varphi)/2\), we see that taking into account blockage by the Earth, the fraction of the celestial sphere illuminated by a single satellite is\[\frac{\cos 13.8º – \cos 23º}{2} \approx 0.0343.\]This means that assuming that the GPS satellite positions are randomly and uniformly distributed over the celestial sphere (which is not really true, but it is useful to obtain a ballpark approximation) and that the constellation has 32 satellites, on average only 1.1 satellites illuminate with their main beam a given point that is far enough from Earth. This intuition roughly matches what we have seen in this dataset, and it limits the feasibility of using GNSS for navigation in the lunar environment.

    Code and data

    The acquisition algorithm used in this post is implemented in the gnss-dsp Rust/Python package. There is an additional repository called lugre that contains the scripts that automate the data processing. The data was published in the Zendo dataset Lunar GNSS Receiver Experiment (LuGRE) Mission Data by the LuGRE team.


5g 10ghz artemis1 astronomy astrophotography ATA ccsds ce5 contests digital modes doppler dslwp dsp eshail2 fec freedv frequency gmat gnss gnuradio gomx hermeslite hf jt lilacsat limesdr linrad lte microwaves mods moonbounce noise ofdm orbital dynamics outernet polarization radar radioastronomy radiosonde rust satellites sdr signal generators tianwen vhf & uhf