Back in May I proposed an erasure FEC scheme for SSDV. The SSDV protocol is used in amateur radio to transmit JPEG files split in packets, in such a way that losing some packets only cases the loss of pieces of the image, instead of a completely corrupted file. My erasure FEC augments the usual SSDV packets with additional FEC packets. Any set of \(k\) received packets is sufficient to recover the full image, where \(k\) is the number of packets in the original image. An almost limitless amount of distinct FEC packets can be generated on the fly as required.
I have now written a Rust implementation of this erasure FEC scheme, which I have called ssdv-fec. This implementation has small microcontrollers in mind. It is no_std
(it doesn’t use the Rust standard library nor libc), does not perform any dynamic memory allocations, and works in-place as much as possible to reduce the memory footprint. As an example use case of this implementation, it is bundled as a static library with a C-like API for ARM Cortex-M4 microcontrollers. This might be used in the AMSAT-DL ERMINAZ PocketQube mission, and it is suitable for other small satellites. There is also a simple CLI application to perform encoding and decoding on a PC.
I have updated the Jupyter notebook that I made in the original post. The notebook had a demo of the FEC scheme written in Python. Now I have added encoding and decoding using the CLI application to this notebook. Using the ssdv-fec
CLI application is quite simple (see the README), so it can be used in combination with the ssdv
CLI application for encoding and decoding from and to a JPEG file. Note that the Python prototype and this new Rust implementation are not interoperable, for reasons described below.
For details about how the FEC scheme works you can refer to the original post. It is basically a Reed-Solomon code over \(GF(2^{16})\) used as an erasure FEC. To make the implementation suitable for microcontrollers, I have made some decisions about which algorithms to use for the mathematical calculations. I will describe these here.
The finite field \(GF(2^{16})\) is realized as an extension of degree two over \(GF(2^8)\). The field \(GF(2^8)\) is implemented as usual, with lookup tables for the exponential and logarithm functions. In this way, elements of \(GF(2^{16})\) are formed by pairs of elements of \(GF(2^8)\), and the multiplication and division can be written using relatively simple formulas with operations on \(GF(2^8)\).
More in detail, \(GF(2^8)\) is realized as the quotient\[GF(2)[x]/(x^8 + x^4 + x^3 + x^2 + 1).\]This choice of a primitive polynomial of degree 8 over \(GF(2)\) is very common. The element \(x\) is primitive, so the exponential function \(j \mapsto x^j\) and the logarithm \(x^j \mapsto j\) can be tabulated. These two tables occupy 256 bytes each. Multiplication is performed by using these tables to calculate multiplication as addition of exponents. The case where \(x = 0\) is treated separately, since the logarithm of zero is not defined. An element of \(GF(2^8)\) are encoded in a byte by writing it as a polynomial of degree at most 7 in \(x\) and storing the leading term of this polynomial in the MSB and the independent term in the LSB. This is all pretty standard, and it is for example how Phil Karn’s implementation of the CCSDS Reed-Solomon (255, 223) code works.
The field \(GF(2^{16})\) is realized as the quotient\[GF(2^8)[y]/(y^2 + x^3y + 1).\]Here \(x\) still denotes the same primitive element of \(GF(2^8)\) as above. I have selected the polynomial \(y^2 + x^3y + 1\) because \(k = 3\) is the smallest \(k \geq 0\) for which the polynomial \(y^2 + x^ky + 1\) is irreducible over \(GF(2^8)\). The fact that only one term of this polynomial is different from one simplifies the multiplication and division formulas. Each element of \(GF(2^{16})\) is a degree one polynomial \(ay + b\), where \(a, b \in GF(2^8)\). Each of the elements \(a\) and \(b\) is stored in its own byte. Addition of elements of \(GF(2^{16})\) is performed by adding the coefficients of their corresponding degree one polynomials. Since this addition is addition on \(GF(2^8)\), it amounts to the XOR of two bytes, which is very fast.
To compute the formula for multiplication, note that\[(ay + b)(cy + d) = acy^2 + (ad + bc)y + bd \equiv (ad+bc+x^3ac) y + bd + ac\]modulo \(y^2 + x^3y + 1\). Therefore, multiplication only needs 5 products in \(GF(2^8)\) and some additions. In my implementation, for simplicity this is written exactly as such. The field \(GF(2^8)\) is implemented as a Rust type for which a multiplication is defined. This has the disadvantage that each multiplication is performing a logarithm evaluation and some of these are repeated. A more optimized implementation calculates the logarithms of \(a, b, c, d\) only once, but it needs to handle separately the cases when some of these are zero. Perhaps the Rust compiler is smart enough to remove the repeated logarithm evaluations when the straightforward formula is used, and to figure out that the logarithm of \(x^3\) is simply 3. I haven’t checked how much of this it is able to optimize out.
Division is slightly more tricky to calculate. The formula follows from solving\[(cy + d)(ey + f) \equiv ay + b \mod y^2 + x^3 y + 1 \]for the unknowns \(e, f \in GF(2^8)\). Expanding the product as above, this gives a 2×2 linear system, which can be solved with Cramer’s rule. This gives\[\begin{split}e &= \frac{ad + bc}{\Delta},\\ f &= \frac{b(d + x^3c) + ac}{\Delta},\end{split}\]where\[\Delta = c^2 + x^3cd + d^2.\]Note that the irreducibility of \(y^2+x^3y+1\) over \(GF(2^8)\) implies that \(\Delta \neq 0\) unless \(c = d = 0\). This division formula requires the evaluation of 10 multiplications/divisions over \(GF(2^8)\), and some additions.
This implementation of \(GF(2^{16})\) is good for memory constrained systems, because it only requires 512 bytes of tables, but it is still reasonably fast. An implementation that uses tables of exponentials and logarithms in \(GF(2^{16})\) is faster, but it requires 128 KiB of memory for each of the two tables. Even using Zech logarithms, which only requires one 128 KiB table, is prohibitive in systems with low memory. In fact, the implementation of \(GF(2^{16})\) as an extension of degree two over \(GF(2^8)\) can also be interesting for large computers with a lot of memory, because the L1 cache on many CPUs has only 32 KiB, so the cache misses caused by using 128 KiB tables can make the implementation using exponentials and logarithms in \(GF(2^{16})\) slower than the implementation described here.
This implementation of the arithmetic in \(GF(2^{16})\) deviates from the one I proposed in my original post, which was the usual quotient construction as \(GF(2)[z]/(p(z))\) for \(p(z)\) an irreducible polynomial of degree degree 16 over \(GF(2)\). Since the construction of \(GF(2^{16})\) as a degree two extension is better for memory constrained systems, I have chosen it for the “production quality” Rust code, but since there is not a fast and simple way to convert between the two representations of field elements, this means that the Rust implementation and my earlier Python prototype using the galois library are not interoperable. My recommendation is that other implementations of this FEC scheme follow the construction used in the Rust implementation, so that they can be interoperable.
The implementations of the fields \(GF(2^8)\) and \(GF(2^{16})\) as described here are exposed in the public API of ssdv-fec, so these can also be used in other Rust applications and libraries.
Another clever idea in the Rust implementation is how the linear system for polynomial interpolation is solved. This needs to be done both for encoding and decoding. The problem can be stated generically as, given a polynomial \(p \in GF(2^{16})[z]\) of degree at most \(m\) and its values at \(m + 1\) distinct points \(z_0, \ldots, z_m \in GF(2^{16})\), compute the values \(p(z)\) at other points \(z \in GF(2^{16})\). The way I presented this in the original post was conceptually simple. The linear system for solving the coefficients of \(p\) in terms of \(p(z_0), \ldots, p(z_m)\) has a single solution, and in fact the matrix of this system is an invertible Vandermonde matrix. Therefore, we can compute the coefficients of \(p\) and then evaluate it at the required points \(z\).
A naïve implementation of this idea solves the system by Gauss reduction. This has the disadvantage that the Vandermonde matrix needs to be stored somewhere in order to perform the row operations that convert it to the identity matrix. If we want the implementation to have a minimal memory footprint, we would rather not store this matrix, which only plays an auxiliary role.
Luckily there is an alternative way to approach this problem that does not require storing a matrix. The polynomial \(p\) is the Lagrange polynomial, and there are some explicit formulas for it. If \(z\) is not equal to any of \(z_0,\ldots,z_m\), we have\[p(z) = l(z) \sum_{j=0}^m \frac{w_j p(z_j)}{z – z_j},\]where\[l(z) = \prod_{j=0}^m (z – z_j),\]and\[w_j = \prod_{0 \leq k \leq m,\ k \neq j} (z_j-z_k)^{-1}.\] This formula gives a way of calculating \(p(z)\) without using any other memory besides that used to store the terms \(p(z_0),\ldots,p(z_m)\) and their corresponding points \(z_0,\ldots,z_m\).
Since during encoding or decoding the formula above is to be evaluated for many different values of \(z\), the input data \(p(z_0),\ldots,p(z_m)\) is modified in-place, substituting each \(p(z_j)\) by \(w_j p(z_j)\), calculating the product \(w_j\) to do this. This makes the evaluation of \(p(z)\) using the formula faster.