free-outernet gets LDPC decoding

In my previous post I talked about some small updates made by George Hopkins to my free-outernet project. In fact, George has been reverse engineering the ondd binary quite in depth and he has been able to reverse engineer the LDPC code which is used for file FEC. This solves a long-standing issue of free-outernet. Formerly, LDPC decoding was not implemented, so to recover a file successfully all the file blocks had to be received correctly. Now, with LDPC decoding the file can be recovered even if some of the file blocks are lost. Thus, the performance of free-outernet in this aspect should now be the same as the performance of the closed source ondd binary included in the official Outernet receiver. Many thanks to George, as this is a substantial improvement of free-outernet. Here I describe the latest changes made by George in free-outernet.

As we suspected, the LDPC code used for file FEC in Outernet is, essentially, the code described in RFC5170. This RFC describes an LDPC code with a pseudorandomly generated parity-check matrix to be used as an erasure FEC. Generating the matrix pseudorandomly has the advantage that it is possible to use any values of \(n\) and \(k\) as appropriate.

Recall that an \((n,k)\) linear code over some field \(F\) is a \(k\)-dimensional subspace \(C\) of the vector space \(F^n\). One way to write this code is in terms of a parity-check matrix, which is an \(m \times n\) matrix \(H\) such that \(C = \ker M\). Usually, \(H\) is chosen to have maximum rank, so that \(m = n-k\). For the RFC5170 code, the parity-check matrix has the form\[H = (L | R),\] where \(L\) is \((n-k) \times k\) and \(R\) is \((n-k) \times (n-k)\). The matrix \(L\) is generated pseudorandomly and \(R\) is either\[R=\begin{pmatrix}1 & 0 & 0 & 0 & \cdots & 0\\1 & 1 & 0 & 0 & \cdots & 0\\0 & 1 & 1 & 0 & \cdots & 0\\\vdots & \cdots &\ddots & \ddots & \cdots & \vdots\\0 & 0 &\cdots & 1 & 1 & 0\\0 & 0 & \cdots & 0 & 1 & 1\end{pmatrix}\]for the so called staircase scheme, or\[R=\begin{pmatrix}1 & 0 & 0 & 0 & \cdots & 0\\1 & 1 & 0 & 0 & \cdots & 0\\1 & 1 & 1 & 0 & \cdots & 0\\\vdots & \cdots &\ddots & \ddots & \cdots & \vdots\\1 & 1 &\cdots & 1 & 1 & 0\\1 & 1 & \cdots & 1 & 1 & 1\end{pmatrix}\]for the triangle scheme. I don’t know yet which one of these two schemes is the one used in Outernet, since the matrix \(R\) is not written explicitly, but hidden inside the decoding function, so I will have to study that code.

The matrix \(L\) is a sparse matrix. A few of its entries are one, while the rest are zero. It is generated using a two step procedure. A PRNG is used so that the distribution of ones within the matrix looks uniform and random-like, but so that the generation of the matrix is repeatable by using the same PRNG algorithm and seed. The first step depends on a parameter N1. In Outernet, it seems that N1=2 always (its value is indicated for each particular file in the file announcement). In this step, the matrix \(L\) starts with all its entries set to zero and in each column we put N1 ones. The positions of the ones are chosen pseudorandomly. The second step takes the matrix produced in the first step and ensures that in every row there are at least 2 ones. For each row where this condition is not satisfied, either 1 or 2 ones are put accordingly to get 2 ones in this row. Again, the positions of the ones are chosen pseudorandomly. Thus the matrix \(L\) is a sparse matrix such that most of its columns have N1 ones and all its rows have at least 2 ones. The exact algorithm to generate this matrix is given in the RFC.

The discussion of the generation of the matrix \(L\) leads us to the PRNG algorithm. Clearly, the PRNG must be standardized to be able to construct the same parity-check matrix in all the implementations. The RFC includes a description of the PRNG to be used. It is the Lehmer or Park-Miller generator. This algorithm depends on \(M\), which is usually a prime or a power of a prime; \(A\), which is an element of high multiplicative order modulo \(M\); and a seed \(X_0\) which is coprime with \(M\). A sequence of pseudorandom numbers \(\{X_n\}\) is produced by\[X_{n+1} = AX_n \mod M.\]A common choice is to put \(M = 2^{31}-1\), which is a Mersenne prime, and \(A = 7^5\). This choice is known as MINSTD and is the one used in RFC5170. The outputs of this PRNG are integers between \(1\) and \(M-1\). Thus, in applications, the output has to be adjusted to the desired range. The RFC includes this remark:

This PRNG produces, natively, a 31-bit value between 1 and 0x7FFFFFFE (2^^31-2) inclusive. Since it is desired to scale the pseudo-random number between 0 and maxv-1 inclusive, one must keep the most significant bits of the value returned by the PRNG (the least significant bits are known to be less random, and modulo-based
solutions should be avoided [PTVF92]). The following algorithm MUST be used:
[…]
scaled_value = (unsigned long) ((double)maxv * (double)raw_value / (double)0x7FFFFFFF);

However, George has discovered that the implementation by Outernet ignores this remark completely and uses modulo instead of division (so scaled_value = raw_value % maxv). This difference is what made my attempts at reverse engineering the LDPC code fail. I was playing with a Python implementation of the RFC code to see if the parity-check bytes matched. However, I was using division as per the RFC. As one can imagine, using modulo instead produces a completely different parity-check matrix. The code written by George to generate the parity-check matrix is almost identical to the code I was using. This is remarkable, given that George wrote his code from reverse-engineering the ondd binary and I wrote mine as a Python translation of the C code in the RFC. If one uses the same scaling method in both codes (either division or modulo), both generate the same parity-check matrix.

From a formal point of view, decoding this kind of LDPC codes is very easy, because they are used only as erasure codes. Assume we receive a codeword \(c = (c_1,\ldots,c_n) \in F^n\), perhaps with some erasures. Let \(r = (r_1,\ldots,r_n)\) be the received codeword, where \(r_j = c_j\) if the \(j\)-th symbol was received correctly or \(r_j\) is an unknown \(x_j\) if the \(j\)-th symbol was not received. The equation \(Hr = 0\) is a non-homogeneous linear system for the unknowns \(x_j\). Note that this system is compatible. If the system has a single solution, the solution allows us to recover the codeword \(c\). Then it is trivial to obtain the original message, as these codes are used as systematic codes. If the system has several solutions, then there were too many erasures and the message can not be recovered. Mathematically, this is the best that can be done: whether the erasures are recoverable or not only depends on whether the system \(Hr=0\) has only a single solution or several solutions. Note that this linear system for the unknowns \(x_j\) is small, especially when there are few erasures. Therefore, any algorithm to solve linear systems can be used here. The decoding function looks very different from this mathematical procedure, but it probably does the same thing. I will have to study this code in the future.

I haven’t talked about interleaving yet. Again this is buried inside the decoding function, but I’m confident that I got interleaving right when I tried to reverse engineer the LDPC code. Recall that both files and LDPC parity-check symbols are sent as chunks of 242 bytes, or 1936 bits. Probably the best way talk about interleaving in this case is to avoid talking about interleaving at all by using some mathematical formalities. So far, I have not said over which field these LDPC codes are supposed to be defined. The reason is that it is not really important. Perhaps the most natural thing would be to think of these as codes over \(GF(2)\), and thus working on the bit level. Many times, we work with bytes instead of bits, so it’s perhaps convenient to think of these codes as codes over \(GF(2^8)\). Now one immediately notices that we are not using the field structure of \(GF(2^8)\). In fact, while I presented these codes as codes over a field \(F\), it is perfectly possible to define them as codes over a vector field \(V\). The same discussion applies without much changes. For Outernet, we can think of the LDPC codes as codes over \(GF(2^{1936})\), which is indeed a field, but since we don’t care about its field structure, we think of it as the vector space \(GF(2)^{1936}\).

Each 242 byte chunk is understood as an element of \(GF(2)^{1936}\) in the natural way. Since the last chunk of a file may be shorter than 242 bytes, it is padded at the end (probably with 0xff bytes according to this line in George’s code). The whole file forms a single LDPC codeword. Therefore, the parameter \(k\) is computed as \(k = \lceil s/242 \rceil\), where \(s\) is the filesize in bytes. The parameter \(n\) is adjusted to have a code rate close to \(5/6\), so \(n = \lceil 6k/5 \rceil\). As you can imagine, this produces different \(n\) and \(k\) parameters for each file, so the LDPC code used for each file is completely different. This is what makes the pseudorandom generation a good idea for this application. The remaining parameters for the LDPC code are N1, the “usual” number of ones per column in the matrix \(L\), and the seed for the PRNG. It seems that N1 is always 2 and the seed is always 1000.

This is all there is to the LDPC code used for file FEC in Outernet. George has discovered that there is another erasure FEC used for fragmentation at the OP level. This is used to recover an LDP packet even if some of its fragments are lost. It seems that this functionality is only used for file announcements, which are the only LDP packets large enough to be fragmented. George has implemented decoding for this OP FEC also.

The erasure code used for OP fragmentation is a code by Luigi Rizzo (see especially his paper Effective Erasure Codes for Reliable Computer Communication Protocols). Apparently this code has become quite popular since its creation in 1997 and there are several implementations out there. free-outernet is using the zfec library, which an implementation of this code with a C, Python and Haskell API.

Recall that the OP header includes 3 bytes related to fragmentation. The first byte is either 0x3c or 0xc3 depending on whether this is the last fragment or not, the second byte is the number of this fragment (starting with 0) and the third byte is the fragment number of the last fragment. The parity-check symbols are sent using a similar scheme. The first byte is 0x69 to indicate that this is a packet of parity-check symbols. Similarly, the second and third bytes are the number of this FEC fragment and the number of the last FEC fragment. This code works over \(GF(2^8)\), so each of the bytes of each fragment is treated independently (the first byte of each fragment and FEC fragment constitute a codeword and so on). For file announcements it seems that \((n,k)\) is either \((9,6)\) or \((10,7)\), so I would say that \(n\) is computed as \(n=k+3\), where \(k\) is simply the number of fragments that are needed to send the LDP packet. Note that this code can always correct up to \(n-k\) erasures.

The new version of free-outernet implements decoding for both file FEC and OP FEC, so it is possible to recover now more files from the recordings that were made when this project started. From this set of KISS recordings, the files outernet-f4gmu.kiss and outernet-primesh.kiss had packet loss and no files could be recovered. Now the results are pretty good:

$ ./free-outernet.py -k /tmp/outernet-kiss/outernet-f4gmu.kiss  | grep "recons"
[File service] File reconstructed: opaks/f37c-Balochistan,_Pakistan.html.tbz2
Unable to reconstruct file opaks/f675-2022_Winter_Olympics.html.tbz2
Unable to reconstruct file opaks/fa5a-Aliya_Mustafina.html.tbz2
[File service] File reconstructed: opaks/2a6c-gfs.2016101418_gfs.t18z.pgrb2.1p00.f048.grib2.tbz2
$ ./free-outernet.py -k /tmp/outernet-kiss/outernet-primesh.kiss  | grep "recons"
[File service] File reconstructed: opaks/adc2-gfs.2016101518_gfs.t18z.pgrb2.1p00.f000.grib2.tbz2
[File service] File reconstructed: opaks/e598-gfs.2016101518_gfs.t18z.pgrb2.1p00.f072.grib2.tbz2
[File service] File reconstructed: opaks/15c1-Baar_Baar_Dekho.html.tbz2
[File service] File reconstructed: opaks/8a2c-messages-0.html.tbz2
[File service] File reconstructed: opaks/1885-Alison_Brie.html.tbz2
[File service] File reconstructed: opaks/2661-Ballers.html.tbz2
[File service] File reconstructed: opaks/01bc-352096178380949903 - iz1cqn.JPG.tbz2
[File service] File reconstructed: opaks/030c-352254227731137254 - To Get to Mars, Head for the Moon.pdf.tbz2
[File service] File reconstructed: opaks/046d-352374092422559609 - Sir_Arthur_Conan_Doyle-Escandalo_en_Bohemia.txt.tbz2

From outernet-f4gmu.kiss 2 out of 4 files can be reconstructed and in outernet-primesh.kiss all 9 files can be reconstructed.

4 comments

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.