Audit of Celestia's RSEMA1D Codec

Celestia • October 20th, 2025 • Technical Report

Introduction

On October 20th, 2025, Celestia tasked zkSecurity with auditing its high-performance data availability codec, RSEMA1D (Reed-Solomon Evans-Mohnblatt-Angeris 1D). The specific code to review was shared via a private GitHub repository. The audit lasted one week with two consultants.

The RSEMA1D codec is based on the ZODA paper, specifically implementing the Hadamard variant from Appendix D with a single column, which served as the primary theoretical foundation for this audit. Additionally, the Celestia team shared a specification as well as an internal Notion document that provided a high-level overview of the codec.

Scope

The scope of this audit covered the entire RSEMA1D repository at commit 65cc9d4c3887936393763d32e0cf4e665d9d2705. All files and components present in that commit were considered in-scope. Dependencies outside this repository (in particular, Celestia’s Leopard Reed–Solomon codec) were not formally in scope, although we occasionally referred to them for context.

Strategic Recommendations

In addition to the findings discussed later in this report, we have the following strategic recommendations:

Audit of underlying Reed-Solomon implementation. This report assumed that Celestia’s Leopard Reed-Solomon GF16 codec was safely implemented, particularly the core functions New, Encode, Reconstruct, and GF16Mul. While we reviewed the RSEMA1D layer built on top of this codec, the correctness and security of the entire system fundamentally depends on the underlying Reed-Solomon implementation. We recommend a dedicated audit of the Leopard library, unless already done, to ensure its cryptographic primitives are sound and properly implemented.

Review of system integration. The scope of this audit was limited to the RSEMA1D codec library itself. However, the security of a data availability system depends not only on the codec’s correctness but also on how it is integrated and used throughout the broader Celestia stack. We recommend an audit of the integration layer to verify that the codec is being invoked correctly, that data flows between components are secure, and that the overall system maintains the intended security properties when all components interact together.

Overview

In this section we give an overview of data availability, the original ZODA construction, and some detail of the construction implemented by Celestia.

What is Data Availability?

Data availability (DA) refers to the guarantee that a blob of data (e.g. all transaction data in a block of a rollup protocol) has been published and is accessible to a network. Without DA, if even a small portion of the data were to be withheld, honest nodes wouldn’t be able to recover the full original data. In protocols like rollups where the production of block data is being done off-chain, not having a decentralized way to ensure that the rollup data is available (for a certain period of time) could prevent users from verifying and trusting the protocol, or leave them stuck and unable to recover from a protocol halting by being unable to download the necessary data to reconstruct the latest state. In traditional blockchains, full nodes trivially ensure data availability by downloading 100% of the block data themselves. However, this approach does not scale well: requiring every node to fetch every block in its entirety quickly becomes a throughput bottleneck.

The modern alternative to downloading every block is known as Data Availability Sampling (DAS). In a nutshell, DAS lets nodes probabilistically verify that a block’s full data is available by downloading random samples of it rather than the entire block. If enough nodes have performed enough sampling, we are convinced that the data is available and recoverable if needed, even if some pieces of it go missing.

This probabilistic verification becomes possible through a specific way of coding the block’s data. Before a block is broadcast, its data is processed with erasure coding to introduce redundancy. Specifically, many modern DA layers use a one or two-dimensional Reed-Solomon encoding approach: the block’s data is arranged in a matrix and then extended with parity chunks for each row/column. This means the block is effectively duplicated with error-correction data in horizontal/vertical directions. For this extended data matrix, even if some parts are lossy, any chunk can be recovered from the redundant pieces. In other words, as long as a sufficient subset of pieces is available, the entire original block can be reconstructed.

DA schemes typically include commitment and proof systems for the encoded data. The block producer publishes a commitment to the entire extended matrix (for example, a KZG or Merkle commitment). Then, for any chunk a node wants to sample, the producer (or any full node) can provide a short proof that this chunk is part of the committed block. This proof allows the node to verify the integrity of that chunk without having to download the whole block.

In short, DA protocols are typically a combination of redundant encoding and succinct proofs, which allow nodes to ask for just a few random pieces of data to contribute to a blob’s availability as if they had downloaded the whole blob.

The ZODA Protocol

ZODA, or Zero-Overhead Data Availability, is a trust-minimized and highly-optimized protocol for providing data availability. Its main contribution is the use of the Ligero check to produce a secure DA scheme based on hash functions.

Note that rsema1d, the protocol implemented by Celestia, is not a direct implementation of the full 2D ZODA protocol, but rather corresponds to the Hadamard variant described in Appendix D of the ZODA paper, simplified to use a single column. This means Celestia only encodes vertically (along columns) rather than applying a full 2D tensor code. We still provide an overview of the full ZODA protocol here in order to introduce the necessary context.

Let’s first consider how data is typically encoded in DA protocols. Typically in DA protocols, an encoder encodes a data expressed as a matrix over both the rows and the columns. This is often referred to as a tensor code, as it is equivalent to applying the tensor product of two separate codes. As an example, a tensor encoding of some data matrix $\tilde{X} \in F^{n \times n^{'}}$ would be as follows:

G \tilde{X} G^{' T}

Where $G$ and $G^{'}$ are the generator matrices for the two respective codes.

Here, ZODA applies a random factor to part of the encoding. Specifically, the final tensor encoding looks like the following:

Z = G \tilde{X} D_{\overset{―}{r}} G^{' T}

Where $D_{\overset{―}{r}}$ is a diagonal matrix composed of random values $\overset{―}{r}$ . The resulting matrix $Z$ is an $m \times m^{'}$ matrix.

(Note that there are multiple variants to how the random factor can be applied in the Appendix of the ZODA paper, but the core idea is the same.)

The protocol is essentially split in three algorithms: encoding, sampling, decoding, which we introduce in more detail below.

Encoding. To encode the original data matrix $\tilde{X}$ the encoder follows these steps:

Compute an encoding of the columns $X = G \tilde{X}$ and commit to its rows:

Receive a uniformly sampled vector $\overset{―}{r}$ and construct a diagonal matrix $D_{\overset{―}{r}}$ using uniformly sampled random values.
Multiply the columns of the data matrix by the random factor and compute the encoding of the resulting rows $Y = \tilde{X} D_{\overset{―}{r}} G^{' T}$ and commit to its columns:

Compute the complete encoding $Z = G \tilde{X} D_{\overset{―}{r}} G^{' T} = G Y$ and commit to it:

Sampling. Other nodes can now “sample” by retrieving random rows and columns from the committed encoding:

Sample a set $S$ of rows of $X$ from the encoder, written as $X_{S}$ .
Sample a set $S^{'}$ of columns of $Y$ from the encoder, written as $Y_{S}^{'}$ . Denote each column as $y_{j}$ for every $j \in S^{'}$ .
Check $X_{i} D_{\overset{―}{r}} g^{' T} = (G y_{j})_{i}$ for every $i \in S$ and $j \in S^{'}$ :

Verify that $Z_{i j} = g_{i}^{T} y_{j}$ for each $i \in S$ and $j \in S^{'}$ .

The 3rd step is the main check that ensures that the rows of $X$ are the correct encodings of the columns of an original matrix $\tilde{X}$ . This is similar to the check done in the Ligero paper, which checks that many vectors are codewords using the fact that if a random linear combination of multiple vectors is also a codeword, then they must be codewords as well with high probability.

Similarly, to check that each vector in $X$ (i.e. each column of $X$ ) is a codeword, we can check whether a random linear combination of all columns in $X$ is a codeword or not. Taking a single row $i$ in $X$ as $X_{i}$ , we can see that $X_{i} D_{\overset{―}{r}}$ results in a scaled vector $[X_{i} [0] \cdot r_{0}, X_{i} [1] \cdot r_{1}, \dots, X_{i} [n - 1] \cdot r_{n - 1}]$ . Then, we compute an inner product sum of this scaled vector with the $j$ -th column of $G^{' T}$ (i.e. ${g_{j}^{'}}^{T}$ ) to get a scalar value $X_{i} D_{\overset{―}{r}} {g_{j}^{'}}^{T}$ .

(Note that if $G^{'}$ is a generator matrix for a systematic code (i.e. $G^{'} = [\begin{matrix} I \\ P^{'} \end{matrix}]$ ), the first $n$ columns of $G^{' T}$ will be the identity matrix. This means that for any $j < n$ , ${g_{j}^{'}}^{T}$ will turn all but one $j$ -th column of $X_{i} D_{\overset{―}{r}}$ to zero (i.e. $[0, 0, \dots, X_{i} [j] \cdot r_{j}, \dots, 0]$ ), thus failing to capture the random linear combination of all columns of $X$ , as is required for the Ligero check. Consequently, when using a systematic code for $G^{'}$ , we need to sample columns $j \geq n$ to avoid using the identity matrix.)

Since the RHS is matrix multiplied by a generator matrix $G$ , we can be certain that $G Y$ is a valid codeword. Therefore, we can compute the $j$ -th column of $G Y$ and compare its $i$ -th row value to $X_{i} D_{\overset{―}{r}} {g_{j}^{'}}^{T}$ .

Whether this check is sufficient to ensure that all rows of $X$ are valid encodings of $\tilde{X}$ is a probabilistic one that depends on the properties of the code. We direct the reader to Section 3.2 of the ZODA paper for a detailed discussion on this topic.

Decoding. The decoder can reconstruct the original data by first running the sampling check and then sample either enough rows or columns to decode the original data. More formally, the decoder can reconstruct the original data by performing either of the following options:

Sample $| S | > m - d$ rows and decode $G_{S} \tilde{X} = X_{S}$ via an erasure decoding algorithm for $G$ .
Sample $| S^{'} | > m^{'} - d^{'}$ columns and decode $G_{S^{'}}^{'} D_{\overset{―}{r}} {\tilde{X}}^{T} = (Y^{T})_{S^{'}}$ via an erasure decoding algorithm for $G^{'}$ .

The Adversarial Reconstruction Property of ZODA. Another guarantee that ZODA provides is the ability to reconstruct the original data even under adversarial conditions. Specifically, when the decoder requests rows or columns from the encoder, the encoder may give the decoder rows and columns of their choosing. As long as the decoder has run the sampling check beforehand, however, ZODA guarantees that the decoder can still decode the original data. We direct the reader to Section 3.2.4 of the ZODA paper for a detailed discussion on this topic.

RSEMA1D Fields

In this section, we will give a brief overview of the fields used in the RSEMA1D protocol and the conversion between fields to the encoding format. The protocol uses two binary fields: $G F (2^{16})$ and $G F (2^{128})$ .

Leopard Interleaved Format. The Reed-Solomon encoding is computed over $G F (2^{16})$ i.e. the generator matrix $G$ , the input data matrix $\tilde{X}$ and the encoding $X = G \tilde{X}$ contain elements from $G F (2^{16})$ . The Leopard codec library (which is used to compute the encoding) processes input data in an interleaved $G F (2^{16})$ format as shown below:

The above diagram shows a single row containing $64$ bytes or $32$ $G F (2^{16})$ symbols. The bytes indexed $0$ - $31$ contain low bytes and the bytes indexed $32$ - $63$ contain high bytes of the $32$ $G F (2^{16})$ symbols. Thus a symbol $s_{i}$ from $G F (2^{16})$ in the little-endian format is computed as:

s_{i} = (l_{i} | h_{i}), for i \in [0, 32)

This conversion is illustrated in the following diagram.

Extension Field. For computing the random linear combination, the random coefficients are sampled from the extension field $G F (2^{128})$ . In the implementation, a single element from $G F (2^{128})$ is represented as a vector of $8$ elements of $G F (2^{16})$ .

Suppose we want to compute a random linear combination of the columns of the input data matrix $\tilde{X}$ which contains $32$ $G F (2^{16})$ symbols in a single row, we sample a random vector $g_{r}$ containing $32$ $G F (2^{128})$ symbols and the random linear combination is then computed as a matrix-vector product ${\tilde{X}}_{r} = \tilde{X} g_{r}$ . If the number of rows of $\tilde{X}$ are $K$ then ${\tilde{X}}_{r}$ will be a vector of $K$ elements from $G F (2^{128})$ . Let us represent the element at index $i$ in the vector ${\tilde{X}}_{r}$ by ${\tilde{a}}_{i}$ . We can represent each ${\tilde{a}}_{i} \in G F (2^{128})$ using $8$ elements from $G F (2^{16})$ i.e.

{\tilde{a}}_{i} = \sum_{j = 0}^{7} {\tilde{a}}_{i, j} \cdot β_{j}

where ${\tilde{a}}_{i, j}$ are elements from $G F (2^{16})$ and $β_{j}$ are the basis of $G F (2^{128})$ over $G F (2^{16})$ . This is illustrated in the following figure:

This gives us the following two representations for ${\tilde{X}}_{r}$ :

As a vector with elements ${\tilde{a}}_{i}$ from $G F (2^{128})$
As a matrix with elements ${\tilde{a}}_{i, j}$ from $G F (2^{16})$

Encoding an extension field vector. Now we want to extend the random linear combination ${\tilde{X}}_{r}$ to compute the encoding $X_{r} = G {\tilde{X}}_{r}$ . We can represent the vector ${\tilde{X}}_{r}$ as coefficients of a polynomial ${\tilde{X}}_{r} (x)$ in $G F (2^{128})$ i.e.

{\tilde{X}}_{r} (x) = \sum_{i = 0}^{K - 1} {\tilde{a}}_{i} \cdot L_{i} (x)

where $L_{i} (x)$ are some basis polynomials. Let us further decompose ${\tilde{X}}_{r} (x)$ using the decomposition of ${\tilde{a}}_{i}$ .

{\tilde{X}}_{r} (x) = \sum_{i = 0}^{K - 1} (\sum_{j = 0}^{7} {\tilde{a}}_{i, j} \cdot β_{j}) \cdot L_{i} (x)

Rearranging the summations, we get:

{\tilde{X}}_{r} (x) = \sum_{j = 0}^{7} (\sum_{i = 0}^{K - 1} {\tilde{a}}_{i, j} \cdot L_{i} (x)) \cdot β_{j} = \sum_{j = 0}^{7} {\tilde{A}}_{j} (x) \cdot β_{j}

where ${\tilde{A}}_{j} (x) = \sum_{i = 0}^{K - 1} {\tilde{a}}_{i, j} \cdot L_{i} (x)$ is a polynomial with coefficients from the $j$ th column of ${\tilde{X}}_{r}$ in its matrix representation as shown in the following figure.

The Reed-Solomon encoding of ${\tilde{X}}_{r}$ is just evaluations of the polynomial ${\tilde{X}}_{r} (x)$ over an extended domain from $G F (2^{16})$ . This is equivalent to encoding each of the columns ${\tilde{A}}_{j}$ and then interpreting each row of the encoding, which contains $8$ $G F (2^{16})$ elements, as a single element from $G F (2^{128})$ . The entire encoding process is illustrated in the following diagram.

In the above diagram, we assume extension by $N$ i.e. the number of rows of the encoding are $N + K$ . The columns of the encoding $X_{r}$ are denoted by $A_{j}$ which are an extension of the columns ${\tilde{A}}_{j}$ of the data ${\tilde{X}}_{r}$ . The elements $a_{i, j}$ of the encoding $X_{r}$ are symbols from $G F (2^{16})$ and each row contains $8$ $G F (2^{16})$ symbols which can be packed into a single element of $G F (2^{128})$ using the same packing equation as follows:

a_{i} = \sum_{j = 0}^{7} a_{i, j} \cdot β_{j}

Here, even though we have shown that we can compute $a_{i} \in G F (2^{128})$ from the $8$ $G F (2^{16})$ symbols of a row, in the implementation we don’t compute $a_{i}$ since a symbol in $G F (2^{128})$ is represented as a vector of $8$ $G F (2^{16})$ symbols.

Note that before extending ${\tilde{X}}_{r}$ , we convert the rows of $G F (2^{16})$ symbols into Leopard interleaved format which was described earlier. Similarly, after extending ${\tilde{X}}_{r}$ , we convert from Leopard interleaved format to rows of $G F (2^{16})$ symbols. We omit this conversion from the diagram for simplicity.

RSEMA1D Codec

In this section we review the ZODA-inspired protocol implemented by Celestia.

Note that the 1D variant implemented by Celestia is essentially the Hadamard variant specified in Appendix D of the ZODA paper, but with a single column. This simplification means that instead of applying a full 2D tensor code (encoding both rows and columns), the protocol only encodes the columns vertically using Reed-Solomon encoding.

Encoding Algorithm. The encoding algorithm uses Reed-Solomon encoding over $𝔽_{2^{16}}$ to vertically extend the input data matrix i.e. the generator matrix $G$ and the input data matrix $\tilde{X}$ contain elements from $𝔽_{2^{16}}$ . The generator matrix $G$ is in systematic form i.e. $G = [\frac{I}{A}]$ , where $I$ is the identity matrix and $A$ is the matrix containing the parity check information.

We outline a simplified description of the protocol:

Compute the encoding of the columns $X = G \tilde{X}$
Commit to the encoding $X$ using a Merkle tree where each leaf of the Merkle tree is the hash of a row of the encoding $X$ and let ${r o o t}_{X}$ represent the root of this tree
Derive, using Fiat-Shamir, a uniformly random vector $g_{r}$ which contains elements from an extension field $𝔽_{2^{128}}$
Compute the random linear combination of the columns ${\tilde{X}}_{r} = \tilde{X} g_{r}$
Compute the encoding of the random linear combination as $X_{r} = G {\tilde{X}}_{r}$
Commit to the random encoding $X_{r}$ using a Merkle tree and let ${r o o t}_{X_{r}}$ represent the root of this tree
Commit to both the roots as $S h a 256 ({r o o t}_{X} | | {r o o t}_{X_{r}})$

commitment

Sampling Algorithm. To sample rows of the committed encoding and knowing ${\tilde{X}}_{r}$ , a node/verifier request Merkle proofs for a set of rows and verifies them in the following way:

Recompute the extended RLC vector rlcExtended (i.e. $G {\tilde{X}}_{r}$ )
Verify each Row opening using the known root commitment rowRoot
Rederive the random vector $g_{r}$ using Fiat-Shamir
Compute the random linear combination (using $g_{r}$ ) of each Row and check that they match the expected entry in rlcExtended
Combine rowRoot and the known rlcRoot to check the main commitment.

Standalone Proof. A node can directly reconstruct a row of the original data given a Merkle proof of the expected result in rlcExtended, and assuming that a proof of data availability exists (i.e. at least one honest node successfully sampled the encoding via the previous algorithm). The verification steps go as follow:

Verify the Row opening using the known root commitment rowRoot
Rederive the random vector $g_{r}$ using Fiat-Shamir
Compute the random linear combination for the row
Ensure that the result correctly opens at the expected path in the RLC Merkle tree (given the opening proof)
Combine both roots to recompute the commitment and check it matches the public one.

Note on Commitments. Merkle tree commitments are used throughout the scheme, where leaves are SHA-256 hashes (with prefix 0x00) of the rows of a matrix, which are encoded in a breadth-first order way in a contiguous array. Nodes in the tree use the prefix 0x01 when hashing children.

commitments

Findings

Below are listed the findings found during the engagement. High severity findings can be seen as so-called "priority 0" issues that need fixing (potentially urgently). Medium severity findings are most often serious findings that have less impact (or are harder to exploit) than high-severity findings. Low severity findings are most often exploitable in contrived scenarios, if at all, but still warrant reflection. Findings marked as informational are general comments that did not fit any of the other criteria.

High

#00 Multiple Openings to the Same Committed Row Allowed proof.go

Medium

#01 Verification API Can Crash If Passed Invalid Inputs proof.go

Low

#02 Include the Configuration in Fiat-Shamir commitment.go #03 Negative Indices Allow for Incorrect Proofs proof.go #04 Commit to rlcOrig, Not rlcExtended codec.go

High · #00

Multiple Openings to the Same Committed Row Allowed

proof.go

Description. VerifyStandaloneProof and VerifyRowWithContext are missing length checks to both proof.Row and proof.RowProof, which opens up the possibility to reveal the proof.Index with different data.

An adversary can construct a malicious Merkle tree like below. With different proof lengths, one can construct proofs to different nodes with the same index:

Suppose that $(r_{1}, r_{2}, . . ., r_{n})$ and $(r_{1}^{'}, r_{2}^{'}, . . ., r_{n^{'}}^{'})$ are the symbols that are extracted from the rows of the two nodes. Their RLCs are respectively defined by

\begin{matrix} RLC & = \sum_{i = 1}^{n} r_{i} \cdot c_{i} \\ {RLC}^{'} & = \sum_{i = 1}^{n^{'}} r_{i}^{'} \cdot c_{i} . \end{matrix}

Since there is no checks to the length for proof.Row, it is possible that $n \neq n^{'}$ . One can set

(r_{1}^{'}, r_{2}^{'}, . . ., r_{n}^{'}, r_{n + 1}^{'}, . . ., r_{n^{'}}^{'}) = (r_{1}, r_{2}, . . ., r_{n}, 0, . . ., 0),

implying that $RLC = {RLC}^{'}$ .

The below test case shows that an adversary can create multiple proofs to the same index, ending up having different proof.Rows:

func TestVerifyRowWithContextWithMultipleOpenings(t *testing.T) {
    config := &Config{
        K:           8,
        N:           8,
        RowSize:     256,
        WorkerCount: 1,
    }
    config.Validate() // populate .kPadded and .totalPadded

    // === PROVER ===
    // the prover are being malicious, and they hope that no one will open at index 0

    data := make([][]byte, 8)
    for i := 0; i < 8; i++ {
        digest := sha512.Sum512([]byte{byte(i)})
        data[i] = append(digest[:], 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0)
    }

    // mostly copied from the Encode function from codec.go
    extended, err := encoding.ExtendVertical(data, config.N)
    assert.NoError(t, err)

    nodes, asNodes := buildAdversarialPaddedRowTree(extended)
    rowRoot := nodes[0]

    coeffs := deriveCoefficients([32]byte(rowRoot), config)
    rlcOrig := computeRLCOrig(data, coeffs, config)
    rlcExtended, err := encoding.ExtendRLCResults(rlcOrig, config.N)
    assert.NoError(t, err)
    rlcTree := buildPaddedRLCTree(rlcExtended, config)
    rlcRoot := rlcTree.Root()

    h := sha256.New()
    h.Write(rowRoot[:])
    h.Write(rlcRoot[:])
    var commitment Commitment
    h.Sum(commitment[:0])

    // === VERIFIER ===
    // assuming that the verifier wants to open at index 3
    ctx, err := CreateVerificationContext(rlcOrig, config)
    assert.NoError(t, err)

    // ...it is possible to open as some data (doing nicely)
    proof1 := &RowProof{
        Index:    3,
        Row:      extended[3],
        RowProof: [][]byte{nodes[17], nodes[7], nodes[4], nodes[2]},
    }
    if err := VerifyRowWithContext(proof1, commitment, ctx); err != nil {
        t.Error("VerifyStandaloneProof error:", err)
    }

    // ...it is also possible to open as some data, with zeros concatenated
    proof2 := &RowProof{
        Index:    3,
        Row:      extended[3][:256-64],
        RowProof: [][]byte{asNodes[17], asNodes[7], asNodes[4], asNodes[2], nodes[16], nodes[8], nodes[4], nodes[2]},
    }
    if err := VerifyRowWithContext(proof2, commitment, ctx); err != nil {
        t.Error("VerifyStandaloneProof error:", err)
    }
}

func buildAdversarialPaddedRowTree(extended [][]byte) ([][]byte, [][]byte) {
    nodes := make([][]byte, 31)
    asNodes := make([][]byte, 31) // adversary subtree nodes

    // build the adversary subtree

    for i := 0; i < 16; i++ {
        digest := sha256.Sum256(append([]byte{0}, extended[i][:256-64]...))
        asNodes[15+i] = digest[:] // SHA256(00 || data) for leaf nodes
    }
    for i := 0; i < 8; i++ {
        digest := sha256.Sum256(append([]byte{1}, append(asNodes[15+2*i], asNodes[15+2*i+1]...)...))
        asNodes[7+i] = digest[:] // SHA256(01 || left || right) for non-leaf nodes
    }
    for i := 0; i < 4; i++ {
        digest := sha256.Sum256(append([]byte{1}, append(asNodes[7+2*i], asNodes[7+2*i+1]...)...))
        asNodes[3+i] = digest[:]
    }
    for i := 0; i < 2; i++ {
        digest := sha256.Sum256(append([]byte{1}, append(asNodes[3+2*i], asNodes[3+2*i+1]...)...))
        asNodes[1+i] = digest[:]
    }
    for i := 0; i < 1; i++ {
        digest := sha256.Sum256(append([]byte{1}, append(asNodes[1+2*i], asNodes[1+2*i+1]...)...))
        asNodes[0+i] = digest[:]
    }

    nodes[15+0] = asNodes[0]
    for i := 1; i < 16; i++ { // starts from one
        digest := sha256.Sum256(append([]byte{0}, extended[i][:256]...))
        nodes[15+i] = digest[:]
    }
    for i := 0; i < 8; i++ {
        digest := sha256.Sum256(append([]byte{1}, append(nodes[15+2*i], nodes[15+2*i+1]...)...))
        nodes[7+i] = digest[:]
    }
    for i := 0; i < 4; i++ {
        digest := sha256.Sum256(append([]byte{1}, append(nodes[7+2*i], nodes[7+2*i+1]...)...))
        nodes[3+i] = digest[:]
    }
    for i := 0; i < 2; i++ {
        digest := sha256.Sum256(append([]byte{1}, append(nodes[3+2*i], nodes[3+2*i+1]...)...))
        nodes[1+i] = digest[:]
    }
    for i := 0; i < 1; i++ {
        digest := sha256.Sum256(append([]byte{1}, append(nodes[1+2*i], nodes[1+2*i+1]...)...))
        nodes[0+i] = digest[:]
    }

    return nodes, asNodes
}

The prover can reveal the proof.Index with different data (one with additional trailing null bytes) as long as proof.Index is nonzero.

Recommendation. Check also the length of proof.Row to be config.RowSize, and the length of proof.RowProof to be the binary logarithm of c.totalPadded.

Client Response. This was fixed in https://github.com/celestiaorg/rsema1d/pull/8

Medium · #01

Verification API Can Crash If Passed Invalid Inputs

proof.go

Description. Both the VerifyRowWithContext and VerifyStandaloneProof API assume that their proof and context inputs are not nil, and immediately dereference internal fields of these inputs. For example, VerifyRowWithContext starts with:

func VerifyRowWithContext(proof *RowProof, commitment Commitment, context *VerificationContext) error {
    if proof.Index < 0 || proof.Index >= context.config.K+context.config.N {
        …
    }

An attacker can deliberately pass nil for proof or context, and the very first access (proof.Index or context.config) will immediately panic. Thus, a single malformed request can deterministically crash every verifier, resulting in a denial of service.

Recommendation. Add explicit nil guards for proofs and contexts so the verifiers fail with a normal error instead of panicking.

Client Response. This was fixed in https://github.com/celestiaorg/rsema1d/pull/11.

Low · #02

Include the Configuration in Fiat-Shamir

commitment.go

Description. The protocol implements the Fiat-Shamir transformation by hashing, using SHA-256, the root of the encoded ODS (the encoded Original Data Square matrix). This derivation is implemented in deriveCoefficients, and produces a vector of random value that can later be used to produce a random linear combination of the vectors of that same matrix.

func deriveCoefficients(rowRoot [32]byte, config *Config) []field.GF128 {
    seed := sha256.Sum256(rowRoot[:])
    numSymbols := config.RowSize / 2 // Each GF16 symbol is 2 bytes

The derivation of the random values does not include the configuration itself. Especially, we’re interested in the configuration fields that form our problem instance:

type Config struct {
    // Core parameters (required)
    K       int // Number of original rows (can be arbitrary)
    N       int // Number of parity rows (can be arbitrary)
    RowSize int // Size of each row in bytes (multiple of 64)

This means that different compatible instances of the same matrix, but with different K, N, or RowSize values will produce the same exact random coefficients.

This does not seem to be a problem in practice, as one would still need to produce the same exact Merkle tree root as well. This limits the K and N values used to only add 0 rows to the matrix without changing the padding (so K and N should not push the derived padded configuration to the next powers of two), and the RowSize to only allow for hashed leaves that will produce the same Merkle root (effectively “impossible”).

That being said, the collision of Fiat-Shamir transcript is still somewhat odd, and the overhead of adding these values seems negligible.

Client Response. This was fixed in https://github.com/celestiaorg/rsema1d/pull/10

Low · #03

Negative Indices Allow for Incorrect Proofs

proof.go

Description. There is a missing check in VerifyStandaloneProof which allows proof.Index to be negative. Since it is only used in merkle.ComputeRootFromProof and it is not range-checked, it can still be considered as a correct proof. The below test case should not pass:

func TestVerifyStandaloneProofWithNegativeProofIndex(t *testing.T) {
    config := &Config{
        K:           8,
        N:           8,
        RowSize:     256,
        WorkerCount: 1,
    }

    data := makeTestData(8, 256)
    extData, commitment, _, err := Encode(data, config)

    index := 6
    proof, err := extData.GenerateStandaloneProof(index)
    if err != nil {
        t.Fatalf("GenerateStandaloneProof(%d) error: %v", index, err)
    }

    // negative proof.Index is bad
    proof.Index = -index

    // Verify standalone proof
    if err := VerifyStandaloneProof(proof, commitment, config); err != nil {
        t.Errorf("VerifyStandaloneProof(%d) error: %v", index, err)
    }

    // Verify proof contains correct row data
    if !bytes.Equal(proof.Row, extData.rows[index]) {
        t.Errorf("Standalone proof data doesn't match extended data for index %d", index)
    }
}

If proof.Index is used as an index to read from an array, it will cause panic because it is trying to read out of bounds.

Recommendation. Check proof.Index to be non-negative.

Client Response. This was fixed in https://github.com/celestiaorg/rsema1d/pull/9

Low · #04

Commit to rlcOrig, Not rlcExtended

codec.go

Description. The current implementation commits to rlcExtended ( $K + N$ RLC values) in the Merkle tree, but only the original $K$ values (rlcOrig) are ever used via Merkle proofs. The extended $N$ parity RLC values are never accessed individually through the tree. Committing to only rlcOrig would be simpler, more efficient, and less error-prone.

In codec.go, the prover commits to all $K + N$ RLC values:

// 6. Extend RLC results
rlcExtended, err := encoding.ExtendRLCResults(rlcOrig, config.N)
if err != nil {
    return nil, Commitment{}, nil, fmt.Errorf("failed to extend RLC results: %w", err)
}

// 7. Build padded RLC Merkle tree matching row tree structure
rlcTree := buildPaddedRLCTree(rlcExtended, config)
rlcRoot := rlcTree.Root()

Only the standalone proof API makes use of that committed RLC Merkle tree, but it does so only supporting original rows (i.e. index < K). This is enforced in proof.go:

func VerifyStandaloneProof(proof *StandaloneProof, commitment Commitment, config *Config) error {
    // ...
    if proof.Index >= config.K {
        return errors.New("standalone verification only supports original rows")
    }

The extended RLC values (indices $K$ to $K + N - 1$ ) are never accessed via Merkle proofs.

It is important that a verifier/sampler does not, in the sampling API, use the extended RLC vector directly, and instead verifies it to be a correct encoding. The verifier does that in practice by encoding the RLC vector themselves.

The current approach is error-prone, as it allows a different API where the sampler would not produce the RLC encoding themselves.

Recommendation. Commit to only rlcOrig ( $K$ values) instead of the extended version.

← Back to Index