Introduction

On October 20th, 2025, Celestia tasked zkSecurity with auditing its high-performance data availability codec, RSEMA1D (Reed-Solomon Evans-Mohnblatt-Angeris 1D). The specific code to review was shared via a private GitHub repository. The audit lasted one week with two consultants.

The RSEMA1D codec is based on the ZODA paper, specifically implementing the Hadamard variant from Appendix D with a single column, which served as the primary theoretical foundation for this audit. Additionally, the Celestia team shared a specification as well as an internal Notion document that provided a high-level overview of the codec.

Scope

The scope of this audit covered the entire RSEMA1D repository at commit 65cc9d4c3887936393763d32e0cf4e665d9d2705. All files and components present in that commit were considered in-scope. Dependencies outside this repository (in particular, Celestia’s Leopard Reed–Solomon codec) were not formally in scope, although we occasionally referred to them for context.

Strategic Recommendations

In addition to the findings discussed later in this report, we have the following strategic recommendations:

Audit of underlying Reed-Solomon implementation. This report assumed that Celestia’s Leopard Reed-Solomon GF16 codec was safely implemented, particularly the core functions New, Encode, Reconstruct, and GF16Mul. While we reviewed the RSEMA1D layer built on top of this codec, the correctness and security of the entire system fundamentally depends on the underlying Reed-Solomon implementation. We recommend a dedicated audit of the Leopard library, unless already done, to ensure its cryptographic primitives are sound and properly implemented.

Review of system integration. The scope of this audit was limited to the RSEMA1D codec library itself. However, the security of a data availability system depends not only on the codec’s correctness but also on how it is integrated and used throughout the broader Celestia stack. We recommend an audit of the integration layer to verify that the codec is being invoked correctly, that data flows between components are secure, and that the overall system maintains the intended security properties when all components interact together.

Overview

In this section we give an overview of data availability, the original ZODA construction, and some detail of the construction implemented by Celestia.

What is Data Availability?

Data availability (DA) refers to the guarantee that a blob of data (e.g. all transaction data in a block of a rollup protocol) has been published and is accessible to a network. Without DA, if even a small portion of the data were to be withheld, honest nodes wouldn’t be able to recover the full original data. In protocols like rollups where the production of block data is being done off-chain, not having a decentralized way to ensure that the rollup data is available (for a certain period of time) could prevent users from verifying and trusting the protocol, or leave them stuck and unable to recover from a protocol halting by being unable to download the necessary data to reconstruct the latest state. In traditional blockchains, full nodes trivially ensure data availability by downloading 100% of the block data themselves. However, this approach does not scale well: requiring every node to fetch every block in its entirety quickly becomes a throughput bottleneck.

The modern alternative to downloading every block is known as Data Availability Sampling (DAS). In a nutshell, DAS lets nodes probabilistically verify that a block’s full data is available by downloading random samples of it rather than the entire block. If enough nodes have performed enough sampling, we are convinced that the data is available and recoverable if needed, even if some pieces of it go missing.

This probabilistic verification becomes possible through a specific way of coding the block’s data. Before a block is broadcast, its data is processed with erasure coding to introduce redundancy. Specifically, many modern DA layers use a one or two-dimensional Reed-Solomon encoding approach: the block’s data is arranged in a matrix and then extended with parity chunks for each row/column. This means the block is effectively duplicated with error-correction data in horizontal/vertical directions. For this extended data matrix, even if some parts are lossy, any chunk can be recovered from the redundant pieces. In other words, as long as a sufficient subset of pieces is available, the entire original block can be reconstructed.

DA schemes typically include commitment and proof systems for the encoded data. The block producer publishes a commitment to the entire extended matrix (for example, a KZG or Merkle commitment). Then, for any chunk a node wants to sample, the producer (or any full node) can provide a short proof that this chunk is part of the committed block. This proof allows the node to verify the integrity of that chunk without having to download the whole block.

In short, DA protocols are typically a combination of redundant encoding and succinct proofs, which allow nodes to ask for just a few random pieces of data to contribute to a blob’s availability as if they had downloaded the whole blob.

The ZODA Protocol

ZODA, or Zero-Overhead Data Availability, is a trust-minimized and highly-optimized protocol for providing data availability. Its main contribution is the use of the Ligero check to produce a secure DA scheme based on hash functions.

Note that rsema1d, the protocol implemented by Celestia, is not a direct implementation of the full 2D ZODA protocol, but rather corresponds to the Hadamard variant described in Appendix D of the ZODA paper, simplified to use a single column. This means Celestia only encodes vertically (along columns) rather than applying a full 2D tensor code. We still provide an overview of the full ZODA protocol here in order to introduce the necessary context.

Let’s first consider how data is typically encoded in DA protocols. Typically in DA protocols, an encoder encodes a data expressed as a matrix over both the rows and the columns. This is often referred to as a tensor code, as it is equivalent to applying the tensor product of two separate codes. As an example, a tensor encoding of some data matrix X~Fn×n would be as follows:

GX~GT

Where G and G are the generator matrices for the two respective codes.

Here, ZODA applies a random factor to part of the encoding. Specifically, the final tensor encoding looks like the following:

Z=GX~DrGT

Where Dr is a diagonal matrix composed of random values r. The resulting matrix Z is an m×m matrix.

(Note that there are multiple variants to how the random factor can be applied in the Appendix of the ZODA paper, but the core idea is the same.)

The protocol is essentially split in three algorithms: encoding, sampling, decoding, which we introduce in more detail below.

Encoding. To encode the original data matrix X~ the encoder follows these steps:

  1. Compute an encoding of the columns X=GX~ and commit to its rows:

image

  1. Receive a uniformly sampled vector r and construct a diagonal matrix Dr using uniformly sampled random values.

  2. Multiply the columns of the data matrix by the random factor and compute the encoding of the resulting rows Y=X~DrGT and commit to its columns:

image

  1. Compute the complete encoding Z=GX~DrGT=GY and commit to it:

image

Sampling. Other nodes can now “sample” by retrieving random rows and columns from the committed encoding:

  1. Sample a set S of rows of X from the encoder, written as XS.

  2. Sample a set S of columns of Y from the encoder, written as YS. Denote each column as yj for every jS.

  3. Check XiDrgT=(Gyj)i for every iS and jS:

image

  1. Verify that Zij=giTyj for each iS and jS.

The 3rd step is the main check that ensures that the rows of X are the correct encodings of the columns of an original matrix X~. This is similar to the check done in the Ligero paper, which checks that many vectors are codewords using the fact that if a random linear combination of multiple vectors is also a codeword, then they must be codewords as well with high probability.

Similarly, to check that each vector in X (i.e. each column of X) is a codeword, we can check whether a random linear combination of all columns in X is a codeword or not. Taking a single row i in X as Xi, we can see that XiDr results in a scaled vector [Xi[0]·r0,Xi[1]·r1,,Xi[n1]·rn1]. Then, we compute an inner product sum of this scaled vector with the j-th column of GT (i.e. gjT) to get a scalar value XiDrgjT.

(Note that if G is a generator matrix for a systematic code (i.e. G=[IP]), the first n columns of GT will be the identity matrix. This means that for any j<n, gjT will turn all but one j-th column of XiDr to zero (i.e. [0,0,,Xi[j]·rj,,0]), thus failing to capture the random linear combination of all columns of X, as is required for the Ligero check. Consequently, when using a systematic code for G, we need to sample columns jn to avoid using the identity matrix.)

Since the RHS is matrix multiplied by a generator matrix G, we can be certain that GY is a valid codeword. Therefore, we can compute the j-th column of GY and compare its i-th row value to XiDrgjT.

Whether this check is sufficient to ensure that all rows of X are valid encodings of X~ is a probabilistic one that depends on the properties of the code. We direct the reader to Section 3.2 of the ZODA paper for a detailed discussion on this topic.

Decoding. The decoder can reconstruct the original data by first running the sampling check and then sample either enough rows or columns to decode the original data. More formally, the decoder can reconstruct the original data by performing either of the following options:

  1. Sample |S|>md rows and decode GSX~=XS via an erasure decoding algorithm for G.

  2. Sample |S|>md columns and decode GSDrX~T=(YT)S via an erasure decoding algorithm for G.

The Adversarial Reconstruction Property of ZODA. Another guarantee that ZODA provides is the ability to reconstruct the original data even under adversarial conditions. Specifically, when the decoder requests rows or columns from the encoder, the encoder may give the decoder rows and columns of their choosing. As long as the decoder has run the sampling check beforehand, however, ZODA guarantees that the decoder can still decode the original data. We direct the reader to Section 3.2.4 of the ZODA paper for a detailed discussion on this topic.

RSEMA1D Fields

In this section, we will give a brief overview of the fields used in the RSEMA1D protocol and the conversion between fields to the encoding format. The protocol uses two binary fields: GF(216) and GF(2128).

Leopard Interleaved Format. The Reed-Solomon encoding is computed over GF(216) i.e. the generator matrix G, the input data matrix X~ and the encoding X=GX~ contain elements from GF(216). The Leopard codec library (which is used to compute the encoding) processes input data in an interleaved GF(216) format as shown below:

image

The above diagram shows a single row containing 64 bytes or 32 GF(216) symbols. The bytes indexed 0-31 contain low bytes and the bytes indexed 32-63 contain high bytes of the 32 GF(216) symbols. Thus a symbol si from GF(216) in the little-endian format is computed as:

si=(li|hi),for i[0,32)

This conversion is illustrated in the following diagram.

image

Extension Field. For computing the random linear combination, the random coefficients are sampled from the extension field GF(2128). In the implementation, a single element from GF(2128) is represented as a vector of 8 elements of GF(216).

Suppose we want to compute a random linear combination of the columns of the input data matrix X~ which contains 32 GF(216) symbols in a single row, we sample a random vector gr containing 32 GF(2128) symbols and the random linear combination is then computed as a matrix-vector product X~r=X~gr. If the number of rows of X~ are K then X~r will be a vector of K elements from GF(2128). Let us represent the element at index i in the vector X~r by a~i. We can represent each a~iGF(2128) using 8 elements from GF(216) i.e.

a~i=j=07a~i,j·βj

where a~i,j are elements from GF(216) and βj are the basis of GF(2128) over GF(216). This is illustrated in the following figure:

image

This gives us the following two representations for X~r:

  1. As a vector with elements a~i from GF(2128)
  2. As a matrix with elements a~i,j from GF(216)

Encoding an extension field vector. Now we want to extend the random linear combination X~r to compute the encoding Xr=GX~r. We can represent the vector X~r as coefficients of a polynomial X~r(x) in GF(2128) i.e.

X~r(x)=i=0K1a~i·Li(x)

where Li(x) are some basis polynomials. Let us further decompose X~r(x) using the decomposition of a~i.

X~r(x)=i=0K1(j=07a~i,j·βj)·Li(x)

Rearranging the summations, we get:

X~r(x)=j=07(i=0K1a~i,j·Li(x))·βj=j=07A~j(x)·βj

where A~j(x)=i=0K1a~i,j·Li(x) is a polynomial with coefficients from the jth column of X~r in its matrix representation as shown in the following figure.

image

The Reed-Solomon encoding of X~r is just evaluations of the polynomial X~r(x) over an extended domain from GF(216). This is equivalent to encoding each of the columns A~j and then interpreting each row of the encoding, which contains 8 GF(216) elements, as a single element from GF(2128). The entire encoding process is illustrated in the following diagram.

image

In the above diagram, we assume extension by N i.e. the number of rows of the encoding are N+K. The columns of the encoding Xr are denoted by Aj which are an extension of the columns A~j of the data X~r. The elements ai,j of the encoding Xr are symbols from GF(216) and each row contains 8 GF(216) symbols which can be packed into a single element of GF(2128) using the same packing equation as follows:

ai=j=07ai,j·βj

Here, even though we have shown that we can compute aiGF(2128) from the 8 GF(216) symbols of a row, in the implementation we don’t compute ai since a symbol in GF(2128) is represented as a vector of 8 GF(216) symbols.

Note that before extending X~r, we convert the rows of GF(216) symbols into Leopard interleaved format which was described earlier. Similarly, after extending X~r, we convert from Leopard interleaved format to rows of GF(216) symbols. We omit this conversion from the diagram for simplicity.

RSEMA1D Codec

In this section we review the ZODA-inspired protocol implemented by Celestia.

Note that the 1D variant implemented by Celestia is essentially the Hadamard variant specified in Appendix D of the ZODA paper, but with a single column. This simplification means that instead of applying a full 2D tensor code (encoding both rows and columns), the protocol only encodes the columns vertically using Reed-Solomon encoding.

Encoding Algorithm. The encoding algorithm uses Reed-Solomon encoding over 𝔽216 to vertically extend the input data matrix i.e. the generator matrix G and the input data matrix X~ contain elements from 𝔽216. The generator matrix G is in systematic form i.e. G=[IA], where I is the identity matrix and A is the matrix containing the parity check information.

We outline a simplified description of the protocol:

  1. Compute the encoding of the columns X=GX~
  2. Commit to the encoding X using a Merkle tree where each leaf of the Merkle tree is the hash of a row of the encoding X and let rootX represent the root of this tree
  3. Derive, using Fiat-Shamir, a uniformly random vector gr which contains elements from an extension field 𝔽2128
  4. Compute the random linear combination of the columns X~r=X~gr
  5. Compute the encoding of the random linear combination as Xr=GX~r
  6. Commit to the random encoding Xr using a Merkle tree and let rootXr represent the root of this tree
  7. Commit to both the roots as Sha256(rootX||rootXr)

commitment

Sampling Algorithm. To sample rows of the committed encoding and knowing X~r, a node/verifier request Merkle proofs for a set of rows and verifies them in the following way:

  1. Recompute the extended RLC vector rlcExtended (i.e. GX~r)
  2. Verify each Row opening using the known root commitment rowRoot
  3. Rederive the random vector gr using Fiat-Shamir
  4. Compute the random linear combination (using gr) of each Row and check that they match the expected entry in rlcExtended
  5. Combine rowRoot and the known rlcRoot to check the main commitment.

Standalone Proof. A node can directly reconstruct a row of the original data given a Merkle proof of the expected result in rlcExtended, and assuming that a proof of data availability exists (i.e. at least one honest node successfully sampled the encoding via the previous algorithm). The verification steps go as follow:

  1. Verify the Row opening using the known root commitment rowRoot
  2. Rederive the random vector gr using Fiat-Shamir
  3. Compute the random linear combination for the row
  4. Ensure that the result correctly opens at the expected path in the RLC Merkle tree (given the opening proof)
  5. Combine both roots to recompute the commitment and check it matches the public one.

Note on Commitments. Merkle tree commitments are used throughout the scheme, where leaves are SHA-256 hashes (with prefix 0x00) of the rows of a matrix, which are encoded in a breadth-first order way in a contiguous array. Nodes in the tree use the prefix 0x01 when hashing children.

commitments