Seqair — General Design
These are cross-cutting design rules that apply to all modules in the Seqair crate. They establish conventions for error handling, type safety, and platform portability.
Sources: This file contains seqair-specific design rules with no direct upstream spec counterpart. Named constants (FLAG bits, CIGAR op codes) are derived from [SAM1] §1.4 and §4.2. See references.md.
Error handling
Seqair aims to provide helpful and complete error messages at all layers of the system.
Error types MUST implement
std::error::Error using
thiserror and capture
necessary context and provide
causes.
Every failure mode MUST be a
distinct enum variant with typed
fields — never a catch-all
String or
reason: String. Callers
MUST be able to pattern-match on
failure modes without inspecting
display text. When a single enum
variant is constructed at multiple
call sites with different format
strings, split it into separate
variants. Paths, byte arrays, and
counts MUST be stored in typed
fields (e.g. PathBuf,
[u8; 4],
usize), not
interpolated into strings.
Type safety
BAM files are full of magic numbers: flag bits (0x4 = unmapped, 0x10 = reverse strand), CIGAR operation codes (0 = M, 1 = I, ...), format constants. Using raw literals is error-prone and hard to read.
Magic numbers (bit flags, format
constants, operation codes) MUST be
defined as named constants, not used
as raw literals. BAM flags MUST use
named constants (e.g.,
FLAG_UNMAPPED instead
of 0x4). CIGAR
operation types MUST use named
constants (e.g.,
CIGAR_M instead of
0).
BAM flags MUST have a
BamFlags newtype
wrapping u16 with named
predicate methods (e.g.,
is_unmapped(),
is_reverse()). The
public API (RecordRef)
MUST expose
BamFlags rather than
raw u16 where
appropriate.
CIGAR operations MUST have a
CigarOpType enum with
variants for all 9 SAM-spec
operations (M, I, D, N, S, H, P, =,
X). The enum MUST provide
consumes_ref() and
consumes_query()
methods. Invalid operation codes
MUST be represented as
None via
from_bam(u8) ->
Option<CigarOpType>.
API design
All public enums (including error
types) that may gain variants in
future versions MUST be annotated
with #[non_exhaustive].
Adding a variant to a non-exhaustive
enum is not a semver-breaking
change, allowing new variants and
error conditions to be introduced in
minor releases.
Only intentionally public types
should be exported. Internal
implementation modules (compression
codecs, container parsers, encoding
details) MUST be
pub(crate) or
#[doc(hidden)] so that
downstream crates do not depend on
internal structure. The public API
contract is defined by explicit
pub use re-exports in
each top-level module file.
Arithmetic safety
All integer arithmetic on values
derived from untrusted input (file
data, parsed fields) MUST use
checked or saturating operations.
Default wrapping arithmetic
(+, -,
*) MUST NOT be used on
untrusted values. The
clippy::arithmetic_side_effects
lint should be enabled.
Platform portability
Every platform-specific optimization (SIMD, architecture-specific intrinsics) MUST have a scalar fallback that produces identical results on all platforms. The optimized and fallback paths MUST be tested with property-based tests that verify equivalence for arbitrary inputs.
Unified Reader
seqair reads BAM, bgzf-compressed SAM, and CRAM files through a format-agnostic reader interface that auto-detects the input format and dispatches to the appropriate parser, while presenting a uniform API to the rest of the codebase.
Sources: The unified reader design is seqair-specific. Format magic bytes come from [SAM1] §4.2 (BAM:
BAM\1), [SAM1] §4.1 (BGZF:1f 8b), and [CRAM3] §6 (CRAM:CRAM). Sort order detection uses theSOtag from [SAM1] §1.3 "The header section" (@HDline). See references.md.
Goals
-
Backward compatibility:
existing code that calls
IndexedBamReader::open()continues to work unchanged. -
Auto-detection:
IndexedReader::open(path)inspects the file to determine BAM vs SAM vs CRAM and opens the appropriate reader. -
Uniform downstream API:
all three formats populate the same
RecordStoreviafetch_into(). The pileup engine, call pipeline, and all downstream code are format-agnostic. -
Forkable: all reader
types support
fork()for thread-safe parallel processing with shared immutable state.
Format detection
IndexedReader::open(path)
MUST auto-detect the file format by
inspecting magic bytes:
-
Bytes
1f 8b(gzip magic) → verify BGZF structure (extra field withBCsubfield). If not BGZF, reject with error: "file is gzip-compressed but not BGZF; usebgzipinstead ofgzip." If BGZF, decompress the first block:-
Starts with
BAM\x01→ BAM format -
Starts with
@(0x40) → bgzf-compressed SAM - Otherwise → reject with error naming supported formats
-
Starts with
-
Bytes
43 52 41 4d(CRAM) → CRAM format -
Byte
@(0x40) at position 0 → uncompressed SAM. Reject with error: "uncompressed SAM files cannot be indexed; compress withbgzip file.samthen index withsamtools index file.sam.gz." - Otherwise → return an error listing supported formats (BAM, bgzf-compressed SAM, CRAM)
After format detection, the reader MUST locate the appropriate index file:
-
BAM:
.bai,.bam.bai, or.csi(CSI supports references > 512 Mbp that BAI cannot) -
SAM:
.tbior.csi(tabix or CSI index) -
CRAM:
.crai(CRAM index)
If the format is detected but no
matching index is found, the error
MUST name the expected index
extension and suggest the tool to
create it (samtools index
for BAM/CRAM,
samtools index or
tabix for SAM).
Reader enum
The unified reader MUST be an enum
dispatching to format-specific
readers, not a trait object. This
avoids dynamic dispatch overhead in
the hot path and keeps the type
concrete for fork().
enum IndexedReader {
coverage.html Bam(IndexedBamReader),
Sam(IndexedSamReader),
Cram(IndexedCramReader),
}
The unified reader MUST expose:
-
header() -> &BamHeader— all formats produce the same header type (target names, lengths, tid lookup). The SAM and CRAM parsers convert their native header representations toBamHeader. -
fetch_into(tid, start, end, store) -> Result<usize>— populates aRecordStorewith records overlapping the region, identically to the BAM path. -
fork() -> Result<Self>— creates a lightweight copy sharing immutable state. -
shared() -> &Arc<_>— access to shared state forArc::ptr_eqtesting.
For a given BAM file and its
SAM/CRAM representations of the same
data, fetch_into for
the same region MUST produce records
with equivalent logical content:
same positions, flags, sequences,
qualities, and CIGAR operations. Aux
tags MUST contain the same set of
tag names and values, but tag
ordering and integer type codes
(e.g., c vs
i for small values) MAY
differ between formats because BAM
writers choose specific integer
widths that SAM text cannot
preserve.
RecordStore integration
Currently
RecordStore::push_raw()
takes raw BAM bytes. For SAM and
CRAM, records arrive as parsed
fields rather than BAM binary.
The RecordStore MUST provide
push_fields() (or equivalent)
that accepts pre-parsed record fields:
- pos, end_pos, flags, mapq, seq_len, matching_bases, indel_bases (fixed fields)
- qname bytes
- CIGAR as packed BAM-format u32 ops (SAM/CRAM parsers convert to this representation)
-
sequence as
&[Base](already the enum type stored in the bases slab — no conversion needed) - quality bytes
- aux tag bytes in BAM binary format (SAM/CRAM parsers serialize tags to this format)
This avoids a wasteful round-trip through BAM binary encoding.
push_fields() and
push_raw() MUST produce
identical
SlimRecord fixed fields
and identical name/bases/cigar/qual
slab contents for the same logical
record. Aux tag slab contents MAY
differ in integer type codes (a
SAM-derived i:42 may
serialize as c while
the original BAM used
C), since the original
BAM writer's type choice is not
recoverable from SAM text. Tests
MUST verify fixed-field and
non-aux-slab equivalence by pushing
the same record both ways.
Shared header type
SAM headers are text (the
@HD, @SQ,
@RG, @PG,
@CO lines). CRAM
containers embed SAM header text.
BamHeader already
stores
header_text: String and
parses @SQ lines for
targets.
BamHeader MUST gain a
from_sam_text(text: &str)
constructor that parses SAM header text
directly, without requiring BGZF/BAM
framing. This is used by both the SAM reader
(header lines at start of file) and the CRAM
reader (SAM header block in file header
container).
Sort order validation
Indexed random access assumes
coordinate-sorted data. If the
@HD header line
contains SO:unsorted or
SO:queryname, the
reader MUST return an error
explaining that indexed region
queries require coordinate-sorted
input. SO:coordinate or
absent SO (common in
older files) MUST be accepted.
Fork semantics per format
BAM fork: shares
Arc<BamShared>
(index + header), opens fresh
File handle. (Already
implemented.)
SAM fork: shares
Arc holding parsed
tabix/CSI index + header, opens
fresh File handle for
BGZF reading.
CRAM fork: shares
Arc holding CRAI index
+ header, opens fresh
File handle for
container reading. Each fork MUST
have its own FASTA reader (via
fasta_reader.fork())
for thread-safe reference lookups.
Reference caching (r[cram.perf.reference_caching]) MUST be per-fork, not shared.
Readers: alignment + reference bundle
The Readers struct
bundles an
IndexedReader
(alignment) with an
IndexedFastaReader
(reference) in a single type. This
eliminates the need for separate
FASTA path passing — CRAM can access
the reference it needs for sequence
reconstruction, and all formats have
uniform open/fork semantics.
pub struct Readers {
alignment: IndexedReader,
fasta: IndexedFastaReader,
}
Readers::open(alignment_path,
fasta_path)
MUST auto-detect the alignment
format (via
r[unified.detect_format]), open the appropriate reader, and
open the FASTA reader. For CRAM, the
fasta_path is passed to
IndexedCramReader::open()
for sequence reconstruction. For
BAM/SAM, the FASTA reader is opened
but not used by the alignment
reader.
Readers::fork() MUST
fork both the alignment reader and
the FASTA reader, returning a new
Readers with
independent I/O handles but shared
immutable state (indices, headers).
The CRAM fork gets its own FASTA
reader via
IndexedFastaReader::fork().
Readers MUST expose:
-
header() -> &BamHeader— delegates to the alignment reader's header. -
fetch_into(tid, start, end, store) -> Result<usize>— delegates to the alignment reader. -
fasta() -> &IndexedFastaReaderandfasta_mut() -> &mut IndexedFastaReader— direct access for callers that need reference sequences independently of the alignment reader (e.g., the call pipeline's segment fetching). -
alignment() -> &IndexedReaderandalignment_mut() -> &mut IndexedReader— direct access when needed.
IndexedReader::open(path)
MUST continue to work for BAM and
SAM files without a FASTA path. CRAM
detection in
IndexedReader::open()
MUST return an error explaining that
CRAM requires a reference and
suggesting
Readers::open()
instead. This preserves backward
compatibility for code that only
needs BAM/SAM.
API surface
See
r[io.non_exhaustive_enums]
and
r[io.minimal_public_api]
in general.md for the general
rules. For this module,
IndexedReader,
FormatDetectionError, and
ReaderError are the primary
enums subject to
r[io.non_exhaustive_enums].
BGZF Block Reader
BGZF (Blocked Gzip Format) is the compression layer used by BAM and other HTS (High-Throughput Sequencing) formats. Unlike plain gzip, which compresses an entire file as one stream, BGZF splits the data into independent blocks of up to 64 KiB uncompressed. Each block is a self-contained gzip member, meaning any block can be decompressed without reading the ones before it. This enables random access: a BAM index can point directly to a specific block offset in the compressed file, and the reader can decompress just that block to find the data it needs.
A BGZF file is simply a concatenation of these gzip blocks, ending with a special empty EOF marker block. Each block has a standard gzip header with an extra field that encodes the block's total compressed size (BSIZE), which the reader uses to locate where the next block begins.
Sources: All rules in this file derive from [SAM1] §4.1 "The BGZF compression format", §4.1 "Random access", and §4.1 "End-of-file marker". See references.md.
Block structure
[SAM1] §4.1 "The BGZF compression format" — block layout, extra field, BSIZE
A BGZF block MUST begin with the
gzip magic bytes
1f 8b 08 04 (gzip,
DEFLATE, FEXTRA flag set).
The extra field MUST contain a
BC subfield (SI1=0x42,
SI2=0x43) with SLEN=2 whose value is
the total block size minus one
(BSIZE). The total compressed block
size is BSIZE + 1.
Each block's DEFLATE payload MUST be decompressed independently. The last 8 bytes of the block are the gzip footer: a CRC32 checksum (4 bytes) followed by the uncompressed size ISIZE (4 bytes), both little-endian. Decompression MUST produce exactly ISIZE bytes.
BGZF blocks MUST NOT exceed 65536 bytes uncompressed. If the ISIZE footer field claims a larger value, the reader MUST return an error rather than allocating an unbounded buffer.
An EOF marker block has ISIZE=0. When encountered, the reader MUST signal end-of-stream.
Virtual offsets
[SAM1] §4.1 "Random access" — virtual file offset definition
Because BGZF blocks are at known compressed
file offsets and have known uncompressed
sizes, any byte in the uncompressed stream
can be addressed with a
virtual offset: a packed
64-bit value that combines "which block" and
"where within that block." BAM index files
(.bai) store virtual offsets to
point at specific records.
A virtual offset is a 64-bit value where the upper 48 bits encode the compressed block offset in the file and the lower 16 bits encode the byte offset within the uncompressed block.
The reader MUST support seeking to an arbitrary virtual offset by seeking the underlying file to the block offset, decompressing that block, and advancing to the within-block offset.
Reading
BAM records are variable-length and can span block boundaries (a record may start near the end of one block and continue into the next). The reader must handle this transparently.
The reader MUST support reading an exact number of bytes, transparently crossing block boundaries when the requested data spans multiple blocks.
The reader MUST support partial reads (up to N bytes) returning the actual count, returning 0 at EOF.
Decompression SHOULD use the
libdeflater crate (Rust
bindings to the libdeflate C
library) for performance parity with
htslib's libdeflate usage.
libdeflate provides
hardware-accelerated DEFLATE
decompression and CRC32 computation.
Integrity
[SAM1] §4.1 "The BGZF compression format" — gzip footer CRC32 and ISIZE fields
Each gzip block includes a CRC32 checksum of the uncompressed data. Verifying this catches silent data corruption from disk errors, network glitches on cluster storage, or truncated writes.
After decompression, the reader MUST
verify the CRC32 checksum of the
decompressed data against the
expected CRC32 stored in the gzip
footer (4 bytes before ISIZE). A
mismatch MUST return a
ChecksumMismatch error.
Performance
The reader SHOULD fast-path standard BGZF headers where XLEN=6 and the BC subfield is at the fixed offset (bytes 12–17 of the 18-byte header). All BAM files produced by samtools, htslib, and Picard use this layout. This avoids allocating an extra-fields buffer for the common case. Non-standard layouts MUST fall back to searching the extra fields for the BC subfield.
When resizing buffers that will be
immediately and fully overwritten
(by read_exact or
decompression), the reader SHOULD
use an uninitialized resize to avoid
redundant zero-filling. This applies
to the decompressed block buffer
(~64 KB per block) and the
compressed data buffer.
The reader MUST track the current block's compressed file offset for virtual offset calculation. The offset MUST be queried from the stream position before reading each new block, and set directly on seeks.
CIGAR Operations
A CIGAR string describes how a read aligns to the reference genome. It's a sequence of operations like "50M2I48M" meaning "50 bases match, 2 bases inserted in the read, 48 bases match." Each operation has a type and a length.
In BAM's binary format, CIGAR operations are
packed as u32 values: the lower 4 bits
encode the operation type (0–8) and the
upper 28 bits encode the length. A 150bp
read with a simple alignment has just one
operation (150M), while reads
spanning splice junctions or containing
indels can have 10+ operations.
The key concept is which operations consume the reference and/or query (read) sequences:
- Consumes reference: the operation advances the position on the reference genome
- Consumes query: the operation advances the position within the read
This determines how to translate between reference positions and read positions — the core question the pileup engine asks thousands of times per second.
Sources: [SAM1] §1.4 "The alignment section: mandatory fields" — CIGAR operations table with op codes, BAM numeric encodings, and "consumes query"/"consumes reference" flags. §4.2 "The BAM format" — packed u32 CIGAR encoding (
op_len << 4 | op). See references.md.
Operations
[SAM1] §1.4 "The alignment section: mandatory fields" — CIGAR operations table (M=0, I=1, D=2, N=3, S=4, H=5, P=6, ==7, X=8)
The following CIGAR operations MUST be supported:
- M (0): alignment match (consumes query and reference)
- I (1): insertion (consumes query only)
- D (2): deletion (consumes reference only)
- N (3): reference skip (consumes reference only)
- S (4): soft clip (consumes query only)
- H (5): hard clip (consumes neither)
- P (6): padding (consumes neither)
- = (7): sequence match (consumes query and reference)
- X (8): sequence mismatch (consumes query and reference)
Matches and indels
These aggregate statistics are used downstream for read quality metrics (e.g. fraction of matching bases, indel rate).
calc_matches_indels(cigar)
MUST compute the total number of
matching bases (sum of M operation
lengths) and the total number of
indel bases (sum of I and D
operation lengths) from a packed
CIGAR array.
CIGAR index
For the pileup engine, the critical
operation is: "at reference position X, what
is the corresponding position in this read?"
This requires walking the CIGAR to
accumulate reference and query offsets. The
CigarIndex precomputes this to
avoid re-walking for every position.
A CigarIndex MUST be
constructable from a record's
CIGAR and start position. It
precomputes cumulative reference and
query offsets per operation.
CigarIndex::qpos_at(ref_pos)
MUST return
Some(query_pos) when
the reference position falls within
a query-consuming operation (M, =,
X, I at the boundary), or
None when the position
falls within a deletion, reference
skip, or outside the alignment.
qpos_at MUST return
None for positions
outside the alignment's
reference span. This applies to all
mapping variants, including the
linear fast-path. Positions before
the alignment start or at/after the
alignment end MUST return
None.
CompactOp::ref_start is
stored as i32. Since
BAM positions are defined as
i32 values (max 2^31 -
1), this is sufficient. The
construction of
CompactOp MUST assert
(via debug_assert!)
that the computed reference position
fits in i32.
qpos_at MUST produce
identical results to htslib's
Alignment::qpos() for
every reference position within the
alignment span. This is verified by
comparison tests.
Seqair-Types — Phred Quality Score
Sources: Phred quality scores are defined in [SAM1] section 1.4 (QUAL field, Phred+33 encoding). The
Phrednewtype wraps the mathematical definition:Q = -10 log10(P). See references.md.
Construction
Phred scores MUST be non-negative.
from_phred MUST accept
only u8 values
(0..=255), making negative inputs
unrepresentable at the type level.
Integer conversion
as_int MUST clamp the
result to the range [0, 99]. Values
less than or equal to zero
(including negative results from
floating-point arithmetic) MUST
return 0. NaN MUST return 0.
BAM Reader
The indexed BAM reader is the main entry point for loading aligned reads from a BAM file. It ties together three layers: BGZF decompression (reading compressed blocks), BAI index lookup (finding which blocks contain reads for a region), and record decoding (parsing the binary BAM records into usable structs).
The reader's primary operation is
fetch_into: given a genomic
region like
chr19:6,105,700–6,105,800, it
queries the index, bulk-reads the relevant
compressed data into memory (see
region_buf.md), decompresses
it, decodes the records, and stores them in
a RecordStore for downstream
processing by the pileup engine.
Sources: [SAM1] §4 "The BAM Format Specification" generally — BGZF, BAM header, record layout, and BAI index together define the reader's behaviour. The forking design and ChunkCache are seqair-specific optimisations with no upstream spec counterpart. See references.md.
Opening
The reader MUST open a BAM file by
path, parse its header, and locate
the corresponding
.bai index file (trying
both
<path>.bai and
<path_without_ext>.bai).
The reader MUST expose the parsed
BamHeader for contig
name/tid/length lookups.
Region fetching
fetch_into(region, store)
MUST clear the store, query the BAI
index for chunks overlapping the
region, seek through those chunks
via BGZF, decode all records, and
add records that actually overlap
the region to the store
(RecordStore).
A record overlaps a region if
record.pos <=
region.end
AND
record.end_pos >=
region.start. Records outside the region that
appear in index chunks MUST be
filtered out. This is necessary
because the BAM index is approximate
— chunks may contain records
slightly outside the queried region.
Records in the store MUST be in
coordinate-sorted order (by pos,
then by end_pos as tiebreaker) after
fetch_into, matching
the order a sorted BAM file
provides.
Edge cases
Unmapped reads (flag 0x4) MUST be
excluded from the store during
fetch_into.
htslib's pileup engine rejects
them in bam_plp_push,
so including them would inflate
depth counts.
Secondary (0x100) and supplementary (0x800) reads MUST be included in the store by default. The caller's pileup filter is responsible for excluding them if desired, matching htslib's behavior.
Forking (thread-safe shared state)
Rastair processes regions in parallel via
rayon. Each worker thread needs its own
IndexedBamReader because file
handles have mutable seek state. Without
forking, every thread independently opens
the BAM file, re-reads the entire BAI index
from disk, and re-parses the BAM header —
wasting both I/O bandwidth and memory. On
HPC clusters with NFS/Lustre, redundant
index reads are especially costly.
The fork pattern separates the reader into
shared immutable data
(BamIndex,
BamHeader) and per-thread
mutable state (file handles, decompression
buffers). A prototype reader is opened once
on the main thread; worker threads call
fork() to get a lightweight
copy that shares the index and header via
Arc but owns fresh file
handles.
Shared state
Fork operation
fork() MUST produce a
new reader that:
-
Shares the same
ArcholdingBamIndexandBamHeaderas the source reader (no re-parsing, no additional memory). -
Opens a fresh raw
Filehandle for bulkRegionBufreads. NoBgzfReaderis needed —fetch_intoonly usesRegionBuf, and theBgzfReaderused duringopen()for header parsing is not retained. -
Allocates its own
ChunkCache(starts empty, populated on first region fetch for a tid).
A forked reader MUST produce
identical
fetch_into results as
independently opening the same BAM
file. The fork is purely an
optimization — it MUST NOT change
observable behavior.
Forked readers MUST be fully independent for mutable operations. Seeking or fetching on one reader MUST NOT affect any other reader (original or forked). Each reader owns its own file descriptors and buffer state.
Edge cases
Arc::ptr_eq on the
shared state of the original and
forked reader MUST return
true, confirming they
share the same allocation rather
than having made a deep copy.
Multiple forks from the same
prototype MUST be safe to use
concurrently from different threads.
Since the shared state is immutable
and each fork owns its file handles,
no synchronization is needed beyond
the Arc reference
count.
Coordinate validation
fetch_into MUST
validate that tid fits
in i32 and that
start/end
fit in i64 before using
them. If any value exceeds the
signed maximum,
fetch_into MUST return
a
BamError::CoordinateOverflow
rather than silently truncating via
as casts.
Error propagation
fetch_into (and
ChunkCache::load) MUST
propagate I/O and decompression
errors from the BGZF layer. Only
BgzfError::UnexpectedEof
(signalling the end of a
chunk's data) should silently
break the read loop. All other
errors — CRC failures, truncated
blocks, decompression failures,
invalid magic, et.c — MUST be
returned to the caller.
BAM Index (BAI/CSI)
A BAM file can be hundreds of gigabytes.
Without an index, finding all reads
overlapping a small genomic region (e.g.
chr19:6,105,700–6,105,800) would require
decompressing and scanning the entire file.
The BAM index (.bai file)
solves this by recording, for each genomic
region, which compressed byte ranges in the
BAM file contain relevant reads.
The index uses a
binning scheme: the genome
is divided into hierarchical bins of
decreasing size (the whole reference → 512
Mbp → 64 Mbp → ... → 16 KiB leaf bins). Each
bin stores a list of
chunks — pairs of virtual
offsets (see bgzf.md) marking
the start and end of compressed byte ranges
in the BAM file that contain reads falling
into that bin. A
linear index provides an
additional optimization: for each 16 KiB
window, it stores the minimum virtual offset
of any read starting in that window,
allowing the reader to skip chunks that are
entirely before the query start.
Sources: [SAM1] §5 "Indexing BAM" — binning algorithm; §5.1.1 "Basic binning index" — bin hierarchy and sizes; §5.2 "The BAI index format for BAM files" — binary format; §5.3 "C source code for computing bin number and overlapping bins" —
reg2bin/reg2bins. [CSI] — for CSI format. See references.md.
BAI format
[SAM1] §5.2 "The BAI index format for BAM files" — magic, n_ref, n_bin, chunk virtual offsets, n_intv linear index
A BAI file MUST begin with the magic
bytes BAI\1 (0x42,
0x41, 0x49, 0x01). The reader MUST
reject files that do not match.
The reader MUST parse the BAI binary format: n_ref references, each containing bins with chunks (pairs of virtual offsets) and a linear index of 16 KiB window offsets.
When decompressing BGZF blocks (e.g. for tabix), the reader MUST validate that bsize is large enough for the block header and footer before indexing into the block. A bsize that is too small MUST return an error rather than panicking.
Given a region (tid, start, end),
the index MUST return a list of BGZF
chunk pairs
(begin, end) that
contain all reads overlapping the
region. The query MUST use the
binning scheme to identify candidate
bins and the linear index to skip
chunks that cannot contain reads
starting before the query start.
The BAI binning scheme uses
reg2bin(beg, end) where
bin 0 spans the whole reference,
bins 1–8 span 64 Mbp each, and leaf
bins span 16 KiB. The implementation
MUST correctly compute bins for a
region and enumerate all overlapping
bins at every level.
Bin 0
[SAM1] §5.1.1 "Basic binning index" — bin 0 spans 512 Mbp, bins 1–8 span 64 Mbp
Bin 0 is the root of the binning hierarchy,
covering the entire reference sequence (0 to
2^29). Each alignment is placed in the
smallest bin that fully contains its
[pos, end) interval. An
alignment is placed in bin 0 when its
interval straddles a level-1 boundary
(multiples of 2^26 = 64 Mbp). This can
happen to short reads (e.g. 150 bp) that
cross positions like 67,108,864 or
134,217,728. See
SAM spec §5.3.
In practice, bin 0 contains very few reads (only those near the ~3–4 level-1 boundaries per chromosome), but its chunks may be located at a very different file offset from the rest of the query's chunks. In a 43 GB BAM, this can mean an extra seek of several GB — a significant cost on network storage.
The query MUST always include bin 0 in the candidate bins for any region query. Omitting it would silently drop reads that straddle 64 Mbp boundaries.
The index SHOULD provide a
query_split method that
separates distant (level 0–2) chunks
from nearby (level 3–5) chunks in
the result, returning a
QueryChunks struct with
nearby and
distant fields. This
allows callers to handle them
differently (e.g. caching, separate
I/O). The original
query method MUST
remain available for callers that
don't need the separation.
Higher-level bins and ChunkCache
The same distant-chunk problem affects bins at levels 1 (64 Mbp) and 2 (8 Mbp). For a query in chr5:100K–200K, the index includes bin 1 (covering chr5:0–64M) whose chunks span ~2 GB of the BAM file, and bin 9 (covering chr5:0–8M) whose chunks span ~0.27 GB. These chunks are identical for every region query within the same reference and contain many records that are outside the query region (~15% of decoded records are discarded by the position filter).
The query_split method
MUST separate chunks into
nearby (levels 3–5,
loaded via RegionBuf per query) and
distant (levels 0–2,
suitable for caching). The threshold
(level ≤ 2) is based on profiling
data showing that levels 0–2
contribute >90% of the file span
in typical region queries.
The reader SHOULD maintain a
ChunkCache that loads
ALL records from bins at levels 0–2
(covering ≥8 Mbp) once per reference
sequence (tid) per thread. The
BamIndex::distant_chunks
method returns all level 0–2 chunks
for a tid, unfiltered by region, for
use when populating the cache. Per
query, matching records (overlapping
the query region) are injected from
the cache into the
RecordStore without any
additional I/O. The cache MUST be
invalidated when switching to a
different reference (tid). This
eliminates expensive seeks for
distant-bin chunks on every region
query.
CSI format
[CSI] — min_shift, depth, l_aux header; per-bin loffset; parameterised reg2bin/reg2bins
CSI is an alternative index format that supports larger genomes and arbitrary bin sizes. It uses the same conceptual approach (bins + chunks + linear index) but with a configurable minimum shift and depth.
CSI index support MAY be added as a
follow-up. The same
region_query interface
MUST be used for both BAI and CSI
indices.
BAM Header
A BAM file begins with a header that
describes the reference genome and records
metadata about how the alignments were
produced. The header contains two parts: a
SAM-format text section (with
@HD, @SQ,
@RG, @PG lines)
and a binary reference sequence table. The
reference table maps each contig
(chromosome) to a numeric
target ID (tid) used
throughout the rest of the file — every
aligned read stores the tid of its reference
contig rather than the name string.
Sources: [SAM1] §4.2 "The BAM format" (BAM magic, l_text, n_ref, reference sequence table); §1.3 "The header section" (SAM header line types @HD/@SQ/@RG/@PG). See references.md.
Parsing
[SAM1] §4.2 "The BAM format" — magic, l_text, text, n_ref, l_name, name, l_ref fields
A BAM file MUST begin with the magic
bytes BAM\1 (0x42,
0x41, 0x4d, 0x01) after BGZF
decompression. The reader MUST
reject files that do not start with
this magic.
After the magic, a 4-byte
little-endian
l_text gives the length
of the SAM header text. The reader
MUST parse this text to extract
@HD, @SQ,
@RG, and
@PG header lines.
Header length fields
(l_text,
n_ref,
l_name) are stored as
i32 in the BAM format.
The reader MUST validate that these
values are non-negative before using
them as sizes. Negative values from
corrupt files MUST return a
BamHeaderError rather
than causing a massive allocation
via
i32 as usize wrapping.
After the header text, a 4-byte
little-endian
n_ref gives the number
of reference sequences. Each
reference is encoded as a 4-byte
l_name (including NUL),
the name bytes, and a 4-byte
l_ref (sequence
length). The reader MUST parse all
reference sequences.
Contig lookup
The rest of the BAM file and its index use tids (integer IDs) to refer to contigs. The caller typically has a contig name (e.g. "chr19") from user input or a BED file and needs the tid for index queries. Conversely, when reporting results, the tid needs to be mapped back to a name.
The header MUST support O(1) lookup from contig name to target ID (tid) and O(1) lookup from tid to contig name and length.
The header MUST provide access to the list of all target names in tid order.
The header MUST provide the length of each target sequence by tid.
BAM Record
Each aligned read in a BAM file is stored as a record: a binary structure encoding the read's position on the reference genome, its DNA sequence, quality scores, alignment details (CIGAR), and optional auxiliary tags. Records are variable-length — a 150 bp read with few tags is ~300 bytes, while a long read with many tags can be kilobytes.
The record's binary layout starts with 32 fixed bytes (position, flags, lengths, etc.), followed by variable-length fields in order: read name (qname), CIGAR operations, 4-bit packed sequence, quality scores, and auxiliary tags.
Sources: [SAM1] §4.2 "The BAM format" — binary record layout (refID, pos, l_read_name, mapq, bin, n_cigar_op, flag, l_seq, cigar, seq, qual, auxiliary); §4.2.4 "SEQ and QUAL encoding" — 4-bit sequence encoding; §4.2.5 "Auxiliary data encoding" — tag types A/c/C/s/S/i/I/f/Z/H/B; §1.4 "The alignment section: mandatory fields" — FLAG bit definitions. See references.md.
Decoding
[SAM1] §4.2 "The BAM format" — block_size, fixed 32-byte header, variable-length field order
A BAM record MUST be decoded from its binary representation: 32 fixed bytes followed by variable-length qname, CIGAR, sequence, quality, and auxiliary data. The 4-byte block_size prefix is read by the caller.
BAM records have a 2 MiB size cap. The reader MUST reject records whose block_size exceeds this limit rather than allocating an unbounded buffer from untrusted input.
Variable-length field offset calculations (var_start, cigar_end, seq_end, qual_end) MUST use checked arithmetic to prevent wrapping on malformed input. Overflow MUST return a decode error.
The decoder MUST extract: tid (i32), pos (i32→i64), mapq (u8), flags (u16), seq_len (u32), n_cigar_ops (u16), qname (NUL-terminated, stored without NUL), packed sequence (4-bit), quality scores, CIGAR operations (packed u32), and raw auxiliary data.
The decoder MUST precompute the
reference end position from the
CIGAR. Reference-consuming
operations are: M(0), D(2), N(3),
=(7), X(8). The end position is
pos + ref_consumed - 1
(inclusive).
When a read has zero
reference-consuming CIGAR operations
(e.g., pure soft-clip or
insertion-only), the reference span
is zero. In this case
end_pos MUST equal
pos (matching
htslib's
bam_cigar2rlen which
returns 0, placing the read at
exactly one position). The
implementation MUST NOT compute
pos - 1 for such reads.
Sequence encoding
[SAM1] §4.2.4 "SEQ and QUAL encoding" — 4-bit nibble encoding,
=ACMGRSVTWYHKDBNlookup table
BAM stores DNA sequences in a compact 4-bit encoding: two bases per byte, high nibble first. This halves storage compared to ASCII but requires decoding before use.
BAM encodes sequences in 4-bit
nibbles: two bases per byte (high
nibble first). The standard lookup
table maps 0→=,
1→A, 2→C,
4→G, 8→T,
15→N, and other values
to IUPAC ambiguity codes.
The record MUST support decoding a single base at a given read position.
The decoder MAY decode the full
4-bit sequence into
Base enum values at
construction time using
SIMD-accelerated bulk decoding (see
base_decode.md).
Single-base access via
seq_at(idx, pos)
returns the pre-decoded
Base value directly.
Flag access
[SAM1] §1.4 "The alignment section: mandatory fields" — FLAG bit table (0x4 unmapped, 0x10 reverse strand, 0x40 first in template, 0x80 second in template)
Each record has a 16-bit flags field encoding properties like strand orientation, pairing status, and mapping quality. These are checked frequently during pileup construction and filtering.
is_reverse() MUST
return true when the 0x10 bit is
set.
is_first_in_template()
MUST return true when the 0x40 bit
is set.
is_second_in_template()
MUST return true when the 0x80 bit
is set.
is_unmapped() MUST
return true when the 0x4 bit is set.
Auxiliary tags
[SAM1] §4.2.5 "Auxiliary data encoding" — tag byte layout, type codes A/c/C/s/S/i/I/f/Z/H/B
BAM records can carry optional key-value
tags (e.g. XR:Z:CT for bismark
strand, MD:Z:... for mismatch
string). Tags are stored as a flat byte
array at the end of the record, each
prefixed by a 2-byte name and a 1-byte type
code.
The record MUST support looking up auxiliary tags by their 2-byte name. Tag types A, c, C, s, S, i, I, f, d, Z, H, and B (typed array) MUST be supported.
B-type (array) auxiliary tags are
not yet parsed into
AuxValue.
find_tag returns
None for array tags.
The tag bytes MUST be correctly
skipped so that all subsequent tags
in the record remain accessible.
The record MUST provide access to raw auxiliary data bytes for efficient filtering without full tag parsing.
Sequence Codec
BAM encodes DNA sequences in a compact 4-bit format: two bases per byte, high nibble first. A 150bp read takes 75 bytes instead of 150. The 16 possible nibble values map to the 4 standard bases (A=1, C=2, G=4, T=8), N (15), and IUPAC ambiguity codes for the remaining values.
Decoding these sequences is on the hot path: during pileup construction, the engine needs the base at each read's query position for every reference position. With ~30 reads at ~3 billion positions, that's ~90 billion base lookups for a whole-genome BAM. Encoding is needed when rewriting BAM files with modified sequences.
Sources: [SAM1] §4.2.4 "SEQ and QUAL encoding" — 4-bit encoding table
=ACMGRSVTWYHKDBN → [0,15], high-nibble-first packing. The SIMD decode path and pair-table optimisation are seqair-specific. See references.md.
Decoding (4-bit → ASCII)
The scalar decoder MUST use a
16-entry lookup table mapping 4-bit
codes to ASCII bases
(=ACMGRSVTWYHKDBN). It
MUST support decoding full sequences
(all bases) and single bases at
arbitrary positions.
For bulk decoding, a 256-entry pair
table (DECODE_PAIR[byte] → [high_base,
low_base]) SHOULD be used to decode two
bases per byte lookup, avoiding
repeated shift-and-mask operations.
On platforms with SIMD support
(SSSE3 on x86_64, NEON on aarch64),
bulk sequence decoding MUST use SIMD
table-lookup instructions (pshufb
/ vqtbl1q_u8) to decode
32 bases per iteration from 16
packed bytes. A scalar tail MUST
handle the remaining bytes.
Runtime dispatch MUST select the fastest available decoder: SSSE3 on x86_64 (with feature detection), NEON on aarch64 (always available), scalar fallback on all other platforms.
Encoding (ASCII → 4-bit)
The scalar encoder MUST use a
256-entry lookup table mapping ASCII
bytes to 4-bit codes. Unknown bases
MUST map to 15 (N). Two
bases MUST be packed per byte (high
nibble first).
Correctness
The SIMD decoder MUST produce byte-identical output to the scalar decoder for all inputs. This MUST be verified by property-based tests covering arbitrary sequence lengths and byte patterns, including SIMD boundary lengths (0, 1, 15, 16, 31, 32, 33, 63, 64, 65, 128).
Base-typed Sequence Decoding
Sources: [SAM1] §4.2.4 "SEQ and QUAL encoding" — 4-bit nibble-to-base mapping (
=ACMGRSVTWYHKDBN). TheBaseenum, SIMD acceleration, and direct-to-Basedecode table are seqair-specific design choices; the upstream spec defines only the nibble codes. See references.md.
Background
BAM sequences are stored as 4-bit nibbles
(two bases per byte). The standard decode
table maps nibbles to ASCII characters
(=ACMGRSVTWYHKDBN). Downstream,
the ASCII bytes must be converted to
Base enum values (A, C, G, T,
Unknown) at every pileup position — billions
of times for whole-genome data.
By decoding directly into
Base values at record
construction time, we:
-
Eliminate the per-position
Base::from(u8)conversion (a LUT lookup per read per position) -
Get compile-time type safety — a
Baseis guaranteed to be one of 5 valid values, not an arbitrary byte -
Collapse IUPAC ambiguity codes (M, R, W,
S, Y, K, etc.) and
=toBase::Unknownat decode time, instead of at each use site
Decode table
A 16-entry lookup table MUST map
4-bit BAM nibble codes directly to
Base discriminant
values: 1→A(65), 2→C(67), 4→G(71),
8→T(84), and all other codes (0, 3,
5, 6, 7, 9–14, 15) → Unknown(78).
Table invariant
Every entry in
DECODE_BASE_TYPED (16
entries) and
DECODE_PAIR_TYPED (256
× 2 = 512 bytes) MUST be a valid
Base discriminant:
A(65), C(67), G(71), T(84), or
Unknown(78). This MUST be verified
by an exhaustive test covering all
entries.
Decoding function
decode_bases(packed_seq,
seq_len)
MUST decode a 4-bit packed BAM
sequence into a
Vec<Base>. The
function SHOULD use the
SIMD-accelerated decode path
(SSSE3/NEON) with a Base-valued
lookup table for identical
throughput to the ASCII decoder.
Storage in RecordStore
The RecordStore MUST
store decoded bases in a dedicated
slab (Vec<Base>)
separate from the data slab. The
data slab continues to hold cigar,
qual, and aux data. The
seq_at method MUST
return Base instead of
u8.
PileupAlignment
PileupAlignment.base
MUST be of type Base,
not u8. This eliminates
the
Base::from(u8)
conversion in the metrics
accumulation hot loop.
ASCII-to-Base batch conversion
Reference sequences (FASTA) and SAM text
sequences arrive as ASCII bytes, not 4-bit
packed nibbles. These need the same one-time
batch conversion to Base that
BAM sequences get via
decode_bases, but using an
ASCII-aware path instead of the nibble
decode table.
Base::from_ascii_vec(Vec<u8>)
-> Vec<Base>
MUST convert a vector of ASCII bytes
to Base values in-place
(reusing the allocation). Maps
A/a→A, C/c→C, G/g→G, T/t→T, all
other bytes→Unknown. The function
MUST NOT allocate a new vector — it
operates on the input's buffer
and transmutes the result. This is
safe because Base is
repr(u8) and the
function only writes valid
Base discriminant bytes
(65, 67, 71, 78, 84).
The ASCII batch converter MUST use
SIMD acceleration (SSSE3 on x86_64,
NEON on aarch64) with a scalar
fallback, following the same
dispatch pattern as
decode_bases. This is
the required path for all bulk
u8→Base conversions outside BAM
4-bit decoding: FASTA reference
sequences, SAM text sequences, and
any other ASCII-encoded base data.
The SIMD ASCII converter MUST
produce identical output to applying
Base::from(u8)
element-wise. This MUST be verified
by property-based tests covering
arbitrary byte values and lengths
including SIMD boundary lengths (0,
1, 15, 16, 31, 32, 33, 63, 64, 65,
128).
FromStr validation
FromStr for Base MUST
accept only a single character A, C,
G, T, or N (case-insensitive) after
trimming whitespace. Multi-character
input (after trimming) MUST return
Err(BaseError::MultipleChars). Any single character that is not
A/C/G/T/N MUST return
Err(BaseError::InvalidBase(byte))
where byte is the first
byte of the trimmed input. Empty
input MUST return
Err(BaseError::Empty).
Error display
BaseError variants MUST
display the actual invalid value.
InvalidBase(byte)
SHOULD format as
"Invalid base:
0x{byte:02x}". Empty SHOULD format
as "Empty".
CRAM Reader
A CRAM reader for region-based random access. CRAM is a reference-based, column-oriented compression format that achieves significantly smaller file sizes than BAM (typically 40-60% of BAM size) at the cost of decoding complexity and reference dependency.
Sources: [CRAM3] throughout — file structure, containers, slices, blocks, variable-length integers, bit stream, compression header, data series encodings, record decode order, read features, CRAI index. [CRAMcodecs] — rANS 4×8 (§2), rANS Nx16 (§3), arithmetic coder (§4), tok3 (§5), fqzcomp (§6). See references.md.
Context
CRAM stores alignment data as differences
against a reference genome rather than full
sequences. Where BAM stores
ACGTACGT for a read, CRAM
stores "matches reference at this position,
except substitution at offset 5". This
reference-based approach, combined with
column-oriented storage and sophisticated
entropy coding, makes CRAM the most compact
alignment format.
Rastair needs CRAM support because many large-scale sequencing projects store data exclusively in CRAM format.
Goals and scope
The reader MUST support CRAM v3.0
and v3.1. CRAM v2.0 and v2.1 MAY be
supported but are not required —
they are rarely encountered in
practice. The magic bytes
CRAM followed by major
version 3 MUST be accepted; major
version 2 SHOULD produce a warning;
other versions MUST be rejected.
The reader is read-only. CRAM writing is out of scope — rastair writes BAM (via htslib) for output.
CRAM decoding requires access to the reference genome (FASTA). The reader MUST accept a FASTA reader (the one being built on the other branch) for sequence lookups. Embedded reference slices (where the reference is stored in the CRAM container itself) MUST also be supported as a fallback.
Note: htslib supports automatic reference
download via REF_PATH/REF_CACHE
environment variables and the EBI reference
server. This is out of scope for seqair, but
when the FASTA reader is missing or the
required contig is absent, the error message
SHOULD mention REF_PATH/REF_CACHE
as an alternative for users familiar with
the htslib workflow.
File structure
[CRAM3] §6 "File definition" — CRAM magic, major/minor version, file ID
File definition
The file MUST start with
CRAM (bytes
43 52 41 4d), followed
by major version (1 byte) and minor
version (1 byte), followed by a
20-byte file ID.
The first container after the file definition MUST be the file header container. It contains a single block (content_type=FILE_HEADER) whose data is:
-
header_text_length(i32, fixed-width little-endian): byte length of the SAM header text -
header_text(UTF-8 bytes): the SAM header text (@HD,@SQ,@RG, etc.)
The header text is parsed via
BamHeader::from_sam_text() to
populate target names, lengths, and tid
mappings. Note: the i32 length prefix is
part of the block's
decompressed data, not the block
header itself.
The file MUST end with an EOF
container. The EOF container
produced by samtools has
length=15 (not 0) and
num_records=0. The
reader MUST detect the EOF container
as any container where
num_records=0 and
length <= 15. The
reader SHOULD verify the EOF marker
but MUST NOT fail if it's
missing (truncated files should
produce a warning, not a hard
error).
Containers
[CRAM3] §7 "Container header structure" — length (fixed i32), itf8 fields, landmarks, CRC32
Each container header contains:
-
length(i32, fixed-width little-endian): total byte length of all blocks in the container. This is the only fixed-width field in the header; all subsequent fields use variable-length encoding. -
reference_sequence_id(itf8): reference ID, or -1 (unmapped), or -2 (multi-ref) -
alignment_start(itf8): 1-based start position on reference -
alignment_span(itf8): number of reference bases covered -
num_records(itf8): number of alignment records -
record_counter(ltf8): global 0-based record index (v3.0+) -
bases(ltf8): total bases in container (v3.0+) -
num_blocks(itf8): block count -
landmarks(itf8 array): count (itf8) followed by that many itf8 values; byte offsets from container data start to each slice header -
crc32(4 bytes, fixed-width little-endian): CRC32 over all preceding header bytes (fromlengththrough end oflandmarks)
During fetch_into, the
reader MUST skip containers whose
[alignment_start,
alignment_start +
alignment_span)
range does not overlap the query
region. The CRAI index provides byte
offsets for this.
Slices
[CRAM3] §8.4 "Slice header block" — slice header fields, multi-ref slices, embedded reference, reference_md5
Each slice header contains:
-
reference_sequence_id(itf8): ref ID for this slice, -1 unmapped, -2 multi-ref -
alignment_start,alignment_span(itf8): genomic range covered -
num_records(itf8): records in this slice -
record_counter(ltf8): global record counter -
num_blocks(itf8): data blocks in this slice -
block_content_ids(itf8 array): content IDs of the blocks -
embedded_reference(itf8): -1 if no embedded reference, else block content ID -
reference_md5(16 bytes): MD5 of the reference segment (all zeros if N/A) - optional tags (BAM-format aux bytes)
When
reference_sequence_id ==
-2, the slice contains records from
multiple references. The
RI (reference ID) data
series MUST be decoded per-record to
determine which reference each
record aligns to. Multi-ref slices
are common in coordinate-sorted CRAM
files at chromosome boundaries and
for reads with unmapped mates.
When
embedded_reference >=
0, the corresponding external data
block contains the reference bases
for this slice. The reader MUST use
these instead of the FASTA reader.
This enables CRAM files to be
self-contained (produced by
samtools view
--output-fmt-option
embed_ref
or cramtools).
Blocks
[CRAM3] §8 "Block structure" — method, content_type, content_id, compressed_size, uncompressed_size, data, CRC32 (v3.0+)
Each block contains:
-
method(byte): compression codec (see codecs section) -
content_type(byte): 0=FILE_HEADER, 1=COMPRESSION_HEADER, 2=SLICE_HEADER, 4=EXTERNAL, 5=CORE -
content_id(itf8): identifies which data series this block holds -
compressed_size,uncompressed_size(itf8) -
data(bytes): compressed payload -
crc32(4 bytes): CRC32 of the block (v3.0+)
CRC32 MUST be verified on every block in v3.0+. The CRC32 covers all bytes from the start of the block (method byte) through the end of the compressed data, excluding the 4 CRC32 bytes themselves. Mismatches MUST produce an error with the block's content_type and content_id for diagnostics.
Variable-length integers
[CRAM3] §2.1 "Logical data types" — ITF8 (up to 32 bits) and LTF8 (up to 64 bits) variable-length integer encodings
ITF8 (up to 32 bits) MUST be decoded as:
-
0xxxxxxx: 1 byte, value = bits 6..0 (7 bits) -
10xxxxxx + 1 byte: 2 bytes, value =(b0 & 0x3F) << 8 | b1(14 bits) -
110xxxxx + 2 bytes: 3 bytes, value =(b0 & 0x1F) << 16 | b1 << 8 | b2(21 bits) -
1110xxxx + 3 bytes: 4 bytes, value =(b0 & 0x0F) << 24 | b1 << 16 | b2 << 8 | b3(28 bits) -
1111xxxx + 4 bytes: 5 bytes, value =(b0 & 0x0F) << 28 | b1 << 20 | b2 << 12 | b3 << 4 | (b4 & 0x0F)(32 bits)
Note: in the 5-byte form, only the low 4 bits of byte 4 are used. The high 4 bits of byte 4 are unused/ignored.
The decoded value is an unsigned u32. For
fields that can be negative (e.g.,
reference_sequence_id where
-1=unmapped, -2=multi-ref), the u32 MUST be
reinterpreted as signed i32 via two's
complement (value as i32).
LTF8 (up to 64 bits) MUST be decoded
with the same leading-bit pattern
extended to 9 bytes, with
0xFF prefix indicating
8 continuation bytes.
Bit stream reading
[CRAM3] §2.2 "Reading and writing bits in a bit stream" — MSB-first bit ordering
The core data block (content_type=5) is read as a bit stream. Bits are read MSB-first (most significant bit first) within each byte:
- Bit offset 0 is the most significant bit (0x80)
- Bit offset 7 is the least significant bit (0x01)
- Reading N bits shifts them out from the current position, crossing byte boundaries as needed
Encodings that read from the core (Huffman,
Beta, Subexp, Gamma) all use this bit
stream. Encodings that read from external
blocks (External, ByteArrayStop) use
byte-level access to the specific external
block identified by content_id.
The core bit stream is shared across all records in a slice — fields are interleaved sequentially. Reading fields in the wrong order corrupts all subsequent data.
Compression header
[CRAM3] §8.3 "Compression header block" — preservation map, data series encoding map, tag encoding map
The compression header is a block (content_type=1) at the start of each container. It defines how all data in the container's slices is encoded.
Preservation map
[CRAM3] §8.3 "Preservation map" — RN, AP, RR, SM (substitution matrix), TD (tag dictionary)
The preservation map MUST be decoded as a series of key-value pairs:
-
RN(bool): whether read names are preserved (if false, names are generated) -
AP(bool): whether alignment positions are delta-coded (true) or absolute -
RR(bool): whether the reference is required for decoding -
SM(5 bytes): substitution matrix (seer[cram.compression.substitution_matrix]) -
TD(bytes): tag dictionary — null-separated lists of 3-byte entries (2-char tag name + 1-char BAM type)
The substitution matrix is 5 bytes, one per reference base in order: A, C, G, T, N. Each byte encodes the ordering of substitution alternatives using 2-bit codes packed MSB-first:
byte = (alt0 << 6) | (alt1
<< 4) | (alt2 << 2) |
alt3
where each 2-bit value maps to a base: 0=A,
1=C, 2=G, 3=T. When decoding a substitution
(BS data series value = code), the read base
is \alt[code] from this
ordering.
Example: for reference base C, the possible
substitutions are A, G, T (not C itself). If
the substitution byte is
0b_00_10_11_01 (0x2D):
- code 0 → bits
00→ A - code 1 → bits
10→ G - code 2 → bits
11→ T -
code 3 → bits
01→ C (unused for ref=C, but the slot exists)
For non-N reference bases, only codes 0-2 are meaningful (3 alternatives, excluding the reference base itself). Code 3 exists in the byte but is never emitted by conforming CRAM writers — the decoder MAY treat it as an error or map it to the remaining unused base. For reference base N, all 4 codes are meaningful (A, C, G, T).
Data series encoding map
[CRAM3] §8.3 "Data series encodings" — two-byte data series codes and their types (BF, CF, RI, RL, AP, RG, RN, MF, NS, NP, TS, NF, FN, FC, FP, MQ, BA, QS, BS, IN, SC, DL, RS, PD, HC, BB, QQ, TL)
The data series encoding map MUST be decoded as key-value pairs where:
- Key: 2-byte data series code
- Value: encoding descriptor (encoding ID + parameters)
Complete data series list:
| Code | Name | Type | Description |
|---|---|---|---|
BF |
BAM flags | int | BAM FLAG field (u16) |
CF |
CRAM flags | int | CRAM-specific flags |
RI |
Ref ID | int | Reference sequence ID (multi-ref slices only) |
RL |
Read length | int | Number of bases |
AP |
Alignment pos | int | Position (delta or absolute per preservation map) |
RG |
Read group | int | Index into @RG header entries |
RN |
Read name | byte array | QNAME |
MF |
Mate flags | int | 0x1=mate reverse, 0x2=mate unmapped |
NS |
Next segment ref | int | Mate reference ID |
NP |
Next mate pos | int | Mate position |
TS |
Template size | int | Insert size |
NF |
Next fragment | int | Record-skip to next fragment (attached mates) |
FN |
Feature count | int | Number of read features |
FC |
Feature code | byte | Feature type character |
FP |
Feature position | int | Position within read (delta from previous feature) |
MQ |
Mapping quality | int | MAPQ |
BA |
Base | byte | Single base |
QS |
Quality score | byte | Single quality value |
BS |
Substitution code | byte | Index into substitution matrix |
IN |
Insertion | byte array | Inserted bases |
SC |
Soft clip | byte array | Soft-clipped bases |
DL |
Deletion length | int | Number of deleted reference bases |
RS |
Ref skip | int | Reference skip length (intron) |
PD |
Padding | int | Padding length |
HC |
Hard clip | int | Hard clip length |
BB |
Bases block | byte array | Stretch of bases |
QQ |
Quality block | byte array | Stretch of quality scores |
TL |
Tag line | int | Index into tag dictionary |
Tag encoding map
[CRAM3] §8.3 "Tag encodings" — ITF8 key as
(char1 << 16) + (char2 << 8) + type_char, BAM-format tag values
The tag encoding map uses ITF8 keys
computed as
(char1 << 16) + (char2
<< 8) + type_char. Each key maps to an encoding
descriptor for that tag type. Tag
values are stored as raw BAM-format
bytes (without the 3-byte tag+type
prefix).
Base modification tags (MM/ML) use
Z and B:C types
respectively and can be very large on long
reads (thousands of bytes). The decoder MUST
handle arbitrary-length tag values.
Encoding types
[CRAM3] §13 "Encodings" — NULL (0), EXTERNAL (1), HUFFMAN (3), BYTE_ARRAY_LEN (4), BYTE_ARRAY_STOP (5), BETA (6), SUBEXP (7), GAMMA (9)
Encoding ID 0 (NULL): no data. Used for absent data series.
Encoding ID 1 (EXTERNAL): reads raw
bytes from an external data block
identified by
content_id. The most
common encoding — used for quality
scores, bases, tags, etc.
Encoding ID 3 (HUFFMAN): canonical Huffman coding. Parameters: alphabet (itf8 array) and bit lengths (itf8 array). Reads from the core data bit stream.
Canonical Huffman construction: sort by (bit_length, symbol_value), assign codes starting from 0 with incrementing and left-shifting when bit length increases. A single-symbol Huffman code has bit length 0 and consumes 0 bits from the stream — this is common for fields where all records have the same value (e.g., all flags identical).
Encoding ID 4 (BYTE_ARRAY_LEN): a length encoding followed by a value encoding. The length is decoded first (from any encoding), then that many bytes are decoded using the value encoding.
Encoding ID 5 (BYTE_ARRAY_STOP): reads bytes from an external block until a stop byte is encountered. Used for read names (stop=0x00) and variable-length strings.
Encoding ID 6 (BETA): fixed-width integer. Parameters: offset and number of bits. Reads from core bit stream.
Encoding ID 7 (SUBEXP):
sub-exponential code. Parameters:
offset (itf8) and
K (itf8). Reads from
core bit stream. Decode procedure:
-
Read unary: count consecutive
1bits until a0bit is read. Call this countn. -
If
n == 0: readKbits from the stream as the value. -
If
n > 0: readn + K - 1bits from the stream asremainder. The value is(1 << (n + K - 1)) - (1 << K) + remainder. -
The final decoded value is
value - offset.
Encoding ID 9 (GAMMA): Elias gamma
code with offset. Parameters:
offset (itf8). Reads
from core bit stream. Decode
procedure:
-
Read unary: count consecutive
0bits until a1bit is read. Call this countn. -
If
n == 0: the value is 0. -
If
n > 0: readnmore bits from the stream assuffix. The value is(1 << n) | suffix - 1. -
The final decoded value is
value - offset.
Compression codecs
[CRAM3] §14 "External compression methods" — RAW (0), GZIP (1), BZIP2 (2), LZMA (3), rANS4x8 (4), rANS Nx16 (5), arithmetic coder (6), fqzcomp (7), tok3 (8) [CRAMcodecs] §2 "rANS 4x8" — frequency table RLE, order-0/order-1 modes, chunk-based order-1 interleaving [CRAMcodecs] §3 "rANS Nx16" — flags byte (ORDER, N32, STRIPE, NoSize, CAT, RLE, PACK) [CRAMcodecs] §5 "Name tokenisation codec" — tok3
Method 0 (RAW): uncompressed data.
Method 1 (GZIP): deflate compression. MUST be supported — this is the most common codec.
Method 2 (BZIP2): bzip2 compression. MUST be supported for v3.0 compatibility.
Method 3 (LZMA): LZMA compression. MUST be supported for v3.0 compatibility.
Method 4 (rANS 4×8): rANS entropy coder with 4-way interleaved states and 8-bit renormalization (L=2^23). Supports order-0 (context-free) and order-1 (previous-symbol context) modes. MUST be supported — this is the default codec in CRAM v3.0 writers including samtools.
The frequency table header uses a compact
run-length encoding: after reading a symbol
and its frequency, if the next symbol is
prev_sym + 1, a run length byte
follows, then that many consecutive symbols'
frequencies. The terminator is a NUL (0x00)
byte. Order-1 nests this same structure: the
outer loop reads context symbols (with the
same run-length scheme), and for each
context reads an order-0 frequency table.
Order-1 interleaving is chunk-based, not
round-robin: the output is split into 4
equal segments, each decoded by one state
sequentially. Remainder bytes (when
len % 4 != 0) are handled by
state 3.
Method 5 (rANS Nx16): v3.1 codec. rANS with N-way interleaving (N=4 or 32), 16-bit renormalization (L=2^15), and optional transforms controlled by an 8-bit flags byte: ORDER(1), N32(4), STRIPE(8), NoSize(16), CAT(32), RLE(64), PACK(128). Transforms are applied in order: decode entropy → unRLE → unPack. MUST be supported for v3.1 files — samtools uses this for most data blocks in v3.1 output.
Method 6 (arithmetic coder): v3.1 adaptive arithmetic coder. SHOULD be supported.
Method 7 (fqzcomp): v3.1 quality-score-specific compressor. SHOULD be supported.
Method 8 (tok3): v3.1 read-name
tokeniser. MUST be supported for
v3.1 files — samtools uses tok3 for
read name blocks by default in v3.1
output. Without tok3 support, v3.1
CRAM files produced by
samtools view -C cannot
be read.
Unknown codec methods MUST produce a
clear error naming the method ID and
suggesting conversion to BAM (samtools view -b).
Record decoding
[CRAM3] §10 "Record structure" — CRAM record layout and field decode order
Decoding order
[CRAM3] §10.1 "CRAM record" — strict field decode order (BF, CF, RI, RL, AP, RG, QS, RN, mate info, TL, FN/FC/FP, MQ, per-base QS)
Records within a slice MUST be decoded in a strict field order, because the core bit stream is sequential. Reading fields out of order corrupts all subsequent data. The order is:
BF(BAM flags)CF(CRAM flags)-
RI(reference ID — only if slice is multi-ref, i.e., ref_seq_id == -2) RL(read length)-
AP(alignment position — delta or absolute per preservation map) RG(read group)-
QS(quality scores — only if CRAM flags bit 0x1 is clear, meaning quality stored per-record not per-base) -
RN(read name — only if preservation mapRN=true) -
Mate information:
-
If CRAM flags bit 0x2 set
(detached):
MF,NS,NP,TS -
If CRAM flags bit 0x4 set (mate
downstream):
NF
-
If CRAM flags bit 0x2 set
(detached):
TL(tag line index)-
FN(feature count), then for each feature:FC,FP, feature-specific data MQ(mapping quality)-
QSper-base (only if CRAM flags bit 0x1 is set — quality as array)
Getting this order wrong is the most common implementation bug in CRAM readers.
Per-record data series
BF (BAM flags) and
CF (CRAM flags) MUST be
decoded per record. BAM flags map
directly to the u16 flags in
SlimRecord. CRAM flags
are separate and encode:
- Bit 0x1: quality scores stored as array (vs per-record)
- Bit 0x2: detached (mate info stored inline)
- Bit 0x4: has mate downstream in same slice
- Bit 0x8: sequence not stored (unknown)
AP (alignment
position): if the preservation map
AP=true (delta mode),
each record's position is a
delta from the previous
record's position within the
slice. The reader MUST maintain a
running position accumulator and
convert to absolute 0-based
coordinates (CRAM positions are
1-based). Delta-coded positions can
be negative (valid for supplementary
alignments in coordinate-sorted
CRAM).
RL (read length): the
number of bases in the read. Used to
allocate sequence and quality
arrays. Long reads (PacBio/ONT) can
be 10,000-1,000,000+ bases — buffer
pre-allocation MUST NOT assume
short-read sizes.
RG (read group): an
index into the
@RG entries in the
header. Stored as itf8. The reader
MUST convert to a
RG:Z:<id> aux tag
with the read group ID string.
RN (read name): if
preservation map
RN=true, names are
stored. If RN=false,
names are auto-generated. Rastair
uses qnames for mate-pair dedup, so
RN=false CRAM files
MUST produce a warning that dedup
will not work correctly. Generated
names SHOULD be deterministic (e.g.,
cram_<record_counter>).
Read feature decoding
[CRAM3] §10.6 "Mapped reads" and §10.6.1 "Read feature codes" — feature codes X/I/i/D/S/B/b/Q/q/N/H/P and their data series
For each record,
FN gives the number of
read features. For each feature:
-
Decode
FC(feature code byte) -
Decode
FP(position within read, delta-coded from the previous feature's position within the same record — first feature's FP is absolute within the read, starting at 1) - Decode feature-specific data:
| FC | Char | Data series | Description |
|---|---|---|---|
X |
0x58 | BS |
Substitution:
base =
substitution_matrix[ref_base][code]
|
I |
0x49 | IN | Insertion: multi-base insertion |
i |
0x69 | BA | Single-base insertion |
D |
0x44 | DL | Deletion: skip DL reference bases |
S |
0x53 | SC | Soft clip: bases not aligned to reference |
B |
0x42 | BA + QS | Single base replacement + quality |
b |
0x62 | BB + QQ | Stretch of bases + qualities |
Q |
0x51 | QS | Quality score at position |
q |
0x71 | Stretch of quality scores | |
N |
0x4e | RS | Reference skip (intron): skip RS reference bases |
H |
0x48 | HC | Hard clip: record HC bases clipped |
P |
0x50 | PD | Padding: PD padding bases |
Sequence reconstruction
[CRAM3] §10.6 "Mapped reads" — reference-based sequence reconstruction from read features; §11 "Reference sequences" — reference MD5 verification
Read sequences are NOT stored directly. They MUST be reconstructed from the reference and read features:
-
Fetch the reference sequence for
[alignment_start, alignment_start + alignment_span)from the FASTA reader (or embedded reference block). -
Initialize the read's bases array
with
RLentries. -
Walk the reference and read in parallel,
applying features in order:
- Between features: copy reference bases to read bases (these are the matching segments).
-
X(substitution): look upsubstitution_matrix[ref_base][BS_code]for the read base. -
I/i(insertion): insert bases fromIN/BAinto the read (these don't consume reference). -
D(deletion): skipDLreference bases (these don't consume read). -
S(soft clip): insert bases fromSCinto the read (no reference consumption). -
B(base+quality): replace base and quality at the feature position. -
N(ref skip): skipRSreference bases. -
H(hard clip): record length only (no bases in read or reference).
-
Convert reconstructed bases to
Baseenum values for the bases slab.
Quality scores are decoded from
QS (per-base) or
QQ (stretch) data
series, depending on CRAM flags bit
0x1. If CRAM flags bit 0x8 is set
(sequence unknown), quality is
unavailable (all 0xFF). Quality
length MUST equal read length.
When
quality_as_array is
false (CRAM flags bit 0x1 unset),
per-position quality overrides are
encoded as Q (0x51)
read features. Each
Q feature carries a
single quality score from the
QS data series at a
specific read position. The decoder
MUST apply these quality scores to
qual_buf at the
feature's read position
(1-based FP, converted
to 0-based index). Positions without
a Q feature retain the
default quality (0xFF). This is
distinct from the bulk
QS decode path used
when
quality_as_array is
true.
The q (0x71) read
feature carries a stretch of quality
scores from the QQ data
series, starting at the
feature's read position. The
decoder MUST write the decoded
quality bytes into
qual_buf starting at
the feature's 0-based read
position. The number of quality
scores in the block is determined by
the QQ encoding. Like
Q, this feature only
appears when
quality_as_array is
false.
CRAM does not store CIGAR explicitly. The CIGAR MUST be reconstructed from the read features:
- Stretches of matching reference → M ops (or =/X if distinguishing is needed, but M is standard)
-
X(substitution) → part of M ops (substitutions are still alignment matches) -
I/i→ I ops D→ D opsS→ S opsN→ N opsH→ H opsP→ P ops
The reconstructed CIGAR MUST be packed into BAM u32 format for the data slab.
Mate information
[CRAM3] §10.4 "Mate records" — detached (CF bit 0x2) vs attached (NF) mate encoding
When CRAM flags bit 0x2 is set
(detached), mate information is
stored inline: MF (mate
flags), NS (next ref
ID), NP (next mate
position), TS (template
size). These fields are NOT needed
for RecordStore (which doesn't
store mate info), but the decoder
MUST read and discard them to
advance the data stream correctly.
When bit 0x2 is clear, the mate is
"attached" —
NF (next fragment
distance) gives the record-skip to
the mate within the same slice. The
decoder MUST read
NF but does not need to
resolve the linkage for RecordStore
purposes.
Auxiliary tags
[CRAM3] §10.5 "Auxiliary tags" — TL tag line index, tag dictionary, tag encoding map
TL (tag line index)
selects which tag combination this
record has from the tag dictionary
in the preservation map. For each
tag in the combination, the tag
encoding map provides the encoding
for its value. Tag values MUST be
decoded and serialized to BAM binary
aux format for storage in the data
slab.
The RG data series is
separate from aux tags. If a read
group is present, the reader MUST
emit an
RG:Z:<id> aux tag
using the read group ID from the
header's
@RG entries. This tag
MUST be included in the aux slab
alongside dictionary-decoded tags.
CRAM index (.crai)
[CRAM3] §12 "Indexing" — CRAI format: gzip-compressed TSV, 6 fields per line (ref_id, alignment_start, alignment_span, container_offset, slice_offset, slice_size)
The .crai file is
gzip-compressed TAB-delimited text.
Each line has 6 fields:
- Reference sequence ID (int)
- Alignment start (int, 1-based matching the container header convention, 0 for unmapped)
- Alignment span (int, 0 for unmapped)
- Container byte offset in file (int, 0-based byte offset from start of file)
- Slice byte offset relative to container data start (int)
- Slice size in bytes (int)
Region queries MUST find all index
entries where
[start, start+span)
overlaps the query region for the
given reference ID. The reader then
seeks to the container byte offset,
skips to the slice offset, and
decodes the slice.
CRAI entries with
alignment_span == 0
(but
alignment_start > 0)
indicate unknown extent — this
occurs when samtools writes CRAM
with embedded references or when the
span is not computed. These entries
MUST be included in query results
whenever
entry.alignment_start <
query_end, since the slice may contain
records at any position past the
start. Treating span=0 as
"zero-width" would
silently drop all records in that
slice.
Multi-ref slices produce multiple index entries (one per reference they span). A query for one reference may hit a multi-ref slice that also contains records from other references. The reader MUST filter records by reference ID after decoding.
Unmapped records (ref ID = -1) have
start=0, span=0. They are only
returned if explicitly queried. For
fetch_into, unmapped
records MUST be skipped (same as
BAM).
Edge cases
If the reference MD5 in the slice
header does not match the MD5 of the
FASTA sequence for that region, the
reader MUST return an error. The MD5
is computed over the uppercase
reference sequence for the exact
range
[alignment_start,
alignment_start +
alignment_span), with no newlines or other
formatting. If the range extends
beyond the reference length, this is
a file/reference mismatch error.
If the FASTA reader cannot provide
the reference sequence for a
slice's region (e.g., contig
not in FASTA), the reader MUST
return an error unless the slice has
an embedded reference. The error
SHOULD mention
REF_PATH/REF_CACHE
as alternatives.
When RN=false in the
preservation map, read names are not
available. The reader MUST log a
warning (once per slice). Records
without read names are stored with
empty qnames. The pileup
engine's overlapping-pair dedup
MUST skip records with empty or
* qnames, since mate
matching requires real qnames.
When CRAM flags bit 0x8 is set, the
read's sequence is unknown
(* in SAM).
seq_len MUST be 0 and
no bases are pushed to the bases
slab.
Slices with 0 records MUST be silently skipped.
Delta-coded positions can produce negative values if records are not strictly sorted. The decoder MUST handle this gracefully (negative deltas are valid for supplementary alignments within a coordinate-sorted CRAM).
Unmapped reads (flag 0x4) may appear
in the last containers of the file,
in slices with
reference_sequence_id =
-1. Their bases are stored in the
BA data series (no
reference reconstruction).
fetch_into MUST skip
them, same as BAM.
PacBio and ONT CRAM files contain
reads of 10,000-1,000,000+ bases.
Read feature lists can have
thousands of entries, and
quality/sequence arrays are
correspondingly large. Buffer
pre-allocation MUST scale with
RL, not assume
short-read sizes.
Query coordinates passed to
fetch_into are u64, but
internal CRAM positions are i32/i64.
When converting
u64::MAX to i64 for
overlap filtering, the cast wraps to
-1, which silently filters out all
records. The reader MUST clamp
coordinates:
end.min(i64::MAX as u64) as
i64.
During sequence reconstruction, if
the reference sequence is shorter
than the positions accessed (i.e.,
ref_offset + ref_pos >=
reference_seq.len()), the reader MUST log a warning on
the first occurrence per
reconstruction and fall back to
b'N'. This
may indicate a reference/CRAM
mismatch but can legitimately occur
near chromosome ends. The warning
MUST be emitted at most once per
record to avoid flooding logs.
The rANS 4x8 frequency table uses
run-length encoding where a symbol
counter sym: u8 is
incremented for each run element.
When sym == 255,
sym += 1 overflows. The
reader MUST use wrapping arithmetic
(wrapping_add(1)) for
this counter — the overflow is
harmless because the loop terminates
before using the overflowed value.
Symbol variables in rANS frequency
table run-length reading MUST use
u8 type, making
overflow past 255 impossible by
construction. After a run-length
loop, sym may have
wrapped to 0 via
wrapping_add(1) when
the previous value was 255. The code
MUST NOT use the post-loop
sym value as a table
index if it has wrapped to 0. The
run-length loop MUST break when
sym would exceed 255.
The rANS
state_step function
computes
f * (s >> bits) + (s
& mask) - g. For valid frequency tables
produced by conforming CRAM writers,
g <= f * (s >> bits) +
(s & mask)
always holds, so the subtraction
does not underflow. The
implementation MUST use
wrapping_sub for
robustness against malformed data in
release mode, with a
debug_assert! to catch
violations during development.
Frequency normalization
(normalize_frequencies)
sums all 256 frequency entries and
doubles the sum in a loop. Both the
sum and the doubling MUST use
checked arithmetic to detect
overflow from malformed frequency
tables. Overflow MUST produce an
error, not silent wraparound.
The uint7 variable-length integer decoding loop MUST be bounded to at most 5 iterations. A u32 can hold at most 32 bits; 5 iterations of 7-bit groups produce 35 bits, which is the maximum meaningful input. More than 5 continuation bytes indicate malformed data and MUST produce an error.
Length fields decoded from ITF8
(e.g., num_content_ids,
num_blocks,
alignment_span) may be
negative when interpreted as i32.
Before using such values as
usize for allocation or
iteration, the reader MUST validate
they are non-negative via
i32::try_from or
equivalent. Negative values MUST
produce an error, not wrap to huge
usize values causing
OOM.
The
TokenReader::get() and
get_mut() methods MUST
return the same reader for every
TokenType variant. In
particular, DZLen MUST
map to dz_len_reader in
both methods. A mismatch causes the
dup-copy path (which uses
get()) to read from the
wrong stream.
The name_count field in
tok3 headers comes from untrusted
data. Allocation based on
name_count MUST be
bounded to a reasonable limit (e.g.,
10,000,000 names). The per-name
token vector MUST be grown
dynamically rather than
pre-allocated to a fixed size, since
the number of token positions varies
per name.
Performance considerations
CRAM decoding is expected to be 2-4× slower than BAM due to codec complexity and reference reconstruction. The I/O savings (40-60% smaller files) often compensate on network filesystems.
CRAM random access is slice-granular, not record-granular. A region query may decompress an entire slice (typically 10,000 records) to extract a few records at the edges. This is inherently coarser than BAM's record-level seeking.
Reference sequence lookups happen per-slice (to reconstruct all records in the slice). The reader SHOULD cache the most recently used reference region to avoid repeated FASTA lookups for consecutive slices on the same contig. This cache MUST be per-fork, not shared across threads.
rANS and arithmetic decoders have higher per-byte CPU cost than zlib. The reader SHOULD pre-allocate decode buffers and reuse them across blocks within a slice.
FASTA Reader
The indexed FASTA reader provides random
access to reference genome sequences. It
reads .fai index files to
locate sequence data within plain or
bgzip-compressed FASTA files, enabling
efficient region queries without scanning
the entire file.
Rastair fetches reference sequences once per
segment (~100 KB) for each worker thread.
The forkable design shares the parsed index
across threads while giving each thread its
own file handle, matching the
IndexedBamReader pattern.
Sources: The FASTA format and FAI index are not covered by any hts-specs document. The FAI format is defined by samtools (
samtools faidx); the byte offset formula and field layout are described in thesamtools faidxman page and implemented in [htslib]. The GZI index format is a htslib convention for bgzip-compressed random-access files, implemented in [htslib]bgzf.c. BGZF decompression follows [SAM1] §4.1. The forking and buffer-reuse design are seqair-specific. See references.md.
FAI Index
The FAI index (.fai file) is a
tab-delimited text file with one line per
sequence. It enables O(1) lookup of any base
position within the FASTA file by recording
where each sequence's data starts and how
lines are formatted.
Parsing
The FAI parser MUST read a
.fai file and produce
an index with one entry per
sequence. Each entry MUST contain:
sequence name (String),
length (number of bases,
u64), byte offset of
the first base in the file
(u64), bases per line
(u64), and bytes per
line including line terminator
(u64). The parser MUST
NOT trim whitespace from field
values — tabs are the only
delimiter, and sequence names may
contain spaces.
Each line MUST have exactly 5 tab-separated fields: NAME, LENGTH, OFFSET, LINEBASES, LINEWIDTH. Lines with fewer or more fields MUST produce an error identifying the malformed line. Empty lines MUST be skipped.
The parser MUST reject entries where
linewidth < linebases
(line terminator would be negative),
linebases == 0 (would
cause division by zero in offset
calculations), or
length == 0 (empty
sequences are not useful for region
queries). Note:
linewidth == linebases
is valid — it means lines have no
trailing newline character, which
occurs when a sequence is stored on
a single line without a mid-sequence
newline.
The index MUST support O(1) lookup of a sequence entry by name. Duplicate sequence names MUST produce an error at parse time.
Location
For a FASTA file at path
P, the reader MUST look
for the index at
P.fai (i.e., appending
.fai to the full path
including any
.gz extension). This
matches samtools/htslib convention:
ref.fa →
ref.fa.fai,
ref.fa.gz →
ref.fa.gz.fai.
When the FAI index file is not
found, the reader MUST produce a
clear error message that names the
expected index path and suggests
running
samtools faidx
<fasta_path>
to create it. The reader MUST NOT
attempt to build the index itself.
Rationale: auto-indexing hides user
errors (wrong FASTA passed
silently), fails on read-only shared
reference directories (common on HPC
clusters), introduces
non-deterministic file creation and
race conditions, and takes 10–30s
for a human genome which makes the
tool appear to hang with no
feedback.
Byte offset calculation
To find the byte offset of base at
0-based position
pos within a sequence,
the reader MUST compute:
offset + (pos / linebases) *
linewidth + (pos %
linebases), where offset,
linebases, and
linewidth come from the
FAI entry. This accounts for newline
characters inserted at line
boundaries.
The offset formula is correct for all valid
positions including those on the last line
of a sequence, even when the last line is
shorter than linebases. This
works because integer division gives the
correct line number regardless of last-line
length. The raw read range
[byte_offset(start),
byte_offset(last_pos) + 1)
may include newline bytes from line
boundaries; these are stripped after
reading.
Sequence fetching
fetch_seq(name, start,
stop)
MUST use 0-based half-open
coordinates
[start, stop). A
request for
(name, 0, 4) returns
the first 4 bases of the named
sequence.
The reader MUST verify that
start < stop
(zero-length and reversed ranges are
invalid) and
stop <=
sequence_length. Out-of-bounds requests MUST
produce an error naming the
sequence, the requested range, and
the actual sequence length.
Returned sequence bytes MUST be uppercased. FASTA files may contain lowercase bases (indicating soft-masking), but Seqair always works with uppercase bases.
When reading from the file, the
reader MUST strip line terminators
(LF \n and CR
\r) from the raw bytes.
The returned sequence MUST contain
only base characters.
When the requested sequence name is not in the index, the reader MUST produce an error listing the requested name. It SHOULD include available sequence names if the index is small (fewer than 20 entries) to help diagnose typos.
Raw byte return type
fetch_seq and
fetch_seq_into MUST
return raw uppercase ASCII bytes
(Vec<u8> /
&mut Vec<u8>), not Base values.
CRAM reference MD5 verification
computes checksums on the exact byte
sequence from the FASTA file —
converting to Base and
back is lossy because IUPAC
ambiguity codes (M, R, S, etc.)
collapse to Unknown.
Any u8→Base conversion MUST happen
downstream, after CRAM verification
is complete.
Buffer reuse
fetch_seq_into(name, start,
stop, buf)
MUST accept a caller-provided
&mut Vec<u8>
buffer, clear it, and write the
fetched sequence into it. This
avoids per-call allocation when
fetching many segments. The
fetch_seq convenience
method MAY allocate a new
Vec<u8>
internally.
Plain FASTA reading
For uncompressed FASTA files, the reader MUST seek to the computed byte offset and read the required byte range in a single I/O operation, then strip newlines and uppercase the result.
The raw read size MUST account for
embedded newlines. For a fetch of
n bases starting at
position pos, the
number of raw bytes to read is:
end_byte_offset -
start_byte_offset
where both offsets are computed via
fasta.index.offset_calculation.
Bgzip FASTA reading
The reader MUST auto-detect bgzip
compression by reading the first 18
bytes of the file and checking for:
(1) gzip magic bytes
1f 8b, (2) DEFLATE
compression method 08,
(3) FEXTRA flag 04, and
(4) the BC subfield
identifier at bytes 12–13. All four
checks MUST pass to identify a file
as BGZF. Plain gzip files with the
FEXTRA flag but without the BC
subfield MUST NOT be misidentified
as BGZF.
For bgzip FASTA files, a GZI index
(.gzi) MUST be present
alongside the FASTA. The GZI index
maps compressed block offsets to
uncompressed byte offsets, enabling
random access into the compressed
data.
When the GZI index is not found for
a bgzip FASTA, the reader MUST
produce a clear error naming the
expected GZI path and suggesting
samtools faidx
<fasta_path>
(which creates both
.fai and
.gzi).
GZI index format
The GZI parser MUST read a binary
file containing: an 8-byte
little-endian entry count
(u64), followed by that
many pairs of 8-byte little-endian
values:
(compressed_offset: u64,
uncompressed_offset: u64). Each pair records where a BGZF
block starts in the compressed file
and what uncompressed byte offset
that corresponds to. The entry count
multiplication MUST use checked
arithmetic to prevent overflow on
corrupted files.
GZI entries MUST be sorted in
strictly increasing order by
uncompressed_offset.
The parser MUST reject files with
unsorted or duplicate entries, as
translate() relies on
binary search.
Given an uncompressed byte offset,
the GZI index MUST be able to find
the BGZF block that contains it:
find the last entry whose
uncompressed_offset <=
target, then the compressed block starts
at that entry's
compressed_offset, and
the within-block offset is
target -
uncompressed_offset.
The within-block offset MUST fit in
u16 (0–65535), matching
the BGZF specification's
maximum uncompressed block size of
65,536 bytes. If the computed
within-block offset exceeds
u16::MAX, the reader
MUST return an error indicating the
GZI index may be corrupt or missing
entries. The reader MUST NOT
silently truncate the offset.
Decompression
The reader MUST decompress BGZF blocks using the same libdeflater-based decompression as the BAM reader, including CRC32 verification.
For a fetch spanning multiple BGZF blocks, the reader MUST read and decompress blocks sequentially from the starting block until all requested bytes are obtained. Since FASTA fetches are contiguous byte ranges (unlike BAM which has disjoint index chunks), a simple sequential scan suffices.
Forking
fork() MUST produce a
new reader that shares the same
Arc as the source (no
re-parsing) and opens a fresh
File handle. Each fork
MUST own its own read buffer.
A forked reader MUST produce byte-identical results to independently opening the same FASTA file.
Forked readers MUST be fully independent for mutable operations. Seeking on one MUST NOT affect any other.
Error handling
All FASTA-specific errors MUST
contain context (file paths,
sequence names, positions)
sufficient to diagnose the problem
without a debugger. The
implementation MUST NOT use
panic!,
unwrap(),
expect(), or
unreachable!() — all
error paths MUST propagate errors
via Result.
SAM Reader
A bgzf-compressed, indexed SAM reader for
region-based random access. Plain
(uncompressed) SAM files cannot be indexed
and are NOT supported — users must compress
with bgzip first.
Sources: [SAM1] §1 "The SAM Format Specification" — SAM text format, mandatory fields, FLAG bits, CIGAR string, header lines (@HD/@SQ/@RG/@PG); §1.3 "The header section"; §1.4 "The alignment section: mandatory fields". [TABIX] — TBI index format and SAM column configuration. [SAM1] §4.1 — BGZF decompression (shared with BAM). The coordinate conversion (1-based → 0-based) and aux-to-BAM-binary serialization are seqair-specific. See references.md.
Context
SAM is the text representation of the same
data BAM encodes in binary. A
bgzf-compressed SAM file
(.sam.gz) wraps SAM text in
BGZF blocks, enabling random access via
tabix (.tbi) or CSI
(.csi) indexes. htslib reads
these transparently; seqair needs to support
them for backward compatibility after
removing the htslib reading path.
The key difference from BAM: records are text lines that must be parsed field-by-field, and coordinates are 1-based (BAM is 0-based). The BGZF and index layers are shared with BAM.
Opening
The SAM reader MUST open a
bgzf-compressed SAM file, parse its
header, and locate a tabix
(.tbi) or CSI
(.csi) index file. If
neither index exists, return an
error suggesting
samtools index or
tabix -p sam.
If the file does not start with BGZF
magic (1f 8b with
BC subfield), the
reader MUST reject it with an error
explaining that plain SAM files must
be bgzf-compressed for indexed
access (bgzip file.sam).
Header parsing
The SAM header consists of all lines
starting with @ at the
beginning of the file. The reader
MUST:
-
Read lines from the BGZF stream until a
line does NOT start with
@. - Record the BGZF virtual offset of the first alignment line (for seeking past the header).
-
Parse
@SQlines to extract target names (SNtag) and lengths (LNtag). -
Construct a
BamHeaderfrom the parsed text viaBamHeader::from_sam_text().
Lines starting with @ that
appear after the first alignment line are
NOT header continuation — they are malformed
alignment lines and MUST be treated as parse
errors.
If no @SQ lines are
present, the reader MUST return an
error. Without
@SQ lines, target
name→tid mapping is impossible and
fetch_into cannot work.
The full header text (all
@ lines joined with
newlines) MUST be stored in
BamHeader::header_text
for downstream use (e.g.,
constructing htslib
Header for BAM
writing).
Alignment line parsing
[SAM1] §1.4 "The alignment section: mandatory fields" — 11 mandatory fields: QNAME, FLAG, RNAME, POS (1-based), MAPQ, CIGAR, RNEXT, PNEXT, TLEN, SEQ, QUAL; QUAL Phred+33 encoding
Each alignment line MUST be split on TAB characters into the 11 mandatory fields plus optional tag fields. The parser MUST handle:
| Field | SAM type | Conversion to internal representation |
|---|---|---|
| QNAME | string | stored as bytes in name slab |
| FLAG | integer | u16 (identical to BAM flags) |
| RNAME | string |
converted to tid via
header's
name_to_tid map
|
| POS | 1-based integer | subtract 1 for 0-based internal pos (i64) |
| MAPQ | integer | u8 (255 means unavailable) |
| CIGAR | string | parsed to packed BAM CIGAR ops (u32 per op: `len << 4 |
| RNEXT | string |
= means same as
RNAME, * means
unavailable — not stored in
RecordStore
|
| PNEXT | integer | not stored in RecordStore (mate info) |
| TLEN | integer | not stored in RecordStore (template length) |
| SEQ | string |
each ASCII char decoded to
Base enum
|
| QUAL | string |
each char minus 33 for Phred
score; * means all
0xFF
|
The parser MUST validate each field and produce clear error messages for malformed records: non-integer POS, FLAG > 65535, negative MAPQ, etc. Real-world SAM files from non-standard tools may contain hand-edited or corrupted records.
SAM POS is 1-based; the internal representation is 0-based (matching BAM). The parser MUST subtract 1 from POS. A POS of 0 in SAM means unmapped — after subtraction this becomes -1 (as i64). Unmapped records are filtered by FLAG 0x4 before reaching this point, so -1 positions should not appear in the RecordStore.
The CIGAR string MUST be parsed into
the same packed u32 format used by
BAM: each operation is
(length << 4) |
op_code, where op_codes are M=0, I=1, D=2,
N=3, S=4, H=5, P=6, =7, X=8. The
string * means CIGAR
unavailable (0 operations, and
end_pos = pos).
SEQ characters MUST be decoded to
Base values: A→A, C→C,
G→G, T→T, all others (N, IUPAC
codes)→Unknown. The string
* means sequence
unavailable (length 0).
Note: the SAM character = means
"identical to the reference base at this
position." This is an intentional lossy
conversion to Unknown —
matching BAM's behavior, which also stores
= as a distinct 4-bit code but
does not resolve it against the reference.
Some tools (Illumina DRAGEN) use
= extensively. If exact
reference-matching base recovery is needed
in the future, the = case would
need reference lookup during parsing.
QUAL characters MUST have 33
subtracted (Phred+33 encoding). The
string * means quality
unavailable; all quality bytes MUST
be set to 0xFF (255). QUAL length
MUST equal SEQ length when both are
present.
Optional fields after the 11th
column follow the format
TAG:TYPE:VALUE. They
MUST be serialized to BAM binary aux
format for storage in the data slab:
-
A(char):[tag0, tag1, 'A', char] -
i(integer):[tag0, tag1, type, bytes...]where type is the smallest fitting BAM integer type:-
[-128, 127] →
c(int8) -
[128, 255] →
C(uint8) -
[-32768, -129] or [256, 32767] →
s(int16) -
[32768, 65535] →
S(uint16) -
[-2147483648, -32769] or [65536,
2147483647] →
i(int32) -
[2147483648, 4294967295] →
I(uint32)
-
[-128, 127] →
-
f(float):[tag0, tag1, 'f', le_f32_bytes] -
Z(string):[tag0, tag1, 'Z', string_bytes, 0x00] -
H(hex):[tag0, tag1, 'H', hex_bytes, 0x00] -
B(array):[tag0, tag1, 'B', subtype, count_le32, values...]
Note: the integer type selected here may
differ from the type used by the original
BAM writer, since SAM text loses the
specific width information. This is an
inherent limitation — see
r[unified.push_fields_equivalence].
Malformed aux tag values MUST return
an error, not silently default to
zero. For integer (i)
typed tags, if the value cannot be
parsed as an integer, the reader
MUST return a
SamRecordError::InvalidAuxValue
error.
SAM aux integer values (type
i) are serialized into
the smallest BAM integer type that
fits, using the ranges defined in
r[sam.record.aux_tags]. Values that exceed the u32 range
(i.e.,
val > u32::MAX or
val < i32::MIN when
interpreted as i64) cannot be
represented in any BAM integer type
and MUST return a
SamRecordError::AuxIntOutOfRange
error instead of silently
truncating.
Region fetching
fetch_into(tid, start, end,
store)
MUST:
- Query the tabix/CSI index for BGZF virtual offset ranges overlapping the region.
-
Use
RegionBuf(or equivalent bulk read) to load compressed bytes. - Decompress BGZF blocks to text.
-
Split text into lines (on
\n). - Parse each alignment line.
-
Filter: skip lines where RNAME's
tid doesn't match, or where the
record doesn't overlap
[start, end]. - Skip unmapped reads (FLAG 0x4).
- Push passing records into the RecordStore.
The overlap filter uses half-open
intervals (0-based).
end_pos MUST be
computed from the parsed CIGAR. When
CIGAR is
* (unavailable),
end_pos = pos (point
record).
A record overlaps
[start, end) iff
pos < end && end_pos
> start. Records where
pos == end or
end_pos == start do NOT
overlap and MUST be filtered out.
Records MUST be added to the store in coordinate-sorted order. Since bgzf-compressed indexed SAM files are coordinate-sorted by construction (indexing requires it), the natural parse order is correct.
Tabix / CSI index
[TABIX] — TBI magic
TBI\x01, header fields, n_ref, bin/chunk/linear-index structure, format=1 for SAM [CSI] — CSI magicCSI\x01, min_shift, depth, l_aux, per-bin loffset
The reader MUST support tabix
(.tbi) indexes. Tabix
is a generic index for
bgzf-compressed, TAB-delimited
files. For SAM, the format code is
1 (not
0/generic). The column
configuration (seq_col=3, beg_col=4,
end_col=0) is implicit for format=1
and may be ignored by some
implementations; the reader SHOULD
NOT reject a tabix file based on
stored column values when format=1
is present.
Tabix files are BGZF-compressed and MUST be decompressed before parsing.
The reader SHOULD also support CSI
(.csi) indexes, which
generalize the binning scheme with
configurable
min_shift and
depth parameters. CSI
supports references longer than 512
Mbp (BAI's limit). CSI files
are BGZF-compressed and MUST be
decompressed before parsing.
Not yet implemented
— see r[csi.*] rules in
tabix_index.md.
Index file lookup order:
<path>.tbi,
<path>.bai. If
neither exists, return a clear
error. CSI (.csi)
lookup should be added when CSI
support is implemented.
Edge cases
When SEQ is *, the
record has no sequence data.
seq_len MUST be 0 and
no bases are pushed to the bases
slab. This is valid per the SAM spec
for secondary alignments that omit
sequence.
When QUAL is *, all
quality bytes MUST be 0xFF. This
matches BAM's representation of
unavailable quality.
BAM encodes CIGARs with >65535
operations using a
CG:B:I aux tag and a
placeholder CIGAR. SAM has no such
limitation — the CIGAR string can be
arbitrarily long. The parser MUST
handle CIGARs of any length.
Blank lines (empty or whitespace-only) between alignment records MUST be silently skipped.
RNAME of * means
unmapped. Combined with FLAG 0x4,
these records MUST be filtered out
by fetch_into (same as
BAM's unmapped filtering).
RNAME * without FLAG
0x4 is malformed but SHOULD be
treated as unmapped.
A single SAM alignment line may span
multiple BGZF blocks. The reader
MUST buffer partial lines across
block boundaries and only parse
complete lines (terminated by
\n). The parser MUST
also strip \r at line
boundaries to handle
\r\n (Windows-style)
line endings that may appear in
files from mixed-platform pipelines.
Integer parsing
Integer parsing functions
(parse_u8,
parse_u16,
parse_u32) MUST reject
empty input by returning
None. An empty byte
slice is not a valid integer
representation and MUST NOT silently
produce zero.
Performance considerations
Like BAM, the SAM reader SHOULD use
RegionBuf-style bulk
I/O: read all compressed bytes for
the region in one large I/O
operation, then decompress and parse
from memory. This is critical for
NFS/Lustre performance.
Text parsing is inherently slower than BAM binary decoding. The SAM reader is expected to be 2-5× slower than BAM for the same data due to text parsing overhead and larger compressed size. This is acceptable since SAM is not the recommended input format. Implementation guidance:
-
Use
memchrfor fast TAB and newline scanning. - Avoid allocations during parsing (reuse line buffers).
-
Parse integers without going through
String(direct byte-to-int conversion). - Decode SEQ characters to Base with a lookup table (not per-char branching).
Tabix and CSI Index
Tabix (.tbi) and CSI
(.csi) are generic indexes for
bgzf-compressed, coordinate-sorted,
TAB-delimited files. They enable
region-based random access via BGZF virtual
offsets. seqair needs these for
bgzf-compressed SAM files.
Sources: [TABIX] — full TBI format specification (magic, header, bins, chunks, linear index, n_no_coor, reg2bin/reg2bins). [CSI] — full CSI format specification (magic, min_shift, depth, l_aux, per-bin loffset, parameterised reg2bin/reg2bins). [SAM1] §5.3 — reg2bin/reg2bins C code for BAI-compatible binning parameters (min_shift=14, depth=5). See references.md.
Context
BAM uses BAI indexes, which are a specialized form of the binning scheme. Tabix and CSI generalize this:
-
Tabix
(
.tbi): fixed parameters (min_shift=14, depth=5), identical binning to BAI. Includes column configuration for the text format. -
CSI
(
.csi): configurable min_shift and depth, supporting references longer than 512 Mbp.
Both use the same bin/chunk/virtual-offset query mechanism as BAI.
Implementation strategy:
Tabix with format=SAM uses identical binning
parameters to BAI (min_shift=14, depth=5).
The index data after the tabix-specific
header is structurally identical to BAI
(bins, chunks, linear index). The initial
implementation SHOULD reuse the existing
BamIndex parsing and query code
by reading past the tabix header and
delegating to the BAI parser. CSI support
(parameterized binning, per-bin
loff, no linear index) requires
a separate parser and is deferred.
File compression
Tabix (.tbi) files are
BGZF-compressed. The parser MUST
decompress via BGZF before parsing
the binary format. Reading raw bytes
without decompression produces
garbage.
CSI (.csi) files are
BGZF-compressed. Same decompression
requirement as tabix.
Tabix format
[TABIX] — TBI format table: magic, n_ref, format, col_seq, col_beg, col_end, meta, skip, l_nm, names; bin/chunk/interval lists; n_no_coor
Tabix files start with magic bytes
TBI\x01 (4 bytes),
after BGZF decompression.
After magic:
-
n_ref(i32): number of reference sequences -
format(i32): 0=generic, 1=SAM, 2=VCF. For SAM files, the format code is 1 (not 0). -
col_seq(i32): column for sequence name (1-based; 3 for SAM) -
col_beg(i32): column for begin position (1-based; 4 for SAM) -
col_end(i32): column for end position (0 = computed from begin) -
meta(i32): comment/header prefix character (64 =@for SAM) -
skip(i32): number of header lines to skip (0 for SAM; headers detected bymetaprefix) -
l_nm(i32): length of concatenated sequence names -
names(bytes): null-terminated sequence names, concatenated
Note: when format=1 (SAM),
htslib hard-codes the column configuration
internally. The stored
col_seq/col_beg/col_end values may be
ignored by some implementations. The reader
SHOULD NOT reject a tabix file based on
stored column values when format=1.
Since tabix uses identical binning
parameters to BAI (min_shift=14,
depth=5), the index data after the
tabix header MUST be parsed by
reusing
BamIndex::from_reader()
(or equivalent). This avoids
duplicating the
bin/chunk/linear-index parsing code.
The tabix header is read and
validated first, then the remaining
bytes are handed to the BAI parser.
After the header, the index data follows the same structure as BAI:
-
For each reference:
-
n_bin(i32): number of bins -
For each bin:
bin_id(u32),n_chunk(i32), chunks as virtual offset pairs -
n_intv(i32): number of linear index entries - Linear index: virtual offsets at 16 kbp intervals
-
After all reference data, an optional 8-byte
n_no_coor field (u64) may be
present (number of unmapped reads with no
coordinate). Its presence is detected by
whether 8 more bytes remain.
Bin ID 37450 is a pseudo-bin that stores metadata (virtual offset range of all records for a reference). Its chunks MUST NOT be included in query results. The reader MUST recognize and skip this bin during region queries.
Region queries use the same
reg2bins algorithm as
BAI (with min_shift=14, depth=5).
The bin offset for level
l (counting from root,
l=0=root) is
((1 << 3*l) - 1) /
7, and the bin ID for position
p at level
l is:
((1 << 3*l) - 1) / 7 + (p
>> (min_shift + 3*(depth -
l)))
For BAI/tabix with min_shift=14, depth=5:
- Level 0 (root): offset=0, shift=29. Bin 0 covers the entire reference.
- Level 5 (leaf): offset=4681, shift=14. Each bin covers 16 kbp.
CSI format
[CSI] — CSI format table: magic
CSI\x01, min_shift, depth, l_aux, n_ref; per-bin loffset replacing linear index; parameterised reg2bin/reg2bins/bin_limit functions
CSI files start with magic bytes
CSI\x01 (4 bytes),
after BGZF decompression.
After magic:
-
min_shift(i32): minimum shift for binning (BAI/tabix use 14) -
depth(i32): number of binning levels (BAI/tabix use 5) -
l_aux(i32): length of auxiliary data -
aux(bytes): auxiliary data (for SAM: same column configuration as tabix header)
Note: CSI support (r[csi.*] requirements below) is deferred. The initial implementation supports tabix only, which covers the common case of bgzf-compressed indexed SAM files. CSI adds support for references > 512 Mbp and is needed for some plant/polyploid BAM files.
CSI generalizes the BAI binning
scheme with configurable
min_shift and
depth. The bin ID
formula is the same as tabix/BAI but
parameterized:
bin_id(level, pos) = ((1 <<
3*level) - 1) / 7 + (pos >>
(min_shift + 3*(depth - level)))
where level counts from root
(0) to leaf (depth).
The reg2bins function MUST be
parameterized by min_shift and depth rather
than using a hard-coded levels array. The
existing BAM
reg2bins implementation
hard-codes
[(1,26), (9,23), (73,20), (585,17),
(4681,14)]
— this must be refactored to compute these
values dynamically.
CSI bins have a different structure from BAI/tabix bins:
bin_id(u32)-
loff(u64): minimum virtual offset for this bin — replaces BAI's linear index n_chunk(i32)- chunks: virtual offset pairs
The loff field per-bin replaces
the separate linear index that BAI/tabix
use. During queries, chunks with
end <= loff of the leftmost
overlapping bin SHOULD be skipped (same
optimization as BAI's linear index).
CSI does NOT have a separate linear
index array. The per-bin
loff field serves the
same purpose. The parser MUST NOT
attempt to read a linear index after
bins.
The CSI pseudo-bin ID is
((1 << (depth*3 + 3)) -
1) / 7. For depth=5 this equals 37450
(same as BAI). Its chunks MUST NOT
be included in query results.
CSI queries produce the same output
as BAI/tabix queries: a list of
virtual offset chunk ranges to scan.
RegionBuf::load() can
be reused unchanged since it
operates on
Vec<Chunk>.
Shared query infrastructure
Edge cases
References longer than 512 Mbp require CSI (BAI/tabix with min_shift=14, depth=5 cannot represent them). The reader MUST support CSI indexes for BAM files too (not just SAM), since some BAM files from plant/polyploid genomes use CSI.
A reference with no records has no bins and no linear index entries. Queries for such references MUST return an empty chunk list (not an error).
Tabix stores sequence names in the
index header. These MUST match the
names in the SAM header's
@SQ lines. A mismatch
SHOULD produce a warning (the index
may be stale). CSI does NOT store
sequence names — this check applies
only to tabix.
Record Store
Sources: The slab-based RecordStore design is seqair-specific. The data it stores is defined by [SAM1] §4.2 (BAM record fields, CIGAR packed u32 format, quality Phred encoding, aux tag binary format). See references.md.
Background
When loading a region from any alignment
format (BAM, SAM.gz, or CRAM), the reader
decodes hundreds to thousands of records.
Each record has fixed-size fields (position,
flags, quality) and variable-length fields
(read name, CIGAR, sequence, quality scores,
auxiliary tags). The original design wrapped
each record in
Rc<BamRecord> with 5
separate heap allocations for the
variable-length fields, plus the Rc
allocation itself — 6 allocations per
record.
The record store replaces this with a slab-based design: all variable-length data is packed into contiguous buffers (slabs), and fixed fields are stored in a compact struct that references the slabs by offset. This eliminates all per-record heap allocation.
Layout
The store contains four vectors:
-
Record table
(
Vec<SlimRecord>): compact fixed-size structs with alignment fields and offsets into the slabs. -
Name slab
(
Vec<u8>): all read names (qnames) packed sequentially. Separated because qnames are only accessed during dedup mate detection — a linear scan of a compact, contiguous buffer. -
Bases slab
(
Vec<Base>): decoded base sequences stored asBaseenum values (A/C/G/T/Unknown). For BAM, decoded from 4-bit nibbles via SIMD at push time. For SAM/CRAM, provided directly as&[Base]by the reader. -
Data slab
(
Vec<u8>): packed per-record as[cigar | qual | aux]— CIGAR in BAM packed u32 format, quality as raw Phred bytes, aux tags in BAM binary format.
Record table: [SlimRecord₀] [SlimRecord₁] [SlimRecord₂] ...
Name slab: [qname₀|qname₁|qname₂|...]
Bases slab: [bases₀|bases₁|bases₂|...]
Data slab: [cigar₀|qual₀|aux₀|cigar₁|qual₁|aux₁|...]
Capacity estimation
On construction, the store MUST
pre-allocate all three vectors with
estimated capacities to avoid
reallocation during
push_raw. Estimates
SHOULD be based on the number of
compressed bytes loaded for the
region: a typical BAM compression
ratio of ~3:1, an average record
size of ~400 bytes, ~25 bytes per
read name, and ~375 bytes per record
of non-name data. The store MUST
accept a byte-count hint for this
purpose.
Decoding into slabs
push_raw(raw_bytes)
MUST decode a BAM record's
fixed fields and append its
variable-length data directly into
the slabs, without allocating
intermediate
Box<[u8]> or
Vec<u8> per
field. The 4-bit packed sequence
MUST be decoded (via SIMD when
available) to Base enum
values in the bases slab.
Slab offsets (name_off, bases_off,
data_off) MUST be checked for u32
overflow before storing. If any slab
exceeds u32::MAX bytes,
the store MUST return an error
rather than silently truncating the
offset.
push_fields(...) MUST
accept pre-parsed record fields for
SAM and CRAM readers: pos, end_pos,
flags, mapq, matching_bases,
indel_bases, qname bytes, CIGAR as
packed BAM-format u32 ops, sequence
as &[Base], quality
bytes, and aux tag bytes in BAM
binary format. This avoids encoding
to BAM binary only to immediately
decode it again. SAM and CRAM
parsers convert their native
representations to these types
before pushing.
Field access
The store MUST provide methods to
access each variable-length field
for a given record index:
qname(idx),
cigar(idx),
seq(idx),
seq_at(idx, pos),
qual(idx),
aux(idx). These return
borrowed slices into the slabs.
Region lifecycle
Clearing the store for a new region MUST retain the allocated capacity of all four vectors, so that subsequent regions reuse the same memory without reallocation.
All records in the store MUST remain valid and accessible until the store is cleared.
Integration
The pileup engine MUST NOT use
Rc for record sharing.
PileupAlignment carries
pre-extracted flat fields (base,
qual, mapq, flags, etc.), and
ActiveRecord stores a
record index into the store. The
store is either owned by or borrowed
by the engine for the duration of
iteration.
Region Buffer
Sources: The RegionBuf design is seqair-specific (bulk I/O optimisation for cluster storage). BGZF block decompression follows [SAM1] §4.1. Virtual offsets are defined in [SAM1] §4.1 "Random access". Index chunks come from [SAM1] §5.2 (BAI) or [TABIX] (TBI). See references.md.
Background
When reading a BGZF-compressed file (BAM or
bgzf-compressed SAM), the reader doesn't
decompress the entire file. Instead, it uses
an index (.bai for BAM,
.tbi for SAM.gz) to find which
compressed byte ranges — called
chunks — contain reads
overlapping the requested genomic region.
Each chunk is a pair of virtual offsets (see
bgzf.md) that point to a start
and end position in the compressed file.
A typical region query returns a handful of chunks. The standard approach is to seek to each chunk's start, then read and decompress BGZF blocks one at a time until reaching the chunk's end. Each block read involves a header read (~18 bytes) followed by a compressed data read (~20–40 KB), then decompression.
On local SSDs this works fine — seek + read latency is microseconds. On cluster storage (NFS, Lustre, GPFS), each I/O operation can take milliseconds due to network round-trips. A region with 200 BGZF blocks means 200+ network round-trips, which dominates processing time.
Approach
RegionBuf solves this by
pre-fetching all compressed bytes for a
region into RAM in one shot, then decompressing from the in-memory
buffer. The I/O pattern changes from
"hundreds of small reads" to "one large
sequential read" per region. Since Rastair
processes BAM segments in parallel (one
thread per segment), each thread creates its
own RegionBuf with no
contention.
The flow:
- Query the BAM index → get a list of chunks (compressed byte ranges)
- Merge overlapping/adjacent chunks into larger contiguous ranges
-
Read each merged range with one
seek+readinto aVec<u8> - Decompress BGZF blocks from the in-memory buffer (zero network I/O)
Chunk merging
The BAM index may return multiple chunks whose compressed byte ranges overlap or are adjacent. Reading them separately would cause redundant I/O and seeking.
Before reading, index chunks MUST be merged into non-overlapping byte ranges sorted by file offset. Overlapping or adjacent chunks MUST be coalesced into a single range. The end of each range MUST extend past the last chunk's end block offset by at least one maximum BGZF block size (64 KiB) to ensure the final block is fully included in the buffer.
RegionBuf MUST NOT
include bin 0 chunks in its bulk
read. Bin 0 chunks are at distant
file offsets and contain very few
reads; including them would cause an
expensive additional seek. Bin 0
records are handled separately via
the bin 0 cache (see
bam_index.md).
Loading
RegionBuf::load MUST
accept a seekable reader and a slice
of index chunks. It MUST merge
chunks, then perform one
seek +
read_exact per merged
range to load all compressed bytes
into a contiguous in-memory buffer.
On typical BAM files this results in
a single read of a few hundred KB to
a few MB per region.
When given zero chunks,
load MUST return an
empty buffer that immediately
signals EOF on any read.
Seeking
After loading, the caller iterates through the original (unmerged) chunks. For each chunk, it seeks to the chunk's start virtual offset within the pre-loaded buffer and reads records until reaching the chunk's end.
The region buffer MUST support seeking to a virtual offset within the pre-loaded data. Since disjoint ranges are concatenated, the buffer MUST maintain a range map that translates file offsets to buffer positions. A seek to a file offset that falls within any loaded range MUST succeed; a seek to an offset in a gap between ranges or before/after the loaded data MUST return an error.
Block decompression
The in-memory buffer contains raw BGZF-compressed bytes, exactly as they appear on disk. Decompression follows the same block-by-block process as the file-based reader, but reads from RAM instead of the filesystem.
Block decompression from the in-memory buffer MUST follow the same BGZF format rules as the file-based reader: magic validation, BSIZE extraction, DEFLATE decompression, and CRC32 verification.
The fast-path header parsing (XLEN=6, BC at fixed offset) MUST be used when applicable, with fallback to searching extra fields.
Reading
The region buffer MUST support
reading an exact number of bytes,
transparently crossing block
boundaries, with the same API as
BgzfReader::read_exact_into.
The Drop implementation
MUST never panic. Gap-size
calculations between ranges MUST use
saturating arithmetic to handle
overlapping or malformed range
boundaries safely.
The region buffer MUST track virtual offsets correctly: the block offset corresponds to the file position of the current block (not the buffer position), so that virtual offset comparisons with the BAM index chunk boundaries work correctly.
Integration
Both
IndexedBamReader::fetch_into
and
IndexedSamReader::fetch_into
MUST use RegionBuf to
bulk-read compressed data before
record decoding. The file-based
BgzfReader is retained
for header parsing and other
sequential access at file open time.
CRAM does NOT use
RegionBuf. CRAM
containers are not BGZF-compressed —
the reader seeks to container byte
offsets directly via the CRAI index
and reads container data with plain
File::read_exact.
Block-level decompression (gzip,
rANS, etc.) happens after reading
the container into memory.
Overlapping Pair Deduplication
In paired-end sequencing, two reads (mates) are generated from opposite ends of the same DNA fragment. When the fragment is shorter than twice the read length, the mates overlap — both reads cover the same genomic positions in the middle. Without deduplication, these overlapping positions would be counted twice in the pileup, inflating the apparent coverage and biasing variant allele frequencies.
Sources: Seqair-specific design with no upstream spec counterpart. Mate detection relies on QNAME matching as defined in [SAM1] §1.4. The first-in-template / second-in-template distinction uses FLAG bits 0x40 and 0x80 defined in [SAM1] §1.4. See references.md.
For example, with 150bp reads from a 200bp fragment, positions 50–150 are covered by both mates. At each of these positions, the pileup would show two observations from the same molecule. Deduplication removes one mate at each overlapping position, keeping the observation count honest.
Opt-in behavior
Overlapping pair deduplication MUST
be opt-in via
set_dedup_overlapping(). When not enabled, the pileup
engine MUST yield all alignments
without deduplication.
Mate detection
Mates are identified by sharing the same read name (qname). The engine pre-computes a mate lookup table when deduplication is enabled, so the per-position dedup check is a fast hash lookup rather than a string comparison.
Mates are detected by matching
qnames. When
set_dedup_overlapping()
is called, the engine MUST build a
mate index that maps each record to
its mate's index (if both are
in the arena). Records with no mate
in the arena are unaffected.
Only the first two records sharing a qname are treated as mates. If three or more records share the same qname (supplementary alignments, etc.), only the first pair is linked; additional records are treated as unpaired for dedup purposes.
Resolution rule
When both mates are present at a position, one must be removed. The rule depends on whether the mates agree on the base call:
When both mates show the same base at a position, the first encountered (by arena order) is kept and the other is removed.
When mates show different bases at a position, the first-in-template read is kept and the second-in-template read is removed. This preserves the primary observation.
Scope
Deduplication MUST be applied per-position, not globally. A read that is removed at one overlapping position MUST still appear at positions where its mate is not active. Both mates contribute normally outside their overlap region.
Non-interference
Deduplication MUST be applied after the read filter and after qpos computation. A read that was excluded by the filter MUST NOT participate in dedup. A read that has no qpos at a position (deletion/skip) MUST NOT participate in dedup at that position.
Deduplication MUST be applied before max_depth truncation. The depth after dedup may be below max_depth even if the raw depth exceeded it.
Pileup Engine
A pileup is a column-by-column view of all reads aligned to a reference region. At each reference position, the pileup shows which reads are present and what base each read has at that position. This is the fundamental data structure for variant calling: to determine whether position X has a mutation, you look at the pileup column at X and count how many reads show the reference base vs. an alternate base.
The pileup engine takes a set of BAM records
(stored in a RecordStore) and
iterates over reference positions from left
to right, maintaining an "active set" of
reads that overlap the current position. At
each position, it yields a
PileupColumn containing the
aligned reads with their query positions
(the position within each read that
corresponds to this reference position).
Sources: The pileup algorithm is seqair-specific design, verified for correctness against [htslib]
bam_plp_auto. FLAG bits referenced here (0x4 unmapped, 0x10 reverse strand) are defined in [SAM1] §1.4. CIGAR consume semantics (consumes query/reference) are from [SAM1] §1.4. See references.md.
Column iteration
The engine MUST iterate over every
reference position from the region
start to the region end (inclusive),
yielding a
PileupColumn at each
position where at least one read is
active.
At each position, the active set
contains all records whose alignment
span
[pos, end_pos] includes
the current position. Records MUST
enter the active set when the
current position reaches their
pos and MUST be evicted
when the current position exceeds
their end_pos.
Each PileupColumn MUST
provide:
- the reference position
- the number of active alignments (depth)
-
access to each alignment with its query
position (
qpos) and aRecordRef
Filtering
The engine MUST support a per-read filter function that is evaluated once per record when it enters the active set (NOT re-evaluated at each position). Records that fail the filter MUST be excluded from all subsequent columns.
The engine MUST support a maximum
depth setting. When more records
pass the filter than
max_depth allows,
excess records MUST be dropped
(preferring records already in the
active set).
Max depth MUST be enforced
per-position, not globally at
arena-load time. A read that is
rejected at one high-coverage
position may still be included at an
adjacent lower-coverage position.
This matches htslib's behavior
where MAXCNT is checked
against the current column's
depth.
Query position
The query position (qpos) maps
a reference position back into a read's
coordinate system. For example, if a read
starts at reference position 100 and has
CIGAR 50M, then at reference
position 125, qpos = 25. For
complex CIGARs with insertions or deletions,
the mapping requires walking the CIGAR
operations (see cigar.md).
For each alignment in a column, the
engine MUST provide
qpos: the position
within the read's query
sequence that aligns to the current
reference position. This MUST be
computed using the
CigarIndex.
When the current reference position falls within a deletion or reference skip in the read's CIGAR, that alignment MUST be excluded from the column (no qpos exists).
Edge cases
Positions with zero active reads MUST be skipped (no column yielded). When there is a gap between reads, the engine MUST jump to the next read's start position rather than iterating through empty positions one by one.
Reads with the unmapped flag (0x4)
MUST be excluded from the pileup.
htslib's
bam_plp_push rejects
unmapped reads before they enter the
buffer.
Reads with zero reference-consuming
CIGAR operations (pure soft-clip,
insertion-only) have
end_pos == pos and MUST
appear in exactly one pileup column
at their pos. They MUST
NOT be skipped or cause off-by-one
errors.
When a reference position falls within the soft-clipped portion of a read's alignment, the read is NOT active at that position (soft clips do not consume reference). The read only becomes active at its first reference-consuming position.
Compatibility
For any BAM file and region, the
pileup engine MUST produce identical
columns to htslib's
bam_plp_auto-based
pileup: same positions, same depth
at each position, and same
qpos values per
alignment. This is verified by
comparison tests.
Performance
The pileup engine is the innermost loop of Rastair's processing pipeline: for a typical 30× whole-genome BAM, it processes ~3 billion reference positions, each with ~30 aligned reads. These rules specify performance-sensitive behaviors to avoid unnecessary overhead in this hot path.
Sources: Seqair-specific performance rules with no upstream spec counterpart. Implementation strategies are guided by profiling results and cross-checked against [htslib] behaviour. See references.md.
Allocation avoidance
The pileup engine allocates a
Vec<PileupAlignment>
per column using
Vec::with_capacity(active.len()). Reusing a buffer via clone was
measured to be slower than fresh
allocation due to the copy cost of
entries. The allocator efficiently
reuses recently-freed blocks of the
same size.
When a record enters the pileup
active set, the engine MUST retrieve
the record data once and cache what
it needs (cigar, flags, strand,
mapq, etc.) in the
ActiveRecord — NOT
perform repeated store lookups for
the same record in the hot loop.
Since BAM records arrive in
coordinate-sorted order from
fetch_into, the pileup
engine MUST NOT sort record indices.
It MUST iterate records in arena
order directly.
Building a
CigarIndex MUST NOT
clone the CIGAR bytes via
.to_vec(). The CIGAR
bytes live in the arena slab for the
lifetime of the region; the
CigarIndex MUST borrow
or reference them by offset, not own
a copy.
Matches and indels counts MUST be
computed once per record during
decode and stored in
BamRecord, NOT
recomputed from CIGAR at every
pileup position.
For records with more than 4 CIGAR
operations,
qpos_at SHOULD use
binary search on
ref_start instead of
linear scan. For 1–4 ops, linear
scan is acceptable.
The RecordStore MUST support a
capacity hint so that the first
region's allocation can be
pre-sized based on an estimate
(e.g., from the BAI index chunk
sizes), avoiding repeated
reallocation during the first
fetch_into. This is
provided by
RecordStore::with_byte_hint().
References
The seqair spec is based on the following upstream specifications and reference implementations.
Format specifications
-
[SAM1] Sequence Alignment/Map Format Specification, The SAM/BAM Format Specification Working Group, last modified 2025-08-10. https://samtools.github.io/hts-specs/SAMv1.pdf Covers: SAM text format (§1), mandatory alignment fields and FLAG bits (§1.4), BAM binary encoding (§4 "The BAM Format Specification"), BGZF block compression (§4.1 "The BGZF compression format"), random access and virtual offsets (§4.1 "Random access"), EOF marker (§4.1 "End-of-file marker"), BAM record layout (§4.2 "The BAM format"), BIN field calculation (§4.2.1), SEQ/QUAL encoding (§4.2.4), auxiliary data encoding (§4.2.5), BAI index algorithm (§5 "Indexing BAM"), basic binning index (§5.1.1), BAI index format (§5.2 "The BAI index format for BAM files"),
reg2bin/reg2binsC code (§5.3). -
[CRAM3] CRAM format specification (version 3.1), samtools-devel@lists.sourceforge.net, last modified 2025-06-04. https://samtools.github.io/hts-specs/CRAMv3.pdf Covers: data types and ITF8/LTF8 variable-length integers (§2), bit stream reading (§2.2), encoding types (§3 "Encodings"), checksums (§4), file structure overview (§5), file definition and magic (§6), container header structure (§7), CRAM header container (§7.1), block structure (§8), block content types (§8.1), compression header block (§8.3) including preservation map, data series encodings, and substitution matrix, slice header block (§8.4), core data block (§8.5), external data blocks (§8.6), EOF container (§9), record structure and CRAM record decode order (§10), BAM and CRAM flags (§10.1), positional data and AP delta coding (§10.2), read names (§10.3), mate records (§10.4), auxiliary tags (§10.5), mapped reads and read features (§10.6), substitution codes (§10.6.1), unmapped reads (§10.7), reference sequences (§11), CRAM index / CRAI format (§12), encoding codec IDs (§13).
-
[CRAMcodecs] CRAM codec specification (version 3.1), samtools-devel@lists.sourceforge.net, last modified 2025-03-24. https://samtools.github.io/hts-specs/CRAMcodecs.pdf Covers: rANS 4×8 codec with order-0 and order-1 frequency tables and interleaving (§2), rANS Nx16 codec with transforms (RLE, PACK, STRIPE, CAT, N32) (§3), range/arithmetic coder (§4), name tokeniser tok3 (§5), fqzcomp quality codec (§6).
-
[TABIX] The Tabix index file format, Heng Li. https://samtools.github.io/hts-specs/tabix.pdf Covers: TBI magic, header fields (n_ref, format, col_seq, col_beg, col_end, meta, skip, l_nm, names), bin/chunk/linear-index structure identical to BAI,
n_no_coortrailing field,reg2bin/reg2binsC functions. -
[CSI] Coordinate Sorted Index (CSI) format specification, SAM/BAM Format Specification Working Group, last modified 2020-05-25. https://samtools.github.io/hts-specs/CSIv1.pdf Covers: CSI magic, header fields (min_shift, depth, l_aux), per-bin
loffsetfield replacing BAI's linear index, parameterisedreg2bin/reg2bins/bin_limitfunctions.
Reference implementations
-
[htslib] htslib, Wellcome Sanger Institute. https://github.com/samtools/htslib The canonical C implementation of SAM/BAM/CRAM/BGZF/BAI. Used to verify correctness of seqair's output: pileup positions, depths, qpos values, CRAM record fields, and flag handling are all cross-checked against htslib.
-
[noodles] noodles, Michael Macias. https://github.com/zaeleus/noodles A pure-Rust implementation of genomics formats. Referenced for Rust-idiomatic design patterns in handling BAM/SAM/CRAM records.