Seqair — General Design

These are cross-cutting design rules that apply to all modules in the Seqair crate. They establish conventions for error handling, type safety, and platform portability.

Sources: This file contains seqair-specific design rules with no direct upstream spec counterpart. Named constants (FLAG bits, CIGAR op codes) are derived from [SAM1] §1.4 and §4.2. See references.md.

Error handling

Seqair aims to provide helpful and complete error messages at all layers of the system.

io.errors

bgzf.rs:32 +4 bgzf.rs:33

Error types MUST implement std::error::Error using thiserror and capture necessary context and provide causes.

io.errors.typed_variants

Every failure mode MUST be a distinct enum variant with typed fields — never a catch-all String or reason: String. Callers MUST be able to pattern-match on failure modes without inspecting display text. When a single enum variant is constructed at multiple call sites with different format strings, split it into separate variants. Paths, byte arrays, and counts MUST be stored in typed fields (e.g. PathBuf, [u8; 4], usize), not interpolated into strings.

Type safety

BAM files are full of magic numbers: flag bits (0x4 = unmapped, 0x10 = reverse strand), CIGAR operation codes (0 = M, 1 = I, ...), format constants. Using raw literals is error-prone and hard to read.

io.named_constants

cigar.rs:8 +1 bgzf.rs:90 +1

Magic numbers (bit flags, format constants, operation codes) MUST be defined as named constants, not used as raw literals. BAM flags MUST use named constants (e.g., FLAG_UNMAPPED instead of 0x4). CIGAR operation types MUST use named constants (e.g., CIGAR_M instead of 0).

io.typed_flags

flags.rs:6 bgzf.rs:108

BAM flags MUST have a BamFlags newtype wrapping u16 with named predicate methods (e.g., is_unmapped(), is_reverse()). The public API (RecordRef) MUST expose BamFlags rather than raw u16 where appropriate.

io.typed_cigar_ops

cigar.rs:19 cigar.rs:138 +1

CIGAR operations MUST have a CigarOpType enum with variants for all 9 SAM-spec operations (M, I, D, N, S, H, P, =, X). The enum MUST provide consumes_ref() and consumes_query() methods. Invalid operation codes MUST be represented as None via from_bam(u8) -> Option<CigarOpType>.

API design

io.non_exhaustive_enums

reader.rs:20

All public enums (including error types) that may gain variants in future versions MUST be annotated with #[non_exhaustive]. Adding a variant to a non-exhaustive enum is not a semver-breaking change, allowing new variants and error conditions to be introduced in minor releases.

io.minimal_public_api

bam.rs:40 +2

Only intentionally public types should be exported. Internal implementation modules (compression codecs, container parsers, encoding details) MUST be pub(crate) or #[doc(hidden)] so that downstream crates do not depend on internal structure. The public API contract is defined by explicit pub use re-exports in each top-level module file.

Arithmetic safety

io.checked_arithmetic

All integer arithmetic on values derived from untrusted input (file data, parsed fields) MUST use checked or saturating operations. Default wrapping arithmetic (+, -, *) MUST NOT be used on untrusted values. The clippy::arithmetic_side_effects lint should be enabled.

Platform portability

io.platform_optimizations

seq.rs:10 seq_codec.rs:58

Every platform-specific optimization (SIMD, architecture-specific intrinsics) MUST have a scalar fallback that produces identical results on all platforms. The optimized and fallback paths MUST be tested with property-based tests that verify equivalence for arbitrary inputs.

Unified Reader

seqair reads BAM, bgzf-compressed SAM, and CRAM files through a format-agnostic reader interface that auto-detects the input format and dispatches to the appropriate parser, while presenting a uniform API to the rest of the codebase.

Sources: The unified reader design is seqair-specific. Format magic bytes come from [SAM1] §4.2 (BAM: BAM\1), [SAM1] §4.1 (BGZF: 1f 8b), and [CRAM3] §6 (CRAM: CRAM). Sort order detection uses the SO tag from [SAM1] §1.3 "The header section" (@HD line). See references.md.

Goals

Backward compatibility: existing code that calls IndexedBamReader::open() continues to work unchanged.
Auto-detection: IndexedReader::open(path) inspects the file to determine BAM vs SAM vs CRAM and opens the appropriate reader.
Uniform downstream API: all three formats populate the same RecordStore via fetch_into(). The pileup engine, call pipeline, and all downstream code are format-agnostic.
Forkable: all reader types support fork() for thread-safe parallel processing with shared immutable state.

Format detection

unified.detect_format

reader.rs:326 reader.rs:400 +4

IndexedReader::open(path) MUST auto-detect the file format by inspecting magic bytes:

Bytes 1f 8b (gzip magic) → verify BGZF structure (extra field with BC subfield). If not BGZF, reject with error: "file is gzip-compressed but not BGZF; use bgzip instead of gzip." If BGZF, decompress the first block:
- Starts with BAM\x01 → BAM format
- Starts with @ (0x40) → bgzf-compressed SAM
- Otherwise → reject with error naming supported formats
Bytes 43 52 41 4d (CRAM) → CRAM format
Byte @ (0x40) at position 0 → uncompressed SAM. Reject with error: "uncompressed SAM files cannot be indexed; compress with bgzip file.sam then index with samtools index file.sam.gz."
Otherwise → return an error listing supported formats (BAM, bgzf-compressed SAM, CRAM)

unified.detect_index

reader.rs:418 +1 unified_reader.rs:45 +1

After format detection, the reader MUST locate the appropriate index file:

BAM: .bai, .bam.bai, or .csi (CSI supports references > 512 Mbp that BAI cannot)
SAM: .tbi or .csi (tabix or CSI index)
CRAM: .crai (CRAM index)

unified.detect_error

reader.rs:419 +1 unified_reader.rs:63 +1

If the format is detected but no matching index is found, the error MUST name the expected index extension and suggest the tool to create it (samtools index for BAM/CRAM, samtools index or tabix for SAM).

Reader enum

unified.reader_enum

reader.rs:104 unified_reader.rs:44 +1

The unified reader MUST be an enum dispatching to format-specific readers, not a trait object. This avoids dynamic dispatch overhead in the hot path and keeps the type concrete for fork().

enum IndexedReader {
coverage.html    Bam(IndexedBamReader),
    Sam(IndexedSamReader),
    Cram(IndexedCramReader),
}

unified.reader_api

reader.rs:122 unified_reader.rs:86

The unified reader MUST expose:

header() -> &BamHeader — all formats produce the same header type (target names, lengths, tid lookup). The SAM and CRAM parsers convert their native header representations to BamHeader.
fetch_into(tid, start, end, store) -> Result<usize> — populates a RecordStore with records overlapping the region, identically to the BAM path.
fork() -> Result<Self> — creates a lightweight copy sharing immutable state.
shared() -> &Arc<_> — access to shared state for Arc::ptr_eq testing.

unified.fetch_equivalence

reader.rs:171 compare_cram_parsing.rs:206 +5

For a given BAM file and its SAM/CRAM representations of the same data, fetch_into for the same region MUST produce records with equivalent logical content: same positions, flags, sequences, qualities, and CIGAR operations. Aux tags MUST contain the same set of tag names and values, but tag ordering and integer type codes (e.g., c vs i for small values) MAY differ between formats because BAM writers choose specific integer widths that SAM text cannot preserve.

RecordStore integration

unified.record_store_push

record_store.rs:166 record_store.rs:126

Currently RecordStore::push_raw() takes raw BAM bytes. For SAM and CRAM, records arrive as parsed fields rather than BAM binary.

The RecordStore MUST provide push_fields() (or equivalent) that accepts pre-parsed record fields:

pos, end_pos, flags, mapq, seq_len, matching_bases, indel_bases (fixed fields)
qname bytes
CIGAR as packed BAM-format u32 ops (SAM/CRAM parsers convert to this representation)
sequence as &[Base] (already the enum type stored in the bases slab — no conversion needed)
quality bytes
aux tag bytes in BAM binary format (SAM/CRAM parsers serialize tags to this format)

This avoids a wasteful round-trip through BAM binary encoding.

unified.push_fields_equivalence

record_store.rs:168 record_store.rs:125 +1

push_fields() and push_raw() MUST produce identical SlimRecord fixed fields and identical name/bases/cigar/qual slab contents for the same logical record. Aux tag slab contents MAY differ in integer type codes (a SAM-derived i:42 may serialize as c while the original BAM used C), since the original BAM writer's type choice is not recoverable from SAM text. Tests MUST verify fixed-field and non-aux-slab equivalence by pushing the same record both ways.

Shared header type

unified.header_from_sam_text

header.rs:141 header.rs:281

SAM headers are text (the @HD, @SQ, @RG, @PG, @CO lines). CRAM containers embed SAM header text. BamHeader already stores header_text: String and parses @SQ lines for targets.

BamHeader MUST gain a from_sam_text(text: &str) constructor that parses SAM header text directly, without requiring BGZF/BAM framing. This is used by both the SAM reader (header lines at start of file) and the CRAM reader (SAM header block in file header container).

Sort order validation

unified.sort_order

header.rs:223 +3 header.rs:329 +5

Indexed random access assumes coordinate-sorted data. If the @HD header line contains SO:unsorted or SO:queryname, the reader MUST return an error explaining that indexed region queries require coordinate-sorted input. SO:coordinate or absent SO (common in older files) MUST be accepted.

Fork semantics per format

unified.fork_bam

reader.rs:186 unified_reader.rs:128

BAM fork: shares Arc<BamShared> (index + header), opens fresh File handle. (Already implemented.)

unified.fork_sam

reader.rs:187 unified_reader.rs:129

SAM fork: shares Arc holding parsed tabix/CSI index + header, opens fresh File handle for BGZF reading.

unified.fork_cram

reader.rs:188 compare_cram_with_htslib.rs:221 +1

CRAM fork: shares Arc holding CRAI index + header, opens fresh File handle for container reading. Each fork MUST have its own FASTA reader (via fasta_reader.fork()) for thread-safe reference lookups. Reference caching (r[cram.perf.reference_caching]) MUST be per-fork, not shared.

Readers: alignment + reference bundle

unified.readers_struct

reader.rs:203 unified_reader.rs:167

The Readers struct bundles an IndexedReader (alignment) with an IndexedFastaReader (reference) in a single type. This eliminates the need for separate FASTA path passing — CRAM can access the reference it needs for sequence reconstruction, and all formats have uniform open/fork semantics.

pub struct Readers {
    alignment: IndexedReader,
    fasta: IndexedFastaReader,
}

unified.readers_open

reader.rs:222 unified_reader.rs:168 +2

Readers::open(alignment_path, fasta_path) MUST auto-detect the alignment format (via r[unified.detect_format]), open the appropriate reader, and open the FASTA reader. For CRAM, the fasta_path is passed to IndexedCramReader::open() for sequence reconstruction. For BAM/SAM, the FASTA reader is opened but not used by the alignment reader.

unified.readers_fork

reader.rs:232 compare_cram_with_htslib.rs:220 +1

Readers::fork() MUST fork both the alignment reader and the FASTA reader, returning a new Readers with independent I/O handles but shared immutable state (indices, headers). The CRAM fork gets its own FASTA reader via IndexedFastaReader::fork().

unified.readers_accessors

reader.rs:239 unified_reader.rs:169

Readers MUST expose:

header() -> &BamHeader — delegates to the alignment reader's header.
fetch_into(tid, start, end, store) -> Result<usize> — delegates to the alignment reader.
fasta() -> &IndexedFastaReader and fasta_mut() -> &mut IndexedFastaReader — direct access for callers that need reference sequences independently of the alignment reader (e.g., the call pipeline's segment fetching).
alignment() -> &IndexedReader and alignment_mut() -> &mut IndexedReader — direct access when needed.

unified.readers_backward_compat

reader.rs:129 unified_reader.rs:154

IndexedReader::open(path) MUST continue to work for BAM and SAM files without a FASTA path. CRAM detection in IndexedReader::open() MUST return an error explaining that CRAM requires a reference and suggesting Readers::open() instead. This preserves backward compatibility for code that only needs BAM/SAM.

API surface

See r[io.non_exhaustive_enums] and r[io.minimal_public_api] in general.md for the general rules. For this module, IndexedReader, FormatDetectionError, and ReaderError are the primary enums subject to r[io.non_exhaustive_enums].

BGZF Block Reader

BGZF (Blocked Gzip Format) is the compression layer used by BAM and other HTS (High-Throughput Sequencing) formats. Unlike plain gzip, which compresses an entire file as one stream, BGZF splits the data into independent blocks of up to 64 KiB uncompressed. Each block is a self-contained gzip member, meaning any block can be decompressed without reading the ones before it. This enables random access: a BAM index can point directly to a specific block offset in the compressed file, and the reader can decompress just that block to find the data it needs.

A BGZF file is simply a concatenation of these gzip blocks, ending with a special empty EOF marker block. Each block has a standard gzip header with an extra field that encodes the block's total compressed size (BSIZE), which the reader uses to locate where the next block begins.

Sources: All rules in this file derive from [SAM1] §4.1 "The BGZF compression format", §4.1 "Random access", and §4.1 "End-of-file marker". See references.md.

Block structure

[SAM1] §4.1 "The BGZF compression format" — block layout, extra field, BSIZE

bgzf.magic

bgzf.rs:165 compare_header.rs:20

A BGZF block MUST begin with the gzip magic bytes 1f 8b 08 04 (gzip, DEFLATE, FEXTRA flag set).

bgzf.bsize

bgzf.rs:166 compare_header.rs:21

The extra field MUST contain a BC subfield (SI1=0x42, SI2=0x43) with SLEN=2 whose value is the total block size minus one (BSIZE). The total compressed block size is BSIZE + 1.

bgzf.decompression

bgzf.rs:167 compare_bgzf_crate.rs:49 +2

Each block's DEFLATE payload MUST be decompressed independently. The last 8 bytes of the block are the gzip footer: a CRC32 checksum (4 bytes) followed by the uncompressed size ISIZE (4 bytes), both little-endian. Decompression MUST produce exactly ISIZE bytes.

bgzf.max_block_size

bgzf.rs:263 bgzf.rs:446

BGZF blocks MUST NOT exceed 65536 bytes uncompressed. If the ISIZE footer field claims a larger value, the reader MUST return an error rather than allocating an unbounded buffer.

bgzf.eof

bgzf.rs:168 compare_records.rs:74

An EOF marker block has ISIZE=0. When encountered, the reader MUST signal end-of-stream.

Virtual offsets

[SAM1] §4.1 "Random access" — virtual file offset definition

Because BGZF blocks are at known compressed file offsets and have known uncompressed sizes, any byte in the uncompressed stream can be addressed with a virtual offset: a packed 64-bit value that combines "which block" and "where within that block." BAM index files (.bai) store virtual offsets to point at specific records.

bgzf.virtual_offset

bgzf.rs:10 bgzf.rs:6 +2

A virtual offset is a 64-bit value where the upper 48 bits encode the compressed block offset in the file and the lower 16 bits encode the byte offset within the uncompressed block.

bgzf.seek

bgzf.rs:145 compare_records.rs:72

The reader MUST support seeking to an arbitrary virtual offset by seeking the underlying file to the block offset, decompressing that block, and advancing to the within-block offset.

Reading

BAM records are variable-length and can span block boundaries (a record may start near the end of one block and continue into the next). The reader must handle this transparently.

bgzf.read_exact

bgzf.rs:315 compare_header.rs:23

The reader MUST support reading an exact number of bytes, transparently crossing block boundaries when the requested data spans multiple blocks.

bgzf.read_partial

bgzf.rs:341 compare_records.rs:75

The reader MUST support partial reads (up to N bytes) returning the actual count, returning 0 at EOF.

bgzf.libdeflate

bgzf.rs:72 compare_header.rs:24

Decompression SHOULD use the libdeflater crate (Rust bindings to the libdeflate C library) for performance parity with htslib's libdeflate usage. libdeflate provides hardware-accelerated DEFLATE decompression and CRC32 computation.

Integrity

[SAM1] §4.1 "The BGZF compression format" — gzip footer CRC32 and ISIZE fields

Each gzip block includes a CRC32 checksum of the uncompressed data. Verifying this catches silent data corruption from disk errors, network glitches on cluster storage, or truncated writes.

bgzf.crc32

bgzf.rs:169 bgzf.rs:55 +3

After decompression, the reader MUST verify the CRC32 checksum of the decompressed data against the expected CRC32 stored in the gzip footer (4 bytes before ISIZE). A mismatch MUST return a ChecksumMismatch error.

Performance

bgzf.fast_header

bgzf.rs:172 bgzf.rs:56

The reader SHOULD fast-path standard BGZF headers where XLEN=6 and the BC subfield is at the fixed offset (bytes 12–17 of the 18-byte header). All BAM files produced by samtools, htslib, and Picard use this layout. This avoids allocating an extra-fields buffer for the common case. Non-standard layouts MUST fall back to searching the extra fields for the BC subfield.

bgzf.resize_uninit

bgzf.rs:171 bgzf.rs:57

When resizing buffers that will be immediately and fully overwritten (by read_exact or decompression), the reader SHOULD use an uninitialized resize to avoid redundant zero-filling. This applies to the decompressed block buffer (~64 KB per block) and the compressed data buffer.

bgzf.block_offset_tracking

bgzf.rs:81 +1 bgzf.rs:58

The reader MUST track the current block's compressed file offset for virtual offset calculation. The offset MUST be queried from the stream position before reading each new block, and set directly on seeks.

CIGAR Operations

A CIGAR string describes how a read aligns to the reference genome. It's a sequence of operations like "50M2I48M" meaning "50 bases match, 2 bases inserted in the read, 48 bases match." Each operation has a type and a length.

In BAM's binary format, CIGAR operations are packed as u32 values: the lower 4 bits encode the operation type (0–8) and the upper 28 bits encode the length. A 150bp read with a simple alignment has just one operation (150M), while reads spanning splice junctions or containing indels can have 10+ operations.

The key concept is which operations consume the reference and/or query (read) sequences:

Consumes reference: the operation advances the position on the reference genome
Consumes query: the operation advances the position within the read

This determines how to translate between reference positions and read positions — the core question the pileup engine asks thousands of times per second.

Sources: [SAM1] §1.4 "The alignment section: mandatory fields" — CIGAR operations table with op codes, BAM numeric encodings, and "consumes query"/"consumes reference" flags. §4.2 "The BAM format" — packed u32 CIGAR encoding (op_len << 4 | op). See references.md.

Operations

[SAM1] §1.4 "The alignment section: mandatory fields" — CIGAR operations table (M=0, I=1, D=2, N=3, S=4, H=5, P=6, ==7, X=8)

cigar.operations

cigar.rs:7 cigar.rs:112

The following CIGAR operations MUST be supported:

M (0): alignment match (consumes query and reference)
I (1): insertion (consumes query only)
D (2): deletion (consumes reference only)
N (3): reference skip (consumes reference only)
S (4): soft clip (consumes query only)
H (5): hard clip (consumes neither)
P (6): padding (consumes neither)
= (7): sequence match (consumes query and reference)
X (8): sequence mismatch (consumes query and reference)

Matches and indels

These aggregate statistics are used downstream for read quality metrics (e.g. fraction of matching bases, indel rate).

cigar.matches_indels

cigar.rs:75 cigar.rs:11 +3

calc_matches_indels(cigar) MUST compute the total number of matching bases (sum of M operation lengths) and the total number of indel bases (sum of I and D operation lengths) from a packed CIGAR array.

CIGAR index

For the pileup engine, the critical operation is: "at reference position X, what is the corresponding position in this read?" This requires walking the CIGAR to accumulate reference and query offsets. The CigarIndex precomputes this to avoid re-walking for every position.

cigar.index

cigar.rs:96 cigar.rs:49

A CigarIndex MUST be constructable from a record's CIGAR and start position. It precomputes cumulative reference and query offsets per operation.

cigar.qpos_at

cigar.rs:144 cigar.rs:50 +8

CigarIndex::qpos_at(ref_pos) MUST return Some(query_pos) when the reference position falls within a query-consuming operation (M, =, X, I at the boundary), or None when the position falls within a deletion, reference skip, or outside the alignment.

cigar.qpos_bounds

cigar.rs:254 cigar.rs:401

qpos_at MUST return None for positions outside the alignment's reference span. This applies to all mapping variants, including the linear fast-path. Positions before the alignment start or at/after the alignment end MUST return None.

cigar.compact_op_position_invariant

cigar.rs:387

CompactOp::ref_start is stored as i32. Since BAM positions are defined as i32 values (max 2^31 - 1), this is sufficient. The construction of CompactOp MUST assert (via debug_assert!) that the computed reference position fits in i32.

cigar.qpos_accuracy

cigar.rs:145 cigar.rs:244 +2

qpos_at MUST produce identical results to htslib's Alignment::qpos() for every reference position within the alignment span. This is verified by comparison tests.

Seqair-Types — Phred Quality Score

Sources: Phred quality scores are defined in [SAM1] section 1.4 (QUAL field, Phred+33 encoding). The Phred newtype wraps the mathematical definition: Q = -10 log10(P). See references.md.

Construction

types.phred.non_negative

phred.rs:49 phred.rs:123 +1

Phred scores MUST be non-negative. from_phred MUST accept only u8 values (0..=255), making negative inputs unrepresentable at the type level.

Integer conversion

types.phred.as_int_clamp

phred.rs:34 phred.rs:104 +2

as_int MUST clamp the result to the range [0, 99]. Values less than or equal to zero (including negative results from floating-point arithmetic) MUST return 0. NaN MUST return 0.

BAM Reader

The indexed BAM reader is the main entry point for loading aligned reads from a BAM file. It ties together three layers: BGZF decompression (reading compressed blocks), BAI index lookup (finding which blocks contain reads for a region), and record decoding (parsing the binary BAM records into usable structs).

The reader's primary operation is fetch_into: given a genomic region like chr19:6,105,700–6,105,800, it queries the index, bulk-reads the relevant compressed data into memory (see region_buf.md), decompresses it, decodes the records, and stores them in a RecordStore for downstream processing by the pileup engine.

Sources: [SAM1] §4 "The BAM Format Specification" generally — BGZF, BAM header, record layout, and BAI index together define the reader's behaviour. The forking design and ChunkCache are seqair-specific optimisations with no upstream spec counterpart. See references.md.

Opening

bam.reader.open

reader.rs:120 compare_header.rs:18

The reader MUST open a BAM file by path, parse its header, and locate the corresponding .bai index file (trying both <path>.bai and <path_without_ext>.bai).

bam.reader.header_access

reader.rs:121 compare_header.rs:19

The reader MUST expose the parsed BamHeader for contig name/tid/length lookups.

Region fetching

bam.reader.fetch_into+2

reader.rs:181 compare_bam_with_htslib.rs:59 +1

fetch_into(region, store) MUST clear the store, query the BAI index for chunks overlapping the region, seek through those chunks via BGZF, decode all records, and add records that actually overlap the region to the store (RecordStore).

bam.reader.overlap_filter

reader.rs:182 compare_records.rs:60

A record overlaps a region if record.pos <= region.end AND record.end_pos >= region.start. Records outside the region that appear in index chunks MUST be filtered out. This is necessary because the BAM index is approximate — chunks may contain records slightly outside the queried region.

bam.reader.sorted_order+2

reader.rs:183 compare_records.rs:61

Records in the store MUST be in coordinate-sorted order (by pos, then by end_pos as tiebreaker) after fetch_into, matching the order a sorted BAM file provides.

Edge cases

bam.reader.unmapped_skipped+2

reader.rs:252 compare_records.rs:62

Unmapped reads (flag 0x4) MUST be excluded from the store during fetch_into. htslib's pileup engine rejects them in bam_plp_push, so including them would inflate depth counts.

bam.reader.secondary_supplementary_included+2

reader.rs:184 pileup.rs:253

Secondary (0x100) and supplementary (0x800) reads MUST be included in the store by default. The caller's pileup filter is responsible for excluding them if desired, matching htslib's behavior.

Forking (thread-safe shared state)

Rastair processes regions in parallel via rayon. Each worker thread needs its own IndexedBamReader because file handles have mutable seek state. Without forking, every thread independently opens the BAM file, re-reads the entire BAI index from disk, and re-parses the BAM header — wasting both I/O bandwidth and memory. On HPC clusters with NFS/Lustre, redundant index reads are especially costly.

The fork pattern separates the reader into shared immutable data (BamIndex, BamHeader) and per-thread mutable state (file handles, decompression buffers). A prototype reader is opened once on the main thread; worker threads call fork() to get a lightweight copy that shares the index and header via Arc but owns fresh file handles.

Shared state

bam.reader.shared_state

reader.rs:107 fork.rs:93

The BAM index and header MUST be stored behind Arc so that forked readers share a single copy. These structures are read-only after initial parsing and are accessed concurrently without synchronization.

Fork operation

bam.reader.fork

reader.rs:156 fork.rs:28 +1

fork() MUST produce a new reader that:

Shares the same Arc holding BamIndex and BamHeader as the source reader (no re-parsing, no additional memory).
Opens a fresh raw File handle for bulk RegionBuf reads. No BgzfReader is needed — fetch_into only uses RegionBuf, and the BgzfReader used during open() for header parsing is not retained.
Allocates its own ChunkCache (starts empty, populated on first region fetch for a tid).

bam.reader.fork_equivalence

reader.rs:158 fork.rs:29 +1

A forked reader MUST produce identical fetch_into results as independently opening the same BAM file. The fork is purely an optimization — it MUST NOT change observable behavior.

bam.reader.fork_independence

reader.rs:157 fork.rs:106

Forked readers MUST be fully independent for mutable operations. Seeking or fetching on one reader MUST NOT affect any other reader (original or forked). Each reader owns its own file descriptors and buffer state.

Edge cases

bam.reader.fork_arc_identity

reader.rs:172 fork.rs:68 +1

Arc::ptr_eq on the shared state of the original and forked reader MUST return true, confirming they share the same allocation rather than having made a deep copy.

bam.reader.fork_concurrent

reader.rs:159 fork.rs:137 +1

Multiple forks from the same prototype MUST be safe to use concurrently from different threads. Since the shared state is immutable and each fork owns its file handles, no synchronization is needed beyond the Arc reference count.

Coordinate validation

bam.reader.coordinate_overflow

reader.rs:82 +1 reader.rs:493 +3

fetch_into MUST validate that tid fits in i32 and that start/end fit in i64 before using them. If any value exceeds the signed maximum, fetch_into MUST return a BamError::CoordinateOverflow rather than silently truncating via as casts.

Error propagation

bam.reader.propagate_errors

reader.rs:241 +1

fetch_into (and ChunkCache::load) MUST propagate I/O and decompression errors from the BGZF layer. Only BgzfError::UnexpectedEof (signalling the end of a chunk's data) should silently break the read loop. All other errors — CRC failures, truncated blocks, decompression failures, invalid magic, et.c — MUST be returned to the caller.

BAM Index (BAI/CSI)

A BAM file can be hundreds of gigabytes. Without an index, finding all reads overlapping a small genomic region (e.g. chr19:6,105,700–6,105,800) would require decompressing and scanning the entire file. The BAM index (.bai file) solves this by recording, for each genomic region, which compressed byte ranges in the BAM file contain relevant reads.

The index uses a binning scheme: the genome is divided into hierarchical bins of decreasing size (the whole reference → 512 Mbp → 64 Mbp → ... → 16 KiB leaf bins). Each bin stores a list of chunks — pairs of virtual offsets (see bgzf.md) marking the start and end of compressed byte ranges in the BAM file that contain reads falling into that bin. A linear index provides an additional optimization: for each 16 KiB window, it stores the minimum virtual offset of any read starting in that window, allowing the reader to skip chunks that are entirely before the query start.

Sources: [SAM1] §5 "Indexing BAM" — binning algorithm; §5.1.1 "Basic binning index" — bin hierarchy and sizes; §5.2 "The BAI index format for BAM files" — binary format; §5.3 "C source code for computing bin number and overlapping bins" — reg2bin/reg2bins. [CSI] — for CSI format. See references.md.

BAI format

[SAM1] §5.2 "The BAI index format for BAM files" — magic, n_ref, n_bin, chunk virtual offsets, n_intv linear index

bam.index.bai_magic

index.rs:85 compare_records.rs:68

A BAI file MUST begin with the magic bytes BAI\1 (0x42, 0x41, 0x49, 0x01). The reader MUST reject files that do not match.

bam.index.bai_parse

index.rs:86 compare_records.rs:69

The reader MUST parse the BAI binary format: n_ref references, each containing bins with chunks (pairs of virtual offsets) and a linear index of 16 KiB window offsets.

bam.index.bgzf_block_validation

index.rs:323 index.rs:653

When decompressing BGZF blocks (e.g. for tabix), the reader MUST validate that bsize is large enough for the block header and footer before indexing into the block. A bsize that is too small MUST return an error rather than panicking.

bam.index.region_query

index.rs:175 compare_records.rs:70

Given a region (tid, start, end), the index MUST return a list of BGZF chunk pairs (begin, end) that contain all reads overlapping the region. The query MUST use the binning scheme to identify candidate bins and the linear index to skip chunks that cannot contain reads starting before the query start.

bam.index.bin_calculation

index.rs:369 compare_records.rs:71

The BAI binning scheme uses reg2bin(beg, end) where bin 0 spans the whole reference, bins 1–8 span 64 Mbp each, and leaf bins span 16 KiB. The implementation MUST correctly compute bins for a region and enumerate all overlapping bins at every level.

Bin 0

[SAM1] §5.1.1 "Basic binning index" — bin 0 spans 512 Mbp, bins 1–8 span 64 Mbp

Bin 0 is the root of the binning hierarchy, covering the entire reference sequence (0 to 2^29). Each alignment is placed in the smallest bin that fully contains its [pos, end) interval. An alignment is placed in bin 0 when its interval straddles a level-1 boundary (multiples of 2^26 = 64 Mbp). This can happen to short reads (e.g. 150 bp) that cross positions like 67,108,864 or 134,217,728. See SAM spec §5.3.

In practice, bin 0 contains very few reads (only those near the ~3–4 level-1 boundaries per chromosome), but its chunks may be located at a very different file offset from the rest of the query's chunks. In a 43 GB BAM, this can mean an extra seek of several GB — a significant cost on network storage.

bam.index.bin0_required

index.rs:212 index.rs:501

The query MUST always include bin 0 in the candidate bins for any region query. Omitting it would silently drop reads that straddle 64 Mbp boundaries.

bam.index.bin0_separate+2

index.rs:66 +1 index.rs:481

The index SHOULD provide a query_split method that separates distant (level 0–2) chunks from nearby (level 3–5) chunks in the result, returning a QueryChunks struct with nearby and distant fields. This allows callers to handle them differently (e.g. caching, separate I/O). The original query method MUST remain available for callers that don't need the separation.

Higher-level bins and ChunkCache

The same distant-chunk problem affects bins at levels 1 (64 Mbp) and 2 (8 Mbp). For a query in chr5:100K–200K, the index includes bin 1 (covering chr5:0–64M) whose chunks span ~2 GB of the BAM file, and bin 9 (covering chr5:0–8M) whose chunks span ~0.27 GB. These chunks are identical for every region query within the same reference and contain many records that are outside the query region (~15% of decoded records are discarded by the position filter).

bam.index.chunk_separation+2

index.rs:67 +1 index.rs:482

The query_split method MUST separate chunks into nearby (levels 3–5, loaded via RegionBuf per query) and distant (levels 0–2, suitable for caching). The threshold (level ≤ 2) is based on profiling data showing that levels 0–2 contribute >90% of the file span in typical region queries.

bam.index.chunk_cache

index.rs:155 +2 reader.rs:441

The reader SHOULD maintain a ChunkCache that loads ALL records from bins at levels 0–2 (covering ≥8 Mbp) once per reference sequence (tid) per thread. The BamIndex::distant_chunks method returns all level 0–2 chunks for a tid, unfiltered by region, for use when populating the cache. Per query, matching records (overlapping the query region) are injected from the cache into the RecordStore without any additional I/O. The cache MUST be invalidated when switching to a different reference (tid). This eliminates expensive seeks for distant-bin chunks on every region query.

CSI format

[CSI] — min_shift, depth, l_aux header; per-bin loffset; parameterised reg2bin/reg2bins

CSI is an alternative index format that supports larger genomes and arbitrary bin sizes. It uses the same conceptual approach (bins + chunks + linear index) but with a configurable minimum shift and depth.

bam.index.csi_support

CSI index support MAY be added as a follow-up. The same region_query interface MUST be used for both BAI and CSI indices.

BAM Header

A BAM file begins with a header that describes the reference genome and records metadata about how the alignments were produced. The header contains two parts: a SAM-format text section (with @HD, @SQ, @RG, @PG lines) and a binary reference sequence table. The reference table maps each contig (chromosome) to a numeric target ID (tid) used throughout the rest of the file — every aligned read stores the tid of its reference contig rather than the name string.

Sources: [SAM1] §4.2 "The BAM format" (BAM magic, l_text, n_ref, reference sequence table); §1.3 "The header section" (SAM header line types @HD/@SQ/@RG/@PG). See references.md.

Parsing

[SAM1] §4.2 "The BAM format" — magic, l_text, text, n_ref, l_name, name, l_ref fields

bam.header.magic

header.rs:73 compare_header.rs:16

A BAM file MUST begin with the magic bytes BAM\1 (0x42, 0x41, 0x4d, 0x01) after BGZF decompression. The reader MUST reject files that do not start with this magic.

bam.header.text

header.rs:74 compare_header.rs:17

After the magic, a 4-byte little-endian l_text gives the length of the SAM header text. The reader MUST parse this text to extract @HD, @SQ, @RG, and @PG header lines.

bam.header.non_negative_lengths

header.rs:86 header.rs:404

Header length fields (l_text, n_ref, l_name) are stored as i32 in the BAM format. The reader MUST validate that these values are non-negative before using them as sizes. Negative values from corrupt files MUST return a BamHeaderError rather than causing a massive allocation via i32 as usize wrapping.

bam.header.references

header.rs:75 compare_header.rs:15

After the header text, a 4-byte little-endian n_ref gives the number of reference sequences. Each reference is encoded as a 4-byte l_name (including NUL), the name bytes, and a 4-byte l_ref (sequence length). The reader MUST parse all reference sequences.

Contig lookup

The rest of the BAM file and its index use tids (integer IDs) to refer to contigs. The caller typically has a contig name (e.g. "chr19") from user input or a BED file and needs the tid for index queries. Conversely, when reporting results, the tid needs to be mapped back to a name.

bam.header.tid_lookup

header.rs:200 compare_header.rs:14

The header MUST support O(1) lookup from contig name to target ID (tid) and O(1) lookup from tid to contig name and length.

bam.header.target_names

header.rs:214 compare_header.rs:12

The header MUST provide access to the list of all target names in tid order.

bam.header.target_len

header.rs:209 compare_header.rs:13

The header MUST provide the length of each target sequence by tid.

BAM Record

Each aligned read in a BAM file is stored as a record: a binary structure encoding the read's position on the reference genome, its DNA sequence, quality scores, alignment details (CIGAR), and optional auxiliary tags. Records are variable-length — a 150 bp read with few tags is ~300 bytes, while a long read with many tags can be kilobytes.

The record's binary layout starts with 32 fixed bytes (position, flags, lengths, etc.), followed by variable-length fields in order: read name (qname), CIGAR operations, 4-bit packed sequence, quality scores, and auxiliary tags.

Sources: [SAM1] §4.2 "The BAM format" — binary record layout (refID, pos, l_read_name, mapq, bin, n_cigar_op, flag, l_seq, cigar, seq, qual, auxiliary); §4.2.4 "SEQ and QUAL encoding" — 4-bit sequence encoding; §4.2.5 "Auxiliary data encoding" — tag types A/c/C/s/S/i/I/f/Z/H/B; §1.4 "The alignment section: mandatory fields" — FLAG bit definitions. See references.md.

Decoding

[SAM1] §4.2 "The BAM format" — block_size, fixed 32-byte header, variable-length field order

bam.record.decode

record.rs:16 compare_bam_with_htslib.rs:60 +1

A BAM record MUST be decoded from its binary representation: 32 fixed bytes followed by variable-length qname, CIGAR, sequence, quality, and auxiliary data. The 4-byte block_size prefix is read by the caller.

bam.record.max_size

region_buf.rs:435 region_buf.rs:565

BAM records have a 2 MiB size cap. The reader MUST reject records whose block_size exceeds this limit rather than allocating an unbounded buffer from untrusted input.

bam.record.checked_offsets

record.rs:203 record.rs:393 +1

Variable-length field offset calculations (var_start, cigar_end, seq_end, qual_end) MUST use checked arithmetic to prevent wrapping on malformed input. Overflow MUST return a decode error.

bam.record.fields

record.rs:15 compare_bam_with_htslib.rs:61 +2

The decoder MUST extract: tid (i32), pos (i32→i64), mapq (u8), flags (u16), seq_len (u32), n_cigar_ops (u16), qname (NUL-terminated, stored without NUL), packed sequence (4-bit), quality scores, CIGAR operations (packed u32), and raw auxiliary data.

bam.record.end_pos

record.rs:129 compare_bam_with_htslib.rs:83 +1

The decoder MUST precompute the reference end position from the CIGAR. Reference-consuming operations are: M(0), D(2), N(3), =(7), X(8). The end position is pos + ref_consumed - 1 (inclusive).

bam.record.zero_refspan

record.rs:130 pileup.rs:270

When a read has zero reference-consuming CIGAR operations (e.g., pure soft-clip or insertion-only), the reference span is zero. In this case end_pos MUST equal pos (matching htslib's bam_cigar2rlen which returns 0, placing the read at exactly one position). The implementation MUST NOT compute pos - 1 for such reads.

Sequence encoding

[SAM1] §4.2.4 "SEQ and QUAL encoding" — 4-bit nibble encoding, =ACMGRSVTWYHKDBN lookup table

BAM stores DNA sequences in a compact 4-bit encoding: two bases per byte, high nibble first. This halves storage compared to ASCII but requires decoding before use.

bam.record.seq_4bit

record.rs:75 compare_bam_with_htslib.rs:84 +1

BAM encodes sequences in 4-bit nibbles: two bases per byte (high nibble first). The standard lookup table maps 0→=, 1→A, 2→C, 4→G, 8→T, 15→N, and other values to IUPAC ambiguity codes.

bam.record.seq_at

record.rs:109 compare_records.rs:130

The record MUST support decoding a single base at a given read position.

bam.record.seq_at_simd+2

record.rs:76 compare_records.rs:131

The decoder MAY decode the full 4-bit sequence into Base enum values at construction time using SIMD-accelerated bulk decoding (see base_decode.md). Single-base access via seq_at(idx, pos) returns the pre-decoded Base value directly.

Flag access

[SAM1] §1.4 "The alignment section: mandatory fields" — FLAG bit table (0x4 unmapped, 0x10 reverse strand, 0x40 first in template, 0x80 second in template)

Each record has a 16-bit flags field encoding properties like strand orientation, pairing status, and mapping quality. These are checked frequently during pileup construction and filtering.

bam.record.flag_reverse

record.rs:89 compare_records.rs:157

is_reverse() MUST return true when the 0x10 bit is set.

bam.record.flag_first

record.rs:94 compare_records.rs:158

is_first_in_template() MUST return true when the 0x40 bit is set.

bam.record.flag_second

record.rs:99 compare_records.rs:159

is_second_in_template() MUST return true when the 0x80 bit is set.

bam.record.flag_unmapped

record.rs:104 compare_records.rs:160

is_unmapped() MUST return true when the 0x4 bit is set.

Auxiliary tags

[SAM1] §4.2.5 "Auxiliary data encoding" — tag byte layout, type codes A/c/C/s/S/i/I/f/Z/H/B

BAM records can carry optional key-value tags (e.g. XR:Z:CT for bismark strand, MD:Z:... for mismatch string). Tags are stored as a flat byte array at the end of the record, each prefixed by a 2-byte name and a 1-byte type code.

bam.record.aux_parse

record.rs:114 compare_records.rs:192

The record MUST support looking up auxiliary tags by their 2-byte name. Tag types A, c, C, s, S, i, I, f, d, Z, H, and B (typed array) MUST be supported.

bam.record.aux_array_unsupported

B-type (array) auxiliary tags are not yet parsed into AuxValue. find_tag returns None for array tags. The tag bytes MUST be correctly skipped so that all subsequent tags in the record remain accessible.

bam.record.raw_aux

record.rs:259 compare_records.rs:193

The record MUST provide access to raw auxiliary data bytes for efficient filtering without full tag parsing.

Sequence Codec

BAM encodes DNA sequences in a compact 4-bit format: two bases per byte, high nibble first. A 150bp read takes 75 bytes instead of 150. The 16 possible nibble values map to the 4 standard bases (A=1, C=2, G=4, T=8), N (15), and IUPAC ambiguity codes for the remaining values.

Decoding these sequences is on the hot path: during pileup construction, the engine needs the base at each read's query position for every reference position. With ~30 reads at ~3 billion positions, that's ~90 billion base lookups for a whole-genome BAM. Encoding is needed when rewriting BAM files with modified sequences.

Sources: [SAM1] §4.2.4 "SEQ and QUAL encoding" — 4-bit encoding table =ACMGRSVTWYHKDBN → [0,15], high-nibble-first packing. The SIMD decode path and pair-table optimisation are seqair-specific. See references.md.

Decoding (4-bit → ASCII)

seq.decode_scalar

seq.rs:4 seq_codec.rs:8 +2

The scalar decoder MUST use a 16-entry lookup table mapping 4-bit codes to ASCII bases (=ACMGRSVTWYHKDBN). It MUST support decoding full sequences (all bases) and single bases at arbitrary positions.

seq.decode_pair_table

seq.rs:5 seq_codec.rs:37

For bulk decoding, a 256-entry pair table (DECODE_PAIR[byte] → [high_base, low_base]) SHOULD be used to decode two bases per byte lookup, avoiding repeated shift-and-mask operations.

seq.decode_simd

seq.rs:6 seq_codec.rs:56

On platforms with SIMD support (SSSE3 on x86_64, NEON on aarch64), bulk sequence decoding MUST use SIMD table-lookup instructions (pshufb / vqtbl1q_u8) to decode 32 bases per iteration from 16 packed bytes. A scalar tail MUST handle the remaining bytes.

seq.decode_dispatch

seq.rs:7 seq_codec.rs:57

Runtime dispatch MUST select the fastest available decoder: SSSE3 on x86_64 (with feature detection), NEON on aarch64 (always available), scalar fallback on all other platforms.

Encoding (ASCII → 4-bit)

seq.encode_scalar

seq.rs:8 seq_codec.rs:112 +2

The scalar encoder MUST use a 256-entry lookup table mapping ASCII bytes to 4-bit codes. Unknown bases MUST map to 15 (N). Two bases MUST be packed per byte (high nibble first).

Correctness

seq.simd_scalar_equivalence

seq.rs:9 seq.rs:504 +4

The SIMD decoder MUST produce byte-identical output to the scalar decoder for all inputs. This MUST be verified by property-based tests covering arbitrary sequence lengths and byte patterns, including SIMD boundary lengths (0, 1, 15, 16, 31, 32, 33, 63, 64, 65, 128).

Base-typed Sequence Decoding

Sources: [SAM1] §4.2.4 "SEQ and QUAL encoding" — 4-bit nibble-to-base mapping (=ACMGRSVTWYHKDBN). The Base enum, SIMD acceleration, and direct-to-Base decode table are seqair-specific design choices; the upstream spec defines only the nibble codes. See references.md.

Background

BAM sequences are stored as 4-bit nibbles (two bases per byte). The standard decode table maps nibbles to ASCII characters (=ACMGRSVTWYHKDBN). Downstream, the ASCII bytes must be converted to Base enum values (A, C, G, T, Unknown) at every pileup position — billions of times for whole-genome data.

By decoding directly into Base values at record construction time, we:

Eliminate the per-position Base::from(u8) conversion (a LUT lookup per read per position)
Get compile-time type safety — a Base is guaranteed to be one of 5 valid values, not an arbitrary byte
Collapse IUPAC ambiguity codes (M, R, W, S, Y, K, etc.) and = to Base::Unknown at decode time, instead of at each use site

Decode table

base_decode.table

seq.rs:108 base_decode.rs:6 +2

A 16-entry lookup table MUST map 4-bit BAM nibble codes directly to Base discriminant values: 1→A(65), 2→C(67), 4→G(71), 8→T(84), and all other codes (0, 3, 5, 6, 7, 9–14, 15) → Unknown(78).

Table invariant

base_decode.table_invariant

seq.rs:109 seq.rs:474 +1

Every entry in DECODE_BASE_TYPED (16 entries) and DECODE_PAIR_TYPED (256 × 2 = 512 bytes) MUST be a valid Base discriminant: A(65), C(67), G(71), T(84), or Unknown(78). This MUST be verified by an exhaustive test covering all entries.

Decoding function

base_decode.decode

seq.rs:150 base_decode.rs:7 +2

decode_bases(packed_seq, seq_len) MUST decode a 4-bit packed BAM sequence into a Vec<Base>. The function SHOULD use the SIMD-accelerated decode path (SSSE3/NEON) with a Base-valued lookup table for identical throughput to the ASCII decoder.

Storage in RecordStore

base_decode.slab

record_store.rs:63 base_decode.rs:47

The RecordStore MUST store decoded bases in a dedicated slab (Vec<Base>) separate from the data slab. The data slab continues to hold cigar, qual, and aux data. The seq_at method MUST return Base instead of u8.

PileupAlignment

base_decode.alignment

pileup.rs:96 base_decode.rs:60

PileupAlignment.base MUST be of type Base, not u8. This eliminates the Base::from(u8) conversion in the metrics accumulation hot loop.

ASCII-to-Base batch conversion

Reference sequences (FASTA) and SAM text sequences arrive as ASCII bytes, not 4-bit packed nibbles. These need the same one-time batch conversion to Base that BAM sequences get via decode_bases, but using an ASCII-aware path instead of the nibble decode table.

base_decode.ascii_batch

base.rs:76 base.rs:573 +2

Base::from_ascii_vec(Vec<u8>) -> Vec<Base> MUST convert a vector of ASCII bytes to Base values in-place (reusing the allocation). Maps A/a→A, C/c→C, G/g→G, T/t→T, all other bytes→Unknown. The function MUST NOT allocate a new vector — it operates on the input's buffer and transmutes the result. This is safe because Base is repr(u8) and the function only writes valid Base discriminant bytes (65, 67, 71, 78, 84).

base_decode.ascii_simd

base.rs:77 base.rs:536 +1

The ASCII batch converter MUST use SIMD acceleration (SSSE3 on x86_64, NEON on aarch64) with a scalar fallback, following the same dispatch pattern as decode_bases. This is the required path for all bulk u8→Base conversions outside BAM 4-bit decoding: FASTA reference sequences, SAM text sequences, and any other ASCII-encoded base data.

base_decode.ascii_scalar_equivalence

base.rs:114 base.rs:537 +2

The SIMD ASCII converter MUST produce identical output to applying Base::from(u8) element-wise. This MUST be verified by property-based tests covering arbitrary byte values and lengths including SIMD boundary lengths (0, 1, 15, 16, 31, 32, 33, 63, 64, 65, 128).

FromStr validation

types.base.from_str_validation+2

base.rs:393 base.rs:629 +2

FromStr for Base MUST accept only a single character A, C, G, T, or N (case-insensitive) after trimming whitespace. Multi-character input (after trimming) MUST return Err(BaseError::MultipleChars). Any single character that is not A/C/G/T/N MUST return Err(BaseError::InvalidBase(byte)) where byte is the first byte of the trimmed input. Empty input MUST return Err(BaseError::Empty).

Error display

types.base.error_display

base.rs:483 base.rs:663

BaseError variants MUST display the actual invalid value. InvalidBase(byte) SHOULD format as "Invalid base: 0x{byte:02x}". Empty SHOULD format as "Empty".

CRAM Reader

A CRAM reader for region-based random access. CRAM is a reference-based, column-oriented compression format that achieves significantly smaller file sizes than BAM (typically 40-60% of BAM size) at the cost of decoding complexity and reference dependency.

Sources: [CRAM3] throughout — file structure, containers, slices, blocks, variable-length integers, bit stream, compression header, data series encodings, record decode order, read features, CRAI index. [CRAMcodecs] — rANS 4×8 (§2), rANS Nx16 (§3), arithmetic coder (§4), tok3 (§5), fqzcomp (§6). See references.md.

Context

CRAM stores alignment data as differences against a reference genome rather than full sequences. Where BAM stores ACGTACGT for a read, CRAM stores "matches reference at this position, except substitution at offset 5". This reference-based approach, combined with column-oriented storage and sophisticated entropy coding, makes CRAM the most compact alignment format.

Rastair needs CRAM support because many large-scale sequencing projects store data exclusively in CRAM format.

Goals and scope

cram.scope.versions

reader.rs:230 compare_cram_with_htslib.rs:110

The reader MUST support CRAM v3.0 and v3.1. CRAM v2.0 and v2.1 MAY be supported but are not required — they are rarely encountered in practice. The magic bytes CRAM followed by major version 3 MUST be accepted; major version 2 SHOULD produce a warning; other versions MUST be rejected.

The reader is read-only. CRAM writing is out of scope — rastair writes BAM (via htslib) for output.

cram.scope.reference_required

reader.rs:246 compare_cram_parsing.rs:145

CRAM decoding requires access to the reference genome (FASTA). The reader MUST accept a FASTA reader (the one being built on the other branch) for sequence lookups. Embedded reference slices (where the reference is stored in the CRAM container itself) MUST also be supported as a fallback.

Note: htslib supports automatic reference download via REF_PATH/REF_CACHE environment variables and the EBI reference server. This is out of scope for seqair, but when the FASTA reader is missing or the required contig is absent, the error message SHOULD mention REF_PATH/REF_CACHE as an alternative for users familiar with the htslib workflow.

File structure

[CRAM3] §6 "File definition" — CRAM magic, major/minor version, file ID

File definition

cram.file.magic

reader.rs:223 compare_cram_parsing.rs:21

The file MUST start with CRAM (bytes 43 52 41 4d), followed by major version (1 byte) and minor version (1 byte), followed by a 20-byte file ID.

cram.file.header_container

reader.rs:418 container.rs:245 +2

The first container after the file definition MUST be the file header container. It contains a single block (content_type=FILE_HEADER) whose data is:

header_text_length (i32, fixed-width little-endian): byte length of the SAM header text
header_text (UTF-8 bytes): the SAM header text (@HD, @SQ, @RG, etc.)

The header text is parsed via BamHeader::from_sam_text() to populate target names, lengths, and tid mappings. Note: the i32 length prefix is part of the block's decompressed data, not the block header itself.

cram.file.eof

container.rs:123 container.rs:207

The file MUST end with an EOF container. The EOF container produced by samtools has length=15 (not 0) and num_records=0. The reader MUST detect the EOF container as any container where num_records=0 and length <= 15. The reader SHOULD verify the EOF marker but MUST NOT fail if it's missing (truncated files should produce a warning, not a hard error).

Containers

[CRAM3] §7 "Container header structure" — length (fixed i32), itf8 fields, landmarks, CRC32

cram.container.header

container.rs:34 container.rs:157 +2

Each container header contains:

length (i32, fixed-width little-endian): total byte length of all blocks in the container. This is the only fixed-width field in the header; all subsequent fields use variable-length encoding.
reference_sequence_id (itf8): reference ID, or -1 (unmapped), or -2 (multi-ref)
alignment_start (itf8): 1-based start position on reference
alignment_span (itf8): number of reference bases covered
num_records (itf8): number of alignment records
record_counter (ltf8): global 0-based record index (v3.0+)
bases (ltf8): total bases in container (v3.0+)
num_blocks (itf8): block count
landmarks (itf8 array): count (itf8) followed by that many itf8 values; byte offsets from container data start to each slice header
crc32 (4 bytes, fixed-width little-endian): CRC32 over all preceding header bytes (from length through end of landmarks)

cram.container.region_skip

reader.rs:290 compare_cram_with_htslib.rs:111

During fetch_into, the reader MUST skip containers whose [alignment_start, alignment_start + alignment_span) range does not overlap the query region. The CRAI index provides byte offsets for this.

Slices

[CRAM3] §8.4 "Slice header block" — slice header fields, multi-ref slices, embedded reference, reference_md5

cram.slice.header

slice.rs:31 compare_cram_with_htslib.rs:107

Each slice header contains:

reference_sequence_id (itf8): ref ID for this slice, -1 unmapped, -2 multi-ref
alignment_start, alignment_span (itf8): genomic range covered
num_records (itf8): records in this slice
record_counter (ltf8): global record counter
num_blocks (itf8): data blocks in this slice
block_content_ids (itf8 array): content IDs of the blocks
embedded_reference (itf8): -1 if no embedded reference, else block content ID
reference_md5 (16 bytes): MD5 of the reference segment (all zeros if N/A)
optional tags (BAM-format aux bytes)

cram.slice.multi_ref

slice.rs:249 compare_cram_with_htslib.rs:108

When reference_sequence_id == -2, the slice contains records from multiple references. The RI (reference ID) data series MUST be decoded per-record to determine which reference each record aligns to. Multi-ref slices are common in coordinate-sorted CRAM files at chromosome boundaries and for reads with unmapped mates.

cram.slice.embedded_ref

slice.rs:161 compare_cram_parsing.rs:205

When embedded_reference >= 0, the corresponding external data block contains the reference bases for this slice. The reader MUST use these instead of the FASTA reader. This enables CRAM files to be self-contained (produced by samtools view --output-fmt-option embed_ref or cramtools).

Blocks

[CRAM3] §8 "Block structure" — method, content_type, content_id, compressed_size, uncompressed_size, data, CRC32 (v3.0+)

cram.block.structure

block.rs:39 block.rs:201

Each block contains:

method (byte): compression codec (see codecs section)
content_type (byte): 0=FILE_HEADER, 1=COMPRESSION_HEADER, 2=SLICE_HEADER, 4=EXTERNAL, 5=CORE
content_id (itf8): identifies which data series this block holds
compressed_size, uncompressed_size (itf8)
data (bytes): compressed payload
crc32 (4 bytes): CRC32 of the block (v3.0+)

cram.block.crc32

block.rs:40 block.rs:314

CRC32 MUST be verified on every block in v3.0+. The CRC32 covers all bytes from the start of the block (method byte) through the end of the compressed data, excluding the 4 CRC32 bytes themselves. Mismatches MUST produce an error with the block's content_type and content_id for diagnostics.

Variable-length integers

[CRAM3] §2.1 "Logical data types" — ITF8 (up to 32 bits) and LTF8 (up to 64 bits) variable-length integer encodings

cram.itf8

varint.rs:6 varint.rs:286

ITF8 (up to 32 bits) MUST be decoded as:

0xxxxxxx: 1 byte, value = bits 6..0 (7 bits)
10xxxxxx + 1 byte: 2 bytes, value = (b0 & 0x3F) << 8 | b1 (14 bits)
110xxxxx + 2 bytes: 3 bytes, value = (b0 & 0x1F) << 16 | b1 << 8 | b2 (21 bits)
1110xxxx + 3 bytes: 4 bytes, value = (b0 & 0x0F) << 24 | b1 << 16 | b2 << 8 | b3 (28 bits)
1111xxxx + 4 bytes: 5 bytes, value = (b0 & 0x0F) << 28 | b1 << 20 | b2 << 12 | b3 << 4 | (b4 & 0x0F) (32 bits)

Note: in the 5-byte form, only the low 4 bits of byte 4 are used. The high 4 bits of byte 4 are unused/ignored.

The decoded value is an unsigned u32. For fields that can be negative (e.g., reference_sequence_id where -1=unmapped, -2=multi-ref), the u32 MUST be reinterpreted as signed i32 via two's complement (value as i32).

cram.ltf8

varint.rs:94 varint.rs:388

LTF8 (up to 64 bits) MUST be decoded with the same leading-bit pattern extended to 9 bytes, with 0xFF prefix indicating 8 continuation bytes.

Bit stream reading

[CRAM3] §2.2 "Reading and writing bits in a bit stream" — MSB-first bit ordering

cram.bitstream

bitstream.rs:4 bitstream.rs:64

The core data block (content_type=5) is read as a bit stream. Bits are read MSB-first (most significant bit first) within each byte:

Bit offset 0 is the most significant bit (0x80)
Bit offset 7 is the least significant bit (0x01)
Reading N bits shifts them out from the current position, crossing byte boundaries as needed

Encodings that read from the core (Huffman, Beta, Subexp, Gamma) all use this bit stream. Encodings that read from external blocks (External, ByteArrayStop) use byte-level access to the specific external block identified by content_id.

The core bit stream is shared across all records in a slice — fields are interleaved sequentially. Reading fields in the wrong order corrupts all subsequent data.

Compression header

[CRAM3] §8.3 "Compression header block" — preservation map, data series encoding map, tag encoding map

The compression header is a block (content_type=1) at the start of each container. It defines how all data in the container's slices is encoded.

Preservation map

[CRAM3] §8.3 "Preservation map" — RN, AP, RR, SM (substitution matrix), TD (tag dictionary)

cram.compression.preservation

compression_header.rs:136 compare_cram_parsing.rs:140

The preservation map MUST be decoded as a series of key-value pairs:

RN (bool): whether read names are preserved (if false, names are generated)
AP (bool): whether alignment positions are delta-coded (true) or absolute
RR (bool): whether the reference is required for decoding
SM (5 bytes): substitution matrix (see r[cram.compression.substitution_matrix])
TD (bytes): tag dictionary — null-separated lists of 3-byte entries (2-char tag name + 1-char BAM type)

cram.compression.substitution_matrix

compression_header.rs:55 compare_cram_parsing.rs:273

The substitution matrix is 5 bytes, one per reference base in order: A, C, G, T, N. Each byte encodes the ordering of substitution alternatives using 2-bit codes packed MSB-first:

byte = (alt0 << 6) | (alt1 << 4) | (alt2 << 2) | alt3

where each 2-bit value maps to a base: 0=A, 1=C, 2=G, 3=T. When decoding a substitution (BS data series value = code), the read base is \alt[code] from this ordering.

Example: for reference base C, the possible substitutions are A, G, T (not C itself). If the substitution byte is 0b_00_10_11_01 (0x2D):

code 0 → bits 00 → A
code 1 → bits 10 → G
code 2 → bits 11 → T
code 3 → bits 01 → C (unused for ref=C, but the slot exists)

For non-N reference bases, only codes 0-2 are meaningful (3 alternatives, excluding the reference base itself). Code 3 exists in the byte but is never emitted by conforming CRAM writers — the decoder MAY treat it as an error or map it to the remaining unused base. For reference base N, all 4 codes are meaningful (A, C, G, T).

Data series encoding map

[CRAM3] §8.3 "Data series encodings" — two-byte data series codes and their types (BF, CF, RI, RL, AP, RG, RN, MF, NS, NP, TS, NF, FN, FC, FP, MQ, BA, QS, BS, IN, SC, DL, RS, PD, HC, BB, QQ, TL)

cram.compression.ds_encodings

compression_header.rs:241 compare_cram_parsing.rs:141

The data series encoding map MUST be decoded as key-value pairs where:

Key: 2-byte data series code
Value: encoding descriptor (encoding ID + parameters)

Complete data series list:

Code	Name	Type	Description
`BF`	BAM flags	int	BAM FLAG field (u16)
`CF`	CRAM flags	int	CRAM-specific flags
`RI`	Ref ID	int	Reference sequence ID (multi-ref slices only)
`RL`	Read length	int	Number of bases
`AP`	Alignment pos	int	Position (delta or absolute per preservation map)
`RG`	Read group	int	Index into @RG header entries
`RN`	Read name	byte array	QNAME
`MF`	Mate flags	int	0x1=mate reverse, 0x2=mate unmapped
`NS`	Next segment ref	int	Mate reference ID
`NP`	Next mate pos	int	Mate position
`TS`	Template size	int	Insert size
`NF`	Next fragment	int	Record-skip to next fragment (attached mates)
`FN`	Feature count	int	Number of read features
`FC`	Feature code	byte	Feature type character
`FP`	Feature position	int	Position within read (delta from previous feature)
`MQ`	Mapping quality	int	MAPQ
`BA`	Base	byte	Single base
`QS`	Quality score	byte	Single quality value
`BS`	Substitution code	byte	Index into substitution matrix
`IN`	Insertion	byte array	Inserted bases
`SC`	Soft clip	byte array	Soft-clipped bases
`DL`	Deletion length	int	Number of deleted reference bases
`RS`	Ref skip	int	Reference skip length (intron)
`PD`	Padding	int	Padding length
`HC`	Hard clip	int	Hard clip length
`BB`	Bases block	byte array	Stretch of bases
`QQ`	Quality block	byte array	Stretch of quality scores
`TL`	Tag line	int	Index into tag dictionary

Tag encoding map

[CRAM3] §8.3 "Tag encodings" — ITF8 key as (char1 << 16) + (char2 << 8) + type_char, BAM-format tag values

cram.compression.tag_encodings

compression_header.rs:348 compare_cram_parsing.rs:142

The tag encoding map uses ITF8 keys computed as (char1 << 16) + (char2 << 8) + type_char. Each key maps to an encoding descriptor for that tag type. Tag values are stored as raw BAM-format bytes (without the 3-byte tag+type prefix).

Base modification tags (MM/ML) use Z and B:C types respectively and can be very large on long reads (thousands of bytes). The decoder MUST handle arbitrary-length tag values.

Encoding types

[CRAM3] §13 "Encodings" — NULL (0), EXTERNAL (1), HUFFMAN (3), BYTE_ARRAY_LEN (4), BYTE_ARRAY_STOP (5), BETA (6), SUBEXP (7), GAMMA (9)

cram.encoding.null

encoding.rs:212 encoding.rs:621

Encoding ID 0 (NULL): no data. Used for absent data series.

cram.encoding.external

encoding.rs:214 +1 encoding.rs:576

Encoding ID 1 (EXTERNAL): reads raw bytes from an external data block identified by content_id. The most common encoding — used for quality scores, bases, tags, etc.

cram.encoding.huffman

encoding.rs:221 encoding.rs:491

Encoding ID 3 (HUFFMAN): canonical Huffman coding. Parameters: alphabet (itf8 array) and bit lengths (itf8 array). Reads from the core data bit stream.

Canonical Huffman construction: sort by (bit_length, symbol_value), assign codes starting from 0 with incrementing and left-shifting when bit length increases. A single-symbol Huffman code has bit length 0 and consumes 0 bits from the stream — this is common for fields where all records have the same value (e.g., all flags identical).

cram.encoding.byte_array_len

encoding.rs:369 encoding.rs:605

Encoding ID 4 (BYTE_ARRAY_LEN): a length encoding followed by a value encoding. The length is decoded first (from any encoding), then that many bytes are decoded using the value encoding.

cram.encoding.byte_array_stop

encoding.rs:378 encoding.rs:595

Encoding ID 5 (BYTE_ARRAY_STOP): reads bytes from an external block until a stop byte is encountered. Used for read names (stop=0x00) and variable-length strings.

cram.encoding.beta

encoding.rs:225 encoding.rs:559

Encoding ID 6 (BETA): fixed-width integer. Parameters: offset and number of bits. Reads from core bit stream.

cram.encoding.subexp

encoding.rs:233 compare_cram_parsing.rs:144

Encoding ID 7 (SUBEXP): sub-exponential code. Parameters: offset (itf8) and K (itf8). Reads from core bit stream. Decode procedure:

Read unary: count consecutive 1 bits until a 0 bit is read. Call this count n.
If n == 0: read K bits from the stream as the value.
If n > 0: read n + K - 1 bits from the stream as remainder. The value is (1 << (n + K - 1)) - (1 << K) + remainder.
The final decoded value is value - offset.

cram.encoding.gamma

encoding.rs:239 compare_cram_parsing.rs:143

Encoding ID 9 (GAMMA): Elias gamma code with offset. Parameters: offset (itf8). Reads from core bit stream. Decode procedure:

Read unary: count consecutive 0 bits until a 1 bit is read. Call this count n.
If n == 0: the value is 0.
If n > 0: read n more bits from the stream as suffix. The value is (1 << n) | suffix - 1.
The final decoded value is value - offset.

Compression codecs

[CRAM3] §14 "External compression methods" — RAW (0), GZIP (1), BZIP2 (2), LZMA (3), rANS4x8 (4), rANS Nx16 (5), arithmetic coder (6), fqzcomp (7), tok3 (8) [CRAMcodecs] §2 "rANS 4x8" — frequency table RLE, order-0/order-1 modes, chunk-based order-1 interleaving [CRAMcodecs] §3 "rANS Nx16" — flags byte (ORDER, N32, STRIPE, NoSize, CAT, RLE, PACK) [CRAMcodecs] §5 "Name tokenisation codec" — tok3

cram.codec.raw

block.rs:121 block.rs:202

Method 0 (RAW): uncompressed data.

cram.codec.gzip

block.rs:123 block.rs:231

Method 1 (GZIP): deflate compression. MUST be supported — this is the most common codec.

cram.codec.bzip2

block.rs:132 block.rs:263

Method 2 (BZIP2): bzip2 compression. MUST be supported for v3.0 compatibility.

cram.codec.lzma

block.rs:143 block.rs:289

Method 3 (LZMA): LZMA compression. MUST be supported for v3.0 compatibility.

cram.codec.rans4x8

block.rs:153 +1 rans.rs:276 +1

Method 4 (rANS 4×8): rANS entropy coder with 4-way interleaved states and 8-bit renormalization (L=2^23). Supports order-0 (context-free) and order-1 (previous-symbol context) modes. MUST be supported — this is the default codec in CRAM v3.0 writers including samtools.

The frequency table header uses a compact run-length encoding: after reading a symbol and its frequency, if the next symbol is prev_sym + 1, a run length byte follows, then that many consecutive symbols' frequencies. The terminator is a NUL (0x00) byte. Order-1 nests this same structure: the outer loop reads context symbols (with the same run-length scheme), and for each context reads an order-0 frequency table.

Order-1 interleaving is chunk-based, not round-robin: the output is split into 4 equal segments, each decoded by one state sequentially. Remainder bytes (when len % 4 != 0) are handled by state 3.

cram.codec.rans_nx16

block.rs:155 +1 rans_nx16.rs:589

Method 5 (rANS Nx16): v3.1 codec. rANS with N-way interleaving (N=4 or 32), 16-bit renormalization (L=2^15), and optional transforms controlled by an 8-bit flags byte: ORDER(1), N32(4), STRIPE(8), NoSize(16), CAT(32), RLE(64), PACK(128). Transforms are applied in order: decode entropy → unRLE → unPack. MUST be supported for v3.1 files — samtools uses this for most data blocks in v3.1 output.

cram.codec.arith

Method 6 (arithmetic coder): v3.1 adaptive arithmetic coder. SHOULD be supported.

cram.codec.fqzcomp

Method 7 (fqzcomp): v3.1 quality-score-specific compressor. SHOULD be supported.

cram.codec.tok3

block.rs:157 +1 tok3.rs:464

Method 8 (tok3): v3.1 read-name tokeniser. MUST be supported for v3.1 files — samtools uses tok3 for read name blocks by default in v3.1 output. Without tok3 support, v3.1 CRAM files produced by samtools view -C cannot be read.

cram.codec.unknown

block.rs:159 block.rs:329

Unknown codec methods MUST produce a clear error naming the method ID and suggesting conversion to BAM (samtools view -b).

Record decoding

[CRAM3] §10 "Record structure" — CRAM record layout and field decode order

Decoding order

[CRAM3] §10.1 "CRAM record" — strict field decode order (BF, CF, RI, RL, AP, RG, QS, RN, mate info, TL, FN/FC/FP, MQ, per-base QS)

cram.record.decode_order

slice.rs:216 compare_cram_parsing.rs:195 +2

Records within a slice MUST be decoded in a strict field order, because the core bit stream is sequential. Reading fields out of order corrupts all subsequent data. The order is:

BF (BAM flags)
CF (CRAM flags)
RI (reference ID — only if slice is multi-ref, i.e., ref_seq_id == -2)
RL (read length)
AP (alignment position — delta or absolute per preservation map)
RG (read group)
QS (quality scores — only if CRAM flags bit 0x1 is clear, meaning quality stored per-record not per-base)
RN (read name — only if preservation map RN=true)
Mate information:
- If CRAM flags bit 0x2 set (detached): MF, NS, NP, TS
- If CRAM flags bit 0x4 set (mate downstream): NF
TL (tag line index)
FN (feature count), then for each feature: FC, FP, feature-specific data
MQ (mapping quality)
QS per-base (only if CRAM flags bit 0x1 is set — quality as array)

Getting this order wrong is the most common implementation bug in CRAM readers.

Per-record data series

cram.record.flags

slice.rs:238 compare_cram_with_htslib.rs:102 +1

BF (BAM flags) and CF (CRAM flags) MUST be decoded per record. BAM flags map directly to the u16 flags in SlimRecord. CRAM flags are separate and encode:

Bit 0x1: quality scores stored as array (vs per-record)
Bit 0x2: detached (mate info stored inline)
Bit 0x4: has mate downstream in same slice
Bit 0x8: sequence not stored (unknown)

cram.record.position

slice.rs:257 compare_cram_with_htslib.rs:103 +1

AP (alignment position): if the preservation map AP=true (delta mode), each record's position is a delta from the previous record's position within the slice. The reader MUST maintain a running position accumulator and convert to absolute 0-based coordinates (CRAM positions are 1-based). Delta-coded positions can be negative (valid for supplementary alignments in coordinate-sorted CRAM).

cram.record.read_length

slice.rs:253 compare_cram_with_htslib.rs:105

RL (read length): the number of bases in the read. Used to allocate sequence and quality arrays. Long reads (PacBio/ONT) can be 10,000-1,000,000+ bases — buffer pre-allocation MUST NOT assume short-read sizes.

cram.record.read_group

slice.rs:271 compare_cram_with_htslib.rs:170

RG (read group): an index into the @RG entries in the header. Stored as itf8. The reader MUST convert to a RG:Z:<id> aux tag with the read group ID string.

cram.record.read_name

slice.rs:275 compare_cram_with_htslib.rs:106

RN (read name): if preservation map RN=true, names are stored. If RN=false, names are auto-generated. Rastair uses qnames for mate-pair dedup, so RN=false CRAM files MUST produce a warning that dedup will not work correctly. Generated names SHOULD be deterministic (e.g., cram_<record_counter>).

Read feature decoding

[CRAM3] §10.6 "Mapped reads" and §10.6.1 "Read feature codes" — feature codes X/I/i/D/S/B/b/Q/q/N/H/P and their data series

cram.record.features

slice.rs:430 compare_cram_with_htslib.rs:109

For each record, FN gives the number of read features. For each feature:

Decode FC (feature code byte)
Decode FP (position within read, delta-coded from the previous feature's position within the same record — first feature's FP is absolute within the read, starting at 1)
Decode feature-specific data:

FC	Char	Data series	Description
`X`	0x58	BS	Substitution: `base = substitution_matrix[ref_base][code]`
`I`	0x49	IN	Insertion: multi-base insertion
`i`	0x69	BA	Single-base insertion
`D`	0x44	DL	Deletion: skip DL reference bases
`S`	0x53	SC	Soft clip: bases not aligned to reference
`B`	0x42	BA + QS	Single base replacement + quality
`b`	0x62	BB + QQ	Stretch of bases + qualities
`Q`	0x51	QS	Quality score at position
`q`	0x71	QQ	Stretch of quality scores
`N`	0x4e	RS	Reference skip (intron): skip RS reference bases
`H`	0x48	HC	Hard clip: record HC bases clipped
`P`	0x50	PD	Padding: PD padding bases

Sequence reconstruction

[CRAM3] §10.6 "Mapped reads" — reference-based sequence reconstruction from read features; §11 "Reference sequences" — reference MD5 verification

cram.record.sequence

slice.rs:431 compare_cram_parsing.rs:196 +1

Read sequences are NOT stored directly. They MUST be reconstructed from the reference and read features:

Fetch the reference sequence for [alignment_start, alignment_start + alignment_span) from the FASTA reader (or embedded reference block).
Initialize the read's bases array with RL entries.
Walk the reference and read in parallel, applying features in order:
- Between features: copy reference bases to read bases (these are the matching segments).
- X (substitution): look up substitution_matrix[ref_base][BS_code] for the read base.
- I/i (insertion): insert bases from IN/BA into the read (these don't consume reference).
- D (deletion): skip DL reference bases (these don't consume read).
- S (soft clip): insert bases from SC into the read (no reference consumption).
- B (base+quality): replace base and quality at the feature position.
- N (ref skip): skip RS reference bases.
- H (hard clip): record length only (no bases in read or reference).
Convert reconstructed bases to Base enum values for the bases slab.

cram.record.quality

slice.rs:359 compare_cram_with_htslib.rs:145

Quality scores are decoded from QS (per-base) or QQ (stretch) data series, depending on CRAM flags bit 0x1. If CRAM flags bit 0x8 is set (sequence unknown), quality is unavailable (all 0xFF). Quality length MUST equal read length.

cram.feature.quality_score

When quality_as_array is false (CRAM flags bit 0x1 unset), per-position quality overrides are encoded as Q (0x51) read features. Each Q feature carries a single quality score from the QS data series at a specific read position. The decoder MUST apply these quality scores to qual_buf at the feature's read position (1-based FP, converted to 0-based index). Positions without a Q feature retain the default quality (0xFF). This is distinct from the bulk QS decode path used when quality_as_array is true.

cram.feature.quality_block

The q (0x71) read feature carries a stretch of quality scores from the QQ data series, starting at the feature's read position. The decoder MUST write the decoded quality bytes into qual_buf starting at the feature's 0-based read position. The number of quality scores in the block is determined by the QQ encoding. Like Q, this feature only appears when quality_as_array is false.

cram.record.cigar_reconstruction

slice.rs:432 compare_cram_parsing.rs:197 +1

CRAM does not store CIGAR explicitly. The CIGAR MUST be reconstructed from the read features:

Stretches of matching reference → M ops (or =/X if distinguishing is needed, but M is standard)
X (substitution) → part of M ops (substitutions are still alignment matches)
I/i → I ops
D → D ops
S → S ops
N → N ops
H → H ops
P → P ops

The reconstructed CIGAR MUST be packed into BAM u32 format for the data slab.

Mate information

[CRAM3] §10.4 "Mate records" — detached (CF bit 0x2) vs attached (NF) mate encoding

cram.record.mate_detached

slice.rs:280 compare_cram_with_htslib.rs:173

When CRAM flags bit 0x2 is set (detached), mate information is stored inline: MF (mate flags), NS (next ref ID), NP (next mate position), TS (template size). These fields are NOT needed for RecordStore (which doesn't store mate info), but the decoder MUST read and discard them to advance the data stream correctly.

cram.record.mate_attached

slice.rs:281 compare_cram_with_htslib.rs:174

When bit 0x2 is clear, the mate is "attached" — NF (next fragment distance) gives the record-skip to the mate within the same slice. The decoder MUST read NF but does not need to resolve the linkage for RecordStore purposes.

Auxiliary tags

[CRAM3] §10.5 "Auxiliary tags" — TL tag line index, tag dictionary, tag encoding map

cram.record.aux_tags

slice.rs:298 compare_cram_with_htslib.rs:171

TL (tag line index) selects which tag combination this record has from the tag dictionary in the preservation map. For each tag in the combination, the tag encoding map provides the encoding for its value. Tag values MUST be decoded and serialized to BAM binary aux format for storage in the data slab.

cram.record.rg_tag

slice.rs:317 compare_cram_with_htslib.rs:172

The RG data series is separate from aux tags. If a read group is present, the reader MUST emit an RG:Z:<id> aux tag using the read group ID from the header's @RG entries. This tag MUST be included in the aux slab alongside dictionary-decoded tags.

CRAM index (.crai)

[CRAM3] §12 "Indexing" — CRAI format: gzip-compressed TSV, 6 fields per line (ref_id, alignment_start, alignment_span, container_offset, slice_offset, slice_size)

cram.index.parse

index.rs:39 index.rs:149 +1

The .crai file is gzip-compressed TAB-delimited text. Each line has 6 fields:

Reference sequence ID (int)
Alignment start (int, 1-based matching the container header convention, 0 for unmapped)
Alignment span (int, 0 for unmapped)
Container byte offset in file (int, 0-based byte offset from start of file)
Slice byte offset relative to container data start (int)
Slice size in bytes (int)

cram.index.query

index.rs:73 index.rs:164 +1

Region queries MUST find all index entries where [start, start+span) overlaps the query region for the given reference ID. The reader then seeks to the container byte offset, skips to the slice offset, and decodes the slice.

cram.index.zero_span

index.rs:87 index.rs:203

CRAI entries with alignment_span == 0 (but alignment_start > 0) indicate unknown extent — this occurs when samtools writes CRAM with embedded references or when the span is not computed. These entries MUST be included in query results whenever entry.alignment_start < query_end, since the slice may contain records at any position past the start. Treating span=0 as "zero-width" would silently drop all records in that slice.

cram.index.multi_ref_slices

slice.rs:371 compare_cram_parsing.rs:118

Multi-ref slices produce multiple index entries (one per reference they span). A query for one reference may hit a multi-ref slice that also contains records from other references. The reader MUST filter records by reference ID after decoding.

cram.index.unmapped

index.rs:74 +1 index.rs:179

Unmapped records (ref ID = -1) have start=0, span=0. They are only returned if explicitly queried. For fetch_into, unmapped records MUST be skipped (same as BAM).

Edge cases

cram.edge.reference_mismatch

slice.rs:118 slice.rs:763

If the reference MD5 in the slice header does not match the MD5 of the FASTA sequence for that region, the reader MUST return an error. The MD5 is computed over the uppercase reference sequence for the exact range [alignment_start, alignment_start + alignment_span), with no newlines or other formatting. If the range extends beyond the reference length, this is a file/reference mismatch error.

cram.edge.missing_reference

reader.rs:379 reader.rs:499

If the FASTA reader cannot provide the reference sequence for a slice's region (e.g., contig not in FASTA), the reader MUST return an error unless the slice has an embedded reference. The error SHOULD mention REF_PATH/REF_CACHE as alternatives.

cram.edge.unknown_read_names

pileup.rs:172 +1

When RN=false in the preservation map, read names are not available. The reader MUST log a warning (once per slice). Records without read names are stored with empty qnames. The pileup engine's overlapping-pair dedup MUST skip records with empty or * qnames, since mate matching requires real qnames.

cram.edge.seq_unknown

slice.rs:402 compare_cram_parsing.rs:199

When CRAM flags bit 0x8 is set, the read's sequence is unknown (* in SAM). seq_len MUST be 0 and no bases are pushed to the bases slab.

cram.edge.empty_slice

slice.rs:113 compare_cram_parsing.rs:198

Slices with 0 records MUST be silently skipped.

cram.edge.position_overflow

slice.rs:267 compare_cram_with_htslib.rs:175

Delta-coded positions can produce negative values if records are not strictly sorted. The decoder MUST handle this gracefully (negative deltas are valid for supplementary alignments within a coordinate-sorted CRAM).

cram.edge.unmapped_reads

slice.rs:419 compare_cram_with_htslib.rs:176

Unmapped reads (flag 0x4) may appear in the last containers of the file, in slices with reference_sequence_id = -1. Their bases are stored in the BA data series (no reference reconstruction). fetch_into MUST skip them, same as BAM.

cram.edge.long_reads

slice.rs:336 compare_cram_parsing.rs:201

PacBio and ONT CRAM files contain reads of 10,000-1,000,000+ bases. Read feature lists can have thousands of entries, and quality/sequence arrays are correspondingly large. Buffer pre-allocation MUST scale with RL, not assume short-read sizes.

cram.edge.coordinate_clamp

reader.rs:394 compare_cram_parsing.rs:200

Query coordinates passed to fetch_into are u64, but internal CRAM positions are i32/i64. When converting u64::MAX to i64 for overlap filtering, the cast wraps to -1, which silently filters out all records. The reader MUST clamp coordinates: end.min(i64::MAX as u64) as i64.

cram.slice.ref_bounds_warning+2

slice.rs:433 slice.rs:850

During sequence reconstruction, if the reference sequence is shorter than the positions accessed (i.e., ref_offset + ref_pos >= reference_seq.len()), the reader MUST log a warning on the first occurrence per reconstruction and fall back to b'N'. This may indicate a reference/CRAM mismatch but can legitimately occur near chromosome ends. The warning MUST be emitted at most once per record to avoid flooding logs.

cram.edge.rans_sym_overflow

rans.rs:145 rans.rs:277

The rANS 4x8 frequency table uses run-length encoding where a symbol counter sym: u8 is incremented for each run element. When sym == 255, sym += 1 overflows. The reader MUST use wrapping arithmetic (wrapping_add(1)) for this counter — the overflow is harmless because the loop terminates before using the overflowed value.

cram.codec.rans_sym_bounded+2

rans.rs:146

Symbol variables in rANS frequency table run-length reading MUST use u8 type, making overflow past 255 impossible by construction. After a run-length loop, sym may have wrapped to 0 via wrapping_add(1) when the previous value was 255. The code MUST NOT use the post-loop sym value as a table index if it has wrapped to 0. The run-length loop MUST break when sym would exceed 255.

cram.codec.state_step_safety

rans_nx16.rs:144

The rANS state_step function computes f * (s >> bits) + (s & mask) - g. For valid frequency tables produced by conforming CRAM writers, g <= f * (s >> bits) + (s & mask) always holds, so the subtraction does not underflow. The implementation MUST use wrapping_sub for robustness against malformed data in release mode, with a debug_assert! to catch violations during development.

cram.codec.normalize_checked

rans_nx16.rs:238 rans_nx16.rs:752

Frequency normalization (normalize_frequencies) sums all 256 frequency entries and doubles the sum in a loop. Both the sum and the doubling MUST use checked arithmetic to detect overflow from malformed frequency tables. Overflow MUST produce an error, not silent wraparound.

cram.codec.uint7_bounded

rans_nx16.rs:96 +1 rans_nx16.rs:735

The uint7 variable-length integer decoding loop MUST be bounded to at most 5 iterations. A u32 can hold at most 32 bits; 5 iterations of 7-bit groups produce 35 bits, which is the maximum meaningful input. More than 5 continuation bytes indicate malformed data and MUST produce an error.

cram.slice.validated_lengths

slice.rs:42 +1 slice.rs:874

Length fields decoded from ITF8 (e.g., num_content_ids, num_blocks, alignment_span) may be negative when interpreted as i32. Before using such values as usize for allocation or iteration, the reader MUST validate they are non-negative via i32::try_from or equivalent. Negative values MUST produce an error, not wrap to huge usize values causing OOM.

cram.tok3.dz_len_reader

tok3.rs:143 tok3.rs:640

The TokenReader::get() and get_mut() methods MUST return the same reader for every TokenType variant. In particular, DZLen MUST map to dz_len_reader in both methods. A mismatch causes the dup-copy path (which uses get()) to read from the wrong stream.

cram.tok3.name_count_limit

tok3.rs:21 tok3.rs:655

The name_count field in tok3 headers comes from untrusted data. Allocation based on name_count MUST be bounded to a reasonable limit (e.g., 10,000,000 names). The per-name token vector MUST be grown dynamically rather than pre-allocated to a fixed size, since the number of token positions varies per name.

Performance considerations

CRAM decoding is expected to be 2-4× slower than BAM due to codec complexity and reference reconstruction. The I/O savings (40-60% smaller files) often compensate on network filesystems.

cram.perf.slice_granularity

reader.rs:291 compare_cram_parsing.rs:202

CRAM random access is slice-granular, not record-granular. A region query may decompress an entire slice (typically 10,000 records) to extract a few records at the edges. This is inherently coarser than BAM's record-level seeking.

cram.perf.reference_caching

reader.rs:292 compare_cram_parsing.rs:203

Reference sequence lookups happen per-slice (to reconstruct all records in the slice). The reader SHOULD cache the most recently used reference region to avoid repeated FASTA lookups for consecutive slices on the same contig. This cache MUST be per-fork, not shared across threads.

cram.perf.codec_overhead

reader.rs:293 compare_cram_parsing.rs:204

rANS and arithmetic decoders have higher per-byte CPU cost than zlib. The reader SHOULD pre-allocate decode buffers and reuse them across blocks within a slice.

FASTA Reader

The indexed FASTA reader provides random access to reference genome sequences. It reads .fai index files to locate sequence data within plain or bgzip-compressed FASTA files, enabling efficient region queries without scanning the entire file.

Rastair fetches reference sequences once per segment (~100 KB) for each worker thread. The forkable design shares the parsed index across threads while giving each thread its own file handle, matching the IndexedBamReader pattern.

Sources: The FASTA format and FAI index are not covered by any hts-specs document. The FAI format is defined by samtools (samtools faidx); the byte offset formula and field layout are described in the samtools faidx man page and implemented in [htslib]. The GZI index format is a htslib convention for bgzip-compressed random-access files, implemented in [htslib] bgzf.c. BGZF decompression follows [SAM1] §4.1. The forking and buffer-reuse design are seqair-specific. See references.md.

FAI Index

The FAI index (.fai file) is a tab-delimited text file with one line per sequence. It enables O(1) lookup of any base position within the FASTA file by recording where each sequence's data starts and how lines are formatted.

Parsing

fasta.index.parse

index.rs:64 index.rs:197

The FAI parser MUST read a .fai file and produce an index with one entry per sequence. Each entry MUST contain: sequence name (String), length (number of bases, u64), byte offset of the first base in the file (u64), bases per line (u64), and bytes per line including line terminator (u64). The parser MUST NOT trim whitespace from field values — tabs are the only delimiter, and sequence names may contain spaces.

fasta.index.fields

index.rs:89 index.rs:224 +3

Each line MUST have exactly 5 tab-separated fields: NAME, LENGTH, OFFSET, LINEBASES, LINEWIDTH. Lines with fewer or more fields MUST produce an error identifying the malformed line. Empty lines MUST be skipped.

fasta.index.validation

index.rs:116 index.rs:302 +2

The parser MUST reject entries where linewidth < linebases (line terminator would be negative), linebases == 0 (would cause division by zero in offset calculations), or length == 0 (empty sequences are not useful for region queries). Note: linewidth == linebases is valid — it means lines have no trailing newline character, which occurs when a sequence is stored on a single line without a mid-sequence newline.

fasta.index.name_lookup

index.rs:65 index.rs:198 +2

The index MUST support O(1) lookup of a sequence entry by name. Duplicate sequence names MUST produce an error at parse time.

Location

fasta.index.location

reader.rs:275 reader.rs:335 +1

For a FASTA file at path P, the reader MUST look for the index at P.fai (i.e., appending .fai to the full path including any .gz extension). This matches samtools/htslib convention: ref.fa → ref.fa.fai, ref.fa.gz → ref.fa.gz.fai.

fasta.index.missing

reader.rs:105 reader.rs:436

When the FAI index file is not found, the reader MUST produce a clear error message that names the expected index path and suggests running samtools faidx <fasta_path> to create it. The reader MUST NOT attempt to build the index itself. Rationale: auto-indexing hides user errors (wrong FASTA passed silently), fails on read-only shared reference directories (common on HPC clusters), introduces non-deterministic file creation and race conditions, and takes 10–30s for a human genome which makes the tool appear to hang with no feedback.

Byte offset calculation

fasta.index.offset_calculation

index.rs:58 fasta.rs:17

To find the byte offset of base at 0-based position pos within a sequence, the reader MUST compute: offset + (pos / linebases) * linewidth + (pos % linebases), where offset, linebases, and linewidth come from the FAI entry. This accounts for newline characters inserted at line boundaries.

The offset formula is correct for all valid positions including those on the last line of a sequence, even when the last line is shorter than linebases. This works because integer division gives the correct line number regardless of last-line length. The raw read range [byte_offset(start), byte_offset(last_pos) + 1) may include newline bytes from line boundaries; these are stripped after reading.

Sequence fetching

fasta.fetch.coordinates

reader.rs:84 fasta.rs:194

fetch_seq(name, start, stop) MUST use 0-based half-open coordinates [start, stop). A request for (name, 0, 4) returns the first 4 bases of the named sequence.

fasta.fetch.bounds_check

reader.rs:207 reader.rs:405 +3

The reader MUST verify that start < stop (zero-length and reversed ranges are invalid) and stop <= sequence_length. Out-of-bounds requests MUST produce an error naming the sequence, the requested range, and the actual sequence length.

fasta.fetch.uppercase

reader.rs:263 fasta.rs:193

Returned sequence bytes MUST be uppercased. FASTA files may contain lowercase bases (indicating soft-masking), but Seqair always works with uppercase bases.

fasta.fetch.newline_stripping

reader.rs:262 fasta.rs:192

When reading from the file, the reader MUST strip line terminators (LF \n and CR \r) from the raw bytes. The returned sequence MUST contain only base characters.

fasta.fetch.unknown_sequence

reader.rs:196 reader.rs:417 +1

When the requested sequence name is not in the index, the reader MUST produce an error listing the requested name. It SHOULD include available sequence names if the index is small (fewer than 20 entries) to help diagnose typos.

Raw byte return type

fasta.fetch.raw_bytes

reader.rs:176 +1 reader.rs:346 +1

fetch_seq and fetch_seq_into MUST return raw uppercase ASCII bytes (Vec<u8> / &mut Vec<u8>), not Base values. CRAM reference MD5 verification computes checksums on the exact byte sequence from the FASTA file — converting to Base and back is lossy because IUPAC ambiguity codes (M, R, S, etc.) collapse to Unknown. Any u8→Base conversion MUST happen downstream, after CRAM verification is complete.

Buffer reuse

fasta.fetch.buffer_reuse

reader.rs:177 +1 fasta.rs:319 +1

fetch_seq_into(name, start, stop, buf) MUST accept a caller-provided &mut Vec<u8> buffer, clear it, and write the fetched sequence into it. This avoids per-call allocation when fetching many segments. The fetch_seq convenience method MAY allocate a new Vec<u8> internally.

Plain FASTA reading

fasta.plain.read

reader.rs:230 fasta.rs:191

For uncompressed FASTA files, the reader MUST seek to the computed byte offset and read the required byte range in a single I/O operation, then strip newlines and uppercase the result.

fasta.plain.read_size

reader.rs:231 reader.rs:347 +2

The raw read size MUST account for embedded newlines. For a fetch of n bases starting at position pos, the number of raw bytes to read is: end_byte_offset - start_byte_offset where both offsets are computed via fasta.index.offset_calculation.

Bgzip FASTA reading

fasta.bgzf.detect

reader.rs:114 +1 compare_fasta_with_htslib.rs:29

The reader MUST auto-detect bgzip compression by reading the first 18 bytes of the file and checking for: (1) gzip magic bytes 1f 8b, (2) DEFLATE compression method 08, (3) FEXTRA flag 04, and (4) the BC subfield identifier at bytes 12–13. All four checks MUST pass to identify a file as BGZF. Plain gzip files with the FEXTRA flag but without the BC subfield MUST NOT be misidentified as BGZF.

fasta.bgzf.gzi_required

reader.rs:117 compare_fasta_with_htslib.rs:30 +1

For bgzip FASTA files, a GZI index (.gzi) MUST be present alongside the FASTA. The GZI index maps compressed block offsets to uncompressed byte offsets, enabling random access into the compressed data.

fasta.bgzf.gzi_missing

reader.rs:120 compare_fasta_with_htslib.rs:179

When the GZI index is not found for a bgzip FASTA, the reader MUST produce a clear error naming the expected GZI path and suggesting samtools faidx <fasta_path> (which creates both .fai and .gzi).

GZI index format

fasta.gzi.parse

gzi.rs:46 gzi.rs:180 +4

The GZI parser MUST read a binary file containing: an 8-byte little-endian entry count (u64), followed by that many pairs of 8-byte little-endian values: (compressed_offset: u64, uncompressed_offset: u64). Each pair records where a BGZF block starts in the compressed file and what uncompressed byte offset that corresponds to. The entry count multiplication MUST use checked arithmetic to prevent overflow on corrupted files.

fasta.gzi.sorted

gzi.rs:97 gzi.rs:283 +1

GZI entries MUST be sorted in strictly increasing order by uncompressed_offset. The parser MUST reject files with unsorted or duplicate entries, as translate() relies on binary search.

fasta.gzi.translate

gzi.rs:120 fasta.rs:115

Given an uncompressed byte offset, the GZI index MUST be able to find the BGZF block that contains it: find the last entry whose uncompressed_offset <= target, then the compressed block starts at that entry's compressed_offset, and the within-block offset is target - uncompressed_offset.

fasta.gzi.within_block_bounds

gzi.rs:144 gzi.rs:266

The within-block offset MUST fit in u16 (0–65535), matching the BGZF specification's maximum uncompressed block size of 65,536 bytes. If the computed within-block offset exceeds u16::MAX, the reader MUST return an error indicating the GZI index may be corrupt or missing entries. The reader MUST NOT silently truncate the offset.

Decompression

fasta.bgzf.decompress

reader.rs:233 fasta.rs:264

The reader MUST decompress BGZF blocks using the same libdeflater-based decompression as the BAM reader, including CRC32 verification.

fasta.bgzf.sequential_read

reader.rs:232 fasta.rs:265

For a fetch spanning multiple BGZF blocks, the reader MUST read and decompress blocks sequentially from the starting block until all requested bytes are obtained. Since FASTA fetches are contiguous byte ranges (unlike BAM which has disjoint index chunks), a simple sequential scan suffices.

Forking

fasta.fork.shared_state

reader.rs:71 reader.rs:457 +2

The FAI index and GZI index (if applicable) MUST be stored behind Arc so forked readers share a single parsed copy. These structures are read-only after initial parsing.

fasta.fork.operation

reader.rs:147 reader.rs:456 +1

fork() MUST produce a new reader that shares the same Arc as the source (no re-parsing) and opens a fresh File handle. Each fork MUST own its own read buffer.

fasta.fork.equivalence

reader.rs:455 fasta.rs:266

A forked reader MUST produce byte-identical results to independently opening the same FASTA file.

fasta.fork.independence

reader.rs:148 reader.rs:483 +1

Forked readers MUST be fully independent for mutable operations. Seeking on one MUST NOT affect any other.

Error handling

fasta.errors

reader.rs:19 reader.rs:406 +4

All FASTA-specific errors MUST contain context (file paths, sequence names, positions) sufficient to diagnose the problem without a debugger. The implementation MUST NOT use panic!, unwrap(), expect(), or unreachable!() — all error paths MUST propagate errors via Result.

SAM Reader

A bgzf-compressed, indexed SAM reader for region-based random access. Plain (uncompressed) SAM files cannot be indexed and are NOT supported — users must compress with bgzip first.

Sources: [SAM1] §1 "The SAM Format Specification" — SAM text format, mandatory fields, FLAG bits, CIGAR string, header lines (@HD/@SQ/@RG/@PG); §1.3 "The header section"; §1.4 "The alignment section: mandatory fields". [TABIX] — TBI index format and SAM column configuration. [SAM1] §4.1 — BGZF decompression (shared with BAM). The coordinate conversion (1-based → 0-based) and aux-to-BAM-binary serialization are seqair-specific. See references.md.

Context

SAM is the text representation of the same data BAM encodes in binary. A bgzf-compressed SAM file (.sam.gz) wraps SAM text in BGZF blocks, enabling random access via tabix (.tbi) or CSI (.csi) indexes. htslib reads these transparently; seqair needs to support them for backward compatibility after removing the htslib reading path.

The key difference from BAM: records are text lines that must be parsed field-by-field, and coordinates are 1-based (BAM is 0-based). The BGZF and index layers are shared with BAM.

Opening

sam.reader.open

reader.rs:147 sam_gz.rs:60

The SAM reader MUST open a bgzf-compressed SAM file, parse its header, and locate a tabix (.tbi) or CSI (.csi) index file. If neither index exists, return an error suggesting samtools index or tabix -p sam.

sam.reader.reject_uncompressed

reader.rs:150 reader.rs:865

If the file does not start with BGZF magic (1f 8b with BC subfield), the reader MUST reject it with an error explaining that plain SAM files must be bgzf-compressed for indexed access (bgzip file.sam).

Header parsing

sam.header.parse

reader.rs:598 compare_sam_with_htslib.rs:260 +1

The SAM header consists of all lines starting with @ at the beginning of the file. The reader MUST:

Read lines from the BGZF stream until a line does NOT start with @.
Record the BGZF virtual offset of the first alignment line (for seeking past the header).
Parse @SQ lines to extract target names (SN tag) and lengths (LN tag).
Construct a BamHeader from the parsed text via BamHeader::from_sam_text().

Lines starting with @ that appear after the first alignment line are NOT header continuation — they are malformed alignment lines and MUST be treated as parse errors.

sam.header.sq_required

reader.rs:599 compare_sam_with_htslib.rs:261 +1

If no @SQ lines are present, the reader MUST return an error. Without @SQ lines, target name→tid mapping is impossible and fetch_into cannot work.

sam.header.preserve_full_text

reader.rs:600 sam_gz.rs:162

The full header text (all @ lines joined with newlines) MUST be stored in BamHeader::header_text for downstream use (e.g., constructing htslib Header for BAM writing).

Alignment line parsing

[SAM1] §1.4 "The alignment section: mandatory fields" — 11 mandatory fields: QNAME, FLAG, RNAME, POS (1-based), MAPQ, CIGAR, RNEXT, PNEXT, TLEN, SEQ, QUAL; QUAL Phred+33 encoding

sam.record.parse

reader.rs:305 compare_sam_with_htslib.rs:83 +1

Each alignment line MUST be split on TAB characters into the 11 mandatory fields plus optional tag fields. The parser MUST handle:

Field	SAM type	Conversion to internal representation
QNAME	string	stored as bytes in name slab
FLAG	integer	u16 (identical to BAM flags)
RNAME	string	converted to tid via header's `name_to_tid` map
POS	1-based integer	subtract 1 for 0-based internal pos (i64)
MAPQ	integer	u8 (255 means unavailable)
CIGAR	string	parsed to packed BAM CIGAR ops (u32 per op: `len << 4
RNEXT	string	`=` means same as RNAME, `*` means unavailable — not stored in RecordStore
PNEXT	integer	not stored in RecordStore (mate info)
TLEN	integer	not stored in RecordStore (template length)
SEQ	string	each ASCII char decoded to `Base` enum
QUAL	string	each char minus 33 for Phred score; `*` means all 0xFF

The parser MUST validate each field and produce clear error messages for malformed records: non-integer POS, FLAG > 65535, negative MAPQ, etc. Real-world SAM files from non-standard tools may contain hand-edited or corrupted records.

sam.record.coordinate_conversion

reader.rs:351 compare_sam_with_htslib.rs:84 +1

SAM POS is 1-based; the internal representation is 0-based (matching BAM). The parser MUST subtract 1 from POS. A POS of 0 in SAM means unmapped — after subtraction this becomes -1 (as i64). Unmapped records are filtered by FLAG 0x4 before reaching this point, so -1 positions should not appear in the RecordStore.

sam.record.cigar_parse

reader.rs:428 compare_sam_with_htslib.rs:109 +2

The CIGAR string MUST be parsed into the same packed u32 format used by BAM: each operation is (length << 4) | op_code, where op_codes are M=0, I=1, D=2, N=3, S=4, H=5, P=6, =7, X=8. The string * means CIGAR unavailable (0 operations, and end_pos = pos).

sam.record.seq_decode

reader.rs:379 compare_sam_with_htslib.rs:110 +3

SEQ characters MUST be decoded to Base values: A→A, C→C, G→G, T→T, all others (N, IUPAC codes)→Unknown. The string * means sequence unavailable (length 0).

Note: the SAM character = means "identical to the reference base at this position." This is an intentional lossy conversion to Unknown — matching BAM's behavior, which also stores = as a distinct 4-bit code but does not resolve it against the reference. Some tools (Illumina DRAGEN) use = extensively. If exact reference-matching base recovery is needed in the future, the = case would need reference lookup during parsing.

sam.record.qual_decode

reader.rs:388 compare_sam_with_htslib.rs:111 +3

QUAL characters MUST have 33 subtracted (Phred+33 encoding). The string * means quality unavailable; all quality bytes MUST be set to 0xFF (255). QUAL length MUST equal SEQ length when both are present.

sam.record.aux_tags

reader.rs:468 compare_sam_with_htslib.rs:151 +1

Optional fields after the 11th column follow the format TAG:TYPE:VALUE. They MUST be serialized to BAM binary aux format for storage in the data slab:

A (char): [tag0, tag1, 'A', char]
i (integer): [tag0, tag1, type, bytes...] where type is the smallest fitting BAM integer type:
- [-128, 127] → c (int8)
- [128, 255] → C (uint8)
- [-32768, -129] or [256, 32767] → s (int16)
- [32768, 65535] → S (uint16)
- [-2147483648, -32769] or [65536, 2147483647] → i (int32)
- [2147483648, 4294967295] → I (uint32)
f (float): [tag0, tag1, 'f', le_f32_bytes]
Z (string): [tag0, tag1, 'Z', string_bytes, 0x00]
H (hex): [tag0, tag1, 'H', hex_bytes, 0x00]
B (array): [tag0, tag1, 'B', subtype, count_le32, values...]

Note: the integer type selected here may differ from the type used by the original BAM writer, since SAM text loses the specific width information. This is an inherent limitation — see r[unified.push_fields_equivalence].

sam.record.aux_parse_strict

reader.rs:469 reader.rs:947

Malformed aux tag values MUST return an error, not silently default to zero. For integer (i) typed tags, if the value cannot be parsed as an integer, the reader MUST return a SamRecordError::InvalidAuxValue error.

sam.record.aux_int_range+2

reader.rs:570 reader.rs:980 +1

SAM aux integer values (type i) are serialized into the smallest BAM integer type that fits, using the ranges defined in r[sam.record.aux_tags]. Values that exceed the u32 range (i.e., val > u32::MAX or val < i32::MIN when interpreted as i64) cannot be represented in any BAM integer type and MUST return a SamRecordError::AuxIntOutOfRange error instead of silently truncating.

Region fetching

sam.reader.fetch_into

reader.rs:195 compare_sam_with_htslib.rs:82 +2

fetch_into(tid, start, end, store) MUST:

Query the tabix/CSI index for BGZF virtual offset ranges overlapping the region.
Use RegionBuf (or equivalent bulk read) to load compressed bytes.
Decompress BGZF blocks to text.
Split text into lines (on \n).
Parse each alignment line.
Filter: skip lines where RNAME's tid doesn't match, or where the record doesn't overlap [start, end].
Skip unmapped reads (FLAG 0x4).
Push passing records into the RecordStore.

sam.reader.overlap_filter+2

reader.rs:370 sam_gz.rs:62

The overlap filter uses half-open intervals (0-based). end_pos MUST be computed from the parsed CIGAR. When CIGAR is * (unavailable), end_pos = pos (point record).

sam.reader.overlap_halfopen

reader.rs:371 reader.rs:908 +2

A record overlaps [start, end) iff pos < end && end_pos > start. Records where pos == end or end_pos == start do NOT overlap and MUST be filtered out.

sam.reader.sorted_order

reader.rs:196 sam_edge_cases.rs:180

Records MUST be added to the store in coordinate-sorted order. Since bgzf-compressed indexed SAM files are coordinate-sorted by construction (indexing requires it), the natural parse order is correct.

Tabix / CSI index

[TABIX] — TBI magic TBI\x01, header fields, n_ref, bin/chunk/linear-index structure, format=1 for SAM [CSI] — CSI magic CSI\x01, min_shift, depth, l_aux, per-bin loffset

sam.index.tabix

reader.rs:624 sam_edge_cases.rs:223

The reader MUST support tabix (.tbi) indexes. Tabix is a generic index for bgzf-compressed, TAB-delimited files. For SAM, the format code is 1 (not 0/generic). The column configuration (seq_col=3, beg_col=4, end_col=0) is implicit for format=1 and may be ignored by some implementations; the reader SHOULD NOT reject a tabix file based on stored column values when format=1 is present.

Tabix files are BGZF-compressed and MUST be decompressed before parsing.

sam.index.csi+2

The reader SHOULD also support CSI (.csi) indexes, which generalize the binning scheme with configurable min_shift and depth parameters. CSI supports references longer than 512 Mbp (BAI's limit). CSI files are BGZF-compressed and MUST be decompressed before parsing. Not yet implemented — see r[csi.*] rules in tabix_index.md.

sam.index.locate+2

reader.rs:627 sam_edge_cases.rs:222

Index file lookup order: <path>.tbi, <path>.bai. If neither exists, return a clear error. CSI (.csi) lookup should be added when CSI support is implemented.

Edge cases

sam.edge.missing_seq

reader.rs:380 sam_edge_cases.rs:102

When SEQ is *, the record has no sequence data. seq_len MUST be 0 and no bases are pushed to the bases slab. This is valid per the SAM spec for secondary alignments that omit sequence.

sam.edge.missing_qual

reader.rs:389 sam_edge_cases.rs:127

When QUAL is *, all quality bytes MUST be 0xFF. This matches BAM's representation of unavailable quality.

sam.edge.long_cigar

reader.rs:429 sam_proptests.rs:59

BAM encodes CIGARs with >65535 operations using a CG:B:I aux tag and a placeholder CIGAR. SAM has no such limitation — the CIGAR string can be arbitrarily long. The parser MUST handle CIGARs of any length.

sam.edge.empty_lines

reader.rs:250 sam_edge_cases.rs:39

Blank lines (empty or whitespace-only) between alignment records MUST be silently skipped.

sam.edge.rname_star

reader.rs:334 sam_edge_cases.rs:65

RNAME of * means unmapped. Combined with FLAG 0x4, these records MUST be filtered out by fetch_into (same as BAM's unmapped filtering). RNAME * without FLAG 0x4 is malformed but SHOULD be treated as unmapped.

sam.edge.line_spanning_blocks

reader.rs:601 sam_edge_cases.rs:242

A single SAM alignment line may span multiple BGZF blocks. The reader MUST buffer partial lines across block boundaries and only parse complete lines (terminated by \n). The parser MUST also strip \r at line boundaries to handle \r\n (Windows-style) line endings that may appear in files from mixed-platform pipelines.

Integer parsing

sam.record.parse_int_nonempty

reader.rs:679 reader.rs:962 +2

Integer parsing functions (parse_u8, parse_u16, parse_u32) MUST reject empty input by returning None. An empty byte slice is not a valid integer representation and MUST NOT silently produce zero.

Performance considerations

sam.perf.bulk_read

reader.rs:198 sam_gz.rs:63

Like BAM, the SAM reader SHOULD use RegionBuf-style bulk I/O: read all compressed bytes for the region in one large I/O operation, then decompress and parse from memory. This is critical for NFS/Lustre performance.

Text parsing is inherently slower than BAM binary decoding. The SAM reader is expected to be 2-5× slower than BAM for the same data due to text parsing overhead and larger compressed size. This is acceptable since SAM is not the recommended input format. Implementation guidance:

Use memchr for fast TAB and newline scanning.
Avoid allocations during parsing (reuse line buffers).
Parse integers without going through String (direct byte-to-int conversion).
Decode SEQ characters to Base with a lookup table (not per-char branching).

Tabix and CSI Index

Tabix (.tbi) and CSI (.csi) are generic indexes for bgzf-compressed, coordinate-sorted, TAB-delimited files. They enable region-based random access via BGZF virtual offsets. seqair needs these for bgzf-compressed SAM files.

Sources: [TABIX] — full TBI format specification (magic, header, bins, chunks, linear index, n_no_coor, reg2bin/reg2bins). [CSI] — full CSI format specification (magic, min_shift, depth, l_aux, per-bin loffset, parameterised reg2bin/reg2bins). [SAM1] §5.3 — reg2bin/reg2bins C code for BAI-compatible binning parameters (min_shift=14, depth=5). See references.md.

Context

BAM uses BAI indexes, which are a specialized form of the binning scheme. Tabix and CSI generalize this:

Tabix (.tbi): fixed parameters (min_shift=14, depth=5), identical binning to BAI. Includes column configuration for the text format.
CSI (.csi): configurable min_shift and depth, supporting references longer than 512 Mbp.

Both use the same bin/chunk/virtual-offset query mechanism as BAI.

Implementation strategy: Tabix with format=SAM uses identical binning parameters to BAI (min_shift=14, depth=5). The index data after the tabix-specific header is structurally identical to BAI (bins, chunks, linear index). The initial implementation SHOULD reuse the existing BamIndex parsing and query code by reading past the tabix header and delegating to the BAI parser. CSI support (parameterized binning, per-bin loff, no linear index) requires a separate parser and is deferred.

File compression

tabix.compression

index.rs:107 tabix_index.rs:91

Tabix (.tbi) files are BGZF-compressed. The parser MUST decompress via BGZF before parsing the binary format. Reading raw bytes without decompression produces garbage.

csi.compression

CSI (.csi) files are BGZF-compressed. Same decompression requirement as tabix.

Tabix format

[TABIX] — TBI format table: magic, n_ref, format, col_seq, col_beg, col_end, meta, skip, l_nm, names; bin/chunk/interval lists; n_no_coor

tabix.magic

index.rs:103 index.rs:591 +1

Tabix files start with magic bytes TBI\x01 (4 bytes), after BGZF decompression.

tabix.header

index.rs:104 tabix_index.rs:38

After magic:

n_ref (i32): number of reference sequences
format (i32): 0=generic, 1=SAM, 2=VCF. For SAM files, the format code is 1 (not 0).
col_seq (i32): column for sequence name (1-based; 3 for SAM)
col_beg (i32): column for begin position (1-based; 4 for SAM)
col_end (i32): column for end position (0 = computed from begin)
meta (i32): comment/header prefix character (64 = @ for SAM)
skip (i32): number of header lines to skip (0 for SAM; headers detected by meta prefix)
l_nm (i32): length of concatenated sequence names
names (bytes): null-terminated sequence names, concatenated

Note: when format=1 (SAM), htslib hard-codes the column configuration internally. The stored col_seq/col_beg/col_end values may be ignored by some implementations. The reader SHOULD NOT reject a tabix file based on stored column values when format=1.

tabix.bai_reuse

index.rs:105 tabix_index.rs:39

Since tabix uses identical binning parameters to BAI (min_shift=14, depth=5), the index data after the tabix header MUST be parsed by reusing BamIndex::from_reader() (or equivalent). This avoids duplicating the bin/chunk/linear-index parsing code. The tabix header is read and validated first, then the remaining bytes are handed to the BAI parser.

tabix.index_data

index.rs:106 tabix_index.rs:40

After the header, the index data follows the same structure as BAI:

For each reference:
- n_bin (i32): number of bins
- For each bin: bin_id (u32), n_chunk (i32), chunks as virtual offset pairs
- n_intv (i32): number of linear index entries
- Linear index: virtual offsets at 16 kbp intervals

After all reference data, an optional 8-byte n_no_coor field (u64) may be present (number of unmapped reads with no coordinate). Its presence is detected by whether 8 more bytes remain.

tabix.pseudo_bin

index.rs:178 tabix_index.rs:58

Bin ID 37450 is a pseudo-bin that stores metadata (virtual offset range of all records for a reference). Its chunks MUST NOT be included in query results. The reader MUST recognize and skip this bin during region queries.

tabix.query

index.rs:177 tabix_index.rs:57

Region queries use the same reg2bins algorithm as BAI (with min_shift=14, depth=5). The bin offset for level l (counting from root, l=0=root) is ((1 << 3*l) - 1) / 7, and the bin ID for position p at level l is:

((1 << 3*l) - 1) / 7 + (p >> (min_shift + 3*(depth - l)))

For BAI/tabix with min_shift=14, depth=5:

Level 0 (root): offset=0, shift=29. Bin 0 covers the entire reference.
Level 5 (leaf): offset=4681, shift=14. Each bin covers 16 kbp.

CSI format

[CSI] — CSI format table: magic CSI\x01, min_shift, depth, l_aux, n_ref; per-bin loffset replacing linear index; parameterised reg2bin/reg2bins/bin_limit functions

csi.magic

CSI files start with magic bytes CSI\x01 (4 bytes), after BGZF decompression.

min_shift (i32): minimum shift for binning (BAI/tabix use 14)
depth (i32): number of binning levels (BAI/tabix use 5)
l_aux (i32): length of auxiliary data
aux (bytes): auxiliary data (for SAM: same column configuration as tabix header)

Note: CSI support (r[csi.*] requirements below) is deferred. The initial implementation supports tabix only, which covers the common case of bgzf-compressed indexed SAM files. CSI adds support for references > 512 Mbp and is needed for some plant/polyploid BAM files.

csi.binning

CSI generalizes the BAI binning scheme with configurable min_shift and depth. The bin ID formula is the same as tabix/BAI but parameterized:

bin_id(level, pos) = ((1 << 3*level) - 1) / 7 + (pos >> (min_shift + 3*(depth - level)))

where level counts from root (0) to leaf (depth).

The reg2bins function MUST be parameterized by min_shift and depth rather than using a hard-coded levels array. The existing BAM reg2bins implementation hard-codes [(1,26), (9,23), (73,20), (585,17), (4681,14)] — this must be refactored to compute these values dynamically.

csi.bin_structure

CSI bins have a different structure from BAI/tabix bins:

bin_id (u32)
loff (u64): minimum virtual offset for this bin — replaces BAI's linear index
n_chunk (i32)
chunks: virtual offset pairs

The loff field per-bin replaces the separate linear index that BAI/tabix use. During queries, chunks with end <= loff of the leftmost overlapping bin SHOULD be skipped (same optimization as BAI's linear index).

csi.no_linear_index

CSI does NOT have a separate linear index array. The per-bin loff field serves the same purpose. The parser MUST NOT attempt to read a linear index after bins.

csi.pseudo_bin

The CSI pseudo-bin ID is ((1 << (depth*3 + 3)) - 1) / 7. For depth=5 this equals 37450 (same as BAI). Its chunks MUST NOT be included in query results.

csi.query

CSI queries produce the same output as BAI/tabix queries: a list of virtual offset chunk ranges to scan. RegionBuf::load() can be reused unchanged since it operates on Vec<Chunk>.

Shared query infrastructure

index.shared_query

index.rs:108 unified_reader.rs:87

The existing BamIndex::query() returns Vec<Chunk> with begin/end virtual offsets. The tabix and CSI parsers MUST produce the same Vec<Chunk> output so that RegionBuf::load() can be reused unchanged. The binning parameters differ but the output format is identical.

index.shared_types

index.rs:109 unified_reader.rs:88

Chunk and VirtualOffset types from the BAM index module MUST be reused by tabix/CSI indexes.

Edge cases

index.edge.large_refs

References longer than 512 Mbp require CSI (BAI/tabix with min_shift=14, depth=5 cannot represent them). The reader MUST support CSI indexes for BAM files too (not just SAM), since some BAM files from plant/polyploid genomes use CSI.

index.edge.no_records

index.rs:215 index.rs:532

A reference with no records has no bins and no linear index entries. Queries for such references MUST return an empty chunk list (not an error).

index.edge.name_mismatch

Tabix stores sequence names in the index header. These MUST match the names in the SAM header's @SQ lines. A mismatch SHOULD produce a warning (the index may be stale). CSI does NOT store sequence names — this check applies only to tabix.

Record Store

Sources: The slab-based RecordStore design is seqair-specific. The data it stores is defined by [SAM1] §4.2 (BAM record fields, CIGAR packed u32 format, quality Phred encoding, aux tag binary format). See references.md.

Background

When loading a region from any alignment format (BAM, SAM.gz, or CRAM), the reader decodes hundreds to thousands of records. Each record has fixed-size fields (position, flags, quality) and variable-length fields (read name, CIGAR, sequence, quality scores, auxiliary tags). The original design wrapped each record in Rc<BamRecord> with 5 separate heap allocations for the variable-length fields, plus the Rc allocation itself — 6 allocations per record.

The record store replaces this with a slab-based design: all variable-length data is packed into contiguous buffers (slabs), and fixed fields are stored in a compact struct that references the slabs by offset. This eliminates all per-record heap allocation.

Layout

The store contains four vectors:

Record table (Vec<SlimRecord>): compact fixed-size structs with alignment fields and offsets into the slabs.
Name slab (Vec<u8>): all read names (qnames) packed sequentially. Separated because qnames are only accessed during dedup mate detection — a linear scan of a compact, contiguous buffer.
Bases slab (Vec<Base>): decoded base sequences stored as Base enum values (A/C/G/T/Unknown). For BAM, decoded from 4-bit nibbles via SIMD at push time. For SAM/CRAM, provided directly as &[Base] by the reader.
Data slab (Vec<u8>): packed per-record as [cigar | qual | aux] — CIGAR in BAM packed u32 format, quality as raw Phred bytes, aux tags in BAM binary format.

Record table:  [SlimRecord₀] [SlimRecord₁] [SlimRecord₂] ...
Name slab:     [qname₀|qname₁|qname₂|...]
Bases slab:    [bases₀|bases₁|bases₂|...]
Data slab:     [cigar₀|qual₀|aux₀|cigar₁|qual₁|aux₁|...]

Capacity estimation

record_store.capacity

record_store.rs:61 record_store.rs:76

On construction, the store MUST pre-allocate all three vectors with estimated capacities to avoid reallocation during push_raw. Estimates SHOULD be based on the number of compressed bytes loaded for the region: a typical BAM compression ratio of ~3:1, an average record size of ~400 bytes, ~25 bytes per read name, and ~375 bytes per record of non-name data. The store MUST accept a byte-count hint for this purpose.

Decoding into slabs

record_store.push_raw+2

record_store.rs:17 +1 record_store.rs:6 +1

push_raw(raw_bytes) MUST decode a BAM record's fixed fields and append its variable-length data directly into the slabs, without allocating intermediate Box<[u8]> or Vec<u8> per field. The 4-bit packed sequence MUST be decoded (via SIMD when available) to Base enum values in the bases slab.

record_store.checked_offsets

record_store.rs:118 +2 record_store.rs:374

Slab offsets (name_off, bases_off, data_off) MUST be checked for u32 overflow before storing. If any slab exceeds u32::MAX bytes, the store MUST return an error rather than silently truncating the offset.

record_store.push_fields

record_store.rs:167 record_store.rs:124

push_fields(...) MUST accept pre-parsed record fields for SAM and CRAM readers: pos, end_pos, flags, mapq, matching_bases, indel_bases, qname bytes, CIGAR as packed BAM-format u32 ops, sequence as &[Base], quality bytes, and aux tag bytes in BAM binary format. This avoids encoding to BAM binary only to immediately decode it again. SAM and CRAM parsers convert their native representations to these types before pushing.

Field access

record_store.field_access

record_store.rs:58 compare_bam_with_htslib.rs:85 +2

The store MUST provide methods to access each variable-length field for a given record index: qname(idx), cigar(idx), seq(idx), seq_at(idx, pos), qual(idx), aux(idx). These return borrowed slices into the slabs.

Region lifecycle

record_store.clear+2

record_store.rs:59 record_store.rs:54

Clearing the store for a new region MUST retain the allocated capacity of all four vectors, so that subsequent regions reuse the same memory without reallocation.

record_store.region_scoped

record_store.rs:60 record_store.rs:102

All records in the store MUST remain valid and accessible until the store is cleared.

Integration

record_store.no_rc

record_store.rs:62 record_store.rs:101

The pileup engine MUST NOT use Rc for record sharing. PileupAlignment carries pre-extracted flat fields (base, qual, mapq, flags, etc.), and ActiveRecord stores a record index into the store. The store is either owned by or borrowed by the engine for the duration of iteration.

Region Buffer

Sources: The RegionBuf design is seqair-specific (bulk I/O optimisation for cluster storage). BGZF block decompression follows [SAM1] §4.1. Virtual offsets are defined in [SAM1] §4.1 "Random access". Index chunks come from [SAM1] §5.2 (BAI) or [TABIX] (TBI). See references.md.

Background

When reading a BGZF-compressed file (BAM or bgzf-compressed SAM), the reader doesn't decompress the entire file. Instead, it uses an index (.bai for BAM, .tbi for SAM.gz) to find which compressed byte ranges — called chunks — contain reads overlapping the requested genomic region. Each chunk is a pair of virtual offsets (see bgzf.md) that point to a start and end position in the compressed file.

A typical region query returns a handful of chunks. The standard approach is to seek to each chunk's start, then read and decompress BGZF blocks one at a time until reaching the chunk's end. Each block read involves a header read (~18 bytes) followed by a compressed data read (~20–40 KB), then decompression.

On local SSDs this works fine — seek + read latency is microseconds. On cluster storage (NFS, Lustre, GPFS), each I/O operation can take milliseconds due to network round-trips. A region with 200 BGZF blocks means 200+ network round-trips, which dominates processing time.

Approach

RegionBuf solves this by pre-fetching all compressed bytes for a region into RAM in one shot, then decompressing from the in-memory buffer. The I/O pattern changes from "hundreds of small reads" to "one large sequential read" per region. Since Rastair processes BAM segments in parallel (one thread per segment), each thread creates its own RegionBuf with no contention.

The flow:

Query the BAM index → get a list of chunks (compressed byte ranges)
Merge overlapping/adjacent chunks into larger contiguous ranges
Read each merged range with one seek + read into a Vec<u8>
Decompress BGZF blocks from the in-memory buffer (zero network I/O)

Chunk merging

The BAM index may return multiple chunks whose compressed byte ranges overlap or are adjacent. Reading them separately would cause redundant I/O and seeking.

region_buf.merge_chunks

region_buf.rs:507 region_buf.rs:24

Before reading, index chunks MUST be merged into non-overlapping byte ranges sorted by file offset. Overlapping or adjacent chunks MUST be coalesced into a single range. The end of each range MUST extend past the last chunk's end block offset by at least one maximum BGZF block size (64 KiB) to ensure the final block is fully included in the buffer.

region_buf.no_bin0

reader.rs:186 region_buf.rs:83

RegionBuf MUST NOT include bin 0 chunks in its bulk read. Bin 0 chunks are at distant file offsets and contain very few reads; including them would cause an expensive additional seek. Bin 0 records are handled separately via the bin 0 cache (see bam_index.md).

Loading

region_buf.load

region_buf.rs:75 region_buf.rs:39

RegionBuf::load MUST accept a seekable reader and a slice of index chunks. It MUST merge chunks, then perform one seek + read_exact per merged range to load all compressed bytes into a contiguous in-memory buffer. On typical BAM files this results in a single read of a few hundred KB to a few MB per region.

region_buf.empty

region_buf.rs:76 region_buf.rs:14

When given zero chunks, load MUST return an empty buffer that immediately signals EOF on any read.

Seeking

After loading, the caller iterates through the original (unmerged) chunks. For each chunk, it seeks to the chunk's start virtual offset within the pre-loaded buffer and reads records until reaching the chunk's end.

region_buf.seek_virtual

region_buf.rs:165 region_buf.rs:64

The region buffer MUST support seeking to a virtual offset within the pre-loaded data. Since disjoint ranges are concatenated, the buffer MUST maintain a range map that translates file offsets to buffer positions. A seek to a file offset that falls within any loaded range MUST succeed; a seek to an offset in a gap between ranges or before/after the loaded data MUST return an error.

Block decompression

The in-memory buffer contains raw BGZF-compressed bytes, exactly as they appear on disk. Decompression follows the same block-by-block process as the file-based reader, but reads from RAM instead of the filesystem.

region_buf.decompress

region_buf.rs:220 compare_bgzf_crate.rs:51 +2

Block decompression from the in-memory buffer MUST follow the same BGZF format rules as the file-based reader: magic validation, BSIZE extraction, DEFLATE decompression, and CRC32 verification.

region_buf.fast_header

region_buf.rs:221 region_buf.rs:43

The fast-path header parsing (XLEN=6, BC at fixed offset) MUST be used when applicable, with fallback to searching extra fields.

Reading

region_buf.read_exact

region_buf.rs:364 region_buf.rs:41

The region buffer MUST support reading an exact number of bytes, transparently crossing block boundaries, with the same API as BgzfReader::read_exact_into.

region_buf.drop_no_panic

region_buf.rs:460 region_buf.rs:541

The Drop implementation MUST never panic. Gap-size calculations between ranges MUST use saturating arithmetic to handle overlapping or malformed range boundaries safely.

region_buf.virtual_offset

region_buf.rs:215 region_buf.rs:42

The region buffer MUST track virtual offsets correctly: the block offset corresponds to the file position of the current block (not the buffer position), so that virtual offset comparisons with the BAM index chunk boundaries work correctly.

Integration

region_buf.fetch_into+2

reader.rs:185 region_buf.rs:82

Both IndexedBamReader::fetch_into and IndexedSamReader::fetch_into MUST use RegionBuf to bulk-read compressed data before record decoding. The file-based BgzfReader is retained for header parsing and other sequential access at file open time.

region_buf.not_cram

reader.rs:289 compare_cram_with_htslib.rs:113

CRAM does NOT use RegionBuf. CRAM containers are not BGZF-compressed — the reader seeks to container byte offsets directly via the CRAI index and reads container data with plain File::read_exact. Block-level decompression (gzip, rANS, etc.) happens after reading the container into memory.

Overlapping Pair Deduplication

In paired-end sequencing, two reads (mates) are generated from opposite ends of the same DNA fragment. When the fragment is shorter than twice the read length, the mates overlap — both reads cover the same genomic positions in the middle. Without deduplication, these overlapping positions would be counted twice in the pileup, inflating the apparent coverage and biasing variant allele frequencies.

Sources: Seqair-specific design with no upstream spec counterpart. Mate detection relies on QNAME matching as defined in [SAM1] §1.4. The first-in-template / second-in-template distinction uses FLAG bits 0x40 and 0x80 defined in [SAM1] §1.4. See references.md.

For example, with 150bp reads from a 200bp fragment, positions 50–150 are covered by both mates. At each of these positions, the pileup would show two observations from the same molecule. Deduplication removes one mate at each overlapping position, keeping the observation count honest.

Opt-in behavior

dedup.opt_in

pileup.rs:160 dedup.rs:40 +1

Overlapping pair deduplication MUST be opt-in via set_dedup_overlapping(). When not enabled, the pileup engine MUST yield all alignments without deduplication.

Mate detection

Mates are identified by sharing the same read name (qname). The engine pre-computes a mate lookup table when deduplication is enabled, so the per-position dedup check is a fast hash lookup rather than a string comparison.

dedup.mate_detection

pileup.rs:161 dedup.rs:70 +1

Mates are detected by matching qnames. When set_dedup_overlapping() is called, the engine MUST build a mate index that maps each record to its mate's index (if both are in the arena). Records with no mate in the arena are unaffected.

dedup.mate_pairs_only

pileup.rs:162 dedup.rs:105

Only the first two records sharing a qname are treated as mates. If three or more records share the same qname (supplementary alignments, etc.), only the first pair is linked; additional records are treated as unpaired for dedup purposes.

Resolution rule

When both mates are present at a position, one must be removed. The rule depends on whether the mates agree on the base call:

dedup.resolution_same_base

pileup.rs:360 dedup.rs:123 +1

When both mates show the same base at a position, the first encountered (by arena order) is kept and the other is removed.

dedup.resolution_different_base

pileup.rs:361 dedup.rs:143 +1

When mates show different bases at a position, the first-in-template read is kept and the second-in-template read is removed. This preserves the primary observation.

Scope

dedup.per_position

pileup.rs:315 dedup.rs:179 +2

Deduplication MUST be applied per-position, not globally. A read that is removed at one overlapping position MUST still appear at positions where its mate is not active. Both mates contribute normally outside their overlap region.

Non-interference

dedup.filter_independent

pileup.rs:316 dedup.rs:233

Deduplication MUST be applied after the read filter and after qpos computation. A read that was excluded by the filter MUST NOT participate in dedup. A read that has no qpos at a position (deletion/skip) MUST NOT participate in dedup at that position.

dedup.max_depth_independent

pileup.rs:317 dedup.rs:255

Deduplication MUST be applied before max_depth truncation. The depth after dedup may be below max_depth even if the raw depth exceeded it.

Pileup Engine

A pileup is a column-by-column view of all reads aligned to a reference region. At each reference position, the pileup shows which reads are present and what base each read has at that position. This is the fundamental data structure for variant calling: to determine whether position X has a mutation, you look at the pileup column at X and count how many reads show the reference base vs. an alternate base.

The pileup engine takes a set of BAM records (stored in a RecordStore) and iterates over reference positions from left to right, maintaining an "active set" of reads that overlap the current position. At each position, it yields a PileupColumn containing the aligned reads with their query positions (the position within each read that corresponds to this reference position).

Sources: The pileup algorithm is seqair-specific design, verified for correctness against [htslib] bam_plp_auto. FLAG bits referenced here (0x4 unmapped, 0x10 reverse strand) are defined in [SAM1] §1.4. CIGAR consume semantics (consumes query/reference) are from [SAM1] §1.4. See references.md.

Column iteration

pileup.position_iteration

pileup.rs:219 compare_bam_with_htslib.rs:158 +2

The engine MUST iterate over every reference position from the region start to the region end (inclusive), yielding a PileupColumn at each position where at least one read is active.

pileup.active_set

pileup.rs:34 compare_pileup.rs:84 +1

At each position, the active set contains all records whose alignment span [pos, end_pos] includes the current position. Records MUST enter the active set when the current position reaches their pos and MUST be evicted when the current position exceeds their end_pos.

pileup.column_contents

pileup.rs:73 compare_bam_with_htslib.rs:159 +2

Each PileupColumn MUST provide:

the reference position
the number of active alignments (depth)
access to each alignment with its query position (qpos) and a RecordRef

Filtering

pileup.read_filter

pileup.rs:154 pileup.rs:74 +1

The engine MUST support a per-read filter function that is evaluated once per record when it enters the active set (NOT re-evaluated at each position). Records that fail the filter MUST be excluded from all subsequent columns.

pileup.max_depth

pileup.rs:148 pileup.rs:120

The engine MUST support a maximum depth setting. When more records pass the filter than max_depth allows, excess records MUST be dropped (preferring records already in the active set).

pileup.max_depth_per_position

pileup.rs:149 +1 pileup.rs:137

Max depth MUST be enforced per-position, not globally at arena-load time. A read that is rejected at one high-coverage position may still be included at an adjacent lower-coverage position. This matches htslib's behavior where MAXCNT is checked against the current column's depth.

Query position

The query position (qpos) maps a reference position back into a read's coordinate system. For example, if a read starts at reference position 100 and has CIGAR 50M, then at reference position 125, qpos = 25. For complex CIGARs with insertions or deletions, the mapping requires walking the CIGAR operations (see cigar.md).

pileup.qpos

pileup.rs:95 compare_bam_with_htslib.rs:160 +3

For each alignment in a column, the engine MUST provide qpos: the position within the read's query sequence that aligns to the current reference position. This MUST be computed using the CigarIndex.

pileup.qpos_none

pileup.rs:293 compare_pileup.rs:120

When the current reference position falls within a deletion or reference skip in the read's CIGAR, that alignment MUST be excluded from the column (no qpos exists).

Edge cases

pileup.empty_positions_skipped

pileup.rs:281 pileup.rs:184

Positions with zero active reads MUST be skipped (no column yielded). When there is a gap between reads, the engine MUST jump to the next read's start position rather than iterating through empty positions one by one.

pileup.unmapped_excluded

pileup.rs:255 pileup.rs:203

Reads with the unmapped flag (0x4) MUST be excluded from the pileup. htslib's bam_plp_push rejects unmapped reads before they enter the buffer.

pileup.zero_refspan_reads

pileup.rs:35 pileup.rs:218

Reads with zero reference-consuming CIGAR operations (pure soft-clip, insertion-only) have end_pos == pos and MUST appear in exactly one pileup column at their pos. They MUST NOT be skipped or cause off-by-one errors.

pileup.soft_clip_at_position

pileup.rs:36 pileup.rs:234

When a reference position falls within the soft-clipped portion of a read's alignment, the read is NOT active at that position (soft clips do not consume reference). The read only becomes active at its first reference-consuming position.

Compatibility

pileup.htslib_compat

pileup.rs:74 compare_bam_with_htslib.rs:157 +2

For any BAM file and region, the pileup engine MUST produce identical columns to htslib's bam_plp_auto-based pileup: same positions, same depth at each position, and same qpos values per alignment. This is verified by comparison tests.

Performance

The pileup engine is the innermost loop of Rastair's processing pipeline: for a typical 30× whole-genome BAM, it processes ~3 billion reference positions, each with ~30 aligned reads. These rules specify performance-sensitive behaviors to avoid unnecessary overhead in this hot path.

Sources: Seqair-specific performance rules with no upstream spec counterpart. Implementation strategies are guided by profiling results and cross-checked against [htslib] behaviour. See references.md.

Allocation avoidance

perf.reuse_alignment_vec+2

pileup.rs:40 perf.rs:15

The pileup engine allocates a Vec<PileupAlignment> per column using Vec::with_capacity(active.len()). Reusing a buffer via clone was measured to be slower than fresh allocation due to the copy cost of entries. The allocator efficiently reuses recently-freed blocks of the same size.

perf.avoid_redundant_arena_get+2

pileup.rs:38 perf.rs:71

When a record enters the pileup active set, the engine MUST retrieve the record data once and cache what it needs (cigar, flags, strand, mapq, etc.) in the ActiveRecord — NOT perform repeated store lookups for the same record in the hot loop.

perf.no_sorted_indices

pileup.rs:37 perf.rs:46

Since BAM records arrive in coordinate-sorted order from fetch_into, the pileup engine MUST NOT sort record indices. It MUST iterate records in arena order directly.

perf.cigar_no_to_vec

pileup.rs:39 perf.rs:96

Building a CigarIndex MUST NOT clone the CIGAR bytes via .to_vec(). The CIGAR bytes live in the arena slab for the lifetime of the region; the CigarIndex MUST borrow or reference them by offset, not own a copy.

perf.precompute_matches_indels

record.rs:26 perf.rs:131 +1

Matches and indels counts MUST be computed once per record during decode and stored in BamRecord, NOT recomputed from CIGAR at every pileup position.

perf.cigar_binary_search

cigar.rs:100 perf.rs:184 +1

For records with more than 4 CIGAR operations, qpos_at SHOULD use binary search on ref_start instead of linear scan. For 1–4 ops, linear scan is acceptable.

perf.arena_capacity_hint+2

record_store.rs:84 perf.rs:267

The RecordStore MUST support a capacity hint so that the first region's allocation can be pre-sized based on an estimate (e.g., from the BAI index chunk sizes), avoiding repeated reallocation during the first fetch_into. This is provided by RecordStore::with_byte_hint().

References

The seqair spec is based on the following upstream specifications and reference implementations.

Format specifications

[SAM1] Sequence Alignment/Map Format Specification, The SAM/BAM Format Specification Working Group, last modified 2025-08-10. https://samtools.github.io/hts-specs/SAMv1.pdf Covers: SAM text format (§1), mandatory alignment fields and FLAG bits (§1.4), BAM binary encoding (§4 "The BAM Format Specification"), BGZF block compression (§4.1 "The BGZF compression format"), random access and virtual offsets (§4.1 "Random access"), EOF marker (§4.1 "End-of-file marker"), BAM record layout (§4.2 "The BAM format"), BIN field calculation (§4.2.1), SEQ/QUAL encoding (§4.2.4), auxiliary data encoding (§4.2.5), BAI index algorithm (§5 "Indexing BAM"), basic binning index (§5.1.1), BAI index format (§5.2 "The BAI index format for BAM files"), reg2bin/reg2bins C code (§5.3).
[CRAM3] CRAM format specification (version 3.1), samtools-devel@lists.sourceforge.net, last modified 2025-06-04. https://samtools.github.io/hts-specs/CRAMv3.pdf Covers: data types and ITF8/LTF8 variable-length integers (§2), bit stream reading (§2.2), encoding types (§3 "Encodings"), checksums (§4), file structure overview (§5), file definition and magic (§6), container header structure (§7), CRAM header container (§7.1), block structure (§8), block content types (§8.1), compression header block (§8.3) including preservation map, data series encodings, and substitution matrix, slice header block (§8.4), core data block (§8.5), external data blocks (§8.6), EOF container (§9), record structure and CRAM record decode order (§10), BAM and CRAM flags (§10.1), positional data and AP delta coding (§10.2), read names (§10.3), mate records (§10.4), auxiliary tags (§10.5), mapped reads and read features (§10.6), substitution codes (§10.6.1), unmapped reads (§10.7), reference sequences (§11), CRAM index / CRAI format (§12), encoding codec IDs (§13).
[CRAMcodecs] CRAM codec specification (version 3.1), samtools-devel@lists.sourceforge.net, last modified 2025-03-24. https://samtools.github.io/hts-specs/CRAMcodecs.pdf Covers: rANS 4×8 codec with order-0 and order-1 frequency tables and interleaving (§2), rANS Nx16 codec with transforms (RLE, PACK, STRIPE, CAT, N32) (§3), range/arithmetic coder (§4), name tokeniser tok3 (§5), fqzcomp quality codec (§6).
[TABIX] The Tabix index file format, Heng Li. https://samtools.github.io/hts-specs/tabix.pdf Covers: TBI magic, header fields (n_ref, format, col_seq, col_beg, col_end, meta, skip, l_nm, names), bin/chunk/linear-index structure identical to BAI, n_no_coor trailing field, reg2bin/reg2bins C functions.
[CSI] Coordinate Sorted Index (CSI) format specification, SAM/BAM Format Specification Working Group, last modified 2020-05-25. https://samtools.github.io/hts-specs/CSIv1.pdf Covers: CSI magic, header fields (min_shift, depth, l_aux), per-bin loffset field replacing BAI's linear index, parameterised reg2bin/reg2bins/bin_limit functions.

Reference implementations

[htslib] htslib, Wellcome Sanger Institute. https://github.com/samtools/htslib The canonical C implementation of SAM/BAM/CRAM/BGZF/BAI. Used to verify correctness of seqair's output: pileup positions, depths, qpos values, CRAM record fields, and flag handling are all cross-checked against htslib.
[noodles] noodles, Michael Macias. https://github.com/zaeleus/noodles A pure-Rust implementation of genomics formats. Referenced for Rust-idiomatic design patterns in handling BAM/SAM/CRAM records.