Seqair — General Design

These are cross-cutting design rules that apply to all modules in the Seqair crate. They establish conventions for error handling, type safety, and platform portability.

Sources: This file contains seqair-specific design rules with no direct upstream spec counterpart. Named constants (FLAG bits, CIGAR op codes) are derived from [SAM1] §1.4 and §4.2. See references.md.

Error handling

Seqair aims to provide helpful and complete error messages at all layers of the system.

bgzf.rs:32 +4 bgzf.rs:33

Error types MUST implement std::error::Error using thiserror and capture necessary context and provide causes.

Every failure mode MUST be a distinct enum variant with typed fields — never a catch-all String or reason: String. Callers MUST be able to pattern-match on failure modes without inspecting display text. When a single enum variant is constructed at multiple call sites with different format strings, split it into separate variants. Paths, byte arrays, and counts MUST be stored in typed fields (e.g. PathBuf, [u8; 4], usize), not interpolated into strings.

Type safety

BAM files are full of magic numbers: flag bits (0x4 = unmapped, 0x10 = reverse strand), CIGAR operation codes (0 = M, 1 = I, ...), format constants. Using raw literals is error-prone and hard to read.

cigar.rs:8 +1 bgzf.rs:90 +1

Magic numbers (bit flags, format constants, operation codes) MUST be defined as named constants, not used as raw literals. BAM flags MUST use named constants (e.g., FLAG_UNMAPPED instead of 0x4). CIGAR operation types MUST use named constants (e.g., CIGAR_M instead of 0).

flags.rs:6 bgzf.rs:108

BAM flags MUST have a BamFlags newtype wrapping u16 with named predicate methods (e.g., is_unmapped(), is_reverse()). The public API (RecordRef) MUST expose BamFlags rather than raw u16 where appropriate.

cigar.rs:19 cigar.rs:138 +1

CIGAR operations MUST have a CigarOpType enum with variants for all 9 SAM-spec operations (M, I, D, N, S, H, P, =, X). The enum MUST provide consumes_ref() and consumes_query() methods. Invalid operation codes MUST be represented as None via from_bam(u8) -> Option<CigarOpType>.

API design

All public enums (including error types) that may gain variants in future versions MUST be annotated with #[non_exhaustive]. Adding a variant to a non-exhaustive enum is not a semver-breaking change, allowing new variants and error conditions to be introduced in minor releases.

bam.rs:40 +2

Only intentionally public types should be exported. Internal implementation modules (compression codecs, container parsers, encoding details) MUST be pub(crate) or #[doc(hidden)] so that downstream crates do not depend on internal structure. The public API contract is defined by explicit pub use re-exports in each top-level module file.

Arithmetic safety

All integer arithmetic on values derived from untrusted input (file data, parsed fields) MUST use checked or saturating operations. Default wrapping arithmetic (+, -, *) MUST NOT be used on untrusted values. The clippy::arithmetic_side_effects lint should be enabled.

Platform portability

seq.rs:10 seq_codec.rs:58

Every platform-specific optimization (SIMD, architecture-specific intrinsics) MUST have a scalar fallback that produces identical results on all platforms. The optimized and fallback paths MUST be tested with property-based tests that verify equivalence for arbitrary inputs.

Unified Reader

seqair reads BAM, bgzf-compressed SAM, and CRAM files through a format-agnostic reader interface that auto-detects the input format and dispatches to the appropriate parser, while presenting a uniform API to the rest of the codebase.

Sources: The unified reader design is seqair-specific. Format magic bytes come from [SAM1] §4.2 (BAM: BAM\1), [SAM1] §4.1 (BGZF: 1f 8b), and [CRAM3] §6 (CRAM: CRAM). Sort order detection uses the SO tag from [SAM1] §1.3 "The header section" (@HD line). See references.md.

Goals

  1. Backward compatibility: existing code that calls IndexedBamReader::open() continues to work unchanged.
  2. Auto-detection: IndexedReader::open(path) inspects the file to determine BAM vs SAM vs CRAM and opens the appropriate reader.
  3. Uniform downstream API: all three formats populate the same RecordStore via fetch_into(). The pileup engine, call pipeline, and all downstream code are format-agnostic.
  4. Forkable: all reader types support fork() for thread-safe parallel processing with shared immutable state.

Format detection

reader.rs:326 reader.rs:400 +4

IndexedReader::open(path) MUST auto-detect the file format by inspecting magic bytes:

  • Bytes 1f 8b (gzip magic) → verify BGZF structure (extra field with BC subfield). If not BGZF, reject with error: "file is gzip-compressed but not BGZF; use bgzip instead of gzip." If BGZF, decompress the first block:
    • Starts with BAM\x01 → BAM format
    • Starts with @ (0x40) → bgzf-compressed SAM
    • Otherwise → reject with error naming supported formats
  • Bytes 43 52 41 4d (CRAM) → CRAM format
  • Byte @ (0x40) at position 0 → uncompressed SAM. Reject with error: "uncompressed SAM files cannot be indexed; compress with bgzip file.sam then index with samtools index file.sam.gz."
  • Otherwise → return an error listing supported formats (BAM, bgzf-compressed SAM, CRAM)
reader.rs:418 +1 unified_reader.rs:45 +1

After format detection, the reader MUST locate the appropriate index file:

  • BAM: .bai, .bam.bai, or .csi (CSI supports references > 512 Mbp that BAI cannot)
  • SAM: .tbi or .csi (tabix or CSI index)
  • CRAM: .crai (CRAM index)
reader.rs:419 +1 unified_reader.rs:63 +1

If the format is detected but no matching index is found, the error MUST name the expected index extension and suggest the tool to create it (samtools index for BAM/CRAM, samtools index or tabix for SAM).

Reader enum

reader.rs:104 unified_reader.rs:44 +1

The unified reader MUST be an enum dispatching to format-specific readers, not a trait object. This avoids dynamic dispatch overhead in the hot path and keeps the type concrete for fork().

enum IndexedReader {
coverage.html    Bam(IndexedBamReader),
    Sam(IndexedSamReader),
    Cram(IndexedCramReader),
}
reader.rs:122 unified_reader.rs:86

The unified reader MUST expose:

  • header() -> &BamHeader — all formats produce the same header type (target names, lengths, tid lookup). The SAM and CRAM parsers convert their native header representations to BamHeader.
  • fetch_into(tid, start, end, store) -> Result<usize> — populates a RecordStore with records overlapping the region, identically to the BAM path.
  • fork() -> Result<Self> — creates a lightweight copy sharing immutable state.
  • shared() -> &Arc<_> — access to shared state for Arc::ptr_eq testing.
reader.rs:171 compare_cram_parsing.rs:206 +5

For a given BAM file and its SAM/CRAM representations of the same data, fetch_into for the same region MUST produce records with equivalent logical content: same positions, flags, sequences, qualities, and CIGAR operations. Aux tags MUST contain the same set of tag names and values, but tag ordering and integer type codes (e.g., c vs i for small values) MAY differ between formats because BAM writers choose specific integer widths that SAM text cannot preserve.

RecordStore integration

record_store.rs:166 record_store.rs:126

Currently RecordStore::push_raw() takes raw BAM bytes. For SAM and CRAM, records arrive as parsed fields rather than BAM binary.

The RecordStore MUST provide push_fields() (or equivalent) that accepts pre-parsed record fields:

  • pos, end_pos, flags, mapq, seq_len, matching_bases, indel_bases (fixed fields)
  • qname bytes
  • CIGAR as packed BAM-format u32 ops (SAM/CRAM parsers convert to this representation)
  • sequence as &[Base] (already the enum type stored in the bases slab — no conversion needed)
  • quality bytes
  • aux tag bytes in BAM binary format (SAM/CRAM parsers serialize tags to this format)

This avoids a wasteful round-trip through BAM binary encoding.

record_store.rs:168 record_store.rs:125 +1

push_fields() and push_raw() MUST produce identical SlimRecord fixed fields and identical name/bases/cigar/qual slab contents for the same logical record. Aux tag slab contents MAY differ in integer type codes (a SAM-derived i:42 may serialize as c while the original BAM used C), since the original BAM writer's type choice is not recoverable from SAM text. Tests MUST verify fixed-field and non-aux-slab equivalence by pushing the same record both ways.

Shared header type

header.rs:141 header.rs:281

SAM headers are text (the @HD, @SQ, @RG, @PG, @CO lines). CRAM containers embed SAM header text. BamHeader already stores header_text: String and parses @SQ lines for targets.

BamHeader MUST gain a from_sam_text(text: &str) constructor that parses SAM header text directly, without requiring BGZF/BAM framing. This is used by both the SAM reader (header lines at start of file) and the CRAM reader (SAM header block in file header container).

Sort order validation

header.rs:223 +3 header.rs:329 +5

Indexed random access assumes coordinate-sorted data. If the @HD header line contains SO:unsorted or SO:queryname, the reader MUST return an error explaining that indexed region queries require coordinate-sorted input. SO:coordinate or absent SO (common in older files) MUST be accepted.

Fork semantics per format

reader.rs:186 unified_reader.rs:128

BAM fork: shares Arc<BamShared> (index + header), opens fresh File handle. (Already implemented.)

reader.rs:187 unified_reader.rs:129

SAM fork: shares Arc holding parsed tabix/CSI index + header, opens fresh File handle for BGZF reading.

reader.rs:188 compare_cram_with_htslib.rs:221 +1

CRAM fork: shares Arc holding CRAI index + header, opens fresh File handle for container reading. Each fork MUST have its own FASTA reader (via fasta_reader.fork()) for thread-safe reference lookups. Reference caching (r[cram.perf.reference_caching]) MUST be per-fork, not shared.

Readers: alignment + reference bundle

reader.rs:203 unified_reader.rs:167

The Readers struct bundles an IndexedReader (alignment) with an IndexedFastaReader (reference) in a single type. This eliminates the need for separate FASTA path passing — CRAM can access the reference it needs for sequence reconstruction, and all formats have uniform open/fork semantics.

pub struct Readers {
    alignment: IndexedReader,
    fasta: IndexedFastaReader,
}
reader.rs:222 unified_reader.rs:168 +2

Readers::open(alignment_path, fasta_path) MUST auto-detect the alignment format (via r[unified.detect_format]), open the appropriate reader, and open the FASTA reader. For CRAM, the fasta_path is passed to IndexedCramReader::open() for sequence reconstruction. For BAM/SAM, the FASTA reader is opened but not used by the alignment reader.

reader.rs:232 compare_cram_with_htslib.rs:220 +1

Readers::fork() MUST fork both the alignment reader and the FASTA reader, returning a new Readers with independent I/O handles but shared immutable state (indices, headers). The CRAM fork gets its own FASTA reader via IndexedFastaReader::fork().

reader.rs:239 unified_reader.rs:169

Readers MUST expose:

  • header() -> &BamHeader — delegates to the alignment reader's header.
  • fetch_into(tid, start, end, store) -> Result<usize> — delegates to the alignment reader.
  • fasta() -> &IndexedFastaReader and fasta_mut() -> &mut IndexedFastaReader — direct access for callers that need reference sequences independently of the alignment reader (e.g., the call pipeline's segment fetching).
  • alignment() -> &IndexedReader and alignment_mut() -> &mut IndexedReader — direct access when needed.
reader.rs:129 unified_reader.rs:154

IndexedReader::open(path) MUST continue to work for BAM and SAM files without a FASTA path. CRAM detection in IndexedReader::open() MUST return an error explaining that CRAM requires a reference and suggesting Readers::open() instead. This preserves backward compatibility for code that only needs BAM/SAM.

API surface

See r[io.non_exhaustive_enums] and r[io.minimal_public_api] in general.md for the general rules. For this module, IndexedReader, FormatDetectionError, and ReaderError are the primary enums subject to r[io.non_exhaustive_enums].

BGZF Block Reader

BGZF (Blocked Gzip Format) is the compression layer used by BAM and other HTS (High-Throughput Sequencing) formats. Unlike plain gzip, which compresses an entire file as one stream, BGZF splits the data into independent blocks of up to 64 KiB uncompressed. Each block is a self-contained gzip member, meaning any block can be decompressed without reading the ones before it. This enables random access: a BAM index can point directly to a specific block offset in the compressed file, and the reader can decompress just that block to find the data it needs.

A BGZF file is simply a concatenation of these gzip blocks, ending with a special empty EOF marker block. Each block has a standard gzip header with an extra field that encodes the block's total compressed size (BSIZE), which the reader uses to locate where the next block begins.

Sources: All rules in this file derive from [SAM1] §4.1 "The BGZF compression format", §4.1 "Random access", and §4.1 "End-of-file marker". See references.md.

Block structure

[SAM1] §4.1 "The BGZF compression format" — block layout, extra field, BSIZE

bgzf.rs:165 compare_header.rs:20

A BGZF block MUST begin with the gzip magic bytes 1f 8b 08 04 (gzip, DEFLATE, FEXTRA flag set).

bgzf.rs:166 compare_header.rs:21

The extra field MUST contain a BC subfield (SI1=0x42, SI2=0x43) with SLEN=2 whose value is the total block size minus one (BSIZE). The total compressed block size is BSIZE + 1.

bgzf.rs:167 compare_bgzf_crate.rs:49 +2

Each block's DEFLATE payload MUST be decompressed independently. The last 8 bytes of the block are the gzip footer: a CRC32 checksum (4 bytes) followed by the uncompressed size ISIZE (4 bytes), both little-endian. Decompression MUST produce exactly ISIZE bytes.

bgzf.rs:263 bgzf.rs:446

BGZF blocks MUST NOT exceed 65536 bytes uncompressed. If the ISIZE footer field claims a larger value, the reader MUST return an error rather than allocating an unbounded buffer.

bgzf.rs:168 compare_records.rs:74

An EOF marker block has ISIZE=0. When encountered, the reader MUST signal end-of-stream.

Virtual offsets

[SAM1] §4.1 "Random access" — virtual file offset definition

Because BGZF blocks are at known compressed file offsets and have known uncompressed sizes, any byte in the uncompressed stream can be addressed with a virtual offset: a packed 64-bit value that combines "which block" and "where within that block." BAM index files (.bai) store virtual offsets to point at specific records.

bgzf.rs:10 bgzf.rs:6 +2

A virtual offset is a 64-bit value where the upper 48 bits encode the compressed block offset in the file and the lower 16 bits encode the byte offset within the uncompressed block.

bgzf.rs:145 compare_records.rs:72

The reader MUST support seeking to an arbitrary virtual offset by seeking the underlying file to the block offset, decompressing that block, and advancing to the within-block offset.

Reading

BAM records are variable-length and can span block boundaries (a record may start near the end of one block and continue into the next). The reader must handle this transparently.

bgzf.rs:315 compare_header.rs:23

The reader MUST support reading an exact number of bytes, transparently crossing block boundaries when the requested data spans multiple blocks.

bgzf.rs:341 compare_records.rs:75

The reader MUST support partial reads (up to N bytes) returning the actual count, returning 0 at EOF.

bgzf.rs:72 compare_header.rs:24

Decompression SHOULD use the libdeflater crate (Rust bindings to the libdeflate C library) for performance parity with htslib's libdeflate usage. libdeflate provides hardware-accelerated DEFLATE decompression and CRC32 computation.

Integrity

[SAM1] §4.1 "The BGZF compression format" — gzip footer CRC32 and ISIZE fields

Each gzip block includes a CRC32 checksum of the uncompressed data. Verifying this catches silent data corruption from disk errors, network glitches on cluster storage, or truncated writes.

bgzf.rs:169 bgzf.rs:55 +3

After decompression, the reader MUST verify the CRC32 checksum of the decompressed data against the expected CRC32 stored in the gzip footer (4 bytes before ISIZE). A mismatch MUST return a ChecksumMismatch error.

Performance

bgzf.rs:172 bgzf.rs:56

The reader SHOULD fast-path standard BGZF headers where XLEN=6 and the BC subfield is at the fixed offset (bytes 12–17 of the 18-byte header). All BAM files produced by samtools, htslib, and Picard use this layout. This avoids allocating an extra-fields buffer for the common case. Non-standard layouts MUST fall back to searching the extra fields for the BC subfield.

bgzf.rs:171 bgzf.rs:57

When resizing buffers that will be immediately and fully overwritten (by read_exact or decompression), the reader SHOULD use an uninitialized resize to avoid redundant zero-filling. This applies to the decompressed block buffer (~64 KB per block) and the compressed data buffer.

bgzf.rs:81 +1 bgzf.rs:58

The reader MUST track the current block's compressed file offset for virtual offset calculation. The offset MUST be queried from the stream position before reading each new block, and set directly on seeks.

CIGAR Operations

A CIGAR string describes how a read aligns to the reference genome. It's a sequence of operations like "50M2I48M" meaning "50 bases match, 2 bases inserted in the read, 48 bases match." Each operation has a type and a length.

In BAM's binary format, CIGAR operations are packed as u32 values: the lower 4 bits encode the operation type (0–8) and the upper 28 bits encode the length. A 150bp read with a simple alignment has just one operation (150M), while reads spanning splice junctions or containing indels can have 10+ operations.

The key concept is which operations consume the reference and/or query (read) sequences:

  • Consumes reference: the operation advances the position on the reference genome
  • Consumes query: the operation advances the position within the read

This determines how to translate between reference positions and read positions — the core question the pileup engine asks thousands of times per second.

Sources: [SAM1] §1.4 "The alignment section: mandatory fields" — CIGAR operations table with op codes, BAM numeric encodings, and "consumes query"/"consumes reference" flags. §4.2 "The BAM format" — packed u32 CIGAR encoding (op_len << 4 | op). See references.md.

Operations

[SAM1] §1.4 "The alignment section: mandatory fields" — CIGAR operations table (M=0, I=1, D=2, N=3, S=4, H=5, P=6, ==7, X=8)

cigar.rs:7 cigar.rs:112

The following CIGAR operations MUST be supported:

  • M (0): alignment match (consumes query and reference)
  • I (1): insertion (consumes query only)
  • D (2): deletion (consumes reference only)
  • N (3): reference skip (consumes reference only)
  • S (4): soft clip (consumes query only)
  • H (5): hard clip (consumes neither)
  • P (6): padding (consumes neither)
  • = (7): sequence match (consumes query and reference)
  • X (8): sequence mismatch (consumes query and reference)

Matches and indels

These aggregate statistics are used downstream for read quality metrics (e.g. fraction of matching bases, indel rate).

cigar.rs:75 cigar.rs:11 +3

calc_matches_indels(cigar) MUST compute the total number of matching bases (sum of M operation lengths) and the total number of indel bases (sum of I and D operation lengths) from a packed CIGAR array.

CIGAR index

For the pileup engine, the critical operation is: "at reference position X, what is the corresponding position in this read?" This requires walking the CIGAR to accumulate reference and query offsets. The CigarIndex precomputes this to avoid re-walking for every position.

cigar.rs:96 cigar.rs:49

A CigarIndex MUST be constructable from a record's CIGAR and start position. It precomputes cumulative reference and query offsets per operation.

cigar.rs:144 cigar.rs:50 +8

CigarIndex::qpos_at(ref_pos) MUST return Some(query_pos) when the reference position falls within a query-consuming operation (M, =, X, I at the boundary), or None when the position falls within a deletion, reference skip, or outside the alignment.

cigar.rs:254 cigar.rs:401

qpos_at MUST return None for positions outside the alignment's reference span. This applies to all mapping variants, including the linear fast-path. Positions before the alignment start or at/after the alignment end MUST return None.

CompactOp::ref_start is stored as i32. Since BAM positions are defined as i32 values (max 2^31 - 1), this is sufficient. The construction of CompactOp MUST assert (via debug_assert!) that the computed reference position fits in i32.

cigar.rs:145 cigar.rs:244 +2

qpos_at MUST produce identical results to htslib's Alignment::qpos() for every reference position within the alignment span. This is verified by comparison tests.

Seqair-Types — Phred Quality Score

Sources: Phred quality scores are defined in [SAM1] section 1.4 (QUAL field, Phred+33 encoding). The Phred newtype wraps the mathematical definition: Q = -10 log10(P). See references.md.

Construction

phred.rs:49 phred.rs:123 +1

Phred scores MUST be non-negative. from_phred MUST accept only u8 values (0..=255), making negative inputs unrepresentable at the type level.

Integer conversion

phred.rs:34 phred.rs:104 +2

as_int MUST clamp the result to the range [0, 99]. Values less than or equal to zero (including negative results from floating-point arithmetic) MUST return 0. NaN MUST return 0.

BAM Reader

The indexed BAM reader is the main entry point for loading aligned reads from a BAM file. It ties together three layers: BGZF decompression (reading compressed blocks), BAI index lookup (finding which blocks contain reads for a region), and record decoding (parsing the binary BAM records into usable structs).

The reader's primary operation is fetch_into: given a genomic region like chr19:6,105,700–6,105,800, it queries the index, bulk-reads the relevant compressed data into memory (see region_buf.md), decompresses it, decodes the records, and stores them in a RecordStore for downstream processing by the pileup engine.

Sources: [SAM1] §4 "The BAM Format Specification" generally — BGZF, BAM header, record layout, and BAI index together define the reader's behaviour. The forking design and ChunkCache are seqair-specific optimisations with no upstream spec counterpart. See references.md.

Opening

reader.rs:120 compare_header.rs:18

The reader MUST open a BAM file by path, parse its header, and locate the corresponding .bai index file (trying both <path>.bai and <path_without_ext>.bai).

reader.rs:121 compare_header.rs:19

The reader MUST expose the parsed BamHeader for contig name/tid/length lookups.

Region fetching

reader.rs:181 compare_bam_with_htslib.rs:59 +1

fetch_into(region, store) MUST clear the store, query the BAI index for chunks overlapping the region, seek through those chunks via BGZF, decode all records, and add records that actually overlap the region to the store (RecordStore).

reader.rs:182 compare_records.rs:60

A record overlaps a region if record.pos <= region.end AND record.end_pos >= region.start. Records outside the region that appear in index chunks MUST be filtered out. This is necessary because the BAM index is approximate — chunks may contain records slightly outside the queried region.

reader.rs:183 compare_records.rs:61

Records in the store MUST be in coordinate-sorted order (by pos, then by end_pos as tiebreaker) after fetch_into, matching the order a sorted BAM file provides.

Edge cases

reader.rs:252 compare_records.rs:62

Unmapped reads (flag 0x4) MUST be excluded from the store during fetch_into. htslib's pileup engine rejects them in bam_plp_push, so including them would inflate depth counts.

reader.rs:184 pileup.rs:253

Secondary (0x100) and supplementary (0x800) reads MUST be included in the store by default. The caller's pileup filter is responsible for excluding them if desired, matching htslib's behavior.

Forking (thread-safe shared state)

Rastair processes regions in parallel via rayon. Each worker thread needs its own IndexedBamReader because file handles have mutable seek state. Without forking, every thread independently opens the BAM file, re-reads the entire BAI index from disk, and re-parses the BAM header — wasting both I/O bandwidth and memory. On HPC clusters with NFS/Lustre, redundant index reads are especially costly.

The fork pattern separates the reader into shared immutable data (BamIndex, BamHeader) and per-thread mutable state (file handles, decompression buffers). A prototype reader is opened once on the main thread; worker threads call fork() to get a lightweight copy that shares the index and header via Arc but owns fresh file handles.

Shared state

reader.rs:107 fork.rs:93

The BAM index and header MUST be stored behind Arc so that forked readers share a single copy. These structures are read-only after initial parsing and are accessed concurrently without synchronization.

Fork operation

reader.rs:156 fork.rs:28 +1

fork() MUST produce a new reader that:

  1. Shares the same Arc holding BamIndex and BamHeader as the source reader (no re-parsing, no additional memory).
  2. Opens a fresh raw File handle for bulk RegionBuf reads. No BgzfReader is needed — fetch_into only uses RegionBuf, and the BgzfReader used during open() for header parsing is not retained.
  3. Allocates its own ChunkCache (starts empty, populated on first region fetch for a tid).
reader.rs:158 fork.rs:29 +1

A forked reader MUST produce identical fetch_into results as independently opening the same BAM file. The fork is purely an optimization — it MUST NOT change observable behavior.

reader.rs:157 fork.rs:106

Forked readers MUST be fully independent for mutable operations. Seeking or fetching on one reader MUST NOT affect any other reader (original or forked). Each reader owns its own file descriptors and buffer state.

Edge cases

reader.rs:172 fork.rs:68 +1

Arc::ptr_eq on the shared state of the original and forked reader MUST return true, confirming they share the same allocation rather than having made a deep copy.

reader.rs:159 fork.rs:137 +1

Multiple forks from the same prototype MUST be safe to use concurrently from different threads. Since the shared state is immutable and each fork owns its file handles, no synchronization is needed beyond the Arc reference count.

Coordinate validation

reader.rs:82 +1 reader.rs:493 +3

fetch_into MUST validate that tid fits in i32 and that start/end fit in i64 before using them. If any value exceeds the signed maximum, fetch_into MUST return a BamError::CoordinateOverflow rather than silently truncating via as casts.

Error propagation

reader.rs:241 +1

fetch_into (and ChunkCache::load) MUST propagate I/O and decompression errors from the BGZF layer. Only BgzfError::UnexpectedEof (signalling the end of a chunk's data) should silently break the read loop. All other errors — CRC failures, truncated blocks, decompression failures, invalid magic, et.c — MUST be returned to the caller.

BAM Index (BAI/CSI)

A BAM file can be hundreds of gigabytes. Without an index, finding all reads overlapping a small genomic region (e.g. chr19:6,105,700–6,105,800) would require decompressing and scanning the entire file. The BAM index (.bai file) solves this by recording, for each genomic region, which compressed byte ranges in the BAM file contain relevant reads.

The index uses a binning scheme: the genome is divided into hierarchical bins of decreasing size (the whole reference → 512 Mbp → 64 Mbp → ... → 16 KiB leaf bins). Each bin stores a list of chunks — pairs of virtual offsets (see bgzf.md) marking the start and end of compressed byte ranges in the BAM file that contain reads falling into that bin. A linear index provides an additional optimization: for each 16 KiB window, it stores the minimum virtual offset of any read starting in that window, allowing the reader to skip chunks that are entirely before the query start.

Sources: [SAM1] §5 "Indexing BAM" — binning algorithm; §5.1.1 "Basic binning index" — bin hierarchy and sizes; §5.2 "The BAI index format for BAM files" — binary format; §5.3 "C source code for computing bin number and overlapping bins" — reg2bin/reg2bins. [CSI] — for CSI format. See references.md.

BAI format

[SAM1] §5.2 "The BAI index format for BAM files" — magic, n_ref, n_bin, chunk virtual offsets, n_intv linear index

index.rs:85 compare_records.rs:68

A BAI file MUST begin with the magic bytes BAI\1 (0x42, 0x41, 0x49, 0x01). The reader MUST reject files that do not match.

index.rs:86 compare_records.rs:69

The reader MUST parse the BAI binary format: n_ref references, each containing bins with chunks (pairs of virtual offsets) and a linear index of 16 KiB window offsets.

index.rs:323 index.rs:653

When decompressing BGZF blocks (e.g. for tabix), the reader MUST validate that bsize is large enough for the block header and footer before indexing into the block. A bsize that is too small MUST return an error rather than panicking.

index.rs:175 compare_records.rs:70

Given a region (tid, start, end), the index MUST return a list of BGZF chunk pairs (begin, end) that contain all reads overlapping the region. The query MUST use the binning scheme to identify candidate bins and the linear index to skip chunks that cannot contain reads starting before the query start.

index.rs:369 compare_records.rs:71

The BAI binning scheme uses reg2bin(beg, end) where bin 0 spans the whole reference, bins 1–8 span 64 Mbp each, and leaf bins span 16 KiB. The implementation MUST correctly compute bins for a region and enumerate all overlapping bins at every level.

Bin 0

[SAM1] §5.1.1 "Basic binning index" — bin 0 spans 512 Mbp, bins 1–8 span 64 Mbp

Bin 0 is the root of the binning hierarchy, covering the entire reference sequence (0 to 2^29). Each alignment is placed in the smallest bin that fully contains its [pos, end) interval. An alignment is placed in bin 0 when its interval straddles a level-1 boundary (multiples of 2^26 = 64 Mbp). This can happen to short reads (e.g. 150 bp) that cross positions like 67,108,864 or 134,217,728. See SAM spec §5.3.

In practice, bin 0 contains very few reads (only those near the ~3–4 level-1 boundaries per chromosome), but its chunks may be located at a very different file offset from the rest of the query's chunks. In a 43 GB BAM, this can mean an extra seek of several GB — a significant cost on network storage.

index.rs:212 index.rs:501

The query MUST always include bin 0 in the candidate bins for any region query. Omitting it would silently drop reads that straddle 64 Mbp boundaries.

index.rs:66 +1 index.rs:481

The index SHOULD provide a query_split method that separates distant (level 0–2) chunks from nearby (level 3–5) chunks in the result, returning a QueryChunks struct with nearby and distant fields. This allows callers to handle them differently (e.g. caching, separate I/O). The original query method MUST remain available for callers that don't need the separation.

Higher-level bins and ChunkCache

The same distant-chunk problem affects bins at levels 1 (64 Mbp) and 2 (8 Mbp). For a query in chr5:100K–200K, the index includes bin 1 (covering chr5:0–64M) whose chunks span ~2 GB of the BAM file, and bin 9 (covering chr5:0–8M) whose chunks span ~0.27 GB. These chunks are identical for every region query within the same reference and contain many records that are outside the query region (~15% of decoded records are discarded by the position filter).

index.rs:67 +1 index.rs:482

The query_split method MUST separate chunks into nearby (levels 3–5, loaded via RegionBuf per query) and distant (levels 0–2, suitable for caching). The threshold (level ≤ 2) is based on profiling data showing that levels 0–2 contribute >90% of the file span in typical region queries.

index.rs:155 +2 reader.rs:441

The reader SHOULD maintain a ChunkCache that loads ALL records from bins at levels 0–2 (covering ≥8 Mbp) once per reference sequence (tid) per thread. The BamIndex::distant_chunks method returns all level 0–2 chunks for a tid, unfiltered by region, for use when populating the cache. Per query, matching records (overlapping the query region) are injected from the cache into the RecordStore without any additional I/O. The cache MUST be invalidated when switching to a different reference (tid). This eliminates expensive seeks for distant-bin chunks on every region query.

CSI format

[CSI] — min_shift, depth, l_aux header; per-bin loffset; parameterised reg2bin/reg2bins

CSI is an alternative index format that supports larger genomes and arbitrary bin sizes. It uses the same conceptual approach (bins + chunks + linear index) but with a configurable minimum shift and depth.

CSI index support MAY be added as a follow-up. The same region_query interface MUST be used for both BAI and CSI indices.

BAM Header

A BAM file begins with a header that describes the reference genome and records metadata about how the alignments were produced. The header contains two parts: a SAM-format text section (with @HD, @SQ, @RG, @PG lines) and a binary reference sequence table. The reference table maps each contig (chromosome) to a numeric target ID (tid) used throughout the rest of the file — every aligned read stores the tid of its reference contig rather than the name string.

Sources: [SAM1] §4.2 "The BAM format" (BAM magic, l_text, n_ref, reference sequence table); §1.3 "The header section" (SAM header line types @HD/@SQ/@RG/@PG). See references.md.

Parsing

[SAM1] §4.2 "The BAM format" — magic, l_text, text, n_ref, l_name, name, l_ref fields

header.rs:73 compare_header.rs:16

A BAM file MUST begin with the magic bytes BAM\1 (0x42, 0x41, 0x4d, 0x01) after BGZF decompression. The reader MUST reject files that do not start with this magic.

header.rs:74 compare_header.rs:17

After the magic, a 4-byte little-endian l_text gives the length of the SAM header text. The reader MUST parse this text to extract @HD, @SQ, @RG, and @PG header lines.

header.rs:86 header.rs:404

Header length fields (l_text, n_ref, l_name) are stored as i32 in the BAM format. The reader MUST validate that these values are non-negative before using them as sizes. Negative values from corrupt files MUST return a BamHeaderError rather than causing a massive allocation via i32 as usize wrapping.

header.rs:75 compare_header.rs:15

After the header text, a 4-byte little-endian n_ref gives the number of reference sequences. Each reference is encoded as a 4-byte l_name (including NUL), the name bytes, and a 4-byte l_ref (sequence length). The reader MUST parse all reference sequences.

Contig lookup

The rest of the BAM file and its index use tids (integer IDs) to refer to contigs. The caller typically has a contig name (e.g. "chr19") from user input or a BED file and needs the tid for index queries. Conversely, when reporting results, the tid needs to be mapped back to a name.

header.rs:200 compare_header.rs:14

The header MUST support O(1) lookup from contig name to target ID (tid) and O(1) lookup from tid to contig name and length.

header.rs:214 compare_header.rs:12

The header MUST provide access to the list of all target names in tid order.

header.rs:209 compare_header.rs:13

The header MUST provide the length of each target sequence by tid.

BAM Record

Each aligned read in a BAM file is stored as a record: a binary structure encoding the read's position on the reference genome, its DNA sequence, quality scores, alignment details (CIGAR), and optional auxiliary tags. Records are variable-length — a 150 bp read with few tags is ~300 bytes, while a long read with many tags can be kilobytes.

The record's binary layout starts with 32 fixed bytes (position, flags, lengths, etc.), followed by variable-length fields in order: read name (qname), CIGAR operations, 4-bit packed sequence, quality scores, and auxiliary tags.

Sources: [SAM1] §4.2 "The BAM format" — binary record layout (refID, pos, l_read_name, mapq, bin, n_cigar_op, flag, l_seq, cigar, seq, qual, auxiliary); §4.2.4 "SEQ and QUAL encoding" — 4-bit sequence encoding; §4.2.5 "Auxiliary data encoding" — tag types A/c/C/s/S/i/I/f/Z/H/B; §1.4 "The alignment section: mandatory fields" — FLAG bit definitions. See references.md.

Decoding

[SAM1] §4.2 "The BAM format" — block_size, fixed 32-byte header, variable-length field order

record.rs:16 compare_bam_with_htslib.rs:60 +1

A BAM record MUST be decoded from its binary representation: 32 fixed bytes followed by variable-length qname, CIGAR, sequence, quality, and auxiliary data. The 4-byte block_size prefix is read by the caller.

region_buf.rs:435 region_buf.rs:565

BAM records have a 2 MiB size cap. The reader MUST reject records whose block_size exceeds this limit rather than allocating an unbounded buffer from untrusted input.

record.rs:203 record.rs:393 +1

Variable-length field offset calculations (var_start, cigar_end, seq_end, qual_end) MUST use checked arithmetic to prevent wrapping on malformed input. Overflow MUST return a decode error.

record.rs:15 compare_bam_with_htslib.rs:61 +2

The decoder MUST extract: tid (i32), pos (i32→i64), mapq (u8), flags (u16), seq_len (u32), n_cigar_ops (u16), qname (NUL-terminated, stored without NUL), packed sequence (4-bit), quality scores, CIGAR operations (packed u32), and raw auxiliary data.

record.rs:129 compare_bam_with_htslib.rs:83 +1

The decoder MUST precompute the reference end position from the CIGAR. Reference-consuming operations are: M(0), D(2), N(3), =(7), X(8). The end position is pos + ref_consumed - 1 (inclusive).

record.rs:130 pileup.rs:270

When a read has zero reference-consuming CIGAR operations (e.g., pure soft-clip or insertion-only), the reference span is zero. In this case end_pos MUST equal pos (matching htslib's bam_cigar2rlen which returns 0, placing the read at exactly one position). The implementation MUST NOT compute pos - 1 for such reads.

Sequence encoding

[SAM1] §4.2.4 "SEQ and QUAL encoding" — 4-bit nibble encoding, =ACMGRSVTWYHKDBN lookup table

BAM stores DNA sequences in a compact 4-bit encoding: two bases per byte, high nibble first. This halves storage compared to ASCII but requires decoding before use.

record.rs:75 compare_bam_with_htslib.rs:84 +1

BAM encodes sequences in 4-bit nibbles: two bases per byte (high nibble first). The standard lookup table maps 0→=, 1→A, 2→C, 4→G, 8→T, 15→N, and other values to IUPAC ambiguity codes.

record.rs:109 compare_records.rs:130

The record MUST support decoding a single base at a given read position.

record.rs:76 compare_records.rs:131

The decoder MAY decode the full 4-bit sequence into Base enum values at construction time using SIMD-accelerated bulk decoding (see base_decode.md). Single-base access via seq_at(idx, pos) returns the pre-decoded Base value directly.

Flag access

[SAM1] §1.4 "The alignment section: mandatory fields" — FLAG bit table (0x4 unmapped, 0x10 reverse strand, 0x40 first in template, 0x80 second in template)

Each record has a 16-bit flags field encoding properties like strand orientation, pairing status, and mapping quality. These are checked frequently during pileup construction and filtering.

record.rs:89 compare_records.rs:157

is_reverse() MUST return true when the 0x10 bit is set.

record.rs:94 compare_records.rs:158

is_first_in_template() MUST return true when the 0x40 bit is set.

record.rs:99 compare_records.rs:159

is_second_in_template() MUST return true when the 0x80 bit is set.

record.rs:104 compare_records.rs:160

is_unmapped() MUST return true when the 0x4 bit is set.

Auxiliary tags

[SAM1] §4.2.5 "Auxiliary data encoding" — tag byte layout, type codes A/c/C/s/S/i/I/f/Z/H/B

BAM records can carry optional key-value tags (e.g. XR:Z:CT for bismark strand, MD:Z:... for mismatch string). Tags are stored as a flat byte array at the end of the record, each prefixed by a 2-byte name and a 1-byte type code.

record.rs:114 compare_records.rs:192

The record MUST support looking up auxiliary tags by their 2-byte name. Tag types A, c, C, s, S, i, I, f, d, Z, H, and B (typed array) MUST be supported.

B-type (array) auxiliary tags are not yet parsed into AuxValue. find_tag returns None for array tags. The tag bytes MUST be correctly skipped so that all subsequent tags in the record remain accessible.

record.rs:259 compare_records.rs:193

The record MUST provide access to raw auxiliary data bytes for efficient filtering without full tag parsing.

Sequence Codec

BAM encodes DNA sequences in a compact 4-bit format: two bases per byte, high nibble first. A 150bp read takes 75 bytes instead of 150. The 16 possible nibble values map to the 4 standard bases (A=1, C=2, G=4, T=8), N (15), and IUPAC ambiguity codes for the remaining values.

Decoding these sequences is on the hot path: during pileup construction, the engine needs the base at each read's query position for every reference position. With ~30 reads at ~3 billion positions, that's ~90 billion base lookups for a whole-genome BAM. Encoding is needed when rewriting BAM files with modified sequences.

Sources: [SAM1] §4.2.4 "SEQ and QUAL encoding" — 4-bit encoding table =ACMGRSVTWYHKDBN → [0,15], high-nibble-first packing. The SIMD decode path and pair-table optimisation are seqair-specific. See references.md.

Decoding (4-bit → ASCII)

seq.rs:4 seq_codec.rs:8 +2

The scalar decoder MUST use a 16-entry lookup table mapping 4-bit codes to ASCII bases (=ACMGRSVTWYHKDBN). It MUST support decoding full sequences (all bases) and single bases at arbitrary positions.

seq.rs:5 seq_codec.rs:37

For bulk decoding, a 256-entry pair table (DECODE_PAIR[byte] → [high_base, low_base]) SHOULD be used to decode two bases per byte lookup, avoiding repeated shift-and-mask operations.

seq.rs:6 seq_codec.rs:56

On platforms with SIMD support (SSSE3 on x86_64, NEON on aarch64), bulk sequence decoding MUST use SIMD table-lookup instructions (pshufb / vqtbl1q_u8) to decode 32 bases per iteration from 16 packed bytes. A scalar tail MUST handle the remaining bytes.

seq.rs:7 seq_codec.rs:57

Runtime dispatch MUST select the fastest available decoder: SSSE3 on x86_64 (with feature detection), NEON on aarch64 (always available), scalar fallback on all other platforms.

Encoding (ASCII → 4-bit)

seq.rs:8 seq_codec.rs:112 +2

The scalar encoder MUST use a 256-entry lookup table mapping ASCII bytes to 4-bit codes. Unknown bases MUST map to 15 (N). Two bases MUST be packed per byte (high nibble first).

Correctness

seq.rs:9 seq.rs:504 +4

The SIMD decoder MUST produce byte-identical output to the scalar decoder for all inputs. This MUST be verified by property-based tests covering arbitrary sequence lengths and byte patterns, including SIMD boundary lengths (0, 1, 15, 16, 31, 32, 33, 63, 64, 65, 128).

Base-typed Sequence Decoding

Sources: [SAM1] §4.2.4 "SEQ and QUAL encoding" — 4-bit nibble-to-base mapping (=ACMGRSVTWYHKDBN). The Base enum, SIMD acceleration, and direct-to-Base decode table are seqair-specific design choices; the upstream spec defines only the nibble codes. See references.md.

Background

BAM sequences are stored as 4-bit nibbles (two bases per byte). The standard decode table maps nibbles to ASCII characters (=ACMGRSVTWYHKDBN). Downstream, the ASCII bytes must be converted to Base enum values (A, C, G, T, Unknown) at every pileup position — billions of times for whole-genome data.

By decoding directly into Base values at record construction time, we:

  1. Eliminate the per-position Base::from(u8) conversion (a LUT lookup per read per position)
  2. Get compile-time type safety — a Base is guaranteed to be one of 5 valid values, not an arbitrary byte
  3. Collapse IUPAC ambiguity codes (M, R, W, S, Y, K, etc.) and = to Base::Unknown at decode time, instead of at each use site

Decode table

seq.rs:108 base_decode.rs:6 +2

A 16-entry lookup table MUST map 4-bit BAM nibble codes directly to Base discriminant values: 1→A(65), 2→C(67), 4→G(71), 8→T(84), and all other codes (0, 3, 5, 6, 7, 9–14, 15) → Unknown(78).

Table invariant

seq.rs:109 seq.rs:474 +1

Every entry in DECODE_BASE_TYPED (16 entries) and DECODE_PAIR_TYPED (256 × 2 = 512 bytes) MUST be a valid Base discriminant: A(65), C(67), G(71), T(84), or Unknown(78). This MUST be verified by an exhaustive test covering all entries.

Decoding function

seq.rs:150 base_decode.rs:7 +2

decode_bases(packed_seq, seq_len) MUST decode a 4-bit packed BAM sequence into a Vec<Base>. The function SHOULD use the SIMD-accelerated decode path (SSSE3/NEON) with a Base-valued lookup table for identical throughput to the ASCII decoder.

Storage in RecordStore

record_store.rs:63 base_decode.rs:47

The RecordStore MUST store decoded bases in a dedicated slab (Vec<Base>) separate from the data slab. The data slab continues to hold cigar, qual, and aux data. The seq_at method MUST return Base instead of u8.

PileupAlignment

pileup.rs:96 base_decode.rs:60

PileupAlignment.base MUST be of type Base, not u8. This eliminates the Base::from(u8) conversion in the metrics accumulation hot loop.

ASCII-to-Base batch conversion

Reference sequences (FASTA) and SAM text sequences arrive as ASCII bytes, not 4-bit packed nibbles. These need the same one-time batch conversion to Base that BAM sequences get via decode_bases, but using an ASCII-aware path instead of the nibble decode table.

base.rs:76 base.rs:573 +2

Base::from_ascii_vec(Vec<u8>) -> Vec<Base> MUST convert a vector of ASCII bytes to Base values in-place (reusing the allocation). Maps A/a→A, C/c→C, G/g→G, T/t→T, all other bytes→Unknown. The function MUST NOT allocate a new vector — it operates on the input's buffer and transmutes the result. This is safe because Base is repr(u8) and the function only writes valid Base discriminant bytes (65, 67, 71, 78, 84).

base.rs:77 base.rs:536 +1

The ASCII batch converter MUST use SIMD acceleration (SSSE3 on x86_64, NEON on aarch64) with a scalar fallback, following the same dispatch pattern as decode_bases. This is the required path for all bulk u8→Base conversions outside BAM 4-bit decoding: FASTA reference sequences, SAM text sequences, and any other ASCII-encoded base data.

base.rs:114 base.rs:537 +2

The SIMD ASCII converter MUST produce identical output to applying Base::from(u8) element-wise. This MUST be verified by property-based tests covering arbitrary byte values and lengths including SIMD boundary lengths (0, 1, 15, 16, 31, 32, 33, 63, 64, 65, 128).

FromStr validation

base.rs:393 base.rs:629 +2

FromStr for Base MUST accept only a single character A, C, G, T, or N (case-insensitive) after trimming whitespace. Multi-character input (after trimming) MUST return Err(BaseError::MultipleChars). Any single character that is not A/C/G/T/N MUST return Err(BaseError::InvalidBase(byte)) where byte is the first byte of the trimmed input. Empty input MUST return Err(BaseError::Empty).

Error display

base.rs:483 base.rs:663

BaseError variants MUST display the actual invalid value. InvalidBase(byte) SHOULD format as "Invalid base: 0x{byte:02x}". Empty SHOULD format as "Empty".

CRAM Reader

A CRAM reader for region-based random access. CRAM is a reference-based, column-oriented compression format that achieves significantly smaller file sizes than BAM (typically 40-60% of BAM size) at the cost of decoding complexity and reference dependency.

Sources: [CRAM3] throughout — file structure, containers, slices, blocks, variable-length integers, bit stream, compression header, data series encodings, record decode order, read features, CRAI index. [CRAMcodecs] — rANS 4×8 (§2), rANS Nx16 (§3), arithmetic coder (§4), tok3 (§5), fqzcomp (§6). See references.md.

Context

CRAM stores alignment data as differences against a reference genome rather than full sequences. Where BAM stores ACGTACGT for a read, CRAM stores "matches reference at this position, except substitution at offset 5". This reference-based approach, combined with column-oriented storage and sophisticated entropy coding, makes CRAM the most compact alignment format.

Rastair needs CRAM support because many large-scale sequencing projects store data exclusively in CRAM format.

Goals and scope

reader.rs:230 compare_cram_with_htslib.rs:110

The reader MUST support CRAM v3.0 and v3.1. CRAM v2.0 and v2.1 MAY be supported but are not required — they are rarely encountered in practice. The magic bytes CRAM followed by major version 3 MUST be accepted; major version 2 SHOULD produce a warning; other versions MUST be rejected.

The reader is read-only. CRAM writing is out of scope — rastair writes BAM (via htslib) for output.

reader.rs:246 compare_cram_parsing.rs:145

CRAM decoding requires access to the reference genome (FASTA). The reader MUST accept a FASTA reader (the one being built on the other branch) for sequence lookups. Embedded reference slices (where the reference is stored in the CRAM container itself) MUST also be supported as a fallback.

Note: htslib supports automatic reference download via REF_PATH/REF_CACHE environment variables and the EBI reference server. This is out of scope for seqair, but when the FASTA reader is missing or the required contig is absent, the error message SHOULD mention REF_PATH/REF_CACHE as an alternative for users familiar with the htslib workflow.

File structure

[CRAM3] §6 "File definition" — CRAM magic, major/minor version, file ID

File definition

reader.rs:223 compare_cram_parsing.rs:21

The file MUST start with CRAM (bytes 43 52 41 4d), followed by major version (1 byte) and minor version (1 byte), followed by a 20-byte file ID.

reader.rs:418 container.rs:245 +2

The first container after the file definition MUST be the file header container. It contains a single block (content_type=FILE_HEADER) whose data is:

  • header_text_length (i32, fixed-width little-endian): byte length of the SAM header text
  • header_text (UTF-8 bytes): the SAM header text (@HD, @SQ, @RG, etc.)

The header text is parsed via BamHeader::from_sam_text() to populate target names, lengths, and tid mappings. Note: the i32 length prefix is part of the block's decompressed data, not the block header itself.

container.rs:123 container.rs:207

The file MUST end with an EOF container. The EOF container produced by samtools has length=15 (not 0) and num_records=0. The reader MUST detect the EOF container as any container where num_records=0 and length <= 15. The reader SHOULD verify the EOF marker but MUST NOT fail if it's missing (truncated files should produce a warning, not a hard error).

Containers

[CRAM3] §7 "Container header structure" — length (fixed i32), itf8 fields, landmarks, CRC32

container.rs:34 container.rs:157 +2

Each container header contains:

  • length (i32, fixed-width little-endian): total byte length of all blocks in the container. This is the only fixed-width field in the header; all subsequent fields use variable-length encoding.
  • reference_sequence_id (itf8): reference ID, or -1 (unmapped), or -2 (multi-ref)
  • alignment_start (itf8): 1-based start position on reference
  • alignment_span (itf8): number of reference bases covered
  • num_records (itf8): number of alignment records
  • record_counter (ltf8): global 0-based record index (v3.0+)
  • bases (ltf8): total bases in container (v3.0+)
  • num_blocks (itf8): block count
  • landmarks (itf8 array): count (itf8) followed by that many itf8 values; byte offsets from container data start to each slice header
  • crc32 (4 bytes, fixed-width little-endian): CRC32 over all preceding header bytes (from length through end of landmarks)
reader.rs:290 compare_cram_with_htslib.rs:111

During fetch_into, the reader MUST skip containers whose [alignment_start, alignment_start + alignment_span) range does not overlap the query region. The CRAI index provides byte offsets for this.

Slices

[CRAM3] §8.4 "Slice header block" — slice header fields, multi-ref slices, embedded reference, reference_md5

slice.rs:31 compare_cram_with_htslib.rs:107

Each slice header contains:

  • reference_sequence_id (itf8): ref ID for this slice, -1 unmapped, -2 multi-ref
  • alignment_start, alignment_span (itf8): genomic range covered
  • num_records (itf8): records in this slice
  • record_counter (ltf8): global record counter
  • num_blocks (itf8): data blocks in this slice
  • block_content_ids (itf8 array): content IDs of the blocks
  • embedded_reference (itf8): -1 if no embedded reference, else block content ID
  • reference_md5 (16 bytes): MD5 of the reference segment (all zeros if N/A)
  • optional tags (BAM-format aux bytes)
slice.rs:249 compare_cram_with_htslib.rs:108

When reference_sequence_id == -2, the slice contains records from multiple references. The RI (reference ID) data series MUST be decoded per-record to determine which reference each record aligns to. Multi-ref slices are common in coordinate-sorted CRAM files at chromosome boundaries and for reads with unmapped mates.

slice.rs:161 compare_cram_parsing.rs:205

When embedded_reference >= 0, the corresponding external data block contains the reference bases for this slice. The reader MUST use these instead of the FASTA reader. This enables CRAM files to be self-contained (produced by samtools view --output-fmt-option embed_ref or cramtools).

Blocks

[CRAM3] §8 "Block structure" — method, content_type, content_id, compressed_size, uncompressed_size, data, CRC32 (v3.0+)

block.rs:39 block.rs:201

Each block contains:

  • method (byte): compression codec (see codecs section)
  • content_type (byte): 0=FILE_HEADER, 1=COMPRESSION_HEADER, 2=SLICE_HEADER, 4=EXTERNAL, 5=CORE
  • content_id (itf8): identifies which data series this block holds
  • compressed_size, uncompressed_size (itf8)
  • data (bytes): compressed payload
  • crc32 (4 bytes): CRC32 of the block (v3.0+)
block.rs:40 block.rs:314

CRC32 MUST be verified on every block in v3.0+. The CRC32 covers all bytes from the start of the block (method byte) through the end of the compressed data, excluding the 4 CRC32 bytes themselves. Mismatches MUST produce an error with the block's content_type and content_id for diagnostics.

Variable-length integers

[CRAM3] §2.1 "Logical data types" — ITF8 (up to 32 bits) and LTF8 (up to 64 bits) variable-length integer encodings

varint.rs:6 varint.rs:286

ITF8 (up to 32 bits) MUST be decoded as:

  • 0xxxxxxx: 1 byte, value = bits 6..0 (7 bits)
  • 10xxxxxx + 1 byte: 2 bytes, value = (b0 & 0x3F) << 8 | b1 (14 bits)
  • 110xxxxx + 2 bytes: 3 bytes, value = (b0 & 0x1F) << 16 | b1 << 8 | b2 (21 bits)
  • 1110xxxx + 3 bytes: 4 bytes, value = (b0 & 0x0F) << 24 | b1 << 16 | b2 << 8 | b3 (28 bits)
  • 1111xxxx + 4 bytes: 5 bytes, value = (b0 & 0x0F) << 28 | b1 << 20 | b2 << 12 | b3 << 4 | (b4 & 0x0F) (32 bits)

Note: in the 5-byte form, only the low 4 bits of byte 4 are used. The high 4 bits of byte 4 are unused/ignored.

The decoded value is an unsigned u32. For fields that can be negative (e.g., reference_sequence_id where -1=unmapped, -2=multi-ref), the u32 MUST be reinterpreted as signed i32 via two's complement (value as i32).

varint.rs:94 varint.rs:388

LTF8 (up to 64 bits) MUST be decoded with the same leading-bit pattern extended to 9 bytes, with 0xFF prefix indicating 8 continuation bytes.

Bit stream reading

[CRAM3] §2.2 "Reading and writing bits in a bit stream" — MSB-first bit ordering

bitstream.rs:4 bitstream.rs:64

The core data block (content_type=5) is read as a bit stream. Bits are read MSB-first (most significant bit first) within each byte:

  • Bit offset 0 is the most significant bit (0x80)
  • Bit offset 7 is the least significant bit (0x01)
  • Reading N bits shifts them out from the current position, crossing byte boundaries as needed

Encodings that read from the core (Huffman, Beta, Subexp, Gamma) all use this bit stream. Encodings that read from external blocks (External, ByteArrayStop) use byte-level access to the specific external block identified by content_id.

The core bit stream is shared across all records in a slice — fields are interleaved sequentially. Reading fields in the wrong order corrupts all subsequent data.

Compression header

[CRAM3] §8.3 "Compression header block" — preservation map, data series encoding map, tag encoding map

The compression header is a block (content_type=1) at the start of each container. It defines how all data in the container's slices is encoded.

Preservation map

[CRAM3] §8.3 "Preservation map" — RN, AP, RR, SM (substitution matrix), TD (tag dictionary)

compression_header.rs:136 compare_cram_parsing.rs:140

The preservation map MUST be decoded as a series of key-value pairs:

  • RN (bool): whether read names are preserved (if false, names are generated)
  • AP (bool): whether alignment positions are delta-coded (true) or absolute
  • RR (bool): whether the reference is required for decoding
  • SM (5 bytes): substitution matrix (see r[cram.compression.substitution_matrix])
  • TD (bytes): tag dictionary — null-separated lists of 3-byte entries (2-char tag name + 1-char BAM type)
compression_header.rs:55 compare_cram_parsing.rs:273

The substitution matrix is 5 bytes, one per reference base in order: A, C, G, T, N. Each byte encodes the ordering of substitution alternatives using 2-bit codes packed MSB-first:

byte = (alt0 << 6) | (alt1 << 4) | (alt2 << 2) | alt3

where each 2-bit value maps to a base: 0=A, 1=C, 2=G, 3=T. When decoding a substitution (BS data series value = code), the read base is \alt[code] from this ordering.

Example: for reference base C, the possible substitutions are A, G, T (not C itself). If the substitution byte is 0b_00_10_11_01 (0x2D):

  • code 0 → bits 00 → A
  • code 1 → bits 10 → G
  • code 2 → bits 11 → T
  • code 3 → bits 01 → C (unused for ref=C, but the slot exists)

For non-N reference bases, only codes 0-2 are meaningful (3 alternatives, excluding the reference base itself). Code 3 exists in the byte but is never emitted by conforming CRAM writers — the decoder MAY treat it as an error or map it to the remaining unused base. For reference base N, all 4 codes are meaningful (A, C, G, T).

Data series encoding map

[CRAM3] §8.3 "Data series encodings" — two-byte data series codes and their types (BF, CF, RI, RL, AP, RG, RN, MF, NS, NP, TS, NF, FN, FC, FP, MQ, BA, QS, BS, IN, SC, DL, RS, PD, HC, BB, QQ, TL)

compression_header.rs:241 compare_cram_parsing.rs:141

The data series encoding map MUST be decoded as key-value pairs where:

  • Key: 2-byte data series code
  • Value: encoding descriptor (encoding ID + parameters)

Complete data series list:

Code Name Type Description
BF BAM flags int BAM FLAG field (u16)
CF CRAM flags int CRAM-specific flags
RI Ref ID int Reference sequence ID (multi-ref slices only)
RL Read length int Number of bases
AP Alignment pos int Position (delta or absolute per preservation map)
RG Read group int Index into @RG header entries
RN Read name byte array QNAME
MF Mate flags int 0x1=mate reverse, 0x2=mate unmapped
NS Next segment ref int Mate reference ID
NP Next mate pos int Mate position
TS Template size int Insert size
NF Next fragment int Record-skip to next fragment (attached mates)
FN Feature count int Number of read features
FC Feature code byte Feature type character
FP Feature position int Position within read (delta from previous feature)
MQ Mapping quality int MAPQ
BA Base byte Single base
QS Quality score byte Single quality value
BS Substitution code byte Index into substitution matrix
IN Insertion byte array Inserted bases
SC Soft clip byte array Soft-clipped bases
DL Deletion length int Number of deleted reference bases
RS Ref skip int Reference skip length (intron)
PD Padding int Padding length
HC Hard clip int Hard clip length
BB Bases block byte array Stretch of bases
QQ Quality block byte array Stretch of quality scores
TL Tag line int Index into tag dictionary

Tag encoding map

[CRAM3] §8.3 "Tag encodings" — ITF8 key as (char1 << 16) + (char2 << 8) + type_char, BAM-format tag values

compression_header.rs:348 compare_cram_parsing.rs:142

The tag encoding map uses ITF8 keys computed as (char1 << 16) + (char2 << 8) + type_char. Each key maps to an encoding descriptor for that tag type. Tag values are stored as raw BAM-format bytes (without the 3-byte tag+type prefix).

Base modification tags (MM/ML) use Z and B:C types respectively and can be very large on long reads (thousands of bytes). The decoder MUST handle arbitrary-length tag values.

Encoding types

[CRAM3] §13 "Encodings" — NULL (0), EXTERNAL (1), HUFFMAN (3), BYTE_ARRAY_LEN (4), BYTE_ARRAY_STOP (5), BETA (6), SUBEXP (7), GAMMA (9)

encoding.rs:212 encoding.rs:621

Encoding ID 0 (NULL): no data. Used for absent data series.

encoding.rs:214 +1 encoding.rs:576

Encoding ID 1 (EXTERNAL): reads raw bytes from an external data block identified by content_id. The most common encoding — used for quality scores, bases, tags, etc.

encoding.rs:221 encoding.rs:491

Encoding ID 3 (HUFFMAN): canonical Huffman coding. Parameters: alphabet (itf8 array) and bit lengths (itf8 array). Reads from the core data bit stream.

Canonical Huffman construction: sort by (bit_length, symbol_value), assign codes starting from 0 with incrementing and left-shifting when bit length increases. A single-symbol Huffman code has bit length 0 and consumes 0 bits from the stream — this is common for fields where all records have the same value (e.g., all flags identical).

encoding.rs:369 encoding.rs:605

Encoding ID 4 (BYTE_ARRAY_LEN): a length encoding followed by a value encoding. The length is decoded first (from any encoding), then that many bytes are decoded using the value encoding.

encoding.rs:378 encoding.rs:595

Encoding ID 5 (BYTE_ARRAY_STOP): reads bytes from an external block until a stop byte is encountered. Used for read names (stop=0x00) and variable-length strings.

encoding.rs:225 encoding.rs:559

Encoding ID 6 (BETA): fixed-width integer. Parameters: offset and number of bits. Reads from core bit stream.

encoding.rs:233 compare_cram_parsing.rs:144

Encoding ID 7 (SUBEXP): sub-exponential code. Parameters: offset (itf8) and K (itf8). Reads from core bit stream. Decode procedure:

  1. Read unary: count consecutive 1 bits until a 0 bit is read. Call this count n.
  2. If n == 0: read K bits from the stream as the value.
  3. If n > 0: read n + K - 1 bits from the stream as remainder. The value is (1 << (n + K - 1)) - (1 << K) + remainder.
  4. The final decoded value is value - offset.
encoding.rs:239 compare_cram_parsing.rs:143

Encoding ID 9 (GAMMA): Elias gamma code with offset. Parameters: offset (itf8). Reads from core bit stream. Decode procedure:

  1. Read unary: count consecutive 0 bits until a 1 bit is read. Call this count n.
  2. If n == 0: the value is 0.
  3. If n > 0: read n more bits from the stream as suffix. The value is (1 << n) | suffix - 1.
  4. The final decoded value is value - offset.

Compression codecs

[CRAM3] §14 "External compression methods" — RAW (0), GZIP (1), BZIP2 (2), LZMA (3), rANS4x8 (4), rANS Nx16 (5), arithmetic coder (6), fqzcomp (7), tok3 (8) [CRAMcodecs] §2 "rANS 4x8" — frequency table RLE, order-0/order-1 modes, chunk-based order-1 interleaving [CRAMcodecs] §3 "rANS Nx16" — flags byte (ORDER, N32, STRIPE, NoSize, CAT, RLE, PACK) [CRAMcodecs] §5 "Name tokenisation codec" — tok3

block.rs:121 block.rs:202

Method 0 (RAW): uncompressed data.

block.rs:123 block.rs:231

Method 1 (GZIP): deflate compression. MUST be supported — this is the most common codec.

block.rs:132 block.rs:263

Method 2 (BZIP2): bzip2 compression. MUST be supported for v3.0 compatibility.

block.rs:143 block.rs:289

Method 3 (LZMA): LZMA compression. MUST be supported for v3.0 compatibility.

block.rs:153 +1 rans.rs:276 +1

Method 4 (rANS 4×8): rANS entropy coder with 4-way interleaved states and 8-bit renormalization (L=2^23). Supports order-0 (context-free) and order-1 (previous-symbol context) modes. MUST be supported — this is the default codec in CRAM v3.0 writers including samtools.

The frequency table header uses a compact run-length encoding: after reading a symbol and its frequency, if the next symbol is prev_sym + 1, a run length byte follows, then that many consecutive symbols' frequencies. The terminator is a NUL (0x00) byte. Order-1 nests this same structure: the outer loop reads context symbols (with the same run-length scheme), and for each context reads an order-0 frequency table.

Order-1 interleaving is chunk-based, not round-robin: the output is split into 4 equal segments, each decoded by one state sequentially. Remainder bytes (when len % 4 != 0) are handled by state 3.

block.rs:155 +1 rans_nx16.rs:589

Method 5 (rANS Nx16): v3.1 codec. rANS with N-way interleaving (N=4 or 32), 16-bit renormalization (L=2^15), and optional transforms controlled by an 8-bit flags byte: ORDER(1), N32(4), STRIPE(8), NoSize(16), CAT(32), RLE(64), PACK(128). Transforms are applied in order: decode entropy → unRLE → unPack. MUST be supported for v3.1 files — samtools uses this for most data blocks in v3.1 output.

Method 6 (arithmetic coder): v3.1 adaptive arithmetic coder. SHOULD be supported.

Method 7 (fqzcomp): v3.1 quality-score-specific compressor. SHOULD be supported.

block.rs:157 +1 tok3.rs:464

Method 8 (tok3): v3.1 read-name tokeniser. MUST be supported for v3.1 files — samtools uses tok3 for read name blocks by default in v3.1 output. Without tok3 support, v3.1 CRAM files produced by samtools view -C cannot be read.

block.rs:159 block.rs:329

Unknown codec methods MUST produce a clear error naming the method ID and suggesting conversion to BAM (samtools view -b).

Record decoding

[CRAM3] §10 "Record structure" — CRAM record layout and field decode order

Decoding order

[CRAM3] §10.1 "CRAM record" — strict field decode order (BF, CF, RI, RL, AP, RG, QS, RN, mate info, TL, FN/FC/FP, MQ, per-base QS)

slice.rs:216 compare_cram_parsing.rs:195 +2

Records within a slice MUST be decoded in a strict field order, because the core bit stream is sequential. Reading fields out of order corrupts all subsequent data. The order is:

  1. BF (BAM flags)
  2. CF (CRAM flags)
  3. RI (reference ID — only if slice is multi-ref, i.e., ref_seq_id == -2)
  4. RL (read length)
  5. AP (alignment position — delta or absolute per preservation map)
  6. RG (read group)
  7. QS (quality scores — only if CRAM flags bit 0x1 is clear, meaning quality stored per-record not per-base)
  8. RN (read name — only if preservation map RN=true)
  9. Mate information:
    • If CRAM flags bit 0x2 set (detached): MF, NS, NP, TS
    • If CRAM flags bit 0x4 set (mate downstream): NF
  10. TL (tag line index)
  11. FN (feature count), then for each feature: FC, FP, feature-specific data
  12. MQ (mapping quality)
  13. QS per-base (only if CRAM flags bit 0x1 is set — quality as array)

Getting this order wrong is the most common implementation bug in CRAM readers.

Per-record data series

slice.rs:238 compare_cram_with_htslib.rs:102 +1

BF (BAM flags) and CF (CRAM flags) MUST be decoded per record. BAM flags map directly to the u16 flags in SlimRecord. CRAM flags are separate and encode:

  • Bit 0x1: quality scores stored as array (vs per-record)
  • Bit 0x2: detached (mate info stored inline)
  • Bit 0x4: has mate downstream in same slice
  • Bit 0x8: sequence not stored (unknown)
slice.rs:257 compare_cram_with_htslib.rs:103 +1

AP (alignment position): if the preservation map AP=true (delta mode), each record's position is a delta from the previous record's position within the slice. The reader MUST maintain a running position accumulator and convert to absolute 0-based coordinates (CRAM positions are 1-based). Delta-coded positions can be negative (valid for supplementary alignments in coordinate-sorted CRAM).

slice.rs:253 compare_cram_with_htslib.rs:105

RL (read length): the number of bases in the read. Used to allocate sequence and quality arrays. Long reads (PacBio/ONT) can be 10,000-1,000,000+ bases — buffer pre-allocation MUST NOT assume short-read sizes.

slice.rs:271 compare_cram_with_htslib.rs:170

RG (read group): an index into the @RG entries in the header. Stored as itf8. The reader MUST convert to a RG:Z:<id> aux tag with the read group ID string.

slice.rs:275 compare_cram_with_htslib.rs:106

RN (read name): if preservation map RN=true, names are stored. If RN=false, names are auto-generated. Rastair uses qnames for mate-pair dedup, so RN=false CRAM files MUST produce a warning that dedup will not work correctly. Generated names SHOULD be deterministic (e.g., cram_<record_counter>).

Read feature decoding

[CRAM3] §10.6 "Mapped reads" and §10.6.1 "Read feature codes" — feature codes X/I/i/D/S/B/b/Q/q/N/H/P and their data series

slice.rs:430 compare_cram_with_htslib.rs:109

For each record, FN gives the number of read features. For each feature:

  1. Decode FC (feature code byte)
  2. Decode FP (position within read, delta-coded from the previous feature's position within the same record — first feature's FP is absolute within the read, starting at 1)
  3. Decode feature-specific data:
FC Char Data series Description
X 0x58 BS Substitution: base = substitution_matrix[ref_base][code]
I 0x49 IN Insertion: multi-base insertion
i 0x69 BA Single-base insertion
D 0x44 DL Deletion: skip DL reference bases
S 0x53 SC Soft clip: bases not aligned to reference
B 0x42 BA + QS Single base replacement + quality
b 0x62 BB + QQ Stretch of bases + qualities
Q 0x51 QS Quality score at position
q 0x71 QQ Stretch of quality scores
N 0x4e RS Reference skip (intron): skip RS reference bases
H 0x48 HC Hard clip: record HC bases clipped
P 0x50 PD Padding: PD padding bases

Sequence reconstruction

[CRAM3] §10.6 "Mapped reads" — reference-based sequence reconstruction from read features; §11 "Reference sequences" — reference MD5 verification

slice.rs:431 compare_cram_parsing.rs:196 +1

Read sequences are NOT stored directly. They MUST be reconstructed from the reference and read features:

  1. Fetch the reference sequence for [alignment_start, alignment_start + alignment_span) from the FASTA reader (or embedded reference block).
  2. Initialize the read's bases array with RL entries.
  3. Walk the reference and read in parallel, applying features in order:
    • Between features: copy reference bases to read bases (these are the matching segments).
    • X (substitution): look up substitution_matrix[ref_base][BS_code] for the read base.
    • I/i (insertion): insert bases from IN/BA into the read (these don't consume reference).
    • D (deletion): skip DL reference bases (these don't consume read).
    • S (soft clip): insert bases from SC into the read (no reference consumption).
    • B (base+quality): replace base and quality at the feature position.
    • N (ref skip): skip RS reference bases.
    • H (hard clip): record length only (no bases in read or reference).
  4. Convert reconstructed bases to Base enum values for the bases slab.
slice.rs:359 compare_cram_with_htslib.rs:145

Quality scores are decoded from QS (per-base) or QQ (stretch) data series, depending on CRAM flags bit 0x1. If CRAM flags bit 0x8 is set (sequence unknown), quality is unavailable (all 0xFF). Quality length MUST equal read length.

When quality_as_array is false (CRAM flags bit 0x1 unset), per-position quality overrides are encoded as Q (0x51) read features. Each Q feature carries a single quality score from the QS data series at a specific read position. The decoder MUST apply these quality scores to qual_buf at the feature's read position (1-based FP, converted to 0-based index). Positions without a Q feature retain the default quality (0xFF). This is distinct from the bulk QS decode path used when quality_as_array is true.

The q (0x71) read feature carries a stretch of quality scores from the QQ data series, starting at the feature's read position. The decoder MUST write the decoded quality bytes into qual_buf starting at the feature's 0-based read position. The number of quality scores in the block is determined by the QQ encoding. Like Q, this feature only appears when quality_as_array is false.

slice.rs:432 compare_cram_parsing.rs:197 +1

CRAM does not store CIGAR explicitly. The CIGAR MUST be reconstructed from the read features:

  • Stretches of matching reference → M ops (or =/X if distinguishing is needed, but M is standard)
  • X (substitution) → part of M ops (substitutions are still alignment matches)
  • I/i → I ops
  • D → D ops
  • S → S ops
  • N → N ops
  • H → H ops
  • P → P ops

The reconstructed CIGAR MUST be packed into BAM u32 format for the data slab.

Mate information

[CRAM3] §10.4 "Mate records" — detached (CF bit 0x2) vs attached (NF) mate encoding

slice.rs:280 compare_cram_with_htslib.rs:173

When CRAM flags bit 0x2 is set (detached), mate information is stored inline: MF (mate flags), NS (next ref ID), NP (next mate position), TS (template size). These fields are NOT needed for RecordStore (which doesn't store mate info), but the decoder MUST read and discard them to advance the data stream correctly.

slice.rs:281 compare_cram_with_htslib.rs:174

When bit 0x2 is clear, the mate is "attached" — NF (next fragment distance) gives the record-skip to the mate within the same slice. The decoder MUST read NF but does not need to resolve the linkage for RecordStore purposes.

Auxiliary tags

[CRAM3] §10.5 "Auxiliary tags" — TL tag line index, tag dictionary, tag encoding map

slice.rs:298 compare_cram_with_htslib.rs:171

TL (tag line index) selects which tag combination this record has from the tag dictionary in the preservation map. For each tag in the combination, the tag encoding map provides the encoding for its value. Tag values MUST be decoded and serialized to BAM binary aux format for storage in the data slab.

slice.rs:317 compare_cram_with_htslib.rs:172

The RG data series is separate from aux tags. If a read group is present, the reader MUST emit an RG:Z:<id> aux tag using the read group ID from the header's @RG entries. This tag MUST be included in the aux slab alongside dictionary-decoded tags.

CRAM index (.crai)

[CRAM3] §12 "Indexing" — CRAI format: gzip-compressed TSV, 6 fields per line (ref_id, alignment_start, alignment_span, container_offset, slice_offset, slice_size)

index.rs:39 index.rs:149 +1

The .crai file is gzip-compressed TAB-delimited text. Each line has 6 fields:

  1. Reference sequence ID (int)
  2. Alignment start (int, 1-based matching the container header convention, 0 for unmapped)
  3. Alignment span (int, 0 for unmapped)
  4. Container byte offset in file (int, 0-based byte offset from start of file)
  5. Slice byte offset relative to container data start (int)
  6. Slice size in bytes (int)
index.rs:73 index.rs:164 +1

Region queries MUST find all index entries where [start, start+span) overlaps the query region for the given reference ID. The reader then seeks to the container byte offset, skips to the slice offset, and decodes the slice.

index.rs:87 index.rs:203

CRAI entries with alignment_span == 0 (but alignment_start > 0) indicate unknown extent — this occurs when samtools writes CRAM with embedded references or when the span is not computed. These entries MUST be included in query results whenever entry.alignment_start < query_end, since the slice may contain records at any position past the start. Treating span=0 as "zero-width" would silently drop all records in that slice.

slice.rs:371 compare_cram_parsing.rs:118

Multi-ref slices produce multiple index entries (one per reference they span). A query for one reference may hit a multi-ref slice that also contains records from other references. The reader MUST filter records by reference ID after decoding.

index.rs:74 +1 index.rs:179

Unmapped records (ref ID = -1) have start=0, span=0. They are only returned if explicitly queried. For fetch_into, unmapped records MUST be skipped (same as BAM).

Edge cases

slice.rs:118 slice.rs:763

If the reference MD5 in the slice header does not match the MD5 of the FASTA sequence for that region, the reader MUST return an error. The MD5 is computed over the uppercase reference sequence for the exact range [alignment_start, alignment_start + alignment_span), with no newlines or other formatting. If the range extends beyond the reference length, this is a file/reference mismatch error.

reader.rs:379 reader.rs:499

If the FASTA reader cannot provide the reference sequence for a slice's region (e.g., contig not in FASTA), the reader MUST return an error unless the slice has an embedded reference. The error SHOULD mention REF_PATH/REF_CACHE as alternatives.

pileup.rs:172 +1

When RN=false in the preservation map, read names are not available. The reader MUST log a warning (once per slice). Records without read names are stored with empty qnames. The pileup engine's overlapping-pair dedup MUST skip records with empty or * qnames, since mate matching requires real qnames.

slice.rs:402 compare_cram_parsing.rs:199

When CRAM flags bit 0x8 is set, the read's sequence is unknown (* in SAM). seq_len MUST be 0 and no bases are pushed to the bases slab.

slice.rs:113 compare_cram_parsing.rs:198

Slices with 0 records MUST be silently skipped.

slice.rs:267 compare_cram_with_htslib.rs:175

Delta-coded positions can produce negative values if records are not strictly sorted. The decoder MUST handle this gracefully (negative deltas are valid for supplementary alignments within a coordinate-sorted CRAM).

slice.rs:419 compare_cram_with_htslib.rs:176

Unmapped reads (flag 0x4) may appear in the last containers of the file, in slices with reference_sequence_id = -1. Their bases are stored in the BA data series (no reference reconstruction). fetch_into MUST skip them, same as BAM.

slice.rs:336 compare_cram_parsing.rs:201

PacBio and ONT CRAM files contain reads of 10,000-1,000,000+ bases. Read feature lists can have thousands of entries, and quality/sequence arrays are correspondingly large. Buffer pre-allocation MUST scale with RL, not assume short-read sizes.

reader.rs:394 compare_cram_parsing.rs:200

Query coordinates passed to fetch_into are u64, but internal CRAM positions are i32/i64. When converting u64::MAX to i64 for overlap filtering, the cast wraps to -1, which silently filters out all records. The reader MUST clamp coordinates: end.min(i64::MAX as u64) as i64.

slice.rs:433 slice.rs:850

During sequence reconstruction, if the reference sequence is shorter than the positions accessed (i.e., ref_offset + ref_pos >= reference_seq.len()), the reader MUST log a warning on the first occurrence per reconstruction and fall back to b'N'. This may indicate a reference/CRAM mismatch but can legitimately occur near chromosome ends. The warning MUST be emitted at most once per record to avoid flooding logs.

rans.rs:145 rans.rs:277

The rANS 4x8 frequency table uses run-length encoding where a symbol counter sym: u8 is incremented for each run element. When sym == 255, sym += 1 overflows. The reader MUST use wrapping arithmetic (wrapping_add(1)) for this counter — the overflow is harmless because the loop terminates before using the overflowed value.

Symbol variables in rANS frequency table run-length reading MUST use u8 type, making overflow past 255 impossible by construction. After a run-length loop, sym may have wrapped to 0 via wrapping_add(1) when the previous value was 255. The code MUST NOT use the post-loop sym value as a table index if it has wrapped to 0. The run-length loop MUST break when sym would exceed 255.

rans_nx16.rs:144

The rANS state_step function computes f * (s >> bits) + (s & mask) - g. For valid frequency tables produced by conforming CRAM writers, g <= f * (s >> bits) + (s & mask) always holds, so the subtraction does not underflow. The implementation MUST use wrapping_sub for robustness against malformed data in release mode, with a debug_assert! to catch violations during development.

rans_nx16.rs:238 rans_nx16.rs:752

Frequency normalization (normalize_frequencies) sums all 256 frequency entries and doubles the sum in a loop. Both the sum and the doubling MUST use checked arithmetic to detect overflow from malformed frequency tables. Overflow MUST produce an error, not silent wraparound.

rans_nx16.rs:96 +1 rans_nx16.rs:735

The uint7 variable-length integer decoding loop MUST be bounded to at most 5 iterations. A u32 can hold at most 32 bits; 5 iterations of 7-bit groups produce 35 bits, which is the maximum meaningful input. More than 5 continuation bytes indicate malformed data and MUST produce an error.

slice.rs:42 +1 slice.rs:874

Length fields decoded from ITF8 (e.g., num_content_ids, num_blocks, alignment_span) may be negative when interpreted as i32. Before using such values as usize for allocation or iteration, the reader MUST validate they are non-negative via i32::try_from or equivalent. Negative values MUST produce an error, not wrap to huge usize values causing OOM.

tok3.rs:143 tok3.rs:640

The TokenReader::get() and get_mut() methods MUST return the same reader for every TokenType variant. In particular, DZLen MUST map to dz_len_reader in both methods. A mismatch causes the dup-copy path (which uses get()) to read from the wrong stream.

tok3.rs:21 tok3.rs:655

The name_count field in tok3 headers comes from untrusted data. Allocation based on name_count MUST be bounded to a reasonable limit (e.g., 10,000,000 names). The per-name token vector MUST be grown dynamically rather than pre-allocated to a fixed size, since the number of token positions varies per name.

Performance considerations

CRAM decoding is expected to be 2-4× slower than BAM due to codec complexity and reference reconstruction. The I/O savings (40-60% smaller files) often compensate on network filesystems.

reader.rs:291 compare_cram_parsing.rs:202

CRAM random access is slice-granular, not record-granular. A region query may decompress an entire slice (typically 10,000 records) to extract a few records at the edges. This is inherently coarser than BAM's record-level seeking.

reader.rs:292 compare_cram_parsing.rs:203

Reference sequence lookups happen per-slice (to reconstruct all records in the slice). The reader SHOULD cache the most recently used reference region to avoid repeated FASTA lookups for consecutive slices on the same contig. This cache MUST be per-fork, not shared across threads.

reader.rs:293 compare_cram_parsing.rs:204

rANS and arithmetic decoders have higher per-byte CPU cost than zlib. The reader SHOULD pre-allocate decode buffers and reuse them across blocks within a slice.

FASTA Reader

The indexed FASTA reader provides random access to reference genome sequences. It reads .fai index files to locate sequence data within plain or bgzip-compressed FASTA files, enabling efficient region queries without scanning the entire file.

Rastair fetches reference sequences once per segment (~100 KB) for each worker thread. The forkable design shares the parsed index across threads while giving each thread its own file handle, matching the IndexedBamReader pattern.

Sources: The FASTA format and FAI index are not covered by any hts-specs document. The FAI format is defined by samtools (samtools faidx); the byte offset formula and field layout are described in the samtools faidx man page and implemented in [htslib]. The GZI index format is a htslib convention for bgzip-compressed random-access files, implemented in [htslib] bgzf.c. BGZF decompression follows [SAM1] §4.1. The forking and buffer-reuse design are seqair-specific. See references.md.

FAI Index

The FAI index (.fai file) is a tab-delimited text file with one line per sequence. It enables O(1) lookup of any base position within the FASTA file by recording where each sequence's data starts and how lines are formatted.

Parsing

index.rs:64 index.rs:197

The FAI parser MUST read a .fai file and produce an index with one entry per sequence. Each entry MUST contain: sequence name (String), length (number of bases, u64), byte offset of the first base in the file (u64), bases per line (u64), and bytes per line including line terminator (u64). The parser MUST NOT trim whitespace from field values — tabs are the only delimiter, and sequence names may contain spaces.

index.rs:89 index.rs:224 +3

Each line MUST have exactly 5 tab-separated fields: NAME, LENGTH, OFFSET, LINEBASES, LINEWIDTH. Lines with fewer or more fields MUST produce an error identifying the malformed line. Empty lines MUST be skipped.

index.rs:116 index.rs:302 +2

The parser MUST reject entries where linewidth < linebases (line terminator would be negative), linebases == 0 (would cause division by zero in offset calculations), or length == 0 (empty sequences are not useful for region queries). Note: linewidth == linebases is valid — it means lines have no trailing newline character, which occurs when a sequence is stored on a single line without a mid-sequence newline.

index.rs:65 index.rs:198 +2

The index MUST support O(1) lookup of a sequence entry by name. Duplicate sequence names MUST produce an error at parse time.

Location

reader.rs:275 reader.rs:335 +1

For a FASTA file at path P, the reader MUST look for the index at P.fai (i.e., appending .fai to the full path including any .gz extension). This matches samtools/htslib convention: ref.faref.fa.fai, ref.fa.gzref.fa.gz.fai.

reader.rs:105 reader.rs:436

When the FAI index file is not found, the reader MUST produce a clear error message that names the expected index path and suggests running samtools faidx <fasta_path> to create it. The reader MUST NOT attempt to build the index itself. Rationale: auto-indexing hides user errors (wrong FASTA passed silently), fails on read-only shared reference directories (common on HPC clusters), introduces non-deterministic file creation and race conditions, and takes 10–30s for a human genome which makes the tool appear to hang with no feedback.

Byte offset calculation

index.rs:58 fasta.rs:17

To find the byte offset of base at 0-based position pos within a sequence, the reader MUST compute: offset + (pos / linebases) * linewidth + (pos % linebases), where offset, linebases, and linewidth come from the FAI entry. This accounts for newline characters inserted at line boundaries.

The offset formula is correct for all valid positions including those on the last line of a sequence, even when the last line is shorter than linebases. This works because integer division gives the correct line number regardless of last-line length. The raw read range [byte_offset(start), byte_offset(last_pos) + 1) may include newline bytes from line boundaries; these are stripped after reading.

Sequence fetching

reader.rs:84 fasta.rs:194

fetch_seq(name, start, stop) MUST use 0-based half-open coordinates [start, stop). A request for (name, 0, 4) returns the first 4 bases of the named sequence.

reader.rs:207 reader.rs:405 +3

The reader MUST verify that start < stop (zero-length and reversed ranges are invalid) and stop <= sequence_length. Out-of-bounds requests MUST produce an error naming the sequence, the requested range, and the actual sequence length.

reader.rs:263 fasta.rs:193

Returned sequence bytes MUST be uppercased. FASTA files may contain lowercase bases (indicating soft-masking), but Seqair always works with uppercase bases.

reader.rs:262 fasta.rs:192

When reading from the file, the reader MUST strip line terminators (LF \n and CR \r) from the raw bytes. The returned sequence MUST contain only base characters.

reader.rs:196 reader.rs:417 +1

When the requested sequence name is not in the index, the reader MUST produce an error listing the requested name. It SHOULD include available sequence names if the index is small (fewer than 20 entries) to help diagnose typos.

Raw byte return type

reader.rs:176 +1 reader.rs:346 +1

fetch_seq and fetch_seq_into MUST return raw uppercase ASCII bytes (Vec<u8> / &mut Vec<u8>), not Base values. CRAM reference MD5 verification computes checksums on the exact byte sequence from the FASTA file — converting to Base and back is lossy because IUPAC ambiguity codes (M, R, S, etc.) collapse to Unknown. Any u8→Base conversion MUST happen downstream, after CRAM verification is complete.

Buffer reuse

reader.rs:177 +1 fasta.rs:319 +1

fetch_seq_into(name, start, stop, buf) MUST accept a caller-provided &mut Vec<u8> buffer, clear it, and write the fetched sequence into it. This avoids per-call allocation when fetching many segments. The fetch_seq convenience method MAY allocate a new Vec<u8> internally.

Plain FASTA reading

reader.rs:230 fasta.rs:191

For uncompressed FASTA files, the reader MUST seek to the computed byte offset and read the required byte range in a single I/O operation, then strip newlines and uppercase the result.

reader.rs:231 reader.rs:347 +2

The raw read size MUST account for embedded newlines. For a fetch of n bases starting at position pos, the number of raw bytes to read is: end_byte_offset - start_byte_offset where both offsets are computed via fasta.index.offset_calculation.

Bgzip FASTA reading

reader.rs:114 +1 compare_fasta_with_htslib.rs:29

The reader MUST auto-detect bgzip compression by reading the first 18 bytes of the file and checking for: (1) gzip magic bytes 1f 8b, (2) DEFLATE compression method 08, (3) FEXTRA flag 04, and (4) the BC subfield identifier at bytes 12–13. All four checks MUST pass to identify a file as BGZF. Plain gzip files with the FEXTRA flag but without the BC subfield MUST NOT be misidentified as BGZF.

reader.rs:117 compare_fasta_with_htslib.rs:30 +1

For bgzip FASTA files, a GZI index (.gzi) MUST be present alongside the FASTA. The GZI index maps compressed block offsets to uncompressed byte offsets, enabling random access into the compressed data.

reader.rs:120 compare_fasta_with_htslib.rs:179

When the GZI index is not found for a bgzip FASTA, the reader MUST produce a clear error naming the expected GZI path and suggesting samtools faidx <fasta_path> (which creates both .fai and .gzi).

GZI index format

gzi.rs:46 gzi.rs:180 +4

The GZI parser MUST read a binary file containing: an 8-byte little-endian entry count (u64), followed by that many pairs of 8-byte little-endian values: (compressed_offset: u64, uncompressed_offset: u64). Each pair records where a BGZF block starts in the compressed file and what uncompressed byte offset that corresponds to. The entry count multiplication MUST use checked arithmetic to prevent overflow on corrupted files.

gzi.rs:97 gzi.rs:283 +1

GZI entries MUST be sorted in strictly increasing order by uncompressed_offset. The parser MUST reject files with unsorted or duplicate entries, as translate() relies on binary search.

gzi.rs:120 fasta.rs:115

Given an uncompressed byte offset, the GZI index MUST be able to find the BGZF block that contains it: find the last entry whose uncompressed_offset <= target, then the compressed block starts at that entry's compressed_offset, and the within-block offset is target - uncompressed_offset.

gzi.rs:144 gzi.rs:266

The within-block offset MUST fit in u16 (0–65535), matching the BGZF specification's maximum uncompressed block size of 65,536 bytes. If the computed within-block offset exceeds u16::MAX, the reader MUST return an error indicating the GZI index may be corrupt or missing entries. The reader MUST NOT silently truncate the offset.

Decompression

reader.rs:233 fasta.rs:264

The reader MUST decompress BGZF blocks using the same libdeflater-based decompression as the BAM reader, including CRC32 verification.

reader.rs:232 fasta.rs:265

For a fetch spanning multiple BGZF blocks, the reader MUST read and decompress blocks sequentially from the starting block until all requested bytes are obtained. Since FASTA fetches are contiguous byte ranges (unlike BAM which has disjoint index chunks), a simple sequential scan suffices.

Forking

reader.rs:71 reader.rs:457 +2

The FAI index and GZI index (if applicable) MUST be stored behind Arc so forked readers share a single parsed copy. These structures are read-only after initial parsing.

reader.rs:147 reader.rs:456 +1

fork() MUST produce a new reader that shares the same Arc as the source (no re-parsing) and opens a fresh File handle. Each fork MUST own its own read buffer.

reader.rs:455 fasta.rs:266

A forked reader MUST produce byte-identical results to independently opening the same FASTA file.

reader.rs:148 reader.rs:483 +1

Forked readers MUST be fully independent for mutable operations. Seeking on one MUST NOT affect any other.

Error handling

reader.rs:19 reader.rs:406 +4

All FASTA-specific errors MUST contain context (file paths, sequence names, positions) sufficient to diagnose the problem without a debugger. The implementation MUST NOT use panic!, unwrap(), expect(), or unreachable!() — all error paths MUST propagate errors via Result.

SAM Reader

A bgzf-compressed, indexed SAM reader for region-based random access. Plain (uncompressed) SAM files cannot be indexed and are NOT supported — users must compress with bgzip first.

Sources: [SAM1] §1 "The SAM Format Specification" — SAM text format, mandatory fields, FLAG bits, CIGAR string, header lines (@HD/@SQ/@RG/@PG); §1.3 "The header section"; §1.4 "The alignment section: mandatory fields". [TABIX] — TBI index format and SAM column configuration. [SAM1] §4.1 — BGZF decompression (shared with BAM). The coordinate conversion (1-based → 0-based) and aux-to-BAM-binary serialization are seqair-specific. See references.md.

Context

SAM is the text representation of the same data BAM encodes in binary. A bgzf-compressed SAM file (.sam.gz) wraps SAM text in BGZF blocks, enabling random access via tabix (.tbi) or CSI (.csi) indexes. htslib reads these transparently; seqair needs to support them for backward compatibility after removing the htslib reading path.

The key difference from BAM: records are text lines that must be parsed field-by-field, and coordinates are 1-based (BAM is 0-based). The BGZF and index layers are shared with BAM.

Opening

reader.rs:147 sam_gz.rs:60

The SAM reader MUST open a bgzf-compressed SAM file, parse its header, and locate a tabix (.tbi) or CSI (.csi) index file. If neither index exists, return an error suggesting samtools index or tabix -p sam.

reader.rs:150 reader.rs:865

If the file does not start with BGZF magic (1f 8b with BC subfield), the reader MUST reject it with an error explaining that plain SAM files must be bgzf-compressed for indexed access (bgzip file.sam).

Header parsing

reader.rs:598 compare_sam_with_htslib.rs:260 +1

The SAM header consists of all lines starting with @ at the beginning of the file. The reader MUST:

  1. Read lines from the BGZF stream until a line does NOT start with @.
  2. Record the BGZF virtual offset of the first alignment line (for seeking past the header).
  3. Parse @SQ lines to extract target names (SN tag) and lengths (LN tag).
  4. Construct a BamHeader from the parsed text via BamHeader::from_sam_text().

Lines starting with @ that appear after the first alignment line are NOT header continuation — they are malformed alignment lines and MUST be treated as parse errors.

reader.rs:599 compare_sam_with_htslib.rs:261 +1

If no @SQ lines are present, the reader MUST return an error. Without @SQ lines, target name→tid mapping is impossible and fetch_into cannot work.

reader.rs:600 sam_gz.rs:162

The full header text (all @ lines joined with newlines) MUST be stored in BamHeader::header_text for downstream use (e.g., constructing htslib Header for BAM writing).

Alignment line parsing

[SAM1] §1.4 "The alignment section: mandatory fields" — 11 mandatory fields: QNAME, FLAG, RNAME, POS (1-based), MAPQ, CIGAR, RNEXT, PNEXT, TLEN, SEQ, QUAL; QUAL Phred+33 encoding

reader.rs:305 compare_sam_with_htslib.rs:83 +1

Each alignment line MUST be split on TAB characters into the 11 mandatory fields plus optional tag fields. The parser MUST handle:

Field SAM type Conversion to internal representation
QNAME string stored as bytes in name slab
FLAG integer u16 (identical to BAM flags)
RNAME string converted to tid via header's name_to_tid map
POS 1-based integer subtract 1 for 0-based internal pos (i64)
MAPQ integer u8 (255 means unavailable)
CIGAR string parsed to packed BAM CIGAR ops (u32 per op: `len << 4
RNEXT string = means same as RNAME, * means unavailable — not stored in RecordStore
PNEXT integer not stored in RecordStore (mate info)
TLEN integer not stored in RecordStore (template length)
SEQ string each ASCII char decoded to Base enum
QUAL string each char minus 33 for Phred score; * means all 0xFF

The parser MUST validate each field and produce clear error messages for malformed records: non-integer POS, FLAG > 65535, negative MAPQ, etc. Real-world SAM files from non-standard tools may contain hand-edited or corrupted records.

reader.rs:351 compare_sam_with_htslib.rs:84 +1

SAM POS is 1-based; the internal representation is 0-based (matching BAM). The parser MUST subtract 1 from POS. A POS of 0 in SAM means unmapped — after subtraction this becomes -1 (as i64). Unmapped records are filtered by FLAG 0x4 before reaching this point, so -1 positions should not appear in the RecordStore.

reader.rs:428 compare_sam_with_htslib.rs:109 +2

The CIGAR string MUST be parsed into the same packed u32 format used by BAM: each operation is (length << 4) | op_code, where op_codes are M=0, I=1, D=2, N=3, S=4, H=5, P=6, =7, X=8. The string * means CIGAR unavailable (0 operations, and end_pos = pos).

reader.rs:379 compare_sam_with_htslib.rs:110 +3

SEQ characters MUST be decoded to Base values: A→A, C→C, G→G, T→T, all others (N, IUPAC codes)→Unknown. The string * means sequence unavailable (length 0).

Note: the SAM character = means "identical to the reference base at this position." This is an intentional lossy conversion to Unknown — matching BAM's behavior, which also stores = as a distinct 4-bit code but does not resolve it against the reference. Some tools (Illumina DRAGEN) use = extensively. If exact reference-matching base recovery is needed in the future, the = case would need reference lookup during parsing.

reader.rs:388 compare_sam_with_htslib.rs:111 +3

QUAL characters MUST have 33 subtracted (Phred+33 encoding). The string * means quality unavailable; all quality bytes MUST be set to 0xFF (255). QUAL length MUST equal SEQ length when both are present.

reader.rs:468 compare_sam_with_htslib.rs:151 +1

Optional fields after the 11th column follow the format TAG:TYPE:VALUE. They MUST be serialized to BAM binary aux format for storage in the data slab:

  • A (char): [tag0, tag1, 'A', char]
  • i (integer): [tag0, tag1, type, bytes...] where type is the smallest fitting BAM integer type:
    • [-128, 127] → c (int8)
    • [128, 255] → C (uint8)
    • [-32768, -129] or [256, 32767] → s (int16)
    • [32768, 65535] → S (uint16)
    • [-2147483648, -32769] or [65536, 2147483647] → i (int32)
    • [2147483648, 4294967295] → I (uint32)
  • f (float): [tag0, tag1, 'f', le_f32_bytes]
  • Z (string): [tag0, tag1, 'Z', string_bytes, 0x00]
  • H (hex): [tag0, tag1, 'H', hex_bytes, 0x00]
  • B (array): [tag0, tag1, 'B', subtype, count_le32, values...]

Note: the integer type selected here may differ from the type used by the original BAM writer, since SAM text loses the specific width information. This is an inherent limitation — see r[unified.push_fields_equivalence].

reader.rs:469 reader.rs:947

Malformed aux tag values MUST return an error, not silently default to zero. For integer (i) typed tags, if the value cannot be parsed as an integer, the reader MUST return a SamRecordError::InvalidAuxValue error.

reader.rs:570 reader.rs:980 +1

SAM aux integer values (type i) are serialized into the smallest BAM integer type that fits, using the ranges defined in r[sam.record.aux_tags]. Values that exceed the u32 range (i.e., val > u32::MAX or val < i32::MIN when interpreted as i64) cannot be represented in any BAM integer type and MUST return a SamRecordError::AuxIntOutOfRange error instead of silently truncating.

Region fetching

reader.rs:195 compare_sam_with_htslib.rs:82 +2

fetch_into(tid, start, end, store) MUST:

  1. Query the tabix/CSI index for BGZF virtual offset ranges overlapping the region.
  2. Use RegionBuf (or equivalent bulk read) to load compressed bytes.
  3. Decompress BGZF blocks to text.
  4. Split text into lines (on \n).
  5. Parse each alignment line.
  6. Filter: skip lines where RNAME's tid doesn't match, or where the record doesn't overlap [start, end].
  7. Skip unmapped reads (FLAG 0x4).
  8. Push passing records into the RecordStore.
reader.rs:370 sam_gz.rs:62

The overlap filter uses half-open intervals (0-based). end_pos MUST be computed from the parsed CIGAR. When CIGAR is * (unavailable), end_pos = pos (point record).

reader.rs:371 reader.rs:908 +2

A record overlaps [start, end) iff pos < end && end_pos > start. Records where pos == end or end_pos == start do NOT overlap and MUST be filtered out.

reader.rs:196 sam_edge_cases.rs:180

Records MUST be added to the store in coordinate-sorted order. Since bgzf-compressed indexed SAM files are coordinate-sorted by construction (indexing requires it), the natural parse order is correct.

Tabix / CSI index

[TABIX] — TBI magic TBI\x01, header fields, n_ref, bin/chunk/linear-index structure, format=1 for SAM [CSI] — CSI magic CSI\x01, min_shift, depth, l_aux, per-bin loffset

reader.rs:624 sam_edge_cases.rs:223

The reader MUST support tabix (.tbi) indexes. Tabix is a generic index for bgzf-compressed, TAB-delimited files. For SAM, the format code is 1 (not 0/generic). The column configuration (seq_col=3, beg_col=4, end_col=0) is implicit for format=1 and may be ignored by some implementations; the reader SHOULD NOT reject a tabix file based on stored column values when format=1 is present.

Tabix files are BGZF-compressed and MUST be decompressed before parsing.

The reader SHOULD also support CSI (.csi) indexes, which generalize the binning scheme with configurable min_shift and depth parameters. CSI supports references longer than 512 Mbp (BAI's limit). CSI files are BGZF-compressed and MUST be decompressed before parsing. Not yet implemented — see r[csi.*] rules in tabix_index.md.

reader.rs:627 sam_edge_cases.rs:222

Index file lookup order: <path>.tbi, <path>.bai. If neither exists, return a clear error. CSI (.csi) lookup should be added when CSI support is implemented.

Edge cases

reader.rs:380 sam_edge_cases.rs:102

When SEQ is *, the record has no sequence data. seq_len MUST be 0 and no bases are pushed to the bases slab. This is valid per the SAM spec for secondary alignments that omit sequence.

reader.rs:389 sam_edge_cases.rs:127

When QUAL is *, all quality bytes MUST be 0xFF. This matches BAM's representation of unavailable quality.

reader.rs:429 sam_proptests.rs:59

BAM encodes CIGARs with >65535 operations using a CG:B:I aux tag and a placeholder CIGAR. SAM has no such limitation — the CIGAR string can be arbitrarily long. The parser MUST handle CIGARs of any length.

reader.rs:250 sam_edge_cases.rs:39

Blank lines (empty or whitespace-only) between alignment records MUST be silently skipped.

reader.rs:334 sam_edge_cases.rs:65

RNAME of * means unmapped. Combined with FLAG 0x4, these records MUST be filtered out by fetch_into (same as BAM's unmapped filtering). RNAME * without FLAG 0x4 is malformed but SHOULD be treated as unmapped.

reader.rs:601 sam_edge_cases.rs:242

A single SAM alignment line may span multiple BGZF blocks. The reader MUST buffer partial lines across block boundaries and only parse complete lines (terminated by \n). The parser MUST also strip \r at line boundaries to handle \r\n (Windows-style) line endings that may appear in files from mixed-platform pipelines.

Integer parsing

reader.rs:679 reader.rs:962 +2

Integer parsing functions (parse_u8, parse_u16, parse_u32) MUST reject empty input by returning None. An empty byte slice is not a valid integer representation and MUST NOT silently produce zero.

Performance considerations

reader.rs:198 sam_gz.rs:63

Like BAM, the SAM reader SHOULD use RegionBuf-style bulk I/O: read all compressed bytes for the region in one large I/O operation, then decompress and parse from memory. This is critical for NFS/Lustre performance.

Text parsing is inherently slower than BAM binary decoding. The SAM reader is expected to be 2-5× slower than BAM for the same data due to text parsing overhead and larger compressed size. This is acceptable since SAM is not the recommended input format. Implementation guidance:

  • Use memchr for fast TAB and newline scanning.
  • Avoid allocations during parsing (reuse line buffers).
  • Parse integers without going through String (direct byte-to-int conversion).
  • Decode SEQ characters to Base with a lookup table (not per-char branching).

Tabix and CSI Index

Tabix (.tbi) and CSI (.csi) are generic indexes for bgzf-compressed, coordinate-sorted, TAB-delimited files. They enable region-based random access via BGZF virtual offsets. seqair needs these for bgzf-compressed SAM files.

Sources: [TABIX] — full TBI format specification (magic, header, bins, chunks, linear index, n_no_coor, reg2bin/reg2bins). [CSI] — full CSI format specification (magic, min_shift, depth, l_aux, per-bin loffset, parameterised reg2bin/reg2bins). [SAM1] §5.3 — reg2bin/reg2bins C code for BAI-compatible binning parameters (min_shift=14, depth=5). See references.md.

Context

BAM uses BAI indexes, which are a specialized form of the binning scheme. Tabix and CSI generalize this:

  • Tabix (.tbi): fixed parameters (min_shift=14, depth=5), identical binning to BAI. Includes column configuration for the text format.
  • CSI (.csi): configurable min_shift and depth, supporting references longer than 512 Mbp.

Both use the same bin/chunk/virtual-offset query mechanism as BAI.

Implementation strategy: Tabix with format=SAM uses identical binning parameters to BAI (min_shift=14, depth=5). The index data after the tabix-specific header is structurally identical to BAI (bins, chunks, linear index). The initial implementation SHOULD reuse the existing BamIndex parsing and query code by reading past the tabix header and delegating to the BAI parser. CSI support (parameterized binning, per-bin loff, no linear index) requires a separate parser and is deferred.

File compression

index.rs:107 tabix_index.rs:91

Tabix (.tbi) files are BGZF-compressed. The parser MUST decompress via BGZF before parsing the binary format. Reading raw bytes without decompression produces garbage.

CSI (.csi) files are BGZF-compressed. Same decompression requirement as tabix.

Tabix format

[TABIX] — TBI format table: magic, n_ref, format, col_seq, col_beg, col_end, meta, skip, l_nm, names; bin/chunk/interval lists; n_no_coor

index.rs:103 index.rs:591 +1

Tabix files start with magic bytes TBI\x01 (4 bytes), after BGZF decompression.

index.rs:104 tabix_index.rs:38

After magic:

  • n_ref (i32): number of reference sequences
  • format (i32): 0=generic, 1=SAM, 2=VCF. For SAM files, the format code is 1 (not 0).
  • col_seq (i32): column for sequence name (1-based; 3 for SAM)
  • col_beg (i32): column for begin position (1-based; 4 for SAM)
  • col_end (i32): column for end position (0 = computed from begin)
  • meta (i32): comment/header prefix character (64 = @ for SAM)
  • skip (i32): number of header lines to skip (0 for SAM; headers detected by meta prefix)
  • l_nm (i32): length of concatenated sequence names
  • names (bytes): null-terminated sequence names, concatenated

Note: when format=1 (SAM), htslib hard-codes the column configuration internally. The stored col_seq/col_beg/col_end values may be ignored by some implementations. The reader SHOULD NOT reject a tabix file based on stored column values when format=1.

index.rs:105 tabix_index.rs:39

Since tabix uses identical binning parameters to BAI (min_shift=14, depth=5), the index data after the tabix header MUST be parsed by reusing BamIndex::from_reader() (or equivalent). This avoids duplicating the bin/chunk/linear-index parsing code. The tabix header is read and validated first, then the remaining bytes are handed to the BAI parser.

index.rs:106 tabix_index.rs:40

After the header, the index data follows the same structure as BAI:

  • For each reference:
    • n_bin (i32): number of bins
    • For each bin: bin_id (u32), n_chunk (i32), chunks as virtual offset pairs
    • n_intv (i32): number of linear index entries
    • Linear index: virtual offsets at 16 kbp intervals

After all reference data, an optional 8-byte n_no_coor field (u64) may be present (number of unmapped reads with no coordinate). Its presence is detected by whether 8 more bytes remain.

index.rs:178 tabix_index.rs:58

Bin ID 37450 is a pseudo-bin that stores metadata (virtual offset range of all records for a reference). Its chunks MUST NOT be included in query results. The reader MUST recognize and skip this bin during region queries.

index.rs:177 tabix_index.rs:57

Region queries use the same reg2bins algorithm as BAI (with min_shift=14, depth=5). The bin offset for level l (counting from root, l=0=root) is ((1 << 3*l) - 1) / 7, and the bin ID for position p at level l is:

((1 << 3*l) - 1) / 7 + (p >> (min_shift + 3*(depth - l)))

For BAI/tabix with min_shift=14, depth=5:

  • Level 0 (root): offset=0, shift=29. Bin 0 covers the entire reference.
  • Level 5 (leaf): offset=4681, shift=14. Each bin covers 16 kbp.

CSI format

[CSI] — CSI format table: magic CSI\x01, min_shift, depth, l_aux, n_ref; per-bin loffset replacing linear index; parameterised reg2bin/reg2bins/bin_limit functions

CSI files start with magic bytes CSI\x01 (4 bytes), after BGZF decompression.

After magic:

  • min_shift (i32): minimum shift for binning (BAI/tabix use 14)
  • depth (i32): number of binning levels (BAI/tabix use 5)
  • l_aux (i32): length of auxiliary data
  • aux (bytes): auxiliary data (for SAM: same column configuration as tabix header)

Note: CSI support (r[csi.*] requirements below) is deferred. The initial implementation supports tabix only, which covers the common case of bgzf-compressed indexed SAM files. CSI adds support for references > 512 Mbp and is needed for some plant/polyploid BAM files.

CSI generalizes the BAI binning scheme with configurable min_shift and depth. The bin ID formula is the same as tabix/BAI but parameterized:

bin_id(level, pos) = ((1 << 3*level) - 1) / 7 + (pos >> (min_shift + 3*(depth - level)))

where level counts from root (0) to leaf (depth).

The reg2bins function MUST be parameterized by min_shift and depth rather than using a hard-coded levels array. The existing BAM reg2bins implementation hard-codes [(1,26), (9,23), (73,20), (585,17), (4681,14)] — this must be refactored to compute these values dynamically.

CSI bins have a different structure from BAI/tabix bins:

  • bin_id (u32)
  • loff (u64): minimum virtual offset for this bin — replaces BAI's linear index
  • n_chunk (i32)
  • chunks: virtual offset pairs

The loff field per-bin replaces the separate linear index that BAI/tabix use. During queries, chunks with end <= loff of the leftmost overlapping bin SHOULD be skipped (same optimization as BAI's linear index).

CSI does NOT have a separate linear index array. The per-bin loff field serves the same purpose. The parser MUST NOT attempt to read a linear index after bins.

The CSI pseudo-bin ID is ((1 << (depth*3 + 3)) - 1) / 7. For depth=5 this equals 37450 (same as BAI). Its chunks MUST NOT be included in query results.

CSI queries produce the same output as BAI/tabix queries: a list of virtual offset chunk ranges to scan. RegionBuf::load() can be reused unchanged since it operates on Vec<Chunk>.

Shared query infrastructure

index.rs:108 unified_reader.rs:87

The existing BamIndex::query() returns Vec<Chunk> with begin/end virtual offsets. The tabix and CSI parsers MUST produce the same Vec<Chunk> output so that RegionBuf::load() can be reused unchanged. The binning parameters differ but the output format is identical.

index.rs:109 unified_reader.rs:88

Chunk and VirtualOffset types from the BAM index module MUST be reused by tabix/CSI indexes.

Edge cases

References longer than 512 Mbp require CSI (BAI/tabix with min_shift=14, depth=5 cannot represent them). The reader MUST support CSI indexes for BAM files too (not just SAM), since some BAM files from plant/polyploid genomes use CSI.

index.rs:215 index.rs:532

A reference with no records has no bins and no linear index entries. Queries for such references MUST return an empty chunk list (not an error).

Tabix stores sequence names in the index header. These MUST match the names in the SAM header's @SQ lines. A mismatch SHOULD produce a warning (the index may be stale). CSI does NOT store sequence names — this check applies only to tabix.

Record Store

Sources: The slab-based RecordStore design is seqair-specific. The data it stores is defined by [SAM1] §4.2 (BAM record fields, CIGAR packed u32 format, quality Phred encoding, aux tag binary format). See references.md.

Background

When loading a region from any alignment format (BAM, SAM.gz, or CRAM), the reader decodes hundreds to thousands of records. Each record has fixed-size fields (position, flags, quality) and variable-length fields (read name, CIGAR, sequence, quality scores, auxiliary tags). The original design wrapped each record in Rc<BamRecord> with 5 separate heap allocations for the variable-length fields, plus the Rc allocation itself — 6 allocations per record.

The record store replaces this with a slab-based design: all variable-length data is packed into contiguous buffers (slabs), and fixed fields are stored in a compact struct that references the slabs by offset. This eliminates all per-record heap allocation.

Layout

The store contains four vectors:

  • Record table (Vec<SlimRecord>): compact fixed-size structs with alignment fields and offsets into the slabs.
  • Name slab (Vec<u8>): all read names (qnames) packed sequentially. Separated because qnames are only accessed during dedup mate detection — a linear scan of a compact, contiguous buffer.
  • Bases slab (Vec<Base>): decoded base sequences stored as Base enum values (A/C/G/T/Unknown). For BAM, decoded from 4-bit nibbles via SIMD at push time. For SAM/CRAM, provided directly as &[Base] by the reader.
  • Data slab (Vec<u8>): packed per-record as [cigar | qual | aux] — CIGAR in BAM packed u32 format, quality as raw Phred bytes, aux tags in BAM binary format.
Record table:  [SlimRecord₀] [SlimRecord₁] [SlimRecord₂] ...
Name slab:     [qname₀|qname₁|qname₂|...]
Bases slab:    [bases₀|bases₁|bases₂|...]
Data slab:     [cigar₀|qual₀|aux₀|cigar₁|qual₁|aux₁|...]

Capacity estimation

record_store.rs:61 record_store.rs:76

On construction, the store MUST pre-allocate all three vectors with estimated capacities to avoid reallocation during push_raw. Estimates SHOULD be based on the number of compressed bytes loaded for the region: a typical BAM compression ratio of ~3:1, an average record size of ~400 bytes, ~25 bytes per read name, and ~375 bytes per record of non-name data. The store MUST accept a byte-count hint for this purpose.

Decoding into slabs

record_store.rs:17 +1 record_store.rs:6 +1

push_raw(raw_bytes) MUST decode a BAM record's fixed fields and append its variable-length data directly into the slabs, without allocating intermediate Box<[u8]> or Vec<u8> per field. The 4-bit packed sequence MUST be decoded (via SIMD when available) to Base enum values in the bases slab.

record_store.rs:118 +2 record_store.rs:374

Slab offsets (name_off, bases_off, data_off) MUST be checked for u32 overflow before storing. If any slab exceeds u32::MAX bytes, the store MUST return an error rather than silently truncating the offset.

record_store.rs:167 record_store.rs:124

push_fields(...) MUST accept pre-parsed record fields for SAM and CRAM readers: pos, end_pos, flags, mapq, matching_bases, indel_bases, qname bytes, CIGAR as packed BAM-format u32 ops, sequence as &[Base], quality bytes, and aux tag bytes in BAM binary format. This avoids encoding to BAM binary only to immediately decode it again. SAM and CRAM parsers convert their native representations to these types before pushing.

Field access

record_store.rs:58 compare_bam_with_htslib.rs:85 +2

The store MUST provide methods to access each variable-length field for a given record index: qname(idx), cigar(idx), seq(idx), seq_at(idx, pos), qual(idx), aux(idx). These return borrowed slices into the slabs.

Region lifecycle

record_store.rs:59 record_store.rs:54

Clearing the store for a new region MUST retain the allocated capacity of all four vectors, so that subsequent regions reuse the same memory without reallocation.

record_store.rs:60 record_store.rs:102

All records in the store MUST remain valid and accessible until the store is cleared.

Integration

record_store.rs:62 record_store.rs:101

The pileup engine MUST NOT use Rc for record sharing. PileupAlignment carries pre-extracted flat fields (base, qual, mapq, flags, etc.), and ActiveRecord stores a record index into the store. The store is either owned by or borrowed by the engine for the duration of iteration.

Region Buffer

Sources: The RegionBuf design is seqair-specific (bulk I/O optimisation for cluster storage). BGZF block decompression follows [SAM1] §4.1. Virtual offsets are defined in [SAM1] §4.1 "Random access". Index chunks come from [SAM1] §5.2 (BAI) or [TABIX] (TBI). See references.md.

Background

When reading a BGZF-compressed file (BAM or bgzf-compressed SAM), the reader doesn't decompress the entire file. Instead, it uses an index (.bai for BAM, .tbi for SAM.gz) to find which compressed byte ranges — called chunks — contain reads overlapping the requested genomic region. Each chunk is a pair of virtual offsets (see bgzf.md) that point to a start and end position in the compressed file.

A typical region query returns a handful of chunks. The standard approach is to seek to each chunk's start, then read and decompress BGZF blocks one at a time until reaching the chunk's end. Each block read involves a header read (~18 bytes) followed by a compressed data read (~20–40 KB), then decompression.

On local SSDs this works fine — seek + read latency is microseconds. On cluster storage (NFS, Lustre, GPFS), each I/O operation can take milliseconds due to network round-trips. A region with 200 BGZF blocks means 200+ network round-trips, which dominates processing time.

Approach

RegionBuf solves this by pre-fetching all compressed bytes for a region into RAM in one shot, then decompressing from the in-memory buffer. The I/O pattern changes from "hundreds of small reads" to "one large sequential read" per region. Since Rastair processes BAM segments in parallel (one thread per segment), each thread creates its own RegionBuf with no contention.

The flow:

  1. Query the BAM index → get a list of chunks (compressed byte ranges)
  2. Merge overlapping/adjacent chunks into larger contiguous ranges
  3. Read each merged range with one seek + read into a Vec<u8>
  4. Decompress BGZF blocks from the in-memory buffer (zero network I/O)

Chunk merging

The BAM index may return multiple chunks whose compressed byte ranges overlap or are adjacent. Reading them separately would cause redundant I/O and seeking.

region_buf.rs:507 region_buf.rs:24

Before reading, index chunks MUST be merged into non-overlapping byte ranges sorted by file offset. Overlapping or adjacent chunks MUST be coalesced into a single range. The end of each range MUST extend past the last chunk's end block offset by at least one maximum BGZF block size (64 KiB) to ensure the final block is fully included in the buffer.

reader.rs:186 region_buf.rs:83

RegionBuf MUST NOT include bin 0 chunks in its bulk read. Bin 0 chunks are at distant file offsets and contain very few reads; including them would cause an expensive additional seek. Bin 0 records are handled separately via the bin 0 cache (see bam_index.md).

Loading

region_buf.rs:75 region_buf.rs:39

RegionBuf::load MUST accept a seekable reader and a slice of index chunks. It MUST merge chunks, then perform one seek + read_exact per merged range to load all compressed bytes into a contiguous in-memory buffer. On typical BAM files this results in a single read of a few hundred KB to a few MB per region.

region_buf.rs:76 region_buf.rs:14

When given zero chunks, load MUST return an empty buffer that immediately signals EOF on any read.

Seeking

After loading, the caller iterates through the original (unmerged) chunks. For each chunk, it seeks to the chunk's start virtual offset within the pre-loaded buffer and reads records until reaching the chunk's end.

region_buf.rs:165 region_buf.rs:64

The region buffer MUST support seeking to a virtual offset within the pre-loaded data. Since disjoint ranges are concatenated, the buffer MUST maintain a range map that translates file offsets to buffer positions. A seek to a file offset that falls within any loaded range MUST succeed; a seek to an offset in a gap between ranges or before/after the loaded data MUST return an error.

Block decompression

The in-memory buffer contains raw BGZF-compressed bytes, exactly as they appear on disk. Decompression follows the same block-by-block process as the file-based reader, but reads from RAM instead of the filesystem.

region_buf.rs:220 compare_bgzf_crate.rs:51 +2

Block decompression from the in-memory buffer MUST follow the same BGZF format rules as the file-based reader: magic validation, BSIZE extraction, DEFLATE decompression, and CRC32 verification.

region_buf.rs:221 region_buf.rs:43

The fast-path header parsing (XLEN=6, BC at fixed offset) MUST be used when applicable, with fallback to searching extra fields.

Reading

region_buf.rs:364 region_buf.rs:41

The region buffer MUST support reading an exact number of bytes, transparently crossing block boundaries, with the same API as BgzfReader::read_exact_into.

region_buf.rs:460 region_buf.rs:541

The Drop implementation MUST never panic. Gap-size calculations between ranges MUST use saturating arithmetic to handle overlapping or malformed range boundaries safely.

region_buf.rs:215 region_buf.rs:42

The region buffer MUST track virtual offsets correctly: the block offset corresponds to the file position of the current block (not the buffer position), so that virtual offset comparisons with the BAM index chunk boundaries work correctly.

Integration

reader.rs:185 region_buf.rs:82

Both IndexedBamReader::fetch_into and IndexedSamReader::fetch_into MUST use RegionBuf to bulk-read compressed data before record decoding. The file-based BgzfReader is retained for header parsing and other sequential access at file open time.

reader.rs:289 compare_cram_with_htslib.rs:113

CRAM does NOT use RegionBuf. CRAM containers are not BGZF-compressed — the reader seeks to container byte offsets directly via the CRAI index and reads container data with plain File::read_exact. Block-level decompression (gzip, rANS, etc.) happens after reading the container into memory.

Overlapping Pair Deduplication

In paired-end sequencing, two reads (mates) are generated from opposite ends of the same DNA fragment. When the fragment is shorter than twice the read length, the mates overlap — both reads cover the same genomic positions in the middle. Without deduplication, these overlapping positions would be counted twice in the pileup, inflating the apparent coverage and biasing variant allele frequencies.

Sources: Seqair-specific design with no upstream spec counterpart. Mate detection relies on QNAME matching as defined in [SAM1] §1.4. The first-in-template / second-in-template distinction uses FLAG bits 0x40 and 0x80 defined in [SAM1] §1.4. See references.md.

For example, with 150bp reads from a 200bp fragment, positions 50–150 are covered by both mates. At each of these positions, the pileup would show two observations from the same molecule. Deduplication removes one mate at each overlapping position, keeping the observation count honest.

Opt-in behavior

pileup.rs:160 dedup.rs:40 +1

Overlapping pair deduplication MUST be opt-in via set_dedup_overlapping(). When not enabled, the pileup engine MUST yield all alignments without deduplication.

Mate detection

Mates are identified by sharing the same read name (qname). The engine pre-computes a mate lookup table when deduplication is enabled, so the per-position dedup check is a fast hash lookup rather than a string comparison.

pileup.rs:161 dedup.rs:70 +1

Mates are detected by matching qnames. When set_dedup_overlapping() is called, the engine MUST build a mate index that maps each record to its mate's index (if both are in the arena). Records with no mate in the arena are unaffected.

pileup.rs:162 dedup.rs:105

Only the first two records sharing a qname are treated as mates. If three or more records share the same qname (supplementary alignments, etc.), only the first pair is linked; additional records are treated as unpaired for dedup purposes.

Resolution rule

When both mates are present at a position, one must be removed. The rule depends on whether the mates agree on the base call:

pileup.rs:360 dedup.rs:123 +1

When both mates show the same base at a position, the first encountered (by arena order) is kept and the other is removed.

pileup.rs:361 dedup.rs:143 +1

When mates show different bases at a position, the first-in-template read is kept and the second-in-template read is removed. This preserves the primary observation.

Scope

pileup.rs:315 dedup.rs:179 +2

Deduplication MUST be applied per-position, not globally. A read that is removed at one overlapping position MUST still appear at positions where its mate is not active. Both mates contribute normally outside their overlap region.

Non-interference

pileup.rs:316 dedup.rs:233

Deduplication MUST be applied after the read filter and after qpos computation. A read that was excluded by the filter MUST NOT participate in dedup. A read that has no qpos at a position (deletion/skip) MUST NOT participate in dedup at that position.

pileup.rs:317 dedup.rs:255

Deduplication MUST be applied before max_depth truncation. The depth after dedup may be below max_depth even if the raw depth exceeded it.

Pileup Engine

A pileup is a column-by-column view of all reads aligned to a reference region. At each reference position, the pileup shows which reads are present and what base each read has at that position. This is the fundamental data structure for variant calling: to determine whether position X has a mutation, you look at the pileup column at X and count how many reads show the reference base vs. an alternate base.

The pileup engine takes a set of BAM records (stored in a RecordStore) and iterates over reference positions from left to right, maintaining an "active set" of reads that overlap the current position. At each position, it yields a PileupColumn containing the aligned reads with their query positions (the position within each read that corresponds to this reference position).

Sources: The pileup algorithm is seqair-specific design, verified for correctness against [htslib] bam_plp_auto. FLAG bits referenced here (0x4 unmapped, 0x10 reverse strand) are defined in [SAM1] §1.4. CIGAR consume semantics (consumes query/reference) are from [SAM1] §1.4. See references.md.

Column iteration

pileup.rs:219 compare_bam_with_htslib.rs:158 +2

The engine MUST iterate over every reference position from the region start to the region end (inclusive), yielding a PileupColumn at each position where at least one read is active.

pileup.rs:34 compare_pileup.rs:84 +1

At each position, the active set contains all records whose alignment span [pos, end_pos] includes the current position. Records MUST enter the active set when the current position reaches their pos and MUST be evicted when the current position exceeds their end_pos.

pileup.rs:73 compare_bam_with_htslib.rs:159 +2

Each PileupColumn MUST provide:

  • the reference position
  • the number of active alignments (depth)
  • access to each alignment with its query position (qpos) and a RecordRef

Filtering

pileup.rs:154 pileup.rs:74 +1

The engine MUST support a per-read filter function that is evaluated once per record when it enters the active set (NOT re-evaluated at each position). Records that fail the filter MUST be excluded from all subsequent columns.

pileup.rs:148 pileup.rs:120

The engine MUST support a maximum depth setting. When more records pass the filter than max_depth allows, excess records MUST be dropped (preferring records already in the active set).

pileup.rs:149 +1 pileup.rs:137

Max depth MUST be enforced per-position, not globally at arena-load time. A read that is rejected at one high-coverage position may still be included at an adjacent lower-coverage position. This matches htslib's behavior where MAXCNT is checked against the current column's depth.

Query position

The query position (qpos) maps a reference position back into a read's coordinate system. For example, if a read starts at reference position 100 and has CIGAR 50M, then at reference position 125, qpos = 25. For complex CIGARs with insertions or deletions, the mapping requires walking the CIGAR operations (see cigar.md).

pileup.rs:95 compare_bam_with_htslib.rs:160 +3

For each alignment in a column, the engine MUST provide qpos: the position within the read's query sequence that aligns to the current reference position. This MUST be computed using the CigarIndex.

pileup.rs:293 compare_pileup.rs:120

When the current reference position falls within a deletion or reference skip in the read's CIGAR, that alignment MUST be excluded from the column (no qpos exists).

Edge cases

pileup.rs:281 pileup.rs:184

Positions with zero active reads MUST be skipped (no column yielded). When there is a gap between reads, the engine MUST jump to the next read's start position rather than iterating through empty positions one by one.

pileup.rs:255 pileup.rs:203

Reads with the unmapped flag (0x4) MUST be excluded from the pileup. htslib's bam_plp_push rejects unmapped reads before they enter the buffer.

pileup.rs:35 pileup.rs:218

Reads with zero reference-consuming CIGAR operations (pure soft-clip, insertion-only) have end_pos == pos and MUST appear in exactly one pileup column at their pos. They MUST NOT be skipped or cause off-by-one errors.

pileup.rs:36 pileup.rs:234

When a reference position falls within the soft-clipped portion of a read's alignment, the read is NOT active at that position (soft clips do not consume reference). The read only becomes active at its first reference-consuming position.

Compatibility

pileup.rs:74 compare_bam_with_htslib.rs:157 +2

For any BAM file and region, the pileup engine MUST produce identical columns to htslib's bam_plp_auto-based pileup: same positions, same depth at each position, and same qpos values per alignment. This is verified by comparison tests.

Performance

The pileup engine is the innermost loop of Rastair's processing pipeline: for a typical 30× whole-genome BAM, it processes ~3 billion reference positions, each with ~30 aligned reads. These rules specify performance-sensitive behaviors to avoid unnecessary overhead in this hot path.

Sources: Seqair-specific performance rules with no upstream spec counterpart. Implementation strategies are guided by profiling results and cross-checked against [htslib] behaviour. See references.md.

Allocation avoidance

pileup.rs:40 perf.rs:15

The pileup engine allocates a Vec<PileupAlignment> per column using Vec::with_capacity(active.len()). Reusing a buffer via clone was measured to be slower than fresh allocation due to the copy cost of entries. The allocator efficiently reuses recently-freed blocks of the same size.

pileup.rs:38 perf.rs:71

When a record enters the pileup active set, the engine MUST retrieve the record data once and cache what it needs (cigar, flags, strand, mapq, etc.) in the ActiveRecord — NOT perform repeated store lookups for the same record in the hot loop.

pileup.rs:37 perf.rs:46

Since BAM records arrive in coordinate-sorted order from fetch_into, the pileup engine MUST NOT sort record indices. It MUST iterate records in arena order directly.

pileup.rs:39 perf.rs:96

Building a CigarIndex MUST NOT clone the CIGAR bytes via .to_vec(). The CIGAR bytes live in the arena slab for the lifetime of the region; the CigarIndex MUST borrow or reference them by offset, not own a copy.

record.rs:26 perf.rs:131 +1

Matches and indels counts MUST be computed once per record during decode and stored in BamRecord, NOT recomputed from CIGAR at every pileup position.

record_store.rs:84 perf.rs:267

The RecordStore MUST support a capacity hint so that the first region's allocation can be pre-sized based on an estimate (e.g., from the BAI index chunk sizes), avoiding repeated reallocation during the first fetch_into. This is provided by RecordStore::with_byte_hint().

References

The seqair spec is based on the following upstream specifications and reference implementations.

Format specifications

  • [SAM1] Sequence Alignment/Map Format Specification, The SAM/BAM Format Specification Working Group, last modified 2025-08-10. https://samtools.github.io/hts-specs/SAMv1.pdf Covers: SAM text format (§1), mandatory alignment fields and FLAG bits (§1.4), BAM binary encoding (§4 "The BAM Format Specification"), BGZF block compression (§4.1 "The BGZF compression format"), random access and virtual offsets (§4.1 "Random access"), EOF marker (§4.1 "End-of-file marker"), BAM record layout (§4.2 "The BAM format"), BIN field calculation (§4.2.1), SEQ/QUAL encoding (§4.2.4), auxiliary data encoding (§4.2.5), BAI index algorithm (§5 "Indexing BAM"), basic binning index (§5.1.1), BAI index format (§5.2 "The BAI index format for BAM files"), reg2bin/reg2bins C code (§5.3).

  • [CRAM3] CRAM format specification (version 3.1), samtools-devel@lists.sourceforge.net, last modified 2025-06-04. https://samtools.github.io/hts-specs/CRAMv3.pdf Covers: data types and ITF8/LTF8 variable-length integers (§2), bit stream reading (§2.2), encoding types (§3 "Encodings"), checksums (§4), file structure overview (§5), file definition and magic (§6), container header structure (§7), CRAM header container (§7.1), block structure (§8), block content types (§8.1), compression header block (§8.3) including preservation map, data series encodings, and substitution matrix, slice header block (§8.4), core data block (§8.5), external data blocks (§8.6), EOF container (§9), record structure and CRAM record decode order (§10), BAM and CRAM flags (§10.1), positional data and AP delta coding (§10.2), read names (§10.3), mate records (§10.4), auxiliary tags (§10.5), mapped reads and read features (§10.6), substitution codes (§10.6.1), unmapped reads (§10.7), reference sequences (§11), CRAM index / CRAI format (§12), encoding codec IDs (§13).

  • [CRAMcodecs] CRAM codec specification (version 3.1), samtools-devel@lists.sourceforge.net, last modified 2025-03-24. https://samtools.github.io/hts-specs/CRAMcodecs.pdf Covers: rANS 4×8 codec with order-0 and order-1 frequency tables and interleaving (§2), rANS Nx16 codec with transforms (RLE, PACK, STRIPE, CAT, N32) (§3), range/arithmetic coder (§4), name tokeniser tok3 (§5), fqzcomp quality codec (§6).

  • [TABIX] The Tabix index file format, Heng Li. https://samtools.github.io/hts-specs/tabix.pdf Covers: TBI magic, header fields (n_ref, format, col_seq, col_beg, col_end, meta, skip, l_nm, names), bin/chunk/linear-index structure identical to BAI, n_no_coor trailing field, reg2bin/reg2bins C functions.

  • [CSI] Coordinate Sorted Index (CSI) format specification, SAM/BAM Format Specification Working Group, last modified 2020-05-25. https://samtools.github.io/hts-specs/CSIv1.pdf Covers: CSI magic, header fields (min_shift, depth, l_aux), per-bin loffset field replacing BAI's linear index, parameterised reg2bin/reg2bins/bin_limit functions.

Reference implementations

  • [htslib] htslib, Wellcome Sanger Institute. https://github.com/samtools/htslib The canonical C implementation of SAM/BAM/CRAM/BGZF/BAI. Used to verify correctness of seqair's output: pileup positions, depths, qpos values, CRAM record fields, and flag handling are all cross-checked against htslib.

  • [noodles] noodles, Michael Macias. https://github.com/zaeleus/noodles A pure-Rust implementation of genomics formats. Referenced for Rust-idiomatic design patterns in handling BAM/SAM/CRAM records.