molprint

molprint is a high-performance molecular fingerprint computation and similarity search library written in Rust.

The goal is to be fast enough for large-scale virtual screening — targeting 5–10× faster than RDKit for bulk workflows — while remaining accurate enough to match RDKit bit-for-bit on standard benchmarks.

What it does

Parses SMILES strings into molecular graphs
Computes MACCS-166 structural key fingerprints (100% RDKit-accurate on ChEMBL 10k)
Computes Morgan/ECFP circular fingerprints at configurable radius and bit width (512–4096)
Calculates Tanimoto, Dice, and Cosine similarity using POPCNT on u64 word arrays
Runs parallel threshold and top-k screening via Rayon
Reads and writes FPS, SDF, and SMILES file formats (FPS is chemfp-compatible)

Benchmarks

Measured on Apple M-series, Rust 1.94, --release.

Operation	Performance
Tanimoto (2048-bit)	36 ns
Morgan ECFP4 batch	~700k mol/s
MACCS-166 batch	~535k mol/s
Screening 100k compounds	826 µs / query

Quick example

#![allow(unused)]
fn main() {
use molprint_core::smiles::parse_smiles;
use molprint_fp::{morgan::Morgan, maccs::Maccs166, traits::Fingerprinter};
use molprint_search::metrics::tanimoto;

let mol_a = parse_smiles("c1ccccc1").unwrap();   // benzene
let mol_b = parse_smiles("c1ccncc1").unwrap();   // pyridine

let fp = Morgan::new(2, 2048);
let sim = tanimoto(&fp.fingerprint(&mol_a), &fp.fingerprint(&mol_b));
println!("{:.3}", sim); // 0.600
}

Workspace layout

Crate	Role
`molprint-core`	SMILES parser, molecular graph, ring perception, SMARTS
`molprint-fp`	Morgan and MACCS fingerprint algorithms
`molprint-search`	Similarity metrics and parallel screening
`molprint-io`	FPS, SDF, SMILES file I/O
`molprint-cli`	End-user CLI tool

Crates form a strict dependency chain: molprint-core → molprint-fp → molprint-search + molprint-io → molprint-cli. No cycles, no dev-dependency shortcuts across this chain.

License

MIT

Getting Started

molprint ships as both a CLI tool and a Rust library. Choose the approach that fits your workflow.

Install the CLI

cargo install --git https://github.com/mariusrueve/molprint molprint-cli

This installs the molprint binary to ~/.cargo/bin/. Make sure that directory is on your PATH.

Use as a library

Add the relevant crates to your Cargo.toml:

[dependencies]
molprint-core = { git = "https://github.com/mariusrueve/molprint" }
molprint-fp   = { git = "https://github.com/mariusrueve/molprint" }
molprint-search = { git = "https://github.com/mariusrueve/molprint" }

You only need the crates relevant to your use case. If you just want to parse SMILES, molprint-core is sufficient. For fingerprints, add molprint-fp. For similarity search, add molprint-search.

Build from source

git clone https://github.com/mariusrueve/molprint
cd molprint
cargo build --release

The --release flag is important for performance — debug builds can be 10–20× slower for fingerprint batch operations.

Run tests

cargo test --workspace

Run benchmarks

cargo bench

Benchmarks use Criterion and output HTML reports to target/criterion/.

CLI Usage

The molprint CLI has two subcommands: fp for computing fingerprints, and search for similarity search.

`molprint fp` — compute fingerprints

Reads a molecule file, computes fingerprints, and writes a chemfp-compatible FPS file.

molprint fp --input molecules.smi --output molecules.fps

Options

Flag	Default	Description
`--input` / `-i`	required	Input file (`.smi`, `.sdf`, `.sdf.gz`)
`--output` / `-o`	required	Output FPS file path
`--fp-type` / `-t`	`morgan`	Fingerprint type: `morgan` or `maccs`
`--radius` / `-r`	`2`	Morgan radius (ignored for MACCS)
`--nbits` / `-n`	`2048`	Bit width for Morgan (ignored for MACCS)

Examples

# Default: Morgan ECFP4, 2048 bits
molprint fp --input molecules.smi --output molecules.fps

# MACCS-166 structural keys
molprint fp --input molecules.smi --fp-type maccs --output maccs.fps

# Morgan ECFP6, 4096 bits, from SDF
molprint fp --input molecules.sdf --radius 3 --nbits 4096 --output ecfp6.fps

# Gzip-compressed SDF (auto-detected by extension)
molprint fp --input chembl.sdf.gz --output chembl.fps

Input formats

File format is detected from the extension:

.smi — SMILES file, one molecule per line. Format: SMILES<tab>ID or bare SMILES. If no ID is given, a sequential mol1, mol2, ... ID is assigned.
.sdf — MDL V2000 SDF. The molecule name (first line of the mol block) is used as the ID.
.sdf.gz — gzip-compressed SDF, read transparently via flate2.

Output format

The output is a chemfp-compatible FPS v1 file:

#FPS1
#type=Morgan nbits=2048
#radius=2
#software=molprint/0.1.0
<hex_fingerprint>\t<id>
<hex_fingerprint>\t<id>
...

Fingerprints are encoded as lowercase hexadecimal, least-significant byte first (chemfp convention).

Progress output

The CLI prints a progress summary to stderr when done:

Processed 10000 molecules in 0.14s (71428 mol/s)

`molprint search` — similarity search

Searches a fingerprint database for molecules similar to a query.

molprint search --query "c1ccccc1" --db molecules.fps --threshold 0.4

Options

Flag	Default	Description
`--query` / `-q`	required	Query as a SMILES string
`--db` / `-d`	required	FPS database file path
`--threshold`	`0.7`	Tanimoto threshold for filtering
`--top-k` / `-k`	`10`	Return top-k results

How threshold and top-k interact

If --top-k is non-zero (the default is 10), a top-k search is performed and then results below --threshold are filtered out.
To get all results above a threshold without a hard top-k cap, set --top-k 0.

Examples

# All compounds with Tanimoto >= 0.4 to benzene (no top-k cap)
molprint search --query "c1ccccc1" --db molecules.fps --threshold 0.4 --top-k 0

# Top 20 most similar, filtered to Tanimoto >= 0.7
molprint search --query "c1ccccc1" --db molecules.fps --top-k 20

# Top 5 most similar (no threshold filter)
molprint search --query "c1ccccc1" --db molecules.fps --top-k 5 --threshold 0.0

Output format

Tab-separated id<tab>similarity, sorted by descending score:

CHEMBL123    0.8500
CHEMBL456    0.7800
CHEMBL789    0.7200

The fingerprint type (morgan/maccs) and parameters (radius, nbits) are read from the FPS header automatically, so the query fingerprint is computed with the same settings as the database.

Library Usage

This page walks through using molprint as a Rust library, from parsing a single SMILES string to running a parallel similarity search over thousands of compounds.

Parsing SMILES

#![allow(unused)]
fn main() {
use molprint_core::smiles::parse_smiles;

let mol = parse_smiles("CC(=O)Oc1ccccc1C(=O)O").unwrap(); // aspirin
println!("{} atoms, {} bonds", mol.node_count(), mol.edge_count());
// 13 atoms, 13 bonds
}

parse_smiles returns a Result<MolGraph, ParseError>. In library code, propagate the error rather than unwrapping.

Inspecting atoms

#![allow(unused)]
fn main() {
use molprint_core::smiles::parse_smiles;
use molprint_core::mol::graph::MolGraphExt;
use petgraph::graph::NodeIndex;

let mol = parse_smiles("CCO").unwrap(); // ethanol

for i in 0..mol.node_count() {
    let idx = NodeIndex::new(i);
    let atom = mol.atom(idx);
    println!(
        "atom {}: {:?} charge={} h_count={} in_ring={}",
        i, atom.element, atom.charge, atom.h_count, atom.in_ring
    );
}
}

Computing fingerprints

Morgan (ECFP)

#![allow(unused)]
fn main() {
use molprint_core::smiles::parse_smiles;
use molprint_fp::{morgan::Morgan, traits::Fingerprinter};

let mol = parse_smiles("c1ccccc1").unwrap();

// ECFP4 (radius=2, 2048 bits)
let morgan = Morgan::new(2, 2048);
let fp = morgan.fingerprint(&mol);

println!("{} bits set out of 2048", fp.count_ones());
println!("hex: {}", fp.to_hex());
}

Morgan::new(radius, nbits) where:

radius = 1 → ECFP2
radius = 2 → ECFP4 (default)
radius = 3 → ECFP6

MACCS-166

#![allow(unused)]
fn main() {
use molprint_core::smiles::parse_smiles;
use molprint_fp::{maccs::Maccs166, traits::Fingerprinter};

let mol = parse_smiles("c1ccccc1").unwrap();
let maccs = Maccs166::new();
let fp = maccs.fingerprint(&mol);

println!("{} bits set out of 166", fp.count_ones());
}

MACCS-166 always produces 167-bit fingerprints (bit 0 is always 0; bits 1–166 correspond to the 166 structural keys).

The `Fingerprinter` trait

Both Morgan and Maccs166 implement Fingerprinter:

#![allow(unused)]
fn main() {
pub trait Fingerprinter {
    fn fingerprint(&self, mol: &MolGraph) -> FingerprintBits;
    fn name(&self) -> &str;
    fn nbits(&self) -> usize;
}
}

You can use Box<dyn Fingerprinter> to select the algorithm at runtime.

Computing similarity

#![allow(unused)]
fn main() {
use molprint_core::smiles::parse_smiles;
use molprint_fp::{morgan::Morgan, traits::Fingerprinter};
use molprint_search::metrics::{tanimoto, dice, cosine};

let benzene = parse_smiles("c1ccccc1").unwrap();
let pyridine = parse_smiles("c1ccncc1").unwrap();

let morgan = Morgan::new(2, 2048);
let fp_a = morgan.fingerprint(&benzene);
let fp_b = morgan.fingerprint(&pyridine);

println!("Tanimoto: {:.3}", tanimoto(&fp_a, &fp_b));
println!("Dice:     {:.3}", dice(&fp_a, &fp_b));
println!("Cosine:   {:.3}", cosine(&fp_a, &fp_b));
}

All three metrics return f64 in [0.0, 1.0]. Tanimoto is the standard in cheminformatics.

Parallel screening

#![allow(unused)]
fn main() {
use molprint_core::smiles::parse_smiles;
use molprint_fp::{morgan::Morgan, traits::Fingerprinter};
use molprint_search::screen::{threshold_search, top_k_search};

let morgan = Morgan::new(2, 2048);

// Build a database (in practice, loaded from an FPS file)
let smiles_list = vec!["CCO", "CCCO", "c1ccccc1", "CC(=O)O", "c1ccncc1"];
let db_fps: Vec<_> = smiles_list.iter()
    .map(|smi| morgan.fingerprint(&parse_smiles(smi).unwrap()))
    .collect();

// Query
let query = parse_smiles("c1ccccc1").unwrap();
let query_fp = morgan.fingerprint(&query);

// Threshold search: all compounds with Tanimoto >= 0.5
let hits = threshold_search(&query_fp, &db_fps, 0.5);
for hit in &hits {
    println!("index={} similarity={:.3}", hit.index, hit.similarity);
}

// Top-k search: 3 most similar
let top3 = top_k_search(&query_fp, &db_fps, 3);
}

Both functions use Rayon for parallelism. They return Vec<SearchHit> sorted by descending similarity.

Reading and writing FPS files

#![allow(unused)]
fn main() {
use std::fs::File;
use std::io::BufWriter;
use molprint_io::fps::{FpsWriter, FpsReader};

// Write
let out = File::create("output.fps").unwrap();
let mut writer = FpsWriter::new(BufWriter::new(out), 2048, "Morgan").unwrap();
writer.write_fingerprint("mol1", &fp).unwrap();

// Read
let f = File::open("output.fps").unwrap();
let mut reader = FpsReader::new(f);
let nbits = reader.read_header().unwrap();

for item in &mut reader {
    let (id, fp) = item.unwrap();
    println!("{}: {} bits set", id, fp.count_ones());
}
}

Reading SMILES files

#![allow(unused)]
fn main() {
use std::fs::File;
use molprint_io::smiles_file::SmilesFileReader;

let f = File::open("molecules.smi").unwrap();
for (id, mol) in SmilesFileReader::new(f) {
    println!("{}: {} atoms", id, mol.node_count());
}
}

Reading SDF files

#![allow(unused)]
fn main() {
use molprint_io::sdf::open_sdf;

for result in open_sdf("molecules.sdf").unwrap() {
    let record = result.unwrap();
    println!("{}: {} atoms", record.name, record.mol.node_count());
    // record.properties: HashMap<String, String>
}
}

open_sdf automatically handles .sdf.gz based on the file extension.

SMARTS substructure matching

#![allow(unused)]
fn main() {
use molprint_core::smarts;
use molprint_core::smiles::parse_smiles;

let carbonyl = smarts::compile("C=O").unwrap();
let aspirin = parse_smiles("CC(=O)Oc1ccccc1C(=O)O").unwrap();

if smarts::has_match(&aspirin, &carbonyl) {
    println!("aspirin contains a carbonyl group");
}

// Count all matches (returns one mapping per unique anchor atom)
let matches = smarts::subgraph_match(&aspirin, &carbonyl);
println!("{} carbonyl groups found", matches.len());
}

Architecture

This section describes the internal design of molprint — how each component works and why it was built the way it was.

Crate map

molprint-core
    mol/atom.rs      — Element enum, Atom struct
    mol/bond.rs      — BondType enum
    mol/graph.rs     — MolGraph (petgraph UnGraph wrapper) + MolGraphExt trait
    smiles/lexer.rs  — tokenizer
    smiles/parser.rs — token stream → MolGraph
    ring.rs          — SSSR via BFS-based Horton algorithm
    arom.rs          — aromaticity perception
    smarts/          — SMARTS query language, VF2-style matching
        ast.rs
        lexer.rs
        matcher.rs

molprint-fp
    bitvec.rs        — FingerprintBits (Vec<u64> bit vector)
    traits.rs        — Fingerprinter trait
    morgan.rs        — Morgan/ECFP iterative hashing
    maccs.rs         — MACCS-166 structural keys

molprint-search
    metrics.rs       — Tanimoto, Dice, Cosine
    screen.rs        — threshold_search, top_k_search (Rayon)

molprint-io
    smiles_file.rs   — streaming SMILES line reader
    sdf.rs           — streaming SDF V2000 parser (plain + gzip)
    fps.rs           — chemfp FPS read/write

molprint-cli
    main.rs          — clap CLI: fp + search subcommands

Data flow

For fingerprint computation:

SMILES/SDF file
    → molprint-io (SmilesFileReader / SdfReader)
    → parse_smiles → MolGraph
    → Fingerprinter::fingerprint → FingerprintBits
    → FpsWriter → .fps file

For similarity search:

.fps file → FpsReader → Vec<FingerprintBits>
query SMILES → parse_smiles → MolGraph → FingerprintBits
threshold_search / top_k_search (Rayon) → Vec<SearchHit>

Design principles

Zero unsafe code in library crates — relying on petgraph and the standard library for all unsafe operations.
No unwrap in library code — all errors propagate through Result using thiserror.
Deterministic fingerprints — hash functions use fixed seeds; no random state that varies between process launches.
Separation of concerns — the molecular graph knows nothing about fingerprints; fingerprints know nothing about I/O.

Overview

molprint is structured as a Cargo workspace with five crates arranged in a strict dependency chain.

Dependency order

molprint-core
    ↓
molprint-fp
    ↓          ↓
molprint-search  molprint-io
         ↓       ↓
         molprint-cli

This ordering enforces clean separation: the core graph library has no knowledge of fingerprints, fingerprint algorithms have no knowledge of file formats, and the CLI is the only place where everything comes together.

molprint-core

The foundation of the library. Responsible for:

Representing molecules as graphs (MolGraph = petgraph UnGraph<Atom, BondType>)
Parsing SMILES strings into those graphs
Computing ring membership (SSSR)
Perceiving aromaticity
Matching SMARTS patterns against molecule graphs

Nothing in molprint-core is fingerprint-specific.

molprint-fp

Implements fingerprint algorithms on top of MolGraph. Contains:

FingerprintBits — a word-aligned bit vector backed by Vec<u64>
Fingerprinter trait — the common interface all algorithms implement
Morgan — iterative neighborhood hashing (Morgan/ECFP)
Maccs166 — 166 manually implemented structural key tests

molprint-search

Similarity metrics and parallel screening. Depends only on molprint-fp (for FingerprintBits). Contains:

tanimoto, dice, cosine — bit-vector similarity metrics using POPCNT
threshold_search, top_k_search — Rayon-parallelized screening functions

molprint-io

File format support. Depends on molprint-core (for MolGraph and parse_smiles) and molprint-fp (for FingerprintBits). Contains:

SmilesFileReader — streaming iterator over (id, MolGraph) pairs from SMILES files
SdfReader — streaming iterator over SDF records (plain and gzip)
FpsWriter / FpsReader — chemfp-compatible FPS format

molprint-cli

The end-user binary. Ties everything together using clap for argument parsing. Has two subcommands:

fp — read molecules, compute fingerprints, write FPS
search — read an FPS database, compute a query fingerprint, run parallel search

Testing strategy

Unit tests live alongside the code they test, in #[cfg(test)] modules
Integration tests live in tests/ at the workspace root (cli_integration.rs)
A fuzz target (fuzz/fuzz_targets/fuzz_smiles.rs) exercises the SMILES parser with arbitrary input
Fingerprint accuracy is validated by crates/molprint-fp/tests/validate_against_rdkit.rs, which compares MACCS and Morgan output against RDKit on a ChEMBL subset

Molecular Graph

The molecular graph is the central data structure in molprint. Every algorithm — ring perception, fingerprinting, SMARTS matching — operates on it.

Type definition

#![allow(unused)]
fn main() {
// mol/graph.rs
pub type MolGraph = UnGraph<Atom, BondType>;
}

MolGraph is a type alias for petgraph's UnGraph<Atom, BondType>, an undirected graph where each node carries an Atom and each edge carries a BondType. Hydrogen atoms are not stored as nodes in the graph (they are implicit); instead, Atom::h_count records the total hydrogen count.

The `Atom` struct

#![allow(unused)]
fn main() {
pub struct Atom {
    pub element: Element,    // atomic number-backed enum
    pub charge: i8,          // formal charge
    pub isotope: Option<u16>,
    pub aromatic: bool,
    pub h_count: u8,         // total H count (implicit + explicit)
    pub explicit_h: Option<u8>, // Some if bracket atom with H spec
    pub map_num: u16,        // atom-map number (for reaction SMILES)
    pub in_ring: bool,       // set by ring perception
    pub ring_sizes: Vec<u8>, // one entry per SSSR ring containing this atom
}
}

h_count is computed after the full graph is constructed by assign_implicit_hydrogens(). For bracket atoms (e.g., [NH3]), the H count is explicit; for organic subset atoms (e.g., N in CC(N)C), it is inferred from the atom's default valence minus its current bond order sum.

Bond types

#![allow(unused)]
fn main() {
pub enum BondType {
    Single,
    Double,
    Triple,
    Aromatic,
}
}

Each BondType has a valence_contribution() method that returns its integer bond order (Single=1, Double=2, Triple=3, Aromatic=1). The aromatic contribution of 1 is intentional: the additional electron from the pi system is accounted for separately in the implicit H calculation.

The `MolGraphExt` trait

petgraph's UnGraph is a generic graph; it doesn't know about chemistry. The MolGraphExt trait adds chemistry-aware methods:

#![allow(unused)]
fn main() {
pub trait MolGraphExt {
    fn num_atoms(&self) -> usize;
    fn num_bonds(&self) -> usize;
    fn atom(&self, idx: NodeIndex) -> &Atom;
    fn atom_mut(&mut self, idx: NodeIndex) -> &mut Atom;
    fn bond_between(&self, a: NodeIndex, b: NodeIndex) -> Option<BondType>;
    fn heavy_neighbors(&self, idx: NodeIndex) -> Vec<NodeIndex>;
    fn degree(&self, idx: NodeIndex) -> usize;
    fn compute_implicit_h(&self, idx: NodeIndex) -> u8;
    fn assign_implicit_hydrogens(&mut self);
}
}

compute_implicit_h is the most subtle method. It handles three cases:

Bracket atom with H spec (e.g., [NH3]): return the explicit count directly.
Non-organic-subset element (e.g., [Fe]): return 0. These atoms get no implicit Hs.
Organic subset atom: find the smallest standard valence ≥ current bond order sum; return valence − bond_sum − |charge|. For aromatic atoms, an extra +1 is added to the effective bond sum to account for the pi-electron contribution.

Implicit hydrogen calculation for aromatic atoms

This was the single most difficult piece to get right. The logic is:

effective_bond_sum = bond_sum + 1   (add 1 for the pi electron)
h_count = valence - effective_bond_sum

But this +1 must only be applied when the atom's valence is not already fully saturated by its sigma bonds. For example, trimethylamine-oxide's nitrogen in an aromatic context is a different case than a pyridine nitrogen. The code checks whether valence >= bond_sum before deciding to apply the adjustment. If the valence is already fully consumed by heavy-atom bonds, no pi adjustment is added, and 0 implicit Hs are assigned.

Node and edge indexing

petgraph assigns NodeIndex and EdgeIndex values sequentially as nodes and edges are added. These indices are stable within a single MolGraph lifetime (no removals happen in molprint). The SMILES parser, ring perception, and all fingerprint algorithms use NodeIndex::new(i) to iterate over atoms.

SMILES Parser

The SMILES parser converts a SMILES string into a MolGraph. It is split into a lexer (tokenizer) and a parser (graph builder), following the standard compiler front-end pattern.

Lexer (`smiles/lexer.rs`)

The lexer reads the SMILES string character by character and produces a stream of Token values.

#![allow(unused)]
fn main() {
pub enum Token {
    Atom { element: Element, aromatic: bool },
    BracketAtom { isotope, element, aromatic, h_count, charge, map_num, chirality },
    Bond(BondType),
    BondStereo(char),     // '/' or '\'
    RingClosure(u8),      // ring closure digit or %nn
    OpenBranch,           // '('
    CloseBranch,          // ')'
    Dot,                  // '.'
}
}

Key tokenization rules:

Organic subset atoms (B, C, N, O, P, S, F, Cl, Br, I) and their lowercase aromatic counterparts (c, n, o, s, p) are emitted as Token::Atom.
Bracket atoms ([NH3], [2H], [Fe+2], [13C@@H]) are parsed as Token::BracketAtom with all fields decoded inline. Two-character element symbols (Cl, Br, Se, etc.) require look-ahead.
Ring closures are %NN for two-digit ring numbers (e.g., %10) or single digits (0–9).
Stereo tokens (/, \) are consumed but stored only as BondStereo; stereo is not used by fingerprinting algorithms.

Parser (`smiles/parser.rs`)

The parser takes the token stream and builds a MolGraph incrementally.

State maintained during parsing:

#![allow(unused)]
fn main() {
let mut current: Option<NodeIndex> = None;
let mut branch_stack: Vec<NodeIndex> = Vec::new();
let mut pending_bond: Option<BondType> = None;
let mut ring_map: HashMap<u8, (NodeIndex, Option<BondType>)> = HashMap::new();
}

The token dispatch:

Token	Action
`Atom` / `BracketAtom`	Add node; connect to `current` with `pending_bond` or default bond; set `current`
`Bond`	Store in `pending_bond`
`BondStereo`	Ignored (no pending_bond change)
`OpenBranch`	Push `current` onto `branch_stack`
`CloseBranch`	Pop `branch_stack`; restore `current`; clear `pending_bond`
`RingClosure`	If ring_id seen before: add closure edge; else store in ring_map
`Dot`	Set `current = None` (start new fragment)

Default bond inference

When no explicit bond token precedes an atom, the parser infers the bond type:

If both the previous atom and the new atom are aromatic → BondType::Aromatic
Otherwise → BondType::Single

Ring closure bond type

When a ring closure digit appears twice, the bond type is resolved in this priority order:

The pending_bond at the closing digit (if set explicitly)
The bond type stored when the ring was opened (if set explicitly)
Inferred from the aromaticity of the two ring-closure atoms

Error handling

The parser returns Result<MolGraph, ParseError> where ParseError covers:

Lexer(LexerError) — invalid characters in the SMILES string
UnmatchedRing(u8) — a ring closure digit that was never closed
UnmatchedBranch — mismatched ( or )
Empty — empty input string
DanglingBond — a bond token with no subsequent atom

Post-processing

After the token loop, the parser runs two passes:

graph.assign_implicit_hydrogens() — computes h_count for every atom
assign_ring_info(&mut graph) — runs SSSR and sets in_ring and ring_sizes on atoms

These are called unconditionally, so every MolGraph produced by parse_smiles has fully-populated ring and hydrogen information.

Public API

#![allow(unused)]
fn main() {
// smiles/mod.rs
pub use parser::parse as parse_smiles;
}

Usage:

#![allow(unused)]
fn main() {
use molprint_core::smiles::parse_smiles;

let mol = parse_smiles("CC(=O)O").unwrap(); // acetic acid
}

Fuzz testing

The SMILES lexer and parser are covered by a libFuzzer fuzz target at fuzz/fuzz_targets/fuzz_smiles.rs. The target feeds arbitrary byte strings into parse_smiles and checks that it never panics — only returns Ok or Err. This uncovered several edge cases during development, particularly around multi-digit ring closures and unusual bracket atom encodings.

Ring Perception

Ring perception identifies which atoms belong to rings, what size those rings are, and produces the Smallest Set of Smallest Rings (SSSR). This information is used directly by the MACCS fingerprint (many keys are ring-related) and by the Morgan fingerprint (atom invariants include in_ring).

Why SSSR?

The SSSR is the minimal basis for all rings in a molecule. For a molecule with e edges, n nodes, and c connected components, the circuit rank (number of independent rings) is:

r = e − n + c

For benzene: 6 − 6 + 1 = 1. For naphthalene: 11 − 10 + 1 = 2. The SSSR contains exactly r rings.

Alternative approaches (DFS cycle enumeration, relevant cycles) enumerate more rings. For naphthalene, they would find 3 rings (the two 6-membered rings plus the 10-membered outer ring). The SSSR approach matches RDKit's behavior, which is important for MACCS accuracy.

Algorithm

The implementation uses a BFS-based approach sometimes called the Horton algorithm:

Step 1: Generate candidates

For each edge (u, v):

Temporarily remove the edge
Find the shortest path from u to v using BFS (ignoring the removed edge)
If a path exists, the path + the removed edge forms a ring candidate

This produces at most one candidate per edge. Candidates are sorted by ring size (smallest first).

Step 2: Select a basis

Greedily select candidates to form a linearly independent set in GF(2) (the field with two elements). Each ring is represented as a bit vector over edges (1 if the ring contains that edge, 0 otherwise). A candidate is added to the SSSR if its edge vector is linearly independent from all previously selected rings, determined by Gaussian elimination over GF(2).

Selection stops when r rings have been chosen.

Implementation details

#![allow(unused)]
fn main() {
pub fn find_sssr(mol: &MolGraph) -> Vec<Vec<NodeIndex>> {
    let target_rings = e.saturating_sub(n) + components;
    // ... generate candidates via BFS ...
    // ... select basis via GF(2) Gaussian elimination ...
}
}

The GF(2) basis is maintained in row-echelon form. reduce_to_basis XORs the candidate vector against basis vectors whose pivot column is set in the candidate. If the result is non-zero, the candidate is independent and is added.

Ring info assignment

After find_sssr, the assign_ring_info function marks atoms:

#![allow(unused)]
fn main() {
pub fn assign_ring_info(mol: &mut MolGraph) {
    for ring in &rings {
        let ring_size = ring.len() as u8;
        for &atom_idx in ring {
            mol[atom_idx].in_ring = true;
            mol[atom_idx].ring_sizes.push(ring_size);
        }
    }
}
}

Junction atoms (atoms shared between two fused rings, like those in naphthalene) get multiple entries in ring_sizes. This allows MACCS keys to correctly identify fused ring systems.

Spiro atom handling

Spiro compounds (two rings sharing exactly one atom) were an early bug source. A naive ring detection approach that excludes atoms already in one ring would fail to detect the second ring in a spiro compound. The BFS candidate generation correctly handles this because it operates on edges, not atoms — the spiro center is visited in candidates for both rings independently.

The test test_spiro_compound covers C1CCC2(CC1)CCCC2 (spiro[5.5]undecane) and asserts find_sssr returns exactly 2 rings.

Connected components

The circuit rank formula requires knowing the number of connected components. Components are counted by a simple BFS that starts from each unvisited node:

#![allow(unused)]
fn main() {
fn count_components(mol: &MolGraph) -> usize {
    let mut visited = vec![false; mol.node_count()];
    let mut count = 0;
    for start in mol.node_indices() {
        if !visited[start.index()] {
            count += 1;
            // BFS from start...
        }
    }
    count
}
}

For disconnected SMILES like CC.OO, components = 2, and the circuit rank is 0 (no rings), which is correct.

SMARTS Engine

The SMARTS engine supports substructure queries against MolGraph. It is used internally by the MACCS-166 fingerprint for structural key evaluation and is also available as a public API.

Supported features

Atom primitives (inside […]):

Primitive	Meaning
`#n`	Atomic number (e.g., `[#6]` = carbon)
`C`, `N`, `O`, …	Aliphatic element (organic subset)
`c`, `n`, `o`, …	Aromatic element
`a` / `A`	Any aromatic / any aliphatic atom
`Hn`	Hydrogen count (`H` = 1, `H2` = 2)
`+n` / `-n`	Formal charge
`R`	In any ring; `R0` = not in ring
`Dn`	Heavy-atom degree
`*`	Any atom

Logical operators: ! (NOT), & (high-priority AND), ; (low-priority AND), , (OR).

Bond primitives: - = # : ~ (any), and unspecified (matches single or aromatic).

Not supported in v1: recursive SMARTS $(), stereo @/@@, component-level grouping.

Architecture

The engine has three layers:

AST (smarts/ast.rs) — data structures for atom and bond expressions
Lexer/Parser (smarts/lexer.rs) — parses a SMARTS string into a SmartsPattern
Matcher (smarts/matcher.rs) — VF2-style subgraph isomorphism

AST

#![allow(unused)]
fn main() {
pub enum AtomPrimitive {
    AtomicNum(u8),
    Aromatic(bool),
    AnyAromatic,
    AnyAliphatic,
    HCount(u8),
    Charge(i8),
    InRing,
    NotInRing,
    RingSize(u8),
    Degree(u8),
    Any,
}

pub enum AtomExpr {
    Primitive(AtomPrimitive),
    Not(Box<AtomExpr>),
    And(Box<AtomExpr>, Box<AtomExpr>),
    Or(Box<AtomExpr>, Box<AtomExpr>),
}
}

SmartsBond stores the bond constraint (specific type, any, or default). SmartsPattern is a graph of SmartsAtom nodes and SmartsBond edges.

Matching

The matcher uses a recursive VF2-style algorithm. It maintains a partial mapping from pattern atoms to molecule atoms and extends it one atom at a time.

#![allow(unused)]
fn main() {
pub fn has_match(mol: &MolGraph, pattern: &SmartsPattern) -> bool;
pub fn subgraph_match(mol: &MolGraph, pattern: &SmartsPattern) -> Vec<Vec<(usize, usize)>>;
}

has_match returns early on the first valid mapping. subgraph_match collects all mappings.

The atom compatibility check evaluates the AtomExpr tree recursively:

#![allow(unused)]
fn main() {
fn atom_matches(mol_atom: &Atom, expr: &AtomExpr) -> bool {
    match expr {
        AtomExpr::Primitive(p) => primitive_matches(mol_atom, p),
        AtomExpr::Not(inner) => !atom_matches(mol_atom, inner),
        AtomExpr::And(a, b) => atom_matches(mol_atom, a) && atom_matches(mol_atom, b),
        AtomExpr::Or(a, b) => atom_matches(mol_atom, a) || atom_matches(mol_atom, b),
    }
}
}

Usage

#![allow(unused)]
fn main() {
use molprint_core::smarts;
use molprint_core::smiles::parse_smiles;

// Carbon bonded to nitrogen (any bond type)
let pattern = smarts::compile("[#6]~[#7]").unwrap();
let pyridine = parse_smiles("c1ccncc1").unwrap();
assert!(smarts::has_match(&pyridine, &pattern));

// Hydroxyl group
let oh = smarts::compile("[OH]").unwrap();
let ethanol = parse_smiles("CCO").unwrap();
assert!(smarts::has_match(&ethanol, &oh));

// Count amine nitrogens
let amine = smarts::compile("[NH2]").unwrap();
let diamine = parse_smiles("NCCN").unwrap();
let matches = smarts::subgraph_match(&diamine, &amine);
assert_eq!(matches.len(), 2);
}

Compile errors

smarts::compile returns Result<SmartsPattern, SmartsError>:

#![allow(unused)]
fn main() {
pub enum SmartsError {
    UnexpectedChar(char, usize),
    UnterminatedBracket(usize),
    UnmatchedRing(u8),
    UnmatchedParen,
    Empty,
    InvalidAtomicNum(String),
}
}

Fingerprints

molprint implements two fingerprint types: Morgan (circular/ECFP) and MACCS-166 structural keys.

`FingerprintBits`

All fingerprints are represented as FingerprintBits, a word-aligned bit vector:

#![allow(unused)]
fn main() {
pub struct FingerprintBits {
    words: Vec<u64>,
    nbits: usize,
}
}

Operations are implemented at the word level using Rust's u64::count_ones() (which compiles to a single POPCNT instruction on x86/ARM). Bitwise AND and OR iterate over word pairs, making Tanimoto computation efficient for any bit width.

The hex encoding used in FPS files stores bytes least-significant first, matching chemfp's convention.

The `Fingerprinter` trait

#![allow(unused)]
fn main() {
pub trait Fingerprinter {
    fn fingerprint(&self, mol: &MolGraph) -> FingerprintBits;
    fn name(&self) -> &str;
    fn nbits(&self) -> usize;
}
}

Both Morgan and Maccs166 implement this trait. The CLI uses Box<dyn Fingerprinter> to select the algorithm at runtime.

Morgan / ECFP

Morgan fingerprints encode circular atom environments. For each atom, an environment of radius r includes everything within r bonds.

Atom invariants (radius 0)

Each atom gets an initial hash from its local properties:

#![allow(unused)]
fn main() {
fn atom_invariant(&self, mol: &MolGraph, idx: NodeIndex) -> u32 {
    let atom = mol.atom(idx);
    let packed = ((atom.element.atomic_number() as u64) << 48)
        | ((mol.degree(idx) as u64) << 32)
        | ((atom.h_count as u64) << 16)
        | (((atom.charge as i64 + 128) as u64) << 8)
        | (atom.in_ring as u64);
    hash_u64(packed) as u32
}
}

The properties encoded are: atomic number, heavy-atom degree, hydrogen count, formal charge, ring membership. This matches the standard ECFP invariants.

Iteration

At each radius step, each atom's hash is updated based on its neighbors:

#![allow(unused)]
fn main() {
fn iterate(&self, mol: &MolGraph, prev_hashes: &[u32], _radius: u32) -> Vec<u32> {
    for i in 0..n {
        let mut neighbors: Vec<(u8, u32)> = mol.edges(idx)
            .map(|e| (bond_order, prev_hashes[neighbor]))
            .collect();
        neighbors.sort_unstable(); // deterministic ordering
        new_hashes[i] = combine_hashes(prev_hashes[i] as u64, &neighbors);
    }
}
}

Sorting neighbors before hashing is critical for canonical fingerprints — without it, the result would depend on the order atoms were added to the graph.

Hash function

The hash uses a FNV-1a-inspired 64-bit hash for atom invariants, and boost-style hash_combine for neighborhood accumulation:

#![allow(unused)]
fn main() {
fn hash_combine(h: u64, val: u64) -> u64 {
    h ^ val.wrapping_add(0x9e3779b97f4a7c15)
         .wrapping_add(h << 6)
         .wrapping_add(h >> 2)
}
}

Important: AHasher::default() uses a randomized seed and cannot be used here. All hashing must be deterministic across process launches.

Bit folding

All hashes from radius 0 through radius are collected into a HashSet<u32>, then each is folded into the bit vector:

#![allow(unused)]
fn main() {
for h in all_hashes {
    fp.set(h as usize % self.nbits);
}
}

Using a HashSet means each unique environment contributes at most one bit position, even if it appears at multiple atoms. This is equivalent to RDKit's default behavior.

Configurations

Name	`Morgan::new(r, bits)`
ECFP2	`Morgan::new(1, 2048)`
ECFP4	`Morgan::new(2, 2048)`
ECFP6	`Morgan::new(3, 2048)`

MACCS-166

MACCS-166 is a fixed set of 166 structural keys defined as patterns (originally as SMARTS queries in RDKit). Each key tests for the presence of a specific structural feature.

The keys are implemented manually in maccs.rs rather than via runtime SMARTS evaluation. This provides:

Exact control over counting semantics (uniquify=True, min_count=1 matching RDKit)
Better performance (no SMARTS compilation overhead per fingerprint)
Direct bit-level RDKit compatibility

Implementation structure

Each bit is computed by a dedicated test. The tests fall into several categories:

Element presence: "does the molecule contain a sulfur atom?"
Ring tests: "is there a 6-membered ring?", "is there an aromatic ring?"
Functional group patterns: carbonyl, hydroxyl, amine, nitro, etc.
Chain patterns: specific bond sequences, e.g., N-C=O (amide)
Count-based keys: "more than 2 ring systems", etc.

A helper function has_simple_path supports keys that require checking for specific atom sequences:

#![allow(unused)]
fn main() {
fn has_simple_path(
    mol: &MolGraph,
    start: NodeIndex,
    length: usize,
    target_pred: &dyn Fn(NodeIndex) -> bool,
) -> bool
}

For molecules with ≤ 64 atoms (the vast majority), this uses a single u64 bitmask for the visited set, avoiding heap allocation.

Accuracy

MACCS-166 achieves 100% accuracy on the ChEMBL 10k validation set against RDKit. Getting there required careful attention to:

uniquify=True: RDKit's default MACCS implementation uses uniquify=True in its SMARTS matching, meaning each atom in the molecule is only used once per match. Some keys set the bit if the count is ≥ some threshold.
Aromatic implicit H: aromatic nitrogen (pyridine vs. pyrrole) has different H counts, and several MACCS keys test for [nH] specifically.
Bit 0: always 0 (undefined in the original MACCS spec). The bitvector has 167 bits, with bit 0 unused.

Fingerprinter implementation

#![allow(unused)]
fn main() {
impl Fingerprinter for Maccs166 {
    fn nbits(&self) -> usize { 167 }
    fn name(&self) -> &str { "MACCS166" }
    fn fingerprint(&self, mol: &MolGraph) -> FingerprintBits {
        let mut fp = FingerprintBits::new(167);
        // bit 1: isotope
        // bit 2: ...
        // ...
        fp
    }
}
}

Similarity & Screening

The molprint-search crate provides similarity metrics and parallel screening functions.

Similarity metrics

All three metrics operate on FingerprintBits and use POPCNT for efficiency.

Tanimoto (Jaccard)

The standard similarity metric in cheminformatics:

T(A, B) = |A ∩ B| / |A ∪ B| = popcount(A & B) / popcount(A | B)

#![allow(unused)]
fn main() {
pub fn tanimoto(a: &FingerprintBits, b: &FingerprintBits) -> f64 {
    let intersection = a.and(b).count_ones() as f64;
    let union = a.or(b).count_ones() as f64;
    if union == 0.0 { 0.0 } else { intersection / union }
}
}

Returns 0.0 for two all-zero fingerprints (rather than NaN). Returns 1.0 when the fingerprints are identical.

A common threshold for "similar" compounds is Tanimoto ≥ 0.7 for ECFP4.

Dice

D(A, B) = 2 * |A ∩ B| / (|A| + |B|)

Dice is more lenient than Tanimoto for compounds of different sizes, because the denominator uses the sum of individual set sizes rather than the union.

Cosine

C(A, B) = |A ∩ B| / sqrt(|A| * |B|)

Cosine similarity treats the fingerprint as a vector and measures the angle between them. It is less commonly used in cheminformatics than Tanimoto but useful for some applications.

Performance

At 36 ns per Tanimoto comparison (2048-bit), the bottleneck is memory bandwidth, not computation. The u64 word loop uses Rust's u64::count_ones() which compiles to a single POPCNT instruction. For future work, explicit SIMD (AVX2 or NEON) could batch multiple fingerprint comparisons per loop iteration.

Parallel screening

`threshold_search`

Returns all database fingerprints with Tanimoto ≥ threshold, sorted by descending similarity:

#![allow(unused)]
fn main() {
pub fn threshold_search(
    query: &FingerprintBits,
    database: &[FingerprintBits],
    threshold: f64,
) -> Vec<SearchHit>
}

Uses Rayon's par_iter() for parallel evaluation. Each thread processes a chunk of the database independently, then results are merged and sorted.

`top_k_search`

Returns the k most similar fingerprints:

#![allow(unused)]
fn main() {
pub fn top_k_search(
    query: &FingerprintBits,
    database: &[FingerprintBits],
    k: usize,
) -> Vec<SearchHit>
}

Computes similarity for every database entry (in parallel), sorts, then truncates to k. For very large databases where k is much smaller than the total size, a heap-based algorithm would be more efficient — this is the simplest correct implementation.

`SearchHit`

#![allow(unused)]
fn main() {
pub struct SearchHit {
    pub index: usize,      // index into the database slice
    pub similarity: f64,
}
}

The index into the original database slice lets the caller look up the molecule ID.

Screening performance

On Apple M-series, screening a 100k compound database takes ~826 µs per query (121M fingerprint comparisons/second). This is primarily memory-bound. At 2048 bits = 256 bytes per fingerprint, a 100k database is 25 MB — fits in L3 cache on most modern CPUs, which explains why Rayon parallelism helps.

For databases that fit in RAM (up to ~10M compounds at 2048-bit), molprint keeps everything in a Vec<FingerprintBits> without indexing structures. For larger scales, pre-filtering using popcount bounds (the Tanimoto lower bound from bit count differences) would reduce the number of full comparisons needed.

Usage example

#![allow(unused)]
fn main() {
use molprint_search::screen::{threshold_search, top_k_search};

// database is Vec<FingerprintBits>, loaded from an FPS file
// query_fp is FingerprintBits for the query molecule

// All compounds with Tanimoto >= 0.6
let hits = threshold_search(&query_fp, &database, 0.6);
for hit in &hits {
    println!("index={} sim={:.4}", hit.index, hit.similarity);
}

// Top 10 most similar
let top10 = top_k_search(&query_fp, &database, 10);
}

File I/O

The molprint-io crate handles reading molecules from SMILES and SDF files, and reading/writing fingerprints in the FPS format.

SMILES files (`smiles_file.rs`)

SmilesFileReader is an iterator over (id, MolGraph) pairs from a SMILES file:

#![allow(unused)]
fn main() {
use std::fs::File;
use molprint_io::smiles_file::SmilesFileReader;

let f = File::open("molecules.smi").unwrap();
for (id, mol) in SmilesFileReader::new(f) {
    println!("{}: {} atoms", id, mol.node_count());
}
}

Input format

Accepts two formats per line:

SMILES<tab>ID — SMILES and ID separated by a tab
SMILES — bare SMILES; a sequential ID (mol1, mol2, …) is assigned

Lines starting with # are skipped as comments. Empty lines are skipped.

Error handling

Lines where the SMILES fails to parse are silently skipped in the iterator (consistent with how large SMILES files from external sources are handled — one bad record shouldn't abort processing 10 million molecules). The CLI logs skipped records to stderr.

SDF files (`sdf.rs`)

open_sdf returns a streaming iterator over SDF records:

#![allow(unused)]
fn main() {
use molprint_io::sdf::open_sdf;

for result in open_sdf("molecules.sdf").unwrap() {
    match result {
        Ok(record) => {
            println!("{}: {} atoms", record.name, record.mol.node_count());
            if let Some(mw) = record.properties.get("MW") {
                println!("  MW = {}", mw);
            }
        }
        Err(e) => eprintln!("Skipping record: {}", e),
    }
}
}

SDF record

#![allow(unused)]
fn main() {
pub struct SdfRecord {
    pub name: String,                          // first line of the mol block
    pub mol: MolGraph,                         // parsed molecular graph
    pub properties: HashMap<String, String>,   // > <key> / value pairs
}
}

Gzip support

open_sdf detects .sdf.gz by file extension and transparently wraps the file in a flate2::read::GzDecoder. No separate API is needed:

#![allow(unused)]
fn main() {
// Both work the same way
open_sdf("molecules.sdf")
open_sdf("molecules.sdf.gz")
}

V2000 parsing

The parser handles MDL V2000 format mol blocks. The atom table and bond table are read line by line. Atom coordinates (x, y, z) are parsed but discarded — molprint is a 2D/topological library and doesn't use 3D coordinates.

Stereo bond wedge/dash indicators in the bond table are parsed and ignored. Properties outside the mol block (the > <name> / value pairs before $$$$) are collected into the properties map.

FPS files (`fps.rs`)

The FPS format is chemfp's standard fingerprint exchange format. molprint implements a subset that is read/write compatible with chemfp.

Writing

#![allow(unused)]
fn main() {
use molprint_io::fps::FpsWriter;
use std::io::BufWriter;

let out = File::create("output.fps").unwrap();
let mut writer = FpsWriter::new(BufWriter::new(out), 2048, "Morgan").unwrap();
writer.write_meta("radius", "2").unwrap();
writer.write_fingerprint("mol1", &fp).unwrap();
}

The writer outputs the FPS v1 header on construction:

#FPS1
#type=Morgan nbits=2048
#software=molprint/0.1.0

Additional metadata lines can be written with write_meta before any fingerprint records. The CLI uses this to store the Morgan radius so that search can reconstruct the exact fingerprinter from the database file.

Each fingerprint record is: <hex>\t<id>\n.

Reading

#![allow(unused)]
fn main() {
use molprint_io::fps::FpsReader;

let mut reader = FpsReader::new(File::open("output.fps").unwrap());
let nbits = reader.read_header().unwrap(); // reads header, returns nbits

// Access metadata from header
let radius = reader.meta("radius"); // Option<&str>
let fp_type = reader.fp_type();     // &str

for item in &mut reader {
    let (id, fp) = item.unwrap();
}
}

read_header parses all #key=value comment lines at the top of the file. The reader handles the case where nbits is not in the header (infers from hex string length). If read_header is not called before iterating, it is called automatically on the first next() call.

FPS format specification

#FPS1                                  ← magic header
#type=<name> nbits=<n>                 ← fingerprint type
#<key>=<value>                         ← optional metadata
<hex_fingerprint>\t<molecule_id>       ← one record per line

Hexadecimal encoding: bytes are written least-significant-first (little-endian), matching chemfp. For a 64-bit fingerprint with only bit 0 set, the hex is 0100000000000000 (8 bytes = 16 hex chars).

Design Decisions

This page explains the key architectural choices made in molprint and the reasoning behind them.

petgraph vs. a custom graph

The molecular graph uses petgraph's UnGraph rather than a custom adjacency list or matrix.

Why petgraph: Writing a correct, fast, well-tested graph data structure is a large project in itself. petgraph is mature, widely used in the Rust ecosystem, and provides exactly the operations we need: node/edge iteration, neighbor traversal, edge lookup by endpoint pair, and NodeIndex stability. The BFS in ring perception, the VF2 matching in SMARTS, and the neighborhood iteration in Morgan all map cleanly onto petgraph's API.

The cost: petgraph's UnGraph stores adjacency as a list of (source, target, edge) triples, which is not the most cache-friendly layout for large-scale operations. But for molecules (typically < 100 atoms), the in-memory layout is small enough to fit in cache regardless, and the performance difference vs. a custom flat structure is negligible.

The alternative: A flat array of (u8, u8, BondType) triples indexed by a CSR (compressed sparse row) adjacency list would be more cache-friendly for bulk operations. This would matter most for the SMARTS matcher traversing large molecules, but for typical drug-like molecules the difference is not measurable.

Manual MACCS-166 vs. runtime SMARTS

MACCS-166 keys are defined in RDKit as SMARTS patterns. The obvious implementation is to compile those 166 SMARTS patterns at startup and evaluate each one per molecule.

molprint instead implements the 166 keys as hand-written Rust code.

Why manual: Three reasons.

Exact RDKit semantics: RDKit's MACCS implementation uses specific counting logic (uniquify=True, min_count thresholds per key) that is not the same as simply checking has_match. The counting semantics are easier to express precisely in code than to replicate through SMARTS configuration.
Performance: Compiling 166 SMARTS patterns and running VF2 matching for each molecule adds significant constant overhead per molecule. Manual code eliminates the pattern compilation step and allows simpler, more direct logic for common tests (element presence, ring sizes) without the general-purpose SMARTS machinery.
Accuracy as first-class concern: Because the keys are hand-written, each one can be individually tested against RDKit output. The validation test suite (validate_against_rdkit.rs) checks all 166 bits on a ChEMBL 10k subset. Catching a semantic error in a hand-written key is straightforward; tracking down why a SMARTS-based implementation differs from RDKit's behavior would be harder.

`Vec<u64>` bitvector

Fingerprints are stored as Vec<u64> with a fixed bit count. The alternatives were:

[u64; N] const generic array: Would require const generic parameters through the entire codebase and cannot represent runtime-configurable bit widths (Morgan supports 512–4096 bits).
bitvec crate: A well-designed library, but adds a dependency and has more API surface than needed. The custom FingerprintBits has exactly the operations needed (set, get, count_ones, and/xor, hex encode/decode) with no extra overhead.
Vec<u8> or Vec<u32>: u64 matches the natural word size of modern CPUs, and u64::count_ones() compiles to a single POPCNT instruction. Using u8 or u32 would require more iterations per Tanimoto calculation.

Rayon for parallelism

The screening functions use Rayon's par_iter() rather than manual thread management or async.

Why Rayon: threshold_search over a database of fingerprints is embarrassingly parallel — each comparison is independent. Rayon's work-stealing thread pool handles load balancing automatically, and the API change from iter() to par_iter() is minimal. Manual std::thread management would require partitioning the database into chunks, handling the results from each thread, and managing thread lifetimes — all complexity that Rayon handles for free.

Why not async: Similarity search is CPU-bound, not I/O-bound. Async would add complexity without benefiting throughput.

`thiserror` for error types

All public error types use thiserror::Error derive macros.

Why thiserror: It generates std::error::Error and Display implementations from the error type definition, which would otherwise be boilerplate. The derive-based approach enforces consistent error message formatting and makes it easy to chain errors from lower-level crates (e.g., LexerError wraps into ParseError with #[from]).

Alternative: anyhow is common in applications but is designed for error propagation rather than typed error handling. Library code should use typed errors so callers can match on specific variants; anyhow boxes the error, losing the type.

Deterministic fingerprints

Morgan's hash functions use fixed constants (0x9e3779b97f4a7c15 from the golden ratio, and a fixed FNV-1a seed 0xcbf29ce484222325). This is explicitly documented in the code with a comment: AHasher::default() uses a randomized seed and must NOT be used here.

Why this matters: If the hash function is randomized, the same SMILES produces a different fingerprint in each process invocation. This makes it impossible to compare fingerprints computed in different runs (e.g., a database built yesterday vs. a query today). Deterministic fingerprints are a correctness requirement, not just a performance preference.

Implicit-only hydrogen representation

Hydrogen atoms are not stored as nodes in MolGraph. They are counted in Atom::h_count but not represented as graph nodes.

Why: Most cheminformatics algorithms (ring perception, fingerprinting, SMARTS matching) operate on the heavy-atom graph. Storing explicit H nodes would double the graph size for typical organic molecules, slow down neighbor iteration, and require special-casing Hs in every algorithm.

The exception: Bracket atoms like [2H] (deuterium) can appear in SMILES and do get added as nodes, because they carry isotope information that may be structurally meaningful. The assign_implicit_hydrogens function counts these explicit H neighbors and includes them in h_count.

Learnings

These are honest technical insights from building molprint — things that took longer than expected, bugs that required careful thought to fix, and places where the problem was harder than it looked.

RDKit MACCS semantics: uniquify=True and min_count

The MACCS keys look straightforward on paper — 166 SMARTS patterns, one bit per pattern. The bit is set if the pattern has at least one match in the molecule.

But "at least one match" hides complexity. RDKit's GetMACCSKeysFingerprint uses uniquify=True in its SMARTS matching, which means each atom can only be used once per match. For some keys, the bit is set only if there are at least N matches (some keys test for "more than 1 ring" or "more than 2 atoms of type X").

The practical consequence: if you implement MACCS as "compile pattern, call has_match, set bit", you'll get the wrong answer for about 10–15 keys on any reasonably complex molecule. The correct implementation requires knowing the intended count semantics for each key, which comes from reading the RDKit source code for MACCSkeys.py directly.

Example: MACCS key 166 (?) is always 0 in RDKit's implementation. Key 0 is also always 0 (it's a placeholder). Several keys in the 140–150 range test for the presence of multiple ring systems or specific ring counts, not just "does this pattern appear at all".

Running the validation suite (validate_against_rdkit.rs) against a ChEMBL subset was the only reliable way to catch all of these edge cases.

Aromatic implicit hydrogen calculation

Computing implicit H counts for aromatic atoms is the most subtle part of the SMILES parser.

Consider pyridine (c1ccncc1): the nitrogen is aromatic. It has two aromatic bonds (bond order sum = 2 from the sigma bonds). Its normal valence is 3. So naively: 3 - 2 = 1 H. But pyridine nitrogen has zero hydrogens.

The issue: aromatic bonds each contribute 1 to the sigma bond sum, but the pi system takes up the remaining valence capacity. The standard formula adds +1 to the effective bond sum to account for the pi electron: effective_sum = sigma_sum + 1 = 3, then h = 3 - 3 = 0. Correct.

Now consider pyrrole nitrogen ([nH]): it contributes 2 pi electrons to the aromatic system (it's the nitrogen with the lone pair). Its sigma bonds are 2 (two ring C-N bonds). Effective sum = 3, valence = 3, h = 0. But we wrote [nH] explicitly — the bracket forces H=1.

The problem emerges for organic-subset aromatic nitrogen without explicit bracket notation. In practice, most aromatic nitrogens in the organic subset that need an H are written as [nH] in valid SMILES. But the implicit H formula still needs to be correct for cases like n in a ring where no H is present.

The final rule: apply the +1 pi adjustment only when the atom has available valence remaining after its sigma bonds. If the sigma bonds already consume all valence capacity (bond_sum == some_valence), don't add +1 — the atom is already fully bonded and gets 0 implicit H. This handles the trimethylamine-oxide case and similar edge cases.

Spiro atom ring-bond exclusion bugs

The first SSSR implementation had a subtle bug with spiro compounds. A spiro compound has two rings sharing exactly one atom (the spiro center). The SSSR should report two rings.

An earlier version of the candidate selection excluded edges whose endpoints were already both "in a ring". The reasoning was: if both atoms are already in a ring, the edge is redundant. This is wrong for spiro compounds — the spiro center is in one ring, but the edges connecting it to the second ring are not redundant.

The fix was to generate candidates based purely on BFS reachability (for each edge, can you get from one endpoint to the other without using that edge?) and rely entirely on the GF(2) independence test to filter out duplicates. This is slower in theory (more candidates) but correct for all cases including spiro, fused, and bridged ring systems.

The specific test that caught this: C1CCC2(CC1)CCCC2 (spiro[5.5]undecane) should have 2 rings. The broken version reported 1.

SSSR vs. DFS cycle enumeration

An early temptation was to use DFS to find all simple cycles and then deduplicate. This is conceptually simple but produces too many rings for fused systems.

For naphthalene (c1ccc2ccccc2c1), DFS enumeration finds 3 simple cycles: the two 6-membered rings and the 10-membered outer ring. The SSSR finds 2 (just the two 6-membered rings). RDKit uses SSSR, so molprint must match.

The GF(2) basis selection elegantly handles this: the 10-membered outer ring of naphthalene is linearly dependent on the two 6-membered rings in the edge-vector basis, so it is automatically excluded.

Hash function must be deterministic

During early development, the Morgan fingerprint implementation used HashMap for collecting environment hashes. HashMap in Rust uses AHasher by default since Rust 1.36, and AHasher::default() uses a process-level random seed for hash-flooding resistance.

This made fingerprints non-deterministic across processes. A database built in one run couldn't be searched in another because the bit positions of all hash-folded environments were different each time.

The fix was to replace all uses of HashSet<u32> (for deduplication) with BTreeSet or, better, to use a custom hash function with a fixed seed. The hash functions in morgan.rs now use FNV-1a constants and the golden-ratio-based combine from boost::hash_combine — both fully deterministic.

The rule: never use HashMap/HashSet/AHasher in any code path that contributes to fingerprint output. Only use them for intermediate data structures where the output ordering doesn't matter (and then only where you've verified it doesn't affect the final bit vector).

FPS byte ordering: least-significant byte first

The FPS format stores fingerprints as hexadecimal with bytes in little-endian order (least-significant byte first). This is chemfp's convention, and getting it wrong means fingerprints written by molprint can't be read by chemfp, and vice versa.

For a 64-bit fingerprint with only bit 0 set:

The internal u64 value is 0x0000000000000001
The little-endian byte representation is 01 00 00 00 00 00 00 00
The hex encoding is 0100000000000000

A naive implementation might write the hex of the u64 as a big-endian integer: 0000000000000001. That's wrong for FPS interoperability.

The to_hex and from_hex methods in FingerprintBits iterate over bytes in little-endian order explicitly to get this right.

The `pending_bond = None` on CloseBranch

The SMILES parser has a subtlety: when a CloseBranch token is encountered, the pending_bond must be cleared. Consider CC(=O)C:

C — atom 0
C — atom 1, bonded to atom 0
( — push atom 1 onto branch stack
= — set pending_bond = Double
O — atom 2, bonded to atom 1 with Double bond
) — pop branch stack, restore current = atom 1, clear pending_bond
C — atom 3, bonded to atom 1 with Single bond (not Double!)

Without clearing pending_bond on ), the final carbon would incorrectly receive a double bond. This was caught by the acetic acid test (CC(=O)O).