molprint
molprint is a high-performance molecular fingerprint computation and similarity search library written in Rust.
The goal is to be fast enough for large-scale virtual screening — targeting 5–10× faster than RDKit for bulk workflows — while remaining accurate enough to match RDKit bit-for-bit on standard benchmarks.
What it does
- Parses SMILES strings into molecular graphs
- Computes MACCS-166 structural key fingerprints (100% RDKit-accurate on ChEMBL 10k)
- Computes Morgan/ECFP circular fingerprints at configurable radius and bit width (512–4096)
- Calculates Tanimoto, Dice, and Cosine similarity using POPCNT on
u64word arrays - Runs parallel threshold and top-k screening via Rayon
- Reads and writes FPS, SDF, and SMILES file formats (FPS is chemfp-compatible)
Benchmarks
Measured on Apple M-series, Rust 1.94, --release.
| Operation | Performance |
|---|---|
| Tanimoto (2048-bit) | 36 ns |
| Morgan ECFP4 batch | ~700k mol/s |
| MACCS-166 batch | ~535k mol/s |
| Screening 100k compounds | 826 µs / query |
Quick example
#![allow(unused)] fn main() { use molprint_core::smiles::parse_smiles; use molprint_fp::{morgan::Morgan, maccs::Maccs166, traits::Fingerprinter}; use molprint_search::metrics::tanimoto; let mol_a = parse_smiles("c1ccccc1").unwrap(); // benzene let mol_b = parse_smiles("c1ccncc1").unwrap(); // pyridine let fp = Morgan::new(2, 2048); let sim = tanimoto(&fp.fingerprint(&mol_a), &fp.fingerprint(&mol_b)); println!("{:.3}", sim); // 0.600 }
Workspace layout
| Crate | Role |
|---|---|
molprint-core | SMILES parser, molecular graph, ring perception, SMARTS |
molprint-fp | Morgan and MACCS fingerprint algorithms |
molprint-search | Similarity metrics and parallel screening |
molprint-io | FPS, SDF, SMILES file I/O |
molprint-cli | End-user CLI tool |
Crates form a strict dependency chain: molprint-core → molprint-fp → molprint-search + molprint-io → molprint-cli. No cycles, no dev-dependency shortcuts across this chain.
License
MIT
Getting Started
molprint ships as both a CLI tool and a Rust library. Choose the approach that fits your workflow.
Install the CLI
cargo install --git https://github.com/mariusrueve/molprint molprint-cli
This installs the molprint binary to ~/.cargo/bin/. Make sure that directory is on your PATH.
Use as a library
Add the relevant crates to your Cargo.toml:
[dependencies]
molprint-core = { git = "https://github.com/mariusrueve/molprint" }
molprint-fp = { git = "https://github.com/mariusrueve/molprint" }
molprint-search = { git = "https://github.com/mariusrueve/molprint" }
You only need the crates relevant to your use case. If you just want to parse SMILES, molprint-core is sufficient. For fingerprints, add molprint-fp. For similarity search, add molprint-search.
Build from source
git clone https://github.com/mariusrueve/molprint
cd molprint
cargo build --release
The --release flag is important for performance — debug builds can be 10–20× slower for fingerprint batch operations.
Run tests
cargo test --workspace
Run benchmarks
cargo bench
Benchmarks use Criterion and output HTML reports to target/criterion/.
CLI Usage
The molprint CLI has two subcommands: fp for computing fingerprints, and search for similarity search.
molprint fp — compute fingerprints
Reads a molecule file, computes fingerprints, and writes a chemfp-compatible FPS file.
molprint fp --input molecules.smi --output molecules.fps
Options
| Flag | Default | Description |
|---|---|---|
--input / -i | required | Input file (.smi, .sdf, .sdf.gz) |
--output / -o | required | Output FPS file path |
--fp-type / -t | morgan | Fingerprint type: morgan or maccs |
--radius / -r | 2 | Morgan radius (ignored for MACCS) |
--nbits / -n | 2048 | Bit width for Morgan (ignored for MACCS) |
Examples
# Default: Morgan ECFP4, 2048 bits
molprint fp --input molecules.smi --output molecules.fps
# MACCS-166 structural keys
molprint fp --input molecules.smi --fp-type maccs --output maccs.fps
# Morgan ECFP6, 4096 bits, from SDF
molprint fp --input molecules.sdf --radius 3 --nbits 4096 --output ecfp6.fps
# Gzip-compressed SDF (auto-detected by extension)
molprint fp --input chembl.sdf.gz --output chembl.fps
Input formats
File format is detected from the extension:
.smi— SMILES file, one molecule per line. Format:SMILES<tab>IDor bareSMILES. If no ID is given, a sequentialmol1,mol2, ... ID is assigned..sdf— MDL V2000 SDF. The molecule name (first line of the mol block) is used as the ID..sdf.gz— gzip-compressed SDF, read transparently via flate2.
Output format
The output is a chemfp-compatible FPS v1 file:
#FPS1
#type=Morgan nbits=2048
#radius=2
#software=molprint/0.1.0
<hex_fingerprint>\t<id>
<hex_fingerprint>\t<id>
...
Fingerprints are encoded as lowercase hexadecimal, least-significant byte first (chemfp convention).
Progress output
The CLI prints a progress summary to stderr when done:
Processed 10000 molecules in 0.14s (71428 mol/s)
molprint search — similarity search
Searches a fingerprint database for molecules similar to a query.
molprint search --query "c1ccccc1" --db molecules.fps --threshold 0.4
Options
| Flag | Default | Description |
|---|---|---|
--query / -q | required | Query as a SMILES string |
--db / -d | required | FPS database file path |
--threshold | 0.7 | Tanimoto threshold for filtering |
--top-k / -k | 10 | Return top-k results |
How threshold and top-k interact
- If
--top-kis non-zero (the default is 10), a top-k search is performed and then results below--thresholdare filtered out. - To get all results above a threshold without a hard top-k cap, set
--top-k 0.
Examples
# All compounds with Tanimoto >= 0.4 to benzene (no top-k cap)
molprint search --query "c1ccccc1" --db molecules.fps --threshold 0.4 --top-k 0
# Top 20 most similar, filtered to Tanimoto >= 0.7
molprint search --query "c1ccccc1" --db molecules.fps --top-k 20
# Top 5 most similar (no threshold filter)
molprint search --query "c1ccccc1" --db molecules.fps --top-k 5 --threshold 0.0
Output format
Tab-separated id<tab>similarity, sorted by descending score:
CHEMBL123 0.8500
CHEMBL456 0.7800
CHEMBL789 0.7200
The fingerprint type (morgan/maccs) and parameters (radius, nbits) are read from the FPS header automatically, so the query fingerprint is computed with the same settings as the database.
Library Usage
This page walks through using molprint as a Rust library, from parsing a single SMILES string to running a parallel similarity search over thousands of compounds.
Parsing SMILES
#![allow(unused)] fn main() { use molprint_core::smiles::parse_smiles; let mol = parse_smiles("CC(=O)Oc1ccccc1C(=O)O").unwrap(); // aspirin println!("{} atoms, {} bonds", mol.node_count(), mol.edge_count()); // 13 atoms, 13 bonds }
parse_smiles returns a Result<MolGraph, ParseError>. In library code, propagate the error rather than unwrapping.
Inspecting atoms
#![allow(unused)] fn main() { use molprint_core::smiles::parse_smiles; use molprint_core::mol::graph::MolGraphExt; use petgraph::graph::NodeIndex; let mol = parse_smiles("CCO").unwrap(); // ethanol for i in 0..mol.node_count() { let idx = NodeIndex::new(i); let atom = mol.atom(idx); println!( "atom {}: {:?} charge={} h_count={} in_ring={}", i, atom.element, atom.charge, atom.h_count, atom.in_ring ); } }
Computing fingerprints
Morgan (ECFP)
#![allow(unused)] fn main() { use molprint_core::smiles::parse_smiles; use molprint_fp::{morgan::Morgan, traits::Fingerprinter}; let mol = parse_smiles("c1ccccc1").unwrap(); // ECFP4 (radius=2, 2048 bits) let morgan = Morgan::new(2, 2048); let fp = morgan.fingerprint(&mol); println!("{} bits set out of 2048", fp.count_ones()); println!("hex: {}", fp.to_hex()); }
Morgan::new(radius, nbits) where:
radius = 1→ ECFP2radius = 2→ ECFP4 (default)radius = 3→ ECFP6
MACCS-166
#![allow(unused)] fn main() { use molprint_core::smiles::parse_smiles; use molprint_fp::{maccs::Maccs166, traits::Fingerprinter}; let mol = parse_smiles("c1ccccc1").unwrap(); let maccs = Maccs166::new(); let fp = maccs.fingerprint(&mol); println!("{} bits set out of 166", fp.count_ones()); }
MACCS-166 always produces 167-bit fingerprints (bit 0 is always 0; bits 1–166 correspond to the 166 structural keys).
The Fingerprinter trait
Both Morgan and Maccs166 implement Fingerprinter:
#![allow(unused)] fn main() { pub trait Fingerprinter { fn fingerprint(&self, mol: &MolGraph) -> FingerprintBits; fn name(&self) -> &str; fn nbits(&self) -> usize; } }
You can use Box<dyn Fingerprinter> to select the algorithm at runtime.
Computing similarity
#![allow(unused)] fn main() { use molprint_core::smiles::parse_smiles; use molprint_fp::{morgan::Morgan, traits::Fingerprinter}; use molprint_search::metrics::{tanimoto, dice, cosine}; let benzene = parse_smiles("c1ccccc1").unwrap(); let pyridine = parse_smiles("c1ccncc1").unwrap(); let morgan = Morgan::new(2, 2048); let fp_a = morgan.fingerprint(&benzene); let fp_b = morgan.fingerprint(&pyridine); println!("Tanimoto: {:.3}", tanimoto(&fp_a, &fp_b)); println!("Dice: {:.3}", dice(&fp_a, &fp_b)); println!("Cosine: {:.3}", cosine(&fp_a, &fp_b)); }
All three metrics return f64 in [0.0, 1.0]. Tanimoto is the standard in cheminformatics.
Parallel screening
#![allow(unused)] fn main() { use molprint_core::smiles::parse_smiles; use molprint_fp::{morgan::Morgan, traits::Fingerprinter}; use molprint_search::screen::{threshold_search, top_k_search}; let morgan = Morgan::new(2, 2048); // Build a database (in practice, loaded from an FPS file) let smiles_list = vec!["CCO", "CCCO", "c1ccccc1", "CC(=O)O", "c1ccncc1"]; let db_fps: Vec<_> = smiles_list.iter() .map(|smi| morgan.fingerprint(&parse_smiles(smi).unwrap())) .collect(); // Query let query = parse_smiles("c1ccccc1").unwrap(); let query_fp = morgan.fingerprint(&query); // Threshold search: all compounds with Tanimoto >= 0.5 let hits = threshold_search(&query_fp, &db_fps, 0.5); for hit in &hits { println!("index={} similarity={:.3}", hit.index, hit.similarity); } // Top-k search: 3 most similar let top3 = top_k_search(&query_fp, &db_fps, 3); }
Both functions use Rayon for parallelism. They return Vec<SearchHit> sorted by descending similarity.
Reading and writing FPS files
#![allow(unused)] fn main() { use std::fs::File; use std::io::BufWriter; use molprint_io::fps::{FpsWriter, FpsReader}; // Write let out = File::create("output.fps").unwrap(); let mut writer = FpsWriter::new(BufWriter::new(out), 2048, "Morgan").unwrap(); writer.write_fingerprint("mol1", &fp).unwrap(); // Read let f = File::open("output.fps").unwrap(); let mut reader = FpsReader::new(f); let nbits = reader.read_header().unwrap(); for item in &mut reader { let (id, fp) = item.unwrap(); println!("{}: {} bits set", id, fp.count_ones()); } }
Reading SMILES files
#![allow(unused)] fn main() { use std::fs::File; use molprint_io::smiles_file::SmilesFileReader; let f = File::open("molecules.smi").unwrap(); for (id, mol) in SmilesFileReader::new(f) { println!("{}: {} atoms", id, mol.node_count()); } }
Reading SDF files
#![allow(unused)] fn main() { use molprint_io::sdf::open_sdf; for result in open_sdf("molecules.sdf").unwrap() { let record = result.unwrap(); println!("{}: {} atoms", record.name, record.mol.node_count()); // record.properties: HashMap<String, String> } }
open_sdf automatically handles .sdf.gz based on the file extension.
SMARTS substructure matching
#![allow(unused)] fn main() { use molprint_core::smarts; use molprint_core::smiles::parse_smiles; let carbonyl = smarts::compile("C=O").unwrap(); let aspirin = parse_smiles("CC(=O)Oc1ccccc1C(=O)O").unwrap(); if smarts::has_match(&aspirin, &carbonyl) { println!("aspirin contains a carbonyl group"); } // Count all matches (returns one mapping per unique anchor atom) let matches = smarts::subgraph_match(&aspirin, &carbonyl); println!("{} carbonyl groups found", matches.len()); }
Architecture
This section describes the internal design of molprint — how each component works and why it was built the way it was.
Crate map
molprint-core
mol/atom.rs — Element enum, Atom struct
mol/bond.rs — BondType enum
mol/graph.rs — MolGraph (petgraph UnGraph wrapper) + MolGraphExt trait
smiles/lexer.rs — tokenizer
smiles/parser.rs — token stream → MolGraph
ring.rs — SSSR via BFS-based Horton algorithm
arom.rs — aromaticity perception
smarts/ — SMARTS query language, VF2-style matching
ast.rs
lexer.rs
matcher.rs
molprint-fp
bitvec.rs — FingerprintBits (Vec<u64> bit vector)
traits.rs — Fingerprinter trait
morgan.rs — Morgan/ECFP iterative hashing
maccs.rs — MACCS-166 structural keys
molprint-search
metrics.rs — Tanimoto, Dice, Cosine
screen.rs — threshold_search, top_k_search (Rayon)
molprint-io
smiles_file.rs — streaming SMILES line reader
sdf.rs — streaming SDF V2000 parser (plain + gzip)
fps.rs — chemfp FPS read/write
molprint-cli
main.rs — clap CLI: fp + search subcommands
Data flow
For fingerprint computation:
SMILES/SDF file
→ molprint-io (SmilesFileReader / SdfReader)
→ parse_smiles → MolGraph
→ Fingerprinter::fingerprint → FingerprintBits
→ FpsWriter → .fps file
For similarity search:
.fps file → FpsReader → Vec<FingerprintBits>
query SMILES → parse_smiles → MolGraph → FingerprintBits
threshold_search / top_k_search (Rayon) → Vec<SearchHit>
Design principles
- Zero unsafe code in library crates — relying on petgraph and the standard library for all unsafe operations.
- No unwrap in library code — all errors propagate through
Resultusingthiserror. - Deterministic fingerprints — hash functions use fixed seeds; no random state that varies between process launches.
- Separation of concerns — the molecular graph knows nothing about fingerprints; fingerprints know nothing about I/O.
Overview
molprint is structured as a Cargo workspace with five crates arranged in a strict dependency chain.
Dependency order
molprint-core
↓
molprint-fp
↓ ↓
molprint-search molprint-io
↓ ↓
molprint-cli
This ordering enforces clean separation: the core graph library has no knowledge of fingerprints, fingerprint algorithms have no knowledge of file formats, and the CLI is the only place where everything comes together.
molprint-core
The foundation of the library. Responsible for:
- Representing molecules as graphs (
MolGraph= petgraphUnGraph<Atom, BondType>) - Parsing SMILES strings into those graphs
- Computing ring membership (SSSR)
- Perceiving aromaticity
- Matching SMARTS patterns against molecule graphs
Nothing in molprint-core is fingerprint-specific.
molprint-fp
Implements fingerprint algorithms on top of MolGraph. Contains:
FingerprintBits— a word-aligned bit vector backed byVec<u64>Fingerprintertrait — the common interface all algorithms implementMorgan— iterative neighborhood hashing (Morgan/ECFP)Maccs166— 166 manually implemented structural key tests
molprint-search
Similarity metrics and parallel screening. Depends only on molprint-fp (for FingerprintBits). Contains:
tanimoto,dice,cosine— bit-vector similarity metrics using POPCNTthreshold_search,top_k_search— Rayon-parallelized screening functions
molprint-io
File format support. Depends on molprint-core (for MolGraph and parse_smiles) and molprint-fp (for FingerprintBits). Contains:
SmilesFileReader— streaming iterator over(id, MolGraph)pairs from SMILES filesSdfReader— streaming iterator over SDF records (plain and gzip)FpsWriter/FpsReader— chemfp-compatible FPS format
molprint-cli
The end-user binary. Ties everything together using clap for argument parsing. Has two subcommands:
fp— read molecules, compute fingerprints, write FPSsearch— read an FPS database, compute a query fingerprint, run parallel search
Testing strategy
- Unit tests live alongside the code they test, in
#[cfg(test)]modules - Integration tests live in
tests/at the workspace root (cli_integration.rs) - A fuzz target (
fuzz/fuzz_targets/fuzz_smiles.rs) exercises the SMILES parser with arbitrary input - Fingerprint accuracy is validated by
crates/molprint-fp/tests/validate_against_rdkit.rs, which compares MACCS and Morgan output against RDKit on a ChEMBL subset
Molecular Graph
The molecular graph is the central data structure in molprint. Every algorithm — ring perception, fingerprinting, SMARTS matching — operates on it.
Type definition
#![allow(unused)] fn main() { // mol/graph.rs pub type MolGraph = UnGraph<Atom, BondType>; }
MolGraph is a type alias for petgraph's UnGraph<Atom, BondType>, an undirected graph where each node carries an Atom and each edge carries a BondType. Hydrogen atoms are not stored as nodes in the graph (they are implicit); instead, Atom::h_count records the total hydrogen count.
The Atom struct
#![allow(unused)] fn main() { pub struct Atom { pub element: Element, // atomic number-backed enum pub charge: i8, // formal charge pub isotope: Option<u16>, pub aromatic: bool, pub h_count: u8, // total H count (implicit + explicit) pub explicit_h: Option<u8>, // Some if bracket atom with H spec pub map_num: u16, // atom-map number (for reaction SMILES) pub in_ring: bool, // set by ring perception pub ring_sizes: Vec<u8>, // one entry per SSSR ring containing this atom } }
h_count is computed after the full graph is constructed by assign_implicit_hydrogens(). For bracket atoms (e.g., [NH3]), the H count is explicit; for organic subset atoms (e.g., N in CC(N)C), it is inferred from the atom's default valence minus its current bond order sum.
Bond types
#![allow(unused)] fn main() { pub enum BondType { Single, Double, Triple, Aromatic, } }
Each BondType has a valence_contribution() method that returns its integer bond order (Single=1, Double=2, Triple=3, Aromatic=1). The aromatic contribution of 1 is intentional: the additional electron from the pi system is accounted for separately in the implicit H calculation.
The MolGraphExt trait
petgraph's UnGraph is a generic graph; it doesn't know about chemistry. The MolGraphExt trait adds chemistry-aware methods:
#![allow(unused)] fn main() { pub trait MolGraphExt { fn num_atoms(&self) -> usize; fn num_bonds(&self) -> usize; fn atom(&self, idx: NodeIndex) -> &Atom; fn atom_mut(&mut self, idx: NodeIndex) -> &mut Atom; fn bond_between(&self, a: NodeIndex, b: NodeIndex) -> Option<BondType>; fn heavy_neighbors(&self, idx: NodeIndex) -> Vec<NodeIndex>; fn degree(&self, idx: NodeIndex) -> usize; fn compute_implicit_h(&self, idx: NodeIndex) -> u8; fn assign_implicit_hydrogens(&mut self); } }
compute_implicit_h is the most subtle method. It handles three cases:
- Bracket atom with H spec (e.g.,
[NH3]): return the explicit count directly. - Non-organic-subset element (e.g.,
[Fe]): return 0. These atoms get no implicit Hs. - Organic subset atom: find the smallest standard valence ≥ current bond order sum; return
valence − bond_sum − |charge|. For aromatic atoms, an extra +1 is added to the effective bond sum to account for the pi-electron contribution.
Implicit hydrogen calculation for aromatic atoms
This was the single most difficult piece to get right. The logic is:
effective_bond_sum = bond_sum + 1 (add 1 for the pi electron)
h_count = valence - effective_bond_sum
But this +1 must only be applied when the atom's valence is not already fully saturated by its sigma bonds. For example, trimethylamine-oxide's nitrogen in an aromatic context is a different case than a pyridine nitrogen. The code checks whether valence >= bond_sum before deciding to apply the adjustment. If the valence is already fully consumed by heavy-atom bonds, no pi adjustment is added, and 0 implicit Hs are assigned.
Node and edge indexing
petgraph assigns NodeIndex and EdgeIndex values sequentially as nodes and edges are added. These indices are stable within a single MolGraph lifetime (no removals happen in molprint). The SMILES parser, ring perception, and all fingerprint algorithms use NodeIndex::new(i) to iterate over atoms.
SMILES Parser
The SMILES parser converts a SMILES string into a MolGraph. It is split into a lexer (tokenizer) and a parser (graph builder), following the standard compiler front-end pattern.
Lexer (smiles/lexer.rs)
The lexer reads the SMILES string character by character and produces a stream of Token values.
#![allow(unused)] fn main() { pub enum Token { Atom { element: Element, aromatic: bool }, BracketAtom { isotope, element, aromatic, h_count, charge, map_num, chirality }, Bond(BondType), BondStereo(char), // '/' or '\' RingClosure(u8), // ring closure digit or %nn OpenBranch, // '(' CloseBranch, // ')' Dot, // '.' } }
Key tokenization rules:
- Organic subset atoms (
B,C,N,O,P,S,F,Cl,Br,I) and their lowercase aromatic counterparts (c,n,o,s,p) are emitted asToken::Atom. - Bracket atoms (
[NH3],[2H],[Fe+2],[13C@@H]) are parsed asToken::BracketAtomwith all fields decoded inline. Two-character element symbols (Cl,Br,Se, etc.) require look-ahead. - Ring closures are
%NNfor two-digit ring numbers (e.g.,%10) or single digits (0–9). - Stereo tokens (
/,\) are consumed but stored only asBondStereo; stereo is not used by fingerprinting algorithms.
Parser (smiles/parser.rs)
The parser takes the token stream and builds a MolGraph incrementally.
State maintained during parsing:
#![allow(unused)] fn main() { let mut current: Option<NodeIndex> = None; let mut branch_stack: Vec<NodeIndex> = Vec::new(); let mut pending_bond: Option<BondType> = None; let mut ring_map: HashMap<u8, (NodeIndex, Option<BondType>)> = HashMap::new(); }
The token dispatch:
| Token | Action |
|---|---|
Atom / BracketAtom | Add node; connect to current with pending_bond or default bond; set current |
Bond | Store in pending_bond |
BondStereo | Ignored (no pending_bond change) |
OpenBranch | Push current onto branch_stack |
CloseBranch | Pop branch_stack; restore current; clear pending_bond |
RingClosure | If ring_id seen before: add closure edge; else store in ring_map |
Dot | Set current = None (start new fragment) |
Default bond inference
When no explicit bond token precedes an atom, the parser infers the bond type:
- If both the previous atom and the new atom are aromatic →
BondType::Aromatic - Otherwise →
BondType::Single
Ring closure bond type
When a ring closure digit appears twice, the bond type is resolved in this priority order:
- The
pending_bondat the closing digit (if set explicitly) - The bond type stored when the ring was opened (if set explicitly)
- Inferred from the aromaticity of the two ring-closure atoms
Error handling
The parser returns Result<MolGraph, ParseError> where ParseError covers:
Lexer(LexerError)— invalid characters in the SMILES stringUnmatchedRing(u8)— a ring closure digit that was never closedUnmatchedBranch— mismatched(or)Empty— empty input stringDanglingBond— a bond token with no subsequent atom
Post-processing
After the token loop, the parser runs two passes:
graph.assign_implicit_hydrogens()— computesh_countfor every atomassign_ring_info(&mut graph)— runs SSSR and setsin_ringandring_sizeson atoms
These are called unconditionally, so every MolGraph produced by parse_smiles has fully-populated ring and hydrogen information.
Public API
#![allow(unused)] fn main() { // smiles/mod.rs pub use parser::parse as parse_smiles; }
Usage:
#![allow(unused)] fn main() { use molprint_core::smiles::parse_smiles; let mol = parse_smiles("CC(=O)O").unwrap(); // acetic acid }
Fuzz testing
The SMILES lexer and parser are covered by a libFuzzer fuzz target at fuzz/fuzz_targets/fuzz_smiles.rs. The target feeds arbitrary byte strings into parse_smiles and checks that it never panics — only returns Ok or Err. This uncovered several edge cases during development, particularly around multi-digit ring closures and unusual bracket atom encodings.
Ring Perception
Ring perception identifies which atoms belong to rings, what size those rings are, and produces the Smallest Set of Smallest Rings (SSSR). This information is used directly by the MACCS fingerprint (many keys are ring-related) and by the Morgan fingerprint (atom invariants include in_ring).
Why SSSR?
The SSSR is the minimal basis for all rings in a molecule. For a molecule with e edges, n nodes, and c connected components, the circuit rank (number of independent rings) is:
r = e − n + c
For benzene: 6 − 6 + 1 = 1. For naphthalene: 11 − 10 + 1 = 2. The SSSR contains exactly r rings.
Alternative approaches (DFS cycle enumeration, relevant cycles) enumerate more rings. For naphthalene, they would find 3 rings (the two 6-membered rings plus the 10-membered outer ring). The SSSR approach matches RDKit's behavior, which is important for MACCS accuracy.
Algorithm
The implementation uses a BFS-based approach sometimes called the Horton algorithm:
Step 1: Generate candidates
For each edge (u, v):
- Temporarily remove the edge
- Find the shortest path from
utovusing BFS (ignoring the removed edge) - If a path exists, the path + the removed edge forms a ring candidate
This produces at most one candidate per edge. Candidates are sorted by ring size (smallest first).
Step 2: Select a basis
Greedily select candidates to form a linearly independent set in GF(2) (the field with two elements). Each ring is represented as a bit vector over edges (1 if the ring contains that edge, 0 otherwise). A candidate is added to the SSSR if its edge vector is linearly independent from all previously selected rings, determined by Gaussian elimination over GF(2).
Selection stops when r rings have been chosen.
Implementation details
#![allow(unused)] fn main() { pub fn find_sssr(mol: &MolGraph) -> Vec<Vec<NodeIndex>> { let target_rings = e.saturating_sub(n) + components; // ... generate candidates via BFS ... // ... select basis via GF(2) Gaussian elimination ... } }
The GF(2) basis is maintained in row-echelon form. reduce_to_basis XORs the candidate vector against basis vectors whose pivot column is set in the candidate. If the result is non-zero, the candidate is independent and is added.
Ring info assignment
After find_sssr, the assign_ring_info function marks atoms:
#![allow(unused)] fn main() { pub fn assign_ring_info(mol: &mut MolGraph) { for ring in &rings { let ring_size = ring.len() as u8; for &atom_idx in ring { mol[atom_idx].in_ring = true; mol[atom_idx].ring_sizes.push(ring_size); } } } }
Junction atoms (atoms shared between two fused rings, like those in naphthalene) get multiple entries in ring_sizes. This allows MACCS keys to correctly identify fused ring systems.
Spiro atom handling
Spiro compounds (two rings sharing exactly one atom) were an early bug source. A naive ring detection approach that excludes atoms already in one ring would fail to detect the second ring in a spiro compound. The BFS candidate generation correctly handles this because it operates on edges, not atoms — the spiro center is visited in candidates for both rings independently.
The test test_spiro_compound covers C1CCC2(CC1)CCCC2 (spiro[5.5]undecane) and asserts find_sssr returns exactly 2 rings.
Connected components
The circuit rank formula requires knowing the number of connected components. Components are counted by a simple BFS that starts from each unvisited node:
#![allow(unused)] fn main() { fn count_components(mol: &MolGraph) -> usize { let mut visited = vec![false; mol.node_count()]; let mut count = 0; for start in mol.node_indices() { if !visited[start.index()] { count += 1; // BFS from start... } } count } }
For disconnected SMILES like CC.OO, components = 2, and the circuit rank is 0 (no rings), which is correct.
SMARTS Engine
The SMARTS engine supports substructure queries against MolGraph. It is used internally by the MACCS-166 fingerprint for structural key evaluation and is also available as a public API.
Supported features
Atom primitives (inside […]):
| Primitive | Meaning |
|---|---|
#n | Atomic number (e.g., [#6] = carbon) |
C, N, O, … | Aliphatic element (organic subset) |
c, n, o, … | Aromatic element |
a / A | Any aromatic / any aliphatic atom |
Hn | Hydrogen count (H = 1, H2 = 2) |
+n / -n | Formal charge |
R | In any ring; R0 = not in ring |
Dn | Heavy-atom degree |
* | Any atom |
Logical operators: ! (NOT), & (high-priority AND), ; (low-priority AND), , (OR).
Bond primitives: - = # : ~ (any), and unspecified (matches single or aromatic).
Not supported in v1: recursive SMARTS $(), stereo @/@@, component-level grouping.
Architecture
The engine has three layers:
- AST (
smarts/ast.rs) — data structures for atom and bond expressions - Lexer/Parser (
smarts/lexer.rs) — parses a SMARTS string into aSmartsPattern - Matcher (
smarts/matcher.rs) — VF2-style subgraph isomorphism
AST
#![allow(unused)] fn main() { pub enum AtomPrimitive { AtomicNum(u8), Aromatic(bool), AnyAromatic, AnyAliphatic, HCount(u8), Charge(i8), InRing, NotInRing, RingSize(u8), Degree(u8), Any, } pub enum AtomExpr { Primitive(AtomPrimitive), Not(Box<AtomExpr>), And(Box<AtomExpr>, Box<AtomExpr>), Or(Box<AtomExpr>, Box<AtomExpr>), } }
SmartsBond stores the bond constraint (specific type, any, or default). SmartsPattern is a graph of SmartsAtom nodes and SmartsBond edges.
Matching
The matcher uses a recursive VF2-style algorithm. It maintains a partial mapping from pattern atoms to molecule atoms and extends it one atom at a time.
#![allow(unused)] fn main() { pub fn has_match(mol: &MolGraph, pattern: &SmartsPattern) -> bool; pub fn subgraph_match(mol: &MolGraph, pattern: &SmartsPattern) -> Vec<Vec<(usize, usize)>>; }
has_match returns early on the first valid mapping. subgraph_match collects all mappings.
The atom compatibility check evaluates the AtomExpr tree recursively:
#![allow(unused)] fn main() { fn atom_matches(mol_atom: &Atom, expr: &AtomExpr) -> bool { match expr { AtomExpr::Primitive(p) => primitive_matches(mol_atom, p), AtomExpr::Not(inner) => !atom_matches(mol_atom, inner), AtomExpr::And(a, b) => atom_matches(mol_atom, a) && atom_matches(mol_atom, b), AtomExpr::Or(a, b) => atom_matches(mol_atom, a) || atom_matches(mol_atom, b), } } }
Usage
#![allow(unused)] fn main() { use molprint_core::smarts; use molprint_core::smiles::parse_smiles; // Carbon bonded to nitrogen (any bond type) let pattern = smarts::compile("[#6]~[#7]").unwrap(); let pyridine = parse_smiles("c1ccncc1").unwrap(); assert!(smarts::has_match(&pyridine, &pattern)); // Hydroxyl group let oh = smarts::compile("[OH]").unwrap(); let ethanol = parse_smiles("CCO").unwrap(); assert!(smarts::has_match(ðanol, &oh)); // Count amine nitrogens let amine = smarts::compile("[NH2]").unwrap(); let diamine = parse_smiles("NCCN").unwrap(); let matches = smarts::subgraph_match(&diamine, &amine); assert_eq!(matches.len(), 2); }
Compile errors
smarts::compile returns Result<SmartsPattern, SmartsError>:
#![allow(unused)] fn main() { pub enum SmartsError { UnexpectedChar(char, usize), UnterminatedBracket(usize), UnmatchedRing(u8), UnmatchedParen, Empty, InvalidAtomicNum(String), } }
Fingerprints
molprint implements two fingerprint types: Morgan (circular/ECFP) and MACCS-166 structural keys.
FingerprintBits
All fingerprints are represented as FingerprintBits, a word-aligned bit vector:
#![allow(unused)] fn main() { pub struct FingerprintBits { words: Vec<u64>, nbits: usize, } }
Operations are implemented at the word level using Rust's u64::count_ones() (which compiles to a single POPCNT instruction on x86/ARM). Bitwise AND and OR iterate over word pairs, making Tanimoto computation efficient for any bit width.
The hex encoding used in FPS files stores bytes least-significant first, matching chemfp's convention.
The Fingerprinter trait
#![allow(unused)] fn main() { pub trait Fingerprinter { fn fingerprint(&self, mol: &MolGraph) -> FingerprintBits; fn name(&self) -> &str; fn nbits(&self) -> usize; } }
Both Morgan and Maccs166 implement this trait. The CLI uses Box<dyn Fingerprinter> to select the algorithm at runtime.
Morgan / ECFP
Morgan fingerprints encode circular atom environments. For each atom, an environment of radius r includes everything within r bonds.
Atom invariants (radius 0)
Each atom gets an initial hash from its local properties:
#![allow(unused)] fn main() { fn atom_invariant(&self, mol: &MolGraph, idx: NodeIndex) -> u32 { let atom = mol.atom(idx); let packed = ((atom.element.atomic_number() as u64) << 48) | ((mol.degree(idx) as u64) << 32) | ((atom.h_count as u64) << 16) | (((atom.charge as i64 + 128) as u64) << 8) | (atom.in_ring as u64); hash_u64(packed) as u32 } }
The properties encoded are: atomic number, heavy-atom degree, hydrogen count, formal charge, ring membership. This matches the standard ECFP invariants.
Iteration
At each radius step, each atom's hash is updated based on its neighbors:
#![allow(unused)] fn main() { fn iterate(&self, mol: &MolGraph, prev_hashes: &[u32], _radius: u32) -> Vec<u32> { for i in 0..n { let mut neighbors: Vec<(u8, u32)> = mol.edges(idx) .map(|e| (bond_order, prev_hashes[neighbor])) .collect(); neighbors.sort_unstable(); // deterministic ordering new_hashes[i] = combine_hashes(prev_hashes[i] as u64, &neighbors); } } }
Sorting neighbors before hashing is critical for canonical fingerprints — without it, the result would depend on the order atoms were added to the graph.
Hash function
The hash uses a FNV-1a-inspired 64-bit hash for atom invariants, and boost-style hash_combine for neighborhood accumulation:
#![allow(unused)] fn main() { fn hash_combine(h: u64, val: u64) -> u64 { h ^ val.wrapping_add(0x9e3779b97f4a7c15) .wrapping_add(h << 6) .wrapping_add(h >> 2) } }
Important: AHasher::default() uses a randomized seed and cannot be used here. All hashing must be deterministic across process launches.
Bit folding
All hashes from radius 0 through radius are collected into a HashSet<u32>, then each is folded into the bit vector:
#![allow(unused)] fn main() { for h in all_hashes { fp.set(h as usize % self.nbits); } }
Using a HashSet means each unique environment contributes at most one bit position, even if it appears at multiple atoms. This is equivalent to RDKit's default behavior.
Configurations
| Name | Morgan::new(r, bits) |
|---|---|
| ECFP2 | Morgan::new(1, 2048) |
| ECFP4 | Morgan::new(2, 2048) |
| ECFP6 | Morgan::new(3, 2048) |
MACCS-166
MACCS-166 is a fixed set of 166 structural keys defined as patterns (originally as SMARTS queries in RDKit). Each key tests for the presence of a specific structural feature.
The keys are implemented manually in maccs.rs rather than via runtime SMARTS evaluation. This provides:
- Exact control over counting semantics (
uniquify=True, min_count=1matching RDKit) - Better performance (no SMARTS compilation overhead per fingerprint)
- Direct bit-level RDKit compatibility
Implementation structure
Each bit is computed by a dedicated test. The tests fall into several categories:
- Element presence: "does the molecule contain a sulfur atom?"
- Ring tests: "is there a 6-membered ring?", "is there an aromatic ring?"
- Functional group patterns: carbonyl, hydroxyl, amine, nitro, etc.
- Chain patterns: specific bond sequences, e.g., N-C=O (amide)
- Count-based keys: "more than 2 ring systems", etc.
A helper function has_simple_path supports keys that require checking for specific atom sequences:
#![allow(unused)] fn main() { fn has_simple_path( mol: &MolGraph, start: NodeIndex, length: usize, target_pred: &dyn Fn(NodeIndex) -> bool, ) -> bool }
For molecules with ≤ 64 atoms (the vast majority), this uses a single u64 bitmask for the visited set, avoiding heap allocation.
Accuracy
MACCS-166 achieves 100% accuracy on the ChEMBL 10k validation set against RDKit. Getting there required careful attention to:
- uniquify=True: RDKit's default MACCS implementation uses
uniquify=Truein its SMARTS matching, meaning each atom in the molecule is only used once per match. Some keys set the bit if the count is ≥ some threshold. - Aromatic implicit H: aromatic nitrogen (pyridine vs. pyrrole) has different H counts, and several MACCS keys test for
[nH]specifically. - Bit 0: always 0 (undefined in the original MACCS spec). The bitvector has 167 bits, with bit 0 unused.
Fingerprinter implementation
#![allow(unused)] fn main() { impl Fingerprinter for Maccs166 { fn nbits(&self) -> usize { 167 } fn name(&self) -> &str { "MACCS166" } fn fingerprint(&self, mol: &MolGraph) -> FingerprintBits { let mut fp = FingerprintBits::new(167); // bit 1: isotope // bit 2: ... // ... fp } } }
Similarity & Screening
The molprint-search crate provides similarity metrics and parallel screening functions.
Similarity metrics
All three metrics operate on FingerprintBits and use POPCNT for efficiency.
Tanimoto (Jaccard)
The standard similarity metric in cheminformatics:
T(A, B) = |A ∩ B| / |A ∪ B| = popcount(A & B) / popcount(A | B)
#![allow(unused)] fn main() { pub fn tanimoto(a: &FingerprintBits, b: &FingerprintBits) -> f64 { let intersection = a.and(b).count_ones() as f64; let union = a.or(b).count_ones() as f64; if union == 0.0 { 0.0 } else { intersection / union } } }
Returns 0.0 for two all-zero fingerprints (rather than NaN). Returns 1.0 when the fingerprints are identical.
A common threshold for "similar" compounds is Tanimoto ≥ 0.7 for ECFP4.
Dice
D(A, B) = 2 * |A ∩ B| / (|A| + |B|)
Dice is more lenient than Tanimoto for compounds of different sizes, because the denominator uses the sum of individual set sizes rather than the union.
Cosine
C(A, B) = |A ∩ B| / sqrt(|A| * |B|)
Cosine similarity treats the fingerprint as a vector and measures the angle between them. It is less commonly used in cheminformatics than Tanimoto but useful for some applications.
Performance
At 36 ns per Tanimoto comparison (2048-bit), the bottleneck is memory bandwidth, not computation. The u64 word loop uses Rust's u64::count_ones() which compiles to a single POPCNT instruction. For future work, explicit SIMD (AVX2 or NEON) could batch multiple fingerprint comparisons per loop iteration.
Parallel screening
threshold_search
Returns all database fingerprints with Tanimoto ≥ threshold, sorted by descending similarity:
#![allow(unused)] fn main() { pub fn threshold_search( query: &FingerprintBits, database: &[FingerprintBits], threshold: f64, ) -> Vec<SearchHit> }
Uses Rayon's par_iter() for parallel evaluation. Each thread processes a chunk of the database independently, then results are merged and sorted.
top_k_search
Returns the k most similar fingerprints:
#![allow(unused)] fn main() { pub fn top_k_search( query: &FingerprintBits, database: &[FingerprintBits], k: usize, ) -> Vec<SearchHit> }
Computes similarity for every database entry (in parallel), sorts, then truncates to k. For very large databases where k is much smaller than the total size, a heap-based algorithm would be more efficient — this is the simplest correct implementation.
SearchHit
#![allow(unused)] fn main() { pub struct SearchHit { pub index: usize, // index into the database slice pub similarity: f64, } }
The index into the original database slice lets the caller look up the molecule ID.
Screening performance
On Apple M-series, screening a 100k compound database takes ~826 µs per query (121M fingerprint comparisons/second). This is primarily memory-bound. At 2048 bits = 256 bytes per fingerprint, a 100k database is 25 MB — fits in L3 cache on most modern CPUs, which explains why Rayon parallelism helps.
For databases that fit in RAM (up to ~10M compounds at 2048-bit), molprint keeps everything in a Vec<FingerprintBits> without indexing structures. For larger scales, pre-filtering using popcount bounds (the Tanimoto lower bound from bit count differences) would reduce the number of full comparisons needed.
Usage example
#![allow(unused)] fn main() { use molprint_search::screen::{threshold_search, top_k_search}; // database is Vec<FingerprintBits>, loaded from an FPS file // query_fp is FingerprintBits for the query molecule // All compounds with Tanimoto >= 0.6 let hits = threshold_search(&query_fp, &database, 0.6); for hit in &hits { println!("index={} sim={:.4}", hit.index, hit.similarity); } // Top 10 most similar let top10 = top_k_search(&query_fp, &database, 10); }
File I/O
The molprint-io crate handles reading molecules from SMILES and SDF files, and reading/writing fingerprints in the FPS format.
SMILES files (smiles_file.rs)
SmilesFileReader is an iterator over (id, MolGraph) pairs from a SMILES file:
#![allow(unused)] fn main() { use std::fs::File; use molprint_io::smiles_file::SmilesFileReader; let f = File::open("molecules.smi").unwrap(); for (id, mol) in SmilesFileReader::new(f) { println!("{}: {} atoms", id, mol.node_count()); } }
Input format
Accepts two formats per line:
SMILES<tab>ID— SMILES and ID separated by a tabSMILES— bare SMILES; a sequential ID (mol1,mol2, …) is assigned
Lines starting with # are skipped as comments. Empty lines are skipped.
Error handling
Lines where the SMILES fails to parse are silently skipped in the iterator (consistent with how large SMILES files from external sources are handled — one bad record shouldn't abort processing 10 million molecules). The CLI logs skipped records to stderr.
SDF files (sdf.rs)
open_sdf returns a streaming iterator over SDF records:
#![allow(unused)] fn main() { use molprint_io::sdf::open_sdf; for result in open_sdf("molecules.sdf").unwrap() { match result { Ok(record) => { println!("{}: {} atoms", record.name, record.mol.node_count()); if let Some(mw) = record.properties.get("MW") { println!(" MW = {}", mw); } } Err(e) => eprintln!("Skipping record: {}", e), } } }
SDF record
#![allow(unused)] fn main() { pub struct SdfRecord { pub name: String, // first line of the mol block pub mol: MolGraph, // parsed molecular graph pub properties: HashMap<String, String>, // > <key> / value pairs } }
Gzip support
open_sdf detects .sdf.gz by file extension and transparently wraps the file in a flate2::read::GzDecoder. No separate API is needed:
#![allow(unused)] fn main() { // Both work the same way open_sdf("molecules.sdf") open_sdf("molecules.sdf.gz") }
V2000 parsing
The parser handles MDL V2000 format mol blocks. The atom table and bond table are read line by line. Atom coordinates (x, y, z) are parsed but discarded — molprint is a 2D/topological library and doesn't use 3D coordinates.
Stereo bond wedge/dash indicators in the bond table are parsed and ignored. Properties outside the mol block (the > <name> / value pairs before $$$$) are collected into the properties map.
FPS files (fps.rs)
The FPS format is chemfp's standard fingerprint exchange format. molprint implements a subset that is read/write compatible with chemfp.
Writing
#![allow(unused)] fn main() { use molprint_io::fps::FpsWriter; use std::io::BufWriter; let out = File::create("output.fps").unwrap(); let mut writer = FpsWriter::new(BufWriter::new(out), 2048, "Morgan").unwrap(); writer.write_meta("radius", "2").unwrap(); writer.write_fingerprint("mol1", &fp).unwrap(); }
The writer outputs the FPS v1 header on construction:
#FPS1
#type=Morgan nbits=2048
#software=molprint/0.1.0
Additional metadata lines can be written with write_meta before any fingerprint records. The CLI uses this to store the Morgan radius so that search can reconstruct the exact fingerprinter from the database file.
Each fingerprint record is: <hex>\t<id>\n.
Reading
#![allow(unused)] fn main() { use molprint_io::fps::FpsReader; let mut reader = FpsReader::new(File::open("output.fps").unwrap()); let nbits = reader.read_header().unwrap(); // reads header, returns nbits // Access metadata from header let radius = reader.meta("radius"); // Option<&str> let fp_type = reader.fp_type(); // &str for item in &mut reader { let (id, fp) = item.unwrap(); } }
read_header parses all #key=value comment lines at the top of the file. The reader handles the case where nbits is not in the header (infers from hex string length). If read_header is not called before iterating, it is called automatically on the first next() call.
FPS format specification
#FPS1 ← magic header
#type=<name> nbits=<n> ← fingerprint type
#<key>=<value> ← optional metadata
<hex_fingerprint>\t<molecule_id> ← one record per line
Hexadecimal encoding: bytes are written least-significant-first (little-endian), matching chemfp. For a 64-bit fingerprint with only bit 0 set, the hex is 0100000000000000 (8 bytes = 16 hex chars).
Design Decisions
This page explains the key architectural choices made in molprint and the reasoning behind them.
petgraph vs. a custom graph
The molecular graph uses petgraph's UnGraph rather than a custom adjacency list or matrix.
Why petgraph: Writing a correct, fast, well-tested graph data structure is a large project in itself. petgraph is mature, widely used in the Rust ecosystem, and provides exactly the operations we need: node/edge iteration, neighbor traversal, edge lookup by endpoint pair, and NodeIndex stability. The BFS in ring perception, the VF2 matching in SMARTS, and the neighborhood iteration in Morgan all map cleanly onto petgraph's API.
The cost: petgraph's UnGraph stores adjacency as a list of (source, target, edge) triples, which is not the most cache-friendly layout for large-scale operations. But for molecules (typically < 100 atoms), the in-memory layout is small enough to fit in cache regardless, and the performance difference vs. a custom flat structure is negligible.
The alternative: A flat array of (u8, u8, BondType) triples indexed by a CSR (compressed sparse row) adjacency list would be more cache-friendly for bulk operations. This would matter most for the SMARTS matcher traversing large molecules, but for typical drug-like molecules the difference is not measurable.
Manual MACCS-166 vs. runtime SMARTS
MACCS-166 keys are defined in RDKit as SMARTS patterns. The obvious implementation is to compile those 166 SMARTS patterns at startup and evaluate each one per molecule.
molprint instead implements the 166 keys as hand-written Rust code.
Why manual: Three reasons.
-
Exact RDKit semantics: RDKit's MACCS implementation uses specific counting logic (
uniquify=True,min_countthresholds per key) that is not the same as simply checkinghas_match. The counting semantics are easier to express precisely in code than to replicate through SMARTS configuration. -
Performance: Compiling 166 SMARTS patterns and running VF2 matching for each molecule adds significant constant overhead per molecule. Manual code eliminates the pattern compilation step and allows simpler, more direct logic for common tests (element presence, ring sizes) without the general-purpose SMARTS machinery.
-
Accuracy as first-class concern: Because the keys are hand-written, each one can be individually tested against RDKit output. The validation test suite (
validate_against_rdkit.rs) checks all 166 bits on a ChEMBL 10k subset. Catching a semantic error in a hand-written key is straightforward; tracking down why a SMARTS-based implementation differs from RDKit's behavior would be harder.
Vec<u64> bitvector
Fingerprints are stored as Vec<u64> with a fixed bit count. The alternatives were:
[u64; N]const generic array: Would require const generic parameters through the entire codebase and cannot represent runtime-configurable bit widths (Morgan supports 512–4096 bits).bitveccrate: A well-designed library, but adds a dependency and has more API surface than needed. The customFingerprintBitshas exactly the operations needed (set, get, count_ones, and/xor, hex encode/decode) with no extra overhead.Vec<u8>orVec<u32>:u64matches the natural word size of modern CPUs, andu64::count_ones()compiles to a singlePOPCNTinstruction. Usingu8oru32would require more iterations per Tanimoto calculation.
Rayon for parallelism
The screening functions use Rayon's par_iter() rather than manual thread management or async.
Why Rayon: threshold_search over a database of fingerprints is embarrassingly parallel — each comparison is independent. Rayon's work-stealing thread pool handles load balancing automatically, and the API change from iter() to par_iter() is minimal. Manual std::thread management would require partitioning the database into chunks, handling the results from each thread, and managing thread lifetimes — all complexity that Rayon handles for free.
Why not async: Similarity search is CPU-bound, not I/O-bound. Async would add complexity without benefiting throughput.
thiserror for error types
All public error types use thiserror::Error derive macros.
Why thiserror: It generates std::error::Error and Display implementations from the error type definition, which would otherwise be boilerplate. The derive-based approach enforces consistent error message formatting and makes it easy to chain errors from lower-level crates (e.g., LexerError wraps into ParseError with #[from]).
Alternative: anyhow is common in applications but is designed for error propagation rather than typed error handling. Library code should use typed errors so callers can match on specific variants; anyhow boxes the error, losing the type.
Deterministic fingerprints
Morgan's hash functions use fixed constants (0x9e3779b97f4a7c15 from the golden ratio, and a fixed FNV-1a seed 0xcbf29ce484222325). This is explicitly documented in the code with a comment: AHasher::default() uses a randomized seed and must NOT be used here.
Why this matters: If the hash function is randomized, the same SMILES produces a different fingerprint in each process invocation. This makes it impossible to compare fingerprints computed in different runs (e.g., a database built yesterday vs. a query today). Deterministic fingerprints are a correctness requirement, not just a performance preference.
Implicit-only hydrogen representation
Hydrogen atoms are not stored as nodes in MolGraph. They are counted in Atom::h_count but not represented as graph nodes.
Why: Most cheminformatics algorithms (ring perception, fingerprinting, SMARTS matching) operate on the heavy-atom graph. Storing explicit H nodes would double the graph size for typical organic molecules, slow down neighbor iteration, and require special-casing Hs in every algorithm.
The exception: Bracket atoms like [2H] (deuterium) can appear in SMILES and do get added as nodes, because they carry isotope information that may be structurally meaningful. The assign_implicit_hydrogens function counts these explicit H neighbors and includes them in h_count.
Learnings
These are honest technical insights from building molprint — things that took longer than expected, bugs that required careful thought to fix, and places where the problem was harder than it looked.
RDKit MACCS semantics: uniquify=True and min_count
The MACCS keys look straightforward on paper — 166 SMARTS patterns, one bit per pattern. The bit is set if the pattern has at least one match in the molecule.
But "at least one match" hides complexity. RDKit's GetMACCSKeysFingerprint uses uniquify=True in its SMARTS matching, which means each atom can only be used once per match. For some keys, the bit is set only if there are at least N matches (some keys test for "more than 1 ring" or "more than 2 atoms of type X").
The practical consequence: if you implement MACCS as "compile pattern, call has_match, set bit", you'll get the wrong answer for about 10–15 keys on any reasonably complex molecule. The correct implementation requires knowing the intended count semantics for each key, which comes from reading the RDKit source code for MACCSkeys.py directly.
Example: MACCS key 166 (?) is always 0 in RDKit's implementation. Key 0 is also always 0 (it's a placeholder). Several keys in the 140–150 range test for the presence of multiple ring systems or specific ring counts, not just "does this pattern appear at all".
Running the validation suite (validate_against_rdkit.rs) against a ChEMBL subset was the only reliable way to catch all of these edge cases.
Aromatic implicit hydrogen calculation
Computing implicit H counts for aromatic atoms is the most subtle part of the SMILES parser.
Consider pyridine (c1ccncc1): the nitrogen is aromatic. It has two aromatic bonds (bond order sum = 2 from the sigma bonds). Its normal valence is 3. So naively: 3 - 2 = 1 H. But pyridine nitrogen has zero hydrogens.
The issue: aromatic bonds each contribute 1 to the sigma bond sum, but the pi system takes up the remaining valence capacity. The standard formula adds +1 to the effective bond sum to account for the pi electron: effective_sum = sigma_sum + 1 = 3, then h = 3 - 3 = 0. Correct.
Now consider pyrrole nitrogen ([nH]): it contributes 2 pi electrons to the aromatic system (it's the nitrogen with the lone pair). Its sigma bonds are 2 (two ring C-N bonds). Effective sum = 3, valence = 3, h = 0. But we wrote [nH] explicitly — the bracket forces H=1.
The problem emerges for organic-subset aromatic nitrogen without explicit bracket notation. In practice, most aromatic nitrogens in the organic subset that need an H are written as [nH] in valid SMILES. But the implicit H formula still needs to be correct for cases like n in a ring where no H is present.
The final rule: apply the +1 pi adjustment only when the atom has available valence remaining after its sigma bonds. If the sigma bonds already consume all valence capacity (bond_sum == some_valence), don't add +1 — the atom is already fully bonded and gets 0 implicit H. This handles the trimethylamine-oxide case and similar edge cases.
Spiro atom ring-bond exclusion bugs
The first SSSR implementation had a subtle bug with spiro compounds. A spiro compound has two rings sharing exactly one atom (the spiro center). The SSSR should report two rings.
An earlier version of the candidate selection excluded edges whose endpoints were already both "in a ring". The reasoning was: if both atoms are already in a ring, the edge is redundant. This is wrong for spiro compounds — the spiro center is in one ring, but the edges connecting it to the second ring are not redundant.
The fix was to generate candidates based purely on BFS reachability (for each edge, can you get from one endpoint to the other without using that edge?) and rely entirely on the GF(2) independence test to filter out duplicates. This is slower in theory (more candidates) but correct for all cases including spiro, fused, and bridged ring systems.
The specific test that caught this: C1CCC2(CC1)CCCC2 (spiro[5.5]undecane) should have 2 rings. The broken version reported 1.
SSSR vs. DFS cycle enumeration
An early temptation was to use DFS to find all simple cycles and then deduplicate. This is conceptually simple but produces too many rings for fused systems.
For naphthalene (c1ccc2ccccc2c1), DFS enumeration finds 3 simple cycles: the two 6-membered rings and the 10-membered outer ring. The SSSR finds 2 (just the two 6-membered rings). RDKit uses SSSR, so molprint must match.
The GF(2) basis selection elegantly handles this: the 10-membered outer ring of naphthalene is linearly dependent on the two 6-membered rings in the edge-vector basis, so it is automatically excluded.
Hash function must be deterministic
During early development, the Morgan fingerprint implementation used HashMap for collecting environment hashes. HashMap in Rust uses AHasher by default since Rust 1.36, and AHasher::default() uses a process-level random seed for hash-flooding resistance.
This made fingerprints non-deterministic across processes. A database built in one run couldn't be searched in another because the bit positions of all hash-folded environments were different each time.
The fix was to replace all uses of HashSet<u32> (for deduplication) with BTreeSet or, better, to use a custom hash function with a fixed seed. The hash functions in morgan.rs now use FNV-1a constants and the golden-ratio-based combine from boost::hash_combine — both fully deterministic.
The rule: never use HashMap/HashSet/AHasher in any code path that contributes to fingerprint output. Only use them for intermediate data structures where the output ordering doesn't matter (and then only where you've verified it doesn't affect the final bit vector).
FPS byte ordering: least-significant byte first
The FPS format stores fingerprints as hexadecimal with bytes in little-endian order (least-significant byte first). This is chemfp's convention, and getting it wrong means fingerprints written by molprint can't be read by chemfp, and vice versa.
For a 64-bit fingerprint with only bit 0 set:
- The internal
u64value is0x0000000000000001 - The little-endian byte representation is
01 00 00 00 00 00 00 00 - The hex encoding is
0100000000000000
A naive implementation might write the hex of the u64 as a big-endian integer: 0000000000000001. That's wrong for FPS interoperability.
The to_hex and from_hex methods in FingerprintBits iterate over bytes in little-endian order explicitly to get this right.
The pending_bond = None on CloseBranch
The SMILES parser has a subtlety: when a CloseBranch token is encountered, the pending_bond must be cleared. Consider CC(=O)C:
C— atom 0C— atom 1, bonded to atom 0(— push atom 1 onto branch stack=— set pending_bond = DoubleO— atom 2, bonded to atom 1 with Double bond)— pop branch stack, restore current = atom 1, clear pending_bondC— atom 3, bonded to atom 1 with Single bond (not Double!)
Without clearing pending_bond on ), the final carbon would incorrectly receive a double bond. This was caught by the acetic acid test (CC(=O)O).