Architecture
This section describes the internal design of molprint — how each component works and why it was built the way it was.
Crate map
molprint-core
mol/atom.rs — Element enum, Atom struct
mol/bond.rs — BondType enum
mol/graph.rs — MolGraph (petgraph UnGraph wrapper) + MolGraphExt trait
smiles/lexer.rs — tokenizer
smiles/parser.rs — token stream → MolGraph
ring.rs — SSSR via BFS-based Horton algorithm
arom.rs — aromaticity perception
smarts/ — SMARTS query language, VF2-style matching
ast.rs
lexer.rs
matcher.rs
molprint-fp
bitvec.rs — FingerprintBits (Vec<u64> bit vector)
traits.rs — Fingerprinter trait
morgan.rs — Morgan/ECFP iterative hashing
maccs.rs — MACCS-166 structural keys
molprint-search
metrics.rs — Tanimoto, Dice, Cosine
screen.rs — threshold_search, top_k_search (Rayon)
molprint-io
smiles_file.rs — streaming SMILES line reader
sdf.rs — streaming SDF V2000 parser (plain + gzip)
fps.rs — chemfp FPS read/write
molprint-cli
main.rs — clap CLI: fp + search subcommands
Data flow
For fingerprint computation:
SMILES/SDF file
→ molprint-io (SmilesFileReader / SdfReader)
→ parse_smiles → MolGraph
→ Fingerprinter::fingerprint → FingerprintBits
→ FpsWriter → .fps file
For similarity search:
.fps file → FpsReader → Vec<FingerprintBits>
query SMILES → parse_smiles → MolGraph → FingerprintBits
threshold_search / top_k_search (Rayon) → Vec<SearchHit>
Design principles
- Zero unsafe code in library crates — relying on petgraph and the standard library for all unsafe operations.
- No unwrap in library code — all errors propagate through
Resultusingthiserror. - Deterministic fingerprints — hash functions use fixed seeds; no random state that varies between process launches.
- Separation of concerns — the molecular graph knows nothing about fingerprints; fingerprints know nothing about I/O.