SMILES Parser
The SMILES parser converts a SMILES string into a MolGraph. It is split into a lexer (tokenizer) and a parser (graph builder), following the standard compiler front-end pattern.
Lexer (smiles/lexer.rs)
The lexer reads the SMILES string character by character and produces a stream of Token values.
#![allow(unused)] fn main() { pub enum Token { Atom { element: Element, aromatic: bool }, BracketAtom { isotope, element, aromatic, h_count, charge, map_num, chirality }, Bond(BondType), BondStereo(char), // '/' or '\' RingClosure(u8), // ring closure digit or %nn OpenBranch, // '(' CloseBranch, // ')' Dot, // '.' } }
Key tokenization rules:
- Organic subset atoms (
B,C,N,O,P,S,F,Cl,Br,I) and their lowercase aromatic counterparts (c,n,o,s,p) are emitted asToken::Atom. - Bracket atoms (
[NH3],[2H],[Fe+2],[13C@@H]) are parsed asToken::BracketAtomwith all fields decoded inline. Two-character element symbols (Cl,Br,Se, etc.) require look-ahead. - Ring closures are
%NNfor two-digit ring numbers (e.g.,%10) or single digits (0–9). - Stereo tokens (
/,\) are consumed but stored only asBondStereo; stereo is not used by fingerprinting algorithms.
Parser (smiles/parser.rs)
The parser takes the token stream and builds a MolGraph incrementally.
State maintained during parsing:
#![allow(unused)] fn main() { let mut current: Option<NodeIndex> = None; let mut branch_stack: Vec<NodeIndex> = Vec::new(); let mut pending_bond: Option<BondType> = None; let mut ring_map: HashMap<u8, (NodeIndex, Option<BondType>)> = HashMap::new(); }
The token dispatch:
| Token | Action |
|---|---|
Atom / BracketAtom | Add node; connect to current with pending_bond or default bond; set current |
Bond | Store in pending_bond |
BondStereo | Ignored (no pending_bond change) |
OpenBranch | Push current onto branch_stack |
CloseBranch | Pop branch_stack; restore current; clear pending_bond |
RingClosure | If ring_id seen before: add closure edge; else store in ring_map |
Dot | Set current = None (start new fragment) |
Default bond inference
When no explicit bond token precedes an atom, the parser infers the bond type:
- If both the previous atom and the new atom are aromatic →
BondType::Aromatic - Otherwise →
BondType::Single
Ring closure bond type
When a ring closure digit appears twice, the bond type is resolved in this priority order:
- The
pending_bondat the closing digit (if set explicitly) - The bond type stored when the ring was opened (if set explicitly)
- Inferred from the aromaticity of the two ring-closure atoms
Error handling
The parser returns Result<MolGraph, ParseError> where ParseError covers:
Lexer(LexerError)— invalid characters in the SMILES stringUnmatchedRing(u8)— a ring closure digit that was never closedUnmatchedBranch— mismatched(or)Empty— empty input stringDanglingBond— a bond token with no subsequent atom
Post-processing
After the token loop, the parser runs two passes:
graph.assign_implicit_hydrogens()— computesh_countfor every atomassign_ring_info(&mut graph)— runs SSSR and setsin_ringandring_sizeson atoms
These are called unconditionally, so every MolGraph produced by parse_smiles has fully-populated ring and hydrogen information.
Public API
#![allow(unused)] fn main() { // smiles/mod.rs pub use parser::parse as parse_smiles; }
Usage:
#![allow(unused)] fn main() { use molprint_core::smiles::parse_smiles; let mol = parse_smiles("CC(=O)O").unwrap(); // acetic acid }
Fuzz testing
The SMILES lexer and parser are covered by a libFuzzer fuzz target at fuzz/fuzz_targets/fuzz_smiles.rs. The target feeds arbitrary byte strings into parse_smiles and checks that it never panics — only returns Ok or Err. This uncovered several edge cases during development, particularly around multi-digit ring closures and unusual bracket atom encodings.