Similarity & Screening

The molprint-search crate provides similarity metrics and parallel screening functions.

Similarity metrics

All three metrics operate on FingerprintBits and use POPCNT for efficiency.

Tanimoto (Jaccard)

The standard similarity metric in cheminformatics:

T(A, B) = |A ∩ B| / |A ∪ B| = popcount(A & B) / popcount(A | B)

#![allow(unused)]
fn main() {
pub fn tanimoto(a: &FingerprintBits, b: &FingerprintBits) -> f64 {
    let intersection = a.and(b).count_ones() as f64;
    let union = a.or(b).count_ones() as f64;
    if union == 0.0 { 0.0 } else { intersection / union }
}
}

Returns 0.0 for two all-zero fingerprints (rather than NaN). Returns 1.0 when the fingerprints are identical.

A common threshold for "similar" compounds is Tanimoto ≥ 0.7 for ECFP4.

Dice

D(A, B) = 2 * |A ∩ B| / (|A| + |B|)

Dice is more lenient than Tanimoto for compounds of different sizes, because the denominator uses the sum of individual set sizes rather than the union.

Cosine

C(A, B) = |A ∩ B| / sqrt(|A| * |B|)

Cosine similarity treats the fingerprint as a vector and measures the angle between them. It is less commonly used in cheminformatics than Tanimoto but useful for some applications.

At 36 ns per Tanimoto comparison (2048-bit), the bottleneck is memory bandwidth, not computation. The u64 word loop uses Rust's u64::count_ones() which compiles to a single POPCNT instruction. For future work, explicit SIMD (AVX2 or NEON) could batch multiple fingerprint comparisons per loop iteration.

Parallel screening

`threshold_search`

Returns all database fingerprints with Tanimoto ≥ threshold, sorted by descending similarity:

#![allow(unused)]
fn main() {
pub fn threshold_search(
    query: &FingerprintBits,
    database: &[FingerprintBits],
    threshold: f64,
) -> Vec<SearchHit>
}

Uses Rayon's par_iter() for parallel evaluation. Each thread processes a chunk of the database independently, then results are merged and sorted.

`top_k_search`

Returns the k most similar fingerprints:

#![allow(unused)]
fn main() {
pub fn top_k_search(
    query: &FingerprintBits,
    database: &[FingerprintBits],
    k: usize,
) -> Vec<SearchHit>
}

Computes similarity for every database entry (in parallel), sorts, then truncates to k. For very large databases where k is much smaller than the total size, a heap-based algorithm would be more efficient — this is the simplest correct implementation.

`SearchHit`

#![allow(unused)]
fn main() {
pub struct SearchHit {
    pub index: usize,      // index into the database slice
    pub similarity: f64,
}
}

The index into the original database slice lets the caller look up the molecule ID.

Screening performance

On Apple M-series, screening a 100k compound database takes ~826 µs per query (121M fingerprint comparisons/second). This is primarily memory-bound. At 2048 bits = 256 bytes per fingerprint, a 100k database is 25 MB — fits in L3 cache on most modern CPUs, which explains why Rayon parallelism helps.

For databases that fit in RAM (up to ~10M compounds at 2048-bit), molprint keeps everything in a Vec<FingerprintBits> without indexing structures. For larger scales, pre-filtering using popcount bounds (the Tanimoto lower bound from bit count differences) would reduce the number of full comparisons needed.

Usage example

#![allow(unused)]
fn main() {
use molprint_search::screen::{threshold_search, top_k_search};

// database is Vec<FingerprintBits>, loaded from an FPS file
// query_fp is FingerprintBits for the query molecule

// All compounds with Tanimoto >= 0.6
let hits = threshold_search(&query_fp, &database, 0.6);
for hit in &hits {
    println!("index={} sim={:.4}", hit.index, hit.similarity);
}

// Top 10 most similar
let top10 = top_k_search(&query_fp, &database, 10);
}