From Cybersecurity to Chemistry: Applying anomaly detection to expand a fragment collection

In drug discovery, the quality and diversity of our screening libraries are paramount. In fragment-based drug discovery, a common task is trying to enhance an existing fragment collection. Our Associate Director of In Silico Discovery, Angelo Pugliese , talks through a recently completed a project to do just that, a data-driven approach to expand our ~1,000-fragment in-house library from a ~8,000-compound collection and instead of relying only on similarity cutoffs, we added AI/ML into the mix.

Step 1: Finding Novelty with AI/ML

Our primary goal was to identify fragments that are different from our existing collection. This is, by definition, an anomaly detection task. By framing our existing library as "normal", we could search for anomalous new fragments that sit outside this distribution. We used the Isolation Forest algorithm, an ML method suited for this high-dimensional task. Its core principle is that anomalies are few and different and therefore easier to isolate. The model builds hundreds of random decision trees. At each node, a tree randomly picks a feature (a single bit in the fingerprint) and splits the data. The anomaly score for a compound is directly related to its average path length across all trees, how many splits it takes to get that compound into its own leaf node. Novel molecules are isolated quickly (short path length), while common structures require many more splits.

Something that Isolation forest does not rmitigate it's the size bias. When we present a new, larger fragment from the 8K library, because it's larger, it will almost certainly have more atoms, more bonds, etc. This means its fingerprint will have many more 'on' bits than the average fingerprint from our training set, so the bigger the fragment the more diverse it looks because the "on" fingerprint bits are more. The results is that larger fragments are more "anomalous" vs our current library and most likely to be selected. We addressed this using a hard, post-filter (MW <= 300) to enforce the chemical rules we care about. This acts as a "guardrail" on the ML model's output. It can certaily be done in a different way. For instance, Cosine similarity mitigates the size effect by using a geometric mean in its calculation. Even better is the Tversky Index, which can be tuned to specifically measure substructure containment, effectively asking how much of the small molecule is present in the large one? While swapping these metrics for the anomaly detection model is a valid strategy, our approach of using a knowledge-based absolute property filtering in a later step provided a robust, simple and transparent workflow to achieve what we wanted.

Step 2: Filtering for Useful Diversity

After identifying nearly 1,700 novel candidates with our ML model, the set was refined in two stages. First, we enriched for fragments with properties (such as MW, LogP, TPSA, Fsp3) that occupied underrepresented regions of our current library's property space. Next, we applied our guardrail filters, enforcing hard cutoffs of MW ≤ 300 Da and LogP ≤ 3.0. This dual-filter approach yielded our final set of 436 high-quality fragments. Below there is an image of the UMAP projection of the fragment library chemical space (see caption).

UMAP Projection of Fragment Library Chemical Space. The 2D visualization was generated from 2048-bit Morgan fingerprints (Jaccard metric), where each point represents a single fragment and proximity indicates structural similarity. The plot overlays our original library (~1,000 fragments, blue circles) with the intermediate ML-selected novel candidates (~1,700 fragments, pink crosses) and the final curated set of 436 complementary fragments (red markers). The spatial separation of the red markers demonstrates that the final selection populates diverse chemical space not covered by the original library.

Step 3: Planning for Success with Fast Follow-ups

Finally, to accelerate future projects, we prepared for SAR exploration. For every single fragment in our newly expanded library (1,400+), we searched the remaining 8K library for close analogues. The entire library is now annotated with a pre-identified list of its closest analogues, giving our project teams immediate starting points the moment a hit is found.

This blended workflow, combining machine learning for unbiased novelty detection with strategic, rule-based filtering, gives us confidence that our expanded library is not just larger, but carries more chemical and property diversity. It’s a good example of how we can use computational tools to tackle complex data challenges and make smarter, data-driven decisions.

Find out more about BioAscent's computational chemistry expertise here.

About Us

Integrated Drug Discovery

Compound Management

Resources

News & Events

Careers

News

Integrated Drug Discovery

Compound Management

Medicinal Chemistry

In Silico Discovery

Discovery Biology

Get in Touch

Our Site

About Us

Integrated Drug Discovery

Compound Management

Resources

News & Events

Careers

News