Exploring the chemical compound space

The chemical compound space is vast—so vast that numbers used to describe it have taken on a life of their own. A frequently cited figure comes from Bohacek et al. (1996) [1], who estimated there could be on the order of 10^60 drug-like small molecules built from light atoms such as C, N, O, and H. Over time this figure has become something of a meme—not a precise measurement, but an illustration of immensity and combinatorial explosion in molecular design [2].

To put it in perspective: the universe itself is around 4.35 × 10^17 seconds old [3]. Even if we could transform every proton into a quantum computer, the entire universe could only have performed on the order of 10^120 operations in its lifetime [4]. That is nowhere near enough to calculate the properties of each molecule one by one. Clearly, brute force enumeration is impossible.

A chat-GPT generated image of the chemical compound space
A chat-GPT generated image of the chemical compound space

Why Machine Learning Matters

Because the chemical universe is effectively infinite for practical purposes, we need intelligent shortcuts. This is where machine learning (ML) has emerged as a transformative tool for accelerating molecular and materials discovery. In a way, ML give us several way of fining the needle in the haystack:

  • Generative models (variational autoencoders, GANs, diffusion models) can propose new molecules by learning the structure of chemical space [5].

  • Graph neural networks (GNNs) capture molecular topology and are state-of-the-art for predicting quantum chemical and biological properties [6].

  • Active learning and Bayesian optimization guide experiments by prioritizing which compounds to test next, maximizing information gained per calculation [7].

These methods don’t replace physics or chemistry; they allow us to navigate the chemical universe more strategically, focusing resources where discovery is most likely. Estimates have been realised, concluding that ML can bring down the standard R&D for new drugs and materials from ~20 to ~5 years [7].

 

Looking Ahead

Human history has often been defined by the materials we learned to master—the Stone Age, the Bronze Age, the Iron Age. Today we are in the Silicon Age, where we can offload vast computation to algorithms and AI. With this, we may soon discover new classes of materials: room-temperature superconductors that could make nuclear fusion viable, or catalysts capable of splitting stubborn molecules like CO₂ and NO₂ into harmless components.

Throughout my career—first as a PhD student, then as a postdoctoral researcher, and now as a research scientist—I have worked on several of these challenges. What gives me optimism is not just the scale of chemical space, but the fact that with AI, we finally have a map to explore it.

 

References & Further Reading

  1. Bohacek RS, McMartin C, Guida WC. The art and practice of structure-based drug design: A molecular modeling perspective. Med. Res. Rev. 1996. Link

  2. “How many small molecules could be out there?” Jena University blog, 2025. Link

  3. Age of the Universe – Wikipedia. Link

  4. Lloyd S. Computational Capacity of the Universe. arXiv:quant-ph/0110141. Link

  5. Reymond J-L. The Chemical Space Project. Acc. Chem. Res. 2015. Link

  6. Gilmer J et al. Neural Message Passing for Quantum Chemistry. ICML 2017. Link

  7. Settles B. Active Learning Literature Survey. University of Wisconsin, 2009. Link

  8. Roche. AI and machine learning: Revolutionising drug discovery and transforming patient care. Link