Winners: 2024 Outstanding Early Career Research Award of Digital Discovery
Daniel Probst and Jan Weinreich
Paper: Learning on compressed molecular representations
In computational settings, molecules are often represented as graphs or as string-encoded graphs in various formats, such as SMILES. While the former is commonly used as input for various graph algorithms and graph neural networks, the latter has become relevant as input in natural language processing-based methods, most prominently transformers. Such deep learning techniques have the advantage of scaling extremely well with data set size compared to other machine learning approaches. However, in the case of string-based inputs, other methods require an initial conversion of the string into a numerical vector, which represents the structure and composition of any given molecule. A common way to achieve this is by way of graph algorithms called molecular fingerprints, such as the extended connectivity fingerprints. Although methods making use of these molecular fingerprints can also achieve impressive performance, they also introduce information loss, as once converted, a fingerprint can generally not be mapped back to the input molecule.鈥
An alternative is to compute the information distance between two strings and base a machine learning algorithm on this metric. While the information distance is not computable, the compression distance can easily be computed, as it approximates the Kolmogorov complexity needed to compute information distance, through lossless compression algorithms such as DEFLATE used by the program gzip. The feasibility of such an approach has been shown by successfully classifying natural language or image data.
In our research, we extended this approach to multimodal regression problems in biology or chemistry. Predicting binding affinities of protein--ligand pairs, our approach, which we named MolZip, based on compressing concatenated ligand SMILES and protein amino acid sequences, performed better than 9 out of 10 graph neural network methods. On molecular property prediction benchmarks, MolZip performed similarly to established methods.
Our work provides an interesting insight into the relation between information and string representations of molecules. Furthermore, it is fascinating that a method based, at its core, on the length of compressed strings is capable of performing as well as or better than established and conceptually complex methodologies. It is a reminder to keep our research focus broad and to consider whether the benchmarks we use to evaluate our models are well-suited to the task.
About the winners

Daniel received his BSc in computer science at the Bern University of Applied Sciences in 2013 and his MSc in Bioinformatics and Computational Biology at the University of Bern in 2016. In 2020, he received his PhD in 浪花直播 and Molecular Sciences for his thesis 鈥淪calable Methods for the Exploration and Visualisation of Large Chemical Spaces鈥 from the University of Bern under the supervision of Prof. Jean-Louis Reymond.
After a two-year stay as a permanent research staff member at IBM Research in the Team of Teodoro Laino, working on machine learning for biocatalysis, he started as a postdoctoral researcher in the group of Prof. Pierre Vandergheynst at EPFL, working on machine learning methodologies with applications for small molecules and protein dynamics.
In 2024, Daniel started as a tenure-track assistant professor at Wageningen University and Research, where he works at the intersection of biology, chemistry, and computer science with a focus on sustainability.

Jan is a researcher specialising in machine learning for chemistry and materials science. His work focuses on developing efficient, scalable algorithms for molecular property prediction and chemical space exploration. He is also the co-founder of Chembricks AI, a company building AI platforms for materials discovery and chemical process optimisation.
Related pages
Digital Discovery
A journal for new thinking on machine learning, robotics and AI.
Journal
Digital Discovery Outstanding Early Career Research Award
Recognising and celebrating outstanding contributions in the realm of Digital Discovery.
Prize
Publish with us
Get your work the international recognition that it deserves.