Conference Agenda

Overview and details of the sessions of this conference. Please select a date or location to show only sessions at that day or location. Please select a single session for detailed view (with abstracts and downloads if available).

 
 
Session Overview
Date: Wednesday, 15/Nov/2023
9:00am - 10:15amCZE1: ELIXIR Czech Republic 1
Location: Chamber Hall
Session Chair: Jiri Damborsky
 
9:00am - 9:15am

Welcome

Jiří Vondrášek

Institute of Organic Chemistry and Biochemistry, Czech Academy of Sciences

Welcome to ELIXIR Czech Meeting



9:15am - 9:35am

A PDB-wide assignment of apo & holo relationships based on individual protein-ligand interactions

Marian Novotny, Christos Feidakis, Radoslav Krivak, David Hoksza

Charles University, Czech Republic

From studying protein dynamics to unveiling cryptic binding sites or assessing the effectiveness of ligand binding site prediction software, access to several snapshots of a protein is needed. Availability of both bound (holo) and unbound (apo) forms of a protein is paramount for making meaningful comparisons and drawing robust conclusions. The few existing resources that provide access to such data are restricted either in terms of protein coverage, or in the number of provided structure pairs which does not always reflect the conformational variance that is represented by the structures deposited in the Protein Data Bank (PDB).

Here we present previously designed application (AHoJ, Apo-Holo Juxtaposition) and use it to perform an extensive search for apo-holo pairs for each individual protein-ligand interaction across the PDB (~500,000 small molecule interactions, excluding interactions with peptides and nucleic acids). We assemble the results of this search into a database that can be used to train and evaluate predictors, discover potentially druggable proteins, and reveal associations that can confirm existing hypotheses or expose protein- and ligand-specific relationships like order-to-disorder transitions, that were previously obscured by intermittent or partial data.



9:35am - 9:55am

Binding residue prediction with protein language models: Does the structure matter?

David Hoksza, Hamza Gamouh, Marian Novotný

Charles University, Czech Republic

The accurate prediction of protein-ligand binding sites is crucial for understanding protein interactions, especially in the context of biotechnology and drug discovery. There are two main approaches to tackle this challenge: one relies on the sequence of the protein (sequence-based methods), while the other relies on the three-dimensional structure of the protein (structure-based methods).

In this talk, we will discuss a novel approach that combines the strengths of both approaches to advance the state-of-the-art in this field. Our hybrid model merges two cutting-edge deep learning techniques: protein language models (pLMs) from the sequence-based approach and Graph Neural Networks (GNNs) from the structure-based approach. Specifically, we create a residue-level Graph Attention Network (GAT) model using the 3D protein structure and incorporate pre-trained pLM embeddings as node features. This integration allows our model to capture both the sequential information embedded in the protein sequence and the structural relationships within the protein.

Our model performs well compared to existing methods on a benchmark dataset, covering various ligands and ligand types. Ablation studies highlight the importance of the graph attention mechanism, particularly in densely connected graphs. Furthermore, we illustrate that as we employ more intricate pLMs to represent node features, the relative impact of the GNN architecture diminishes. This finding suggests that, to some extent, the structural information necessary for accurate binding site prediction is inherently encoded within the pLMs themselves.



9:55am - 10:15am

New ways of protein family visualization in AlphaFold era

Radka Svobodová1, Karel Berka2, Ivana Hutařová Vařeková2, Tomáš Raček1, Ondřej Schindler1

1CEITEC and NCBR, Masaryk University Brno, Kamenice 5, 625 00 Brno, Czech Republic; 2Department of Physical Chemistry, Faculty of Science, Palacký University, tř. 17. listopadu 12, 771 46 Olomouc, Czech Republic

Thanks to advanced structural biology approaches, more than 200,000 experimentally determined protein structures are available in the Protein Data Bank. Based on this data, more than 200,000,000 protein structures were generated using artificial intelligence algorithms and are available in AlphaFoldDB. This data has greatly expanded the possibilities of studying protein families, their anatomy, variability, common features, and evolutionary conservation. Visualizing different aspects of protein family structures provides important information for their analysis and research.

In this paper, we would like to present new methodologies for visualizing protein families and their properties. Specifically, 1D diagrams of protein families, generated using the OverProt tool [1], 2D diagrams, produced by the 2DProts application [2], and mapping properties and annotations of protein families onto these diagrams. The properties include, e.g., partial atomic charges calculated using the software tools ACC II [3] and αCharges [4].

1. A. Midlik, I. Hutařová Vařeková, J. Hutař, A. Chareshneu, K. Berka, R. Svobodová, Bioinf., 38(14), (2022), 3648-3650. 2. 2. I, Hutařová Vařeková, J. Hutař, A. Midlik, V. Horský, E. Hladká, R. Svobodová, K. Berka, Bioinf., 37(23), (2021), 4599-4601. 3. T. Raček, O. Schindler, D. Toušek, V. Horský, K. Berka, J. Koča, R. Svobodová, Nucl. Acids Res., 48(W1), (2020), W591-W596. 4. O. Schindler, K. Berka, A. Cantara, A. Křenek, D. Tichý, T. Raček, R. Svobodová, Nucl. Acids Res., 51(W1), (2023), W11–W16.

Core Facility Biological Data Management and Analysis of CEITEC Masaryk University, supported by ELIXIR CZ research infrastructure (MEYS Grant No: LM2023055) is gratefully acknowledged for the obtaining of the scientific data presented in this contribution.

 
10:15am - 10:45amCB0: Coffee break
Location: Forum Hall Foyer 3
10:45am - 12:00pmCZE2: ELIXIR Czech Republic 2
Location: Chamber Hall
Session Chair: Radka Svobodová
 
10:45am - 11:05am

Dynamics from Alphafold - Elastic network approach

Vojtech Spiwok

University of Chemistry and Technology, Prague, Czech Republic

Alphafold 2 has significantly changed the way how structures of proteins and protein-protein complexes are being predicted. This tool also provide residue-residue distance probability profiles as its output. Using the laws of thermodynamics it is possible to infer protein dynamics from these profiles. We will present our results on application of Alphafold-based elastic network model to find flexible or conformationally variable elements in protein, for example, activation loops in protein kinases, flexible segments in G protein-coupled receptors and protein structures not known at the time of the training of Alphafold.

The work is supported by ELIXIR CZ and E-Infra (Ministry of Education, Youth and Sports of the Czech Republic: LM2018140, LM2018131) and Czech Science Foundation (22-29667S).



11:05am - 11:25am

FireProt and FireProt-ASR – Web Tools for Computational Protein Stabilisation

David Bednar1,2, Milos Musil1,2,3, Rayyan Tariq Khan1,2, Jan Stourac1,2, Andrej Jezik1,3, Jana Horackova1, Simeon Borko1,2,3, Petr Kabourek1,2, Jiri Damborsky1,2

1Masaryk University, Czech Republic; 2St. Anne´s University Hospital Brno, Czech Republic; 3University of Technology, Brno, Czech Republic

Thermostable proteins are crucial in biomedicine and biotechnology, but designing them has been challenging, typically yielding limited improvements through single-point mutations. FireProt 2.0 builds upon its predecessor, offering several innovative strategies for protein stabilisation and starting from only a sequence input through AlphaFold integration and ProMod3 modelling. It introduces multiple-point designs with minimized antagonistic effects between mutations. Moreover, users can customize calculations, perform saturation mutagenesis for the selection of single-point mutations, or design multiple-point mutants in automized mode. Evolution-based strategies predict stabilizing mutations from back-to-consensus analysis and ancestral sequence reconstruction. FireProt 2.0 significantly reduces calculation time and improves user experience. It is freely accessible at https://loschmidt.chemi.muni.cz/fireprot/.

Ancestral Sequence Reconstruction (ASR) infers ancestral protein sequences, aiding the discovery of highly stable, versatile, and productive proteins. FireProtASR simplifies the reconstruction process with a user-friendly web server. It automatizes the ASR process from searching homologous protein sequences, building a multiple sequence alignment, constructing and rooting the phylogenetic tree, and reconstructing the ancestral sequences, including ancestral gaps. The tool is freely available at https://loschmidt.chemi.muni.cz/fireprotasr/.



11:25am - 11:45am

Annotation, validation, refinement, and modeling of nucleic acid structures.

Jiri Cerny, Paulina Bozikova, Barbora Schramlova, Bohdan Schneider

Insitute of Biotechnology of the Czech Academy of Sciences, Czech Republic

DNATCO web server available at https://dnatco.datmos.org [1] provides intuitive annotation, validation, modeling and refinement of nucleic acids employing the universal structural alphabet of nucleic acids for assignment of DNA and RNA backbone conformations [2]. The recent improvements of the freely accessible DNATCO web server will be presented.
Further, the progress of the Nucleic Acid Valence Geometry Working Group [3], whose task is to define and implement a uniform dictionary for nucleic acid valence geometry parameters for use in modeling, refinement and validation systems will be mentioned together with initiatives dealing with standardization of nucleic acid base-pairing patterns.

1. Černý et al., Acta Cryst. D 2020, 76(9), 52-64, 805-813.
2. Černý et al., Nucleic Acids Research 2020, 48(11), 6367-6381.
3. Schneider et al., IUCr Newsletter 2020, 28, 4



11:45am - 12:00pm

Are kuravirus capsid diameters quantized? The first all-atom genome tracing method for double-stranded DNA viruses

Samuel Coulbourn Flores1, Michal Malý22, Dominik Hrebík3, Pavel Plevka3, Jiří Černý2

1Swedish University of Agricultural Sciences, Sweden; 2Institute of Biotechnology of the Czech Academy of Sciences; 3Central European Institute of Technology, Brno, Czech Republic

The revolution in Cryo-Electron Microscopy has resulted in unprecedented power to resolve large macromolecular complexes including viruses. Many methods exist to explain density corresponding to proteins and thus entire protein capsids have been solved at the all-atom level. However methods for nucleic acids lag behind, and no all-atom viral double-stranded DNA genomes have been published at all. We here present a method which exploits the spiral winding patterns of DNA in icosahedral capsids. The method quickly generates shells of DNA wound in user-specified, idealized spherical or cylindrical spirals. For transition regions, the method allows guided semiflexible fitting. For the kuravirus SU10, our method explains most of the density in a semiautomated fashion. The results suggest rules for DNA turns in the end caps under which two discrete parameters determine the capsid inner diameter. We suggest that other kuraviruses viruses may follow the same winding scheme, producing a discrete rather than continuous spectrum of capsid inner diameters. Our software may be used to explain the published density maps of other double-stranded DNA viruses and uncover their genome packaging principles.

 
12:00pm - 1:00pmL1: Lunch and get together of the Czech and International sections
Location: Forum Hall Foyer 3
1:00pm - 1:30pmIntro: Introductory session
Location: Chamber Hall
Session Chair: Bohdan Schneider
 
1:00pm - 1:10pm

Purpose and highlights of the conference

Christine Orengo

University College London, United Kingdom

Purpose and highlights of the conference 3D-BioInfo | ICSB 3D-SIG | ELIXIR Czech Republic
Community Meeting in Structural Bioinformatics



1:10pm - 1:20pm

ELIXIR welcome

Elixir Rep I

Elixir

ELIXIR Welcome



1:20pm - 1:50pm

Welcome and organizational notes

Bohdan Schneider

Institute of Biotechnology of the Czech Academy of Sciences, Czech Republic

Welcome and organizational notes

 
1:30pm - 3:30pmS1: Activity 1 - To develop the infrastructure for FAIR structural and functional annotations
Location: Chamber Hall
Session Chair: Sameer Velankar
 
1:30pm - 1:45pm

PDBe-KB in 2023: New data pipelines and improved functionality

Mihaly Varadi, Grisell Diaz Leines, Sri D. Appasamy, Joseph I.J. Ellaway, Roshan I. Kunnakkattu, Preeti Choudhary, Sreenath S. Nair, Stephen Anyango, Sameer Velankar

Protein Data Bank in Europe, EMBL-EBI, Hinxton, United Kingdom

The Protein Data Bank in Europe - Knowledge Base (PDBe-KB) represents an innovative open, collaborative consortium that integrates and enriches 3D-structure data combined with functional annotations to invigorate foundational and translational research. A standout component of PDBe-KB's endeavours is the 3D-Beacons Network, which offers seamless access to both experimental and computational models of protein structures. As a cornerstone initiative, PDBe-KB is a flagship within Activity 1 of the ELIXIR 3D-BioInfo Community.

Guided by a tri-fold mission, PDBe-KB aims to establish a robust FAIR infrastructure, ensuring the effective utilisation of macromolecular structures alongside their annotations, enriching these structures with pertinent structural and functional annotations, anchoring them within their biological context and developing state-of-the-art tools that redefine data access and visualisation paradigms.

Last year the PDBe team brought significant advancements to PDBe-KB, from a more streamlined PDBe-KB website which enhances user accessibility, to the roll-out of an updated PISA version, offering intricate details on macromolecular interaction interfaces and the introduction of a refined structural superposition and conformation clustering pipeline. This tool runs weekly through the entire PDB archive, pinpointing unique conformations for protein segments. We also designed a novel methodology to generate unique, persistent identifiers for macromolecular assemblies in the PDB, considering both super- and sub-complex compositions. Looking at small molecules, we pioneered a data process to reconstruct covalently linked compounds (CLCs) throughout the PDB archive, providing complete ligands instead of discrete CCD components. Finally, in the context of the 3D-Beacons Network, we created an enhanced user interface and added advanced search functionalities.

Through these improvements, the PDBe-KB data resource continues its commitment to delivering a comprehensive data infrastructure, driving the next wave of scientific discoveries in macromolecular structures.



1:45pm - 2:15pm

Computational Enzymology in 3D: Modules and Mechanisms

Janet Thornton, AJM Ribeiro, Ioannis Riziotis, JD Tyzack, Neera Borkakoti, Roman Laskowski

European Bioinformatics Institute (EMBL-EBI), Wellcome > Genome Campus, Cambridge CB10 1SD, UK

Enzymes catalyse most of the chemical reactions which are essential for life. They are powerful catalysts that have evolved over millions of years to perform the functions in an organism that are necessary for survival. Using structural data and computational biology we seek to understand and predict how enzymes work and how they evolve to perform new enzyme functions. In this talk I will present our ongoing work on identifying catalytic modules and predicting mechanisms.



2:15pm - 2:30pm

The Evolution of Local Energetic Frustration in Protein Families and Superfamilies

Maria I. Freiberger1, Victoria Ruiz-Serra2, Miriam Poley2, Marco Ludaic2, Camila Pontes2, Miguel Romero2, Cesar Ramirez-Sarmiento3, Marcelo Marti1, Peter Wolynes4, R. Gonzalo Parra2, Alfonso Valencia2

1Buenos Aires University, Argentina; 2Barcelona Supercomputing Center, Spain; 3Pontificia Universidad Católica de Chile, Chile; 4Pontificia Universidad Católica de Chile, Chile

Protein families evolve and diversify by accumulating sequence variations that translate into changes in the folding landscapes and the structure and dynamics of the native state of their members. These changes are constrained by the features of the folding energy landscape as well as the proteins’ molecular functions.

Natural proteins fold by minimizing the energetic conflicts of those interactions that are present in their native states. Although the free energy is globally minimized, not all native interactions are energetically optimized. In fact, up to 10% of the interactions in the native state of proteins are in conflict with their local environment, 40% are energetically optimized and the rest are neutral. These conflicting, frustrated, signals have been linked with different functional aspects such as protein-protein interactions, allosterism and catalytic activity.

We have developed FrustraEvo, a tool that measures structural local frustration conservation patterns within protein families as a proxy to define residues that are important either for stability or function and relate them to their sequence variability signatures. We compare evolutionary related protein families, with a common ancestor, to detect differential frustration patterns that have emerged as a consequence of their divergence and link them with subfamily functional properties. For specific cases, we have analyzed ancestral reconstructed proteins to understand how frustration dynamically changed as a function of evolutionary time to give insights into protein functional diversification. Finally, we have used reverse folding and protein language model machine learning methods to generate protein sequences in the context of the studied families to explore the biophysical limits of their sequence spaces.

Facilitated by the latest protein structure prediction techniques, we have implemented a biophysically inspired strategy to provide novel insights into protein sequence-structure-function relationships as well as into their evolutionary history. Our methodology can offer a novel way to annotate stability and function related residues at the level of protein families



2:30pm - 2:45pm

Finding structure specific entity types in literature

Melanie Vollmar1, Santosh Tirunagari2, Sameer Velankar1

1PDBe, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK; 2Europe PMC, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, UK

Proteins and their structure are crucial for the understanding of diseases, processes and pathways within cells and organisms. Experimental structures and/or predicted models are complemented by results from biochemical and/or biophysical assays and inferred details from homologs to create the knowledge about a particular protein. This knowledge is then extensively discussed in scientific literature to paint a diverse picture, with all its controversial and accepted evidence, about a specific protein. Scientific literature as a vast source of knowledge is, however, of unstructured nature and hence only accessible to humans. Additionally, the speed with which publications are being produced, makes it a challenge for a researcher to stay informed on all new evidence. In order to be able to enrich structures (predicted or experimental) with functional information from the literature, a human biocurator will have to read the literature and manually extract the details to provide annotations to a structure.

Here we present a deep learning tool that automatically identifies protein structure specific entity types in scientific literature. The current system identifies these entity types with a precision of 90%, recall of 92% and F1 score of 91%. The identified key terms describing a protein’s structure and functional residues can be highlighted for biocurators to help them in the annotation process. They also provide a starting point for future work on identifying relationships between the terms to create a structure specific ontology. A large-language model could then be employed to reason over the literature using identified entities and relationships as additional input. Combined, this will create an automatic pipeline to extract residue-level functional information from literature to enrich the annotations of protein structures and sequences.

Acknowledgements: Deborah Harrus, David Armstrong, Genevieve Evans, Deepti Gupta, Marcelo Querino Lima Afonso and Romana Gaborova who all spend hours diligently annotating the initial publication set for training the first version of the autoannotator.



2:45pm - 3:00pm

FiTMuSiC: Leveraging structural and (co)evolutionary data for protein fitness prediction.

Matsvei Tsishyn, Gabriel Cia Beriain, Pauline Hermans, Jean Marc Kwasigroch, Marianne Rooman, Fabrizio Pucci

Université Libre de Bruxelles, Belgium

Accurately predicting how mutations impact the fitness of a protein is of major interest for the interpretation of genetic variants and thus for the understanding of genetic diseases. Here we introduce FiTMuSiC, a simple model that combines structural, evolutionary and coevolutionary features to predict the fitness of protein variants. Although our model only has a few parameters, it outperforms deep-learning models when benchmarked on deep-mutagenesis scanning data. Moreover, unlike multi-parameter machine learning models, our simple approach allows us to biologically interpret the phenomenon underlying the predicted fitness and thus to deepen our understanding of variant pathogenicity. We showcase the application of FiTMuSiC on hydroxymethylbilane synthase, which was one of the targets in the last round of the Critical Assessment of Genome Interpretation (CAGI) in which our method was one of the best. FiTMuSiC is freely available for academic use at babylone.3bio.ulb.ac.be/FiTMuSiC.



3:00pm - 3:15pm

The MOKCa database 2023.

Biniam T Haile1, Adnan Cinar1, Chistopher J Richardson2, Frances M G Pearl1

1University of Sussex, Brighton, United Kingdom; 2The Institute of Cancer Research, London, United Kingdom

The MOKCa database http://strubiol.icr.ac.uk/extra/MOKCa (Mutations, Oncogenes and Knowledge in Cancer) was developed to structurally and functionally annotate, and where possible predict, the phenotypic consequences of disease-associated mutations in proteins implicated in cancer. The initial database focused on protein kinases, but has now been extended to include all the proteins from the human genome that are mutated in cancer.

We are currently updating MOKCa with somatic mutation data from the COSMIC database, with missense, nonsense and inframe indel mutations being mapped to their position on the canonical UniProt sequence. Protein level annotations include, Gene Ontology (GO) assignments, Pfam domains and Prosite patterns.

Each mutation is described by its alteration to the protein sequence, eg BRAF V600E, with different genetic changes that result in the same mutation presented together at the protein level. Each cancer phenotype in which this mutation has been recorded is also presented on the protein overview page. Functional annotations for each mutation include whether mutations are annotated or predicted to be; pathogenic, whether they are a gain of function (GOF) or loss of function mutation, and whether they are likely to affect post-translational modifications, including phosphorylation, glycosylation, and ubiquitination.

The amino acid sequence for each protein has been scanned, and then aligned against the Protein Data Bank (PDB) and AlphaFold, to map the mutation onto the experimentally determined human protein structure or when not available the AlphaFold model. A 3D-cartoon image is generated of the highest homology structure in which the equivalent to the affected residue was structurally defined, with that residue highlighted. Hyperlinks are provided to launch an interactive session with the JSmol viewing applet, in which the structure can be examined in 3D. The structural impact of missense mutations are being calculated using the SAAP algorithm.

The web-interface for MOKCa can be searched by gene name or by UniProt accession code. Users can also browse the database. We have also predefined subsets of proteins that are important in cancer including, protein kinases, oncogenes and tumour suppressors, proteins involved in the DNA damage response (DDR) and those proteins that are current targets of chemotherapy and personalised cancer medicine regimes (drug targets).



3:15pm - 3:30pm

Developing Training Materials for Structural Biology

Paulyna Magana

EMBL-EBI, United Kingdom

Structural biology is a rapidly evolving field with a wide range of applications, including drug discovery, materials science, and biomedicine. To meet the growing demand for skilled structural biologists, there is a need for high-quality training materials that incorporate the latest advances in the field, such as AlphaFold.

Developing effective training materials for structural biology is challenging due to the complex and interdisciplinary nature of the field. However, it is essential to ensure that everyone has the opportunity to learn about and use AlphaFold, a groundbreaking deep learning algorithm that is revolutionising the way we predict protein structure.

I am working with DeepMind to develop comprehensive training materials on AlphaFold. These materials will be designed for a variety of audiences, including undergraduate and graduate students, postdoctoral fellows, and early-career scientists. They will cover all aspects of AlphaFold, from the underlying theory to practical applications.

The training materials will be delivered through a variety of formats, including online courses, workshops, and tutorials. They will incorporate engaging and innovative teaching methods, such as hands-on exercises, case studies, and interactive simulations.

 
3:30pm - 4:00pmCB1: Coffee break
Location: Forum Hall Foyer 3
4:00pm - 6:00pmS2: Activity 2 - To create open resources for sharing, integrating and benchmarking software tools for modelling the proteome in 3D
Location: Chamber Hall
Session Chair: Shoshana Wodak
 
4:00pm - 4:30pm

An atlas of protein homo-oligomerization across domains of life

Hugo Schweke

Weizmann Institute of Science, Israel

Protein structures are essential to understand cellular processes in molecular detail. While advances in AI revealed the tertiary structure of proteins at scale, their quaternary structure remains mostly unknown. Here, we describe a scalable strategy based on AlphaFold2 to predict homo-oligomeric assemblies across four proteomes spanning the tree of life. We find that 50% of archaeal, 45% of bacterial, and 20% of eukaryotic proteomes form homomers. Our predictions accurately capture protein homo-oligomerization, recapitulate megadalton complexes, and unveil hundreds of novel homo-oligomer types. Analyzing these datasets reveals coiled-coil regions as major enablers of quaternary structure evolution in Eukaryotes. Integrating these structures with omics data shows that a majority of known protein complexes are symmetric. Finally, these datasets provide a structural context for interpreting disease mutations, which we find enriched at interfaces. Our strategy is applicable to any organism and provides a comprehensive view of homo-oligomerization in proteomes, protein networks, and disease.



4:30pm - 5:00pm

Datasets and models for modeling of antibody-antigen complexes

Dina Schneidman

The Hebrew University of Jerusalem, Israel

Antibody repertoires are highly diverse, enabling responses to a wide range of pathogens. While sequencing of an individual's antibody repertoires is becoming common, identifying the antigens they recognize requires costly low-throughput experiments. Even when the antigen is known, epitope mapping is still challenging: experimental approaches are costly low-throughput, and computational ones are not sufficiently accurate. Recently, deep learning models, pioneered by AlphaFold2, have revolutionized Structural Biology by predicting highly accurate structures of proteins and protein complexes. However, they rely on evolutionary information that is not available for antibody-antigen interactions. Traditional computational epitope mapping is based on a two-step approach: structure modeling (folding) of the antibodies, followed by docking of the predicted antibody structure to the corresponding antigen. The problem with this sequential approach is that the folding step does not consider the structural changes of the antibody upon antigen binding and the docking step is inaccurate because the antibody is considered rigid. We develop a deep learning end-to-end fold&dock model that simultaneously performs antibody folding and docking tasks, given an antibody sequence and its corresponding antigen structure. For training and testing the model, we extract experimental structures of antibodies and antibody-antigen complexes from the SAbDab database and split it into train and test sets according to antibody sequence identity. The fold&dock model produces the 3D coordinates of the entire antibody-antigen complex, including the side chains. An accurate model is detected among the Top-5 and Top-100 predictions for 28% and 70% of the test set, respectively. In addition to mining antibody repertoires, the method can be used in antibody-based drug and vaccine design.



5:00pm - 5:15pm

Discriminating physiological from non-physiological interfaces in structures of protein complexes: a community-wide study

Emmanuel Levy

Elixir Activity II community

Reliably scoring and ranking candidate models of protein complexes and assigning their oligomeric state from the structure of the crystal lattice represent outstanding challenges. A community-wide effort was launched to tackle these challenges. The latest resources on protein complexes and interfaces were exploited to derive a benchmark dataset consisting of 1677 homodimer protein crystal structures, including a balanced mix of physiological and non-physiological complexes. The non-physiological complexes in the benchmark were selected to bury a similar or larger interface area than their physiological counterparts, making it more difficult for scoring functions to differentiate between them. Next, 252 functions for scoring protein-protein interfaces previously developed by 13 groups were collected and evaluated for their ability to discriminate between physiological and non-physiological complexes. A simple consensus score generated using the best performing score of each of the 13 groups, and a cross-validated Random Forest (RF) classifier were created. Both approaches showed excellent performance, with an area under the Receiver Operating Characteristic (ROC) curve of 0.93 and 0.94, respectively, outperforming individual scores developed by different groups. Additionally, AlphaFold2 engines recalled the physiological dimers with significantly higher accuracy than the non-physiological set, lending support to the reliability of our benchmark dataset annotations. Optimizing the combined power of interface scoring functions and evaluating it on challenging benchmark datasets appears to be a promising strategy.



5:15pm - 5:30pm

Explaining Conformational Diversity in Protein Families through Molecular Motions

Valentin Lombard1, Sergei Grudinin2, Elodie Laine1,3

1Sorbonne University, France; 2Université Grenoble Alpes, CNRS, France; 3Institut universitaire de France (IUF)

In contrast to protein structure prediction, conformational flexibility modeling lacks high-quality data for training and benchmarking methods. As part of the Elixir 3D-BioInfo Community effort, we developed DANCE (Dimensionality ANalysis for protein Conformational Exploration), an automated and efficient pipeline for recapitulating the conformational diversity observed in a set of inputs protein 3D structures. DANCE generates protein- or protein family-specific conformational ensembles. It estimates the amount and complexity of their conformational diversity by extracting linear motions. We applied it on all experimentally resolved protein 3D structures. We investigated how the ensembles transform upon family expansion. We identified a set of 12 conformational ensembles that could serve as a reference benchmark for the community. They display a wide variety of motions' amplitude and complexity. Some represent proteins whose structures and dynamics have been extensively studied, like calmodulin, while others are archetypal examples of allostery or of ligand-associated opening-closing motions. Several proteins bear a therapeutic interest. We assess the ability of classical manifold leaning techniques for reconstructing conformational states from this benchmark set.



5:30pm - 5:45pm

Systematic identification and characterisation of domain movements in proteins from low-dimensional representations of conformational ensembles

Sergei Grudinin1, Steven Hayward2

1LJK CNRS, Grenoble, France; 2University of East Anglia, UK

Identifying rigid parts in dynamic systems has numerous biological applications, from protein domain detection [1] to automatic coarse-graining in molecular simulations [2, 3]. Various solutions have been proposed for this problem, such as using graph theory [4-5], examining the covariance patterns in elastic network models [6], employing machine-learning-based approaches [7], or developing specific techniques based on knowledge of multiple protein conformations [8]. In the Elixir implementation study, we modified existing strategies for dealing with multi-conformer proteins from the same family and created new techniques for analyzing the collected structural ensembles. I will present the methods and results from our analysis and show low-dimensional representations of protein motions.

[1] S. Jones, M. Stewart, A. Michie, M. B. Swindells, C. Orengo, and J. M. Thornton, “Domain assignment for protein structures using a consensus approach: characterization and analysis,” Protein Science, vol. 7, no. 2, pp. 233–242, 1998.

[2] Z. Zhang, L. Lu, W. G. Noid, V. Krishna, J. Pfaendtner, and G. A. Voth, “A systematic methodology for defining coarse-grained sites in large biomolecules,” Biophysical journal, vol. 95, no. 11, pp. 5073–5083, 2008.

[3] S. Kmiecik, D. Gront, M. Kolinski, L. Wieteska, A. E. Dawid, and A. Kolinski, “Coarse-grained protein models and their applications,” Chemical reviews, vol. 116, no. 14, pp. 7898–7936, 2016.

[4] J. Sim, J. Sim, E. Park, and J. Lee, “Method for identification of rigid domains and hinge residues in proteins based on exhaustive enumeration,” Proteins: Structure, Function, and Bioinformatics, vol. 83, no. 6, pp. 1054–1067, 2015.

[5] D. J. Jacobs, A. J. Rader, L. A. Kuhn, and M. F. Thorpe, “Protein flexibility predictions using graph theory,” Proteins: Structure, Function, and Bioinformatics, vol. 44, no. 2, pp. 150–165, 2001.

[6] O. Keskin, S. R. Durell, I. Bahar, R. L. Jernigan, and D. G. Covell, “Relating molecular flexibility to function: a case study of tubulin,” Biophysical journal, vol. 83, no. 2, pp. 663–680, 2002.

[7] J. Wells, A. Hawkins-Hooker, N. Bordin, C. Orengo, and B. Paige, “Chainsaw: protein domain segmentation with fully convolutional neural networks,” bioRxiv, pp. 2023–07, 2023.

[8] S. Hayward, & H. J. Berendsen, H. J. Systematic analysis of domain motions in proteins from conformational change: new results on citrate synthase and T4 lysozyme. Proteins: structure, function, and bioinformatics, 30(2), 144-154, 1998.



5:45pm - 6:00pm

FAIR workflow to chart and characterize the conformational landscape of native proteins. A combined work of ELIXIR 3D-BioInfo structural community and the BioExcel Centre of Excellence for Computational Biomolecular Research

Adam Hospital Gasch

IRB Barcelona, Spain

The Implementation Study “Building on PDBe-KB to chart and characterize the conformational landscape of native proteins from the ELIXIR 3D-BioInfo structural community aims to create an infrastructure to chart the experimentally sampled conformational diversity of native proteins by exploiting data from the PDB, augmented with results of state-of-the-art computational tools. The project was designed from scratch to generate FAIR workflows with the integration of these tools: Findable, Accessible, Interoperable and Reproducible/Reusable. The BioExcel Center of Excellence for Computational Biomolecular Research has designed and developed a software library for interoperable and reproducible biomolecular simulation workflows (BioExcel Building Blocks -BioBB-). The combination of the BioBB library and the ELIXIR 3D-BioInfo Implementation Study tools is producing a set of FAIR workflows that are Findable from standard search engines and registries (bio.tools, BioSchemas, WorkflowHub), Accessible from standard repositories (GitHub, BioConda), Interoperable across different workflow languages (Jupyter Notebooks, Galaxy, CWL), and Reproducible/Reusable thanks to software packaging systems (Pip, BioConda, BioContainers).

 
6:00pm - 9:00pmPosters: Posters
Location: Forum Hall Foyer 3
 

Isolation and functional analysis of phage‐displayed antibody fragments targeting the staphylococcal superantigen‐like proteins

Ida Alanko1, Rebecca Sandberg1, Eeva‐Christine Brockmann2, Carla J. C. de Haas3, Jos A. G. van Strijp3, Urpo Lamminmäki2, Outi M. H. Salo-Ahen1

1Faculty of Sciences and Engineering, Pharmaceutical Sciences Laboratory (Pharmacy) & Structural Bioinformatics Laboratory (Biochemistry) Turku, Åbo Akademi University, Turku, Finland; 2Department of Life Technologies, University of Turku, Turku, Finland; 3Department of Medical Microbiology, University Medical Center Utrecht, Utrecht University, Utrecht, The Netherlands

Staphylococcus aureus produces an arsenal of virulence factors that manipulate the immune system helping the bacteria avoid phagocytosis. In this study we are investigating two of these evasion molecules called the Staphylococcal superantigen-like protein 1 and 5 (SSL1 and SSL5). Both SSLs inhibit vital host immune processes and contribute to S. aureus immune evasion, e.g. by inhibiting the Matrix metalloproteinase 9 (MMP9), thus limiting chemokine potentiation and neutrophil migration. The aim of this study was to isolate single-chain variable fragment (scFvs) antibodies from synthetic antibody phage libraries, that can recognize SSL1 and SSL5, and that could block the interaction between the SSLs and their respective human targets.

The scFv-antibodies were selected after three rounds of panning against SSL1 and SSL5 and their binding activity to the SSL1 and SSL5 was studied using time-resolved fluorescence-based immunoassay. We obtained altogether 27 unique clones displaying binding activity to the SSL1 and SSL5. The capability of the scFvs to inhibit the SSLs’ function was tested various immunoassays including an MMP9 enzymatic activity assay. We were able to show that ten scFvs were able to inhibit the SSL1 or SSL5 in a concentration dependent manner. Some antibodies were able to restore the MMP9 activity fully after incubation with scFv- bound SSLs.

Finally, the structure of the best inhibiting scFv was modeled and used to create putative scFv‐SSL‐complex models by protein–protein docking. The complex models were subjected to a 100‐ns molecular dynamics simulation to assess the possible binding mode of the antibody.

We have demonstrated that by utilizing phage display we are able to isolate antibodies that recognize and inhibit virulence factors. The antibodies found here could be a ground for developing antivirulence factors against S. aureus infections to help restore the immune system’s capacity and further enable a more efficient clearance of the bacteria.



Detailed analysis of a thermostable protein-DNA complex: the case of Sac7d as a prototype for protein-DNA interaction

Elena Álvarez1,2, Stéphane Téletchéa1, Simon Huet2, Bernard Offmann1

1Nantes Université, US2B, CNRS, UMR6286 & Affilogic F-44000 Nantes, France; 2Affilogic SAS, F-44000 Nantes, France

Sac7d is a 7kDa protein belonging to the class of the small chromosomal proteins from archeon Sulfolobus acidocaldarius. Sac7d was discovered in 1974 in Yellowtone National Parks geysers, and studied extensively since then for its remarkable stability at large pH and temperature ranges. Sac7d binds to DNA minor groove by raising its melting temperature, thus protecting DNA from these extreme conditions.
In this study, we analyzed Sac7d-DNA complex using 1µs molecular dynamics simulations to determine which residues contributed most to DNA binding. The interaction energy of the interface was decomposed using Molecular Mechanics Generalized Born Surface Area (MM/GBSA). We determined that more than 10 residues were critical for DNA recognition. The individual contribution of each residue to the binding interface was in agreement with previous documented results. We provide a novel in-depth focus on the DNA energetics as a consequence of its tethering to Sac7d.



Method for analysis of protein-DNA interactions using probability density maps

Daniel Berdár, Bohdan Schneider, Lada Biedermannová

Institute of Biotechnology, Czech Academy of Science, Czech Republic

A plethora of experimental and computational tools are used to improve the understanding of biomolecular structure-function relationships. At the intersection of theoretical and experimental approaches stand data analysis studies that use a large pool of structures available within structural databases, such as the PDB, to show patterns in large ensembles of macromolecular structures.

Here we plan to demonstrate how, using NtC’s for local conformational description of nucleic acids, we can construct probability density maps for DNA fragments. Maps are constructed for relevant combinations of DNA building blocks and selected interacting atom/group. These can be further examined or superposed to elucidate guiding patterns of interaction at biomolecular interfaces.



Prediction of DNA hydration based on data mining of crystallographic structures

Lada Biedermannová, Bohdan Schneider

Institute of Biotechnology, Czech Republic

Water is a critical factor in stabilizing DNA structure and mediating its interactions. In our study, we harness crystallographic data to establish average hydration patterns around biomolecules, including proteins [1,2,3] and nucleic acids [4,5,6]. Our recent focus has been on exploring DNA hydration as a function of its conformation and sequence.
To gain a more comprehensive understanding, we employed a multi-step approach to determine water probability densities around dinucleotide fragments. Beginning with DNA crystal structures containing water molecules, we conducted an extensive analysis of DNA dinucleotides within an ensemble of 2,727 non-redundant DNA chains, encompassing 41,853 dinucleotides and the associated 316,265 first-shell water molecules [6]. We classified dinucleotides based on their 16 sequences and the previously defined structural classes, known as nucleotide conformers (NtCs). From the DNA structures in the data set, we extracted all dinucleotides with associated water molecules. Subsequently, all waters linked to dinucleotides of a specific NtC/sequence combination were transferred to a reference dinucleotide. Finally, we computed water probability density distributions through Fourier averaging, separately for waters associated with the base and the sugar-phosphate atoms. Peaks in the hydration densities are referred to as Hydration Sites (HSs), and unveil the intricate interplay between base and sugar-phosphate hydration with respect to the sequence and conformation of DNA. The identified hydrated dinucleotide building blocks allowed us to subsequently calculate DNA hydration by determining the probability of water density distributions.
In this poster, we present an overview of our findings and discuss the potential applications of hydrated DNA building blocks for predicting DNA hydration. Our data and predictions are readily accessible for browsing and visualization at the website watlas.datmos.org/watna.

[1] Biedermannová L. & Schneider B.: Structure of the ordered hydration of amino acids in proteins: analysis of crystal structures. Acta Crystallographica D71, 2192-2202 (2015).
[2] Černý J., Schneider B. & Biedermannová L.: WatAA: Atlas of Protein Hydration. Exploring synergies between data mining and ab initio calculations. Phys. Chem. Chem. Phys. 19, 17094 (2017).
[3] Biedermannová L. & Schneider B.: Hydration of proteins and nucleic acids: Advances in experiment and theory. A review. Biochimica et Biophysica Acta - General Subjects 1860, 1821-1835 (2016).
[4] Schneider B. & Berman H.M.: Hydration of the DNA Bases Is Local. Biophysical J. 69, 2661-2669 (1995).
[5] Schneider B., Patel K. & Berman H.M.: Hydration of the Phosphate Group in Double-Helical DNA. Biophysical J. 75, 2422-2434 (1998).
[6] Biedermannová L., Černý, J., Malý, M., Nekardová, M., Schneider, B.: Prediction of DNA hydration from knowledge-based hydrated building blocks. Acta Cryst. D78, 1032-1045 (2022).



Conformational study of small carbon rings in ligands

Gabriela Bučeková1, Viktoriia Doshchenko1, Radka Svobodová1,2

1National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Kamenice 753/5, 625 00 Brno, Czech Republic; 2CEITEC - Central European Institute of Technology, Masaryk University, Kamenice 753/5, 625 00 Brno, Czech Republic

The primary repository not only for protein structures, Protein Data Bank (PDB), allows access to numerous structural data for biomacromolecules. However, due to the large amount of deposited structures may lead to the presence of errors in data. Early validation approaches primarily focused on the geometric properties of standard biomacromolecular residues, leading to the release of PDB validation reports. Later, this report extends to ligand validation, but some validation aspects are still not covered.

This study focuses on one of these overlooked areas of validation. We examined the conformations of basic rings found within small molecules in the PDB, including cyclopentane, cyclohexane and benzene. Inaccurate determination of ring conformation within ligands can significantly influence the geometric properties of a macromolecule or molecular fragments in the surrounding area. Our analysis extends to investigating the underlying reasons for selecting energetically unfavourable conformations of rings.



Predicting the effects of mutations on protein-protein interactions via self-supervised machine learning

Anton Bushuiev1, Roman Bushuiev1,4, Anatolii Filkin1, Petr Kouba1,2, Marketa Gabrielova1, Jiri Sedlar1, Tomas Pluskal4, Jiri Damborsky2,3, Stanislav Mazurenko2,3, Josef Sivic1

1Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague; 2Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University; 3International Clinical Research Center, St. Anne's University Hospital Brno; 4Institute of Organic Chemistry and Biochemistry of the Czech Academy of Sciences

The design of protein-protein interactions (PPIs) is critical for advancing biomedical research and therapeutic development. However, current machine-learning approaches for scoring protein binder variants have only limited applicability in practical scenarios. We find that the major bottleneck of existing tools is their poor generalization due to the reliance on scarce mutational libraries. Here, we propose PPIformer – a novel approach to designing PPIs for higher binding affinity via self-supervised geometric deep learning. We train a 3D equivariant transformer neural network to reconstruct masked protein-protein interfaces from a newly assembled dataset of interactions comprehensively mined from the Protein Data Bank. Our results demonstrate that by solving self-supervision objectives, PPIformer learns the distributions of stabilizing interface mutations, enabling the scoring and design of improved protein-protein interaction variants. We highlight our findings through the case study on engineering staphylokinase for higher thrombolytic activity via enhanced binding affinity towards plasmin.



Mol*VS Visualization and Interpretation of Cell Imaging Data alongside Macromolecular Structure Data and Biological Annotations

Aliaksei Chareshneu1, Adam Midlik1, Alessio Cantara1, Crina-Maria Ionescu1, Alexander Rose2, Vladimír Horský1, Radka Svobodová1,3, Karel Berka4, David Sehnal1,3

1National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Brno, 602 00, Czech Republic; 2Mol* Consortium - San Diego, CA, USA; 3CEITEC - Central European Institute of Technology, Masaryk University, Brno, 625 00, Czech Republic; 4Department of Physical Chemistry, Faculty of Science, Palacký University Olomouc, Olomouc, 779 00, Czech Republic

Segmentation plays a crucial role in interpreting biological imaging data. As automated segmentation tools have advanced, public repositories for imaging data have expanded their capabilities to include sharing and visualizing segmentations. This evolution has led to a growing demand for interactive web-based visualization of 3D volume segmentations. To address this challenge we have created Mol* Volumes and Segmentations (Mol*VS), which enables the interactive, web-based visualization of cellular imaging data, complemented by macromolecular data and biological annotations. Mol*VS seamlessly integrates with Mol* Viewer, a platform already adopted by several public repositories. Notably, Mol*VS provides access to all EMDB and EMPIAR entries featuring segmentation datasets, supporting the visualization of data generated across a broad spectrum of electron and light microscopy experiments. Mol*VS is an open-source solution, freely accessible at https://molstarvolseg.ncbr.muni.cz/.



Application and analysis of protein language models for predicting membrane interacting peptide regions

Máté Csepi1, Gábor Erdős2, Zsuzsanna Dosztányi2, Tamás Hegedűs1,3

1Semmelweis University, Hungary; 2Eötvös Loránd University, Hungary; 3Eötvös Loránd Research Network, Hungary

Interaction between proteins and lipids plays a crucial role in numerous cellular processes. Similar to protein-protein interactions, the involved peptide segments may be intrinsically disordered regions (IDRs) and their dynamics decreases upon lipid binding that may even include gaining secondary structures. We had collected proteins with lipid-interacting IDRs based on experimental data and named the interacting segments membrane molecular recognition features (MemMoRFs; https://memmorf.hegelab.org) [1]. Now we aimed to use this dataset to establish a predictor to overcome the tedious experiments for identification of membrane interacting IDRs. Since the available data set is small for conventional machine learning methods [2], we employed protein language models (pLMs), including T5 and Ankh. We used logistic regression to select important features of pLM embeddings. The analysis of these feature subsets indicates that MemMoRF properties are encoded in more features in Ankh than in T5 pLM embeddings. This phenomenon most likely contributes to better performance of Ankh in general when all features are used in predictions. However, we demonstrate that subsets of features can decrease the noise, thus increase the performance of per residue embedding-based neural networks for MemMoRF prediction. Our best performing model exhibits AUC, MCC, and F1 of 0.885, 0.655, and 0.827, respectively. We used this model to predict MemMoRFs in the human proteome, which predictions and also the predictor are available publicly at https://plmmorf.hegelab.org.

1. G. Csizmadia, G. Erdős, H. Tordai, R. Padányi, S. Tosatto, Z. Dosztányi, T. Hegedűs (2020). The MemMoRF database for recognizing disordered protein regions interacting with cellular membranes. Nucleic Acids Research, 49, D355–D360. https://doi.org/10.1093/nar/gkaa954

2. S. Basu, T. Hegedűs, L. Kurgan (2023). CoMemMoRFPred: sequence-based prediction of MemMoRFs by combining predictors of intrinsic disorder, MoRFs and disordered lipid-binding regions. Journal of Molecular Biology. https://doi.org/10.1016/j.jmb.2023.168272

Support of grants NKFI-127961and NKFI-137610 to T.H. and National Academy of Scientist Education fellowships to M.C. and T.H. are acknowledged. We thank the colleagues of the Governmental Information-Technology Development Agency for computational resources on the Komondor GPU cluster.



SpikePro: a webserver to predict the fitness of SARS-CoV-2 variants

Gabriel Cia Beriain1, Jean Kwasigroch1, Marianne Rooman1,2, Fabrizio Pucci1,2

1Computational Biology and Bioinformatics, Université Libre de Bruxelles, Brussels, Belgium; 2Interuniversity Institute of Bioinformatics in Brussels, Brussels, Belgium

Since the original SARS-CoV-2 Wuhan lineage, successive waves of variants have evolved and spread across the globe. Despite many efforts during the last two years, accurately predicting the fitness of new variants is still a challenge which, if solved, would enable earlier activation of COVID-19 plans long before reaching the peak of new epidemic waves. To that end, we recently developed SpikePro, which is a structure-based computational model that can predict the viral fitness of a variant from its spike protein sequence quickly and accurately. The model is based on the predicted effect of the variant on the stability of the spike protein as well as on its binding affinity for the angiotensin-converting enzyme 2 (ACE2) and for a set of neutralizing antibodies. Using these predicted values, computed using in-house prediction tools, the model estimates the virus transmissibility, infectivity, immune escape and basic reproduction rate (R0). We show how SpikePro predictions are in very good agreement with epidemiological and clinical data of the main circulating SARS-CoV-2 strains, including the recent Omicron subvariants XBB.1.5 and EG.5.1. In summary, SpikePro contributes to genomic surveillance and viral evolution programs by assessing the fitness of new emerging SARS-CoV-2 variants. It is freely available through an easy-to-use webserver as well as command line tool.



Extension of the SUGRES coarse-grained model of polysaccharides to heparin

Annemarie Danielsson, Sergey A. Samsonov, Adam Liwo, Adam K. Sieradzan

Department of Theoretical Chemistry, Faculty of Chemistry, University of Gdansk, Poland

The glycosaminoglycan heparin (HP) is an unbranched periodic polysaccharide composed of negatively charged disaccharide units and involved in key biological processes, including anticoagulation, angiogenesis, and inflammation. The considerable size and flexibility of naturally occurring HP as well as the predominantly electrostatic nature of its interaction with proteins renders it a particularly difficult target in all-atom molecular dynamics (MD) simulations of the molecular mechanisms underlying the biologically relevant multiscale processes. Therefore, application of coarse-grained approaches is potentially promising to model HP-containing molecular systems.

We have extended the coarse-grained SUGRES-1P model of polysaccharides [1] to HP and modified the interaction energy function to account for a shift of the interaction centres and to enable a direct modification of the electrostatic energy term weight. The implemented parameters were previously obtained using all-atom MD simulations [1,2] with the GLYCAM06 force field [3]. With this modification, we were able to apply the SUGRES-1P force field in microsecond-long MD simulations of free HP oligosaccharides ranging from degree of polymerization 6 to 68. The modelled HP chains exhibited remarkable similarity to experimentally determined HP molecules [4,5] in terms of their global structural characteristics. A comprehensive analysis of the constituent energy term weights and ion concentration, represented by the Debye-Hückel parameter к, indicates that long HP chains are characterized by coiled conformations governed predominantly by electrostatic interactions established between the charged residues.

We integrated the SUGRES-1P model into the coarse-grained UNICORN model [6,7], enabling microsecond-scale MD simulations of HP interactions with single- and multi-domain proteins. This achievement represents a significant milestone, as it is the first time a "bottom-up" physics-based approach has been used for coarse-grained modelling of HP chains, while maintaining compatibility with other biomolecule classes within the UNICORN modelling package.

References

1. E. Lubecka, A. Liwo, J. Chem. Phys., 147, (2017), 115101.

2. S. A. Samsonov, E. A. Lubecka. K. K. Bojarski, R. Ganzynkowicz, A. Liwo, Biopol. 110, (2010), e23269.

3. K. N. Kirschner, A. B. Yongye. S. M. Tschampel, J. Gonzalez-Outeirino, C. R. Daniels. B. L. Foley, R. J. Woods, J. Comp. Chem. 29, (2008), 622-655.

4. S. Khan, J. Gor, B. Mulloy, S. J. Perkins, J. Mol. Biol. 395, (2010), 504-521.

5. G. Pavlov, S. Finet, K. Tatarenko, E. Korneeva, C. Ebel, Eur. Biophys. 32, (2003), 437-449.

6. A. Danielsson, S. A. Samsonov, A. Liwo, A. K. Sieradzan, J. Chem. Theory Comput. 19, (2023), 6023-6036.

7. A. Liwo, C. Czaplewski, A. K. Sieradzan, et al., Prog. Mol. Biol. Transl. Sci. 170, (2020), 73-122.

This work was supported by the Polish National Science Centre, grants UMO-2018/31/G/ST4/00246, UMO-2021/40/Q/ST4/00035 and UMO-2017/27/B/ST4/00926. Calculations were conducted using the cluster at the Faculty of Chemistry, University of Gdańsk.



Mapping and characterization of the human missense variation universe using AlphaFold 3D models

Elbert I. Timothy, Suhail A. Islam, Gordon Hanna, Michael J.E. Sternberg, Alessia David

Imperial College London, United Kingdom

In the last few years, we have witnessed major developments in both protein structure prediction, as exemplified by AlphaFold 3D models [1], and, most recently, in language-based models for the prediction of genetic variants, including the recently released AlphaMissense [2]. However, these predictors do not provide insights into the structural basis of the phenotypic effect.

Three-dimensional protein structures allow us to perform an atom-based analysis of the consequence of an amino acid substitution and models generated with the deep learning algorithm, AlphaFold, provide a unique opportunity for atom-based analysis of human missense variants. Our group has developed the Missense3D portal to provide structure-based interpretation of missense variants [3]. Here we present the results obtained with a new pipeline that we have developed to automatically identify accurately modelled amino acid regions that can be used for variant characterization with our in-house Missense3D algorithm.

The recommended AlphaFold pLDDT threshold for an accurately modelled residue is >=70. When using this threshold for the query residue, the accuracy of the atom-based predictions calculated using Missense3D on AlphaFold models was 0.66, MCC 0.36, TPR/FPR 5.1. We then compared these results to those obtained with lower pLDDT scores, and after introduction of the PAE matrix score, on a benchmark dataset of 10,085 human proteins harbouring 84,827 missense variants.

We show that, when the model accuracy of the environment surrounding the query residue (E-plDDT-5Å) is considered, an E-plDDT-5Å >=60 provides similar accuracy, MCC and TPR/FPR to that obtained using the plDDT threshold >=70 for the query residue alone but increases the number of residues for which an atom-based analysis can be performed.

We applied this new E-plDDT-5Å >=60 threshold to a total of 8,965,659 residues corresponding to 16,325 reviewed human UniProt sequences of lengths ≤ 2,700 amino acids. When using this threshold, 6,169,173 human residues (68.8% of the proteome) are modelled with sufficient quality to allow an atom-based analysis of the query residue and its surrounding environment. At the variant level, confident predictions could be obtained for 4,405,910 (65.9%) out of 6,700,719 unique missense variants mined from the UniProt homo_sapiens_variation.txt database.

In conclusion, AlphaFold 3D models offer a unique opportunity to understand the consequences of amino acid substitutions on protein structure, thus complementing existing evolutionary-based methods.

1. Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, et al.: Highly accurate protein structure prediction with AlphaFold. Nature, 596, (2021), 583-589.

2. Cheng J, Novati G, Pan J, Bycroft C, Žemgulytė A, Applebaum T, Pritzel A, Wong LH, Zielinski M, Sargeant T, et al.: Accurate proteome-wide missense variant effect prediction with AlphaMissense. Science (2023),7492.

3. Ittisoponpisan S, Islam SA, Khanna T, Alhuzimi E, David A, Sternberg MJE: Can Predicted Protein 3D Structures Provide Reliable Insights into whether Missense Variants Are Disease Associated? J. Mol. Biol., 431, (2019), 2197–2212.



Computational resources for analyzing transmembrane protein structures, and topology

Laszlo Dobson1,2, Gabor Tusnady1

1Research Centre for Natural Sciences, Magyar Tudosok Korutja, 1117-Budapest, Hungary; 2European Molecular Biology Laboratory, Meyerhofstrasse 1, 69117-Heidelberg, Germany

The Membrane Protein Bioinformatics Research Group hosts and maintains several widely used resources related to transmembrane protein structures.

In 2021, AlphaFold2 (AF2) opened new frontiers for almost all fields of structural biology and provided 3D structures for almost all known protein sequences. In the Transmembrane AlphaFold database (TmAlphaFold database, https://tmalphafold.ttk.hu/) we use a simple geometry-based method to visualize the likeliest position of the membrane plane using AF2 structures as a source. In addition, we calculate several parameters to evaluate the location of the protein into the membrane. This also allows the TmAlphaFold database to show whether the predicted 3D structure is realistic or not.

We also overhauled several other popular resources and combined them in the The UNIfied database of TransMembrane Proteins (UniTmp, https://www.unitmp.org/). UniTmp is a comprehensive and freely accessible resource of transmembrane protein structural information at different levels, from localization of protein segments, through the topology of the protein to the membrane-embedded 3D structure. We not only annotated tens of thousands of new structures and experiments, but we also developed a new system that can serve these resources in parallel. UniTmp is a unified platform that merges TOPDB (Topology Data Bank of Transmembrane Proteins), TOPDOM (database of conservatively located domains and motifs in proteins), PDBTM (Protein Data Bank of Transmembrane Proteins), and HTP (Human Transmembrane Proteome) databases and provides interoperability between them.

In the near future we plan to integrate more databases and web servers into the framework of UniTmp, so researchers will be able to find all membrane protein related information at one place.



On modelling signalling amyloid motifs

Witold Dyrka1, Jakub Gałązka1, Marlena Gąsior-Głogowska1, Krzysztof Pysz1, Monika Szefczyk2, Natalia Szulc1,3

1Katedra Inżynierii Biomedycznej, Wydział Podstawowych Problemów Techniki, Politechnika Wrocławska, Wybrzeże Wyspiańskiego 27, 50-370 Wrocław, Poland; 2Katedra Chemii Bioorganicznej, Wydział Chemiczny, Politechnika Wrocławska, Wybrzeże Wyspiańskiego 27, 50-370 Wrocław, Poland; 3Katedra Fizyki i Biofizyki, Wydział Biotechnologii i Nauk o Żywności, Uniwersytet Przyrodniczy we Wrocławiu, Norwida 25, 50-375 Wrocław, Poland

Amyloid motifs are very short domains (typically around 25 amino acids) that facilitate protein aggregation into a polymeric fibrillary structure or the amyloid fold. Apart from the involvement in pathological amyloidoses, the amyloid fold is vital for many physiological functions including oligomerization and signal transduction. Despite considerable sequential diversity and partial similarity to intrinsically disordered regions, known functional amyloid motifs typically assume the beta-arch conformation and aggregate into beta-solenoids through stacking of the beta strands. Identification of functional amyloids in sequence databases is difficult. Existing computational tools for assessing propensity to form amyloids typically focus on the amyloidogenic hotspots or short peptides forming amyloid fibrils in vitro. These tools are not calibrated for meaningful searches in entire genomes as it is estimated that most proteins contain amyloidogenic hotspots [1]. An alternative approach is to model particular families of amyloidogenic sequences, which can be a viable option for already identified motifs (e.g. Pfam profiles HET-s_218-289 and RHIM). However, as amyloidogenic motifs are relatively short and quite diverse internally, traditional tools based on k-mers or profile Hidden Markov Models do not offer enough statistical power for more generalised genome-wide searches.

Signalling amyloid motifs work in pairs (or triplets), where one motif triggers the other to assume the amyloid fold in a prion-like manner. To date, the most successful searches relied on identification of larger domains associated with the amyloid motifs combined with filtering based on genomic proximity and sequential similarity of potentially cooperating sequences. This led to identification of around thousand signalling amyloids representing more than a dozen of families associated with the Nod-Like Receptors (or NLR, innate immune system proteins) in filamentous fungi, bacteria and archaea [2, 3]. These datasets have been used to develop machine learning models capable of accounting for non-local dependencies resulting from the spatial fold. Our generalised model of several families of signalling amyloid motifs, based on the probabilistic context-free grammars (git.e-science.pl/wdyrka/pcfg-cm), demonstrated good specificity and sensitivity enabling searches in collections of thousands NLR-associated proteins [1]. This facilitated discovering new motif families in sac fungi and made possible exploring the signalling amyloidosome of filamentous basidiomycota [4]. Currently, we develop deep learning methods aimed at achieving higher specificity enabling genome-wide searches without the need for prefiltering (github.com/jakub-galazka/asmscan-lstm, github.com/chrispysz/amylotool-console).

In some cases even relatively small high-quality alignments of several dozens of sequences provide enough information to predict structural models of amyloid structures made of several copies of a motif. For example, using AlphaFold 2, we obtained a plausible model of HET-s related motif [2, 4]. However, the main issue with applying the general protein structure prediction protocols to amyloidogenic peptides is the method indifference to single point mutations affecting the formal charges, even though the feature was shown experimentally to alter the aggregation propensity of amyloids [5, 6]. Despite this limitation, AlphaFold-style predictions of a potential fold of the motif can be a useful step preceding the more fine-tuned investigation with other methods, such as molecular dynamics [6].

The short length of typical amyloidogenic peptides means that they can be synthesised relatively easily, although the process has its peculiarities due to high aggregation [1, 4-6]. The use of synthetic peptides enables quite rapid experimental verification of the modelling in vitro. For example, infrared spectroscopy can be used to establish the structural content and rigidity reflecting the aggregation stage. often in conjunction with an imaging method such as atomic force microscopy. In addition the kinetics of aggregation is typically evaluated through the Thioflavin T assay [4]. As properties of functional amyloids are often very sensitive to even slight changes in environmental conditions such as temperature, pH or presence of certain ions, we postulate the in vitro validation of the aggregation process to be performed over the grid of conditions.

The ultimate goal is predicting and modelling their heterotypic interactions, such as cross-seeding and hetero-aggregation, with the hope of deciphering potential triggers of amyloidosis [8]. However, as of now, data on cross-interactions of pathological amyloids are too scarce and sparse for training robust machine learning models. One possibility is to retreat towards threading-based methods [9]. At the same time, the growing number of evolutionary coupled pairs of signalling amyloid motifs opens perspectives also for the machine learning approaches. So far, the identification of NLR-related interacting pairs relied on genomic proximity and sequential similarity of candidate motifs. However, in some cases at least one of these conditions could be unmet [4], which necessitates a direct prediction of the interaction. Experimental evidence for the presence of so-called gate-keeper residues of the aggregation process [5, 10] suggest that such prediction is a viable option at least for functional amyloids. It can be expected that concerted use of modelling and experimental methods could pave a way for decoding interaction networks of signalling amyloids within and between genomes, better understanding their interactions with other proteins (e.g. beta-solenoid tandem repeats), and even enable their use for biocomputing.

1. W. Dyrka, M. Gąsior-Głogowska, M. Szefczyk, N. Szulc, BMC Bioinformatics, 22, (2021), 222.
2. A. Daskalov, W. Dyrka, S. Saupe, Sci Rep, 5, (2015), 12494.
3. W. Dyrka, V. Coustou, A. Daskalov, A. Lends, et al., J Mol Biol, 432, (2020), 6005.
4. J.W. Wojciechowski, E. Tekoglu, M. Gąsior-Głogowska, V. Coustou, et al., PLoS Comput Biol, 18, (2022), e1010787.
5. N. Szulc, M. Gąsior-Głogowska, J.W. Wojciechowski, M. Szefczyk, et al., Int J Mol Sci, 22, (2021), 5127.
6. N. Szulc, M.E. Gąsior-Głogowska, P. Żyłka, M. Szefczyk, et al., ssrn.com/abstract=4521809.
7. J. Jumper, R. Evans, A. Pritzel, T. Green, et al., Nature, 596, (2021), 583.
8. R.P. Friedland, M.R. Chapman, PLoS Pathog, 13, (2017), e1006654.
9. J.W. Wojciechowski, W. Szczurek, N. Szulc, M. Szefczyk, et al., bioRxiv, 2022.07.07.499150.
10. A. Daskalov, D. Martinez, V. Coustou, N. El Mammeri, et al., Proc Natl Acad Sci USA, 118, (2020), e2014085118.



AHoJ DB: A PDB-wide assignment of apo & holo relationships based on individual protein-ligand interactions

Christos Feidakis1, Radoslav Krivak2, David Hoksza2, Marian Novotny1

1Charles University, Faculty of Science, Department of Cell Biology, Czech Republic; 2Charles University, Faculty of Science, Department of Software Engineering, Czech Republic

A repertoire of scenarios in structural biology requires access to multiple snapshots of a protein. From studying protein dynamics to unveiling cryptic binding sites, from assessing the effectiveness of ligand binding site prediction software to building datasets for training such machine learning predictors, a single protein structure is rarely sufficient to capture or explain the variability of a protein.

The availability of both bound (holo) and unbound (apo) forms of a protein structure is essential for making meaningful comparisons and drawing robust conclusions. The few existing resources that provide access to such data are limited either in terms of protein coverage or in the number of structure pairs provided, which does not always reflect the conformational variance represented by the structures deposited in the Protein Data Bank (PDB).

Here, we use a previously designed application (AHoJ, Apo-Holo Juxtaposition) to perform an extensive search for apo-holo pairs for each individual protein-ligand interaction across the PDB (excluding interactions with peptides and nucleic acids). We assemble the results of ~500,000 small molecule interactions into a database that can be used to train and evaluate predictors, discover potentially druggable proteins, and reveal associations that can confirm existing hypotheses or expose protein- and ligand-specific relationships like order-to-disorder transitions that were previously obscured by intermittent or partial data, or discover specific binding properties of individual ligands.



Toward a database of genetically encoded non-canonical amino acids

Carolina F. Rodrigues1,2, Antoine Daina3,4, Marta A.S. Perez3,4, Vincent Zoete3,4, Bohdan Schneider1, Gustavo Fuertes1

1Institute of Biotechnology of the Czech Academy of Sciences, Czech Republic; 2Instituto Politécnico de Setúbal, Escola Superior de Tecnologia de Barreiro, Portugal; 3University of Lausanne, Switzerland; 4SIB Swiss Institute of Bioinformatics Lausanne, Switzerland

SwissSideChain [1] is the only database specifically devoted to non-natural sidechains and displays general properties (physical, structural, and molecular data) for 210 amino acids in both L- and D-configuration. Additionally, the provided modeling tools permit the straightforward insertion of these unnatural residues into proteins in silico. However, it is not directly obvious what are the specific properties that make these unnatural amino acids useful or how to prepare proteins made of unnatural building blocks. Among the available methods, genetic code expansion (GCE) enables the ribosome-mediated introduction of a large number of non-canonical amino acids (ncAA) into polypeptides at virtually any target position(s). GCE is based on codon reassignment via orthogonal pairs composed of an engineered aminoacyl-tRNA synthetase (aaRS) and its cognate tRNA. Thus, we hereby present our initial efforts to update SwissSideChain with new genetically encoded ncAA in order to make the resource more practically useful. By including key data, potential applications, and sequence information on aaRS/tRNA, we expect to guide researchers in obtaining tailor-made proteins of interest carrying ncAA.



Valinomycin interactions with water

Jindřich Hašek1, Michal Dušek2, Ivana Císařová3, Jiří Dybal4, Tereza Skálová1, Jarmila Dušková1, Jan Dohnálek1

1Institute of Biotechnology, Academy of Sciences of the CR, Czech Republic; 2Institute of Physics, Academy of Sciences, Cukrovarnická 2, Praha, Czech Republic; 3Faculty of Science, Charles University, Hlavova 2030, Praha, Czech Republic; 4Institute of Macromol.Chemistry, Academy of Sciences, Heyrovského nám.2, Praha, Czech Republic

Valinomycin is well known compound important for high selectivity of the transport of K+ ions through lipophilic membranes. This study indicates possible formation of valinomycin tunnels through the membranes under specific conditions. Originally, the crystal structures of valinomycin were studied on crystals grown from strictly dehydrated solutions. This is probably the reason of low diffraction quality of crystals and unusually high R factors (R>17 %). Eight structures in period 1975–1980 were not accepted into the CSD and thus their coordinates are lost. Water is undoubtedly important to satisfy the hydrophilic inner part of the valinomycin surface. Thus, we decided to uncover the role of water in the valinomycin structure and function.

QUANTUM CHEMICAL CALCULATIONS


The quantum chemical calculations (QCC) were carried out using the density functional theory (DFT) with the B3LYP functional and the 6-31G(d) basis set employing the Gaussian 03 program package22.

Figure 1. Superposition of valinomycin structures:

  1. 1. valinomycin complexed with two neutral waters (diffraction experiment) - yellow carbons and waters
  2. 2. valinomycin complexed with two neutral waters (calculated) - green carbons and waters,
  3. 3. valinomycin complexed with hydronium [H5O2]+ (calculated) - blue carbons and waters,
  4. 4. valinomycin complexed with [H6O2]2+ (calculated) - magenta carbons and waters.

Reliable indicator of the trapped water ionization is the distance between the water oxygens (neutral 3.2 Å, ionized 2.4 Å).

EXPERIMENTAL.

We prepared crystals and experimentally determined structures of:

  1. Valinomycin complexed with two uncharged molecules water (R=3.9 %),
  2. Valinomycin dimers encapsulating >12 water molecules. The dimers are stacked in crystal to form infinite hydrophilic tunnels (higher R=7.9 % corresponds to mobility of waters in the inner tunnel).

Self-organization of valinomycin to form hydrophilic tunnels in hydrophobic membranes


The crystal structure is formed by large spherical valinomycin dimers encapsulating 12-15 water molecules (blue spheres in Fig. 2) and forming a disks with external diameter ~16 Å and the height about 20 Å (on the left side of Fig. 2). The disks are stacked in the crystal to form the infinite water filled tunnels (on the right side of Fig. 2), stabilized by hexagons of chlorine ions encapsulated in hydrophobic pockets. Interior of the tunnel is hydrophilic and the external surface of the tunnel is hydrophobic.

Fig.2. Two valinomycin dimers can form a water tunnel through the lipid bi-layer membrane. This configuration in membrane may be stabilized by chlorine ions similarly as the tunnels observed in the crystal structure.

Conclusion

The quantum chemical calculations and the experiments confirmed that the positive charge of water molecules has small effect on the valinomycin conformation. The only reliable indicator of water ionization is the distance between water oxygens.

The main driving force for stabilization of the valinomycin tunnel are hexagons of chaotropic anions trapped in the hydrophobic pockets of external hydrophobic groups of the neighbor valinomycin molecules. The experimentally confirmed structure indicates possibility of formation of self-assembled hydrophilic tunnels through the hydrophobic bi-layers under special conditions (e.g. presence of small chaotropic anions).

The work was supported by the Czech Academy of Science project no. 86652036.



AlphaMissense performance on transmembrane proteins

Hedvig Tordai1, Tamas Hegedus1,2

1Semmelweis University, Hungary; 2HUN-REN-SU Biophysical Virology Research Group

Single amino acid substitutions can impact protein folding, dynamics, and function, often leading to severe pathological consequences. Distinguishing between benign and pathogenic substitutions is crucial for guiding research and therapeutic interventions. Unfortunately, the experimental investigation of these variants remains limited due to resource constraints. In response to this challenge, AlphaMissense has emerged as a promising tool for predicting the pathogenicity of single nucleotide polymorphism (SNP) variants, surpassing other existing predictors. Since transmembrane proteins are challenging for both experimental and computational approaches, we evaluated the performance of AlphaMissense on transmembrane proteins, utilizing ClinVar data for validation (positive: missense variants likely pathogenic and pathogenic; negative: likely benign and benign; reviewed: with at least one star). To ensure the reliability of our dataset, we exclusively considered TM topology predictions with a high confidence level (reliability score > 85) from the Human Transembrane Proteome. Consequently, our dataset comprises 1,653 transmembrane proteins, with 1,228 and 6,370 variations located within TM and non-TM regions, respectively. Our evaluation reveals that AlphaMissense demonstrates remarkable performance, achieving F1 and MCC scores of 0.90 and 0.74 for TM regions and 0.82 and 0.69 for non-TM locations. This suggests that AlphaMissense is equally effective at predicting the pathogenicity of variants in both transmembrane and soluble regions of proteins. Furthermore, we investigated variant predictions for selected TM ABC proteins associated with diseases, such as cystic fibrosis, to assess the distribution of pathogenic mutation in structural context. Moreover, recognizing that the accessibility of AlphaMissense predictions may be limited for many researchers and clinicians, we have developed a user-friendly, protein-centric web database. This resource, available at https://alphamissense.hegelab.org, offers an easy access to AlphaMissense data, enhancing the utility of this valuable tool. Our study on AlphaMissense's performance in the context of membrane proteins, coupled with this web resource, promises to facilitate the broader utilization of AlphaMissense for research and clinical applications.

Funding: This work has been supported by the National Research, Development and Innovation Office (grant number: K 137610).



Large-scale application of ProteinMPNN to design conformation-specific GPCR agonists

Hannes Junker

Leipzig University, Germany

With more than 800 members G protein-coupled receptors (GPCRs) are the largest family of transmembrane receptors. Many GPCRs occupy variations of active states, each with a conformation-dependent signaling profile, a phenomenon termed biased signaling. Modulating the propensity for the occupation of certain states with fine-tuned agonists therefore holds great pharmacological promise. However, computational design of GPCR targeting ligands usually occurs on experimentally solved structures which represent only one snapshot within a conformational ensemble. Here we present our efforts to computationally design peptide agonists that can target conformational substates on the example of the Class A GPCR Growth Hormone Secretagogue Receptor (GHSR). To this end, we investigate the novel deep-learning design method ProteinMPNN which predicts an alternative amino acid sequence based on a provided backbone conformation. More precisely, we hypothesize that ProteinMPNN can be used to design peptide agonists, that are optimized for a specific receptor conformation. To provide coherent backbone variations, we utilized continuous frames from a ~36μs long MD simulation of the GHSR in complex with its endogenous peptide agonist Ghrelin and applied ProteinMPNN on the agonist. We demonstrate that ProteinMPNN is highly sensitive towards minor changes in the orthosteric binding pocket conformation by responding with a suitable agonist sequence. Given an adequate set of alternative receptor structures this approach provides the possibility to rapidly identify peptide sequences that address specific conformational states which in turn are associated with a specific signaling profile.



Determining Structural Characteristics of Native and de novo Proteins for Improved Protein Design Algorithms

Johannes Andreas Klier

Leipzig University, Germany

De novo proteins are derived from methods such as computational protein design in contrast to naturally evolved proteins. By definition, amino acid sequences of de novo proteins are not found in nature and are unmatched by natural sequences. De novo proteins designed with the software framework Rosetta often display characteristic biophysical behaviours like high thermostability and rigid structures, which limits protein flexibility. In order to design more natural proteins in the future, computational algorithms have to be improved. To achieve this, it is necessary to understand differences between natural and de novo proteins. For this purpose, a computational scoring system was developed, derived from structural and biophysical characteristics and predictions made by the deep residual network trRosetta. Various interresidue distance metrics showed clear differences between natural and de novo proteins. Predictions by trRosetta for pairwise Cβ distances between two residues were more accurate for de novo proteins than for natural proteins while predictions for interresidue orientations were worse. It was also found that de novo proteins contain more interresidue interactions than natural proteins and a lower fraction of noninteracting residues. A correlation between low absolute contact order and designed proteins could be established. These findings will help inform further protein design algorithm optimization to improve design of proteins with natural flexibility and function.



Assessing the Effects of Mutations and Corrector Molecules on APOE Dynamics via Comparative Markov State Analysis

Jakub Kopko1, Petr Kouba1,2, Sérgio M. Marques2,3, Joan Planas-Iglesias2,3, Jiří Damborský2,3, Stanislav Mazurenko2,3, Josef Šivic1, David Bednář2,3, Jiří Sedlář1

1Czech Institute of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslávských partyzánů 1580/3, 160 00 Prague, Czech Republic; 2Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech Republic; 3International Clinical Research Center, St. Anne's University Hospital Brno, Pekařská 53, 656 91 Brno, Czech Republic

Due to the growing recognition of the importance of dynamical properties of proteins [1], molecular dynamics simulations have become an important tool for their analysis. These simulations generate large-scale high-dimensional datasets, motivating the development of data analysis methods aimed at distilling this information into a format understandable to humans. Among these methods, VAMPnet neural network [2] stands out as one of the leading methods, offering a linear perspective on the dynamics by directly learning a Markov state model from the data. The resulting models not only capture information about the metastable states observed during the simulation but also provide the probabilities of transition between these states.

For enhancing the interpretability of Markov state models, we have developed a comparative Markov state analysis approach CoVAMPnet [3]. The CoVAMPnet method introduces a way to assess the significance of inter-residue distances for assigning Markov states through the analysis of aggregated neural network gradients. Additionally, CoVAMPnet enables comparison of Markov states between different sets of simulations, including simulations corresponding to different systems. This is achieved through the alignment of ensembles of Markov state models obtained as a solution to an optimal transport problem. The versatility of CoVAMPnet allows its application in various research areas where the comparison of molecular dynamics simulations of different systems finds its use. Crucially, this includes the study of the effects of drug candidates on the conformational behaviour of intrinsically disordered biomolecules [3] and the analysis and comparison of closely related protein variants, illuminating the influence of the mutations on the dynamical properties of the protein.

We applied CoVAMPnet to analyse the dynamics of apolipoprotein E (APOE), specifically its 4-helix bundle domain, which plays a role in the APOE dimerization process [4]. In the first part of our investigation, we focused on differences between APOE3 and APOE4, two APOE variants that appear to exhibit different risk levels of Alzheimer's disease onset in their human carriers. Subsequently, our research also explored the influence of the small molecule drug candidate 3SPA on the conformational behaviour of APOE3 and APOE4, providing valuable insights into possible explanations of its therapeutic potential.

With CoVAMPnet we verified the previously reported findings [4] and gained two interesting novel insights into the flexibility of APOE. Firstly, we found out that the HL1 subdomain of the free APOE3 is remarkably flexible, which we hypothesise may play a role in forming dimer interfaces [4]. Secondly, APOE4 simulated in the presence of 3SPA formed previously unknown conformations with a specific loss of structure in the H2 subdomain. Traditional methods of data processing missed this specific conformational change [4], demonstrating CoVAMPnet's effectiveness in comprehensively analyzing complex protein dynamics.

  1. Miller MD, Phillips GN Jr., Moving beyond static snapshots: Protein dynamics and the Protein Data Bank. Journal of Biological Chemistry, 296 (2021).

  2. Andreas Mardt, Luca Pasquali et al. VAMPnets for deep learning of molecular kinetics. Nature Communications, 9, (2018).

  3. Sérgio M. Marques, Petr Kouba et al. Effects of Alzheimer’s Disease Drug Candidates on Disordered Aβ42 Dissected by Comparative Markov State Analysis (CoVAMPnet). bioRxiv 2023.01.06.523007, (2023).

  4. Michal Nemergut, Sérgio M. Marques et al. Domino-like effect of C112R mutation on ApoE4 aggregation and its reduction by Alzheimer’s disease drug candidate. Molecular Neurodegeneration, 18, (2023).

This work was supported by the Technology Agency of the Czech Republic through projects RETEMED and TEREP (TN02000122, TN02000122/001N) and by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90254). This research has received funding from The project National Institute for Neurological Research (Programme EXCELES, ID Project No. LX22NPO5107) - Funded by the European Union – Next Generation EU.



E-pRSA: protein language models improve the prediction of residue solvent accessibility from sequence

Matteo Manfredi, Castrense Savojardo, Pier Luigi Martelli, Rita Casadio

University of Bologna, Italy

Knowledge of residue solvent accessibility in a protein is important for different applications, including the identification of interacting functional surfaces and the characterization of residues undergoing variations. Accessible Surface Area (ASA) and Relative Solvent Accessibility (RSA) values can be directly computed from the 3-dimensional structure. When resolved structures are not available, machine learning-based tools can provide accurate estimates starting from the protein sequence.

Recently introduced Protein Language Models (PLMs) allow the development of faster and more accurate tools with respect to canonical encodings such as Multiple Sequence Alignments (MSAs). After being trained on huge datasets including billions of protein sequences in a self-supervised way, PLMs can be efficiently adopted to generate a representation of the protein sequence that casts relevant evolutionary, structural, and contextual information. Different architectures are then fine-tuned to provide task-specific predictions.

Here we present E-pRSA, a tool to accurately estimate the RSA values for residues of a protein sequence in the absence of the structure.

E-pRSA is trained on 6,552 proteins and benchmarked on 21 proteins, all obtained from the PDB. Each protein has been mapped on the corresponding UniProt entry and DSSP was adopted to compute accessibility values from protein structures. The training dataset was split into 10 equally sized subsets for performing a 10-fold cross-validation. Proteins included in two different subsets share less than 25% sequence identity over a minimum of 40% coverage. Moreover, the blind test set is completely non-redundant with respect to the training datasets of all methods included in the benchmark.

The method is described in Figure 1 and in the corresponding legend. The input encoding is based on two state-of-the-art PLM-based embeddings, namely ProtT5 [1] and ESM2 [2], and the prediction is performed with a Deep Learning architecture.

The method achieves a 0.72 Pearson Correlation Coefficient on the blind test set, outperforming our previous method DeepREx [3], which exploits evolutionary information contained in MSAs. Moreover, E-pRSA outperforms two recently released methods based on PLM encoding, NetSurfP-2.0 [4] and SPOT-1D-LM [5]. The peculiarity of E-pRSA is to adopt two different PLMs to embed the input. Results then confirm our previous observation [6-8] that the concatenation of embeddings generated by different and complementary models can improve the performance of the downstream predictors.

E-pRSA is available at https://e-prsa.biocomp.unibo.it/main/



Proteome secondary structures generated by AlphaFold

Ivana Hutařová Vařeková1, Dominik Martinat1, Radka Svobodová2,3, Karel Berka1

1Department of Physical Chemistry, Faculty of Science, Palacký University Olomouc, tř. 17. listopadu 12, 77146 Olomouc, Czech Republic; 2CEITEC – Central European Institute of Technology, Masaryk University Brno, Kamenice 5, 625 00 Brno, Czech Republic; 3National Centre for Biomolecular Research, Faculty of Science, Masaryk University Brno, Kamenice 5, 625 00 Brno, Czech Republic

The AlphaFold[1] algorithm and its associated database[2] provide convenient access to the structural data for entire proteomes across various species, facilitating comprehensive statistical analysis. In our study, we examined the distribution of secondary structures, specifically alpha helices and beta strands, within the proteomes of model organisms and those of relevance to human health. While there seems to be a general distribution of secondary structures within proteins in all analyzed life forms, our findings reveal several noteworthy functional exceptions: 1) abundance of short proteins with small amounts of secondary structures in plant proteomes, 2) spike of structures with 9-12 alpha helix count in some mammals (e.g., mice or rat) from the abundance of olfactory receptors from GPCR family and 3) enhanced presence of long proteins with abundant secondary structures (50+) in the human proteome. These insights contribute to a deeper understanding of the structural diversity within proteomes, shedding light on specific patterns and variations across different species and functional categories of proteins.

[1]Jumper J., Evans R., Pritzel A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021). https://doi.org/10.1038/s41586-021-03819-2

[2]Varadi M., Anyango S., Deshpande M. et al. AlphaFold Protein Structure Database: massively expanding the structural coverage of protein-sequence space with high-accuracy models, Nucleic Acids Research, Volume 50, Issue D1, D439–D444 (2022), https://doi.org/10.1093/nar/gkab1061



Old folds can learn new tricks: Alphafold-driven insights on coiled-coil structure

Mikel Martinez Goikoetxea1, Rafal Madaj2, Jan Ludwiczak2,3, Andrei Lupas1, Stanislaw Dunin-Horkawicz1,2

1Max Planck Institute for Biology, Tuebingen, Germany; 2Institute of Evolutionary Biology, Faculty of Biology, Biological and Chemical Research Centre, University of Warsaw, Warsaw, Poland; 3Prescient Design, Genentech Research & Early Development, Roche Group, Basel, Switzerland

Coiled coils are a widespread protein structure motif that consists of two or more -helices that wind around a central axis to form a helical bundle with a buried hydrophobic core. They are built from relatively short and simple sequence repeats, typically consisting of 7 residues (heptads), although alternative repeat sizes are possible, the most common being 11 residues (hendecads) [1,2]. The repeat size and residue composition of coiled coils are responsible for their considerable variety in terms of the helical topology (number and orientation of the helices) and geometry (axial rotation and degree and direction of the winding). Conversely, these structural features are responsible for the extraordinary diversity of functions that coiled-coil domains perform in nature, such as mechanical support, muscle contraction, vesicle transport and fusion, transcription factor, or signal transduction. In this work, we have benchmarked how accurate AlphaFold is in modeling typical heptad coiled coils [3], and investigate whether it could be applied to new, hitherto undescribed non-heptad coiled coils such as the ones composed primarily of hendecads. Our results show that AlphaFold is able to recapitulate a number of known coiled-coil rules that relate sequence and structure, and that it can be used to obtain insights into new ones. Simultaneously, we have found a number of cases that highlight the limitations and biases of AlphaFold in coiled-coil modeling. We hope that our work will serve as a foundation to develop new tools with which to further advance our understanding of this model protein structure motif.



The Protein Universe Atlas

Janani Durairaj1,2, Andrew M. Waterhouse1,2, Toomas Mets3,4, Tetiana Brodiazhenko3, Minhal Abdullah3,4, Gabriel Studer1,2, Gerardo Tauriello1,2, Mehmet Akdel5, Antonina Andreeva6, Alex Bateman6, Tanel Tenson3, Vasili Hauryliuk3,4,7,8, Torsten Schwede1,2, Joana Pereira1,2

1Biozentrum, University of Basel, Basel, Switzerland; 2SIB Swiss Institute of Bioinformatics, University of Basel, Basel, Switzerland; 3Institute of Technology, University of Tartu, Tartu, Estonia; 4Department of Experimental Medical Science, Lund University, Lund, Sweden; 5VantAI, New York, USA; 6European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, United Kingdom; 7Science for Life Laboratory, Lund, Sweden; 8Virus Centre, Lund University, Lund, Sweden

The collection of all protein sequences sampled by nature is commonly referred to as the “Protein Universe”. This can be seen as a multidimensional space of amino acid sequences, where each protein adopts a coordinate that is defined by its similarity to all others. This landscape can be conceptualized as a large protein similarity network, where protein families and superfamilies form clusters and superclusters whose internal structure illustrates the evolutionary (or other similarity relationships) of the encompassed proteins. While commonly used for the study of individual protein or domain families, the methods currently available are memory intensive and are limited to relatively small sets of up to 10 to 20 thousand proteins.

Leveraging the recent advances in GPU-accelerated force-directed graph layouting and complex network summarizing approaches, we constructed for the first time a protein similarity network encompassing more than 50 million proteins. We annotated this network, linked it to various protein databases and made it available as a new web resource, the Protein Universe Atlas, available at https://uniprot3d.org. In contrast to common protein databases such as UniProt, GenBank and the Protein Data Bank, which are protein-centric and display a page for each single queried protein, the Atlas represents entries as points in a 2D landscape where similar proteins are close in space. This puts them in evolutionary context, highlighting evolutionary relationships that can help guide biocuration, protein family classification, and protein function prediction and annotation.

Thanks to this development, we were already able to shed light onto a new toxin-antitoxin superfamily, to discover a new protein fold family, and to add hundreds of not hitherto described protein families to Pfam. We expect that expanding the resource to the more than 250 million proteins in UniProt and to those resulting from large-scale metagenomic studies will significantly speed up the study of any protein of interest, the automatic identification of protein families and superfamilies, and the functional annotation of the exponentially growing protein data deposited in protein databases.



Exploring the relationship between structural quality and journal impact factor

Jana Porubská1, Veronika Horská1, Vladimír Horský1, Radka Svobodová1,2

1National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Kamenice 753/5, 625 00 Brno, Czech Republic; 2CEITEC – Central European Institute of Technology, Masaryk University, Kamenice 753/5, 625 00 Brno, Czech Republic

In structural bioinformatics, scientific research heavily depends on the structural data stored in the Protein Data Bank (PDB). The scientific community's widespread utilization of structural models puts pressure on the integrity of these structures. Consequently, a profound emphasis exists on structural validation, with the PDB providing detailed validation reports with various quality attributes.

Our project embarks on an exploration of structural quality across different scientific journals. We investigate the interplay between structural quality metrics and the journal impact factor. Our methodology involves a selection of journals, categorizing them based on their impact factor, and the compilation of a dataset containing selected quality factors of these structures. Finally, our analysis uses statistical methods to describe the dynamic nature of the structural quality within each journal category.



Towards finding high affinity binders of the PSB domain of human collagen prolyl 4-hydroxylase

M. Mubinur Rahman1, Hongmin Tu2, Bukunmi Adediran1, Sudarshan Murthy1, Antti M. Salo1,2, Outi Lampela2, Andre Juffer2, Johanna Myllyharju1,2, Rik K. Wierenga1, M. Kristian Koski1,2

1Faculty of Biochemistry and Molecular Medicine, University of Oulu, Oulu, Finland; 2Biocenter Oulu, University of Oulu, Oulu, Finland

Collagen prolyl 4-hydroxylase (C-P4H) catalyses the hydroxylation of the Y prolines of the XYG-repeat of procollagen. In human there are three isoforms, C-P4H-I, -II and –III. C-P4Hs are tetrameric α2β2 enzymes. The α-subunit provides the N-terminal dimerization domain, the middle peptide-substrate-binding (PSB) domain and the C-terminal catalytic (CAT) domain. The CAT domain belongs the Fe(II) and 2-oxoglutarate-dependent dioxygenases, which adopt the DSBH-fold and which use molecular oxygen to hydroxylate proline residues. The β-subunit is identical to PDI and its function for the catalysis is not yet known. The PSB domain (about 100 residues), unique for the collagen prolyl 4-hydroxylases, binds the proline rich peptide substrate that is hydroxylated by the CAT domain. The PSB domain has five α-helices, adopting the TPR fold.

Crystal structures are now available of all domains and subunits of C-P4H. For these structures different truncated forms of the α-subunit have been used. It concerns the crystal structure of the dimeric DD construct (of C-P4H-I), consisting of the dimerization and PSB domains of the α subunit showing the dimerization motif between the two α subunits [1], whereas the crystal structure of the CAT domain (of the truncated C-P4H-II α subunit), complexed with mature PDI (the CAT-PDI complex) shows the mode of interactions of the CAT domain and PDI [2]. These structures do not reveal the mode of assembly of the DD-dimer with the two CAT-PDI units, which is the structure of the mature C-P4H α2β2 complex. The latter structure has however been predicted [2] by AI-based structure prediction tools, such as Robetta and AlphaFold.

The PSB domain is important for the efficient hydroxylation of proline rich peptides by the CAT domain [3]. The binding of proline rich peptides to the PSB-I domain has previously been reported for the peptides (PPG)3 and P9, which bind in an extended, PP-II conformation [1]. Protein crystallographic binding studies show that the mode of binding to PSB-I of proline-rich peptides that have the PxGP motif, is different, adopting a bent conformation, which is the same as previously reported for the mode of binding of such peptides to the PSB-II domain [4]. The PSB domain has two proline binding pockets, referred to as the P5 and P8 pockets. In both modes of binding proline side chains bind in these pockets. Calorimetric binding studies show that these PxGP peptides have good affinity for the PSB-I domain.

The current PSB studies are aimed at finding tight binders at the P5 and P8 pockets of the PSB domain. Such compounds will compete with the binding of the proline rich substrate peptides and are therefore potential inhibitors of C-P4Hs. Such compounds are potential pharmaceuticals against fibrotic diseases and cancer. Two approaches are being developed to find these compounds (i) the FRET method using various screens with a wide range of compounds and (ii) biocomputational methods using the currently available structures of PSB peptide complexes as a starting point. The poster will describe the currently available structural and affinity data.

This project is funded by the Jane and Aatos Erkko foundation.

1. Anantharajan, J., Koski, M. K., Kursula, P., Hieta, R, Bergmann, U., Myllyharju, J, Wierenga, R. K., Structure, 21 (2013), 2107-2118.

2. Murthy, A. V., Sulu, R., Lebedev, A., Salo, A. M., Korhonen, K., Venkatesan, R., Tu, H., Bergmann, U., Jänis, J., Laitaoja, M., Ruddock, L., Myllyharju, J., Koski, M. K., Wierenga, R. K., J. Biol. Chem., 298 (2022), 102614.

3. Pekkala, M., Hieta, R., Bergmann, U., Kivirikko, K. I., Wierenga, R.K., Myllyharju, J., J. Biol. Chem., 279 (2004), 52255-52261.

4. Murthy, A. V., Sulu, R., Koski, M. K., Tu, H., Anantharajan, J., Sah-Teli, S. K., Myllyharju, J., Wierenga, R. K., Prot. Sci., 27 (2018),1692-1703.



Onedata4Sci: Life-science experimental datasets management system

Adrian Rosinec1,2,3, Tomas Svoboda1,2,3, Tomas Racek1,2,3, Josef Handl1, Jozef Sabo2,3, Ales Krenek3, Radka Svobodova1,2

1National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Kamenice 753/5, Brno, 625 00, Czech Republic; 2CEITEC - Central European Institute of Technology, Kamenice 753/5, Brno, 625 00, Czech Republic; 3Institute of Computer Science, Masaryk University, Sumavská 416/15, Brno, 602 00, Czech Republic

In many scientific disciplines, especially life-sciences, expensive equipment is shared nowadays (like cryoEM devices, optical microscopes, …). The users – scientists request specific experiments from facilities, which perform the experiments on their behalf. The outcome of such an experiment is a dataset, which can get quite large in many cases (tens of gigabytes to terabytes). Data are then processed in order to draw scientific conclusions from their interpretation, and the results are published. However, today more and more emphasis is being placed on sharing the primary data itself - not only for the purpose of verification of scientific findings, but also for the re-use of the dataset to be used in future research. Automatic/manual annotation with appropriate metadata, storage or archiving of the dataset, assignment of DOIs, and subsequent publication of the dataset in disciplinary metadata catalogues or data repositories are necessary.

To address these challenges, we design and develop a system Onedata4Sci, that automates acquiring, sharing, and publishing of data produced by specialized scientific devices. The proposed solution automatically makes experimental data available to the scientific community in a predefined way. It is particularly useful for on-the-fly processing in local or distant data centers, real-time analysis, or archiving to permanent storage according to defined quality of service (e.g., data distribution). The solution includes a web-based system that can be used to manage emerging datasets and annotate them with metadata (automatically extracted from the data produced by the instruments or manually entered by users according to defined templates).

The system makes it easy to automate the individual steps of dataset preparation, checking compliance with FAIR principles, and publishing the dataset to the scientific community. The development of the system is guided by FAIR principles and national EOSC-CZ activities.



The analysis of the Impact of Substitutions within EXO-Motifs on Hsa-MiR-1246 Intercellular Transfer in Breast Cancer Cells

Agnieszka Rybarczyk1,2, Tomasz P. Lehmann3, Ewa Iwańczyk-Skalska3, Wojciech Juzwa4, Kamil Kopciuch1, Paweł P. Jagodziński3

1Institute of Computing Science, Poznan University of Technology, Poland; 2Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poznan, Poland; 3Department of Biochemistry and Molecular Biology, Poznan University of Medical Sciences, Poznan, Poland; 4Department of Biotechnology and Food Microbiology, Poznan University of Life Sciences, Poznan, Poland

Extracellular vesicles (EVs) release various biomolecules into the extracellular space, with miR-1246 garnering recent interest for its oncogenic role in several cancers. The processes and mechanisms guiding miR-1246 into EVs and its stability remain elusive. In our study, we explored the influence of single-nucleotide alterations in miR-1246's exosome sorting motifs (EXO-motifs: GGAG and GCAG) through both in silico methods, such as structural analysis and modeling, and in vitro assays involving the transfection of fluorescently labeled miRNA to MDA-MB-231 cells, which we analyzed by flow cytometry and fluorescent microscopy. Our findings indicate that disruptions in miR-1246 EXO-motifs can affect its stability and intercellular transfer, suggesting a connection between RNA stability and EV-mediated transfer.



The hijacker of host immune system: Structural analysis of Klebsiella immune evasin A

Mia Åstrand1, Tobias Eriksberg1, Joana Sá-Pessoa2, José A. Bengoechea2, Christine Touma1, Tiina A. Salminen1

11Structural Bioinformatics Laboratory and InFLAMES Research Flagship Center, Biochemistry, Faculty of Science and Engineering, Åbo Akademi University, Turku, Finland; 22Wellcome-Wolfson Institute for Experimental Medicine, School of Medicine, Dentistry and Biomedical Sciences, Queen's University Belfast, Belfast, UK

The multidrug resistant bacteria Klebsiella pneumoniae, which causes infections in the respiratory system and the urinary tract, as well as life-threatening hospital-acquired infections, is included on WHO’s list of pathogens that need to be prioritized when developing new antibiotics. Due to the increasing spread of antibiotic resistance among its strains it is considered an urgent problem. The recently discovered K. pneumoniae protein, Klebsiella immune evasin A (KivA), has been shown to inhibit the IL-17 and TLR signaling pathways that play a significant role in the human immune response against pathogens. KivA consists of an N-terminal SEFIR domain, which is similar to the eukaryotic SEFIR domain, and a C-terminal domain with an unknown 3D fold. KivA likely inhibits key proteins in the signaling pathways by forming SEFIR-SEFIR interactions with them, but nothing is known about the role of the C-terminal domain yet. We have used traditional and AI-based methods to model the 3D structure for KivA and crystal structure determination to validate the modeling results is ongoing. The SEFIR domain of KivA is predicted to have a typical a flavodoxin-like fold whereas the C-terminal domain forms a novel alpha-helical 3D fold. The structural and functional analysis of KivA will provide insight into the mechanisms used by K. pneumoniae to disrupt the immune responses and will enable the development of new effective and more selective treatment methods.



ChannelsDB 2.0: A Comprehensive Database of Protein Tunnels and Pores in AlphaFold Era

Anna Špačková1, Ondřej Vávra2,3, Tomáš Raček4,5, Václav Bazgier1, David Sehnal4,5, Jiří Damborský2,3, Radka Svobodová4,5, David Bednář2,3, Karel Berka1

1Department of Physical Chemistry, Faculty of Science, Palacký University, tř. 17. listopadu 12, 771 46 Olomouc, Czech Republic; 2Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech Republic; 3International Clinical Research Center, St. Anne’s University Hospital Brno, Pekařská 53, 656 91 Brno, Czech Republic; 4CEITEC – Central European Institute of Technology, Masaryk University Brno, Kamenice 5, 625 00 Brno, Czech Republic; 5National Centre for Biomolecular Research, Faculty of Science, Kamenice 5, 625 00 Brno, Czech Republic

ChannelsDB 2.0 has been upgraded to offer a comprehensive insight into protein channels' structural characteristics, geometry, and physicochemical attributes, encompassing both tunnels and pores. These channels are computed on deposited biomacromolecular structures originating from the PDBe and AlphaFoldDB databases. In the new version of the ChannelsDB database, we have incorporated data generated through the widely used CAVER tool, augmenting the insights previously acquired only through the original MOLE tool. Additionally, we have extended the database's coverage by introducing tunnels originating from cofactors localised within the AlphaFill database or from cognate ligands within PDB structures. This expansion has increased the available channel annotations by almost five times. ChannelsDB 2.0 houses information concerning geometric properties such as length and radius and physicochemical attributes based on the amino acids lining the channels. These stored data are intricately linked with the existing UniProt mutation annotation data, facilitating in-depth investigations into the functional roles of biomacromolecular tunnels and pores.

In summary, ChannelsDB 2.0 represents an invaluable resource for conducting in-depth analyses of the significance of biomacromolecular channels. The database is freely accessible to the public at the address https://channelsdb2.biodata.ceitec.cz.



C-RCPred: a multi-objective algorithm for interactive secondary structure prediction of RNA complexes integrating user knowledge and SHAPE data

Mandy Ibéné, Audrey Legendre, Guillaume Postic, Eric Angel, Fariza Tahi

Université Paris-Saclay / Univ. Evry, France

RNAs can interact with other molecules in their environment, such as ions, proteins or other RNAs, to form complexes with important biological roles. The prediction of the structure of these complexes is therefore an important issue and a difficult task. We are interested in RNA complexes composed of several (more than two) interacting RNAs. We show how available knowledge on the considered RNAs can help predict their secondary structure. We propose an interactive tool for the prediction of RNA complexes, called C-RCPRed, that considers user knowledge and probing data (which can be generated experimentally or artificially). C-RCPred is based on a multi-objective optimization algorithm. Through an extensive benchmarking procedure, which includes state-of-the-art methods, we show the efficiency of the multi-objective approach and the positive impact of considering user knowledge and probing data on the prediction results. C-RCPred is freely available as an open-source program and web server on the EvryRNA website (https://evryrna.ibisc.univ-evry.fr).



Reusing and accessing computed structure models with SWISS-MODEL and ModelArchive

Gerardo Tauriello1,2, Gabriel Studer1,2, Stefan Bienert1,2, Andrew Waterhouse1,2, Torsten Schwede1,2

1Biozentrum, University of Basel, Basel, Switzerland; 2SIB Swiss Institute of Bioinformatics, University of Basel, Basel, Switzerland

A broad spectrum of applications in life science research benefits from the availability of three-dimensional structures of proteins as they provide valuable molecular insights into their function. While experimental structures are still the most accurate option when available, it has now become possible to generate high quality computed structure models for almost every known protein. We are hence in a situation where creating a structural model is not the challenge any more but we can instead aim to make best use of available models and to get the critical details right for a given downstream application.

Within SWISS-MODEL we have therefore expanded our modelling capabilities to take advantage of the available high quality models in the AlphaFold database (AFDB). We now query AFDB models, rank them together with available experimental structures, and use them as templates for modelling. Since the AFDB contains models for more than 200 million proteins and our users create approximately two new SWISS-MODEL projects every minute, we implemented efficient storage of the AFDB models using the OpenStructure Minimal Format (OMF) and a fast and resource-efficient sequence search based on k-mer search which can query the AFDB in seconds.

While the AFDB contains models for large numbers of proteins, it is limited to sequences in UniProt, to monomer models and to the use of the default AlphaFold2 modelling pipeline. Different modelling pipelines have distinct advantages and disadvantages and users can thus benefit from access to a variety of models to find the one which is best suited for their application. Models for any protein and from any modelling pipeline which is not based on experimental data can be stored in the ModelArchive. There, we have seen a large increase of depositions in the past years from 2’770 publicly released models at the end of 2021 to over 74’000 models in September 2023. This increase in depositions was enabled by the establishment of the ModelCIF data format to store models and their metadata and complemented by the development of the 3D-Beacons network which enables sharing of models across multiple model providers.

Taken together, our latest developments in ModelArchive make a variety of computed structure models accessible and SWISS-MODEL can reuse models to help users obtain the best ones for their applications.



Protein folding enhanced by Adversarial Autoencoders

Guglielmo Tedeschi1, Vladimír Višňovský2, Aleš Křenek2, Vojtech Spiwok1

1Department of Biochemistry and Microbiology - University of Chemistry and Technology, Prague; 2Department of Machine Learning and Data Processing - Faculty of Informatics - Masaryk University, Brno

This research is motivated by acceleration. Molecular simulations make true the possibility to simulate the motion from small molecules to big proteins and their combinations in drug-target complexes. It lets us predict their changing confirmation, their stability and a plenty of other properties thanks to the evolution of molecular structure. However, application of molecular simulations is affected by the large computational costs in computing steps that must be in order of femtoseconds, to assure numerical stability to integrate Newton equation of motion. Taking into account this limitation, a typical molecular dynamics simulation is capable of sampling only a small fraction of the states available to the simulated system, with the likely catch or unlikely loss of some slow or rarely occurring processes, where likelihood depends on the simulation time. There are numerous techniques to address this limitation and to speed up simulations. Metadynamics is an enhancing method based on biasing Hamiltonian of the system that helps to cross barriers and go head through new unexplored free energy surface areas, thanks to some selected internal coordinates, so called collective variables. Choosing correct collective variables to make metadynamics successful is not a trivial task and it depends first of all on the knowledge and expertise of the user. In the last few years there are emerging opportunities for machine learning and artificial neural networks in this field. We decided to develop an adversarial autoencoder1 as a tool to analyze simulation data and to support users to derive good collective variables to enhance molecular dynamics simulation. The potential of this platform is demonstrated by using Trp-cage and Villin headpiece unbiased molecular dynamics simulations2. We aim to explore its applicability in more complex systems.

We thank D.E Shaw Research to provide us trpcage trajectory for asmsa training. This work was supported by the Czech Science Foundation (22-29667S).

1. A. Makhzani, J.Shlens, N.Jaitly and I.Goodfellow, I. Adversarial autoencoders. International Conference on Learning Representations, 2016.

2. K. Lindorff-Larsen, S. Piana, R. Dror and D. Shaw, How fast-folding proteins fold. Science 334:517, 2011.



LIGYSIS: a pipeline and web application for the analysis of ligand binding sites

Javier S. Utgés, Stuart A. MacGowan, Geoffrey J. Barton

University of Dundee, United Kingdom

Ligands are key for protein function and can act as substrates, co-factors, or inhibitors in complex with a myriad of proteins spread across all molecular processes. For that reason, understanding how they interact with proteins is crucial and can provide insight into drug development, and wider protein function understanding. In this work, we gathered >25,000 proteins from multiple species with experimentally resolved 3D structures. We analyse the interactions between these proteins and biologically meaningful ligands as defined by BioLiP. Protein-ligand interaction fingerprints are used to group ligands and define »100,000 unique ligand binding sites. Ligand sites are grouped in four clusters based on their solvent accessibility profile. The defined clusters are biologically different, in terms of evolutionary divergence, enrichment in neutral missense variation, relative solvent accessibility, and functional enrichment. These results strongly suggest that these cluster labels can be used to infer ligand binding site functionality. These findings will be of interest to those studying protein-ligand interactions or developing new drugs. To facilitate access to these data, we are developing web application that will allow users to explore in structure the defined binding sites on a protein of interest, as well as dynamically explore graphs and tables displaying the features of the residues forming the sites. The LIGYSIS web service is a Python Flask application that uses 3Dmol.js for protein visualisation, Chart.js for dynamic graph rendering, Bootstrap for stylings, and vanilla JavaScript and jQuery to link all the components together and make them interact.



Defining conformational states and their variability

Jose Gavalda Garcia, David Bickel, Joel Roca Martinez, Wim Vranken

Interuniversity Institut of Bioinformatics in Brussels, Vrije Universiteit Brussel, Belgium

The dynamics and related conformational changes of proteins are often essential for their function, but are difficult to characterise and interpret. The energy landscape that determines the conformational behavior and dynamics of an amino acid residue in a protein is determined by its local environment, which encompasses interactions with other residues or molecules as well as parameters such as temperature or pH. The lowest energy state for a given residue can correspond to very sharply defined conformations, for example when the residue is part of a stable helix, or can cover a wide range of conformations, for example for residues in intrinsically disordered regions. Defining such low energy states will therefore help to describe the behavior of a residue and how it changes with its environment. We propose a novel data-driven probabilistic definition of six low energy conformational states typically accessible for amino acid residues in proteins. This definition is based on in solution NMR information for 1414 proteins, through a combined analysis of structure ensembles with interpreted chemical shifts. We further introduce a conformational state variability parameter that captures, based on an ensemble of protein structures from molecular dynamics or other methods, how often a residue moves between these conformational states. The approach enables a different perspective on the conformational behavior of proteins that is complementary to their static interpretation from single structure models.



Computational design of a cyclic peptide for inhibition of CD59 using hotspot extension in Rosetta

Max Beining1,2

1Institute for Drug Discovery, Leipzig University Faculty of Medicine, Leipzig, Germany; 2School of Embedded Composite Artificial Intelligence (SECAI), Cooperation of University Leipzig and TU Dresden

CD59 is a crucial membrane complement regulatory protein that plays a pivotal role in limiting the activation of the complement system and preventing the formation of the membrane attack complex (MAC) on cell membranes. CD59 restricts the construction of the MAC on cell membranes by inhibiting the transmembrane channel-forming activity of homologous C8 and C9 proteins during the final stage of MAC formation. On various tumor cells, CD59 showed an increased expression and is associated with reduced survival in cancer patients. It has been shown that inhibition of CD59 expression on leukemia cancer cells leads to increased therapy efficacy with the monoclonal antibody Rituximab 1 . In this early-stage project, a computational approach was used to address the challenge of designing a potential inhibitor for CD59. By extending critical interaction anchor residues in the C9-CD59 interface, we build de-novo cyclic peptides containing non-canonical amino acids (NCAA) using the generalized kinematic loop closure method in the software suite Rosetta. Docking experiments showed funneling for some designed peptides into the binding region. Also, the creation of bicyclic peptides using an NCAA Cys-Maleimide linker was performed as a second computational design experiment. Promising candidates will be tested in the laboratory to validate their specific binding affinity.



3D-Bioinfo nucleic acid activity - a progress report

Bohdan Schneider

Institute of Biotechnology of the Czech Academy of Sciences, Czech Republic

The 3D-Bioinfo nucleic acid activity initiated and is involved in two benchmarking activities, one dubbed NAVAL dealing with valence geometries (interatomic distances and angles) the other, NABIR, aiming at benchmarking publicly available algorithms and programs assigning base pairing classes. The NAVAL project is slowly (perhaps a bit too slowly) walks towards formulation of recommendations for the community to straighten inconsistencies in valence geometry dictionaries and suggest uniting the standards to be used in main software packages for structure determination and analysis; the main goal remains to formulate sensible recommendation for validation procedures used by the main public archive, the PDB. The NABIR initiative started about year ago by opening discussion about consistency, or rather lack of it, in assigning various classes of the base pairing topologies. The group of about ten people assigned base pairs on a broad range of RNA and DNA structures (we benefited from careful selection of structures in NAVAL) and compared individual assignments by various programs noticing significant differences. In the near future, we plan to determine validation criteria based on physics of interatomic interactions, likely based on the hydrogen bonding geometries, and suggest them for the base pair assignment. Due to the collaborations in the NA activity, we published an opinion article in Nucleic Acid Research about issues related to predicting RNA 3D structures (Schneider et al, NAR, 2023).



Validation of nucleic acid valence geometry at dnatco.datmos.org

Jiří Černý, Paulína Božíková, Barbora Schramlová, Bohdan Schneider, Lada Biedermannová, Michal Malý

Institute of Biotechnology of the Czech Academy of Sciences, Czech Republic

Nucleic acids play a fundamental role in all living organisms. They store genetic information and control the process of protein synthesis. For their study, it is essential to determine their spatial arrangement. A better understanding of the structure will help create more accurate structural models that can help explain many biological mechanisms and develop new substances. Several methods are used to construct a three-dimensional atomic model of nucleic acid structure. The most common of these is X-ray crystallography. The initial model needs to be further refined, usually performing a number of refinement and validation cycles.

The goal of our work is to characterize the covalent geometry parameters of nucleic acids and to propose new procedures for their validation. We will present our recently proposed approach based on analysis of a sequentially non-redundant, high resolution, quality-filtered reference set of nucleic acid structures. It will provide an intuitive valence geometry validation score to the authors of nucleic acid structures. The reference implementation is currently available at the https://dnatco.datmos.org web address.



Reference-free ranking method for RNA 3D models

Tomasz Zok1, Jan Pielesiak1, Maciej Antczak1,2, Marta Szachniuk1,2

1Poznan University of Technology, Poland; 2Institute of Bioorganic Chemistry, Polish Academy of Sciences, Poland

There has been a surge of interest in predicting RNA 3D structures lately, with more researchers recognizing the significance of understanding the structure and function of RNA. As our knowledge of RNA molecules expands, we can leverage the advancements made in protein structure prediction to improve our predictions in the RNA field.

However, a significant challenge when predicting novel structural folds is assessing the quality of the models produced. Modeling software often generates multiple models per input, sometimes even thousands, making selecting the most promising ones crucial. Traditionally, researchers determine the quality of the model based on energy terms calculated using force fields or coarse-grained statistical potentials. The lower the energy calculated, the more likely the RNA structure is considered to be. However, the energy landscape usually contains many local minima, leading to inconclusive results.

Therefore, we propose a different approach for ranking multiple 3D models created from the same sequence by analyzing the base pairs and stacking interactions within them. We build a consensus secondary structure from the extracted data and rank each model's interaction network against that consensus to provide a final ranking.

We benchmarked our proposed method on public RNA 3D modeling datasets to verify its usefulness, comparing its performance against state-of-the-art energy-based evaluations.

 

Date: Thursday, 16/Nov/2023
9:00am - 9:30amS3: Activity 4 - To develop tools to Describe, Analyse, Annotate, and Predict Nucleic Acid Structures
Location: Chamber Hall
Session Chair: Christine Orengo

ELIXIR program 2024-2028

9:30am - 11:30amS4: Activity 4 - To develop tools to Describe, Analyse, Annotate, and Predict Nucleic Acid Structures
Location: Chamber Hall
Session Chair: Bohdan Schneider
 
9:30am - 10:00am

RNA-Puzzles : Blind Assessments of (Semi)-Automatic 3D RNA Modeling

Eric Westhof1, Zhichao Miao2

1Architecture et Réactivité de l'ARN, Université de Strasbourg, Institut de biologie moléculaire et cellulaire du CNRS, 67000 Strasbourg France; 2GMU-GIBH Joint School of Life Sciences, Guangzhou Laboratory, Guangzhou Medical University, Guangzhou, China

RNA 3D structure modeling dates to the late 1960s and several computer programs for predicting RNA 3D structures have been proposed since then. RNA-Puzzles is a collaborative effort dedicated to advancing and improving RNA 3D structure prediction. With the agreement of crystallographers, RNA structures are predicted by different groups before the publication of crystal structures. Since the success of AlphaFold in protein structure prediction, artificial intelligence approaches are continuously designed to solve the problem of RNA 3D structure prediction with strategies like AlphaFold. However, eliminating redundancy between training and test data is not trivial and some programs have shown overfitting results. Therefore, blind, unbiased evaluations (based on equivalence of comparison metrics) of all prediction tools are a necessary requirement.
A dedicated website ( http://www.rnapuzzles.org/) gathers the systematic protocols and parameters used for comparing models and crystal structures, all the data, analysis of the assessments, and related publications. Up to now, 40 RNA sequences with experimentally determined structures (x-ray or cryo-EM) have been predicted by many groups from several countries. Many of the predictions have achieved high accuracy after comparison with the solved structures.



10:00am - 10:30am

Rfam, RNA 3D structures, and issues facing RNA 3D structure prediction

Blake Sweeney

EMBL-EBI, United Kingdom

Rfam is a database of over 4,000 non-coding RNA (ncRNA) families. Each family is composed of a sequence alignment called the seed, often manually curated, a consensus secondary structure and a covariance model. Rfam was originally developed 20 years ago to annotate genomes with ncRNAs using the covariance models. However, it has become the de-facto reference database for known ncRNAs and their alignments. This has led to it being used in new contexts including, RNA 3D structure prediction. This has pushed Rfam in new directions. Recently, Rfam has been improved by aligning sequences and base pair annotations from 3D structures into seed alignments. This connects Rfam alignments with 3D structures directly and allows improvements of families. We have used this to improve over 30 families and have started annotating pseudoknots. However despite these improvements, Rfam still has several limitations that make the prediction of RNA 3D structures challenging. Briefly, they are that ncRNA data is limited, biassed and incomplete. In this talk we will discuss some of these issues, suggest possible improvements, and challenge the community to solve them.



10:30am - 10:45am

Unraveling the RNA web: detecting and deciphering entanglements in 3D structures

Marta Szachniuk1,2, Maciej Antczak1,2, Mariusz Popenda2, Joanna Sarzynska2, Tomasz Zok1

1Poznan University of Technology, Poland; 2Institute of Bioorganic Chemistry PAS, Poland

RNA molecules, essential players in the intricate machinery of cellular processes, exhibit a remarkable level of complexity in their three-dimensional structures. For many years, the primary focus in RNA structure study has traditionally been on base-pairing interactions and simple structure motifs. However, recent advances have unveiled another dimension of complexity – the presence of entanglements within RNA 3D structures. These structural intricacies, reminiscent of topological puzzles, may have profound implications for RNA function and dynamics. On the other hand, some of their types may be bugs injected into the structure, during its determination or in silico modeling process.

In this presentation, we will explore the diverse range of entangled motifs that can be found within RNA molecules. We will delve into the computational algorithms that have been developed to detect and analyze these unusual topological configurations in RNA structures. Finally, we will take a look at entanglements in experimental and simulated models of RNA 3D structure and we will learn if they can be untangled with any existing methods.



10:45am - 11:00am

Posttranscriptional Modifications in RNA Experimental 3D Structures: Occurrences and Effect on Interbase Hydrogen Bonding

Mohit Chawla1, Luigi Cavallo1, Romina Oliva2

1King Abdullah University of Science and Technology (KAUST), Physical Sciences and Engineering Division, Kaust Catalysis Center, Thuwal 23955-6900, Saudi Arabia; 2Department of Sciences and Technologies, University Parthenope of Naples, Centro Direzionale Isola C4, I-80143 Naples, Italy

The physicochemical information of RNA molecules is greatly enhanced by posttranscriptional modifications, contributing to explain the diversity of their structures and functions.

To date, over 150 natural modifications have been characterized in all major classes of RNAs, ranging from isomerization or methylation, to the addition of bulky and complex chemical groups [1-2]. Modifications can change the folding landscape of RNA, resulting at times in alternative conformations [2-4]. This occurs by altering the interactions between nucleotides. Especially the H-bonding between nucleobases, both the regular Watson–Crick pairs enclosed in the RNA stems and the non Watson–Crick pairs outside the stems [5] - also known as tertiary interactions -, can be affected by modifications due to steric and energetic effects.

In order to investigate the impact of modifications on the interbase H-bonding, we have set up an approach combining structural bioinformatics with quantum mechanics (QM) calculations. Specifically, occurrences and structural context of modified base pairs (MBPs), i.e. base pairs featuring posttranscriptional modifications, are collected from the RNA structures in the PDB and classified by bioinformatics tools. Then, QM calculations are performed to clarify the effect of the modification on the geometry and stability of the corresponding base pair. We have applied this approach over time to both natural and non-natural (synthetic) modifications (see for instance [6-7]) and, in 2015, we have presented an atlas of MBPs, i.e. a systematic study of all the MBPs in RNA experimental structures [8]. At the time, we could identify a total of »900 occurrences for 11 natural modifications, with roughly half of them involved in base pairing. Our atlas 1.0 consisted of 27 MBPs, unique in terms of identity of H-bonded bases and/or geometry classification.

Herein, to extend our understanding of how posttranscriptional modifications act on the structure of RNA molecules to influence their function, we present an updated atlas, derived from an over doubled structures dataset. It consists overall of almost 100 unique MBPs, featuring 35 different posttranscriptional modifications, located in a variety of different RNA molecules and structural motifs. Consistently with our previous findings, most of the MBPs are non Watson–Crick like and are involved in RNA tertiary structure motifs. Results of the structural analyses, along with insight from QM calculations into the impact of the different modifications on the geometry and stability of the corresponding base pairs, will be presented and discussed.

1. P. Boccaletto, F. Stefaniak, A. Ray, A. Cappannini, S. Mukherjee, et al., Nucleic Acids Res., 50, (2022), D231.

2. P. F. Agris, RNA, 21, (2015), 552.

3. M. Helm, Nucleic Acids Res., 34, (2006), 721.

4. K. I. Zhou, M. Parisien, Q. Dai, N. Liu, L. Diatchenko, J. R. Sachleben, T. Pan, J. Mol. Biol, 428, (2016), 822.

5. N. B. Leontis, J. Stombaugh, E. Westhof, Nucleic Acids Res., 30, (2002), 3497.

6. R. Oliva, L. Cavallo, A. Tramontano, Nucleic Acids Res., 34, (2006), 865.

7. M. Chawla, S. Gorle, A. R. Shaikh, R. Oliva, L. Cavallo, Comput. Struct. Biotechnol. J., 19, (2021), 1312.

8. M. Chawla, R. Oliva, J. M. Bujnicki, L. Cavallo, Nucleic Acids Res., 43, (2015), 6714.



11:00am - 11:15am

RNAdvisor: Evaluation of RNA 3D structures with metrics and energies

Clément Bernard1, Sahar Ghannay2, Guillaume Postic1, Fariza Tahi1

1IBISC, France; 2LISN, CNRS, France

RNA adopts three-dimensional structures that play a crucial and direct role in its biological function. Understanding these diverse functions is necessary for the development of RNA-based therapies, but the complex structure of RNA molecules remains a major challenge. Computational methods have been developed throughout the years to fill the gap between the huge amount of known RNA sequences and their structures. With the increased number of RNA structures that are still to be discovered, predictive methods need to be robust and to be able to generalize to unseen new RNA families.

While structure predictions are a vast and complex problem, the evaluation and assessment of structure nativity is also at stake. RNA structure is a 3D object where the evaluation of a prediction has been discussed for years. Current methods rely on the comparison of a reference solved structure with a prediction, categorised as metrics. It can compare deviation on atoms like RMSD or εRMSD [1], or overlaps between them like CLASH score [2]. Other metrics are inspired by protein 3D evaluation metrics from the CASP competition. Indeed, RNA and protein 3D structures share common properties as 3D objects and adaptation of the known protein’s metrics like TM-score [3] can be done to RNA. It remains structural differences between protein and RNA molecules that hamper the full efficiency of structural evaluation metrics. RNA-oriented metrics have been developed to take full advantage of structural specificities like INF [4] or MCQ [5] scores.

Nonetheless, the metrics rely on a known solved structure, which in practice is not available. Predictive models are also based on the generation of multiple structures before selecting the best ones. Common approaches are thus to replicate the molecule free energy, where a minimum of energy would mean a stabilisation in the structure. This adaptation of the free energy of the structure has become a standard in the ranking, filtering and confidence assessment of structures. It often uses knowledge-based statistical potentials, with the requirement of a reference state to simulate structures without native interactions. This is the case for NAST [6], 3dRNAScore [7], DFIRE-RNA [8] and rsRNASP [9]. Recent advances tend to use deep learning to prevent manual pre-processing of RNA features like RNA3DCNN [10] or ARES [11].

RNA 3D structures remain of high complexity, and there is not a single existing metric or energy that could evaluate correctly all the available structures. Metrics and energies can be redundant between each other, while also complementary for structure assessment. The different existing metrics can be required to develop and understand predictive models’ weaknesses, while the diverse energies could help improve models’ generation such as the filtering process.

The current metrics and energies are the results of years of research by various groups. Each work has been developed in different programming languages, with different installation procedures and library versions that have evolved over the years. The installation process can be laborious for the community and is multiplied by the number of different metrics and energies. Efforts should be made on developing predictive models while engineering aspects for structures assessment should not be a bottleneck. Works have been done by the community with the development of RNAPuzzles [12], a CASP-like competition for RNA 3D structure assessment. It comes with RNA-tools [13], a centralised platform that tries to include the available RNA 3D structure related works. Nonetheless, it is limited in practice with the need to manually include binary files; that depend on the operating system of the user. There are also web servers available for some metrics and energies, which are useful for non-coder users. However, it limits the automation procedure, which should be considered due to the increasing number of solved 3D structures.

To help the development and the automation of RNA 3D structures evaluation, we have developed RNAdvisor: a software usable with one command line and that can compute both metrics and energies for given RNA. It uses eight existing codes written in C++, Java or Python and gathers them into a single interface. All the laborious installations are done in different stages of the Dockerfile. It leverages Docker containers for easy installation across diverse operating systems, simplifying accessibility for all researchers. It enables researchers to access both metrics and energies in one line of code, with customizable parameters to suit individual preferences.

RNAdvisor represents a significant advancement for the automation of RNA 3D structure evaluation. It offers a unified tool that enhances the accessibility of existing metrics and energies. It helps accelerate investigation in RNA 3D structure predictions.

The source code is available at: https://github.com/EvryRNA/rnadvisor.

1. Bottaro S, Di Palma F, and Bussi G. The Role of Nucleobase Interactions in RNA Structure and Dynamics. Nucleic acids research 2014;42.2. Davis IW, Leaver-Fay A, Chen VB, et al. MolProbity: all-atom contacts and structure validation for proteins and nucleic acids. Nucleic Acids Research 2007;35:W375–W383.3. Zhang Y and Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins 2004;57:702–10.

4. Parisien M, Cruz J, Westhof E, and Major F. New metrics for comparing and assessing discrepancies between RNA 3D structures and models. RNA (New York, N.Y.) 2009;15:1875–85

5. Zok T, Popenda M, and Szachniuk M. MCQ4Structures to compute similarity of molecule structures. Central European Journal of Operations Research 2013;22.

6. Jonikas MA, Radmer RJ, Laederach A, et al. Coarse-grained modeling of large RNA molecules with knowledge-based potentials and structural filters. RNA 2 2009;15:189–99.

7. Wang J, Zhao Y, Zhu C, and Xiao Y. 3dRNAscore: a distance and torsion angle dependent evaluation function of 3D RNA structures. Nucleic Acids Research 10 2015;43:e63–e63

8. Capriotti E, Norambuena T, Marti-Renom MA, and Melo F. All-atom knowledge-based potential for RNA structure prediction and assessment. Bioinformatics 2011; 27:1086–93.

9. Tan YL, Wang X, Shi YZ, Zhang W, and Tan ZJ. rsRNASP: A residue-separation based statistical potential for RNA 3D structure evaluation. Biophysical Journal 1 2022;121:142–56.

10. Li J, Zhu W, Wang J, et al. RNA3DCNN: Local and global quality assessments of RNA 3D structures using 3D deep convolutional neural networks. PLOS Computational Biology 2018;14:1–18.

11. Townshend RJL, Eismann S, Watkins AM, et al. Geometric deep learning of RNA structure. Science 6558 2021;373:1047–51.

12. Cruz J, Blanchet MF, Boniecki M, et al. RNA-Puzzles: A CASP-like evaluation of RNA three-dimensional structure prediction. RNA (New York, N.Y.) 2012;18:610–25.

13. Magnus M. rna-tools.online: a Swiss army knife for RNA 3D structure modeling workflow. Nucleic Acids Research 2022;50:W657–W662.



11:15am - 11:30am

Prediction of secondary structure for long non-coding RNAs using a recursive cutting method based on deep learning

Loïc Omnes1, Eric Angel1, Pierre Bartet3, François Radvanyi2, Fariza Tahi1

1Université Paris-Saclay, Univ Evry, France; 2CNRS - Institut Curie, France; 3ADLIN Science, France

Accurately predicting the secondary structure of RNA, particularly for long non-coding RNA, has direct implications in healthcare, where it can be used for diagnostic, therapeutic, and drug discovery purposes. However, the majority of previous approaches are too costly in terms of computation budget to cope with the increasing complexity of long RNAs, and the ones that can scale to long RNAs lack accuracy to reliably predict their structures. We propose a new approach combining recursive cutting and machine learning techniques for predicting the secondary structures of long non-coding RNAs. In comparison, our method proves to be computationally efficient by recursively partitioning a sequence into smaller fragments until they can be easily managed by an existing model. We perform a benchmark of different state-of-the-art models and show that our approach indeed demonstrates better performance for long RNAs and a potential to bring significant improvements in the future, as well as interesting enhancing properties, which we discuss.

 
11:30am - 12:00pmCB2: Coffee break
Location: Forum Hall Foyer 3
12:00pm - 1:00pmS5: Activity 2, 5 - Additional talks
Location: Chamber Hall
Session Chair: Lynne Regan
Session Chair: Bohdan Schneider
 
12:00pm - 12:30pm

Protein Quaternary Structures in Solution are a Mixture of Multiple forms

Gideon Schreiber

Weizmann Institute of Science, Israel

Over half the proteins in the E.coli cytoplasm form homo or hetero-oligomeric structures. Experimentally determined structures are often considered in determining a protein’s oligomeric state, but static structures miss the dynamic equilibrium between different quaternary forms. The problem is exacerbated in homo-oligomers, where the oligomeric states are challenging to characterize. Here, we re-evaluated the oligomeric state of 17 different bacterial proteins across a broad range of protein concentrations and solutions by native mass-spectrometry (MS), mass photometry (MP), size exclusion chromatography (SEC), and small-angle x-ray scattering (SAXS), finding that most exhibit several oligomeric states. Surprisingly, many proteins did not show mass-action driven equilibrium between the oligomeric states. For approximately half the proteins, the predicted oligomeric forms described in publicly available databases underestimated the complexity of protein quaternary structures in solution.Conversely, AlphaFold Multimer provided an accurate description of the potential multimeric states for most proteins, suggesting that it could help resolve uncertainties on the solution state of many proteins.



12:30pm - 1:00pm

Structural plasticity in the loop region of engineered lipocalins with novel ligand specificities – Anticalins

Arne Skerra

Technical University of Munich, Germany

Anticalins are generated via combinatorial protein design on the basis of the lipocalin protein scaffold and constitute a novel class of small and robust binding proteins. These engineered lipocalins offer prospects as an alternative to antibodies for applications in medical therapy as well as in vivo diagnostics. The lipocalins are natural binding proteins with diverse ligand specificities which share a simple architecture with a central eight-stranded antiparallel β-barrel and an α-helix attached to its side. At the open end of the β-barrel, four structurally variable loops connect the β-strands in a pair-wise manner and, together, shape the ligand pocket. Using targeted random mutagenesis in combination with molecular selection techniques, this loop region can be reshaped to generate pockets for the tight binding of various ligands ranging from small molecules over peptides to proteins. While such Anticalin proteins can be derived from different natural lipocalins, the human lipocalin 2 (Lcn2) scaffold proved particularly successful for the design of binding proteins with novel specificities and, over the years, more than 20 crystal structures of Lcn2-based Anticalins have been elucidated. Using a novel way of graphical representation, the conformational variability that emerged in the loop region of these functionally diverse artificial binding proteins can be illustrated in comparison with the natural scaffold. This analysis has provided picturesque evidence of the high structural plasticity around the binding site of the lipocalins which explains their proven tolerance toward excessive mutagenesis. Furthermore, apart from a simple lock-and-key mode of ligand recognition, structural evidence suggests two distinct mechanisms of spatial adaptation during the formation of Anticalin-ligand complexes: (i) induced fit, in which conformational alteration follows ligand binding, and (ii) conformational selection, which is based on a pre-existing mixture of conformational states. Taken together, these molecular mechanisms demonstrate remarkable resemblance between the binding site of lipocalins (natural or engineered) and the well characterized complementarity-determining region of immunoglobulins (antibodies), which represent two structurally and functionally different types of mammalian plasma proteins.

 
1:00pm - 2:00pmL2: Lunch
Location: Forum Hall Foyer 3
2:00pm - 3:30pmS6a: Activity 5 - To establish a Biostudies database of protein engineering results
Location: Chamber Hall
Session Chair: Lynne Regan
 
2:00pm - 2:30pm

Novel Strategies and Web-based Tools for Protein Engineering

Jiri Damborsky

Masaryk University and ICRC-FNUSA, Czech Republic

We develop novel strategies and web-based protein engineering tools under the ELIXIR Czech Republic umbrella. These are fully automated computational workflows which can operated using the intuitive graphical user interface.1 Protein sequence or structure is typically the only input required for the calculation. The tools can be accessed freely via the Protein Engineering Portal (Figure 1). The tools are particularly suitable for experimentalists without prior structural biology or bioinformatics knowledge. The National Supercomputing Centre IT4Innovations provides the infrastructure for high-performance computing. This talk will introduce some of our web tools and illustrate their use for engineering proteins for biotechnological and biomedical applications.2



2:30pm - 3:00pm

Novel immunotherapeutic drugs through computational protein design

Clara Tabea Schoeder

Leipzig University, Faculty of Medicine, Germany

Recent advances in computational protein design have demonstrated its usability for tasks such as designing self-assembling nanoparticles, ß-barrel membrane proteins or the rapid discovery of picomolar binders for the SARS-CoV-2 receptor binding domain as antiviral drugs. In my laboratory, we are further leveraging these tools to design new immunotherapeutic drugs, including stabilized viral glycoproteins for rationally acting vaccines, antibodies and antibody fragments for cellular therapies, and adeno-associated virus capsids for targeted gene therapy. While computational protein design has been based on biophysical and knowledge-derived energy terms in the past, new machine learning methods are emerging with new capabilities. Here, we are presenting a study, where protein design in the Rosetta software framework was combined with the prediction of post-translational modifications using artificial neural networks. We integrated these models in the Rosetta framework, allowing the use of these predictions during design. With this, it is possible to enrich for intended post-translational modifications while altering the sequence, e.g. for the design of N-linked glycosylation, but also to decrease the occurrence of unintended modifications sites, such as deamidation of asparagine. This new method will be applied during epitope-focused immunogen design for influenza virus vaccines and for the stabilization of antibody therapeutics.



3:00pm - 3:15pm

Using molecular dynamics (MD) calculations for the characterization of structural transitions

Outi Tuulikki Lampela1, Tiila-Riikka Kiema1, Takayuki Ohnuma2,3, Tamo Fukamizo3, Rikkert Wierenga4, André Juffer1,4

1Biocenter Oulu, University Of Oulu, Finland; 2Department of Advanced Bioscience, Kindai University, Japan; 3Agricultural Technology and Innovation Research Institute (ATIRI), Kindai University, Japan; 4Faculty of Biochemistry and Molecular Medicine, Univeristy of Oulu, Finland

The increasing number of structures available for individual proteins in different conditions enable defining structural changes between protein states like unbound and liganded and active and non-active conformations. These structures do not reveal all the atomic movements (dynamics) needed for the transition between these states. Molecular dynamics (MD) simulations can be used to study and confirm anticipated structural changes. This requires an analysis of the available structures to define the states as well as the recognition of the path (transition) between these states from a simulated ensemble of conformations. The present work has applied correlation-based analysis tools to analyze MD simulation results to find possible states and transitions to understand the underlying mechanism of protein structure changes. We present examples of the information available from MD calculations, when combined with structural data. We also present some of the challenges to find the transition of interest in the simulation data.

For two proteins, vcCBP (Vibrio cholerae Carbohydrate Binding Protein [1] and MFE1(multifunctional enzyme, type-1) [2] several 500 ns all atom MD simulations were performed to study the dynamics of the proteins when starting from different structural states. vcCBP is a periplasmic solute binding protein specific for chitooligosaccharide, polymer of N-acetylglucosamine; (GlcNAc)n. MFE1 is a monomeric enzyme with two active sites. In both studies structure-based predetermined distances between carefully selected ca-atoms were analysed as function of the simulation time as well as by cross-correlation of these distances with each other. The atom distance analysis and dynamical cross-correlation (DCC) data from MD simulations have been used to investigate the intra-domain and inter-domain movements.

In the case of vcCBP, the distance analysis can differentiate between a bound and an unbound state, but it is in the DCC data of the ca-atom positions which highlights key features of the binding mode of the two different ligands, (GlcNAc)2 and (GlcNAc)3, as shown in Fig. 1. The cross-correlation analysis reveals different behaviour of the domains and division of a single domain into two domains, which is consistent with the the thermodynamic data measured for these ligands. The results predict the existence of dynamical domains which is confirmed by a combination of experimental structural data [3].

For the MFE1 study two different conformational states are found in the asymmetric unit of the crystallised protein. The characteristic distances for the A and B states have been defined in previous publications [2]. The analysis of the 500ns simulations runs of the unliganded structure, using these characteristic distances, shows that these structural states do not converge to each other during this time scale attainable by the MD simulation. The simulations are extended and analysis of 500 ns simulations runs of liganded MFE1 complexes is still in progress.

The collection of the distance characteristics from two or more protein conformational states using extensive simulation data will make it possible to automate the search of the residues that are important for the function of these proteins.



3:15pm - 3:30pm

From AlphaFold to PyMOL: enabling seamless access to Structural Bioinformatics Tools

Serena Rosignoli, Alessandro Paiardini

Sapienza, Italy

AlphaFold, the groundbreaking artificial intelligence system developed by Google DeepMind, has revolutionised the field of structural biology with its remarkable accuracy in predicting protein structures. A pivotal development that facilitated its widespread adoption was the creation of a comprehensive database of predictions. Subsequently, AlphaFold has been investigated and examined from various viewpoints, leading to a rapid increase in its utilisation across all domains of biotechnological research. This growth has also catalysed new software development aimed at incorporating AlphaFold's predictions, often focusing on addressing the algorithm's remaining challenges. The cornerstone of AlphaFold's practical success lies in its capacity to integrate into established protocols within structural bioinformatics. To accomplish this, ensuring that AlphaFold is made as user-friendly as possible becomes paramount.

In this background, our goal is to integrate these protein structure models seamlessly into the PyMOL-PyMod environment. The latter, an established project designed to act as a complete suite for structural bioinformatics, will encompass AlphaFold predictions into its tools for sequence similarity searches, multiple sequence/structure alignment building, phylogenetic trees and evolutionary conservation analyses, domain parsing, single/multiple chains and loop modelling. In addition to this canonical use of PyMod, tailored PDB-API queries will enrich AlphaFold models with information on its conformational space, biological assemblies, post-translational modifications, ligands and cofactors; hence powering the system with biophysical knowledge. Despite the significant breakthroughs of ab-initio models in structural predictions, homology-based methods remain of interest. The synergistic application of homology-based and ab-initio algorithms could enable structural biologists to create detailed models of intricate protein complexes where AlphaFold may have had limited predictive capacity regarding the relative positioning of domains or the prediction of proper multimeric assemblies.

Our proposal represents a significant step forward in streamlining research workflows and democratising access to cutting-edge structural biology tools, embracing this synergy between AI-driven predictions and user-friendly platforms with a unified and accessible environment.

 
3:30pm - 4:00pmCB3: Coffee break
Location: Forum Hall Foyer 3
4:00pm - 4:45pmS6b: Activity 5 - To establish a Biostudies database of protein engineering results
Location: Chamber Hall
Session Chair: Lynne Regan
 
4:00pm - 4:15pm

Design of novel peptides targeted to human primary amine oxidase

Marion, Denise Alix, Gabriela Guédez, Tiina, Annamaria Salminen

Åbo Akademi university, Finland

Design of novel peptides targeted to human primary amine oxidaseM. Alix, G. Guédez, T.A. SalminenÅbo Akademi university, Science and Engineering, Biochemistry, Tuomiokiokontori 3, 20500 marion.alix@abo.fi

Human primary amine oxidase (hAOC3), a transmembrane protein on the endothelium binds the Sialic Acid-binding Immunoglobulin-like lectin-9 (Siglec-9) on the leukocytes surface for the extravasation to the inflammatory site [1] During inflammatory conditions, hAOC3 is translocated to the endothelial cell surface and, therefore, a labelled Siglec-9 peptide targeting specifically [1]. hAOC3 can be used to detect tumors and acute or chronic inflammatory response in many diseases by positron emission tomography (PET) e.g. rheumatoid arthritis [2]. The aim of this study is to design improved Siglec-9 peptide derived from peptide for PET imaging and further develop then for anti-cancer and anti-inflammatory agents.

We have started our study by designing a novel Siglec-9 peptide that corresponds to the original phage peptide but is significantly smaller than the Siglec-9 peptide used for PET-imaging [1]. The experimental binding studies show that the designed peptide interacts with hAOC3 like the original phage peptide (positive control) whereas the mouse Siglec-E peptide (negative control) did not show any binding. In addition to these peptides, we have predicted the binding of even smaller Siglec-9 peptides containing WRG motif like the phage peptide, Siglec-10 peptide: containing QRG motif, and other related peptides by docking experiments. The docking results help us to understand the specific binding of human Siglec-9 peptide to hAOC3. Interestingly, the binding sites for the phage peptide and the human Siglec-9 peptides coincide with the known inhibitor binding site in hAOC3 around specificity site (Fig. 1).

1. K. Aalto, A. Autio, E. A. Kiss, K. Elima, Y. Nymalm, T.Z. Veres, F. Marttila-Ichihara, H. Elovaara, T. Saanijoki, P. R. Crocker, M. Maksimow, E. Bligt, T.A. Salminen, M. Salmi, A. Rovainen and S. Jalkanen, Blood, 118, (2011), pp. 3725-33.

2. H. Virtanen, A. Autio, R. Siitinen, H. Liljenbäck, T. Saanijoki, P. Lankinen, J. Mäkilä, M. Käkelä, J. Teuho, N. Savisto, K. Jaakkola, S. Jalkanen, and A. Roivainen, Arthritis Research & Therapy, 17, (2015), 308.



4:15pm - 4:30pm

Assessing the performance of protein regression models

Wouter Boomsma

University of Copenhagen, Denmark

Machine Learning is increasingly used to predict protein fitness as a guide for selecting promising candidates in protein engineering. It is therefore natural to ask how good these predictions are, and how much we are able to learn from one round of experiments to the next. In this talk, I will discuss how different assessment criteria for regressor performance can lead to quite different conclusions, depending on the choice of metric, and the underlying definition of generalization. I will also highlight issues of sample bias in typical regression scenarios and how this can lead to misleading conclusions about regressor performance.



4:30pm - 4:45pm

Prediction of bacterial interactomes based on genome-wide coevolutionary networks: an updated implementation of the ContextMirror approach

Miguel Fernandez, Camila Pontes, Victoria Ruiz-Serra, Alfonso Valencia

Barcelona Supercomputing Center, Spain

The biological function of proteins is preserved through coevolution and can be quantified by computing the similarity between the phylogenetic trees of pairs of protein families [1]. When the phylogenetic similarity is high, it indicates that proteins are likely to interact. However, this similarity is influenced by many factors, including background evolution. Current coevolution-based methods treat protein pairs independently, despite proteins interacting with multiple others.

The ContextMirror methodology evaluates coevolution by integrating the influence of every interactor on a given protein pair (coevolutionary network), providing more accurate protein-protein interaction predictions [2]. In our study, we evaluate the ContextMirror pipeline, already shown to improve the prediction of protein-protein interactions, by predicting protein-protein interactions for the full proteome of Escherichia coli (4298 proteins). Preliminary predictions reveal the potential of this approach to improve our understanding of protein coevolution. The true positive rate of the top-500 predictions (≈ 60% accuracy) is approximate to other methods and compared to the STRING database [3], they map only to high-confident pairs (confident score > 0.8). The physical compatibilty between these pairs was confirmed by quantifying the structural PPI interface as the pDockQ score from structures modelled with AlphaFold-Multimer [4].

In the current stage of our analysis, ContextMirror is being used to predict PPI networks for different bacterial proteomes and to identify differences in their predicted interactomes with potential applications in drug design and protein engineering [5].

1. Pazos, F., & Valencia, A. (2001). Similarity of phylogenetic trees as indicator of protein–protein interaction.
Protein Engineering, Design And Selection, 14(9), 609-614
2. Juan, D., Pazos, F., & Valencia, A. (2008). High-confidence prediction of global interactomes based on genome-
wide coevolutionary networks. Proceedings Of The National Academy Of Sciences, 105(3), 934-939.
3. Szklarczyk, D. et al. (2023). The STRING database in 2023: protein-protein association networks and functional
enrichment analyses for any sequenced genome of interest. Nucleic Acids Res, 51(D1), D638-D646
4. Wensi Zhu and others, Evaluation of AlphaFold-Multimer prediction on multi-chain protein complexes,
Bioinformatics, Volume 39, Issue 7, July 2023, btad424

 
4:45pm - 6:00pmDiscussion: Elixir Platform Discussion
Location: Chamber Hall
Session Chair: Bohdan Schneider

Elixir Platform Discussions.

Chairs: Sameer Velankar, Shoshana Wodak, Vincent Zoete, Bohdan Schneider, Lynne Regan


Date: Friday, 17/Nov/2023
9:00am - 11:00amS7: Activity 3 - To develop models for protein-ligand interactions
Location: Chamber Hall
Session Chair: Vincent Zoete
 
9:00am - 9:15am

Introduction and status of the ligand-protein activity

Vincent Zoete

SIB Swiss Institute of Bioinformatics, Switzerland

Ligand-protein activity



9:15am - 9:45am

AlphaFill: enriching AlphaFold models with ligands and cofactors

Robbie P. Joosten

Netherlands Cancer Institute, Netherlands, The

With the public availability of the AlphaFold Protein Structure Database, structural biologists now have a valuable new tool for their research. However, biochemical interpretation of these structures is limited to amino acids only. That is, co-factors, small molecules and/or metal ions, which are required for protein function or structural integrity are lacking. To address this limitation we developed AlphaFill: an algorithm that enriches AlphaFold models with co-factors, small molecules and (metal) ions based on homologous structures available in the PDB-REDO databank. All AlphaFill models are freely available through alphafill.eu: a FAIR (Findable, Accessible, Interoperable and Reproducible) resource to support scientists interpreting the enriched AlphaFold models while creating new hypotheses and designing experiments.

Besides the AlphaFill databank containing “filled” AlphaFold models, we developed a webservice where users can upload their own protein structures and run the algorithm to enrich their structures of interest with ligands and cofactors. Here, we will present the latest features and future developments that improve the interpretability of the protein-ligand complexes obtained by AlphaFill, and increase user flexibility while “filling” their macromolecular structures of interest.



9:45am - 10:15am

A Bottom-Up Approach to Screening Massive Virtual Collections

Xavier Barril1,2,3

1University of Barcelona; 2ICREA; 3Gain Therapeutics

Drug discovery starts with the identification of a ‘hit’ compound that, following a long and expensive optimisation process, evolves into a drug candidate. Bigger screening collections offer the possibility of finding more and better hits. For this reason, the emergence of on-demand chemical collections (which nowadays contain >50 billion accessible compounds) have the potential to revolutionise the drug discovery process, delivering hits that are closer to the final drug, even for the most difficult targets. But first, it will be necessary to develop new tools and protocols capable of sieving through such massive collections in an effective and efficient manner. To address this challenge, we have conceived a novel virtual screening strategy that explores the chemical universe from the bottom up, i.e. performing a systematic search of the fragment-size chemical space, followed by focused exploration of the most promising areas of the lead-like chemical space. Using a hierarchy of increasingly sophisticated computational methods, we also maximise the success probability of each selected compound. In this talk I will describe the first application of this concept, the results obtained in a prospective validation and its current implementation as a Galaxy pipeline.



10:15am - 10:30am

SiteMine: large scale binding site similarity searching in protein databases

Thorben Reim1, Christiane Ehrt1, Joel Graef1, Sebastian Günther2, Alke Meents2, Matthias Rarey1

1Center for Bioinformatics, Universität Hamburg, Bundesstraße 43, 20146 Hamburg, Germany; 2Center for Free-Electron Laser Science, Deutsches Elektronen-Synchrotron, Notkestraße 85, 22607 Hamburg, Germany

Drug discovery and design challenges such as drug repurposing, the analyses of protein-ligand and protein-protein complexes, ligand promiscuity studies or function prediction can be addressed by protein binding site similarity searches. Although numerous tools were developed in the past, they all have their individual benefits and drawbacks with regard to run time, provision of structure superpositions, and applicability to diverse application domains. Here, we introduce SiteMine, an all-in-one database-driven, alignment-providing binding site similarity search tool to tackle the most pressing challenges with regard to binding site comparison. The performance of SiteMine is evaluated on the ProSPECCTs benchmark, showing a promising performance on most of the data sets. The method performs convincingly regarding all quality criteria for reliable binding site comparison, offering a novel state-of-the-art approach for structure-based molecular design based on binding site comparisons. In a SiteMine showcase, we found a high structural similarity between Cathepsin L and Calpain 1 binding sites and give an outlook on how this can be used in structure-based drug design.



10:30am - 10:45am

Automated benchmarking of protein-ligand complex prediction

Janani Durairaj1,2, Michele Leemann1,2, Ander Sagasta1,2, Louis Ollivier1,2, Peter Skrinjar1,2, Arthur Goetzee1,2, Leila Alexander1,2, Xavier Robin1,2, Torsten Schwede1,2

1Biozentrum, University of Basel, Switzerland; 2SIB Swiss Institute of Bioinformatics

Following the major advances in protein structure prediction, novel protein-ligand complex (PLC) prediction methods, including those using innovative deep learning methods, are being developed with very promising results. This was reflected in the inclusion of the protein-ligand interaction prediction category in the CASP15 experiment, where the task consisted of predicting both the three-dimensional structure of the receptor protein as well as the position and conformation of the ligand. We address the challenges and propose solutions for devising automated benchmarking techniques for PLC prediction:

  • Is the ground truth good enough? - experimental uncertainty in structure determination may invalidate comparison.

  • Is a PLC interesting to assess? - the bias and redundancy in structure determination efforts hinder representative evaluation across diverse PLCs.

  • What scoring metrics apply to PLCs? - existing scores are not suited to evaluate local interactions between protein and ligand atoms.

With the Continuous Automated Model EvaluatiOn (CAMEO) platform (cameo3d.org) we take advantage of the prediction window afforded by the weekly pre-release of the PDB to perform blind evaluations of PLC prediction methods. In parallel, the LIGATE project aims to assess these methods on diverse and representative datasets with different characteristics, potentially highlighting strengths and weaknesses of the algorithms. Both of these efforts rely on fully automated workflows.

We describe how we identify high-quality targets and assess their novelty with binding pocket-focused clustering techniques. We then evaluate the accuracy of the predicted PLCs from the protein, ligand, and, most importantly, the interaction perspective, for which we developed two new PLC-specific scores. We show the need for more representative and higher quality benchmark sets, and also show that redocking on crystal structures is a much simpler task than docking into predicted protein models. We introduce a fully automated NextFlow pipeline that predicts PLC and evaluates their accuracy.

By providing the structural biology community with novel high-quality benchmarking targets that challenge the capabilities of existing methods, and automated workflows to efficiently and effectively assess their prediction results, we are supporting new developments in this highly active area of research.



10:45am - 11:00am

Jumpcount: Exact confidence intervals for free energy difference estimations

Pavel Kříž1, Jan Beránek2, Vojtěch Spiwok2

1Faculty of Mathematics and Physics, Charles University, Prague, Czech Republic; 2Department of Biochemistry and Microbiology, University of Chemistry and Technology, Prague, Czech Republic

Increasing number of biomolecular simulations provide sufficiently long simulations to predict equilibrium constants and associated free energies of simulated processes such as protein folding or protein-ligand binding. The equilibrium constant of a process of a transition from the state A to state B can be estimated as the fraction of times spent in B and A. Here we present a simple, yet exact, method to calculate the errors of the free energy estimations, which is calculated solely from temperature and the number of transitions from A to B and B to A. Markovianity of the process is the prerequisite of the method. The error can be calculated online by our tool at jumpcount.cz.

This method can be used for results based on unbiased molecular dynamics simulations or simulations based on enhanced sampling methods with static bias potential, e. g. umbrella sampling. We also present the methodology to calculate confidence intervals for temperature replica-exchange simulations (parallel tempering).

The approach was tested on numerous molecular systems (glycerol, alanine dipeptide and miniprotein folding).

The work was supported by Czech Science Foundation (22-29667S and 19-16857S) and the Ministry of Education, Youth and Sports of the Czech Republic (LM2018140, LM2018131).

 
11:00am - 11:30amCB4: Coffee break
Location: Forum Hall Foyer 3
11:30am - 12:30pmS8: Final Session
Location: Chamber Hall
Session Chair: Bohdan Schneider

Keynote by Michael Sternberg

Discussion and concluding remarks

Chairs: Shoshana Wodak, Vincent Zoete, Bohdan Schneider, Lynne Regan, Mihaly Varadi, Geraldo Tauriello

 
11:30am - 12:30pm

Structure-based modelling of missense variants in the AlphaFold era

Michael Sternberg

Imperial College London, United Kingdom

This talk will focus on the missense3D portal (missense3d.bc.ic.ac.uk ) to predict the structural impact of missense variants [1,2,3]. The key features of this resource are:

  1. it provides a structural explanation for the prediction of the phenotypic effect of the missense variant and thus complements many other predictors including the recently-released AlphaMissense from DeepMind/Google
  2. it was developed to be applicable to predicted structures including by AlphaFold.

Three related web servers have been developed to model missense variants in protein tertiary structure, protein/protein interfaces and with membrane spanning reasons. The results of benchmarking the accuracy of the algorithm in both experimental structures in models predicted both by our server Phyre2 (www.sbg.bio.ic.ac.uk/~phyre2 ) [4] and AlphaFold. Challenges in developing and evaluating missense variant predictors will be discussed

The talk will also introduce the web-based graphics program EzMol [5] (www.sbg.bio.ic.ac.uk/~ezmol/ ))designed for rapid display of protein structures for occasional users which will facilitate identifying the location of missense variants on protein structure. Finally an illustration of how structure based modelling of variants lead to the formulation of an hypothesis about missense variant in the human protein TMPRSS2 that reduces the chances of severe Covid-19 [6].

Group publications

1) Ittisoponpisan, et al. (2019). Can predicted protein 3D structures provide reliable insights into whether missense variants are disease associated?. Journal of molecular biology, 431(11), 2197-2212.

2) Khanna, et al. (2021). Missense3D-DB web catalogue: an atom-based analysis and repository of 4M human protein-coding genetic variants. Human Genetics, 140, 805-812.

3) Khanna, et al. (2021). Missense3D-DB web catalogue: an atom-based analysis and repository of 4M human protein-coding genetic variants. Human Genetics, 140, 805-812.

4) Kelley, et al. (2015). The Phyre2 web portal for protein modeling, prediction and analysis. Nature protocols, 10(6), 845-858.

5) Reynolds, et al. (2018). EzMol: a web server wizard for the rapid visualization and image production of protein and nucleic acid structures. Journal of molecular biology, 430(15), 2244-2248.

6) David, et al. (2022). A common TMPRSS2 variant has a protective effect against severe COVID-19. Current research in translational medicine, 70(2), 103333.

 
12:30pm - 1:00pmDiscussion2: Elixir Platform Discussion
Location: Chamber Hall
Session Chair: Bohdan Schneider

Elixir Platform Discussions.

Chairs: Sameer Velankar, Shoshana Wodak, Vincent Zoete, Bohdan Schneider, Lynne Regan

1:00pm - 2:00pmL3: Lunch
Location: Forum Hall Foyer 3