Publications - Kabir's AI2Bio Lab @USF

Peer Reviewed Journal Articles

Tang, M., Cromie, G. A., Kabir, A., Timour, M. S., Ashmead, J., Lo, R. S., Corley, N., DiMaio, F., Morizono, H., Caldovic, L., Mew, N. A., Gropman, A., Shehu, A., & Dudley, A. M. (2026). Predicting epistasis across proteins by structural logic. Proceedings of the National Academy of Sciences, 123(3), e2516291123. https://www.pnas.org/doi/abs/10.1073/pnas.2516291123

@article{TangDudley2026epistasis,
  author = {Tang, Michelle and Cromie, Gareth A. and Kabir, Anowarul and Timour, Martin S. and Ashmead, Julee and Lo, Russell S. and Corley, Nathaniel and DiMaio, Frank and Morizono, Hiroki and Caldovic, Ljubica and Mew, Nicholas Ah and Gropman, Andrea and Shehu, Amarda and Dudley, Aimée M.},
  title = {Predicting epistasis across proteins by structural logic},
  journal = {Proceedings of the National Academy of Sciences},
  volume = {123},
  number = {3},
  pages = {e2516291123},
  year = {2026},
  doi = {10.1073/pnas.2516291123},
  url = {https://www.pnas.org/doi/abs/10.1073/pnas.2516291123},
  eprint = {https://www.pnas.org/doi/pdf/10.1073/pnas.2516291123},
  impact = {10.6}
}

Epistatic interactions present a major challenge in accurate genotype–phenotype correlation. Here, we combined high-throughput yeast assays and machine learning to explore and predict intragenic complementation, a form of epistatic interaction, in human ASL, a gene associated with a urea cycle disorder. Of the  3,600 combinations of complete loss-of-function missense variants tested, over 60% can restore argininosuccinate lyase (ASL) function to near wild-type levels. We demonstrated that intragenic complementation is determined by the spatial positioning of the two variants in the protein structure and is predictable by machine learning with exceptional accuracy. We estimated intragenic complementation to occur in at least 800 human proteins. Our study provides a method for predicting intragenic complementation to help bridge the genotype–phenotype gap. Accurately predicting the phenotypic consequences of genetic variation is a major challenge for precision medicine. The problem is exacerbated by epistatic interactions, nonadditive effects between genetic variants that produce unexpected phenotypes. Here, we explore an understudied form of positive epistasis: intragenic complementation, in which pairs of loss-of-function variants restore near wild-type protein function. Using mutational scanning in yeast, we identify thousands of such interactions in a clinically important enzyme, human argininosuccinate lyase (ASL). Restoration of protein function is not due to the biochemical properties of the substituted amino acids, but rather to a structural feature of the protein, the active site assembly. We develop a machine learning algorithm that uses protein language model embeddings to predict intragenic complementation in ASL with 99.6% accuracy. Additionally, the model trained on ASL generalizes to a structurally related but sequence-divergent enzyme, fumarase, with accuracy over 90%. Our findings reveal a structural basis for this form of epistasis and provide a predictive framework that could extend to at least 4% of human proteins.

Kabir, A., Bhattarai, M., Peterson, S., Najman-Licht, Y., Rasmussen, K. Ø., Shehu, A., Bishop, A. R., Alexandrov, B., & Usheva, A. (2024). DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors. Nucleic Acids Research, gkae783. https://doi.org/10.1093/nar/gkae783

@article{kabirbhattarai2024epbdbert,
  author = {Kabir, Anowarul and Bhattarai, Manish and Peterson, Selma and Najman-Licht, Yonatan and Rasmussen, Kim Ø and Shehu, Amarda and Bishop, Alan R and Alexandrov, Boian and Usheva, Anny},
  title = {{DNA breathing integration with deep learning foundational model advances genome-wide binding prediction of human transcription factors}},
  journal = {Nucleic Acids Research},
  pages = {gkae783},
  year = {2024},
  month = sep,
  issn = {0305-1048},
  doi = {10.1093/nar/gkae783},
  url = {https://doi.org/10.1093/nar/gkae783},
  eprint = {https://academic.oup.com/nar/advance-article-pdf/doi/10.1093/nar/gkae783/59112098/gkae783.pdf},
  impact = {16.6}
}

It was previously shown that DNA breathing, thermodynamic stability, as well as transcriptional activity and transcription factor (TF) bindings are functionally correlated. To ascertain the precise relationship between TF binding and DNA breathing, we developed the multi-modal deep learning model EPBDxDNABERT-2, which is based on the Extended Peyrard-Bishop-Dauxois (EPBD) nonlinear DNA dynamics model. To train our EPBDxDNABERT-2, we used chromatin immunoprecipitation sequencing (ChIP-Seq) data comprising 690 ChIP-seq experimental results encompassing 161 distinct TFs and 91 human cell types. EPBDxDNABERT-2 significantly improves the prediction of over 660 TF-DNA, with an increase in the area under the receiver operating characteristic (AUROC) metric of up to 9.6% when compared to the baseline model that does not leverage DNA biophysical properties. We expanded our analysis to in vitro high-throughput Systematic Evolution of Ligands by Exponential enrichment (HT-SELEX) dataset of 215 TFs from 27 families, comparing EPBD with established frameworks. The integration of the DNA breathing features with DNABERT-2 foundational model, greatly enhanced TF-binding predictions. Notably, EPBDxDNABERT-2, trained on a large-scale multi-species genomes, with a cross-attention mechanism, improved predictive power shedding light on the mechanisms underlying disease-related non-coding variants discovered in genome-wide association studies.

Kabir, A., Moldwin, A., Bromberg, Y., & Shehu, A. (2024). In the twilight zone of protein sequence homology: do protein language models learn protein structure? Bioinformatics Advances, 4(1), vbae119. https://doi.org/10.1093/bioadv/vbae119

@article{kabirshehu2023remhombioadv,
  author = {Kabir, Anowarul and Moldwin, Asher and Bromberg, Yana and Shehu, Amarda},
  title = {{In the twilight zone of protein sequence homology: do protein language models learn protein structure?}},
  journal = {Bioinformatics Advances},
  volume = {4},
  number = {1},
  pages = {vbae119},
  year = {2024},
  month = aug,
  issn = {2635-0041},
  doi = {10.1093/bioadv/vbae119},
  url = {https://doi.org/10.1093/bioadv/vbae119},
  eprint = {https://academic.oup.com/bioinformaticsadvances/article-pdf/4/1/vbae119/58914492/vbae119.pdf},
  impact = {4.4}
}

Protein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent.We address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the “twilight zone” of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak.We believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.

Bromberg, Y., Prabakaran, R., Kabir, A., & Shehu, A. (2024). Variant Effect Prediction in the Age of Machine Learning. Cold Spring Harbor Perspectives in Biology, 16(7), a041467. http://dx.doi.org/10.1101/cshperspect.a041467

@article{brombergshehu2024,
  title = {Variant Effect Prediction in the Age of Machine Learning},
  volume = {16},
  issn = {1943-0264},
  url = {http://dx.doi.org/10.1101/cshperspect.a041467},
  doi = {10.1101/cshperspect.a041467},
  number = {7},
  journal = {Cold Spring Harbor Perspectives in Biology},
  publisher = {Cold Spring Harbor Laboratory},
  author = {Bromberg, Yana and Prabakaran, R. and Kabir, Anowarul and Shehu, Amarda},
  year = {2024},
  month = apr,
  pages = {a041467},
  impact = {6.9}
}

Over the years, many computational methods have been created for the analysis of the impact of single amino acid substitutions resulting from single-nucleotide variants in genome coding regions. Historically, all methods have been supervised and thus limited by the inadequate sizes of experimentally curated data sets and by the lack of a standardized definition of variant effect. The emergence of unsupervised, deep learning (DL)-based methods raised an important question: Can machines learn the language of life from the unannotated protein sequence data well enough to identify significant errors in the protein “sentences”? Our analysis suggests that some unsupervised methods perform as well or better than existing supervised methods. Unsupervised methods are also faster and can, thus, be useful in large-scale variant evaluations. For all other methods, however, their performance varies by both evaluation metrics and by the type of variant effect being predicted. We also note that the evaluation of method performance is still lacking on less-studied, nonhuman proteins where unsupervised methods hold the most promise.

Kabir, A., Bhattarai, M., Rasmussen, K. Ø., Shehu, A., Usheva, A., Bishop, A. R., & Alexandrov, B. (2023). Examining DNA breathing with pyDNA-EPBD. Bioinformatics, 39(11), btad699. https://doi.org/10.1093/bioinformatics/btad699

@article{kabir2023pydnaepbd,
  author = {Kabir, Anowarul and Bhattarai, Manish and Rasmussen, Kim Ø and Shehu, Amarda and Usheva, Anny and Bishop, Alan R and Alexandrov, Boian},
  title = {{Examining DNA breathing with pyDNA-EPBD}},
  journal = {Bioinformatics},
  volume = {39},
  number = {11},
  pages = {btad699},
  year = {2023},
  month = nov,
  issn = {1367-4811},
  doi = {10.1093/bioinformatics/btad699},
  url = {https://doi.org/10.1093/bioinformatics/btad699},
  eprint = {https://academic.oup.com/bioinformatics/article-pdf/39/11/btad699/53863029/btad699.pdf},
  impact = {4.4}
}

The two strands of the DNA double helix locally and spontaneously separate and recombine in living cells due to the inherent thermal DNA motion. This dynamics results in transient openings in the double helix and is referred to as “DNA breathing” or “DNA bubbles.” The propensity to form local transient openings is important in a wide range of biological processes, such as transcription, replication, and transcription factors binding. However, the modeling and computer simulation of these phenomena, have remained a challenge due to the complex interplay of numerous factors, such as, temperature, salt content, DNA sequence, hydrogen bonding, base stacking, and others.We present pyDNA-EPBD, a parallel software implementation of the Extended Peyrard-Bishop-Dauxois (EPBD) nonlinear DNA model that allows us to describe some features of DNA dynamics in detail. The pyDNA-EPBD generates genomic scale profiles of average base-pair openings, base flipping probability, DNA bubble probability, and calculations of the characteristically dynamic length indicating the number of base pairs statistically significantly affected by a single point mutation using the Markov Chain Monte Carlo algorithm.pyDNA-EPBD is supported across most operating systems and is freely available at https://github.com/lanl/pyDNA_EPBD. Extensive documentation can be found at https://lanl.github.io/pyDNA_EPBD/.

Kabir, A., & Shehu, A. (2022). GOProFormer: A Multi-Modal Transformer Method for Gene Ontology Protein Function Prediction. Biomolecules, 12(11). https://www.mdpi.com/2218-273X/12/11/1709

@article{kabirshehu2022goproformer,
  author = {Kabir, Anowarul and Shehu, Amarda},
  title = {GOProFormer: A Multi-Modal Transformer Method for Gene Ontology Protein Function Prediction},
  journal = {Biomolecules},
  volume = {12},
  year = {2022},
  number = {11},
  article-number = {1709},
  url = {https://www.mdpi.com/2218-273X/12/11/1709},
  pubmedid = {36421723},
  issn = {2218-273X},
  doi = {10.3390/biom12111709},
  impact = {5.8}
}

Protein Language Models (PLMs) are shown to be capable of learning sequence representations useful for various prediction tasks, from subcellular localization, evolutionary relationships, family membership, and more. They have yet to be demonstrated useful for protein function prediction. In particular, the problem of automatic annotation of proteins under the Gene Ontology (GO) framework remains open. This paper makes two key contributions. It debuts a novel method that leverages the transformer architecture in two ways. A sequence transformer encodes protein sequences in a task-agnostic feature space. A graph transformer learns a representation of GO terms while respecting their hierarchical relationships. The learned sequence and GO terms representations are combined and utilized for multi-label classification, with the labels corresponding to GO terms. The method is shown superior over recent representative GO prediction methods. The second major contribution in this paper is a deep investigation of different ways of constructing training and testing datasets. The paper shows that existing approaches under- or over-estimate the generalization power of a model. A novel approach is proposed to address these issues, resulting in a new benchmark dataset to rigorously evaluate and compare methods and advance the state-of-the-art.

Peer Reviewed Conference Proceedings

Kabir, A., Moldwin, A., & Shehu, A. (2023). A Comparative Analysis of Transformer-based Protein Language Models for Remote Homology Prediction. Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. https://doi.org/10.1145/3584371.3612942

@inproceedings{kabirshehu2023remhomcsbw,
  author = {Kabir, Anowarul and Moldwin, Asher and Shehu, Amarda},
  title = {A Comparative Analysis of Transformer-based Protein Language Models for Remote Homology Prediction},
  year = {2023},
  isbn = {9798400701269},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3584371.3612942},
  doi = {10.1145/3584371.3612942},
  booktitle = {Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics},
  articleno = {97},
  numpages = {9},
  keywords = {large language model, transformer, remote homology},
  location = {Houston, TX, USA},
  series = {BCB '23}
}

Protein language models based on the transformer architecture are increasingly shown to learn rich representations from protein sequences that improve performance on a variety of downstream protein prediction tasks. These tasks encompass a wide range of predictions, including prediction of secondary structure, subcellular localization, evolutionary relationships within protein families, as well as superfamily and family membership. There is recent evidence that such models also implicitly learn structural information. In this paper we put this to the test on a hallmark problem in computational biology, remote homology prediction. We employ a rigorous setting, where, by lowering sequence identity, we clarify whether the problem of remote homology prediction has been solved. Among various interesting findings, we report that current state-of-the-art, large models are still underperforming in the "twilight zone" of very low sequence identity.

Kabir, A., Inan, T., & Shehu, A. (2022). Analysis of AlphaFold2 for Modeling Structures of Wildtype and Variant Protein Sequences. In H. Al-Mubaid, T. Aldwairi, & O. Eulenstein (Eds.), Proceedings of 14th International Conference on Bioinformatics and Computational Biology (Vol. 83, pp. 53–65). EasyChair; .

@inproceedings{kabirshehu2022af2mutanalysis,
  author = {Kabir, Anowarul and Inan, Toki and Shehu, Amarda},
  title = {Analysis of AlphaFold2 for Modeling Structures of Wildtype and Variant Protein Sequences},
  booktitle = {Proceedings of 14th International Conference on Bioinformatics and Computational Biology},
  editor = {Al-Mubaid, Hisham and Aldwairi, Tamer and Eulenstein, Oliver},
  series = {EPiC Series in Computing},
  volume = {83},
  publisher = {EasyChair},
  bibsource = {EasyChair, https://easychair.org},
  issn = {2398-7340},
  doi = {10.29007/5g4v},
  pages = {53-65},
  year = {2022}
}

ResNet and, more recently, AlphaFold2 have demonstrated that deep neural networks can now predict a tertiary structure of a given protein amino-acid sequence with high accuracy. This seminal development will allow molecular biology researchers to advance various studies linking sequence, structure, and function. Many studies will undoubtedly focus on the impact of sequence mutations on stability, fold, and function. In this paper, we evaluate the ability of AlphaFold2 to predict accurate tertiary structures of wildtype and mutated sequences of protein molecules. We do so on a benchmark dataset in mutation modeling studies. Our empirical evaluation utilizes global and local structure analyses and yields several interesting observations. It shows, for instance, that AlphaFold2 performs similarly on wildtype and variant sequences. The placement of the main chain of a protein molecule is highly accurate. However, while AlphaFold2 reports similar confidence in its predictions over wildtype and variant sequences, its performance on placements of the side chains suffers in comparison to main-chain predictions. The analysis overall supports the premise that AlphaFold2-predicted structures can be utilized in further downstream tasks, but that further refinement of these structures may be necessary.

Kabir, A., & Shehu, A. (2022). Sequence-Structure Embeddings via Protein Language Models Improve on Prediction Tasks. 2022 IEEE International Conference on Knowledge Graph (ICKG), 105–112.

@inproceedings{kabirshehu2022protoformer,
  author = {Kabir, Anowarul and Shehu, Amarda},
  booktitle = {2022 IEEE International Conference on Knowledge Graph (ICKG)},
  title = {Sequence-Structure Embeddings via Protein Language Models Improve on Prediction Tasks},
  year = {2022},
  volume = {},
  number = {},
  pages = {105-112},
  keywords = {Location awareness;Soft sensors;Semantics;Training data;Predictive models;Transformers;Protein sequence;Protein language model;Transformer;Sequence structure transformer;Protein function;superfamily},
  doi = {10.1109/ICKG55886.2022.00021}
}

Building on the transformer architecture and its revolutionizing of language models for natural language processing, protein language models (PLMs) are now emerging as a powerful tool for learning over large numbers of sequences in protein sequence databases and linking protein sequence to function. PLMs are shown to learn useful, task-agnostic sequence representations that allow predicting protein secondary structure, protein subcellular localization, and evolutionary relationships within protein families. However, existing models are strictly trained over protein sequences and miss an opportunity to leverage and integrate the information present in heterogeneous data sources. In this paper, inspired by the intrinsic role of three-dimensional/tertiary protein structure in determining a broad range of protein properties, we propose a PLM that integrates and attends to both protein sequence and tertiary structure. In particular, this paper posits that learning joint sequence-structure representations yields better representations for function-related prediction tasks. A detailed experimental evaluation shows that such joint sequence-structure representations are more powerful than sequence-based representations, yield better performance on superfamily membership across various metrics, and capture interesting relationships in the PLM-learned embedding space.

Du, Y., Kabir, A., Zhao, L., & Shehu, A. (2020). From Interatomic Distances to Protein Tertiary Structures with a Deep Convolutional Neural Network. Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics. https://doi.org/10.1145/3388440.3414699

@inproceedings{dushehu2020protstruct,
  author = {Du, Yuanqi and Kabir, Anowarul and Zhao, Liang and Shehu, Amarda},
  title = {From Interatomic Distances to Protein Tertiary Structures with a Deep Convolutional Neural Network},
  year = {2020},
  isbn = {9781450379649},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3388440.3414699},
  doi = {10.1145/3388440.3414699},
  booktitle = {Proceedings of the 11th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics},
  articleno = {101},
  numpages = {8},
  keywords = {coordinate reconstruction, deep learning, protein modeling, tertiary structure},
  location = {Virtual Event, USA},
  series = {BCB '20}
}

Elucidating biologically-active protein structures remains a daunting task both in the wet and dry laboratory, and many proteins lack structural characterization. This lack of knowledge continues to motivate the development of computational methods for protein structure prediction. Methods are diverse in their approaches, and recent efforts have debuted deep learning-based methods for various sub-problems within the larger problem of protein structure prediction. In this paper, we focus on such a sub-problem, the reconstruction of three-dimensional structures consistent with given inter-atomic distances. Inspired by a recent architecture put forward in the larger context of generative frameworks, we design and evaluate a deep convolutional network model on experimentally- and computationally-obtained tertiary structures. Comparison with convex and stochastic optimization-based methods shows that the deep model is faster and similarly or more accurate, opening up several venues of further research to advance the larger problem of protein structure prediction.

Khan, T. S., Kabir, A., Pfoser, D., & Züfle, A. (2019). CrowdZIP: A System to Improve Reverse ZIP Code Geocoding using Spatial and Crowdsourced Data (Demo Paper). Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 588–591. https://doi.org/10.1145/3347146.3359362

@inproceedings{khanandreas2019crowdzip,
  author = {Khan, Tunaggina Subrina and Kabir, Anowarul and Pfoser, Dieter and Z\"{u}fle, Andreas},
  title = {CrowdZIP: A System to Improve Reverse ZIP Code Geocoding using Spatial and Crowdsourced Data (Demo Paper)},
  year = {2019},
  isbn = {9781450369091},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3347146.3359362},
  doi = {10.1145/3347146.3359362},
  booktitle = {Proceedings of the 27th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems},
  pages = {588–591},
  numpages = {4},
  keywords = {ZIP Codes, ZIP Code Classification, Reverse Geocoding, Microblog Data, Location Based Services, Geocoding},
  location = {Chicago, IL, USA},
  series = {SIGSPATIAL '19},
  impact = {4.6}
}

Zoning Improvement Plan (ZIP) Codes provide a sub-division of space. Interestingly, the ZIP code area polygons for different data sources do not match, resulting in uncertainty for a range of services that rely on such data. This paper presents a system that employs traditional classification methods to map a given spatial coordinate to a distribution of ZIP-codes using various public available ZIP-code maps as predictors, and using the (not publicly available) United States Postal Service (USPS) map as an authoritative ground truth. We show that large sets of microblog data, from which we extract potential ZIP-codes, can significantly improve classification accuracy despite the noise of such data. The demonstrator allows users to select locations on a map of Orlando, FL, view the resulting distribution of ZIP-codes predicted for this location, compare the results to the ground-truth, and view the microblogs that have enriched the result. A focus will be on showing that the signal present in large, noisy, and 99.99% unrelated microblog data can indeed be used to improve reverse ZIP code geo-coding.

Peer Reviewed Workshop Papers

Inan, T. T., Kabir, A., Rasmussen, K., Shehu, A., Usheva, A., Bishop, A., Alexandrov, B., & Bhattarai, M. (2024). Efficient High-Throughput DNA Breathing Features Generation Using Jax-EPBD. In bioRxiv. Cold Spring Harbor Laboratory; . https://www.biorxiv.org/content/early/2024/12/12/2024.12.06.627191

@misc{InanBhattarai2024JaxEPBD,
  author = {Inan, Toki Tahmid and Kabir, Anowarul and Rasmussen, Kim and Shehu, Amarda and Usheva, Anny and Bishop, Alan and Alexandrov, Boian and Bhattarai, Manish},
  title = {Efficient High-Throughput DNA Breathing Features Generation Using Jax-EPBD},
  elocation-id = {2024.12.06.627191},
  year = {2024},
  doi = {10.1101/2024.12.06.627191},
  publisher = {Cold Spring Harbor Laboratory},
  url = {https://www.biorxiv.org/content/early/2024/12/12/2024.12.06.627191},
  eprint = {https://www.biorxiv.org/content/early/2024/12/12/2024.12.06.627191.full.pdf},
  journal = {bioRxiv}
}

DNA breathing dynamics—transient base-pair opening and closing due to thermal fluctuations—are vital for processes like transcription, replication, and repair. Traditional models, such as the Extended Peyrard-Bishop-Dauxois (EPBD), provide insights into these dynamics but are computationally limited for long sequences. We present JAX-EPBD, a high-throughput Langevin molecular dynamics framework leveraging JAX for GPU-accelerated simulations, achieving up to 30x speedup and superior scalability compared to the original C-based EPBD implementation. JAX-EPBD efficiently captures time-dependent behaviors, including bubble lifetimes and base flipping kinetics, enabling genome-scale analyses. Applying it to transcription factor (TF) binding affinity prediction using SELEX datasets, we observed consistent improvements in R2 values when incorporating breathing features with sequence data. Validating on the 77-bp AAV P5 promoter, JAX-EPBD revealed sequence-specific differences in bubble dynamics correlating with transcriptional activity. These findings establish JAX-EPBD as a powerful and scalable tool for understanding DNA breathing dynamics and their role in gene regulation and transcription factor binding.Competing Interest StatementThe authors have declared no competing interest.

Kabir, A., Inan, T. T., Rasmussen, K., Shehu, A., Usheva, A., Bishop, A., Alexandrov, B., & Bhattarai, M. (2024). Scalable DNA Feature Generation and Transcription Factor Binding Prediction via Deep Surrogate Models. In bioRxiv. Cold Spring Harbor Laboratory; . https://www.biorxiv.org/content/early/2024/12/10/2024.12.06.626709

@misc{KabirBhattarai2024SurrEPBD,
  author = {Kabir, Anowarul and Inan, Toki Tahmid and Rasmussen, Kim and Shehu, Amarda and Usheva, Anny and Bishop, Alan and Alexandrov, Boian and Bhattarai, Manish},
  title = {Scalable DNA Feature Generation and Transcription Factor Binding Prediction via Deep Surrogate Models},
  elocation-id = {2024.12.06.626709},
  year = {2024},
  doi = {10.1101/2024.12.06.626709},
  publisher = {Cold Spring Harbor Laboratory},
  url = {https://www.biorxiv.org/content/early/2024/12/10/2024.12.06.626709},
  eprint = {https://www.biorxiv.org/content/early/2024/12/10/2024.12.06.626709.full.pdf},
  journal = {bioRxiv}
}

Simulating DNA breathing dynamics, for instance Extended Peyrard-Bishop-Dauxois (EPBD) model, across the entire human genome using traditional biophysical methods like pyDNA-EPBD is computationally prohibitive due to intensive techniques such as Markov Chain Monte Carlo (MCMC) and Langevin dynamics. To overcome this limitation, we propose a deep surrogate generative model utilizing a conditional Denoising Diffusion Probabilistic Model (DDPM) trained on DNA sequence-EPBD feature pairs. This surrogate model efficiently generates high-fidelity DNA breathing features conditioned on DNA sequences, reducing computational time from months to hours–a speedup of over 1000 times. By integrating these features into the EPBDxDNABERT-2 model, we enhance the accuracy of transcription factor (TF) binding site predictions. Experiments demonstrate that the surrogate-generated features perform comparably to those obtained from the original EPBD framework, validating the model’s efficacy and fidelity. This advancement enables real-time, genome-wide analyses, significantly accelerating genomic research and offering powerful tools for disease understanding and therapeutic development.Competing Interest StatementThe authors have declared no competing interest.

Book Chapters

Kabir, A., & Shehu, A. (2022). Graph Neural Networks in Predicting Protein Function and Interactions. In L. Wu, P. Cui, J. Pei, & L. Zhao (Eds.), Graph Neural Networks: Foundations, Frontiers, and Applications (pp. 541–556). Springer Nature Singapore; . https://doi.org/10.1007/978-981-16-6054-2_25

@inbook{Kabirshehu2022gnnbookchapter,
  author = {Kabir, Anowarul and Shehu, Amarda},
  editor = {Wu, Lingfei and Cui, Peng and Pei, Jian and Zhao, Liang},
  title = {Graph Neural Networks in Predicting Protein Function and Interactions},
  booktitle = {Graph Neural Networks: Foundations, Frontiers, and Applications},
  year = {2022},
  publisher = {Springer Nature Singapore},
  address = {Singapore},
  pages = {541--556},
  isbn = {978-981-16-6054-2},
  doi = {10.1007/978-981-16-6054-2_25},
  url = {https://doi.org/10.1007/978-981-16-6054-2_25}
}

Graph Neural Networks (GNNs) are becoming increasingly popular and powerful tools in molecular modeling research due to their ability to operate over non-Euclidean data, such as graphs. Because of their ability to embed both the inherent structure and preserve the semantic information in a graph, GNNs are advancing diverse molecular structure-function studies. In this chapter, we focus on GNNaided studies that bring together one or more protein-centric sources of data with the goal of elucidating protein function. We provide a short survey on GNNs and their most successful, recent variants designed to tackle the related problems of predicting the biological function and molecular interactions of protein molecules. We review the latest methodological advances, discoveries, as well as open challenges promising to spur further research.

Preprints

Singh, A., Infante, S., Kim, S., & Kabir, A. (2026). Predicting Obstetric and Non-obstetric Diagnoses Co-occurrences during Pregnancy. In bioRxiv. Cold Spring Harbor Laboratory; . https://www.biorxiv.org/content/early/2026/02/09/2026.02.06.704385

@unpublished{SinghKabir2026Obstetric,
  author = {Singh, Akash and Infante, Samuel and Kim, Seungbae and Kabir, Anowarul},
  title = {Predicting Obstetric and Non-obstetric Diagnoses Co-occurrences during Pregnancy},
  elocation-id = {2026.02.06.704385},
  year = {2026},
  doi = {10.64898/2026.02.06.704385},
  publisher = {Cold Spring Harbor Laboratory},
  url = {https://www.biorxiv.org/content/early/2026/02/09/2026.02.06.704385},
  eprint = {https://www.biorxiv.org/content/early/2026/02/09/2026.02.06.704385.full.pdf},
  journal = {bioRxiv}
}

Pregnancy care often involves simultaneous obstetric and other medical conditions, but their co-occurrence patterns are rarely modeled explicitly in a systematic, network-based approach. In this work, we formulate obstetric and non-obstetric diagnoses co-occurrences as a link prediction problem on a diagnosis-level homogeneous graph constructed from pregnancy encounters. Diagnoses are represented as nodes connected by co-occurrence edges, with node features capturing graph structure and demographic statistics3.We address this challenge by leveraging collected electronic health records data and study several standalone and hybrid graph neural network (GNN) architectures, including GCN, GAT, GraphSAGE, and three hybrid encoders that combine complementary aggregation mechanisms, namely GCN+GraphSAGE, GCN+GAT, and GAT+GraphSAGE. All models used consistent train-validation-test splits and are evaluated on 5- fold cross-validation sets. Among standalone models, GraphSAGE achieved the strongest performance, whereas hybrid GraphSAGE-based models (GCN+GraphSAGE and GAT+GraphSAGE) are best performers. The GCN+GraphSAGE hybrid, reaching an AUROC and AUPRC of approximately 0.90, consistently outperformed all other architectures. Further analysis of top-ranked predicted links revealed clinically plausible associations between pregnancy stage and risk-related diagnoses and common endocrine, metabolic, and hematological conditions. These findings indicate that graph-based link prediction may effectively prioritize obstetric and non-obstetric diagnosis pairs, providing a scalable framework for identifying clinically meaningful comorbidity patterns. They may further support hypothesis generation and downstream obstetric risk stratification efforts.Availability All codes including data preparation scripts, training and validation recipes, and experimental configurations are available at: https://github.com/kabir-ai2bio-lab/ob-nonob-diagnoses-cooccurrences.Competing Interest StatementThe authors have declared no competing interest.

Infante, S., Singh, A., & Kabir, A. (2025). LoMuS: Low-Rank Adaptation with Multimodal Representations Improves Protein Stability Prediction. In bioRxiv. Cold Spring Harbor Laboratory; . https://www.biorxiv.org/content/early/2025/12/18/2025.12.15.694540

@unpublished{InfanteKabir2025LoMuS,
  author = {Infante, Samuel and Singh, Akash and Kabir, Anowarul},
  title = {LoMuS: Low-Rank Adaptation with Multimodal Representations Improves Protein Stability Prediction},
  elocation-id = {2025.12.15.694540},
  year = {2025},
  doi = {10.64898/2025.12.15.694540},
  publisher = {Cold Spring Harbor Laboratory},
  url = {https://www.biorxiv.org/content/early/2025/12/18/2025.12.15.694540},
  eprint = {https://www.biorxiv.org/content/early/2025/12/18/2025.12.15.694540.full.pdf},
  journal = {bioRxiv}
}

Protein folding stability is a key determinant for understanding protein dynamics, including molecular function, pathogenicity, and/or protein engineering. Yet, accurate prediction of protein stability changes remains a challenging problem due to the high-variability in the available data, especially from sequence-only information when structural knowledge is of low-resolution or unavailable. In this work, we introduce LoMuS, a Multimodal deep learning model that combines two complimentary aspects of the molecule and predicts unnormalized protein Stability effect from the primary sequence as input. In the core of the model architecture, a fusion network integrates explicit physicochemical descriptors with Low-rank adapted protein language model derived embeddings from the sequence that shows powerful and accurate generalization ability across various benchmark settings for predicting protein folding stability changes. We compared and rigorously evaluated our model capacity spanning from fold-induced stability changes to mutation caused stability effect prediction. This includes benchmarking against various held-out protein domains, out-of-distribution label settings and per-protein evaluation. LoMuS consistently outperforms other sequence-only protein stability baselines. It achieves an absolute performance gain by an at least 10% in the spearman rank correlation metric for predicting protein stability across many held-out domains and out-of-distribution stability label predictions. Per-protein validation additionally demonstrates promising performance gain of our model. Ablation analysis on the model architectural choices confirms that complementary signals from derived features are critical for this multimodal approach. We believe LoMuS advances protein engineering research and can aid in rational protein design by elucidating precise protein stability changes.Availability All codes including data preparation scripts, training and validation recipes, and experimental configurations for LoMuS are available at: https://github.com/samuelinfantee/LoMuS-repository.Supplementary information Supplementary data are available at Journal Name online.Competing Interest StatementThe authors have declared no competing interest.MLMachine LearningDLDeep LearningPLMProtein Language ModelMLPsMulti-layer Perceptrons

Kabir, A., & Shehu, A. (2022). Transformer Neural Networks Attending to Both Sequence and Structure for Protein Prediction Tasks. https://arxiv.org/abs/2206.11057
```
@unpublished{KabirShehu2022SeqStructTransformer,
  title = {Transformer Neural Networks Attending to Both Sequence and Structure for Protein Prediction Tasks},
  author = {Kabir, Anowarul and Shehu, Amarda},
  year = {2022},
  eprint = {2206.11057},
  archiveprefix = {arXiv},
  primaryclass = {cs.LG},
  url = {https://arxiv.org/abs/2206.11057},
  doi = {10.48550/arXiv.2206.11057}
}
```

Peer Reviewed Journal Articles

Peer Reviewed Conference Proceedings

Peer Reviewed Workshop Papers

Book Chapters

Preprints

About

Office Location