<source>https://r.jina.ai/https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409</source> <cacheid>cache:2519822628</cacheid><accessDate>9/19/2024, 1:53:25 AM</accessDate><content>
 Title: Large language models improve annotation of viral proteins

URL Source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409

Markdown Content:
*   [Journal List](https://www.ncbi.nlm.nih.gov/pmc/journals/)
*   [Research Square](https://www.ncbi.nlm.nih.gov/pmc/?term=%22Res%20Sq%22[jour])
*   PMC10187409

Version 1. Res Sq. Preprint. 2023 May 2.

Abstract
--------

Viral sequences are poorly annotated in environmental samples, a major roadblock to understanding how viruses influence microbial community structure. Current annotation approaches rely on alignment-based sequence ho-mology methods, which are limited by available viral sequences and sequence divergence in viral proteins. Here, we show that protein language model representations capture viral protein function beyond the limits of remote sequence homology by targeting two axes of viral sequence annotation: systematic labeling of protein families and function identification for biologic discovery. Protein language model representations capture protein functional properties specific to viruses and expand the annotated fraction of ocean virome viral protein sequences by 37%. Among unannotated viral protein families, we identify a novel DNA editing protein family that defines a new mobile element in marine picocyanobacteria. Protein language models thus significantly enhance remote homology detection of viral proteins and can be utilized to enable new biological discovery across diverse functional categories.

Viruses of microbes, hereafter, ‘viruses’, are abundant in the environment and have wide-ranging impacts on microbial communities. Most of what we know about viral diversity, ecology, and function comes from analysis of sequences obtained from environmental samples, yet viruses are difficult to identify, classify, and annotate. Thus, we make statements about viral impacts on microbial community structure and function based on a tiny fraction of viral sequences with sufficient similarity to existing references. In recent years, next-generation sequencing and increasing computational resources have been applied to catalogue the world’s virome[1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R1)–[7](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R7). While there has been substantial methodological progress in identifying viral DNA in whole community metagenomic sequence data[8](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R8)–[16](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R16), sequence feature annotation and overall taxonomic assignment of identified uncultivated viral genomes (**UViGs**) has lagged considerably. Viruses have no conserved marker genes to enable broad, unified, taxonomic analysis and thus most of the hundreds of thousands of new viruses uncovered in viral catalogue studies remain unclassified[1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R1)–[7](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R7). Viral taxonomic classification is generally based on using predicted UViG proteins as features for clustering-based[17](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R17)–[19](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R19) or machine learning-based[20](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R20) taxonomic classification. Yet, as many as 86% of environmental viral protein clusters match uncharacterized protein families or have no hits at all[6](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R6),[7](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R7),[16](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R16),[21](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R21),[22](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R22). Improved annotation of viral protein families (**VPFs**) is thus a necessary, unrealized, step towards understanding the roles of viruses in microbial ecology.

Viral protein annotation currently relies on sequence homology using state-of-the-art profile Hidden Markov Model (**pHMM**)-based approaches. For viral metagenomics, sequence homology methods suffer from two fundamental limitations: (1) the limited library of annotated viral protein sequences from which to construct probabilistic sequence models and (2) the rate at which viral proteins change, quickly diverging beyond recognition by traditional sequence homology metrics. An alignment-free method that does not depend on constructing sequence profiles for statistical sequence homology and that can leverage functional homology between proteins could overcome both challenges.

Advances in the field of natural language processing have increasingly been utilized to identify viral sequences in whole community sequencing data, including k-mer frequency[9](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R9),[11](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R11) and learned vector representation[10](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R10),[16](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R16),[23](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R23),[24](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R24) methods. In natural language processing, current state-of-the-art large language models are trained in an unsupervised manner on gigantic corpora of text. Recently, this approach has been used to train protein language models (**PLMs**) on billions of protein sequences to learn real number vector representations of amino acids. PLMs capture physico-chemical properties of amino acids and can resolve protein structural and functional information from sequence input alone[25](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R25)–[30](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R30). Unlike sequence, structure and function of viral proteins are better maintained over evolutionary time due to biochemical and fitness constraints[31](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R31),[32](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R32). We hypothesized that annotating VPFs based on functional homology captured in PLM-based protein representations, rather than strict protein sequence homology, would improve VPF annotation. Therefore, we developed a PLM-based viral protein function classifier and asked if it could improve the viral protein annotation problem.

Using curated VPF databases and recently published PLMs, we show that PLM-based representations of viral protein sequences can capture viral functional homology beyond remote sequence homology. Our analysis focuses on the two axes of viral sequence annotation: systematic labeling of protein families and specific function identification for biologic discovery. The first is the ability to characterize VPFs in environmental viral profiling studies, where we utilize our PLM-based viral classifier to expand the annotated fraction of VPFs collected from the ocean virome by 37%. The second is the functional characterization of specific proteins in UViGs of biological interest, which we demonstrate by using our functional classifier to identify novel phage-like DNA editing proteins. Finally, we show that the PLM-based representations capture functional groupings unique to viruses. PLMs thus capture features of viral proteins that aid in detecting remote homology, a necessary step toward understanding the functions of viral populations across the world.

Results
-------

### Protein language models capture viral protein function

To determine whether PLMs can aid in annotating viral proteins, it is necessary to determine that PLMs capture properties of viral protein function and that they can identify sequence homology that is invisible to state-of-theart approaches to identify distantly related sequences such as pHMMs. We based our efforts on the Prokaryotic virus Remote Homologous Groups (PHROGs) database, a curated library of VPFs constructed to capture remote sequence homology and manually annotated to high-level functional categories[21](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R21). Because the database was constructed to maximize remote sequence homology captured by each family, it is an ideal dataset to determine whether PLM-based representations can capture viral protein function beyond sequence homology. PHROGs contains 868,340 protein sequences clustered to 38,880 families, of which 5,088 are annotated to 9 functional classes (Supplemental Table 1).

We built our classification model by embedding proteins in annotated families in the PHROGs database to a distributed representation using a PLM. Using a feed-forward neural network with sequence embeddings as input ([Figure 1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F1/)), a functional annotation classifier was trained on VPFs to predict the functional category of sequences from held out VPFs. Five-fold cross validation over the entire annotated set was performed with proteins embedded using four trained PLMs. The PLM trained exclusively with the unsupervised objective of predicting masked amino acids on over 2.1 billion protein sequences from the Big Fantastic Database (**BFD**) (Transformer BFD[28](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R28)) performed the best of the PLMs evaluated[28](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R28)–[30](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R30) (Supplemental Table 2, Supplemental Figure 1) and is used for subsequent experiments. Performance with Transformer\_BFD for each category is shown in [Figure 2](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F2/). The multiclass classifier achieved an average AUROC of 0.90 and average AUPRC of 0.62 across all classes and folds.

[](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F1/)

[![An external file that holds a picture, illustration, etc. Object name is nihpp-rs2852098v1-f0001.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/bin/nihpp-rs2852098v1-f0001.jpg "Click on image to zoom")](https://www.ncbi.nlm.nih.gov/core/lw/2.0/html/tileshop_pmc/tileshop_pmc_inline.html?title=Click%20on%20image%20to%20zoom&p=PMC3&id=10187409_nihpp-rs2852098v1-f0001.jpg)

**Overview of training viral protein family (VPF) function classifier using protein language models (PLMs).**

(a) VPFs collected from the PHROGs database with manual annotation to 9 functional categories[21](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R21). (b) Protein sequences are embedded using trained PLMs by averaging amino acid (_a_) vectors (_v_) to a single embedding vector (_e_) of dimension _d_, and (c) used as input to a feed-forward multi-class classifier.

[](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F2/)

[![An external file that holds a picture, illustration, etc. Object name is nihpp-rs2852098v1-f0002.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/bin/nihpp-rs2852098v1-f0002.jpg "Click on image to zoom")](https://www.ncbi.nlm.nih.gov/core/lw/2.0/html/tileshop_pmc/tileshop_pmc_inline.html?title=Click%20on%20image%20to%20zoom&p=PMC3&id=10187409_nihpp-rs2852098v1-f0002.jpg)

**Functional category classification of PHROGs VPFs with PLM-based protein embeddings.**

(a) Receiver operating characteristic curve with average area under curve (AUC) and standard deviation (SD) over five folds. (b) Precision-recall curve with AUC and SD over five folds. Per fold, training is performed over all proteins in a family and testing is performed on a random single sequence from test families. Protein sequences were embedded using the Transformer\_BFD PLM and the classifier consists of a three hidden layer dense neural network and an output layer with softmax activation.

A single classification model was then trained on all annotated families as well as 14,280 families of the unknown function category in order to capture sequences that do not match the functional categories. Subsequent to classifier training, 57 PHROGs families were reclassified. The classifier correctly predicted the re-annotation of 38/57 families (66.6%) despite being trained on the previous incorrect annotation for those families (Supplemental Table 3). The performance on the reannotated families serves as a validation of the classifier’s ability to capture function. Our trained classifier is available for download ([https://github.com/kellylab/viral\_protein\_function\_plm](https://github.com/kellylab/viral_protein_function_plm)).

### Language model protein embeddings capture phage biology

Having determined that PLM-based representations of viral proteins can predict function, the viral protein embedding space was investigated to understand what enables the PLM to detect differences between functions. Because a PLM can produce a dense vector representation for any protein sequence, we can interrogate the similarity of sequences in a family and families in a functional category using vector similarity. VPFs were represented as the centroid of sequence embeddings for constituent proteins and visualized for the functionally annotated PHROGs subset ([Figure 3a](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F3/)).

[](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F3/)

[![An external file that holds a picture, illustration, etc. Object name is nihpp-rs2852098v1-f0003.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/bin/nihpp-rs2852098v1-f0003.jpg "Click on image to zoom")](https://www.ncbi.nlm.nih.gov/core/lw/2.0/html/tileshop_pmc/tileshop_pmc_inline.html?title=Click%20on%20image%20to%20zoom&p=PMC3&id=10187409_nihpp-rs2852098v1-f0003.jpg)

**Investigation of PLM-based embedding of PHROGs VPFs.**

(a) umap projection of PHROG VPFs. VPFs are represented as the centroid of sequence vectors. (b) Spectral network visualization of the inter-category family-family similarity (edge weight), which is measured as the mean family-family centroid similarity across all family pairs between two categories. The category-category similarity matrix is clustered with n=2 into two groups (black and yellow). (c) Spectral clusters are used to color PHROGs VPF umap projection. (d) Clusters are used as binary classes for PHROGs VPF classifier as in 2B. (e) Classifier performance on 10 random two group splits with AUPRC averaged over groups and splits.

While the sequence-sequence vector similarity in families across all categories is high (Supplemental Figure 2a), the intra-category family-family similarity varies between functional categories (Supplemental Figure 2b). Families in the transcription regulation category are most similar, with a median family-family similarity of 0.68, while the integration and excision category had lowest median family-family similarity of 0.51. To ask if there are groupings of categories in the embedding space, we first measured the category-category similarity as the average of the family-family vector similarity for all pairs of families between the categories (Supplemental Figure 3). We spectrally clustered the category-category distance matrix ([Figure 3b](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F3/)), revealing a biologically meaningful partition of functional categories into those relating to phage virion structure and infection (cluster1) and those relating to viral genome replication and other host derived genes (cluster2). The partition is apparent when visualizing the PHROGs embedding space ([Figure 3c](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F3/)). Grouping of the functional categories into the two clusters identified greatly increases the performance of the PHROGs classifier in five-fold cross validation ([Figure 3d](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F3/)) when compared to individual classes ([Figure 2B](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F2/)) and when compared to random partitions of the categories into groups of two ([Figure 3e](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F3/)).

The ability to classify phage structural proteins, termed phage virion proteins (**PVPs**), is important for identifying and grouping novel sequences, and a number of methods have been recently developed to tackle this problem[33](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R33),[34](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R34). With the high performance of our cluster1 vs. cluster2 binary classifier, we evaluated how PLM-based classification compares with existing methods for PVP classification. Using a PVP identification task designed previously[34](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R34),[35](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R35), our method achieved on par performance with state-of-the-art approaches ([Table 1](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/table/T1/)).

### Table 1:

**PLM-based viral protein sequence embedding produces best performance in phage virion protein (PVP) classification task.**

PVP classification task designed previously[34](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R34) with PHANNs dataset[35](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R35).

Method

Recall (%)

Precision (%)

F1-score (%)

PLM+FNN

90.32

96.88

93.48

\*DeePVP

88.10

96.75

92.22

\*PHANNs

91.68

76.11

83.17

### Improved classification of proteins from the global ocean virome

To further validate the trained functional classifier, we evaluated the performance of the classifier against pHMM annotation of the largest pan-ecosystem VPF database, EFAM, curated from UVIGs identified in the global oceans[22](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R22) (Supplemental Data 1). With different phage genome sources for database construction, viral genomes in EFAM are not present in the PHROGs training sequences, making this dataset well-suited for an external validation of our classifier. To assign true functional categories to the EFAM VPFs, we used profile-profile HMM matching with the PHROGs database (Supplemental Data 2). 80,942/240,311 (33.7%) of EFAM VPFs were assigned a PHROGs function and these VPFs were predicted using our PLM-based functional classifier trained on PHROG VPFs ([Figure 4a](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F4/)). The weighted average F1 score across all categories was 0.70 and is increased to 0.75 when the unknown function category is excluded (Supplemental Table 4).

[](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F4/)

[![An external file that holds a picture, illustration, etc. Object name is nihpp-rs2852098v1-f0004.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/bin/nihpp-rs2852098v1-f0004.jpg "Click on image to zoom")](https://www.ncbi.nlm.nih.gov/core/lw/2.0/html/tileshop_pmc/tileshop_pmc_inline.html?title=Click%20on%20image%20to%20zoom&p=PMC3&id=10187409_nihpp-rs2852098v1-f0004.jpg)

**Functional category classifier validation and discovery with the EFAM database of VPFs curated from the ocean virome.**

(a) Precision-recall curve for EFAM VPFs labeled with PHROGs HMMs and predicted with the PLM-based functional classifier. Performance is measured with AUPRC and optimal F1-score. (b) Number of VPFs in EFAM that are labeled to each functional category based on the category-specific optimal threshold and not captured by PHROGs HMMs. (c) EFAM VPFs predicted integration and excision probability as a function of average protein length in the VPF. Annotation of excisionase (pink) and integase/recombinase (purple) terms are for VPFs annotated in EFAM (·). Structural prediction for two EFAM VPFs that do not match PHROGs HMMs and are unannotated in EFAM (x) are shown with predicted structure, one excisionase (cluster122519) and one integrase (cluster86903). Decision probability is the threshold of maximal F1 for integration and excision category prediction.

To highlight the systematic annotation capability of our approach, the optimal F1 decision probability per category was used to predict the functional category of EFAM VPFs not captured by the PHROGs HMMs ([Figure 4b](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F4/)). In total we expand the annotated fraction of EFAM by 39,258 families, a 37.8% increase over the number annotated internal to the EFAM database supplemented with annotation by PHROGs (103,919 families).

### PLMs enable identification of novel phage DNA editing enzymes

Integration and excision had the best prediction performance, therefore, VPFs in EFAM labeled to this category were used to highlight the ability of PLM-representations to uncover novel biology. Additionally, detection of genes associated with phage integration into host genomes is crucial in viral bioinformatics for the characterization of phage genomes as temperate. EFAM VPFs predicted in this category can be stratified based on their annotation in the EFAM database itself, with VPFs having average protein lengths \>100 matching annotation to known integrase/recombinase proteins and VPFs with average protein lengths <100 matching known excisionases ([Figure 4c](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F4/)). We validated our integration and excision prediction for EFAM VPFs that were not annotated in EFAM or by PHROGs HMM matching using structure and domain predictions (Supplemental Table 5). Further investigation of predicted EFAM integrase families led to the annotation of an integrase (cluster86903) on a previously reported putative prophage in uncultured Alphaproteobacteria[36](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R36) highlighting the utility of this approach.

Our method was also able to annotate related genes in non-viral contexts. We identified a novel integrase family (cluster158946) located within marine picocyanobacterial genomes, including members of the globally abundant cyanobacteria _Prochlorococcus_ and _Synechococcus_. Phylogenetic analysis revealed these enzymes as a novel subgroup within the tyrosine integrase/recombinase family of site-specific integrases. These cyanobacterial integrases are distinct from other recombinases commonly seen in phages and bacterial mobile genetic elements, integrases recently described as being associated with Tycheposon mobile elements in _Prochlorococcus_[37](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R37), or tyrosine recombinases associated with VEIME phage satellites[38](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R38) ([Figure 5a](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F5/)). This protein has a different domain structure than is typical of many tyrosine integrases[39](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R39), yet structural modeling confirmed that this enzyme retains the key catalytic residues required for activity[40](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R40) (Supplemental Figure 4). While found only within a subset of available _Prochlorococcus_ and _Synechococcus_ genomes, where identified the integrase is typically found upstream of one of two specific tRNAs, either tRNA-Phe or tRNA-Cys. tRNAs are frequent integration sites for mobile genetic elements[41](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R41) and phylogenetic groupings of these enzymes correlate with their respective tRNA ([Figure 5b](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F5/)), suggesting that these may represent the integration site. The integrases are located in islands of variable genetic content and are also frequently, though not exclusively, found near a small serine recombinase ([Figure 5c](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F5/)–[d](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F5/)). Together, these properties suggest that this novel enzyme defines a mobile genetic element within marine picocyanobacteria.

[](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/figure/F5/)

[![An external file that holds a picture, illustration, etc. Object name is nihpp-rs2852098v1-f0005.jpg](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409/bin/nihpp-rs2852098v1-f0005.jpg "Click on image to zoom")](https://www.ncbi.nlm.nih.gov/core/lw/2.0/html/tileshop_pmc/tileshop_pmc_inline.html?title=Click%20on%20image%20to%20zoom&p=PMC3&id=10187409_nihpp-rs2852098v1-f0005.jpg)

**Identification of a novel integrase/recombinase within marine picocyanobacteria.**

(a) Phylogenetic relationship of the novel integrases (blue) in comparison with tyrosine recombinases described in marine viral parasites (VEIMES; yellow), cyanobacterial mobile element integrases (green) and classes commonly found among well-described phage and mobile elements (e.g. IS, PICIs, ICEs). (b) Phylogenetic groupings of full-length (\>350aa) novel integrases in _Prochlorococcus_ and _Synechococcus_, in relationship to the closest downstream tRNA (outer ring) and genome taxonomy (inner ring). Gaps reflect unknown tRNA associations from limitations of genome assemblies. (c and d) Genomic context of the novel integrase in selected marine _Prochlorococcus_ and _Synechococcus_ genomes, respectively. Colored genes indicate the novel integrase (blue), a small serine recombinase frequently found near the integrase (red) and the downstream tRNA (purple or yellow). Shaded regions connect orthologous genes.

Discussion
----------

Improving viral protein annotation in environmental samples is a vital step towards understanding how viruses influence microbial community structure. Current approaches annotate on average less than 30% of viral protein families in large, global, environmental metagenomes[6](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R6),[7](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R7),[16](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R16),[21](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R21),[22](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R22), meaning that our understanding of viruses of microbes is based on a small fraction of phage genomes. Annotating viral proteins is also key to studies of viral evolution[42](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R42), characterization of isolate genomes[43](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R43), and to understand the role of viruses as disseminators of DNA in microbial populations[44](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R44).

Our work provides a proof of concept that high-level viral function can be learned with PLM-based representations and extends remote homology detection beyond the ability of universally used, state-of-the-art, alignment-based methods. Furthermore, we used our PLM classifier to discover novel biology in the oceans by interrogating predictions in the integration and excision category, unveiling a previously unrecognized DNA-editing enzyme in the globally abundant marine phototrophs _Proclorococcus_ and _Synechococcus_ that anchors a putative novel mobile genetic element.

We show that across all nine categories in PHROGs, a single multi-class classifier was able to learn viral function in a stratified five-fold cross validation across the annotated PHROGs VPFs. How does performance differ by category? ”Tail” and ”DNA, RNA, and nucleotide metabolism” were the best performing categories and had the highest number of families, though not the highest number of sequences per family. Two of the worst performing categories, ”moron, auxiliary metabolic gene and host takeover” and ”other”, are the categories that are least specific in their annotation, grouping sequence diverse and functionally distinct families into single categories. These categories also have the lowest number of average sequences per family, which may contribute to their worse performance. While the ”integration and excision” class had the least number of families and the least number of sequences, it had performance similar to ”head and packaging” and ”transcription regulation”. It remains an open question how family sequence diversity and number of sequence contribute to classifier performance. However, that our multi-class classifier can predict functional labels of held out VPFs demonstrates our classifier has captured underlying homology beyond what is captured with alignment-based approaches.

PLM training on the BFD[45](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R45), the largest existing protein corpus that contains protein sequences from uncultivated genomes from metagenomic sequencing data, resulted in the best viral protein function classifier performance. Interestingly, supplemental supervision tasks in PLM training related to structure, such as predicting residue contacts in protein structures[30](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R30), or function, such as protein GO term annotation[29](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R29), did not result in better classification performance. It is possible that this is due to the dearth of viral protein representation in protein structure and knowledge databases. Future work is necessary to determine if there are viral-specific supervised tasks that can enhance PLM training.

A major advantage to an alignment-free classification is the ability to make predictions even when a VPF does not share sequence homology with known proteins/families. To ask if our classifier could improve VPF annotation, we used VPFs from the ocean virome curated in the EFAM database. Using PHROGs HMM-labeled EFAM VPFs as ground truth, our classifier achieved a weighted F1 score of 0.75 for functional annotation.

To ask if our classifier could uncover novel biology, we interrogated newly annotated EFAM VPFs of the integration and excision category, which had the highest predictive performance. We identified a novel mobile genetic element defined by a previously unrecognized integrase related to the tyrosine integrase/recombinase family. The genomic context of these integrases indicates that their activity contributes to generating genomic diversity among globally abundant marine picocyanobacteria. We identified representative sequences of this integrase in cultured isolate and single-cell genomes of _Prochlorococcus_ and _Synechococcus_ and found that the region immediately surrounding the integrase represents a genomic island whose length, gene content, and gene orientation varies among individual genomes. Variable genes found near the integrase include putative restriction/modification systems, biosynthetic enzymes, and nutrient acquisition genes, indicating that the integrase-associated element can move genetic cargo of ecological relevance in the ocean. The consistent proximity of the integrase to two specific tRNAs suggests these as likely integration sites for the element. The integrases are also frequently, though not exclusively, found near a small serine recombinase which might contribute to resolving mobile element insertion into a target molecule[46](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R46). However, the specific mechanism through which this element is mobilized or integrated is not yet known.

Additionally, proteins in the integration and excision functional category are related to the viral capacity to integrate its genome into a host. These proteins are of particular interest in viral profiling studies as they are used to distinguish between lytic and temperate viral life-cycles[13](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R13),[16](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R16),[47](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R47). We note that mobile genetic elements, some of which contain integrases, are widespread in environmental samples and are still being discovered and characterized[37](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R37),[38](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R38). Understanding the origins and dynamics of these elements remains an open question. Finally, the ability to detect proteins that function in host genome integration is crucial for the field of phage therapy[48](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R48).

Aside from serving as feature embeddings for protein sequences in downstream tasks, PLM transformation of viral proteins also captured underlying patterns of protein functions in viruses. The clustering together of families that function in virion formation and host lysis separately from families involved in metabolic processes, expression regulation, and host genome integration constitutes a biologically meaningful division of viral function. While our original hypothesis was that sequences in different VPFs but the same functional category would help predict unseen sequences of that function, reflecting shared evolution or functional redundancy in reticulate evolution, we were surprised that using sequences in related functions also aids in prediction. Having identified that PLM embeddings capture this biology, we show that PLM-based representations outperform feature engineering and uninformed but local learning through convolution for PVP classification. As PVPs are crucial to initial phage-host interaction, the ability to better identify this class of proteins in UViGs will aid studies of phage-host specificity.

Our study must acknowledge several limitations. In attempting to systematically annotate VPF function and highlight the ability to label integration and excision VPFs, we note that for experimentalists interested in annotating UViGs there are a plethora of methods, parameters, and thresholds to decide, and they may arrive at an annotation for a specific gene not annotated in large-scale approaches by thorough investigation. Annotation goals are project-specific and may have different levels of granularity needed for annotation; here we have focused on protein family level annotations. In selecting the PHROGs database for training the functional classifier we benefited from the high-level functional category annotation as it collapses a wide array of annotation terms into defined categories. However, the categories vary in their scope and while some are relatively narrow (integration and excision; lysis) and their prediction can be relevant to experimentalists, the ones that are comparatively broad (head and packaging; moron, auxiliary metabolic gene and host takeover) are limited in their ability to provide specific information when predicted. Finally, while we leverage previously trained PLMs due to the computational resources necessary to train large language models and as proof of their ability to capture viral protein function, there may be more optimal approaches to train PLMs for viral function prediction. Future work will seek to determine whether there are supervised tasks that can increase performance of the functional classifier.

Our PLM-based classifier is trained on the same data that underlies the PHROG pHMMs yet can detect homology across a larger sequence space, identifying integrase genes that the original pHMMs and other annotation tools were blind to. This suggests that PLMs are accessing features of sequence space that alignment-based methods cannot. We hypothesize that these features reflect protein structure, as PLMs and large language models more generally, have been shown to be adept at capturing domain and structural features of proteins. Using our approach, targeted hypotheses about protein function can be gleamed from PLM-based classification and then tested experimentally, providing a powerful method for directing study into currently hidden functions of interest.

Methods
-------

### Viral protein sequence data

The PHROGs VPF database v3[21](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R21) ([https://phrogs.lmge.uca.fr/](https://phrogs.lmge.uca.fr/)) was downloaded on 01/26/2022. Reannotation data was downloaded after v4 release. EFAM VPF database[22](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R22) was downloaded from project repository on CyVerse Data Commons on 09/07/2022. PHANNs protein sequences and annotations[35](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R35) ([https://phanns.com/downloads](https://phanns.com/downloads)) was downloaded on 01/17/2023.

### Protein language models

Protein sequences were embedded to vectors using trained PLMs. Transformer\_BFD PLM from the ProtTrans[28](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R28) project was used via the DeepChainBio/BioTransformers python package. Sequences were embedded with pool\_mode=‘mean’ and batch\_size=2. Sequences were cut off at 5,096 amino acids which is the limit of the Transformer\_BFD PLM. LSTM\_Uniref90 and LSTM\_Uniref90\_MT from the ProSE[30](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R30) project were download from the project github repository ([https://github.com/tbepler/prose](https://github.com/tbepler/prose)) and protein sequences were embedded with the embed\_sequences.py script with –pool avg. Transformer Uniref90 MT from the ProteinBERT[29](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R29) project was downloaded from the project github repository ([https://github.com/nadavbra/protein\_bert](https://github.com/nadavbra/protein_bert)) and protein sequences were embedded using the get\_model\_with\_hidden\_layers\_as\_outputs function in the proteinbert python package. All protein sequence embedding was performed on 2 NVIDIA TITAN V GPUs.

### Classifier training and evaluation

To test the ability of a model to predict functional category for a test sequence, all labeled PHROG families were split into five, stratified sets for fivefold cross-validation. In each split, training was done on all sequences in training families while testing was performed on a single randomly selected sequence from the testing families. Data preparation for model training was done using sklearn[49](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R49) methods StratifiedKFold and LabelBinarizer. The same training-validation procedure was used for cluster1 vs. cluster2 fivefold cross-validation.

For training the PHROGs functional category classifier used in the EFAM classification experiment, families from the ‘unknown function’ category were included as an additional functional category. However, because the unknown families may be missing annotation, any family that was predicted by the model trained without the unknown function category with a score \>0.5 was removed from training (n=19,512).

The classifier architecture is a dense, feed-forward neural network trained with tensorflow[50](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R50). The network has three hidden layers of dimensions 512, 256, and 128 trained with 20% dropout and ReLU activation. The output layer is of dimension equal to the number of functional categories being predicted and has a softmax activation. Input dimension is equal to the embedding vector length output from the PLM. For PLMs with embedding dimension greater than 1,024, an additional hidden layer of dimension 1,024 was added as the first hidden layer. The model was fit with the following parameters: n\_epoch=20, loss=categorical\_crossentropy, opt=Adam(0.0001), batch size=60. Class prediction is assigned based on the highest probability of the softmax layer. For binary classifiers based on binary clusters of PHROGs functional categories, the same architecture and training parameters are used with the exception of n\_epochs=5.

Evaluation for the classifier was measured per-functional category using area under the receiver operating characteristic curve (AUROC), area under precision-recall curve (AUPRC), and the F1-score: F1\=2⋅TPTP+12(FP+FN), where TP, FP, and FN are the number of true positive, false positive, and false negatives predicted, respectively. ROC and PRC curves, AUC, and F1-score were all calculated using sklearn[49](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R49) methods roc\_curve, precision\_recall\_curve, and auc. In the case of PHROGs fivefold cross-validation, true labels are known for holdout families. In the case of EFAM, true labels are assigned based on HMM matching of EFAM families to PHROG families. EFAM families were aligned using clustal omega v1.2.4[51](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R51) and searched against the PHROG HMM database using hhsearch[52](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R52). PHROGs functional label assignment was made if an EFAM family matched a PHROGs HMM with e-value < 1.0_e_−12. The label of the PHROGs family with the lowest e-value is considered the true label unless that label is unknown function in which case the next lowest family label is assigned. For predicting EFAM category in the absence of PHROGs HMM hits, the decision threshold probability for category assignment in EFAM was identified by calculating the per-category maximum F1-score. For EFAM VPFs with annotation in the EFAM database, annotation terms present \> 5 times in families predicted by the classifier as integration and excision are shown to highlight the split around proteins of length 100 in the category.

### Viral protein family embedding space

PHROGs v4 annotation were used for interrogation of the embedding space. PHROGs families were collapsed to centroid vectors by taking the column average of the vector representation of all proteins in a family. Uniform Manifold Approximation and Projection for Dimension Reduction (UMAP) in python[53](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R53) was used to visualize embedded VPFs. Cosine similarity is a measure of similarity between two vectors and is calculated:

cosine similarity \=∑i\=1nuivi∑i\=1nui∑i\=1nvi

where **u** and **v** are vectors of length _n_ and _x__i_ is the _i_\-th element of each vector. It is used to measure sequencesequence similarity and family-family similarity from protein vectors and family centroid vectors, respectively. Families with vector similarities \> 0.999 (n=312) were excluded from median family mean sequence-sequence similarity calculation as some families have only duplicate sequences as PHROGs did not deduplicate protein sequences. For intra-category similarity, pairwise similarity was calculated for all category families. For inter-category similarity, each family in one category was compared to each family in another category with the mean across all pairwise comparisons constituting the category-category similarity. Differences in the distribution of similarities between categories were evaluated with the independent student t-test with Bonferroni correction using statannotations[54](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R54). The category-category similarity matrix was converted to a network using networkx[55](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R55) and displayed with spectral layout. The distance matrix was clustered using sklearn[49](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R49) SpectralClustering with n\_clusters=2.

### Phage virion protein classification

PHANNs protein sequences were embedded using the Transformer\_BFD PLM and a PVP vs other classifier was trained with the same architecture and parameters as the cluster1 vs. cluster2 classifier. Training and testing sequence split is as described previously[34](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R34). All sequences in the 10 PHANNs validation splits for all PVP classes are combined to a single PVP training set (n=154,183) and all 10 other validation splits were combined to a single other training set (n=336,151). Testing was done on the held PVP sequences for all classes (n=14,477) and the held out other sequences (n=33,402).

### Viral protein sequence annotation validation tools

Viral sequence predictions were manually validated using existing sequence and structural homology software. Individual sequence homology was performed with NCBI-hosted blastp[56](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R56) using the nr database and default parameters. Domain prediction was performed using InterPro[57](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R57). MPI bioinformatics suite[58](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R58) was used for searching protein sequences against HMM databases using HHpred[59](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R59) with default databases (PDB\_mmCIF30\_10\_Jan, UniProt-SwissProt-viral70\_3\_Nov\_2021, COG\_KOG\_v1.0, PHROGs\_v4) and parameters and for searching sequence databases (nr30\_17\_jan) for HMM hits using HMMER[60](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R60) with default parameters. Phyre2 was used for protein structural fold prediction and 3D model prediction[61](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R61).

### Investigation of predicted integrase protein families

A putative integrase protein sequence (MAK08069.1) from cluster158946 was used to search MGniFY for similar sequences in metagenomic datasets. We took the first MGnify hit, MGYP000503484273 (e-value 3.3E-257), and used it as a seed to search the IMG-VR database for sequences in the Viral Protein Database using default cutoffs (1E-5). This led to the discovery of putative integrase homologs from _Prochlorococcus_ and _Synechococcus_ genomes, which were interrogated further.

### Integrase analysis

Novel integrase sequences originally identified in IMG were used to query a custom database of _Prochlorococcus_ genomes from cultured isolates and single cell genomes[37](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R37). Additional sequences, such as those from _Synechococcus_, were retrieved through blastp searches of the NCBI nr database (Supplemental Data 3). The tyrosine integrase phylogeny was constructed from a set of tyrosine recombinases extracted from the UniRef50 database ([http://www.uniprot.org/uniref](http://www.uniprot.org/uniref)) using HMM models from ref[39](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R39); a set of integrases associated with _Prochlorococcus_ Tycheposons and cryptic elements[37](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R37); and representative sequences VEIME-associated integrases[38](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R38) (based on 40% identity clusters generated with MMSeqs2[62](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R62)). Sequences were aligned with Mafft v7.520[63](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R63), a maximum likelihood phylogeny was generated using FastTree v2.1.11[64](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R64), and the tree was plotted using iTOL[65](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R65). Genome regions surrounding the integrases were plotted in R using gggenomes 0.9.7.9000 ([https://github.com/thackl/gggenomes](https://github.com/thackl/gggenomes)).

### Protein structure modeling of a novel integrase sequence

Protein structure can be conserved among very distantly related sequences. We previously utilized homology modeling approaches to identify distantly related structural homologs to novel viral capsid protein sequences[66](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R66). Here, we took a similar approach to identify structures related to sequences in our putative novel integrase family. We utilized the fully automated protein structure homology-modelling server SWISS-MODEL via the Expasy web server[67](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R67) for template selection, target/template alignment, and model generation using default parameters for an integrase sequence from the Prochlorococcus PAC1 genome (WP 052038630). The top template, as identified by the Global Model Quality Estimate score, was PDB ID 1Z1B, the phage lambda integrase[68](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R68). The target/template alignment has 13% sequence identity, consistent with our sequences not previously being identified as integrases. The MolProbity protein quality score, provided by SWISS-MODEL, which combines protein structure quality features that together reflect crystallographic resolution, was 2.2[69](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R69). The lambda integrase is a tyrosine recombinase with defined active site residues Arg 212, Lys 235, His 308, Arg 311, His 333, and Tyr 342[68](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R68). In a study of catalysis requirements for tyrosine recombinases, the key residues strictly required for function were identified as the Tyr (Y) and Lys (K) residues[40](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10187409#R40). The target/template alignment demonstrates that residues Arg 212, Lys 235, Arg 311, and Tyr 342 are conserved in our target sequence (Supplemental Figure 4, panel A). The sequence is modeled as a homo-tetramer, consistent with the quaternary structure of the template (Supplemental Figure 4, panel B).

Acknowledgements
----------------

We thank Thomas Hackl, Kathryn Kauffman, and Cole Matrishin for helpful discussions. Z.N.F. was supported by the Einstein Medical Scientist Training Program (2T32GM007288). S.J.B. was supported by grants from the National Science Foundation (OCE-2049004) and the Simons Foundation (Award ID 917971). L.K. is supported in part by NIH NHLBI grant R01HL069438. Computational resources were supported by an award from the Google Cloud Research Credits program (GCP19980904) to L.K. We thank the NVIDIA Academic Hardware Grant Program for GPUs used in this work.

Footnotes
---------

REFERENCES
----------

\[1\] Roux S. et al. Ecogenomics and potential biogeochemical impacts of globally abundant ocean viruses. Nature, 537(7622):689–693, 2016. \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/27654921)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Nature&title=Ecogenomics+and+potential+biogeochemical+impacts+of+globally+abundant+ocean+viruses&author=S.+Roux&volume=537&issue=7622&publication_year=2016&pages=689-693&pmid=27654921&)\]

\[2\] Paez-Espino D. et al. Uncovering Earth’s virome. Nature, 536(7617):425–430, 2016. \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/27533034)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Nature&title=Uncovering+Earth%E2%80%99s+virome&author=D.+Paez-Espino&volume=536&issue=7617&publication_year=2016&pages=425-430&pmid=27533034&)\]

\[3\] Gregory A. C. et al. Marine DNA Viral Macro- and Microdiversity from Pole to Pole. Cell, 177(5):1109–1123.e14, may 2019. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6525058/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/31031001)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Cell&title=Marine+DNA+Viral+Macro-+and+Microdiversity+from+Pole+to+Pole&author=A.+C.+Gregory&volume=177&issue=5&publication_year=2019&pages=1109-1123.e14&pmid=31031001&)\]

\[4\] ter Horst A. M. et al. Minnesota peat viromes reveal terrestrial and aquatic niche partitioning for local and global viral populations. Microbiome, 9(1):233, 2021. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8626947/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/34836550)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Microbiome&title=Minnesota+peat+viromes+reveal+terrestrial+and+aquatic+niche+partitioning+for+local+and+global+viral+populations&author=A.+M.+ter+Horst&volume=9&issue=1&publication_year=2021&pages=233&pmid=34836550&)\]

\[5\] Gregory A. C. et al. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host & Microbe, 28(5):724–740.e8, 2020. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7443397/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/32841606)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Cell+Host+&+Microbe&title=The+gut+virome+database+reveals+age-dependent+patterns+of+virome+diversity+in+the+human+gut&author=A.+C.+Gregory&volume=28&issue=5&publication_year=2020&pages=724-740.e8&pmid=32841606&)\]

\[6\] Camarillo-Guerrero L. F. et al. Massive expansion of human gut bacteriophage diversity. Cell, 184(4):1098–1109.e9, 2021. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7895897/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/33606979)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Cell&title=Massive+expansion+of+human+gut+bacteriophage+diversity&author=L.+F.+Camarillo-Guerrero&volume=184&issue=4&publication_year=2021&pages=1098-1109.e9&pmid=33606979&)\]

\[7\] Nayfach S. et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nature Microbiology, 6(7):960–970, 2021. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8241571/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/34168315)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Nature+Microbiology&title=Metagenomic+compendium+of+189,680+DNA+viruses+from+the+human+gut+microbiome&author=S.+Nayfach&volume=6&issue=7&publication_year=2021&pages=960-970&)\]

\[8\] Roux S. et al. Virsorter: mining viral signal from microbial genomic data. PeerJ, 3:e985, May 2015. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4451026/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/26038737)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=PeerJ&title=Virsorter:+mining+viral+signal+from+microbial+genomic+data&author=S.+Roux&volume=3&publication_year=2015&pages=e985&pmid=26038737&)\]

\[9\] Ren J. et al. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome, 5(1):69, 2017. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5501583/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/28683828)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Microbiome&title=VirFinder:+a+novel+k-mer+based+tool+for+identifying+viral+sequences+from+assembled+metagenomic+data&author=J.+Ren&volume=5&issue=1&publication_year=2017&pages=69&pmid=28683828&)\]

\[10\] Ren J. et al. Identifying viruses from metagenomic data using deep learning. Quantitative Biology, 8(1):64–77, 2020. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8172088/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/34084563)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Quantitative+Biology&title=Identifying+viruses+from+metagenomic+data+using+deep+learning&author=J.+Ren&volume=8&issue=1&publication_year=2020&pages=64-77&pmid=34084563&)\]

\[11\] Wood D. E. et al. Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1):257, 2019. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6883579/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/31779668)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Genome+Biology&title=Improved+metagenomic+analysis+with+Kraken+2&author=D.+E.+Wood&volume=20&issue=1&publication_year=2019&pages=257&pmid=31779668&)\]

\[12\] Guo J. et al. VirSorter2: a multi-classifier, expert-guided approach to detect diverse DNA and RNA viruses. Microbiome, 9(1):37, 2021. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7852108/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/33522966)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Microbiome&title=VirSorter2:+a+multi-classifier,+expert-guided+approach+to+detect+diverse+DNA+and+RNA+viruses&author=J.+Guo&volume=9&issue=1&publication_year=2021&pages=37&pmid=33522966&)\]

\[13\] Kieft K. et al. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome, 8(1):90, 2020. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7288430/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/32522236)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Microbiome&title=VIBRANT:+automated+recovery,+annotation+and+curation+of+microbial+viruses,+and+evaluation+of+viral+community+function+from+genomic+sequences&author=K.+Kieft&volume=8&issue=1&publication_year=2020&pages=90&pmid=32522236&)\]

\[14\] Tisza M. J. et al. Cenote-Taker 2 democratizes virus discovery and sequence annotation. Virus Evolution, 7(1):veaa100, jan 2021. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7816666/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/33505708)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Virus+Evolution&title=Cenote-Taker+2+democratizes+virus+discovery+and+sequence+annotation&author=M.+J.+Tisza&volume=7&issue=1&publication_year=2021&pages=veaa100&pmid=33505708&)\]

\[15\] Glickman C. et al. Simulation study and comparative evaluation of viral contiguous sequence identification tools. BMC Bioinformatics, 22(1):329, 2021. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8207588/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/34130621)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=BMC+Bioinformatics&title=Simulation+study+and+comparative+evaluation+of+viral+contiguous+sequence+identification+tools&author=C.+Glickman&volume=22&issue=1&publication_year=2021&pages=329&pmid=34130621&)\]

\[16\] Camargo A. P. et al. You can move, but you can’t hide: identification of mobile genetic elements with genomad. bioRxiv, 2023. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11324519/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/37735266)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=bioRxiv&title=You+can+move,+but+you+can%E2%80%99t+hide:+identification+of+mobile+genetic+elements+with+genomad&author=A.+P.+Camargo&publication_year=2023&)\]

\[17\] Meier-Kolthoff J. P. and Göker M.. VICTOR: genome-based phylogeny and classification of prokaryotic viruses. Bioinformatics, 33(21):3396–3404, 07 2017. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5860169/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/29036289)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Bioinformatics&title=VICTOR:+genome-based+phylogeny+and+classification+of+prokaryotic+viruses&author=J.+P.+Meier-Kolthoff&author=M.+G%C3%B6ker&volume=33&issue=21&publication_year=2017&pages=3396-3404&pmid=29036289&)\]

\[18\] Bin Jang H. et al. Taxonomic assignment of uncultivated prokaryotic virus genomes is enabled by gene-sharing networks. Nature Biotechnology, 37(6):632–639, 2019. \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/31061483)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Nature+Biotechnology&title=Taxonomic+assignment+of+uncultivated+prokaryotic+virus+genomes+is+enabled+by+gene-sharing+networks&author=H.+Bin+Jang&volume=37&issue=6&publication_year=2019&pages=632-639&)\]

\[19\] Moraru C.. Virclust – a tool for hierarchical clustering, core gene detection and annotation of (prokaryotic) viruses. bioRxiv, 2021. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10143988/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/37112988)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=bioRxiv&title=Virclust+%E2%80%93+a+tool+for+hierarchical+clustering,+core+gene+detection+and+annotation+of+\(prokaryotic\)+viruses&author=C.+Moraru&publication_year=2021&)\]

\[20\] Pons J. C. et al. VPF-Class: taxonomic assignment and host prediction of uncultivated viruses based on viral protein families. Bioinformatics, 37(13):1805–1813, 01 2021. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8830756/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/33471063)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Bioinformatics&title=VPF-Class:+taxonomic+assignment+and+host+prediction+of+uncultivated+viruses+based+on+viral+protein+families&author=J.+C.+Pons&volume=37&issue=13&publication_year=2021&pages=1805-1813&pmid=33471063&)\]

\[21\] Terzian P. et al. PHROG: families of prokaryotic virus proteins clustered using remote homology. NAR Genomics and Bioinformatics, 3(3), 08 2021. lqab067. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8341000/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/34377978)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=NAR+Genomics+and+Bioinformatics&title=PHROG:+families+of+prokaryotic+virus+proteins+clustered+using+remote+homology&author=P.+Terzian&volume=3&issue=3&publication_year=2021&pages=08&)\]

\[22\] Zayed A. A. et al. efam: an expanded, metaproteome-supported HMM profile database of viral protein families. Bioinformatics, 37(22):4202–4208, 06 2021. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9502166/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/34132786)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Bioinformatics&title=efam:+an+expanded,+metaproteome-supported+HMM+profile+database+of+viral+protein+families&author=A.+A.+Zayed&volume=37&issue=22&publication_year=2021&pages=4202-4208&pmid=34132786&)\]

\[23\] Abdelkareem A. O. et al. Virnet: Deep attention model for viral reads identification. In 2018 13th International Conference on Computer Engineering and Systems (ICCES), pp. 623–626, 2018. \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=2018+13th+International+Conference+on+Computer+Engineering+and+Systems+\(ICCES\)&title=Virnet:+Deep+attention+model+for+viral+reads+identification&author=A.+O.+Abdelkareem&publication_year=2018&pages=623-626&)\]

\[24\] Tynecki P. et al. Phageai - bacteriophage life cycle recognition with machine learning and natural language processing. bioRxiv, 2020. \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=bioRxiv&title=Phageai+-+bacteriophage+life+cycle+recognition+with+machine+learning+and+natural+language+processing&author=P.+Tynecki&publication_year=2020&)\]

\[25\] Asgari E. and Mofrad M. R. K.. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLOS ONE, 10(11):1–15, 11 2015. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4640716/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/26555596)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=PLOS+ONE&title=Continuous+distributed+representation+of+biological+sequences+for+deep+proteomics+and+genomics&author=E.+Asgari&author=M.+R.+K.+Mofrad&volume=10&issue=11&publication_year=2015&pages=1-15&)\]

\[26\] Heinzinger M. et al. Modeling aspects of the language of life through transfer-learning protein sequences. BMC Bioinformatics, 20(1):723, 2019. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6918593/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/31847804)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=BMC+Bioinformatics&title=Modeling+aspects+of+the+language+of+life+through+transfer-learning+protein+sequences&author=M.+Heinzinger&volume=20&issue=1&publication_year=2019&pages=723&pmid=31847804&)\]

\[27\] Rives A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, 2021. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8053943/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/33876751)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Proceedings+of+the+National+Academy+of+Sciences&title=Biological+structure+and+function+emerge+from+scaling+unsupervised+learning+to+250+million+protein+sequences&author=A.+Rives&volume=118&issue=15&publication_year=2021&pages=e2016239118&)\]

\[28\] Elnaggar A. et al. Prottrans: Towards cracking the language of lifes code through self-supervised deep learning and high performance computing. IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2021. \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=IEEE+Transactions+on+Pattern+Analysis+and+Machine+Intelligence&title=Prottrans:+Towards+cracking+the+language+of+lifes+code+through+self-supervised+deep+learning+and+high+performance+computing&author=A.+Elnaggar&publication_year=2021&pages=1-1&pmid=31331880&)\]

\[29\] Brandes N. et al. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics, 01 2022. btac020. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9386727/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/35020807)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Bioinformatics&title=ProteinBERT:+a+universal+deep-learning+model+of+protein+sequence+and+function&author=N.+Brandes&publication_year=2022&pages=01&)\]

\[30\] Bepler T. and Berger B.. Learning the protein language: Evolution, structure, and function. Cell Systems, 12(6):654–669.e3, 2021. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8238390/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/34139171)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Cell+Systems&title=Learning+the+protein+language:+Evolution,+structure,+and+function&author=T.+Bepler&author=B.+Berger&volume=12&issue=6&publication_year=2021&pages=654-669.e3&pmid=34139171&)\]

\[31\] Nasir A. and Caetano-Anollés G.. A phylogenomic data-driven exploration of viral origins and evolution. Science Advances, 1(8):e1500527, 2015. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4643759/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/26601271)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Science+Advances&title=A+phylogenomic+data-driven+exploration+of+viral+origins+and+evolution&author=A.+Nasir&author=G.+Caetano-Anoll%C3%A9s&volume=1&issue=8&publication_year=2015&pages=e1500527&pmid=26601271&)\]

\[32\] Balaji S. and Srinivasan N.. Comparison of sequence-based and structure-based phylogenetic trees of homologous proteins: Inferences on protein evolution. Journal of Biosciences, 32(1):83–96, 2007. \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/17426382)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Journal+of+Biosciences&title=Comparison+of+sequence-based+and+structure-based+phylogenetic+trees+of+homologous+proteins:+Inferences+on+protein+evolution&author=S.+Balaji&author=N.+Srinivasan&volume=32&issue=1&publication_year=2007&pages=83-96&pmid=17426382&)\]

\[33\] Meng C. et al. Review and comparative analysis of machine learning-based phage virion protein identification methods. Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics, 1868(6):140406, 2020. \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/32135196)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Biochimica+et+Biophysica+Acta+\(BBA\)+-+Proteins+and+Proteomics&title=Review+and+comparative+analysis+of+machine+learning-based+phage+virion+protein+identification+methods&author=C.+Meng&volume=1868&issue=6&publication_year=2020&pages=140406&pmid=32135196&)\]

\[34\] Fang Z. et al. DeePVP: Identification and classification of phage virion proteins using deep learning. Giga-Science, 11, 08 2022. giac076. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9366990/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/35950840)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Giga-Science&title=DeePVP:+Identification+and+classification+of+phage+virion+proteins+using+deep+learning&author=Z.+Fang&volume=11&publication_year=2022&pages=08&)\]

\[35\] Cantu V. A. et al. Phanns, a fast and accurate tool and web server to classify phage structural proteins. PLOS Computational Biology, 16(11):1–18, 11 2020. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7660903/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/33137102)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=PLOS+Computational+Biology&title=Phanns,+a+fast+and+accurate+tool+and+web+server+to+classify+phage+structural+proteins&author=V.+A.+Cantu&volume=16&issue=11&publication_year=2020&pages=1-18&)\]

\[36\] Mizuno C. M. et al. Genomes of abundant and widespread viruses from the deep ocean. mBio, 7(4):e00805–16, 2016. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4981710/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/27460793)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=mBio&title=Genomes+of+abundant+and+widespread+viruses+from+the+deep+ocean&author=C.+M.+Mizuno&volume=7&issue=4&publication_year=2016&pages=e00805-16&pmid=27460793&)\]

\[37\] Hackl T. et al. Novel integrative elements and genomic plasticity in ocean ecosystems. Cell, 186(1):47–62.e16, 2023. \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/36608657)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Cell&title=Novel+integrative+elements+and+genomic+plasticity+in+ocean+ecosystems&author=T.+Hackl&volume=186&issue=1&publication_year=2023&pages=47-62.e16&pmid=36608657&)\]

\[38\] Eppley J. M. et al. Marine viral particles reveal an expansive repertoire of phage-parasitizing mobile elements. Proceedings of the National Academy of Sciences, 119(43):e2212722119, 2022. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9618062/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/36256808)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Proceedings+of+the+National+Academy+of+Sciences&title=Marine+viral+particles+reveal+an+expansive+repertoire+of+phage-parasitizing+mobile+elements&author=J.+M.+Eppley&volume=119&issue=43&publication_year=2022&pages=e2212722119&)\]

\[39\] Smyshlyaev G. et al. Sequence analysis of tyrosine recombinases allows annotation of mobile genetic elements in prokaryotic genomes. Molecular Systems Biology, 17(5):e9880, 2021. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8138268/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/34018328)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Molecular+Systems+Biology&title=Sequence+analysis+of+tyrosine+recombinases+allows+annotation+of+mobile+genetic+elements+in+prokaryotic+genomes&author=G.+Smyshlyaev&volume=17&issue=5&publication_year=2021&pages=e9880&pmid=34018328&)\]

\[40\] Gibb B. et al. Requirements for catalysis in the Cre recombinase active site. Nucleic Acids Research, 38(17):5817–5832, 05 2010. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2943603/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/20462863)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Nucleic+Acids+Research&title=Requirements+for+catalysis+in+the+Cre+recombinase+active+site&author=B.+Gibb&volume=38&issue=17&publication_year=2010&pages=5817-5832&pmid=20462863&)\]

\[41\] Williams K. P.. Integration sites for genetic elements in prokaryotic tRNA and tmRNA genes: sublocation preference of integrase subfamilies. Nucleic Acids Research, 30(4):866–875, 02 2002. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC100330/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/11842097)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Nucleic+Acids+Research&title=Integration+sites+for+genetic+elements+in+prokaryotic+tRNA+and+tmRNA+genes:+sublocation+preference+of+integrase+subfamilies&author=K.+P.+Williams&volume=30&issue=4&publication_year=2002&pages=866-875&pmid=11842097&)\]

\[42\] Koonin E. V. et al. The global virome: How much diversity and how many independent origins? Environmental Microbiology, 25(1):40–44, 2023. \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/36097140)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Environmental+Microbiology&title=The+global+virome:+How+much+diversity+and+how+many+independent+origins?&author=E.+V.+Koonin&volume=25&issue=1&publication_year=2023&pages=40-44&pmid=36097140&)\]

\[43\] Shen A. and Millard A.. Phage genome annotation: Where to begin and end. PHAGE, 2(4):183–193, 2021. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9041514/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/36159890)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=PHAGE&title=Phage+genome+annotation:+Where+to+begin+and+end&author=A.+Shen&author=A.+Millard&volume=2&issue=4&publication_year=2021&pages=183-193&pmid=36159890&)\]

\[44\] Borodovich T. et al. Phage-mediated horizontal gene transfer and its implications for the human gut microbiome. Gastroenterology Report, 10, 04 2022. goac012. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9006064/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/35425613)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Gastroenterology+Report&title=Phage-mediated+horizontal+gene+transfer+and+its+implications+for+the+human+gut+microbiome&author=T.+Borodovich&volume=10&publication_year=2022&pages=04&)\]

\[45\] Jumper J. et al. Highly accurate protein structure prediction with alphafold. Nature, 596:583–589, 8 2021. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8371605/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/34265844)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Nature&title=Highly+accurate+protein+structure+prediction+with+alphafold&author=J.+Jumper&volume=596&publication_year=2021&pages=583-589&pmid=34265844&)\]

\[46\] Nicolas E. et al. The tn¡i¿3¡/i¿-family of replicative transposons. Microbiology Spectrum, 3(4):3.4.14, 2015. \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/26350313)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Microbiology+Spectrum&title=The+tn%C2%A1i%C2%BF3%C2%A1/i%C2%BF-family+of+replicative+transposons&author=E.+Nicolas&volume=3&issue=4&publication_year=2015&)\]

\[47\] Mavrich T. N. and Hatfull G. F.. Bacteriophage evolution differs by host, lifestyle and genome. Nature Microbiology, 2, 7 2017. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5540316/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/28692019)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Nature+Microbiology&title=Bacteriophage+evolution+differs+by+host,+lifestyle+and+genome&author=T.+N.+Mavrich&author=G.+F.+Hatfull&volume=2&publication_year=2017&pages=7&)\]

\[48\] Pires D. P. et al. Current challenges and future opportunities of phage therapy. FEMS Microbiology Reviews, 44(6):684–700, 05 2020. \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/32472938)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=FEMS+Microbiology+Reviews&title=Current+challenges+and+future+opportunities+of+phage+therapy&author=D.+P.+Pires&volume=44&issue=6&publication_year=2020&pages=684-700&pmid=32472938&)\]

\[49\] Pedregosa F. et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011. \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Journal+of+Machine+Learning+Research&title=Scikit-learn:+Machine+learning+in+Python&author=F.+Pedregosa&volume=12&publication_year=2011&pages=2825-2830&)\]

\[50\] Abadi M. et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.

\[51\] Sievers F. and Higgins D. G.. Clustal omega for making accurate alignments of many protein sequences. Protein Science, 27(1):135–145, 2018. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5734385/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/28884485)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Protein+Science&title=Clustal+omega+for+making+accurate+alignments+of+many+protein+sequences&author=F.+Sievers&author=D.+G.+Higgins&volume=27&issue=1&publication_year=2018&pages=135-145&pmid=28884485&)\]

\[52\] Steinegger M. et al. HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics, 20(1):473, 2019. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6744700/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/31521110)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=BMC+Bioinformatics&title=HH-suite3+for+fast+remote+homology+detection+and+deep+protein+annotation&author=M.+Steinegger&volume=20&issue=1&publication_year=2019&pages=473&pmid=31521110&)\]

\[53\] McInnes L. et al. Umap: Uniform manifold approximation and projection for dimension reduction, 2018.

\[54\] Charlier F. et al. Statannotations, October 2022.

\[55\] Hagberg A. A. et al. Exploring network structure, dynamics, and function using networkx. In Varoquaux G. et al., editors, Proceedings of the 7th Python in Science Conference, pp. 11 – 15, Pasadena, CA USA, 2008. \[[Google Scholar](https://scholar.google.com/scholar_lookup?title=Proceedings+of+the+7th+Python+in+Science+Conference&author=A.+A.+Hagberg&author=G.+Varoquaux&publication_year=2008&)\]

\[56\] Altschul S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25(17):3389–3402, 09 1997. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC146917/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/9254694)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Nucleic+Acids+Research&title=Gapped+BLAST+and+PSI-BLAST:+a+new+generation+of+protein+database+search+programs&author=S.+F.+Altschul&volume=25&issue=17&publication_year=1997&pages=3389-3402&pmid=9254694&)\]

\[58\] Gabler F. et al. Protein sequence analysis using the mpi bioinformatics toolkit. Current Protocols in Bioinformatics, 72(1):e108, 2020. \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/33315308)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Current+Protocols+in+Bioinformatics&title=Protein+sequence+analysis+using+the+mpi+bioinformatics+toolkit&author=F.+Gabler&volume=72&issue=1&publication_year=2020&pages=e108&pmid=33315308&)\]

\[59\] Zimmermann L. et al. A completely reimplemented mpi bioinformatics toolkit with a new hhpred server at its core. Journal of Molecular Biology, 430(15):2237–2243, 2018. Computation Resources for Molecular Biology. \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/29258817)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Journal+of+Molecular+Biology&title=A+completely+reimplemented+mpi+bioinformatics+toolkit+with+a+new+hhpred+server+at+its+core&author=L.+Zimmermann&volume=430&issue=15&publication_year=2018&pages=2237-2243&pmid=29258817&)\]

\[60\] Potter S. C. et al. HMMER web server: 2018 update. Nucleic Acids Research, 46(W1):W200–W204, 06 2018. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6030962/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/29905871)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Nucleic+Acids+Research&title=HMMER+web+server:+2018+update&author=S.+C.+Potter&volume=46&issue=W1&publication_year=2018&pages=W200-W204&pmid=29905871&)\]

\[61\] Kelley L. A. et al. The phyre2 web portal for protein modeling, prediction and analysis. Nature Protocols, 10:845–858, 6 2015. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5298202/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/25950237)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Nature+Protocols&title=The+phyre2+web+portal+for+protein+modeling,+prediction+and+analysis&author=L.+A.+Kelley&volume=10&publication_year=2015&pages=845-858&pmid=25950237&)\]

\[62\] Steinegger M. and Soding J.. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology, 35:1026–1028, 11 2017. \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/29035372)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Nature+Biotechnology&title=Mmseqs2+enables+sensitive+protein+sequence+searching+for+the+analysis+of+massive+data+sets&author=M.+Steinegger&author=J.+Soding&volume=35&publication_year=2017&pages=1026-1028&)\]

\[63\] Katoh K. and Standley D. M.. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution, 30(4):772–780, 01 2013. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3603318/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/23329690)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Molecular+Biology+and+Evolution&title=MAFFT+Multiple+Sequence+Alignment+Software+Version+7:+Improvements+in+Performance+and+Usability&author=K.+Katoh&author=D.+M.+Standley&volume=30&issue=4&publication_year=2013&pages=772-780&pmid=23329690&)\]

\[64\] Price M. N. et al. Fasttree 2 – approximately maximum-likelihood trees for large alignments. PLOS ONE, 5(3):1–10, 03 2010. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2835736/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/20224823)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=PLOS+ONE&title=Fasttree+2+%E2%80%93+approximately+maximum-likelihood+trees+for+large+alignments&author=M.+N.+Price&volume=5&issue=3&publication_year=2010&pages=1-10&)\]

\[65\] Letunic I. and Bork P.. Interactive Tree Of Life (iTOL) v5: an online tool for phylogenetic tree display and annotation. Nucleic Acids Research, 49(W1):W293–W296, 04 2021. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8265157/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/33885785)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Nucleic+Acids+Research&title=Interactive+Tree+Of+Life+\(iTOL\)+v5:+an+online+tool+for+phylogenetic+tree+display+and+annotation&author=I.+Letunic&author=P.+Bork&volume=49&issue=W1&publication_year=2021&pages=W293-W296&pmid=33885785&)\]

\[66\] Kauffman K. M. et al. Viruses of the Nahant Collection, characterization of 251 marine Vibrionaceae viruses. Scientific Data, 5(1):180114, 2018. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6029569/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/29969110)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Scientific+Data&title=Viruses+of+the+Nahant+Collection,+characterization+of+251+marine+Vibrionaceae+viruses&author=K.+M.+Kauffman&volume=5&issue=1&publication_year=2018&pages=180114&pmid=29969110&)\]

\[67\] Waterhouse A. et al. SWISS-MODEL: homology modelling of protein structures and complexes. Nucleic Acids Research, 46(W1):W296–W303, 05 2018. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6030848/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/29788355)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Nucleic+Acids+Research&title=SWISS-MODEL:+homology+modelling+of+protein+structures+and+complexes&author=A.+Waterhouse&volume=46&issue=W1&publication_year=2018&pages=W296-W303&pmid=29788355&)\]

\[68\] Biswas T. et al. A structural basis for allosteric control of dna recombination by integrase. Nature, 435:1059–1066, 6 2005. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1809751/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/15973401)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Nature&title=A+structural+basis+for+allosteric+control+of+dna+recombination+by+integrase&author=T.+Biswas&volume=435&publication_year=2005&pages=1059-1066&pmid=15973401&)\]

\[69\] Chen V. B. et al. _MolProbity_: all-atom structure validation for macromolecular crystallography. Acta Crystallographica Section D, 66(1):12–21, Jan 2010. \[[PMC free article](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2803126/)\] \[[PubMed](https://pubmed.ncbi.nlm.nih.gov/20057044)\] \[[Google Scholar](https://scholar.google.com/scholar_lookup?journal=Acta+Crystallographica+Section+D&title=MolProbity:+all-atom+structure+validation+for+macromolecular+crystallography&author=V.+B.+Chen&volume=66&issue=1&publication_year=2010&pages=12-21&)\]

* * *

Articles from Research Square are provided here courtesy of **American Journal Experts**

* * *

</content>