Associate Professor University of Virginia, United States
Introduction: Cell signaling relies on the activity, localization, and prevalence of signaling molecules and their interacting partners. For proteins, their molecular interactions are conferred through the presence and combination of ordered regions of structural and functional properties known as domains. While thousands of domains exist within the human proteome only a limited number of combinations have been observed with certain combinations appearing more frequently such as the Src homology 3 (SH3) – SH2 – Kinase domain architecture of cytoplasmic kinases. Additionally, in signaling modules like the phosphotyrosine machinery domain architectures where two catalytic domains (e.g. kinases and phosphatases) rarely co-occur within singular proteins. Further, certain domains like kinase domains are frequently involved in gene fusions associated with cancers such as BCR-ABL in chronic myelogenous leukemia or EML4-ALK in lung cancer. These observations suggest that favorable domain combinations follow principles determined by a domain’s tolerance to new functionalization. This flexibility and ability to contextualize new functions are reminiscent of the evolution and development of grammar in natural languages. However, the diversity and combination of domains are primarily studied in the context of evolution or to classify protein subfamilies. Whether domain architectures can be used to describe the available machinery and potential interactions associated with changes in cell state remain underexplored.
Materials and
Methods: Domain architectures were retrieved from the InterPro database using our previously developed computational tool CoDIAC. The retrieved domain architectures are then subject to n-gram analysis which extracts ordered sequences of the protein domains. Both the complete and sub domain architectures are extracted from protein canonical isoforms. The extracted domain (sub)architectures (referred hereafter as n-grams) are then assembled into networks where individual nodes represent n-grams and edges between nodes represent whether one n-gram is an extension of the other.
Both the Hallmarks gene sets from the Molecular Signatures Database and public datasets of RNA-sequencing and proteomics results were retrieved and subject to n-gram network analysis. Topographical features of the networks such as connected components, articulation points, network diameter and isolates were quantified. Changes in node clustering through community detection were compared for n-grams of the Hallmarks gene sets relative to the complete proteome. For datasets containing differential gene expression results, changes in the conditional frequency of domains given a preceding n-gram for differentially regulated genes were determined.
Cancer fusion gene analysis was conducted by retrieving breakpoint information for in-frame fusions identified in the ChimberDB database for samples in The Cancer Genome Atlas. Breakpoints were then mapped to the corresponding domain location and the complete domain architecture for the fusion gene constructed. The resulting domain architectures were then incorporated into the n-gram network of the complete proteome and analyzed for topographical changes detailed above.
Results, Conclusions, and Discussions: N-gram network analysis of the complete proteome found >1300 connected components representing the different domain architecture families containing distinct domain members. About 700 domains do not co-occur with other domains, and 400 multi-domain architectures consisting of only unique domains were identified in the n-gram network. Like many biological systems the n-gram network is dominated by a single large connected component containing 4999 n-grams (Figure 1A). Comparing the n-grams associated with the Hallmarks gene sets reveals that this large connected component is split into 142 components, and that network wide there are >800 components missing, 400 truncated and 10 fractionated (Figure 1A,B). Further community detection within the Hallmarks gene sets revealed an over identification of clusters associated with clathrin adaptors and serine peptidases, while the n-grams associated with the phosphoserine binding BRCT domain or the homeobox domain were under-identified. Evaluating differential gene expression results from public datasets further confirmed the fractionation of the complete proteome as the distribution of component diameters and the mean pairwise distance between nodes within the largest connected reduced relative to the complete proteome (Figure 1C). Further, analyzing changes in domain architecture frequencies reflects how the available machinery associated with different cell states and signaling pathways changes.
Fusion gene networks showed that about 20% of fusion genes with novel domain architectures will incorporate domains that were isolated from the rest of the network (i.e. reduce connected components). The same fraction was also observed irrespective if a gene was involved in fusions across multiple cancers (e.g. FGFR3, ROCK1) or in a single cancer type (Figure 1D). Recurrent fusion genes (e.g. FGFR3-TACC3) that occurred in multiple cancers further reduced this fraction to < 10%. Together, these data suggest that fusion genes remain under the selective principles governing the natural evolution of domain architectures that reinforce pre-existing domain co-occurrence trends.
Altogether, our analysis demonstrates the possibility that incorporating domain architecture information into analyses of cell signaling can provide new insights into the diversity of processes utilized in response to different signaling contexts.