Background: Functional assignments for short-read metagenomic data pose a significant computational challenge due to perceived unpredictability of alignment behavior and the inability to infer useful functional information from translated protein-fragments/peptides. To address this problem, we have examined the predictability of short peptide alignments by systematically studying alignment behavior of large sets of short peptides generated from well-characterized proteins as well as hypothetical proteins in the KEGG database. Results: Using test sets of peptides modeling the length and phylogenetic distributions of short-read metagenomic data, we observed that peptides from well-characterized proteins had indistinguishable alignments to proteins from the same orthologous family and proteins from different families. Nonetheless, the patterns contained remarkable phylogenetic and structural signals, with alignments of even very short peptides naturally restricted to their orthologous family and/or proteins having similar structural folds. In stark contrast, peptides from "hypothetical proteins" had only sparse hit patterns with low frequencies and much lower identities. By weighting the structure-driven alignments and filtering peptides with behaviors similar to those derived from "hypothetical proteins", we demonstrate that the accuracy of abundance predictions of protein families is dramatically improved. Conclusions: Evolutionary processes have dispersed protein folds across multiple protein families, precluding accurate functional assignment to short peptides, whose alignment behavior is non-random and driven by structure. Algorithms that filter sparse peptides and weight hit patterns of peptides from "known space" dramatically improve quantification of functions from diverse mixtures of peptides and should substantially improve applications of metagenomic analyses requiring accurate quantitative measures of functional families.
ASJC Scopus subject areas