Use of average mutual information and derived measures to find coding regions

Garin Newcomb, Khalid Sayood

Research output: Contribution to journalArticlepeer-review

Abstract

One of the important steps in the annotation of genomes is the identification of regions in the genome which code for proteins. One of the tools used by most annotation approaches is the use of signals extracted from genomic regions that can be used to identify whether the region is a protein coding region. Motivated by the fact that these regions are information bearing structures we propose signals based on measures motivated by the average mutual information for use in this task. We show that these signals can be used to identify coding and noncoding sequences with high accuracy. We also show that these signals are robust across species, phyla, and kingdom and can, therefore, be used in species agnostic genome annotation algorithms for identifying protein coding regions. These in turn could be used for gene identification.

Original languageEnglish (US)
Article number1324
JournalEntropy
Volume23
Issue number10
DOIs
StatePublished - Oct 2021

Keywords

  • DNA annotation
  • Mutual information
  • Protein coding

ASJC Scopus subject areas

  • Information Systems
  • Mathematical Physics
  • Physics and Astronomy (miscellaneous)
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Use of average mutual information and derived measures to find coding regions'. Together they form a unique fingerprint.

Cite this