Automatic annotation of organellar genomes with DOGMA

Stacia K.Wyman Department of Computer Sciences University of Texas at Austin


Jeffrey L.Boore

DOE Joint Genome Institute

Robert K.Jansen

Section of Integrative Biology

University of Texas at Austin,



Dual Organellar GenoMe Annotator(DOGMA)automates the annotation of extra-nuclear organellar(chloroplast and animal mitochondrial)genomes.It is a web-based package that allows the use of comparative BLAST searches to identify and annotate genes in a genome.DOGMA presents a list of putative genes to the user in a graphical format for viewing and editing.Annotations are stored on our password-protected plete annotations can be extracted for direct submission to GenBank.Furthermore, intergenic regions of specified length can be extracted,as well as the nucleotide sequences and amino acid sequences of the genes. URL: Keywords:annotation,organelles,chloroplasts,mitochondria. 1Introduction

The comparison of complete organellar genome se-quences is becoming increasingly important for reconstruct-ing the evolutionary relationships of organisms[2,3,7,8], for studying population structure and history[11],includ-ing those of humans[6],for identifying forensic materi-als[10],and for understanding the inheritance of certain hu-man diseases[12].Identifying and annotating genes is cur-rently a time consuming and error fraught process and,with the input of high-throughput genome sequencing centers,is becoming the rate-limiting step in the production of com-plete chloroplast and mitochondrial genome sequences.For extra-nuclear organellar genomes,gene content and func-tion is largely known,and annotation involves locating and identifying the set of known genes,and clearly,an auto-mated and accurate method such as DOGMA is an invalu-able tool.We also may be able to use this program as a model on which to base methods for automating annotation of other genomes.

DOGMA is a web-based annotation package that takes as input afile containing the complete nucleotide sequence of an animal mitochondrial or chloroplast genome in Fasta format.The genome is BLASTed against our custom databases constructed from all the genes from a set of an-

imal mitochondrial and green plant chloroplast genomes.

DOGMA constructs a list of genes from the BLAST output, and graphically displays the list of genes to the user for an-notation.The putative genes are laid out on a number line, and when the gene is selected,a detailed view of the gene’s sequence and BLAST hits is displayed.The user can then choose a start and stop codon for each protein coding gene, and a begin and end position for each transfer RNA(tRNA) and ribosomal RNA(rRNA)in the genome.Annotations are stored on our password-protected server so they can be retrieved and edited.When complete,the annotation may be retrieved in Sequin format for direct submission to Gen-Bank.Additionally,intergenic regions of specified length can be extracted,as well as the nucleotide and amino acid sequences of the genes.


Organelles are membrane-bound structures in the cell that carry out various functions.Two organelles,chloro-plasts and mitochondria,have circular,double-stranded chromosomes with an almost completely known set of genes.

Animal mitochondrial genomes Animal mitochondrial genomes typically are about15,000basepairs(bp)in length and contain37genes:13protein coding genes,22trans-fer RNAs(tRNAs)and2ribosomal RNAs(rRNAs)[1].

Gene content is mostlyfixed,though the gene order can be highly rearranged.Duplications or deletions of genes are rare,most genes do not overlap(though there are some well-identified exceptions),and genes do not contain introns.At the time of writing,there were467complete,annotated,an-imal mitochondrial genomes in GenBank.

Chloroplast genomes Chloroplast genomes,on the other hand,are usually about150,000bp(but can be as long as 220,000bp)and contain110-130genes[9].There are4 ribosomal RNA genes,about30transfer RNAs and about 80protein coding genes.Introns are infrequent in chloro-plast genomes,occurring in Nicotiana in20genes.Chloro-plast genomes contain4distinct regions.Two of the re-1