 |
 |
Monterey Bay Coastal Ocean Microbial Observatory |
| Annotation of Closed Environmental BACs |
Image ©MBARI |
|
The closed environmental BACs are annotated similarly to completed microbial genomes. BAC annotation utilizes a wide variety of methods to analyze sequence data to identify genes and assign a name and/or function to these genes.
Gene Finding
We use the Glimmer program to find genes in the closed BACs. Glimmer is very useful for this project since it can be trained from raw sequence data; therefore, training sets of know proteins are not required, which would be difficult to generate for uncultured bacteria. In tests on numerous completely sequenced bacterial genomes, the system consistently finds over 99% of the genes in a fully automated fashion.
One of the limitations of the system is its inability in a few cases to resolve overlaps between genes; i.e., to decide what to do when two genes in different reading frames overlap. When such cases arise, Glimmer outputs both genes and they are carefully examined manually.
Functional assignment
The set of putative proteins, chosen by Glimmer, is analyzed after several searches are done on the amino acid sequence they encode. Each protein is searched against a non-redundant amino acid database (nraa) made up of all proteins available from GenBank, PIR, SWISS-PROT and TIGR's internal protein database, EGAD.
All of the proteins from the genome sequences are also searched against Pfam and TIGRFAM hidden Markov models (HMMs) (Bateman, et al., 2000; Haft, et al.). HMMs are useful for annotation since they are more sensitive and accurate than pairwise alignments, and each HMM has an associated cutoff score above which hits are known to be significant.
In addition, searches for PROSITE motifs (Hofmann, et al., 1999), lipoproteins, signal peptides (Neilsen, et al., 1997), and membrane spanning regions (Claros, et al., 1994) are performed.
Non-coding genes and other features of the genome are also identified during the annotation process. The program tRNAscan (Lowe, et al., 1997)) is used to find tRNAs. Other software written at TIGR is used to detect structural RNAs, rho-independent terminators, and DNA repeats.
Distilling all of this information from a wide variety of sources to an accurate gene assignment is a complex task. We strive to annotate each gene with as much information as we can confidently impart, but are also wary of inferring too much from sequence similarity. This has led to a conservative approach to gene naming and a system of nomenclature in which the specificity of the name reflects our confidence in the assignment.
If there are multiple lines of evidence indicating that a protein has a specific function including HMM matches, multiple full-length pairwise matches with percent identity greater than 30%, and conserved PROSITE motifs (where applicable), then we use the fully descriptive name of the protein and assign a gene name. However, if the searches are not as specific, a more general name is given to the protein without a gene name.
Coding Region Management
To ensure that Glimmer did not miss any ORFs, all intergenic regions of the genome sequence are searched against our internal non-redundant database. Regions that have database matches are examined to see if missed ORFs should be inserted.
Potential frameshift mutations or start or stop codon point mutations identified during analysis of the pair-wise alignments must be resolved. These sequence discrepancies can arise not only from actual mutation of the DNA, but also from errors in sequencing or sequence editing. If an annotator identifies a potential mutation of this type, the laboratory is notified. The sequence coverage and quality are checked, and the sequence is either verified or a repair is suggested. Annotators will merge, split, or extend the predicted coding regions based on the lab report. If a frameshift or point mutation is verified, "authentic frame shift" or "authentic point mutation" is appended to the common name of the gene.
Data Availability
Once all of the annotation and coding region management steps have been completed the annotation is made available at the Monterey Bay Coastal Ocean Microbial Observatory database. At this site, users can access the information used to annotate the genes, search for genes of interest, and compare their sequences to the BAC sequence. Also, users can download files containing the complete sequence of the BAC, nucleotide sequences of each predicted coding region, and protein sequences of each predicted coding region.
|