The currently supported file formats include GDE data files, Genbank formatted files (with type extensions), a generic flat file format, and a color mask file.
All other lines are retained as comments. The LOCUS line also specifies
what type of sequence follows. The form of this line is:
where name is the Genbank Locus name, size is total base count, type is
one of DNA, RNA, PROTEIN, MASK, or TEXT and date is of the form dd-MON-yyyy.
In this way, the standard Genbank format is extended to store all text,
mask and protein data. The Genbank character set has also been extended
in order to support these other data types. Valid characters are:
Here is a valid Genbank entry for two E.coli tRNA's:
The type character is # for DNA/RNA, % for protein sequence, @ for mask
sequence, and " for text. The short name is the same as the LOCUS
line in Genbank. This is followed by lines of sequence, each ending with
a return character.These lines are read until the next type character is
encountered, or until the end of the file is reached. Care should be taken
in using this format with text as space characters are stripped automatically.
As of release 2.0, flat file format allows for an optional offset to be
specified in parentheses after the sequence name. An offset represents
how many leading gap characters should be placed before the start of a
sequence. If this offset does not exist, then it is defined to be 0.
Here is a sample flat file for two Ecoli tRNA's:
#ECOTRNT4 GGGUCGUUAGCUCAGUUGGUAGAGCAGUUGACUUUUAAUCAAUUGGNCGCAGGUUCGAAU
CCUGCACGACCCACCA
#ECOTRQ1 UGGGGUAUCGCCAAGCGGUAAGGCACCGGUUUUUGAUACCGGCAUUCCCUGGUUCGAAUC
CAGGUACCCCAGCCA