文件格式说明 - Generic Feature Format Version 3 (GFF3)

简介

Annotating Genomes with GFF3 or GTF files
Generic Feature Format Version 3 (GFF3) v1.26
GFF3 format

Formatting requirements

[1] seqid in GFF3/GTF column 1 should match the corresponding FASTA or ASN.1 file that is being annotated. For assemblies already in GenBank, seqids will be matched to their corresponding accessions if they are the same as what was used in the original submission. [The seqid is the text between the ‘>’ and the first space in the fasta definition line; do not include the ‘>’ in the GFF file]

[2] contig, supercontig, chromosome and similar landmark features are not required and will be ignored.

[3] multi-exon mRNA and other RNA features can be represented using either: [a] child exon features [b] child five_prime_UTR, CDS, and three_prime_UTR features [c] multiple RNA feature rows with the same ID

Furthermore, whereas the GFF3 specifications require that all rows of a multi-exon CDS feature use the same ID, some commonly used software deviates from this requirement. To allow for deviations from the specifications, for eukaryotes the GenBank software assumes that multiple CDS rows with the same Parent attribute represent parts of the same CDS feature. Multiple CDS features for the same gene need to be annotated by using a separate mRNA Parent feature for each, so there is always a 1:1 relationship of mRNA to CDS, like in the following schematic:

1
2
3
4
5
6
7
8
9
gene1            ================================    ID=gene1
mRNA1 ================================ ID=mRNA1;Parent=gene1
five_prime_UTR == Parent=mRNA1
CDS1 ==....=====...........== Parent=mRNA1 (3 rows)
three_prime_UTR ====== Parent=mRNA1
mRNA2 ================================ ID=mRNA2;Parent=gene1
exon ==== Parent=mRNA2
CDS2 ==....................== Parent=mRNA2 (2 rows)
exon ======== Parent=mRNA2

[4] GFF3 ID attributes are required for interpreting parent-child feature relationships and that is their only role here.

They are not automatically used for the locus_tag qualifier, so if the ID is applicable as the locus_tag, it should be copied into that attribute with the appropriate formatting.
However, if no transcript_id, or protein_id qualifiers are present, then the GFF3 ID attribute will be used as the basis of those qualifiers, as described in point [5c] below. These qualifiers do not appear in the flatfile view, so if the GFF3 IDs are meant to be seen in that view, then they should be copied into a ‘note’ attribute with the appropriate formatting.

[5] GFF3 Name attributes are ignored.

-------------本文结束感谢您的阅读-------------