IUBio Biosequences .. Software .. Molbio soft .. Network News .. FTP

[Arabidopsis] Re: Inconsistency in protein fasta and gene annotation

Wei, Xuehong via arab-gen%40net.bio.net (by weix from cshl.edu)
Wed Aug 31 10:45:53 EST 2016


Dear Araport developer,

I am following up on a question raised a week ago about the discrepancy between translation of gff gene models and peptide sequences under the same transcript name. Both data come from https://www.araport.org/data/araport11 download. Here is one of the problematic gene model: AT1G07500.1

The translation of the model based on Tair10 assembly is
Met E E K N Y D D G D T V T V D D D Y Q Met G C T T P T R D D C R I P A Y P P C P P P VRRKRSLLGFGKKREPPKKGYFQPPDL DLFFSVVAASQAAT Stop

However the sequence in Araport11_genes.201606.pep.fasta is
>AT1G07500.1 | hypothetical protein | Chr1:2304766-2305031 REVERSE LENGTH=74 | 201606
Met E E K N Y D D G D T V T V D D D Y Q M G C T T P T R D D C R I P A Y P P C P P P GTTEEGIFSAAGSRLVLLG

The two sequences agree with each other until position 41 and then the rest looks like a frame shift. There are about 50 cases in all the gene models. Could you investigate it?

Thanks,

Sharon


On Aug 24, 2016, at 1:59 PM, Sharon Wei <weix from cshl.edu<mailto:weix from cshl.edu>> wrote:

Dear Araport developer,

I recently loaded the protein coding genes from the gff3 file downloaded from https://www.araport.org/data/araport11. While doing QC, I found about 50 proteins in the Araport11_genes.201606.pep.fasta do not match the translation from gene annotation. For example, gene AT1G07500

The gene annotation from GFF annotation is
Chr1    Araport11       gene    2304766 2305031 .       -       .       ID=AT1G07500;Name=AT1G07500;Dbxref=PMID:26546445,PMID:16024587,PMID:17085506,PMID:17227549,PMID:17599908,PMID:18650403,PMID:20706207,PMID:24399300,PMID:25037213,PMID:25385697,PMID:25624148,locus:2024957;Note=hypothetical protein;Alias=SMR5,SIAMESE-RELATED 5;computational_description=unknown protein%3B Has 4 Blast hits to 4 proteins in 3 species: Archae - 0%3B Bacteria - 0%3B Metazoa - 0%3B Fungi - 0%3B Plants - 4%3B Viruses - 0%3B Other Eukaryotes - 0 (source: NCBI BLink).;locus_type=protein_coding
Chr1    Araport11       mRNA    2304766 2305031 .       -       .       ID=AT1G07500.1;Parent=AT1G07500;Name=AT1G07500.1;Note=hypothetical protein;conf_class=8;Alias=SMR5,SIAMESE-RELATED 5;computational_description=unknown protein%3B Has 4 Blast hits to 4 proteins in 3 species: Archae - 0%3B Bacteria - 0%3B Metazoa - 0%3B Fungi - 0%3B Plants - 4%3B Viruses - 0%3B Other Eukaryotes - 0 (source: NCBI BLink).;conf_rating=*;Dbxref=gene:2024956,UniProt:F4HQP3
Chr1    Araport11       exon    2304766 2305031 .       -       .       ID=AT1G07500:exon:1;Parent=AT1G07500.1;Name=AT1G07500:exon:1
Chr1    Araport11       CDS     2304783 2305031 .       -       0       ID=AT1G07500:CDS:1;Parent=AT1G07500.1;Name=AT1G07500:CDS:1
Chr1    Araport11       protein 2304783 2305031 .       -       .       ID=AT1G07500.1-Protein;Name=AT1G07500.1;Derives_from=AT1G07500.1

According to the sequence extracted from JBrowse from your site, the CDS sequence is

>Chr1 Chr1:2304783..2305031
CTAGGTTGCCGCTTGGGAGGCTGCTACCACCGAGAAGAACAAGTCTAGATCCGGCGGCTGAAAATATCCCTTCTTCGG
TGGTTCCCTCTTCTTCCCAAAGCCTAGTAGCGATCTCTTCCTTCTCACCGGAGGTGGACAAGGCGGATATGCTGGTAT
CCGGCAATCATCACGTGTAGGCGTCGTGCATCCCATCTGATAATCATCATCAACCGTCACCGTATCTCCGTCGTCGTA
GTTTTTCTCCTCCAT

The translation of this sequence is
Met E E K N Y D D G D T V T V D D D Y Q Met G C T T P T R D D C R I P A Y P P C P P P V R R K R S L L G F G K K R E P P K K G Y F Q P P D L D L F F S V V A A S Q A A T Stop


However the sequence in Araport11_genes.201606.pep.fasta is
>AT1G07500.1 | hypothetical protein | Chr1:2304766-2305031 REVERSE LENGTH=74 | 201606
MEEKNYDDGDTVTVDDDYQMGCTTPTRDDCRIPAYPPCPPPGTTEEGIFSAAGSRLVLLG

Could you please look into this problem?

Thanks,

Sharon
Gramene project






More information about the Arab-gen mailing list

Send comments to us at biosci-help [At] net.bio.net