Search GTDB


What format should my sequence be?

Your sequence must be a protein (amino acid) sequence in FASTA format with the header or without. e.g.,

>tr|Q9FAE8|Q9FAE8_9BURK Flagellin OS=Acidovorax avenae subsp. avenae OX=80870 GN=N1141-fla1 PE=3 SV=1
MASTINTNVSSLTAQRNLSLSQSSLNTSIQRLSSGLRINSAKDDAAGLAISERFTSQIRG
LNQAVRNANDGISLAQTAEGALKSTGDILQRVRELAVQSANATNSSGDRKAIQAEVGQLL
SEMDRIAGNTEFNGQKLLDGSFGSATFQVGANANQTITATTGNFRTNNYGAQLTASATGA
ATTGATAGSAGAAAGTVVIAGLQTKTVNVAAAGTASDIASAVNAVADSTGVTASARNVSE
MKFSGTGSFTLAVKGDNSTAANVTFNVSATSTAAGLAEAVKAFNDVSSQTGVTAKLNSDS
SGLILTNESGNDINIANGSSSAAGITLASQDAVTTQSSGTLTFTSATAAGTGVTVASRGT
VEYKSDKGYTVSGTGGTMTNATATSSTLTKVSDIDVSTVDGSTKALKIIDAALSAVNGQR
ASFGALQSRFETTVNNLQSTSENMSASRSRIQDADFAAETANLSRSQILQQAGTAMVAQANQLPQGVLSLLK


Which methods/databases are used when I use the Search GTDB option?

The search is using DIAMOND default mode to perform a protein sequence similarity search against the AnnoView database, which is based on the GTDB data (Release 95) and AnnoTree database The protein homology search criteria includes E-value, coverage cut-off, and which database to search (bacteria/archaea). Annotations are from the AnnoTree database, with every protein sequence annotated by KEGG orthology identifiers, Pfam protein families and TIGRFAM protein families.


What if I want to search against a different database?

This is not currently supported. Future versions of AnnoView will be updated to the latest GTDB release.


Can I search multiple protein sequences using Search GTDB?

No. Only one protein sequence is allowed at a time.


Will I still be able to see the result page if I lose internet connection before the search is done?

No, you won’t be able to see the result even if the internet is reconnected. You’ll have to perform the search again with your query.


Is there a way to save the result of the intermediate page?

No. The search result can only be saved after the gene neighborhood is displayed. This is something that will be developed for later versions of the tool.


How long is it going to take for a query using the Search GTDB option?

Running a query against the archaea database typically takes around 15 seconds, while a query against the bacteria database typically takes about 6 minutes. These times may increase depending on the number of concurrent users.


What if there are multiple protein hits similar to the query in one genome? Will this tool show all of them?

Yes, our tool provides the flexibility to display multiple protein hits if they are found within a single genome. After entering your query, you will be directed to an intermediate page where you can choose which protein hits and their associated gene neighborhoods you would like to view. If there are multiple hits within one genome, you have the option to display either a single hit or select multiple hits for further exploration. This allows you to customize your results based on your specific interests or research requirements.


Upload


What format should my data be when uploading to AnnoView?

Annoview currently accepts .gbk, .gff and .csv format files.


What kind of information should I include if I upload .csv format files to AnnoView?

The .csv file should include the following columns: GTDB/nucleotide ID (must in the first column), organism/species name (must in the second column), GTDB gene ID/ Protein ID, start position, end position, CDS length, and at least one column with metadata to annotate the genes with (gene name, product, KEGG, Pfam, TIGRFAM etc). The following columns are optional: sequence, domain, phylum, class, order, family, genus, default center (to set a gene to center the rows around), extra metadata/function annotation columns. Download a template here Slr4_example.


What is the file size limit for upload to AnnoView?

Currently, the maximum size for a single file upload is 5MB. and the maximum total file size for multiple files is 16 MB.


Can I upload multiple .csv files to AnnoView?

Only one .csv file can be uploaded at a time.


What does each column mean in the .csv file downloaded from Search GTDB? Is there a difference between the .csv files downloaded from AnnoTree and from uploading?


Column explanation

.csv downloaded from Search GTDB

.csv downloaded from uploading NCBI .gbk/.gff

Columns required for upload

Genome/assembly/nucleotide accession

GTDB ID

Nucleotide ID

Yes and must be the first column

Organism name

Species

Organism

Yes and needs to be the second column

Taxonomy level

Domain

-

Optional

Taxonomy level

Phylum

-

Optional

Taxonomy level

Class

-

Optional

Taxonomy level

Order

-

Optional

Taxonomy level

Family

-

Optional

Taxonomy level

Genus

-

Optional

Gene product

-

Product

Optional

Protein ID

GTDB Gene ID

Protein ID

Yes

Gene name

-

Gene

Optional

Start location

Start

Start

Yes

Stop location

End

End

Yes

Orientation

Strand

Strand

Optional

Nucleotide sequence length

CDS Length

CDS Length

Yes

KEGG orthology

KEGG

-

Optional

Pfam protein family

Pfam

-

Optional

TIGRFAM protein family

TIGRFAM

-

Optional

Center gene used for gene neighborhood clustering

Default Center

-

Optional

Protein sequence

Sequence

Sequence

Optional

Customized protein annotation by user

-

-

Optional


Visualization


How can I save my visualizations for editing later?

You can download the figure in .svg format, and edit it using a vector graphic editor such as Adobe Illustrator or Inkscape. An alternative way of saving or editing the visualization is to download the gene neighborhood dataset in .csv format. You can then add columns that contain taxonomic information, default centering instructions for the visualization, or protein annotations to the table. This table in .csv format can be re-uploaded to AnnoView.


Is there a way to reorder and align the gene neighborhoods?

Yes. Gene neighborhoods can be sorted and aligned based on a clustering algorithm implemented in our server. You can do this by picking a center gene first, and then right clicking on the gene and choosing “center on that gene. This centers the rows on the first gene with a matching annotation (whichever annotations are currently being visualized) in each row. Alternatively, you can download the .csv file and change the default center gene to whichever gene you want (one in each row). You can also manipulate row order manually by dragging the gene neighborhood labels on the left side. Each row can also be moved right-left by grabbing the row so that it can be aligned by the user. Rows can also be flipped by right clicking on a row and choosing "Flip track"."


Can I delete unwanted gene neighborhood tracks?

Yes. You can select one or multiple gene neighborhood tracks by clicking on their labels on the left, then right click on the labels and choose “Delete all selected tracks”.


Why are some genes are colored light grey?

These genes don’t have any annotations under the currently selected annotation category.


There are so many genes in the browser. How do I quickly find a specific gene?

You can use the search function in the toolbar and type the annotation of the gene you want to locate. The browser will then highlight all the genes with that specific annotation. The genes in the search bar are sorted according to their frequency, allowing users to locate the most commonly occurring genes within the current gene neighborhoods.


General Questions


What browser do you recommend to run AnnoView?

Chrome.


Other questions/comments?

Please contact h29tan@uwaterloo.ca or use the following question form.


Question form