Annotation
Gene basic information
This part shows the basic information for a gene, which was extracted from different sources. The Entrez ID, HGNC ID, Gene Symbol, Transcript ID, Transcript Name, Transcript Length, Transcript Refseq ID, Protein ID, Protein Refseq ID, Uniprot ID, and Gene Type were extracted from Ensembl BioMart(Ensembl Genes 105). Ensembl ID, Strand, Gene Length, and Position were obtained through Ensembl gtf files(Ensembl Release 105). Protein Length were obtained through Ensembl fasta files(Ensembl Release 105). Gene Alias, Full Name, and Gene Summary were grabbed from NCBI databases.
Gene model
This part describes the distribution of CDS, UTR and intron of a gene on chromosome based on the information from Ensembl gtf files.
Protein function domain
The function domain displays the domain distribution of the longest protein for each gene. In order to identify the functional domains for TFs, we first downloaded all the HMMER profiles from Pfam database (version 35.0). Then, we applied PfamScan to search the protein domain against all the TF longest protein sequences with the default setting. After a domain coverage>=70% filtration, we got the final function domain for each sequence.
Gene ontology
The GO annotations were parsed from gene2go file, which was downloaded from NCBI ftp.
TF related GWAS phenotype
We downloaded the latest GWAS Catalog data and SNP annotation data (release 155) from GWAS Catalog and dbSNP. By mapping the GWAS identified phenotype associated SNPs to the genomic locations of TF genes, we obtained a list of SNPs located in TFs with related disease phenotypes and 30,675 TF related SNP-disease pairs, which were presented in the GWAS panel of the TF detail page.
TFBS
We downloaded mononucleotide TF binding site matrix files of human and mouse from HOCOMOCOv11 and got non-redundant matrix files of vertebrates and nematodes from JASPAR. In addition, we purchased the February 2018 version of the TRANSFAC database and extracted TFBS matrix of vertebrates and nematodes. We also downloaded the TF binding site matrix files from CIS-BP. Finally, we integrate all matrix data from four databases and use The MEME Suite to draw the logo of each TFBS.
Phenotype
We parsed the disease information from MalaCards and Ensembl Biomart(Ensembl Genes 105).
Protein-protein interaction
The protein-protein interactions were extracted from BioGRID version 4.4 and HPRD databases.
Ortholog
We extracted the ortholog information by Ensembl API.
Paralog
We extracted the paralog information by Ensembl API.
Gene expression
In AnimalTFDB4.0, we collected expression data of 38 species, including (Astatotilapia calliptera, Astyanax mexicanus, Bos taurus, Caenorhabditis elegans, Callithrix jacchus, Capra hircus, Cavia porcellus, Cercocebus atys, Chlorocebus sabaeus, Danio rerio, Drosophila melanogaster, Equus caballus, Esox lucius, Felis catus, Gallus gallus, Gasterosteus aculeatus, Gorilla gorilla, Homo sapiens, Latimeria chalumnae, Lepisosteus oculatus, Macaca fascicularis, Macaca mulatta, Macaca nemestrina, Microcebus murinus, Monodelphis domestica, Mus musculus, Neolamprologus brichardi, Nothobranchius furzeri, Ornithorhynchus anatinus, Oryctolagus cuniculus, Pan paniscus, Pan troglodytes, Papio anubis, Poecilia reticulata, Rattus norvegicus, Scophthalmus maximus, Sus scrofa, Xenopus tropicalis). Those data were collected from the following sources: 1. Human gene expressions in 37 cancers from TCGA 2. Human gene expressions in normal tissues and cell lines from EMBL-EBI Expression Atlas. 3. Human protein expressions in normal tissues and cells from Human Protein Map 4. Human gene expressions in non-diseased tissue sites from GTEx. 5. Human transcript expression levels summarized per gene in 60 tissues based on CAGE data from FANTOM5. 6. Human transcript expression levels summarized per gene in 256 tissues and cell lines based on RNA-seq from The Human Protein Atlas. 7. Human and other species gene expressions in normal tissues and cells from bgee database(Version 15.0). 8. Gene expression of seven organs (cerebrum, cerebellum, heart, kidney, liver, ovary and testis) across developmental time points from early organogenesis to adulthood for human, rhesus macaque, mouse, rat, rabbit, opossum and chicken. This data is extracted from the paper published by Henrik Kaessmann et al. PMID: 31243369. 9. Human and other species gene expressions of different developmental stages and tissues. This data is extracted from the paper published by Gong et al. PMID:32986825.
Family Multiple sequence alignment
We made multiple sequence alignment for the DBD sequences by ClustalW2 phylogenetic tree files for TFs in the same family of each species by applying PHYLIP Neighbor-Joining method with bootstrap 100. The multiple sequence alignment result was displayed by Weblogo.
Pathway
We parsed the disease information from KEGG and BioCarta.
Mutation
To understand the mutation information of Human TF and cofactor, we collected the information from Clinvar and COSMIC(V96). There are 190,627 mutation information from Clinvar and 8,294,851 mutation information from COSMIC.
Autophagy
We also collected information about whether a TF or cofactor is involved in regulating autophagy-related processes from THANATOS database. The autophagy-related processes include Autophagy, ApopTosis, and Necrosis. The regulation methods include positive regulation, negative regulation or both. In total, we collected 1,045 TFs or cofactors autophagy information in six mode organisms, including Homo sapiens, Mus musculus, Rattus norvegicus, Caenorhabditis elegans, Danio rerio, and Drosophila melanogaster.
Post Translational Modification
Post-translational modifications (PTMs) play important roles in regulating a broad spectrum of biological processes and pathways. We parsed 38,943 acetylation, methylation, and ubiquitination information from CPLM database and 131,378 phosphorylation from EPSD database in eight species(Homo sapiens, Mus musculus, Rattus norvegicus, Caenorhabditis elegans, Bos taurus, Cavia porcellus, Gallus gallus, and Drosophila melanogaster).
References
- Kim, M.S., Pinto, S.M. et al. (2014) A draft map of the human proteome. Nature, 509, 575-581.
- Wilhelm, M., Schlegl, J. et al. (2014) Mass-spectrometry-based draft of the human proteome. Nature, 509, 582-587.
- Li, J.J., Huang, H., Bickel, P.J. and Brenner, S.E. (2014) Comparison of D. melanogaster and C. elegans developmental stages, tissues, and cells by modENCODE RNA-seq data. Genome research, 24, 1086-1101.