Knowledge Guides: Graduate School of Arts & Sciences-Hawthorne Campus: Bioinformatics Resources

Bioinformatics Databases

START HERE: HGNC - Hugo Gene Nomenclature Committee - make sure you are using the most current symbol and name for human genes. From the symbol report page, then use the Genbank and other database links.

ArrayExpress - public archive for transcriptomics data, which is aimed at storing MIAME - and MINSEQE - compliant data in accordance with FGED recommendations. The ArrayExpressWarehouse stores gene-indexed expression profiles from a curated subset of experiments in the archive.

BioCyc [subscription required for access] - a collection of 1763 Pathway/Genome Databases. Each database in the BioCyc collection describes the genome and metabolic pathways of a single organism.

BRENDA - the main collection of enzyme functional data available to the scientific community.

dbGaP - The database of Genotypes and Phenotypes (dbGaP) was developed to archive and distribute the results of studies that have investigated the interaction of genotype and phenotype. Such studies include genome-wide association studies, medical sequencing, molecular diagnostic assays, as well as association between genotype and non-clinical traits. dbGaP provides two levels of access - open and controlled - in order to allow broad release of non-sensitive data, while providing oversight and investigator accountability for sensitive data sets involving personal health information. Summaries of studies and the contents of measured variables as well as original study document text are generally available to the public, while access to individual-level data including phenotypic data tables and genotypes require varying levels of authorization.

Ensembl Genome - a joint project between the EMBL-EBI and the Wellcome Trust Sanger Institute that aims at developing a system that maintains automatic annotation of large eukaryotic genomes. Access to all the software and data is free and without constraints of any kind. The project is primarily funded by theWellcome Trust. It is a comprehensive source of stable annotation with confirmed gene predictions that have been integrated from external data sources. Ensembl annotates known genes and predicts new ones, with functional annotation from InterPro , OMIM and gene families.

GenBank - NIH genetic sequence database, an annotated collection of all publicly available DNA sequences.

Gene - NCBI's database for gene-specific information.

Genome - provides views for a variety of genomes, complete chromosomes, sequence maps with contigs, and integrated genetic and physical maps.

Genome Browser - created by the GenomeBioinformatics Group of University of California at Santa Clara.

GEO - Gene Expression Omnibus is a gene expression/molecular abundance repository supporting MIAME compliant data submissions, and a curated, online resource for gene expression data browsing, query, and retrieval.

KEGG - a complete computer representation of the cell, the organism, and the biosphere, enabling computational prediction of higher-level complexity of cellular processes and organism behaviors from genomic and molecular information.

PRIDE - a database by European Bioinformatics Institute (EMBL-EBI) that maintains the world’s most comprehensive range of freely available and up-to-date molecular data resources.

SNP - The Single Nucleotide Polymorphismdatabase (dbSNP) is a public-domain archive for a broad collection of simple genetic polymorphisms.

UniProt - comprehensive, high-quality and freely accessible resource of protein sequence and functional information.

Worm Base - database of the model organism Caenorhabditis elegans and relatednematodes.

Bioinformatics Tools

Bioconductor - an open source and open development software project to provide tools for the analysis and comprehension of genomic data.

Discovery Studio Visualizer - visualize and share molecular information in a clear and consistent way, and in a wide variety of industry-standard formats. You can also create high quality graphics.

ExPASy - The ExPASy (Expert ProteinAnalysis System) proteomics server of the Swiss (SIB) is dedicated to the analysis of protein sequences and structures as well as 2-D PAGE.

FirstGlance - view the 3D structures of proteins, DNA, RNA, and their complexes. FirstGlance in Jmol can display major structural features of the molecule with one click each. One-click options display secondary structure, amino and carboxy (or 3' and 5') termini, composition (protein, DNA, RNA, ligands, and solvent), the distributions of hydrophobic, polar, and charged amino acids, salt bridges and cation-pi orbital interactions for amino acids. Non-standard residues and missing side chains are flagged automatically.

GeneCards - a searchable, integrated database of human genes that provides concise genomic, proteomic, transcriptomic, genetic and functional information on all known and predicted human genes. Information featured in GeneCards includes orthologies, disease relationships, mutations and SNPs, gene expression, gene function, pathways, protein-protein interactions, related drugs & compounds and direct links to cutting edge research reagents and tools such as antibodies, recombinant proteins, clones, expression assays and RNAi reagents.

GenePattern - combines a powerful scientific workflow platform with more than 100 genomic analysis tools.

NCBI Structure - Cn3D ("see in 3d") is a helper application for your web browser that allows you to view3-dimensional structures from NCBI's Entrez retrieval service. Cn3D runs on Windows, Macintosh, and Unix. Cn3D simultaneously displays structure, sequence, and alignment, and now has powerful annotation and alignment editing features.

SAM - An Excel Add-in that can be applied to data from Oligo or cDNA arrays, SNP arrays, protein arrays, etc.; correlates expression data to clinical parameters including treatment, diagnosis categories, survival time, paired (before and after), quantitative (egg. tumor volume) and one-class. Both parametric and non-parametric tests are offered. Correlates expression data with time, to study time trends. The experimental units can fall into one or two classes, or be paired. Automatic imputation of missing data via nearest neighbor algorithm (better, faster in SAM version 2.0) .Adjustable threshold determines number of genes called significant. Uses data permutations to provide estimate of False Discovery Rate for multiple testing. Gene lists in Excel workbook form, easily exportable into TreeView. Cluster or other software.

Bioinformatics Vocabularies

Gene Ontology- provides a controlled vocabulary to describe gene and gene product attributes in any organism.