Week 3

October 13

Bioinformatics

Definition via Wikipedia: Bioinformatics and computational biology involve the use or development of techniques, including applied mathematics, informatics, statistics, computer science, artificial intelligence, chemistry, and biochemistry to solve biological problems, usually on the molecular level. The primary goal of bioinformatics is to increase our understanding of biological processes. What sets it apart from other approaches, however, is its focus on developing and applying computationally intensive techniques (e.g., data mining, and machine learning algorithms) to achieve this goal. Major research efforts in the field include sequence alignment, gene finding, genome assembly, protein structure alignment, protein structure prediction, prediction of gene expression and protein-protein interactions, and the modeling of evolution.

NCBI Science Primer on Bioinformatics

For more information see:
Roberts Lab wiki (includes several nice video tutorials)
FISH507: Bioinformatics Course
Bioinformatics Cheat Sheet.doc

Paper:
A hitchhiker’s guide to expressed sequence tag (EST) analysis

BLAST: The Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

Nice Tutorial via Geospiza

Interesting Databases:
KEGG PATHWAY
Enzymes
-

macgavery

Short demo of what one might do with a fragment obtained by PCR, DEG, SSH, ETC.

Numerous examples of aggregation of gene related information (click on image)

Another primary use is finding sequences of interest that are publicly available.
How might you go about doing this???

Text from Lisa's Cheat Sheet
Bioinformatics: field of science in which biology, computer science, and information technology merge to form a single discipline. The ultimate goal of the field is to enable the discovery of new biological insights as well as to create a global perspective from which unifying principles in biology can be discerned (NCBI).
EST: small pieces of DNA sequence (usually 200 to 500 nucleotides long) that are generated by sequencing either one or both ends of an expressed gene. The idea is to sequence bits of DNA that represent genes expressed in certain cells, tissues, or organs from different organisms and use these "tags" to fish a gene out of a portion of chromosomal DNA by matching base pairs. The challenge associated with identifying genes from genomic sequences varies among organisms and is dependent upon genome size as well as the presence or absence of introns, the intervening DNA sequences interrupting the protein coding sequence of a gene (NCBI).

Pros
- enable gene discovery
- complement genome annotation
- aid gene structure identification
- establish viability of alt. transcripts
- guide SNP characterization
- facilitate proteome analysis
- can be used with methylation filtration and high Ct selection to examine gene pools
- low cost
- mine UTRs
- poly A tails can distinguish untranslated mRNA form productive transcripts, leading to protein isoforms

Cons
- “poor man’s genome” - subject to sampling bias resulting in underrepresentation of rare transcripts (only 60% of orgs genes)
- error in choosing the right tool for each step of EST analysis
- only short copy of mRNA so sequence is error prone at ends (vector contamination)
- usually only sequenced once
- redundancy & under/over representation of transcripts due to variable protocols in EST generation
- sequencing artifacts- base calling errors, stuttering, low quality sequences
- SNPs – artifacts or natural?
- Multiple clustering programs (loose/stringent)

How are ESTs Generated?
mRNA (expressed genes) → reverse transcriptase → cDNA (double stranded) →cloned →sequenced (single pass or full length)
Data Sources:

NCBI – dbEST, UniGene (gene-oriented clusters of transcript sequences)
TIGR
Geneious: an integrated, cross-platform bioinformatics software suite for manipulating, finding, sharing, and exploring biological data such as DNA sequences or proteins, phylogenies, 3D structure information, publications, etc. It features sequence alignment and phylogenetic analysis, contig assembly, primer design and cloning, access to NCBI and UniProt, BLAST, protein structure viewing, and automated PubMed searching. This program ROCKS! An all in one stop and a great way to organize and share your data.

Pre-processing
- Vector databases (UniVec, VecScreen-NCBI): a tool for identifying segments of a nucleic acid sequence that may be of vector, linker, or adapter origin prior to sequence analysis or submission.
- Cross-match: general purpose utility for comparing any two DNA sequence sets. It can be used to compare a set of reads to a set of vector sequences and produce vector-masked versions of the reads. It is slower but more sensitive than BLAST.
- DUST/RepeatMasker/MaskerAid: programs for filtering low complexity regions from nucleic acid sequences.

Clustering & Assembly (to reduce redundancy)
- PHRAP: program for assembling shotgun DNA sequence data. It allows use of the entire read and not just the trimmed high quality part, it uses a combination of user-supplied and internally computed data quality information to improve assembly accuracy in the presence of repeats, it constructs the contig sequence as a mosaic of the highest quality read segments rather than a consensus, it provides extensive assembly information to assist in trouble-shooting assembly problems, and it handles large datasets.
- CAP3: A DNA sequence assembly program. The program has a capability to clip 5' and 3' low-quality regions of reads. It uses base quality values in computation of overlaps between reads, construction of multiple sequence alignments of reads, and generation of consensus sequences. The program also uses forward-reverse constraints to correct assembly errors and link contigs.

Database Similarity Searching
- BLAST - Basic Local Alignment Search Tool for comparing gene and protein sequences against others in public databases. Comparisons are made in a pairwise fashion. Each comparison is given a score reflecting the degree of similarity between the query and the sequence being compared. The higher the score, the greater the degree of similarity.
  - blastn : compares a nucleotide query sequence against a nucleotide sequence database
  - blastx : compares a nucleotide query sequence translated in all reading frames against a protein sequence database
  - rpsblast: program that searches a query protein sequence or protein sequences against a database of position specific scoring matrices (PSSMs, profiles, or more commonly known as conserved domains) to identify the ones the query is similar to.
  - MegaBLAST: uses a greedy algorithm for the nucleotide sequence alignment search. Optimized for aligning sequences that differ slightly as a result of sequencing or other similar "errors". When larger word size (the minimal length of an identical match an alignment must contain if it is to be found by the algorithm) is used, it is up to 10 times faster than more common sequence similarity programs. Mega BLAST is also able to efficiently handle much longer DNA sequences than blastn.
Conserved Domain search
COGS (Clusters of Orthologous Groups): Phylogenetic classification of proteins encoded in complete genomes.

Translating & Functional Annotation
- ORF Finder - identifies all possible ORFs in a DNA sequence by locating the standard and alternative stop and start codons. The deduced amino acid sequences can then be used to BLAST against GenBank
- ESTScan: can detect coding regions in DNA sequences, even if they are of low quality. It will also detect and correct sequencing errors that lead to frameshifts. ESTScan is not a gene prediction program nor is it an open reading frame detector. In fact, its strength lies in the fact that it does not require an open reading frame to detect a coding region. As a result, the program may miss a few translated amino acids at either the N or the C terminus, but will detect coding regions with high selectivity and sensitivity.
- Spidey - aligns one or more mRNA sequences to a single genomic sequence. Spidey will try to determine the exon/intron structure, returning one or more models of the genomic structure, including the genomic/mRNA alignments for each exon.
- Splign - is a utility for computing cDNA-to-Genomic alignments based on a variation of the Needleman-Wunsch algorithm combined with Blast for compartment detection and greater performance.
- UniGene DDD - Digital Differential Display - an online tool to compare computed gene expression profiles between selected cDNA libraries. Using a statistical test, genes whose expression levels differ significantly from one tissue to the next are identified and shown to the user.
- Blast2GO: ALL in ONE tool for functional annotation of (novel) sequences and the analysis of annotation data. Blast against public or private databases, map against GO resources to fetch functional data, and annotate to generate trustful functional assignments.

High Performance Computing

Chris Dwan speaks on genomics and high performance computing at Grey Thumb Boston on September 7th, 2007.