El Contact Center como Soporte del Comercio Electrónico [email protected] [email protected] .
Bioinformatics Dr. Víctor Treviño [email protected] A7-421 Ext -4536+103 BT4007
description
Transcript of Bioinformatics Dr. Víctor Treviño [email protected] A7-421 Ext -4536+103 BT4007
BIOINFORMATICSDR. VÍCTOR TREVIÑ[email protected]+103BT4007
Blast and Alignments
PRESENTACIONES DE PAPERS EN MARZO Buscar un artículo de investigación relacionado con su proyecto y
que tenga un alto componente bioinformático. Por ejemplo: Generación de una base de datos Desarrollo de un programa o servicio Descubrimiento de genes/vías metabólicas/etc por medio/con ayuda de
métodos bioinformáticos Proponer el paper al profesor y confirmar Estudiar el paper Preparar presentación Presentarlo en clase, 15 minutos, 10 minutos presentación + 5 de
preguntas Las presentaciones las evalua el profesor y los alumnos, se lleva una
rúbrica calificando elementos como: Tema, Intro, Mét, Resul, Disc, Critica, Voz, Claridad, Seguridad, Conocimiento, Respuestas, Tiempo
PAPERS FOR NEXT SESSION
SEQUENCE SIMILARITY Sequences are similar because are
derived from a common ancestor Will most often be the result of
duplication events. Similarity will then depend on
diveregence times. General Rule: 25% Identity in 100 aa
sequence is good evidence of common ancestry
Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI
SEQUENCE SIMILARITY Within a protein sequence, some regions
will be more conserved than others. As more conserved, more important. for function for 3D structure for localization for modification for interaction for regulation/control for transcriptional regulation
(in DNA)
REASONS TOPERFORM
SEQUENCESIMILARITYSEARCHES
SEQUENCE SIMILARITY - TERMS Homologous: similar due to common
ancestry Analogous: similar due to convergent
evolution Orthologous: homologous with
conserved function (by speciation in separated species)
Paralogous: homologous with different function (commonly within the same species)
Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI
SEQUENCE SIMILARITY - TERMS Xenologous: due to horizontal
transfer HGT: transfer of genetic material that is
not its offspring VGT: transfer of genetic material from its
ancestor (mitosis) [vgt is not related to xenologous]
Ohnologous: paralogous that have originated by whole genome duplication
Gametologous: homologous genes in non-recombining opposite sex chromosomes.
Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI Wikipedia
SEQUENCE SIMILARITY – EVOLUTIONARY RELATIONSHIP
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected] SIMILARITY – ORIGINS OF GENES
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
a1 & a2 are Paralogous
a1-S1 and a1-S2 are Orthologousa2-S1 and a2-S2 are Orthologous
Analogous Genes – Same FunctionDifferent Origin
Xenologous
SEQUENCE SIMILARITY – TYPES OF MODIFICATION
…ACCAGTGTGCCGTACA…
Mutations occur during evolution by Insertions
…ACCAGTaGTGCCGTACA… Deletions
…ACCAGTCCGTACA… Substitutions
…ACCAGTGCGCCGTACA…
GTG
SIMILARITY AND DISTANCE BETWEEN SEQUENCES
SIMILARITY is the maximal SUM of WEIGHTS for the conserved residues More useful for phylogenetic tree reconstruction
DISTANCE is the minimal SUM of WEIGHTS for a set of mutations transforming one sequence into the other More useful for database searching
Both are opposite and interconvertible concepts
WEIGHT accounts for different roles of mutation events, AA residue similarity, etc. e.g. synonymous mutations are different than
non-sense mutationsBioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI
SEQUENCE ALIGNMENT Procedure for comparing two (pair-wise
alignment) or more (multiple sequence alignment) sequences by searching for similar patterns that are in the same order in the sequences Identical residues (nt or aa) are placed in the same
column Non-identical residues can be placed in the same
column or indicated as gaps
Wikipedia, http://www-personal.umich.edu/~lpt/fgf/fgfrcomp.htmBioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
Ove
rall
sim
ilitu
de
SEQUENCE ALIGNMENT GLOBAL - Procedure applied to the
entire sequence to include as many matches as possible up to the end of the sequence
Methods Brute Force – unpractical Dot Matrix – graphical, easy to
understand Dynamical Programming – the most
accurate Heuristic Methods – fast, not so accurate Word k-tuple – Database Searching –
BLAST
Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI Wikipedia
GLOBAL AND LOCAL ALIGNMENTS Proteins are MODULAR
Patterns formed by exchange of whole EXONS
Example: F12 : Coagulation Factor XII PLAT: Tissue-type plasminogen activator
A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed.
F1/2 - FibronectinsE - Epidermal Growth FactorsK - "Kringle" domain
GLOBALALIGNMENTMETHODS
DO NOTCONSIDER
THIS ISSUES
LOCAL ALIGNMENT
GLOBAL AND LOCAL ALIGNMENTS
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
LOCAL ALIGNMENT Alignment stops at the end of regions
of identity or strong similarity Much higher priority is given to find
these local regions than extending the alignment
A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed.
DOT-MATRIX METHOD Primary method for comparing
sequences Provides a global and local overview of
similarity Useful for direct or inverted repeats Useful for self-complementary RNA
regions DNA Straider, DOTTER, GCG-DOTPLOT,
DOTLETBioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
http://myhits.isb-sib.ch/cgi-bin/dotlet
DOT-MATRIX METHOD Align, the aa sequence
"DOROTHYHODGKIN" vs "DOROTHYCROWFOOTHODGKIN"
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
DOT-MATRIX METHOD – EX 1
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
WINDOW SIZE= 11
STRINGENCY= 7(how many identical)
…ACCAGTGTGCCGTACA…
window
DOT-MATRIX METHOD – EX 2
A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed.
DOT-MATRIX METHOD – EX 3 -REPEATS
Figure 3.6. Dot matrix analysis of the human LDL receptor against itself using DNA Strider, vers. 1.3, on a Macintosh
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
[email protected] METHOD – PROGRAMS
(you could use PubMed also)
Bioinformatics for Dummies – Claviere – Notredame – Wiley - 2nd Ed. 2007
DOT-MATRIX EXAMPLES http://hits.isb-sib.ch/util/dotlet/doc/dotl
et_examples.html
http://myhits.isb-sib.ch/cgi-bin/dotlet
DYNAMIC PROGRAMMING METHOD Provides the very best or optimal alignment
in a very reasonable amount of time Several parameters though Global: Needleman-Wunsch Local: Smith-Waterman Provides a p-value of obtaining the
alignment by chance of unrelated sequences There is a method for statistical significance Results depends on the scoring system
DYNAMIC PROGRAMMING METHOD Provides the very best or optimal
alignment Several parameters though Global: Needleman-Wunsch Local: Smith-Waterman Provides a p-value of obtaining the
alignment by chance of unrelated sequences
There is a method for statistical significance
DYN.PROG.METHOD - SCORING Results depend on the scoring system –
SCORING MATRICES Depending on Pair-wise Gap Penalties
DNA alignments require a similar scoring system
[email protected] PROGRAMMING METHOD
i
j
x, y are the "radius"
Gap penalties from the scoring matrix
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
DYNAMIC PROGRAMMING METHOD
i j
x, y are the "radius"
Gap penalties from the scoring matrix
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
DYNAMIC PROGRAMMING METHOD
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
DYNAMIC PROGRAMMING EXAMPLEgap A C G G A T A T
gap 0 -1 -1 -1 -1 -1 -1 -1 -1
G -1 Max(0,-2,-2)=0
-1,-2,-1=-1
(d)+1
(d)+1
(l)0 (ld)-1
(d)-1
(d)-1
G -1 -1,-1,-2=-1
0 (d)+1
(d)+3
(l)+2
(l)+1
(l)0 (ld)-1
C -1 -1 (d)+1
(ldu)0
(u)+2
(d)+3
(ld)+2
(ld)+1
(ld)0
T -1 (d)-1
(u)0 (d)+1
(u)+1
(ud)+2
(d)+5
(l)+4
(ld)+3
A -1 (d)+1
(l)0 (d)0 (d)+1
(d)+3
(u)+4
(d)+7
(l)+6X=1
Y=1Gap W(x=1) = 1, W(x=2)=1 …Gap W(y = 1)=1,…s(a,b)=2, if a = bs(a,b)=0, if a <> b
ACGGATAT--GGCTA-
DYN.PROG.METHOD - SCORING Results depend on the scoring system –
SCORING MATRICES Depending on Pair-wise Gap Penalties
Dayhoff PAM (point accepted mutations) matrix is based on a evolutionary model for proteins One PAM is a unit of evolutionary divergence in which
1% of the amino acids have been changed in very similar sequences
BLOSUM matrix are designed to identify members of the same family Derived from BLOCKS database (for distant sequences,
blocks substitution matrix)
DYNAMIC PROGRAMING - SCORING Remember "SUM
OF WEIGHTS" for similarity/distance
PAM250 is250 times PAM
BLOSUM62, seq 62% identical can be merged into one.BLOSUM90 for comparing more similar sequences.BLOSUM30 for verydifferent.
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
DYNAMIC PROGRAMMING METHOD Some programs provide alternative
alignments, depending on the goal domains structural same family biological function common ancestor
There are several variations respect to original Needleman-Wunsch, Smith-Waterman methods improving memory usage, cpu time, and other features
DYNAMIC PROGRAMMING METHOD - OUTPUT
Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press
DYNAMIC PROGRAMMING – STATISTICAL SIGNIFICANCE
To assign a p-value, we could "shuffle" both sequences 100,000 times. The proportion of times we obtain SCORES
larger than that obtained in the real score represent the p-value
Another quicker method is converting the alignment to BINARY sequences (match or not match) e.g. probability of obtaining HTHTHHHH in a
coin toss experiment
DYNAMIC PROGRAMMING – STATISTICAL SIGNIFICANCE
Two random sequences of length m and n and p=prob. of match
Length of matches=log1/p(mn) DNA seq. length=100, p=0.25 (equal
nt) the longest match = 2 x log4(100)=6.65
More precise formula
DYNAMIC PROGRAMMING – STATISTICAL SIGNIFICANCE
Simpliying
k=mismatches, m and n are sequence length
Efective length = n – E(m) (used in BLAST)
(mean of the highest possible local alignment score)
ALIGNMENT PROCEDURE OVERVIEW
[email protected] K-TUPLE METHOD - BLAST Search a database for sequences that at least
share W identical residues
For a sequence of length L, the number of "internal searches" is L-W+1
All "potential" sequences are then "extended" using the Dynamic Programming Method
A statistical significance score is estimated representing the number of expected similar sequences in the database (E value, -equivalent- to a p-value for the entire database)
BLAST Pi – random residue probability Sij – From score matrix Score S=sum(PiPjSij) Transformation
For statistical comparisons Expressed in bits
Expected number of matches of at least S’
Lengths: query=m, database=n Example:
m=250, n=50,000,000, to achieve E=0.05 S’ = 38 bits S = [(38 * ln 2) + ln K] / λ S = 76.6
(for ungapped version : λu = 0.3176 and Ku = 0.134