Bioinformatics Dr. Víctor Treviño [email protected] A7-421 Ext -4536+103 BT4007

40
BIOINFORMATICS DR. VÍCTOR TREVIÑO [email protected] A7-421 EXT-4536+103 BT4007 Blast and Alignments

description

Blast and Alignments. Bioinformatics Dr. Víctor Treviño [email protected] A7-421 Ext -4536+103 BT4007. Buscar un art ículo de investigación relacionado con su proyecto y que tenga un alto componente bioinformático. Por ejemplo: Generación de una base de datos - PowerPoint PPT Presentation

Transcript of Bioinformatics Dr. Víctor Treviño [email protected] A7-421 Ext -4536+103 BT4007

Page 1: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

BIOINFORMATICSDR. VÍCTOR TREVIÑ[email protected]+103BT4007

Blast and Alignments

Page 2: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

PRESENTACIONES DE PAPERS EN MARZO Buscar un artículo de investigación relacionado con su proyecto y

que tenga un alto componente bioinformático. Por ejemplo: Generación de una base de datos Desarrollo de un programa o servicio Descubrimiento de genes/vías metabólicas/etc por medio/con ayuda de

métodos bioinformáticos Proponer el paper al profesor y confirmar Estudiar el paper Preparar presentación Presentarlo en clase, 15 minutos, 10 minutos presentación + 5 de

preguntas Las presentaciones las evalua el profesor y los alumnos, se lleva una

rúbrica calificando elementos como: Tema, Intro, Mét, Resul, Disc, Critica, Voz, Claridad, Seguridad, Conocimiento, Respuestas, Tiempo

Page 3: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

PAPERS FOR NEXT SESSION

Page 4: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

SEQUENCE SIMILARITY Sequences are similar because are

derived from a common ancestor Will most often be the result of

duplication events. Similarity will then depend on

diveregence times. General Rule: 25% Identity in 100 aa

sequence is good evidence of common ancestry

Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI

Page 5: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

SEQUENCE SIMILARITY Within a protein sequence, some regions

will be more conserved than others. As more conserved, more important. for function for 3D structure for localization for modification for interaction for regulation/control for transcriptional regulation

(in DNA)

REASONS TOPERFORM

SEQUENCESIMILARITYSEARCHES

Page 6: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

SEQUENCE SIMILARITY - TERMS Homologous: similar due to common

ancestry Analogous: similar due to convergent

evolution Orthologous: homologous with

conserved function (by speciation in separated species)

Paralogous: homologous with different function (commonly within the same species)

Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI

Page 7: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

SEQUENCE SIMILARITY - TERMS Xenologous: due to horizontal

transfer HGT: transfer of genetic material that is

not its offspring VGT: transfer of genetic material from its

ancestor (mitosis) [vgt is not related to xenologous]

Ohnologous: paralogous that have originated by whole genome duplication

Gametologous: homologous genes in non-recombining opposite sex chromosomes.

Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI Wikipedia

Page 8: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

SEQUENCE SIMILARITY – EVOLUTIONARY RELATIONSHIP

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 9: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected] SIMILARITY – ORIGINS OF GENES

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

a1 & a2 are Paralogous

a1-S1 and a1-S2 are Orthologousa2-S1 and a2-S2 are Orthologous

Analogous Genes – Same FunctionDifferent Origin

Xenologous

Page 10: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

SEQUENCE SIMILARITY – TYPES OF MODIFICATION

…ACCAGTGTGCCGTACA…

Mutations occur during evolution by Insertions

…ACCAGTaGTGCCGTACA… Deletions

…ACCAGTCCGTACA… Substitutions

…ACCAGTGCGCCGTACA…

GTG

Page 11: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

SIMILARITY AND DISTANCE BETWEEN SEQUENCES

SIMILARITY is the maximal SUM of WEIGHTS for the conserved residues More useful for phylogenetic tree reconstruction

DISTANCE is the minimal SUM of WEIGHTS for a set of mutations transforming one sequence into the other More useful for database searching

Both are opposite and interconvertible concepts

WEIGHT accounts for different roles of mutation events, AA residue similarity, etc. e.g. synonymous mutations are different than

non-sense mutationsBioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI

Page 12: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

SEQUENCE ALIGNMENT Procedure for comparing two (pair-wise

alignment) or more (multiple sequence alignment) sequences by searching for similar patterns that are in the same order in the sequences Identical residues (nt or aa) are placed in the same

column Non-identical residues can be placed in the same

column or indicated as gaps

Wikipedia, http://www-personal.umich.edu/~lpt/fgf/fgfrcomp.htmBioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Ove

rall

sim

ilitu

de

Page 13: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

SEQUENCE ALIGNMENT GLOBAL - Procedure applied to the

entire sequence to include as many matches as possible up to the end of the sequence

Methods Brute Force – unpractical Dot Matrix – graphical, easy to

understand Dynamical Programming – the most

accurate Heuristic Methods – fast, not so accurate Word k-tuple – Database Searching –

BLAST

Bioinformatics - Methods and Applications – Genomics, Proteomics and Drug Discovery – Rastogi – Mendiratta - PHI Wikipedia

Page 14: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

GLOBAL AND LOCAL ALIGNMENTS Proteins are MODULAR

Patterns formed by exchange of whole EXONS

Example: F12 : Coagulation Factor XII PLAT: Tissue-type plasminogen activator

A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed.

F1/2 - FibronectinsE - Epidermal Growth FactorsK - "Kringle" domain

GLOBALALIGNMENTMETHODS

DO NOTCONSIDER

THIS ISSUES

LOCAL ALIGNMENT

Page 15: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

GLOBAL AND LOCAL ALIGNMENTS

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 16: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

LOCAL ALIGNMENT Alignment stops at the end of regions

of identity or strong similarity Much higher priority is given to find

these local regions than extending the alignment

A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed.

Page 17: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DOT-MATRIX METHOD Primary method for comparing

sequences Provides a global and local overview of

similarity Useful for direct or inverted repeats Useful for self-complementary RNA

regions DNA Straider, DOTTER, GCG-DOTPLOT,

DOTLETBioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

http://myhits.isb-sib.ch/cgi-bin/dotlet

Page 18: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DOT-MATRIX METHOD Align, the aa sequence

"DOROTHYHODGKIN" vs "DOROTHYCROWFOOTHODGKIN"

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 19: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DOT-MATRIX METHOD – EX 1

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

WINDOW SIZE= 11

STRINGENCY= 7(how many identical)

…ACCAGTGTGCCGTACA…

window

Page 20: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DOT-MATRIX METHOD – EX 2

A practical guide to the analysis of genes and proteins – Baxevanis – Ouellette – Wiley 2Ed.

Page 21: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DOT-MATRIX METHOD – EX 3 -REPEATS

Figure 3.6. Dot matrix analysis of the human LDL receptor against itself using DNA Strider, vers. 1.3, on a Macintosh

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 22: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected] METHOD – PROGRAMS

(you could use PubMed also)

Bioinformatics for Dummies – Claviere – Notredame – Wiley - 2nd Ed. 2007

Page 23: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DOT-MATRIX EXAMPLES http://hits.isb-sib.ch/util/dotlet/doc/dotl

et_examples.html

http://myhits.isb-sib.ch/cgi-bin/dotlet

Page 24: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DYNAMIC PROGRAMMING METHOD Provides the very best or optimal alignment

in a very reasonable amount of time Several parameters though Global: Needleman-Wunsch Local: Smith-Waterman Provides a p-value of obtaining the

alignment by chance of unrelated sequences There is a method for statistical significance Results depends on the scoring system

Page 25: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DYNAMIC PROGRAMMING METHOD Provides the very best or optimal

alignment Several parameters though Global: Needleman-Wunsch Local: Smith-Waterman Provides a p-value of obtaining the

alignment by chance of unrelated sequences

There is a method for statistical significance

Page 26: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DYN.PROG.METHOD - SCORING Results depend on the scoring system –

SCORING MATRICES Depending on Pair-wise Gap Penalties

DNA alignments require a similar scoring system

Page 27: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected] PROGRAMMING METHOD

i

j

x, y are the "radius"

Gap penalties from the scoring matrix

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 28: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DYNAMIC PROGRAMMING METHOD

i j

x, y are the "radius"

Gap penalties from the scoring matrix

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 29: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DYNAMIC PROGRAMMING METHOD

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 30: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DYNAMIC PROGRAMMING EXAMPLEgap A C G G A T A T

gap 0 -1 -1 -1 -1 -1 -1 -1 -1

G -1 Max(0,-2,-2)=0

-1,-2,-1=-1

(d)+1

(d)+1

(l)0 (ld)-1

(d)-1

(d)-1

G -1 -1,-1,-2=-1

0 (d)+1

(d)+3

(l)+2

(l)+1

(l)0 (ld)-1

C -1 -1 (d)+1

(ldu)0

(u)+2

(d)+3

(ld)+2

(ld)+1

(ld)0

T -1 (d)-1

(u)0 (d)+1

(u)+1

(ud)+2

(d)+5

(l)+4

(ld)+3

A -1 (d)+1

(l)0 (d)0 (d)+1

(d)+3

(u)+4

(d)+7

(l)+6X=1

Y=1Gap W(x=1) = 1, W(x=2)=1 …Gap W(y = 1)=1,…s(a,b)=2, if a = bs(a,b)=0, if a <> b

ACGGATAT--GGCTA-

Page 31: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DYN.PROG.METHOD - SCORING Results depend on the scoring system –

SCORING MATRICES Depending on Pair-wise Gap Penalties

Dayhoff PAM (point accepted mutations) matrix is based on a evolutionary model for proteins One PAM is a unit of evolutionary divergence in which

1% of the amino acids have been changed in very similar sequences

BLOSUM matrix are designed to identify members of the same family Derived from BLOCKS database (for distant sequences,

blocks substitution matrix)

Page 32: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DYNAMIC PROGRAMING - SCORING Remember "SUM

OF WEIGHTS" for similarity/distance

PAM250 is250 times PAM

BLOSUM62, seq 62% identical can be merged into one.BLOSUM90 for comparing more similar sequences.BLOSUM30 for verydifferent.

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 33: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DYNAMIC PROGRAMMING METHOD Some programs provide alternative

alignments, depending on the goal domains structural same family biological function common ancestor

There are several variations respect to original Needleman-Wunsch, Smith-Waterman methods improving memory usage, cpu time, and other features

Page 34: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DYNAMIC PROGRAMMING METHOD - OUTPUT

Bioinformatics – Sequence and Genome Analysis – Mount – CSH Lab Press

Page 35: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DYNAMIC PROGRAMMING – STATISTICAL SIGNIFICANCE

To assign a p-value, we could "shuffle" both sequences 100,000 times. The proportion of times we obtain SCORES

larger than that obtained in the real score represent the p-value

Another quicker method is converting the alignment to BINARY sequences (match or not match) e.g. probability of obtaining HTHTHHHH in a

coin toss experiment

Page 36: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DYNAMIC PROGRAMMING – STATISTICAL SIGNIFICANCE

Two random sequences of length m and n and p=prob. of match

Length of matches=log1/p(mn) DNA seq. length=100, p=0.25 (equal

nt) the longest match = 2 x log4(100)=6.65

More precise formula

Page 37: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

DYNAMIC PROGRAMMING – STATISTICAL SIGNIFICANCE

Simpliying

k=mismatches, m and n are sequence length

Efective length = n – E(m) (used in BLAST)

(mean of the highest possible local alignment score)

Page 38: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

ALIGNMENT PROCEDURE OVERVIEW

Page 39: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected] K-TUPLE METHOD - BLAST Search a database for sequences that at least

share W identical residues

For a sequence of length L, the number of "internal searches" is L-W+1

All "potential" sequences are then "extended" using the Dynamic Programming Method

A statistical significance score is estimated representing the number of expected similar sequences in the database (E value, -equivalent- to a p-value for the entire database)

Page 40: Bioinformatics Dr.  Víctor  Treviño vtrevino@itesm.mx A7-421 Ext -4536+103 BT4007

[email protected]

BLAST Pi – random residue probability Sij – From score matrix Score S=sum(PiPjSij) Transformation

For statistical comparisons Expressed in bits

Expected number of matches of at least S’

Lengths: query=m, database=n Example:

m=250, n=50,000,000, to achieve E=0.05 S’ = 38 bits S = [(38 * ln 2) + ln K] / λ S = 76.6

(for ungapped version : λu = 0.3176 and Ku = 0.134