CONICET - · PDF fileHernán Dopazo Laboratorio de Genómica Biomédica y...
Transcript of CONICET - · PDF fileHernán Dopazo Laboratorio de Genómica Biomédica y...
Hernn DopazoLaboratorio de Genmica Biomdica y EvolucinLab 25. Pab II. FCEyN. UBA.CONICET
Junio 2012
Bioinformtica y Genmica
Claves para la Nueva Biologa
Cambio de Paradigma Tecnolgico
Produccin Artesanal Cambio Tecnolgico Produccin Industrial
Datos Genticos (~1980) Datos Genmicos (~2005)
Tsunami de Datos Genticos
Qu hacemos realmente ?
Kit Bsico de Recursos Bioinformticos
Apuntes sobre la Nueva Biologa
Ecologa y Gentica de Poblaciones
Filogeografa y Ecologa de Poblaciones
Filogeografa y Ecologa de Poblaciones
Modelos de Aprendizaje y Evolucin
Filogenmica
SNPs Enfermedades
Seleccin Natural en Genomas Completos
Filoma Humano
Desarrollo de Software
Seleccin Natural en Mdulos de Funciones
Filogenmica y Modelos de Evolucin
Filogenmica en Virus
Complete 454 genome sequencing
and phylogenomic analysis
Complexity and Information Theory A widely accepted measure of relative complexity is
the minimum amount of information required to specify the ontogeny and operation of a system.
The information content or complexity of an object can be measured by the length of its shortest description.
For instance the string 01010101010101010101010101 has the short description 13 repetitions of 01, while 11001000011000011101111011 has no simpler description other than writing down the string itself.
The minimum genome sizes observed across metazoa show a consistent increase with complexity from simple nematode worms to insects to vertebrates.
This holds true for both biochemical and genome sequence measurements, as well as for computationally compressed genome sizes.
BioEssays 29:288299, 2007
1- Burrows-Wheeler transform (BWT) The transform is done by sorting all rotations of
the text in lexicographic order, then taking the last column.
2- Move-To-Front transform (MTF)
The MTF is a way to encode a string, that most of the times it allows to represent the same string in a more compact way than its original form.
The main idea is that each symbol in the data is replaced by its index in the stack of recently used symbols.
3- Entropy : -sum {i=0}^3 p(i) * log 4 p(i) ;
Measuring Complexity Definition:
Given a string s we define complexity ratio,
CR (s) = Entropy (MTF (BWT (s) ) )
0 CR 1; CV (s) = CR * length (s)
Example 1. Sequences with CR=0 have the simplest combinatorial structure: they contain just one symbol. Rotations in lexicographic order (A)
9 A|AAAAAAAAA 8 AA|AAAAAAAA 7 AAA|AAAAAAA 6 AAAA|AAAAAA 5 AAAAA|AAAAA 4 AAAAAA|AAAA 3 AAAAAAA|AAA 2 AAAAAAAA|AA 1 AAAAAAAAA|A 0 AAAAAAAAAA|
BWT(s) = AAAAAAAAAAMTF(BWT(s)) = 0,0,0,0,0,0,0,0,0,0CR(s) = E(MTF(BWT(s))) = 0
Example 2. If a sequence contains just half of the symbols then CR is at most 0.5Let s=AAAAAACCAC Rotations in lexicographic order (A, C)
0 AAAAAACCAC|1 AAAAACCAC|A2 AAAACCAC|AA3 AAACCAC|AAA4 AACCAC|AAAA8 AC|AAAAAACC5 ACCAC|AAAAA9 C|AAAAAACCA7 CAC|AAAAAAC6 CCAC|AAAAAA
BWT(s) = CAAAACAACAMTF(BWT(s)) = 0,1,0,0,0,1,1,0,1,1CR(s) = E(MTF(BWT(s))) = 0.500
Example 3. The de Bruijn sequence with the lowest CR among all de Bruijn sequences of order 2 in the four letter alphabet: s= AACCTTCGTAGCATGG
Rotations in lexicographic order (A, C, G, T)
0 AACCTTCGTAGCATGG|1 ACCTTCGTAGCATGG|A9 AGCATGG|AACCTTCGT12 ATGG|AACCTTCGTAGC11 CATGG|AACCTTCGTAG2 CCTTCGTAGCATGG|AA6 CGTAGCATGG|AACCTT3 CTTCGTAGCATGG|AAC15 G|AACCTTCGTAGCATG10 GCATGG|AACCTTCGTA14 GG|AACCTTCGTAGCAT7 GTAGCATGG|AACCTTC8 TAGCATGG|AACCTTCG5 TCGTAGCATGG|AACCT13 TGG|AACCTTCGTAGCA4 TTCGTAGCATGG|AACC
BWT(s) = GATCGATCGATCGTACMTF(BWT(s)) = 0,1,2,3,3,3,3,3,3,3,3,3,3,2,3,3CR(s) = E(MTF(BWT(s))) = 0.593
Example 4. The de Bruijn sequence with the highest CR among all de Bruijn sequences of order 2 in the four letter alphabet: s=AACCATAGTTGCTCGG
Rotations in lexicographic order (A, C, G, T)
0 AACCATAGTTGCTCGG|1 ACCATAGTTGCTCGG|A6 AGTTGCTCGG|AACCAT4 ATAGTTGCTCGG|AACC3 CATAGTTGCTCGG|AAC2 CCATAGTTGCTCGG|AA13 CGG|AACCATAGTTGCT11 CTCGG|AACCATAGTTG15 G|AACCATAGTTGCTCG10 GCTCGG|AACCATAGTT14 GG|AACCATAGTTGCTC7 GTTGCTCGG|AACCATA5 TAGTTGCTCGG|AACCA12 TCGG|AACCATAGTTGC9 TGCTCGG|AACCATAGT8 TTGCTCGG|AACCATAG
BWT(s) = GATCCATGGTCAACTGMTF(BWT(s)) = 0,1,2,3,0,2,2,3,0,1,3,3,0,1,2,3CR(s) = E(MTF(BWT(s))) = 0.989
Human Language Complexity
Torquato Tasso. Johann Wolfgang von Goethe. German
Divina Commedia. Dante Alighieri. Italian
Principia Mathematica. Isaac Newton. Latin
El Quijote. Miguel de Cervantes. Spanish
Complete Works. W. Shakespeare. English
Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.457e+03 1.660e+04 -0.269 0.794 size 3.781e-01 8.019e-03 47.149 4.35e-12 ***
Multiple R-squared: 0.996, Adjusted R-squared: 0.9955 F-statistic: 2223 on 1 and 9 DF, p-value: 4.349e-12
Cuentos de amor, locura y muerte. Horacio Quiroga. Spanish
Origin of Species. Charles Robert. Darwin. English
Facundo. Domingo F. Sarmiento. Spanish
Les Miserables. Victor Hugo. French
The Human Genome. Scientific Abstract. English
El Aleph- Jorge Luis Borges. Spanish
Lenght (Log L)
Hum
an L
angu
age
Com
plex
ity (
Log
MFT
E)
104 105 106
104
105
106
Genome Diversity (54 sps)Features Species Taxa
LGS Monodelphis domestica MammalsHomo sapiens MammalsPongo abelii MammalsMacaca mulatta MammalsPan troglodytes MammalsMus musculus MammalsRattus norvegicus MammalsBos taurus MammalsEquus caballus MammalsCanis familiaris Mammals
AP RP Zea mays Plants
AP Danio rerio FishesTaeniopygia guttata BirdsGallus gallus Birds
AP Sorghum bicolor Plants
AP Oryzias latipes Fishes
AP Physcomitrella patens Bryophyta
AP Populus trichocarpa Plants
AP Oryza sativa Plants
AP Brachypodium distachyon PlantsAnopheles gambiae InvertebratesApis mellifera Invertebrates
AP Tetraodon nigroviridis Fishes
AP Arabidopsis lyrata Plants
GE Daphnia pulex InvertebratesDrosophila melanogaster Invertebrates
AP, RG Arabidopsis thaliana PlantsTribolium castaneum InvertebratesCaenorhabditis elegans Invertebrates
LGS: Largest Genome Sequenced
SGS: Shortest Genome Sequenced
AP: Ancient Polyploid
RP: Recent Polyploid
LBG: Largest Bacterial Genome
SBG: Shortest Bacterial Genome
IBP: Intracellular Bacterial Parasite
RG: Reduced Genome
GE: Gene Expansion
EE: Extreme Environment
UE: Unicellular Eukaryote
SSD: Single-Strand DNA
DSD: Double-Strand DNA
RNA: RNA Virus
SL: Synthetic Life
Features Species Taxa
Ciona intestinalis Urochordate
UE Dictyostelium discoideum Amebozoa
UE Thalassiosira pseudonana Heterokonta
UE Phaeodactylum tricornutum Heterokonta
UE Plasmodium falciparum Ampicomplexa
AP Saccharomyces cerevisiae Fungi
LBG Burkholderia xenovorans Bacteria
Escherichia coli Bacteria
Mycobacterium tuberculosis Bacteria
Bacillus subtilis Bacteria
EE Sulfolobus islandicus Archaea
EE Methanocaldococcus vulcanius Archaea
EE Thermococcus sibiricus Archaea
SL Synthetic mycoplasma mycoides Bacteria
IBP, RG Ureaplasma urealyticum Bacteria
IBP, RG Buchnera aphidicola Bacteria
IBP, RG, SBG Carsonella ruddii Bacteria
DSD Human herpesvirus1 Virus
DSD Enterobacteria phage lambda Phage
RNA Sudan ebolavirus Virus
RNA HIV 1 Virus
SSD Enterobacteria phage m13 Phage
SSD Tomato mosaic Virus
RNA Hepatitis B Virus
SGS, RNA Hepatitis D Virus
Genome Size (Log S)
Gen
ome
Com
plex
ity (L
og M
TFE
)
104
105
106
107
108
109
104 105 106 107 108 109
Taxa
Virus
Phage
Bacteria
Archaea
Fungi
Ampicomplexa
Heterokonta
Amebozoa
Urochordate
Invertebrates
Plants
Fishes
Bryophita
Birds
Mammals
HIV virus
Sudan Ebola virus
Human Herpes virus
Buchnera aphidicola Ureaplasma urealyticum
Escherichia coli
Plasmodium falciparumSaccharomyses cerevisiaeDictyostelium dicoideumPhaedodactylum tricornutum
Caenorahbditis. elegans
Tetraodon nigroviridis
Physcomitrella patens
Gallus gallus
Ciona intestinalis
Lambda phage
M13 phage
Thermococcus sibiricus
Daphnia pulex
Burkholderia xenovorans
Zea mays
Synthetic Mycoplasma mycoides
M. domestica
Sulfolobus islandicus
Hepatitis D virus
Tomato mosaic virusHepatitis B
GCR
Carsonela ruddii
Genome ComplexityFull set (54 sps)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.115e+06 1.848e+07 -0.493 0.624 size 9.673e-01 1.517e-02 63.750 |t|) (Intercept) -1.279e+06 1.513e+06 -0.845 0.403 size 9.888e-01 1.150e-03 859.732 |t|) (Intercept) 9.165e+07 4.357e+07 2.103 0.0617 . size 6.332e-01 5.571e-02 11.367 4.85e-07 ***
Multiple R-squared: 0.9282, Adjusted R-squared: 0.921 F-statistic: 129.2 on 1 and 10 DF, p-value: 4.855e-07 ------------------------------------------------------------- Linear Model * Interaction