CONICET - · PDF fileHernán Dopazo Laboratorio de Genómica Biomédica y...

download CONICET -   · PDF fileHernán Dopazo Laboratorio de Genómica Biomédica y Evolución Lab 25. Pab II. FCEyN. UBA. CONICET Junio 2012 Bioinformática y Genómica Claves para la

If you can't read please download the document

Transcript of CONICET - · PDF fileHernán Dopazo Laboratorio de Genómica Biomédica y...

  • Hernn DopazoLaboratorio de Genmica Biomdica y EvolucinLab 25. Pab II. FCEyN. UBA.CONICET

    Junio 2012

    Bioinformtica y Genmica

    Claves para la Nueva Biologa

  • Cambio de Paradigma Tecnolgico

    Produccin Artesanal Cambio Tecnolgico Produccin Industrial

    Datos Genticos (~1980) Datos Genmicos (~2005)

  • Tsunami de Datos Genticos

  • Qu hacemos realmente ?

  • Kit Bsico de Recursos Bioinformticos

  • Apuntes sobre la Nueva Biologa

  • Ecologa y Gentica de Poblaciones

  • Filogeografa y Ecologa de Poblaciones

  • Filogeografa y Ecologa de Poblaciones

  • Modelos de Aprendizaje y Evolucin

  • Filogenmica

  • SNPs Enfermedades

  • Seleccin Natural en Genomas Completos

  • Filoma Humano

  • Desarrollo de Software

  • Seleccin Natural en Mdulos de Funciones

  • Filogenmica y Modelos de Evolucin

  • Filogenmica en Virus

    Complete 454 genome sequencing

    and phylogenomic analysis

  • Complexity and Information Theory A widely accepted measure of relative complexity is

    the minimum amount of information required to specify the ontogeny and operation of a system.

    The information content or complexity of an object can be measured by the length of its shortest description.

    For instance the string 01010101010101010101010101 has the short description 13 repetitions of 01, while 11001000011000011101111011 has no simpler description other than writing down the string itself.

    The minimum genome sizes observed across metazoa show a consistent increase with complexity from simple nematode worms to insects to vertebrates.

    This holds true for both biochemical and genome sequence measurements, as well as for computationally compressed genome sizes.

    BioEssays 29:288299, 2007

  • 1- Burrows-Wheeler transform (BWT) The transform is done by sorting all rotations of

    the text in lexicographic order, then taking the last column.

    2- Move-To-Front transform (MTF)

    The MTF is a way to encode a string, that most of the times it allows to represent the same string in a more compact way than its original form.

    The main idea is that each symbol in the data is replaced by its index in the stack of recently used symbols.

    3- Entropy : -sum {i=0}^3 p(i) * log 4 p(i) ;

    Measuring Complexity Definition:

    Given a string s we define complexity ratio,

    CR (s) = Entropy (MTF (BWT (s) ) )

    0 CR 1; CV (s) = CR * length (s)

  • Example 1. Sequences with CR=0 have the simplest combinatorial structure: they contain just one symbol. Rotations in lexicographic order (A)

    9 A|AAAAAAAAA 8 AA|AAAAAAAA 7 AAA|AAAAAAA 6 AAAA|AAAAAA 5 AAAAA|AAAAA 4 AAAAAA|AAAA 3 AAAAAAA|AAA 2 AAAAAAAA|AA 1 AAAAAAAAA|A 0 AAAAAAAAAA|

    BWT(s) = AAAAAAAAAAMTF(BWT(s)) = 0,0,0,0,0,0,0,0,0,0CR(s) = E(MTF(BWT(s))) = 0

    Example 2. If a sequence contains just half of the symbols then CR is at most 0.5Let s=AAAAAACCAC Rotations in lexicographic order (A, C)

    0 AAAAAACCAC|1 AAAAACCAC|A2 AAAACCAC|AA3 AAACCAC|AAA4 AACCAC|AAAA8 AC|AAAAAACC5 ACCAC|AAAAA9 C|AAAAAACCA7 CAC|AAAAAAC6 CCAC|AAAAAA

    BWT(s) = CAAAACAACAMTF(BWT(s)) = 0,1,0,0,0,1,1,0,1,1CR(s) = E(MTF(BWT(s))) = 0.500

    Example 3. The de Bruijn sequence with the lowest CR among all de Bruijn sequences of order 2 in the four letter alphabet: s= AACCTTCGTAGCATGG

    Rotations in lexicographic order (A, C, G, T)

    0 AACCTTCGTAGCATGG|1 ACCTTCGTAGCATGG|A9 AGCATGG|AACCTTCGT12 ATGG|AACCTTCGTAGC11 CATGG|AACCTTCGTAG2 CCTTCGTAGCATGG|AA6 CGTAGCATGG|AACCTT3 CTTCGTAGCATGG|AAC15 G|AACCTTCGTAGCATG10 GCATGG|AACCTTCGTA14 GG|AACCTTCGTAGCAT7 GTAGCATGG|AACCTTC8 TAGCATGG|AACCTTCG5 TCGTAGCATGG|AACCT13 TGG|AACCTTCGTAGCA4 TTCGTAGCATGG|AACC

    BWT(s) = GATCGATCGATCGTACMTF(BWT(s)) = 0,1,2,3,3,3,3,3,3,3,3,3,3,2,3,3CR(s) = E(MTF(BWT(s))) = 0.593

    Example 4. The de Bruijn sequence with the highest CR among all de Bruijn sequences of order 2 in the four letter alphabet: s=AACCATAGTTGCTCGG

    Rotations in lexicographic order (A, C, G, T)

    0 AACCATAGTTGCTCGG|1 ACCATAGTTGCTCGG|A6 AGTTGCTCGG|AACCAT4 ATAGTTGCTCGG|AACC3 CATAGTTGCTCGG|AAC2 CCATAGTTGCTCGG|AA13 CGG|AACCATAGTTGCT11 CTCGG|AACCATAGTTG15 G|AACCATAGTTGCTCG10 GCTCGG|AACCATAGTT14 GG|AACCATAGTTGCTC7 GTTGCTCGG|AACCATA5 TAGTTGCTCGG|AACCA12 TCGG|AACCATAGTTGC9 TGCTCGG|AACCATAGT8 TTGCTCGG|AACCATAG

    BWT(s) = GATCCATGGTCAACTGMTF(BWT(s)) = 0,1,2,3,0,2,2,3,0,1,3,3,0,1,2,3CR(s) = E(MTF(BWT(s))) = 0.989

  • Human Language Complexity

    Torquato Tasso. Johann Wolfgang von Goethe. German

    Divina Commedia. Dante Alighieri. Italian

    Principia Mathematica. Isaac Newton. Latin

    El Quijote. Miguel de Cervantes. Spanish

    Complete Works. W. Shakespeare. English

    Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.457e+03 1.660e+04 -0.269 0.794 size 3.781e-01 8.019e-03 47.149 4.35e-12 ***

    Multiple R-squared: 0.996, Adjusted R-squared: 0.9955 F-statistic: 2223 on 1 and 9 DF, p-value: 4.349e-12

    Cuentos de amor, locura y muerte. Horacio Quiroga. Spanish

    Origin of Species. Charles Robert. Darwin. English

    Facundo. Domingo F. Sarmiento. Spanish

    Les Miserables. Victor Hugo. French

    The Human Genome. Scientific Abstract. English

    El Aleph- Jorge Luis Borges. Spanish

    Lenght (Log L)

    Hum

    an L

    angu

    age

    Com

    plex

    ity (

    Log

    MFT

    E)

    104 105 106

    104

    105

    106

  • Genome Diversity (54 sps)Features Species Taxa

    LGS Monodelphis domestica MammalsHomo sapiens MammalsPongo abelii MammalsMacaca mulatta MammalsPan troglodytes MammalsMus musculus MammalsRattus norvegicus MammalsBos taurus MammalsEquus caballus MammalsCanis familiaris Mammals

    AP RP Zea mays Plants

    AP Danio rerio FishesTaeniopygia guttata BirdsGallus gallus Birds

    AP Sorghum bicolor Plants

    AP Oryzias latipes Fishes

    AP Physcomitrella patens Bryophyta

    AP Populus trichocarpa Plants

    AP Oryza sativa Plants

    AP Brachypodium distachyon PlantsAnopheles gambiae InvertebratesApis mellifera Invertebrates

    AP Tetraodon nigroviridis Fishes

    AP Arabidopsis lyrata Plants

    GE Daphnia pulex InvertebratesDrosophila melanogaster Invertebrates

    AP, RG Arabidopsis thaliana PlantsTribolium castaneum InvertebratesCaenorhabditis elegans Invertebrates

    LGS: Largest Genome Sequenced

    SGS: Shortest Genome Sequenced

    AP: Ancient Polyploid

    RP: Recent Polyploid

    LBG: Largest Bacterial Genome

    SBG: Shortest Bacterial Genome

    IBP: Intracellular Bacterial Parasite

    RG: Reduced Genome

    GE: Gene Expansion

    EE: Extreme Environment

    UE: Unicellular Eukaryote

    SSD: Single-Strand DNA

    DSD: Double-Strand DNA

    RNA: RNA Virus

    SL: Synthetic Life

    Features Species Taxa

    Ciona intestinalis Urochordate

    UE Dictyostelium discoideum Amebozoa

    UE Thalassiosira pseudonana Heterokonta

    UE Phaeodactylum tricornutum Heterokonta

    UE Plasmodium falciparum Ampicomplexa

    AP Saccharomyces cerevisiae Fungi

    LBG Burkholderia xenovorans Bacteria

    Escherichia coli Bacteria

    Mycobacterium tuberculosis Bacteria

    Bacillus subtilis Bacteria

    EE Sulfolobus islandicus Archaea

    EE Methanocaldococcus vulcanius Archaea

    EE Thermococcus sibiricus Archaea

    SL Synthetic mycoplasma mycoides Bacteria

    IBP, RG Ureaplasma urealyticum Bacteria

    IBP, RG Buchnera aphidicola Bacteria

    IBP, RG, SBG Carsonella ruddii Bacteria

    DSD Human herpesvirus1 Virus

    DSD Enterobacteria phage lambda Phage

    RNA Sudan ebolavirus Virus

    RNA HIV 1 Virus

    SSD Enterobacteria phage m13 Phage

    SSD Tomato mosaic Virus

    RNA Hepatitis B Virus

    SGS, RNA Hepatitis D Virus

  • Genome Size (Log S)

    Gen

    ome

    Com

    plex

    ity (L

    og M

    TFE

    )

    104

    105

    106

    107

    108

    109

    104 105 106 107 108 109

    Taxa

    Virus

    Phage

    Bacteria

    Archaea

    Fungi

    Ampicomplexa

    Heterokonta

    Amebozoa

    Urochordate

    Invertebrates

    Plants

    Fishes

    Bryophita

    Birds

    Mammals

    HIV virus

    Sudan Ebola virus

    Human Herpes virus

    Buchnera aphidicola Ureaplasma urealyticum

    Escherichia coli

    Plasmodium falciparumSaccharomyses cerevisiaeDictyostelium dicoideumPhaedodactylum tricornutum

    Caenorahbditis. elegans

    Tetraodon nigroviridis

    Physcomitrella patens

    Gallus gallus

    Ciona intestinalis

    Lambda phage

    M13 phage

    Thermococcus sibiricus

    Daphnia pulex

    Burkholderia xenovorans

    Zea mays

    Synthetic Mycoplasma mycoides

    M. domestica

    Sulfolobus islandicus

    Hepatitis D virus

    Tomato mosaic virusHepatitis B

    GCR

    Carsonela ruddii

    Genome ComplexityFull set (54 sps)Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -9.115e+06 1.848e+07 -0.493 0.624 size 9.673e-01 1.517e-02 63.750 |t|) (Intercept) -1.279e+06 1.513e+06 -0.845 0.403 size 9.888e-01 1.150e-03 859.732 |t|) (Intercept) 9.165e+07 4.357e+07 2.103 0.0617 . size 6.332e-01 5.571e-02 11.367 4.85e-07 ***

    Multiple R-squared: 0.9282, Adjusted R-squared: 0.921 F-statistic: 129.2 on 1 and 10 DF, p-value: 4.855e-07 ------------------------------------------------------------- Linear Model * Interaction