Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB...
Transcript of Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB...
Bioinformatika
KFC/BIN
II. SekvenceRNDr. Karel Berka, Ph.D.
Univerzita Palackého v Olomouci
Centrální dogma molekulární
biologie
Centrální dogma molekulární
biologie
reversnítranscripce
informace funkce
DNA RNA protein
IUB kód
code nucleotides complement
A A T
C C G
G G C
T T A
(U U) A
M AC K
R AG Y
W AT S
S CG W
Y CT R
K GT M
V ACG B
H ACT D
D AGT H
B CGT V
N ACGT N
- space -
codethree-letter
code aminoacid
A Ala Alanine
C Cys Cysteine
D Asp Aspartic acid
G Glu Glutamic acid
H His Histidine
I Ile Isoleucine
K Lys Lysine
L Leu Leucine
M Met Methionine
N Asn Asparagine
P Pro Proline
Q Gln Glutamine
R Arg Arginine
S Ser Serine
T Thr Threonine
V Val Valine
W Trp Tryptofan
Y Tyr Tyrosine
X Xxx Any aminoacid
* --- stop
NAProteiny
genetický kódT C A G
T TTT Phe TCT Ser TAT Tyr TGT Cys T
TTC Phe TCC Ser TAC Tyr TGC Cys C
TTA Leu TCA Ser TAA Stop TGA Stop A
TTG Leu TCG Ser TAG Stop TGG Trp G
C CTT Leu CCT Pro CAT His CGT Arg T
CTC Leu CCC Pro CAC His CGC Arg C
CTA Leu CCA Pro CAA Gln CGA Arg A
CTG Leu CCG Pro CAG Gln CGG Arg G
A ATT Ile ACT Thr AAT Asn AGT Ser T
ATC Ile ACC Thr AAC Asn AGC Ser C
ATA Ile ACA Thr AAA Lys AGA Arg A
ATG Met ACG Thr AAG Lys AGG Arg G
G GTT Val GCT Ala GAT Asp GGT Gly T
GTC Val GCC Ala GAC Asp GGC Gly C
GTA Val GCA Ala GAA Glu GGA Gly A
GTG Val GCG Ala GAG Glu GGG Gly G
Termíny a zkratky
Genomika: kompletní genetická informace o
organismu (DNA sekvence) a její interpretace.
strukturní
funkční
DNA, RNA: nt (nucleotid), bp (pár bazí)
Proteomika: Co, kde (a kdy) se v organismu
exprimováno a jakou to má funkci
Proteiny: aa (aminokyseliny)
Sekvence
5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3‘
| | | | | | | | | | | | | | |
3' G-C-T-A-A-C-G-T-T-G-C-T-A-C-G 5‘
5' C-G-A-U-U-G-C-A-A-C-G-A-U-G-C 3‘
Nter R W Q R C Cter
DNA
RNA
Protein
Příklad: Hemoglobin
DNA sekvence - 444 bp
atggtgcatctgactcctgaggagaagtctgccgttactgccctgtggggcaaggtgaac
gtggatgaagttggtggtgaggccctgggcaggctgctggtggtctacccttggacccag
aggttctttgagtcctttggggatctgtccactcctgatgctgttatgggcaaccctaag
gtgaaggctcatggcaagaaagtgctcggtgcctttagtgatggcctggctcacctggac
aacctcaagggcacctttgccacactgagtgagctgcactgtgacaagctgcacgtggat
cctgagaacttcaggctcctgggcaacgtgctggtctgtgtgctggcccatcactttggc
aaagaattcaccccaccagtgcaggctgcctatcagaaagtggtggctggtgtggctaat
gccctggcccacaagtatcactaa
Proteinová sekvence - 147 aa
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFE
SFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFAT
LSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQ
KVVAGVANALAHKYH
DNA sekvence určuje proteinovou sekvenci
proteinová sekvence určuje proteinovou strukturu
struktura proteinu určuje funkci
DNA• DNA sekvenace
– 1972 DNA klonování
– 1975 DNA sekvenace
– od 80. let – sekvenační revoluce
Manuálně (dideoxy elektroforéza)
• Sanger
Automaticky - robotizace
• J. Craig Venter
– Celera Genomics
Protein
• Proteinová sekvenace
– Edmanovo odbourávání
• Sanger - fluorescenční činidlo
– MS/MS
masses (m/z)
940.421 - ELSDIAR
1093.477 - QLLLTADDR
1341.556 - PHSHPALTPEQK
1469.633 - PHSHPALTPEQKK
1488.645 - GILAADESTGSIAKR
1646.650 - LQSIGTENTEWENRR
2122.975 - IGENHTPSALAIMENANVLAR
2241.903 - YTPSGQAGAAASESLFISNHAY
Projekt Lidský genom
(The Human Genome Project)• Zahájen v polovině 80-tých let 20. století
• Odhad: 100,000 genů, dokončeno v roce 2005
• Automatické sekvenování a zdokonalení výpočetní techniky– Shotgun methody
• První verze publikována v roce 2000 společně– International Consortium Human Genome Project (veřejně
financovaná společnost)
– Celera Genomics (soukromá společnost)
• Referenční sekvence lidské DNA dokončena v dubnu 2003
http://genomics.energy.gov/
Projekt Lidský genom
(The Human Genome Project)
• 20 313 genů (Ensembl.org, 21.2.2016)
• 20 769 genů (Ensenbl.org, 30.9.2013)
Alternativní sestřih – 10,000,000 proteinů
• Stovky genů jsou výsledkem horizontálního přenosu z bakterií (v linii obratlovců)
• Desítky genů jsou odvozeny od transpozibilních elementů
• Rychlost mutací u můžů je asi 2x větší než u žen
• >1,400,000 jednoduchých nukleotidových polymorfismů (SNPs)
Biologické databáze
primarní vs. sekundární
formát vs. obsah (computers vs. human)
primární
sbírají informace o dotyčných sekvencích
sekundární
Obsahují výsledky analýzy dat z primárních databází
Sestaveny pomocí mnohočetného porovnávání
(multiple alignment) homologních sekvencí pro zachycení
konzervovaných oblastí – zařazení do rodin
DNA databáze
• GenBank (NCBI) – od roku 1982 – vz. 212, 190,250,235 sekvencí, 207,018,196,067 nt (Feb 2016)
– vz. 1, 606 sekvencí, 680,338 nt (Dec 1982)
• WGS (Whole Genome Shotgun) – od roku 2002– vz. 212, 333,012,760 sekvencí, 1,399,865,495,608 nt (Feb 2016)
• ENA - EMBL (EBI) – 713,500,000 sekvencí, 1,611,100,000,000 nt (Feb 2016)
– 83,666,567 sekvencí, 150,163,403,742 nt, (Nov 2006)
69 GB compressed (376 GB uncompressed)
• DDBJ (DNA DataBase of Japan)– 64,267,978 sekvencí, 68,259,314,742 nt (Dec 2006)
sdílejí „accession numbers“ ("A12345" v EMBL je stejný jako"A12345" in GenBank or DDBJ)
Primární proteinové databáze
• UniProtKB (PIR-PSD, SwissProt, TrEMBL)– UniProtKB/Swiss-Prot
manually curated and reviewed protein sequence database 550,552 (Feb 2016)
– UniProtKB/TrEMBLautomatically-annotated and not reviewed. 60,971,489 (Feb 2016)
• NCBInr; – compiled from a variety of sources, including
SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq 4,396,331 entries (January 2007) - 4GB
Sekundární proteinové databáze
Sekundární
databáze
Zdroj dat Princip řazení
PROSITE UNIPROT Regulární výrazy
(patterns)
PRINTS OWL motivy (fingerprints)
Pfam UNIPROT Skryté Markovovské
Modely (HMMs)
BLOCKS PROSITE
/PRINTS
motivy (blocks)
formáty sekvencí
binární s chromatogramy
pro programs
minimal
annotované
textové
(human
readable)
SCF
ALF
ABI
interní databáze těchto
programů
text
fasta
EMBL
GenBank
ASN
XML
SCF
SCF: standart chromatogram file
fasta format
>gi|6102607|gb|AF145233.1|AF145233 Mus musculus transcription factor PAX4
TGGCAGGACTGAAGCAGCTGGAGGCTGTTACAAGACCAGACCACCAGCAAACCCTGGAGCCTGCACAGGA
CCCTGAGACCTCTTCCTGGAATTCCCACCTTTTTTCCTCCATCCAGAACCAGTCCCAAAGAGAAACTTCC
AGAAGGAGCTCTCCGTTTTCAGTTTGCCAGTTGGCTTCCTGTCCTTCTGTGAGGAGTACCAGTGTGAAGC
ATGCAGCAGGACGGACTCAGCAGTGTGAATCAGCTAGGGGGACTCTTTGTGAATGGCCGGCCCCTTCCTC
TGGACACCAGGCAGCAGATTGTGCAGCTAGCAATAAGAGGGATGCGACCCTGTGACATTTCACGGAGCCT
TAAGGTATCTAATGGCTGTGTGAGCAAGATCCTAGGACGCTACTACCGCACAGGTGTCTTGGAACCCAAG
TGTATTGGGGGAAGCAAACCACGTCTGGCCACACCTGCTGTGGTGGCTCGAATTGCCCAGCTAAAGGATG
AGTACCCTGCTCTTTTTGCCTGGGAGATCCAACACCAGCTTTGCACTGAAGGGCTTTGTACCCAGGACAA
GGCTCCCAGTGTGTCCTCTATCAATCGAGTACTTCGGGCACTTCAGGAAGACCAGAGCTTGCACTGGACT
CAACTCAGATCACCAGCTGTGTTGGCTCCAGTTCTTCCCAGTCCCCACAGTAACTGTGGGGCTCCCCGAG
GCCCCCACCCAGGAACCAGCCACAGGAATCGGACTATCTTCTCCCCGGGACAAGCCGAGGCACTGGAGAA
AGAGTTTCAGCGTGGGCAGTATCCAGATTCAGTGGCCCGTGGGAAGCTGGCTGCTGCCACCTCTCTGCCT
GAAGACACGGTGAGGGTTTGGTTTTCTAACAGAAGAGCCAAATGGCGCAGGCAAGAGAAGCTGAAATGGG
AAGCACAGCTGCCAGGTGCTTCCCAGGACCTGACAGTACCAAAAAATTCTCCAGGGATCATCTCTGCACA
GCAGTCCCCCGGCAGTGTACCCTCAGCTGCCTTGCCTGTGCTGGAACCATTGAGTCCTTCCTTCTGTCAG
CTATGCTGTGGGACAGCACCAGGCAGATGTTCCAGTGACACCTCATCCCAGGCCTATCTCCAACCCTACT
GGGACTGCCAATCCCTCCTTCCTGTGGCTTCCTCCTCATATGTGGAATTTGCCTGGCCCTGCCTCACCAC
CCATCCTGTGCATCATCTGATTGGAGGCCCAGGACAAGTGCCATCAACCCATTGCTCAAACTGGCCATAA
GAGGCCTCTATTTGACAGTAATAAAAACCTTTTCTTAGATGTTAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
> řádek s komentářem – specifikace zda NA, či protein
GenBank fields
Reference Seq-id
The NCBI RefSeq project provides a curated, nonredundant set of
reference sequence standards for naturally occurring biological
molecules, ranging from chromosomes to transcripts to proteins.
Prefixes:
•NC_ chromosomes
•NM_ mRNAs
•NP_ proteins
•NT_ constructed genomic contigs
•NG_ genomic regions or gene clusters
GenBank fields
FEATURE field:
structured record
must have location (which can be partial)
main fields:
•SOURCE
•CDS (coding region)
•RNA
•GENE
•PROTEIN
GenBank flatfile
LOCUS AF145233 1360 bp mRNA ROD 23-OCT-1999
DEFINITION Mus musculus transcription factor PAX4 (Pax4) mRNA, complete cds.
ACCESSION AF145233
VERSION AF145233.1 GI:6102607
KEYWORDS .
SOURCE house mouse.
ORGANISM Mus musculus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
REFERENCE 1 (bases 1 to 1360)
AUTHORS Kalousova,A., Benes,V., Paces,J., Paces,V. and Kozmik,Z.
TITLE DNA binding and transactivating properties of the paired and
homeobox protein Pax4
JOURNAL Biochem. Biophys. Res. Commun. 259 (3), 510-518 (1999)
MEDLINE 99294619
PUBMED 10364449
REFERENCE 2 (bases 1 to 1360)
AUTHORS Kalousova,A., Paces,J. and Kozmik,Z.
TITLE Direct Submission
JOURNAL Submitted (23-APR-1999) Dept. of Transcription Regulation,
Institute of Molecular Genetics, Videnska 1083, Prague 142 20,
Czech Republic
FEATURES Location/Qualifiers
source 1..1360
/organism="Mus musculus"
/db_xref="taxon:10090"
gene 1..1360
/gene="Pax4"
CDS 211..1260
/gene="Pax4"
/note="DNA binding protein; paired box protein; homeobox
protein"
/codon_start=1
/product="transcription factor PAX4"
/protein_id="AAF03533.1"
…
GenBank flatfile
CDS 211..1260
/gene="Pax4"
/note="DNA binding protein; paired box protein; homeobox
protein"
/codon_start=1
/product="transcription factor PAX4"
/protein_id="AAF03533.1"
/db_xref="GI:6102608"
/translation="MQQDGLSSVNQLGGLFVNGRPLPLDTRQQIVQLAIRGMRPCDIS
RSLKVSNGCVSKILGRYYRTGVLEPKCIGGSKPRLATPAVVARIAQLKDEYPALFAWE
IQHQLCTEGLCTQDKAPSVSSINRVLRALQEDQSLHWTQLRSPAVLAPVLPSPHSNCG
APRGPHPGTSHRNRTIFSPGQAEALEKEFQRGQYPDSVARGKLAAATSLPEDTVRVWF
SNRRAKWRRQEKLKWEAQLPGASQDLTVPKNSPGIISAQQSPGSVPSAALPVLEPLSP
SFCQLCCGTAPGRCSSDTSSQAYLQPYWDCQSLLPVASSSYVEFAWPCLTTHPVHHLI
GGPGQVPSTHCSNWP"
BASE COUNT 359 a 381 c 328 g 292 t
ORIGIN
1 tggcaggact gaagcagctg gaggctgtta caagaccaga ccaccagcaa accctggagc
61 ctgcacagga ccctgagacc tcttcctgga attcccacct tttttcctcc atccagaacc
121 agtcccaaag agaaacttcc agaaggagct ctccgttttc agtttgccag ttggcttcct
181 gtccttctgt gaggagtacc agtgtgaagc atgcagcagg acggactcag cagtgtgaat
…
1081 tccagtgaca cctcatccca ggcctatctc caaccctact gggactgcca atccctcctt
1141 cctgtggctt cctcctcata tgtggaattt gcctggccct gcctcaccac ccatcctgtg
1201 catcatctga ttggaggccc aggacaagtg ccatcaaccc attgctcaaa ctggccataa
1261 gaggcctcta tttgacagta ataaaaacct tttcttagat gttaaaaaaa aaaaaaaaaa
1321 aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa
//
ID AF031150 standard; RNA; ROD; 1379 BP.
XX
AC AF031150;
XX
SV AF031150.1
XX
DT 27-FEB-1998 (Rel. 54, Created)
DT 27-FEB-1998 (Rel. 54, Last updated, Version 1)
XX
DE Mus musculus paired-box transcription factor (Pax4) mRNA, complete cds.
XX
KW .
XX
OS Mus musculus (house mouse)
OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.
XX
RN [1]
RP 1-1379
RA Inoue H., Nomiyama J., Nakai K., Matsutani A., Tanizawa Y., Oka Y.;
RT Isolation of full-length cDNA of mouse PAX4 gene and identification of its
RT human homologue;
RL Biochem. Biophys. Res. Commun. 243:628-633(1998).
XX
RN [2]
RP 1-1379
RA Inoue H., Nomiyama J., Nakai K., Tanizawa Y., Oka Y.;
RT ;
RL Submitted (23-OCT-1997) to the EMBL/GenBank/DDBJ databases.
RL Third Dept. of Int. Med., Yamaguchi University, 1144 Kogushi, Ube,
RL Yamaguchi 755, Japan
XX
FH Key Location/Qualifiers
…
EMBL flatfile
EMBL flatfile
…
FH Key Location/Qualifiers
FH
FT source 1..1379
FT /db_xref=taxon:10090
FT /organism=Mus musculus
FT /cell_line=MIN6
FT CDS 297..1346
FT /codon_start=1
FT /gene=Pax4
FT /product=paired-box transcription factor
FT /protein_id=AAC40046.1
FT /translation=MQQDGLSSVNQLGGLFVNGRPLPLDTRQQIVQLAIRGMRPCDISR
FT SLKVSNGCVSKILGRYYRTGVLEPKCIGGSKPRLATPAVVARIAQLKDEYPALFAWEIQ
FT HQLCTEGLCTQDKAPSVSSINRVLRALQEDQSLHWTQLRSPAVLAPVLPSPHSNCGAPR
FT GPHPGTSHRNRTIFSPGQAEALEKEFQRGQYPDSVARGKLAAATSLPEDTVRVWFSNRR
FT AKWRRQEKLKWEAQLPGASQDLTVPKNSPGIISAQQSPGSVPSAALPVLEPLSPSFCQL
FT CCGTAPGRCSSDTSSQAYLQPYWDCQSLLPVASSSYVEFAWPCLTTHPVHHLIGGPGQV
FT PSTHCSNWP
XX
SQ Sequence 1379 BP; 327 A; 402 C; 347 G; 303 T; 0 other;
aaaaaaaaaa aaaaagcggc cgctgaattc tagcagaagg ctgccctctg ctcctgagtg 60
aaggctctgt gaagctctgg accccctggc aggactgaag cagctggagg ctgttacaag 120
accagaccac cagcaaaccc tggagcctgc acaggaccct gagacctctt cctggaattc 180
ccaccttttt tcctccatcc agaaccagtc ccaaagagaa acttccagaa ggagctctcc 240
gttttcagtt tgccagttgg cttcctgtcc ttctgtgagg agtaccagtg tgaagcatgc 300
agcaggacgg actcagcagt gtgaatcagc tagggggact ctttgtgaat ggccggcccc 360
…
gctgtgggac agcaccaggc agatgttcca gtgacacctc atcccaggcc tatctccaac 1200
cctactggga ctgccaatcc ctccttcctg tggcttcctc ctcatatgtg gaatttgcct 1260
ggccctgcct caccacccat cctgtgcatc atctgattgg aggcccagga caagtgccat 1320
caacccattg ctcaaactgg ccataagagg cctctatttg acagtaataa aaacctttt 1379
//
ASN.1
Seq-entry ::= set {
class nuc-prot ,
descr {
title "Mus musculus transcription factor PAX4 (Pax4) mRNA, complete cds." ,
source {
org {
taxname "Mus musculus" ,
common "house mouse" ,
db {
{
db "taxon" ,
tag
id 10090 } } ,
orgname {
name
binomial {
genus "Mus" ,
species "musculus" } ,
lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;
Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae;
Mus" ,
gcode 1 ,
mgcode 2 ,
div "ROD" } } } ,
pub {
pub {
sub {
authors {
names
std
Genome project of Rhodopseudomonas palustris
Sequencing and characterization of 5kb region.
(diplomová práce Jany Prejdové pod
Janem Pačesem)
modelový příklad
využití sekvenace
DNA sequencing
connecting contigs
>jana (4797 nt)
GAATTCGCCGCGGGGCTGCGCATCACCGATGCCGCCACCATCGAGATCGTCGAGATGGTACTGGCCGGCTCGATCAACAAGCAGCTCGTCGGC
TACATCAACGAAGCGGGCGGCAAGGCCGTCGGCCTGTGCGGCAAGGACGGCAACATGGTGTCCGCCACCAAGGCGACGCGCACCATGGTCGAT
CCGGATTCGCGGATCGAAGAGGTGATCGACCTCGGTTTCGTCGGCGAGCCGGAGAAGGTCGACCTCACCCTGCTCAACCAGCTGATCGGCCAC
GAGTTGATCCCGGTGCTGGCGCCGCTGGCGACCTCCGCGTCGGGCCAGACCTTCAACGTCAATGCCGACACCTTTGCAGGTGCGGTTGCCGGT
GCGCTGCGGGCCAAGCGCCTGCTGCTGCTGACCGACGTGCCGGGCGTGCTCGACCAGAACAAGAAGCTGATCCCCGAACTGTCGATCAAGGAT
GCCCGCAAGCTGATCGCAGACGGCACCATCTCGGGCGGCATGATCCCCAAGGTCGAGACCTGCATCTACGCGCTCGAACAGGGCGTCGAAGGC
GTCGTCATCCTCGACGGCAAGGTCCCGCACGCAGTGCTGCTCGAATTGTTCACCAACCAGGGCACCGGCACGCTGATCCACAAGTGATGCGAG
GCTGCGGCGACAACATCCGTCATGGCCGGGCTCGTCCCGGCCATCCACGTCTTTCCGGCGGTTTTCTCAGCAAGACGTGGATGCCCGGCACAA
GGCCGGGCATGACGGGGTGGAGATCGCGCGCCCTCGCCGCCATTGTCACCACCCTCGCCCTCACCTCCGCCGCCCACGCCGACCTCAAGCTCT
GCAACCGCATGAGCTACGTGGTCGAGACGGCGATCGGGGTCGATTCCAACGGCACCACCGCCTCGCGCGGATGGCTGCGGATTGATCCGGCGC
AATGCCGGGTCGTGGTGCAAGGCGCGCTCAACGCCGACCGCATCATGCTGAATGCCCGCGCGCTGGCGGTGTACGGCGTCTCGCCGCTGCCGC
AGAACGGCACTGACCGGCTGTGCATTGCCGAAGACAATTTCGTCATCGCCGCCGCGCGGCAATGCCGCGGCGGCCAAACGCTCGCCGCCTTCA
CCGAGATCAAGCCCACCGACACCGAGGACGGCAACAAGATCGCTTATCTGGCGGAAGACTCCGGCTACGACGACGAACAGGCCAAACTCGCCG
CGATCCAGCGGCTGCTGGTGATCGCCGGTTACGACGCCTCGCCGATCGACGGCGTCGACGGCCCGAAGACGCAGGCCGCGCTGTCCGCCTTCC
TCAAGAGCCGAGGCCTGAAGCCCGAGATCGTCGATGCGCCGGATTTCTTCGACGTGATGATCAAGGCAGTGCAGCAGCCGTCCGGCAGCGGGC
TGACCTGGTGCAACGACACCAAGTACAAGATCATGGCGGCCGTCGGCGAAGACGACGGCAAGACTGTCACCAGCCGCGGCTGGTACGGTGTTG
CGCCCGGCCAATGCCTGCGCCCCGACCTCGGCGCACAGCCGAAGCGGGTGTTCAGCTTCGCCGAAGCGGTCGACGGCAGCGGCAGGCCGGTGA
CCATCAAGGGCCGTGCGCTGAACTGGGGCGGCGGCGTGACGCTGTGCACGCGTGACAGCAAGTTCGAGATCGGCGAGCAAGGCGATTGCGCGG
CGCGCGGCCTCGCCGCCACCGGCTTCGCCGCCGTCGATCTCAGTAGCGGCAAGACATTGAGGTTGTCCGCCCCATGATGCAGCTCGGCAAACG
CGGCTTCGATCACGTCGAGACCTGGGTGTTCGATCTCGACAACACGCTGTACCCGCATCACCTCAACCTATGGCAGCAGGTCGATGCGCGGAT
CCGCGACTTCGTCGCCGACTGGCTGAAGGTTTCGCCGGAAGAAGCCTTCCGTATCCAGAAGGATTACTACAAGCGCTACGGCACCACGATGCG
CGGGATGATGACCGAGCACGGCGTTCACGCCGACGACTACCTGGCTTATGTCCACGCCATCGACCATTCGCCGCTGCAGCCGAATCCGGCGAT
GGGCGATGCGATCGAGCGACTGCCGGGCCGCAAGCTGATCCTGACCAACGGCTCGACCGCCCATGCGGGCAAGGTGCTGGAGCGGCTCGGCAT
CGGCCATCATTTCGAGGCGGTGTTCGACATCATTGCGGCCGACCTCGAGCCGAAGCCGGCGCCGCAGACCTACCGCCGTTTTCTCGATCGCCA
TGGTGTCGACCCGGCCCGCGCCGCGATGTTCGAAGACCTCGCCCGCAACCTCACCGTGCCGCACCAGCTCGGCATGACCACCGTGCTGGTGGT
GCCTGACGATAGCCAGGACGTGGTCCGCGAAGATTGGGAGCTTGAAGGCCGCGACGCCGCCCACGTCGATCACGTGACTGATGATTTGACAGG
GTTCTTGGGGAAGCTGAGTTCGCTGTAGGCCGGGGACGCCTCCCAAGCGTCAATCGTCATCGCCGCCGGATGCAAGGCGGCTAGGTATTGCGG
AGCGCTCGCGATCTTCCGTCCAATGCCCTGGGATACTGGATCGCCCGGACGAGCCGGGCGACGACGTTGAAGAGAGATGACGTGGCGTCACCA
CATCCCCCGCCGTCATCGCCCGCGCAGGCGGGCGATGACTTGGCGGACGGGGCGGCGCCTTGACTCCGACCCGGCGAATCCGGACAACACTCC
GCAAGGACTGGACCACGCTGTTCTTCAGCTTTCGAGGTCGGATCAATCGCGCCAAATACTGGCTGGTCGGACTGATCTACGTCGCCGCCTGGA
TGG …
sequence in FastA
Leucin
Rhodobacter capsulatus
anticodone number %
CUA 3 <1
CUC 119 16
CUG 458 60
CUU 157 20
UUA 0 0
UUG 27 3
Escherichia coli
%
4
9
52
10
11
13
how to find genes?
genes
Sanger
Ch21 (in Nature)
cDNA
GENESCAN
EXOFISH
eukaryotic genes
which proteins are encoded by
genes?
ja1 ja2 ja3 ja4 ja5 ja6
BLAST - search for relatives
which proteins are encoded by
genes?
ja1 ACETYLGLUTAMATE KINASE EC 2.7.2.8
ja2
ja3
ja4 TETRAHYDRODIPICOLINATE EC 2.3.1.117
N-SUCCINYLTRANSFERASE
ja5
ja6 SUCCINYL-DIAMINOPIMELATE EC 3.5.1.18
DESUCCINYLASE
ja1 ja2 ja3 ja4 ja5 ja6
what function have these genes
in the cell?
what function have these genes
in the cell?
which proteins are encoded by
genes?
ja4 TETRAHYDRODIPICOLINATE EC 2.3.1.117
N-SUCCINYLTRANSFERASE
ja5 ACETYLORNITHINE EC 2.6.1.11
TRANSAMINASE
ja6 SUCCINYL-DIAMINOPIMELATE EC 3.5.1.18
DESUCCINYLASE
ja1 ja2 ja3 ja4 ja5 ja6
bioinformatics
Rhodopseudomonas palustris
can synthetize aminoacid
lysine in biochemical pathway with
enzyme EC 2.6.1.17.
Credits
• Při přípravě této přednášky byly použity
přednášky:
– Jan Pačes a Jiří Vondrášek – Bioinformatika
(UK Praha)
– Aplikovaná proteomika (UO Hradec Králové)