Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB...

41
Bioinformatika KFC/BIN II. Sekvence RNDr. Karel Berka, Ph.D. Univerzita Palackého v Olomouci

Transcript of Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB...

Page 1: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

Bioinformatika

KFC/BIN

II. SekvenceRNDr. Karel Berka, Ph.D.

Univerzita Palackého v Olomouci

Page 2: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

Centrální dogma molekulární

biologie

Page 3: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

Centrální dogma molekulární

biologie

reversnítranscripce

informace funkce

DNA RNA protein

Page 4: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

IUB kód

code nucleotides complement

A A T

C C G

G G C

T T A

(U U) A

M AC K

R AG Y

W AT S

S CG W

Y CT R

K GT M

V ACG B

H ACT D

D AGT H

B CGT V

N ACGT N

- space -

codethree-letter

code aminoacid

A Ala Alanine

C Cys Cysteine

D Asp Aspartic acid

G Glu Glutamic acid

H His Histidine

I Ile Isoleucine

K Lys Lysine

L Leu Leucine

M Met Methionine

N Asn Asparagine

P Pro Proline

Q Gln Glutamine

R Arg Arginine

S Ser Serine

T Thr Threonine

V Val Valine

W Trp Tryptofan

Y Tyr Tyrosine

X Xxx Any aminoacid

* --- stop

NAProteiny

Page 5: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

genetický kódT C A G

T TTT Phe TCT Ser TAT Tyr TGT Cys T

TTC Phe TCC Ser TAC Tyr TGC Cys C

TTA Leu TCA Ser TAA Stop TGA Stop A

TTG Leu TCG Ser TAG Stop TGG Trp G

C CTT Leu CCT Pro CAT His CGT Arg T

CTC Leu CCC Pro CAC His CGC Arg C

CTA Leu CCA Pro CAA Gln CGA Arg A

CTG Leu CCG Pro CAG Gln CGG Arg G

A ATT Ile ACT Thr AAT Asn AGT Ser T

ATC Ile ACC Thr AAC Asn AGC Ser C

ATA Ile ACA Thr AAA Lys AGA Arg A

ATG Met ACG Thr AAG Lys AGG Arg G

G GTT Val GCT Ala GAT Asp GGT Gly T

GTC Val GCC Ala GAC Asp GGC Gly C

GTA Val GCA Ala GAA Glu GGA Gly A

GTG Val GCG Ala GAG Glu GGG Gly G

Page 6: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

Termíny a zkratky

Genomika: kompletní genetická informace o

organismu (DNA sekvence) a její interpretace.

strukturní

funkční

DNA, RNA: nt (nucleotid), bp (pár bazí)

Proteomika: Co, kde (a kdy) se v organismu

exprimováno a jakou to má funkci

Proteiny: aa (aminokyseliny)

Page 7: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

Sekvence

5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3‘

| | | | | | | | | | | | | | |

3' G-C-T-A-A-C-G-T-T-G-C-T-A-C-G 5‘

5' C-G-A-U-U-G-C-A-A-C-G-A-U-G-C 3‘

Nter R W Q R C Cter

DNA

RNA

Protein

Page 8: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

Příklad: Hemoglobin

DNA sekvence - 444 bp

atggtgcatctgactcctgaggagaagtctgccgttactgccctgtggggcaaggtgaac

gtggatgaagttggtggtgaggccctgggcaggctgctggtggtctacccttggacccag

aggttctttgagtcctttggggatctgtccactcctgatgctgttatgggcaaccctaag

gtgaaggctcatggcaagaaagtgctcggtgcctttagtgatggcctggctcacctggac

aacctcaagggcacctttgccacactgagtgagctgcactgtgacaagctgcacgtggat

cctgagaacttcaggctcctgggcaacgtgctggtctgtgtgctggcccatcactttggc

aaagaattcaccccaccagtgcaggctgcctatcagaaagtggtggctggtgtggctaat

gccctggcccacaagtatcactaa

Proteinová sekvence - 147 aa

MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFE

SFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFAT

LSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQ

KVVAGVANALAHKYH

DNA sekvence určuje proteinovou sekvenci

proteinová sekvence určuje proteinovou strukturu

struktura proteinu určuje funkci

Page 9: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

DNA• DNA sekvenace

– 1972 DNA klonování

– 1975 DNA sekvenace

– od 80. let – sekvenační revoluce

Manuálně (dideoxy elektroforéza)

• Sanger

Automaticky - robotizace

• J. Craig Venter

– Celera Genomics

Page 10: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

Protein

• Proteinová sekvenace

– Edmanovo odbourávání

• Sanger - fluorescenční činidlo

– MS/MS

masses (m/z)

940.421 - ELSDIAR

1093.477 - QLLLTADDR

1341.556 - PHSHPALTPEQK

1469.633 - PHSHPALTPEQKK

1488.645 - GILAADESTGSIAKR

1646.650 - LQSIGTENTEWENRR

2122.975 - IGENHTPSALAIMENANVLAR

2241.903 - YTPSGQAGAAASESLFISNHAY

Page 11: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

Projekt Lidský genom

(The Human Genome Project)• Zahájen v polovině 80-tých let 20. století

• Odhad: 100,000 genů, dokončeno v roce 2005

• Automatické sekvenování a zdokonalení výpočetní techniky– Shotgun methody

• První verze publikována v roce 2000 společně– International Consortium Human Genome Project (veřejně

financovaná společnost)

– Celera Genomics (soukromá společnost)

• Referenční sekvence lidské DNA dokončena v dubnu 2003

http://genomics.energy.gov/

Page 12: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

Projekt Lidský genom

(The Human Genome Project)

• 20 313 genů (Ensembl.org, 21.2.2016)

• 20 769 genů (Ensenbl.org, 30.9.2013)

Alternativní sestřih – 10,000,000 proteinů

• Stovky genů jsou výsledkem horizontálního přenosu z bakterií (v linii obratlovců)

• Desítky genů jsou odvozeny od transpozibilních elementů

• Rychlost mutací u můžů je asi 2x větší než u žen

• >1,400,000 jednoduchých nukleotidových polymorfismů (SNPs)

Page 13: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

Biologické databáze

primarní vs. sekundární

formát vs. obsah (computers vs. human)

primární

sbírají informace o dotyčných sekvencích

sekundární

Obsahují výsledky analýzy dat z primárních databází

Sestaveny pomocí mnohočetného porovnávání

(multiple alignment) homologních sekvencí pro zachycení

konzervovaných oblastí – zařazení do rodin

Page 14: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

DNA databáze

• GenBank (NCBI) – od roku 1982 – vz. 212, 190,250,235 sekvencí, 207,018,196,067 nt (Feb 2016)

– vz. 1, 606 sekvencí, 680,338 nt (Dec 1982)

• WGS (Whole Genome Shotgun) – od roku 2002– vz. 212, 333,012,760 sekvencí, 1,399,865,495,608 nt (Feb 2016)

• ENA - EMBL (EBI) – 713,500,000 sekvencí, 1,611,100,000,000 nt (Feb 2016)

– 83,666,567 sekvencí, 150,163,403,742 nt, (Nov 2006)

69 GB compressed (376 GB uncompressed)

• DDBJ (DNA DataBase of Japan)– 64,267,978 sekvencí, 68,259,314,742 nt (Dec 2006)

sdílejí „accession numbers“ ("A12345" v EMBL je stejný jako"A12345" in GenBank or DDBJ)

Page 15: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

Primární proteinové databáze

• UniProtKB (PIR-PSD, SwissProt, TrEMBL)– UniProtKB/Swiss-Prot

manually curated and reviewed protein sequence database 550,552 (Feb 2016)

– UniProtKB/TrEMBLautomatically-annotated and not reviewed. 60,971,489 (Feb 2016)

• NCBInr; – compiled from a variety of sources, including

SwissProt, PIR, PRF, PDB, and translations from annotated coding regions in GenBank and RefSeq 4,396,331 entries (January 2007) - 4GB

Page 16: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

Sekundární proteinové databáze

Sekundární

databáze

Zdroj dat Princip řazení

PROSITE UNIPROT Regulární výrazy

(patterns)

PRINTS OWL motivy (fingerprints)

Pfam UNIPROT Skryté Markovovské

Modely (HMMs)

BLOCKS PROSITE

/PRINTS

motivy (blocks)

Page 17: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

formáty sekvencí

binární s chromatogramy

pro programs

minimal

annotované

textové

(human

readable)

SCF

ALF

ABI

interní databáze těchto

programů

text

fasta

EMBL

GenBank

ASN

XML

Page 18: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

SCF

SCF: standart chromatogram file

Page 19: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

fasta format

>gi|6102607|gb|AF145233.1|AF145233 Mus musculus transcription factor PAX4

TGGCAGGACTGAAGCAGCTGGAGGCTGTTACAAGACCAGACCACCAGCAAACCCTGGAGCCTGCACAGGA

CCCTGAGACCTCTTCCTGGAATTCCCACCTTTTTTCCTCCATCCAGAACCAGTCCCAAAGAGAAACTTCC

AGAAGGAGCTCTCCGTTTTCAGTTTGCCAGTTGGCTTCCTGTCCTTCTGTGAGGAGTACCAGTGTGAAGC

ATGCAGCAGGACGGACTCAGCAGTGTGAATCAGCTAGGGGGACTCTTTGTGAATGGCCGGCCCCTTCCTC

TGGACACCAGGCAGCAGATTGTGCAGCTAGCAATAAGAGGGATGCGACCCTGTGACATTTCACGGAGCCT

TAAGGTATCTAATGGCTGTGTGAGCAAGATCCTAGGACGCTACTACCGCACAGGTGTCTTGGAACCCAAG

TGTATTGGGGGAAGCAAACCACGTCTGGCCACACCTGCTGTGGTGGCTCGAATTGCCCAGCTAAAGGATG

AGTACCCTGCTCTTTTTGCCTGGGAGATCCAACACCAGCTTTGCACTGAAGGGCTTTGTACCCAGGACAA

GGCTCCCAGTGTGTCCTCTATCAATCGAGTACTTCGGGCACTTCAGGAAGACCAGAGCTTGCACTGGACT

CAACTCAGATCACCAGCTGTGTTGGCTCCAGTTCTTCCCAGTCCCCACAGTAACTGTGGGGCTCCCCGAG

GCCCCCACCCAGGAACCAGCCACAGGAATCGGACTATCTTCTCCCCGGGACAAGCCGAGGCACTGGAGAA

AGAGTTTCAGCGTGGGCAGTATCCAGATTCAGTGGCCCGTGGGAAGCTGGCTGCTGCCACCTCTCTGCCT

GAAGACACGGTGAGGGTTTGGTTTTCTAACAGAAGAGCCAAATGGCGCAGGCAAGAGAAGCTGAAATGGG

AAGCACAGCTGCCAGGTGCTTCCCAGGACCTGACAGTACCAAAAAATTCTCCAGGGATCATCTCTGCACA

GCAGTCCCCCGGCAGTGTACCCTCAGCTGCCTTGCCTGTGCTGGAACCATTGAGTCCTTCCTTCTGTCAG

CTATGCTGTGGGACAGCACCAGGCAGATGTTCCAGTGACACCTCATCCCAGGCCTATCTCCAACCCTACT

GGGACTGCCAATCCCTCCTTCCTGTGGCTTCCTCCTCATATGTGGAATTTGCCTGGCCCTGCCTCACCAC

CCATCCTGTGCATCATCTGATTGGAGGCCCAGGACAAGTGCCATCAACCCATTGCTCAAACTGGCCATAA

GAGGCCTCTATTTGACAGTAATAAAAACCTTTTCTTAGATGTTAAAAAAAAAAAAAAAAAAAAAAAAAAA

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

> řádek s komentářem – specifikace zda NA, či protein

Page 20: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

GenBank fields

Reference Seq-id

The NCBI RefSeq project provides a curated, nonredundant set of

reference sequence standards for naturally occurring biological

molecules, ranging from chromosomes to transcripts to proteins.

Prefixes:

•NC_ chromosomes

•NM_ mRNAs

•NP_ proteins

•NT_ constructed genomic contigs

•NG_ genomic regions or gene clusters

Page 21: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

GenBank fields

FEATURE field:

structured record

must have location (which can be partial)

main fields:

•SOURCE

•CDS (coding region)

•RNA

•GENE

•PROTEIN

Page 22: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

GenBank flatfile

LOCUS AF145233 1360 bp mRNA ROD 23-OCT-1999

DEFINITION Mus musculus transcription factor PAX4 (Pax4) mRNA, complete cds.

ACCESSION AF145233

VERSION AF145233.1 GI:6102607

KEYWORDS .

SOURCE house mouse.

ORGANISM Mus musculus

Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;

Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.

REFERENCE 1 (bases 1 to 1360)

AUTHORS Kalousova,A., Benes,V., Paces,J., Paces,V. and Kozmik,Z.

TITLE DNA binding and transactivating properties of the paired and

homeobox protein Pax4

JOURNAL Biochem. Biophys. Res. Commun. 259 (3), 510-518 (1999)

MEDLINE 99294619

PUBMED 10364449

REFERENCE 2 (bases 1 to 1360)

AUTHORS Kalousova,A., Paces,J. and Kozmik,Z.

TITLE Direct Submission

JOURNAL Submitted (23-APR-1999) Dept. of Transcription Regulation,

Institute of Molecular Genetics, Videnska 1083, Prague 142 20,

Czech Republic

FEATURES Location/Qualifiers

source 1..1360

/organism="Mus musculus"

/db_xref="taxon:10090"

gene 1..1360

/gene="Pax4"

CDS 211..1260

/gene="Pax4"

/note="DNA binding protein; paired box protein; homeobox

protein"

/codon_start=1

/product="transcription factor PAX4"

/protein_id="AAF03533.1"

Page 23: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

GenBank flatfile

CDS 211..1260

/gene="Pax4"

/note="DNA binding protein; paired box protein; homeobox

protein"

/codon_start=1

/product="transcription factor PAX4"

/protein_id="AAF03533.1"

/db_xref="GI:6102608"

/translation="MQQDGLSSVNQLGGLFVNGRPLPLDTRQQIVQLAIRGMRPCDIS

RSLKVSNGCVSKILGRYYRTGVLEPKCIGGSKPRLATPAVVARIAQLKDEYPALFAWE

IQHQLCTEGLCTQDKAPSVSSINRVLRALQEDQSLHWTQLRSPAVLAPVLPSPHSNCG

APRGPHPGTSHRNRTIFSPGQAEALEKEFQRGQYPDSVARGKLAAATSLPEDTVRVWF

SNRRAKWRRQEKLKWEAQLPGASQDLTVPKNSPGIISAQQSPGSVPSAALPVLEPLSP

SFCQLCCGTAPGRCSSDTSSQAYLQPYWDCQSLLPVASSSYVEFAWPCLTTHPVHHLI

GGPGQVPSTHCSNWP"

BASE COUNT 359 a 381 c 328 g 292 t

ORIGIN

1 tggcaggact gaagcagctg gaggctgtta caagaccaga ccaccagcaa accctggagc

61 ctgcacagga ccctgagacc tcttcctgga attcccacct tttttcctcc atccagaacc

121 agtcccaaag agaaacttcc agaaggagct ctccgttttc agtttgccag ttggcttcct

181 gtccttctgt gaggagtacc agtgtgaagc atgcagcagg acggactcag cagtgtgaat

1081 tccagtgaca cctcatccca ggcctatctc caaccctact gggactgcca atccctcctt

1141 cctgtggctt cctcctcata tgtggaattt gcctggccct gcctcaccac ccatcctgtg

1201 catcatctga ttggaggccc aggacaagtg ccatcaaccc attgctcaaa ctggccataa

1261 gaggcctcta tttgacagta ataaaaacct tttcttagat gttaaaaaaa aaaaaaaaaa

1321 aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa aaaaaaaaaa

//

Page 24: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

ID AF031150 standard; RNA; ROD; 1379 BP.

XX

AC AF031150;

XX

SV AF031150.1

XX

DT 27-FEB-1998 (Rel. 54, Created)

DT 27-FEB-1998 (Rel. 54, Last updated, Version 1)

XX

DE Mus musculus paired-box transcription factor (Pax4) mRNA, complete cds.

XX

KW .

XX

OS Mus musculus (house mouse)

OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;

OC Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus.

XX

RN [1]

RP 1-1379

RA Inoue H., Nomiyama J., Nakai K., Matsutani A., Tanizawa Y., Oka Y.;

RT Isolation of full-length cDNA of mouse PAX4 gene and identification of its

RT human homologue;

RL Biochem. Biophys. Res. Commun. 243:628-633(1998).

XX

RN [2]

RP 1-1379

RA Inoue H., Nomiyama J., Nakai K., Tanizawa Y., Oka Y.;

RT ;

RL Submitted (23-OCT-1997) to the EMBL/GenBank/DDBJ databases.

RL Third Dept. of Int. Med., Yamaguchi University, 1144 Kogushi, Ube,

RL Yamaguchi 755, Japan

XX

FH Key Location/Qualifiers

EMBL flatfile

Page 25: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

EMBL flatfile

FH Key Location/Qualifiers

FH

FT source 1..1379

FT /db_xref=taxon:10090

FT /organism=Mus musculus

FT /cell_line=MIN6

FT CDS 297..1346

FT /codon_start=1

FT /gene=Pax4

FT /product=paired-box transcription factor

FT /protein_id=AAC40046.1

FT /translation=MQQDGLSSVNQLGGLFVNGRPLPLDTRQQIVQLAIRGMRPCDISR

FT SLKVSNGCVSKILGRYYRTGVLEPKCIGGSKPRLATPAVVARIAQLKDEYPALFAWEIQ

FT HQLCTEGLCTQDKAPSVSSINRVLRALQEDQSLHWTQLRSPAVLAPVLPSPHSNCGAPR

FT GPHPGTSHRNRTIFSPGQAEALEKEFQRGQYPDSVARGKLAAATSLPEDTVRVWFSNRR

FT AKWRRQEKLKWEAQLPGASQDLTVPKNSPGIISAQQSPGSVPSAALPVLEPLSPSFCQL

FT CCGTAPGRCSSDTSSQAYLQPYWDCQSLLPVASSSYVEFAWPCLTTHPVHHLIGGPGQV

FT PSTHCSNWP

XX

SQ Sequence 1379 BP; 327 A; 402 C; 347 G; 303 T; 0 other;

aaaaaaaaaa aaaaagcggc cgctgaattc tagcagaagg ctgccctctg ctcctgagtg 60

aaggctctgt gaagctctgg accccctggc aggactgaag cagctggagg ctgttacaag 120

accagaccac cagcaaaccc tggagcctgc acaggaccct gagacctctt cctggaattc 180

ccaccttttt tcctccatcc agaaccagtc ccaaagagaa acttccagaa ggagctctcc 240

gttttcagtt tgccagttgg cttcctgtcc ttctgtgagg agtaccagtg tgaagcatgc 300

agcaggacgg actcagcagt gtgaatcagc tagggggact ctttgtgaat ggccggcccc 360

gctgtgggac agcaccaggc agatgttcca gtgacacctc atcccaggcc tatctccaac 1200

cctactggga ctgccaatcc ctccttcctg tggcttcctc ctcatatgtg gaatttgcct 1260

ggccctgcct caccacccat cctgtgcatc atctgattgg aggcccagga caagtgccat 1320

caacccattg ctcaaactgg ccataagagg cctctatttg acagtaataa aaacctttt 1379

//

Page 26: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

ASN.1

Seq-entry ::= set {

class nuc-prot ,

descr {

title "Mus musculus transcription factor PAX4 (Pax4) mRNA, complete cds." ,

source {

org {

taxname "Mus musculus" ,

common "house mouse" ,

db {

{

db "taxon" ,

tag

id 10090 } } ,

orgname {

name

binomial {

genus "Mus" ,

species "musculus" } ,

lineage "Eukaryota; Metazoa; Chordata; Craniata; Vertebrata;

Euteleostomi; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae;

Mus" ,

gcode 1 ,

mgcode 2 ,

div "ROD" } } } ,

pub {

pub {

sub {

authors {

names

std

Page 27: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

Genome project of Rhodopseudomonas palustris

Sequencing and characterization of 5kb region.

(diplomová práce Jany Prejdové pod

Janem Pačesem)

modelový příklad

využití sekvenace

Page 28: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

DNA sequencing

Page 29: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

connecting contigs

Page 30: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

>jana (4797 nt)

GAATTCGCCGCGGGGCTGCGCATCACCGATGCCGCCACCATCGAGATCGTCGAGATGGTACTGGCCGGCTCGATCAACAAGCAGCTCGTCGGC

TACATCAACGAAGCGGGCGGCAAGGCCGTCGGCCTGTGCGGCAAGGACGGCAACATGGTGTCCGCCACCAAGGCGACGCGCACCATGGTCGAT

CCGGATTCGCGGATCGAAGAGGTGATCGACCTCGGTTTCGTCGGCGAGCCGGAGAAGGTCGACCTCACCCTGCTCAACCAGCTGATCGGCCAC

GAGTTGATCCCGGTGCTGGCGCCGCTGGCGACCTCCGCGTCGGGCCAGACCTTCAACGTCAATGCCGACACCTTTGCAGGTGCGGTTGCCGGT

GCGCTGCGGGCCAAGCGCCTGCTGCTGCTGACCGACGTGCCGGGCGTGCTCGACCAGAACAAGAAGCTGATCCCCGAACTGTCGATCAAGGAT

GCCCGCAAGCTGATCGCAGACGGCACCATCTCGGGCGGCATGATCCCCAAGGTCGAGACCTGCATCTACGCGCTCGAACAGGGCGTCGAAGGC

GTCGTCATCCTCGACGGCAAGGTCCCGCACGCAGTGCTGCTCGAATTGTTCACCAACCAGGGCACCGGCACGCTGATCCACAAGTGATGCGAG

GCTGCGGCGACAACATCCGTCATGGCCGGGCTCGTCCCGGCCATCCACGTCTTTCCGGCGGTTTTCTCAGCAAGACGTGGATGCCCGGCACAA

GGCCGGGCATGACGGGGTGGAGATCGCGCGCCCTCGCCGCCATTGTCACCACCCTCGCCCTCACCTCCGCCGCCCACGCCGACCTCAAGCTCT

GCAACCGCATGAGCTACGTGGTCGAGACGGCGATCGGGGTCGATTCCAACGGCACCACCGCCTCGCGCGGATGGCTGCGGATTGATCCGGCGC

AATGCCGGGTCGTGGTGCAAGGCGCGCTCAACGCCGACCGCATCATGCTGAATGCCCGCGCGCTGGCGGTGTACGGCGTCTCGCCGCTGCCGC

AGAACGGCACTGACCGGCTGTGCATTGCCGAAGACAATTTCGTCATCGCCGCCGCGCGGCAATGCCGCGGCGGCCAAACGCTCGCCGCCTTCA

CCGAGATCAAGCCCACCGACACCGAGGACGGCAACAAGATCGCTTATCTGGCGGAAGACTCCGGCTACGACGACGAACAGGCCAAACTCGCCG

CGATCCAGCGGCTGCTGGTGATCGCCGGTTACGACGCCTCGCCGATCGACGGCGTCGACGGCCCGAAGACGCAGGCCGCGCTGTCCGCCTTCC

TCAAGAGCCGAGGCCTGAAGCCCGAGATCGTCGATGCGCCGGATTTCTTCGACGTGATGATCAAGGCAGTGCAGCAGCCGTCCGGCAGCGGGC

TGACCTGGTGCAACGACACCAAGTACAAGATCATGGCGGCCGTCGGCGAAGACGACGGCAAGACTGTCACCAGCCGCGGCTGGTACGGTGTTG

CGCCCGGCCAATGCCTGCGCCCCGACCTCGGCGCACAGCCGAAGCGGGTGTTCAGCTTCGCCGAAGCGGTCGACGGCAGCGGCAGGCCGGTGA

CCATCAAGGGCCGTGCGCTGAACTGGGGCGGCGGCGTGACGCTGTGCACGCGTGACAGCAAGTTCGAGATCGGCGAGCAAGGCGATTGCGCGG

CGCGCGGCCTCGCCGCCACCGGCTTCGCCGCCGTCGATCTCAGTAGCGGCAAGACATTGAGGTTGTCCGCCCCATGATGCAGCTCGGCAAACG

CGGCTTCGATCACGTCGAGACCTGGGTGTTCGATCTCGACAACACGCTGTACCCGCATCACCTCAACCTATGGCAGCAGGTCGATGCGCGGAT

CCGCGACTTCGTCGCCGACTGGCTGAAGGTTTCGCCGGAAGAAGCCTTCCGTATCCAGAAGGATTACTACAAGCGCTACGGCACCACGATGCG

CGGGATGATGACCGAGCACGGCGTTCACGCCGACGACTACCTGGCTTATGTCCACGCCATCGACCATTCGCCGCTGCAGCCGAATCCGGCGAT

GGGCGATGCGATCGAGCGACTGCCGGGCCGCAAGCTGATCCTGACCAACGGCTCGACCGCCCATGCGGGCAAGGTGCTGGAGCGGCTCGGCAT

CGGCCATCATTTCGAGGCGGTGTTCGACATCATTGCGGCCGACCTCGAGCCGAAGCCGGCGCCGCAGACCTACCGCCGTTTTCTCGATCGCCA

TGGTGTCGACCCGGCCCGCGCCGCGATGTTCGAAGACCTCGCCCGCAACCTCACCGTGCCGCACCAGCTCGGCATGACCACCGTGCTGGTGGT

GCCTGACGATAGCCAGGACGTGGTCCGCGAAGATTGGGAGCTTGAAGGCCGCGACGCCGCCCACGTCGATCACGTGACTGATGATTTGACAGG

GTTCTTGGGGAAGCTGAGTTCGCTGTAGGCCGGGGACGCCTCCCAAGCGTCAATCGTCATCGCCGCCGGATGCAAGGCGGCTAGGTATTGCGG

AGCGCTCGCGATCTTCCGTCCAATGCCCTGGGATACTGGATCGCCCGGACGAGCCGGGCGACGACGTTGAAGAGAGATGACGTGGCGTCACCA

CATCCCCCGCCGTCATCGCCCGCGCAGGCGGGCGATGACTTGGCGGACGGGGCGGCGCCTTGACTCCGACCCGGCGAATCCGGACAACACTCC

GCAAGGACTGGACCACGCTGTTCTTCAGCTTTCGAGGTCGGATCAATCGCGCCAAATACTGGCTGGTCGGACTGATCTACGTCGCCGCCTGGA

TGG …

sequence in FastA

Page 31: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

Leucin

Rhodobacter capsulatus

anticodone number %

CUA 3 <1

CUC 119 16

CUG 458 60

CUU 157 20

UUA 0 0

UUG 27 3

Escherichia coli

%

4

9

52

10

11

13

how to find genes?

Page 32: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

genes

Page 33: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

Sanger

Ch21 (in Nature)

cDNA

GENESCAN

EXOFISH

eukaryotic genes

Page 34: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

which proteins are encoded by

genes?

ja1 ja2 ja3 ja4 ja5 ja6

Page 35: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

BLAST - search for relatives

Page 36: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

which proteins are encoded by

genes?

ja1 ACETYLGLUTAMATE KINASE EC 2.7.2.8

ja2

ja3

ja4 TETRAHYDRODIPICOLINATE EC 2.3.1.117

N-SUCCINYLTRANSFERASE

ja5

ja6 SUCCINYL-DIAMINOPIMELATE EC 3.5.1.18

DESUCCINYLASE

ja1 ja2 ja3 ja4 ja5 ja6

Page 37: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

what function have these genes

in the cell?

Page 38: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

what function have these genes

in the cell?

Page 39: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

which proteins are encoded by

genes?

ja4 TETRAHYDRODIPICOLINATE EC 2.3.1.117

N-SUCCINYLTRANSFERASE

ja5 ACETYLORNITHINE EC 2.6.1.11

TRANSAMINASE

ja6 SUCCINYL-DIAMINOPIMELATE EC 3.5.1.18

DESUCCINYLASE

ja1 ja2 ja3 ja4 ja5 ja6

Page 40: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

bioinformatics

Rhodopseudomonas palustris

can synthetize aminoacid

lysine in biochemical pathway with

enzyme EC 2.6.1.17.

Page 41: Bioinformatika KFC/BIN II. Sekvencefch.upol.cz/wp-content/uploads/2018/03/BIN_02_sekvence_vz4.pdfIUB kód code nucleotides complement A A T C C G G G C T T A (U U) A M AC K R AG Y

Credits

• Při přípravě této přednášky byly použity

přednášky:

– Jan Pačes a Jiří Vondrášek – Bioinformatika

(UK Praha)

– Aplikovaná proteomika (UO Hradec Králové)