Simulación y Minería de datos - Grupo de Geofísica … · 2007-09-25 · abstracto que usa el...

48
Chris Stephens, Instituto de Ciencias Nucleares, UNAM Seminario de Modelación Matemática y Computacional 21/09/2007 Simulaci Simulaci ó ó n n y y Miner Miner í í a a de de datos datos : : Dos Dos facetas facetas nuevas nuevas de la de la modelaci modelaci ó ó n n computacional computacional

Transcript of Simulación y Minería de datos - Grupo de Geofísica … · 2007-09-25 · abstracto que usa el...

Chris Stephens,Instituto de Ciencias Nucleares, UNAMSeminario de Modelación Matemática y Computacional 21/09/2007

SimulaciSimulacióónn y y MinerMinerííaa de de datosdatos: :

Dos Dos facetasfacetas nuevasnuevas de la de la modelacimodelacióónn computacionalcomputacional

¿Qué es un modelo?

“Un modelo matemático es un modelo abstracto que usa el lenguaje matemático para describir el comportamiento de un sistema.” (Wikipedia)…una representación de los aspectos

esenciales de un sistema que presenta conocimiento del sistema en forma usable

Debe dar información: cualitativa – entendimiento y intuicióncuantitativa - predicciones

Simulación

CienciaInformática

Ciencia Computacional

Modelación matemática

Hidrodinámica

Minería de datos

Mercadosfinancieros Microarreglos

Biodiversidad

Desempeñoestudiantil

Baja Parametricidad Alta Baja Deductividad Alta

Alta Complejidad Baja

Dinámica

GenéticaPoblacional

SistemasComplejos

EconomíaBiología

Física

QuímicaIngeniería

Simulación

Mercados financieros(simulacion)

Modelos en las ciencias exactas

Un ejemplo: el problema de dos cuerpos con interacción gravitacional

θ LrrmMGrr

=

+−=−

θμ

θ2

22 /)(r

v Ecuaciones de Newton

Solución exacta, analítica: )cos1(

)(θ

θeAr

+=

Información cualitativa: las orbitas son secciones cónicas, e = 0, círculo; e < 1, elipse; e = 1, parábola; e > 1, hipérbola

Información cuantitativa: rmin= A/(1+e); rmax = A/(1-e)

Modelos en las ciencias de la vida

Un ejemplo: dinámica de poblaciones

• x(t+1) = r x(t)(1-x(t))– x(t) es la población relativa de un organismo

(relativa al máximo posible entonces 0 < x < 1– r es la taza efectiva de crecimiento; (0 < r < 4)– el término x(t) da retroalimentación positiva

(taza de nacimiento) y (1-x(t)) de retroalimentación negativa (taza de muerte, debido por ejemplo a recursos finitos)

Un ejemplo: dinámica de poblaciones

bifurcación

chaos

ci(t), vi(t) – position/direction vectors of a “particle”

Competencia entreuna repulsión y atracción efectivaentre “partículas”

Ecuación para partículas “cargadas”siguiendo una fuerza externa gi

Y este modelo – ¿de las cienciasexactas o de la vida?

Couzin, I.D., Krause, J., Franks, N.R. & Levin, S.A. (2005) Nature, 433, 513-516.

¿Qué son las diferencias y semejanzas entre estos modelos?

• ¿Cómo tan fieles son?• ¿Qué grado de idealización hay?• ¿Dan una descripción tanto cuantitativa como

cualitativa?• ¿Qué grado de aproximación hay?• ¿Qué fenómenos capturan y cuales no de los

sistemas que describen?• ¿Cuántos parámetros hay en los modelos?

Son modelos paramétricos simples que en ciertos casos (física) capturan “toda” la dinámica y en otros casos (biología) modelan cualitativamente un único aspecto del sistema

Mercados financieros

AFM Model – The Market Mechanism

• One risky asset (no dividends), one riskless asset (no interest - “cash”)

• No short sales, no borrowing, uniform trade size (traders buy/sell/hold)

• Double Auction– At time t list traders’ bids and offers (obtained from a Gaussian

distribution centered on p(t-1)); every auction is a “tick”– Match best bid with best offer at the midpoint price iff pb(t) ≥ po(t)– Excess demand/supply is determined only from bids and offers that are

unmatched and overlap, i.e. pb(t) ≥ p(t) p(t) ≥ po(t)

AFM Model – The Traders• N traders • One-parameter family of trading strategies

– P(b) = 2d/3; P(h) = 1/3; P(o) = 2(1 - d)/3 d € [0,1]– Denote strategy by (100d,100(1-d))– d = ½ (50,50) “noise” trader– Traders choose a strategy from this family– Traders may dynamically adapt their strategy

• Portfolio for trader i at time t - ni(t), Ci(t) – Wealth Wi(t) = (Ci(t) + ni(t)pi(t)); Ci(t) – riskless asset

• Learning included by “copycat”mechanism; copycat traders reproduce the best observed strategy in the market

AFM Model – Price Dynamics

p(t+1) = p(t)(1 + η(D(t) – S(t))) D(t) = DemandS(t) = Supplyη = tuning parameter

Market state - (Ci(t),ni(t)),(pi(t),Xi(t)),p(t)Portfolio parameters

Position ParametersBuy/sell/hold price

Xi(t) = X(di(t)) = -1,0,1 - stochastic variable that depends only on the strategy parameter di; In principle: di(t) = F(risk preferences, utility, information set, price …)

Efficient Markets100 (90,10) traders

100 (50,50) traders

Divide traders into two groups, A and B, to see if there exists a relative inefficiency between them; A and B traders may have unknown beliefs

No statistically significant excess trading

gains Homogeneous markets are efficient (no relative informational advantage for any given trader group)

I(50,50,A)(50,50,B) (t,0)

Graphs of # of traders with a given excess profit after 3001 ticks

Inefficient Markets 50 (90,10) traders and 50 (50,50) traders

Graphs of # of traders with a given excess profit after 101 and 3001 ticks

Apparent multi-modality – indication of excess profits? Signal or noise?

Multi-modality – is evidence that informed traders are making profits at the expense of noise traders

Distributions separate at a speed that depends on (d(90,10)- d(50,50)) Relative informational

advantages lead to Inefficient markets

¿Cómo difiere esta simulación de las otras?

• Hay muchos parámetros en este modelo (pero pocos comparado con el sistema real)

• Cada objeto no simplemente cambia su estado pero también ¡puede cambiar la dinámica que cambia ese estado!

• Cambia de estrategia - “adaptación”(aprendizaje)

• Muestra un fenómeno emergente – la eficiencia del mercado

Simulación de la “Evolucíon”

Evolución

Minería de datos

Minería de datos

• Data mining is the exploration and analysis of data in order to discover patterns, correlations and other regularities– All the previous models we have seen can be

thought of in terms of data mining

Here’s the data… data mining in one-dimension!

What would you do?

Here, there is no“law” or fundamentaltheory. We have to tryand statistically inferrelationships.

Do you think that the ROIonly depends onspending?

Let’s make it a bit more interesting!

Datamining in two dimensions!

• Want to predict the probability to be in a class C given two “features” 1 and 2.– E.g. What’s the probability for a client to

spend $ C on a new product as a function of $ spent on two other products 1 and 2?

From past data we find this…

What model would you use to predict here?Multi-variate linear regression?

Income level(1 is highest)

Age level(5 is oldest)

Probabilityof health risk

But what if we’d found this…?

What model would you use to predict now?

All we have to do isunderstand this “topography”

Sound easy?

Very intuitive

For good statistical inference we need the heightfunction for all the feature vectors of the search space

So what’s the catch…?

Low variance High variance

Problem 1: The World is Noisy!

Are we surethis is a highpoint in the

PredictabilityLandscape?

Solution: Obtain more

data?

Problem 2: The Curse ofDimensionality

• Number of seconds in your lifetime: 2.5 x 109

• Number of atoms in this room: 1025

• Number of atoms in universe: 1080

Fastest computer in the world: IBM's BlueGene/L - 360 teraflops (1012 floating point ops per sec)

• Number of possible responses to a 50 question surveywith 1-10 scale answers: 1050

So if everybody on the planet filled in a survey we’d still only be exploring lessthan one part in 1040 of the search space

• Number of possible socio-demographic profiles obtained from100 census-style socio-demographic variables divided intodeciles: 10100

Your data!

The possible data points The Predictability peaks?

How do we infer the height of thosepoints for which we have no information?

“Coarse graining”

• “Binning”/Grouping data– E.g. 100 survey respondants, expect ~ 3

respondants for every age in yearsbin the data: too few bins risks losingpredictability and discrimination, too manyrisks statistically unreliable predictions

• “Ignoring” data– Removing variables – but which ones?

“Coarse graining”• Averaging or marginalizing data

– Introduce a new “symbol” “*”• P(C | X) = P(C | x1 x2 )

– E.g. C = high spending on autos, x1 = age, x2 = income

• P(C | x1 * ) = Σx2 P(C | x1 x2 ) – Probability to have high spending on autos given age x1

irrespective of income

• P(C | * x2 ) = Σx1 P(C | x1 x2 ) – Probability to have high spending on autos given income

x2 irrespective of age

• P(C) = P(C | * * ) = Σx1,x2 P(C | x1 x2 ) – Probability to have high spending on autos irrespective

of age or income

Determining important driversA useful statistical diagnostic:

C X

NX

NC,X

NC

N

P( C | X ) = NC,X / NX

P( C ) = NC / N

“Signal”

“Noise”

e.g. X is age group 65-70, C is the top 5% of spending on denture cleanersε > 2 implies that being in the age group 65-70 is positively correlated in a statistically significant way with being in the top 5% of spenders

And the “topographic”interpretation?

High values of ε for feature 1 and feature 2 together associated withgood statistical confidence that these points are significantly above

the random chance plane

So what do data miningmodels do?

• They make “guesses” - statistical inferences -about the topography of the PredictabilityLandscape

• They are “templates” that we try to fit to the formof the landscape

• There are “zillions” of templates to try!• We can fit to what we think is a high point

on the landscape only to find with more samplingthat it wasn’t really high (overfitting/variance)

• We can fit with a “biased” (parametric) modeland miss structure

No “Magic Bullet”• Out of the zillions of models NONE is “perfect”• Why? Because “perfect” is multi-dimensional:

predictability, discrimination, transparency, interpretability, robustness, portability, runtime, cost, simplicity …

• Each model can give a different perspective ofthe Predictability Landscape and of the problemat hand– E.g. Neural networks can score high on predictability

but low on transparency, simplicity and runtime– Association rules can score high on interpretability but

often low on predictability

Conclusiones• Se ha estado haciendo simulación y minería de datos

desde empezó la ciencia• La computadora ha permitido la creación de

simulaciones mucho mas ricas (mas parámetros) de sistemas mas complicados

• Se ha podido hacer simulaciones que difieren intrínsicamente de sus contrapartes tradicionales (una dinámica en el espacio de “leyes” tanto como de estados) que aplican a biología, finanzas etc. (Adaptación y aprendizaje)

• En la Minería de datos se trata de modelos donde en principio hay muchas variables involucrados donde no sabemos a priori las correlaciones entre las variables y no hay leyes fundamentales para guiarnos

• Los modelos no parametrizados características de esa área no tienen mucho cesgo pero tienen que estar construidos empiricamente “a mano”

Un ejemplo: el problema de tres cuerpos con interacción gravitacional

¡Si pasamos a tres cuerpos ya no hay solución analítica!

Un ejemplo: el problema de tres cuerpos con interacción gravitacional

¿Y predictabilidad…?

Dinámica Genética

What’s Genetic Dynamics?Population of “objects” – “genotypes”

Space of populations

P(t)

P(t+1)

General evolution equation

determines the state of thepopulation at time t; Ω is thedimension of the space ofstates of an “object”; for linearchromosomes with binaryalleles Ω = 2N

p represents a set of parameters associated withthe evolution operator

Evolutionoperator

Consider:

mutationselection

recombination

000 0

0

0 111

1

000 0

0

0 111

0

000 0

0

0 111

1

101 0

0

1 111

0

100 0

0

0 111

1

001 0

0

1 111

0

+

+

recombination point

001 0

0

1 111

0

001 0

0

1 111

0

“cloning”

Mixing of genetic material

Abstractions of the principalGenetic Operators:

In mathematics…

That’s most of standard population genetics and evolutionary computation!

Finite population model determined by Markov chain. In the infinite population limitfor haploids:

Implicit summation over repeated indices

Probability to mutate genotype J to genotype I

Probability to implement recombination

Probability that given recombination takes place it is implementedwith mode m

Probability to select genotype I

Conditional probability for “child” J given “parents” K and L and a mode m

Select an object J

Don’t recombine it with another

Mutate it to object I

Select two “parents”K and L

Recombine them withrespect to a recombinationmode m applied with probabilitypcpc(m) to obtain a “child” J

• Ω coupled non-linear difference equations• Population genetics has spent the last 70 years

trying to deal with them• Go to reduced number of loci

• In object basis there are Ω3 different λJKL - that’s a lot!

• Most of them are 0!

Two Questions…

1. Can we “solve” them?Put them on the computer. Not very feasible for N = 100!

2. Can we understand anything “qualitatively”from them?

How does genetic dynamics “work”?What are the effective degrees of freedom/collective modes?

Simula el sistema