Simulación y Minería de datos - Grupo de Geofísica … · 2007-09-25 · abstracto que usa el...
Transcript of Simulación y Minería de datos - Grupo de Geofísica … · 2007-09-25 · abstracto que usa el...
Chris Stephens,Instituto de Ciencias Nucleares, UNAMSeminario de Modelación Matemática y Computacional 21/09/2007
SimulaciSimulacióónn y y MinerMinerííaa de de datosdatos: :
Dos Dos facetasfacetas nuevasnuevas de la de la modelacimodelacióónn computacionalcomputacional
¿Qué es un modelo?
“Un modelo matemático es un modelo abstracto que usa el lenguaje matemático para describir el comportamiento de un sistema.” (Wikipedia)…una representación de los aspectos
esenciales de un sistema que presenta conocimiento del sistema en forma usable
Debe dar información: cualitativa – entendimiento y intuicióncuantitativa - predicciones
CienciaInformática
Ciencia Computacional
Modelación matemática
Hidrodinámica
Minería de datos
Mercadosfinancieros Microarreglos
Biodiversidad
Desempeñoestudiantil
Baja Parametricidad Alta Baja Deductividad Alta
Alta Complejidad Baja
Dinámica
GenéticaPoblacional
SistemasComplejos
EconomíaBiología
Física
QuímicaIngeniería
Simulación
Mercados financieros(simulacion)
Un ejemplo: el problema de dos cuerpos con interacción gravitacional
θ LrrmMGrr
=
+−=−
θμ
θ2
22 /)(r
v Ecuaciones de Newton
Solución exacta, analítica: )cos1(
)(θ
θeAr
+=
Información cualitativa: las orbitas son secciones cónicas, e = 0, círculo; e < 1, elipse; e = 1, parábola; e > 1, hipérbola
Información cuantitativa: rmin= A/(1+e); rmax = A/(1-e)
Un ejemplo: dinámica de poblaciones
• x(t+1) = r x(t)(1-x(t))– x(t) es la población relativa de un organismo
(relativa al máximo posible entonces 0 < x < 1– r es la taza efectiva de crecimiento; (0 < r < 4)– el término x(t) da retroalimentación positiva
(taza de nacimiento) y (1-x(t)) de retroalimentación negativa (taza de muerte, debido por ejemplo a recursos finitos)
ci(t), vi(t) – position/direction vectors of a “particle”
Competencia entreuna repulsión y atracción efectivaentre “partículas”
Ecuación para partículas “cargadas”siguiendo una fuerza externa gi
Y este modelo – ¿de las cienciasexactas o de la vida?
Couzin, I.D., Krause, J., Franks, N.R. & Levin, S.A. (2005) Nature, 433, 513-516.
¿Qué son las diferencias y semejanzas entre estos modelos?
• ¿Cómo tan fieles son?• ¿Qué grado de idealización hay?• ¿Dan una descripción tanto cuantitativa como
cualitativa?• ¿Qué grado de aproximación hay?• ¿Qué fenómenos capturan y cuales no de los
sistemas que describen?• ¿Cuántos parámetros hay en los modelos?
Son modelos paramétricos simples que en ciertos casos (física) capturan “toda” la dinámica y en otros casos (biología) modelan cualitativamente un único aspecto del sistema
AFM Model – The Market Mechanism
• One risky asset (no dividends), one riskless asset (no interest - “cash”)
• No short sales, no borrowing, uniform trade size (traders buy/sell/hold)
• Double Auction– At time t list traders’ bids and offers (obtained from a Gaussian
distribution centered on p(t-1)); every auction is a “tick”– Match best bid with best offer at the midpoint price iff pb(t) ≥ po(t)– Excess demand/supply is determined only from bids and offers that are
unmatched and overlap, i.e. pb(t) ≥ p(t) p(t) ≥ po(t)
AFM Model – The Traders• N traders • One-parameter family of trading strategies
– P(b) = 2d/3; P(h) = 1/3; P(o) = 2(1 - d)/3 d € [0,1]– Denote strategy by (100d,100(1-d))– d = ½ (50,50) “noise” trader– Traders choose a strategy from this family– Traders may dynamically adapt their strategy
• Portfolio for trader i at time t - ni(t), Ci(t) – Wealth Wi(t) = (Ci(t) + ni(t)pi(t)); Ci(t) – riskless asset
• Learning included by “copycat”mechanism; copycat traders reproduce the best observed strategy in the market
AFM Model – Price Dynamics
p(t+1) = p(t)(1 + η(D(t) – S(t))) D(t) = DemandS(t) = Supplyη = tuning parameter
Market state - (Ci(t),ni(t)),(pi(t),Xi(t)),p(t)Portfolio parameters
Position ParametersBuy/sell/hold price
Xi(t) = X(di(t)) = -1,0,1 - stochastic variable that depends only on the strategy parameter di; In principle: di(t) = F(risk preferences, utility, information set, price …)
Efficient Markets100 (90,10) traders
100 (50,50) traders
Divide traders into two groups, A and B, to see if there exists a relative inefficiency between them; A and B traders may have unknown beliefs
No statistically significant excess trading
gains Homogeneous markets are efficient (no relative informational advantage for any given trader group)
I(50,50,A)(50,50,B) (t,0)
Graphs of # of traders with a given excess profit after 3001 ticks
Inefficient Markets 50 (90,10) traders and 50 (50,50) traders
Graphs of # of traders with a given excess profit after 101 and 3001 ticks
Apparent multi-modality – indication of excess profits? Signal or noise?
Multi-modality – is evidence that informed traders are making profits at the expense of noise traders
Distributions separate at a speed that depends on (d(90,10)- d(50,50)) Relative informational
advantages lead to Inefficient markets
¿Cómo difiere esta simulación de las otras?
• Hay muchos parámetros en este modelo (pero pocos comparado con el sistema real)
• Cada objeto no simplemente cambia su estado pero también ¡puede cambiar la dinámica que cambia ese estado!
• Cambia de estrategia - “adaptación”(aprendizaje)
• Muestra un fenómeno emergente – la eficiencia del mercado
Minería de datos
• Data mining is the exploration and analysis of data in order to discover patterns, correlations and other regularities– All the previous models we have seen can be
thought of in terms of data mining
Here’s the data… data mining in one-dimension!
What would you do?
Here, there is no“law” or fundamentaltheory. We have to tryand statistically inferrelationships.
Do you think that the ROIonly depends onspending?
Let’s make it a bit more interesting!
Datamining in two dimensions!
• Want to predict the probability to be in a class C given two “features” 1 and 2.– E.g. What’s the probability for a client to
spend $ C on a new product as a function of $ spent on two other products 1 and 2?
From past data we find this…
What model would you use to predict here?Multi-variate linear regression?
Income level(1 is highest)
Age level(5 is oldest)
Probabilityof health risk
All we have to do isunderstand this “topography”
Sound easy?
Very intuitive
For good statistical inference we need the heightfunction for all the feature vectors of the search space
So what’s the catch…?
Low variance High variance
Problem 1: The World is Noisy!
Are we surethis is a highpoint in the
PredictabilityLandscape?
Solution: Obtain more
data?
Problem 2: The Curse ofDimensionality
• Number of seconds in your lifetime: 2.5 x 109
• Number of atoms in this room: 1025
• Number of atoms in universe: 1080
Fastest computer in the world: IBM's BlueGene/L - 360 teraflops (1012 floating point ops per sec)
• Number of possible responses to a 50 question surveywith 1-10 scale answers: 1050
So if everybody on the planet filled in a survey we’d still only be exploring lessthan one part in 1040 of the search space
• Number of possible socio-demographic profiles obtained from100 census-style socio-demographic variables divided intodeciles: 10100
Your data!
The possible data points The Predictability peaks?
How do we infer the height of thosepoints for which we have no information?
“Coarse graining”
• “Binning”/Grouping data– E.g. 100 survey respondants, expect ~ 3
respondants for every age in yearsbin the data: too few bins risks losingpredictability and discrimination, too manyrisks statistically unreliable predictions
• “Ignoring” data– Removing variables – but which ones?
“Coarse graining”• Averaging or marginalizing data
– Introduce a new “symbol” “*”• P(C | X) = P(C | x1 x2 )
– E.g. C = high spending on autos, x1 = age, x2 = income
• P(C | x1 * ) = Σx2 P(C | x1 x2 ) – Probability to have high spending on autos given age x1
irrespective of income
• P(C | * x2 ) = Σx1 P(C | x1 x2 ) – Probability to have high spending on autos given income
x2 irrespective of age
• P(C) = P(C | * * ) = Σx1,x2 P(C | x1 x2 ) – Probability to have high spending on autos irrespective
of age or income
Determining important driversA useful statistical diagnostic:
C X
NX
NC,X
NC
N
P( C | X ) = NC,X / NX
P( C ) = NC / N
“Signal”
“Noise”
e.g. X is age group 65-70, C is the top 5% of spending on denture cleanersε > 2 implies that being in the age group 65-70 is positively correlated in a statistically significant way with being in the top 5% of spenders
And the “topographic”interpretation?
High values of ε for feature 1 and feature 2 together associated withgood statistical confidence that these points are significantly above
the random chance plane
So what do data miningmodels do?
• They make “guesses” - statistical inferences -about the topography of the PredictabilityLandscape
• They are “templates” that we try to fit to the formof the landscape
• There are “zillions” of templates to try!• We can fit to what we think is a high point
on the landscape only to find with more samplingthat it wasn’t really high (overfitting/variance)
• We can fit with a “biased” (parametric) modeland miss structure
No “Magic Bullet”• Out of the zillions of models NONE is “perfect”• Why? Because “perfect” is multi-dimensional:
predictability, discrimination, transparency, interpretability, robustness, portability, runtime, cost, simplicity …
• Each model can give a different perspective ofthe Predictability Landscape and of the problemat hand– E.g. Neural networks can score high on predictability
but low on transparency, simplicity and runtime– Association rules can score high on interpretability but
often low on predictability
Conclusiones• Se ha estado haciendo simulación y minería de datos
desde empezó la ciencia• La computadora ha permitido la creación de
simulaciones mucho mas ricas (mas parámetros) de sistemas mas complicados
• Se ha podido hacer simulaciones que difieren intrínsicamente de sus contrapartes tradicionales (una dinámica en el espacio de “leyes” tanto como de estados) que aplican a biología, finanzas etc. (Adaptación y aprendizaje)
• En la Minería de datos se trata de modelos donde en principio hay muchas variables involucrados donde no sabemos a priori las correlaciones entre las variables y no hay leyes fundamentales para guiarnos
• Los modelos no parametrizados características de esa área no tienen mucho cesgo pero tienen que estar construidos empiricamente “a mano”
Un ejemplo: el problema de tres cuerpos con interacción gravitacional
¡Si pasamos a tres cuerpos ya no hay solución analítica!
What’s Genetic Dynamics?Population of “objects” – “genotypes”
Space of populations
P(t)
P(t+1)
General evolution equation
determines the state of thepopulation at time t; Ω is thedimension of the space ofstates of an “object”; for linearchromosomes with binaryalleles Ω = 2N
p represents a set of parameters associated withthe evolution operator
Evolutionoperator
Consider:
mutationselection
recombination
000 0
0
0 111
1
000 0
0
0 111
0
000 0
0
0 111
1
101 0
0
1 111
0
100 0
0
0 111
1
001 0
0
1 111
0
+
+
recombination point
001 0
0
1 111
0
001 0
0
1 111
0
“cloning”
Mixing of genetic material
Abstractions of the principalGenetic Operators:
In mathematics…
That’s most of standard population genetics and evolutionary computation!
Finite population model determined by Markov chain. In the infinite population limitfor haploids:
Implicit summation over repeated indices
Probability to mutate genotype J to genotype I
Probability to implement recombination
Probability that given recombination takes place it is implementedwith mode m
Probability to select genotype I
Conditional probability for “child” J given “parents” K and L and a mode m
Select an object J
Don’t recombine it with another
Mutate it to object I
Select two “parents”K and L
Recombine them withrespect to a recombinationmode m applied with probabilitypcpc(m) to obtain a “child” J
• Ω coupled non-linear difference equations• Population genetics has spent the last 70 years
trying to deal with them• Go to reduced number of loci
• In object basis there are Ω3 different λJKL - that’s a lot!
• Most of them are 0!
Two Questions…
1. Can we “solve” them?Put them on the computer. Not very feasible for N = 100!
2. Can we understand anything “qualitatively”from them?
How does genetic dynamics “work”?What are the effective degrees of freedom/collective modes?