UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the...

81
Curs 2014-15 UPC APUNTS DE CLASSE: TEMA 3 RESPOSTA BINÀRIA

Transcript of UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the...

Page 1: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

Curs 2014-15

UPC APUNTS DE CLASSE: TEMA 3 RESPOSTA BINÀRIA

Page 2: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 2 Curs 2. 01 4- 2. 01 5

TABLE OF CONTENTS

3-1. TOPIC 3: BINARY RESPONSE DATA. BINOMIAL MODELS _________________________________________________________________ 3

3-1.1 COMPONENTS OF GENERALIZED LINEAR MODELS _____________________________________________________________________________ 3 3-1.2 CLASSIFICATION OF STATISTICAL LINEAR MODELS ____________________________________________________________________________ 5

3-2. BINOMIAL MODELS FOR BINARY DATA _______________________________________________________________________________ 12

3-2.1 LINK FUNCTIONS _______________________________________________________________________________________________________ 13

3-3. INTERPRETING MODEL PARAMETERS ________________________________________________________________________________ 19

3-4. ESTIMATION OF MODEL PARAMETERS ________________________________________________________________________________ 21

3-5. GOODNESS OF FIT ____________________________________________________________________________________________________ 23

3-5.1 ROC CURVE AND CONFUSION MATRICES ___________________________________________________________________________________ 30

3-6. MODEL DIAGNOSTICS ________________________________________________________________________________________________ 33

3-6.1 RESIDUALS IN GLMZ ___________________________________________________________________________________________________ 33 3-6.2 INFLUENCE DATA IN GLMZ ______________________________________________________________________________________________ 36 3-6.3 GRAPHICS FOR DIAGNOSTICS _____________________________________________________________________________________________ 37

3-7. EJEMPLO 1: ACCIDENTES CON HERIDOS SEGÚN USO DEL CINTURÓN – AGRESTI (2002) __________________________________ 38

3-8. EJEMPLO 2: PARTICIPACIÓN LABORAL FEMENINA EN CANADA (FOX) __________________________________________________ 67

Page 3: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 3 Curs 2. 01 4- 2. 01 5

3-1. TOPIC 3: BINARY RESPONSE DATA. BINOMIAL MODELS

3-1.1 Components of generalized linear models Generalized linear models are extensions of the classical multiple regression models.

Let ( )nT yy ,,y 1= be a vector of n components, sample of a random vector ( )n

T YY ,,Y 1= ,

with statistical independent components, identically distributed with means ( )nT µµ ,,1 =µ :

• The random component assumes that mutual Independence holds and each random variable in ( )n

T YY ,,Y 1= belongs to the exponential family with one parameter distributions with

jointly expected values [ ] µ=YΕ .

At desaggreted leved for each individual observation, the response is dycothomic and we are concerned with Bernoulli distribution. For grouped data, binomial distributions are suitable.

• The systematic component in the model model specifies a vector η , the predictor lineal vector is a

linear combination from a limited number of explicative variables ( )pXX ,,X 1= or regressors

and parameters ( )pT ββ ,,1=β

to be estimated. In matrix notation, βη X= where η is

nx1, X is nxp and β is px1.

Page 4: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 4 Curs 2. 01 4- 2. 01 5

THEME 3: BINARY RESPONSE DATA. BINOMIAL MODELS

• For each observation, the expected value µ is related to the linear predictor η , through the link

function, notated as g(. ), and thus ( )µη g= .

Response function is: ( )ηµ 1g−=

In ordinary least squares models for normal data, the identity link is used µη = .

For binary data, several link functions are commonly used and will be detailed in a forthcoming section.

Since ML estimates ( )iiTii gxi ηµβηβ ˆˆˆˆˆ 1−=→=∀→

Page 5: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 5 Curs 2. 01 4- 2. 01 5

THEME 3: BINARY RESPONSE DATA. BINOMIAL MODELS

3-1.2 Classification of statistical linear models

Explicative Variables

Response Variable Dicothomic or

Binary Polythomic Counts

(discrete) Continuous

Normal Time between events

Dicothomic Contingency tables Logistic regression Log-linear models

Contingency tables Log-linear models

Log-linear models

Tests for 2 subpopulation means: t.test

Survival Analysis

Polythomic Contingency tables Logistic regression Log-linear models

Contingency tables Log-linear models

Log-linear models

ONEWAY, ANOVA

Survival Analysis

Continuous (covariates)

Logistic regression * Log-linear models

Multiple regression

Survival Analysis

Factors and covariates

Logistic regression * Log-linear models

Covariance Analysis

Survival Analysis

Random Effects

Mixed models Mixed models Mixed models

Mixed models Mixed models

Page 6: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 6 Curs 2. 01 4- 2. 01 5

3-2. INTRODUCTION TO BINARY RESPONSE DATA. BINOMIAL MODELS

This variable appears when given a sample each individual holds or does not hold a target characteristic of the study and this is codified as (Y=1) or not (Y=0).

For example, on mode selection in transportation models, one might be interested in the modal choice between public (metro, bus, light train, etc) or private modes (car, motorcycle, etc) in the study area for home to work trips. In those models, the response variable can be defined for a commuter as Y=1 (positive response or success, for example public modes), or Y=0 (negative response or fail, in our case private modes).

It is possible the extension to more than 2 levels or categories in the response variable.

The probability of success is notated by π , in such a way that,

( ) kkYP π== 1 : Probability of positive response (success) for kth individual unit in sample.

( ) kkYP π−== 10 : Probability of negative response (fail) for kth individual in sample.

Each individual in the sample is characterized by a set of variables some covariates (income, age)

some of them factors (gender, grades, etc) that defines: ( )pxx 1=Tkx .

Page 7: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 7 Curs 2. 01 4- 2. 01 5

3-2. INTRODUCTION TO BINARY RESPONSE DATA. BINOMIAL MODELS

Explicative variables that will form the linear predictor ( )pxx 1=Tkx might be:

• Quantitative variables or covariates.

• Transformation of original variables .

• Polynomial regressors build from covariates.

• Dummy variables to represent factors.

• Dummy variables to represent interaction between factors and covariates.

Por example, in the public-private binary modal choice model, for each commuter variables as income, gender, car availability, distance to local public transport, value of time, etc.

In this subject, the goal relies on studying the relationship between the response variable y and the

explicative variables in order to model the probability of positive response: ( )xππ = .

In design of experiment groups of individual units are defined, and each group receives a combination of experimental conditions in such a way that are shared by all the unit in the group, in general, factors are considered as explicative variables, and kth experimental condition is modeled by a common set of values for all the explicative variables of individual units in the group

( )pxx 1=Tkx and thus apply to km individual units.

Page 8: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 8 Curs 2. 01 4- 2. 01 5

3-2. INTRODUCTION TO BINARY RESPONSE DATA. BINOMIAL MODELS

The total number of units in the sample is the sum of the size of the groups and thus, the number of experimental conditions or groups is defined by n and the total number of units is

nmmN ++= 1 .

Each group or combination of experimental conditions defines a covariate class where all individual units belonging to the covariate class share the same values for explicative variables.

The former difference between individual and covariate class is critical when specifying data to statistical packages. In general, both representations fully desaggregated (each individual outcome is detailed) or aggregated at some level according to covariate classes are allowed:

1. Some analysis methods are well suited for aggregated data, and perform badly when applied to disaggregated data, for example asymptotic approximations to normality.

2. Asymptotic approximations for aggregated data are based either on the asymptotic evolution of the number of covariate classes (or groups) ( ∞→m ) or on the total number of individual units (

∞→N ).

3. Disaggregated data is suitable for asymptotic approximations based on the total size.

Page 9: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 9 Curs 2. 01 4- 2. 01 5

3-2. INTRODUCTION TO BINARY RESPONSE DATA. BINOMIAL MODELS

… Let us work with a simple example to see differences in the representation…

Disaggregated data Aggregated data Individual unit Variables Response Covariate class Size of the

class Positive

responses 1 (male, 1) 0 (1,1) 2 1

2 (male,2) 1 (1,2) 3 2

3 (male,2) 0 (2,1) 1 0

4 (female,1) 0 (2,2) 1 1

5 (female,2) 1

6 (male,2) 1

7 (male,1) 1

Former table shows an experiment consisting on dicothomic factors A and C and thus n=4=2x2 is the number of covariate classes, but the total number of individuals is N=7 . In our example, factor A can be gener (two levels, coded as male and female) and factor C is the car availability (1 car or more than 1).

Page 10: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 1 0 Curs 2. 01 4- 2. 01 5

3-2. INTRODUCTION TO BINARY RESPONSE DATA. BINOMIAL MODELS

… Aggregated versus disaggregated data …

More efficient and less memory consumption when aggregated data is considered. It simplifies significant effect detected at a glance.

Aggregated data implies serial order is lost. If additional variables are present, only average values can be considered possibly leading to ecological fallacy situations.

Aggregated data implies a response variable model of the binomial type, since sample observed

positive responses are nn mymy ,,11 , where kk my ≤≤0 being the number of positive

responses in kth covariate class which size is km .

Size of covariate classes in a vector form are called binomial index vector and notated by ( )nmm 1=m . For disaggregated data, each individual unit defines a binomial response for a

group of size 1 and thus, ( )11 =m .

Page 11: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 1 1 Curs 2. 01 4- 2. 01 5

3-2. INTRODUCTION TO BINARY RESPONSE DATA. BINOMIAL MODELS

When factor levels define covariate classes, as in our example, a contingency table seems a good representation for aggregated data, our convention for future subject is stated as: Y response (in columns), factor C (subtable) and factor A (rows):

FACTOR C C1 =1 CK=2 =2

FACTOR A FACTOR B – Respuesta Y FACTOR B – Respuesta Y TOTAL

B1 Y=0 BJ=2 Y=1 SUBTOTAL B1 Y=0 BJ=2 Y=1 SUBTOTAL A1 = male 1 1 2 1 2 3 5

AI=2 =female 1 0 1 0 1 1 2

SUBTOTAL 2 1 1 3

TOTAL 3 4 7

Page 12: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 1 2 Curs 2. 01 4- 2. 01 5

3-2. BINOMIAL MODELS FOR BINARY DATA

Usually considered and defined in introductory courses to statistical analysis or theory of probability:

Let ( )π,mBY ≈ a binomial variable that models the number of positive responses in m independent trials in a Bernoully process and thus each one with a common probability π .

[ ]( )

( )

[ ][ ] ( )ππ

π

1

0ππ

0

)(

π)(1π)(

−⋅⋅=⋅=

>

≤≤−

<

=

===

∑=

1

1

0

0

mYVmY

my

myim

y

yF

ym

yYPyp

y

i

imiY

ymyY

Ε

Page 13: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 1 3 Curs 2. 01 4- 2. 01 5

3-2. BINOMIAL MODELS FOR BINARY DATA

3-2.1 Link functions The goal consists on stablishing a functional relationship between the probability of a positive result

π and the vector of explicative variables (predictors in general, covariates if they are continuous)

( )pxx 1=Tx : ( )xππ = .

• In Generalized Linear Models, the link function relates the linear predictor scale with the expected value of the probabilistic variable selected to model the random response. In the case of a binomial model concerned with the probability of positive response for a dicothomic individual response, the

linear predictor η might for a given observation any value in the real axis, but the probability of positive answer belongs to the open interval (0, 1).

Vector π is related to the linear predictor η , through the link function, notated as g(. ),

( )πη g= , π is nx1.

Canonic link for binomial data is the logit function: ( )πθη logit== .

Page 14: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 1 4 Curs 2. 01 4- 2. 01 5

3-2. BINOMIAL MODELS FOR BINARY DATA

Logit link function is the most frequently used link, but one should understand the role of link functions and do not act authomatically.

Some common link function for binary response data are:

1. The logit link (sometimes bad-called logistic

link): ( ) ( )

−===

ππππη

11 loglogitg .

and ( ) ( ) ( )( )η

ηηηπexp

exp+

== −

11

11 g ; this is the

distribution function for a standard logistic variable, whose density function is

( ) ( ) ( )( )( )2

11 1 η

ηηexp

exp'+

=−g with 0 mean (position

parameter) and variance 32π (scale parameter 1), this is a continuous and symmetric variable, quite similar to normal distribution.

Page 15: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 1 5 Curs 2. 01 4- 2. 01 5

3-2. BINOMIAL MODELS FOR BINOMIAL DATA : LINK FUNCTIONS

… link function for binary response data:

2. Probit link or standard normal inverse: ( ) ( )ππη 12

−== Φg and ( ) ( ) ( )ηηηπ Φ== −122 g .

Standard normal (mean 0 and variance 1).

3. Log-log complementary link ( ) ( )( )( )π1πη −== 1loglog3g . Where the response function is,

( ) ( ) ( )( )ηηηπ expexp −−== − 1133 g .

This link function is the inverse of the distribution function for the minimum extreme value law (Gompertz law), with position parameter 0 and scale parameter 1, giving an expected value of e=-0.577216 (first derivative of gamma function evaluated at 1) and variance 62π .

4. Log-log link ( ) ( )( )ππη 1loglog4 −== g , where the response function is

( ) ( ) ( )( )ηηηπ −−−== − expexp1144 g . This link function is the inverse of distribution of

probability function for the maximum extreme value law (Gumbel law) ), with position parameter 0 and scale parameter 1, giving an expected value of e=0.577216 (-first derivative of gamma function evaluated at 1) and variance 62π .

All former link function are related to well-known inverse of distribution functions for continuous variates.

Page 16: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 1 6 Curs 2. 01 4- 2. 01 5

3-2. BINOMIAL MODELS FOR BINOMIAL DATA : LINK FUNCTIONS

Logit link:

( ) ( ) ( )( )η

ηηηπexp

exp+

== −

11

11 g y ( ) ( ) ( )( )( )

( ) ( )( )ηπηπη

ηη 1121

1 11

−=+

=−

expexp'g

In general,

( ) ( )βTixΡΡ == ii ηπ ,

where P(.) indicates some distribution function for continuous variables and transforms real values

for the linear predictor to the [ ]1,0interval.

Transformations should depend on characteristics of data and not to be selected in an straight forward way.

0 0.2 0.4 0.6 0.8

1 1.2

-4

-3.1

4

-2.2

8

-1.4

2

-0.5

6

0.3

1.16

2.02

2.88

3.74

Prob

abili

ty

Linear Predictor

Link Logt

PI_1(ETA)

D_PI_1(ETA)

Page 17: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 1 7 Curs 2. 01 4- 2. 01 5

3-2. BINOMIAL MODELS FOR BINOMIAL DATA : LINK FUNCTIONS

logit and probit links can be seen related to changes in scales: Probability

π Odds

ππ−1

Log-odds

βx=

−ππ

1log

Probit

( ) βx=Φ− π1

C_log-log

βx=

−π1

1loglog

Log-log

βx=

π1loglog

0,01 0,0101 -4,5951 -2,3263 -4,60015 -1,52718 0,05 0,0526 -2,9444 -1,6449 -2,97020 -1,09719 0,10 0,1111 -2,1972 -1,2816 -2,25037 -0,83403 0,15 0,1765 -1,7346 -1,0364 -1,81696 -0,64034 0,20 0,2500 -1,3863 -0,8416 -1,49994 -0,47588 0,25 0,3333 -1,0986 -0,6745 -1,24590 -0,32663 0,30 0,4286 -0,8473 -0,5244 -1,03093 -0,18563 0,50 1,0000 0,0000 0,0000 -0,36651 0,36651 0,70 2,3333 0,8473 0,5244 0,18563 1,03093 0,75 3,0000 1,0986 0,6745 0,32663 1,24590 0,80 4,0000 1,3863 0,8416 0,47588 1,49994 0,85 5,6667 1,7346 1,0364 0,64034 1,81696 0,90 9,0000 2,1972 1,2816 0,83403 2,25037 0,95 19,0000 2,9444 1,6449 1,09719 2,97020 0,99 99,0000 4,5951 2,3263 1,52718 4,60015

Page 18: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 1 8 Curs 2. 01 4- 2. 01 5

3-2. BINOMIAL MODELS FOR BINOMIAL DATA : LINK FUNCTIONS

Relantionship between log-log and c-log-log link functions:

( ) ( )π1π1

1π1

1π −−=

−−=

−= 43 gg loglogloglog

All former link functions are continuous and monotone increasing functions in the (0,1). logit and probit function show an almost linear relationship in the 0.1 to 0.9 subinterval.

For low probabilities logit and complementary log-log functions are quite similar.

For probabilities closed to 1, complementary log-log trends slowly to 1 than logit function does.

For probabilities closed to 1, logit and log-log links are quite similar.

Page 19: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 1 9 Curs 2. 01 4- 2. 01 5

3-3. INTERPRETING MODEL PARAMETERS

Assume a logit link.

The linear predictor contains 2 covariates X1 and X2, so the log-odd of a positive response would be:

[ ] βTxlog =

=

−=

2

1

βββ

π1πη

0

211 xx

In the odd scale for a positive response:

( ) ( ) ( )22110 xx βββηπ1

π++===

−expxexpexp T β

In terms of the probability scale, for a positive response given a logit link, and its response function ( )ηπ 1

1−= g one has,

( )( )

( )( )

( )( )22110

22110

111 xxxx

ββββββ

ηηπ

+++++

=+

=+

=exp

expxexp

xexpexp

expT

T

ββ

Page 20: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 20 Curs 2. 01 4- 2. 01 5

INTERPRETING MODEL PARAMETERS

… and thus, the negative response probability is,

( ) ( ) ( )2211011

11

11

xx βββηπ1

+++=

+=

+=−

expXexpexp β

In the log-odds scale a change from the reference level of factor C to the alternative level modeled

through the dummy variable x2 whose parameter is 2β :

1. Would have an effect of a change in the log odds scale of a positive response equal to the estimate

for the model parameter 2β .

2. In the odd scale, the effect would be multiplicative, a changing from the reference level of C to the alternative level represents to multiply the odd for the reference level by exponential of the

estimated parameter for the dummy variable ( )2βexp , all else being equal, and this means leaving all the other variables in the same values.

3. Interpreting in the probability scale is much more difficult since the effect depends not only of factor C levels, but on the rest of model variables and thus, only an approximation can be given. We

can use the partial derivative effect on the positive response π is ( ) 2

2

βπ1π∂∂π

−=x , that

means a greater effect when the positive response probabilityπ is closer to 0.5 and not to 0 or 1.

Page 21: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 21 Curs 2. 01 4- 2. 01 5

3-4. ESTIMATION OF MODEL PARAMETERS

The estimation process relies on an unconstricted maximization of the log likelihood function,

( ) ( )∑=

=n

ipiyfMax

1,log βyβ,β , ( )pββ ,,1 =Tβ and ( )n

T yy ,,y 1= .

The iterative process to compute the estimates is called the method of the scores, a second order method Newton-type, specialized to the properties of the log likelihood function. The method converges fast, but it is not globally convergent.

Existence and unicity for estimates under any of the formerly presented link functions, if ii my <<0 for any covariate class/observations.

The quality of the initialization is not usually very important, since the algorithm shows fast convergence properties, but it is not globally convergent- so an extreme initial point might lead to divergence.

Page 22: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 22 Curs 2. 01 4- 2. 01 5

3-4 ESTIMATION OF MODEL PARAMETERS

There are some examples, easy to build, where convergence of the estimates does not hold.

When some values for variables perfectly separate positive from negative responses, then observations

have a null probability or full positive response, 0=iy o ii my = .

Estimates for such parameters β do not converge, but fitted values π̂ and deviance tend to a limiting value. For example,

( )( ) 1

exp1exp11

21

212 →

+++

=∞→⇒=⇒=i

iiii x

xxifββ

ββπβπ

Or

( )( ) 0

exp1exp00

21

212 →

+++

=−∞→⇒=⇒=i

iiii x

xxifββ

ββπβπ

Page 23: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 23 Curs 2. 01 4- 2. 01 5

3-5. GOODNESS OF FIT

Let β̂ the estimates of the model parameters, thus a linear predictor for each observation i might be

computed βxˆ Ti=iη and thus through the response function (inverse of the selected link function)

fitted values can be computed: ( )ii g ηπ ˆˆ 1−= .

The scaled deviance can be computed, ( ) ( )y,ˆy)(y,ˆy,' µµ 22 −=D .

And so, the deviance that under the binomial distribution is identical, since 1φ =

( ) ( ) ( )µµµ ˆy,'φˆy,'ˆy, DDD == if ( )iii mBY π,≈

The saturated model y)(y, implies fitted probabilities as observed probabilities, notated in the

following as i

ii m

y=π~ , for all i = 1,…n.

Page 24: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 24 Curs 2. 01 4- 2. 01 5

BINOMIAL MODELS FOR BINARY DATA: GOODNESS OF FIT

Expression for the deviance:

( ) ( ) ( ) ( )( )∑

=

−−

−+

==

n

i iii

iiii

ii

ii mm

ymymm

yyDD1

2ππ ˆ

logˆ

logˆy,ˆy, πµ

Sometimes, the deviance statistic is ,

∑ ∑=

=negativepostive

n

i i

ii e

ooD, 1

log2 where,

1. Observed values for positive response for observation i, ii yo = .

2. Observed values for negative response for observation i, iii ymo −= .

3. Expected positive responses for observation i, iii me π̂= .

4. Expected negative responses for observation i, iiii mme π̂−= .

Page 25: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 25 Curs 2. 01 4- 2. 01 5

BINOMIAL MODELS FOR BINARY DATA: GOODNESS OF FIT

Assimptotic distribution for model (M) with p parameters ( )π̂,YDDM = is 2

pn−χ (not to be

confused with 2

pN −χ ). Assymptotic conditions are

Thus, a goodness of fit test can be formulated as H0 “The current model fits properly the data” and

the p value for the test is ( ) valuepDP Mpn _2 =>−χ :

If pvalue <<0.05 then there is evidence to reject H0 and thus, the model (M) does not fit properly data. There is an statistical evidence of discrepancy between observations and fitted values provided by model (M).

If pvalue >> 0.05 then there is not evidence to reject H0 and thus, H0 is accepted leading to the conclusion that model (M) does fit properly data, since discrepance between observed and fitted values is not significative in statistical terms.

AIC (Akaike Information Criteria, 1974) is defined as a trade-off between a goodness of fit provided by a model (M) and the number of parameters p in the model (as an indicator for model

complexity) kaike. Let M be a model with p parameters ( )( )pAIC MM +−= y,ˆ2 π . Models with minimum AIC are preferred.

Page 26: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 26 Curs 2. 01 4- 2. 01 5

BINOMIAL MODELS FOR BINARY DATA: GOODNESS OF FIT

In order to consider sample size, another statistic named BIC (Bayesian Information Criteria) is

proposed (in SAS©), Schwartz criteria ( ) npBIC B logy,ˆ2 +−= πB . Minimum BIC models are preferred (AIC(model,k=log(n)).

AIC and BIC might be used to compare unnested models.

Following McCullagh, the test for Generalized LM equivalent to F.Test in classical linear regression, consists on comparing differences on scaled deviance on 2 hierarchical models (nested models):

Let MA be a model with q parameters nested in model MB with p > q parameters, let Aπ̂ and Bπ̂ fitted probabilities for both models, Such that, the set of parameters for MB are those common to

MA and those specific; i.e., ( )TT21 , βββ Τ

Β = and ( )T1ββ Τ

Α = with dim( Aβ )=q<p, then

( ) ( ) ( ) ( )y,ˆ2y,ˆ2ˆ,ˆ, ABBAAB DDD ππππ −=−=∆ yy is asymptotically distributed 2

qp−χ.

And for testing 0=Η 20 β:

( )

Η<<Η<<

→∆>− AcceptedRejected

0

02

αα

χ ABqp DP

It is a contrast for multiple coefficients! Large values indicate non-equivalence of models

Page 27: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 27 Curs 2. 01 4- 2. 01 5

BINOMIAL MODELS FOR BINARY DATA: GOODNESS OF FIT

Classical t de Student test for linear regression has the equivalence in Wald test for binomial models:

H0: jj ββ ˆ= then

( )100 ,ˆ

ˆ.

ˆ

NasintZj

jj

βσ

ββ −= , if H0 is true.

Bilateral asymptotic confidence interval at level ( )%α-1100 for a model parameter is

jzj βα σβ ˆ/ ˆˆ

2± , where 2/α-1z is N(0,1) distributed

Wald Test is natural in a maximum likelihood estimation framework since asymptotically :

( )10,ˆ −ℑ≈− pNββ ,

where [ ]TUUΕ=ℑ is Fisher Expected Information Matrix (score variance), using as an approximation

WXXT belonging to last iteration (convergence) of Method of the Scores

Then , ( ) ( ) 2p

Tχββββ ≈−ℑ− ˆˆ , Wald statistic is W= ( ) ( )ββββ −ℑ− ˆˆ T

.

Page 28: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 28 Curs 2. 01 4- 2. 01 5

BINOMIAL MODELS FOR BINARY DATA: GOODNESS OF FIT

Multiple hypothesis testing is also possible with Wald Test

00 : ββ =Η through ( ) [ ] ( ) 20

1

0ˆˆˆ

p

TVW χ≈−−=

−βββββ .

Or ( )TTT21 , βββ = dim( 2β )=q<p and 0=Η 20 : β then [ ] 2

2

1

22ˆˆˆ

qTVW χ≈=

−βββ .

If dim( 2β )=1 then 0: 20 =Η β : [ ] ( )1,0ˆ

ˆ

2

2 NV

z ≈=β

β.

Deviance for a GLM plays a role similar to Residual Sum of Squares in classical regression, thus it allows to define a Generalized R2, or pseudo R2

( )( )

( )( ) ( ) ( ) ( ) ( )AA

AA

AA DDGwhereDG

GDDR πππ

πππ

ππ ,,,

,,,

,,1 0

0

2 yyyyy

yyy

−=+

=−= ,

10 2 ≤≤ R In R software: library(lmtest) waldtest(modelA, modelB, ..., test = c("F", "Chisq")) # Wald Test anova(modelA, modelB, ..., test = c("F", "Chisq")) # Deviance Test

Page 29: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 29 Curs 2. 01 4- 2. 01 5

BINOMIAL MODELS FOR BINARY DATA: GOODNESS OF FIT

Goodness of fit by generalized Pearson X2 statistic, asymptotically distributed as:

( )( )

( )( )

( )

−=

−−

=−

−= ∑∑∑∑

−+ === , 1

2

1

2

1

22

μ̂μ̂μ̂

π̂π̂π̂ n

i i

iin

i iii

iiin

i iii

iii

eeo

mym

1mmyX

When disaggregated data is used then mi = 1 and asymptotic theory does not apply, as a rule of thumb at a fully disaggregated level for data, a model is good if Deviance or Pearson X2 statistics are similar to degrees of freedom for the model. Large values indicate a lack of fit of the model.

The Hosmer-Lemeshow Statistic (1980,1989) is another measure of goodness of fit. Hosmer and Lemeshow recommend partitioning the observations into 10 equal sized groups according to their predicted probabilities. Usually G=10 groups are defined for predicted probabilities 0–0.1, 0.1–0.2, … 0.9–1, but fewer than 10 groups depending of data characteristics can be applied. For each group, the observed number of outcomes (positive, negative) in the sample and the expected number of outcomes (positive, negative) are compared by calculating the Pearson chi-square statistic from the 2xG table of observed and expected frequencies. The statistic is distributed as a chi-square distribution with G-2 degrees of freedom;

( )( )∑

= −

−=

G

g ggg

gggHL 1m

moX

1

22

π̂π̂π̂

Page 30: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 30 Curs 2. 01 4- 2. 01 5

BINOMIAL MODELS FOR BINARY DATA: GOODNESS OF FIT

3-5.1 ROC Curve and Confusion Matrices ROC (Receiver Operating Characteristic) curve analysis has been widely accepted as the standard for describing and comparing the accuracy of predictions.

If the ROC curve rises rapidly towards the upper right-hand corner of the graph, or if the value of area under the curve is large, we can say the model performs well. If the area is close to 1.0, it indicates that the model is good. While if the area is to 0.5, it shows that the model is bad.

Confusion matrix for a binary model (M) shows predicted response versus observed response (positive/negative outcomes)

Let the prediction in response be 1ˆ =iy if si >π̂ , where is a threshold s between 0 and 1. For each s a confusion matrix can be built for model (M):

s Y=1 Y=0 Total

1ˆ =iy a b a+b

0ˆ =iy c d c+d

a+c b+d n

• Sensibility is the proportion of observed positive

outcomes (Y=1) predicted as positive ( 1ˆ =iy ): Sn =a/(a+c).

• Specificity is the proportion of observed negative outcomes (Y=0) predicted as negative ( 0ˆ =iy ): Sp = d/(b+d).

Page 31: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 31 Curs 2. 01 4- 2. 01 5

BINOMIAL MODELS FOR BINARY DATA: GOODNESS OF FIT

ROC Curve shows in X axis for each s, 1-Sp (false negative rate) and Sensibility - Sn (true positive rate) in Y-Axis.

• The point (0,1) is the perfect classifier: it classifies all positive cases and negative cases correctly. It is (0,1) because the false positive rate is 0 (none), and the true positive rate is 1 (all).

• The point (0,0) represents a classifier that predicts all cases to be negative.

• The point (1,1) corresponds to a classifier that predicts every case to be positive.

• A good electronic address to understand ROC curves http://gim.unmc.edu/dxtests/ROC1.htm.

A guideline for interpreting ROC curve is:

.90-1 = excellent(A)

.80-.90 = very good (B)

.70-.80 = good (C)

.60-.70 = bad (D)

.50-.60 = very bad (F)

Page 32: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 32 Curs 2. 01 4- 2. 01 5

BINOMIAL MODELS FOR BINARY DATA: GOODNESS OF FIT – ROC CURVE

How to compute in R, Pearson X2 statistic - sum of squares of Pearson’s residuals:

sum( resid( model, ‘pearson’) ^2 )

As in the case, for deviance residual:

sum( resid( model, ‘deviance’) ^2 ) == model$deviance

Package rms contains the specific method lrm(.) for logistic regression with additional diagnostics (c, Naglekerke R2, and so on). NagelkerkeR2 is also in the fmsb package.

To compute ROC curves: Install package ROCR – specific performance plots are available

> library("ROCR") > dadesroc<-prediction(predict(lm2_logit,type="response"),ars$resposta) > par(mfrow=c(1,2)) > plot(performance(dadesroc,"err")) > plot(performance(dadesroc,"tpr","fpr")) > abline(0,1,lty=2)

Page 33: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 33 Curs 2. 01 4- 2. 01 5

3-6. MODEL DIAGNOSTICS

3-6.1 Residuals in GLMz Extension to Generalized linear Models of Normal regression methods:

According to Pregibon (1981), Williams (1987) and Fox (2008): a residual can be defined for each

covariate class i: iiiiii myyye π̂ˆ −=−=.The response residuals are not used in

diagnostics,however, because they ignore the nonconstant variance that is a crucial part of a GLMz.

Pearson residuals are casewise components of the Pearson goodness of fit statistic for the model,

( )[ ] φ̂μ̂

μ̂

i

iiPi

V

ye −=

and ∑=

=n

iPi

rX1

22

These are a basic set of residuals for use with a GLM because of their direct analogy to linear models. For a model named M, the R command residuals(M, type="pearson") returns the Pearson residuals.

Standardized Pearson residuals correct for conditional response variation and for the leverage of the observations:

( )[ ]( )iii

iiPSi hV

ye−

−=

1μ̂μ̂

Page 34: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 34 Curs 2. 01 4- 2. 01 5

MODEL DIAGNOSTICS: RESIDUALS

To compute the standardized Pearson residuals, we need to define the hat-values hii for GLMs.

Deviance residuals, ( ) iiiDi dysigne μ̂−= are the square roots of the casewise components of the

residual deviance ( ) ∑ ==

ni DirD

,,ˆy,

12µ , attaching the sign of iiy μ̂− .

o In the linear model, the deviance residuals reduce to the Pearson residuals. o The deviance residuals are often the preferred form of residual for GLMs, and are returned by

the command residuals(M, type="deviance").

Standardized deviance residuals are ( )iiDi

DSi hee −= 1φ̂ .

Studentized residual in generalized linear models are approximations to the scaled difference between the response and the fitted value computed without case i:

( ) ( )( ) ( )221μ̂ PSiii

DSiiiii

Stui ehehysigne +−−=

The following functions, some in standard R and some in the car package, have methods for GLMs:

rstudent, hatvalues, cooks.distance, dfbetas, outlierTest, avPlots, residualPlots, marginalModelPlots, crPlots, etc.

Page 35: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 35 Curs 2. 01 4- 2. 01 5

MODEL DIAGNOSTICS: RESIDUALS

Hat matrix for Generalized Linear Models can be defined, although it depends on Y (through W) and x’s values,

( ) 21T1T21 WXWXXXWH −=

H matrix is symmetric with diagonal values between 0 and 1, hii, named leverages and average value

p/n. It corresponds to the last iteration (convergence) of the IWLS for estimating model parameters.

The hii are taken from the final iteration of the Iterative Weighted Least Squares procedure for fitting the model and have the usual interpretation, except that, unlike in a linear model, the hat-values in a GLM depend on y as well as on the configuration of the xs.

Page 36: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 36 Curs 2. 01 4- 2. 01 5

MODEL DIAGNOSTICS: RESIDUALS

3-6.2 Influence data in GLMz

Influence data are detected by and adapted Cook’s statitstic derived from Wald statistic for

multiple hypothesis testing: H0: 0ββ = ,

( ) [ ] ( ) ( ) ( )000

1

020 βββββββββ −−=−−=

− ˆWXXˆˆˆˆˆ TTTVZ

Let Wald statistic be for observation I, ( )2

iZ − for testing H0: ( )i−= ββ ˆ , distance between β̂ and

( )i−β̂ ( ( )i−−= ββ ˆˆdi ).

And thus, ( ) ( )( ) ( )( ) ( )( )ii

iiPSi

i

T

ii hpheZ

−=−−= −−− 1

ˆˆˆˆ2

2 ββββ WXXT

Page 37: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 37 Curs 2. 01 4- 2. 01 5

MODEL DIAGNOSTICS: RESIDUALS

3-6.3 Graphics for diagnostics

A scatterplot showing Standardized Pearson residuals (Y-axis) and leverage (hii, diagonal of H). Cut-offs can be included at 2p/n.

A scatterplot showing Pearson residuals versus each of the predictors in turn.

A scatterplot showing Pearson residuals against fitted values in the linear predictor scale, thus, residualPlots() plots residuals against the estimated linear predictor, η( x).

Examine leverage for observations.

Examine Cook’s distance for observations.

o In binary regression for disaggregated data, the plots of Pearson residuals or deviance residuals are strongly patterned—particularly the plot against the linear predictor, where the residuals can take on only two values, depending on whether the response is equal to 0 or 1.

o A correct model requires that the conditional mean function in any residual plot be constant as we move across the plot, and ‘to see this’ smoothers help in the purpose.

In R, residualPlots(M), in each panel in the graph by default includes a smooth fit; a lack-of-fit test is provided only for the numeric predictor residualPlots(mod.working, layout=c(1, 3))

influenceIndexPlot(mod.working, vars=c("Cook", "hat"), id.n=3)

Page 38: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 38 Curs 2. 01 4- 2. 01 5

3-7. EJEMPLO 1: ACCIDENTES CON HERIDOS SEGÚN USO DEL CINTURÓN – AGRESTI (2002)

Datos de 68694 accidentes sucedidos en el estado de Main. Se recoge la gravedad y las variables explicativas de género, entorno y uso del cinturón. Se estudiará la incidencia en la presencia de heridos de los factores, por tanto se crea un factor dicotómico: Sin – Con Heridos (ref. Sin)genero entorno cinturon gravedad y Mujer Urbano No SinHeridos 7287 Mujer Urbano Si SinHeridos 11587 Mujer NoUrbano No SinHeridos 3246 Mujer NoUrbano Si SinHeridos 6134 Hombre Urbano No SinHeridos 10381 Hombre Urbano Si SinHeridos 10969 Hombre NoUrbano No SinHeridos 6123 Hombre NoUrbano Si SinHeridos 6693 Mujer Urbano No LeveSinHospital 175 Mujer Urbano Si LeveSinHospital 126 Mujer NoUrbano No LeveSinHospital 73 Mujer NoUrbano Si LeveSinHospital 94 Hombre Urbano No LeveSinHospital 136 Hombre Urbano Si LeveSinHospital 83 Hombre NoUrbano No LeveSinHospital 141 Hombre NoUrbano Si LeveSinHospital 74 Mujer Urbano No LeveConHospital 720 Mujer Urbano Si LeveConHospital 577 Mujer NoUrbano No LeveConHospital 710 Mujer NoUrbano Si LeveConHospital 564

genero entorno cinturon gravedad y Hombre Urbano No LeveConHospital 566 Hombre Urbano Si LeveConHospital 259 Hombre NoUrbano No LeveConHospital 710 Hombre NoUrbano Si LeveConHospital 353 Mujer Urbano No Hospitalización 91 Mujer Urbano Si Hospitalización 48 Mujer NoUrbano No Hospitalización 159 Mujer NoUrbano Si Hospitalización 82 Hombre Urbano No Hospitalización 96 Hombre Urbano Si Hospitalización 37 Hombre NoUrbano No Hospitalización 188 Hombre NoUrbano Si Hospitalización 74 Mujer Urbano No Mortal 10 Mujer Urbano Si Mortal 8 Mujer NoUrbano No Mortal 31 Mujer NoUrbano Si Mortal 17 Hombre Urbano No Mortal 14 Hombre Urbano Si Mortal 1 Hombre NoUrbano No Mortal 45 Hombre NoUrbano Si Mortal 12

Page 39: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 39 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

> summary(acc) genero entorno cinturon gravedad y f.heridos Hombre:20 NoUrbano:20 Si:20 Hospitalización:8 Min. : 1.00 Sin: 8 Mujer :20 Urbano :20 No:20 LeveConHospital:8 1st Qu.: 66.75 Con:32 LeveSinHospital:8 Median : 138.50 Mortal :8 Mean : 1717.35 SinHeridos :8 3rd Qu.: 710.00 Max. :11587.00 > tapply(acc$y,acc$f.heridos,sum);sum(acc$y) Sin Con 62420 6274 [1] 68694

Tomando como variable de respuesta la presencia de heridos (f.heridos), globalmente se observa 6274 accidentes de un total de 68694, con una probabilidad de 0,0913. El odds es 6274/62420 o 0,1005 a 1 i el log-odds es log(0,1005) = -2.297472.

Se propone comparar inicialmente la presencia de heridos (respuesta) según el Factor Uso del Cinturón (2 niveles, base-line Si).

Cinturón Con Heridos

(respuesta positiva) Sin Heridos

m

Si (ref) 2409 35383 37792

No 3865 27037 30902

6274 62420 68694

P(‘Accidente CON Heridos’)=0. 091 3=6274/68694

Page 40: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 40 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

Sólo hay 2 posibles modelos: el modelo nulo que asume homogeneidad en el Uso en los dos grupos definidos por el Factor (M1) y el modelo completo (M2) que propone proporciones diferentes en el Uso entre los dos grupos:

(M1) ηπ

π=

− i

i

1log (M2) 021

1==+=

− 1ι ααη

ππ ,log i

i

i

> dfc cinturon m ypos yneg Si Si 37792 2409 35383 No No 30902 3865 27037 > > acc.m1 <-glm(cbind(ypos,yneg)~1, family=binomial(link=logit), data=dfc) > summary(acc.m1) Call: glm(formula = cbind(ypos, yneg) ~ 1, family = binomial(link = logit), data = dfc) Deviance Residuals: Si No -19.59 19.60 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.29747 0.01324 -173.5 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Page 41: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 41 Curs 2. 01 4- 2. 01 5

(Dispersion parameter for binomial family taken to be 1) Null deviance: 768.03 on 1 degrees of freedom Residual deviance: 768.03 on 1 degrees of freedom AIC: 789.55 > > acc.m2 <-glm(cbind(ypos,yneg)~cinturon, family=binomial(link=logit), data=dfc) > summary(acc.m2) Call: glm(formula = cbind(ypos, yneg) ~ cinturon, family = binomial(link = logit), data = dfc) Deviance Residuals: [1] 0 0 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.68702 0.02106 -127.61 <2e-16 *** cinturonNo 0.74178 0.02719 27.29 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 7.6803e+02 on 1 degrees of freedom Residual deviance: -4.3099e-13 on 0 degrees of freedom AIC: 23.523 > residuals(acc.m1,'pearson') Si No -18.61742 20.58856 > xpea<-sum(residuals(acc.m1,'pearson')^2);xpea [1] 770.4972

Page 42: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 42 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

El estadístico de Pearson de (M2) es 0 y de (M1) toma por expresión: ( )( )

2112

2

2,12 4972.770

ˆˆˆ

=−=−=≈=

−−

= ∑ pniiii

iiiP m

ymX χµµµ

La devianza de (M2) es 0 y de (M1) toma por expresión:

( ) 21122,1

3.768ˆ

logˆ

log2 =−=−=≈=

−−

−+

= ∑ pni

ii

iiii

i

ii m

ymymyyD χµµ .

Ambos estadísticos son altamente significativos, implicando que el modelo no se ajusta bien a los datos.

En (M1) el estimador 2.29747η −=ˆ , el logit de la proporción muestral.

En (M2), el estimador η̂ , es el logit del nivel de referencia (Si) (logit de la proporción de heridos en grupo que Usa cinturón, logit(2409/37792)=-2.687) y el efecto del nivel No sobre el logit de la proporción de heridos (diferencia de logits entre el nivel No y el nivel de referencia Si: logit(3865/30902)-logit(2409/37792)=0.742.

==

=− Noee

Yese

i

i

21 2 ι 1ι

ππ

αη

η

1.22 ==− αeNovsYesratioodds

Los odds de tener heridos entre los accidentes que No usan cinturón es más del doble que el odds de tener heridos entre los que Si usan cinturón.

Page 43: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 43 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

Ahora procedamos a analizar la incidencia de accidentes con heridos según el género del conductor accidentado (referencia género hombre).

Genero Con iy Sin ii ym − im

Hombre 2789 34166 36955

Mujer 3485 28254 31739

6274 62420 68694 > acc.m2g <-glm(cbind(ypos,yneg)~genero, family=binomial(link=logit), data=dfg) > summary(acc.m2g) Call: glm(formula = cbind(ypos, yneg) ~ genero, family = binomial(link = logit), data = dfg) Deviance Residuals: [1] 0 0 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.50555 0.01969 -127.23 <2e-16 *** generoMujer 0.41278 0.02665 15.49 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2.4172e+02 on 1 degrees of freedom Residual deviance: -7.0122e-13 on 0 degrees of freedom

Page 44: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 44 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

AIC: 23.571 Number of Fisher Scoring iterations: 2 >> xpea<-sum(residuals(acc.m1g,'pearson')^2);xpea [1] 242.4970 > log(2789 /34166);log(3485 /28254);log(3485 /28254)-log(2789 /34166) [1] -2.505548 [1] -2.092767 [1] 0.4127809 > exp(0.41278) [1] 1.511013 >

Sólo hay 2 posibles modelos: el modelo nulo que asume homogeneidad en la presencia de heridos en accidentes en los 2 grupos definidos por el Factor (M1) y el modelo completo (M2) que propone proporciones diferentes en los accidentes con heridos entre los 2 grupos:

(M1) ηπ

π=

− i

i

1log (M2)

≡=+≡=

=

− Mi

Hi

i

i

21

1log

2αηη

ππ

El estadístico de Pearson de (M2) es 0 y de (M1) toma por expresión:

( )( )

21121

2

212 497.242

ˆˆˆ

=−=−÷=≈=

−−

= ∑ niiii

iiiP m

ymX χµµµ

La devianza de (M2) es 0 y de (M1) toma por expresión:

Page 45: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 45 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

( ) 211221

72.241ˆ

logˆ

log2 =−=−÷=≈=

−−

−+

= ∑ pni

ii

iiii

i

ii m

ymymyyD χµµ . Ambos estadísticos

son altamente significativos, implicando que el modelo no se ajusta bien a los datos.

En (M1) el estimador 0,775η −=ˆ , el logit de la proporción muestral.

En (M2), el estimador η̂ , es el logit del nivel de referencia (Hombres) (logit de la proporción de heridos en accidentes en hombres a la vista de la tabla, logit(2789/34166)= -2.51) y el efecto del nivel 2 (mujeres) sobre el logit de “H” (diferencia de logits en los grupos: log(3485 /28254)-log(2789 /34166)=0.413.

==

=− Hee

Hei

i

i

21 ι 1ι

ππ

αη

η

51.1==− ieHvsiGruporatioodds α

Los odds de accidentes con heridos se incrementan en un 51% en las mujeres respecto los hombres.

Queda por probar el último modelo univariante según Entorno urbano o no urbano: los odds de accidentes con heridos se decrementan en un (1-exp(-0.7158))x100%=51% si sucede en entorno urbano. Los odds de urbano son 0.4887= exp(-0.7158) veces los odds de no urbano.

Page 46: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 46 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

> summary(acc.m2e) Call: glm(formula = cbind(ypos, yneg) ~ entorno, family = binomial(link = logit), data = dfe) Deviance Residuals: [1] 0 0 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.89784 0.01859 -102.08 <2e-16 *** entornoUrbano -0.71584 0.02664 -26.87 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 7.1961e+02 on 1 degrees of freedom Residual deviance: 3.9262e-12 on 0 degrees of freedom AIC: 23.564 Number of Fisher Scoring iterations: 2 > xpea<-sum(residuals(acc.m1e,'pearson')^2);xpea [1] 745.0957 >

Page 47: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 47 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

Modelos con 2 Predictores: Cinturón y Entorno

Hay 4 grupos o clases de las covariables, sea ijy el número de accidentes con heridos en el grupo de Cinturón i-ésimo y grupo de Entorno j-ésimo, donde los niveles de referencia son ‘Si’ para Cinturón (Factor A) y ‘NoUrbano’ para el Factor C. > df2 cinturon entorno m ypos yneg 1 Si NoUrbano 14097 1270 12827 2 No NoUrbano 11426 2057 9369 3 Si Urbano 23695 1139 22556 4 No Urbano 19476 1808 17668

Hay 5 modelos de interés aplicables a la estructura sistemática de los datos anteriores (M1) a (M5), cuyas devianzas y detalles de la estimación con MINITAB se detallan a continuación.

Modelo n-p Devianza D∆ Contraste g.l. Modelo 1 1 3 1504.1 Todos significativos Constante: η

2 A 2 736.11 767.99 (M2) vs (M1) 1 Cinturón: iαη +

3 C 2 784.53 719.57 (M3) vs (M1) 1 Entorno: jβη +

4 A+C 1 2.7116 733.4 (M4) vs (M2) 1 Aditivo: ji βαη ++ 781.8 (M4) vs (M3) 1

5 A*C 0 0 2.7116 (M5) vs (M4) 1 Interacción Factores: ijji αββαη +++

Page 48: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 48 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

> sum(df2[,3]);sum(df2[,4]);sum(df2[,5]) [1] 68694 [1] 6274 [1] 62420 > acc.m20 <-glm(cbind(ypos,yneg)~1, family=binomial(link=logit), data=df2) > summary(acc.m20) Call: glm(formula = cbind(ypos, yneg) ~ 1, family = binomial(link = logit), data = df2) Deviance Residuals: 1 2 3 4 -0.5131 29.4486 -25.2217 0.7247 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.29747 0.01324 -173.5 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1504.1 on 3 degrees of freedom Residual deviance: 1504.1 on 3 degrees of freedom AIC: 1542.4 Number of Fisher Scoring iterations: 4 > acc.m21 <-glm(cbind(ypos,yneg)~entorno, family=binomial(link=logit), data=df2) > summary(acc.m21)

Page 49: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 49 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

Call: glm(formula = cbind(ypos, yneg) ~ entorno, family = binomial(link = logit), data = df2) Deviance Residuals: 1 2 3 4 -14.92 15.04 -12.97 12.94 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.89784 0.01859 -102.08 <2e-16 *** entornoUrbano -0.71584 0.02664 -26.87 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1504.14 on 3 degrees of freedom Residual deviance: 784.53 on 2 degrees of freedom AIC: 824.76 Number of Fisher Scoring iterations: 4 > acc.m22 <-glm(cbind(ypos,yneg)~cinturon, family=binomial(link=logit), data=df2) > summary(acc.m22) Call: glm(formula = cbind(ypos, yneg) ~ cinturon, family = binomial(link = logit), data = df2)

Page 50: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 50 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

Deviance Residuals: 1 2 3 4 12.10 16.82 -10.30 -14.17 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.68702 0.02106 -127.61 <2e-16 *** cinturonNo 0.74178 0.02719 27.29 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1504.14 on 3 degrees of freedom Residual deviance: 736.11 on 2 degrees of freedom AIC: 776.34 Number of Fisher Scoring iterations: 4 > acc.m23 <-glm(cbind(ypos,yneg)~cinturon+entorno, family=binomial(link=logit), data=df2) > summary(acc.m23) Call: glm(formula = cbind(ypos, yneg) ~ cinturon + entorno, family = binomial(link = logit), data = df2) Deviance Residuals: 1 2 3 4 -0.8793 0.7358 0.9220 -0.7396

Page 51: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 51 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -2.28676 0.02465 -92.78 <2e-16 *** cinturonNo 0.75265 0.02734 27.53 <2e-16 *** entornoUrbano -0.72721 0.02682 -27.12 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1504.1407 on 3 degrees of freedom Residual deviance: 2.7116 on 1 degrees of freedom AIC: 44.938 Number of Fisher Scoring iterations: 3 > xpea<-sum(residuals(acc.m21,'pearson')^2);xpea [1] 787.0698 > xpea<-sum(residuals(acc.m22,'pearson')^2);xpea [1] 761.8445 > xpea<-sum(residuals(acc.m20,'pearson')^2);xpea [1] 1618.284 > xpea<-sum(residuals(acc.m23,'pearson')^2);xpea [1] 2.712893 > 1-pchisq(xpea,1) [1] 0.09954032 >

Page 52: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 52 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

El modelo aditivo ajusta bien los datos, vamos a interpretar sus parámetros:

1. 2.28676−=η es el logit de la probabilidad base: accidentes cuando se usa cinturón en entorno rural.

2. 2α muestra un efecto creciente de la incidencia de accidentados cuando No se usa el cinturón.

3. 2β muestra un efecto decreciente de la incidencia de accidentados cuando el accidente ocurre en Entorno urbano.

4. 2α es positivo y el odds de padecer heridos cuando no se usa cinturón es más del doble que entre los accidentes cuando se usa cinturón dentro del mismo grupo de Entorno (all else being equal o ceteris paribus).

La tentativa final consiste en considerar todos las variables explicativas disponibles, es decir, considerar tres factores A, C y D (Cinturón, Entorno y Género). Los posibles modelos son 12 ¡!! Se va a cambiar el orden de los niveles del Factor C – Entorno para facilitar la interpretación.

Page 53: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 53 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

El modelo aditivo ajusta bien los datos, pero todavía queda devianza por explicar: > summary(acc) genero entorno cinturon gravedad y f.heridos heridos Hombre:20 Urbano :20 Si:20 Hospitalización:8 Min. : 1.00 Sin: 8 Min. : 0.0 Mujer :20 NoUrbano:20 No:20 LeveConHospital:8 1st Qu.: 66.75 Con:32 1st Qu.: 9.5 LeveSinHospital:8 Median : 138.50 Median : 74.0 Mortal :8 Mean : 1717.35 Mean :156.8 SinHeridos :8 3rd Qu.: 710.00 3rd Qu.:163.0 Max. :11587.00 Max. :720.0 > > df3 cinturon entorno genero m ypos yneg 1 Si Urbano Hombre 11349 380 10969 2 No Urbano Hombre 11193 812 10381 3 Si NoUrbano Hombre 7206 513 6693 4 No NoUrbano Hombre 7207 1084 6123 5 Si Urbano Mujer 12346 759 11587 6 No Urbano Mujer 8283 996 7287 7 Si NoUrbano Mujer 6891 757 6134 8 No NoUrbano Mujer 4219 973 3246

Page 54: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 54 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

> summary(acc.m331) Call: glm(formula = cbind(ypos, yneg) ~ cinturon + entorno + genero, family = binomial(link = logit), data = df3) Deviance Residuals: 1 2 3 4 5 6 7 8 -0.5055 -0.7976 0.2133 0.9023 1.7426 -0.4639 -1.5365 0.3172 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.33639 0.03114 -107.14 <2e-16 *** cinturonNo 0.81710 0.02765 29.55 <2e-16 *** entornoNoUrbano 0.75806 0.02697 28.11 <2e-16 *** generoMujer 0.54483 0.02727 19.98 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1912.4532 on 7 degrees of freedom Residual deviance: 7.4645 on 4 degrees of freedom AIC: 82.167 Number of Fisher Scoring iterations: 3

Page 55: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 55 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

El siguiente paso podría ser añadir una interacción entre 2 de los factores: A*C o A*D o C*D.

Modelo n-p Devianza D∆ Contraste g.l. Modelo 1 A+C+D 4 7.4645 Aditivo: kji γβαη +++

2 A*C+D 3 3.5914 3.8730 (M2) vs (M1) 1 Interacción Cinturón-Entorno : ijkji αβγβαη ++++

3 A*D+B 3 7.3826 0.0818 (M3) vs (M1) 1 Interacción Cinturón-Género: ikkji αγγβαη ++++

4 C*D+A 3 4.4909 2.9736 (M4) vs (M1) 1 Interacción Entorno-Género: jkkji βγγβαη ++++

Estrictamente sólo la interacción entre Cinturón y Entorno es estadísticamente significativa, aunque la interacción entre Entorno y Género tiene un pvalor del 8% según el contraste de devianza con el modelo aditivo. Se interpreta el mejor modelo obtenido hasta el momento donde intervienen los 3 factores y una interacción doble entre el Uso de Cinturón y el Entorno donde sucede el accidente.

glm(formula = cbind(ypos, yneg) ~ cinturon * entorno + genero, family = binomial, data = df3) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.30342 0.03509 -94.149 <2e-16 *** cinturonNo 0.76173 0.03933 19.366 <2e-16 *** entornoNoUrbano 0.69360 0.04239 16.362 <2e-16 *** generoMujer 0.54594 0.02729 20.007 <2e-16 *** cinturonNo:entornoNoUrbano 0.10800 0.05486 1.968 0.049 *

Page 56: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 56 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

La interpretación en la escala lineal de:

• Si el conductor es mujer los log odds se incrementan en 0.55 unidades respecto al grupo de referencia hombres dentro del mismo grupo del resto de factores.

• No usar el cinturón incrementa la escala lineal en 0.76 unidades en Entorno urbano y 0.76+0.11 en entorno NoUrbano; dentro del mismo grupo de género.

• Conducir en entorno No Urbano incrementa la escala lineal en 0.69 unidades si se usa cinturón y 0.69+0.11 si no se uso cinturón.

• Tanto el uso del cinturón como el entorno no pueden interpretarse independientemente, ya que hay un término de interacción.

La interpretación en la escala de los odds seria:

• Si el conductor es mujer los odds de darse heridos en el accidente se incrementan en un 73% (exp(0.55)=1.73) respecto al grupo de referencia hombres, dentro del mismo grupo del resto de factores.

• No usar el cinturón incrementa los odds de darse heridos en el accidente en un 113% (exp(0.76)=2.13) en Entorno urbano y en un 140% (exp(0.76+0.11)=2.387) en entorno NoUrbano; dentro del mismo grupo de género.

• Conducir en entorno No Urbano incrementa los odds de darse heridos en el accidente en un 100% (exp(0.69)=1.994) si se usa cinturón y en casi un 125% (exp(0.69+0.11)=2.226) si no se usa cinturón; dentro del mismo grupo de género.

Page 57: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 57 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

La interpretación en la escala de las probabilidades son aproximadas y seria en términos absolutos según una probabilidad marginal de darse heridos en un accidente de P(‘Accidente CON Heridos’)=0. 091 3=6274/68694: Y de aquí 0. 091 3x(1 - 0. 091 3)=0. 083. • Si el conductor es mujer la probabilidad de darse heridos en el accidente sube en 0.046

(0.083x0.55=0.046) respecto al grupo de referencia hombres, dentro del mismo grupo del resto de factores.

• No usar el cinturón incrementa la probabilidad de darse heridos en el accidente en 0.063 (0.083x0.76=0.063) en Entorno urbano y en un 0.072 (0.083(0.76+0.11)=0.072) en entorno NoUrbano; dentro del mismo grupo de género.

• Conducir en entorno No Urbano incrementa la probabilidad de darse heridos en el accidente en 0.057 (0.083x0.69=0.057) si se usa cinturón y en 0.066 (0.083(0.696+0.11)=0.066) si no se usa cinturón; dentro del mismo grupo de género.

Page 58: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 58 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

> summary(acc.m331) Call: glm(formula = cbind(ypos, yneg) ~ cinturon + entorno + genero, family = binomial(link = logit), data = df3) Deviance Residuals: 1 2 3 4 5 6 7 8 -0.5055 -0.7976 0.2133 0.9023 1.7426 -0.4639 -1.5365 0.3172 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.33639 0.03114 -107.14 <2e-16 *** cinturonNo 0.81710 0.02765 29.55 <2e-16 *** entornoNoUrbano 0.75806 0.02697 28.11 <2e-16 *** generoMujer 0.54483 0.02727 19.98 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: 1912.4532 on 7 degrees of freedom Residual deviance: 7.4645 on 4 degrees of freedom AIC: 82.167 Number of Fisher Scoring iterations: 3 > summary(acc.m332) Call: glm(formula = cbind(ypos, yneg) ~ cinturon + entorno * genero, family = binomial(link = logit), data = df3)

Page 59: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 59 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

… Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.36383 0.03519 -95.592 <2e-16 *** cinturonNo 0.81618 0.02765 29.521 <2e-16 *** entornoNoUrbano 0.80907 0.04010 20.177 <2e-16 *** generoMujer 0.59306 0.03914 15.152 <2e-16 *** entornoNoUrbano:generoMujer -0.09345 0.05422 -1.724 0.0848 . --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: 1912.4532 on 7 degrees of freedom Residual deviance: 4.4909 on 3 degrees of freedom AIC: 81.193 Number of Fisher Scoring iterations: 3 > summary(acc.m333) Call: glm(formula = cbind(ypos, yneg) ~ cinturon * entorno + genero, family = binomial(link = logit), data = df3) … Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.30342 0.03509 -94.149 <2e-16 *** cinturonNo 0.76173 0.03933 19.366 <2e-16 *** entornoNoUrbano 0.69360 0.04239 16.362 <2e-16 *** generoMujer 0.54594 0.02729 20.007 <2e-16 *** cinturonNo:entornoNoUrbano 0.10800 0.05486 1.968 0.049 *

Page 60: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 60 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

Null deviance: 1912.4532 on 7 degrees of freedom Residual deviance: 3.5914 on 3 degrees of freedom AIC: 80.294 Number of Fisher Scoring iterations: 3 > summary(acc.m334) Call: glm(formula = cbind(ypos, yneg) ~ cinturon * genero + entorno, family = binomial(link = logit), data = df3) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -3.34236 0.03755 -89.014 <2e-16 *** cinturonNo 0.82621 0.04220 19.579 <2e-16 *** generoMujer 0.55459 0.04370 12.691 <2e-16 *** entornoNoUrbano 0.75792 0.02698 28.096 <2e-16 *** cinturonNo:generoMujer -0.01598 0.05586 -0.286 0.775 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Null deviance: 1912.4532 on 7 degrees of freedom Residual deviance: 7.3826 on 3 degrees of freedom AIC: 84.085 Number of Fisher Scoring iterations: 3

Page 61: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 61 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

> anova(acc.m331,acc.m332,test="Chisq") Analysis of Deviance Table Model 1: cbind(ypos, yneg) ~ cinturon + entorno + genero Model 2: cbind(ypos, yneg) ~ cinturon + entorno * genero Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 4 7.4645 2 3 4.4909 1 2.9736 0.0846 > anova(acc.m331,acc.m333,test="Chisq") Analysis of Deviance Table Model 1: cbind(ypos, yneg) ~ cinturon + entorno + genero Model 2: cbind(ypos, yneg) ~ cinturon * entorno + genero Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 4 7.4645 2 3 3.5914 1 3.8730 0.0491 > anova(acc.m331,acc.m334,test="Chisq") Analysis of Deviance Table Model 1: cbind(ypos, yneg) ~ cinturon + entorno + genero Model 2: cbind(ypos, yneg) ~ cinturon * genero + entorno Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 4 7.4645 2 3 7.3826 1 0.0818 0.7748 > xpea<-sum(residuals(acc.m332,'pearson')^2);xpea [1] 4.496567 > 1-pchisq(xpea,3) [1] 0.2125967 > xpea<-sum(residuals(acc.m333,'pearson')^2);xpea [1] 3.580126 > 1-pchisq(xpea,3) [1] 0.3105178

Page 62: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 62 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

El siguiente paso consistiría en analizar los modelos con 2 interacciones entre los factores, ya que el modelo A*C+D ajusta bien los datos, pero todavía deja una devianza de 3.5914 por explicar en 3 grados de libertad, se podría dar por bueno el modelo.

Modelo n-p Devianza D∆ Contraste g.l. Modelo 1 A*C+A*D 2 3.562410 2.2371 (M1) vs (M4) 1 Interacción Cinturón-Entorno Y

Cinturón-Género : jkijkji αγαβγβαη +++++

2 A*D+C*D 2 4.371979 3.0467 (M2) vs (M4) 1 Interacción Cinturón-Género Y Entorno-Género :

jkikkji βγαγγβαη +++++

3 A*C+C*D 2 1.367022 0.04171 (M3) vs (M4) 1 Interacción Cinturón-Entorno Y Entorno-Género :

jkijkji βγαβγβαη +++++

4 A*C+C*D+A*D

1 1.325317 jkikijkji βγαγαβγβαη ++++++

El modelo no requiere de más análisis, no hay diferencias significativas entre el modelo con las 3 interacciones dobles y ninguno de los modelos con 2 pares de factores en interacciones.

Page 63: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 63 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

El siguiente paso consistiría en analizar los modelos con 2 interacciones entre los factores y compararlos con el modelo aditivo, para ver si son significativas 2 interacciones dobles simultáneamente.

Modelo n-p Devianza D∆ Contraste g.l. Modelo 1 A*C+A*D 2 3.562410 3.9021 (M1) vs (M4) 1 Interacción Cinturón-Entorno Y

Cinturón-Género : jkijkji αγαβγβαη +++++

2 A*D+C*D 2 4.371979 3.0925 (M2) vs (M4) 1 Interacción Cinturón-Género Y Entorno-Género :

jkikkji βγαγγβαη +++++

3 A*C+C*D 2 1.367022 6.0975 (M3) vs (M4) 1 Interacción Cinturón-Entorno Y Entorno-Género :

jkijkji βγαβγβαη +++++

4 A+C+D 4 7.4645 kji γβαη +++

El modelo no requiere de más análisis, ya que simultáneamente son significativas 2 interacciones dobles Cinturón-Entorno Y Entorno-Género.

Page 64: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 64 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

Comparando el mejor modelo con 1 interacción doble (Cinturón-Entorno) con el modelo que tiene 2 interacciones dobles (Cinturón-Entorno y Entorno-Genero) se cuantifica el p valor del contraste de la devianza de la interacción Entorno-Género con un 0.14, por tanto, no significativa una vez que Cinturón-Entorno está en el modelo, pero con un valor incómodo.

> anova(acc.m333,acc.m43,test="Chisq") Analysis of Deviance Table Model 1: cbind(ypos, yneg) ~ cinturon * entorno + genero Model 2: cbind(ypos, yneg) ~ cinturon * entorno + entorno * genero Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 3 3.5914 2 2 1.3670 1 2.2244 0.1358 >

Se propone para finalizar el análisis valorar el modelo con 2 interacciones dobles y el mejor modelo con 1 interacción doble según el criterio de información de Akaike y el método step() en R. Se prefiere mantener las 2 interacciones dobles.

Al final se da una tabla resumen con la devianza residual y el AIC para todos los modelos que se han calculado.

Page 65: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 65 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

> acc.res<-step(acc.m34) Start: AIC=82.7 cbind(ypos, yneg) ~ cinturon * genero * entorno Df Deviance AIC - cinturon:genero:entorno 1 1.325 82.028 <none> 2.411e-12 82.702 Step: AIC=82.03 cbind(ypos, yneg) ~ cinturon + genero + entorno + cinturon:genero + cinturon:entorno + genero:entorno Df Deviance AIC - cinturon:genero 1 1.367 80.069 <none> 1.325 82.028 - genero:entorno 1 3.562 82.265 - cinturon:entorno 1 4.372 83.074 Step: AIC=80.07 cbind(ypos, yneg) ~ cinturon + genero + entorno + cinturon:entorno + genero:entorno Df Deviance AIC <none> 1.367 80.069 - genero:entorno 1 3.591 80.294 - cinturon:entorno 1 4.491 81.193 >

Page 66: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 66 Curs 2. 01 4- 2. 01 5

EXEMPLE 1

Modelos logit(πijk) Devianza n-p AIC

1 η 1912.5 7 1981.2 Cinturón - A η+ αi 1144.4 6 1215.1 Entorno - C η+ βj 1192.8 6 1263.5 Género -D η+ γk 1670.7 6 1741.4

A + D η+ αi+ βj 795.82 5 868.52 A + C η+ αi+ γk 411.02 5 483.73 D + C η+ βj+ γk 911.01 5 983.71 A D η+ αi+ βj+ (αβ)ij 795.32 4 870.03 A C η+ αi+ γk+ (αγ)ik 408.31 4 483.01

A + D + C η+ αi+ βj+ γk 7.4645 4 82.167 A D + C η+ αi+ βj+ γk+ (αβ)ij 7.3826 3 84.085 A C + D η+ αi+ βj+ γk+ (αγ)ik 3.5914 3 80.294 A + D C η+ αi+ βj+ γk+ (βγ)jk 4.4909 3 81.193 A D + A C η+ αi+ βj+ γk+ (αβ)ij+ (αγ)ik 3.5624 2 82.265 A D + D C η+ αi+ βj+ γk+ (αβ)ij+ (βγ)jk 4.372 2 83.074 A C + D C η+ αi+ βj+ γk+ (αγ)ik+ (βγ)jk 1.3670 2 80.07 A D + A C + D C η+ αi+ βj+ γk+ (αβ)ij+ (αγ)ik+ (βγ)jk 1.3253 1 82.028

Page 67: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 67 Curs 2. 01 4- 2. 01 5

3-8. EJEMPLO 2: PARTICIPACIÓN LABORAL FEMENINA EN CANADA (FOX)

En 1977 se realizó una encuesta sociodemográfica a la población de Canadá. El modelo lineal generalizado que se plantea investiga el análisis de la relación entre las mujeres jóvenes casadas que trabajan en función de la existencia de hijos en el hogar, los ingresos de sus maridos y la región del país donde residen.

• La variable de respuesta es dicotómica: trabaja frente a no trabaja (para cada mujer joven casada que interviene en el modelo). Originariamente en los datos la variable tiene 3 categorías, lo que será aprovechado en un ejemplo del Tema 5.

• La presencia de hijos en el hogar es el factor A, que tiene 2 categorías (SI, NO). Categoría base: NO (la constante corresponde al valor medio de la categoría NO).

• La región del Canadá es un factor politómico B, con 5 categorías. Los ingresos del marido (en miles de dólares) es la covariable X.

• La intuición indica una interacción entre los ingresos de los maridos (X) y la presencia de hijos (A).

Page 68: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 68 Curs 2. 01 4- 2. 01 5

EJEMPLO 2 (FOX)

WOMEN'S LABOUR-FORCE PARTICIPATION DATASET, CANADA 1977 [1] OBSERVATION [2] LABOUR-FORCE PARTICIPATION fulltime = WORKING FULL-TIME parttime = WORKING PART-TIME not_work = NOT WORKING OUTSIDE THE HOME [3] HUSBAND'S IINCOME, $1000'S [4] PRESENCE OF CHILDREN absent present [5] REGION Atlantic = ATLANTIC CANADA Quebec Ontario Prairie = PRAIRIE PROVINCES BC = BRITISH COLUMBIA Source: Social Change in Canada Project, York Institute for Social Research. DATA: 1 not_work 15 present Ontario 2 not_work 13 present Ontario … 253 not_work 13 present Quebec 254 parttime 23 present Quebec 255 fulltime 11 absent Quebec 256 not_work 9 absent Quebec 257 fulltime 2 absent Quebec … 263 not_work 15 present Quebec ENDDATA

Page 69: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 69 Curs 2. 01 4- 2. 01 5

EJEMPLO 2 (FOX)

La tabla contiene el análisis de la devianza para diversos modelos. El modelo más adecuado contiene X y A, cuyo coeficiente negativo indican que ante la presencia de niños y mayores ingresos masculinos es menor la incidencia del trabajo femenino.

Análisis de la Devianza Modelo p Devianza o

Log-Verosimilitud

Devianza∆ g.l. Comentarios Contraste 0H Accept.

0 1 1 ¿? 39.609 7 0 vs 8 No

1 A 2 -162.279 4.826 1 1 vs 3 No

2 X 2 -175.528 31.324 1 2 vs 3 No

3 A+X 3 -159.866 2.43 4 3 vs 7 Si

4 A+B 6 -161.213 5.124 1 4 vs 7 No

5 B+X 6 -171.322 25.342 1 5 vs 7 No

6 A*X 4 -159.562 2.582 4 6 vs 8 Si

7 A+B+X 7 -158.651 0.76 1 7 vs 8 Si

8 B+A*X 8 -158.271 84320501 .., =χ

Page 70: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 70 Curs 2. 01 4- 2. 01 5

EJEMPLO 2 (FOX)

El contraste de M7 vs M8 indica que las interacciones entre los ingresos masculinos y la presencia de niños no es estadísticamente significativa (Factor A).

El contraste de M3 vs M7 indica que la región (Factor B) tampoco es estadísticamente significativa. > summary(bm3) Call: glm(formula = bwork ~ sons + income, family = binomial, data = womenlf) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 1.33583 0.38376 3.481 0.0005 *** sonspresent -1.57565 0.29226 -5.391 7e-08 *** income -0.04231 0.01978 -2.139 0.0324 * Null deviance: 356.15 on 262 degrees of freedom Residual deviance: 319.73 on 260 degrees of freedom AIC: 325.73 >

Sin embargo, los efectos principales del Factor A (M1 vs M3) y de la covariable (M2 vs M3) son estadísticamente significativos (se rechazan las correspondientes hipótesis nulas).

iii

i xAFactor 042310576133611

...log −−=− ππ

donde 1=iAFactor si hay presencia de niños y 0 de otro modo.

Page 71: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 71 Curs 2. 01 4- 2. 01 5

EJEMPLO 2 (FOX)

El análisis de los residuos de la devianza frente a las probabilidades estimadas es:

absent present

0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8

-2

-1

0

1

2

3

EPRO1

DR

ES

1

Page 72: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 72 Curs 2. 01 4- 2. 01 5

EJEMPLO 2 (FOX)

Los residuos de la devianza frente al leverage:

El valor medio del leverage p/n es 0,06522 y el extremo superior del intervalo a 2 y 3 veces la distancia es 0.16704 y 0.21795, respectivamente.

absent present

0,0 0,1 0,2

-2

-1

0

1

2

3

HI1

DR

ES

1

Page 73: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 73 Curs 2. 01 4- 2. 01 5

EJEMPLO 2 (FOX)

Los residuos son difíciles de interpretar en los modelos lineales generalizados!!! residualPlots(bm3, layout=c(1, 3))

Page 74: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 74 Curs 2. 01 4- 2. 01 5

EJEMPLO 2 (FOX)

El modelo propuesto no parece demasiado adecuado a los datos: el logit no es lineal a los ingresos!!!

La distorsión y dificultad de comprensión de los gráficos ante un nivel de máxima desagregación de los datos.

absent present

0 10 20 30 40 50

-2

-1

0

1

2

Income-XO

LOG

IT6

absent present

0 10 20 30 40 50

-2

-1

0

1

2

Income-X

OLO

GIT

6

Page 75: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 75 Curs 2. 01 4- 2. 01 5

absent present

50403020100

1

0

-1

-2

Income-X

ELO

GIT

6

absent present

0 5 10 15 20 25 30 35 40 45

-2

-1

0

1

2

C_INCOMEX

OLO

GIT

7

109

2

26

44

4321 12

3

3

EJEMPLO 2 (FOX)

• Los 2 gráficos muestran en la escala logit, la comparación entre valores empíricos (considerando una categorización de INCOME-X cada 10 unidades y con etiquetas el número total de observaciones en la clase de la covariable correspondiente) y ajustados con el modelo INCOME-X sin categorizar: hay un problema serio de observaciones influyentes y no linealidad.

Page 76: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 76 Curs 2. 01 4- 2. 01 5

EJEMPLO 2 (FOX)

Page 77: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 77 Curs 2. 01 4- 2. 01 5

EJEMPLO 2 (FOX)

> options(contrasts=c("contr.treatment","contr.treatment")) > bm3 <-glm( bwork~sons+income,family=binomial, data=womenlf ) > bm6 <-glm( bwork~sons*income, family=binomial, data=womenlf ) > anova(bm3,bm6,test='Chisq') Analysis of Deviance Table Model 1: bwork ~ sons + income Model 2: bwork ~ sons * income Resid. Df Resid. Dev Df Deviance P(>|Chi|) 1 260 319.73 2 259 319.12 1 0.60831 0.4354 > > ## Diagnosi > library(car) > residualPlots(bm3, layout=c(1, 3)) Test stat Pr(>|t|) sons NA NA income 1.226 0.268 > influenceIndexPlot(bm3, id.n=10) > matplot(dfbetas(bm3),type='l') # Cut-off dfbetas 2/arrel(n) > abline(h=sqrt(4/(dim(womenlf)[1])),lty=3,col=6) > abline(h=-sqrt(4/(dim(womenlf)[1])),lty=3,col=6) > lines(sqrt(cooks.distance(bm3)),lwd=3,col=1) >legend(locator(n=1),legend=c(names(as.data.frame(dfbetas(bm3))),"Cook D"), col=c(1:3,1), lty=c(3,3,3,1) )

Page 78: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 78 Curs 2. 01 4- 2. 01 5

EJEMPLO 2 (FOX)

>

Page 79: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 79 Curs 2. 01 4- 2. 01 5

EJEMPLO 2 (FOX)

> library(car) > Anova(bm3) Analysis of Deviance Table (Type II tests) Response: bwork LR Chisq Df Pr(>Chisq) sons 31.3229 1 2.185e-08 *** income 4.8264 1 0.02803 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > library(effects) > plot(allEffects(bm3),ask=FALSE) >

Page 80: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 80 Curs 2. 01 4- 2. 01 5

EJEMPLO 2 (FOX)

El análisis de los residuos de la devianza frente al anclaje (leverage, id.n=x indica que se desea mostrar las x observaciones más resaltadas de rstudent i hatvalues, x de cada uno): > llista<-influencePlot(bm3,id.n=3,col=3,pch=19) > womenlf[row.names(llista),c(8,3:4)] bwork income sons 76 work 38 present 77 work 38 present 89 not_work 35 absent 90 not_work 35 absent 91 not_work 35 absent 120 work 30 present > llista StudRes Hat CookD 76 2.044657 0.02949436 0.2573350 77 2.044657 0.02949436 0.2573350 89 -1.138115 0.05332491 0.1309859 90 -1.138115 0.05332491 0.1309859 91 -1.138115 0.05332491 0.1309859 120 1.871816 0.01866566 0.1709365 > El valor medio del leverage p/n es 0,06522 y el extremo superior del intervalo a 2 y 3 veces la distancia es 0.16704 y 0.21795, respectivamente

0.01 0.02 0.03 0.04 0.05

-10

12

Hat-Values

Stu

dent

ized

Res

idua

ls

7677

899091

120

Page 81: UPC APUNTS DE CLASSE: TEMA 3 › ~lmontero › lmm_tm › Tr3Qua_mlg_v8.pdfIt is possible the extension to more than 2 levels or categories in the response variable. The probability

GRAU D’ESTADÍSTICA – UB- UPC MLGz

Prof. Lídia Montero i Josep A. Sánchez Pàg. 3- 81 Curs 2. 01 4- 2. 01 5

EJEMPLO 2 (FOX)

avPlots(bm3,labels=row.names(womenlf),id.method=abs(cooks.distance(bm3)), id.n=5) marginalModelPlots(bm3,labels=row.names(df),id.method=abs(cooks.distance(bm3)), id.n=5)