Download - Estadística Descriptiva (II) - Técnicas Estadísticas en Análisis ...umh1480.edu.umh.es/wp-content/uploads/sites/44/2013/02/...PaquetesquegenerantablasbonitasVariable Levels —x

Estadística Descriptiva (II)Técnicas Estadísticas en Análisis de Mercados

Xavi Barber

Centro de Investigación OperativaUniversidad Miguel Hernández de Elche

2017-03-02

Xavi Barber (@umh1480 @XaviBarberUMH) Estadística Descriptiva (II) 2017-03-02 1 / 64

1 Paquetes que generan tablas bonitas

2 Gráficos personalizables


library(s20x)data(course.df)


Paquetes que generan tablas bonitas




Recordando. . .



stargazer

Este paquete sirve para más cosas a parte de mostrar un descriptivo decente, peropara empezar es muy fácil de utilizar:

library(stargazer)stargazer(course.df)

% Table created by stargazer v.5.2 by Marek Hlavac, Harvard University. E-mail: hlavac at fas.harvard.edu %Date and time: jue, mar 02, 2017 - 11:38:49

Table 1

Statistic N Mean St. Dev. Min MaxExam 146 52.877 18.678 11 93Assign 146 13.827 4.405 0.000 20.000Test 146 11.567 3.779 3.600 20.000B 146 9.253 4.087 0 18C 146 11.144 5.281 0 20MC 146 16.240 5.774 4 29Years.Since 146 1.592 1.057 0.000 4.500



Como única preucaución hay que poner en el chunk lo siguiente:results=’asis’, message=FALSE y dentro de la intrucción header=FALSE conel fin de evitar encabezados inesperados.



library(stargazer)stargazer(course.df, header = FALSE)

Table 2

Statistic N Mean St. Dev. Min MaxExam 146 52.877 18.678 11 93Assign 146 13.827 4.405 0.000 20.000Test 146 11.567 3.779 3.600 20.000B 146 9.253 4.087 0 18C 146 11.144 5.281 0 20MC 146 16.240 5.774 4 29Years.Since 146 1.592 1.057 0.000 4.500



Las opciones más interesantes de stargazer son:stargazer( . . . , type = “latex”, title = “”, style = “default”, summary = NULL, out = NULL, out.header =FALSE, column.labels = NULL, column.separate = NULL, covariate.labels = NULL, dep.var.caption = NULL,dep.var.labels = NULL, dep.var.labels.include = TRUE, align = FALSE,

coef = NULL, se = NULL, t = NULL, p = NULL,t.auto = TRUE, p.auto = TRUE,ci = FALSE, ci.custom = NULL,ci.level = 0.95, ci.separator = NULL,

add.lines = NULL,apply.coef = NULL, apply.se = NULL,apply.t = NULL, apply.p = NULL, apply.ci = NULL,colnames = NULL,column.sep.width = "5pt",

decimal.mark = NULL, df = TRUE,digit.separate = NULL, digit.separator = NULL,digits = NULL, digits.extra = NULL, flip = FALSE,

float = TRUE, float.env="table",font.size = NULL, header = TRUE,initial.zero = NULL,intercept.bottom = TRUE, intercept.top = FALSE,

keep = NULL, keep.stat = NULL,label = "", model.names = NULL,model.numbers = NULL, multicolumn = TRUE,no.space = NULL,notes = NULL, notes.align = NULL,notes.append = TRUE, notes.label = NULL,object.names = FALSE,

omit = NULL, omit.labels = NULL,omit.stat = NULL, omit.summary.stat = NULL,omit.table.layout = NULL,omit.yes.no = c("Yes", "No"),

order = NULL, ord.intercepts = FALSE,perl = FALSE, report = NULL, rownames = NULL,rq.se = "nid", selection.equation = FALSE,single.row = FALSE,star.char = NULL, star.cutoffs = NULL,suppress.errors = FALSE,table.layout = NULL, table.placement = "!htbp",zero.component = FALSE,

summary.logical = TRUE, summary.stat = NULL,nobs = TRUE, mean.sd = TRUE, min.max = TRUE,median = FALSE, iqr = FALSE )



reporttools

Con este paquete podremos crear los descriptivos de las variables categóricas,así como los descriptivos de las variables contínuas × categóricas.Una de las cosas que habitualemnte nos pueden pedir son el cálculo de losp-valores para contrastar porcentajes o medias entre grupos.

Enlace:reporttools: R Functions to Generate LATEX Tables of Descriptive Statistics


https://cran.r-project.org/web/packages/reporttools/vignettes/reporttools.pdf


tableContinuous(vars, weights = NA, subset = NA, group = NA,stats = c(“n”, “min”, “q1”, “median”, “mean”, “q3”, “max”,“s”, “iqr”,“na”),

prec = 1,

col.tit = NA,

col.tit.font = c("bf", "", "sf", "it", "rm"),

**print.pval = c("none", "anova", "kruskal"), **

**pval.bound = 10^-4, **

declare.zero = 10^-10,

cap = "", lab = "",

font.size = "footnotesize", longtable = TRUE,

disp.cols = NA, nams = NA, ...)Xavi Barber (@umh1480 @XaviBarberUMH) Estadística Descriptiva (II) 2017-03-02 11 / 64


Primer ejemplo:

library(reporttools)

#titulo de la tablacap5 <- "Resultados de los examenes por Gender"

# Estadísticos que vamos a mostrar, con la "figura", no en letrastats <- list("mean", "s","min","median","max","n")

# variables contínuas a analizarsele<-c(3,7,8,9,10,11,14) # las que ponía numeric en la tabla inicial#y aquí le dicimos que lo calcule todotableContinuous(vars = course.df[,sele], # variables a nalizar

group =course.df$Gender, # variable "by" (factor)stats = stats, # estadisticos a mostrarprint.pval = "kruskal", # tipo de contrastecap = cap5, lab = "tab: cont2",longtable = FALSE)

Recuerda poner en el chunk: results=‘asis’Xavi Barber (@umh1480 @XaviBarberUMH) Estadística Descriptiva (II) 2017-03-02 12 / 64


Variable Levels —x s Min x̃ Max nExam Female 54.6 18.1 20.0 54.0 89.0 78

Male 50.9 19.3 11.0 48.0 93.0 68p = 0.21 all 52.9 18.7 11.0 51.5 93.0 146Assign Female 14.7 3.8 2.4 16.0 20.0 78

Male 12.8 4.8 0.0 13.6 19.6 68p = 0.01 all 13.8 4.4 0.0 14.8 20.0 146Test Female 11.9 3.6 3.6 12.2 20.0 78

Male 11.2 4.0 3.6 10.9 20.0 68p = 0.22 all 11.6 3.8 3.6 11.8 20.0 146B Female 10.0 4.0 1.0 10.0 17.0 78

Male 8.4 4.1 0.0 8.0 18.0 68p = 0.02 all 9.3 4.1 0.0 9.0 18.0 146C Female 12.3 4.9 0.0 13.0 20.0 78

Male 9.8 5.4 0.0 10.0 20.0 68p = 0.0054 all 11.1 5.3 0.0 12.0 20.0 146MC Female 16.2 5.6 5.0 16.0 29.0 78

Male 16.3 6.0 4.0 15.0 29.0 68p = 0.96 all 16.2 5.8 4.0 16.0 29.0 146Years.Since Female 1.5 1.1 0.0 1.5 4.5 78

Male 1.7 1.0 0.0 1.5 4.5 68p = 0.44 all 1.6 1.1 0.0 1.5 4.5 146

Table 3: Resultados de los examenes por Gender



library(reporttools)#titulo de la tablacap5 <- "Resultados de los examenes por Gender"

# Estadísticos que vamos a mostrar, con la "figura", no en letrastats <- list("n", "min", "median",

"$\\bar{x}_{\\mathrm{trim}}$" =function(x){return(mean(x, trim = .05))},"max", "iqr","c$_{\\mathrm{v}}$" = function(x){return(sd(x) / mean(x))},"s", "na")

# variables contínuas a analizarsele<-c(3,7,8,9,10,11,14) # las que ponía numeric en la tabla inicial#y aquí le dicimos que lo calcule todotableContinuous(vars = course.df[,sele], group =course.df$Gender, stats = stats,print.pval = "kruskal", cap = cap5, lab = "tab: cont2", longtable =

FALSE)



Variable Levels n Min x̃ x̄trim Max IQR cv s #NAExam Female 78 20.0 54.0 54.6 89.0 30.5 0.3 18.1 0

Male 68 11.0 48.0 50.6 93.0 26.0 0.4 19.3 0p = 0.21 all 146 11.0 51.5 52.7 93.0 28.5 0.4 18.7 0Assign Female 78 2.4 16.0 14.9 20.0 4.7 0.3 3.8 0

Male 68 0.0 13.6 13.1 19.6 6.0 0.4 4.8 0p = 0.01 all 146 0.0 14.8 14.1 20.0 5.5 0.3 4.4 0Test Female 78 3.6 12.2 11.9 20.0 5.2 0.3 3.6 0

Male 68 3.6 10.9 11.1 20.0 5.4 0.4 4.0 0p = 0.22 all 146 3.6 11.8 11.5 20.0 5.4 0.3 3.8 0B Female 78 1.0 10.0 10.0 17.0 5.0 0.4 4.0 0

Male 68 0.0 8.0 8.4 18.0 5.0 0.5 4.1 0p = 0.02 all 146 0.0 9.0 9.3 18.0 6.8 0.4 4.1 0C Female 78 0.0 13.0 12.5 20.0 8.0 0.4 4.9 0

Male 68 0.0 10.0 9.9 20.0 8.0 0.6 5.4 0p = 0.0054 all 146 0.0 12.0 11.3 20.0 8.0 0.5 5.3 0MC Female 78 5.0 16.0 16.2 29.0 8.0 0.3 5.6 0

Male 68 4.0 15.0 16.2 29.0 9.0 0.4 6.0 0p = 0.96 all 146 4.0 16.0 16.2 29.0 8.0 0.4 5.8 0Years.Since Female 78 0.0 1.5 1.5 4.5 1.4 0.7 1.1 0

Male 68 0.0 1.5 1.6 4.5 1.6 0.6 1.0 0p = 0.44 all 146 0.0 1.5 1.5 4.5 1.9 0.7 1.1 0

Table 4: Resultados de los examenes por Grado



Y si lo que queremos es ver sólo los factores:

sele <- c(1, 2, 4, 5, 6)titulo <- "Características de las variables Nominales"tableNominal(vars = course.df[, sele], cap = titulo, vertical = FALSE,

font.size = "scriptsize", lab = "tab:nominal1", longtable = FALSE,cumsum = TRUE)



Variable Levels n %∑

%Grade A 32 21.9 21.9

B 29 19.9 41.8C 41 28.1 69.9D 44 30.1 100.0all 146 100.0

Pass No 44 30.1 30.1Yes 102 69.9 100.0all 146 100.0

Degree BA 17 11.6 11.6BCom 49 33.6 45.2BSc 64 43.8 89.0Other 16 11.0 100.0all 146 100.0

Gender Female 78 53.4 53.4Male 68 46.6 100.0all 146 100.0

Attend No 46 31.5 31.5Yes 100 68.5 100.0all 146 100.0

Table 5: Características de las variables Nominales



Y ahora cruzando tablas para dos variables categóricas

sele <- c(1, 2, 4)titulo <- "Características de las variables Nominales por la variable Repetidor"tableNominal(vars = course.df[, sele], group = course.df[, 15],

cap = titulo, vertical = FALSE, font.size = "scriptsize",print.pval = "chi2", fisher.B = "fisher.test", lab = "tab:nominal1",longtable = FALSE, cumsum = FALSE)



Variable Levels nNo %No nYes %Yes nall %allGrade A 31 28.4 1 2.7 32 21.9

B 25 22.9 4 10.8 29 19.9C 24 22.0 17 46.0 41 28.1D 29 26.6 15 40.5 44 30.1

p = 0.00048 all 109 100.0 37 100.0 146 100.0Pass No 29 26.6 15 40.5 44 30.1

Yes 80 73.4 22 59.5 102 69.9p = 0.16 all 109 100.0 37 100.0 146 100.0Degree BA 12 11.0 5 13.5 17 11.6

BCom 41 37.6 8 21.6 49 33.6BSc 41 37.6 23 62.2 64 43.8Other 15 13.8 1 2.7 16 11.0

p = 0.03 all 109 100.0 37 100.0 146 100.0

Table 6: Características de las variables Nominales por la variable Repetidor



dplyr

Este paquete de R es algo más que para hacer descriptivos, es toda una formade trabajar.Es capaz de realizar diferentes acciones sobre los datos: agrupar, seleccionar,etc.

Enlace:Tutorial-

Soluciona las operaciones de manipulación de datos más comunes.Proporcionar funciones simples que corresponden a los verbos más comunes demanipulación de datos, por lo que es muy fácil traducir sus pensamientos encódigo.Es computacionalmente muy eficiente.


https://cran.r-project.org/web/packages/dplyr/index.html


library(dplyr)

dplyr porporciona una función para verbo habitual en el tratamiento de datos:

filter y slice: filtrar y particionararrange: organizarselect y rename: seleccionar columnasdistinct: obteniendo valores distintos por filasmutate y transmute: creando nuevas variablessummarise: calculando estadísticossample_n y sample_frac: seleccionando submuestrasetc.



Base de datos de ejemploUtilizaremos una base de datos que nos indica el retraso en la llegada de diferentesvuelos. Nos han contratado para que analicemos si existe alguna relación entreestos retrasos y el resto de variables de la base de datos.

library(dplyr)library(nycflights13)dim(flights)

## [1] 336776 19head(flights)

## # A tibble: 6 × 19## year month day dep_time sched_dep_time dep_delay arr_time## <int> <int> <int> <int> <int> <dbl> <int>## 1 2013 1 1 517 515 2 830## 2 2013 1 1 533 529 4 850## 3 2013 1 1 542 540 2 923## 4 2013 1 1 544 545 -1 1004## 5 2013 1 1 554 600 -6 812## 6 2013 1 1 554 558 -4 740## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,## # time_hour <dttm>Xavi Barber (@umh1480 @XaviBarberUMH) Estadística Descriptiva (II) 2017-03-02 22 / 64


Creando filtros

vuelos.enero <- filter(flights, month == 1, day == 1)head(vuelos.enero[, 1:4])

## # A tibble: 6 × 4## year month day dep_time## <int> <int> <int> <int>## 1 2013 1 1 517## 2 2013 1 1 533## 3 2013 1 1 542## 4 2013 1 1 544## 5 2013 1 1 554## 6 2013 1 1 554



es equivalente a:

Lunes.de.enero <- flights[flights$month == 1 & flights$day ==1, ]

table(Lunes.de.enero$month, Lunes.de.enero$day)

#### 1## 1 842



Otro ejemplo:

temp <- filter(flights, month == 1 | month == 2)

#### 1 2## 27004 24951



Para seleccionar filas por su posición:

slice(flights, 1:10)

## # A tibble: 10 × 19## year month day dep_time sched_dep_time dep_delay arr_time## <int> <int> <int> <int> <int> <dbl> <int>## 1 2013 1 1 517 515 2 830## 2 2013 1 1 533 529 4 850## 3 2013 1 1 542 540 2 923## 4 2013 1 1 544 545 -1 1004## 5 2013 1 1 554 600 -6 812## 6 2013 1 1 554 558 -4 740## 7 2013 1 1 555 600 -5 913## 8 2013 1 1 557 600 -3 709## 9 2013 1 1 557 600 -3 838## 10 2013 1 1 558 600 -2 753## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,## # time_hour <dttm>Xavi Barber (@umh1480 @XaviBarberUMH) Estadística Descriptiva (II) 2017-03-02 26 / 64


Organizar filas con:

arrange(flights, year, month, day)

## # A tibble: 336,776 × 19## year month day dep_time sched_dep_time dep_delay arr_time## <int> <int> <int> <int> <int> <dbl> <int>## 1 2013 1 1 517 515 2 830## 2 2013 1 1 533 529 4 850## 3 2013 1 1 542 540 2 923## 4 2013 1 1 544 545 -1 1004## 5 2013 1 1 554 600 -6 812## 6 2013 1 1 554 558 -4 740## 7 2013 1 1 555 600 -5 913## 8 2013 1 1 557 600 -3 709## 9 2013 1 1 557 600 -3 838## 10 2013 1 1 558 600 -2 753## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,## # minute <dbl>, time_hour <dttm>



Organizar filas con:

arrange(flights, desc(dep_delay)) #orden desc.

## # A tibble: 336,776 × 19## year month day dep_time sched_dep_time dep_delay arr_time## <int> <int> <int> <int> <int> <dbl> <int>## 1 2013 1 9 641 900 1301 1242## 2 2013 6 15 1432 1935 1137 1607## 3 2013 1 10 1121 1635 1126 1239## 4 2013 9 20 1139 1845 1014 1457## 5 2013 7 22 845 1600 1005 1044## 6 2013 4 10 1100 1900 960 1342## 7 2013 3 17 2321 810 911 135## 8 2013 6 27 959 1900 899 1236## 9 2013 7 22 2257 759 898 121## 10 2013 12 5 756 1700 896 1058## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,## # minute <dbl>, time_hour <dttm>



equivalente a:

flights[order(flights$year, flights$month, flights$day), ]

## # A tibble: 336,776 × 19## year month day dep_time sched_dep_time dep_delay arr_time## <int> <int> <int> <int> <int> <dbl> <int>## 1 2013 1 1 517 515 2 830## 2 2013 1 1 533 529 4 850## 3 2013 1 1 542 540 2 923## 4 2013 1 1 544 545 -1 1004## 5 2013 1 1 554 600 -6 812## 6 2013 1 1 554 558 -4 740## 7 2013 1 1 555 600 -5 913## 8 2013 1 1 557 600 -3 709## 9 2013 1 1 557 600 -3 838## 10 2013 1 1 558 600 -2 753## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,## # minute <dbl>, time_hour <dttm>



equivalente a:

flights[order(desc(flights$dep_delay)), ]

## # A tibble: 336,776 × 19## year month day dep_time sched_dep_time dep_delay arr_time## <int> <int> <int> <int> <int> <dbl> <int>## 1 2013 1 9 641 900 1301 1242## 2 2013 6 15 1432 1935 1137 1607## 3 2013 1 10 1121 1635 1126 1239## 4 2013 9 20 1139 1845 1014 1457## 5 2013 7 22 845 1600 1005 1044## 6 2013 4 10 1100 1900 960 1342## 7 2013 3 17 2321 810 911 135## 8 2013 6 27 959 1900 899 1236## 9 2013 7 22 2257 759 898 121## 10 2013 12 5 756 1700 896 1058## # ... with 336,766 more rows, and 12 more variables: sched_arr_time <int>,## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,## # minute <dbl>, time_hour <dttm>



Seleccionar columnas con:

select(flights, year, month, day)

## # A tibble: 336,776 × 3## year month day## <int> <int> <int>## 1 2013 1 1## 2 2013 1 1## 3 2013 1 1## 4 2013 1 1## 5 2013 1 1## 6 2013 1 1## 7 2013 1 1## 8 2013 1 1## 9 2013 1 1## 10 2013 1 1## # ... with 336,766 more rows



select(flights, year:day)

## # A tibble: 336,776 × 3## year month day## <int> <int> <int>## 1 2013 1 1## 2 2013 1 1## 3 2013 1 1## 4 2013 1 1## 5 2013 1 1## 6 2013 1 1## 7 2013 1 1## 8 2013 1 1## 9 2013 1 1## 10 2013 1 1## # ... with 336,766 more rows



select(flights, -(year:day))

## # A tibble: 336,776 × 16## dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay## <int> <int> <dbl> <int> <int> <dbl>## 1 517 515 2 830 819 11## 2 533 529 4 850 830 20## 3 542 540 2 923 850 33## 4 544 545 -1 1004 1022 -18## 5 554 600 -6 812 837 -25## 6 554 558 -4 740 728 12## 7 555 600 -5 913 854 19## 8 557 600 -3 709 723 -14## 9 557 600 -3 838 846 -8## 10 558 600 -2 753 745 8## # ... with 336,766 more rows, and 10 more variables: carrier <chr>,## # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,## # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>



Renombrando variables

select(flights, tail_num = tailnum)

## # A tibble: 336,776 × 1## tail_num## <chr>## 1 N14228## 2 N24211## 3 N619AA## 4 N804JB## 5 N668DN## 6 N39463## 7 N516JB## 8 N829AS## 9 N593JB## 10 N3ALAA## # ... with 336,766 more rows

# renombrar variables



Obteniendo los valores “distintos”por filas, Similar a unique()

distinct(select(flights, tailnum))

## # A tibble: 4,044 × 1## tailnum## <chr>## 1 N14228## 2 N24211## 3 N619AA## 4 N804JB## 5 N668DN## 6 N39463## 7 N516JB## 8 N829AS## 9 N593JB## 10 N3ALAA## # ... with 4,034 more rows



distinct(select(flights, origin, dest))

## # A tibble: 224 × 2## origin dest## <chr> <chr>## 1 EWR IAH## 2 LGA IAH## 3 JFK MIA## 4 JFK BQN## 5 LGA ATL## 6 EWR ORD## 7 EWR FLL## 8 LGA IAD## 9 JFK MCO## 10 LGA ORD## # ... with 214 more rows



Creando nuevas variables en el data.frame

mutate(flights,gain = arr_delay - dep_delay,speed = distance / air_time * 60)

## # A tibble: 336,776 × 21## year month day dep_time sched_dep_time dep_delay arr_time## <int> <int> <int> <int> <int> <dbl> <int>## 1 2013 1 1 517 515 2 830## 2 2013 1 1 533 529 4 850## 3 2013 1 1 542 540 2 923## 4 2013 1 1 544 545 -1 1004## 5 2013 1 1 554 600 -6 812## 6 2013 1 1 554 558 -4 740## 7 2013 1 1 555 600 -5 913## 8 2013 1 1 557 600 -3 709## 9 2013 1 1 557 600 -3 838## 10 2013 1 1 558 600 -2 753## # ... with 336,766 more rows, and 14 more variables: sched_arr_time <int>,## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,## # minute <dbl>, time_hour <dttm>, gain <dbl>, speed <dbl>



Creando nuevas variables a partir del data.frame existente

transmute(flights,gain = arr_delay - dep_delay,gain_per_hour = gain / (air_time / 60))

## # A tibble: 336,776 × 2## gain gain_per_hour## <dbl> <dbl>## 1 9 2.378855## 2 16 4.229075## 3 31 11.625000## 4 -17 -5.573770## 5 -19 -9.827586## 6 16 6.400000## 7 24 9.113924## 8 -11 -12.452830## 9 -5 -2.142857## 10 10 4.347826## # ... with 336,766 more rows



summarise

kk<-summarise(flights,Media_Retraso = mean(dep_delay, na.rm = TRUE),Desv_Retraso=sd(dep_delay, na.rm = TRUE) )

kable(kk, caption = "Media y Desviación de los REtrasos")

Table 7: Media y Desviación de los REtrasos

Media_Retraso Desv_Retraso12.63907 40.21006



Selección aleatoria de filas

sample_n(flights, 10)

sample_n(flights, 10)

## # A tibble: 10 × 19## year month day dep_time sched_dep_time dep_delay arr_time## <int> <int> <int> <int> <int> <dbl> <int>## 1 2013 3 29 1509 1510 -1 1752## 2 2013 10 5 2024 2035 -11 2123## 3 2013 10 27 1821 1829 -8 2007## 4 2013 2 4 2141 2154 -13 47## 5 2013 2 14 735 740 -5 1040## 6 2013 6 29 721 705 16 950## 7 2013 6 22 1954 2000 -6 2205## 8 2013 11 26 1558 1545 13 1815## 9 2013 4 27 1934 1925 9 2110## 10 2013 1 3 553 600 -7 721## # ... with 12 more variables: sched_arr_time <int>, arr_delay <dbl>,## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,## # time_hour <dttm>



sample_frac(flights, 0.01)

## # A tibble: 3,368 × 19## year month day dep_time sched_dep_time dep_delay arr_time## <int> <int> <int> <int> <int> <dbl> <int>## 1 2013 3 20 557 600 -3 858## 2 2013 3 21 2050 1759 171 2346## 3 2013 6 24 1115 1012 63 1419## 4 2013 1 4 2158 2159 -1 2314## 5 2013 3 28 1636 1645 -9 1744## 6 2013 3 8 2126 1940 106 7## 7 2013 4 20 1338 1345 -7 1707## 8 2013 5 9 1547 1535 12 1753## 9 2013 12 10 614 615 -1 932## 10 2013 5 14 625 630 -5 858## # ... with 3,358 more rows, and 12 more variables: sched_arr_time <int>,## # arr_delay <dbl>, carrier <chr>, flight <int>, tailnum <chr>,## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,## # minute <dbl>, time_hour <dttm>



group_by

kk2<-summarise(group_by(flights, month),Media=mean(dep_delay, na.rm=TRUE),Desv=sd(dep_delay, na.rm=TRUE),min=min(dep_delay, na.rm=TRUE),Mediana=median(dep_delay, na.rm=TRUE),Max=max(dep_delay, na.rm=TRUE),N=n())

kable(kk2, caption="Retrasos en las salidas por meses")



Table 8: Retrasos en las salidas por meses

month Media Desv min Mediana Max N1 10.036665 36.39031 -30 -2 1301 270042 10.816842 36.26655 -33 -2 853 249513 13.227076 40.13097 -25 -1 911 288344 13.938038 42.96626 -21 -2 960 283305 12.986859 39.35283 -24 -1 878 287966 20.846332 51.45694 -21 0 1137 282437 21.727787 51.61608 -22 0 1005 294258 12.611040 37.66692 -26 -1 520 293279 6.722476 35.61480 -24 -3 1014 27574

10 6.243988 29.67176 -25 -3 702 2888911 5.435362 27.58836 -32 -3 798 2726812 16.576688 41.87681 -43 0 896 28135


Gráficos personalizables




ggplot2

Existen en R distintos paquestes para realizar gráficos.lattice ha sido, y sisgue siendo, una alternativa para realizar gráficos dedistintos tipos, y crear paneles con diferentes gráficos.ggplot2 en su última versión, ha mostrado una potencia inigualable.

H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-VerlagNew York, 2009.



Sintaxis compleja y distintaEl único pero que se le achaca a ggplot2 es que su sintaxis es totalmente distintaa lo que se venia utilizando con plot o con lattice.Índice de comandos ggplot2 versión 2.1.0


http://docs.ggplot2.org/current/index.html


housing <- read.csv("landdata-states.csv")head(housing[1:5])

## State region Date Home.Value Structure.Cost## 1 AK West 2010.25 224952 160599## 2 AK West 2010.50 225511 160252## 3 AK West 2009.75 225820 163791## 4 AK West 2010.00 224994 161787## 5 AK West 2008.00 234590 155400## 6 AK West 2008.25 233714 157458



Comparemos los comandos clásicos frente a ggplot2

hist(housing$Home.Value)

Histogram of housing$Home.Value

housing$Home.Value

Fre

quen

cy

0e+00 4e+05 8e+05

010

00



library(ggplot2)ggplot(housing, aes(x = Home.Value)) + geom_histogram()

0

500

1000

1500

0 250000 500000 750000

Home.Value

coun

t



plot(Home.Value ~ Date,data=subset(housing, State == "MA"))

points(Home.Value ~ Date, col="red",data=subset(housing, State == "TX"))

legend(1975, 400000,c("MA", "TX"), title="State",col=c("black", "red"),pch=c(1, 1))



1980 1990 2000 2010

1e+

052e

+05

3e+

054e

+05

Date

Hom

e.V

alue

State

MATX



ggplot(subset(housing, State %in% c("MA", "TX")),aes(x=Date,

y=Home.Value,color=State))+

geom_point()

1e+05

2e+05

3e+05

4e+05

1980 1990 2000 2010

Date

Hom

e.V

alue State

MA

TX



hp2001Q1 <- subset(housing, Date == 2001.25)ggplot(hp2001Q1,

aes(y = Structure.Cost, x = Land.Value)) +geom_point()

75000

100000

125000

150000

175000

0 50000 100000 150000 200000

Land.Value

Str

uctu

re.C

ost



ggplot(hp2001Q1, aes(y = Structure.Cost, x = log(Land.Value))) +geom_point()

75000

100000

125000

150000

175000

9 10 11 12

log(Land.Value)

Str

uctu

re.C

ost



hp2001Q1$pred.SC <- predict(lm(Structure.Cost ~ log(Land.Value), data = hp2001Q1))

p1 <- ggplot(hp2001Q1, aes(x = log(Land.Value), y = Structure.Cost))

p1 + geom_point(aes(color = Home.Value)) +geom_line(aes(y = pred.SC))



75000

100000

125000

150000

175000

9 10 11 12

log(Land.Value)

Str

uctu

re.C

ost

100000

150000

200000

250000

300000

350000Home.Value



p1 +geom_point(aes(color = Home.Value)) +geom_smooth()



100000

150000

9 10 11 12

log(Land.Value)

Str

uctu

re.C

ost

100000

150000

200000

250000

300000

350000Home.Value



p1 + geom_text(aes(label = State), size = 3)

AK

ALARAZ

CA

CO

CTDE

FLGA

HI

IA

ID

IL

IN

KSKYLA

MA

MD

ME

MI MN

MO

MS

MTNC

ND

NE

NH

NJ

NM

NV

NY

OH

OK

OR

PA

RI

SC

SD

TN

TX

UT

VAVT

WA

WI

WV

WY

DC

75000

100000

125000

150000

175000

9 10 11 12

log(Land.Value)

Str

uctu

re.C

ost



## install.packages('ggrepel')library("ggrepel")p1 + geom_point() + geom_text_repel(aes(label = State), size = 3)



AK

ALARAZ

CA

CO

CTDE

FLGA

HI

IA

ID

IL

IN

KS KYLA

MA

MDME

MI MN

MO

MS

MT NC

ND

NE

NH

NJ

NM

NV

NY

OH

OK

OR

PA

RI

SC

SD

TN

TX

UT

VAVT

WA

WI

WV

WY

DC

75000

100000

125000

150000

175000

9 10 11 12

log(Land.Value)

Str

uctu

re.C

ost



p1 +geom_point(aes(size = 2),# incorrect! 2 is not a variable

color="red") # this is fine -- all points red

75000

100000

125000

150000

175000

9 10 11 12

log(Land.Value)

Str

uctu

re.C

ost

2

2



p1 + geom_point(aes(color = Home.Value, shape = region))

75000

100000

125000

150000

175000

9 10 11 12

log(Land.Value)

Str

uctu

re.C

ost

100000

150000

200000

250000

300000

region

Midwest

N. East

South

West

NA



Seguiremos. . .

http://r4stats.com/examples/graphics-ggplot2/http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.htmlhttps://cran.r-project.org/web/packages/ggplot2/index.htmlhttps://plot.ly/ggplot2/getting-started/


http://r4stats.com/examples/graphics-ggplot2/

http://tutorials.iq.harvard.edu/R/Rgraphics/Rgraphics.html

https://cran.r-project.org/web/packages/ggplot2/index.html

https://plot.ly/ggplot2/getting-started/