Taller 5 Solucionado

26
Taller 5 La siguiente tabla se compone de datos de entrenamiento de una base de datos de los empleados. Los datos han sido generalizados. Por ejemplo, "31-35" para la edad representa el rango de edad de 31 a 35. Para una entrada de fila dada, representa el número de tuplas de datos que tienen los valores de departamento, estado, edad y salario dado en esa fila. department status age salary count sales senior 31-35 46K-50K 30 sales junior 26-30 26K-30K 40 sales junior 31-35 31K-35K 40 systems junior 21-25 46K-50K 20 systems senior 31-35 66K-70K 5 systems junior 26-30 46K-50K 3 systems senior 41-45 66K-70K 3 marketing senior 36-40 46K-50K 10 marketing junior 31-35 41K-45K 4 secretary senior 46-50 36K-40K 4 secretary junior 26-30 26K-30K 6 Sea status el atributo etiqueta de clase. 1) Usando weka, a. Construir el árbol usando id3, j48 y random forest. Compare los resultados b. Bayes net c. Multilayer perceptron d. LibSVM, pruebe con 4 diferentes tipos de kernel. Compare los resultados. e. 2) Otro método para solucionar las redes bayesianas es el de Belief propagation , tambien conocido como sum-product message passing . Describa en brevemente en que consiste. 3) Hacer el ejercicio 9.1 del libro Solución

Transcript of Taller 5 Solucionado

Page 1: Taller 5 Solucionado

Taller 5

La siguiente tabla se compone de datos de entrenamiento de una base de datos de los empleados. Los datos han sido generalizados. Por ejemplo, "31-35" para la edad representa el rango de edad de 31 a 35.Para una entrada de fila dada, representa el número de tuplas de datos que tienen los valores dedepartamento, estado, edad y salario dado en esa fila.

department status age salary countsales senior 31-35 46K-50K 30sales junior 26-30 26K-30K 40sales junior 31-35 31K-35K 40systems junior 21-25 46K-50K 20systems senior 31-35 66K-70K 5systems junior 26-30 46K-50K 3systems senior 41-45 66K-70K 3marketing senior 36-40 46K-50K 10marketing junior 31-35 41K-45K 4secretary senior 46-50 36K-40K 4secretary junior 26-30 26K-30K 6

Sea status el atributo etiqueta de clase.

1) Usando weka,a. Construir el árbol usando id3, j48 y random forest. Compare los resultadosb. Bayes netc. Multilayer perceptrond. LibSVM, pruebe con 4 diferentes tipos de kernel. Compare los resultados.e.

2) Otro método para solucionar las redes bayesianas es el de Belief propagation, tambien conocido como  sum-product message passing. Describa en brevemente en que consiste.

3) Hacer el ejercicio 9.1 del libro

Solución

Page 2: Taller 5 Solucionado

Id3

=== Classifier model (full training set) ===

Id3

salary = 46K-50k: senior

salary = 26K-30K: junior

salary = 31K-35K: junior

salary = 46K-50K

| department = sales: null

| department = systems: junior

| department = marketing: senior

| department = secretary: null

salary = 66K-70K: senior

salary = 41K-45k: junior

salary = 36K-40K: senior

Time taken to build model: 0.03 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 165 100 %

Page 3: Taller 5 Solucionado

Incorrectly Classified Instances 0 0 %

Kappa statistic 1

Mean absolute error 0

Root mean squared error 0

Relative absolute error 0 %

Root relative squared error 0 %

Total Number of Instances 165

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

1 0 1 1 1 1 senior

1 0 1 1 1 1 junior

Weighted Avg. 1 0 1 1 1 1

=== Confusion Matrix ===

a b <-- classified as

52 0 | a = senior

0 113 | b = junior

J48

=== Classifier model (full training set) ===

J48 pruned tree

Page 4: Taller 5 Solucionado

------------------

salary = 46K-50k: senior (30.0)

salary = 26K-30K: junior (46.0)

salary = 31K-35K: junior (40.0)

salary = 46K-50K

| department = sales: junior (0.0)

| department = systems: junior (23.0)

| department = marketing: senior (10.0)

| department = secretary: junior (0.0)

salary = 66K-70K: senior (8.0)

salary = 41K-45k: junior (4.0)

salary = 36K-40K: senior (4.0)

Number of Leaves : 10

Size of the tree : 12

Time taken to build model: 0.02 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 165 100 %

Page 5: Taller 5 Solucionado

Incorrectly Classified Instances 0 0 %

Kappa statistic 1

Mean absolute error 0

Root mean squared error 0

Relative absolute error 0 %

Root relative squared error 0 %

Total Number of Instances 165

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

1 0 1 1 1 1 senior

1 0 1 1 1 1 junior

Weighted Avg. 1 0 1 1 1 1

=== Confusion Matrix ===

a b <-- classified as

52 0 | a = senior

0 113 | b = junior

random forest

=== Classifier model (full training set) ===

Random forest of 10 trees, each constructed while considering 3 random features.

Out of bag error: 0

Page 6: Taller 5 Solucionado

Time taken to build model: 0.04 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 165 100 %

Incorrectly Classified Instances 0 0 %

Kappa statistic 1

Mean absolute error 0.0029

Root mean squared error 0.0199

Relative absolute error 0.6805 %

Root relative squared error 4.2893 %

Total Number of Instances 165

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

1 0 1 1 1 1 senior

1 0 1 1 1 1 junior

Weighted Avg. 1 0 1 1 1 1

=== Confusion Matrix ===

Page 7: Taller 5 Solucionado

a b <-- classified as

52 0 | a = senior

0 113 | b = junior

. El algoritmo id3 y j48 muestra un conteo de la relación entre las tuplas mostradas

anteriormente. J48 nos muestra el número de hojas que posee el árbol y el tamaño de este. Lo que no hace el

algoritmo id3. Ambos métodos muestran un porcentaje de error nulo. El algoritmo j48 es más preciso que el algoritmo randomForest debido a las diferencias de los

porcentajes de errores que muestra cada método.

b. Bayes Net

=== Classifier model (full training set) ===

Bayes Network Classifier

not using ADTree

#attributes=4 #classindex=1

Network structure (nodes followed by parents)

department(4): status

status(2):

age(6): status

salary(7): status

LogScore Bayes: -671.2178322167754

LogScore BDeu: -734.0007744477259

Page 8: Taller 5 Solucionado

LogScore MDL: -741.4529682985715

LogScore ENTROPY: -667.416758927013

LogScore AIC: -696.416758927013

Time taken to build model: 0.01 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 161 97.5758 %

Incorrectly Classified Instances 4 2.4242 %

Kappa statistic 0.945

Mean absolute error 0.0273

Root mean squared error 0.0912

Relative absolute error 6.3016 %

Root relative squared error 19.6156 %

Total Number of Instances 165

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

1 0.035 0.929 1 0.963 1 senior

0.965 0 1 0.965 0.982 1 junior

Weighted Avg. 0.976 0.011 0.977 0.976 0.976 1

Page 9: Taller 5 Solucionado

=== Confusion Matrix ===

a b <-- classified as

52 0 | a = senior

4 109 | b = junior

En este algoritmo de clasificación proporciona datos de los resultaos de los logaritmos referentes algunas variables como la entropía, los Bayes, BDeu, MDL, AIC. También se puede observas las instancias que se clasificaron correctamente y las que se clasificaron incorrectamente.

c. Multilayer perceptron

=== Classifier model (full training set) ===

Sigmoid Node 0

Inputs Weights

Threshold 2.7394370243645954

Node 2 -1.8508751465691542

Node 3 -2.354821458534509

Node 4 -1.4870386211949875

Node 5 -1.3229943341191277

Node 6 1.5278889125307347

Node 7 -2.2131181020972868

Node 8 -0.5914245531409845

Node 9 -2.0587896015776685

Node 10 2.270779949723552

Sigmoid Node 1

Page 10: Taller 5 Solucionado

Inputs Weights

Threshold -2.7825884536066616

Node 2 1.85191579937515

Node 3 2.323626175638421

Node 4 1.477966841443966

Node 5 1.304880433881045

Node 6 -1.4594876918432313

Node 7 2.232715726449767

Node 8 0.6801335831028799

Node 9 2.051171802934638

Node 10 -2.297941872862322

Sigmoid Node 2

Inputs Weights

Threshold -0.07755042992069021

Attrib department=sales 0.22690884175757925

Attrib department=systems 0.1545650553116968

Attrib department=marketing -0.028916340821306043

Attrib department=secretary -0.2113862149709065

Attrib age=31-35 -0.02750617633512445

Attrib age=26-30 0.7708789960391705

Attrib age=21-25 0.9031969460307472

Attrib age=41-45 -0.24524874175681483

Attrib age=36-40 -0.9372910085639369

Attrib age=46-50 -0.3173972567786108

Attrib salary=46K-50k -1.2721688131980107

Page 11: Taller 5 Solucionado

Attrib salary=26K-30K 0.596029591410492

Attrib salary=31K-35K 1.073614573206162

Attrib salary=46K-50K 0.10622100546611281

Attrib salary=66K-70K -0.8931249639552377

Attrib salary=41K-45k 0.9840722037777665

Attrib salary=36K-40K -0.32173827499151414

Sigmoid Node 3

Inputs Weights

Threshold -0.09160443154315269

Attrib department=sales 0.19633050850394618

Attrib department=systems 0.14350673033770345

Attrib department=marketing -0.021132001738419306

Attrib department=secretary -0.22057503111112434

Attrib age=31-35 0.01617542844596184

Attrib age=26-30 0.9011003972162315

Attrib age=21-25 1.0622717118790497

Attrib age=41-45 -0.2752012547415663

Attrib age=36-40 -1.084971185421792

Attrib age=46-50 -0.41474256360460926

Attrib salary=46K-50k -1.448381199332202

Attrib salary=26K-30K 0.6690483312961746

Attrib salary=31K-35K 1.2615937981296461

Attrib salary=46K-50K 0.11767253931489512

Attrib salary=66K-70K -1.0619222218399373

Attrib salary=41K-45k 1.1155548070459484

Page 12: Taller 5 Solucionado

Attrib salary=36K-40K -0.4197992341286937

Sigmoid Node 4

Inputs Weights

Threshold -0.020643481201846912

Attrib department=sales 0.15069976965985996

Attrib department=systems 0.12379838338640137

Attrib department=marketing -0.0943505094026181

Attrib department=secretary -0.14725072714787346

Attrib age=31-35 0.010512235705337582

Attrib age=26-30 0.7498866595807208

Attrib age=21-25 0.7873805984482601

Attrib age=41-45 -0.1464959547224344

Attrib age=36-40 -0.7580525890150757

Attrib age=46-50 -0.2556075853274041

Attrib salary=46K-50k -1.0701392832809675

Attrib salary=26K-30K 0.5473953000497558

Attrib salary=31K-35K 0.992235520740865

Attrib salary=46K-50K 0.13349935484850442

Attrib salary=66K-70K -0.7751334030115394

Attrib salary=41K-45k 0.7376105027424174

Attrib salary=36K-40K -0.31165331988873113

Sigmoid Node 5

Inputs Weights

Threshold -0.10504649400363189

Attrib department=sales 0.1888580478883902

Page 13: Taller 5 Solucionado

Attrib department=systems 0.15300803282270037

Attrib department=marketing -0.02098097843654761

Attrib department=secretary -0.1280029758315231

Attrib age=31-35 -0.017009595286887676

Attrib age=26-30 0.6482292510400863

Attrib age=21-25 0.6915153763340572

Attrib age=41-45 -0.19586354097092593

Attrib age=36-40 -0.7359220880188584

Attrib age=46-50 -0.29334007131163364

Attrib salary=46K-50k -1.0108613445175572

Attrib salary=26K-30K 0.5262317715574415

Attrib salary=31K-35K 0.9182024924100928

Attrib salary=46K-50K 0.1188196030377099

Attrib salary=66K-70K -0.670937831354511

Attrib salary=41K-45k 0.7010453789252467

Attrib salary=36K-40K -0.22576085497477247

Sigmoid Node 6

Inputs Weights

Threshold 0.09557922327970046

Attrib department=sales -0.018139053459302966

Attrib department=systems -0.10209290440196074

Attrib department=marketing -0.1328719579969402

Attrib department=secretary 0.028572139713019157

Attrib age=31-35 -0.0739998323577446

Attrib age=26-30 -0.3457681026476964

Page 14: Taller 5 Solucionado

Attrib age=21-25 -0.46939913970448427

Attrib age=41-45 -0.042763003576478054

Attrib age=36-40 0.39852936666748573

Attrib age=46-50 0.06929754238286273

Attrib salary=46K-50k 0.3741253602958736

Attrib salary=26K-30K -0.24297175284159717

Attrib salary=31K-35K -0.4818491178517223

Attrib salary=46K-50K -0.10589748590134765

Attrib salary=66K-70K 0.3434715967606044

Attrib salary=41K-45k -0.5752678422949488

Attrib salary=36K-40K 0.045120149940549435

Sigmoid Node 7

Inputs Weights

Threshold -0.06160115651344392

Attrib department=sales 0.2134113259865331

Attrib department=systems 0.172272506671537

Attrib department=marketing -0.029607224853917643

Attrib department=secretary -0.24106617744959774

Attrib age=31-35 0.029991616105111238

Attrib age=26-30 0.8651532630509001

Attrib age=21-25 0.9906708595270766

Attrib age=41-45 -0.20575332717275158

Attrib age=36-40 -1.047645476787963

Attrib age=46-50 -0.3535834941223248

Attrib salary=46K-50k -1.4332784615759104

Page 15: Taller 5 Solucionado

Attrib salary=26K-30K 0.6493573607930859

Attrib salary=31K-35K 1.1819371137499333

Attrib salary=46K-50K 0.12596930399229297

Attrib salary=66K-70K -1.064741079546573

Attrib salary=41K-45k 1.1431299292474153

Attrib salary=36K-40K -0.3944593594329362

Sigmoid Node 8

Inputs Weights

Threshold -0.0472094583119319

Attrib department=sales 0.15217654915023798

Attrib department=systems 0.11722287931424957

Attrib department=marketing -0.06807690704999361

Attrib department=secretary -0.024785605611675223

Attrib age=31-35 -0.08736108359059333

Attrib age=26-30 0.4530072183024529

Attrib age=21-25 0.5003986272198906

Attrib age=41-45 -0.03799085042854391

Attrib age=36-40 -0.3920240148745457

Attrib age=46-50 -0.1205586069197979

Attrib salary=46K-50k -0.605742149160759

Attrib salary=26K-30K 0.3802185575469776

Attrib salary=31K-35K 0.6066272566318488

Attrib salary=46K-50K 0.10660422983302109

Attrib salary=66K-70K -0.3320427568391321

Attrib salary=41K-45k 0.42400945288403125

Page 16: Taller 5 Solucionado

Attrib salary=36K-40K -0.10285393387441728

Sigmoid Node 9

Inputs Weights

Threshold -0.08424420101244516

Attrib department=sales 0.18487395033283963

Attrib department=systems 0.14441985599862148

Attrib department=marketing -0.014746998161577226

Attrib department=secretary -0.17691559953689145

Attrib age=31-35 0.02784723014273439

Attrib age=26-30 0.8489877575101271

Attrib age=21-25 0.9380267979491645

Attrib age=41-45 -0.2608835537805671

Attrib age=36-40 -1.0141091744602124

Attrib age=46-50 -0.39718995571738797

Attrib salary=46K-50k -1.3435954859187489

Attrib salary=26K-30K 0.6385529218748827

Attrib salary=31K-35K 1.186853891625651

Attrib salary=46K-50K 0.14603610059077612

Attrib salary=66K-70K -0.9588501691188384

Attrib salary=41K-45k 1.0099505342882553

Attrib salary=36K-40K -0.362947971136462

Sigmoid Node 10

Inputs Weights

Threshold 0.09908187740254618

Attrib department=sales -0.05342457992841399

Page 17: Taller 5 Solucionado

Attrib department=systems -0.1192290423182317

Attrib department=marketing -0.05846524907628152

Attrib department=secretary 0.14338709613151754

Attrib age=31-35 -0.09957529167902422

Attrib age=26-30 -0.5969152237408825

Attrib age=21-25 -0.7313360476645168

Attrib age=41-45 0.11828014170484517

Attrib age=36-40 0.7424640467459025

Attrib age=46-50 0.2309416143733404

Attrib salary=46K-50k 0.854775905742494

Attrib salary=26K-30K -0.4117530818438961

Attrib salary=31K-35K -0.7689602300444687

Attrib salary=46K-50K -0.09902009869134437

Attrib salary=66K-70K 0.7183111474183733

Attrib salary=41K-45k -0.8460358832180466

Attrib salary=36K-40K 0.22686248383226118

Class senior

Input

Node 0

Class junior

Input

Node 1

Time taken to build model: 2.09 seconds

Page 18: Taller 5 Solucionado

Mediante este algoritmo podemos observar cada uno de los nodos con cada uno de sus atributos y los pesos que corresponde a cada nodo.

D. LibSVM tipo de Kernel: Linear

=== Classifier model (full training set) ===

LibSVM wrapper, original code by Yasser EL-Manzalawy (= WLSVM)

Time taken to build model: 0.14 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 125 75.7576 %

Incorrectly Classified Instances 40 24.2424 %

Kappa statistic 0.3429

Mean absolute error 0.2424

Root mean squared error 0.4924

Relative absolute error 56.0304 %

Root relative squared error 105.9548 %

Coverage of cases (0.95 level) 75.7576 %

Mean rel. region size (0.95 level) 50 %

Total Number of Instances 165

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class

Page 19: Taller 5 Solucionado

0,346 0,053 0,750 0,346 0,474 0,386 0,647 0,466 senior

0,947 0,654 0,759 0,947 0,843 0,386 0,647 0,755 junior

Weighted Avg. 0,758 0,465 0,756 0,758 0,726 0,386 0,647 0,664

=== Confusion Matrix ===

a b <-- classified as

18 34 | a = senior

6 107 | b = junior

LibSVM Tipo de Kernel : Polinomial

=== Classifier model (full training set) ===

LibSVM wrapper, original code by Yasser EL-Manzalawy (= WLSVM)

Time taken to build model: 0.06 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 132 80 %

Incorrectly Classified Instances 33 20 %

Kappa statistic 0.4409

Mean absolute error 0.2

Root mean squared error 0.4472

Relative absolute error 46.2251 %

Page 20: Taller 5 Solucionado

Root relative squared error 96.2382 %

Coverage of cases (0.95 level) 80 %

Mean rel. region size (0.95 level) 50 %

Total Number of Instances 165

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class

0,365 0,000 1,000 0,365 0,535 0,532 0,683 0,565 senior

1,000 0,635 0,774 1,000 0,873 0,532 0,683 0,774 junior

Weighted Avg. 0,800 0,435 0,845 0,800 0,766 0,532 0,683 0,708

=== Confusion Matrix ===

a b <-- classified as

19 33 | a = senior

0 113 | b = junior

LibSVM tipo de Kernel: Funcion Radial

=== Classifier model (full training set) ===

LibSVM wrapper, original code by Yasser EL-Manzalawy (= WLSVM)

Time taken to build model: 0.03 seconds

=== Stratified cross-validation ===

Page 21: Taller 5 Solucionado

=== Summary ===

Correctly Classified Instances 165 100 %

Incorrectly Classified Instances 0 0 %

Kappa statistic 1

Mean absolute error 0

Root mean squared error 0

Relative absolute error 0 %

Root relative squared error 0 %

Coverage of cases (0.95 level) 100 %

Mean rel. region size (0.95 level) 50 %

Total Number of Instances 165

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class

1,000 0,000 1,000 1,000 1,000 1,000 1,000 1,000 senior

1,000 0,000 1,000 1,000 1,000 1,000 1,000 1,000 junior

Weighted Avg. 1,000 0,000 1,000 1,000 1,000 1,000 1,000 1,000

=== Confusion Matrix ===

a b <-- classified as

52 0 | a = senior

0 113 | b = junior

Page 22: Taller 5 Solucionado

LibSVM tipo de Kernel: Sigmoid

=== Classifier model (full training set) ===

LibSVM wrapper, original code by Yasser EL-Manzalawy (= WLSVM)

Time taken to build model: 0.04 seconds

=== Stratified cross-validation ===

=== Summary ===

Correctly Classified Instances 143 86.6667 %

Incorrectly Classified Instances 22 13.3333 %

Kappa statistic 0.6513

Mean absolute error 0.1333

Root mean squared error 0.3651

Relative absolute error 30.8167 %

Root relative squared error 78.5782 %

Coverage of cases (0.95 level) 86.6667 %

Mean rel. region size (0.95 level) 50 %

Total Number of Instances 165

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure MCC ROC Area PRC Area Class

0,577 0,000 1,000 0,577 0,732 0,695 0,788 0,710 senior

Page 23: Taller 5 Solucionado

1,000 0,423 0,837 1,000 0,911 0,695 0,788 0,837 junior

Weighted Avg. 0,867 0,290 0,888 0,867 0,855 0,695 0,788 0,797

=== Confusion Matrix ===

a b <-- classified as

30 22 | a = senior

0 113 | b = junior

Este tipo de algoritmo nos presenta el numero correcto de instancias que se clasificaron y las que no se clasificaron correctamente con sus respectivos porcentajes, nos muestra un error relativo, una cobertura de los casos y su porcentaje, la raíz del error cuadrado y su error relativo.

Se hace una tabla de valores con los detalles de la precisión por clase, con valores como tasa TP, tasa FP, Precisión, re-llamado, medida F, MCC con valores que varían entre 0 y 1.

Con cada tipo de kernel diferente el número de instancias correctas que se clasifican correcta e incorrectamente cambia por tal razón varían todos los datos y porcentajes.

Con el kernel de función de Base Radial el procentaje de instancias clasificadas incorrectamente fue 0 por tal razón los porcentajes de error, y la raíz del error cuadrado es 0 y la tada TP esta en 1, tada FP 1, precisión 1, recall 1, medida F 1 y MCC 1.

2.

Belief propagation

Es un algoritmo para realizar inferencias en modelos gráficos, como redes bayesianas y los campos aleatorios de Markov . Calculando la distribución marginal de cada nodo. Es utilizado en la inteligencia artificial y teoría de la información, se ha demostrado que es un algoritmo útil en aproximada de gráficos generales.