Presentazione tutorial

60
AWS Machine Learning Engineering in Computer Science - Data Mining Class

Transcript of Presentazione tutorial

Page 1: Presentazione tutorial

AWS Machine Learning

Engineering in Computer Science - Data Mining Class

Page 2: Presentazione tutorial

AWS Machine Learning

Who are we?

2

Lukas Hermann

Milad Kiwan

Dario Molinari Lorenzo Vitali

Daniele De Cillis

Matteo Pallotta

Page 3: Presentazione tutorial

AWS Machine Learning

Where to find the material

Slideshare repositoryhttp://www.slideshare.net/dariospin93/presentazione-tutorial-70026708

Github repository: https://github.com/dariospin93/TutorialDataMining

Here you’ll find the files needed for this tutorial

3

Page 4: Presentazione tutorial

AWS Machine Learning

What is Machine Learning?

“Machine learning is the subfield of computer science that gives computers the ability to learn without being explicitly programmed" (Wikipedia)

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E." (Tom M. Mitchell, Chair of the Machine Learning Department at Carnegie Mellon University)

4

Page 5: Presentazione tutorial

AWS Machine Learning

What is Machine Learning?

Some ML tasks:• Classification: inputs are divided into 2 or more

classes. The goal is to produce a model that assigns unseen inputs to one (e.g: Spam Filtering, input= emails, output = ”spam” or “not spam”).

• Regression: related to the previous category. The outputs are continuous rather than discrete (e.g: input = ”size of a house”, output = ”price”)

5

Page 6: Presentazione tutorial

AWS Machine Learning

What is Machine Learning?

Some ML tasks:• Clustering: divide inputs into groups. The main

difference with respect to Classification problems is that the groups are not known beforehand

• Dimensionality reduction: map inputs into a lower-dimensional space. (e.g: input = “set of documents in human language”, output = “which documents cover similar topics”)

• ...

6

Page 7: Presentazione tutorial

AWS Machine Learning

Why Machine Learning?

• Growing flood of data• Growing availability of computational power• Progress in algorithms

7

Page 8: Presentazione tutorial

AWS Machine Learning

Amazon Machine Learning

“Service that makes it easy for developers of all skill levels to use machine learning technology. Amazon Machine Learning provides visualization tools and wizards that guide you through the process of creating machine learning (ML) models without having to learn complex ML algorithms and technology.”

8

Page 9: Presentazione tutorial

AWS Machine Learning

What is Amazon ML?

• Robust cloud-based service that makes it easy for developers of all skill levels to use ML technology.

• Create ML models by finding patterns in your existing data.

• provides visualization tools and wizards that guide you through the process.

9

Page 10: Presentazione tutorial

AWS Machine Learning

When to use Amazon ML?

• No need to learn complex ML algorithms and technology.

• Makes it easy to obtain predictions for your application using simple APIs.

• ML is not a solution for every type of problem.– if you can determine a target value by using simple rules, computations,

or predetermined steps that can be programmed without needing ML.

10

Page 11: Presentazione tutorial

AWS Machine Learning

When to use Amazon ML?

• Many human tasks cannot be adequately solved using a simple rule-based solution: recognizing whether an email is spam or not spam.

• When rules depend on too many factors and many of these rules overlap or need to be tuned very finely.

11

Page 12: Presentazione tutorial

AWS Machine Learning

When to use Amazon ML?

You can use ML approaches for these specific ML tasks:

• binary classification (predicting one of two possible outcomes).

• multiclass classification (predicting one of more than two outcomes).

• regression (predicting a numeric value).

12

Page 13: Presentazione tutorial

AWS Machine Learning

Formulating The Problem

• The first step in machine learning is to decide what you want to predict, which is known as the label or target answer.

– Predict the number of purchases your customers will make for each product. (regression problem)

– Predict which products will get more than 10 purchases. (binary classification problem)

– Which category of products is most interesting to this customer. (multiclass classification problem)

13

Page 14: Presentazione tutorial

AWS Machine Learning

Collecting Labeled Data

• Labeled Data: are data for which you already know the target answer.

• The Target: is the answer that you want to predict.

14

Page 15: Presentazione tutorial

AWS Machine Learning

Collecting Labeled Data

• Data is not readily available in a labeled form. Collecting and preparing the variables and the target are often the most important steps in solving an ML problem.

• You provide data that is labeled with the target to the ML algorithm to learn from. Then, you will use the trained ML model to predict this answer on data for which you do not know the target answer.

15

Page 16: Presentazione tutorial

AWS Machine Learning

What is Amazon S3 (Simple Storage Service)?

• Amazon S3 has a simple web services interface that you can use to store and retrieve any amount of data.

• It is designed to make web-scale computing easier for developers.

16

Page 17: Presentazione tutorial

AWS Machine Learning

Training and Evaluation Data

• The fundamental goal of ML is to generalize beyond the data instances to train models.

• Amazon ML splits the first 70 percent of the input data sent for training a model through the Amazon ML console and the remaining 30 percent for the evaluation datasource.

17

Page 18: Presentazione tutorial

AWS Machine Learning

Training and Evaluation Data

• The ML system uses the training data to train models to see patterns, and uses the evaluation data to evaluate the predictive quality of the trained model

• The ML system evaluates predictive performance by comparing predictions on the evaluation data set with true values.

18

Page 19: Presentazione tutorial

AWS Machine Learning

Evaluation

• Threshold for prediction can be adjusted

• Control precision and recall

19

Page 20: Presentazione tutorial

AWS Machine Learning

Precision and Recall

20

Page 21: Presentazione tutorial

AWS Machine Learning

AWS ML Techniques

• for regression, AWS uses linear regression• for classification, AWS uses logistic regression

– Despite the name classification method

– uses a ML model similar to regression with a logistic sigmoid function

– binomial or multinomial

21

Page 22: Presentazione tutorial

AWS Machine Learning

Logistic Regression: Example

• Labeled data with labels y {0,1}∈• e.g.:

– x: hours of study– y: pass (1) or fail (0)

• What’s the probability of success given a certain time spent studying?

22

Page 23: Presentazione tutorial

Logistic Regression: Example

23AWS Machine Learning

Page 24: Presentazione tutorial

Limitations of Amazon ML

• Only supervised learning (no clustering etc.)• No selection of the ML method possible• Preprocessing of the data is a black box

24AWS Machine Learning

Page 25: Presentazione tutorial

AWS Machine Learning – hands-on

Page 26: Presentazione tutorial

AWS Machine Learning

A cloud application

With Amazon ML, we can build and train a predictive model in a scalable cloud

solution. In fact, there is no need of any kind of application to run this tool, because

it runs on the cloud (actually, we’ll need a web browser in order to access to the

tool). In our tutorial we’ll show you the basic functionalities of Amazon ML, like

creating a datasource, building a model and using this model to generate

predictions.

In order to do this, we need a dataset, as big as possible. Our dataset, is taken

from the University of California at Irvine (UCI) machine learning repository, where it

is possible to find a lot of them.

Pagina 26

Page 27: Presentazione tutorial

AWS Machine Learning

What we will see in this tutorial

In this tutorial we’ll see how machine learning can be used for marketing purposes.

To do this, we’ll show you how to build and train a model to help you making

decisions based on the data you have.

We’ll focus on selecting people based on their earnings, that may be useful to find

who’s going to be more suitable for certain marketing offers

Pagina 27

Page 28: Presentazione tutorial

AWS Machine Learning

Tutorial Plan

1. Preparing the data

2. Creating a training datasource

3. Creating a model

4. Reviewing the model’s predictive performance and setting a score threshold

5. Using the model to generate predictions

6. Cleaning up (to avoid incurring in unwanted charges)

Pagina 28

Page 29: Presentazione tutorial

AWS Machine Learning

Step 1: Preparing the data

Initially, we must be sure that our tool understands the data we pass to it. In order

to do this, we should ensure that our dataset follows Amazon’s guidelines:

• Data must be saved in .csv format

• Each row must be a single observation

• Each column must contain a single attribute of the observation

• The first should contain the attribute’s names (or you can provide them in a

separated file, but it’s not recommended)

• Every attribute must be separated by comma

• If you use Excel and MacOS, do not save in “comma separated value(.csv)”

format, use the “windows comma separated (.csv)” instead.

Pagina 29

Page 30: Presentazione tutorial

AWS Machine Learning

Step 1: Preparing the data

Consider our dataset: open the “census.csv” file

Our target is the attribute “class”: how much a person earns per year (binary, 1 if >

50.000, 0 if ≤ 50.000)

Pagina 30

Page 31: Presentazione tutorial

AWS Machine Learning

Step 1: Preparing the data

In practice, the machine will learn which are the characteristics of the people who

earn more than the threshold and who earn less, and with this knowledge, we will

ask to predict at which class other people belong.

Pagina 31

Page 32: Presentazione tutorial

AWS Machine Learning

Step 1: Preparing the data

Open the census-batch.csv file: there is no “class” attribute there. In fact, the tool’s

job now is showing us what it has learnt, letting it work on this dataset where we

know the right “class” attribute, but it’s not specified in there.

Pagina 32

Page 33: Presentazione tutorial

AWS Machine Learning

Step 2: Creating the training datasource

In order to use all our files, we have to upload them to Amazon S3

• Open https://console.aws.amazon.com/s3/

• Create a new bucket

• Choose upload in the navigation bar

• Add the files mentioned before

Pagina 33

Page 34: Presentazione tutorial

AWS Machine Learning

Step 2: Creating the training datasource

Now to create the datasource (it will contain only the location of the data):

• Open https://console.aws.amazon.com/machinelearning/

• Choose Get Started (or Create New) and launch

• Select S3 from “Where your data is located?”

• Type <name of your bucket>/census.csv

• Put the name “Census data”

• Choose verify and grant permission

• Review and choose continue

Pagina 34

Page 35: Presentazione tutorial

AWS Machine Learning

Step 2: Creating the training datasource

A schema contains information needed to interpret the input data for the model. The

simplest and fastest thing to do is to let Amazon infer it. We have to check if it is

correct. Review the schema and be sure that:

• Attributes with only 2 possible states are marked as binary

• Attributes that are numbers or strings that are used to denote a category should

be marked as categorical

• Attributes that are numbers where order matters should be marked as numeric

• Attributes that are plain strings as text

Then choose continue

Pagina 35

Page 36: Presentazione tutorial

AWS Machine Learning

Step 2: Creating the training datasource

Finally we can choose the target attribute to predict, in this case it is “class”. We

don’t have an identifier, so we skip to continue and the datasource will be created.

Pagina 36

Page 37: Presentazione tutorial

AWS Machine Learning

Step 3: Creating an ML model

Amazon should redirect us to the page of model creation. If not:

• From the console, click on “create a new model”

• Choose “I already created a datasource pointing to my S3 data”

• Pick our datasource previously created and click Continue

• Be sure the model name is “ML model: Census data” and select Default

• The evaluation name must be “Evaluation ML model: Census data”, review and

finish

Pagina 37

Page 38: Presentazione tutorial

AWS Machine Learning

Step 3: Creating an ML model

Now Amazon is processing our data, and this may take some minutes

Pagina 38

Page 39: Presentazione tutorial

AWS Machine Learning

Step 3: Creating an ML model

The operations that Amazon is performing are the following:

• Splitting the training datasource in 2 parts: one containing the 70% of the data

and one containing the remaining 30%

• Training the model with 70% of the data

• Testing the resulting model with the 30%

The status now is in pending. It will be in progress and then completed.

Pagina 39

Page 40: Presentazione tutorial

AWS Machine Learning

Step 3: Creating an ML model

Pagina 40

Page 41: Presentazione tutorial

AWS Machine Learning

Step 4: Reviewing the model’s predictive performance and setting a score threshold It’s important to check if the model is good enough for future predictions. This can

be done by looking at the model evaluation.

Take a look to the AUC (Area Under Curve) metric: it is an industry-standard quality

metric that expresses the performance quality of the model.

• Choose evaluation in the model summary

• Click on our model

• Click on summary

Pagina 41

Page 42: Presentazione tutorial

AWS Machine Learning

Step 4: Reviewing the model’s predictive performance and setting a score threshold Shortly, the ML model generates numeric prediction score for each record and then,

based on a threshold, it converts this scores in binary labels.

Pagina 42

Page 43: Presentazione tutorial

AWS Machine Learning

Step 4: Reviewing the model’s predictive performance and setting a score threshold We can interact with this evaluation: if we change this threshold, we can modify

how the model assigns the labels.

• On evaluation summary page, choose “Adjust score threshold”

• Try to move the vertical line on the graphic and the number of correct choices

and errors will change:

– Movements to the right will reduce the number of false positives

– Movements to the left will reduce the number of false negatives

• Move it until the score threshold becomes 0.37 (it decreases the false

negatives)

Pagina 43

Page 44: Presentazione tutorial

AWS Machine Learning

Step 4: Reviewing the model’s predictive performance and setting a score threshold Now every time the model will predict a label, it will do it with this new threshold.

Pagina 44

Page 45: Presentazione tutorial

AWS Machine Learning

Step 5: Using the ML model to generate predictions

There are two types of prediction that can be done:

• Real time predictions: it is prediction for a single observation that amazon

generates on demand

• Batch predictions: it is a set of predictions for a group of observation (N.B.:

Amazon will charge you 0.10€ for 1000 predictions, rounding up to the next

thousand)

Pagina 45

Page 46: Presentazione tutorial

AWS Machine Learning

Step 5: Using the ML model to generate predictions

We’ll try now batch predictions, and we need the census-batch.csv file that we

uploaded at the beginning.

• Click on Amazon Machine Learning

• Click on Batch prediction

• Choose the model we created and click Continue

• In “Locate the input data”, choose “My data is in s3, and I need to create a

datasource”

• For the name of the datasource, type “Census data 2” and for the location of the

file type “your-bucket/census-batch.csv”

• “Does the first line in your cvs contain the column names?”, choose Yes, then

Verify and Continue

Pagina 46

Page 47: Presentazione tutorial

AWS Machine Learning

Step 5: Using the ML model to generate predictions

• For the destination, type the location where you uploaded the file at the

beginning

• Accept the default name

• Choose Review

• Grant permission to Amazon S3

• On the review page choose Finish

As we saw with the training, now Amazon will process our file and give us the

results.

Pagina 47

Page 48: Presentazione tutorial

AWS Machine Learning

Step 5: Using the ML model to generate predictions

Pagina 48

Page 49: Presentazione tutorial

AWS Machine Learning

Step 5: Using the ML model to generate predictions

To view the results:

• Go to https://console.aws.amazon.com/s3/

• Navigate to the output location given before

• You will find a compressed file containing the result: download it and open it

Pagina 49

Page 50: Presentazione tutorial

AWS Machine Learning

Step 5: Using the ML model to generate predictions

This file has 2 columns: best answer and score for each row of the datasource.

The score is greater than the threshold → the best answer will be “> 50.000”

The score is smaller than the threshold → the best answer will be “≤ 50.000”

Pagina 50

Page 51: Presentazione tutorial

AWS Machine Learning

Step 6: Cleaning up

It’s safe to delete all the model and predictions we created so far, in order to not

incur in additional charges and to keep clean our console.

Pagina 51

Page 52: Presentazione tutorial

AWS Machine Learning - Homework Assignment

Pagina 52

Page 53: Presentazione tutorial

AWS Machine Learning

Homework Assignment

In the tutorial it has been introduced the usage of Amazon ML service through a

graphical interface, however in practice it can be useful to integrate such service

into a particular application.

Amazon ML addresses this problem offering a large, complete and easy to use set

of APIs.

http://docs.aws.amazon.com/machine-learning/latest/APIReference

Pagina 53

Page 54: Presentazione tutorial

AWS Machine Learning

Homework Assignment

Assignment:

You are asked to repeat the steps presented in the tutorial with the exception of the

5th

step (Using the model to generate predictions). You are asked indeed to

complete such point by writing a Python script that makes use of the APIs.

Write the code needed to:1) Generate real-time predictions2) Generate batch predictions:

Pagina 54

Page 55: Presentazione tutorial

AWS Machine Learning

Homework Assignment – Before starting

DASHBOARD LINK: https://eu-west-1.console.aws.amazon.com/machinelearning

DATASOURCE_ID: once in the dashboard, click on the datasource (created at

pass 2), then copy the ID MODEL_ID: once in the dashboard, click on the model (created at pass 3), then

copy the ID ID and KEY: once in the dashboard, click on your username on the top right side

of the screen →"My Security Credentials" → expand the voice "Access Keys" →"Create new access key" → copy the ID and KEY generated

Pagina 55

Page 56: Presentazione tutorial

AWS Machine Learning

Homework Assignment – Before starting

GIVE PERMISSIONS TO FILES IN S3: It is mandatory to grant usage permissions to the files uploaded to S3. To do so: right click on the files -> Properties -> Permissions -> Add more permissions -> Select 'Any authenticated AWS user' -> Put a tick on all different permissions

ENABLE MODEL FOR REAL TIME PREDICTIONS: click on the model → create endpoint

Pagina 56

Page 57: Presentazione tutorial

AWS Machine Learning

Homework Assignment – Exercise 1

Generate real-time predictions: in a new file, store 10 records of the “census-batch.csv” file. Generate one real-time prediction per record and print the results.

You can make use of the following function: from boto3.session import Session #install library boto3 first MODEL_ID = 'the id of the model you have created'ID = 'your id'KEY = 'your key' session = Session(aws_access_key_id=ID, aws_secret_access_key=KEY)client = session.client('machinelearning', region_name='eu-west-1')prediction_endpoint = "https://realtime.machinelearning.eu-west-1.amazonaws.com"fields=["age", "work class", "fnlwgt", "education", "education-num", "marital-status", "occupation", "relationship", "race", "sex",

"capital-gain", "capital-loss", "hours-per-week", "native-country"] def real_time_prediction(line) : # line = one line of the csv file

record = dict()for index, val in enumerate(line.split(',')):

record[fields[index]] = valresponse = client.predict(MLModelId=MODEL_ID, Record=record, PredictEndpoint=prediction_endpoint)return response.get('Prediction')

Pagina 57

Page 58: Presentazione tutorial

AWS Machine Learning

Homework Assignment – Exercise 2

Generate batch predictions: use the “census-batch.csv” file that you’ve uploaded and then check the results on S3.

You can make use of the following function:from boto3.session import Session #install library boto3 firstID = 'your id'KEY = 'your key'MODEL_ID = 'the id of the model you have created'DATASOURCE_ID = 'the id of the data source you have created (the one related to census-batch.csv)'PREDICTION_ID = "batch_prediction_0001" # must be uniquePREDICTION_NAME = "bp_0001"OUTPUT_URI = "s3://your_bucket/dir_batch_0001" session = Session(aws_access_key_id=ID, aws_secret_access_key=KEY)client = session.client('machinelearning', region_name='eu-west-1')client.create_batch_prediction(BatchPredictionId=PREDICTION_ID, BatchPredictionName=PREDICTION_NAME,

MLModelId=MODEL_ID, BatchPredictionDataSourceId=DATASOURCE_ID, OutputUri=OUTPUT_URI)status = "PENDING"while status != "COMPLETED" and status != "FAILED" :

print(status)response = client.get_batch_prediction(BatchPredictionId=PREDICTION_ID)status = response['Status']time.sleep(3)

print(status)print(response)print("Your results are in s3!")

Pagina 58

Page 59: Presentazione tutorial

AWS Machine Learning

Homework Assignment

For this homework, you’ll have one week of time to deliver the results.

In particular, the due is 20/12/2016, 23:59

You are asked to deliver back the code and the instructions to run it into a .zip file to one of our email addresses.

Pagina 59

Page 60: Presentazione tutorial

AWS Machine Learning

Homework Assignment

For any kind of problem or information, please contact us!

Contacts:• Dario Molinari: [email protected]• Daniele De cillis: [email protected]• Lorenzo Vitali: [email protected]• Lukas Hermann: [email protected]• Milad Kiwan: [email protected]• Matteo Pallotta: [email protected]

Pagina 60