dialecto

23
Automatic Dialect Extraction through supervised learning using IBM Watson

Transcript of dialecto

Automatic Dialect Extraction through supervised learning using IBM Watson

Meet the Team

• Pablo Sarquis Peillard• CS undergrad • Experienced working in startups through VC funding and development• Broken top 10 on iOS AppStore multiple times

• David Orozco• Software Engineering undergraduate student with domain in artificial intelligence. • Programming skills in Python, C/C++, Java, MATLAB• Experienced working in teams to develop software applications

• Eduardo Gonzalez• Computer science undergraduate student• Specialized in games using c#• Undergraduate research in virtual reality.

Outline

• Dialects• What’s a dialect?• Why do companies care?• What’s Dialecto?

• Deployment Solutions• Cognitive Applications• Server-less Infrastructure• Front End Development

• Dialect Classification• Data Collection• Text Condensation• Training Watson

• Closing Statements• Product • Team• Questions

What’s a dialect?

Why?

• As globalization becomes a standard one cannot expect a user’s location to be the only goto for market analysis

• Dialect Classification provides an avenue for understanding word-choice from a demographic.

Dialecto?

• It is a trained language classifier able to detect nuisances in text to determine a text’s country of origin.

!"#

$%&

Outline

• Dialects• What’s a dialect?• Why do companies care?• What’s Dialecto?

• Deployment Solutions• Cognitive Applications

• IBM Watson• Server-less Infrastructure• Front End Development

• Dialect Classification• Data Collection• Text Condensation• Training Watson

• Closing Statements• Product • Team• Questions

Cognitive Applications

• Adaptive solutions

• React and change based on new interactions and data

• Contextual

• Are able to quickly define the domain and collect relevant data.

IBM Watson

• Watson is a question answering computer system capable of answering questions posed in natural language

• Computing system that rivals a human’s ability to retrieve, analyze and interpret vast amounts of information

Server-less Infrastructure using OpenWhisk

• Event-action platform to execute code in response to events

• This allows us to scale effortlessly

• Fast Scaling

• Reduced Operational Costs

Front End Development• Static Website

• Service runs entirely on Javascript

• OpenWhisk• Serverless platform to serve the custom application

• IBM’s Bluemix API Connect• Provides gateway to a custom application without having to deploy a server

Diagram

Outline

• Dialects• What’s a dialect?• Why do companies care?• What’s Dialecto?

• Deployment Solutions• Cognitive Applications• Server-less Infrastructure• Front End Development

• Dialect Classification• Data Collection• Text Condensation• Training Watson

• Closing Statements• Product • Team• Questions

Data Collection - Twitter

• Preselected 31 twitter accounts• Archived their tweets for the past 3 months• Over 190k tweets

Limitations

• Watson is limited in the type of data you can send it.

Text Condensation - Unigram filter

To classify whether the tweets are general or unique, a tweet is tokenized and the normalized values for each token are multiplied together to obtain a uniqueness value (U) for that particular tweet. If the uniqueness value exceeds a certain threshold, the tweet is discarded

5 Most Unique Words by Country

Country 1st 2nd 3rd 4th 5th

Chile Andreas Ruptura Ponga Candreva Desinformacion

USA Contratos Doñana Esteriles Jerusalen Gasolineras

Mexico Directiorio Completamente Ruptura Desatencion Pokemones

Colombia Instauracion Candia Municipio Tio Jornada

Argentina Autonoma Dialogo HuffPos Muecas Desplazo

Spain Intitez Canes Goteo Radiotelescopio motoGP

Text Condensation - Edit Distance• Calculated the characters required to change between words, between tweets.• We used that to reduce redundancy in the tweets.

Similarity Between Countries

CHILE - ARGENTINA 1,16

CHILE - COLUMBIA 1.03

CHILE - MEXICO 1.06

CHILE - SPAIN 8.53

CHILE - USA 2.60

ARGENTINA - COLUMBIA 0.84

ARGENTINA - MEXICO 0.68

ARGENTINA - SPAIN 11.91

ARGENTINA - USA 6.80

COLUMBIA - MEXICO 0.20

COLUMBIA - SPAIN 16.70

COLUMBIA - USA 6.40

MEXICO - SPAIN 7.78

MEXICO - USA 0.28

SPAIN - USA 31.93

Training Watson

• Limitations on Watson (15k records, etc) caused us to split up the data we had into two sets

• Following the standard 80/20 approach• Used one set for training (80%) and the other for testing (20%)

• Condensation allowed us to train watson with the most relevant data

Outline

• Dialects• What’s a dialect?• What’s Dialecto?• Why do companies care?

• Deployment Solutions• Cognitive Applications• Server-less Infrastructure• Front End Development

• Dialect Classification• Data Collection• Text Condensation• Training Watson

• Closing Statements• Product • Team• Questions

Demo

Questions