Meet the Team
• Pablo Sarquis Peillard• CS undergrad • Experienced working in startups through VC funding and development• Broken top 10 on iOS AppStore multiple times
• David Orozco• Software Engineering undergraduate student with domain in artificial intelligence. • Programming skills in Python, C/C++, Java, MATLAB• Experienced working in teams to develop software applications
• Eduardo Gonzalez• Computer science undergraduate student• Specialized in games using c#• Undergraduate research in virtual reality.
Outline
• Dialects• What’s a dialect?• Why do companies care?• What’s Dialecto?
• Deployment Solutions• Cognitive Applications• Server-less Infrastructure• Front End Development
• Dialect Classification• Data Collection• Text Condensation• Training Watson
• Closing Statements• Product • Team• Questions
Why?
• As globalization becomes a standard one cannot expect a user’s location to be the only goto for market analysis
• Dialect Classification provides an avenue for understanding word-choice from a demographic.
Dialecto?
• It is a trained language classifier able to detect nuisances in text to determine a text’s country of origin.
!"#
$%&
Outline
• Dialects• What’s a dialect?• Why do companies care?• What’s Dialecto?
• Deployment Solutions• Cognitive Applications
• IBM Watson• Server-less Infrastructure• Front End Development
• Dialect Classification• Data Collection• Text Condensation• Training Watson
• Closing Statements• Product • Team• Questions
Cognitive Applications
• Adaptive solutions
• React and change based on new interactions and data
• Contextual
• Are able to quickly define the domain and collect relevant data.
IBM Watson
• Watson is a question answering computer system capable of answering questions posed in natural language
• Computing system that rivals a human’s ability to retrieve, analyze and interpret vast amounts of information
Server-less Infrastructure using OpenWhisk
• Event-action platform to execute code in response to events
• This allows us to scale effortlessly
• Fast Scaling
• Reduced Operational Costs
Front End Development• Static Website
• Service runs entirely on Javascript
• OpenWhisk• Serverless platform to serve the custom application
• IBM’s Bluemix API Connect• Provides gateway to a custom application without having to deploy a server
Outline
• Dialects• What’s a dialect?• Why do companies care?• What’s Dialecto?
• Deployment Solutions• Cognitive Applications• Server-less Infrastructure• Front End Development
• Dialect Classification• Data Collection• Text Condensation• Training Watson
• Closing Statements• Product • Team• Questions
Data Collection - Twitter
• Preselected 31 twitter accounts• Archived their tweets for the past 3 months• Over 190k tweets
Text Condensation - Unigram filter
To classify whether the tweets are general or unique, a tweet is tokenized and the normalized values for each token are multiplied together to obtain a uniqueness value (U) for that particular tweet. If the uniqueness value exceeds a certain threshold, the tweet is discarded
5 Most Unique Words by Country
Country 1st 2nd 3rd 4th 5th
Chile Andreas Ruptura Ponga Candreva Desinformacion
USA Contratos Doñana Esteriles Jerusalen Gasolineras
Mexico Directiorio Completamente Ruptura Desatencion Pokemones
Colombia Instauracion Candia Municipio Tio Jornada
Argentina Autonoma Dialogo HuffPos Muecas Desplazo
Spain Intitez Canes Goteo Radiotelescopio motoGP
Text Condensation - Edit Distance• Calculated the characters required to change between words, between tweets.• We used that to reduce redundancy in the tweets.
Similarity Between Countries
CHILE - ARGENTINA 1,16
CHILE - COLUMBIA 1.03
CHILE - MEXICO 1.06
CHILE - SPAIN 8.53
CHILE - USA 2.60
ARGENTINA - COLUMBIA 0.84
ARGENTINA - MEXICO 0.68
ARGENTINA - SPAIN 11.91
ARGENTINA - USA 6.80
COLUMBIA - MEXICO 0.20
COLUMBIA - SPAIN 16.70
COLUMBIA - USA 6.40
MEXICO - SPAIN 7.78
MEXICO - USA 0.28
SPAIN - USA 31.93
Training Watson
• Limitations on Watson (15k records, etc) caused us to split up the data we had into two sets
• Following the standard 80/20 approach• Used one set for training (80%) and the other for testing (20%)
• Condensation allowed us to train watson with the most relevant data
Outline
• Dialects• What’s a dialect?• What’s Dialecto?• Why do companies care?
• Deployment Solutions• Cognitive Applications• Server-less Infrastructure• Front End Development
• Dialect Classification• Data Collection• Text Condensation• Training Watson
• Closing Statements• Product • Team• Questions
Top Related