How to Clean Dirty Data

Meet Chris Greenough, a regular DataDriven contributor and co-founder of  HyperLab. He’ll be teaching us how to clean dirty data. 

How To Clean Dirty Text For Machine Learning Models

Text data is a highly insightful and highly available data source that companies often undervalue. Part of a growing subset of Artificial Intelligence (AI), Conversational AI uses a combination of Natural Language Processing, Machine Learning, Neural Networks and Contextual Awareness to process, understand and extract value from text data.

Processing text accurately requires rigorous cleansing in order to ensure that your data models are accurate. You cannot simply go from raw text to establishing an effective data model without heeding the ‘garbage in = garbage out’ mantra.

While it’s important to know that text cleaning is task specific, after working with many enterprises building Intelligent Assistants that automate sales and support, we’ve identified 7 essential steps you should take in order to get crisply clean data to use in your machine learning models.

Most developers use Python, tools like NLTK and spend time searching for large open source libraries to perform these steps. But, at the end of this tutorial, I’ll suggest a much faster method.

The 7 Steps of Squeaky Clean Data

Here are the 7-steps to cleaning garbage text and transforming it into a sanitary word heaven:

  1. Tokenization

Tokenizing is the first step and entails splitting your documents and/or sentences into individual “tokens”. When doing this manually, you can choose how you want to define a “token”, whether it’s a word, sentence or paragraph. In most classification cases, splitting it into individual words is most effective.

  1. Normalise Case

After converting your data into tokens, the next step is to standardise your case into lowercase so that a machine can recognize the same words. While there are some instances, (like names “Bush” vs. “bush”) where you may lose context, lowercasing tokens are a simple and effective way of ensuring that the input fits a universal data format.

  1. Remove Punctuations

Similar to lowercasing words, punctuations are often superfluous and prevents a machine from recognizing the same words.

  1. Stem Each Token

Stemming is an optional process of reducing a word to its base form. For example, in English nouns can be plural or singular, and verbs can be expressed in different states, with each having variations in its spellings.

Words of the same meaning but different form can be represented by the same token for the sake of natural language processing and understanding. Stemming’s job is to systematically chop words such that only their “base” form remains. The stem is then used to represent the token in the processing pipeline. For example, words like “cleaning”, “cleaned”, “cleaner” and “cleans” would be stemmed into “clean”.

However, modern methodologies, such as word embedding made more popular by word2vec, favors minimal word cleansing to build a model, so stemming is not recommended.

  1. Convert Non-alphabetic Tokens

Although people may type in numbers, it’s important to normalise text inputs to achieve maximum data integrity. But it’s not as simple as it sounds. The challenge is to convert numbers encountered during parsing into full length words, which may differ depending on their context. For example, “I want 4 of them” is cleaned into “i want four of them”, while “What do I need that 4?” is cleaned into “what do I need that for”.  This is just another step towards achieving a standardized, universal representation of your text.

  1. Fix Common Nuanced Errors

This is the hardest step in this process to complete without finding a magical open source dataset, or already having a large one of your own. Things you should look into are converting common misspells; typos from fast fingers and keystroke errors; cultural abbreviations like “xde” into “tak ada” (Bahasa Malaysia) or ppl into people; remove superfluous cultural lingo, like “eh” or “la”; expand contractions, like “what’s” into “what is”;  and, standardise common terms like “email”, “e-mail” and “e mail”.

If you can effectively do this, you can focus your time on building clean data models without needing to account for the endless supply of garbage words people may give you.

  1. Filter Out Stop Words

A ‘stop word’ is text processing lingo for common words that do not contribute to any deeper meaning and that machines are trained to ignore. These mostly include definite and indefinite articles, like “the”, “a”, and “is”.

Finally, You Can Begin…

After your hours of hard work, your text is finally ready to build a machine learning model. You may be tired, but don’t be, because more data will be coming soon and you’ll have to do this all over again. If only there was another solution….

Conveniently, there is an API service called Dialex, which completely automates this process.  Dialex meticulously performs each processing task with care (except for stemming and removing stop words) to avoid changing the original meaning and structure of the sentence.

Available for English and Bahasa Malaysia, we are working on supporting more Asian languages, to be released soon. You can quickly connect to Dialex in less than 5 lines of code and take advantage of our prepared SDKs for JavaScript, Python and Go to get started right away.

Try Dialex today and get 10,000 API calls per month for FREE!  We would welcome feedback and look forward to helping you get crispy clean text.
Your data is as dirty as this wall. Clean it!

Did you like it?

Let us know so we can improve

0 Vote DownVote Up