Despite what the bad media are saying, computers haven’t understood human language (yet). We need to turn sentences and words into a format that can be effectively manipulated by a Machine Learning or Deep Learning algorithm. This is called language modeling. Here I will explain several methods that can turn words into a meaningful representation.

Integer encoding #

This approach is the simplest. Once we have a list of the tokens composing the vocabulary, we associate each one with an integer. For example, if the vocabulary is “Roses, are, red, Violets, blue”, we can create a mapping: Roses : 0, are: 1, red: 2, Violets: 3, blue: 4.

The sentences : “Roses are red, Violets are blue” would become [0, 1, 2, 3, 1, 4]. This creates an array of integers of sentences which is easy to use for a Machine Learning algorithm. It’s useful to have this format to associate a word with a numerical feature. For example, please take a look at the Embedding section below.

One hot encoding #

This is a way to transform the words into categorical features. This creates one boolean (0 or 1) feature per word in the vocabulary. For example, if the vocabulary is (Roses, are, red, Violets, blue) we can represent it this way:

Roses: 10000 are: 01000 red: 00100 Violets: 00010 blue: 00001

The sentences thus becomes: [10000, 01000, 00100, 00010, 01000, 00001].

Word count #

This data representation takes the number of occurrences of each word as an input feature for a Machine Learning or Deep Learning model. For example, if our vocabulary is (Tulip, orchids, Roses, Violets, are, red, blue), the sentences “Roses are red. Violets are blue.” can be represented by the count of each word in the vocabulary: {0, 0, 1, 1, 2, 1, 1}. This replacement of words by the number of occurrences is really useful as a good starting point for classification problems. The intuition behind this is that word counts can help to categorize a text.

N-gram #

This is a more advanced form of word counts where words are grouped by 2, 3, 4, n tokens. The goal here is to capture the semantics of the sentences, by getting the expressions (e.g science fiction), negations, etc. For 2-grams, the sentences might be represented with the vocabulary “Roses are”, “are not”, “not black”, “Violets are”, “are blue”.

TF-IDF (Term Frequency - Inverse Document Frequency) #

One issue with word count representations is that they treat all words equally. For example it represents the word count for “are” the same way as for “planet”, although the former carries much less sense than the latter. To address this issue, one can use TF-IDF (Term Frequency - Inverse Document Frequency). Each term t in a document d (d can be an article, a paragraph, a sentence) is associated with a weight that tells how important each term t is. TF-IDF is composed of (surprise!) two parts.

Term Frequency #

Term frequency counts the occurrences of a term. This is logical. Intuitively, we know that an important word is highly frequent.

Inverse Document Frequency #

The goal here is to tell how important the word is across the documents. We use the formula:

$${IDF} = log {{\vert D\vert}\over{\vert d \in D:t\in d\vert}}$$

Where:

  • D represents the documents (Therefore |D| is the number of documents)
  • d is the document where the term t has been found
  • t is a term of d

In plain English, it means that the Inverse Document Frequency is equal to the logarithm of the total number of documents divided by the number of documents where the word has been found.

But, why this is useful?

The IDF part tells how important a word is. For example, “planet” will have a high IDF because it’s not a common word and is really tied to the document’s subject. But “are” is found in virtually all documents.

Combining the two parts #

After that, computing TF-IDF is as simple as multiplying TF and IDF.

Word Embedding #

Word Embedding is a way to represent words as numerical vectors of a fixed length (called the dimension of the embedding). Often, the vectors have nice properties. For example, the vectors for two related words are close to each other (to compute the distance between two vectors, one can use a similarity measure algorithm such as the cosine similarity). A few popular algorithms to learn the embedding (i.e adjust the numerical vectors) are Word2Vec, GloVe, Elmo or FastText. One embedding (Sense2Vec) even deals with multi-sense words such as “bank” or “bat”. I will not cover how they work here because explaining it would require a dedicated blog post for each of them.

Generally, learning a word embedding requires a large number of text documents and a lot of computational power. Once an embedding is ready, one can replace each word by its vector, and feed this representation to a Machine Learning or Deep Learning model. To make this process easy, the words often need to be integer-encoded first.

Graph of Words #

All the methods above try to extract features from the words themselves, but don’t make use of the relationships between them. One can represent a text as a graph data structure which represents how the words are connected together. This paper explains well how to build such a thing.

A graph is a data structure composed of nodes (our words here) connected together by edges. Each edge can be associated with a weight (For example, the distance between two connected words, their number of neighbors, their summed TF-IDF weights) and each node can have any number of attributes. A graph of words is a directed one because usually a language reads from left to right or vice versa. Therefore the edges are one-way only.

Because we can’t build a graph representing all words connections, we usually pick a context size and stick to word connections within it. For example, if the context size is 3, we would represent the connections from a word to the 3 neighboring ones in both directions.

I have built a library to create a graph of words here. A graph of words can be represented as below:

town: live, where, in live: where, man, born man: who, called

This representation can be used by various machine learning algorithms.

Conclusion #

Despite being a well known topic, text representation is hard to get right. The best way to represent a text for natural language processing often depends on the problem at hand and on the model that ought to solve the problem.