What is the Bag of Words Model in Natural Language Processing?

In Natural Language processing, we use the bag of words model as a way to extract features from text corpus for machine learning-related tasks such as text classification. (e.g., spam filtering, text category classification, sentiment analysis, etc.)

Since machine learning classifiers cannot directly work with text information, we need to transform this information into meaningful numerical features. One such way is the Bag of Words Model.

This method constructs a list of unique words that create a feature vector utilized in any machine learning classifier.

We read documents and scan for words, and we can store them in an array-like data structure such as a trivial list. This method will remove punctuation characters in the document, and only the words are filtered out.

We would take each word of the transformed document and then add it to a set; the set data structure would remove duplicate words.

The words in the set get stored in a dictionary. The dictionary's keys are the words, and the values are the indexes associated with that word. This dictionary is now our 'bag of words'.

import string

doc1 = "John likes to watch movies. Mary likes movies too.";
doc2 = "Mary also likes to watch football games.";

documents = [doc1, doc2]

'''Removes punctuation characters from a string and converts to 
lower case'''
def transform(s):
    return s.translate(str.maketrans('','',string.punctuation)).lower()

def createWordBag(documents):
    # Punctuation characters/special characters are removed 
    result = [transform(document) for document in documents]
    result = " ".join(result)

    # Each word in the document is placed in an array
    result = result.split(" ")

    # All unique words filtered by the use of a set
    result = set(result)

    wordBag = {}
    for index, word in enumerate(result):
        wordBag[word] = index;

    return wordBag;

Therefore, when we process a document, we create a feature array with a size equal to the number of keys in the 'bag of words' dictionary.

We traverse the entire words in the document, and then for each word, we cross-check with the dictionary to get the index. We access the index in our feature array and increment the value by 1. The feature vector gets generated once the entire document is transversed.

def getFeature(document, wordBag):
    newDocument = transform(document).split(" ")
    n = len(newDocument)
    feature = [0] * len(wordBag.keys())
    for word in newDocument:
        index = wordBag[word]
        feature[index] += 1

    return feature

In the below code snippet, we create the 'word bag' modal from the list of documents defined above, and for each document, with the aid of the word bag model, the 'getFeature' method creates the feature vectors.

wordBag = createWordBag(documents)

feature1 = getFeature(doc1, wordBag)
feature2 = getFeature(doc2, wordBag)

print(feature1) # [0, 1, 1, 0, 2, 2, 1, 1, 0, 1]
print(feature2) # [1, 1, 1, 1, 1, 0, 1, 0, 1, 0]

Any machine learning model can utilize the generated feature vector using the bag of words method.

The entire implementation in Python can be found in the following Github Gist

Python Implementation

Thank you for reading till the end, in a future article I would explore upon this method further, and write a couple of posts about the Hashing Trick to scale this method for large text corpus processing and also about the 'tf–idf' as well.