Master Scatter Charts in 5 Minutes!

Fast Win Data Science

Learn graphical text analysis with NLTK

A sepia photograph of Sherlock Holmes examining a book with a magnifying glass. — Sherlock Holmes (by DALL-E3)

He Natural Language Toolkit (NLTK) Ships with a fun feature called Dispersion diagram which allows you to publish the location of a word in a text. More specifically, it plots the occurrences of a word versus the number of words since the beginning of the corpus.

Here is an example of a scatter plot for the main characters of the Sherlock Holmes novel: The Hound of the Baskervilles:

A scatter plot that uses blue vertical ticks to indicate the occurrence of a word in a text. — Dispersion plot of the main characters of “The Hound of the Baskervilles” (by author)

The blue vertical marks represent the locations of the target words in the text. Each row covers the corpus from start to finish.

If you are familiar with The Hound of the Baskervilles (and I won't spoil it if it isn't), then you'll appreciate Holmes' sparse appearance in the middle, Mortimer's late return, and the overlap of Barrymore, Selden, and the dog.

Scatter plots may have more practical applications. For example, imagine you are a data scientist working with paralegals on a criminal case involving insider trading. To find out if the defendant contacted board members just before making the illegal trades, you can upload the defendant's cited emails as a continuous string and generate a scatter plot to check for name juxtapositions.

Social scientists analyze scatter plots to study linguistic trends related to specific topics. By tracking the appearance of terms like “climate change” or “gun control” in news articles, they can gain insight into the priorities that are important to society in specific time periods.

In this Fast Win Data Science project, we will write the Python code that generated The Hound of the Baskervilles scatter plot shown above.

We will use a copy of the novel stored in this Essence. Originally came from Project Gutenberg, a great source of public domain literature. As recommended for natural language processing, I have removed it…