Fast Win Data Science
He Natural Language Toolkit (NLTK) Ships with a fun feature called Dispersion diagram which allows you to publish the location of a word in a text. More specifically, it plots the occurrences of a word versus the number of words since the beginning of the corpus.
Here is an example of a scatter plot for the main characters of the Sherlock Holmes novel: The Hound of the Baskervilles:
The blue vertical marks represent the locations of the target words in the text. Each row covers the corpus from start to finish.
If you are familiar with The Hound of the Baskervilles (and I won't spoil it if it isn't), then you'll appreciate Holmes' sparse appearance in the middle, Mortimer's late return, and the overlap of Barrymore, Selden, and the dog.
Scatter plots may have more practical applications. For example, imagine you are a data scientist working with paralegals on a criminal case involving insider trading. To find out if the defendant contacted board members just before making the illegal trades, you can upload the defendant's cited emails as a continuous string and generate a scatter plot to check for name juxtapositions.
Social scientists analyze scatter plots to study linguistic trends related to specific topics. By tracking the appearance of terms like “climate change” or “gun control” in news articles, they can gain insight into the priorities that are important to society in specific time periods.
In this Fast Win Data Science project, we will write the Python code that generated The Hound of the Baskervilles scatter plot shown above.
We will use a copy of the novel stored in this Essence. Originally came from Project Gutenberg, a great source of public domain literature. As recommended for natural language processing, I have removed it…