While ingesting the data from the API, we will apply a few criteria. First, we will only ingest documents whose year is between 2016 and 2022. We want a fairly recent language, as terms and taxonomy of certain topics can change over long periods of time.
We will also add key terms and perform multiple searches. While we would normally ingest random topic areas, we will use key terms to narrow down our search. This way, we will have an idea of how many high-level topics we have and we can compare them to the output of the model. Next, we create a function where we can add key terms and perform searches through the API.
import pandas as pd
import requests
def import_data(pages, start_year, end_year, search_terms):"""
This function is used to use the OpenAlex API, conduct a search on works, a return a dataframe with associated works.
Inputs:
- pages: int, number of pages to loop through
- search_terms: str, keywords to search for (must be formatted according to OpenAlex standards)
- start_year and end_year: int, years to set as a range for filtering works
"""
#create an empty dataframe
search_results = pd.DataFrame()
for page in range(1, pages):
#use paramters to conduct request and format to a dataframe
response = requests.get(f'https://api.openalex.org/works?page={page}&per-page=200&filter=publication_year:{start_year}-{end_year},type:article&search={search_terms}')
data = pd.DataFrame(response.json()('results'))
#append to empty dataframe
search_results = pd.concat((search_results, data))
#subset to relevant features
search_results = search_results(("id", "title", "display_name", "publication_year", "publication_date",
"type", "countries_distinct_count","institutions_distinct_count",
"has_fulltext", "cited_by_count", "keywords", "referenced_works_count", "abstract_inverted_index"))
return(search_results)
We conducted 5 different searches, each in a different technology area. These technology areas are inspired by the Department of Defense’s “Critical technology Areas.” See more here:
Below is an example of a search using the required OpenAlex syntax:
#search for Trusted ai and Autonomy
ai_search = import_data(35, 2016, 2024, "'artificial intelligence' OR 'deep learn' OR 'neural net' OR 'autonomous' OR drone")
After compiling our queries and removing duplicate documents, we need to clean the data to prepare it for our topic model. There are two main problems with our current result.
- Abstracts are returned as an inverted index (for legal reasons). However, we can use them to return the original text.
- Once we get the original text, it will be raw and unprocessed, which will introduce noise and harm our model. We will perform traditional NLP preprocessing to prepare it for the model.
Below is a function to return the original text of an inverted index.
def undo_inverted_index(inverted_index):"""
The purpose of the function is to 'undo' and inverted index. It inputs an inverted index and
returns the original string.
"""
#create empty lists to store uninverted index
word_index = ()
words_unindexed = ()
#loop through index and return key-value pairs
for k,v in inverted_index.items():
for index in v: word_index.append((k,index))
#sort by the index
word_index = sorted(word_index, key = lambda x : x(1))
#join only the values and flatten
for pair in word_index:
words_unindexed.append(pair(0))
words_unindexed = ' '.join(words_unindexed)
return(words_unindexed)
Now that we have the raw text, we can perform our traditional preprocessing steps such as standardization, stopword removal, stemming, etc. Below are functions that can be assigned to a list or series of documents.
def preprocess(text):"""
This function takes in a string, coverts it to lowercase, cleans
it (remove special character and numbers), and tokenizes it.
"""
#convert to lowercase
text = text.lower()
#remove special character and digits
text = re.sub(r'\d+', '', text)
text = re.sub(r'(^\w\s)', '', text)
#tokenize
tokens = nltk.word_tokenize(text)
return(tokens)
def remove_stopwords(tokens):"""
This function takes in a list of tokens (from the 'preprocess' function) and
removes a list of stopwords. Custom stopwords can be added to the 'custom_stopwords' list.
"""
#set default and custom stopwords
stop_words = nltk.corpus.stopwords.words('english')
custom_stopwords = ()
stop_words.extend(custom_stopwords)
#filter out stopwords
filtered_tokens = (word for word in tokens if word not in stop_words)
return(filtered_tokens)
def lemmatize(tokens):"""
This function conducts lemmatization on a list of tokens (from the 'remove_stopwords' function).
This shortens each word down to its root form to improve modeling results.
"""
#initalize lemmatizer and lemmatize
lemmatizer = nltk.WordNetLemmatizer()
lemmatized_tokens = (lemmatizer.lemmatize(token) for token in tokens)
return(lemmatized_tokens)
def clean_text(text):"""
This function uses the previously defined functions to take a string and\
run it through the entire data preprocessing process.
"""
#clean, tokenize, and lemmatize a string
tokens = preprocess(text)
filtered_tokens = remove_stopwords(tokens)
lemmatized_tokens = lemmatize(filtered_tokens)
clean_text = ' '.join(lemmatized_tokens)
return(clean_text)
Now that we have a set of preprocessed documents, we can create our first topic model!