similarity assessment
Next, I wanted to see the similarities between each batch of generated reviews and the original reviews. To do this, we can use cosine similarity to calculate how similar the different sentence vectors of each source are. First, we can create a cosine similarity matrix that will first transform our sentences into vectors using TfidVectorizer() and then compute the cosine similarity between the two new sentence vectors.
def cosine_similarity(sentence1, sentence2):
"""
A function that accepts two sentences as input and outputs their cosine
similarityInputs:
sentence1 (str): A string of word
sentence2 (str): A string of words
Returns:
cosine_sim: Cosine similarity score for the two input sentences
"""
# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()
# Create the TF-IDF matrix
tfidf_matrix = vectorizer.fit_transform([sentence1, sentence2])
# Calculate the cosine similarity
cosine_sim = cosine_similarity(tfidf_matrix[0], tfidf_matrix[1])
return cosine_sim[0][0]
One problem I had was that the data sets were now so large that calculations were taking too long (and sometimes I didn’t have enough RAM in Google Colab to continue). To combat this problem, I randomly sampled 200 reviews from each of the data sets to calculate similarity.
#Random Sample 200 Reviews
o_review = sample(reviews_dict['original review'],200)
p_review = sample(reviews_dict['fake positive review'],200)
n_review = sample(reviews_dict['fake negative review'],200)r_dict = {'original review': o_review,
'fake positive review': p_review,
'fake negative review':n_review}
Now that we have the randomly selected samples, we can look at the cosine similarities between the different combinations of the data sets.
#Cosine Similarity Calcualtion
source = ['original review','fake negative review','fake positive review']
source_to_compare = ['original review','fake negative review','fake positive review']
avg_cos_sim_per_word = {}
for s in source:
count = []
for s2 in source_to_compare:
if s != s2:
for sent in r_dict[s]:
for sent2 in r_dict[s2]:
similarity = calculate_cosine_similarity(sent, sent2)
count.append(similarity)
avg_cos_sim_per_word['{0} to {1}'.format(s,s2)] = np.mean(count)results = pd.DataFrame(avg_cos_sim_per_word,index=[0]).T
For the original data set, the negative reviews were more similar. My guess is that this is because I use more prompts to create negative reviews than positive reviews. Not surprisingly, the reviews generated by ChatGPT showed the greatest signs of similarity between them.
Great, we have the cosine similarities, but are there any other steps we can take to assess the similarities of the reviews? There is! Let’s visualize sentences as vectors. To do this, we need to embed the sentences (convert them to number vectors) and then we can visualize them in 2D space. I used Spacy to embed my vectors and display them.
# Load pre-trained GloVe model
nlp = spacy.load('en_core_web_lg')source_embeddings = {}
for source, source_sentences in reviews_dict.items():
source_embeddings[source] = []
for sentence in source_sentences:
# Tokenize the sentence using spaCy
doc = nlp(sentence)
# Retrieve word embeddings
word_embeddings = np.array([token.vector for token in doc])
# Save word embeddings for the source
source_embeddings[source].append(word_embeddings)
def legend_without_duplicate_labels(figure):
handles, labels = plt.gca().get_legend_handles_labels()
by_label = dict(zip(labels, handles))
figure.legend(by_label.values(), by_label.keys(), loc='lower right')
# Plot embeddings with colors based on source
fig, ax = plt.subplots()
colors = ['g', 'b', 'r'] # Colors for each source
i=0
for source, embeddings in source_embeddings.items():
for embedding in embeddings:
ax.scatter(embedding[:, 0], embedding[:, 1], c=colors[i], label=source)
i+=1
legend_without_duplicate_labels(plt)
plt.show()
The good news is that we can clearly see that the embeddings and distributions of the sentence vectors align closely. Visual inspection shows that there is more variability in the distribution of the original reviewssupporting the claim they are more diverse. Since ChatGPT generated positive and negative reviews, we suspect that their distributions are the same. Keep in mind, however, that fake negative reviews actually have a wider distribution and more variation than positive reviews. Why could this be? This is probably partly due to the fact that I had to trick ChatGPT to create the fake negative reviews (ChatGPT is designed to say positive statements) and I had to provide more prompts to ChatGPT to get enough negative reviews vs. positive ones. This helps the data set because with the additional diversity in the data set, we can train higher performing machine learning models.
We can then inspect the differences in the three different distributions of revisions and see if there are any distinctive patterns.
What we see? Visually, we can see that most of the dataset revisions are centered at the origin and range from -10 to 10. This is a positive sign and supports the use of false revisions to train prediction models. The variations are somewhat the same, however the original reviews had a wider variation in their distribution, both laterally and longitudinally, an indicator that there is more diversity in the lexicon within those reviews. The ChatGPT reviews definitely had similar distributions, but the positive reviews had more outliers. As noted, these distinctions could be the result of how you were asking the system to generate reviews.