Image created in GIVE HIM
Do you know that election results can be predicted to some extent using sentiment analysis? Data science can be fun and very useful when applied to real-life situations rather than working with simulated data sets.
In this article, we will conduct a brief case study using Twitter data. In the end, you will see a case study that has a significant impact on real life, which is sure to pique your interest. But first, let's start with the basics.
Sentiment analysis is a method used to predict feelings, like digital psychologists. With this, psychologist you created, the fate of the text you will analyze will be in your hands. You can do it like the famous psychologist Freud, or you can just be there like a psychologist, charging $10 a session.
Just as your psychologist listens and understands your emotions, sentiment analysis does the same in text, such as reviews, comments or tweets, as we will do in the next section. To do that, let's start doing a case study on the ready data set.
To perform sentiment analysis, we will use data sets from Kaggle. Here, this dataset was collected using the Twitter API. Here is the link to this data set: https://www.kaggle.com/datasets/kazanova/sentiment140
Now, let's start exploring the data set.
Explore data set
Now, before we perform a sentiment analysis, let's explore our data set. To read it, use encoding. Because of this, we will add column names later. You can increase the methods to perform data exploration. The header, info, and description method will give you great information; let's see the code.
import pandas as pd
data = pd.read_csv('training.csv', encoding='ISO-8859-1', header=None)
column_names = ('target', 'ids', 'date', 'flag', 'user', 'text')
data.columns = column_names
head = data.head()
info = data.info()
describe = data.describe()
head, info, describe
Here is the result.
Of course, you can run these methods one by one if you don't have a limit of images in your project. Let's look at the insights we gleaned from these previous exploration methods.
Perspectives
- The data set has 1.6 million tweets, with no missing values in any column.
- Each tweet has a target sentiment (0 for negative, 2 neutral, 4 for positive), an ID, a timestamp, a flag (query or 'NO_QUERY'), the username and the text.
- Sentiment targets are balanced, with the same number of positive and negative labels.
View the data set
Wonderful, we have statistical and structural knowledge about our data set. Now, let's create some visualizations to imagine it. Now, we all know the most acute feelings, positive and negative. To see what words we'll use for that, we'll use one of the python libraries called word cloud.
This library will display your data sets according to the frequency of the words it contains. If the words are used frequently, you will understand by looking at their size, there is a positive relationship, if the word is larger, it should be used a lot.
But first, we need to select positive and negative tweets and combine them using Python join method after. Let's look at the code.
# Separate positive and negative tweets based on the 'target' column
positive_tweets = data(data('target') == 4)('text')
negative_tweets = data(data('target') == 0)('text')
# Sample some positive and negative tweets to create word clouds
sample_positive_text = " ".join(text for text in positive_tweets.sample(frac=0.1, random_state=23))
sample_negative_text = " ".join(text for text in negative_tweets.sample(frac=0.1, random_state=23))
# Generate word cloud images for both positive and negative sentiments
wordcloud_positive = WordCloud(width=800, height=400, max_words=200, background_color="white").generate(sample_positive_text)
wordcloud_negative = WordCloud(width=800, height=400, max_words=200, background_color="white").generate(sample_negative_text)
# Display the generated image using matplotlib
plt.figure(figsize=(15, 7.5))
# Positive word cloud
plt.subplot(1, 2, 1)
plt.imshow(wordcloud_positive, interpolation='bilinear')
plt.title('Positive Tweets Word Cloud')
plt.axis("off")
# Negative word cloud
plt.subplot(1, 2, 2)
plt.imshow(wordcloud_negative, interpolation='bilinear')
plt.title('Negative Tweets Word Cloud')
plt.axis("off")
plt.show()
Here is the result.
The words “thank you” and “now” in the graph on the left sound more positive. However, “work” and “now” seem interesting because these words seem to be often in negative tweets.
Analysis of feelings
To perform a sentiment analysis, these are the steps we will follow;
- Preprocess text data.
- Split the data set
- Vectorize the data set
- Data conversion
- Label encoding
- Train a neural network
- Train the model
- Evaluate the model (with plot)
Now, working on 1.6 million tweets can put a heavy workload on your computer or platform; That's why at the beginning I selected 50,000 positive tweets and 50,000 negative ones.
# Since we need to use a smaller dataset due to resource constraints, let's sample 100k tweets
# Balanced sampling: 50k positive and 50k negative
sample_size_per_class = 50000
positive_sample = data(data('target') == 4).sample(n=sample_size_per_class, random_state=23)
negative_sample = data(data('target') == 0).sample(n=sample_size_per_class, random_state=23)
# Combine the samples into one dataset
balanced_sample = pd.concat((positive_sample, negative_sample))
# Check the balance of the sampled data
balanced_sample('target').value_counts()
Next, let's build our neural networks.
import tensorflow as tf
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.utils import to_categorical
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 2))
# Train and test split
X_train, X_val, y_train, y_val = train_test_split(balanced_sample('text'), balanced_sample('target'), test_size=0.2, random_state=23)
# After vectorizing the text data using TF-IDF
X_train_vectorized = vectorizer.fit_transform(X_train)
X_val_vectorized = vectorizer.transform(X_val)
# Convert the sparse matrix to a dense matrix
X_train_vectorized = X_train_vectorized.todense()
X_val_vectorized = X_val_vectorized.todense()
# Convert labels to one-hot encoding
encoder = LabelEncoder()
y_train_encoded = to_categorical(encoder.fit_transform(y_train))
y_val_encoded = to_categorical(encoder.transform(y_val))
# Define a simple neural network model
model = Sequential()
model.add(Dense(512, input_shape=(X_train_vectorized.shape(1),), activation='relu'))
model.add(Dense(2, activation='softmax')) # 2 because we have two classes
# Compile the model
model.compile(optimizer="adam", loss="categorical_crossentropy", metrics=('accuracy'))
# Train the model over epochs
history = model.fit(X_train_vectorized, y_train_encoded, epochs=10, batch_size=128,
validation_data=(X_val_vectorized, y_val_encoded), verbose=1)
# Plotting the model accuracy over epochs
plt.figure(figsize=(10, 6))
plt.plot(history.history('accuracy'), label="Train Accuracy", marker="o")
plt.plot(history.history('val_accuracy'), label="Validation Accuracy", marker="o")
plt.title('Model Accuracy over Epochs')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.grid(True)
plt.show()
Here is the result.
Final Insights on Sentiment Analysis
- Training precision: Accuracy starts at almost 80% and steadily increases to close to 100% in the tenth epoch. So, it seems that the model is learning effectively.
- Validation accuracy: Validation accuracy again starts around 80% and continues steadily and rapidly, which could indicate that the model is not generalizing to unseen data.
At the beginning of this article, your interest was piqued. And now let's explain the real story behind this.
The article Predicting election results from Twitter using machine learning algorithms,
published in “Recent Advances in Computing and Communications”, presents a machine learning-based method for predicting election results. Here you can read it in full.
In short, they did a sentiment analysis and achieved an accuracy of 94.2% in the AP Assembly elections 2019. It seems like they really came close.
If you plan to do a portfolio project, research like this, or intend to go beyond this case study, you can use the Twitter API or the x API. Here are the plans: https://developer.twitter.com/en/products/twitter-api
You can conduct hashtag sentiment analysis on Twitter after major sporting or political events. In 2024, there will be elections in several countries like the United States, where you can check the news.
The power of data science can really be seen in this example. This year we will witness numerous elections around the world, so if you want to draw attention to your project, this could be a good idea. If you are a beginner and looking for ways to learn data science, you can find many real-life projects. data science interview questionsand blog posts featuring data science projects like this in StrataScratch.
Nate Rosidi He is a data scientist and in product strategy. He is also an adjunct professor of analysis and is the founder of StrataScratch, a platform that helps data scientists prepare for their interviews with real questions from top companies. Connect with him on Twitter: StrataScratch either LinkedIn.