Sentiment Analysis with Twitter Climate Change Data¶
This guide demonstrates how to use Google Colab and its AI-powered features to perform sentiment analysis on climate change-related Twitter data. The goal is to show researchers in ecology and environmental science how neural networks and natural language processing (NLP) techniques can be applied to understand public sentiment about climate change. We will walk through the process of data loading, tokenization, model building, and evaluation using popular Python libraries.
Open in Colab¶
If you want to try out this notebook yourself, you can make a copy directly in Google Colab. Click the link below and then choose "File" > "Save a copy in Drive" in Colab to start editing your own version.
Table of Contents¶
- Introduction
- Environment Setup
- Loading and Preparing the Data
- Tokenization
- Dataset Splitting and DataLoader Creation
- Building a Simple RNN Classifier
- Training and Evaluation of the RNN
- Improving the Model
- Conclusion and Next Steps
Introduction¶
In this notebook, we will: - Leverage the Hugging Face ecosystem (Transformers, Datasets, Evaluate, Accelerate) to process and analyze text data. - Work with Twitter sentiment data that reflects opinions about climate change. - Build and train a simple recurrent neural network (RNN) to classify tweets as positive or negative. - Explore improvements to the model architecture for better performance.
Google Colab’s integration with AI assistants (such as Gemini, visible on the top right in the notebook interface) makes it easy to ask questions and get real-time help while you work on your projects.
Environment Setup¶
Before we begin, ensure that the required libraries are installed. The notebook requires the following libraries:
- transformers
- datasets
- evaluate
- accelerate
- torch
- torchmetrics
You can install these packages with the commands provided below.
import os
import sys
import torch
import numpy as np
os.environ["TOKENIZERS_PARALLELISM"] = "false"
print(torch.__version__)
We'll use Hugging Face's popular tools which work seamlessly with both PyTorch and TensorFlow.
try:
import transformers
import datasets
import evaluate
import accelerate
except:
!pip install transformers datasets evaluate accelerate
!pip install torchmetrics
Loading and Preparing the Data¶
For this demonstration, we will work with climate change-related Twitter sentiment data. You can download the dataset from Kaggle.
Data Description¶
The dataset contains tweets with sentiment labels:
- 1 indicating a positive attitude (affirming that climate change is man-made),
- -1 indicating a negative sentiment,
- other labels like 2 (news) and 0 (neutral).
For the purpose of this analysis, we will filter the dataset to include only two classes:
- 1 (positive)
- -1 (negative)
These will be transformed to binary labels:
- 1 becomes 1 (positive)
- -1 becomes 0 (negative)
Below is the code that loads and preprocesses the data:
from datasets import load_dataset
import torch.nn as nn
import torch.optim as optim
import pandas as pd
df = pd.read_csv("twitter_sentiment_data.csv")
df.head()
We filter the dataset to keep only tweets with sentiment -1 or 1 and then convert them to binary labels.
df = df[df['sentiment'].isin([-1,1])]
df['sentiment'] = df['sentiment'].apply(lambda x: 0 if x == -1 else 1)
df.head()
Now the classes are as follows: - 1: Positive - 0: Negative
You can use the AI recommendation feature within Colab to generate plots for better understanding the data.
Let's inspect some sample tweets to verify the data:
And we can check the tokenization on one sample tweet:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
output = tokenizer(df['message'][0], truncation=True, max_length=100)
print("------------Original-------------")
print(df['message'][0])
print("------------Encoded-------------")
print(output['input_ids'])
print("------------Decoded-------------")
print(tokenizer.decode(output['input_ids']))
Tokenization¶
Tokenization is a crucial step in NLP where text is converted into tokens (words or sub-words) that can be processed by machine learning models. We use the pre-trained AutoTokenizer from Hugging Face.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")
Dataset Splitting and DataLoader Creation¶
Next, we split the data into training and testing sets. Here, we use an 80-20 split, ensuring that the sentiment distribution is preserved using stratification.
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42, stratify=df["sentiment"])
We then convert our Pandas DataFrames to Hugging Face Datasets for efficient processing.
from datasets import Dataset, DatasetDict
train_dataset = Dataset.from_pandas(train_df)
test_dataset = Dataset.from_pandas(test_df)
dataset = DatasetDict({
'train': train_dataset,
'test': test_dataset
})
Now, define a tokenization function that processes the "message" field in the dataset:
def tokenize_function(examples):
return tokenizer(examples["message"], truncation=True, max_length=100)
# Apply the tokenizer to the dataset (batched tokenization)
tokenized_dataset = dataset.map(tokenize_function, batched=True)
tokenized_dataset
To streamline further processing, we remove unnecessary columns:
Finally, we create DataLoaders with dynamic padding using Hugging Face's DataCollatorWithPadding.
from transformers import DataCollatorWithPadding
from torch.utils.data import DataLoader
# Create a data collator that will dynamically pad the inputs in a batch.
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Create DataLoaders for training and testing.
train_dataloader = DataLoader(tokenized_dataset['train'], shuffle=True, batch_size=256, collate_fn=data_collator)
test_dataloader = DataLoader(tokenized_dataset['test'], shuffle=False, batch_size=256, collate_fn=data_collator)
batch = next(iter(train_dataloader))
print(batch.keys())
print(batch['input_ids'].shape)
print(batch['input_ids'])
print(batch['sentiment'].shape)
vocab_size = len(tokenizer.vocab)
Building a Simple RNN Classifier¶
We now define a simple RNN model using an LSTM layer for sentiment classification. This model uses an embedding layer followed by an LSTM and a fully connected layer. You can experiment with model architectures and consult AI assistants (like Gemini in Colab) for suggestions.
class SimpleRNNClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, output_dim, pad_index):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_index)
self.rnn = nn.LSTM(embed_dim, embed_dim, batch_first=True, dropout=0.0)
self.fc = nn.Linear(embed_dim, output_dim)
def forward(self, x):
embedding = self.embedding(x)
output, states = self.rnn(embedding)
output = self.fc(output[:,-1,:]).squeeze()
return torch.nn.functional.sigmoid(output)
Training and Evaluation of the RNN¶
We now train the simple RNN model for 5 epochs. The training loop uses Adam optimization and binary cross-entropy loss.
from tqdm import tqdm
from torchmetrics.classification import Accuracy, F1Score
n_epochs = 5
test_loss = []
train_loss = []
model = SimpleRNNClassifier(vocab_size=vocab_size, embed_dim=50, output_dim=1, pad_index=0)
optimizer = optim.Adam(model.parameters(), lr=0.001)
loss_fn = torch.nn.BCELoss()
for epoch in range(n_epochs):
model.train()
_loss = []
for batch in tqdm(train_dataloader):
optimizer.zero_grad()
output = model(batch['input_ids'])
loss = loss_fn(output, batch['sentiment'].float())
loss.backward()
optimizer.step()
_loss.append(loss.item())
train_loss.append(np.mean(_loss[-10:]))
model.eval()
with torch.no_grad():
test_pred, test_label = zip(*[
(model(test_batch['input_ids']), test_batch['sentiment'])
for test_batch in test_dataloader
])
test_pred = torch.cat(test_pred)
test_label = torch.cat(test_label).float()
test_loss.append(loss_fn(test_pred, test_label).item())
print("Loss", train_loss[-1], "Val Loss", test_loss[-1])
import matplotlib.pyplot as plt
accuracy = Accuracy(task="binary")
f1_score = F1Score(task="binary")
print("Accuracy", accuracy(test_pred, test_label))
print("F1", f1_score(test_pred, test_label))
plt.plot(train_loss, label="Train Loss")
plt.plot(test_loss, label="Validation Loss")
plt.legend()
plt.show()
Improving the Model¶
To further improve the model, we experiment with the following changes: - Increase the number of LSTM layers. - Use bidirectional LSTM. - Add dropout for regularization. - Introduce an additional fully connected layer. - Adjust the learning rate and weight decay.
The following code shows an improved model and training loop:
device = 'cuda' if torch.cuda.is_available() else 'cpu'
print(device)
import torch.nn.functional as F
n_epochs = 5
test_loss = []
train_loss = []
class MyClassifier(nn.Module):
def __init__(self, vocab_size, embed_dim, output_dim, pad_index):
super().__init__()
self.embedding = nn.Embedding(vocab_size, embed_dim, padding_idx=pad_index)
self.rnn = nn.LSTM(embed_dim, embed_dim, batch_first=True, num_layers=2, dropout=0.4, bidirectional=True)
self.fc1 = nn.Linear(2 * embed_dim, 32)
self.fc2 = nn.Linear(32, output_dim)
def forward(self, x):
embedding = self.embedding(x)
output, states = self.rnn(embedding)
output = F.relu(self.fc1(output[:, -1, :]))
output = self.fc2(output).squeeze()
return torch.nn.functional.sigmoid(output)
my_model = MyClassifier(vocab_size=vocab_size, embed_dim=64, output_dim=1, pad_index=0).to(device)
optimizer = optim.Adam(my_model.parameters(), lr=0.0001, weight_decay=2e-4)
loss_fn = torch.nn.BCELoss()
for epoch in range(n_epochs):
my_model.train()
_loss = []
for batch in tqdm(train_dataloader):
optimizer.zero_grad()
output = my_model(batch['input_ids'].to(device))
loss = loss_fn(output, batch['sentiment'].float().to(device))
loss.backward()
optimizer.step()
_loss.append(loss.item())
train_loss.append(np.mean(_loss[-10:]))
my_model.eval()
with torch.no_grad():
test_pred, test_label = zip(*[
(my_model(test_batch['input_ids'].to(device)), test_batch['sentiment'].to(device))
for test_batch in test_dataloader
])
test_pred = torch.cat(test_pred)
test_label = torch.cat(test_label).float()
test_loss.append(loss_fn(test_pred, test_label).item())
print("Loss", train_loss[-1], "Val Loss", test_loss[-1])
Finally, we print the final accuracy and F1 score to ensure that the performance metrics are visible in the notebook output:
print("Accuracy", accuracy(test_pred.detach().cpu(), test_label.detach().cpu()))
print("F1", f1_score(test_pred.detach().cpu(), test_label.detach().cpu()))
plt.plot(train_loss, label="Train Loss")
plt.plot(test_loss, label="Validation Loss")
plt.legend()
plt.show()
- Accuracy tensor(0.8520)
- F1 tensor(0.9201)
Conclusion and Next Steps¶
This notebook demonstrates how to use Google Colab to perform sentiment analysis on Twitter data related to climate change. By building and training a simple RNN and then enhancing it with additional layers, dropout, and bidirectional LSTM, we were able to achieve around 85% accuracy with only 5 epochs of training.
What Did We Improve?¶
- Added an extra fully connected layer.
- Increased LSTM layers to 2 and made the LSTM bidirectional.
- Applied dropout (0.4) to reduce overfitting.
- Lowered the learning rate to 0.0001 and introduced weight decay.
- Increased the embedding dimension to 64 for better feature representation.
Next Steps¶
- Experiment with Transformer Models: Consider trying transformer-based models like BERT for even better performance.
- Hyperparameter Tuning: Experiment with different batch sizes, learning rates, and regularization techniques.
- Early Stopping: Implement early stopping to avoid overfitting when training for more epochs.
- Data Visualization: Use recommended plots and further data visualizations to better understand the sentiment distributions.
This project is a great starting point for environmental researchers looking to leverage AI for data analysis. Happy coding and exploring!