In this project, we'll build a multilabel classifier to classify Disaster event related messages into appropriate categories. This can can help tell us about the nature of the event so that the message can be routed to the correct organizations, enabling faster mobilization of resources.
The project will be divided into two separate modules: "Extract, Transformer, Load" and Machine Learning. As you'll see, the initial data for the project is not clean. We'll take this opportunity to showcase some basic ETL skills, and save the clean data to a Sqlite database, which can then be loaded by the ML pipeline. Finally, we will train an XGBoost classifier to predict the labels fo new messages.
This is a walkthrough notebook of the project. The finished scripts, and a web app with a user interface to allow message classification can be found here.
Note: This project is available as part of some Udacity nanodegrees.
import pandas as pd
import matplotlib
from sqlalchemy import create_engine
Data is available to us in two different files: disaster_messages.csv
and disaster_categories.csv
.
Let's load these two datasets and take a look at their first few rows.
categories = pd.read_csv('https://raw.githubusercontent.com/alex-coch/alex-coch.github.io/main/message/data/disaster_categories.csv')
messages = pd.read_csv('https://raw.githubusercontent.com/alex-coch/alex-coch.github.io/main/message/data/disaster_messages.csv')
messages.head()
We're only interested in the message
column from the disaster_messages
dataset. We'll use that column to train our message classifier, and ignore the other columns.
categories.head()
categories.iloc[0]['categories']
The disaster_categories
dataset contains the category labels for our messages, but in a serialized fashion. We have also taken a closer look at the category for the first row. We'll need to convert the categories to a format better suited for our ML model.
Let's start by merging the two datasets so that the messages and categories are present in the same dataframe.
df = messages.merge(categories, on='id')
df.head()
Let's start by splitting the categories column into separate columns for each category.
categories = df['categories'].str.split(';',expand=True)
categories.head()
Extracting the column names from a row of the categories dataframe.
category_column_names = categories.iloc[0].apply(lambda x: x.split("-")[0])
# Let's check which categories we have
print([name for name in category_column_names])
Using the extracted column names as the header of the categories dataframe:
categories.columns = category_column_names
categories.head()
Next, we'll fix the values of the above dataset so that they are binary. 1 indicates that a message belonging to a given category.
for column in categories:
categories[column] = categories[column].str[-1].astype(int)
The related
column contains values other than 1 or 2, so we will convert the 2s to 1s by testricting the maximum value of a category as 1.
categories['related'].value_counts()
categories['related'] = categories['related'].clip(0,1)
categories['related'].value_counts()
Let's concatenate the categories dataframe back to our original dataframe.
# drop the original categories column from `df`
df.drop('categories',axis=1,inplace=True)
# concatenate the original dataframe with the new `categories` dataframe
df = pd.concat([df,categories], axis=1)
df.head(2)
# checking for duplicate rows in df
df.duplicated().sum()
# drop duplicates
df.drop_duplicates(keep='first', inplace=True)
# drop NaN messages
df.dropna(subset=["message"], axis=0, inplace=True) # drop the row
# drop the id and original columns as they are not useful for the learning problem
df.drop(["id", "original"], axis=1, inplace=True)
As part of this project's simulated scenario, we'll use sqlite and save the dataframe to a db.
database_filename = "disaster_db"
engine = create_engine('sqlite:///'+ database_filename)
df.to_sql('messages', engine, if_exists="replace", index=False)
Let's start by loading the data back form our database:
import re
import nltk
# uncomment below if NLTK data needs to be downloaded
nltk.download(['punkt', 'wordnet'], quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('omw-1.4', quiet=True)
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report, accuracy_score
from sklearn.multioutput import MultiOutputClassifier
engine = create_engine('sqlite:///' + database_filename)
df = pd.read_sql_query('select * from messages', engine)
# the training data is a numpy array of all messages
X = df['message'].values
# the labels are all the different categories
Y = df.drop(columns=['message','genre'], axis=1)
category_names = Y.columns
len(X)
X[:5]
Y.head()
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
print(len(X_train), len(X_test))
For the classifier to perform well, we need to preprocess the disaster message texts to standardize and tokenize them. We can define a function to do this, which can be used later when defining our model pipeline:
punctuation_regex = re.compile(r"[^\w\s]")
stopwords = stop_words = stopwords.words('english')
wordnet_lemmatizer = WordNetLemmatizer()
pos_tags_to_lemmatize = ["n", "v"]
def tokenize(text: str) -> list:
"""
Tokenizes a given text.
Args:
text: text string
Returns:
tokens: list of tokens
"""
# lowercase string and remove punctuation
text = punctuation_regex.sub(" ", text.lower()).strip()
# tokenize text
tokens = [token for token in word_tokenize(text)]
# lemmatize text based on pos tags
for pos_tag in pos_tags_to_lemmatize:
tokens = [wordnet_lemmatizer.lemmatize(token, pos=pos_tag) for token in word_tokenize(text)]
# remove stopwords
tokens = [token for token in tokens if token not in stopwords]
return tokens
def build_model() -> GridSearchCV:
"""
Builds classification model
Returns:
cv: model
"""
pipeline = Pipeline([
('vect', CountVectorizer(tokenizer=tokenize)),
('tfidf', TfidfTransformer()),
('clf', MultiOutputClassifier(XGBClassifier(learning_rate=0.1)))
])
parameters = {
"clf__estimator__max_depth": [8, 16],
"clf__estimator__colsample_bytree":[0.5, 0.75]
}
cv = GridSearchCV(pipeline, cv=3, param_grid=parameters, n_jobs=1, scoring="f1_micro")
return cv
model = build_model()
model.fit(X_train, y_train)
y_preds = model.predict(X_test)
print(classification_report(y_preds, y_test.values, target_names=category_names, zero_division=0))
# collect accuracy scores in a dict
category_name_2_accuracy_score = {}
for i in range(len(category_names)):
category_name_2_accuracy_score[y_test.columns[i]] = accuracy_score(y_test.values[:,i],y_preds[:,i])
print(pd.Series(category_name_2_accuracy_score))
This is an imbalanced dataset, as most categories for a given message will be 0. The f1-score is a better metric of the model's performance.
def predict(text: str) -> list:
"""
Returns a list of predicted categories for the given text
Args:
text: text string
Returns:
predicted_categories: list of categories
"""
preds = model.predict([text])
predicted_categories = [category for i, category in enumerate(y_test.columns) if preds[0][i] == 1]
return predicted_categories
predict("after the floods in our area we are trapped. we need food and shelter ")
This notebook sticks to the basics in order to provide a good baseline model for this classification task. There is plenty of room for improvement here, including but not limited to:
related
and child_alone
are highly skewed. We can look at adding more data for these categories.In this project, we built an ETL pipeline to load messy and unsuitable-for-training data, clean and transform it, and save it to a database. We also built a ML pipeline to tokenize message text, and train an XGBoost classifier to classify messages into different categories.
We stuck to the basics for building this classifier, and there's plenty of room for improvement in the future using modern NLP architectures.