1. Introduction

In this project, we'll build a multilabel classifier to classify Disaster event related messages into appropriate categories. This can can help tell us about the nature of the event so that the message can be routed to the correct organizations, enabling faster mobilization of resources.

The project will be divided into two separate modules: "Extract, Transformer, Load" and Machine Learning. As you'll see, the initial data for the project is not clean. We'll take this opportunity to showcase some basic ETL skills, and save the clean data to a Sqlite database, which can then be loaded by the ML pipeline. Finally, we will train an XGBoost classifier to predict the labels fo new messages.

This is a walkthrough notebook of the project. The finished scripts, and a web app with a user interface to allow message classification can be found here.

Note: This project is available as part of some Udacity nanodegrees.

In [ ]:
import pandas as pd
import matplotlib
from sqlalchemy import create_engine

2. Data

Data is available to us in two different files: disaster_messages.csv and disaster_categories.csv.

Let's load these two datasets and take a look at their first few rows.

In [ ]:
categories = pd.read_csv('https://raw.githubusercontent.com/alex-coch/alex-coch.github.io/main/message/data/disaster_categories.csv')
messages = pd.read_csv('https://raw.githubusercontent.com/alex-coch/alex-coch.github.io/main/message/data/disaster_messages.csv')
In [ ]:
messages.head()
Out[ ]:
id message original genre
0 2 Weather update - a cold front from Cuba that c... Un front froid se retrouve sur Cuba ce matin. ... direct
1 7 Is the Hurricane over or is it not over Cyclone nan fini osinon li pa fini direct
2 8 Looking for someone but no name Patnm, di Maryani relem pou li banm nouvel li ... direct
3 9 UN reports Leogane 80-90 destroyed. Only Hospi... UN reports Leogane 80-90 destroyed. Only Hospi... direct
4 12 says: west side of Haiti, rest of the country ... facade ouest d Haiti et le reste du pays aujou... direct

We're only interested in the message column from the disaster_messages dataset. We'll use that column to train our message classifier, and ignore the other columns.

In [ ]:
categories.head()
Out[ ]:
id categories
0 2 related-1;request-0;offer-0;aid_related-0;medi...
1 7 related-1;request-0;offer-0;aid_related-1;medi...
2 8 related-1;request-0;offer-0;aid_related-0;medi...
3 9 related-1;request-1;offer-0;aid_related-1;medi...
4 12 related-1;request-0;offer-0;aid_related-0;medi...
In [ ]:
categories.iloc[0]['categories']
Out[ ]:
'related-1;request-0;offer-0;aid_related-0;medical_help-0;medical_products-0;search_and_rescue-0;security-0;military-0;child_alone-0;water-0;food-0;shelter-0;clothing-0;money-0;missing_people-0;refugees-0;death-0;other_aid-0;infrastructure_related-0;transport-0;buildings-0;electricity-0;tools-0;hospitals-0;shops-0;aid_centers-0;other_infrastructure-0;weather_related-0;floods-0;storm-0;fire-0;earthquake-0;cold-0;other_weather-0;direct_report-0'

The disaster_categories dataset contains the category labels for our messages, but in a serialized fashion. We have also taken a closer look at the category for the first row. We'll need to convert the categories to a format better suited for our ML model.

3. Extract, Transform, Load!

Let's start by merging the two datasets so that the messages and categories are present in the same dataframe.

In [ ]:
df = messages.merge(categories, on='id')
In [ ]:
df.head()
Out[ ]:
id message original genre categories
0 2 Weather update - a cold front from Cuba that c... Un front froid se retrouve sur Cuba ce matin. ... direct related-1;request-0;offer-0;aid_related-0;medi...
1 7 Is the Hurricane over or is it not over Cyclone nan fini osinon li pa fini direct related-1;request-0;offer-0;aid_related-1;medi...
2 8 Looking for someone but no name Patnm, di Maryani relem pou li banm nouvel li ... direct related-1;request-0;offer-0;aid_related-0;medi...
3 9 UN reports Leogane 80-90 destroyed. Only Hospi... UN reports Leogane 80-90 destroyed. Only Hospi... direct related-1;request-1;offer-0;aid_related-1;medi...
4 12 says: west side of Haiti, rest of the country ... facade ouest d Haiti et le reste du pays aujou... direct related-1;request-0;offer-0;aid_related-0;medi...

Cleaning and transforming the categories column

Let's start by splitting the categories column into separate columns for each category.

In [ ]:
categories = df['categories'].str.split(';',expand=True)
In [ ]:
categories.head()
Out[ ]:
0 1 2 3 4 5 6 7 8 9 ... 26 27 28 29 30 31 32 33 34 35
0 related-1 request-0 offer-0 aid_related-0 medical_help-0 medical_products-0 search_and_rescue-0 security-0 military-0 child_alone-0 ... aid_centers-0 other_infrastructure-0 weather_related-0 floods-0 storm-0 fire-0 earthquake-0 cold-0 other_weather-0 direct_report-0
1 related-1 request-0 offer-0 aid_related-1 medical_help-0 medical_products-0 search_and_rescue-0 security-0 military-0 child_alone-0 ... aid_centers-0 other_infrastructure-0 weather_related-1 floods-0 storm-1 fire-0 earthquake-0 cold-0 other_weather-0 direct_report-0
2 related-1 request-0 offer-0 aid_related-0 medical_help-0 medical_products-0 search_and_rescue-0 security-0 military-0 child_alone-0 ... aid_centers-0 other_infrastructure-0 weather_related-0 floods-0 storm-0 fire-0 earthquake-0 cold-0 other_weather-0 direct_report-0
3 related-1 request-1 offer-0 aid_related-1 medical_help-0 medical_products-1 search_and_rescue-0 security-0 military-0 child_alone-0 ... aid_centers-0 other_infrastructure-0 weather_related-0 floods-0 storm-0 fire-0 earthquake-0 cold-0 other_weather-0 direct_report-0
4 related-1 request-0 offer-0 aid_related-0 medical_help-0 medical_products-0 search_and_rescue-0 security-0 military-0 child_alone-0 ... aid_centers-0 other_infrastructure-0 weather_related-0 floods-0 storm-0 fire-0 earthquake-0 cold-0 other_weather-0 direct_report-0

5 rows × 36 columns

Extracting the column names from a row of the categories dataframe.

In [ ]:
category_column_names = categories.iloc[0].apply(lambda x: x.split("-")[0])
In [ ]:
# Let's check which categories we have
print([name for name in category_column_names])
['related', 'request', 'offer', 'aid_related', 'medical_help', 'medical_products', 'search_and_rescue', 'security', 'military', 'child_alone', 'water', 'food', 'shelter', 'clothing', 'money', 'missing_people', 'refugees', 'death', 'other_aid', 'infrastructure_related', 'transport', 'buildings', 'electricity', 'tools', 'hospitals', 'shops', 'aid_centers', 'other_infrastructure', 'weather_related', 'floods', 'storm', 'fire', 'earthquake', 'cold', 'other_weather', 'direct_report']

Using the extracted column names as the header of the categories dataframe:

In [ ]:
categories.columns = category_column_names
In [ ]:
categories.head()
Out[ ]:
related request offer aid_related medical_help medical_products search_and_rescue security military child_alone ... aid_centers other_infrastructure weather_related floods storm fire earthquake cold other_weather direct_report
0 related-1 request-0 offer-0 aid_related-0 medical_help-0 medical_products-0 search_and_rescue-0 security-0 military-0 child_alone-0 ... aid_centers-0 other_infrastructure-0 weather_related-0 floods-0 storm-0 fire-0 earthquake-0 cold-0 other_weather-0 direct_report-0
1 related-1 request-0 offer-0 aid_related-1 medical_help-0 medical_products-0 search_and_rescue-0 security-0 military-0 child_alone-0 ... aid_centers-0 other_infrastructure-0 weather_related-1 floods-0 storm-1 fire-0 earthquake-0 cold-0 other_weather-0 direct_report-0
2 related-1 request-0 offer-0 aid_related-0 medical_help-0 medical_products-0 search_and_rescue-0 security-0 military-0 child_alone-0 ... aid_centers-0 other_infrastructure-0 weather_related-0 floods-0 storm-0 fire-0 earthquake-0 cold-0 other_weather-0 direct_report-0
3 related-1 request-1 offer-0 aid_related-1 medical_help-0 medical_products-1 search_and_rescue-0 security-0 military-0 child_alone-0 ... aid_centers-0 other_infrastructure-0 weather_related-0 floods-0 storm-0 fire-0 earthquake-0 cold-0 other_weather-0 direct_report-0
4 related-1 request-0 offer-0 aid_related-0 medical_help-0 medical_products-0 search_and_rescue-0 security-0 military-0 child_alone-0 ... aid_centers-0 other_infrastructure-0 weather_related-0 floods-0 storm-0 fire-0 earthquake-0 cold-0 other_weather-0 direct_report-0

5 rows × 36 columns

Next, we'll fix the values of the above dataset so that they are binary. 1 indicates that a message belonging to a given category.

In [ ]:
for column in categories:
    categories[column] = categories[column].str[-1].astype(int)

The related column contains values other than 1 or 2, so we will convert the 2s to 1s by testricting the maximum value of a category as 1.

In [ ]:
categories['related'].value_counts()
Out[ ]:
1    20042
0     6140
2      204
Name: related, dtype: int64
In [ ]:
categories['related'] = categories['related'].clip(0,1)
In [ ]:
categories['related'].value_counts()
Out[ ]:
1    20246
0     6140
Name: related, dtype: int64

Let's concatenate the categories dataframe back to our original dataframe.

In [ ]:
# drop the original categories column from `df`
df.drop('categories',axis=1,inplace=True)

# concatenate the original dataframe with the new `categories` dataframe
df = pd.concat([df,categories], axis=1)
In [ ]:
df.head(2)
Out[ ]:
id message original genre related request offer aid_related medical_help medical_products ... aid_centers other_infrastructure weather_related floods storm fire earthquake cold other_weather direct_report
0 2 Weather update - a cold front from Cuba that c... Un front froid se retrouve sur Cuba ce matin. ... direct 1 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 7 Is the Hurricane over or is it not over Cyclone nan fini osinon li pa fini direct 1 0 0 1 0 0 ... 0 0 1 0 1 0 0 0 0 0

2 rows × 40 columns

Removing duplicates and NaN rows, if any

In [ ]:
# checking for duplicate rows in df
df.duplicated().sum()
Out[ ]:
171
In [ ]:
# drop duplicates
df.drop_duplicates(keep='first', inplace=True)
In [ ]:
# drop NaN messages
df.dropna(subset=["message"], axis=0, inplace=True) # drop the row
In [ ]:
# drop the id and original columns as they are not useful for the learning problem
df.drop(["id", "original"], axis=1, inplace=True)

Saving to a database

As part of this project's simulated scenario, we'll use sqlite and save the dataframe to a db.

In [ ]:
database_filename = "disaster_db"

engine = create_engine('sqlite:///'+ database_filename)
df.to_sql('messages', engine, if_exists="replace", index=False)

4. Machine Learning Pipeline

Let's start by loading the data back form our database:

In [ ]:
import re

import nltk
# uncomment below if NLTK data needs to be downloaded
nltk.download(['punkt', 'wordnet'], quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('omw-1.4', quiet=True)
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

from xgboost import XGBClassifier

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.metrics import classification_report, accuracy_score
from sklearn.multioutput import MultiOutputClassifier
In [ ]:
engine = create_engine('sqlite:///' + database_filename)
df = pd.read_sql_query('select * from messages', engine)
# the training data is a numpy array of all messages
X = df['message'].values
# the labels are all the different categories
Y = df.drop(columns=['message','genre'], axis=1)
category_names = Y.columns
In [ ]:
len(X)
Out[ ]:
26215
In [ ]:
X[:5]
Out[ ]:
array(['Weather update - a cold front from Cuba that could pass over Haiti',
       'Is the Hurricane over or is it not over',
       'Looking for someone but no name',
       'UN reports Leogane 80-90 destroyed. Only Hospital St. Croix functioning. Needs supplies desperately.',
       'says: west side of Haiti, rest of the country today and tonight'],
      dtype=object)
In [ ]:
Y.head()
Out[ ]:
related request offer aid_related medical_help medical_products search_and_rescue security military child_alone ... aid_centers other_infrastructure weather_related floods storm fire earthquake cold other_weather direct_report
0 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
1 1 0 0 1 0 0 0 0 0 0 ... 0 0 1 0 1 0 0 0 0 0
2 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
3 1 1 0 1 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
4 1 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 36 columns

Splitting the dataset into train/test sets

In [ ]:
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2)
In [ ]:
print(len(X_train), len(X_test))
20972 5243

Preprocessing the messages texts

For the classifier to perform well, we need to preprocess the disaster message texts to standardize and tokenize them. We can define a function to do this, which can be used later when defining our model pipeline:

In [ ]:
punctuation_regex = re.compile(r"[^\w\s]")
stopwords = stop_words = stopwords.words('english')
wordnet_lemmatizer = WordNetLemmatizer()
pos_tags_to_lemmatize = ["n", "v"]
In [ ]:
def tokenize(text: str) -> list:
    """
    Tokenizes a given text.
    Args:
        text: text string
    Returns:
        tokens: list of tokens
    """
    # lowercase string and remove punctuation
    text = punctuation_regex.sub(" ", text.lower()).strip()
    # tokenize text
    tokens = [token for token in word_tokenize(text)]
    # lemmatize text based on pos tags
    for pos_tag in pos_tags_to_lemmatize:
        tokens = [wordnet_lemmatizer.lemmatize(token, pos=pos_tag) for token in word_tokenize(text)]
    # remove stopwords
    tokens = [token for token in tokens if token not in stopwords]
    return tokens

Building the model pipeline

In [ ]:
def build_model() -> GridSearchCV:
    """
    Builds classification model 
    Returns:
        cv: model
    """

    pipeline = Pipeline([
        ('vect', CountVectorizer(tokenizer=tokenize)),
        ('tfidf', TfidfTransformer()),
        ('clf', MultiOutputClassifier(XGBClassifier(learning_rate=0.1)))
    ])

    parameters = {
        "clf__estimator__max_depth": [8, 16],
        "clf__estimator__colsample_bytree":[0.5, 0.75]
    }

    cv = GridSearchCV(pipeline, cv=3, param_grid=parameters, n_jobs=1, scoring="f1_micro")
    return cv

Fitting the model

In [ ]:
model = build_model()
In [ ]:
model.fit(X_train, y_train)
Out[ ]:
GridSearchCV(cv=3,
             estimator=Pipeline(steps=[('vect',
                                        CountVectorizer(tokenizer=<function tokenize at 0x7f5f74575820>)),
                                       ('tfidf', TfidfTransformer()),
                                       ('clf',
                                        MultiOutputClassifier(estimator=XGBClassifier()))]),
             n_jobs=1,
             param_grid={'clf__estimator__colsample_bytree': [0.5, 0.75],
                         'clf__estimator__max_depth': [8, 16]},
             scoring='f1_micro')

Evaluating the model

In [ ]:
y_preds = model.predict(X_test)
print(classification_report(y_preds, y_test.values, target_names=category_names, zero_division=0))
                        precision    recall  f1-score   support

               related       0.96      0.83      0.89      4657
               request       0.56      0.78      0.65       644
                 offer       0.00      0.00      0.00         0
           aid_related       0.67      0.77      0.72      1867
          medical_help       0.26      0.60      0.37       181
      medical_products       0.32      0.74      0.45       120
     search_and_rescue       0.22      0.67      0.34        49
              security       0.01      0.11      0.03         9
              military       0.33      0.67      0.44        79
           child_alone       0.00      0.00      0.00         0
                 water       0.66      0.74      0.70       296
                  food       0.78      0.79      0.78       585
               shelter       0.60      0.74      0.66       351
              clothing       0.53      0.76      0.62        58
                 money       0.23      0.67      0.34        42
        missing_people       0.14      0.73      0.24        11
              refugees       0.24      0.56      0.34        68
                 death       0.51      0.76      0.61       157
             other_aid       0.15      0.53      0.23       193
infrastructure_related       0.07      0.56      0.12        39
             transport       0.26      0.84      0.40        74
             buildings       0.42      0.70      0.53       154
           electricity       0.25      0.56      0.34        48
                 tools       0.00      0.00      0.00         0
             hospitals       0.07      0.44      0.12         9
                 shops       0.00      0.00      0.00         0
           aid_centers       0.00      0.00      0.00         3
  other_infrastructure       0.02      0.33      0.04        15
       weather_related       0.69      0.85      0.76      1217
                floods       0.59      0.86      0.70       286
                 storm       0.66      0.75      0.70       440
                  fire       0.27      0.74      0.40        23
            earthquake       0.79      0.90      0.84       456
                  cold       0.34      0.73      0.47        55
         other_weather       0.12      0.48      0.19        67
         direct_report       0.45      0.71      0.55       622

             micro avg       0.61      0.79      0.69     12875
             macro avg       0.34      0.58      0.40     12875
          weighted avg       0.72      0.79      0.74     12875
           samples avg       0.53      0.66      0.54     12875

In [ ]:
# collect accuracy scores in a dict
category_name_2_accuracy_score = {}
for i in range(len(category_names)):
    category_name_2_accuracy_score[y_test.columns[i]] = accuracy_score(y_test.values[:,i],y_preds[:,i])
print(pd.Series(category_name_2_accuracy_score))
related                   0.815945
request                   0.898341
offer                     0.995041
aid_related               0.781995
medical_help              0.928285
medical_products          0.957849
search_and_rescue         0.975014
security                  0.985123
military                  0.974061
child_alone               1.000000
water                     0.963761
food                      0.950792
shelter                   0.949456
clothing                  0.989891
money                     0.979210
missing_people            0.990082
refugees                  0.971581
death                     0.971200
other_aid                 0.872401
infrastructure_related    0.938013
transport                 0.964906
buildings                 0.963189
electricity               0.980355
tools                     0.992562
hospitals                 0.988556
shops                     0.995804
aid_centers               0.986649
other_infrastructure      0.957086
weather_related           0.877932
floods                    0.959756
storm                     0.946023
fire                      0.990082
earthquake                0.970628
cold                      0.982644
other_weather             0.946595
direct_report             0.862293
dtype: float64

This is an imbalanced dataset, as most categories for a given message will be 0. The f1-score is a better metric of the model's performance.

Predicting categories for new messages

In [ ]:
def predict(text: str) -> list:
    """
    Returns a list of predicted categories for the given text
    Args:
        text: text string
    Returns:
        predicted_categories: list of  categories
    """
    preds = model.predict([text])
    predicted_categories = [category for i, category in enumerate(y_test.columns) if preds[0][i] == 1]
    return predicted_categories
In [ ]:
predict("after the floods in our area we are trapped. we need food and shelter ")
Out[ ]:
['related',
 'request',
 'aid_related',
 'search_and_rescue',
 'food',
 'shelter',
 'weather_related',
 'floods',
 'direct_report']

5. Further Improvements

This notebook sticks to the basics in order to provide a good baseline model for this classification task. There is plenty of room for improvement here, including but not limited to:

  1. Using word embeddings (GloVe, word2vec) or even sentence embeddings (Univsersal Sentgence Encoder) to transform message text, instead of using a CountVectorizer. This should allow the model to generalize well to similar/unseen words and improve the model accuracy too.
  2. Some of the category column values like related and child_alone are highly skewed. We can look at adding more data for these categories.
  3. A Neural Network can be trained for this task. Better yet, a pre-trained state-of-the-art transformer based network can be fine tuned using the data available to us.
  4. Certain classification categories are "noisy" such as "related" or "child alone" (no positive examples). These can either be removed, or more data can be acquired that provides enough missing examples for such cases.

6. Summary

In this project, we built an ETL pipeline to load messy and unsuitable-for-training data, clean and transform it, and save it to a database. We also built a ML pipeline to tokenize message text, and train an XGBoost classifier to classify messages into different categories.

We stuck to the basics for building this classifier, and there's plenty of room for improvement in the future using modern NLP architectures.