Exploratory Data Analysis and Visualizations of the Titanic Dataset¶

The Titanic Dataset on Kaggle is likely the most popular dataset for newbies in Data Science, and for good reasons too. One of these reasons is, of course, its pop cultue appeal thanks to late '90s movie, The Titanic, poduced by James Cameron. Also, as an introduction to machine learning, it gives beginners the opportunity to build machine learning models that predict whether or not a given passenger on the Titanic shipwreck survived or not (classification), based on features in the data such as age, sex and fare paid for trip.

This field of Machine Learning is broadly labeled as Supervised Learning. Supervised Learning is the branch of machine learning (ML) that involves predicting labels, such as 'Survived' or 'Not'. Such models:

  • learn from labelled data, e.g. data that includes whether a passenger survived (called model training).
  • and then predict on unlabelled data.

But before building your machine learning, it's important to get familiar with the dataset, make some assumptions about correlation between features and identify any hidden patterns within the dataset. It is also important to identy any outliers or missing data within the dataset, before feeding the data into the model. This process is called Exploratory Data Analysis.

Approach¶

Here's a simplified approach to supervised learning:

  • Exploratory Data Analysis (EDA) & Cleaning;
  • Iterate;
  • Engineer features;
  • Get model that performs better.

This project will focus only on the Exploatory Data Analysis, using Plotly Express to visuale the data!

Import you data and check it out¶

In [1]:
# Import libraries to be used
import pandas as pd
import plotly.express as px
In [2]:
# Import test and train datasets
df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')

# View first lines of training data
df_train.head()
Out[2]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
  • What are all these features? Check out the Kaggle data documentation here for a deeper understanding of the columns.

Important note on terminology:

  • variables/columns are known as features (or predictor variables).
In [3]:
# View first lines of test data
df_test.head()
Out[3]:
PassengerId Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 892 3 Kelly, Mr. James male 34.5 0 0 330911 7.8292 NaN Q
1 893 3 Wilkes, Mrs. James (Ellen Needs) female 47.0 1 0 363272 7.0000 NaN S
2 894 2 Myles, Mr. Thomas Francis male 62.0 0 0 240276 9.6875 NaN Q
3 895 3 Wirz, Mr. Albert male 27.0 0 0 315154 8.6625 NaN S
4 896 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) female 22.0 1 1 3101298 12.2875 NaN S
  • Use the DataFrame .info() method to check out datatypes, missing values and more (of df_train).
In [4]:
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
  • Use the DataFrame .describe() method to check out summary statistics of numeric columns (of df_train).
In [5]:
df_train.describe()
Out[5]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 714.000000 891.000000 891.000000 891.000000
mean 446.000000 0.383838 2.308642 29.699118 0.523008 0.381594 32.204208
std 257.353842 0.486592 0.836071 14.526497 1.102743 0.806057 49.693429
min 1.000000 0.000000 1.000000 0.420000 0.000000 0.000000 0.000000
25% 223.500000 0.000000 2.000000 20.125000 0.000000 0.000000 7.910400
50% 446.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 668.500000 1.000000 3.000000 38.000000 1.000000 0.000000 31.000000
max 891.000000 1.000000 3.000000 80.000000 8.000000 6.000000 512.329200

Recap:

  • The data was loaded succesfully and looked at. When using this data to build a machine learning model in the future, it's worth bearing in mind that there are a number of null and missing values in the dataset that will need to be treated.

Up next: Explore the data visually and draw inferences!

Visual exploratory data analysis¶

  • Use plotly to build a histogram plot, showing the count of Titanic survivors.
In [6]:
fig = px.histogram(data_frame=df_train, x =df_train['Survived'].astype('string'), color='Survived')
fig.show()
# Converted the inputs in the 'Survived' feature into strings so the features can be treated as categorical, rather than numeric

Take-away: In the training set, less people survived than didn't.

Up next: More EDA.

EDA on feature variables¶

  • Use plotly to build a histogram plot of the Titanic dataset feature 'Sex' (of df_train).
In [7]:
fig = px.histogram(data_frame=df_train, x ='Sex', color='Sex')
fig.show()
  • Use plotly to build bar plots of the Titanic dataset feature 'Survived' split (faceted) over the feature 'Sex'.
In [8]:
fig = px.histogram(data_frame=df_train, x ='Sex', facet_col='Survived', color='Sex')
fig.show()
In [9]:
fig = px.histogram(data_frame=df_train, x ='Sex', color='Survived')
fig.show()

Take-away: Distribution of passengers according to gender shows women were more likely to survive than men.

  • Use pandas to calculate the number of women and men who survived or not.
In [10]:
sex_group = df_train.groupby('Sex')
sex_group['Survived'].value_counts()
Out[10]:
Sex     Survived
female  1           233
        0            81
male    0           468
        1           109
Name: Survived, dtype: int64
In [11]:
sex_group['Survived'].sum()
# An alternative to calculate number of survived by gender because the 'survived' column is represented in 1s and 0s.
Out[11]:
Sex
female    233
male      109
Name: Survived, dtype: int64
  • Use pandas to figure out the proportion of women that survived, along with the proportion of men:
In [12]:
print(df_train[df_train.Sex == 'female'].Survived.sum()/df_train[df_train.Sex == 'female'].Survived.count())
print(df_train[df_train.Sex == 'male'].Survived.sum()/df_train[df_train.Sex == 'male'].Survived.count())
0.7420382165605095
0.18890814558058924

~74% of women survived, while ~18% of men survived.

Recap:

  • loaded in the data and had a look at it.
  • explored the 'Survived' variable visually.

Up next: EDA of other feature variables, categorical and numeric.

Explore your data more!¶

  • Use plotly to build histogram plots of the Titanic dataset feature 'Survived' split (faceted) over the feature 'Pclass'.
In [13]:
fig = px.histogram(data_frame=df_train, x =df_train['Survived'].astype('string'), facet_col='Pclass', color='Survived')
fig.show()

Take-away: The chart sugests that, tragically, the people in the lower class of the ship were less likely to have survived in the wreck. Conversely, it appears that passengers in first class were more likely to survive, perhaps by virtue of their social status.

  • Use plotly to build bar plots of the Titanic dataset feature 'Survived' split (faceted) over the feature 'Embarked'.
In [14]:
fig = px.histogram(data_frame=df_train, x =df_train['Survived'].astype('string'), color='Embarked')
fig.show()
# returns an error
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Input In [14], in <cell line: 1>()
----> 1 fig = px.histogram(data_frame=df_train, x =df_train['Survived'].astype('string'), color='Embarked')
      2 fig.show()

File ~\anaconda3\lib\site-packages\plotly\express\_chart_types.py:472, in histogram(data_frame, x, y, color, pattern_shape, facet_row, facet_col, facet_col_wrap, facet_row_spacing, facet_col_spacing, hover_name, hover_data, animation_frame, animation_group, category_orders, labels, color_discrete_sequence, color_discrete_map, pattern_shape_sequence, pattern_shape_map, marginal, opacity, orientation, barmode, barnorm, histnorm, log_x, log_y, range_x, range_y, histfunc, cumulative, nbins, text_auto, title, template, width, height)
    426 def histogram(
    427     data_frame=None,
    428     x=None,
   (...)
    464     height=None,
    465 ):
    466     """
    467     In a histogram, rows of `data_frame` are grouped together into a
    468     rectangular mark to visualize the 1D distribution of an aggregate
    469     function `histfunc` (e.g. the count or sum) of the value `y` (or `x` if
    470     `orientation` is `'h'`).
    471     """
--> 472     return make_figure(
    473         args=locals(),
    474         constructor=go.Histogram,
    475         trace_patch=dict(
    476             histnorm=histnorm, histfunc=histfunc, cumulative=dict(enabled=cumulative),
    477         ),
    478         layout_patch=dict(barmode=barmode, barnorm=barnorm),
    479     )

File ~\anaconda3\lib\site-packages\plotly\express\_core.py:1988, in make_figure(args, constructor, trace_patch, layout_patch)
   1986 facet_col_wrap = args.get("facet_col_wrap", 0)
   1987 for group_name in sorted_group_names:
-> 1988     group = grouped.get_group(group_name if len(group_name) > 1 else group_name[0])
   1989     mapping_labels = OrderedDict()
   1990     trace_name_labels = OrderedDict()

File ~\anaconda3\lib\site-packages\pandas\core\groupby\groupby.py:747, in BaseGroupBy.get_group(self, name, obj)
    745 inds = self._get_index(name)
    746 if not len(inds):
--> 747     raise KeyError(name)
    749 return obj._take_with_is_copy(inds, axis=self.axis)

KeyError: (nan, '', '', '', '')
In [15]:
# returns an error because there are NaN values in the 'Embarked' column. Let's take a look at the count.
In [16]:
df_train['Embarked'].value_counts(dropna=False)
# 2 NaN values. Could be stowaways.
Out[16]:
S      644
C      168
Q       77
NaN      2
Name: Embarked, dtype: int64
In [17]:
#  Makes sense to fill the NaN values with 'S' as that is the most common port of Embarkment
fig = px.histogram(data_frame=df_train, x =df_train['Survived'].astype('string'), facet_col=df_train['Embarked'].fillna('S'), color='Survived')
fig.show()

Take-away: Suggest that passengers who embarked on the ship at Southampton were the least likely to survive the wreck. Perhaps there is a correlation between the port of Embarkment and the class of ticket bought.

EDA with numeric variables¶

  • Use plotly to plot a histogram of the 'Fare' column of df_train.
In [18]:
fig = px.histogram(data_frame=df_train, x='Fare')
fig.show()

Take-away: The distribution of fares paid is skewed to the right. Majority of trips bought for the trip were for less than 100 (currency not included)

  • Use plotly to plot a histogram of the 'Age' column of df_train. Hint: Before plotting, we should check if there are any null values you may need to drop null values before doing so.
In [19]:
df_train['Age'].isna().sum()
# 177 records without age. These recorded will be dropped for visualizin but during cleaning we would fill with either the mode, mode or median age.
Out[19]:
177
In [20]:
fig = px.histogram(data_frame=df_train, x=df_train['Age'].dropna())
fig.show()

Take-away: Age distribution of passengers looks slightly skewed to the right.

  • Plot a strip plot of 'Fare' with 'Survived' on the x-axis.
In [21]:
fig = px.strip(df_train, x=df_train["Survived"], y="Fare")
fig.show()

Take-away: With the strip plot, we can easily identify the outliers appearing in our data in the 'Fare' feature.

  • Use the DataFrame method .describe() to check out summary statistics of 'Fare' as a function of survival.
In [22]:
not_survived = df_train['Survived'] == 0
df_train.loc[not_survived, 'Fare'].describe()
Out[22]:
count    549.000000
mean      22.117887
std       31.388207
min        0.000000
25%        7.854200
50%       10.500000
75%       26.000000
max      263.000000
Name: Fare, dtype: float64
In [23]:
survived = df_train['Survived'] == 1
df_train.loc[survived, 'Fare'].describe()
Out[23]:
count    342.000000
mean      48.395408
std       66.596998
min        0.000000
25%       12.475000
50%       26.000000
75%       57.000000
max      512.329200
Name: Fare, dtype: float64
  • Use plotly to plot a scatter plot of 'Age' against 'Fare', colored by 'Survived'.
In [24]:
fig = px.scatter(data_frame=df_train, x='Age', y='Fare', color='Survived')
fig.show()

Take-away: Another interesting detail that pops up is that many of the survivors were yougn people between the age of 0 to ~ 18. If true, this is unsurprising taking to account the Birkenhead drill, a code of conduct whereby the lives of women and children were to be saved first in a life-threatening situation, typically abandoning ship, when survival resources such as lifeboats were limited.

Recap:

  • Successfully loaded in the data and had a look at it.
  • Explored the target variable ('Survived') visually.
  • Explored the feature variables visually, , categorical and numeric.