Data 101

While you wait

Please complete this survey: Data Literacy Survey

Welcome to Juno College

Welcome to a FREE Sneak Peek into our newly launched Continuing Education Data Foundations Course.

We’ve been teaching people to code since 2012, and we’ve mastered pairing the perfect students with the perfect material and learning environment. How do we know you’ll love Juno? Don't just take our word for it.

What we are going to cover

About the course

The Data Foundations course is a beginner-friendly course that teaches functional data literacy and foundational data science concepts.

Data literacy means being able to understand, work with, and analyze data, and is becoming an increasingly more in-demand skill across all industries. If having a better understanding of data is something that would help you in your day-to-day work, then this course is the perfect starting point.

Using the most popular tools and commonly used approaches, the course teaches students how to use Spreadsheets, SQL, R and Python to interact with data and perform typical steps in the data science workflow.

The course includes a mini capstone project that students will be expected to complete as part of the curriculum.

Download the course pacakge.

Set Up

We are going to use a free, demo version of a tool called Jupyter Notebook.

Jupyter Notebook

Notebooks have rapidly grown in popularity among data scientists to become the de facto standard for quick prototyping and exploratory analysis.

Notebooks allows you to create and share documents that contain live code, equations,visualizations and narrative text.

Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

Get Started : Jupyter

1. Go to https://jupyter.org/try and click on "Try Jupyter Lab". Wait for it to load.

2. Close the two tabs "Lorenz" and "Reference"

3. Click the New Folder icon to create a new folder and give it a name

4. Click the Plus icon to create a new notebook (in your folder) and give it a name

In today's workshop, we will use a language called Markdown.

Markdown is a lightweight markup language with plain text formatting syntax. Markdown is often used to format readme files, and to create rich text using a plain text editor.

Get Started : Markdown

1. Open the notebook you created and change the type of the cell from code to markdown using the drop-down menu

2. Create a heading using this markdown code
# This is a heading

3. Create a markdown sub-heading using the code
## This is a h2

4. Create a markdown list using the code
* This is an item in a list

Python

In today's workshop, we are going to use Python to write code to explore, visualize and model the data.

We will not learn Python, rather how to use the power of Python libraries for to accomplish our data science steps for today!

Data Science

Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data (source: Wikipedia).

Here is one representation of a typical data science workflow that is a great way to start learning about data! (source)

In today's workshop, we will spend most of our time on step 3 and step 4. In the course itself, we will work throuh each step using a dataset of your choice and the tools we learn.

As you start learning about the field of data science, its important to note that data science is an iterative process.

Step 1: Ask an interesting question

For today's workshop, we will be working with data from the legendary Titanic shipwreck.The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

To start us off, here are two questions we are interested in:
What sorts of people were aboard the Titanic? Were some groups of people were more likely to survive the shipwreck than others?

Step 2: Get the data

We are going to use a pulicly available dataset that contains data on a subset of passengers aboard the Titanic.


2.1 Download the Titanic dataset here
2.2. Upload the data file (titanic.csv) to your folder in Jupyter Notebook
2.3 From the left pane in Jupyter, double click the data file to get a quick view of the dataset

Step 3: Explore the data

The next step is to explore the data - this process is formally called exploratory data analysis (or EDA) .

The goal of EDA is to develop an understanding of your dataset and start to answer the questions we asked in step 1!

Here are some of the types of questions that are commonly used in EDA:
  • What is the structure of the data:
    • How many rows ?
    • How many columns?
    • What are the data types
  • What does each row of data represent?
  • What is the quality of the data and how many missing values are there?
  • How do each of the variables vary across the passengers ?
  • What are the relationships between variables?
  • What is the data lineage (source, processing and potenial sources of bias)?
Step 3.1: Import data using Pandas

Pandas

3.1.1 Go to your Notebook, add a code cell and run the code
pip install pandas
This tells Jupyter to use pip to install the Pandas library.
3.1.2 Add a new code cell and run the code
import pandas as pd
This tells Jupyter that you want to use the Pandas library, and refer to it using a shorter name 'pd'
3.1.3 Add a new code cell and run the code
data = pd.read_csv("titanic.csv")
This line of code says to use the Pandas library ('pd') to read the .csv data file and assign it to a Pandas data frame object called 'data'
3.1.4 Add a new code cell and run the code
data.head(10)
This line of code says to display or print the first 10 rows of the data frame

Step 3.2: Profile the data using Pandas
3.2.1 Add a code cell and run the code
data.columns
This tells Pandas to give us the names of the columns


From the data source, we have the following notes about the columns:
  • Survived ... 1 is survived, 0 means did not survive
  • Sex ... 0 for male, 1 for female
  • Pclass ... is 'passenger class'
  • SibSp ... is the number of siblings and spouses that the passenger was travelling with
  • Parch ... is the number of parents / children that the passenger was travelling with
  • Embark ... is the port of embarkation


Let's add this data dictionary to a markdown cell in our notebook.
3.2.2 Add a new code cell and run the code
data.info()
This tells Pandas to give us information about the columns - you can see the data types and how many rows have data

3.2.3 Add a new code cell and run the code
data['Age'].max()
This code calculates the maximum value of the Age column for our dataset.

Add a new code cell, change the name of the variable to 'Fare', and change the function to min(). This will return the minimum value of Fare
3.2.4 Add a new code cell and run the code
data.mean().round(2)
This code calculates the mean value for all columns - the average, or the central value of a discrete set of numbers is specifically, the sum of the values divided by the number of values.

3.2.5 Add a new code cell and run the code
data.describe().round(2)
This code generates basic descriptive statistics about the numeric columns in the dataset. The statistics that get generated include - minimum value, maximum value, mean, standard deviation and some others.

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range

Step 3.3: Analyze each variable

One of the key aspects of EDA is to inspect each of the variables in our dataset. Broadly speaking, there are two types of variables:

A very common technique to explore variables is to look at data visualizations

We are going to use a Python library called "seaborn" to help us do visualizations!

Seaborn

3.3.1 Add a new code cell and run the code
pip install matplotlib==3.1.0;
This tells Jupyter to install a library we need for plotting i.e. "matplotlib"
3.3.2 Add a few new code cell and run the code
import matplotlib.pyplot as plt
3.3.3 Add a code cell and run the code
pip install seaborn
This tells Jupyter to install the "seaborn" library
3.3.4 Add a new code cell and run the code
import seaborn as sns
This tells Jupyter that we want to use the seaborn library, and will refer to it using a shorter name "sns"
Great - let's analyze the categorical variables 3.3.5 Add a new code cell and run the code
sns.catplot(x="Pclass", kind="count", palette="ch:.25", data=data);
This tells Jupyter to use seaborn (i.e sns) to create a "category plot", using the categorical variable "Pclass" for the x-axis, and to represent the count of the number of rows on the y-axis.


This type of visualization is called a Bar/Category plot and is used to analyze the distribution of categorical variables. The height of the bars displays how many observations occurred with each x value. Taller bars show the more common values of a variable, and shorter bars show less-common values.
3.3.6 Add a few new code cells and use the code snippet from above, to create Category plots for the following variables -
  • Survived
  • Sex
  • Embarked
  • SibSp
  • Parch
You will need to change the name of the x-variable parameter.

For each plot, add a markdown cell and make some comments about the variable
Next, lets analyze the numerical variables i.e. Age, Fare 3.3.7 Add a new code cell and run the code
sns.distplot(data['Age'], color='red', bins=20);
This tells Jupyter to use seaborn (i.e sns) to create a "distribution plot ", using the numerical variable "Age" for the x-axis, and to categorize the variable into 20 bins.



This type of chart is called a Distribution Plot or Histogram - we use this type of chart to analyze the distribution of numeric variables. A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. You can set the width of the intervals in a histogram. Taller bars show the more common values of a variable, and shorter bars show less-common values.

Add a new code cell and generate a distribution plot for Fare. To do this, replace data['Age'] with data['Fare'].

Add markdown cells and make a comment for the distribution of each variable.
Step 3.4: Visualize relationships between variables

One of the easiest ways analyze relationships between variables is to use a scatter plot:

3.4.1 Add a new code cell and run the code
sns.scatterplot(x="Age", y="Fare",
              data=data);
This tells Jupyter to use seaborn (i.e sns) to create a "scatter plot", using the numerical variable "Age" for the x-axis, "Fare" for the y axis.

Scatter Plot
  • A scatter plot, also known as a scatter graph or a scatter chart
  • It is typically used to display values for two variables of a set of data
  • It uses dots to represent the values obtained for two different variables - one plotted along the x-axis and the other plotted along the y-axis.
  • There are ways to add a third and fourth dimension to a scatteplot (for eg. the color of the dot)
Let's add a third variable to the scatter plot! 3.4.2 Add a new code cell and run the code
sns.scatterplot(x="Age", y="Fare", hue="Survived", data=data);
This tells Jupyter to use seaborn (i.e sns) to create a "distribution plot ", using the numerical variable "Age" for the x-axis, "Fare" for the y-axis and color the dots using the variable "Survived"



Add a new code cell, and generate a scatter plot using Age, Embarked and Survived
Step 3.5: Quantify the relationship between variables

Next, we want to quantify the relationships between variables i.e. correlation:

Correlation

3.5.1 Add a new code cell and run the code
data.corr().round(2)
This tells Jupyter that we want to calculate the correlations for the numerical variables in our dataset, and display the results in a table/matrix.

Each cell in the table is a number between 1 and -1 that quantifies the relationship between the two variables.

The closer the number to 1 or -1, the stronger the relationship. Negative values imply a negative correlation and vice-versa for positive correlation values.

Step 3.6: Visualize the correlation matrix

One of the easiest ways calculate correlations is to use Pandas to generate a correlation matrix

Here is an example of a heatmap visualization of the correlation matrix from our Titanic dataset

3.6.1 Add a new cell and run the following code:
plt.figure(figsize=(19,9))
sns.heatmap(data.corr(), annot=True, linewidth=0.5, cmap="coolwarm")

This code use "sns" to generate a heatmap to visualize the correlation matrix. The heatmap is annoated with the actual correlation values in each cell, using the specified color map, and adds lines between cells to make it easier to read.

Step 3.7: Let's use Aggregations to analyze relationships 3.7.1 Add a few new code cell and run the code
data.groupby('Pclass')['Survived'].mean()
This code aggregates the data by Passenger Class, and calculates the average of the survival rate for each category.

As you can see from running this code, people in passenger class 1 had a much higher chance of survival, compared to the passengers in the other classes.
3.7.2 Add a new code cell and run the code
data.groupby('Sex')['Survived'].mean()
This code aggregates the data by Sex, and calculates the average of the survival rate for each category.

As you can see from running this code, female passengers had a much higher chance of survival than male passengers - nearly 75% of survivors were female.

Add a new code cell and calculate the survival rate by port of embarkation
3.7.3 Add a few new code cell and run the code
data.groupby(['Sex', 'Pclass'])['Survived'].mean()
This code aggregates the data by Sex and Pclass, calculating the average value for the variable Survived, for each sub-group.

As you can see from running this code, female passengers in passenger class had the highest chance of survival; and, male passengers in class 1 and 2 had the lowest chance of survival

Before moving on to the next step:

Sweet ! We just DID some cool data science stuff - let's quickly recap what we did:

To recap - here are the steps we performed in the EDA process:

Ok - on to the next step !

Step 4: Model the data

Predictive models use input data (also called predictors, variables or features) and statistics to predict outcomes (also called targets).

For our project, we want to create a model to predict whether a passenger will survive or not, given some information about the passenger.

This is a "Classification" problem, because we want to predict a 'class' (i.e. survived or not survived) for each of the passengers.

Here are some common examples of classification models:

Here are the basic steps we will follow today to build our Classifier

The remainder of this workshop focuses on building a Classification model using the Titanic dataset.

To get started, create a new notebook and give it a name!

Step 4.1: Load dataset using Pandas 4.1.1 Add a new code cell and run the code
import pandas as pd
data = pd.read_csv("titanic.csv")
data.head(10)
Step 4.2: Split the dataset using scikit-learn 4.2.1 Add a new code cell and run the code to install the library
pip install scikit-learn

For the modelling steps, including this step. we need to use a very popular Python library called Scikit-learn - a free software machine learning library for the Python programming language

A key concept in predicitive modeling using classification is that we split our data into two parts - training and testing

The image below illustrates this concept:



4.2.2 Next, add another code cell and run the following code to split the data
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.30)
This code creates two new data frames called train and test, from the original data frame called 'data' - test contains 30% of the overall data.
Step 4.3: Select features and target using Pandas 4.3.1 Add a new code cell and run the code below to select the input features
feature_columns = ['Pclass']
features = train[feature_columns]
features.head(5)
We start by selecting one feature i.e Pclass as inputs to the model
4.3.2 Add a new code cell and run the code below to specify the target variable to 'Survived' (this is the outcome we want to learn to predict)
target = train['Survived'] 
target.head(5)
We have just created two dataframes from our Train dataframe - one called "features", the other is called "target"
Step 4.4: Select an algorithm 4.4.1 Add a new code cell and run the code
from sklearn.linear_model import SGDClassifier

model = SGDClassifier()

This code creates a Classifier model which we will "train" using our dataset
Step 4.5: Train the model 4.5.1 Add a new code cell and run the code
model.fit(features, target)
This one line of code trains a model with inputs set as the selected features and the output set to the target that the model will learn to predict.

Step 4.6: Score the model 4.6.1 Add a new code cell and run the code
score = model.score(features, target)

print(round(score * 100, 2))
This code will calcuate a score for the accuracy of the model. Broadly speaking, a higher score is indicative of a more accurate model.
Step 4.7: Use the model to make predictions on the testing data 4.7.1 Add a new code cell and run the code
test_features = test[feature_columns] 
test_target = test['Survived']
test_features.head(5)
This code prepares the test data to feed into our model
Score the model using the test data 4.7.2 Add a new code cell and run the code
score = model.score(test_features, test_target)
print(round(score * 100, 2))
This line of code scores the accuracy of the model on the Test data - as you can see, the score is different / lower than the accuracy of the model on the training data.

Why do you think this is the case?
Make predictions about survival using the test data and the trained model 4.7.3 Add a new code cell and run the code
predictions = model.predict(test_features)

predictions
This line of code makes predictions - as you can see, the values are 0 or 1
Step 4.8: Lets start to evaluate the results of our predictions ! 4.8.1 Add a new code cell and run the code
test["Prediction"] = predictions
test.head(10)
This code puts the Predictions we made on our Test data together with the actual known Survival label. We use this output to see where our predictions are "wrong" (or our classifier is confused

4.8.2 Add a new code cell and run the code
test.count()
This code shows the count of the rows in our test data
4.8.3 Add a new code cell and run the code
test.query('Survived == Prediction').count() 
This code shows the count of the rows in test dataset where our prediction was correct
4.8.4 Add a new code cell and run the code
test.query('Survived != Prediction').count()
This code shows the count of the rows in test dataset where our prediction was wrong"
4.8.5 Add a new code cell and run the code
test[(test.Survived == 1) & (test.Prediction == 0)].count()
This code shows the count of the rows in test dataset where our prediction was 0, and the correct label was 1 i.e. survived
4.8.5 Add a new code cell and run the code
test[(test.Survived == 0) & (test.Prediction == 1)].count()
This code shows the count of the rows in test dataset where our prediction was 1, and the correct label was 0 i.e. survived

These counts form the basis for more advanced methods and metrics for evaluating the results of a predictive model.

Step 4.9: Lets see if we can make our model better !

Ok! Now that we have a model and a baseline evaluation, the next step is to start improving/optimizing our model. This is where most of the data science work happens (after data cleaning). There are a variety of techniques that we can use:

Let's add a feature and score the new model 4.9.1 Add a new code cell and run the following lines of code
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.30)

feature_columns_2 = ['Pclass', 'Sex']

features_2 = train[feature_columns_2]

model_2 = SGDClassifier()
model_2.fit(features_2, target)

score_2 = model_2.score(features_2, target)
print(round(score_2 * 100, 2))
This code selects 2 features, fits a model to the training data and scores the accuray
Now, lets evaluate our model on the test data 4.9.2 Add a new code cell and run the code
test_features_2 = test[feature_columns_2]
test_target = test['Survived']

testscore_2 = model_2.score(test_features_2, test_target)
print(round(testscore_2 * 100, 2))
This code will evaluate the accuracy of our second model using the test data.
Great - lets try and add one more feature and run through the process again 4.9.3 Add a new code cell and run the following lines of code
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.30)

feature_columns_3 = ['Pclass', 'Sex', 'Embark']

features_3 = train[feature_columns_3]

model_3 = SGDClassifier()
model_3.fit(features_3, target)

score_3 = model_3.score(features_2, target)
print(round(score_3 * 100, 2))
This code selects 3 features, fits a model to the training data and scores the accuray
Now, lets evaluate our model on the test data 4.9.4 Add a new code cell and run the code
test_features_3 = test[feature_columns_3]
test_target = test['Survived']

testscore_3 = model_3.score(test_features_3, test_target)
print(round(testscore_3 * 100, 2))
This code will evaluate the accuracy of our second model using the test data.

Recap - we just performed the basic steps in the process for creating a predictive model: In summary:

Before we move into the Q & A:

Feedback Form

We are always looking for ways to improve, it would help us tremendously if you could take a moment to fill out this short survey.

Learn More

There is a ton to learn about data and analytics, and this was only the first step!

Our upcoming Data Foundations course is a beginner-friendly course that teaches functional data literacy and foundational data science concepts. Data literacy means being able to understand, work with, and analyze data, and is becoming an increasingly more in-demand skill across all industries. If having a better understanding of data is something that would help you in your day-to-day work, then this course is the perfect starting point.

Using the most popular tools and commonly used approaches, the course teaches students:

  • How to use Spreadsheets, SQL, R and Python
  • How to interact with databases, build dashboards and tell a story with data
  • How to perform typical steps in the data science workflow
The course includes a mini capstone project that students will be expected to complete as part of the curriculum.

Next Steps:

Course Dates

Thank you for joining us!