Welcome to Juno College
Welcome to a FREE Sneak Peek into our newly launched Continuing Education Data Foundations Course.
We’ve been teaching people to code since 2012, and we’ve mastered pairing the perfect students with the perfect material and learning environment. How do we know you’ll love Juno? Don't just take our word for it.
What we are going to cover
- Overview of the Data Foundations course
- Set Up
- Code Along: Part 1: Explore the data
- Break
- Code Along: Part 2: Make predictions
- Q & A
About the course
The Data Foundations course is a beginner-friendly course that teaches functional data literacy and foundational data science concepts.
Data literacy means being able to understand, work with, and analyze data, and is becoming an increasingly more in-demand skill across all industries. If having a better understanding of data is something that would help you in your day-to-day work, then this course is the perfect starting point.
Using the most popular tools and commonly used approaches, the course teaches students how to use Spreadsheets, SQL, R and Python to interact with data and perform typical steps in the data science workflow.
The course includes a mini capstone project that students will be expected to complete as part of the curriculum.
Set Up
We are going to use a free, demo version of a tool called Jupyter Notebook.
Jupyter Notebook
Notebooks have rapidly grown in popularity among data scientists to become the de facto standard for quick prototyping and exploratory analysis.
Notebooks allows you to create and share documents that contain live code, equations,visualizations and narrative text.
Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.
Get Started : Jupyter
1. Go to https://jupyter.org/try and click on "Try Jupyter Lab". Wait for it to load.
2. Close the two tabs "Lorenz" and "Reference"
3. Click the New Folder icon to create a new folder and give it a name
4. Click the Plus icon to create a new notebook (in your folder) and give it a name
In today's workshop, we will use a language called Markdown.
Markdown is a lightweight markup language with plain text formatting syntax. Markdown is often used to format readme files, and to create rich text using a plain text editor.
Get Started : Markdown
1. Open the notebook you created and change the type of the cell from code to markdown using the drop-down menu
2. Create a heading using this markdown code # This is a heading
3. Create a markdown sub-heading using the code ## This is a h2
4. Create a markdown list using the code * This is an item in a list
Python
In today's workshop, we are going to use Python to write code to explore, visualize and model the data.
- Python is a general purpose programming language
- It is open-source, easy to understand and powerful
- There are a number of dedicated data and analytical libraries freely available
- Python is the leading industry standard in the field of data science
- The Python programming language was created in the late 1980s and was named after the BBC TV show Monty Python’s Flying Circus
We will not learn Python, rather how to use the power of Python libraries for to accomplish our data science steps for today!
Data Science
Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data (source: Wikipedia).
Here is one representation of a typical data science workflow that is a great way to start learning about data! (source)
In today's workshop, we will spend most of our time on step 3 and step 4. In the course itself, we will work throuh each step using a dataset of your choice and the tools we learn.
As you start learning about the field of data science, its important to note that data science is an iterative process.
Step 1: Ask an interesting question
For today's workshop, we will be working with data from the legendary Titanic shipwreck.The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.
To start us off, here are two questions we are interested in:
What sorts of people were aboard the Titanic?
Were some groups of people were more likely to survive the shipwreck than others?
Step 2: Get the data
We are going to use a pulicly available dataset that contains data on a subset of passengers aboard the Titanic.
2.1 Download the Titanic dataset here
2.2. Upload the data file (titanic.csv) to your folder in Jupyter Notebook
2.3 From the left pane in Jupyter, double click the data file to get a quick view of the dataset
Step 3: Explore the data
The next step is to explore the data - this process is formally called exploratory data analysis
(or EDA) .
The goal of EDA is to develop an understanding of your dataset and start to answer the questions we asked in step 1!
Here are some of the types of questions that are commonly used in EDA:
- What is the structure of the data:
- How many rows ?
- How many columns?
- What are the data types
- What does each row of data represent?
- What is the quality of the data and how many missing values are there?
- How do each of the variables vary across the passengers ?
- What are the relationships between variables?
- What is the data lineage (source, processing and potenial sources of bias)?
Step 3.1: Import data using Pandas
Pandas
- Pandas is a software library written for the Python programming language for data manipulation and analysis
- Pandas takes data (like a CSV or TSV file, or a SQL database) and creates a Python object with rows and columns called data frame
- We will be using a Python software called "pip" to install these libraries in our Jupter Notbeook
3.1.1 Go to your Notebook, add a code cell and run the code pip install pandas
This tells Jupyter to use pip to install the Pandas library.
3.1.2 Add a new code cell and run the code import pandas as pd
This tells Jupyter that you want to use the Pandas library, and refer to it using a shorter name 'pd'
3.1.3 Add a new code cell and run the code data = pd.read_csv("titanic.csv")
This line of code says to use the Pandas library ('pd') to read the .csv data file and assign it to a Pandas data frame object called 'data'
3.1.4 Add a new code cell and run the code data.head(10)
This line of code says to display or print the first 10 rows of the data frame
Step 3.2: Profile the data using Pandas3.2.1 Add a code cell and run the code data.columns
This tells Pandas to give us the names of the columns
From the data source, we have the following notes about the columns:
- Survived ... 1 is survived, 0 means did not survive
- Sex ... 0 for male, 1 for female
- Pclass ... is 'passenger class'
- SibSp ... is the number of siblings and spouses that the passenger was travelling with
- Parch ... is the number of parents / children that the passenger was travelling with
- Embark ... is the port of embarkation
Let's add this data dictionary to a markdown cell in our notebook.
3.2.2 Add a new code cell and run the code data.info()
This tells Pandas to give us information about the columns - you can see the data types and how many rows have data
3.2.3 Add a new code cell and run the code data['Age'].max()
This code calculates the maximum value of the Age column for our dataset.
Add a new code cell, change the name of the variable to 'Fare', and change the function to min(). This will return the minimum value of Fare
3.2.4 Add a new code cell and run the code data.mean().round(2)
This code calculates the mean value for all columns - the average, or the central value of a discrete set of numbers is specifically, the sum of the values divided by the number of values.
3.2.5 Add a new code cell and run the code data.describe().round(2)
This code generates basic descriptive statistics about the numeric columns in the dataset. The statistics that get generated include - minimum value, maximum value, mean, standard deviation and some others.
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range
Step 3.3: Analyze each variable One of the key aspects of EDA is to inspect each of the variables in our dataset. Broadly speaking, there are two types of variables:
- Categorical variables.. are those that can take on one of several discrete values, are not numbers and have no ordering
- Numerical variables .. are those that can be quantified by a number and have a meaningful ordering
A very common technique to explore variables is to look at data visualizations
We are going to use a Python library called "seaborn" to help us do visualizations!
Seaborn
- Seaborn is a Python data visualization library
- It provides a high-level interface for drawing attractive and informative statistical graphics
3.3.1 Add a new code cell and run the code pip install matplotlib==3.1.0;
This tells Jupyter to install a library we need for plotting i.e. "matplotlib"
3.3.2 Add a few new code cell and run the code import matplotlib.pyplot as plt
3.3.3 Add a code cell and run the code pip install seaborn
This tells Jupyter to install the "seaborn" library
3.3.4 Add a new code cell and run the code import seaborn as sns
This tells Jupyter that we want to use the seaborn library, and will refer to it using a shorter name "sns"
Great - let's analyze the categorical variables
3.3.5 Add a new code cell and run the code sns.catplot(x="Pclass", kind="count", palette="ch:.25", data=data);
This tells Jupyter to use seaborn (i.e sns) to create a "category plot", using the categorical variable "Pclass" for the x-axis, and to represent the count of the number of rows on the y-axis.
This type of visualization is called a Bar/Category plot and is used to analyze the distribution of categorical variables. The height of the bars displays how many observations occurred with each x value. Taller bars show the more common values of a variable, and shorter bars show less-common values.
3.3.6 Add a few new code cells and use the code snippet from above, to create Category plots for the following variables -
- Survived
- Sex
- Embarked
- SibSp
- Parch
You will need to change the name of the x-variable parameter.
For each plot, add a markdown cell and make some comments about the variable
Next, lets analyze the numerical variables i.e. Age, Fare
3.3.7 Add a new code cell and run the code sns.distplot(data['Age'], color='red', bins=20);
This tells Jupyter to use seaborn (i.e sns) to create a "distribution plot ", using the numerical variable "Age" for the x-axis, and to categorize the variable into 20 bins.
This type of chart is called a Distribution Plot or Histogram - we use this type of chart to analyze the distribution of numeric variables. A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. You can set the width of the intervals in a histogram. Taller bars show the more common values of a variable, and shorter bars show less-common values.
Add a new code cell and generate a distribution plot for Fare. To do this, replace data['Age'] with data['Fare'].
Add markdown cells and make a comment for the distribution of each variable.
Step 3.4: Visualize relationships between variablesOne of the easiest ways analyze relationships between variables is to use a scatter plot:
3.4.1 Add a new code cell and run the code sns.scatterplot(x="Age", y="Fare",
data=data);
This tells Jupyter to use seaborn (i.e sns) to create a "scatter plot", using the numerical variable "Age" for the x-axis, "Fare" for the y axis.
Scatter Plot
- A scatter plot, also known as a scatter graph or a scatter chart
- It is typically used to display values for two variables of a set of data
- It uses dots to represent the values obtained for two different variables - one plotted along the x-axis and the other plotted along the y-axis.
- There are ways to add a third and fourth dimension to a scatteplot (for eg. the color of the dot)
Let's add a third variable to the scatter plot!
3.4.2 Add a new code cell and run the code sns.scatterplot(x="Age", y="Fare", hue="Survived", data=data);
This tells Jupyter to use seaborn (i.e sns) to create a "distribution plot ", using the numerical variable "Age" for the x-axis, "Fare" for the y-axis and color the dots using the variable "Survived"
Add a new code cell, and generate a scatter plot using Age, Embarked and Survived
Step 3.5: Quantify the relationship between variables Next, we want to quantify the relationships between variables i.e. correlation:
Correlation
- Correlation measures both the strength and direction of the relationship between two variables.
- The values of the correlation coefficient can range from -1 to +1 ... the closer it is to +1 or -1, the more closely are the two variables are related.
- The positive sign signifies the direction of the correlation i.e. if one of the variables increases, the other variable is also supposed to increase.
- You can only correlate numerical features
3.5.1 Add a new code cell and run the code data.corr().round(2)
This tells Jupyter that we want to calculate the correlations for the numerical variables in our dataset, and display the results in a table/matrix.
Each cell in the table is a number between 1 and -1 that quantifies the relationship between the two variables.
The closer the number to 1 or -1, the stronger the relationship. Negative values imply a negative correlation and vice-versa for positive correlation values.
Step 3.6: Visualize the correlation matrix One of the easiest ways calculate correlations is to use Pandas to generate a correlation matrix
- A correlation matrix is a table showing correlation coefficients between variables.
- Each cell in the table shows the correlation between two variables.
- A correlation matrix can be visualized using a heatmap
- A heatmap is a representation of data in the form of a map or diagram in which data values are represented as colors.
- Heatmaps are visually appealing and make it quick and easy to make inferences about correlations
Here is an example of a heatmap visualization of the correlation matrix from our Titanic dataset
3.6.1 Add a new cell and run the following code:plt.figure(figsize=(19,9))
sns.heatmap(data.corr(), annot=True, linewidth=0.5, cmap="coolwarm")
This code use "sns" to generate a heatmap to visualize the correlation matrix. The heatmap is annoated with the actual correlation values in each cell, using the specified color map, and adds lines between cells to make it easier to read.
Step 3.7: Let's use Aggregations to analyze relationships
3.7.1 Add a few new code cell and run the code data.groupby('Pclass')['Survived'].mean()
This code aggregates the data by Passenger Class, and calculates the average of the survival rate for each category.
As you can see from running this code, people in passenger class 1 had a much higher chance of survival, compared to the passengers in the other classes.
3.7.2 Add a new code cell and run the code data.groupby('Sex')['Survived'].mean()
This code aggregates the data by Sex, and calculates the average of the survival rate for each category.
As you can see from running this code, female passengers had a much higher chance of survival than male passengers - nearly 75% of survivors were female.
Add a new code cell and calculate the survival rate by port of embarkation
3.7.3 Add a few new code cell and run the code data.groupby(['Sex', 'Pclass'])['Survived'].mean()
This code aggregates the data by Sex and Pclass, calculating the average value for the variable Survived, for each sub-group.
As you can see from running this code, female passengers in passenger class had the highest chance of survival; and, male passengers in class 1 and 2 had the lowest chance of survival
Before moving on to the next step:
- Save your notebook
- In the left pane, right click your notebook and download it as a .iPYNB file
- Export your notebook (Go to File > Export > HTML)
- Open the downloaded HTML notebook in your browser!
- Get the completed EDA notebook here (Right click > Save Link As .iPYNB)
Sweet ! We just DID some cool data science stuff - let's quickly recap what we did:
- Jupyter Notebook - tool of choice for data scientists
- Markdown - questions, observations, narrative for your work
- Titanic dataset - great learning example
- EDA - data profile, analyze variables and relationships
- Python - Pandas and Sebaborn library
To recap - here are the steps we performed in the EDA process:
- 3.1 Import the data
- 3.2 Profile the data
- 3.3 Analyze each variable
- 3.4 Analyze relationships between variables
- 3.5 Quantify correlation between variables
- 3.6 Visualize correlations using a heatmap
- 3.7 Use aggregations to analyze relationships
Ok - on to the next step !
Step 4: Model the data
Predictive models use input data (also called predictors, variables or features) and statistics to predict outcomes (also called targets).
For our project, we want to create a model to predict whether a passenger will survive or not, given some information about the passenger.
This is a "Classification" problem, because we want to predict a 'class' (i.e. survived or not survived) for each of the passengers.
Here are some common examples of classification models:
- To determine whether an email is spam or "ham" (non-spam)
- To anticipate visitor behavior to a website - eg. will-buy, window-shopping
- To predict whether a credit card transaction is fraudulent
- To predict if a passenger aboard the Titanic will survive or not
Here are the basic steps we will follow today to build our Classifier
- 4.1. Load the dataset
- 4.2. Split the dataset into training and testing datasets
- 4.3. Select feature variables and specify the target variable
- 4.4. Select an algorithm / model to use
- 4.5. Fit the model using the training data and algorithm
- 4.6. Score the accuracy of the model
- 4.7. Make predictions using the test data
- 4.8. Evaluate the predictions of the model on the test data
- 4.9. Iterate and improve
The remainder of this workshop focuses on building a Classification model using the Titanic dataset.
To get started, create a new notebook and give it a name!
Step 4.1: Load dataset using Pandas4.1.1 Add a new code cell and run the code
import pandas as pd
data = pd.read_csv("titanic.csv")
data.head(10)
Step 4.2: Split the dataset using scikit-learn
4.2.1 Add a new code cell and run the code to install the library
pip install scikit-learn
For the modelling steps, including this step. we need to use a very popular Python library called Scikit-learn - a free software machine learning library for the Python programming language
A key concept in predicitive modeling using classification is that we split our data into two parts - training and testing
- The training data is used to train a model
- The testing data is used to evaluate (or score the accuracy) of the model
The image below illustrates this concept:
4.2.2 Next, add another code cell and run the following code to split the data
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.30)
This code creates two new data frames called train and test, from the original data frame called 'data' - test contains 30% of the overall data.
Step 4.3: Select features and target using Pandas
4.3.1 Add a new code cell and run the code below to select the input features
feature_columns = ['Pclass']
features = train[feature_columns]
features.head(5)
We start by selecting one feature i.e Pclass as inputs to the model
4.3.2 Add a new code cell and run the code below to specify the target variable to 'Survived' (this is the outcome we want to learn to predict)
target = train['Survived']
target.head(5)
We have just created two dataframes from our Train dataframe - one called "features", the other is called "target"
Step 4.4: Select an algorithm
4.4.1 Add a new code cell and run the code
from sklearn.linear_model import SGDClassifier
model = SGDClassifier()
This code creates a Classifier model which we will "train" using our dataset
Step 4.5: Train the model
4.5.1 Add a new code cell and run the code model.fit(features, target)
This one line of code trains a model with inputs set as the selected features and the output set to the target that the model will learn to predict.
Step 4.6: Score the model
4.6.1 Add a new code cell and run the code score = model.score(features, target)
print(round(score * 100, 2))
This code will calcuate a score for the accuracy of the model. Broadly speaking, a higher score is indicative of a more accurate model.
Step 4.7: Use the model to make predictions on the testing data
4.7.1 Add a new code cell and run the code test_features = test[feature_columns]
test_target = test['Survived']
test_features.head(5)
This code prepares the test data to feed into our model
Score the model using the test data
4.7.2 Add a new code cell and run the code score = model.score(test_features, test_target)
print(round(score * 100, 2))
This line of code scores the accuracy of the model on the Test data - as you can see, the score is different / lower than the accuracy of the model on the training data.
Why do you think this is the case?
Make predictions about survival using the test data and the trained model
4.7.3 Add a new code cell and run the code predictions = model.predict(test_features)
predictions
This line of code makes predictions - as you can see, the values are 0 or 1
Step 4.8: Lets start to evaluate the results of our predictions !
4.8.1 Add a new code cell and run the code test["Prediction"] = predictions
test.head(10)
This code puts the Predictions we made on our Test data together with the actual known Survival label. We use this output to see where our predictions are "wrong" (or our classifier is confused
4.8.2 Add a new code cell and run the code test.count()
This code shows the count of the rows in our test data
4.8.3 Add a new code cell and run the code test.query('Survived == Prediction').count()
This code shows the count of the rows in test dataset where our prediction was correct
4.8.4 Add a new code cell and run the code test.query('Survived != Prediction').count()
This code shows the count of the rows in test dataset where our prediction was wrong"
4.8.5 Add a new code cell and run the code test[(test.Survived == 1) & (test.Prediction == 0)].count()
This code shows the count of the rows in test dataset where our prediction was 0, and the correct label was 1 i.e. survived
4.8.5 Add a new code cell and run the code test[(test.Survived == 0) & (test.Prediction == 1)].count()
This code shows the count of the rows in test dataset where our prediction was 1, and the correct label was 0 i.e. survived
These counts form the basis for more advanced methods and metrics for evaluating the results of a predictive model.
Step 4.9: Lets see if we can make our model better !Ok! Now that we have a model and a baseline evaluation, the next step is to start improving/optimizing our model. This is where most of the data science work happens (after data cleaning). There are a variety of techniques that we can use:
- Add features to use when training the model
- Reduce features that don't add any predicitve power to our model
- Tune the algorithm using the available parameters
- Try a different algorithm
- Create composite features or add new features from other data sources
4.9.1 Add a new code cell and run the following lines of code
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.30)
feature_columns_2 = ['Pclass', 'Sex']
features_2 = train[feature_columns_2]
model_2 = SGDClassifier()
model_2.fit(features_2, target)
score_2 = model_2.score(features_2, target)
print(round(score_2 * 100, 2))
This code selects 2 features, fits a model to the training data and scores the accuray
Now, lets evaluate our model on the test data
4.9.2 Add a new code cell and run the code test_features_2 = test[feature_columns_2]
test_target = test['Survived']
testscore_2 = model_2.score(test_features_2, test_target)
print(round(testscore_2 * 100, 2))
This code will evaluate the accuracy of our second model using the test data.
Great - lets try and add one more feature and run through the process again
4.9.3 Add a new code cell and run the following lines of code
from sklearn.model_selection import train_test_split
train, test = train_test_split(data, test_size=0.30)
feature_columns_3 = ['Pclass', 'Sex', 'Embark']
features_3 = train[feature_columns_3]
model_3 = SGDClassifier()
model_3.fit(features_3, target)
score_3 = model_3.score(features_2, target)
print(round(score_3 * 100, 2))
This code selects 3 features, fits a model to the training data and scores the accuray
Now, lets evaluate our model on the test data
4.9.4 Add a new code cell and run the code test_features_3 = test[feature_columns_3]
test_target = test['Survived']
testscore_3 = model_3.score(test_features_3, test_target)
print(round(testscore_3 * 100, 2))
This code will evaluate the accuracy of our second model using the test data.
Recap - we just performed the basic steps in the process for creating a predictive model:
- 4.1. Load the dataset
- 4.2. Split the dataset into training and testing datasets
- 4.3. Select predictors/features and outcome/target
- 4.4. Select an algorithm based on the problem and data
- 4.5. Train the model using the training data and algorithm
- 4.6. Score the accuracy of the model
- 4.7. Make predictions
- 4.8. Evaluate the predictions of the model using the test data
- 4.9. Iterate and improve
- Predictive models use input data (also called predictors, variables or features) and statistics to predict outcomes (also called targets).
- A predictive model is able to learn how different points of data connect with each other, and use these learned relationships to make a prediction about an outcome.
- Two of the oldest and most widely studied / used predictive modeling techniques are called
Regression and it's close relative Classification. Broadly speaking:
- Regression refers to predicting a numeric quantity (like temperature, price, income)
- Classification refers to predicting a label or category, like "spam", or "survived", or "cat"
Before we move into the Q & A:
- Save your Notebook
- Download it as a .IPYNB i.e. iPython Notebook - this file format can be opened as a notebook in Jupyter
- Export it as a .HTML file - you can open this in a browser!
- Download the completed Notebook from today's workshop here
Feedback Form
We are always looking for ways to improve, it would help us tremendously if you could take a moment to fill out this short survey.
Learn More
There is a ton to learn about data and analytics, and this was only the first step!
Our upcoming Data Foundations course is a beginner-friendly course that teaches functional data literacy and foundational data science concepts. Data literacy means being able to understand, work with, and analyze data, and is becoming an increasingly more in-demand skill across all industries. If having a better understanding of data is something that would help you in your day-to-day work, then this course is the perfect starting point.
Using the most popular tools and commonly used approaches, the course teaches students:
- How to use Spreadsheets, SQL, R and Python
- How to interact with databases, build dashboards and tell a story with data
- How to perform typical steps in the data science workflow
Next Steps:
Course Dates