Juno College - Data 101

Welcome to Juno College

Welcome to a FREE Sneak Peek into our newly launched Continuing Education Data Foundations Course.

We’ve been teaching people to code since 2012, and we’ve mastered pairing the perfect students with the perfect material and learning environment. How do we know you’ll love Juno? Don't just take our word for it.

What we are going to cover

Overview of the Data Foundations course
Set Up
Code Along: Part 1: Explore the data
Break
Code Along: Part 2: Make predictions
Q & A

About the course

The Data Foundations course is a beginner-friendly course that teaches functional data literacy and foundational data science concepts.

Data literacy means being able to understand, work with, and analyze data, and is becoming an increasingly more in-demand skill across all industries. If having a better understanding of data is something that would help you in your day-to-day work, then this course is the perfect starting point.

Using the most popular tools and commonly used approaches, the course teaches students how to use Spreadsheets, SQL, R and Python to interact with data and perform typical steps in the data science workflow.

The course includes a mini capstone project that students will be expected to complete as part of the curriculum.

Download the course pacakge.

Set Up

We are going to use a free, demo version of a tool called Jupyter Notebook.

Jupyter Notebook

Notebooks have rapidly grown in popularity among data scientists to become the de facto standard for quick prototyping and exploratory analysis.

Notebooks allows you to create and share documents that contain live code, equations,visualizations and narrative text.

Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more.

Get Started : Jupyter
                    

1. Go to https://jupyter.org/try and click on "Try Jupyter Lab". Wait for it to load.

                    
2. Close the two tabs "Lorenz" and "Reference"

                    
3. Click the New Folder icon to create a new folder and give it a name

                    
4. Click the Plus icon to create a new notebook (in your folder) and give it a name

In today's workshop, we will use a language called Markdown.

Markdown is a lightweight markup language with plain text formatting syntax. Markdown is often used to format readme files, and to create rich text using a plain text editor.

Get Started : Markdown

                    

1. Open the notebook you created and change the type of the cell from code to markdown using the drop-down menu

                    
2. Create a heading using this markdown code # This is a heading
                    
3. Create a markdown sub-heading using the code ## This is a h2
                    
4. Create a markdown list using the code * This is an item in a list

Python

In today's workshop, we are going to use Python to write code to explore, visualize and model the data.

Python is a general purpose programming language
It is open-source, easy to understand and powerful
There are a number of dedicated data and analytical libraries freely available
Python is the leading industry standard in the field of data science
The Python programming language was created in the late 1980s and was named after the BBC TV show Monty Python’s Flying Circus

We will not learn Python, rather how to use the power of Python libraries for to accomplish our data science steps for today!

Data Science

Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data (source: Wikipedia).

Here is one representation of a typical data science workflow that is a great way to start learning about data! (source)

In today's workshop, we will spend most of our time on step 3 and step 4. In the course itself, we will work throuh each step using a dataset of your choice and the tools we learn.

As you start learning about the field of data science, its important to note that data science is an iterative process.

Step 1: Ask an interesting question

For today's workshop, we will be working with data from the legendary Titanic shipwreck.The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

To start us off, here are two questions we are interested in:
What sorts of people were aboard the Titanic? Were some groups of people were more likely to survive the shipwreck than others?

Step 2: Get the data

We are going to use a pulicly available dataset that contains data on a subset of passengers aboard the Titanic.


2.1 Download the Titanic dataset here
                
2.2. Upload the data file (titanic.csv) to your folder in Jupyter Notebook
                
2.3 From the left pane in Jupyter, double click the data file to get a quick view of the dataset

Step 3: Explore the data

The next step is to explore the data - this process is formally called exploratory data analysis (or EDA) .

The goal of EDA is to develop an understanding of your dataset and start to answer the questions we asked in step 1!


                    Here are some of the types of questions that are commonly used in EDA: 
            
                        What is the structure of the data:
                            
                                How many rows ? 
                                How many columns? 
                                What are the data types
                            
                        
                        What does each row of data represent? 
                        What is the quality of the data and how many missing values are there?
                        How do each of the variables vary across the passengers ?
                        What are the relationships between variables? 
                        What is the data lineage (source, processing and potenial sources of bias)?

Step 3.1: Import data using Pandas

Pandas

Pandas is a software library written for the Python programming language for data manipulation and analysis
Pandas takes data (like a CSV or TSV file, or a SQL database) and creates a Python object with rows and columns called data frame
We will be using a Python software called "pip" to install these libraries in our Jupter Notbeook


                 3.1.1 Go to your Notebook,  add a code cell and run the code pip install pandasThis tells Jupyter to use pip to install the Pandas library.


                3.1.2 Add a new code cell and run the code import pandas as pdThis tells Jupyter that you want to use the Pandas library, and refer to it using a shorter name 'pd'


                3.1.3 Add a new code cell and run the code data = pd.read_csv("titanic.csv")This line of code says to use the Pandas library ('pd') to read the .csv data file and assign it to a Pandas data frame object called 'data'


                3.1.4 Add a new code cell and run the code data.head(10)This line of code says to display or print the first 10 rows of the data frame

Step 3.2: Profile the data using Pandas

3.2.1 Add a code cell and run the code data.columnsThis tells Pandas to give us the names of the columns
             



                From the data source, we have the following notes about the columns:
                    
                            Survived ... 1 is survived, 0 means did not survive
                            Sex  ... 0 for male, 1 for female
                            Pclass ... is 'passenger class' 
                            SibSp ... is the number of siblings and spouses that the passenger was travelling with 
                            Parch ... is the number of parents / children that the passenger was travelling with
                            Embark ... is the port of embarkation
                    

            
Let's add this data dictionary to a markdown cell in our notebook.

3.2.2 Add a new code cell and run the code data.info()
This tells Pandas to give us information about the columns - you can see the data types and how many rows have data

3.2.3 Add a new code cell and run the code data['Age'].max()
This code calculates the maximum value of the Age column for our dataset. 

Add a new code cell, change the name of the variable to 'Fare', and change the function to min(). This will return the minimum value of Fare

3.2.4 Add a new code cell and run the code data.mean().round(2)
This code calculates the mean value for all columns - the average, or the central value of a discrete set of numbers is specifically, the sum of the values divided by the number of values.

3.2.5 Add a new code cell and run the code data.describe().round(2)
This code generates basic descriptive statistics about the numeric columns in the dataset. The statistics that get generated include - minimum value, maximum value, mean, standard deviation and some others.

In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean of the set, while a high standard deviation indicates that the values are spread out over a wider range

Step 3.3: Analyze each variable

One of the key aspects of EDA is to inspect each of the variables in our dataset. Broadly speaking, there are two types of variables:

Categorical variables.. are those that can take on one of several discrete values, are not numbers and have no ordering
Numerical variables .. are those that can be quantified by a number and have a meaningful ordering

A very common technique to explore variables is to look at data visualizations

We are going to use a Python library called "seaborn" to help us do visualizations!

Seaborn

Seaborn is a Python data visualization library
It provides a high-level interface for drawing attractive and informative statistical graphics

3.3.1 Add a new code cell and run the code pip install matplotlib==3.1.0;This tells Jupyter to install a library we need for plotting i.e. "matplotlib"

3.3.2 Add a few new code cell and run the code import matplotlib.pyplot as plt 3.3.3 Add a code cell and run the code pip install seabornThis tells Jupyter to install the "seaborn" library

3.3.4 Add a new code cell and run the code import seaborn as sns This tells Jupyter that we want to use the seaborn library, and will refer to it using a shorter name "sns"

Great - let's analyze the categorical variables

3.3.5 Add a new code cell and run the code sns.catplot(x="Pclass", kind="count", palette="ch:.25", data=data);This tells Jupyter to use seaborn (i.e sns) to create a "category plot", using the categorical variable "Pclass" for the x-axis, and to represent the count of the number of rows on the y-axis. 



                    
                

                    This type of visualization is called a Bar/Category plot and is used to analyze the distribution of categorical variables. The height of the bars displays how many observations occurred with each x value. Taller bars show the more common values of a variable, and shorter bars show less-common values.

3.3.6 Add a few new code cells and use the code snippet from above, to create Category plots for the following variables -
                    
                         Survived
                        Sex 
                        Embarked
                        SibSp
                        Parch
                    

                    You will need to change the name of the x-variable parameter.
                    

For each plot, add a markdown cell and make some comments about the variable

Next, lets analyze the numerical variables i.e. Age, Fare

3.3.7 Add a new code cell and run the code sns.distplot(data['Age'], color='red', bins=20);This tells Jupyter to use seaborn (i.e sns) to create a "distribution plot ", using the numerical variable "Age" for the x-axis, and to categorize the variable into 20 bins.


                    

                    

This type of chart is called a Distribution Plot or Histogram - we use this type of chart to analyze the distribution of numeric variables. A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. You can set the width of the intervals in a histogram. Taller bars show the more common values of a variable, and shorter bars show less-common values.

                    

Add a new code cell and generate a distribution plot for Fare. To do this, replace data['Age'] with data['Fare'].

                

Add markdown cells and make a comment for the distribution of each variable.

Step 3.4: Visualize relationships between variables

One of the easiest ways analyze relationships between variables is to use a scatter plot:

3.4.1 Add a new code cell and run the code sns.scatterplot(x="Age", y="Fare",
              data=data);This tells Jupyter to use seaborn (i.e sns) to create a "scatter plot", using the numerical variable "Age" for the x-axis, "Fare" for the y axis. 


              Scatter Plot
              
                    A scatter plot, also known as a scatter graph or a scatter chart
                    It is typically used to display values for two variables of a set of data
                    It uses dots to represent the values obtained for two different variables - one plotted along the x-axis and the other plotted along the y-axis.
                    There are ways to add a third and fourth dimension to a scatteplot (for eg. the color of the dot)

Let's add a third variable to the scatter plot!

3.4.2 Add a new code cell and run the code sns.scatterplot(x="Age", y="Fare", hue="Survived", data=data);This tells Jupyter to use seaborn (i.e sns) to create a "distribution plot ", using the numerical variable "Age" for the x-axis, "Fare" for the y-axis and color the dots using the variable "Survived"



                

                

Add a new code cell, and generate a scatter plot using Age, Embarked and Survived

Step 3.5: Quantify the relationship between variables

Next, we want to quantify the relationships between variables i.e. correlation:

Correlation

Correlation measures both the strength and direction of the relationship between two variables.
The values of the correlation coefficient can range from -1 to +1 ... the closer it is to +1 or -1, the more closely are the two variables are related.
The positive sign signifies the direction of the correlation i.e. if one of the variables increases, the other variable is also supposed to increase.
You can only correlate numerical features

3.5.1 Add a new code cell and run the code data.corr().round(2)This tells Jupyter that we want to calculate the correlations for the numerical variables in our dataset, and display the results in a table/matrix. 

Each cell in the table is a number between 1 and -1 that quantifies the relationship between the two variables. 

The closer the number to 1 or -1, the stronger the relationship. Negative values imply a negative correlation and vice-versa for positive correlation values.

Step 3.6: Visualize the correlation matrix

One of the easiest ways calculate correlations is to use Pandas to generate a correlation matrix

A correlation matrix is a table showing correlation coefficients between variables.
Each cell in the table shows the correlation between two variables.
A correlation matrix can be visualized using a heatmap
- A heatmap is a representation of data in the form of a map or diagram in which data values are represented as colors.
- Heatmaps are visually appealing and make it quick and easy to make inferences about correlations

Here is an example of a heatmap visualization of the correlation matrix from our Titanic dataset

3.6.1 Add a new cell and run the following code:plt.figure(figsize=(19,9))sns.heatmap(data.corr(), annot=True, linewidth=0.5, cmap="coolwarm")
This code use "sns" to generate a heatmap to visualize the correlation matrix. The heatmap is annoated with the actual correlation values in each cell, using the specified color map, and adds lines between cells to make it easier to read.

Step 3.7: Let's use Aggregations to analyze relationships

3.7.1 Add a few new code cell and run the code data.groupby('Pclass')['Survived'].mean()This code aggregates the data by Passenger Class, and calculates the average of the survival rate for each category.

As you can see from running this code, people in passenger class 1 had a much higher chance of survival, compared to the passengers in the other classes.

3.7.2 Add a new code cell and run the code data.groupby('Sex')['Survived'].mean()This code aggregates the data by Sex, and calculates the average of the survival rate for each category.

As you can see from running this code, female passengers had a much higher chance of survival than male passengers -  nearly 75% of survivors were female. 

Add a new code cell and calculate the survival rate by port of embarkation

3.7.3 Add a few new code cell and run the code data.groupby(['Sex', 'Pclass'])['Survived'].mean()This code aggregates the data by Sex and Pclass, calculating the average value for the variable Survived, for each sub-group.

As you can see from running this code, female passengers in passenger class had the highest chance of survival; and, male passengers in class 1 and 2 had the lowest chance of survival

Before moving on to the next step:

Save your notebook
In the left pane, right click your notebook and download it as a .iPYNB file
Export your notebook (Go to File > Export > HTML)
Open the downloaded HTML notebook in your browser!
Get the completed EDA notebook here (Right click > Save Link As .iPYNB)

Sweet ! We just DID some cool data science stuff - let's quickly recap what we did:

Jupyter Notebook - tool of choice for data scientists
Markdown - questions, observations, narrative for your work
Titanic dataset - great learning example
EDA - data profile, analyze variables and relationships
Python - Pandas and Sebaborn library

To recap - here are the steps we performed in the EDA process:

3.1 Import the data
3.2 Profile the data
3.3 Analyze each variable
3.4 Analyze relationships between variables
3.5 Quantify correlation between variables
3.6 Visualize correlations using a heatmap
3.7 Use aggregations to analyze relationships

Ok - on to the next step !

Step 4: Model the data

Predictive models use input data (also called predictors, variables or features) and statistics to predict outcomes (also called targets).

For our project, we want to create a model to predict whether a passenger will survive or not, given some information about the passenger.

This is a "Classification" problem, because we want to predict a 'class' (i.e. survived or not survived) for each of the passengers.

Here are some common examples of classification models:

To determine whether an email is spam or "ham" (non-spam)
To anticipate visitor behavior to a website - eg. will-buy, window-shopping
To predict whether a credit card transaction is fraudulent
To predict if a passenger aboard the Titanic will survive or not

Here are the basic steps we will follow today to build our Classifier

4.1. Load the dataset
4.2. Split the dataset into training and testing datasets
4.3. Select feature variables and specify the target variable
4.4. Select an algorithm / model to use
4.5. Fit the model using the training data and algorithm
4.6. Score the accuracy of the model
4.7. Make predictions using the test data
4.8. Evaluate the predictions of the model on the test data
4.9. Iterate and improve

The remainder of this workshop focuses on building a Classification model using the Titanic dataset.

To get started, create a new notebook and give it a name!

Step 4.1: Load dataset using Pandas

4.1.1 Add a new code cell and run the code
            import pandas as pd
data = pd.read_csv("titanic.csv")
data.head(10)

Step 4.2: Split the dataset using scikit-learn

4.2.1 Add a new code cell and run the code to install the library
            pip install scikit-learn

        For the modelling steps, including this step. we need to use a very popular Python library called Scikit-learn - a free software machine learning library for the Python programming language

A key concept in predicitive modeling using classification is that we split our data into two parts - training and testing

The training data is used to train a model
The testing data is used to evaluate (or score the accuracy) of the model

The image below illustrates this concept:

4.2.2 Next, add another code cell and run the following code to split the data
            from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.30)

            This code creates two new data frames called train and test, from the original data frame called 'data' -  test contains 30% of the overall data.

Step 4.3: Select features and target using Pandas

4.3.1 Add a new code cell and run the code below to select the input features
            feature_columns = ['Pclass']
features = train[feature_columns]
features.head(5)We start by selecting one feature i.e Pclass as inputs to the model

4.3.2 Add a new code cell and run the code below to specify the target variable to 'Survived' (this is the outcome we want to learn to predict)
            target = train['Survived'] 
target.head(5)We have just created two dataframes from our Train dataframe - one called "features", the other is called "target"

Step 4.4: Select an algorithm

4.4.1 Add a new code cell and run the code
        from sklearn.linear_model import SGDClassifier

model = SGDClassifier()
            

            This code creates a Classifier model which we will "train" using our dataset

Step 4.5: Train the model

4.5.1 Add a new code cell and run the code model.fit(features, target)
This one line of code trains a model with inputs set as the selected features and the output set to the target that the model will learn to predict.

Step 4.6: Score the model

4.6.1 Add a new code cell and run the code score = model.score(features, target)

print(round(score * 100, 2))

             This code will calcuate a score for the accuracy of the model. Broadly speaking, a higher  score is indicative of a more accurate model.

Step 4.7: Use the model to make predictions on the testing data

4.7.1 Add a new code cell and run the code test_features = test[feature_columns] 
test_target = test['Survived']

test_features.head(5)This code prepares the test data to feed into our model

Score the model using the test data

4.7.2 Add a new code cell and run the code score = model.score(test_features, test_target)
print(round(score * 100, 2))This line of code scores the accuracy of the model on the Test data  - as you can see, the score is different / lower than the accuracy of the model on the training data. 

Why do you think this is the case?

Make predictions about survival using the test data and the trained model

4.7.3 Add a new code cell and run the code predictions = model.predict(test_features)

predictionsThis line of code makes predictions - as you can see, the values are 0 or 1

Step 4.8: Lets start to evaluate the results of our predictions !

4.8.1 Add a new code cell and run the code test["Prediction"] = predictions
test.head(10)This code puts the Predictions we made on our Test data together with the actual known Survival label. We use this output to see where our predictions are "wrong" (or our classifier is confused

4.8.2 Add a new code cell and run the code test.count()This code shows the count of the rows in our test data

4.8.3 Add a new code cell and run the code test.query('Survived == Prediction').count() This code shows the count of the rows in test dataset where our prediction was correct

4.8.4 Add a new code cell and run the code test.query('Survived != Prediction').count()This code shows the count of the rows in test dataset where our prediction was wrong"

4.8.5 Add a new code cell and run the code test[(test.Survived == 1) & (test.Prediction == 0)].count()This code shows the count of the rows in test dataset where our prediction was 0, and the correct label was 1 i.e. survived

4.8.5 Add a new code cell and run the code test[(test.Survived == 0) & (test.Prediction == 1)].count()This code shows the count of the rows in test dataset where our prediction was 1, and the correct label was 0 i.e. survived

These counts form the basis for more advanced methods and metrics for evaluating the results of a predictive model.

Step 4.9: Lets see if we can make our model better !

Ok! Now that we have a model and a baseline evaluation, the next step is to start improving/optimizing our model. This is where most of the data science work happens (after data cleaning). There are a variety of techniques that we can use:

Add features to use when training the model
Reduce features that don't add any predicitve power to our model
Tune the algorithm using the available parameters
Try a different algorithm
Create composite features or add new features from other data sources

Let's add a feature and score the new model

4.9.1 Add a new code cell and run the following lines of code

            from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.30)

feature_columns_2  = ['Pclass', 'Sex']

features_2 = train[feature_columns_2]

model_2 = SGDClassifier()
model_2.fit(features_2, target)

score_2 = model_2.score(features_2, target)
print(round(score_2 * 100, 2))
  This code selects 2 features, fits a model to the training data and scores the accuray

Now, lets evaluate our model on the test data

4.9.2 Add a new code cell and run the code test_features_2 = test[feature_columns_2]
test_target = test['Survived']

testscore_2 = model_2.score(test_features_2, test_target)
print(round(testscore_2 * 100, 2))This code will evaluate the accuracy of our second model using the test data.

Great - lets try and add one more feature and run through the process again

4.9.3 Add a new code cell and run the following lines of code

            from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size=0.30)

feature_columns_3  = ['Pclass', 'Sex', 'Embark']

features_3 = train[feature_columns_3]

model_3 = SGDClassifier()
model_3.fit(features_3, target)

score_3 = model_3.score(features_2, target)
print(round(score_3 * 100, 2))
  This code selects 3 features, fits a model to the training data and scores the accuray

Now, lets evaluate our model on the test data

4.9.4 Add a new code cell and run the code test_features_3 = test[feature_columns_3]
test_target = test['Survived']

testscore_3 = model_3.score(test_features_3, test_target)
print(round(testscore_3 * 100, 2))This code will evaluate the accuracy of our second model using the test data.

Recap - we just performed the basic steps in the process for creating a predictive model:

4.1. Load the dataset
4.2. Split the dataset into training and testing datasets
4.3. Select predictors/features and outcome/target
4.4. Select an algorithm based on the problem and data
4.5. Train the model using the training data and algorithm
4.6. Score the accuracy of the model
4.7. Make predictions
4.8. Evaluate the predictions of the model using the test data
4.9. Iterate and improve

In summary:

Predictive models use input data (also called predictors, variables or features) and statistics to predict outcomes (also called targets).
A predictive model is able to learn how different points of data connect with each other, and use these learned relationships to make a prediction about an outcome.
Two of the oldest and most widely studied / used predictive modeling techniques are called Regression and it's close relative Classification. Broadly speaking:
- Regression refers to predicting a numeric quantity (like temperature, price, income)
- Classification refers to predicting a label or category, like "spam", or "survived", or "cat"

Before we move into the Q & A:

Save your Notebook
Download it as a .IPYNB i.e. iPython Notebook - this file format can be opened as a notebook in Jupyter
Export it as a .HTML file - you can open this in a browser!
Download the completed Notebook from today's workshop here

Feedback Form

We are always looking for ways to improve, it would help us tremendously if you could take a moment to fill out this short survey.

Learn More

There is a ton to learn about data and analytics, and this was only the first step!

Our upcoming Data Foundations course is a beginner-friendly course that teaches functional data literacy and foundational data science concepts. Data literacy means being able to understand, work with, and analyze data, and is becoming an increasingly more in-demand skill across all industries. If having a better understanding of data is something that would help you in your day-to-day work, then this course is the perfect starting point.

Using the most popular tools and commonly used approaches, the course teaches students:

How to use Spreadsheets, SQL, R and Python
How to interact with databases, build dashboards and tell a story with data
How to perform typical steps in the data science workflow