Lesson Plan: Data Science: Predictive Analysis using Mumbai Temperature Data

As an undergraduate Mathematics or Data Science teacher, you can use this set of computer-based tools to help you in teaching Introductory Predictive Analysis and specifically Exploratory Data Analysis with Linear Regression.

Introduction

This lesson plan will help you to teach Introductory Predictive Analysis through an Exploratory Data Analysis with  Linear Regression assignment. The lesson plan includes a hands-on computer-based classroom activity to be conducted on a dataset of the annual temperature records of Mumbai - a coastal city in Western India, for the span of 1842 to 2019. This activity includes hands-on Python code, and a set of inquiry-based questions that will enable your students to apply their understanding of scatter plots, trendlines, moving averages, heatmaps, correlation coefficients, linear regression, and regression equations.

Thus, the use of this lesson plan allows you to integrate the teaching of a climate science topic with a core topic in Mathematics, Statistics, and Data Science.

Questions

Use this lesson plan to help your students find answers to:

  1. Use an example to describe the time series analysis of a dataset of 177 years of data
  2. Use an example to describe exploratory data analysis with linear regression analysis.
  3. What are heatmaps and correlation analysis for a given dataset?
  4. Use regression analyses to describe how the seasonal and annual temperatures of Mumbai have changed over time.
  5. Discuss reasons for changes in temperature patterns and the impact of climate change on temperatures of various cities in the world.

Location of Mumbai in India

About the Lesson Plan

Grade Level Undergraduate
Discipline Mathematics, Data Science
Topic(s) in Discipline Scatter Plots, Correlation Coefficients,
Regression Equations, Linear Regression,
OLS and LOWESS Trendlines, Heatmap
Climate Topic Climate and the Atmosphere
Climate Variability Record
Location Global
Language(s) English
Access Online, Offline
Computer Skills Required Intermediate
Approximate
Time Required
60-90 min

Contents

Contents

Teaching Module

(25 min)

A teaching module to explain the basics of scatter plots, correlation coefficients, regression equations, and linear regression
Video micro-lectures

(14  min)

A video micro-lecture to give Introduction to Simple Linear Regression
Classroom/ Laboratory activity

(30 min)

A classroom activity - Python Code to apply the understanding of Exploratory Data Analysis and Linear Regression by using a dataset of the annual and seasonal temperature of Mumbai city for the period of 1842 to 2019.

Go to GitHub Repository

Video

Here is a step-by-step guide to using this lesson plan in the classroom/laboratory. We have suggested these steps as a possible plan of action. You may customize the lesson plan according to your preferences and requirements.

Step 1: Topic introduction and discussion:

1.         Use the teaching module, ‘Introduction-Linear Regression and Correlation’ by OpenStaxTM, Rice University (for High School level) or ‘Chapter-3: Linear Regression’ provided by Ramesh Sridharan, Massachusetts Institute of Technology (for Undergraduate level), to introduce these topics of basic statistics.

2.         Navigate to the sub-sections within the module to the basics of scatter plots, correlation coefficients, regression equations, and linear regression.

3.         Use the in-built practice exercises and quizzes to evaluate your students’ understanding of the topics.

Find Linear Regression Teaching  Module PDF here

 

Step 2: Develop the topic further:

Use the video micro-lecture, ‘Introduction to Simple Linear Regression by dataminingincae, for a basic introduction to Simple Linear Regression and terms like dependant variable, independent variable, regression line, and regression coefficients.

Step 3: Extend understanding by practicing Hands-on Python code:

1. Use the provided Dataset mumbai-temp-data.csv and associated Python Notebook for Exploratory Data Analysis with Linear Regression.

2. The raw data was collected from Colaba’s meteorological station for the period of 1842 to 2019, which comes out to be a total of 177 years worth of data. However, the useful data is from the year 1878 to 2019 which is still 141 years of data in total. There are all twelve months' temperatures for all 141 years.

Data Source: Colaba Meteorological Station, Mumbai, India

3.  Use the Python Notebook and Dataset to:

  • Part 1: Load and prepare the data for use
  • -Read the Dataset using DataFrame
  • -Know the basics of the dataset like its dimensions, data types, and memory usage
  • -Plot the scatter plot of the annual temperature variable, seasonal temperatures for Jan-Feb, March-May, June-September, October-December.
  • -Use NumPy library to convert the DataFrame to NumPy Array which would be used in the further steps.
  • Calculate moving averages to get a smoother curve when to plot our readings for the columns. The moving averages are calculated at an interval of five years for all the seasonal columns.
  • Part 2: Exploratory Data Analysis: Know your data
  • - Handling missing values: On taking the first look at the given dataset, we can gather that there are some missing value rows for the moving average’s columns and hence we get rid of those first by dropping those rows. Reducing our data from 143 rows to 138 rows in summation.
  • - Basic statistical functions: Then we start to look at some basic statistical functions to get basic intuition so we know where to start looking in the data. We find out the mean, median, minimum and maximum values for each column to check if there are any outliers in the data. There isn’t a big difference between 75% and the maximum values of our moving average columns so we don’t have to do anything to identify and remove the outliers.
  • - Null values exist for the moving average and were removed so the dataset is not affected in the future. A total of 10 (-999) missing values were present in data that had to be imputed by using averages of those columns.
  • Part 3: Heatmaps for correlations:- Now let the students create some visualizations to see feature correlations i.e. a heatmap. The first heatmap is of all 12 months and another one is the correlation between moving average columns of all seasons. This will help us determine the difference that will be made by taking moving averages.-  The first heatmap gives us the interpretation that the months- February, May, September, October, November, and December contribute the most to the annual column and have a high positive correlation. Whereas the second heatmap shows a very high positive correlation in October-December moving average with the Annual temperature column. Thus oct-dec column contributes the most to the annual column which means that the rise in overall temperature in the annual column can be seen more because of the oct-dec column i.e. winter season rather than the summer season months. None of the columns have a negative correlation amongst themselves which means no inverse correlation exists amongst any of the columns.
  • Part 4: Trendlines
  • - Let the students plot annual and seasonal temperature scatter plots using the data columns for 5-years moving averages.
  • - Let them divide the data into 50 years intervals to understand the trends for every 50 years.
  • - Encourage the students to analyze the trendlines and discuss their observations.
  • - Discuss the data ranges for various seasons.
  • Part 5: Linear Regression for Predictions
  • -Find the Regression Coefficients  for Simple Linear Regression
  • -Plot the scatter plot and Regression Line as per the predicted coefficients
  • -Calculate RMSE (Root Mean-Squared Error-values)
  • -Discuss how well the Regression Line describes the data points for the total time period and for every 50 years.

4. Encourage your students to answer topical questions by applying their understanding of scatter plots, correlation coefficients, heatmap, moving averages, trendlines, and linear regression.

5.   Use the regression analyses performed to initiate a discussion on the increase in temperatures from 1980 to 2020 due to anthropogenic activities, which is one major reason behind global climate change.

 

%d bloggers like this: