As a high school or undergraduate teacher of Mathematics or Data Science, this lesson plan will help you to teach Exploratory Data Analysis using a dataset of long-term temperature records in India.
Introduction
Exploratory Data Analysis is a process of understanding and analyzing the data sets and extracting insights or main characteristics of them.
The dataset used in this Lesson Plan includes temperature data seen in India from 1901 to 2011. India observes three distinct seasons: summer, rainy season (monsoon), and winter. This data can be used to study changes in temperature patterns in these seasons and when exactly the changes started showing early signs of them. The Python code given below can be used to perform the Exploratory Analysis of the temperature data to summarize their main characteristics and to find any trends.
Contents
Video Lecture: "Exploratory Data Analysis" by Prof. Patrick Meyer, Curry School of Education, University of Virginia, USA. The lecture is an introduction to exploratory data analysis that includes a discussion of descriptive statistics, graphs, outliers, and robust statistics.
Classroom/ Laboratory activity (30 min): A classroom/Computer Lab activity that includes Python Code to practice Exploratory Data Analysis using a dataset of the temperature data (in degrees Celsius) seen in India from 1900 to 2011.
Suggested Questions
Why is exploratory data analysis important in data science?
What is climate change? What are the causes of global warming?
What are the trends in the temperature of India over the past 100 years?
Can you find a trend for the given temperature records?
About the Lesson Plan
Grade Level: High school, Undergraduate
Discipline: Data Science, Mathematics, Environmental Sciences
Topic(s) in Discipline: Basics of Data Analysis, Exploratory Data Analysis, Global Warming
Climate Topic: Introduction to Climate Change, Climate Variability Record, Long-term climate records
Location: India
Language(s): English
About the Data
The dataset includes temperature data seen in India from 1900 to 2011. The columns include Annual temperature, January to February, March to May, June to September, and October to December. The dataset is available for free access at the Open Government Platform India.
Video: Exploratory Data Analysis
Python Code
Exploratory Data Analysis
Steps
Statistical Observations
Target Variables
Correlations
Checking Outliers
Checking Skewness
Here is a step-by-step guide to using this lesson plan in the classroom/laboratory. We have suggested these steps as a possible plan of action. You may customize the lesson plan according to your preferences and requirements
Step 1 : Introduction to Exploratory Data Analysis:
Use the video lecture "Exploratory Data Analysis" by Prof. Patrick Meyer, Curry School of Education, University of Virginia, USA. The lecture is an introduction to exploratory data analysis that includes a discussion of descriptive statistics, graphs, outliers, and robust statistics.
Conduct an activity to practice coding for Exploratory Data Analysis using Python. Guide the students to use the code to understand steps involved in loading the data, necessary libraries and plotting interactive graphs.
In the adjoining Python code, the libraries required to be imported in our Jupyter Notebook IDE to plot interactive graphs using the Plotly library are mentioned. Once the necessary libraries are loaded, arrays can be prepared that can be used to store the values/data in order to plot graphs. The NumPy library is used to make arrays and store values in the array. These arrays are used to refer to the X-axis and Y-axis in the plot. The Plotly platform is used to plot interactive line graphs for the columns of Annual temperature, Jan-Feb temperature, March-May temperature, June-September temperature, and October-December temperature.
Step 3: Data Analysis and Suggested Discussion
Once the code is ready and running without any issues, encourage the students to analyze the data using the plots. The analysis can be run for Annual temperature and seasonal temperature for time intervals as January-February, March-May, June-September, and October-December. Help the students to note the changes in the average, highest and lowest values of the temperature over the period of 111 years.
A detailed Data Analysis and Discussion is suggested below.
This dataset consists of a total of 111 entries and 6 columns namely- Year, Annual, Jan-Feb, Mar-May, Jun-Sept, Oct-Dec. There are no null or missing values in the dataset and except for the Year column, all the other columns have a float data type. While performing exploratory data analysis about this dataset we will divide the analysis into two categories - Basic Understanding of the dataset and Data Visualization.
Basic Understanding:
The basic understanding that can be derived by performing statistical functions using pandas and NumPy libraries on a dataset will be discussed in this part in detail. After using describe () function to find out about the statistical values of the dataset, that is the mean, median, minimum and maximum values for each column, here are the main observations that were seen.
Key Observations :
The mean value of each column is lesser than the median value i.e. (50% percentile).
Another significant difference can be seen between 75% and maximum values of Jan-Feb and Mar-May columns. The difference is large enough to be noticed.
Observations 1 and 2 suggest that there are outliers present in the dataset that we should look out for.
Here we will try to understand each column’s data individually and understand the data ranges.
Annual:The data values are in the range of 28 to 31 degrees celsius. The average temperature is 29.33 degrees celsius. Where most of the values present in the column are 28.76,28.89,28.66,28.80 and 28.70 degrees celsius. The median value is 29.07 degrees celsius.
January - February: In this column, the data is in the range of 23 to 27 degrees celsius. The median value is 24.51 degrees celsius. The highest count is for the following values - 24.99, 24.51, and 23.62 degrees celsius.
March-May:For this interval, data is in the range of 31 to 33 degrees celsius which is the hottest time of the year. The median temperature comes out to be 31.46 degrees celsius. These are the values with the highest frequency in the dataset- 30.84, 31.17, and 31.89 degrees celsius.
June - September:During this time period, the temperature has ranged from 31 to 32 degrees celsius which is a slight increase in temperature from the last time interval measured but not a considerable one. The median temperature comes out to be 31.16 degrees celsius. Most counts are for these temperatures - 31.55, 31.28, 31.11, and 31.25degrees celsius.
October - December:The last few months of the year,also known as the winter period, has a steep drop in temperature ranging from 26 to 28 degrees celsius with the median value being 27.18 degrees celsius. Temperatures 27.24, 27.50, and 27.26 are encountered the most for this interval.
So far, we have used statistical values to derive a basic understanding of our dataset. Certain visualizations can be used to dive deeper into our data and understand how data is correlated. These correlations will help us in feature selection i.e. in choosing which columns are the ones we should focus on.
Firstly, the Seaborn library in Python is used to plot a heatmap that will help in understanding correlations between columns. Correlated variables can cause outliers or errors in our data thus leading to skewed values while using prediction models hence we need to get rid of outliers. So we will first identify outliers using heatmap and boxplot. Furthermore, by plotting normal distribution curves for the values, we can measure the skewness factor too.
Correlations through Heatmap: As seen by the image below, the lighter blue shades represent a negative correlation between columns whereas darker shades represent a positive correlation. Here 0.5 to 1.0 is the range using which we will define how strong/weak the correlation is. We can infer that the Annual column has a strong positive correlation with columns Year and all the other months since it has taken into account data from all these columns.
Now since this dataset doesn’t require feature selection particularly, we need not eliminate any features based
on this heatmap. But while using Linear Regression it's important to eliminate features that have a linear relationship to avoid overfitting.
We can even get the values of correlation by putting annot=True in code as seen in the image below. These heatmaps are very useful for feature selection though not particularly in this case.
In order to sniff out the outliers present in the data, we will plot a box and whisker plot. Box plot is typically used to display the distribution of quantitative data such that comparisons between the features are seen clearly.
In the box plot, the central rectangle depicts the first quartile to the third quartile also known as the interquartile range /IQR. A straight vertical line inside the rectangle shows us the median value. The whiskers above and below the box show the maximum and minimum value for that column. Outliers can be spotted because they are 3 x IQR and present either above the 3rd quartile or below the 1st quartile.
Thus we can see that in our dataset none of the columns have outliers.
We plot a distribution graph to check the linearity of the values and to look for the skewness of features.
As one can see from the image above, except for the Annual and Oct-Dec columns all the others are normally distributed. The annual plot seems to be left-skewed whereas the Oct-Dec plot seems to be right-skewed.
Data Analysis and Suggested Discussion
We have suggested these points that can be used as a possible plan of action to conduct a discussion in your class. You may customize the lesson plan according to your preferences and requirements.
1. Annual Temperature Line Plot: The line plot for annual temperatures over a time period of 110 years (1901-2011) shows that the temperature range lies from 28 – 30 degrees celsius. According to the line trends, we can see that there are significant drops in the years- 1920, 1950 and 2000. Similarly, a significant rise in temperature can be seen in the years- 1940, 1959, and 1999. In the years 1921 and 1961, the temperature was slightly below the average temperature. Whereas, in the year 2001, there was a spike in the temperature of 29.9 9 (almost 30 degrees celsius).
Help your students to note that, there has been a notable rise in temperature since 1981.
Interactive Plots
2. January-February Line Plot: The Line plot for temperature for the months of January and February shows that the range of temperature is from 23 to 27 degrees Celsius for this time of the year. The median value is 24.51 degree Celsius. Temperature rises were seen in the years 1902,1946,1966 and 2006. Whereas, a drop in temperatures was seen in the years 1905,1968 and 2008.
Suggest your students note that the highest temperature recorded over the time period was 27.44 degrees Celsius in 2006 and the lowest temperature recorded is 22.25 degrees in the year 1905.
3. March-May Line Plot: The adjoining Line plot is for temperature for months of March, April and May. For this interval, data is in the range of 31 to 33 degree celsius which is the hottest time of the year. Median temperature comes out to be 31.46 degree celsius. These are the values with highest frequency in the dataset 30.84, 31.17 and 31.89 degree celsius.
Temperature rises were seen in the years1985, 2002 and 2011. Whereas, a drop in temperatures was seen in years- 1907,1917,1926,1933 and 1957. The highest temperature recorded over the time period was 33.46 degrees in 2010. Whereas the lowest temperature recorded is 29.92 degrees in the year 1907.
Help the students to compare the median, lowest, and highest values of the temperature for other time intervals.
4. June – September Line Plot: Line plot includes temperature for months of June, July, August, and September. The range of temperature is recorded from 31.2 to 31.5 degrees Celsius for this time of the year. Temperature rises were seen in the years -1919,1945,1970,1985 and 2011. Whereas, a drop in temperatures was seen in years- 1920,1938,1959 and 1979.
The highest temperature recorded over the time period was 32.24 degrees in 1987 and 2009. Whereas the lowest temperature recorded is 30.2 degrees in the year 1956.
5. October – December Line Plot: The adjoining plot shows the temperature readings in the months of October, November and December, which is the winter period in India. Temperature is in the range of 27 to 28 degrees for this interval. The highest recorded temperature is 28.5 degrees in the year 1995 while the lowest temperature has been recorded in the year 1917 which was 25.7 degrees. Spikes in temperatures were recorded in 1922, 1941, 1955, 1999, 2001 whereas drops were seen in 1919, 1961, and year 2000.
Help the students to note the change in the lowest values of the temperature over the period of 111 years.
Learning Outcomes
Questions/Assignments
Additional Resources
Downloads
Credits
Learning outcomes for Data Science:
Learn about basic Exploratory Analysis of a dataset to find the number of records and columns, identify obvious errors, understand patterns within the data, detect outliers or anomalous events, and find interesting relations among the variables.
Learn how to calculate the mean, standard deviation, and minimum and maximum values of various columns.
Learn about the overall distribution of data over a given range.
Learn how to prepare a dataset and plot graphs using Python.
Learn the use of libraries like NumPy, Pandas and MatPlotLib for Python.
Learn how to plot interactive graphs on a platform called 'Plotly'.
Learning outcomes for climate change:
The tools in this lesson plan will enable students to:
learn about climate change and global warming with the help of the analysis of India's long term temperature records
use Python functions to calculate and describe temperature trends in India from the beginning of the 20th century to recent times (1901-2011)
discuss how these changes suggest that the planet has warmed significantly since the beginning of the industrial age
discuss the overall trends of temperature in summer, monsoon, and winter seasons in India
Suggested questions/assignments for learning evaluation :
Questions for evaluation of Data Analysis:
What is Exploratory Data Analysis? Why it is important?
What type of pre-processing of the data is required before the analysis?
How to perform Exploratory Data Analysis of a dataset using Python functions and libraries?
How to plot line plots using Python functions and libraries?
Questions for climate change understanding:
Is the global annual mean temperature of India increasing since 1980?
What is the difference between the average annual temperature, Jan-Feb temperature, March-May temperature, June-Sept temperature and October-December temperature?
What is the latest rate of change of global average temperatures according to the last recorded data point (2018)?
What are the trends in the temperature of India over the past 100 years?
Can you find a trend for the given temperature records?
Can you draw an overall conclusion by looking at the line plot of the annual average temperature of India?
Assignment:
Download the original dataset and Python code files. Try to run the code and perform the Exploratory Data Analysis using Python studio (offline) or Jupyter Notebook (online).
The downloadable dataset and Python code file are available on TROP ICSU Github page.
If you or your students would like to explore the topic further, these additional resources will be useful.
Use the interactive visualization ‘Average temperature anomaly, Global’ by Our World in Data, to encourage discussion amongst your students about the changes in the average global temperatures from the years 1850-2017.
Discuss how these changes suggest that the planet is warming and therefore, could be impacting Earth’s climate.
This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.
Strictly Necessary Cookies
Strictly Necessary Cookie should be enabled at all times so that we can save your preferences for cookie settings.
If you disable this cookie, we will not be able to save your preferences. This means that every time you visit this website you will need to enable or disable cookies again.