Here is a Lesson Plan / Tutorial to teach or learn how we can turn raw data into code and that in turn into interactive graphs, using a dataset of long term temperature records in India.
Introduction
The dataset includes temperature data seen in India from 1901 to 2011. India observes three distinct seasons: summer, rainy season (monsoon), and winter. This data can be used to study changes in temperature patterns in these seasons and when exactly the changes started showing early signs of them.
The Python code given below can be used to perform the Exploratory Analysis of the temperature data to summarize their main characteristics and to find any trends.
About the Data
The dataset includes temperature data seen in India from 1900 to 2011. The columns include Annual temperature, January to February, March to May, June to September, and October to December. The dataset is available for free access at the Open Government Platform India.
About the Lesson Plan
- Grade Level: High school, Undergraduate
- Discipline: Environmental Sciences, Data Science
- Topic(s) in Discipline: Basics of Data Analysis, Global Warming
- Climate Topic: Introduction to Climate Change, Climate Variability Record
- Location: India
- Language(s): English
Suggested Questions
- What is climate change? What are the causes of global warming?
- What are the trends in the temperature of India over the past 100 years?
- Can you find a trendline for the given temperature records?
Algorithm
Here we will see in depth the code used in Jupyter Notebook IDE to plot interactive graphs using Plotly library.
In the adjoining Python code, the libraries required to be imported in our Jupyter Notebook to plot interactive graphs are mentioned. Once the necessary libraries are loaded, arrays which can be used to store the values/data in order to plot graphs can be prepared .
We use NumPy library to make arrays and store values in the array then we will use these arrays to refer to the x-axis and y-axis in the plot. After input data is referred through arrays into the variable “fig” we then give titles to the entire plot followed by titles given to x and y axes. By using .show() we can see the figure plotted.
The Plotly platform is used to plot interactive line graphs for the columns of Annual temperature, Jan-Feb temperature, March-May temperature, June-September temperature and October-December temperature.
Python Code
Exploratory Data Analysis
This dataset consists of a total of 111 entries and 6 columns namely- Year, Annual, Jan-Feb, Mar-May, Jun-Sept, Oct-Dec. There are no null or missing values in the dataset and except for the Year column all the other columns have float data type. While performing exploratory data analysis about this dataset we will divide the analysis in two categories - Intuition about the dataset and Data Visualization.
Intuition :
Basic intuition that can be derived by performing statistical functions using pandas and NumPy libraries on a dataset will be discussed in this part in detail. After using describe () function to find out about the statistical values of the dataset, that is the mean, median , minimum and maximum values for each column, here are the main observations seen.
Key Observations :
- The mean value of each column is lesser than the median value i.e. (50% percentile) .
- Another significant difference can be seen between 75% and maximum values of Jan-Feb and Mar-May columns. The difference is large enough to be noticed.
- Observation’s 1 and 2 suggest that there are outliers present in the dataset that we should look out for.
Here we will try to understand each column’s data individually and see the range that the values have.
Annual : The data values are in the range of 28 to 31 degree celsius. Average temperature is 29.33 degree celsius. Where most of the values present in the column are 28.76,28.89,28.66,28.80 and 28.70 degree celsius. Median value is 29.07 degree celsius.
January - February : In this column the data is in the range of 23 to 27 degree celsius . Median value is 24.51 degree celsius. Highest count is for the following values - 24.99, 24.51 and 23.62 degree celsius.
March - May : For this interval, data is in the range of 31 to 33 degree celsius which is the hottest time of the year. Median temperature comes out to be 31.46 degree celsius. These are the values with highest frequency in the dataset- 30.84, 31.17 and 31.89 degree celsius.
June - September : During this time period, the temperature has ranged from 31 to 32 degree celsius which is a slight increase in temperature from the last time interval measured but not a considerable one. Median temperature comes out to be 31.16 degree celsius. Most counts are for these temperatures - 31.55, 31.28, 31.11 and 31.25 degree celsius.
October - December : The last few months of the year, also known as the winter period has a steep drop in temperature ranging from 26 to 28 degree celsius with median value being 27.18 degree celsius. Temperatures 27.24, 27.50 and 27.26 are encountered the most for this interval.
So far, we have used statistical values to derive basic intuition from our dataset now we will use certain visualizations to dive deeper into our data and understand how data is correlated. These correlations will help us in feature selection i.e. in choosing which columns are the ones we should focus on.
Firstly, we will be using Seaborn library to plot a heatmap that will help us in understanding correlations between columns. Correlated variables can cause outliers or errors in our data thus leading to skewed values while using prediction models hence we need to get rid of outliers. So we will first identify outliers using heatmap, boxplot. Furthermore by plotting normal distribution curves for the values, we can measure the skewness factor too. Let's get plotting !
Correlations through Heatmap : As seen by the image below, the lighter blue shades represent a negative correlation between columns whereas darker shades represent a positive correlation. Here 0.5 to 1.0 is the range using which we will define how strong/weak the correlation is. We can infer that, Annual column has a strong positive correlation with columns Year and all the other months since it has taken into account data from all these columns.
Now since this dataset doesn’t require feature selection particularly, we need not eliminate any features based on this heatmap but while using Linear Regression it's important to eliminate features that have a linear relationship to avoid overfitting. We can even get the values of correlation by putting annot=True in code as seen in image below. These heatmaps are very useful for feature selection though not particularly in this case.
In order to sniff out the outliers present in the data we will plot a box and whisker plot. Box plot is typically used to display distribution of quantitative data such that comparisons between the features is seen clearly.
In the box plot the central rectangle depicts the first quartile to the third quartile also known as the interquartile range /IQR. A straight vertical line inside the rectangle shows us the median value and the whiskers above and below the box shows the maximum and minimum value for that column. Outliers can be spotted because they are 3 x IQR and present either above 3rd quartile or below the 1st quartile .
Thus we can see that in our dataset none of the columns have outliers.
We plot a distribution graph to check the linearity of the values and to look for skewness of features.
As you can see from the image above, except for Annual and Oct-Dec columns all the others are normally distributed. Annual plot seems to be left skewed whereas the Oct-Dec plot seems to be right skewed.
Thus here we end the exploratory data analysis of our dataset. In the next section we will focus on the descriptive analysis of data wherein we plot line plots and understand data trends shown by this dataset better.
Data Analysis
1. Annual Temperature Line Plot: The line plot for annual rainfall over a time period of 110 years (1901-2011). The temperature range lies from 29 – 30 degrees elsius. According to the line trends, we can see that there are significant drops in years- 1920, 1950 and 2000. Similarly, a significant rise in temperature can be seen in the years- 1940, 1959, and 1999. In the years 1921 and 1961, we see the temperature being slightly below the average temperature. Whereas, in the year 2001, there was a spike in the temperature of 29.9 9 (almost 30 degrees celsius).
The main observation for the annual plot is that there has been a notable rise in temperature since 1981.
Plots
2. January-February Line Plot: Line plot includes temperature for months of January and February. The range of temperature is recorded from 23 to 27 degrees Celsius for this time of the year. Median value is 24.51 degree Celsius. Temperature rises were seen in the years 1902,1946,1966 and 2006. Whereas, a drop in temperatures was seen in years 1905,1968 and 2008.
The highest temperature recorded over the time period was 27.44 degrees Celsius in 2006. Whereas the lowest temperature recorded is 22.25 degrees in the year 1905.
3. March – May Line Plot: Line plot includes temperature for months of March, April and May. For this interval, data is in the range of 31 to 33 degree celsius which is the hottest time of the year. Median temperature comes out to be 31.46 degree celsius. These are the values with highest frequency in the dataset 30.84, 31.17 and 31.89 degree celsius.
Temperature rises were seen in the years1985, 2002 and 2011. Whereas, a drop in temperatures was seen in years- 1907,1917,1926,1933 and 1957. The highest temperature recorded over the time period was 33.46 degrees in 2010. Whereas the lowest temperature recorded is 29.92 degrees in the year 1907.
4. June – September Line Plot: Line plot includes temperature for months of June, July, August, and September. The range of temperature is recorded from 31.2 to 31.5 degrees Celsius for this time of the year. Temperature rises were seen in the years -1919,1945,1970,1985 and 2011. Whereas, a drop in temperatures was seen in years- 1920,1938,1959 and 1979.
The highest temperature recorded over the time period was 32.24 degrees in 1987 and 2009. Whereas the lowest temperature recorded is 30.2 degrees in the year 1956.
5. October – December Line Plot: Temperature readings in the months of October, November and December are taken into account while plotting this graph i.e. the last few months of the year or the winter period in India. Temperature is in the range of 27 to 28 degrees. The highest recorded temperature is 28.5 degrees in the year 1995 while the lowest temperature has been recorded in the year 1917 which was 25.7 degrees. Spikes in temperatures were recorded in 1922, 1941, 1955, 1999, 2001 whereas drops were seen in 1919, 1961 and year 2000.
Learning outcomes for Data Science:
- Learn about basic Exploratory Analysis of a dataset to find the number of records and columns, to check null values etc.
- Learn how to to calculate mean, standard deviation and minimum and maximum values of various columns.
- Learn how to prepare a dataset and plot graphs using Python.
- Learn the use of libraries like NumPy, Pandas and MatPlotLib for Python.
- Learn how to plot interactive graphs on a platform called 'Plotly'.
Learning outcomes for climate change:
The tools in this lesson plan will enable students to:
- learn about climate change and global warming with the help of the analysis of India's long term temperature records
- use Python functions to calculate and describe temperature anomalies in India from the beginning of the 20th century to recent times (1901-2001)
- discuss how these changes suggest that the planet has warmed significantly since the beginning of the industrial age
Suggested questions/assignments for learning evaluation :
Questions for evaluation of Data Analysis:
- What is Exploratory Data Analysis? Why it is important?
- What type of pre-processing of the data is required before the analysis?
- How to perform Exploratory Data Analysis of a dataset using Python functions and libraries?
- How to plot line plots using Python functions and libraries?
Questions for climate change understanding:
- Is the global annual mean temperature of India increasing since 1980?
- What is the difference between the average annual temperature, Jan-Feb temperature, March-May temperature , June-Sept temperature and October-December temperature?
- What is the latest rate of change of global average temperatures according to the last recorded data point (2018)?
- What are the trends in the temperature of India over the past 100 years?
- Can you find a trendline for the given temperature records?
- Can you draw an overall conclusion by looking at the line plot of the annual average temperature of India?
Assignment:
Download the original dataset and Python code files. Try to run the code and perform the Exploratory Data Analysis using Python studio (offline) or Jupyter Notebook (online).
The downloadable dataset and Python code file are available on TROP ICSU Github page.
If you or your students would like to explore the topic further, these additional resources will be useful.
Use the interactive visualization ‘Average temperature anomaly, Global’ by Our World in Data, to encourage discussion amongst your students about the changes in the average global temperatures from the years 1850-2017.
Discuss how these changes suggest that the planet is warming and therefore, could be impacting Earth’s climate.
These can be accessed here.
The dataset and Python code file are available on TROP ICSU Github page.
Download Original Data in MS Excel format
Download Lesson Plan
Credits and copyrights
- Dataset : Open Government Platform India
- Additional Resource : visualization ‘Average temperature anomaly, Global’ by Our World in Data
- Image : Commission for Environmental Cooperation