Python

Data Visualization with Pandas

Programmingempire

In this post on Data Visualization with Pandas, I will discuss how we can visualize our data by plotting various kinds of charts using the Pandas library of Python. Basically, there are several functions for plotting the charts available in the pandas library. Further, these functions are highly customizable and simple to use. Therefore, you can create a variety of plots to visualize your data.

Plotting Functions in Pandas Library

There are several plotting functions available in the pandas library that we use to analyze our data visually. Let us consider these functions one by one. Also, in the following examples, I will use a dataset downloaded from www.kaggle.com.

About the Dataset

basically, it is a dataset about Credit Card Customers that indicates the total transaction amount of customers along with other fields as shown below.

Example of the Dataset

import pandas as pd

df=pd.read_csv("BankChurners.csv", sep=",")
print(df.dtypes)

Output

Credit Card Customers Dataset
Credit Card Customers Dataset

In this article, I will use this particular dataset for plotting various charts as well as one more dataset consisting of temperature and relative humidity values for a particular day. The second dataset is obtained by an IoT application that uses a DHT11 temperature and humidity sensor.

The Pandas library contains several methods for plotting charts. Here, I will discuss the following charts.

  • Pie Chart
  • Histogram
  • Line Chart
  • Area Chart
  • Box Plot
  • Hexabin Plot
  • Kernel Density Estimation (KDE) Plot

Let us start with the pie chart. As mentioned above, you need to download the CSV file from the above-mentioned link and place this file in the current directory.

Pie Chart

Basically, a pie chart represents the contribution of different categories. In other words, it shows the proportion of each category as a slice. For this purpose, we use three different columns of our dataset to represent a category. These three columns are Education_Level, Marital_Status, and Gender. The chart has been plotted against the field Total_Trans_Amt representing the total transaction amount. However, we need to group the data before using it to plot the chart. Therefore, we need to do the following steps:

  1. Create another data frame by extracting the desired two columns
  2. Perform grouping on the basis of the category and find the sum of transaction amount for each category
  3. Plot the pie chart for the grouped data

Code Example

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df=pd.read_csv("BankChurners.csv", sep=",")
print(df.head())
print(df.describe())
print(df.dtypes)

df1=df[['Education_Level', 'Total_Trans_Amt']]

df2=df1.groupby('Education_Level').sum()
print(df2)

labels=df2.index
colors=['yellow', 'red', 'orange', 'navy', 'pink', 'green', 'brown']
df2.plot.pie(y='Total_Trans_Amt', labels=labels,figsize=(8, 6), colors=colors)
plt.show()

df1=df[['Marital_Status', 'Total_Trans_Amt']]

df2=df1.groupby('Marital_Status').sum()
print(df2)

labels=df2.index
colors=['yellow', 'red', 'orange', 'navy', 'pink', 'green', 'brown']
df2.plot.pie(y='Total_Trans_Amt', labels=labels,figsize=(8, 6), colors=colors)
plt.show()

df1=df[['Gender', 'Total_Trans_Amt']]

df2=df1.groupby('Gender').sum()
print(df2)

labels=df2.index
colors=['yellow', 'red', 'orange', 'navy', 'pink', 'green', 'brown']
df2.plot.pie(y='Total_Trans_Amt', labels=labels,figsize=(8, 6), colors=colors)
plt.show()

Output

Plotting Pie Charts with Pandas
Plotting Pie Charts with Pandas

Histogram

Another important chart is the Histogram that displays rectangles whose areas are in proportion to the frequency of a data variable that we want to visualize. However, the width of of all rectangles is same and is equal to the class interval.

Consider following examples of creating histograms using the same dataset of credit card customers.

Code Example

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df=pd.read_csv("BankChurners.csv", sep=",")
print(df.head())
print(df.describe())
print(df.dtypes)

df1=df[['Total_Trans_Amt', 'Education_Level', 'Customer_Age']]
df1.hist(column='Total_Trans_Amt')
plt.show()

df1.hist(column='Customer_Age')
plt.show()


df1.hist(column='Customer_Age', by='Education_Level', xrot='90', figsize=(10,12))
plt.show()

The following histogram shows the frequency of Total Transaction Amount and also shows that in which range the transaction is mostly performed.

Histogram for Total Transaction Amount
Histogram for Total Transaction Amount

Next, the histogram is plotted for Customer Age that depicts which age group has maximum customers.

Histogram for Customer Age
Histogram for Customer Age

Finally, the following histogram shows the age group of customers with respect to the each category of the Education Level. As evident from the figure, the by attribute specifies the separate category of the histogram.

j1
Category-wise plot of Histogram

Scatter Plot

Suppose you want to find the relationship between individual data points as well as the pattern exhibited by the whole data. In that case, the scatter plot is best suitable. basically, we use the scatter plot to determine the correlation between two set of numeric values. For instance, we can use it show how the attendance of a student can impact his/her result.

Let us take an example of correlation between two attributes for Total Transaction Amount and Total Transaction Count which the scatter plot represents.

Code Example

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df=pd.read_csv("BankChurners.csv", sep=",")
print(df.head())
print(df.describe())
print(df.dtypes)

df1=df[['Total_Trans_Amt', 'Total_Trans_Ct']]
df1.plot.scatter(x='Total_Trans_Amt', y='Total_Trans_Ct', c='#cb4154')
plt.show()

In the following plot Total Transaction Count is shown on y-axis, whereas Total Transaction Amount is shown on the x-axis.

Scatter Plot Showing the Correlation
Scatter Plot Showing the Correlation

Line Plot

Whenever we want to find the changes in values of a variable over a period of time, we can use a line plot. Also, the line plot can be used to determine the relationship between two different data sets. For the purpose of this example, we use a CSV file called roomdata.csv.roomdataDownload

Code Example

import pandas as pd
import matplotlib.pyplot as plt
#from datetime import datetime
import matplotlib.dates as mdates
import datetime as dt
import scipy.stats

df=pd.read_csv("roomdata.csv")
print(df.head())
print(df.describe())
print(df.dtypes)

df1=df[['created_at', 'Temperature', 'Humidity']]

df1.plot.line()
plt.title('Temperature and Humidity variation in a Day')
plt.xlabel('Time of Day')

plt.show()

df1.plot.line(x='created_at', y='Temperature', figsize=(20, 6))
plt.title('Temperature variation in a Day')
plt.xlabel('Time of Day')
plt.show()

df1.plot.line(x='created_at', y='Humidity', figsize=(20, 6))
plt.title('Humidity variation in a Day')
plt.xlabel('Time of Day')
plt.show()

The following chart shows the variation in both the temperature and humidity with respect to each other and the next two charts show the variations in a single specified column.

Line  Plot for Both Temperature and Humidity variations in a Day
Line Plot for Both Temperature and Humidity Variations in a Day
Line Plot for Temperature variations in a Day
Temperature Variations in a Day
Line Plot for Humidity variations in a Day
Humidity Variations in a Day

Area Plot

Basically, the area plot combines the line chart and the histogram and commonly used to show trends rather than comparing specific values. The following code demonstrates the use of area plot for all columns in the data frame as well as the specified individual columns.

Code Example

import pandas as pd
import matplotlib.pyplot as plt
#from datetime import datetime
import matplotlib.dates as mdates
import datetime as dt
import scipy.stats

df=pd.read_csv("roomdata.csv")
print(df.head())
print(df.describe())
print(df.dtypes)

df1=df[['created_at', 'Temperature', 'Humidity']]
df1.plot.area(colormap='prism')
plt.title('Temperature and Humidity variation in a Day')
plt.xlabel('Time of Day')

plt.show()

df1.plot.area(x='created_at', y='Temperature', figsize=(20, 6), colormap='Spectral')
plt.title('Temperature variation in a Day')
plt.xlabel('Time of Day')
plt.show()

df1.plot.area(x='created_at', y='Humidity', figsize=(20, 6), colormap='rainbow')
plt.title('Humidity variation in a Day')
plt.xlabel('Time of Day')
plt.show()
Area Plot for Temperature and Humidity Variation
Area Plot for Temperature and Humidity Variation
Temperature Variation
Temperature Variation
Humidity Variation
Humidity Variation

Box Plot

In order to summarize the data, you can use a box plot. Also, this plot gives you several kinds of information about the data like its distribution and the central value, and how the data is varying. In other words, a box plot can describe your data in many ways as shown in the following example.

Code Example

import pandas as pd
import matplotlib.pyplot as plt
#from datetime import datetime
import matplotlib.dates as mdates
import datetime as dt
import scipy.stats

df=pd.read_csv("roomdata.csv")
print(df.head())
print(df.describe())
print(df.dtypes)

df.boxplot(column=['Temperature', 'Humidity'])
plt.title('Temperature and Humidity variation in a Day')
plt.xlabel('Time of Day')
plt.show()

As shown below in the box plot, the variation in day temperature is much lesser than that in the relative humidity.

Box Plot Example
Box Plot Example

Another example of the Box plot is given next where the Box plot is drawn for the Total Transaction Amount of Credit Card Customers which we group on the basis of Education Level of customers. Also, note how we can specify a color to the components of the plot like median, cap, whiskers, and boxes.

Code Example

import pandas as pd
import matplotlib.pyplot as plt
#from datetime import datetime
import matplotlib.dates as mdates
import datetime as dt
import scipy.stats

df=pd.read_csv("BankChurners.csv")
print(df.head())
print(df.describe())
print(df.dtypes)

df.boxplot(column=['Total_Trans_Amt'], by='Education_Level', figsize=(10,6), color=dict(boxes='r', whiskers='g', medians='y', caps='b'))
plt.show()
Box Plot with Categories
Box Plot with Categories

Hexbin Plot

Sometimes we have large data to plot. In such cases we can use a hexagon plot instead of a scatter plot. In particular, a hexagon plot indicates whether the data is more dense. Hence, you will use a hexagon plot for the big data.

Code Example

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#from datetime import datetime
import matplotlib.dates as mdates
import datetime as dt
import scipy.stats

df=pd.read_csv("BankChurners.csv")
print(df.head())
print(df.describe())
print(df.dtypes)

df.plot.hexbin(x='Total_Trans_Amt', y='Total_Trans_Ct', cmap='Oranges', gridsize=30)
plt.title('Transaction Amount vs. Transaction Count')
plt.xlabel('Total Transaction Amount')
plt.ylabel('Total Transaction Count')
plt.show()
Hexbin Plot Example
Hexbin Plot Example

Kernel Density Estimation (KDE) Plot

Another plot for visualizing the data distribution is the KDE plot. Basically, a Kernel Density Function or KDE plot shows the data distribution using a continuous probability density function. Usually, this plot is used for continuous variables and represents their probability density. Therefore, we can better analyze the probability distribution of continuous variables using a KDE plot.

Code Example

import pandas as pd
import matplotlib.pyplot as plt
#from datetime import datetime
import matplotlib.dates as mdates
import datetime as dt
import scipy.stats

df=pd.read_csv("roomdata.csv")
print(df.head())
print(df.describe())
print(df.dtypes)

df1=df[['created_at', 'Temperature', 'Humidity']]

df1.plot.kde(cmap='rainbow')
plt.title('Temperature and Humidity variation in a Day')
plt.show()

df1.plot.kde(bw_method=0.3, cmap='winter')
plt.title('Temperature and Humidity variation in a Day')
plt.show()

df1.plot.kde(bw_method=3, cmap='prism')
plt.title('Temperature and Humidity variation in a Day')
plt.show()
KDE Plot Example
KDE Plot Example
Plot with Bw_method=0.3
Plot with Bw_method=0.3
KDE Plot with Bw_method=3
KDE Plot with Bw_method=3

Summary

To sum up the above discussion, it should be noted that Data Visualization with Pandas requires the use of various kinds of plots that we can draw. These plots enable you to visualize individual data points, compare two values, and find a correlation, and also to visualize the data as a whole.

Further Reading

Deep Learning Tutorial

Text Summarization Techniques

How to Implement Inheritance in Python

Find Prime Numbers in Given Range in Python

Running Instructions in an Interactive Interpreter in Python

Deep Learning Practice Exercise

Python Practice Exercise

Deep Learning Methods for Object Detection

Understanding YOLO Algorithm

What is Image Segmentation?

ImageNet and its Applications

Image Contrast Enhancement using Histogram Equalization

Transfer Learning and its Applications

Examples of OpenCV Library in Python

Examples of Tuples in Python

Python List Practice Exercise

Understanding Blockchain Concepts

Edge Detection Using OpenCV

Predicting with Time Series

Example of Multi-layer Perceptron Classifier in Python

Measuring Performance of Classification using Confusion Matrix

Artificial Neural Network (ANN) Model using Scikit-Learn

Popular Machine Learning Algorithms for Prediction

Long Short Term Memory – An Artificial Recurrent Neural Network Architecture

Python Project Ideas for Undergraduate Students

Creating Basic Charts using Plotly

Visualizing Regression Models with lmplot() and residplot() in Seaborn

Data Visualization with Pandas

A Brief Introduction of Pandas Library in Python

A Brief Tutorial on NumPy in Python

programmingempire

You may also like...