Skip to main content

Command Palette

Search for a command to run...

Data Analysis with Python: NumPy, Pandas and Matplotlib (Basic)

Updated
11 min read
Data Analysis with Python: NumPy, Pandas and Matplotlib (Basic)
S

I'm a full-stack web developer with Django(Python) and React js as my main stack. I'm also a beginner writer, who loves to write tutorials for different purposes.

Python is an incredibly versatile language, and one of its strengths is its ability to work with data. With various libraries available, Python makes it easy to manipulate, analyze, and visualize data in a way that is both intuitive and powerful.

The most popular libraries for data analysis in Python are NumPy, Pandas and Matplotlib. NumPy is a library for working with numerical data, while Pandas is built on top of NumPy and provides more advanced functionality for data manipulation and analysis and Matplotlib is used to plot the data in user-friendly charts and graphs. In this article, we’ll explore how to use these libraries to analyze data in Python.

P.S.: This tutorial assumes that you have a basic understanding of how Python works and how to set up virtual environments to separate environments for different projects.

Installing NumPy and Pandas

Before we can start analyzing data with NumPy, Pandas and Matplotlib, we need to install them, which is a simple process.

pip install numpy pandas matplotlib

This will download and install NumPy, Pandas and Matplotlib on your system. With these packages installed, we're ready to start working with data.

NumPy Basics

NumPy provides a powerful set of tools for working with numerical data in Python. One of the most important features of NumPy is the ndarray object, which provides a fast and efficient way to store and manipulate arrays of numerical data.

To get started with NumPy, we first need to import the library:

import numpy as np

This imports NumPy and renames it to "np" for our convenience. Creating an array in NumPy is simple. We can create a one-dimensional array (also called a vector) by passing a Python list to the np.array() function:

import numpy as np
x = np.array([1,2,3])
print(x)

Output:

[1 2 3]

We can also create two-dimensional arrays (matrices) by passing a list of lists to the np.array() function:

y = np.array([[1,2,3], [4,5,6]])
print(y)

Output:

[[1 2 3]
 [4 5 6]]

NumPy provides a number of functions for creating arrays with specific values, such as zeros() and ones(). We can create an array of zeros with the np.zeros() and an array of ones with the np.ones() functions respectively:

z = np.zeros((3,3))
print(z)

Output:

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
a = np.ones((4,4))
print(a)

Output:

[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]

NumPy provides a lot of functions for manipulating arrays. For example, we can reshape an array with the reshape() function:

a = np.array([1,2,3,4,5,6])
b = a.reshape((2,3))
print(b)

Output:

[[1 2 3]
 [4 5 6]]

We can also perform mathematical operations on arrays. For example, we can add two arrays together:

c = np.array([1,2,3])
d = np.array([4,5,6])
​
e = c + d
print(e)

Output:

[5 7 9]

Similarly, we can multiply two arrays together:

f = np.array([1,2,3])
g = np.array([4,5,6])
h = f * g
print(h)

Output:

[ 4 10 18]

NumPy also provides functions for computing basic strategies on arrays. For example, we can compute the mean, median, and standard deviation of an array:

i = np.array([1,2,3,4,5,6])
print(np.mean(i))
print(np.median(i))
print(np.std(i))

Output:

3.5
3.5
1.707825127659933

Pandas Basics

While NumPy provides a powerful set of tools for working with numerical data, it doesn't provide much in the way of data manipulation or analysis. This is where Pandas come in.

Pandas provides two main classes for working with data: Series and DataFrame. A Series is a one-dimensional array-like object, while a DataFrame is a two-dimensional table-like object. Both of these classes provide a rich set of functions for manipulating and analyzing data.

Creating a Series in Pandas is simple. We can create a Series from a Python list:

import pandas as pd
s = pd.Series([1,2,3])
print(s)

Output:

0    1
1    2
2    3
dtype: int64

The first column of the output shows the index of the Series, which is automatically generated by Pandas. We can also specify our own index:

t = pd.Series([1,2,3], index=['a', 'b', 'c'])
print(t)

Output:

a    1
b    2
c    3
dtype: int64

Creating a DataFrame in Pandas is also simple. We can create a DataFrame from a Python dictionary:

u = pd.DataFrame({'Name': ['Shady', 'Tony', 'Amanda'], 'Age': [25, 30, 35]})
print(u)

Output:

     Name  Age
0   Shady   25
1    Tony   30
2  Amanda   35

Like Series, DataFrames also have an index, which is generated automatically if not specified.

Pandas provides a rich set of functions for manipulating and analyzing data. For example, we can filter a DataFrame to include only rows that meet a certain condition:

In [16]:

v = pd.DataFrame({'Name': ['Shady', 'Tony', 'Amanda'], 'Age': [25, 30, 35]})
w = v[v['Age'] > 30]
print(w)

Output:

     Name  Age
2  Amanda   35

We can also sort a DataFrame by a certain Column:

x = pd.DataFrame({'Name': ['Shady', 'Tony', 'Amanda'], 'Age': [25, 30, 35]})
y = x.sort_values(by='Age')
print(y)

Output:

     Name  Age
0   Shady   25
1    Tony   30
2  Amanda   35

Another useful function is groupby(), which allows us to group rows of a DataFrame by a certain column:

z = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Gender': ['F', 'M', 'M'], 'Age': [25, 30, 35]})
a = z.groupby('Gender')['Age'].mean()
print(a)

Output:

Gender
F    25.0
M    32.5
Name: Age, dtype: float64

This groups the rows of the DataFrame by the "Gender" column and computes the mean of the "Age" column for each group.

Now that we know some basic functionalities of NumPy and Pandas, let's move on to see how we can use these libraries for data analysis.

Data Analysis with NumPy and Pandas

Suppose we have a CSV File, containing some data we want to analyze. We can read the data into a pandas DataFrame using the read_csv() function:

sales = pd.read_csv('sales_data.csv')
sales.head(10)

Output:

idstoreproduct_groupproduct_codestock_qtycostpricelast_week_saleslast_month_sales
0100UlerPG34000731921.35516.92991412
1101PaladinPG1400180984.39436.05121250
2102UlerPG340022729608.01838.54511970
3103UlerPG240032217577.9762.6111737
4104PaladinPG140043652633.661318.50881727
5105UlerPG340052183254.731274.13381264
6106AlpinoPG14006393885.61566.2822745
7107UlerPG240072879581.86385.215719
8108AlpinoPG140084943287.91952.56251360
9109PaladinPG140093273786.10754.781001610

This assumes that the CSV file is located in the same directory as our Python script. If the file is present somewhere else, we should give the exact path to that location. For example, 'home/user/my_project/data.csv'

Once we have the data in the DataFrame, we can use Pandas functions to analyze it. We have a CSV file that contains information about different stores' sales data including their product group, product code, stock quantity, cost, price, and sales. We can compute some statistics based on this data using Pandas.

average_cost = sales['cost'].mean()
print("Average Cost: ", average_cost)

total_sales_week = sales.groupby('store')['last_week_sales'].sum()
print("Total sales last week: ", total_sales_week)

Output:

Average Cost:  502.79470000000003
Total sales last week:  store
Alpino      915
Mary       1024
Paladin    1200
Soner      1066
Uler       1222
Name: last_week_sales, dtype: int64

We can also filter the data based on certain conditions. For example, if we wanted to only see the data of one store:

ulers_data = sales[sales.store=="Uler"]
ulers_data

Output:

idstoreproduct_groupproduct_codestock_qtycostpricelast_week_saleslast_month_sales
0100UlerPG34000731921.35516.92991412
2102UlerPG340022729608.01838.54511970
3103UlerPG240032217577.9762.6111737
5105UlerPG340052183254.731274.13381264
7107UlerPG240072879581.86385.215719
19119UlerPG24019526900.231488.23591014
36136UlerPG34036153496.11529.4894543
41141UlerPG34041885349.82463.15541154
49149UlerPG340494249297.051018.1861403
50150UlerPG14050356523.171438.89731542
52152UlerPG340521863298.031038.4556404
54154UlerPG140543997737.651396.198657
56156UlerPG340563481369.75908.6359927
57157UlerPG340573710338.39120.02571353
59159UlerPG240593594794.62938.8227844
70170UlerPG140701129529.481449.5398321
81181UlerPG340814900597.64220.9731231
82182UlerPG14082826308.20302.9073353
89189UlerPG340892492280.95191.827357
91191UlerPG340914842778.14332.67891931
92192UlerPG340922501485.57309.23221931
99199UlerPG340991522200.721132.0958933

This will only output a DataFrame containing only the rows where the store column is Uler.

We can also use Pandas to plot the data. For example, let's say we want to plot a histogram of last week's sales of different products.

import matplotlib.pyplot as plt
plt.bar(ulers_data['last_week_sales'], ulers_data['product_code'])
plt.xlabel('Product Code')
plt.ylabel('Last week sales')
plt.title('Last week sales of products')
plt.show()

Output:

We can get the average sales of a specific product group and plot them as well:

avg_sales = sales.groupby(['store', 'product_group'], as_index=False).agg(avg_sales = ("last_week_sales", "mean"))
avg_sales

Output:

storeproduct_groupavg_sales
0AlpinoPG152.250000
1AlpinoPG276.000000
2AlpinoPG357.272727
3MaryPG144.800000
4MaryPG249.000000
5MaryPG371.800000
6PaladinPG163.636364
7PaladinPG261.142857
8PaladinPG318.000000
9SonerPG151.000000
10SonerPG251.166667
11SonerPG350.454545
12UlerPG182.500000
13UlerPG236.000000
14UlerPG353.428571
avg_sales.plot.bar(x="store", y="avg_sales", color=['r'])

Output:

Conclusion:

In this section, we have shown how to use NumPy and Pandas for data analysis. We have demonstrated how to read data from a CSV file into a Pandas DataFrame, compute statistics on the data, filter the data based on certain conditions, and plot the data using Matplotlib.

These are just a few examples of the many things you can do with NumPy and Pandas for data analysis. With these tools, you can quickly and easily analyze large amounts of data, allowing you to gain insights and make informed decisions.

If you like what you are reading, Buy me a coffee, to support me.