Data Analysis with Python: NumPy, Pandas and Matplotlib (Basic)

I'm a full-stack web developer with Django(Python) and React js as my main stack. I'm also a beginner writer, who loves to write tutorials for different purposes.
Python is an incredibly versatile language, and one of its strengths is its ability to work with data. With various libraries available, Python makes it easy to manipulate, analyze, and visualize data in a way that is both intuitive and powerful.
The most popular libraries for data analysis in Python are NumPy, Pandas and Matplotlib. NumPy is a library for working with numerical data, while Pandas is built on top of NumPy and provides more advanced functionality for data manipulation and analysis and Matplotlib is used to plot the data in user-friendly charts and graphs. In this article, we’ll explore how to use these libraries to analyze data in Python.
P.S.: This tutorial assumes that you have a basic understanding of how Python works and how to set up virtual environments to separate environments for different projects.
Installing NumPy and Pandas
Before we can start analyzing data with NumPy, Pandas and Matplotlib, we need to install them, which is a simple process.
pip install numpy pandas matplotlib
This will download and install NumPy, Pandas and Matplotlib on your system. With these packages installed, we're ready to start working with data.
NumPy Basics
NumPy provides a powerful set of tools for working with numerical data in Python. One of the most important features of NumPy is the ndarray object, which provides a fast and efficient way to store and manipulate arrays of numerical data.
To get started with NumPy, we first need to import the library:
import numpy as np
This imports NumPy and renames it to "np" for our convenience. Creating an array in NumPy is simple. We can create a one-dimensional array (also called a vector) by passing a Python list to the np.array() function:
import numpy as np
x = np.array([1,2,3])
print(x)
Output:
[1 2 3]
We can also create two-dimensional arrays (matrices) by passing a list of lists to the np.array() function:
y = np.array([[1,2,3], [4,5,6]])
print(y)
Output:
[[1 2 3]
[4 5 6]]
NumPy provides a number of functions for creating arrays with specific values, such as zeros() and ones(). We can create an array of zeros with the np.zeros() and an array of ones with the np.ones() functions respectively:
z = np.zeros((3,3))
print(z)
Output:
[[0. 0. 0.]
[0. 0. 0.]
[0. 0. 0.]]
a = np.ones((4,4))
print(a)
Output:
[[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]
[1. 1. 1. 1.]]
NumPy provides a lot of functions for manipulating arrays. For example, we can reshape an array with the reshape() function:
a = np.array([1,2,3,4,5,6])
b = a.reshape((2,3))
print(b)
Output:
[[1 2 3]
[4 5 6]]
We can also perform mathematical operations on arrays. For example, we can add two arrays together:
c = np.array([1,2,3])
d = np.array([4,5,6])
e = c + d
print(e)
Output:
[5 7 9]
Similarly, we can multiply two arrays together:
f = np.array([1,2,3])
g = np.array([4,5,6])
h = f * g
print(h)
Output:
[ 4 10 18]
NumPy also provides functions for computing basic strategies on arrays. For example, we can compute the mean, median, and standard deviation of an array:
i = np.array([1,2,3,4,5,6])
print(np.mean(i))
print(np.median(i))
print(np.std(i))
Output:
3.5
3.5
1.707825127659933
Pandas Basics
While NumPy provides a powerful set of tools for working with numerical data, it doesn't provide much in the way of data manipulation or analysis. This is where Pandas come in.
Pandas provides two main classes for working with data: Series and DataFrame. A Series is a one-dimensional array-like object, while a DataFrame is a two-dimensional table-like object. Both of these classes provide a rich set of functions for manipulating and analyzing data.
Creating a Series in Pandas is simple. We can create a Series from a Python list:
import pandas as pd
s = pd.Series([1,2,3])
print(s)
Output:
0 1
1 2
2 3
dtype: int64
The first column of the output shows the index of the Series, which is automatically generated by Pandas. We can also specify our own index:
t = pd.Series([1,2,3], index=['a', 'b', 'c'])
print(t)
Output:
a 1
b 2
c 3
dtype: int64
Creating a DataFrame in Pandas is also simple. We can create a DataFrame from a Python dictionary:
u = pd.DataFrame({'Name': ['Shady', 'Tony', 'Amanda'], 'Age': [25, 30, 35]})
print(u)
Output:
Name Age
0 Shady 25
1 Tony 30
2 Amanda 35
Like Series, DataFrames also have an index, which is generated automatically if not specified.
Pandas provides a rich set of functions for manipulating and analyzing data. For example, we can filter a DataFrame to include only rows that meet a certain condition:
In [16]:
v = pd.DataFrame({'Name': ['Shady', 'Tony', 'Amanda'], 'Age': [25, 30, 35]})
w = v[v['Age'] > 30]
print(w)
Output:
Name Age
2 Amanda 35
We can also sort a DataFrame by a certain Column:
x = pd.DataFrame({'Name': ['Shady', 'Tony', 'Amanda'], 'Age': [25, 30, 35]})
y = x.sort_values(by='Age')
print(y)
Output:
Name Age
0 Shady 25
1 Tony 30
2 Amanda 35
Another useful function is groupby(), which allows us to group rows of a DataFrame by a certain column:
z = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Gender': ['F', 'M', 'M'], 'Age': [25, 30, 35]})
a = z.groupby('Gender')['Age'].mean()
print(a)
Output:
Gender
F 25.0
M 32.5
Name: Age, dtype: float64
This groups the rows of the DataFrame by the "Gender" column and computes the mean of the "Age" column for each group.
Now that we know some basic functionalities of NumPy and Pandas, let's move on to see how we can use these libraries for data analysis.
Data Analysis with NumPy and Pandas
Suppose we have a CSV File, containing some data we want to analyze. We can read the data into a pandas DataFrame using the read_csv() function:
sales = pd.read_csv('sales_data.csv')
sales.head(10)
Output:
| id | store | product_group | product_code | stock_qty | cost | price | last_week_sales | last_month_sales | |
| 0 | 100 | Uler | PG3 | 4000 | 731 | 921.35 | 516.92 | 99 | 1412 |
| 1 | 101 | Paladin | PG1 | 4001 | 80 | 984.39 | 436.05 | 12 | 1250 |
| 2 | 102 | Uler | PG3 | 4002 | 2729 | 608.01 | 838.54 | 51 | 1970 |
| 3 | 103 | Uler | PG2 | 4003 | 2217 | 577.97 | 62.61 | 1 | 1737 |
| 4 | 104 | Paladin | PG1 | 4004 | 3652 | 633.66 | 1318.50 | 88 | 1727 |
| 5 | 105 | Uler | PG3 | 4005 | 2183 | 254.73 | 1274.13 | 38 | 1264 |
| 6 | 106 | Alpino | PG1 | 4006 | 3938 | 85.61 | 566.28 | 22 | 745 |
| 7 | 107 | Uler | PG2 | 4007 | 2879 | 581.86 | 385.21 | 57 | 19 |
| 8 | 108 | Alpino | PG1 | 4008 | 4943 | 287.91 | 952.56 | 25 | 1360 |
| 9 | 109 | Paladin | PG1 | 4009 | 3273 | 786.10 | 754.78 | 100 | 1610 |
This assumes that the CSV file is located in the same directory as our Python script. If the file is present somewhere else, we should give the exact path to that location. For example, 'home/user/my_project/data.csv'
Once we have the data in the DataFrame, we can use Pandas functions to analyze it. We have a CSV file that contains information about different stores' sales data including their product group, product code, stock quantity, cost, price, and sales. We can compute some statistics based on this data using Pandas.
average_cost = sales['cost'].mean()
print("Average Cost: ", average_cost)
total_sales_week = sales.groupby('store')['last_week_sales'].sum()
print("Total sales last week: ", total_sales_week)
Output:
Average Cost: 502.79470000000003
Total sales last week: store
Alpino 915
Mary 1024
Paladin 1200
Soner 1066
Uler 1222
Name: last_week_sales, dtype: int64
We can also filter the data based on certain conditions. For example, if we wanted to only see the data of one store:
ulers_data = sales[sales.store=="Uler"]
ulers_data
Output:
| id | store | product_group | product_code | stock_qty | cost | price | last_week_sales | last_month_sales | |
| 0 | 100 | Uler | PG3 | 4000 | 731 | 921.35 | 516.92 | 99 | 1412 |
| 2 | 102 | Uler | PG3 | 4002 | 2729 | 608.01 | 838.54 | 51 | 1970 |
| 3 | 103 | Uler | PG2 | 4003 | 2217 | 577.97 | 62.61 | 1 | 1737 |
| 5 | 105 | Uler | PG3 | 4005 | 2183 | 254.73 | 1274.13 | 38 | 1264 |
| 7 | 107 | Uler | PG2 | 4007 | 2879 | 581.86 | 385.21 | 57 | 19 |
| 19 | 119 | Uler | PG2 | 4019 | 526 | 900.23 | 1488.23 | 59 | 1014 |
| 36 | 136 | Uler | PG3 | 4036 | 153 | 496.11 | 529.48 | 94 | 543 |
| 41 | 141 | Uler | PG3 | 4041 | 885 | 349.82 | 463.15 | 54 | 1154 |
| 49 | 149 | Uler | PG3 | 4049 | 4249 | 297.05 | 1018.18 | 61 | 403 |
| 50 | 150 | Uler | PG1 | 4050 | 356 | 523.17 | 1438.89 | 73 | 1542 |
| 52 | 152 | Uler | PG3 | 4052 | 1863 | 298.03 | 1038.45 | 56 | 404 |
| 54 | 154 | Uler | PG1 | 4054 | 3997 | 737.65 | 1396.19 | 86 | 57 |
| 56 | 156 | Uler | PG3 | 4056 | 3481 | 369.75 | 908.63 | 59 | 927 |
| 57 | 157 | Uler | PG3 | 4057 | 3710 | 338.39 | 120.02 | 57 | 1353 |
| 59 | 159 | Uler | PG2 | 4059 | 3594 | 794.62 | 938.82 | 27 | 844 |
| 70 | 170 | Uler | PG1 | 4070 | 1129 | 529.48 | 1449.53 | 98 | 321 |
| 81 | 181 | Uler | PG3 | 4081 | 4900 | 597.64 | 220.97 | 3 | 1231 |
| 82 | 182 | Uler | PG1 | 4082 | 826 | 308.20 | 302.90 | 73 | 353 |
| 89 | 189 | Uler | PG3 | 4089 | 2492 | 280.95 | 191.82 | 7 | 357 |
| 91 | 191 | Uler | PG3 | 4091 | 4842 | 778.14 | 332.67 | 89 | 1931 |
| 92 | 192 | Uler | PG3 | 4092 | 2501 | 485.57 | 309.23 | 22 | 1931 |
| 99 | 199 | Uler | PG3 | 4099 | 1522 | 200.72 | 1132.09 | 58 | 933 |
This will only output a DataFrame containing only the rows where the store column is Uler.
We can also use Pandas to plot the data. For example, let's say we want to plot a histogram of last week's sales of different products.
import matplotlib.pyplot as plt
plt.bar(ulers_data['last_week_sales'], ulers_data['product_code'])
plt.xlabel('Product Code')
plt.ylabel('Last week sales')
plt.title('Last week sales of products')
plt.show()
Output:

We can get the average sales of a specific product group and plot them as well:
avg_sales = sales.groupby(['store', 'product_group'], as_index=False).agg(avg_sales = ("last_week_sales", "mean"))
avg_sales
Output:
| store | product_group | avg_sales | |
| 0 | Alpino | PG1 | 52.250000 |
| 1 | Alpino | PG2 | 76.000000 |
| 2 | Alpino | PG3 | 57.272727 |
| 3 | Mary | PG1 | 44.800000 |
| 4 | Mary | PG2 | 49.000000 |
| 5 | Mary | PG3 | 71.800000 |
| 6 | Paladin | PG1 | 63.636364 |
| 7 | Paladin | PG2 | 61.142857 |
| 8 | Paladin | PG3 | 18.000000 |
| 9 | Soner | PG1 | 51.000000 |
| 10 | Soner | PG2 | 51.166667 |
| 11 | Soner | PG3 | 50.454545 |
| 12 | Uler | PG1 | 82.500000 |
| 13 | Uler | PG2 | 36.000000 |
| 14 | Uler | PG3 | 53.428571 |
avg_sales.plot.bar(x="store", y="avg_sales", color=['r'])
Output:

Conclusion:
In this section, we have shown how to use NumPy and Pandas for data analysis. We have demonstrated how to read data from a CSV file into a Pandas DataFrame, compute statistics on the data, filter the data based on certain conditions, and plot the data using Matplotlib.
These are just a few examples of the many things you can do with NumPy and Pandas for data analysis. With these tools, you can quickly and easily analyze large amounts of data, allowing you to gain insights and make informed decisions.
If you like what you are reading, Buy me a coffee, to support me.



