Data Analysis with Python: NumPy, Pandas and Matplotlib (Basic)

Python is an incredibly versatile language, and one of its strengths is its ability to work with data. With various libraries available, Python makes it easy to manipulate, analyze, and visualize data in a way that is both intuitive and powerful.

The most popular libraries for data analysis in Python are NumPy, Pandas and Matplotlib. NumPy is a library for working with numerical data, while Pandas is built on top of NumPy and provides more advanced functionality for data manipulation and analysis and Matplotlib is used to plot the data in user-friendly charts and graphs. In this article, we’ll explore how to use these libraries to analyze data in Python.

P.S.: This tutorial assumes that you have a basic understanding of how Python works and how to set up virtual environments to separate environments for different projects.

Installing NumPy and Pandas

Before we can start analyzing data with NumPy, Pandas and Matplotlib, we need to install them, which is a simple process.

pip install numpy pandas matplotlib

This will download and install NumPy, Pandas and Matplotlib on your system. With these packages installed, we're ready to start working with data.

NumPy Basics

NumPy provides a powerful set of tools for working with numerical data in Python. One of the most important features of NumPy is the ndarray object, which provides a fast and efficient way to store and manipulate arrays of numerical data.

To get started with NumPy, we first need to import the library:

import numpy as np

This imports NumPy and renames it to "np" for our convenience. Creating an array in NumPy is simple. We can create a one-dimensional array (also called a vector) by passing a Python list to the np.array() function:

import numpy as np
x = np.array([1,2,3])
print(x)

Output:

[1 2 3]

We can also create two-dimensional arrays (matrices) by passing a list of lists to the np.array() function:

y = np.array([[1,2,3], [4,5,6]])
print(y)

Output:

[[1 2 3]
 [4 5 6]]

NumPy provides a number of functions for creating arrays with specific values, such as zeros() and ones(). We can create an array of zeros with the np.zeros() and an array of ones with the np.ones() functions respectively:

z = np.zeros((3,3))
print(z)

Output:

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]

a = np.ones((4,4))
print(a)

Output:

[[1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]
 [1. 1. 1. 1.]]

NumPy provides a lot of functions for manipulating arrays. For example, we can reshape an array with the reshape() function:

a = np.array([1,2,3,4,5,6])
b = a.reshape((2,3))
print(b)

Output:

[[1 2 3]
 [4 5 6]]

We can also perform mathematical operations on arrays. For example, we can add two arrays together:

c = np.array([1,2,3])
d = np.array([4,5,6])

e = c + d
print(e)

Output:

[5 7 9]

Similarly, we can multiply two arrays together:

f = np.array([1,2,3])
g = np.array([4,5,6])
h = f * g
print(h)

Output:

[ 4 10 18]

NumPy also provides functions for computing basic strategies on arrays. For example, we can compute the mean, median, and standard deviation of an array:

i = np.array([1,2,3,4,5,6])
print(np.mean(i))
print(np.median(i))
print(np.std(i))

Output:

3.5
3.5
1.707825127659933

Pandas Basics

While NumPy provides a powerful set of tools for working with numerical data, it doesn't provide much in the way of data manipulation or analysis. This is where Pandas come in.

Pandas provides two main classes for working with data: Series and DataFrame. A Series is a one-dimensional array-like object, while a DataFrame is a two-dimensional table-like object. Both of these classes provide a rich set of functions for manipulating and analyzing data.

Creating a Series in Pandas is simple. We can create a Series from a Python list:

import pandas as pd
s = pd.Series([1,2,3])
print(s)

Output:

0    1
1    2
2    3
dtype: int64

The first column of the output shows the index of the Series, which is automatically generated by Pandas. We can also specify our own index:

t = pd.Series([1,2,3], index=['a', 'b', 'c'])
print(t)

Output:

a    1
b    2
c    3
dtype: int64

Creating a DataFrame in Pandas is also simple. We can create a DataFrame from a Python dictionary:

u = pd.DataFrame({'Name': ['Shady', 'Tony', 'Amanda'], 'Age': [25, 30, 35]})
print(u)

Output:

     Name  Age
0   Shady   25
1    Tony   30
2  Amanda   35

Like Series, DataFrames also have an index, which is generated automatically if not specified.

Pandas provides a rich set of functions for manipulating and analyzing data. For example, we can filter a DataFrame to include only rows that meet a certain condition:

In [16]:

v = pd.DataFrame({'Name': ['Shady', 'Tony', 'Amanda'], 'Age': [25, 30, 35]})
w = v[v['Age'] > 30]
print(w)

Output:

     Name  Age
2  Amanda   35

We can also sort a DataFrame by a certain Column:

x = pd.DataFrame({'Name': ['Shady', 'Tony', 'Amanda'], 'Age': [25, 30, 35]})
y = x.sort_values(by='Age')
print(y)

Output:

     Name  Age
0   Shady   25
1    Tony   30
2  Amanda   35

Another useful function is groupby(), which allows us to group rows of a DataFrame by a certain column:

z = pd.DataFrame({'Name': ['Alice', 'Bob', 'Charlie'], 'Gender': ['F', 'M', 'M'], 'Age': [25, 30, 35]})
a = z.groupby('Gender')['Age'].mean()
print(a)

Output:

Gender
F    25.0
M    32.5
Name: Age, dtype: float64

This groups the rows of the DataFrame by the "Gender" column and computes the mean of the "Age" column for each group.

Now that we know some basic functionalities of NumPy and Pandas, let's move on to see how we can use these libraries for data analysis.

Data Analysis with NumPy and Pandas

Suppose we have a CSV File, containing some data we want to analyze. We can read the data into a pandas DataFrame using the read_csv() function:

sales = pd.read_csv('sales_data.csv')
sales.head(10)

Output:

	id	store	product_group	product_code	stock_qty	cost	price	last_week_sales	last_month_sales
0	100	Uler	PG3	4000	731	921.35	516.92	99	1412
1	101	Paladin	PG1	4001	80	984.39	436.05	12	1250
2	102	Uler	PG3	4002	2729	608.01	838.54	51	1970
3	103	Uler	PG2	4003	2217	577.97	62.61	1	1737
4	104	Paladin	PG1	4004	3652	633.66	1318.50	88	1727
5	105	Uler	PG3	4005	2183	254.73	1274.13	38	1264
6	106	Alpino	PG1	4006	3938	85.61	566.28	22	745
7	107	Uler	PG2	4007	2879	581.86	385.21	57	19
8	108	Alpino	PG1	4008	4943	287.91	952.56	25	1360
9	109	Paladin	PG1	4009	3273	786.10	754.78	100	1610

This assumes that the CSV file is located in the same directory as our Python script. If the file is present somewhere else, we should give the exact path to that location. For example, 'home/user/my_project/data.csv'

Once we have the data in the DataFrame, we can use Pandas functions to analyze it. We have a CSV file that contains information about different stores' sales data including their product group, product code, stock quantity, cost, price, and sales. We can compute some statistics based on this data using Pandas.

average_cost = sales['cost'].mean()
print("Average Cost: ", average_cost)

total_sales_week = sales.groupby('store')['last_week_sales'].sum()
print("Total sales last week: ", total_sales_week)

Output:

Average Cost:  502.79470000000003
Total sales last week:  store
Alpino      915
Mary       1024
Paladin    1200
Soner      1066
Uler       1222
Name: last_week_sales, dtype: int64

We can also filter the data based on certain conditions. For example, if we wanted to only see the data of one store:

ulers_data = sales[sales.store=="Uler"]
ulers_data

Output:

	id	store	product_group	product_code	stock_qty	cost	price	last_week_sales	last_month_sales
0	100	Uler	PG3	4000	731	921.35	516.92	99	1412
2	102	Uler	PG3	4002	2729	608.01	838.54	51	1970
3	103	Uler	PG2	4003	2217	577.97	62.61	1	1737
5	105	Uler	PG3	4005	2183	254.73	1274.13	38	1264
7	107	Uler	PG2	4007	2879	581.86	385.21	57	19
19	119	Uler	PG2	4019	526	900.23	1488.23	59	1014
36	136	Uler	PG3	4036	153	496.11	529.48	94	543
41	141	Uler	PG3	4041	885	349.82	463.15	54	1154
49	149	Uler	PG3	4049	4249	297.05	1018.18	61	403
50	150	Uler	PG1	4050	356	523.17	1438.89	73	1542
52	152	Uler	PG3	4052	1863	298.03	1038.45	56	404
54	154	Uler	PG1	4054	3997	737.65	1396.19	86	57
56	156	Uler	PG3	4056	3481	369.75	908.63	59	927
57	157	Uler	PG3	4057	3710	338.39	120.02	57	1353
59	159	Uler	PG2	4059	3594	794.62	938.82	27	844
70	170	Uler	PG1	4070	1129	529.48	1449.53	98	321
81	181	Uler	PG3	4081	4900	597.64	220.97	3	1231
82	182	Uler	PG1	4082	826	308.20	302.90	73	353
89	189	Uler	PG3	4089	2492	280.95	191.82	7	357
91	191	Uler	PG3	4091	4842	778.14	332.67	89	1931
92	192	Uler	PG3	4092	2501	485.57	309.23	22	1931
99	199	Uler	PG3	4099	1522	200.72	1132.09	58	933

This will only output a DataFrame containing only the rows where the store column is Uler.

We can also use Pandas to plot the data. For example, let's say we want to plot a histogram of last week's sales of different products.

import matplotlib.pyplot as plt
plt.bar(ulers_data['last_week_sales'], ulers_data['product_code'])
plt.xlabel('Product Code')
plt.ylabel('Last week sales')
plt.title('Last week sales of products')
plt.show()

Output:

We can get the average sales of a specific product group and plot them as well:

avg_sales = sales.groupby(['store', 'product_group'], as_index=False).agg(avg_sales = ("last_week_sales", "mean"))
avg_sales

Output:

	store	product_group	avg_sales
0	Alpino	PG1	52.250000
1	Alpino	PG2	76.000000
2	Alpino	PG3	57.272727
3	Mary	PG1	44.800000
4	Mary	PG2	49.000000
5	Mary	PG3	71.800000
6	Paladin	PG1	63.636364
7	Paladin	PG2	61.142857
8	Paladin	PG3	18.000000
9	Soner	PG1	51.000000
10	Soner	PG2	51.166667
11	Soner	PG3	50.454545
12	Uler	PG1	82.500000
13	Uler	PG2	36.000000
14	Uler	PG3	53.428571

avg_sales.plot.bar(x="store", y="avg_sales", color=['r'])

Output:

Conclusion:

In this section, we have shown how to use NumPy and Pandas for data analysis. We have demonstrated how to read data from a CSV file into a Pandas DataFrame, compute statistics on the data, filter the data based on certain conditions, and plot the data using Matplotlib.

These are just a few examples of the many things you can do with NumPy and Pandas for data analysis. With these tools, you can quickly and easily analyze large amounts of data, allowing you to gain insights and make informed decisions.

If you like what you are reading, Buy me a coffee, to support me.

Data Analysis with Python: NumPy, Pandas and Matplotlib (Basic)

P.S.: This tutorial assumes that you have a basic understanding of how Python works and how to set up virtual environments to separate environments for different projects.

Installing NumPy and Pandas

NumPy Basics

Pandas Basics

Data Analysis with NumPy and Pandas

Conclusion:

Comments

More from this blog

Host your React app to Vercel

Making your life easier with 'run script' in Django

Getting started with Python's `super()` function

How to install Django?

Command Palette

P.S.: This tutorial assumes that you have a basic understanding of how Python works and how to set up virtual environments to separate environments for different projects.

Installing NumPy and Pandas

NumPy Basics

Pandas Basics

Data Analysis with NumPy and Pandas

Conclusion:

Comments

More from this blog