Essential Pandas Operations You Must Bookmark Right Away

AI With Hariharan
7 min readDec 9, 2022

--

Photo by Alex Chumak on Unsplash

Introduction

As we know, Data Cleaning and Preprocessing take up the majority of time in a Data Science project. We often find that a single library — Pandas — can solve several issues primarily.

Pandas is a Python module with extensive functions that can handle everything from data pre-processing to data analysis.

The following article will explore some simple yet powerful Pandas operations that may be useful and, more importantly, come in handy when needed. These bookmark-worthy operations can be modified depending on the requirements.

I just listed out the operations in 4 simple steps.

Operations in Pandas

1. To replace NaN values with Random Values from a List

Generally, to replace missing data in a Pandas DataFrame, we use the .fillna() method. We can use .loc() on the DataFrame if we are trying to fill the NaN values randomly from a list of numbers or strings.Let’s see an example below:

Importing the Libraries
import pandas as pd
import numpy as np
Creating a Pandas DataFrame with dummy data
df = pd.DataFrame({
'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'],
'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan],
'Height (in cm)': [167.7, 182.3, 178.7, 166.2],
'Salary': [12134343, 21312324, 421324554, 234434325]
})
Image Credits: O/P from the authors

The above code generates a Pandas DataFrame, as shown in the image above.

For this dataset, we purposefully added a few NaN values, since we need these values to treat. Changing the seed value:

Let’s set the seed value since we will replace NaN values with random NumPy values, and we want the same results every time.

np.random.seed(124)

Replacing the NaN values…!

Now that we have our seed set, we will use the .loc() method on our DataFrame to perform our operation.

df.loc[df['Favorite Sport'].isna(), 'Favorite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favorite Sport'].isna().sum())]

In the column ‘Favorite Sport’, by executing this code, a list of values will be assigned to the NaN values. There would be a random selection of values from a list of [‘Volleyball’, ‘Football’, ‘Basketball’, ‘Cricket’], and the number of the values picked will be equal to the number of NaNs in the selected column.

Putting it all together

# Import the Libraries
import pandas as pd
import numpy as np
# Creating the DataFrame
df = pd.DataFrame({
'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'],
'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan],
'Height (in cm)': [167.7, 182.3, 178.7, 166.2],
'Salary': [12134343, 21312324, 421324554, 234434325]
})
np.random.seed(124)
df.loc[df['Favourite Sport'].isna(), 'Favourite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favourite Sport'].isna().sum())]
print(df)

On executing this code, our final dataset would look something like this:

Image Credits: O/P from the author

A NaN is replaced with a randomly selected value, such as ‘Basketball’ and ‘Volleyball’ from the mentioned sports lists.

2. Map values in a categorical column into numeric codes.

Mapping values to numeric codes is a useful method for numeric data in our DataFrame that has to be unique.

Let’s begin by repeating the prerequisites from the previous step.

This function is useful for automatically assigning Roll Numbers in a class from a list of student names. creating a Pandas DataFrame and executing the (1.) operation after importing the libraries.

Performing the Prerequisites

Importing the libraries
import pandas as pd
import numpy as np
# Creating the DataFrame
df = pd.DataFrame({
'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'],
'Favourite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan],
'Height (in cm)': [167.7, 182.3, 178.7, 166.2],
'Salary': [12134343, 21312324, 421324554, 234434325]
})
# Replacing the NaN values
np.random.seed(124)
df.loc[df['Favorite Sport'].isna(), 'Favorite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favorite Sport'].isna().sum())]
Creating a list of Codes (from the 'Name' column)
list(pd.Categorical(df['Name'], ordered = True).codes)

On executing this, we get the following:

Image Credits: o/p

Using Pandas’ Categorical() method, we passed the ‘Name’ column and the ‘ordered’ parameter from our DataFrame, to get the list of numbers based on the alphabetically ordered ‘Name’ column.

Accordingly, when the name ‘Alex’ is entered, the assigned code is ‘0’, while when the name ‘Jimmy’ is entered, the assigned code will be ‘2’ because ‘Jimmy’ is listed in the 3rd position in the ‘Name’ column.

We can also pass this list of values into the DataFrame as a column.

Creating New Column from Codes

df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes)
Executing this will create a new column named 'Roll Number.'
Putting this all together
# Import the Libraries
import pandas as pd
import numpy as np
# Creating the DataFrame
df = pd.DataFrame({
'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'],
'Favorite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan],
'Height (in cm)': [167.7, 182.3, 178.7, 166.2],
'Salary': [12134343, 21312324, 421324554, 234434325]
})
# Replacing the NaN values
np.random.seed(124)
df.loc[df['Favorite Sport'].isna(), 'Favorite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favorite Sport'].isna().sum())]
# Mapping 'Name' column into numeric codes
df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes)
print(df)

On executing this code, our DataFrame would look like this:

3. Is to format the integers in a DataFrame.

This process improves the readability of the numbers. DataFrames with numerous digits can be confusing and cause misinterpretations.

Here is an example of formatting the ‘Salary’ column values.

The first step is to complete the requirements for the previous three operations: importing libraries, creating the Pandas DataFrame, and building the DataFrame.

Performing the prerequisites

# Import the Libraries
import pandas as pd
import numpy as np
# Creating the DataFrame
df = pd.DataFrame({
'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'],
'Favorite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan],
'Height (in cm)': [167.7, 182.3, 178.7, 166.2],
'Salary': [12134343, 21312324, 421324554, 234434325]
})
# Replacing the NaN values
np.random.seed(124)
df.loc[df['Favorite Sport'].isna(), 'Favorite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favorite Sport'].isna().sum())]
# Mapping 'Name' column into numeric codes
df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes)
Formatting the 'Salary' column
df['Salary'] = df['Salary'].apply(lambda x: format(x, ',d'))

Putting it all together

# Import the Libraries
import pandas as pd
import numpy as np
# Creating the DataFrame
df = pd.DataFrame({
'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'],
'Favorite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan],
'Height (in cm)': [167.7, 182.3, 178.7, 166.2],
'Salary': [12134343, 21312324, 421324554, 234434325]
})
# Replacing the NaN values
np.random.seed(124)
df.loc[df['Favorite Sport'].isna(), 'Favorite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favorite Sport'].isna().sum())]
# Mapping 'Name' column into numeric codes
df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes)
# Format values in 'Salary' column
df['Salary'] = df['Salary'].apply(lambda x: format(x, ',d'))
print(df)

On executing this code, we get the following:

Our ‘Salary’ column values are passed to the built-in format() method, using Pandas’ .apply() method. The commas between the digits may cause the value to become object-type or categorical when performing this operation.

4. If a certain categorical column contains a given substring, we can remove rows that satisfy a specific requirement.

This is often accomplished using categorical columns. We will perform a similar operation on one of our categorical columns below.

In our DataFrame, we will extract all the rows where the person has ball games as their favorite sport.

Using our Favorite Sport column, let’s start with the prerequisites, such as library imports, DataFrame construction, and the previously completed operations.

Performing the Prerequisites

# Import the Libraries
import pandas as pd
import numpy as np
# Creating the DataFrame
df = pd.DataFrame({
'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'],
'Favorite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan],
'Height (in cm)': [167.7, 182.3, 178.7, 166.2],
'Salary': [12134343, 21312324, 421324554, 234434325]
})
# Replacing the NaN values
np.random.seed(124)
df.loc[df['Favorite Sport'].isna(), 'Favorite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favorite Sport'].isna().sum())]
# Mapping the 'Name' column into numeric codes
df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes)
# Format values in the 'Salary' column
df['Salary'] = df['Salary'].apply(lambda x: format(x, ',d'))
Extracting the Rows of Interest
print(df[df['Favorite Sport'].str.contains('ball')])

Executing this will extract all the rows where the Favorite Sport of a person has the text ‘ball’ in it.

Putting it all together

# Import the Libraries
import pandas as pd
import numpy as np
# Creating the DataFrame
df = pd.DataFrame({
'Name': ['Alex', 'Jimmy', 'Katie', 'Brute'],
'Favorite Sport': [np.nan, 'Lawn Tennis', 'Basketball', np.nan],
'Height (in cm)': [167.7, 182.3, 178.7, 166.2],
'Salary': [12134343, 21312324, 421324554, 234434325]
})
# Replacing the NaN values
np.random.seed(124)
df.loc[df['Favorite Sport'].isna(), 'Favorite Sport'] = [i for i in np.random.choice(['Volleyball', 'Football', 'Basketball', 'Cricket'], df['Favorite Sport'].isna().sum())]
# Mapping the 'Name' column into numeric codes
df['Roll Number'] = list(pd.Categorical(df['Name'], ordered = True).codes)
# Format values in the 'Salary' column
df['Salary'] = df['Salary'].apply(lambda x: format(x, ',d'))
# Checking if 'ball' is in the 'Favorite Sport' column
print(df[df['Favourite Sport'].str.contains('ball')])

On executing this code, we get the following:

As explained, we did get all the data in their respective columns. Despite the fact that all the actions described have been carried out in the simplest possible way, other ways of performing them may exist.

Because the article is so simple, it’s worth bookmarking because it saves time looking for similar solutions on StackOverflow.

It was demonstrated how to replace NaNs with randomized values — strings or numbers, as well as how to code strings into numerics based on the alphabetical order of the strings.

In the third operation, we learned how to format integers for easier reading and how to change the datatype of the column from int to str by formatting.

During the fourth operation, we learned how to extract rows if a given substring is found in one of the columns.

--

--

No responses yet