
The Keys to Faster Data Filtering in pandas
- Last updated on February 18, 2025 at 4:22 PM
Have you ever found yourself writing a loop to filter rows in a pandas DataFrame, only to realize that it’s taking way too long? Or maybe you’ve struggled with getting the right rows when filtering data? If so, you’re not alone. Many beginners (and even experienced analysts) overlook one of pandas’ most powerful features: Boolean indexing.
Boolean indexing lets you filter data using conditions, eliminating the need for inefficient loops. Once you get the hang of it, you’ll be able to select the exact data you need in just one line of code. Let’s walk through the best ways to filter data in pandas, the right way!
The Dataset: Fortune 500 Companies
To practice Boolean indexing, we'll use the Fortune 500 dataset, which contains financial data on the top 500 companies in the world. This dataset includes columns such as:
company
: Name of the companyrevenues
: Total revenue (in millions)profits
: Net income (in millions)country
: Country where the company is headquartered
You can download the dataset and follow along locally using this link: Fortune 500 dataset.
Now, let’s learn some Boolean indexing techniques!
Boolean Filtering Basics
Boolean indexing works by applying a condition to a DataFrame or Series, which results in a Boolean mask—a series of True
and False
values. When it's used as a filter, only the rows where the condition evaluated to True
are returned.
Example: Selecting Profitable Companies
Say we have a DataFrame f500
that contains financial data for Fortune 500 companies, including revenue and profit columns. If we want to select only the companies that reported a profit, we can load the data and use:
import pandas as pd
f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None
bool_profitable = f500["profits"] > 0
profitable = f500[bool_profitable]
Here’s what’s happening:
f500["profits"] > 0
creates a Boolean mask (bool_profitable
) where each row is marked asTrue
if the profits are positive.f500[bool_profitable]
filters the DataFrame, returning only the rows where the mask isTrue
.
And just like that—no loops needed!
Combining Conditions with Boolean Operators
You’re not limited to filtering by a single condition. You can combine multiple conditions using &
(AND), |
(OR), and ~
(NOT) operators. Just be sure to wrap each condition in parentheses to avoid syntax errors.
Example: Companies with High Revenue and Negative Profit
big_rev_neg_profit = f500[(f500["revenues"] > 100000) & (f500["profits"] < 0)]
This selects companies that:
- Have revenues greater than $100B (
100000
in millions) - Reported negative profits
Want to find companies with high revenue OR negative profits? Just swap &
for |
:
big_or_neg_profit = f500[(f500["revenues"] > 100000) | (f500["profits"] < 0)]
Or if you want to exclude companies in the USA:
non_us_companies = f500[~(f500["country"] == "USA")]
The ~
operator negates the condition, selecting everything except companies in the USA.
Filtering Missing Values with .notnull()
Since it is such a common task, let’s talk about handling missing values in pandas. Often, datasets contain NaN
(Not a Number) values, which represent missing or undefined data. The .notnull()
method is a quick way to filter out these missing values.
Example: Selecting Companies with a Previous Rank
previously_ranked = f500[f500["previous_rank"].notnull()]
print(previously_ranked.shape) # Output: (467,)
This selects only the rows where the previous_rank
column contains valid (non-null) values. Using .notnull()
ensures that we don’t perform calculations on missing data, preventing unexpected errors.
Understanding pandas Index Alignment
One of the lesser-known but powerful features of pandas is index alignment. When performing operations on DataFrames or Series, pandas automatically aligns values based on their index labels. This makes merging, updating, and performing calculations across different datasets much easier.
How Index Alignment Works
Let's say we want to calculate how much each company’s ranking has changed compared to the previous year. We can compute the rank_change
column by subtracting the current rank
from the previous_rank
:
previously_ranked = f500[f500["previous_rank"].notnull()]
rank_change = previously_ranked["previous_rank"] - previously_ranked["rank"]
f500["rank_change"] = rank_change
Even though the previously_ranked
DataFrame contains fewer rows than f500
, pandas automatically aligns the index labels. This ensures that only the rows that exist in previously_ranked
are updated in f500
, while all other rows remain unchanged. For these unaffected rows, the rank_change
column will contain NaN
values, since no corresponding previous rank exists to compute a change.
Advantages of Index Alignment
- Prevents incorrect data merges – Ensures values correspond to the correct rows.
- No need for manual sorting – pandas automatically aligns values based on index labels.
- Works even when indexes are non-sequential – Making transformations and calculations easier.
This built-in behavior allows seamless operations, reducing errors and saving time when working with large datasets.
Filtering Multiple Values with .isin()
What if you want to filter by a list of values? Instead of chaining multiple OR conditions, we can use .isin()
.
Example: Selecting Companies from Brazil or Venezuela
latin_american_companies = f500[f500["country"].isin(["Brazil", "Venezuela"])]
This is much cleaner than writing multiple |
conditions!
Filtering Numeric Ranges with .between()
If you’re filtering numeric values within a range, .between()
is a more readable alternative to >=
and <=
comparisons.
Example: Finding Mid-Sized Companies by Revenue
mid_size_companies = f500[f500["revenues"].between(50000, 100000)]
This selects companies with revenue between $50B and $100B (inclusive).
Why Boolean Indexing Matters
- Faster than loops – Filtering happens in a single, optimized operation.
- More readable – Cleaner code that’s easier to understand.
- More scalable – Works efficiently even on large datasets.
If you’re still writing loops to filter DataFrames, it’s time to level up with Boolean indexing!
Want to sharpen your pandas skills even more? Learn all about data selection and exploration in our lesson on Exploring Data with Pandas: Intermediate or enroll in the Junior Data Analyst path to build on your skills.
Happy coding, and keep experimenting!