The Keys to Faster Data Filtering in pandas

  • Last updated on February 18, 2025 at 4:22 PM

Have you ever found yourself writing a loop to filter rows in a pandas DataFrame, only to realize that it’s taking way too long? Or maybe you’ve struggled with getting the right rows when filtering data? If so, you’re not alone. Many beginners (and even experienced analysts) overlook one of pandas’ most powerful features: Boolean indexing.

Boolean indexing lets you filter data using conditions, eliminating the need for inefficient loops. Once you get the hang of it, you’ll be able to select the exact data you need in just one line of code. Let’s walk through the best ways to filter data in pandas, the right way!


The Dataset: Fortune 500 Companies

To practice Boolean indexing, we'll use the Fortune 500 dataset, which contains financial data on the top 500 companies in the world. This dataset includes columns such as:

  • company: Name of the company
  • revenues: Total revenue (in millions)
  • profits: Net income (in millions)
  • country: Country where the company is headquartered

You can download the dataset and follow along locally using this link: Fortune 500 dataset.

Now, let’s learn some Boolean indexing techniques!


Boolean Filtering Basics

Boolean indexing works by applying a condition to a DataFrame or Series, which results in a Boolean mask—a series of True and False values. When it's used as a filter, only the rows where the condition evaluated to True are returned.

Example: Selecting Profitable Companies

Say we have a DataFrame f500 that contains financial data for Fortune 500 companies, including revenue and profit columns. If we want to select only the companies that reported a profit, we can load the data and use:

import pandas as pd

f500 = pd.read_csv("f500.csv", index_col=0)
f500.index.name = None

bool_profitable = f500["profits"] > 0
profitable = f500[bool_profitable]

Here’s what’s happening:

  • f500["profits"] > 0 creates a Boolean mask (bool_profitable) where each row is marked as True if the profits are positive.
  • f500[bool_profitable] filters the DataFrame, returning only the rows where the mask is True.

And just like that—no loops needed!


Combining Conditions with Boolean Operators

You’re not limited to filtering by a single condition. You can combine multiple conditions using & (AND), | (OR), and ~ (NOT) operators. Just be sure to wrap each condition in parentheses to avoid syntax errors.

Example: Companies with High Revenue and Negative Profit

big_rev_neg_profit = f500[(f500["revenues"] > 100000) & (f500["profits"] < 0)]

This selects companies that:

  • Have revenues greater than $100B (100000 in millions)
  • Reported negative profits

Want to find companies with high revenue OR negative profits? Just swap & for |:

big_or_neg_profit = f500[(f500["revenues"] > 100000) | (f500["profits"] < 0)]

Or if you want to exclude companies in the USA:

non_us_companies = f500[~(f500["country"] == "USA")]

The ~ operator negates the condition, selecting everything except companies in the USA.


Filtering Missing Values with .notnull()

Since it is such a common task, let’s talk about handling missing values in pandas. Often, datasets contain NaN (Not a Number) values, which represent missing or undefined data. The .notnull() method is a quick way to filter out these missing values.

Example: Selecting Companies with a Previous Rank

previously_ranked = f500[f500["previous_rank"].notnull()]
print(previously_ranked.shape)  # Output: (467,)

This selects only the rows where the previous_rank column contains valid (non-null) values. Using .notnull() ensures that we don’t perform calculations on missing data, preventing unexpected errors.


Understanding pandas Index Alignment

One of the lesser-known but powerful features of pandas is index alignment. When performing operations on DataFrames or Series, pandas automatically aligns values based on their index labels. This makes merging, updating, and performing calculations across different datasets much easier.

How Index Alignment Works

Let's say we want to calculate how much each company’s ranking has changed compared to the previous year. We can compute the rank_change column by subtracting the current rank from the previous_rank:

previously_ranked = f500[f500["previous_rank"].notnull()]
rank_change = previously_ranked["previous_rank"] - previously_ranked["rank"]
f500["rank_change"] = rank_change

Even though the previously_ranked DataFrame contains fewer rows than f500, pandas automatically aligns the index labels. This ensures that only the rows that exist in previously_ranked are updated in f500, while all other rows remain unchanged. For these unaffected rows, the rank_change column will contain NaN values, since no corresponding previous rank exists to compute a change.

Advantages of Index Alignment

  • Prevents incorrect data merges – Ensures values correspond to the correct rows.
  • No need for manual sorting – pandas automatically aligns values based on index labels.
  • Works even when indexes are non-sequential – Making transformations and calculations easier.

This built-in behavior allows seamless operations, reducing errors and saving time when working with large datasets.


Filtering Multiple Values with .isin()

What if you want to filter by a list of values? Instead of chaining multiple OR conditions, we can use .isin().

Example: Selecting Companies from Brazil or Venezuela

latin_american_companies = f500[f500["country"].isin(["Brazil", "Venezuela"])]

This is much cleaner than writing multiple | conditions!


Filtering Numeric Ranges with .between()

If you’re filtering numeric values within a range, .between() is a more readable alternative to >= and <= comparisons.

Example: Finding Mid-Sized Companies by Revenue

mid_size_companies = f500[f500["revenues"].between(50000, 100000)]

This selects companies with revenue between $50B and $100B (inclusive).


Why Boolean Indexing Matters

  • Faster than loops – Filtering happens in a single, optimized operation.
  • More readable – Cleaner code that’s easier to understand.
  • More scalable – Works efficiently even on large datasets.

If you’re still writing loops to filter DataFrames, it’s time to level up with Boolean indexing!

Want to sharpen your pandas skills even more? Learn all about data selection and exploration in our lesson on Exploring Data with Pandas: Intermediate or enroll in the Junior Data Analyst path to build on your skills.

Happy coding, and keep experimenting!