
What Most People Get Wrong When Selecting Data in pandas
- Last updated on February 7, 2025 at 8:26 PM
When you start working with data, one of the first things you'll run into is the need to grab the right information—whether it's a column of sales figures or a row detailing customer activity. For those using pandas, selecting data efficiently can make or break your workflow. The good news? It doesn't have to be complicated. Let's explore how to select rows and columns, highlight common mistakes, and give you the tools to avoid them.
You'll learn this and much more in the Introduction to pandas lesson. Ready to sharpen your skills?
Meet the Data: Fortune Global 500 Dataset
Throughout this tutorial, we'll be working with the Fortune Global 500 dataset, which contains information about the world’s largest companies. This dataset is stored in a file called f500.csv
, which you can download here if you want to follow along locally.
Here's how to load it using pandas:
import pandas as pd
# Load the dataset
df = pd.read_csv("f500.csv", index_col=0)
The index_col=0
argument sets the first column (company names) as the row labels, making it easier to reference rows by company names.
To get a sense of the data structure, we can use:
# Display the first 5 rows
df.head()
Expected output:
rank revenues profits country
company
Walmart 1 485873.0 13643.0 USA
State Grid 2 315199.0 9571.3 China
Sinopec Group 3 267518.0 1257.9 China
China Natural Petroleum 4 262573.0 1867.5 China
Toyota Motor 5 254694.0 16899.3 Japan
Please note that the output above is truncated for display purposes; you’ll see more columns if you run this command locally.
Understanding the Anatomy of a DataFrame
Before we get into the nitty-gritty, take a quick look at this labeled DataFrame:
Here’s a quick breakdown:
- Row labels are along the index axis (typically representing individual records).
- Column labels run across the column axis (typically representing features or variables).
- The values in the DataFrame can have different types (e.g., integers, floats, or strings).
This visual is your roadmap—knowing the structure helps you understand how pandas
selections work.
Selecting Columns the Right Way
The most common way to select a column in pandas is by using square brackets:
# Select a single column (returns a Series)
single_col = df["revenues"]
But here’s a key distinction: selecting one column like this returns a Series, not a DataFrame. If you need the result as a DataFrame, wrap the column name in double square brackets:
# Select a single column and return it as a DataFrame
single_col_df = df[["revenues"]]
If you want multiple columns, separate them with commas:
# Select multiple columns
subset = df[["revenues", "profits"]]
Pro tip: Many beginners get tripped up when they expect a single column to behave like a DataFrame but instead get a Series. The difference matters when applying further operations—so be mindful of the brackets.
Selecting Rows: loc
and iloc
Now let’s talk rows. The two main methods for row selection in pandas are:
loc[]
: Select rows by label.iloc[]
: Select rows by index position.
Example: Selecting by Label
Assume you want the row where the company is Walmart:
walmart_row = df.loc["Walmart"]
Example: Selecting by Position
If you know the row number instead:
first_row = df.iloc[0]
Series vs. DataFrames: Why It Matters
Notice how selecting a single column or row returns a Series, while selecting multiple columns or rows returns a DataFrame. This distinction becomes important when chaining operations—a Series doesn’t support all the methods that a DataFrame does.
Let’s clarify this with some examples using data we’ve already defined:
# Select columns (returns a DataFrame)
subset = df[["revenues", "profits"]]
# Calculate the total revenue
total_revenues = subset["revenues"].sum()
If you accidentally work with a Series instead of a DataFrame (or vice versa), certain methods may behave differently or lead to errors. Always check what type of object you are working with using the type()
function:
# Check for Series or a DataFrame print(type(df["revenues"]))
# <class 'pandas.core.series.Series'>
print(type(df[["revenues"]]))
# <class 'pandas.core.frame.DataFrame'>
By verifying the object type, you can ensure you're applying the correct methods and avoid unexpected behavior.
Common Pitfalls to Avoid
- Misusing single brackets: As mentioned, single brackets return a Series. If your downstream code expects a DataFrame, this mismatch can cause errors.
- Mixing up
loc
andiloc
: If you accidentally use the wrong one, you may select unexpected rows.df.loc[0]
won’t work if 0 isn’t a label in your index—you’d needdf.iloc[0]
.
- Forgetting about slices: Both
loc
andiloc
support slicing. For example:
# Select rows by position
subset_rows = df.iloc[0:3] # First 3 rows
# Select rows by label
subset_rows = df.loc["Walmart":"Sinopec Group"]
Example: Filtering for High-Performing Companies
Imagine you’re analyzing financial data and want to select companies with revenues over 300,000:
high_revenue = df[df["revenues"] > 300000]
Want to be more specific? Select only high-revenue companies from China:
china_high_revenue = df[(df["revenues"] > 300000) & (df["country"] == "China")]
This kind of filtering is common when cleaning or exploring datasets.
Key Takeaways
- Use single brackets to get a Series, and double brackets to get a DataFrame.
- Select rows using
loc[]
for labels andiloc[]
for positions. - Be mindful of Series vs. DataFrame distinctions when chaining operations.
Ready to practice this yourself? Check out the Introduction to pandas lesson to get hands-on experience. If you’re aiming to become a data analyst, you can also explore our Junior Data Analyst career path for a structured learning journey.
Happy coding, and keep experimenting!