Messy Column Names? Here’s How to Fix Them with pandas

Question

Accepted Answer

You load a dataset, eager to explore it, but something feels off. The column names are inconsistent—some have spaces, others have weird capitalization, and a few even contain special characters. You try to reference a column and get an error. Annoying, right?Messy column names are a common frustration, but pandas makes it easy to clean them up. Let’s go over a few simple techniques to make your column names readable, standardized, and easy to work with.The Dataset: Laptops.csvFor this article, we’ll be working with a dataset containing information about 1,300 laptop computers. If you’d like to follow along and run the code yourself, you can download the dataset here: laptops.csv.Let's first load the dataset and inspect the column names:import pandas as pd

# Load dataset
df = pd.read_csv("laptops.csv", 
                 encoding="latin1")

# Display the column names
print(df.columns)
Output:Index([' Manufacturer', 
       'Model Name',
       'Category',
       'Screen Size',
       'Screen',
       'CPU',
       'RAM',
       ' Storage',
       'GPU',
       'Operating System',
       'Operating System Version',
       'Weight',
       'Price (Euros)'],
dtype='object')
What’s Wrong with Our Column Names?From the output above, here's what you might notice:Spaces instead of underscores (Operating System instead of operating_system)Inconsistent capitalization (Manufacturer vs. model name)Hidden leading or trailing whitespace (" Storage" instead of "Storage")Special characters (Price (Euros) instead of price_euros)These small inconsistencies can slow you down, especially when writing queries or referencing columns in your analysis. We will now rectify these issues, one step at a time.How to Clean Up Column Names with pandasWe'll go through a series of steps to clean our column names. After each step, we'll display the updated column names to see our progress.Step 1: Remove Leading and Trailing WhitespaceColumn names sometimes contain extra spaces at the beginning or end, which can lead to hard-to-spot errors when referencing them. Removing these unnecessary spaces ensures consistency and prevents issues when selecting columns.# Strip whitespace from column names
df.columns = df.columns.str.strip()
print(df.columns)
Output:Index(['Manufacturer',
       'Model Name',
       'Category',
       'Screen Size',
       'Screen',
       'CPU',
       'RAM',
       'Storage',
       'GPU',
       'Operating System',
       'Operating System Version',
       'Weight',
       'Price (Euros)'],
dtype='object')
Notice that leading and trailing spaces have been removed from column names like ' Storage' and ‘ Manufacturer’, making them easier to reference.Step 2: Standardize Formatting (Lowercase & Underscores)Column names should be easy to reference in code. A common convention is to use lowercase letters and replace spaces with underscores. This makes accessing and manipulating columns more predictable and less error-prone.df.columns = df.columns.str.lower().str.replace(" ", "_")
print(df.columns)
Output:Index(['manufacturer',
       'model_name',
       'category',
       'screen_size',
       'screen',
       'cpu',
       'ram',
       'storage',
       'gpu',
       'operating_system',
       'operating_system_version',
       'weight',
       'price_(euros)'],
dtype='object')
All column names are now lowercase, and spaces have been replaced with underscores, following a standard naming convention. However, note that special characters are still present, such as parentheses in 'price_(euros)', which we'll address next.Step 3: Remove Special CharactersSome column names may include special characters like parentheses, currency symbols, or other non-alphanumeric characters. These can cause errors when performing operations like merging or querying data. Removing them ensures compatibility across different processing tools.df.columns = df.columns.str.replace("[()€$]", "", regex=True)
print(df.columns)
Output:Index(['manufacturer',
       'model_name',
       'category',
       'screen_size',
       'screen',
       'cpu',
       'ram',
       'storage',
       'gpu',
       'operating_system',
       'operating_system_version',
       'weight',
       'price_euros'],
dtype='object')
Special characters like parentheses and currency symbols have been removed. 'price_(euros)' is now 'price_euros', making it more user-friendly and easier to reference in code.Step 4: Apply All Fixes at OnceInstead of applying each fix manually, we can create a function that performs all the cleaning steps in one go. This approach is useful when working with multiple datasets that require the same cleaning steps, making the process more efficient and reusable.def clean_column_names(df):
    df.columns = (df.columns
               .str.strip()
               .str.lower()
               .str.replace(" ", "_")
               .str.replace("[()€$]", "",
                            regex=True))
    return df

# Clean the column names
df = clean_column_names(df)
print(df.columns)
Output:Index(['manufacturer',
       'model_name',
       'category',
       'screen_size',
       'screen',
       'cpu',
       'ram',
       'storage',
       'gpu',
       'operating_system',
       'operating_system_version',
       'weight',
       'price_euros'],
dtype='object')
After applying all fixes in one function, we now have fully cleaned column names that are easy to reference and work within any analysis or transformation.Now your dataset is much easier to work with!Why This MattersCleaning column names may seem like a small step, but it saves time and prevents frustrating errors. When you can quickly reference your columns, your data analysis flows much more smoothly.Want to learn more about cleaning messy data? Check out the full Data Cleaning Basics lesson. If you're serious about data analysis, explore the full Junior Data Analyst career path.Happy coding, and remember to always keep your data clean!