1
Current Location:
>
Practical Python Data Cleaning: How to Elegantly Handle Dirty Data for More Efficient Analysis
2024-10-29   read:178

Current Situation

Have you often encountered situations like this: you receive a dataset, open it, and oh my goodness, it's such a mess. Missing values are everywhere, formats are inconsistent, and duplicate data is countless. As data analysts, we spend 80% of our time on data cleaning, which I'm sure you can deeply relate to.

Recently, while processing an e-commerce dataset, I encountered various types of "dirty data." Through this experience, I developed a methodology for data cleaning that I'd like to share with you today.

Problems

Before we begin, let's look at some of the most common issues in data cleaning:

import pandas as pd
import numpy as np


data = {
    'user_id': [1, 2, 2, 3, None, 5],
    'age': ['25', '30', '30', '-1', '150', '35'],
    'purchase_date': ['2023-01-01', 'Jan 1, 2023', '2023/01/01', '', '2023-13-01', '2023-01-01'],
    'amount': ['100.0', 'unknown', '100', '200.0', '1000000', '150.0']
}

df = pd.DataFrame(data)
print(df)

Looking at such data probably gives you a headache too. Let me analyze the main issues:

  1. Duplicate data: records with user_id=2 are duplicated
  2. Missing values: null values exist in the user_id column
  3. Anomalies: age=150 is clearly unreasonable, amount=1000000 might be an anomaly
  4. Inconsistent data types: the age column mixes string types
  5. Chaotic date formats: date formats in the purchase_date column are inconsistent

Methods

Handling Duplicates

Handling duplicate data is the first step in data cleaning. Let's see how to elegantly solve this problem:

duplicates = df.duplicated().sum()
print(f"Number of duplicate records: {duplicates}")


df_clean = df.drop_duplicates()

I personally suggest paying special attention to these points when handling duplicates: - Determine which columns to use for identifying duplicates - Decide whether to keep the first or last occurrence of duplicates - Check if there are data inconsistencies among duplicate records

Handling Missing Values

Handling missing values is a technical task that requires different strategies in different scenarios:

missing_stats = df.isnull().sum()
print("Missing value statistics:")
print(missing_stats)


df_clean = df_clean.copy()
df_clean['user_id'] = df_clean['user_id'].fillna(-999)  # Fill with special value
df_clean['age'] = df_clean['age'].fillna(df_clean['age'].median())  # Fill with median

In my practice, I follow this approach when handling missing values: - Understand the reason for missing data - Evaluate the impact of missing data on analysis - Choose appropriate filling strategies

Handling Anomalies

Anomaly detection and handling is the part of data cleaning that requires the most business knowledge:

def detect_outliers(series):
    Q1 = series.quantile(0.25)
    Q3 = series.quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return (series < lower_bound) | (series > upper_bound)


df_clean['age'] = pd.to_numeric(df_clean['age'], errors='coerce')
age_outliers = detect_outliers(df_clean['age'])
df_clean.loc[age_outliers, 'age'] = np.nan

For anomalies, I suggest: - Define anomalies based on business rules - Use statistical methods to assist judgment - Handle boundary cases carefully

Format Standardization

Standardizing data formats is key to ensuring analysis accuracy:

def standardize_date(date_str):
    try:
        return pd.to_datetime(date_str).strftime('%Y-%m-%d')
    except:
        return None

df_clean['purchase_date'] = df_clean['purchase_date'].apply(standardize_date)


df_clean['amount'] = pd.to_numeric(df_clean['amount'], errors='coerce')

My experience is: - Establish standard formats at the beginning of the project - Use automated tools for conversion - Keep backups of original data

Results

Let's look at the improvement in data quality after these cleaning steps:

print("Overview of cleaned data:")
print(df_clean.info())
print("
Data samples:")
print(df_clean.head())


quality_metrics = {
    "Total records": len(df_clean),
    "Complete record ratio": (df_clean.notna().all(axis=1).sum() / len(df_clean)) * 100,
    "Anomaly ratio": (detect_outliers(df_clean['age']).sum() / len(df_clean)) * 100
}
print("
Data quality metrics:")
print(quality_metrics)

Through these cleaning steps, our data quality has significantly improved: - Duplicate records completely eliminated - Missing values reasonably filled - Anomalies effectively identified and handled - Data formats standardized

Insights

In practice, I've found that data cleaning is an iterative process. Each time you process data, there are new discoveries and room for improvement. I suggest:

  1. Establish standard procedures for data cleaning
  2. Write reusable cleaning functions
  3. Document cleaning processes and decision rationales in detail
  4. Regularly evaluate cleaning effectiveness

Do you find these methods helpful? Feel free to share your data cleaning experiences in the comments. If you want to learn more about data cleaning techniques, we can discuss some advanced topics next time, such as how to clean time series data or how to use machine learning methods for anomaly detection.

Related articles