Python Data Processing Tool: Using Pandas to Handle Millions of Records with Ease-Ink Wash Data

Opening Thoughts

Have you ever been frustrated by Excel's sluggishness when handling large datasets? Or struggled with processing different data file formats? Today, I want to share with you a powerful Python tool - Pandas - that makes data processing twice as efficient.

As a data analysis practitioner, I deeply appreciate Pandas' power. I remember once needing to process a CSV file containing 5 million rows - trying to open it in Excel would crash my computer. But with Pandas, I could smoothly complete data cleaning and analysis with just a few lines of code, which was truly delightful.

Basic Concepts

Before we explore Pandas' powerful features, let's understand two fundamental data structures: Series and DataFrame.

You can think of a Series as an enhanced list, while a DataFrame is like a super Excel spreadsheet. Let me illustrate with a simple example:

import pandas as pd


fruits = pd.Series(['Apple', 'Banana', 'Orange', 'Pear'], name='Fruits')
print("Series example:")
print(fruits)


data = {
    'Fruit': ['Apple', 'Banana', 'Orange', 'Pear'],
    'Price': [5, 3, 4, 6],
    'Stock': [100, 200, 150, 80]
}
df = pd.DataFrame(data)
print("
DataFrame example:")
print(df)

Why do I like using this fruit example? Because it's simple and intuitive - everyone can understand it. In real work, you might be handling stock data, user behavior data, or sales data, but the basic principles are the same.

Data Operations

When it comes to data operations, Pandas is like a Swiss Army knife. Whether it's filtering, sorting, or statistical analysis, everything can be done with concise code.

expensive_fruits = df[df['Price'] > 4]
print("Fruits more expensive than $4:")
print(expensive_fruits)


sorted_df = df.sort_values('Stock', ascending=False)
print("
Sorted by stock quantity:")
print(sorted_df)


summary = df.describe()
print("
Data statistical summary:")
print(summary)

See how intuitive these operations are? In my work, these are basic operations I use every day. For example, I often need to filter products within a certain price range or sort products by sales volume.

Efficient Techniques

Now I'll share some efficient techniques I've summarized from practice. These might not be easily found in official documentation but can greatly improve your data processing efficiency.

1. Using query method for complex filtering

result1 = df[(df['Price'] > 4) & (df['Stock'] > 100)]


result2 = df.query('Price > 4 and Stock > 100')

I particularly like using the query method because its syntax is closer to natural language and makes the code more readable. This advantage is especially apparent when handling complex filtering conditions.

2. Using apply for custom operations

def calculate_value(row):
    return row['Price'] * row['Stock']

df['Total Value'] = df.apply(calculate_value, axis=1)

The apply function is one of my most frequently used features. It allows you to perform any custom operation on your data, offering great flexibility.

Performance Optimization

When handling large-scale data, performance optimization becomes particularly important. Here are some optimization techniques I've summarized from practice:

df['Price'] = df['Price'].astype('float32')  # Reduce memory usage


for chunk in pd.read_csv('large_file.csv', chunksize=10000):
    # Process each chunk
    pass



result = df.query('Price > 4 & Stock > 100')

These optimization techniques are especially useful when handling millions of records. I remember once needing to process a 2GB CSV file, and through these optimizations, the processing time was reduced from 10 minutes to 2 minutes.

Practical Case

Let's apply the knowledge we've learned through a practical case. Suppose we need to analyze a store's sales data:

import pandas as pd
import numpy as np


np.random.seed(42)
dates = pd.date_range('20230101', periods=1000)
products = ['A', 'B', 'C', 'D', 'E']
sales_data = {
    'Date': np.repeat(dates, len(products)),
    'Product': products * len(dates),
    'Sales': np.random.randint(10, 100, size=len(dates) * len(products)),
    'Unit Price': np.random.uniform(10, 50, size=len(dates) * len(products))
}

sales_df = pd.DataFrame(sales_data)


sales_df['Revenue'] = sales_df['Sales'] * sales_df['Unit Price']
product_summary = sales_df.groupby('Product')['Revenue'].sum().sort_values(ascending=False)


daily_sales = sales_df.groupby('Date')['Revenue'].sum()
monthly_sales = daily_sales.resample('M').sum()

print("Product Sales Ranking:")
print(product_summary)
print("
Monthly Sales Trend:")
print(monthly_sales)

This case demonstrates how to use Pandas for actual data analysis work. Through the combined use of functions like groupby and sort_values, we can quickly obtain valuable business insights.

Conclusion

Learning Pandas is like learning a new language - it might seem difficult at first, but once you master the basic concepts and common operations, you'll find it greatly improves your data processing efficiency.

Have you encountered any memorable data processing scenarios in your work? Feel free to share your experiences in the comments. Data analysis is a process of continuous learning and improvement, let's continue exploring this field together.

In the next article, we'll delve into Pandas' advanced features, including table joins, pivot table operations, and more. Stay tuned.

Python List Comprehensions: Elegant and Efficient Data Processing Tools