First Meeting with Pandas
Have you ever wondered why the most popular library for data analysis in Python is called Pandas? This reminds me of an interesting story from my first encounter with Pandas. I remember struggling with a messy sales dataset full of null values, duplicates, inconsistent formats, and various other issues. Just when I was at my wit's end, Pandas, like a kung fu master, solved these problems in no time. Since then, I've fallen deeply in love with this "Kung Fu Panda."
Now, let me take you on a journey to explore Pandas' unique techniques.
Data Cleaning
In data analysis, the most time-consuming yet crucial step is data cleaning. Like practicing basic skills in kung fu, data cleaning is a fundamental skill that every data analyst must master.
import pandas as pd
import numpy as np
data = {
'name': ['Zhang San', 'Li Si', np.nan, 'Wang Wu', 'Zhang San'],
'age': [25, -1, 30, 150, 25],
'income': ['5000', '8000yuan', '12000', 'bad data', '5000'],
'department': ['Sales ', 'Tech', 'Marketing', ' Tech', 'Sales']
}
df = pd.DataFrame(data)
Does this data give you a headache? Don't worry, let's handle it step by step:
df_clean = df.drop_duplicates()
df_clean['name'].fillna('Unknown', inplace=True)
df_clean = df_clean[
(df_clean['age'] > 0) &
(df_clean['age'] < 120)
]
df_clean['income'] = df_clean['income'].str.extract('(\d+)').astype(float)
df_clean['department'] = df_clean['department'].str.strip()
See how much cleaner the data looks after applying these "kung fu" moves?
Data Transformation
In my view, data transformation is like the "Great Shift" in kung fu, capable of presenting data in completely different aspects.
pivot_table = df_clean.pivot_table(
values='income',
index='department',
columns='age',
aggfunc='mean',
fill_value=0
)
melted_df = df_clean.melt(
id_vars=['department'],
value_vars=['age', 'income'],
var_name='metric',
value_name='value'
)
Data Analysis
After mastering the basic moves, it's time for practical exercises. Let's see how to use Pandas for actual data analysis:
stats = df_clean.describe()
dept_stats = df_clean.groupby('department').agg({
'income': ['mean', 'std', 'count'],
'age': ['mean', 'min', 'max']
})
df_clean['date'] = pd.date_range(start='2024-01-01', periods=len(df_clean))
time_series = df_clean.set_index('date')['income'].resample('M').mean()
These analysis methods are particularly useful in real work. For instance, I once helped an e-commerce company analyze their sales data using these methods and discovered that weekend sales were 30% higher than weekday sales, which helped them adjust their marketing strategy accordingly.
Practical Tips
At this point, I'd like to share some practical tips from my work experience:
result = (df_clean
.query('age >= 25')
.groupby('department')
.income
.agg(['mean', 'count'])
.round(2)
)
def process_income(x):
return x if x > 5000 else x * 1.1
df_clean['adjusted_income'] = df_clean['income'].apply(process_income)
mask = (df_clean['age'] > 30) & (df_clean['income'] > 8000)
high_value_employees = df_clean[mask]
Summary and Outlook
Through this article, we've explored the core functionalities of Pandas together. Have you noticed how Pandas is like a treasure chest, always providing the right tool when you need it?
However, this is just the tip of the iceberg of Pandas' capabilities. Did you know that Pandas can directly handle Excel files? It can also perform time series analysis and even process large-scale datasets.
Finally, I'd like to ask you a question: What data processing challenges have you encountered in your work? Feel free to share your experiences in the comments section - perhaps we can discuss how to solve these problems using Pandas together.
Every practice is an opportunity to improve your data analysis skills. Let's continue advancing on the path of data analysis, using Pandas as our powerful tool to solve more interesting problems.