π§Ή Module 3: Data Cleaning & Preparation
This module covers essential data analytics concepts and practical applications.
Beginner Level
β±οΈ 45-60 minutes
π Topics Covered
-
β Common Data Quality Issues
-
β Handling Missing Values
-
β Outlier Detection & Treatment
-
β Data Normalization & Standardization
-
β Data Type Conversion
-
β Deduplication Strategies
-
β Data Transformation Techniques
-
β Creating Data Cleaning Workflows
π Key Concepts
-
β’ Identifying and categorizing data quality problems
-
β’ Choosing appropriate strategies for missing data
-
β’ Statistical methods for outlier detection
-
β’ Preparing data for analysis and modeling
-
β’ Documentation and reproducibility in data cleaning
3.1 Common Data Quality Issues
Real-world data is messy. Understanding common problems helps you clean data systematically.
The Dirty Dozen - Most Common Issues:
| Issue |
Example |
Impact |
Solution |
| Missing Values |
Blank cells, NULL, N/A |
Incomplete analysis |
Imputation, deletion |
| Duplicates |
Same record appears 3x |
Inflated counts |
Deduplication |
| Inconsistent Formats |
03/15/2025 vs 2025-03-15 |
Sorting/filtering errors |
Standardization |
| Outliers |
Age: 999 years |
Skewed statistics |
Investigation, capping |
| Typos |
Californa instead of California |
Incorrect grouping |
Fuzzy matching, validation |
Real-World Example (Healthcare - Canada):
A Vancouver hospital analyzed patient readmission rates. Initial data showed 18% readmission rate.
After cleaning duplicates (same patient counted multiple times due to system errors), inconsistent
date formats, and outliers (one patient with 47 admissions flagged as data entry error),
the true rate was 12.3% - a significant difference for resource planning.
3.2 Handling Missing Values
Missing data is inevitable. The key is choosing the right strategy based on why data is missing.
Types of Missing Data:
- MCAR (Missing Completely At Random) - No pattern, random omissions
- MAR (Missing At Random) - Missing related to other variables
- MNAR (Missing Not At Random) - Missing for systematic reasons
Simulation: Missing Data Analysis Tool
βββββββββββββββββββββββββββββββββββββββββββββββ
β Missing Data Analyzer β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Dataset: Customer_Database.csv β
β Total Records: 50,000 β
β β
β Missing Values Summary: β
β ββββββββββββββββββ¬ββββββββββ¬βββββββββββ β
β β Column β Missing β % Missingβ β
β ββββββββββββββββββΌββββββββββΌβββββββββββ€ β
β β Customer_ID β 0 β 0.0% β β
β β Email β 4,123 β 8.2% β β
β β Phone β 2,847 β 5.7% β β
β β Income β 6,500 β 13.0% β β
β β Age β 342 β 0.7% β β
β β Purchase_Date β 0 β 0.0% β β
β ββββββββββββββββββ΄ββββββββββ΄βββββββββββ β
β β
β Recommended Actions: β
β β’ Email: β οΈ Keep (8% acceptable for email) β
β β’ Phone: β Keep or impute median β
β β’ Income: β οΈ High missing - investigate β
β β’ Age: β Impute with median (0.7% only) β
β β
β [Generate Report] [Apply Fixes] [Export] β
βββββββββββββββββββββββββββββββββββββββββββββββ
Strategies for Handling Missing Data:
| Strategy |
When to Use |
Pros |
Cons |
| Deletion |
<5% missing, MCAR |
Simple, no assumptions |
Reduces sample size |
| Mean/Median Imputation |
Numeric, <10% missing |
Preserves sample size |
Reduces variance |
| Mode Imputation |
Categorical data |
Quick, maintains categories |
May not fit individual |
| Forward/Backward Fill |
Time series data |
Logical for temporal |
Assumes stability |
| Predictive Imputation |
Complex patterns, MAR |
Most accurate |
Complex, time-consuming |
Example: Excel Imputation
Before:
Age: 25, 32, [blank], 45, [blank], 28, 52
After (Median Imputation):
Median = 32
Age: 25, 32, 32, 45, 32, 28, 52
Excel Formula: =IF(ISBLANK(A2), MEDIAN($A$2:$A$100), A2)
3.3 Outlier Detection & Treatment
Outliers can be genuine extreme values or errors. Detecting them requires statistical methods.
Common Detection Methods:
- IQR Method - Values beyond Q1 - 1.5ΓIQR or Q3 + 1.5ΓIQR
- Z-Score - Values more than 3 standard deviations from mean
- Visual Inspection - Box plots, scatter plots
- Domain Knowledge - Age > 120, negative prices
Simulation: Outlier Detection Dashboard
βββββββββββββββββββββββββββββββββββββββββββββββ
β Outlier Detection Tool β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Column: Order_Amount β
β Method: [IQR Method βΌ] β
β β
β Statistics: β
β Mean: $342.50 β
β Median: $289.00 β
β Std Dev: $156.23 β
β Q1: $185.00 β
β Q3: $425.00 β
β IQR: $240.00 β
β β
β Outlier Boundaries: β
β Lower: -$175.00 (Q1 - 1.5ΓIQR) β
β Upper: $785.00 (Q3 + 1.5ΓIQR) β
β β
β Outliers Detected: 347 (0.7%) β
β β
β Sample Outliers: β
β β’ Order #45892: $12,450.00 β οΈ β
β β’ Order #78234: $8,923.00 β οΈ β
β β’ Order #12455: $7,100.00 β οΈ β
β β
β [View All] [Cap Values] [Remove] [Keep] β
βββββββββββββββββββββββββββββββββββββββββββββββ
Treatment Options:
- Investigate - Are they errors or legitimate? (Check original source)
- Remove - Delete if confirmed errors or impossible values
- Cap (Winsorize) - Replace with boundary value (e.g., 95th percentile)
- Transform - Log transformation reduces outlier impact
- Keep - If legitimate and meaningful (e.g., high-value customers)
E-commerce Example (USA):
A New York retailer found 23 orders over $10,000 (outliers). Investigation revealed:
β’ 18 were legitimate bulk B2B orders β Kept
β’ 3 were data entry errors (decimal point mistakes) β Corrected
β’ 2 were fraudulent transactions β Removed and flagged
3.4 Data Normalization & Standardization
Different scales can bias analysis. Normalization brings variables to comparable ranges.
When to Normalize:
- Machine learning algorithms (neural networks, K-means clustering)
- Comparing variables with different units (age vs income)
- Distance-based calculations
Common Techniques:
| Method |
Formula |
Range |
Use Case |
| Min-Max Normalization |
(X - min) / (max - min) |
0 to 1 |
Neural networks |
| Z-Score Standardization |
(X - mean) / std |
~-3 to 3 |
Most algorithms |
| Robust Scaling |
(X - median) / IQR |
Varies |
Data with outliers |
Example: Normalizing Customer Data
Original Data:
Age: 25, 35, 45, 55 (range: 30)
Income: $30,000, $50,000, $70,000, $90,000 (range: $60,000)
After Min-Max Normalization (0-1):
Age: 0.00, 0.33, 0.67, 1.00
Income: 0.00, 0.33, 0.67, 1.00
Now both variables are on same scale for clustering algorithms!
3.5 Data Type Conversion
Ensuring correct data types prevents errors and enables proper analysis.
Common Conversions Needed:
- Text to Date: "2025-03-15" (string) β March 15, 2025 (date)
- Text to Number: "$1,234.56" (string) β 1234.56 (numeric)
- Number to Category: Age 25 β "Young Adult" category
- Boolean Conversion: "Yes"/"No" β True/False
Simulation: Data Type Converter
βββββββββββββββββββββββββββββββββββββββββββββββ
β Data Type Conversion Tool β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Column: Purchase_Date β
β Current Type: Text/String β
β Detected Format: MM/DD/YYYY β
β β
β Sample Values: β
β β’ "03/15/2025" β
β β’ "04/02/2025" β
β β’ "12/25/2024" β
β β
β Convert To: [Date/Time βΌ] β
β Output Format: YYYY-MM-DD β
β β
β Preview After Conversion: β
β β’ 2025-03-15 β
β β’ 2025-04-02 β
β β’ 2024-12-25 β
β β
β β οΈ 3 values could not be converted β
β (Invalid dates - will be set to NULL) β
β β
β [Convert] [Preview More] [Cancel] β
βββββββββββββββββββββββββββββββββββββββββββββββ
Common Issues & Solutions:
Problem: "Revenue" column imported as text due to $ signs and commas
Solution: Remove $ and commas, convert to numeric
Excel: =VALUE(SUBSTITUTE(SUBSTITUTE(A2,"$",""),",",""))
Python: df["Revenue"] = df["Revenue"].str.replace("[$,]", "").astype(float)
3.6 Deduplication Strategies
Duplicate records inflate counts and skew analysis. Effective deduplication requires strategy.
Types of Duplicates:
- Exact Duplicates - All fields identical (easy to detect)
- Near Duplicates - Minor differences (typos, formatting)
- Semantic Duplicates - Same entity, different representation
Simulation: Duplicate Detection Tool
βββββββββββββββββββββββββββββββββββββββββββββββ
β Duplicate Record Finder β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Match On: β Email β Phone β Name β
β Matching Type: [Exact Match βΌ] β
β β
β Duplicates Found: 1,203 records β
β β
β Example Duplicate Set: β
β ββββββββ¬ββββββββββββββ¬βββββββββββββ¬βββββββββ
β β ID β Email β Phone β Date ββ
β ββββββββΌββββββββββββββΌβββββββββββββΌββββββββ€β
β β 1045 β john@co.com β 555-1234 β Jan 5 ββ
β β 2389 β john@co.com β 555-1234 β Feb 2 ββ
β β 4521 β john@co.com β 555-1234 β Mar 8 ββ
β ββββββββ΄ββββββββββββββ΄βββββββββββββ΄βββββββββ
β β
β Keep Record: [β¦Ώ Most Recent β First β
β β Manual Review] β
β β
β [Remove Duplicates] [Export List] [Review] β
βββββββββββββββββββββββββββββββββββββββββββββββ
Deduplication Decision Tree:
Step 1: Identify key fields (email, customer_id, etc.)
Step 2: Check for exact matches on key fields
Step 3: For near-duplicates, use fuzzy matching (Levenshtein distance)
Step 4: Decide which record to keep:
β’ Most recent (if timestamped)
β’ Most complete (fewest nulls)
β’ Manual review for critical data
Step 5: Document removals for audit trail
3.7 Data Transformation Techniques
Reshaping and deriving new variables makes data more useful for analysis.
Common Transformations:
| Transformation |
Purpose |
Example |
| Binning |
Create categories from continuous |
Age β Age Groups (18-25, 26-35...) |
| Log Transformation |
Reduce skewness |
Income β log(Income) |
| Feature Engineering |
Create new variables |
Revenue - Cost = Profit |
| Pivoting |
Reshape data structure |
Long format β Wide format |
| Encoding |
Convert categories to numbers |
Red/Blue/Green β 1/2/3 |
Example: Feature Engineering
Banking Example (Canada):
A Toronto bank improved credit risk models by creating derived features:
Original Fields:
β’ Income: $65,000
β’ Debt: $28,000
β’ Age: 34
Engineered Features:
β’ Debt-to-Income Ratio: 43% (28000/65000)
β’ Years to Retirement: 31 (65 - 34)
β’ Risk Score: Calculated from multiple factors
Result: Model accuracy improved from 76% to 84%
3.8 Creating Data Cleaning Workflows
Document your cleaning process for reproducibility and transparency.
Standard Cleaning Workflow:
- Initial Assessment - Profile data, identify issues
- Handle Missing Values - Decide strategy per column
- Remove Duplicates - Based on key fields
- Fix Data Types - Convert to appropriate formats
- Detect Outliers - Statistical methods + domain knowledge
- Standardize Values - Consistent formats, spelling
- Transform/Derive - Create calculated fields
- Validate - Check business rules, ranges
- Document - Log all changes made
Simulation: Data Cleaning Workflow Builder
βββββββββββββββββββββββββββββββββββββββββββββββ
β Data Cleaning Pipeline β
βββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β Step 1: β Load Data (50,000 rows) β
β Step 2: β Remove Duplicates (1,203 found) β
β Step 3: βΊ Handle Missing Values β
β β’ Email: Keep as-is (8.2%) β
β β’ Age: Impute median β
β β’ Income: Delete rows (13% - high) β
β Step 4: β Detect Outliers β
β Step 5: β Normalize Data β
β Step 6: β Validate Results β
β β
β Expected Output: ~42,500 rows β
β β
β [Run Pipeline] [Save Workflow] [Schedule] β
β β
β β Generate cleaning report β
β β Save cleaned data to new file β
βββββββββββββββββββββββββββββββββββββββββββββββ
β Module 3 Complete
You've learned:
- Common data quality issues and their impacts
- Strategies for handling missing values (deletion, imputation)
- Outlier detection methods (IQR, Z-score) and treatment
- Data normalization and standardization techniques
- Data type conversion and format standardization
- Deduplication strategies for exact and near-duplicates
- Data transformation techniques (binning, encoding, feature engineering)
- Building reproducible data cleaning workflows
- Real-world examples from healthcare, banking, and e-commerce
Next: Module 4 covers Excel for data analysis - the most universal analytics tool.