πŸ“Š Data Analytics Mastery Course

Master techniques for collecting, analyzing, and interpreting data to drive informed business decisions and strategic insights.

πŸ“š Total Modules

20

🎯 Skill Levels

All Levels

🌎 Coverage

USA & Canada

⏱️ Total Duration

~20 Hours

🧹 Module 3: Data Cleaning & Preparation

This module covers essential data analytics concepts and practical applications.

Beginner Level
⏱️ 45-60 minutes

πŸ“š Topics Covered

  • βœ“ Common Data Quality Issues
  • βœ“ Handling Missing Values
  • βœ“ Outlier Detection & Treatment
  • βœ“ Data Normalization & Standardization
  • βœ“ Data Type Conversion
  • βœ“ Deduplication Strategies
  • βœ“ Data Transformation Techniques
  • βœ“ Creating Data Cleaning Workflows

πŸ”‘ Key Concepts

  • β€’ Identifying and categorizing data quality problems
  • β€’ Choosing appropriate strategies for missing data
  • β€’ Statistical methods for outlier detection
  • β€’ Preparing data for analysis and modeling
  • β€’ Documentation and reproducibility in data cleaning

3.1 Common Data Quality Issues

Real-world data is messy. Understanding common problems helps you clean data systematically.

The Dirty Dozen - Most Common Issues:

Issue Example Impact Solution
Missing Values Blank cells, NULL, N/A Incomplete analysis Imputation, deletion
Duplicates Same record appears 3x Inflated counts Deduplication
Inconsistent Formats 03/15/2025 vs 2025-03-15 Sorting/filtering errors Standardization
Outliers Age: 999 years Skewed statistics Investigation, capping
Typos Californa instead of California Incorrect grouping Fuzzy matching, validation
Real-World Example (Healthcare - Canada):
A Vancouver hospital analyzed patient readmission rates. Initial data showed 18% readmission rate. After cleaning duplicates (same patient counted multiple times due to system errors), inconsistent date formats, and outliers (one patient with 47 admissions flagged as data entry error), the true rate was 12.3% - a significant difference for resource planning.

3.2 Handling Missing Values

Missing data is inevitable. The key is choosing the right strategy based on why data is missing.

Types of Missing Data:

  • MCAR (Missing Completely At Random) - No pattern, random omissions
  • MAR (Missing At Random) - Missing related to other variables
  • MNAR (Missing Not At Random) - Missing for systematic reasons

Simulation: Missing Data Analysis Tool

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Missing Data Analyzer β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ Dataset: Customer_Database.csv β”‚
β”‚ Total Records: 50,000 β”‚
β”‚ β”‚
β”‚ Missing Values Summary: β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ Column β”‚ Missing β”‚ % Missingβ”‚ β”‚
β”‚ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚
β”‚ β”‚ Customer_ID β”‚ 0 β”‚ 0.0% β”‚ β”‚
β”‚ β”‚ Email β”‚ 4,123 β”‚ 8.2% β”‚ β”‚
β”‚ β”‚ Phone β”‚ 2,847 β”‚ 5.7% β”‚ β”‚
β”‚ β”‚ Income β”‚ 6,500 β”‚ 13.0% β”‚ β”‚
β”‚ β”‚ Age β”‚ 342 β”‚ 0.7% β”‚ β”‚
β”‚ β”‚ Purchase_Date β”‚ 0 β”‚ 0.0% β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β”‚ β”‚
β”‚ Recommended Actions: β”‚
β”‚ β€’ Email: ⚠️ Keep (8% acceptable for email) β”‚
β”‚ β€’ Phone: βœ“ Keep or impute median β”‚
β”‚ β€’ Income: ⚠️ High missing - investigate β”‚
β”‚ β€’ Age: βœ“ Impute with median (0.7% only) β”‚
β”‚ β”‚
β”‚ [Generate Report] [Apply Fixes] [Export] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Strategies for Handling Missing Data:

Strategy When to Use Pros Cons
Deletion <5% missing, MCAR Simple, no assumptions Reduces sample size
Mean/Median Imputation Numeric, <10% missing Preserves sample size Reduces variance
Mode Imputation Categorical data Quick, maintains categories May not fit individual
Forward/Backward Fill Time series data Logical for temporal Assumes stability
Predictive Imputation Complex patterns, MAR Most accurate Complex, time-consuming

Example: Excel Imputation

Before:
Age: 25, 32, [blank], 45, [blank], 28, 52

After (Median Imputation):
Median = 32
Age: 25, 32, 32, 45, 32, 28, 52

Excel Formula: =IF(ISBLANK(A2), MEDIAN($A$2:$A$100), A2)

3.3 Outlier Detection & Treatment

Outliers can be genuine extreme values or errors. Detecting them requires statistical methods.

Common Detection Methods:

  • IQR Method - Values beyond Q1 - 1.5Γ—IQR or Q3 + 1.5Γ—IQR
  • Z-Score - Values more than 3 standard deviations from mean
  • Visual Inspection - Box plots, scatter plots
  • Domain Knowledge - Age > 120, negative prices

Simulation: Outlier Detection Dashboard

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Outlier Detection Tool β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ Column: Order_Amount β”‚
β”‚ Method: [IQR Method β–Ό] β”‚
β”‚ β”‚
β”‚ Statistics: β”‚
β”‚ Mean: $342.50 β”‚
β”‚ Median: $289.00 β”‚
β”‚ Std Dev: $156.23 β”‚
β”‚ Q1: $185.00 β”‚
β”‚ Q3: $425.00 β”‚
β”‚ IQR: $240.00 β”‚
β”‚ β”‚
β”‚ Outlier Boundaries: β”‚
β”‚ Lower: -$175.00 (Q1 - 1.5Γ—IQR) β”‚
β”‚ Upper: $785.00 (Q3 + 1.5Γ—IQR) β”‚
β”‚ β”‚
β”‚ Outliers Detected: 347 (0.7%) β”‚
β”‚ β”‚
β”‚ Sample Outliers: β”‚
β”‚ β€’ Order #45892: $12,450.00 ⚠️ β”‚
β”‚ β€’ Order #78234: $8,923.00 ⚠️ β”‚
β”‚ β€’ Order #12455: $7,100.00 ⚠️ β”‚
β”‚ β”‚
β”‚ [View All] [Cap Values] [Remove] [Keep] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Treatment Options:

  1. Investigate - Are they errors or legitimate? (Check original source)
  2. Remove - Delete if confirmed errors or impossible values
  3. Cap (Winsorize) - Replace with boundary value (e.g., 95th percentile)
  4. Transform - Log transformation reduces outlier impact
  5. Keep - If legitimate and meaningful (e.g., high-value customers)
E-commerce Example (USA):
A New York retailer found 23 orders over $10,000 (outliers). Investigation revealed:
β€’ 18 were legitimate bulk B2B orders β†’ Kept
β€’ 3 were data entry errors (decimal point mistakes) β†’ Corrected
β€’ 2 were fraudulent transactions β†’ Removed and flagged

3.4 Data Normalization & Standardization

Different scales can bias analysis. Normalization brings variables to comparable ranges.

When to Normalize:

  • Machine learning algorithms (neural networks, K-means clustering)
  • Comparing variables with different units (age vs income)
  • Distance-based calculations

Common Techniques:

Method Formula Range Use Case
Min-Max Normalization (X - min) / (max - min) 0 to 1 Neural networks
Z-Score Standardization (X - mean) / std ~-3 to 3 Most algorithms
Robust Scaling (X - median) / IQR Varies Data with outliers

Example: Normalizing Customer Data

Original Data:
Age: 25, 35, 45, 55 (range: 30)
Income: $30,000, $50,000, $70,000, $90,000 (range: $60,000)

After Min-Max Normalization (0-1):
Age: 0.00, 0.33, 0.67, 1.00
Income: 0.00, 0.33, 0.67, 1.00

Now both variables are on same scale for clustering algorithms!

3.5 Data Type Conversion

Ensuring correct data types prevents errors and enables proper analysis.

Common Conversions Needed:

  • Text to Date: "2025-03-15" (string) β†’ March 15, 2025 (date)
  • Text to Number: "$1,234.56" (string) β†’ 1234.56 (numeric)
  • Number to Category: Age 25 β†’ "Young Adult" category
  • Boolean Conversion: "Yes"/"No" β†’ True/False

Simulation: Data Type Converter

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Data Type Conversion Tool β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ Column: Purchase_Date β”‚
β”‚ Current Type: Text/String β”‚
β”‚ Detected Format: MM/DD/YYYY β”‚
β”‚ β”‚
β”‚ Sample Values: β”‚
β”‚ β€’ "03/15/2025" β”‚
β”‚ β€’ "04/02/2025" β”‚
β”‚ β€’ "12/25/2024" β”‚
β”‚ β”‚
β”‚ Convert To: [Date/Time β–Ό] β”‚
β”‚ Output Format: YYYY-MM-DD β”‚
β”‚ β”‚
β”‚ Preview After Conversion: β”‚
β”‚ β€’ 2025-03-15 β”‚
β”‚ β€’ 2025-04-02 β”‚
β”‚ β€’ 2024-12-25 β”‚
β”‚ β”‚
β”‚ ⚠️ 3 values could not be converted β”‚
β”‚ (Invalid dates - will be set to NULL) β”‚
β”‚ β”‚
β”‚ [Convert] [Preview More] [Cancel] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Common Issues & Solutions:

Problem: "Revenue" column imported as text due to $ signs and commas
Solution: Remove $ and commas, convert to numeric

Excel: =VALUE(SUBSTITUTE(SUBSTITUTE(A2,"$",""),",",""))
Python: df["Revenue"] = df["Revenue"].str.replace("[$,]", "").astype(float)

3.6 Deduplication Strategies

Duplicate records inflate counts and skew analysis. Effective deduplication requires strategy.

Types of Duplicates:

  • Exact Duplicates - All fields identical (easy to detect)
  • Near Duplicates - Minor differences (typos, formatting)
  • Semantic Duplicates - Same entity, different representation

Simulation: Duplicate Detection Tool

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Duplicate Record Finder β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ Match On: β˜‘ Email β˜‘ Phone ☐ Name β”‚
β”‚ Matching Type: [Exact Match β–Ό] β”‚
β”‚ β”‚
β”‚ Duplicates Found: 1,203 records β”‚
β”‚ β”‚
β”‚ Example Duplicate Set: β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚ β”‚ ID β”‚ Email β”‚ Phone β”‚ Date β”‚β”‚
β”‚ β”œβ”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”‚
β”‚ β”‚ 1045 β”‚ john@co.com β”‚ 555-1234 β”‚ Jan 5 β”‚β”‚
β”‚ β”‚ 2389 β”‚ john@co.com β”‚ 555-1234 β”‚ Feb 2 β”‚β”‚
β”‚ β”‚ 4521 β”‚ john@co.com β”‚ 555-1234 β”‚ Mar 8 β”‚β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚ β”‚
β”‚ Keep Record: [β¦Ώ Most Recent β—‹ First β”‚
β”‚ β—‹ Manual Review] β”‚
β”‚ β”‚
β”‚ [Remove Duplicates] [Export List] [Review] β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Deduplication Decision Tree:

Step 1: Identify key fields (email, customer_id, etc.)
Step 2: Check for exact matches on key fields
Step 3: For near-duplicates, use fuzzy matching (Levenshtein distance)
Step 4: Decide which record to keep:
  β€’ Most recent (if timestamped)
  β€’ Most complete (fewest nulls)
  β€’ Manual review for critical data
Step 5: Document removals for audit trail

3.7 Data Transformation Techniques

Reshaping and deriving new variables makes data more useful for analysis.

Common Transformations:

Transformation Purpose Example
Binning Create categories from continuous Age β†’ Age Groups (18-25, 26-35...)
Log Transformation Reduce skewness Income β†’ log(Income)
Feature Engineering Create new variables Revenue - Cost = Profit
Pivoting Reshape data structure Long format β†’ Wide format
Encoding Convert categories to numbers Red/Blue/Green β†’ 1/2/3

Example: Feature Engineering

Banking Example (Canada):
A Toronto bank improved credit risk models by creating derived features:

Original Fields:
β€’ Income: $65,000
β€’ Debt: $28,000
β€’ Age: 34

Engineered Features:
β€’ Debt-to-Income Ratio: 43% (28000/65000)
β€’ Years to Retirement: 31 (65 - 34)
β€’ Risk Score: Calculated from multiple factors

Result: Model accuracy improved from 76% to 84%

3.8 Creating Data Cleaning Workflows

Document your cleaning process for reproducibility and transparency.

Standard Cleaning Workflow:

  1. Initial Assessment - Profile data, identify issues
  2. Handle Missing Values - Decide strategy per column
  3. Remove Duplicates - Based on key fields
  4. Fix Data Types - Convert to appropriate formats
  5. Detect Outliers - Statistical methods + domain knowledge
  6. Standardize Values - Consistent formats, spelling
  7. Transform/Derive - Create calculated fields
  8. Validate - Check business rules, ranges
  9. Document - Log all changes made

Simulation: Data Cleaning Workflow Builder

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Data Cleaning Pipeline β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ β”‚
β”‚ Step 1: βœ“ Load Data (50,000 rows) β”‚
β”‚ Step 2: βœ“ Remove Duplicates (1,203 found) β”‚
β”‚ Step 3: β–Ί Handle Missing Values β”‚
β”‚ β€’ Email: Keep as-is (8.2%) β”‚
β”‚ β€’ Age: Impute median β”‚
β”‚ β€’ Income: Delete rows (13% - high) β”‚
β”‚ Step 4: β—‹ Detect Outliers β”‚
β”‚ Step 5: β—‹ Normalize Data β”‚
β”‚ Step 6: β—‹ Validate Results β”‚
β”‚ β”‚
β”‚ Expected Output: ~42,500 rows β”‚
β”‚ β”‚
β”‚ [Run Pipeline] [Save Workflow] [Schedule] β”‚
β”‚ β”‚
β”‚ β˜‘ Generate cleaning report β”‚
β”‚ β˜‘ Save cleaned data to new file β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

βœ“ Module 3 Complete

You've learned:

  • Common data quality issues and their impacts
  • Strategies for handling missing values (deletion, imputation)
  • Outlier detection methods (IQR, Z-score) and treatment
  • Data normalization and standardization techniques
  • Data type conversion and format standardization
  • Deduplication strategies for exact and near-duplicates
  • Data transformation techniques (binning, encoding, feature engineering)
  • Building reproducible data cleaning workflows
  • Real-world examples from healthcare, banking, and e-commerce

Next: Module 4 covers Excel for data analysis - the most universal analytics tool.

← Back to All Modules Next Module β†’