Data Analytics Mastery Course

Beginner Level

⏱️ 45-60 minutes

📚 Topics Covered

✓ Common Data Quality Issues
✓ Handling Missing Values
✓ Outlier Detection & Treatment
✓ Data Normalization & Standardization
✓ Data Type Conversion
✓ Deduplication Strategies
✓ Data Transformation Techniques
✓ Creating Data Cleaning Workflows

🔑 Key Concepts

• Identifying and categorizing data quality problems
• Choosing appropriate strategies for missing data
• Statistical methods for outlier detection
• Preparing data for analysis and modeling
• Documentation and reproducibility in data cleaning

3.1 Common Data Quality Issues

Real-world data is messy. Understanding common problems helps you clean data systematically.

The Dirty Dozen - Most Common Issues:

Issue	Example	Impact	Solution
Missing Values	Blank cells, NULL, N/A	Incomplete analysis	Imputation, deletion
Duplicates	Same record appears 3x	Inflated counts	Deduplication
Inconsistent Formats	03/15/2025 vs 2025-03-15	Sorting/filtering errors	Standardization
Outliers	Age: 999 years	Skewed statistics	Investigation, capping
Typos	Californa instead of California	Incorrect grouping	Fuzzy matching, validation

Real-World Example (Healthcare - Canada):
A Vancouver hospital analyzed patient readmission rates. Initial data showed 18% readmission rate. After cleaning duplicates (same patient counted multiple times due to system errors), inconsistent date formats, and outliers (one patient with 47 admissions flagged as data entry error), the true rate was 12.3% - a significant difference for resource planning.

3.2 Handling Missing Values

Missing data is inevitable. The key is choosing the right strategy based on why data is missing.

Types of Missing Data:

MCAR (Missing Completely At Random) - No pattern, random omissions
MAR (Missing At Random) - Missing related to other variables
MNAR (Missing Not At Random) - Missing for systematic reasons

Simulation: Missing Data Analysis Tool

                ┌─────────────────────────────────────────────┐

                │  Missing Data Analyzer                      │

                ├─────────────────────────────────────────────┤

                │                                             │

                │  Dataset: Customer_Database.csv             │

                │  Total Records: 50,000                      │

                │                                             │

                │  Missing Values Summary:                    │

                │  ┌────────────────┬─────────┬──────────┐   │

                │  │ Column         │ Missing │ % Missing│   │

                │  ├────────────────┼─────────┼──────────┤   │

                │  │ Customer_ID    │     0   │   0.0%   │   │

                │  │ Email          │ 4,123   │   8.2%   │   │

                │  │ Phone          │ 2,847   │   5.7%   │   │

                │  │ Income         │ 6,500   │  13.0%   │   │

                │  │ Age            │   342   │   0.7%   │   │

                │  │ Purchase_Date  │     0   │   0.0%   │   │

                │  └────────────────┴─────────┴──────────┘   │

                │                                             │

                │  Recommended Actions:                       │

                │  • Email: ⚠️ Keep (8% acceptable for email) │

                │  • Phone: ✓ Keep or impute median           │

                │  • Income: ⚠️ High missing - investigate     │

                │  • Age: ✓ Impute with median (0.7% only)    │

                │                                             │

                │  [Generate Report] [Apply Fixes] [Export]   │

                └─────────────────────────────────────────────┘

Strategies for Handling Missing Data:

Strategy	When to Use	Pros	Cons
Deletion	<5% missing, MCAR	Simple, no assumptions	Reduces sample size
Mean/Median Imputation	Numeric, <10% missing	Preserves sample size	Reduces variance
Mode Imputation	Categorical data	Quick, maintains categories	May not fit individual
Forward/Backward Fill	Time series data	Logical for temporal	Assumes stability
Predictive Imputation	Complex patterns, MAR	Most accurate	Complex, time-consuming

Example: Excel Imputation

Before:
Age: 25, 32, [blank], 45, [blank], 28, 52

After (Median Imputation):
Median = 32
Age: 25, 32, 32, 45, 32, 28, 52

Excel Formula: =IF(ISBLANK(A2), MEDIAN($A$2:$A$100), A2)

3.3 Outlier Detection & Treatment

Outliers can be genuine extreme values or errors. Detecting them requires statistical methods.

Common Detection Methods:

IQR Method - Values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR
Z-Score - Values more than 3 standard deviations from mean
Visual Inspection - Box plots, scatter plots
Domain Knowledge - Age > 120, negative prices

Simulation: Outlier Detection Dashboard

                ┌─────────────────────────────────────────────┐

                │  Outlier Detection Tool                     │

                ├─────────────────────────────────────────────┤

                │                                             │

                │  Column: Order_Amount                       │

                │  Method: [IQR Method ▼]                     │

                │                                             │

                │  Statistics:                                │

                │  Mean:       $342.50                        │

                │  Median:     $289.00                        │

                │  Std Dev:    $156.23                        │

                │  Q1:         $185.00                        │

                │  Q3:         $425.00                        │

                │  IQR:        $240.00                        │

                │                                             │

                │  Outlier Boundaries:                        │

                │  Lower: -$175.00 (Q1 - 1.5×IQR)            │

                │  Upper:  $785.00 (Q3 + 1.5×IQR)            │

                │                                             │

                │  Outliers Detected: 347 (0.7%)              │

                │                                             │

                │  Sample Outliers:                           │

                │  • Order #45892: $12,450.00 ⚠️             │

                │  • Order #78234:  $8,923.00 ⚠️             │

                │  • Order #12455:  $7,100.00 ⚠️             │

                │                                             │

                │  [View All] [Cap Values] [Remove] [Keep]    │

                └─────────────────────────────────────────────┘

Treatment Options:

Investigate - Are they errors or legitimate? (Check original source)
Remove - Delete if confirmed errors or impossible values
Cap (Winsorize) - Replace with boundary value (e.g., 95th percentile)
Transform - Log transformation reduces outlier impact
Keep - If legitimate and meaningful (e.g., high-value customers)

E-commerce Example (USA):
A New York retailer found 23 orders over $10,000 (outliers). Investigation revealed:
• 18 were legitimate bulk B2B orders → Kept
• 3 were data entry errors (decimal point mistakes) → Corrected
• 2 were fraudulent transactions → Removed and flagged

3.4 Data Normalization & Standardization

Different scales can bias analysis. Normalization brings variables to comparable ranges.

When to Normalize:

Machine learning algorithms (neural networks, K-means clustering)
Comparing variables with different units (age vs income)
Distance-based calculations

Common Techniques:

Method	Formula	Range	Use Case
Min-Max Normalization	(X - min) / (max - min)	0 to 1	Neural networks
Z-Score Standardization	(X - mean) / std	~-3 to 3	Most algorithms
Robust Scaling	(X - median) / IQR	Varies	Data with outliers

Example: Normalizing Customer Data

Original Data:
Age: 25, 35, 45, 55 (range: 30)
Income: $30,000, $50,000, $70,000, $90,000 (range: $60,000)

After Min-Max Normalization (0-1):
Age: 0.00, 0.33, 0.67, 1.00
Income: 0.00, 0.33, 0.67, 1.00

Now both variables are on same scale for clustering algorithms!

3.5 Data Type Conversion

Ensuring correct data types prevents errors and enables proper analysis.

Common Conversions Needed:

Text to Date: "2025-03-15" (string) → March 15, 2025 (date)
Text to Number: "$1,234.56" (string) → 1234.56 (numeric)
Number to Category: Age 25 → "Young Adult" category
Boolean Conversion: "Yes"/"No" → True/False

Simulation: Data Type Converter

                ┌─────────────────────────────────────────────┐

                │  Data Type Conversion Tool                  │

                ├─────────────────────────────────────────────┤

                │                                             │

                │  Column: Purchase_Date                      │

                │  Current Type: Text/String                  │

                │  Detected Format: MM/DD/YYYY                │

                │                                             │

                │  Sample Values:                             │

                │  • "03/15/2025"                             │

                │  • "04/02/2025"                             │

                │  • "12/25/2024"                             │

                │                                             │

                │  Convert To: [Date/Time ▼]                  │

                │  Output Format: YYYY-MM-DD                  │

                │                                             │

                │  Preview After Conversion:                  │

                │  • 2025-03-15                               │

                │  • 2025-04-02                               │

                │  • 2024-12-25                               │

                │                                             │

                │  ⚠️ 3 values could not be converted         │

                │  (Invalid dates - will be set to NULL)      │

                │                                             │

                │  [Convert] [Preview More] [Cancel]          │

                └─────────────────────────────────────────────┘

Common Issues & Solutions:

Problem: "Revenue" column imported as text due to $ signs and commas
Solution: Remove $ and commas, convert to numeric

Excel: =VALUE(SUBSTITUTE(SUBSTITUTE(A2,"$",""),",",""))
Python: df["Revenue"] = df["Revenue"].str.replace("[$,]", "").astype(float)

3.6 Deduplication Strategies

Duplicate records inflate counts and skew analysis. Effective deduplication requires strategy.

Types of Duplicates:

Exact Duplicates - All fields identical (easy to detect)
Near Duplicates - Minor differences (typos, formatting)
Semantic Duplicates - Same entity, different representation

Simulation: Duplicate Detection Tool

                ┌─────────────────────────────────────────────┐

                │  Duplicate Record Finder                    │

                ├─────────────────────────────────────────────┤

                │                                             │

                │  Match On: ☑ Email  ☑ Phone  ☐ Name        │

                │  Matching Type: [Exact Match ▼]            │

                │                                             │

                │  Duplicates Found: 1,203 records            │

                │                                             │

                │  Example Duplicate Set:                     │

                │  ┌──────┬─────────────┬────────────┬───────┐│

                │  │ ID   │ Email       │ Phone      │ Date  ││

                │  ├──────┼─────────────┼────────────┼───────┤│

                │  │ 1045 │ john@co.com │ 555-1234   │ Jan 5 ││

                │  │ 2389 │ john@co.com │ 555-1234   │ Feb 2 ││

                │  │ 4521 │ john@co.com │ 555-1234   │ Mar 8 ││

                │  └──────┴─────────────┴────────────┴───────┘│

                │                                             │

                │  Keep Record: [⦿ Most Recent  ○ First      │

                │                ○ Manual Review]             │

                │                                             │

                │  [Remove Duplicates] [Export List] [Review] │

                └─────────────────────────────────────────────┘

Deduplication Decision Tree:

Step 1: Identify key fields (email, customer_id, etc.)
Step 2: Check for exact matches on key fields
Step 3: For near-duplicates, use fuzzy matching (Levenshtein distance)
Step 4: Decide which record to keep:
  • Most recent (if timestamped)
  • Most complete (fewest nulls)
  • Manual review for critical data
Step 5: Document removals for audit trail

3.7 Data Transformation Techniques

Reshaping and deriving new variables makes data more useful for analysis.

Common Transformations:

Transformation	Purpose	Example
Binning	Create categories from continuous	Age → Age Groups (18-25, 26-35...)
Log Transformation	Reduce skewness	Income → log(Income)
Feature Engineering	Create new variables	Revenue - Cost = Profit
Pivoting	Reshape data structure	Long format → Wide format
Encoding	Convert categories to numbers	Red/Blue/Green → 1/2/3

Example: Feature Engineering

Banking Example (Canada):
A Toronto bank improved credit risk models by creating derived features:

Original Fields:
• Income: $65,000
• Debt: $28,000
• Age: 34

Engineered Features:
• Debt-to-Income Ratio: 43% (28000/65000)
• Years to Retirement: 31 (65 - 34)
• Risk Score: Calculated from multiple factors

Result: Model accuracy improved from 76% to 84%

3.8 Creating Data Cleaning Workflows

Document your cleaning process for reproducibility and transparency.

Standard Cleaning Workflow:

Initial Assessment - Profile data, identify issues
Handle Missing Values - Decide strategy per column
Remove Duplicates - Based on key fields
Fix Data Types - Convert to appropriate formats
Detect Outliers - Statistical methods + domain knowledge
Standardize Values - Consistent formats, spelling
Transform/Derive - Create calculated fields
Validate - Check business rules, ranges
Document - Log all changes made

Simulation: Data Cleaning Workflow Builder

                ┌─────────────────────────────────────────────┐

                │  Data Cleaning Pipeline                     │

                ├─────────────────────────────────────────────┤

                │                                             │

                │  Step 1: ✓ Load Data (50,000 rows)         │

                │  Step 2: ✓ Remove Duplicates (1,203 found) │

                │  Step 3: ► Handle Missing Values            │

                │          • Email: Keep as-is (8.2%)         │

                │          • Age: Impute median               │

                │          • Income: Delete rows (13% - high) │

                │  Step 4: ○ Detect Outliers                  │

                │  Step 5: ○ Normalize Data                   │

                │  Step 6: ○ Validate Results                 │

                │                                             │

                │  Expected Output: ~42,500 rows              │

                │                                             │

                │  [Run Pipeline] [Save Workflow] [Schedule]  │

                │                                             │

                │  ☑ Generate cleaning report                 │

                │  ☑ Save cleaned data to new file            │

                └─────────────────────────────────────────────┘

✓ Module 3 Complete

You've learned:

Common data quality issues and their impacts
Strategies for handling missing values (deletion, imputation)
Outlier detection methods (IQR, Z-score) and treatment
Data normalization and standardization techniques
Data type conversion and format standardization
Deduplication strategies for exact and near-duplicates
Data transformation techniques (binning, encoding, feature engineering)
Building reproducible data cleaning workflows
Real-world examples from healthcare, banking, and e-commerce

Next: Module 4 covers Excel for data analysis - the most universal analytics tool.

📊 Data Analytics Mastery Course

📚 Total Modules

🎯 Skill Levels

🌎 Coverage

⏱️ Total Duration

🧹 Module 3: Data Cleaning & Preparation