📊 Data Analytics Mastery Course

Master techniques for collecting, analyzing, and interpreting data to drive informed business decisions and strategic insights.

📚 Total Modules

20

🎯 Skill Levels

All Levels

🌎 Coverage

USA & Canada

⏱️ Total Duration

~20 Hours

📥 Module 2: Data Collection & Source Integration

This module covers essential data analytics concepts and practical applications.

Beginner Level
⏱️ 45-60 minutes

📚 Topics Covered

  • ✓ Data Sources: Internal vs External
  • ✓ Database Connections & Queries
  • ✓ API Integration for Real-Time Data
  • ✓ Web Scraping Fundamentals
  • ✓ File Format Handling (CSV, Excel, JSON, XML)
  • ✓ Cloud Storage Integration
  • ✓ Data Quality Assessment at Source

🔑 Key Concepts

  • • Identifying reliable and relevant data sources
  • • Establishing secure database connections
  • • Automating data collection workflows
  • • Understanding real-time vs batch data collection
  • • Evaluating data source quality and completeness

2.1 Understanding Data Sources

Data can come from dozens of sources. Successful analysts know which sources to tap and how to access them efficiently.

Two Main Categories:

Type Examples Advantages Challenges
Internal Data Sales databases, CRM, ERP, internal surveys, server logs Controlled, relevant, accessible Siloed systems, legacy formats
External Data Government data, social media, market research, weather APIs Broad perspective, benchmarking Quality varies, cost, format inconsistency
Real-World Scenario (Retail - USA/Canada):
A grocery chain combines internal point-of-sale data with external weather data and local event calendars. They discover that umbrella sales increase 240% on rainy days within 2 hours of rain starting. They now use real-time weather APIs to trigger automated restocking alerts.

2.2 Database Connections

Most business data lives in databases. Learning to connect and query databases is essential.

Common Database Types:

  • Relational (SQL): MySQL, PostgreSQL, SQL Server, Oracle - structured tables with relationships
  • NoSQL: MongoDB, Cassandra - flexible schemas for unstructured data
  • Cloud Data Warehouses: Snowflake, BigQuery, Redshift - massive scale analytics

Simulation: Connecting to SQL Server Database

┌─────────────────────────────────────────────┐
│ Database Connection Manager │
├─────────────────────────────────────────────┤
│ │
│ Server: sql-prod-01.company.com │
│ Port: 1433 │
│ Database: [Sales_Analytics_DB ▼] │
│ Authentication: Windows Authentication │
│ Username: analyst@company.com │
│ Password: •••••••••• │
│ │
│ ☑ Use encrypted connection (SSL) │
│ ☑ Save connection for reuse │
│ │
│ [Test Connection] [Connect] [Cancel] │
│ │
│ Status: ✓ Connected successfully │
│ Tables available: 47 │
└─────────────────────────────────────────────┘

Basic SQL Query Example:

SELECT
    CustomerID,
    SUM(OrderTotal) AS TotalRevenue,
    COUNT(OrderID) AS OrderCount
FROM Orders
WHERE OrderDate >= '2024-01-01'
GROUP BY CustomerID
ORDER BY TotalRevenue DESC
LIMIT 100;

What this does: Retrieves top 100 customers by revenue in 2024, showing total spent and order count.

2.3 API Integration for Real-Time Data

APIs (Application Programming Interfaces) allow you to pull data directly from web services in real-time.

Popular Business APIs:

  • Salesforce API - Customer relationship data
  • Google Analytics API - Website traffic and behavior
  • Stripe API - Payment and transaction data
  • Twitter/X API - Social media sentiment and engagement
  • Weather API - Environmental data for retail/logistics

Simulation: Google Analytics API Connection

┌─────────────────────────────────────────────┐
│ Google Analytics Data Connector │
├─────────────────────────────────────────────┤
│ │
│ Account: Company Website Analytics │
│ Property: www.company.com (UA-12345678) │
│ View: All Web Site Data │
│ │
│ Date Range: [Last 30 Days ▼] │
│ From: 2025-03-05 To: 2025-04-04 │
│ │
│ Metrics to Pull: │
│ ☑ Sessions │
│ ☑ Users │
│ ☑ Bounce Rate │
│ ☑ Average Session Duration │
│ ☑ Conversions │
│ │
│ Dimensions: │
│ ☑ Date │
│ ☑ Traffic Source │
│ ☑ Device Category │
│ │
│ [Preview Data] [Import] [Schedule Auto] │
│ │
│ Status: ✓ 45,892 rows ready to import │
└─────────────────────────────────────────────┘

API Response Example (JSON):

{
  "date": "2025-04-03",
  "sessions": 1523,
  "users": 1204,
  "bounceRate": 42.3,
  "avgSessionDuration": 185,
  "conversions": 47,
  "source": "google",
  "device": "mobile"
}

2.4 File Format Handling

Data arrives in many formats. Each has specific use cases and handling requirements.

Format Best For Tools Notes
CSV Simple tabular data Excel, Python, R Universal, lightweight
Excel (.xlsx) Business reports, multiple sheets Excel, Python (pandas) Formatted, formulas preserved
JSON API responses, hierarchical data JavaScript, Python, R Nested structures, web-friendly
XML Legacy systems, configuration Python, Java Verbose but structured
Parquet Big data, analytics Spark, Python Columnar, compressed, fast

Simulation: Excel Import Dialog

┌─────────────────────────────────────────────┐
│ Import Excel File │
├─────────────────────────────────────────────┤
│ │
│ File: Sales_Q1_2025.xlsx │
│ [Browse...] [Recent Files ▼] │
│ │
│ Worksheet: [Sheet1 - Sales Data ▼] │
│ │
│ ☑ First row contains headers │
│ ☑ Detect data types automatically │
│ ☐ Skip blank rows │
│ │
│ Preview (first 5 rows): │
│ ┌────────┬──────────┬────────┬─────────┐ │
│ │ Date │ Product │ Units │ Revenue │ │
│ ├────────┼──────────┼────────┼─────────┤ │
│ │ Jan 15 │ Widget A │ 150 │ 3,450 │ │
│ │ Jan 16 │ Widget B │ 203 │ 5,878 │ │
│ │ Jan 16 │ Widget A │ 89 │ 2,047 │ │
│ └────────┴──────────┴────────┴─────────┘ │
│ │
│ Rows detected: 4,582 │
│ Columns detected: 8 │
│ │
│ [Import] [Cancel] │
└─────────────────────────────────────────────┘

2.5 Web Scraping Basics

Sometimes data isn't available through APIs or databases. Web scraping extracts data from websites.

When to Use Web Scraping:

  • Competitor pricing monitoring
  • Public data not available via API (government sites, industry reports)
  • Social media sentiment analysis (when API limits restrict access)
  • Real estate listings, job postings, product catalogs
⚠️ Legal & Ethical Considerations:
• Always check website Terms of Service
• Respect robots.txt files
• Don't overload servers (rate limiting)
• USA: Computer Fraud and Abuse Act (CFAA)
• Canada: Criminal Code provisions on unauthorized access
• Consider privacy laws (GDPR, PIPEDA, CCPA)

Common Web Scraping Tools:

  • Python Libraries: BeautifulSoup, Scrapy, Selenium
  • Browser Extensions: Web Scraper, Data Miner
  • No-Code Tools: Octoparse, ParseHub, Import.io

Example: Scraping Product Prices

Target: E-commerce product page
Data to Extract: Product name, price, availability, reviews

# Python with BeautifulSoup
from bs4 import BeautifulSoup
import requests

url = "https://example-store.com/products/laptop"
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")

product_name = soup.find("h1", class_="product-title").text
price = soup.find("span", class_="price").text
stock = soup.find("div", class_="availability").text

2.6 Cloud Storage Integration

Modern data often lives in cloud storage. Learn to connect to these platforms.

Major Cloud Storage Platforms:

  • Amazon S3 - Object storage, data lakes
  • Google Cloud Storage - Integrated with BigQuery
  • Azure Blob Storage - Microsoft ecosystem
  • Dropbox/OneDrive - Business file sharing

Simulation: Connecting to Amazon S3

┌─────────────────────────────────────────────┐
│ Amazon S3 Data Source │
├─────────────────────────────────────────────┤
│ │
│ Access Key ID: AKIAIOSFODNN7EXAMPLE │
│ Secret Access Key: •••••••••••••••••••• │
│ Region: [us-east-1 ▼] │
│ │
│ Bucket: company-analytics-data │
│ │
│ Available Files: │
│ 📁 raw-data/ │
│   📄 sales_2025_Q1.csv (2.3 MB) │
│   📄 customer_data.parquet (15.7 MB) │
│ 📁 processed/ │
│   📄 monthly_summary.xlsx (450 KB) │
│ │
│ Selected: sales_2025_Q1.csv │
│ │
│ [Test Connection] [Download] [Import] │
│ │
│ Status: ✓ Connected to S3 bucket │
└─────────────────────────────────────────────┘

2.7 Data Quality Assessment at Source

Before importing data, evaluate its quality to avoid "garbage in, garbage out."

Data Quality Checklist:

Dimension What to Check Red Flags
Completeness Missing values, null records >5% missing in critical fields
Accuracy Values within expected ranges Ages >120, negative prices
Consistency Format uniformity, standards Mixed date formats, units
Timeliness Data freshness, update frequency Last updated >6 months ago
Uniqueness Duplicate records Same ID appears multiple times

Quick Data Profile Example:

Dataset: Customer Purchase History
Rows: 125,483
Columns: 12

Quality Issues Detected:
• Email: 8.2% missing (10,289 records)
• Purchase_Date: 3 records with future dates (data entry error)
• Product_Price: 47 records with $0.00 (free samples or errors?)
• Customer_ID: 1,203 duplicates found
• State/Province: Inconsistent (CA, Calif., California all used)

Recommendation: Clean data before analysis

✓ Module 2 Complete

You've learned:

  • Difference between internal and external data sources
  • How to connect to SQL databases and write basic queries
  • API integration for real-time data collection
  • File format handling (CSV, Excel, JSON, XML, Parquet)
  • Web scraping fundamentals and legal considerations
  • Cloud storage integration (S3, Google Cloud, Azure)
  • Data quality assessment before importing
  • Practical simulations of connection interfaces

Next: Module 3 teaches you how to clean and prepare data for accurate analysis.

← Back to All Modules Next Module →