24 - Checkpoint 3B: Database Fundamentals II Review
Complexity: Easy (E)
24.0 Introduction: Why This Matters for Data Engineering
In data engineering at Hijra Group, robust database skills are critical for managing Sharia-compliant financial transaction data, enabling scalable analytics for stakeholder reporting. This checkpoint consolidates advanced SQL querying, indexing, optimization, schema design, and type-safe Python integration with SQLite and PostgreSQL from Chapters 12–23, ensuring you can build efficient, reliable data pipelines. These skills are foundational for cloud analytics in Phase 4 (Chapters 25–30) and production-grade deployments in Phase 9 (Chapters 60–66). The micro-project integrates these concepts into a comprehensive database tool, processing sales data from data/sales.db and data/transactions.csv, producing a JSON report and visualizations, all with 4-space indentation per PEP 8, preferring spaces over tabs to avoid IndentationError.
Data Engineering Workflow Context
This diagram illustrates the database pipeline:
flowchart TD
A["Raw Data
sales.db, transactions.csv"] --> B["Python Scripts
SQLite/PostgreSQL"]
B --> C{"Database Operations"}
C -->|Load/Validate| D["SQLite/PostgreSQL Tables"]
C -->|Query/Analyze| E["Aggregated Metrics"]
D --> F["Output
JSON/Plots"]
E --> F
F --> G["Storage/Reporting"]
classDef data fill:#f9f9f9,stroke:#333,stroke-width:2px
classDef process fill:#d0e0ff,stroke:#336,stroke-width:1px
classDef storage fill:#ddffdd,stroke:#363,stroke-width:1px
class A,D,E,F data
class B,C process
class G storageBuilding On and Preparing For
- Building On:
- Chapter 12: SQL basics with SQLite, extended to advanced queries.
- Chapter 13: Python-SQLite integration, now with type safety.
- Chapter 14: Advanced SQLite operations (transactions, views).
- Chapter 15: Type-safe SQLite programming with Pydantic.
- Chapter 16–17: PostgreSQL fundamentals and Python integration.
- Chapter 18: Schema design with ER diagrams.
- Chapter 19: Advanced SQL querying (joins, subqueries).
- Chapter 20: SQLite indexing and optimization.
- Chapter 21–22: Advanced PostgreSQL querying and optimization.
- Chapter 23: Type-safe database integration.
- Preparing For:
- Chapter 25: BigQuery fundamentals, extending SQL skills.
- Chapter 26: Python-BigQuery integration, building on type-safe programming.
- Chapter 47: Data access patterns for web applications.
What You’ll Learn
This chapter reinforces:
- Advanced SQL Querying: Joins, subqueries, CTEs, and window functions.
- Schema Design: Normalized tables with ER diagrams.
- Indexing and Optimization: Efficient query performance.
- Type-Safe Integration: Python with SQLite/PostgreSQL using Pydantic.
- Visualization: Matplotlib plots for stakeholder reporting.
- Testing: Unit and integration tests with
pytest.
The micro-project builds a type-annotated database tool that queries sales.db, validates data with config.yaml, and produces a JSON report and plot, tested with pytest and verified by Pyright, all with 4-space indentation per PEP 8.
Follow-Along Tips:
- Create
de-onboarding/data/and populate with files from Appendix 1 (sales.db,transactions.csv,config.yaml,empty.csv,invalid.csv,malformed.csv,negative.csv). - Install libraries:
pip install numpy pandas matplotlib pyyaml sqlite3 psycopg2-binary pydantic pytest. - Configure editor for 4-space indentation per PEP 8 (VS Code: “Editor: Tab Size” = 4, “Editor: Insert Spaces” = true, “Editor: Detect Indentation” = false).
- Use print statements (e.g.,
print(df.head())) to debug DataFrames. - Save plots to
data/(e.g.,data/sales_report.png). - Verify file paths with
ls data/(Unix/macOS) ordir data\(Windows). - Use UTF-8 encoding to avoid
UnicodeDecodeError. - Run
pytestwith-vfor verbose output.
24.1 Core Concepts
24.1.1 Advanced SQL Querying
Advanced SQL techniques include:
- Joins: Combine tables (e.g.,
salesandtransactions). - Subqueries: Nested queries for filtering.
- CTEs: Readable query structures.
- Window Functions: Compute rankings or running totals.
Example (SQLite, from Chapter 19):
WITH SalesCTE AS (
SELECT product, price * quantity AS amount
FROM sales
)
SELECT product, amount,
RANK() OVER (ORDER BY amount DESC) AS sales_rank
FROM SalesCTE;Join Example (hypothetical, from Chapter 19):
-- Joining sales with a products table
SELECT s.product, s.price, p.category
FROM sales s
JOIN products p ON s.product = p.product_name
WHERE s.product LIKE 'Halal%';Subquery Example (from Chapter 19):
-- Select products with above-average price
SELECT product, price
FROM sales
WHERE price > (SELECT AVG(price) FROM sales)
AND product LIKE 'Halal%';Time Complexity: O(n log n) for sorting in window functions, O(n) for joins and subqueries. Space Complexity: O(n) for result sets.
24.1.2 Schema Design
Normalized schemas use ER diagrams to define relationships, reducing redundancy. From Chapter 18, normalization ensures data integrity by adhering to:
- First Normal Form (1NF): Eliminates repeating groups (e.g., each sale has one product).
- Second Normal Form (2NF): Removes partial dependencies (e.g., product details in a separate table).
- Third Normal Form (3NF): Eliminates transitive dependencies (e.g., no derived fields like
amountstored).
Example: Splitting sales into two tables:
CREATE TABLE products (
product_id INTEGER PRIMARY KEY,
product_name TEXT
);
CREATE TABLE sales (
sale_id INTEGER PRIMARY KEY,
product_id INTEGER,
price REAL,
quantity INTEGER,
FOREIGN KEY (product_id) REFERENCES products(product_id)
);This reduces redundancy (e.g., storing Halal Laptop once) and ensures consistency. However, normalization increases query complexity (e.g., requiring joins to retrieve product names), a trade-off balanced by efficient indexing.
erDiagram
PRODUCTS ||--o{ SALES : references
SALES ||--o{ TRANSACTIONS : contains
PRODUCTS {
INTEGER product_id
TEXT product_name
}
SALES {
INTEGER sale_id
INTEGER product_id
REAL price
INTEGER quantity
}
TRANSACTIONS {
TEXT transaction_id
TEXT product
REAL price
INTEGER quantity
TEXT date
}Implication: Ensures data integrity for Hijra Group’s transaction analytics.
24.1.3 Indexing and Optimization
Indexes improve query performance (Chapter 20, 22):
- B-Tree Indexes: O(log n) for lookups, with storage overhead of ~O(n) for n rows, justified by faster queries (e.g.,
SELECT * FROM sales WHERE product = 'Halal Laptop'). - Query Optimization: Use
EXPLAINto analyze plans.
Example (SQLite):
CREATE INDEX idx_product ON sales(product);
SELECT * FROM sales WHERE product = 'Halal Laptop';Performance: Reduces query time from O(n) to O(log n).
24.1.4 Type-Safe Integration
Type annotations with Pydantic (Chapter 15, 23) ensure data integrity:
from pydantic import BaseModel
class Sale(BaseModel):
product: str
price: float
quantity: intImplication: Prevents runtime errors in pipelines.
24.2 Micro-Project: Type-Safe Database Tool
Project Requirements
Build a type-annotated database tool that:
- Queries
sales.dbandtransactions.csvfor sales analytics. - Validates data using
config.yamlwith Pydantic. - Computes metrics (total sales, top products) with Pandas/NumPy.
- Exports results to
data/report.json. - Generates a sales plot saved to
data/sales_report.pngwithdpi=100for clarity. - Includes
pytesttests for unit and integration testing. - Uses 4-space indentation per PEP 8, verified by Pyright.
- Tests edge cases with
empty.csv,invalid.csv,malformed.csv,negative.csv.
Note on Plot Resolution: The plot uses dpi=100 to ensure clarity for stakeholder reports, increasing file size slightly (e.g., ~100KB vs. ~70KB for dpi=72). Test with dpi=72 if smaller files are needed, checking with ls -lh data/sales_report.png (Unix/macOS) or dir data\sales_report.png (Windows).
Sample Input Files
data/sales.db (from Appendix 1):
CREATE TABLE sales (
product TEXT,
price REAL,
quantity INTEGER
);
INSERT INTO sales VALUES
('Halal Laptop', 999.99, 2),
('Halal Mouse', 24.99, 10),
('Halal Keyboard', 49.99, 5);data/transactions.csv (from Appendix 1):
transaction_id,product,price,quantity,date
T001,Halal Laptop,999.99,2,2023-10-01
T002,Halal Mouse,24.99,10,2023-10-02
T003,Halal Keyboard,49.99,5,2023-10-03
T004,,29.99,3,2023-10-04
T005,Monitor,199.99,2,2023-10-05data/config.yaml (from Appendix 1):
min_price: 10.0
max_quantity: 100
required_fields:
- product
- price
- quantity
product_prefix: 'Halal'
max_decimals: 2Data Processing Flow
flowchart TD
A["Input
sales.db, transactions.csv"] --> B["Load Data
sqlite3, pandas"]
B --> C["Validate
Pydantic/config.yaml"]
C -->|Invalid| D["Log Warning"]
C -->|Valid| E["Query/Analyze
SQL/Pandas"]
E --> F["Export JSON
report.json"]
E --> G["Generate Plot
sales_report.png"]
D --> H["End"]
F --> H
G --> H
classDef data fill:#f9f9f9,stroke:#333,stroke-width:2px
classDef process fill:#d0e0ff,stroke:#336,stroke-width:1px
classDef error fill:#ffdddd,stroke:#933,stroke-width:1px
classDef endpoint fill:#ddffdd,stroke:#363,stroke-width:1px
class A data
class B,C,E,F,G process
class D error
class H endpointAcceptance Criteria
- Go Criteria:
- Loads
sales.dbandtransactions.csvcorrectly. - Validates data with Pydantic and
config.yaml. - Computes total sales and top 3 products.
- Exports to
data/report.json. - Saves plot to
data/sales_report.png. - Includes
pytesttests (unit, integration). - Uses type annotations, verified by Pyright.
- Uses 4-space indentation per PEP 8.
- Handles edge cases (
empty.csv, etc.).
- Loads
- No-Go Criteria:
- Fails to load data or validate.
- Incorrect metrics or outputs.
- Missing tests or type annotations.
- Inconsistent indentation.
Common Pitfalls to Avoid
- Database Connection Issues:
- Problem:
sqlite3.OperationalErrordue to missingsales.db. - Solution: Verify
sales.dbexists withls data/sales.db(Unix/macOS) ordir data\sales.db(Windows). Printdb_path.
- Problem:
- Type Mismatches:
- Problem: Pydantic validation fails due to non-numeric values.
- Solution: Print
df.dtypesand useastype(int)forquantity.
- SQL Query Errors:
- Problem:
sqlite3.ProgrammingErrorfrom incorrect query syntax. - Solution: Print query and use
EXPLAIN QUERY PLANto debug.
- Problem:
- PostgreSQL Connection Issues:
- Problem:
psycopg2.OperationalErrordue to incorrect host/port (if testing PostgreSQL locally). - Solution: Print connection parameters (e.g.,
host=localhost, port=5432) and verify PostgreSQL is running.
- Problem:
- IndentationError:
- Problem: Mixed spaces/tabs.
- Solution: Use 4 spaces per PEP 8. Run
python -tt database_tool.py.
- Test Failures:
- Problem:
pytesterrors due to incorrect test setup. - Solution: Run
pytest -vand print test inputs (e.g.,print(df.head())).
- Problem:
How This Differs from Production
- Scalability: Handles small datasets; production uses chunked queries (Chapter 40).
- Security: Lacks encryption (Chapter 65).
- Observability: Uses print logs; production uses Grafana (Chapter 66).
- Deployment: Local execution; production uses Kubernetes (Chapter 61).
- Performance: Pydantic’s validation adds overhead for large datasets, mitigated by batch processing in Chapter 40.
Implementation
# File: de-onboarding/utils.py
from pydantic import BaseModel, validator
from typing import Dict, List
import yaml
class Config(BaseModel):
min_price: float
max_quantity: int
required_fields: List[str]
product_prefix: str
max_decimals: int
@validator('required_fields')
def check_required_fields(cls, v):
if not all(isinstance(f, str) for f in v):
raise ValueError("All required fields must be strings")
return v
def read_config(config_path: str) -> Dict:
"""Read YAML configuration."""
print(f"Opening config: {config_path}") # Debug
with open(config_path, "r") as file:
config = yaml.safe_load(file)
config_model = Config(**config)
print(f"Loaded config: {config_model.dict()}") # Debug
return config_model.dict()
def clean_string(s: str) -> str:
"""Strip whitespace from string."""
return s.strip()
def is_numeric_value(x: float) -> bool:
"""Check if value is numeric."""
return isinstance(x, (int, float))
def is_integer(x: float) -> bool:
"""Check if value is integer."""
return isinstance(x, int) or (isinstance(x, float) and x.is_integer())
def has_valid_decimals(x: float, max_decimals: int) -> bool:
"""Check if value has valid decimal places."""
if not is_numeric_value(x):
return False
parts = str(x).split(".")
return len(parts) == 1 or len(parts[1]) <= max_decimals
# File: de-onboarding/database_tool.py
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import json
import os
from typing import Tuple, Dict, List
from pydantic import BaseModel
import utils
class Sale(BaseModel):
product: str
price: float
quantity: int
def connect_db(db_path: str) -> sqlite3.Connection:
"""Connect to SQLite database."""
print(f"Connecting to: {db_path}") # Debug
conn = sqlite3.connect(db_path)
return conn
def load_transactions(csv_path: str, config: Dict) -> pd.DataFrame:
"""Load and validate transactions CSV."""
print(f"Loading CSV: {csv_path}") # Debug
df = pd.read_csv(csv_path)
print("Initial DataFrame:") # Debug
print(df.head())
required_fields = config["required_fields"]
missing_fields = [f for f in required_fields if f not in df.columns]
if missing_fields:
print(f"Missing columns: {missing_fields}") # Log
return pd.DataFrame()
df = df.dropna(subset=["product"])
df = df[df["product"].str.startswith(config["product_prefix"])]
df = df[df["quantity"].apply(utils.is_integer)]
df["quantity"] = df["quantity"].astype(int)
df = df[df["quantity"] <= config["max_quantity"]]
df = df[df["price"].apply(utils.is_numeric_value)]
df = df[df["price"] > 0]
df = df[df["price"] >= config["min_price"]]
df = df[df["price"].apply(lambda x: utils.has_valid_decimals(x, config["max_decimals"]))]
print("Validated DataFrame:") # Debug
print(df)
return df
def query_sales(conn: sqlite3.Connection, config: Dict) -> pd.DataFrame:
"""Query sales data with advanced SQL."""
query = """
WITH SalesCTE AS (
SELECT product, price * quantity AS amount
FROM sales
WHERE product LIKE ? || '%'
AND price >= ?
AND quantity <= ?
AND price = ROUND(price, 2) -- Ensure up to 2 decimal places
)
SELECT product, amount,
RANK() OVER (ORDER BY amount DESC) AS sales_rank
FROM SalesCTE;
"""
# Example PostgreSQL query (optional, test locally if PostgreSQL is installed):
# query = """
# WITH SalesCTE AS (
# SELECT product, price * quantity AS amount
# FROM sales
# WHERE product LIKE %s || '%'
# AND price >= %s
# AND quantity <= %s
# AND price = ROUND(price, 2)
# )
# SELECT product, amount,
# ROW_NUMBER() OVER (ORDER BY amount DESC) AS sales_rank
# FROM SalesCTE;
# """
params = (config["product_prefix"], config["min_price"], config["max_quantity"])
df = pd.read_sql_query(query, conn, params=params)
print("Queried DataFrame:") # Debug
print(df)
return df
def process_data(sales_df: pd.DataFrame, transactions_df: pd.DataFrame) -> Dict:
"""Process data and compute metrics."""
if sales_df.empty and transactions_df.empty:
print("No valid data") # Log
return {"total_sales": 0.0, "top_products": {}, "unique_products": []}
# Combine data
combined_df = pd.concat([sales_df[["product", "amount"]], transactions_df[["product", "amount"]]])
combined_df["amount"] = combined_df["amount"].astype(float)
total_sales = np.sum(combined_df["amount"].values)
unique_products = combined_df["product"].unique().tolist()
sales_by_product = combined_df.groupby("product")["amount"].sum()
top_products = sales_by_product.sort_values(ascending=False).head(3).to_dict()
print(f"Total Sales: {total_sales}") # Debug
print(f"Top Products: {top_products}") # Debug
return {
"total_sales": float(total_sales),
"unique_products": unique_products,
"top_products": top_products
}
def export_results(results: Dict, json_path: str) -> None:
"""Export results to JSON."""
print(f"Writing to: {json_path}") # Debug
with open(json_path, "w") as file:
json.dump(results, file, indent=2)
print(f"Exported to: {json_path}") # Confirm
def plot_sales(df: pd.DataFrame, plot_path: str) -> None:
"""Generate sales plot."""
if df.empty:
print("No data to plot") # Log
return
plt.figure(figsize=(8, 6))
plt.bar(df["product"], df["amount"])
plt.title("Sales by Product")
plt.xlabel("Product")
plt.ylabel("Sales Amount ($)")
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.savefig(plot_path, dpi=100)
plt.close()
print(f"Plot saved to: {plot_path}") # Confirm
def main() -> None:
"""Main function."""
db_path = "data/sales.db"
csv_path = "data/transactions.csv"
config_path = "data/config.yaml"
json_path = "data/report.json"
plot_path = "data/sales_report.png"
config = utils.read_config(config_path)
conn = connect_db(db_path)
sales_df = query_sales(conn, config)
transactions_df = load_transactions(csv_path, config)
transactions_df["amount"] = transactions_df["price"] * transactions_df["quantity"]
results = process_data(sales_df, transactions_df)
export_results(results, json_path)
plot_sales(sales_df, plot_path)
print("\nSales Report:")
print(f"Total Sales: ${round(results['total_sales'], 2)}")
print(f"Unique Products: {results['unique_products']}")
print(f"Top Products: {results['top_products']}")
conn.close()
if __name__ == "__main__":
main()
# File: de-onboarding/test_database_tool.py
import pytest
import pandas as pd
import sqlite3
import os
from database_tool import connect_db, query_sales, load_transactions, process_data, export_results, plot_sales
from utils import read_config
@pytest.fixture
def setup_db(tmp_path):
db_path = tmp_path / "test.db"
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE sales (
product TEXT,
price REAL,
quantity INTEGER
)
""")
cursor.executemany(
"INSERT INTO sales VALUES (?, ?, ?)",
[("Halal Laptop", 999.99, 2), ("Halal Mouse", 24.99, 10)]
)
conn.commit()
yield str(db_path)
conn.close()
@pytest.fixture
def config():
return read_config("data/config.yaml")
def test_connect_db(setup_db):
conn = connect_db(setup_db)
assert isinstance(conn, sqlite3.Connection)
conn.close()
def test_query_sales(setup_db, config):
conn = connect_db(setup_db)
df = query_sales(conn, config)
assert not df.empty
assert "product" in df.columns
assert "amount" in df.columns
assert df["amount"].iloc[0] == 1999.98 # Halal Laptop
conn.close()
def test_load_transactions(config):
df = load_transactions("data/transactions.csv", config)
assert not df.empty
assert df["product"].str.startswith("Halal").all()
assert df["quantity"].max() <= config["max_quantity"]
def test_process_data():
sales_df = pd.DataFrame({
"product": ["Halal Laptop"],
"amount": [1999.98]
})
transactions_df = pd.DataFrame({
"product": ["Halal Mouse"],
"amount": [249.90]
})
results = process_data(sales_df, transactions_df)
assert results["total_sales"] == 2249.88
assert "Halal Laptop" in results["unique_products"]
assert results["top_products"]["Halal Laptop"] == 1999.98
def test_export_results(tmp_path):
results = {"total_sales": 2499.83, "unique_products": ["Halal Laptop"], "top_products": {"Halal Laptop": 1999.98}}
json_path = tmp_path / "test.json"
export_results(results, str(json_path))
assert os.path.exists(json_path)
def test_plot_sales(tmp_path):
df = pd.DataFrame({
"product": ["Halal Laptop"],
"amount": [1999.98]
})
plot_path = tmp_path / "test.png"
plot_sales(df, str(plot_path))
assert os.path.exists(plot_path)Expected Outputs
data/report.json:
{
"total_sales": 4999.66,
"unique_products": ["Halal Laptop", "Halal Mouse", "Halal Keyboard"],
"top_products": {
"Halal Laptop": 3999.96,
"Halal Mouse": 499.8,
"Halal Keyboard": 499.9
}
}data/sales_report.png: Bar plot of sales amounts by product.
Console Output (abridged):
Opening config: data/config.yaml
Loaded config: {'min_price': 10.0, 'max_quantity': 100, 'required_fields': ['product', 'price', 'quantity'], 'product_prefix': 'Halal', 'max_decimals': 2}
Connecting to: data/sales.db
Queried DataFrame:
product amount sales_rank
0 Halal Laptop 1999.98 1
1 Halal Mouse 249.90 3
2 Halal Keyboard 249.95 2
Loading CSV: data/transactions.csv
Validated DataFrame:
transaction_id product price quantity date amount
0 T001 Halal Laptop 999.99 2 2023-10-01 1999.98
1 T002 Halal Mouse 24.99 10 2023-10-02 249.90
2 T003 Halal Keyboard 49.99 5 2023-10-03 249.95
Total Sales: 4999.66
Exported to: data/report.json
Plot saved to: data/sales_report.png
Sales Report:
Total Sales: $4999.66
Unique Products: ['Halal Laptop', 'Halal Mouse', 'Halal Keyboard']
Top Products: {'Halal Laptop': 3999.96, 'Halal Mouse': 499.8, 'Halal Keyboard': 499.9}How to Run and Test
Setup:
- Create
de-onboarding/data/and populate per Appendix 1. - Install:
pip install numpy pandas matplotlib pyyaml sqlite3 psycopg2-binary pydantic pytest. - Save
utils.py,database_tool.py,test_database_tool.py. - Configure editor for 4-space indentation per PEP 8.
- Create virtual environment:
python -m venv venv, activate (Windows:venv\Scripts\activate, Unix:source venv/bin/activate).
- Create
Run:
- Open terminal in
de-onboarding/. - Run:
python database_tool.py. - Outputs:
data/report.json,data/sales_report.png.
- Open terminal in
Test:
Run:
pytest test_database_tool.py -v.Verify all tests pass.
Edge Case Testing:
CSV File Expected Output Test Code Snippet empty.csvEmpty DataFrame df = load_transactions("data/empty.csv", config); print(df)invalid.csvEmpty DataFrame df = load_transactions("data/invalid.csv", config); print(df)malformed.csvDataFrame with only Halal Mouse row df = load_transactions("data/malformed.csv", config); print(df)negative.csvDataFrame with only Halal Mouse row df = load_transactions("data/negative.csv", config); print(df)Example:
config = utils.read_config("data/config.yaml") df = load_transactions("data/empty.csv", config) print(df) # Expected: Empty DataFrame
24.3 Practice Exercises
Exercise 1: Advanced SQL Query
Write a function to query sales.db with a CTE and window function, returning a ranked DataFrame.
Expected Output:
product amount sales_rank
0 Halal Laptop 1999.98 1
1 Halal Keyboard 249.95 2
2 Halal Mouse 249.90 3Follow-Along:
- Save as
ex1_sql.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex1_sql.py. - Test: Verify output with
print(df).
Exercise 2: Type-Safe Validation with Edge Case
Write a type-annotated function to validate transactions using Pydantic, handling a non-string product value (e.g., numeric 12345) by filtering it out.
Sample Input (data/transactions_invalid.csv):
transaction_id,product,price,quantity,date
T001,Halal Laptop,999.99,2,2023-10-01
T002,12345,24.99,10,2023-10-02
T003,Halal Keyboard,49.99,5,2023-10-03Expected Output:
Validated DataFrame:
product price quantity
0 Halal Laptop 999.99 2
2 Halal Keyboard 49.99 5Follow-Along:
- Create
data/transactions_invalid.csvwith the sample input. - Save as
ex2_validation.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex2_validation.py. - Test: Verify with
empty.csv(should return empty DataFrame).
Exercise 3: Indexing Optimization
Write a function to create an index on sales.db, insert 1000 rows, and measure query performance, saving results to data/ex3_benchmark.txt. The price range (10.0–1000.0) and quantity range (1–100) simulate realistic sales data, matching sales.db’s structure.
Expected Output (data/ex3_benchmark.txt):
Query Time Before Index: 0.005s
Query Time After Index: 0.001sFollow-Along:
- Save as
ex3_index.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex3_index.py. - Test: Verify index creation with
sqlite3 data/sales.db ".indexes"and benchmark file withcat data/ex3_benchmark.txt(Unix/macOS) ortype data\ex3_benchmark.txt(Windows).
Exercise 4: Visualization
Write a function to plot sales from sales.db, saving to data/ex4_plot.png.
Expected Output:
Plot saved to data/ex4_plot.pngFollow-Along:
- Save as
ex4_plot.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex4_plot.py. - Test: Verify plot exists with
ls data/ex4_plot.png(Unix/macOS) ordir data\ex4_plot.png(Windows). - Debugging Tip: If the plot is empty or incorrect, print
dfbefore plotting to check data. Temporarily addplt.show()to inspect the plot interactively, then remove it to comply with Pyodide’s non-interactive requirements.
Exercise 5: Debug SQL Query
Fix a buggy SQL query with a parameter mismatch, causing sqlite3.ProgrammingError.
Buggy Code:
def query_sales(conn):
query = "SELECT product, price * quantity AS amount FROM sales WHERE product LIKE ?"
params = ("Halal%", "invalid_param") # Bug: Too many parameters
return pd.read_sql_query(query, conn, params=params)Expected Output:
product amount
0 Halal Laptop 1999.98
1 Halal Mouse 249.90
2 Halal Keyboard 249.95Follow-Along:
- Save as
ex5_debug.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex5_debug.pyto see error. - Fix and re-run.
- Test: Verify filtered output with
print(df).
Exercise 6: Conceptual Analysis of Indexing
Explain why B-Tree indexes reduce query time from O(n) to O(log n) for sales.db, saving the response to data/ex6_concepts.txt.
Expected Output (data/ex6_concepts.txt):
B-Tree indexes reduce query time from O(n) to O(log n) because they organize data in a balanced tree structure, allowing logarithmic-time lookups. For example, querying `sales` for `product = 'Halal Laptop'` without an index scans all rows (O(n)). With a B-Tree index on `product`, the database traverses the tree, reducing comparisons to O(log n).Follow-Along:
- Complete the following steps:
- Explain why B-Tree indexes improve query performance, referencing
sales.db. - Save the explanation to
data/ex6_concepts.txt. - Verify with
cat data/ex6_concepts.txt(Unix/macOS) ortype data\ex6_concepts.txt(Windows).
- Explain why B-Tree indexes improve query performance, referencing
- Test: Ensure the explanation references
sales.dband complexity correctly.
24.4 Exercise Solutions
Solution to Exercise 1
import sqlite3
import pandas as pd
def query_sales(db_path: str) -> pd.DataFrame:
conn = sqlite3.connect(db_path)
query = """
WITH SalesCTE AS (
SELECT product, price * quantity AS amount
FROM sales
WHERE product LIKE 'Halal%'
)
SELECT product, amount,
RANK() OVER (ORDER BY amount DESC) AS sales_rank
FROM SalesCTE;
"""
df = pd.read_sql_query(query, conn)
conn.close()
return df
print(query_sales("data/sales.db"))Solution to Exercise 2
import pandas as pd
from pydantic import BaseModel
from typing import Dict
import utils
class Transaction(BaseModel):
product: str
price: float
quantity: int
def validate_transactions(csv_path: str, config: Dict) -> pd.DataFrame:
df = pd.read_csv(csv_path)
# Filter for string products starting with prefix
df = df[df["product"].apply(lambda x: isinstance(x, str) and x.startswith(config["product_prefix"]))]
df = df[df["quantity"].apply(utils.is_integer)]
df["quantity"] = df["quantity"].astype(int)
df = df[df["price"].apply(utils.is_numeric_value)]
for _, row in df.iterrows():
Transaction(**row[["product", "price", "quantity"]].to_dict())
return df
config = utils.read_config("data/config.yaml")
print(validate_transactions("data/transactions_invalid.csv", config))Solution to Exercise 3
import sqlite3
import time
import random
def optimize_index(db_path: str, output_path: str) -> Tuple[float, float]:
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
# Insert 1000 rows
products = ["Halal Laptop", "Halal Mouse", "Halal Keyboard"]
rows = [(random.choice(products), round(random.uniform(10.0, 1000.0), 2), random.randint(1, 100)) for _ in range(1000)]
cursor.executemany("INSERT INTO sales (product, price, quantity) VALUES (?, ?, ?)", rows)
conn.commit()
query = "SELECT * FROM sales WHERE product = 'Halal Laptop'"
start = time.time()
cursor.execute(query)
cursor.fetchall()
before = time.time() - start
cursor.execute("CREATE INDEX IF NOT EXISTS idx_product ON sales(product)")
start = time.time()
cursor.execute(query)
cursor.fetchall()
after = time.time() - start
with open(output_path, "w") as f:
f.write(f"Query Time Before Index: {before:.3f}s\nQuery Time After Index: {after:.3f}s\n")
conn.close()
return before, after
before, after = optimize_index("data/sales.db", "data/ex3_benchmark.txt")
print(f"Results saved to data/ex3_benchmark.txt")Solution to Exercise 4
import sqlite3
import pandas as pd
import matplotlib.pyplot as plt
def plot_sales(db_path: str, plot_path: str) -> None:
conn = sqlite3.connect(db_path)
df = pd.read_sql_query("SELECT product, price * quantity AS amount FROM sales WHERE product LIKE 'Halal%'", conn)
plt.figure(figsize=(8, 6))
plt.bar(df["product"], df["amount"])
plt.title("Sales by Product")
plt.xlabel("Product")
plt.ylabel("Sales Amount ($)")
plt.xticks(rotation=45)
plt.grid(True)
plt.tight_layout()
plt.savefig(plot_path, dpi=100)
plt.close()
print(f"Plot saved to: {plot_path}")
plot_sales("data/sales.db", "data/ex4_plot.png")Solution to Exercise 5
import sqlite3
import pandas as pd
def query_sales(db_path: str) -> pd.DataFrame:
conn = sqlite3.connect(db_path)
query = "SELECT product, price * quantity AS amount FROM sales WHERE product LIKE ?"
params = ("Halal%",) # Fix: Correct number of parameters
df = pd.read_sql_query(query, conn, params=params)
conn.close()
return df
print(query_sales("data/sales.db"))Solution to Exercise 6
# Save this explanation to data/ex6_concepts.txt
explanation = """
B-Tree indexes reduce query time from O(n) to O(log n) because they organize data in a balanced tree structure, allowing logarithmic-time lookups. For example, querying `sales` for `product = 'Halal Laptop'` without an index scans all rows (O(n)). With a B-Tree index on `product`, the database traverses the tree, reducing comparisons to O(log n).
"""
with open("data/ex6_concepts.txt", "w") as file:
file.write(explanation)
print("Explanation saved to data/ex6_concepts.txt")24.5 Chapter Summary and Connection to Chapter 25
This chapter consolidated advanced SQL, schema design, indexing, optimization, and type-safe Python integration, building a robust database tool for Hijra Group’s analytics. The micro-project demonstrated practical application, producing a JSON report and visualization, tested with pytest and verified by Pyright, all with 4-space indentation per PEP 8. These skills prepare you for cloud-based analytics in Chapter 25, BigQuery Fundamentals, where you’ll extend SQL querying to BigQuery, load data with google-cloud-bigquery, and compute metrics, leveraging sales.csv for cloud analytics, maintaining type safety and testing practices.