19 - Advanced SQL Querying with SQLite
Complexity: Moderate (M)
19.0 Introduction: Why This Matters for Data Engineering
In data engineering, advanced SQL querying is essential for extracting meaningful insights from large datasets, enabling Hijra Group to analyze Sharia-compliant financial transactions efficiently. SQLite, as a lightweight, serverless database, supports complex queries like joins, subqueries, and aggregations, making it ideal for prototyping pipelines before scaling to PostgreSQL (Chapter 21) or BigQuery (Chapter 25). SQLite’s single-threaded writes and lack of concurrent access make it suitable for prototyping but unsuitable for Hijra Group’s high-throughput transaction systems, which use PostgreSQL or BigQuery (Chapters 21, 25). Building on Chapters 12–18, which introduced SQL basics, Python-SQLite integration, and schema design, this chapter focuses on advanced querying techniques to handle sales data, preparing for optimization (Chapter 20) and production-grade databases.
This chapter avoids concepts not yet introduced, such as type annotations (Chapter 7), testing (Chapter 9), or advanced indexing (Chapter 20). All Python code uses 4-space indentation per PEP 8, preferring spaces over tabs to avoid IndentationError, ensuring compatibility with Hijra Group’s pipeline scripts. The micro-project uses data/sales.db from Appendix 1, testing queries for sales analytics and visualizing results with Matplotlib.
Data Engineering Workflow Context
This diagram illustrates how advanced SQL querying fits into a data engineering pipeline:
flowchart TD
A["Raw Data (SQLite DB)"] --> B["Python Scripts with SQLite"]
B --> C{"Advanced SQL Querying"}
C -->|Joins/Subqueries| D["Aggregated Metrics"]
C -->|Aggregations| E["Filtered Datasets"]
D --> F["Output (JSON/Plot)"]
E --> F
F --> G["Storage/Analysis"]
classDef data fill:#f9f9f9,stroke:#333,stroke-width:2px
classDef process fill:#d0e0ff,stroke:#336,stroke-width:1px
classDef storage fill:#ddffdd,stroke:#363,stroke-width:1px
class A,D,E,F data
class B,C process
class G storageBuilding On and Preparing For
- Building On:
- Chapter 12: SQL basics (SELECT, INSERT, WHERE) for querying
sales.db. - Chapter 13: Python-SQLite integration with
sqlite3for programmatic access. - Chapter 14: Transactions and views for managing data operations.
- Chapter 15: Type-safe database programming (avoided here due to no type annotations).
- Chapter 18: Schema design and entity-relationship (ER) diagrams for normalized tables.
- Chapter 12: SQL basics (SELECT, INSERT, WHERE) for querying
- Preparing For:
- Chapter 20: Indexing and optimization for query performance.
- Chapter 21: Advanced PostgreSQL querying with CTEs and window functions.
- Chapter 23: Type-safe database integration for robust pipelines.
- Chapter 24: Comprehensive database review for cloud analytics.
What You’ll Learn
This chapter covers:
- Advanced SQL Techniques: Joins, subqueries, and aggregations for complex analytics.
- Query Design: Structuring queries for readability and efficiency.
- Python Integration: Executing queries with
sqlite3and processing results with Pandas. - Visualization: Plotting query results with Matplotlib for stakeholder reporting.
- Performance Considerations: Understanding query complexity (e.g., O(n) for table scans).
By the end, you’ll build a query tool to analyze sales data, producing a JSON report and a sales plot, all with 4-space indentation per PEP 8. The micro-project uses data/sales.db and validates queries against expected metrics.
Follow-Along Tips:
- Create
de-onboarding/data/and populate withsales.dbper Appendix 1. - Install libraries:
pip install pandas matplotlib sqlite3. - If
IndentationError, use 4 spaces (not tabs) per PEP 8. Runpython -tt script.pyto detect tab/space mixing. - Use print statements (e.g.,
print(pd.DataFrame(rows))) to debug query results. - Save plots to
data/(e.g.,sales_query_plot.png) instead of usingplt.show(). - Verify file paths with
ls data/(Unix/macOS) ordir data\(Windows). - Use UTF-8 encoding for all files to avoid
UnicodeDecodeError.
19.1 Advanced SQL Querying Techniques
Advanced SQL queries enable complex data analysis, such as joining tables, nesting subqueries, and aggregating metrics. SQLite supports these operations efficiently for small to medium datasets (e.g., <1GB), with O(n) complexity for table scans and O(n log n) for joins with indexes (covered in Chapter 20). Format SQL queries with consistent indentation (e.g., align FROM, JOIN, WHERE) and use uppercase for keywords (e.g., SELECT, GROUP BY) to enhance readability. Break long queries across lines and align clauses for clarity, improving maintainability in pipeline scripts.
19.1.1 Joins
Joins combine data from multiple tables based on related columns. SQLite supports INNER, LEFT, and CROSS joins, with INNER JOIN being the most common for matching records.
import sqlite3 # Import SQLite
# Connect to database
conn = sqlite3.connect("data/sales.db")
cursor = conn.cursor()
# Create a temporary categories table
cursor.execute("""
CREATE TABLE IF NOT EXISTS categories (
product TEXT,
category TEXT
)
""")
cursor.executemany(
"INSERT INTO categories (product, category) VALUES (?, ?)",
[
("Halal Laptop", "Electronics"),
("Halal Mouse", "Accessories"),
("Halal Keyboard", "Accessories"),
]
)
conn.commit()
# INNER JOIN to combine sales and categories
cursor.execute("""
SELECT s.product, s.price, s.quantity, c.category
FROM sales s
INNER JOIN categories c
ON s.product = c.product
""")
rows = cursor.fetchall()
# Print results
print("Join Results:") # Debug
for row in rows:
print(row) # Show joined rows
# Close connection
conn.close()
# Expected Output:
# Join Results:
# ('Halal Laptop', 999.99, 2, 'Electronics')
# ('Halal Mouse', 24.99, 10, 'Accessories')
# ('Halal Keyboard', 49.99, 5, 'Accessories')Follow-Along Instructions:
- Ensure
data/sales.dbexists per Appendix 1. - Save as
de-onboarding/sql_joins.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python sql_joins.py. - Verify output matches comments.
- Common Errors:
- sqlite3.OperationalError: Ensure
sales.dbexists. Printos.path.exists("data/sales.db"). - IndentationError: Use 4 spaces (not tabs). Run
python -tt sql_joins.py.
- sqlite3.OperationalError: Ensure
Key Points:
- INNER JOIN: Matches rows where the condition (e.g.,
s.product = c.product) is true. - Time Complexity: O(n * m) for unindexed joins (n, m rows in tables), improved with indexes (Chapter 20).
- Space Complexity: O(k) for k result rows.
- Implication: Joins enable combining sales data with metadata, e.g., product categories for Hijra Group’s analytics.
19.1.1.1 Multiple Joins
Complex analytics often require joining multiple tables. For example, combining sales data with categories and supplier information involves chaining joins.
import sqlite3 # Import SQLite
# Connect to database
conn = sqlite3.connect("data/sales.db")
cursor = conn.cursor()
# Create categories and suppliers tables
cursor.execute("""
CREATE TABLE IF NOT EXISTS categories (
product TEXT,
category TEXT
)
""")
cursor.execute("""
CREATE TABLE IF NOT EXISTS suppliers (
product TEXT,
supplier_name TEXT
)
""")
cursor.executemany(
"INSERT INTO categories (product, category) VALUES (?, ?)",
[
("Halal Laptop", "Electronics"),
("Halal Mouse", "Accessories"),
("Halal Keyboard", "Accessories"),
]
)
cursor.executemany(
"INSERT INTO suppliers (product, supplier_name) VALUES (?, ?)",
[
("Halal Laptop", "TechCorp"),
("Halal Mouse", "PeriTech"),
("Halal Keyboard", "PeriTech"),
]
)
conn.commit()
# Multiple JOINs to combine sales, categories, and suppliers
cursor.execute("""
SELECT s.product, s.price, c.category, p.supplier_name
FROM sales s
INNER JOIN categories c
ON s.product = c.product
INNER JOIN suppliers p
ON s.product = p.product
""")
rows = cursor.fetchall()
# Print results
print("Multiple Join Results:") # Debug
for row in rows:
print(row) # Show joined rows
# Close connection
conn.close()
# Expected Output:
# Multiple Join Results:
# ('Halal Laptop', 999.99, 'Electronics', 'TechCorp')
# ('Halal Mouse', 24.99, 'Accessories', 'PeriTech')
# ('Halal Keyboard', 49.99, 'Accessories', 'PeriTech')Follow-Along Instructions:
- Ensure
data/sales.dbexists per Appendix 1. - Save as
de-onboarding/sql_multi_joins.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python sql_multi_joins.py. - Verify output matches comments.
- Common Errors:
- sqlite3.OperationalError: Ensure
sales.dbexists. Printos.path.exists("data/sales.db"). - IndentationError: Use 4 spaces (not tabs). Run
python -tt sql_multi_joins.py.
- sqlite3.OperationalError: Ensure
Key Points:
- Multiple Joins: Chain INNER JOINs to combine multiple tables (e.g., sales, categories, suppliers).
- Time Complexity: O(n _ m _ p) for unindexed joins across three tables (n, m, p rows), improved with indexes (Chapter 20).
- Space Complexity: O(k) for k result rows.
- Implication: Enables complex analytics, e.g., supplier-based sales reports for Hijra Group.
19.1.2 Subqueries
Subqueries are nested queries that return intermediate results for the outer query, useful for filtering or aggregating.
import sqlite3 # Import SQLite
# Connect to database
conn = sqlite3.connect("data/sales.db")
cursor = conn.cursor()
# Subquery to find products with above-average price
cursor.execute("""
SELECT product, price, quantity
FROM sales
WHERE price > (
SELECT AVG(price)
FROM sales
)
""")
rows = cursor.fetchall()
# Print results
print("Subquery Results:") # Debug
for row in rows:
print(row) # Show filtered rows
# Close connection
conn.close()
# Expected Output:
# Subquery Results:
# ('Halal Laptop', 999.99, 2)Follow-Along Instructions:
- Save as
de-onboarding/sql_subqueries.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python sql_subqueries.py. - Verify output shows high-priced products.
- Common Errors:
- sqlite3.OperationalError: Check query syntax. Print the query string.
- IndentationError: Use 4 spaces (not tabs). Run
python -tt sql_subqueries.py.
Key Points:
- Subquery: Executes first, providing a value (e.g., average price) for the outer query.
- Time Complexity: O(n) for subquery scan, O(n) for outer query scan.
- Space Complexity: O(1) for scalar subqueries.
- Implication: Useful for dynamic filtering, e.g., identifying premium products.
19.1.3 Aggregations
Aggregations compute summary metrics (e.g., SUM, COUNT) across groups, often with GROUP BY.
import sqlite3 # Import SQLite
# Connect to database
conn = sqlite3.connect("data/sales.db")
cursor = conn.cursor()
# Aggregate total sales by product
cursor.execute("""
SELECT product, SUM(price * quantity) AS total_sales
FROM sales
GROUP BY product
""")
rows = cursor.fetchall()
# Print results
print("Aggregation Results:") # Debug
for row in rows:
print(row) # Show grouped totals
# Close connection
conn.close()
# Expected Output:
# Aggregation Results:
# ('Halal Keyboard', 249.95)
# ('Halal Laptop', 1999.98)
# ('Halal Mouse', 249.9)Follow-Along Instructions:
- Save as
de-onboarding/sql_aggregations.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python sql_aggregations.py. - Verify output matches comments.
- Common Errors:
- sqlite3.OperationalError: Ensure
price * quantityis valid. Print query results. - IndentationError: Use 4 spaces (not tabs). Run
python -tt sql_aggregations.py.
- sqlite3.OperationalError: Ensure
Key Points:
- GROUP BY: Groups rows by column (e.g.,
product). - SUM: Aggregates values within groups.
- Time Complexity: O(n) for scanning and grouping.
- Space Complexity: O(k) for k groups.
- Implication: Essential for summarizing sales metrics.
19.2 Micro-Project: Sales Data Query Tool
Project Requirements
Build a query tool to analyze sales data in data/sales.db, producing a JSON report and a sales plot for Hijra Group’s analytics. Ensure sales.db is created per Appendix 1 before running the project. The tool supports Hijra Group’s sales forecasting by analyzing Halal product trends, enabling data-driven decisions for Sharia-compliant inventory management. Queries ensure Sharia-compliant product analytics by focusing on Halal products, aligning with Islamic Financial Services Board (IFSB) standards for transaction reporting. The tool uses advanced SQL queries (joins, subqueries, aggregations) to compute metrics, integrating with Python and Pandas for processing and Matplotlib for visualization.
- Connect to
data/sales.dbwithsqlite3. - Create a
categoriestable for joins. - Execute queries for total sales, top products, and high-value sales.
- Export results to
data/sales_query_results.json. - Generate a sales plot saved to
data/sales_query_plot.png. - Log steps using print statements.
- Use 4-space indentation per PEP 8, preferring spaces over tabs.
- Test edge cases (e.g., empty database).
Sample Input File
data/sales.db (from Appendix 1):
- Table:
sales - Columns:
product(TEXT),price(REAL),quantity(INTEGER) - Data:
('Halal Laptop', 999.99, 2) ('Halal Mouse', 24.99, 10) ('Halal Keyboard', 49.99, 5)
Optionally, save the categories table creation as create_categories.py for reuse in exercises. Run before exercises if the categories table is missing.
Data Processing Flow
flowchart TD
A["Input DB
sales.db"] --> B["Connect
sqlite3"]
B --> C["Create Categories
Table"]
C --> D["Execute SQL Queries
Joins/Subqueries"]
D --> E["Process Results
Pandas"]
E -->|Valid| F["Export JSON
sales_query_results.json"]
E -->|Valid| G["Generate Plot
sales_query_plot.png"]
E -->|Invalid| H["Log Warning"]
F --> I["End Processing"]
G --> I
H --> I
classDef data fill:#f9f9f9,stroke:#333,stroke-width:2px
classDef process fill:#d0e0ff,stroke:#336,stroke-width:1px
classDef error fill:#ffdddd,stroke:#933,stroke-width:1px
classDef endpoint fill:#ddffdd,stroke:#363,stroke-width:1px
class A data
class B,C,D,E,F,G process
class H error
class I endpointAcceptance Criteria
- Go Criteria:
- Connects to
sales.dband createscategoriestable. - Executes queries for total sales, top products, and high-value sales.
- Exports results to
data/sales_query_results.json. - Saves plot to
data/sales_query_plot.png, confirming file existence. - Logs steps and invalid cases.
- Uses 4-space indentation per PEP 8, preferring spaces over tabs.
- Handles edge cases (e.g., empty database).
- Connects to
- No-Go Criteria:
- Fails to connect to
sales.db. - Incorrect query results or calculations.
- Missing JSON export or plot.
- Uses try/except or type annotations.
- Inconsistent indentation or tab/space mixing.
- Fails to connect to
Common Pitfalls to Avoid
- Database Connection Issues:
- Problem:
sqlite3.OperationalErrordue to missingsales.db. - Solution: Ensure
sales.dbexists per Appendix 1. Printos.path.exists("data/sales.db").
- Problem:
- Query Syntax Errors:
- Problem: Invalid SQL syntax causes errors.
- Solution: Test queries in SQLite CLI with
sqlite3 data/sales.db 'SELECT * FROM sales;'in your terminal to test a query. If errors occur, check for missing tables or syntax issues. Alternatively, print the query string withprint(query)beforecursor.execute.
- Empty Results:
- Problem: Queries return no rows for empty tables.
- Solution: Check row counts with
SELECT COUNT(*). Print results.
- Plotting Issues:
- Problem: Plot not saved.
- Solution: Use
plt.savefig()and check permissions. Printos.path.exists(plot_path). To debug plot appearance (e.g., cut-off labels), temporarily addplt.show()beforeplt.close()to view the plot interactively (displays on your screen, unlikeplt.savefig’s file output). Removeplt.show()afterward for Pyodide compatibility, as it requires non-interactive outputs. Check figure size withprint(plt.gcf().get_size_inches()).
- IndentationError:
- Problem: Mixed spaces/tabs.
- Solution: Use 4 spaces per PEP 8. Run
python -tt query_tool.py.
How This Differs from Production
In production, this solution would include:
- Error Handling: Try/except for robust errors (Chapter 7).
- Type Safety: Type annotations with Pyright (Chapter 7).
- Testing: Unit tests with
pytest(Chapter 9). - Optimization: Indexes for faster joins (Chapter 20).
- Logging: File-based logging (Chapter 52).
- Scalability: Distributed queries with PostgreSQL/BigQuery (Chapters 21, 25).
Implementation
# File: de-onboarding/query_tool.py
import sqlite3 # For database operations
import pandas as pd # For data processing
import matplotlib.pyplot as plt # For plotting
import json # For JSON export
import os # For file existence check
# Define function to setup database
def setup_database(db_path): # Takes database path
"""Setup database with categories table."""
print(f"Connecting to: {db_path}") # Debug: print path
conn = sqlite3.connect(db_path) # Connect to database
cursor = conn.cursor()
# Create categories table
cursor.execute("""
CREATE TABLE IF NOT EXISTS categories (
product TEXT,
category TEXT
)
""")
# Insert category data
cursor.executemany(
"INSERT OR IGNORE INTO categories (product, category) VALUES (?, ?)",
[
("Halal Laptop", "Electronics"),
("Halal Mouse", "Accessories"),
("Halal Keyboard", "Accessories"),
]
)
conn.commit()
print("Categories table created") # Debug
return conn, cursor # Return connection and cursor
# Define function to execute queries
def execute_queries(conn, cursor): # Takes connection and cursor
"""Execute advanced SQL queries."""
results = {} # Store query results
# Query 1: Total sales by category (JOIN)
query = """
SELECT c.category, SUM(s.price * s.quantity) AS total_sales
FROM sales s
INNER JOIN categories c
ON s.product = c.product
GROUP BY c.category
"""
print("Executing category sales query") # Debug
# For syntax errors, uncomment `print(query)` to inspect the full query
cursor.execute(query)
rows = cursor.fetchall()
if not rows: # Check for empty results
print("No data for category sales") # Log empty
results["category_sales"] = {}
else:
results["category_sales"] = {row[0]: row[1] for row in rows}
print("Category Sales:", results["category_sales"]) # Debug
# Query 2: Top 2 products by sales (Aggregation)
query = """
SELECT product, SUM(price * quantity) AS total_sales
FROM sales
GROUP BY product
ORDER BY total_sales DESC
LIMIT 2
"""
print("Executing top products query") # Debug
# For syntax errors, uncomment `print(query)` to inspect the full query
cursor.execute(query)
rows = cursor.fetchall()
if not rows: # Check for empty results
print("No data for top products") # Log empty
results["top_products"] = {}
else:
results["top_products"] = {row[0]: row[1] for row in rows}
print("Top Products:", results["top_products"]) # Debug
# Query 3: High-value sales (Subquery)
query = """
SELECT product, price, quantity
FROM sales
WHERE price > (
SELECT AVG(price)
FROM sales
)
"""
print("Executing high-value sales query") # Debug
# For syntax errors, uncomment `print(query)` to inspect the full query
cursor.execute(query)
rows = cursor.fetchall()
if not rows: # Check for empty results
print("No high-value sales") # Log empty
results["high_value_sales"] = []
else:
results["high_value_sales"] = [{"product": r[0], "price": r[1], "quantity": r[2]} for r in rows]
print("High-Value Sales:", results["high_value_sales"]) # Debug
return results # Return results dictionary
# Define function to export results
def export_results(results, json_path): # Takes results and file path
"""Export query results to JSON."""
print(f"Writing to: {json_path}") # Debug: print path
file = open(json_path, "w") # Open JSON file
json.dump(results, file, indent=2) # Write JSON
file.close() # Close file
print(f"Exported results to {json_path}") # Confirm export
# Define function to plot sales
def plot_sales(results, plot_path): # Takes results and plot path
"""Generate sales by category plot."""
if not results["category_sales"]: # Check for empty data
print("No data to plot") # Log empty
return
# Prepare data for plotting
categories = list(results["category_sales"].keys())
sales = list(results["category_sales"].values())
plt.figure(figsize=(8, 6)) # Set figure size
plt.bar(categories, sales) # Bar plot
plt.title("Sales by Category") # Title
plt.xlabel("Category") # X-axis label
plt.ylabel("Sales Amount ($)") # Y-axis label
plt.grid(True) # Add grid
plt.tight_layout() # Adjust layout
plt.savefig(plot_path, dpi=100) # Save plot with high resolution
plt.close() # Close figure
print(f"Plot saved to {plot_path}") # Confirm save
print(f"File exists: {os.path.exists(plot_path)}") # Confirm file creation
# Define main function
def main(): # No parameters
"""Main function to run query tool."""
db_path = "data/sales.db" # Database path
json_path = "data/sales_query_results.json" # JSON output path
plot_path = "data/sales_query_plot.png" # Plot output path
# Setup database
conn, cursor = setup_database(db_path)
# Execute queries
results = execute_queries(conn, cursor)
# Export and plot
export_results(results, json_path)
plot_sales(results, plot_path)
# Close connection
conn.close()
print("Processing completed") # Confirm completion
if __name__ == "__main__":
main() # Run main functionExpected Outputs
data/sales_query_results.json:
{
"category_sales": {
"Electronics": 1999.98,
"Accessories": 499.85
},
"top_products": {
"Halal Laptop": 1999.98,
"Halal Mouse": 249.9
},
"high_value_sales": [
{
"product": "Halal Laptop",
"price": 999.99,
"quantity": 2
}
]
}data/sales_query_plot.png: Bar plot showing sales by category (Electronics, Accessories), saved with dpi=100.
Console Output (abridged):
Connecting to: data/sales.db
Categories table created
Executing category sales query
Category Sales: {'Electronics': 1999.98, 'Accessories': 499.85}
Executing top products query
Top Products: {'Halal Laptop': 1999.98, 'Halal Mouse': 249.9}
Executing high-value sales query
High-Value Sales: [{'product': 'Halal Laptop', 'price': 999.99, 'quantity': 2}]
Writing to: data/sales_query_results.json
Exported results to data/sales_query_results.json
Plot saved to data/sales_query_plot.png
File exists: True
Processing completedHow to Run and Test
Setup:
Initial Setup:
Run
python create_sales_db.pyto createsales.dbandpython create_categories.pyto create thecategoriestable (both from Appendix 1 and below). Save these scripts inde-onboarding/and run before the micro-project and Exercises 1 and 4.Create
de-onboarding/data/directory.Save
sales.dbper Appendix 1 (runcreate_sales_db.py).Install libraries:
pip install pandas matplotlib sqlite3.Create virtual environment:
python -m venv venv, activate (Windows:venv\Scripts\activate, Unix:source venv/bin/activate).Verify Python 3.10+:
python --version.Configure editor for 4-space indentation per PEP 8 (VS Code: “Editor: Tab Size” = 4, “Editor: Insert Spaces” = true, “Editor: Detect Indentation” = false).
Save
query_tool.pyinde-onboarding/.Create
create_categories.pyfor exercise reuse:# File: de-onboarding/create_categories.py import sqlite3 conn = sqlite3.connect("data/sales.db") cursor = conn.cursor() cursor.execute(""" CREATE TABLE IF NOT EXISTS categories ( product TEXT, category TEXT ) """) cursor.executemany( "INSERT OR IGNORE INTO categories (product, category) VALUES (?, ?)", [ ("Halal Laptop", "Electronics"), ("Halal Mouse", "Accessories"), ("Halal Keyboard", "Accessories"), ] ) conn.commit() conn.close() print("Categories table created")Run:
python create_categories.pybefore exercises requiringcategories.
Troubleshooting:
- If
sqlite3.OperationalError, checksales.dbexistence and query syntax. - If
PermissionErrorfor JSON/plot, checkdata/permissions withls -l data/. - If
IndentationError, use 4 spaces (not tabs). Runpython -tt query_tool.py.
- If
Run:
- Open terminal in
de-onboarding/. - Run:
python query_tool.py. - Outputs:
data/sales_query_results.json,data/sales_query_plot.png, console logs.
- Open terminal in
Test Scenarios:
- Valid Data: Verify JSON matches expected metrics and plot shows category sales.
- Empty Database: Test with empty
salestable:conn = sqlite3.connect("data/sales.db") cursor = conn.cursor() cursor.execute("DELETE FROM sales") conn.commit() results = execute_queries(conn, cursor) print(results) # Expected: {'category_sales': {}, 'top_products': {}, 'high_value_sales': []} conn.close()
19.3 Practice Exercises
Before running Exercises 1 and 4, refer to the Initial Setup in Section 19.2 to ensure sales.db and the categories table exist by running python create_sales_db.py and python create_categories.py.
Exercise 1: Join Query
Write a function to join sales and categories tables, returning products and categories, with 4-space indentation per PEP 8.
Expected Output:
[('Halal Laptop', 'Electronics'), ('Halal Mouse', 'Accessories'), ('Halal Keyboard', 'Accessories')]Follow-Along Instructions:
- Save as
de-onboarding/ex1_join.py. - Run
create_categories.pyif not already executed. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex1_join.py. - How to Test:
- Add:
print(join_sales_categories("data/sales.db")). - Verify output matches expected.
- Test with empty
categoriestable: Should return empty list.
- Add:
Exercise 2: Subquery
Write a function to find sales with above-average quantities, with 4-space indentation per PEP 8.
Expected Output:
[('Halal Mouse', 24.99, 10)]Follow-Along Instructions:
- Save as
de-onboarding/ex2_subquery.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex2_subquery.py. - How to Test:
- Add:
print(high_quantity_sales("data/sales.db")). - Verify output matches expected.
- Test with empty
salestable: Should return empty list.
- Add:
Exercise 3: Aggregation
Write a function to compute total sales by product, with 4-space indentation per PEP 8.
Expected Output:
[('Halal Keyboard', 249.95), ('Halal Laptop', 1999.98), ('Halal Mouse', 249.9)]Follow-Along Instructions:
- Save as
de-onboarding/ex3_aggregation.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex3_aggregation.py. - How to Test:
- Add:
print(total_sales_by_product("data/sales.db")). - Verify output matches expected.
- Test with empty
salestable: Should return empty list.
- Add:
Exercise 4: Visualization
Write a function to plot total sales by category, saving to data/category_plot.png, with 4-space indentation per PEP 8.
Expected Output:
Plot saved to data/category_plot.pngFollow-Along Instructions:
- Save as
de-onboarding/ex4_plot.py. - Run
create_categories.pyif not already executed. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex4_plot.py. - How to Test:
- Verify
data/category_plot.pngexists with correct bars. - Test with empty
salestable: Should not generate plot.
- Verify
Exercise 5: Debug a Subquery Bug
Fix this buggy code that uses incorrect subquery syntax, causing no results, ensuring 4-space indentation per PEP 8.
Buggy Code:
import sqlite3
def high_price_sales(db_path):
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("""
SELECT product, price, quantity
FROM sales
WHERE price > (SELECT price FROM sales) # Bug: Invalid subquery
""")
rows = cursor.fetchall()
conn.close()
return rows
print(high_price_sales("data/sales.db"))Expected Output:
[('Halal Laptop', 999.99, 2)]Follow-Along Instructions:
- Save as
de-onboarding/ex5_debug.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex5_debug.pyto see incorrect output. - Fix and re-run.
- How to Test:
- Verify output matches expected.
- Test with empty
salestable: Should return empty list.
Exercise 6: Conceptual Analysis of Joins vs. Subqueries
Explain when to use a JOIN versus a subquery for filtering sales data, considering performance and readability. Ensure your explanation compares JOIN and subquery performance (e.g., time complexity like O(n * m) for joins vs. O(n) for subqueries) and readability (e.g., query structure clarity). Write a 100-word explanation and save it to de-onboarding/ex6_concepts.txt, using 4-space indentation per PEP 8 for any code references.
Expected Output:
Explanation saved to de-onboarding/ex6_concepts.txtFollow-Along Instructions:
- Save as
de-onboarding/ex6_concepts.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex6_concepts.py. - How to Test:
- Verify
ex6_concepts.txtexists with a ~100-word explanation. - Check content for clarity and relevance to joins vs. subqueries.
- Test with a shorter explanation (<50 words): Should prompt for 100 words.
- Verify
19.4 Exercise Solutions
Solution to Exercise 1: Join Query
import sqlite3 # Import SQLite
def join_sales_categories(db_path): # Takes database path
"""Join sales and categories tables."""
conn = sqlite3.connect(db_path) # Connect
cursor = conn.cursor()
# Execute join query
query = """
SELECT s.product, c.category
FROM sales s
INNER JOIN categories c
ON s.product = c.product
"""
print("Executing join query") # Debug
# For syntax errors, uncomment `print(query)` to inspect the full query
cursor.execute(query)
rows = cursor.fetchall()
print("Join Results:", rows) # Debug
conn.close()
return rows # Return results
# Test
print(join_sales_categories("data/sales.db"))
# Output:
# Executing join query
# Join Results: [('Halal Laptop', 'Electronics'), ('Halal Mouse', 'Accessories'), ('Halal Keyboard', 'Accessories')]
# [('Halal Laptop', 'Electronics'), ('Halal Mouse', 'Accessories'), ('Halal Keyboard', 'Accessories')]Solution to Exercise 2: Subquery
import sqlite3 # Import SQLite
def high_quantity_sales(db_path): # Takes database path
"""Find sales with above-average quantities."""
conn = sqlite3.connect(db_path) # Connect
cursor = conn.cursor()
# Execute subquery
query = """
SELECT product, price, quantity
FROM sales
WHERE quantity > (
SELECT AVG(quantity)
FROM sales
)
"""
print("Executing subquery") # Debug
# For syntax errors, uncomment `print(query)` to inspect the full query
cursor.execute(query)
rows = cursor.fetchall()
print("High Quantity Sales:", rows) # Debug
conn.close()
return rows # Return results
# Test
print(high_quantity_sales("data/sales.db"))
# Output:
# Executing subquery
# High Quantity Sales: [('Halal Mouse', 24.99, 10)]
# [('Halal Mouse', 24.99, 10)]Solution to Exercise 3: Aggregation
import sqlite3 # Import SQLite
def total_sales_by_product(db_path): # Takes database path
"""Compute total sales by product."""
conn = sqlite3.connect(db_path) # Connect
cursor = conn.cursor()
# Execute aggregation query
query = """
SELECT product, SUM(price * quantity) AS total_sales
FROM sales
GROUP BY product
"""
print("Executing aggregation query") # Debug
# For syntax errors, uncomment `print(query)` to inspect the full query
cursor.execute(query)
rows = cursor.fetchall()
print("Total Sales by Product:", rows) # Debug
conn.close()
return rows # Return results
# Test
print(total_sales_by_product("data/sales.db"))
# Output:
# Executing aggregation query
# Total Sales by Product: [('Halal Keyboard', 249.95), ('Halal Laptop', 1999.98), ('Halal Mouse', 249.9)]
# [('Halal Keyboard', 249.95), ('Halal Laptop', 1999.98), ('Halal Mouse', 249.9)]Solution to Exercise 4: Visualization
import sqlite3 # Import SQLite
import pandas as pd # Import Pandas
import matplotlib.pyplot as plt # Import Matplotlib
def plot_category_sales(db_path, plot_path): # Takes database and plot paths
"""Plot total sales by category."""
conn = sqlite3.connect(db_path) # Connect
cursor = conn.cursor()
# Execute query
query = """
SELECT c.category, SUM(s.price * s.quantity) AS total_sales
FROM sales s
INNER JOIN categories c
ON s.product = c.product
GROUP BY c.category
"""
print("Executing visualization query") # Debug
# For syntax errors, uncomment `print(query)` to inspect the full query
cursor.execute(query)
rows = cursor.fetchall()
conn.close()
if not rows: # Check for empty results
print("No data to plot") # Log empty
return
# Convert to DataFrame
df = pd.DataFrame(rows, columns=["category", "total_sales"])
# Plot
plt.figure(figsize=(8, 6)) # Set figure size
plt.bar(df["category"], df["total_sales"]) # Bar plot
plt.title("Sales by Category") # Title
plt.xlabel("Category") # X-axis label
plt.ylabel("Sales Amount ($)") # Y-axis label
plt.grid(True) # Add grid
plt.tight_layout() # Adjust layout
plt.savefig(plot_path, dpi=100) # Save plot
plt.close() # Close figure
print(f"Plot saved to {plot_path}") # Confirm save
# Test
plot_category_sales("data/sales.db", "data/category_plot.png")
# Output:
# Executing visualization query
# Plot saved to data/category_plot.pngSolution to Exercise 5: Debug a Subquery Bug
import sqlite3 # Import SQLite
def high_price_sales(db_path): # Takes database path
"""Find sales with above-average prices."""
conn = sqlite3.connect(db_path) # Connect
cursor = conn.cursor()
query = """
SELECT product, price, quantity
FROM sales
WHERE price > (
SELECT AVG(price)
FROM sales
)
"""
print("Executing high-price sales query") # Debug
# For syntax errors, uncomment `print(query)` to inspect the full query
cursor.execute(query)
rows = cursor.fetchall()
conn.close()
return rows
# Test
print(high_price_sales("data/sales.db"))
# Output:
# Executing high-price sales query
# [('Halal Laptop', 999.99, 2)]Explanation:
- Subquery Bug: The subquery
SELECT price FROM salesreturned multiple rows, causing an error. Fixed by usingAVG(price)to return a single value.
Solution to Exercise 6: Conceptual Analysis of Joins vs. Subqueries
def explain_join_vs_subquery(output_path): # Takes output file path
"""Explain JOIN vs. subquery for filtering sales data."""
explanation = (
"Use a JOIN to combine sales and categories tables when retrieving related data, "
"e.g., product categories, as it’s readable and efficient for matching records (O(n * m) unindexed). "
"Subqueries are better for dynamic filtering, e.g., finding sales above average price, "
"but can be less readable and slower (O(n) per scan) for complex conditions. "
"For Hijra Group’s analytics, prefer JOINs for reporting with metadata and subqueries for ad-hoc filters. "
"Format queries for clarity, e.g., `SELECT s.product FROM sales s JOIN categories c ON s.product = c.product`."
)
word_count = len(explanation.split())
if word_count < 90: # Check word count
print(f"Explanation too short ({word_count} words). Aim for ~100 words addressing performance (e.g., complexity) and readability.")
return
with open(output_path, "w") as file:
file.write(explanation)
print(f"Explanation saved to {output_path}") # Confirm save
# Test
explain_join_vs_subquery("de-onboarding/ex6_concepts.txt")
# Output:
# Explanation saved to de-onboarding/ex6_concepts.txtExplanation (saved to ex6_concepts.txt):
Use a JOIN to combine sales and categories tables when retrieving related data, e.g., product categories, as it’s readable and efficient for matching records (O(n * m) unindexed). Subqueries are better for dynamic filtering, e.g., finding sales above average price, but can be less readable and slower (O(n) per scan) for complex conditions. For Hijra Group’s analytics, prefer JOINs for reporting with metadata and subqueries for ad-hoc filters. Format queries for clarity, e.g., `SELECT s.product FROM sales s JOIN categories c ON s.product = c.product`.19.5 Chapter Summary and Connection to Chapter 20
In this chapter, you’ve mastered:
- Advanced SQL: Joins (O(n * m) unindexed), subqueries (O(n)), and aggregations (O(n)) for complex analytics.
- Python Integration: Using
sqlite3and Pandas for query execution and processing. - Visualization: Matplotlib plots for stakeholder reporting (O(n) for bar plots).
- Query Design: Formatting queries with indentation and uppercase keywords for readability.
- White-Space Sensitivity and PEP 8: Using 4-space indentation, preferring spaces over tabs.
The micro-project built a query tool for sales.db, using joins, subqueries, and aggregations to produce a JSON report and a high-resolution plot, tested for edge cases. The modular functions and formatted queries prepare for optimization in Chapter 20, where indexing will reduce join complexity. Chapter 20 also explores query optimization strategies, such as minimizing nested subqueries to reduce O(n) scans, building on this chapter’s query design principles.
Connection to Chapter 20
Chapter 20 introduces SQLite Indexing and Optimization, building on this chapter:
- Query Performance: Optimizes joins and subqueries with indexes, reducing complexity from O(n * m) to O(n log n).
- Data Structures: Uses
sales.dbfor indexing experiments. - Modules: Reuses query logic for performance testing.
- Fintech Context: Prepares for efficient analytics in Hijra Group’s pipelines, maintaining PEP 8’s 4-space indentation.