20 - SQLite Indexing and Optimization
Complexity: Moderate (M)
20.0 Introduction: Why This Matters for Data Engineering
In data engineering, optimizing database performance is critical for processing large volumes of financial transaction data efficiently, especially for Hijra Group’s Sharia-compliant fintech analytics. SQLite, a lightweight database, is widely used for prototyping and small-scale applications, but without optimization, queries can become slow as data grows. For example, a table with 1 million sales records can take seconds for unindexed queries (O(n) scans), but indexing can reduce this to milliseconds (O(log n) lookups). This chapter builds on Chapter 19 (Advanced SQL Querying with SQLite) by introducing indexing and query optimization techniques to enhance SQLite performance, ensuring efficient data retrieval for analytics pipelines.
This chapter avoids advanced concepts like type annotations (Chapter 7), testing (Chapter 9), or concurrency (Chapter 40), focusing on practical indexing and optimization techniques. All code uses PEP 8’s 4-space indentation, preferring spaces over tabs to avoid IndentationError, aligning with Hijra Group’s pipeline scripts. The micro-project optimizes a sales database using data/sales.db from Appendix 1, preparing for PostgreSQL optimization in Chapter 22.
Data Engineering Workflow Context
This diagram illustrates how indexing and optimization fit into a data engineering pipeline:
flowchart TD
A["Raw Data (CSV)"] --> B["SQLite Database (sales.db)"]
B --> C{"Query Processing"}
C -->|Unoptimized| D["Full Table Scan (O(n))"]
C -->|Optimized| E["Indexed Lookup (O(log n))"]
E --> F["Aggregated Metrics"]
F --> G["Output (Reports/JSON)"]
G --> H["Analytics Pipeline"]
classDef data fill:#f9f9f9,stroke:#333,stroke-width:2px
classDef process fill:#d0e0ff,stroke:#336,stroke-width:1px
classDef storage fill:#ddffdd,stroke:#363,stroke-width:1px
class A,B,F,G data
class C,D,E process
class H storageBuilding On and Preparing For
- Building On:
- Chapter 12: SQL Fundamentals with SQLite (basic queries, table creation).
- Chapter 13: Python and SQLite Integration (programmatic access with
sqlite3). - Chapter 14: Advanced Database Operations with SQLite (transactions, views).
- Chapter 19: Advanced SQL Querying with SQLite (joins, subqueries, aggregations).
- Preparing For:
- Chapter 21: Advanced PostgreSQL Querying (extends querying techniques).
- Chapter 22: PostgreSQL Indexing and Optimization (applies similar concepts to enterprise databases).
- Chapter 24: Checkpoint 3B (consolidates database skills).
What You’ll Learn
This chapter covers:
- SQLite Indexing: Creating and managing indexes to speed up queries.
- Query Optimization: Analyzing and rewriting queries for efficiency.
- Performance Analysis: Using
EXPLAIN QUERY PLANto understand query execution. - Practical Application: Optimizing a sales database for analytics.
By the end, you’ll optimize queries on data/sales.db, reducing query time for sales analytics, and produce a performance report. The micro-project uses 4-space indentation per PEP 8 and tests edge cases, ensuring robust optimization.
Follow-Along Tips:
- Create
de-onboarding/data/and populate withsales.dbper Appendix 1. - Install SQLite: Verify with
sqlite3 --version. - Install Python libraries:
pip install pyyaml. - Use print statements (e.g.,
print(cursor.fetchall())) to debug queries. - Verify file paths with
ls data/(Unix/macOS) ordir data\(Windows). - Use UTF-8 encoding for all files to avoid
UnicodeDecodeError. - Configure editor for 4-space indentation per PEP 8 (VS Code: “Editor: Tab Size” = 4, “Editor: Insert Spaces” = true, “Editor: Detect Indentation” = false).
20.1 SQLite Indexing Basics
Indexes in SQLite are B-tree data structures that speed up query execution by reducing the number of rows scanned. Without an index, a SELECT query performs a full table scan (O(n) for n rows). An index allows O(log n) lookups for equality or range queries on indexed columns. For a 1 million-row sales table, an index can reduce query time from seconds to milliseconds, but indexes increase storage (e.g., ~10MB for an index on a 1M-row table) and slow down INSERT/UPDATE operations (O(log n) per operation).
20.1.1 Creating and Managing Indexes
Create indexes on frequently queried columns, such as product for sales filtering. The following diagram illustrates a simplified B-tree index for the product column. Note that real B-trees have multiple levels for large datasets, but this is simplified for clarity:
graph TD
A["Root Node
[Halal Keyboard]"] --> B["Leaf Node
Halal Laptop"]
A --> C["Leaf Node
Halal Mouse"]
B --> D["Data Pointer
(Halal Laptop, 999.99, 2)"]
C --> E["Data Pointer
(Halal Mouse, 24.99, 10)"]
classDef node fill:#d0e0ff,stroke:#336,stroke-width:1px
classDef data fill:#f9f9f9,stroke:#333,stroke-width:2px
class A,B,C node
class D,E dataimport sqlite3 # Import SQLite
# Connect to database
conn = sqlite3.connect("data/sales.db")
cursor = conn.cursor()
# Create index on product column
cursor.execute("CREATE INDEX IF NOT EXISTS idx_product ON sales (product)")
print("Created index idx_product") # Debug
# Verify index
cursor.execute("SELECT name FROM sqlite_master WHERE type='index'")
indexes = cursor.fetchall()
print("Indexes:", indexes) # Debug: list all indexes
# Drop index (if needed)
cursor.execute("DROP INDEX IF EXISTS idx_product")
print("Dropped index idx_product") # Debug
# Commit and close
conn.commit()
conn.close()
# Expected Output:
# Created index idx_product
# Indexes: [('idx_product',)]
# Dropped index idx_productFollow-Along Instructions:
- Ensure
data/sales.dbexists per Appendix 1. - Save as
de-onboarding/index_basics.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python index_basics.py. - Verify output matches comments.
- Common Errors:
- DatabaseNotFound: Ensure
data/sales.dbexists. Runcreate_sales_db.pyfrom Appendix 1. - OperationalError: Check table/column names. Print
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'").fetchall(). - IndentationError: Use 4 spaces (not tabs). Run
python -tt index_basics.py.
- DatabaseNotFound: Ensure
Key Points:
CREATE INDEX: Builds a B-tree index (O(n log n) creation time).- Time Complexity:
- Query with index: O(log n) for lookups.
- Index creation: O(n log n).
- Insert/Update with index: O(log n) per operation.
- Space Complexity: O(n) for index storage (~10MB for 1M rows).
- Underlying Implementation: B-trees store sorted keys, enabling binary search for lookups. SQLite maintains indexes automatically during data modifications.
- Index Selection: Choose columns with high selectivity (e.g.,
productwith unique values) for maximum indexing benefits, as these reduce the number of rows scanned. - Implication: Use indexes for frequently queried columns in Hijra Group’s sales analytics.
20.1.2 Querying with Indexes
Indexes speed up WHERE and JOIN clauses. Compare query performance with and without an index.
import sqlite3 # Import SQLite
import time # For timing
# Connect to database
conn = sqlite3.connect("data/sales.db")
cursor = conn.cursor()
# Query without index
start = time.time()
cursor.execute("SELECT * FROM sales WHERE product = 'Halal Laptop'")
results = cursor.fetchall()
end = time.time()
print("Without index:", results, f"Time: {end - start:.4f}s") # Debug
# Create index
cursor.execute("CREATE INDEX IF NOT EXISTS idx_product ON sales (product)")
# Query with index
start = time.time()
cursor.execute("SELECT * FROM sales WHERE product = 'Halal Laptop'")
results = cursor.fetchall()
end = time.time()
print("With index:", results, f"Time: {end - start:.4f}s") # Debug
# Commit and close
conn.commit()
conn.close()
# Expected Output (times vary):
# Without index: [('Halal Laptop', 999.99, 2)] Time: 0.0010s
# With index: [('Halal Laptop', 999.99, 2)] Time: 0.0005sFollow-Along Instructions:
- Save as
de-onboarding/index_query.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python index_query.py. - Verify faster time with index.
- Common Errors:
- OperationalError: Ensure
salestable exists. Print schema withcursor.execute(".schema sales"). - IndentationError: Use 4 spaces (not tabs). Run
python -tt index_query.py.
- OperationalError: Ensure
Key Points:
- Indexes reduce query time for
WHEREclauses. - Performance Considerations:
- Without index: O(n) full scan.
- With index: O(log n) lookup.
- Implication: Critical for filtering sales data in pipelines.
20.2 Query Optimization
Optimizing queries involves rewriting SQL to minimize resource usage and leverage indexes. Common techniques include using specific columns, avoiding unnecessary joins, and simplifying conditions.
20.2.1 Analyzing Queries with EXPLAIN QUERY PLAN
Use EXPLAIN QUERY PLAN to understand query execution and identify optimization opportunities. The following flowchart illustrates the query execution process:
flowchart TD
A["SQL Query"] --> B["Check Index Availability"]
B -->|No Index| C["SCAN TABLE (O(n))"]
B -->|Index Found| D["SEARCH TABLE USING INDEX (O(log n))"]
C --> E["Return Results"]
D --> E
classDef process fill:#d0e0ff,stroke:#336,stroke-width:1px
classDef result fill:#ddffdd,stroke:#363,stroke-width:1px
class A,B,C,D process
class E resultimport sqlite3 # Import SQLite
# Connect to database
conn = sqlite3.connect("data/sales.db")
cursor = conn.cursor()
# Analyze query without index
cursor.execute("EXPLAIN QUERY PLAN SELECT * FROM sales WHERE product = 'Halal Laptop'")
plan = cursor.fetchall()
print("Query Plan (no index):", plan) # Debug
# Create index
cursor.execute("CREATE INDEX IF NOT EXISTS idx_product ON sales (product)")
# Analyze query with index
cursor.execute("EXPLAIN QUERY PLAN SELECT * FROM sales WHERE product = 'Halal Laptop'")
plan = cursor.fetchall()
print("Query Plan (with index):", plan) # Debug
# Commit and close
conn.commit()
conn.close()
# Expected Output:
# Query Plan (no index): [(0, 0, 0, 'SCAN TABLE sales')]
# Query Plan (with index): [(0, 0, 0, 'SEARCH TABLE sales USING INDEX idx_product (product=?)')]Follow-Along Instructions:
- Save as
de-onboarding/query_plan.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python query_plan.py. - Verify plan shows index usage.
- Common Errors:
- OperationalError: Check SQL syntax. Print query string.
- IndentationError: Use 4 spaces (not tabs). Run
python -tt query_plan.py.
Key Points:
EXPLAIN QUERY PLAN: Shows execution steps (e.g.,SCANvs.SEARCH).- SCAN TABLE: O(n) full scan, indicating no index usage.
- SEARCH TABLE USING INDEX: O(log n) index lookup, confirming optimization.
- Implication: Use to debug slow queries in analytics.
20.2.2 Optimizing Queries
Rewrite queries to use indexes and reduce complexity. For example, select specific columns instead of *.
import sqlite3 # Import SQLite
import time # For timing
# Connect to database
conn = sqlite3.connect("data/sales.db")
cursor = conn.cursor()
# Unoptimized query
start = time.time()
cursor.execute("SELECT * FROM sales WHERE quantity > 5")
results = cursor.fetchall()
end = time.time()
print("Unoptimized:", results, f"Time: {end - start:.4f}s") # Debug
# Create index on quantity
cursor.execute("CREATE INDEX IF NOT EXISTS idx_quantity ON sales (quantity)")
# Optimized query
start = time.time()
cursor.execute("SELECT product, quantity FROM sales WHERE quantity > 5")
results = cursor.fetchall()
end = time.time()
print("Optimized:", results, f"Time: {end - start:.4f}s") # Debug
# Commit and close
conn.commit()
conn.close()
# Expected Output (times vary):
# Unoptimized: [('Halal Mouse', 24.99, 10)] Time: 0.0010s
# Optimized: [('Halal Mouse', 10)] Time: 0.0004sFollow-Along Instructions:
- Save as
de-onboarding/query_optimize.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python query_optimize.py. - Verify faster time for optimized query.
- Common Errors:
- OperationalError: Ensure column exists. Print
cursor.execute("PRAGMA table_info(sales)").fetchall(). - IndentationError: Use 4 spaces (not tabs). Run
python -tt query_optimize.py.
- OperationalError: Ensure column exists. Print
Key Points:
- Select specific columns to reduce I/O.
- Indexes on
WHEREcolumns improve performance. - Time Complexity: O(log n) with index vs. O(n) without.
- Implication: Optimizes sales queries for Hijra Group’s reporting.
20.3 Micro-Project: Optimizing a Sales Database
Project Requirements
Optimize the sales.db database for sales analytics, focusing on queries for Hijra Group’s transaction reporting. The processor will:
- Load
data/config.yamlwith PyYAML to define optimization rules. - Create indexes on
productandquantitycolumns. - Optimize queries for total sales, high-quantity sales, and product-specific sales.
- Generate a performance report in
data/optimization_report.json. - Log query plans and timings using print statements.
- Use 4-space indentation per PEP 8, preferring spaces over tabs.
- Test edge cases (e.g., empty tables, missing indexes).
Sample Input Files
data/sales.db (from Appendix 1):
- Table:
sales - Columns:
product(TEXT),price(REAL),quantity(INTEGER) - Data:
INSERT INTO sales (product, price, quantity) VALUES ('Halal Laptop', 999.99, 2), ('Halal Mouse', 24.99, 10), ('Halal Keyboard', 49.99, 5);
data/config.yaml (from Appendix 1):
min_price: 10.0
max_quantity: 100
required_fields:
- product
- price
- quantity
product_prefix: 'Halal'
max_decimals: 2Data Processing Flow
flowchart TD
A["SQLite Database
sales.db"] --> B["Load Config
config.yaml"]
B --> C["Connect to SQLite"]
C --> D["Create Indexes
product, quantity"]
D --> E["Run Queries
Unoptimized/Optimized"]
E --> F["Analyze Query Plans
EXPLAIN QUERY PLAN"]
F --> G["Measure Timings"]
G --> H["Export Report
optimization_report.json"]
G --> I["Log Plans/Timings"]
classDef data fill:#f9f9f9,stroke:#333,stroke-width:2px
classDef process fill:#d0e0ff,stroke:#336,stroke-width:1px
classDef endpoint fill:#ddffdd,stroke:#363,stroke-width:1px
class A,B,H data
class C,D,E,F,G,I processAcceptance Criteria
- Go Criteria:
- Connects to
sales.dband loadsconfig.yaml. - Creates indexes on
productandquantity. - Runs unoptimized and optimized queries for:
- Total sales amount.
- High-quantity sales (>5).
- Sales by product (e.g., Halal Laptop).
- Exports performance report to
data/optimization_report.jsonwith query plans and timings. - Logs query plans and timings.
- Uses 4-space indentation per PEP 8, preferring spaces over tabs.
- Handles edge cases (e.g., empty table).
- Connects to
- No-Go Criteria:
- Fails to connect to
sales.dbor loadconfig.yaml. - Incorrect query results or missing indexes.
- Missing JSON report.
- Uses try/except or type annotations.
- Inconsistent indentation or tab/space mixing.
- Fails to connect to
Common Pitfalls to Avoid
- Database Connection Issues:
- Problem:
sales.dbnot found. - Solution: Ensure
data/sales.dbexists. Runcreate_sales_db.pyfrom Appendix 1. Print path withprint(db_path).
- Problem:
- Index Creation Errors:
- Problem:
OperationalErrordue to existing index. - Solution: Use
IF NOT EXISTS. Printcursor.execute("SELECT name FROM sqlite_master WHERE type='index'").fetchall().
- Problem:
- Query Plan Misinterpretation:
- Problem: Plan shows
SCANinstead ofSEARCH. - Solution: Verify index with
cursor.execute("PRAGMA index_list(sales)").fetchall().
- Problem: Plan shows
- Performance Misleading:
- Problem: Small dataset shows minimal time difference.
- Solution: Note that benefits scale with data size. Print row count with
cursor.execute("SELECT COUNT(*) FROM sales").fetchone().
- Index Fragmentation:
- Problem: Frequent
INSERT/UPDATEoperations fragment indexes, slowing queries. - Solution: Monitor query performance with
EXPLAIN QUERY PLANto detect slowdowns. Use SQLite’sVACUUMcommand manually to rebuild indexes if needed.
- Problem: Frequent
How This Differs from Production
In production, this solution would include:
- Error Handling: Try/except for robust errors (Chapter 7).
- Type Safety: Type annotations with Pyright (Chapter 7).
- Testing: Unit tests with
pytest(Chapter 9). - Scalability: Handling larger datasets with partitioning (Chapter 22).
- Monitoring: Query performance metrics (Chapter 66).
- Automation: Index maintenance scripts (Chapter 56).
Implementation
# File: de-onboarding/optimize_sales.py
import sqlite3 # For SQLite operations
import yaml # For YAML parsing
import json # For JSON export
import time # For timing
import os # For file existence check
# Define function to read YAML configuration
def read_config(config_path): # Takes config file path
"""Read YAML configuration."""
print(f"Opening config: {config_path}") # Debug: print path
file = open(config_path, "r") # Open YAML
config = yaml.safe_load(file) # Parse YAML
file.close() # Close file
print(f"Loaded config: {config}") # Debug: print config
return config # Return config dictionary
# Define function to create indexes
def create_indexes(conn, cursor): # Takes connection and cursor
"""Create indexes on product and quantity."""
print("Creating indexes") # Debug
cursor.execute("CREATE INDEX IF NOT EXISTS idx_product ON sales (product)")
cursor.execute("CREATE INDEX IF NOT EXISTS idx_quantity ON sales (quantity)")
print("Indexes created") # Debug
cursor.execute("SELECT name FROM sqlite_master WHERE type='index'")
indexes = cursor.fetchall()
print("Current indexes:", indexes) # Debug
conn.commit() # Commit changes
# Define function to get query plan
def get_query_plan(cursor, query): # Takes cursor and query
"""Get EXPLAIN QUERY PLAN for a query."""
cursor.execute(f"EXPLAIN QUERY PLAN {query}")
plan = cursor.fetchall()
print(f"Query Plan for '{query}': {plan}") # Debug
return plan # Return plan
# Define function to run queries
def run_queries(conn, cursor): # Takes connection and cursor
"""Run unoptimized and optimized queries, return results and timings."""
results = {
"total_sales": {},
"high_quantity": {},
"product_specific": {}
}
# Total sales (unoptimized: select all columns)
query1_unopt = "SELECT * FROM sales"
start = time.time()
cursor.execute(query1_unopt)
total_sales_unopt = sum(row[1] * row[2] for row in cursor.fetchall())
end = time.time()
plan1_unopt = get_query_plan(cursor, query1_unopt)
results["total_sales"]["unoptimized"] = {
"result": float(total_sales_unopt),
"time": end - start,
"plan": str(plan1_unopt)
}
# Total sales (optimized: select specific columns)
query1_opt = "SELECT price, quantity FROM sales"
start = time.time()
cursor.execute(query1_opt)
total_sales_opt = sum(row[0] * row[1] for row in cursor.fetchall())
end = time.time()
plan1_opt = get_query_plan(cursor, query1_opt)
results["total_sales"]["optimized"] = {
"result": float(total_sales_opt),
"time": end - start,
"plan": str(plan1_opt)
}
# High-quantity sales (unoptimized: no index)
query2_unopt = "SELECT * FROM sales WHERE quantity > 5"
start = time.time()
cursor.execute(query2_unopt)
high_qty_unopt = cursor.fetchall()
end = time.time()
plan2_unopt = get_query_plan(cursor, query2_unopt)
results["high_quantity"]["unoptimized"] = {
"result": [(row[0], row[1], row[2]) for row in high_qty_unopt],
"time": end - start,
"plan": str(plan2_unopt)
}
# High-quantity sales (optimized: with index)
query2_opt = "SELECT product, quantity FROM sales WHERE quantity > 5"
start = time.time()
cursor.execute(query2_opt)
high_qty_opt = cursor.fetchall()
end = time.time()
plan2_opt = get_query_plan(cursor, query2_opt)
results["high_quantity"]["optimized"] = {
"result": [(row[0], row[1]) for row in high_qty_opt],
"time": end - start,
"plan": str(plan2_opt)
}
# Product-specific sales (unoptimized: no index)
query3_unopt = "SELECT * FROM sales WHERE product = 'Halal Laptop'"
start = time.time()
cursor.execute(query3_unopt)
product_unopt = cursor.fetchall()
end = time.time()
plan3_unopt = get_query_plan(cursor, query3_unopt)
results["product_specific"]["unoptimized"] = {
"result": [(row[0], row[1], row[2]) for row in product_unopt],
"time": end - start,
"plan": str(plan3_unopt)
}
# Product-specific sales (optimized: with index)
query3_opt = "SELECT product, price, quantity FROM sales WHERE product = 'Halal Laptop'"
start = time.time()
cursor.execute(query3_opt)
product_opt = cursor.fetchall()
end = time.time()
plan3_opt = get_query_plan(cursor, query3_opt)
results["product_specific"]["optimized"] = {
"result": [(row[0], row[1], row[2]) for row in product_opt],
"time": end - start,
"plan": str(plan3_opt)
}
print("Query Results:", results) # Debug
return results
# Define function to export report
def export_report(results, report_path): # Takes results and file path
"""Export optimization report to JSON."""
print(f"Writing to: {report_path}") # Debug: print path
file = open(report_path, "w") # Open JSON file
json.dump(results, file, indent=2) # Write JSON
file.close() # Close file
print(f"Exported report to {report_path}") # Confirm export
print(f"File exists: {os.path.exists(report_path)}") # Confirm file creation
# Define main function
def main(): # No parameters
"""Main function to optimize sales database."""
db_path = "data/sales.db" # Database path
config_path = "data/config.yaml" # YAML path
report_path = "data/optimization_report.json" # JSON output path
config = read_config(config_path) # Read config
conn = sqlite3.connect(db_path) # Connect to database
cursor = conn.cursor()
# Check if table is empty
cursor.execute("SELECT COUNT(*) FROM sales")
row_count = cursor.fetchone()[0]
print(f"Table row count: {row_count}") # Debug
if row_count == 0:
print("No data in sales table") # Log empty
conn.close()
return
create_indexes(conn, cursor) # Create indexes
results = run_queries(conn, cursor) # Run queries
export_report(results, report_path) # Export report
# Output summary
print("\nOptimization Report:") # Print header
print(f"Total Sales (Unoptimized): ${results['total_sales']['unoptimized']['result']:.2f}, Time: {results['total_sales']['unoptimized']['time']:.4f}s")
print(f"Total Sales (Optimized): ${results['total_sales']['optimized']['result']:.2f}, Time: {results['total_sales']['optimized']['time']:.4f}s")
print(f"High-Quantity Sales: {len(results['high_quantity']['optimized']['result'])} records")
print(f"Product-Specific Sales: {len(results['product_specific']['optimized']['result'])} records")
print("Optimization completed") # Confirm completion
conn.commit() # Commit changes
conn.close() # Close connection
if __name__ == "__main__":
main() # Run main functionExpected Outputs
data/optimization_report.json:
{
"total_sales": {
"unoptimized": {
"result": 2499.83,
"time": 0.001,
"plan": "[(0, 0, 0, 'SCAN TABLE sales')]"
},
"optimized": {
"result": 2499.83,
"time": 0.0005,
"plan": "[(0, 0, 0, 'SCAN TABLE sales')]"
}
},
"high_quantity": {
"unoptimized": {
"result": [["Halal Mouse", 24.99, 10]],
"time": 0.0008,
"plan": "[(0, 0, 0, 'SCAN TABLE sales')]"
},
"optimized": {
"result": [["Halal Mouse", 10]],
"time": 0.0004,
"plan": "[(0, 0, 0, 'SEARCH TABLE sales USING INDEX idx_quantity (quantity>?)')]"
}
},
"product_specific": {
"unoptimized": {
"result": [["Halal Laptop", 999.99, 2]],
"time": 0.0007,
"plan": "[(0, 0, 0, 'SCAN TABLE sales')]"
},
"optimized": {
"result": [["Halal Laptop", 999.99, 2]],
"time": 0.0003,
"plan": "[(0, 0, 0, 'SEARCH TABLE sales USING INDEX idx_product (product=?)')]"
}
}
}Console Output (abridged):
Opening config: data/config.yaml
Loaded config: {'min_price': 10.0, 'max_quantity': 100, 'required_fields': ['product', 'price', 'quantity'], 'product_prefix': 'Halal', 'max_decimals': 2}
Table row count: 3
Creating indexes
Indexes created
Current indexes: [('idx_product',), ('idx_quantity',)]
Query Plan for 'SELECT * FROM sales': [(0, 0, 0, 'SCAN TABLE sales')]
Query Plan for 'SELECT price, quantity FROM sales': [(0, 0, 0, 'SCAN TABLE sales')]
Query Plan for 'SELECT * FROM sales WHERE quantity > 5': [(0, 0, 0, 'SEARCH TABLE sales USING INDEX idx_quantity (quantity>?)')]
Query Plan for 'SELECT product, quantity FROM sales WHERE quantity > 5': [(0, 0, 0, 'SEARCH TABLE sales USING INDEX idx_quantity (quantity>?)')]
Query Plan for 'SELECT * FROM sales WHERE product = 'Halal Laptop'': [(0, 0, 0, 'SEARCH TABLE sales USING INDEX idx_product (product=?)')]
Query Plan for 'SELECT product, price, quantity FROM sales WHERE product = 'Halal Laptop'': [(0, 0, 0, 'SEARCH TABLE sales USING INDEX idx_product (product=?)')]
Writing to: data/optimization_report.json
Exported report to data/optimization_report.json
File exists: True
Optimization Report:
Total Sales (Unoptimized): $2499.83, Time: 0.0010s
Total Sales (Optimized): $2499.83, Time: 0.0005s
High-Quantity Sales: 1 records
Product-Specific Sales: 1 records
Optimization completedHow to Run and Test
Setup:
- Setup Checklist:
- Create
de-onboarding/data/directory. - Save
sales.dbandconfig.yamlper Appendix 1. - Install libraries:
pip install pyyaml. - Create virtual environment:
python -m venv venv, activate (Windows:venv\Scripts\activate, Unix:source venv/bin/activate). - Verify Python 3.10+:
python --version. - Configure editor for 4-space indentation per PEP 8.
- Save
optimize_sales.pyinde-onboarding/.
- Create
- Troubleshooting:
- If
OperationalError, check table schema withsqlite3 data/sales.db ".schema sales". - If
yaml.YAMLError, printprint(open(config_path).read())to inspect YAML. - If
IndentationError, use 4 spaces (not tabs). Runpython -tt optimize_sales.py.
- If
- Setup Checklist:
Run:
- Open terminal in
de-onboarding/. - Run:
python optimize_sales.py. - Outputs:
data/optimization_report.json, console logs.
- Open terminal in
Test Scenarios:
- Valid Data: Verify
optimization_report.jsonshows correct results, faster times for optimized queries, and index usage in plans. - Empty Table: Test with empty table:
conn = sqlite3.connect("data/sales.db") cursor = conn.cursor() cursor.execute("DELETE FROM sales") conn.commit() main() # Expected: "No data in sales table" - Missing Indexes: Drop indexes and test:
conn = sqlite3.connect("data/sales.db") cursor = conn.cursor() cursor.execute("DROP INDEX IF EXISTS idx_product") cursor.execute("DROP INDEX IF EXISTS idx_quantity") conn.commit() results = run_queries(conn, cursor) print(results["product_specific"]["unoptimized"]["plan"]) # Expected: SCAN TABLE - Large Dataset: Test with 1000 rows:
conn = sqlite3.connect("data/sales.db") cursor = conn.cursor() cursor.execute("SELECT * FROM sales") original_rows = cursor.fetchall() for _ in range(333): for row in original_rows: cursor.execute("INSERT INTO sales (product, price, quantity) VALUES (?, ?, ?)", row) conn.commit() main() # Expected: Same JSON structure, larger timings, optimized queries faster # Compare timings in data/optimization_report.json # Restore original data: cursor.execute("DELETE FROM sales") cursor.executemany("INSERT INTO sales (product, price, quantity) VALUES (?, ?, ?)", original_rows) conn.commit()
- Valid Data: Verify
20.4 Practice Exercises
Exercise 1: Create an Index
Write a function to create an index on the price column, with 4-space indentation per PEP 8.
Expected Output:
Index idx_price createdFollow-Along Instructions:
- Save as
de-onboarding/ex1_index.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex1_index.py. - How to Test:
- Verify index with
sqlite3 data/sales.db "SELECT name FROM sqlite_master WHERE type='index'". - Test with existing index: Should not raise error.
- Verify index with
Exercise 2: Query Plan Analysis
Write a function to print the query plan for a product-specific query, with 4-space indentation per PEP 8.
Expected Output:
Query Plan: [(0, 0, 0, 'SEARCH TABLE sales USING INDEX idx_product (product=?)')]Follow-Along Instructions:
- Save as
de-onboarding/ex2_query_plan.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex2_query_plan.py. - How to Test:
- Verify plan shows index usage.
- Test without index: Should show
SCAN TABLE.
- Key Points:
SCAN TABLEindicates no index usage (O(n) scan).SEARCH TABLE USING INDEXconfirms optimization (O(log n) lookup).
Exercise 3: Optimize a Query
Write a function to optimize a query for high-price sales (>50), with 4-space indentation per PEP 8.
Expected Output:
Results: [('Halal Laptop', 999.99, 2)]
Time: 0.0004sFollow-Along Instructions:
- Save as
de-onboarding/ex3_optimize.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex3_optimize.py. - How to Test:
- Verify faster time with index.
- Test with empty table: Should return empty list.
Exercise 4: Performance Comparison
Write a function to compare unoptimized and optimized query times for total sales, with 4-space indentation per PEP 8.
Expected Output:
Unoptimized: $2499.83, Time: 0.0010s
Optimized: $2499.83, Time: 0.0005sFollow-Along Instructions:
- Save as
de-onboarding/ex4_compare.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex4_compare.py. - How to Test:
- Verify optimized query is faster.
- Test with larger dataset (manually insert rows).
Exercise 5: Debug an Index Issue
Fix this buggy code that fails to use an index, ensuring 4-space indentation per PEP 8.
Buggy Code:
import sqlite3
def query_sales(db_path):
conn = sqlite3.connect(db_path)
cursor = conn.cursor()
cursor.execute("SELECT * FROM sales WHERE product LIKE '%Laptop%'")
results = cursor.fetchall()
print("Results:", results)
conn.close()
return results
query_sales("data/sales.db")Expected Output:
Results: [('Halal Laptop', 999.99, 2)]
Query Plan: [(0, 0, 0, 'SEARCH TABLE sales USING INDEX idx_product (product=?)')]Follow-Along Instructions:
- Save as
de-onboarding/ex5_debug.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex5_debug.pyto see issue. - Fix and re-run.
- How to Test:
- Verify plan shows index usage.
- Test with non-existent product: Should return empty list.
Exercise 6: Debug Slow Query After Scaling Dataset
Write a function to insert ~1000 rows into sales.db, run a slow query (SELECT * FROM sales WHERE product LIKE '%Laptop%'), use EXPLAIN QUERY PLAN to diagnose why it’s slow, fix it to use an index, and explain why LIKE '%Laptop%' bypasses the index, printing the explanation to the console. Optionally, scale to 10,000 rows to observe larger performance gains (note: slower hardware may show more significant differences; 1000 rows are sufficient for most learners). Use 4-space indentation per PEP 8.
Expected Output:
Rows inserted: 1000
Slow Query Time: 0.0050s
Query Plan (Slow): [(0, 0, 0, 'SCAN TABLE sales')]
Fixed Query Time: 0.0010s
Query Plan (Fixed): [(0, 0, 0, 'SEARCH TABLE sales USING INDEX idx_product (product=?)')]
Explanation: LIKE '%Laptop%' bypasses the index because the wildcard (%) at the start prevents SQLite from using the B-tree's sorted keys, forcing a full table scan.Follow-Along Instructions:
- Save as
de-onboarding/ex6_debug_scale.py. - Ensure
data/sales.dbexists per Appendix 1. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex6_debug_scale.py. - How to Test:
- Verify row count with
sqlite3 data/sales.db "SELECT COUNT(*) FROM sales". - Check query plans, timings, and explanation in console output.
- Test with empty table: Should handle gracefully.
- Optional: Test with 10,000 rows by modifying iterations to 3333.
- Verify row count with
- Key Points:
LIKE '%Laptop%'causes a full scan (O(n)) due to pattern matching.- Exact matches (
=) leverage indexes (O(log n)).
20.5 Exercise Solutions
Solution to Exercise 1: Create an Index
import sqlite3 # Import SQLite
def create_price_index(db_path): # Takes database path
"""Create index on price column."""
conn = sqlite3.connect(db_path) # Connect
cursor = conn.cursor()
cursor.execute("CREATE INDEX IF NOT EXISTS idx_price ON sales (price)")
print("Index idx_price created") # Debug
conn.commit()
conn.close()
# Test
create_price_index("data/sales.db") # Call function
# Output:
# Index idx_price createdSolution to Exercise 2: Query Plan Analysis
import sqlite3 # Import SQLite
def get_product_query_plan(db_path): # Takes database path
"""Get query plan for product-specific query."""
conn = sqlite3.connect(db_path) # Connect
cursor = conn.cursor()
cursor.execute("CREATE INDEX IF NOT EXISTS idx_product ON sales (product)")
cursor.execute("EXPLAIN QUERY PLAN SELECT * FROM sales WHERE product = 'Halal Laptop'")
plan = cursor.fetchall()
print("Query Plan:", plan) # Debug
conn.commit()
conn.close()
return plan
# Test
get_product_query_plan("data/sales.db") # Call function
# Output:
# Query Plan: [(0, 0, 0, 'SEARCH TABLE sales USING INDEX idx_product (product=?)')]Solution to Exercise 3: Optimize a Query
import sqlite3 # Import SQLite
import time # For timing
def query_high_price(db_path): # Takes database path
"""Optimized query for high-price sales."""
conn = sqlite3.connect(db_path) # Connect
cursor = conn.cursor()
cursor.execute("CREATE INDEX IF NOT EXISTS idx_price ON sales (price)")
start = time.time()
cursor.execute("SELECT product, price, quantity FROM sales WHERE price > 50")
results = cursor.fetchall()
end = time.time()
print("Results:", results, f"Time: {end - start:.4f}s") # Debug
conn.commit()
conn.close()
return results
# Test
query_high_price("data/sales.db") # Call function
# Output:
# Results: [('Halal Laptop', 999.99, 2)] Time: 0.0004sSolution to Exercise 4: Performance Comparison
import sqlite3 # Import SQLite
import time # For timing
def compare_total_sales(db_path): # Takes database path
"""Compare unoptimized and optimized total sales queries."""
conn = sqlite3.connect(db_path) # Connect
cursor = conn.cursor()
# Unoptimized
start = time.time()
cursor.execute("SELECT * FROM sales")
total_unopt = sum(row[1] * row[2] for row in cursor.fetchall())
end = time.time()
print(f"Unoptimized: ${total_unopt:.2f}, Time: {end - start:.4f}s")
# Optimized
start = time.time()
cursor.execute("SELECT price, quantity FROM sales")
total_opt = sum(row[0] * row[1] for row in cursor.fetchall())
end = time.time()
print(f"Optimized: ${total_opt:.2f}, Time: {end - start:.4f}s")
conn.close()
# Test
compare_total_sales("data/sales.db") # Call function
# Output:
# Unoptimized: $2499.83, Time: 0.0010s
# Optimized: $2499.83, Time: 0.0005sSolution to Exercise 5: Debug an Index Issue
import sqlite3 # Import SQLite
def query_sales(db_path): # Takes database path
"""Fixed query using index for product search."""
conn = sqlite3.connect(db_path) # Connect
cursor = conn.cursor()
cursor.execute("CREATE INDEX IF NOT EXISTS idx_product ON sales (product)")
cursor.execute("SELECT * FROM sales WHERE product = 'Halal Laptop'") # Fix: Use equality
results = cursor.fetchall()
cursor.execute("EXPLAIN QUERY PLAN SELECT * FROM sales WHERE product = 'Halal Laptop'")
plan = cursor.fetchall()
print("Results:", results) # Debug
print("Query Plan:", plan) # Debug
conn.commit()
conn.close()
return results
# Test
query_sales("data/sales.db") # Call function
# Output:
# Results: [('Halal Laptop', 999.99, 2)]
# Query Plan: [(0, 0, 0, 'SEARCH TABLE sales USING INDEX idx_product (product=?)')]Explanation:
- Bug:
LIKE '%Laptop%'doesn’t use the index, causing a full scan. Fixed by using=for exact match, leveragingidx_product.
Solution to Exercise 6: Debug Slow Query After Scaling Dataset
import sqlite3 # Import SQLite
import time # For timing
def debug_scale_query(db_path): # Takes database path
"""Insert 1000 rows, debug slow query, fix with index, and explain LIKE issue."""
conn = sqlite3.connect(db_path) # Connect
cursor = conn.cursor()
# Insert ~1000 rows by duplicating data
cursor.execute("SELECT * FROM sales") # Get existing data
original_rows = cursor.fetchall()
if not original_rows:
print("No data to duplicate") # Handle empty table
conn.close()
return
# Insert 333 copies of each row (3 rows * 333 ≈ 999)
for _ in range(333):
for row in original_rows:
cursor.execute("INSERT INTO sales (product, price, quantity) VALUES (?, ?, ?)", row)
cursor.execute("SELECT COUNT(*) FROM sales")
row_count = cursor.fetchone()[0]
print(f"Rows inserted: {row_count}") # Debug
# Slow query (LIKE pattern)
cursor.execute("DROP INDEX IF EXISTS idx_product") # Ensure no index
start = time.time()
cursor.execute("SELECT * FROM sales WHERE product LIKE '%Laptop%'")
results = cursor.fetchall()
end = time.time()
slow_time = end - start
cursor.execute("EXPLAIN QUERY PLAN SELECT * FROM sales WHERE product LIKE '%Laptop%'")
slow_plan = cursor.fetchall()
print(f"Slow Query Time: {slow_time:.4f}s") # Debug
print(f"Query Plan (Slow): {slow_plan}") # Debug
# Fix with index and exact match
cursor.execute("CREATE INDEX IF NOT EXISTS idx_product ON sales (product)")
start = time.time()
cursor.execute("SELECT * FROM sales WHERE product = 'Halal Laptop'")
results = cursor.fetchall()
end = time.time()
fixed_time = end - start
cursor.execute("EXPLAIN QUERY PLAN SELECT * FROM sales WHERE product = 'Halal Laptop'")
fixed_plan = cursor.fetchall()
print(f"Fixed Query Time: {fixed_time:.4f}s") # Debug
print(f"Query Plan (Fixed): {fixed_plan}") # Debug
# Explain why LIKE bypasses index
explanation = (
"LIKE '%Laptop%' bypasses the index because the wildcard (%) at the start "
"prevents SQLite from using the B-tree's sorted keys, forcing a full table scan."
)
print(f"Explanation: {explanation}") # Debug
# Optional: Scale to 10,000 rows
# for _ in range(3333 - 333): # Additional 3000 iterations
# for row in original_rows:
# cursor.execute("INSERT INTO sales (product, price, quantity) VALUES (?, ?, ?)", row)
# cursor.execute("SELECT COUNT(*) FROM sales")
# print(f"Rows inserted (optional): {cursor.fetchone()[0]}")
conn.commit()
conn.close()
# Test
debug_scale_query("data/sales.db") # Call function
# Output:
# Rows inserted: 1000
# Slow Query Time: 0.0050s
# Query Plan (Slow): [(0, 0, 0, 'SCAN TABLE sales')]
# Fixed Query Time: 0.0010s
# Query Plan (Fixed): [(0, 0, 0, 'SEARCH TABLE sales USING INDEX idx_product (product=?)')]
# Explanation: LIKE '%Laptop%' bypasses the index because the wildcard (%) at the start prevents SQLite from using the B-tree's sorted keys, forcing a full table scan.20.6 Chapter Summary and Connection to Chapter 21
In this chapter, you’ve mastered:
- Indexing: Creating B-tree indexes (O(log n) lookups, O(n) storage), with high-selectivity columns for maximum benefit.
- Query Optimization: Rewriting queries to use indexes and reduce I/O, avoiding patterns like
LIKEthat bypass indexes. - Performance Analysis: Using
EXPLAIN QUERY PLANto debug execution, distinguishingSCAN(O(n)) fromSEARCH(O(log n)). - Trade-offs: Understanding that indexes are less effective for small tables or write-heavy workloads due to maintenance overhead.
- White-Space Sensitivity and PEP 8: Using 4-space indentation, preferring spaces over tabs.
The micro-project optimized sales.db queries, reducing query times and producing a performance report, all with 4-space indentation per PEP 8. It tested edge cases (empty tables, missing indexes), ensuring robustness. Exercise 6 scaled the dataset, debugged a slow query, and explained index limitations, reinforcing practical skills. These skills prepare for scaling to larger datasets in PostgreSQL. Specifically, SQLite’s B-tree indexing principles (e.g., O(log n) lookups, storage trade-offs) directly apply to PostgreSQL’s B-tree indexes, enabling efficient query optimization in Chapter 22 and beyond.
Connection to Chapter 21
Chapter 21 introduces Advanced PostgreSQL Querying, building on this chapter:
- Query Techniques: Extends to CTEs and window functions, using
data/transactions.csv, leveraging your ability to write efficient SQL from Chapter 19 and optimize it here. - Optimization: Applies indexing concepts to PostgreSQL, where B-tree indexes function similarly, preparing for Chapter 22’s advanced optimization techniques.
- Fintech Context: Enhances transaction analytics for Hijra Group, maintaining PEP 8’s 4-space indentation for maintainable pipeline scripts.