12 - SQL Fundamentals with SQLite
Complexity: Easy (E)
12.0 Introduction: Why This Matters for Data Engineering
Structured Query Language (SQL) is the cornerstone of data management in data engineering, enabling efficient querying and manipulation of financial transaction data for Hijra Group’s Sharia-compliant fintech analytics. For example, SQL queries enable Hijra Group to generate daily sales reports for Sharia compliance audits, supporting reviews by Sharia boards to ensure transactions meet IFSB standards by filtering product categories (e.g., Halal products) to ensure compliance in financial analytics, avoiding riba (interest). SQLite, a lightweight, serverless database, is ideal for learning SQL fundamentals due to its simplicity and zero-configuration setup, storing data in a single file (e.g., sales.db, ~4KB for small datasets like in Appendix 1). SQL queries offer O(log n) performance with indexes for lookups on n rows, compared to O(n) for Python loops, making it critical for scalable pipelines. This chapter builds on Chapters 1–11 (Python basics, Pandas, type safety, and testing) and introduces SQL for querying data/sales.db, preparing for Python-SQLite integration in Chapter 13 and PostgreSQL in Chapter 16.
This chapter avoids advanced concepts like type annotations (used from Chapter 7), Python-SQLite integration (Chapter 13), or complex joins (covered in Chapter 19) to focus on foundational skills. All code uses PEP 8’s 4-space indentation, preferring spaces over tabs to avoid IndentationError, ensuring compatibility with Hijra Group’s pipeline scripts. SQLite commands are executed via the sqlite3 command-line tool or Python’s sqlite3 module for testing, with queries written in uppercase per SQL convention for readability.
Data Engineering Workflow Context
This diagram illustrates SQL’s role in a data engineering pipeline:
flowchart TD
A["Raw Data (CSV)"] --> B["SQLite Database (sales.db)"]
B --> C["SQL Queries"]
C -->|SELECT| D["Filtered Data"]
C -->|INSERT| E["Updated Tables"]
C -->|GROUP BY| F["Aggregated Metrics"]
D --> G["Output (CSV/JSON)"]
E --> G
F --> G
G --> H["Pipeline/Analysis"]
classDef data fill:#f9f9f9,stroke:#333,stroke-width:2px
classDef process fill:#d0e0ff,stroke:#336,stroke-width:1px
classDef storage fill:#ddffdd,stroke:#363,stroke-width:1px
class A,D,E,F data
class B,C process
class G,H storageBuilding On and Preparing For
- Building On:
- Chapter 1: Uses Python loops and lists to process data, now replaced with SQL queries for efficiency.
- Chapter 3: Leverages Pandas DataFrame filtering and grouping, now performed with SQL
WHEREandGROUP BY. - Chapter 9: Applies testing concepts for query validation, though without
pytest(introduced later). - Chapter 11: Ensures code quality with modular scripts, reused for SQL execution.
- Preparing For:
- Chapter 13: Prepares for Python-SQLite integration using
sqlite3module. - Chapter 14: Lays groundwork for advanced SQLite operations (transactions, views).
- Chapter 15: Enables type-safe database programming.
- Chapter 19: Supports advanced SQL querying (joins, subqueries).
- Chapter 13: Prepares for Python-SQLite integration using
What You’ll Learn
This chapter covers:
- SQLite Basics: Database setup and table creation.
- SQL Queries: SELECT, INSERT, UPDATE, DELETE for data manipulation.
- Filtering: WHERE clause for conditional queries.
- Aggregation: GROUP BY and COUNT for metrics.
- Testing Queries: Validating results with Python scripts.
- Micro-Project: A SQL tool to query
data/sales.dbfor sales analytics.
By the end, you’ll build a tested SQL tool to query data/sales.db, producing a JSON report of sales metrics, using 4-space indentation per PEP 8. The micro-project leverages sales.db (Appendix 1) and tests edge cases, ensuring robust data retrieval for Hijra Group’s analytics.
Follow-Along Tips:
- Create
de-onboarding/data/and populate withsales.dbper Appendix 1. - Install SQLite: Verify with
sqlite3 --version. - Install libraries:
pip install pyyaml(for config parsing). - Use
sqlite3command-line tool or Python’ssqlite3module for queries. - If
IndentationError, use 4 spaces per PEP 8. Runpython -tt script.py. - Use print statements (e.g.,
print(rows)) to debug query results. - Save outputs to
data/(e.g.,sales_report.json). - Verify file paths with
ls data/(Unix/macOS) ordir data\(Windows). - Use UTF-8 encoding for all files to avoid
UnicodeDecodeError.
12.1 SQLite Basics
SQLite is a serverless, file-based relational database, storing data in tables with rows and columns. The sales.db (Appendix 1) contains a sales table with columns product (TEXT), price (REAL), quantity (INTEGER), using ~4KB for 3 rows. SQLite supports O(log n) lookups with indexes and O(1) row inserts without constraints.
12.1.1 Setting Up SQLite
Create and inspect sales.db using the sqlite3 command-line tool.
# Navigate to de-onboarding/
cd de-onboarding
# Create sales.db (if not already created per Appendix 1)
sqlite3 data/sales.db
# Inside sqlite3 prompt
sqlite> CREATE TABLE IF NOT EXISTS sales (
...> product TEXT,
...> price REAL,
...> quantity INTEGER
...> );
sqlite> INSERT INTO sales (product, price, quantity) VALUES
...> ('Halal Laptop', 999.99, 2),
...> ('Halal Mouse', 24.99, 10),
...> ('Halal Keyboard', 49.99, 5);
sqlite> .tables
sales
sqlite> .schema sales
CREATE TABLE sales (
product TEXT,
price REAL,
quantity INTEGER
);
sqlite> SELECT * FROM sales;
Halal Laptop|999.99|2
Halal Mouse|24.99|10
Halal Keyboard|49.99|5
sqlite> .exitFollow-Along Instructions:
- Ensure
de-onboarding/data/exists. - Install SQLite:
sqlite3 --version(pre-installed on most systems). - Run commands above in terminal.
- Verify table has 3 rows with
SELECT COUNT(*) FROM sales;. - Common Errors:
- sqlite3: not found: Install SQLite (e.g.,
sudo apt install sqlite3on Ubuntu). - Database locked: Close other SQLite connections. Check with
lsof data/sales.db. - Syntax error: Ensure SQL commands end with
;. Print query to debug.
- sqlite3: not found: Install SQLite (e.g.,
Key Points:
- Table Creation:
CREATE TABLEdefines schema with data types (TEXT, REAL, INTEGER). - Data Insertion:
INSERT INTOadds rows. - Inspection:
.tableslists tables,.schemashows structure. - Underlying Implementation: SQLite stores data in a B-tree, enabling O(log n) lookups with indexes (Chapter 20). Tables are stored in a single file (
sales.db), with minimal overhead (~4KB for small datasets). - Performance Considerations:
- Time Complexity: O(1) for inserts (no indexes), O(n) for full scans.
- Space Complexity: O(n) for n rows (~4KB for 3 rows with TEXT/REAL/INTEGER).
- Implication: SQLite is ideal for small-scale analytics in Hijra Group’s pipelines.
12.2 SQL Queries
SQL queries manipulate data using SELECT, INSERT, UPDATE, and DELETE.
- Note: Transactions (Chapter 14) ensure data consistency for
INSERT/UPDATEoperations but are excluded here to focus on single-query operations. - Note: Subqueries (Chapter 19) enable nested queries but are excluded here to focus on basic operations.
This diagram illustrates the flow of a SELECT query:
flowchart TD
A["SQL Query (e.g., SELECT)"] --> B["SQLite Engine"]
B --> C["Process Query (B-tree)"]
C --> D["Return Results"]
D --> E["Output (e.g., Rows)"]
classDef process fill:#d0e0ff,stroke:#336,stroke-width:1px
classDef output fill:#ddffdd,stroke:#363,stroke-width:1px
class A,B,C process
class D,E output12.2.1 SELECT Syntax
Retrieve data with SELECT.
-- Select all columns
SELECT * FROM sales;
-- Select specific columns
SELECT product, price FROM sales;
-- Expected Output (in sqlite3):
-- Halal Laptop|999.99
-- Halal Mouse|24.99
-- Halal Keyboard|49.99Follow-Along Instructions:
- Open
sqlite3 data/sales.db. - Run queries above.
- Verify output matches expected.
- Common Errors:
- no such column: Check column names with
.schema sales. - syntax error: Ensure commas between columns and semicolon at end.
- no such column: Check column names with
Key Points:
SELECT *: Retrieves all columns.SELECT column1, column2: Retrieves specific columns.- Time Complexity: O(n) for full table scan.
- Space Complexity: O(k) for k returned rows.
- Implication: Efficient for retrieving sales data for reports.
12.2.2 INSERT Syntax
Add data with INSERT.
-- Insert a single row
INSERT INTO sales (product, price, quantity) VALUES ('Halal Monitor', 199.99, 3);
-- Verify insertion
SELECT * FROM sales WHERE product = 'Halal Monitor';
-- Expected Output:
-- Halal Monitor|199.99|3Follow-Along Instructions:
- Open
sqlite3 data/sales.db. - Run
INSERTandSELECTqueries. - Verify new row appears.
- Common Errors:
- datatype mismatch: Ensure
priceis REAL (float),quantityis INTEGER. Print.schema sales. - syntax error: Check parentheses and commas.
- datatype mismatch: Ensure
Key Points:
INSERT INTO: Adds rows with specified values.- Time Complexity: O(1) for inserts (no indexes).
- Space Complexity: O(1) per row.
- Implication: Useful for adding new sales records.
12.2.3 UPDATE Syntax
Modify data with UPDATE.
-- Update price for Halal Mouse
UPDATE sales SET price = 29.99 WHERE product = 'Halal Mouse';
-- Verify update
SELECT * FROM sales WHERE product = 'Halal Mouse';
-- Expected Output:
-- Halal Mouse|29.99|10Follow-Along Instructions:
- Open
sqlite3 data/sales.db. - Run
UPDATEandSELECTqueries. - Verify price is updated.
- Common Errors:
- no rows affected: Ensure
WHEREcondition matches rows. RunSELECT * FROM salesto check. - syntax error: Check
SETsyntax and semicolon.
- no rows affected: Ensure
Key Points:
UPDATE: Modifies existing rows based onWHERE.- Time Complexity: O(n) for scan, O(log n) with indexes.
- Space Complexity: O(1) for updates.
- Implication: Corrects pricing errors in sales data.
12.2.4 DELETE Syntax
Remove data with DELETE.
-- Delete rows with quantity > 5
DELETE FROM sales WHERE quantity > 5;
-- Verify deletion
SELECT * FROM sales;
-- Expected Output:
-- Halal Laptop|999.99|2
-- Halal Keyboard|49.99|5Follow-Along Instructions:
- Open
sqlite3 data/sales.db. - Run
DELETEandSELECTqueries. - Verify
Halal Mouse(quantity 10) is removed. - Common Errors:
- no rows affected: Check
WHEREcondition. RunSELECT quantity FROM sales. - syntax error: Ensure
WHEREclause is correct.
- no rows affected: Check
Key Points:
DELETE: Removes rows based onWHERE.- Time Complexity: O(n) for scan, O(log n) with indexes.
- Space Complexity: O(1) for deletions.
- Implication: Removes invalid sales records.
12.3 Filtering with WHERE
Filter data using WHERE for conditions.
-- Filter high-value sales
SELECT product, price, quantity
FROM sales
WHERE price > 100;
-- Expected Output:
-- Halal Laptop|999.99|2Follow-Along Instructions:
- Open
sqlite3 data/sales.db. - Run query above.
- Verify output shows only
Halal Laptop. - Common Errors:
- no rows returned: Check condition. Run
SELECT price FROM sales. - syntax error: Ensure
WHEREsyntax (e.g., no quotes for numbers).
- no rows returned: Check condition. Run
Key Points:
WHERE: Filters rows based on conditions (e.g.,price > 100,product = 'Halal Laptop').- Operators:
=,>,<,>=,<=,!=,LIKE,IN. - Note: Indexes (Chapter 20) can reduce WHERE’s time complexity to O(log n), but are excluded here to focus on basic filtering.
- Time Complexity: O(n) for scan, O(log n) with indexes.
- Space Complexity: O(k) for k matching rows.
- Implication: Filters Sharia-compliant products for analytics.
12.4 Aggregation with GROUP BY
Aggregate data using GROUP BY and functions like COUNT, SUM.
-- Count sales by product
SELECT product, COUNT(*) AS sale_count
FROM sales
GROUP BY product;
-- Expected Output:
-- Halal Laptop|1
-- Halal Mouse|1
-- Halal Keyboard|1Follow-Along Instructions:
- Open
sqlite3 data/sales.db. - Run query above.
- Verify output shows counts per product.
- Optional: Visualize GROUP BY results as a text-based bar chart (e.g., Halal Laptop: *_** for sale_count=4) using asterisks. Print asterisks proportional to counts (e.g., sale_count _ 2 asterisks) for exploration, ensuring it fits the session.
- Common Errors:
- non-aggregated column: Ensure non-grouped columns are aggregated. Check query syntax.
- syntax error: Verify
GROUP BYcolumn exists.
Key Points:
GROUP BY: Groups rows by column values.COUNT(*): Counts rows per group.SUM(column): Sums column values (e.g.,SUM(quantity)).- Time Complexity: O(n) for grouping n rows.
- Space Complexity: O(k) for k groups.
- Implication: Summarizes sales metrics for reports.
12.5 Testing Queries
Validate queries using Python’s sqlite3 module (no type annotations or pytest).
# File: de-onboarding/test_queries.py
import sqlite3 # For SQLite access
# Connect to database
conn = sqlite3.connect("data/sales.db") # Open database
cursor = conn.cursor() # Create cursor
# Test SELECT query
cursor.execute("SELECT product, price FROM sales WHERE price > 100") # Run query
rows = cursor.fetchall() # Fetch results
print("High-value sales:", rows) # Debug
if len(rows) == 1 and rows[0][0] == "Halal Laptop": # Validate
print("SELECT test passed") # Confirm
else:
print("SELECT test failed") # Report failure
# Close connection
conn.close() # Close database
# Expected Output:
# High-value sales: [('Halal Laptop', 999.99)]
# SELECT test passedFollow-Along Instructions:
- Save as
de-onboarding/test_queries.py. - Ensure
data/sales.dbexists per Appendix 1. - Configure editor for 4-space indentation per PEP 8.
- Run:
python test_queries.py. - Verify output confirms test passed.
- Optional: Use
EXPLAIN QUERY PLAN(e.g.,EXPLAIN QUERY PLAN SELECT * FROM sales WHERE price > 100;) insqlite3to debug query performance, preparing for Chapter 20’s optimization. - Optional: Run
sqlite3 data/sales.db .dumpin terminal to inspect sales.db’s schema and data, aiding debugging without requiring code changes. - Note: Manual validation (e.g., checking
len(rows)) will transition to automatedpytesttests in Chapter 9, enhancing testing scalability. - Note: pytest (Chapter 9) is avoided to maintain simplicity of manual validation, focusing on basic sqlite3 usage.
- Note: Type annotations (Chapter 7) are avoided to maintain simplicity, focusing on basic Python and sqlite3 usage.
- Common Errors:
- OperationalError: Check database path. Print
data/sales.db. - IndentationError: Use 4 spaces. Run
python -tt test_queries.py.
- OperationalError: Check database path. Print
Key Points:
sqlite3.connect(): Opens database.cursor.execute(): Runs SQL queries.fetchall(): Retrieves results as list of tuples.- Time Complexity: O(n) for query execution.
- Space Complexity: O(k) for k returned rows.
- Implication: Validates query accuracy for pipeline reliability.
12.5.1 Summary of SQL Fundamentals
This section summarized SQLite setup (database and table creation), basic SQL queries (SELECT, INSERT, UPDATE, DELETE), filtering with WHERE, aggregation with GROUP BY, and query testing with Python’s sqlite3 module. These skills enable efficient data manipulation for Hijra Group’s analytics, preparing for programmatic integration in Chapter 13.
12.6 Micro-Project: Sales Database Query Tool
Project Requirements
Build a SQL-based tool to query data/sales.db for Hijra Group’s sales analytics, producing a JSON report with total sales, unique products, and top products. The tool supports Hijra Group’s transaction reporting, ensuring compliance with IFSB standards by filtering Halal products:
- Query
salestable for valid records (price > 10, quantity ≤ 100, product starts with “Halal”). - Compute total sales (
price * quantity), unique products, and top 3 products by sales. - Export results to
data/sales_report.json. - Log steps and invalid records using print statements.
- Use Python’s
sqlite3module for query execution (no type annotations). - Use
utils.pyfrom Chapter 3 for validation. - Use 4-space indentation per PEP 8, preferring spaces over tabs.
- Test edge cases (e.g., empty table, invalid data).
Sample Input
data/sales.db (from Appendix 1):
-- Schema
CREATE TABLE sales (
product TEXT,
price REAL,
quantity INTEGER
);
-- Data
INSERT INTO sales (product, price, quantity) VALUES
('Halal Laptop', 999.99, 2),
('Halal Mouse', 24.99, 10),
('Halal Keyboard', 49.99, 5);data/config.yaml (from Appendix 1):
min_price: 10.0
max_quantity: 100
required_fields:
- product
- price
- quantity
product_prefix: 'Halal'
max_decimals: 2Data Processing Flow
flowchart TD
A["Sales Database
sales.db"] --> B["Load Config
config.yaml"]
B --> C["Execute SQL Queries
SELECT"]
C --> D["Validate Rows
utils.py"]
D -->|Invalid| E["Log Warning"]
D -->|Valid| F["Compute Metrics
Total, Top Products"]
F --> G["Export JSON
sales_report.json"]
E --> H["End Processing"]
G --> H
classDef data fill:#f9f9f9,stroke:#333,stroke-width:2px
classDef process fill:#d0e0ff,stroke:#336,stroke-width:1px
classDef error fill:#ffdddd,stroke:#933,stroke-width:1px
classDef endpoint fill:#ddffdd,stroke:#363,stroke-width:1px
class A data
class B,C,D,F,G process
class E error
class H endpointAcceptance Criteria
- Go Criteria:
- Connects to
sales.dband readsconfig.yaml. - Queries valid records (price > 10, quantity ≤ 100, Halal products).
- Computes total sales, unique products, and top 3 products.
- Exports results to
data/sales_report.json. - Logs steps and invalid records.
- Uses 4-space indentation per PEP 8.
- Passes edge case tests (empty table, invalid data).
- Connects to
- No-Go Criteria:
- Fails to connect to
sales.dbor readconfig.yaml. - Incorrect query results or calculations.
- Missing JSON export.
- Uses type annotations or try/except.
- Inconsistent indentation.
- Fails to connect to
Common Pitfalls to Avoid
- Database Connection Failure:
- Problem:
OperationalErrordue to missingsales.db. - Solution: Ensure
data/sales.dbexists. Print path withprint(db_path).
- Problem:
- Query Syntax Errors:
- Problem:
sqlite3.OperationalErrorfrom incorrect SQL. - Solution: Print query with
print(query). Validate withsqlite3 data/sales.db.
- Problem:
- Validation Errors:
- Problem: Invalid rows not filtered.
- Solution: Use
utils.validate_sale. Print rows before filtering.
- JSON Export Failure:
- Problem:
PermissionErrorwhen writingsales_report.json. - Solution: Check write permissions with
ls -l data/. Printjson_path.
- Problem:
- IndentationError:
- Problem: Mixed spaces/tabs.
- Solution: Use 4 spaces per PEP 8. Run
python -tt sales_query.py.
- Empty Query Results:
- Problem:
fetchall()returns[]if no rows match. - Solution: Run
SELECT COUNT(*) FROM salesinsqlite3to verify data presence. Printrowsbefore validation.
- Problem:
How This Differs from Production
In production, this solution would include:
- Error Handling: Avoids try/except (introduced in Chapter 7) to maintain simplicity, using basic conditionals instead.
- Resource Management: Avoids context managers (with statements, introduced in Chapter 13) for file and database handling, using manual open/close instead.
- Type Safety: Type annotations with Pyright (Chapter 15).
- Testing: Unit tests with
pytest(Chapter 9). - Pipeline Testing: Manual validation in
test_queries.pywill be extended by automated data pipeline testing (Chapter 42) for production-grade reliability. - Scalability: SQLite’s single-file storage is limited for large datasets; PostgreSQL’s client-server model (Chapter 16) supports enterprise-scale analytics. Indexes for large datasets (Chapter 20).
- Deployment: Kubernetes (Chapter 61) will enable scalable SQLite deployments, supporting production-grade analytics.
- Logging: File-based logging (Chapter 52).
- Integration: Python-SQLite integration for automation (Chapter 13).
Implementation
# File: de-onboarding/utils.py (from Chapter 3, reused)
def is_numeric(s, max_decimals=2): # Check if string is a decimal number
"""Check if string is a decimal number with up to max_decimals."""
parts = s.split(".") # Split on decimal point
if len(parts) != 2 or not parts[0].isdigit() or not parts[1].isdigit():
return False # Invalid format
return len(parts[1]) <= max_decimals # Check decimal places
def clean_string(s): # Clean string by stripping whitespace
"""Strip whitespace from string."""
return s.strip()
def is_numeric_value(x): # Check if value is numeric
"""Check if value is an integer or float."""
return isinstance(x, (int, float)) # Return True for numeric types
def has_valid_decimals(x, max_decimals): # Check decimal places
"""Check if value has valid decimal places."""
return is_numeric(str(x), max_decimals) # Use is_numeric for validation
def apply_valid_decimals(x, max_decimals): # Apply decimal validation
"""Apply has_valid_decimals to a value."""
return has_valid_decimals(x, max_decimals)
def is_integer(x): # Check if value is an integer
"""Check if value is an integer when converted to string."""
return str(x).isdigit() # Return True for integer strings
# Note: Converts price/quantity to strings for consistency with Chapter 3’s string-based validation. In production, numeric validation could use SQLite’s type system (e.g., isinstance(row[1], float)).
def validate_sale(sale, config): # Validate a sale dictionary
"""Validate sale based on config rules."""
required_fields = config["required_fields"] # Get required fields
min_price = config["min_price"] # Get minimum price
max_quantity = config["max_quantity"] # Get maximum quantity
prefix = config["product_prefix"] # Get product prefix
max_decimals = config["max_decimals"] # Get max decimal places
print(f"Validating sale: {sale}") # Debug: print sale
# Empty string checks are sufficient for sales.db; in production, None checks (e.g., sale[field] is None) could enhance robustness.
for field in required_fields: # Loop through required fields
if not sale[field] or sale[field].strip() == "": # Check if field is empty
print(f"Invalid sale: missing {field}: {sale}") # Log invalid
return False
# Validate product: non-empty and matches prefix
product = clean_string(sale["product"]) # Clean product string
if not product.startswith(prefix): # Check prefix
print(f"Invalid sale: product lacks '{prefix}' prefix: {sale}") # Log invalid
return False
# Validate price: numeric, meets minimum, and positive
price = str(sale["price"]) # Convert to string
if not is_numeric(price, max_decimals) or float(price) < min_price or float(price) <= 0: # Check format, value
print(f"Invalid sale: invalid price: {sale}") # Log invalid
return False
# Validate quantity: integer and within limit
quantity = str(sale["quantity"]) # Convert to string
if not quantity.isdigit() or int(quantity) > max_quantity: # Check format and limit
print(f"Invalid sale: invalid quantity: {sale}") # Log invalid
return False
return True # Return True if all checks pass
# File: de-onboarding/sales_query.py
import sqlite3 # For SQLite access
import yaml # For YAML parsing
import json # For JSON export
import utils # Import utils module
# Define function to read YAML configuration
def read_config(config_path): # Takes config file path
"""Read YAML configuration."""
print(f"Opening config: {config_path}") # Debug: print path
file = open(config_path, "r") # Open YAML
# Assumes valid YAML; debug malformed config.yaml by printing file contents (e.g., print(open(config_path).read())).
config = yaml.safe_load(file) # Parse YAML
file.close() # Close file
print(f"Loaded config: {config}") # Debug: print config
return config # Return config dictionary
# Define function to query sales data
def query_sales(db_path, config): # Takes database path and config
"""Query sales data and validate."""
# Modular design prepares for integration with FastAPI endpoints (Chapter 53).
print(f"Connecting to database: {db_path}") # Debug
conn = sqlite3.connect(db_path) # Open database
cursor = conn.cursor() # Create cursor
# Query all sales
query = """
SELECT product, price, quantity
FROM sales
WHERE price > ? AND quantity <= ? AND product LIKE ?
"""
cursor.execute(query, (config["min_price"], config["max_quantity"], config["product_prefix"] + "%")) # Run query
rows = cursor.fetchall() # Fetch results
print("Raw query results:", rows) # Debug
# Validate rows
valid_sales = [] # Store valid sales
total_records = len(rows) # Count total records
for row in rows: # Loop through rows
sale = {
"product": row[0],
"price": str(row[1]), # Convert to string for validation
"quantity": str(row[2]) # Convert to string for validation
}
if utils.validate_sale(sale, config): # Validate sale
valid_sales.append({
"product": row[0],
"price": float(row[1]), # Convert back to float
"quantity": int(row[2]), # Convert back to int
"amount": float(row[1]) * int(row[2]) # Compute amount
})
conn.close() # Close database
print("Valid sales:", valid_sales) # Debug
return valid_sales, total_records # Return valid sales and count
# Define function to process sales data
def process_sales(sales, config): # Takes sales list and config
"""Process sales: compute total and top products."""
if not sales: # Check for empty sales
print("No valid sales data") # Debug
return {"total_sales": 0.0, "unique_products": [], "top_products": {}}, 0
# Compute metrics
total_sales = 0.0 # Initialize total
products = set() # Store unique products
product_sales = {} # Store sales by product
for sale in sales: # Loop through sales
amount = sale["amount"] # Get amount
total_sales += amount # Add to total
products.add(sale["product"]) # Add product
product_sales[sale["product"]] = product_sales.get(sale["product"], 0) + amount # Sum sales
# Get top 3 products
top_products = dict(sorted(product_sales.items(), key=lambda x: x[1], reverse=True)[:3])
valid_sales_count = len(sales) # Count valid sales
print(f"Valid sales: {valid_sales_count} records") # Debug
return {
"total_sales": total_sales,
"unique_products": list(products),
"top_products": top_products
}, valid_sales_count # Return results and count
# Define function to export results
def export_results(results, json_path): # Takes results and file path
"""Export results to JSON."""
print(f"Writing to: {json_path}") # Debug: print path
print(f"Results: {results}") # Debug: print results
file = open(json_path, "w") # Open JSON file
json.dump(results, file, indent=2) # Write JSON
file.close() # Close file
print(f"Exported results to {json_path}") # Confirm export
# Define main function
def main(): # No parameters
"""Main function to query sales data."""
db_path = "data/sales.db" # Database path
config_path = "data/config.yaml" # YAML path
json_path = "data/sales_report.json" # JSON output path
config = read_config(config_path) # Read config
sales, total_records = query_sales(db_path, config) # Query sales
results, valid_sales = process_sales(sales, config) # Process
export_results(results, json_path) # Export results
# Print ASCII table for visual clarity; no external libraries used
# Fixed width (20 chars) may truncate long product names, but sufficient for sales.db’s data (e.g., "Halal Laptop"). Rounding to 2 decimals ensures consistent display; production may adjust precision.
print("\nSales Report (Table):")
print("-" * 40)
print(f"{'Product':<20} {'Sales ($)':>15}")
print("-" * 40)
for product, amount in results["top_products"].items():
print(f"{product:<20} {round(amount, 2):>15}")
print("-" * 40)
# Output report
print("\nSales Report (Details):") # Print header
print(f"Total Records Processed: {total_records}") # Total records
print(f"Valid Sales: {valid_sales}") # Valid count
print(f"Invalid Sales: {total_records - valid_sales}") # Invalid count
print(f"Total Sales: ${round(results['total_sales'], 2)}") # Total sales
print(f"Unique Products: {results['unique_products']}") # Products
print(f"Top Products: {results['top_products']}") # Top products
print("Processing completed") # Confirm completion
if __name__ == "__main__":
main() # Run main functionExpected Outputs
data/sales_report.json:
{
"total_sales": 2499.83,
"unique_products": ["Halal Laptop", "Halal Mouse", "Halal Keyboard"],
"top_products": {
"Halal Laptop": 1999.98,
"Halal Mouse": 249.9,
"Halal Keyboard": 249.95
}
}Console Output (abridged):
Opening config: data/config.yaml
Loaded config: {'min_price': 10.0, 'max_quantity': 100, 'required_fields': ['product', 'price', 'quantity'], 'product_prefix': 'Halal', 'max_decimals': 2}
Connecting to database: data/sales.db
Raw query results: [('Halal Laptop', 999.99, 2), ('Halal Mouse', 24.99, 10), ('Halal Keyboard', 49.99, 5)]
Validating sale: {'product': 'Halal Laptop', 'price': '999.99', 'quantity': '2'}
Validating sale: {'product': 'Halal Mouse', 'price': '24.99', 'quantity': '10'}
Validating sale: {'product': 'Halal Keyboard', 'price': '49.99', 'quantity': '5'}
Valid sales: [{'product': 'Halal Laptop', 'price': 999.99, 'quantity': 2, 'amount': 1999.98}, ...]
Valid sales: 3 records
Writing to: data/sales_report.json
Exported results to data/sales_report.json
Sales Report (Table):
----------------------------------------
Product Sales ($)
----------------------------------------
Halal Laptop 1999.98
Halal Mouse 249.90
Halal Keyboard 249.95
----------------------------------------
Sales Report (Details):
Total Records Processed: 3
Valid Sales: 3
Invalid Sales: 0
Total Sales: $2499.83
Unique Products: ['Halal Laptop', 'Halal Mouse', 'Halal Keyboard']
Top Products: {'Halal Laptop': 1999.98, 'Halal Mouse': 249.9, 'Halal Keyboard': 249.95}
Processing completedHow to Run and Test
Setup Instructions
- Setup Checklist:
- Create
de-onboarding/data/directory. - Save
sales.dbandconfig.yamlper Appendix 1. - Install libraries:
pip install pyyaml. - Verify SQLite version supports LIKE and parameterized queries:
sqlite3 --version(requires SQLite 3.5.0+, typically pre-installed; newer versions like 3.40.0 may offer minor performance improvements). - Create virtual environment:
python -m venv venv, activate (Windows:venv\Scripts\activate, Unix:source venv/bin/activate). - Verify Python 3.10+:
python --version. - Configure editor for 4-space indentation per PEP 8.
- Save
utils.pyandsales_query.pyinde-onboarding/.
- Create
- Troubleshooting:
- If
OperationalError, checksales.dbpath withprint(db_path). - If
ModuleNotFoundError, installpyyamlor checkutils.py. - If
IndentationError, use 4 spaces. Runpython -tt sales_query.py. - If
UnicodeDecodeError, ensure UTF-8 encoding forconfig.yaml. - If
yaml.YAMLError, printprint(open(config_path).read())to inspect YAML.
- If
Execution Steps
- Open terminal in
de-onboarding/. - Run:
python sales_query.py. - Outputs:
data/sales_report.json, console output with ASCII table.
Test Scenarios
- Valid Data: Verify
sales_report.jsonshowstotal_sales: 2499.83, correct top products. Check ASCII table displays correctly. - Empty Table:
conn = sqlite3.connect("data/sales.db") cursor = conn.cursor() cursor.execute("DELETE FROM sales") # Clear table conn.commit() conn.close() main() # Run main # Expected: {'total_sales': 0.0, 'unique_products': [], 'top_products': {}}, 0 - Invalid Data:
conn = sqlite3.connect("data/sales.db") cursor = conn.cursor() cursor.execute("INSERT INTO sales (product, price, quantity) VALUES (?, ?, ?)", ("Non-Halal", 5.00, 150)) conn.commit() conn.close() main() # Run main # Expected: Excludes Non-Halal row
12.7 Practice Exercises
- Optional Peer Review: After completing Exercise 6, exchange answers with a peer to discuss SQL vs. Python efficiency, fostering collaborative learning. Ensure discussions fit within the session.
Exercise 1: SELECT Query
Write a SQL query to select products with quantity > 5 from sales.db.
Expected Output (in sqlite3):
Halal Mouse|24.99|10Follow-Along Instructions:
- Open
sqlite3 data/sales.db. - Run:
SELECT * FROM sales WHERE quantity > 5;. - Verify output matches expected.
- How to Test:
- Check row count:
SELECT COUNT(*) FROM sales WHERE quantity > 5;.
- Check row count:
- Common Errors:
- syntax error: Ensure semicolon at end of query. Run query in
sqlite3to verify. - no rows returned: Check condition with
SELECT quantity FROM sales.
- syntax error: Ensure semicolon at end of query. Run query in
Exercise 2: INSERT Query
Write a SQL query to insert a new sale into sales.db.
Sample Input:
INSERT INTO sales (product, price, quantity) VALUES ('Halal Monitor', 199.99, 3);Expected Output (in sqlite3):
Halal Monitor|199.99|3Follow-Along Instructions:
- Open
sqlite3 data/sales.db. - Run
INSERTandSELECT * FROM sales WHERE product = 'Halal Monitor';. - Verify new row appears.
- How to Test:
- Check insertion:
SELECT COUNT(*) FROM sales;.
- Check insertion:
- Common Errors:
- datatype mismatch: Ensure
priceis REAL (float),quantityis INTEGER. Print.schema sales. - syntax error: Check parentheses and commas in
INSERT.
- datatype mismatch: Ensure
Exercise 3: UPDATE Query
Write a SQL query to update prices < 30 to 30.00 in sales.db.
Expected Output (in sqlite3):
Halal Mouse|30.00|10Follow-Along Instructions:
- Open
sqlite3 data/sales.db. - Run:
UPDATE sales SET price = 30.00 WHERE price < 30;. - Verify with
SELECT * FROM sales WHERE product = 'Halal Mouse';. - How to Test:
- Check updates:
SELECT COUNT(*) FROM sales WHERE price = 30.00;.
- Check updates:
- Common Errors:
- no rows affected: Ensure
WHEREcondition matches rows. RunSELECT * FROM sales. - syntax error: Check
SETsyntax and semicolon.
- no rows affected: Ensure
Exercise 4: GROUP BY Query
Write a SQL query to count sales by product in sales.db.
Expected Output (in sqlite3):
Halal Laptop|1
Halal Mouse|1
Halal Keyboard|1Follow-Along Instructions:
- Open
sqlite3 data/sales.db. - Run:
SELECT product, COUNT(*) AS sale_count FROM sales GROUP BY product;. - Verify output matches expected.
- How to Test:
- Check groups:
SELECT DISTINCT product FROM sales;.
- Check groups:
- Common Errors:
- non-aggregated column: Ensure non-grouped columns are aggregated. Check query syntax.
- syntax error: Verify
GROUP BYcolumn exists with.schema sales.
Exercise 5: Debug a SQL Query Bug
Fix this buggy query that returns incorrect counts.
Buggy Query:
SELECT product, COUNT(price) AS sale_count
FROM sales
GROUP BY product;Expected Output (in sqlite3):
Halal Laptop|1
Halal Mouse|1
Halal Keyboard|1Follow-Along Instructions:
- Open
sqlite3 data/sales.db. - Run buggy query and note results.
- Fix to
SELECT product, COUNT(*) AS sale_count FROM sales GROUP BY product;. - Verify output matches expected.
- How to Test:
- Compare counts with
SELECT product FROM sales;.
- Compare counts with
- Common Errors:
- non-aggregated column: Use
COUNT(*)to count all rows, notCOUNT(price). - syntax error: Ensure
GROUP BYsyntax is correct.
- non-aggregated column: Use
Exercise 6: Conceptual Analysis of SQL vs. Python Filtering
Explain why SQL’s WHERE clause is faster than Python loops for filtering large datasets, referencing B-tree indexing.
Expected Output (written explanation):
SQL’s WHERE clause uses B-tree indexing (O(log n) for lookups with indexes, covered in Chapter 20) compared to Python loops (O(n) for linear scans), making it faster for large datasets like Hijra Group’s transaction data.Follow-Along Instructions:
- Write the explanation in a text file (e.g.,
de-onboarding/ex6_concepts.txt). - Verify it addresses B-tree indexing and performance.
- Hint: Reference the B-tree explanation in 12.1’s “Underlying Implementation” for indexing details.
- How to Test:
- Compare with sample answer in solutions.
- Common Errors:
- Incomplete explanation: Ensure both SQL (B-tree) and Python (linear scan) are compared.
- Incorrect complexity: Verify O(log n) for indexed lookups, O(n) for loops.
Exercise 7: Debug a WHERE Clause Bug
Fix this buggy query that returns no rows due to an incorrect WHERE condition.
Buggy Query:
SELECT * FROM sales WHERE price > 1000;Expected Output (in sqlite3):
Halal Laptop|999.99|2Follow-Along Instructions:
- Open
sqlite3 data/sales.db. - Run buggy query and note no rows are returned.
- Fix to
SELECT * FROM sales WHERE price > 100;. - Verify output matches expected.
- Hint: Check
sales.db’s price range withSELECT MAX(price) FROM salesto diagnose theWHEREcondition. - How to Test:
- Check rows returned:
SELECT COUNT(*) FROM sales WHERE price > 100;.
- Check rows returned:
- Common Errors:
- no rows returned: Verify price range with
SELECT MAX(price) FROM sales. - syntax error: Ensure
WHEREcondition syntax is correct.
- no rows returned: Verify price range with
12.7.1 Exercise Summary
These exercises reinforced SQL querying (SELECT, INSERT, UPDATE, DELETE), filtering (WHERE), aggregation (GROUP BY), debugging, and conceptual analysis of SQL’s efficiency, preparing for Python-SQLite integration in Chapter 13.
12.8 Exercise Solutions
Solution to Exercise 1: SELECT Query
SELECT * FROM sales WHERE quantity > 5;Explanation:
- Filters rows with
quantity > 5, returningHalal Mouse.
Solution to Exercise 2: INSERT Query
INSERT INTO sales (product, price, quantity) VALUES ('Halal Monitor', 199.99, 3);
SELECT * FROM sales WHERE product = 'Halal Monitor';Explanation:
- Inserts new row and verifies with
SELECT.
Solution to Exercise 3: UPDATE Query
UPDATE sales SET price = 30.00 WHERE price < 30;
SELECT * FROM sales WHERE product = 'Halal Mouse';Explanation:
- Updates
priceto 30.00 for rows withprice < 30.
Solution to Exercise 4: GROUP BY Query
SELECT product, COUNT(*) AS sale_count
FROM sales
GROUP BY product;Explanation:
- Groups by
productand counts rows per group.
Solution to Exercise 5: Debug a SQL Query Bug
SELECT product, COUNT(*) AS sale_count
FROM sales
GROUP BY product;Explanation:
- Bug:
COUNT(price)counts non-nullpricevalues, which may exclude nulls. Fixed toCOUNT(*)to count all rows.
Solution to Exercise 6: Conceptual Analysis of SQL vs. Python Filtering
Explanation (save to de-onboarding/ex6_concepts.txt):
SQL’s WHERE clause uses B-tree indexing (O(log n) for lookups with indexes, covered in Chapter 20) compared to Python loops (O(n) for linear scans), making it faster for large datasets like Hijra Group’s transaction data.Solution to Exercise 7: Debug a WHERE Clause Bug
SELECT * FROM sales WHERE price > 100;Explanation:
- Bug:
price > 1000is too restrictive, assales.db’s maximum price is 999.99 (verify withSELECT MAX(price) FROM sales). Fixed toprice > 100to return high-value sales.
12.9 Chapter Summary and Connection to Chapter 13
In this chapter, you’ve mastered:
- SQLite Basics: Setting up databases and tables (O(1) inserts, ~4KB for small datasets).
- SQL Queries:
SELECT,INSERT,UPDATE,DELETE(O(n) scans, O(log n) with indexes). - Filtering:
WHEREfor conditional queries (O(n) scans). - Aggregation:
GROUP BYfor metrics (O(n) grouping). - White-Space Sensitivity and PEP 8: Using 4-space indentation in Python scripts, preferring spaces over tabs.
The micro-project built a SQL tool to query data/sales.db, producing a JSON report with sales metrics, tested for edge cases, and aligned with Hijra Group’s analytics needs. The modular functions (e.g., query_sales) prepare for Python-SQLite integration in Chapter 13, where sqlite3 will automate query execution. For example, the query_sales function will be refactored into a Python class in Chapter 13, enhancing modularity for complex pipelines. Chapter 13’s Python-SQLite integration will enable batch processing of sales data, supporting Hijra Group’s real-time analytics needs. SQLite’s single-user limitation will be addressed by PostgreSQL’s multi-user support in Chapter 16, enabling enterprise-scale analytics. These skills will also support data mart creation in Chapter 32, enabling targeted analytics, BI dashboard creation in Chapter 51, enabling stakeholder reporting, dbt transformations in Chapter 54, enabling data transformation workflows, and will be orchestrated with Airflow in Chapter 56, enabling automated pipeline workflows for Hijra Group’s analytics.
- Reflective Question: How does SQL’s efficiency (e.g., O(log n) with indexes) compare to Pandas’ boolean indexing (e.g.,
df[df["price"] > 100]) from Chapter 3 for large datasets? Consider writing a brief answer inde-onboarding/reflection.txt.
Connection to Chapter 13
Chapter 13 introduces Python and SQLite Integration, building on this chapter:
- SQL Queries: Extends
SELECTto programmatic execution withsqlite3module, usingdata/sales.csvandsales.dbper Appendix 1. - Data Validation: Reuses
utils.pyfor validation, integrating with Python logic. - Automation: Automates query execution, preparing for type-safe programming in Chapter 15.
- Fintech Context: Supports dynamic data loading for real-time analytics, maintaining PEP 8’s 4-space indentation.