11 - Checkpoint 2: Python Code Quality Review
Complexity: Easy (E)
11.0 Introduction: Why This Matters for Data Engineering
In data engineering, high-quality code ensures reliable, maintainable, and scalable pipelines, critical for Hijra Group’s Sharia-compliant fintech analytics processing thousands of daily transactions. Validating Halal products aligns with Islamic Financial Services Board (IFSB) standards, ensuring compliance in financial data pipelines. This checkpoint consolidates skills from Chapters 7–10, focusing on type safety (Pyright), annotations and decorators, testing (unittest, pytest), and code quality tools (black, ruff, pre-commit). These practices reduce errors, enhance collaboration, and align with production standards, preparing for database fundamentals in Phase 3A. Building on Python foundations (Chapters 1–6), this chapter avoids advanced database or concurrency concepts (Chapters 12, 40), emphasizing modular, tested, and type-safe code with 4-space indentation per PEP 8, preferring spaces over tabs to avoid IndentationError.
Data Engineering Workflow Context
flowchart TD
A["Raw Data (CSV)"] --> B["Type-Safe Python (Pandas)"]
B --> C{"Code Quality Checks"}
C -->|Validation| V["Validate (utils.py)"]
V --> C
C -->|Type Checking| D["Pyright"]
C -->|Formatting| E["black"]
C -->|Linting| F["ruff"]
C -->|Testing| G["pytest"]
C -->|Pre-Commit| H["Validate Code"]
H --> I["Processed Data (JSON)"]
classDef data fill:#f9f9f9,stroke:#333,stroke-width:2px
classDef process fill:#d0e0ff,stroke:#336,stroke-width:1px
classDef tool fill:#ddffdd,stroke:#363,stroke-width:1px
class A,I data
class B,C,H,V process
class D,E,F,G toolBuilding On and Preparing For
- Building On:
- Chapter 7: Static typing with Pyright for type-safe functions.
- Chapter 8: Annotations and decorators for logging and modularity.
- Chapter 9: Testing with
unittestandpytestfor robust code. - Chapter 10: Code quality with black, ruff, and pre-commit hooks.
- Preparing For:
- Chapter 12: SQL with SQLite, using type-safe Python for data loading.
- Chapter 13: Python-SQLite integration with tested pipelines.
- Chapter 42: Advanced pipeline testing with type safety.
What You’ll Learn
This chapter reviews:
- Type-Safe Processing: Using Pyright-verified type annotations.
- Decorators: Logging execution details.
- Testing: Unit and integration tests with
pytest. - Code Quality: Formatting (black), linting (ruff), and pre-commit hooks.
- Modular Design: Organizing code in modules (
utils.py,processor.py).
The micro-project builds a type-safe, tested sales data processor with a pre-commit pipeline, processing data/sales.csv and config.yaml, producing a JSON report with 4-space indentation per PEP 8.
Follow-Along Tips:
- Create
de-onboarding/data/with files from Appendix 1 (sales.csv,config.yaml,empty.csv,invalid.csv,malformed.csv,negative.csv). - Install libraries:
pip install numpy pandas pyyaml pytest black ruff pre-commit pyright. - Use 4 spaces (not tabs) per PEP 8. Run
python -tt script.pyto detect tab/space mixing. - Debug with
print(df.head())for DataFrames,print(config)for YAML. - Use UTF-8 encoding to avoid
UnicodeDecodeError.
11.1 Core Concepts
11.1.1 Type-Safe Processing
Type annotations with Pyright ensure type safety, reducing runtime errors. For example, annotating a function to process sales data:
from typing import Dict, Any
def process_sales(df: pd.DataFrame, config: Dict[str, Any]) -> Dict[str, Any]:
# Process sales with type-safe inputs
pass- Implementation: Pyright checks annotations, catching type mismatches (e.g., passing a string instead of a DataFrame).
- Time Complexity: O(1) for type checking (static analysis).
- Space Complexity: O(1) for type metadata.
11.1.2 Decorators for Logging
Decorators add logging to functions, improving debugging without altering logic.
from functools import wraps
import logging
def log_execution(func):
@wraps(func)
def wrapper(*args, **kwargs):
logging.info(f"Executing {func.__name__}")
result = func(*args, **kwargs)
logging.info(f"Completed {func.__name__}")
return result
return wrapper- Implementation: Logs function execution, useful for pipeline monitoring.
- Time Complexity: O(1) for logging overhead.
- Space Complexity: O(1) for log entries.
11.1.3 Testing with pytest
pytest ensures code reliability through unit and integration tests, organized in modules (tests/test_processor.py).
# File: tests/test_processor.py
import pytest
import pandas as pd
from processor import process_sales
def test_process_sales():
df = pd.DataFrame({"product": ["Halal Laptop"], "price": [999.99], "quantity": [2]})
config = {"min_price": 10.0, "max_quantity": 100, "product_prefix": "Halal"}
result, _ = process_sales(df, config)
assert result["total_sales"] == 1999.98- Implementation: Tests validate outputs, catching logical errors.
- Time Complexity: O(n) for testing n rows.
- Space Complexity: O(n) for test data.
11.1.4 Code Quality Tools
- black: Formats code for consistency (O(n) for n lines).
- ruff: Lints for errors and style (O(n) for n lines).
- pre-commit: Runs checks before commits, ensuring quality.
- Pyright: Verifies type annotations (O(1) per check).
# .pre-commit-config.yaml
repos:
- repo: https://github.com/psf/black
rev: 24.8.0
hooks:
- id: black
- repo: https://github.com/charliermarsh/ruff
rev: 0.6.8
hooks:
- id: ruff
- repo: https://github.com/pycqa/pyright
rev: 1.1.382
hooks:
- id: pyright11.2 Micro-Project: Type-Safe Sales Data Processor
Project Requirements
Build a type-safe, tested sales data processor with a pre-commit pipeline, processing data/sales.csv and config.yaml to produce data/sales_results.json. The processor validates Halal products per IFSB standards, computes metrics (total sales, unique products, top 3 products), and logs execution, using 4-space indentation per PEP 8. It ensures robust handling of edge cases (empty.csv, invalid.csv, malformed.csv, negative.csv) for Hijra Group’s transaction reporting.
Sample Input Files
data/sales.csv (Appendix 1):
product,price,quantity
Halal Laptop,999.99,2
Halal Mouse,24.99,10
Halal Keyboard,49.99,5
,29.99,3
Monitor,invalid,2
Headphones,5.00,150data/config.yaml (Appendix 1):
min_price: 10.0
max_quantity: 100
required_fields:
- product
- price
- quantity
product_prefix: 'Halal'
max_decimals: 2Data Processing Flow
flowchart TD
A["sales.csv"] --> B["Load CSV (Pandas)"]
B --> C["DataFrame"]
C --> D["config.yaml"]
D --> E["Validate (utils.py)"]
E -->|Invalid| F["Log Error"]
E -->|Valid| G["Process (Pandas/NumPy)"]
G --> H["Log Execution"]
G --> I["Export JSON"]
I --> J["Test (pytest)"]
J --> K["Pre-Commit (black, ruff)"]
classDef data fill:#f9f9f9,stroke:#333
classDef process fill:#d0e0ff,stroke:#336
classDef error fill:#ffdddd,stroke:#933
classDef tool fill:#ddffdd,stroke:#363
class A,C,D,I data
class B,E,G,H,J,K process
class F errorAcceptance Criteria
- Go Criteria:
- Loads and validates
sales.csv,config.yamlwith type annotations, ensuring Halal products and edge cases (empty.csv,invalid.csv,malformed.csv,negative.csv) per IFSB standards. - Computes total sales, unique products, top 3 products.
- Exports to
data/sales_results.json. - Logs execution with decorator.
- Passes
pytesttests intests/test_processor.py. - Uses pre-commit with black (24.8.0), ruff (0.6.8), Pyright (1.1.382).
- Loads and validates
- No-Go Criteria:
- Fails to load files or validate data.
- Incorrect metrics or missing JSON.
- Fails tests or pre-commit checks.
- Uses tabs or inconsistent indentation.
Common Pitfalls
- Type Errors:
- Problem: Pyright flags incorrect types (e.g.,
strinstead ofpd.DataFrame). - Solution: Verify annotations with
print(type(df)). Runpyright processor.py.
- Problem: Pyright flags incorrect types (e.g.,
- Test Failures:
- Problem:
pytestfails due to incorrect logic (e.g.,test_malformed_datafails if non-integer quantities are present). - Solution: Print
resultin tests anddf.dtypesto verifyquantityis int. Runpytest -v tests/test_processor.py.
- Problem:
- Pre-Commit Errors:
- Problem: black or ruff fails.
- Solution: Run
black .andruff check .manually.
- IndentationError:
- Problem: Mixed spaces/tabs.
- Solution: Use 4 spaces. Run
python -tt processor.py.
- FileNotFoundError:
- Problem: Missing
sales.csvorconfig.yaml. - Solution: Verify paths with
ls data/. Printcsv_path.
- Problem: Missing
Complexity Analysis
- Loading CSV (
pd.read_csv): O(n) time for n rows, O(n) space (~24MB for 1M rows, 3 columns, numeric types). - Validation (Pandas filtering): O(n) time for n rows, O(k) space for k valid rows.
- Processing (NumPy sum, Pandas groupby): O(n) time for sum and grouping, O(k) space for k groups.
- Exporting JSON: O(n) time for n records, O(n) space for output.
- Testing (pytest): O(n) time per test for n rows, O(n) space for test data.
Implication: Efficient for Hijra Group’s daily transaction datasets (thousands of rows), but production systems may require chunked processing for millions of rows (Chapter 40).
Production Differences
In production:
- Error Handling: Robust try/except (Chapter 7).
- Logging: File-based with
logging(Chapter 52). - Scalability: Chunked processing for large CSVs (Chapter 40), critical for Hijra Group’s pipelines processing millions of transactions daily. For example, processing 10M transactions may require splitting CSVs into 1M-row chunks to avoid memory issues, as covered in Chapter 40.
- CI/CD: Automated pre-commit in GitHub Actions (Chapter 66).
Implementation
# File: de-onboarding/utils.py
from typing import Dict, Any, Union
import pandas as pd
import logging
def is_numeric(s: str, max_decimals: int = 2) -> bool:
"""Check if string is a decimal number with up to max_decimals."""
parts = s.split(".")
if len(parts) != 2 or not parts[0].replace("-", "").isdigit() or not parts[1].isdigit():
return False
return len(parts[1]) <= max_decimals
def clean_string(s: Union[str, float]) -> str:
"""Strip whitespace from string."""
return str(s).strip()
def is_numeric_value(x: Any) -> bool:
"""Check if value is numeric."""
return isinstance(x, (int, float))
def is_integer(x: Any) -> bool:
"""Check if value is an integer."""
return isinstance(x, int) or (isinstance(x, str) and x.isdigit())
def validate_sale(sale: Dict[str, str], config: Dict[str, Any]) -> bool:
"""Validate sale based on config rules, ensuring IFSB-compliant Halal products."""
required_fields = config["required_fields"]
min_price = config["min_price"]
max_quantity = config["max_quantity"]
prefix = config["product_prefix"]
max_decimals = config["max_decimals"]
logging.debug(f"Validating sale: {sale}")
for field in required_fields:
if field not in sale or not sale[field] or clean_string(sale[field]) == "":
logging.warning(f"Invalid sale: missing {field}: {sale}")
return False
product = clean_string(sale["product"])
if not product.startswith(prefix):
logging.warning(f"Invalid sale: product lacks '{prefix}' prefix: {sale}")
return False
price = clean_string(sale["price"])
if not is_numeric(price, max_decimals) or float(price) < min_price or float(price) <= 0:
logging.warning(f"Invalid sale: invalid price: {sale}")
return False
quantity = clean_string(sale["quantity"])
if not is_integer(quantity) or int(quantity) > max_quantity:
logging.warning(f"Invalid sale: invalid quantity: {sale}")
return False
return True
# File: de-onboarding/processor.py
from typing import Dict, Any, Tuple
import pandas as pd
import numpy as np
import yaml
import json
import logging
from functools import wraps
import utils
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s")
def log_execution(func):
"""Decorator to log function execution."""
@wraps(func)
def wrapper(*args, **kwargs):
logging.info(f"Executing {func.__name__}")
result = func(*args, **kwargs)
logging.info(f"Completed {func.__name__}")
return result
return wrapper
@log_execution
def read_config(config_path: str) -> Dict[str, Any]:
"""Read YAML configuration."""
logging.info(f"Opening config: {config_path}")
with open(config_path, "r") as file:
config = yaml.safe_load(file)
logging.debug(f"Loaded config: {config}")
return config
@log_execution
def load_and_validate_sales(csv_path: str, config: Dict[str, Any]) -> Tuple[pd.DataFrame, int, int]:
"""Load and validate sales CSV."""
logging.info(f"Loading CSV: {csv_path}")
df = pd.read_csv(csv_path)
logging.debug(f"Initial DataFrame:\n{df.head()}")
required_fields = config["required_fields"]
missing_fields = [f for f in required_fields if f not in df.columns]
if missing_fields:
logging.error(f"Missing columns: {missing_fields}")
return pd.DataFrame(), 0, len(df)
df = df.dropna(subset=["product"])
df = df[df["product"].str.startswith(config["product_prefix"])]
df = df[df["quantity"].apply(utils.is_integer)]
df["quantity"] = df["quantity"].astype(int)
df = df[df["quantity"] <= config["max_quantity"]]
df = df[df["price"].apply(utils.is_numeric_value)]
df = df[df["price"] > 0]
df = df[df["price"] >= config["min_price"]]
total_records = len(df)
logging.debug(f"Validated DataFrame:\n{df}")
return df, len(df), total_records
@log_execution
def process_sales(df: pd.DataFrame, config: Dict[str, Any]) -> Tuple[Dict[str, Any], int]:
"""Process sales data."""
if df.empty:
logging.warning("No valid sales data")
return {"total_sales": 0.0, "unique_products": [], "top_products": {}}, 0
df["amount"] = df["price"] * df["quantity"]
logging.debug(f"DataFrame with Amount:\n{df}")
total_sales = np.sum(df["amount"].values)
unique_products = df["product"].unique().tolist()
sales_by_product = df.groupby("product")["amount"].sum()
top_products = sales_by_product.sort_values(ascending=False).head(3).to_dict()
valid_sales = len(df)
logging.info(f"Valid sales: {valid_sales} records")
return {
"total_sales": float(total_sales),
"unique_products": unique_products,
"top_products": top_products
}, valid_sales
@log_execution
def export_results(results: Dict[str, Any], json_path: str) -> None:
"""Export results to JSON."""
logging.info(f"Writing to: {json_path}")
with open(json_path, "w") as file:
json.dump(results, file, indent=2)
logging.info(f"Exported results to {json_path}")
def main() -> None:
"""Main function."""
csv_path = "data/sales.csv"
config_path = "data/config.yaml"
json_path = "data/sales_results.json"
config = read_config(config_path)
df, valid_sales, total_records = load_and_validate_sales(csv_path, config)
results, valid_sales = process_sales(df, config)
export_results(results, json_path)
logging.info("\nSales Report:")
logging.info(f"Total Records Processed: {total_records}")
logging.info(f"Valid Sales: {valid_sales}")
logging.info(f"Invalid Sales: {total_records - valid_sales}")
logging.info(f"Total Sales: ${round(results['total_sales'], 2)}")
logging.info(f"Unique Products: {results['unique_products']}")
logging.info(f"Top Products: {results['top_products']}")
if __name__ == "__main__":
main()
# File: tests/test_processor.py
import pytest
import pandas as pd
from processor import process_sales, load_and_validate_sales
@pytest.fixture
def sample_config():
return {
"min_price": 10.0,
"max_quantity": 100,
"required_fields": ["product", "price", "quantity"],
"product_prefix": "Halal",
"max_decimals": 2
}
def test_process_sales(sample_config):
df = pd.DataFrame({
"product": ["Halal Laptop", "Halal Mouse"],
"price": [999.99, 24.99],
"quantity": [2, 10]
})
result, valid_sales = process_sales(df, sample_config)
assert result["total_sales"] == 2249.88
assert valid_sales == 2
assert result["unique_products"] == ["Halal Laptop", "Halal Mouse"]
assert result["top_products"] == {
"Halal Laptop": 1999.98,
"Halal Mouse": 249.9
}
def test_empty_dataframe(sample_config):
df = pd.DataFrame(columns=["product", "price", "quantity"])
result, valid_sales = process_sales(df, sample_config)
assert result["total_sales"] == 0.0
assert valid_sales == 0
assert result["unique_products"] == []
assert result["top_products"] == {}
def test_invalid_columns(sample_config):
df = pd.DataFrame({
"name": ["Halal Laptop"],
"price": [999.99],
"quantity": [2]
})
df, valid_sales, total_records = load_and_validate_sales("data/invalid.csv", sample_config)
assert df.empty
assert valid_sales == 0
assert total_records == 0
def test_malformed_data(sample_config):
df, valid_sales, total_records = load_and_validate_sales("data/malformed.csv", sample_config)
assert len(df) == 1
assert valid_sales == 1
assert total_records == 1
assert df.iloc[0]["product"] == "Halal Mouse"
assert df.iloc[0]["price"] == 24.99
assert df.iloc[0]["quantity"] == 10
def test_negative_prices(sample_config):
df, valid_sales, total_records = load_and_validate_sales("data/negative.csv", sample_config)
assert len(df) == 1
assert valid_sales == 1
assert total_records == 1
assert df.iloc[0]["product"] == "Halal Mouse"
assert df.iloc[0]["price"] == 24.99
assert df.iloc[0]["quantity"] == 10Expected Output
data/sales_results.json:
{
"total_sales": 2499.83,
"unique_products": ["Halal Laptop", "Halal Mouse", "Halal Keyboard"],
"top_products": {
"Halal Laptop": 1999.98,
"Halal Mouse": 249.9,
"Halal Keyboard": 249.95
}
}Console Output (abridged):
2025-04-24 10:00:00,000 - INFO - Opening config: data/config.yaml
2025-04-24 10:00:00,001 - INFO - Executing read_config
2025-04-24 10:00:00,002 - INFO - Completed read_config
2025-04-24 10:00:00,003 - INFO - Loading CSV: data/sales.csv
2025-04-24 10:00:00,004 - INFO - Executing load_and_validate_sales
2025-04-24 10:00:00,005 - INFO - Valid sales: 3 records
2025-04-24 10:00:00,006 - INFO - Writing to: data/sales_results.json
2025-04-24 10:00:00,007 - INFO - Sales Report:
2025-04-24 10:00:00,008 - INFO - Total Records Processed: 3
2025-04-24 10:00:00,009 - INFO - Valid Sales: 3
2025-04-24 10:00:00,010 - INFO - Total Sales: $2499.83How to Run and Test
Setup Checklist:
- Create
de-onboarding/data/withsales.csv,config.yaml,empty.csv,invalid.csv,malformed.csv,negative.csv(Appendix 1). - Install:
pip install numpy pandas pyyaml pytest black==24.8.0 ruff==0.6.8 pre-commit pyright==1.1.382. - Save
utils.py,processor.py,tests/test_processor.py. - Create
.pre-commit-config.yamlinde-onboarding/with the provided content. - Configure pre-commit:
pre-commit install. - Configure editor for 4-space indentation per PEP 8 (VS Code: “Editor: Tab Size” = 4, “Editor: Insert Spaces” = true, “Editor: Detect Indentation” = false).
- Verify files:
ls data/and checksales.csvhas 6 rows withwc -l data/sales.csv(Unix/macOS) orfindstr /r /n "^" data\sales.csv | find /c ":"(Windows).
- Create
Run:
- Run:
python processor.py. - Outputs:
data/sales_results.json, console logs.
- Run:
Test:
- Run:
pytest tests/test_processor.py -v. - Verify all tests pass.
- Debugging Tests:
- For all tests, debug with
print(df),print(df.dtypes),print(df.head())to inspect data types and contents. - Debug
FileNotFoundError: Print paths withprint(csv_path)inprocessor.py.
- For all tests, debug with
- Edge Case Tests:
- Empty CSV (
empty.csv):config = read_config("data/config.yaml") df, valid_sales, total_records = load_and_validate_sales("data/empty.csv", config) print(df, valid_sales, total_records) # Expected: Empty DataFrame with columns ['product', 'price', 'quantity'], valid_sales=0, total_records=0 - Invalid Headers (
invalid.csv):config = read_config("data/config.yaml") df, valid_sales, total_records = load_and_validate_sales("data/invalid.csv", config) print(df, valid_sales, total_records) # Expected: Empty DataFrame with no columns, valid_sales=0, total_records=0 - Malformed Data (
malformed.csv):config = read_config("data/config.yaml") df, valid_sales, total_records = load_and_validate_sales("data/malformed.csv", config) print(df) # Expected: DataFrame with 1 row: product='Halal Mouse', price=24.99, quantity=10 - Negative Prices (
negative.csv):config = read_config("data/config.yaml") df, valid_sales, total_records = load_and_validate_sales("data/negative.csv", config) print(df) # Expected: DataFrame with 1 row: product='Halal Mouse', price=24.99, quantity=10
- Empty CSV (
- Run:
11.3 Practice Exercises
Exercise 1: Type-Safe Halal Validation
Write a type-safe function to validate a sale’s Halal product prefix per IFSB standards, using 4-space indentation.
from typing import Dict, Any
import logging
def validate_halal_sale(sale: Dict[str, str], config: Dict[str, Any]) -> bool:
"""Validate sale’s product prefix for IFSB-compliant Halal products."""
prefix = config["product_prefix"]
product = sale.get("product", "")
logging.debug(f"Validating product: {product}")
if not product.startswith(prefix):
logging.warning(f"Invalid product: lacks '{prefix}' prefix")
return False
return TrueTest:
- Save as
de-onboarding/ex1_validate.py. - Verify:
validate_halal_sale({"product": "Halal Laptop"}, {"product_prefix": "Halal"})returnsTrue. - Debug: Print
sale["product"]ifFalse. - Run:
python ex1_validate.py.
Exercise 2: Logging Decorator
Write a decorator to log function inputs, using 4-space indentation.
from functools import wraps
import logging
def log_inputs(func):
"""Log function inputs."""
@wraps(func)
def wrapper(*args, **kwargs):
logging.info(f"Inputs for {func.__name__}: args={args}, kwargs={kwargs}")
return func(*args, **kwargs)
return wrapperTest:
- Save as
de-onboarding/ex2_decorator.py. - Apply to a function (e.g.,
validate_sale) and verify logs show inputs. - Debug: Check
logging.basicConfigsetup withprint(logging.getLogger().level). - Run:
python ex2_decorator.py.
Exercise 3: pytest Test Case
Write a pytest test for process_sales, using 4-space indentation.
import pytest
import pandas as pd
from processor import process_sales
def test_process_sales():
df = pd.DataFrame({"product": ["Halal Mouse"], "price": [24.99], "quantity": [10]})
config = {"min_price": 10.0, "max_quantity": 100, "product_prefix": "Halal"}
result, valid_sales = process_sales(df, config)
assert result["total_sales"] == 249.9Test:
- Save as
de-onboarding/ex3_test.py. - Run:
pytest ex3_test.py. - Verify: Test passes.
- Debug: Print
resultif fails.
Exercise 4: Conceptual Pipeline Integration
Explain why type safety is important for Hijra Group’s data pipelines and, optionally for advanced learners, how the sales processor could integrate with automated scheduling or web endpoints. Save the explanation to ex4_concepts.txt, using 4-space indentation in any code snippets. This prepares for pipeline orchestration in Chapter 56 and API development in Chapter 53, critical for capstone projects in Chapters 68–71.
Template:
- Type Safety: [Why is type safety important for pipelines?]
- (Optional) Scheduling: [How to automate the processor’s execution]
- (Optional) Web Endpoint: [How to expose the processor’s results online]
Example Phrases:
- Type Safety: “Type safety prevents errors by ensuring data types match expected formats…”
- Scheduling: “Automate daily runs to process transactions consistently…”
- Web Endpoint: “Provide a URL to view sales metrics securely…”
Expected Output (ex4_concepts.txt, 50–100 words):
Type Safety: Type safety prevents errors in Hijra Group’s pipelines by ensuring data types match expected formats, reducing runtime issues in transaction processing. It improves code reliability and maintainability, critical for Sharia-compliant analytics.
(Optional) Scheduling: The processor could be automated to run daily, ensuring consistent transaction processing without manual intervention.
(Optional) Web Endpoint: The processor’s results could be exposed through a web service, allowing stakeholders to access sales metrics like total sales via a secure URL.Test:
- Save as
de-onboarding/ex4_concepts.pywith code to write toex4_concepts.txt. - Run:
python ex4_concepts.py. - Verify:
ex4_concepts.txtexists with 50–100 words (check withwc -w ex4_concepts.txton Unix/macOS). - Debug: Print file contents with
print(open("ex4_concepts.txt").read()).
Exercise 5: Debug a Buggy Processor
Fix this buggy process_sales function that groups by the wrong column, causing incorrect totals, using 4-space indentation. The input contains only Halal products.
Buggy Code:
import pandas as pd
from typing import Dict, Any, Tuple
def process_sales(df: pd.DataFrame, config: Dict[str, Any]) -> Tuple[Dict[str, Any], int]:
"""Process sales data."""
df["amount"] = df["price"] * df["quantity"]
total_sales = df["amount"].sum()
unique_products = df["product"].unique().tolist()
sales_by_product = df.groupby("price")["amount"].sum() # Bug: Wrong column
top_products = sales_by_product.sort_values(ascending=False).head(3).to_dict()
valid_sales = len(df)
return {
"total_sales": float(total_sales),
"unique_products": unique_products,
"top_products": top_products
}, valid_salesSample Input (data/sample.csv):
product,price,quantity
Halal Laptop,999.99,2
Halal Mouse,24.99,10Expected Output:
{
'total_sales': 2249.88,
'unique_products': ['Halal Laptop', 'Halal Mouse'],
'top_products': {'Halal Laptop': 1999.98, 'Halal Mouse': 249.9}
}, 2Test:
- Save as
de-onboarding/ex5_debug.py. - Run buggy code:
python ex5_debug.pyto see incorrect output. - Fix by changing
groupby("price")togroupby("product"). - Re-run and verify output matches expected.
- Debug: Print
df.groupby("price").size()vs.df.groupby("product").size()to inspect grouping. Useprint(df.head())to inspect data,print(df.dtypes)to check column types.
11.4 Chapter Summary and Connection to Chapter 12
This chapter solidified type safety, decorators, testing, and code quality, producing a robust sales processor with pre-commit checks, validated against IFSB-compliant Halal products. The modular design (utils.py, processor.py), 4-space indentation per PEP 8, and comprehensive edge case testing (empty.csv, malformed.csv, negative.csv) ensure reliable pipelines. The complexity analysis (O(n) for processing, ~24MB for 1M rows) prepares for scalability discussions in Chapter 40. Chapter 12: SQL Fundamentals with SQLite builds on this by introducing SQL queries, using type-safe Python to load data from sales.db, maintaining code quality for Hijra Group’s transaction analytics.