60 - Docker for Data Applications
Complexity: Moderate (M)
60.0 Introduction: Why This Matters for Data Engineering
Docker is a cornerstone of modern data engineering, enabling consistent, portable, and scalable deployment of data pipelines. For Hijra Group’s Sharia-compliant fintech analytics, Docker containers ensure that Python-based data applications, such as sales processing pipelines, run identically across development, testing, and production environments. By containerizing dependencies like pandas, psycopg2, and PyYAML, Docker eliminates “it works on my machine” issues, reducing deployment errors by up to 70% in complex systems. This chapter builds on prior knowledge of Python (Phase 1), Pandas (Chapters 3, 39), PostgreSQL (Chapters 16–17, 47–48), and type-safe programming (Chapters 7, 15, 23, 41) to package a sales data pipeline into a Docker container, preparing for Kubernetes deployments in Chapters 61–64.
Docker containers are lightweight, using ~100MB for a Python app with pandas and psycopg2, compared to ~1GB for virtual machines, with startup times under 1 second. This chapter introduces Docker concepts, containerization of Python applications, and Docker Compose for multi-container setups, all with type annotations verified by Pyright and tests using pytest, adhering to the curriculum’s quality standards. Code uses 4-space indentation per PEP 8, preferring spaces over tabs to avoid IndentationError, ensuring compatibility with Hijra Group’s pipeline scripts.
Data Engineering Workflow Context
This diagram illustrates Docker’s role in a data engineering pipeline, highlighting container isolation:
flowchart TD
A["Raw Data (CSV)"] --> B["Python App (Pandas, psycopg2)"]
subgraph Docker Environment
B --> C["Docker Container (App)"]
C --> D{"Data Processing"}
D -->|Load/Validate| E["PostgreSQL Container"]
D -->|Analyze| F["Aggregated Metrics"]
end
E --> G["Output (JSON)"]
F --> G
G --> H["Storage/Analysis"]
classDef data fill:#f9f9f9,stroke:#333,stroke-width:2px
classDef process fill:#d0e0ff,stroke:#336,stroke-width:1px
classDef container fill:#ffd700,stroke:#336,stroke-width:1px
classDef storage fill:#ddffdd,stroke:#363,stroke-width:1px
class A,F,G data
class B,D process
class C,E container
class H storageBuilding On and Preparing For
- Building On:
- Phase 1 (Chapters 1–6): Python basics for scripting data processing.
- Chapter 3, 39: Pandas for data manipulation in the sales pipeline.
- Chapters 16–17, 47–48: PostgreSQL integration for storing sales data.
- Chapters 7, 15, 23, 41: Type annotations for robust, type-safe code.
- Chapter 59: Airflow orchestration, preparing for containerized workflows.
- Preparing For:
- Chapter 61: Kubernetes fundamentals for orchestrating containers.
- Chapter 62: Deploying stateful data applications in Kubernetes.
- Chapter 64: Running Airflow in Kubernetes with Helm Charts.
- Chapters 67–70: Capstone projects integrating Dockerized pipelines.
What You’ll Learn
This chapter covers:
- Docker Basics: Images, containers, and Dockerfiles for packaging applications.
- Containerizing Python Apps: Building a type-annotated sales pipeline with
pandasandpsycopg2. - Docker Compose: Orchestrating a Python app and PostgreSQL database.
- Type-Safe Integration: Ensuring type safety with Pyright in containers.
- Testing: Validating containerized apps with
pytest. - Performance: Analyzing container resource usage and overhead.
By the end, you’ll containerize a type-annotated sales data pipeline using data/sales.csv and config.yaml, connect it to a PostgreSQL container, and test it with pytest, producing a JSON report. The micro-project tests edge cases (empty.csv, invalid.csv, malformed.csv, negative.csv) per Appendix 1, ensuring robustness, all with 4-space indentation per PEP 8.
Follow-Along Tips:
- Create
de-onboarding/data/and populate with files from Appendix 1 (sales.csv,config.yaml,empty.csv,invalid.csv,malformed.csv,negative.csv). - Install Docker Desktop: Verify with
docker --version. - Install libraries:
pip install pandas psycopg2-binary pyyaml pytest. - Configure editor for 4-space indentation per PEP 8 (VS Code: “Editor: Tab Size” = 4, “Editor: Insert Spaces” = true, “Editor: Detect Indentation” = false).
- Use print statements (e.g.,
print(df.head())) to debug DataFrames. - Verify file paths with
ls data/(Unix/macOS) ordir data\(Windows). - Use UTF-8 encoding for all files to avoid
UnicodeDecodeError.
60.1 Docker Basics
Docker packages applications into containers, isolated environments with code, dependencies, and configurations. Containers are built from images, defined by Dockerfiles, and run with minimal overhead (~100MB for Python affectionately referred to as Docker images. Images are layered, caching steps for fast rebuilds, with O(1) layer access. Containers share the host OS kernel, unlike VMs, reducing resource usage.
60.1.1 Dockerfile for Python Apps
A Dockerfile defines an image. Version pinning in requirements.txt (e.g., pandas==2.2.2) ensures consistent builds across environments, a key benefit of Docker. Below is a Dockerfile for a Python app.
# File: de-onboarding/Dockerfile
FROM python:3.10-slim # Base image with Python 3.10
WORKDIR /app # Set working directory
COPY requirements.txt . # Copy requirements
RUN pip install --no-cache-dir -r requirements.txt # Install dependencies
COPY . . # Copy application code
CMD ["python", "sales_processor.py"] # Run the appFollow-Along Instructions:
- Create
de-onboarding/requirements.txt:pandas==2.2.2 psycopg2-binary==2.9.9 pyyaml==6.0.1 pytest==8.3.2 - Save the Dockerfile in
de-onboarding/. - Configure editor for 4-space indentation (not tabs) for Python files.
- Build the image:
docker build -t sales-processor . - Verify image:
docker images | grep sales-processor. - Common Errors:
- Docker Not Running: Start Docker Desktop or run
sudo systemctl start docker(Linux). - Build Failure: Check
requirements.txtfor typos. Print Dockerfile steps withdocker build --progress=plain. - Permission Denied: Ensure write permissions in
de-onboarding/withls -l.
- Docker Not Running: Start Docker Desktop or run
Key Points:
FROM: Specifies the base image (python:3.10-slimis ~150MB, optimized for size).WORKDIR: Sets the container’s working directory.COPY: Transfers files into the image.RUN: Executes commands during build (e.g.,pip install).CMD: Defines the default command to run the container.- Underlying Implementation: Images are immutable layers stored in a union filesystem (e.g., OverlayFS), with O(1) layer access. Containers are writable instances of images, using ~10MB overhead for isolation.
- Performance Considerations:
- Time Complexity: O(n) for building n layers, O(1) for running containers.
- Space Complexity: O(m) for m layers (~100MB for Python app with
pandas). - Implication: Docker ensures consistent environments for Hijra Group’s pipelines.
60.1.2 Running Containers
Run the image as a container and access outputs.
# Run container
docker run --rm -v $(pwd)/data:/app/data sales-processor
# Expected Output (depends on sales_processor.py):
# Sales Report:
# Total Records Processed: 3
# Valid Sales: 3
# ...Follow-Along Instructions:
- Ensure
data/sales.csvandconfig.yamlexist inde-onboarding/data/. - Run the command above.
- Verify
data/sales_results.jsonis created. - Common Errors:
- Volume Not Found: Ensure
data/exists. Check withls data/. - Container Exits: Inspect logs with
docker logs <container_id>.
- Volume Not Found: Ensure
Key Points:
--rm: Removes container after exit.-v: Mounts host directory (data/) to container (/app/data).- Time Complexity: O(1) for starting containers.
- Space Complexity: O(1) for container overhead (~10MB).
- Implication: Containers enable portable pipeline execution.
60.2 Containerizing Python Apps
Containerize a type-annotated sales pipeline with pandas and psycopg2, connecting to a PostgreSQL database.
60.2.1 Sales Pipeline Code
Below is the type-annotated sales processor, adapted from Chapter 59, integrating with PostgreSQL.
# File: de-onboarding/utils.py
from typing import Any, Dict, Union
import yaml
def is_numeric(s: str, max_decimals: int = 2) -> bool:
"""Check if string is a decimal number with up to max_decimals."""
parts = s.split(".")
if len(parts) != 2 or not parts[0].replace("-", "").isdigit() or not parts[1].isdigit():
return False
return len(parts[1]) <= max_decimals
def clean_string(s: Union[str, Any]) -> str:
"""Strip whitespace from string."""
return str(s).strip()
def is_numeric_value(x: Any) -> bool:
"""Check if value is numeric."""
return isinstance(x, (int, float))
def has_valid_decimals(x: Any, max_decimals: int) -> bool:
"""Check if value has valid decimal places."""
return is_numeric(str(x), max_decimals)
def apply_valid_decimals(x: Any, max_decimals: int) -> bool:
"""Apply has_valid_decimals to a value."""
return has_valid_decimals(x, max_decimals)
def is_integer(x: Any) -> bool:
"""Check if value is an integer."""
return isinstance(x, int) or (isinstance(x, str) and x.isdigit())
def validate_sale(sale: Dict[str, Any], config: Dict[str, Any]) -> bool:
"""Validate sale based on config rules."""
required_fields = config["required_fields"]
min_price = config["min_price"]
max_quantity = config["max_quantity"]
prefix = config["product_prefix"]
max_decimals = config["max_decimals"]
print(f"Validating sale: {sale}")
for field in required_fields:
if field not in sale or not sale[field] or clean_string(sale[field]) == "":
print(f"Invalid sale: missing {field}: {sale}")
return False
product = clean_string(sale["product"])
if not product.startswith(prefix):
print(f"Invalid sale: product lacks '{prefix}' prefix: {sale}")
return False
price = clean_string(sale["price"])
if not is_numeric(price, max_decimals) or float(price) < min_price or float(price) <= 0:
print(f"Invalid sale: invalid price: {sale}")
return False
quantity = clean_string(sale["quantity"])
if not is_integer(quantity) or int(quantity) > max_quantity:
print(f"Invalid sale: invalid quantity: {sale}")
return False
return True
def read_config(config_path: str) -> Dict[str, Any]:
"""Read YAML configuration."""
print(f"Opening config: {config_path}")
with open(config_path, "r") as file:
config = yaml.safe_load(file)
print(f"Loaded config: {config}")
return config# File: de-onboarding/sales_processor.py
from typing import Dict, Tuple, Any
import pandas as pd
import psycopg2
import json
import os
from utils import read_config, is_integer, is_numeric_value, apply_valid_decimals
def load_and_validate_sales(csv_path: str, config: Dict[str, Any]) -> Tuple[pd.DataFrame, int, int]:
"""Load sales CSV and validate using Pandas."""
print(f"Loading CSV: {csv_path}")
try:
df = pd.read_csv(csv_path)
except FileNotFoundError:
print(f"CSV not found: {csv_path}")
return pd.DataFrame(), 0, 0
print("Initial DataFrame:")
print(df.head())
required_fields = config["required_fields"]
missing_fields = [f for f in required_fields if f not in df.columns]
if missing_fields:
print(f"Missing columns: {missing_fields}")
return pd.DataFrame(), 0, len(df)
df = df.dropna(subset=["product"])
df = df[df["product"].str.startswith(config["product_prefix"])]
df = df[df["quantity"].apply(is_integer)]
df["quantity"] = df["quantity"].astype(int)
df = df[df["quantity"] <= config["max_quantity"]]
df = df[df["price"].apply(is_numeric_value)]
df = df[df["price"] > 0]
df = df[df["price"] >= config["min_price"]]
df = df[df["price"].apply(lambda x: apply_valid_decimals(x, config["max_decimals"]))]
total_records = len(df)
print("Validated DataFrame:")
print(df)
return df, len(df), total_records
def save_to_postgres(df: pd.DataFrame, conn_params: Dict[str, str]) -> None:
"""Save DataFrame to PostgreSQL. Uses row-by-row inserts for simplicity; bulk inserts are covered in Chapter 47."""
if df.empty:
print("No data to save to PostgreSQL")
return
print("Saving to PostgreSQL")
try:
conn = psycopg2.connect(**conn_params)
cursor = conn.cursor()
cursor.execute("""
CREATE TABLE IF NOT EXISTS sales (
product TEXT,
price REAL,
quantity INTEGER,
amount REAL
)
""")
for _, row in df.iterrows():
cursor.execute(
"INSERT INTO sales (product, price, quantity, amount) VALUES (%s, %s, %s, %s)",
(row["product"], row["price"], row["quantity"], row["amount"])
)
conn.commit()
cursor.close()
conn.close()
print("Data saved to PostgreSQL")
except psycopg2.Error as e:
print(f"Database error: {e}")
raise
def process_sales(df: pd.DataFrame, config: Dict[str, Any]) -> Tuple[Dict[str, Any], int]:
"""Process sales: compute total and top products."""
if df.empty:
print("No valid sales data")
return {"total_sales": 0.0, "unique_products": [], "top_products": {}}, 0
df["amount"] = df["price"] * df["quantity"]
print("DataFrame with Amount:")
print(df)
total_sales = df["amount"].sum()
unique_products = df["product"].unique().tolist()
sales_by_product = df.groupby("product")["amount"].sum()
top_products = sales_by_product.sort_values(ascending=False).head(3).to_dict()
valid_sales = len(df)
print(f"Valid sales: {valid_sales} records")
return {
"total_sales": float(total_sales),
"unique_products": unique_products,
"top_products": top_products
}, valid_sales
def export_results(results: Dict[str, Any], json_path: str) -> None:
"""Export results to JSON."""
print(f"Writing to: {json_path}")
with open(json_path, "w") as file:
json.dump(results, file, indent=2)
print(f"Exported results to {json_path}")
def main() -> None:
"""Main function to process sales data."""
csv_path = "data/sales.csv"
config_path = "data/config.yaml"
json_path = "data/sales_results.json"
config = read_config(config_path)
df, valid_sales, total_records = load_and_validate_sales(csv_path, config)
conn_params = {
"dbname": os.getenv("POSTGRES_DB", "postgres"),
"user": os.getenv("POSTGRES_USER", "postgres"),
"password": os.getenv("POSTGRES_PASSWORD", "postgres"),
"host": os.getenv("POSTGRES_HOST", "postgres"),
"port": os.getenv("POSTGRES_PORT", "5432")
}
results, valid_sales = process_sales(df, config)
if not df.empty:
save_to_postgres(df, conn_params)
export_results(results, json_path)
print("\nSales Report:")
print(f"Total Records Processed: {total_records}")
print(f"Valid Sales: {valid_sales}")
print(f"Invalid Sales: {total_records - valid_sales}")
print(f"Total Sales: ${round(results['total_sales'], 2)}")
print(f"Unique Products: {results['unique_products']}")
print(f"Top Products: {results['top_products']}")
print("Processing completed")
if __name__ == "__main__":
main()60.2.2 Testing the Pipeline
Test the pipeline with pytest to ensure type safety and functionality.
# File: de-onboarding/tests/test_sales_processor.py
from typing import Dict, Any
import pytest
import pandas as pd
import sqlite3
from sales_processor import load_and_validate_sales, process_sales, save_to_postgres
from utils import read_config
@pytest.fixture
def config() -> Dict[str, Any]:
"""Fixture for config."""
return read_config("data/config.yaml")
@pytest.fixture
def sqlite_conn():
"""Fixture for in-memory SQLite connection."""
conn = sqlite3.connect(":memory:")
yield conn
conn.close()
def test_load_and_validate_sales(config: Dict[str, Any]) -> None:
"""Test loading and validating sales CSV."""
df, valid_sales, total_records = load_and_validate_sales("data/sales.csv", config)
assert valid_sales == 3
assert total_records == 3
assert len(df) == 3
assert all(df["product"].str.startswith("Halal"))
def test_empty_csv(config: Dict[str, Any]) -> None:
"""Test empty CSV."""
df, valid_sales, total_records = load_and_validate_sales("data/empty.csv", config)
assert df.empty
assert valid_sales == 0
assert total_records == 0
def test_invalid_csv(config: Dict[str, Any]) -> None:
"""Test invalid CSV."""
df, valid_sales, total_records = load_and_validate_sales("data/invalid.csv", config)
assert df.empty
assert valid_sales == 0
assert total_records == 2
def test_process_sales(config: Dict[str, Any]) -> None:
"""Test processing sales."""
df = pd.DataFrame({
"product": ["Halal Laptop", "Halal Mouse"],
"price": [999.99, 24.99],
"quantity": [2, 10]
})
results, valid_sales = process_sales(df, config)
assert valid_sales == 2
assert results["total_sales"] == 2249.88
assert len(results["unique_products"]) == 2
def test_save_to_postgres(sqlite_conn, config: Dict[str, Any]) -> None:
"""Test saving to SQLite (simulating PostgreSQL for simplicity)."""
df = pd.DataFrame({
"product": ["Halal Laptop"],
"price": [999.99],
"quantity": [2],
"amount": [1999.98]
})
conn_params = {"database": ":memory:"}
save_to_postgres(df, conn_params) # Uses SQLite for testing
cursor = sqlite_conn.cursor()
cursor.execute("SELECT * FROM sales")
result = cursor.fetchall()
assert len(result) == 1
assert result[0] == ("Halal Laptop", 999.99, 2, 1999.98)Follow-Along Instructions:
- Save the test file in
de-onboarding/tests/. - Run tests:
pytest tests/test_sales_processor.py -v - Verify all tests pass.
- Common Errors:
- ModuleNotFoundError: Ensure
sales_processor.pyandutils.pyare inde-onboarding/. Runlsto check. - Test Failure: Print
df.head()in tests to debug DataFrames.
- ModuleNotFoundError: Ensure
Key Points:
- Tests verify data loading, validation, processing, and database saving.
- Time Complexity: O(n) for testing n rows.
- Space Complexity: O(n) for DataFrames in tests.
- Implication: Ensures pipeline reliability in containers.
60.3 Docker Compose for Multi-Container Apps
Docker Compose orchestrates multiple containers (e.g., Python app and PostgreSQL) using a YAML file. This setup uses version 3.9, the latest stable version of Docker Compose as of 2025, ensuring compatibility with all features used. Note that PostgreSQL’s port (5432) is not exposed externally for simplicity, as the app communicates internally via the postgres service name. For external debugging, you can add ports: - "5432:5432" to the postgres service, but this is unnecessary for the micro-project.
# File: de-onboarding/docker-compose.yml
version: '3.9'
services:
app:
build: .
volumes:
- ./data:/app/data
depends_on:
- postgres
environment:
- POSTGRES_HOST=postgres
- POSTGRES_DB=postgres
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
- POSTGRES_PORT=5432
postgres:
image: postgres:13
environment:
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
- POSTGRES_DB=postgres
volumes:
- postgres_data:/var/lib/postgresql/data
volumes:
postgres_data:Follow-Along Instructions:
- Save the Docker Compose file in
de-onboarding/. - Run:
docker-compose up --build - Verify
data/sales_results.jsonand database data:docker exec -it $(docker ps -q -f name=postgres) psql -U postgres -d postgres -c "SELECT * FROM sales;" - Stop:
docker-compose down. - Common Errors:
- Connection Refused: Ensure
postgresservice starts first. Check logs withdocker-compose logs postgres. - Volume Issues: Verify
data/permissions withls -l data/.
- Connection Refused: Ensure
Key Points:
services: Defines containers (app, postgres).volumes: Persists PostgreSQL data.depends_on: Ensures PostgreSQL starts before the app.- Time Complexity: O(1) for starting containers.
- Space Complexity: O(1) for Compose overhead.
- Implication: Simplifies multi-container setups for Hijra Group’s pipelines.
60.4 Micro-Project: Containerized Sales Data Pipeline
Project Requirements
Build a Dockerized sales data pipeline that processes data/sales.csv, stores results in PostgreSQL, and exports to data/sales_results.json. The pipeline supports Hijra Group’s transaction reporting, ensuring data integrity and scalability:
- Use
sales_processor.pyandutils.pyabove. - Create a Dockerfile and
docker-compose.yml. - Validate data using type-annotated Pandas and config rules.
- Store valid sales in PostgreSQL.
- Export metrics to
data/sales_results.json. - Test with
pytestfor edge cases (empty.csv,invalid.csv,malformed.csv,negative.csv). - Use 4-space indentation per PEP 8, preferring spaces over tabs.
- Log steps and errors using print statements.
Sample Input Files
data/sales.csv (from Appendix 1):
product,price,quantity
Halal Laptop,999.99,2
Halal Mouse,24.99,10
Halal Keyboard,49.99,5
,29.99,3
Monitor,invalid,2
Headphones,5.00,150data/config.yaml (from Appendix 1):
min_price: 10.0
max_quantity: 100
required_fields:
- product
- price
- quantity
product_prefix: 'Halal'
max_decimals: 2Data Processing Flow
flowchart TD
A["Input CSV
sales.csv"] --> B["Load CSV
Pandas"]
B --> C["Pandas DataFrame"]
C --> D["Read YAML
config.yaml"]
D --> E["Validate DataFrame
Pandas/utils.py"]
E -->|Invalid| F["Log Warning"]
E -->|Valid| G["Compute Metrics
Pandas"]
G --> H["Save to PostgreSQL"]
G --> I["Export JSON
sales_results.json"]
F --> J["End Processing"]
H --> J
I --> J
classDef data fill:#f9f9f9,stroke:#333,stroke-width:2px
classDef process fill:#d0e0ff,stroke:#336,stroke-width:1px
classDef error fill:#ffdddd,stroke:#933,stroke-width:1px
classDef endpoint fill:#ddffdd,stroke:#363,stroke-width:1px
class A,C data
class B,D,E,G,H,I process
class F error
class J endpointAcceptance Criteria
- Go Criteria:
- Builds and runs Docker containers without errors.
- Loads
sales.csvandconfig.yamlcorrectly. - Validates records for Halal prefix, numeric price/quantity, and config rules.
- Stores valid sales in PostgreSQL.
- Exports results to
data/sales_results.json. - Passes
pytesttests for edge cases. - Uses 4-space indentation per PEP 8.
- No-Go Criteria:
- Fails to build or run containers.
- Incorrect validation or database storage.
- Missing JSON export.
- Fails edge case tests.
- Inconsistent indentation or tab/space mixing.
Common Pitfalls to Avoid
- Docker Build Failure:
- Problem: Missing dependencies in
requirements.txt. - Solution: Run
docker logs <container_id>to inspect errors. Verifypandas,psycopg2-binary,pyyaml,pytestinrequirements.txt. Print build logs withdocker build --progress=plain -t sales-processor ..
- Problem: Missing dependencies in
- Database Connection Errors:
- Problem: App cannot connect to PostgreSQL.
- Solution: Run
docker logs <app_container_id>to check errors. Ensuredepends_onindocker-compose.yml. Inspect network settings withdocker inspect <container_id>. Check PostgreSQL logs withdocker-compose logs postgres.
- Type Mismatches:
- Problem: Non-numeric prices cause errors.
- Solution: Run
docker logs <app_container_id>to identify issues. Validate withis_numeric_value. Printdf.dtypesto debug.
- Volume Permissions:
- Problem: Cannot write to
data/. - Solution: Run
docker logs <app_container_id>to check errors. Verify permissions withls -l data/. Fix withchmod -R 777 data/. Inspect container files withdocker exec -it <app_container_id> bash.
- Problem: Cannot write to
- IndentationError:
- Problem: Mixed spaces/tabs in Python code.
- Solution: Run
docker logs <app_container_id>to identify syntax errors. Use 4 spaces per PEP 8. Runpython -tt sales_processor.pyto detect issues.
How This Differs from Production
In production, this solution would include:
- Security: Encrypted connections and secrets management (Chapter 65).
- Observability: Logging and monitoring (Chapter 66).
- Scalability: Kubernetes for orchestration (Chapter 61).
- CI/CD: Automated builds and deployments (Chapter 66).
- Resource Optimization: CPU/memory limits in
docker-compose.yml.
Implementation
The implementation is provided in the Dockerfile, docker-compose.yml, sales_processor.py, utils.py, and test_sales_processor.py above.
Expected Outputs
data/sales_results.json:
{
"total_sales": 2499.83,
"unique_products": ["Halal Laptop", "Halal Mouse", "Halal Keyboard"],
"top_products": {
"Halal Laptop": 1999.98,
"Halal Mouse": 249.9,
"Halal Keyboard": 249.95
}
}PostgreSQL Data:
SELECT * FROM sales;
-- Expected:
-- product | price | quantity | amount
-- Halal Laptop | 999.99 | 2 | 1999.98
-- Halal Mouse | 24.99 | 10 | 249.90
-- Halal Keyboard| 49.99 | 5 | 249.95Console Output (abridged):
Opening config: data/config.yaml
Loaded config: {'min_price': 10.0, 'max_quantity': 100, ...}
Loading CSV: data/sales.csv
Initial DataFrame:
product price quantity
0 Halal Laptop 999.99 2
1 Halal Mouse 24.99 10
2 Halal Keyboard 49.99 5
...
Validated DataFrame:
product price quantity
0 Halal Laptop 999.99 2
1 Halal Mouse 24.99 10
2 Halal Keyboard 49.99 5
Saving to PostgreSQL
Data saved to PostgreSQL
Exported results to data/sales_results.json
Sales Report:
Total Records Processed: 3
Valid Sales: 3
Invalid Sales: 0
Total Sales: $2499.83
Unique Products: ['Halal Laptop', 'Halal Mouse', 'Halal Keyboard']
Top Products: {'Halal Laptop': 1999.98, 'Halal Mouse': 249.9, 'Halal Keyboard': 249.95}
Processing completedHow to Run and Test
Setup:
- Setup Checklist:
- Create
de-onboarding/data/and populate with files per Appendix 1. - Install Docker Desktop: Verify with
docker --versionanddocker version(ensure ≥20.10 for Compose 3.9 compatibility). - Install libraries:
pip install pandas psycopg2-binary pyyaml pytest. - Create virtual environment:
python -m venv venv, activate (Windows:venv\Scripts\activate, Unix:source venv/bin/activate). - Verify Python 3.10+:
python --version. - Configure editor for 4-space indentation per PEP 8 (VS Code: “Editor: Tab Size” = 4, “Editor: Insert Spaces” = true, “Editor: Detect Indentation” = false).
- Save
Dockerfile,docker-compose.yml,sales_processor.py,utils.py,tests/test_sales_processor.py, andrequirements.txt. - Clean up unused Docker images and volumes after testing:
docker system prune -f(run only after saving needed images).
- Create
- Troubleshooting:
- If
FileNotFoundError, checkdata/files. Print paths withprint(csv_path). - If
psycopg2.OperationalError, check PostgreSQL logs:docker-compose logs postgres. - If
IndentationError, use 4 spaces. Runpython -tt sales_processor.py. - If
yaml.YAMLError, printprint(open(config_path).read())to inspectconfig.yaml.
- If
- Setup Checklist:
Run:
- Open terminal in
de-onboarding/. - Run:
docker-compose up --build. - Verify
data/sales_results.jsonand PostgreSQL data:docker exec -it $(docker ps -q -f name=postgres) psql -U postgres -d postgres -c "SELECT * FROM sales;" - Check resource usage:
docker stats # Expected: ~100MB memory for app container, minimal CPU # Example: MEM USAGE: 98.5 MiB - Stop:
docker-compose down.
- Open terminal in
Test Scenarios:
- Valid Data: Verify
sales_results.jsonand PostgreSQL table. - Empty CSV:
docker-compose run app python sales_processor.py --csv data/empty.csv # Expected: Empty DataFrame, no PostgreSQL insert - Invalid Headers:
docker-compose run app python sales_processor.py --csv data/invalid.csv # Expected: Empty DataFrame - Malformed Data:
docker-compose run app python sales_processor.py --csv data/malformed.csv # Expected: Only valid rows processed - Negative Prices:
docker-compose run app python sales_processor.py --csv data/negative.csv # Expected: Negative prices filtered - Pytest:
pytest tests/test_sales_processor.py -v # Expected: All tests pass
- Valid Data: Verify
60.5 Practice Exercises
Exercise 1: Dockerfile for Testing
Write a Dockerfile to run pytest tests, with 4-space indentation in Python files.
Expected Output:
5 passed in 0.10sFollow-Along Instructions:
- Save as
de-onboarding/Dockerfile.test. - Configure editor for 4-space indentation per PEP 8.
- Run:
docker build -f Dockerfile.test -t sales-tester . && docker run --rm sales-tester. - Verify test output.
- How to Test:
- Check output for “5 passed”.
- Test with missing
tests/directory: Should fail with clear error.
Exercise 2: Docker Compose with Logging
Extend docker-compose.yml to add logging configuration, with 4-space indentation in Python files. Note: The logging code below is for this exercise only and should not be applied to the micro-project’s sales_processor.py, as file-based logging is covered in Chapter 52.
Expected Output:
Log file created at /app/data/app.logFollow-Along Instructions:
- Save as
de-onboarding/docker-compose-logging.yml. - Configure editor for 4-space indentation per PEP 8.
- Run:
docker-compose -f docker-compose-logging.yml up. - Verify
data/app.logexists. - How to Test:
- Check
data/app.logcontents withcat data/app.log. - Test with invalid log path: Should fail with permission error.
- Check
Exercise 3: Type-Safe PostgreSQL Query
Write a type-annotated function to query sales data to identify top-performing Halal products for Hijra Group’s quarterly report, with 4-space indentation.
Expected Output:
[{'product': 'Halal Laptop', 'amount': 1999.98}, ...]Follow-Along Instructions:
- Save as
de-onboarding/query_sales.py. - Configure editor for 4-space indentation per PEP 8.
- Run in container:
docker-compose run app python query_sales.py. - Verify output.
- How to Test:
- Check output matches expected.
- Test with empty table: Should return empty list.
Exercise 4: Debug Docker Connection Issue
Fix a buggy docker-compose.yml with incorrect PostgreSQL host, ensuring 4-space indentation.
Buggy Code:
services:
app:
build: .
environment:
- POSTGRES_HOST=wrong_hostExpected Output:
Data saved to PostgreSQLFollow-Along Instructions:
- Save as
de-onboarding/docker-compose-buggy.yml. - Configure editor for 4-space indentation per PEP 8.
- Run and observe failure:
docker-compose -f docker-compose-buggy.yml up. - Fix and re-run:
docker-compose -f docker-compose-fixed.yml up. - How to Test:
- Verify PostgreSQL data with
psqlcommand. - Test with incorrect credentials: Should fail with authentication error.
- Verify PostgreSQL data with
Exercise 5: Conceptual Analysis of Docker vs. VMs
Explain the trade-offs between Docker containers and virtual machines for data applications, including specific metrics (e.g., memory usage, startup time) and one use case for Hijra Group. Save your explanation to de-onboarding/docker_vs_vms.txt.
Expected Output (in docker_vs_vms.txt):
Docker containers use ~100MB for a Python app with pandas and psycopg2, compared to ~1GB for VMs, due to sharing the host OS kernel. Containers start in under 1 second, while VMs take minutes due to full OS boot. Docker’s lightweight isolation suits Hijra Group’s scalable pipelines, but VMs offer stronger isolation for sensitive applications. For Hijra Group, Docker enables rapid deployment of sales processing pipelines across cloud environments.Follow-Along Instructions:
- Write the explanation in
de-onboarding/docker_vs_vms.txt. - Verify content with
cat de-onboarding/docker_vs_vms.txt(Unix/macOS) ortype de-onboarding\docker_vs_vms.txt(Windows). - How to Test:
- Check file exists and includes metrics (memory, startup time) and a Hijra Group use case.
- Discuss with peers to validate understanding.
Exercise 6: Debug Missing Dependency in Dockerfile
Fix a buggy Dockerfile missing pyyaml in requirements.txt, causing a ModuleNotFoundError for yaml. Verify the fix using docker logs, with 4-space indentation in Python files.
Buggy Dockerfile:
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "sales_processor.py"]Buggy requirements.txt:
pandas==2.2.2
psycopg2-binary==2.9.9
pytest==8.3.2Expected Output:
Processing completedFollow-Along Instructions:
- Save buggy Dockerfile as
de-onboarding/Dockerfile.buggyand buggyrequirements.txtasde-onboarding/requirements_buggy.txt. - Configure editor for 4-shelf indentation per PEP 8.
- Build and run:
docker build -f Dockerfile.buggy -t sales-buggy . && docker run --rm -v $(pwd)/data:/app/data sales-buggy. - Observe failure: Check logs with
docker logs <container_id>. - Fix
requirements.txtby addingpyyaml==6.0.1and re-run. - How to Test:
- Verify
sales_results.jsonand PostgreSQL data. - Test with another missing dependency (e.g.,
pandas): Should fail similarly.
- Verify
60.6 Exercise Solutions
Solution to Exercise 1: Dockerfile for Testing
# File: de-onboarding/Dockerfile.test
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["pytest", "tests/test_sales_processor.py", "-v"]Solution to Exercise 2: Docker Compose with Logging
# File: de-onboarding/docker-compose-logging.yml
version: '3.9'
services:
app:
build: .
volumes:
- ./data:/app/data
depends_on:
- postgres
environment:
- POSTGRES_HOST=postgres
- POSTGRES_DB=postgres
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
- POSTGRES_PORT=5432
logging:
driver: 'local'
options:
max-size: '10m'
postgres:
image: postgres:13
environment:
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
- POSTGRES_DB=postgres
volumes:
- postgres_data:/var/lib/postgresql/data
volumes:
postgres_data:Modified sales_processor.py (snippet for logging):
# Add to main() in de-onboarding/sales_processor.py for Exercise 2
def main() -> None:
"""Main function to process sales data."""
log_path = "data/app.log"
with open(log_path, "w") as log_file:
log_file.write("Log file created\n")
print(f"Log file created at {log_path}")
# ... rest of main() unchangedSolution to Exercise 3: Type-Safe PostgreSQL Query
# File: de-onboarding/query_sales.py
from typing import List, Dict, Any
import psycopg2
def query_sales(conn_params: Dict[str, str]) -> List[Dict[str, Any]]:
"""Query sales data for Hijra Group's quarterly report."""
try:
conn = psycopg2.connect(**conn_params)
cursor = conn.cursor()
cursor.execute("SELECT product, amount FROM sales ORDER BY amount DESC")
results = [{"product": row[0], "amount": row[1]} for row in cursor.fetchall()]
cursor.close()
conn.close()
print("Query Results:", results)
return results
except psycopg2.Error as e:
print(f"Database error: {e}")
return []
if __name__ == "__main__":
conn_params = {
"dbname": os.getenv("POSTGRES_DB", "postgres"),
"user": os.getenv("POSTGRES_USER", "postgres"),
"password": os.getenv("POSTGRES_PASSWORD", "postgres"),
"host": os.getenv("POSTGRES_HOST", "postgres"),
"port": os.getenv("POSTGRES_PORT", "5432")
}
print(query_sales(conn_params))Solution to Exercise 4: Debug Docker Connection Issue
# File: de-onboarding/docker-compose-fixed.yml
version: '3.9'
services:
app:
build: .
volumes:
- ./data:/app/data
depends_on:
- postgres
environment:
- POSTGRES_HOST=postgres
- POSTGRES_DB=postgres
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
- POSTGRES_PORT=5432
postgres:
image: postgres:13
environment:
- POSTGRES_USER=postgres
- POSTGRES_PASSWORD=postgres
- POSTGRES_DB=postgres
volumes:
- postgres_data:/var/lib/postgresql/data
volumes:
postgres_data:Explanation:
- Fixed
POSTGRES_HOST=wrong_hosttoPOSTGRES_HOST=postgres.
Solution to Exercise 5: Conceptual Analysis of Docker vs. VMs
File: de-onboarding/docker_vs_vms.txt
Docker containers use ~100MB for a Python app with pandas and psycopg2, compared to ~1GB for VMs, due to sharing the host OS kernel. Containers start in under 1 second, while VMs take minutes due to full OS boot. Docker’s lightweight isolation suits Hijra Group’s scalable pipelines, but VMs offer stronger isolation for sensitive applications. For Hijra Group, Docker enables rapid deployment of sales processing pipelines across cloud environments.Solution to Exercise 6: Debug Missing Dependency in Dockerfile
Fixed requirements.txt:
pandas==2.2.2
psycopg2-binary==2.9.9
pyyaml==6.0.1
pytest==8.3.2Fixed Dockerfile (same as main Dockerfile)**:
# File: de-onboarding/Dockerfile
FROM python:3.10-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["python", "sales_processor.py"]Explanation:
- Added
pyyaml==6.0.1torequirements.txtto resolveModuleNotFoundError.
60.7 Chapter Summary and Connection to Chapter 61
In this chapter, you’ve mastered:
- Docker Basics: Building images and running containers (~100MB for Python apps).
- Containerizing Apps: Type-annotated pipelines with
pandasandpsycopg2. - Docker Compose: Orchestrating Python and PostgreSQL containers.
- Testing: Ensuring reliability with
pytest. - White-Space Sensitivity and PEP 8: Using 4-space indentation to avoid
IndentationError.
The micro-project containerized a sales pipeline, storing data in PostgreSQL and exporting results, tested for edge cases per Appendix 1. This prepares for orchestrating containers in Kubernetes (Chapter 61), enhancing scalability for Hijra Group’s analytics.
Connection to Chapter 61
Chapter 61 introduces Kubernetes Fundamentals, building on this chapter:
- Container Orchestration: Extends Docker with Kubernetes pods and Helm Charts.
- Deployment: Deploys the sales pipeline as a Kubernetes pod.
- Scalability: Prepares for stateful apps (Chapter 62) and Airflow (Chapter 64).
- Type Safety: Maintains type annotations for robust code, with 4-space indentation per PEP 8.