03 - Essential Data Libraries: Numpy and Pandas Basics
Complexity: Moderate (M)
3.0 Introduction: Why This Matters for Data Engineering
In data engineering, efficient data manipulation is critical for transforming financial transaction data into actionable insights for Hijra Group’s Sharia-compliant fintech analytics. NumPy and Pandas are foundational libraries for handling large datasets, with Pandas DataFrames using ~24MB for 1 million rows (3 columns) of numeric types and NumPy arrays offering 10–100x faster computations than Python loops. Visualization with Matplotlib enables stakeholder reporting by producing plots like sales trends, laying the groundwork for interactive dashboards with BI tools in Chapter 51. Building on Chapters 1 (Python basics) and 2 (file handling, modules), this chapter introduces NumPy for numerical computations and Pandas for structured data analysis, essential for processing sales data in pipelines.
This chapter avoids advanced concepts like type annotations (Chapter 7), testing (Chapter 9), or error handling (try/except, Chapter 7), focusing on basic operations, filtering, and grouping. All code uses PEP 8’s 4-space indentation, preferring spaces over tabs to avoid IndentationError due to Python’s white-space sensitivity, ensuring compatibility with Hijra Group’s pipeline scripts.
Data Engineering Workflow Context
This diagram illustrates how NumPy and Pandas fit into a data engineering pipeline:
flowchart TD
A["Raw Data (CSV)"] --> B["Python Scripts with NumPy/Pandas"]
B --> C{"Data Processing"}
C -->|Load/Validate| D["NumPy Arrays/DataFrames"]
C -->|Analyze| E["Aggregated Metrics"]
D --> F["Output (CSV/JSON)"]
E --> F
F --> G["Storage/Analysis"]
classDef data fill:#f9f9f9,stroke:#333,stroke-width:2px
classDef process fill:#d0e0ff,stroke:#336,stroke-width:1px
classDef storage fill:#ddffdd,stroke:#363,stroke-width:1px
class A,D,E,F data
class B,C process
class G storageBuilding On and Preparing For
- Building On:
- Chapter 1: Uses lists, dictionaries, and loops for data manipulation, extended to NumPy arrays and Pandas DataFrames.
- Chapter 2: Leverages CSV parsing (
csv.DictReader), string validation, and modules (utils.py) for data loading and cleaning, now streamlined withpandas.read_csv.
- Preparing For:
- Chapter 4: Prepares for API data integration by handling structured data.
- Chapter 5: Supports OOP by organizing data processing logic in modules.
- Chapter 7: Lays groundwork for type-safe Pandas operations.
- Chapter 38–39: Enables advanced NumPy/Pandas for large-scale analytics.
What You’ll Learn
This chapter covers:
- NumPy Basics: Arrays for numerical operations (e.g., sales calculations).
- Pandas Basics: DataFrames for structured data (e.g., sales records).
- Data Loading: Reading CSVs with
pandas.read_csv. - Data Cleaning: Filtering and validating data with Pandas.
- Basic Analysis: Grouping and aggregating metrics.
- Visualization: Simple plots with Matplotlib (saved to files).
By the end, you’ll refactor Chapter 2’s sales processor to use NumPy for calculations and Pandas for data handling, producing a JSON report and a sales trend plot, all with 4-space indentation per PEP 8. The micro-project uses data/sales.csv and tests edge cases with empty.csv, invalid.csv, malformed.csv, and negative.csv, as specified in Appendix 1.
Follow-Along Tips:
- Create
de-onboarding/data/and populate with files from Appendix 1 (sales.csv,config.yaml,sample.csv,empty.csv,invalid.csv,malformed.csv,negative.csv). - Install libraries:
pip install numpy pandas matplotlib pyyaml. - If
IndentationError, use 4 spaces (not tabs) per PEP 8. Runpython -tt script.pyor use VS Code’s Pylint to detect tab/space mixing. - Use print statements (e.g.,
print(df.head())) to debug DataFrames. - Save plots to
data/(e.g.,sales_trend.png) instead of usingplt.show(). - Verify file paths with
ls data/(Unix/macOS) ordir data\(Windows). - Use UTF-8 encoding for all files to avoid
UnicodeDecodeError.
3.1 NumPy Basics
NumPy provides arrays for fast numerical computations, ideal for sales calculations. Arrays are fixed-size, contiguous memory blocks, offering O(1) access and vectorized operations, unlike Python lists (O(n) for some operations). For 1 million sales records, NumPy arrays use ~8MB for floats, with operations 10–100x faster than loops.
The following diagram shows the NumPy data flow:
flowchart TD
A["CSV Data"] --> B["Python List"]
B --> C["NumPy Array"]
C --> D["Filtered Array"]
D --> E["Aggregated Metrics"]3.1.1 Creating and Operating on Arrays
Create arrays from lists and perform vectorized operations.
import numpy as np # Import NumPy with standard alias
# Create arrays from sales data
prices = np.array([999.99, 24.99, 49.99]) # Array of prices
quantities = np.array([2, 10, 5]) # Array of quantities
# Vectorized operations
amounts = prices * quantities # Element-wise multiplication
total_sales = np.sum(amounts) # Sum all amounts
average_price = np.mean(prices) # Average price
max_quantity = np.max(quantities) # Maximum quantity
# Print results
print("Prices:", prices) # Debug: print array
print("Quantities:", quantities) # Debug: print array
print("Amounts:", amounts) # Debug: print computed amounts
print("Total Sales:", total_sales) # Output total
print("Average Price:", average_price) # Output average
print("Max Quantity:", max_quantity) # Output max
# Expected Output:
# Prices: [999.99 24.99 49.99]
# Quantities: [2 10 5]
# Amounts: [1999.98 249.9 249.95]
# Total Sales: 2499.83
# Average Price: 358.3233333333333
# Max Quantity: 10Follow-Along Instructions:
- Ensure
de-onboarding/exists from Chapter 2. - Install NumPy:
pip install numpy. - Save as
de-onboarding/numpy_basics.py. - Configure editor for 4-space indentation (not tabs) per PEP 8 (VS Code: “Editor: Tab Size” = 4, “Editor: Insert Spaces” = true, “Editor: Detect Indentation” = false).
- Run:
python numpy_basics.py. - Verify output matches comments.
- Common Errors:
- ModuleNotFoundError: Install
numpywithpip install numpy. - ValueError: Ensure lists have same length for operations. Print
len(prices),len(quantities). - IndentationError: Use 4 spaces (not tabs). Run
python -tt numpy_basics.py.
- ModuleNotFoundError: Install
Key Points:
- White-Space Sensitivity and PEP 8: Indentation (4 spaces per PEP 8) ensures readable code. Spaces are preferred over tabs to avoid
IndentationError. np.array(): Creates arrays from lists.- Vectorized operations:
+,*, etc., are O(n) but faster than loops due to C-based implementation. - Aggregation:
np.sum(),np.mean(),np.max()are O(n). - Underlying Implementation: Arrays are contiguous memory, enabling SIMD (Single Instruction, Multiple Data) operations. NumPy uses C for vectorized operations, reducing Python’s interpreter overhead, making operations 10–100x faster than loops.
- Performance Considerations:
- Time Complexity: O(n) for vectorized operations, O(n) for aggregations.
- Space Complexity: O(n) for n elements (~8MB for 1M floats).
- Implication: Use NumPy for numerical tasks in pipelines, e.g., computing sales totals for Hijra Group’s analytics.
3.1.2 Array Filtering
Filter arrays based on conditions, e.g., high-quantity sales.
import numpy as np # Import NumPy
# Create arrays
quantities = np.array([2, 10, 5, 150]) # Sales quantities
prices = np.array([999.99, 24.99, 49.99, 5.00]) # Prices
# Filter high-quantity sales (>100)
high_quantity = quantities > 100 # Boolean array
filtered_prices = prices[high_quantity] # Filter prices where quantities > 100
# Print results
print("Quantities:", quantities) # Debug: print quantities
print("High Quantity Mask:", high_quantity) # Debug: print boolean mask
print("Filtered Prices:", filtered_prices) # Output filtered prices
# Expected Output:
# Quantities: [ 2 10 5 150]
# High Quantity Mask: [False False False True]
# Filtered Prices: [5.]Follow-Along Instructions:
- Save as
de-onboarding/numpy_filtering.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python numpy_filtering.py. - Verify output matches comments.
- Common Errors:
- IndexError: Ensure arrays have same length. Print
len(quantities),len(prices). - IndentationError: Use 4 spaces (not tabs). Run
python -tt numpy_filtering.py.
- IndexError: Ensure arrays have same length. Print
Key Points:
- Boolean indexing: Creates O(n) boolean arrays for filtering.
- Time Complexity: O(n) for filtering.
- Space Complexity: O(n) for boolean mask, O(k) for filtered array (k elements).
- Implication: Efficient for identifying outliers, e.g., excessive quantities in sales data.
3.2 Pandas Basics
Pandas provides DataFrames, tabular structures for structured data, ideal for sales records. DataFrames are built on NumPy arrays, offering labeled columns and rows, with O(1) column access and O(n) row operations. A 1 million-row DataFrame (~3 columns) uses ~24MB for numeric types. Pandas Series are 1D structures, used for single columns or grouped results, distinct from DataFrames’ 2D tables.
3.2.1 Creating and Loading DataFrames
Load CSVs into DataFrames using pandas.read_csv.
import pandas as pd # Import Pandas with standard alias
# Load sales CSV
df = pd.read_csv("data/sales.csv") # Read CSV into DataFrame
# Inspect DataFrame
print("DataFrame Head:") # Debug: print first few rows
print(df.head()) # Show first 5 rows
print("DataFrame Info:") # Debug: print structure
print(df.info()) # Show column types and null counts
# Expected Output (with sales.csv from Appendix 1):
# DataFrame Head:
# product price quantity
# 0 Halal Laptop 999.99 2
# 1 Halal Mouse 24.99 10
# 2 Halal Keyboard 49.99 5
# 3 NaN 29.99 3
# 4 Monitor NaN 2
# DataFrame Info:
# <class 'pandas.core.frame.DataFrame'>
# RangeIndex: 6 entries, 0 to 5
# Data columns (total 3 columns):
# # Column Non-Null Count Dtype
# --- ------ -------------- -----
# 0 product 5 non-null object
# 1 price 5 non-null float64
# 2 quantity 6 non-null int64
# dtypes: float64(1), int64(1), object(1)
# memory usage: 272.0+ bytesFollow-Along Instructions:
- Ensure
data/sales.csvexists inde-onboarding/data/per Appendix 1. - Install Pandas:
pip install pandas. - Save as
de-onboarding/pandas_basics.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python pandas_basics.py. - Verify output shows DataFrame structure.
- Common Errors:
- FileNotFoundError: Ensure
data/sales.csvexists. Print path withprint("data/sales.csv"). - ModuleNotFoundError: Install
pandaswithpip install pandas. - IndentationError: Use 4 spaces (not tabs). Run
python -tt pandas_basics.py.
- FileNotFoundError: Ensure
Key Points:
pd.read_csv(): Loads CSVs with automatic type inference.df.head(): Shows first 5 rows for debugging.df.info(): Displays column types, null counts, memory usage.- Underlying Implementation: DataFrames are column-oriented, storing each column as a NumPy array for fast column-wise operations (O(1) access), but row iterations are slower (O(n)) due to non-contiguous memory.
- Performance Considerations:
- Time Complexity: O(n) for loading n rows.
- Space Complexity: O(n) for n rows. The ~24MB estimate for 1 million rows (3 columns) assumes float64/int64 types (8 bytes each); string columns like
productmay increase memory usage due to object types storing variable-length data. - Implication: Use Pandas for structured data in pipelines, e.g., sales records at Hijra Group.
3.2.2 Data Cleaning and Filtering
Clean and filter DataFrames using boolean indexing and string operations. The is_numeric function from utils.py checks if a value is numeric, ensuring clarity.
import pandas as pd # Import Pandas
import utils # Import utils module
# Load and clean DataFrame
df = pd.read_csv("data/sales.csv") # Read CSV
df = df.dropna(subset=["product"]) # Drop rows with missing product
df = df[df["product"].str.startswith("Halal")] # Filter Halal products
df = df[df["quantity"] <= 100] # Filter quantity <= 100
df = df[df["price"].apply(utils.is_numeric)] # Ensure price is numeric
# Print cleaned DataFrame
print("Cleaned DataFrame:") # Debug
print(df) # Show filtered DataFrame
# Expected Output:
# Cleaned DataFrame:
# product price quantity
# 0 Halal Laptop 999.99 2
# 1 Halal Mouse 24.99 10
# 2 Halal Keyboard 49.99 5Follow-Along Instructions:
- Save as
de-onboarding/pandas_cleaning.py. - Ensure
utils.pyincludesis_numeric(see micro-project). - Configure editor for 4-space indentation per PEP 8.
- Run:
python pandas_cleaning.py. - Verify output shows 3 rows with Halal products, valid prices, and quantity <= 100.
- Common Errors:
- KeyError: Ensure column names match CSV. Print
df.columns. - TypeError: Check for non-string values in
str.startswith. Printdf["product"]. - Type Mismatch: If filtering fails (e.g.,
TypeErrorinpricechecks), printdf["price"].apply(type)to inspect data types and ensure numeric values. - IndentationError: Use 4 spaces (not tabs). Run
python -tt pandas_cleaning.py.
- KeyError: Ensure column names match CSV. Print
Key Points:
dropna(): Removes rows with missing values.str.startswith(): Filters strings (e.g., Halal products).- Boolean indexing: Filters rows based on conditions.
apply(): Applies custom validation (e.g.,utils.is_numeric).- Time Complexity: O(n) for filtering n rows.
- Space Complexity: O(k) for filtered DataFrame (k rows).
- Implication: Efficient for cleaning transaction data, ensuring Sharia compliance.
3.2.3 Grouping and Aggregation
Group data and compute metrics, e.g., total sales by product. The result of groupby().sum() is a Pandas Series, a 1D structure, unlike a DataFrame’s 2D table.
import pandas as pd # Import Pandas
# Load and clean DataFrame
df = pd.read_csv("data/sales.csv") # Read CSV
df = df.dropna(subset=["product", "price"]) # Drop missing product/price
df = df[df["product"].str.startswith("Halal")] # Filter Halal products
df["price"] = df["price"].astype(float) # Convert price to float
df["quantity"] = df["quantity"].astype(int) # Convert quantity to int
# Compute amount per sale
df["amount"] = df["price"] * df["quantity"] # New column for price * quantity
# Group by product and sum amounts
sales_by_product = df.groupby("product")["amount"].sum() # Returns a Series
# Print results
print("Sales by Product (Series):") # Debug
print(sales_by_product) # Show grouped sums
# Expected Output:
# Sales by Product (Series):
# product
# Halal Keyboard 249.95
# Halal Laptop 1999.98
# Halal Mouse 249.90
# Name: amount, dtype: float64Follow-Along Instructions:
- Save as
de-onboarding/pandas_grouping.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python pandas_grouping.py. - Verify output shows sales by product.
- Common Errors:
- KeyError: Ensure
amountcolumn exists. Printdf.columns. - Grouping Errors: If
groupbyproduces unexpected results, printdf.groupby("product").size()to check the number of rows per group and verify data integrity. - IndentationError: Use 4 spaces (not tabs). Run
python -tt pandas_grouping.py.
- KeyError: Ensure
Key Points:
groupby(): Groups data by column(s).sum(): Aggregates grouped data, returning a Series.- Time Complexity: O(n) for grouping n rows.
- Space Complexity: O(k) for k groups.
- Implication: Useful for summarizing sales metrics in pipelines.
3.3 Basic Visualization with Matplotlib
Visualize data with Matplotlib, saving plots to files (avoiding plt.show() per Pyodide guidelines).
import pandas as pd # Import Pandas
import matplotlib.pyplot as plt # Import Matplotlib
# Load and clean DataFrame
df = pd.read_csv("data/sales.csv") # Read CSV
df = df.dropna(subset=["product", "price"]) # Drop missing values
df = df[df["product"].str.startswith("Halal")] # Filter Halal products
# Compute amount
df["amount"] = df["price"] * df["quantity"] # Price * quantity
# Plot sales by product
plt.figure(figsize=(8, 6)) # Set figure size
plt.bar(df["product"], df["amount"]) # Bar plot
plt.title("Sales by Product") # Title
plt.xlabel("Product") # X-axis label
plt.ylabel("Sales Amount ($)") # Y-axis label
plt.xticks(rotation=45) # Rotate x labels
plt.grid(True) # Add grid
plt.tight_layout() # Adjust layout
plt.savefig("data/sales_plot.png", dpi=100) # Save plot with high resolution
plt.close() # Close figure
print("Plot saved to data/sales_plot.png") # Confirm save
# Expected Output:
# Plot saved to data/sales_plot.png
# (Creates data/sales_plot.png with bar plot)Follow-Along Instructions:
- Install Matplotlib:
pip install matplotlib. - Save as
de-onboarding/visualization.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python visualization.py. - Verify
data/sales_plot.pngexists and shows bars for Halal products. - Common Errors:
- ModuleNotFoundError: Install
matplotlibwithpip install matplotlib. - FileNotFoundError: Ensure write permissions for
data/. Check withls -l data/(Unix/macOS) ordir data\(Windows). - Plot Sizing: If labels are cut off, adjust
plt.tight_layout()or increasefigsize(e.g.,(10, 8)). Printplt.gcf().get_size_inches()to check figure dimensions. - Plot Resolution: Setting
dpi=100inplt.savefigincreasessales_plot.pngfile size slightly but ensures clarity for stakeholder reports. Check file size withls -lh data/sales_plot.png(Unix/macOS) ordir data\sales_plot.png(Windows). - IndentationError: Use 4 spaces (not tabs). Run
python -tt visualization.py.
- ModuleNotFoundError: Install
Key Points:
plt.bar(): Creates bar plots.plt.savefig(): Saves plots to files.- Time Complexity: O(n) for plotting n items in bar plots. Complex visualizations like scatter plots with many points may have higher complexity depending on rendering.
- Space Complexity: O(1) for plot metadata.
- Implication: Visualizations aid stakeholder reporting at Hijra Group.
3.4 Micro-Project: Refactored Sales Data Processor
Project Requirements
Refactor Chapter 2’s sales processor to use NumPy and Pandas for efficient analytics and visualization, processing data/sales.csv for Hijra Group’s analytics. This processor supports Hijra Group’s transaction reporting for Sharia-compliant product sales, ensuring compliance with Islamic Financial Services Board (IFSB) standards by validating Halal products and generating stakeholder reports. Hijra Group’s daily transaction datasets often include thousands of records, making efficient tools like Pandas and NumPy critical for processing large volumes of sales data. Real-world CSVs may have inconsistent delimiters (e.g., semicolons) or encodings (e.g., UTF-16). pd.read_csv supports these with parameters like sep or encoding, which are explored in Chapter 40 for large-scale data processing:
- Load
data/sales.csvwithpandas.read_csv. - Read
data/config.yamlwith PyYAML. - Validate records using Pandas filtering and
utils.py, ensuring Halal products, positive prices, and config rules (e.g., currency validation from Chapter 2’s Exercise 7 will be extended in later chapters). - Compute total sales and top 3 products using Pandas/NumPy.
- Export results to
data/sales_results.json. - Generate a sales trend plot saved to
data/sales_trend.png. - Log steps and invalid records using print statements.
- Use 4-space indentation per PEP 8, preferring spaces over tabs.
- Test edge cases with
empty.csv,invalid.csv,malformed.csv, andnegative.csvper Appendix 1.
Sample Input Files
data/sales.csv (from Appendix 1):
product,price,quantity
Halal Laptop,999.99,2
Halal Mouse,24.99,10
Halal Keyboard,49.99,5
,29.99,3
Monitor,invalid,2
Headphones,5.00,150data/config.yaml (from Appendix 1):
min_price: 10.0
max_quantity: 100
required_fields:
- product
- price
- quantity
product_prefix: 'Halal'
max_decimals: 2Data Processing Flow
flowchart TD
A["Input CSV
sales.csv"] --> B["Load CSV
pandas.read_csv"]
B --> C["Pandas DataFrame"]
C --> D["Read YAML
config.yaml"]
D --> E["Validate DataFrame
Pandas/utils.py"]
E -->|Invalid| F["Log Warning
Skip Record"]
E -->|Valid| G["Compute Metrics
Pandas/NumPy"]
G --> H["Export JSON
sales_results.json"]
G --> I["Generate Plot
sales_trend.png"]
F --> J["End Processing"]
H --> J
I --> J
classDef data fill:#f9f9f9,stroke:#333,stroke-width:2px
classDef process fill:#d0e0ff,stroke:#336,stroke-width:1px
classDef error fill:#ffdddd,stroke:#933,stroke-width:1px
classDef endpoint fill:#ddffdd,stroke:#363,stroke-width:1px
class A,C data
class B,D,E,G,H,I process
class F error
class J endpointAcceptance Criteria
- Go Criteria:
- Loads
sales.csvandconfig.yamlcorrectly. - Validates records for required fields, Halal prefix, numeric price/quantity, positive prices, and config rules.
- Computes total sales and top 3 products by sales.
- Exports results to
data/sales_results.json. - Saves sales trend plot to
data/sales_trend.png, confirming file existence. - Logs steps and invalid records.
- Uses 4-space indentation per PEP 8, preferring spaces over tabs.
- Passes edge case tests with
empty.csv,invalid.csv,malformed.csv, andnegative.csv.
- Loads
- No-Go Criteria:
- Fails to load
sales.csvorconfig.yaml. - Incorrect validation or calculations.
- Missing JSON export or plot.
- Uses try/except or type annotations.
- Inconsistent indentation or tab/space mixing.
- Fails to load
Common Pitfalls to Avoid
- Incorrect CSV Loading:
- Problem:
pd.read_csvfails due to missing file. - Solution: Print path with
print(csv_path). Ensuredata/sales.csvexists inde-onboarding/data/.
- Problem:
- Validation Errors:
- Problem: Missing values cause filtering issues.
- Solution: Use
dropna()and printdf.head()to debug.
- Type Mismatches:
- Problem: Non-numeric prices cause calculation errors.
- Solution: Validate with
utils.is_numeric. Printdf.dtypes.
- NumPy Shape Mismatches:
- Problem:
ValueErrorwhen multiplying arrays of different lengths. - Solution: Print
prices.shape,quantities.shapeto verify compatibility.
- Problem:
- Pandas Type Inference:
- Problem:
pd.read_csvparsesquantityas float if CSV contains non-integer values (e.g., “invalid”). This causes errors in integer-based operations. - Solution: Validate types with
utils.is_integerand convert withdf["quantity"].astype(int). Printdf.dtypesto debug.
- Problem:
- Plotting Issues:
- Problem: Plot not saved.
- Solution: Use
plt.savefig()and check permissions. Printplt.get_fignums()andos.path.exists(plot_path). To debug plot issues, temporarily addplt.show()beforeplt.close()to inspect the plot interactively, then remove it to comply with Pyodide’s non-interactive requirements.
- IndentationError:
- Problem: Mixed spaces/tabs.
- Solution: Use 4 spaces per PEP 8. Run
python -tt sales_processor.py.
How This Differs from Production
In production, this solution would include:
- Error Handling: Try/except for robust errors (Chapter 7).
- Type Safety: Type annotations with Pyright (Chapter 7).
- Testing: Unit tests with
pytest(Chapter 9). - Scalability: Chunked loading for large CSVs (Chapter 40).
- Logging: File-based logging (Chapter 52).
- Visualization: Interactive dashboards with Metabase (Chapter 51).
Implementation
# File: de-onboarding/utils.py
def is_decimal_string(s, max_decimals=2): # Check if string is a decimal number
"""Check if string is a decimal number with up to max_decimals."""
parts = s.split(".") # Split on decimal point
if len(parts) != 2 or not parts[0].isdigit() or not parts[1].isdigit():
return False # Invalid format
return len(parts[1]) <= max_decimals # Check decimal places
def clean_string(s): # Clean string by stripping whitespace
"""Strip whitespace from string."""
return s.strip()
def is_numeric_type(x): # Check if value is numeric
"""Check if value is an integer or float."""
return isinstance(x, (int, float)) # Return True for numeric types
def has_valid_decimals(x, max_decimals): # Check decimal places
"""Check if value has valid decimal places."""
return is_decimal_string(str(x), max_decimals) # Use is_decimal_string for validation
def apply_valid_decimals(x, max_decimals): # Apply decimal validation
"""Apply has_valid_decimals to a value."""
return has_valid_decimals(x, max_decimals)
def is_integer(x): # Check if value is an integer
"""Check if value is an integer when converted to string."""
return str(x).isdigit() # Return True for integer strings
def validate_sale(sale, config): # Validate a sale dictionary
"""Validate sale based on config rules."""
required_fields = config["required_fields"] # Get required fields
min_price = config["min_price"] # Get minimum price
max_quantity = config["max_quantity"] # Get maximum quantity
prefix = config["product_prefix"] # Get product prefix
max_decimals = config["max_decimals"] # Get max decimal places
print(f"Validating sale: {sale}") # Debug: print sale
# Check for missing or empty fields
for field in required_fields: # Loop through required fields
if not sale[field] or sale[field].strip() == "": # Check if field is empty
print(f"Invalid sale: missing {field}: {sale}") # Log invalid
return False
# Validate product: non-empty and matches prefix
product = clean_string(sale["product"]) # Clean product string
if not product.startswith(prefix): # Check prefix
print(f"Invalid sale: product lacks '{prefix}' prefix: {sale}") # Log invalid
return False
# Validate price: numeric, meets minimum, and positive
price = clean_string(sale["price"]) # Clean price string
if not is_decimal_string(price, max_decimals) or float(price) < min_price or float(price) <= 0: # Check format, value, and positivity
print(f"Invalid sale: invalid price: {sale}") # Log invalid
return False
# Validate quantity: integer and within limit
quantity = clean_string(sale["quantity"]) # Clean quantity string
if not quantity.isdigit() or int(quantity) > max_quantity: # Check format and limit
print(f"Invalid sale: invalid quantity: {sale}") # Log invalid
return False
return True # Return True if all checks pass# File: de-onboarding/sales_processor.py
import pandas as pd # For DataFrame operations
import numpy as np # For numerical computations
import yaml # For YAML parsing
import json # For JSON export
import matplotlib.pyplot as plt # For plotting
import utils # Import custom utils module
import os # For file existence check
# Define function to read YAML configuration
def read_config(config_path): # Takes config file path
"""Read YAML configuration."""
print(f"Opening config: {config_path}") # Debug: print path
file = open(config_path, "r") # Open YAML
config = yaml.safe_load(file) # Parse YAML
file.close() # Close file
print(f"Loaded config: {config}") # Debug: print config
return config # Return config dictionary
# Define function to load and validate sales data
def load_and_validate_sales(csv_path, config): # Takes CSV path and config
"""Load sales CSV and validate using Pandas."""
print(f"Loading CSV: {csv_path}") # Debug: print path
# Load CSV and coerce price to numeric, invalids become NaN
df = pd.read_csv(csv_path)
df["price"] = pd.to_numeric(df["price"], errors="coerce")
print("Initial DataFrame:")
print(df)
# Check for missing required columns
missing_fields = [f for f in config["required_fields"] if f not in df.columns]
if missing_fields:
print(f"Missing columns: {missing_fields}")
return pd.DataFrame(), 0, 0
# Drop rows with missing required fields (including NaN price)
df = df.dropna(subset=config["required_fields"])
# Early return if DataFrame is empty after dropping required fields
if df.empty:
print("No data available after dropping missing required fields.")
return df, 0, 0
df = df[df["quantity"].apply(utils.is_integer)] # Ensure quantity is integer
df["quantity"] = df["quantity"].astype(int)
df = df[df["quantity"] <= config["max_quantity"]]
df = df[df["price"] > 0]
df = df[df["price"] >= config["min_price"]]
total_records = len(df)
print("Validated DataFrame:")
print(df)
return df, len(df), total_records
# Define function to process sales data
def process_sales(df): # Takes DataFrame
"""Process sales: compute total and top products using Pandas/NumPy."""
if df.empty: # Check for empty DataFrame
print("No valid sales data") # Log empty
return {"total_sales": 0.0, "unique_products": [], "top_products": {}}, 0
# Compute amount
df["amount"] = (df["price"] * df["quantity"]).round(2) # Price * quantity, rounded
print("DataFrame with Amount:") # Debug
print(df) # Show DataFrame with amount
# Compute metrics using NumPy
total_sales = float(np.sum(df["amount"].values)) # Total sales
unique_products = df["product"].unique().tolist() # Unique products
sales_by_product = df.groupby("product")["amount"].sum().round(2)
# Use OrderedDict to preserve order
from collections import OrderedDict
top_products = OrderedDict(
sales_by_product.sort_values(ascending=False).head(3).items()
)
valid_sales = len(df) # Count valid sales
print(f"Valid sales: {valid_sales} records") # Log valid count
return {
"total_sales": round(total_sales, 2), # Convert to float for JSON
"unique_products": unique_products, # List of products
"top_products": dict(top_products), # Top 3 products, ordered
}, valid_sales # Return results and count
# Define function to export results
def export_results(results, json_path): # Takes results and file path
"""Export results to JSON."""
print(f"Writing to: {json_path}") # Debug: print path
print(f"Results: {results}") # Debug: print results
file = open(json_path, "w") # Open JSON file
json.dump(results, file, indent=2) # Write JSON
file.close() # Close file
print(f"Exported results to {json_path}") # Confirm export
# Define function to plot sales
def plot_sales(df, plot_path): # Takes DataFrame and plot path
"""Generate sales trend plot."""
if df.empty: # Check for empty DataFrame
print("No data to plot") # Log empty
return
plt.figure(figsize=(8, 6)) # Set figure size
plt.bar(df["product"], df["amount"]) # Bar plot
plt.title("Sales by Product") # Title
plt.xlabel("Product") # X-axis label
plt.ylabel("Sales Amount ($)") # Y-axis label
plt.xticks(rotation=45) # Rotate x labels
plt.grid(True) # Add grid
plt.tight_layout() # Adjust layout
plt.savefig(plot_path, dpi=100) # Save plot with high resolution
plt.close() # Close figure
print(f"Plot saved to {plot_path}") # Confirm save
print(f"File exists: {os.path.exists(plot_path)}") # Confirm file creation
# Define main function
def main(): # No parameters
"""Main function to process sales data."""
csv_path = "data/sample.csv" # CSV path
config_path = "data/config.yaml" # YAML path
json_path = "data/sales_results.json" # JSON output path
plot_path = "data/sales_trend.png" # Plot output path
config = read_config(config_path) # Read config
df, valid_sales, total_records = load_and_validate_sales(
csv_path, config
) # Load and validate
results, valid_sales = process_sales(df) # Process
export_results(results, json_path) # Export results
plot_sales(df, plot_path) # Generate plot
# Output report
print("\nSales Report:") # Print header
print(f"Total Records Processed: {total_records}") # Total records
print(f"Valid Sales: {valid_sales}") # Valid count
print(f"Invalid Sales: {total_records - valid_sales}") # Invalid count
print(f"Total Sales: ${round(results['total_sales'], 2)}") # Total sales
print(f"Unique Products: {results['unique_products']}") # Products
print(f"Top Products: {results['top_products']}") # Top products
print("Processing completed") # Confirm completion
if __name__ == "__main__":
main() # Run main functionExpected Outputs
data/sales_results.json:
{
"total_sales": 2499.83,
"unique_products": ["Halal Laptop", "Halal Mouse", "Halal Keyboard"],
"top_products": {
"Halal Laptop": 1999.98,
"Halal Mouse": 249.9,
"Halal Keyboard": 249.95
}
}data/sales_trend.png: Bar plot showing sales amounts for Halal products, saved with dpi=100.
Console Output (abridged):
Opening config: data/config.yaml
Loaded config: {'min_price': 10.0, 'max_quantity': 100, 'required_fields': ['product', 'price', 'quantity'], 'product_prefix': 'Halal', 'max_decimals': 2}
Loading CSV: data/sales.csv
Initial DataFrame:
product price quantity
0 Halal Laptop 999.99 2
1 Halal Mouse 24.99 10
2 Halal Keyboard 49.99 5
3 NaN 29.99 3
4 Monitor NaN 2
Validated DataFrame:
product price quantity
0 Halal Laptop 999.99 2
1 Halal Mouse 24.99 10
2 Halal Keyboard 49.99 5
DataFrame with Amount:
product price quantity amount
0 Halal Laptop 999.99 2 1999.98
1 Halal Mouse 24.99 10 249.90
2 Halal Keyboard 49.99 5 249.95
Valid sales: 3 records
Writing to: data/sales_results.json
Exported results to data/sales_results.json
Plot saved to data/sales_trend.png
File exists: True
Sales Report:
Total Records Processed: 3
Valid Sales: 3
Invalid Sales: 0
Total Sales: $2499.83
Unique Products: ['Halal Laptop', 'Halal Mouse', 'Halal Keyboard']
Top Products: {'Halal Laptop': 1999.98, 'Halal Mouse': 249.9, 'Halal Keyboard': 249.95}
Processing completedHow to Run and Test
Setup:
- Setup Checklist:
- Create
de-onboarding/data/directory. - Save
sales.csv,config.yaml,sample.csv,empty.csv,invalid.csv,malformed.csv,negative.csvper Appendix 1. - Install libraries:
pip install numpy pandas matplotlib pyyaml. - Create virtual environment:
python -m venv venv, activate (Windows:venv\Scripts\activate, Unix:source venv/bin/activate). - Verify Python 3.10+:
python --version. - Configure editor for 4-space indentation per PEP 8 (VS Code: “Editor: Tab Size” = 4, “Editor: Insert Spaces” = true, “Editor: Detect Indentation” = false).
- Save
utils.pyandsales_processor.pyinde-onboarding/.
- Create
- Troubleshooting:
- If
FileNotFoundErrororPermissionErroroccurs when savingsales_results.jsonorsales_trend.png, check write permissions withls -l data/(Unix/macOS) ordir data\(Windows). - If
ModuleNotFoundError, install libraries or checkutils.pypath. - If
IndentationError, use 4 spaces (not tabs). Runpython -tt sales_processor.py. - If
UnicodeDecodeError, ensure UTF-8 encoding for all files. - If
yaml.YAMLError, printprint(open(config_path).read())to inspectconfig.yamlfor syntax errors like incorrect indentation or missing colons.
- If
- Setup Checklist:
Run:
- Open terminal in
de-onboarding/. - Run:
python sales_processor.py. - Outputs:
data/sales_results.json,data/sales_trend.png, console logs.
- Open terminal in
Test Scenarios:
- Valid Data: Verify
sales_results.jsonshowstotal_sales: 2499.83, correct top products, andsales_trend.pngdisplays bars with high resolution (dpi=100). - Empty CSV: Test with
empty.csv:config = read_config("data/config.yaml") df, valid_sales, total_records = load_and_validate_sales("data/empty.csv", config) results, valid_sales = process_sales(df, config) print(results, valid_sales, total_records) # Expected: {'total_sales': 0.0, 'unique_products': [], 'top_products': {}}, 0, 0- Note:
empty.csvcontains only headers (product,price,quantity), resulting in an empty DataFrame with columns but no rows. Scripts should handle this gracefully, returning zero metrics and no plot.
- Note:
- Invalid Headers: Test with
invalid.csv:config = read_config("data/config.yaml") df, valid_sales, total_records = load_and_validate_sales("data/invalid.csv", config) print(df) # Expected: Empty DataFrame - Malformed Data: Test with
malformed.csv:config = read_config("data/config.yaml") df, valid_sales, total_records = load_and_validate_sales("data/malformed.csv", config) print(df) # Expected: DataFrame with only Halal Mouse row- Note:
quantity: "invalid"is parsed as an object type bypd.read_csv. The validation inload_and_validate_salesfilters out non-integer quantities usingutils.is_integer, ensuring only valid integer quantities (e.g., 10) remain.
- Note:
- Negative Prices: Test with
negative.csv:config = read_config("data/config.yaml") df, valid_sales, total_records = load_and_validate_sales("data/negative.csv", config) print(df) # Expected: DataFrame with only Halal Mouse row
- Valid Data: Verify
3.5 Practice Exercises
Exercise 1: NumPy Sales Calculator
Write a function to compute total sales using NumPy arrays, with 4-space indentation per PEP 8.
Sample Input:
prices = [999.99, 24.99, 49.99]
quantities = [2, 10, 5]Expected Output:
2499.83Follow-Along Instructions:
- Save as
de-onboarding/ex1_numpy.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex1_numpy.py. - How to Test:
- Add:
print(calculate_sales([999.99, 24.99, 49.99], [2, 10, 5])). - Verify output:
2499.83. - Test with empty lists: Should return
0.0. - Test with negative quantities:
calculate_sales([999.99, 24.99, 49.99], [2, -10, 5])should return 0.0. Printquantitiesto debug. - Note: Negative quantities are invalid in sales data and return
0.0to ensure data integrity, aligning with the micro-project’s positive price validation for Sharia-compliant transactions. - Common Errors:
- ValueError: Print
len(prices),len(quantities)to check lengths. - IndentationError: Use 4 spaces (not tabs). Run
python -tt ex1_numpy.py.
- ValueError: Print
- Add:
Exercise 2: Pandas Data Loader
Write a function to load and filter data/sample.csv for Halal products, with 4-space indentation per PEP 8.
Sample Input (data/sample.csv):
product,price,quantity
Halal Laptop,999.99,2
Halal Mouse,24.99,10Expected Output:
product price quantity
0 Halal Laptop 999.99 2
1 Halal Mouse 24.99 10Follow-Along Instructions:
- Save as
de-onboarding/ex2_pandas.py. - Ensure
data/sample.csvexists per Appendix 1. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex2_pandas.py. - How to Test:
- Add:
print(load_halal_sales("data/sample.csv")). - Verify output matches expected.
- Test with empty CSV: Should return empty DataFrame.
- Add:
Exercise 3: Pandas Grouping
Write a function to group sales by product from data/sample.csv and compute totals, with 4-space indentation per PEP 8.
Sample Input (data/sample.csv):
product,price,quantity
Halal Laptop,999.99,2
Halal Mouse,24.99,10Expected Output:
product
Halal Laptop 1999.98
Halal Mouse 249.90
Name: amount, dtype: float64Follow-Along Instructions:
- Save as
de-onboarding/ex3_grouping.py. - Ensure
data/sample.csvexists per Appendix 1. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex3_grouping.py. - How to Test:
- Add:
print(group_sales("data/sample.csv")). - Verify output matches expected.
- Test with non-Halal products: Should exclude them.
- Add:
Exercise 4: Visualization
Write a function to plot sales by product from data/sample.csv, saving to data/plot.png, with 4-space indentation per PEP 8.
Sample Input (data/sample.csv):
product,price,quantity
Halal Laptop,999.99,2
Halal Mouse,24.99,10Expected Output:
Plot saved to data/plot.pngFollow-Along Instructions:
- Save as
de-onboarding/ex4_plot.py. - Ensure
data/sample.csvexists per Appendix 1. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex4_plot.py. - How to Test:
- Verify
data/plot.pngexists with correct bars at high resolution (dpi=100). - Test with empty CSV: Should not generate plot.
- Verify
Exercise 5: Debug a Pandas Grouping Bug
Fix this buggy code that groups by the wrong column, causing incorrect sales totals, ensuring 4-space indentation per PEP 8.
Buggy Code:
import pandas as pd
def group_sales(csv_path):
df = pd.read_csv(csv_path)
df["amount"] = df["price"] * df["quantity"]
sales_by_product = df.groupby("price")["amount"].sum() # Bug: Wrong column
return sales_by_product
print(group_sales("data/sample.csv"))Sample Input (data/sample.csv):
product,price,quantity
Halal Laptop,999.99,2
Halal Mouse,24.99,10Expected Output:
product
Halal Laptop 1999.98
Halal Mouse 249.90
Name: amount, dtype: float64Follow-Along Instructions:
- Save as
de-onboarding/ex5_debug.py. - Ensure
data/sample.csvexists per Appendix 1. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex5_debug.pyto see incorrect output. - Fix and re-run.
- How to Test:
- Verify output matches expected.
- Test with additional products to ensure correct grouping.
- Common Errors:
- KeyError: Print
df.columnsto check column names. - IndentationError: Use 4 spaces (not tabs). Run
python -tt ex5_debug.py.
- KeyError: Print
Exercise 6: NumPy vs. Pandas Selector with Conceptual Analysis
Write a function to process sales data using NumPy or Pandas based on a use_numpy flag, and explain how Pandas’ column-oriented storage impacts performance compared to NumPy’s contiguous arrays, with 4-space indentation per PEP 8.
Sample Input:
prices = [999.99, 24.99]
quantities = [2, 10]
use_numpy = TrueExpected Output:
2249.88
Explanation: NumPy is chosen for numerical computations like summing sales amounts due to its fast, C-based array operations. Pandas’ column-oriented storage enables fast column access but slower row iterations, while NumPy’s contiguous arrays optimize numerical computations.Follow-Along Instructions:
- Save as
de-onboarding/ex6_selector.py. - Configure editor for 4-space indentation per PEP 8.
- Run:
python ex6_selector.py. - How to Test:
- Add:
print(process_sales([999.99, 24.99], [2, 10], True)). - Verify output:
2249.88and explanation. - Test with
use_numpy=False: Should return a DataFrame with amounts. - Test with mismatched lengths:
process_sales([999.99], [2, 10], True)should return 0.0. Printlen(prices),len(quantities)to debug. - Save explanation to
de-onboarding/ex6_concepts.txtfor review. - Common Errors:
- ValueError: Ensure
pricesandquantitieshave same length. - IndentationError: Use 4 spaces (not tabs). Run
python -tt ex6_selector.py.
- ValueError: Ensure
- Add:
3.6 Exercise Solutions
Solution to Exercise 1: NumPy Sales Calculator
import numpy as np # Import NumPy
def calculate_sales(prices, quantities): # Takes prices and quantities lists
"""Compute total sales using NumPy."""
if not prices or not quantities or len(prices) != len(quantities): # Check for empty or mismatched inputs
return 0.0 # Return 0 for invalid input
if any(q < 0 for q in quantities): # Check for negative quantities
print("Error: Negative quantities detected") # Log error
return 0.0 # Return 0 for invalid input
prices_array = np.array(prices) # Convert to array
quantities_array = np.array(quantities) # Convert to array
amounts = prices_array * quantities_array # Compute amounts
total = np.sum(amounts) # Sum amounts
print("Amounts:", amounts) # Debug
return total # Return total
# Test
print(calculate_sales([999.99, 24.99, 49.99], [2, 10, 5])) # Call function
# Output:
# Amounts: [1999.98 249.9 249.95]
# 2499.83Solution to Exercise 2: Pandas Data Loader
import pandas as pd # Import Pandas
def load_halal_sales(csv_path): # Takes CSV path
"""Load and filter CSV for Halal products."""
df = pd.read_csv(csv_path) # Load CSV
print("Initial DataFrame:") # Debug
print(df.head()) # Show first rows
df = df[df["product"].str.startswith("Halal")] # Filter Halal products
return df # Return filtered DataFrame
# Test
print(load_halal_sales("data/sample.csv")) # Call function
# Output:
# Initial DataFrame:
# product price quantity
# 0 Halal Laptop 999.99 2
# 1 Halal Mouse 24.99 10
# product price quantity
# 0 Halal Laptop 999.99 2
# 1 Halal Mouse 24.99 10Solution to Exercise 3: Pandas Grouping
import pandas as pd # Import Pandas
def group_sales(csv_path): # Takes CSV path
"""Group sales by product and compute totals."""
df = pd.read_csv(csv_path) # Load CSV
df["amount"] = df["price"] * df["quantity"] # Compute amount
sales_by_product = df.groupby("product")["amount"].sum() # Group and sum
print("Grouped Data:") # Debug
print(sales_by_product) # Show results
return sales_by_product # Return Series
# Test
print(group_sales("data/sample.csv")) # Call function
# Output:
# Grouped Data:
# product
# Halal Laptop 1999.98
# Halal Mouse 249.90
# Name: amount, dtype: float64Solution to Exercise 4: Visualization
import pandas as pd # Import Pandas
import matplotlib.pyplot as plt # Import Matplotlib
def plot_sales(csv_path, plot_path): # Takes CSV and plot paths
"""Plot sales by product."""
df = pd.read_csv(csv_path) # Load CSV
df["amount"] = df["price"] * df["quantity"] # Compute amount
plt.figure(figsize=(8, 6)) # Set figure size
plt.bar(df["product"], df["amount"]) # Bar plot
plt.title("Sales by Product") # Title
plt.xlabel("Product") # X-axis label
plt.ylabel("Sales Amount ($)") # Y-axis label
plt.xticks(rotation=45) # Rotate x labels
plt.grid(True) # Add grid
plt.tight_layout() # Adjust layout
plt.savefig(plot_path, dpi=100) # Save plot with high resolution
plt.close() # Close figure
print(f"Plot saved to {plot_path}") # Confirm save
# Test
plot_sales("data/sample.csv", "data/plot.png") # Call function
# Output:
# Plot saved to data/plot.pngSolution to Exercise 5: Debug a Pandas Grouping Bug
import pandas as pd # Import Pandas
def group_sales(csv_path): # Takes CSV path
"""Group sales by product and compute totals."""
df = pd.read_csv(csv_path) # Load CSV
df["amount"] = df["price"] * df["quantity"] # Compute amount
sales_by_product = df.groupby("product")["amount"].sum() # Fix: Group by product
return sales_by_product # Return Series
# Test
print(group_sales("data/sample.csv")) # Call function
# Output:
# product
# Halal Laptop 1999.98
# Halal Mouse 249.90
# Name: amount, dtype: float64Explanation:
- Grouping Bug: Grouping by
priceinstead ofproductproduced incorrect totals, as prices are not unique identifiers. Fixed by usingdf.groupby("product").
Solution to Exercise 6: NumPy vs. Pandas Selector with Conceptual Analysis
import numpy as np # Import NumPy
import pandas as pd # Import Pandas
def process_sales(prices, quantities, use_numpy): # Takes prices, quantities, and flag
"""Process sales using NumPy or Pandas and explain storage performance."""
if len(prices) != len(quantities): # Check for mismatched lengths
print("Error: Mismatched lengths") # Log error
return 0.0 # Return 0 for invalid input
if use_numpy: # Use NumPy for numerical computation
prices_array = np.array(prices) # Convert to array
quantities_array = np.array(quantities) # Convert to array
amounts = prices_array * quantities_array # Compute amounts
total = np.sum(amounts) # Sum amounts
explanation = (
"Explanation: NumPy is chosen for numerical computations like summing sales amounts "
"due to its fast, C-based array operations. Pandas’ column-oriented storage enables fast "
"column access but slower row iterations, while NumPy’s contiguous arrays optimize numerical "
"computations."
)
print(explanation)
return total # Return total
else: # Use Pandas for structured data
df = pd.DataFrame({"price": prices, "quantity": quantities}) # Create DataFrame
df["amount"] = df["price"] * df["quantity"] # Compute amount
explanation = (
"Explanation: Pandas is chosen for structured data with labeled columns, ideal for "
"filtering Halal products or grouping sales by product. Pandas’ column-oriented storage "
"enables fast column access but slower row iterations, while NumPy’s contiguous arrays "
"optimize numerical computations."
)
print(explanation)
return df[["price", "quantity", "amount"]] # Return DataFrame
# Test
print(process_sales([999.99, 24.99], [2, 10], True)) # Call with NumPy
# Output:
# Explanation: NumPy is chosen for numerical computations like summing sales amounts due to its fast, C-based array operations. Pandas’ column-oriented storage enables fast column access but slower row iterations, while NumPy’s contiguous arrays optimize numerical computations.
# 2249.88Explanation (save to de-onboarding/ex6_concepts.txt):
NumPy is chosen for numerical computations like summing sales amounts due to its fast, C-based array operations, ideal for large datasets. Pandas is chosen for structured data with labeled columns, perfect for filtering Halal products or grouping sales by product in Hijra Group’s pipelines. Pandas’ column-oriented storage enables fast column access but slower row iterations, while NumPy’s contiguous arrays optimize numerical computations.3.7 Chapter Summary and Connection to Chapter 4
In this chapter, you’ve mastered:
- NumPy: Arrays for numerical computations (O(n) vectorized operations, ~8MB for 1M floats).
- Pandas: DataFrames for structured data (O(n) loading, ~24MB for 1M rows with numeric types) and Series for 1D results.
- Visualization: Matplotlib plots saved to files (O(n) for bar plots, higher for complex visualizations).
- White-Space Sensitivity and PEP 8: Using 4-space indentation, preferring spaces over tabs to avoid
IndentationError.
The micro-project refactored Chapter 2’s processor, leveraging Pandas for data handling and NumPy for calculations, producing a JSON report and high-resolution plot, all with 4-space indentation per PEP 8. It tested edge cases (empty.csv, invalid.csv, malformed.csv, negative.csv) per Appendix 1, ensuring robustness. The modular functions like load_and_validate_sales prepare for organizing pipeline logic into classes in Chapter 5’s Object-Oriented Programming, enhancing code scalability. For example, the load_and_validate_sales function could be encapsulated in a SalesProcessor class in Chapter 5, improving modularity for complex pipelines. Visualization skills prepare for creating dashboards with Metabase in Chapter 51, enhancing stakeholder reporting.
Connection to Chapter 4
Chapter 4 introduces Web Integration and APIs, building on this chapter:
- Data Loading: Extends
pd.read_csvto API data withrequests.get, processing JSON responses into DataFrames, usingdata/transactions.csvper Appendix 1. Chapter 4’stransactions.csvextendssales.csv’s structure with transaction IDs and dates, preparing for time-series analysis in later chapters, maintaining consistency with Hijra Group’s transaction data processing. - Data Structures: Uses Pandas DataFrames for API data, building on Chapter 3’s filtering and grouping.
- Modules: Reuses
utils.pyfor validation, preparing for OOP in Chapter 5. - Fintech Context: Prepares for fetching transaction data from APIs, aligning with Hijra Group’s real-time analytics, maintaining PEP 8’s 4-space indentation for maintainable pipeline scripts.