Student Performance: Data Preprocessing¶
Data Mining For Business Intelligence December 2025 Professor Qu Meng
Report Author: Jack Pattarini
1. Overview¶
This notebook documents the preprocessing for the Student Performance Factors Dataset. The goal of preprocessing was to prepare the raw data (sourced from Kaggle) for several data mining processes while maintaining data integrity and meaningful relationships.
2. The Raw Dataset¶
The dataset includes 6,607 rows, 20 columns, and a mix of numerical and categorical variables all plausibly correlating with student performance. Our target variable is Exam_Score.
3. Results, Output¶
The preprocessing pipeline successfully processed all 6,607 records with 20 attributes. The script identified and handled missing values in three categorical attributes.
Missing Values Counts¶
Teacher_Quality: 78
Parental_Education_Level: 90
Distance_from_Home: 67
Missing Value Treatments¶
All missing values belonged to categorical attributes. Mode imputation was used to assign the most frequent value for each attribute with missing data:
- Teacher_Quality: 'Medium' (mode)
- Parental_Education_Level: 'High School' (mode)
- Distance_from_Home: 'Near' (mode)
Categorical Variable Encoding¶
After treating missing values, two encoding strategies were applied:
- Ordinal encoding for attributes with meaningful order/ranking
- One-hot encoding for nominal attributes
Ordinal-Encoded Attributes¶
Parental_Involvement: Low=0, Medium=1, High=2
Access_to_Resources: Low=0, Medium=1, High=2
Motivation_Level: Low=0, Medium=1, High=2
Family_Income: Low=0, Medium=1, High=2
Teacher_Quality: Low=0, Medium=1, High=2
Peer_Influence: Negative=0, Neutral=1, Positive=2
Parental_Education_Level: High School=0, College=1, Postgraduate=2
Distance_from_Home: Near=0, Moderate=1, Far=2
One-Hot Encoded Attributes¶
Extracurricular_Activities -> Extracurricular_Activities_Yes Internet_Access -> Internet_Access_Yes School_Type -> School_Type_Public Learning_Disabilities -> Learning_Disabilities_Yes Gender -> Gender_Male
Numerical Attribute Scaling¶
Min-Max Normalization was applied to numerical features to scale them to the [0, 1] range:
- Hours_Studied: [1.0, 44.0] -> [0, 1]
- Attendance: [60.0, 100.0] -> [0, 1]
- Sleep_Hours: [4.0, 10.0] -> [0, 1]
- Previous_Scores: [50.0, 100.0] -> [0, 1]
- Tutoring_Sessions: [0.0, 8.0] -> [0, 1]
- Physical_Activity: [0.0, 6.0] -> [0, 1]
Target Variable¶
Exam_Score was left unscaled for interpretability.
Dimensionality Reduction Analysis¶
Principal Component Analysis (PCA) revealed that:
- 13 components are needed to explain 90% of variance
- 14 components are needed to explain 95% of variance
- No highly correlated feature pairs were found (all |r| < 0.8 and there is no obvious redundancy to remove)
Feature Correlations with Exam Score¶
The strongest positive correlations with Exam Score were:
- Attendance: +0.581
- Hours_Studied: +0.445
- Previous_Scores: +0.175
- Access_to_Resources: +0.170
- Parental_Involvement: +0.157
Output Files¶
The preprocessing script generated four output files:
- processed_student_data.csv: Complete processed dataset (6,607 rows, 20 columns)
- features.csv: All non-target attributes (6,607 rows, 19 columns)
- target.csv: Exam_Score only (6,607 rows, 1 column)
- metadata.json: Feature metadata including data types and preprocessing details
Data Type Transformations¶
Original data types:
- Object (categorical): 13 columns
- Integer: 7 columns
Processed data types:
- Integer: 9 columns (ordinal encoded)
- Float: 6 columns (scaled numerical)
- Boolean: 5 columns (one-hot encoded)
All variables are now numerical and normalized, with ordinal variables treated as continuous and binary variables as boolean.