Generate Report¶
Loading dataset... Dataset shape: (6607, 20) Columns: ['Hours_Studied', 'Attendance', 'Parental_Involvement', 'Access_to_Resources', 'Sleep_Hours', 'Previous_Scores', 'Motivation_Level', 'Tutoring_Sessions', 'Family_Income', 'Teacher_Quality', 'Peer_Influence', 'Physical_Activity', 'Parental_Education_Level', 'Distance_from_Home', 'Exam_Score', 'Extracurricular_Activities_Yes', 'Internet_Access_Yes', 'School_Type_Public', 'Learning_Disabilities_Yes', 'Gender_Male'] ============================================================ DATASET OVERVIEW ============================================================ Hours_Studied Attendance Parental_Involvement Access_to_Resources \ 0 0.511628 0.600 0 2 1 0.418605 0.100 0 1 2 0.534884 0.950 1 1 3 0.651163 0.725 0 1 4 0.418605 0.800 1 1 Sleep_Hours Previous_Scores Motivation_Level Tutoring_Sessions \ 0 0.500000 0.46 0 0.000 1 0.666667 0.18 0 0.250 2 0.500000 0.82 1 0.250 3 0.666667 0.96 1 0.125 4 0.333333 0.30 1 0.375 Family_Income Teacher_Quality Peer_Influence Physical_Activity \ 0 0 1 2 0.500000 1 1 1 0 0.666667 2 1 1 1 0.666667 3 1 1 0 0.666667 4 1 2 1 0.666667 Parental_Education_Level Distance_from_Home Exam_Score \ 0 0 0 67 1 1 1 61 2 2 0 74 3 0 1 71 4 1 0 70 Extracurricular_Activities_Yes Internet_Access_Yes School_Type_Public \ 0 False True True 1 False True True 2 True True True 3 True True True 4 True True True Learning_Disabilities_Yes Gender_Male 0 False True 1 False False 2 False True 3 False True 4 False False Data types: Hours_Studied float64 Attendance float64 Parental_Involvement int64 Access_to_Resources int64 Sleep_Hours float64 Previous_Scores float64 Motivation_Level int64 Tutoring_Sessions float64 Family_Income int64 Teacher_Quality int64 Peer_Influence int64 Physical_Activity float64 Parental_Education_Level int64 Distance_from_Home int64 Exam_Score int64 Extracurricular_Activities_Yes bool Internet_Access_Yes bool School_Type_Public bool Learning_Disabilities_Yes bool Gender_Male bool dtype: object Missing values: Hours_Studied 0 Attendance 0 Parental_Involvement 0 Access_to_Resources 0 Sleep_Hours 0 Previous_Scores 0 Motivation_Level 0 Tutoring_Sessions 0 Family_Income 0 Teacher_Quality 0 Peer_Influence 0 Physical_Activity 0 Parental_Education_Level 0 Distance_from_Home 0 Exam_Score 0 Extracurricular_Activities_Yes 0 Internet_Access_Yes 0 School_Type_Public 0 Learning_Disabilities_Yes 0 Gender_Male 0 dtype: int64 Total missing: 0 ============================================================ DATA PREPARATION ============================================================ Created binary target: Pass (≥70) = 1625, Fail = 4982 Pass rate: 24.60% Features shape: (6607, 19) Target shape: (6607,) Train set: (5285, 19), Test set: (1322, 19) ============================================================ 1. CLASSIFICATION ALGORITHMS COMPARISON ============================================================ Training Decision Tree... Accuracy: 0.8540 Precision: 0.7230 Recall: 0.6585 F1-Score: 0.6892 Training Random Forest... Accuracy: 0.9145 Precision: 0.9309 Recall: 0.7046 F1-Score: 0.8021 Training K-Nearest Neighbors... Accuracy: 0.7632 Precision: 0.5312 Recall: 0.3138 F1-Score: 0.3946 Training Neural Network... Accuracy: 0.9372 Precision: 0.9172 Recall: 0.8185 F1-Score: 0.8650 Training Naïve Bayes... Accuracy: 0.9259 Precision: 0.9283 Recall: 0.7569 F1-Score: 0.8339 Training Support Vector Machine... Accuracy: 0.9342 Precision: 0.9508 Recall: 0.7723 F1-Score: 0.8523 Training Gradient Boosting... Accuracy: 0.9327 Precision: 0.9069 Recall: 0.8092 F1-Score: 0.8553 ============================================================ DECISION TREE ANALYSIS AND VISUALIZATION ============================================================ Decision Tree Rules (first 3 levels): |--- Attendance <= 0.64 | |--- Hours_Studied <= 0.59 | | |--- Hours_Studied <= 0.45 | | | |--- Tutoring_Sessions <= 0.81 | | | | |--- truncated branch of depth 10 | | | |--- Tutoring_Sessions > 0.81 | | | | |--- truncated branch of depth 2 | | |--- Hours_Studied > 0.45 | | | |--- Attendance <= 0.56 | | | | |--- truncated branch of depth 13 | | | |--- Attendance > 0.56 | | | | |--- truncated branch of depth 8 | |--- Hours_Studied > 0.59 | | |--- Attendance <= 0.39 | | | |--- Hours_Studied <= 0.66 | | | | |--- truncated branch of depth 6 | | | |--- Hours_Studied > 0.66 | | | | |--- truncated branch of depth 9 | | |--- Attendance > 0.39 | | | |--- Previous_Scores <= 0.35 | | | | |--- truncated branch of depth 7 | | | |--- Previous_Scores > 0.35 | | | | |--- truncated branch of depth 8 |--- Attendance > 0.64 | |--- Hours_Studied <= 0.45 | | |--- Hours_Studied <= 0.34 | | | |--- Access_to_Resources <= 1.50 | | | | |--- truncated branch of depth 9 | | | |--- Access_to_Resources > 1.50 | | | | |--- truncated branch of depth 7 | | |--- Hours_Studied > 0.34 | | | |--- Previous_Scores <= 0.51 | | | | |--- truncated branch of depth 12 | | | |--- Previous_Scores > 0.51 | | | | |--- truncated branch of depth 11 | |--- Hours_Studied > 0.45 | | |--- Hours_Studied <= 0.55 | | | |--- Parental_Involvement <= 0.50 | | | | |--- truncated branch of depth 8 | | | |--- Parental_Involvement > 0.50 | | | | |--- truncated branch of depth 10 | | |--- Hours_Studied > 0.55 | | | |--- Access_to_Resources <= 0.50 | | | | |--- truncated branch of depth 8 | | | |--- Access_to_Resources > 0.50 | | | | |--- truncated branch of depth 8
Top 10 Features from Decision Tree:
feature importance
Attendance 0.334078
Hours_Studied 0.222294
Previous_Scores 0.101916
Tutoring_Sessions 0.059365
Access_to_Resources 0.045726
Parental_Involvement 0.043599
Physical_Activity 0.025517
Family_Income 0.025028
Sleep_Hours 0.022586
Parental_Education_Level 0.019073
----------------------------------------
DECISION PATH ANALYSIS
----------------------------------------
Student #0 (Actual: Fail, Predicted: Fail):
Top influencing features:
Attendance: 0.5750 (importance: 0.3341)
Hours_Studied: 0.4419 (importance: 0.2223)
Previous_Scores: 0.5400 (importance: 0.1019)
Tutoring_Sessions: 0.0000 (importance: 0.0594)
Access_to_Resources: 0.0000 (importance: 0.0457)
Student #10 (Actual: Fail, Predicted: Fail):
Top influencing features:
Attendance: 0.3750 (importance: 0.3341)
Hours_Studied: 0.5581 (importance: 0.2223)
Previous_Scores: 0.6400 (importance: 0.1019)
Tutoring_Sessions: 0.3750 (importance: 0.0594)
Access_to_Resources: 0.0000 (importance: 0.0457)
Student #20 (Actual: Pass, Predicted: Pass):
Top influencing features:
Attendance: 0.6750 (importance: 0.3341)
Hours_Studied: 0.5814 (importance: 0.2223)
Previous_Scores: 0.4000 (importance: 0.1019)
Tutoring_Sessions: 0.3750 (importance: 0.0594)
Access_to_Resources: 2.0000 (importance: 0.0457)
============================================================
CLASSIFICATION RESULTS SUMMARY
============================================================
Algorithm Accuracy Precision Recall F1-Score
Neural Network 0.937216 0.917241 0.818462 0.865041
Support Vector Machine 0.934191 0.950758 0.772308 0.852292
Gradient Boosting 0.932678 0.906897 0.809231 0.855285
Naïve Bayes 0.925870 0.928302 0.756923 0.833898
Random Forest 0.914523 0.930894 0.704615 0.802102
Decision Tree 0.854009 0.722973 0.658462 0.689211
K-Nearest Neighbors 0.763238 0.531250 0.313846 0.394584
============================================================
NAÏVE BAYES ANALYSIS AND VISUALIZATION
============================================================
----------------------------------------
FEATURE DISTRIBUTION ANALYSIS
----------------------------------------
Top 10 Features by Class Discrimination:
feature mean_fail mean_pass mean_diff discrimination_score
Attendance 0.4195 0.7516 0.3321 1.4235
Hours_Studied 0.4113 0.5337 0.1224 0.9669
Access_to_Resources 1.0462 1.2954 0.2492 0.3659
Previous_Scores 0.4741 0.5774 0.1032 0.3647
Parental_Involvement 1.0409 1.2500 0.2091 0.3062
Tutoring_Sessions 0.1751 0.2188 0.0436 0.2777
Peer_Influence 1.1448 1.3085 0.1637 0.2208
Distance_from_Home 0.5405 0.4000 0.1405 0.2161
Parental_Education_Level 0.6612 0.8123 0.1511 0.1922
Motivation_Level 0.8790 0.9908 0.1117 0.1607
============================================================ 2. CLUSTERING ANALYSIS ============================================================ Determining optimal number of clusters...
Silhouette Scores:
K=2: Silhouette Score = 0.0646
K=3: Silhouette Score = 0.0556
K=4: Silhouette Score = 0.0523
K=5: Silhouette Score = 0.0574
Cluster distribution:
Cluster
0 2381
1 3531
2 695
Name: count, dtype: int64
============================================================
CLUSTER CHARACTERISTICS
============================================================
Exam_Score Hours_Studied Attendance Previous_Scores \
Cluster
0 66.96 0.45 0.50 0.50
1 67.61 0.44 0.51 0.50
2 66.27 0.44 0.48 0.51
Parental_Involvement
Cluster
0 1.08
1 1.09
2 1.10
============================================================
3. ASSOCIATION RULE MINING (APRIORI ALGORITHM)
============================================================
Using 9 binary columns for association analysis
Running Apriori algorithm...
Found 936 association rules
Top 10 Association Rules (sorted by Lift):
antecedents consequents support confidence lift
934 (School_Type_Public, Pass) (High_Study, High_Score, High_Attendance, Internet_Access_Yes) 0.110640 0.651515 2.775345
881 (Pass) (High_Study, High_Score, High_Attendance, Internet_Access_Yes) 0.159074 0.646769 2.755129
863 (High_Study, High_Score, High_Attendance, Internet_Access_Yes) (Pass) 0.159074 0.677627 2.755129
724 (High_Score, High_Attendance, High_Study, Extracurricular_Activities_Yes) (Pass) 0.101710 0.676737 2.751509
905 (High_Score, High_Study, Internet_Access_Yes, School_Type_Public, High_Attendance) (Pass) 0.110640 0.675601 2.746889
903 (School_Type_Public, Pass) (High_Score, High_Attendance, High_Study) 0.116997 0.688948 2.712683
927 (School_Type_Public, Internet_Access_Yes, Pass) (High_Score, High_Attendance, High_Study) 0.110640 0.687030 2.705130
578 (High_Score, High_Attendance, High_Study) (Pass) 0.168760 0.664482 2.701680
586 (Pass) (High_Score, High_Attendance, High_Study) 0.168760 0.686154 2.701680
877 (Internet_Access_Yes, Pass) (High_Score, High_Attendance, High_Study) 0.159074 0.683800 2.692410
/Users/jackpattarini/Library/Python/3.12/lib/python/site-packages/mlxtend/frequent_patterns/fpcommon.py:175: DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type warnings.warn(
============================================================ 4. DIMENSIONALITY REDUCTION WITH PCA ============================================================ Principal Components Analysis: Total features: 19 Explained variance by component: PC1: 0.0574 (5.7%), Cumulative: 5.7% PC2: 0.0567 (5.7%), Cumulative: 11.4% PC3: 0.0558 (5.6%), Cumulative: 17.0% PC4: 0.0554 (5.5%), Cumulative: 22.5% PC5: 0.0551 (5.5%), Cumulative: 28.0% PC6: 0.0546 (5.5%), Cumulative: 33.5% PC7: 0.0537 (5.4%), Cumulative: 38.9% PC8: 0.0534 (5.3%), Cumulative: 44.2% PC9: 0.0532 (5.3%), Cumulative: 49.5% PC10: 0.0520 (5.2%), Cumulative: 54.7% PC11: 0.0518 (5.2%), Cumulative: 59.9% PC12: 0.0514 (5.1%), Cumulative: 65.1% PC13: 0.0513 (5.1%), Cumulative: 70.2% PC14: 0.0510 (5.1%), Cumulative: 75.3% PC15: 0.0505 (5.0%), Cumulative: 80.3% PC16: 0.0502 (5.0%), Cumulative: 85.4% PC17: 0.0496 (5.0%), Cumulative: 90.3% PC18: 0.0486 (4.9%), Cumulative: 95.2% PC19: 0.0483 (4.8%), Cumulative: 100.0% Components needed for 95% variance: 18
============================================================
5. ANOMALY DETECTION
============================================================
Running Isolation Forest...
Anomalies detected: 661 (10.0%)
Running Local Outlier Factor...
Anomalies detected: 661 (10.0%)
Running Elliptic Envelope...
Anomalies detected: 661 (10.0%)
Anomaly Detection Comparison:
Method Anomalies_Detected Percentage
Isolation Forest 661 10.004541
Local Outlier Factor 661 10.004541
Elliptic Envelope 661 10.004541
============================================================
6. ENSEMBLE METHODS
============================================================
Training Random Forest (Bagging)...
Accuracy: 0.9145
CV Accuracy: 0.9199 (+/- 0.0116)
Training AdaBoost (Boosting)...
Accuracy: 0.9213
CV Accuracy: 0.9228 (+/- 0.0084)
Training Gradient Boosting...
Accuracy: 0.9327
CV Accuracy: 0.9345 (+/- 0.0213)
Ensemble Methods Comparison:
Algorithm Accuracy Precision Recall F1-Score CV_Mean CV_Std
Random Forest (Bagging) 0.914523 0.930894 0.704615 0.802102 0.919934 0.005816
AdaBoost (Boosting) 0.921331 0.859935 0.812308 0.835443 0.922808 0.004219
Gradient Boosting 0.932678 0.906897 0.809231 0.855285 0.934464 0.010662
============================================================
RANDOM FOREST FEATURE IMPORTANCE
============================================================
feature importance
Attendance 0.331548
Hours_Studied 0.212918
Previous_Scores 0.088600
Tutoring_Sessions 0.043570
Access_to_Resources 0.038045
Sleep_Hours 0.034478
Parental_Involvement 0.031346
Physical_Activity 0.030850
Family_Income 0.024878
Parental_Education_Level 0.024545
============================================================ 7. NEURAL NETWORK ANALYSIS ============================================================ Training Neural Network... Neural Network Performance: Accuracy: 0.9561 Precision: 0.9238 Recall: 0.8954 F1-Score: 0.9094
============================================================
FINAL COMPREHENSIVE ANALYSIS
============================================================
ALL ALGORITHMS RANKED BY ACCURACY:
Algorithm Accuracy F1-Score Precision Recall
Neural Network 0.937216 0.865041 0.917241 0.818462
Support Vector Machine 0.934191 0.852292 0.950758 0.772308
Gradient Boosting 0.932678 0.855285 0.906897 0.809231
Gradient Boosting 0.932678 0.855285 0.906897 0.809231
Naïve Bayes 0.925870 0.833898 0.928302 0.756923
AdaBoost (Boosting) 0.921331 0.835443 0.859935 0.812308
Random Forest 0.914523 0.802102 0.930894 0.704615
Random Forest (Bagging) 0.914523 0.802102 0.930894 0.704615
Decision Tree 0.854009 0.689211 0.722973 0.658462
K-Nearest Neighbors 0.763238 0.394584 0.531250 0.313846
============================================================
SUMMARY AND KEY FINDINGS
============================================================
BEST PERFORMING ALGORITHM: Neural Network
Accuracy: 0.9372
F1-Score: 0.8650
Precision: 0.9172
Recall: 0.8185
CLUSTERING INSIGHTS:
Optimal number of clusters: 3
Students distributed across clusters:
Cluster 0: 2381 students (36.0%)
Cluster 1: 3531 students (53.4%)
Cluster 2: 695 students (10.5%)
DIMENSIONALITY REDUCTION:
Original features: 19
Features for 95% variance: 18
Reduction possible: 5.3%
ANOMALY DETECTION:
Average anomalies detected: 10.0%
Recommended review: 661 student records
TOP 5 MOST IMPORTANT FEATURES:
Attendance: 0.3315
Hours_Studied: 0.2129
Previous_Scores: 0.0886
Tutoring_Sessions: 0.0436
Access_to_Resources: 0.0380
============================================================
ANALYSIS COMPLETE
============================================================
Rationale¶
Selection Criteria¶
- Mixed data types
- Potential non-linear relationships
- Many interactions
- Results should be interpretable
Model Summaries¶
1. Decision Tree / Random Forest¶
- Interpretability
- Identifies which factors most influence student performance
- Can capture complex interactions
Potential Use Cases:
- "What factors most predict student success?"
- "What interventions would help specific student profiles?"
2. K-Nearest Neighbor¶
- Comparative
- No distribution assumptions
- Lazy learning models can incorporate new student data without retraining
Potential Use Cases:
- Finding similar student cases
- Students may fail in recognizable ways
- Personalized intervention suggestions based on prior student performance
3. Neural Networks¶
- Can capture complex non-linear interactions
- Robust to higher dimensionality
Potential Use Cases:
- When many factors interact in unpredictable ways
- When prediction accuracy is prioritized over interpretability
4. Naïve Bayes¶
- Computationally inexpensive
- Provides probability of passing instead of yes/no
Potential Use Cases:
- Both categorical and continuous features (using GaussianNB)
- Small datasets (if that's the case here)
5. Support Vector Machines¶
- Robust to high dimensionality
- Finds global optimal boundary between pass/fail
- Can handle non-linear relationships
- Good generalization, resistant to overfitting
6. Random Forest, Boosting (Ensemble Methods)¶
- Combines multiple models
- Reduced susceptability to overfitting
- Reduces variance, handles noise (Random Forest)
- Focuses on hard-to-predict cases (AdaBoost/Gradient)
7. Association Rule Mining (Apriori)¶
- Discovers interesting patterns
- Highly interpretable, immediately actionable
Additional Algorithms¶
Clustering (K-means)
- Groups similar students for targeted interventions
- Identifies unusual student profiles
- Tailor teaching strategies to different clusters
PCA (Dimensionality Reduction)
- Plot students in 2D to identify patterns
- Removes redundant features
- Speeds up other algorithms
Anomaly Detection (Isolation Forest)
- Flags students with unusual profiles
- Detects data errors or exceptional cases
- Finds students who don't fit normal patterns
Excluded Algorithms¶
RIPPER/CN2/1R/AQ (Rule-based)
- Covered by decision tree
- Less accurate than ensemble methods
Bayesian Belief
- Too complex for application
DBSCAN
- K-means is more interpretable