Generate Report¶

Loading dataset...
Dataset shape: (6607, 20)
Columns: ['Hours_Studied', 'Attendance', 'Parental_Involvement', 'Access_to_Resources', 'Sleep_Hours', 'Previous_Scores', 'Motivation_Level', 'Tutoring_Sessions', 'Family_Income', 'Teacher_Quality', 'Peer_Influence', 'Physical_Activity', 'Parental_Education_Level', 'Distance_from_Home', 'Exam_Score', 'Extracurricular_Activities_Yes', 'Internet_Access_Yes', 'School_Type_Public', 'Learning_Disabilities_Yes', 'Gender_Male']

============================================================
DATASET OVERVIEW
============================================================
   Hours_Studied  Attendance  Parental_Involvement  Access_to_Resources  \
0       0.511628       0.600                     0                    2   
1       0.418605       0.100                     0                    1   
2       0.534884       0.950                     1                    1   
3       0.651163       0.725                     0                    1   
4       0.418605       0.800                     1                    1   

   Sleep_Hours  Previous_Scores  Motivation_Level  Tutoring_Sessions  \
0     0.500000             0.46                 0              0.000   
1     0.666667             0.18                 0              0.250   
2     0.500000             0.82                 1              0.250   
3     0.666667             0.96                 1              0.125   
4     0.333333             0.30                 1              0.375   

   Family_Income  Teacher_Quality  Peer_Influence  Physical_Activity  \
0              0                1               2           0.500000   
1              1                1               0           0.666667   
2              1                1               1           0.666667   
3              1                1               0           0.666667   
4              1                2               1           0.666667   

   Parental_Education_Level  Distance_from_Home  Exam_Score  \
0                         0                   0          67   
1                         1                   1          61   
2                         2                   0          74   
3                         0                   1          71   
4                         1                   0          70   

   Extracurricular_Activities_Yes  Internet_Access_Yes  School_Type_Public  \
0                           False                 True                True   
1                           False                 True                True   
2                            True                 True                True   
3                            True                 True                True   
4                            True                 True                True   

   Learning_Disabilities_Yes  Gender_Male  
0                      False         True  
1                      False        False  
2                      False         True  
3                      False         True  
4                      False        False  

Data types:
Hours_Studied                     float64
Attendance                        float64
Parental_Involvement                int64
Access_to_Resources                 int64
Sleep_Hours                       float64
Previous_Scores                   float64
Motivation_Level                    int64
Tutoring_Sessions                 float64
Family_Income                       int64
Teacher_Quality                     int64
Peer_Influence                      int64
Physical_Activity                 float64
Parental_Education_Level            int64
Distance_from_Home                  int64
Exam_Score                          int64
Extracurricular_Activities_Yes       bool
Internet_Access_Yes                  bool
School_Type_Public                   bool
Learning_Disabilities_Yes            bool
Gender_Male                          bool
dtype: object

Missing values:
Hours_Studied                     0
Attendance                        0
Parental_Involvement              0
Access_to_Resources               0
Sleep_Hours                       0
Previous_Scores                   0
Motivation_Level                  0
Tutoring_Sessions                 0
Family_Income                     0
Teacher_Quality                   0
Peer_Influence                    0
Physical_Activity                 0
Parental_Education_Level          0
Distance_from_Home                0
Exam_Score                        0
Extracurricular_Activities_Yes    0
Internet_Access_Yes               0
School_Type_Public                0
Learning_Disabilities_Yes         0
Gender_Male                       0
dtype: int64

Total missing: 0

============================================================
DATA PREPARATION
============================================================
Created binary target: Pass (≥70) = 1625, Fail = 4982
Pass rate: 24.60%
Features shape: (6607, 19)
Target shape: (6607,)
Train set: (5285, 19), Test set: (1322, 19)

============================================================
1. CLASSIFICATION ALGORITHMS COMPARISON
============================================================

Training Decision Tree...
  Accuracy: 0.8540
  Precision: 0.7230
  Recall: 0.6585
  F1-Score: 0.6892

Training Random Forest...
  Accuracy: 0.9145
  Precision: 0.9309
  Recall: 0.7046
  F1-Score: 0.8021

Training K-Nearest Neighbors...
  Accuracy: 0.7632
  Precision: 0.5312
  Recall: 0.3138
  F1-Score: 0.3946

Training Neural Network...
  Accuracy: 0.9372
  Precision: 0.9172
  Recall: 0.8185
  F1-Score: 0.8650

Training Naïve Bayes...
  Accuracy: 0.9259
  Precision: 0.9283
  Recall: 0.7569
  F1-Score: 0.8339

Training Support Vector Machine...
  Accuracy: 0.9342
  Precision: 0.9508
  Recall: 0.7723
  F1-Score: 0.8523

Training Gradient Boosting...
  Accuracy: 0.9327
  Precision: 0.9069
  Recall: 0.8092
  F1-Score: 0.8553

============================================================
DECISION TREE ANALYSIS AND VISUALIZATION
============================================================

Decision Tree Rules (first 3 levels):
|--- Attendance <= 0.64
|   |--- Hours_Studied <= 0.59
|   |   |--- Hours_Studied <= 0.45
|   |   |   |--- Tutoring_Sessions <= 0.81
|   |   |   |   |--- truncated branch of depth 10
|   |   |   |--- Tutoring_Sessions >  0.81
|   |   |   |   |--- truncated branch of depth 2
|   |   |--- Hours_Studied >  0.45
|   |   |   |--- Attendance <= 0.56
|   |   |   |   |--- truncated branch of depth 13
|   |   |   |--- Attendance >  0.56
|   |   |   |   |--- truncated branch of depth 8
|   |--- Hours_Studied >  0.59
|   |   |--- Attendance <= 0.39
|   |   |   |--- Hours_Studied <= 0.66
|   |   |   |   |--- truncated branch of depth 6
|   |   |   |--- Hours_Studied >  0.66
|   |   |   |   |--- truncated branch of depth 9
|   |   |--- Attendance >  0.39
|   |   |   |--- Previous_Scores <= 0.35
|   |   |   |   |--- truncated branch of depth 7
|   |   |   |--- Previous_Scores >  0.35
|   |   |   |   |--- truncated branch of depth 8
|--- Attendance >  0.64
|   |--- Hours_Studied <= 0.45
|   |   |--- Hours_Studied <= 0.34
|   |   |   |--- Access_to_Resources <= 1.50
|   |   |   |   |--- truncated branch of depth 9
|   |   |   |--- Access_to_Resources >  1.50
|   |   |   |   |--- truncated branch of depth 7
|   |   |--- Hours_Studied >  0.34
|   |   |   |--- Previous_Scores <= 0.51
|   |   |   |   |--- truncated branch of depth 12
|   |   |   |--- Previous_Scores >  0.51
|   |   |   |   |--- truncated branch of depth 11
|   |--- Hours_Studied >  0.45
|   |   |--- Hours_Studied <= 0.55
|   |   |   |--- Parental_Involvement <= 0.50
|   |   |   |   |--- truncated branch of depth 8
|   |   |   |--- Parental_Involvement >  0.50
|   |   |   |   |--- truncated branch of depth 10
|   |   |--- Hours_Studied >  0.55
|   |   |   |--- Access_to_Resources <= 0.50
|   |   |   |   |--- truncated branch of depth 8
|   |   |   |--- Access_to_Resources >  0.50
|   |   |   |   |--- truncated branch of depth 8

No description has been provided for this image
Top 10 Features from Decision Tree:
                 feature  importance
              Attendance    0.334078
           Hours_Studied    0.222294
         Previous_Scores    0.101916
       Tutoring_Sessions    0.059365
     Access_to_Resources    0.045726
    Parental_Involvement    0.043599
       Physical_Activity    0.025517
           Family_Income    0.025028
             Sleep_Hours    0.022586
Parental_Education_Level    0.019073
No description has been provided for this image
----------------------------------------
DECISION PATH ANALYSIS
----------------------------------------

Student #0 (Actual: Fail, Predicted: Fail):
  Top influencing features:
    Attendance: 0.5750 (importance: 0.3341)
    Hours_Studied: 0.4419 (importance: 0.2223)
    Previous_Scores: 0.5400 (importance: 0.1019)
    Tutoring_Sessions: 0.0000 (importance: 0.0594)
    Access_to_Resources: 0.0000 (importance: 0.0457)

Student #10 (Actual: Fail, Predicted: Fail):
  Top influencing features:
    Attendance: 0.3750 (importance: 0.3341)
    Hours_Studied: 0.5581 (importance: 0.2223)
    Previous_Scores: 0.6400 (importance: 0.1019)
    Tutoring_Sessions: 0.3750 (importance: 0.0594)
    Access_to_Resources: 0.0000 (importance: 0.0457)

Student #20 (Actual: Pass, Predicted: Pass):
  Top influencing features:
    Attendance: 0.6750 (importance: 0.3341)
    Hours_Studied: 0.5814 (importance: 0.2223)
    Previous_Scores: 0.4000 (importance: 0.1019)
    Tutoring_Sessions: 0.3750 (importance: 0.0594)
    Access_to_Resources: 2.0000 (importance: 0.0457)

============================================================
CLASSIFICATION RESULTS SUMMARY
============================================================
             Algorithm  Accuracy  Precision   Recall  F1-Score
        Neural Network  0.937216   0.917241 0.818462  0.865041
Support Vector Machine  0.934191   0.950758 0.772308  0.852292
     Gradient Boosting  0.932678   0.906897 0.809231  0.855285
           Naïve Bayes  0.925870   0.928302 0.756923  0.833898
         Random Forest  0.914523   0.930894 0.704615  0.802102
         Decision Tree  0.854009   0.722973 0.658462  0.689211
   K-Nearest Neighbors  0.763238   0.531250 0.313846  0.394584
No description has been provided for this image
============================================================
NAÏVE BAYES ANALYSIS AND VISUALIZATION
============================================================

----------------------------------------
FEATURE DISTRIBUTION ANALYSIS
----------------------------------------

Top 10 Features by Class Discrimination:
                 feature  mean_fail  mean_pass  mean_diff  discrimination_score
              Attendance     0.4195     0.7516     0.3321                1.4235
           Hours_Studied     0.4113     0.5337     0.1224                0.9669
     Access_to_Resources     1.0462     1.2954     0.2492                0.3659
         Previous_Scores     0.4741     0.5774     0.1032                0.3647
    Parental_Involvement     1.0409     1.2500     0.2091                0.3062
       Tutoring_Sessions     0.1751     0.2188     0.0436                0.2777
          Peer_Influence     1.1448     1.3085     0.1637                0.2208
      Distance_from_Home     0.5405     0.4000     0.1405                0.2161
Parental_Education_Level     0.6612     0.8123     0.1511                0.1922
        Motivation_Level     0.8790     0.9908     0.1117                0.1607
No description has been provided for this image
No description has been provided for this image
============================================================
2. CLUSTERING ANALYSIS
============================================================
Determining optimal number of clusters...
No description has been provided for this image
Silhouette Scores:
  K=2: Silhouette Score = 0.0646
  K=3: Silhouette Score = 0.0556
  K=4: Silhouette Score = 0.0523
  K=5: Silhouette Score = 0.0574

Cluster distribution:
Cluster
0    2381
1    3531
2     695
Name: count, dtype: int64

============================================================
CLUSTER CHARACTERISTICS
============================================================
         Exam_Score  Hours_Studied  Attendance  Previous_Scores  \
Cluster                                                           
0             66.96           0.45        0.50             0.50   
1             67.61           0.44        0.51             0.50   
2             66.27           0.44        0.48             0.51   

         Parental_Involvement  
Cluster                        
0                        1.08  
1                        1.09  
2                        1.10  
No description has been provided for this image
============================================================
3. ASSOCIATION RULE MINING (APRIORI ALGORITHM)
============================================================
Using 9 binary columns for association analysis

Running Apriori algorithm...

Found 936 association rules

Top 10 Association Rules (sorted by Lift):
                                                                            antecedents                                                     consequents   support  confidence      lift
934                                                          (School_Type_Public, Pass)  (High_Study, High_Score, High_Attendance, Internet_Access_Yes)  0.110640    0.651515  2.775345
881                                                                              (Pass)  (High_Study, High_Score, High_Attendance, Internet_Access_Yes)  0.159074    0.646769  2.755129
863                      (High_Study, High_Score, High_Attendance, Internet_Access_Yes)                                                          (Pass)  0.159074    0.677627  2.755129
724           (High_Score, High_Attendance, High_Study, Extracurricular_Activities_Yes)                                                          (Pass)  0.101710    0.676737  2.751509
905  (High_Score, High_Study, Internet_Access_Yes, School_Type_Public, High_Attendance)                                                          (Pass)  0.110640    0.675601  2.746889
903                                                          (School_Type_Public, Pass)                       (High_Score, High_Attendance, High_Study)  0.116997    0.688948  2.712683
927                                     (School_Type_Public, Internet_Access_Yes, Pass)                       (High_Score, High_Attendance, High_Study)  0.110640    0.687030  2.705130
578                                           (High_Score, High_Attendance, High_Study)                                                          (Pass)  0.168760    0.664482  2.701680
586                                                                              (Pass)                       (High_Score, High_Attendance, High_Study)  0.168760    0.686154  2.701680
877                                                         (Internet_Access_Yes, Pass)                       (High_Score, High_Attendance, High_Study)  0.159074    0.683800  2.692410
/Users/jackpattarini/Library/Python/3.12/lib/python/site-packages/mlxtend/frequent_patterns/fpcommon.py:175: DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance and their support might be discontinued in the future.Please use a DataFrame with bool type
  warnings.warn(
No description has been provided for this image
============================================================
4. DIMENSIONALITY REDUCTION WITH PCA
============================================================
Principal Components Analysis:
Total features: 19

Explained variance by component:
  PC1: 0.0574 (5.7%), Cumulative: 5.7%
  PC2: 0.0567 (5.7%), Cumulative: 11.4%
  PC3: 0.0558 (5.6%), Cumulative: 17.0%
  PC4: 0.0554 (5.5%), Cumulative: 22.5%
  PC5: 0.0551 (5.5%), Cumulative: 28.0%
  PC6: 0.0546 (5.5%), Cumulative: 33.5%
  PC7: 0.0537 (5.4%), Cumulative: 38.9%
  PC8: 0.0534 (5.3%), Cumulative: 44.2%
  PC9: 0.0532 (5.3%), Cumulative: 49.5%
  PC10: 0.0520 (5.2%), Cumulative: 54.7%
  PC11: 0.0518 (5.2%), Cumulative: 59.9%
  PC12: 0.0514 (5.1%), Cumulative: 65.1%
  PC13: 0.0513 (5.1%), Cumulative: 70.2%
  PC14: 0.0510 (5.1%), Cumulative: 75.3%
  PC15: 0.0505 (5.0%), Cumulative: 80.3%
  PC16: 0.0502 (5.0%), Cumulative: 85.4%
  PC17: 0.0496 (5.0%), Cumulative: 90.3%
  PC18: 0.0486 (4.9%), Cumulative: 95.2%
  PC19: 0.0483 (4.8%), Cumulative: 100.0%

Components needed for 95% variance: 18
No description has been provided for this image
No description has been provided for this image
============================================================
5. ANOMALY DETECTION
============================================================

Running Isolation Forest...
  Anomalies detected: 661 (10.0%)

Running Local Outlier Factor...
  Anomalies detected: 661 (10.0%)

Running Elliptic Envelope...
  Anomalies detected: 661 (10.0%)

Anomaly Detection Comparison:
              Method  Anomalies_Detected  Percentage
    Isolation Forest                 661   10.004541
Local Outlier Factor                 661   10.004541
   Elliptic Envelope                 661   10.004541
No description has been provided for this image
============================================================
6. ENSEMBLE METHODS
============================================================

Training Random Forest (Bagging)...
  Accuracy: 0.9145
  CV Accuracy: 0.9199 (+/- 0.0116)

Training AdaBoost (Boosting)...
  Accuracy: 0.9213
  CV Accuracy: 0.9228 (+/- 0.0084)

Training Gradient Boosting...
  Accuracy: 0.9327
  CV Accuracy: 0.9345 (+/- 0.0213)

Ensemble Methods Comparison:
              Algorithm  Accuracy  Precision   Recall  F1-Score  CV_Mean   CV_Std
Random Forest (Bagging)  0.914523   0.930894 0.704615  0.802102 0.919934 0.005816
    AdaBoost (Boosting)  0.921331   0.859935 0.812308  0.835443 0.922808 0.004219
      Gradient Boosting  0.932678   0.906897 0.809231  0.855285 0.934464 0.010662

============================================================
RANDOM FOREST FEATURE IMPORTANCE
============================================================
                 feature  importance
              Attendance    0.331548
           Hours_Studied    0.212918
         Previous_Scores    0.088600
       Tutoring_Sessions    0.043570
     Access_to_Resources    0.038045
             Sleep_Hours    0.034478
    Parental_Involvement    0.031346
       Physical_Activity    0.030850
           Family_Income    0.024878
Parental_Education_Level    0.024545
No description has been provided for this image
============================================================
7. NEURAL NETWORK ANALYSIS
============================================================
Training Neural Network...
Neural Network Performance:
  Accuracy: 0.9561
  Precision: 0.9238
  Recall: 0.8954
  F1-Score: 0.9094
No description has been provided for this image
============================================================
FINAL COMPREHENSIVE ANALYSIS
============================================================

ALL ALGORITHMS RANKED BY ACCURACY:
              Algorithm  Accuracy  F1-Score  Precision   Recall
         Neural Network  0.937216  0.865041   0.917241 0.818462
 Support Vector Machine  0.934191  0.852292   0.950758 0.772308
      Gradient Boosting  0.932678  0.855285   0.906897 0.809231
      Gradient Boosting  0.932678  0.855285   0.906897 0.809231
            Naïve Bayes  0.925870  0.833898   0.928302 0.756923
    AdaBoost (Boosting)  0.921331  0.835443   0.859935 0.812308
          Random Forest  0.914523  0.802102   0.930894 0.704615
Random Forest (Bagging)  0.914523  0.802102   0.930894 0.704615
          Decision Tree  0.854009  0.689211   0.722973 0.658462
    K-Nearest Neighbors  0.763238  0.394584   0.531250 0.313846
No description has been provided for this image
============================================================
SUMMARY AND KEY FINDINGS
============================================================

BEST PERFORMING ALGORITHM: Neural Network
  Accuracy: 0.9372
  F1-Score: 0.8650
  Precision: 0.9172
  Recall: 0.8185

CLUSTERING INSIGHTS:
  Optimal number of clusters: 3
  Students distributed across clusters:
    Cluster 0: 2381 students (36.0%)
    Cluster 1: 3531 students (53.4%)
    Cluster 2: 695 students (10.5%)

DIMENSIONALITY REDUCTION:
  Original features: 19
  Features for 95% variance: 18
  Reduction possible: 5.3%

ANOMALY DETECTION:
  Average anomalies detected: 10.0%
  Recommended review: 661 student records

TOP 5 MOST IMPORTANT FEATURES:
  Attendance: 0.3315
  Hours_Studied: 0.2129
  Previous_Scores: 0.0886
  Tutoring_Sessions: 0.0436
  Access_to_Resources: 0.0380

============================================================
ANALYSIS COMPLETE
============================================================

Rationale¶

Selection Criteria¶

  1. Mixed data types
  2. Potential non-linear relationships
  3. Many interactions
  4. Results should be interpretable

Model Summaries¶

1. Decision Tree / Random Forest¶

  • Interpretability
  • Identifies which factors most influence student performance
  • Can capture complex interactions

Potential Use Cases:

  • "What factors most predict student success?"
  • "What interventions would help specific student profiles?"

2. K-Nearest Neighbor¶

  • Comparative
  • No distribution assumptions
  • Lazy learning models can incorporate new student data without retraining

Potential Use Cases:

  • Finding similar student cases
  • Students may fail in recognizable ways
  • Personalized intervention suggestions based on prior student performance

3. Neural Networks¶

  • Can capture complex non-linear interactions
  • Robust to higher dimensionality

Potential Use Cases:

  • When many factors interact in unpredictable ways
  • When prediction accuracy is prioritized over interpretability

4. Naïve Bayes¶

  • Computationally inexpensive
  • Provides probability of passing instead of yes/no

Potential Use Cases:

  • Both categorical and continuous features (using GaussianNB)
  • Small datasets (if that's the case here)

5. Support Vector Machines¶

  • Robust to high dimensionality
  • Finds global optimal boundary between pass/fail
  • Can handle non-linear relationships
  • Good generalization, resistant to overfitting

6. Random Forest, Boosting (Ensemble Methods)¶

  • Combines multiple models
  • Reduced susceptability to overfitting
  • Reduces variance, handles noise (Random Forest)
  • Focuses on hard-to-predict cases (AdaBoost/Gradient)

7. Association Rule Mining (Apriori)¶

  • Discovers interesting patterns
  • Highly interpretable, immediately actionable

Additional Algorithms¶

Clustering (K-means)

  • Groups similar students for targeted interventions
  • Identifies unusual student profiles
  • Tailor teaching strategies to different clusters

PCA (Dimensionality Reduction)

  • Plot students in 2D to identify patterns
  • Removes redundant features
  • Speeds up other algorithms

Anomaly Detection (Isolation Forest)

  • Flags students with unusual profiles
  • Detects data errors or exceptional cases
  • Finds students who don't fit normal patterns

Excluded Algorithms¶

RIPPER/CN2/1R/AQ (Rule-based)

  • Covered by decision tree
  • Less accurate than ensemble methods

Bayesian Belief

  • Too complex for application

DBSCAN

  • K-means is more interpretable