Machine learning approach to student performance prediction of online learning

Student performance is crucial for addressing learning process problems and is also an important factor in measuring learning outcomes. The ability to improve educational systems using data knowledge has driven the development of the field of educational data…
Apryl Johns · 5 months ago · 6 minutes read


## Student Performance Prediction in Online Learning: An Enhanced Machine Learning Approach### AbstractStudent performance is crucial for addressing learning process issues and is a significant factor in evaluating learning outcomes. The development of educational data mining research has been driven by the growing need to use data insights to enhance educational systems. This study proposes a machine learning method for predicting student performance in online learning environments. The key idea is to construct eleven learning behavioral indicators based on the online learning process. By examining the correlation between these eleven indicators and students' online learning scores, we identify those indicators that have a weak correlation with student scores and remove them, while retaining those that are strongly correlated with student scores. These eigenvalue indicators are then used to train a proposed logistic regression model with Taylor expansion. Experimental results demonstrate that the proposed logistic regression model outperforms comparative models in terms of prediction accuracy. The results also indicate that there is a significant relationship between students' initiative in learning and learning duration, with learning duration having a significant impact on student performance prediction.### IntroductionEducational data mining has recently gained traction as a viable solution for improving higher education management by enabling data-driven decision-making. Educational data mining aims to leverage new capabilities in data processing and the maturity of data mining algorithms to enhance the learning process and transform existing information into knowledge.Educational data mining involves analyzing educational data (such as student information, educational records, exam scores, participation in online activities and classroom records, etc.) to develop models that improve learning experiences and institutional effectiveness. This involves extracting knowledge from data stored in various formats and granularities from multiple sources (such as enrollment systems, registration systems, learning management systems, etc.), each requiring specific handling. Traditional data mining techniques are often inadequate for addressing these challenges; hence, more advanced data mining methods are necessary for the knowledge discovery process.### Related WorkSeveral studies have been conducted to predict student performance, such as the work of Conijn et al. [28], which utilized multi-level and standard regressions to predict student performance. However, due to differences in course data, it is challenging to draw general conclusions regarding the online behavior of students at risk.In [29], a convolutional neural network was proposed for student performance prediction, and the results demonstrated successful prediction. This work employed traditional and straightforward features to establish a student performance prediction model. Similarly, a machine learning method was implemented in [30].### Methodology**Overall Scheme**The proposed method involves four stages:1. **Data collection:** Data is gathered from online learning platforms and used to predict student performance. Since the collected data typically encompasses diverse data types (e.g., relational and non-relational) and may contain missing or anomalous values, it requires preprocessing based on the learning behavior indicators we construct. This preprocessing step is essential for preparing the data for the second stage.2. **Learning behavior analysis:** Students are classified based on specific criteria to compare their learning behaviors and analyze their behavioral characteristics. To identify whether behavioral indicators are related to the outcome, we analyzed the correlation between learning behavioral indicators and online learning. If the analysis results indicate no correlation, the behavioral indicators are discarded; otherwise, they are retained as eigenvalues.3. **Behavior modeling:** We constructed a logistic regression model trained by the eigenvalues.4. **Student performance prediction:** We utilized our model to predict student performance.**Analysis of Learning Behavior**Online learning exhibits multiple forms of diversity, implying that learning behavior indicators based on online learning are multifaceted. Accordingly, we considered eleven learning behavior indicators, illustrated in Table 1. These indicators are described as follows:The learning process consists of three parts: preparation, major learning behavior, and secondary learning behavior. In the preparation stage, we considered the number of course introductions viewed, course registrations, and course logins. Subsequently, major learning behavior, which is critical for monitoring, consists of five behavioral indicators: learning time, resource utilization (calculated by dividing the time spent on learning resources by the recommended time), the number of repeated views of resources, the number of repeated learning sessions after completing a course, and resource density utilization (calculated by dividing the resource view time by the time difference between the last and first resource view). Secondary learning behavior, on the other hand, is regarded as a learning interaction behavior, comprising the number of browsing forum discussions, posting forum discussions, and replying to forum discussions.We analyzed the correlation between the eleven learning behavioral indicators in Table 1 and the average score achieved by students in Table 2 using SPSS. Consequently, we excluded those learning behavioral indicators with weak correlation, retaining only those with higher correlation. The filtered details indicate that learning behavioral indicators with a correlation coefficient below 0.6 are removed, while those above 0.6 are retained. These retained learning behavioral indicators are used as the eigenvalue indicators affecting online learning. For convenience, we refer to these retained learning behavioral indicators as eigenvalue indicators in subsequent sections.**Behavioral Modeling**Based on the obtained eigenvalue indicators, we construct a logistic regression model. Given eigenvalue indicators x1, x2, ..., xi, ..., xn and the corresponding weights β1, β2, ..., βi, ..., βn, the probability of taking 1 is denoted by h(x). The joint density function of n samples can be calculated as follows:$$f(y_1,y_2, ..., y_n|x_1,x_2, ..., x_n, \beta) = \prod_{i=1}^n h(x_i)^{y_i}(1-h(x_i))^{1-y_i}$$To achieve accurate prediction results, we introduced penalized log-likelihood. Substituting (2) into (3), we obtain:$$L^* = \sum_{i=1}^n \left( y_i\log h(x_i) + (1-y_i)\log(1-h(x_i)) + \lambda \sum_{j=1}^m \beta_j^2 \right)$$where λ is the penalty item. Larger values of λ result in stronger effects. yi is the ith eigenvalue indicator. pi is the probability that yi = 1. β1, β2, ..., βm are parameters that can be estimated by the maximum likelihood criterion.**Algorithm Implementation**The model algorithm is summarized in Algorithm 1 below:**Input:** Learning behavioral indicators LBI(k), average score of j-th course AS(j)**Output:** Prediction accuracy1. Initialize parameters;2. For k = 1 to 11:3. For j = 1 to 15:4. Calculate correlation between LBI(k) and AS(j);5. Obtain correlation coefficient CR(k,j);6. If CR(k, j) ≥ 0.6 Then:7. Save eigenvalue indicator8. End If9. End For10. End For11. Utilize constructed eigenvalue indicators to construct a matric X with 300 rows and Q columns. Here, the row is the number of students, and the column is constructed by both the number of eigenvalue indicators and that of courses.12. Obtain training set Train(M) through randomly selecting 80%;13. Obtain testing set Test(M);14. For i = 1 to i = Imax:15. Train a logistic regression model h(x) in Eq (1);16. If current training accuracy == maximum value True:17. Save the model h(x, Train(M));18. Obtain current training accuracy;19. Break;20. End If21. End For22. Verify trained h(x, Train(M)) using testing set Test(M);23. Obtain predicted accuracy;### Experimental Settings**Datasets**The experimental datasets are provided from the MOOC platform (https://www.icourse163.org