Understanding the drivers of academic achievement: Evidence for Portugal’s high school system: A machine learning approach

Research output: ThesisDoctoral Thesis


Although the proven superiority of data-driven approaches based on machine learning techniques with respect to survey-based methods, the use of machine learning in the field of education is still in the beginning. To address this situation, this thesis aims to predict academic achievement by presenting a machine learning framework that is specifically designed to analyze the enormous amount of data provided by the public administration. In detail, the key goals are: (i) apply data science and machine learning methods in the context of academic achievement; (ii) conduct a study of academic achievement to virtually capture the universe of public high schools’ students, i.e., not rely on sample data, (iii) use predictive models to actively “flag” those students with a greater likelihood to underperform in academic achievement, thereby enabling an appropriate educational response; and (iv) contribute to the domain development focusing on hypothetical novel quantitative approaches. In pursuing the above objectives, the research uses a pioneering public initiative – MISI – that has data regarding high school systems and students’ academic achievement at the country level. Section 1 and section 2, Introduction, and Literature review, serve to shape the research framework accordingly. The section 3 study uses an anonymous 2014-15 school year dataset from the Directorate-General for Statistics of Education and Science of the Portuguese Ministry of Education as a means to carry out a predictive power comparison between the classic multilinear regression model and a chosen set of machine learning algorithms. A multilinear regression model is used in parallel with random forest, support vector machine, artificial neural network and extreme gradient boosting machine stacking ensemble implementations. Designing a hybrid analysis is intended where classical statistical analysis and artificial intelligence algorithms are blended to augment the ability to retain valuable conclusions and well-supported results. The machine learning algorithms attain a higher level of predictive ability. In addition, the stacking appropriateness increases as the base learner output correlation matrix determinant increases and the random forest feature importance empirical distributions are correlated with the structure of p-values and the statistical significance test ascertains of the multiple linear model. An information system that supports the nationwide education system should be designed and further structured to collect meaningful and precise data about the full range of academic achievement antecedents. The article concludes that no evidence is found in favour of smaller classes. The section 4 study focuses on the machine learning bias when predicting teacher grades. The experimental phase consists of predicting the student grades of 11th and 12th grade Portuguese high school grades and computing the bias and variance decomposition. In the base implementation, only the academic achievement critical factors are considered. In the second implementation, the preceding year’s grade is appended as an input variable. The machine learning algorithms in use are random forest, support vector machine, and extreme boosting machine. The reasons behind the poor performance of the machine learning algorithms are either the input space poor preciseness or the lack of a sound record of student performance. We introduce the new concept of knowledge bias and a new predictive model classification. Precision education would reduce bias by providing low-bias intensive-knowledge models. To avoid bias, it is not necessary to add knowledge to the input space. Low-bias extensiveknowledge models are achievable simply by appending the student’s earlier performance record to the model. The low-bias intensive-knowledge learning models promoted by precision education are suited to designing new policies and actions toward academic attainments. If the aim is solely prediction, deciding for a low bias knowledge-extensive model can be appropriate and correct. The section 5 study applies deep learning to the prediction of Portuguese high school grades. Two implementations are undertaken in the experimental phase, one of a deep multilayer perceptron and the other of multiple linear regression. The architecture, topology, regularization, initialization, and optimization algorithms are fine-tuned in the deep learning hyper-tuning phase. The results encompass point predictions, prediction intervals, variables gradients, and the impact of an increase in the class size on grades. The deep learning generalization error is more minor in the student grades prediction, and its prediction intervals are more accurate. The deep multilayer perceptron gradients empirical distributions largely align with the regression coefficients estimates, indicating a satisfactory regression fit. Based on gradients discrepancies, a student's mother being an employer does not seem to be a positive factor. A benign paradigm change in the balance between home and career affairs for both genders should be reinforced. The deep multilayer perceptron broadens the spectrum of possibilities and greets each specificity as a core analysis element by providing a quantum solution hinged on a universal approximator. In the case of an academic achievement critical factor such as class size where the literature is neither unanimous on its importance nor its direction, the multilayer perceptron formed three distinct clusters per the individual gradient signals. Finally, section 6 recaps the findings and conclusions of the thesis. Keywords: Academic Achievement, Machine Learning, Deep Learning, Support Vector Regression, Random Forest, Stacking, Boosting, Bias and Variance Decomposition, Quantitative Political Analysis
Original languageEnglish
QualificationDoctor of Philosophy
  • Oliveira, Tiago, Supervisor
  • Castelli, Mauro, Supervisor
  • Jesus, Frederico Cruz, Supervisor
Award date25 Jul 2022
Publication statusPublished - 25 Jul 2022


  • academic achievement
  • machine learning
  • deep learning
  • Support vector regression
  • random forest
  • Stacking
  • Boosting
  • Bias and Variance Decomposition
  • Quantitative Political Analysis


Dive into the research topics of 'Understanding the drivers of academic achievement: Evidence for Portugal’s high school system: A machine learning approach'. Together they form a unique fingerprint.

Cite this