Feedback is known to be a major factor for successful learning (Wisniewski et al., 2020). But instead of only indicating what is incorrect, feedback is particularly effective when it is formative, providing explanations of errors and advice on appropriate solution strategies (Hattie & Timperley, 2007). However, in higher education, feedback is seldom provided to learners due to resource constraints (Boud & Molloy, 2013), which is particularly true for writing tasks. However, the advent of artificial intelligence (AI) in educational contexts might be a means to scale feedback.
Research investigating ChatGPT-generated feedback on text assignments reveals that the quality of human feedback is higher than that of AI feedback for clarity of directions for improvement, accuracy, prioritization of essential features, and supportive tone, but not for criteria-based feedback (Steiss et al., 2024). To be useful, learners need to be willing to accept feedback and use it to revise their work (Hattie & Timperley, 2007; Narciss, 2008). Hence, besides investigating the accuracy of human and AI feedback, students’ perceptions also need to be examined (Nazaretsky et al., 2024; Rüdian et al., 2025), but receive limited attention in research (Strijbos et al., 2021). Recent research considering feedback perceptions found that students' capabilities to correctly detect the feedback source depend on the course and task, and that students prefer human over AI-feedback, which is also related to the correct identification of the feedback source (Nazaretsky et al., 2024).
To advance research on students' perceptions of AI feedback, this initial exploratory study within a larger research project first examines whether students in an AI feedback group correctly detect the feedback source (RQ1a) and whether this relates to the presentation order (RQ1b). Second, it is investigated whether students’ perceptions of feedback differ by source (RQ2). Third, the congruence between human- and AI-based feedback scoring is analyzed (RQ3).
This study was conducted in three parallel seminar groups in the teacher education Master’s program, in which 78 students were enrolled. The N = 65 participating students (73.8% female) were M = 24.65 (SD = 3.04) years old and had studied on average for 10.51 semesters (SD = 2.45).
Students had to write a research project exposé (500 words). Students signed themselves up for the AI-feedback group (n = 38) or the human-feedback group (n = 27). The AI-group received feedback from the lecturer (rubrics and short text), plus AI-feedback with values assigned to extended rubrics. To avoid an order effect, students randomly received either human or AI feedback first. The human-feedback group only received the lecturer's feedback. The feedback rubrics included six categories and three measures (0 = not fulfilled, 0.5 = partly fulfilled, 1 = fulfilled; see Table 2). Students received the feedback three days after the submission deadline or one week later, depending on their participation in the course.
To assess students’ feedback perceptions, the Feedback Perceptions Questionnaire (FPQ; Strijbos et al., 2021; 16 items, 5-point Likert scale) was adapted. The 16 items were rated on a 5-point Likert scale (1 = fully disagree, 5 = fully agree). For the subscales, Cronbach’s α ranged from .71 to .93. Students in the AI-feedback group were asked whether the feedback was AI-generated. Furthermore, participants stated demographic details (e.g., semester studied, age, and current grade point average).
The AI feedback tool is designed to support teachers by automating the process of providing criteria-oriented feedback on student submissions. The tool analyzes students' text submissions using predefined analytic rubrics, leveraging natural language processing and machine learning, including large language models' capabilities.
The system is composed of three main components. First, a feature extraction pipeline processes student submissions, converting them into numerical indicators based on linguistic and contextual criteria. This step ensures that the model can quantitatively assess different aspects of the student’s submission. Second, the training process involves a machine-learning regression tree model that learns from historical student submissions and teacher-provided ratings. Third, the prediction component applies the trained model to new student submissions to predict rubric-based ratings.
RQ1. Correct detection of the feedback source and relation to presentation order
To determine whether students correctly identified the AI feedback frequencies, the frequencies were inspected. Human feedback was correctly identified as human by 54.1%, 29.7% considered it as AI feedback, and 16.2% were insecure. 28.9% thought the AI feedback was not AI-based, 47.4% correctly identified it as AI feedback, and 23.7% were unsure.
Investigating whether correct identification relates to the presentation order of the feedback types (12 participants received AI feedback first, 25 received human feedback first), Pearson correlation reveals that participants correctly identify human feedback when human feedback is presented second (r = .407, p = .023), and AI feedback is also correctly identified when human feedback is presented second (r = .433, p = .027).
RQ2. Students’ perceptions of the feedback depend on the source
To investigate whether students in the AI group perceived the two feedback types differently in terms of affect, willingness to use the feedback, fairness, acceptance, and usefulness, two-sided paired t-tests were used. Results (see Table 1) indicate that human feedback is perceived as significantly fairer and more acceptable.
Table 1 Students’ perceptions of AI-feedback and human-feedback | ||||||
Variable | M | SD | t(36) | p | d | |
Positive affect | Human-feedback | 3.64 | .94 | |||
AI-feedback | 3.36 | 1.02 | 1.36 | .181 | .22 | |
Negative affect | Human-feedback | 1.28 | .65 | |||
AI-feedback | 1.45 | .81 | -1.83 | .076 | -.30 | |
Willingness to use feedback | Human-feedback | 3.80 | 1.01 | |||
AI-feedback | 3.54 | 1.11 | 1.43 | .161 | .24 | |
Fairness | Human-feedback | 4.09 | .63 | |||
AI-feedback | 3.67 | .87 | 2.94 | .006 | .48 | |
Acceptance | Human-feedback | 4.68 | .49 | |||
AI-feedback | 4.41 | .79 | 2.46 | .019 | .40 | |
Usefulness | Human-feedback | 3.59 | .89 | |||
AI-feedback | 3.33 | 1.18 | 1.51 | .139 | .25 | |
For all feedback, the human rater (70) resulted in M = .829 points (SD = .131), and the AI (41) resulted in M = .837 (SD = .097). To assess the congruence of the scoring (human versus AI), the mean absolute error (MAE) was computed as an interpretable metric. For the imbalanced and small dataset (see Table 2), we obtained an MAE of 0.21 (overall RMSE of 0.27), indicating good congruence on a scale of [0,1].
Table 2 Criteria set, statistics, including Mean Absolute Error (MAE) and Root Mean Squared Error (RMSE) | |||||
ID | Criterion | M | SD | MAE | RMSE |
1 | Concept idea and state of research | .89 | .24 | .31 | .24 |
2 | Scientific justification and relevance | .83 | .29 | .18 | .29 |
3 | Development of research question | .86 | .25 | .23 | .25 |
4 | Theoretical framework and concepts | .69 | .29 | .27 | .30 |
5 | Number of academic sources | .95 | .21 | .09 | .22 |
6 | Length | .80 | .32 | .20 | .32 |
Results indicate that about 54% of participants correctly detected human feedback, whereas only about 47% correctly identified AI feedback, a pattern similarly reported by Nazaretsky et al. (2024). Furthermore, the results suggest that students are better able to identify human feedback when it is presented after AI feedback. In a larger study, the mediation effects of order and correct identification with feedback perceptions need further examination. In this sample, students consider human feedback fairer and accept it more than AI feedback, but it must also be taken into account that the usefulness of each piece of feedback is considered only mediocre. Still, this feedback tool highlights the possibilities of using AI to assist teachers in effectively rating students’ submissions, ensuring timely, structured, and criteria-based feedback at scale. Keeping a human-in-the-loop approach maintains educators’ essential role in guiding student learning while harnessing the benefits of AI. Our human-in-the-loop approach might also account for students’ preference for human feedback (Nazaretsky et al., 2024).
As this exploratory study was conducted in the field, the groups receiving AI feedback or human feedback first are not equal in size, resulting in statistical limitations. More training data might also have resulted in better performance in predicting all feedback criteria.
This work was supported by the Federal Ministry of Research, Technology, and Space (BMFTR), grant number 16DHBKI045