ChatGPT-supported formative assessment in interpreter education: validity of automated ratings and perceived quality of diagnostic feedback

Wenjing Liu; Adriana Silvina  Pagano

doi:10.26512/les.v26i2.59793

Autores

Wenjing Liu Universidade Politécnica de Macau https://orcid.org/0009-0004-7865-5549
Adriana Silvina Pagano Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brasil https://orcid.org/0000-0002-3150-3503

DOI:

https://doi.org/10.26512/les.v26i2.59793

Palavras-chave:

avaliação formativa, feedback, modelo de linguagem de grande escala, ChatGPT, ensino de interpretação

Resumo

Formative assessment plays a critical role in teaching and learning. Recent advances in large language models (LLMs) have enabled their application as automated assessment systems and feedback providers. This study explores the validity of ChatGPT-based assessment and the perceived quality of its feedback in interpreter education. To this end, ChatGPT-4o was used to assess 60 Chinese–Portuguese simultaneous interpreting tasks, producing rubric-based quantitative ratings and qualitative diagnostic feedback. To examine its effectiveness, three types of validity (concurrent, predictive, and know-group) were examined by comparing ChatGPT-generated scores with those of nine trained human raters. A post-hoc questionnaire was also administered to collect raters’ subjective perceptions of the feedback. Results show strong alignment between the model and human scores, with ChatGPT demonstrating robust predictive power and discriminative ability. Raters viewed the feedback favorably and supported its use as a complement to teacher feedback, highlighting the pedagogical value of LLMs in interpreter training.

Downloads

Não há dados estatísticos.

Biografia do Autor

Wenjing Liu, Universidade Politécnica de Macau

Wenjing Liu is currently PhD candidate in Interpreting Studies at Macao Polytechnic University in China. Her research interests include translation and interpreting quality assessment, interpreter training, and intercultural studies.

Adriana Silvina Pagano, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brasil

Adriana S. Pagano is Full Professor in Applied Linguistics at Universidade Federal de Minas Gerais, Brazil. She is a research fellow of CNPq (National Council for Scientific and Technological Development, Ministry of Science and Technology, Brazil) and FAPEMIG (Research Foundation of the State of Minas Gerais, Brazil). Her research interests include systemic-functional approaches to translation and interpreting and multilingual and multimodal modelling of meaning.

Referências

ALAHMADI, N. et al. The impact of the formative assessment in speaking test on Saudi students’ performance. Arab World English Journal, Kuala Lumpur, v. 10, n. 1, p. 259-270, 2019.

BALAMAN, S. Exploring Undergraduate Students’ Viewpoints on Corrective Feedback Implementations in Interpreting. Korkut Ata Türkiyat Araştırmaları Dergisi, Osmaniye, v. 15, p. 994-1011, 2024.

BOUD, D.; SOLER, R. Sustainable assessment revisited. Assessment & Evaluation in Higher Education, London, v. 41, n. 3, p. 400-413, 2016.

BROWN, T. et al. Language Models are Few-Shot Learners. Advances in neural information processing systems, New York, v. 33, p. 1877-1901, 2020.

CARLESS, D.; BOUD, D. The development of student feedback literacy: enabling uptake of feedback. Assessment & Evaluation in Higher Education, London, v. 43, n. 8, p. 1315-1325, 2018.

CREZEE, I.; GRANT, L. Thrown in the deep end: Challenges of interpreting informal paramedic language. Translation & Interpreting: The International Journal of Translation and Interpreting Research, Sydney, v. 8, n. 2, p. 1-12, 2016.

DAMACENA, M.; QUEVEDO-CAMARGO, G. Avaliação e formação de professores de línguas: uma discussão sobre o currículo e as percepções dos formandos. Olhares & Trilhas, Uberlândia, v. 23, n. 3, p. 1054-1073, 2021.,

DAVIS, F. D.; BAGOZZI, R. P.; WARSHAW, P. R. User Acceptance of Computer Technology: A Comparison of Two Theoretical Models. Management Science, Catonsville, v. 35, n. 8, p. 982-1003, 1989.

ER, E. et al. Assessing student perceptions and use of instructor versus AI-generated feedback. British Journal of Educational Technology, London, v. 56, n. 3, p. 1074-1091. 2024.

FOWLER, Y. Formative assessment: Using peer and self-assessment in interpreter training. In: WADENSJO, C.; DIMITROVA, B. E.; NILSSON, A. (eds.). The Critical Link 4: Professionalisation of interpreting in the community. Amsterdam: John Benjamins, 2007. p. 253-262.

FREITAG, M. et al. Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation. Transactions of the Association for Computational Linguistics, Cambridge, v. 9, p. 1460-1474, 2021.

GEORGE, D.; MALLERY, P. IBM spss statistics 26 step by step: A simple guide and reference. New York: Routledge, 2019.

GIESHOFF, A. C.; ALBL-MIKASA, M. Interpreting accuracy revisited: a refined approach to interpreting performance analysis. Perspectives, London, v. 32, n. 2, p. 210-228, 2024.

GILE, D. Consecutive vs. Simultaneous: Which is more accurate? Interpretation Studies: The Journal of the Japan Association for Interpretation Studies, Tokyo, v. 1, p. 8-20, 2001.

GISEV, N.; BELL, J. S.; CHEN, T. F. Interrater agreement and interrater reliability: key concepts, approaches, and applications. Research in Social and Administrative Pharmacy, New York, v. 9, n. 3, p. 330-338, 2013.

GLAZER, N. Formative plus Summative Assessment in Large Undergraduate Courses: Why Both? International Journal of Teaching and Learning in Higher Education, Fort Collins, v. 26, n. 2, p. 276-286, 2014.

GUO, K.; WANG, D. To resist it or to embrace it? Examining ChatGPT’s potential to support teacher feedback in EFL writing. Education and Information Technologies, New York, v. 29, n. 7, p. 8435-8463, 2024.

HAN, C. Using rating scales to assess interpretation: Practices, problems and prospects. Interpreting, Amsterdam, v. 20, n. 1, p. 59-95, 2018.

HAN, C. Detecting and measuring rater effects in interpreting assessment: A methodological comparison of classical test theory, generalizability theory, and many-facet rasch measurement. In: CHEN, J.; HAN, C. (eds.). Testing and assessment of interpreting: Recent developments in China. Singapore: Springer, 2021. p. 85-113.

HAN, C. Interpreting Testing and Assessment: A state-Of-The-Art Review. Language Testing, London, v. 39, n. 1, p. 30-55, 2022.

HAN, C.; LU, X. Can automated machine translation evaluation metrics be used to assess students’ interpretation in the language learning classroom? Computer Assisted Language Learning, London, v. 36, n. 5-6, p. 1064-1087, 2021.

HAN, C.; LU, X. Interpreting quality assessment re-imagined: The synergy between human and machine scoring. Interpreting and Society: An Interdisciplinary Journal, Beijing, v. 1, n. 1, p. 70-90, 2021.

HAN, C.; LU, X.; FAN, Q. Taming generative AI for interpreter education: using large language models in classroom-based assessment of English-Chinese consecutive interpreting. The Interpreter and Translator Trainer, London, v. 19, n. 3-4, p. 444-464, 2025.

HATTIE, J.; TIMPERLEY, H. The power of feedback. Review of educational research, Thousand Oaks, v. 77, n. 1, p. 81-112, 2007.

HOLEWIK, K. Peer feedback and reflective practice in public service interpreter training. Theory and Practice of Second Language Acquisition, Katowice, v. 6, n. 2, p. 133-159, 2020.

IMRAN, M.; ALMUSHARRAF., N. Analyzing the role of ChatGPT as a writing assistant at higher education level: a systematic review of the literature. Contemporary Educational Technology, Podgorica, v. 15, n. 4, e464, 2023.

JIA, Y.; ARYADOUST, V. The Utility of Generative Artificial Intelligence in Rating Interpreters’ Accuracy: A Case Study of ChatGPT-4. In: CHAPELLE, C. A.; BECKETT, G. H.; RANALLI, J. (eds.). Exploring artificial intelligence in applied linguistics. Ames: Iowa State University Digital Press, 2024. p. 59-72.

KELLY, D. A Handbook for Translator Trainers. London: Routledge, 2014.

KOCMI, T.; FEDERMANN, C. Large language models are state-of-the-art evaluators of translation quality. arXiv, 2023. arXiv:2302.14520. Disponível em: https://arxiv.org/abs/2302.14520. Acesso em: 17 jan. 2026.

KOO, T. K.; LI, M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of chiropractic medicine, Lombard, v. 15, n. 2, p. 155-163, 2016.

KORZYNSKI, P. et al. Artificial intelligence prompt engineering as a new digital competence: Analysis of generative AI technologies such as ChatGPT. Entrepreneurial Business and Economics Review, Kraków, v. 11, n. 3, p. 25-37, 2023.

LEE, J. Feedback on feedback: Guiding student interpreter performance. Translation & Interpreting: The International Journal of Translation and Interpreting Research, Sydney, v. 10, n. 1, p. 152-170, 2018.

LIU, Y. Exploring a corpus-based approach to assessing interpreting quality. In: CHEN, J.; HAN, C. (eds.). Testing and assessment of interpreting: Recent developments in China. Singapore: Springer, 2021. p. 159-178.

LU, Q. et al. Error analysis prompting enables human-like translation evaluation in large language models: A case study on chatgpt. arXiv, 2023. arXiv:2303.13809. Disponível em: https://arxiv.org/abs/2303.13809. Acesso em: 17 jan. 2026.

MESSICK, S. Validity of psychological assessment: Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American psychologist, Washington, DC, v. 50, n. 9, p. 741-749, 1995.

NAZARETSKY, T. et al. AI or human? Evaluating student feedback perceptions in higher education. In: FERREIRA MELLO, R.; RUMMEL, N.; JIVET, I.; PISHTARI, G.; RUIPÉREZ VALIENTE, J. A. (eds.). Technology Enhanced Learning for Inclusive and Equitable Quality Education: EC-TEL 2024. Cham: Springer, 2024. p. 284-298.

OUYANG, L. et al. Coh-Metrix model-based automatic assessment of interpreting quality. In: CHEN, J.; HAN, C. (eds.). Testing and assessment of interpreting: Recent developments in China. Singapore: Springer, 2021. p. 179-200.

OZILI, P. K. The acceptable R-square in empirical modelling for social science research. In: SALIYA, C. A. (ed.). Social research methodology and publishing results: A guide to non-native English speakers. Hershey: IGI global, 2023. p. 134-143.

SAWYER, D. B. Fundamental aspects of interpreter education: curriculum and assessment. Amsterdam: John Benjamins Publishing, 2004.

SCARAMUCCI, M. V. R. O professor avaliador: sobre a importância da avaliação na formação do professor de língua estrangeira. In: ROTTAVA, L.; SANTOS, S. (orgs.). Ensino-aprendizagem de línguas: língua estrangeira. Ijuí: Editora da UNIJUI, 2006. p. 49-64.

SUN, L. Transforming business interpretation education with AI: Perspectives from instructors and learners. Education and Information Technologies, New York, n. 30, p. 1-35, 2025.

TAHRAOUI, A. Teaching sight and bilateral interpreting online: students’ perceptions of teacher feedback. Texto Livre, Belo Horizonte, v. 15, e39545, 2022.

TENG, M. F. “ChatGPT is the companion, not enemies”: EFL learners’ perceptions and experiences in using ChatGPT for feedback in writing. Computers and Education: Artificial Intelligence, Oxford, v. 7, e100270, 2024.

ÜNLÜ, C. Interpretutor: Using large language models for interpreter assessment. In: INTERNATIONAL CONFERENCE HUMAN-INFORMED TRANSLATION AND INTERPRETING TECHNOLOGY (HiT-IT), 2023, Naples. Proceedings of the International Conference HiT-IT 2023. Shoumen: INCOMA Ltd., 2023. p. 78-96.

WANG, X.; FANTINUOLI, C. Exploring the correlation between human and machine evaluation of simultaneous speech translation. arXiv, 2024. arXiv:2406.10091. Disponível em: https://arxiv.org/abs/2406.10091. Acesso em: 17 jan. 2026.

WANG, X.; WANG, B. Identifying fluency parameters for a machine-learning-based automated interpreting assessment system. Perspectives, London, v. 32, n. 2, p. 278-294, 2024.

WANG, X.; WANG, B. Advancing automatic assessment of target-language quality in interpreter training with large language models: insights from explainable AI. The Interpreter and Translator Trainer, London, v. 19, n. 3-4, p. 465-485, 2025.

WILIAM, D.; THOMPSON, M. Integrating assessment with learning: What will it take to make it work? In: DWYER C. (ed.). The future of assessment: shaping teaching and learning. New York: Routledge, 2017. p.53-82.

WU, Z. The interrelationship among in-class peer-assessment, interpreting anxiety and interpreting performance. Language Education, Abingdon, v. 5, n. 4, p. 33-37, 2017.

WU, Z. Chasing the unicorn? The feasibility of automatic assessment of interpreting fluency. In: CHEN, J.; HAN, C. (eds.). Testing and assessment of interpreting: Recent developments in China. Singapore: Springer, 2021. p. 143-158.

XU, S.; SU, Y.; LIU, K. Investigating student engagement with AI-driven feedback in translation revision: A mixed-methods study. Education and Information Technologies, New York, v. 30, p. 16969-16995, 2025.

XUE, R.; LIU, Q. Exploring student interpreters’ engagement with different sources of feedback on note-taking. Innovations in Education and Teaching International, Abingdon, v. 62, n. 4, p. 1135-1148, 2024.

YAN, D.; AMINI, M.; KASUMA, S. A. A. Status quo of the formative assessment enactments in spoken language interpreter training: a scoping review of research and practice. International Journal of Academic Research in Progressive Education and Development, Bahawalpur, v. 12, n. 4, p. 652-673, 2023.

YU, W.; VAN HEUVEN, V. J. Quantitative correlates as predictors of judged fluency in consecutive interpreting: Implications for automatic assessment and pedagogy. In: CHEN, J.; HAN, C. (eds.), Testing and assessment of interpreting: Recent developments in China. Singapore: Springer, 2021. p. 117-142.

YU, Y., WEI, W., CHEN, Z. Comparing learners’ engagement strategies with feedback from a Generative AI chatbot and peers in an interpreter training programme: a quasi-experimental study. The Interpreter and Translator Trainer, London, v. 19, n. 3-4, p. 338-356, 2025.

ZHENG, C. et al. Progressive-Hint Prompting improves reasoning in large language models. arXiv, 2023. arXiv:2304.09797. Disponível em: https://arxiv.org/abs/2304.09797. Acesso em: 17 jan. 2026.

ZHOU, J.; DONG, Y. Effects of note-taking on the accuracy and fluency of consecutive interpreters' immediate free recall of source texts: A three-stage developmental study. Acta Psychologica, Amesterdam, v. 248, e104359, 2024.

WANG, X.; WANG, B. Advancing automatic assessment of target-language quality in interpreter training with large language models: insights from explainable AI. The Interpreter and Translator Trainer, p. 1-21, 2025.

WILIAM, D.; THOMPSON, M. Integrating assessment with learning: What will it take to make it work? In DWYER C. (Ed.), The future of assessment: shaping teaching and learning. New York: Routledge, 2017, p.53-82.

WU, J.; LIU, M.; LIAO, C. Analytic scoring in interpretation test: Construct validity and the halo effect. 2013. In LIAO, H-H.; KAO T-E.; LIN, Y. (Eds.), The making of a translator: Multiple perspectives. Taipei: Bookman, 2013, 277-292.

WU, Z. The interrelationship among in-class peer-assessment, interpreting anxiety and interpreting performance. Language Education, v. 5, n. 4, p. 33-37, 2017.

WU, Z. Chasing the unicorn? The feasibility of automatic assessment of interpreting fluency. In CHEN J.; HAN, C. (Eds.), Testing and assessment of interpreting: Recent developments in China. Singapore: Springer, 2021, p. 143-158.

XU, S.; SU, Y.; LIU, K. Investigating student engagement with AI-driven feedback in translation revision: A mixed-methods study. Education and Information Technologies, p. 1-27, 2025.

XUE, R.; LIU, Q. Exploring student interpreters’ engagement with different sources of feedback on note-taking. Innovations in Education and Teaching International, p. 1-14, 2024.

YAN, D.; AMINI, M.; KASUMA, S. A. A. Status quo of the formative assessment enactments in spoken language interpreter training: a scoping review of research and practice. International Journal of Academic Research in Progressive Education and Development, v. 12, n. 4, p. 652-673, 2023.

YU, W.; VAN HEUVEN, V. J. Quantitative correlates as predictors of judged fluency in consecutive interpreting: Implications for automatic assessment and pedagogy. In CHEN J.; HAN, C. (Eds.), Testing and assessment of interpreting: Recent developments in China. Singapore: Springer, 2021, p. 117-142.

YU, Y., WEI, W., CHEN, Z. Comparing learners’ engagement strategies with feedback from a Generative AI chatbot and peers in an interpreter training programme: a quasi-experimental study. The Interpreter and Translator Trainer, p. 1-19, 2025.

ZHENG, C., et al. Progressive-Hint Prompting improves reasoning in large language models. arXiv preprint arXiv:2304.09797, 2023.

ZHOU, J.; DONG, Y. Effects of note-taking on the accuracy and fluency of consecutive interpreters' immediate free recall of source texts: A three-stage developmental study. Acta Psychologica, 248, 2024.

ChatGPT-supported formative assessment in interpreter education

validity of automated ratings and perceived quality of diagnostic feedback

Autores

DOI:

Palavras-chave:

Resumo

Downloads

Biografia do Autor

Wenjing Liu, Universidade Politécnica de Macau

Adriana Silvina Pagano, Universidade Federal de Minas Gerais, Belo Horizonte, Minas Gerais, Brasil

Referências

Downloads

Publicado

Como Citar

Edição

Seção

Licença

Enviar Submissão

Idioma

Informações