A comparative study of reading sections of high-stakes teacher-made tests in Portugal

Autores

DOI:

https://doi.org/10.26512/les.v26i2.59640

Palavras-chave:

high-stakes teacher-made tests, validity, reliability, EFL, secondary-level education, teachers’ assessment literacy

Resumo

This study investigates the validity of high-stakes, teacher-made tests as parallel measures of achievement of identical curricular goals. Focusing on three reading sections from tests designed in different schools across Portugal, the research assessed a sample of 75 students representative of the intended population. Two complementary approaches were adopted: expert evaluations and psychometric analyses. Findings revealed discrepancies among experts regarding construct coverage and highlighted variations in item scope, format, subcomponents assessed, and comprehension demands across the sections. Technical shortcomings in instructions, item design, and scoring criteria were also identified. Results from a one-way repeated measures ANOVA demonstrated significant differences in mean performance across the three sections, with section two producing notably lower scores compared to sections one and three. These results suggest that the sections are not equivalent in construct or difficulty. The study reinforces concerns about teachers’ assessment literacy, test quality, and fairness in assessment.

Downloads

Não há dados estatísticos.

Referências

ALDERSON, J. (2005). Diagnosing foreign language proficiency. The interface Between learning and assessment. London, UK: Continuum.

ALDERSON, J. (2001) The shape of things to come: will it be the normal distribution? In MILANOVIC, M., WEIR, C. & ASSOCIATION OF LANGUAGE TESTERS IN EUROPE (2004). European language testing in a global context: Proceedings of the ALTE Barcelona conference, July 2001 (Studies in language testing ; 18). Cambridge: Cambridge University Press, pp. 1-26.

ALKHARUSI, H., KAZEM, A. & AL-MUSAWAI, A. (2011). Knowledge, skills, and attitudes of preservice and inservice teachers in educational measurement. Asia-Pacific Journal of Teacher Education, 39(2), pp. 113-123.

ALLEN, M., & YEN, W. (1979). Introduction to measurement theory. Monterey, Calif: Brooks/Cole Pub. Co.

ANASTASI, A. (1976). Psychological testing (4th Ed.). New York: London: Macmillan; Collier Macmillan.

ASCHBACHER, P. (1999). Developing Indicators of Classroom Practice to Monitor and Support School Reform. CSE Technical report 513. National Center for Research on Evaluation. University of California, Los Angeles.

BACHMAN, L. (2005). Building and supporting a case for test use. Language Assessment Quarterly: An International Journal, 2(1), pp. 1-34, DOI:10.1207/s15434311laq0201_1

BACHMAN, L. (1991) What does language testing have to offer? TESOL Quarterly, 25(4), pp. 671-704.

BACHMAN, L. (2007). What is the construct? The dialectic of abilities and contexts in defining constructs in language assessment. In FOX, J. et al. (Eds). Language Testing Reconsidered, 3 (pp. 41-71). Ottawa: University of Ottawa Press.

BAKER, E. (2013). The chimera of validity. Teachers College Record, 115. Columbia University.

BECKER, A. & NEKRASOVA-BEKER, T. (2018) Investigating the Effect of Different Selected-response Item Formats for Reading Comprehension, Educational Assessment, 23:4, 296-317. DOI: 10.1080/10627197.2018.1517023

BLACK, P. (2010). Validity in teachers’ summative assessments, Assessment in Education: Principles, Policy & Practice, 17(2), pp. 215-232. DOI: 10.1080/09695941003696016.

BROEKKAMP, H., HOUT-WOLTERS, B., VAN DEN BERGH, H. & RIJLAARSDAM, G. (2004). Students' expectations about the processing demands of teacher-made tests. Studies in Educational Evaluation, 30, pp. 281-304.

BROOKHART, S. (2001). The “standards” and classroom assessment research. Paper presented at the annual meeting of the American Association of Colleges for Teacher Education, Dallas, TX. https://archive.org/details/ERIC_ED451189.

CAMPBELL, C. (2013). Research on teacher competency in classroom assessment. In MCMILLAN, J. (Ed.). Sage Handbook of Research on Classroom Assessment. Thousand Oakes, CA: SAGE Publications, pp. 71-84.

CARTER, K. (1984). Do teachers understand principles for writing tests? Journal of Teacher Education, 35, pp. 57–60. Doi:10.1177/002248718403500613.

CHAPELLE, C. (1999). Validity in language assessment. Annual Review of Applied Linguistics 19, pp. 254-272.

CHAPELLE, C., ENRIGHT, M. & JAMIESON, J. (2011). Building a validity argument for the test of English as a foreign language (ESL & applied linguistics professional series). New York; London: Routledge. Taylor & Francis e-Library.

COHEN, L., MANION, L. & MORRISON, K. (2007). Research Methods in Education (6th Ed.), Taylor & Francis Group. London: Routledge.

DAVIES, A. (2007). Assessing academic English language proficiency: 40+ years of U.K. language tests. In FOX, J. et al. (Eds). Language Testing Reconsidered, 4 (pp.73-86). Ottawa: University of Ottawa Press.

DOUGLAS, D. (2001). Performance consistency in second language acquisition and language testing research: a conceptual gap. Second Language Research, 17(4), pp. 442-456.

DOWNING, S. (2006). Twelve steps for effective test development. In DOWNING, S., & HALADYNA, T. (2006). Handbook of Test Development, 1 (pp. 3-25). Mahwah, N.J.: Lawrence Erlbaum Associates.

FARLEY-RIPPLE, E., MAY, H., KARPYN, A., TILLEY, K. & MCDONOUGH, K. (2018). Rethinking connections between research and practice in education: A conceptual framework. Educational Researcher, 47(4), pp. 235-245.

FERRARA, S. & WAY, D. (2016) Design and development of end-of-course tests for student assessment and teacher evaluation. In NCME Applications of Educational Measurement and Assessment: Meeting the Challenges to Measurement in an Era of Accountability, 5. NCME Book Series. New York: Routledge. Retrieved from https://www.book2look.com/embed/9781135040154

GREEN, A. & WEIR, C. (2004). Can Placement Tests Inform Instructional Decisions? Language Testing, 21(4), pp. 467-494.

GREEN, R. (2013). Statistical Analyses for Language Testers. New York: Palgrave Macmillan.

GULLICKSON, A. (1993). Matching measurement instruction to classroom-based evaluation: perceived discrepancies, needs, and challenges. In WISE, S. (Ed.), Teacher Training in Measurement and Assessment Skills, 3. Lincoln, NE: Buros Institute of Mental Measurements, University of Nebraska–Lincoln. http://digitalcommons.unl.edu/burosteachertraining/3

HERMAN, J. & DORR-BREMME, D. (1983). In HATHAWAY, W., Testing in the schools (New directions for testing and measurement; 19). San Francisco: Jossey-Bass.

IMPARA, J., DIVINE, K., BRUCE, F., LIVERMAN, M. & GAY, A. (1991) Does Interpretive Test Score Information Help Teachers? Educational Measurement: Issues and Practices, 10(4), pp. 16-18.

KANE, M. (2006) Content-related validity evidence in test development. In DOWNING, S., & HALADYNA, T. (2006). Handbook of Test Development, 7 (pp.131-153). Mahwah, N.J.: Lawrence Erlbaum Associates.

KANE, M. (2011). Validating score interpretations and uses: Messick Lecture, Language Testing Research Colloquium, Cambridge, April 2010. Language Testing, 29(1), pp. 3-17.

KANE, M. (2013). Validating the interpretations and uses of test scores. Journal of Educational Measurement, 50(1), 1-73.

KHALIFA, H., & WEIR, C. (2009). Examining Reading: Research and practice in assessing second language reading (Studies in language Testing, 29). Cambridge: Cambridge University Press.

KINTSCH, W. (1998). Comprehension: A paradigm for cognition. New York, NY, US: Cambridge University Press.

KISELNIKOV, A., VAKHITOVA, D., KAZYMOV, T. (2020). Coh-metrix readability formulas for an academic text analysis, IOP Conference Series: Materials Science and Engineering, Volume 890, International Scientific Conference on Socio-Technical Construction and Civil Engineering (STCCE - 2020) 29 April - 15 May 2020, Kazan, Russian Federation. DOI 10.1088/1757-899X/890/1/012207

KOLEN, M. & BRENNAN, R. (2014). Test Equating, Scaling, and Linking: Methods and practices (3rd ed.). Statistics for social and public policy. New York: Springer.

KUNNAN, A. & CARR, N. (2017). A comparability study between the general English proficiency test-advanced and the Internet-based test of English as a foreign language. Language Testing in Asia, 7(17), pp. 1-16.

KUNNAN, A. (2018). Evaluating Language Assessments (New perspectives on Language Assessment Series). New York: Routledge, https://doi.org/10.4324/9780203803554

LAFLAIR, G., ISBELL, D., MAY, L., ARVIZU, M. & JAMIESON, J. (2017). Equating in small-scale language testing programs. Language Testing, 34(1), pp. 127-144.

LEE, V. & KLEIN, S. (2002). Technical criteria for evaluating tests. In Hamilton, L. et al. (Eds). Making Sense of Test-Based Accountability in Education. Santa Monica, CA: RAND Corporation. https://www.rand.org/pubs/monograph_reports/MR1554.html.

LUOMA, S. (2001). What does your test measure? Construct definition in language test development and validation. Retrieved from http://www.solki.jyu.fi/vanhat/Luoma_Sari_2001_PhD_manuscript1.pdf

MALONE, M. (2013). The essentials of assessment literacy: Contrasts between testers and users. Language Testing, 30(3), pp. 329-344.

MARSO, R. & PIGGE, F. (1988). An analysis of teacher-Made tests: testing practices, cognitive demands, and item construction errors. Paper presented at the annual meeting of the National Council on Measurement in Education. New Orleans, Louisiana.

MARSO, R. & PIGGE, F. (1991) An analysis of teacher-made tests: item types, cognitive demands, and item construction errors. Contemporary educational psychology, 16, pp. 279-286.

MCNAMARA, T. (1996). Measuring second language performance. London/NY: Longman.

MCNAMARA, T. (2007). Assessment in foreign language education: The struggles over constructs. The Modern Journal, 91(2), pp. 280-282.

MCNAMARA, T. (2011). Applied Linguistics and Measurement: A Dialogue. Language Testing, 28(4), pp.435-440.

MERTLER, C. & CAMPBELL, C. (2005). Measuring teachers’ knowledge & application of classroom assessment concepts: development of the Assessment Literacy Inventory. Paper presented at the annual meeting of the American Educational Research Association,Montréal, Quebec, Canada.

MESSICK, S. (1989). Meaning and values in test validation: The science and ethics of assessment. Educational Researcher, 18(2), pp. 5-11.

MESSICK, S. (1994). The Interplay of Evidence and Consequences in the Validation of Performance Assessments. Educational Researcher, 23(2), 13-23. Retrieved from http://www.jstor.org.ezproxy.lancs.ac.uk/stable/1176219.

MESSICK, S. (1995). Validity of psychological assessment. Validation from inferences from persons’ responses and performances as scientific inquiry into score meaning. American psychologist, 50( 9), pp. 741-749.

MILLER, M., LINN, R. & GRONLUND, N. (2009). Measurement and assessment in teaching, 10th Ed. Merril/Pearson. New Jersey.

OESCHER, J. & KIRBY, P. (1990). Assessing Teacher-made tests in secondary Math and Science classrooms. Paper presented at the annual meeting of the National Council on Measurement in Education, Boston, MA. ERIC Document Reproduction Service No. 322 169.

ORT, V. (1967). Teacher-made tests. The Clearing House, 41(7), pp. 396-399.

PALLANT, J. (2016) SPSS Survival Manual (6th ed.). NY: Open University Press/McGraw-Hill Education.

PATO, M. (2019). A comparative study of the reading section of high-stakes teacher-made tests in Portugal [Unpublished Master’s thesis]. Lancaster University.

POPHAM, W. (2008). Classroom assessment: what teachers need to know, 5th ed. Boston: Pearson/Allyn &Bacon.

SANDERS, J. & VOGEL, S., "3. The Development of Standards for Teacher Competence in Educational Assessment of Students" (1993). Teacher Training in Measurement and Assessment Skills. 5.http://digitalcommons.unl.edu/burosteachertraining/5.

SCHNEIDER, M., EGAN, K. & JULIAN, M. (2013). Classroom Assessment in the Context of High-Stakes Testing, In MCMILLAN, S. (Eds), SAGE Handbook of Research on Classroom Assessment, 4, pp. 55-70.

SIMSEK, A. (2016). A comparative analysis of common mistakes in achievement tests prepared by school teachers and corporate trainers. European Journal of Science and Mathematics Education, 4(4), pp. 477‐489.

SHOHAMY, E. (1982). Affective considerations in language testing. The Modern Language Journal, 66(1), pp. 13-17.

SOLÓRZANO, R. (2008). High stakes testing: issues, implications, and remedies for English language learners. Review of Educational Research, 78(2), pp. 260-329.

STIGGINS, R. & CONKLIN, N. (1992). In teachers’ hands: investigating the practices of classroom assessment. Albany, NY: State University of New York Press.

TIGHE, J., MCMANUS, I., DEWHURST, N., CHIS, L. & MUCKLOW, J. (2010). The standard error of measurement is a more appropriate measure of quality for postgraduate medical assessments than is reliability: an analysis of MRCP (UK) examinations. BMC medical education, 10(40). Doi:10.1186/1472-6920-10-40.

TOULMIN, S. (2003). The uses of argument (Updated ed.). Cambridge, England: Cambridge University Press.

TROIKE, R. (1983). Can language be tested? The Journal of Education, 165(2), pp. 209-216.

U.S. CONGRESS (1992). Testing in American Schools: Asking the Right Questions, OTA-SET-519. Congress of the U.S., Washington, DC. Office of Technology Assessment.

VOGT, K. & TSAGARI, D. (2014). Assessment Literacy of Foreign Language Teachers: Findings of a European Study. Language Assessment Quarterly 11(4), pp. 374-402. DOI: 10.1080/15434303.2014.960046.

WEIR, C. (2005). Language testing and validation [electronic resource]: An evidence-based approach (Research and practice in applied linguistics). Basingstoke: Palgrave Macmillan.

WEIR, C., HAWKEY, R., GREEN, A. & DEVI S. (2009). The cognitive processes underlying the academic reading construct as measured by IELTS in Research Reports, 9, pp. 157-189, British Council/IDP Australia.

WISE, S., LUKIN, L. & ROOS, L. (1991). Teacher beliefs about training and measurement. Journal of Teacher Education, 42(1), pp.37-42.

Downloads

Publicado

2026-03-29

Como Citar

Pato, M. M. (2026). A comparative study of reading sections of high-stakes teacher-made tests in Portugal. Cadernos De Linguagem E Sociedade, 26(2), 291–313. https://doi.org/10.26512/les.v26i2.59640