生成式 AI 在華語文分級測驗試題編寫之應用研究—與 TOCFL 試題之比較分析

No Thumbnail Available

Date

2025

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

本研究旨在探討生成式人工智能於華語文分級測驗試題編寫上的應用可行性,並以TOCFL模擬題本為參照,分析AI所生成之聽力與閱讀題本在語言難度、認知層次、試題結構及作答表現上的表現與差異。本研究所使用的生成工具為GPT-4,以TOCFL入門基礎級之等級設計試題,並與人工編寫之 TOCFL題本進行比較分析。本研究共邀請40位華語文程度達A1以上的越南籍華語學習者參與施測,藉由分析受試者的答題表現進行試題的區辨度與命題品質評估,同時輔以PIRLS四層次提問架構分析試題提問難度與類型,再透過Bloom 認知層次理論分析AI的表現來看AI在華語文試題編寫上所展現的認知能力與限制。研究結果顯示,AI 所生成的試題在語言形式與題型模仿方面表現穩定,尤其於聽力題型中展現出與人工題本相近的理解難度與得分分布,可有效檢驗出學習者的學習難點,快速生成有效且多樣的試題,展現其具備一定的命題潛力。然而,AI閱讀題本整體難度偏低,受試者分數分布集中,導致區辨度不足。此外,AI難以穩定掌握語言等級與提問層次,常出現超綱詞彙與語境提示不足等問題。認知層次分析亦發現,AI多生成低層次提問(如直接提取、直接推論),在整合、評鑑等高層次命題能力上仍顯不足。綜合而言,生成式AI可作為華語命題的輔助工具,具備提高出題效率與生成初稿的實用價值,但在高層次的語境邏輯以及誘答選項設計上,仍需仰賴人工審查與語言判斷,以確保試題品質與測驗效度。
This study investigated the applicability of generative artificial intelligence(AI) in developing test items for Chinese language proficiency tests. Using the Test of Chinese as a Foreign Language ,(TOCFL)mock tests as a reference,this study analyzed the performance and differences of AI-generated listening and reading test booklets in terms of language difficulty, cognitive levels, test item structure, and test-taking performance. The generative tool employed in this study was GPT-4, which was used to design test items at the TOCFL Band A(Beginner) level. These AI-generated items were then compared and analyzed against manually compiled TOCFL test booklets. A total of 40 Vietnamese Chinese language learners with proficiency levels above A1 participated in the assessment. By analyzing the test takers' performance, the discriminability and quality of the test items were evaluated. Concurrently, the PIRLS four-level questioning framework was utilized to analyze the difficulty and types of questions. Furthermore, Bloom's Taxonomy of Cognitive Levels was applied to analyze the AI's performance, examining the cognitive abilities and limitations demonstrated by AI in Chinese language test item development. The study results indicated that AI-generated test items exhibited stable performance in terms of linguistic forms and item type imitation, particularly in listening comprehension, where they showed similar comprehension difficulty and score distribution to human-compiled test booklets. These items effectively identified learners' learning difficulties and rapidly generated valid and diverse test items, demonstrating AI's potential in test item development. However, the overall difficulty of the AI-generated reading test booklets was low, leading to a concentrated distribution of test taker scores and insufficient discriminability. Additionally, AI struggled to consistently control language levels and questioning tiers, frequently producing out-of-syllabus vocabulary and insufficient contextual cues. Cognitive level analysis also revealed that AI predominantly generated low-level questions (e.g., direct retrieval, direct inference)and remained inadequate in high-level item development capabilities such as integration and evaluation.In conclusion, generative AI can serve as an auxiliary tool for Chinese language test item development, offering practical value in enhancing item generation efficiency and producing preliminary drafts. Nevertheless, for highlevel contextual logic and the design of distractor options, human review and linguistic judgment remain essential to ensure test item quality and test validity.

Description

Keywords

人工智能, 華語文能力測驗, 華語文測驗編寫, Artificial intelligence, Test of Chinese as a Foreign Language (TOCFL), Chinese language test item development

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By