문항반응이론을 적용한 단위검사 구성 검사점수의 신뢰도 추정 방법 비교 :IRT Approaches to Estimating the Reliability of Testlet Composed Test Scores

김나나

추천

검색

자료유형: 학위논문

저자정보: 김나나 (연세대학교, 연세대학교 일반대학원)

지도교수: 이규민

발행연도: 2016

저작권: 연세대학교 논문은 저작권에 의해 보호받습니다.

이용수6

이 논문의 연구 히스토리 (2)

2017

Bifactor 다차원 문항반응이론을 적용한 단위검사 구성 검사점수의 신뢰도 추정 방법 -단위검사 효과 크기와 자료의 불균형 수준을 중심으로-

김나나 , 이규민 , 강상진 교육평가연구 2017.03 학술저널

2016

문항반응이론을 적용한 단위검사 구성 검사점수의 신뢰도 추정 방법 비교

김나나 교육학과 2016.01 학위논문

이 논문의 후속연구가 궁금하신가요?
연관 학술논문 또는 학술발표를 통해 보다 발전된 연구결과를 확인하실 수 있습니다.
이 논문의 연구 히스토리 확인하기

초록· 키워드

오류제보하기

교육학에서 대부분의 표준화 검사들은 하나 이상의 문항이 동일한 자료를 공유하는 형태의 단위검사로 구성된다. 하지만 이러한 검사의 문항들은 상호 의존성을 지니기 때문에 자료의 구조를 고려하지 않은 채 신뢰도를 추정할 경우 과대 또는 과소 추정된 결과를 산출할 위험이 있고, 잘못된 신뢰도의 사용은 검사점수 해석과 사용의 오류로 이어질 수 있다. 이에 본 연구에서는 문항반응이론의 접근을 중심으로 단위검사 구성 검사점수의 신뢰도를 추정하는 방법에 대하여 탐구하였다.
본 연구에서는 문항반응이론 모형을 적용하여 단위검사 구성 검사점수의 신뢰도를 추정할 때, 단위검사의 측정 구조에 접근하는 방법에 따라 어떠한 차이가 있는지, 그리고 그 차이가 단위검사의 특성 변수에 따라 어떻게 변화하는지 밝히고자 하였다. 또한 문항반응이론과 일반화가능도이론을 적용하여 추정한 신뢰도를 비교·분석함으로써 두 측정 이론의 신뢰도 간 차이에 대해 고찰하고자 하였다. 이를 위해 구체적으로 설정한 연구 문제는 다음과 같다.
첫째, 단위검사에 대해 서로 다른 가정을 갖는 세 가지 문항반응이론 모형(일차원 이분문항반응이론 모형, 일차원 다분문항반응이론 모형, bifactor 다차원 문항반응이론 모형)을 적용하였을 때, 추정된 신뢰도는 단위검사의 두 가지 특성(단위검사 효과의 크기, 단위검사의 불균형 수준)에 따라 어떠한 차이를 보이는가? 둘째, 단위검사에 대해 서로 다른 가정을 갖는 세 가지 문항반응이론 모형을 적용하였을 때, 추정된 신뢰도는 단위검사의 두 가지 특성에 따라 각각 어느 정도의 오차를 보이는가? 셋째, 단위검사에 대해 서로 다른 가정을 갖는 세 가지 문항반응이론 모형을 적용하였을 때, 추정된 신뢰도는 일반화가능도이론을 적용하여 추정한 신뢰도를 기준으로 어떠한 차이를 보이는가?
이상의 연구 문제에 응답하기 위하여 bifactor 모형을 통해 모의 자료를 연구 조건별로 반복 생성하였고, 이분문항반응이론 모형(2모수 로지스틱 모형), 다분문항반응이론 모형(등급반응모형), bifactor 모형을 적용하여 신뢰도를 추정한 뒤, 그 값을 단위검사의 특성에 따라 비교하였다. 또한 준거 신뢰도(true reliability)를 기준으로 산출한 모형별 신뢰도 추정치의 평균 절대 오차(mean absolute error)를 단위검사의 특성에 따라 비교하였다. 마지막으로 세 모형을 적용하여 추정한 신뢰도를 단위검사의 특성에 따라 산출한 일반화가능도계수와 비교·분석하였다.
본 연구의 결과를 요약하면 다음과 같다. 첫째, 신뢰도 추정치가 이분문항반응이론 모형, bifactor 모형, 다분문항반응이론 모형의 순으로 크게 나타났고 이분문항반응이론 모형과 나머지 두 모형의 신뢰도 추정치 간 차이는 단위검사 효과의 크기가 커짐에 따라 커졌다. 단위검사의 불균형 수준은 신뢰도 추정치에 영향을 미치지 않았다. 둘째, 신뢰도 추정치의 오차는 bifactor 모형, 다분문항반응이론 모형, 이분문항반응이론 모형 순으로 작게 나타났다. 그중 이분문항반응이론 모형의 오차는 단위검사 효과의 크기가 커질수록 커졌고 단위검사의 불균형 수준이 심해질수록 조금씩 커졌다. 마지막으로 세 모형의 신뢰도 추정치와 일반화가능도이론의 신뢰도를 비교한 결과, 다분문항반응이론 모형이 일반화가능도계수와 가장 근접한 값을 보였고 bifactor 모형이 일반화가능도계수보다 약 0.01정도 더 큰 값을 보였다.
이를 바탕으로 다음과 같은 결론을 도출하였다. 첫째, 이분문항반응이론 모형과 다분문항반응이론 모형이 단위검사 구성 검사점수의 신뢰도를 과대·과소 추정하고, bifactor 모형이 가장 정확하게 추정한다. 하지만 다분문항반응이론 모형을 사용하여도, 신뢰도 과소 추정의 정도가 작기 때문에 단위검사 구성 검사점수의 신뢰도를 비교적 정확하게 추정할 수 있다. 둘째, 단위검사의 효과가 클수록 이분문항반응이론 모형을 적용하여 검사점수의 신뢰도를 추정하는 것에 유의해야 한다. 한편 단위검사의 불균형 수준은 신뢰도 추정 방법을 결정할 때 단위검사 효과의 크기만큼 중요하게 고려하지 않아도 된다. 셋째, 문항반응이론이 일반화가능도이론보다 높은 신뢰도 추정치를 산출하는 것으로 나타났으며, 이는 두 측정 이론이 사용하는 정보의 양과 동형검사에 대한 가정이 서로 다르기 때문이라고 볼 수 있다.

Most standardized tests in education or psychology are composed of item bundles called testlets. A testlet is often used not only because it is useful when measuring problem-solving or integrated skills but also time and cost efficient in test construction. This study aimed to investigate the item response theory (IRT) approaches to estimating the reliability of testlet composed test scores by applying three IRT models: the two parameter logistic (2PL) model, the graded response model (GRM), and the bifactor model. As previous studies have found that the testlet effect size and the degree of imbalance in testlet lengths may influence the estimation of reliability, their effects on the reliability estimates and corresponding errors derived from each of the three IRT models were also examined. Moreover, the reliabilities estimated using the three IRT models were compared with the ones estimated via the generalizability theory approach.
Using simulated data, the reliability estimates and corresponding mean absolute errors (MAEs) were obtained from the three IRT models. Then respective estimates and errors from each IRT model were compared along the conditions of five different testlet effect sizes and three varying degrees of imbalance in testlet lengths. Furthermore, the reliabilities estimated using the three IRT models and the estimated generalizability coefficients were compared in each condition.
The results of the study were as follows. Comparing the reliability estimates and corresponding MAEs derived from the three IRT models, the 2PL model overestimated while the GRM underestimated the reliability of testlet composed test scores. However, the magnitudes of underestimation in the GRM were very small; it produced almost the same MAEs as the bifactor model. Hence, though the bifactor model appeared to be the most appropriate IRT model to be applied when estimating the reliability of testlet composed test scores, the GRM can also be used as a quite appropriate model.
Regarding the effects of the testlet effect size and the degree of imbalance in testlet lengths on estimating the reliability of testlet composed test scores, the results showed that as the testlet effect increased, the MAEs of the reliability estimates from the GRM and the bifactor model did not change while the ones from the 2PL model increased. This could be interpreted as the increase of dependencies between items causing a severe violation of the local independence assumption in IRT, and it consequently increasing the magnitude of overestimation in the 2PL model. Therefore, researchers should take heed of using the 2PL model for estimating the reliability of testlet composed test scores, especially when the testlet effects are large. The degree of imbalance in testlet lengths also influenced the MAEs of reliability estimates derived from the 2PL model, but the effects were relatively small. This implied that the degree of imbalance in testlet lengths is not as important as the testlet effect size when estimating the reliability of testlet composed test scores using IRT models.
Lastly, comparison of the reliabilities estimated from the IRT models to the generalizability coefficients revealed that the GRM produced the closest estimates to the generalizability coefficients while the estimates from the bifactor model were about 0.001 higher than the generalizability coefficients. This indicated that the IRT approach produces slightly higher reliability estimates compared to the generalizability approach, presumably due to the differences between the two theories such as their different parallelism assumptions on a test.

#신뢰도 #단위검사 #문항반응이론 #bifactor 모형 #이분문항반응이론 모형 #다분문항반응이론 모형 #일반화가능도이론 / reliability #testlet #item response theory #bifactor model #two parameter logistic model #graded response model #generalizability coefficient

최근 본 자료

전체보기

구분	그룹	데이터 항목
AI 학습용 데이터	원문	원문 PDF 파일
AI 학습용 데이터	원문 + 메타 (기본/상세)	원문 PDF 파일 및 서지정보 CSV
대량 구매용 데이터	B2B 구독 방식	특정 자료 한정으로 원문 접근 권한 부여
대량 구매용 데이터	URL 전달 방식	바로 PDF 뷰어를 열람할 수 있는 URL 제공

구분	그룹	데이터 항목
AI 학습용 데이터	기본 메타	발행기관명, 간행물명, 권호명, 권(vol), 호(issue), 통권, 발행연도, 발행월, 논문명, 저자명, 시작페이지, 종료페이지, 전체페이지, 상세페이지URL
상세 메타 데이터	발행기관 메타	발행기관 이명, 영문명, 창립연도, 홈페이지URL, 발행기관 소개
	간행물 메타	부제목, 간행물 유형, ISSN, ISBN, 최초발행연도, 폐간연도, 간행빈도, 발행주기, 등재사항, 이용수, 피인용수, 권호수, 논문수, 표지이미지
	논문 메타	작성 언어, 부제목, 대등제목, 목차, 키워드, 초록, 이미지, 참고문헌, 이용수, 피인용수, 논문활용도, DBpia통합주제분류, KDC분류, DDC분류, 한국연구재단분류, UCI, DOI
	저자 메타	소속기관, 소속부서, 직급, 연구분야, 연구키워드, 이용수, 피인용수, 저자 논문활용도

구분	그룹	데이터 항목
※ 결합형/맞춤형 메타 데이터는 신청 내용에 따라 다양하게 제공 가능
이용순위 정보	주제분야별 많이 이용된 논문	“인문학”에서 많이 이용된 논문 TOP100
	이용기관별 많이 이용된 논문	“중고등학교”에서 많이 이용된 논문 TOP100
	세부기관별 많이 이용된 논문	“서울대학교”에서 많이 이용된 논문 TOP100
	키워드별 많이 이용된 논문	“Chat GPT”에서 많이 이용된 논문 TOP100
키워드 정보	많이 이용된 키워드	특정기간/분야/저널 내 많이 이용된 키워드
	많이 발행된 키워드	특정기간/분야/저널 내 많이 발행된 키워드
	많이 검색된 키워드	특정기간/분야/저널 내 많이 검색된 키워드
	연구 트렌드 키워드	특정 키워드 연관 연구동향 분석 데이터 키워드

논문 기본 정보

이 논문의 연구 히스토리 (2)

초록· 키워드

목차

최근 본 자료

댓글(0)