메뉴 건너뛰기
.. 내서재 .. 알림
소속 기관/학교 인증
인증하면 논문, 학술자료 등을  무료로 열람할 수 있어요.
한국대학교, 누리자동차, 시립도서관 등 나의 기관을 확인해보세요
(국내 대학 90% 이상 구독 중)
로그인 회원가입 고객센터 ENG
주제분류

추천
검색
질문

논문 기본 정보

자료유형
학술저널
저자정보
Donald STURGEON (Harvard University)
저널정보
동국대학교 불교학술원 International Journal of Buddhist Thought and Culture International Journal of Buddhist Thought & Culture Vol.28 No.2
발행연도
2018.12
수록면
11 - 44 (34page)
DOI
10.16893/IJBTC.2018.12.28.2.11

이용수

표지
📌
연구주제
📖
연구배경
🔬
연구방법
🏆
연구결과
AI에게 요청하기
추천
검색
질문

초록· 키워드

오류제보하기
Optical character recognition (OCR) - the fully automated transcription of text appearing in a digitized image - offers transformative opportunities for the scholarly study of written materials produced prior to the digital age. Digitization, in the sense of photographic reproduction, is now a well-established and efficiently performable mechanical process, and one with significant value in its own right for purposes of preservation as well as access to rare materials. As a result, hundreds of millions of pages of pre-modern Chinese works have already been digitized by libraries and academic institutions around the world - a significant portion of this increasingly being made freely available online.
To make use of this material efficiently, transcriptions of the textual content of these images are needed. Given the enormous volume of image data in existence - and its ongoing production as digitization continues - this task is only feasible if it can be fully automated: performed by software without manual human intervention. Individually, reliable transcriptions produced by OCR offer enormous time savings to researchers, enabling efficient navigation of materials in ways not possible without digital transcription. In aggregate, however, these transcriptions facilitate entirely new ways of exploring historical materials - for example, rapidly identifying material that one suspects might exist somewhere, without knowing in advance where that might actually be. It is also a prerequisite also to virtually any type of statistical analysis of these materials - the potential utility of which continues to increase as a larger and larger proportion of the extant corpus is transcribed.
This paper introduces a procedure for OCR of pre-modern Chinese written materials, both printed and handwritten, describing the complete process from digitized image through to automated transcription and manual correction of remaining errors, with particular attention to issues arising in this domain. The process described has been applied to over 25 million pages of pre-modern Chinese works, and the paper also introduces the Chinese Text Project (https://ctext.org) platform used to both make these results available to scholars as well as provide a distributed, crowdsourced mechanism for facilitating manual corrections at scale as well as further analysis of the materials.

목차

Abstract
Introduction
Image Pre-processing
Character Segmentation
Training Data Extraction
Language Modeling
Results
Using the Data
Crowdsourcing of Corrections
Conclusions and Future Work
References

참고문헌 (11)

참고문헌 신청

함께 읽어보면 좋을 논문

논문 유사도에 따라 DBpia 가 추천하는 논문입니다. 함께 보면 좋을 연관 논문을 확인해보세요!

이 논문의 저자 정보

최근 본 자료

전체보기

댓글(0)

0

UCI(KEPA) : I410-ECN-0101-2019-022-000344828