잠재 의미 분석을 적용한 유사 특허 검색 시스템 :A Similar Patent Search System using Latent Dirichlet Allocation

임현근

추천

검색

자료유형: 학위논문

저자정보: 임현근 (배재대학교, 배재대학교 일반대학원)

지도교수: 정회경

발행연도: 2019

저작권: 배재대학교 논문은 저작권에 의해 보호받습니다.

이용수4

이 논문의 연구 히스토리 (2)

2019

잠재 의미 분석을 적용한 유사 특허 검색 시스템

임현근 컴퓨터공학과 2019.01 학위논문

2018

잠재 의미 분석을 적용한 유사 특허 검색 서비스 시스템

임현근 , 김재윤 , 정회경 한국정보통신학회논문지 2018.08 학술저널

이 논문의 후속연구가 궁금하신가요?
연관 학술논문 또는 학술발표를 통해 보다 발전된 연구결과를 확인하실 수 있습니다.
이 논문의 연구 히스토리 확인하기

초록· 키워드

오류제보하기

4차 산업혁명의 시대에 특허와 같은 지적 재산을 확보하여 기술시장에서 경쟁우위를 선점하기 위한 기업들의 기술경쟁은 더욱 심화되고 있다. 매년 특허 등록량은 증가하고 있는 양상을 보이나 현재 전문가 기반의 정석적 접근 방식의 특허 분석 방법으로는 증가하는 특허의 등록량을 처리하기에는 제한적인 상황이다. 이를 해결하기 위해 특허 분석을 자동화 하려는 시도들이 있어 왔다. 기존 연구4차 산업혁명의 시대에서는 유사 특허를 검색하는 방법으로 키워드 검색 방법을 사용하였으나 최근에는 머신러닝을 활용한 자동분류 방법을 사용하고 있다.
키워드 검색방법은 문서에서 핵심 키워드를 추출하여 키워드를 인덱스 기반으로 저장하고 검색 엔진을 통한 키워드 가중치 기반의 검색 과정이고 머신러닝을 사용하는 과정은 특허 문서에서 텍스트 특징(Feature)을 추출하여 벡터 데이터로 변형한 이후에 데이터 기반의 유사도가 높은 라벨(Label)로 분류하는 방법이다. 키워드 검색방법은 데이터 정제를 통해 정형화된 텍스트의 분석 방법으로 단문일 경우 검색에서는 정확도는 높지만 문서와 같이 여러 단어가 불규칙 하게 이루어진 장문일 경우 문장에 내포된 의미 분석을 할 수 없었다. 의미 분석 단계에서의 자동 분류 방법은 비정형 데이터 분석 방법으로 여러 단어로 이루어진 문장을 분류하는데 사용되고 있다.
키워드 검색과 머신러닝 분석방법을 결합하여 유사 문서 검색을 하려는 시도가 있었지만 비정형 데이터와 정형 데이터의 동시 사용에는 과정이 다르기 때문에 동시 적용에는 구현상의 문제점이 있었다.
이에 본 연구는 기존의 검색방법에 머신러닝을 활용한 자동 분류방법 그리고 비정형 텍스트에 의미 분석 방법인 토픽 모델링 알고리즘을 결합하여 효율적인 유사 의미 특허를 검색하는 방법을 연구하였다.
1차적으로 검색 분류방법으로 특허 다중 분류를 위해 실제 특허 데이터를 기반으로 Naive Baysian 모델을 학습시켰다. WIPO에서 제공되는 공개 특허중 A그룹의 5160개의 특허 abstract를 활용하여 분석을 진행하였고, 그 결과 약 87%의 정확도로 특허 클래스를 분류하는 결과를 얻었으며, SVM(88%) 알고리즘과 비교해서 정확도 차이가 거의 없는 것을 확인하였다. 또한 2차 방법으로 1차에서 분류된 특허분류의 자질의 정확도를 높이기 위하여 키워드 검색방법에 사용할 키워드를 Text Rank 방법으로 생성하였다. 키워드 추출 방법을 단순 키워드 빈도수를 측정하는 TF-IDF 방법보다 Text Rank 방식을 사용하여 단어의 연관(Apriori)도가 높은 어절을 구성하여 관련도가 높은 내용 검색이 가능하도록 하였다. 현재 대부분의 검색 서비스에서 단일 키워드 기준으로 서비스 되지만 연관어 검색방법도 차츰 보급되고 있다.
마지막으로 연관 키워드로 검색된 후보 문서들에서 잠재 의미 분석(LDA) 방식을 사용하여 분석 대상 문서와 의미 유사도가 높은 문서를 분류 할 수 있었다.
본 연구로 검색 엔진에 머신러닝 기능을 추가함으로써 기존 시스템에서도 빠르게 특허의 의미 분석이 가능한 시스템을 구축할 수 있었다. 특허와 같이 방대한 양의 데이터를 저 사양 PC에서도 분석 가능한 시스템을 구현하였다.
이 연구를 통해 특허 분석에 대한 시스템을 효과적으로 구축하여 실질적인 특허 출원이나 심사과정에서 비용과 시간을 줄이고 조금 더 정확한 특허검색을 가능하게 할 것으로 기대된다.

As the Fourth Industrial Revolution has come, technological competition for companies rivalry for intellectual property such as patents is intensified in the technology market. The increasing amount of patent applications and the growing every year, patent analysis method of professional-based patent-oriented approach is limited to handle. There have been attempts to automate patent analysis to solve this problem.
In the last, keyword searching was used as a method to search for similar patents.
Recently, automatic classification system using machine learning has been used. Method for keyword search extract keywords from a document, stores keywords base on indexing, and using weighted keywords on searching engine. Method for Machine learning processing is a extracting a text feature from a patent document, transforming it into vector data, and classifying it into labels base of data similarity.
Keyword searching is a method for analysis text formatted through data refinement, case of a short sentence, the accuarcy of the search is high, but case of a long sentence such as a document, in which several words are irregular, the semantics implied in the sentence could not be analyzed.
In the semantic analysis step, automatic classification used to classify sentences composed of several words by an unstructured data.
There was an attempt to find similar documents by combining the two methods. However, it have a problem in the algorithm w the methods of analysis are different ways to use simultaneous unstructured data and regular data. In this paper, we study the method of extracting keywords implied in the document and using the latent semantic analysis(LDA) method to classify documents efficiently without human intervention and finding similar patents.
Firstly, We train Naive Baysian model actual patent data for multiple classification. Among the public patents provided by WIPO, 5160 patent abstracts of group A were used for evaluation. As a result, we obtained classification of patent class with accuracy of about 87% and check that there is little difference in accuracy compared with SVM (88%) algorithm.
Also, in order to increase the accuracy of the characteristics of the patent classification classified in the first order by the second method, a keyword to be used for the keyword search method was generated by the Text Rank. We use the Text Rank method to make the content retrieval more relevant than the TF-IDF that measures the keyword frequency. Currently, most of search services are based on a single keyword, but the searching for related words is also increasing.
Finally, in the candidate documents retrieved by the related keywords, it was possible to classify the document to be analyzed and the document having high semantic similarity using latent semantic analysis (LDA) method. In this study, it combine the machine learning function with the search engine, we were able to construct a system that can quickly semantic analysis of the patent in the existing system.
We implemented a system capable of analyzing vast amounts of data on low-specification PCs as well as patents.
Through this study, it is expected that the system of patent analysis will be effectively constructed to reduce cost and time in actual patent application and examination process, and enable more accurate patent search.

#머신러닝 #문서분류 #유사특허검색 #잠재의미분석 #키워드추출 #Machine Learning #Document Classification #Similar Patent Search #LDA Keyword Extract

국문초록 ⅰ
목 차 ⅲ
그림목차 ⅴ
표 목 차 ⅶ
Ⅰ. 서 론 1
1.1 연구배경 및 목적 1
1.2 연구내용 및 범위 3
1.3 논문의 구성 4
Ⅱ. 관련연구 5
2.1 특허 문서 구조 5
2.2 특허문서 분류 방법 8
2.3 키워드 추출 12
2.4 LDA 유사도 검증 14
Ⅲ. 시스템 설계 18
3.1 서비스 구성 18
3.1.1 서비스 범위 19
3.1.2 기능 요구사항 20
3.1.3 시스템구성 20
3.2 단계별 처리 과정 22
3.2.1 문서의 전처리 22
3.2.2 문서 분석 단계 27
3.2.3 특허 검색 32
3.2.4 특허 토픽 분석 34
Ⅳ. 시스템 구현 및 실험 36
4.1 구현 환경 36
4.2 시스템 구현 38
4.2.1 MS Azure 38
4.2.2 Web Page 40
4.2.3 Application 구성 43
4.3 실험 43
Ⅴ. 결 론 45
참고문헌 48
영문초록 52
감사의 글(Acknowledgement) 54

최근 본 자료

전체보기

구분	그룹	데이터 항목
AI 학습용 데이터	원문	원문 PDF 파일
AI 학습용 데이터	원문 + 메타 (기본/상세)	원문 PDF 파일 및 서지정보 CSV
대량 구매용 데이터	B2B 구독 방식	특정 자료 한정으로 원문 접근 권한 부여
대량 구매용 데이터	URL 전달 방식	바로 PDF 뷰어를 열람할 수 있는 URL 제공

구분	그룹	데이터 항목
AI 학습용 데이터	기본 메타	발행기관명, 간행물명, 권호명, 권(vol), 호(issue), 통권, 발행연도, 발행월, 논문명, 저자명, 시작페이지, 종료페이지, 전체페이지, 상세페이지URL
상세 메타 데이터	발행기관 메타	발행기관 이명, 영문명, 창립연도, 홈페이지URL, 발행기관 소개
	간행물 메타	부제목, 간행물 유형, ISSN, ISBN, 최초발행연도, 폐간연도, 간행빈도, 발행주기, 등재사항, 이용수, 피인용수, 권호수, 논문수, 표지이미지
	논문 메타	작성 언어, 부제목, 대등제목, 목차, 키워드, 초록, 이미지, 참고문헌, 이용수, 피인용수, 논문활용도, DBpia통합주제분류, KDC분류, DDC분류, 한국연구재단분류, UCI, DOI
	저자 메타	소속기관, 소속부서, 직급, 연구분야, 연구키워드, 이용수, 피인용수, 저자 논문활용도

구분	그룹	데이터 항목
※ 결합형/맞춤형 메타 데이터는 신청 내용에 따라 다양하게 제공 가능
이용순위 정보	주제분야별 많이 이용된 논문	“인문학”에서 많이 이용된 논문 TOP100
	이용기관별 많이 이용된 논문	“중고등학교”에서 많이 이용된 논문 TOP100
	세부기관별 많이 이용된 논문	“서울대학교”에서 많이 이용된 논문 TOP100
	키워드별 많이 이용된 논문	“Chat GPT”에서 많이 이용된 논문 TOP100
키워드 정보	많이 이용된 키워드	특정기간/분야/저널 내 많이 이용된 키워드
	많이 발행된 키워드	특정기간/분야/저널 내 많이 발행된 키워드
	많이 검색된 키워드	특정기간/분야/저널 내 많이 검색된 키워드
	연구 트렌드 키워드	특정 키워드 연관 연구동향 분석 데이터 키워드

논문 기본 정보

이 논문의 연구 히스토리 (2)

초록· 키워드

목차

최근 본 자료

댓글(0)