대용량 멀티미디어 데이터의 유사 검색을 위한 분산 고차원 색인 방법 :Distributed High-Dimensional Indexing Scheme for Similarity Search on Large-Scale Multimedia Data

최현화

추천

검색

자료유형: 학위논문

저자정보: 최현화 (충남대학교, 忠南大學校大學院)

지도교수: 李圭哲

발행연도: 2013

저작권: 충남대학교 논문은 저작권에 의해 보호받습니다.

이용수1

이 논문의 연구 히스토리 (3)

2013

대용량 멀티미디어 데이터의 유사 검색을 위한 분산 고차원 색인 방법

최현화 컴퓨터공학과 2013.01 학위논문

2010

대용량 데이터의 내용 기반 검색을 위한 분산 고차원 색인 구조

최현화 , 이미영 , 김영창 외 2명 정보과학회논문지 : 데이타베이스 2010.10 학술저널

2009

대용량 멀티미디어 데이터의 내용 기반 검색을 위한 고확장 지원 색인 기법

최현화 , 이미영 , 이규철 한국콘텐츠학회 종합학술대회 논문집 2009.05 학술대회자료

이 논문의 후속연구가 궁금하신가요?
연관 학술논문 또는 학술발표를 통해 보다 발전된 연구결과를 확인하실 수 있습니다.
이 논문의 연구 히스토리 확인하기

초록· 키워드

오류제보하기

We are currently witnessing a rapid growth of image data, triggered by the popularity of the Internet and the huge amount of user-generated content from Web 2.0 applications. Given such image collections, performing similarity search to find objects most similar to a given object is a classical problem with many applications. In principle, there are two basic types of similarity query: an ε-range query and a k-nearest neighbors (k-NN) query. Given a set of data points represented in a high-dimensional space and the distance measurement between them, an ε-range query is used to find all objects with a distance smaller or equal to a given range ε from a query point, whereas a k-NN query is used to return the k-most similar neighbors of the query point.
To support a similarity search in very large collections of high-dimensional data, a number of distributed index structures for high-dimensional data spaces have been proposed. Most of the approaches recently published focus mainly on supporting range queries, or operating in decentralized systems. Locality Sensitive Hashing (LSH) Forest and LSH At Large leverage the LSH to provide approximate results to a k-NN search. LSH has received considerable attention recently, because it was shown that its runtime is independent of the dimension of feature vectors for querying. However, LSH, which probabilistically assigns similar data to the same bucket in a hash table, is not suitable for skewed data and can be highly inaccurate in a similarity search. SkipIndex is based on a K-D-tree in partitioning the data space, and maps the data space into a skip graph overlay network by encoding the space into a unique key. The DiST(Distributed Search Tree) and VBI-tree(Virtual Binary Index-tree) are also tree-based approaches, which do not scale well when data dimensions are high.
Several approaches to parallelize an NN search on the centralized systems have been developed. The approaches like the Master R-tree and Master-Client R-trees, GPR-tree (Global Parallel R-tree), and distributed hybrid spill-tree have tried to parallelize the tree-based structure. Parallel VA-files(Vector Approximation-files) and parallel CBF(Cell-Based Filtering) proposed a parallel NN-search based on the VA-file to achieve a linear increase of search speed as the number of servers grows. However, the query response times of these solutions are not satisfactory for a search engine on the World-Wide Web.
In order to provide a similarity search on massive high-dimensional data in cloud computing services or web search services, we propose a high-dimensional indexing structure, called the Distributed Vector Approximation-tree (DVA-tree), which operates in a cluster environment. We have designed a centralized indexing structure to solve the search performance and accuracy problem of decentralized indexing methods. The DVA-tree partitions a feature vector space in a hierarchical tree manner and maps leaf nodes of the tree into separate servers in a cluster environment. Then, we build an index and retrieve neighbors by using the VA-file, as the local index on separate servers. In conclusion, we propose a mechanism that efficiently distributes VA-files in order to increase k-NN search performance, as well as to support index scalability for large-scale data.
In this thesis, we make the following contributions: (i) We present a new distributed index scheme, called DVA-tree, that combines a tree-based index structure with a scan method using the VA-files; (ii) we present an approximate k algorithm to process the distributed k-NN search and propose cost formulas for predicting the response time of the k-NN search; (iii) For ununiform datasets, we enhance the k-NN search performance of DVA-tree by VA-files using independent encoding of vector data; (iv) finally, we experiment and evaluate the efficiency and effectiveness of our approach using the skewed data and uniformly distributed data.
The experiment results of the DVA-tree show that our proposed method, the DVA-tree, has significant performance advantage over existing index structures on ununiform data. We believe that the DVA-tree is the most suitable candidate to support high search performance, as well as index scalability for large-scale datasets.

1. 서 론 1
1.1. 연구 배경과 목적 1
1.2. 연구 내용과 범위 5
1.3. 논문 구성 7
2. 관련 연구 8
2.1. 단일 컴퓨팅 서버 기반 고차원 색인 방법 8
2.2. 분산 고차원 색인의 요구사항 12
2.3. 분산형 분산 고차원 색인 방법 14
2.3.1. 해쉬 기반 접근 14
2.3.2. 분할 기반 접근 16
2.4. 중앙 집중형 분산 고차원 색인 방법 20
2.4.1. 필터 기반 접근 20
2.4.2. 분할 기반 접근 21
2.5. 기존 연구의 문제점 25
3. 사전 지식 27
3.1. Hybrid spill-tree 27
3.2. VA-file 31
4. 분산 벡터 근사 트리 34
4.1. 분산 벡터 근사 트리 구축 34
4.2. K-최근접점 검색 41
4.3. K-최근접점 검색의 비용 모델 45
5. 비균일 데이터에 대한 검색 성능 강화 49
5.1. 연구 동기 49
5.2. 분산 벡터 근사 트리의 문제점 51
5.3. 분산 벡터 근사 트리 성능 개선 방법 53
6. 성능 실험 58
6.1. 기존 분산 고차원 색인 방법 비교 실험 61
6.2. 고확장성 실험 66
6.3. 독립적인 VA-file 구축에 따른 성능 비교 실험 69
7. 결론 및 향후 연구 72
참고문헌 75
ABSTRACT 82

최근 본 자료

전체보기

구분	그룹	데이터 항목
AI 학습용 데이터	원문	원문 PDF 파일
AI 학습용 데이터	원문 + 메타 (기본/상세)	원문 PDF 파일 및 서지정보 CSV
대량 구매용 데이터	B2B 구독 방식	특정 자료 한정으로 원문 접근 권한 부여
대량 구매용 데이터	URL 전달 방식	바로 PDF 뷰어를 열람할 수 있는 URL 제공

구분	그룹	데이터 항목
AI 학습용 데이터	기본 메타	발행기관명, 간행물명, 권호명, 권(vol), 호(issue), 통권, 발행연도, 발행월, 논문명, 저자명, 시작페이지, 종료페이지, 전체페이지, 상세페이지URL
상세 메타 데이터	발행기관 메타	발행기관 이명, 영문명, 창립연도, 홈페이지URL, 발행기관 소개
	간행물 메타	부제목, 간행물 유형, ISSN, ISBN, 최초발행연도, 폐간연도, 간행빈도, 발행주기, 등재사항, 이용수, 피인용수, 권호수, 논문수, 표지이미지
	논문 메타	작성 언어, 부제목, 대등제목, 목차, 키워드, 초록, 이미지, 참고문헌, 이용수, 피인용수, 논문활용도, DBpia통합주제분류, KDC분류, DDC분류, 한국연구재단분류, UCI, DOI
	저자 메타	소속기관, 소속부서, 직급, 연구분야, 연구키워드, 이용수, 피인용수, 저자 논문활용도

구분	그룹	데이터 항목
※ 결합형/맞춤형 메타 데이터는 신청 내용에 따라 다양하게 제공 가능
이용순위 정보	주제분야별 많이 이용된 논문	“인문학”에서 많이 이용된 논문 TOP100
	이용기관별 많이 이용된 논문	“중고등학교”에서 많이 이용된 논문 TOP100
	세부기관별 많이 이용된 논문	“서울대학교”에서 많이 이용된 논문 TOP100
	키워드별 많이 이용된 논문	“Chat GPT”에서 많이 이용된 논문 TOP100
키워드 정보	많이 이용된 키워드	특정기간/분야/저널 내 많이 이용된 키워드
	많이 발행된 키워드	특정기간/분야/저널 내 많이 발행된 키워드
	많이 검색된 키워드	특정기간/분야/저널 내 많이 검색된 키워드
	연구 트렌드 키워드	특정 키워드 연관 연구동향 분석 데이터 키워드

논문 기본 정보

이 논문의 연구 히스토리 (3)

초록· 키워드

목차

최근 본 자료

댓글(0)