최근 텍스트 데이터와 이미지 데이터를 활용한 딥러닝 기술의 발전에 힘입어, 두 분야의 융합 분야인 이미지 캡셔닝에 대한 연구가 활발히 이루어지고 있다. 이미지 캡셔닝은 주어진 이미지에 대한 설명을 텍스트로 생성하는 기술이며, 이미지 이해와 텍스트 생성을 동시에 다루고 있다. 다양한 활용 가능성 때문에 인공지능 연구의 핵심 분야로 자리 잡고 있으며, 성능을 향상을 위한 여러 연구가 꾸준히 이루어지고 있다. 하지만 이러한 다양한 노력에도 불구하고, 이미지를 일반인의 관점이 아닌 특정 분야별 전문가의 시각에서 ‘해석’하기 위한 연구는 찾아보기 어렵다. 같은 이미지에 대해서도 그 이미지를 접한 사람의 전문 분야에 따라 집중해서 주목하는 부분이 다를 뿐만 아니라, 전문성의 수준에 따라 이를 표현하고 해석하는 방식도 상이하다. 따라서 본 연구에서는 전문가의 전문성을 모델에 이식하는 방법을 제안하고, 이를 통해 해당 분야에 특화된 이미지의 캡션을 생성하는 방안을 제안한다. 구체적으로 제안 방법론은 대량의 일반 데이터에 대해 학습을 수행해 사전 학습 모델을 구축한 후, 소량의 전문 데이터를 전이 학습해 해당 분야의 전문성을 이식한다. 또한, 본 연구에서는 학습 과정에서 발생할 수 있는 관찰간 간섭 문제를 방지하기 위해 ‘특성 독립 전이 학습’ 방법을 제안한다. 제안 방법론의 검증을 위해 MSCOCO의 이미지 캡셔닝 데이터 셋을 활용하여 사전 학습 모델을 구축하고, 실제 미술 치료사의 자문을 토대로 생성된 ‘이미지-전문 캡션‘ 데이터를 통해 전문성을 이식하는 실험을 수행하였다. 실험 결과 일반적인 관점에서 생성한 일반 캡션은 전문적 해석과 무관한 내용을 포함한 것과는 달리, 제안 방법론을 통해 생성된 전문 캡션은 전문적 해석에 필요한 내용을 모두 포함한 것을 확인하였다.
As deep learning has recently attracted attention, the application of deep learning is being considered as a way to solving problems in various fields. In particular, deep learning is known to exhibit excellent performance when applying unstructured data such as text, images and sounds, and its effectiveness have been proven in many studies. Thanks to the remarkable advance of image and text deep learning technology, interest in image captioning technology and its application is growing rapidly. Image captioning is a technology that automatically generates adequate captions for a given image by simultaneously processing both image understanding and text generation. Despite the high barriers to entry for image captioning, which require researcher should be able to handle both image and text data, their wide applicability has made it one of the key fields of A.I. research. In addition, many researches have been conducted to enhance the performance of image captioning in diverse aspects. Recent studies attempt to create advanced captions that not only can accurately describe the image, but also more sophisticatedly convey the information contained in the image. In spite of many recent efforts to enhance the performance of image captioning, it is difficult to find any studies to interpret images from the viewpoint of experts in each domain, not from the viewpoint of the general public. Even for the same image, the point of interests may differ depending on the expertise domain of the person recognizing the image. In addition, the way of interpreting and expressing the image also differs depending on the level of expertise. The public tends to perceive the image from a holistic and general point of view, that is, identifying the components of the image and their relationships. Domain experts, on the other hand, tend to recognize the image based on their expertise, focusing on some specific components necessary to interpret the given image. It implies that even the same image has different meaningful parts of the image depending on viewers'' perspective. Accordingly, image captioning needs to reflect this phenomenon. Therefore, in this study, we propose a methodology to generate captions specialized in each domain for the image by using the expertise of experts. Specifically, after performing pre-training on a huge amount of general data, we transplant the expertise in the field through transfer-learning with a small amount of specialized data. However, applying transfer learning as-is with expertise data may lead to other type of problem. When a caption contains variety of features and is used for learning, it can cause a so-called ‘interference between observations’ problem, which make it difficult to perform pure learning of each feature perspective. For learning with huge amount of data, most of this problem is self-purified and has little effect on the results. Conversely, in the case of fine-tuning that performs train using a small amount of data, the effect of such problem on learning can be relatively important. To solve this problem, therefore, we present a novel ‘Feature-Independent Transfer learning’ that performs transfer learning independently for each Feature. In order to confirm the validity of the proposed methodology, we conducted experiments using the results of pre-training on MSCOCO dataset consisting of 120k images and about 600k general captions. In Addition, experiment was conducted to transplant expertise using the ‘image / expertise captions’ data created based on the advice of an art therapist. As a result of the experiment, it was verified that captions generated according to the proposed method generates captions from the viewpoint of transplanted expertise, whereas the captions generated through general image captioning method contains a number of components irrelevant to expertise interpretation. In this paper, we propose a novel approach to specialized image interpretation. To achieve this objective, we present how to utilize transfer learning and generate captions specialized in the specific domain. In the future, it is expected that many researches will be widely conducted to solve the problem of lack of expertise data and improve performance of image captioning by applying the methodology to the transplantation of expertise in various domains.
목차
1. 서론 12. 관련연구 72.1. 딥러닝 연구: 텍스트와 이미지 활용 72.2. 이미지 캡셔닝 연구 92.3. 전이 학습 연구 112.4. 데이터 분석에 기반한 미술 치료 연구 133. 제안 방법론 153.1. 관찰/해석 지도(O2I Map) 및 전문성 쿼드(E-Quad) 구축 153.2. 특성 독립 전이 학습 모델 구축 163.3. 전문 해석 캡션 생성 모델 184. 실험 및 결과 214.1. 실험 개요 214.2. 데이터 특징에 따른 캡션 품질 비교 234.3. 일반 캡션과 전문 캡션 비교 285. 결론 31참고문헌 33Abstract 40