Hadoop 튜닝이 적용된 프레임워크 기반 하에 나이브베이즈 기법을 적용한 새로운 악성메일 필터링 방법 :A new malicious mail filtering method using Naive Bayes technique on efficiently tuned Hadoop framework

지경엽

추천

검색

자료유형: 학위논문

저자정보: 지경엽 (충남대학교, 충남대학교 대학원)

지도교수: 권영미

발행연도: 2023

저작권: 충남대학교 논문은 저작권에 의해 보호받습니다.

이용수13

이 논문의 연구 히스토리 (3)

2023

멀티노미얼나이브베이즈 기법의 정교화를 통한 악성 메일 필터링 시스템 구현

지경엽 , 권영미 한국정보기술학회논문지 2023.07 학술저널

기계학습기법인 나이브베이즈 기법을 활용한 악성 메일 탐지 기능 구현

지경엽 , 권영미 Proceedings of KIIT Conference 2023.06 학술대회자료

Hadoop 튜닝이 적용된 프레임워크 기반 하에 나이브베이즈 기법을 적용한 새로운 악성메일 필터링 방법

지경엽 전자전파정보통신공학과 2023.01 학위논문

이 논문의 후속연구가 궁금하신가요?
연관 학술논문 또는 학술발표를 통해 보다 발전된 연구결과를 확인하실 수 있습니다.
이 논문의 연구 히스토리 확인하기

초록· 키워드

오류제보하기

As the importance of email increases, the amount of malicious email is also increasing, so the need for malicious email filtering is growing. Since it is more economical to combine commodity hardware consisting of a media server or PC with a virtual environment to use as a single server resource and filter malicious email using machine learning techniques, we used a Hadoop MapReduce framework and Naïve Bayes amid various machine learning methods for malicious email filtering. Naïve Bayes was selected because it is one of the top machine learning methods (Support Vector Machine (SVM), Naïve Bayes, K-Nearest Neighbor (KNN), and Decision Tree) in terms of execution time and accuracy. In addition, to shorten the execution time by configuring the Hadoop framework using PC-based commodity hardware, among more than 200 Hadoop parameters, 21 major Hadoop configuration parameters affecting the execution time were selected and optimized through several cases of experiments. The execution time was shortened by setting the optimized values to the selected 21 parameters. The special thing is that in a distributed environment, increasing the block size for storing and managing data does not necessarily shorten the execution time and increasing the number of replicated data set does not increase proportionally the execution time. The types of main Hadoop configuration parameters include setting block size, adjusting the number of server threads, adjusting the number of blocks to be replicated, setting the number of MapReduce tasks, setting the memory size and heap size required for setting up MapReduce, adjusting the number of virtual cores allocated to the container, compressing the results of all MapReduce jobs running on the cluster, etc. The set value of the Hadoop configuration parameter was optimized to shorten the execution time through the experimental technique, based on the Hadoop framework, malicious mail prediction was made using the MapReduce programming technique using Naïve Bayes technology, a supervised machine learning technique and a bare metal server environment that does not apply the Hadoop framework, a prediction method for judging spam was also implemented with a Python program method with Naïve Bayes technology. Laplace smoothing was applied for stable calculation when calculating the prediction probability, and Laplace smoothing was applied to the Hadoop MapReduce Naïve Bayes method and the Bare metal Python Naïve Bayes method to compare the prediction error rate and accuracy between the two methods. To apply Laplace smoothing, Hadoop MapReduce Naïve Bayes method applied numerical value (1), Bare metal Python Naïve Bayes method applied eleven types of numerical values, among them, the result of applying 0.0001, which had the best duration time and good prediction error rate and accuracy, was compared with the result of Hadoop MapReduce Naïve Bayes method. As a result of comparing the accuracy and predictive error rate of the two methods, the Hadoop MapReduce Naïve Bayes method improved the accuracy of identifying spam and ham by 1.04 times and the prediction error rate by 6.04 times compared to the bare metal Python Naïve Bayes method. In conclusion, the Hadoop MapReduce Naïve Bayes machine learning method presented in this paper was able to shorten the execution time through the appropriate selection of Hadoop configuration parameters and optimized settings, and improved the system stability through easy spanning-out and distributed data set replication. Therefore, I think that the proposed system architecture and algorithm of this paper improved the accuracy of malicious mail prediction and reduced the prediction error rate.

#Hadoop #Hadoop Distributed File System #HDFS #MapReduce #Hadoop configuration 파라미터 #악성메일 필터링malicious mail filtering #나이브베이즈 #Naïve Bayes #Laplace smoothing

제1장 서론 1
1.1 연구 배경 1
1.2 연구 방향 2
제2장 관련 기술 및 이전 연구 5
2.1 Hadoop 개요 및 구조 5
2.2 Hadoop 실행시간 개선 방법 6
2.3 Hadoop 2.0 주요 구성 요소 8
2.4 기계학습 구성요소 및 나이브베이즈 이론 개요 18
2.5 Hadoop 성능 개선 및 기계학습 기반의 스팸 필터링 관련 연구 자료 분석 21
제3장 Hadoop MapReduce와 나이브베이즈의 선택 배경과 성능 및 필터 링 정확도 향상을 위한 기술적 구현 방법 29
3.1 Hadoop MapReduce 및 나이브베이즈 선택 배경 29
3.2 Hadoop의 실행속도개선을 위한 configuration 파라미터 최적화 30
3.3 악성메일 예측 정확도를 높이기 위한 기계학습기법들의 분석 및 비 Hadoop 환경의 예측 정확도와 비교 42
제4장 Hadoop configuration 파라미터 최적화를 통한 실행시간 개선과 MapReduce 나이브베이즈 기법을 통한 악성메일 판단예측 시스템 구현 실험 46
4.1 본 논문의 시스템 환경 및 아키텍처 46
4.2 테스트 유형별 Hadoop configuration 파라미터 설정 내역 49
4.3 Hadoop 파라미터 개선을 통한 악성메일판단 시스템 성능(실행시간) 개선 결과 62
4.4 악성메일 판단시스템의 예측오류율 및 정확도 결과 74
4.5 실험 결과에 대한 분석과 의미 107
제5장 결론 110
[참고문헌] 112
[부록: 약어표] 120
ABSTRACT 122

최근 본 자료

전체보기

구분	그룹	데이터 항목
AI 학습용 데이터	원문	원문 PDF 파일
AI 학습용 데이터	원문 + 메타 (기본/상세)	원문 PDF 파일 및 서지정보 CSV
대량 구매용 데이터	B2B 구독 방식	특정 자료 한정으로 원문 접근 권한 부여
대량 구매용 데이터	URL 전달 방식	바로 PDF 뷰어를 열람할 수 있는 URL 제공

구분	그룹	데이터 항목
AI 학습용 데이터	기본 메타	발행기관명, 간행물명, 권호명, 권(vol), 호(issue), 통권, 발행연도, 발행월, 논문명, 저자명, 시작페이지, 종료페이지, 전체페이지, 상세페이지URL
상세 메타 데이터	발행기관 메타	발행기관 이명, 영문명, 창립연도, 홈페이지URL, 발행기관 소개
	간행물 메타	부제목, 간행물 유형, ISSN, ISBN, 최초발행연도, 폐간연도, 간행빈도, 발행주기, 등재사항, 이용수, 피인용수, 권호수, 논문수, 표지이미지
	논문 메타	작성 언어, 부제목, 대등제목, 목차, 키워드, 초록, 이미지, 참고문헌, 이용수, 피인용수, 논문활용도, DBpia통합주제분류, KDC분류, DDC분류, 한국연구재단분류, UCI, DOI
	저자 메타	소속기관, 소속부서, 직급, 연구분야, 연구키워드, 이용수, 피인용수, 저자 논문활용도

구분	그룹	데이터 항목
※ 결합형/맞춤형 메타 데이터는 신청 내용에 따라 다양하게 제공 가능
이용순위 정보	주제분야별 많이 이용된 논문	“인문학”에서 많이 이용된 논문 TOP100
	이용기관별 많이 이용된 논문	“중고등학교”에서 많이 이용된 논문 TOP100
	세부기관별 많이 이용된 논문	“서울대학교”에서 많이 이용된 논문 TOP100
	키워드별 많이 이용된 논문	“Chat GPT”에서 많이 이용된 논문 TOP100
키워드 정보	많이 이용된 키워드	특정기간/분야/저널 내 많이 이용된 키워드
	많이 발행된 키워드	특정기간/분야/저널 내 많이 발행된 키워드
	많이 검색된 키워드	특정기간/분야/저널 내 많이 검색된 키워드
	연구 트렌드 키워드	특정 키워드 연관 연구동향 분석 데이터 키워드

논문 기본 정보

이 논문의 연구 히스토리 (3)

초록· 키워드

목차

최근 본 자료

댓글(0)