Main Content Extraction from Web Pages Based on Node Characteristics

Qingtang Liu; Mingbo Shao; Linjing Wu; Gang Zhao; Guilin Fan; Jun Li

추천

검색

질문

자료유형: 학술저널

저자정보: Qingtang Liu (Central China Normal University) Mingbo Shao (Central China Normal University) Linjing Wu (Central China Normal University) Gang Zhao (Central China Normal University) Guilin Fan (Central China Normal University) Jun Li (Hubei University for Nationalities)

저널정보: Korean Institute of Information Scientists and Engineers Journal of Computing Science and Engineering Journal of Computing Science and Engineering Vol.11 No.2

발행연도: 2017.6

수록면: 39 - 48 (10page)

이용수

📌

연구주제

📖

연구배경

🔬

연구방법

🏆

연구결과

초록· 키워드

오류제보하기

Main content extraction of web pages is widely used in search engines, web content aggregation and mobile Internet browsing. However, a mass of irrelevant information such as advertisement, irrelevant navigation and trash information is included in web pages. Such irrelevant information reduces the efficiency of web content processing in content-based applications. The purpose of this paper is to propose an automatic main content extraction method of web pages. In this method, we use two indicators to describe characteristics of web pages: text density and hyperlink density. According to continuous distribution of similar content on a page, we use an estimation algorithm to judge if a node is a content node or a noisy node based on characteristics of the node and neighboring nodes. This algorithm enables us to filter advertisement nodes and irrelevant navigation. Experimental results on 10 news websites revealed that our algorithm could achieve a 96.34% average acceptable rate.

#Content extraction #Web page #Text density #Hyperlink density

참고문헌 (20)

참고문헌 신청

A. Bhardwaj and V. Mangat, “An improvised algorithm for relevant content extraction from web pages,” Journal of Emerging Technologies in Web Intelligence, vol. 6, no. 2, pp. 226-230, 2014. J. O. Wobbrock, J. Forlizzi, S. E. Hudson, and B. A. Myers, “WebThumb: interaction techniques for small-screen browsers,” in Proceedings of the 15th Annual ACM Symposium on User Interface Software and Technology, Paris, France, 2002, pp. 205-208. W. Petprasit and S. Jaiyen, “E-commerce web page classification based on automatic content extraction,” in the Proceedings of 12th International Joint Conference on Computer Science and Software Engineering (JCSSE), Songkhla, Thailand, 2015, pp. 74-77. A. Schieber and A. Hilbert, “Process model for content extraction from Weblogs,” International Journal of Intelligent Information Technologies, vol. 10, no. 2, pp. 20-36, 2014. S. Debnath, P. Mitra, and C. L. Giles, “Identifying content blocks from web documents,” in International Symposium on Methodologies for Intelligent Systems. Heidelberg: Springer, 2005, pp. 285-293.

함께 읽어보면 좋을 논문

논문 유사도에 따라 DBpia 가 추천하는 논문입니다. 함께 보면 좋을 연관 논문을 확인해보세요!