手机版

索引子系统的设计与实现(3)

发布时间:2021-06-08   来源:未知    
字号:

索引子系统的设计与实现

ABSTRACT

CnX indexing subsystem is a complete indexing constructor of Chinese XML data. It is mainly composed by a Chinese-English semantic processing module, an inverted index building module and a scoring module which uses Okapi BM25 algorithm that is a kind of probabilistic model. This paper gives a design solution and an implement solution of the CnX,which is based on C/S structure of a multi-threaded subsystem. This paper starts from the modern information retrieval technology, and considers the ways of retrievaling XML data at first, and then begin to discuss the demand analysis,the ways of design and implementing of CnX. XML is a kind of semi-structured data, how to store the structure information of a XML document must be considered when building the inverted index. The structure of a XML document just likes a tree of data structure, which is builded up by many element nodes. And the nodes it has also can be divided into inner nodes and leaf nodes, usually, the leaf nodes are considered that contain text content, and the inner nodes usually not. For the text of the leaf nodes, it can be retrievaled by a ways of full-text content, just like retrievaling a plain text file.

As CnX focuses on the Chinese indexing, the Chinese sentences need lexically analyzing at first between the full-text content retrievaling, and then builds the pair of tag-term according to the structure information of the XML document. Before building a virtual document object of a XML document, the structure of the memory tree must be adjusted to a conscious state. Through repeated handling this tree,the inverted index is stored into a database system at last. After building the index completely,CnX will score the stored index by Okapi BM 25 algorithm for the top of the core procedures to use.

CnX index subsystem is a complete XML based information retrieval system, it plays an important role in building the whole information retrieval system.

Key words:XML;Chinese Words;Inverted Index;Information Retrieval (IR)

索引子系统的设计与实现(3).doc 将本文的Word文档下载到电脑,方便复制、编辑、收藏和打印
×
二维码
× 游客快捷下载通道(下载后可以自由复制和排版)
VIP包月下载
特价:29 元/月 原价:99元
低至 0.3 元/份 每月下载150
全站内容免费自由复制
VIP包月下载
特价:29 元/月 原价:99元
低至 0.3 元/份 每月下载150
全站内容免费自由复制
注:下载文档有可能出现无法下载或内容有问题,请联系客服协助您处理。
× 常见问题(客服时间:周一到周五 9:30-18:00)