A formal model for information selection in multi-sentence t

时间：2026-01-15 来源：未知

小中大

字号：

Selecting important information while accounting for repetitions is a hard task for both summarization and question answering. We propose a formal model that represents a collection of documents in a two-dimensional space of textual and conceptual units wi

A Formal Model for Information Selection in Multi-Sentence Text

Extraction

Elena Filatova Department of Computer Science Columbia University

New York,NY10027,USA filatova@cs.columbia.edu

Vasileios Hatzivassiloglou Center for Computational Learning Systems

Columbia University

New York,NY10027,USA

vh@cs.columbia.edu

Abstract

Selecting important information while account-

ing for repetitions is a hard task for both sum-

marization and question answering.We pro-

pose a formal model that represents a collec-

tion of documents in a two-dimensional space

of textual and conceptual units with an asso-

ciated mapping between these two dimensions.

This representation is then used to describe the

task of selecting textual units for a summary or

answer as a formal optimization task.We pro-

vide approximation algorithms and empirically

validate the performance of the proposed model

when used with two very different sets of fea-

tures,words and atomic events.

1Introduction

Many natural language processing tasks involve the collection and assembling of pieces of informa-tion from multiple sources,such as different doc-uments or different parts of a document.Text sum-marization clearly entails selecting the most salient information(whether generically or for a speciﬁc task)and putting it together in a coherent sum-mary.Question answering research has recently started examining the production of multi-sentence answers,where multiple pieces of information are included in theﬁnal output.

When the answer or summary consists of mul-tiple separately extracted(or constructed)phrases, sentences,or paragraphs,additional factors inﬂu-ence the selection process.Obviously,each of the selected text snippets should individually be impor-tant.However,when many of the competing pas-sages are included in theﬁnal output,the issue of information overlap between the parts of the output comes up,and a mechanism for addressing redun-dancy is needed.Current approaches in both sum-marization and long answer generation are primar-ily oriented towards making good decisions for each potential part of the output,rather than examining whether these parts overlap.Most current methods adopt a statistical framework,without full semantic analysis of the selected content passages;this makes the comparison of content across multiple selected text passages hard,and necessarily approximated by the textual similarity of those passages.

Thus,most current summarization or long-answer question-answering systems employ two levels of analysis:a content level,where every tex-tual unit is scored according to the concepts or fea-tures it covers,and a textual level,when,before being added to theﬁnal output,the textual units deemed to be important are compared to each other and only those that are not too similar to other can-didates are included in theﬁnal answer or summary. This comparison can be performed purely on the ba-sis of text similarity,or on the basis of shared fea-tures that may be the same as the features used to select the candidate text units in theﬁrst place.

In this paper,we propose a formal model for in-tegrating these two tasks,simultaneously perform-ing the selection of important text passages and the minimization of information overlap between them. We formalize the problem by positing a textual unit space,from which all potential parts of the summary or answer are drawn,a conceptual unit space,which represents the distinct conceptual pieces of informa-tion that should be maximally included in theﬁnal output,and a mapping between conceptual and tex-tual units.All three components of the model are application-and task-dependent,allowing for dif-ferent applications to operate on text pieces of dif-ferent granularity and aim to cover different concep-tual features,as appropriate for the task at hand.We cast the problem of selecting the best textual units as an optimization problem over a general scoring function that measures the total coverage of concep-tual units by any given set of textual units,and pro-vide general algorithms for obtaining a solution. By integrating redundancy checking into the se-lection of the textual units we provide a uniﬁed framework for addressing content overlap that does not require external measures of similarity between textual units.We also account for the partial overlap of information between textual units(e.g.,a single shared clause),a situation which is common in nat-

…… 此处隐藏：2699字，全部文档内容请下载后查看。喜欢就下载吧 ……

A formal model for information selection in multi-sentence t.doc 将本文的Word文档下载到电脑，方便复制、编辑、收藏和打印

下载这篇word文档