Selecting important information while accounting for repetitions is a hard task for both summarization and question answering. We propose a formal model that represents a collection of documents in a two-dimensional space of textual and conceptual units wi
weighing the two components appropriately.
3Handling Redundancy in
Summarization
Redundancy of information has been found useful in determining what text pieces should be included during summarization,on the basis that information that is repeated is likely to be central to the topic or event being discussed.Earlier work has also recog-nized that,while it is a good idea to select among the passages repeating information,it is also impor-tant to avoid repetition of the same information in thefinal output.
Two main approaches have been proposed for avoiding redundancy in the output.One approach relies on grouping together potential output text units on the basis of their similarity,and outputting only a representative from each group(Hatzivas-siloglou et al.,2001).Sentences can be clustered in this manner according to word overlap,or by us-ing additional content similarity features.This ap-proach has been recently applied to the construction of paragraph-long answers(e.g.,(Blair-Goldensohn et al.,2003;Yu and Hatzivassiloglou,2003)).
An alternative approach,proposed for the synthe-sis of information during query-based passage re-trieval is the maximum marginal relevance(MMR) method(Goldstein et al.,2000).This approach as-signs to each potential new sentence in the output a similarity score with the sentences already included in the summary.Only those sentences that contain a substantial amount of new information can get into the summary.MMR bases this similarity score on word overlap and additional information about the time when each document was released,and thus can fail to identify repeated information when para-phrasing is used to convey the same meaning.
In contrast to these approaches,our model han-dles redundancy in the output at the same time it selects the output sentences.It is clear from equa-tions(1)–(3)that each conceptual unit is counted only once whether it appears in one or multiple tex-tual units.Thus,when wefind the subset of textual units that maximizes overall information coverage with a constraint on the total number or length of textual units,the model will prefer the collection of textual units that have minimal overlap of covered conceptual units.Our approach offers three advan-tages versus both clustering and MMR:First,it in-tegrates redundancy elimination into the selection process,requiring no additional features for defin-ing a text-level similarity between selected textual units.Second,decisions are based on the same fea-tures that drive the summarization itself,not on ad-ditional surface properties of similarity.Finally,be-cause all decisions are informed by the overlap of conceptual units,our approach accounts for partial overlap of information across textual units.To illus-trate this last point,consider a case where three fea-tures A,B,and C should be covered in the output, and where three textual units are available,cover-ing A and B,A and C,and B and C,respectively. Then our model will determine that selecting any two of the textual units is fully sufficient,while this may not be apparent on the basis of text similarity between the three text units;a clustering algorithm may form three singleton clusters,and MMR may determine that each textual unit is sufficiently dif-ferent from each other,especially if A,B,and C are realized with nearly the same number of words. 4Applying the Model
Having presented a formal metric for the informa-tion content(and optionally the cost)of any poten-tial summary or answer,the task that remains is to optimize this metric and select the corresponding set of textual units for thefinal output.As stated in Section2.3,one possible way to do this is to fo-cus on the information content metric and introduce an additional constraint,limiting the total cost to a constant.An alternative is to optimize directly the composite function that combines cost and informa-tion content into a single number.
We examine the case of zero-one mappings be-tween textual and conceptual units,where the to-tal information content is specified by equation(1). The complexity of the problem depends on the cost function,and whether we optimize I(S)while keeping P(S)fixed or whether we optimize a com-bined function of both of those quantities.We will only consider the former case in the present paper. We start by examining an artificially simple case, where the cost assigned to each textual unit is1,and the function P for combining costs is their sum.In this case,the total cost is equal to the number of textual units used in a summary.
This problem,as we have formalized it above, is identical to the Maximum Set Coverage problem studied in theoretical computer science:given C,a finite set of weighted elements,a collection T of subsets of C,and an integer k,find those k sets that maximize the total number of elements in the union of T’s members(Hochbaum,1997).In our case, the zero-one mapping allows us to view each textual unit as a subset of the conceptual units space,con-taining those conceptual units covered by the tex-tual unit,and k is the total target cost.Unfortu-nately,maximum set coverage is NP-hard,as it is