|
Technology
The
Power Text Solutions text-mining technology can
analyze any
form of the unstructured text resources. It represents a combination of
Information Synthesis and Information Extraction procedures.
Our approach to multi-document summarization follows these major
criteria:
a. pertinence of information output
b. comprehensiveness
c. readability - the artificially compiled text summary must near the
quality of human-written overviews
d. ease of understanding and usage of the summarized information - this
is primarily about how the output text is ordered and organized.
These criteria are met at four basic processing steps.
1. Processing of original documents
2. Information extraction
3. Ordering of the output text units
4. Organizing the output
Let's now look at these steps in more detail.
1. Processing of
original documents
The key components here are:
a. fetching the documents - that are aggregated by a search engine, a
database, a web forum site or a custom collection of documents
b. retrieving and indexing the text passages - these must be text units
that are independent from original semantic context
Typically, all starts with querying a search engine or a database with
the user's search query. The documents are "cleaned up" from the
non-informative content. Extraction of semantically self-sufficient,
complete text passages. No dangling anaphora, referring to content
outside the passage itself, are allowed. Any redundant passages are
eliminated at this stage. In course of the indexing procedure all
essential content elements are assigned to the text passages.
2. Information
Extraction
a) using patterns
We have identified thousands of patterns of how certain kinds of data
are presented in written texts. Searching for these characteristic
patterns allows us to extract special kinds of exact data - such as
answering factual questions like Who? Where? When?, or extracting
direct speech on a given subject accompanied with info about authors of
each quotation like person's full name, affiliation, position.
Here are a few examples:
Personal names in relation to their
organizations and
occupied posts. View example.
Data on companies (location, founding date,
executives, partners etc.)
Information on a business area: segments of a
given market field, key players, their
profiles, top executives, important events (lawsuits, takeovers,
mergers, acquisitions etc), planned actions, companies announcements,
relevant analysts views & opinions (including the supporting
information). View example.
Exact answers to factual questions (who?,
where?, when?, how many?, how much?, what is?, etc). For
example, as an answer to the question "Who is Mr X?" a user
obtains this person’s full name and position in a respective
organization. View example
1
and example 2.
b) applying semantic lists and thesauri
This is about extracting information on a user's professional domain or
a special perspective.
This approach has proved to be particularly useful in our custom
solutions available for several professional areas. Just one example is
the Homeland Security Reporter system, designed for HLS practitioners.
This system presents 8 distinct work modes. These modes can be used
with 16 major homeland security Perspectives, e.g. potential hazards
(natural, technological, health hazards) and threats (terrorist,
extremism, law enforcement, etc.). The system outputs a structured
information report about alerts, about a particular person or
organization, etc.
A noteworthy application of this technology approach is presenting
information under a certain angle - like, causes and consequences,
advantages and disadvantages (say, of a certain product), etc. This
option is particularly useful for analysis of a complex research area
(example: viewing information on business strategy through the
perspective of analytical tools supporting strategic choices).
c) identified main concepts of the document(s)
This methodology is applied by us as accessory in the query-biased
multi-document summarization. Though its most significant role is for
the non-query-biased summarization. The most noteworthy example here is
our NewsFeed Researcher project - continuous automatic summarization of
documents aggregated by the Google News.
d) user's query
Query analysis is one of the most sophisticated steps of the presented
technology. Its main specificity is related to deducing the user's
intentions. One example is revealing the implicitly requested contexts.
By presence of certain
semantic elements and their combinations, the software can perceive the
user's request for retrieving information under a specific angle, such
as "Advantages & Disadvantages" or "Causes & Consequences", and
many others.
At this step, we also establish semantically conjoined terms (phrases),
even if a user has not enclosed them with double quotes. For example,
if the query is about the New York the engine will not search for just
any sentences containing the word "new".
3. The Ordering Step
Meaningful ordering provides that the retrieved information is
perceived as a unified text, not as a collection of independent text
passages. It is primarily intended for reading as a whole, from
beginning to end, not for just picking up separate notes.
The ordering of text passages is achieved by applying a combination of
two main criteria:
relevance score and semantic closeness.
This results in that passages representing a given thematic line are
grouped together, with the most essential at the top, followed by
passages that are supplementing additional information to the most
pertinent ones. With such coherent order of passages, the text is
unfolding itself - starting from most relevant items downwards until a
given thematic line is completed. By manipulating parameters mastering
the relevance score and semantic closeness, we are able to increase
conciseness of the output or to achieve higher completeness.
4. The Text
Organization Step
Our approach allows to break the compiled text into paragraphs,
subsections and sections. Starting sentences of all subsections are put
into a list at the top of a corresponding section. This is used as
Contents. From every item-heading inside the Contents a reader can
navigate to a given subsection.
At the very top of this content hierarchy are the sections. Each
section starts with its own Contents (list of subsections). These
sections differ by degree of pertinence of the information they
contain.
The Key Topics section presents the most pertinent information on the
subject. Depending on the nature of information resources and abundance
of information, we may need to adjust the internal parameters of the
system in order to keep this section concise. The In-Depth section
contains highly relevant, though probably somewhat less important
information. The Possibly Useful section provides additional or
supplementary information on the subject even when it is
moderately-relevant.
Additionally, we provide the General Info section - it contains
definitions of the subject and basic information about a subject. This
one shows up when the query is represented by a single term or a
two-word phrase or a larger phrase but put in quotes, and of course -
when the appropriate information is present. A professional/analyst
probably wouldn't need the General Info section, as he is usually
interested in more specific consideration of facts and relationships.
However, for other users, e.g. students, such section might be quite
useful when the query subject is about something rare and not widely
known (e.g., text mining). In many cases, their information need can be
satisfied with just this section. Alternatively, the General Info may
help them to digest more detailed and specific information inside the
subsequent sections.
The final section of the summary is represented by Sources. The source
items are ranked according to their informativeness and relevance.
Several components of our proprietary technology
have been successfully
tested at TREC
(2001, 2002) and DUC
(2005) conferences conducted by NIST.
Selected Bibliography:
M.M. Soubbotin, S.M. Soubbotin. Trade-Off between
Factors Influencing Quality of the Summary.
Document Understanding Conference (DUC) - 2005.
(Download PDF file)
M.M.
Soubbotin, S.M. Soubbotin. Exhaustive
Mining of Information from Unstructured Documents. Accepted for
presentation at the 9th World Multiconference on Systemics, Cybernetics
and Informatics (WMSCI 2005), Orlando, USA, July 10-13, 2005.
(Download PDF file)
M.M. Soubbotin, S.M. Soubbotin. Use
of Patterns for Detection of Likely Answer Strings: A Systematic
Approach. In: Proceedings of the Text REtrieval Conference (TREC-2002).
Gaithersburg, MD, November 2003.
(Download PDF file)
M.M
Soubbotin, S.M. Soubbotin. Patterns
of Potential Answer Expressions as Clues to the Right Answer.In:
Proceedings of TREC-10 Conference. Gaithersburg, MD, November 2001.
(Download PDF file)
|