Text Mining Applications
HomeproductstechnologyContact Us

Technology


The Power Text Solutions text-mining technology can analyze any form of the unstructured text resources. It represents a combination of Information Synthesis and Information Extraction procedures.

Our approach to multi-document summarization follows these major criteria:
a. pertinence of information output
b. comprehensiveness
c. readability - the artificially compiled text summary must near the quality of human-written overviews
d. ease of understanding and usage of the summarized information - this is primarily about how the output text is ordered and organized.

These criteria are met at four basic processing steps.

1. Processing of original documents
2. Information extraction
3. Ordering of the output text units
4. Organizing the output

Let's now look at these steps in more detail.

1. Processing of original documents

The key components here are:
a. fetching the documents - that are aggregated by a search engine, a database, a web forum site or a custom collection of documents
b. retrieving and indexing the text passages - these must be text units that are independent from original semantic context

Typically, all starts with querying a search engine or a database with the user's search query. The documents are "cleaned up" from the non-informative content. Extraction of semantically self-sufficient, complete text passages. No dangling anaphora, referring to content outside the passage itself, are allowed. Any redundant passages are eliminated at this stage. In course of the indexing procedure all essential content elements are assigned to the text passages.

2. Information Extraction

a) using patterns
We have identified thousands of patterns of how certain kinds of data are presented in written texts. Searching for these characteristic patterns allows us to extract special kinds of exact data - such as answering factual questions like Who? Where? When?, or extracting direct speech on a given subject accompanied with info about authors of each quotation like person's full name, affiliation, position.

Here are a few examples:
Personal names in relation to their organizations and occupied posts. View example.
Data on companies (location, founding date, executives, partners etc.)
Information on a business area: segments of a given market field, key players, their profiles, top executives, important events (lawsuits, takeovers, mergers, acquisitions etc), planned actions, companies announcements, relevant analysts views & opinions (including the supporting information). View example.
Exact answers to factual questions (who?, where?, when?, how many?, how much?, what is?, etc). For example, as an answer to the question "Who is Mr X?"  a user obtains this person’s full name and position in a respective organization. View example 1 and example 2.

b) applying semantic lists and thesauri
This is about extracting information on a user's professional domain or a special perspective.
This approach has proved to be particularly useful in our custom solutions available for several professional areas. Just one example is the Homeland Security Reporter system, designed for HLS practitioners. This system presents 8 distinct work modes. These modes can be used with 16 major homeland security Perspectives, e.g. potential hazards (natural, technological, health hazards) and threats (terrorist, extremism, law enforcement, etc.). The system outputs a structured information report about alerts, about a particular person or organization, etc.

A noteworthy application of this technology approach is presenting information under a certain angle - like, causes and consequences, advantages and disadvantages (say, of a certain product), etc. This option is particularly useful for analysis of a complex research area (example: viewing information on business strategy through the perspective of analytical tools supporting strategic choices).

c) identified main concepts of the document(s)
This methodology is applied by us as accessory in the query-biased multi-document summarization. Though its most significant role is for the non-query-biased summarization. The most noteworthy example here is our NewsFeed Researcher project - continuous automatic summarization of documents aggregated by the Google News.

d) user's query
Query analysis is one of the most sophisticated steps of the presented technology. Its main specificity is related to deducing the user's intentions. One example is revealing the implicitly requested contexts. By presence of certain semantic elements and their combinations, the software can perceive the user's request for retrieving information under a specific angle, such as "Advantages & Disadvantages" or "Causes & Consequences", and many others.
At this step, we also establish semantically conjoined terms (phrases), even if a user has not enclosed them with double quotes. For example, if the query is about the New York the engine will not search for just any sentences containing the word "new".

3. The Ordering Step

Meaningful ordering provides that the retrieved information is perceived as a unified text, not as a collection of independent text passages. It is primarily intended for reading as a whole, from beginning to end, not for just picking up separate notes.
The ordering of text passages is achieved by applying a combination of two main criteria:
relevance score and semantic closeness.
This results in that passages representing a given thematic line are grouped together, with the most essential at the top, followed by passages that are supplementing additional information to the most pertinent ones. With such coherent order of passages, the text is unfolding itself - starting from most relevant items downwards until a given thematic line is completed. By manipulating parameters mastering the relevance score and semantic closeness, we are able to increase conciseness of the output or to achieve higher completeness.

4. The Text Organization Step

Our approach allows to break the compiled text into paragraphs, subsections and sections. Starting sentences of all subsections are put into a list at the top of a corresponding section. This is used as Contents. From every item-heading inside the Contents a reader can navigate to a given subsection.

At the very top of this content hierarchy are the sections. Each section starts with its own Contents (list of subsections). These sections differ by degree of pertinence of the information they contain.

The Key Topics section presents the most pertinent information on the subject. Depending on the nature of information resources and abundance of information, we may need to adjust the internal parameters of the system in order to keep this section concise. The In-Depth section contains highly relevant, though probably somewhat less important information. The Possibly Useful section provides additional or supplementary information on the subject even when it is moderately-relevant.

Additionally, we provide the General Info section - it contains definitions of the subject and basic information about a subject. This one shows up when the query is represented by a single term or a two-word phrase or a larger phrase but put in quotes, and of course - when the appropriate information is present. A professional/analyst probably wouldn't need the General Info section, as he is usually interested in more specific consideration of facts and relationships. However, for other users, e.g. students, such section might be quite useful when the query subject is about something rare and not widely known (e.g., text mining). In many cases, their information need can be satisfied with just this section. Alternatively, the General Info may help them to digest more detailed and specific information inside the subsequent sections.

The final section of the summary is represented by Sources. The source items are ranked according to their informativeness and relevance.


Several components of our proprietary technology have been successfully tested at TREC (2001, 2002) and DUC (2005) conferences conducted by NIST.

Selected Bibliography:

M.M. Soubbotin, S.M. Soubbotin. Trade-Off between Factors Influencing Quality of the Summary.
Document Understanding Conference (DUC) - 2005.
(Download PDF file)

M.M. Soubbotin, S.M. Soubbotin. Exhaustive Mining of Information from Unstructured Documents. Accepted for presentation at the 9th World Multiconference on Systemics, Cybernetics and Informatics (WMSCI 2005), Orlando, USA, July 10-13, 2005.
(Download PDF file)


M.M. Soubbotin, S.M. Soubbotin. Use of Patterns for Detection of Likely Answer Strings: A Systematic Approach. In: Proceedings of the Text REtrieval
Conference (TREC-2002). Gaithersburg, MD, November 2003.
(Download PDF file)


M.M Soubbotin, S.M. Soubbotin. Patterns of Potential Answer Expressions as Clues to the Right Answer.In: Proceedings of TREC-10 Conference. Gaithersburg, MD, November 2001.
(Download PDF file)

Home | Products & Services Technology |  Contact Us

© Copyright 1998-2008, Power Text Solutions, All Rights Reserved.