PolyVista's
Patent Pending Solution
PolyVista's approach to delivering a Text mining
solution includes three fundamental processes.
 |
1.
|
Clustering: This process involves
the creation of hierarchal text clusters from
data fields generally consisting of free-formatted
text. |
 |
2.
|
Integration: The second process
includes the integration of these clusters
(text-based dimensions) in an otherwise standard
multi-dimensional database (MS Analysis Services).
This provides the all-important linking of
structured and unstructured data. |
  |
3.
|
Analysis:
And the third process is the capability
to analyze these new-clustered text dimensions
using both automated (discovery algorithms)
as well as standard (manual) interactive OLAP
methods. |
It
should be noted that without step 3 (the ability
to effectively analyze the structured and unstructured
dimensions), the text mining process becomes largely
academic and offers significantly less value.
1.
Hierarchal Text Clusters
Text data is provided in either ASCII delimited
files, MS Access tables or MS SQL Server table format.
This data is processed to create a keyword frequency
distribution taking into account stemming, synonyms
and stop words. Using keyword frequency as a basis,
hierarchal text clustering is performed using a
PolyVista proprietary parallel processing method.
This involves an iterative K-means clustering technique
that results in a hierarchal text cluster organized
in a parent-child relationship. The clustering process
is largely data driven though we expose a number
of tuning parameters to influence the clustering
process. These parameters include the number of
clusters at a given level, the number of terms describing
the clusters and the degree to which a given word
vector "fits" a given cluster. The order of the
terms describing the clusters is driven entirely
by their observed rank frequency. While k-means
is one of the fastest clustering techniques, PolyVista
has employed a unique (patent pending) parallel
processing method to handle millions of text records
which are commonly received daily in many large
call center applications.
2.
Integrating Structured and Unstructured data
The second step in our Text mining solution involves
creating a cube (multi-dimensional database) using
this new text-based information. The text data is
processed and represented internally as a "ragged"
dimension (a hierarchal dimension with non-uniform
levels) in the cube. The final cube represents a
very powerful integration of standard structured
data with its corresponding unstructured data elements.
This unique approach allows analysts for the first
time to understand text field data in both a parent-child
relationship (summarized, hierarchical view) and
a multi-dimensional context as well.
3.
Automated analysis techniques
In addition to offering the technology to process
and construct large text based dimensions, PolyVista
has also developed new algorithms and visualizations
particularly suited for text based feature discovery.
Automating the discovery process is crucial in efficiently
surfacing new business insight in complex multi-dimensional
cubes. While our current discovery algorithms can
be applied effectively to any text dimension, we
have added a new algorithm to our Discovery suite.
The Difference algorithm has been designed to find
and rank differences or deltas between any two user
definable sets in terms of one or more selectable
dimensions. For example, one could rank deltas between
the number of calls received yesterday (set 1) and
the previous 7 days (set 2) in terms of a text cluster
dimension such as Problem Area.
|
|
Example
A call or contact center makes a great example
of deriving value from the integration of
text and structured data. Consider a chat-based
support system where customers initiate a
real-time "electronic dialog" (chat) with
a customer support agent. For each chat session,
the dialog between agent and customer interaction
is recorded and stored. The structured elements
of this call record include information about
the customer as well as the agent (who, what,
where, when, etc.). The unstructured data
is the verbatim (free-form text) customer/agent
dialog itself.
The business value in clustering these text
fields and merging them with their related
structural elements include the following:
Problem Identification
- Timely and accurate problem identification
is critical for the support agent as well
as the product engineers and service managers.
In a typical support system the agent is usually
responsible for identifying the problem and
classifying the problem type. Where a classification
scheme is very simple, the odds of a correct
classification are good. Unfortunately, a
simple classification scheme may not have
enough detail to support identifying important
new problem trends or identifying root cause.
On the other hand, a more complex scheme may
offer very robust analysis potential, but
its complex nature invites errors and creative
shortcuts, thus negating any analytic value.
Text clustering can provide an automated method
to assist in problem identification and classification;
it is generally unbiased and not prone to
taking shortcuts.
Early Warning -
With accelerating product lifecycles, the
value of analysis is closely tied to how quickly
actionable results can be delivered. A problem
discovered and corrected in the first weeks
of a product's life is much less costly than
one discovered several weeks later. Integrating
the structured and unstructured data facilitates
analysis of these "problem" clusters across
multiple dimensions like time, manufacturing
location, component suppliers, or product
family. This capability enables analysts to
quickly identify root causes and to take immediate
corrective action. |
|
|