|
With continued
advances in communication network technology and sensing technology, there is
an astounding growth in the amount of data produced and made available
through the cyberspace. Efficient and high quality clustering of large
datasets continues to be one of the most important problems in large-scale
data analysis. A commonly used methodology for cluster analysis on large
datasets is the three-phase framework of ``sampling/summarization - iterative
cluster analysis - disk-labeling''. There are three known problems with this
framework, which demand effective solutions. The first problem is how to
effectively define and validate irregularly shaped clusters, especially in
large datasets. Automated algorithms and statistical methods are typically
not effective in handling such particular clusters. The second problem is how
to effectively label the entire data on disk (disk-labeling) without
introducing additional errors, including the solutions for dealing with
outliers, irregular clusters, and cluster boundary extension. The third
problem is the lack of research about the issues for effectively integrating
the three phases.
The iVIBRATE project studies an interactive-visualization based three-phase
framework for clustering large datasets. The two main components of iVIBRATE
are its VISTA visual cluster
rendering subsystem, which invites human into the large-scale
iterative clustering process through interactive visualization, and its
Adaptive ClusterMap Labeling subsystem, which offers visualization-guided
disk-labeling solutions that are effective in dealing with outliers,
irregular clusters, and cluster boundary extension. Another important
contribution of iVIBRATE development is the identification of special issues
presented in integrating the two components and the sampling approach into a
coherent framework, and the solutions to improve the reliability of the
framework and to minimize the amount of errors generated throughout the
cluster analysis process. We study the effectiveness of the iVIBRATE
framework through a walkthrough example dataset of a million records and
experimentally evaluate the iVIBRATE approach using both real-life datasets
and synthetic datasets. Our results show that iVIBRATE can efficiently
involve the user into the clustering process and generate high-quality
clustering results for large datasets.
|
|
Related
papers:
- Keke Chen and Ling
Liu: "VISTA: Validating and Refining Clusters via
Visualization." Journal of Information Visualization.
Sept. 2004. (pdf)
- Keke Chen and Ling
Liu:"ClusterMap: Labeling Large Datasets via Visualization." ACM
Conf. of Information and Knowledge Management (CIKM04), Washington
DC, Nov, 2004 (pdf)
- Keke Chen and Ling
Liu: "Validating and Refining Clusters via Visual Rendering." Proc.
of Intl. Conf. on Data Mining(ICDM03). Melbourne, FL, November
2003. (pdf)
- Keke Chen and Ling
Liu: "A Visual Framework Invites Human into the Clustering
Process." Proc of Scientific and Statistical Database
Management (SSDBM03). Cambridge, Boston, July 2003. (pdf)
- Keke Chen and Ling
Liu: "Cluster Rendering of Skewed Datasets via Visualization."
Proc. of ACM Symposium on Applied Computing(ACM SAC03).
Melbourne, FL, March 2003.(pdf)
- Keke Chen and Ling Liu: " iVIBRATE: Interactive
Visualization Based Framework for Clustering Large Datasets "
submitted to ACM Transactions on Information Systems (TOIS), to appear. (pdf )
- Keke Chen and Ling
Liu: "Models and Performance Evaluation of Star-Coordinates
Visualizations", under review
|