Dennis Gannon and Beth Plale
From Chemical Informatics and Cyberinfrastructure Collaboratory
Status on Optimization of Data Clustering Algorithms
Jiahu Deng 10/05/2006
Clustering algorithms are useful in molecular chemistry for detecting similarities in molecular structures in terascale databases where visual inspection is impossible. Our goal in this project is to study the behaviors of different clustering routines, with the goal of optimizing a parallel clustering algorithm in a multicore machine. This clustering algorithm will have to achieve its best performance on a large community data collection like PubChem.
We are starting our study by examining a simple sequential K-means clustering algorithm, using the Cluster 3.0* open source data clustering package. We have carried out the preliminary performance testing of the K-means clustering routine, examining the execution times of different components of the K-means algorithm to understand the performance profile. In addition, we also studied the method to find a solution that avoids the non-convergence problem of the K-means clustering algorithm. The next step is to parallelize the K-means algorithm (bisecting K-means) for purposes of providing a learning experience and a benchmark against which we can compare the Digital Chemistry DivKmeans algorithm, or another algorithm based on fuzzy clustering.
http://bonsai.ims.u-tokyo.ac.jp/~mdehoon/software/cluster/software.htm
Educational Iniative: Tools and Technology for Computational e-Science.
Instructors/developers: Dennis Gannon and Beth Plale
This is a new course that has been designed to teach students the core computational systems science they will need to know when they try to do research in the distributed computing e-science environment of the future. The course covers the design and implementation of web-based scientific gateway portals, web service architectures, data analysis and mining, software tooks for workflow design and execution. It applies this knowledge to the scientific topics of inerest to CiCC and other large-scale e-science iniatives including TeraGrid and the NSF LEAD project.
Currently 35 graduate students are enrolled in the course. They are divided into teams of 4 students. Each team is building a portal to provide workflow and web services to a particular scientific application.
