ParaKMeans is a high performance parallel processing implementation of the K Means Clustering algorithm.|
This software is distributed under the Open Software License v3.0 agreement
To download the software please read and accept the license here
We designed the software so it can be deployed on most Windows operating systems. The applications are written for the .NET Framework v1.1 using the C# programming language. The parallel nature of the application comes from the use of a web service to perform the distance calculations and cluster assignments. Because we use a webservice, it is essential that at least one computer has Internet Information Services (IIS v.5 or better) installed and running.
Additional help and system requirements can be found here.
The application was designed in a modular fashion to provide both deployment flexibility as well as flexibility in the user interface. The application is made of three software components:
- A Web Service - this software component is the main computation workhorse and resides on the "compute nodes". Data and the cluster centroids are sent to the web service where the distance calculations and cluster assignments are performed. Of note, is that once the data is sent to the web service it never leaves.
- A Main API - this software component is a .NET dynamic link library (DLL) used by the user interfaces to orchestrate the activities of the compute nodes. The API is responsible for managing the ThreadPool and working with the web services to perform the K Means clustering algorithm across the compute nodes. The API provides all the methods and properties necessary. We will provide documentation to the API in case anyone wants to use it in another application.
- User Interfaces - this software component provides the actual application that the user interacts with to run the programs. We provide two different user interfaces, a stand alone windows application and a web application.
- ParaKMeans stand-alone windows application. The windows application can be installed on any windows machine regardless of whether or not IIS is installed. This application provides easy file management, compute node management, program options and a results window for data viewing and saving.
- ParaKMeans web application. This interface requires IIS to be installed on the computer. The web application provides the same functionality as the stand alone, but requires that each set of data to be analyzed be uploaded to the server.
Although the software was created using this modular design, the end user only needs to concern themselves with the web service and which user interface they want to install.
The basic steps involved in the ParaKMeans algorithm:
- The user opens or uploads the data to be analyzed.
- The user selects whether to cluster genes, arrays or both.
- The user selects the number of clusters and compute nodes to use in the algorithm.
- The user selects the method to initialize the centroids for the first round.
- The algorithm partitions the data based on the number of nodes used.
- The algorithm creates an array of web proxies used to connect to the compute nodes.
- The algorithm initializes the centroids based on the method selected by the user.
- The algorithm asychronously sends the data and the initial centroids to the compute nodes.
- Each compute node calculates the Euclidian Distance matrix and assigns the data on that node to each of the cluster centroids.
- Once all the compute nodes finish the cluster assignments, the performance function for that node is returned and summed across all nodes. The summed performance function is used to calculate the new centroids.
- The algorithm sends the new centroids back to the compute nodes for another round of assignments.
- The algorithm ends when the performance function does not change between rounds.