Research    Publications    Funding    Professor    People    Course    Cal
VDB

CUDA K-Means Clustering with GPU Acceleration

for High-Dimensional Data

This project leverages CUDA-based GPU acceleration to optimize vector database operations, focusing on index building performance. Key features include:

Features

  • Efficient Data Partitioning and Clustering: Implements advanced techniques commonly used in IVF (Inverted File) indexing for faster and more accurate data handling.
  • Out-of-Core Dataset Optimization: Accelerates data loading and processing for large datasets that exceed GPU memory capacity.
  • High-Dimensional Data Support: Optimized for managing and clustering high-dimensional vectors with scalable GPU parallelism.

Implementation Details

The core implementation uses CUDA to parallelize key operations such as:

  • Distance calculations between data points and cluster centroids.
  • Cluster assignment and centroid updates using GPU threads.
  • Reduction and aggregation for efficient mean calculations.
Kmeans Data Structure

Original CUDA K-means is an optimized implementation that parallelizes the traditional K-means algorithm using CUDA. The main features of this implementation are:

  • Data Parallelism Utilization
    • Cluster centroids remain fixed during each iteration, ensuring safe concurrent access from multiple threads
    • Parallel processing of distance calculations for data points to enhance performance
  • Computation Optimization
    • Safe parallel processing through atomic operations in distance calculation and centroid update phases
    • Efficient workload distribution using block and thread structure
  • Input Parameters
    • Dataset with D-dimensional coordinates
    • Number of clusters (K)
    • Number of threads per block (TPB)
    • Maximum number of iterations

This implementation provides significantly improved clustering performance compared to CPU-based approaches.

Kmeans_dim Data Structure

In GPU-based parallel implementation of K-means, each thread (representing a point) requires repeated access to centroid data stored in global memory. These global memory loading operations introduce substantial overhead that increases with data dimensionality. Therefore, the following characteristics are necessary:

  • Shared Memory Optimization
    • Each thread block loads centroids into shared memory to minimize global memory access latency
    • Threads within the same block can efficiently share and reuse centroid data
  • Dimension Chunk Strategy
    • High-dimensional data is processed in fixed-size chunks (256 dimensions per chunk)
    • This chunking approach overcomes shared memory size limitations (48KB)
    • Partial distance calculations are performed for each chunk and accumulated for final results
reference

TBD

27336, College of Software, Sungkyunkwan University, Tel. +82 31-299-4917, Seobu-ro 2066, Jangan-gu, Suwon, 16419, South Korea
Campus map (how to reach CGLab)