Classification algorithms distributed on the cloud
Lokad’s first major disruption, after the project’s inception back in 2008, was the emergence of cloud computing. Cloud computing was a new paradigm that had taken the industry by storm. Overnight, the old HPC (high-performance computing) was dead, and Lokad had to embrace its successor. Cloud computing represented Lokad’s first radical departure from what can still be considered ‘mainstream’ enterprise software. Although most enterprise vendors do SaaS nowadays, almost none have adopted a cloud-native design1. The adoption of cloud computing was driven by the pioneering work of Matthieu Durut, Lokad’s second-ever employee (Lokad’s first employee was another PhD).
Much like the work of Benoit Petra, this manuscript had never previously been published on Lokad’s website. I am pleased to right that wrong today.
Author: Matthieu Durut
Date: September 2012
The subjects addressed in this thesis are inspired from research problems faced by the Lokad company. These problems are related to the challenge of designing efficient parallelization techniques of clustering algorithms on a Cloud Computing platform. Chapter 2 provides an introduction to the Cloud Computing technologies, especially the ones devoted to intensive computations. Chapter 3 details more specifically Microsoft Cloud Computing offer: Windows Azure. The following chapter details technical aspects of cloud application development and provides some cloud design patterns. Chapter 5 is dedicated to the parallelization of a well-known clustering algorithm: the Batch K-Means. It provides insights on the challenges of a cloud implementation of distributed Batch K-Means, especially the impact of communication costs on the implementation efficiency. Chapters 6 and 7 are devoted to the parallelization of another clustering algorithm, the Vector Quantization (VQ). Chapter 6 provides an analysis of different parallelization schemes of VQ and presents the various speedups to convergence provided by them. Chapter 7 provides a cloud implementation of these schemes. It highlights that it is the online nature of the VQ technique that enables an asynchronous cloud implementation, which drastically reduces the communication costs introduced in Chapter 5.
A litmus test to assess whether a vendor has a cloud-native design is to ask the vendor if, as a client, you can go from zero to 10 terabyte within hours without permission from the vendor. Most enterprise software vendors don’t operate with pooled resources; hence, an upfront negotiation is required to secure a properly sized pool of dedicated resources. ↩︎