Different types of clustering for textual documents

Different types of clustering for textual documents

Clustering Model 1: Nearest Neighbour

Algorithms: KD-Trees and Locality Sensitive Hashing

Core Machine Learning Concepts: Distance Metrics and Approximation Algorithms

Problem Domain: Finding similar documents

Nearest neighbour: Find distance between all other documents and query document. Retrieve the document which is closest

Critical Component

Document Representation
Computing Distances

Scaling to large dataset: KD Trees - Not appropriate for very high dimensional documents
Thus, Locality sensitive hashing is used for approximate nearest neighbour search.

Clustering Model 2: Capturing uncertainty in clustering: Mixture Models

Any document can be related to t_itopic by x%.
Can learn users topic preferences
Making soft assignments to clusters -Expectation Maximization

Clustering Model 3: Latent dirichlet allocation

Mixed membership
Probability of words in vocabulary
Unsupervised learning task

Comments