Different types of clustering for textual documents
Clustering Model 1: Nearest Neighbour Algorithms: KD-Trees and Locality Sensitive Hashing Core Machine Learning Concepts: Distance Metrics and Approximation Algorithms Problem Domain: Finding similar documents Nearest neighbour: Find distance between all other documents and query document. Retrieve the document which is closest Critical Component Document Representation Computing Distances Scaling to large dataset: KD Trees - Not appropriate for very high dimensional documents Thus, Locality sensitive hashing is used for approximate nearest neighbour search. Clustering Model 2: Capturing uncertainty in clustering: Mixture Models Any document can be related to t i topic by x%. Can learn users topic preferences Making soft assignments to clusters -Expectation Maximization Clustering Model 3: Latent dirichlet allocation Mixed membership Probability of words in vocabulary Unsupervised learning task