Different types of clustering for textual documents

Clustering Model 1: Nearest Neighbour
Algorithms: KD-Trees and Locality Sensitive Hashing
Core Machine Learning Concepts: Distance Metrics and Approximation Algorithms
Problem Domain: Finding similar documents
Nearest neighbour: Find distance between all other documents and query document. Retrieve the document which is closest
  • Critical Component
    • Document Representation
    • Computing Distances
  • Scaling to large dataset: KD Trees - Not appropriate for very high dimensional documents
  • Thus, Locality sensitive hashing is used for approximate nearest neighbour search.

Clustering Model 2: Capturing uncertainty in clustering: Mixture Models
  • Any document can be related to ti topic by x%.
  • Can learn users topic preferences
  • Making soft assignments to clusters -Expectation Maximization


Clustering Model 3: Latent dirichlet allocation
  • Mixed membership
  • Probability of words in vocabulary
  • Unsupervised learning task


Comments

Popular posts from this blog

Analysis and Research trends using Word Co-occurrence Network

Significance of Woman in Data Science