Different types of clustering for textual documents
Clustering Model 1: Nearest Neighbour
Algorithms: KD-Trees and Locality Sensitive Hashing
Core Machine Learning Concepts: Distance Metrics and Approximation
Algorithms
Problem Domain: Finding similar documents
Nearest neighbour: Find distance between all other documents
and query document. Retrieve the document which is closest
- Critical Component
- Document Representation
- Computing Distances
- Scaling to large dataset: KD Trees - Not appropriate for very high dimensional documents
- Thus, Locality sensitive hashing is used for approximate nearest neighbour search.
Clustering Model 2: Capturing uncertainty in clustering:
Mixture Models
- Any document can be related to ti topic by x%.
- Can learn users topic preferences
- Making soft assignments to clusters -Expectation Maximization
Clustering Model 3: Latent dirichlet allocation
- Mixed membership
- Probability of words in vocabulary
- Unsupervised learning task
Comments
Post a Comment