Unsupervised Machine Learning Algorithms
Unsupervised Machine Learning is a type of machine learning where the model learns patterns and structures from unlabeled data. It is primarily used for clustering, dimensionality reduction, association rule mining, and anomaly detection.
Key Concepts
Unsupervised Learning Objective
Given data
Applications
- Clustering: Grouping similar data points.
- Dimensionality Reduction: Simplifying data while retaining key patterns.
- Association Rule Mining: Discovering relationships between variables.
- Anomaly Detection: Identifying outliers or abnormal instances.
Types of Unsupervised Algorithms
1. Clustering Algorithms
Clustering divides data into groups (clusters) based on similarity.
a. K-Means Clustering
K-Means partitions data into
Mathematical Formulation:
Where:
: Cluster . : Centroid of cluster .
Algorithm:
- Initialize
centroids. - Assign each point to the nearest centroid.
- Recompute centroids.
- Repeat until convergence.
Use Cases:
- Customer segmentation.
- Image compression.
- Market segmentation.
b. Hierarchical Clustering
Builds a tree-like structure of nested clusters.
Two Approaches:
- Agglomerative: Bottom-up merging of clusters.
- Divisive: Top-down splitting of clusters.
Linkage Methods:
- Single Linkage:
- Complete Linkage:
- Average Linkage:
Use Cases:
- Gene expression analysis.
- Social network analysis.
c. DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Groups points based on density and identifies outliers as noise.
Parameters:
: Neighborhood radius. : Minimum points to form a dense region.
Use Cases:
- Geospatial data analysis.
- Noise filtering.
2. Dimensionality Reduction Algorithms
a. Principal Component Analysis (PCA)
PCA reduces dimensionality by projecting data onto orthogonal axes that maximize variance.
Mathematical Formulation:
Given data matrix
Use Cases:
- Data visualization.
- Noise reduction.
b. t-SNE (t-Distributed Stochastic Neighbor Embedding)
t-SNE maps high-dimensional data to a lower-dimensional space while preserving local structure.
Use Cases:
- Visualizing high-dimensional datasets.
- Exploring clusters.
3. Association Algorithms
a. Apriori Algorithm
Discovers frequent itemsets and association rules in transaction data.
Steps: 1. Identify frequent itemsets using a minimum support threshold. 2. Generate association rules using a minimum confidence threshold.
Mathematical Definitions:
- Support: Proportion of transactions containing an itemset:
- Confidence:
Likelihood of
given :
- Lift: Measures the strength of the rule:
Use Cases:
- Market basket analysis.
- Recommender systems.
b. FP-Growth Algorithm
Efficiently mines frequent itemsets without candidate generation by using a prefix-tree structure.
Use Cases:
- Retail analytics.
- Fraud detection.
4. Anomaly Detection Algorithms
a. Isolation Forest
Detects anomalies by isolating instances using a tree structure.
Key Idea:
Anomalies are isolated quickly due to their rarity and differences.
Use Cases:
- Fraud detection.
- Network intrusion detection.
b. One-Class SVM
Classifies data into one class, treating outliers as anomalies.
Mathematical Formulation:
Solves:
Where:
: Fraction of anomalies.
Use Cases:
- Manufacturing defect detection.
- Medical diagnosis.
c. Elliptic Envelope
Fits data to a Gaussian distribution and identifies anomalies based on Mahalanobis distance.
Use Cases:
- Financial fraud detection.
- Sensor anomaly detection.
Performance Metrics for Unsupervised Learning
Clustering Metrics
1. Silhouette Score:
Where:
: Average distance within the same cluster. : Average distance to the nearest cluster.
2. Davies-Bouldin Index:
Where:
: Cluster dispersion. : Distance between cluster centroids.
3. Dunn Index:
Where:
: Inter-cluster distance. : Intra-cluster distance.
Anomaly Detection Metrics
- Precision: Proportion of true anomalies among detected anomalies.
- Recall: Proportion of true anomalies detected.
- F1-Score: Harmonic mean of precision and recall.
- Area Under ROC Curve (AUC): Measures the trade-off between true positive rate and false positive rate.
Conclusion
Unsupervised learning algorithms are powerful tools for exploring data, identifying patterns, and detecting anomalies. Choosing the right algorithm depends on the problem type, data structure, and desired outcome.