请求获取具备特定可视化聚类形态的分类标签数据集
Hey there! If you're hunting for datasets that produce those specific clustering visualization patterns, I've got you covered—whether you want pre-built options or code to generate your own (which is often more flexible for tuning exactly what you need). Let's break down each case:
Targeted Datasets & Generation for Specific Clustering Patterns
1. Crescent-shaped Clusters
- Go-to tool: The
sklearn.datasets.make_moons()function is the gold standard for generating perfect crescent clusters. It even lets you add noise to mimic real-world fuzziness. - Quick generation code:
from sklearn.datasets import make_moons import matplotlib.pyplot as plt # Generate 500 samples with minimal noise X, y = make_moons(n_samples=500, noise=0.05, random_state=42) plt.scatter(X[:, 0], X[:, 1], c=y) plt.title("Crescent-Shaped Clusters") plt.show()
- Pro tip: Crank up the
noiseparameter (e.g., to 0.15) if you want less defined, more realistic crescent boundaries.
2. Cluster-in-Cluster (Nested Clusters)
- No standard pre-built dataset exists for this, but it's trivial to generate your own—think concentric circles or a large loose cluster wrapping a tight inner cluster.
- Concentric circle example code:
import numpy as np import matplotlib.pyplot as plt # Outer cluster: 400 points around radius 5 n_outer = 400 theta_outer = np.random.uniform(0, 2*np.pi, n_outer) radius_outer = np.random.normal(5, 0.3, n_outer) X_outer = np.array([radius_outer*np.cos(theta_outer), radius_outer*np.sin(theta_outer)]).T # Inner cluster: 100 points around radius 2 n_inner = 100 theta_inner = np.random.uniform(0, 2*np.pi, n_inner) radius_inner = np.random.normal(2, 0.2, n_inner) X_inner = np.array([radius_inner*np.cos(theta_inner), radius_inner*np.sin(theta_inner)]).T # Combine and label X = np.vstack((X_outer, X_inner)) y = np.hstack((np.zeros(n_outer), np.ones(n_inner))) plt.scatter(X[:, 0], X[:, 1], c=y) plt.title("Cluster-in-Cluster (Concentric Circles)") plt.show()
3. 4 Uniformly Distributed Clusters
- Use
sklearn.datasets.make_blobs()with manually specified symmetric centers to ensure even spatial distribution. - Generation code:
from sklearn.datasets import make_blobs import matplotlib.pyplot as plt # Define evenly spaced centers (e.g., corners of a square) centers = [[0, 0], [0, 5], [5, 0], [5, 5]] X, y = make_blobs(n_samples=800, centers=centers, cluster_std=0.6, random_state=42) plt.scatter(X[:, 0], X[:, 1], c=y) plt.title("4 Uniformly Distributed Clusters") plt.show()
- Note: Adjust
cluster_stdto control how tight each cluster is—keep it consistent across centers for uniform look.
4. 2 Uniformly Distributed Clusters
- Similar to the 4-cluster case, just use 2 symmetric centers with
make_blobs. - Generation code:
from sklearn.datasets import make_blobs import matplotlib.pyplot as plt # Symmetric centers along the x-axis centers = [[-3, 0], [3, 0]] X, y = make_blobs(n_samples=400, centers=centers, cluster_std=0.8, random_state=42) plt.scatter(X[:, 0], X[:, 1], c=y) plt.title("2 Uniformly Distributed Clusters") plt.show()
5. 2 or 3 Stretched (Elliptical) Clusters
- Take standard blobs and apply a linear transformation to stretch them into ellipses. You can customize the stretch direction and intensity per cluster.
- 2 stretched clusters example:
from sklearn.datasets import make_blobs import numpy as np import matplotlib.pyplot as plt X, y = make_blobs(n_samples=600, centers=2, random_state=42) # Stretch the first cluster horizontally and compress it vertically stretch_matrix = np.array([[2, 0], [0, 0.5]]) X[y == 0] = X[y == 0] @ stretch_matrix plt.scatter(X[:, 0], X[:, 1], c=y) plt.title("2 Stretched (Elliptical) Clusters") plt.show()
- For 3 stretched clusters, add a third center and apply unique stretch matrices to each cluster group.
6. Clusters with Outliers
- Start with a standard clustered dataset, then manually insert extreme points far from all existing clusters to simulate outliers.
- Generation code:
from sklearn.datasets import make_blobs import numpy as np import matplotlib.pyplot as plt # Base 3-cluster dataset X, y = make_blobs(n_samples=500, centers=3, cluster_std=0.7, random_state=42) # Add 5 outliers in regions far from the main clusters outliers = np.array([[15, 10], [-12, 8], [9, -15], [-8, -10], [13, -7]]) X = np.vstack((X, outliers)) y = np.hstack((y, np.array([3, 3, 3, 3, 3]))) # Label outliers separately plt.scatter(X[:, 0], X[:, 1], c=y) plt.title("Clusters with Outliers") plt.show()
- Customize: Change the number of outliers or their coordinates to match how extreme you want the anomalies to be.
内容的提问来源于stack exchange,提问作者K.J Fogang Fokoa




