请求获取具备特定可视化聚类形态的分类标签数据集

阿华AIGC实验室

2026-4-30

Hey there! If you're hunting for datasets that produce those specific clustering visualization patterns, I've got you covered—whether you want pre-built options or code to generate your own (which is often more flexible for tuning exactly what you need). Let's break down each case:

Targeted Datasets & Generation for Specific Clustering Patterns

1. Crescent-shaped Clusters

Go-to tool: The sklearn.datasets.make_moons() function is the gold standard for generating perfect crescent clusters. It even lets you add noise to mimic real-world fuzziness.
Quick generation code:

from sklearn.datasets import make_moons
import matplotlib.pyplot as plt

# Generate 500 samples with minimal noise
X, y = make_moons(n_samples=500, noise=0.05, random_state=42)
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title("Crescent-Shaped Clusters")
plt.show()

Pro tip: Crank up the noise parameter (e.g., to 0.15) if you want less defined, more realistic crescent boundaries.

2. Cluster-in-Cluster (Nested Clusters)

No standard pre-built dataset exists for this, but it's trivial to generate your own—think concentric circles or a large loose cluster wrapping a tight inner cluster.
Concentric circle example code:

import numpy as np
import matplotlib.pyplot as plt

# Outer cluster: 400 points around radius 5
n_outer = 400
theta_outer = np.random.uniform(0, 2*np.pi, n_outer)
radius_outer = np.random.normal(5, 0.3, n_outer)
X_outer = np.array([radius_outer*np.cos(theta_outer), radius_outer*np.sin(theta_outer)]).T

# Inner cluster: 100 points around radius 2
n_inner = 100
theta_inner = np.random.uniform(0, 2*np.pi, n_inner)
radius_inner = np.random.normal(2, 0.2, n_inner)
X_inner = np.array([radius_inner*np.cos(theta_inner), radius_inner*np.sin(theta_inner)]).T

# Combine and label
X = np.vstack((X_outer, X_inner))
y = np.hstack((np.zeros(n_outer), np.ones(n_inner)))

plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title("Cluster-in-Cluster (Concentric Circles)")
plt.show()

3. 4 Uniformly Distributed Clusters

Use sklearn.datasets.make_blobs() with manually specified symmetric centers to ensure even spatial distribution.
Generation code:

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Define evenly spaced centers (e.g., corners of a square)
centers = [[0, 0], [0, 5], [5, 0], [5, 5]]
X, y = make_blobs(n_samples=800, centers=centers, cluster_std=0.6, random_state=42)
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title("4 Uniformly Distributed Clusters")
plt.show()

Note: Adjust cluster_std to control how tight each cluster is—keep it consistent across centers for uniform look.

4. 2 Uniformly Distributed Clusters

Similar to the 4-cluster case, just use 2 symmetric centers with make_blobs.
Generation code:

from sklearn.datasets import make_blobs
import matplotlib.pyplot as plt

# Symmetric centers along the x-axis
centers = [[-3, 0], [3, 0]]
X, y = make_blobs(n_samples=400, centers=centers, cluster_std=0.8, random_state=42)
plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title("2 Uniformly Distributed Clusters")
plt.show()

5. 2 or 3 Stretched (Elliptical) Clusters

Take standard blobs and apply a linear transformation to stretch them into ellipses. You can customize the stretch direction and intensity per cluster.
2 stretched clusters example:

from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt

X, y = make_blobs(n_samples=600, centers=2, random_state=42)
# Stretch the first cluster horizontally and compress it vertically
stretch_matrix = np.array([[2, 0], [0, 0.5]])
X[y == 0] = X[y == 0] @ stretch_matrix

plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title("2 Stretched (Elliptical) Clusters")
plt.show()

For 3 stretched clusters, add a third center and apply unique stretch matrices to each cluster group.

6. Clusters with Outliers

Start with a standard clustered dataset, then manually insert extreme points far from all existing clusters to simulate outliers.
Generation code:

from sklearn.datasets import make_blobs
import numpy as np
import matplotlib.pyplot as plt

# Base 3-cluster dataset
X, y = make_blobs(n_samples=500, centers=3, cluster_std=0.7, random_state=42)
# Add 5 outliers in regions far from the main clusters
outliers = np.array([[15, 10], [-12, 8], [9, -15], [-8, -10], [13, -7]])
X = np.vstack((X, outliers))
y = np.hstack((y, np.array([3, 3, 3, 3, 3]))) # Label outliers separately

plt.scatter(X[:, 0], X[:, 1], c=y)
plt.title("Clusters with Outliers")
plt.show()