如何基于波士顿房价数据集，用Python生成100个大小为30的随机样本？

阿华AIGC实验室

2026-5-6

Hey there, I've put together the exact implementation you're asking for, with clear explanations for each step. Quick heads-up: the load_boston dataset has been deprecated in newer scikit-learn versions due to ethical issues around its original data, but I'll follow your specified requirements closely here.

Complete Implementation

1. Import Required Libraries

First, we'll import all the libraries you mentioned:

import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
import random

2. Load Dataset & Convert to DataFrame

Next, we'll load the Boston housing data, convert the feature matrix into a pandas DataFrame, and add the target column MEDV (which represents the median value of owner-occupied homes in $1000s):

# Load the dataset
boston = load_boston()

# Convert features to DataFrame
df = pd.DataFrame(boston.data, columns=boston.feature_names)

# Add target column 'MEDV'
df['MEDV'] = boston.target

3. Generate 100 Random Samples (Each with 30 Observations)

We'll set a random seed for reproducibility, then use random.sample to generate 100 separate samples, each containing 30 rows from our DataFrame:

# Set random seed for reproducibility
random.seed(1)

# Initialize a list to store all samples
samples_list = []

# Generate 100 samples
for _ in range(100):
    # Randomly select 30 row indices without replacement
    sample_indices = random.sample(list(df.index), k=30)
    # Extract the corresponding rows from the DataFrame
    sample = df.loc[sample_indices]
    # Add the sample to our list
    samples_list.append(sample)

Key Notes:

random.seed(1) ensures that every time you run this code, you'll get the exact same set of samples (critical for reproducibility).
random.sample(list(df.index), k=30) picks 30 unique row indices from the DataFrame without repeating any rows in a single sample.
Each item in samples_list is a separate pandas DataFrame with 30 rows and all original columns (including MEDV).

If you want to verify the output, you can run these quick checks:

# Check total number of samples generated
print(f"Total samples: {len(samples_list)}")
# Check the dimensions of the first sample
print(f"First sample shape: {samples_list[0].shape}")

内容的提问来源于stack exchange，提问作者akietta