如何基于波士顿房价数据集,用Python生成100个大小为30的随机样本?
Hey there, I've put together the exact implementation you're asking for, with clear explanations for each step. Quick heads-up: the load_boston dataset has been deprecated in newer scikit-learn versions due to ethical issues around its original data, but I'll follow your specified requirements closely here.
1. Import Required Libraries
First, we'll import all the libraries you mentioned:
import pandas as pd import numpy as np from sklearn.datasets import load_boston import random
2. Load Dataset & Convert to DataFrame
Next, we'll load the Boston housing data, convert the feature matrix into a pandas DataFrame, and add the target column MEDV (which represents the median value of owner-occupied homes in $1000s):
# Load the dataset boston = load_boston() # Convert features to DataFrame df = pd.DataFrame(boston.data, columns=boston.feature_names) # Add target column 'MEDV' df['MEDV'] = boston.target
3. Generate 100 Random Samples (Each with 30 Observations)
We'll set a random seed for reproducibility, then use random.sample to generate 100 separate samples, each containing 30 rows from our DataFrame:
# Set random seed for reproducibility random.seed(1) # Initialize a list to store all samples samples_list = [] # Generate 100 samples for _ in range(100): # Randomly select 30 row indices without replacement sample_indices = random.sample(list(df.index), k=30) # Extract the corresponding rows from the DataFrame sample = df.loc[sample_indices] # Add the sample to our list samples_list.append(sample)
Key Notes:
random.seed(1)ensures that every time you run this code, you'll get the exact same set of samples (critical for reproducibility).random.sample(list(df.index), k=30)picks 30 unique row indices from the DataFrame without repeating any rows in a single sample.- Each item in
samples_listis a separate pandas DataFrame with 30 rows and all original columns (includingMEDV).
If you want to verify the output, you can run these quick checks:
# Check total number of samples generated print(f"Total samples: {len(samples_list)}") # Check the dimensions of the first sample print(f"First sample shape: {samples_list[0].shape}")
内容的提问来源于stack exchange,提问作者akietta




