Ray 1.7升级后出现The actor ImplicitFunc is too large错误的技术求助
This error pops up because Ray 1.7 upgraded the actor size warning (just a heads-up in 1.6) to a hard error with a 95MiB threshold. The ImplicitFunc actor Ray auto-generates wraps your instance method self.train, which silently captures the entire self object—all attributes in your class, not just the data you passed via tune.with_parameters. Even when you stripped the train function body, the issue stuck around because the problem lies in the captured self instance, not the function logic itself.
Here are actionable solutions tailored to your code:
1. Convert train to a standalone function (Recommended)
By turning train into an independent function instead of a class instance method, you avoid dragging the entire self object along. Pass only the specific attributes you need (like device, df, and your datasets) via tune.with_parameters:
# Move train outside your class as a standalone function def train(config, data, device, df): print("Train") net = None if df: net = Net(k1=config["k1"], k2=config["k2"], out1=config["out1"], out2=config["out2"], L1=config["l1"]) else: net = Net() net.to(device) criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9) trainloader = torch.utils.data.DataLoader( data[0], batch_size=int(config["batch_size"]), shuffle=True, num_workers=8) valloader = torch.utils.data.DataLoader( data[1], batch_size=int(config["batch_size"]), shuffle=True, num_workers=8) # Rest of your training logic remains the same... tune.report(loss=(val_loss / val_steps), accuracy= correct/total)
Then update your main method to call this standalone function:
def main(self, num_samples=50, max_num_epochs=20, gpus_per_trial=1): # ... (your existing config and scheduler setup) result = tune.run( tune.with_parameters(train, data=(self.train_data,self.val), device=self.device, df=self.df), resources_per_trial={"cpu": 4, "gpu": 1}, config=config, num_samples=num_samples, scheduler=scheduler, progress_reporter=ExperimentTerminationReporter(), verbose=1)
2. Use ray.put() for large objects if you need to keep train as an instance method
If you must retain train as a class method, store large objects (like your datasets) in Ray's object store first. This replaces the actual heavy data in self with a lightweight reference, cutting down the size of the captured self instance:
def main(self, num_samples=50, max_num_epochs=20, gpus_per_trial=1): # Store large datasets in Ray's object store train_data_ref = ray.put(self.train_data) val_ref = ray.put(self.val) # ... (your existing config and scheduler setup) result = tune.run( tune.with_parameters(self.train, train_data_ref=train_data_ref, val_ref=val_ref), resources_per_trial={"cpu": 4, "gpu": 1}, config=config, num_samples=num_samples, scheduler=scheduler, progress_reporter=ExperimentTerminationReporter(), verbose=1)
Then modify your train method to retrieve the objects from the store:
def train(self, config, train_data_ref, val_ref): print("Train") # Fetch data from Ray's object store train_data = ray.get(train_data_ref) val_data = ray.get(val_ref) net = None if self.df: net = Net(k1=config["k1"], k2=config["k2"], out1=config["out1"], out2=config["out2"], L1=config["l1"]) else: net = Net() net.to(self.device) criterion = nn.CrossEntropyLoss() optimizer = optim.SGD(net.parameters(), lr=config["lr"], momentum=0.9) trainloader = torch.utils.data.DataLoader( train_data, batch_size=int(config["batch_size"]), shuffle=True, num_workers=8) valloader = torch.utils.data.DataLoader( val_data, batch_size=int(config["batch_size"]), shuffle=True, num_workers=8) # Rest of your training logic remains the same... tune.report(loss=(val_loss / val_steps), accuracy= correct/total)
Why tune.with_parameters() didn't fix it initially
Even though you passed data via tune.with_parameters, since train is an instance method, Ray still wraps the entire self object into the ImplicitFunc actor. This includes all other attributes in your class (not just train_data/val), which pushed the total size over Ray 1.7's error threshold.
内容的提问来源于stack exchange,提问作者TheBatz




