AWS EMR技术问询:集群节点预安装Python包及引导自动安装模块
Great question! Let's break this down into two clear parts: checking pre-installed Python packages on your EMR nodes, and automating package installation during the cluster bootstrap phase.
EMR nodes (master, core, and task) come with a standard set of Python packages pre-installed, and you can verify them in a couple of ways:
SSH into the Master Node
The easiest way is to SSH into your EMR master node (you can find the public IP in the AWS Console under your cluster details). Once connected, run these commands to list packages:# List packages for Python 3 (default in EMR 6.x+) pip3 list # Get a version-formatted list pip3 freeze # For Python 2 (only available in older EMR versions like 5.x) pip listCore and task nodes share the same base environment as the master, so the package list will be identical. If you want to confirm on a core/task node, you can use
hadoop ssh <node-instance-id>from the master to connect directly.Check Within Your MapReduce Code
If you don't want to log into nodes, add a quick function to your Python code to print installed packages during job execution:import pkg_resources def log_installed_packages(): installed = sorted([f"{pkg.key}=={pkg.version}" for pkg in pkg_resources.working_set]) print("Installed Python Packages:") for pkg in installed: print(pkg) # Call this in your mapper or reducer function log_installed_packages()You can find the output later in your EMR job logs stored in S3.
EMR Bootstrap Actions let you run custom scripts or predefined actions when your cluster spins up—perfect for installing missing Python packages across all nodes. Here are two reliable methods:
Option 1: Use EMR's Predefined Pip Installer
EMR provides a built-in bootstrap action to install pip packages directly. Here's how to implement it with the boto library:
import boto.emr # Connect to your target AWS region conn = boto.emr.connect_to_region('us-east-1') # Launch the cluster with bootstrap action cluster_id = conn.run_jobflow( name='My EMR Cluster', log_uri='s3://your-emr-log-bucket/', ec2_keyname='your-key-pair-name', instance_groups=[ {'name': 'Master', 'instance_type': 'm5.xlarge', 'instance_count': 1}, {'name': 'Core', 'instance_type': 'm5.xlarge', 'instance_count': 2} ], bootstrap_actions=[ boto.emr.bootstrap_action.BootstrapAction( name='Install Python Packages', script_uri='s3://elasticmapreduce/libs/script-runner/script-runner.jar', args=['s3://elasticmapreduce/libs/pip/pip-installer.py', 'pandas', 'numpy', 'scipy'] ) ], steps=[ # Add your MapReduce job steps here ] )
This will install the specified packages on every node in the cluster.
Option 2: Custom Bootstrap Script
For more control (like installing system dependencies or specific package versions), create a custom shell script:
- Write a script named
install_packages.sh:
#!/bin/bash # Update system packages (optional but recommended) sudo yum update -y # Install system-level dependencies (example: for psycopg2) sudo yum install -y postgresql-devel gcc python3-devel # Install Python packages with pip3 sudo pip3 install pandas==1.5.3 numpy==1.21.6 psycopg2-binary==2.9.5
Upload the script to an S3 bucket (e.g.,
s3://your-bootstrap-scripts/install_packages.sh).Reference it in your boto cluster creation code:
import boto.emr conn = boto.emr.connect_to_region('us-east-1') cluster_id = conn.run_jobflow( name='Custom EMR Cluster', log_uri='s3://your-emr-log-bucket/', ec2_keyname='your-key-pair-name', instance_groups=[ {'name': 'Master', 'instance_type': 'm5.xlarge', 'instance_count': 1}, {'name': 'Core', 'instance_type': 'm5.xlarge', 'instance_count': 2} ], bootstrap_actions=[ boto.emr.bootstrap_action.BootstrapAction( name='Custom Package Setup', script_uri='s3://your-bootstrap-scripts/install_packages.sh' ) ], steps=[ # Add your MapReduce job steps here ] )
Quick Tips
- EMR Version Notes: EMR 6.x+ uses Amazon Linux 2 and defaults to Python 3. For older EMR versions (5.x), swap
pip3withpipand adjust system package commands if needed. - Global Installation: Use
sudoto install packages globally—this ensures thehadoopuser (which runs MapReduce jobs) can access them. - Troubleshooting: Bootstrap action logs live in your S3 log bucket under
bootstrap-actions/—check these if packages fail to install.
内容的提问来源于stack exchange,提问作者pankaj agarwal




