You need to enable JavaScript to run this app.
优惠活动
大模型
产品
解决方案
定价
更多
文档控制台
免费开始使用

AWS EMR技术问询:集群节点预安装Python包及引导自动安装模块

Great question! Let's break this down into two clear parts: checking pre-installed Python packages on your EMR nodes, and automating package installation during the cluster bootstrap phase.

1. Checking Pre-Installed Python Packages on EMR Nodes

EMR nodes (master, core, and task) come with a standard set of Python packages pre-installed, and you can verify them in a couple of ways:

  • SSH into the Master Node
    The easiest way is to SSH into your EMR master node (you can find the public IP in the AWS Console under your cluster details). Once connected, run these commands to list packages:

    # List packages for Python 3 (default in EMR 6.x+)
    pip3 list
    # Get a version-formatted list
    pip3 freeze
    
    # For Python 2 (only available in older EMR versions like 5.x)
    pip list
    

    Core and task nodes share the same base environment as the master, so the package list will be identical. If you want to confirm on a core/task node, you can use hadoop ssh <node-instance-id> from the master to connect directly.

  • Check Within Your MapReduce Code
    If you don't want to log into nodes, add a quick function to your Python code to print installed packages during job execution:

    import pkg_resources
    
    def log_installed_packages():
        installed = sorted([f"{pkg.key}=={pkg.version}" for pkg in pkg_resources.working_set])
        print("Installed Python Packages:")
        for pkg in installed:
            print(pkg)
    
    # Call this in your mapper or reducer function
    log_installed_packages()
    

    You can find the output later in your EMR job logs stored in S3.

2. Automating Package Installation During Bootstrap Phase

EMR Bootstrap Actions let you run custom scripts or predefined actions when your cluster spins up—perfect for installing missing Python packages across all nodes. Here are two reliable methods:

Option 1: Use EMR's Predefined Pip Installer

EMR provides a built-in bootstrap action to install pip packages directly. Here's how to implement it with the boto library:

import boto.emr

# Connect to your target AWS region
conn = boto.emr.connect_to_region('us-east-1')

# Launch the cluster with bootstrap action
cluster_id = conn.run_jobflow(
    name='My EMR Cluster',
    log_uri='s3://your-emr-log-bucket/',
    ec2_keyname='your-key-pair-name',
    instance_groups=[
        {'name': 'Master', 'instance_type': 'm5.xlarge', 'instance_count': 1},
        {'name': 'Core', 'instance_type': 'm5.xlarge', 'instance_count': 2}
    ],
    bootstrap_actions=[
        boto.emr.bootstrap_action.BootstrapAction(
            name='Install Python Packages',
            script_uri='s3://elasticmapreduce/libs/script-runner/script-runner.jar',
            args=['s3://elasticmapreduce/libs/pip/pip-installer.py', 'pandas', 'numpy', 'scipy']
        )
    ],
    steps=[
        # Add your MapReduce job steps here
    ]
)

This will install the specified packages on every node in the cluster.

Option 2: Custom Bootstrap Script

For more control (like installing system dependencies or specific package versions), create a custom shell script:

  1. Write a script named install_packages.sh:
#!/bin/bash

# Update system packages (optional but recommended)
sudo yum update -y

# Install system-level dependencies (example: for psycopg2)
sudo yum install -y postgresql-devel gcc python3-devel

# Install Python packages with pip3
sudo pip3 install pandas==1.5.3 numpy==1.21.6 psycopg2-binary==2.9.5
  1. Upload the script to an S3 bucket (e.g., s3://your-bootstrap-scripts/install_packages.sh).

  2. Reference it in your boto cluster creation code:

import boto.emr

conn = boto.emr.connect_to_region('us-east-1')

cluster_id = conn.run_jobflow(
    name='Custom EMR Cluster',
    log_uri='s3://your-emr-log-bucket/',
    ec2_keyname='your-key-pair-name',
    instance_groups=[
        {'name': 'Master', 'instance_type': 'm5.xlarge', 'instance_count': 1},
        {'name': 'Core', 'instance_type': 'm5.xlarge', 'instance_count': 2}
    ],
    bootstrap_actions=[
        boto.emr.bootstrap_action.BootstrapAction(
            name='Custom Package Setup',
            script_uri='s3://your-bootstrap-scripts/install_packages.sh'
        )
    ],
    steps=[
        # Add your MapReduce job steps here
    ]
)

Quick Tips

  • EMR Version Notes: EMR 6.x+ uses Amazon Linux 2 and defaults to Python 3. For older EMR versions (5.x), swap pip3 with pip and adjust system package commands if needed.
  • Global Installation: Use sudo to install packages globally—this ensures the hadoop user (which runs MapReduce jobs) can access them.
  • Troubleshooting: Bootstrap action logs live in your S3 log bucket under bootstrap-actions/—check these if packages fail to install.

内容的提问来源于stack exchange,提问作者pankaj agarwal

火山引擎 最新活动