7.8 C
New York
Wednesday, April 2, 2025

Speed up basis mannequin coaching and fine-tuning with new Amazon SageMaker HyperPod recipes


Voiced by Polly

As we speak, we’re asserting the overall availability of Amazon SageMaker HyperPod recipes to assist information scientists and builders of all talent units to get began coaching and fine-tuning basis fashions (FMs) in minutes with state-of-the-art efficiency. They’ll now entry optimized recipes for coaching and fine-tuning common publicly accessible FMs akin to Llama 3.1 405B, Llama 3.2 90B, or Mixtral 8x22B.

At AWS re:Invent 2023, we launched SageMaker HyperPod to cut back time to coach FMs by as much as 40 % and scale throughout greater than a thousand compute assets in parallel with preconfigured distributed coaching libraries. With SageMaker HyperPod, you will discover the required accelerated compute assets for coaching, create essentially the most optimum coaching plans, and run coaching workloads throughout totally different blocks of capability primarily based on the supply of compute assets.

SageMaker HyperPod recipes embrace a coaching stack examined by AWS, eradicating tedious work experimenting with totally different mannequin configurations, eliminating weeks of iterative analysis and testing. The recipes automate a number of essential steps, akin to loading coaching datasets, making use of distributed coaching strategies, automating checkpoints for quicker restoration from faults, and managing the end-to-end coaching loop.

With a easy recipe change, you possibly can seamlessly swap between GPU- or Trainium-based situations to additional optimize coaching efficiency and scale back prices. You may simply run workloads in manufacturing on SageMaker HyperPod or SageMaker coaching jobs.

SageMaker HyperPod recipes in motion
To get began, go to the SageMaker HyperPod recipes GitHub repository to browse coaching recipes for common publicly accessible FMs.

You solely must edit easy recipe parameters to specify an occasion sort and the situation of your dataset in cluster configuration, then run the recipe with a single line command to attain state-of-art efficiency.

You could edit the recipe config.yaml file to specify the mannequin and cluster sort after cloning the repository.

$ git clone --recursive https://github.com/aws/sagemaker-hyperpod-recipes.git
$ cd sagemaker-hyperpod-recipes
$ pip3 set up -r necessities.txt.
$ cd ./recipes_collections
$ vim config.yaml

The recipes assist SageMaker HyperPod with Slurm, SageMaker HyperPod with Amazon Elastic Kubernetes Service (Amazon EKS), and SageMaker coaching jobs. For instance, you possibly can arrange a cluster sort (Slurm orchestrator), a mannequin identify (Meta Llama 3.1 405B language mannequin), an occasion sort (ml.p5.48xlarge), and your information places, akin to storing the coaching information, outcomes, logs, and so forth.

defaults:
- cluster: slurm # assist: slurm / k8s / sm_jobs
- recipes: fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora # identify of mannequin to be skilled
debug: False # set to True to debug the launcher configuration
instance_type: ml.p5.48xlarge # or different supported cluster situations
base_results_dir: # Location(s) to retailer the outcomes, checkpoints, logs and so on.

You may optionally regulate model-specific coaching parameters on this YAML file, which outlines the optimum configuration, together with the variety of accelerator units, occasion sort, coaching precision, parallelization and sharding strategies, the optimizer, and logging to observe experiments by TensorBoard.

run:
  identify: llama-405b
  results_dir: ${base_results_dir}/${.identify}
  time_limit: "6-00:00:00"
restore_from_path: null
coach:
  units: 8
  num_nodes: 2
  accelerator: gpu
  precision: bf16
  max_steps: 50
  log_every_n_steps: 10
  ...
exp_manager:
  exp_dir: # location for TensorBoard logging
  identify: helloworld 
  create_tensorboard_logger: True
  create_checkpoint_callback: True
  checkpoint_callback_params:
    ...
  auto_checkpoint: True # for automated checkpointing
use_smp: True 
distributed_backend: smddp # optimized collectives
# Begin coaching from pretrained mannequin
mannequin:
  model_type: llama_v3
  train_batch_size: 4
  tensor_model_parallel_degree: 1
  expert_model_parallel_degree: 1
  # different model-specific params

To run this recipe in SageMaker HyperPod with Slurm, you will need to put together the SageMaker HyperPod cluster following the cluster setup instruction.

Then, connect with the SageMaker HyperPod head node, entry the Slurm controller, and duplicate the edited recipe. Subsequent, you run a helper file to generate a Slurm submission script for the job that you need to use for a dry run to examine the content material earlier than beginning the coaching job.

$ python3 major.py --config-path recipes_collection --config-name=config

After coaching completion, the skilled mannequin is mechanically saved to your assigned information location.

To run this recipe on SageMaker HyperPod with Amazon EKS, clone the recipe from the GitHub repository, set up the necessities, and edit the recipe (cluster: k8s) in your laptop computer. Then, create a hyperlink between your laptop computer and working the EKS cluster and subsequently use the HyperPod Command Line Interface (CLI) to run the recipe.

$ hyperpod start-job –recipe fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora 
--persistent-volume-claims fsx-claim:information 
--override-parameters 
'{
  "recipes.run.identify": "hf-llama3-405b-seq8k-gpu-qlora",
  "recipes.exp_manager.exp_dir": "/information/<your_exp_dir>",
  "cluster": "k8s",
  "cluster_type": "k8s",
  "container": "658645717510.dkr.ecr.<area>.amazonaws.com/smdistributed-modelparallel:2.4.1-gpu-py311-cu121",
  "recipes.mannequin.information.train_dir": "<your_train_data_dir>",
  "recipes.mannequin.information.val_dir": "<your_val_data_dir>",
}'

You may also run recipe on SageMaker coaching jobs utilizing SageMaker Python SDK. The next instance is working PyTorch coaching scripts on SageMaker coaching jobs with overriding coaching recipes.

...
recipe_overrides = {
    "run": {
        "results_dir": "/choose/ml/mannequin",
    },
    "exp_manager": {
        "exp_dir": "",
        "explicit_log_dir": "/choose/ml/output/tensorboard",
        "checkpoint_dir": "/choose/ml/checkpoints",
    },   
    "mannequin": {
        "information": {
            "train_dir": "/choose/ml/enter/information/practice",
            "val_dir": "/choose/ml/enter/information/val",
        },
    },
}
pytorch_estimator = PyTorch(
           output_path=<output_path>,
           base_job_name=f"llama-recipe",
           function=<function>,
           instance_type="p5.48xlarge",
           training_recipe="fine-tuning/llama/hf_llama3_405b_seq8k_gpu_qlora",
           recipe_overrides=recipe_overrides,
           sagemaker_session=sagemaker_session,
           tensorboard_output_config=tensorboard_output_config,
)
...

As coaching progresses, the mannequin checkpoints are saved on Amazon Easy Storage Service (Amazon S3) with the totally automated checkpointing functionality, enabling quicker restoration from coaching faults and occasion restarts.

Now accessible
Amazon SageMaker HyperPod recipes at the moment are accessible within the SageMaker HyperPod recipes GitHub repository. To study extra, go to the SageMaker HyperPod product web page and the Amazon SageMaker AI Developer Information.

Give SageMaker HyperPod recipes a try to ship suggestions to AWS re:Put up for SageMaker or by your typical AWS Help contacts.

Channy



Related Articles

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Latest Articles