Developer
Masahiro Tanaka
81312776+tohtana@users.noreply.github.com
Performance
Key patterns and highlights from this developer's activity.
Breakdown of growth, maintenance, and fixes effort over time.
Bugs introduced vs. fixed over time.
Reclassifies engineering effort based on bug attribution. Commits that introduced bugs are retrospectively counted as poor investments.
Investment Quality reclassifies engineering effort based on bug attribution data. Commits identified as buggy origins (those that introduced bugs later fixed by someone) have their grow and maintenance time moved into the Wasted Time category. Their waste (fix commits) remains counted as productive. All other commits retain their standard classification: grow is productive, maintenance is maintenance, and waste (fixes) is productive.
The standard model classifies commits as Growth, Maintenance, or Fixes. Investment Quality adds a quality lens: a commit that introduced a bug is retrospectively counted as a poor investment — the engineering time spent on it was wasted because it ultimately required additional fix work. Fix commits (Fixes in the standard model) are reframed as productive, because fixing bugs is valuable work.
Currently computed client-side from commit and bug attribution data. Ideal server-side endpoint:
POST /v1/organizations/{orgId}/investment-quality
Content-Type: application/json
Request:
{
"startTime": "2025-01-01T00:00:00Z",
"endTime": "2025-12-31T23:59:59Z",
"bucketSize": "BUCKET_SIZE_MONTH",
"groupBy": ["repository_id" | "deliverer_email"]
}
Response:
{
"productivePct": 74,
"maintenancePct": 18,
"wastedPct": 8,
"buckets": [
{
"bucketStart": "2025-01-01T00:00:00Z",
"productive": 4.2,
"maintenance": 1.8,
"wasted": 0.6
}
]
}Latest analyzed commits from this developer.
| Hash | Message | Date | Files | Effort |
|---|---|---|---|---|
| 3bdebc0 | This commit **fixes a CI failure** occurring in tests for **AutoTP (Automatic Tensor Parallelism)** and **universal checkpoint**. The issue, a "RuntimeError: Cannot re-initialize CUDA", arose because `torch.cuda.current_device()` was called prematurely during test setup under `pytest --forked`. To resolve this, a new method `_should_materialize_tp_partition` is introduced in `deepspeed/module_inject/layers.py` to conditionally skip constructor-time AutoTP materialization when no model-parallel group is provided. This **bug fix** ensures that **AutoTP** partitioning only occurs when an actual `mp_group` is present, preventing device placement issues and stabilizing the CI pipeline for these critical features. | Mar 31 | 1 | waste |
| 36f0b0c | This commit introduces a **feature enhancement** to the **CI/CD pipeline** by implementing dynamic hardware detection for DeepSpeed's full test suite. It modifies the `.github/workflows/aws-torch-latest-full.yml` workflow to detect the **CUDA architecture** and the **number of GPUs** available in the test environment. These detected values are then set as environment variables, enabling adaptive configuration of DeepSpeed installation and test execution. This change provides a crucial **fallback mechanism** to improve the **reliability** of nightly full tests, specifically addressing recent failures by allowing the system to better utilize available resources like A100 nodes. | Mar 30 | 1 | grow |
| 138f20d | This commit introduces a **backward compatibility fix** for DeepSpeed, specifically addressing issues when installing from source with **PyTorch versions older than 2.4**. It resolves a build failure caused by the absence of `torch.amp.custom_fwd` in older PyTorch releases, which was implicitly imported by DeepSpeed's `setup.py`. The **DeepSpeed runtime's Zero module** in `deepspeed/runtime/zero/linear.py` now includes a fallback mechanism, utilizing `torch.cuda.amp.custom_fwd` for these legacy environments. This ensures that users can **install and run DeepSpeed from source** on a broader range of PyTorch versions, with new unit tests verifying the correct `autocast` decorator behavior across different PyTorch versions. | Mar 25 | 2 | waste |
| 784cc26 | This commit **fixes a critical bug** in the **Evoformer attention mechanism** that caused order-dependent failures during multi-architecture CUDA builds. It **refactors** the GPU architecture detection in `csrc/deepspeed4science/evoformer_attn/gemm_kernel_utils.h` to enable runtime dispatch of appropriate kernels based on the device's compute capability. This ensures that **Evoformer** binaries built for mixed architectures (e.g., pre-Ampere and Ampere+) correctly select optimized kernels, deprecating the `DS_EVOFORMER_GPU_ARCH` build flag. The change improves the stability and performance of **Evoformer** across diverse GPU environments by providing a robust multi-architecture build and runtime solution. | Mar 13 | 4 | maint |
| 6c59d54 | This commit delivers a **critical performance fix** for **DeepSpeed's ZeRO-enabled training**, resolving a regression where dynamic gradient hook counting caused significant overhead during the backward pass. It introduces a `should_refresh_expected_hook_count()` predicate to ensure the expensive hook count computation is performed only once per reentrant backward phase, rather than for every gradient hook. This optimization is applied across **ZeRO-1, ZeRO-2, and ZeRO-3 stages** by conditionally refreshing or reusing cached hook counts, and also includes resetting counters in `enter_backward()` to prevent pollution. The **performance improvement** is substantial, leading to a 2.5x speedup in backward pass iteration times for large transformer models. | Mar 5 | 4 | maint |
| 4dba1e2 | This commit introduces a **documentation update** to the `docs/code-docs/source/training.rst` file, enhancing the project's user guidance. It adds a new section that **clarifies the behavior and usage of `torch.autocast` when nested**, specifically detailing its interaction within the **DeepSpeed engine**. This **documentation improvement** explains the rationale behind nesting `autocast` and provides guidance on when and why it is needed, thereby improving user understanding for developers utilizing these advanced training features. | Mar 4 | 1 | maint |
| 04d69cc | This commit delivers a **bug fix** addressing a `RuntimeError` encountered during `import deepspeed` on PyTorch 2.3 with Python 3.12. The `deepspeed.utils.torch.py` module's `jit_script_compat` decorator was unconditionally invoking `torch.compile()`, which lacked Dynamo support for Python 3.12 in that specific PyTorch version, leading to import crashes. The fix introduces a version gate within `jit_script_compat` to prevent `torch.compile()` calls on known-unsupported combinations and implements a robust double fallback mechanism. This ensures **DeepSpeed can be successfully imported and utilized** on these specific platform configurations, significantly improving **compatibility**. | Mar 2 | 1 | waste |
| bffaf45 | This commit **fixes a critical bug** affecting **DeepSpeed ZeRO's parameter counting** mechanism when running on **PyTorch 2.3**. Previously, the `_get_grad_fn_or_grad_acc` lookup within `count_used_parameters_in_backward` in `deepspeed/runtime/utils.py` would fail in `no-grad` contexts during backward hooks, leading to crashes. The **bug fix** explicitly wraps this lookup with `torch.enable_grad()` to ensure proper gradient function retrieval, aligning with newer PyTorch behavior. This ensures **DeepSpeed ZeRO** can reliably count used parameters and operate correctly on **PyTorch 2.3**, with a new unit test added to validate the change. | Mar 1 | 2 | waste |
| efc0b49 | This commit **enhances the Automatic Tensor Parallelism (AutoTP) documentation** by **restructuring its navigation and content**. It **adds a new sidebar navigation link** for the `AutoTP Training` tutorial, while **renaming the existing AutoTP entry** to `AutoTP Inference` for improved clarity within `docs/_data/navigation.yml`. Furthermore, this **documentation update** includes **fixing broken internal links** within the `docs/_tutorials/autotp-training.md` tutorial and updating introductory notes in `docs/_tutorials/automatic-tensor-parallelism.md` to correctly reference the new training guide. These changes improve the **discoverability and accuracy** of AutoTP resources for users. | Feb 25 | 3 | maint |
| 0416cf6 | This commit **schedules a nightly execution** for the **full unit test suite** within the **CI/CD pipeline**. It introduces a **new capability** by adding a schedule trigger to the `.github/workflows/aws-torch-latest-full.yml` workflow. This **maintenance improvement** ensures that the comprehensive tests run automatically every night, but intelligently, only when new commits have been detected since the last successful run. This regular, conditional execution helps in **continuously monitoring test stability** and **early detection of regressions** in the project's core functionalities. | Feb 24 | 1 | grow |
| 93524c8 | This commit **fixes a regression** in the `TestZeroStaticScale` unit test, which was failing for **ZeRO optimizer** stages 1, 2, and 3. The issue arose from an incorrect assertion in `tests/unit/runtime/half_precision/test_fp16.py` that attempted to access `optim.loss_scale_config.dynamic_loss_scale`, a property not present in ZeRO optimizers. This **bug fix** reverts the assertion to the correct `optim.dynamic_loss_scale`, ensuring the **FP16 half-precision tests** accurately validate static loss scaling behavior for ZeRO optimizers. This restores the integrity and reliability of **ZeRO optimizer testing** for mixed-precision training. | Feb 22 | 1 | maint |
| 57b10d5 | This commit **fixes a bug** in the **DeepSpeed ZeRO Redundancy Optimizer** by enhancing the `GatheredParameters` context manager. It introduces **sanity checks** within `deepspeed/runtime/zero/partition_parameters.py` to detect and prevent in-place modification of parameters when `modifier_rank` is `None`. Previously, an incomplete implementation failed to catch these modifications, leading to potential silent data corruption or unexpected behavior during distributed training. Now, attempting such an operation will correctly raise a `RuntimeError`, improving the **robustness and predictability** of the ZeRO optimizer. This change ensures data integrity and provides clearer error feedback to users of the **DeepSpeed ZeRO** optimization. | Feb 21 | 2 | maint |
| dbc1b07 | This commit **fixes compilation errors** on **HIP/ROCm (AMD)** platforms within the **DeepSpeed Inference v2** module. It addresses the absence of specific CUDA-style BF16 conversion intrinsics by introducing **platform-specific fallback implementations** in `deepspeed/inference/v2/kernels/includes/conversion_utils.h`. This **bug fix** ensures that integer, unsigned integer, and float to BF16 conversions, and vice-versa, are correctly handled on AMD GPUs. The change significantly improves **platform compatibility** for **DeepSpeed Inference v2**, enabling it to compile and run successfully on **HIP/ROCm** systems. | Feb 18 | 1 | waste |
| d2ca6e7 | This commit introduces a **compatibility layer** for JIT compilation within DeepSpeed, primarily to **resolve deprecation warnings** encountered when importing DeepSpeed on `torch==2.10.0`. It **refactors** several internal helper functions across the **Mixture of Experts (MoE)**, **ZeRO optimizer utilities**, and **sequence parallelism layers** by replacing direct calls to `@torch.jit.script` with a new utility decorator, `jit_script_compat`. This new utility, defined in `deepspeed/utils/torch.py`, conditionally leverages `torch.compile` for newer PyTorch versions while falling back to `torch.jit.script` for older ones. The change ensures **cleaner imports** and better alignment with PyTorch's recommended JIT compilation practices, improving **forward compatibility**. | Feb 12 | 4 | grow |
| 1752c2a | This commit **fixes gradient norm divergence** observed during **BF16 training with ZeRO stage 0** by addressing two critical bugs within the **DeepSpeed engine**. It resolves incorrect dynamic loss scaling application in `FP16_UnfusedOptimizer` and prevents unintended gradient accumulation caused by skipping `zero_grad` for BF16 without ZeRO. The **bug fix** disables loss scaling for BF16 and removes the `zero_optimization()` gate on `zero_grad`, complemented by a **refactoring** of the loss scaling mechanism to use a new `LossScaleConfig`. This ensures **stable and accurate gradient updates** for models leveraging these specific mixed-precision and optimization configurations. | Feb 12 | 8 | maint |
| a44fb58 | This commit delivers a **bug fix** and **enhancement** for **DeepSpeed's Auto Tensor Parallelism (AutoTP)**, resolving critical issues with custom pattern configurations. It updates `deepspeed/module_inject/auto_tp.py` to correctly respect `use_default_specs: false` and disable traditional injection when custom patterns are enabled, ensuring proper module replacement. Additionally, `deepspeed/runtime/tensor_parallel/init_utils.py` is modified to automatically create a tensor parallel group during `deepspeed.initialize` if `mpu` is not provided, significantly improving **Hugging Face Trainer integration**. These changes make custom AutoTP patterns reliable and enhance the overall usability and compatibility of the tensor parallelism features. | Feb 7 | 4 | maint |
| 6b9cab1 | This commit introduces a **new capability** for **Automatic Tensor Parallelism (AutoTP)**, enabling users to define **custom layer partitioning patterns** via a flexible, configuration-driven API. This allows for precise control over how model parameters are sharded, supporting **any model architecture** including those with complex fused layers and unequal sub-parameter sizes, using regex patterns within the DeepSpeed configuration. The `deepspeed.initialize` function is enhanced to simplify AutoTP setup by integrating these configurations directly, while maintaining **backward compatibility** with previous initialization methods. This significantly improves the **extensibility and usability** of AutoTP for diverse and custom model training scenarios. | Jan 31 | 19 | grow |
| 52b1d4d | This commit **fixes a race condition** within **DeepSpeed ZeRO3 leaf modules** during the backward pass, specifically when PyTorch's autograd concurrently triggers hooks for modules returning multiple outputs. This **bug fix** introduces **thread synchronization** in `deepspeed/runtime/zero/partitioned_param_coordinator.py` to ensure only a single thread handles parameter fetching for a leaf module, preventing concurrent modifications to internal parameter states. This significantly improves the **stability and correctness** of **ZeRO3** training, especially for models leveraging multi-output leaf modules. New tests in `test_zero_leaf_module.py` validate this thread-safe behavior. | Jan 30 | 2 | waste |
| b19987c | This commit performs a **maintenance update** by upgrading the **PyTorch version** used within the project's continuous integration (CI) pipelines. Specifically, the **`accelerate` and `torch_latest` CI environments** are updated to use PyTorch **v2.9.1** from the previous v2.6.0. This **chore** involves modifying `ci/accelerate.py` and `ci/torch_latest.py` to reflect the new base image version. Additionally, `ci/torch_latest.py` adjusts the `pytest` command to align with the updated torch and cuda versions, ensuring tests are run against a more current and compatible deep learning framework. | Jan 29 | 2 | maint |
| d9f3d40 | This commit **fixes a crash** in **DeepSpeed's ZeRO-3** by introducing a **clearer `RuntimeError`** when `GatheredParameters` are modified in-place without `modifier_rank` specified. Specifically, the `GatheredParameters.__exit__` method in `deepspeed/runtime/zero/partition_parameters.py` now detects and raises an actionable error, synchronized across ranks, instead of an obscure internal invariant assertion. Additionally, the `free_param` function now provides more informative error messages when parameters are still active in submodules. This **error handling improvement** enhances **debugging clarity** and the overall **developer experience** for users of ZeRO-3. | Jan 28 | 2 | maint |
This commit **fixes a CI failure** occurring in tests for **AutoTP (Automatic Tensor Parallelism)** and **universal checkpoint**. The issue, a "RuntimeError: Cannot re-initialize CUDA", arose because `torch.cuda.current_device()` was called prematurely during test setup under `pytest --forked`. To resolve this, a new method `_should_materialize_tp_partition` is introduced in `deepspeed/module_inject/layers.py` to conditionally skip constructor-time AutoTP materialization when no model-parallel group is provided. This **bug fix** ensures that **AutoTP** partitioning only occurs when an actual `mp_group` is present, preventing device placement issues and stabilizing the CI pipeline for these critical features.
This commit introduces a **feature enhancement** to the **CI/CD pipeline** by implementing dynamic hardware detection for DeepSpeed's full test suite. It modifies the `.github/workflows/aws-torch-latest-full.yml` workflow to detect the **CUDA architecture** and the **number of GPUs** available in the test environment. These detected values are then set as environment variables, enabling adaptive configuration of DeepSpeed installation and test execution. This change provides a crucial **fallback mechanism** to improve the **reliability** of nightly full tests, specifically addressing recent failures by allowing the system to better utilize available resources like A100 nodes.
This commit introduces a **backward compatibility fix** for DeepSpeed, specifically addressing issues when installing from source with **PyTorch versions older than 2.4**. It resolves a build failure caused by the absence of `torch.amp.custom_fwd` in older PyTorch releases, which was implicitly imported by DeepSpeed's `setup.py`. The **DeepSpeed runtime's Zero module** in `deepspeed/runtime/zero/linear.py` now includes a fallback mechanism, utilizing `torch.cuda.amp.custom_fwd` for these legacy environments. This ensures that users can **install and run DeepSpeed from source** on a broader range of PyTorch versions, with new unit tests verifying the correct `autocast` decorator behavior across different PyTorch versions.
This commit **fixes a critical bug** in the **Evoformer attention mechanism** that caused order-dependent failures during multi-architecture CUDA builds. It **refactors** the GPU architecture detection in `csrc/deepspeed4science/evoformer_attn/gemm_kernel_utils.h` to enable runtime dispatch of appropriate kernels based on the device's compute capability. This ensures that **Evoformer** binaries built for mixed architectures (e.g., pre-Ampere and Ampere+) correctly select optimized kernels, deprecating the `DS_EVOFORMER_GPU_ARCH` build flag. The change improves the stability and performance of **Evoformer** across diverse GPU environments by providing a robust multi-architecture build and runtime solution.
This commit delivers a **critical performance fix** for **DeepSpeed's ZeRO-enabled training**, resolving a regression where dynamic gradient hook counting caused significant overhead during the backward pass. It introduces a `should_refresh_expected_hook_count()` predicate to ensure the expensive hook count computation is performed only once per reentrant backward phase, rather than for every gradient hook. This optimization is applied across **ZeRO-1, ZeRO-2, and ZeRO-3 stages** by conditionally refreshing or reusing cached hook counts, and also includes resetting counters in `enter_backward()` to prevent pollution. The **performance improvement** is substantial, leading to a 2.5x speedup in backward pass iteration times for large transformer models.
This commit introduces a **documentation update** to the `docs/code-docs/source/training.rst` file, enhancing the project's user guidance. It adds a new section that **clarifies the behavior and usage of `torch.autocast` when nested**, specifically detailing its interaction within the **DeepSpeed engine**. This **documentation improvement** explains the rationale behind nesting `autocast` and provides guidance on when and why it is needed, thereby improving user understanding for developers utilizing these advanced training features.
This commit delivers a **bug fix** addressing a `RuntimeError` encountered during `import deepspeed` on PyTorch 2.3 with Python 3.12. The `deepspeed.utils.torch.py` module's `jit_script_compat` decorator was unconditionally invoking `torch.compile()`, which lacked Dynamo support for Python 3.12 in that specific PyTorch version, leading to import crashes. The fix introduces a version gate within `jit_script_compat` to prevent `torch.compile()` calls on known-unsupported combinations and implements a robust double fallback mechanism. This ensures **DeepSpeed can be successfully imported and utilized** on these specific platform configurations, significantly improving **compatibility**.
This commit **fixes a critical bug** affecting **DeepSpeed ZeRO's parameter counting** mechanism when running on **PyTorch 2.3**. Previously, the `_get_grad_fn_or_grad_acc` lookup within `count_used_parameters_in_backward` in `deepspeed/runtime/utils.py` would fail in `no-grad` contexts during backward hooks, leading to crashes. The **bug fix** explicitly wraps this lookup with `torch.enable_grad()` to ensure proper gradient function retrieval, aligning with newer PyTorch behavior. This ensures **DeepSpeed ZeRO** can reliably count used parameters and operate correctly on **PyTorch 2.3**, with a new unit test added to validate the change.
This commit **enhances the Automatic Tensor Parallelism (AutoTP) documentation** by **restructuring its navigation and content**. It **adds a new sidebar navigation link** for the `AutoTP Training` tutorial, while **renaming the existing AutoTP entry** to `AutoTP Inference` for improved clarity within `docs/_data/navigation.yml`. Furthermore, this **documentation update** includes **fixing broken internal links** within the `docs/_tutorials/autotp-training.md` tutorial and updating introductory notes in `docs/_tutorials/automatic-tensor-parallelism.md` to correctly reference the new training guide. These changes improve the **discoverability and accuracy** of AutoTP resources for users.
This commit **schedules a nightly execution** for the **full unit test suite** within the **CI/CD pipeline**. It introduces a **new capability** by adding a schedule trigger to the `.github/workflows/aws-torch-latest-full.yml` workflow. This **maintenance improvement** ensures that the comprehensive tests run automatically every night, but intelligently, only when new commits have been detected since the last successful run. This regular, conditional execution helps in **continuously monitoring test stability** and **early detection of regressions** in the project's core functionalities.
This commit **fixes a regression** in the `TestZeroStaticScale` unit test, which was failing for **ZeRO optimizer** stages 1, 2, and 3. The issue arose from an incorrect assertion in `tests/unit/runtime/half_precision/test_fp16.py` that attempted to access `optim.loss_scale_config.dynamic_loss_scale`, a property not present in ZeRO optimizers. This **bug fix** reverts the assertion to the correct `optim.dynamic_loss_scale`, ensuring the **FP16 half-precision tests** accurately validate static loss scaling behavior for ZeRO optimizers. This restores the integrity and reliability of **ZeRO optimizer testing** for mixed-precision training.
This commit **fixes a bug** in the **DeepSpeed ZeRO Redundancy Optimizer** by enhancing the `GatheredParameters` context manager. It introduces **sanity checks** within `deepspeed/runtime/zero/partition_parameters.py` to detect and prevent in-place modification of parameters when `modifier_rank` is `None`. Previously, an incomplete implementation failed to catch these modifications, leading to potential silent data corruption or unexpected behavior during distributed training. Now, attempting such an operation will correctly raise a `RuntimeError`, improving the **robustness and predictability** of the ZeRO optimizer. This change ensures data integrity and provides clearer error feedback to users of the **DeepSpeed ZeRO** optimization.
This commit **fixes compilation errors** on **HIP/ROCm (AMD)** platforms within the **DeepSpeed Inference v2** module. It addresses the absence of specific CUDA-style BF16 conversion intrinsics by introducing **platform-specific fallback implementations** in `deepspeed/inference/v2/kernels/includes/conversion_utils.h`. This **bug fix** ensures that integer, unsigned integer, and float to BF16 conversions, and vice-versa, are correctly handled on AMD GPUs. The change significantly improves **platform compatibility** for **DeepSpeed Inference v2**, enabling it to compile and run successfully on **HIP/ROCm** systems.
This commit introduces a **compatibility layer** for JIT compilation within DeepSpeed, primarily to **resolve deprecation warnings** encountered when importing DeepSpeed on `torch==2.10.0`. It **refactors** several internal helper functions across the **Mixture of Experts (MoE)**, **ZeRO optimizer utilities**, and **sequence parallelism layers** by replacing direct calls to `@torch.jit.script` with a new utility decorator, `jit_script_compat`. This new utility, defined in `deepspeed/utils/torch.py`, conditionally leverages `torch.compile` for newer PyTorch versions while falling back to `torch.jit.script` for older ones. The change ensures **cleaner imports** and better alignment with PyTorch's recommended JIT compilation practices, improving **forward compatibility**.
This commit **fixes gradient norm divergence** observed during **BF16 training with ZeRO stage 0** by addressing two critical bugs within the **DeepSpeed engine**. It resolves incorrect dynamic loss scaling application in `FP16_UnfusedOptimizer` and prevents unintended gradient accumulation caused by skipping `zero_grad` for BF16 without ZeRO. The **bug fix** disables loss scaling for BF16 and removes the `zero_optimization()` gate on `zero_grad`, complemented by a **refactoring** of the loss scaling mechanism to use a new `LossScaleConfig`. This ensures **stable and accurate gradient updates** for models leveraging these specific mixed-precision and optimization configurations.
This commit delivers a **bug fix** and **enhancement** for **DeepSpeed's Auto Tensor Parallelism (AutoTP)**, resolving critical issues with custom pattern configurations. It updates `deepspeed/module_inject/auto_tp.py` to correctly respect `use_default_specs: false` and disable traditional injection when custom patterns are enabled, ensuring proper module replacement. Additionally, `deepspeed/runtime/tensor_parallel/init_utils.py` is modified to automatically create a tensor parallel group during `deepspeed.initialize` if `mpu` is not provided, significantly improving **Hugging Face Trainer integration**. These changes make custom AutoTP patterns reliable and enhance the overall usability and compatibility of the tensor parallelism features.
This commit introduces a **new capability** for **Automatic Tensor Parallelism (AutoTP)**, enabling users to define **custom layer partitioning patterns** via a flexible, configuration-driven API. This allows for precise control over how model parameters are sharded, supporting **any model architecture** including those with complex fused layers and unequal sub-parameter sizes, using regex patterns within the DeepSpeed configuration. The `deepspeed.initialize` function is enhanced to simplify AutoTP setup by integrating these configurations directly, while maintaining **backward compatibility** with previous initialization methods. This significantly improves the **extensibility and usability** of AutoTP for diverse and custom model training scenarios.
This commit **fixes a race condition** within **DeepSpeed ZeRO3 leaf modules** during the backward pass, specifically when PyTorch's autograd concurrently triggers hooks for modules returning multiple outputs. This **bug fix** introduces **thread synchronization** in `deepspeed/runtime/zero/partitioned_param_coordinator.py` to ensure only a single thread handles parameter fetching for a leaf module, preventing concurrent modifications to internal parameter states. This significantly improves the **stability and correctness** of **ZeRO3** training, especially for models leveraging multi-output leaf modules. New tests in `test_zero_leaf_module.py` validate this thread-safe behavior.
This commit performs a **maintenance update** by upgrading the **PyTorch version** used within the project's continuous integration (CI) pipelines. Specifically, the **`accelerate` and `torch_latest` CI environments** are updated to use PyTorch **v2.9.1** from the previous v2.6.0. This **chore** involves modifying `ci/accelerate.py` and `ci/torch_latest.py` to reflect the new base image version. Additionally, `ci/torch_latest.py` adjusts the `pytest` command to align with the updated torch and cuda versions, ensuring tests are run against a more current and compatible deep learning framework.
This commit **fixes a crash** in **DeepSpeed's ZeRO-3** by introducing a **clearer `RuntimeError`** when `GatheredParameters` are modified in-place without `modifier_rank` specified. Specifically, the `GatheredParameters.__exit__` method in `deepspeed/runtime/zero/partition_parameters.py` now detects and raises an actionable error, synchronized across ranks, instead of an obscure internal invariant assertion. Additionally, the `free_param` function now provides more informative error messages when parameters are still active in submodules. This **error handling improvement** enhances **debugging clarity** and the overall **developer experience** for users of ZeRO-3.
Commit activity distribution by hour and day of week. Shows when this developer is most active.
Developers who frequently work on the same files and symbols. Higher score means stronger code collaboration.