Developer
Ma, Guokai
guokai.ma@gmail.com
Performance
Key patterns and highlights from this developer's activity.
Breakdown of growth, maintenance, and fixes effort over time.
Bugs introduced vs. fixed over time.
Reclassifies engineering effort based on bug attribution. Commits that introduced bugs are retrospectively counted as poor investments.
Investment Quality reclassifies engineering effort based on bug attribution data. Commits identified as buggy origins (those that introduced bugs later fixed by someone) have their grow and maintenance time moved into the Wasted Time category. Their waste (fix commits) remains counted as productive. All other commits retain their standard classification: grow is productive, maintenance is maintenance, and waste (fixes) is productive.
The standard model classifies commits as Growth, Maintenance, or Fixes. Investment Quality adds a quality lens: a commit that introduced a bug is retrospectively counted as a poor investment — the engineering time spent on it was wasted because it ultimately required additional fix work. Fix commits (Fixes in the standard model) are reframed as productive, because fixing bugs is valuable work.
Currently computed client-side from commit and bug attribution data. Ideal server-side endpoint:
POST /v1/organizations/{orgId}/investment-quality
Content-Type: application/json
Request:
{
"startTime": "2025-01-01T00:00:00Z",
"endTime": "2025-12-31T23:59:59Z",
"bucketSize": "BUCKET_SIZE_MONTH",
"groupBy": ["repository_id" | "deliverer_email"]
}
Response:
{
"productivePct": 74,
"maintenancePct": 18,
"wastedPct": 8,
"buckets": [
{
"bucketStart": "2025-01-01T00:00:00Z",
"productive": 4.2,
"maintenance": 1.8,
"wasted": 0.6
}
]
}Latest analyzed commits from this developer.
| Hash | Message | Date | Files | Effort |
|---|---|---|---|---|
| abb88ce | This commit **updates the documentation** within `AGENTS.md` and `CLAUDE.md` to provide more precise instructions for running `pre-commit` checks. Specifically, it clarifies the **CI requirements** by adding the explicit command `pre-commit run --files <changed_files>`, indicating that only modified files should be checked. This **documentation enhancement** improves developer guidance and streamlines the pre-commit process, preventing unnecessary full codebase scans and enhancing efficiency. | Mar 29 | 2 | maint |
| a240c4d | This commit introduces **HuggingFace `tp_plan` support** to **DeepSpeed AutoTP**, enabling automatic tensor parallelism for models that ship with a `base_model_tp_plan` like Llama or Qwen. It adds a new `TPPlanConverter` to translate HuggingFace's `colwise` and `rowwise` plans into DeepSpeed's `TPLayerSpec`, integrating this into the **AutoTP partitioning logic** within `engine.py`. This **new capability** significantly enhances **model compatibility** by allowing AutoTP to leverage built-in HuggingFace tensor parallelism configurations, reducing the need for manual `partition_config`. The system now prioritizes `partition_config` > HF `tp_plan` > AutoTP heuristics, streamlining the user experience for tensor parallelism with many popular models. | Mar 25 | 13 | maint |
| 49a82a0 | This commit **adds new documentation files**, `AGENTS.md` and `CLAUDE.md`, to establish project-specific rules and guidelines for **AI coding agents** like OpenCode and Claude Code contributing to the DeepSpeed codebase. This **maintenance and documentation update** aims to address common issues where AI-generated code violates DeepSpeed conventions, such as DCO sign-off, formatting (`yapf`/`flake8`), and code style. By providing explicit instructions on commit requirements, code change discipline, and tool caveats, this change will **improve the quality and consistency of AI-generated contributions** and streamline the review process for automated pull requests. The new files, located in the project's root, serve as a critical reference for automated development workflows. | Mar 13 | 2 | maint |
| a41a96b | This commit **refactors** DeepSpeed's **Automatic Mixed Precision (AMP) integration** to align with current PyTorch best practices. It replaces the legacy `get_accelerator().amp()` method with the recommended `torch.amp` API, specifically impacting the **`deepspeed/runtime/zero/linear.py` module**. This **API migration** ensures **broader compatibility** with various PyTorch backends, such as XPU, which may not provide the older device-specific AMP modules. By adopting the standard `torch.amp`, DeepSpeed enhances its **maintainability and future-proofing** for mixed precision training. | Mar 2 | 11 | maint |
| d8e15da | This commit performs a **major refactoring** to remove the dependency on `Intel Extension for PyTorch` (IPEX) for **XPU device support**, aligning DeepSpeed with native PyTorch 2.8+ XPU capabilities. It **updates the XPU accelerator logic**, `op_builder` implementations, and CI workflows to use stock PyTorch's builder protocol and SYCL compilation. This change simplifies the XPU stack, but users must upgrade to the latest PyTorch for XPU features, as DeepSpeed will no longer be compatible with previous PyTorch+IPEX setups on XPU devices. The **documentation** and **tests** have also been updated to reflect this new, streamlined approach. | Mar 2 | 16 | maint |
| 116dbe2 | This commit delivers a **bug fix** for the **DeepSpeed Muon optimizer** that addresses a `ValueError` encountered during **partial model training**. Previously, when only a subset of model parameters were trainable, the optimizer's internal parameter grouping logic could incorrectly include non-trainable parameters, leading to an empty tensor list being passed to `torch.cat`. The fix modifies the `_configure_optimizer` method in `deepspeed/runtime/engine.py` to ensure that only parameters explicitly requiring gradients are added to the optimizer's parameter groups. This prevents crashes and enhances the robustness of the **Muon optimizer** for advanced training scenarios, with new unit tests added to verify correct behavior. | Mar 1 | 2 | maint |
| 7f2f423 | This commit introduces a crucial **performance optimization** for the **DeepSpeed Muon optimizer** by relocating its **momentum buffer from CPU to GPU memory**. The core change in `deepspeed/runtime/zero/stage_1_and_2.py` modifies the `create_param_group` method to ensure this buffer is allocated directly on the device. This relocation significantly **reduces iteration time** for models leveraging the Muon optimizer, as evidenced by a 39% speedup (1500ms to 910ms) during finetuning of large models. Additionally, a new `compiler.py` utility is introduced and applied to key Muon functions like `zeropower_via_newtonschulz5` and `muon_update`, which could further enhance execution efficiency. This enhancement directly benefits **DeepSpeed ZeRO users** seeking faster training throughput. | Nov 26 | 4 | grow |
| a83fd7b | This commit **updates the documentation** for **Automatic Tensor Parallelism (AutoTP)** by explicitly adding `qwen2.5` and `qwen3` to the list of supported model families. This is a **documentation update** and **maintenance** task, correcting an omission in the `docs/_tutorials/automatic-tensor-parallelism.md` tutorial. The change ensures that users consulting the DeepSpeed tutorials are accurately informed about the full range of models compatible with AutoTP, improving clarity and usability for **Qwen2.5 and Qwen3 users**. | Nov 18 | 1 | maint |
| df59f20 | This commit introduces a **new capability** to the **DeepSpeed runtime engine**, enabling users to specify distinct learning rates for the Muon and Adam components of the `MuonWithAuxAdam` optimizer. Previously, these components shared a single learning rate, but now separate `muon_lr` and `adam_lr` parameters can be configured. This **feature enhancement** provides **finer-grained control** over the optimization process, potentially improving training stability and performance for models utilizing this specific optimizer. The change primarily affects the `deepspeed/runtime/engine.py` module. | Nov 11 | 1 | grow |
| 67b365a | This commit introduces a **configuration adjustment** to the **Muon optimizer**, preventing its application to specific neural network components. It **excludes embedding layers and language model head layers** (identified by `embed` and `lm_head` in parameter names) from using the Muon optimizer. The `set_optimizer_flags` function in `deepspeed/__init__.py` was modified to implement this exclusion. This **performance optimization** is based on empirical evidence suggesting these layers perform better without Muon, aiming to achieve **improved training stability and efficiency** for models utilizing this optimizer. | Oct 22 | 1 | waste |
| 2b68bbc | This commit **introduces a new blog post** that details a performance study on **ZenFlow and ZeRO offload with DeepSpeed CPU core binding**. This **documentation addition** provides valuable insights into optimizing DeepSpeed performance, specifically focusing on core binding techniques. It affects the **project's educational content** by adding `blogs/zenflow-corebinding/README.md` and updates the `docs/index.md` to integrate this new resource into the 'Latest News' section. This expands the available knowledge base for users interested in advanced DeepSpeed performance tuning. | Oct 6 | 2 | maint |
| 65322e1 | This commit **adds a Chinese version** of the **DeepSpeed SuperOffload blog post** to the project's documentation. Specifically, it introduces the `blogs/deepspeed-superoffload/README_cn.md` file, which provides a comprehensive overview of SuperOffload's features, working principles, and usage in Chinese. This **documentation update** significantly enhances the project's **internationalization efforts**, making crucial information about the SuperOffload module accessible to a broader, Chinese-speaking audience. | Oct 4 | 1 | maint |
| 66c7031 | This commit introduces a **bug fix** to **DeepSpeed's accelerator device handling** by updating how device identifiers are retrieved for tensor operations. Previously, using `get_accelerator().current_device()` could lead to failures when creating tensors on CPU devices, as it was primarily designed for CUDA. The change replaces this with `torch.device(get_accelerator().current_device_name())` to ensure **robust device compatibility** across various hardware. This **enhances the reliability of tensor allocation and operations** within `deepspeed/runtime/engine.py`, `deepspeed/runtime/utils.py`, and `deepspeed/runtime/zero/partitioned_param_coordinator.py`, enabling correct execution on both GPU and CPU environments. | Sep 28 | 3 | waste |
| 2585881 | This commit **enhances the usability of the Muon optimizer** by automating the configuration of its `use_muon` flags. It introduces a new function `set_optimizer_flags` within the **DeepSpeed initialization process** in `deepspeed/__init__.py`, integrating it into the `initialize` function. This **streamlines the DeepSpeed engine setup**, allowing users to enable the **Muon optimizer** simply by specifying it in `config.json`. The change **significantly simplifies the user experience**, eliminating the need for manual code modifications to `model.parameters()`. | Sep 17 | 2 | grow |
| 43537d0 | This commit introduces **automatic CPU core affinity management** for **DeepSpeed's ZenFlow optimizer workers**, resolving core binding issues (issue #7478). It **integrates ZenFlow's core binding** with DeepSpeed's existing `--bind_cores_to_rank` mechanism, dynamically splitting available CPU cores between the main PyTorch process and ZenFlow optimizer workers. A new configuration parameter, `pt_reserved_cores_perc` in `deepspeed/runtime/zenflow/zenflow_config.py`, allows users to specify the percentage of cores reserved for PyTorch threads. This **new capability** significantly **improves performance and resource utilization** for ZenFlow by ensuring optimal CPU resource allocation, primarily affecting the `ZenFlowZeroOptimizer` and `zenflow_optimizer_process` in `deepspeed/runtime/zenflow/zenflow_stage_1_and_2.py`. | Sep 4 | 2 | grow |
| f03d416 | This commit **updates the ZeRO offload tutorial** (`docs/_tutorials/zero-offload.md`) to incorporate a crucial **performance optimization tip**. It advises users to utilize the `--bind_cores_to_rank` flag in the DeepSpeed launch command, particularly for workloads heavily relying on **CPUAdam** within the ZeRO offload configuration. This **documentation enhancement** aims to increase user awareness and guide them towards **improved performance** by optimizing CPU core binding, which can lead to approximately a **1.3x speedup** in average step time for CPU-intensive operations. | Aug 8 | 1 | maint |
| 80bc7b7 | This commit **enhances DeepSpeed's Automatic Tensor Parallelism (AutoTP)** by **enabling proper metadata loading for Qwen3 models**. It specifically updates the `deepspeed/module_inject/auto_tp.py` module to include `Qwen3RMSNorm` in the list of recognized module names within the `Loading` class. This **new capability** resolves a compatibility issue (referenced #7275), allowing users to successfully apply AutoTP to Qwen3 architectures without encountering errors related to unrecognized module types. Consequently, it improves the overall **model compatibility and usability** of DeepSpeed for a wider range of large language models. | May 19 | 1 | grow |
| f459502 | This commit **rolls back** changes introduced in PR #6726 to **fix a critical bug** affecting the **DeepSpeed ZeRO Redundancy Optimizer**. Specifically, it reverts modifications to the **parameter offloading** logic in `deepspeed/runtime/zero/parameter_offload.py`, restoring the unconditional initialization of `ds_grads_remaining` within `_post_backward_module_hook`. Additionally, it undoes changes in `deepspeed/runtime/zero/partitioned_param_coordinator.py`, moving the initialization of `__n_available_params` back to the constructor and removing its reset in `release_all_live_params`. This **maintenance rollback** resolves an issue that caused incorrect state management and instability in ZeRO-enabled training, ensuring the correct functioning of partitioned parameter coordination. | May 19 | 3 | waste |
This commit **updates the documentation** within `AGENTS.md` and `CLAUDE.md` to provide more precise instructions for running `pre-commit` checks. Specifically, it clarifies the **CI requirements** by adding the explicit command `pre-commit run --files <changed_files>`, indicating that only modified files should be checked. This **documentation enhancement** improves developer guidance and streamlines the pre-commit process, preventing unnecessary full codebase scans and enhancing efficiency.
This commit introduces **HuggingFace `tp_plan` support** to **DeepSpeed AutoTP**, enabling automatic tensor parallelism for models that ship with a `base_model_tp_plan` like Llama or Qwen. It adds a new `TPPlanConverter` to translate HuggingFace's `colwise` and `rowwise` plans into DeepSpeed's `TPLayerSpec`, integrating this into the **AutoTP partitioning logic** within `engine.py`. This **new capability** significantly enhances **model compatibility** by allowing AutoTP to leverage built-in HuggingFace tensor parallelism configurations, reducing the need for manual `partition_config`. The system now prioritizes `partition_config` > HF `tp_plan` > AutoTP heuristics, streamlining the user experience for tensor parallelism with many popular models.
This commit **adds new documentation files**, `AGENTS.md` and `CLAUDE.md`, to establish project-specific rules and guidelines for **AI coding agents** like OpenCode and Claude Code contributing to the DeepSpeed codebase. This **maintenance and documentation update** aims to address common issues where AI-generated code violates DeepSpeed conventions, such as DCO sign-off, formatting (`yapf`/`flake8`), and code style. By providing explicit instructions on commit requirements, code change discipline, and tool caveats, this change will **improve the quality and consistency of AI-generated contributions** and streamline the review process for automated pull requests. The new files, located in the project's root, serve as a critical reference for automated development workflows.
This commit **refactors** DeepSpeed's **Automatic Mixed Precision (AMP) integration** to align with current PyTorch best practices. It replaces the legacy `get_accelerator().amp()` method with the recommended `torch.amp` API, specifically impacting the **`deepspeed/runtime/zero/linear.py` module**. This **API migration** ensures **broader compatibility** with various PyTorch backends, such as XPU, which may not provide the older device-specific AMP modules. By adopting the standard `torch.amp`, DeepSpeed enhances its **maintainability and future-proofing** for mixed precision training.
This commit performs a **major refactoring** to remove the dependency on `Intel Extension for PyTorch` (IPEX) for **XPU device support**, aligning DeepSpeed with native PyTorch 2.8+ XPU capabilities. It **updates the XPU accelerator logic**, `op_builder` implementations, and CI workflows to use stock PyTorch's builder protocol and SYCL compilation. This change simplifies the XPU stack, but users must upgrade to the latest PyTorch for XPU features, as DeepSpeed will no longer be compatible with previous PyTorch+IPEX setups on XPU devices. The **documentation** and **tests** have also been updated to reflect this new, streamlined approach.
This commit delivers a **bug fix** for the **DeepSpeed Muon optimizer** that addresses a `ValueError` encountered during **partial model training**. Previously, when only a subset of model parameters were trainable, the optimizer's internal parameter grouping logic could incorrectly include non-trainable parameters, leading to an empty tensor list being passed to `torch.cat`. The fix modifies the `_configure_optimizer` method in `deepspeed/runtime/engine.py` to ensure that only parameters explicitly requiring gradients are added to the optimizer's parameter groups. This prevents crashes and enhances the robustness of the **Muon optimizer** for advanced training scenarios, with new unit tests added to verify correct behavior.
This commit introduces a crucial **performance optimization** for the **DeepSpeed Muon optimizer** by relocating its **momentum buffer from CPU to GPU memory**. The core change in `deepspeed/runtime/zero/stage_1_and_2.py` modifies the `create_param_group` method to ensure this buffer is allocated directly on the device. This relocation significantly **reduces iteration time** for models leveraging the Muon optimizer, as evidenced by a 39% speedup (1500ms to 910ms) during finetuning of large models. Additionally, a new `compiler.py` utility is introduced and applied to key Muon functions like `zeropower_via_newtonschulz5` and `muon_update`, which could further enhance execution efficiency. This enhancement directly benefits **DeepSpeed ZeRO users** seeking faster training throughput.
This commit **updates the documentation** for **Automatic Tensor Parallelism (AutoTP)** by explicitly adding `qwen2.5` and `qwen3` to the list of supported model families. This is a **documentation update** and **maintenance** task, correcting an omission in the `docs/_tutorials/automatic-tensor-parallelism.md` tutorial. The change ensures that users consulting the DeepSpeed tutorials are accurately informed about the full range of models compatible with AutoTP, improving clarity and usability for **Qwen2.5 and Qwen3 users**.
This commit introduces a **new capability** to the **DeepSpeed runtime engine**, enabling users to specify distinct learning rates for the Muon and Adam components of the `MuonWithAuxAdam` optimizer. Previously, these components shared a single learning rate, but now separate `muon_lr` and `adam_lr` parameters can be configured. This **feature enhancement** provides **finer-grained control** over the optimization process, potentially improving training stability and performance for models utilizing this specific optimizer. The change primarily affects the `deepspeed/runtime/engine.py` module.
This commit introduces a **configuration adjustment** to the **Muon optimizer**, preventing its application to specific neural network components. It **excludes embedding layers and language model head layers** (identified by `embed` and `lm_head` in parameter names) from using the Muon optimizer. The `set_optimizer_flags` function in `deepspeed/__init__.py` was modified to implement this exclusion. This **performance optimization** is based on empirical evidence suggesting these layers perform better without Muon, aiming to achieve **improved training stability and efficiency** for models utilizing this optimizer.
This commit **introduces a new blog post** that details a performance study on **ZenFlow and ZeRO offload with DeepSpeed CPU core binding**. This **documentation addition** provides valuable insights into optimizing DeepSpeed performance, specifically focusing on core binding techniques. It affects the **project's educational content** by adding `blogs/zenflow-corebinding/README.md` and updates the `docs/index.md` to integrate this new resource into the 'Latest News' section. This expands the available knowledge base for users interested in advanced DeepSpeed performance tuning.
This commit **adds a Chinese version** of the **DeepSpeed SuperOffload blog post** to the project's documentation. Specifically, it introduces the `blogs/deepspeed-superoffload/README_cn.md` file, which provides a comprehensive overview of SuperOffload's features, working principles, and usage in Chinese. This **documentation update** significantly enhances the project's **internationalization efforts**, making crucial information about the SuperOffload module accessible to a broader, Chinese-speaking audience.
This commit introduces a **bug fix** to **DeepSpeed's accelerator device handling** by updating how device identifiers are retrieved for tensor operations. Previously, using `get_accelerator().current_device()` could lead to failures when creating tensors on CPU devices, as it was primarily designed for CUDA. The change replaces this with `torch.device(get_accelerator().current_device_name())` to ensure **robust device compatibility** across various hardware. This **enhances the reliability of tensor allocation and operations** within `deepspeed/runtime/engine.py`, `deepspeed/runtime/utils.py`, and `deepspeed/runtime/zero/partitioned_param_coordinator.py`, enabling correct execution on both GPU and CPU environments.
This commit **enhances the usability of the Muon optimizer** by automating the configuration of its `use_muon` flags. It introduces a new function `set_optimizer_flags` within the **DeepSpeed initialization process** in `deepspeed/__init__.py`, integrating it into the `initialize` function. This **streamlines the DeepSpeed engine setup**, allowing users to enable the **Muon optimizer** simply by specifying it in `config.json`. The change **significantly simplifies the user experience**, eliminating the need for manual code modifications to `model.parameters()`.
This commit introduces **automatic CPU core affinity management** for **DeepSpeed's ZenFlow optimizer workers**, resolving core binding issues (issue #7478). It **integrates ZenFlow's core binding** with DeepSpeed's existing `--bind_cores_to_rank` mechanism, dynamically splitting available CPU cores between the main PyTorch process and ZenFlow optimizer workers. A new configuration parameter, `pt_reserved_cores_perc` in `deepspeed/runtime/zenflow/zenflow_config.py`, allows users to specify the percentage of cores reserved for PyTorch threads. This **new capability** significantly **improves performance and resource utilization** for ZenFlow by ensuring optimal CPU resource allocation, primarily affecting the `ZenFlowZeroOptimizer` and `zenflow_optimizer_process` in `deepspeed/runtime/zenflow/zenflow_stage_1_and_2.py`.
This commit **updates the ZeRO offload tutorial** (`docs/_tutorials/zero-offload.md`) to incorporate a crucial **performance optimization tip**. It advises users to utilize the `--bind_cores_to_rank` flag in the DeepSpeed launch command, particularly for workloads heavily relying on **CPUAdam** within the ZeRO offload configuration. This **documentation enhancement** aims to increase user awareness and guide them towards **improved performance** by optimizing CPU core binding, which can lead to approximately a **1.3x speedup** in average step time for CPU-intensive operations.
This commit **enhances DeepSpeed's Automatic Tensor Parallelism (AutoTP)** by **enabling proper metadata loading for Qwen3 models**. It specifically updates the `deepspeed/module_inject/auto_tp.py` module to include `Qwen3RMSNorm` in the list of recognized module names within the `Loading` class. This **new capability** resolves a compatibility issue (referenced #7275), allowing users to successfully apply AutoTP to Qwen3 architectures without encountering errors related to unrecognized module types. Consequently, it improves the overall **model compatibility and usability** of DeepSpeed for a wider range of large language models.
This commit **rolls back** changes introduced in PR #6726 to **fix a critical bug** affecting the **DeepSpeed ZeRO Redundancy Optimizer**. Specifically, it reverts modifications to the **parameter offloading** logic in `deepspeed/runtime/zero/parameter_offload.py`, restoring the unconditional initialization of `ds_grads_remaining` within `_post_backward_module_hook`. Additionally, it undoes changes in `deepspeed/runtime/zero/partitioned_param_coordinator.py`, moving the initialization of `__n_available_params` back to the constructor and removing its reset in `release_all_live_params`. This **maintenance rollback** resolves an issue that caused incorrect state management and instability in ZeRO-enabled training, ensuring the correct functioning of partitioned parameter coordination.
Commit activity distribution by hour and day of week. Shows when this developer is most active.
Developers who frequently work on the same files and symbols. Higher score means stronger code collaboration.