NavigaraNavigara
OrganizationsDistributionCompareResearch
NavigaraNavigara
OrganizationsDistributionCompareResearch
All developers

Ma, Guokai

Developer

Ma, Guokai

guokai.ma@gmail.com

18 commits~4 files/commit

Performance

2026Previous year

Insights

Key patterns and highlights from this developer's activity.

Peak MonthMar'26100 performance
Growth Trend↓47%vs prior period
Avg Files/Commit4files per commit
Active Days16of 455 days
Top RepoDeepSpeed18 commits

Effort Over Time

Breakdown of growth, maintenance, and fixes effort over time.

Bug Behavior

Beta

Bugs introduced vs. fixed over time.

Investment Quality

Beta

Reclassifies engineering effort based on bug attribution. Commits that introduced bugs are retrospectively counted as poor investments.

17%Productive TimeGrowth 80% + Fixes 20%
81%Maintenance Time
2%Wasted Time
How it works

Methodology

Investment Quality reclassifies engineering effort based on bug attribution data. Commits identified as buggy origins (those that introduced bugs later fixed by someone) have their grow and maintenance time moved into the Wasted Time category. Their waste (fix commits) remains counted as productive. All other commits retain their standard classification: grow is productive, maintenance is maintenance, and waste (fixes) is productive.

Relationship to Growth / Maintenance / Fixes

The standard model classifies commits as Growth, Maintenance, or Fixes. Investment Quality adds a quality lens: a commit that introduced a bug is retrospectively counted as a poor investment — the engineering time spent on it was wasted because it ultimately required additional fix work. Fix commits (Fixes in the standard model) are reframed as productive, because fixing bugs is valuable work.

Proposed API Endpoint

Currently computed client-side from commit and bug attribution data. Ideal server-side endpoint:

POST /v1/organizations/{orgId}/investment-quality
Content-Type: application/json

Request:
{
  "startTime": "2025-01-01T00:00:00Z",
  "endTime": "2025-12-31T23:59:59Z",
  "bucketSize": "BUCKET_SIZE_MONTH",
  "groupBy": ["repository_id" | "deliverer_email"]
}

Response:
{
  "productivePct": 74,
  "maintenancePct": 18,
  "wastedPct": 8,
  "buckets": [
    {
      "bucketStart": "2025-01-01T00:00:00Z",
      "productive": 4.2,
      "maintenance": 1.8,
      "wasted": 0.6
    }
  ]
}

Recent Activity

Latest analyzed commits from this developer.

HashMessageDateFilesEffort
abb88ceThis commit **updates the documentation** within `AGENTS.md` and `CLAUDE.md` to provide more precise instructions for running `pre-commit` checks. Specifically, it clarifies the **CI requirements** by adding the explicit command `pre-commit run --files <changed_files>`, indicating that only modified files should be checked. This **documentation enhancement** improves developer guidance and streamlines the pre-commit process, preventing unnecessary full codebase scans and enhancing efficiency.Mar 292maint
a240c4dThis commit introduces **HuggingFace `tp_plan` support** to **DeepSpeed AutoTP**, enabling automatic tensor parallelism for models that ship with a `base_model_tp_plan` like Llama or Qwen. It adds a new `TPPlanConverter` to translate HuggingFace's `colwise` and `rowwise` plans into DeepSpeed's `TPLayerSpec`, integrating this into the **AutoTP partitioning logic** within `engine.py`. This **new capability** significantly enhances **model compatibility** by allowing AutoTP to leverage built-in HuggingFace tensor parallelism configurations, reducing the need for manual `partition_config`. The system now prioritizes `partition_config` > HF `tp_plan` > AutoTP heuristics, streamlining the user experience for tensor parallelism with many popular models.Mar 2513maint
49a82a0This commit **adds new documentation files**, `AGENTS.md` and `CLAUDE.md`, to establish project-specific rules and guidelines for **AI coding agents** like OpenCode and Claude Code contributing to the DeepSpeed codebase. This **maintenance and documentation update** aims to address common issues where AI-generated code violates DeepSpeed conventions, such as DCO sign-off, formatting (`yapf`/`flake8`), and code style. By providing explicit instructions on commit requirements, code change discipline, and tool caveats, this change will **improve the quality and consistency of AI-generated contributions** and streamline the review process for automated pull requests. The new files, located in the project's root, serve as a critical reference for automated development workflows.Mar 132maint
a41a96bThis commit **refactors** DeepSpeed's **Automatic Mixed Precision (AMP) integration** to align with current PyTorch best practices. It replaces the legacy `get_accelerator().amp()` method with the recommended `torch.amp` API, specifically impacting the **`deepspeed/runtime/zero/linear.py` module**. This **API migration** ensures **broader compatibility** with various PyTorch backends, such as XPU, which may not provide the older device-specific AMP modules. By adopting the standard `torch.amp`, DeepSpeed enhances its **maintainability and future-proofing** for mixed precision training.Mar 211maint
d8e15daThis commit performs a **major refactoring** to remove the dependency on `Intel Extension for PyTorch` (IPEX) for **XPU device support**, aligning DeepSpeed with native PyTorch 2.8+ XPU capabilities. It **updates the XPU accelerator logic**, `op_builder` implementations, and CI workflows to use stock PyTorch's builder protocol and SYCL compilation. This change simplifies the XPU stack, but users must upgrade to the latest PyTorch for XPU features, as DeepSpeed will no longer be compatible with previous PyTorch+IPEX setups on XPU devices. The **documentation** and **tests** have also been updated to reflect this new, streamlined approach.Mar 216maint
116dbe2This commit delivers a **bug fix** for the **DeepSpeed Muon optimizer** that addresses a `ValueError` encountered during **partial model training**. Previously, when only a subset of model parameters were trainable, the optimizer's internal parameter grouping logic could incorrectly include non-trainable parameters, leading to an empty tensor list being passed to `torch.cat`. The fix modifies the `_configure_optimizer` method in `deepspeed/runtime/engine.py` to ensure that only parameters explicitly requiring gradients are added to the optimizer's parameter groups. This prevents crashes and enhances the robustness of the **Muon optimizer** for advanced training scenarios, with new unit tests added to verify correct behavior.Mar 12maint
7f2f423This commit introduces a crucial **performance optimization** for the **DeepSpeed Muon optimizer** by relocating its **momentum buffer from CPU to GPU memory**. The core change in `deepspeed/runtime/zero/stage_1_and_2.py` modifies the `create_param_group` method to ensure this buffer is allocated directly on the device. This relocation significantly **reduces iteration time** for models leveraging the Muon optimizer, as evidenced by a 39% speedup (1500ms to 910ms) during finetuning of large models. Additionally, a new `compiler.py` utility is introduced and applied to key Muon functions like `zeropower_via_newtonschulz5` and `muon_update`, which could further enhance execution efficiency. This enhancement directly benefits **DeepSpeed ZeRO users** seeking faster training throughput.Nov 264grow
a83fd7bThis commit **updates the documentation** for **Automatic Tensor Parallelism (AutoTP)** by explicitly adding `qwen2.5` and `qwen3` to the list of supported model families. This is a **documentation update** and **maintenance** task, correcting an omission in the `docs/_tutorials/automatic-tensor-parallelism.md` tutorial. The change ensures that users consulting the DeepSpeed tutorials are accurately informed about the full range of models compatible with AutoTP, improving clarity and usability for **Qwen2.5 and Qwen3 users**.Nov 181maint
df59f20This commit introduces a **new capability** to the **DeepSpeed runtime engine**, enabling users to specify distinct learning rates for the Muon and Adam components of the `MuonWithAuxAdam` optimizer. Previously, these components shared a single learning rate, but now separate `muon_lr` and `adam_lr` parameters can be configured. This **feature enhancement** provides **finer-grained control** over the optimization process, potentially improving training stability and performance for models utilizing this specific optimizer. The change primarily affects the `deepspeed/runtime/engine.py` module.Nov 111grow
67b365aThis commit introduces a **configuration adjustment** to the **Muon optimizer**, preventing its application to specific neural network components. It **excludes embedding layers and language model head layers** (identified by `embed` and `lm_head` in parameter names) from using the Muon optimizer. The `set_optimizer_flags` function in `deepspeed/__init__.py` was modified to implement this exclusion. This **performance optimization** is based on empirical evidence suggesting these layers perform better without Muon, aiming to achieve **improved training stability and efficiency** for models utilizing this optimizer.Oct 221waste
2b68bbcThis commit **introduces a new blog post** that details a performance study on **ZenFlow and ZeRO offload with DeepSpeed CPU core binding**. This **documentation addition** provides valuable insights into optimizing DeepSpeed performance, specifically focusing on core binding techniques. It affects the **project's educational content** by adding `blogs/zenflow-corebinding/README.md` and updates the `docs/index.md` to integrate this new resource into the 'Latest News' section. This expands the available knowledge base for users interested in advanced DeepSpeed performance tuning.Oct 62maint
65322e1This commit **adds a Chinese version** of the **DeepSpeed SuperOffload blog post** to the project's documentation. Specifically, it introduces the `blogs/deepspeed-superoffload/README_cn.md` file, which provides a comprehensive overview of SuperOffload's features, working principles, and usage in Chinese. This **documentation update** significantly enhances the project's **internationalization efforts**, making crucial information about the SuperOffload module accessible to a broader, Chinese-speaking audience.Oct 41maint
66c7031This commit introduces a **bug fix** to **DeepSpeed's accelerator device handling** by updating how device identifiers are retrieved for tensor operations. Previously, using `get_accelerator().current_device()` could lead to failures when creating tensors on CPU devices, as it was primarily designed for CUDA. The change replaces this with `torch.device(get_accelerator().current_device_name())` to ensure **robust device compatibility** across various hardware. This **enhances the reliability of tensor allocation and operations** within `deepspeed/runtime/engine.py`, `deepspeed/runtime/utils.py`, and `deepspeed/runtime/zero/partitioned_param_coordinator.py`, enabling correct execution on both GPU and CPU environments.Sep 283waste
2585881This commit **enhances the usability of the Muon optimizer** by automating the configuration of its `use_muon` flags. It introduces a new function `set_optimizer_flags` within the **DeepSpeed initialization process** in `deepspeed/__init__.py`, integrating it into the `initialize` function. This **streamlines the DeepSpeed engine setup**, allowing users to enable the **Muon optimizer** simply by specifying it in `config.json`. The change **significantly simplifies the user experience**, eliminating the need for manual code modifications to `model.parameters()`.Sep 172grow
43537d0This commit introduces **automatic CPU core affinity management** for **DeepSpeed's ZenFlow optimizer workers**, resolving core binding issues (issue #7478). It **integrates ZenFlow's core binding** with DeepSpeed's existing `--bind_cores_to_rank` mechanism, dynamically splitting available CPU cores between the main PyTorch process and ZenFlow optimizer workers. A new configuration parameter, `pt_reserved_cores_perc` in `deepspeed/runtime/zenflow/zenflow_config.py`, allows users to specify the percentage of cores reserved for PyTorch threads. This **new capability** significantly **improves performance and resource utilization** for ZenFlow by ensuring optimal CPU resource allocation, primarily affecting the `ZenFlowZeroOptimizer` and `zenflow_optimizer_process` in `deepspeed/runtime/zenflow/zenflow_stage_1_and_2.py`.Sep 42grow
f03d416This commit **updates the ZeRO offload tutorial** (`docs/_tutorials/zero-offload.md`) to incorporate a crucial **performance optimization tip**. It advises users to utilize the `--bind_cores_to_rank` flag in the DeepSpeed launch command, particularly for workloads heavily relying on **CPUAdam** within the ZeRO offload configuration. This **documentation enhancement** aims to increase user awareness and guide them towards **improved performance** by optimizing CPU core binding, which can lead to approximately a **1.3x speedup** in average step time for CPU-intensive operations.Aug 81maint
80bc7b7This commit **enhances DeepSpeed's Automatic Tensor Parallelism (AutoTP)** by **enabling proper metadata loading for Qwen3 models**. It specifically updates the `deepspeed/module_inject/auto_tp.py` module to include `Qwen3RMSNorm` in the list of recognized module names within the `Loading` class. This **new capability** resolves a compatibility issue (referenced #7275), allowing users to successfully apply AutoTP to Qwen3 architectures without encountering errors related to unrecognized module types. Consequently, it improves the overall **model compatibility and usability** of DeepSpeed for a wider range of large language models.May 191grow
f459502This commit **rolls back** changes introduced in PR #6726 to **fix a critical bug** affecting the **DeepSpeed ZeRO Redundancy Optimizer**. Specifically, it reverts modifications to the **parameter offloading** logic in `deepspeed/runtime/zero/parameter_offload.py`, restoring the unconditional initialization of `ds_grads_remaining` within `_post_backward_module_hook`. Additionally, it undoes changes in `deepspeed/runtime/zero/partitioned_param_coordinator.py`, moving the initialization of `__n_available_params` back to the constructor and removing its reset in `release_all_live_params`. This **maintenance rollback** resolves an issue that caused incorrect state management and instability in ZeRO-enabled training, ensuring the correct functioning of partitioned parameter coordination.May 193waste
abb88ceMar 29

This commit **updates the documentation** within `AGENTS.md` and `CLAUDE.md` to provide more precise instructions for running `pre-commit` checks. Specifically, it clarifies the **CI requirements** by adding the explicit command `pre-commit run --files <changed_files>`, indicating that only modified files should be checked. This **documentation enhancement** improves developer guidance and streamlines the pre-commit process, preventing unnecessary full codebase scans and enhancing efficiency.

2 filesmaint
a240c4dMar 25

This commit introduces **HuggingFace `tp_plan` support** to **DeepSpeed AutoTP**, enabling automatic tensor parallelism for models that ship with a `base_model_tp_plan` like Llama or Qwen. It adds a new `TPPlanConverter` to translate HuggingFace's `colwise` and `rowwise` plans into DeepSpeed's `TPLayerSpec`, integrating this into the **AutoTP partitioning logic** within `engine.py`. This **new capability** significantly enhances **model compatibility** by allowing AutoTP to leverage built-in HuggingFace tensor parallelism configurations, reducing the need for manual `partition_config`. The system now prioritizes `partition_config` > HF `tp_plan` > AutoTP heuristics, streamlining the user experience for tensor parallelism with many popular models.

13 filesmaint
49a82a0Mar 13

This commit **adds new documentation files**, `AGENTS.md` and `CLAUDE.md`, to establish project-specific rules and guidelines for **AI coding agents** like OpenCode and Claude Code contributing to the DeepSpeed codebase. This **maintenance and documentation update** aims to address common issues where AI-generated code violates DeepSpeed conventions, such as DCO sign-off, formatting (`yapf`/`flake8`), and code style. By providing explicit instructions on commit requirements, code change discipline, and tool caveats, this change will **improve the quality and consistency of AI-generated contributions** and streamline the review process for automated pull requests. The new files, located in the project's root, serve as a critical reference for automated development workflows.

2 filesmaint
a41a96bMar 2

This commit **refactors** DeepSpeed's **Automatic Mixed Precision (AMP) integration** to align with current PyTorch best practices. It replaces the legacy `get_accelerator().amp()` method with the recommended `torch.amp` API, specifically impacting the **`deepspeed/runtime/zero/linear.py` module**. This **API migration** ensures **broader compatibility** with various PyTorch backends, such as XPU, which may not provide the older device-specific AMP modules. By adopting the standard `torch.amp`, DeepSpeed enhances its **maintainability and future-proofing** for mixed precision training.

11 filesmaint
d8e15daMar 2

This commit performs a **major refactoring** to remove the dependency on `Intel Extension for PyTorch` (IPEX) for **XPU device support**, aligning DeepSpeed with native PyTorch 2.8+ XPU capabilities. It **updates the XPU accelerator logic**, `op_builder` implementations, and CI workflows to use stock PyTorch's builder protocol and SYCL compilation. This change simplifies the XPU stack, but users must upgrade to the latest PyTorch for XPU features, as DeepSpeed will no longer be compatible with previous PyTorch+IPEX setups on XPU devices. The **documentation** and **tests** have also been updated to reflect this new, streamlined approach.

16 filesmaint
116dbe2Mar 1

This commit delivers a **bug fix** for the **DeepSpeed Muon optimizer** that addresses a `ValueError` encountered during **partial model training**. Previously, when only a subset of model parameters were trainable, the optimizer's internal parameter grouping logic could incorrectly include non-trainable parameters, leading to an empty tensor list being passed to `torch.cat`. The fix modifies the `_configure_optimizer` method in `deepspeed/runtime/engine.py` to ensure that only parameters explicitly requiring gradients are added to the optimizer's parameter groups. This prevents crashes and enhances the robustness of the **Muon optimizer** for advanced training scenarios, with new unit tests added to verify correct behavior.

2 filesmaint
7f2f423Nov 26

This commit introduces a crucial **performance optimization** for the **DeepSpeed Muon optimizer** by relocating its **momentum buffer from CPU to GPU memory**. The core change in `deepspeed/runtime/zero/stage_1_and_2.py` modifies the `create_param_group` method to ensure this buffer is allocated directly on the device. This relocation significantly **reduces iteration time** for models leveraging the Muon optimizer, as evidenced by a 39% speedup (1500ms to 910ms) during finetuning of large models. Additionally, a new `compiler.py` utility is introduced and applied to key Muon functions like `zeropower_via_newtonschulz5` and `muon_update`, which could further enhance execution efficiency. This enhancement directly benefits **DeepSpeed ZeRO users** seeking faster training throughput.

4 filesgrow
a83fd7bNov 18

This commit **updates the documentation** for **Automatic Tensor Parallelism (AutoTP)** by explicitly adding `qwen2.5` and `qwen3` to the list of supported model families. This is a **documentation update** and **maintenance** task, correcting an omission in the `docs/_tutorials/automatic-tensor-parallelism.md` tutorial. The change ensures that users consulting the DeepSpeed tutorials are accurately informed about the full range of models compatible with AutoTP, improving clarity and usability for **Qwen2.5 and Qwen3 users**.

1 filesmaint
df59f20Nov 11

This commit introduces a **new capability** to the **DeepSpeed runtime engine**, enabling users to specify distinct learning rates for the Muon and Adam components of the `MuonWithAuxAdam` optimizer. Previously, these components shared a single learning rate, but now separate `muon_lr` and `adam_lr` parameters can be configured. This **feature enhancement** provides **finer-grained control** over the optimization process, potentially improving training stability and performance for models utilizing this specific optimizer. The change primarily affects the `deepspeed/runtime/engine.py` module.

1 filesgrow
67b365aOct 22

This commit introduces a **configuration adjustment** to the **Muon optimizer**, preventing its application to specific neural network components. It **excludes embedding layers and language model head layers** (identified by `embed` and `lm_head` in parameter names) from using the Muon optimizer. The `set_optimizer_flags` function in `deepspeed/__init__.py` was modified to implement this exclusion. This **performance optimization** is based on empirical evidence suggesting these layers perform better without Muon, aiming to achieve **improved training stability and efficiency** for models utilizing this optimizer.

1 fileswaste
2b68bbcOct 6

This commit **introduces a new blog post** that details a performance study on **ZenFlow and ZeRO offload with DeepSpeed CPU core binding**. This **documentation addition** provides valuable insights into optimizing DeepSpeed performance, specifically focusing on core binding techniques. It affects the **project's educational content** by adding `blogs/zenflow-corebinding/README.md` and updates the `docs/index.md` to integrate this new resource into the 'Latest News' section. This expands the available knowledge base for users interested in advanced DeepSpeed performance tuning.

2 filesmaint
65322e1Oct 4

This commit **adds a Chinese version** of the **DeepSpeed SuperOffload blog post** to the project's documentation. Specifically, it introduces the `blogs/deepspeed-superoffload/README_cn.md` file, which provides a comprehensive overview of SuperOffload's features, working principles, and usage in Chinese. This **documentation update** significantly enhances the project's **internationalization efforts**, making crucial information about the SuperOffload module accessible to a broader, Chinese-speaking audience.

1 filesmaint
66c7031Sep 28

This commit introduces a **bug fix** to **DeepSpeed's accelerator device handling** by updating how device identifiers are retrieved for tensor operations. Previously, using `get_accelerator().current_device()` could lead to failures when creating tensors on CPU devices, as it was primarily designed for CUDA. The change replaces this with `torch.device(get_accelerator().current_device_name())` to ensure **robust device compatibility** across various hardware. This **enhances the reliability of tensor allocation and operations** within `deepspeed/runtime/engine.py`, `deepspeed/runtime/utils.py`, and `deepspeed/runtime/zero/partitioned_param_coordinator.py`, enabling correct execution on both GPU and CPU environments.

3 fileswaste
2585881Sep 17

This commit **enhances the usability of the Muon optimizer** by automating the configuration of its `use_muon` flags. It introduces a new function `set_optimizer_flags` within the **DeepSpeed initialization process** in `deepspeed/__init__.py`, integrating it into the `initialize` function. This **streamlines the DeepSpeed engine setup**, allowing users to enable the **Muon optimizer** simply by specifying it in `config.json`. The change **significantly simplifies the user experience**, eliminating the need for manual code modifications to `model.parameters()`.

2 filesgrow
43537d0Sep 4

This commit introduces **automatic CPU core affinity management** for **DeepSpeed's ZenFlow optimizer workers**, resolving core binding issues (issue #7478). It **integrates ZenFlow's core binding** with DeepSpeed's existing `--bind_cores_to_rank` mechanism, dynamically splitting available CPU cores between the main PyTorch process and ZenFlow optimizer workers. A new configuration parameter, `pt_reserved_cores_perc` in `deepspeed/runtime/zenflow/zenflow_config.py`, allows users to specify the percentage of cores reserved for PyTorch threads. This **new capability** significantly **improves performance and resource utilization** for ZenFlow by ensuring optimal CPU resource allocation, primarily affecting the `ZenFlowZeroOptimizer` and `zenflow_optimizer_process` in `deepspeed/runtime/zenflow/zenflow_stage_1_and_2.py`.

2 filesgrow
f03d416Aug 8

This commit **updates the ZeRO offload tutorial** (`docs/_tutorials/zero-offload.md`) to incorporate a crucial **performance optimization tip**. It advises users to utilize the `--bind_cores_to_rank` flag in the DeepSpeed launch command, particularly for workloads heavily relying on **CPUAdam** within the ZeRO offload configuration. This **documentation enhancement** aims to increase user awareness and guide them towards **improved performance** by optimizing CPU core binding, which can lead to approximately a **1.3x speedup** in average step time for CPU-intensive operations.

1 filesmaint
80bc7b7May 19

This commit **enhances DeepSpeed's Automatic Tensor Parallelism (AutoTP)** by **enabling proper metadata loading for Qwen3 models**. It specifically updates the `deepspeed/module_inject/auto_tp.py` module to include `Qwen3RMSNorm` in the list of recognized module names within the `Loading` class. This **new capability** resolves a compatibility issue (referenced #7275), allowing users to successfully apply AutoTP to Qwen3 architectures without encountering errors related to unrecognized module types. Consequently, it improves the overall **model compatibility and usability** of DeepSpeed for a wider range of large language models.

1 filesgrow
f459502May 19

This commit **rolls back** changes introduced in PR #6726 to **fix a critical bug** affecting the **DeepSpeed ZeRO Redundancy Optimizer**. Specifically, it reverts modifications to the **parameter offloading** logic in `deepspeed/runtime/zero/parameter_offload.py`, restoring the unconditional initialization of `ds_grads_remaining` within `_post_backward_module_hook`. Additionally, it undoes changes in `deepspeed/runtime/zero/partitioned_param_coordinator.py`, moving the initialization of `__n_available_params` back to the constructor and removing its reset in `release_all_live_params`. This **maintenance rollback** resolves an issue that caused incorrect state management and instability in ZeRO-enabled training, ensuring the correct functioning of partitioned parameter coordination.

3 fileswaste

Work Patterns

Beta

Commit activity distribution by hour and day of week. Shows when this developer is most active.

Collaboration

Beta

Developers who frequently work on the same files and symbols. Higher score means stronger code collaboration.

NavigaraNavigara
OrganizationsDistributionCompareResearch