Ma, Guokai

Developer

Ma, Guokai

guokai.ma@gmail.com

18 commits~4 files/commit

Performance

2026Previous year

Insights

Key patterns and highlights from this developer's activity.

Peak MonthMar'26100 performance

Growth Trend↓47%vs prior period

Avg Files/Commit4files per commit

Active Days16of 455 days

Top RepoDeepSpeed18 commits

Effort Over Time

Breakdown of growth, maintenance, and fixes effort over time.

Bug Behavior

Beta

Bugs introduced vs. fixed over time.

Investment Quality

Beta

Reclassifies engineering effort based on bug attribution. Commits that introduced bugs are retrospectively counted as poor investments.

17%Productive TimeGrowth 80% + Fixes 20%

81%Maintenance Time

2%Wasted Time

How it works

Methodology

Investment Quality reclassifies engineering effort based on bug attribution data. Commits identified as buggy origins (those that introduced bugs later fixed by someone) have their grow and maintenance time moved into the Wasted Time category. Their waste (fix commits) remains counted as productive. All other commits retain their standard classification: grow is productive, maintenance is maintenance, and waste (fixes) is productive.

Relationship to Growth / Maintenance / Fixes

The standard model classifies commits as Growth, Maintenance, or Fixes. Investment Quality adds a quality lens: a commit that introduced a bug is retrospectively counted as a poor investment — the engineering time spent on it was wasted because it ultimately required additional fix work. Fix commits (Fixes in the standard model) are reframed as productive, because fixing bugs is valuable work.

Proposed API Endpoint

Currently computed client-side from commit and bug attribution data. Ideal server-side endpoint:

POST /v1/organizations/{orgId}/investment-quality
Content-Type: application/json

Request:
{
  "startTime": "2025-01-01T00:00:00Z",
  "endTime": "2025-12-31T23:59:59Z",
  "bucketSize": "BUCKET_SIZE_MONTH",
  "groupBy": ["repository_id" | "deliverer_email"]
}

Response:
{
  "productivePct": 74,
  "maintenancePct": 18,
  "wastedPct": 8,
  "buckets": [
    {
      "bucketStart": "2025-01-01T00:00:00Z",
      "productive": 4.2,
      "maintenance": 1.8,
      "wasted": 0.6
    }
  ]
}

Recent Activity

Latest analyzed commits from this developer.

Hash	Message	Date	Files	Effort
abb88ce	This commit updates the documentation within `AGENTS.md` and `CLAUDE.md` to provide more precise instructions for running `pre-commit` checks. Specifically, it clarifies the CI requirements by adding the explicit command `pre-commit run --files <changed_files>`, indicating that only modified files should be checked. This documentation enhancement improves developer guidance and streamlines the pre-commit process, preventing unnecessary full codebase scans and enhancing efficiency.	Mar 29	2	maint
a240c4d	This commit introduces HuggingFace `tp_plan` support to DeepSpeed AutoTP, enabling automatic tensor parallelism for models that ship with a `base_model_tp_plan` like Llama or Qwen. It adds a new `TPPlanConverter` to translate HuggingFace's `colwise` and `rowwise` plans into DeepSpeed's `TPLayerSpec`, integrating this into the AutoTP partitioning logic within `engine.py`. This new capability significantly enhances model compatibility by allowing AutoTP to leverage built-in HuggingFace tensor parallelism configurations, reducing the need for manual `partition_config`. The system now prioritizes `partition_config` > HF `tp_plan` > AutoTP heuristics, streamlining the user experience for tensor parallelism with many popular models.	Mar 25	13	maint
49a82a0	This commit adds new documentation files, `AGENTS.md` and `CLAUDE.md`, to establish project-specific rules and guidelines for AI coding agents like OpenCode and Claude Code contributing to the DeepSpeed codebase. This maintenance and documentation update aims to address common issues where AI-generated code violates DeepSpeed conventions, such as DCO sign-off, formatting (`yapf`/`flake8`), and code style. By providing explicit instructions on commit requirements, code change discipline, and tool caveats, this change will improve the quality and consistency of AI-generated contributions and streamline the review process for automated pull requests. The new files, located in the project's root, serve as a critical reference for automated development workflows.	Mar 13	2	maint
a41a96b	This commit refactors DeepSpeed's Automatic Mixed Precision (AMP) integration to align with current PyTorch best practices. It replaces the legacy `get_accelerator().amp()` method with the recommended `torch.amp` API, specifically impacting the `deepspeed/runtime/zero/linear.py` module. This API migration ensures broader compatibility with various PyTorch backends, such as XPU, which may not provide the older device-specific AMP modules. By adopting the standard `torch.amp`, DeepSpeed enhances its maintainability and future-proofing for mixed precision training.	Mar 2	11	maint
d8e15da	This commit performs a major refactoring to remove the dependency on `Intel Extension for PyTorch` (IPEX) for XPU device support, aligning DeepSpeed with native PyTorch 2.8+ XPU capabilities. It updates the XPU accelerator logic, `op_builder` implementations, and CI workflows to use stock PyTorch's builder protocol and SYCL compilation. This change simplifies the XPU stack, but users must upgrade to the latest PyTorch for XPU features, as DeepSpeed will no longer be compatible with previous PyTorch+IPEX setups on XPU devices. The documentation and tests have also been updated to reflect this new, streamlined approach.	Mar 2	16	maint
116dbe2	This commit delivers a bug fix for the DeepSpeed Muon optimizer that addresses a `ValueError` encountered during partial model training. Previously, when only a subset of model parameters were trainable, the optimizer's internal parameter grouping logic could incorrectly include non-trainable parameters, leading to an empty tensor list being passed to `torch.cat`. The fix modifies the `_configure_optimizer` method in `deepspeed/runtime/engine.py` to ensure that only parameters explicitly requiring gradients are added to the optimizer's parameter groups. This prevents crashes and enhances the robustness of the Muon optimizer for advanced training scenarios, with new unit tests added to verify correct behavior.	Mar 1	2	maint
7f2f423	This commit introduces a crucial performance optimization for the DeepSpeed Muon optimizer by relocating its momentum buffer from CPU to GPU memory. The core change in `deepspeed/runtime/zero/stage_1_and_2.py` modifies the `create_param_group` method to ensure this buffer is allocated directly on the device. This relocation significantly reduces iteration time for models leveraging the Muon optimizer, as evidenced by a 39% speedup (1500ms to 910ms) during finetuning of large models. Additionally, a new `compiler.py` utility is introduced and applied to key Muon functions like `zeropower_via_newtonschulz5` and `muon_update`, which could further enhance execution efficiency. This enhancement directly benefits DeepSpeed ZeRO users seeking faster training throughput.	Nov 26	4	grow
a83fd7b	This commit updates the documentation for Automatic Tensor Parallelism (AutoTP) by explicitly adding `qwen2.5` and `qwen3` to the list of supported model families. This is a documentation update and maintenance task, correcting an omission in the `docs/_tutorials/automatic-tensor-parallelism.md` tutorial. The change ensures that users consulting the DeepSpeed tutorials are accurately informed about the full range of models compatible with AutoTP, improving clarity and usability for Qwen2.5 and Qwen3 users.	Nov 18	1	maint
df59f20	This commit introduces a new capability to the DeepSpeed runtime engine, enabling users to specify distinct learning rates for the Muon and Adam components of the `MuonWithAuxAdam` optimizer. Previously, these components shared a single learning rate, but now separate `muon_lr` and `adam_lr` parameters can be configured. This feature enhancement provides finer-grained control over the optimization process, potentially improving training stability and performance for models utilizing this specific optimizer. The change primarily affects the `deepspeed/runtime/engine.py` module.	Nov 11	1	grow
67b365a	This commit introduces a configuration adjustment to the Muon optimizer, preventing its application to specific neural network components. It excludes embedding layers and language model head layers (identified by `embed` and `lm_head` in parameter names) from using the Muon optimizer. The `set_optimizer_flags` function in `deepspeed/__init__.py` was modified to implement this exclusion. This performance optimization is based on empirical evidence suggesting these layers perform better without Muon, aiming to achieve improved training stability and efficiency for models utilizing this optimizer.	Oct 22	1	waste
2b68bbc	This commit introduces a new blog post that details a performance study on ZenFlow and ZeRO offload with DeepSpeed CPU core binding. This documentation addition provides valuable insights into optimizing DeepSpeed performance, specifically focusing on core binding techniques. It affects the project's educational content by adding `blogs/zenflow-corebinding/README.md` and updates the `docs/index.md` to integrate this new resource into the 'Latest News' section. This expands the available knowledge base for users interested in advanced DeepSpeed performance tuning.	Oct 6	2	maint
65322e1	This commit adds a Chinese version of the DeepSpeed SuperOffload blog post to the project's documentation. Specifically, it introduces the `blogs/deepspeed-superoffload/README_cn.md` file, which provides a comprehensive overview of SuperOffload's features, working principles, and usage in Chinese. This documentation update significantly enhances the project's internationalization efforts, making crucial information about the SuperOffload module accessible to a broader, Chinese-speaking audience.	Oct 4	1	maint
66c7031	This commit introduces a bug fix to DeepSpeed's accelerator device handling by updating how device identifiers are retrieved for tensor operations. Previously, using `get_accelerator().current_device()` could lead to failures when creating tensors on CPU devices, as it was primarily designed for CUDA. The change replaces this with `torch.device(get_accelerator().current_device_name())` to ensure robust device compatibility across various hardware. This enhances the reliability of tensor allocation and operations within `deepspeed/runtime/engine.py`, `deepspeed/runtime/utils.py`, and `deepspeed/runtime/zero/partitioned_param_coordinator.py`, enabling correct execution on both GPU and CPU environments.	Sep 28	3	waste
2585881	This commit enhances the usability of the Muon optimizer by automating the configuration of its `use_muon` flags. It introduces a new function `set_optimizer_flags` within the DeepSpeed initialization process in `deepspeed/__init__.py`, integrating it into the `initialize` function. This streamlines the DeepSpeed engine setup, allowing users to enable the Muon optimizer simply by specifying it in `config.json`. The change significantly simplifies the user experience, eliminating the need for manual code modifications to `model.parameters()`.	Sep 17	2	grow
43537d0	This commit introduces automatic CPU core affinity management for DeepSpeed's ZenFlow optimizer workers, resolving core binding issues (issue #7478). It integrates ZenFlow's core binding with DeepSpeed's existing `--bind_cores_to_rank` mechanism, dynamically splitting available CPU cores between the main PyTorch process and ZenFlow optimizer workers. A new configuration parameter, `pt_reserved_cores_perc` in `deepspeed/runtime/zenflow/zenflow_config.py`, allows users to specify the percentage of cores reserved for PyTorch threads. This new capability significantly improves performance and resource utilization for ZenFlow by ensuring optimal CPU resource allocation, primarily affecting the `ZenFlowZeroOptimizer` and `zenflow_optimizer_process` in `deepspeed/runtime/zenflow/zenflow_stage_1_and_2.py`.	Sep 4	2	grow
f03d416	This commit updates the ZeRO offload tutorial (`docs/_tutorials/zero-offload.md`) to incorporate a crucial performance optimization tip. It advises users to utilize the `--bind_cores_to_rank` flag in the DeepSpeed launch command, particularly for workloads heavily relying on CPUAdam within the ZeRO offload configuration. This documentation enhancement aims to increase user awareness and guide them towards improved performance by optimizing CPU core binding, which can lead to approximately a 1.3x speedup in average step time for CPU-intensive operations.	Aug 8	1	maint
80bc7b7	This commit enhances DeepSpeed's Automatic Tensor Parallelism (AutoTP) by enabling proper metadata loading for Qwen3 models. It specifically updates the `deepspeed/module_inject/auto_tp.py` module to include `Qwen3RMSNorm` in the list of recognized module names within the `Loading` class. This new capability resolves a compatibility issue (referenced #7275), allowing users to successfully apply AutoTP to Qwen3 architectures without encountering errors related to unrecognized module types. Consequently, it improves the overall model compatibility and usability of DeepSpeed for a wider range of large language models.	May 19	1	grow
f459502	This commit rolls back changes introduced in PR #6726 to fix a critical bug affecting the DeepSpeed ZeRO Redundancy Optimizer. Specifically, it reverts modifications to the parameter offloading logic in `deepspeed/runtime/zero/parameter_offload.py`, restoring the unconditional initialization of `ds_grads_remaining` within `_post_backward_module_hook`. Additionally, it undoes changes in `deepspeed/runtime/zero/partitioned_param_coordinator.py`, moving the initialization of `__n_available_params` back to the constructor and removing its reset in `release_all_live_params`. This maintenance rollback resolves an issue that caused incorrect state management and instability in ZeRO-enabled training, ensuring the correct functioning of partitioned parameter coordination.	May 19	3	waste

abb88ceMar 29

This commit **updates the documentation** within `AGENTS.md` and `CLAUDE.md` to provide more precise instructions for running `pre-commit` checks. Specifically, it clarifies the **CI requirements** by adding the explicit command `pre-commit run --files <changed_files>`, indicating that only modified files should be checked. This **documentation enhancement** improves developer guidance and streamlines the pre-commit process, preventing unnecessary full codebase scans and enhancing efficiency.

2 filesmaint

a240c4dMar 25

This commit introduces **HuggingFace `tp_plan` support** to **DeepSpeed AutoTP**, enabling automatic tensor parallelism for models that ship with a `base_model_tp_plan` like Llama or Qwen. It adds a new `TPPlanConverter` to translate HuggingFace's `colwise` and `rowwise` plans into DeepSpeed's `TPLayerSpec`, integrating this into the **AutoTP partitioning logic** within `engine.py`. This **new capability** significantly enhances **model compatibility** by allowing AutoTP to leverage built-in HuggingFace tensor parallelism configurations, reducing the need for manual `partition_config`. The system now prioritizes `partition_config` > HF `tp_plan` > AutoTP heuristics, streamlining the user experience for tensor parallelism with many popular models.

13 filesmaint

49a82a0Mar 13

This commit **adds new documentation files**, `AGENTS.md` and `CLAUDE.md`, to establish project-specific rules and guidelines for **AI coding agents** like OpenCode and Claude Code contributing to the DeepSpeed codebase. This **maintenance and documentation update** aims to address common issues where AI-generated code violates DeepSpeed conventions, such as DCO sign-off, formatting (`yapf`/`flake8`), and code style. By providing explicit instructions on commit requirements, code change discipline, and tool caveats, this change will **improve the quality and consistency of AI-generated contributions** and streamline the review process for automated pull requests. The new files, located in the project's root, serve as a critical reference for automated development workflows.

2 filesmaint

a41a96bMar 2

This commit **refactors** DeepSpeed's **Automatic Mixed Precision (AMP) integration** to align with current PyTorch best practices. It replaces the legacy `get_accelerator().amp()` method with the recommended `torch.amp` API, specifically impacting the **`deepspeed/runtime/zero/linear.py` module**. This **API migration** ensures **broader compatibility** with various PyTorch backends, such as XPU, which may not provide the older device-specific AMP modules. By adopting the standard `torch.amp`, DeepSpeed enhances its **maintainability and future-proofing** for mixed precision training.

11 filesmaint

d8e15daMar 2

This commit performs a **major refactoring** to remove the dependency on `Intel Extension for PyTorch` (IPEX) for **XPU device support**, aligning DeepSpeed with native PyTorch 2.8+ XPU capabilities. It **updates the XPU accelerator logic**, `op_builder` implementations, and CI workflows to use stock PyTorch's builder protocol and SYCL compilation. This change simplifies the XPU stack, but users must upgrade to the latest PyTorch for XPU features, as DeepSpeed will no longer be compatible with previous PyTorch+IPEX setups on XPU devices. The **documentation** and **tests** have also been updated to reflect this new, streamlined approach.

16 filesmaint

116dbe2Mar 1

This commit delivers a **bug fix** for the **DeepSpeed Muon optimizer** that addresses a `ValueError` encountered during **partial model training**. Previously, when only a subset of model parameters were trainable, the optimizer's internal parameter grouping logic could incorrectly include non-trainable parameters, leading to an empty tensor list being passed to `torch.cat`. The fix modifies the `_configure_optimizer` method in `deepspeed/runtime/engine.py` to ensure that only parameters explicitly requiring gradients are added to the optimizer's parameter groups. This prevents crashes and enhances the robustness of the **Muon optimizer** for advanced training scenarios, with new unit tests added to verify correct behavior.

2 filesmaint

7f2f423Nov 26

This commit introduces a crucial **performance optimization** for the **DeepSpeed Muon optimizer** by relocating its **momentum buffer from CPU to GPU memory**. The core change in `deepspeed/runtime/zero/stage_1_and_2.py` modifies the `create_param_group` method to ensure this buffer is allocated directly on the device. This relocation significantly **reduces iteration time** for models leveraging the Muon optimizer, as evidenced by a 39% speedup (1500ms to 910ms) during finetuning of large models. Additionally, a new `compiler.py` utility is introduced and applied to key Muon functions like `zeropower_via_newtonschulz5` and `muon_update`, which could further enhance execution efficiency. This enhancement directly benefits **DeepSpeed ZeRO users** seeking faster training throughput.

4 filesgrow

a83fd7bNov 18

This commit **updates the documentation** for **Automatic Tensor Parallelism (AutoTP)** by explicitly adding `qwen2.5` and `qwen3` to the list of supported model families. This is a **documentation update** and **maintenance** task, correcting an omission in the `docs/_tutorials/automatic-tensor-parallelism.md` tutorial. The change ensures that users consulting the DeepSpeed tutorials are accurately informed about the full range of models compatible with AutoTP, improving clarity and usability for **Qwen2.5 and Qwen3 users**.

1 filesmaint

df59f20Nov 11

This commit introduces a **new capability** to the **DeepSpeed runtime engine**, enabling users to specify distinct learning rates for the Muon and Adam components of the `MuonWithAuxAdam` optimizer. Previously, these components shared a single learning rate, but now separate `muon_lr` and `adam_lr` parameters can be configured. This **feature enhancement** provides **finer-grained control** over the optimization process, potentially improving training stability and performance for models utilizing this specific optimizer. The change primarily affects the `deepspeed/runtime/engine.py` module.

1 filesgrow

67b365aOct 22

This commit introduces a **configuration adjustment** to the **Muon optimizer**, preventing its application to specific neural network components. It **excludes embedding layers and language model head layers** (identified by `embed` and `lm_head` in parameter names) from using the Muon optimizer. The `set_optimizer_flags` function in `deepspeed/__init__.py` was modified to implement this exclusion. This **performance optimization** is based on empirical evidence suggesting these layers perform better without Muon, aiming to achieve **improved training stability and efficiency** for models utilizing this optimizer.

1 fileswaste

2b68bbcOct 6

This commit **introduces a new blog post** that details a performance study on **ZenFlow and ZeRO offload with DeepSpeed CPU core binding**. This **documentation addition** provides valuable insights into optimizing DeepSpeed performance, specifically focusing on core binding techniques. It affects the **project's educational content** by adding `blogs/zenflow-corebinding/README.md` and updates the `docs/index.md` to integrate this new resource into the 'Latest News' section. This expands the available knowledge base for users interested in advanced DeepSpeed performance tuning.

2 filesmaint

65322e1Oct 4

This commit **adds a Chinese version** of the **DeepSpeed SuperOffload blog post** to the project's documentation. Specifically, it introduces the `blogs/deepspeed-superoffload/README_cn.md` file, which provides a comprehensive overview of SuperOffload's features, working principles, and usage in Chinese. This **documentation update** significantly enhances the project's **internationalization efforts**, making crucial information about the SuperOffload module accessible to a broader, Chinese-speaking audience.

1 filesmaint

66c7031Sep 28

This commit introduces a **bug fix** to **DeepSpeed's accelerator device handling** by updating how device identifiers are retrieved for tensor operations. Previously, using `get_accelerator().current_device()` could lead to failures when creating tensors on CPU devices, as it was primarily designed for CUDA. The change replaces this with `torch.device(get_accelerator().current_device_name())` to ensure **robust device compatibility** across various hardware. This **enhances the reliability of tensor allocation and operations** within `deepspeed/runtime/engine.py`, `deepspeed/runtime/utils.py`, and `deepspeed/runtime/zero/partitioned_param_coordinator.py`, enabling correct execution on both GPU and CPU environments.

3 fileswaste

2585881Sep 17

This commit **enhances the usability of the Muon optimizer** by automating the configuration of its `use_muon` flags. It introduces a new function `set_optimizer_flags` within the **DeepSpeed initialization process** in `deepspeed/__init__.py`, integrating it into the `initialize` function. This **streamlines the DeepSpeed engine setup**, allowing users to enable the **Muon optimizer** simply by specifying it in `config.json`. The change **significantly simplifies the user experience**, eliminating the need for manual code modifications to `model.parameters()`.

2 filesgrow

43537d0Sep 4

This commit introduces **automatic CPU core affinity management** for **DeepSpeed's ZenFlow optimizer workers**, resolving core binding issues (issue #7478). It **integrates ZenFlow's core binding** with DeepSpeed's existing `--bind_cores_to_rank` mechanism, dynamically splitting available CPU cores between the main PyTorch process and ZenFlow optimizer workers. A new configuration parameter, `pt_reserved_cores_perc` in `deepspeed/runtime/zenflow/zenflow_config.py`, allows users to specify the percentage of cores reserved for PyTorch threads. This **new capability** significantly **improves performance and resource utilization** for ZenFlow by ensuring optimal CPU resource allocation, primarily affecting the `ZenFlowZeroOptimizer` and `zenflow_optimizer_process` in `deepspeed/runtime/zenflow/zenflow_stage_1_and_2.py`.

2 filesgrow

f03d416Aug 8

This commit **updates the ZeRO offload tutorial** (`docs/_tutorials/zero-offload.md`) to incorporate a crucial **performance optimization tip**. It advises users to utilize the `--bind_cores_to_rank` flag in the DeepSpeed launch command, particularly for workloads heavily relying on **CPUAdam** within the ZeRO offload configuration. This **documentation enhancement** aims to increase user awareness and guide them towards **improved performance** by optimizing CPU core binding, which can lead to approximately a **1.3x speedup** in average step time for CPU-intensive operations.

1 filesmaint

80bc7b7May 19

This commit **enhances DeepSpeed's Automatic Tensor Parallelism (AutoTP)** by **enabling proper metadata loading for Qwen3 models**. It specifically updates the `deepspeed/module_inject/auto_tp.py` module to include `Qwen3RMSNorm` in the list of recognized module names within the `Loading` class. This **new capability** resolves a compatibility issue (referenced #7275), allowing users to successfully apply AutoTP to Qwen3 architectures without encountering errors related to unrecognized module types. Consequently, it improves the overall **model compatibility and usability** of DeepSpeed for a wider range of large language models.

1 filesgrow

f459502May 19

This commit **rolls back** changes introduced in PR #6726 to **fix a critical bug** affecting the **DeepSpeed ZeRO Redundancy Optimizer**. Specifically, it reverts modifications to the **parameter offloading** logic in `deepspeed/runtime/zero/parameter_offload.py`, restoring the unconditional initialization of `ds_grads_remaining` within `_post_backward_module_hook`. Additionally, it undoes changes in `deepspeed/runtime/zero/partitioned_param_coordinator.py`, moving the initialization of `__n_available_params` back to the constructor and removing its reset in `release_all_live_params`. This **maintenance rollback** resolves an issue that caused incorrect state management and instability in ZeRO-enabled training, ensuring the correct functioning of partitioned parameter coordination.

3 fileswaste

Work Patterns

Beta

Commit activity distribution by hour and day of week. Shows when this developer is most active.

Collaboration

Beta

Developers who frequently work on the same files and symbols. Higher score means stronger code collaboration.

POST /v1/organizations/{orgId}/investment-quality Content-Type: application/json Request: { "startTime": "2025-01-01T00:00:00Z", "endTime": "2025-12-31T23:59:59Z", "bucketSize": "BUCKET_SIZE_MONTH", "groupBy": ["repository_id" | "deliverer_email"] } Response: { "productivePct": 74, "maintenancePct": 18, "wastedPct": 8, "buckets": [ { "bucketStart": "2025-01-01T00:00:00Z", "productive": 4.2, "maintenance": 1.8, "wasted": 0.6 } ] }

Hash

Message

Date

Files

Effort

abb88ce

Mar 29

maint

a240c4d

Mar 25

maint

49a82a0

Mar 13

maint

a41a96b

Mar 2

maint

d8e15da

Mar 2

maint

116dbe2

Mar 1

maint

7f2f423

Nov 26

grow

a83fd7b

Nov 18

maint

df59f20

Nov 11

grow

67b365a

Oct 22

waste

2b68bbc

Oct 6

maint

65322e1

Oct 4

maint

66c7031

Sep 28

waste

2585881

Sep 17

grow

43537d0

Sep 4

grow

f03d416

Aug 8

maint

80bc7b7

May 19

grow

f459502

May 19

waste