Methodology

How Arena measures developer performance

Arena turns merged code into per-commit scores and aggregates them into organization-level trends. This page explains the measurement system and how the quarterly report on public engineering teams was produced.

At a glance

Score per commit, not per developer
Every merged file change gets a work-type score; people and teams are aggregates, never direct inputs.
Three work types, one unit
Growth, Maintenance, and Fixes — each measured in ETV (Engineering Throughput Value). No single “performance score.”
Deterministic mechanics
An AI stage classifies work and traces bugs. A second, reproducible stage computes scores from context complexity and engagement.
Public code, five quarters, six orgs
The quarterly report covers public default branches of 6 organizations from Q1 2025 through Q1 2026, with bootstrap confidence intervals.

#01Chapter 01

What we measure

Performance is not a single metric. Arena scores a triple — Growth, Maintenance, Fixes — so that shipping a new feature, refactoring a module, and correcting a long-standing bug are counted as different kinds of output. The unit is ETV, and it flows from per-file decisions up to teams, repositories, and organizations.

ETV — Engineering Throughput Value

A dimensionless unit for engineering output. ETV is additive within a work type (Growth ETV sums to total Growth ETV), and deliberately not additive across types to preserve the distinction between new capability, upkeep, and repair.

Unit of measurement and aggregation

The atomic unit is an individual file change within a merged commit on the default branch. Each file change receives three sub-scores, which roll up hierarchically: contributor → team → repository → organization.

Attribution follows the merge date, not the authorship date — a commit authored in March and merged in April is counted in April. The primary git author receives credit; co-authors are tracked in the knowledge graph but do not receive score credit.

Work-type classifications

Type	Definition	Conventional hint
Growth	Net-new functionality or capability	feat
Maintenance	Upkeep, refactors, cleanup, performance, tests, dependencies, docs, style, build, CI	chore, refactor, perf, test, style, build, ci, docs
Fixes	Corrects previous output — bug fixes and regressions	fix

KTLO view

A two-bucket alternative combines Maintenance and Fixes into KTLO (Keep The Lights On), contrasted with Growth. Useful when the operational question is “how much is net-new versus keeping the system running?”

What performance does not measure

The score reflects merged code. It does not capture code review depth, incident response, planning, mentorship, architectural decisions, or pair programming that does not result in standalone commits. Supporting roles surface indirectly through team-level outcomes (fewer bugs, cleaner architecture), not through individual commit scores.

Work on unmerged branches, unconnected repositories, or private forks is invisible to the measurement.

#02Chapter 02

How we score

Scoring runs in two stages. An AI stage reads each diff in context, classifies work type per file, and traces the origin of bugs. A mechanical stage then computes a deterministic score from two foundational inputs, adjusts it through dampeners and a fix multiplier, and rolls the result up the aggregation hierarchy.

Stage 1 — AI classification

Reads commit message, full diff, and surrounding code. Produces per-file work type, identifies changed symbols, and for fixes traces the bug back to the commit and author that introduced it.

Stage 2 — Mechanical scoring

Deterministic algorithms compute complexity and engagement, apply dampeners and the fix multiplier, and aggregate. No LLM involvement at this stage — the same inputs always produce the same score.

base_score      = f(context_complexity, engagement)
adjusted_score  = base_score × dampeners × fix_multiplier

Conceptual shape of the per-file score. Exact coefficients are calibrated against a labeled corpus and applied identically across all customers.

Context complexity

Cognitive weight of modified lines — computed per function scope over added and modified lines. Same line count carries different weight depending on what those lines do. Consistent across languages and paradigms: React JSX is treated like Go.

Engagement

Surface area the developer had to understand to make the change safely. Within the file it tracks shared identifiers and data flow; across the repository it traces values through callers and callees. Bounded so one-line edits to universal helpers don’t dominate.

Dampeners

Reduce credit for mechanical or low-novelty work. Three independent factors apply before aggregation: similarity, blame decay, and copy decay.

Fix multiplier

Amplifies the score for repair work. Grows with the age of the bug, with context transfer (fix touches another author’s code), and with churn in the affected area since the bug was introduced.

Dampeners

Factor	Effect
Similarity	Reduces credit when the change structure matches existing patterns — mechanical refactors and boilerplate.
Blame decay	Discounts overwriting very recent work by the same author over a short business-day window.
Copy decay	Reduces credit for literally duplicated lines copied from elsewhere in the repository.

Supported languages

Full structural analysis

CC++C#GoJavaJavaScript / TypeScriptJSX / TSXKotlinPHPPythonRubyRustScalaSwift

Partial analysis (classification + line-level signals)

HTMLCSSSQLTerraformShellYAMLMarkdownConfig files

Filtered files (excluded from analysis)

Generated code: .pb.go, _grpc.pb.go, .pb.ts, .graphql.ts, OpenAPI specs, *_generated.go, *.gen.go, zz_generated.*
Dependency lockfiles: go.sum, package-lock.json, yarn.lock, pnpm-lock.yaml, Cargo.lock, Gemfile.lock
Build artifacts and vendored trees: dist/, build/, .next/, vendor/, node_modules/, content-hash bundles
Minified files detected by heuristic (average line length > 300 characters)
Binary and media: images, fonts, PDFs, archives, compiled binaries

Attribution rules

Credit goes to the primary git author of the merged commit, after email alias resolution. Co-authors are tracked in the knowledge graph but do not receive score credit. A default list of bots (dependabot, renovate, github-actions, …) is filtered from aggregates; admins can flag additional contributors as bots or excluded.

All platform aggregates — totals, per-developer averages, and headcount — are scoped to contributors tagged with role = SWE. Non-engineering roles (PM, design, QA, …) still produce commits in the knowledge graph but do not enter numerators or denominators of any reported figure, so per-SWE values reflect engineering output per software engineer.

At the organization level, contributors active in multiple repositories are counted once and their output is summed across repositories.

#03Chapter 03

How this report was built

The quarterly report is a second layer on top of per-commit scoring. It defines the population (which organizations, which repositories, which contributors), the time windows, and the statistics — including bootstrap confidence intervals — that turn per-commit scores into cross-organization comparisons.

Two-layer design

The measurement and the report are independent layers. The per-commit scoring engine produces ETV values for every merged file change. The report layer aggregates those values across a defined cohort and window, adds confidence intervals, and frames the results for cross-organization comparison.

This separation matters: the same scoring engine can feed any number of reports (weekly team reviews, quarterly organization reports, ad-hoc repository deep-dives) without changing the measurement itself.

Bootstrap confidence intervals

Intervals around aggregate values are computed by the bootstrap method — resampling the underlying observations with replacement — rather than by assuming a parametric distribution. This is robust to skewed per-commit score distributions and to small cohorts, at the cost of being computationally heavier than a standard error formula.

The shaded bands on aggregate charts are 95% bootstrap confidence intervals.

Context boundary — the repository

The repository is the unit of aggregation for engagement. Context complexity is computed against local scopes; engagement against the repository’s functions and call sites. Cross-repository analysis is not performed — a change in Repo A carries no engagement weight from Repo B. Repositories are the natural abstraction boundary: coherent build, review process, and ownership.

Organization selection

The six organizations studied were not selected randomly. They were chosen because they publish identifiable engineering teams through public repositories at a scale that supports quarterly statistics — clearly attributable merged commits, enough distinct SWEs per quarter to bootstrap, and independent public repositories that can be extracted reliably.

The selection is documented in Appendix A of the report, which lists the repositories analyzed per organization.

Fixed-period cohort

The cohort for the report is the intersection of qualifying SWEs across every quarter of the reporting window. A qualifying SWE merged at least one scored commit in every quarter of the window. Contributors who appear in only some quarters are excluded from the cohort used for quarter-over-quarter comparison, but remain visible in per-quarter aggregates.

This fixed-period construction controls for composition changes — growth, attrition, reorganizations — so quarterly deltas reflect changes in output, not changes in who was counted.

Temporal alignment

Quarters follow the standard calendar definition (Q1 = Jan–Mar, Q2 = Apr–Jun, and so on). Commits are attributed to the quarter of their merge date. No per-organization fiscal calendar is applied.

#04Chapter 04

Limitations

Every measurement system has boundaries. These are the boundaries for Arena’s performance scoring and for this report in particular — worth keeping in mind when interpreting numbers or comparing cohorts.

Scope boundaries

Public default branches only — private forks, feature branches, and unmerged work are not scored.
Merged code only — reviews, planning, incident response, and mentorship are not captured.
Cross-repository engagement is not modeled — the repository is the outer boundary of engagement analysis.
Qualitative dimensions (design quality, architectural soundness, code review rigor) are not directly measured.
Cross-team and cross-organization raw totals are not automatically normalized; compare trends within each team rather than absolute totals across teams with different sizes and language mixes.
Co-authored commits credit only the primary author; collaborators are tracked in the knowledge graph but not scored.
Work-type classification by the AI stage is probabilistic per file — individual edge cases may be debatable, but aggregate statistics are stable.

Calibration is global

Thresholds and coefficients — dampener sensitivities, engagement bounds, fix-multiplier curves — are calibrated against a labeled commit corpus and recalibrated periodically. The same calibration applies to every customer; there is no per-customer model training. Formulas are fixed and auditable.

#05Chapter 05

Glossary

Term	Definition
ETV	Engineering Throughput Value — dimensionless unit for all performance metrics. Additive within a work type, not across types.
Context complexity	Cognitive weight of changed lines based on scope and control flow. Computed per function scope over added and modified lines.
Engagement	Surface area of the codebase a developer must understand to make a change safely. Combines within-file and within-repository data flow.
Fix multiplier	Amplification factor applied to fixes based on bug age, authorship transfer, and churn in the affected area.
Similarity dampener	Reduction factor applied when a change’s structure matches existing patterns — boilerplate and mechanical refactors.
Blame decay	Reduction factor for overwriting very recent work by the same author over a short business-day window.
Copy decay	Reduction factor for literally duplicated added lines from elsewhere in the repository.
KTLO	Keep The Lights On — two-bucket view combining Maintenance and Fixes into a single category, contrasted with Growth.
Bootstrap CI	A 95% confidence interval for an aggregate, computed by resampling underlying observations with replacement — no parametric distribution assumed.
Qualifying SWE	A contributor who merged at least one scored commit in every quarter of the reporting window; the basis of the fixed-period cohort.
Fixed-period cohort	The intersection of qualifying SWEs across every quarter of the reporting window — holds composition constant for quarter-over-quarter comparison.

Methodology

How Arena measures developer performance

At a glance

Score per commit, not per developer
Every merged file change gets a work-type score; people and teams are aggregates, never direct inputs.
Three work types, one unit
Growth, Maintenance, and Fixes — each measured in ETV (Engineering Throughput Value). No single “performance score.”
Deterministic mechanics
An AI stage classifies work and traces bugs. A second, reproducible stage computes scores from context complexity and engagement.
Public code, five quarters, six orgs
The quarterly report covers public default branches of 6 organizations from Q1 2025 through Q1 2026, with bootstrap confidence intervals.

#01Chapter 01

What we measure

ETV — Engineering Throughput Value

Unit of measurement and aggregation

Work-type classifications

Type	Definition	Conventional hint
Growth	Net-new functionality or capability	feat
Maintenance	Upkeep, refactors, cleanup, performance, tests, dependencies, docs, style, build, CI	chore, refactor, perf, test, style, build, ci, docs
Fixes	Corrects previous output — bug fixes and regressions	fix

KTLO view

What performance does not measure

Work on unmerged branches, unconnected repositories, or private forks is invisible to the measurement.

#02Chapter 02

How we score

Stage 1 — AI classification

Reads commit message, full diff, and surrounding code. Produces per-file work type, identifies changed symbols, and for fixes traces the bug back to the commit and author that introduced it.

Stage 2 — Mechanical scoring

Deterministic algorithms compute complexity and engagement, apply dampeners and the fix multiplier, and aggregate. No LLM involvement at this stage — the same inputs always produce the same score.

base_score      = f(context_complexity, engagement)
adjusted_score  = base_score × dampeners × fix_multiplier

Conceptual shape of the per-file score. Exact coefficients are calibrated against a labeled corpus and applied identically across all customers.

Context complexity

Engagement

Dampeners

Reduce credit for mechanical or low-novelty work. Three independent factors apply before aggregation: similarity, blame decay, and copy decay.

Fix multiplier

Amplifies the score for repair work. Grows with the age of the bug, with context transfer (fix touches another author’s code), and with churn in the affected area since the bug was introduced.

Dampeners

Factor	Effect
Similarity	Reduces credit when the change structure matches existing patterns — mechanical refactors and boilerplate.
Blame decay	Discounts overwriting very recent work by the same author over a short business-day window.
Copy decay	Reduces credit for literally duplicated lines copied from elsewhere in the repository.

Supported languages

Full structural analysis

CC++C#GoJavaJavaScript / TypeScriptJSX / TSXKotlinPHPPythonRubyRustScalaSwift

Partial analysis (classification + line-level signals)

HTMLCSSSQLTerraformShellYAMLMarkdownConfig files

Filtered files (excluded from analysis)

Generated code: .pb.go, _grpc.pb.go, .pb.ts, .graphql.ts, OpenAPI specs, *_generated.go, *.gen.go, zz_generated.*
Dependency lockfiles: go.sum, package-lock.json, yarn.lock, pnpm-lock.yaml, Cargo.lock, Gemfile.lock
Build artifacts and vendored trees: dist/, build/, .next/, vendor/, node_modules/, content-hash bundles
Minified files detected by heuristic (average line length > 300 characters)
Binary and media: images, fonts, PDFs, archives, compiled binaries

Attribution rules

At the organization level, contributors active in multiple repositories are counted once and their output is summed across repositories.

#03Chapter 03

How this report was built

Two-layer design

Bootstrap confidence intervals

The shaded bands on aggregate charts are 95% bootstrap confidence intervals.

Context boundary — the repository

Organization selection

The selection is documented in Appendix A of the report, which lists the repositories analyzed per organization.

Fixed-period cohort

This fixed-period construction controls for composition changes — growth, attrition, reorganizations — so quarterly deltas reflect changes in output, not changes in who was counted.

Temporal alignment

#04Chapter 04

Limitations

Scope boundaries

Public default branches only — private forks, feature branches, and unmerged work are not scored.
Merged code only — reviews, planning, incident response, and mentorship are not captured.
Cross-repository engagement is not modeled — the repository is the outer boundary of engagement analysis.
Qualitative dimensions (design quality, architectural soundness, code review rigor) are not directly measured.
Cross-team and cross-organization raw totals are not automatically normalized; compare trends within each team rather than absolute totals across teams with different sizes and language mixes.
Co-authored commits credit only the primary author; collaborators are tracked in the knowledge graph but not scored.
Work-type classification by the AI stage is probabilistic per file — individual edge cases may be debatable, but aggregate statistics are stable.

Calibration is global

#05Chapter 05

Glossary

Term	Definition
ETV	Engineering Throughput Value — dimensionless unit for all performance metrics. Additive within a work type, not across types.
Context complexity	Cognitive weight of changed lines based on scope and control flow. Computed per function scope over added and modified lines.
Engagement	Surface area of the codebase a developer must understand to make a change safely. Combines within-file and within-repository data flow.
Fix multiplier	Amplification factor applied to fixes based on bug age, authorship transfer, and churn in the affected area.
Similarity dampener	Reduction factor applied when a change’s structure matches existing patterns — boilerplate and mechanical refactors.
Blame decay	Reduction factor for overwriting very recent work by the same author over a short business-day window.
Copy decay	Reduction factor for literally duplicated added lines from elsewhere in the repository.
KTLO	Keep The Lights On — two-bucket view combining Maintenance and Fixes into a single category, contrasted with Growth.
Bootstrap CI	A 95% confidence interval for an aggregate, computed by resampling underlying observations with replacement — no parametric distribution assumed.
Qualifying SWE	A contributor who merged at least one scored commit in every quarter of the reporting window; the basis of the fixed-period cohort.
Fixed-period cohort	The intersection of qualifying SWEs across every quarter of the reporting window — holds composition constant for quarter-over-quarter comparison.