Arena turns merged code into per-commit scores and aggregates them into organization-level trends. This page explains the measurement system and how the quarterly report on public engineering teams was produced.
At a glance
Score per commit, not per developer
Every merged file change gets a work-type score; people and teams are aggregates, never direct inputs.
Three work types, one unit
Growth, Maintenance, and Fixes — each measured in ETV (Engineering Throughput Value). No single “performance score.”
Deterministic mechanics
An AI stage classifies work and traces bugs. A second, reproducible stage computes scores from context complexity and engagement.
Public code, five quarters, six orgs
The quarterly report covers public default branches of 6 organizations from Q1 2025 through Q1 2026, with bootstrap confidence intervals.
Performance is not a single metric. Arena scores a triple — Growth, Maintenance, Fixes — so that shipping a new feature, refactoring a module, and correcting a long-standing bug are counted as different kinds of output. The unit is ETV, and it flows from per-file decisions up to teams, repositories, and organizations.
ETV — Engineering Throughput Value
A dimensionless unit for engineering output. ETV is additive within a work type (Growth ETV sums to total Growth ETV), and deliberately not additive across types to preserve the distinction between new capability, upkeep, and repair.
The atomic unit is an individual file change within a merged commit on the default branch. Each file change receives three sub-scores, which roll up hierarchically: contributor → team → repository → organization.
Attribution follows the merge date, not the authorship date — a commit authored in March and merged in April is counted in April. The primary git author receives credit; co-authors are tracked in the knowledge graph but do not receive score credit.
| Type | Definition | Conventional hint |
|---|---|---|
| Growth | Net-new functionality or capability | feat |
| Maintenance | Upkeep, refactors, cleanup, performance, tests, dependencies, docs, style, build, CI | chore, refactor, perf, test, style, build, ci, docs |
| Fixes | Corrects previous output — bug fixes and regressions | fix |
KTLO view
A two-bucket alternative combines Maintenance and Fixes into KTLO (Keep The Lights On), contrasted with Growth. Useful when the operational question is “how much is net-new versus keeping the system running?”
The score reflects merged code. It does not capture code review depth, incident response, planning, mentorship, architectural decisions, or pair programming that does not result in standalone commits. Supporting roles surface indirectly through team-level outcomes (fewer bugs, cleaner architecture), not through individual commit scores.
Work on unmerged branches, unconnected repositories, or private forks is invisible to the measurement.
Scoring runs in two stages. An AI stage reads each diff in context, classifies work type per file, and traces the origin of bugs. A mechanical stage then computes a deterministic score from two foundational inputs, adjusts it through dampeners and a fix multiplier, and rolls the result up the aggregation hierarchy.
Stage 1 — AI classification
Reads commit message, full diff, and surrounding code. Produces per-file work type, identifies changed symbols, and for fixes traces the bug back to the commit and author that introduced it.
Stage 2 — Mechanical scoring
Deterministic algorithms compute complexity and engagement, apply dampeners and the fix multiplier, and aggregate. No LLM involvement at this stage — the same inputs always produce the same score.
base_score = f(context_complexity, engagement) adjusted_score = base_score × dampeners × fix_multiplier
Conceptual shape of the per-file score. Exact coefficients are calibrated against a labeled corpus and applied identically across all customers.
Context complexity
Cognitive weight of modified lines — computed per function scope over added and modified lines. Same line count carries different weight depending on what those lines do. Consistent across languages and paradigms: React JSX is treated like Go.
Engagement
Surface area the developer had to understand to make the change safely. Within the file it tracks shared identifiers and data flow; across the repository it traces values through callers and callees. Bounded so one-line edits to universal helpers don’t dominate.
Dampeners
Reduce credit for mechanical or low-novelty work. Three independent factors apply before aggregation: similarity, blame decay, and copy decay.
Fix multiplier
Amplifies the score for repair work. Grows with the age of the bug, with context transfer (fix touches another author’s code), and with churn in the affected area since the bug was introduced.
| Factor | Effect |
|---|---|
| Similarity | Reduces credit when the change structure matches existing patterns — mechanical refactors and boilerplate. |
| Blame decay | Discounts overwriting very recent work by the same author over a short business-day window. |
| Copy decay | Reduces credit for literally duplicated lines copied from elsewhere in the repository. |
Full structural analysis
Partial analysis (classification + line-level signals)
Credit goes to the primary git author of the merged commit, after email alias resolution. Co-authors are tracked in the knowledge graph but do not receive score credit. A default list of bots (dependabot, renovate, github-actions, …) is filtered from aggregates; admins can flag additional contributors as bots or excluded.
All platform aggregates — totals, per-developer averages, and headcount — are scoped to contributors tagged with role = SWE. Non-engineering roles (PM, design, QA, …) still produce commits in the knowledge graph but do not enter numerators or denominators of any reported figure, so per-SWE values reflect engineering output per software engineer.
At the organization level, contributors active in multiple repositories are counted once and their output is summed across repositories.
The quarterly report is a second layer on top of per-commit scoring. It defines the population (which organizations, which repositories, which contributors), the time windows, and the statistics — including bootstrap confidence intervals — that turn per-commit scores into cross-organization comparisons.
The measurement and the report are independent layers. The per-commit scoring engine produces ETV values for every merged file change. The report layer aggregates those values across a defined cohort and window, adds confidence intervals, and frames the results for cross-organization comparison.
This separation matters: the same scoring engine can feed any number of reports (weekly team reviews, quarterly organization reports, ad-hoc repository deep-dives) without changing the measurement itself.
Intervals around aggregate values are computed by the bootstrap method — resampling the underlying observations with replacement — rather than by assuming a parametric distribution. This is robust to skewed per-commit score distributions and to small cohorts, at the cost of being computationally heavier than a standard error formula.
The shaded bands on aggregate charts are 95% bootstrap confidence intervals.
Context boundary — the repository
The repository is the unit of aggregation for engagement. Context complexity is computed against local scopes; engagement against the repository’s functions and call sites. Cross-repository analysis is not performed — a change in Repo A carries no engagement weight from Repo B. Repositories are the natural abstraction boundary: coherent build, review process, and ownership.
The six organizations studied were not selected randomly. They were chosen because they publish identifiable engineering teams through public repositories at a scale that supports quarterly statistics — clearly attributable merged commits, enough distinct SWEs per quarter to bootstrap, and independent public repositories that can be extracted reliably.
The selection is documented in Appendix A of the report, which lists the repositories analyzed per organization.
The cohort for the report is the intersection of qualifying SWEs across every quarter of the reporting window. A qualifying SWE merged at least one scored commit in every quarter of the window. Contributors who appear in only some quarters are excluded from the cohort used for quarter-over-quarter comparison, but remain visible in per-quarter aggregates.
This fixed-period construction controls for composition changes — growth, attrition, reorganizations — so quarterly deltas reflect changes in output, not changes in who was counted.
Quarters follow the standard calendar definition (Q1 = Jan–Mar, Q2 = Apr–Jun, and so on). Commits are attributed to the quarter of their merge date. No per-organization fiscal calendar is applied.
Every measurement system has boundaries. These are the boundaries for Arena’s performance scoring and for this report in particular — worth keeping in mind when interpreting numbers or comparing cohorts.
Calibration is global
Thresholds and coefficients — dampener sensitivities, engagement bounds, fix-multiplier curves — are calibrated against a labeled commit corpus and recalibrated periodically. The same calibration applies to every customer; there is no per-customer model training. Formulas are fixed and auditable.
| Term | Definition |
|---|---|
| ETV | Engineering Throughput Value — dimensionless unit for all performance metrics. Additive within a work type, not across types. |
| Context complexity | Cognitive weight of changed lines based on scope and control flow. Computed per function scope over added and modified lines. |
| Engagement | Surface area of the codebase a developer must understand to make a change safely. Combines within-file and within-repository data flow. |
| Fix multiplier | Amplification factor applied to fixes based on bug age, authorship transfer, and churn in the affected area. |
| Similarity dampener | Reduction factor applied when a change’s structure matches existing patterns — boilerplate and mechanical refactors. |
| Blame decay | Reduction factor for overwriting very recent work by the same author over a short business-day window. |
| Copy decay | Reduction factor for literally duplicated added lines from elsewhere in the repository. |
| KTLO | Keep The Lights On — two-bucket view combining Maintenance and Fixes into a single category, contrasted with Growth. |
| Bootstrap CI | A 95% confidence interval for an aggregate, computed by resampling underlying observations with replacement — no parametric distribution assumed. |
| Qualifying SWE | A contributor who merged at least one scored commit in every quarter of the reporting window; the basis of the fixed-period cohort. |
| Fixed-period cohort | The intersection of qualifying SWEs across every quarter of the reporting window — holds composition constant for quarter-over-quarter comparison. |