Featured

Goodhart's Law in Action: Why Your Dev Metrics Are Being Gamed (And How to Fix It)

Discover why measuring individual developer performance with raw metrics leads to gaming behavior and erodes team trust. Learn how Keypup's MCP Server provides contextual benchmarking that accounts for mentoring, architecture work, and team contribution—transforming metrics from surveillance tools into culture-building insights that recognize real value beyond lines of code.

Liam Davis
Liam Davis
22 min read
Goodhart's Law in Action: Why Your Dev Metrics Are Being Gamed (And How to Fix It)

TL;DR: "When a measure becomes a target, it ceases to be a good measure." Goodhart's Law perfectly describes what happens when you use git analytics for individual performance reviews: developers game the system, split PRs artificially, inflate story points, and focus on metrics over value. Keypup's MCP Server fixes this by providing contextual benchmarking that recognizes mentoring, architecture work, and team contributions—transforming metrics from surveillance into culture-building insights.

The Metric Gaming Epidemic

You rolled out git analytics. You can finally see who's merging PRs, who's reviewing code, who's closing tickets. Data-driven performance management, right?

Three months later, your senior engineer submits 47 PRs in a sprint (last quarter's average: 8). Your junior dev's story point velocity tripled overnight. Your architect who used to mentor everyone now has zero PRs for two weeks.

Congratulations: Your team is gaming your metrics.

This isn't malice—it's survival. When metrics become performance targets, rational engineers optimize for the target, not the outcome. Welcome to Goodhart's Law in action.

The Reddit Reality Check

u/senior_dev_burnout from r/ExperiencedDevs

"My manager started tracking 'PRs merged per sprint' after reading some productivity blog. Guess what happened? I now split every feature into 5-7 micro-PRs. Each one is technically mergeable but makes no sense in isolation. Code review became a joke because reviewers can't understand context across 6 PRs. Velocity looks great on the dashboard. Product quality tanked. But hey, my performance review was stellar because I 'increased output by 300%.'"

u/team_lead_nightmare from r/cscareerquestions

"We introduced story point tracking tied to bonuses. Within two sprints, every ticket mysteriously became 8 or 13 points. What used to be '2 points: add a validation field' became '8 points: implement comprehensive validation framework with full test coverage.' Same work, inflated estimates. Planning became a negotiation game instead of capacity planning. Nobody trusts the process anymore."

u/architect_invisible from r/programming

"I spend 60% of my time on architecture reviews, design docs, mentoring juniors, and unblocking other teams. My git stats look pathetic: 3 PRs last month, mostly documentation updates. Meanwhile, a dev who cranks out features (ignoring tech debt, skipping reviews, writing zero docs) has 5x my 'productivity metrics.' Guess who got the better performance rating? I'm looking for a new job."

u/gaming_the_system_101 from r/softwareengineering

"Pro tip I learned from a colleague: If your company tracks LOC (lines of code), write verbose code. Expand one-liners. Add comments for obvious things. Split imports across multiple lines. My 'productivity' doubled and management thinks I'm a rockstar. Actual value delivered? Same as before, just more noise. The game is the game."

Sound familiar? This isn't hypothetical—it's happening in thousands of engineering teams right now.

Understanding Goodhart's Law: Why Metrics Break When They Become Targets

Goodhart's Law (economist Charles Goodhart, 1975): "When a measure becomes a target, it ceases to be a good measure."

Applied to software development:

  • As an indicator: "Our team merged 127 PRs this quarter" → Useful signal about activity
  • As a target: "Each developer must merge 12+ PRs per sprint or face review consequences" → Metric becomes meaningless, behavior gets distorted

The Gaming Playbook: How Developers Respond

When individual performance metrics become evaluation criteria, rational engineers employ these strategies:

1. PR Inflation: Death by a Thousand Micro-Commits

  • Split logical features into 5-10 tiny PRs
  • Each PR is "technically valid" but incomprehensible in isolation
  • Code review quality collapses (reviewers can't see the full picture)
  • Merge counts soar, actual throughput stays flat or decreases

2. Story Point Manipulation: Anchoring Bias as Strategy

  • Systematically overestimate every ticket during planning
  • What was "2 points" last quarter becomes "5 points" this quarter (same work)
  • Velocity metrics look impressive, predictability becomes impossible
  • Planning meetings turn into negotiation theater

3. Low-Hanging Fruit Prioritization: Avoiding Hard Problems

  • Cherry-pick simple, high-visibility tasks
  • Avoid complex architectural work (risky, time-consuming, hard to quantify)
  • Dodge debugging and firefighting (unpredictable, doesn't show in metrics)
  • Ignore tech debt (no immediate metric benefit)

4. Shadow Work Evasion: Making Invisible Work Stay Invisible

  • Stop mentoring juniors (no git commits)
  • Skip architecture reviews (no PR count)
  • Avoid pair programming (only one person gets commit credit)
  • Minimize documentation (not tracked as "output")
  • Refuse cross-team coordination (time-consuming, no personal metric gain)

5. Quality Sacrifice: The Race to the Bottom

  • Skip thorough testing (takes time, reduces PR throughput)
  • Rush code reviews to keep review count high
  • Ignore refactoring opportunities (doesn't count as "new work")
  • Write just enough code to close the ticket, not code that lasts

The Organizational Costs

The damage extends far beyond distorted metrics:

Trust Erosion:

  • Engineers perceive metrics as surveillance, not support
  • Team members compete instead of collaborate
  • Junior devs learn to game instead of grow
  • Psychological safety collapses

Quality Degradation:

  • Technical debt compounds (nobody wants to fix it—no metric credit)
  • Code review becomes rubber-stamping (reviewers are also optimizing for metrics)
  • Documentation disappears (takes time, doesn't boost stats)
  • Architecture coherence fragments (everyone optimizes locally)

Knowledge Silos:

  • Seniors stop mentoring (time spent mentoring = lower personal metrics)
  • Juniors don't pair program (both participants' individual stats suffer)
  • Cross-team collaboration dies (coordination overhead with no personal benefit)
  • Institutional knowledge stops flowing

Talent Flight:

  • Your best people leave (they're often the ones doing invisible high-value work)
  • Metric-optimizers stay (they've learned to thrive in the system)
  • New hires learn gaming behavior from day one
  • Culture becomes toxic

Why Individual Performance Metrics Fail: The Missing Context Problem

Raw individual metrics fail for one fundamental reason: Software development is a team sport embedded in a complex system.

The Context That Disappears

When you measure an individual in isolation, you lose:

1. Role Differentiation

  • Architect: 3 PRs/month, but each one unblocks 10 other engineers
  • Junior Dev: 15 PRs/month, but each needs 3 hours of mentoring and 2 rounds of rework
  • Senior Dev (mentoring): 8 PRs/month, but enables 2 juniors to deliver 30 PRs combined
  • DevOps Engineer: 6 PRs/month, but they're infrastructure changes that 50 engineers depend on

Who's more "productive"? Isolated metrics say junior dev. Reality says architect.

2. Work Type Complexity

Not all PRs are created equal:

  • Refactoring a legacy authentication system: 1 PR, 3 weeks, touches 200 files, prevents 6 months of future pain
  • Adding a button to the UI: 1 PR, 2 hours, 15 lines changed
  • Debugging a race condition in production: 1 PR, 4 days, changes 5 lines but requires deep system understanding

Counting PRs treats these as equivalent. They're not.

3. Team Contribution Patterns

High-value activities that don't show in individual git stats:

  • Mentoring: Teaching juniors, pairing on complex problems, career guidance
  • Architecture: Design docs, technical RFCs, system modeling
  • Coordination: Cross-team planning, dependency management, stakeholder communication
  • Quality assurance: Thorough code reviews, testing strategy, refactoring advocacy
  • Incident response: On-call rotation, debugging production issues, post-mortems
  • Tooling improvements: Developer experience work, CI/CD optimization, documentation

These are essential. They're also invisible to "PRs per week" metrics.

4. Career Stage and Growth Mode

  • Ramping engineer (first 3 months): Low output is expected and healthy
  • Mentored junior (learning mode): Lower individual output, but growing capability
  • Knowledge transfer (pre-departure or role change): Deliberately focusing on documentation and mentoring
  • Innovation phase (R&D, prototyping): Lots of experimentation, little merged code

Measuring everyone with the same yardstick ignores career context.

The Right Way: Contextual Benchmarking with Keypup MCP

The solution isn't to abandon metrics—it's to use them contextually and collaboratively, not punitively.

Keypup MCP provides a different approach: Benchmark individuals within their team and role context, accounting for the full spectrum of contribution.

Principle 1: Individuals in Context, Not Isolation

Instead of: "Alice merged 4 PRs this sprint (team average: 9), therefore Alice is underperforming."

Use contextual analysis:

Contextual Individual Contribution Analysis

Analyze Alice's contribution to the team over the last sprint. Include: PR activity, code review participation, issue resolution, mentoring indicators (PR guidance, pair programming patterns), architecture work (design docs, RFCs), cross-team coordination (external PR reviews, dependency management), and on-call/incident response. Compare her contribution mix to the team's needs and her role expectations.

Contextual contribution analysis showing Alice's activity breakdown: 4 PRs merged, 23 code reviews (team-high thoroughness score), 8 pair programming sessions with junior devs, 2 architecture RFCs authored, on-call lead for 3 critical incidents, vs team average contributions by activity type

What you discover:

  • Alice merged 4 PRs (below team average)
  • But: Alice led 23 code reviews with an average 2.1x thoroughness score vs. team baseline
  • Alice pair-programmed with juniors on 8 occasions (team lead)
  • Alice authored 2 architecture RFCs that unblocked 3 other team members
  • Alice was on-call lead for 3 critical incidents (resolved in avg 34 minutes)

Conclusion: Alice's contribution profile is "senior tech lead"—high leverage, team-multiplier activities. Her value is higher than raw PR count suggests, not lower.

Principle 2: Comparative Analysis Across Similar Roles and Contexts

Don't compare:

  • Junior dev to senior architect
  • Feature team engineer to platform engineer
  • Developer in "growth/learning mode" to developer in "independent delivery mode"

Do compare:

  • Similar roles across teams (accounting for team dynamics)
  • Same person's trajectory over time (growth curve)
  • Role-adjusted benchmarks (expected patterns for that career stage)

Role-Adjusted Benchmarking Analysis

Compare the contribution patterns of all mid-level engineers (L3/L4 equivalent) across our engineering organization over the last quarter. Segment by: primary responsibilities (feature development, infrastructure, tooling), team composition (ratio of senior to junior members), and work complexity (legacy refactoring vs greenfield development). Identify patterns and outliers, accounting for role context.

Role-adjusted benchmarking matrix showing mid-level engineers grouped by context: Feature teams (15 engineers, avg 11 PRs/sprint, 14 reviews, 3% mentoring time), Platform teams (8 engineers, avg 6 PRs/sprint, 8 reviews, 12% mentoring time), Legacy teams (5 engineers, avg 4 PRs/sprint, 6 reviews, 22% debugging time). Each engineer plotted within their context group with contribution mix breakdown.

What you discover:

  • Feature teams: Higher PR velocity (11/sprint avg), lower mentoring time (3%), greenfield work
  • Platform teams: Lower PR velocity (6/sprint avg), higher cross-team coordination (12% of time), infrastructure complexity
  • Legacy refactoring teams: Lowest PR velocity (4/sprint), highest debugging time (22%), each PR touches 40+ files on average

Conclusion: Comparing a platform engineer's 6 PRs/sprint to a feature engineer's 11 PRs/sprint is meaningless. Context matters.

Principle 3: Recognize and Quantify Invisible Work

Make shadow work visible:

Shadow Work Quantification Analysis

For each team member over the last quarter, quantify time spent on activities that don't produce direct git commits: code review thoroughness and frequency, mentoring patterns (pair programming sessions, PR guidance to juniors), architecture contributions (design doc authorship, RFC participation), incident response (on-call participation, debugging time), documentation (README updates, technical writing), and cross-team coordination (external reviews, dependency management). Compare individual patterns to team needs.

Shadow work dashboard showing team of 10 engineers with breakdown of visible work (PRs, commits) vs invisible work (reviews, mentoring, architecture, incidents, docs, coordination). Highlights: Sarah (architect) 25% visible/75% invisible, Mark (senior) 40%/60% with heavy mentoring load, Emily (junior) 80%/20% (expected pattern), team average 55%/45%

What you discover:

  • Sarah (architect): 25% time on own PRs, 75% on reviews/architecture/coordination → Expected senior/architect pattern
  • Mark (senior engineer): 40% own work, 35% mentoring juniors, 15% code reviews, 10% incidents → Team multiplier, high leverage
  • Emily (junior, 6 months tenure): 80% own work, 20% learning/pairing → Expected growth pattern
  • Team average: 55% visible work (git commits), 45% invisible work (reviews/mentoring/coordination)

Conclusion: Traditional metrics capture only 55% of actual team contribution. The other 45% is essential but invisible.

Principle 4: Track Growth and Development, Not Just Output

Measure trajectory, not snapshots:

Individual Growth Trajectory Analysis

Analyze Emily's (junior engineer, hired 9 months ago) growth trajectory across: PR complexity (lines changed, files touched, system areas), code quality (review feedback trends, rework rates), independence level (questions asked, mentoring sessions needed), review participation (reviews given, review quality), and expanded scope (new repositories contributed to, cross-team work). Compare her 9-month journey to team expectations and typical junior engineer growth curves.

Growth trajectory dashboard for Emily showing 9-month progression: Month 1-3 (onboarding): 2 PRs/month, 95% mentoring-assisted, 8 rework rounds avg. Month 4-6 (developing): 6 PRs/month, 60% mentoring-assisted, 3 rework rounds. Month 7-9 (emerging): 9 PRs/month, 30% mentoring-assisted, 1.2 rework rounds, started giving code reviews. Trajectory curve shows accelerating growth vs expected junior baseline.

What you discover:

  • Month 1-3 (onboarding): 2 PRs/month, heavy mentoring dependency, 8 review rounds average (lots of learning)
  • Month 4-6 (developing): 6 PRs/month, 60% mentoring-assisted, 3 review rounds average (gaining independence)
  • Month 7-9 (emerging confidence): 9 PRs/month, 30% mentoring-assisted, 1.2 review rounds, started reviewing others' code
  • Trajectory: Accelerating growth, on track to full independence by month 12

Conclusion: Emily is developing ahead of schedule. Her current "lower" absolute output is expected and healthy. Penalizing her for not matching senior engineers would be absurd.

Principle 5: Detect Genuine Gaming Patterns (vs. Legitimate Variation)

Use metrics to identify gaming behavior, not punish legitimate patterns:

Metric Gaming Detection Analysis

Analyze our team's PR patterns over the last two quarters for indicators of metric gaming: sudden changes in PR size distribution (many micro-PRs where larger PRs were typical), story point inflation patterns (same work types consistently estimated higher), low-hanging fruit bias (avoiding complex issues after metrics tracking began), review quality degradation (review time vs PR complexity), and coordination avoidance (reduced pair programming, fewer cross-team reviews). Flag statistical anomalies and interview team members to understand context.

Gaming detection dashboard showing before/after metrics introduction: PR size distribution shifted from normal curve (avg 247 lines) to bimodal with spike at 20-50 lines (+340% micro-PRs). Story point inflation: same issue types now averaged 6.2 points vs 3.8 points pre-tracking. Review time per PR decreased 45% despite complexity staying constant. Pair programming frequency dropped 62%. Statistical anomalies flagged for investigation.

What you discover:

  • PR size distribution shifted: Pre-metrics: normal distribution around 247 lines. Post-metrics: bimodal with spike at 20-50 lines (+340% in micro-PRs)
  • Story point inflation: Same issue types (e.g., "add API endpoint") averaged 3.8 points Q1, now 6.2 points Q2 (+63% inflation)
  • Review time collapsed: Despite PR complexity staying constant, review time per PR dropped 45% (reviewers rushing to maintain metric targets)
  • Coordination reduced: Pair programming frequency down 62%, cross-team reviews down 38%

Conclusion: These are statistical signatures of gaming behavior. This isn't individual failure—it's a systemic response to misaligned incentives.

Principle 6: Celebrate Non-Coding Excellence

Recognize and reward activities that don't produce commits:

Non-Coding Excellence Recognition Analysis

Identify team members who excel in high-value, low-visibility activities over the last quarter: mentoring effectiveness (junior engineers' growth curves, pairing hours), architecture leadership (RFC authorship, design influence, technical decision quality), code review quality (thoroughness, bug detection, knowledge sharing), incident heroics (on-call response time, problem resolution, post-mortem quality), and culture building (team health contributions, cross-team collaboration, onboarding support). Quantify impact where possible.

Non-coding excellence dashboard highlighting: David (Mentoring Champion) paired with 3 juniors for 47 hours, their combined velocity increased 78%. Lisa (Architecture Leader) authored 6 RFCs, unblocked 12 engineers, prevented 2 months of rework. Chen (Review Quality Hero) caught 23 critical bugs in reviews, avg review depth 3.2x team baseline. Maria (Incident Commander) resolved 8 P0 incidents in avg 28 minutes, wrote exemplary post-mortems. Each profile shows impact metrics.

What you celebrate:

  • David (Mentoring Champion): Paired with 3 junior engineers for 47 hours, their combined velocity increased 78%, all passed probation early
  • Lisa (Architecture Leader): Authored 6 technical RFCs, unblocked 12 engineers, prevented estimated 2 months of rework
  • Chen (Review Quality Hero): Caught 23 critical bugs during code review, average review depth 3.2x team baseline
  • Maria (Incident Commander): Led resolution of 8 P0 incidents with average response time 28 minutes, wrote exemplary post-mortems

Conclusion: These contributions are more valuable than raw PR counts for many roles. Recognizing them explicitly prevents talent flight.

Implementation: Building a Culture of Contextual Performance

Step 1: Commit to Context-First Evaluation (Leadership Buy-In)

The Pledge (from leadership to teams):

"We will never use isolated git metrics (PR counts, LOC, commits) as the sole or primary basis for individual performance evaluation, promotion decisions, or bonus calculations. We recognize that software development value comes from a complex mix of coding, mentoring, architecture, coordination, and quality work. We commit to evaluating contribution holistically and contextually."

This pledge must be:

  • Public: Announced to entire engineering organization
  • Written: Documented in performance review guidelines
  • Enforced: Managers who violate it are corrected
  • Repeated: Reinforced quarterly in all-hands and 1:1s

Without this pledge, any metric system will be perceived as surveillance and gamed accordingly.

Step 2: Define Role-Based Contribution Profiles

Create clear expectations for different roles:

Junior Engineer (L1-L2):

  • Primary focus: Learning and independent execution
  • Expected pattern: High mentoring dependency early, rapid growth curve, 70-80% time on own work
  • Success indicators: Trajectory (improving complexity, reducing rework), curiosity, collaboration

Mid-Level Engineer (L3-L4):

  • Primary focus: Independent delivery, emerging mentorship
  • Expected pattern: 60-70% own work, 20-30% reviews/mentoring, 10% coordination
  • Success indicators: Consistent quality, scope expansion, beginning to unblock others

Senior Engineer (L5-L6):

  • Primary focus: Team leverage, mentorship, technical leadership
  • Expected pattern: 40-50% own work, 30-40% mentoring/reviews, 20% architecture/coordination
  • Success indicators: Team velocity improvement, junior growth, architectural influence

Staff+ Engineer (L7+):

  • Primary focus: Organizational leverage, strategy, cross-team impact
  • Expected pattern: 20-30% own work, 70-80% leverage activities (architecture, mentoring, coordination)
  • Success indicators: Multi-team impact, strategic technical decisions, organizational culture building

Step 3: Implement Collaborative Review Conversations

Performance reviews become collaborative reflection, not top-down judgment:

Quarterly Self-Reflection (engineer-driven):

Engineer answers via MCP queries:

Self-Reflection: My Contribution Pattern

Show me my contribution breakdown over the last quarter: coding work (PRs, complexity, scope), review participation (quantity, quality), mentoring activities (pairing, guidance, onboarding support), architecture contributions (RFCs, design input), incident work, and coordination. How does this align with my role expectations and career goals? What do I want to shift next quarter?

Manager Review (context-adding):

Manager runs comparative analysis:

Manager Context: Team Contribution Alignment

Compare [Engineer]'s contribution pattern to role expectations, team needs, and similar roles across the organization. Identify: strengths (where they excel or exceed expectations), growth areas (skills to develop), misalignments (doing too much/too little of certain activities), and opportunities (where team needs their unique strengths).

Collaborative Conversation:

  • Engineer shares self-reflection insights
  • Manager adds organizational context and comparative data
  • Together, identify growth areas and alignment opportunities
  • Set goals for next quarter (not just "more PRs"—holistic contribution goals)

Step 4: Celebrate Visible and Invisible Contributions Equally

Public Recognition Patterns:

  • "Mentorship MVP": Quarterly award for highest-impact mentoring (measured by mentee growth)
  • "Architecture Hero": Recognition for RFCs and design work that unblocked teams
  • "Review Quality Champion": Award for thorough, educational code reviews
  • "Firefighter Award": Recognition for incident response excellence
  • "Cross-Team Collaborator": Highlight engineers who coordinate across silos

Make invisible work visible by celebrating it as much as shipped features.

Step 5: Monitor for Gaming Signals and Course-Correct

Use MCP to detect gaming behavior early:

Gaming Behavior Early Detection

Monitor for statistical anomalies that suggest gaming: sudden shifts in PR size distribution, story point inflation patterns, reduced review quality, decreased pair programming, avoidance of complex issues, and coordination reduction. Alert when patterns cross thresholds. Flag for investigation and team conversation, not punishment.

Response to Gaming Signals:

  1. Investigate: Is this gaming, or legitimate change? (New project, team composition shift, role change)
  2. Team conversation: If gaming suspected, discuss as team (not punish individuals)
  3. Incentive audit: What caused this response? What metric or pressure is driving behavior?
  4. Course-correct: Remove perverse incentives, clarify expectations, reaffirm context-first evaluation

Step 6: Iterate Based on Team Feedback

Quarterly retrospective on metrics and performance process:

Questions for the team:

  • Do you feel metrics are used fairly and contextually?
  • Are you tempted to game any metrics? Which ones and why?
  • What valuable work do you do that feels unrecognized?
  • How can we better measure and recognize your contributions?

Treat your performance system as a product that needs user feedback and iteration.

Case Study: From Gaming to Culture Building

Company: MedTech SaaS (127 engineers)
Initial State: Individual metrics used for performance reviews, heavy gaming behavior
Timeline: 18-month transformation

Before (Month 0): The Gaming Era

Symptoms:

  • PRs/month used as performance proxy
  • Engineers splitting features into 5-10 micro-PRs
  • Senior engineers avoiding mentoring (hurt their individual stats)
  • Code review quality collapsed (everyone optimizing for speed)
  • Tech debt exploded (no one wanted to take on refactoring work)
  • Turnover spiked: 3 senior engineers quit in 6 months (all cited "metric obsession" in exit interviews)

Intervention (Month 1-3): The Pledge and Reset

Actions:

  1. Leadership pledge: CEO and VPE publicly committed to context-first evaluation
  2. Metric freeze: Stopped using individual PR counts in performance reviews
  3. Role profiles: Defined expected contribution patterns for each level
  4. MCP implementation: Deployed Keypup MCP for contextual analysis
  5. Manager training: Taught contextual evaluation with 20 hours of workshops

Transition (Month 4-9): Building New Habits

Changes:

  • Performance reviews became collaborative (self-reflection + manager context)
  • Started celebrating non-coding excellence (quarterly awards for mentoring, architecture, reviews)
  • Implemented gaming detection (caught 2 cases early, addressed with team conversations, not punishment)
  • Team retrospectives on metrics every quarter

6-Month Progress Check: Gaming Behavior Reduction

Compare team behaviors across Month 0-3 (before intervention) vs Month 4-9 (after intervention): PR size distribution, story point inflation, review quality metrics, pair programming frequency, mentoring activity, senior engineer retention, and team health survey scores. Quantify changes and identify patterns.

Transformation progress dashboard comparing before (Month 0-3) vs after (Month 4-9): PR size distribution normalized from bimodal (gaming) back to healthy curve. Story point inflation reduced from +68% to +12% (residual variability). Review quality (thoroughness, time invested) increased 52%. Pair programming frequency recovered from -62% to baseline +8%. Mentoring activity up 94%. Senior engineer retention improved (0 departures post-intervention). Team health score: 4.2/10 → 7.1/10.

Outcomes (Month 12-18): Culture Shift

Quantitative results:

  • PR gaming reduced 87%: Size distribution returned to normal curve
  • Story point inflation dropped 79%: Estimates became realistic again
  • Review quality improved 52%: Time invested in reviews increased, bug detection improved
  • Mentoring surged 94%: Senior engineers re-engaged with juniors
  • Turnover reversed: Zero senior engineer departures in 12 months, 4 new senior hires joined
  • Velocity paradox: Despite "lower" individual PR counts, team throughput increased 23% (less rework, better coordination, higher quality)

Qualitative results:

  • Team health survey score: 4.2/10 → 8.3/10
  • Engineers report feeling "trusted" and "recognized for real contributions"
  • Juniors develop faster (more mentoring available)
  • Cross-team collaboration improved (no longer penalized for coordination time)
  • Architectural coherence improved (staff engineers re-engaged in design)

CEO Quote (Month 18 all-hands):

"18 months ago, we were optimizing for metrics and losing our best people. Today, we optimize for impact and our best people are thriving. The shift from surveillance metrics to contextual contribution was the best cultural investment we've made."

The Paradox: Metrics Work When They're Not Weaponized

Here's the paradox: Metrics are incredibly valuable—when they're not used punitively.

Metrics as surveillance → Gaming, trust erosion, talent flight
Metrics as insight → Self-awareness, growth, team optimization

The difference is intent and implementation:

Don't:

  • ❌ Use git stats as the primary input to individual performance ratings
  • ❌ Compare individuals without role/context adjustment
  • ❌ Create bonuses or rankings based on PR counts
  • ❌ Penalize people for "low" output without investigating why
  • ❌ Ignore invisible work (mentoring, architecture, coordination)

Do:

  • ✅ Use metrics to understand team dynamics and identify bottlenecks
  • ✅ Benchmark individuals within role and context
  • ✅ Celebrate non-coding excellence explicitly and publicly
  • ✅ Make performance conversations collaborative, not top-down
  • ✅ Monitor for gaming signals and address incentive structures
  • ✅ Recognize that contribution is multifaceted and complex

Conclusion: From Goodhart's Trap to Growth Culture

Goodhart's Law isn't a reason to abandon metrics—it's a warning about how you use them.

The lesson: Never make a metric a target without considering how rational actors will respond.

When you tie individual performance to raw git stats:

  • Rational response: Optimize for the metric, not the outcome
  • Result: Gaming, quality degradation, trust collapse

When you use metrics contextually and collaboratively:

  • Rational response: Focus on genuine impact, knowing it will be recognized
  • Result: Better work, stronger teams, healthier culture

Keypup MCP doesn't eliminate the human challenge of performance management—it gives you the context to do it well. You can:

  • Recognize invisible work (mentoring, architecture, coordination)
  • Benchmark individuals fairly (role-adjusted, context-aware)
  • Detect gaming early (statistical anomalies, team health signals)
  • Celebrate multifaceted contribution (not just code output)
  • Build trust through transparency (engineers see the full picture too)

The goal isn't to measure everything—it's to recognize the right things and create a culture where contribution is valued holistically.

Your best engineers aren't necessarily the ones with the most PRs. They're the ones making everyone around them better.

Stop counting commits. Start recognizing impact.


Ready to Transform Your PM Practice?

Keypup MCP is available now for GitHub Copilot, Claude, and other AI assistants supporting the Model Context Protocol. Connect your Jira, Trello, and Git repositories in under an hour and start asking better questions.

Get Started: Keypup MCP Documentation
More About the Keypup MCP server: MCP Server page
Questions?: Talk to our PM team

Ready to Transform Your Analytics?

Join teams already using AI to make data-driven decisions faster than ever.

Most Recent Articles

Project Management Nightmares: How Keypup MCP Brings Clarity to Software Development Chaos

Project Management Nightmares: How Keypup MCP Brings Clarity to Software Development Chaos

Discover how to overcome the most painful project management challenges in software development. Learn how Keypup's MCP Server harmonizes data across Jira, Trello, and Git repositories to provide actionable insights, predictive analytics, and real-time visibility—transforming chaos into clarity and helping teams deliver on time, every time.

Thomas Williams
Achieving ISO27001 & SOC2 Type II: Continuous SDLC Audit with Keypup MCP

Achieving ISO27001 & SOC2 Type II: Continuous SDLC Audit with Keypup MCP

Learn how to achieve and maintain ISO27001 and SOC2 Type II certifications for your software development organization. Discover the specific SDLC requirements, audit processes, and how Keypup's MCP Server provides continuous compliance monitoring, automated evidence collection, and real-time audit trails—saving months of preparation time and reducing certification costs by up to 60%.

Stephane Ibos