METHODOLOGY

Aggregation Theory Offline — Master Prompt

Purpose

This document is the single source of truth for the industry concentration research behind aggregationtheory.world. It defines the taxonomy, metrics, data sources, validation requirements, and update procedures. Any agent, tool, or human working on this project should follow this document.

1. Thesis

Industry concentration today maps closely to marginal cost structure. Industries sit on a spectrum from more to less concentrated across:

Internet — zero marginal cost distribution; winner-take-most
Software — low marginal cost; moderate concentration with oligopoly tendencies
Cognitive Services — knowledge work constrained by labor; fragmented
Physical Services — physical labor + local delivery; highly fragmented
Manufacturing / Production — capital + labor + materials; bimodal (oligopolies in capital-intensive, fragmented in discrete/job-shop)

AI will compress labor and distribution constraints in categories 3–5, driving concentration dynamics that resemble what happened in categories 1–2. The data on this site tracks whether that shift is occurring.

2. Taxonomy

2.1 Five Top-Level Categories

Category	Description	Typical S1	Typical CR3
Internet	Digital platforms, zero marginal cost distribution	20–90%	50–100%
Software	Enterprise/consumer software by workflow	7–48%	17–76%
Cognitive Services	Professional services where the product is expertise	2–22%	5–48%
Physical Services	Field services, trades, facilities — labor + local delivery	1.5–22%	3–44%
Manufacturing / Production	Goods production — capital + labor + materials	0.5–62%	1–85%

2.2 Manufacturing Sub-Categories

Manufacturing / Production is subdivided into three structural types because concentration is driven by fundamentally different forces:

Sub-Category	Description	Concentration Driver	AI Impact
Capital-Intensive	High capex, IP, regulation barriers (autos, semis, aerospace, industrial gases, cement, beverages, EMS)	Supply-side scale economies	Low — already concentrated
Process	Continuous-flow operations, commodity/specialty mix (pharma, chemicals, refining, paper, plastics, glass)	Process efficiency, feedstock access	Moderate — yield optimization
Discrete / Job-Shop	Make-to-order, local/regional, labor-intensive operations (machine shops, metal fab, printing, food processing, building products, packaging, furniture, textiles, medical devices)	Same as Physical Services: quoting, scheduling, labor, local sales	HIGH — structurally identical to Physical Services

2.3 Market Selection Criteria

Each category should contain 8–12 markets that are:

Independently definable (clear boundary between this market and adjacent ones)
Measurable (denominator can be sourced from at least one authoritative source)
Representative of the category's concentration dynamics
Relevant to the thesis (especially for categories 3–5)

3. Metrics

3.1 Primary Metrics (always report)

S1 (Leader Share): Revenue or usage share of the #1 player in the market.

S1 = Revenue_of_#1_player / Total_market_revenue

Report as percentage to one decimal place.

CR3 (Concentration Ratio, Top 3): Combined share of the top 3 players.

CR3 = (Rev_#1 + Rev_#2 + Rev_#3) / Total_market_revenue

3.2 Supporting Metric

HHI (Herfindahl-Hirschman Index): Sum of squared market shares of all firms (or top 50).

HHI = Σ(share_i²)  where shares are whole numbers (30% → 30² = 900)

DOJ thresholds:
  <1,500   = unconcentrated
  1,500–2,500 = moderately concentrated
  >2,500   = highly concentrated

3.3 Denominator Rules (CRITICAL)

The denominator (total market size) is the most important and most debatable decision in any concentration calculation. Different denominators produce different shares for the same company.

Requirements:

Define the denominator explicitlyfor every market. Record it in the dataset (e.g., "US ad revenue share," "global query share," "US shipment value").
Validate against multiple sources. Every market size denominator MUST be cross-referenced against at least two independent sources where possible: an industry tracker as primary, and the Economic Census, BEA industry value-added, company 10-K TAM commentary, or trade association data as secondary.
Use the Census Bureau concentration tables (data.census.gov) as ground truth for manufacturing (NAICS 31-33) and services sectors where available.
Use revenue as the default denominator unless there is a strong reason to use another metric (query share for search engines, GMV for e-commerce, installed base for OS).
Geography: US only unless the market is inherently global (mobile OS, cloud infrastructure, semiconductor fab).
Do not mix denominators across markets within the same comparison chart.

4. Data Sources — Hierarchy and Validation

4.1 Source Priority by Category

Internet Markets: StatCounter (browser, search, OS usage share) → eMarketer (ad spend, e-commerce GMV) → Synergy Research (cloud infrastructure) → company 10-Ks / earnings for revenue → Nielsen, Antenna, Gridwise, Bloomberg Second Measure (streaming, ride-hailing, delivery).

Software Markets: IDC Semiannual Software Tracker (CRM, ERP, security, etc.) → Gartner Market Share reports → Canalys, Synergy (cloud, infrastructure) → CIMdata (CAD/PLM) → company 10-Ks for revenue numerators.

Cognitive Services:Am Law 100 / American Lawyer (legal) → Big 4 annual reports (accounting) → AM Best (insurance) → SIA (staffing) → ENR Top 500 (A&E firms) → Ad Age (agencies) → company filings for revenue numerators → Economic Census concentration tables for denominator validation.

Physical Services: IBISWorld industry reports → trade association lists (RC Top 100 roofing, SDM Top 100 fire/safety, etc.) → company filings (EMCOR, Rollins, ABM, Waste Management) → Economic Census (NAICS 238 — Specialty Trade Contractors) → BLS employment data for firm count validation.

Manufacturing / Production:Economic Census concentration tables (data.census.gov) — PRIMARY source for CR4/CR8/HHI at 6-digit NAICS → company 10-Ks for revenue numerators → IHS Markit / S&P Global (auto, chemicals) → IQVIA (pharma) → Euromonitor (beverages, food) → SIPRI (defense) → BEA industry value-added for denominator validation.

4.2 Cross-Validation Procedure

1. DENOMINATOR VALIDATION
   a. Identify primary source for total market size
   b. Cross-reference with at least one secondary source:
      - Economic Census (for manufacturing/services NAICS codes)
      - BEA value-added by industry (FRED series)
      - Trade association reported market size
      - Aggregated company filings (if top 10 players cover >50% of market)
   c. If primary and secondary disagree by >20%, investigate and note discrepancy
   d. Record both sources and chosen value with justification

2. NUMERATOR VALIDATION
   a. Use most recent fiscal year revenue from company 10-K or equivalent
   b. Adjust for geography (US vs global) using segment reporting or analyst estimates
   c. Adjust for scope (e.g., Palo Alto total revenue vs security-only revenue)
   d. Note any estimates or adjustments made

3. CENSUS BUREAU CHECK (for Manufacturing and Services)
   a. Look up NAICS code for the market
   b. Check data.census.gov for 2022 Economic Census concentration tables
   c. Compare our calculated CR3 with Census CR4 (expect our CR3 ≤ Census CR4)
   d. If significant discrepancy, investigate market definition differences
   e. Note Census CR4/CR8 values alongside our metrics

4.3 FRED / BEA Cross-Reference Series

For denominator validation, check these FRED/BEA series:

GDP by Industry (Value Added): BEA Table 1.3.5 — gross output by NAICS sector
Industrial Production indices by NAICS (FRED series IPG____S) — for manufacturing trend context
Annual Survey of Manufactures — shipment values by NAICS (more frequent than Economic Census)
Service Annual Survey — revenue by NAICS for services sectors

5. Time Series Requirements

5.1 Historical Data Collection

Every market in the dataset must have a time series going back as far as reliable data allows. Target time horizons:

Category	Target Start Year	Rationale
Internet	~2000 (or market inception)	Most markets didn't exist before this
Software	~1990 (or market inception)	Pre-cloud ERP/productivity era
Cognitive Services	~1980 (or earlier)	Big 8→Big 6→Big 5→Big 4 history
Physical Services	~1990 (Economic Census 1992)	Census provides baseline
Manufacturing	~1950 (Census of Manufactures)	Long history of concentration data

5.2 Composite Category Time Series

In addition to individual market time series, compute and display a composite CR3 average by category for each time period. Method: simple average of CR3 across all markets in the category for each year. When individual market data is missing for a year, interpolate linearly between adjacent known values.

6. Current Dataset

62 markets spanning 5 top-level categories with Manufacturing subdivided into 3 structural sub-types. Full market-level data is available on the Data page as CSV download.

7. Category Averages (Current)

Category	Markets	Avg S1	Avg CR3	Avg HHI	Level
Internet	10	52.9%	78.1%	3,771	Highly Concentrated
Software	10	22.3%	40.8%	807	Mildly Concentrated
Cognitive Services	9	9.2%	21.7%	326	Unconcentrated
Physical Services	10	7.0%	13.8%	223	Unconcentrated
Manufacturing (blended)	23	10.5%	24.6%	531	Mildly Concentrated
→ Capital-Intensive	7	24.3%	49.6%	1,450	Moderately Concentrated
→ Process	6	9.1%	22.2%	293	Unconcentrated
→ Discrete / Job-Shop	10	3.8%	9.3%	68	Highly Fragmented

8. Software Concentration Time Series Patterns

Four distinct patterns of software concentration have been identified:

Pattern 1: Declining S1 (market expanded faster than leader)

ERP (SAP): 25% → 13% (2010–2025)
Cloud Infra (AWS): 100% → 30% (2010–2025)
Productivity (Microsoft): 90% → 48% (2005–2025)
Mechanism: Market TAM grew 3–10x; new cloud-native entrants captured expansion
Key insight: S1 fell but CR3 often stayed stable — oligopoly formed among top 3

Pattern 2: Rising S1 (cloud-native winner aggregated)

CRM (Salesforce): 10% → 22% (2010–2025)
ITSM (ServiceNow): 2% → 35% (2010–2025)
Payments (Stripe): 0% → 20% (2010–2025)
Mechanism: Cloud-native player with superior distribution displaced on-prem incumbents
Key insight: This IS the Aggregation Theory pattern within software

Pattern 3: Stable oligopoly

CAD/PLM: Autodesk/Dassault/Siemens held ~56% CR3 for 15 years
Mechanism: High switching costs, deep workflow integration, long sales cycles

Pattern 4: Reconsolidating after disruption

Data/Analytics: CR3 30% → 18% → 29% (2010–2018–2025)
Mechanism: Open-source (Hadoop) disrupted incumbents; then Snowflake/Databricks re-concentrated
Key insight: Disruption temporarily fragments; new winners reconsolidate over ~8–10 years

10. Update Procedure

Annual Q2 refresh. Pull latest data from StatCounter, eMarketer, IDC Semiannual Software Tracker, Gartner, Big 4 annual reports, AM Best, SIA, ENR Top 500, Ad Age, IBISWorld, Economic Census, company 10-Ks, BEA value-added, BLS QCEW. Cross-validate any market with >10% TAM change. Update CR3 time series. Publish updated dataset with new vintage timestamp.

13. Averaging Methodology

13.1 Simple Average (Default)

Each market is one equal observation. Formula: avg_CR3 = Σ(CR3_i) / n

Strengths: treats each industry as an equal data point about structural dynamics. Easy to explain. Weakness: tiny markets (pest control $25B) weigh the same as massive ones (food processing $950B).

13.2 Revenue-Weighted Average

Each market's CR3 is weighted by its total market size. Formula: weighted_CR3 = Σ(CR3_i × size_i) / Σ(size_i)

Markets without revenue denominators (browser usage share, OS installed base) are excluded from weighted calculations.

13.3 Display Requirement

The dashboard MUST provide a toggle between simple and weighted averages. Both values should be computable from the dataset. Summary cards and bar charts should update dynamically when the toggle is switched.

14. Source Attribution Requirements

Every market in the dataset must have a primary source and at least one secondary validation source. When a user hovers or clicks any market in any chart, the tooltip or detail page must show: market name, category, leader, S1, CR3, market size, primary source, and validation sources with their quality tier.

15. Source Quality Tiers

Every source used in this dataset is classified into one of three quality tiers. The tiering follows the source-priority rules in §4.1 and the cross-validation procedure in §4.2.

Source Quality Hierarchy

Every market has a primary source plus 1–3 validation sources. Revenue numerators come from SEC 10-K filings; denominators validated against the Economic Census where available.

TIER 1Authoritative anchorsGovernment data + SEC filings

US Economic Census (data.census.gov) · SEC EDGAR / Company 10-Ks · BEA Industry Value-Added (FRED) · BLS QCEW · USGS Mineral Commodity Summaries · EIA Refinery Capacity Report · FDA device listings

TIER 2Leading commercial trackersIndustry-standard paid trackers

IDC Semiannual Software Tracker · Gartner Market Share & Magic Quadrant · IBISWorld NAICS reports · StatCounter Global Stats · eMarketer / Insider Intelligence · Synergy Research Group

TIER 3Category specialistsBest-in-class niche authorities

SIPRI (aerospace & defense) · IQVIA (pharma) · TrendForce (semiconductors) · CIMdata (CAD/PLM) · Nilson Report (payments) · Am Law 100 (legal) · ENR Top 500 (A&E) · AM Best (insurance) · SIA (staffing) · Ad Age (agencies) · Barron's / Cerulli (wealth) · Evaluate MedTech · Nielsen Gauge (streaming) · Gridwise (ride-hail) · Bloomberg Second Measure (delivery) · RC Top 100 (roofing) · SDM Top 100 (fire/safety) · Big 4 annual reports · ALM Intelligence (consulting)

Data vintage: Q2 2025 research compilation. Labels updated 2026-Q2. Tier 1 sources refreshed from FY2025 SEC filings where available. For full methodology see the Methodology page.

References

Federal Reserve FEDS Note (Feb 2023):"A Note on Industry Concentration Measurement" — validates Economic Census over Compustat for concentration measurement.
Kulick (2022), US Chamber:"Industrial Concentration in the United States: 2002–2017" — comprehensive CR4 trends across all NAICS sectors.
Gutiérrez & Philippon (2018), NBER:"From Good to Bad Concentration?" — national vs local concentration trends.
Autor et al. (2020): Rising concentration trends 1992–2012 using Economic Census.
2022 Economic Census: CR4/CR8/CR20/CR50 tables released April 2025.