Aggregation Theory Offline — Master Prompt
Purpose
This document is the single source of truth for the industry concentration research behind aggregationtheory.world. It defines the taxonomy, metrics, data sources, validation requirements, and update procedures. Any agent, tool, or human working on this project should follow this document.
1. Thesis
Industry concentration today maps closely to marginal cost structure. Industries sit on a spectrum from more to less concentrated across:
- Internet — zero marginal cost distribution; winner-take-most
- Software — low marginal cost; moderate concentration with oligopoly tendencies
- Cognitive Services — knowledge work constrained by labor; fragmented
- Physical Services — physical labor + local delivery; highly fragmented
- Manufacturing / Production — capital + labor + materials; bimodal (oligopolies in capital-intensive, fragmented in discrete/job-shop)
AI will compress labor and distribution constraints in categories 3–5, driving concentration dynamics that resemble what happened in categories 1–2. The data on this site tracks whether that shift is occurring.
2. Taxonomy
2.1 Five Top-Level Categories
| Category | Description | Typical S1 | Typical CR3 |
|---|---|---|---|
| Internet | Digital platforms, zero marginal cost distribution | 20–90% | 50–100% |
| Software | Enterprise/consumer software by workflow | 7–48% | 17–76% |
| Cognitive Services | Professional services where the product is expertise | 2–22% | 5–48% |
| Physical Services | Field services, trades, facilities — labor + local delivery | 1.5–22% | 3–44% |
| Manufacturing / Production | Goods production — capital + labor + materials | 0.5–62% | 1–85% |
2.2 Manufacturing Sub-Categories
Manufacturing / Production is subdivided into three structural types because concentration is driven by fundamentally different forces:
| Sub-Category | Description | Concentration Driver | AI Impact |
|---|---|---|---|
| Capital-Intensive | High capex, IP, regulation barriers (autos, semis, aerospace, industrial gases, cement, beverages, EMS) | Supply-side scale economies | Low — already concentrated |
| Process | Continuous-flow operations, commodity/specialty mix (pharma, chemicals, refining, paper, plastics, glass) | Process efficiency, feedstock access | Moderate — yield optimization |
| Discrete / Job-Shop | Make-to-order, local/regional, labor-intensive operations (machine shops, metal fab, printing, food processing, building products, packaging, furniture, textiles, medical devices) | Same as Physical Services: quoting, scheduling, labor, local sales | HIGH — structurally identical to Physical Services |
2.3 Market Selection Criteria
Each category should contain 8–12 markets that are:
- Independently definable (clear boundary between this market and adjacent ones)
- Measurable (denominator can be sourced from at least one authoritative source)
- Representative of the category's concentration dynamics
- Relevant to the thesis (especially for categories 3–5)
3. Metrics
3.1 Primary Metrics (always report)
S1 (Leader Share): Revenue or usage share of the #1 player in the market.
S1 = Revenue_of_#1_player / Total_market_revenue
Report as percentage to one decimal place.
CR3 (Concentration Ratio, Top 3): Combined share of the top 3 players.
CR3 = (Rev_#1 + Rev_#2 + Rev_#3) / Total_market_revenue
3.2 Supporting Metric
HHI (Herfindahl-Hirschman Index): Sum of squared market shares of all firms (or top 50).
HHI = Σ(share_i²) where shares are whole numbers (30% → 30² = 900) DOJ thresholds: <1,500 = unconcentrated 1,500–2,500 = moderately concentrated >2,500 = highly concentrated
3.3 Denominator Rules (CRITICAL)
The denominator (total market size) is the most important and most debatable decision in any concentration calculation. Different denominators produce different shares for the same company.
Requirements:
- Define the denominator explicitlyfor every market. Record it in the dataset (e.g., "US ad revenue share," "global query share," "US shipment value").
- Validate against multiple sources. Every market size denominator MUST be cross-referenced against at least two independent sources where possible: an industry tracker as primary, and the Economic Census, BEA industry value-added, company 10-K TAM commentary, or trade association data as secondary.
- Use the Census Bureau concentration tables (data.census.gov) as ground truth for manufacturing (NAICS 31-33) and services sectors where available.
- Use revenue as the default denominator unless there is a strong reason to use another metric (query share for search engines, GMV for e-commerce, installed base for OS).
- Geography: US only unless the market is inherently global (mobile OS, cloud infrastructure, semiconductor fab).
- Do not mix denominators across markets within the same comparison chart.
4. Data Sources — Hierarchy and Validation
4.1 Source Priority by Category
Internet Markets: StatCounter (browser, search, OS usage share) → eMarketer (ad spend, e-commerce GMV) → Synergy Research (cloud infrastructure) → company 10-Ks / earnings for revenue → Nielsen, Antenna, Gridwise, Bloomberg Second Measure (streaming, ride-hailing, delivery).
Software Markets: IDC Semiannual Software Tracker (CRM, ERP, security, etc.) → Gartner Market Share reports → Canalys, Synergy (cloud, infrastructure) → CIMdata (CAD/PLM) → company 10-Ks for revenue numerators.
Cognitive Services:Am Law 100 / American Lawyer (legal) → Big 4 annual reports (accounting) → AM Best (insurance) → SIA (staffing) → ENR Top 500 (A&E firms) → Ad Age (agencies) → company filings for revenue numerators → Economic Census concentration tables for denominator validation.
Physical Services: IBISWorld industry reports → trade association lists (RC Top 100 roofing, SDM Top 100 fire/safety, etc.) → company filings (EMCOR, Rollins, ABM, Waste Management) → Economic Census (NAICS 238 — Specialty Trade Contractors) → BLS employment data for firm count validation.
Manufacturing / Production:Economic Census concentration tables (data.census.gov) — PRIMARY source for CR4/CR8/HHI at 6-digit NAICS → company 10-Ks for revenue numerators → IHS Markit / S&P Global (auto, chemicals) → IQVIA (pharma) → Euromonitor (beverages, food) → SIPRI (defense) → BEA industry value-added for denominator validation.
4.2 Cross-Validation Procedure
1. DENOMINATOR VALIDATION
a. Identify primary source for total market size
b. Cross-reference with at least one secondary source:
- Economic Census (for manufacturing/services NAICS codes)
- BEA value-added by industry (FRED series)
- Trade association reported market size
- Aggregated company filings (if top 10 players cover >50% of market)
c. If primary and secondary disagree by >20%, investigate and note discrepancy
d. Record both sources and chosen value with justification
2. NUMERATOR VALIDATION
a. Use most recent fiscal year revenue from company 10-K or equivalent
b. Adjust for geography (US vs global) using segment reporting or analyst estimates
c. Adjust for scope (e.g., Palo Alto total revenue vs security-only revenue)
d. Note any estimates or adjustments made
3. CENSUS BUREAU CHECK (for Manufacturing and Services)
a. Look up NAICS code for the market
b. Check data.census.gov for 2022 Economic Census concentration tables
c. Compare our calculated CR3 with Census CR4 (expect our CR3 ≤ Census CR4)
d. If significant discrepancy, investigate market definition differences
e. Note Census CR4/CR8 values alongside our metrics4.3 FRED / BEA Cross-Reference Series
For denominator validation, check these FRED/BEA series:
- GDP by Industry (Value Added): BEA Table 1.3.5 — gross output by NAICS sector
- Industrial Production indices by NAICS (FRED series IPG____S) — for manufacturing trend context
- Annual Survey of Manufactures — shipment values by NAICS (more frequent than Economic Census)
- Service Annual Survey — revenue by NAICS for services sectors
5. Time Series Requirements
5.1 Historical Data Collection
Every market in the dataset must have a time series going back as far as reliable data allows. Target time horizons:
| Category | Target Start Year | Rationale |
|---|---|---|
| Internet | ~2000 (or market inception) | Most markets didn't exist before this |
| Software | ~1990 (or market inception) | Pre-cloud ERP/productivity era |
| Cognitive Services | ~1980 (or earlier) | Big 8→Big 6→Big 5→Big 4 history |
| Physical Services | ~1990 (Economic Census 1992) | Census provides baseline |
| Manufacturing | ~1950 (Census of Manufactures) | Long history of concentration data |
5.2 Composite Category Time Series
In addition to individual market time series, compute and display a composite CR3 average by category for each time period. Method: simple average of CR3 across all markets in the category for each year. When individual market data is missing for a year, interpolate linearly between adjacent known values.
6. Current Dataset
62 markets spanning 5 top-level categories with Manufacturing subdivided into 3 structural sub-types. Full market-level data is available on the Data page as CSV download.
7. Category Averages (Current)
| Category | Markets | Avg S1 | Avg CR3 | Avg HHI | Level |
|---|---|---|---|---|---|
| Internet | 10 | 52.9% | 78.1% | 3,771 | Highly Concentrated |
| Software | 10 | 22.3% | 40.8% | 807 | Mildly Concentrated |
| Cognitive Services | 9 | 9.2% | 21.7% | 326 | Unconcentrated |
| Physical Services | 10 | 7.0% | 13.8% | 223 | Unconcentrated |
| Manufacturing (blended) | 23 | 10.5% | 24.6% | 531 | Mildly Concentrated |
| → Capital-Intensive | 7 | 24.3% | 49.6% | 1,450 | Moderately Concentrated |
| → Process | 6 | 9.1% | 22.2% | 293 | Unconcentrated |
| → Discrete / Job-Shop | 10 | 3.8% | 9.3% | 68 | Highly Fragmented |
8. Software Concentration Time Series Patterns
Four distinct patterns of software concentration have been identified:
Pattern 1: Declining S1 (market expanded faster than leader)
- ERP (SAP): 25% → 13% (2010–2025)
- Cloud Infra (AWS): 100% → 30% (2010–2025)
- Productivity (Microsoft): 90% → 48% (2005–2025)
- Mechanism: Market TAM grew 3–10x; new cloud-native entrants captured expansion
- Key insight: S1 fell but CR3 often stayed stable — oligopoly formed among top 3
Pattern 2: Rising S1 (cloud-native winner aggregated)
- CRM (Salesforce): 10% → 22% (2010–2025)
- ITSM (ServiceNow): 2% → 35% (2010–2025)
- Payments (Stripe): 0% → 20% (2010–2025)
- Mechanism: Cloud-native player with superior distribution displaced on-prem incumbents
- Key insight: This IS the Aggregation Theory pattern within software
Pattern 3: Stable oligopoly
- CAD/PLM: Autodesk/Dassault/Siemens held ~56% CR3 for 15 years
- Mechanism: High switching costs, deep workflow integration, long sales cycles
Pattern 4: Reconsolidating after disruption
- Data/Analytics: CR3 30% → 18% → 29% (2010–2018–2025)
- Mechanism: Open-source (Hadoop) disrupted incumbents; then Snowflake/Databricks re-concentrated
- Key insight: Disruption temporarily fragments; new winners reconsolidate over ~8–10 years
10. Update Procedure
Annual Q2 refresh. Pull latest data from StatCounter, eMarketer, IDC Semiannual Software Tracker, Gartner, Big 4 annual reports, AM Best, SIA, ENR Top 500, Ad Age, IBISWorld, Economic Census, company 10-Ks, BEA value-added, BLS QCEW. Cross-validate any market with >10% TAM change. Update CR3 time series. Publish updated dataset with new vintage timestamp.
13. Averaging Methodology
13.1 Simple Average (Default)
Each market is one equal observation. Formula: avg_CR3 = Σ(CR3_i) / n
Strengths: treats each industry as an equal data point about structural dynamics. Easy to explain. Weakness: tiny markets (pest control $25B) weigh the same as massive ones (food processing $950B).
13.2 Revenue-Weighted Average
Each market's CR3 is weighted by its total market size. Formula: weighted_CR3 = Σ(CR3_i × size_i) / Σ(size_i)
Markets without revenue denominators (browser usage share, OS installed base) are excluded from weighted calculations.
13.3 Display Requirement
The dashboard MUST provide a toggle between simple and weighted averages. Both values should be computable from the dataset. Summary cards and bar charts should update dynamically when the toggle is switched.
14. Source Attribution Requirements
Every market in the dataset must have a primary source and at least one secondary validation source. When a user hovers or clicks any market in any chart, the tooltip or detail page must show: market name, category, leader, S1, CR3, market size, primary source, and validation sources with their quality tier.
15. Source Quality Tiers
Every source used in this dataset is classified into one of three quality tiers. The tiering follows the source-priority rules in §4.1 and the cross-validation procedure in §4.2.
Source Quality Hierarchy
Every market has a primary source plus 1–3 validation sources. Revenue numerators come from SEC 10-K filings; denominators validated against the Economic Census where available.
Data vintage: Q2 2025 research compilation. Labels updated 2026-Q2. Tier 1 sources refreshed from FY2025 SEC filings where available. For full methodology see the Methodology page.
References
- Federal Reserve FEDS Note (Feb 2023):"A Note on Industry Concentration Measurement" — validates Economic Census over Compustat for concentration measurement.
- Kulick (2022), US Chamber:"Industrial Concentration in the United States: 2002–2017" — comprehensive CR4 trends across all NAICS sectors.
- Gutiérrez & Philippon (2018), NBER:"From Good to Bad Concentration?" — national vs local concentration trends.
- Autor et al. (2020): Rising concentration trends 1992–2012 using Economic Census.
- 2022 Economic Census: CR4/CR8/CR20/CR50 tables released April 2025.