METHODOLOGY

Aggregation Theory Offline — Master Prompt

Purpose

This document is the single source of truth for the industry concentration research behind aggregationtheory.world. It defines the taxonomy, metrics, data sources, validation requirements, and update procedures. Any agent, tool, or human working on this project should follow this document.

1. Thesis

Industry concentration today maps closely to marginal cost structure. Industries sit on a spectrum from more to less concentrated across:

  1. Internet — zero marginal cost distribution; winner-take-most
  2. Software — low marginal cost; moderate concentration with oligopoly tendencies
  3. Cognitive Services — knowledge work constrained by labor; fragmented
  4. Physical Services — physical labor + local delivery; highly fragmented
  5. Manufacturing / Production — capital + labor + materials; bimodal (oligopolies in capital-intensive, fragmented in discrete/job-shop)

AI will compress labor and distribution constraints in categories 3–5, driving concentration dynamics that resemble what happened in categories 1–2. The data on this site tracks whether that shift is occurring.

2. Taxonomy

2.1 Five Top-Level Categories

CategoryDescriptionTypical S1Typical CR3
InternetDigital platforms, zero marginal cost distribution20–90%50–100%
SoftwareEnterprise/consumer software by workflow7–48%17–76%
Cognitive ServicesProfessional services where the product is expertise2–22%5–48%
Physical ServicesField services, trades, facilities — labor + local delivery1.5–22%3–44%
Manufacturing / ProductionGoods production — capital + labor + materials0.5–62%1–85%

2.2 Manufacturing Sub-Categories

Manufacturing / Production is subdivided into three structural types because concentration is driven by fundamentally different forces:

Sub-CategoryDescriptionConcentration DriverAI Impact
Capital-IntensiveHigh capex, IP, regulation barriers (autos, semis, aerospace, industrial gases, cement, beverages, EMS)Supply-side scale economiesLow — already concentrated
ProcessContinuous-flow operations, commodity/specialty mix (pharma, chemicals, refining, paper, plastics, glass)Process efficiency, feedstock accessModerate — yield optimization
Discrete / Job-ShopMake-to-order, local/regional, labor-intensive operations (machine shops, metal fab, printing, food processing, building products, packaging, furniture, textiles, medical devices)Same as Physical Services: quoting, scheduling, labor, local salesHIGH — structurally identical to Physical Services

2.3 Market Selection Criteria

Each category should contain 8–12 markets that are:

3. Metrics

3.1 Primary Metrics (always report)

S1 (Leader Share): Revenue or usage share of the #1 player in the market.

S1 = Revenue_of_#1_player / Total_market_revenue

Report as percentage to one decimal place.

CR3 (Concentration Ratio, Top 3): Combined share of the top 3 players.

CR3 = (Rev_#1 + Rev_#2 + Rev_#3) / Total_market_revenue

3.2 Supporting Metric

HHI (Herfindahl-Hirschman Index): Sum of squared market shares of all firms (or top 50).

HHI = Σ(share_i²)  where shares are whole numbers (30% → 30² = 900)

DOJ thresholds:
  <1,500   = unconcentrated
  1,500–2,500 = moderately concentrated
  >2,500   = highly concentrated

3.3 Denominator Rules (CRITICAL)

The denominator (total market size) is the most important and most debatable decision in any concentration calculation. Different denominators produce different shares for the same company.

Requirements:

  1. Define the denominator explicitlyfor every market. Record it in the dataset (e.g., "US ad revenue share," "global query share," "US shipment value").
  2. Validate against multiple sources. Every market size denominator MUST be cross-referenced against at least two independent sources where possible: an industry tracker as primary, and the Economic Census, BEA industry value-added, company 10-K TAM commentary, or trade association data as secondary.
  3. Use the Census Bureau concentration tables (data.census.gov) as ground truth for manufacturing (NAICS 31-33) and services sectors where available.
  4. Use revenue as the default denominator unless there is a strong reason to use another metric (query share for search engines, GMV for e-commerce, installed base for OS).
  5. Geography: US only unless the market is inherently global (mobile OS, cloud infrastructure, semiconductor fab).
  6. Do not mix denominators across markets within the same comparison chart.

4. Data Sources — Hierarchy and Validation

4.1 Source Priority by Category

Internet Markets: StatCounter (browser, search, OS usage share) → eMarketer (ad spend, e-commerce GMV) → Synergy Research (cloud infrastructure) → company 10-Ks / earnings for revenue → Nielsen, Antenna, Gridwise, Bloomberg Second Measure (streaming, ride-hailing, delivery).

Software Markets: IDC Semiannual Software Tracker (CRM, ERP, security, etc.) → Gartner Market Share reports → Canalys, Synergy (cloud, infrastructure) → CIMdata (CAD/PLM) → company 10-Ks for revenue numerators.

Cognitive Services:Am Law 100 / American Lawyer (legal) → Big 4 annual reports (accounting) → AM Best (insurance) → SIA (staffing) → ENR Top 500 (A&E firms) → Ad Age (agencies) → company filings for revenue numerators → Economic Census concentration tables for denominator validation.

Physical Services: IBISWorld industry reports → trade association lists (RC Top 100 roofing, SDM Top 100 fire/safety, etc.) → company filings (EMCOR, Rollins, ABM, Waste Management) → Economic Census (NAICS 238 — Specialty Trade Contractors) → BLS employment data for firm count validation.

Manufacturing / Production:Economic Census concentration tables (data.census.gov) — PRIMARY source for CR4/CR8/HHI at 6-digit NAICS → company 10-Ks for revenue numerators → IHS Markit / S&P Global (auto, chemicals) → IQVIA (pharma) → Euromonitor (beverages, food) → SIPRI (defense) → BEA industry value-added for denominator validation.

4.2 Cross-Validation Procedure

1. DENOMINATOR VALIDATION
   a. Identify primary source for total market size
   b. Cross-reference with at least one secondary source:
      - Economic Census (for manufacturing/services NAICS codes)
      - BEA value-added by industry (FRED series)
      - Trade association reported market size
      - Aggregated company filings (if top 10 players cover >50% of market)
   c. If primary and secondary disagree by >20%, investigate and note discrepancy
   d. Record both sources and chosen value with justification

2. NUMERATOR VALIDATION
   a. Use most recent fiscal year revenue from company 10-K or equivalent
   b. Adjust for geography (US vs global) using segment reporting or analyst estimates
   c. Adjust for scope (e.g., Palo Alto total revenue vs security-only revenue)
   d. Note any estimates or adjustments made

3. CENSUS BUREAU CHECK (for Manufacturing and Services)
   a. Look up NAICS code for the market
   b. Check data.census.gov for 2022 Economic Census concentration tables
   c. Compare our calculated CR3 with Census CR4 (expect our CR3 ≤ Census CR4)
   d. If significant discrepancy, investigate market definition differences
   e. Note Census CR4/CR8 values alongside our metrics

4.3 FRED / BEA Cross-Reference Series

For denominator validation, check these FRED/BEA series:

5. Time Series Requirements

5.1 Historical Data Collection

Every market in the dataset must have a time series going back as far as reliable data allows. Target time horizons:

CategoryTarget Start YearRationale
Internet~2000 (or market inception)Most markets didn't exist before this
Software~1990 (or market inception)Pre-cloud ERP/productivity era
Cognitive Services~1980 (or earlier)Big 8→Big 6→Big 5→Big 4 history
Physical Services~1990 (Economic Census 1992)Census provides baseline
Manufacturing~1950 (Census of Manufactures)Long history of concentration data

5.2 Composite Category Time Series

In addition to individual market time series, compute and display a composite CR3 average by category for each time period. Method: simple average of CR3 across all markets in the category for each year. When individual market data is missing for a year, interpolate linearly between adjacent known values.

6. Current Dataset

62 markets spanning 5 top-level categories with Manufacturing subdivided into 3 structural sub-types. Full market-level data is available on the Data page as CSV download.

7. Category Averages (Current)

CategoryMarketsAvg S1Avg CR3Avg HHILevel
Internet1052.9%78.1%3,771Highly Concentrated
Software1022.3%40.8%807Mildly Concentrated
Cognitive Services99.2%21.7%326Unconcentrated
Physical Services107.0%13.8%223Unconcentrated
Manufacturing (blended)2310.5%24.6%531Mildly Concentrated
→ Capital-Intensive724.3%49.6%1,450Moderately Concentrated
→ Process69.1%22.2%293Unconcentrated
→ Discrete / Job-Shop103.8%9.3%68Highly Fragmented

8. Software Concentration Time Series Patterns

Four distinct patterns of software concentration have been identified:

Pattern 1: Declining S1 (market expanded faster than leader)

Pattern 2: Rising S1 (cloud-native winner aggregated)

Pattern 3: Stable oligopoly

Pattern 4: Reconsolidating after disruption

10. Update Procedure

Annual Q2 refresh. Pull latest data from StatCounter, eMarketer, IDC Semiannual Software Tracker, Gartner, Big 4 annual reports, AM Best, SIA, ENR Top 500, Ad Age, IBISWorld, Economic Census, company 10-Ks, BEA value-added, BLS QCEW. Cross-validate any market with >10% TAM change. Update CR3 time series. Publish updated dataset with new vintage timestamp.

13. Averaging Methodology

13.1 Simple Average (Default)

Each market is one equal observation. Formula: avg_CR3 = Σ(CR3_i) / n

Strengths: treats each industry as an equal data point about structural dynamics. Easy to explain. Weakness: tiny markets (pest control $25B) weigh the same as massive ones (food processing $950B).

13.2 Revenue-Weighted Average

Each market's CR3 is weighted by its total market size. Formula: weighted_CR3 = Σ(CR3_i × size_i) / Σ(size_i)

Markets without revenue denominators (browser usage share, OS installed base) are excluded from weighted calculations.

13.3 Display Requirement

The dashboard MUST provide a toggle between simple and weighted averages. Both values should be computable from the dataset. Summary cards and bar charts should update dynamically when the toggle is switched.

14. Source Attribution Requirements

Every market in the dataset must have a primary source and at least one secondary validation source. When a user hovers or clicks any market in any chart, the tooltip or detail page must show: market name, category, leader, S1, CR3, market size, primary source, and validation sources with their quality tier.

15. Source Quality Tiers

Every source used in this dataset is classified into one of three quality tiers. The tiering follows the source-priority rules in §4.1 and the cross-validation procedure in §4.2.

Source Quality Hierarchy

Every market has a primary source plus 1–3 validation sources. Revenue numerators come from SEC 10-K filings; denominators validated against the Economic Census where available.

TIER 1Authoritative anchorsGovernment data + SEC filings
US Economic Census (data.census.gov) · SEC EDGAR / Company 10-Ks · BEA Industry Value-Added (FRED) · BLS QCEW · USGS Mineral Commodity Summaries · EIA Refinery Capacity Report · FDA device listings
TIER 2Leading commercial trackersIndustry-standard paid trackers
IDC Semiannual Software Tracker · Gartner Market Share & Magic Quadrant · IBISWorld NAICS reports · StatCounter Global Stats · eMarketer / Insider Intelligence · Synergy Research Group
TIER 3Category specialistsBest-in-class niche authorities
SIPRI (aerospace & defense) · IQVIA (pharma) · TrendForce (semiconductors) · CIMdata (CAD/PLM) · Nilson Report (payments) · Am Law 100 (legal) · ENR Top 500 (A&E) · AM Best (insurance) · SIA (staffing) · Ad Age (agencies) · Barron's / Cerulli (wealth) · Evaluate MedTech · Nielsen Gauge (streaming) · Gridwise (ride-hail) · Bloomberg Second Measure (delivery) · RC Top 100 (roofing) · SDM Top 100 (fire/safety) · Big 4 annual reports · ALM Intelligence (consulting)

Data vintage: Q2 2025 research compilation. Labels updated 2026-Q2. Tier 1 sources refreshed from FY2025 SEC filings where available. For full methodology see the Methodology page.

References