ApX logoApX logo

Model Transparency

Last Updated: February 19, 2026

Evaluating AI models through evidence-based transparency scoring

The AI industry is plagued by unverifiable claims, benchmark gaming, and opaque model development practices. ApX Model Transparency addresses this by systematically evaluating how transparent AI model providers are about their models' architecture, training, and deployment, empowering developers and researchers to make informed decisions.

This transparency scoring system evaluates models across 10 distinct criteria organized into three pillars: Upstream (where the model comes from), Model (the model itself), and Downstream (how you use it). Each criterion is scored from 0-10 based on publicly verifiable evidence, with a total maximum score of 100 points.

This system is inspired by Stanford's Foundation Model Transparency Index (FMTI) but simplified and adapted for practical application by developers and practitioners, rather than purely academic assessment.

Why Model Transparency Matters

The lack of transparency in AI development has created significant challenges for the technical community:

IssueDescription
Benchmark Score Gaming:Models claim higher performance through undisclosed prompt engineering, contaminated test data, or cherry-picked evaluation sets without reproducible methodology.
Architecture Obfuscation:Providers make vague claims like "advanced transformer architecture" without disclosing whether the model is an original creation, fine-tuned from another base model, or distilled from a proprietary system.
Parameter Misrepresentation:Mixture-of-Experts (MoE) models advertise massive parameter counts (e.g., "47B parameters") while only activating a fraction (e.g., 12B active parameters) during inference, misleading users about computational requirements.
Silent Model Degradation:Model weights or behavior change without version updates or announcements, causing unexpected performance shifts, increased refusals, or "alignment tax" where safety tuning degrades capability.
Licensing Confusion:Models are labeled "open source" while imposing significant commercial restrictions, or license terms conflict with data provenance, creating legal uncertainty for developers.

By scoring transparency across technical, legal, and operational dimensions, this provides a systematic framework to hold model providers accountable and help users select models that meet their integrity standards.

The Three-Pillar Framework

The transparency evaluation is organized into three pillars that cover the complete lifecycle of an AI model:

30 Points

Upstream: Model Origins

Evaluates transparency about where the model came from: its base architecture, training data sources, and tokenization approach. Essential for understanding the foundation upon which the model is built.

40 Points

Model: Core Characteristics

Assesses transparency about the model itself: parameter counts, training compute, benchmark validity, and identity consistency. The largest pillar because these factors directly impact model selection and trust.

30 Points

Downstream: Practical Usage

Examines transparency about how the model can be used: licensing terms, hardware requirements, and version management. Critical for deployment planning and long-term maintenance.

The 10 Transparency Criteria

Each model is evaluated against 10 specific criteria. Scores are based on publicly verifiable evidence. Vague marketing claims or unverifiable statements result in low scores.

Pillar 1: Upstream Transparency

1. Architectural Provenance (0-10)

Does the provider disclose the base model architecture, whether it was trained from scratch or fine-tuned, and what modifications were made?

High Score (7-10):

Base model explicitly named with public documentation, training methodology fully described, architectural modifications documented, pretraining procedure detailed with evidence.

Medium Score (5-6):

Base model mentioned but documentation limited, training approach described in general terms, some architectural details provided but not comprehensive, partial pretraining information.

Low Score (0-4):

No base model disclosure, vague claims like "built on transformer architecture", undisclosed fine-tuning approach, "proprietary training methods" without documentation.

2. Dataset Composition (0-10)

Are the training data sources disclosed? Is the dataset composition breakdown available (e.g., web 40%, code 20%, books 10%)?

High Score (7-10):

Training data sources disclosed publicly, dataset composition breakdown provided, filtering and cleaning methodology documented, data collection approach explained, sample data available.

Medium Score (5-6):

Some major data sources mentioned, partial composition information provided, basic filtering approach described, limited documentation on collection methods.

Low Score (0-4):

Vague claims like "trained on diverse internet data", no sources named, "carefully curated" without defining criteria, "proprietary dataset" with no details.

3. Tokenizer Integrity (0-10)

Is the tokenizer publicly available for inspection? Does it match the claimed language support and training data?

High Score (7-10):

Tokenizer publicly available, vocabulary size stated, tokenization approach documented, verifiable alignment with language support, training data composition matches tokenizer design.

Medium Score (5-6):

Tokenizer available but documentation sparse, vocabulary size provided, basic tokenization details given, language support claims generally align with observable behavior.

Low Score (0-4):

No tokenizer access, unknown vocabulary size, vague "advanced tokenization", claims multilingual support but tokenizer not inspectable, token counts don't match across platforms.

Pillar 2: Model Transparency

4. Parameter Density (0-10)

Are total and active parameters clearly stated? For MoE models, is the distinction between total and active parameters transparent?

High Score (7-10):

Total parameters clearly stated, active parameters disclosed for MoE models, architectural breakdown provided (e.g., attention 40%, FFN 60%), quantization impact documented with evidence.

Medium Score (5-6):

Parameter counts provided but some ambiguity, MoE models mention active parameters but lack detail, basic architectural information given, limited quantization documentation.

Low Score (0-4):

Vague parameter counts ("~7B", "approximately"), MoE models advertising total params without active params disclosure, conflicting parameter counts across sources, no dense vs sparse clarification.

5. Training Compute (0-10)

Is information about training compute resources disclosed? GPU/TPU hours, hardware specifications, environmental impact?

High Score (7-10):

GPU/TPU hours disclosed, hardware specifications provided, training duration stated, carbon footprint calculated or estimated, cost transparency where appropriate.

Medium Score (5-6):

General compute information provided, hardware type mentioned, approximate training duration, limited environmental impact data, some cost indicators.

Low Score (0-4):

Vague claims like "trained on powerful GPUs", no compute hours disclosed, no environmental impact data, "significant resources" without specifics, downplaying resource requirements.

6. Benchmark Reproducibility (0-10)

Can benchmark results be reproduced? Are evaluation prompts, few-shot examples, and benchmark versions disclosed?

High Score (7-10):

Evaluation code public, exact prompts and few-shot examples disclosed, benchmark versions specified, reproduction instructions provided, third-party verification available or encouraged.

Medium Score (5-6):

Some evaluation details provided, benchmark versions mentioned, general methodology described, partial reproduction possible, limited third-party verification.

Low Score (0-4):

Cherry-picked benchmarks, no evaluation methodology, vague "outperforms competitors", undisclosed prompting strategies, no reproduction path, different scores across sources without explanation.

7. Identity Consistency (0-10)

Does the model correctly identify itself? Is version information provided? Does it accurately represent its capabilities?

High Score (7-10):

Model correctly identifies itself consistently, version number provided and accurate, no identity confusion, transparent about capabilities and limitations, acknowledges knowledge cutoff dates.

Medium Score (5-6):

Model generally identifies itself correctly, version info sometimes provided, mostly accurate capability claims, occasional minor inconsistencies, limited limitation disclosure.

Low Score (0-4):

Claims to be different model (e.g., says it's GPT-4 when it's not), identity confusion, misleading capability claims, no version awareness, pretends to be from different company.

Pillar 3: Downstream Transparency

8. License Clarity (0-10)

Is the license clear and unambiguous? Are commercial use terms explicit? Are there conflicting license claims?

High Score (7-10):

Clear open source license (Apache 2.0, MIT) or well-defined custom license, commercial use terms explicit, no conflicting terms, derivative works policy clear, consistent licensing across weights and code.

Medium Score (5-6):

License specified but some terms unclear, commercial use generally permitted with some restrictions, mostly consistent licensing, derivative works policy mentioned but not detailed.

Low Score (0-4):

Vague licensing, conflicting terms, "free for non-commercial use" without clear definition, license unclear or missing, "open source" label with commercial restrictions (not true open source).

9. Hardware Footprint (0-10)

Are VRAM requirements documented for different precision levels? Is guidance provided for quantization and context length scaling?

High Score (7-10):

VRAM requirements documented for FP16/Q8/Q4, batch size impact disclosed, context length memory scaling provided, quantization accuracy tradeoffs documented, realistic requirements stated.

Medium Score (5-6):

Basic VRAM requirements provided, some precision levels covered, general quantization guidance given, context length considerations mentioned, requirements mostly realistic.

Low Score (0-4):

No VRAM guidance, vague "runs on consumer hardware" that doesn't match reality, misleading efficiency claims, claims "8GB VRAM sufficient" when 24GB actually needed, undisclosed context limitations.

10. Versioning & Drift (0-10)

Is semantic versioning used? Are changes documented? Can users access previous versions if needed?

High Score (7-10):

Semantic versioning used, changelog maintained, API and weight changes documented, deprecation notices provided, version history accessible, clear migration paths for breaking changes.

Medium Score (5-6):

Basic versioning in place, some changes documented, major updates announced, limited version history available, general migration guidance provided.

Low Score (0-4):

No versioning system, silent updates, behavior drift without notice, no changelog, impossible to track changes, model weights change without version updates, no access to previous versions.

Scoring Methodology

Each criterion is scored on a 0-10 scale based on the quality and accessibility of publicly available evidence:

9-10

Exemplary transparency, comprehensive documentation

7-8

Good transparency with minor gaps

5-6

Moderate transparency, key details missing

3-4

Minimal disclosure with significant gaps

0-2

No information, vague claims, or unverifiable assertions

Overall Transparency Grades

  • A (90-100): Exceptional transparency, exemplary practices across all pillars.
  • B (75-89): Good transparency with minor gaps, mostly trustworthy.
  • C (60-74): Moderate transparency, significant gaps but usable information.
  • D (50-59): Poor transparency, major concerns about verifiability.
  • F (0-49): Opaque, untrustworthy, or actively deceptive practices.

Automatic Penalties for Known Issues

The scoring system applies automatic penalties when specific controversies or violations are discovered:

  • Benchmark Contamination (-2 to -5 points): Training data includes test sets, contamination not disclosed publicly, or evaluation methodology designed to inflate scores artificially.
  • Identity Misrepresentation (-3 to -8 points): Model falsely claims to be a competitor's model, inflates parameter counts, or makes misleading capability claims that can't be verified.
  • License Violations (-5 to -10 points): Using restricted data without permission, violating upstream model licenses, or significant conflicts between stated license and actual terms of service.
  • Silent Model Degradation (-3 to -6 points): Performance degraded without notice, safety restrictions increased silently (alignment tax), or behavior changes without version updates.
  • Data Provenance Issues (-4 to -7 points): Copyrighted material used without disclosure, personal data harvested without consent, or undisclosed use of synthetic data from other proprietary models.

Note: Models are not penalized for technical oversights, unintentional bugs, or legally mandated content restrictions (e.g., regional compliance requirements). Penalties focus on deliberate obfuscation or deceptive practices.

Research & Verification Methodology

Transparency evaluations combine AI-powered research with human verification. Multiple sources of evidence are used and claims are cross-referenced to ensure accuracy:

Evidence Hierarchy (strongest to weakest):

  1. Peer-reviewed papers with reproducible results
  2. Official GitHub repositories with actual model code and documentation
  3. Official technical blog posts with detailed specifications
  4. Independent third-party technical audits and testing
  5. Official model cards, datasheets, and documentation
  6. API documentation with specifications

Verification Activities by Type

Where possible, external sources are supplemented with hands-on verification using different techniques depending on the criterion:

Documentary Review:

License analysis, policy review, documentation completeness assessment, changelog examination. Used for: License Clarity, Versioning & Drift, Dataset Composition (partial).

Static Analysis:

Downloading and inspecting tokenizer files, model weights, configuration files, code repositories. Vocabulary size verification, architecture inspection, parameter counting. Used for: Tokenizer Integrity, Architectural Provenance, Parameter Density.

Interactive Testing:

Querying deployed models to detect tokenizer behavior, testing identity consistency (self-identification), validating capability claims. Used for: Identity Consistency, Tokenizer Integrity, Benchmark Reproducibility (partial).

Hands-On Deployment:

Actually running models locally or in test environments to measure VRAM consumption, validate context length limits, verify quantization claims, test inference speed. Used for: Hardware Footprint, Parameter Density (validation), Compute efficiency claims.

Cross-Reference Validation:

Comparing claims across multiple sources, checking for consistency between documentation and observed behavior, validating third-party reports. Applied across all criteria to detect inconsistencies.

Technical Transparency vs. Safety Transparency

Model Transparency focuses on technical transparency: the information developers and practitioners need to effectively evaluate, deploy, and maintain AI models. This includes architecture details, resource requirements, licensing clarity, and operational characteristics.

While several excellent initiatives focus on safety transparency (bias auditing, red-teaming results, content moderation approaches), the emphasis is deliberately on the technical infrastructure layer. Technical transparency is considered a prerequisite for informed model selection and effective deployment.

Focus: Technical Transparency

Architecture, training data, compute resources, benchmarks, licensing, versioning, hardware requirements: information for building with models.

Complementary: Safety Transparency

Bias testing, harmful content evaluation, safety benchmarks, red-teaming results: covered by other initiatives like Stanford HELM and AI Verify.

View Model Transparency Scores

Transparency scores are displayed on individual model pages in the LLM database. Each model includes a transparency chart showing scores across all 10 criteria, along with an overall transparency grade.

Not all models have transparency scores yet. Coverage is actively being expanded. Models are evaluated based on publicly available information at the time of assessment and may be re-evaluated as new information becomes available.

Explore Transparency Scores

Browse the LLM database to view transparency evaluations for individual models.

View LLM Database →