Name: AgentsOrg
Author: AgentsOrg

Your organization has 500 Custom GPTs and Projects. How many are actually delivering business value?

If you’re like most enterprises with ChatGPT Enterprise, the honest answer is: you don’t know. Some were built in five minutes during a team meeting. Others represent weeks of careful prompt engineering, tool integration, and user testing. From the admin console, they all look the same.

The GPT Portfolio Maturity Model is a framework for classifying every Custom GPT and Project in your workspace by the depth and quality of its configuration: turning an opaque list into an actionable intelligence map.

The Three Tiers

The model classifies GPTs using a Sophistication Score (1-5), derived from semantic analysis of each GPT’s system prompt, tool configuration, and integration depth.

Tier 1: Experimental (Score 1-2)

What it looks like:

System prompt is 1-3 sentences (“You are a helpful assistant for marketing”)
No custom tools or actions configured
No external integrations
No structured output format
Often a test or experiment that was never revisited

What it means: These GPTs represent the natural “try it and see” phase of AI adoption. Someone heard about Custom GPTs, built one in 5 minutes, and either forgot about it or found it wasn’t useful enough to invest more time.

Typical characteristics:

Created by someone exploring ChatGPT for the first time
No clear audience beyond the creator
No measurable business impact
Often duplicates a GPT that already exists

Tier 2: Functional (Score 3)

What it looks like:

Structured system prompt with clear role definition
Some use of custom instructions (tone, format, constraints)
May have basic tool usage (code interpreter, file upload)
Serves a specific use case with a defined audience
Used regularly by the creator, sometimes shared with team

What it means: These GPTs are delivering value but haven’t been optimized. They solve a real problem, but the implementation has room to grow: better prompt engineering, tool integration, quality safeguards, or broader distribution.

Typical characteristics:

Created by someone with moderate ChatGPT experience
Clear business process alignment
Active usage within a team
Could benefit from prompt engineering training

Tier 3: Production (Score 4-5)

What it looks like:

Comprehensive system prompt (500+ words) with detailed instructions
Structured output formats (JSON, markdown templates, etc.)
External integrations via Actions (APIs, databases, third-party services)
Error handling and edge case coverage in the prompt
Clear documentation or onboarding for users
Active usage across teams or departments

What it means: These are your organization’s AI assets. They automate real business processes, handle complex workflows, and are relied upon by multiple people. The builders behind them are your AI power users.

Typical characteristics:

Created by someone with strong prompt engineering skills
Often integrated into existing workflows
Measurable time savings or quality improvement
Others should learn from these implementations

The Typical Enterprise Distribution

After analyzing GPT portfolios across organizations, a consistent pattern emerges:

Tier	Percentage	Implication
Experimental	~60%	Natural exploration: but creates noise, duplicates, and governance overhead
Functional	~25%	The growth opportunity: targeted upskilling moves these toward Production
Production	~15%	The proof of value: these justify the ChatGPT Enterprise investment

This 60/25/15 distribution isn’t inherently bad. It reflects healthy organic adoption: people experimenting, finding what works, and investing in what delivers value. The problem is when you can’t see this distribution. Without measurement, you can’t:

Retire the 60% that’s creating noise
Upskill the 25% that’s close to Production
Recognize the 15% that’s proving the value of your AI investment
Learn from the patterns that distinguish Production from Experimental

Measuring Sophistication

The Sophistication Score isn’t a simple heuristic. It’s derived from semantic analysis of each GPT’s full configuration using LLM-based enrichment. The analysis considers:

Prompt Depth (40% weight)

Length and structure of system instructions
Presence of role definition, constraints, and examples
Edge case handling and error instructions
Output format specifications

Tool Usage (25% weight)

Number and type of configured tools/actions
External API integrations
File handling capabilities
Code execution configuration

Integration Depth (20% weight)

Connection to external data sources
Webhook or API action configurations
Authentication and security setup
Multi-step workflow orchestration

User Experience (15% weight)

Conversation starters / onboarding
Clear audience definition
Documentation or usage instructions
Feedback mechanisms

Using the Model

For AI Leaders

The maturity distribution is your first dashboard metric. Track it over time:

Improving: Experimental percentage shrinking, Functional and Production growing
Stagnant: Distribution unchanged quarter over quarter: training isn’t working or isn’t reaching the right people
Growing but immature: Total GPT count rising but Experimental percentage also rising: creation is outpacing capability building

For L&D Teams

Map each tier to specific learning interventions:

Tier	Learning Intervention
Experimental → Functional	Prompt engineering fundamentals, ChatGPT best practices
Functional → Production	Advanced prompt patterns, tool/action integration, testing methodologies
Production (maintain)	Peer sharing, architecture reviews, advanced API courses

The key insight: don’t give everyone the same training. An Experimental builder needs different skills than a Functional builder who’s ready for Production.

For Governance Teams

The maturity model informs risk prioritization:

Experimental GPTs are low risk individually (simple prompts, no integrations) but high risk collectively (noise, duplicates, ungoverned data handling)
Functional GPTs need review: they’re handling real data but may lack proper safeguards
Production GPTs warrant full governance review: they’re critical enough that failures have real business impact

For Finance / ROI

The model directly answers the ROI question:

Count Production GPTs and estimate time savings per GPT
Identify Functional GPTs close to Production (score 3 with high usage): these are your lowest-hanging ROI fruit
Quantify Experimental waste: how many duplicates? How many abandoned?

Beyond Static Classification

The maturity model becomes powerful when tracked over time:

Builder Trajectories: Which builders are consistently creating Production-grade GPTs? These are your AI champions. Which are stuck at Experimental? They need targeted support.

Department Patterns: Is Engineering at 30% Production while Marketing is at 5%? That’s not a failure: it’s a data point for where to invest L&D resources.

Training Impact: Did the prompt engineering workshop actually move the needle? Compare maturity distributions before and after for workshop participants.

GPT Lifecycle: How long does it take for a GPT to move from Experimental to Functional? From Functional to Production? This is your AI capability velocity.

Implementation with AgentsOrg

AgentsOrg implements the GPT Portfolio Maturity Model automatically:

Discovery: Connects to OpenAI’s Compliance API to fetch your full GPT and Project inventory
Enrichment: Runs 9 semantic KPI analyses per GPT, including Sophistication Score
Classification: Maps each GPT to Experimental, Functional, or Production
Visualization: The Leader Dashboard shows maturity distribution with drill-down to individual GPTs
Action: Builder Recognition identifies your champions; L&D recommends specific courses for each builder based on actual gaps

Deploy in minutes. Demo mode included for evaluation:

git clone https://github.com/ori129/agentsorg.git
cd agentsorg && make up

The GPT Portfolio Maturity Model is part of AI Transformation Intelligence: a discipline we’re defining in the open. Contribute to the framework or discuss your organization’s experience.

The GPT Portfolio Maturity Model: From Experimental to Production

The Three Tiers

Tier 1: Experimental (Score 1-2)

Tier 2: Functional (Score 3)

Tier 3: Production (Score 4-5)

The Typical Enterprise Distribution

Measuring Sophistication

Using the Model

For AI Leaders

For L&D Teams

For Governance Teams

For Finance / ROI

Beyond Static Classification

Implementation with AgentsOrg

Try AgentsOrg Today