Planning Poker for AI and Machine Learning Projects: Estimating ML Development Work

Machine learning project estimation remains one of the most challenging aspects of AI development. Unlike traditional software engineering where experienced teams can predict timelines with reasonable accuracy, ML projects introduce fundamental uncertainties that can derail even the most carefully planned sprints. According to recent industry research, nearly 52% of AI projects fail to reach production, and those that do often take significantly longer than initially estimated.

Planning Poker, the consensus-based agile estimation technique, offers a structured approach to tackling these challenges. However, applying Planning Poker to machine learning projects requires adapting the methodology to account for the unique characteristics of ML development: experiment-driven workflows, data dependencies, model uncertainty, and the inherent unpredictability of algorithm performance.

This guide explores how to effectively use Planning Poker for AI and machine learning projects, providing practical frameworks for estimating everything from data preparation to model deployment.

Why Machine Learning Estimation Is Fundamentally Different

Before diving into Planning Poker techniques, it's essential to understand why ML estimation diverges from traditional software development.

The Experiment-Driven Nature of ML Development

Traditional software development follows a relatively linear path: you write code, test it, and deploy it. Machine learning development is fundamentally experimental. You're not just writing a program—you're writing a program that generates a program that learns from data. This multi-layered complexity introduces uncertainty at every stage.

Consider a typical ML workflow:

Data collection and preparation: Unknown data quality issues emerge during exploration
Feature engineering: Initial features may prove ineffective, requiring iteration
Model training: Algorithm performance on real data is unpredictable until tested
Hyperparameter tuning: Optimal configurations require extensive experimentation
Deployment: Production data distribution may differ from training data

Each stage contains unknown unknowns—challenges you won't discover until you're actively working on the problem.

Data Dependencies and Quality Issues

Machine learning models are only as good as their training data. Data quality challenges represent a primary source of estimation errors:

Data availability: Assumed data sources may be incomplete or inaccessible
Annotation requirements: Labeling data often takes 10-100x longer than initially estimated
Data drift: Production data characteristics change over time, requiring model retraining
Privacy constraints: Compliance requirements may limit usable data

These dependencies create estimation challenges that don't exist in traditional development. A user story might seem straightforward—"build a sentiment classifier"—but the actual work depends heavily on data quality, quantity, and accessibility.

The Iteration Tax

Machine learning requires extensive iteration. Your first model rarely achieves production-quality performance. Industry data suggests that successful ML projects typically go through 5-15 major model iterations before deployment, with each iteration potentially revealing new challenges.

This iteration tax means that ML estimation must account for:

Baseline model development
Performance improvement iterations
Failed experiments and dead ends
A/B testing and validation
Model monitoring and maintenance

Adapting Planning Poker for ML Projects

Standard Planning Poker works well for traditional software, but ML projects require modified approaches. Here's how to adapt the technique for data science work.

Modified Story Point Scales for ML Work

The Fibonacci sequence (1, 2, 3, 5, 8, 13, 21) works for traditional development because work complexity roughly follows this pattern. For machine learning projects, consider these adaptations:

Research-Heavy Scale: For projects with significant uncertainty

1 point: Well-understood task with established approach (e.g., deploying a proven model)
3 points: Some uncertainty, standard ML techniques apply (e.g., training a classification model with clean data)
5 points: Moderate research required, multiple approaches possible (e.g., feature engineering for new domain)
8 points: Significant experimentation needed, outcome uncertain (e.g., novel architecture exploration)
13 points: High uncertainty, may require literature review or external consultation
21+ points: Requires breaking down into smaller experiments

Confidence Multipliers: Add confidence levels to story points

High confidence (1.0x): Similar problems solved before, clean data available
Medium confidence (1.5x): Some unknowns, but reasonable assumptions possible
Low confidence (2.0x): Significant unknowns, experimental approach required

For example, a 5-point story with low confidence effectively becomes a 10-point story when planning sprints.

Estimating Different ML Workflow Stages

Machine learning projects consist of distinct phases, each requiring different estimation approaches.

Data Preparation and Annotation (30-50% of ML Project Time)

Data work consistently takes longer than expected. When estimating data preparation:

Data Collection:

1-2 points: Accessing existing, well-documented datasets
3-5 points: Combining multiple data sources with known schemas
8-13 points: Web scraping, API integration, or creating new data pipelines
21+ points: Requires partnerships, legal agreements, or building data infrastructure

Data Cleaning:

2-3 points: Standard cleaning (missing values, duplicates) on structured data
5-8 points: Handling complex quality issues, outlier detection, data validation
13+ points: Unstructured data processing, complex transformations, or dealing with severe quality problems

Data Annotation:

Per 1,000 items: Calculate hourly annotation rate (typically 50-200 items/hour depending on complexity)
Quality control overhead: Add 30-50% for validation and inter-annotator agreement
Iteration buffer: Add 20% for guideline refinements and re-annotation

Example estimation conversation:

Product Owner: "We need to label 10,000 customer support tickets for sentiment."

ML Engineer 1: "I'm thinking 8 points. Sentiment is subjective, we'll need clear guidelines."

ML Engineer 2: "I'd say 13. We'll need multiple annotators for agreement, and the guidelines will require iteration as edge cases emerge."

Data Scientist: "Agreed on 13. Plus, we should split this into two stories—one for guideline development (3 points) and one for annotation at scale (8 points with medium confidence)."

Feature Engineering (15-25% of ML Project Time)

Feature engineering directly impacts model performance but involves significant trial and error.

3 points: Standard features from clean, structured data (e.g., basic aggregations, one-hot encoding)
5 points: Domain-specific features requiring business logic (e.g., customer lifetime value calculations)
8 points: Time-series features, complex aggregations, or features from unstructured data
13 points: Novel feature creation requiring domain expertise or external data sources
21+ points: Advanced feature engineering (embeddings, graph features, automated feature learning)

Key estimation questions:

How well do we understand the domain?
Are features straightforward transformations or do they require experimentation?
Do we have access to all necessary raw data?
How will we validate feature importance?

Model Development and Training (20-30% of ML Project Time)

This is where ML estimation becomes most challenging due to performance uncertainty.

Initial Model Development:

2-3 points: Applying proven algorithms to well-understood problems (e.g., logistic regression on tabular data)
5 points: Standard deep learning architectures with established frameworks
8 points: Adapting research papers or techniques from adjacent domains
13+ points: Novel architectures or custom algorithm development

Model Training and Tuning:

2 points: Quick iteration cycles (minutes to hours per experiment)
5 points: Moderate training time (hours to overnight)
8 points: Long training cycles (days) requiring infrastructure optimization
13+ points: Distributed training, very large models, or complex hyperparameter spaces

Performance Improvement: Always allocate separate stories for iteration:

Baseline model: Initial implementation (use above guidelines)
First improvement iteration: 60% of baseline story points
Subsequent iterations: Diminishing returns (40%, 30%, 20% of baseline)
Production threshold: Estimate separately based on acceptable performance

Example:

Story: "Develop recommendation model with 70% precision"

Breakdown:

Baseline collaborative filtering (5 points, expect ~50% precision)

Content-based features iteration (3 points, target 60%)

Hybrid model experimentation (2 points, target 70%)

Total: 10 points with medium confidence

MLOps and Deployment (15-20% of ML Project Time)

Deployment complexity varies dramatically based on infrastructure maturity.

3 points: Deploying to established ML platform with monitoring (e.g., SageMaker, Vertex AI)
5 points: Containerization, API development, basic monitoring setup
8 points: Custom serving infrastructure, real-time predictions, A/B testing framework
13 points: Building MLOps pipeline from scratch, complex integration requirements
21+ points: Distributed serving, edge deployment, or strict latency requirements

Don't forget ongoing maintenance:

Model monitoring: 2-3 points per sprint for dashboard development and alert setup
Retraining pipelines: 5-8 points for automated retraining infrastructure
Model versioning: 3-5 points for experiment tracking and model registry

ML-Specific Story Templates

Using standardized templates helps teams estimate consistently. Here are templates for common ML stories:

Data Collection Story

As a [data scientist/ML engineer]
I need to [collect/acquire data from source]
So that [we can train/evaluate the model]

Acceptance Criteria:
- [ ] X records collected with required fields
- [ ] Data quality validated (completeness, accuracy)
- [ ] Data stored in accessible format
- [ ] Basic exploratory analysis completed
- [ ] Documentation of data schema and sources

Estimation Considerations:
- Data availability and access complexity
- Data volume and transfer time
- Quality assessment effort
- Legal/privacy review requirements

Model Training Story

As a [ML engineer/data scientist]
I need to [train a model for task]
So that [we can predict/classify/recommend X]

Acceptance Criteria:
- [ ] Baseline model trained with performance metrics
- [ ] Model performance documented and compared to benchmark
- [ ] Training pipeline code reviewed and documented
- [ ] Hyperparameters logged and reproducible
- [ ] Model artifacts saved with versioning

Estimation Considerations:
- Model complexity and training time
- Hyperparameter tuning scope
- Available compute resources
- Baseline performance uncertainty
- Number of experiments planned

Feature Engineering Story

As a [data scientist]
I need to [create features from data source]
So that [model can learn relationships]

Acceptance Criteria:
- [ ] Features engineered and validated
- [ ] Feature importance analysis completed
- [ ] Features added to training pipeline
- [ ] Documentation of feature logic
- [ ] Performance impact measured

Estimation Considerations:
- Feature complexity and domain knowledge required
- Data transformations needed
- Feature validation approach
- Expected number of features to test
- Pipeline integration complexity

Model Deployment Story

As a [ML engineer]
I need to [deploy model to environment]
So that [users/systems can access predictions]

Acceptance Criteria:
- [ ] Model deployed with API endpoint
- [ ] Latency and throughput requirements met
- [ ] Monitoring and logging configured
- [ ] Rollback procedure documented
- [ ] Load testing completed

Estimation Considerations:
- Infrastructure complexity
- Latency requirements
- Traffic volume expectations
- Integration points
- Monitoring sophistication needed

Communicating ML Uncertainty to Stakeholders

One of the biggest challenges in ML project planning is managing stakeholder expectations around uncertainty. Planning Poker sessions provide an opportunity for this crucial communication.

The Cone of Uncertainty for ML Projects

Traditional software projects follow a cone of uncertainty that narrows as development progresses. ML projects have a different shape—uncertainty may actually increase during early iterations as you discover data quality issues or performance challenges.

Visualize this during estimation:

Sprint 1-2: High uncertainty (±100% variance is normal)
Sprint 3-5: Uncertainty should decrease as baseline performance is established
Sprint 6+: If uncertainty isn't decreasing, the problem may need rescoping

Using Planning Poker to Surface Risks

When estimates diverge widely during Planning Poker, it often indicates different team members perceive different risks. Use this as a discussion trigger:

Large Estimate Gaps (e.g., votes of 3 and 13):

"What assumptions are different?"
"What could go wrong that some team members foresee?"
"Do we need a spike to reduce uncertainty?"

Consistently High Estimates:

"Should we break this down differently?"
"Is this really a research project disguised as a user story?"
"Do we need external expertise or resources?"

The Spike Story Approach

For highly uncertain work, use spike stories explicitly:

Spike: Investigate [technical approach/data quality/model feasibility]

Time-box: [1-3 days]

Success Criteria:
- [ ] Recommendation on approach with confidence level
- [ ] Rough estimate for full implementation
- [ ] Identified risks and unknowns
- [ ] Prototype or proof of concept (if applicable)

Spikes acknowledge uncertainty while providing structure. They should:

Have strict time limits (typically 1-3 days)
Produce actionable insights, not production code
Lead to better estimates for follow-up stories
Be valued at 2-5 points depending on time-box

Advanced Estimation Frameworks for Complex ML Projects

The Three-Estimate Approach

For critical ML initiatives, use three-point estimation:

Optimistic (O): Everything goes well, data is clean, first approach works
Most Likely (M): Realistic scenario with normal challenges
Pessimistic (P): Multiple iterations needed, data issues, performance challenges

Weighted average: (O + 4M + P) / 6

Example:

Story: Implement fraud detection model

Optimistic: 8 points (clean data, proven algorithms work)

Most Likely: 13 points (some data cleaning, 2-3 model iterations)

Pessimistic: 21 points (significant data issues, novel approach required)

Weighted: 13.8 ≈ 13 points

Use the pessimistic estimate for risk planning even if you commit to the weighted average.

Experimentation Budgets

Rather than estimating individual experiments, allocate experimentation budgets:

Per-Sprint Experiment Allocation:

Junior data scientists: 5-8 points of experimentation capacity
Senior data scientists: 8-13 points of experimentation capacity
ML researchers: 13-21 points of experimentation capacity

Stories become "objectives" rather than fixed tasks:

"Improve model recall from 60% to 75%" (8 points, includes multiple experiments)
"Reduce inference latency below 100ms" (5 points, experimentation budget)
"Explore alternative architectures for time-series prediction" (13 points, research spike)

This approach acknowledges that you can't predict which experiments will succeed, only how much time you can allocate to finding solutions.

The T-Shirt Sizing Alternative

For early-stage ML projects with high uncertainty, consider T-shirt sizing (S, M, L, XL) before transitioning to story points:

Small: 1-3 days, well-understood task
Medium: 1-2 weeks, standard ML techniques apply
Large: 2-4 weeks, requires experimentation
Extra Large: 1-2 months, research-level effort

T-shirt sizing works better when:

The team is new to ML estimation
The problem space is poorly understood
You're doing discovery work before committing to implementation

Once you have baseline models and clearer requirements, transition to story points for more precise sprint planning.

Practical Tips for Running ML Planning Poker Sessions

Pre-Planning Preparation

Before the Planning Poker session:

Data Assessment: Have data scientists review available data and document quality/quantity
Literature Review: For novel problems, review relevant research papers and existing solutions
Technical Spikes: Complete time-boxed investigations for high-uncertainty areas
Success Metrics: Define clear, measurable performance thresholds
Infrastructure Check: Verify compute resources and tooling availability

During the Session

Estimation Discussion Structure:

Product Owner reads the story and acceptance criteria
ML Lead provides technical context (data availability, algorithm options, expected challenges)
Team asks clarifying questions focusing on:
- Data quality and quantity assumptions
- Performance requirements and how they'll be measured
- Infrastructure and tools available
- Similar past experiences
Silent Estimation: Each team member selects a card
Reveal: Discuss outliers first (highest and lowest estimates)
Re-estimate: After discussion, vote again until consensus

Red Flags to Watch For:

Estimates based on "best case" data assumptions
Ignoring iteration and experimentation time
Forgetting model monitoring and maintenance
Underestimating deployment complexity
No buffer for failed experiments

Post-Planning Tracking

ML projects require different tracking metrics:

Standard Metrics:

Story points completed vs. planned
Velocity trends over time
Estimation accuracy (actual vs. estimated)

ML-Specific Metrics:

Experiment success rate (% of experiments improving performance)
Data quality issues discovered per sprint
Model performance trends toward production threshold
Time to production for ML features

Track these to improve future estimates. If you consistently underestimate data preparation by 50%, adjust your estimation discussions accordingly.

Case Study: Estimating a Recommendation System

Let's walk through estimating a complete ML project using Planning Poker.

Project: Build a product recommendation system for e-commerce platform

Sprint 1: Data Foundation

Story 1.1: Collect user interaction data (clicks, purchases, views)

Discussion: "We have analytics data, but it's in separate systems. Need to join and validate."
Estimates: 3, 5, 5, 8 → Consensus at 5 points
Confidence: Medium (existing data, but quality unknown)

Story 1.2: Create user and product feature sets

Discussion: "Basic features are straightforward, but we might need category embeddings."
Estimates: 5, 5, 8, 8 → Consensus at 8 points
Confidence: Medium-Low (feature effectiveness uncertain)

Story 1.3: Exploratory data analysis and baseline metrics

Discussion: "Need to understand current conversion rates and user behavior patterns."
Estimates: 2, 3, 3, 5 → Consensus at 3 points
Confidence: High (standard analysis)

Sprint 1 Total: 16 points (realistic for a team with 20-25 point velocity)

Sprint 2: Baseline Model

Story 2.1: Implement collaborative filtering baseline

Discussion: "Standard approach, but we need to handle cold-start problem."
Estimates: 5, 5, 8, 8 → Consensus at 8 points
Confidence: High (proven technique)

Story 2.2: Evaluate baseline model performance

Discussion: "Need offline metrics and A/B test preparation."
Estimates: 3, 3, 5, 5 → Consensus at 3 points
Confidence: High (standard evaluation)

Story 2.3: Set up model training pipeline

Discussion: "Want automated retraining as new data arrives."
Estimates: 5, 8, 8, 13 → After discussion, consensus at 8 points
Confidence: Medium (infrastructure dependency)

Sprint 2 Total: 19 points

Sprint 3: Model Improvement

Story 3.1: Add content-based features to hybrid model

Discussion: "Product descriptions and categories. NLP might be needed."
Estimates: 5, 8, 8, 13 → Consensus at 8 points with caveat
Caveat: "If NLP is required, we'll need a follow-up story"
Confidence: Medium

Story 3.2: Hyperparameter tuning for production model

Discussion: "We have compute resources, but this could take time."
Estimates: 3, 5, 5, 8 → Consensus at 5 points
Confidence: Medium

Story 3.3: Production deployment and monitoring

Discussion: "We have existing ML infrastructure, so integration should be smooth."
Estimates: 5, 5, 8, 8 → Consensus at 8 points
Confidence: Medium-High

Sprint 3 Total: 21 points

Project Summary: 56 total points across 3 sprints for initial production deployment (approximately 6-9 weeks for a team with 20 point velocity).

Key Learnings from This Estimation:

Broke down the project into clear phases (data, baseline, improvement)
Identified uncertainty early (data quality, feature effectiveness)
Allocated appropriate time for deployment and monitoring
Left buffer for experimentation in Sprint 3
Team discussed specific technical challenges during estimation

Common Pitfalls and How to Avoid Them

Pitfall 1: Estimating Only Happy Path

Problem: Assuming data is perfect and first model works Solution: Always add 30-50% buffer for data issues and failed experiments

Pitfall 2: Ignoring Model Monitoring

Problem: Forgetting ongoing maintenance and performance tracking Solution: Include monitoring, retraining, and drift detection in initial estimates

Pitfall 3: Underestimating Data Annotation

Problem: Assuming annotation is straightforward manual work Solution: Calculate per-item rates empirically, include quality control overhead

Pitfall 4: Treating Research as Development

Problem: Estimating research-level problems like standard features Solution: Use spike stories for high-uncertainty work, then re-estimate

Pitfall 5: Not Accounting for Deployment Complexity

Problem: Thinking model training is the end goal Solution: Deployment often takes 30-40% of total project time—estimate accordingly

Conclusion: Embracing Uncertainty in ML Estimation

Machine learning project estimation will never be as precise as traditional software development—and that's okay. The goal of Planning Poker for ML projects isn't perfect predictions; it's creating shared understanding of uncertainty, surfacing risks early, and continuously improving estimation accuracy through retrospectives.

Key principles to remember:

Acknowledge uncertainty explicitly: Use confidence levels and spike stories
Estimate in phases: Data, baseline, iteration, deployment
Track experiment outcomes: Learn from what works and what doesn't
Communicate risks proactively: Use estimation discussions to surface concerns
Iterate your estimation process: Retrospect on estimation accuracy and adjust

The most successful ML teams treat estimation as an ongoing learning process. They track their accuracy, discuss what went wrong (or right), and continuously refine their approach. Over time, this creates organizational knowledge that dramatically improves planning accuracy.

Planning Poker provides the structure for these crucial conversations. By bringing together product owners, data scientists, ML engineers, and stakeholders, you create shared understanding of what's really involved in machine learning development—not just the glamorous model training, but the unglamorous data cleaning, the frustrating debugging, and the essential ongoing maintenance.

Start your next ML Planning Poker session with this question: "What could go wrong?" The answers will lead to better estimates, more realistic timelines, and ultimately, more successful AI projects.

Ready to improve your ML team's estimation accuracy? Try Planning Poker with ML-specific card sets and features designed for data science teams. Start planning your next machine learning sprint with confidence.