Planning Poker for AI and Machine Learning Projects: Estimating ML Development Work
Master machine learning estimation with Planning Poker. Learn frameworks for AI project planning, model development estimation, and communicating ML uncertainty to stakeholders.
Planning Poker for AI and Machine Learning Projects: Estimating ML Development Work
Machine learning project estimation remains one of the most challenging aspects of AI development. Unlike traditional software engineering where experienced teams can predict timelines with reasonable accuracy, ML projects introduce fundamental uncertainties that can derail even the most carefully planned sprints. According to recent industry research, nearly 52% of AI projects fail to reach production, and those that do often take significantly longer than initially estimated.
Planning Poker, the consensus-based agile estimation technique, offers a structured approach to tackling these challenges. However, applying Planning Poker to machine learning projects requires adapting the methodology to account for the unique characteristics of ML development: experiment-driven workflows, data dependencies, model uncertainty, and the inherent unpredictability of algorithm performance.
This guide explores how to effectively use Planning Poker for AI and machine learning projects, providing practical frameworks for estimating everything from data preparation to model deployment.
Why Machine Learning Estimation Is Fundamentally Different
Before diving into Planning Poker techniques, it's essential to understand why ML estimation diverges from traditional software development.
The Experiment-Driven Nature of ML Development
Traditional software development follows a relatively linear path: you write code, test it, and deploy it. Machine learning development is fundamentally experimental. You're not just writing a program—you're writing a program that generates a program that learns from data. This multi-layered complexity introduces uncertainty at every stage.
Consider a typical ML workflow:
- Data collection and preparation: Unknown data quality issues emerge during exploration
- Feature engineering: Initial features may prove ineffective, requiring iteration
- Model training: Algorithm performance on real data is unpredictable until tested
- Hyperparameter tuning: Optimal configurations require extensive experimentation
- Deployment: Production data distribution may differ from training data
Each stage contains unknown unknowns—challenges you won't discover until you're actively working on the problem.
Data Dependencies and Quality Issues
Machine learning models are only as good as their training data. Data quality challenges represent a primary source of estimation errors:
- Data availability: Assumed data sources may be incomplete or inaccessible
- Annotation requirements: Labeling data often takes 10-100x longer than initially estimated
- Data drift: Production data characteristics change over time, requiring model retraining
- Privacy constraints: Compliance requirements may limit usable data
These dependencies create estimation challenges that don't exist in traditional development. A user story might seem straightforward—"build a sentiment classifier"—but the actual work depends heavily on data quality, quantity, and accessibility.
The Iteration Tax
Machine learning requires extensive iteration. Your first model rarely achieves production-quality performance. Industry data suggests that successful ML projects typically go through 5-15 major model iterations before deployment, with each iteration potentially revealing new challenges.
This iteration tax means that ML estimation must account for:
- Baseline model development
- Performance improvement iterations
- Failed experiments and dead ends
- A/B testing and validation
- Model monitoring and maintenance
Adapting Planning Poker for ML Projects
Standard Planning Poker works well for traditional software, but ML projects require modified approaches. Here's how to adapt the technique for data science work.
Modified Story Point Scales for ML Work
The Fibonacci sequence (1, 2, 3, 5, 8, 13, 21) works for traditional development because work complexity roughly follows this pattern. For machine learning projects, consider these adaptations:
Research-Heavy Scale: For projects with significant uncertainty
- 1 point: Well-understood task with established approach (e.g., deploying a proven model)
- 3 points: Some uncertainty, standard ML techniques apply (e.g., training a classification model with clean data)
- 5 points: Moderate research required, multiple approaches possible (e.g., feature engineering for new domain)
- 8 points: Significant experimentation needed, outcome uncertain (e.g., novel architecture exploration)
- 13 points: High uncertainty, may require literature review or external consultation
- 21+ points: Requires breaking down into smaller experiments
Confidence Multipliers: Add confidence levels to story points
- High confidence (1.0x): Similar problems solved before, clean data available
- Medium confidence (1.5x): Some unknowns, but reasonable assumptions possible
- Low confidence (2.0x): Significant unknowns, experimental approach required
For example, a 5-point story with low confidence effectively becomes a 10-point story when planning sprints.
Estimating Different ML Workflow Stages
Machine learning projects consist of distinct phases, each requiring different estimation approaches.
Data Preparation and Annotation (30-50% of ML Project Time)
Data work consistently takes longer than expected. When estimating data preparation:
Data Collection:
- 1-2 points: Accessing existing, well-documented datasets
- 3-5 points: Combining multiple data sources with known schemas
- 8-13 points: Web scraping, API integration, or creating new data pipelines
- 21+ points: Requires partnerships, legal agreements, or building data infrastructure
Data Cleaning:
- 2-3 points: Standard cleaning (missing values, duplicates) on structured data
- 5-8 points: Handling complex quality issues, outlier detection, data validation
- 13+ points: Unstructured data processing, complex transformations, or dealing with severe quality problems
Data Annotation:
- Per 1,000 items: Calculate hourly annotation rate (typically 50-200 items/hour depending on complexity)
- Quality control overhead: Add 30-50% for validation and inter-annotator agreement
- Iteration buffer: Add 20% for guideline refinements and re-annotation
Example estimation conversation:
Product Owner: "We need to label 10,000 customer support tickets for sentiment."
ML Engineer 1: "I'm thinking 8 points. Sentiment is subjective, we'll need clear guidelines."
ML Engineer 2: "I'd say 13. We'll need multiple annotators for agreement, and the guidelines will require iteration as edge cases emerge."
Data Scientist: "Agreed on 13. Plus, we should split this into two stories—one for guideline development (3 points) and one for annotation at scale (8 points with medium confidence)."
Feature Engineering (15-25% of ML Project Time)
Feature engineering directly impacts model performance but involves significant trial and error.
- 3 points: Standard features from clean, structured data (e.g., basic aggregations, one-hot encoding)
- 5 points: Domain-specific features requiring business logic (e.g., customer lifetime value calculations)
- 8 points: Time-series features, complex aggregations, or features from unstructured data
- 13 points: Novel feature creation requiring domain expertise or external data sources
- 21+ points: Advanced feature engineering (embeddings, graph features, automated feature learning)
Key estimation questions:
- How well do we understand the domain?
- Are features straightforward transformations or do they require experimentation?
- Do we have access to all necessary raw data?
- How will we validate feature importance?
Model Development and Training (20-30% of ML Project Time)
This is where ML estimation becomes most challenging due to performance uncertainty.
Initial Model Development:
- 2-3 points: Applying proven algorithms to well-understood problems (e.g., logistic regression on tabular data)
- 5 points: Standard deep learning architectures with established frameworks
- 8 points: Adapting research papers or techniques from adjacent domains
- 13+ points: Novel architectures or custom algorithm development
Model Training and Tuning:
- 2 points: Quick iteration cycles (minutes to hours per experiment)
- 5 points: Moderate training time (hours to overnight)
- 8 points: Long training cycles (days) requiring infrastructure optimization
- 13+ points: Distributed training, very large models, or complex hyperparameter spaces
Performance Improvement: Always allocate separate stories for iteration:
- Baseline model: Initial implementation (use above guidelines)
- First improvement iteration: 60% of baseline story points
- Subsequent iterations: Diminishing returns (40%, 30%, 20% of baseline)
- Production threshold: Estimate separately based on acceptable performance
Example:
Story: "Develop recommendation model with 70% precision"
Breakdown:
- Baseline collaborative filtering (5 points, expect ~50% precision)
- Content-based features iteration (3 points, target 60%)
- Hybrid model experimentation (2 points, target 70%)
- Total: 10 points with medium confidence
MLOps and Deployment (15-20% of ML Project Time)
Deployment complexity varies dramatically based on infrastructure maturity.
- 3 points: Deploying to established ML platform with monitoring (e.g., SageMaker, Vertex AI)
- 5 points: Containerization, API development, basic monitoring setup
- 8 points: Custom serving infrastructure, real-time predictions, A/B testing framework
- 13 points: Building MLOps pipeline from scratch, complex integration requirements
- 21+ points: Distributed serving, edge deployment, or strict latency requirements
Don't forget ongoing maintenance:
- Model monitoring: 2-3 points per sprint for dashboard development and alert setup
- Retraining pipelines: 5-8 points for automated retraining infrastructure
- Model versioning: 3-5 points for experiment tracking and model registry
ML-Specific Story Templates
Using standardized templates helps teams estimate consistently. Here are templates for common ML stories:
Data Collection Story
As a [data scientist/ML engineer]
I need to [collect/acquire data from source]
So that [we can train/evaluate the model]
Acceptance Criteria:
- [ ] X records collected with required fields
- [ ] Data quality validated (completeness, accuracy)
- [ ] Data stored in accessible format
- [ ] Basic exploratory analysis completed
- [ ] Documentation of data schema and sources
Estimation Considerations:
- Data availability and access complexity
- Data volume and transfer time
- Quality assessment effort
- Legal/privacy review requirements
Model Training Story
As a [ML engineer/data scientist]
I need to [train a model for task]
So that [we can predict/classify/recommend X]
Acceptance Criteria:
- [ ] Baseline model trained with performance metrics
- [ ] Model performance documented and compared to benchmark
- [ ] Training pipeline code reviewed and documented
- [ ] Hyperparameters logged and reproducible
- [ ] Model artifacts saved with versioning
Estimation Considerations:
- Model complexity and training time
- Hyperparameter tuning scope
- Available compute resources
- Baseline performance uncertainty
- Number of experiments planned
Feature Engineering Story
As a [data scientist]
I need to [create features from data source]
So that [model can learn relationships]
Acceptance Criteria:
- [ ] Features engineered and validated
- [ ] Feature importance analysis completed
- [ ] Features added to training pipeline
- [ ] Documentation of feature logic
- [ ] Performance impact measured
Estimation Considerations:
- Feature complexity and domain knowledge required
- Data transformations needed
- Feature validation approach
- Expected number of features to test
- Pipeline integration complexity
Model Deployment Story
As a [ML engineer]
I need to [deploy model to environment]
So that [users/systems can access predictions]
Acceptance Criteria:
- [ ] Model deployed with API endpoint
- [ ] Latency and throughput requirements met
- [ ] Monitoring and logging configured
- [ ] Rollback procedure documented
- [ ] Load testing completed
Estimation Considerations:
- Infrastructure complexity
- Latency requirements
- Traffic volume expectations
- Integration points
- Monitoring sophistication needed
Communicating ML Uncertainty to Stakeholders
One of the biggest challenges in ML project planning is managing stakeholder expectations around uncertainty. Planning Poker sessions provide an opportunity for this crucial communication.
The Cone of Uncertainty for ML Projects
Traditional software projects follow a cone of uncertainty that narrows as development progresses. ML projects have a different shape—uncertainty may actually increase during early iterations as you discover data quality issues or performance challenges.
Visualize this during estimation:
- Sprint 1-2: High uncertainty (±100% variance is normal)
- Sprint 3-5: Uncertainty should decrease as baseline performance is established
- Sprint 6+: If uncertainty isn't decreasing, the problem may need rescoping
Using Planning Poker to Surface Risks
When estimates diverge widely during Planning Poker, it often indicates different team members perceive different risks. Use this as a discussion trigger:
Large Estimate Gaps (e.g., votes of 3 and 13):
- "What assumptions are different?"
- "What could go wrong that some team members foresee?"
- "Do we need a spike to reduce uncertainty?"
Consistently High Estimates:
- "Should we break this down differently?"
- "Is this really a research project disguised as a user story?"
- "Do we need external expertise or resources?"
The Spike Story Approach
For highly uncertain work, use spike stories explicitly:
Spike: Investigate [technical approach/data quality/model feasibility]
Time-box: [1-3 days]
Success Criteria:
- [ ] Recommendation on approach with confidence level
- [ ] Rough estimate for full implementation
- [ ] Identified risks and unknowns
- [ ] Prototype or proof of concept (if applicable)
Spikes acknowledge uncertainty while providing structure. They should:
- Have strict time limits (typically 1-3 days)
- Produce actionable insights, not production code
- Lead to better estimates for follow-up stories
- Be valued at 2-5 points depending on time-box
Advanced Estimation Frameworks for Complex ML Projects
The Three-Estimate Approach
For critical ML initiatives, use three-point estimation:
- Optimistic (O): Everything goes well, data is clean, first approach works
- Most Likely (M): Realistic scenario with normal challenges
- Pessimistic (P): Multiple iterations needed, data issues, performance challenges
Weighted average: (O + 4M + P) / 6
Example:
Story: Implement fraud detection model
- Optimistic: 8 points (clean data, proven algorithms work)
- Most Likely: 13 points (some data cleaning, 2-3 model iterations)
- Pessimistic: 21 points (significant data issues, novel approach required)
- Weighted: 13.8 ≈ 13 points
Use the pessimistic estimate for risk planning even if you commit to the weighted average.
Experimentation Budgets
Rather than estimating individual experiments, allocate experimentation budgets:
Per-Sprint Experiment Allocation:
- Junior data scientists: 5-8 points of experimentation capacity
- Senior data scientists: 8-13 points of experimentation capacity
- ML researchers: 13-21 points of experimentation capacity
Stories become "objectives" rather than fixed tasks:
- "Improve model recall from 60% to 75%" (8 points, includes multiple experiments)
- "Reduce inference latency below 100ms" (5 points, experimentation budget)
- "Explore alternative architectures for time-series prediction" (13 points, research spike)
This approach acknowledges that you can't predict which experiments will succeed, only how much time you can allocate to finding solutions.
The T-Shirt Sizing Alternative
For early-stage ML projects with high uncertainty, consider T-shirt sizing (S, M, L, XL) before transitioning to story points:
- Small: 1-3 days, well-understood task
- Medium: 1-2 weeks, standard ML techniques apply
- Large: 2-4 weeks, requires experimentation
- Extra Large: 1-2 months, research-level effort
T-shirt sizing works better when:
- The team is new to ML estimation
- The problem space is poorly understood
- You're doing discovery work before committing to implementation
Once you have baseline models and clearer requirements, transition to story points for more precise sprint planning.
Practical Tips for Running ML Planning Poker Sessions
Pre-Planning Preparation
Before the Planning Poker session:
- Data Assessment: Have data scientists review available data and document quality/quantity
- Literature Review: For novel problems, review relevant research papers and existing solutions
- Technical Spikes: Complete time-boxed investigations for high-uncertainty areas
- Success Metrics: Define clear, measurable performance thresholds
- Infrastructure Check: Verify compute resources and tooling availability
During the Session
Estimation Discussion Structure:
- Product Owner reads the story and acceptance criteria
- ML Lead provides technical context (data availability, algorithm options, expected challenges)
- Team asks clarifying questions focusing on:
- Data quality and quantity assumptions
- Performance requirements and how they'll be measured
- Infrastructure and tools available
- Similar past experiences
- Silent Estimation: Each team member selects a card
- Reveal: Discuss outliers first (highest and lowest estimates)
- Re-estimate: After discussion, vote again until consensus
Red Flags to Watch For:
- Estimates based on "best case" data assumptions
- Ignoring iteration and experimentation time
- Forgetting model monitoring and maintenance
- Underestimating deployment complexity
- No buffer for failed experiments
Post-Planning Tracking
ML projects require different tracking metrics:
Standard Metrics:
- Story points completed vs. planned
- Velocity trends over time
- Estimation accuracy (actual vs. estimated)
ML-Specific Metrics:
- Experiment success rate (% of experiments improving performance)
- Data quality issues discovered per sprint
- Model performance trends toward production threshold
- Time to production for ML features
Track these to improve future estimates. If you consistently underestimate data preparation by 50%, adjust your estimation discussions accordingly.
Case Study: Estimating a Recommendation System
Let's walk through estimating a complete ML project using Planning Poker.
Project: Build a product recommendation system for e-commerce platform
Sprint 1: Data Foundation
Story 1.1: Collect user interaction data (clicks, purchases, views)
- Discussion: "We have analytics data, but it's in separate systems. Need to join and validate."
- Estimates: 3, 5, 5, 8 → Consensus at 5 points
- Confidence: Medium (existing data, but quality unknown)
Story 1.2: Create user and product feature sets
- Discussion: "Basic features are straightforward, but we might need category embeddings."
- Estimates: 5, 5, 8, 8 → Consensus at 8 points
- Confidence: Medium-Low (feature effectiveness uncertain)
Story 1.3: Exploratory data analysis and baseline metrics
- Discussion: "Need to understand current conversion rates and user behavior patterns."
- Estimates: 2, 3, 3, 5 → Consensus at 3 points
- Confidence: High (standard analysis)
Sprint 1 Total: 16 points (realistic for a team with 20-25 point velocity)
Sprint 2: Baseline Model
Story 2.1: Implement collaborative filtering baseline
- Discussion: "Standard approach, but we need to handle cold-start problem."
- Estimates: 5, 5, 8, 8 → Consensus at 8 points
- Confidence: High (proven technique)
Story 2.2: Evaluate baseline model performance
- Discussion: "Need offline metrics and A/B test preparation."
- Estimates: 3, 3, 5, 5 → Consensus at 3 points
- Confidence: High (standard evaluation)
Story 2.3: Set up model training pipeline
- Discussion: "Want automated retraining as new data arrives."
- Estimates: 5, 8, 8, 13 → After discussion, consensus at 8 points
- Confidence: Medium (infrastructure dependency)
Sprint 2 Total: 19 points
Sprint 3: Model Improvement
Story 3.1: Add content-based features to hybrid model
- Discussion: "Product descriptions and categories. NLP might be needed."
- Estimates: 5, 8, 8, 13 → Consensus at 8 points with caveat
- Caveat: "If NLP is required, we'll need a follow-up story"
- Confidence: Medium
Story 3.2: Hyperparameter tuning for production model
- Discussion: "We have compute resources, but this could take time."
- Estimates: 3, 5, 5, 8 → Consensus at 5 points
- Confidence: Medium
Story 3.3: Production deployment and monitoring
- Discussion: "We have existing ML infrastructure, so integration should be smooth."
- Estimates: 5, 5, 8, 8 → Consensus at 8 points
- Confidence: Medium-High
Sprint 3 Total: 21 points
Project Summary: 56 total points across 3 sprints for initial production deployment (approximately 6-9 weeks for a team with 20 point velocity).
Key Learnings from This Estimation:
- Broke down the project into clear phases (data, baseline, improvement)
- Identified uncertainty early (data quality, feature effectiveness)
- Allocated appropriate time for deployment and monitoring
- Left buffer for experimentation in Sprint 3
- Team discussed specific technical challenges during estimation
Common Pitfalls and How to Avoid Them
Pitfall 1: Estimating Only Happy Path
Problem: Assuming data is perfect and first model works Solution: Always add 30-50% buffer for data issues and failed experiments
Pitfall 2: Ignoring Model Monitoring
Problem: Forgetting ongoing maintenance and performance tracking Solution: Include monitoring, retraining, and drift detection in initial estimates
Pitfall 3: Underestimating Data Annotation
Problem: Assuming annotation is straightforward manual work Solution: Calculate per-item rates empirically, include quality control overhead
Pitfall 4: Treating Research as Development
Problem: Estimating research-level problems like standard features Solution: Use spike stories for high-uncertainty work, then re-estimate
Pitfall 5: Not Accounting for Deployment Complexity
Problem: Thinking model training is the end goal Solution: Deployment often takes 30-40% of total project time—estimate accordingly
Conclusion: Embracing Uncertainty in ML Estimation
Machine learning project estimation will never be as precise as traditional software development—and that's okay. The goal of Planning Poker for ML projects isn't perfect predictions; it's creating shared understanding of uncertainty, surfacing risks early, and continuously improving estimation accuracy through retrospectives.
Key principles to remember:
- Acknowledge uncertainty explicitly: Use confidence levels and spike stories
- Estimate in phases: Data, baseline, iteration, deployment
- Track experiment outcomes: Learn from what works and what doesn't
- Communicate risks proactively: Use estimation discussions to surface concerns
- Iterate your estimation process: Retrospect on estimation accuracy and adjust
The most successful ML teams treat estimation as an ongoing learning process. They track their accuracy, discuss what went wrong (or right), and continuously refine their approach. Over time, this creates organizational knowledge that dramatically improves planning accuracy.
Planning Poker provides the structure for these crucial conversations. By bringing together product owners, data scientists, ML engineers, and stakeholders, you create shared understanding of what's really involved in machine learning development—not just the glamorous model training, but the unglamorous data cleaning, the frustrating debugging, and the essential ongoing maintenance.
Start your next ML Planning Poker session with this question: "What could go wrong?" The answers will lead to better estimates, more realistic timelines, and ultimately, more successful AI projects.
Ready to improve your ML team's estimation accuracy? Try Planning Poker with ML-specific card sets and features designed for data science teams. Start planning your next machine learning sprint with confidence.