172 scored conversations. 13 team members. From invisible gaps to measurable transformation.
Morgan Stanley faced a problem that exists at every major wealth management firm: their leadership team was having high-stakes conversations with financial advisors every day, but there was no visibility into the quality of those conversations. Were they asking the right questions? Were they diagnosing real blockers or accepting surface-level answers? Were they closing with specific commitments or vague next steps?
In a contact center, every call is recorded, scored, and analyzed. In wealth management, leadership conversations happen behind closed doors. The gap between what the organization believed was happening and what was actually happening was completely invisible.
The team needed a way to create that visibility without disrupting the relationship-driven culture that makes wealth management work. And they needed to do it fast enough to prove the concept before scaling it across the broader organization.
BlueEye Advisory designed a multi-phase AI coaching program that gave each team member a private, low-stakes environment to practice high-stakes conversations. The program spanned 6 weeks and evolved through three distinct scoring phases, each building on the behavioral patterns revealed by the last.
Every conversation was scored against behavioral criteria developed from real advisor interactions. The scenarios were calibrated to mirror actual advisor behaviors: resistance patterns, stall tactics, and the conversational nuances unique to wealth management. When team members pushed back that something felt unrealistic, the scenarios were refined immediately. Realism was non-negotiable.
Design principle: Scenarios were built to feel real, not to feel like a test. If the AI behaved in a way that would get someone hung up on in real life, it got rewritten. Realism drives engagement. Without it, the data means nothing.
Every conversation was scored across six behavioral dimensions, weighted by their impact on real-world conversation outcomes:
| Dimension | Weight | What It Measures |
|---|---|---|
| Opening & Framing | 15% | Clean, confident opening that establishes purpose and control |
| Discovery Depth | 25% | Open-ended questions that go beyond surface answers |
| Active Listening | 20% | Paraphrasing and reflection that demonstrates real understanding |
| Diagnostic Quality | 20% | Accurate identification of the real blocker, not the stated one |
| Resistance Handling | 10% | Composure under pressure without premature solutioning |
| Commitment & Next Steps | 10% | Specific, time-bound action items, not "let's circle back" |
What happened over 6 weeks tells the real story. The team's average score didn't climb in a straight line. It followed the pattern that every organization goes through when they first get honest measurement of something that was previously invisible.
Week 1 was humbling. The team averaged 3.4 out of 100. Not because they were bad at their jobs. Because they had never been measured against specific behavioral criteria before. The gap between "I think I do this well" and "the data says otherwise" hit hard.
By Week 2, heavy practice kicked in. 31 calls in a single week. Scores jumped to 26.2 as the coaching frameworks started taking hold. By Week 4, the team hit its peak: a 55.7 average across 45 conversations. That's a 16x improvement from where they started.
The Week 6 dip is actually the most telling data point. We introduced a harder module targeting a completely different conversation type. Scores dropped, but the team's response was immediate: they leaned in. That's the behavioral shift. The instinct to practice, not avoid.
The pattern that matters: Week 1 scores are always low. Always. The organizations that transform are the ones where the team treats that score as fuel, not as a reason to disengage. Morgan Stanley's team treated it as fuel.
The team average tells one story. The individual trajectories tell the real one. Across 13 team members, the range of improvement was dramatic. Here are four that stand out:
22 completed conversations. Started near the bottom, finished at the very top of the team. This is what happens when someone with natural ability finally gets structured feedback on specific behaviors.
20 conversations. Scored 0 on initial attempts because the approach didn't match the scorecard criteria at all. By the end, consistently scoring in the top tier. Complete behavioral reset.
19 conversations. Consistent, methodical improvement week over week. No dramatic swings. Just steady gains from deliberate practice. The kind of trajectory that tells you the system is working.
24 conversations. The highest volume on the team. Started with the highest baseline, which means the habits were already partially formed. The challenge was refinement, not reinvention.
Not every trajectory was a success story. One team member completed only 6 conversations and regressed from 23.3 to 2.7. That data point is just as valuable. It tells leadership exactly where to focus, who needs a different kind of support, and whether the issue is skill, motivation, or something else entirely.
Across 172 scored conversations, six behavioral patterns emerged that no amount of observation or self-reporting would have surfaced:
Vague or filler-heavy openings that immediately ceded conversational control. The first 30 seconds predicted the quality of the entire conversation.
Closed-ended questions that shut down explanation and created shallow, transactional conversations instead of diagnostic ones.
Accepting the first answer without going deeper. The real blocker almost never surfaces in the initial response.
The single biggest gap. Generic acknowledgments ("got it," "makes sense") instead of paraphrasing that signals genuine understanding.
Jumping to recommendations before earning the transition. Stronger conversations build the case before offering the answer.
"Let's reconnect next week" instead of "I'll send the analysis by Thursday and we'll review it together Friday at 2."
The program didn't stay static. As the team improved, the measurement framework evolved with them. Three distinct scorecard phases pushed the team progressively harder:
| Phase | Calls | Avg Score | Focus |
|---|---|---|---|
| Phase 1: Foundation | 62 | 28.8 | Baseline coaching conversation skills. Opening, discovery, active listening, next steps. |
| Phase 2: Refinement | 97 | 50.7 | Recalibrated scoring based on Phase 1 data. Higher bar for discovery depth and diagnostic quality. |
| Phase 3: Advanced | 13 | 55.6 | Entirely new conversation type: re-engagement after commitment stalls. Diagnosing inertia, handling skepticism. |
The jump from Phase 1 (28.8 avg) to Phase 2 (50.7 avg) represents genuine behavioral improvement. The Phase 3 score of 55.6 is especially telling: when the team faced a completely new, harder scenario type, they still outperformed their Phase 2 average. The skills had become transferable.
Across all 13 team members, 172 conversations, and 6 weeks of practice:
| Team Member | Calls | First 3 Avg | Last 3 Avg | Gain |
|---|---|---|---|---|
| Member A | 22 | 9.2 | 96.7 | +951% |
| Member B | 20 | 0.0 | 88.3 | From 0 |
| Member C | 19 | 17.5 | 81.1 | +363% |
| Member D | 24 | 30.8 | 78.2 | +154% |
| Member E | 17 | 32.5 | 66.8 | +106% |
| Member F | 13 | 46.6 | 57.2 | +23% |
| Member G | 11 | 6.7 | 40.8 | +509% |
| Member H | 16 | 0.0 | 39.2 | From 0 |
| Member I | 16 | 9.3 | 37.9 | +308% |
| Member J | 4 | 11.7 | 28.3 | +142% |
| Member K | 3 | 43.7 | 43.7 | 0% |
| Member L | 6 | 23.3 | 2.7 | -88% |
| Team Average | 172 | 18.8 | 53.6 | +185% |
Why this table matters: Every row is a coaching decision. The top four need advanced scenarios and stretch assignments. The middle group needs targeted reinforcement. The bottom two need a fundamentally different conversation. Without this data, every person on this team gets the same generic training. With it, every person gets exactly what they need.
The transformation wasn't just in the scores. It was in the team's relationship with practice itself. By Week 3, team members were voluntarily completing multiple sessions per week. They were comparing scores, sharing what worked, and asking for harder scenarios. The program created a culture of deliberate practice that didn't exist before.
Leadership's three simplified coaching priorities became:
Get the opening clean. Practice the first 30 seconds until it's second nature. No filler, no rambling, clear purpose.
Go one level deeper. When someone gives you an answer, ask "what's driving that?" before moving on. The real insight is always underneath the first response.
Summarize before moving forward. Paraphrase what you heard before transitioning. This single behavior separates good conversations from great ones.
Based on the results, leadership initiated conversations to bring this approach to additional groups across the organization. The behavioral scoring model and coaching intelligence framework were designed from the start to scale beyond the initial team. The infrastructure, the scorecards, and the measurement methodology are all built to expand.
This engagement proved something that wealth management has struggled with for decades: you can bring the measurement rigor of a contact center to relationship-driven advisory conversations without sacrificing the human element. The AI doesn't replace the conversation. It makes the invisible visible so that coaching becomes precise, targeted, and measurable.
172 conversations in 6 weeks. A team that went from a 3.4 average to a 55.7 peak. Individual improvements of nearly 1,000%. And most importantly, a leadership team that can now answer the question every organization should be asking: "Are our people actually having the conversations we think they're having?"
The answer was more revealing than anyone expected. And that's the point.
Book a 30-minute diagnostic to explore how AI coaching intelligence can transform your team's performance.
Book a Diagnostic