Debugging Flaky Tests with AI: Playwright + Copilot Guide (2026)
This is Part 2 of my AI Playwright Testing series. If you haven’t read Part 1: AI Playwright Testing with GitHub Copilot & MCP yet, I’d start there — this post assumes you already have Copilot and an MCP server wired into your Playwright project. If you don’t, go set that up first and come back.
Why I’m Writing About Debugging Flaky Tests With AI
In Part 1, I was upfront that I’m not an AI expert. I’m an SDET who decided to stop reading about AI-assisted testing and actually use it on real projects. That post covered getting GitHub Copilot and an MCP server talking to my Playwright suite. This post is the next, messier step: debugging flaky tests with AI — what happens when you point that setup at the thing every automation engineer secretly dreads.
If you’ve maintained any automation suite for more than a few months, you know the feeling. A test fails. You re-run it. It passes. You shrug, mark it as “flaky,” and move on — until it fails again during a release demo, in front of the one person you didn’t want watching the pipeline turn red. Three years ago, I would have spent an afternoon scrolling through console logs trying to spot the millisecond where a click happened before a modal finished animating. Today, I spent that same debugging session having a conversation with an AI assistant that had direct access to my trace files, my test code, and my CI logs through MCP — and it found the root cause in about six minutes.
This post is my honest documentation of debugging flaky tests with AI — what actually works, what GitHub Copilot and MCP get right, where they get it wrong, and a repeatable workflow you can apply to your own Playwright suite starting today. I’m not going to pretend AI magically fixes flaky tests. It doesn’t. But it changes the economics of debugging them enough that I think every SDET should be doing some version of this by the end of 2026.
What You’ll Walk Away With
- A clear framework for classifying flaky tests by root cause, before you even open an AI tool
- A repeatable process for debugging flaky tests with AI that you can apply to your own Playwright suite this week
- A step-by-step workflow for using GitHub Copilot and MCP to diagnose flakiness instead of guessing
- Specific, copy-paste prompt patterns that actually produce useful output (and which ones waste your time)
- Real code examples for fixing the most common categories of flaky Playwright tests
- A way to build a simple MCP-powered flaky test triage tool for your own repo
- An honest list of where AI assistance currently fails, so you don’t over-trust it
A Quick Locator Resilience Reference
Since locator choice comes up repeatedly throughout this post as both a cause of and fix for flakiness, here’s a quick reference table I keep handy, ranking common Playwright locator strategies by general resilience to both timing issues and markup changes:
| Locator Strategy | Resilience | Notes |
|---|---|---|
getByRole() | High | Tied to accessible semantics, survives most markup/CSS changes |
getByTestId() | High | Stable by design, but requires discipline to maintain test IDs in the app code |
getByText() | Medium | Breaks on copy changes, fragile across locales |
getByLabel() | High for forms | Excellent for form fields specifically, tied to accessible labeling |
| CSS class selector | Low | Breaks easily on styling refactors unrelated to functionality |
| XPath with positional indexing | Very low | Breaks on almost any markup reordering; avoid except as a last resort |
When an AI assistant suggests a locator-related fix anywhere in this post, I cross-check it against this table — a suggestion that moves you down this resilience ranking (e.g., “just add an nth-child selector”) is usually a sign to push back and ask for a more resilient alternative, even if the suggested code would technically fix the immediate failure.
MCP Architecture, Explained for Testers Who Haven’t Used It Yet
If you read Part 1, feel free to skip this section. But I’ve had enough readers tell me they jumped straight to Part 2 that I want to cover the basics of what MCP actually is, in plain terms, before going deeper into debugging workflows that assume you understand it.
The Problem MCP Solves
This is one of the recurring decision points when debugging flaky tests with AI in practice.
Before MCP, getting an AI coding assistant useful context meant manually copy-pasting things into a chat window: a stack trace here, a code snippet there, maybe a screenshot. This works for small, one-off questions, but it falls apart for anything requiring repeated, structured access to live data — like trace files, CI run history, or database state — because you’d be manually re-gathering and re-pasting that data every single time.
What MCP Actually Is
Model Context Protocol is a standard way for an AI assistant to call out to external “tools” — small programs that expose specific capabilities, like “get the last 20 CI run results for a test” or “read this trace file’s network log” — and get structured data back, the same way a human would call an API. An MCP server is the program exposing these tools. An MCP client (your AI assistant, like Claude or Copilot) connects to one or more servers and can call their tools mid-conversation, automatically, when it decides a tool would help answer your question.
Why Debugging Flaky Tests with AI Actually Helps
Every workflow in this post — comparing multiple traces, cross-referencing run history, pulling network logs — becomes dramatically less tedious once it’s a tool call instead of a manual copy-paste. The custom MCP server built later in this post is a concrete example: instead of you manually opening a JSON file, finding the relevant test’s run history, and pasting it into a chat, the AI assistant calls get_flakiness_report directly and gets back exactly the data it needs, structured and ready to reason over.
You Don’t Need to Build Your Own Server to Get Started
This is worth saying clearly: Playwright already ships an official MCP server that exposes trace and accessibility-tree data out of the box, which is what most of the basic workflows in this post rely on. Building a custom server (covered later) is for when you want something more specific to your team’s setup, like historical flakiness scoring — it’s an enhancement, not a prerequisite for getting started.
What Actually Makes a Test “Flaky”?
Before we bring AI into this at all, we need a shared definition, because “flaky” gets used as a catch-all term that hides very different problems. A flaky test is one that produces different results — pass or fail — across multiple runs against the exact same code, with no changes to the application under test. That’s the key distinguishing factor from a genuinely broken test: a broken test fails consistently because something is actually wrong. A flaky test fails inconsistently because something about the test, the environment, or the timing is non-deterministic. This lines up with how the wider testing community defines it too — Martin Fowler’s well-known writing on eradicating non-determinism in tests describes essentially the same root problem, years before AI tooling entered the picture.
This distinction matters more than it sounds like it should, because it changes where you point your debugging effort — and it changes what you ask an AI assistant to look for. If you ask Copilot “why is this test failing” when the test is actually broken (not flaky), you’ll get a different — and usually more useful — answer than if you ask the same question about a genuinely non-deterministic test. AI tools are good at spotting deterministic bugs in code. They’re worse, on their own, at spotting timing-dependent non-determinism, because that requires evidence across multiple runs, not just a static read of the code. That gap is precisely why debugging flaky tests with AI needs a different approach than debugging a straightforward broken test.
The Flaky Test Spectrum
Getting this right matters more than it looks, since it directly affects how reliable debugging flaky tests with AI actually is.
In practice, I’ve found it useful to think of flakiness on a spectrum rather than a binary:
- Always passes — boring, ideal, the goal
- Mostly passes, rare random failure — true flakiness, often timing or environment related
- Passes/fails roughly 50/50 — usually a race condition or test isolation problem
- Mostly fails, rare random pass — often mislabeled as “flaky” when it’s actually broken with an occasional lucky timing window
- Always fails — not flaky, just broken
Get this classification wrong and you’ll waste hours. I’ve watched teams spend a sprint “fixing flakiness” on a test that was actually consistently broken under a specific data condition — they just hadn’t run it enough times to notice the pattern. This is actually one of the first places AI assistance genuinely helps: feeding an AI assistant your last 50-100 CI run results for a given test, through MCP, lets it spot the pattern (e.g., “this test only fails when run after Test X, never in isolation”) far faster than a human scrolling through a CI dashboard.
One practical habit that flows directly from this spectrum: whenever a test gets reported as “flaky,” I now ask whoever reported it for the actual recent pass/fail ratio before doing anything else, rather than accepting the label at face value. “It failed once and I re-ran it” is a single data point, not evidence of a spectrum position — and treating it as confirmed flakiness without checking the ratio is exactly how a consistently-broken-with-rare-lucky-passes test gets miscategorized and deprioritized for weeks.
Why Flaky Tests Are the Real Trust Problem in QA
I want to spend a section on why this matters before we get into the how, because if you’re trying to justify spending time on this to a manager (or if you are the manager, like me), the business case matters as much as the technical fix.
Flaky tests don’t just waste CI minutes. They destroy trust in the test suite itself. The moment a developer sees “build failed” and their first instinct is “probably just flaky, re-run it” instead of “let me check what broke,” your test suite has stopped doing its job. At that point, you have all the maintenance cost of automation with almost none of the benefit, because real failures start getting lost in the noise of re-runs. I’ve seen teams normalize re-running failed pipelines two, three, even four times before anyone actually looks at the failure — which means a genuinely broken feature can sit unnoticed for hours while everyone assumes it’s “just flaky again.”
There’s also a quieter cost: flaky tests are disproportionately expensive to debug compared to deterministic bugs, because by definition you can’t reliably reproduce them. A developer can usually reproduce a deterministic bug by running the failing test once. A flaky test might need to run 20 times before it fails again, and even then, the failure might present differently each time — a different element not found, a different timeout, a different assertion mismatch — because the underlying cause (a race condition, a timing window, a resource contention issue) manifests differently depending on exactly when it gets triggered.
This is exactly the kind of problem that benefits from AI assistance with memory and pattern recognition across runs, rather than a human trying to hold twenty different failure screenshots in their head at once.
The Release-Cycle Cost
Beyond the per-test debugging cost, flaky tests have a compounding effect on release confidence. A team I worked with tracked something simple: how often a release was delayed, even briefly, by someone needing to manually verify whether a CI failure was “real” or “just flaky” before signing off. Individually each delay was minor — twenty minutes here, an hour there — but across a quarter, it added up to a meaningful chunk of lost release-day time, almost entirely attributable to a handful of chronically flaky tests nobody had prioritized fixing because each individual instance seemed too small to matter.
A Before/After Snapshot
To make the trust-erosion point concrete rather than abstract, here’s a simplified before/after comparison from a sprint where we applied the classification-and-AI-assisted workflow in this post to our ten most chronically flaky tests:
| Metric | Before | After (4 weeks later) |
|---|---|---|
| CI re-run rate (pipelines manually re-triggered) | ~28% | ~9% |
| Average time-to-diagnosis for a new flaky report | ~50 minutes | ~15 minutes |
| Flaky tests open longer than 2 sprints | 7 | 2 |
I want to be careful here: this is one team’s sprint-over-sprint snapshot, not a controlled study, and other changes happening in parallel (general test suite cleanup, a couple of unrelated infrastructure fixes) likely contributed too. I’m not presenting this as rigorous proof that AI assistance alone caused the improvement — but it does match the qualitative experience described throughout this post, and it’s the kind of before/after framing I’d recommend any QA lead track for their own team rather than relying purely on anecdote when making the case to leadership.
The Traditional Flaky Test Debugging Workflow (And Why It’s Slow)
Before talking about what changes with AI, it’s worth being honest about what the old workflow actually looked like, because that’s the baseline we’re improving on.
- Notice the failure — usually from a CI notification, sometimes from a developer complaint
- Re-run it — to confirm it’s actually flaky and not a real regression
- Pull up the trace/video — if your team even captures these consistently, which not all do
- Scroll through manually — looking for the moment things diverged from expected behavior
- Form a hypothesis — “I think it’s a timing issue with the modal”
- Add a workaround — usually a longer wait, a retry, or a different locator strategy
- Run it 10-20 times locally — to see if the fix actually holds
- Ship it and hope — because you can never be 100% sure a flaky test fix actually worked until it survives a few weeks in CI
Step 4 and step 5 are where most of the time goes, and they’re also the steps where AI assistance through MCP makes the biggest difference — not because AI replaces your judgment, but because it can process the trace data, network logs, and console output faster than you can scroll through them, and surface the 2-3 most likely hypotheses instead of you starting from a blank page.
Setting Up GitHub Copilot and MCP for Debugging (Not Just Writing Tests)
If you followed Part 1, you already have the basic Copilot + MCP wiring in place for writing tests. Debugging needs a slightly different setup, because the AI needs access to artifacts it didn’t need before: trace files, screenshots, videos, and CI run history — not just your test source code.
What MCP Needs Access To For Debugging
The core idea behind MCP (Model Context Protocol) is giving your AI assistant structured, tool-based access to context outside the chat window — instead of you copy-pasting logs into a prompt by hand. For flaky test debugging specifically, you want your MCP server (or the MCP servers you connect to) to expose:
- Playwright trace files — the single richest source of debugging information Playwright produces
- Test run history — pass/fail results across recent CI runs, ideally with timestamps
- Console and network logs — captured during the failing run
- The test source code itself — obviously, but worth stating explicitly since debugging prompts often need this alongside the trace data
- Screenshots/video on failure — Playwright’s built-in failure capture
If you set up the Playwright MCP server from Part 1, a good chunk of this is already wired up, since Playwright’s own MCP server exposes trace and accessibility-tree data to connected AI tools. What you’ll want to add for debugging specifically is a way to feed in historical run data — which is where a small custom MCP tool (covered later in this post) becomes genuinely useful rather than just a nice-to-have.
Configuring Playwright to Capture What AI Needs
I keep coming back to this point whenever I’m debugging flaky tests with AI on a new codebase.
Before any AI tool can help, your Playwright config needs to actually be capturing the right artifacts. This is the single most common reason people tell me “AI debugging didn’t work for my flaky test” — they never configured trace capture in the first place, so there was nothing for the AI to read.
// playwright.config.ts import { defineConfig } from '@playwright/test'; export default defineConfig({ use: { trace: 'retain-on-failure', // captures full trace only when a test fails screenshot: 'only-on-failure', video: 'retain-on-failure', }, retries: process.env.CI ? 2 : 0, // capture multiple attempts in CI for pattern comparison reporter: [ ['html'], ['json', { outputFile: 'test-results/results.json' }], // structured output an MCP tool can parse ], });
Two settings matter most here for AI-assisted debugging specifically. First, trace: 'retain-on-failure' — without this, you simply have no trace file to feed into anything, AI or otherwise. See Playwright’s official trace viewer documentation for the full set of capture options. Second, the JSON reporter output gives you a structured, machine-readable result format that’s far easier for an MCP tool to parse than scraping HTML reports or console output. Getting this configuration right is the real prerequisite for debugging flaky tests with AI — none of the prompt patterns later in this post work without actual trace data behind them.
Configuring Custom Copilot Instructions for Debugging Flaky Tests with AI
One upgrade I made after writing Part 1 that made a real, measurable difference to answer quality: a custom instructions file. GitHub Copilot (and most modern AI coding assistants) supports a repository-level instructions file that gets automatically included as context in every chat, so you’re not re-explaining your debugging conventions every single session.
Creating .github/copilot-instructions.md
# Flaky Test Debugging Conventions for This Repository When helping diagnose a flaky Playwright test, follow this process: 1. Always ask for or check the trace file, run history, and test source before suggesting a root cause. Do not guess from a description alone. 2. Classify the flakiness into one of: timing/race condition, test data collision, test isolation, network/third-party dependency, animation/ rendering timing, resource contention, or genuine application bug. 3. Explicitly state your confidence level in the diagnosis and what additional evidence (if any) would increase that confidence. 4. Never suggest a longer timeout or a retry as the primary fix unless you have ruled out an underlying root cause first. 5. For isolation-related fixes, prefer fixture-level fixes (beforeEach/ afterEach or custom fixtures) over test-level workarounds. 6. Flag if the evidence suggests this might be a real application bug rather than a test problem. ## Project conventions - We use `getByRole` and `getByTestId` locators, not raw CSS selectors - Test data should be worker-scoped using `testInfo.workerIndex`, not shared hardcoded values - All third-party API calls in tests should be mocked via `page.route()`, not hit live in CI
This single file changed the character of my debugging sessions noticeably. Before adding it, I had to restate “don’t just suggest a longer timeout” almost every session. After adding it, that constraint is baked into every response by default, which saves real time across dozens of debugging sessions per month — exactly the kind of small, compounding setup investment that’s easy to skip and genuinely worth doing.
VS Code Workspace Settings That Help
// .vscode/settings.json
{
"github.copilot.chat.codeGeneration.useInstructionFiles": true,
"github.copilot.advanced": {
"indentationMode": "explicit"
}
}
The useInstructionFiles setting is the one that actually activates the custom instructions file above — without it, Copilot won’t automatically include it as context, and you’re back to manually pasting your conventions into every chat session. See GitHub’s official Copilot documentation for the full range of configuration options beyond what’s covered here.
Comparing AI Assistants for Debugging Flaky Tests with AI: Copilot vs Claude vs Cursor
Since Part 1, the most common question I’ve gotten is whether GitHub Copilot is actually the best choice for this specific job, or whether I just used it because it was the one I happened to set up first. Fair question, so I spent a few weeks running the same flaky test debugging sessions through GitHub Copilot Chat, Claude (via the desktop app with MCP connected), and Cursor’s built-in AI, to see how they actually compared for this narrow use case — not general coding ability, specifically flaky Playwright test diagnosis.
GitHub Copilot Chat (in VS Code)
Strongest where you’d expect: deep integration with your actual codebase context in the editor. If you’re already mid-debug with the test file, the trace viewer, and the terminal open in VS Code, Copilot Chat’s ability to reference “the file currently open” or “the selected code” without you re-pasting anything is genuinely faster for quick iterations. Where it was weaker: longer, multi-document reasoning across a trace file plus run history plus test code all at once tended to lose track of earlier context in longer chat sessions, requiring me to re-state the hypothesis more often than I’d like.
Claude (via MCP)
This is exactly the kind of detail that separates useful AI assistance from noise when debugging flaky tests with AI.
This is what I used for most of the deeper diagnostic sessions in this post, including the case study walkthrough. With MCP wiring trace data, run history, and test code in as structured context up front (rather than relying on in-editor file references), I found the multi-step reasoning — “compare these two traces, then cross-reference against this run history, then suggest a fixture-level fix” — held together more reliably across a longer back-and-forth conversation. The tradeoff is setup friction: getting the MCP server actually exposing the right data took more upfront configuration than just opening Copilot Chat.
Cursor
Cursor’s tight integration between its AI and the actual codebase (similar strength to Copilot) plus its own context-management approach made it a reasonable middle ground. I didn’t find a decisive reason to prefer it over Copilot+MCP or Claude+MCP specifically for flaky test debugging, though if you’re already using Cursor as your primary editor, there’s no strong reason to add a second tool just for this.
My Honest Recommendation
If you’re doing a quick, single-test diagnosis and you’re already in VS Code, Copilot Chat is the lowest-friction choice. If you’re doing the deeper, evidence-heavy diagnosis described in this post — comparing multiple traces, cross-referencing run history, building toward a documented root cause — the MCP-connected setup (whether that’s Claude or Copilot with MCP servers wired in) earns its setup cost. I don’t think there’s a single “best” tool here; it depends on how deep the specific debugging session needs to go.
DIY MCP Workflow vs. Commercial Self-Healing Test Platforms
It’s worth addressing directly, since I get this question often: there are commercial platforms (Testim, mabl, Functionize, Applitools, among others) that bundle AI-assisted test maintenance and self-healing locators as a packaged product, rather than the DIY GitHub Copilot + MCP approach in this post. Here’s an honest comparison based on what I’ve evaluated.
What Commercial Platforms Do Well
These platforms generally offer a more polished, purpose-built experience for specific problems — self-healing locators that automatically adapt when an element’s selector changes, visual diffing with less manual setup, and dashboards built specifically for flakiness tracking rather than assembled from a custom JSON reporter like the one in this post. If your team doesn’t have engineering capacity to build and maintain custom MCP tooling, a commercial platform removes that burden entirely.
What the DIY Approach Offers Instead
It’s a small thing, but it consistently improves the quality of debugging flaky tests with AI sessions.
The tradeoff is flexibility and cost. A custom MCP-based workflow built on top of Copilot or Claude — tools many teams already have licenses for — costs essentially nothing additional beyond the engineering time to set it up, versus a per-seat or per-test commercial platform subscription. It’s also fully customizable to your specific application’s quirks (like the worker-scoped test data pattern or the custom triage scoring shown earlier), where a commercial platform’s self-healing logic is necessarily more generic since it has to work across many different customers’ applications.
My Honest Take
For a small-to-mid-size team already comfortable with Playwright and already paying for Copilot or Claude licenses, the DIY approach in this post gets you 70-80% of the value of a commercial platform for a fraction of the cost, with the tradeoff being your own engineering time to build and maintain it. For a larger organization with budget for dedicated test-platform tooling and less appetite for maintaining custom MCP servers, a commercial self-healing platform is a completely reasonable choice — I wouldn’t tell a team to rip one out just because this post exists. They’re solving overlapping but not identical problems.
A Full Sample Debugging Conversation, Start to Finish
Everything in this post so far has been individual prompt patterns and code snippets. To make the actual experience concrete, here’s a condensed but realistic transcript of what an actual debugging session looks like end-to-end, based closely on a real session from this sprint (lightly trimmed for length).
Me:
I have a flaky test called
updateProfile.spec.tsthat fails about 1 in 8 runs in CI, never locally. The failure is always “element not found: .profile-saved-toast”. Here’s the trace from the most recent failure [trace attached via MCP], and here’s the last 15 CI runs [run history attached]. I suspect this is a timing issue with the toast notification, but I’m not certain.
AI Assistant:
Looking at the trace, the test clicks “Save” at timestamp 00:02.140, then immediately asserts on
.profile-saved-toastat 00:02.180 — only 40ms later. The network log shows the PATCH request to/api/profiledoesn’t resolve until 00:02.310, roughly 170ms after the assertion already ran. The toast appears to be rendered only after the API response resolves, based on the DOM snapshot at 00:02.350 showing the toast element appearing for the first time. This looks like a genuine race condition: the test isn’t waiting for the save request to complete before checking for the toast. Looking at the run history, the 2 failures both show similarly fast click-to-assert timing, while passing runs show a slightly longer gap — consistent with this being timing-dependent rather than environmental.
Me:
This is worth internalizing early, since it shapes every later session of debugging flaky tests with AI.
That matches what I was seeing. What’s the actual fix — just add a wait before the assertion, or is there something better?
AI Assistant:
A blind wait would work but is fragile and slower than necessary. Better options, in order of preference: First, wait explicitly for the PATCH response before asserting on the toast, using
page.waitForResponse()scoped to the specific API call. Second, if Playwright’stoBeVisible()isn’t already being used for the toast assertion, switching to it would help since it auto-retries rather than checking once — but based on your trace, the assertion does appear to be a one-shot check rather than an auto-retrying expect, which is the more fundamental issue here.
Me:
It’s currently using a plain `if (await page.locator(…).isVisible())` check, not `expect().toBeVisible()`. That’s probably the actual bug.
AI Assistant:
That confirms it.
isVisible()checks the current state once and returns immediately — it doesn’t wait or retry. Switching toawait expect(page.locator('.profile-saved-toast')).toBeVisible()will make Playwright auto-retry the check for up to the default timeout, which resolves this without needing an explicit wait for the API response at all, and is more idiomatic for this codebase based on the patterns in your other test files.
This transcript is a good representative example of the pattern I described earlier: I brought the symptom and a hypothesis with actual evidence, the AI confirmed and refined it with specific timestamp analysis I wouldn’t have manually cross-referenced as quickly, and the final fix came from a genuine back-and-forth rather than a single one-shot answer. The actual root cause here — a non-retrying isVisible() check instead of an auto-retrying expect() — is a subtle enough distinction that I think it would have taken me longer to spot manually than it did with the trace timing laid out explicitly like this.
Categorizing Flaky Tests: The Seven Root Causes I See Most
Before throwing a failing test at Copilot, I’ve found it dramatically speeds things up to first mentally (or literally) categorize what kind of flakiness you’re likely looking at. This isn’t AI-specific advice — it’s just good debugging hygiene — but it becomes especially powerful when paired with AI assistance, because you can tell the AI which category you suspect and get a much more targeted analysis instead of a generic “here are some things that could be wrong” response.
| Root Cause Category | Typical Symptom | How Often I See It |
|---|---|---|
| Timing / race conditions | Element not found, or found before it’s interactive | Very common |
| Test data collisions | Assertion mismatch, “expected X, got stale data” | Common |
| Test isolation / shared state | Passes alone, fails in suite | Common |
| Network / third-party dependency | Timeout, intermittent 5xx | Common |
| Animation / rendering timing | Click doesn’t register, element “not stable” | Common |
| Resource contention (CI parallelism) | Fails only under load, not locally | Less common but painful |
| Genuine application bug | Intermittent in the app itself, not the test | Underrated — often missed |
That last row deserves its own callout, because it’s the category most teams get wrong. Not every flaky test is a bad test. Sometimes the application genuinely has an intermittent bug — a backend race condition, a caching inconsistency — and the test is correctly catching it. Treating every flaky failure as “the test’s fault” and reaching straight for a longer timeout or a retry can paper over a real production issue. This is one place where I explicitly prompt the AI assistant to consider both possibilities rather than assuming the test is always to blame — more on the exact prompt wording in the next section.
A Small Habit That Makes AI Diagnosis Easier: Descriptive Test Names and Comments
I treat this as a non-negotiable step in my own process for debugging flaky tests with AI.
One low-effort change worth adopting team-wide: test names and step comments that describe intent, not just action, give the AI assistant (and any human debugging later) far more to work with. Compare test('test 4', ...) against test('checkout completes and shows confirmation toast within expected backend response window', ...) — the second version tells an AI assistant (or a teammate six months later) exactly what assumption the test is encoding, which makes it much faster to spot when a failure represents a violated assumption (like the backend response window assumption from the checkout case study earlier in this post) rather than starting diagnosis from a completely blank slate. This costs almost nothing at write-time and pays back consistently at debug-time — exactly the kind of habit that’s easy to skip under deadline pressure but compounds in value the longer a test lives in your suite.
Common Playwright-Specific Flaky Patterns (And How AI Helps Spot Each One)
Beyond the seven broad root-cause categories, there’s a set of patterns specific to Playwright’s API that show up constantly once you’ve debugged enough flaky tests. I’ve started keeping a running list, and feeding this list to the AI assistant as context (“check specifically for these known Playwright flakiness patterns”) consistently produces sharper diagnoses than a generic prompt.
1. waitForSelector vs. Auto-Waiting Conflicts
Older Playwright code (or code migrated from Selenium habits) sometimes mixes explicit waitForSelector calls with Playwright’s built-in auto-waiting, creating redundant or conflicting wait conditions:
// Redundant and occasionally conflicting
await page.waitForSelector('.product-card', { state: 'visible' });
await page.locator('.product-card').click(); // already auto-waits
// Cleaner — let Playwright's auto-waiting do the work
await page.locator('.product-card').click();
I ask the AI assistant to specifically flag any test mixing both patterns, since it’s a quick, safe cleanup that removes one source of timing ambiguity.
2. networkidle Misuse
waitForLoadState('networkidle') is one of the most commonly misapplied waits in Playwright suites. It waits for no network requests for 500ms — which breaks down completely on pages with polling, analytics beacons, or websocket connections that never go fully idle, causing the wait to either time out unnecessarily or resolve too early relative to what you actually needed to wait for.
// Fragile on pages with polling/analytics
await page.waitForLoadState('networkidle');
// More reliable — wait for the actual signal you care about
await expect(page.locator('[data-testid="dashboard-loaded"]')).toBeVisible();
This is a pattern I now explicitly prompt for: “check whether this test uses networkidle, and if so, whether the page has any polling or websocket activity that would make it unreliable.”
3. Iframe Timing Issues
Tests interacting with iframes (payment widgets, embedded third-party content) are flaky almost by default if the iframe’s own load timing isn’t explicitly waited for before interacting with its contents:
const frame = page.frameLocator('#payment-iframe');
await frame.locator('input[name="cardNumber"]').waitFor({ state: 'visible' });
await frame.locator('input[name="cardNumber"]').fill('4242424242424242');
AI assistants are reliably good at catching this pattern when given the trace, since a “frame not found” or “element not attached” error inside an iframe context has a fairly distinctive signature.
4. Shadow DOM Locator Fragility
This pattern shows up often enough that it’s become a standard checkpoint when debugging flaky tests with AI.
Components using Shadow DOM (common in design systems built on web components) sometimes need explicit piercing, and naive CSS selectors silently fail to match anything inside the shadow root:
// Playwright pierces shadow DOM automatically with standard locators in most cases,
// but custom shadow roots with closed mode need special handling
await page.locator('my-custom-element').locator('button').click();
This is one area where I’ve found AI suggestions less reliable, since shadow DOM behavior varies significantly by component library — always verify against your specific design system’s documentation rather than trusting a generic suggestion here.
5. File Upload Race Conditions
File upload flows often trigger asynchronous processing (virus scan, thumbnail generation, validation) after the file input event fires, and tests that assert on a post-upload UI state immediately after setInputFiles are a common flakiness source:
await page.locator('input[type="file"]').setInputFiles('test-image.png');
// Flaky: asserting immediately
// await expect(page.locator('.upload-success')).toBeVisible();
// Better: wait for the actual completion signal
await expect(page.locator('.upload-success')).toBeVisible({ timeout: 15_000 });
await expect(page.locator('.thumbnail-preview img')).toHaveAttribute('src', /.+/);
6. Drag-and-Drop Flakiness
Drag-and-drop interactions are notoriously timing-sensitive because they depend on a sequence of mouse events firing with realistic timing gaps. Playwright’s dragTo() helper handles most of this, but custom drag implementations (common in kanban boards, file managers) often need manual mouse event sequencing with explicit small delays between steps — a pattern AI assistants reliably suggest once you show them a failing drag-and-drop trace.
7. Multi-Tab and Popup Timing
Tests that open a new tab or popup (OAuth flows, “open in new window” links) need an explicit wait for the new page object before interacting with it:
const [newPage] = await Promise.all([
context.waitForEvent('page'),
page.locator('a[target="_blank"]').click(),
]);
await newPage.waitForLoadState();
Forgetting the Promise.all pattern here — clicking first, then trying to grab the new page afterward — is a textbook race condition that AI assistants catch almost instantly when shown the relevant code, since it’s such a well-documented Playwright gotcha.
More Flaky Patterns Worth Knowing: Mobile, Localization, and Database Seeding
Beyond the seven Playwright-specific patterns above, a few additional categories come up often enough in real projects that I want to cover them separately, since they each have a distinct diagnostic signature.
Mobile and Responsive Viewport Flakiness
Skipping this step is the single fastest way to get a wrong answer when debugging flaky tests with AI.
Tests run against mobile viewport emulation sometimes behave differently than the same test at desktop viewport, particularly around touch event handling versus mouse events, and elements that get hidden behind a hamburger menu or bottom sheet at smaller viewports. A test that clicks an element directly visible at desktop width might need an explicit menu-open step at mobile width that’s easy to forget when adapting a desktop test:
// Viewport-aware navigation, avoids flakiness from hidden nav at mobile widths
const viewportSize = page.viewportSize();
if (viewportSize && viewportSize.width < 768) {
await page.getByRole('button', { name: 'Menu' }).click();
}
await page.getByRole('link', { name: 'Account Settings' }).click();
When an AI assistant is given a trace from a mobile-viewport failure, I explicitly ask it to check whether the target element was actually within the visible viewport at the time of the action, not just present in the DOM — a distinction that matters a lot at narrow viewports where content frequently sits below the fold or behind a collapsed menu.
Locale and Internationalization Flakiness
Tests that assert on visible text are inherently fragile across locales, but a subtler version of this shows up even in single-locale test suites: date formatting, number formatting, and pluralization that depend on the test environment’s system locale settings rather than the application’s configured locale, producing intermittent mismatches when CI runners have inconsistent locale configuration.
// Locale-fragile — date format depends on system locale
await expect(page.locator('.order-date')).toHaveText('6/20/2026');
// More resilient — assert against a data-testid carrying the raw ISO value
await expect(page.locator('[data-testid="order-date"]'))
.toHaveAttribute('data-iso-date', '2026-06-20');
This category is easy to miss because it often presents as “this test fails only on this one CI runner” — which looks like resource contention or an environment issue, but is actually a locale configuration mismatch. I now explicitly ask the AI assistant to check whether a failure correlates with a specific CI runner/region rather than assuming it’s purely load-related when I see that pattern.
Database and Test Environment Seeding Race Conditions
For tests that depend on seeded data created by a setup script (rather than created fresh within the test itself), a race between the seeding script completing and the test starting is a common, hard-to-spot flakiness source — especially in CI setups where seeding happens as a separate pipeline step before the test job starts, with no explicit readiness check connecting the two.
// Fragile — assumes seeding finished, no explicit check
test('order history shows seeded orders', async ({ page }) => {
await page.goto('/orders');
await expect(page.locator('.order-row')).toHaveCount(5);
});
// More resilient — explicit readiness check before the test body runs
test.beforeAll(async ({ request }) => {
await expect.poll(async () => {
const res = await request.get('/api/seed-status');
return (await res.json()).status;
}, { timeout: 30_000 }).toBe('complete');
});
This is a category where AI assistance is particularly good at spotting the missing readiness check once shown the CI pipeline configuration alongside the test, since “step B starts before step A’s async work is confirmed done” is a well-understood pattern across many types of systems, not specific to browser testing at all.
Accessibility Tree Changes Causing Locator Flakiness
Role-based locators (getByRole) are generally more resilient than CSS selectors, but they’re not immune to flakiness — a component library update that changes an element’s ARIA role or accessible name mid-transition (for example, a button that briefly reports as aria-disabled="true" during a loading state) can cause a role-based locator to intermittently fail to match. Since Playwright’s MCP server exposes accessibility-tree snapshots directly, this is one category where I lean on MCP specifically rather than just the trace viewer — asking the assistant to diff the accessibility tree at the moment of the failed action against a passing run’s tree at the same step surfaces this kind of subtle attribute-timing mismatch faster than reading raw HTML would.
AI-Assisted Root Cause Analysis: Prompt Patterns That Actually Work
This is the section I get the most questions about, so I want to be specific rather than hand-wavy. Not all prompts produce useful output, and I’ve wasted real time on vague ones before landing on patterns that consistently work.
The Prompt That Doesn’t Work
This is the kind of evidence that makes debugging flaky tests with AI genuinely faster, not just different.
Why is this test flaky?
This is too vague. Without trace data, run history, or category context, Copilot will give you a generic list of “common causes of flaky tests” pulled from its training data — useful as a checklist, not useful as a diagnosis of your
The Prompt Pattern That Works
Here is the Playwright trace for a failing run of `checkout.spec.ts`, plus the last 15 CI run results for this test (8 pass, 7 fail, no clear pattern by time of day). The test passes 100% of the time when run in isolation. I suspect this is a test isolation issue, not a timing issue, based on the "passes alone, fails in suite" symptom. 1. Confirm or rule out test isolation as the root cause using the trace data 2. If confirmed, identify which other test(s) in the suite are likely leaving behind state that affects this test 3. If ruled out, suggest the next most likely category from: timing, network, animation, resource contention
Notice the structure: I give the AI the evidence (trace, run history), my hypothesis (with the specific symptom that led me there), and a constrained, numbered set of things to actually do — not an open-ended “what do you think.” This is the single biggest lesson from months of doing this: AI debugging assistance works best as a hypothesis-testing tool, not a hypothesis-generating tool. You bring the suspicion based on the symptom pattern from the table above; the AI confirms, refines, or redirects it using evidence you couldn’t have manually cross-referenced as fast.
A Prompt Library for the Seven Categories
Here are the specific prompt skeletons I now reuse for each category, adapted from the pattern above:
Timing / race conditions:
Analyze this trace for `[test name]`. Identify any action (click, fill, etc.) that occurs within 200ms of a DOM mutation or network response completing. Flag any locator waits that resolved on `visible` rather than the action's actionability requirements (stable, enabled, receives-events).
Test data collisions:
This test creates a record with a hardcoded identifier `[value]`. Check the trace and network log for any 409/conflict responses, or any assertion that compares against stale data created by a previous run. Suggest whether this identifier should be made unique per run.
Test isolation:
Compare the network requests and console output of this test when run alone versus in full-suite mode [paste both]. Identify any shared resource (cookie, localStorage key, database row, API rate limit) that differs between the two runs.
Network / third-party dependency:
Review the network log for this trace. List every external domain called during the test, their response times, and flag any call exceeding 2 seconds or returning a non-2xx status that the test didn't explicitly expect.
Using AI to Read Playwright Trace Files (The Part That Actually Saves Time)
The Playwright trace viewer is already one of the best debugging tools in the testing ecosystem — it gives you a timeline, DOM snapshots, network activity, and console logs for every step of a test run. What changes with AI assistance isn’t the data; it’s who’s reading it.
Manually, reading a trace means clicking through each step, hovering over the timeline, expanding network calls one at a time. For a 40-step test, that’s genuinely tedious, and it’s exactly the kind of repetitive, pattern-matching task AI is well-suited to once it has structured access to the trace data rather than just a screenshot.
Exporting Trace Data for AI Analysis
Playwright traces are zip files containing structured JSON alongside screenshots and network HARs. You can extract the relevant pieces programmatically rather than relying on the AI to “see” the visual trace viewer:
npx playwright show-trace trace.zip --output trace-summary.json
Or, for a quick scripted extraction of just the network and console data most relevant to flaky test debugging:
import { test } from '@playwright/test';
import fs from 'fs';
test.afterEach(async ({ page }, testInfo) => {
if (testInfo.status !== testInfo.expectedStatus) {
const consoleLogs = await page.evaluate(() =>
(window as any).__consoleCapture || []
);
fs.writeFileSync(
`debug-context-${testInfo.title}.json`,
JSON.stringify({ consoleLogs, status: testInfo.status }, null, 2)
);
}
});
Once you have this structured, MCP can hand it to Copilot as context alongside your prompt, and the difference in answer quality compared to a vague “this test is flaky, help” is significant. The AI isn’t guessing anymore — it’s pattern-matching against actual timestamps and actual network responses.
Integrating This Workflow With Your CI Platform
Everything covered so far works whether your traces and run history live in GitHub Actions, GitLab CI, or Jenkins — but getting the artifacts out of each platform in a usable form looks slightly different for debugging flaky tests with AI at scale. Here’s how I’ve set this up across the platforms I’ve actually used.
GitHub Actions
I’d treat this as foundational to any serious attempt at debugging flaky tests with AI at scale.
See GitHub’s artifact storage documentation for the full configuration reference.
# .github/workflows/playwright.yml
- name: Run Playwright tests
run: npx playwright test
- name: Upload trace artifacts
if: failure()
uses: actions/upload-artifact@v4
with:
name: playwright-traces
path: test-results/
retention-days: 14
- name: Upload run history JSON
if: always()
uses: actions/upload-artifact@v4
with:
name: run-results-json
path: test-results/results.json
The retention-days: 14 setting matters more than it looks — the default artifact retention on many GitHub plans is shorter, and you need at least a couple of weeks of history for the run-history comparison prompts described earlier in this post to have enough data to work with.
GitLab CI
Reference: GitLab’s CI artifacts and reports documentation.
# .gitlab-ci.yml
playwright_tests:
script:
- npx playwright test
artifacts:
when: always
paths:
- test-results/
expire_in: 14 days
reports:
junit: test-results/results.xml
GitLab’s reports: junit integration is worth setting up alongside the raw artifact upload, since it gives you a built-in pass/fail history view in the GitLab UI that’s a useful quick reference even before you bring AI tooling into it.
Jenkins
// Jenkinsfile (declarative)
post {
always {
archiveArtifacts artifacts: 'test-results/**', allowEmptyArchive: true
junit 'test-results/results.xml'
}
}
Jenkins requires a bit more manual setup to expose this data to an MCP tool compared to GitHub Actions or GitLab, since there’s no single standard artifact API — you’ll typically need a small script that pulls artifacts via the Jenkins REST API into a format your MCP server can read locally.
A Note on Artifact Retention
Whatever platform you’re on, the single most common setup mistake I see is artifact retention set too short to build any meaningful run history. If you’re only keeping 3 days of trace artifacts, you’ll never have enough data for the kind of pattern-matching prompts described in this post. I’d treat 14 days as a practical minimum, and 30 days as ideal if your storage budget allows it.
How Trace Data Actually Maps to AI Reasoning
I want to go one level deeper than “feed it the trace” for readers who want to understand the mechanics, not just the workflow. A Playwright trace zip contains several distinct data streams, and understanding what each contributes to an AI assistant’s reasoning helps you debug more effectively when a diagnosis seems off.
The Action Timeline
This consistently comes up as a sticking point for teams just starting with debugging flaky tests with AI.
Every action (click, fill, navigation) is recorded with precise start and end timestamps, including the time spent in Playwright’s internal actionability wait. This is the data an AI assistant uses for the timing-correlation analysis seen throughout this post — comparing an action’s timestamp against a network response’s timestamp, for example, is really just precise arithmetic over this timeline data, which is why AI assistants are reliably good at this specific kind of analysis: it’s closer to structured data processing than open-ended reasoning.
DOM Snapshots
Trace files capture DOM snapshots at multiple points, which is what powers the “before vs after” comparisons used in the isolation-debugging and accessibility-tree sections earlier in this post. This is also where AI reasoning is weakest in isolation — a DOM snapshot tells you what changed, but not always why, which is precisely why I keep emphasizing bringing your own hypothesis about the mechanism rather than expecting the AI to infer intent purely from structural diffs.
Network HAR Data
The full request/response cycle, including timing breakdowns (DNS, connection, time-to-first-byte, download). This is the richest data source for the network and resource-contention categories covered earlier, since response-time analysis is largely a matter of comparing numbers across multiple captured requests — again, a strength area for AI-assisted analysis specifically because it’s structured and comparative rather than requiring deep contextual understanding.
Console Logs
Browser console output, including application-level errors and warnings that might not surface as a test failure directly but indicate something relevant happened. I’ve found this stream most useful for catching genuine application bugs (the eighth category mentioned early in this post) — a console error logged right before a flaky assertion failure is often the strongest single signal that you’re looking at a real bug, not a test problem.
Why This Matters Practically
This is the part of debugging flaky tests with AI that benefits most from a second pair of eyes.
Knowing which data stream supports which kind of conclusion helps you sanity-check an AI’s diagnosis. If an assistant claims a timing-related root cause but hasn’t referenced specific action-timeline timestamps in its reasoning, that’s a signal to ask it to actually walk through the timestamps rather than accepting a plausible-sounding but unsubstantiated explanation — directly connecting back to the “treating AI confidence language as calibrated” pitfall covered earlier in this post.
Fixing Flaky Locators With AI-Suggested Strategies
Locator-related flakiness is probably the single most common category I run into, and it’s also the category where AI suggestions are most reliably good, because locator strategy is a well-understood, well-documented problem space that AI models have seen thousands of examples of.
The Classic Mistake
// Flaky — waits for visible, not for actionable
await page.locator('.submit-button').click();
This looks fine and passes most of the time. It fails intermittently because .click() waits for the element to be visible, but a button can be visible while still disabled (e.g., during a form validation debounce) or while still animating into its final position. Copilot, when given the trace showing a click registering 80ms before a CSS transition completes, will typically suggest:
// More resilient — explicit actionability wait
await page.locator('.submit-button').waitFor({ state: 'visible' });
await expect(page.locator('.submit-button')).toBeEnabled();
await page.locator('.submit-button').click();
Or, often a better suggestion, using Playwright’s built-in auto-waiting more effectively by switching to a role-based locator that’s inherently more stable against markup churn:
await page.getByRole('button', { name: 'Submit' }).click();
Where AI Locator Suggestions Need a Human Check
This is a good place for one of my honesty callouts. AI suggestions for locator fixes are usually syntactically correct and often the right pattern — but they don’t know your application’s actual behavior unless you tell them. I’ve had Copilot suggest a waitForLoadState('networkidle') fix for what was actually an animation-timing issue, because “networkidle” is a common pattern it’s seen in training data for similar-looking problems, not because it understood this specific app’s animation behavior. Always verify the suggested fix against what you actually observed in the trace, not just whether the code looks plausible.
Debugging Network and Third-Party Dependency Flakiness
This category is trickier than locator issues because the fix isn’t always “write better Playwright code” — sometimes the fix is “stop depending on a live third-party service in your test.” AI assistance is genuinely useful here for spotting the pattern quickly, even though the fix itself is more architectural.
Diagnosing With Network Logs
Here is the network log from 5 failing runs and 5 passing runs of the same test. Identify any external API call (not first-party to our app) and compare response times across the pass/fail sets. Flag anything where the failing runs show meaningfully slower or failed responses from the same endpoint.
This kind of comparative analysis across multiple runs is exactly where AI assistance through MCP earns its keep — manually diffing five network logs against five other network logs is the kind of tedious cross-referencing a human will rush through, while an AI assistant won’t get bored doing it properly.
The Actual Fix: Mocking, Not Just Waiting Longer
This detail is easy to overlook, but it’s central to doing debugging flaky tests with AI properly.
Once a third-party dependency is confirmed as the flakiness source, the right fix is almost always to mock it rather than add a longer timeout:
await page.route('**/api/external/payment-gateway/**', async (route) => {
await route.fulfill({
status: 200,
contentType: 'application/json',
body: JSON.stringify({ status: 'approved', transactionId: 'test-txn-123' }),
});
});
This is a case where I’d push back on an AI suggestion if it just recommended increasing the timeout — that treats the symptom, not the cause, and you’ll be back debugging the same test in a month when the third-party service has a slow day again.
Resource Contention and CI Parallelism: The Category Teams Underestimate
This category deserves its own deep dive because it’s the one most commonly missed entirely — tests that pass reliably locally and fail intermittently only in CI, with no obvious code-level cause, often turn out to be resource contention rather than a test bug at all.
Why This Happens
CI environments run multiple test workers in parallel, often on shared infrastructure with limited CPU, memory, or network bandwidth compared to a developer’s local machine. A test that works fine when it’s the only thing running can fail when three other workers are simultaneously hitting the same staging database, the same rate-limited API, or simply competing for CPU cycles, causing browser actions and network calls to take longer than your test’s timeouts allow for.
How to Confirm This Is the Cause
The diagnostic signature is fairly distinctive once you know to look for it: failures cluster around specific CI shards or worker counts, failures correlate with overall CI load (more failures during peak deployment hours), and — critically — the test never fails when run with --workers=1 locally or in an isolated CI job.
// Quick diagnostic run — isolate from parallelism as a variable npx playwright test checkout.spec.ts --workers=1 --repeat-each=20
If a test fails 0/20 times in this isolated run but fails intermittently in the normal parallel CI run, resource contention under parallelism is strongly implicated. This is exactly the kind of correlation I now ask the AI assistant to check for explicitly, since “did failures cluster by worker count or CI load” requires cross-referencing data most engineers don’t bother collecting manually.
Fixes That Actually Address This
- Reduce parallelism for specific test groups that hit shared, rate-limited resources, rather than reducing it globally
- Use dedicated test data/accounts per worker via Playwright’s
testInfo.workerIndex, so parallel workers aren’t contending for the same database rows - Scope timeouts to known-slower operations explicitly, rather than a blanket global timeout increase that masks the underlying contention
- Consider sharding by resource sensitivity — grouping tests that are sensitive to backend load into a shard that runs with lower parallelism, separate from UI-only tests that don’t touch shared backend resources
// Worker-scoped test data avoids contention between parallel workers
import { test } from '@playwright/test';
test('checkout completes successfully', async ({ page }, testInfo) => {
const uniqueEmail = `test-worker-${testInfo.workerIndex}-${Date.now()}@example.com`;
// ... use uniqueEmail instead of a shared hardcoded test account
});
Test Isolation and State Leakage: The Hardest Category to Debug
This is, honestly, the category where AI assistance has helped me the most, because it’s also the category that’s most tedious to debug manually. The classic symptom: a test passes every time you run it alone, and fails roughly half the time when run as part of the full suite.
Common Causes
This is a good example of why debugging flaky tests with AI still needs human judgment in the loop.
- Shared test data (same user account, same database row) modified by an earlier test
- Browser context/storage state leaking between tests that should be isolated
- A previous test leaving the application in an unexpected state (e.g., a modal still open)
- Global test fixtures with unintended shared mutable state
An AI-Assisted Diffing Workflow
The workflow I now use: run the failing test in isolation, capture the trace. Run the full suite, capture the trace of the same test when it fails. Feed both to Copilot via MCP with a structured comparison prompt:
Compare these two traces of the same test: trace-isolated-pass.zip (passing, run alone) and trace-suite-fail.zip (failing, run as part of the full suite). 1. List every difference in initial application state between the two (cookies, localStorage, visible DOM elements before the test's first action) 2. Identify which prior test in the suite run order is most likely responsible for any state difference found 3. Suggest a fixture-level fix (not a test-level workaround) to properly isolate this test
I specifically ask for a “fixture-level fix, not a test-level workaround” because the lazy fix here is always to add cleanup code inside the failing test itself, which just moves the problem instead of solving it. A proper fix usually belongs in a beforeEach/afterEach hook or a test fixture that guarantees isolation for every test, not just the one that happened to expose the bug.
// Proper isolation fixture, not a one-off patch
import { test as base } from '@playwright/test';
export const test = base.extend({
page: async ({ page }, use) => {
await page.context().clearCookies();
await page.evaluate(() => localStorage.clear());
await use(page);
},
});
Building a Simple MCP Tool for Flaky Test Triage
Once you’ve manually debugged a handful of flaky tests using the prompt patterns above, the natural next step is automating the first-pass triage so you’re not manually pulling trace files and run history every time. This is a good beginner MCP-building project — exactly the kind of thing I’m still learning myself, so I’ll walk through it at the level of someone doing this for the first time, not as an MCP architecture expert.
What This Tool Should Do
- Pull the last N CI run results for a given test from your test results JSON
- Calculate a simple flakiness score (fail rate over the last N runs)
- Attach the most recent failing trace file as context
- Hand all of this to the connected AI assistant in one structured call, instead of you assembling it by hand each time
A Minimal MCP Server Skeleton
// flaky-triage-mcp-server.ts
import { McpServer } from '@modelcontextprotocol/sdk/server/mcp.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';
import fs from 'fs';
const server = new McpServer({ name: 'flaky-triage', version: '1.0.0' });
server.tool(
'get_flakiness_report',
'Returns recent run history and flakiness score for a given test name',
{ testName: { type: 'string' } },
async ({ testName }) => {
const results = JSON.parse(fs.readFileSync('test-results/results.json', 'utf-8'));
const runs = results.suites
.flatMap((s: any) => s.specs)
.filter((spec: any) => spec.title === testName);
const recent = runs.slice(-20);
const failCount = recent.filter((r: any) => r.outcome === 'unexpected').length;
const flakinessScore = failCount / (recent.length || 1);
return {
content: [{
type: 'text',
text: JSON.stringify({
testName,
totalRuns: recent.length,
failures: failCount,
flakinessScore: flakinessScore.toFixed(2),
recentOutcomes: recent.map((r: any) => r.outcome),
}, null, 2),
}],
};
}
);
const transport = new StdioServerTransport();
await server.connect(transport);
This is intentionally minimal — it’s a starting point, not a production triage system. Once connected, you can ask Copilot something like “use get_flakiness_report for checkout.spec.ts and tell me if this looks like timing-related or isolation-related flakiness based on the failure pattern,” and it’ll pull live data instead of you pasting it manually.
Extending the Tool: Attaching the Most Recent Failing Trace
I’ve found this step alone meaningfully improves the reliability of debugging flaky tests with AI.
The version above gives run history but not the trace content itself, which still leaves a manual step. A natural extension — the “more complete version” I mentioned wanting to build — adds a second tool that locates and returns the most recent failing trace’s key data alongside the flakiness report, so a single tool call gives the AI assistant everything it needs in one pass:
server.tool(
'get_latest_failure_trace_summary',
'Returns a summary of the most recent failing trace for a given test',
{ testName: { type: 'string' } },
async ({ testName }) => {
const traceDir = `test-results/${testName.replace(/\s+/g, '-')}`;
const traceFiles = fs.readdirSync(traceDir)
.filter(f => f.endsWith('.zip'))
.sort()
.reverse();
if (traceFiles.length === 0) {
return { content: [{ type: 'text', text: 'No trace files found for this test.' }] };
}
// In a fuller implementation, this would unzip and parse the trace's
// network.har and trace.json files directly. For a minimal version,
// returning the file path lets the connected AI tool's own trace-reading
// capability (where available) take it from there.
return {
content: [{
type: 'text',
text: `Most recent failing trace: ${traceDir}/${traceFiles[0]}`,
}],
};
}
);
I’m intentionally showing this as a partial implementation rather than a polished production tool, in keeping with the honest, learning-in-public approach of this whole series — fully parsing trace zip internals (the HAR file, the snapshot data) for structured tool output is the next piece I’m actively working on, and I’d rather show you the real, current state of this than a finished tool I haven’t actually battle-tested yet.
Surfacing This in Pull Requests, Not Just in Chat
The triage tool above is useful interactively, but the bigger workflow win comes from surfacing the same data automatically when a test fails in CI, rather than waiting for someone to manually go ask. A simple CI step can call the same flakiness-report logic and post a structured comment on the pull request when a test fails, so the first-pass classification (timing vs. isolation vs. network, plus the historical fail rate) is already sitting in front of the developer before they even open the CI logs.
// ci-flaky-comment.ts — run as a CI step after test failures
import { getFlakinessReport } from './flaky-triage-lib';
import { postPrComment } from './github-api-helper';
async function reportFlakyFailures(failedTests: string[]) {
for (const testName of failedTests) {
const report = await getFlakinessReport(testName);
if (parseFloat(report.flakinessScore) > 0.1) {
await postPrComment({
body: `⚠️ **${testName}** has a historical flakiness score of ` +
`${report.flakinessScore} (${report.failures}/${report.totalRuns} recent runs). ` +
`This may not be a new regression — check trace before assuming the PR broke it.`,
});
}
}
}
This small addition changes the team behavior I described earlier in this post — developers no longer have to guess whether a red pipeline is “probably just flaky” or a real regression; the historical data is right there in the PR comment. I’d treat this as a natural extension of the MCP triage tool once you’ve validated it works well for your own interactive debugging first.
A Slack Reporting Template Worth Standardizing
For teams that report flaky test findings in Slack rather than (or in addition to) PR comments, standardizing the message format pays off the same way the RCA template does — it makes past reports searchable and comparable rather than free-form prose that’s hard to scan later. A simple structure I’ve adopted:
🔍 Flaky test flagged: `[test name]` Fail rate: [X/Y recent runs] Suspected category: [from the seven-category table] Status: [investigating / root-caused / fix in review / monitoring] Thread 🧵 for diagnosis details
Keeping diagnosis details in a thread rather than the main message keeps the channel scannable, while the structured header line makes it easy to search Slack later for “what did we already learn about this test” — a lighter-weight companion to the formal RCA template for cases that don’t warrant a full ticket.
Case Study: Debugging a Real Flaky Checkout Test End-to-End
To make this concrete rather than theoretical, here’s a full walkthrough of debugging a representative flaky test using the workflow above — the kind of checkout flow test that exists in some form in almost every e-commerce automation suite.
The Symptom
A checkout test failed roughly 1 in 6 runs in CI, always passed locally, and the failure was always the same assertion: an order confirmation number that came back empty.
Step 1 — Classify
This is the kind of pattern that makes debugging flaky tests with AI worth the initial setup effort.
Symptom pattern: fails intermittently, never locally, always the same assertion. This points toward either resource contention under CI parallelism or a network/timing issue specific to the order-confirmation API call — not test isolation, since the failure mode is consistent rather than varying.
Step 2 — Gather Evidence
Pulled the last 20 CI runs via the triage MCP tool above: 4 failures, all on the same assertion, no pattern by time of day, but all 4 failures occurred when this test ran in a CI shard alongside at least 3 other tests hitting the same backend service concurrently.
Step 3 — AI-Assisted Diagnosis
This test fails 4/20 times in CI, always on the orderConfirmationNumber assertion coming back empty, never locally. All 4 failures occurred when run concurrently with 3+ other tests against the same staging backend. Here is the network log from one failing run [pasted]. Check whether the order-confirmation API call in the failing run shows a slower response time than typical, and whether the test's wait/assertion timeout is shorter than the observed worst-case response time under load.
The AI assistant’s analysis (cross-referencing the network timing against the assertion’s default timeout) found that the order-confirmation endpoint averaged 800ms under normal load but spiked to 4-6 seconds when the staging backend was handling concurrent requests from parallel CI shards — and the test’s assertion was using Playwright’s default 5-second expect timeout, which meant it was right on the edge of timing out under load.
Step 4 — The Fix
Two changes, not one. First, an explicit, longer timeout specifically for this slower endpoint rather than a blanket increase across the whole suite:
await expect(page.locator('[data-testid="order-confirmation-number"]'))
.toBeVisible({ timeout: 10_000 }); // explicit, documented, scoped to this known-slow call
Second — and this is the fix that actually addressed the root cause rather than the symptom — reducing CI parallelism against the shared staging backend for this specific test group, since the real problem was backend resource contention under concurrent load, not the test itself being poorly written.
Step 5 — Verify
This is precisely where debugging flaky tests with AI starts to outperform a purely manual approach.
Ran the test 30 times locally with artificial network throttling to simulate the slow-response condition, then watched it across the next two weeks of CI runs: zero failures. That two-week observation window matters — a flaky test fix that “looks right” on day one can still resurface, so I don’t consider any flaky test fix verified until it survives multiple days of real CI traffic patterns.
Case Study Two: The Multi-Tab OAuth Flow That Failed Only on Tuesdays
A second example worth walking through, because the symptom pattern was genuinely confusing at first. A login test using a third-party OAuth popup failed almost exclusively during a specific weekly batch job window — initially this looked like a coincidence, but the run-history pattern was too consistent to ignore.
Classification: Initially miscategorized as “environment flakiness, unclear cause” — exactly the kind of vague bucket that wastes time. Evidence gathering via the triage MCP tool showed failures clustered tightly around a 90-minute window, every week, on the same day. AI-assisted diagnosis, once given the full run-history timestamps rather than just the failure logs, correctly identified that this window matched a scheduled backend reindexing job that temporarily slowed authentication API response times — a connection a human would eventually find too, but likely after several more weeks of “huh, weird” before someone thought to check the backend job schedule.
The fix here wasn’t a Playwright code change at all — it was flagging the backend reindexing job’s resource impact to the platform team, who moved it to a lower-traffic window. The lesson: not every flaky test fix lives in the test file, and AI-assisted timestamp correlation across a long run history is genuinely better at catching weekly/cyclical patterns than a human scanning a CI dashboard.
Case Study Three: The Drag-and-Drop Kanban Test That “Couldn’t” Be Flaky
A drag-and-drop test for a kanban-style task board failed roughly 1 in 10 runs, always with the card ending up in the wrong column. The developer who wrote it insisted the code was correct, since it worked every time they ran it manually.
Classification: Animation/rendering timing, based on the “wrong final position” symptom rather than an outright error — this pointed toward the drag completing before the column’s drop-zone boundary recalculated after a CSS transition, rather than a logic error in the test.
AI-assisted diagnosis, given the trace’s mouse-event timeline alongside the column’s CSS transition duration from the stylesheet, confirmed the drag’s final mouseup event fired 60-80ms before the drop-zone’s animated width transition completed — meaning the drop coordinates were being calculated against a stale layout. The fix added an explicit wait for the drop-zone’s transition to finish before initiating the final drop step, rather than a blanket delay added to every drag operation in the suite, which would have slowed down dozens of unrelated tests unnecessarily.
Calculating the Real ROI of Debugging Flaky Tests with AI
Numbers help when you’re justifying this to a manager — or, in my case, justifying my own time investment to myself. Here’s a rough, honest accounting based on the cases above and a handful of others from the same sprint.
| Debugging Step | Manual Time (Typical) | AI-Assisted Time |
|---|---|---|
| Forming an initial hypothesis | 20-40 minutes (scrolling logs/traces) | 5-10 minutes |
| Cross-referencing multiple CI runs | 30-60 minutes | 2-5 minutes (via triage tool) |
| Comparing isolated vs. suite-run traces | 30-45 minutes | 5-10 minutes |
| Identifying the specific fix pattern | 15-30 minutes | 5-10 minutes |
None of these numbers are scientific — they’re rough averages from my own sprints, and your mileage will vary based on suite size and how well-instrumented your traces already are. But even a conservative read of this table suggests something like a 60-70% reduction in time-to-diagnosis for the categories of flakiness covered in this post, with the caveat that the fix itself still takes the same engineering time it always did — AI speeds up finding the root cause, not implementing or verifying the fix.
Setting a Team Flakiness SLA
Once you have a baseline from the metrics above, it’s worth setting an explicit, internal SLA rather than treating flakiness reduction as an open-ended aspiration. Ours, after the four-week period referenced in the before/after table earlier: any test exceeding a 10% fail rate over its last 20 runs gets flagged for triage within the next sprint, and any test still above that threshold after two sprints gets escalated to a documented RCA using the template shown earlier, with the rotating flaky-test owner accountable for driving it to resolution. Having a concrete number, rather than a vague “we should fix flaky tests when we have time,” is what actually keeps this from sliding back into the old normalized-re-run culture described at the start of this post once the initial enthusiasm for the new workflow settles into routine.
Cost Considerations: Token Usage at Scale
This nuance is easy to miss, but it matters a lot for debugging flaky tests with AI accurately.
Worth a brief, honest note for any manager evaluating this: feeding full trace files and run history into an AI assistant repeatedly does consume meaningfully more tokens than a quick code question, since trace data — especially network logs and DOM snapshots — can be verbose. For occasional debugging sessions, this is a non-issue on most standard subscription tiers. If you’re planning to run the automated PR-comment workflow described earlier across a large suite with many tests failing daily, it’s worth keeping an eye on usage, and worth trimming trace payloads to the relevant excerpt (the specific failing step plus surrounding context) rather than always sending entire raw trace files, both for cost and for keeping the AI’s context focused on what’s actually relevant.
Measuring Success: Flaky Test Metrics Worth Tracking
Fixing individual flaky tests is good. Knowing whether your overall flakiness is improving over time is better, and it’s also the kind of evidence that justifies the time investment to a manager or team lead.
Metrics I Track
- Flakiness rate — percentage of test runs that pass on re-run after an initial failure, tracked weekly
- Re-run frequency — how often CI pipelines are manually re-triggered, a proxy for trust erosion
- Time-to-diagnosis — how long from “test flagged flaky” to “root cause identified,” which is the metric AI assistance most directly improves
- Flaky test backlog age — how long known-flaky tests sit unfixed, since these tend to get deprioritized indefinitely without a forcing function
A simple dashboard built from your CI JSON output (the same data your MCP triage tool reads) can track these over time without needing a dedicated paid tool, especially for a team just getting started with this discipline.
Assigning Ownership, Not Just Tracking Metrics
Metrics alone don’t fix anything if nobody owns acting on them. The change that actually made our flaky-test backlog shrink wasn’t the dashboard itself — it was assigning a rotating “flaky test owner” role each sprint, whose explicit responsibility (alongside their regular work, time-boxed to a few hours) was triaging the top entries from the flakiness summary using the workflow in this post. Without an owner, even an accurate, visible dashboard tends to become background noise everyone glances at and nobody acts on, the same way a code-coverage number can sit unchanged for months once the initial push to add it fades.
A Note on Test Pyramid Balance
It’s worth stepping back to note that some flakiness is best addressed not by debugging individual tests at all, but by reconsidering whether a given check belongs at the end-to-end UI layer in the first place. E2E tests are inherently more flakiness-prone than unit or API-level tests, simply because they have more moving parts — network, rendering, animation, browser quirks — any one of which can introduce non-determinism. Several of the chronically flaky tests I’ve debugged using this workflow, once root-caused, turned out to be testing business logic that would have been far more reliably (and faster) verified at the API or unit level, with the E2E test only needing to confirm the UI correctly displays the result. I’d treat persistent, hard-to-fix flakiness in a given test as a prompt to ask “does this need to be an E2E test at all,” not just “how do I make this E2E test more stable.”
Common Myths About Debugging Flaky Tests with AI
Before the limitations section, I want to directly address a few misconceptions I keep running into in comments and conversations, since some of these actively discourage people from trying a workflow that would genuinely help them.
Myth: “AI will just tell me to add a longer timeout, so what’s the point”
This is a habit worth building early if you’re serious about debugging flaky tests with AI long-term.
This is a fair criticism of vague, low-effort prompting — and exactly why this entire post is structured around bringing evidence and a classification hypothesis rather than asking an open-ended question. With the prompt patterns shown here, “just add a timeout” stops being the default answer because the AI is being asked to confirm a specific mechanism using specific trace evidence, not asked to generate a generic suggestion from nothing.
Myth: “This only works if you’re already an AI/prompt engineering expert”
I’d push back on this directly — I’m not a prompt engineering expert, and I said as much at the start of this series. The prompt patterns in this post came from iterating on what actually produced useful output over real debugging sessions, not from any specialized prompt-engineering theory. Anyone comfortable reading a Playwright trace can apply these patterns with a normal learning curve.
Myth: “AI-assisted debugging replaces the need to understand Playwright deeply”
The opposite is closer to true, based on my experience. The sessions where AI assistance worked best were ones where I already had a strong hypothesis from understanding the application and Playwright’s actionability model — the AI accelerated verification and pattern-matching across data, but it didn’t substitute for that underlying understanding. Readers newer to Playwright should treat this post as a complement to learning the fundamentals, not a shortcut around them.
Myth: “Once you fix a flaky test with AI assistance, it’s permanently fixed”
Covered earlier in this post but worth restating as a myth specifically: no debugging approach, AI-assisted or otherwise, guarantees permanence. The verification period (multiple days of real CI traffic) and the “check related tests for the same pattern” step in the RCA template both exist because flaky test fixes can resurface or have siblings elsewhere in the suite.
When Debugging Flaky Tests with AI Goes Wrong
I promised at the start of this series that I’d be honest about limitations, and this section is where that matters most, because over-trusting AI output on flaky tests can genuinely make things worse, not just waste time.
False Confidence on Timing Issues
This is one of the clearer signals to watch for when debugging flaky tests with AI on a new test.
AI assistants are good at pattern-matching against common causes, but they will sometimes confidently suggest a timing fix (longer wait, different wait state) for a problem that’s actually a genuine application bug. If you accept that suggestion without verifying against the actual trace evidence, you’ve just hidden a real bug behind a longer timeout. Always ask “is this a test problem or an application problem?” explicitly, and don’t accept the first plausible-sounding answer.
Suggesting Retries as a Default Fix
Both Copilot and most AI coding assistants have a strong tendency to suggest adding retries (test.describe.configure({ retries: 2 })) as a quick fix for flaky tests. Retries can be a reasonable temporary mitigation for genuinely environmental flakiness, but they’re frequently suggested even when the real fix is a proper code change, simply because retries are a common, low-effort pattern in the AI’s training data. I treat any AI suggestion that amounts to “just retry it” as a signal to dig deeper, not as an acceptable final answer.
Hallucinated API Methods
Less common with Playwright specifically since it’s well-represented in training data, but it still happens — an AI assistant suggesting a method or config option that doesn’t actually exist in your Playwright version, or that existed in an older version and was renamed. Always check suggested code against the actual Playwright docs for your installed version before trusting it blindly, especially right after a major version bump.
Not Understanding Your Application’s Specific Behavior
This is the big one. AI assistants reason from patterns seen across many codebases. They don’t know that your app’s specific modal has a 400ms CSS transition, or that your specific backend has a known caching quirk on the third API call in a session. The trace data gives them evidence, but the interpretation of that evidence still benefits enormously from your own domain knowledge of the application. I think of debugging flaky tests with AI as pairing with a very fast, very well-read junior engineer — genuinely useful, sometimes faster than me at spotting a pattern in data, but not a replacement for actually understanding the system under test.
Overfitting a Fix to One Failure Instance
This is the kind of cross-check I now do by default when debugging flaky tests with AI.
Because AI assistants reason from the specific trace you give them, there’s a real risk of getting a fix that resolves the exact failure you showed it without addressing the broader pattern. I’ve had this happen with a test isolation issue where the suggested fix correctly addressed the specific state leak shown in one trace, but a different, related state leak from a different prior test still caused intermittent failures afterward. The lesson: after applying a suggested fix, re-run the full classification process rather than assuming one fix instance has solved the entire flakiness pattern, especially for isolation-related issues where multiple tests can be contributing to shared state problems simultaneously.
Treating AI Confidence Language as Calibrated
AI assistants will sometimes phrase a guess with the same confident tone as a well-evidenced conclusion. Without explicitly prompting for a stated confidence level (as shown in the custom instructions file earlier in this post), it can be hard to tell from tone alone whether a suggestion is backed by strong trace evidence or is a plausible-sounding pattern match with weaker support. This is part of why I now explicitly require a confidence statement in my custom instructions — it forces a distinction that doesn’t naturally show up in conversational tone alone.
Building a Simple Flakiness Dashboard
Once you’re tracking the metrics described earlier in this post — flakiness rate, re-run frequency, time-to-diagnosis, backlog age — a lightweight dashboard makes the trend visible to the whole team rather than living in your own head or a spreadsheet only you check.
A Minimal Implementation
You don’t need a dedicated paid tool to start. A small script run nightly against your CI’s stored run-history JSON, writing a simple summary, is enough to get useful visibility:
// generate-flakiness-summary.ts
import fs from 'fs';
interface TestRun { title: string; outcome: string; timestamp: string; }
function generateSummary(runs: TestRun[]) {
const byTest: Record = {};
for (const run of runs) {
byTest[run.title] = byTest[run.title] || [];
byTest[run.title].push(run);
}
const summary = Object.entries(byTest).map(([title, testRuns]) => {
const recent = testRuns.slice(-20);
const fails = recent.filter(r => r.outcome === 'unexpected').length;
return {
title,
runs: recent.length,
flakinessScore: (fails / (recent.length || 1)).toFixed(2),
};
}).sort((a, b) => parseFloat(b.flakinessScore) - parseFloat(a.flakinessScore));
fs.writeFileSync('flakiness-summary.json', JSON.stringify(summary, null, 2));
}
This produces a sorted list of your flakiest tests by recent fail rate, which is genuinely useful on its own as a backlog-prioritization tool, separate from any AI tooling — start your next sprint’s flaky-test cleanup with whatever’s at the top of this list, not whatever happened to fail most recently and annoy someone.
Connecting the Dashboard Back to Debugging Flaky Tests with AI
The natural next step is wiring this same summary as another MCP tool, so you can ask “what are our top 5 flakiest tests right now, and what category does each likely fall into based on its failure pattern” and get a prioritized, pre-classified list rather than working through them one at a time. I’d treat this as a logical extension of the triage server built earlier in this post, once the basic version has proven useful in daily debugging.
Mistakes Teams Make When Adopting This Workflow
Beyond the individual-debugging-level mistakes covered above, there’s a set of team-level adoption mistakes worth flagging explicitly, since I’ve made some of these myself.
Skipping the Classification Step Entirely
This step has saved me real time across dozens of sessions of debugging flaky tests with AI.
The single most common mistake: jumping straight to “ask AI why this is flaky” without first classifying the symptom pattern, which produces the vague, low-quality prompts shown earlier in this post that waste everyone’s time and erode trust in the whole approach before it’s had a fair chance.
Not Investing in Artifact Capture First
Teams sometimes want to start with the AI tooling before fixing basic trace/artifact capture configuration. No prompt pattern in this post works without actual trace data behind it — get the boring infrastructure (config, CI artifact retention) right first.
Treating This as a One-Time Setup Instead of an Evolving Practice
The custom instructions file, the triage MCP tool, the prompt library — none of these are “set once and forget.” I update my instructions file every few weeks as I notice new patterns the AI keeps getting wrong, and I’d encourage treating this the same way you’d treat any other piece of internal tooling: something that improves with iteration, not a one-time installation.
Letting One Person Own All the Tribal Knowledge
If only one engineer on the team understands the prompt patterns and classification framework, the whole practice is fragile — it disappears the moment that person is busy or leaves. Document it centrally, as mentioned earlier, and rotate who leads flaky-test triage sessions so the knowledge spreads.
Measuring Adoption Once and Declaring Victory
This is a good litmus test for whether your setup actually supports debugging flaky tests with AI well.
The last mistake worth naming: treating the before/after snapshot from a single sprint as proof the problem is permanently solved. Flaky test rates tend to creep back up gradually as new tests get added by engineers who weren’t in the original rollout session, unless the code-review checklist and SLA described elsewhere in this post are actively maintained as ongoing practices rather than a one-time cleanup push. I’d recommend re-running the same before/after measurement every quarter, not just once, to catch this drift early rather than discovering six months later that the re-run rate has quietly climbed back toward where it started.
Best Practices Checklist for Debugging Flaky Tests with AI
- Configure trace, screenshot, and video capture on failure before you need them, not after
- Classify the flakiness category (timing, isolation, network, data, etc.) before prompting the AI — give it your hypothesis, don’t ask it to guess from nothing
- Always provide actual evidence (trace, network log, run history) rather than just describing the symptom from memory
- Explicitly ask whether the issue could be a real application bug, not just a test problem
- Be suspicious of any AI suggestion that’s purely “wait longer” or “add a retry” — push for the actual root cause
- Verify suggested locator/API changes against your actual installed Playwright version
- Don’t consider a flaky test fixed until it survives at least several days of real CI traffic, not just local re-runs
- Track flakiness metrics over time, not just individual test fixes, to know if the overall trend is improving
- Use fixture-level fixes for isolation problems, not test-level workarounds that hide the same bug elsewhere
Why Each of These Matters (Not Just a List to Skim)
It’s worth unpacking a couple of these rather than leaving them as a bare checklist, since the reasoning is as important as the rule itself. “Don’t consider a flaky test fixed until it survives several days” exists because flaky test fixes have a specific failure mode where they appear to work for the first day or two simply because the triggering condition (peak CI load, a specific data state, a particular time-of-day backend job) hasn’t recurred yet — not because the fix actually addressed the root cause. I’ve personally closed out a “fixed” flaky test ticket on day two, only to reopen it on day nine when the same weekly batch-job timing condition from Case Study Two recurred. Similarly, “use fixture-level fixes, not test-level workarounds” exists because test-level patches create a maintenance trap: the next engineer who writes a similar test won’t know about the workaround and will reintroduce the exact same bug in a new file, whereas a fixture-level fix protects every test that uses it going forward.
Code Review Checklist: Catching Flakiness Before It Ships
Everything above is about debugging flakiness after it’s already in your suite. As a QA lead, I’d rather catch these patterns in code review before a flaky test ever merges. Here’s the specific checklist I now use (and have started having Copilot check automatically via the custom instructions file from earlier) during PR review of new Playwright tests:
- Hardcoded test data identifiers — flag any test creating a record with a fixed ID, email, or username rather than a worker-scoped or timestamp-based unique value
- Mixed wait patterns — flag any test combining explicit
waitForSelectorwith auto-waiting locator actions on the same element - One-shot state checks — flag any
isVisible(),isEnabled(), or similar one-shot boolean check used as an assertion, instead of an auto-retryingexpect() - networkidle usage — flag any use of
waitForLoadState('networkidle')for a manual review of whether the page has polling/websocket activity that would make it unreliable - Missing cleanup in custom fixtures — flag any new fixture that doesn’t clear cookies/storage, unless explicitly justified by a comment explaining why shared state is intentional
- Drag-and-drop or multi-step interactions without explicit waits — flag any custom (non-
dragTo()) interaction sequence missing intermediate stability checks - External API calls not mocked — flag any test hitting a third-party domain directly rather than through
page.route()mocking, unless the test is explicitly an integration test by design
I’ve found that catching even half of these at review time meaningfully reduces how many new flaky tests show up in the backlog each sprint — it’s a much better return on effort than debugging flakiness after the fact, even with AI assistance making that debugging faster.
Building a Flaky Test Knowledge Base
One habit I’d recommend alongside everything else in this post: keep the RCA reports described earlier somewhere searchable, rather than letting each one disappear into a closed ticket. Over a few months, this becomes a genuinely valuable internal resource — both for spotting recurring patterns specific to your application (maybe every isolation issue in your suite traces back to the same shared fixture, or every network-flakiness incident involves the same third-party payment provider) and for onboarding new team members faster, since they can search past RCAs instead of re-learning lessons the team already paid for once.
Practically, this doesn’t need to be elaborate — a dedicated tag or label in your issue tracker (flaky-rca), combined with the RCA template from earlier, is enough to make this searchable without building a separate system. The value compounds the longer you keep it up: six months in, a new flaky test that “smells” similar to a past incident can often be diagnosed in minutes just by searching the knowledge base, before even opening an AI debugging session at all — sometimes the fastest debugging is recognizing you’ve already solved this exact problem before.
Security and Data Privacy: What You Should (and Shouldn’t) Feed to AI Tools
This is a section I almost skipped, but as a QA lead I’d be doing readers a disservice not to address it directly, because trace files and network logs can contain sensitive data that shouldn’t casually be sent to a third-party AI service.
What Commonly Ends Up in Traces
- Authentication tokens and session cookies, captured automatically in network logs
- Real customer data, if your tests run against any environment with production-like data rather than fully synthetic test data
- API keys, if they’re visible in request headers or query parameters during the recorded run
- Internal URLs and infrastructure details that reveal more about your backend architecture than you’d want shared externally
Practical Mitigations
- Always test against synthetic/staging data, never production data, as a baseline practice independent of AI tooling — this is good hygiene regardless
- Check your organization’s data handling policy for whichever AI tool/provider you’re using before sending trace data, especially if your company has specific data residency or processing agreements in place
- Scrub or redact tokens and cookies from network logs before pasting them into a chat session, if your MCP setup doesn’t already do this automatically
- Prefer enterprise/business-tier AI tooling with documented data retention and training-opt-out policies over consumer-tier tools when working with anything that touches real infrastructure details
None of this should block you from using AI-assisted debugging — but it’s the kind of thing worth a five-minute conversation with whoever owns security/compliance at your organization before rolling this out broadly, not something to discover after the fact.
Rolling This Out Across a QA Team: A Manager’s Perspective
Everything in this post so far has been about the individual debugging workflow. As a QA lead, the harder problem is usually getting an entire team to actually adopt a new workflow consistently, rather than having one enthusiastic engineer use it while everyone else keeps doing things the old way. Here’s what’s actually worked when I’ve rolled similar workflow changes out before.
Start With One Visible Win, Not a Mandate
This is the level of specificity that makes debugging flaky tests with AI genuinely reliable.
Rather than announcing “we’re now debugging flaky tests with AI” as a policy, I picked the team’s most notoriously annoying flaky test — the one everyone had individually lost an afternoon to at some point — and debugged it live in a team session using the workflow in this post. A concrete, visible win on a test people already had history with does more to drive adoption than any process document.
Document the Prompt Patterns Centrally
The specific prompt patterns in this post took me weeks of trial and error to land on. Don’t make every engineer on your team rediscover them independently — put them in your team wiki or repository README so a junior SDET can use a refined prompt pattern on day one instead of starting from “why is this test flaky?” like I did.
Pair the Workflow With the Classification Table, Not Just the AI Tool
The biggest adoption mistake I’d warn other QA leads against: introducing the AI tooling without first teaching the underlying classification framework (the seven root-cause categories from earlier in this post). Engineers who skip straight to “ask the AI” without understanding what they’re asking it to check for tend to accept the first plausible-sounding answer, even when it’s a “just add a retry” suggestion that should be pushed back on.
Track Adoption the Same Way You Track Outcomes
Alongside the flakiness metrics covered earlier in this post, I also track a simpler adoption metric: what percentage of flaky test fixes in a given sprint went through the documented AI-assisted triage process versus an ad-hoc fix. This isn’t about enforcement — it’s about knowing whether the investment in setting this up is actually changing team behavior, or whether it quietly fell out of use after the initial novelty wore off.
Addressing Skepticism From Senior Engineers
This is exactly the sort of evidence-gathering that good debugging flaky tests with AI workflows depend on.
Not everyone on a team is immediately convinced, and that skepticism is often reasonable — senior engineers who’ve been burned by overconfident AI suggestions before are right to be cautious. What’s worked for me isn’t arguing the case abstractly; it’s showing the “When AI Gets It Wrong” section of this post explicitly as part of the rollout, so the team understands this is a tool with documented limitations, not a black box being treated as infallible. Skeptical engineers tend to come around faster when the limitations are stated upfront rather than discovered the hard way.
A Sample Workshop Agenda for Introducing This to Your Team
If you’re leading this rollout yourself, here’s roughly the structure that worked for my team’s introductory session, scaled to about 90 minutes:
- (10 min) Why flaky tests matter — the trust-erosion argument, framed around the team’s own actual re-run habits, not abstract theory
- (15 min) The seven-category classification framework, with real examples from your own codebase if you have them
- (30 min) Live debugging of a real, currently-unsolved flaky test from your backlog, using the prompt patterns in this post, with the whole team watching and contributing hypotheses
- (15 min) Walk through the “When AI Gets It Wrong” limitations explicitly, so the team calibrates trust appropriately from day one
- (20 min) Hands-on — pair up and have each pair pick a flaky test from the backlog to triage using the framework, reporting back next session
The live debugging segment is the part people remember — watching a genuinely difficult, real test get diagnosed in real time does more for buy-in than any slide deck.
A Sample RCA Report Template
For teams that require documented root-cause-analysis for recurring issues (common in regulated industries or larger organizations with formal incident processes), here’s a template I’ve adapted specifically for flaky test RCAs, informed by the workflow in this post:
## Flaky Test RCA: [test name] **Symptom:** [exact failure mode, e.g. "element not found: .toast"] **Frequency:** [X fails / Y recent runs] **Reproduces locally:** [yes/no] **Initial classification:** [timing / isolation / network / data / etc.] **Evidence reviewed:** [trace files, run history range, network logs] **AI-assisted diagnosis summary:** [1-2 sentence summary of the AI's analysis, plus your own verification of it] **Confirmed root cause:**
[the actual mechanism, e.g. “isVisible() check used instead of expect().toBeVisible(), causing a non-retrying assertion”]
**Fix applied:** [code change description, link to PR] **Verification period:** [date range observed post-fix, pass rate during that window] **Related tests checked for the same pattern:** [list, if any]
That last field — checking related tests for the same pattern — is one I added after the “overfitting a fix to one failure instance” issue described earlier in this post. A good RCA process should prompt you to check whether the same root cause exists elsewhere in the suite, not just in the one test that happened to get flagged.
How to Talk About This Experience in Interviews
Since a good chunk of this blog’s audience is job-hunting SDETs and QA engineers (myself included, as regular readers know), I want to address this directly: the workflow in this post is genuinely strong interview material, but only if you can speak to it specifically rather than name-dropping “I used AI for testing.”
What’s worked for me when discussing this in interviews: lead with the classification framework, not the AI tool. Saying “I built a systematic approach to categorizing flaky test root causes, and used AI assistance to accelerate evidence-gathering within that framework” demonstrates engineering judgment, not just tool usage. Interviewers — especially technical ones — tend to probe on exactly the limitations covered in the “When AI Gets It Wrong” section of this post; being able to discuss those limitations unprompted signals you’ve actually used these tools critically, not just enthusiastically. If asked for a concrete example, the checkout test case study earlier in this post is exactly the kind of structured, evidence-based story (symptom → hypothesis → evidence → diagnosis → fix → verification) that interviewers responding well to STAR-format answers want to hear.
A Brief Note on Visual Regression and Cross-Browser Flakiness
I’d call this one of the more underrated parts of debugging flaky tests with AI effectively.
One category I haven’t given a full treatment to elsewhere in this post: visual regression tests (pixel-diffing screenshots) have their own distinct flakiness profile, driven mostly by font rendering differences across CI runner OS versions, anti-aliasing inconsistencies, and animation timing at the moment a screenshot is captured — rather than the DOM/network-driven causes covered in the rest of this post. AI assistance here is more limited in my experience, since the evidence is fundamentally visual/pixel-based rather than structured timing or network data; an AI assistant can help reason about why a screenshot test might be unstable (e.g., “this screenshot is captured immediately after a CSS transition starts rather than waiting for it to settle”) once you describe the symptom, but it can’t directly compare pixel diffs the way it can compare structured trace JSON. This is an area I expect to improve significantly as multimodal AI tooling matures, and it’s likely to get its own dedicated treatment in a future post once I’ve done enough of this work to write about it honestly.
Cross-Browser Flakiness: Chromium vs Firefox vs WebKit
A related pattern worth flagging separately: tests that pass reliably on Chromium but flake specifically on WebKit (Safari’s engine) or Firefox are common enough to deserve their own check in your classification process. The usual cause is subtly different default timing behavior — WebKit in particular has historically had different default animation and focus-event timing than Chromium, which can expose race conditions that Chromium’s more forgiving timing happens to mask. When a test is flaky on only one browser project, I explicitly tell the AI assistant which browser engine the failure is specific to, since this narrows the likely category significantly toward timing/animation issues rather than the broader set of possibilities, and it’s a detail easy to forget to mention if you’re used to running everything on Chromium locally during development.
What’s Next: Part 3
This post focused specifically on debugging flaky tests with AI once they already exist in your suite. Part 3 of this series will go upstream — using AI assistance to prevent flakiness at the point of writing new tests in the first place, including patterns for AI-suggested locator strategies that are flaky-resistant by design, and a closer look at self-healing test approaches. If you’ve found this post useful, the prompt patterns and MCP triage tool here are a good foundation to have in place before that next post, since prevention strategies build directly on the same diagnostic habits.
Frequently Asked Questions
Can AI completely eliminate flaky tests?
No, and I’d be skeptical of any claim that it can. Debugging flaky tests with AI significantly speeds up root cause analysis, but it doesn’t replace the underlying engineering work of fixing test isolation, locator strategy, or application-level race conditions. Think of it as a faster diagnostic tool, not an autopilot.
Does GitHub Copilot need MCP to help debug flaky Playwright tests?
Not strictly — you can paste trace excerpts and code into a Copilot chat without MCP and still get useful suggestions when debugging flaky tests with AI. MCP’s value is removing the manual copy-paste step and giving the AI structured, automatic access to trace files, run history, and logs, which produces noticeably better diagnoses with less manual effort on your part.
What’s the difference between a flaky test and a broken test?
This is the difference between debugging flaky tests with AI well and just guessing faster.
A broken test fails consistently because something is genuinely wrong. A flaky test produces different results across runs of identical code — sometimes passing, sometimes failing — usually due to timing, race conditions, test isolation problems, or environmental factors rather than a deterministic bug. This distinction matters a lot once you start debugging flaky tests with AI, since the diagnostic approach for each is different.
Should I just add retries to fix flaky tests?
Retries are a reasonable short-term mitigation for genuinely environmental flakiness (e.g., known third-party service instability you can’t control), but they shouldn’t be your default fix. Both AI assistants and engineers under deadline pressure tend to over-suggest retries because they’re low-effort, but they hide root causes rather than fixing them.
How many CI runs should I look at before trusting an AI’s flakiness diagnosis?
I wouldn’t trust a diagnosis based on a single failure. Aim for at least 10-20 recent runs so there’s an actual pattern to analyze — a fail rate, a correlation with parallelism or time of day, or a consistent symptom. A single data point isn’t enough evidence for a confident root-cause claim, AI-assisted or not.
Is it worth building a custom MCP server just for debugging flaky tests with AI?
If you’re debugging flaky tests occasionally, the manual prompt patterns in this post are enough — you don’t need custom tooling. If your team is debugging flaky tests weekly or more, the time saved by automating the trace/run-history retrieval (as shown in the MCP triage tool section) tends to pay for the initial setup effort within the first month, based on the rough ROI numbers in this post.
Does this workflow work for Selenium tests too, or only Playwright?
The underlying principles — classify before diagnosing, bring evidence not just a symptom description, be suspicious of “just add a retry” suggestions — apply equally to Selenium. The specific tooling differs: Selenium doesn’t have Playwright’s built-in trace viewer, so you’d need to assemble equivalent evidence (screenshots, browser console logs, network HAR files) through other means before feeding it to an AI assistant.
Can AI help with flaky tests in CI environments I don’t have direct log access to?
Only as well as the data you can extract and provide to it. If your CI platform doesn’t surface trace files, run history, or logs in an exportable format, no AI tool can read what it doesn’t have access to. Step one for any team struggling here is improving CI artifact capture and retention before worrying about AI tooling at all.
How do I know if my flaky test fix actually worked, versus just got lucky?
Don’t trust a handful of passing local runs. Watch the test across real CI traffic for at least several days to a week, ideally covering different load conditions (peak deployment hours, parallel CI runs, off-hours). A fix that holds for one day and then resurfaces the following week is a strong sign the root cause wasn’t fully addressed.
Should junior QA engineers learn manual flaky test debugging before using AI tools?
Yes, and I’d say this strongly. The prompt patterns in this post work because I bring a hypothesis informed by years of manually debugging exactly these patterns. If you skip straight to “ask AI why it’s flaky” without understanding the underlying categories (timing, isolation, network, etc.) yourself, you won’t be able to judge whether the AI’s suggestion actually makes sense for your application — which is precisely the failure mode described in the “When AI Gets It Wrong” section above.
Can this workflow help with flaky tests in mobile app automation (Appium), not just web?
The classification framework and prompting principles transfer directly — timing, isolation, network, and environment categories all apply to mobile automation too. The tooling specifics differ, since Appium doesn’t have Playwright’s built-in trace viewer; you’d need to assemble equivalent evidence from Appium’s logs and screenshots before feeding it to an AI assistant the way described in this post.
What’s the single highest-leverage change to make if I only do one thing from this post?
Configure proper trace, screenshot, and video capture on failure (covered early in this post) if you haven’t already. Every other technique in this post depends on having actual evidence to feed the AI assistant — without it, you’re back to vague, low-quality prompts regardless of how good the AI tool itself is.
How is this different from just using a flaky test detection feature built into my CI platform?
Built-in flaky test detection (several CI platforms now flag tests with inconsistent pass/fail history automatically) is a useful first step for identifying which tests are flaky, but it generally stops there — it tells you a test is flaky, not why. This post’s workflow picks up from that point, using AI assistance to actually diagnose the root cause once a test has been flagged.
Do I need a paid AI tool subscription, or does a free tier work for this?
Most of the prompt patterns in this post work on standard subscription tiers without anything exotic. Where you’ll hit limits is high-volume automated usage (like the PR-comment bot described earlier, run across a large suite with frequent failures) — for occasional interactive debugging sessions, a standard tier is generally sufficient.
What if my team isn’t ready to set up MCP at all — is there still value here?
Yes. Every prompt pattern in this post works, just with more manual effort, by directly pasting trace excerpts, network logs, and run history into a regular Copilot or Claude chat session instead of having an MCP server fetch them automatically. MCP removes friction and enables automation (like the PR-comment bot), but the classification framework and prompt structure — which is where most of the actual value comes from — works identically either way.
How do I convince my manager this is worth the time investment?
Use the before/after metrics framing shown earlier in this post — re-run rate, time-to-diagnosis, backlog age — rather than an abstract pitch about AI tooling. Managers respond better to “our re-run rate dropped from 28% to 9% over four weeks” than to “we should use AI for testing.” If you’re the manager reading this and trying to build the case upward yourself, the same framing applies one level up.
Does this approach work equally well for new test suites versus large legacy suites?
The classification framework and prompting principles apply equally to both. In practice, legacy suites tend to have more accumulated test-isolation debt (shared fixtures written years ago by people no longer on the team), which makes the isolation-debugging workflow in this post disproportionately valuable there — that’s usually the category with the largest backlog in an older suite.
Can I use this same approach to debug flaky unit tests, not just end-to-end UI tests?
The classification categories shift somewhat — timing/race conditions and test isolation still apply, but network and animation-timing categories are less relevant for most unit tests, while categories like shared mutable state in test doubles or improperly reset mocks become more prominent. The core principle (classify first, bring evidence, be suspicious of “just add a retry”) transfers directly even though the specific tooling and evidence sources differ.
How often should I revisit and update my custom Copilot instructions file?
I update mine roughly every few weeks, whenever I notice the AI repeatedly making the same unhelpful suggestion (like over-suggesting retries) that I have to manually correct. Treat it the same way you’d treat a linter configuration — something that improves incrementally as you notice gaps, not something you write once and never touch again.
Does capturing traces on every CI run slow down the pipeline significantly?
Using trace: 'retain-on-failure' rather than 'on' means traces are only recorded and saved when a test actually fails, so the performance overhead on passing runs (the vast majority) is minimal. There’s some overhead during the recording itself for failing tests, but since these are already the runs you need to debug, the tradeoff is firmly worth it — I wouldn’t recommend disabling trace capture for performance reasons unless you’ve specifically measured it as a bottleneck.
Should I apply this same workflow to flaky tests in a staging environment versus production smoke tests?
The classification framework applies equally, but the stakes and caution level differ. For production smoke tests specifically, I’d be more conservative about any fix that touches timeouts or retries, since masking a genuine intermittent production issue behind a longer timeout has real business consequences beyond just CI noise — treat any AI-suggested fix here with extra scrutiny against the “is this actually an application bug” question raised earlier in this post.
What if the AI assistant and I disagree on the root cause?
Trust your own judgment, informed by your knowledge of the application — that’s precisely the scenario the “domain knowledge” limitation described earlier in this post is about. When I’ve disagreed with an AI’s diagnosis, the resolution has usually been asking it to walk through its reasoning against the specific timestamps or data points step by step, which either reveals a gap in its analysis or surfaces something I’d missed. Either way, don’t defer automatically just because the explanation sounds confident.
Is it worth retroactively applying this workflow to an old, large flaky-test backlog, or only to new flakiness going forward?
I’d recommend starting with new flakiness going forward (catching it via the code-review checklist and the SLA described earlier) while working through the existing backlog gradually using the flakiness summary’s sorted list, rather than trying to tackle the entire backlog in one dedicated sprint. A large backlog accumulated over months or years usually represents a wide mix of root causes, and working through it gradually, prioritized by fail rate, tends to be more sustainable than a single big push that competes with regular feature work and risks losing momentum partway through.
A Quick-Reference Error Pattern Table
One more practical reference before the appendix — a quick lookup table mapping common Playwright failure messages to their most likely category, useful as a first-pass triage even before bringing in AI assistance:
| Error Message Pattern | Most Likely Category |
|---|---|
| “element not found” / “element not attached to the DOM” | Timing, or test isolation if intermittent across suite runs |
| “Timeout exceeded while waiting for…” | Network/third-party, or resource contention under CI load |
| “element is not stable” / “element is not visible” | Animation/rendering timing |
| Assertion mismatch with seemingly correct expected value | Test data collision or isolation |
| Failure only inside an iframe context | Iframe timing (see Playwright-specific patterns above) |
| Failure only on a specific CI runner/region | Locale/internationalization mismatch |
| Failure only at specific times of day/week | Resource contention or a scheduled backend job, as in Case Study Two |
This table is intentionally a starting heuristic, not a guarantee — always confirm against actual trace evidence using the prompt patterns above rather than assuming the table’s first guess is correct. It’s meant to save you a few minutes of initial classification, not replace the evidence-based diagnosis process this entire post is built around.
Appendix: Complete Prompt Cheat Sheet
For quick reference, here’s every prompt pattern from this post consolidated in one place — useful to bookmark or paste into your team wiki alongside the custom instructions file from earlier.
General Diagnosis Starter
Here is the Playwright trace for a failing run of `[test name]`, plus the last [N] CI run results for this test ([X] pass, [Y] fail). The test [passes/fails] [consistently/inconsistently] when run in isolation. I suspect this is a [category] issue, based on [specific symptom]. 1. Confirm or rule out [category] as the root cause using the trace data 2. If confirmed, identify the specific mechanism 3. If ruled out, suggest the next most likely category
Timing / Race Conditions
Analyze this trace for `[test name]`. Identify any action (click, fill, etc.) that occurs within 200ms of a DOM mutation or network response completing. Flag any locator waits that resolved on `visible` rather than full actionability (stable, enabled, receives-events).
Test Data Collisions
This test creates a record with identifier `[value]`. Check the trace and network log for any conflict responses, or any assertion comparing against stale data from a previous run. Suggest whether this identifier should be made unique per run.
Test Isolation
Compare the network requests and console output of this test run alone versus in full-suite mode [paste both]. Identify any shared resource that differs between the two runs, and suggest a fixture-level fix, not a test-level workaround.
Network / Third-Party Dependency
Review the network log for this trace. List every external domain called, their response times, and flag any call exceeding 2 seconds or returning a non-2xx status the test didn't explicitly expect.
Resource Contention / CI Parallelism
Here is the run history with timestamps and worker/shard IDs for `[test name]`. Check whether failures correlate with specific worker counts, CI load, or time of day, rather than appearing randomly distributed.
Accessibility Tree Diffing
Compare the accessibility tree snapshot at the moment of the failed action against the same step's snapshot from a passing run. Flag any difference in role, accessible name, or disabled state for the target element.
Quick Start: Your First Week With This Workflow
If everything above feels like a lot to absorb at once, here’s the condensed, practical sequence I’d actually recommend for someone starting from zero this week:
- Day 1: Confirm trace, screenshot, and video capture are configured correctly in your Playwright config (the snippet near the start of this post). If they’re not, fix this first — nothing else works without it.
- Day 2: Pick your single most chronically annoying flaky test. Pull its last 15-20 CI run results and classify it against the seven-category table in this post, based on symptom pattern alone, before involving AI at all.
- Day 3: Run the AI-assisted diagnosis using the matching prompt pattern from the cheat sheet, with actual trace data attached. Don’t accept the first answer uncritically — cross-check it against the “When AI Gets It Wrong” section.
- Day 4: Implement the fix at the fixture level if it’s an isolation issue, or the specific code-level fix for other categories. Run it 20-30 times locally before considering it done.
- Day 5: Set up the custom Copilot instructions file for your repository, encoding the conventions and guardrails from this post so future sessions start from a better baseline.
- Following week: If this proved useful, build the minimal MCP triage tool and start tracking the flakiness metrics described earlier, so you have real before/after numbers within a month rather than relying on impression alone.
This sequence deliberately starts narrow — one test, fully understood — rather than trying to apply the entire framework across your whole suite on day one. The patterns and tooling generalize quickly once you’ve internalized them on a single real example.
Glossary of Terms Used in This Post
- Flaky test — a test that produces different pass/fail results across runs of identical code, with no underlying code change
- MCP (Model Context Protocol) — a standard for giving AI assistants structured, tool-based access to external context (files, logs, APIs) rather than relying on manual copy-paste
- Trace file — Playwright’s recorded timeline of a test run, including DOM snapshots, network activity, and console output, viewable in the trace viewer
- Test isolation — the principle that each test should run independently, without depending on or being affected by the state left behind by other tests
- Race condition — a bug where the outcome depends on the relative timing of two or more events, producing inconsistent results depending on which happens first
- Actionability — Playwright’s internal check that an element is visible, stable, enabled, and able to receive events before performing an action like a click
- CI sharding/parallelism — splitting a test suite across multiple parallel workers or machines to reduce total run time, which can introduce resource contention as a flakiness source
- Self-healing test automation — automation frameworks or AI-assisted approaches that automatically adapt locators or wait conditions when the application under test changes, reducing maintenance burden (covered in more depth in Part 3)
- Flakiness score — a simple ratio of failed runs to total runs over a recent window (e.g., the last 20), used throughout this post to prioritize which flaky tests to triage first
- Root cause analysis (RCA) — a structured, documented process for identifying the actual underlying mechanism behind a failure, rather than just describing the symptom
- Worker-scoped test data — test data generated uniquely per parallel CI worker (using Playwright’s
testInfo.workerIndex) to avoid collisions when multiple workers run simultaneously - Network HAR — HTTP Archive format, a structured record of every network request and response during a test run, including timing breakdowns
Tools and Resources Referenced in This Post
- Playwright’s built-in trace viewer and
trace: 'retain-on-failure'configuration - GitHub Copilot Chat (VS Code integration)
- Claude with MCP-connected servers for structured context access
- Cursor’s built-in AI assistant
- A custom MCP server skeleton (shown earlier in this post) for flakiness reporting
- Playwright’s JSON reporter for structured, machine-readable test results
If You’re Still Building Your MCP Foundation
Everything in this post assumes the baseline setup from Part 1 — Copilot and an MCP server connected to your Playwright project. If you haven’t done that step yet, go back and work through Part 1 first; the prompt patterns and tooling here build directly on that foundation rather than standing alone. And if you’ve already read Part 1 and worked through this post, the next logical step in this series — preventing flakiness at the point of writing new tests, rather than debugging it after the fact — is Part 3, which I’m actively writing now.
Keeping This Practice Sustainable Long Term
A risk worth naming honestly: novelty fades. The team workshop, the new MCP tool, the custom instructions file — all of this gets genuine enthusiasm in week one. The harder problem is whether it’s still being used in month six, after the initial excitement wears off and the backlog has new, more urgent priorities competing for attention.
What’s worked for me to keep this alive rather than letting it quietly lapse: tying it to an existing recurring ritual rather than creating a new one. Our team already had a bi-weekly tech-debt review; I added “review the flakiness summary dashboard, triage the top 2-3 entries” as a standing five-minute agenda item rather than a separate meeting people have to remember to schedule. Recurring rituals survive; standalone initiatives quietly die. I’d strongly recommend folding this into whatever cadence your team already has, rather than treating it as a new process people have to actively choose to keep doing.
Where This Is Headed: A Few Predictions
I’ll close with a few honest, hedged predictions about where debugging flaky tests with AI is likely headed over the next year or two, since several readers have asked where I think this is going.
First, I expect the manual “export trace, paste into chat” friction described throughout this post to shrink further as MCP tooling matures and more testing frameworks ship official MCP servers out of the box, the way Playwright already has — meaning more of this workflow becomes default behavior rather than something you have to deliberately wire up. Second, I expect visual regression flakiness (the category I flagged as currently underserved by AI tooling) to improve significantly as multimodal reasoning over screenshots/video becomes more reliable, closing the gap between structured-data debugging (where AI already excels, as covered in this post) and visual debugging (where it currently lags). Third, and more speculatively, I expect the line between “self-healing test automation” and “AI-assisted debugging” to blur — rather than a human running the diagnostic conversation described in this post, I’d expect more of this analysis to run automatically on every CI failure, with a human reviewing a pre-classified, pre-diagnosed report rather than starting from a blank failure log. None of these are certainties, and I’d rather be honest that I’m extrapolating from current trends than present these as confident forecasts — but it’s the direction all the tooling I’ve used this year seems to be pointing.
Final Thoughts
I started this series saying I wasn’t an AI expert, and that’s still true after writing this post. What’s changed since Part 1 is that I’ve now used GitHub Copilot and MCP on enough real flaky tests to trust the workflow described here — not blindly, but as a genuinely faster first pass than the manual trace-scrolling approach I used for years. The core lesson, if I had to compress this whole post into one sentence: AI is excellent at confirming or ruling out a hypothesis you bring with evidence, and much weaker when you ask it to generate a hypothesis from nothing. Bring the classification, bring the trace, bring the run history — let the AI do the fast cross-referencing you don’t have time to do by hand.
If you’re reading this as part of your own job search rather than as a practicing SDET on an active team, my honest advice is the same advice I gave in the interview-prep section above: don’t just read this post, actually build the triage MCP tool, run it against a real (even personal/practice) Playwright suite, and be ready to talk through a specific debugging session the way the case studies in this post are written — symptom, evidence, hypothesis, diagnosis, fix, verification. That structure, more than any tool name on a resume, is what signals real experience to anyone technical enough to ask a follow-up question.
I’ll also say this plainly, since it’s easy to lose in a post this long: none of the individual techniques here are revolutionary on their own. Trace analysis, root-cause classification, fixture-level fixes — these are established testing discipline that predates any of this AI tooling. What’s actually changed, in my own day-to-day work, is the speed and consistency with which that discipline gets applied, because the tedious cross-referencing step that used to make thorough debugging feel like a luxury you couldn’t always afford is now fast enough to do every time, not just when a flaky test was annoying enough to justify the afternoon. That’s the real shift worth taking from this post — not a new debugging philosophy, but enough speed to actually practice good debugging discipline consistently.
If you try the MCP triage tool or the prompt patterns above on your own suite, I’d genuinely like to hear what worked and what didn’t — drop a comment below. Part 3 is already in progress.
🔥 Continue Your Learning Journey
Want to go beyond Playwright with Typescript setup and crack interviews faster? Check these hand-picked guides:
👉 🚀 Master TestNG Framework (Enterprise Level)
Build scalable automation frameworks with CI/CD, parallel execution, and real-world architecture
➡️ Read: TestNG Automation Framework – Complete Architect Guide
👉 🧠 Learn Cucumber (BDD from Scratch to Advanced)
Understand Gherkin, step definitions, and real-world BDD framework design
➡️ Read: Cucumber Automation Framework – Beginner to Advanced Guide
👉 🔐 API Authentication Made Simple
Master JWT, OAuth, Bearer Tokens with real API testing examples
➡️ Read: Ultimate API Authentication Guide
👉 ⚡ Crack Playwright Interviews (2026 Ready)
Top real interview questions with answers and scenarios
➡️ Read: Playwright Interview Questions Guide