If you cannot measure a prompt, you cannot improve it.
That premise is behind every "prompt grader" tool. Most of them are black boxes. You paste a prompt, get a number, and have no idea what the number means or how to move it.
Here is the rubric we use inside FixMyPrompt to grade text prompts. Seven axes, scored independently, weighted, and rolled up into the 0-100 score you see in the report. We are publishing it because prompt engineering should be a discipline with a shared rubric, not a feel-it-out craft.
The seven axes
Goal clarity (0 to 20 points)
Does the prompt state what you are trying to achieve?
- 0 to 5: "help me with X." No measurable goal.
- 6 to 12: "write an email about X." Task is named, success is not defined.
- 13 to 17: "write a 150-word cold outreach email pitching X to CTOs." Task and audience defined.
- 18 to 20: same plus a measurable success criterion ("the email should make the recipient want to schedule a 15-min call").
Why it matters: models cannot infer your goal. They infer the most common goal for that kind of prompt, which is usually not yours.
Audience definition (0 to 15 points)
Who is the output for?
- 0 to 5: not stated.
- 6 to 10: high-level audience ("for a developer audience").
- 11 to 13: specific role plus seniority plus context ("for senior backend engineers at fintechs evaluating Postgres alternatives").
- 14 to 15: plus what the audience already knows and what they do not ("assume familiarity with replication, no familiarity with our product").
Why it matters: same task, different audience produces different output. Without this, the model targets a generic median reader.
Format and structure (0 to 15 points)
What should the output look like?
- 0 to 5: nothing specified.
- 6 to 10: format named ("a bulleted list," "a 3-paragraph essay").
- 11 to 13: format plus length plus delivery vehicle ("a 200-word LinkedIn post with 3 short paragraphs and one question at the end").
- 14 to 15: plus an example or schema (especially for JSON outputs, agent loops, or anything programmatically consumed).
Why it matters: prompts without format hints produce variable-length, variable-structure outputs. Fine for chat. Fatal for production pipelines.
Constraints (0 to 15 points)
What rules must the output follow?
- 0 to 5: no constraints.
- 6 to 10: one or two soft constraints ("keep it short").
- 11 to 13: hard constraints with measurable thresholds ("under 150 words, no jargon, third person").
- 14 to 15: plus exclusions ("do not mention competitors, do not use the word 'leverage,' no code blocks").
Why it matters: constraints prevent the dozen common failure modes. Too long, off-tone, off-format, hallucinated content, mode-collapsed outputs.
Context and background (0 to 15 points)
What does the model need to know to do the job?
- 0 to 5: nothing supplied.
- 6 to 10: high-level context ("for a SaaS company").
- 11 to 13: specific facts ("our product is a usage-based observability tool, our ICP is platform engineering teams at 50 to 500-person companies").
- 14 to 15: plus prior art or reference material ("here's a previous email that worked: [paste]").
Why it matters: context is the difference between generic output and output that fits your situation. Models do not know your company, your product, or your prior decisions unless you tell them.
Tone and voice (0 to 10 points)
How should it sound?
- 0 to 3: not specified.
- 4 to 7: tone named ("friendly," "professional").
- 8 to 10: tone plus a reference example ("warm but direct, like a Stripe support email").
Why it matters: tone is invisible until it is wrong. Models default to a slightly corporate LinkedIn voice unless you push them off it.
Examples / few-shot (0 to 10 points)
Have you shown the model what good looks like?
- 0 to 3: no examples.
- 4 to 7: one example.
- 8 to 10: 2 to 3 examples covering edge cases.
Why it matters: few-shot examples can cut hallucination by 30% or more and dramatically improve format adherence. The highest-leverage thing you can add to a struggling prompt.
How the score rolls up
The seven axes sum to 100. Each issue we flag in a report is tagged to its axis and a severity (critical, major, or minor). You will see something like:
Score: 64 / 100
- Goal clarity: 16 / 20. OK.
- Audience: 8 / 15. Major. Audience is named but seniority and context are missing.
- Format: 4 / 15. Critical. No format or length specified.
- Constraints: 9 / 15. OK.
- Context: 12 / 15. OK.
- Tone: 7 / 10. OK.
- Examples: 8 / 10. OK.
That breakdown tells you exactly where to focus. The improved prompt addresses the flagged issues directly. You see the before-and-after on each axis, not just an opaque rewrite.
Why a rubric matters more than a score
A single number ("score: 64") does not help you ship better prompts. A rubric does.
It is a checklist. Run any prompt through the seven axes and you will catch most failure modes before you spend a token.
It is a teaching tool. Junior engineers on a team can read a rubric. They cannot read "this prompt feels off."
It is a regression test. Before-and-after rubric scores tell you whether a change actually improved the prompt or just made it longer.
It is model-agnostic. The rubric is about what prompts need, not about what any specific model wants. A rubric-strong prompt works on Claude, GPT, Gemini, and Llama equally well.
How to use this rubric without our tool
You do not need FixMyPrompt to apply the rubric. You can do it by hand:
- Read your prompt.
- Score each axis from 0 to its max.
- Sum to 100.
- Anything under 70: fix the lowest-scoring axis first.
For production work (running the rubric against dozens of prompts a week, generating reports for code review, sharing with teammates), paste the prompt into fixmyprompt.net/try and let our QA model score it for you. Three free runs per day. No signup.