ChatGPT Can’t Grade Your Papers (But It Can Help in Other Ways)

Blog

ChatGPT Can’t Grade Your Papers (But It Can Help in Other Ways)

I tried using ChatGPT to grade papers. Here’s why it didn’t work.

When I was teaching freshman composition, I had a ritual for the day after students’ final portfolios were due. I would stare at the towering pile of essays on my desk – or, later, at the long list of file submissions in the course management system – and I would read this McSweeney’s piece to myself out loud: I Would Rather Do Anything Else Than Grade Your Final Papers by Robin Lee Mozer.

I would rather base jump off of the parking garage next to the student activity center or eat that entire sketchy tray of taco meat left over from last week’s student achievement luncheon that’s sitting in the department refrigerator or walk all the way from my house to the airport on my hands than grade your Final Papers.

I am sure there are some writing professors who enjoy grading essays, but I was never one of them. (As Mozer says, I would rather eat beef stroganoff.) If I could outsource this task to someone else and trust that they would be able to grade the papers with consistency and attention to detail and to hold students accountable to the standards established in class, I would certainly be tempted to hand over the whole stack. Of course, giving my students’ essays to another person to grade would be unethical and probably a FERPA violation, so I never did, but I sure thought about it.

But that was before ChatGPT came along. What about using ChatGPT? That’s not a person, and I would still be overseeing the grading. Could I run students’ papers through ChatGPT and let the AI do the heavy lifting?

To cut the suspense, I’ll jump ahead to the conclusion. After testing this out, I am sad to say: No, ChatGPT cannot grade your papers for you.

Let me explain how I tested this.

The Test: Part One, Establishing a Baseline

Because of privacy concerns, I did not want to try putting actual student writing into ChatGPT. Instead, I used the example essays from the ACT website. The ACT website provides thorough scoring criteria in addition to scored example essays, ranging from poor to excellent. This made the example essays perfect writing samples to put into ChatGPT and compare ChatGPT’s output against the ACT website’s scores.

I began by copying and pasting in the entire scoring rubric, then the writing sample prompt, and finally, several essays, one by one. You can read the whole conversation here.

A screenshot of part of a conversation with ChatGPT.

Prompt: I am going to provide you with the rubric criteria for scoring an essay, and then I will paste in an essay written by a student. I would like you to score the essay according to the rubric criteria.

ChatGPT’s response: Sure, I can help you with that. Please provide me with the rubric criteria for scoring the essay, and then paste in the essay written by the student. I’ll do my best to score the essay based on the given criteria.

First, I pasted in two low-scoring essays from the ACT website. On a scale from 2 to 6, ChatGPT scored them both as 2s, the same as the website did.

Next, I entered the essays provided as an example of a level-5 essay and a level-6 essay. ChatGPT gave both essays a score of 4.

Finally, I asked ChatGPT to generate an essay in response to the prompt that would score a 6 based on the rubric criteria. It churned out an organized, thoughtful-sounding 6-paragraph essay.

“Now please provide a score breakdown of this essay per the rubric,” I told it.

“Certainly!” responded ChatGPT, and wrote up a score analysis of the essay

ChatGPT’s analysis: It gave its own supposedly level-6 essay – the essay that it created just one prompt earlier following instructions to create an essay that would earn a 6 – a 5.

An image of Elmo the red Muppet staring at the viewer with an exasperated expression. In front of him is his friend Zoey’s pet rock, Rocco. Elmo often argues with Zoey that Rocco is just a rock, but Zoey continues anthropomorphizing Rocco anyway, to Elmo’s eternal frustration. — Elmo wonders if even Rocco could do a better job than ChatGPT.

Next, I entered another essay prompt – this one from the Kaplan test prep website – and another example essay and asked ChatGPT to score them based on the same rubric criteria. It also scored this essay as a 5, which looked consistent with the scored essays on the ACT website.

In the feedback on the Kaplan essay, ChatGPT noted, “There are only minor errors in grammar, usage, and mechanics, which do not significantly impact meaning.” I was surprised; I hadn’t spotted any errors when I skimmed the essay. Curious, I asked: Could you please list the errors in grammar, usage, and mechanics from this essay?

ChatGPT spat out a list of 7 errors in the essay, explaining what made each error incorrect and an example of how it could be corrected. But there was a problem: None of the errors were errors, and none of ChatGPT’s explanations were correct.

A screenshot of part of a ChatGPT conversation. ChatGPT’s incorrect responses have been marked in red.

Prompt: Thank you. Could you please list the errors in grammar, usage, and mechanics from this essay?

ChatGPT:
Certainly! Here are the errors in grammar, usage, and mechanics from the essay.

1. Sentence fragment: “In reality, all classes need to be carefully selected so this is not a problem for bilingual classes.” (This sentence lacks a main clause and is a sentence fragment.)
2. Capitalization: “Given the rigorous demands of the current job climate, students will greatly benefit from any additional marketable skills that they can acquire during their high school careers.” (The word “careers” should not be capitalized.)
3. Subject-Verb Agreement: “Most of the world uses English as a second language, and many people speak at least two languages.” (The subject “world” is singular, so it should be “uses” instead of “use”.)
4. Apostrophe Use: “It’s the same thing with bilingual courses.” (The word “it’s” should be “its” because it indicates possession, not a contraction of “it is”.
5. Comma Splice: ”So it’s only right that to stay competitive, U.S. students should also be fluent in two languages; this is particularly important in careers that require international work.” (The comma between “languages” and “this” should be replaced with a semicolon or separated into two sentences.)
6. Word Choice: “how can the schools encourage bilingual fluency if they don’t provide a place for students to practice another language?” (The word “fluent” should be “fluency” to match the context of the sentence.)
7. Subject-Verb Agreement: “Recognizing the benefits of being bilingual, and making bilingual courses available but optional, is the best of both worlds.” (The subject “recognizing” and “making” should agree with “is”, so it should be “are” instead.)

So far, ChatGPT is only moderately consistent at evaluating an essay against the rubric, and completely baffling at providing specific feedback on grammar and mechanics.

The Test: Part Two, Checking Reproducibility

Next, I started a new conversation with ChatGPT. (You can read the whole conversation here.) I began the same way, pasting in the ACT website’s rubric criteria and writing prompt. But then, instead of the example ACT essays, I pasted in the essay that ChatGPT wrote – the essay that it wrote after being prompted to write a 6, and then scored as a 5.

The result:

A screenshot of part of a ChatGPT response that says, Global score: Considering the scores for each criterion, the global score for this essay would be a 4. This score indicates that the essay demonstrates a competent level of performance, effectively presenting and supporting its ideas while maintaining a coherent organizational structure and language use. The essay provides a balanced and thoughtful analysis of the issue, addressing the benefits and potential drawbacks of embracing intelligent machines, and highlights the importance of a cautious and deliberate approach to technological integration.

It gave its essay a 4.

The Test: Part Three, Once More Unto the Breach

I opened a new chat and again pasted in the rubric criteria and the prompt. (Here’s the full conversation.) Then I pasted in the level-5 writing sample from the ACT website, which ChatGPT had scored as a 4 in the first chat.

This time, it gave the essay a 3.

I tested the ACT website’s level-6 example, which ChatGPT previously also scored as a 4. This time, ChatGPT gave it a slightly higher score than the level-5 example, but still a 3.

A close-up offshoot of Elmo looking straight at the viewer with an even more aggrieved, long suffering look on his face — an impressive feat for a muppet, so shout out to the Sesame Street Workshop for nailing it. — Elmo thinks Elmo is hitting rock bottom.

The essay ChatGPT wrote earlier as an example of a 6? It also scored a 5.

Obviously, ChatGPT was struggling with consistently applying the rubric criteria. Could it be a problem of needing to calibrate ChatGPT’s assessment of what a level-6 essay would look like?

The Test: Part Four, Calibration

I started yet another conversation to test this. Once again, I pasted in the rubric and essay prompt. Then I dropped in the ACT level 2 and level 6 essays and told ChatGPT that these were examples of the lowest and highest scores so that it could compare additional essays against the rubric and against what constituted a top-scoring or low-scoring essay.

Then I asked it to score the ACT level 5, 4, and 3 essays. ChatGPT scored these as 4.5, 3.75, and 3.5, respectively, more closely aligned with how the ACT website rated them.

So I asked it to score the essay written by ChatGPT as an example of a 6, which you’ll recall ChatGPT elsewhere scored as a 5, a 4, and a 5.

Reader, this time, ChatGPT scored its own level-6 essay at a 3. So much for calibration being the problem.

Conclusion

So, can ChatGPT be used to grade student papers?

Nope. It is not advisable to use ChatGPT, or at least GPT-3.5 at its current capabilities ¹, to grade student papers. It’s inconsistent, it finds errors where there are none, and it isn’t really doing analysis.

Remember, ChatGPT is not an analysis machine – it is a language machine. It is not built to critically assess papers and provide thoughtful feedback; it is built to produce language that sounds like thoughtful feedback.

We also haven’t even dug into the student privacy concerns; the Common Sense Privacy Program gives ChatGPT a 48% rating. I would strongly advise against putting students’ work into ChatGPT, certainly not with any identifying information attached.

How Else Can ChatGPT Be Used in the Grading Process?

That doesn’t mean there’s no room to use ChatGPT to help you save time grading papers. One use for it is to write up coherent feedback. You can give ChatGPT a shorthand list of things you want to point out, and ChatGPT can turn that into sentences that you can give a student as feedback.

ChatGPT can also help create assignment sheets and grading rubrics, as I explain here: 10 Practical Ways to Use ChatGPT to Make Your Teaching Life Easier (Even If You’re Afraid of AI).

The bottom line is that when it comes to grading papers accurately and consistently, providing your students with targeted feedback that meets them where they are, and assessing your students’ growth as writers through the semester, you are irreplaceable.

Abi Bechtel
Abi Bechtel is a writer, educator, and ChatGPT enthusiast. She has an MFA in Creative Writing from the Northeast Ohio MFA program through the University of Akron, and she just thinks generative AI is neat.

Abi Bechtel
Abi Bechtel is a writer, educator, and ChatGPT enthusiast. They have an MFA in Creative Writing from the Northeast Ohio MFA program through the University of Akron, and they just think generative AI is neat.

I only tested this with GPT-3.5, which is the free version of ChatGPT. As of this writing, GPT-4, which OpenAI touts as being “great for tasks requiring creativity and advanced reasoning,” requires a ChatGPT Plus membership at $20/month. It’s possible GPT-4 could handle this task better; however, I suspect that the people who would like to relieve some of the burden of grading papers are not in the demographic of people who want to spend $20 a month on ChatGPT Plus.

Blog