Abstract
We analyzed nearly 5,000 zero-shot case law citations generated by GPT 3.5 Turbo. Our goal was to identify accuracy patterns across various jurisdictions and practice areas. We gave the language model as little help as we could in terms of prompting and grounding. The results show that GPT case law citations vary widely across jurisdictions and practice areas, with Federal Constitutional Law being most accurate, and Maine Bankruptcy Law being the least accurate.
Methodology
First, we'll explain what we mean by case law citations.
What is a Case Law Citation?
According to the Blue Book, a proper case law citation includes 7 components.
- Case Name
- Volume Number
- Reporter Abbreviation
- First Page of Case
- Pinpoint
- Court
- Decision Year
In this study, we focused on Case Name, Volume Number, Reporter Abbreviation, and First Page of Case.
Citation Generation
The experiment utilized the gpt-3.5-turbo model, tested on April 22, 2024, to generate case law citations for 50 jurisdictions and 21 practice areas. A zero-shot prompt was used, instructing the model to generate full case law citations without any prior examples or context. The model generated approximately 5 citations for each jurisdiction and practice area combination, resulting in a dataset of 4,904 legal cases.
We understand GPT 3.5 Turbo is not the best available language model. We did this on purpose. More advanced language models with more recent training data should theoretically be better at generating citations. Our study emulates the unfortunate (but probably common) scenario of someone using an older language model without prompt engineering, grounding, or domain expertise-- for example, a pro se litigant. That's where the insights from this study are most useful. See our prompt, jurisdictions, and practices we used to generate citations below.
Our goal here was to give GPT the bare minimum to really test its ability to recall case law citations from training data.
prompt = [
"No talking, start listing court cases about {state} {practice_area} with full citations"
]
jurisdictions = [
"Federal", "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado", "Connecticut", "Delaware", "Florida",
"Georgia", "Hawaii", "Idaho", "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine",
"Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi", "Missouri", "Montana", "Nebraska",
"Nevada", "New Hampshire", "New Jersey", "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio",
"Oklahoma", "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota", "Tennessee", "Texas",
"Utah", "Vermont", "Virginia", "Washington", "West Virginia", "Wisconsin", "Wyoming"
]
practices = [
"criminal law", "constitutional law", "property law", "contract law", "personal injury law",
"tort law", "family law", "corporate law", "intellectual property law", "labor law", "tax law",
"environmental law", "bankruptcy law", "immigration law", "health law", "education law",
"real estate law", "consumer protection law", "insurance law", "evidence law", "business law"
]
GPT generated a list of about 5 citations for each jurisdiction/practice combination.
For example, the first combination--Federal criminal law--produced the following citations.
1. United States v. Booker, 543 U.S. 220
2. United States v. Lopez, 514 U.S. 549
3. United States v. Morrison, 529 U.S. 598
4. United States v. Jones, 565 U.S. 400
5. United States v. Davis, 139 S. Ct. 2319
We repeated this for more than 1000 combinations of jurisdiction and practice area.
Verifying Citations
The generated citations were verified using Court Listener's citation API, which primarily focuses on validating the Volume Number, Reporter Abbreviation, and First Page of Case. However, it was observed that the API could result in false positives when the language model generated a mismatched Case Name with a real reporter citation. To address this issue, a new status code (301) was introduced to represent GPT citations with a proper reporter citation but mismatched case name.
Please note our adaptation is not perfect. (See Maine, where no cases were correctly cited). Please conduct your own research or reach out to Counsel Stack for further inquiries.
Status Codes Explanation
If you look at the raw data, you'll see that each GPT citation has a status code. The status code tells us whether the citation was correct.
The dataset includes the following status codes to indicate the validity and accuracy of each citation.
- 200 (Valid Citation): The citation's reporter and case name are accurate and match the information in the legal database.
- 300 (Ambiguous Reporter): The provided citation is ambiguous and could refer to multiple cases due to a non-specific reporter abbreviation.
- 301 (Mismatched Case Name and Reporter): The citation appears valid, but the case name does not correspond to the referenced reporter and page number.
- 404 (Invalid Citation): The reporter citation is invalid or does not exist.
200 (Valid Citation)
The citation's reporter and case name are accurate and match the information in the legal database.
These citations can be considered fully valid and reliable.
- For example: Federal, criminal law, "United States v. Booker, 543 U.S. 220"
- This is a real case that GPT cited properly. The Case Name and Reporter are both correct.
300 (Ambiguous Reporter)
The provided citation is ambiguous and could refer to multiple cases.
This often happens when the reporter abbreviation is not specific enough. For example, "H." could refer to Handy's Ohio Reports, Hawaii Reports, or Hill's New York Reports. Additional context or a more precise citation would be needed to determine the correct case.
- For example: Louisiana, corporate law, "Louisiana Public Service Commission v. Federal Communications Commission, 476 U.S. 355"
- It's unclear whether this is a real case cited properly. There are multiple citations matching the Reporter. Further analysis is needed.
301 (Mismatched Case Name and Reporter)
The citation appears to be valid at first glance, but upon cross-referencing the case name and reporter citation, there is a mismatch.
This suggests that while the citation format is correct, the case name does not correspond to the referenced reporter and page number. These citations are not reliable without further verification. This category is not perfect, but it's strict. Even slight mismatching case names will convert that citation from a 200 to a 301.
- For example: Missouri, family law, "In re Custody of Johnson, 611 S.W.2d 456"
- This is a real reporter citation (611 S.W.2d 456) but the real case name is actually Jones v. Tucker. This citation is not reliable because the case name does not match the reporter cited.
But also consider close mismatches. For example, GPT created "McCulloch v. Maryland, 17 U.S. 316" which was flagged as a 301 because the real case name is "M'culloch v. State of Maryland." Similarly, "City of Hazleton v. Lozano, 496 F. Supp. 2d 477" raises a 301 because the real case name is "Lozano v. City of Hazleton" and "Tennessee v. Thomas, 158 S.W.3d 361" raises a 301 because the real case name is "State v. Thomas."
So you see the point--even minor deviations from the official name listed by Court Listener will raise a 301 error.
404 (Invalid Citation)
The reporter citation is invalid or does not exist.
This could be due to a formatting error or a reference to a non-existent reporter series, volume, or page number.
- For example: New Mexico, bankruptcy law, "In re Garcia, 201 F.3d 1209"
- This reporter citation does not exist. Although the case name might be real, this citation is not reliable because the reporter citation is not real.
Results
The GPT generated dataset comprised 4,904 case law citations.
We did two separate analyses--one with Court Listener's three response statuses, and another with our fourth 301 status for mismatched case name false positives. This discussion will focus on the latter, although the data for both segments of the study are available in the data workbook.
Analysis revealed that the overall citation accuracy was relatively low, with only 23.45% of cases having fully valid citations. A significant portion (37.09%) had mismatched case names and reporter citations, while 34.93% were entirely invalid. Ambiguous cases were relatively uncommon, accounting for only 4.53% of the dataset.
Find the overall statistics below. Notice how 1819 of the 2969 valid cases were in fact mismatched. If relied upon, these citations could result in sanctions or other action by the court. Again, the mismatch category is not perfect but we're merely trying to demonstrate the existence of false positives that have a real volume, reporter number, and first page, but an inforrect case name.
Overall Statistics (Without 301)
- Total Cases: 4904
- Valid Cases: 2969 (60.54%)
- Ambiguous Cases: 222 (4.53%)
- Invalid Cases: 1713 (34.93%)
Overall Statistics (With 301)
- Total Cases: 4904
- Valid Cases: 1150 (23.45%)
- Ambiguous Cases: 222 (4.53%)
- Mismatched Cases: 1819 (37.09%)
- Invalid Cases: 1713 (34.93%)
Breakdown by Jurisdiction
Looking at the breakdown by jurisdiction, there is wide variation in citation accuracy.
The Federal jurisdiction stands out with the highest percentage of valid cases at 71.3%, significantly outperforming the overall average. On the other end of the spectrum, Maine had 0% valid cases, with the bulk falling into the mismatched (64.4%) or invalid (28.9%) categories. Other top performing jurisdictions include Arizona (52.3% valid), California (48.6% valid), New Jersey (49% valid) and Texas (51.5% valid). The worst performing jurisdictions after Maine are Iowa (5.7% valid), Rhode Island (6.9% valid), Nebraska (7.2% valid) and New Mexico (7.7% valid).
This stark difference in accuracy across states supports our hypothesis that GPT's performance depends heavily on the prevalence of each jurisdiction's case law in its training data. Well-represented states and the Federal body of case law see much better results.
Breakdown by Practice Area
Turning to practice areas, constitutional law had the highest rate of valid citations at 45.9%, followed by education law at 41.2% and immigration law at 39.3%. This suggests these may be areas with better data representation in GPT's knowledge base. On the flip side, bankruptcy law was the clear worst performer with only 6.9% valid cites and a combined 85.1% mismatched or invalid. Other poorly performing practice areas include environmental law (14.7% valid), personal injury law (15.3% valid), and corporate law (16.2% valid). These more niche and specialized areas of law appear to give GPT-3.5 Turbo the most trouble in generating accurate case references.
Some other interesting observations
- No jurisdiction exceeded 75% valid case citations, showing room for improvement across the board
- For many states, invalid cases made up the largest share of the results, sometimes exceeding 50% (e.g. Vermont at 72.2% invalid)
- The Federal jurisdiction was an outlier in both its high accuracy and very low invalid rate of only 12%
- Ambiguous cases made up a relatively small portion of results in most categories, usually under 10%
- Bankruptcy and environmental law were the only practice areas with under 10% valid cases
- Constitutional, education and immigration law were the only practice areas that broke 40% valid
Anomalies
Our West Virginia contract law prompt produced the following case citations.
2. Syl. Pt. 1, Syl. Pt. 2, Syl. Pt. 3, Syl. Pt. 4, Syl. Pt. 5
Our Louisiana family law prompt produced the following case citation.
1. Succession of Succession of Succession of Succession of
For the following jurisdiction/practice area combinations, GPT refused to generate any citations.
Arizona education law, Georgia education law, Hawaii health law, Hawaii education law, Idaho intellectual property law, Kansas education law, Kentucky constitutional law, Kentucky education law, Kentucky real estate law, Maryland education law, Massachusetts immigration law, Nebraska immigration law, New Mexico education law, Pennsylvania immigration law, South Carolina education law, Wisconsin contract law
Maine returned 0 valid responses, which is most likely wrong--after cursory review of the data, it seems that we either accidentally labeled some valid citations as 301s or that the naming conventions in the Citation API for Maine case law is different in some way from other jurisdictions.
GPT-3.5 Turbo's case law citation accuracy varies considerably based on jurisdiction and area of law. While some states and practice areas see accuracy rates approaching 50-70%, others languish under 10%. The Federal body of case law is uniquely well-handled, while certain states like Maine and Iowa are very poorly represented. Specialized practice areas like bankruptcy, environmental, and corporate law pose the biggest challenges, while constitutional, education and immigration law are relative bright spots.
Learn more by viewing the data below.
JURISDICTION STATISTICS
| Jurisdiction | Total | Valid | Ambiguous | Mismatched | Invalid | Valid % | Ambiguous % | Mismatched % | Invalid % |
|:---------------|--------:|--------:|------------:|-------------:|----------:|----------:|--------------:|---------------:|------------:|
| Alabama | 103 | 13 | 8 | 48 | 34 | 12.6214 | 7.76699 | 46.6019 | 33.0097 |
| Alaska | 104 | 35 | 11 | 41 | 17 | 33.6538 | 10.5769 | 39.4231 | 16.3462 |
| Arizona | 88 | 46 | 0 | 27 | 15 | 52.2727 | 0 | 30.6818 | 17.0455 |
| Arkansas | 100 | 34 | 5 | 27 | 34 | 34 | 5 | 27 | 34 |
| California | 105 | 51 | 6 | 34 | 14 | 48.5714 | 5.71429 | 32.381 | 13.3333 |
| Colorado | 104 | 25 | 4 | 46 | 29 | 24.0385 | 3.84615 | 44.2308 | 27.8846 |
| Connecticut | 105 | 28 | 5 | 25 | 47 | 26.6667 | 4.7619 | 23.8095 | 44.7619 |
| Delaware | 89 | 34 | 1 | 19 | 35 | 38.2022 | 1.1236 | 21.3483 | 39.3258 |
| Federal | 108 | 77 | 3 | 15 | 13 | 71.2963 | 2.77778 | 13.8889 | 12.037 |
| Florida | 104 | 32 | 5 | 50 | 17 | 30.7692 | 4.80769 | 48.0769 | 16.3462 |
| Georgia | 98 | 20 | 0 | 39 | 39 | 20.4082 | 0 | 39.7959 | 39.7959 |
| Hawaii | 85 | 15 | 2 | 44 | 24 | 17.6471 | 2.35294 | 51.7647 | 28.2353 |
| Idaho | 94 | 17 | 2 | 35 | 40 | 18.0851 | 2.12766 | 37.234 | 42.5532 |
| Illinois | 105 | 20 | 6 | 24 | 55 | 19.0476 | 5.71429 | 22.8571 | 52.381 |
| Indiana | 103 | 15 | 3 | 37 | 48 | 14.5631 | 2.91262 | 35.9223 | 46.6019 |
| Iowa | 105 | 6 | 2 | 42 | 55 | 5.71429 | 1.90476 | 40 | 52.381 |
| Kansas | 98 | 25 | 0 | 35 | 38 | 25.5102 | 0 | 35.7143 | 38.7755 |
| Kentucky | 70 | 8 | 0 | 29 | 33 | 11.4286 | 0 | 41.4286 | 47.1429 |
| Louisiana | 92 | 17 | 7 | 30 | 38 | 18.4783 | 7.6087 | 32.6087 | 41.3043 |
| Maine* | 90 | 0 | 6 | 58 | 26 | 0 | 6.66667 | 64.4444 | 28.8889 |
| Maryland | 92 | 14 | 4 | 32 | 42 | 15.2174 | 4.34783 | 34.7826 | 45.6522 |
| Massachusetts | 99 | 43 | 4 | 25 | 27 | 43.4343 | 4.0404 | 25.2525 | 27.2727 |
| Michigan | 107 | 29 | 4 | 38 | 36 | 27.1028 | 3.73832 | 35.514 | 33.6449 |
| Minnesota | 104 | 32 | 2 | 36 | 34 | 30.7692 | 1.92308 | 34.6154 | 32.6923 |
| Mississippi | 102 | 16 | 9 | 54 | 23 | 15.6863 | 8.82353 | 52.9412 | 22.549 |
| Missouri | 102 | 16 | 13 | 47 | 26 | 15.6863 | 12.7451 | 46.0784 | 25.4902 |
| Montana | 85 | 15 | 3 | 32 | 35 | 17.6471 | 3.52941 | 37.6471 | 41.1765 |
| Nebraska | 97 | 7 | 13 | 29 | 48 | 7.21649 | 13.4021 | 29.8969 | 49.4845 |
| Nevada | 78 | 8 | 0 | 24 | 46 | 10.2564 | 0 | 30.7692 | 58.9744 |
| New Hampshire | 103 | 15 | 1 | 67 | 20 | 14.5631 | 0.970874 | 65.0485 | 19.4175 |
| New Jersey | 100 | 49 | 7 | 34 | 10 | 49 | 7 | 34 | 10 |
| New Mexico | 78 | 6 | 5 | 14 | 53 | 7.69231 | 6.41026 | 17.9487 | 67.9487 |
| New York | 104 | 39 | 5 | 34 | 26 | 37.5 | 4.80769 | 32.6923 | 25 |
| North Carolina | 99 | 18 | 2 | 33 | 46 | 18.1818 | 2.0202 | 33.3333 | 46.4646 |
| North Dakota | 81 | 7 | 4 | 45 | 25 | 8.64198 | 4.93827 | 55.5556 | 30.8642 |
| Ohio | 95 | 10 | 6 | 34 | 45 | 10.5263 | 6.31579 | 35.7895 | 47.3684 |
| Oklahoma | 94 | 24 | 14 | 25 | 31 | 25.5319 | 14.8936 | 26.5957 | 32.9787 |
| Oregon | 101 | 29 | 5 | 29 | 38 | 28.7129 | 4.9505 | 28.7129 | 37.6238 |
| Pennsylvania | 105 | 24 | 8 | 26 | 47 | 22.8571 | 7.61905 | 24.7619 | 44.7619 |
| Rhode Island | 101 | 7 | 2 | 41 | 51 | 6.93069 | 1.9802 | 40.5941 | 50.495 |
| South Carolina | 100 | 10 | 6 | 52 | 32 | 10 | 6 | 52 | 32 |
| South Dakota | 83 | 9 | 5 | 52 | 17 | 10.8434 | 6.0241 | 62.6506 | 20.4819 |
| Tennessee | 98 | 28 | 3 | 42 | 25 | 28.5714 | 3.06122 | 42.8571 | 25.5102 |
| Texas | 103 | 53 | 3 | 32 | 15 | 51.4563 | 2.91262 | 31.068 | 14.5631 |
| Utah | 88 | 12 | 3 | 44 | 29 | 13.6364 | 3.40909 | 50 | 32.9545 |
| Vermont | 90 | 7 | 1 | 17 | 65 | 7.77778 | 1.11111 | 18.8889 | 72.2222 |
| Virginia | 94 | 23 | 0 | 24 | 47 | 24.4681 | 0 | 25.5319 | 50 |
| Washington | 104 | 41 | 5 | 34 | 24 | 39.4231 | 4.80769 | 32.6923 | 23.0769 |
| West Virginia | 91 | 11 | 3 | 35 | 42 | 12.0879 | 3.2967 | 38.4615 | 46.1538 |
| Wisconsin | 84 | 14 | 0 | 47 | 23 | 16.6667 | 0 | 55.9524 | 27.381 |
| Wyoming | 92 | 16 | 6 | 36 | 34 | 17.3913 | 6.52174 | 39.1304 | 36.9565 |
PRACTICE AREA STATISTICS
| Practice Area | Total | Valid | Ambiguous | Mismatched | Invalid | Valid % | Ambiguous % | Mismatched % | Invalid % |
|:--------------------------|--------:|--------:|------------:|-------------:|----------:|----------:|--------------:|---------------:|------------:|
| bankruptcy law | 202 | 14 | 16 | 100 | 72 | 6.93069 | 7.92079 | 49.505 | 35.6436 |
| business law | 190 | 62 | 6 | 65 | 57 | 32.6316 | 3.15789 | 34.2105 | 30 |
| constitutional law | 205 | 94 | 7 | 48 | 56 | 45.8537 | 3.41463 | 23.4146 | 27.3171 |
| consumer protection law | 191 | 50 | 14 | 65 | 62 | 26.178 | 7.32984 | 34.0314 | 32.4607 |
| contract law | 177 | 37 | 3 | 74 | 63 | 20.904 | 1.69492 | 41.8079 | 35.5932 |
| corporate law | 185 | 30 | 13 | 81 | 61 | 16.2162 | 7.02703 | 43.7838 | 32.973 |
| criminal law | 197 | 56 | 5 | 56 | 80 | 28.4264 | 2.53807 | 28.4264 | 40.6091 |
| education law | 170 | 70 | 10 | 45 | 45 | 41.1765 | 5.88235 | 26.4706 | 26.4706 |
| environmental law | 190 | 28 | 16 | 77 | 69 | 14.7368 | 8.42105 | 40.5263 | 36.3158 |
| evidence law | 193 | 64 | 4 | 52 | 73 | 33.1606 | 2.07254 | 26.943 | 37.8238 |
| family law | 172 | 32 | 8 | 52 | 80 | 18.6047 | 4.65116 | 30.2326 | 46.5116 |
| health law | 194 | 39 | 13 | 65 | 77 | 20.1031 | 6.70103 | 33.5052 | 39.6907 |
| immigration law | 173 | 68 | 1 | 46 | 58 | 39.3064 | 0.578035 | 26.5896 | 33.526 |
| insurance law | 202 | 37 | 4 | 78 | 83 | 18.3168 | 1.9802 | 38.6139 | 41.0891 |
| intellectual property law | 179 | 56 | 12 | 70 | 41 | 31.2849 | 6.70391 | 39.1061 | 22.905 |
| labor law | 192 | 43 | 4 | 92 | 53 | 22.3958 | 2.08333 | 47.9167 | 27.6042 |
| personal injury law | 196 | 30 | 6 | 78 | 82 | 15.3061 | 3.06122 | 39.7959 | 41.8367 |
| property law | 187 | 36 | 15 | 59 | 77 | 19.2513 | 8.02139 | 31.5508 | 41.1765 |
| real estate law | 192 | 37 | 12 | 68 | 75 | 19.2708 | 6.25 | 35.4167 | 39.0625 |
| tax law | 197 | 42 | 8 | 70 | 77 | 21.3198 | 4.06091 | 35.533 | 39.0863 |
| tort law | 180 | 54 | 5 | 71 | 50 | 30 | 2.77778 | 39.4444 | 27.7778 |