OpenAI stated that GPT-4 exhibits human-level performance on different professional and academic benchmarks.1 Here, we’ll explore the tests undertaken by GPT-4 and GPT-3.5, the results, and the corresponding scores. Let’s look at the exam scores of OpenAI’s latest GPT models.
Getting to Know GPT-4: A Brief Introduction
GPT-4, which stands for “Generative Pre-trained Transformer 4,” was released by OpenAI on March 14, 2023. It is a language model and represents the most advanced and creative model ever published by OpenAI. With its exceptional reasoning abilities, GPT-4 exhibits near-human-level performance across various domains and demonstrates the capability to tackle complex problem-solving tasks.
A Table of Comparison: Exams Taken By GPT-4
|Exam||GPT-4 Score||GPT-4 Percentile||GPT-3.5 Score||GPT-3.5 Percentile|
|Uniform Bar Exam (MBE+MEE+MPT)||298||90th||213||10th|
|Advanced Sommelier (theory knowledge)||77%||–||46%||–|
|AP Calculus BC||4||43rd-59th||4||43rd-59th|
|Certified Sommelier Exams (Intro)||92%||–||80%||–|
|Certified Sommelier Exams (Certified)||86%||–||58%||–|
|Certified Sommelier Exams (Advanced)||77%||–||46%||–|
As visible in the table, GPT-4 has outperformed its predecessor, GPT-3.5, in all exams and demonstrated a performance close to human-level.
Uniform Bar Exam (MBE+MEE+MPT)
The Uniform Bar Exam (UBE) is a standardized bar examination in the United States, developed by the National Conference of Bar Examiners (NCBE). It consists of the Multistate Bar Examination (MBE), the Multistate Essay Examination (MEE), and the Multistate Performance Test (MPT).
GPT-4 performed exceptionally well in the UBE with a score of 298 out of 400 according to a document by OpenAI, which ranks it in the estimated 90th percentile of test takers. This suggests that GPT-4 has a solid understanding of complex legal topics and concepts, displaying capabilities beyond simple tasks.
In comparison, GPT-3.5 scored 213 out of 400, which places it in the 10th percentile. It seems clear that GPT-4 has demonstrated a significant improvement over GPT-3.5 examination.
To put these scores into perspective, the mean scaled score on the MBE section of the UBE in 2020 was 132.5 according to NCBE, so both AI models outperformed the average.
The Law School Admission Test (LSAT) is a half-day standardized test administered at designated testing centers throughout the world. It’s an integral part of the law school admission process in the United States.
GPT-4 achieved a score of 163, placing it in the 88th percentile of test takers, suggesting a proficient understanding of the logical and verbal reasoning skills measured in the test.
GPT-3.5, on the other hand, scored 149, which falls into the 40th percentile. This again shows the noticeable improvement in performance between GPT-3.5 and GPT-4 on the LSAT.
The average LSAT score is around 150, so both models exceed this average, with GPT-4 significantly surpassing it.
Graduate Record Examination (GRE) Quantitative
The GRE Quantitative section measures the test-taker’s ability to understand, interpret, and analyze quantitative information, and to solve problems using mathematical concepts.
GPT-4 scored 163 out of 170 in this section, putting it in the estimated 80th percentile of test-takers, displaying strong mathematical reasoning and problem-solving abilities.
In contrast, GPT-3.5 scored 147 out of 170, which lands it in the 25th percentile, revealing a significant gap between GPT-4 and GPT-3.5 in quantitative reasoning and problem-solving skills.
The average score on the GRE Quantitative section is around 154, so while GPT-3.5 scores are slightly below average, GPT-4 exceeds the average score by a fair margin.
Graduate Record Examination (GRE) Verbal
The GRE Verbal section evaluates a test-taker’s ability to analyze and draw conclusions from discourse, understand multiple levels of meaning, and summarize text.
GPT-4 displayed an exceptional performance by scoring 169 out of 170, which puts it in the top 99th percentile of test takers. This shows the advanced capabilities of GPT-4 in understanding and processing complex verbal information.
In comparison, GPT-3.5 scored a lower 154 out of 170, placing it in the 63rd percentile. Again, the leap in performance from GPT-3.5 to GPT-4 is evident, especially in handling intricate verbal tasks.
The average GRE Verbal score is approximately 150, and while GPT-3.5 slightly exceeds this average, GPT-4 substantially surpasses it.
Advanced Sommelier (theory knowledge)
The Advanced Sommelier exam is designed to assess the wine knowledge and service abilities of prospective sommeliers. It’s known for being particularly challenging.
GPT-4 achieved a score of 77%, showing a deep understanding of the world of wine, from its history and production to service and tasting.
GPT-3.5, on the other hand, scored lower with 46%. This highlights a considerable jump in the depth and breadth of knowledge from GPT-3.5 to GPT-4.
Codeforces is a platform that hosts competitive programming contests. Participants are ranked based on their problem-solving abilities.
Both GPT-4 and GPT-3.5 had a similar performance with a rating of 392, which places them below the 5th percentile. This suggests that while both models have some capacity for solving programming problems, they still struggle with more complex tasks in competitive programming.
AP Calculus BC
The AP Calculus BC exam tests students’ understanding of the concepts of calculus, their ability to apply these concepts, and their ability to make connections among graphical, numerical, analytical, and verbal representations of mathematics.
GPT-4 and GPT-3.5 both scored a 4, placing them in the 43rd-59th percentile of test takers. This score demonstrates a solid understanding of calculus concepts and applications.
However, it’s important to note that these scores represent a significant decrease in performance compared to many of the other tests evaluated, indicating that there might be room for improvement in the way these AI models handle calculus problems.
The average AP Calculus BC score is approximately 3.8, so both GPT-4 and GPT-3.5 are slightly above the average.
Uniform Bar Exam (UBE)
The Uniform Bar Exam is used to test a candidate’s competency to practice law. It’s a rigorous and comprehensive test that covers many areas of law.
GPT-4 excelled in this test, scoring 298 out of 400, placing it at the 90th percentile of test takers. This is an impressive achievement considering the complexity and depth of legal knowledge required to excel at the UBE.
In contrast, GPT-3.5 scored 213 out of 400, which lands it in the 10th percentile, indicating a substantial leap in legal knowledge and reasoning capabilities from GPT-3.5 to GPT-4.
The average score for the UBE varies by jurisdiction, but a score of 298 would be considered very high in most places.
The AP Biology exam assesses a student’s understanding and knowledge of modern biology. The exam includes multiple-choice questions, as well as free-response questions that require essay writing, problem-solving, and design of experiments.
GPT-4 and GPT-3.5 scored a 5 and a 4 respectively. This places GPT-4 in the 85th-100th percentile of test takers, and GPT-3.5 in the 62nd-85th percentile. This again emphasizes the improved comprehension and application of biological knowledge demonstrated by GPT-4 over its predecessor.
The average AP Biology score is around 2.9, so both models perform well above the average.
Leetcode (easy, medium, hard)
Leetcode is a platform that provides coding challenges of varying difficulty levels to prepare individuals for software engineering interviews.
On the easy level, GPT-4 solved 31 out of 41 problems, the medium level 21 out of 80, and the hard level 3 out of 45. GPT-3.5 only managed to solve 12, 8, and 0 problems in the easy, medium, and hard levels respectively.
It’s hard to compare these scores to an average as it varies greatly depending on the individual’s programming experience. However, given that GPT-4 solved a reasonable proportion of problems at each level, we can infer that its coding ability is fairly robust, certainly outperforming GPT-3.5 in this domain.
The American Mathematics Competitions 10 (AMC 10) is a math competition for students in grades 10 and below. It’s known for challenging students with non-routine problems that encourage creative problem-solving.
In this exam, GPT-4 and GPT-3.5 scored 30 out of 150 and 36 out of 150 respectively. This places them in the 6th-12th percentile and 10th-19th percentile. While these scores are not as high as in some of the other exams, it’s worth noting that the AMC 10 is known for its challenging problems, which can be difficult even for bright students.
AP Chemistry is an advanced placement course that covers a full year of college-level general chemistry. The AP exam evaluates a student’s knowledge in multiple areas of chemistry, including thermodynamics, kinetics, and atomic structure.
In this exam, GPT-4 scored a 4, putting it in the 71st-88th percentile, while GPT-3.5 scored a 2, landing in the 22nd-46th percentile. This is a clear testament to GPT-4’s enhanced ability to grasp and apply chemical knowledge compared to GPT-3.5.
The average AP Chemistry score is around 2.6, indicating that GPT-4 scored well above the average.
The Law School Admission Test (LSAT) is an integral part of the law school admission process in the United States, Canada, and a growing number of other countries. It provides a standard measure of acquired reading and verbal reasoning skills that law schools can use as one of several factors in assessing applicants.
On the LSAT, GPT-4 scored a 163, landing it in the 88th percentile, while GPT-3.5 scored a 149, placing it in the 40th percentile. The significant difference in scores between GPT-4 and GPT-3.5 again underlines GPT-4’s improved reasoning and comprehension abilities.
The average LSAT score is around 151, so GPT-4’s performance is significantly above average.
OpenAI also compared GPT-4 with the state-of-the-art (SOTA) language models across different tasks. For instance, in the AI2 Reasoning Challenge (ARC), GPT-4 achieved a 96.3% accuracy, surpassing GPT-3.5, which achieved an 85.2% accuracy.
In the HumanEval Python coding tasks, GPT-4 managed to achieve a score of 67%, while GPT-3.5 only managed to score 48.1%. For reading comprehension and arithmetic tasks in the DROP benchmark, GPT-4 scored 80.9, while GPT-3.5 scored 64.1.
These results clearly demonstrate GPT-4’s improved ability in tasks requiring complex problem solving, logical reasoning, and understanding context over GPT-3.5.
Overall, the GPT-4 model demonstrated superior performance across a variety of human-designed tests and benchmarks. This showcases the improvements in GPT-4’s ability to handle more complex tasks and more nuanced instructions than its predecessor, GPT-3.5. However, it is worth noting that there is still much progress to be made, and these AI models are far from perfect.
What about Turing Test?
The Turing Test, initially suggested by British mathematician and computer scientist Alan Turing in 1950, is an AI assessment technique that evaluates a computer’s capacity to emulate human intelligence.
The test revolves around the computer’s capability to demonstrate intelligence comparable to that of a human, or to mimic human behavior convincingly. It serves as a standard to gauge the advancement and complexity of AI models.
Considering the current capabilities and exam performance of the GPT-4 model, it showcases performance levels approaching human-like intelligence. Given this, the potential for GPT-4 to pass the Turing Test is indeed high.
The model’s ability to understand context, infer meaning, and generate coherent and meaningful responses would give it a significant advantage in such an evaluation. You can also take a look at our post about ChatGPT’s Turing Test performance.
The differences between GPT-3.5 and GPT-4 model
GPT-4 is the most recent and advanced model released by OpenAI. The GPT-3.5 model set the stage for GPT-4’s advanced features. Comparing their exam performances, it’s evident that GPT-4 outperforms GPT-3.5.
However, the differences between GPT-3.5 and GPT-4 extend far beyond their respective performances in tests. There are numerous variances between these two models, with one of the most significant being the multimodal capabilities of GPT-4. Unlike GPT-3.5, GPT-4 has the capacity to process image inputs. This allows it to handle tasks involving images and text together.
For more detailed information take a look at our post about the differences between GPT-4 and GPT-3.5.
We’ve explored how impressive GPT-4’s performance is in various tests and competitions. It has even surpassed human performance in some areas, showcasing significant advancements in AI development.
GPT-4 outshines previous models like GPT-3.5, with more reliable results, increased creativity, and improved understanding of complex instructions. These upgrades make it a powerful tool for various applications.
Also Read: All OpenAI’s GPT Models
The transition from GPT-3.5 to GPT-4 represents a substantial step forward. This is an excitement for the possibilities of future AI models.