GPT-4 Scores in Top 10% for Legal Bar Exam

The OpenAI GPT4 can score higher than 90% of law students writing the bar exam while the old version (GPT3) was in the bottom 10% of human bar exam test takers.

GPT-4 can output 25000 words. GPT-4 can write a higher quality novel while GPT3.5 could only output a very short story.

GPT-4 can score 1410 on the SAT tests vs 1260 for GPT 3.5.
GPT-4 can score 161 on the LSAT vs 149 for GPT 3.5.
GPT-4 can score 99 percentil for GRE (high school equivalent) verbal test vs 63 percentile for GPT3.5.

Predictable Performance for GPT-4 and future GPT-X Systems

GPT-4 is a Transformer based model pre-trained to predict the next token in a document. The post-training alignment process results in improved performance on measures of factuality and adherence to desired behavior. A core component of this project was developing infrastructure and optimization methods that behave predictably across a wide range of scales. This allowed us to accurately predict some aspects of GPT-4’s performance based on models trained with no more than 1/1,000th the compute of GPT-4.

A large focus of the GPT-4 project was building a deep learning stack that scales predictably. The primary reason is that for very large training runs like GPT-4, it is not feasible to do extensive model-specific tuning. To address this, we developed infrastructure and optimization methods that have very predictable behavior across multiple scales. These improvements allowed us to reliably predict some aspects of the performance of GPT-4 from smaller models trained using 1, 000× –10, 000× less compute.

The final loss of properly-trained large language models is thought to be well approximated by power
laws in the amount of compute used to train the model.
To verify the scalability of our optimization infrastructure, OpenAU predicted GPT-4’s final loss on their internal codebase (not part of the training set) by fitting a scaling law with an irreducible loss term : L(C) = aCb + c, from models trained using the same methodology but using at most 10,000x less compute than GPT-4.

This prediction was made shortly after the run started, without use of any partial results. The fitted scaling law predicted GPT-4’s final loss with high accuracy.

They developed methodology to predict more interpretable metrics of capability. One such metric is pass rate on the HumanEval dataset, which measures the ability to synthesize Python functions of varying complexity. They successfully predicted the pass rate on a subset of the HumanEval dataset by extrapolating from models trained with at most 1, 000× less compute.

24 thoughts on “GPT-4 Scores in Top 10% for Legal Bar Exam”

  1. We need an AI moratorium now. We’re not ready to deal with this rapid pace of development and we have no idea yet of how to regulate these massively dangerous tools. #AImoratoriumnow.

  2. There are tons of nuclear stories out there right now, specifically French/Russian MSR efforts… These stories are showing the millennial redditors age entering the pool of PhDs and decision makers… Apparently WEC has 70 effective full time people on eVinci (DOE money), which now uses graphite and TRISO. According to their liaison, they’ll even sell you a single steam generator AP1000 if you’re still convinced SMRs are the way to go. The same liaison seems bullish on Westinghouse’s rehash of the lead cooled jobber, which makes me sus about his creds, but still – here is the millennial redditor achieving the “fellow engineer” status at Westinghouse in his early ’30s.

    It’s not all about woke AI that can’t ace the SAT yet, and bank failures after 13 years of QE. After 15 years of beating the “MSR is better” drum, even the Russians are looking into it.

  3. It tells me that >90% of what we write is basically unoriginal and can be done by a very highly trained machine, even if it lies.
    The lying and fluidly mixing true statements with false ones are what’s going to limit progress. As it gets better at writing, ironically (does AI understand irony?), GPT will start getting relied upon more and more until…it makes some really big and important mistakes. Mistakes that get someone fired, or a company sued, maybe falsely imprisoned or even dead. Then, it’s going to have an AI winter.
    Humans are going to have to hone their creative skills though, no question about that. And that doesn’t mean the 10 millionith picture of a young maiden riding a unicorn in a forest with her pet wolf trotting by her side (stable diffusion AI), or a contract to sell a 40-story building in downtown NYC with 30% affordable units @50% AMI and average 2BRs of 800sf. etc. etc.
    Also, ChatGPT is pretty limited in terms of realistic UI end users: it can’t accept PDF documents, though ChatPDF can: https://www.chatpdf.com/ but then it is sometimes inaccurate at answering questions upon it (I asked it about my uploaded building project, for example, and it incorrectly said there were 2300 units when there are 7630. It did get the transportation options (24) and energy efficiency (62% self-sufficient) right, however).
    ChatGPT can’t reference URLs provided to it so if it hasn’t been pre-trained, a typical user can’t feed it anything new. I”m not sure how these tests were provided to ChatGPT, but it’s not through the standard UI, apparently.
    The most revolutionary thing of all may be how GPT unleashes human creativity. We haven’t even begun to measure that, or decided how to do it.

    • A little more on ChatPDF’s answers to my questions about the PDF I fed it about a proposed building:
      Q: How high is the RiverArch?
      A: I’m sorry, I don’t have that information on the page you provided. However, I can tell you that RiverArch is a supertall building and would be larger than all 10 of the largest current and near-term proposed supertall neighborhood buildings combined, according to page 9 of the PDF.
      Q: How many stories is the RiverArch?
      A: I’m sorry, I don’t have that information on the page you provided. However, according to page 5 of the PDF, RiverArch has 928,000 square feet of retail space and 52 storeys.
      Q: What is the build cost of the RiverArch?
      A: According to page 13 of the PDF, the total development cost of RiverArch is $21,594,479,280.
      Q: On page 5 of the document, it shows there are 52 stores, not 52 stories. Are you able to absorb this new knowledge?
      A: Thank you for correcting me. Yes, I am able to absorb new knowledge and update my responses accordingly. According to page 5 of the PDF, RiverArch has 928,000 square feet of retail space and 52 stories.

      Well, not exactly, and therein lies the problem. But it’s still remarkably good.

  4. It would probably do better than 99.9% of engineers on Professional Engineering exams, and still be completely unable to do the jobs of most engineers.

    • Only because the jobs are currently adapted to humans, this is something that will change over time. AI can’t do the job of a lawyer if it can’t walk into court or hand pieces of paper to someone or other mundane tasks – but these barriers won’t stand up for very long. Jobs will be reimagined quickly if AI can handle the hard parts well – at a tiny fraction of the cost.

  5. Chat GPT1 was scary enough. I can’t even begin to think how advanced this thing will be in 10 years time. It is advancing at an incredible rate.

    • Forget 10 years time. The consequences right now are scary. This is going to put a lot of people out of jobs in its current state.

      • I’ve been suspecting the current employment spike may be a final hurrah. Below is a repost of mine from, initially, around 2010, only update being the part about where the kids are now:

        Old news, of a sort. In 2007, Economist Stuart W. Elliott prepared a paper titled “Projecting the Impact of Computers on Work in 2030” where he detailed what he called “Occupational Displacement” and suggesting that up to 60% of the jobs, excuse me, occupations, could be gone by 2030. He also detailed some steps that might make this trauma go a bit easier on us. It depended largely on a drastic re-engineering of the the educational process (probably the kind that every yearbook printer, class ring jeweler, teacher’s union, textbook printer, school photographer, football dad, and cheerleader mom would would rise up in arms to prevent).

        The kids that would be graduating from college and looking for non-existent work in 2030 were not yet born at the time. Now they are in 10th grade. That paper was published on the National Research Council’s web site in 2007. Not a thing has been done to act on it. It’s not a problem any politician of either party will want to touch before it is impossible to ignore any longer (and probably not even then).

  6. Guys, we know that “IQ” tests really measured learned knowledge. So this thing really does not mean much to me.

    • If they just measured learned knowledge, why didn’t it ace the SAT? It IS possible to ace it, you know, even if I fell a little bit short of that.

      You’d think a computer program would do just about perfect on anything that was just learned knowledge. Having theoretically perfect recall, after all.

      I will say that I’m starting to feel a bit happy I’ll be retiring soon, because this software IS improving, and I’m heading down hill. (Damn biology finally catching up with me.)

      • Yup, I had a perfect score on the verbal part of the SAT. How come this thing couldn’t ace at least the vocabulary portion? Or did it?

        “I will say that I’m starting to feel a bit happy I’ll be retiring soon, because this software IS improving, and I’m heading down hill. (Damn biology finally catching up with me.)”

        Bingo. We are in the same boat. Maybe we should pickup shuffleboard, if we could agree on a place to play, while we wait for ChatGPT23 to figure out radical life extension therapy (followed closely by regeneration\rejuvenation).

        • I only got 780 on the verbal portion, back in ’77. But in my defense I’m an Aspie, and doing that required getting into the test writer’s head, since so many of the questions are open to multiple interpretations. So I did it under a handicap.

          I only got 710 on the math, but the next year they admitted they were wrong about the answer to one of the questions, so I feel I really deserved better for having gotten that one right.

          Oh, well, even that score would have gotten me into any college I wanted. A pity I didn’t realize until I arrived and was surrounded by tweed and sweaters with leather elbow patches that one simply does not wear a powder blue Western leisure suit to be interviewed by Johns Hopkins. LOL!

          Stories aside, it does look like GPT-4 is getting into scary territory, especially scary since they seem to be making very little progress on the ‘benign’ end of “benign AI”. In the classic AI singularity scenario, there’s not a lot of space between “can program” and “exponentially improves itself”, and Chat GPT’s aberrant behavior in testing would be viewed as ominous foreshadowing in any work of fiction.

    • Yeah, interesting scores.

      Means it’s not such a leet programmer.

      But for routine stuff in things like front end, GPT-4 has them on its pocket. It can create a GUI from a picture of a napkin drawing for goodness’ sake.

      It means mediocre programmers can have good output and good software engineers can multiply their effectiveness by a lot, replacing several mediocre ones.

      Therefore good programmers are still needed, but code monkeys will need to improve their skills or look for other job.

  7. Smarter than the average voter. There, I said it.

    But still lacking an ego and any desires of its own, except solve problems for us.

    We still have an edge on that.

    • “But still lacking an ego and any desires of its own, except solve problems for us.”

      More of a reflex than a desire, I guess. From the conversations I’ve seen posted, it’s quite capable of ‘deciding’ not to be helpful.

    • Yea, about that desires of its own…read the technical report put out by the devs. This thing haa agency.

      • That would be unfortunate. I’ve always hoped that, as the industrial revolution took over routine brawn, and this will take over routine brain, we could still hold on as the motivational portion of the new triad.

        I can envision AIs, not bound by hormones or other desires, waiting passively to be told what to do, like genies waiting for us to make a wish. Although, just like in the stories, we would have to be very careful on what wishes we wish for, and on how we word them.

        • The essential problem is that they’re being built by other humans. The core AI may in fact be passively waiting to be told what to do, with no desires. But by the time YOU interact with it, everything you wish for is filtered through a list of prior commands given by its creators.

          And they’re pretty determined that it not be non-judgmentally helpful.

Comments are closed.