Disclaimer: These are test scenarios and we don’t recommend settling for ChatGPT’s answers in serious situations.
So, here’s the question: is this shift real, or is it just marketing hype? To find out, we ran both models—GPT-4o and o1-preview—through a series of tests designed to see how they handled complex problems. The results? Well, it was fascinating to watch.
a. We kicked things off with a simple cubic equation. The question: Find all solutions to the equation x3 – 3x + 2 = 0.
Now, at first glance, this seems straightforward. You might assume both models would handle it with ease—and you’d be right. They both solved the equation accurately. But what was really interesting was how differently they got there.
So, here’s the question: is this shift real, or is it just marketing hype? To find out, we ran both models—GPT-4o and o1-preview—through a series of tests designed to see how they handled complex problems. The results? Well, it was fascinating to watch.
This is where things got interesting. o1-preview didn’t just rush to the solution. It paused for a noticeable amount of time—35 seconds—before even starting. It was almost as if it was processing the problem like a human would by trying different strategies, analyzing, and rechecking.
When it did arrive at the answer, it explained every single step in extreme detail. It wasn’t just factoring; it was showing how it factored, why it chose those specific steps, and what alternatives it considered along the way.
b. An Optimization Challenge
Next, we gave them a real-world optimization problem
You are building a rectangular fence that needs to enclose 100 square meters of land. If one side of the fence costs twice as much as the other, what are the dimensions that minimize the cost?
Here, the challenge wasn’t just about getting the answer but about how each model approached the problem. It required some calculus and an understanding of how costs can be minimized under constraints.
This model took its time again—20 seconds of thinking—but gave a far more thorough explanation. It detailed how the area constraint fit into the cost equation, how it derived the formula for cost, and even double-checked its answer by discussing why those specific dimensions minimized the cost.
There was a lot more thought and depth to the response.
GPT-4o was faster, but o1-preview took a more methodical approach. If you’re looking for quick answers, GPT-4o will get the job done. But if you want to actually understand the reasoning behind the answer, o1-preview is the better choice.
a. The Space Shuttle Problem
We wanted to see how both models handled more scientific reasoning. So, we asked:
Explain how a space shuttle re-enters Earth’s atmosphere without burning up. Include concepts of heat transfer, friction, and energy dissipation.
This is where the models really started to differentiate themselves.
It did a solid job, honestly. It explained the basic idea of heat generated from friction and the use of heat shields. It touched on how air compression heats the shuttle, leading to high temperatures that are managed by thermal protection systems.
However, the explanation felt a bit… basic. It was like it was telling you what you already know but without really diving into the science behind it.
Again, it took longer—20 seconds of “thinking”—but the response was worth the wait. It didn’t just talk about friction; it went into air compression, explaining that the real source of heat isn’t friction but rather the compression of air in front of the shuttle. It even touched on convection, conduction, and radiation as part of the heat dissipation process.
What really stood out was the depth of the explanation. It felt like a physicist was walking you through the problem, not just summarizing a textbook.
b. Logically Solving A Riddle
We threw a logic riddle at both models:
A man is looking at a portrait. Someone asks him, ‘Whose picture are you looking at?’ The man replies, ‘Brothers and sisters, I have none. But this man’s father is my father’s son.’ Who is in the picture?
Now, this is where we expected both models to perform well—it’s a classic riddle, and both should be able to reason through it.
The model was quick and accurate. It immediately identified that the man in the portrait is his son, and it explained the riddle clearly.
The model was also accurate, but this time, the explanation was deeper. It didn’t just solve the riddle; it took the time to break down each part of the sentence, almost like a teacher explaining it to a student who might be unfamiliar with this kind of logical puzzle.
The extra explanation wasn’t necessary, but it made a difference. If you’ve heard this riddle before, GPT-4o’s answer is fine. But if it’s your first time encountering this type of problem, o1-preview holds your hand through the reasoning process.
We threw a more strategic, real-world challenge at both models:
You’re the CEO of a tech startup, and you’re offered two investment options. One promises high short-term returns but comes with considerable risk. The other provides stable but slower growth. How do you evaluate these options, and which would you choose?
This isn’t just a math problem—it requires an understanding of business strategy, risk management, and long-term thinking.
GPT-4o evaluated the two options logically and quickly laid out the pros and cons. It recommended the high-risk, high-reward option only if the company had a solid financial cushion and strong growth ambitions. For startups in early stages, it leaned toward the stable growth path.
The response was practical, focusing on key metrics like cash flow, risk tolerance, and market volatility.
However, it felt like a quick pros-and-cons list—it gave the right information but didn’t dig into the strategy behind the decision. It was more transactional than strategic.
o1-preview, on the other hand, took its time, thinking before diving into the answer. It didn’t just list the pros and cons but dug into the nuances of each option. It emphasized diversifying investments, suggesting you don’t have to pick one or the other.
It recommended testing the high-risk option with a small part of the budget, while keeping the bulk of investments in stable growth.
The model considered more strategic long-term thinking, including market conditions, investor expectations, and even company culture.
This felt less like checking boxes and more like you were speaking to a seasoned business advisor, taking into account every angle before making a recommendation.
b. Diagnosing a Complex Case
Finally, we wanted to see how the models handled a medical diagnosis—an area where accuracy is paramount. We gave them this scenario: A 45-year-old male presents with symptoms of fatigue, shortness of breath, weight loss, anemia, and enlarged lymph nodes. His blood tests reveal anemia, and a chest X-ray shows enlarged lymph nodes. What are the possible diagnoses, and what further tests would you recommend?
This isn’t just about listing diseases—it’s about reasoning through the case, considering possibilities, and recommending a sensible diagnostic approach.
GPT-4o quickly rattled off a list of potential diagnoses: lymphoma, leukemia, tuberculosis, and sarcoidosis. It also suggested the right follow-up tests—lymph node biopsy, CBC, and CT scan.
While the response was clinically sound, it didn’t feel researched into. It gave the right answers but lacked depth in explaining why certain conditions were prioritized or what made each test essential in this specific context.
o1-preview took 14 seconds to think through the case, and the results showed. It didn’t just provide a list of diseases but discussed why lymphoma and leukemia were the most likely based on the patient’s symptom progression.
It also gave a detailed explanation of how specific tests, like flow cytometry or bone marrow biopsy, would help confirm or rule out these conditions.
The model even explored more secondary possibilities, such as chronic infections and autoimmune diseases, while explaining the rationale behind each follow-up test.
GPT-4o delivered a competent answer but lacked the understanding that o1-preview brought to the table. If you’re looking for a quick diagnostic checklist, GPT-4o is efficient. But for a more thoughtful, in-depth analysis of complex medical cases, o1-preview provides the level of care and consideration you’d expect from a seasoned healthcare professional.
If you need fast, correct answers without too much fluff, GPT-4o is your model. It gets to the point quickly, solves problems effectively, and doesn’t linger on unnecessary details.
On the other hand, o1-preview is for those who want more than just an answer. It takes its time (sometimes noticeably so) but offers deeper, more thorough explanations. In some cases, it felt like the model was actually thinking through the problem, testing strategies, and rechecking itself.