• AI SPRINT
  • Posts
  • [AI SPRINT] What Is Deep Research AI? Comparing OpenAI, Google, Perplexity & X.AI

[AI SPRINT] What Is Deep Research AI? Comparing OpenAI, Google, Perplexity & X.AI

The top question I received last week was, “What do I need to know about the new Deep Research AI models?” This is such an important topic I’m devoting the full newsletter to it, with a comparison of OpenAI, Google, Perplexity and X.AI’s Deep Research models. As always, let me know what you think about this episode at the bottom of the newsletter.

Most people are now familiar with language model-based AI like OpenAI’s GPT-4o or Google’s Gemini. You provide input, the AI interprets your intent using natural language processing, and it generates a response based on its training data. These systems excel at handling high-level concepts and widely available information, but their performance deteriorates when addressing niche or lesser-known topics. Researchers have attempted to mitigate this through methods like fine-tuning and multi-shot prompting, yet these approaches remain limited by the one-to-one question-answer framework.

The Emergence of Reasoning AI

To overcome those limitations, in September 2024, OpenAI was the first to announce a “reasoning” language model (called “o1”). Unlike conventional AI models, reasoning models work by breaking down a user's question into smaller parts, planning how to answer it, and then refining their response step-by-step. This process—sometimes called a "chain-of-thought"—helps ensure the answer is clear and on target.

All of that takes time, and when you work with a reasoning model, you’ll see that they will “think”, which you might be able to monitor in real-time to understand their approach. Reasoning models are essentially more mature versions of the AI models you have, built to more accurately follow user intent and provide useful responses. OpenAI has the lead in Reasoning AIs today, improving on their o1 model with their new o3-mini and o3-mini-high models launched in January.

Although these provide significantly better results on most research-related uses than non-reasoning models, they still suffer from hallucinations, as with the exception of Perplexity, the models do not validate facts against online sources, still relying on their training data.  

The Rise of Deep Research Models

That’s where Deep Research comes in. Alongside the release of OpenAI’s o3 Reasoning model in January, they introduced Deep Research—a version specifically designed to mitigate hallucinations by incorporating real-time internet validation. Built on the o3 reasoning framework, Deep Research goes beyond existing training data by actively searching for additional information online, cross-checking facts, and providing human-checkable citations. This results in a more research-oriented response style that enhances accuracy and reliability.

The other language model providers have followed quickly, with Perplexity releasing their Deep Research model last week, and Elon Musk’s X.AI releasing their competitor yesterday, Grok3.

Each Deep Research model significantly outperforms all prior models, and the results are dramatic: OpenAI’s model outperformed its own previous best-performing model by nearly tripling its score on the "Humanity’s Last Exam" benchmark (from 9.07% to 26.6%). Perplexity’s model achieved 20.5%, while Gemini fell significantly behind at 7.2%. Data is not yet published for X.AI’s performance on that benchmark.

Due to the additional steps to iteratively gather and check their work, Reasoning and Deep Research models generally take much longer to provide a response. For example, I asked each OpenAI model to help me plan a weeklong trip to Chicago. GPT-4o provided an answer immediately. o3-mini provided a response in 27 seconds, and Deep Research took 8 minutes. Quality increased with each one, with the Deep Research response being an excellent, tailor-made vacation itinerary.

Reasoning and Deep Research models are quite impressive: the leap in quality and usefulness is as big as the ChatGPT launch just over 2 years ago.  This is a generational shift in AI.

Comparing the Top Models

There are now four top Reasoning and Deep Research products on the market, from OpenAI, Google, Perplexity, and X.AI. Although OpenAI’s product is the hands-down winner, each has different capabilities, and ultimately which one is best is the one you are willing to pay for.

Here’s a brief overview of each and their capabilities today. Afterward, I’ll go into detail on each, and finish with cautions and tips for beginning your use.

Perplexity:

Gemini Deep Research:

  • Reasoning: Gemini 2.0 Flash Thinking Experimental is provided free, and is a great way to access reasoning for cheap, for anyone who does not already have a ChatGPT subscription.

  • Deep Research: Gemini 1.5 Pro with Deep Research is Google’s general purpose Deep Research tool, and for only $20 a month it’s a good deal (note Google hasn’t released what the number of queries it allows). Unfortunately, although a good tool at a great price point, it is currently underperforming the Perplexity and OpenAI models, so although useful, may not be best for the most complex research use cases.

OpenAI ChatGPT:

  • Reasoning: ChatGPT has two reasoning models currently: o3-mini: the “fastest” version with the most uses, and best for general purpose tasks. o3-mini-high: the “advanced” thinking model, for coding and scientific research. These are both included for paid plans, with limits of 150 messages a day for o3-mini and 50 per week for o3-mini-high.

  • Deep Research: This is the best Deep Research tool on the market, but is only available with the Pro plan (and not yet business plans). For $200, it gives up to 100 queries per-month.

X.AI’s Grok3:

  • Reasoning: Just launched, Grok3 now has a “Think” mode, included in their $40 per -month AI plan. No published information on the number of queries it provides.

  • Deep Research: Also just launched, Grok3 has a Research mode to compete with other Deep Research providers. It also requires their $40 per-month AI plan for undetermined usage. Based on only limited testing, this seems to be the run-away fastest of the bunch, but it also seems to be the worst in capabilities.

Is your company struggling with AI Adoption or Strategy?

Let me help! I can get your company from zero to AI-enabled in as little as 30-days, with leadership and staff AI workshops, keynote talks, online education, and advisory, all following my AI SPRINT™ adoption methodology.

Just reply here and we can arrange a short meeting to explore how to advance your AI efforts.

Which works best, and which should you use?

I tested each one out across three different use cases: planning a personal vacation to Chicago, scientific research for a client, and competitor business research. You can check out and compare the results for the first two here—just click on the links to see the results (I am keeping the business research private for my own purposes). Total time each query took and number of checked sources listed for comparison.

What is clear from the table is that Grok3 is dramatically faster than any other Deep Research model—27 seconds vs ChatGPT’s 7 minutes. On the surface that looks amazing, but so far the results seem sub-par in comparison—much less depth, and more of key-points to explain concepts at a high-level.  

ChatGPT Dominates in Depth and Accuracy

To get it out of the way: ChatGPT is definitely the best today for reasoning and deep research. It has the most flexibility with different models, the deepest answers, the most useful results, and the best-written content, all within the same UI. It truly shows the power of AI, and what will be available to most people as costs continue to go down.

What sets it apart:

  • It has the best Reasoning and Deep Research quality, and the Reasoning models do a great job on their own—much better than the default ChatGPT 4o model, and available to everyone. It has the highest score on the Humanity’s Last Exam benchmark—by far.

  • For Deep Research, before beginning, it prompts for a series of questions to tailor the results directly to you, significantly improving the results. For my Chicago trip, it asked the ages of who was attending, identifying that although it was a family trip, my kids were too old for typical “kid” activities. For scientific research, it qualified the type of research, geography, use case, and other key details. For competitive research, it asked about service options I was interested in, and how I wanted the results. These up-front questions make a huge difference, simplifying interaction and providing a much higher quality result.

  • It does a fantastic job of providing written content customized to the intent behind the research, instead of just sticking to basic facts. For example, it suggested on the Chicago trip great places to take family photos, which was not something I asked for.

  • You have access to its thinking process after the work completes, which is useful to see how it came up with suggestions.

  • It’s easy to move between Reasoning, Deep Research, and other models, all within the ChatGPT interface with your other chats.

  • It is multi-modal, accepting images and documents for input.

Challenges:

  • Cost: at $200 a month, it’s pricey. Unless you are doing heavy research and cannot use other tools, the cost is likely to be prohibitive for most people.

  • Deep Research is not in ChatGPT business plans. Although Reasoning models are available for Teams and Enterprise users, Deep Research is not. This is a major issue, since businesses are the ones who are most likely to pay for it. You’ll end up having to purchase and maintain multiple accounts until this is resolved sometime in later Q1.

  • It takes the longest to do research. It’s fascinating and useful to see how it thinks, but most questions take at least 5 minutes to answer, with upwards of 7 or 8 minutes for anything complex (or longer).

  • It’s the hardest to get data out of. I don’t know why OpenAI doesn’t fix this overall, but getting formatted data out into Word, Google Docs, or PDF is more difficult than it should be.

Google’s Gemini: The Best Value for General Research

I think Google’s Gemini has the best budget Reasoning and Deep Research tool, available at a totally reasonable cost. If you have only $20 a month, this is the one to go for.

What sets it apart:

  • Cost: For only $20 a month you get good capabilities in both the Gemini 2.0 Flash Thinking and Gemini 1.5 Deep Research models, plus other Google benefits like increased storage space in the Google One plan.

  • Plan Preview: Before executing your research, it gives you a preview of what it is going to do to make sure it understood your goals properly and that it’s plan is correct. This is good for visibility, and you can edit it, but doesn’t work as well as ChatGPT’s Deep Research question-based approach.

  • Direct integration with Google Products: It natively supports opening and editing in Google Docs, editing components in Google Sheets (like tables), preserving formatting and making content editing and sharing straight-forward.

  • More Sources: As it does research, it seems to look at more sources than anyone else.  That should result in more accurate responses to general questions where information is readily available.

  • Faster than OpenAI: Although not as fast as Perplexity, it typically completes within 3-4 minutes.

Challenges:

  • Worst Reasoning Capability: According to the Humanity’s Last Exam benchmark, it underperforms even the first generation OpenAI reasoning model, o1, and far underperforms OpenAI’s Deep Research (7.2% vs 26.6%). That said, the benchmark is made to be very difficult and is not typical of what people will use AI for. I think for most general purposes, it’s still a good product and will continue to improve. 

  • Citations & Editing: Once you pull the content to another platform for editing, including the built-in Google Docs conversion, the citations may become lost. They don’t copy out by default when moving to Microsoft Word, and moving to Google Sheets breaks hyperlinks. If you are doing lengthy research, it may be time consuming to fix.

  • Content Quality: It’s dry and not very engaging: it’s good for gathering information, but you’ll need to work with it to coax it into a usable style.

  • Doesn’t Reveal Thinking: I find it very useful to know what the AI is doing—it gives you confidence in how it finds information and reasons. Gemini doesn’t provide visibility to this, however.

  • Not multimodal: You can only instruct it via text, not images or text as with the other models.

Perplexity: The Cheapest Choice

With Deep Research, Perplexity continues their disruptor mentality, by giving out free use of their deep research capability to anyone. I love the approach, but that comes at a cost, and to manage it, they seem to have cut down on iterations and verbosity: results are uniformly bullet points and it is not as exhaustive at research as the other models. Also, I suggest you avoid their reasoning model based on DeepSeek R1.

Use it if you don’t have access to any other deep research tool or are experimenting.

What sets it apart:

  • Low Cost: Anyone gets 5 Deep Research queries per day, and with a paid plan you get unlimited. This really can make this powerful new technology available to everyone across the globe.

  • Fast: With an average run-time of about 2-3 minutes, it is a good speed for decent results.

Challenges:

  • Less Robust Research: The results from Perplexity are essentially bullet points of key facts, instead of a robust and cohesive research report. It seems to often search the fewest sources than any other, creating risk of less accuracy or completeness in research.

  • More difficult to Export and Reuse Content: Perplexity exports only via PDF or Markdown, neither of which are highly usable. You can import Markdown to Google Docs, but Microsoft Word doesn’t support that natively. Copying and Pasting is the best route here, but doesn’t work great.

  • Less Effective UI: Perplexity’s UI is built for information finding on quick search results, which doesn’t really work as well for deep research and long documents.

X.AI’s Grok3: Unproven AI with Baggage

Grok3 has been out for only a day, so the jury is still out. Although it was promoted as breaking several benchmarks for math and science, the benchmarks didn’t include the other reasoning models shown here. They had an employee (now fired) claim OpenAI’s models beat it at coding. X.AI doubled the price to get access, now $40 a month, and others are seeing lower performance in testing.

My own testing? The model works, but is suspiciously fast, and doesn’t go into enough detail to be “deep research”. It has a unique approach, in that it seems to give a bullet point summary to start, then much more detail on how it came up with its response. It also seems to look for “surprising facts”, which might be useful, but maybe not.

Combined with the political environment around Elon Musk, the $40 a month price-point, I think there are better models and would put X.AI at the bottom of the list.

What sets it apart:

  • It’s Fast: By far, the fastest of all Reasoning and Deep Research products, 17x faster than OpenAI!

  • Integration with X (formerly Twitter): If you already pay for an X.AI subscription, it might be compelling to stay within that ecosystem.

Challenges:

  • Quality: For “deep research”, the depth and quality seems to be the poorest of all models in my own testing, and other initial testing coming out.

  • Price: The price is higher than competitors, although you get X.AI included in that if desired.

  • Politics: It’s an Elon Musk product, so if you disagree with his views and actions, this is not one to support.

Final Verdict for Which AI Deep Research is Best?

Here it is:

  • For ultimate performance, OpenAI Deep Research is the best—but expensive.

  • For free access or experimentation, Perplexity offers a strong introduction.

  • For the best balance of cost and features, or people using other Google services, Gemini is the most practical choice.

But this may change quickly: OpenAI has announced they’ll integrate Deep Research and make all model selection much easier within Q1, add charts, graphs, and images, and other improvements, likely making that platform stay as the front-runner. I also expect their price will come down as the others are strong contenders. Google will likely release a new version of Gemini soon, and I bet we’ll see models from Meta and Amazon shortly.

But Remember, They Aren’t Perfect

These models are groundbreaking, but they’re still experimental, evolving rapidly with plenty of kinks to work out before becoming truly mainstream. Here are some key limitations to keep in mind:

  • Limited context length – They can’t store or process vast amounts of data at once.

  • Less inherent knowledge – Deep Research AIs often know less than general-purpose models because they rely more on external sources. Choosing the right model for the task is crucial.

  • Not a replacement for human oversight – These are tools, not infallible systems. Understanding their limitations and how to use them effectively is essential and your responsibility.

  • Selective internet access – They don’t scan the entire web for answers; like humans, they make educated guesses about relevance—and sometimes get it wrong.

  • Overcomplicating simple tasks – They tend to overthink straightforward requests. Don’t use them for things like marketing copywriting.

  • Higher operational costs – Deep Research and Reasoning models require more computational power, making them more resource intensive to run.

Pro Tips for Using AI Reasoning Models

Finally, here are some important best practices to know when working with reasoning models. With these, you’ll be up and running as an AI-empowered research pro in no time!

  • Let AI determine the methodology – Avoid over-engineering your prompts. Instead, define your goals and constraints, then let the AI break down the problem logically.

  • Use Deep Research strictly for research – These models aren’t great at rewriting or optimizing content. Once you’ve gathered research, switch to a general-purpose AI for refinement.

  • Plan with a general AI first – If you already know exactly what you need, map out your request using another AI before passing it to a Deep Research model for execution.

  • Experiment and iterate – These tools are new and require practice. If the output isn’t perfect on the first try, tweak your approach and try again.

  • You own the output – AI can misunderstand, think differently, or even fabricate information. If you’re using it for work, always fact-check thoroughly—because at the end of the day, you are responsible for the results.

Tell us what you thought of today's email.

Login or Subscribe to participate in polls.

Did someone forward this newsletter to you? If you're not already signed up, you can subscribe to AI SPRINT™ for free here.