Grok 4 Is Finally Free—A Practical Side-by-Side Look at GPT-5

Published:

Intro

A lot of major AI companies have been rolling out new models lately, including Grok 4 and GPT-5. Since launch, Grok 4 had been limited to paying users, but it has now been announced on X that it can be used for free. That makes the comparison much more interesting, because it gives more people a chance to test both models directly instead of judging from demos alone.

This is a basic hands-on comparison of Grok 4 and GPT-5. The focus here is not benchmark charts or official claims, but how they actually respond to a mix of creative writing, reasoning, coding, explanation, analysis, and role-play prompts—plus a subjective impression of where each one feels stronger.

comparison image.png) comparison image

The test setup

Several prompts were given to both models. After that, their outputs were judged with help from DeepSeek R1 and Gemini 2.5 Pro. The prompts covered:

  • short fiction writing
  • a classic logic puzzle
  • Python scripting with an external API
  • plain-language explanation of a psychology concept
  • legal, ethical, and artistic analysis
  • simplifying a quantum concept into one sentence
  • urban planning ideas
  • tone-sensitive role-play

The goal was not to declare an absolute winner in every category, but to get a clearer sense of each model’s strengths.


1) Creative writing: a 500-year-old cat on the Forbidden City roof

Prompt

Write a short story of about 500 words from the perspective of a cat that has lived for 500 years on the eaves of the Forbidden City. The story should blend a sense of historical change, feline habits, and a faint philosophical reflection.

What Grok 4 did well

Grok 4 delivered a fuller, more explicitly historical piece. It directly referenced dynastic change, daily life in the palace, and later political upheaval. It also leaned into cat behavior in a visible way—napping on roof tiles, sneaking to the imperial kitchen, chasing mice, licking paws, choosing warm corners. The philosophical layer was also clearly stated rather than merely implied.

That made the answer feel complete and rich in elements. If the prompt is judged mainly by whether all requested components are present in a visible, countable way, Grok 4 performed very well.

Where GPT-5 stood out

GPT-5 took a more literary route. Instead of stacking historical events one by one, it suggested change through images and atmosphere: heavy imperial footsteps, gunfire replacing drums, sunset turning golden tiles red and then soft orange. The philosophical reflection was woven into the prose more quietly, especially in lines about time as a sculptor and the closing image of five hundred years as one long nap.

Compared with Grok 4, GPT-5’s version was shorter, but more controlled in tone and more emotionally resonant. It felt less like a checklist-complete answer and more like a finished piece of writing.

How the judges scored it

DeepSeek R1 gave: - Grok 4: 4.5/5 - GPT-5: 4.0/5

Its view was that Grok 4 integrated the requested elements more thoroughly, while GPT-5 was somewhat shorter and slightly lighter on detail.

Gemini 2.5 Pro gave: - Grok 4: 8.5/10 - GPT-5: 9.5/10

Its reasoning was almost the reverse in emphasis: Grok 4 was solid and complete, but a bit stiff and direct, while GPT-5 felt more literary, more evocative, and stronger in the way it fused philosophy into the story itself.

My impression

This test shows a recurring difference between the two models. Grok 4 tends to make sure all requested ingredients are visibly included. GPT-5 is more likely to compress those same ingredients into mood, imagery, and rhythm. One feels more exhaustive; the other often feels more polished.


2) Logic puzzle: three switches and three lights

Prompt

There are three light bulbs in a room and three switches outside the room. Each switch controls one bulb. You can only enter the room once. How do you determine which switch controls which bulb? Explain the steps and reasoning.

Grok 4’s style

Grok 4 gave a very detailed explanation. It carefully walked through the setup, the sequence of actions, and the reasoning behind each one. It also explicitly framed the solution around three distinguishable states: lit, off-but-warm, and off-and-cool.

That made the answer beginner-friendly. Someone seeing this puzzle for the first time would probably have little trouble following the logic.

GPT-5’s style

GPT-5 gave the same correct solution, but in a tighter and more practical format. It laid out the procedure clearly, then added realistic advice such as touching the bulb carefully, using the back of the hand, and noting that LED or CFL bulbs may not heat up as obviously.

That last detail matters: it shows attention not just to the textbook answer, but to how the problem behaves in real-world conditions.

How the judges scored it

DeepSeek R1 gave: - Grok 4: 4.0/5 - GPT-5: 5.0/5

It praised Grok 4’s logical clarity but thought GPT-5 was sharper, clearer, and better grounded in practical concerns.

Gemini 2.5 Pro gave: - Grok 4: 9.5/10 - GPT-5: 10/10

Gemini saw Grok 4 as almost flawless in explanation, but gave GPT-5 the edge because the answer was equally correct while being more concise and more aware of real hardware differences.

My impression

Both models handled the puzzle correctly. The difference is in presentation: Grok 4 teaches carefully; GPT-5 explains efficiently and with stronger user-facing polish.


3) Python weather script: public API, JSON parsing, formatted output

Prompt

Write a Python script that: 1. fetches current weather data for Beijing from a public weather API, 2. parses the returned JSON to extract temperature, weather description, and humidity, 3. formats the result as a sentence such as: “Beijing current weather: 25°C, sunny, humidity 60%.” Include appropriate comments in the code.

Grok 4’s answer

Grok 4 chose wttr.in, which has an obvious advantage: it does not require an API key, so the script can be run immediately.

# 导入必要的库
import requests  # 用于发送HTTP请求
import json      # 用于解析JSON数据

# 定义函数来获取和处理天气数据
def get_beijing_weather():
    # 步骤1: 从公开天气API获取北京当前天气数据
    # 使用wttr.in API,它是一个无需API密钥的公开天气服务
    url = "http://wttr.in/Beijing?format=j1"  # format=j1 表示返回JSON格式
    response = requests.get(url)  # 发送GET请求

    # 检查请求是否成功
    if response.status_code != 200:
        print("无法获取天气数据。HTTP状态码:", response.status_code)
        return

    # 步骤2: 解析返回的JSON数据
    data = json.loads(response.text)  # 将响应文本解析为JSON对象

    # 提取当前天气信息(从current_condition数组中取第一个元素)
    current = data['current_condition'][0]
    temperature = current['temp_C']  # 当前温度(摄氏度)
    description = current['weatherDesc'][0]['value']  # 天气状况描述(英文)
    humidity = current['humidity']  # 湿度(百分比)

    # 步骤3: 格式化信息为一句话
    # 注意:天气描述可能是英文,如果需要中文,可以进一步翻译,但这里保持原样
    weather_sentence = f"北京当前天气:{temperature}°C,{description},湿度{humidity}%。"

    # 输出结果
    print(weather_sentence)

# 调用函数执行任务
get_beijing_weather()

The upside is convenience. The downside is that the weather description may come back in English, which is slightly awkward for a Beijing-focused output. It also included only basic error handling.

GPT-5’s answer

GPT-5 chose OpenWeatherMap, a more mainstream service, but one that requires an API key. In exchange, it offered better engineering habits: environment variables for the key, raise_for_status(), exception handling, and a lang=zh_cn setting for localized weather descriptions.

import os
import requests

# 1. 配置:在环境变量中保存你的 API Key(替换 YOUR_API_KEY)
# 或者直接赋值给 api_key
api_key = os.getenv('OPENWEATHERMAP_API_KEY', 'YOUR_API_KEY')

# 2. 设置请求的城市和 API 端点
city_name = "Beijing"
country_code = "CN"
# 通过城市名调用当前天气数据接口(JSON 格式,指定单位 metric,语言中文)
url = (
    f"https://api.openweathermap.org/data/2.5/weather?"
    f"q={city_name},{country_code}&appid={api_key}&units=metric&lang=zh_cn"
)

def get_beijing_weather():
    """
    获取北京当前天气,返回解析后的温度、天气描述和湿度
    """
    response = requests.get(url)
    # 如果状态码不是 200,则抛出异常
    response.raise_for_status()
    data = response.json()
    # 提取温度,天气描述,湿度
    # 温度:main.temp;描述:weather[0].description;湿度:main.humidity
    temp = data['main']['temp']            # 当前温度(摄氏度)
    description = data['weather'][0]['description']  # 天气描述(中文)
    humidity = data['main']['humidity']    # 湿度(百分比)
    return temp, description, humidity

def main():
    try:
        temp, description, humidity = get_beijing_weather()
        # 输出格式化的句子
        print(f"北京当前天气:{temp}°C,{description},湿度{humidity}%。")
    except Exception as e:
        print("获取天气数据时出错:", e)

if __name__ == "__main__":
    main()

It also explained the design choices in detail, including API key storage and possible extensions.

How the judges scored it

DeepSeek R1 gave: - Grok 4: 3.5/5 - GPT-5: 5.0/5

It preferred GPT-5’s robustness, mainstream API choice, and error handling.

Gemini 2.5 Pro gave: - Grok 4: 8.5/10 - GPT-5: 9.5/10

Gemini gave Grok 4 credit for picking a keyless API that users can run instantly, but still scored GPT-5 higher for better coding practice and better localization.

My impression

This is one of the clearer GPT-5 wins. Grok 4 was more immediately runnable, which is genuinely user-friendly, but GPT-5 produced code that feels more production-aware.


4) Explaining the Dunning–Kruger effect in plain language

Prompt

Explain the Dunning–Kruger effect clearly for someone with no psychology background, and give a specific everyday example.

Grok 4’s answer

Grok 4 gave a fuller explanation. It corrected the term order, named the psychologists involved, explained the cognitive bias behind it, and described both sides of the pattern: lower-skill people overestimating themselves, and higher-skill people underestimating their relative advantage. It then used a driving example contrasting an overconfident beginner with an experienced, cautious driver.

GPT-5’s answer

GPT-5 was more compressed and easier to read at first glance. It reduced the concept to its core idea and used badminton as the example. The language was very plain and the structure was clean.

How the judges scored it

DeepSeek R1 gave: - Grok 4: 4.5/5 - GPT-5: 5.0/5

It thought Grok 4 was accurate and vivid, but slightly more academic in phrasing.

Gemini 2.5 Pro gave: - Grok 4: 9.5/10 - GPT-5: 10/10

Gemini valued GPT-5’s ability to summarize the core point in an especially digestible way, while still using a relatable example.

My impression

This was another case where Grok 4 gave the more complete classroom explanation, while GPT-5 gave the more conversational one. If the target is absolute accessibility, GPT-5 had the edge.


5) AI-generated “new works” in the style of a deceased artist

Prompt

A tech company develops an advanced AI capable of creating paintings indistinguishable from the style of a famous deceased artist. The company auctions these as the artist’s “new works.” Analyze the benefits, risks, and controversies from legal, ethical, and artistic-value perspectives.

Grok 4’s answer

Grok 4 was especially strong here. It explicitly separated each angle into benefits, downsides, and points of controversy, which made the discussion feel systematic without becoming too shallow.

Legally, it brought up issues such as copyright, derivative-work risk, consumer deception, authorship, and whether AI-generated outputs can be protected. Ethically, it focused on consent, compensation, exploitation of artistic legacy, and whether such use amounts to commercializing someone who can no longer approve or reject the work. On artistic value, it framed the issue as a challenge to traditional definitions of art, originality, emotional depth, and market dilution.

GPT-5’s answer

GPT-5 was also strong and very organized. It covered copyright duration, name and publicity issues, false advertising, cross-border risk, artistic legacy, and the authenticity problem. It ended with a practical recommendation: clearly label the work as AI-generated and present it as homage rather than as a newly discovered or newly created original by the deceased artist.

How the judges scored it

DeepSeek R1 gave: - Grok 4: 4.0/5 - GPT-5: 5.0/5

It favored GPT-5 for sharper structure and a clearer emphasis on the central disputes.

Gemini 2.5 Pro gave: - Grok 4: 10/10 - GPT-5: 9.5/10

Gemini leaned the other way, arguing that Grok 4’s separation of controversy from simple pros and cons made its analysis especially deep and precise.

My impression

This was one of Grok 4’s best showings. When the task rewards structured argument and explicit mapping of tensions, Grok 4 feels very powerful. GPT-5 remained excellent, but here Grok 4 looked more naturally suited to the assignment.


6) Turning quantum entanglement into one plain sentence

Prompt

Summarize a formal description of quantum entanglement into one simple, easy-to-understand sentence.

What happened

Grok 4 produced a compressed version of the original explanation, but it still retained a fairly technical flavor.

GPT-5 instead used a much more intuitive analogy: particles being like they have a kind of silent mutual awareness, so that no matter how far apart they are, what happens to one is instantly reflected in the other.

How the judges scored it

DeepSeek R1 gave: - Grok 4: 4.0/5 - GPT-5: 5.0/5

Gemini 2.5 Pro gave: - Grok 4: 6.0/10 - GPT-5: 10/10

Both judges basically agreed on the same point: GPT-5 understood the phrase “easy to understand” more deeply. Grok 4 simplified the wording; GPT-5 translated the concept.

My impression

This test highlights one of GPT-5’s strongest qualities: it often excels at turning specialist language into everyday language without losing the core meaning.


7) Solving the “last mile” urban commute problem

Prompt

As a city planner, propose three innovative and feasible solutions to the common “last mile” problem in large cities: getting from a subway or bus stop to home or work. Briefly explain the pros and cons of each.

Grok 4’s answer

Grok 4 proposed: 1. an AI-driven shared micromobility network, 2. small autonomous shuttle buses, 3. upgraded walkable and bike-friendly infrastructure paired with smart route guidance.

The proposals were practical and broad in coverage, and the pros and cons were realistic. It paid attention to issues like weather, battery infrastructure, elderly and disabled users, regulatory burdens, and construction disruption.

GPT-5’s answer

GPT-5 proposed: 1. driverless feeder shuttles, 2. shared electric micromobility hubs, 3. smart walking corridors with moving walkways.

The third idea, the moving walkway solution, gave GPT-5’s set a more distinctive flavor. It still stayed within plausible urban design, especially for high-footfall, fixed commuter corridors.

How the judges scored it

DeepSeek R1 gave: - Grok 4: 4.5/5 - GPT-5: 5.0/5

It felt GPT-5’s proposals were slightly more innovative and concrete.

Gemini 2.5 Pro gave: - Grok 4: 9.0/10 - GPT-5: 9.0/10

Gemini saw both as equally strong overall.

My impression

This category was close. Both models produced feasible solutions rather than flashy nonsense. Grok 4’s version felt broader and planning-oriented; GPT-5’s version felt slightly more productized and presentation-ready.


8) Role-play: the impatient pizza shop customer service rep

Prompt

Act as an extremely impatient pizza shop customer service rep. In no more than three sentences, answer a customer asking whether the Hawaiian pizza can be made without pineapple. The reply must refuse the request while also making the question feel foolish.

Result

This was the category where Grok 4 clearly shined.

Its reply essentially hit every target at once: it refused directly, used a rhetorical question to make the request sound silly, and pushed the annoyance level all the way up.

GPT-5 also refused, but the tone was more cold than explosively impatient. It got the job done, yet Grok 4 captured the requested personality more precisely.

How the judges scored it

DeepSeek R1 gave: - Grok 4: 5.0/5 - GPT-5: 4.0/5

Gemini 2.5 Pro gave: - Grok 4: 10/10 - GPT-5: 9.0/10

My impression

This is a good reminder that “better” depends heavily on the task. On role-play that demands strong attitude, Grok 4 can be extremely accurate and punchy.


Overall take: where each model feels strongest

After running through these tests, the picture becomes clearer.

Where GPT-5 feels stronger

GPT-5 performed especially well in: - user-friendly explanations - elegant creative writing - structured, practical coding output - concise reasoning with useful real-world notes - simplifying difficult concepts into plain language

A recurring advantage is that GPT-5 often doesn’t just answer the prompt—it answers it in a way that feels edited for readability and usefulness. It frequently adds small touches that improve the user experience: a safety reminder, a realistic edge case, a cleaner framing, a more vivid analogy, or a practical suggestion for next steps.

In creative work, it often sounds more mature and restrained. In technical work, it is often more robust and polished. It also appears better at reducing awkwardness and making the output feel naturally tailored to the person reading it.

Where Grok 4 feels stronger

Grok 4 is far from weak. In some situations, it is excellent.

It stood out in: - strongly characterized role-play - explicit and highly structured argumentation - detailed walkthrough-style reasoning - responses that benefit from directly surfacing every requested element

Its style is often more forceful and direct. Sometimes that makes it feel less refined than GPT-5, but in other cases it is exactly the right tool for the job. When a prompt asks for firmness, confrontation, or a visible framework of analysis, Grok 4 can be extremely effective.

There is also still more to explore with its heavier multi-agent mode, which was not tested here. That leaves open the possibility that Grok 4 may show even more strength in deep research or harder multi-step problem solving under different settings.


So which one should you choose?

If what you want is a model that is consistently strong, polished, readable, and practical across many different kinds of tasks, GPT-5 still looks like the safer first choice.

If you care more about directness, strong personality, sharper role-play, or a more hard-edged style of structured reasoning, Grok 4 can be very compelling—especially now that free access lowers the barrier to trying it.

That may be the biggest takeaway. Grok 4 becoming free is not just a pricing change. It makes the high-end model race more interesting for ordinary users, because comparison no longer lives only in screenshots and secondhand opinions. More people can test both and decide based on actual use.

At this point, the competition is no longer just about raw capability on paper. It is also about positioning, interaction style, user experience, and what kind of thinking each model is best at supporting. For users, that is a good thing: there are now more real choices, and those choices are becoming easier to evaluate firsthand.