- Supercharged With AI
- Posts
- ⚡🔋xAI Accused of Misleading Grok 3 Benchmarks
⚡🔋xAI Accused of Misleading Grok 3 Benchmarks
And more: 500 NIST Layoffs Could Cripple U.S. AI Safety—Here’s Why!; Meet Neo Gamma: Can This Humanoid Robot Run Your Home?
WHAT’S AT STAKE TODAY ⚡
- 🔍🤔 xAI's Grok 3 stats under scrutiny. Elon's benchmarks getting fact-checked!
- 💸⚠️ US AI Safety Institute facing budget axe. Safety first? Maybe not!
- 🐙✨ Octopus.do wants your attention. Another AI tool joins party!
- 🎬💰 Google's Veo 2 video model costs 50¢ per second. AI movies getting pricey!
- 🏭🇺🇸 Apple pledges $500B for US manufacturing. Houston gets AI server farm!
- 🇦🇪🤖 Meta AI expands to Middle East and Africa. Arabic speakers get AI love!
- 📊💼 Patlytics scores $14M for patent platform. IP analytics gets funded!
- 🔍🌐 Perplexity teases Comet web browser. Search engine wants more screen time!
- 🧠⏳ Anthropic launches unlimited thinking model. AI ponders as long as you want!
⚡ Latest in AI
Did xAI lie about Grok 3’s benchmarks?

Did xAI lie about Grok 3’s benchmarks?
A significant controversy has erupted in the AI community over benchmark reporting practices, with Elon Musk's xAI facing accusations of publishing misleading performance data for its latest AI model, Grok 3. The dispute highlights the ongoing challenges of transparency and standardization in AI benchmarking, an issue that continues to plague the industry as companies compete for technical supremacy.
The controversy began when xAI published a graph on its blog showcasing Grok 3's performance on AIME 2025, a collection of challenging mathematics problems derived from a recent invitational mathematics examination. In the graph, two variants of Grok 3—Grok 3 Reasoning Beta and Grok 3 mini Reasoning—appeared to outperform OpenAI's best available model, o3-mini-high, bolstering xAI's marketing claim that Grok 3 is the "world's smartest AI."
However, OpenAI employees quickly challenged the presentation on social media, pointing out that xAI's graph omitted a crucial metric: o3-mini-high's score at "cons@64." This technical measurement, short for "consensus@64," represents a model's performance when given 64 attempts at each problem, with the most frequent answers counted as final results. This approach typically yields higher performance scores compared to single-attempt evaluations.
When comparing the models using the same methodology, a different picture emerges. Grok 3 Reasoning Beta and Grok 3 mini Reasoning's scores at "@1"—representing their first attempt at solving problems—actually fall below o3-mini-high's performance. Furthermore, Grok 3 Reasoning Beta appears to perform marginally worse than OpenAI's o1 model when set to "medium" computing parameters.
Igor Babuschkin, one of xAI's co-founders, defended the company's benchmark reporting, arguing that OpenAI has similarly published selective benchmark comparisons in the past, particularly when showcasing performance differences between its own models. This defense highlights the lack of standardization in how AI labs report benchmarking results, allowing companies to selectively emphasize metrics that place their models in the most favorable light.
A more neutral analysis from an independent researcher provided a comprehensive graph showing nearly all models' performance at cons@64, offering a more balanced comparison. This third-party analysis demonstrates the value of independent verification in an industry where companies have clear incentives to present their technologies in the most favorable terms possible.
The benchmark controversy extends beyond simple performance rankings. AI researcher Nathan Lambert raised a critical point often missing from these discussions: the computational and monetary costs required for each model to achieve its optimal performance. This missing context is crucial for meaningful comparisons, as a model that achieves slightly better results while consuming vastly more computational resources may not represent a genuine advancement in efficiency or capability.
This dispute illustrates several fundamental challenges facing the AI industry. First, the lack of standardized reporting practices allows companies to selectively highlight metrics that favor their products. Second, commonly used benchmarks like AIME 2025 may not comprehensively capture a model's real-world capabilities or limitations. Some experts have questioned AIME's validity as an AI benchmark, though it remains widely used to evaluate mathematical reasoning.
The controversy also reflects the intense competitive pressure in the AI industry, where companies vie for technical leadership positions that can translate into market dominance, investment opportunities, and talent acquisition advantages. For xAI, positioning Grok 3 as superior to competitors' offerings is crucial to its business strategy and public narrative, particularly as it competes with well-established rivals like OpenAI.
From a consumer perspective, these technical disputes highlight the challenge of making informed decisions about AI products. When companies can selectively report benchmarks and make broad claims about their models being the "smartest" or "most capable," users have limited ability to verify these assertions or understand their practical implications.
The broader AI community has long called for more transparent and standardized benchmarking practices. Some researchers advocate for the establishment of independent benchmarking authorities that would evaluate models using consistent methodologies and report comprehensive results. Others emphasize the need for benchmarks that better reflect real-world use cases rather than abstract problem-solving.
As AI models become increasingly powerful and integrated into critical systems, the importance of accurate performance reporting grows. Misleading benchmarks can create unrealistic expectations about AI capabilities, potentially leading to inappropriate deployment decisions or misaligned development priorities.
Looking forward, this controversy may accelerate efforts to develop more rigorous benchmarking standards within the AI community. It also serves as a reminder for users and investors to approach performance claims with healthy skepticism, recognizing that benchmark results represent just one dimension of a model's overall value and capability.
Why it matters: For now, the debate between xAI and OpenAI demonstrates that even as AI technology advances rapidly, the challenge of measuring and communicating these advances remains fundamentally human—subject to competitive pressures, selective presentation, and the ongoing need for greater transparency in an increasingly important technological field.
Engage Prospects at the Perfect Moment With Our AI BDR
A poorly timed outbound message is a wasted message. Ava tracks your prospects in real-time and waits for them to trigger an intent signal before automatically sending them a personalized email or LinkedIn message.
Hire Ava who automates your entire outbound demand generation process, including:
Intent-Driven Lead Discovery Across Dozens of Sources
High Quality Emails with Human-Level Personalization
Follow-Up Management
Email Deliverability Management
⚡ The companies of the future
US AI Safety Institute Could Face Big Cuts

US Capitol Building
Reports indicate the National Institute of Standards and Technology (NIST) may lay off up to 500 staff members, severely impacting the U.S. Artificial Intelligence Safety Institute (AISI).
The cuts would primarily affect probationary employees, with some already receiving verbal notices. The AISI, created under President Biden's AI safety executive order, was already facing an uncertain future after Trump repealed the order on his first day in office and its director departed earlier this month.
Why it matters: AI safety organizations have criticized the potential cuts, warning they would significantly reduce the government's ability to address critical AI safety concerns.
⚡ The AI Edge: Smart Solutions for Business Growth
What Is Octopus.do Used For?

Octopus.do Sitemap Tool
Octopus.do is a tool designed for website planning. It allows users to create visual sitemaps, wireframes, and project plans with a simple drag-and-drop interface.
Web designers, developers, and project managers use it to structure websites before development, making it easier to organize content, plan user journeys, and estimate project costs. The tool also supports real-time collaboration, helping teams stay aligned during the planning process.
With built-in SEO optimization features and clear project timelines, Octopus.do streamlines the early stages of web development, ensuring projects start with a solid foundation.
⚡ More AI Bites
- 🎬 💰 Google's new AI video model Veo 2 will cost 50 cents per second.
- 🍎 🏭 Apple commits $500B to US manufacturing, including a new AI server facility in Houston.
- 🌍 🗣️ Meta AI arrives in the Middle East and Africa with support for Arabic.
- 📝 💡 Patlytics raises $14M for its patent analytics platform.
- 🔍 🌐 Perplexity teases a web browser called Comet.
- 🧠 ⏱️ Anthropic launches a new AI model that 'thinks' as long as you want.
⚡ Trends for the Future
Norwegian Firm 1X Unveils Neo Gamma Humanoid Robot for Home Use

Neo Gamma Humanoid Robot
The details:
Norwegian robotics company 1X has revealed its latest prototype, Neo Gamma, a humanoid robot specifically designed for the home environment. The new model, which succeeds the Neo Beta launched in August, represents a distinct approach in the rapidly evolving humanoid robotics market by prioritizing home applications over industrial use.
Images released with the announcement show the robot performing various household tasks, including brewing coffee, doing laundry, and vacuuming. While 1X plans to begin limited in-home testing, the company emphasizes that Neo Gamma remains far from commercial scaling and widespread deployment.
The robot features a deliberately approachable design, with a knitted nylon exterior that aims to mitigate potential injuries from human-robot contact. This focus on safety and user-friendly aesthetics contrasts with the industrial appearance of many competing humanoid systems.
Neo Gamma enters a competitive field that includes humanoid robots from companies such as Agility, Apptronik, Boston Dynamics, Figure, and Tesla. However, while these competitors have prioritized warehouse and factory applications, 1X's home-first strategy sets it apart in the market.
The home robotics sector has historically proven challenging. Despite decades of technological advancement, few robots beyond vacuum cleaners have achieved meaningful market penetration. Successful home robots must meet exceptionally high standards for usefulness, reliability, affordability, and safety.
1X highlights advancements in Neo Gamma's onboard AI systems as critical to achieving necessary safety levels for home operation. These systems must maintain acute environmental awareness to prevent harm to people and property. The company also acknowledges the importance of teleoperation capabilities, allowing human operators to take control when needed, particularly during the development phase before full autonomy is achieved.
The company first gained significant industry attention when OpenAI became an early investor, signaling growing interest in embodied intelligence as a natural extension of generative AI. OpenAI has since diversified its humanoid robotics investments, backing competitor Figure while reportedly developing its own robotics initiatives.
Like other advanced humanoid developers, 1X is building proprietary AI models to enhance the robot's speech capabilities and body language. The company's January acquisition of Bay Area startup Kind Humanoid likely contributed to these developments, though specific technological contributions remain undisclosed.
What makes this crucial: While the product videos accompanying the launch demonstrate the potential for humanoid robots in home settings, they represent a vision rather than current capabilities. Despite promising advances in industrial deployments, humanoid robots face substantial challenges in pricing, reliability, safety, and functionality before home adoption becomes practical.

Do you have a business problem keeping you up at night?
Here’s your chance to get it solved! Share your most staggering challenges with us, and I’ll use the power of AI to find solutions tailored just for you. I’ll feature the answers in one of our upcoming Supercharged issues—let’s tackle it together!

AI is not just about creating autonomous systems, it's about understanding the principles that underlie all forms of intelligence. Each breakthrough gives us new insights into both artificial and natural cognitive processes.
Doina Precup is an associate professor at McGill University and head of the Montreal office of Deepmind.
📝 AI Playground: Enhance Your Writing 📝
🔧 This Week’s Tool: Wordtune 🔧
Overview: Wordtune is an AI-powered writing companion developed by AI21 Labs, designed to assist users in crafting clear and compelling content. ✍️ Whether you're drafting emails, articles, or social media posts, Wordtune offers real-time suggestions to enhance your writing's tone, clarity, and style. 🚀
Why Is It Better Than Other Tools? ✨
- ⚡ Contextual Understanding: Wordtune comprehends the context and meaning of your text, providing suggestions that align with your intended message.
- 🤖 Versatility: Works across platforms like Gmail, Google Docs, Facebook, Twitter, and LinkedIn for seamless writing assistance.
- 📈 Real-Time Suggestions: Instantly suggests alternative phrasing to improve clarity and engagement.
What Does It Do Best? 🌟
- 📝 Paraphrasing: Provides alternative expressions to make your writing clearer and more impactful.
- 💡 Tone Adjustment: Adapts suggestions to fit your desired tone, whether casual or formal.
- ✏️ Grammar and Spelling: Corrects errors while ensuring a smooth writing flow.
Applications 💼:
- 📧 Email Writing: Draft professional and concise emails effortlessly.
- 🖋️ Content Creation: Enhance blogs, articles, and reports with clearer, more engaging language.
- 📱 Social Media Posts: Craft compelling posts that resonate with your audience.
- 📚 Academic Writing: Refine essays and papers for better readability and coherence.
- 💼 Business Documentation: Improve proposals, reports, and presentations with clear, effective language.
Follow This Simple Guide to Get Started with Wordtune:
- 🌐 Visit the Website: Go to wordtune.com and sign up for a free account.
- 🔗 Add the Extension: Install the Wordtune Chrome extension for seamless writing support.
- 📝 Start Writing: Write your content, and Wordtune will provide real-time suggestions.
- ✨ Apply Suggestions: Choose the phrasing that best fits your tone and message.
- 📈 Refine and Improve: Continue using Wordtune to enhance your writing style and clarity.
Wordtune is your go-to writing assistant, making it easier than ever to communicate effectively and confidently. 🚀
💡 Challenge: Use Wordtune to revise an email, social media post, or paragraph from an article. Share your before-and-after versions by replying to this email for a chance to win a special prize! 🎁 Start enhancing your writing with Wordtune today! ✍️

No more playing catch-up. It's time to GET AHEAD!!! 🚀🚀🚀, Elena
🌡️ Use the Satisfaction Thermometer to show us how much you enjoyed The Supercharged this week ;)How did we do? |
⚡︎🔋 The Supercharged - loved by thousands of readers ❤️🙋♀️
The Supercharged is aiming to be the world's #1 AI business magazine and is on a mission to empower 1,000,000 entrepreneurs worldwide by 2025, guiding them through the transition into the AI-driven creative age. We're dedicated to breaking down complex technologies, sharing actionable insights, and fostering a community that thrives on innovation, to become the ultimate resource for businesses navigating the AI revolution.
The Supercharged is the #1 AI Newsletter for Entrepreneurs, with 25,000 + readers working at the world’s leading startups and enterprises. The Supercharged is free for the readers. Main ads are typically sold out 2 weeks in advance. You can book future ad spots here.
I'm sending this email because you registered for one of our workshops or our affiliates brought you. You can unsubscribe at the bottom of each email at any time.
Reply