• Core Concepts AI
  • Posts
  • Can NVIDIA’s Jet-Nemotron Fix AI’s Long-Conversation Problem?

Can NVIDIA’s Jet-Nemotron Fix AI’s Long-Conversation Problem?

A closer look at how smarter attention, not just bigger models, could make AI faster and more useful.

In partnership with

NVIDIA Wants to Solve AI's Speed Problem

Ever notice this?

The longer you chat with your favorite large language model (ChatGPT, Claude, Perplexity, etc.) the slower it is to respond…

For example, ask it to go through a 100 page document and, in all likelihood, you'll watch the model crawl toward some kind answer.

Why?

Let’s imagine you're at a MASSIVE conference (AWS in Las Vegas). Every time someone speaks, they literally turn to every single person in the room and check: "Did you hear me? Do you understand how this connects to what Bob said five minutes ago about Pop Tarts and wi-fi?”

That's "full attention,” which is the backbone of modern AI. It works pretty well for accuracy but absolutely insane for inefficiency.

Kind of like if you only had one light switch in your house, and whenever you turned it on every single light in every room blazed with light.

See, as your conversation with an LLM gets longer, the AI has to remember more context. So of course that memory balloons and processing slows. Your GPU runs out of memory.

This is why it’s so difficult to have a reliable conversation with a 200-page document in ChatGPT.

The math doesn't math, as my father used to say.

Per their recent research, NVIDIA’s solution is called Jet-Nemotron (which sounds like a knock-off Transformer to me, but oh well). It's a new type of AI model that's 50 times faster than current models, uses way less memory, but stays just as smart. In theory, at least.

people crowd GIF by Product School

AWS Conference in Which Pop Tarts and Wi-Fi are Probably not Actually Discussed

The Problem: AI Models Are Terrible at Long Conversations

To address this, researchers tried a "linear attention" solution.

Back to our conference analogy real quick: instead of everyone checking with everyone else, people just whisper down the line. Person A tells Person B. Person B tells Person C. And so on. A kind of game of telephone.

This works great for simple messages. "The meeting moved to 3pm." Easy.

But what if the message is complex?

For example: "Remember what Sarah said about the budget in January? Well, that connects to what Mike just mentioned about Q3 projections, which changes everything John proposed earlier."

By the time that nuanced message gets whispered down a 100-person chain, it's basically "something about money changed."

That's linear attention. Lightning fast. Dirt cheap. But it loses the connections that make AI actually simulate intelligence. Yes, you save 50x on processing power, but your AI can't hold complex thoughts together. It becomes great at parroting facts, terrible at reasoning.

And so NVIDIA dug into this further.

Instead of throwing out the whole system, NVIDIA started with a question: Which parts of the AI's "brain" actually matter?

So they ran tests. Lots and lots (and lots) of tests.

And it turns out, AI models are like most teams: a few people do most of the work. When the model answers questions about history, maybe 5-10 out of 100 layers are pulling their weight. When it reads a document, it's a different 5-10 layers doing the heavy lifting.

The other 80? They're not useless, but they're not critical either.

NVIDIA took this and experimented with it: Keep the star players doing what they do best (the full attention, complex reasoning stuff) but swap out the average performers for something leaner called JetBlock.

JetBlock is linear attention that's learned to be selective—it knows when to focus hard and when to kick back with its feet up—and is the building block that makes Jet-Nemotron work.

The Jet-Nemotron models are what you get when you combine these JetBlocks with the original high-performing layers. .

Essentially, same intelligence for way less overhead.

The Results Are Kind of Ridiculous

Jet-Nemotron processes text 47x faster than the original model on long documents.

At really long contexts (256K tokens), it's 53x faster.

Memory usage? It drops from gigabytes to megabytes. You can now run serious AI on modest hardware.

But here's the kicker: it's just as smart. Sometimes smarter!

Their 2-billion parameter model keeps pace with other models like Qwen3 and Gemma3. The 4-billion version beats them outright on many benchmarks.

bugs bunny run GIF by Looney Tunes

Why You Should Care

This matters for three big reasons:

Your AI tools will get way better. Imagine ChatGPT that can analyze entire books without slowing down. Research assistants that work with massive datasets in real-time. That's coming.

Smaller players can compete. You don't need Google's budget to build serious AI anymore. This levels the playing field for startups and researchers, so to speak.

AI gets cheaper to run. Lower costs mean more access. More experimentation. More innovation reaching regular people instead of just tech giants.

The Smart Part: PostNAS (The Neural Architecture Search, not the Rapper)

Here's the really clever bit. Usually, figuring out which layers to keep requires rebuilding the entire model from scratch. That costs millions of dollars and months of compute time.

NVIDIA found a shortcut. They freeze the parts of the model that already "know" things, then just experiment with different attention patterns.

Don’t rebuild the entire building if all you really need to do is rearrange the furniture. Cost drops from millions to thousands.

The Bottom Line

In this research, NVIDIA shows you don't necessarily need to sacrifice intelligence for speed.

With some surgical precision and smart engineering, you can have both.

How 433 Investors Unlocked 400X Return Potential

Institutional investors back startups to unlock outsized returns. Regular investors have to wait. But not anymore. Thanks to regulatory updates, some companies are doing things differently.

Take Revolut. In 2016, 433 regular people invested an average of $2,730. Today? They got a 400X buyout offer from the company, as Revolut’s valuation increased 89,900% in the same timeframe.

Founded by a former Zillow exec, Pacaso’s co-ownership tech reshapes the $1.3T vacation home market. They’ve earned $110M+ in gross profit to date, including 41% YoY growth in 2024 alone. They even reserved the Nasdaq ticker PCSO.

The same institutional investors behind Uber, Venmo, and eBay backed Pacaso. And you can join them. But not for long. Pacaso’s investment opportunity ends September 18.

Paid advertisement for Pacaso’s Regulation A offering. Read the offering circular at invest.pacaso.com. Reserving a ticker symbol is not a guarantee that the company will go public. Listing on the NASDAQ is subject to approvals.

North Light AI is an applied-AI studio that builds real-world tools and training programs for mission-driven teams—from nonprofits and educators to SMBs and government agencies. Their offerings include smart grant-writing assistance (GrantSpace), supplier qualification platforms (Prime Ready), AI-generated risk assessments (Baseline AI), as well as custom consulting and workforce training. Their focus? Deliver AI solutions that actually make work easier and outcomes more meaningful. northlightai.com