#8 - Wikipedia is Tired of AI Developers!

Fighting Against the Scrapers in a Smart Way 🥷

Apr 18, 2025

Welcome to the 8th issue of AI Agents Simplified! 🎉

📰 What Happened in the Last Week?

OpenAI just dropped two new reasoning models: o3 and o4-mini. o3 is being called their “most powerful reasoning model yet” and o4-mini is the lightweight sibling that still punches way above its size and cost. Both models now "think" with images, meaning they can actually interpret stuff like whiteboard notes or sketches and even manipulate them (zoom, rotate, etc.) as part of the reasoning process.

Also: all these models now have access to the full ChatGPT toolset, browsing, DALL·E, code interpreter, you name it. Available for ChatGPT Plus, Pro, and Team users.

RIP to o1 and some older mini models, they’re getting phased out.

Wikipedia just said “enough with the scraping”, but in a smart way
Instead of blocking AI devs, Wikimedia teamed up with Kaggle to drop a machine-learning-friendly dataset of Wikipedia in English and French.
This thing is clean: well-structured JSON, short summaries, infoboxes, image links, everything you’d need for training, fine-tuning, or aligning models without hammering their servers 24/7.
It’s open-source, made for ML workflows, and way better than scraping raw pages. Shoutout to Kaggle for helping indie devs and smaller companies get access without needing deals like Google has.

**Kaggle - Wikipedia Structured Contents**

🔮 Cool AI Tools & Repositories

🧩 1Panel - intuitive web interface for managing websites, files, containers, databases and LLMs within a Linux server

📚 Anx Reader - Featuring powerful AI capabilities and supporting various e-book formats, it makes reading smarter and more focused

🤖 potpie.ai - Prompt-To-Agent : Create custom engineering agents for your codebase

📃 Paper of the Week

OLMOTRACE - Tracing Language Model Outputs Back to Trillions of Tokens

OLMOTRACE is a new system from the Allen Institute for AI that lets you trace any chunk of an LLM’s output back to the exact documents it was trained on, all in real time. It works with multi-trillion-token datasets. Yep, trillions.

Here’s what’s cool:

It’s real-time tracing.
The system matches language model outputs to their original training documents in just a few seconds. That includes pre-, mid-, and post-training data.
It’s fully open source.
Built on top of Infini-gram, and available on AI2 Playground, so you can try it yourself.
Verbatim match only.
OLMOTRACE shows exact matches between a model’s output and its training data.
Smart filters for relevance.
It filters for long, unique spans with low token probability (aka not generic filler) and ranks source docs by BM25 relevance, color-coded by confidence level.
It’s fast.
They benchmarked average trace time at 4.5 seconds for ~450 token outputs using a heavily parallelized setup and smart I/O optimizations.

Use cases?

Here are a few real ones from the paper:

Fact-checking: Want to verify if a model’s answer is based on solid data or just hallucination? You can now see exactly where it came from.
Tracing "creativity": Is your model actually creative or just remixing a fanfic? OLMOTRACE can tell.
Math tracing: If your model nails a combinatorics question, OLMOTRACE might find the exact training doc where it learned that formula.

If you liked this issue of AI Agents Simplified, share it with your friends and spread the knowledge! ❣️