Have I Been Trained?

Verify if your content appears in major AI training datasets. We check live APIs and maintain indexed copies of public datasets for comprehensive coverage.

Have my assets been trained?

Enter Website URL

Understanding AI Model Confidence Levels

✓ Confirmed: Officially documented by the AI company in research papers or public statements

⭐ Likely: Inferred based on model capabilities and industry standards, but not officially confirmed by the company

❓ Unknown: No public disclosure of training data sources - we cannot determine if they used this dataset

Note: Many AI companies keep training data confidential for competitive reasons. "Unknown" does not mean they didn't use your content - only that we cannot verify it.

Verification Results

Create your free Opttab account and see the results.

Unlock comprehensive AI training dataset verification, protect your content, and manage your digital assets with powerful tools.

Already have an account? Sign in

Understanding AI Training Datasets

Learn about the datasets that power modern AI systems and how to protect your content

Training Datasets We Verify

Common Crawl Live

Web Text Dataset

Current web crawl data

Wikipedia Live

Text Dataset

Wikipedia dumps used by virtually all LLMs

Internet Archive Live

Web Archive

Historical snapshots of web content

GitHub & Code Live

Code Dataset

Public code repositories

Reddit & Social Media Live

Conversational Dataset

Reddit posts and conversations

LAION-5B Cached

Image Dataset

5.85B image-text pairs (offline since late 2023)

C4 Corpus Static

Text Dataset

Cleaned Common Crawl text (Google T5)

OpenImages Offline

Image Dataset

9+ million labeled images (Google)

RedPajama Static

Text Dataset

1.2 trillion tokens (LLaMA replication)

Books & Literature Static

Text Dataset

Books corpus from various sources

Audio Datasets Offline

Audio Dataset

Voice and speech data (LibriSpeech, Common Voice)

DALL-E Training Data Offline

Image Dataset

Proprietary image dataset

Midjourney Training Data Offline

Image Dataset

Proprietary/secret dataset

Multimodal Web Data Cached

Multimodal Dataset

Combined text, image, and video data

Why Cached Results?

Many AI training datasets don't offer public live APIs because:

Scale

Datasets like LAION-5B contain billions of entries

Privacy & Cost

Real-time APIs for massive datasets are expensive

Static Nature

Many training datasets are frozen versions

Our cached results are based on real data - we maintain indexed copies for accurate verification

How to Interpret Results

Understanding verification status and confidence levels

Verification Status

FOUND

Content exists in this dataset

NOT FOUND

Content not detected in dataset

ERROR

Temporary API issue

Confidence Levels

90-100%

High confidence (live API or exact match)

70-89%

Good confidence (cached database match)

50-69%

Moderate (pattern-based detection)

How to Protect Your Content

Steps to prevent AI training on your content

Add Meta Tags

<meta name="robots" content="noai, noimageai">

Update robots.txt

User-agent: GPTBot
Disallow: /

Use Headers

X-Robots-Tag: noai

Request Removal

Contact dataset maintainers directly

Manage how you interact with AI

Have I Been Trained?

Have my assets been trained?

Scanning Multiple Datasets...

Understanding AI Model Confidence Levels

Verification Results

Verified Data Sources

Static Verification Results

Domain Protection Analysis

Create your free Opttab account and see the results.

Understanding AI Training Datasets

Training Datasets We Verify

Why Cached Results?

How to Interpret Results

Verification Status

Confidence Levels

How to Protect Your Content

Add Meta Tags

Update robots.txt

Use Headers

Request Removal

Manage your presence

Manage how you interact with AI

Have I Been Trained?

Have my assets been trained?

Scanning Multiple Datasets...

Understanding AI Model Confidence Levels

Verification Results

Verified Data Sources

Static Verification Results

Domain Protection Analysis

Create your free Opttab account and see the results.

Understanding AI Training Datasets

Training Datasets We Verify

Why Cached Results?

How to Interpret Results

Verification Status

Confidence Levels

How to Protect Your Content

Add Meta Tags

Update robots.txt

Use Headers

Request Removal