Skip to main content

How Cryptographic Watermarking Survives Distribution

A publisher's content travels through many hands before reaching an AI company's training corpus. For provenance to be useful, it needs to survive that journey. Here is what survives and what does not.

Why Statistical Approaches Fail in Distribution

Statistical watermarking systems like SynthID embed watermarks as properties of the content's statistical distribution - biased token probabilities for text, imperceptible pixel patterns for images. These properties degrade with content modification.

During typical distribution, content is routinely modified. Article text is reformatted, copy-edited, or summarized by downstream outlets. Images are resized, recompressed, or color-adjusted for different display contexts. Each modification degrades the statistical signal. By the time content reaches an AI training corpus through normal distribution channels, statistical watermarks may be below detection threshold.

Cryptographic provenance is not a statistical property. It is a discrete artifact embedded in defined positions in the content. The artifact is either present or absent. It does not degrade with ordinary distribution operations.

Scenario 1: Copy-Paste from Web

A reader copies text from a signed article and pastes it into an email, a document, or a social media post. The embedded provenance data copies with the text. This is standard Unicode copy behavior: the clipboard stores the Unicode character stream, and the paste operation inserts the same stream.

The pasted text contains the provenance markers. If the recipient subsequently shares that text, the markers travel further. If the text ends up in an AI training corpus via this path, the markers are present.

Confirmed environments: Chrome, Firefox, Safari, Edge, Gmail, Outlook, Apple Mail, Google Docs, Microsoft Word, Slack, Teams. There are edge cases in some processing pipelines that explicitly strip non-printing characters, but these are not standard behaviors in the distribution scenarios that matter for publisher protection.

Scenario 2: Wire Service B2B Distribution

A publisher signs an article and distributes it to AP. AP distributes it to 1,500 subscriber outlets via NITF or NewsML XML feeds. Each subscriber's CMS ingests the XML, extracts the article body, and stores it.

XML processing preserves Unicode character data. The article body extracted from the XML feed contains the provenance markers embedded in the original text. The CMS stores the character stream intact. The published article contains the markers.

The Encypher API also supports wire service signing: the wire service can sign content on behalf of member publishers using delegated credentials. This means even unsigned originals can receive provenance at the distribution layer. The manifest identifies the originating publisher, not the wire service, when using delegated credentials.

Scenario 3: Aggregator Scraping

An aggregator scrapes article text from the publisher's website. Most scraping tools work by fetching HTML and using a parser to extract text content. HTML parsers handle Unicode character data correctly - they preserve all Unicode characters in the text content, including embedded provenance markers.

The scraped text contains the markers. Unless the scraper explicitly filters for non-printing Unicode characters (which is not a standard scraping operation), the markers travel with the scraped content into the aggregator's index.

Common Python scraping libraries (BeautifulSoup, Scrapy, requests-html) preserve Unicode characters in extracted text. JavaScript scraping environments (Puppeteer, Playwright) do the same. The standard scraping toolchain is compatible with provenance marker preservation.

Scenario 4: Social Media Sharing

Social platforms handle text differently from web browsers. Some platforms normalize Unicode input - stripping certain categories of characters during content submission. Twitter/X, in particular, has historically stripped zero-width characters.

The durability of provenance on social platforms depends on the platform's Unicode handling. Platforms that preserve full Unicode character streams preserve the markers. Platforms that strip non-printing characters lose them.

For text that is shared to social platforms and then scraped, the marker survival depends on whether the platform preserved them. For content shared as images (screenshots) or linked content (where the link points to the signed original), the original retains its provenance regardless of what the social platform does to shared-text posts.

Scenario 5: AI Training Corpus Ingestion

AI training corpus builders scrape the web at scale and process the collected HTML into text. The Common Crawl corpus, which is widely used as a base for AI training data, collects raw HTML and processes it into text. WARC files (Web ARChive format) preserve the raw HTTP responses including the full Unicode text content.

Processing pipelines that convert HTML to plain text use parsers that preserve Unicode characters. The text content in AI training corpora derived from web pages typically includes provenance markers that were present in the original HTML text content.

There is no industry standard for stripping embedded provenance data during corpus preprocessing. Some pipelines may normalize Unicode - applying Unicode normalization forms (NFC, NFKC) that could modify or eliminate certain character sequences. The extent to which this occurs varies by implementation and is an area of ongoing evaluation.

For publishers, the relevant question is not whether every training corpus preserves markers perfectly - it is whether the training corpus ingested content that carried embedded rights terms at the time of ingestion. For content that was signed before scraping, the rights terms were present in the content at ingestion time, regardless of whether all markers survived preprocessing.

What Provenance Does Not Survive

To be useful, the limitations of provenance durability need to be stated as clearly as the capabilities.

  • OCR Scanning

    Printing and scanning text via OCR produces new Unicode characters from image pixels. The markers from the original text are not present in OCR output. Print-to-scan pipelines require the whitespace width encoding (Print Leak Detection) which encodes provenance in space width rather than invisible characters.

  • Paraphrasing and Rewriting

    If content is substantially rewritten in different words, the new text does not carry the original markers. Paraphrasing detection is a separate problem from provenance verification. Provenance tracks exact text reproduction, not semantic similarity.

  • Translation

    Translated text is new content. The markers from the original-language text are not present in the translation.

  • Explicit Stripping

    A party that deliberately filters embedded provenance data from text will remove the markers. This is detectable (the text then verifies as unsigned) but not preventable. The act of stripping is itself evidence of awareness.

Related Resources

Provenance That Travels With Your Content

Invisible markers that survive the distribution scenarios that matter for publisher protection. Free tier, no credit card required.

Related