Skip to main content

C2PA Section A.7: Text Provenance

Section A.7 of the C2PA 2.3 specification defines how C2PA manifests are embedded into unstructured text. Erik Svilich, Encypher's founder, authored this section and co-chairs the C2PA Text Provenance Task Force.

Read the specification: C2PA 2.3, Section A.7: Embedding Manifests into Unstructured Text

What Section A.7 Defines

The main C2PA specification defines how manifests are embedded into structured file formats - images, audio, video, and documents that have defined container structures with designated locations for metadata. Unstructured text - plain text, web articles, social media posts, email content - does not have such a container. The manifest must be embedded into the text character stream itself.

Section A.7 solves this by defining how Unicode characters can carry manifest data invisibly within text. The Unicode standard includes character ranges whose values have no direct glyph rendering: they are present in the character stream but do not produce visible output in standard text rendering environments.

Two encoding modes are defined, each optimized for different deployment environments. Both produce text that is visually identical to unsigned text. Both can be decoded by C2PA-compliant verification software.

Encoding Mode 1: Web and Digital Distribution

The default encoding mode is optimized for web and digital distribution. It uses Unicode characters that are invisible to readers and have no rendering effect in standard web browsers, email clients, and text processing systems that comply with Unicode standards.

This mode achieves high information density, allowing full C2PA manifests to be embedded without meaningfully expanding the character count of the document. The encoding is deterministic: the same manifest always produces the same embedded character sequence, and the same character sequence always decodes to the same manifest.

Embedded content survives copy-paste, email forwarding, and CMS processing. A signed passage either passes verification or fails - there is no probabilistic uncertainty. This is Encypher's primary encoding path and the recommended default in Section A.7.

Encoding Mode 2: Microsoft Word Compatibility

The second encoding mode is optimized for Microsoft Word and Office-format processing environments. Word applies its own normalization to Unicode characters during document operations, which can strip or alter certain character ranges that the default mode relies on.

This mode selects characters from Unicode ranges that Word preserves reliably, ensuring that content signed for Word distribution remains verifiable after round-trips through Word's editing and save operations. The trade-off is lower information density compared to the default mode.

Encypher's signing API accepts a parameter to select this mode. Use it when content is destined for Word format distribution or any environment where Office-format compatibility is required.

Encoding Durability

Both encoding modes are designed to survive the operations that text undergoes in normal distribution: copy-paste between applications, email forwarding through standard mail servers, publication through CMS platforms, and aggregation by content syndication systems. The embedded manifest travels with the text through these operations without requiring any special handling by intermediaries.

Verification is deterministic. A signed passage either passes or fails. There is no probabilistic threshold or confidence score - the cryptographic signature either validates against the content or it does not. This makes text provenance suitable for compliance and legal contexts where binary verification outcomes are required.

Encypher's Extension: Sentence-Level Attribution

C2PA Section A.7 defines document-level authentication: a manifest that covers the text document as a whole. A verification check confirms that the entire signed text is authentic and unmodified. This matches how C2PA handles images and other media formats.

Encypher's implementation extends this with proprietary sentence-level attribution. Each sentence in a signed document carries its own attribution data, allowing verification at the granularity of individual sentences rather than the document as a whole. This means a passage extracted from a larger article can be verified on its own, and a document that mixes authentic and modified sentences can identify exactly which sentences remain intact.

This sentence-level capability is Encypher's proprietary technology, built on top of the C2PA standard. Content signed by Encypher passes standard C2PA document-level verification with any compliant tool. The sentence-level attribution requires Encypher's verification infrastructure to resolve.

Encypher's Contribution

Erik Svilich contributed the text provenance encoding approaches to the C2PA specification and co-chairs the C2PA Text Provenance Task Force. The task force is responsible for maintaining and extending Section A.7 as text provenance requirements evolve.

The contribution was accepted into C2PA 2.3, the current version of the specification. C2PA 2.3 is available at spec.c2pa.org. Section A.7 begins at the "Embedding Manifests into Unstructured Text" section.

Encypher's implementation fully complies with Section A.7. Content signed by Encypher can be verified by any C2PA 2.3 compliant verification tool. The sentence-level extension is implemented as a compatible addition to the standard manifest structure, not as a modification that breaks standard verification.

Related Resources

Implement Section A.7 Text Provenance

Encypher's API implements both Section A.7 encoding modes with sentence-level attribution. Python and TypeScript SDKs available. Free tier for up to 1,000 documents per month.

Related