How I Built an Immersive AI Translation Extension with Codex
GitHub repository: realreadpaper/translate
The Short Version
The project became Immersive AI Translate, a Chrome / Edge Manifest V3 extension.
Its stable path is immersive web-page translation: click a floating button, translate the content near the current viewport first, insert translations back into the original page, and switch between bilingual, source-only, and translation-only reading modes without calling the model again.
What I really wanted to test was not whether AI could write a translation extension. I wanted to know whether the system would stay understandable after it grew from web pages into PDF, YouTube subtitles, and reserved ASR/OCR paths.
The final split is:
segment: the smallest stable translation unit.provider: model request and response parsing only.target: web page, PDF, or YouTube subtitles.renderer: put translated results back in the right place.- prompt contract: make model output an interface, not free-form text.
Reading Path
This post follows the implementation order:
- Ask Codex to design boundaries before writing code.
- Use
segment idto make web-page rendering stable. - Isolate DeepSeek, OpenAI-compatible APIs, and traditional translation behind provider adapters.
- Use translation targets for PDF and YouTube.
- Treat prompts as strict output contracts.
- Keep tests close to the boundaries.
1. I Asked Codex for Design First
If the first prompt is “build a web-page translation extension,” the result is usually a demo: read page text, send it to a model, and inject the response back into the DOM. That is fast, but it collapses under dynamic pages, malformed model output, and reading-mode requirements.
So the first prompt defined constraints instead:
Design a Chrome MV3 immersive web-page translation extension.
Requirements:
- Do not destroy the original page DOM
- Translations must map back to source segments reliably
- Support bilingual, source-only, and translation-only modes
- Providers are replaceable, default DeepSeek, also support OpenAI-compatible APIs
- API keys stay in browser local storage
- First define module boundaries, message protocol, and testing strategy
- Do not implement code yetThe first useful structure from Codex was close to the final one:
popup/options
-> background service worker
-> provider adapter
-> content script
-> DOM extractor / renderer
-> chrome.storage.localThat separation kept UI, configuration, model calls, DOM operations, and persistence from leaking into each other. Later, when I added PDF and YouTube, the project did not immediately become a wall of special cases.
2. Do Not Translate Full HTML
The dangerous shortcut is sending an entire page’s innerHTML to a model. It breaks events, styles, scripts, React/Vue state, and makes restoration nearly impossible.
I had Codex implement a text-block pipeline:
Content script
-> DOM TreeWalker finds visible text nodes
-> skip script/style/code/pre/input/hidden nodes
-> merge text into SourceSegment[]
-> write data-segment-id onto real DOM nodes
-> insert data-translation-for nodes when translations returnThe core shape is small:
type SourceSegment = {
id: string;
text: string;
};
type TranslatedSegment = {
id: string;
translatedText: string;
};This solves three problems:
- Translations can return near the original text by
id. - Reading modes do not need another model request.
- The extension owns only the nodes it inserts, not the entire page.
Viewport-first translation grew naturally from this model: translate nearby segments first, then continue as the user scrolls. The default batch size is six segments, which keeps latency low and reduces malformed output.
3. Providers Only Adapt Models
A provider adapter never touches the DOM or popup UI. It owns four things:
validateConfig()translateSegments()normalizeError()getMeta()
The background service worker is the orchestration layer:
read settings
-> validate provider config
-> collect segments
-> chunk batches
-> call provider
-> send results back to content scriptThat means DeepSeek, OpenAI-compatible APIs, custom Base URLs, and traditional translation services all use the same task flow. Adding a provider does not require touching dom-extractor or segment-renderer.
4. The Prompt Is an Interface Contract
The model response must be machine-readable, so the prompt is strict:
You are a professional translation engine for immersive bilingual reading.
Return only one valid JSON object matching exactly:
{"segments":[{"id":"same id","translatedText":"translation"}]}
Rules:
- Copy each input id byte-for-byte.
- Never translate, shorten, rename, or omit ids.
- Do not return markdown fences, explanations, notes, or extra text.
- If a segment is hard to translate, keep its original text.The point is not only better translation. The point is alignment. A browser extension fails badly when a model rewrites an id and a translation lands next to the wrong source text.
Even with a strict prompt, real models return odd shapes:
- Markdown
jsonfences - top-level arrays
{ "segments": [...] }{ "translations": [...] }- fields such as
translationortargetText - JSON-like text with broken string quoting
So Codex added a parseTranslatedSegments recovery layer. It prefers valid JSON, repairs when safe, recovers by source segment order when possible, and fails only when the batch cannot be mapped reliably.
The lesson: prompts stabilize the happy path; parsers protect the unhappy path.
5. PDF and YouTube Need Targets
After web pages worked, I added PDF and YouTube. If I kept adding checks inside the content script, it would become unmaintainable.
The next prompt changed the architecture:
Upgrade the existing page translation into a translation target architecture.
Target types:
- html-page
- pdf-document
- youtube-subtitles
Requirements:
- Reuse provider batch translation
- Do not mix YouTube/PDF logic into dom-extractor
- Each target has its own collector and renderer
- Anchors can represent DOM nodes, PDF blocks, and subtitle cuesEach target reuses translation orchestration but owns collection and rendering:
html-page
collector: DOM segment
renderer: insert translation node
pdf-document
collector: PDF text block
renderer: PDF translation workspace
youtube-subtitles
collector: timedtext / ASR cue
renderer: video overlayThe PDF boundary is narrow: independent PDF documents only, opened in the extension’s PDF workspace, with text-layer extraction first. OCR is a reserved fallback, not a fake promise.
The YouTube path also starts with the stable case: read timedtext subtitle tracks, translate cues, then render an overlay by video time. Videos without usable captions enter the experimental ASR path, but ASR must not slow down captioned videos.
6. Different Content Needs Different Prompts
Same provider, different contentKind, different prompt.
Web-page text preserves tone and interface labels:
For web page text, preserve the original tone and intent.
Keep URLs, product names, code identifiers, and UI labels stable.PDF text needs technical-document behavior:
The input is from an academic or technical PDF.
Preserve formulas, citations, references, code identifiers,
variable names, model names, dataset names, section numbers,
figure/table labels, and bibliography markers.YouTube subtitles need compact, screen-readable language:
The input is timed subtitle text.
Translate naturally and concisely so each cue remains readable on screen.This worked better than tuning from outside because the expected translation style is tied directly to the content type.
7. Ad Cleaning Is Part of Translation Quality
On real pages, the pollution is often not the article itself. It is ads, sponsor cards, recommendation blocks, and ad iframes.
So the extension runs an ad cleaner before translation:
- once when the page opens;
- again when the floating button is clicked;
- again for automatic translation and popup-triggered translation;
- continuously for dynamically inserted ad-like nodes.
This is not meant to be an ad blocker. It makes the segment set closer to what the user actually wants to read.
8. Tests Follow the Boundaries
The project stayed movable because tests protect the split:
dom-extractor: segment order, skip rules, site fallbacks.segment-renderer: bilingual, source-only, translation-only modes.- providers: JSON contract, malformed responses, normalized errors.
- PDF: text blocks, cache, workspace params.
- YouTube: subtitle tracks, overlay, prefetch queue, ASR session.
- Playwright: extension E2E and an explicit DeepSeek smoke path.
My implementation prompts usually include verification:
First add Vitest coverage:
- provider returns Markdown json fence
- provider returns {segments:[...]}
- ids are missing but order matches sourceSegments
Then implement the smallest parser change.
Run the specific test file.
Do not change unrelated modules.That stops “make it more robust” from turning into an accidental provider-layer rewrite.
Current State
The stable capability is still html-page translation:
- viewport-first floating button;
- full-page popup action;
- bilingual, source-only, translation-only modes;
- DeepSeek defaults, OpenAI-compatible APIs, custom Base URL, and a traditional provider.
Experimental paths:
- PDF: an extension workspace for documents with extractable text layers; OCR is reserved.
- YouTube: timedtext subtitles can render a translated overlay; no-caption ASR remains experimental.
Takeaway
AI writes code quickly, but boundaries decide whether a project can grow.
For this extension, the important boundaries are:
segmentis the smallest stable translation unit.providernever touches the DOM.targetkeeps PDF and YouTube out of the web-page path.rendereronly places results back in the right location.- Prompting is part of the interface contract.
Codex was most useful when it helped turn each boundary into code, tests, and documentation. It brought speed; I kept the direction.