PDF Text Is Not Text: What I Learned Building rihaPDF

The Starting Point

The problem was not abstract. Editing PDFs is a normal office task here. A letter has the wrong date. A form needs a few fields filled. A name needs correcting. A signature or stamp needs to be added. A sensitive section needs to be removed before sending the file onward.

The usual tools people used were desktop editors: Acrobat, Foxit, and a few others in that class. They were the usual tools because they were the ones that actually worked with Thaana. Acrobat and Foxit are capable, but for an average person or a small office in Maldives they are also expensive and heavy for what is often a small edit. The cheaper desktop tools I tried were unreliable. The web-based editors I tested, commercial or otherwise, did not handle Thaana source text.

The people I had in mind were not PDF engineers. They were office users who just needed to fix a document without sending it somewhere else.

My first idea was not to build rihaPDF from scratch. I looked at patching Stirling-PDF with better RTL and Thaana support. That seemed like the more obvious route: take an existing PDF project and improve the text editing path.

The problem was the product shape and the editing model. Stirling-PDF is a broad server-side PDF toolkit, not something centered around source text editing on a PDF. Text editing existed, but it was closer to a beta feature than the core of the project. What I needed was a local, browser-only editor built around source text editing from the start: which glyphs came from which text-show operation, which font owned them, which direction the text should edit in, where the caret should land, and which original PDF operators had to disappear on save.

That is when rihaPDF became a scratch build. I could not find a browser-only library or editor that exposed the source-text editing model I needed, even for English. The browser tools I found were overlay tools: draw text or shapes over a rendered page, then export something that looks edited. That is not the same problem as editing source PDF text.

The first technical target still sounded modest: render a PDF in the browser, let me click existing text, edit it, and save. I thought the hard parts would be the UI around the canvas and maybe font handling.

That was not where the time went.

The pattern was always the same: a working-looking edit would land, a real Maldivian PDF would expose a display or save mismatch, and the model would have to move down one layer.

The First False Success

The initial implementation used pdf.js to render the page and extract text content. That gives you text items with positions, transforms, font names, and strings. It is enough to put clickable spans over a canvas and make an edit box appear.

That version felt closer to done than it was.

The first version immediately had to deal with problems I had treated as later details: preserving formatting, hiding original glyphs while editing, resolving source font names, moving text by injecting Tm, detecting Office fake-bold via Tr 2, shearing italic text, and preserving move offsets across later edits.

The revert matters. I tried going lower-level early, but the output was worse. pdf-lib.drawText was still safer until I had a reliable content-stream model. That became a pattern: use the higher-level API until a real PDF proves exactly why it is insufficient, then replace that piece only.

Moving text was the simplest successful surgery. If the user only dragged a run and did not edit the string, I could keep the original Tj / TJ and insert a new Tm before it. That preserved the original font and glyph bytes. It also showed the limit of that shortcut: once underline, strikethrough, rich text, or paragraph edits entered the picture, moving only the text operator was not enough because decoration and surrounding layout stayed behind.

That was the first trap: pdf.js gave me enough information to make a believable UI before I had a correct editing model.

Text Ownership

The first real wall was ownership. pdf.js gives rendered text items. The PDF page contains content-stream operators. They are related, but not one-to-one.

A page might have:

BT
/F12 11 Tf
1 0 0 1 72 640 Tm
[(...) -120 (...)] TJ
ET

The browser sees a line of text. pdf.js might split it into multiple text items. The user sees one editable phrase. The save path has to know which exact Tj or TJ operator drew the glyphs that phrase owns.

That led to the first custom parser: tokenize page content streams, preserve unknown operations as pass-through, and understand just enough PDF text state to track the operators I need to touch:

BT / ET for text objects;
Tf for active font and size;
Tm, Td, TD, T* for text position;
Tr for rendering mode;
Tj, TJ, ', " for text showing;
q / Q because text state is part of graphics state.

That last detail caught real output. Office sometimes simulates bold with Tr 2 inside a q...Q block. If the parser does not push and pop text rendering mode with graphics state, fake-bold leaks into later text and the editor thinks half the page is bold.

The next layer was text ownership work: recover long fili through /ToUnicode, use show-driven overrides, assign extracted items to the closest same-script text-show operation, expand overlay bounds across all items claimed by one show, and filter false-positive glyph clusters from images and antialiasing speckles.

This mattered because editing the wrong operator is worse than failing to edit. If rihaPDF stripped too little, old glyphs leaked into the saved file. If it stripped too much, neighbouring text disappeared. The editor had to know what it owned.

That is where the project stopped being “text items over a canvas.” It became “rendered items are evidence, but source operators are authority.”

Preview Had to Stop Lying

Early editing hid the original glyphs with overlays. That is fine as a temporary visual trick, but it creates a bad debugging loop. The editor can look right while the canvas underneath still contains the old PDF pixels.

The fix was to make the live preview actually remove originals instead of covering them. That sounds small, but it changed how I judged the editor. The preview had to go through the same kind of strip-and-render thinking as save. If the active edit box, committed overlay, preview canvas, and saved/reopened PDF do not agree, the editor is not WYSIWYG in any useful sense.

This became the core loop:

Extract source text and source operator ownership.
Open the editor at the current visual position.
Hide or strip the original glyphs that the edit owns.
Commit the edit into app state.
Rewrite the PDF content stream on save.
Reopen the saved PDF and compare what pdf.js renders.

Most bugs lived between those steps.

Saved Output Had to Be Real

The save path now reads raw page content streams from the copied PDF, parses them into operations, edits the operation list, serializes it, and writes it back into the page.

For source text, the save path has to handle delete, move-only, edit, paragraph edit, and redaction. Each case has a different boundary problem: which operators can be preserved, which must be removed, and which replacements must be drawn.

The basic edit path is not “paint white, then draw black.” It is closer to:

parse page content stream
find text-show ops owned by this source run
drop those ops, plus paired decoration ops
register replacement font in page /Resources
draw shaped replacement operators at the edited position
serialize stream back to the page

Some parts had to stay conservative. If a source run is inside a Form XObject or if several extracted items were produced by one text-show operation, the bounds and ownership have to expand to the operator, not just the clicked item. If the editor strips too narrowly, old glyphs remain. If it strips too broadly, neighbouring text disappears. A lot of the early work was finding the least wrong boundary for real files.

Images and vectors added the same issue in another form. Moving an image is often a cm matrix problem around an XObject Do. Deleting a simple vector shape means removing a detected q...Q range. Redacting a vector path is more awkward because the visible path and the paint operation are not always neatly isolated. In those cases the safe fallback is to remove more than the rectangle visibly covers rather than leave recoverable content.

Thaana Made Every Shortcut Visible

Thaana did not make rihaPDF narrower. It made the editor more honest. Latin-only PDFs let many shortcuts survive longer. Thaana exposed them immediately.

First, there were no libraries or examples I could copy for source-text PDF editing in the browser. Not only for Thaana; I could not find that abstraction for English either. Thaana made the absence much more painful because naive text assumptions fail immediately: combining marks, fili, sukun, right-to-left layout, mixed English/Dhivehi text, and old font behavior all make “just draw the string” fall apart.

Second, real PDFs were inconsistent. One fixture had a broken aabaafili /ToUnicode issue. Another had mixed Thaana and English, images, and a boundary fili mapping bug. Non-Office PDFs exposed end-of-block sukun recovery problems. Form XObjects changed coordinate assumptions. These were not theoretical PDF edge cases; they were normal documents.

Font handling also had to become explicit. A browser might not have a good Thaana font installed. The saved PDF cannot depend on the user’s system fonts. rihaPDF ended up bundling Thaana fonts, using @font-face in the browser, and embedding the selected family in the output PDF.

The HarfBuzz path came only after I had a concrete reason to distrust the simpler save path. At one point the docs said more than the code did, so I removed the claim and then implemented the actual shaped save path in phases:

shape Thaana replacement runs through harfbuzzjs;
use shaped output for FreeText annotation appearance streams;
segment mixed-script text with bidi-js.

The PDF detail that mattered: Tf takes a page resource name, not the font’s BaseFont name. One raw shaped-output attempt used the wrong name and rendered garbage. The fixed path registers the embedded font in the page resources and uses that returned resource key.

The shaped output itself is also not just one drawText call. HarfBuzz returns glyph IDs and positioning. The save path emits raw PDF text operators against an embedded Type 0 / Identity-H font. For Thaana, the output sometimes needs per-glyph Tm placement so mark positioning and extraction behavior do not fight each other. Visual correctness and extracted logical order are not the same target, and mixed-script output still has caveats because extractors can regroup adjacent text operators.

Bidi Was Not One Bug

The bidi work was not a single fix. It was a series of small, visible failures.

One case was dates. A slash date such as 14/1/2026 can behave like an LTR island inside RTL text. If the isolation is too wide, the closing parenthesis joins the date. If it is too narrow, the slashes and digits reorder. Another case was list markers: .4 and 3-1 look tiny, but if the dot or dash crosses the bidi boundary, the line looks obviously wrong. The comma cases were worse because they passed text assertions while moving to the wrong visual side of a Thaana word.

The caret was its own problem. Browser caret placement follows browser layout. Source PDF text follows PDF glyph positions. Clicking near a glyph had to map through PDF-derived glyph edges, not just rely on where an HTML input would put the cursor.

These looked like punctuation bugs. They were actually layout-contract bugs. The string value could be unchanged while the editor felt broken.

The important lesson was that there are at least three bidi contexts:

the browser contenteditable context;
the committed HTML overlay context;
the PDF drawing context.

They do not share a layout engine. Getting one right does not mean the others are right. The saved PDF path cannot rely on Chromium’s bidi behavior. It has to tokenize, segment, shape, and draw in a way that produces the same visual result.

Why Lexical Entered the Project

Plain inputs were enough for early single-run edits. They were not enough for source paragraphs.

Real documents do not expose paragraphs as paragraphs. A paragraph is often a set of separately positioned PDF text runs that happen to form visual lines. The editor has to group those runs, preserve line geometry, allow rich formatting, and avoid merging table rows that merely look close together.

Lexical was not added because a rich editor was nice to have. It was added because source paragraph editing needed a real contenteditable model: styled spans, selections, partial bold/italic changes, paragraph text, and browser input behavior that a plain input could not represent.

But Lexical did not solve PDF layout by itself. In fact, it introduced another renderer into the system. I still had to prepare source text for the editor, preserve PDF-derived line geometry, keep the committed overlay close to the active editor, and draw the saved PDF separately.

The failed attempts are the useful part.

I tried reconstructing source edit text from visual glyph spans. That sounded more correct because it was closer to the rendered page. It made ordering worse and was reverted. The span data was not a proven visual-order source of truth; some of it had already passed through extraction and recovery logic.

I tried leaning on browser justification. It looked plausible because source paragraphs were often justified. It was wrong because PDF text positioning and browser text justification do not stretch lines the same way.

I tried isolating too many bidi spans in the committed overlay. That fixed some small cases and broke others because the overlay stopped matching the live editor after edits.

The current approach is more boring and more explicit: source paragraph grouping records line layout, table-like neighbours are guarded against merging, resized boxes are treated as real layout changes, and save draws wrapped replacement lines rather than cropping or covering old content.

WYSIWYG Became a Test Problem

The test suite grew because string tests kept missing the real failures.

A string assertion can pass while the comma is on the wrong side of a Thaana word. DOM text can be right while the saved PDF reopens with different line breaks. pdf.js extraction can recover the expected codepoints while the visual output is wrong.

The visual test workflow compares crops across states: original/source canvas, active editor, committed overlay, and saved/reopened PDF. It is not mathematically perfect because browser HTML and pdf.js canvas are different renderers, but it catches the failures that mattered: line reordering, indentation drift, punctuation movement, wrong anchoring, and save output that does not match the editor.

The suite now has 115 E2E tests and 39 unit tests. The number is less interesting than what ended up needing coverage: content-stream parsing, text-show state, source font ownership, source paragraph grouping, RTL display normalization, redaction glyph planning, image sanitization, vector stripping, annotation clipping, AcroForm cleanup, and browser workflows on real Maldivian fixtures.

The Probe Scripts

Before a lot of that became tests, it was a pile of one-off scripts. That is probably the clearest evidence of how much of the project was just trying to see what was actually happening.

The scripts fell into four buckets.

The first bucket answered: what did the PDF actually contain? Those scripts dumped content streams, text items, run-to-operator ownership, font mappings, /ToUnicode maps, image XObjects, and AcroForm fields. When something looked wrong on screen, I usually needed two views of it: what pdf.js thought it extracted, and what the PDF content stream actually contained.

The second bucket answered: what did the user actually see? One script scanned rendered canvas pixels for dark glyph clusters, then checked whether each cluster had a clickable data-run-id overlay under it. Another generated side-by-side crops: PDF render only, overlay visible, editor open, and post-commit state. A full-page screenshot hides tiny failures; a cropped run shows the edit box moving a few pixels, the overlay being too narrow, or old glyphs leaking around the replacement.

The third bucket asked whether a failure class was real or imaginary. The fili coverage script counted Thaana vowel marks across pages, because one Office-generated PDF mapped aabaafili through broken /ToUnicode data. The sukun probe checked whether CIDs mapping to U+0020 fit the old recovery rule or represented a different failure shape. The font probe walked embedded font programs, cmap subtables, post glyph names, and fonts without /ToUnicode.

That is also where the gazette sweep came from. It downloaded ranges of PDFs from gazette.gov.mv and ran the font probe across them to find real documents that lacked /ToUnicode in a way that would require a glyph-name fallback. That was not product work. It was an attempt to avoid building a fallback from imagination. If a failure class did not exist in a public corpus, it could stay lower priority until a real fixture appeared.

The fourth bucket was round-trip verification: save a file, reopen it in the app, screenshot the rendered result, dump the recovered run text, or parse the saved bytes with pdf-lib and inspect /V values in AcroForm fields.

Those scripts were not polished. They were measuring instruments. Most of the useful tests started as one of these probes after the same bug showed up twice.

What I Would Not Do Again

I would not trust glyph span order without first proving character-to-glyph pairing against the rendered page.

I would not let browser justification drive PDF source paragraph layout. PDF positioning and browser paragraph layout are different systems, even when they happen to look similar for one fixture.

I would not treat unchanged text as a no-op if rich formatting changed. A string can be identical while the edit payload is meaningfully different.

I would not test RTL source editing only with string assertions. For this problem, pixels are part of the API.

Other Boundaries That Broke

Once text editing started working, the surrounding PDF features forced the same lesson repeatedly.

Forms were not just HTML inputs over a page. Saving fillable PDFs meant writing /V values correctly, using UTF-16BE for Thaana where needed, and rebuilding /Root /AcroForm /Fields after copied pages so other viewers still saw interactive fields.

Annotations were not just overlay boxes. Highlights, ink, and FreeText comments needed to save as native /Annot objects, and Thaana FreeText appearance streams needed the shaped text path too.

Redaction was not just a black rectangle. Text, images, vector paths, annotations, form values, and appearance streams can all carry recoverable content. For redaction, the failure mode is not visual ugliness. The failure mode is leaking content the user believed was gone. Unsupported cases intentionally over-strip. That is not pretty, but under-stripping is worse.

Mobile was not just responsive CSS. Touch drag conflicted with scroll, browser zoom broke fixed chrome, and the soft keyboard changed the visible viewport. That led to touch-hold drag, app-owned zoom, visual-viewport anchored toolbars, and mobile-specific Thaana input.

None of these are the central story, but they explain why the codebase had to be split apart. Document I/O, save orchestration, PDF internals, domain primitives, page overlays, form fields, and visual tests all needed their own boundaries. That was not cleanup for its own sake. The first version of the app could not keep all these concerns in one component tree.

Where It Landed

rihaPDF is not a claim that browser PDF editing is solved.

It is a record of the contracts I had to make explicit. The hard part was not putting an input on a canvas. The hard part was making these surfaces agree:

what the source PDF painted;
what pdf.js extracted;
what the browser editor displays;
where the caret lands;
what the committed overlay shows;
what the save pipeline writes;
what the saved PDF renders after reopening;
what text extraction gets back afterward.

Thaana made that impossible to ignore. There was no browser-only source-text PDF editor or library to copy in the first place, and the usual browser text and PDF-editing assumptions did not hold once Dhivehi documents entered the test set. The project had to build its own path through rendering, extraction, bidi, shaping, content-stream rewriting, and visual regression testing.

The UI still says the simple thing: click text, edit, save.

The code exists because none of those words mean what they seem to mean in a PDF.

Technical Notes

rihaPDF is built with React, TypeScript, Vite, pdf.js, pdf-lib, HarfBuzz, bidi-js, Lexical, and Playwright. The current test suite has 115 E2E tests and 39 unit tests. It runs entirely in the browser and is deployed through Cloudflare Workers Static Assets.

Source: github.com/yashau/rihaPDF. Demo build: rihapdf.yashau.com.