PDFs are not documents — they're instructions. Here's what that means for you.
Most people treat PDF as just "a kind of document." But the way it stores information is fundamentally different from Word or Google Docs — and that difference explains almost every frustrating thing that happens when you try to compress, convert, extract text from, or edit a PDF and get unexpected results.
This page covers what a PDF actually is, why compression works differently on different kinds of PDFs, and which operations are completely lossless versus which ones touch quality. Then the toolkit below gives you eight operations — compress, split, merge, rotate, watermark, extract text, convert to images, and view metadata — all free, no sign-up.
5 min read 8 tools included PDF & Documents
A PDF is a set of page-drawing instructions, not stored text
PDF stands for Portable Document Format, and the portable part is the key. The goal when Adobe invented it in 1993 was to create a file that looks identical on every device and every printer, regardless of what fonts or software the recipient has. To do that, the format doesn't store "here is a paragraph of text with this style." It stores something closer to: "draw these character shapes at these exact x/y coordinates on this page, using this embedded font subset."
The result is a format that is excellent for distribution — you know exactly how it will look — but awkward for editing. There is no concept of "paragraph" or "heading" baked into most PDFs. There are just positioned objects: text blocks, vector shapes, and raster images, all layered together into a page description.
What's actually inside a PDF file
Open a PDF in a hex editor and you'll find it contains several distinct types of data depending on how it was created. A PDF exported from Word contains embedded font subsets (just the characters used, not the full font), vector paths for any lines or boxes, and compressed text streams. A scanned PDF contains none of that — it's essentially just a series of JPEG or TIFF images, one per page, wrapped in a PDF container. That distinction matters enormously for compression.
Why this matters for compression: if your PDF was created digitally — exported from Word, Google Docs, or a design tool — it contains structured text and vectors. These compress extremely well with lossless algorithms. If your PDF is a scan, it contains images. Those compress differently, and aggressive compression degrades them visibly.
How PDF compression actually works — and why it's not all the same
When you click "compress PDF," you are not doing one thing. You're potentially doing several different things simultaneously, depending on what the PDF contains.
Text content and vector graphics inside a PDF are usually already compressed using Flate compression — the same underlying algorithm as ZIP files. This is lossless, meaning no quality is lost. If your PDF is already well-optimised for text, compressing it again won't do much because there isn't much slack to remove.
Where compression actually has impact is on embedded images. A PDF might contain photographs or diagrams that were originally saved at much higher resolution than needed for screen reading or typical printing. The compression tool resamples and re-encodes these images at a lower quality setting. Moderate compression (70–80% quality) is usually invisible to the human eye. Below 50%, you start to see blocky artefacts around edges, especially on scanned text.
There are also things like unused font data, duplicate objects, and redundant metadata that can be stripped out silently with no quality impact at all. A well-written compression tool does all three things — strips metadata bloat, resamples images, and cleans up the object tree — and shows you the difference before you commit.
The quality slider: the tool below lets you set quality from 1 to 100. 75 is a solid default for most use cases. Below 50 produces smaller files but visible image degradation on anything that contains scanned content or photographs. For text-only PDFs, quality setting has almost no visible effect because text is stored as vectors, not pixels.
Which PDF operations change quality and which don't
This is something most PDF tool sites don't explain clearly. Some operations are completely lossless — the output is bit-for-bit identical to the relevant portion of the input. Others necessarily involve re-encoding or modification.
Operation
Quality effect
Why
Split PDF
None
Pages are extracted without modification — the page streams are copied directly
Merge PDFs
None
Page objects are concatenated — no re-encoding happens
Rotate pages
None
Rotation is stored as a flag in the page dictionary, not a pixel transformation
View metadata
None
Read-only operation
Add watermark
None
Watermark text is added as a new vector layer — existing content untouched
Extract text
None
Text streams are read and decoded — source PDF unchanged
Compress PDF
Images only
Image streams are re-encoded at lower quality. Text/vectors are unaffected.
Convert to images
Depends on DPI
Each page is rasterised (turned into pixels) at the DPI you choose. Higher DPI = sharper but larger files.
Rotation deserves a bit more explanation because people often assume rotating a page re-encodes it. It doesn't. PDF pages have a Rotate entry in their dictionary that tells the viewer how to orient the page. Changing that number from 0 to 90 doesn't touch any image or text data. The same goes for splitting and merging — these are purely structural operations on the PDF object tree.
For archival purposes: if you're building a PDF archive that needs to stay readable indefinitely, prefer split, merge, rotate, and watermark over compression. Compression is a trade-off between file size and image quality that you can't reverse once applied.
Scanned PDFs and why text extraction sometimes fails
When you scan a physical document, the scanner produces an image — a photograph of the page. Saving that as a PDF gives you a PDF that contains an image. That image is not searchable. There is no text layer. The word "invoice" at the top of the scanned page is just a pattern of dark and light pixels from the scanner's perspective.
To extract text from a scanned PDF, OCR (Optical Character Recognition) has to run first — software that analyses the pixel patterns and guesses which characters they represent. Good OCR on a clean, well-lit scan of a clear typeface is surprisingly accurate. OCR on a faded photocopy of a handwritten form is not. The quality of the original scan is the limiting factor, not the software.
Digitally-created PDFs — anything exported from Word, Excel, Google Docs, or a modern application — contain actual text streams. Extracting text from these is fast and exact because you're just reading the embedded text data, not guessing from pixels.
How to tell which kind of PDF you have: open it and try to select text with your cursor. If you can highlight individual words, you have a text-layer PDF. If the cursor draws a box over the whole page regardless of where you click, it's a scanned image PDF.
PDF Toolkit
Free · No sign-up · Files deleted after download · Up to 100MB
Upload Your PDF
Click to select or drag and drop your PDF here
PDF files up to 100MB
File Information
Extract Text
Pull all text content from your PDF with page numbers preserved
Split PDF
Split into individual pages or custom page ranges — completely lossless
Merge PDFs
Combine multiple PDFs into one document — no re-encoding
Compress PDF
Reduce file size by resampling embedded images at your chosen quality level
Add Watermark
Add a text watermark as a clean vector layer — original content untouched
Rotate Pages
Rotate 90°, 180°, or 270° — stored as a flag, nothing re-encoded
View Metadata
See title, author, creation date, security settings, and page count
Convert to Images
Rasterise each page to PNG or JPEG at the DPI you choose
Preview
Done — your file is ready
Processing…
Things people ask
Text content and vector graphics inside a PDF are usually already compressed with Flate (similar to ZIP). There's not much left to remove. Compression makes the most difference on PDFs that contain large raster images — photographs, scanned pages, or embedded diagrams — where image quality can be traded for file size.
Yes. Split and merge operations work at the PDF object level — they copy or concatenate page streams without re-encoding any content. The resulting pages are identical to the corresponding pages in the originals. No image quality is affected, no fonts are modified.
A scanned PDF is essentially a series of images — there is no actual text layer. The text you see is a photograph of text, not stored text data. Extracting meaningful text from a scanned PDF requires OCR (character recognition from pixels), which this tool supports for clean scans. The accuracy depends heavily on scan quality — a clear, well-lit scan of a typed document extracts well; a faded photocopy does not.
For screen viewing or web use, 96–150 DPI is sufficient and keeps file sizes manageable. For printing, 300 DPI is the standard. Higher DPI produces sharper images but proportionally larger files — a 600 DPI image of a page has four times as many pixels as the same page at 300 DPI.
No. Files are deleted immediately after your download completes. Files that aren't downloaded expire automatically after 60 minutes. Processing happens in isolated memory — the file is not indexed, logged, or retained.
You'll see a specific error message. Common causes are encrypted PDFs without the password provided, corrupted files, or files that are technically valid PDFs but use obscure internal structures. Try re-uploading — if the same file consistently fails with a specific operation, a different tool in the grid sometimes succeeds where the first attempt didn't.