Base64 Encoding Explained

Published · 8 min read

Base64 shows up everywhere — in JWT tokens, in src="data:..." image attributes, in HTTP Basic authentication headers, in email attachments. It looks like gibberish, and that is exactly the point: Base64 re-encodes arbitrary binary data using only 64 printable ASCII characters, so it can travel through systems that were built for text. This guide explains what is actually happening when you encode something, why the output is always about a third larger than the input, and the one misconception about Base64 that gets people into trouble.

The core idea

A byte is 8 bits, which gives 256 possible values. Many of those values are control characters: newlines, null bytes, the ASCII bell. Protocols designed in the 1980s — SMTP for email, HTTP headers, many databases — treat some of those control bytes as special. If you try to send a raw JPEG through SMTP, the mail transfer agent interprets the first byte that happens to equal a line feed as “end of line,” and your image is corrupted before it arrives.

Base64 sidesteps the problem. It takes the input bytes, regroups them into chunks of 6 bits (because 26 = 64), and maps each 6-bit chunk to one of 64 safe printable characters. Because every output character is plain ASCII, the encoded text can flow through any text-only channel without being mangled.

Three input bytes (24 bits) become exactly four output characters (4 × 6 = 24 bits). That 3-to-4 ratio is the whole reason Base64 is bigger than the original — more on that in a moment.

The alphabet

The 64 characters are defined by RFC 4648:

Value  Char      Value  Char      Value  Char      Value  Char
   0    A          17    R          34    i          51    z
   1    B          18    S          35    j          52    0
   2    C          19    T          36    k          53    1
   ...             ...             ...             ...
  25    Z          42    q          59    7
  26    a          43    r          60    8
  27    b          44    s          61    9
  28    c          45    t          62    +   (or -  in base64url)
  29    d          46    u          63    /   (or _  in base64url)
  ...
  50    y          50    y    (padding =)

A–Z, a–z, 0–9 give 62 characters. The last two are + and / in standard Base64. The variant called base64url — used in JWTs and filenames — replaces those with - and _ so the output is safe in URLs and file paths.

If the input length is not a multiple of three, padding with = characters brings the output length up to a multiple of four. The string "Man" encodes to TWFu; the string "Ma" encodes to TWE=; the single character "M" encodes to TQ==. The number of trailing = signs tells the decoder exactly how many bytes to drop from the final group.

Why the output is ~33% bigger

Three bytes of input become four characters of output. Four thirds is roughly 1.33, so the encoded form is about 33% larger than the raw bytes. Encode a 300 KB PNG and you ship ~400 KB. Add the padding and newline wrapping that some implementations insert every 76 characters, and the overhead creeps higher.

This matters in practice. Embedding images as Base64 in CSS removes an HTTP request, but it forces the browser to download a third again as many bytes and blocks rendering until the (larger) CSS parses. For small icons under a few kilobytes the trade-off usually wins. For anything larger, ship the file as a separate asset and let the browser cache it.

gzip and Brotli partly claw the overhead back — Base64 text compresses well because it uses so few distinct characters — but compression happens at the transport layer, and the decompressed payload still has to fit in memory.

Text, UTF-8, and the encoding pipeline

People say “encode this string to Base64,” but there is an invisible first step. Base64 operates on bytes, not on characters. To Base64-encode a string you first convert the string to bytes using a character encoding — almost always UTF-8 — and then Base64-encode the bytes.

// JavaScript
const text = "안녕";                       // 2 Korean characters
const bytes = new TextEncoder().encode(text); // 6 bytes (UTF-8: 3 per char)
const b64   = btoa(String.fromCharCode(...bytes)); // 7Z2E7JyF
const back  = new TextDecoder().decode(
  Uint8Array.from(atob(b64), c => c.charCodeAt(0))
); // "안녕" again

The classic mistake is calling btoa("안녕") directly. btoa expects a Latin-1 string and throws InvalidCharacterError on any character above code point 255. The fix is always the same: encode to UTF-8 bytes first, then Base64. Conversely, when you decode, you get bytes back; you must then interpret those bytes as UTF-8 to recover the original text. Skipping that last step is how you end up with where your emoji used to be.

Where you will actually use it

Base64 is not encryption

This is the misconception that causes real harm. Base64 is reversible by anyone, instantly, with no key. Encoding a password or an API token in Base64 provides zero confidentiality. It is obfuscation at best, and often not even that — a Base64 string is so recognizable (== padding, the limited alphabet) that scanners flag it automatically.

If you need to keep data secret, use encryption (AES-GCM for symmetric, RSA or ECDSA for asymmetric). If you need to verify data has not been tampered with, use a cryptographic hash or HMAC. If you need to transmit binary safely through a text channel, then Base64 is the right tool. Mixing up these three jobs is how “encrypted” API keys end up readable in a GitHub commit.

Variants and line wrapping

There are a few flavors of Base64 in the wild, and mixing them up is a common bug. Standard Base64 uses + and / and pads with =. Base64url swaps in - and _ so the output is safe inside a URL path or query string without further percent-encoding — this is the variant JWT uses. Some base64url implementations also drop the trailing = padding entirely and infer it from the string length, which is fine until you feed the unpadded string into a strict decoder that expects the padding.

PEM (used for SSL certificates and SSH keys) is standard Base64 wrapped at 64 characters per line with -----BEGIN CERTIFICATE----- headers. MIME email wraps at 76 characters. The line breaks are not part of the Base64 data; they exist so 1980s-era mail servers would not choke on long lines. Most modern decoders ignore whitespace, but if you are writing one yourself, remember to strip newlines before decoding or you will get a length error.

When you should not reach for Base64

Because Base64 trades 33% more bytes for transport safety, it is the wrong tool whenever the channel is already binary-safe. Sending Base64-encoded image data through a multipart/form-data upload is strictly worse than sending the raw bytes — you pay the size tax for nothing, because the upload protocol handles binary fine. Storing Base64 in a database BLOB column wastes a third of the space and forces every reader to decode. The rule: use Base64 at the boundary with a text-only system, and store or transmit raw bytes everywhere else.

A related anti-pattern is Base64-encoding then gzipping, or gzipping then Base64-encoding, in the hope of saving space. Base64-then-gzip recovers much of the overhead (Base64 text compresses to roughly the original size) but the result is no longer text-safe, which defeats the only reason to encode in the first place. gzip-then-Base64 doubles the work and still pays the size tax. If you need compression, compress; if you need text-safety, encode; doing both in sequence is almost always a mistake.

A short summary

Once you internalize that Base64 is purely a transport trick — bytes in, slightly more bytes out, fully reversible — the rest follows. You reach for it when a channel cannot safely carry raw bytes, and you reach for something else (TLS, real encryption, a hash) when you need actual protection.

Related tools

← Back to blog