Understanding Character Encoding

Every piece of text you see on a screen is, at its core, a sequence of numbers. The way those numbers map to characters is the world of character encoding. I have spent years debugging encoding issues, and understanding how encoding works will save you from some of the most frustrating bugs in software development.

In this guide, I will cover the encoding schemes that matter most to web developers: Base64, URL encoding, HTML entities, Unicode, and more. You will understand not just how to use them, but why they exist and when each is the right choice.

ASCII: Where It All Started

The American Standard Code for Information Interchange (ASCII) was published in 1963 and defined 128 characters: the English alphabet, digits, punctuation, and control characters, each represented by a number from 0 to 127 (7 bits). ASCII worked for English but completely ignored the rest of the world. There was no way to represent accented characters, Chinese, Arabic, or the thousands of other symbols humans use. This limitation drove the development of every encoding system we will discuss here.

Base64: Turning Binary into Text

Base64 is an encoding scheme that converts binary data into a text representation using 64 printable ASCII characters. It was originally designed for email (the MIME standard), where binary attachments needed to travel through text-only channels.

How Base64 Works

The algorithm processes input bytes in groups of three (24 bits). Each group is split into four 6-bit chunks, and each 6-bit value maps to one of 64 characters: A-Z, a-z, 0-9, +, and /. If the input length is not divisible by three, the output is padded with = signs to signal how many bytes were in the final group.

// JavaScript Base64 encoding and decoding
const encoded = btoa("Hello, World!");
console.log(encoded); // "SGVsbG8sIFdvcmxkIQ=="

const decoded = atob("SGVsbG8sIFdvcmxkIQ==");
console.log(decoded); // "Hello, World!"

When to Use Base64

Common use cases include embedding small images as data URIs in HTML/CSS, encoding email attachments via MIME, including binary data in JSON API payloads, and storing binary data in text-only database fields. The trade-off is size: Base64 increases data by approximately 33%. Try it with our Base64 Encoder/Decoder.

URL Encoding: Making URLs Safe

URLs have a restricted character set. Characters like spaces, ampersands, question marks, and non-ASCII characters cannot appear directly in a URL because they have special meaning or are simply not allowed by the specification. URL encoding (also called percent-encoding) solves this by replacing unsafe characters with a percent sign followed by their hex value.

Space       ->  %20
&           ->  %26
=           ->  %3D
?           ->  %3F
/           ->  %2F
cafe latte  ->  cafe%20latte

encodeURIComponent vs. encodeURI

JavaScript provides two functions for URL encoding, and using the wrong one is a common source of bugs:

const query = "name=Alice & Bob";

// encodeURIComponent: encodes everything except A-Z a-z 0-9 - _ . ! ~ * ' ( )
encodeURIComponent(query);
// "name%3DAlice%20%26%20Bob"

// encodeURI: preserves URL structure characters (: / ? # & = etc.)
encodeURI("https://example.com/search?q=" + query);
// "https://example.com/search?q=name=Alice%20&%20Bob"  (BROKEN!)

// Correct approach for query parameters:
"https://example.com/search?q=" + encodeURIComponent(query);
// "https://example.com/search?q=name%3DAlice%20%26%20Bob"

The rule: use encodeURIComponent() for individual query parameter values and path segments. Use encodeURI() only when encoding a complete URL. In practice, I reach for encodeURIComponent() about 95% of the time. Experiment with our URL Encoder/Decoder.

HTML Entities: Safe Characters in Markup

In HTML, certain characters have special meaning. The less-than sign < opens a tag. The ampersand & starts an entity reference. If you want to display these characters as literal text, you must encode them as HTML entities.

Named vs. Numeric Entities

HTML entities come in two forms. Named entities use readable labels like < for the less-than sign and & for the ampersand. Numeric entities use the character's code point in decimal (<) or hex (<). Named entities are easier to remember, while numeric entities work for any Unicode character.

XSS Prevention

HTML entity encoding is critical for security. If your application displays user-generated content without encoding it, an attacker can inject malicious <script> tags that execute in every visitor's browser. This is Cross-Site Scripting (XSS), one of the most common web vulnerabilities. Always encode user input before inserting it into HTML. Convert and inspect HTML entities with our HTML Entity Encoder/Decoder.

Unicode: One Encoding to Rule Them All

Unicode aims to include every character from every writing system on Earth, plus mathematical symbols, technical symbols, and emoji. It defines over 149,000 characters and counting.

Code Points

Every Unicode character is assigned a unique number called a code point, written as U+ followed by a hex value:

A          U+0041
e (accented)  U+00E9
Sigma      U+03A3
Han character U+4E16
Smiling face  U+1F600

UTF-8 Encoding

UTF-8 is the dominant encoding on the web (used by over 98% of websites). It is a variable-width encoding that uses 1 to 4 bytes per character:

Code Point Range	Bytes	Example
U+0000 to U+007F	1	ASCII characters (A, 5, !)
U+0080 to U+07FF	2	Latin accented, Greek, Cyrillic
U+0800 to U+FFFF	3	Chinese, Japanese, Korean, most symbols
U+10000 to U+10FFFF	4	Emoji, historic scripts, math symbols

The beauty of UTF-8 is backward compatibility: any valid ASCII text is also valid UTF-8, making migration nearly painless for English-language systems.

Surrogate Pairs and Emoji

JavaScript strings use UTF-16 internally. Characters with code points above U+FFFF (like most emoji) are stored as surrogate pairs, which are two 16-bit code units that together represent a single character:

const emoji = "\u{1F600}"; // Grinning face
console.log(emoji.length);          // 2 (two UTF-16 code units!)
console.log([...emoji].length);     // 1 (one actual character)
console.log(emoji.codePointAt(0));  // 128512 (decimal for U+1F600)

This is why string.length in JavaScript can give surprising results with emoji. If you need to count actual characters, use [...string].length or Array.from(string).length. Explore Unicode characters and conversions with our Unicode Converter.

Binary and Hexadecimal Representations

At the lowest level, all data is binary: sequences of ones and zeros. Hexadecimal (base 16) provides a more compact way to represent binary data, where each hex digit maps to exactly 4 bits:

Character: A
ASCII:     65
Binary:    01000001
Hex:       41

Character: Z
ASCII:     90
Binary:    01011010
Hex:       5A

Hex appears everywhere: CSS color codes (#FF5733), memory addresses, MAC addresses, and cryptographic hashes. Convert between text, binary, and hex using our Binary to Text Converter and Hex Editor.

Morse Code: A Historical Perspective

Before digital encoding, there was Morse code. Developed by Samuel Morse and Alfred Vail in the 1830s, Morse code encodes letters as sequences of short signals (dots) and long signals (dashes). It was the first widely adopted system for encoding text for electronic transmission, and it remained in active use for maritime communication until the late 1990s.

H  ....
E  .
L  .-..
L  .-..
O  ---

"HELLO" in Morse: .... . .-.. .-.. ---

Morse code is a variable-length encoding: common letters (E, T) get short codes, while rare ones (Q, Z) get longer ones. This is the same principle behind modern Huffman coding. Play with it using our Morse Code Translator.

Common Pitfalls and How to Avoid Them

Double Encoding

One of the most frequent encoding bugs is applying the same encoding twice. For URL encoding, this turns a space into %20, and then into %2520 because the percent sign itself gets encoded. The fix is simple: always encode exactly once, at the boundary where data enters a new context.

Encoding Mismatches and Mojibake

If a file is saved as UTF-8 but the server declares it as ISO-8859-1, characters outside the ASCII range will display as garbled text, known as "mojibake." Always ensure your HTML declares <meta charset="UTF-8"> and your server sends the matching Content-Type header. In 2026, the answer is almost always UTF-8.

Storing Encoded Values

If a user submits "Tom & Jerry" and you store it as "Tom & Jerry" in the database, you now have an encoded string masquerading as raw text. When your template engine encodes it again on output, the user sees "Tom &amp; Jerry". The rule: store raw data, encode only at the point of output.

A Quick Reference

Encoding	Use Case	Size Overhead
Base64	Binary data in text contexts	~33%
URL Encoding	Special characters in URLs	~3x for encoded chars
HTML Entities	Special characters in HTML	4 to 8 chars per entity
UTF-8	Universal text encoding	1 to 4 bytes per character
Hex	Byte-level data inspection	2x

Wrapping Up

Character encoding touches almost everything in web development. Getting it right means your application works correctly for users worldwide. Getting it wrong means mojibake, broken URLs, and security vulnerabilities.

Here are the BoltQuickTools encoding resources mentioned in this article:

Base64 Encoder/Decoder for converting between binary and text
URL Encoder/Decoder for percent-encoding and decoding
HTML Entity Encoder/Decoder for safe HTML output
Unicode Converter for exploring code points and encodings
Binary to Text Converter for binary and text conversions
Hex Editor for byte-level data inspection
Morse Code Translator for encoding text as Morse code

All of these tools run entirely in your browser with no data leaving your device. Whether you are debugging an encoding issue or just exploring how different encoding schemes work, they are free and ready to use.

Understanding Character Encoding: Base64, URL Encoding, and More