How Computers Store Text
Every email, log line, and source file is a sequence of bytes. The story of text is the story of how we agree which bytes mean which characters.
From keypress to stored data
When you type a letter, the keyboard sends a scan code to the operating system. The OS maps that event through keyboard layouts and input methods to a Unicode code point—an abstract character identity independent of fonts. Applications then serialize that character for storage or transmission using an encoding, which defines how code points become bytes.
That distinction matters: Unicode says what the character is; UTF-8 (or UTF-16, etc.) says how to represent it as bits. If any step guesses wrong—wrong layout, wrong normalization, wrong encoding—you get subtle bugs: swapped characters, corrupted filenames, or databases that truncate international names.
Characters as binary patterns
Computers do not store curly quotes, emoji, or diacritics as tiny pictures in RAM during editing; they store integers (code points) and byte sequences (encoded text). Rendering happens later when a graphics stack chooses glyphs from fonts and applies shaping rules for complex scripts. For English plaintext, it is easy to imagine “A” as the number 65, but the same principle extends to every symbol Unicode includes—over a million code points in the standard’s architectural space, though only a subset is assigned meanings.
ASCII: the 7-bit foundation
ASCII, stabilized in the 1960s, assigned 128 characters to numbers 0–127. That includes uppercase and lowercase Latin letters, digits, common punctuation, space, and many non-printing control codes (carriage return, tab, etc.). Seven bits are enough for those 128 values, but early storage often padded to 8 bits per character, leaving the high bit unused or repurposed.
ASCII’s cultural footprint is enormous: C strings, HTTP headers, many file formats, and the basic Latin block in Unicode all inherit ASCII’s ordering. That is why tools can often “mostly work” with English even when encoding is underspecified—until you leave the ASCII range.
Extended ASCII and compatibility headaches
Once hardware standardized 8-bit bytes, vendors defined “extended ASCII” by filling values 128–255 with extra letters, symbols, and box-drawing characters. Unfortunately, different platforms picked different mappings—ISO-8859-1, Windows-1252, Mac Roman, and more. A byte like 0xA3 might be £ in one encoding and something else in another.
Those ambiguities still surface when opening old CSV exports, scraping legacy sites, or reading email without proper charset declarations. The fix is not mystical: declare the encoding explicitly, transcode to UTF-8 at system boundaries, and test with non-ASCII fixtures.
Unicode and the code-point idea
Unicode unifies character identities across languages. Each assigned character receives a code point such as U+0041 (LATIN CAPITAL LETTER A) or U+1F600 (GRINNING FACE). Planes organize these into ranges; most daily text lives in the Basic Multilingual Plane, but emoji and historic scripts occupy others. Unicode also defines properties—case mapping, directionality, combining marks—that software must respect for correct behavior.
Unicode is not an on-disk format by itself. You still need an encoding to serialize code points. That is where UTF-8 and friends enter.
UTF-8: variable width, ASCII-friendly
UTF-8 represents Unicode code points using one to four bytes. Crucially, code points U+0000 through U+007F (ASCII) map to identical single-byte values. That backward compatibility made UTF-8 easy to adopt in tools that were “ASCII-clean.” For many Western strings, UTF-8 is compact; for some scripts, other encodings might have been denser historically, but network effects and security reviews cemented UTF-8 as the internet’s default text encoding.
UTF-8’s variable width means you cannot assume one character equals one byte. String length in bytes differs from grapheme count, and slicing bytes blindly can split a multibyte sequence. High-level languages hide much of this, but file I/O, databases, and protocols still expose the truth.
What a text file looks like on disk
A .txt file is typically nothing more than encoded bytes written in order—often UTF-8 today. Line endings might be LF (0x0A), CRLF (0x0D 0x0A), or CR on very old Mac files, which is why cross-platform projects normalize line endings in Git. “Plain text” still implies choices: encoding, newline style, and whether a BOM is present.
Text editors may add metadata in memory, but the saved artifact is usually just bytes. That simplicity is powerful for diffing, compressing, and hashing—but it also means there is no magic auto-detection standard. Heuristics exist (and sometimes fail).
Encoding headers and the BOM
Protocols and file formats often carry charset information explicitly. HTTP responses include Content-Type: text/html; charset=utf-8; HTML can declare <meta charset="utf-8">; XML and JSON have their own rules (JSON is UTF-8 by standard). When metadata is missing, consumers guess—sometimes incorrectly.
A byte order mark at the beginning of a UTF-8 file is the sequence EF BB BF. It can signal UTF-8 to parsers that support BOM sniffing, but it also shifts byte offsets and can confuse naive tools that treat files as pure ASCII bytes. UTF-16 uses BOM to distinguish big-endian vs little-endian layouts. Teams often standardize “UTF-8 without BOM” for source code while accepting BOM in certain Windows-centric workflows.
Frequently asked questions
- What encoding does my computer use?
- Operating systems and apps set defaults—often UTF-8 on modern macOS and Linux, and increasingly on Windows for developer tools—but individual files carry whatever bytes were written. Always inspect the format spec, database settings, or protocol headers.
- Why do some characters show as squares?
- Tofu squares mean missing glyphs, while gibberish usually means mis-decoded bytes. Try a different font, verify the charset, and ensure the source actually contains the intended Unicode sequence.
- What is a BOM?
- An optional leading marker that can indicate UTF-8 or UTF-16 endianness. Useful in some ecosystems, annoying in others—pick a team convention and stick to it.
- How much space does a character take?
- In UTF-8, between one and four bytes depending on the code point. Measuring “characters” for UI width is even trickier because of combining marks and emoji sequences.