File Signature / Magic Number
A specific sequence of bytes at the beginning of a file that uniquely identifies its format, regardless of file extension or filename. The term originated in Version 7 Unix (1979) where the header constant was assigned to a variable labeled ux_mag and called the magic number; the meaning expanded over time from executable format type to file system type to any type of file. Common signatures include 4D 5A (PE executables, “MZ” for Mark Zbikowski), FF D8 FF (JPEG), 89 50 4E 47 (PNG), 25 50 44 46 (PDF, “%PDF”), 50 4B 03 04 (ZIP, “PK” for Phil Katz). The libmagic library provides the standardized signature database used by the Unix file(1) command, python-magic library, and many security tools. PhotoRec includes a signature database covering 480+ file types for data carving recovery.
Inventive HQ · libmagic
Expanded over time
Categorized reference
A file signature (also called magic number or magic bytes) is a specific sequence of bytes at the beginning of a file that uniquely identifies its format, regardless of file extension. Common examples: 4D 5A for Windows executables, FF D8 FF for JPEG, 50 4B 03 04 for ZIP. Signatures enable identification when extensions are wrong, missing, or deliberately misleading. The libmagic library and PhotoRec’s signature database (covering 480+ file types) power the Unix file command, antivirus engines, and data carving tools. For data recovery, signatures are foundational: they’re how carving tools like PhotoRec locate file boundaries on drives where filesystem metadata has been destroyed. A hex editor reveals signatures in their three-column display.
What File Signatures Are
The Wikipedia file signatures reference provides the foundational definition: “A file signature is data used to identify or verify the content of a file. Such signatures are also known as magic numbers or magic bytes and are usually inserted at the beginning of the file. Many file formats are not intended to be read as text. If such a file is accidentally viewed as a text file, its contents will be unintelligible. However, some file signatures can be recognizable when interpreted as text.”1
The fundamental concept
File signatures solve a specific problem: identifying file format when the filename or extension cannot be trusted:
- The filename problem: filenames can be wrong, missing, deliberately misleading, or stripped during copying.
- The extension problem: extensions are conventional, not enforced; renaming .exe to .pdf changes nothing about the file content.
- The signature solution: the file’s actual byte content includes a recognizable pattern that identifies the format.
- The location convention: signatures typically appear at byte offset 0; some formats use offset 4 (MPEG-4 ftyp) or other fixed positions.
- The size convention: signatures are typically 2-8 bytes; rarely longer than 16 bytes.
The CTF Handbook framing
The CTF Handbook describes file signatures in the context of forensic challenges: “File signatures (also known as File Magic Numbers) are bytes within a file used to identify the format of the file. Generally they’re 2-4 bytes long, found at the beginning of a file. Files can sometimes come without an extension, or with incorrect ones. We use file signature analysis to reveal what they truly are.”2 The forensic and CTF use case captures why signatures matter:
- An attacker may rename malware.exe to invoice.pdf to bypass naive filtering.
- A file recovery tool finds a file with no name or extension; signature reveals format.
- A user receives an unknown file; signature confirms (or contradicts) the claimed type.
- A web upload validator must verify uploaded files match claimed types.
- A forensic examiner must identify file types without trusting metadata.
Signature properties
Effective file signatures share several properties:
- Uniqueness: signatures must be unlikely to occur in other formats; high-entropy patterns reduce false positives.
- Position consistency: the same signature at the same offset across all files of that format.
- Backward compatibility: format authors typically preserve signatures across versions for tool compatibility.
- Documentation: well-designed formats publicly document their signatures for parser implementation.
- Specificity vs generality: some signatures cover entire format families (RIFF for WAV/AVI/WebP) requiring secondary identification.
The “magic” connotation
The term “magic number” reflects the apparent arbitrariness of the chosen byte values:
- Values appear arbitrary to outside observers but have specific meaning to format parsers.
- Values are chosen to be improbable to occur randomly in other file types.
- Values often hide meanings: PE 4D 5A spells “MZ” for Mark Zbikowski, ZIP 50 4B spells “PK” for Phil Katz.
- Java class files use CA FE BA BE; “CAFEBABE” reads as a memorable hex word.
- Berkeley Fast File System superblock uses 19 54 01 19 representing the birthday of author Marshall Kirk McKusick.
The Magic Number Concept and History
The history of file signatures parallels the development of operating systems and file format conventions over the past several decades.
The Unix V7 origin
The Wikipedia magic number reference describes the term’s origin: “In Version Seven Unix, the header constant was not tested directly, but assigned to a variable labeled ux_mag and subsequently referred to as the magic number. Probably because of its uniqueness, the term magic number came to mean executable format type, then expanded to mean file system type, and expanded again to mean any type of file.”3 The expansion path:
- 1979 (Unix V7): ux_mag variable holds executable header constant; called the “magic number”.
- Early 1980s: term extends to identify a.out vs other executable formats.
- Mid-1980s: term covers filesystem identification (UFS superblock, FFS).
- Late 1980s: term applies to any binary file format identifier.
- 1990s onward: “magic number” and “file signature” used interchangeably across all file types.
The Unix file(1) command and /etc/magic
The Unix file command (1973-1974) established the standard mechanism for signature-based identification:
- The file command: reads a file’s first bytes and looks them up in a magic database.
- The /etc/magic database: traditionally located at /etc/magic; modern systems use /usr/share/file/magic or compiled .mgc files.
- The libmagic library: programmatic interface to the magic database for non-shell programs.
- Magic database syntax: offset, type, value, message format with continuation rules for compound checks.
- Cross-platform: file command standard on Linux, macOS, FreeBSD, Solaris, and other Unix-like systems.
Hidden meanings in signatures
Many file signatures contain meaningful patterns chosen by format authors:
- 4D 5A (MZ) for Windows PE/EXE: Mark Zbikowski, Microsoft engineer who designed the original DOS executable format; PE inherits this signature.
- 50 4B (PK) for ZIP: Phil Katz, creator of PKZIP and the ZIP format.
- CA FE BA BE for Java class / Mach-O Universal: “CAFEBABE” reads as a memorable hex word; coincidentally shared by both formats.
- FE ED FA CE / CF for Mach-O: “FEEDFACE” / “FEEDFACF” reads as memorable; CE for 32-bit, CF for 64-bit.
- 89 50 4E 47 for PNG: first byte 0x89 is non-ASCII (catches text-mode transfer); next bytes spell “PNG”.
- 0D 0A 1A 0A for PNG continuation: tests for line-ending corruption (CR, LF, EOF, LF) in text-mode transfers.
- 19 54 01 19 for Berkeley FFS superblock: Marshall Kirk McKusick’s birthday encoded in hex.
- %PDF for PDF: ASCII text “PDF” preceded by “%” comment marker.
- SQLite format 3 for SQLite: ASCII text directly identifying format and version.
Signature design principles
Modern format design follows several principles for signature selection:
- Improbable byte combinations: avoid common byte sequences like all zeros (00 00 00 00) or all 0xFFs.
- Mixed printable and non-printable: typically combines high bytes with ASCII text for partial human readability.
- Endianness checks: some formats (TIFF, BMP) embed endianness in the signature itself.
- Version coding: some formats encode version information in signature bytes for forward compatibility.
- Container vs content: wrapped formats (DOCX in ZIP) inherit container signature with content-types differentiating internally.
Multi-magic and complex signatures
Some formats have multiple valid signatures or non-trivial structure:
- TIFF: II 2A 00 (little-endian) or MM 00 2A (big-endian); the byte order itself is part of the format identifier.
- MPEG-4 family: ftyp box at offset 4 with content varying by sub-type (mp42, M4V, qt, isom, etc.).
- JPEG variants: FF D8 FF followed by E0 (JFIF), E1 (EXIF), E8 (SPIFF), or DB (raw); all are valid JPEG.
- RIFF container: 52 49 46 46 (RIFF) followed by 4-byte size, then 4-byte chunk identifier (WEBP, WAVE, AVI ).
- Polyglot files: rare files crafted to satisfy multiple format signatures simultaneously; security research curiosity.
File Signatures Across Major Formats
The following categorized reference covers the most-encountered file signatures across major format families. Each format is listed with its hex signature, ASCII representation where applicable, and notes on variants or compound usage.
Image formats
| Format | Hex Signature | End Marker | Notes |
|---|---|---|---|
| JPEG (JFIF/EXIF/SPIFF) | FF D8 FF | FF D9 | 4th byte E0/E1/E8/DB indicates variant |
| PNG | 89 50 4E 47 0D 0A 1A 0A | IEND chunk | 8-byte signature with CRLF/EOF checks |
| GIF (87a / 89a) | 47 49 46 38 37 61 / 47 49 46 38 39 61 | 3B (semicolon) | “GIF87a” or “GIF89a” |
| BMP / DIB | 42 4D | None standard | “BM”; size in next 4 bytes |
| TIFF (little-endian) | 49 49 2A 00 | None standard | “II”; common on Windows/Intel |
| TIFF (big-endian) | 4D 4D 00 2A | None standard | “MM”; common on macOS legacy |
| WebP | 52 49 46 46 ?? ?? ?? ?? 57 45 42 50 | None standard | “RIFF….WEBP” container |
| HEIC / HEIF | ?? ?? ?? ?? 66 74 79 70 68 65 69 63 | None standard | ftyp box at offset 4 with “heic” |
| ICO (Windows icon) | 00 00 01 00 | None standard | 0 reserved + 1 type + image count |
| PSD (Photoshop) | 38 42 50 53 | None standard | “8BPS” |
Document formats
| Format | Hex Signature | End Marker | Notes |
|---|---|---|---|
| 25 50 44 46 | %%EOF | “%PDF” + version | |
| Old MS Office (DOC/XLS/PPT) | D0 CF 11 E0 A1 B1 1A E1 | None standard | OLE Compound Document |
| Modern Office (DOCX/XLSX/PPTX) | 50 4B 03 04 | 50 4B 05 06 | ZIP with [Content_Types].xml |
| OpenDocument (ODT/ODS/ODP) | 50 4B 03 04 | 50 4B 05 06 | ZIP with mimetype file |
| RTF | 7B 5C 72 74 66 31 | 7D (closing brace) | “{\\rtf1” |
| EPUB | 50 4B 03 04 | 50 4B 05 06 | ZIP with META-INF/container.xml |
| MOBI | ?? ?? ?? ?? ?? ?? ?? ?? 42 4F 4F 4B 4D 4F 42 49 | None standard | “BOOKMOBI” at offset 60 |
| PostScript | 25 21 50 53 | None standard | “%!PS” |
Archive and compression formats
| Format | Hex Signature | End Marker | Notes |
|---|---|---|---|
| ZIP | 50 4B 03 04 | 50 4B 05 06 | “PK..” for Phil Katz |
| RAR (v1.5-4) | 52 61 72 21 1A 07 00 | None standard | “Rar!..” |
| RAR (v5+) | 52 61 72 21 1A 07 01 00 | None standard | “Rar!…” with version 5 |
| 7-Zip | 37 7A BC AF 27 1C | None standard | “7z” + binary |
| GZIP | 1F 8B | None standard | 2-byte signature only |
| BZIP2 | 42 5A 68 | None standard | “BZh” |
| XZ | FD 37 7A 58 5A 00 | 00 00 00 00 04 59 5A | “.7zXZ\\0” |
| TAR | ?? ?? ?? ?? ?? ?? ?? ?? 75 73 74 61 72 | None standard | “ustar” at offset 257 |
| JAR (Java archive) | 50 4B 03 04 | 50 4B 05 06 | ZIP with META-INF/MANIFEST.MF |
| APK (Android package) | 50 4B 03 04 | 50 4B 05 06 | ZIP with AndroidManifest.xml |
Executable formats
| Format | Hex Signature | End Marker | Notes |
|---|---|---|---|
| Windows PE (EXE/DLL/SYS) | 4D 5A | None standard | “MZ” for Mark Zbikowski |
| ELF (Linux/Unix) | 7F 45 4C 46 | None standard | “.ELF”; high byte + ASCII |
| Mach-O (macOS 32-bit) | FE ED FA CE | None standard | “FEEDFACE” |
| Mach-O (macOS 64-bit) | FE ED FA CF | None standard | “FEEDFACF” |
| Mach-O Universal | CA FE BA BE | None standard | “CAFEBABE”; shared with Java class |
| Java class file | CA FE BA BE | None standard | “CAFEBABE”; shared with Mach-O Universal |
| WebAssembly | 00 61 73 6D | None standard | “.asm” |
| DEX (Android) | 64 65 78 0A 30 33 ?? 00 | None standard | “dex.03?\\0” with version |
Audio and video formats
| Format | Hex Signature | End Marker | Notes |
|---|---|---|---|
| MP3 (with ID3 tag) | 49 44 33 | None standard | “ID3” |
| MP3 (raw frame) | FF FB / FF F3 / FF FA / FF F2 | None standard | 11-bit sync + version bits |
| MP4 / M4A / M4V | ?? ?? ?? ?? 66 74 79 70 | None standard | “ftyp” box at offset 4 |
| WAV | 52 49 46 46 ?? ?? ?? ?? 57 41 56 45 | None standard | “RIFF….WAVE” |
| AVI | 52 49 46 46 ?? ?? ?? ?? 41 56 49 20 | None standard | “RIFF….AVI “ |
| FLAC | 66 4C 61 43 | None standard | “fLaC” |
| OGG | 4F 67 67 53 | None standard | “OggS” |
| MKV / WebM | 1A 45 DF A3 | None standard | EBML container |
| FLV | 46 4C 56 | None standard | “FLV” |
Database and filesystem formats
| Format | Hex Signature | End Marker | Notes |
|---|---|---|---|
| SQLite | 53 51 4C 69 74 65 20 66 6F 72 6D 61 74 20 33 00 | None standard | “SQLite format 3” + null |
| Microsoft Access (MDB) | 00 01 00 00 53 74 61 6E 64 61 72 64 20 4A 65 74 20 44 42 | None standard | “Standard Jet DB” |
| dBase / DBF | 03 / 04 / 05 / 30 | 1A (EOF) | First byte indicates version |
| Microsoft Compiled HTML (CHM) | 49 54 53 46 | None standard | “ITSF” |
| NTFS boot sector | EB 52 90 4E 54 46 53 | None standard | Jump + “NTFS” at offset 3 |
| FAT12/16/32 | EB ?? 90 | None standard | Jump instruction; FAT type via additional bytes |
| ext2/3/4 superblock | 53 EF (at offset 0x438) | None standard | 0xEF53 magic at superblock offset |
| HFS+ (macOS) | 48 2B | None standard | “H+” at offset 1024 |
| APFS (macOS) | 4E 58 53 42 | None standard | “NXSB” at container superblock |
Disk image and container formats
| Format | Hex Signature | End Marker | Notes |
|---|---|---|---|
| ISO 9660 | 43 44 30 30 31 | None standard | “CD001” at offset 32769 (0x8001) |
| VMware VMDK | 4B 44 4D | None standard | “KDM” |
| VirtualBox VDI | 3C 3C 3C 20 4F 72 61 63 6C 65 | None standard | “<<< Oracle” |
| QEMU QCOW2 | 51 46 49 FB | None standard | “QFI.” |
| VHD (Microsoft) | 63 6F 6E 65 63 74 69 78 | None standard | “conectix” at end of file |
| DMG (macOS) | 78 01 73 0D 62 62 60 | None standard | Variable; UDIF format |
Signature Identification Tools
Several tools and databases standardize the use of file signatures for identification across applications.
The libmagic library and file command
libmagic is the canonical signature database used by most Unix-like systems. The Inventive HQ reference describes its role: “The first few bytes of a file contain a signature that file identification tools compare against a database of known formats: The Unix file command, Python’s python-magic library, and this tool all use magic number databases to identify files. The most comprehensive database is maintained by the libmagic project.”4 Key properties:
- Database location: /usr/share/file/magic (source format) or /usr/share/file/magic.mgc (compiled).
- Coverage: several thousand file formats; updated regularly.
- Magic database syntax: declarative format with offset, type, value, message; supports compound checks.
- libmagic API: magic_open(), magic_load(), magic_file() for programmatic access.
- Bindings: Python (python-magic), Ruby (filemagic gem), PHP (fileinfo), Java (jmimemagic).
- Limitations: may misidentify ambiguous formats; relies on database accuracy.
The PRONOM database
PRONOM is the UK National Archives’ file format registry, focused on long-term preservation:
- Maintained by: The National Archives (UK).
- Coverage: 1500+ file formats with detailed signatures and metadata.
- Format identifiers: Persistent Unique Identifiers (PUIDs) like fmt/123 for each format.
- Used by: DROID identification tool, Siegfried, format identification workflows in archives and libraries.
- Web interface: nationalarchives.gov.uk/PRONOM/ for browsing and searching.
- Use case: long-term digital preservation; format identification for archival storage.
Gary Kessler’s File Signatures Table
Gary Kessler’s signature table is a long-maintained reference resource:
- URL: gck.au (formerly garykessler.net).
- Coverage: hundreds of file formats with header and trailer signatures.
- Format: sortable HTML table with hex signatures, ASCII representations, and notes.
- Trailer signatures: particularly comprehensive; many formats not in libmagic.
- Used by: forensic examiners, security researchers, file format reverse engineers.
- Maintained: updated since the early 2000s with regular additions.
Identification utilities
Several command-line tools use signature databases for file identification:
- file: Unix standard; uses libmagic; available on Linux, macOS, BSD.
- TrID: proprietary identifier with extensive format database; uses statistical pattern matching beyond simple signatures.
- DROID: Java-based tool from National Archives UK; uses PRONOM database.
- Siegfried: Go-based PRONOM identifier; faster than DROID.
- binwalk: firmware analysis tool; uses signatures to identify embedded files.
- foremost / scalpel: file carving tools using signature databases for recovery.
- Apache Tika: Java library with signature-based identification plus metadata extraction.
PhotoRec’s signature database
PhotoRec, the open-source data recovery tool by Christophe Grenier, includes a comprehensive signature database specifically for carving:
- Coverage: 480+ file types including specialized formats (digital camera RAW, scientific instruments, custom databases).
- Header and trailer pairs: where applicable, for accurate file boundary detection.
- Maximum size limits: for formats without trailer signatures.
- Customization: users can add custom signatures for proprietary formats.
- Open source: signature definitions visible in PhotoRec source code.
- Use case: file carving from disk images where filesystem metadata is unavailable.
Hex editor signature inspection
Manual signature inspection via hex editor is the foundational technique for individual file analysis:
- Open the file in any hex editor (HxD, 010 Editor, Hex Fiend, etc.).
- Examine the first 16-32 bytes; the signature appears at offset 0 (or known offset for specific formats).
- Look up the signature in a reference table (libmagic, Gary Kessler, this entry).
- Confirm format by examining additional structure beyond the signature.
- For ambiguous cases, examine multiple files of suspected format for comparison.
File Signatures and Data Recovery
File signatures play a central role in several specific data recovery scenarios where filesystem metadata is unavailable or unreliable.
The carving foundation
File carving is the recovery technique that extracts files based on signatures rather than filesystem structures. The GitHub File Magic Numbers gist describes the role: “I suppose the different tools for file recovery from a corrupt USB or hard drive work like they do a byte-by-byte reading with the file start and end byte signature detection. The size of a file is saved in the directory information of the file system. When this information is corrupt, recovery tools can try to find the boundary of files.”5 The carving workflow:
- Image the source drive with dd or ddrescue to a separate destination.
- Run a file carving tool (PhotoRec, foremost, scalpel) against the image.
- The tool scans byte-by-byte for known magic numbers across the image.
- When a header signature is found, the tool marks the start of a potential file.
- When a trailer signature is found (or maximum size reached), the tool marks the end.
- The carved file is saved with sequential numbering and the appropriate extension.
- Optional verification: the carved file is opened in its native application to confirm validity.
When carving works best
File carving via signatures works best in specific scenarios:
- Formatted drive recovery: filesystem metadata destroyed, data still intact.
- Corrupted partition tables: drive can’t mount but data sectors readable.
- Severely damaged filesystems: NTFS MFT or ext4 inode tables destroyed.
- Unallocated space recovery: deleted files where directory entries are gone.
- Memory dump analysis: RAM images for forensic file extraction.
- Damaged removable media: SD cards, USB drives with filesystem corruption.
When carving has limitations
Signature-based carving has specific failure modes:
- Fragmented files: if a file is split across non-contiguous sectors, the carved file contains only the contiguous portion starting from the signature.
- Files without signatures: text files, raw data dumps, encrypted blobs cannot be carved by signature.
- Compound formats: ZIP-based formats (DOCX, JAR) all carve as ZIP; further analysis required for proper identification.
- Overlapping signatures: false positives when random data happens to match a magic number.
- Variable-length without trailers: formats without trailers depend on size limits for boundaries.
- Encrypted volumes: BitLocker/FileVault containers appear as random data; signatures inside are unreadable until decrypted.
FOUND.000 / .CHK file identification
One specific recovery scenario involves identifying file types from chkdsk‘s FOUND.000 directory:
- chkdsk produces FILE0001.CHK, FILE0002.CHK, etc. with no original file names.
- Each .CHK file represents recovered fragments from cross-linked or orphaned chains.
- Open each .CHK in a hex editor and read the first 8-16 bytes.
- Match against signature reference (this entry, libmagic, Gary Kessler).
- Rename file with appropriate extension (FILE0001.CHK to FILE0001.jpg if signature is FF D8 FF).
- Open with native application to confirm proper recovery.
- Tools like UnCHK and ChkRecover automate this process for large FOUND.000 directories.
Header repair scenarios
When file headers are damaged but content is intact, signatures provide the repair template:
- Compare damaged file header with known-good signature for the same format.
- Identify which bytes differ; determine which represent damage vs normal variation.
- Restore correct signature bytes via hex editor.
- Repair often suffices to make file openable; deeper structural damage may persist.
- Common cases: JPEG header truncation, ZIP central directory corruption, PDF xref table damage.
Verification of recovery results
After running automated recovery tools, signatures help verify the results:
- Open recovered files and verify the magic number matches the assigned extension.
- Check for proper trailer signature where applicable.
- For ZIP-based formats, verify [Content_Types].xml is present for Office files.
- Identify files where the recovery tool stitched fragments incorrectly (signature mismatch with content).
- Use TrID, DROID, or Siegfried for systematic batch verification of large recovery outputs.
Security and forensic context
Signatures play roles beyond pure recovery in security and forensic work:
- Extension renaming detection: files claiming to be .pdf but signature shows .exe indicate possible attack.
- Antivirus signature databases: many AV engines use file format signatures as part of malware identification.
- Web upload validation: servers verify uploaded files match claimed types before processing.
- Forensic chain of custody: file identification is part of evidence cataloging.
- Steganography detection: unexpected signatures embedded in carrier files reveal hidden content.
- Custom format reverse engineering: identifying signature patterns is the first step in understanding proprietary formats.
File signatures are the foundation of identifying file formats when filenames and extensions cannot be trusted, which is exactly the situation that arises in most data recovery scenarios. For data recovery purposes, the practical implication is that signatures determine what carving tools can recover and how reliably the results can be identified: when filesystem metadata is destroyed but data sectors are readable, signatures are the only mechanism for identifying file boundaries and types. The PhotoRec database covering 480+ file types reflects decades of accumulated format knowledge; the libmagic database powers the Unix file command and dozens of derivative tools; the Gary Kessler signature table captures formats that other databases miss.
For users wondering when file signatures matter in practice, the answer follows the recovery scenario. For routine recovery where filesystem metadata is intact (recently deleted files, accidentally formatted drives where original metadata still partially exists), signature analysis is unnecessary because the filesystem provides the information. For deeper recovery where filesystem metadata is destroyed (formatted drives, corrupted partition tables, severely damaged filesystems), signatures become essential because they’re the only remaining basis for file identification. For FOUND.000/.CHK files from chkdsk operations, signatures convert anonymous numbered files back to identifiable formats. For files received from untrusted sources or with suspicious extension renaming, signatures verify true file types regardless of what the filename claims.
For users facing specific recovery scenarios, the practical guidance reflects the situation. If chkdsk produced a FOUND.000 directory full of .CHK files, open each in a hex editor and check the first 16 bytes against the categorized tables in this entry; rename with the appropriate extension. If a drive needs deep recovery via carving, use PhotoRec which leverages its 480+ signature database; the signature coverage determines what file types can be successfully recovered. If a file has a damaged header preventing it from opening, compare the damaged bytes to a known-good signature from the same format and repair via hex editor. Standard data recovery software typically includes signature-based carving as a fallback recovery mode; HDD-focused recovery tools apply signature analysis to drives where filesystem repair has failed. Cleanroom recovery services use signatures to validate recovered output from severely damaged drives. The core insight: signatures are what make recovery possible when filesystem structures fail.
File Signature FAQ
A file signature (also called magic number, magic bytes, or format signature) is a specific sequence of bytes at the beginning of a file that uniquely identifies its format, regardless of file extension or filename. The Wikipedia file signatures reference defines them: “A file signature is data used to identify or verify the content of a file. Such signatures are also known as magic numbers or magic bytes and are usually inserted at the beginning of the file.” The term originated in Version 7 Unix (1979), where the header constant was assigned to a variable labeled ux_mag and subsequently called the magic number. Common signatures: Windows PE files start with 4D 5A; JPEG starts with FF D8 FF; PNG starts with 89 50 4E 47 0D 0A 1A 0A; PDF starts with 25 50 44 46 (“%PDF”); ZIP starts with 50 4B 03 04. Signatures are typically 2-8 bytes long, located at file offset 0 (or sometimes at known offsets like MPEG-4’s ftyp box at offset 4). The CTF Handbook describes them: “File signatures (also known as File Magic Numbers) are bytes within a file used to identify the format of the file. Generally they’re 2-4 bytes long, found at the beginning of a file.” Operating systems, tools like file and binwalk, and antivirus engines use these signatures rather than file extensions because extensions can be wrong, missing, or deliberately misleading.
The term ‘magic number’ originated in early Unix programming and reflected the apparently arbitrary nature of the byte values chosen as identifiers. The Wikipedia magic number reference describes the origin: “In Version Seven Unix, the header constant was not tested directly, but assigned to a variable labeled ux_mag and subsequently referred to as the magic number. Probably because of its uniqueness, the term magic number came to mean executable format type, then expanded to mean file system type, and expanded again to mean any type of file.” The ‘magic’ aspect refers to several properties: (1) The values appear arbitrary to outside observers but have specific meaning to the file format parser; (2) The values are chosen to be improbable to occur randomly in other file types, providing high-confidence identification; (3) The values often have hidden meanings to the format authors. Examples of these hidden meanings: PE files use 4D 5A which spells “MZ” for Mark Zbikowski (Microsoft engineer who designed the PE format); ZIP files use 50 4B which spells “PK” for Phil Katz (PKZIP author); Java class files use CA FE BA BE which spells “CAFEBABE” as a memorable hex word; the Berkeley Fast File System superblock uses 19 54 01 19 (or its variants) which represents the birthday of Marshall Kirk McKusick, the FFS author.
The most-encountered file signatures across major format categories: Images: JPEG starts with FF D8 FF (followed by E0/E1/E8 indicating variant); PNG starts with 89 50 4E 47 0D 0A 1A 0A (8 bytes including line ending checks); GIF starts with 47 49 46 38 (GIF8) followed by 7a or 9a; BMP starts with 42 4D (BM); TIFF starts with either II 2A 00 (little-endian) or MM 00 2A (big-endian). Documents: PDF starts with 25 50 44 46 (%PDF); old MS Office (DOC/XLS/PPT) starts with D0 CF 11 E0 A1 B1 1A E1; modern Office Open XML (DOCX/XLSX/PPTX) starts with 50 4B 03 04 (ZIP signature). Archives: ZIP starts with 50 4B 03 04; RAR v1.5-4 starts with 52 61 72 21 1A 07 00; RAR v5+ starts with 52 61 72 21 1A 07 01 00; 7-Zip starts with 37 7A BC AF 27 1C; GZIP starts with 1F 8B; BZIP2 starts with 42 5A 68 (BZh). Executables: Windows PE/EXE/DLL starts with 4D 5A (MZ); ELF Linux executable starts with 7F 45 4C 46 (.ELF); Mach-O 32-bit starts with FE ED FA CE; Mach-O 64-bit starts with FE ED FA CF; Java class and Mach-O Universal share CA FE BA BE. Audio/Video: MP3 with ID3 tag starts with 49 44 33 (ID3); MP3 raw frame starts with FF FB or FF F3; MP4/M4A/M4V starts with bytes at offset 4-7 spelling “ftyp”; MKV starts with 1A 45 DF A3; FLAC starts with 66 4C 61 43 (fLaC). Databases: SQLite starts with “SQLite format 3” followed by null byte (53 51 4C 69 74 65 20 66 6F 72 6D 61 74 20 33 00).
Compound file signatures occur when one file format is wrapped or embedded in another, with the outer format’s signature visible at the start. The most common example is the ZIP-based formats: DOCX (Word), XLSX (Excel), PPTX (PowerPoint), JAR (Java archives), APK (Android packages), EPUB (e-books), and ODT (OpenDocument) all start with the ZIP signature 50 4B 03 04 because they are technically ZIP archives containing format-specific content files. The picoCTF Solutions reference describes the inheritance: “Office files (.docx, .xlsx, .pptx), JAR files, APK files, and EPUB files all share the ZIP signature because they are ZIP archives at heart.” To distinguish compound formats, signature analysis tools examine the contents of the ZIP archive: a file containing [Content_Types].xml is Office Open XML; a file containing META-INF/MANIFEST.MF is JAR; a file containing AndroidManifest.xml is APK; a file containing META-INF/container.xml is EPUB. Other compound signatures include: TAR archives that may be GZIP-compressed (1F 8B header followed by tar content); ISO 9660 disk images may contain UDF or other filesystems with their own signatures; QEMU QCOW2 disk images contain wrapped filesystem images. The compound nature means that signature-based identification is sometimes the first step rather than the final answer; deeper analysis of contents is required for full format identification.
No, not all files have magic number signatures. Several common file types deliberately lack signatures or have signatures that vary so much they cannot be reliably identified. Plain text files (TXT, CSV, HTML, XML, source code) typically have no signature; identification relies on content analysis and encoding heuristics rather than fixed bytes. Some Unicode text files include a Byte Order Mark (BOM): UTF-8 BOM is EF BB BF, UTF-16 LE BOM is FF FE, UTF-16 BE BOM is FE FF, UTF-32 LE BOM is FF FE 00 00; the BOM serves a similar purpose to a magic number but is optional. HTML files may start with the DOCTYPE declaration but this is not technically a binary signature. Encrypted files (encrypted with AES-CBC, GPG output without armor, etc.) often have no recognizable signature because encryption produces apparently-random bytes; some encryption formats include signatures (PGP/GPG armored output starts with —–BEGIN PGP); raw encrypted data does not. Compressed data alone (without container) may lack signatures: deflate-compressed data has no fixed header. Custom or proprietary formats may use any byte values as signatures, including no signature at all if the format author didn’t follow conventions. The libmagic database and tools like file/TrID/DROID handle these cases through content analysis (statistical analysis of byte distributions, encoding detection, structure pattern matching) rather than fixed signature lookup; these heuristic approaches are less reliable than magic number identification but extend coverage to text and unsigned binary formats.
File signatures are the foundation of file carving, one of the most-important data recovery techniques for severely damaged or formatted drives. The standard recovery workflow when filesystem metadata is missing or destroyed: (1) Image the source drive with dd or ddrescue to a separate destination. (2) Run a file carving tool like PhotoRec, foremost, or scalpel against the image. (3) The tool scans for known magic numbers across the image, marking the start of potential files. (4) When trailer signatures are known (JPEG FF D9, PDF %%EOF, PNG IEND), the tool also identifies file boundaries. (5) When trailer signatures are unknown, the tool uses maximum file size limits to bound carved files. (6) Each detected signature with its boundary becomes a recovered file, named by sequence number. PhotoRec’s signature database covers 480+ file types; foremost and scalpel use configurable signature lists. The GitHub File Magic Numbers gist describes the carving role: “The size of a file is saved in the directory information of the file system. When this information is corrupt, recovery tools can try to find the boundary of files”. The FOUND.000/.CHK files produced by chkdsk also benefit from signature analysis: opening a .CHK file in a hex editor and reading the first 8-16 bytes typically reveals the file type via its magic number, allowing the file to be renamed with the correct extension. For corrupted file headers, comparison with known-good signatures from the same format identifies which bytes need reconstruction. File signatures are what make recovery possible when filesystem structures fail.
Related glossary entries
- Hex Editor: the standard tool for manually inspecting file signatures and headers.
- dd Command: creates the disk images that signature-based carving tools then process.
- chkdsk: produces FOUND.000/.CHK files identified via magic number analysis.
- Forensic Recovery: signature-based identification is foundational to forensic file analysis.
- Data Recovery: signatures power file carving when filesystem metadata is destroyed.
- Disk Image: PhotoRec and similar tools scan disk images for signature patterns.
- Data Corruption: damaged headers can be repaired by comparing with known-good signatures.
Sources
- Wikipedia: List of file signatures (accessed May 2026)
- CTF Handbook: File Formats and Magic Numbers
- Wikipedia: Magic number (programming)
- Inventive HQ: File Magic Number Checker
- GitHub leommoore: File Magic Numbers gist
- picoCTF Solutions: File Magic / Signature Identifier
- Medium / Shailendra Purohit: Beneath the Bytes: Magic Numbers Deep Dive
- GitHub Ilias1988: Magic-Bytes-List repository
- Cyber Forensics Academy: signature analysis in forensic context
About the Authors
Data Recovery Fix earns revenue through affiliate links on some product recommendations. This does not influence our reference content. Glossary entries are written and reviewed independently based on documented research, vendor documentation, independent testing, and recovery-engineer review. If anything on this page looks inaccurate, outdated, or worth revisiting, please reach out at contact@datarecoveryfix.com and we’ll review it promptly.
