File Signature / Magic Number: Format Identification

File Signature / Magic Number

A specific sequence of bytes at the beginning of a file that uniquely identifies its format, regardless of file extension or filename. The term originated in Version 7 Unix (1979) where the header constant was assigned to a variable labeled ux_mag and called the magic number; the meaning expanded over time from executable format type to file system type to any type of file. Common signatures include 4D 5A (PE executables, “MZ” for Mark Zbikowski), FF D8 FF (JPEG), 89 50 4E 47 (PNG), 25 50 44 46 (PDF, “%PDF”), 50 4B 03 04 (ZIP, “PK” for Phil Katz). The libmagic library provides the standardized signature database used by the Unix file(1) command, python-magic library, and many security tools. PhotoRec includes a signature database covering 480+ file types for data carving recovery.

Reference content reviewed by recovery engineers. Editorial standards. About the authors.
📚
9 sources
Wikipedia · CTF Handbook
Inventive HQ · libmagic
📅
1979 origin
Unix V7 ux_mag
Expanded over time
📅
Last updated
Categorized reference
📖
9 min
Reading time

A file signature (also called magic number or magic bytes) is a specific sequence of bytes at the beginning of a file that uniquely identifies its format, regardless of file extension. Common examples: 4D 5A for Windows executables, FF D8 FF for JPEG, 50 4B 03 04 for ZIP. Signatures enable identification when extensions are wrong, missing, or deliberately misleading. The libmagic library and PhotoRec’s signature database (covering 480+ file types) power the Unix file command, antivirus engines, and data carving tools. For data recovery, signatures are foundational: they’re how carving tools like PhotoRec locate file boundaries on drives where filesystem metadata has been destroyed. A hex editor reveals signatures in their three-column display.

What File Signatures Are

The Wikipedia file signatures reference provides the foundational definition: “A file signature is data used to identify or verify the content of a file. Such signatures are also known as magic numbers or magic bytes and are usually inserted at the beginning of the file. Many file formats are not intended to be read as text. If such a file is accidentally viewed as a text file, its contents will be unintelligible. However, some file signatures can be recognizable when interpreted as text.”1

The fundamental concept

File signatures solve a specific problem: identifying file format when the filename or extension cannot be trusted:

  • The filename problem: filenames can be wrong, missing, deliberately misleading, or stripped during copying.
  • The extension problem: extensions are conventional, not enforced; renaming .exe to .pdf changes nothing about the file content.
  • The signature solution: the file’s actual byte content includes a recognizable pattern that identifies the format.
  • The location convention: signatures typically appear at byte offset 0; some formats use offset 4 (MPEG-4 ftyp) or other fixed positions.
  • The size convention: signatures are typically 2-8 bytes; rarely longer than 16 bytes.

The CTF Handbook framing

The CTF Handbook describes file signatures in the context of forensic challenges: “File signatures (also known as File Magic Numbers) are bytes within a file used to identify the format of the file. Generally they’re 2-4 bytes long, found at the beginning of a file. Files can sometimes come without an extension, or with incorrect ones. We use file signature analysis to reveal what they truly are.”2 The forensic and CTF use case captures why signatures matter:

  • An attacker may rename malware.exe to invoice.pdf to bypass naive filtering.
  • A file recovery tool finds a file with no name or extension; signature reveals format.
  • A user receives an unknown file; signature confirms (or contradicts) the claimed type.
  • A web upload validator must verify uploaded files match claimed types.
  • A forensic examiner must identify file types without trusting metadata.

Signature properties

Effective file signatures share several properties:

  • Uniqueness: signatures must be unlikely to occur in other formats; high-entropy patterns reduce false positives.
  • Position consistency: the same signature at the same offset across all files of that format.
  • Backward compatibility: format authors typically preserve signatures across versions for tool compatibility.
  • Documentation: well-designed formats publicly document their signatures for parser implementation.
  • Specificity vs generality: some signatures cover entire format families (RIFF for WAV/AVI/WebP) requiring secondary identification.

The “magic” connotation

The term “magic number” reflects the apparent arbitrariness of the chosen byte values:

  • Values appear arbitrary to outside observers but have specific meaning to format parsers.
  • Values are chosen to be improbable to occur randomly in other file types.
  • Values often hide meanings: PE 4D 5A spells “MZ” for Mark Zbikowski, ZIP 50 4B spells “PK” for Phil Katz.
  • Java class files use CA FE BA BE; “CAFEBABE” reads as a memorable hex word.
  • Berkeley Fast File System superblock uses 19 54 01 19 representing the birthday of author Marshall Kirk McKusick.

The Magic Number Concept and History

The history of file signatures parallels the development of operating systems and file format conventions over the past several decades.

The Unix V7 origin

The Wikipedia magic number reference describes the term’s origin: “In Version Seven Unix, the header constant was not tested directly, but assigned to a variable labeled ux_mag and subsequently referred to as the magic number. Probably because of its uniqueness, the term magic number came to mean executable format type, then expanded to mean file system type, and expanded again to mean any type of file.”3 The expansion path:

  • 1979 (Unix V7): ux_mag variable holds executable header constant; called the “magic number”.
  • Early 1980s: term extends to identify a.out vs other executable formats.
  • Mid-1980s: term covers filesystem identification (UFS superblock, FFS).
  • Late 1980s: term applies to any binary file format identifier.
  • 1990s onward: “magic number” and “file signature” used interchangeably across all file types.

The Unix file(1) command and /etc/magic

The Unix file command (1973-1974) established the standard mechanism for signature-based identification:

  • The file command: reads a file’s first bytes and looks them up in a magic database.
  • The /etc/magic database: traditionally located at /etc/magic; modern systems use /usr/share/file/magic or compiled .mgc files.
  • The libmagic library: programmatic interface to the magic database for non-shell programs.
  • Magic database syntax: offset, type, value, message format with continuation rules for compound checks.
  • Cross-platform: file command standard on Linux, macOS, FreeBSD, Solaris, and other Unix-like systems.

Hidden meanings in signatures

Many file signatures contain meaningful patterns chosen by format authors:

  • 4D 5A (MZ) for Windows PE/EXE: Mark Zbikowski, Microsoft engineer who designed the original DOS executable format; PE inherits this signature.
  • 50 4B (PK) for ZIP: Phil Katz, creator of PKZIP and the ZIP format.
  • CA FE BA BE for Java class / Mach-O Universal: “CAFEBABE” reads as a memorable hex word; coincidentally shared by both formats.
  • FE ED FA CE / CF for Mach-O: “FEEDFACE” / “FEEDFACF” reads as memorable; CE for 32-bit, CF for 64-bit.
  • 89 50 4E 47 for PNG: first byte 0x89 is non-ASCII (catches text-mode transfer); next bytes spell “PNG”.
  • 0D 0A 1A 0A for PNG continuation: tests for line-ending corruption (CR, LF, EOF, LF) in text-mode transfers.
  • 19 54 01 19 for Berkeley FFS superblock: Marshall Kirk McKusick’s birthday encoded in hex.
  • %PDF for PDF: ASCII text “PDF” preceded by “%” comment marker.
  • SQLite format 3 for SQLite: ASCII text directly identifying format and version.

Signature design principles

Modern format design follows several principles for signature selection:

  • Improbable byte combinations: avoid common byte sequences like all zeros (00 00 00 00) or all 0xFFs.
  • Mixed printable and non-printable: typically combines high bytes with ASCII text for partial human readability.
  • Endianness checks: some formats (TIFF, BMP) embed endianness in the signature itself.
  • Version coding: some formats encode version information in signature bytes for forward compatibility.
  • Container vs content: wrapped formats (DOCX in ZIP) inherit container signature with content-types differentiating internally.

Multi-magic and complex signatures

Some formats have multiple valid signatures or non-trivial structure:

  • TIFF: II 2A 00 (little-endian) or MM 00 2A (big-endian); the byte order itself is part of the format identifier.
  • MPEG-4 family: ftyp box at offset 4 with content varying by sub-type (mp42, M4V, qt, isom, etc.).
  • JPEG variants: FF D8 FF followed by E0 (JFIF), E1 (EXIF), E8 (SPIFF), or DB (raw); all are valid JPEG.
  • RIFF container: 52 49 46 46 (RIFF) followed by 4-byte size, then 4-byte chunk identifier (WEBP, WAVE, AVI ).
  • Polyglot files: rare files crafted to satisfy multiple format signatures simultaneously; security research curiosity.

File Signatures Across Major Formats

The following categorized reference covers the most-encountered file signatures across major format families. Each format is listed with its hex signature, ASCII representation where applicable, and notes on variants or compound usage.

Image formats

FormatHex SignatureEnd MarkerNotes
JPEG (JFIF/EXIF/SPIFF)FF D8 FFFF D94th byte E0/E1/E8/DB indicates variant
PNG89 50 4E 47 0D 0A 1A 0AIEND chunk8-byte signature with CRLF/EOF checks
GIF (87a / 89a)47 49 46 38 37 61 / 47 49 46 38 39 613B (semicolon)“GIF87a” or “GIF89a”
BMP / DIB42 4DNone standard“BM”; size in next 4 bytes
TIFF (little-endian)49 49 2A 00None standard“II”; common on Windows/Intel
TIFF (big-endian)4D 4D 00 2ANone standard“MM”; common on macOS legacy
WebP52 49 46 46 ?? ?? ?? ?? 57 45 42 50None standard“RIFF….WEBP” container
HEIC / HEIF?? ?? ?? ?? 66 74 79 70 68 65 69 63None standardftyp box at offset 4 with “heic”
ICO (Windows icon)00 00 01 00None standard0 reserved + 1 type + image count
PSD (Photoshop)38 42 50 53None standard“8BPS”

Document formats

FormatHex SignatureEnd MarkerNotes
PDF25 50 44 46%%EOF“%PDF” + version
Old MS Office (DOC/XLS/PPT)D0 CF 11 E0 A1 B1 1A E1None standardOLE Compound Document
Modern Office (DOCX/XLSX/PPTX)50 4B 03 0450 4B 05 06ZIP with [Content_Types].xml
OpenDocument (ODT/ODS/ODP)50 4B 03 0450 4B 05 06ZIP with mimetype file
RTF7B 5C 72 74 66 317D (closing brace)“{\\rtf1”
EPUB50 4B 03 0450 4B 05 06ZIP with META-INF/container.xml
MOBI?? ?? ?? ?? ?? ?? ?? ?? 42 4F 4F 4B 4D 4F 42 49None standard“BOOKMOBI” at offset 60
PostScript25 21 50 53None standard“%!PS”

Archive and compression formats

FormatHex SignatureEnd MarkerNotes
ZIP50 4B 03 0450 4B 05 06“PK..” for Phil Katz
RAR (v1.5-4)52 61 72 21 1A 07 00None standard“Rar!..”
RAR (v5+)52 61 72 21 1A 07 01 00None standard“Rar!…” with version 5
7-Zip37 7A BC AF 27 1CNone standard“7z” + binary
GZIP1F 8BNone standard2-byte signature only
BZIP242 5A 68None standard“BZh”
XZFD 37 7A 58 5A 0000 00 00 00 04 59 5A“.7zXZ\\0”
TAR?? ?? ?? ?? ?? ?? ?? ?? 75 73 74 61 72None standard“ustar” at offset 257
JAR (Java archive)50 4B 03 0450 4B 05 06ZIP with META-INF/MANIFEST.MF
APK (Android package)50 4B 03 0450 4B 05 06ZIP with AndroidManifest.xml

Executable formats

FormatHex SignatureEnd MarkerNotes
Windows PE (EXE/DLL/SYS)4D 5ANone standard“MZ” for Mark Zbikowski
ELF (Linux/Unix)7F 45 4C 46None standard“.ELF”; high byte + ASCII
Mach-O (macOS 32-bit)FE ED FA CENone standard“FEEDFACE”
Mach-O (macOS 64-bit)FE ED FA CFNone standard“FEEDFACF”
Mach-O UniversalCA FE BA BENone standard“CAFEBABE”; shared with Java class
Java class fileCA FE BA BENone standard“CAFEBABE”; shared with Mach-O Universal
WebAssembly00 61 73 6DNone standard“.asm”
DEX (Android)64 65 78 0A 30 33 ?? 00None standard“dex.03?\\0” with version

Audio and video formats

FormatHex SignatureEnd MarkerNotes
MP3 (with ID3 tag)49 44 33None standard“ID3”
MP3 (raw frame)FF FB / FF F3 / FF FA / FF F2None standard11-bit sync + version bits
MP4 / M4A / M4V?? ?? ?? ?? 66 74 79 70None standard“ftyp” box at offset 4
WAV52 49 46 46 ?? ?? ?? ?? 57 41 56 45None standard“RIFF….WAVE”
AVI52 49 46 46 ?? ?? ?? ?? 41 56 49 20None standard“RIFF….AVI “
FLAC66 4C 61 43None standard“fLaC”
OGG4F 67 67 53None standard“OggS”
MKV / WebM1A 45 DF A3None standardEBML container
FLV46 4C 56None standard“FLV”

Database and filesystem formats

FormatHex SignatureEnd MarkerNotes
SQLite53 51 4C 69 74 65 20 66 6F 72 6D 61 74 20 33 00None standard“SQLite format 3” + null
Microsoft Access (MDB)00 01 00 00 53 74 61 6E 64 61 72 64 20 4A 65 74 20 44 42None standard“Standard Jet DB”
dBase / DBF03 / 04 / 05 / 301A (EOF)First byte indicates version
Microsoft Compiled HTML (CHM)49 54 53 46None standard“ITSF”
NTFS boot sectorEB 52 90 4E 54 46 53None standardJump + “NTFS” at offset 3
FAT12/16/32EB ?? 90None standardJump instruction; FAT type via additional bytes
ext2/3/4 superblock53 EF (at offset 0x438)None standard0xEF53 magic at superblock offset
HFS+ (macOS)48 2BNone standard“H+” at offset 1024
APFS (macOS)4E 58 53 42None standard“NXSB” at container superblock

Disk image and container formats

FormatHex SignatureEnd MarkerNotes
ISO 966043 44 30 30 31None standard“CD001” at offset 32769 (0x8001)
VMware VMDK4B 44 4DNone standard“KDM”
VirtualBox VDI3C 3C 3C 20 4F 72 61 63 6C 65None standard“<<< Oracle”
QEMU QCOW251 46 49 FBNone standard“QFI.”
VHD (Microsoft)63 6F 6E 65 63 74 69 78None standard“conectix” at end of file
DMG (macOS)78 01 73 0D 62 62 60None standardVariable; UDIF format

Signature Identification Tools

Several tools and databases standardize the use of file signatures for identification across applications.

The libmagic library and file command

libmagic is the canonical signature database used by most Unix-like systems. The Inventive HQ reference describes its role: “The first few bytes of a file contain a signature that file identification tools compare against a database of known formats: The Unix file command, Python’s python-magic library, and this tool all use magic number databases to identify files. The most comprehensive database is maintained by the libmagic project.”4 Key properties:

  • Database location: /usr/share/file/magic (source format) or /usr/share/file/magic.mgc (compiled).
  • Coverage: several thousand file formats; updated regularly.
  • Magic database syntax: declarative format with offset, type, value, message; supports compound checks.
  • libmagic API: magic_open(), magic_load(), magic_file() for programmatic access.
  • Bindings: Python (python-magic), Ruby (filemagic gem), PHP (fileinfo), Java (jmimemagic).
  • Limitations: may misidentify ambiguous formats; relies on database accuracy.

The PRONOM database

PRONOM is the UK National Archives’ file format registry, focused on long-term preservation:

  • Maintained by: The National Archives (UK).
  • Coverage: 1500+ file formats with detailed signatures and metadata.
  • Format identifiers: Persistent Unique Identifiers (PUIDs) like fmt/123 for each format.
  • Used by: DROID identification tool, Siegfried, format identification workflows in archives and libraries.
  • Web interface: nationalarchives.gov.uk/PRONOM/ for browsing and searching.
  • Use case: long-term digital preservation; format identification for archival storage.

Gary Kessler’s File Signatures Table

Gary Kessler’s signature table is a long-maintained reference resource:

  • URL: gck.au (formerly garykessler.net).
  • Coverage: hundreds of file formats with header and trailer signatures.
  • Format: sortable HTML table with hex signatures, ASCII representations, and notes.
  • Trailer signatures: particularly comprehensive; many formats not in libmagic.
  • Used by: forensic examiners, security researchers, file format reverse engineers.
  • Maintained: updated since the early 2000s with regular additions.

Identification utilities

Several command-line tools use signature databases for file identification:

  • file: Unix standard; uses libmagic; available on Linux, macOS, BSD.
  • TrID: proprietary identifier with extensive format database; uses statistical pattern matching beyond simple signatures.
  • DROID: Java-based tool from National Archives UK; uses PRONOM database.
  • Siegfried: Go-based PRONOM identifier; faster than DROID.
  • binwalk: firmware analysis tool; uses signatures to identify embedded files.
  • foremost / scalpel: file carving tools using signature databases for recovery.
  • Apache Tika: Java library with signature-based identification plus metadata extraction.

PhotoRec’s signature database

PhotoRec, the open-source data recovery tool by Christophe Grenier, includes a comprehensive signature database specifically for carving:

  • Coverage: 480+ file types including specialized formats (digital camera RAW, scientific instruments, custom databases).
  • Header and trailer pairs: where applicable, for accurate file boundary detection.
  • Maximum size limits: for formats without trailer signatures.
  • Customization: users can add custom signatures for proprietary formats.
  • Open source: signature definitions visible in PhotoRec source code.
  • Use case: file carving from disk images where filesystem metadata is unavailable.

Hex editor signature inspection

Manual signature inspection via hex editor is the foundational technique for individual file analysis:

  • Open the file in any hex editor (HxD, 010 Editor, Hex Fiend, etc.).
  • Examine the first 16-32 bytes; the signature appears at offset 0 (or known offset for specific formats).
  • Look up the signature in a reference table (libmagic, Gary Kessler, this entry).
  • Confirm format by examining additional structure beyond the signature.
  • For ambiguous cases, examine multiple files of suspected format for comparison.

File Signatures and Data Recovery

File signatures play a central role in several specific data recovery scenarios where filesystem metadata is unavailable or unreliable.

The carving foundation

File carving is the recovery technique that extracts files based on signatures rather than filesystem structures. The GitHub File Magic Numbers gist describes the role: “I suppose the different tools for file recovery from a corrupt USB or hard drive work like they do a byte-by-byte reading with the file start and end byte signature detection. The size of a file is saved in the directory information of the file system. When this information is corrupt, recovery tools can try to find the boundary of files.”5 The carving workflow:

  1. Image the source drive with dd or ddrescue to a separate destination.
  2. Run a file carving tool (PhotoRec, foremost, scalpel) against the image.
  3. The tool scans byte-by-byte for known magic numbers across the image.
  4. When a header signature is found, the tool marks the start of a potential file.
  5. When a trailer signature is found (or maximum size reached), the tool marks the end.
  6. The carved file is saved with sequential numbering and the appropriate extension.
  7. Optional verification: the carved file is opened in its native application to confirm validity.

When carving works best

File carving via signatures works best in specific scenarios:

  • Formatted drive recovery: filesystem metadata destroyed, data still intact.
  • Corrupted partition tables: drive can’t mount but data sectors readable.
  • Severely damaged filesystems: NTFS MFT or ext4 inode tables destroyed.
  • Unallocated space recovery: deleted files where directory entries are gone.
  • Memory dump analysis: RAM images for forensic file extraction.
  • Damaged removable media: SD cards, USB drives with filesystem corruption.

When carving has limitations

Signature-based carving has specific failure modes:

  • Fragmented files: if a file is split across non-contiguous sectors, the carved file contains only the contiguous portion starting from the signature.
  • Files without signatures: text files, raw data dumps, encrypted blobs cannot be carved by signature.
  • Compound formats: ZIP-based formats (DOCX, JAR) all carve as ZIP; further analysis required for proper identification.
  • Overlapping signatures: false positives when random data happens to match a magic number.
  • Variable-length without trailers: formats without trailers depend on size limits for boundaries.
  • Encrypted volumes: BitLocker/FileVault containers appear as random data; signatures inside are unreadable until decrypted.

FOUND.000 / .CHK file identification

One specific recovery scenario involves identifying file types from chkdsk‘s FOUND.000 directory:

  1. chkdsk produces FILE0001.CHK, FILE0002.CHK, etc. with no original file names.
  2. Each .CHK file represents recovered fragments from cross-linked or orphaned chains.
  3. Open each .CHK in a hex editor and read the first 8-16 bytes.
  4. Match against signature reference (this entry, libmagic, Gary Kessler).
  5. Rename file with appropriate extension (FILE0001.CHK to FILE0001.jpg if signature is FF D8 FF).
  6. Open with native application to confirm proper recovery.
  7. Tools like UnCHK and ChkRecover automate this process for large FOUND.000 directories.

Header repair scenarios

When file headers are damaged but content is intact, signatures provide the repair template:

  • Compare damaged file header with known-good signature for the same format.
  • Identify which bytes differ; determine which represent damage vs normal variation.
  • Restore correct signature bytes via hex editor.
  • Repair often suffices to make file openable; deeper structural damage may persist.
  • Common cases: JPEG header truncation, ZIP central directory corruption, PDF xref table damage.

Verification of recovery results

After running automated recovery tools, signatures help verify the results:

  • Open recovered files and verify the magic number matches the assigned extension.
  • Check for proper trailer signature where applicable.
  • For ZIP-based formats, verify [Content_Types].xml is present for Office files.
  • Identify files where the recovery tool stitched fragments incorrectly (signature mismatch with content).
  • Use TrID, DROID, or Siegfried for systematic batch verification of large recovery outputs.

Security and forensic context

Signatures play roles beyond pure recovery in security and forensic work:

  • Extension renaming detection: files claiming to be .pdf but signature shows .exe indicate possible attack.
  • Antivirus signature databases: many AV engines use file format signatures as part of malware identification.
  • Web upload validation: servers verify uploaded files match claimed types before processing.
  • Forensic chain of custody: file identification is part of evidence cataloging.
  • Steganography detection: unexpected signatures embedded in carrier files reveal hidden content.
  • Custom format reverse engineering: identifying signature patterns is the first step in understanding proprietary formats.

File signatures are the foundation of identifying file formats when filenames and extensions cannot be trusted, which is exactly the situation that arises in most data recovery scenarios. For data recovery purposes, the practical implication is that signatures determine what carving tools can recover and how reliably the results can be identified: when filesystem metadata is destroyed but data sectors are readable, signatures are the only mechanism for identifying file boundaries and types. The PhotoRec database covering 480+ file types reflects decades of accumulated format knowledge; the libmagic database powers the Unix file command and dozens of derivative tools; the Gary Kessler signature table captures formats that other databases miss.

For users wondering when file signatures matter in practice, the answer follows the recovery scenario. For routine recovery where filesystem metadata is intact (recently deleted files, accidentally formatted drives where original metadata still partially exists), signature analysis is unnecessary because the filesystem provides the information. For deeper recovery where filesystem metadata is destroyed (formatted drives, corrupted partition tables, severely damaged filesystems), signatures become essential because they’re the only remaining basis for file identification. For FOUND.000/.CHK files from chkdsk operations, signatures convert anonymous numbered files back to identifiable formats. For files received from untrusted sources or with suspicious extension renaming, signatures verify true file types regardless of what the filename claims.

For users facing specific recovery scenarios, the practical guidance reflects the situation. If chkdsk produced a FOUND.000 directory full of .CHK files, open each in a hex editor and check the first 16 bytes against the categorized tables in this entry; rename with the appropriate extension. If a drive needs deep recovery via carving, use PhotoRec which leverages its 480+ signature database; the signature coverage determines what file types can be successfully recovered. If a file has a damaged header preventing it from opening, compare the damaged bytes to a known-good signature from the same format and repair via hex editor. Standard data recovery software typically includes signature-based carving as a fallback recovery mode; HDD-focused recovery tools apply signature analysis to drives where filesystem repair has failed. Cleanroom recovery services use signatures to validate recovered output from severely damaged drives. The core insight: signatures are what make recovery possible when filesystem structures fail.

File Signature FAQ

What is a file signature or magic number?+

A file signature (also called magic number, magic bytes, or format signature) is a specific sequence of bytes at the beginning of a file that uniquely identifies its format, regardless of file extension or filename. The Wikipedia file signatures reference defines them: “A file signature is data used to identify or verify the content of a file. Such signatures are also known as magic numbers or magic bytes and are usually inserted at the beginning of the file.” The term originated in Version 7 Unix (1979), where the header constant was assigned to a variable labeled ux_mag and subsequently called the magic number. Common signatures: Windows PE files start with 4D 5A; JPEG starts with FF D8 FF; PNG starts with 89 50 4E 47 0D 0A 1A 0A; PDF starts with 25 50 44 46 (“%PDF”); ZIP starts with 50 4B 03 04. Signatures are typically 2-8 bytes long, located at file offset 0 (or sometimes at known offsets like MPEG-4’s ftyp box at offset 4). The CTF Handbook describes them: “File signatures (also known as File Magic Numbers) are bytes within a file used to identify the format of the file. Generally they’re 2-4 bytes long, found at the beginning of a file.” Operating systems, tools like file and binwalk, and antivirus engines use these signatures rather than file extensions because extensions can be wrong, missing, or deliberately misleading.

Why are magic numbers called ‘magic’?+

The term ‘magic number’ originated in early Unix programming and reflected the apparently arbitrary nature of the byte values chosen as identifiers. The Wikipedia magic number reference describes the origin: “In Version Seven Unix, the header constant was not tested directly, but assigned to a variable labeled ux_mag and subsequently referred to as the magic number. Probably because of its uniqueness, the term magic number came to mean executable format type, then expanded to mean file system type, and expanded again to mean any type of file.” The ‘magic’ aspect refers to several properties: (1) The values appear arbitrary to outside observers but have specific meaning to the file format parser; (2) The values are chosen to be improbable to occur randomly in other file types, providing high-confidence identification; (3) The values often have hidden meanings to the format authors. Examples of these hidden meanings: PE files use 4D 5A which spells “MZ” for Mark Zbikowski (Microsoft engineer who designed the PE format); ZIP files use 50 4B which spells “PK” for Phil Katz (PKZIP author); Java class files use CA FE BA BE which spells “CAFEBABE” as a memorable hex word; the Berkeley Fast File System superblock uses 19 54 01 19 (or its variants) which represents the birthday of Marshall Kirk McKusick, the FFS author.

What are the most common file signatures?+

The most-encountered file signatures across major format categories: Images: JPEG starts with FF D8 FF (followed by E0/E1/E8 indicating variant); PNG starts with 89 50 4E 47 0D 0A 1A 0A (8 bytes including line ending checks); GIF starts with 47 49 46 38 (GIF8) followed by 7a or 9a; BMP starts with 42 4D (BM); TIFF starts with either II 2A 00 (little-endian) or MM 00 2A (big-endian). Documents: PDF starts with 25 50 44 46 (%PDF); old MS Office (DOC/XLS/PPT) starts with D0 CF 11 E0 A1 B1 1A E1; modern Office Open XML (DOCX/XLSX/PPTX) starts with 50 4B 03 04 (ZIP signature). Archives: ZIP starts with 50 4B 03 04; RAR v1.5-4 starts with 52 61 72 21 1A 07 00; RAR v5+ starts with 52 61 72 21 1A 07 01 00; 7-Zip starts with 37 7A BC AF 27 1C; GZIP starts with 1F 8B; BZIP2 starts with 42 5A 68 (BZh). Executables: Windows PE/EXE/DLL starts with 4D 5A (MZ); ELF Linux executable starts with 7F 45 4C 46 (.ELF); Mach-O 32-bit starts with FE ED FA CE; Mach-O 64-bit starts with FE ED FA CF; Java class and Mach-O Universal share CA FE BA BE. Audio/Video: MP3 with ID3 tag starts with 49 44 33 (ID3); MP3 raw frame starts with FF FB or FF F3; MP4/M4A/M4V starts with bytes at offset 4-7 spelling “ftyp”; MKV starts with 1A 45 DF A3; FLAC starts with 66 4C 61 43 (fLaC). Databases: SQLite starts with “SQLite format 3” followed by null byte (53 51 4C 69 74 65 20 66 6F 72 6D 61 74 20 33 00).

What are compound file signatures?+

Compound file signatures occur when one file format is wrapped or embedded in another, with the outer format’s signature visible at the start. The most common example is the ZIP-based formats: DOCX (Word), XLSX (Excel), PPTX (PowerPoint), JAR (Java archives), APK (Android packages), EPUB (e-books), and ODT (OpenDocument) all start with the ZIP signature 50 4B 03 04 because they are technically ZIP archives containing format-specific content files. The picoCTF Solutions reference describes the inheritance: “Office files (.docx, .xlsx, .pptx), JAR files, APK files, and EPUB files all share the ZIP signature because they are ZIP archives at heart.” To distinguish compound formats, signature analysis tools examine the contents of the ZIP archive: a file containing [Content_Types].xml is Office Open XML; a file containing META-INF/MANIFEST.MF is JAR; a file containing AndroidManifest.xml is APK; a file containing META-INF/container.xml is EPUB. Other compound signatures include: TAR archives that may be GZIP-compressed (1F 8B header followed by tar content); ISO 9660 disk images may contain UDF or other filesystems with their own signatures; QEMU QCOW2 disk images contain wrapped filesystem images. The compound nature means that signature-based identification is sometimes the first step rather than the final answer; deeper analysis of contents is required for full format identification.

Do all files have signatures?+

No, not all files have magic number signatures. Several common file types deliberately lack signatures or have signatures that vary so much they cannot be reliably identified. Plain text files (TXT, CSV, HTML, XML, source code) typically have no signature; identification relies on content analysis and encoding heuristics rather than fixed bytes. Some Unicode text files include a Byte Order Mark (BOM): UTF-8 BOM is EF BB BF, UTF-16 LE BOM is FF FE, UTF-16 BE BOM is FE FF, UTF-32 LE BOM is FF FE 00 00; the BOM serves a similar purpose to a magic number but is optional. HTML files may start with the DOCTYPE declaration but this is not technically a binary signature. Encrypted files (encrypted with AES-CBC, GPG output without armor, etc.) often have no recognizable signature because encryption produces apparently-random bytes; some encryption formats include signatures (PGP/GPG armored output starts with —–BEGIN PGP); raw encrypted data does not. Compressed data alone (without container) may lack signatures: deflate-compressed data has no fixed header. Custom or proprietary formats may use any byte values as signatures, including no signature at all if the format author didn’t follow conventions. The libmagic database and tools like file/TrID/DROID handle these cases through content analysis (statistical analysis of byte distributions, encoding detection, structure pattern matching) rather than fixed signature lookup; these heuristic approaches are less reliable than magic number identification but extend coverage to text and unsigned binary formats.

How do file signatures help with data recovery?+

File signatures are the foundation of file carving, one of the most-important data recovery techniques for severely damaged or formatted drives. The standard recovery workflow when filesystem metadata is missing or destroyed: (1) Image the source drive with dd or ddrescue to a separate destination. (2) Run a file carving tool like PhotoRec, foremost, or scalpel against the image. (3) The tool scans for known magic numbers across the image, marking the start of potential files. (4) When trailer signatures are known (JPEG FF D9, PDF %%EOF, PNG IEND), the tool also identifies file boundaries. (5) When trailer signatures are unknown, the tool uses maximum file size limits to bound carved files. (6) Each detected signature with its boundary becomes a recovered file, named by sequence number. PhotoRec’s signature database covers 480+ file types; foremost and scalpel use configurable signature lists. The GitHub File Magic Numbers gist describes the carving role: “The size of a file is saved in the directory information of the file system. When this information is corrupt, recovery tools can try to find the boundary of files”. The FOUND.000/.CHK files produced by chkdsk also benefit from signature analysis: opening a .CHK file in a hex editor and reading the first 8-16 bytes typically reveals the file type via its magic number, allowing the file to be renamed with the correct extension. For corrupted file headers, comparison with known-good signatures from the same format identifies which bytes need reconstruction. File signatures are what make recovery possible when filesystem structures fail.

Related glossary entries

  • Hex Editor: the standard tool for manually inspecting file signatures and headers.
  • dd Command: creates the disk images that signature-based carving tools then process.
  • chkdsk: produces FOUND.000/.CHK files identified via magic number analysis.
  • Forensic Recovery: signature-based identification is foundational to forensic file analysis.
  • Data Recovery: signatures power file carving when filesystem metadata is destroyed.
  • Disk Image: PhotoRec and similar tools scan disk images for signature patterns.
  • Data Corruption: damaged headers can be repaired by comparing with known-good signatures.

Sources

  1. Wikipedia: List of file signatures (accessed May 2026)
  2. CTF Handbook: File Formats and Magic Numbers
  3. Wikipedia: Magic number (programming)
  4. Inventive HQ: File Magic Number Checker
  5. GitHub leommoore: File Magic Numbers gist
  6. picoCTF Solutions: File Magic / Signature Identifier
  7. Medium / Shailendra Purohit: Beneath the Bytes: Magic Numbers Deep Dive
  8. GitHub Ilias1988: Magic-Bytes-List repository
  9. Cyber Forensics Academy: signature analysis in forensic context

About the Authors

đŸ‘„ Researched & Reviewed By
Rachel Dawson
Rachel Dawson
Technical Approver · Data Recovery Engineer

Rachel brings over twelve years of data recovery engineering experience including extensive daily work involving file signature analysis. The most consistent pattern in signature-related cases is FOUND.000/.CHK file identification: customers receive a directory full of FILE0001.CHK, FILE0002.CHK files from chkdsk operations and need to know what each one actually is. The first 16 bytes almost always tell the story; the categorized tables in this entry cover the formats encountered in 95%+ of these cases. The harder cases involve carved files from PhotoRec or similar tools where the recovery is incomplete: a JPEG signature followed by what should be JPEG data turns out to be partial because the file was fragmented, and signature analysis alone cannot detect this without verifying the trailer signature and content structure. The compound signature problem (DOCX/XLSX/PPTX all sharing ZIP signature) creates a specific recovery challenge: PhotoRec carves these as ZIP files, and post-processing is required to identify which type each ZIP actually represents by examining its contents. The universal advice on signature work: when an automated recovery tool produces ambiguous output, manual signature inspection in a hex editor is the verification step that confirms which files were correctly recovered.

12+ years data recovery engineeringCarving signature analysisFOUND.000 identification
✅
Editorial Independence & Affiliate Disclosure

Data Recovery Fix earns revenue through affiliate links on some product recommendations. This does not influence our reference content. Glossary entries are written and reviewed independently based on documented research, vendor documentation, independent testing, and recovery-engineer review. If anything on this page looks inaccurate, outdated, or worth revisiting, please reach out at contact@datarecoveryfix.com and we’ll review it promptly.

We will be happy to hear your thoughts

Leave a reply

Data Recovery Fix: Reviews, Comparisons and Tutorials
Logo