How to Make a Metadata Extractor Tool (For OSINT & Digital Forensics)

spyboy's avatarPosted by

Most people see a photo.

A cybersecurity professional sees hidden data.

Every image, PDF, Word document, and file often contains metadata — hidden information that can reveal:

  • GPS coordinates
  • Device model
  • Software version
  • Author name
  • Creation timestamps
  • Editing history

Metadata is one of the most powerful OSINT techniques used in:

  • Digital forensics
  • Bug bounty reconnaissance
  • Threat intelligence
  • Journalism investigations
  • Law enforcement analysis

In this guide, you’ll learn how to make a Metadata Extractor Tool using Python. You’ll build a CLI-based tool that:

  • Extracts EXIF data from images
  • Extracts metadata from PDFs
  • Extracts metadata from DOCX files
  • Formats output cleanly
  • Saves results to JSON

All while understanding how attackers and defenders use metadata differently.


What Is Metadata?

Metadata literally means:

“Data about data.”

Examples:

Image Metadata (EXIF)

  • GPS location
  • Camera model
  • Date taken
  • Software used

PDF Metadata

  • Author
  • Producer
  • Creation date
  • Modification date

Document Metadata

  • Creator name
  • Company name
  • Revision history

Even when a file looks clean, metadata can expose sensitive information.


Why Metadata Matters in Cybersecurity

Attackers use metadata to:

  • Identify employee names
  • Discover internal software
  • Find office GPS coordinates
  • Identify infrastructure leaks

Defenders use metadata to:

  • Investigate breaches
  • Track document origins
  • Validate evidence
  • Perform OSINT investigations

Some common tools used in investigations include:

  • ExifTool
  • FOCA
  • Autopsy

Today, you’ll build your own lightweight version.


What We’re Going to Build

Our Python Metadata Extractor Tool will:

✔ Extract EXIF data from images
✔ Extract metadata from PDFs
✔ Extract metadata from DOCX files
✔ Print structured output
✔ Save results to JSON
✔ Handle errors cleanly


Step 1: Install Required Libraries

pip install Pillow PyPDF2 python-docx

We’ll use:

  • Pillow → Image EXIF extraction
  • PyPDF2 → PDF metadata
  • python-docx → Word metadata
  • json → Structured storage

Step 2: Project Structure

metadata_extractor/
├── extractor.py
├── utils.py
└── output/

Step 3: Extract Image Metadata (EXIF)

Create extractor.py

from PIL import Image
from PIL.ExifTags import TAGS
def extract_image_metadata(image_path):
metadata = {}
try:
image = Image.open(image_path)
exif_data = image._getexif()
if exif_data:
for tag_id, value in exif_data.items():
tag = TAGS.get(tag_id, tag_id)
metadata[tag] = value
else:
metadata["Info"] = "No EXIF data found."
except Exception as e:
metadata["Error"] = str(e)
return metadata

This extracts hidden EXIF fields.


What EXIF Data Can Reveal

Image
Image

Common leaked EXIF data includes:

  • GPSLatitude
  • GPSLongitude
  • Make (camera brand)
  • Model (device model)
  • DateTimeOriginal
  • Software used

Many people accidentally leak GPS coordinates in photos.


Step 4: Extract PDF Metadata

Add to extractor.py

from PyPDF2 import PdfReader
def extract_pdf_metadata(pdf_path):
metadata = {}
try:
reader = PdfReader(pdf_path)
info = reader.metadata
if info:
for key, value in info.items():
metadata[key] = str(value)
else:
metadata["Info"] = "No metadata found."
except Exception as e:
metadata["Error"] = str(e)
return metadata

PDFs often reveal:

  • Author name
  • Company name
  • Creator software
  • Modification dates

Step 5: Extract DOCX Metadata

Add:

from docx import Document
def extract_docx_metadata(doc_path):
metadata = {}
try:
doc = Document(doc_path)
props = doc.core_properties
metadata["Author"] = props.author
metadata["Created"] = str(props.created)
metadata["Modified"] = str(props.modified)
metadata["Last Modified By"] = props.last_modified_by
metadata["Title"] = props.title
except Exception as e:
metadata["Error"] = str(e)
return metadata

Word documents often leak internal employee names.


Step 6: Create Output Utility

Create utils.py

import json
import os
def save_to_json(data, filename):
if not os.path.exists("output"):
os.mkdir("output")
path = os.path.join("output", filename)
with open(path, "w") as f:
json.dump(data, f, indent=4)
print(f"[+] Results saved to {path}")

Step 7: Build CLI Interface

Update extractor.py

import os
from utils import save_to_json
def main():
print("=== Metadata Extractor Tool ===")
file_path = input("Enter file path: ")
if not os.path.exists(file_path):
print("File does not exist.")
return
extension = file_path.split(".")[-1].lower()
if extension in ["jpg", "jpeg", "png"]:
metadata = extract_image_metadata(file_path)
elif extension == "pdf":
metadata = extract_pdf_metadata(file_path)
elif extension == "docx":
metadata = extract_docx_metadata(file_path)
else:
print("Unsupported file type.")
return
print("\n=== Extracted Metadata ===")
for key, value in metadata.items():
print(f"{key}: {value}")
save_to_json(metadata, "metadata_results.json")
if __name__ == "__main__":
main()

Now run:

python extractor.py

Your metadata tool is live.


How Hackers Abuse Metadata

Attackers use metadata to:

  • Identify internal usernames
  • Discover server software versions
  • Extract GPS locations
  • Map organizational structure
  • Find hidden document authors

For example:

An employee uploads a PDF to a public website.
Metadata reveals:

  • Author: John Smith
  • Company: ABC Corp
  • Software: Microsoft Word 2016

An attacker now knows:

  • Real employee name
  • Company software stack
  • Potential phishing target

Defensive Use of Metadata Extraction

Blue teams use metadata tools to:

  • Check if documents leak internal info
  • Clean files before public release
  • Investigate breach origins
  • Track digital evidence

Before uploading files publicly, always strip metadata.


How to Strip Metadata

To remove EXIF data:

def strip_image_metadata(image_path):
image = Image.open(image_path)
clean_image = Image.new(image.mode, image.size)
clean_image.putdata(list(image.getdata()))
clean_image.save("clean_image.jpg")

For PDFs and DOCX, exporting as clean copy often removes metadata.


Advanced Improvements

You can expand this tool by adding:

  • Recursive folder scanning
  • Bulk file processing
  • CSV output
  • SQLite storage
  • Geo-location decoding from GPS EXIF
  • Hidden stream detection
  • File hashing (MD5/SHA256)
  • Command-line arguments with argparse
  • Multi-threaded scanning

You could turn this into a mini forensic suite.


Covered

  • how to extract metadata using Python
  • build metadata extractor tool
  • python exif metadata script
  • digital forensics metadata tool
  • OSINT metadata analysis

Real-World Use Cases

✔ Bug bounty reconnaissance
✔ OSINT investigations
✔ Journalism research
✔ Corporate security review
✔ Data leak investigations
✔ Incident response

Metadata often reveals more than the visible file content.


Final Thoughts

Most people don’t think about metadata.

Attackers do.

Defenders should.

When you build your own metadata extractor tool, you gain:

  • Technical OSINT skill
  • Digital forensics knowledge
  • Understanding of hidden data
  • Defensive awareness

Once you start checking metadata, you’ll be surprised how often sensitive information leaks silently.

And in cybersecurity…

Silent leaks are the most dangerous.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.