Skillia
Back to Learning Notes
security

SHA-256: How It Works

What SHA-256 is, its key properties, and why we use it to track file changes

2 min read

SHA-256: How It Works

Question

What is SHA-256 and why do we use it to track file changes?

Explanation

SHA-256 is a hash function: it takes any input (a file, a string, anything) and produces a fixed-size fingerprint of 64 characters (using 0-9 and a-f, called hex). Always 256 bits.

"Hello"           -> 185f8db32271fe25f561a6fc938b2e26...
"Hello!"          -> 334d016f755cd6dc58c53a86e183882f...
(a 500-page PDF)  -> 9f86d081884c7d659a2feaa0c55ad015...

Key properties

  1. Always the same - same input = always the same hash
  2. Fixed size - output is always 64 chars, whether input is 1 byte or 1 GB
  3. One-way - you cannot reverse the hash to get the original input
  4. Tiny change, totally different result - changing 1 character in the input completely changes the hash
  5. No duplicates - almost impossible for two different inputs to produce the same hash

Why we use it in our project

To detect if a file has changed since we last indexed it:

sha256 = hashlib.sha256()
with open(file_path, "rb") as f:
    for block in iter(lambda: f.read(8192), b""):
        sha256.update(block)
hash = f"sha256:{sha256.hexdigest()}"

We store the hash in manifest.json. On the next run: same hash means skip it, different hash means re-index it.

Why hash instead of modification date?

The modification date can change without the content changing (copy, sync, backup restore). The hash only changes if the actual bytes in the file change.

SHA-256 vs other hashes

  • MD5 (128 bits) - broken, someone found two different inputs that give the same hash. Don't use
  • SHA-1 (160 bits) - also broken for the same reason
  • SHA-256 (256 bits) - secure, standard choice
  • SHA-512 (512 bits) - secure, slower, rarely needed

Example

From our manifest:

{
  "AI Engineering Guidebook.pdf": {
    "hash": "sha256:1273e50aceb06a003d5440c3713a9902...",
    "ingested_at": "2026-03-20T10:30:00+00:00"
  }
}

Edit the PDF, hash changes, ingest.py re-indexes it automatically.