Motivation

This tool is the result of me manually extracting content from Facebook data archives and getting increasingly frustrated about the omnipresent mojibake as well as lack of human-readable timestamps, both of which get in the way of browsing the raw JSON.

I won’t argue for the inclusion of human-readable timestamps in what really is a JSON-based data exchange format. But the presence of mojibake makes clear that Facebook engineers moved fast, broke things, and couldn’t be bothered to fix the mess they created. In this case, the mess involves unicode escape sequences of the form \u00xx that encode what really should be raw UTF-8 bytes. They are the result of Facebook using two encoding steps where one would be just right. That turns don’t into don\u00e2\u0080\u0099t inside Facebook’s JSON files and, after parsing the JSON, into a nonsensical donât.

Originally, I relied on Robyn Speer’s ftfy. But then one weekend I got curious and started playing with character encodings, came up with what really is a one-line work-around, and started automating further clean-up of the data. The result is this Python package.