Amazingly, it's been over eight years since I started using some kind of instant messaging. More amazing still is the fact that I have ALL my IM logs since that time; 2nd December 1997 to be precise. Going through some of them proved to be quite the humourous (man I talked crap back then) and poignant (finding the first words exchanged between friends over IM) exercise. Some of you may have already been sent snippets from the past.
It really is a valuable resource, and that's why I've decided to clean, normalise and make more accessible the information contained in these logs - all 200 megs worth. This isn't as easy as it sounds. Various mediums, clients and locations put the data in a variety of formats and levels of integrity. The project will have the following phases:
- Data collection, cleaning and extraction. First I have to locate all of my logs. This is easy enough seeing as I made it a point to secure them from the beginning. What's more difficult is determining the dates each "archive" covers, determining whether there are any gaps or overlaps, and then in the case of ICQ (whose logs are in a propriety binary format) extracting the data, either conceptually via an API or physically to an intermediary format. It's mostly a manual job.
- Data normalisation and aggregation. This would be where development kicks in, since I'd need a way of reading all the various file types and determining what meaningful information to keep. Different formats keep a different level of detail, so I'll have to think about what I want to put in the database and how at this point.
- Database design and data retention. Here, I'll create a database and fill it. I think I'll enjoy this bit the most since it's a bit more creative than the other parts.
- Data cleaning and integrity checking. Again, another manual process, this time looking for dupes, garbage or any other obvious flaws in the data. Not looking forward to this phase, especially with the older stuff.