The second week of coding is coming to an end soon. This is what happened this week:

Robert asked me to create a wiki page on the MusicBrainz wiki, that describes what data we will mine, how mining is going to happen, and where will data be published. Here is the link to the page:

http://wiki.musicbrainz.org/Server_Log_Analysis

When we work out more details I will update that page.

The access logs contain sensitive information (IP addresses for example), that I should not be able to access for privacy reasons. So far I was working on a sample log that Robert cleaned for me. This seemed a bit of a problem to do every day, so Robert proposed that I sign a document stating that if I use or distribute this data I will fail GSoC. Here is the blog post about it:

http://blog.musicbrainz.org/?p=1443

This didn’t work out, because the privacy policy should have been changed, which (I guess) is a lot of hassle. I was given the task to write the sanitizing/anonymizing script in python. The script hashes IP addresses, user IDs and other sensitive parts of the logs. The hashes are sha-1 and there is a secret salt, that somebody (Robert, I guess) will set before the script runs. I didn’t have a github repository before, so I created one:

https://github.com/balidani/MusicBrainz-server-log-analysis

This is the place where I will upload code that belongs to the project. Hopefully it will be more structured in the future :). I wrote the script, uploaded it to the repository, and now it should be reviewed. Hope I didn’t make too many mistakes, I’m sure my python skills still need to improve a lot.

Here is a very small sample to try the sanitizing script:
https://gist.github.com/2852624

Robert came up with a new idea for a query: the client vs the highest search result ranking. He also told me to think about how much data we can put into Splunk until we run out of disk space, and how we can tell how much data to keep. This is my todo list for now, plus what my proposal says for the next two weeks:

Set up the script that runs daily, and make it possible to run Splunk queries with it. Decide for each statistic how we want to store it. Come up with a schema that is as generic as possible, so adding new statistics later won’t be a problem. This genericity should be important through the whole project. Create some simple queries and store the results.

Advertisements