This week codereview still doesn’t work. I tried uploading many different diff files, but no combination seemed to work. I noticed something weird here, but I’m not sure I understand properly:

https://github.com/metabrainz/musicbrainz-server-log-analysis/network

If you look at the graph, there is a commit before the merging, and I don’t know how that is possible.

Anyways, this week Robert reviewed my querying script. You can find the (new) code here:

https://github.com/metabrainz/musicbrainz-server-log-analysis/blob/master/querying/run_queries.py

I tried to make some order in the repository. Robert said we should not use a csv file for storing queries. We should either store them in the database or in code. With help from Oliver I decided to go with YAML. I’ve just learned about it, but it’s a very nice tool for representing data-structures, and storing it in a human-readable way. You can also find the queries.yml file in the repository.

One of the other things Robert suggested was to make a config file for the script. The config file contains the credentials for the database, and also for Splunk. I also changed the way queries.yml is referenced. So far it was passed as an argument, and my original thought was to store the path in the config file. Rob said it’s okay to refer to it statically, so I made the script that way. I created a default config file, with the real username/password missing, and uploaded it to the repository.

I forked the musicbrainz-server repository, and right now I’m learning about the perl modules that are created for every table in the database. I should eventually make a module like that for the table “log_statistic” in the database. I’ll commit the changes I make to my fork of the repository.

Next week I should make a simple module, that displays a report on the website. Robert said the learning curve is quite steep, so I will spend as much time as possible learning how things work, experimenting. He also said displaying reports (and graphs, etc.) might take up the rest of the project in time.

Advertisements

This week I’ve finished last week’s tasks. Since I couldn’t connect to our Splunk server (pino) Robert asked Dave (djce) to grant my user access to port 8089 on pino from rika. This means that I had to stop development on the VM and move to rika. I learned about virtualenv — it’s a very nice way to avoid installing everything on the server and making a huge mess. I set up a virtual environment for python and installed the Splunk python SDK, and psycopg2 there. Psycopg2 is a python module for handling PostgreSQL databases.

I created the database table on rika, and finished the code with the part that actually inserts data into the table. I ran the code and it worked fine. Now we can move on to indexing things periodically.

I also updated the log sanitizing script. For some reason our codereview site is buggy and it won’t let me upload a new diff. No matter what we tried, it still won’t work. I hope this gets fixed by next week, because we really need to move on now.

I’m back, and the fifth week of Google Summer of Code has just ended. I have missed 10 days, which means I haven’t done anything on week #4 :(. This is what I managed to do on the second half of this week:

There were some discussions about the way logs are sanitized. Oliver (ocharles) suggested that we parse URL parameters, and try to match an e-mail regex on each value. I modified the script so it finds log entries that would be anonimized if we did things this way, and mailed the results to Robert. Unfortunately 99.9% of entries are real data, that would be lost.

On the other hand I found another type of log entry that contains sensitive information, and wasn’t filtered by regexes. I wrote a new regex that will match these entries, and I’ll push the new code to the repository before we start sanitizing logs on a daily basis. 

This week’s most important task was to come up with a schema that we store query results in. Ian (ianmcorvidae) suggested that we store query results in JSON format, which is a very good idea. Basically we will have two types of queries: queries that we want to keep a historical record of, and queries for which we are only interested in the latest results. For example keeping the historical record for top artists is a good idea, but for other queries it might be a waste of space. This is the schema we have came up with so far:

CREATE TABLE log_statistic
(
id SERIAL,
timestamp TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
report_type VARCHAR(100) NOT NULL,
data TEXT NOT NULL -- JSON data
);

This allows many more report types in the future, so if we want to handle queries differently we can just do that. It is important, that the timestamp is kept for each row.

I installed the MusicBrainz server VM on my computer to experiment with the database. I added the table described above to the database. I had to install the psycopg2 module for python to access the PostgreSQL database. The last step for this week would have been to run queries on splunk.musicbrainz.org, and store the results in the database. Unfortunately I was not able to connect to our Splunk server from a distance. This shouldn’t be that difficult, so it will soon be done. For now I’ve ran the queries on my own local Splunk server for testing.

In the meantime, I wrote the python script that runs queries, and outputs the results in JSON format. Right now the queries are stored in a csv file. Each row contains a query name and the query itself. Later more metadata will be added to this file, for example, how often does the query need to run.

The only thing left is to connect to the Splunk server and store the results in the VM’s database. This shouldn’t take long once I know which port/credentials to use.

Next week we will focus on indexing data and running queries periodically. After that I should get the popularity statistics working, and find a way to present the results on the website. I should also ask for a sandbox from Ian, so the results are visible online.

Week 3 is over, and so are my exams! This is what I’ve been doing this week:

I found two minor problems with the sanitizing script. Instead of searching for matches for a pattern, then replacing all the matching parts in a line, the script now uses re.sub(). This is a much safer way to do substitution. Robert forked my repository. This is the new location, from now on I will be pushing code only here:

https://github.com/metabrainz/musicbrainz-server-log-analysis/

The code went up for review. Robert said it was ok, but it would be best if someone else cheked it out too. The review request is located here:

http://codereview.musicbrainz.org/r/1934/

I wrote a test using python’s unit testing framework. This can also be found at the repository. The script generates lines with (fake) sensitive data, then checks whether this data is properly removed from each line after it has been processed. The test ran correctly on all inputs.

Robert sanitized me a bunch of log files (9 days). It’s all indexed in Splunk now, and the total number of log entries is around 180 million!

I started collecting Splunk queries for each part of the wiki article’s section that describes what we want to mine. I also ran many of these queries. You can see all the queries here:

http://pastebin.com/e6w6qmQQ

Some of the output of the top entity queries:

Top artists:

artist		 mbid					 view	percent
------------------------------------------------------------------------
[unknown]	 [125ec42a-7229-4250-afc5-e057484327fe], 19676, 0.971494
.unknown.	 [60f7d020-9580-47c5-8839-a4f25659687d], 18925, 0.934414
unknown		 [42931979-e2b8-4c1d-895f-cd6cd86ae69d], 18878, 0.932093
Unknown		 [4d307588-7e57-4032-bde6-5f217fc09b2a], 18546, 0.915701
Unknown		 [222d1430-c367-4d57-84c9-5c8e4ed37d53], 18093, 0.893334
Unknown		 [a3866930-01d3-4988-bfb0-9378306e5cb5], 18067, 0.892050
Ben Howard	 [534dda3c-b73f-408b-8889-bd68eae84df6], 6265, 0.309332
Michael Kiwanuka [11f570ff-44d9-4e9c-8812-e6d56103c5c1], 5994, 0.295951
Bossk		 [9b3abeb1-1dc6-4871-8e32-14d084362648], 5658, 0.279361
Woody Guthrie	 [cbd827e1-4e38-427e-a436-642683433732], 5635, 0.278226

The many Unknown artists are not an error on MusicBrainz’s side. Two of them are special purpose artists that were indeed created for recordings with unknown artists. The other 4 are actual artists called “Unknown”. Someone is mis-tagging a lot of songs here. A further search revealed that this “error” is only done on /ws/1/ (which is deprecated). Results for /ws/2 are normal.

Top release-groups (for 2 June):

1 Coldplay feat. Rihanna: Princess of China
2 Eminem, Dr. Dre & 50 Cent: Crack a Bottle
3 Adele: 21
4 David Guetta feat. Rihanna: Who's That Chick?
5 Kanye West feat. Jay-Z, Rick Ross, Bon Iver & Nicki Minaj: Monster
6 Coldplay: Viva la Vida or Death and All His Friends
7 Lady Gaga feat. Beyoncé: Telephone
8 Coldplay: Mylo Xyloto
9 Coldplay: X&Y
10 Coldplay: A Rush of Blood to the Head

Top recordings (for 2 June):

1 MusicBrainz Test Artist: Please Mister Nagios
2 She Wants Revenge: [silence]
3 Boston Symphony Orchestra, Sir Colin Davis: Symphony No. 2 in D major, Op. 43: IV. Finale. Allegro moderato
4 Joaquín Rodrigo: Concierto de Aranjuez: I. Allegro con spirito
5 Kings of Leon: Sex on Fire
6 Eddy Mitchell: La Ballade de Bill Brillantine
7 Richard Wagner: Lohengrin: Act III, Scene I. "Treulich geführt ziehet dahin"
8 Johannes Brahms: Sonata for Piano and Violin No. 1 in G major, Op. 78: II. Adagio
9 Johannes Brahms: Sonata for Piano and Violin No. 1 in G major, Op. 78: II. Adagio
10 Johannes Brahms: Trio for Piano, Violin, and Cello No. 1 in B major, Op. 8: III. Adagio

The 8th and 9th recording seem to be the same with a different mbid (which can happen as far as I know). Sorry for the inconsistent formatting, this is just to give a little preview. I have been thinking about different queries too, for example some information would be nice about user behavior. I don’t have access to IPs, but I have the hashes. So I could see what is the average number of requests per distinct IP address for a day.

Tomorrow I’m off to Stockholm for 10 days. This was part of my proposal, but at the time of writing it I thought I would only leave on 20 June. When I come back I should focus on creating a schema that we use to store results of queries. This is an important part of my project, because we want it to be as generic as possible.

Hej då! Vi ses snart!

The second week of coding is coming to an end soon. This is what happened this week:

Robert asked me to create a wiki page on the MusicBrainz wiki, that describes what data we will mine, how mining is going to happen, and where will data be published. Here is the link to the page:

http://wiki.musicbrainz.org/Server_Log_Analysis

When we work out more details I will update that page.

The access logs contain sensitive information (IP addresses for example), that I should not be able to access for privacy reasons. So far I was working on a sample log that Robert cleaned for me. This seemed a bit of a problem to do every day, so Robert proposed that I sign a document stating that if I use or distribute this data I will fail GSoC. Here is the blog post about it:

http://blog.musicbrainz.org/?p=1443

This didn’t work out, because the privacy policy should have been changed, which (I guess) is a lot of hassle. I was given the task to write the sanitizing/anonymizing script in python. The script hashes IP addresses, user IDs and other sensitive parts of the logs. The hashes are sha-1 and there is a secret salt, that somebody (Robert, I guess) will set before the script runs. I didn’t have a github repository before, so I created one:

https://github.com/balidani/MusicBrainz-server-log-analysis

This is the place where I will upload code that belongs to the project. Hopefully it will be more structured in the future :). I wrote the script, uploaded it to the repository, and now it should be reviewed. Hope I didn’t make too many mistakes, I’m sure my python skills still need to improve a lot.

Here is a very small sample to try the sanitizing script:
https://gist.github.com/2852624

Robert came up with a new idea for a query: the client vs the highest search result ranking. He also told me to think about how much data we can put into Splunk until we run out of disk space, and how we can tell how much data to keep. This is my todo list for now, plus what my proposal says for the next two weeks:

Set up the script that runs daily, and make it possible to run Splunk queries with it. Decide for each statistic how we want to store it. Come up with a schema that is as generic as possible, so adding new statistics later won’t be a problem. This genericity should be important through the whole project. Create some simple queries and store the results.

This monday (21st May) coding has officially started. Sweet! For the first two weeks we planned some exploratory work. I have already started learning python and splunk earlier. Splunk is a very nice tool created just for what we want to do. It has a python SDK, so it will be even easier to integrate.

I received some anonimized logs from Robert and started mining data. The log contains ~23 million entries, and it was created over the course of a day. These google docs display some of the results:

(Webservice calls)
https://docs.google.com/spreadsheet/ccc?key=0Agqqc0nCeRq9dEhiVUdaUEhrR0RLNUhCU3lzS1U2X2c#gid=0

(Menu statistics)
https://docs.google.com/spreadsheet/ccc?key=0Agqqc0nCeRq9dFlPdEdock5KR0pfblNGNzNiLUQ0bHc#gid=7

I also started exploring graphing solutions, though I only checked out Google Chart Tools so far. Ian suggested multiple other methods on the mailing list. I plan to take a look at these this week. The “problem” is, we may need to create very different charts/graphs, so having more options is always good.

This week I will also try to create more splunk queries. Then I’ll put them in a batch, and leave my poor computer on for the night.

Until 7th June I still have some exams to take, so I can’t work at my full capacity yet. I’ll have a few full days though.