Week 3 is over, and so are my exams! This is what I’ve been doing this week:

I found two minor problems with the sanitizing script. Instead of searching for matches for a pattern, then replacing all the matching parts in a line, the script now uses re.sub(). This is a much safer way to do substitution. Robert forked my repository. This is the new location, from now on I will be pushing code only here:


The code went up for review. Robert said it was ok, but it would be best if someone else cheked it out too. The review request is located here:


I wrote a test using python’s unit testing framework. This can also be found at the repository. The script generates lines with (fake) sensitive data, then checks whether this data is properly removed from each line after it has been processed. The test ran correctly on all inputs.

Robert sanitized me a bunch of log files (9 days). It’s all indexed in Splunk now, and the total number of log entries is around 180 million!

I started collecting Splunk queries for each part of the wiki article’s section that describes what we want to mine. I also ran many of these queries. You can see all the queries here:


Some of the output of the top entity queries:

Top artists:

artist		 mbid					 view	percent
[unknown]	 [125ec42a-7229-4250-afc5-e057484327fe], 19676, 0.971494
.unknown.	 [60f7d020-9580-47c5-8839-a4f25659687d], 18925, 0.934414
unknown		 [42931979-e2b8-4c1d-895f-cd6cd86ae69d], 18878, 0.932093
Unknown		 [4d307588-7e57-4032-bde6-5f217fc09b2a], 18546, 0.915701
Unknown		 [222d1430-c367-4d57-84c9-5c8e4ed37d53], 18093, 0.893334
Unknown		 [a3866930-01d3-4988-bfb0-9378306e5cb5], 18067, 0.892050
Ben Howard	 [534dda3c-b73f-408b-8889-bd68eae84df6], 6265, 0.309332
Michael Kiwanuka [11f570ff-44d9-4e9c-8812-e6d56103c5c1], 5994, 0.295951
Bossk		 [9b3abeb1-1dc6-4871-8e32-14d084362648], 5658, 0.279361
Woody Guthrie	 [cbd827e1-4e38-427e-a436-642683433732], 5635, 0.278226

The many Unknown artists are not an error on MusicBrainz’s side. Two of them are special purpose artists that were indeed created for recordings with unknown artists. The other 4 are actual artists called “Unknown”. Someone is mis-tagging a lot of songs here. A further search revealed that this “error” is only done on /ws/1/ (which is deprecated). Results for /ws/2 are normal.

Top release-groups (for 2 June):

1 Coldplay feat. Rihanna: Princess of China
2 Eminem, Dr. Dre & 50 Cent: Crack a Bottle
3 Adele: 21
4 David Guetta feat. Rihanna: Who's That Chick?
5 Kanye West feat. Jay-Z, Rick Ross, Bon Iver & Nicki Minaj: Monster
6 Coldplay: Viva la Vida or Death and All His Friends
7 Lady Gaga feat. Beyoncé: Telephone
8 Coldplay: Mylo Xyloto
9 Coldplay: X&Y
10 Coldplay: A Rush of Blood to the Head

Top recordings (for 2 June):

1 MusicBrainz Test Artist: Please Mister Nagios
2 She Wants Revenge: [silence]
3 Boston Symphony Orchestra, Sir Colin Davis: Symphony No. 2 in D major, Op. 43: IV. Finale. Allegro moderato
4 Joaquín Rodrigo: Concierto de Aranjuez: I. Allegro con spirito
5 Kings of Leon: Sex on Fire
6 Eddy Mitchell: La Ballade de Bill Brillantine
7 Richard Wagner: Lohengrin: Act III, Scene I. "Treulich geführt ziehet dahin"
8 Johannes Brahms: Sonata for Piano and Violin No. 1 in G major, Op. 78: II. Adagio
9 Johannes Brahms: Sonata for Piano and Violin No. 1 in G major, Op. 78: II. Adagio
10 Johannes Brahms: Trio for Piano, Violin, and Cello No. 1 in B major, Op. 8: III. Adagio

The 8th and 9th recording seem to be the same with a different mbid (which can happen as far as I know). Sorry for the inconsistent formatting, this is just to give a little preview. I have been thinking about different queries too, for example some information would be nice about user behavior. I don’t have access to IPs, but I have the hashes. So I could see what is the average number of requests per distinct IP address for a day.

Tomorrow I’m off to Stockholm for 10 days. This was part of my proposal, but at the time of writing it I thought I would only leave on 20 June. When I come back I should focus on creating a schema that we use to store results of queries. This is an important part of my project, because we want it to be as generic as possible.

Hej då! Vi ses snart!