6.470
Freebase
This lesson will teach you the basics of getting data from Freebase into a MySQL database.
The Big Picture:



Case Study: Music Artists and Albums

Let's assume we want our database to contain musical artists and albums. For each artist, we want to store their name, a picture of the band, and when the band started. Our artist table will look like this:

CREATE TABLE `artists` (
    `id` VARCHAR( 256 ) NOT NULL ,
    `name` VARCHAR( 256 ) NOT NULL ,
    `image` VARCHAR( 256 ) NOT NULL ,
    `start` VARCHAR( 256 ) NOT NULL ,
    PRIMARY KEY ( `id` )
)

For each ablum, we also want the name and picture, as well as the number of tracks. Our album table will look like this:

CREATE TABLE `albums` (
    `id` VARCHAR( 256 ) NOT NULL ,
    `name` VARCHAR( 256 ) NOT NULL ,
    `image` VARCHAR( 256 ) NOT NULL ,
    `track_count` INT NOT NULL,
    PRIMARY KEY ( `id` )
)

Finally, we'd like a join table to associate artists with albums that they created.

CREATE TABLE `glittle+test_db`.`artist_to_album` (
    `artist` VARCHAR( 256 ) NOT NULL ,
    `album` VARCHAR( 256 ) NOT NULL ,
    PRIMARY KEY ( `artist` , `album` )
)
Finding the Data on Freebase

Let's begin by searching on Freebase for our favorite band, The Magnetic Fields.

Now what we really want is a list of all the musical artists on Freebase. We can get this by clicking "Edit and Show details", and then finding the link for Musical Artist.

Now find the text "404,837 Musical Artist topics". The words "Musical Artist" look like a link, but when you click it, a bubble opens with a link for the type schema.

What we are looking at is sortof like an SQL table definition. In Freebase, it is called a "type". The name of the type appears to be "Musical Artist", but it isn't really. The real name is "/music/artist", which appears as small grey text next to "Musical Artist Properties".

Note that there is a property for "Active as Musical Artist (start)". Note also that the real name of this property is the small grey text beneath it, namely "active_start".

Getting the Data from Freebase

Now we'll explore how to programmatically grab information from Freebase. Try pressing "Show Result" below.


What Happend?

When you press "Show Result", the query in the left textbox is formatted into a special URL. This URL begins with http://api.freebase.com/api/service/mqlread?query=, and ends with the query string. In this example, the full URL will become: http://api.freebase.com/api/service/mqlread?query={"query":[{"type":"/music/artist","name":null,"limit":1}]}. The right textbox is really an "iframe" with it's "src" attribute set to the special URL.

What to notice:

Getting the id

Every object in Freebase has a name, which we retrieved above. Every object also has a unique id. We can get the id by adding the following line to the query.

"id":null
Note that we can append the id to the URL http://www.freebase.com/view to get a link to the object. Try clicking http://www.freebase.com/view/en/blonde_redhead.

Getting the Band's Start Time

We want to know when each musical group was created. In the schema, we saw the property "active_start". Let's try adding the following line to our query:

"active_start":null

Getting a Picture of the Band

Now we want an image of the band. If we look at the bottom of the type schema, we see "Topic Properties", followed by the small grey text "/common/topic". This may be viewed as a supertype of "/music/artist". It has the property "image". Let's try getting the image by adding:

"image":null
Unfortunately we get an error. The error has a message for us: "Type /music/artist does not have property image". So perhaps supertype wasn't the best word to describe the relationship between "/common/topic" and "/music/artist". We can get around this by qualifying the "image" property as follows:
"/common/topic/image":null
Unfortunately we get another error. This time it says: "Unique query may have at most one result. Got 2". The problem is that there are 2 images for this band. Apparently "null" means "find exactly one value for this property". We need something that says "find all the values for this property". Freebase uses the JSON array "[]" to mean this. Try:
"/common/topic/image":[]
Now that worked, sortof. Unfortunately we are just getting a textual description of each image. Where is the actual image? By default, the "[]" placeholder returns a list of names of objects. We can coax Freebase into giving us a list of the actual objects by using "[{}]".
"/common/topic/image":[{}]
This still doesn't tell us the URL of each image. In a way it does, since it gives us the "id" for each image, and this can be turned into a URL; however, maybe there is a property that has the URL explicitely. Freebase has a special query property that means "return every property for this object". It is written as "*":null. Let's try it:
"/common/topic/image":[
  {
    "*":null
  }
]
Apparently there is no such property, although "image_location" looked promissing (this property probably says where this image exists elsewere on the web). In any case, the special URL for accessing the image is created by appending the "id" to the URL http://www.freebase.com/api/trans/raw. This gives us http://www.freebase.com/api/trans/raw/guid/9202a8c04000641f80000000049f07ae. Putting that in an "img" tag gives us:

If we just want the "id" for each image, we can change our query to:
"/common/topic/image":[
  {
    "id":null
  }
]
Since our database table only has space for one image per band, we may as well limit these images results as follows:
"/common/topic/image":[
  {
    "id":null,
    "limit":1
  }
]

Getting the Albums

We want a list of albums for each band. We can do this as follows:


Now we already know how to expand this query to give us an image for each album. The only thing we have left to get is the number of tracks on each album. We may start by adding the bold line below to our query:

      "album":[{
        "id":null,
        "name":null,
        "track":[]
      }],

This gives us the tracks, and we could just count them ourselves when we process the data. An alternative is to have Freebase count the tracks for us by doing:
        "track":[{
            "return":"count"
        }]

Putting the Data into MySQL

If we write the entire query suggested above, we'll get a result something like:

  "result":[
    {
      "type":"/music/artist",
      "id":"/en/led_zeppelin",
      "name":"Led Zeppelin"
      "/common/topic/image":[
        {
          "id":"/guid/9202a8c04000641f8000000001208c68"
        }
      ],
      "active_start":"1968",      
      "album":[
        {
          "track":[
            10
          ],
          "/common/topic/image":[
            {
              "id":"/wikipedia/images/en_id/194757"
            }
          ],
          "id":"/guid/9202a8c04000641f80000000000593d3",
          "name":"Led Zeppelin"
        },
        ...
      ],
    },
    ...
  ]
Conceptually, we want to iterate over each artist and add a row to the "artists" table of our database. While we're doing that, we want to iterate over each album within each artist and it to the "albums" table. We also want to add an entry to the "artists_to_albums" table for each album connecting it to the current artist.

We'll walk through the following PHP code which should do the trick. First, we connect to our database.

<?php

$db_host = "sql.mit.edu";
$db_username = "my_name";
$db_password = "jausfkwhytiu";
$db_database = "my_db";

mysql_connect($db_host, $db_username, $db_password) or die(mysql_error());
mysql_select_db($db_database) or die(mysql_error());

Here is a sample output to our query. We have just pasted the text directly into our PHP code. You'll probably want to have PHP download the query results directly, using a function like "file_get_contents".

$string_data = <<<ASDF
{
  "result":[
    {
      "type":"/music/artist",
      "id":"/en/led_zeppelin",
      "name":"Led Zeppelin",
      "/common/topic/image":[
        {
          "id":"/guid/9202a8c04000641f8000000001208c68"

        }
      ],
      "active_start":"1968",      
      "album":[
        {
          "id":"/guid/9202a8c04000641f80000000000593d3",
          "name":"Led Zeppelin",
          "/common/topic/image":[
            {
              "id":"/wikipedia/images/en_id/194757"

            }
          ],
          "track":[
            10
          ]
        }
      ]
    }
  ]
}
ASDF;

This code includes our PHP-JSON library, and uses it to parse the JSON string into PHP objects.

require_once 'JSON.php';
$json_service = new Services_JSON();
$json_data = $json_service->decode($string_data);

Now that we have the result as a PHP object, we want to access the "result" property of the top level object. Note that "result" is an array, so we'll iterate over it.

$result = $json_data->result;
for ($i = 0; $i < sizeof($result); $i++) {

Each element of the "result" array represents an artist, so we'll grab all the information we want from the artist into a set of temporary variables. Note that hassle we go through to get the image--this has to do with the cumbersome nature of PHP syntax when dealing with objects.

    $artist = $result[$i];
    
    $id = $artist->id;
    $name = $artist->name;
    
    $temp = "/common/topic/image";    
    $image = $artist->$temp;
    $image = $image[0];
    $image = $image->id;
    $image = "http://www.freebase.com/api/trans/raw".$image;
    
    $start = $artist->active_start;

Now we're ready to insert this artist into our database. We begin by escaping all of our variables (the main thing this will do is put a backslash in front of any apostrophes). Then we create the query string and send it to MySQL. Finally, we print the query string for debugging purposes.

    $id = mysql_real_escape_string($id);
    $name = mysql_real_escape_string($name);
    $image = mysql_real_escape_string($image);
    $start = mysql_real_escape_string($start);
    
    $sql = "INSERT INTO `artists` (id, name, image, start) VALUES ('$id', '$name', '$image', '$start')";
    mysql_query($sql);
    echo "query: $sql\n";

The next thing we'll do is iterate over each album for this artist. Before we do, we'll save a copy of the artist's id for later use.

    $artist_id = $id;
    $albums = $artist->album;
    for ($ii = 0; $ii < sizeof($albums); $ii++) {

For each album, we're going to grab all the information we want, just as we did for each artist.

        $album = $albums[$ii];
        
        $id = $album->id;
        $name = $album->name;
        
        $temp = "/common/topic/image";    
        $image = $album->$temp;
        $image = $image[0];
        $image = $image->id;
        $image = "http://www.freebase.com/api/trans/raw".$image;
        
        $track_count = $album->track;
        $track_count = $track_count[0];        

As before, we'll escape all of our variables before constructing the MySQL query string.

        $id = mysql_real_escape_string($id);
        $name = mysql_real_escape_string($name);
        $image = mysql_real_escape_string($image);
        $track_count = mysql_real_escape_string($track_count);
    
        $sql = "INSERT INTO `albums` (id, name, image, track_count) VALUES ('$id', '$name', '$image', '$track_count')";
        mysql_query($sql);
        echo "query: $sql\n";

This final query will associate this album with the current artist, whose id we saved in $artist_id.

        $sql = "INSERT INTO `artist_to_album` (artist, album) VALUES ('$artist_id', '$id')";
        mysql_query($sql);
        echo "query: $sql\n";
    }
}

echo "done!\n";

?>

Dealing with Large Results

Try running the following query:


We get an error saying that the query timed out. It took too long to run on Freebase's server, and it gave up. The way to get around this problem is to use "cursors". Try the following query:

{
  "cursor":true,
  "query":[
    {
      "type":"/music/artist",
      "name":null,
      "limit":10
    }
  ]
}
If we look at the result, we notice that the value of "cursor" has been changed from true to some long string of seemingly random characters. This string of characters is a placeholder. We can copy the string in place of "true" in our query and run it again.

This time the result is different. We have accessed the next result. The value of cursor has also changed. We need to copy this new cursor value into our query in order to get the next set of results. The last set of results will have the value of cursor set to false.