6.470 Day 2 |
Database Design |
Databases force you to store all of your data in tables. When you define a table, you give it the names of all the columns you want to use and what type of data that field will contain (integer, text, time, etc.)
Some databases are simple enough to fit in one table. For example, the NameTrends uses one table whose columns areunique id | year | name | gender | percentage | rank |
Let's consider a more complicated case. Two years ago, the challenge for 6.470 was to display movie data. Movie data encompassed actors' names and biographical data, as well as the names of movies and trivia about them, including: year of release, MPAA rating (G, PG, R, etc.), cast and parts. The cast included the names of actors that we had data for.
What's the wrong way of design a part of this database? I'll give you two tempting but losing strategies.
The "One Big Table" Strategy
name | m/f | title | year |
---|---|---|---|
Carrie Fisher | f | Star Wars: A New Hope | 1977 |
Carrie Fisher | f | The Empire Strikes Back | 1980 |
Carrie Fisher | f | Return of the Jedi | 1983 |
Mark Hamill | m | Star Wars: A New Hope | 1977 |
Mark Hamill | m | The Empire Strikes Back | 1980 |
Mark Hamill | m | Return of the Jedi | 1983 |
Harrison Ford | m | Star Wars: A New Hope | 1977 |
Harrison Ford | m | The Empire Strikes Back | 1980 |
Harrison Ford | m | Return of the Jedi | 1983 |
Harrison Ford | m | Indiana Jones | 1981 |
The "List Abuse" Strategy
name | m/f | movies |
---|---|---|
Carrie Fisher | f | 1, 2, 3 |
Mark Hamill | m | 1, 2, 3 |
Harrison Ford | m | 1, 2, 3, 4 |
unique id | title | year |
---|---|---|
1 | Star Wars: A New Hope | 1977 |
2 | The Empire Strikes Back | 1980 |
3 | Return of the Jedi | 1983 |
4 | Indiana Jones | 1981 |
Here's a better way to do this. We're going to do an informal "normalization" of the data by breaking it up into tables to get rid of repeated data. Normalization is a big part of database theory and there are algorithms that achieve different "Normal Forms" that have various desirable properties; fortunately, in most cases your intuition about what set of tables eliminates repeated data will produce a normalized schema; just don't be afraid of:
With those ideas in mind, let's break actors and movies into their own tables. Each actor has one id, one name, and one gender, which is good because one-to-one relationships tend to mean we're not repeating data. Similarly, each movie has one id, one title, and one year.
id | name | m/f |
---|---|---|
1 | Carrie Fisher | f |
2 | Mark Hamill | m |
3 | Harrison Ford | m |
id | title | year |
---|---|---|
1 | Star Wars: A New Hope | 1977 |
2 | The Empire Strikes Back | 1980 |
3 | Return of the Jedi | 1983 |
4 | Indiana Jones | 1981 |
The only thing missing is the information about which actors were in which films. To represent this we create a new join table that links the two tables.
actorid | movieid |
---|---|
1 | 1 |
1 | 2 |
1 | 3 |
2 | 1 |
2 | 2 |
2 | 3 |
3 | 1 |
3 | 2 |
3 | 3 |
3 | 4 |
Tada! A nice normalized schema with no repeated data.