6.470 DB Design

Databases force you to store all of your data in tables. When you define a table, you give it the names of all the columns you want to use and what type of data that field will contain (integer, text, time, etc.)

Some databases are simple enough to fit in one table. For example, the NameTrends uses one table whose columns are

unique id

year

name

gender

percentage

rank

Let's consider a more complicated case. Two years ago, the challenge for 6.470 was to display movie data. Movie data encompassed actors' names and biographical data, as well as the names of movies and trivia about them, including: year of release, MPAA rating (G, PG, R, etc.), cast and parts. The cast included the names of actors that we had data for.

What's the wrong way of design a part of this database? I'll give you two tempting but losing strategies.

The "One Big Table" Strategy

name	m/f	title	year
Carrie Fisher	f	Star Wars: A New Hope	1977
Carrie Fisher	f	The Empire Strikes Back	1980
Carrie Fisher	f	Return of the Jedi	1983
Mark Hamill	m	Star Wars: A New Hope	1977
Mark Hamill	m	The Empire Strikes Back	1980
Mark Hamill	m	Return of the Jedi	1983
Harrison Ford	m	Star Wars: A New Hope	1977
Harrison Ford	m	The Empire Strikes Back	1980
Harrison Ford	m	Return of the Jedi	1983
Harrison Ford	m	Indiana Jones	1981

This design repeats data, which is bad because a bigger database can mean longer query times and because repeated data always means an opportunity for corruption (what happens if Carrie Fischer marries Harrison Ford but you forget to update half of Ms. Fischer's last names?). Additionally, what if you wanted to store data about all kinds of actors, including television actors? How would you store a television actor that had never been in a movie in this table? Leave a bunch of nulls for all the movie fields? Yuck.

The "List Abuse" Strategy

name	m/f	movies
Carrie Fisher	f	1, 2, 3
Mark Hamill	m	1, 2, 3
Harrison Ford	m	1, 2, 3, 4

unique id	title	year
1	Star Wars: A New Hope	1977
2	The Empire Strikes Back	1980
3	Return of the Jedi	1983
4	Indiana Jones	1981

Here we try to avoid making more tables (or avoid joins, which nobody should be scared of but some people are) by listing the movies each actor is in. This goes against the concept of table storage because we've lost the ability to do efficient searching and efficient insertion. For example, to find all the actors in movie id=3 we have to look at every actor, parse the list of movies and see and then search it. With databases, lists are never the answer.

The "Normalization" Strategy

Here's a better way to do this. We're going to do an informal "normalization" of the data by breaking it up into tables to get rid of repeated data. Normalization is a big part of database theory and there are algorithms that achieve different "Normal Forms" that have various desirable properties; fortunately, in most cases your intuition about what set of tables eliminates repeated data will produce a normalized schema; just don't be afraid of:

creating more tables when you need them
creating join tables to provide a link between two tables and get rid of many-to-many relationships contained within one table

With those ideas in mind, let's break actors and movies into their own tables. Each actor has one id, one name, and one gender, which is good because one-to-one relationships tend to mean we're not repeating data. Similarly, each movie has one id, one title, and one year.

actors

id	name	m/f
1	Carrie Fisher	f
2	Mark Hamill	m
3	Harrison Ford	m

movies

id	title	year
1	Star Wars: A New Hope	1977
2	The Empire Strikes Back	1980
3	Return of the Jedi	1983
4	Indiana Jones	1981

The only thing missing is the information about which actors were in which films. To represent this we create a new join table that links the two tables.

movies_actors

actorid	movieid
1	1
1	2
1	3
2	1
2	2
2	3
3	1
3	2
3	3
3	4

Each movie has one or more actors and each actor has one or more movie - these many-to-one relationships create repetition. This join table gets rid of all the data repetition.

Tada! A nice normalized schema with no repeated data.