summaryrefslogtreecommitdiff
path: root/_posts/2019-12-20-sql-murder-mystery.md
diff options
context:
space:
mode:
Diffstat (limited to '_posts/2019-12-20-sql-murder-mystery.md')
-rw-r--r--_posts/2019-12-20-sql-murder-mystery.md529
1 files changed, 529 insertions, 0 deletions
diff --git a/_posts/2019-12-20-sql-murder-mystery.md b/_posts/2019-12-20-sql-murder-mystery.md
new file mode 100644
index 0000000..82aa2ce
--- /dev/null
+++ b/_posts/2019-12-20-sql-murder-mystery.md
@@ -0,0 +1,529 @@
+---
+title: Solving the SQL Murder Mystery
+---
+
+I saw this [SQL Murder Mystery](https://github.com/NUKnightLab/sql-mysteries)
+appear on Hacker News recently, thought it sounded fun, and figured I'd do a
+quick write-up of how I worked through it.
+
+If you want to follow along, go ahead and [download the SQLite database][db]
+(which is copyright NUKnightLab and redistributed here under the [MIT
+license][mit]). You'll need some kind of SQLite client to interact with it (I
+just used the `sqlite3` CLI tool).
+
+In addition to the database, it's very helpful to start with a prompt:
+
+> A crime has taken place and the detective needs your help. The detective gave
+> you the crime scene report, but you somehow lost it. You vaguely remember that
+> the crime was a murder that occurred sometime on Jan. 15, 2018 and that it
+> took place in SQL City. Start by retrieving the corresponding crime scene
+> report from the police department's database. If you want to get the most out
+> of this mystery, try to work through it only using your SQL environment and
+> refrain from using a notepad.
+
+<!--more-->
+
+Let's start by seeing what tables are available. The `sqlite3` CLI uses
+meta-commands that start with a dot, like this:
+
+ sqlite> .tables
+ crime_scene_report get_fit_now_check_in interview
+ drivers_license get_fit_now_member person
+ facebook_event_checkin income solution
+
+Okay, let's start with finding our crime scene report. First, we'll need to know
+what the data looks like. We can learn about this with the `.schema` command:
+
+ sqlite> .schema crime_scene_report
+ CREATE TABLE crime_scene_report (
+ date integer,
+ type text,
+ description text,
+ city text
+ );
+
+Okay, seems pretty straightforward. The only thing I'm not quite sure about is
+how the date is being represented -- it's just stored as an integer. A UNIX
+timestamp perhaps? Let's sample the data:
+
+ sqlite> select date from crime_scene_report limit 5;
+
+<table>
+ <tr><th>date</th></tr>
+ <tr><td>20180115</td></tr>
+ <tr><td>20180115</td></tr>
+ <tr><td>20180115</td></tr>
+ <tr><td>20180215</td></tr>
+ <tr><td>20180215</td></tr>
+</table>
+
+Okay, seems it's just being stored as YYYYMMDD. Let's take a crack at finding
+the crime scene report! We know the type (murder) and the city (SQL City). Let's
+be generous with the date and assume it was sometime in January of 2018:
+
+ sqlite> select * from crime_scene_report
+ ...> where type = 'murder'
+ ...> and city = 'SQL City'
+ ...> and date between 20180101 and 20180131;
+
+<table>
+ <tr>
+ <th>date</th>
+ <th>type</th>
+ <th>description</th>
+ <th>city</th>
+ </tr>
+ <tr>
+ <td>20180115</td>
+ <td>murder</td>
+ <td>
+ Security footage shows that there were 2 witnesses. The first witness
+ lives at the last house on &quot;Northwestern Dr&quot;. The second
+ witness, named Annabel, lives somewhere on &quot;Franklin Ave&quot;.
+ </td>
+ <td>SQL City</td>
+ </tr>
+</table>
+
+Great, there's only one row that matches our broad date criteria! Let's see if
+we can track down these witnesses. First, let's see how the data we need is
+structured:
+
+ sqlite> .schema person
+ CREATE TABLE person (
+ id integer PRIMARY KEY,
+ name text,
+ license_id integer,
+ address_number integer,
+ address_street_name text,
+ ssn integer,
+ FOREIGN KEY (license_id) REFERENCES drivers_license(id)
+ );
+ sqlite> .schema interview
+ CREATE TABLE interview (
+ person_id integer,
+ transcript text,
+ FOREIGN KEY (person_id) REFERENCES person(id)
+ );
+
+Okay, so we need to find the two rows in the person table, and then use their
+ids to cross reference their interview text. This is "the big idea" with
+relational databases, joining data in several tables based on something they
+have in common.
+
+We'll start with the witness who lives on Northwestern Drive. We know that they
+live in "the last house," which presumably has the highest house number on that
+street. We can easily find this by first filtering for only people who live on
+Northwestern Drive, then ordering those results by house number in descending
+order, and only showing the first result:
+
+ sqlite> select * from person
+ ...> where address_street_name = 'Northwestern Dr'
+ ...> order by address_number desc
+ ...> limit 1;
+
+<table>
+ <tr>
+ <th>id</th>
+ <th>name</th>
+ <th>license_id</th>
+ <th>address_number</th>
+ <th>address_street_name</th>
+ <th>ssn</th>
+ </tr>
+ <tr>
+ <td>14887</td>
+ <td>Morty Schapiro</td>
+ <td>118009</td>
+ <td>4919</td>
+ <td>Northwestern Dr</td>
+ <td>111564949</td>
+ </tr>
+</table>
+
+Great! Now let's find Annabel. We can use SQL's `LIKE` operator to match a
+partial name, along with the name of their street:
+
+ sqlite> select * from person
+ ...> where name like 'Annabel%'
+ ...> and address_street_name = 'Franklin Ave';
+
+<table>
+ <tr>
+ <th>id</th>
+ <th>name</th>
+ <th>license_id</th>
+ <th>address_number</th>
+ <th>address_street_name</th>
+ <th>ssn</th>
+ </tr>
+ <tr>
+ <td>16371</td>
+ <td>Annabel Miller</td>
+ <td>490173</td>
+ <td>103</td>
+ <td>Franklin Ave</td>
+ <td>318771143</td>
+ </tr>
+</table>
+
+Okay, so we've got our person IDs: `14887` and `16371`. I think we're going to
+want these IDs in a bunch of upcoming queries, so let's help our future selves
+out by saving their IDs as parameters (a sort of temporary variable):
+
+ sqlite> .parameter set $MORTY 14887
+ sqlite> .parameter set $ANNABEL 16371
+
+Let's grab their interviews. To do this, we'll put joins to use for the first
+time so we can show their name rather than just their person ID. We're selecting
+records from the `interview` table, but _joining_ matching records from the
+`person` table, using the `person_id` column to match up the people.
+
+ sqlite> select person.name, interview.transcript
+ ...> from interview
+ ...> join person on person.id = interview.person_id
+ ...> where person_id in ($MORTY, $ANNABEL);
+
+<table>
+ <tr>
+ <th>name</th>
+ <th>transcript</th>
+ </tr>
+ <tr>
+ <td>Morty Schapiro</td>
+ <td>
+ I heard a gunshot and then saw a man run out. He had a &quot;Get Fit Now
+ Gym&quot; bag. The membership number on the bag started with
+ &quot;48Z&quot;. Only gold members have those bags. The man got into a
+ car with a plate that included &quot;H42W&quot;.
+ </td>
+ </tr>
+ <tr>
+ <td>Annabel Miller</td>
+ <td>
+ I saw the murder happen, and I recognized the killer from my gym when I
+ was working out last week on January the 9th.
+ </td>
+ </tr>
+</table>
+
+Okay, we've got tons of info now! Since the car and bag might not belong to the
+killer, I think our best lead for narrowing things down is to see all the people
+who crossed paths with Annabel at the gym on January 9th, 2018. Let's see what
+those tables look like:
+
+ sqlite> .schema get_fit_now_check_in
+ CREATE TABLE get_fit_now_check_in (
+ membership_id text,
+ check_in_date integer,
+ check_in_time integer,
+ check_out_time integer,
+ FOREIGN KEY (membership_id) REFERENCES get_fit_now_member(id)
+ );
+ sqlite> .schema get_fit_now_member
+ CREATE TABLE get_fit_now_member (
+ id text PRIMARY KEY,
+ person_id integer,
+ name text,
+ membership_start_date integer,
+ membership_status text,
+ FOREIGN KEY (person_id) REFERENCES person(id)
+ );
+
+Alright, time to look for some check-ins! We could do this in two separate
+queries, one to find Annabel's Get Fit Now member ID by using her `person_id`,
+and a second query to find her check-ins using her `membership_id`, but we can
+also use a sub-query to do this in one shot:
+
+ sqlite> select check_in_time, check_out_time
+ ...> from get_fit_now_check_in
+ ...> where date = 20180109
+ ...> and membership_id = (
+ ...> select id
+ ...> from get_fit_now_member
+ ...> where person_id = $ANNABEL);
+
+<table>
+ <tr>
+ <th>check_in_time</th>
+ <th>check_out_time</th>
+ </tr>
+ <tr>
+ <td>1600</td>
+ <td>1700</td>
+ </tr>
+</table>
+
+Looks like Annabel was at the gym from 4pm to 5pm on the 9th. Since we're
+looking for someone who overlapped with Annabel at the gym, we're looking for
+someone who arrived before 5pm and left after 4pm. Again, we'll join some tables
+together here so we can grab their names and person IDs right away, not just
+their membership numbers:
+
+ sqlite> select person.id, person.name, get_fit_now_member.id,
+ ...> get_fit_now_check_in.check_in_time,
+ ...> get_fit_now_check_in.check_out_time
+ ...> from get_fit_now_check_in
+ ...> join get_fit_now_member on get_fit_now_member.id = membership_id
+ ...> join person on person.id = person_id
+ ...> where check_in_date = 20180109
+ ...> and check_in_time <= 1700 and check_out_time >= 1600;
+
+<table>
+ <tr>
+ <th>id</th>
+ <th>name</th>
+ <th>id</th>
+ <th>check_in_time</th>
+ <th>check_out_time</th>
+ </tr>
+ <tr>
+ <td>28819</td>
+ <td>Joe Germuska</td>
+ <td>48Z7A</td>
+ <td>1600</td>
+ <td>1730</td>
+ </tr>
+ <tr>
+ <td>67318</td>
+ <td>Jeremy Bowers</td>
+ <td>48Z55</td>
+ <td>1530</td>
+ <td>1700</td>
+ </tr>
+ <tr>
+ <td>16371</td>
+ <td>Annabel Miller</td>
+ <td>90081</td>
+ <td>1600</td>
+ <td>1700</td>
+ </tr>
+</table>
+
+Interesting, there were only two other gym members who were checked in for a
+period overlapping with Annabel on the 9th. Let's save their IDs as well:
+
+ sqlite> .parameter set $JOE 28819
+ sqlite> .parameter set $JEREMY 67318
+
+Their member numbers both start with 48Z; let's take a look at their vehicles,
+presumably in the `drivers_license` table:
+
+ sqlite> .schema drivers_license
+ CREATE TABLE drivers_license (
+ id integer PRIMARY KEY,
+ age integer,
+ height integer,
+ eye_color text,
+ hair_color text,
+ gender text,
+ plate_number text,
+ car_make text,
+ car_model text
+ );
+
+ sqlite> select person.id, person.name, drivers_license.*
+ ...> from person
+ ...> join drivers_license on drivers_license.id = person.license_id
+ ...> where person.id in ($JOE, $JEREMY);
+
+<table>
+ <tr>
+ <th>id</th>
+ <th>name</th>
+ <th>id</th>
+ <th>age</th>
+ <th>height</th>
+ <th>eye_color</th>
+ <th>hair_color</th>
+ <th>gender</th>
+ <th>plate_number</th>
+ <th>car_make</th>
+ <th>car_model</th>
+ </tr>
+ <tr>
+ <td>67318</td>
+ <td>Jeremy Bowers</td>
+ <td>423327</td>
+ <td>30</td>
+ <td>70</td>
+ <td>brown</td>
+ <td>brown</td>
+ <td>male</td>
+ <td>0H42W2</td>
+ <td>Chevrolet</td>
+ <td>Spark LS</td>
+ </tr>
+</table>
+
+So only Jeremy Bowers has a drivers license. And his car's license plate does
+contain H42W, so it looks like we've found the killer! According to the
+instructions in the GitHub repository, we should insert our answer into the
+`solution` table, then query it:
+
+ sqlite> insert into solution values (1, 'Jeremy Bowers');
+ sqlite> select value from solution;
+
+<table>
+ <tr>
+ <th>value</th>
+ </tr>
+ <tr>
+ <td>
+ Congrats, you found the murderer! But wait, there&#39;s more... If you
+ think you&#39;re up for a challenge, try querying the interview transcript
+ of the murderer to find the real villian behind this crime. If you feel
+ especially confident in your SQL skills, try to complete this final step
+ with no more than 2 queries.
+ </td>
+ </tr>
+</table>
+
+Aha! We did correctly identify Jeremy Bowers. Let's see if we can connect the
+dots to find the mastermind! First, we'll grab the killer's (Jeremy's) interview
+transcript:
+
+ sqlite> select transcript from interview where person_id = $JEREMY;
+
+<table>
+ <tr>
+ <th>transcript</th>
+ </tr>
+ <tr>
+ <td>
+ I was hired by a woman with a lot of money. I don&#39;t know her name but
+ I know she&#39;s around 5&#39;5&quot; (65&quot;) or 5&#39;7&quot;
+ (67&quot;). She has red hair and she drives a Tesla Model S. I know that
+ she attended the SQL Symphony Concert 3 times in December 2017.
+ </td>
+ </tr>
+</table>
+
+Alright, there goes one query... one more to make it count! We're going to be
+correlating data from a bunch of tables here: person, income (related by SSN,
+probably as a sort criterion since we don't have an exact figure to work with),
+we can grab height, hair color, gender, and car make/model from the drivers
+licenses. It's a bit of a risk to filter by Facebook checkins to the SQL
+Symphony, since we don't know that she checked in at all, but maybe we can
+include the count of the number of times there was a check-in at the symphony
+during December. Let's get a reminder of what these tables look like:
+
+ sqlite> .schema person
+ CREATE TABLE person (
+ id integer PRIMARY KEY,
+ name text,
+ license_id integer,
+ address_number integer,
+ address_street_name text,
+ ssn integer,
+ FOREIGN KEY (license_id) REFERENCES drivers_license(id)
+ );
+ sqlite> .schema income
+ CREATE TABLE income (
+ ssn integer PRIMARY KEY,
+ annual_income integer
+ );
+ sqlite> .schema facebook_event_checkin
+ CREATE TABLE facebook_event_checkin (
+ person_id integer,
+ event_id integer,
+ event_name text,
+ date integer,
+ FOREIGN KEY (person_id) REFERENCES person(id)
+ );
+ sqlite> .schema drivers_license
+ CREATE TABLE drivers_license (
+ id integer PRIMARY KEY,
+ age integer,
+ height integer,
+ eye_color text,
+ hair_color text,
+ gender text,
+ plate_number text,
+ car_make text,
+ car_model text
+ );
+
+And assemble our final mega-query!
+
+ sqlite> select p.id, p.name, i.annual_income, dl.height, dl.hair_color,
+ ...> dl.gender, dl.car_make, dl.car_model, (
+ ...> select count(*)
+ ...> from facebook_event_checkin
+ ...> where person_id = p.id
+ ...> and event_name like '%symphony%'
+ ...> and date between 20171201 and 20171231) as num_symphonies
+ ...> from person p
+ ...> join income i on i.ssn = p.ssn
+ ...> join drivers_license dl on dl.id = p.license_id
+ ...> where dl.height between 64 and 68
+ ...> and dl.hair_color like '%red%'
+ ...> and car_make like '%tesla%'
+ ...> and car_model like '%s%'
+ ...> order by i.annual_income desc;
+
+<table>
+ <tr>
+ <th>id</th>
+ <th>name</th>
+ <th>annual_income</th>
+ <th>height</th>
+ <th>hair_color</th>
+ <th>gender</th>
+ <th>car_make</th>
+ <th>car_model</th>
+ <th>num_symphonies</th>
+ </tr>
+ <tr>
+ <td>99716</td>
+ <td>Miranda Priestly</td>
+ <td>310000</td>
+ <td>66</td>
+ <td>red</td>
+ <td>female</td>
+ <td>Tesla</td>
+ <td>Model S</td>
+ <td>3</td>
+ </tr>
+ <tr>
+ <td>78881</td>
+ <td>Red Korb</td>
+ <td>278000</td>
+ <td>65</td>
+ <td>red</td>
+ <td>female</td>
+ <td>Tesla</td>
+ <td>Model S</td>
+ <td>0</td>
+ </tr>
+</table>
+
+Okay, so we actually got two results for red-haired people around 66" tall who
+make a lot of money and drive Tesla Model S's. However, one of them attended the
+symphony three times in December (and makes even more money), so I think we've
+found the mastermind!
+
+I didn't include gender in the filter as I wasn't sure how the data looked, and
+I technically would've needed an additional query to discover that.
+
+ sqlite> insert into solution values (1, 'Miranda Priestly');
+ sqlite> select value from solution;
+
+<table>
+ <tr>
+ <th>value</th>
+ </tr>
+ <tr>
+ <td>
+ Congrats, you found the brains behind the murder! Everyone in SQL City
+ hails you as the greatest SQL detective of all time. Time to break out the
+ champagne!
+ </td>
+ </tr>
+</table>
+
+Hooray! I had a lot of fun playing through this, and would love to do another
+similar puzzle again sometime.
+
+[db]: /assets/sql-murder-mystery.db
+[mit]: https://opensource.org/licenses/MIT