1 files changed, 529 insertions, 0 deletions
diff --git a/_posts/2019-12-20-sql-murder-mystery.md b/_posts/2019-12-20-sql-murder-mystery.md
new file mode 100644
index 0000000..82aa2ce
--- /dev/null
+++ b/_posts/2019-12-20-sql-murder-mystery.md
@@ -0,0 +1,529 @@
+---
+title: Solving the SQL Murder Mystery
+---
+
+I saw this [SQL Murder Mystery](https://github.com/NUKnightLab/sql-mysteries)
+appear on Hacker News recently, thought it sounded fun, and figured I'd do a
+quick write-up of how I worked through it.
+
+If you want to follow along, go ahead and [download the SQLite database][db]
+(which is copyright NUKnightLab and redistributed here under the [MIT
+license][mit]). You'll need some kind of SQLite client to interact with it (I
+just used the `sqlite3` CLI tool).
+
+In addition to the database, it's very helpful to start with a prompt:
+
+> A crime has taken place and the detective needs your help. The detective gave
+> you the crime scene report, but you somehow lost it. You vaguely remember that
+> the crime was a murder that occurred sometime on Jan. 15, 2018 and that it
+> took place in SQL City. Start by retrieving the corresponding crime scene
+> report from the police department's database. If you want to get the most out
+> of this mystery, try to work through it only using your SQL environment and
+> refrain from using a notepad.
+
+<!--more-->
+
+Let's start by seeing what tables are available. The `sqlite3` CLI uses
+meta-commands that start with a dot, like this:
+
+    sqlite> .tables
+    crime_scene_report      get_fit_now_check_in    interview
+    drivers_license         get_fit_now_member      person
+    facebook_event_checkin  income                  solution
+
+Okay, let's start with finding our crime scene report. First, we'll need to know
+what the data looks like. We can learn about this with the `.schema` command:
+
+    sqlite> .schema crime_scene_report
+    CREATE TABLE crime_scene_report (
+            date integer,
+            type text,
+            description text,
+            city text
+        );
+
+Okay, seems pretty straightforward. The only thing I'm not quite sure about is
+how the date is being represented -- it's just stored as an integer. A UNIX
+timestamp perhaps? Let's sample the data:
+
+    sqlite> select date from crime_scene_report limit 5;
+
+<table>
+  <tr><th>date</th></tr>
+  <tr><td>20180115</td></tr>
+  <tr><td>20180115</td></tr>
+  <tr><td>20180115</td></tr>
+  <tr><td>20180215</td></tr>
+  <tr><td>20180215</td></tr>
+</table>
+
+Okay, seems it's just being stored as YYYYMMDD. Let's take a crack at finding
+the crime scene report! We know the type (murder) and the city (SQL City). Let's
+be generous with the date and assume it was sometime in January of 2018:
+
+    sqlite> select * from crime_scene_report
+       ...> where type = 'murder'
+       ...> and city = 'SQL City'
+       ...> and date between 20180101 and 20180131;
+
+<table>
+  <tr>
+    <th>date</th>
+    <th>type</th>
+    <th>description</th>
+    <th>city</th>
+  </tr>
+  <tr>
+    <td>20180115</td>
+    <td>murder</td>
+    <td>
+      Security footage shows that there were 2 witnesses. The first witness
+      lives at the last house on &quot;Northwestern Dr&quot;. The second
+      witness, named Annabel, lives somewhere on &quot;Franklin Ave&quot;.
+    </td>
+    <td>SQL City</td>
+  </tr>
+</table>
+
+Great, there's only one row that matches our broad date criteria! Let's see if
+we can track down these witnesses. First, let's see how the data we need is
+structured:
+
+    sqlite> .schema person
+    CREATE TABLE person (
+            id integer PRIMARY KEY,
+            name text,
+            license_id integer,
+            address_number integer,
+            address_street_name text,
+            ssn integer,
+            FOREIGN KEY (license_id) REFERENCES drivers_license(id)
+        );
+    sqlite> .schema interview
+    CREATE TABLE interview (
+            person_id integer,
+            transcript text,
+            FOREIGN KEY (person_id) REFERENCES person(id)
+        );
+
+Okay, so we need to find the two rows in the person table, and then use their
+ids to cross reference their interview text. This is "the big idea" with
+relational databases, joining data in several tables based on something they
+have in common.
+
+We'll start with the witness who lives on Northwestern Drive.  We know that they
+live in "the last house," which presumably has the highest house number on that
+street. We can easily find this by first filtering for only people who live on
+Northwestern Drive, then ordering those results by house number in descending
+order, and only showing the first result:
+
+    sqlite> select * from person
+       ...> where address_street_name = 'Northwestern Dr'
+       ...> order by address_number desc
+       ...> limit 1;
+
+<table>
+  <tr>
+    <th>id</th>
+    <th>name</th>
+    <th>license_id</th>
+    <th>address_number</th>
+    <th>address_street_name</th>
+    <th>ssn</th>
+  </tr>
+  <tr>
+    <td>14887</td>
+    <td>Morty Schapiro</td>
+    <td>118009</td>
+    <td>4919</td>
+    <td>Northwestern Dr</td>
+    <td>111564949</td>
+  </tr>
+</table>
+
+Great! Now let's find Annabel. We can use SQL's `LIKE` operator to match a
+partial name, along with the name of their street:
+
+    sqlite> select * from person
+       ...> where name like 'Annabel%'
+       ...> and address_street_name = 'Franklin Ave';
+
+<table>
+  <tr>
+    <th>id</th>
+    <th>name</th>
+    <th>license_id</th>
+    <th>address_number</th>
+    <th>address_street_name</th>
+    <th>ssn</th>
+  </tr>
+  <tr>
+    <td>16371</td>
+    <td>Annabel Miller</td>
+    <td>490173</td>
+    <td>103</td>
+    <td>Franklin Ave</td>
+    <td>318771143</td>
+  </tr>
+</table>
+
+Okay, so we've got our person IDs: `14887` and `16371`. I think we're going to
+want these IDs in a bunch of upcoming queries, so let's help our future selves
+out by saving their IDs as parameters (a sort of temporary variable):
+
+    sqlite> .parameter set $MORTY 14887
+    sqlite> .parameter set $ANNABEL 16371
+
+Let's grab their interviews. To do this, we'll put joins to use for the first
+time so we can show their name rather than just their person ID. We're selecting
+records from the `interview` table, but _joining_ matching records from the
+`person` table, using the `person_id` column to match up the people.
+
+    sqlite> select person.name, interview.transcript
+       ...> from interview
+       ...> join person on person.id = interview.person_id
+       ...> where person_id in ($MORTY, $ANNABEL);
+
+<table>
+  <tr>
+    <th>name</th>
+    <th>transcript</th>
+  </tr>
+  <tr>
+    <td>Morty Schapiro</td>
+    <td>
+      I heard a gunshot and then saw a man run out. He had a &quot;Get Fit Now
+      Gym&quot; bag. The membership number on the bag started with
+      &quot;48Z&quot;.  Only gold members have those bags. The man got into a
+      car with a plate that included &quot;H42W&quot;.
+    </td>
+  </tr>
+  <tr>
+    <td>Annabel Miller</td>
+    <td>
+      I saw the murder happen, and I recognized the killer from my gym when I
+      was working out last week on January the 9th.
+    </td>
+  </tr>
+</table>
+
+Okay, we've got tons of info now! Since the car and bag might not belong to the
+killer, I think our best lead for narrowing things down is to see all the people
+who crossed paths with Annabel at the gym on January 9th, 2018. Let's see what
+those tables look like:
+
+    sqlite> .schema get_fit_now_check_in
+    CREATE TABLE get_fit_now_check_in (
+            membership_id text,
+            check_in_date integer,
+            check_in_time integer,
+            check_out_time integer,
+            FOREIGN KEY (membership_id) REFERENCES get_fit_now_member(id)
+        );
+    sqlite> .schema get_fit_now_member
+    CREATE TABLE get_fit_now_member (
+            id text PRIMARY KEY,
+            person_id integer,
+            name text,
+            membership_start_date integer,
+            membership_status text,
+            FOREIGN KEY (person_id) REFERENCES person(id)
+        );
+
+Alright, time to look for some check-ins! We could do this in two separate
+queries, one to find Annabel's Get Fit Now member ID by using her `person_id`,
+and a second query to find her check-ins using her `membership_id`, but we can
+also use a sub-query to do this in one shot:
+
+    sqlite> select check_in_time, check_out_time
+       ...> from get_fit_now_check_in
+       ...> where date = 20180109
+       ...> and membership_id = (
+       ...>   select id
+       ...>   from get_fit_now_member
+       ...>   where person_id = $ANNABEL);
+
+<table>
+  <tr>
+    <th>check_in_time</th>
+    <th>check_out_time</th>
+  </tr>
+  <tr>
+    <td>1600</td>
+    <td>1700</td>
+  </tr>
+</table>
+
+Looks like Annabel was at the gym from 4pm to 5pm on the 9th. Since we're
+looking for someone who overlapped with Annabel at the gym, we're looking for
+someone who arrived before 5pm and left after 4pm. Again, we'll join some tables
+together here so we can grab their names and person IDs right away, not just
+their membership numbers:
+
+    sqlite> select person.id, person.name, get_fit_now_member.id,
+       ...>   get_fit_now_check_in.check_in_time,
+       ...>   get_fit_now_check_in.check_out_time
+       ...> from get_fit_now_check_in
+       ...> join get_fit_now_member on get_fit_now_member.id = membership_id
+       ...> join person on person.id = person_id
+       ...> where check_in_date = 20180109
+       ...> and check_in_time <= 1700 and check_out_time >= 1600;
+
+<table>
+  <tr>
+    <th>id</th>
+    <th>name</th>
+    <th>id</th>
+    <th>check_in_time</th>
+    <th>check_out_time</th>
+  </tr>
+  <tr>
+    <td>28819</td>
+    <td>Joe Germuska</td>
+    <td>48Z7A</td>
+    <td>1600</td>
+    <td>1730</td>
+  </tr>
+  <tr>
+    <td>67318</td>
+    <td>Jeremy Bowers</td>
+    <td>48Z55</td>
+    <td>1530</td>
+    <td>1700</td>
+  </tr>
+  <tr>
+    <td>16371</td>
+    <td>Annabel Miller</td>
+    <td>90081</td>
+    <td>1600</td>
+    <td>1700</td>
+  </tr>
+</table>
+
+Interesting, there were only two other gym members who were checked in for a
+period overlapping with Annabel on the 9th. Let's save their IDs as well:
+
+    sqlite> .parameter set $JOE 28819
+    sqlite> .parameter set $JEREMY 67318
+
+Their member numbers both start with 48Z; let's take a look at their vehicles,
+presumably in the `drivers_license` table:
+
+    sqlite> .schema drivers_license
+    CREATE TABLE drivers_license (
+            id integer PRIMARY KEY,
+            age integer,
+            height integer,
+            eye_color text,
+            hair_color text,
+            gender text,
+            plate_number text,
+            car_make text,
+            car_model text
+        );
+
+    sqlite> select person.id, person.name, drivers_license.*
+       ...> from person
+       ...> join drivers_license on drivers_license.id = person.license_id
+       ...> where person.id in ($JOE, $JEREMY);
+
+<table>
+  <tr>
+    <th>id</th>
+    <th>name</th>
+    <th>id</th>
+    <th>age</th>
+    <th>height</th>
+    <th>eye_color</th>
+    <th>hair_color</th>
+    <th>gender</th>
+    <th>plate_number</th>
+    <th>car_make</th>
+    <th>car_model</th>
+  </tr>
+  <tr>
+    <td>67318</td>
+    <td>Jeremy Bowers</td>
+    <td>423327</td>
+    <td>30</td>
+    <td>70</td>
+    <td>brown</td>
+    <td>brown</td>
+    <td>male</td>
+    <td>0H42W2</td>
+    <td>Chevrolet</td>
+    <td>Spark LS</td>
+  </tr>
+</table>
+
+So only Jeremy Bowers has a drivers license. And his car's license plate does
+contain H42W, so it looks like we've found the killer! According to the
+instructions in the GitHub repository, we should insert our answer into the
+`solution` table, then query it:
+
+    sqlite> insert into solution values (1, 'Jeremy Bowers');
+    sqlite> select value from solution;
+
+<table>
+  <tr>
+    <th>value</th>
+  </tr>
+  <tr>
+    <td>
+      Congrats, you found the murderer! But wait, there&#39;s more... If you
+      think you&#39;re up for a challenge, try querying the interview transcript
+      of the murderer to find the real villian behind this crime. If you feel
+      especially confident in your SQL skills, try to complete this final step
+      with no more than 2 queries.
+    </td>
+  </tr>
+</table>
+
+Aha! We did correctly identify Jeremy Bowers. Let's see if we can connect the
+dots to find the mastermind! First, we'll grab the killer's (Jeremy's) interview
+transcript:
+
+    sqlite> select transcript from interview where person_id = $JEREMY;
+
+<table>
+  <tr>
+    <th>transcript</th>
+  </tr>
+  <tr>
+    <td>
+      I was hired by a woman with a lot of money. I don&#39;t know her name but
+      I know she&#39;s around 5&#39;5&quot; (65&quot;) or 5&#39;7&quot;
+      (67&quot;). She has red hair and she drives a Tesla Model S. I know that
+      she attended the SQL Symphony Concert 3 times in December 2017.
+    </td>
+  </tr>
+</table>
+
+Alright, there goes one query... one more to make it count! We're going to be
+correlating data from a bunch of tables here: person, income (related by SSN,
+probably as a sort criterion since we don't have an exact figure to work with),
+we can grab height, hair color, gender, and car make/model from the drivers
+licenses. It's a bit of a risk to filter by Facebook checkins to the SQL
+Symphony, since we don't know that she checked in at all, but maybe we can
+include the count of the number of times there was a check-in at the symphony
+during December. Let's get a reminder of what these tables look like:
+
+    sqlite> .schema person
+    CREATE TABLE person (
+            id integer PRIMARY KEY,
+            name text,
+            license_id integer,
+            address_number integer,
+            address_street_name text,
+            ssn integer,
+            FOREIGN KEY (license_id) REFERENCES drivers_license(id)
+        );
+    sqlite> .schema income
+    CREATE TABLE income (
+            ssn integer PRIMARY KEY,
+            annual_income integer
+        );
+    sqlite> .schema facebook_event_checkin
+    CREATE TABLE facebook_event_checkin (
+            person_id integer,
+            event_id integer,
+            event_name text,
+            date integer,
+            FOREIGN KEY (person_id) REFERENCES person(id)
+        );
+    sqlite> .schema drivers_license
+    CREATE TABLE drivers_license (
+            id integer PRIMARY KEY,
+            age integer,
+            height integer,
+            eye_color text,
+            hair_color text,
+            gender text,
+            plate_number text,
+            car_make text,
+            car_model text
+        );
+
+And assemble our final mega-query!
+
+    sqlite> select p.id, p.name, i.annual_income, dl.height, dl.hair_color,
+       ...> dl.gender, dl.car_make, dl.car_model, (
+       ...>   select count(*)
+       ...>   from facebook_event_checkin
+       ...>   where person_id = p.id
+       ...>   and event_name like '%symphony%'
+       ...>   and date between 20171201 and 20171231) as num_symphonies
+       ...> from person p
+       ...> join income i on i.ssn = p.ssn
+       ...> join drivers_license dl on dl.id = p.license_id
+       ...> where dl.height between 64 and 68
+       ...> and dl.hair_color like '%red%'
+       ...> and car_make like '%tesla%'
+       ...> and car_model like '%s%'
+       ...> order by i.annual_income desc;
+
+<table>
+  <tr>
+    <th>id</th>
+    <th>name</th>
+    <th>annual_income</th>
+    <th>height</th>
+    <th>hair_color</th>
+    <th>gender</th>
+    <th>car_make</th>
+    <th>car_model</th>
+    <th>num_symphonies</th>
+  </tr>
+  <tr>
+    <td>99716</td>
+    <td>Miranda Priestly</td>
+    <td>310000</td>
+    <td>66</td>
+    <td>red</td>
+    <td>female</td>
+    <td>Tesla</td>
+    <td>Model S</td>
+    <td>3</td>
+  </tr>
+  <tr>
+    <td>78881</td>
+    <td>Red Korb</td>
+    <td>278000</td>
+    <td>65</td>
+    <td>red</td>
+    <td>female</td>
+    <td>Tesla</td>
+    <td>Model S</td>
+    <td>0</td>
+  </tr>
+</table>
+
+Okay, so we actually got two results for red-haired people around 66" tall who
+make a lot of money and drive Tesla Model S's. However, one of them attended the
+symphony three times in December (and makes even more money), so I think we've
+found the mastermind!
+
+I didn't include gender in the filter as I wasn't sure how the data looked, and
+I technically would've needed an additional query to discover that.
+
+    sqlite> insert into solution values (1, 'Miranda Priestly');
+    sqlite> select value from solution;
+
+<table>
+  <tr>
+    <th>value</th>
+  </tr>
+  <tr>
+    <td>
+      Congrats, you found the brains behind the murder! Everyone in SQL City
+      hails you as the greatest SQL detective of all time. Time to break out the
+      champagne!
+    </td>
+  </tr>
+</table>
+
+Hooray! I had a lot of fun playing through this, and would love to do another
+similar puzzle again sometime.
+
+[db]: /assets/sql-murder-mystery.db
+[mit]: https://opensource.org/licenses/MIT