diff options
Diffstat (limited to '_posts')
-rw-r--r-- | _posts/2019-12-20-sql-murder-mystery.md | 529 |
1 files changed, 529 insertions, 0 deletions
diff --git a/_posts/2019-12-20-sql-murder-mystery.md b/_posts/2019-12-20-sql-murder-mystery.md new file mode 100644 index 0000000..82aa2ce --- /dev/null +++ b/_posts/2019-12-20-sql-murder-mystery.md @@ -0,0 +1,529 @@ +--- +title: Solving the SQL Murder Mystery +--- + +I saw this [SQL Murder Mystery](https://github.com/NUKnightLab/sql-mysteries) +appear on Hacker News recently, thought it sounded fun, and figured I'd do a +quick write-up of how I worked through it. + +If you want to follow along, go ahead and [download the SQLite database][db] +(which is copyright NUKnightLab and redistributed here under the [MIT +license][mit]). You'll need some kind of SQLite client to interact with it (I +just used the `sqlite3` CLI tool). + +In addition to the database, it's very helpful to start with a prompt: + +> A crime has taken place and the detective needs your help. The detective gave +> you the crime scene report, but you somehow lost it. You vaguely remember that +> the crime was a murder that occurred sometime on Jan. 15, 2018 and that it +> took place in SQL City. Start by retrieving the corresponding crime scene +> report from the police department's database. If you want to get the most out +> of this mystery, try to work through it only using your SQL environment and +> refrain from using a notepad. + +<!--more--> + +Let's start by seeing what tables are available. The `sqlite3` CLI uses +meta-commands that start with a dot, like this: + + sqlite> .tables + crime_scene_report get_fit_now_check_in interview + drivers_license get_fit_now_member person + facebook_event_checkin income solution + +Okay, let's start with finding our crime scene report. First, we'll need to know +what the data looks like. We can learn about this with the `.schema` command: + + sqlite> .schema crime_scene_report + CREATE TABLE crime_scene_report ( + date integer, + type text, + description text, + city text + ); + +Okay, seems pretty straightforward. The only thing I'm not quite sure about is +how the date is being represented -- it's just stored as an integer. A UNIX +timestamp perhaps? Let's sample the data: + + sqlite> select date from crime_scene_report limit 5; + +<table> + <tr><th>date</th></tr> + <tr><td>20180115</td></tr> + <tr><td>20180115</td></tr> + <tr><td>20180115</td></tr> + <tr><td>20180215</td></tr> + <tr><td>20180215</td></tr> +</table> + +Okay, seems it's just being stored as YYYYMMDD. Let's take a crack at finding +the crime scene report! We know the type (murder) and the city (SQL City). Let's +be generous with the date and assume it was sometime in January of 2018: + + sqlite> select * from crime_scene_report + ...> where type = 'murder' + ...> and city = 'SQL City' + ...> and date between 20180101 and 20180131; + +<table> + <tr> + <th>date</th> + <th>type</th> + <th>description</th> + <th>city</th> + </tr> + <tr> + <td>20180115</td> + <td>murder</td> + <td> + Security footage shows that there were 2 witnesses. The first witness + lives at the last house on "Northwestern Dr". The second + witness, named Annabel, lives somewhere on "Franklin Ave". + </td> + <td>SQL City</td> + </tr> +</table> + +Great, there's only one row that matches our broad date criteria! Let's see if +we can track down these witnesses. First, let's see how the data we need is +structured: + + sqlite> .schema person + CREATE TABLE person ( + id integer PRIMARY KEY, + name text, + license_id integer, + address_number integer, + address_street_name text, + ssn integer, + FOREIGN KEY (license_id) REFERENCES drivers_license(id) + ); + sqlite> .schema interview + CREATE TABLE interview ( + person_id integer, + transcript text, + FOREIGN KEY (person_id) REFERENCES person(id) + ); + +Okay, so we need to find the two rows in the person table, and then use their +ids to cross reference their interview text. This is "the big idea" with +relational databases, joining data in several tables based on something they +have in common. + +We'll start with the witness who lives on Northwestern Drive. We know that they +live in "the last house," which presumably has the highest house number on that +street. We can easily find this by first filtering for only people who live on +Northwestern Drive, then ordering those results by house number in descending +order, and only showing the first result: + + sqlite> select * from person + ...> where address_street_name = 'Northwestern Dr' + ...> order by address_number desc + ...> limit 1; + +<table> + <tr> + <th>id</th> + <th>name</th> + <th>license_id</th> + <th>address_number</th> + <th>address_street_name</th> + <th>ssn</th> + </tr> + <tr> + <td>14887</td> + <td>Morty Schapiro</td> + <td>118009</td> + <td>4919</td> + <td>Northwestern Dr</td> + <td>111564949</td> + </tr> +</table> + +Great! Now let's find Annabel. We can use SQL's `LIKE` operator to match a +partial name, along with the name of their street: + + sqlite> select * from person + ...> where name like 'Annabel%' + ...> and address_street_name = 'Franklin Ave'; + +<table> + <tr> + <th>id</th> + <th>name</th> + <th>license_id</th> + <th>address_number</th> + <th>address_street_name</th> + <th>ssn</th> + </tr> + <tr> + <td>16371</td> + <td>Annabel Miller</td> + <td>490173</td> + <td>103</td> + <td>Franklin Ave</td> + <td>318771143</td> + </tr> +</table> + +Okay, so we've got our person IDs: `14887` and `16371`. I think we're going to +want these IDs in a bunch of upcoming queries, so let's help our future selves +out by saving their IDs as parameters (a sort of temporary variable): + + sqlite> .parameter set $MORTY 14887 + sqlite> .parameter set $ANNABEL 16371 + +Let's grab their interviews. To do this, we'll put joins to use for the first +time so we can show their name rather than just their person ID. We're selecting +records from the `interview` table, but _joining_ matching records from the +`person` table, using the `person_id` column to match up the people. + + sqlite> select person.name, interview.transcript + ...> from interview + ...> join person on person.id = interview.person_id + ...> where person_id in ($MORTY, $ANNABEL); + +<table> + <tr> + <th>name</th> + <th>transcript</th> + </tr> + <tr> + <td>Morty Schapiro</td> + <td> + I heard a gunshot and then saw a man run out. He had a "Get Fit Now + Gym" bag. The membership number on the bag started with + "48Z". Only gold members have those bags. The man got into a + car with a plate that included "H42W". + </td> + </tr> + <tr> + <td>Annabel Miller</td> + <td> + I saw the murder happen, and I recognized the killer from my gym when I + was working out last week on January the 9th. + </td> + </tr> +</table> + +Okay, we've got tons of info now! Since the car and bag might not belong to the +killer, I think our best lead for narrowing things down is to see all the people +who crossed paths with Annabel at the gym on January 9th, 2018. Let's see what +those tables look like: + + sqlite> .schema get_fit_now_check_in + CREATE TABLE get_fit_now_check_in ( + membership_id text, + check_in_date integer, + check_in_time integer, + check_out_time integer, + FOREIGN KEY (membership_id) REFERENCES get_fit_now_member(id) + ); + sqlite> .schema get_fit_now_member + CREATE TABLE get_fit_now_member ( + id text PRIMARY KEY, + person_id integer, + name text, + membership_start_date integer, + membership_status text, + FOREIGN KEY (person_id) REFERENCES person(id) + ); + +Alright, time to look for some check-ins! We could do this in two separate +queries, one to find Annabel's Get Fit Now member ID by using her `person_id`, +and a second query to find her check-ins using her `membership_id`, but we can +also use a sub-query to do this in one shot: + + sqlite> select check_in_time, check_out_time + ...> from get_fit_now_check_in + ...> where date = 20180109 + ...> and membership_id = ( + ...> select id + ...> from get_fit_now_member + ...> where person_id = $ANNABEL); + +<table> + <tr> + <th>check_in_time</th> + <th>check_out_time</th> + </tr> + <tr> + <td>1600</td> + <td>1700</td> + </tr> +</table> + +Looks like Annabel was at the gym from 4pm to 5pm on the 9th. Since we're +looking for someone who overlapped with Annabel at the gym, we're looking for +someone who arrived before 5pm and left after 4pm. Again, we'll join some tables +together here so we can grab their names and person IDs right away, not just +their membership numbers: + + sqlite> select person.id, person.name, get_fit_now_member.id, + ...> get_fit_now_check_in.check_in_time, + ...> get_fit_now_check_in.check_out_time + ...> from get_fit_now_check_in + ...> join get_fit_now_member on get_fit_now_member.id = membership_id + ...> join person on person.id = person_id + ...> where check_in_date = 20180109 + ...> and check_in_time <= 1700 and check_out_time >= 1600; + +<table> + <tr> + <th>id</th> + <th>name</th> + <th>id</th> + <th>check_in_time</th> + <th>check_out_time</th> + </tr> + <tr> + <td>28819</td> + <td>Joe Germuska</td> + <td>48Z7A</td> + <td>1600</td> + <td>1730</td> + </tr> + <tr> + <td>67318</td> + <td>Jeremy Bowers</td> + <td>48Z55</td> + <td>1530</td> + <td>1700</td> + </tr> + <tr> + <td>16371</td> + <td>Annabel Miller</td> + <td>90081</td> + <td>1600</td> + <td>1700</td> + </tr> +</table> + +Interesting, there were only two other gym members who were checked in for a +period overlapping with Annabel on the 9th. Let's save their IDs as well: + + sqlite> .parameter set $JOE 28819 + sqlite> .parameter set $JEREMY 67318 + +Their member numbers both start with 48Z; let's take a look at their vehicles, +presumably in the `drivers_license` table: + + sqlite> .schema drivers_license + CREATE TABLE drivers_license ( + id integer PRIMARY KEY, + age integer, + height integer, + eye_color text, + hair_color text, + gender text, + plate_number text, + car_make text, + car_model text + ); + + sqlite> select person.id, person.name, drivers_license.* + ...> from person + ...> join drivers_license on drivers_license.id = person.license_id + ...> where person.id in ($JOE, $JEREMY); + +<table> + <tr> + <th>id</th> + <th>name</th> + <th>id</th> + <th>age</th> + <th>height</th> + <th>eye_color</th> + <th>hair_color</th> + <th>gender</th> + <th>plate_number</th> + <th>car_make</th> + <th>car_model</th> + </tr> + <tr> + <td>67318</td> + <td>Jeremy Bowers</td> + <td>423327</td> + <td>30</td> + <td>70</td> + <td>brown</td> + <td>brown</td> + <td>male</td> + <td>0H42W2</td> + <td>Chevrolet</td> + <td>Spark LS</td> + </tr> +</table> + +So only Jeremy Bowers has a drivers license. And his car's license plate does +contain H42W, so it looks like we've found the killer! According to the +instructions in the GitHub repository, we should insert our answer into the +`solution` table, then query it: + + sqlite> insert into solution values (1, 'Jeremy Bowers'); + sqlite> select value from solution; + +<table> + <tr> + <th>value</th> + </tr> + <tr> + <td> + Congrats, you found the murderer! But wait, there's more... If you + think you're up for a challenge, try querying the interview transcript + of the murderer to find the real villian behind this crime. If you feel + especially confident in your SQL skills, try to complete this final step + with no more than 2 queries. + </td> + </tr> +</table> + +Aha! We did correctly identify Jeremy Bowers. Let's see if we can connect the +dots to find the mastermind! First, we'll grab the killer's (Jeremy's) interview +transcript: + + sqlite> select transcript from interview where person_id = $JEREMY; + +<table> + <tr> + <th>transcript</th> + </tr> + <tr> + <td> + I was hired by a woman with a lot of money. I don't know her name but + I know she's around 5'5" (65") or 5'7" + (67"). She has red hair and she drives a Tesla Model S. I know that + she attended the SQL Symphony Concert 3 times in December 2017. + </td> + </tr> +</table> + +Alright, there goes one query... one more to make it count! We're going to be +correlating data from a bunch of tables here: person, income (related by SSN, +probably as a sort criterion since we don't have an exact figure to work with), +we can grab height, hair color, gender, and car make/model from the drivers +licenses. It's a bit of a risk to filter by Facebook checkins to the SQL +Symphony, since we don't know that she checked in at all, but maybe we can +include the count of the number of times there was a check-in at the symphony +during December. Let's get a reminder of what these tables look like: + + sqlite> .schema person + CREATE TABLE person ( + id integer PRIMARY KEY, + name text, + license_id integer, + address_number integer, + address_street_name text, + ssn integer, + FOREIGN KEY (license_id) REFERENCES drivers_license(id) + ); + sqlite> .schema income + CREATE TABLE income ( + ssn integer PRIMARY KEY, + annual_income integer + ); + sqlite> .schema facebook_event_checkin + CREATE TABLE facebook_event_checkin ( + person_id integer, + event_id integer, + event_name text, + date integer, + FOREIGN KEY (person_id) REFERENCES person(id) + ); + sqlite> .schema drivers_license + CREATE TABLE drivers_license ( + id integer PRIMARY KEY, + age integer, + height integer, + eye_color text, + hair_color text, + gender text, + plate_number text, + car_make text, + car_model text + ); + +And assemble our final mega-query! + + sqlite> select p.id, p.name, i.annual_income, dl.height, dl.hair_color, + ...> dl.gender, dl.car_make, dl.car_model, ( + ...> select count(*) + ...> from facebook_event_checkin + ...> where person_id = p.id + ...> and event_name like '%symphony%' + ...> and date between 20171201 and 20171231) as num_symphonies + ...> from person p + ...> join income i on i.ssn = p.ssn + ...> join drivers_license dl on dl.id = p.license_id + ...> where dl.height between 64 and 68 + ...> and dl.hair_color like '%red%' + ...> and car_make like '%tesla%' + ...> and car_model like '%s%' + ...> order by i.annual_income desc; + +<table> + <tr> + <th>id</th> + <th>name</th> + <th>annual_income</th> + <th>height</th> + <th>hair_color</th> + <th>gender</th> + <th>car_make</th> + <th>car_model</th> + <th>num_symphonies</th> + </tr> + <tr> + <td>99716</td> + <td>Miranda Priestly</td> + <td>310000</td> + <td>66</td> + <td>red</td> + <td>female</td> + <td>Tesla</td> + <td>Model S</td> + <td>3</td> + </tr> + <tr> + <td>78881</td> + <td>Red Korb</td> + <td>278000</td> + <td>65</td> + <td>red</td> + <td>female</td> + <td>Tesla</td> + <td>Model S</td> + <td>0</td> + </tr> +</table> + +Okay, so we actually got two results for red-haired people around 66" tall who +make a lot of money and drive Tesla Model S's. However, one of them attended the +symphony three times in December (and makes even more money), so I think we've +found the mastermind! + +I didn't include gender in the filter as I wasn't sure how the data looked, and +I technically would've needed an additional query to discover that. + + sqlite> insert into solution values (1, 'Miranda Priestly'); + sqlite> select value from solution; + +<table> + <tr> + <th>value</th> + </tr> + <tr> + <td> + Congrats, you found the brains behind the murder! Everyone in SQL City + hails you as the greatest SQL detective of all time. Time to break out the + champagne! + </td> + </tr> +</table> + +Hooray! I had a lot of fun playing through this, and would love to do another +similar puzzle again sometime. + +[db]: /assets/sql-murder-mystery.db +[mit]: https://opensource.org/licenses/MIT |