From 100f501849095e485329a8042366497ddd45af58 Mon Sep 17 00:00:00 2001 From: Ben Burwell Date: Fri, 20 Dec 2019 23:54:00 -0500 Subject: Add SQL murder mystery blog post --- _posts/2019-12-20-sql-murder-mystery.md | 529 ++++++++++++++++++++++++++++++++ 1 file changed, 529 insertions(+) create mode 100644 _posts/2019-12-20-sql-murder-mystery.md (limited to '_posts') diff --git a/_posts/2019-12-20-sql-murder-mystery.md b/_posts/2019-12-20-sql-murder-mystery.md new file mode 100644 index 0000000..82aa2ce --- /dev/null +++ b/_posts/2019-12-20-sql-murder-mystery.md @@ -0,0 +1,529 @@ +--- +title: Solving the SQL Murder Mystery +--- + +I saw this [SQL Murder Mystery](https://github.com/NUKnightLab/sql-mysteries) +appear on Hacker News recently, thought it sounded fun, and figured I'd do a +quick write-up of how I worked through it. + +If you want to follow along, go ahead and [download the SQLite database][db] +(which is copyright NUKnightLab and redistributed here under the [MIT +license][mit]). You'll need some kind of SQLite client to interact with it (I +just used the `sqlite3` CLI tool). + +In addition to the database, it's very helpful to start with a prompt: + +> A crime has taken place and the detective needs your help. The detective gave +> you the crime scene report, but you somehow lost it. You vaguely remember that +> the crime was a murder that occurred sometime on Jan. 15, 2018 and that it +> took place in SQL City. Start by retrieving the corresponding crime scene +> report from the police department's database. If you want to get the most out +> of this mystery, try to work through it only using your SQL environment and +> refrain from using a notepad. + + + +Let's start by seeing what tables are available. The `sqlite3` CLI uses +meta-commands that start with a dot, like this: + + sqlite> .tables + crime_scene_report get_fit_now_check_in interview + drivers_license get_fit_now_member person + facebook_event_checkin income solution + +Okay, let's start with finding our crime scene report. First, we'll need to know +what the data looks like. We can learn about this with the `.schema` command: + + sqlite> .schema crime_scene_report + CREATE TABLE crime_scene_report ( + date integer, + type text, + description text, + city text + ); + +Okay, seems pretty straightforward. The only thing I'm not quite sure about is +how the date is being represented -- it's just stored as an integer. A UNIX +timestamp perhaps? Let's sample the data: + + sqlite> select date from crime_scene_report limit 5; + + + + + + + + +
date
20180115
20180115
20180115
20180215
20180215
+ +Okay, seems it's just being stored as YYYYMMDD. Let's take a crack at finding +the crime scene report! We know the type (murder) and the city (SQL City). Let's +be generous with the date and assume it was sometime in January of 2018: + + sqlite> select * from crime_scene_report + ...> where type = 'murder' + ...> and city = 'SQL City' + ...> and date between 20180101 and 20180131; + + + + + + + + + + + + + + +
datetypedescriptioncity
20180115murder + Security footage shows that there were 2 witnesses. The first witness + lives at the last house on "Northwestern Dr". The second + witness, named Annabel, lives somewhere on "Franklin Ave". + SQL City
+ +Great, there's only one row that matches our broad date criteria! Let's see if +we can track down these witnesses. First, let's see how the data we need is +structured: + + sqlite> .schema person + CREATE TABLE person ( + id integer PRIMARY KEY, + name text, + license_id integer, + address_number integer, + address_street_name text, + ssn integer, + FOREIGN KEY (license_id) REFERENCES drivers_license(id) + ); + sqlite> .schema interview + CREATE TABLE interview ( + person_id integer, + transcript text, + FOREIGN KEY (person_id) REFERENCES person(id) + ); + +Okay, so we need to find the two rows in the person table, and then use their +ids to cross reference their interview text. This is "the big idea" with +relational databases, joining data in several tables based on something they +have in common. + +We'll start with the witness who lives on Northwestern Drive. We know that they +live in "the last house," which presumably has the highest house number on that +street. We can easily find this by first filtering for only people who live on +Northwestern Drive, then ordering those results by house number in descending +order, and only showing the first result: + + sqlite> select * from person + ...> where address_street_name = 'Northwestern Dr' + ...> order by address_number desc + ...> limit 1; + + + + + + + + + + + + + + + + + + +
idnamelicense_idaddress_numberaddress_street_namessn
14887Morty Schapiro1180094919Northwestern Dr111564949
+ +Great! Now let's find Annabel. We can use SQL's `LIKE` operator to match a +partial name, along with the name of their street: + + sqlite> select * from person + ...> where name like 'Annabel%' + ...> and address_street_name = 'Franklin Ave'; + + + + + + + + + + + + + + + + + + +
idnamelicense_idaddress_numberaddress_street_namessn
16371Annabel Miller490173103Franklin Ave318771143
+ +Okay, so we've got our person IDs: `14887` and `16371`. I think we're going to +want these IDs in a bunch of upcoming queries, so let's help our future selves +out by saving their IDs as parameters (a sort of temporary variable): + + sqlite> .parameter set $MORTY 14887 + sqlite> .parameter set $ANNABEL 16371 + +Let's grab their interviews. To do this, we'll put joins to use for the first +time so we can show their name rather than just their person ID. We're selecting +records from the `interview` table, but _joining_ matching records from the +`person` table, using the `person_id` column to match up the people. + + sqlite> select person.name, interview.transcript + ...> from interview + ...> join person on person.id = interview.person_id + ...> where person_id in ($MORTY, $ANNABEL); + + + + + + + + + + + + + + +
nametranscript
Morty Schapiro + I heard a gunshot and then saw a man run out. He had a "Get Fit Now + Gym" bag. The membership number on the bag started with + "48Z". Only gold members have those bags. The man got into a + car with a plate that included "H42W". +
Annabel Miller + I saw the murder happen, and I recognized the killer from my gym when I + was working out last week on January the 9th. +
+ +Okay, we've got tons of info now! Since the car and bag might not belong to the +killer, I think our best lead for narrowing things down is to see all the people +who crossed paths with Annabel at the gym on January 9th, 2018. Let's see what +those tables look like: + + sqlite> .schema get_fit_now_check_in + CREATE TABLE get_fit_now_check_in ( + membership_id text, + check_in_date integer, + check_in_time integer, + check_out_time integer, + FOREIGN KEY (membership_id) REFERENCES get_fit_now_member(id) + ); + sqlite> .schema get_fit_now_member + CREATE TABLE get_fit_now_member ( + id text PRIMARY KEY, + person_id integer, + name text, + membership_start_date integer, + membership_status text, + FOREIGN KEY (person_id) REFERENCES person(id) + ); + +Alright, time to look for some check-ins! We could do this in two separate +queries, one to find Annabel's Get Fit Now member ID by using her `person_id`, +and a second query to find her check-ins using her `membership_id`, but we can +also use a sub-query to do this in one shot: + + sqlite> select check_in_time, check_out_time + ...> from get_fit_now_check_in + ...> where date = 20180109 + ...> and membership_id = ( + ...> select id + ...> from get_fit_now_member + ...> where person_id = $ANNABEL); + + + + + + + + + + +
check_in_timecheck_out_time
16001700
+ +Looks like Annabel was at the gym from 4pm to 5pm on the 9th. Since we're +looking for someone who overlapped with Annabel at the gym, we're looking for +someone who arrived before 5pm and left after 4pm. Again, we'll join some tables +together here so we can grab their names and person IDs right away, not just +their membership numbers: + + sqlite> select person.id, person.name, get_fit_now_member.id, + ...> get_fit_now_check_in.check_in_time, + ...> get_fit_now_check_in.check_out_time + ...> from get_fit_now_check_in + ...> join get_fit_now_member on get_fit_now_member.id = membership_id + ...> join person on person.id = person_id + ...> where check_in_date = 20180109 + ...> and check_in_time <= 1700 and check_out_time >= 1600; + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
idnameidcheck_in_timecheck_out_time
28819Joe Germuska48Z7A16001730
67318Jeremy Bowers48Z5515301700
16371Annabel Miller9008116001700
+ +Interesting, there were only two other gym members who were checked in for a +period overlapping with Annabel on the 9th. Let's save their IDs as well: + + sqlite> .parameter set $JOE 28819 + sqlite> .parameter set $JEREMY 67318 + +Their member numbers both start with 48Z; let's take a look at their vehicles, +presumably in the `drivers_license` table: + + sqlite> .schema drivers_license + CREATE TABLE drivers_license ( + id integer PRIMARY KEY, + age integer, + height integer, + eye_color text, + hair_color text, + gender text, + plate_number text, + car_make text, + car_model text + ); + + sqlite> select person.id, person.name, drivers_license.* + ...> from person + ...> join drivers_license on drivers_license.id = person.license_id + ...> where person.id in ($JOE, $JEREMY); + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
idnameidageheighteye_colorhair_colorgenderplate_numbercar_makecar_model
67318Jeremy Bowers4233273070brownbrownmale0H42W2ChevroletSpark LS
+ +So only Jeremy Bowers has a drivers license. And his car's license plate does +contain H42W, so it looks like we've found the killer! According to the +instructions in the GitHub repository, we should insert our answer into the +`solution` table, then query it: + + sqlite> insert into solution values (1, 'Jeremy Bowers'); + sqlite> select value from solution; + + + + + + + + +
value
+ Congrats, you found the murderer! But wait, there's more... If you + think you're up for a challenge, try querying the interview transcript + of the murderer to find the real villian behind this crime. If you feel + especially confident in your SQL skills, try to complete this final step + with no more than 2 queries. +
+ +Aha! We did correctly identify Jeremy Bowers. Let's see if we can connect the +dots to find the mastermind! First, we'll grab the killer's (Jeremy's) interview +transcript: + + sqlite> select transcript from interview where person_id = $JEREMY; + + + + + + + + +
transcript
+ I was hired by a woman with a lot of money. I don't know her name but + I know she's around 5'5" (65") or 5'7" + (67"). She has red hair and she drives a Tesla Model S. I know that + she attended the SQL Symphony Concert 3 times in December 2017. +
+ +Alright, there goes one query... one more to make it count! We're going to be +correlating data from a bunch of tables here: person, income (related by SSN, +probably as a sort criterion since we don't have an exact figure to work with), +we can grab height, hair color, gender, and car make/model from the drivers +licenses. It's a bit of a risk to filter by Facebook checkins to the SQL +Symphony, since we don't know that she checked in at all, but maybe we can +include the count of the number of times there was a check-in at the symphony +during December. Let's get a reminder of what these tables look like: + + sqlite> .schema person + CREATE TABLE person ( + id integer PRIMARY KEY, + name text, + license_id integer, + address_number integer, + address_street_name text, + ssn integer, + FOREIGN KEY (license_id) REFERENCES drivers_license(id) + ); + sqlite> .schema income + CREATE TABLE income ( + ssn integer PRIMARY KEY, + annual_income integer + ); + sqlite> .schema facebook_event_checkin + CREATE TABLE facebook_event_checkin ( + person_id integer, + event_id integer, + event_name text, + date integer, + FOREIGN KEY (person_id) REFERENCES person(id) + ); + sqlite> .schema drivers_license + CREATE TABLE drivers_license ( + id integer PRIMARY KEY, + age integer, + height integer, + eye_color text, + hair_color text, + gender text, + plate_number text, + car_make text, + car_model text + ); + +And assemble our final mega-query! + + sqlite> select p.id, p.name, i.annual_income, dl.height, dl.hair_color, + ...> dl.gender, dl.car_make, dl.car_model, ( + ...> select count(*) + ...> from facebook_event_checkin + ...> where person_id = p.id + ...> and event_name like '%symphony%' + ...> and date between 20171201 and 20171231) as num_symphonies + ...> from person p + ...> join income i on i.ssn = p.ssn + ...> join drivers_license dl on dl.id = p.license_id + ...> where dl.height between 64 and 68 + ...> and dl.hair_color like '%red%' + ...> and car_make like '%tesla%' + ...> and car_model like '%s%' + ...> order by i.annual_income desc; + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
idnameannual_incomeheighthair_colorgendercar_makecar_modelnum_symphonies
99716Miranda Priestly31000066redfemaleTeslaModel S3
78881Red Korb27800065redfemaleTeslaModel S0
+ +Okay, so we actually got two results for red-haired people around 66" tall who +make a lot of money and drive Tesla Model S's. However, one of them attended the +symphony three times in December (and makes even more money), so I think we've +found the mastermind! + +I didn't include gender in the filter as I wasn't sure how the data looked, and +I technically would've needed an additional query to discover that. + + sqlite> insert into solution values (1, 'Miranda Priestly'); + sqlite> select value from solution; + + + + + + + + +
value
+ Congrats, you found the brains behind the murder! Everyone in SQL City + hails you as the greatest SQL detective of all time. Time to break out the + champagne! +
+ +Hooray! I had a lot of fun playing through this, and would love to do another +similar puzzle again sometime. + +[db]: /assets/sql-murder-mystery.db +[mit]: https://opensource.org/licenses/MIT -- cgit v1.2.3