summaryrefslogtreecommitdiff
path: root/_posts/2019-12-20-sql-murder-mystery.md
blob: 8f7c1b5e0ba2e5f63e3164650cca4bdcf9282ced (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
---
title: Solving the SQL Murder Mystery
---

I saw this [SQL Murder Mystery](https://github.com/NUKnightLab/sql-mysteries)
appear on Hacker News recently, thought it sounded fun, and figured I'd do a
quick write-up of how I worked through it.

If you want to follow along, go ahead and [download the SQLite database][db]
(which is copyright NUKnightLab and redistributed here under the [MIT
license][mit]). You'll need some kind of SQLite client to interact with it (I
just used the `sqlite3` CLI tool).

In addition to the database, it's very helpful to start with a prompt:

> A crime has taken place and the detective needs your help. The detective gave
> you the crime scene report, but you somehow lost it. You vaguely remember that
> the crime was a murder that occurred sometime on Jan. 15, 2018 and that it
> took place in SQL City. Start by retrieving the corresponding crime scene
> report from the police department's database. If you want to get the most out
> of this mystery, try to work through it only using your SQL environment and
> refrain from using a notepad.

<!--more-->

Let's start by seeing what tables are available. The `sqlite3` CLI uses
meta-commands that start with a dot, like this:

    sqlite> .tables
    crime_scene_report      get_fit_now_check_in    interview
    drivers_license         get_fit_now_member      person
    facebook_event_checkin  income                  solution

Okay, let's start with finding our crime scene report. First, we'll need to know
what the data looks like. We can learn about this with the `.schema` command:

    sqlite> .schema crime_scene_report
    CREATE TABLE crime_scene_report (
            date integer,
            type text,
            description text,
            city text
        );

Okay, seems pretty straightforward. The only thing I'm not quite sure about is
how the date is being represented -- it's just stored as an integer. A UNIX
timestamp perhaps? Let's sample the data:

    sqlite> select date from crime_scene_report limit 5;

<table>
  <tr><th>date</th></tr>
  <tr><td>20180115</td></tr>
  <tr><td>20180115</td></tr>
  <tr><td>20180115</td></tr>
  <tr><td>20180215</td></tr>
  <tr><td>20180215</td></tr>
</table>

Okay, seems it's just being stored as YYYYMMDD. Let's take a crack at finding
the crime scene report! We know the type (murder) and the city (SQL City). Let's
be generous with the date and assume it was sometime in January of 2018:

    sqlite> select * from crime_scene_report
       ...> where type = 'murder'
       ...> and city = 'SQL City'
       ...> and date between 20180101 and 20180131;

<table>
  <tr>
    <th>date</th>
    <th>type</th>
    <th>description</th>
    <th>city</th>
  </tr>
  <tr>
    <td>20180115</td>
    <td>murder</td>
    <td>
      Security footage shows that there were 2 witnesses. The first witness
      lives at the last house on &quot;Northwestern Dr&quot;. The second
      witness, named Annabel, lives somewhere on &quot;Franklin Ave&quot;.
    </td>
    <td>SQL City</td>
  </tr>
</table>

Great, there's only one row that matches our broad date criteria! Let's see if
we can track down these witnesses. First, let's see how the data we need is
structured:

    sqlite> .schema person
    CREATE TABLE person (
            id integer PRIMARY KEY,
            name text,
            license_id integer,
            address_number integer,
            address_street_name text,
            ssn integer,
            FOREIGN KEY (license_id) REFERENCES drivers_license(id)
        );
    sqlite> .schema interview
    CREATE TABLE interview (
            person_id integer,
            transcript text,
            FOREIGN KEY (person_id) REFERENCES person(id)
        );

Okay, so we need to find the two rows in the person table, and then use their
ids to cross reference their interview text. This is "the big idea" with
relational databases, joining data in several tables based on something they
have in common.

We'll start with the witness who lives on Northwestern Drive.  We know that they
live in "the last house," which presumably has the highest house number on that
street. We can easily find this by first filtering for only people who live on
Northwestern Drive, then ordering those results by house number in descending
order, and only showing the first result:

    sqlite> select * from person
       ...> where address_street_name = 'Northwestern Dr'
       ...> order by address_number desc
       ...> limit 1;

<table>
  <tr>
    <th>id</th>
    <th>name</th>
    <th>license_id</th>
    <th>address_number</th>
    <th>address_street_name</th>
    <th>ssn</th>
  </tr>
  <tr>
    <td>14887</td>
    <td>Morty Schapiro</td>
    <td>118009</td>
    <td>4919</td>
    <td>Northwestern Dr</td>
    <td>111564949</td>
  </tr>
</table>

Great! Now let's find Annabel. We can use SQL's `LIKE` operator to match a
partial name, along with the name of their street:

    sqlite> select * from person
       ...> where name like 'Annabel%'
       ...> and address_street_name = 'Franklin Ave';

<table>
  <tr>
    <th>id</th>
    <th>name</th>
    <th>license_id</th>
    <th>address_number</th>
    <th>address_street_name</th>
    <th>ssn</th>
  </tr>
  <tr>
    <td>16371</td>
    <td>Annabel Miller</td>
    <td>490173</td>
    <td>103</td>
    <td>Franklin Ave</td>
    <td>318771143</td>
  </tr>
</table>

Okay, so we've got our person IDs: `14887` and `16371`. I think we're going to
want these IDs in a bunch of upcoming queries, so let's help our future selves
out by saving their IDs as parameters (a sort of temporary variable):

    sqlite> .parameter set $MORTY 14887
    sqlite> .parameter set $ANNABEL 16371

Let's grab their interviews. To do this, we'll put joins to use for the first
time so we can show their name rather than just their person ID. We're selecting
records from the `interview` table, but _joining_ matching records from the
`person` table, using the `person_id` column to match up the people.

    sqlite> select person.name, interview.transcript
       ...> from interview
       ...> join person on person.id = interview.person_id
       ...> where person_id in ($MORTY, $ANNABEL);

<table>
  <tr>
    <th>name</th>
    <th>transcript</th>
  </tr>
  <tr>
    <td>Morty Schapiro</td>
    <td>
      I heard a gunshot and then saw a man run out. He had a &quot;Get Fit Now
      Gym&quot; bag. The membership number on the bag started with
      &quot;48Z&quot;.  Only gold members have those bags. The man got into a
      car with a plate that included &quot;H42W&quot;.
    </td>
  </tr>
  <tr>
    <td>Annabel Miller</td>
    <td>
      I saw the murder happen, and I recognized the killer from my gym when I
      was working out last week on January the 9th.
    </td>
  </tr>
</table>

Okay, we've got tons of info now! Since the car and bag might not belong to the
killer, I think our best lead for narrowing things down is to see all the people
who crossed paths with Annabel at the gym on January 9th, 2018. Let's see what
those tables look like:

    sqlite> .schema get_fit_now_check_in
    CREATE TABLE get_fit_now_check_in (
            membership_id text,
            check_in_date integer,
            check_in_time integer,
            check_out_time integer,
            FOREIGN KEY (membership_id) REFERENCES get_fit_now_member(id)
        );
    sqlite> .schema get_fit_now_member
    CREATE TABLE get_fit_now_member (
            id text PRIMARY KEY,
            person_id integer,
            name text,
            membership_start_date integer,
            membership_status text,
            FOREIGN KEY (person_id) REFERENCES person(id)
        );

Alright, time to look for some check-ins! We could do this in two separate
queries, one to find Annabel's Get Fit Now member ID by using her `person_id`,
and a second query to find her check-ins using her `membership_id`, but we can
also use a sub-query to do this in one shot:

    sqlite> select check_in_time, check_out_time
       ...> from get_fit_now_check_in
       ...> where date = 20180109
       ...> and membership_id = (
       ...>   select id
       ...>   from get_fit_now_member
       ...>   where person_id = $ANNABEL);

<table>
  <tr>
    <th>check_in_time</th>
    <th>check_out_time</th>
  </tr>
  <tr>
    <td>1600</td>
    <td>1700</td>
  </tr>
</table>

Looks like Annabel was at the gym from 4pm to 5pm on the 9th. Since we're
looking for someone who overlapped with Annabel at the gym, we're looking for
someone who arrived before 5pm and left after 4pm. Again, we'll join some tables
together here so we can grab their names and person IDs right away, not just
their membership numbers:

    sqlite> select person.id, person.name, get_fit_now_member.id,
       ...>   get_fit_now_check_in.check_in_time,
       ...>   get_fit_now_check_in.check_out_time
       ...> from get_fit_now_check_in
       ...> join get_fit_now_member on get_fit_now_member.id = membership_id
       ...> join person on person.id = person_id
       ...> where check_in_date = 20180109
       ...> and check_in_time <= 1700 and check_out_time >= 1600;

<table>
  <tr>
    <th>id</th>
    <th>name</th>
    <th>id</th>
    <th>check_in_time</th>
    <th>check_out_time</th>
  </tr>
  <tr>
    <td>28819</td>
    <td>Joe Germuska</td>
    <td>48Z7A</td>
    <td>1600</td>
    <td>1730</td>
  </tr>
  <tr>
    <td>67318</td>
    <td>Jeremy Bowers</td>
    <td>48Z55</td>
    <td>1530</td>
    <td>1700</td>
  </tr>
  <tr>
    <td>16371</td>
    <td>Annabel Miller</td>
    <td>90081</td>
    <td>1600</td>
    <td>1700</td>
  </tr>
</table>

Interesting, there were only two other gym members who were checked in for a
period overlapping with Annabel on the 9th. Let's save their IDs as well:

    sqlite> .parameter set $JOE 28819
    sqlite> .parameter set $JEREMY 67318

Their member numbers both start with 48Z; let's take a look at their vehicles,
presumably in the `drivers_license` table:

    sqlite> .schema drivers_license
    CREATE TABLE drivers_license (
            id integer PRIMARY KEY,
            age integer,
            height integer,
            eye_color text,
            hair_color text,
            gender text,
            plate_number text,
            car_make text,
            car_model text
        );

    sqlite> select person.id, person.name, drivers_license.*
       ...> from person
       ...> join drivers_license on drivers_license.id = person.license_id
       ...> where person.id in ($JOE, $JEREMY);

<table>
  <tr>
    <th>id</th>
    <th>name</th>
    <th>id</th>
    <th>age</th>
    <th>height</th>
    <th>eye_color</th>
    <th>hair_color</th>
    <th>gender</th>
    <th>plate_number</th>
    <th>car_make</th>
    <th>car_model</th>
  </tr>
  <tr>
    <td>67318</td>
    <td>Jeremy Bowers</td>
    <td>423327</td>
    <td>30</td>
    <td>70</td>
    <td>brown</td>
    <td>brown</td>
    <td>male</td>
    <td>0H42W2</td>
    <td>Chevrolet</td>
    <td>Spark LS</td>
  </tr>
</table>

So only Jeremy Bowers has a drivers license. And his car's license plate does
contain H42W, so it looks like we've found the killer! According to the
instructions in the GitHub repository, we should insert our answer into the
`solution` table, then query it:

    sqlite> insert into solution values (1, 'Jeremy Bowers');
    sqlite> select value from solution;

<table>
  <tr>
    <th>value</th>
  </tr>
  <tr>
    <td>
      Congrats, you found the murderer! But wait, there&#39;s more... If you
      think you&#39;re up for a challenge, try querying the interview transcript
      of the murderer to find the real villian behind this crime. If you feel
      especially confident in your SQL skills, try to complete this final step
      with no more than 2 queries.
    </td>
  </tr>
</table>

Aha! We did correctly identify Jeremy Bowers. Let's see if we can connect the
dots to find the mastermind! First, we'll grab the killer's (Jeremy's) interview
transcript:

    sqlite> select transcript from interview where person_id = $JEREMY;

<table>
  <tr>
    <th>transcript</th>
  </tr>
  <tr>
    <td>
      I was hired by a woman with a lot of money. I don&#39;t know her name but
      I know she&#39;s around 5&#39;5&quot; (65&quot;) or 5&#39;7&quot;
      (67&quot;). She has red hair and she drives a Tesla Model S. I know that
      she attended the SQL Symphony Concert 3 times in December 2017.
    </td>
  </tr>
</table>

Alright, there goes one query... one more to make it count! We're going to be
correlating data from a bunch of tables here: person, income (related by SSN,
probably as a sort criterion since we don't have an exact figure to work with),
we can grab height, hair color, gender, and car make/model from the drivers
licenses. It's a bit of a risk to filter by Facebook checkins to the SQL
Symphony, since we don't know that she checked in at all, but maybe we can
include the count of the number of times there was a check-in at the symphony
during December. Let's get a reminder of what these tables look like:

    sqlite> .schema person
    CREATE TABLE person (
            id integer PRIMARY KEY,
            name text,
            license_id integer,
            address_number integer,
            address_street_name text,
            ssn integer,
            FOREIGN KEY (license_id) REFERENCES drivers_license(id)
        );
    sqlite> .schema income
    CREATE TABLE income (
            ssn integer PRIMARY KEY,
            annual_income integer
        );
    sqlite> .schema facebook_event_checkin
    CREATE TABLE facebook_event_checkin (
            person_id integer,
            event_id integer,
            event_name text,
            date integer,
            FOREIGN KEY (person_id) REFERENCES person(id)
        );
    sqlite> .schema drivers_license
    CREATE TABLE drivers_license (
            id integer PRIMARY KEY,
            age integer,
            height integer,
            eye_color text,
            hair_color text,
            gender text,
            plate_number text,
            car_make text,
            car_model text
        );

And assemble our final mega-query!

    sqlite> select p.id, p.name, i.annual_income, dl.height, dl.hair_color,
       ...> dl.gender, dl.car_make, dl.car_model, (
       ...>   select count(*)
       ...>   from facebook_event_checkin
       ...>   where person_id = p.id
       ...>   and event_name like '%symphony%'
       ...>   and date between 20171201 and 20171231) as num_symphonies
       ...> from person p
       ...> join income i on i.ssn = p.ssn
       ...> join drivers_license dl on dl.id = p.license_id
       ...> where dl.height between 64 and 68
       ...> and dl.hair_color like '%red%'
       ...> and car_make like '%tesla%'
       ...> and car_model like '%s%'
       ...> order by i.annual_income desc;

<table>
  <tr>
    <th>id</th>
    <th>name</th>
    <th>annual_income</th>
    <th>height</th>
    <th>hair_color</th>
    <th>gender</th>
    <th>car_make</th>
    <th>car_model</th>
    <th>num_symphonies</th>
  </tr>
  <tr>
    <td>99716</td>
    <td>Miranda Priestly</td>
    <td>310000</td>
    <td>66</td>
    <td>red</td>
    <td>female</td>
    <td>Tesla</td>
    <td>Model S</td>
    <td>3</td>
  </tr>
  <tr>
    <td>78881</td>
    <td>Red Korb</td>
    <td>278000</td>
    <td>65</td>
    <td>red</td>
    <td>female</td>
    <td>Tesla</td>
    <td>Model S</td>
    <td>0</td>
  </tr>
</table>

Okay, so we actually got two results for red-haired people around 66" tall who
make a lot of money and drive Tesla Model S's. However, one of them attended the
symphony three times in December (and makes even more money), so I think we've
found the mastermind!

I didn't include gender in the filter as I wasn't sure how the data looked, and
I technically would've needed an additional query to discover that.

    sqlite> insert into solution values (1, 'Miranda Priestly');
    sqlite> select value from solution;

<table>
  <tr>
    <th>value</th>
  </tr>
  <tr>
    <td>
      Congrats, you found the brains behind the murder! Everyone in SQL City
      hails you as the greatest SQL detective of all time. Time to break out the
      champagne!
    </td>
  </tr>
</table>

Hooray! I had a lot of fun playing through this, and would love to do another
similar puzzle again sometime.

[db]: https://s.bnbl.io/blog/sql-murder-mystery.db
[mit]: https://opensource.org/licenses/MIT