Crime Data Program

Summary: This program uses machine learning to display and analyze data about crime in Los Angeles.

Description: There are two versions of this project: one using AWS and the other a lite version. The AWS version is a school project, and the lite version is a personal project.

  • School Project: The program uses PySpark to retrieve crime data in Los Angeles (from 2020 to Present), stores the transformed data (using PySpark) in AWS Athena for efficient querying, queries the Athena tables to perform data analysis, and uses MLlib to develop a machine learning model. The purpose of this program is to find clusters of crime so that police can send officers appropriately to where crimes are happening.
  • Personal Project: This program uses a CSV of crime incidents in Los Angeles, cleans and filters latitude/longitude data, and aggregates incidents into a configurable spatial grid to produce a heatmap of crime density and a ranked list of hotspot cells. It also runs a K‑Means clustering workflow (including silhouette-score sampling to choose an optimal k) to identify and visualize geographic crime clusters and their centroids, with configurable parameters for grid size, display count, and K‑Means sampling to keep runtimes reasonable.

Dates:

  • School Project: 6/2/2025 – 6/9/2025
  • Personal Project: 11/5/2025 – 11/7/2025

Language: Python

Skills:

  • PySpark
  • Athena
  • MLlib
  • Datasets
  • Data processing and transformation
  • Data mining
  • Imports
  • Constants
  • Data frames
  • Functions
  • Decimals
  • Lists
  • Tuples
  • Sorting
  • Printing messages
  • Displaying graphs
  • Machine learning (Kmeans)

Files:

How to Run the AWS Version:

  1. Download Crime Data from 2020 to Present as a .csv file (click Export button)
    1. Rename the .csv file to Crime_Data_from_2020_to_Present.csv
  2. Launch AWS
  3. Create buckets
    1. Go to S3
    2. Create a general purpose bucket (test-mario-spring2025 for my project)
    3. Create 3 folders inside the general purpose bucket: athena-output, athena, and crime-data
    4. Put the crime data file inside the crime-data folder
  4. Set up Athena
    1. Go to Athena
    2. Click Settings
    3. Set the query result location to the path of the athena folder (s3://test-mario-spring2025/athena/ for my project)
  5. Create an EMR studio
    1. Go to EMR > EMR Studio > Studios
    2. Click Create Studio
    3. Select Custom in Setup options
    4. Set Service role to let Studio access your AWS resources to LabRole
    5. Select a VPC
    6. Select at least two subnets
    7. Click Create Studio
  6. Create a cluster
    1. Go to EMR
    2. Click Create cluster
    3. Set Name to test
    4. Set the Primary EC2 instance type to m4.large
    5. Set the Core EC2 instance type to m4.large
    6. Set the Task EC2 instance type to m4.large
    7. Set the task’s instance size to 3
    8. Set Amazon EC2 key pair for SSH to the cluster to vockey
    9. Set Service role to EMR_DefaultRole
    10. Set Instance profile to EMR_EC2_DefaultRole
    11. Click Create cluster
    12. If you have already created a cluster from the previous steps, but the cluster is no longer running, you can clone the cluster
    13. Wait about 10 minutes (more or less)
  7. Launch the workspace
    1. Go to EMR > EMR Studio > Workspaces (Notebooks)
    2. Select the workspace and attach the cluster
    3. Choose your running EMR cluster
    4. Click Attach cluster and launch
  8. Import the Final_Project.ipynb to the workspace
  9. Open the Final_Project.ipynb file in the workspace
  10. Edit the Final_Project.ipynb file
    1. Change the following constants in the constants section of the notebook
      • Change CRIME_DATA_URI to where you stored the .csv file in the crime-data folder (s3://test-mario-spring2025/crime-data/Crime_Data_from_2020_to_Present.csv for my project)
      • Change AWS_ACCESS_KEY_ID to your AWS access key ID
      • Change AWS_SECRET_ACCESS_KEY to your AWS secret access key
      • Change AWS_SESSION_TOKEN to your AWS session token
      • Change AWS_REGION to your AWS region (if necessary)
      • Change ATHENA_BUCKET to where your athena folder is (s3://test-mario-spring2025/athena for my project)
      • Change ATHENA_OUTPUT_BUCKET to where your athena-output folder is (s3://test-mario-spring2025/athena-output for my project)
      • (Optional) Change KMAX to the maximum number of clusters (one less) that the K-means algorithm should find
    2. (Optional) Change the k variable in the train final model section to an appropriate integer (find where the k-means graph spikes)
  11. Run the code in the notebook
  12. Stop the workspace and terminate the cluster when finished

How to Run the Lite Version:

  1. Download Crime Data from 2020 to Present as a .csv file (click Export button)
    1. Rename the .csv file to Crime_Data_from_2020_to_Present.csv
    2. Store the .csv file in the same place where the .ipynb file is
  2. Run the .ipynb file in Jupyter Notebook

Repository: https://github.com/marvel3492/crime-data-program