Google BigQuery – London Crime Data

BigQuery is a RESTful web-service that enables interactive analysis of massive datasets working in conjunction with Google Storage. It is a server-less Software as a Service (SaaS) that may be used complementarily with MapReduce.

Kaggle has many public datasets stored in BigQuery which one can use to get a good hands on experience with BigQuery and querying in general.

In this article I will walk through the steps to follow to create a project that queries data from BigQuery and Jupyter notebook.

  1. Login/Create an account on Google Cloud. Search for BigQuery API and enable it from the console.

Next, go to credentials and create service account key. Select ‘new service account’, give it a suitable name. In role select Project > Owner (this gives you the full admin access). Select json and click create. This will download a .json file on your computer. Save it in any suitable folder and copy its path to clipboard.

We can now head over to our Jupyter notebook.

pip install google-cloud-bigquery
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] ='<paste the path of the .json file that you copied to clipboard'>
from google.cloud import bigquery
client = bigquery.Client()

Here, you can replace “london_crime” and “bigquery-public-dataset” with your own project as required.

dataset_ref = client.dataset("london_crime", project="bigquery-public-data")
# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)

We have now established the connection and we can now explore the dataset.

# List all the tables in the dataset
tables = list(client.list_tables(dataset))
#list all tables-in case there are multiple
for table in tables:  
    print(table.table_id)

crime_by_lsoa

# Print information on all the columns in the "crime_by_lsoa" table in the dataset
table.schema
[SchemaField('lsoa_code', 'STRING', 'NULLABLE', 'Lower Layer Super Output Area code according to the Office for National Statistics', ()),
 SchemaField('borough', 'STRING', 'NULLABLE', '', ()),
 SchemaField('major_category', 'STRING', 'NULLABLE', '', ()),
 SchemaField('minor_category', 'STRING', 'NULLABLE', '', ()),
 SchemaField('value', 'INTEGER', 'NULLABLE', 'Summary of the number of crimes for the month', ()),
 SchemaField('year', 'INTEGER', 'NULLABLE', '', ()),
 SchemaField('month', 'INTEGER', 'NULLABLE', '', ())]
bigquery_client = bigquery.Client()
#define query to view the structure of the dataframe
QUERY = """
select * 
from `bigquery-public-data.london_crime.crime_by_lsoa`
limit 1000;
"""
query_job = bigquery_client.query(QUERY)
#storing the dataframe in df
df = query_job.to_dataframe()
df

lsoa_code
boroughmajor_categoryminor_categoryvalueyearmonth
0E01000002City of LondonViolence Against the PersonHarassment020119
1E01000005City of LondonViolence Against the PersonHarassment020147
2E01000001City of LondonViolence Against the PersonHarassment020083
3E01032739City of LondonViolence Against the PersonHarassment020163
4E01000001City of LondonViolence Against the PersonHarassment0201110
5E01000002City of LondonViolence Against the PersonHarassment020113
6E01032739City of LondonViolence Against the PersonHarassment020099

.

.

.

lsoa_codeboroughmajor_categoryminor_categoryvalueyearmonth
0999E01032739 City of LondonTheft and HandlingTheft From Motor Vehicle020158

We can further write many more queries that help us explore the dataset, perform EDA, and feature engineer the values for inputs to predictive algorithms.
You can also download my Jupyter notebook for reference –

You can view this article for the EDA

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: