BigQuery is a RESTful web-service that enables interactive analysis of massive datasets working in conjunction with Google Storage. It is a server-less Software as a Service (SaaS) that may be used complementarily with MapReduce.
Kaggle has many public datasets stored in BigQuery which one can use to get a good hands on experience with BigQuery and querying in general.
In this article I will walk through the steps to follow to create a project that queries data from BigQuery and Jupyter notebook.
- Login/Create an account on Google Cloud. Search for BigQuery API and enable it from the console.

Next, go to credentials and create service account key. Select ‘new service account’, give it a suitable name. In role select Project > Owner (this gives you the full admin access). Select json and click create. This will download a .json file on your computer. Save it in any suitable folder and copy its path to clipboard.

We can now head over to our Jupyter notebook.
pip install google-cloud-bigquery
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] ='<paste the path of the .json file that you copied to clipboard'>
from google.cloud import bigquery
client = bigquery.Client()
Here, you can replace “london_crime” and “bigquery-public-dataset” with your own project as required.
dataset_ref = client.dataset("london_crime", project="bigquery-public-data")
# API request - fetch the dataset
dataset = client.get_dataset(dataset_ref)
We have now established the connection and we can now explore the dataset.
# List all the tables in the dataset
tables = list(client.list_tables(dataset))
#list all tables-in case there are multiple
for table in tables:
print(table.table_id)
crime_by_lsoa
# Print information on all the columns in the "crime_by_lsoa" table in the dataset
table.schema
[SchemaField('lsoa_code', 'STRING', 'NULLABLE', 'Lower Layer Super Output Area code according to the Office for National Statistics', ()), SchemaField('borough', 'STRING', 'NULLABLE', '', ()), SchemaField('major_category', 'STRING', 'NULLABLE', '', ()), SchemaField('minor_category', 'STRING', 'NULLABLE', '', ()), SchemaField('value', 'INTEGER', 'NULLABLE', 'Summary of the number of crimes for the month', ()), SchemaField('year', 'INTEGER', 'NULLABLE', '', ()), SchemaField('month', 'INTEGER', 'NULLABLE', '', ())]
bigquery_client = bigquery.Client()
#define query to view the structure of the dataframe
QUERY = """
select *
from `bigquery-public-data.london_crime.crime_by_lsoa`
limit 1000;
"""
query_job = bigquery_client.query(QUERY)
#storing the dataframe in df
df = query_job.to_dataframe()
df
lsoa_code | borough | major_category | minor_category | value | year | month | |
---|---|---|---|---|---|---|---|
0 | E01000002 | City of London | Violence Against the Person | Harassment | 0 | 2011 | 9 |
1 | E01000005 | City of London | Violence Against the Person | Harassment | 0 | 2014 | 7 |
2 | E01000001 | City of London | Violence Against the Person | Harassment | 0 | 2008 | 3 |
3 | E01032739 | City of London | Violence Against the Person | Harassment | 0 | 2016 | 3 |
4 | E01000001 | City of London | Violence Against the Person | Harassment | 0 | 2011 | 10 |
5 | E01000002 | City of London | Violence Against the Person | Harassment | 0 | 2011 | 3 |
6 | E01032739 | City of London | Violence Against the Person | Harassment | 0 | 2009 | 9 |
.
.
.
lsoa_code | borough | major_category | minor_category | value | year | month | |
---|---|---|---|---|---|---|---|
0999 | E01032739 | City of London | Theft and Handling | Theft From Motor Vehicle | 0 | 2015 | 8 |
We can further write many more queries that help us explore the dataset, perform EDA, and feature engineer the values for inputs to predictive algorithms.
You can also download my Jupyter notebook for reference –
You can view this article for the EDA