How to Load Kaggle Datasets into Jupyter Notebooks
How to Load Kaggle Datasets into Jupyter Notebooks
Kaggle is an online community platform for data scientists and machine learning enthusiasts. It allows users to:
- find and publish data sets,
- explore and build models in a web-based data-science environment,
- work with other data scientists and machine learning engineers, and
- enter competitions to solve data science challenges.
We can't use requests
to download a dataset from Kaggle, because it doesn't provide a raw URL for the dataset. In this notebook, we will learn how to download a Kaggle dataset using the opendatasets library with an API token.
Opendatasets
opendatasets
is a Python library for downloading datasets from online sources like Kaggle and Google Drive using a simple Python command.
- Installing and Importing: You can install it with a simple
pip
command, and then import it.
!pip install opendatasets --upgrade --quiet
import opendatasets as od
- Downloading URL: The next step is getting the URL for the dataset you want to load into your jupyter notebook and then passing it with the
opendatasets.download()
function.
For now, we will be working with the US Accidents dataset: https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents
A good way would be to add the url in a seperate variable instead of passing the URL everytime.
- Kaggle Credentials: Now, after running the
download
function, you will be asked to enter your Kaggle username and API key.
Kaggle Credentials
-
After signing up on https://www.kaggle.com/, click on your profile picture on the top right and select "My Account" from the menu.
-
Scroll down to the API section and click "Create new API Token" which shall download a
kaggle.json
file into your system.
The file should contain your kaggle username and key in the format below:
{"username":"YOUR_KAGGLE_USERNAME","key":"YOUR_KAGGLE_KEY"}
- Now you can directly add these credentials after running the
download
function.
dataset_url='https://www.kaggle.com/datasets/sobhanmoosavi/us-accidents'
od.download(dataset_url)
Please provide your Kaggle credentials to download this dataset. Learn more: http://bit.ly/kaggle-creds
Your Kaggle username: himanigulati
Your Kaggle Key: ··········
Downloading us-accidents.zip to ./us-accidents
100%|██████████| 269M/269M [00:01<00:00, 188MB/s]
This was one way to add credentials, i.e by manually copy pasting the key from the downloaded kaggle.json
file. Another way to add these credentials is pretty straightforward.
Automatically Adding Kaggle Credentials
We can save the extra seconds of copying our Kaggle username and key from a file to a Jupyter notebook by directly uploading the json file in the same directory as our Jupyter Notebook. This way the credentials will be read automatically.
Resources:
-
Opendatasets Source Code: https://github.com/JovianML/opendatasets
-
Kaggle: https://www.kaggle.com
-
Some good datasets avaialable on Kaggle:
-
Getting started with Kaggle competitions: https://www.kaggle.com/code/alexisbcook/getting-started-with-kaggle-competitions
Conclusion
The best use you can make out of Kaggle is by participating in Kaggle competitions. With experience comes wisdom and with kaggle competitions comes skills(for Machine Learning) :)
The competitions you win on Kaggle and your Kaggle ranking can have an advantageous impact on your resume for a career in Data Science.
Kaggle also offers other features like GPU, opportuninty to work with other people with smillar interests accross the world, tons and tons of datasets, etc...
All the best :)
!pip install jovian --upgrade --quiet
import jovian
# Execute this to save new versions of the notebook
jovian.commit(project="kaggle-opendatasets")
[jovian] Detected Colab notebook...
[jovian] Please enter your API key ( from https://jovian.ai/ ):
API KEY: ··········
[jovian] Uploading colab notebook to Jovian...
Committed successfully! https://jovian.ai/himani007/kaggle-opendatasets