Problem statement & submission details:
Use this thread to ask questions, share your project, discover interesting projects and give feedback to others.
For the course project, you will pick a real-world dataset of your choice and apply the concepts learned in this course to perform exploratory data analysis. Focus on documentation and presentation - the Jupyter notebook will also serve as a project report, so make sure to include detailed explanations wherever possible using Markdown cells.
Step 1: Select a real-world dataset
Find and download an interesting real-world dataset (see the Recommended Datasets section below for ideas).
The dataset should contain tabular data (rowsn & columns), preferably in CSV/JSON/XLS or other formats that can be read using Pandas. If it’s not in a compatible format, you may have to write some code to convert it to a desired format.
The dataset should contain at least 3 columns and 150 rows of data. You can also combine data from multiple sources to create a large enough dataset.
Step 2: Perform data preparation & cleaning
- Load the dataset into a data frame using Pandas
- Explore the number of rows & columns, ranges of values etc.
- Handle missing, incorrect and invalid data
- Perform any additional steps (parsing dates, creating additional columns, merging multiple dataset etc.)
Step 3: Perform exploratory Analysis & Visualization
- Compute the mean, sum, range and other interesting statistics for numeric columns
- Explore distributions of numeric columns using histograms etc.
- Explore relationship between columns using scatter plots, bar charts etc.
- Make a note of interesting insights from the exploratory analysis
Step 4: Ask & answer questions about the data
- Ask at least 5 interesting questions about your dataset
- Answer the questions either by computing the results using Numpy/Pandas or by plotting graphs using Matplotlib/Seaborn
- Create new columns, merge multiple dataset and perform grouping/aggregation wherever necessary
- Wherever you’re using a library function from Pandas/Numpy/Matplotlib etc. explain briefly what it does
Step 5: Summarize your inferences & write a conclusion
- Write a summary of what you’ve learned from the analysis
- Include interesting insights and graphs from previous sections
- Share ideas for future work on the same topic using other relevant datasets
- Share links to resources you found useful during your analysis
Step 6: Make a submission & share your work
Upload your notebook to your Jovian.ml profile using
Make a submission here: https://jovian.ml/learn/data-analysis-with-python-zero-to-pandas/assignment/course-project
Share your work on the forum: Course Project on Exploratory Data Analysis - Discuss and Share Your Work
Browse through projects shared by other participants and give feedback
(Optional) Step 7: Write a blog post
- A blog post is a great way to present and showcase your work.
- Sign up on Medium.com to write a blog post for your project.
- Copy over the explanations from your Jupyter notebook into your blog post, and embed code cells & outputs
- Check out the Jovian.ml Medium publication for inspiration: https://medium.com/jovianml
Use the following resources for finding interesting datasets:
- Recommended datasets for the course project
- Kaggle datasets
- UCI Machine Learning Repository
- Google Dataset Search
- Your personal data from online services
Refer to these projects for inspiration:
Analyzing your browser history using Pandas & Seaborn by Kartik Godawat
WhatsApp Chat Data Analysis by Prajwal Prashanth
Understanding the Gender Divide in Data Science Roles by Aakanksha N S
Your submission will be evaluated using the following criteria:
- Dataset must contain at least 3 columns and 150 rows of data
- You must ask and answer at least 5 questions about the dataset
- Your submission must include at least 5 visualizations (graphs)
- Your submission must include explanations using markdown cells, apart from the code.
- Your work must not be plagiarized i.e. copy-pasted from somewhere else.