The information on this Github is part of the materials for the subject High Performance Data Processing (SECP3133). This folder contains general Exploratory Data Analysis (EDA) information as well as EDA case studies using Malaysian datasets. This case study was created by a Bachelor of Computer Science (Data Engineering), Universiti Teknologi Malaysia student. In addition, my research group also contributed materials and case studies. Thank you to the collaborators who shared their knowledge in this github.
β οΈ EDA involves using graphics and visualizations to explore and analyze a data set. The goal is to explore, investigate and learn, as opposed to confirming statistical hypotheses.
β οΈ EDA is used by data scientists to analyze and explore datasets and summarize the main characteristics of them.
β οΈ EDA makes it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.
β οΈ EDA is primarily used to provide a better understanding of dataset's variables and their relationships.
β οΈ EDA can also help determine whether the statistical techniques you are considering are appropriate for data analysis.
β οΈ Developed by the American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used method in the data exploration process today.
β οΈ The main purpose of EDA is to help you look at the data before making any assumptions. In addition to better understanding the patterns in the data or detecting unusual events, it also helps you find interesting relationships between variables.
β οΈ Data scientists can use exploratory analysis to ensure that the results they produce are valid and relevant to desired business outcomes and goals.
β οΈ EDA also helps stakeholders by verifying that they are asking the right questions.
β οΈ EDA can help to answer questions about standard deviations, categorical variables, and confidence intervals.
β οΈ After the exploratory analysis is completed and the predictions are determined, its features can be used for more complex data analysis or modeling, including machine learning.
- developers.google: Good Data Analysis
- Towardsdatascience: What is Exploratory Data Analysis?
- Wikipedia: Exploratory data analysis
- r4ds: Exploratory Data Analysis
- careerfoundry:What Is Exploratory Data Analysis?
- How To Conduct Exploratory Data Analysis in 6 Steps
- A Five-Step Guide for Conducting Exploratory Data Analysis
- simplilearn: What is Exploratory Data Analysis? Steps and Market Analysis
- Exploratory Data Analysis (EDA): Types, Tools, Process
- projectpro: Exploratory Data Analysis in Python-Stop, Drop and Explore
- medium.com: 10 Things to do when conducting your Exploratory Data Analysis (EDA)
- towardsdatascience.com: An Extensive Step by Step Guide to Exploratory Data Analysis
- EDA - Exploratory Data Analysis: Using Python Functions
- Step-by-Step Exploratory Data Analysis (EDA) using Python
- Exploratory Data Analysis Tutorial | What Is EDA | How EDA Works | EDA In Python | Intellipaat
- Live Day 1-Live Session On EDA And Feature Engineering- Zomato Dataset
- Live Day 2-Live Session On EDA And Feature Engineering- Black Friday Dataset
- Live Day 3-Live Session On EDA And Feature Engineering- Flight Price Prediction Dataset
- Step By Step Process In EDA And Feature Engineering In Data Science Projects
- Exploratory Data Analysis(EDA) of Titanic dataset
- Exploratory Data Analysis (EDA) Using Python | Python Data Analysis | Python Training | Edureka
- Exploratory Data Analysis with Pandas Python
- How to Do Data Exploration (step-by-step tutorial on real-life dataset)
- Exploratory Data Analysis (Step by Step)
- A Simple Tutorial on Exploratory Data Analysis
- Intro to Exploratory data analysis (EDA) in Python
- Topic 1. Exploratory Data Analysis with Pandas
- Detailed exploratory data analysis with python
- EDA using Python Pandas
- Pandas: EDA of Cars Dataset
- Step-by-step Data Preprocessing & EDA
- PacktPublishing/Hands on Exploratory Data analysis with Python
- code4kunal/eda-python-examples
- SouRitra01/Exploratory-Data-Analysis-EDA-in-Banking-Using-Python
- sandipanpaul21/EDA-in-Python
- vharivinay/python-eda-viz
- demonpratapdemon/Exploratory-Data-Analysis-EDA-and-PreProcessing
- PacktPublishing/Python-for-Data-Analysis-step-by-step-with-projects-
- sandyy2505/Cardio Good Fitness Project
- ajaymache/Data analysis of used car database
No | Dataset | Colab | GitHub |
---|---|---|---|
1 | Boston | ||
2 | Car Features and MSRP | ||
3 | Housing Dataset | ||
4 | United Nations Development Corporation |
Your submission will be evaluated using the following criteria:
- Dataset must contain at least 5 columns and 1500 rows of data
- You must ask and answer at least 5 questions about the dataset
- Your submission must include at least 5 visualizations (graphs)
- Your submission must include explanations using markdown cells, apart from the code.
- Your work must not be plagiarized i.e. copy-pasted from somewhere else.
Follow this step-by-step guide to work on your project.
- The Malaysian dataset must be used for your case study.
- The dataset is available at:
- Load the dataset into a data frame using Pandas
- Explore the number of rows & columns, ranges of values etc.
- Handle missing, incorrect and invalid data
- Perform any additional steps (parsing dates, creating additional columns, merging multiple dataset etc.)
- Compute the mean, sum, range and other interesting statistics for numeric columns
- Explore distributions of numeric columns using histograms etc.
- Explore relationship between columns using scatter plots, bar charts etc.
- Make a note of interesting insights from the exploratory analysis
- Ask at least 4 interesting questions about your dataset
- Answer the questions either by computing the results using Numpy/Pandas or by plotting graphs using Matplotlib/Seaborn
- Create new columns, merge multiple dataset and perform grouping/aggregation wherever necessary
- Wherever you're using a library function from Pandas/Numpy/Matplotlib etc. explain briefly what it does
- Write a summary of what you've learned from the analysis
- Include interesting insights and graphs from previous sections
- Share ideas for future work on the same topic using other relevant datasets
- Share links to resources you found useful during your analysis
- Upload your notebook to github.
Name | Title | Colab | GitHub |
---|---|---|---|
Li Jing | ABC | ||
Saleh Dhekre Saber Saleh | ABC | ||
Eman Al Jabarti | ABC | ||
Anwar Said Salim Al Talaii | ABC | ||
Zhu Caihua | ABC | ||
Shiekhah AL Binali | ABC | ||
Li Haopeng | ABC |
Please create an Issue for any improvements, suggestions or errors in the content.
You can also contact me using Linkedin for any other queries or feedback.