Thread by Jaydeep Karale
- Tweet
- Dec 20, 2022
- #MachineLearning #DataScience #ComputerScience
Thread
💽 DATA IS KEY to Machine Learning 🤖
But in a world where data is the new oil, let's understand 6️⃣ issues with Machine Learning data⛔
But in a world where data is the new oil, let's understand 6️⃣ issues with Machine Learning data⛔
{ Insufficient Data }
🔵In a world full of data, insufficient data problems still do exist
🟠Models trained with insufficient data perform poorly in real world
🟢 Insufficient data also leads to either overfitting or underfitting
🔵In a world full of data, insufficient data problems still do exist
🟠Models trained with insufficient data perform poorly in real world
🟢 Insufficient data also leads to either overfitting or underfitting
{ Too Much Data }
🔵 Too much data also presents it's own set of challenges such as
🟠Data can be old & outdated data which is no longer relevant
🟢Curse of dimensionality i.e. too many features which are useless or less relevant
🔵 Too much data also presents it's own set of challenges such as
🟠Data can be old & outdated data which is no longer relevant
🟢Curse of dimensionality i.e. too many features which are useless or less relevant
{ Non-representative Data }
🔵 ML is simple, if you feed garbage data you get garbage output.
🟠 So, inaccurate or non-representative data leads to poor models
🟢 Select relevant data is a key skill
🔵 ML is simple, if you feed garbage data you get garbage output.
🟠 So, inaccurate or non-representative data leads to poor models
🟢 Select relevant data is a key skill
{ Missing Data }
🔵 Data is key, so missing values is a big problem
🟠Data cleaning solves this problem by substituting missing values using various techniques.
🟢Substitution may lead to to bias & hence poor accuracy
🔵 Data is key, so missing values is a big problem
🟠Data cleaning solves this problem by substituting missing values using various techniques.
🟢Substitution may lead to to bias & hence poor accuracy
{ Duplicate Data }
🔵 Duplication of data is another major problem.
🟠 Removal of duplicates is easy using Pandaskey
🟢 How much clean & relevant data remains after clearing duplicates is the main thing
🔵 Duplication of data is another major problem.
🟠 Removal of duplicates is easy using Pandaskey
🟢 How much clean & relevant data remains after clearing duplicates is the main thing
{ Outliers }
🔵 Outliers are data points which differ significantly from other data
🟠 e.g. for a temperature data which India ranges from 1 to 45 degree Celsius, -60 or +60 is outlier
🟢 Understanding the nature of outlier data is a problem ML engineers have to solve
🔵 Outliers are data points which differ significantly from other data
🟠 e.g. for a temperature data which India ranges from 1 to 45 degree Celsius, -60 or +60 is outlier
🟢 Understanding the nature of outlier data is a problem ML engineers have to solve
Hello 👋
I am Jaydeep from India 🇮🇳
Full time Software Engineer & part time content creator on
🐦Twitter
🖧 Linkedin
🎥YouTube
Follow me for content on
🐍 Python
🤖Ai/ML
🎨Data Visualization
🌟Content creation
Subscribe To My YouTube🔽
youtu.be/FLdS-kBt88M
I am Jaydeep from India 🇮🇳
Full time Software Engineer & part time content creator on
🎥YouTube
Follow me for content on
🐍 Python
🤖Ai/ML
🎨Data Visualization
🌟Content creation
Subscribe To My YouTube🔽
youtu.be/FLdS-kBt88M
Mentions
See All
Afiz ⚡️ @itsafiz
·
Dec 20, 2022
Very well written thread. 👏