Machine Learning

Topic: Handling Datasets of All Types – Part 2 of 5: Data Cleaning and Preprocessing

---

1. Importance of Data Cleaning

• Real-world data is often noisy, incomplete, or inconsistent.

• Cleaning improves data quality and model performance.

---

2. Handling Missing Data

• Detect missing values using isnull() or isna() in pandas.

• Strategies to handle missing data:

* Remove rows or columns with missing values:

df.dropna(inplace=True)

* Impute missing values with mean, median, or mode:

df['column'].fillna(df['column'].mean(), inplace=True)

---

3. Handling Outliers

• Outliers can skew analysis and model results.

• Detect outliers using:

* Boxplots
* Z-score method
* IQR (Interquartile Range)

• Handle by removal or transformation.

---

4. Data Normalization and Scaling

• Many ML models require features to be on a similar scale.

• Common techniques:

* Min-Max Scaling (scales values between 0 and 1)

* Standardization (mean = 0, std = 1)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['feature1', 'feature2']])

---

5. Encoding Categorical Variables

• Convert categorical data into numerical:

* Label Encoding: Assigns an integer to each category.

* One-Hot Encoding: Creates binary columns for each category.

pd.get_dummies(df['category_column'])

---

6. Summary

• Data cleaning is essential for reliable modeling.

• Handling missing values, outliers, scaling, and encoding are key preprocessing steps.

---

Exercise

• Load a dataset, identify missing values, and apply mean imputation.

• Detect outliers using IQR and remove them.

• Normalize numeric features using standardization.

---

#DataCleaning #DataPreprocessing #MachineLearning #Python #DataScience

https://xn--r1a.website/DataScienceM

❤6👍1

1.84K views11:44

Machine Learning

df.dropna(inplace=True)

* Impute missing values with mean, median, or mode:

df['column'].fillna(df['column'].mean(), inplace=True)

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
df_scaled = scaler.fit_transform(df[['feature1', 'feature2']])

pd.get_dummies(df['category_column'])

❤4👍1

2.23K views14:46

Machine Learning

Python Commands for Data Cleaning

#Python #DataCleaning #DataAnalytics #DataScientists #MachineLearning #ArtificialIntelligence #DataAnalysis

https://xn--r1a.website/DataScienceM

⭐

Please open Telegram to view this post

VIEW IN TELEGRAM

❤2

2.99K viewsedited 08:26

Machine Learning

Age
count   5.000000
mean   30.000000
std     6.363961
min    22.000000
25%    26.000000
50%    29.000000
75%    35.000000
max    38.000000

---

10. df.columns
Returns the column labels of the DataFrame.

import pandas as pd
df = pd.DataFrame({'Name': [], 'Age': [], 'City': []})
print(df.columns)

Index(['Name', 'Age', 'City'], dtype='object')

---

11. df.dtypes
Returns the data type of each column.

import pandas as pd
df = pd.DataFrame({'Name': ['Alice'], 'Age': [25], 'Salary': [75000.50]})
print(df.dtypes)

Name       object
Age         int64
Salary    float64
dtype: object

---

12. Selecting a Column
Select a single column, which returns a Pandas Series.

import pandas as pd
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)
ages = df['Age']
print(ages)

0    25
1    30
Name: Age, dtype: int64

#DataSelection #Indexing #Statistics

---

13. df.loc[]
Access a group of rows and columns by label(s) or a boolean array.

import pandas as pd
data = {'Age': [25, 30, 35], 'City': ['NY', 'LA', 'CH']}
df = pd.DataFrame(data, index=['Alice', 'Bob', 'Charlie'])
print(df.loc['Bob'])

Age     30
City    LA
Name: Bob, dtype: object

---

14. df.iloc[]
Access a group of rows and columns by integer position(s).

import pandas as pd
data = {'Age': [25, 30, 35], 'City': ['NY', 'LA', 'CH']}
df = pd.DataFrame(data, index=['Alice', 'Bob', 'Charlie'])
print(df.iloc[1]) # Get the second row (index 1)

Age     30
City    LA
Name: Bob, dtype: object

---

15. df.isnull()
Returns a DataFrame of the same shape with boolean values indicating if a value is missing (NaN).

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, np.nan], 'B': [3, 4]})
print(df.isnull())

A      B
0  False  False
1   True  False

---

16. df.dropna()
Removes missing values.

import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, np.nan, 3], 'B': [4, 5, 6]})
cleaned_df = df.dropna()
print(cleaned_df)

A  B
0  1.0  4
2  3.0  6

#DataCleaning #MissingData

---

17. df.fillna()
Fills missing (NaN) values with a specified value or method.

import pandas as pd
import numpy as np
df = pd.DataFrame({'Score': [90, 85, np.nan, 92]})
filled_df = df.fillna(0)
print(filled_df)

Score
0   90.0
1   85.0
2    0.0
3   92.0

---

18. df.drop_duplicates()
Removes duplicate rows from the DataFrame.

import pandas as pd
data = {'Name': ['Alice', 'Bob', 'Alice'], 'Age': [25, 30, 25]}
df = pd.DataFrame(data)
unique_df = df.drop_duplicates()
print(unique_df)

Name  Age
0  Alice   25
1    Bob   30

---

19. df.rename()
Alters axes labels (e.g., column names).

import pandas as pd
df = pd.DataFrame({'A': [1], 'B': [2]})
renamed_df = df.rename(columns={'A': 'Column_A', 'B': 'Column_B'})
print(renamed_df)

Column_A  Column_B
0         1         2

---

20. series.value_counts()
Returns a Series containing counts of unique values.

379 views10:48

Machine Learning

📌 I Cleaned a Messy CSV File Using Pandas . Here’s the Exact Process I Follow Every Time.

🗂 Category: DATA SCIENCE

🕒 Date: 2025-11-26 | ⏱️ Read time: 17 min read

Stop guessing when cleaning messy CSV files. This article details a repeatable 5-step workflow using Python's Pandas library to systematically diagnose and fix data quality issues. Learn a structured, practical process to transform your data preparation, moving from haphazard fixes to a reliable methodology for any data professional.

#Python #Pandas #DataCleaning #DataScience

❤4

1.56K views08:44

📖 Read and Learn

🧪 Explore Data Science

About

Blog

Apps

Platform