Outliers are data points that are significantly different from the rest of the values in a dataset. They appear unusually high or low compared to the majority and can affect the accuracy of your analysis.
For example, if most students score between 60 and 90 on a test, but one student scores 10, that 10 is likely an outlier.
—
🔍 How to Identify Outliers:
You can detect outliers using several common methods:
1. Visual methods:
- Box plot: Outliers appear as dots outside the “whiskers” of the box.
- Scatter plot: Outliers stand far away from the main cluster of points.
2. Statistical methods:
- Z-score: Measures how far a data point is from the mean. A score above 3 or below -3 is often considered an outlier.
- IQR (Interquartile Range):
Outliers fall below Q1 – 1.5×IQR or above Q3 + 1.5×IQR
3. Domain knowledge:
Sometimes, a value may look extreme but is valid based on real-world context. Always consider the background before deciding.
Let’s say you have the following data on daily sales:
45, 48, 50, 47, 49, 100
Here, “100” stands out from the rest and may be an outlier.
—
✅ How to Handle Outliers:
- Investigate: Is it a typo or a valid value?
- Remove: If it’s an error or not relevant, you can exclude it from analysis.
- Transform: Use techniques like log transformation to reduce its impact.
- Use robust statistics: Median and IQR are less affected by outliers than mean and standard deviation.