DBSCAN
Last updated
Last updated
After collecting all your courage to start a machine learning project, you firstly have to think about how and where to actually start. Following picture shows a typical machine learning workflow. The very beginning of every machine learning project is to collect and prepare essential data to be able to be processed by those fancy algorithms, such as neural networks or support vector machines.
Besides creating some funky first visualizations of your sourced dataset or engineering used features, you also want to detect if you have any sneaky outliers in your dataset.
This article provides a step-by-step guide on how to detect outliers in the dependency between two features (aka attributes or variables), also called a two-dimensional feature space.
This project was was developed in the IDE PyCharm with the Project Interpreter Python 3.7. However, you are free to apply the code in any other environment or notebook! The full project can be found here.
First step is bringing in some visuals of the feature space we are dealing with. This project makes use of the wine dataset that comes together with the framework scikit learn.
We might not be enjoying wine as much as Amy Schumer does, but we still think wine is an amazingly tasty research object to investigate. The 2D feature space we will be examining is the dependency between the concentration of flavanoids (a class of metabolites) and the color intensity of wine.
Following code provides an initial visualization of the 2D feature space.
Awesome! Now we have a first “feel” of the dependency we are dealing with.
We might already see some clusters and outliers in our first visualization, but to give our intuition a bit more statistical grounding, we take help of a clustering algorithm. DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm commonly used for outlier detection. Here, a data instance is considered as outlier, if it does not belong to any cluster.
“DBSCAN algorithm requires 2 parameters — epsilon, which specifies how close points should be to each other to be considered a part of a cluster; and minPts, which specifies how many neighbors a point should have to be included into a cluster.” — alitouka
“DBSCAN algorithm requires 2 parameters — epsilon, which specifies how close points should be to each other to be considered a part of a cluster; and minPts, which specifies how many neighbors a point should have to be included into a cluster.” — alitouka
With adjusting and tweaking the model’s parameters epsilon and minPts (in Python: min_samples), we reveal some nice clusters in the feature space.
Following code clusters the 2D feature space.
Oh yes, and we have finally spotted them!
Wow!! Now we know that there do exist outliers in the dependency between the concentration of flavanoids and the color intensity of wine, yaaay (I still don’t actually get what flavanoids are, btw). Just one last step…
Following code calculates the outliers of the 2D feature space.
This code calculates and finally prints the 11 outliers of the dataset in a separate CSV-file.
This step-by-step guide for outlier detection was created during my thesis project on human-in-the-loop computing, an approach of combining human with machine intelligence. Stay tuned!
[1] Google Cloud, Machine learning workflow (2019)
[2] M. Ester, H.-P. Kriegel, J. Sander and X. Xu, A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise (1996), KDD’96 Proceedings of the Second International Conference on Knowledge Discovery and Data Mining
[3] https://towardsdatascience.com/outlier-detection-python-cd22e6a12098