Monday, 5 January 2009

Implementing DBSCAN

The next stage is to implement the DBSCAN algorithm. DBSCAN is an algorithm for discovering clusters in point data. In my case I want to discover accident black spots in my accident dataset.

The algorithm was devised by Ester et al, in a paper A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.

Tan et al provide a good discussion of DBSCAN in a chapter of Introduction to Data Mining.

My intention is to implement DBSCAN in a VB.NET desktop application using as input a GML file saved from output of my WFS setup. In my actual implementation I will need to develop a web-based implementation of DBSCAN using an actual WFS request as input but I think this is a good starting point (debugging .NET desktop applications is easier than debugging web applications).

DBSCAN requires three inputs:
  • the data source
  • a parameter MinPts - which is the minimum number of points to define a cluster
  • a distance parameter, Eps - a distance parameter - if there are at least minPts within Eps of a point then that point is a core point in a cluster.
Now, Ester et al concluded that MinPts could safely be set to 4 for all 2D datasets. So MinPts is eliminated. The next step is to work out how to calculate Eps. The following was repeated with a number of susbsets of the full dataset:
  • Read in the GML file, saved from WFS, and store the points in an arraylist
  • Create a distance array that holds the distance of each point to every other point in the dataset
  • for each point calculate its 4th nearest point - what is called the 4-dist point
  • sort the 4-dist values in ascending order
  • plot them
This was repeated with three datasets:

  • All accidents in the (Manchester) City Centre ward (3236 points) graph
  • All accidents in the Heatons North ward (398 points) graph
  • All accidents in Greater Manchester in 2008 -up to September 30th (5309 points) graph

The first thing to note (gratifyingly!) is that all three graphs are similar in shape to that displayed in Tan et al. (Note that the graphs are the flipside of that displayed in Ester et al since they order the data in descending order.) The Eps value for a dataset can be estimated as the point where the graph "takes off". This is very much an estimate, but for the City Centre data it is about 50m, for the Heatons North data it is about 70m but for the whole of GM dataset the value is about 700m.

This value has to be provided to the DBSCAN algorithm before it can work out where the clusters are. As Ester et al state it is difficult to calculate this value automatically. So what's to be done? Some options:

  • Do we send the data to the WPS and get it to send back a graph from which we can estimate the value of Eps which we then send back to the WPS for it to calculate the clusters?
  • Do we assume the user "knows" his/her dataset and expect him/her to provide the value of Eps?
  • Can we make some estimate of Eps? If we look at the three graphs - the first two each cover a single ward. I think they are similar in area and have a similar value of Eps. The third graph covers a much larger area (all of GM) and returns a much larger value of Eps. Can this be used to estimate Eps?

Problems with large datasets
I did try with all accidents in Greater Manchester in 2000, 11998 points, but this meant an array of 11998 x 11998 which was too much for VB.NET. I need to have a think about this distance between points array - perhaps look at spatial indexing and doing the query in a spatial database - there's no point in calculating the distances between points that are nowhere near each other.

Conclusion
I don't want to get bogged down in the details of the processing part of the WPS, since the investigation is into the WPS itself, not any particular algorithm. So, I could simply specify that the user has to provide the value of Eps as a parameter for the WPS. However, I'd like to give it more consideration before taking this option - perhaps estimating the value of Eps if it is not supplied by the user.

2 comments:

  1. A discussion on the "Stateful" nature of desktop GIS and the "Stateless" nature of web services by Brauner (http://www.gi-tage.de/archive/2008/downloads/acceptedPapers/Papers/Brauner.pdf) highlights the problem. If I'm to do a two stage solution (1.get the data to do the Eps prediction graph then 2. supply the Eps value and return the clusters), then because WPS is stateless, I'll have to read in the dataset each time and repeat some of the processing. It would be handy to be able to save my distance array in a cache. Perhaps the option to save the data could be a parameter to the WPS which would return a process code which we could use with a subsequent call.

    ReplyDelete
  2. i need this data set for my experiment, can i get it on my mail id tariqali.1982@gmail.com

    ReplyDelete