Wednesday 28 January 2009

UML Experiments

I'm going to document my implementation using UML since it is prevalent in academia. I've been experimenting with the Poseidon for UML community edition modelling software which was included with Teach yourself UML in 24 hours. The book gets panned on Amazon but I am up to hour 7 and I am fairly impressed so far. The important section for me will be sequence diagrams which I have yet to reach (hour 9).

My first attempt at a (very basic) class diagram:
This is the point class used in my code that implements DBSCAN.

The software has an export function which created the above gif image. However, with the version I've got it puts a watermark on any exports:


Which will not look too good in my thesis.

I'm going to look next at Star UML, an open source tool.

Tuesday 27 January 2009

More on DBSCAN

To recap, I'm attempting to implement the clustering alogorithm, DBSCAN, as a desktop VB.NET application, reading the accident data from a GML file. I've managed to do this using a modified version of the method described by Tan et al (Algoritm 8.4, p528).

The implementation requires the user to supply a value for Eps. The process is as follows:
  1. read all points into an arraylist of type point (where point is a user defined class)
  2. populate a distance array – containing the distance between each point and every other point
  3. create array for k-dist where k = 4– i.e. for each point what is the distance to the 4th nearest neighbour?
  4. flag core points – i.e. where 4-dist is <= Eps
  5. flag border points – i.e. where the point is within Eps of a core point
  6. by default the remaining points are noise
  7. populate an array of linked core points – i.e. core points within Eps of each other
  8. where core points are linked put them in the same cluster
  9. add border points to same cluster as their nearest core point
(The code then generates INSERT statements that can be run against a postgreSQL database for plotting in QGIS).

The process was run using data from the Heatons North ward using values for Eps of 10m and 20m. The following images focus on a particular section of road:

With a Eps of 10m:

With an Eps of 20m:


The core points are black, border points are yellow and noise points are crosses. The number indicates the cluster ID (the 10m Eps returns 14 clusters, the 20m Eps returns 19 clusters)

With the 10m Eps there is more noise and the cluster is more focussed - at 2 junctions instead of 3. The choice of Eps is critical then.

A large view of the 10m results can be seen here.

Summary
My implementation of DBSCAN will not be the most efficient but that doesn't matter. Its the implementation of WPS that I'm interested in.

The next stage is to implement the algorithm as a WPS. The first step will be to look at the input parameters of the WPS i.e. how can I make the accident data in WFS form available to a web service?


Saturday 10 January 2009

useful references - GI Days 2008

I came across the GI Days 2008 conference website the other day. There were a good number of papers relating to WPS. In particular:

Towards a Quality Aware Web Processing Service (Donubauer et al) describes a project to develop a web service that gives an indication of the quality of the results based on the quality of the input data. In their forestry use case the process is a polygon overlay. With the source data the accuracy is provided at a line segment (of each polygon) level. The WPS returns the area and standard deviation of the overlay. The accuracy data is passed to the process embedded in metaDataProperty tags in the source GML. Including something similar in my implementation of WPS is probably beyond the scope of my project but it is something I need to at least make reference to.

Web-based geostatistics using WPS (Jorge de Jesus et al) describes the development of a WPS for spatial interpolation, Kriging, in particular. The output from the process is a reference to a geoTIFF document that can be downloaded using a browser. This project has similarities with my own in that it uses point source data. They note that the “verbosity” of XML is a problem when sending thousands of points observations. Another issue is that GML allows for multiple ways of describing the same data which presents a problem to the programmer. They too raise the issue of quality and make reference to the work done with UncertML (original paper).

Providing GRASS with a Web Processing Service Interface (Bruaner) describes how WPS can be integrated with the GRASS desktop GIS. In a discussion on state he notes that each time you read a dataset in a stateless web-based system you process it and then throw it away. He contrasts this with the classic stateful desktop GIS where you load up your data sources and work on them for as long as necessary.

Monday 5 January 2009

Implementing DBSCAN

The next stage is to implement the DBSCAN algorithm. DBSCAN is an algorithm for discovering clusters in point data. In my case I want to discover accident black spots in my accident dataset.

The algorithm was devised by Ester et al, in a paper A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.

Tan et al provide a good discussion of DBSCAN in a chapter of Introduction to Data Mining.

My intention is to implement DBSCAN in a VB.NET desktop application using as input a GML file saved from output of my WFS setup. In my actual implementation I will need to develop a web-based implementation of DBSCAN using an actual WFS request as input but I think this is a good starting point (debugging .NET desktop applications is easier than debugging web applications).

DBSCAN requires three inputs:
  • the data source
  • a parameter MinPts - which is the minimum number of points to define a cluster
  • a distance parameter, Eps - a distance parameter - if there are at least minPts within Eps of a point then that point is a core point in a cluster.
Now, Ester et al concluded that MinPts could safely be set to 4 for all 2D datasets. So MinPts is eliminated. The next step is to work out how to calculate Eps. The following was repeated with a number of susbsets of the full dataset:
  • Read in the GML file, saved from WFS, and store the points in an arraylist
  • Create a distance array that holds the distance of each point to every other point in the dataset
  • for each point calculate its 4th nearest point - what is called the 4-dist point
  • sort the 4-dist values in ascending order
  • plot them
This was repeated with three datasets:

  • All accidents in the (Manchester) City Centre ward (3236 points) graph
  • All accidents in the Heatons North ward (398 points) graph
  • All accidents in Greater Manchester in 2008 -up to September 30th (5309 points) graph

The first thing to note (gratifyingly!) is that all three graphs are similar in shape to that displayed in Tan et al. (Note that the graphs are the flipside of that displayed in Ester et al since they order the data in descending order.) The Eps value for a dataset can be estimated as the point where the graph "takes off". This is very much an estimate, but for the City Centre data it is about 50m, for the Heatons North data it is about 70m but for the whole of GM dataset the value is about 700m.

This value has to be provided to the DBSCAN algorithm before it can work out where the clusters are. As Ester et al state it is difficult to calculate this value automatically. So what's to be done? Some options:

  • Do we send the data to the WPS and get it to send back a graph from which we can estimate the value of Eps which we then send back to the WPS for it to calculate the clusters?
  • Do we assume the user "knows" his/her dataset and expect him/her to provide the value of Eps?
  • Can we make some estimate of Eps? If we look at the three graphs - the first two each cover a single ward. I think they are similar in area and have a similar value of Eps. The third graph covers a much larger area (all of GM) and returns a much larger value of Eps. Can this be used to estimate Eps?

Problems with large datasets
I did try with all accidents in Greater Manchester in 2000, 11998 points, but this meant an array of 11998 x 11998 which was too much for VB.NET. I need to have a think about this distance between points array - perhaps look at spatial indexing and doing the query in a spatial database - there's no point in calculating the distances between points that are nowhere near each other.

Conclusion
I don't want to get bogged down in the details of the processing part of the WPS, since the investigation is into the WPS itself, not any particular algorithm. So, I could simply specify that the user has to provide the value of Eps as a parameter for the WPS. However, I'd like to give it more consideration before taking this option - perhaps estimating the value of Eps if it is not supplied by the user.

Saturday 3 January 2009

Setting up GeoServer

Having got my source data into postGRES I now need to configure GeoServer to deliver the data using WFS.

I installed GeoServer 1.7.1.

First step was to create a namespace:
Config > Data > Namespace > New

Enter prefix: GM
Click new.

Click Submit then Save

The next step is to create the datastore:

Config > Data > DataStores > New

Feature data set description: Postgis
Feature Data Set ID: accidents

Click New then:

Namespace: GM
Description: Road accidents in Greater Manchester
host: localhost
port: 5432
schema: public
database: accidents
user: ****
password: ****

(The username and password are those created when postgreSQL was set up.)

Click Submit you are then prompted to create a featureType:

click create SLD and create feature style settings, e.g. point size and colour

In SRS box enter 27700

click Bounding box Generate button.

click Submit.

Then Apply then Save

To test WFS in uDig go to the GeoServer welcome page and drag WFS Capabilities onto uDig window.

To test in a browser go to:

http://localhost:8080/geoserver/wfs?service=WFS&request=GetCapabilities

The accidents layer appears along with the sample datasets.

To extract a subset of the data using WFS:

http://localhost:8080/geoserver/wfs?service=WFS&request=Getfeature&typeName=GM:accident&CQL_FILTER=ward='CITY%20CENTRE'

This will list all accidents in the CITY CENTRE ward. This, however, does not work in Internet Explorer which returns an error:

The namespace prefix is not allowed to start with the reserved string "xml".

See here for an explanation: http://support.microsoft.com/kb/262585

The problem seems to be with the xml keyword in:

xmlns:xml="http://www.w3.org/XML/1998/namespace"

I will need to investigate how to configure the namespaces in GeoServer (which seems to specify several of them for some reason).

Another problem is that geoServer loses my new featureType each time it is restarted. The namespace and datastore are saved properly but not the featureType which has to be recreated after each restart. The datastore details stored in:

C:\Program Files\GeoServer 1.7.1\data_dir\catalog.xml.

and there is an entry for my datastore.

All the sample featureTypes have folders in:

C:\Program Files\GeoServer 1.7.1\data_dir\ featureTypes\featureTypeName

but nothing is created in here when I create my featureType.

When I select GeoServer Demo > Map preview none of the options work for my data source but those for the demo sources do.

geoServer documentation

Next steps
These two problems (WFS not working in IE and more importantly the issue of not saving the featureType) will need further investigation. However, I am going to put them to one side. I can now extract some data from WFS (using Firefox) which I can save to a text file (GML format). I am now going to experiment with reading in the text file and processing the data using a VB.NET desktop application. My final implementation will involve developing a web application that reads in the data from GeoServer WFS and processing it. But I want to make a start on implementing an algorithm for searching for clusters in the data and I need to simplify the number of steps in the process to do that. If I do it properly the code I develop for the VB.NET desktop application will be easily ported to an ASP.NET application.

Why am I using .NET technologies? So far I have been using open source tools - postgreSQL/postGIS/geoServer and I would like to carry on with this but I don't have the skills for programming open source web applications so I'm going to use .NET.

Friday 2 January 2009

Data source

My dataset consists of 108, 078 road traffic accidents in Greater Manchester between 1st January 1998 and 30th September 2008. Each accident is represented by a point object. The points are located using Ordnance Survey grid references. To get a feel for the dataset visit www.gmtu.gov.uk/gmaccidents. What is a road accident? The word accident will be used throughout. Some organisations prefer the use of collision rather than accident since accident implies that no-one was at fault. See more discussion here. Bizarrely the issue crops up in the film Hot Fuzz.

The task is to get this data into postgresSQL. From there it can be served as a Web Feature Service (WFS) using GeoServer. The WFS will then act as input into my (yet to be developed)Web Processing Service (WPS).

postgreSQL was installed with the postGIS component (postGIS add spatial capabilities to postGRES). A new database, accidents, was created using the postGIS template.

The source data is held in MapInfo format. This was exported to shapefile format. The shapefile was converted into a file of postGRES SQL INSERT statements using shp2pgsql.exe. From the DOS prompt:

shp2pgsql -c -s 27700 c:\temp\accidents_point accident accidents > c:\temp\accidents.sql

accidents.sql was run against the accidents database in postgreSQL.

The table fields are:

gid: unique identifier added by postGIS
accidentID: original unique identifier
year: year of accident
severity: accident severity. 1=fatal, 2=serious, 3=slight The definition of accident severity can be found here.
numbercasu: number of casualties
date: date of accident
easting: OSGB easting
northing: OSGB northing
ward: ward (definition) where the accident occured.
lacode: numeric code for the local authority (definition) where the accident occured

The import into postgreSQL was tested in QGIS which can open postGIS tables:


Click here for larger image.

Thursday 1 January 2009

Welcome

This blog is intended to record progress on my MSc dissertation entitled An exploration of the Web Processing Service: a road traffic accident case study

I am studying with UNIGIS.

The project aim is to assess the suitability of the OGC's Web Processing Service to analyze road traffic accident data. The objectives are:
  • To develop a conceptual framework for remote “geoprocessing” by researching developments prior to the OGC standard and assessing the advantages and disadvantages of remote geoprocessing.
  • To conduct a survey of existing Web Processing Service (WPS) implementations. The survey will look at the types of analysis undertaken and the nature of the data used as input.
  • To identify two different algorithms for highlighting accident black spots that are suitable for implementing using WPS.
  • To develop a demonstration WPS designed to take road accident locations as input and output a black spot analysis
  • To assess the performance of the demonstration service and thus assess the suitability of WPS as a whole.

    The extended project outline can be downloaded (PDF file, 112Kb).