Tuesday 8 December 2009

Geoprocessing in the cloud

Interesting article in GeoInformatics (December 2009) entitled "Geoprocessing in the Clouds". The authors implement the 52North Web Processing Service on Google's and Amazon's cloud services.

They compare response times for both implementations and find that both scale well.

A more detailed paper describing the project can be found on the 52North website.

Two contrasting views of GeoWeb processing

Now my MSc is done and dusted, I've decided to keep the blog going, focussing on the OGC's Web Processing Service.

I came across a couple of quotes recently. First from Jack Dangermond, ESRI CEO:

"So I think even 45 years ago when we were first inventing computer mapping at Harvard, our first attempts were to make beautiful maps. They weren't very beautiful; they were pretty crude. But very quickly thereafter, we moved from mapping into analytics, spatial decision-making to support land use planning and business planning of various sorts. So I think if you're asking me where will the web and the geospatial world move, it will be from simple map displays to analytics on the web. It'll move from simply a basemap with overlays on it to the notion of multiple distributed services that do spatial analysis, but combine various kinds of spatial data together and serve it back out as visualization communications."
full article...

And from an article on the Chrome JavaScript engine on the Spatial Miscellany blog:

"Chrome isn’t the only web browser to recognize the importance of working with JavaScript heavy web sites, for example, much work is being done on a new JavaScript engine for Firefox. Perhaps in the future, as this approach gains support, spatial analysis functionality can move from the server to the client, which would make for a more engaging web mapping experience?" full article...

So processing via services or on the client...?

Thursday 16 April 2009

WPS implementation (nearly)

My WPS system now consists of a ASP.NET application wrapped around a ASP.NET Web Service. The web application takes the input and handles the getCapabilities and DescribeProcess requests. It also, after some error checking passes on the Execute request to the Web Service which processes the request asynchronously.

The WPS getCapabilities response can be seen here:

http://www.gmtu.gov.uk/wps/?service=WPS&version=1.0.0&Request=GetCapabilities

The DescribeProcess response can be seen here:

http://www.gmtu.gov.uk/wps/?service=WPS&version=1.0.0&Request=DescribeProcess&Identifier=DBSCAN

The DescribeProcess output is not yet complete as there is no description of the input and output data format. There is some error trapping so if you enter:

http://www.gmtu.gov.uk/wps/?service=WPS&version=1.0.0&Request=DescribeProcess&Identifier=noProcess

An error message will be returned.

The Execute process should return an XML response document that allows any client to monitor progress. In my implementation, for each request the WPS creates a file nnnnn.xml, where nnnnn is a unique integer. Both the wrapper application and the web service write to this file. When the Execute request has been parsed and all the inputs are present and correct, the wrapper creates the file with a ProcessAccepted message. When the Web Service is called, it overwrites the file with a ProcessStarted message. If there are any errors in processing then a ProcessFailed message is written. Finally a ProcessSucceeded message is written to the ResponseDocument. The idea is that the client can poll this document to check the process of the request.

This is sort of working but I have occasional errors “file being used by another process” when I try to overwrite the file or sometimes when reading it. I think this is due to the web server software (IIS) locking any file it has recently served. I’ll need to think of a way around this.

Monday 2 March 2009

Final architecture?

After several weeks of exploring different solutions including remoting and using a Windows Service I have settled on using a Web Service.

The service takes as input a value for Eps (in metres) and a url pointing to a WFS service.
Currently I have the web service running on one physical server and the WFS service running on another. The service saves the results in a KML file on the server, which can be accessed via the web. The clustering alogorithm may take several minutes to run (depending on the amount of source data) so the service needs to be called asynchronously.

Firstly, the server code that executes the DBSCAN algorithm has to be encoded as a sub rather than a function since it cannot return a value to the client if we are running asynchronously. The sub definition needs the following attribute set:

SoapDocumentMethod(OneWay:=True)

The timeout has to be increased in the service application, using a setting in the Web.config file:

httpruntime executiontimeout="1200"

The service can be called from a web client. A reference to the service is added to the client and, for the page that calls the service, the sync attribute needs to be set in the @Page directive:

Async="true"

The client can then call the service:

Dim ws As New WPSservice.Service
ws.DBSCANasyncAsync(Eps, Url)


Where ws is a reference to the web service and DBSCANasyncAsync is the sub in the web service.

I’ve got this working as it should. The client calls the service asynchronously and control returns to the client page straightaway whilst the service continues to run in the background. When complete it writes out a results file to the server. This is all well and good, but since the client does not wait for the service to finish, how does the client know when the job has been completed, where the results are, if there are any errors? This is the next stage, I need to setup a mechanism for polling the service for the results and reporting any errors back to the client. I need to start getting my system looking like a Web Processing Service as specified by the OGC.





Tuesday 3 February 2009

Output to KML file and more thoughts on implementation

So far I have written the results from DBSCAN as INSERT statements to be run against a postgreSQL database. This is not going to be much use for a client application, so for the next step I’m going to generate a KML file showing the clusters (I’m going to omit noise points to keep the output filesize low). KML has now been adopted as a OGC standard and can be used as input to Google Maps and Google earth.

One problem is that my data is in OSGB co-ordinates whereas KML uses WGS84 lat/long. I've used the Grid InQuest co-ordinate translation program which comes with a DLL file that can be used in Visual Studio. My VB.NET application now outputs the results to a KML file. The code for generating the KML file was adapted from here.

Some sample results:

http://www.gmtu.gov.uk/nmg/citycentre.kml
http://www.gmtu.gov.uk/nmg/heatonsnorth.kml

If these urls are pasted directly into the browser, Google Earth will be fired up (if installed). To use the kml files in Google Maps you can paste the URL into the Search Maps text box on the Google Maps page.

Here is the Heatons North data in Google Maps:


View Larger Map

The number on each point is the clusterID (note that noise points have not been included.

As it stands the KML displays the individual points in each cluster. I perhaps need to think about generalization and generate polygons to display the clusters.

How does this fit in with my eventual implementation? My current thinking is this:

To develop An ASP.NET web application that acts as a “wrapper” around a VB.NET service/application. The wrapper is what the client sees and is in effect the WPS. This is the technique as recommended by Michael and Ames in their early evaluation of WPS. They suggest that the geo-processing takes place in a separate process. This will stop a badly performing process from bringing down the entire WPS (On Windows Server I can put the geo-processing tools in a separate application pool). My ASP.NET application will handle the GetCapabilities and DescribeProcess functions. It will also manage the Execute process by calling the necessary geo-processing service. The geo-processing service will take input from the wrapper and write its results to a results file (e.g. a KML file as above) It will also write to a log file, reporting its progress. The log file can be queried by the ASP.NET wrapper (on request of the client) allowing it to check on progress. The relevant section of the OGC WPS specification is:

"The mandatory Execute operation allows WPS clients to run a specified process implemented by a server, using the input parameter values provided and returning the output values produced. Inputs can be included directly in the Execute request, or reference web accessible resources. The outputs can be returned in the form of an XML response document, either embedded within the response document or stored as web accessible resources. If the outputs are stored, the Execute response shall consist of a XML document that includes a URL for each stored output, which the client can use to retrieve those outputs. Alternatively, for a single output, the server can be directed to return that output in its raw form without being wrapped in an XML reponse document.

Normally, the response document is returned only after process execution is completed. However, a client can instruct the server to return the Execute response document immediately following acceptance by the server of the Execute request. In this case, the Execute response includes a URL from which the response document can later be retrieved during and after process execution. The server can be instructed to provide regular updates to a measure of the amount of processing remaining if the process is not complete. This allows the client to determine the process status by polling this URL. "

The response document will be in effect my log file (an XML document), which while processing is ongoing will provide status updates when queried and when processing is complete will contain URLs to any output e.g. a KML file.

I think this method means generating a separate XML document each time the process is called (each file named with a unique reference number) unless I generate the response document dynamically from ASP.NET.

I need to examine further the approaches taken by 52 North and deegree in their WPS implementations.

Wednesday 28 January 2009

UML Experiments

I'm going to document my implementation using UML since it is prevalent in academia. I've been experimenting with the Poseidon for UML community edition modelling software which was included with Teach yourself UML in 24 hours. The book gets panned on Amazon but I am up to hour 7 and I am fairly impressed so far. The important section for me will be sequence diagrams which I have yet to reach (hour 9).

My first attempt at a (very basic) class diagram:
This is the point class used in my code that implements DBSCAN.

The software has an export function which created the above gif image. However, with the version I've got it puts a watermark on any exports:


Which will not look too good in my thesis.

I'm going to look next at Star UML, an open source tool.

Tuesday 27 January 2009

More on DBSCAN

To recap, I'm attempting to implement the clustering alogorithm, DBSCAN, as a desktop VB.NET application, reading the accident data from a GML file. I've managed to do this using a modified version of the method described by Tan et al (Algoritm 8.4, p528).

The implementation requires the user to supply a value for Eps. The process is as follows:
  1. read all points into an arraylist of type point (where point is a user defined class)
  2. populate a distance array – containing the distance between each point and every other point
  3. create array for k-dist where k = 4– i.e. for each point what is the distance to the 4th nearest neighbour?
  4. flag core points – i.e. where 4-dist is <= Eps
  5. flag border points – i.e. where the point is within Eps of a core point
  6. by default the remaining points are noise
  7. populate an array of linked core points – i.e. core points within Eps of each other
  8. where core points are linked put them in the same cluster
  9. add border points to same cluster as their nearest core point
(The code then generates INSERT statements that can be run against a postgreSQL database for plotting in QGIS).

The process was run using data from the Heatons North ward using values for Eps of 10m and 20m. The following images focus on a particular section of road:

With a Eps of 10m:

With an Eps of 20m:


The core points are black, border points are yellow and noise points are crosses. The number indicates the cluster ID (the 10m Eps returns 14 clusters, the 20m Eps returns 19 clusters)

With the 10m Eps there is more noise and the cluster is more focussed - at 2 junctions instead of 3. The choice of Eps is critical then.

A large view of the 10m results can be seen here.

Summary
My implementation of DBSCAN will not be the most efficient but that doesn't matter. Its the implementation of WPS that I'm interested in.

The next stage is to implement the algorithm as a WPS. The first step will be to look at the input parameters of the WPS i.e. how can I make the accident data in WFS form available to a web service?


Saturday 10 January 2009

useful references - GI Days 2008

I came across the GI Days 2008 conference website the other day. There were a good number of papers relating to WPS. In particular:

Towards a Quality Aware Web Processing Service (Donubauer et al) describes a project to develop a web service that gives an indication of the quality of the results based on the quality of the input data. In their forestry use case the process is a polygon overlay. With the source data the accuracy is provided at a line segment (of each polygon) level. The WPS returns the area and standard deviation of the overlay. The accuracy data is passed to the process embedded in metaDataProperty tags in the source GML. Including something similar in my implementation of WPS is probably beyond the scope of my project but it is something I need to at least make reference to.

Web-based geostatistics using WPS (Jorge de Jesus et al) describes the development of a WPS for spatial interpolation, Kriging, in particular. The output from the process is a reference to a geoTIFF document that can be downloaded using a browser. This project has similarities with my own in that it uses point source data. They note that the “verbosity” of XML is a problem when sending thousands of points observations. Another issue is that GML allows for multiple ways of describing the same data which presents a problem to the programmer. They too raise the issue of quality and make reference to the work done with UncertML (original paper).

Providing GRASS with a Web Processing Service Interface (Bruaner) describes how WPS can be integrated with the GRASS desktop GIS. In a discussion on state he notes that each time you read a dataset in a stateless web-based system you process it and then throw it away. He contrasts this with the classic stateful desktop GIS where you load up your data sources and work on them for as long as necessary.

Monday 5 January 2009

Implementing DBSCAN

The next stage is to implement the DBSCAN algorithm. DBSCAN is an algorithm for discovering clusters in point data. In my case I want to discover accident black spots in my accident dataset.

The algorithm was devised by Ester et al, in a paper A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise.

Tan et al provide a good discussion of DBSCAN in a chapter of Introduction to Data Mining.

My intention is to implement DBSCAN in a VB.NET desktop application using as input a GML file saved from output of my WFS setup. In my actual implementation I will need to develop a web-based implementation of DBSCAN using an actual WFS request as input but I think this is a good starting point (debugging .NET desktop applications is easier than debugging web applications).

DBSCAN requires three inputs:
  • the data source
  • a parameter MinPts - which is the minimum number of points to define a cluster
  • a distance parameter, Eps - a distance parameter - if there are at least minPts within Eps of a point then that point is a core point in a cluster.
Now, Ester et al concluded that MinPts could safely be set to 4 for all 2D datasets. So MinPts is eliminated. The next step is to work out how to calculate Eps. The following was repeated with a number of susbsets of the full dataset:
  • Read in the GML file, saved from WFS, and store the points in an arraylist
  • Create a distance array that holds the distance of each point to every other point in the dataset
  • for each point calculate its 4th nearest point - what is called the 4-dist point
  • sort the 4-dist values in ascending order
  • plot them
This was repeated with three datasets:

  • All accidents in the (Manchester) City Centre ward (3236 points) graph
  • All accidents in the Heatons North ward (398 points) graph
  • All accidents in Greater Manchester in 2008 -up to September 30th (5309 points) graph

The first thing to note (gratifyingly!) is that all three graphs are similar in shape to that displayed in Tan et al. (Note that the graphs are the flipside of that displayed in Ester et al since they order the data in descending order.) The Eps value for a dataset can be estimated as the point where the graph "takes off". This is very much an estimate, but for the City Centre data it is about 50m, for the Heatons North data it is about 70m but for the whole of GM dataset the value is about 700m.

This value has to be provided to the DBSCAN algorithm before it can work out where the clusters are. As Ester et al state it is difficult to calculate this value automatically. So what's to be done? Some options:

  • Do we send the data to the WPS and get it to send back a graph from which we can estimate the value of Eps which we then send back to the WPS for it to calculate the clusters?
  • Do we assume the user "knows" his/her dataset and expect him/her to provide the value of Eps?
  • Can we make some estimate of Eps? If we look at the three graphs - the first two each cover a single ward. I think they are similar in area and have a similar value of Eps. The third graph covers a much larger area (all of GM) and returns a much larger value of Eps. Can this be used to estimate Eps?

Problems with large datasets
I did try with all accidents in Greater Manchester in 2000, 11998 points, but this meant an array of 11998 x 11998 which was too much for VB.NET. I need to have a think about this distance between points array - perhaps look at spatial indexing and doing the query in a spatial database - there's no point in calculating the distances between points that are nowhere near each other.

Conclusion
I don't want to get bogged down in the details of the processing part of the WPS, since the investigation is into the WPS itself, not any particular algorithm. So, I could simply specify that the user has to provide the value of Eps as a parameter for the WPS. However, I'd like to give it more consideration before taking this option - perhaps estimating the value of Eps if it is not supplied by the user.

Saturday 3 January 2009

Setting up GeoServer

Having got my source data into postGRES I now need to configure GeoServer to deliver the data using WFS.

I installed GeoServer 1.7.1.

First step was to create a namespace:
Config > Data > Namespace > New

Enter prefix: GM
Click new.

Click Submit then Save

The next step is to create the datastore:

Config > Data > DataStores > New

Feature data set description: Postgis
Feature Data Set ID: accidents

Click New then:

Namespace: GM
Description: Road accidents in Greater Manchester
host: localhost
port: 5432
schema: public
database: accidents
user: ****
password: ****

(The username and password are those created when postgreSQL was set up.)

Click Submit you are then prompted to create a featureType:

click create SLD and create feature style settings, e.g. point size and colour

In SRS box enter 27700

click Bounding box Generate button.

click Submit.

Then Apply then Save

To test WFS in uDig go to the GeoServer welcome page and drag WFS Capabilities onto uDig window.

To test in a browser go to:

http://localhost:8080/geoserver/wfs?service=WFS&request=GetCapabilities

The accidents layer appears along with the sample datasets.

To extract a subset of the data using WFS:

http://localhost:8080/geoserver/wfs?service=WFS&request=Getfeature&typeName=GM:accident&CQL_FILTER=ward='CITY%20CENTRE'

This will list all accidents in the CITY CENTRE ward. This, however, does not work in Internet Explorer which returns an error:

The namespace prefix is not allowed to start with the reserved string "xml".

See here for an explanation: http://support.microsoft.com/kb/262585

The problem seems to be with the xml keyword in:

xmlns:xml="http://www.w3.org/XML/1998/namespace"

I will need to investigate how to configure the namespaces in GeoServer (which seems to specify several of them for some reason).

Another problem is that geoServer loses my new featureType each time it is restarted. The namespace and datastore are saved properly but not the featureType which has to be recreated after each restart. The datastore details stored in:

C:\Program Files\GeoServer 1.7.1\data_dir\catalog.xml.

and there is an entry for my datastore.

All the sample featureTypes have folders in:

C:\Program Files\GeoServer 1.7.1\data_dir\ featureTypes\featureTypeName

but nothing is created in here when I create my featureType.

When I select GeoServer Demo > Map preview none of the options work for my data source but those for the demo sources do.

geoServer documentation

Next steps
These two problems (WFS not working in IE and more importantly the issue of not saving the featureType) will need further investigation. However, I am going to put them to one side. I can now extract some data from WFS (using Firefox) which I can save to a text file (GML format). I am now going to experiment with reading in the text file and processing the data using a VB.NET desktop application. My final implementation will involve developing a web application that reads in the data from GeoServer WFS and processing it. But I want to make a start on implementing an algorithm for searching for clusters in the data and I need to simplify the number of steps in the process to do that. If I do it properly the code I develop for the VB.NET desktop application will be easily ported to an ASP.NET application.

Why am I using .NET technologies? So far I have been using open source tools - postgreSQL/postGIS/geoServer and I would like to carry on with this but I don't have the skills for programming open source web applications so I'm going to use .NET.

Friday 2 January 2009

Data source

My dataset consists of 108, 078 road traffic accidents in Greater Manchester between 1st January 1998 and 30th September 2008. Each accident is represented by a point object. The points are located using Ordnance Survey grid references. To get a feel for the dataset visit www.gmtu.gov.uk/gmaccidents. What is a road accident? The word accident will be used throughout. Some organisations prefer the use of collision rather than accident since accident implies that no-one was at fault. See more discussion here. Bizarrely the issue crops up in the film Hot Fuzz.

The task is to get this data into postgresSQL. From there it can be served as a Web Feature Service (WFS) using GeoServer. The WFS will then act as input into my (yet to be developed)Web Processing Service (WPS).

postgreSQL was installed with the postGIS component (postGIS add spatial capabilities to postGRES). A new database, accidents, was created using the postGIS template.

The source data is held in MapInfo format. This was exported to shapefile format. The shapefile was converted into a file of postGRES SQL INSERT statements using shp2pgsql.exe. From the DOS prompt:

shp2pgsql -c -s 27700 c:\temp\accidents_point accident accidents > c:\temp\accidents.sql

accidents.sql was run against the accidents database in postgreSQL.

The table fields are:

gid: unique identifier added by postGIS
accidentID: original unique identifier
year: year of accident
severity: accident severity. 1=fatal, 2=serious, 3=slight The definition of accident severity can be found here.
numbercasu: number of casualties
date: date of accident
easting: OSGB easting
northing: OSGB northing
ward: ward (definition) where the accident occured.
lacode: numeric code for the local authority (definition) where the accident occured

The import into postgreSQL was tested in QGIS which can open postGIS tables:


Click here for larger image.

Thursday 1 January 2009

Welcome

This blog is intended to record progress on my MSc dissertation entitled An exploration of the Web Processing Service: a road traffic accident case study

I am studying with UNIGIS.

The project aim is to assess the suitability of the OGC's Web Processing Service to analyze road traffic accident data. The objectives are:
  • To develop a conceptual framework for remote “geoprocessing” by researching developments prior to the OGC standard and assessing the advantages and disadvantages of remote geoprocessing.
  • To conduct a survey of existing Web Processing Service (WPS) implementations. The survey will look at the types of analysis undertaken and the nature of the data used as input.
  • To identify two different algorithms for highlighting accident black spots that are suitable for implementing using WPS.
  • To develop a demonstration WPS designed to take road accident locations as input and output a black spot analysis
  • To assess the performance of the demonstration service and thus assess the suitability of WPS as a whole.

    The extended project outline can be downloaded (PDF file, 112Kb).