Spatial Data Management: A tutorial

Spatial data is information about the locations and shapes of geographic features and the relationships between them, usually stored as coordinates and topology (ESRI, 2017). In other words, it is data that is connected to a ‘location’ on earth, usually represented as latitude and longitude coordinates. Such data is generated by sensors, GPS devices, smartphones and is at the center of GIS systems. Since spatial data is multidimensional (the queries involve both latitude and longitude, in some cases even time), managing it requires special techniques and indexes, some of which will be covered in the papers we read in the course. The simplest spatial indexes are R-trees, Quadtrees and kD-trees. Specialized databases called geodatabases implement these techniques and are used by GIS implementations like ESRI’s ArcGIS and the open source QGIS and GRASS. Popular databases like MySQL, PostgreSQL, Oracle, SQL Server all support spatial data either natively or through extensions.

Figure 1.2: A heat map using spatial data

Figure 1.3: A chloropleth map using spatial data

Spatial datatypes and SQL extensions for spatial data

Since spatial data is inherently multidimensional, the Open Geospatial Consortium (OGC) defines standards that define the types of spatial objects that can be created and stored. Simple spatial datatypes include:

  • Points: Simple points in a 2D plane
  • Lines: Infinite lines or finite line segments
  • Polylines: Also called ‘paths’ by certain implementations, used to represent irregular continuous lines like roads and rivers
  • Polygons: Used to represent areas

Figure 1.4: Spatial datatypes supported by SQL Server

Different spatial DBMS implement different subsets of the spatial datatypes of the original OGC specification. Spatial DBMS also extend the regular SQL to support spatial datatypes and relations between the objects. We shall look at a simple example of PostGIS and see how these extensions work. Please note that PostGIS ships as an extension to regular PostgreSQL and needs to be installed separately.

Spatial SQL example: Neighborhoods in NYC

New York has a rich history of neighborhood names and extent. Neighborhoods are social constructs that do not follow lines laid down by the government. For example, the Brooklyn neighborhoods of Carroll Gardens, Red Hook, and Cobble Hill were once collectively known as “South Brooklyn.” And now, depending on which real estate agent you talk to, the same four blocks in the-neighborhood-formerly-known-as-Red-Hook can be referred to as Columbia Heights, Carroll Gardens West, or Red Hook!

Let’s say the dataset has the following schema:

Column name Description
name Name of the neighborhood
boroname Name of the New York borough. Manhattan, The Bronx, Brooklyn, Staten Island, Queens
geom Polygon boundary of the neighborhood

Figure 1.5: The neighbourhoods of NYC

  • Creating a spatial table

    CREATE TABLE nyc_neighbourhoods (name varchar, boroname varchar, geom geometry);
    
  • Inserting data

    INSERT INTO nyc_neighbourhoods VALUES ('Carroll Hills', ’Brooklyn’, 'POLYGON((0 0, 1 0, 1 1, 0 1, 0 0))');
    

    Since the datasets are usually large, SQL IMPORT statements are used to populate the tables.

  • Intersect Query

    SELECT name, boroname
    FROM nyc_neighborhoods
    WHERE ST_Intersects(geom, ST_GeomFromText('POINT(583571 4506714)',26918));
    

    The above query can be read as: Return the name of a neighborhood and the borough that this point (583571, 4506714) intersects with.

The ST_GeomFromText() is a function used to create a point object from the text specified in the query. In addition to intersection, the OGC specification also lists several other spatial relationships: Contains, Crosses, Disjoint, Distance, DistanceWithin, Equals, Intersects, Touches, Within.

A detailed tutorial for spatial data handling in PostGIS can be found here.