# «Urban Problems and sPatial methods VolUme 17, nUmber 1 • 2015 U.S. Department of Housing and Urban Development | Office of Policy Development and ...»

Moreover, those without the prerequisite technical skills can acquire ready-made datasets through third party vendors, such as Gnip. In stark contrast with the situation facing social science research for most of the 20th century, today we certainly do not suffer from a lack of data. For example, the Dolly project at the University of Kentucky has been collecting all geotagged tweets in the world since June 2012—totaling more than 9 billion data points, and counting. Even small subsets of such data, from people talking about receiving a flu shot to people tweeting about their favorite beer brands, yield many thousands of data points.

Instead, the pressing problem that presents itself is how to gain meaningful insights from such large collections of spatial points. Although research may take any number of approaches (Crampton et al., 2013), an early step is to simply map or visualize these data in ways that reveal the presence (or absence) of underlying spatial processes and distributions.

The easiest approach for visualizing the spatial distribution of data—in this case, tweets from Hurricane Sandy (Shelton et al., 2014)—is, quite literally, putting the points on the map (exhibit 1).

This relatively straightforward one-to-one plotting of data points on a map presents two specific

problems. First, such maps suffer from “overplotting”: many overlapping points obscure each other and make it difficult to assess the total number of points in each area. Second, even if the problem of overplotting were solved, these patterns largely mimic population density: in the case of Twitter, more tweets are generally sent from densely populated areas because these locations simply have more Twitter users. This second problem is very much prevalent in maps of online phenomena and even reached modest Internet fame after Randall Munroe devoted a popular XKCD comic to it (exhibit 2). In the next sections, we walk through a step-by-step approach that first addresses the problem of overplotting by aggregating individual points to a hexagonal lattice. It subsequently provides a solution to correct for this population density “mirroring” by normalizing raw counts through the calculation of an odds ratio.

Exhibit 2 Population Density?

Source: Reprinted from http://xkcd.com/1138/

Fixing the Overplotting Problem A range of common cartographic and geographic information science, or GIScience, approaches can solve the problem of overplotting. The first is to make each data point slightly transparent.

This approach is fine with only a small number of overlapping points but, in the case of big geodata, we are often confronted with hundreds of points overlapping in one location, while other locations have only one or two points. Another approach would be to visually “explode” overlapping features, slightly offsetting their position to prevent overlap. Again, this approach works well with smaller datasets (John Snow’s classic cholera map is a prime example of this approach [Snow, 1855]) but is not well suited for large datasets.

Another way to address the issue of overplotting is generating what is colloquially called a “heatmap.” Techniques such as kernel density estimation or kriging are used to create a (smooth) density surface. A major caveat, however, is that these techniques interpolate or “smooth” values in between actual data points and thus assume that the underlying spatial processes are continuous.

This caveat applies to many natural phenomena, such as temperature and precipitation, but is more problematic when applied to social phenomena. This caveat is especially the case on an urban scale in which stark differences in demographics, retailing, and so on, are often present between neighborhoods or even from block to block. Although heatmaps are visually pleasing (and hence popular), they are not necessarily the most appropriate technique for gaining meaningful insight from online social media data.

A more suitable approach is to aggregate individual points to larger areas or polygons. These areas could be administrative regions, such as census tracts or counties, or they could be arbitrary spatial areas, such as rectangles, circles, or hexagons. Unless the final goal of the analysis is to compare the point data under study with other datasets that are available only for certain administrative units, aggregating to a lattice of arbitrary areas (such as hexagons) has two specific advantages from an analytical perspective. First, administrative units often have varying sizes. For example, counties in the western part of the United States, in general, are much larger than their eastern counterparts. The larger counties not only have a higher chance of having more points inside their border, but they also stand out much more visually. Aggregating to a regular lattice of rectangles or hexagons, in which every area has the exact same size, solves this problem. Second, such a lattice enables us to address, although not solve per se, the Modifiable Areal Unit Problem (MAUP) by intentionally modifying the size of the rectangles or hexagons. This can be done to test whether the spatial patterns indeed change due to MAUP or simply to choose the “best” cell size based on the underlying phenomenon (see Wilson, 2013, for an example of the consequences of changing areal units).

Creating a Hexagonal Lattice Hexagonal lattices have seen a recent surge in popularity within online mapping, but they are more than just the latest fad—they have a few distinct advantages over rectangular grids. First, in cartographical terms, rectangular cells are more distracting. The eye is drawn to the horizontal and vertical grid lines, making it more difficult for the reader of the map to distinguish spatial patterns

(Carr, Olsen, and White, 1992). Second, in analytical terms, hexagonal lattices have a higher representational accuracy than square or rectangular grids (Burt, 1980; Scott, 1988), meaning that they represent the underlying point pattern more closely. The hexagon is the highest sided regular polygon that can still be used to tessellate (that is, cover a surface without gaps or overlap), and, as such, it is closest to the ideal of a circle. The closer a polygon is to a circle, the closer its border points are to its center, which partly explains the higher representational accuracy.

As such, the first practical step in generating a hexagonal lattice is to aggregate up from the original point pattern. This aggregation can be done quite easily in Arcmap1 and QGIS2 or using R (R Core Team, 2014). Given the power of R and its relative newness to geospatial analysis, we include code snippets used for generating the maps in this article. More extensive (and commented) code with some sample data is available at https://github.com/atepoorthuis/smallstoriesbigdata.

library(sp) tweets - read.csv(“tweets.csv”, colClasses=c(“latitude”=“numeric”, “longitude”=“numeric”)) coordinates(tweets) - c(“longitude”, “latitude”) proj4string(tweets) - “+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs” hex - HexPoints2SpatialPolygons(spsample(tweets, n=3000, “hexagonal”)) At this point we read a dataset of tweets from a.csv file, point to the longitude and latitude columns for the spatial coordinates, set a projection, and then generate a hexagonal grid over the same spatial extent. A key variable in this code is the number of cells in the lattice (3,000 in this example), but this number can be readily changed to explore how changes in cell size affect the resulting visualization. After we have produced a hexagonal lattice, we can then simply spatially join each individual tweet to a grid cell.

**tweets$hex - over(tweets, hex)$id**

We take this intermediate step of adding the identifier of the corresponding grid cell to each individual tweet, because it also allows for the flexibility of sampling down power users (Poorthuis and Zook, 2014). Within most online social media, a power law, or close approximation, can be found in which a few users contribute by far the most content (Clauset, Shalizi, and Newman, 2009). If we wanted to correct for that effect, we could, for example, randomly sample down active users to a maximum of 5 data points (or some other selected value) per grid cell.

library(data.table) tweets.dt - data.table(tweets) tweets.dt[,sample(.I, if(.N 5) 5 else.N), by=list(u_id, hex)] After that, it is just a matter of counting the number of tweets per grid cell and visualizing the results.

tweets.dt[,tweets=.N, by=list(hex)] http://www.arcgis.com/home/item.html?id=03388990d3274160afe240ac54763e57.

http://michaelminn.com/linux/mmqgis/.

An example of this step in the process is provided in exhibit 3, which shows the spatial pattern of tweets related to Hurricane Sandy sent in October 2012 (Shelton et al., 2014).

Aggregation to Hexagons Number of Hurricane Sandy-related tweets 254–562 563–1,312 1–84 85–253 1,313–4,374 Normalizing the Cells Using an Odds Ratio Although aggregating to hexagons solves the problem of overplotting, to a large extent, the resulting spatial pattern still follows very closely the distribution of population. Given that this dataset is derived from social media, this problem is to be expected. We are, after all, still looking at the raw count of the number of tweets, which is heavily influenced by how many people happen to live in each hexagon.

A fortunate side effect of the aggregation to polygons is that it becomes much easier to normalize each raw count. For conventional data, we would likely choose to normalize a phenomenon by simply dividing raw counts by the total population or, for example, the area of each polygon.

In the case of online social media data, this approach has two specific disadvantages. First, the approach yields a ratio that becomes difficult to understand; for example, what does 15 tweets per square mile or 100,000 people actually mean? Second, the total population might very well not be the same as total tweeting population.

Instead, we calculate an odds ratio, which is slightly more sophisticated but has the great advantage of allowing us to normalize by any other variable, and the resulting ratios are easy to interpret (Edwards, 1963). In the case of social media data such as Twitter, it often makes sense to normalize

by a random sample of all tweets that stands in as a proxy for the total tweeting population rather than the total population in and of itself. By using the total tweeting population, we can visualize the distribution of a phenomenon within social media use, rather than the popularity of a social media service within the overall population. The formula for the odds ratio is— pi /p, OR = r (1) /r i where pi is the number of tweets in hexagon i related to the phenomenon of interest (for example, flu shot tweets or tweets related to a certain beer brand) and p is the sum of all tweets related to that phenomenon in all hexagons. ri is the number of random tweets in hexagon i and r is the sum of all random tweets in all hexagons. We choose a random sample of all tweets at this point, but one could easily substitute other variables—for example, active Internet users or possibly another point-based phenomenon aggregated to the same hexagonal lattice.

The resulting ratio has a midpoint of 1. At that midpoint, as many data points related to our phenomenon of interest as we would expect are present based on that random sample of all tweets. Values lower than 1 indicate we have fewer points of interest than expected, and vice versa.

For example, an odds ratio of 0.5 means that we find only half as many points of interest as we expected, and a value of 2.0 means we find twice as many points as we expected, based on the total population. We can easily calculate this odds ratio in R (see result in exhibit 4).

Exhibit 4 Basic Odds Ratio Odds ratio of Hurricane Sandy-related tweets

randomTweets - read.csv(“random.csv”, colClasses=c(“latitude”=“numeric”, “longitude”=“numeric”)) coordinates(randomTweets) - c(“longitude”, “latitude”) proj4string(randomTweets) - “+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs” randomTweets$hex - over(randomTweets, hex)$id randomTweets.dt - data.table(randomTweets) randomTweets.dt[,random=.N, by=list(hex)] hex.join - merge(tweets.dt, randomTweet) hex.join[,OR:=(tweets/sum(tweets))/(random/sum(random)),]

where z is the z-score of the chosen confidence level (for example, z = 1.96 for 95 percent confidence level).

We can use this approach to calculate both the upper and lower bounds of the confidence interval but, if we are interested only in significant instances of higher odds ratios, we can calculate and visualize the lower bound only. This approach would enable one to say, for example, the value in hexagon I is at least 1.5, with 95 percent confidence. To calculate this value in R, we only have to adapt the formula (the last line of the previous code snippet) a little bit.

hex.join[,ORlowerconf:=exp(log(OR)-1.96*sqrt(1/tweets+1/sum(tweets)+1/ random+1/sum(random))),] When we visualize this lower bound of the confidence interval for the odds ratio, we get to the final step in our approach, seen in exhibit 5, which results in a clear—and in this case, expected— spatial pattern largely following the areas most affected by Hurricane Sandy (see Shelton, 2014, for a more indepth discussion of this pattern).

Source: Reprinted from Shelton (2014) Final Considerations The approach outlined in this article starts with an arguably noisy and large set of point-level data derived from social media. Using aggregation to a hexagonal lattice and subsequent normalization and calculation of an odds ratio with confidence intervals, we go from a raw view on the data (exhibit 1) to a clear spatial pattern (exhibit 5). Although we have used a random “population” sample to normalize in the example, this approach is flexible; thus, the same approach can be used to directly compare two different point datasets (for example, artists versus bankers) or different time periods of the same dataset.