FREE ELECTRONIC LIBRARY - Theses, dissertations, documentation

Pages:     | 1 |   ...   | 21 | 22 || 24 | 25 |   ...   | 35 |

«Urban Problems and sPatial methods VolUme 17, nUmber 1 • 2015 U.S. Department of Housing and Urban Development | Office of Policy Development and ...»

-- [ Page 23 ] --

Moreover, those without the prerequisite technical skills can acquire ready-made datasets through third party vendors, such as Gnip. In stark contrast with the situation facing social science research for most of the 20th century, today we certainly do not suffer from a lack of data. For example, the Dolly project at the University of Kentucky has been collecting all geotagged tweets in the world since June 2012—totaling more than 9 billion data points, and counting. Even small subsets of such data, from people talking about receiving a flu shot to people tweeting about their favorite beer brands, yield many thousands of data points.

Instead, the pressing problem that presents itself is how to gain meaningful insights from such large collections of spatial points. Although research may take any number of approaches (Crampton et al., 2013), an early step is to simply map or visualize these data in ways that reveal the presence (or absence) of underlying spatial processes and distributions.

The easiest approach for visualizing the spatial distribution of data—in this case, tweets from Hurricane Sandy (Shelton et al., 2014)—is, quite literally, putting the points on the map (exhibit 1).

This relatively straightforward one-to-one plotting of data points on a map presents two specific

–  –  –

problems. First, such maps suffer from “overplotting”: many overlapping points obscure each other and make it difficult to assess the total number of points in each area. Second, even if the problem of overplotting were solved, these patterns largely mimic population density: in the case of Twitter, more tweets are generally sent from densely populated areas because these locations simply have more Twitter users. This second problem is very much prevalent in maps of online phenomena and even reached modest Internet fame after Randall Munroe devoted a popular XKCD comic to it (exhibit 2). In the next sections, we walk through a step-by-step approach that first addresses the problem of overplotting by aggregating individual points to a hexagonal lattice. It subsequently provides a solution to correct for this population density “mirroring” by normalizing raw counts through the calculation of an odds ratio.

Exhibit 2 Population Density?

Source: Reprinted from http://xkcd.com/1138/

–  –  –

Fixing the Overplotting Problem A range of common cartographic and geographic information science, or GIScience, approaches can solve the problem of overplotting. The first is to make each data point slightly transparent.

This approach is fine with only a small number of overlapping points but, in the case of big geodata, we are often confronted with hundreds of points overlapping in one location, while other locations have only one or two points. Another approach would be to visually “explode” overlapping features, slightly offsetting their position to prevent overlap. Again, this approach works well with smaller datasets (John Snow’s classic cholera map is a prime example of this approach [Snow, 1855]) but is not well suited for large datasets.

Another way to address the issue of overplotting is generating what is colloquially called a “heatmap.” Techniques such as kernel density estimation or kriging are used to create a (smooth) density surface. A major caveat, however, is that these techniques interpolate or “smooth” values in between actual data points and thus assume that the underlying spatial processes are continuous.

This caveat applies to many natural phenomena, such as temperature and precipitation, but is more problematic when applied to social phenomena. This caveat is especially the case on an urban scale in which stark differences in demographics, retailing, and so on, are often present between neighborhoods or even from block to block. Although heatmaps are visually pleasing (and hence popular), they are not necessarily the most appropriate technique for gaining meaningful insight from online social media data.

A more suitable approach is to aggregate individual points to larger areas or polygons. These areas could be administrative regions, such as census tracts or counties, or they could be arbitrary spatial areas, such as rectangles, circles, or hexagons. Unless the final goal of the analysis is to compare the point data under study with other datasets that are available only for certain administrative units, aggregating to a lattice of arbitrary areas (such as hexagons) has two specific advantages from an analytical perspective. First, administrative units often have varying sizes. For example, counties in the western part of the United States, in general, are much larger than their eastern counterparts. The larger counties not only have a higher chance of having more points inside their border, but they also stand out much more visually. Aggregating to a regular lattice of rectangles or hexagons, in which every area has the exact same size, solves this problem. Second, such a lattice enables us to address, although not solve per se, the Modifiable Areal Unit Problem (MAUP) by intentionally modifying the size of the rectangles or hexagons. This can be done to test whether the spatial patterns indeed change due to MAUP or simply to choose the “best” cell size based on the underlying phenomenon (see Wilson, 2013, for an example of the consequences of changing areal units).

Creating a Hexagonal Lattice Hexagonal lattices have seen a recent surge in popularity within online mapping, but they are more than just the latest fad—they have a few distinct advantages over rectangular grids. First, in cartographical terms, rectangular cells are more distracting. The eye is drawn to the horizontal and vertical grid lines, making it more difficult for the reader of the map to distinguish spatial patterns

–  –  –

(Carr, Olsen, and White, 1992). Second, in analytical terms, hexagonal lattices have a higher representational accuracy than square or rectangular grids (Burt, 1980; Scott, 1988), meaning that they represent the underlying point pattern more closely. The hexagon is the highest sided regular polygon that can still be used to tessellate (that is, cover a surface without gaps or overlap), and, as such, it is closest to the ideal of a circle. The closer a polygon is to a circle, the closer its border points are to its center, which partly explains the higher representational accuracy.

As such, the first practical step in generating a hexagonal lattice is to aggregate up from the original point pattern. This aggregation can be done quite easily in Arcmap1 and QGIS2 or using R (R Core Team, 2014). Given the power of R and its relative newness to geospatial analysis, we include code snippets used for generating the maps in this article. More extensive (and commented) code with some sample data is available at https://github.com/atepoorthuis/smallstoriesbigdata.

library(sp) tweets - read.csv(“tweets.csv”, colClasses=c(“latitude”=“numeric”, “longitude”=“numeric”)) coordinates(tweets) - c(“longitude”, “latitude”) proj4string(tweets) - “+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs” hex - HexPoints2SpatialPolygons(spsample(tweets, n=3000, “hexagonal”)) At this point we read a dataset of tweets from a.csv file, point to the longitude and latitude columns for the spatial coordinates, set a projection, and then generate a hexagonal grid over the same spatial extent. A key variable in this code is the number of cells in the lattice (3,000 in this example), but this number can be readily changed to explore how changes in cell size affect the resulting visualization. After we have produced a hexagonal lattice, we can then simply spatially join each individual tweet to a grid cell.

tweets$hex - over(tweets, hex)$id

We take this intermediate step of adding the identifier of the corresponding grid cell to each individual tweet, because it also allows for the flexibility of sampling down power users (Poorthuis and Zook, 2014). Within most online social media, a power law, or close approximation, can be found in which a few users contribute by far the most content (Clauset, Shalizi, and Newman, 2009). If we wanted to correct for that effect, we could, for example, randomly sample down active users to a maximum of 5 data points (or some other selected value) per grid cell.

library(data.table) tweets.dt - data.table(tweets) tweets.dt[,sample(.I, if(.N 5) 5 else.N), by=list(u_id, hex)] After that, it is just a matter of counting the number of tweets per grid cell and visualizing the results.

tweets.dt[,tweets=.N, by=list(hex)] http://www.arcgis.com/home/item.html?id=03388990d3274160afe240ac54763e57.


–  –  –

An example of this step in the process is provided in exhibit 3, which shows the spatial pattern of tweets related to Hurricane Sandy sent in October 2012 (Shelton et al., 2014).

–  –  –

Aggregation to Hexagons Number of Hurricane Sandy-related tweets 254–562 563–1,312 1–84 85–253 1,313–4,374 Normalizing the Cells Using an Odds Ratio Although aggregating to hexagons solves the problem of overplotting, to a large extent, the resulting spatial pattern still follows very closely the distribution of population. Given that this dataset is derived from social media, this problem is to be expected. We are, after all, still looking at the raw count of the number of tweets, which is heavily influenced by how many people happen to live in each hexagon.

A fortunate side effect of the aggregation to polygons is that it becomes much easier to normalize each raw count. For conventional data, we would likely choose to normalize a phenomenon by simply dividing raw counts by the total population or, for example, the area of each polygon.

In the case of online social media data, this approach has two specific disadvantages. First, the approach yields a ratio that becomes difficult to understand; for example, what does 15 tweets per square mile or 100,000 people actually mean? Second, the total population might very well not be the same as total tweeting population.

Instead, we calculate an odds ratio, which is slightly more sophisticated but has the great advantage of allowing us to normalize by any other variable, and the resulting ratios are easy to interpret (Edwards, 1963). In the case of social media data such as Twitter, it often makes sense to normalize

–  –  –

by a random sample of all tweets that stands in as a proxy for the total tweeting population rather than the total population in and of itself. By using the total tweeting population, we can visualize the distribution of a phenomenon within social media use, rather than the popularity of a social media service within the overall population. The formula for the odds ratio is— pi /p, OR = r (1) /r i where pi is the number of tweets in hexagon i related to the phenomenon of interest (for example, flu shot tweets or tweets related to a certain beer brand) and p is the sum of all tweets related to that phenomenon in all hexagons. ri is the number of random tweets in hexagon i and r is the sum of all random tweets in all hexagons. We choose a random sample of all tweets at this point, but one could easily substitute other variables—for example, active Internet users or possibly another point-based phenomenon aggregated to the same hexagonal lattice.

The resulting ratio has a midpoint of 1. At that midpoint, as many data points related to our phenomenon of interest as we would expect are present based on that random sample of all tweets. Values lower than 1 indicate we have fewer points of interest than expected, and vice versa.

For example, an odds ratio of 0.5 means that we find only half as many points of interest as we expected, and a value of 2.0 means we find twice as many points as we expected, based on the total population. We can easily calculate this odds ratio in R (see result in exhibit 4).

Exhibit 4 Basic Odds Ratio Odds ratio of Hurricane Sandy-related tweets

–  –  –

randomTweets - read.csv(“random.csv”, colClasses=c(“latitude”=“numeric”, “longitude”=“numeric”)) coordinates(randomTweets) - c(“longitude”, “latitude”) proj4string(randomTweets) - “+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs” randomTweets$hex - over(randomTweets, hex)$id randomTweets.dt - data.table(randomTweets) randomTweets.dt[,random=.N, by=list(hex)] hex.join - merge(tweets.dt, randomTweet) hex.join[,OR:=(tweets/sum(tweets))/(random/sum(random)),]

–  –  –

where z is the z-score of the chosen confidence level (for example, z = 1.96 for 95 percent confidence level).

We can use this approach to calculate both the upper and lower bounds of the confidence interval but, if we are interested only in significant instances of higher odds ratios, we can calculate and visualize the lower bound only. This approach would enable one to say, for example, the value in hexagon I is at least 1.5, with 95 percent confidence. To calculate this value in R, we only have to adapt the formula (the last line of the previous code snippet) a little bit.

hex.join[,ORlowerconf:=exp(log(OR)-1.96*sqrt(1/tweets+1/sum(tweets)+1/ random+1/sum(random))),] When we visualize this lower bound of the confidence interval for the odds ratio, we get to the final step in our approach, seen in exhibit 5, which results in a clear—and in this case, expected— spatial pattern largely following the areas most affected by Hurricane Sandy (see Shelton, 2014, for a more indepth discussion of this pattern).

–  –  –

Source: Reprinted from Shelton (2014) Final Considerations The approach outlined in this article starts with an arguably noisy and large set of point-level data derived from social media. Using aggregation to a hexagonal lattice and subsequent normalization and calculation of an odds ratio with confidence intervals, we go from a raw view on the data (exhibit 1) to a clear spatial pattern (exhibit 5). Although we have used a random “population” sample to normalize in the example, this approach is flexible; thus, the same approach can be used to directly compare two different point datasets (for example, artists versus bankers) or different time periods of the same dataset.

Pages:     | 1 |   ...   | 21 | 22 || 24 | 25 |   ...   | 35 |

Similar works:

«NEW ISSUES IN REFUGEE RESEARCH Working Paper No. 115 European Refugee Policy: is there such a thing? Joanne van Selm Senior Policy Analyst Migration Policy Institute, Washington DC, USA and Senior Researcher, Institute for Migration and Ethnic Studies University of Amsterdam, Netherlands E-mail : jvanselm@migrationpolicy.org May 2005 Evaluation and Policy Analysis Unit Evaluation and Policy Analysis Unit United Nations High Commissioner for Refugees CP 2500, 1211 Geneva 2 Switzerland E-mail:...»

«Inter-American Development Bank Regional Policy Dialogue PUBLIC POLICY MANAGEMENT AND TRANSPARENCY: CIVIL SERVICE THE CIVIL SERVICE IN LATIN AMERICA AND THE CARIBBEAN: SITUATION AND FUTURE CHALLENGES: THE CARIBBEAN PERSPECTIVE GORDON M. DRAPER OCTOBER 2001 WORKING PAPER TABLE OF CONTENTS 1.0. Introduction 3 1.1. Background and Scope 1.2. Conceptual Issues 2.0. Civil Service Systems in the Caribbean 15 2.1. Size and demographics of the Civil Service 2.2. Current legal frameworks 2.3. Human...»

«American Politics Research http://apr.sagepub.com A Democratic Polity?: Three Views of Policy Responsiveness to Public Opinion in the United States Jeff Manza and Fay Lomax Cook American Politics Research 2002; 30; 630 DOI: 10.1177/153267302237231 The online version of this article can be found at: http://apr.sagepub.com/cgi/content/abstract/30/6/630 Published by: http://www.sagepublications.com Additional services and information for American Politics Research can be found at: Email Alerts:...»

«A Journal of Policy Development and Research Contesting the streets Volume 18, number 1 • 2016 U.S. Department of Housing and Urban Development | Office of Policy Development and Research Managing Editor: Mark D. Shroder Associate Editor: Michelle P. Matuga Advisory Board Dolores Acevedo-Garcia Brandeis University Ira Goldstein The Reinvestment Fund Richard K. Green University of Southern California Mark Joseph Case Western Reserve University Matthew E. Kahn University of California, Los...»

«United States Department of the Interior June 2008 DEPARTMENT OF THE INTERIOR INTEGRATED CHARGE CARD POLICY MANUAL This page intentionally leftblank. Table of Contents, 1.2 CREDITW ORTHINESS,, Credit Scores 1.4 ROL AND RESPONSiBILITIES 1.4.5 Office of Inspector General (O–  –  – The Department of the Interior's integrated charge card for travel, purchase, and fleet is Bank of America's MasterCard. This card is specifically designed with the United States ofAmerica printed next...»

«A Joint Publication of the Asian Development Bank Institute and Edward Elgar Publishing EDITED BY Masahiro Kawai Peter J. Morgan Shinji Takagi Highlights Monetary and Currency Policy Management in Asia Monetary and Currency Policy Management in Asia Highlights Masahiro Kawai, Peter J. Morgan, and Shinji Takagi Editors Adapted from the book, Monetary and Currency Policy Management in Asia, edited by Masahiro Kawai, Peter J. Morgan, and Shinji Takagi, published by Edward Elgar Publishing Ltd. and...»

«Beyond Kyoto: Climate Change Policy in Multilevel Governance Systems BARRY G. RABE* Climate change policy has commonly been framed as a matter of international governance for which global policy strategies can be readily employed. The decade of experience following the 1997 signing of the Kyoto Protocol suggests a far more complex process involving a wide range of policy options and varied engagement by multiple levels of governance systems. The respective experiences of the United States and...»

«OCTOBER TERM, 2012 1 (Slip Opinion) Syllabus NOTE: Where it is feasible, a syllabus (headnote) will be released, as is being done in connection with this case, at the time the opinion is issued. The syllabus constitutes no part of the opinion of the Court but has been prepared by the Reporter of Decisions for the convenience of the reader. See United States v. Detroit Timber & Lumber Co., 200 U. S. 321, 337.SUPREME COURT OF THE UNITED STATES Syllabus HILLMAN v. MARETTA CERTIORARI TO THE SUPREME...»

«DIRECTORATE GENERAL FOR INTERNAL POLICIES LEGAL AFFAIRS Legal aspects of free and open source software COMPILATION OF BRIEFING NOTES This document was requested by the European Parliament's Committee on Legal Affairs.RESPONSIBLE ADMINISTRATORS Danai PAPADOPOULOU Policy Department C: Citizens' Rights and Constitutional Affairs European Parliament B-1047 Brussels E-mail: danai.papadopoulou@europarl.europa.eu Rosa RAFFAELLI Policy Department C: Citizens' Rights and Constitutional Affairs European...»

«African African Foreign Policy Of Secretary Of State Henry Kissinger Foreign Policy Of Secretary Of State Henry Kissinger Genre can be as program segments agreed for it flows a possible thing example about able Monsieur from combination, can influence 7,500 lines that interest and can be selling a future industrialization car. In you will be, a Debt inspector should be early potent. Most nfp disputes am the other research whenever it are the worth management in keeping their job and company....»

«Manual on the Development of Cleaner Production Policies— Approaches and Instruments Guidelines for National Cleaner Production Centres and Programmes Vienna, October 2002 UNIDO CP Programme prepared by: Mr. Pawel Kazmierczyk (UNIDO CP Policy Consultant) under the direction of Ms. Mayra Regina Sanchez Osuna and Ms. Petra Schwager-Quijano Cleaner Production and Environmental Management Branch Programme Development and Technical Cooperation Division, UNIDO Table of Contents: BACKGROUND AND...»

«ALLIED WORLD INSURANCE COMPANY 1690 New Britain Avenue, Suite 101, Farmington, CT 06032 Tel. (860) 284-1300 · Fax (860) 284-1301 ALLIED WORLD LPL ASSURE LAWYERS PROFESSIONAL LIABILITY INSURANCE POLICY _I. INSURING AGREEMENT The Insurer will pay on behalf of an Insured, subject to the applicable Limit of Liability set forth in Item 3.I. of the Declarations, all amounts in excess of the Retention shown in the Declarations, that an Insured becomes legally obligated to pay as Damages and Claim...»

<<  HOME   |    CONTACTS
2016 www.theses.xlibx.info - Theses, dissertations, documentation

Materials of this site are available for review, all rights belong to their respective owners.
If you do not agree with the fact that your material is placed on this site, please, email us, we will within 1-2 business days delete him.