Culturomics 2.0: Forecasting Large-Scale Human Behavior Using Global News Media Tone in Time and Space

paper published yesterday in the peer-reviewed journal First Monday combines advanced supercomputing with a quarter-century of worldwide news to forecast and visualize human behavior, from civil unrest to the movement of individuals.

The paper, titled “Culturomics 2.0: Forecasting Large-Scale Human Behavior Using Global News Media Tone in Time and Space,” uses the tone and location of news coverage from across the world to forecast country stability (including retroactively predicting the recent Arab Spring), estimate Osama Bin Laden’s final location as a 200-kilometer radius around Abbottabad, and uncover the six world civilizations of the global news media. The research also demonstrates that the news is indeed becoming more negative and even visualizes global human societal conflict and cooperation over the last quarter century.

It seems like a real attempt at psychohistory from Isaac Asimov’s Foundation series of novels

Psychohistory is a fictional science in Isaac Asimov’s Foundation universe which combines history, sociology, and mathematical statistics to make general predictions about the future behavior of very large groups of people, such as the Galactic Empire.

Using the large shared-memory supercomputer Nautilus, Kalev Leetaru of the University of Illinois in Urbana-Champaign combined three massive news archives totaling more than 100 million articles worldwide to explore the global consciousness of the news media. The complete New York Times from 1945 to 2005, the unclassified edition of Summary of World Broadcasts from 1979 to 2010, and an archive of English-language Google News articles spanning 2006 to 2011 were used to capture a cross-section of the U.S. media spanning half a century and the global media over a quarter-century.

Advanced tonal, geographic, and network analysis methods were used to produce a network 2.4 petabytes in size containing more than 10 billion people, places, things, and activities connected by over 100 trillion relationships, capturing a cross-section of Earth from the news media. A subset of findings from this analysis were then reproduced for this study using more traditional methods and smaller-scale workflows that offer a model for a new class of digital humanities research that explores how the world views itself.

Introduction to Culturomics

The emerging field of “Culturomics” seeks to explore broad cultural trends through the computerized analysis of vast digital book archives, offering novel insights into the functioning of human society. Yet, books represent the “digested history” of humanity, written with the benefit of hindsight. People take action based on the imperfect information available to them at the time, and the news media captures a snapshot of the real–time public information environment. News contains far more than just factual details: an array of cultural and contextual influences strongly impact how events are framed for an outlet’s audience, offering a window into national consciousness . A growing body of work has shown that measuring the “tone” of this real–time consciousness can accurately forecast many broad social behaviors, ranging from box office sales to the stock market itself.

Can the public tone of global news data forecast even broader behaviors, such as the stability of nations, the location of terrorist leaders, or even offer new insight on conflict and cooperation among countries, as accurately as it predicts movie sales or stock movements? This study makes use of a 30–year translated archive of news reports from nearly every country of the world, applying a range of computational content analysis approaches including tone mining, geocoding, and network analysis, to present “Culturomics 2.0.” The traditional Culturomics approach treats every word or phrase as a generic object with no associated meaning and measures only the change in the frequency of its usage over time. The Culturomics 2.0 approach introduced in this paper focuses on extending this model by imbuing the system with higher–level knowledge about each word, specifically focusing on “news tone” and geographic location, given their importance to the understanding of news coverage. Translating textual geographic references into mappable coordinates and quantifying the latent “tone” of news into computable numeric data permits an entirely new class of research questions to be explored via the news media not possible through the traditional frequency count approach.

This study will explore how the latent tone of a large digital news archive can be visualized to understand macro–level changes in global society in both time and space. Measuring the tone of news coverage about a single geography over time, a fundamentally new approach to conflict early warning is developed that “passively crowdsources” the global mood about each country in the world. This is found to offer highly accurate short–term forecasts of national stability. Focusing on the spatial dimension and moving from the country to the city level, the geographic framing of the news is found to offer significant insights into both nationalistic views of the world and the way in which cultures and “civilizations” are portrayed by the media. Finally, mapping the geographies most closely associated with Osama bin Laden by the news media prior to his capture is found to fairly accurately pinpoint his actual location. Global news media tone that is temporally and spatially aware is found to offer an intriguing new approach to modeling the behavior of global society itself.

The spatial dimension of news

Conflict early warning uses the spatial dimension of the news only as a filtering mechanism, honing in on city–level references to reduce noise. Yet, location is a critical component of the news, with a typical news article averaging one location mention every 200–300 words. The New York Times, with 2.9 billion total words from 1945–2005, mentions 369,000 unique locations more than 10.4 million times, or around one location every 279 words. The Summary of World Broadcasts, with 1.2 billion words from 1979–2010, has 201,000 unique locations mentioned roughly 5.81 million times, or around one geographic reference every 215 words. While the previous section explored news tone visualized through time, this section will visualize it spatially, exploring several key questions about journalism.

The maps below compare the world in 2005 according to the New York Times and the Summary of World Broadcasts. Each city or other geographic landmark (such as islands, oceans, mountains, rivers, etc) is color–coded on a 400–point scale from bright green (high positivity) to bright red (high negativity), based on the average tone of all articles mentioning that city in 2005. Each article mentioning two or more cities together results in a link being drawn between those cities, and the average tone of all articles mentioning both cities is used to color–code that link on the same color scale as the cities. Alternatively, a grammatical approach could have been used, with data mining tools that could have teased out just the tone of each individual city mentioned, rather than assigning the overall document tone to all cities mentioned in that document. The more simplistic method here was chosen in order to capture geographic “framing.” In essence, if a city is mentioned in a positive light in highly negative documents over a long period of time, that city is being contextualized by the news media as having some relationship with the negative events, which this technique captures. More critically, this approach also captures the framed connections among cities. A typical New York Times article about a bombing in a foreign city will usually include a quote from the White House condemning the attack. The White House has nothing to do with the attack itself, but is being contextualized as an actor in the events as they are described to an American audience.

If you liked this article, please give it a quick review on ycombinator or StumbleUpon. Thanks