It shouldn't surprise anyone that our ability to model the global climate, to visualize partial 3D neural pathways, to complete the standard model, to predict the spread of disease epidemics or the outbreak of food riots all happen to occur at roughly the same time in history that computers start beating humans at Jeopardy! and 2.4 billion people (almost 70% of the west) are busy collectively churning out about 2.5 quintillion bits of data every day.
In the last two decades, humanity has become very good at collecting, moving, and sorting through massive amounts of data, and have become more comfortable with the network theory and computational tools for modeling information at these scales. These models allow one to view system-level activity and organizational behavior unlike anything we've had before. It's had a strongly unifying effect in the sciences, and in addition to changing many of our customs and norms (and successfully knocking the environment out of a relatively stable state) it has also brought some important changes to the way we think about science. This combination is already starting to have some dramatic impact on our future and what we can do with it.
"Big Data" is a term that is obviously designed to scare the ignorant and emasculate the public. What matters, of course, is not the size of the our data but fact that the mountains of information we all constantly generate can be harnessed and harvested to help us understand ourselves and how we work at many levels of analysis. In other words, "big data" is the lens that helps us see the complexity of our actual situation. And that situation is scary enough without scaremongering, thank you very much.
In this first post, I'm doing to cite some current research in complexity science and some visualization techniques to give people a sense of the field and hopefully stimulate some discussion. I have plenty to say about both the theory and its implications, but hopefully that will come out over the course of the thread.
Big Data Needs a Big Theory to Go with It
As the world becomes increasingly complex and interconnected, some of our biggest challenges have begun to seem intractable. What should we do about uncertainty in the financial markets? How can we predict energy supply and demand? How will climate change play out? How do we cope with rapid urbanization? Our traditional approaches to these problems are often qualitative and disjointed and lead to unintended consequences. To bring scientific rigor to the challenges of our time, we need to develop a deeper understanding of complexity itself.
What does this mean? Complexity comes into play when there are many parts that can interact in many different ways so that the whole takes on a life of its own: it adapts and evolves in response to changing conditions. It can be prone to sudden and seemingly unpredictable changes—a market crash is the classic example. One or more trends can reinforce other trends in a “positive feedback loop” until things swiftly spiral out of control and cross a tipping point, beyond which behavior changes radically.
What makes a “complex system” so vexing is that its collective characteristics cannot easily be predicted from underlying components: the whole is greater than, and often significantly different from, the sum of its parts. A city is much more than its buildings and people. Our bodies are more than the totality of our cells. This quality, called emergent behavior, is characteristic of economies, financial markets, urban communities, companies, organisms, the Internet, galaxies and the health care system.
The digital revolution is driving much of the increasing complexity and pace of life we are now seeing, but this technology also presents an opportunity. The ubiquity of cell phones and electronic transactions, the increasing use of personal medical probes, and the concept of the electronically wired “smart city” are already providing us with enormous amounts of data. With new computational tools and techniques to digest vast, interrelated databases, researchers and practitioners in science, technology, business and government have begun to bring large-scale simulations and models to bear on questions formerly out of reach of quantitative analysis, such as how cooperation emerges in society, what conditions promote innovation, and how conflicts spread and grow.
The trouble is, we don't have a unified, conceptual framework for addressing questions of complexity. We don't know what kind of data we need, nor how much, or what critical questions we should be asking. “Big data” without a “big theory” to go with it loses much of its potency and usefulness, potentially generating new unintended consequences.
When the industrial age focused society's attention on energy in its many manifestations—steam, chemical, mechanical, and so on—the universal laws of thermodynamics came as a response. We now need to ask if our age can produce universal laws of complexity that integrate energy with information. What are the underlying principles that transcend the extraordinary diversity and historical contingency and interconnectivity of financial markets, populations, ecosystems, war and conflict, pandemics and cancer? An overarching predictive, mathematical framework for complex systems would, in principle, incorporate the dynamics and organization of any complex system in a quantitative, computable framework.
We will probably never make detailed predictions of complex systems, but coarse-grained descriptions that lead to quantitative predictions for essential features are within our grasp. We won't predict when the next financial crash will occur, but we ought to be able to assign a probability of one occurring in the next few years. The field is in the midst of a broad synthesis of scientific disciplines, helping reverse the trend toward fragmentation and specialization, and is groping toward a more unified, holistic framework for tackling society's big questions. The future of the human enterprise may well depend on it.
In this thread, we talk about the universe from the perspective of complex systems theory.
Competition among memes in a world with limited attention (Sep. 2011)
... the abundance of information to which we are exposed through online social networks and other socio-technical systems is exceeding our capacity to consume it. Ideas must compete for our scarce individual and collective attention. As a result, the dynamic of information is driven more than ever before by the economy of attention...
Our results do not constitute a proof that exogenous features, like intrinsic values of memes, play no role in determining their popularity. However we have shown that at the statistical level it is not necessary to invoke external explanations for the observed global dynamics of memes. This appears as an arresting conclusion that makes information epidemics quite different from the basic modeling and conceptual framework of biological epidemics. While the intrinsic features of viruses and their adaptation to hosts are extremely relevant in determining the winning strains, in the information world the limited time and attention of human behavior are sufficient to generate a complex information landscape and define a wide range of different meme spreading patterns. This calls for a major revision of many concepts commonly used in the modeling and characterization of meme diffusion and opens the path to different frameworks for the analysis of competition among ideas and strategies for the optimization/suppression of their spread.
Fig. 1: Nodes represent Twitter users, and directed edges represent retweeted posts that carry the meme. The brightness of a node indicates the activity (number of retweets) of a user, and the weight of an edge reflects the number of retweets between two users. (a) The #Japan meme shows how news about the March 2011 earthquake propagated. (b) The #GOP tag stands for the US Republican Party and as many political memes, displays a strong polarization between people with opposing views. Memes related to the “Arab Spring” and in particular the 2011 uprisings in (c) #Egypt and (d) #Syria display characteristic hub users and strong connections, respectively.
Quantifying Collective Attention from Tweet Stream
In this study, we focused on the emergence of “collective attention” on Twitter, a popular social networking service. We propose a simple method for detecting and measuring the collective attention evoked by various types of events. This method exploits the fact that tweeting activity exhibits a burst-like increase and an irregular oscillation when a particular real-world event occurs; otherwise, it follows regular circadian rhythms. The difference between regular and irregular states in the tweet stream was measured using the Jensen-Shannon divergence, which corresponds to the intensity of collective attention.
Figure 6. Visualization of RT networks during collective attention.
(A) Tohoku-oki earthquake ( of nodes = 27,340, of links = 27,709), (B) FIFA Women's World Cup 2011 ( of nodes = 9,277, of links = 8,450), (C) Castle in the Sky ( of nodes = 1,183, of links = 793), and (D) total lunar eclipse ( of nodes = 893, of links = 553). The nodes represent users, which are connected if there is a RT with key terms related to the target event. In each figure, the node sizes are proportional to the number of retweeted tweets. Only the nodes with retweeted tweets are shown for clarity.
A map of people who can walk to fresh food in 5 minutes in Washington D.C.
Real time visualization of air traffic.
The Twitter of Babel: Mapping World Languages through Microblogging Platforms
Large scale analysis and statistics of socio-technical systems that just a few short years ago would have required the use of consistent economic and human resources can nowadays be conveniently performed by mining the enormous amount of digital data produced by human activities. Although a characterization of several aspects of our societies is emerging from the data revolution, a number of questions concerning the reliability and the biases inherent to the big data “proxies” of social life are still open.
Here, we survey worldwide linguistic indicators and trends through the analysis of a large-scale dataset of microblogging posts. We show that available data allow for the study of language geography at scales ranging from country-level aggregation to specific city neighborhoods. The high resolution and coverage of the data allows us to investigate different indicators such as the linguistic homogeneity of different countries, the touristic seasonal patterns within countries and the geographical distribution of different languages in multilingual regions.
This work highlights the potential of geolocalized studies of open data sources to improve current analysis and develop indicators for major social phenomena in specific communities.
Figure 8. Language polarization in Belgium and Catalonia, Spain.
In each cell ( resolution) we compute the user-normalized ratio between the two languages being considered in each case. A) Belgium. B) Catalonia. The color bar is labeled according to the relative dominance of the language denoted by blue. In Belgium, English accounts for of the language share.
Figure 9. Language polarization in Montreal, QC, Canada.
English and French are considered. In each cell () we compute the user-normalized ratio between English and French (excluding all other languages). Blue - English, Yellow - French. The color bar is labeled according to the relative dominance of English to French.
Figure 10. Language polarization in New York City, NY, USA.
The second language by district or municipality (in the case of New Jersey state) is shown. Blue - Spanish, Light Green - Korean, Fuchsia - Russian, Red - Portuguese, Yellow - Japanese, Pink - Dutch, Grey - Danish, Coral - Indonesian.
Characterizing scientific production and consumption in Physics
The network structures of city-to-city citation networks. (A) The backbones (α = 0.1) of the citation networks at the city level within the United States in 1960, 1990, 2009 (from the left to right). (B) The backbones (α = 1, 0.1, 0.1 from left to right) of the citation networks at the city level within the European Union 27 countries as well as Switzerland and Norway in 1960, 1990, 2009 (from the left to right). In (A) and (B), the color shows the direction of links: if node i cites node j there is a link starting with blue and ending with red.
A data-driven study of the patterns of life for 180,000 people
The Bay Area is shown with a dot for each tweet, colored by the radius of gyration of its author. The color scale is logarithmic, so we can compare people with very different habits.
Movement pattern exhibited by 180,000 individuals in 2011, as inferred from 37 million geolocated tweets. Colormap shows the probability density in log10.
Several features of the map reveal interesting patterns. First, the teardrop shape of the contours demonstrates that people travel predominantly along their principle axis, with deviations becoming shorter and less frequent as they move farther away. Second, the appearance of two spatially distinct yellow regions suggests that people spend the vast majority of their time near two locations. We refer to these locations as the work and home locales, where the home locale is centered on the dark red region right of the origin, and the work locale is centered just left of the origin.
Global Internet Activity
Wikipedia's early stock market warning signs
Suzy Moat and Tobias Preis, of Warwick Business School, UK, along with Chester Curme, Adam Avakian, Dror Y. Kenett and H. Eugene Stanley, of Boston University, USA, looked at how often the 30 pages describing the companies listed in the Dow Jones Industrial Average, such as Procter & Gamble, Bank of America, and The Walt Disney Company, were viewed between December 2007 and April 2012.
They found that a simple trading strategy based on changes in the frequency of views would have led to significant profits of up to 141 per cent.
Similarly, a strategy based on views of 285 pages relating to general financial topics, such as macroeconomics, capital and wealth, would have also generated profits of up to 297 per cent. A buy and hold strategy during this period would have led to only a 3 per cent profit.
Quantifying Trading Behavior in Financial Markets Using Google Trends
Figure 2: Cumulative performance of an investment strategy based on Google Trends data.
Profit and loss for an investment strategy based on the volume of the search term debt, the best performing keyword in our analysis, with Δt = 3 weeks, plotted as a function of time (blue line). This is compared to the “buy and hold” strategy (red line) and the standard deviation of 10,000 simulations using a purely random investment strategy (dashed lines). The Google Trends strategy using the search volume of the term debt would have yielded a profit of 326%.
This video offers the first clear look at what those robots are doing every day, all day, now that they control more than half of all market volume.
Inside of the one half-second of trading represented by the video, more than 1,200 orders and 215 actual trades occur, Hunsader says. (The colored boxes in the video represent exchanges, and the dots that go flying represent individual orders.)
Along the horizontal axis are the hours in the market day. Along the vertical axis is the volume of high-frequency stock trading on various exchanges, which are color-coded. As you can see, high-speed trading was almost non-existent in 2007, but has completely blown up and overwhelmed the stock market in the past few years.
To move along the challenge of a "theory of complexity":
Evolutionary Origin of Scale Free Networks
Barabási & Oltvai, Nature Reviews Genetics, 2004
The ubiquity of scale-free networks and hubs in technological, biological and social systems requires an explanation. It has emerged that two fundamental processes have a key role in the development of real networks.
First, most networks are the result of a growth process,during which new nodes join the system over an extended time period. This is the case for the World Wide Web, which has grown from 1 to more than 3-billion web pages over a 10-year period.
Second, nodes prefer to connect to nodes that already have many links, a process that is known as preferential attachment. For example, on the World Wide Web we are more familiar with the highly connected web pages, and therefore are more likely to link to them.
Growth and preferential attachment are jointly responsible for the emergence of the scale-free property in complex networks. Indeed, if a node has many links, new nodes will tend to connect to it with a higher probability. This node will therefore gain new links at a higher rate than its less connected peers and will turn into a hub.
Growth and preferential attachment have a common origin in protein networks that is probably rooted in gene duplication. Duplicated genes produce identical proteins that interact with the same protein partners. Therefore, each protein that is in contact with a duplicated protein gains an extra link.
Highly connected proteins have a natural advantage: it is not that they are more (or less) likely to be duplicated, but they are more likely to have a link to a duplicated protein than their weakly connected cousins, and therefore they are more likely to gain new links if a randomly selected protein is duplicated. This bias represents a subtle version of preferential attachment.
The most important feature of this explanation is that it traces the origin of the scale-free topology back to a well-known biological mechanism — gene duplication. Although the role of gene duplication has been shown only for protein interaction networks, it probably explains, with appropriate adjustments, the emergence of the scale-free features in the regulatory and metabolic networks as well.
It should be noted that, although the models show beyond doubt that gene duplication can lead to a scale-free topology, there is no direct proof that this mechanism is the only one, or the one that generates the observed power laws in cellular networks. However, as gene duplication is a major engineer of the genomic landscape, it is likely to be a key mechanism for generating the scale-free topology.
And a few more resources for research purposes: