Making Big Data Small Data – The Role of Power Laws
At Near, we process 6 TB – 10 TB of data every day and this data covers a variety of phenomena and entities. These include:
Data about People – Where do they go? Places they visit? What media do they consume? When do they consume? Where do they live? Which apps are they on? Which websites are they on? Which brands do they actively engage with?
Data about Places – Which are the most congested intersections at different times of the day? Which malls are most occupied? What are the most common places people go to? Which are common brands/retail stores that they visit? How do density patterns vary in a place over time?
Our team of data scientists has to process this data to figure out the different interesting relationships between all these entities. Where do you begin in the data deluge? Picking subsets of relevant data to explore is a key first step. It helps the data scientists isolate phenomena without getting lost in the weeds. A key tool in this process is our use of “power laws” to understand what general relationships hold. From a statistical point of view, the phenomena underlying the data are quite complex and we want to look for simple relationships to grasp what is going on.
The power law, sometimes referred to as the Pareto distribution (whose principle is well known as the 80-20 rule), or Lotka’s Law, or Zipf’s law, provides a useful alternative to the ‘normal’ (Gaussian) distribution (i.e, the bell curve). Most of the phenomena in the real world do not follow the ideal normal distributions but have unique distributions of their own.
For example, have you considered the distribution of digits in naturally occurring numbers.
The graph below shows on the X-axis, all the digits 1 through 9 and the Y-axis shows the probability of any number in the real world starting with that digit. For example, 30% of numbers across many different domains start with the number 1.
This has its own name called Benford’s Law.
The power law has gained in popularity among more numerate intellectuals, policy makers, and business people because it seems to fit better with common sense than what we were told in basic statistics: Extreme and rare events have a greater than expected impact; a few products, people, and websites seem to have the bulk of market share, wealth, and mindshare; In the world of big data, it is imperative to know – what are the different sets of “small data”, one needs to pay attention.
However, there are many phenomena that exhibit power law behavior. Consider Shaquille O’Neal versus Bill Gates. Shaq, represents the distribution of human heights — which follows the normal distribution; Bill Gates represents the distribution of human wealth — which follows the power law.
The authorship of scientific papers exhibits a power law. The statistician and physicist Lotka deduced an inverse-square law much like Newton’s law of gravitation: the number of authors publishing n papers is 1/n-square of those publishing one paper.
Other examples include the sizes of craters on the moon and of solar flares, the foraging pattern of various species, the sizes of activity patterns of cell populations of nervous systems, the frequencies of words in most languages, frequencies of family names, the species richness in the reproduction of organisms, the sizes of power outages, criminal charges per convict, volcanic eruptions, earthquakes, accidents, company growths, company populations and many other quantities. A good overview of Power laws is provided here.
Given the rich applicability of power laws, we have analyzed our data to identify “data regimes” that are worth paying attention to. Some of the different kinds of relationships we have analyzed include –
- Spatial distribution of places
- Common activities of people in different time windows
- Most common websites folks visit in a given time window
- Common apps folks spend time on
- Most common activity spots
Once we identify the data regimes of interest, we focus in greater detail on building specific models for the phenomena of interest. Further, it also guides our system building efforts focusing attention on what is relevant and what is not. A simple example of power law behavior from Near’s data is the following –
The Y-axis is a measure – number of unique users we see in Allspark – whereas the X-axis is a rank order of all our sources that contribute users. The graph shows a power law behavior wherein a few sources contribute heavily for one market. Plotting this on a log-log scale would show up as a linear relationship. Each individual market shows similar behavior and global characteristics are also similar – the coefficients/exponents in the power laws being different. Understanding these unique data sources and their characteristics will help us understand the users in our system better. By partitioning this data into subsets, we can focus on what is essential.
We will illustrate more detailed use-cases in future blogs. Understanding power laws and utilizing them is a key tool in the world of big data so that your big data efforts are focused and productive.
Contact Us to use data for superior decision-making.