5 Lessons Learned for Building Companies with Data (Or How to Build the Next Bloomberg)

The A.C. Nielsen Company was launched in 1923 with the idea of selling engineering performance surveys – giving birth to one of the world’s first data businesses.  Today, Nielsen is still one of the largest data monopolies in the world and continues to be the primary source of audience measurement and business intelligence research across the globe. 

What’s most interesting is that the way Nielsen (and other similar traditional data companies) tracks and aggregates data hasn’t changed significantly over the past few decades.  At its core, this system relies on a panel-based method – specifically recruiting a large set of people to participate, monitoring their activities, and then weighting the sample to be representative of the broader population.

The result is data that is skewed both by human error (read as lying) and sampling error (who really has time to take surveys or wants to get tracked by Nielsen), but it was the best we could do in a world with limited technology.

With the growth of cloud computing and the resulting decline in storage and compute costs, in combination with the increased availability of passively tracked data – either by inexpensive sensor or API – we’re entering an environment ripe for disruption of these old line data monopolies, which not only includes Nielsen but also other companies such as Bloomberg, Dun & Bradstreet, and NPD (originally “National Purchase Diary”).

While there were a few early companies who decreased the cost of data collection via crowd-sourcing (Euromonitor, Mintel, Data.com, et cetera), we’re at the front of the next wave of opportunity in the space.  Learning from the big winners of the past as well as some of our early investments in the space, these are the five lessons for the next generation data platform companies:

1. Don’t focus on changing existing standards, but rather become the new way businesses understand the world

Across each of Nielsen’s products exists a proprietary score that rates relative effectiveness in context.  While in some cases they’ll provide you with the raw data underlying this, the reality is that Nielsen has created a moat around itself based on having the industry standardize around its own proprietary number – so having better data is not nearly valuable enough.

Therefore the next generation of data companies will focus on changing the conversation around different metrics that don’t displace but rather complement existing standards.  This immediately changes the first sales meeting with a potential customer from “You’ve been doing this wrong” to “Let me help you make this more effective”.

An example of this in our portfolio is Ginger.io, which passively collects human behavior data from cellphone sensors to map to specific disease states for medical purposes.  While there is some value in the data, the larger value comes from their software models, which are able to map certain behavior changes to disease states (built on top of their proprietary behavior modeling algorithms), thereby offering an entirely different data model than doctors use today.

2.    Collect data indirectly and passively – increases honesty and decreases cost

In the early 2000s, there was a music startup that enabled users to passively track their music listening habits privately and post curated playlists to share publicly.

One of their favorite hobbies was to compare the publicly curated view of people with their private listening habits, which usually resulted in folks publicly curating playlists full of obscure B-sides while listening to Madonna on repeat.

Even weirder, when the company made the scrobbling data public, they watched as users began to be more conscious of what songs they were listening to and more actively turned on and off the passive tracking to curate their public persona.

The core issue is not that people lie but rather that what people do, what people say, and what people say they do are always going to be vastly different things.  In order to get good, unbiased data you have to take the human element almost entirely out of it.

Three good examples here are the startups Next Big Sound, Retailigence, and Bundle.  All three have relationships with a separate third party to help aggregate data about their end consumers.  In this way, their end consumers aren’t aware of the tracking and won’t change their behavior, thereby driving cleaner and higher quality data.

3.    There is no junk data – just data that’s not valuable yet

In traditional model building, business users would have requirements and you’d build the model to the requirements; it worked because data was small and you knew what questions to ask.

In the new world, this process is flipped on its head and we’re now searching for answers (or at least trends) in seas of information because it’s less expensive to store and compute this amount of data.

For a specific example, until recently there were large swaths of human DNA data that was considered unimportant – or “junk” DNA.  However, the reality is we just didn’t know that it could be valuable because we had only fully sequenced a handful of humans.

At the time this was because it was so expensive to sequence and store human genomic data.  For reference, as part of the Human Genome Project, scientists developed the first full human genome sequence at a cost of almost $3 billion.  A similarly focused private project during the same time accomplished a similar goal at a total cost of $300 million.

As we crossed over into the first few thousand full genomes sequenced, we realized not only that there was no such thing as junk DNA, but in addition that we needed more data to even begin to understand how human genes worked and, more importantly, how we could use that information to improve health.

4.    Power is surfacing relevant data in context

While traditional data companies would focus on selling packaged data, the real opportunity for the next generation of data companies is to heavily integrate into every individual’s job and to give them the power of traditional analytics tools but with a user experience that exists in their context and makes their job easier.

These systems leverage the ability of computers to store and process large amounts of information in a rules-based manner.   Rather than making decisions for the end user – simple surface relevant information and patterns for the end user – they enable the end user to focus on higher-level problem solving based on deliberative reasoning and pattern recognition that they’re better at.

In this way, humans don’t have to waste time trying to clean and understand all of the information.  Instead they can focus on understanding relevant information at the appropriate time.

Good early examples of this are the data products built by PayPal.

Early in its growth, PayPal had a serious and expanding problem with fraud.  At first, the company built algorithms to detect problems that worked on a specific method of fraud but these would fail as soon as the bad guys changed their method of attack and implemented new methods, which the computer wasn’t trained to find.  Humans are good at catching these types of fraud but the sheer amount of data being processed made it ineffective to hire human workers to sift through it since an astronomical number of people would have been required to do so.

Faced with an impossible technology problem, the team devised an interesting solution.  Rather than try and determine fraudulent transactions in an absolute sense with algorithms, instead use software to crawl through the data and surface potential issues to humans who can understand the broader context.

The solution works by combining the best of each skill set – computers are able to easily comb through tons of data and give each transaction a probability of being fraudulent, surfacing the highest risk transactions for a human to quickly check and intervene if necessary.

5.    Focus on solving a single problem and beware the big red button

There’s a romantic myth about the magic button that simply takes data from across the system and determines the answer whenever users press the red button.

In reality, systems that are too broad and remove the human too far from the data take too long to set up and become limiting down the line, mainly because they’ll do most of the heavy lifting under the cover and in the process create lots of dependencies which are hard for future users to understand or hack against.

Rather, build tools that are flexible enough for power users to adjust and experiment with the data and bring in other tools.  Think more along the lines of Unix with loosely coupled tools that are more easily substitutable and testable, enabling users to play with pieces of the systems and slowly add additional pieces of functionality over time.

A modern example of this would be Spinnakr, a recent 500 Startups graduate that has built a platform to enable easy personalization of websites out of the box – simply deploy with two lines of code and the system will learn and improve over time using actual consumer behavior data.

In contrast, legacy products in the space such as Monetate or Amadesa require months of expensive integration to map each and every type of consumer behavior scenario before deployment and are based on questionable data purchased from third party ad networks.

Overall, tomorrow’s opportunity lies in redefining how customers think about and use data, collecting it in interesting ways and providing tools that surface complexities to humans in a digestible and useable way.

Related Articles: