The ‘Adjacent Possible’ of Big Data: What Evolution Teaches About Insights Generation

Originally published on WIRED

brunkfordbraun/Flickr

Stuart Kauffman in 2002 introduced the the “adjacent possible” theory. This theory proposes that biological systems are able to morph into more complex systems by making incremental, relatively less energy consuming changes in their make up. Steven Johnson uses this concept in his book “Where Good Ideas Come From” to describe how new insights can be generated in previously unexplored areas.

The theory of “adjacent possible extends to the insights generation process. In fact, it offers a highly predictable and deterministic path of generating business value through insights from data analysis. For enterprises struggling to get started with data analysis of their big data, the theory of “adjacent possible” offers an easy to adopt and implement framework of incremental data analysis.

Why Is the Theory of Adjacent Possible Relevant to Insights Generation

Enterprises often embark on their big data journeys with the hope and expectation that business critical insights will be revealed almost immediately just from the virtue of being on a big data journey and they building out their data infrastructure. The expectation is that insights can be generated often within the same quarter as when the infrastructure and data pipelines have been setup. In addition, typically the insights generation process is driven by analysts who report up through the typical management chain. This puts undue pressure on the analysts and managers to show predictable, regular delivery of value and this forces the process of insights generation to fit into project scope and delivery. However, the insights generation process is too ambiguous, too experimental that it rarely fits into the bounds of a committed project.

Deterministic delivery of insights is not what enterprises find on the other side of their initial big data investment. What enterprises almost always find is that data sources are in a disarray, multiple data sets need to be combined while not primed for blending, data quality is low, analytics generation is slow, derived insights are not trustworthy, the enterprise lacks the agility to implement the insights or the enterprise lacks the feedback loop to verify the value of the insights. Even when everything goes right, the value of the insights is simply miniscule and insignificant to the bottom line.

This is the time when the enterprise has to adjust its expectations and its analytics modus operandi. If pipeline problems exist, they need to be fixed. If quality problems exist, they need to be diagnosed (data source quality vs. data analysis quality). In addition, an adjacent possible approach to insights needs to be considered and adopted.

The Adjacent Possible for Discovering Interesting Data

Looking adjacently from the data set that is the main target of analysis can uncover other related data sets that offer more context, signals and potential insights through their blending with the main data set. Enterprises can introspect the attributes of the records in their main data sets and look for other data sets whose attributes are adjacent to them. These datasets can be found within the walls of the enterprise or outside. Enterprises that are looking for adjacent data sets can look at both public and premium data set sources. These data sets should be imported and harmonized with existing data sets to create new data sets that contain a broader and crisper set of observations with a higher probability of generating higher quality insights.

The Adjacent Possible for Exploratory Data Analysis

In the process of data analysis, one can apply the principle of adjacent possible to uncovering hidden patterns in data. An iterative approach towards segmentation analysis with a focus on attribution through micro segmentation, root cause analysis change and predictive analysis and anomaly detection through outlier analysis can lead to a wider set of insights and conclusions to drive business strategy and tactics.

Experimentation with different attributes such as time, location and other categorical dimensions can and should be the initial analytical approach. An iterative approach to incremental segmentation analysis to identify segments where changes in key KPIs or measures can be attributed to, is a good starting point. The application of adjacent possible requires the iterative inclusion of additional attributes to fine tune the segmentation scheme can lead to insights into significant segments and cohorts. In addition, adjacent possible theory can also help in identifying systemic problems in the business process workflow. This can be achieved by walking upstream or downstream in the business workflow and by diagnosing the point of process workflow breakdown or slowdown through the identification of attributes that correlate highly with the breakdown/slowdown.

The Adjacent Possible for Business Context

The process of data analysis is often fraught with silo’d context i.e. the analyst often does not have the full business context to understand the data or understand the motivation for a business driven question or understand the implications of their insights. Applying the theory of adjacent possible here implies that by introducing the idea of collaboration to the insights generation process by inviting and including team members who each might have a slice of the business context from their point of view can lead to higher valued conclusions and insights. Combining the context from each of these team members to design, verify, authenticate and validate the insights generation process and its results is the key to generating high quality insights swiftly and deterministically.

Making incremental progress in the enterprise’s insights discovery efforts is a significant and valuable method to uncover insights with massive business implications. The insights generation process should be treated as an exercise in adjacent possible and incremental insights identification should be encouraged and valued. As this theory is put in practice, enterprises will find themselves with a steady churn of incrementally valuable insights with incrementally higher business impact.

The 2+2=5 Principle and the Perils of Analytics in a Vacuum

Published Originally on Wired

Strategic decision making in enterprises playing in a competitive field requires collaborative information seeking (CIS). Complex situations require analysis that spans multiple sessions with multiple participants (that collectively represent the entire context) who spend time jointly exploring, evaluating, and gathering relevant information to drive conclusions and decisions. This is the core of the 2+2=5 principle.

Analytics in a vacuum (i.e non collaborative analytics) due to missing or partial context is highly likely to be of low quality, lacking key and relevant information and fraught with incorrect assumptions. Other characteristics of non collaborative analytics is the usage of general purpose systems and tools like IM and email that are not designed for analytics. These tools lead to enterprises drowning in a sea of spreadsheets, context lost across thousands of IMs and email and an outcome that is guaranteed to be sub optimal.

A common but incorrect approach to collaborative analytics is to think of it as a post analysis activity. This is the approach to collaboration for most analytics and BI products. Post analysis publishing of results and insights is very important however, pre-publishing collaboration plays a key role in ensuring that the generated results are accurate, informative and relevant. Analysis that terminates at the publishing point has a very short half life.

Enterprises need to think of analysis as a living and breathing story that gets bigger over time as more people collaborate and lead to more data, new data, disparate data leads to the inclusion of more context negating incorrect assumptions, missing or low quality data issues and incorrect semantical understanding of data.

Here are the most common pitfalls that we have observed, of analytics carried out in a vacuum.

Wasted resources. If multiple teams or employees are seeking the same information or attempting to solve the same analytical problem, a non collaborative approach leads to wasted resources and suboptimal results.

Collaboration can help the enterprise streamline and divide and conquer the problem more efficiently and faster with lower time and manpower. Deconstructing an analytical hypothesis into smaller questions and distributing them across multiple employees leads to faster results.

Silo’ed analysis and conclusions. If results of analysis, insights and decisions are not shared systematically across the organization, enterprises face a loss of productivity. This lack of context between employees tasked with the same goals causes organizational misalignment and lack of coherence in strategy.

Enterprises need to ensure that there is common understanding of key data driven insights that are driving organizational strategy. In addition, the process to arrive at these insights should be transparent and repeatable, assumptions made should be clearly documented and a process/mechanism to challenge or question interpretations should be defined and publicized.

Assumptions and biases. Analytics done in a vacuum is hostage the the personal beliefs, assumptions, biases, clarity of purpose and the comprehensiveness of the context in the analyzer’s mind. Without collaboration, such biases remain uncorrected and lead to flawed foundations for strategic decisions.

A process around and freedom to challenge, inspect and reference key interpretation and analytical decisions made en route to the insight is critical for enterprises to enable and proliferate high quality insights in the organization.

Drive-by analysis. When left unchecked with top down pressure to use analytics to drive strategic decision making, enterprises see an uptake in what we call “drive-by analysis.” In this case, employees jump in to their favorite analytical tool, run some analysis to support their argument and publish these results.

This behavior leads to another danger of analytics without collaboration. These can be instances where users, without full context and understanding of of the data, semantics etc perform analysis to make critical decisions. Without supervision, these analytics can lead the organization down the wrong path. Supervision, fact checking and corroboration are needed to ensure that correct decisions are made.

Arbitration. Collaboration without a process for challenge, arbitration and an arbitration authority is often found to be, almost always at a later point in time when it is too late, littered with misinterpretations and factually misaligned or deviated from strategic patterns identified in the past.

Subject matter experts or other employees with the bigger picture, knowledge and understanding of the various moving parts of the organization need to, at every step of the analysis, verify and arbitrate on assumptions and insights before these insights are disseminated across the enterprise and used to affect strategic change.

Collaboration theory has proven that information seeking in complex situations is better accomplished through active collaboration. There is a trend in the analytics industry to think of collaborative analytics as a vanity feature and simple sharing of results is being touted as collaborative analytics. However, collaboration in analytics requires a multi pronged strategy with key processes and a product that delivers those capabilities, namely an investment in processes to allow arbitration, fact checking, interrogation and corroboration of analytics; and an investment in analytical products that are designed and optimized for collaborative analytics.

Your Big Data Needs Some TLC

Published Originally on Wired

In this customer-driven world, more and more businesses are relying on data to derive deep insights about the behavior and experience of end users with a business’ products. Yet end user logs, while interesting, often lack a 360-degree view of the “context” in which users consume a business’ products and services. The ability to analyze these logs in the relevant context is key to getting the maximum business value from big data analysis.

Basic contextual analysis requires a little TLC: Time, Location and Channel.

Thinking within a TLC framework will simplify the identification, collection, assimilation and analysis of context and make it more value driven. Enterprises can apply TLC for better attribution and explanation of end user behavior, to identify patterns and understand profiles that generate insights, and ultimately to enable the business to deliver better, customized, personalized products, services and experiences.

So what does it mean for business owners in the app economy to give their data and analytics a dose of TLC?

Time

Does the time of day, the week, the month, or a particular event impact app usage?

The hypothesis is that certain events at a point in time and certain classes of events have a positive or negative impact on app usage.

Are there different patterns of app usage on weekends versus weekdays or on mornings versus afternoons versus evenings? Are you a retail business hurtling towards Black Friday (the biggest shopping day in the year in the USA)? What patterns have you observed in recent years? What can you expect from your store locator app, your catalog app, your gift card and coupons app… in the days before the event and on black Friday itself?

Are you running a Super Bowl ad? When it airs, will it drive traffic to your web and mobile apps? Will it cause a spike in API traffic?

Business executives need to understand how external events like these impact the use of their apps. Making the correlations and understanding the contexts in which the apps are used can then be used to promote or discourage certain usage of the app for maximum business impact.

  • What external events impact the use of my app?
  • Are there patterns? What types of external events impact the use of my app?
  • As users use an app over a time, do their usage patterns change? Does the how/why/what of app usage change?

Location

Does app usage lead to cross-channel transactions such as store foot traffic or web based fulfillment?

Retailers deploy mobile apps to enable enhanced shopping experiences and sometimes with the purpose to drive foot traffic to their stores.

Where are users before, during and after they use an app? Business owners can use information about where their apps are being used and where they are being the most effective to tune the user experience and maximize impact.

Is your store locator app used most in the vicinity of your store, or most in the vicinity of your competitors’ stores? Do users follow through and walk into your store after using the store locator app, the catalog app…?

Is there a pattern to where users are when they access a gas station app? Are they in the vicinity of a gas station and trying to find the cheapest gas? Are they trying to find the gas vendor to whose rewards program they belong? Are they in a rural setting and looking for the closest gas station?

Location information provides the app developer and service provider with context to answer questions that help chart a customer’s journey of interacting with the service provider across multiple channels and across multiple locations, allowing the identification of patterns that signify and impact the customer’s search, discovery, decision and transaction.

Business owner should be asking questions like:

  • Where are the users before and after they use the app?
  • Are users using the apps in the vicinity of retail stores? How close are they to the stores?
  • Are the users using the apps in the vicinity of competitor stores? How close are they to the stores?
  • Do users use apps and then walk into the stores? Vice versa?
  • Do multiple users use the app in the vicinity of a single store?

Channels

Are online or mobile channels increasing? How do my business channels impact and improve transactions on neighboring channels?

The hypothesis is that the multiple channels of your business are symbiotic.

Does enabling one channel cannibalize, harm or improve business on other channels? Is your mobile app driving more traffic to your store… to your web site?

A powerful example of enabling business with apps, and the impact across their channels comes from Walgreens. The pharmacy chain made mobile technology a key part of its strategy and finds that half of the 12 million visits a week to its numerous online sites come from mobile devices. Additionally, Walgreens indicates that the customers who engage with Walgreens in person, online and via mobile apps spend six times more than those who only visit stores.

Some questions for business owners to ask about their channels include:

  • What is my strongest channel?
  • For multi channel transactions
    Do transactions transcend multiple channels – that is, do users channel hop?

    • Which channel is responsible for starting most transactions?
    • Which channel is responsible for successfully completing most transactions?
    • Which channel is responsible for most abandoned transactions?
  • How and what does each channel contribute to the users’ needs towards driving improved experience and transactions?

TLC for the User Experience

Consumers today are “always addressable”. We are increasingly surrounded by digital screens, which make us reachable at anytime, in anyplace and on any device. This leads to a new type of problem and opportunity that I like to call “screen optimization.”

Screen optimization is the opportunity and the ability of a service provider to optimize the message and content delivered according to the user’s context – time, location, channel, and position in their journey.

  • Adjust and adapt a user’s experience to their context and screen across the various digital touch points on the customer’s journey
  • Adapt the content delivered to a user’s surrounding screens (mobile device to highway billboards) according to the user’s context
  • Provide a personalized, mobile-centric experience that enables a user to orchestrate their multi-channel experience successfully
  • Enable a user to enter and experience the appropriate channel given their stage in the journey of interaction and transaction with your business

So, apply a little TLC to your data and analytics and create better, customized, personalized products, services and experiences for consumers.

It’s the End of the (Analytics and BI) World as We Know It

Published Originally on Wired

“That’s great, it starts with an earthquake, birds and snakes, an aeroplane, and Lenny Bruce is not afraid.” –REM, “It’s the End of the World as We Know It (and I Feel Fine)”

REM’s famous “It’s the End of the World…”song rode high on the college radio circuit back in the late 1980s. It was a catchy tune, but it also stands out because of its rapid-fire, stream-of-consciousness lyrics and — at least in my mind — it symbolizes a key aspect of the future of data analytics.

The stream-of-consciousness narrative is a tool used by writers to depict their characters’ thought processes. It also represents a change in approach that traditional analytics product builders have to embrace and understand in order to boost the agility and efficiency of the data analysis process.

Traditional analytics products were designed for data scientists and business intelligence specialists; these users were responsible for not only correctly interpreting the requests from the business users, but also delivering accurate information to these users. In this brave new world, the decision makers expect to be empowered themselves, with tools that deliver information needed to make decisions required for their roles and their day to day responsibilities. They need tools that enable agility through directed, specific answers to their questions.

Decision-Making Delays

Gone are the days when the user of analytics tools shouldered the burden of forming a question and framing it according to the parameters and interfaces of the analytical product. This would be followed by a response that would need to be interpreted, insights gleaned and shared. Users would have to repeat this process if they had any follow up questions.

The drive to make these analytics products more powerful also made them difficult to use to business users. This led to a vicious cycle: the tools appealed only to analysts and data scientists, leading to these products becoming even more adapted to their needs. Analytics became the responsibility of a select group of people. The limited population of these experts caused delays in data-driven decision making. Additionally, they were isolated from the business context to inform their analysis.

Precision Data Drill-Downs

In this new world, the business decision makers realize that they need access to information they can use to make decisions and course correct if needed. The distance between the analysis and the actor is shrinking, and employees now feel the need to be empowered and armed with data and analytics. This means that analytics products that are one size fits all do not make sense any more.

As the decision makers look for analytics that makes their day to day job successful, they will look towards these new analytics tools to offer the same capabilities and luxuries that having a separate analytics team provides, including the ability to ask questions repeatedly based on responses to a previous question.

This is why modern analytics products have to support the user’s “stream of consciousness” and offer the ability to repeatedly ask questions to drill down with precision and comprehensiveness. This enables users to arrive at the analysis that leads to a decision that leads to an action that generates business value.

Stream of conciousness support can only be offered through new lightweight mini analytics apps that are purpose-built for specific user roles and functions and deliver information and analytics for specific use cases that users in a particular role care about. Modern analytics products have to become combinations of apps to empower users and make their jobs decision and action-oriented.

Changes in People, Process, and Product

Closely related to the change in analytics tools is a change in the usage patterns of these tools. There are generally three types of employees involved in the usage of traditional analytics tools:

  • The analyzer, who collects, analyzes, interprets, and shares analyses of collected data
  • The decision maker, who generates and decides on the options for actions
  • The actor, who acts on the results

These employees act separately to lead an enterprise toward becoming data-driven, but it’s

a process fraught with inefficiencies, misinterpretations, and biases in data collection, analysis, and interpretation. The human latency and error potential makes the process slow and often inconsistent.

In the competitive new world, however, enterprises can’t afford such inefficiencies. Increasingly, we are seeing the need for the analyzer, decision maker, and actor to converge into one person, enabling faster data-driven actions and shorter time to value and growth.

This change will force analytics products to be designed for the decision maker/actor as opposed to the analyzer. They’ll be easy to master, simple to use, and tailored to cater to the needs of a specific use case or task.

Instant Insight

The process of analytics in the current world tends to be after-the-fact analysis of data that drives a product or marketing strategy and action.

However, in the new world, analytics products will need to provide insight into events as they happen, driven by user actions and behavior. Products will need the ability to change or impact the behavior of users, their transactions, and the workings of products and services in real time.

Analytics and BI Products and Platforms

In the traditional analytics world, analytics products tend to be bulky and broad in their flexibility and capabilities. These capabilities range from “data collection” to “analysis” to “visualization.” Traditional analytics products tend to offer different interfaces to the decision makers and the analyzers.

However, in the new world of analytics, products will need to be minimalistic. Analytics products will be tailored to the skills and needs of their particular users. They will directly provide recommendations for specific actions tied directly to a particular use case. They will provide, in real time, the impact of these actions and offer options and recommendations to the user to fine tune, if needed.

The Decision Maker’s Stream of Consciousness

In context of the changing people, process, and product constraints, analytics products will need to adapt to the needs of decision makers and their process of thinking, analyzing, and arriving at decisions. For every enterprise, a study of the decision maker’s job will reveal a certain set of decisions and actions that form the core of their responsibilities.

As we mentioned earlier, yesterday’s successful analytical products will morph into a set of mini analytics apps that deliver the analysis, recommendations, and actions that need to be carried out for each of these decisions/actions. Such mini apps will be tuned and optimized individually for each use case individually for each enterprise.

These apps will also empower the decision maker’s stream of consciousness. This will be achieved by emulating the decision maker’s thought process as a series of analytics layered to offer a decision path to the user. In addition, these mini apps will enable the exploration of tangential questions that arise in the user’s decision making process.

Analytics products will evolve to become more predictive, recommendation-based, and action oriented; the focus will be on driving action and reaction. This doesn’t mean that the process of data collection, cleansing, transformation, and preparation is obsolete. However, it does mean that the analysis is pre-determined and pre-defined to deliver information to drive value for specific use cases that form the core of the decision maker’s responsibility in an enterprise.

This way, users can spend more time reacting to their discoveries, tapping into their streams-of-consciousness, taking action, and reacting again to fine-tune the analysis

Four Common Mistakes That Can Make For A Toxic Data Lake

Originally Published on Forbes

Data lakes are increasingly becoming a popular approach to getting started with big data. Simply put, a data lake is a central location where all applications that generate or consume data go to get raw data in its native form. This enables faster application development, both transactional and analytical, as the application developer has a standard location and interface to write data that the application will generate and a standard location and interface to read data that it needs for the application.

However, left unchecked, data lakes can quickly become toxic, becoming a cost to maintain whereas the value delivered from them shrinks or simply does not materialize. Here are some common mistakes that can make your data lake toxic.

Your big data strategy ends at the data lake.

A common mistake is to choose a data lake as the implementation of the big data strategy. This is a common choice because building a data lake is a deterministic project that IT can plan for and deliver given a budget. However, the assumption that “if you build it, they will come” is not correct. Blindly hoping that the data lake is filled with data from the various applications and systems that already exist or will be built/upgraded in the future and hoping that the data that exists in the data lake will be consumed by data driven application is a common mistake.

Enterprises need to ensure that the data lake is part of an overall big data strategy where application developers are trained and mandated to use the data lake as part of their application design and development process. In addition, applications (existing, in development or planned) need to be audited for their data needs and the usage of the data lake needs to be planned for and incorporated into the design.
Enterprises need to ensure that their business strategy is bound to the data lake and vice versa. Without this, a data lake is bound to be stunted into an IT project that never really lives up to its potential of generating incremental business value.

In addition, enterprises need to ensure that the organization does not use the data lake as a dumping ground. Data that enters the data lake should be of high quality and generated in a form that makes it easier to understand and consume in data driven or analytic applications. Data that gets generated without any thought given to how it would be consumed often ends up being dirty and unusable.

The data in your data lake is abstruse.

If attention is not paid to it, data in a data lake can easily become hard to discover, search or track. Without thinking through what it means to discover and use data, enterprises filling up the data lake will simply end up with data sets that are either unusable or untrustworthy.

Best practices to avoid having data in the data lake be unusable is to focus on capturing, alongside the data, metadata about the data that includes the lineage of the data i.e. how it was created, where it was created, what its acceptable and expected schema is, what are the types, how often the data set is refreshed etc. In addition, each data set should have an owner (application, system or entity), categorization, tags, access controls and if possible the ability to preview a sample. This metadata organization ensures that application developers or data scientists looking to use the data can understand the data source and ensure that they use it correctly in their applications.

All data sets in your data lake should have an associated “good” definition. For example, every data set should have a definition of an acceptable data record including the data generation frequency, acceptable record breadth, expected volume per record and per time interval, expected and acceptable ranges for specific columnar values, any sampling or obfuscations applied and if possible, the acceptable use of the data.

The data in your data lake is entity-unaware.

Often, attention is not paid when data gets generated to carefully record the entities that were part of an event. For example, the identifiers for user, the service, the partner etc that came together in an event might not be recorded. This can severely restrict the data use cases that can be built on top of this data set. It is much easier to aggregate and obfuscate these identifiers in the data set.

Similarly, data that is not generated and stored at the highest possible granularity level carries the risk of having its applicability and value diminished. This often happens when less data is preferable due to storage or compute concerns. It can also happen when the logging of data is not asynchronous i.e. the logging impacts the transaction processing of the system.

The data in your data lake is not auditable.

Data lakes that do not track how their data is being used and are not able to produce, at any point in time, users that access the data, processes that use or enhance the data, redundant copies of the data and how they came about to be and derivations of data sets can quickly become a nightmare to maintain, upgrade and adapt.
Without such auditability built into the data lake, enterprises end up getting stuck with simply large data sets that consume disk, increase the time it takes to process data records while increasing the probability that data is misused or misinterpreted.

In addition, if the data lake does not offer additional services that make it easier for consumers of the data to decide and actually use the data, the expected value from the data lake can be severely restricted. Enterprises should consider building and maintaining application directories that track contributors and readers (applications) on the data sets in the data lake, an index of data sets organized by categories, tags, sources, applications etc including the ability to quickly surface related data sets, data sets with parent-child relationships.

As the volume of data grows and the number of disparate data sets grows and the number of consumers that interact and impact these data sets increases, enterprises will increasingly be faced with a data lake management nightmare and will be forced to set aside more IT resources to track and maintain their data lakes. Some simple guidelines and best practices on how data (and its use) is generated, stored and cataloged can ensure that the data lake does not get toxic and delivers on its promised value that was the reason for its creation in the first place.