In a past post (“Data Science and LP: Much Ado about Something“), we discuss the rapidly growing amount of data and the expansion of retail data management in every aspect of our loss prevention business.
As we increasingly turn to data for the insights that drive our decisions, we do so with the assumptions that the data is available to us, we can use it, and—even more importantly—that it is correct.
None of this happens without excellent retail data management, yet few of us are aware of the complexities of this practice.
Since it is critical to our understanding of the data world, in this post, we focus on the management of data in its many forms and what these practices provide for us in the realm of retail LP. We’ll then wrap up our discussion with some points about how management is implemented in ways that are relevant to the retail industry.
What is Retail Data Management?
To establish a strong working definition of data management, we’ll turn to the Data Management Association International (DAMA International). To paraphrase, they define data management as “the development and management of architectures, processes, and policies that manage data through its full lifecycle as needed by the enterprise.”
Data management allows us to be confident in the availability and integrity of the data we use—something that is imperative to doing our jobs correctly.
As loss prevention professionals, we need to understand the full lifecycle of data, including:
- Acquisition and storage of data in its various forms.
- Transformation of that data into more useful formats to better support our practical needs.
- Validation to ensure downstream processes can rely on the integrity of the data.
- Extension of data: adding new sources, new use cases, and new users.
- Data governance, including policies and procedures for access, usage rights, operational management, and how to integrate new data assets. Documentation of the data assets stored in the environment is an important part of this process.
What Data Do We Have, and What Can We Do with It?
It’s helpful to understand a few terms when talking about data management:
Raw vs. Processed Data: Raw data, also referred to as ‘source data’, comes directly from the system of record and has not been processed for use. It can come in many forms including binary, video, audio, or formatted text. Typically, raw data has to be transformed in some way to make it usable in systems and processes.
Structured vs. Unstructured Data: Structured data is data which is highly organized, usually in a form that can be easily manipulated or searched. For example, data in a spreadsheet or relational database is structured. Unstructured data is data that is not structured, and can include audio, visual and written documents (such as magazine articles).
Directly Observed vs. Derived Data: Directly observed data is data that was explicitly described in the raw source data. Derived data is created from directly observed data by using some type of transformation such as math or logic. For example, the raw file for a point-of-sale (POS) transaction header may include the transaction total, but it may not include the item count. We can compute the item count (derived data) and insert it alongside the transaction total (directly observed data).
An effective retail data management system allows different parties to easily access and work with all types of data, in both raw and processed forms, and allows us to link together directly observed and derived data. Interested parties may use the data in their applications, or may create derived data to be used by others in other applications. In this way, an effective data management system allows downstream analysis by end users who can extract value even when they don’t have the data expertise to work with raw or unstructured/unprocessed data.
For example, a data science team may wish to run video analytics on raw CCTV footage, or may want to perform audio analytics on recordings from the customer service department. They may take this newly derived data, save it, and link it to related transactions so that it can be used by others.
A loss prevention data analyst working on exception-based reporting (EBR) may want to process transactions by appending information not available during the time of the transaction itself (such as returned or post-void indicators). These results would also be saved into the environment for use by others.
Validation is a First-Class Process
As discussed before, users need to have confidence that the data they are working with is accurate. Therefore, as data is loaded and transformed, it’s important to have a validation process to monitor the accuracy and completeness of the results.
A strong data validation process creates confidence in your data systems and will create more wins from the models and analytics that use that data. Not validating data may cause incorrect or misleading results, creating reputational risk for any team whose work is based on those results.
Validation is not simply ensuring that data was loaded without error. Validation should involve point-in-time, time-series, and business-rule validation. Some examples are:
- Beginning-to-end walkthroughs: “Can I reconcile today’s POS transaction amounts against an EOD sales report?”
- Over-time and time-series validation: “Is today’s value too different from the recent daily trend?”, “Does today’s amount represent reasonable same-day-last-year change? How about same-day-last-week?”
- Business-rule validation: “Amounts must have two decimal places only.”
Observations falling outside of expectations should flag the related data as suspicious. Only when flagged records are corrected or confirmed as correct can they be released for downstream use.
Best Practices in Retail Data Management
Having structured and unstructured data, and needing to allow access to the data in raw and various processed forms, presents some interesting challenges to traditional data warehouse architectures. “Data lakes” have been created as a way to address this.
In the same way that a lake ecosystem has varieties of flora and fauna, data lakes are storage environments that handle structured and unstructured data, while supporting managed access for various types of end-users. Many data lake environments use technology based on Hadoop, a collection of data management tools designed for handling high-volume structured and unstructured data, though this is not the only option available.
Hybrid database environments are also common. In these environments, additional data environments are created alongside a data lake to support specific use cases. These additional environments may use different storage technologies than the data lake, based on their target use cases.
For example, specialized applications such as modeling, analytics, and visualization can benefit from structured data environments. These structured environments could be stored in relational data stores, such as SQL. In other instances, highly scalable columnar storage systems are used. These systems work well with analytics-intensive operations, where data needs to be read frequently at high speed.
If you think of the data environment as a real lake, the above data stores can be thought of as streams that run from the lake. They bring the water further from the source, closer to its endpoint. The distributaries in this ecosystem would be the visualization and business intelligence (BI) tools that exist at these edges. They take the data that we have and distribute it to end users.
This can be in the form of reports, dashboards, and applications that let us see and understand the value that has been created from the initially unprocessed lake of data. Data starts off in the lake in its most natural form, but progressively gets processed and refined as it makes it way to the edge. Good retail data management supports this lifecycle and the use cases that exist within it.
Where Do We Go from Here?
Establishing a strong data environment forms the foundation for good analytics and modeling. The data management can make all the difference between teams of data scientists finding success easily or struggling to get basic data.
Editor’s Note: Check out LPM’s podcast interview (“Why Data Science Is Everywhere Now”) with Cheryl Blake and Troy Rhein of Verisk in this audio snippet.
This post was updated April 25, 2019.