Normalizing Corporate Small Data

Fast. Today. Now. These are the terms of modern business. Data is the fuel of business activities. We need better fuel.

Why Data Normalization?

  • Meaningful analytics
  • Continuous improvement as business evolves
  • Low cycle time enables business tempo
  • Handles complex rules across multiple sources
  • Gracefully incorporates uncertainty
  • Visible, auditable functions
  • Rigorous Data Science foundation
  • Works when other methods do not

Corporate Small Data

Corporate Small Data is structured data that is the fuel of its main activities, and whose problems with accuracy and trustworthiness are past the stage of being alleged. This includes financial, customer, company, inventory, medical, risk, supply chain, and other primary data used for decision making, applications, reports, and Business Intelligence. This is Small Data relative to the much ballyhooed Big Data of the Terabyte range.

Corporate Small Data does not include the predominant Big Data examples which are almost all stochastic use cases. These can succeed even if there is error in the source data and uncertainty in the results since the business objective is getting trends or making general associations. In stark contrast are deterministic use cases, where the ramifications for wrong results are severely negative, such as for executive decision making, accounting, risk management, regulatory compliance, and security.

Data Normalization

Data Normalization is drawn from science and hence is a Data Science approach to solving the corporate data challenge. In this mode, Data Normalization is not the same as in data modeling (e.g. third normal form), but attacks the larger problem since it combines subject matter knowledge, governance, business rules, and raw data. Indeed, the types of challenges in corporate Small Data are solved with an order of magnitude less expense, time, and organizational friction and with much higher complexity in many scientific and engineering fields. Data Normalization is similar to normalizing statistics where the objective is to directly compare values by eliminating errors, biases, and scale differences. It is used in financial regulations where it comes with an ever changing set of policies and rules. This notion is also used in real science studies to constrain and thereby make solvable calculations. Doing this successfully requires a more complete, evolvable, and integrated knowledge of all aspects of a challenge instead of a segmented approach where individual parts are transferred minus real knowledge exchange across groups. normalized matrix

Normalizing data needs adjustable and powerful computing tools. A critical success factor is enabling iterative, variable, and transparent results tuned to the personal and organizational work tempo of analysts, managers, and business product delivery. In almost all mission-critical activities, the specific requirements of the business on the data environment are neither static nor known at a level of detail sufficient to supply traditional tools and methods. This leads to the accursed business-technical organizational chasm. Into this fray comes a little noticed aspect of Hadoop: easily customizable parallel computing. We all know Hadoop provides parallel processing of distributed data. That is why it was built. But, we are focusing on what real executives need to have done, done right, done now, and then redone tomorrow. This is where Data Normalization is most beneficial.

The ultimate goal of data management in general and Data Normalization specifically is to enable meaningful analysis that can then be used for human decision making, secure and trusted software applications, and meeting legal and regulatory compliance. Data for data’s sake is not the goal. All of the different uses, definitions, assumed meanings, and commonly unknown semantics of real data make this imperative. Simple metadata management, governance, and even semantic technology cannot make progress if it is divorced from a combined view of design (i.e. architecture) and actual data values.

The first and most important question is what is a meaningful analysis and how does it differ from other analyses? A meaningful analytic product is one that presents information in a concise and easy to understand manner where the underlying data has gone through an iterative series of investigation, normalization, review, and most importantly adjudication. It does not have to be perfect.

Hadoop provides a complete management and execution environment for parallel processing and very large data storage. This is perfectly suited to enable the iterative, constantly updated, multiple method analysis needed to explore, understand, and digest data into meaningful products all within the daily tempo of modern organizations and analysts.

Leave a Reply