To improve approaches for analyzing very large quantities of data, computer scientists at the National Institute of Standards and Technology (NIST) have released broad specifications for how to build more widely useful technical tools for the job.
Following a multiyear effort, the agency has published the final version of the NIST Big Data Interoperability Framework, a collaboration between NIST and more than 800 experts from industry, academia and government.
Filling nine volumes, the framework is intended to guide developers on how to deploy software tools that can analyze data using any type of computing platform, be it a single laptop or the most powerful cloud-based environment. Just as important, it can allow analysts to move their work from one platform to another and substitute a more advanced algorithm without retooling the computing environment.
“We want to enable data scientists to do effective work using whatever platform they choose or have available, and however their operation grows or changes,” said Wo Chang, a NIST computer scientist and convener of one of the collaboration’s working groups, in a statement. “This framework is a reference for how to create an ‘agnostic’ environment for tool creation. If software vendors use the framework’s guidelines when developing analytical tools, then analysts’ results can flow uninterruptedly, even as their goals change and technology advances.”
The framework fills a long-standing need among data scientists, who are asked to extract meaning from ever-larger and more varied datasets while navigating a shifting technology ecosystem. Interoperability is increasingly important as these huge amounts of data pour in from a growing number of platforms, ranging from telescopes and physics experiments to the countless tiny sensors and devices we have linked into the internet of things. While several years ago the world was generating 2.5 exabytes (billion billion bytes) of data each day, that number is predicted to reach 463 exabytes daily by 2025. (This is more than would fit on 212 million DVDs.)
Computer specialists use the term “big data analytics” to refer to the systematic approaches that draw insights from these ultra-large datasets. With the rapid growth of tool availability, data scientists now have the option of scaling up their work from a single, small desktop computing setup to a large, distributed cloud-based environment with many processor nodes. But often, this shift places enormous demands on the analyst. For example, tools may have to be rebuilt from scratch using a different computer language or algorithm, costing staff time and potentially time-critical insights.
“Performing analytics with the newest machine learning and AI techniques while still employing older statistical methods will all be possible,” Chang said. “Any of these approaches will work. The reference architecture will let you choose.”