The U.S. government’s interest in supercomputing, from ENIAC’s hydrogen bomb simulations to NOAA’s hurricane landfall predictions to the IRS’s tax processing, is so well established that it’s only natural the government would find Big Data tools and techniques equally as attractive. The term “big data” refers to data sets so large, they can’t be captured, processed, stored, searched or analyzed by standard relational database and data visualization technologies. Nor can the data sets be stored in a normal manner. This information has to live in a data warehouse. The Federal Government regularly generates datasets of this size.
One 2012 study by the Federal Big Data Commission said, “In 2009, the U.S. Government produced 848 petabytes of data and U.S. healthcare data alone reached 150 exabytes. Five exabytes (10^18 gigabytes) of data would contain all words ever spoken by human beings on earth.” The benefits of using big data tools has become apparent both as the tools have developed and the cost of using them dropped down into a range accessible to more users. Examples include the Center for Medicare and Medicaid Services (CMS) using big data to analyze the Medicare reimbursement system for improper payments.
CMS generates terabytes of data every day, and it’s only with the emergence of data warehousing technologies, along with mapreduce frameworks like Hadoop, and real-time streaming analytic tools that it is possible to analyze the torrents of data.
The goal is clear enough: to discover patterns and anomalies that were not visible looking at samples of isolated data silos. Now, it is possible to see a holistic view of a government agency’s situation by taking in data from multiple sources.
In March of 2012, the Obama Administration directed several agencies to create or enhance big data efforts. For example, the Department of Defense being tasked with the ADAMS (Anomaly Detection at Multiple Scales) effort designed to ferret out cyber attacks by looking at real-time streams of network data and identifying anomalous behavior. Additionally, the Department of Homeland Security’s CVADA project (Center of Excellence on Visualization and Data Analytics) is designed to organize multiple heterogeneous live data feeds for analysis by first responders.
There are also applications that are not crisis-oriented, such as the National Institute of Health’s Cancer Genome Atlas, which the White House describes as “comprehensive and coordinated effort to accelerate understanding of the molecular basis of cancer through the application of genome analysis technologies, including large-scale genome sequencing.”
The CGA is expected to accumulate several petabytes of data by the end of 2014.
Other science-related big data initiatives to which agencies of the federal government have committed are at NASA, the Department of Energy, and the Center for Disease Control.
The Federal Government is also trying its hand at data warehousing. Tucked at the end of the White House statement is an offhand mention of the National Security Agency's (News - Alert) Big Data initiatives.
"Wired” reporter James Bamford wrote that the NSA intends to “intercept, decipher, analyze and store vast amounts of the world’s communications from satellites and underground and undersea cables of international, foreign and domestic networks.” Bamford says this will include “private emails, mobile phone calls and Google (News - Alert) searches, as well as personal data trails – travel itineraries, purchases and other digital ‘pocket litter’” to be stored in a dedicated Bluffdale, Utah data center expected to house an undisclosed number of yottabytes (yottabyte = 10^24 bytes) – which is big data by anyone’s definition.
Not all government data comes from e-mail or phone intercepts, nor will it all be housed in dedicated data centers. The source for much of the government’s data will be sources such as sensor networks, network logs, RFID readers, and satellite telemetry.
Over the next few years, the government is trying to consolidate its many data centers from over 3100 to a more manageable 1800. One expected benefit of consolidation under the
Federal Data Center Consolidation Initiative (FDCCI) is $5 billion in cost savings.
Big data is an extraordinary tool for the United States Federal Government, for both good and bad reasons. While the ability to store and sort through yottabytes of data could lead to the government spying on more citizens; however, many bureaus will be able to use this data to improve the lives on American citizens.
Edited by Braden Becker