Early Hadoop Projects Hitting Data Security Pangs

Early Hadoop Projects Hitting Data Security Pangs

Many organizations adopt Hadoop and leverage existing business intelligence systems, using both legacy systems and the new NoSQL cluster storage model to increase visibility, Schreiber said. Enterprises that started several years ago have migrated data from mainframe systems to the large storage clusters. Other firms continue to maintain data connections to legacy mainframe storage environments while engineers plan out a data migration strategy, experts say. 

A Netflix project that involved a 500-node analytics cluster was spun up using masking to protect sensitive data. The project took eight hours. If data scientists probe the data with the right questions, the analytics can reveal a great deal of insight, said  Adrian Lane, a security industry veteran, who serves as an analyst and chief technology officer at Phoenix, Ariz.-based research firm Securosis.

"When entrenched in legacy land, newer more agile NoSQL is blowing that out of the water and organizations that want to be faster, more agile and more efficient can do it with these tools," Lane said. "If you are spinning up new cluster it can take only a matter of hours before you start pulling data in."

Lane has been researching the information security practices associated with Hadoop. Early Hadoop projects began with mainly implementations of Palo Alto, Calif.-based Cloudera and its early Apache Hadoop-based software. Some early adopters used San Jose, Calif.-based MapR's Hadoop-based software. In 2013, experts tell CRN that enterprises are increasingly adopting Hadoop software developed by HortonWorks, also based in Palo Alto.  Database systems often connected to big data implementations include Apache Cassandra, the open source distributed database management system and Austin, Texas-based DataStax, a maker of a NoSQL database management system.

A variety of vendors provide data security for Hadoop, including Vormetric, Protegrity and Voltage, which specialize in data tokenization and encryption among other vendors, Lane said. Informatica, which sells data integration products, is one of the largest firms providing data masking and other protection measures for big data projects, Lane said.

In addition to insider threats that pose the risk of data exposure or loss, Web-based applications pose a risk to an intrusion, Lane said. Web-based applications that provide analytics should be checked for software vulnerabilities, he said.

"You've got a soup of Web based technologies that work in concert with some big data tools and all of them have their own set of issues," Lane said.

The underlying kernel management, query components and other processes that initiate the core pieces of the NoSQL environment can also be targeted by an attacker.

Some businesses address security via a walled garden approach, putting a Hadoop cluster behind infrastructure to prevent intrusion, Lane said. Other firms protect the cluster itself by implementing security controls such as Kerberos for authentication, he said. A third approach involves protecting data before it goes into the cluster by implementing data masking, tokenization or format preserving encryption.

Businesses can also leverage the cluster for monitoring log activity by either creating a custom application for monitoring, using third-party products or feeding the log data into a security information event management system, Lane said.

Cloudwick's Schreiber said it is still the early days for big data projects and technologies for data security are rapidly emerging. Threats exist and businesses will use a variety of options to provide security controls, layering them in as part of a traditional defensive model, he said.

"Unfortunately data security issues are going to continue to exist until systems and applications have built in defensive capabilities," Schreiber said.  "I think the vast amount of Hadoop early adopters are doing their due diligence to mitigate risks, because the alternative would be a massive breach or embarrassing data exposure."