Early Hadoop Projects Hitting Data Security Pangs

Early Hadoop Projects Hitting Data Security Pangs

Large enterprises involved in massive Hadoop projects to extend existing analytics capabilities are finding that data masking and other security measures are necessary to protect sensitive information from prying eyes.

Apache Hadoop implementations that enable giant retailers, pharmaceutical firms and other large enterprises to apply analytics to an increasingly expanding volume of data are finding their early projects require data security and authentication controls to keep sensitive information from being exposed.

Access to large pools of massive petabytes of data is often shared among business partners, raising the specter of data leakage. Third-party firms often contract out IT administrators to manage the infrastructure raising the potential for intrusion by a trusted insider, a business partner or an external attacker probing for system vulnerabilities to gain access, said Mark Schreiber, general manager at Newark, Calif.-based Cloudwick, a firm that specializes in managing Apache Hadoop projects. 

Schreiber said his firm is seeing an increasing need for data security as businesses transfer a broad range of data from legacy systems into Hadoop clusters. Founded in 2010, Cloudwick currently has 65 big data consultants working full time at many large businesses, including data storage firm, Netapp, financial giant JP Morgan, and retailers Home Depot and Wal-Mart.  Security is of greater concern at firms in heavily regulated industries, including the pharmaceutical, retail and financial sectors, Schreiber said.

Credit card information, Social Security numbers and other personally identifiable information is being secured through encryption, tokenization or by a process called data masking, which sanitizes the data by replacing random letters and numbers with other characters. The goal is to address insider threats, posed by outsourced Hadoop administrators or developers, said Schreiber, whose firm is an early Dataguise partner. Encryption can also be used, but its adoption can still cause a performance hit and it can also cause key management issues for Hadoop projects that are shared with business partners, which is often being done in the financial industry, Schreiber said.

"There are so many analytical use cases that can provide benefit to businesses and data scientists, but personal information needs to be completely anonymized in many cases," said Manmeet Singh, co-founder and CEO of Fremont, Calif.-based Dataguise, a data security vendor that unveiled a formal partner program this month to identify systems integrators managing Hadoop.  "The data can be shared with anybody, but the sensitive data protection needs to be smart so all the analytics can be applied on that data."

Encryption is reversible and masking is a one way function that gives a business the ability to retain the analytical value without any possibility of future exposure of that data. The encryption, even if it is format preserving, often doesn't do a good job of retaining value for analytics, he said. 

The massive data projects are a challenge, because it is increasingly difficult to find the right skilled engineers to carry out infrastructure management and data migration, said Cloudwick's Schreiber. The system integrator conducts a thorough three month training program for its new hires in building out, operating and managing Hadoop implementations with petabytes of data, he said.

"We can't find enough people off the street with the experience and knowledge in big data," Schreiber said.