Category Archives: collection

In-Place Data Analytics For Unstructured Data is No Longer Science Fiction

By John Patzakis

AI-driven analytics supercharges compliance investigations, data security, privacy audits and eDiscovery document review.  AI machine learning employs mathematical models to assess enormous datasets and “learn” from feedback and exposure to gain deep insights into key information. This enables the identification of discrete and hidden patterns in millions of emails and other electronic files to categorize and cluster documents by concepts, content, or topic. This process goes beyond keyword searching to identify anomalies, internal threats, or other indicators of relevant behavior. The enormous volume and scope of corporate data being generated has created numerous opportunities for investigators seeking deep information insights in support of internal compliance, civil litigation and regulatory matters.

The most effective use of AI in investigations couple continuous active learning technology with concept clustering to discover the most relevant data in documents, emails, text and other sources.  As AI continues to learn and improve over time, the benefits of an effectively implemented approach will also increase. In-house and outside counsel and compliance teams are now relying on AI technology in response to government investigations, but also increasingly to identify risks before they escalate to that stage.

Stock Photo - Digital Image used in blog

However, logistical and cost barriers have traditionally stymied organizations from taking advantage of AI in a systematic and proactive basis, especially regarding unstructured data, which, according to industry studies, constitutes 80 percent or more of all data (and data risk) in the enterprise. As analytics engines ingest the text from documents and emails, the extracted text must be “mined” from their native originals. And the natives must first be collected and migrated to a centralized processing appliance. This arduous process is expensive and time consuming, particularly in the case of unstructured data, which must be collected from the “wild” and then migrated to a central location, creating a stand-alone “data lake.”

Due to these limitations, otherwise effective AI capabilities are utilized typically only on very large matters on a reactive basis that limits its benefits to the investigation at hand and the information within the captive data lake.  Thus, ongoing active learning is not generally applied across multiple matters or utilized proactively. And because that captive information consists of migrated copies of the originals, there is a very limited ability to act on data insights as the original data remains in its actual location in the enterprise.

So the ideal architecture for the enterprise would be to move the data analytics “upstream” where all the unstructured data resides, which would not only save up to millions per year in investigation, data audit and eDiscovery costs, but would enable proactive utilization for compliance auditing, security and policy breaches and internal fraud detection.  However, analytics engines require considerable computing resources, with the leading AI solutions typically necessitating tens of thousands of dollars’ worth of high end hardware for a single server instance. So these computing workloads simply cannot be forward deployed to laptops and multiple file servers, where the bulk of unstructured data and associated enterprise risk exists.

But an alternative architecture solves this problem. A process that extracts text from unstructured, distributed data in place, and systematically sends that data at a massive scale to the analytics platform, with the associated metadata and global unique identifiers for each item.  As mentioned, one of the many challenges with traditional workflows is the massive data transfer associated with ongoing data migration of electronic files and emails, the latter of which must be sent in whole containers such as PST files. This process alone can take weeks, choke network bandwidth and is highly disruptive to operations. However, the load associated with text/metadata only is less than 1 percent of the full native item. So the possibilities here are very compelling. This architecture enables very scalable and proactive compliance, information security, and information governance use cases. The upload to AI engines would take hours instead of weeks, enabling continual machine learning to improve processes and accuracy over time and enable immediate action to taken on identified threats or otherwise relevant information.

The only solution that we are aware of that fulfills this vision is X1 Distributed GRC. X1’s unique distributed architecture upends the traditional collection process by indexing at the distributed endpoints, enabling direct pipeline of extracted text to the analytics platform. This innovative technology and workflow results in far faster and more precise collections and a more informed strategy in any matter.

Deployed at each end point or centrally in virtualized environments, X1 Enterprise allows practitioners to query many thousands of devices simultaneously, utilize analytics before collecting and process while collecting directly into myriad different review and analytics applications like RelativityOne and Brainspace. X1 Enterprise empowers corporate eDiscovery, compliance, investigative, cybersecurity and privacy staff with the ability to find, analyze, collect and/or delete virtually any piece of unstructured user data wherever it resides instantly and iteratively, all in a legally defensible fashion.

X1 displayed these powerful capabilities with ComplianceDS in a recent webinar with a brief but substantive demo of our X1 Distributed GRC solution, emphasizing our innovative support of analytics engines through our game-changing ability to extract text in place with direct feed into AI solutions.

Here is a link to the recording with a direct link to the 5 minute demo portion.

Leave a comment

Filed under Best Practices, collection, compliance, Corporations, eDiscovery & Compliance, Enterprise eDiscovery, Enterprise Search, GDPR, Uncategorized

Want Legal to Add A LOT More Value? Stop Over-Collecting Data

blog-cassting-net

The 2019 CLOC (Corporate Legal Operations Consortium) Conference ended last week, and by all accounts it was another great event for an organization that continues to gain relevance and momentum.  A story in Thursday’s Legaltech News entitled “Why E-discovery Savings Is About Department Value for Corporate Legal” summarized a CLOC session focused on “streamlining e-discovery and information governance inside corporate legal departments.”  At the risk of sounding biased, that seems like a perfect topic to me.

The article’s conclusions from the panel session, namely adding value by wresting control of eDiscovery from outside counsel, consolidating hosting vendors and creating a “living data map”, were all spot on and certainly useful.  One way for legal to add enormous value, however, was NOT discussed: collecting far less data as part of the eDiscovery, investigatory and compliance processes.

As we highlighted on an insightful webinar with our partner Compliance Discovery Solutions last Tuesday (which can be viewed here), the way most eDiscovery practitioners conduct ESI collection is remarkably unchanged from a decade ago, an example of which is shown in the infographic below: consult a data map, image entire drives from each and every custodian (e.g. with EnCase), load these many images into a processing application (e.g. Nuix), process these huge amounts of data (most of which is entirely irrelevant), then move this now-processed data into a review application (e.g. Relativity).

blog-legacy-collection-infographic

This legacy collection process for GRC (Governance, Risk & Compliance) and eDiscovery is wildly inefficient, disruptive to the business and costly, yet many if not most practitioners still use it, most likely because it’s the status quo and change is always hard in the legal technology world.  But change here is a must, as this “image everything à then process it all à and only then begin reviewing” workflow causes myriad issues not just for legal but for the company as well:

  • Increases eDiscovery costs exponentially. The still-seminal Rand study on eDiscovery pegged an overall cost-per-GB for identification through production of $1,800/GB.  While some elements of this price have come down in the intervening 6-7 years, especially processing and hosting rates, data volumes and variety have grown by at least as much thereby negating these reductions.  Imaging entire drives by definition collects far more data than could ever be relevant in any given matter – and the costs of this overcollection multiply every step thereafter, forcing clients to pay hundreds of thousands if not millions of dollars more than they should.
  • Is extremely disruptive to employees. Forensically imaging a drive usually requires gaining physical access to the laptop or desktop for some period of time, often for a day or two.  Put yourself in each of those employee’s shoes: even if you are given a “loaner” machine, you still don’t have all of your local information, settings, bookmarks, etc. – which is a major disruption to your work day and therefore a significant drag on productivity.
  • Takes far too long. With forensic imaging of drives requiring physical access to a device, each custodian’s machine must be dealt with.  In many collections, custodians are spread across multiple offices, or on vacation, or remote employees, which often extends the process to many weeks if not months.  All of this time lawyers are unable to access this critical data (e.g. to begin formulating case strategy, negotiating with opposing counsel or a regulator, etc).
  • Creates unnecessary copies of data that could otherwise be remediated. An often-overlooked byproduct of over-collection is that it creates another copy of data that is outside of most (if not all) data remediation programs.  For companies that are regulated and/or encounter litigation regularly, this becomes a major headache and undermines data governance and remediation programs.
  • Forces counsel to “fly blind” for months. Every day the IT and legal teams are spending forensically imaging each custodian’s drives, then processing it, and only then loading it into a review or analysis application is a day in-house and outside counsel are flying blind, unable to look at key data to begin constructing case strategy, conduct informed interviews, negotiate with opposing counsel (e.g. on the scope of a matter, including discovery) or interact with regulators.  This is incredibly valuable time lost for no value received in return.
  • Using forensic tools for non-forensic processes is unnecessary overkill. The irony of this “image everything” approach is that it is extreme overkill: it would be like a doctor whose only procedure to get rid of a mole was to cut off the arm.  Forensic images can always be utilized on a one-off basis in narrow circumstances where there are concerns about possible spoliation of evidence, but for the vast majority of circumstances, a forensic image is completely unnecessary.

As was a focus at the recent CLOC conference in Las Vegas, corporate legal operations are quite correctly focused on showing the value legal is bringing to the business.  However, there is still a fundamental change they need to make to how they handle the collection of ESI for eDiscovery, GRC and privacy purposes that would be an enormous value-add to all parts of the company, including legal: ending the systematic over-collection of data.  How this can be done quickly and cost-effectively has been the subject of previous blog posts, but will be addressed in detail in the next few weeks as well.

Leave a comment

Filed under Best Practices, collection, compliance, Data Audit, eDiscovery, Enterprise eDiscovery, Uncategorized