Introducing Tonic Textual: Redact and synthesize sensitive free-text data
Synthetic Data for Model Training
As a data synthesis company, we like to say that we've been doing generative AI since before it was cool. Today, we're taking it one step further by packaging our industry-leading data synthesis capabilities into a new platform built specifically to help you maximize your use of generative AI. And in typical Tonic.ai style, we’ve built it in a way that champions our commitment to protecting customer privacy.
Meet Tonic Textual, our new sensitive data redaction and synthesis offering that allows you to protect sensitive values in unstructured free-text data so that you can responsibly train large-language models (LLMs) and other text-based machine learning (ML) models, and build and test performant data pipelines on your organization’s data.
Why do you need Tonic Textual? Let's say that you're putting together an LLM for automated customer support, or need a trained model to use for testing in development pipelines. You have a treasure trove of files that contain customer support conversations and case notes. They're a perfect source for your project, but are chock-full of sensitive values that you cannot legally or ethically expose–names, addresses, credit card numbers, member IDs…the list goes on.
Tonic Textual solves this problem. It identifies sensitive values contained in free-text files, and then provides access to versions of those files that redact or replace those values. You can use the Tonic Textual web application as well as the Tonic Textual Python-based SDK.
In this post, we'll take a quick tour of the platform. You can also view more detailed demo videos of the platform and Tonic Textual SDK, or review the Tonic Textual documentation.
Identifying and scanning the files
In Tonic Textual, you create collections of files called datasets.
When you add files to a dataset, Textual automatically scans the files to look for different types of sensitive values.
To identify sensitive values, Textual uses a built-in set of trained models, each associated with a specific type of value, such as names, locations, or identifiers. You can also set up your own custom models. Custom models allow you to extend Textual's detection ability so that it can identify types of values that can be found in your files but that aren't covered by the built-in models.
Reviewing the scan results
After Tonic Textual identifies the sensitive values, the next step is to check the results. Textual lists the types of values that it uncovered in the dataset files.
For each type, the list includes the number of values that Textual detected. You can also view a sampling of the detected values.
Configuring how to protect the values
For each value type, you tell Textual how and whether to protect the detected values.
By default, Textual redacts the values. This means that it replaces the value with a placeholder that indicates the entity type. For example, "Michael" becomes "NAME".
To maintain data realism and utility, you can also synthesize fake, contextually relevant values. Textual replaces the sensitive data entity with a realistic, safe version of the entity that was removed. For example, "Michael" becomes "John".
Or you can decide to leave the values as is. For example, you might know that none of the values for a given entity type are actually sensitive. You can also configure specific values to leave as is that are not sensitive or that might be identified incorrectly.
Previewing and downloading the protected content
For each file, you can display a preview that shows both the original file and the version with the redacted and synthesized values.
When you're happy with the results, you can download the protected versions of the files to distribute and use.
Recap
Tonic Textual is a sensitive data redaction platform that you can use to protect your unstructured free-text data.
Textual scans your free-text files for sensitive values based on built-in and custom trained models. You decide whether to redact, replace, or ignore each type of detected value. You then download the protected versions of the files.
With just a few clicks, you can embed Textual into your data and ML pipelines to provide you with realistic, de-identified text data that’s safe to use to train models, for LLMOps, and to build data pipelines.
With Tonic Textual, you can safely leverage your text data and practice responsible AI development while staying compliant with evolving data handling regulations.
Connect with our team to learn more, or sign up for an account today.