Artificial Intelligence : Masking Sensitive Data

I’ll step away from data analytics for a moment in this series of posts on GenAI to think about Data Masking. As I commented in Artificial Intelligence ? OpenAI ? Data Use and Privacy – there are provacy and security concerns when sending sensitive data to an external provider (in this case a GenAI provider) and where the data structure and volume is small enough one option is Masking.

I’ll write another post about applying masking in data analytics – but to step through the process incrementally it’s instructive to think about masking within a single written document as it helps to identify what sort of masking is possible (and what isn’t) and how to apply it.

The key principle of Data Masking is to replace sensitive data with a randomised placeholder (one which should be retained by the external provider and returned in tact so that it can be substituted back with the original data).

To work well and avoid unexpected side-effects substitute values should match semantically (and also by type of data). Some examples:

  • Peoples Names
    If possible a name should match a persons characteristics (eg gender, cultural origin, age cohort). This should be done with care as large language model bias is a significant concern; when setting this masking up it would be sensible to apply each type of characteristic matching incrementally and with testing/monitoring in place.

    Names need to look like names – for example (albeit not a very sensible one) substituting the name “Andrew” with “is it me you’re looking for” is likely to create strange results when the resulting sentence is “Hello is it me you’re looking for“…
  • Product Names
    If names of your products are something you want to mask then having a one-to-one lookup list of plausibly similar (but not real) product names may be needed.
    This is going to be more important if your product names have a known meaning (eg. “banana”), but even if your product names are all made up words you may find randomly selected replacements may accidentally have meaning (again, eg. “banana”).
  • Numbers
    How numbers are substituted very much depends on what sort of numbers they are:
    • IDs (eg. credit card numbers, personal IDs) clearly need to be protected but as they are effectively meaningless strings you could use a placeholder like “XZYInsertIdHere” (tested to ensure it does not cause some unexpected LLM interpretation).
    • numbers with numerical meaning (eg. salaries, values, counts) may need to be masked even though other identifying information is removed. The risk here is that masking with meaningless numerical values also means the LLM output is likely to be affected. An example would be replacing the value of a banana at AU$1 with AU$54321.
      An option when I return to the data analytics approach is to vary values by some random capped percentage – thus masking real values while still allowing aggregate calculations to return approximate but still realistic values.
      In this example if it’s possible to leave numbers alone without compromising sensitive data that is probably the best approach.
  • Personal details
    Details like a home or business address also need to be protected. Similar to peoples names addresses should be substituted with a geographically similar location (eg. changing someone’s address from Adelaide, Australia to ???, ?? is likely to cause unintended results. On the other hand moving from one suburb of Adelaide to another is unlikely to cause the same problem.

Example Input

I’ll try a simple example input and send it to both OpenAI ChagtGPT and Google Bard.

Imagine the following LLM prompt:

Given the following client details:

Name: Celeste Brookes

Age: 43

Pronouns: They/Them

Address: 45 Bryant St, Brunswick VIC 3056

Current product subscriptions: Book of the week, Newly published books, Living with dyslexia

Please write a brief promotional letter advertising these new subscriptions: eBook of the week, Puzzle of the week

We might mask this as follows – noting the masked values have:

  • a culturally similar name with the same ‘name’ gender
  • a similar (but changed) age
  • A nearby address
  • I did not mask generic products like “Book of the week” but as “Living with dyslexia” is a more sensitive/medical topic I chose to substitute that name. Revealing anything medical might not be ideal so another option might be to simply remove/nullify product names that are sensitive if this information is not important to the output.
  • I also did not mask the new product names – again if there was something sensitive here masking might be a good idea (where nullifying the new products probably wouldn’t work as we’re trying to promote them!)

Given the following client details:
Name: Mary Taylor
Age: 40
Pronouns: They/Them
Address: 80 Gregory St, Brunswick VIC 3056
Current product subscriptions: Book of the week, Newly published books, Living with a medical condition

Please write a brief promotional letter advertising these new subscriptions: eBook of the week, Puzzle of the week

ChatGPT (GPT-4) Response

A few observations:

  • The masked name, address and medically sensitive product names all appear clearly and could be substituted back.
  • ChatGPT has generously decided to offer a 20% discount on my behalf – that’s a cautionary tale about validating output if every there was one!

[Your Company Letterhead]

80 Gregory St, Brunswick VIC 3056 September 9, 2023

Dear Mary Taylor,

We hope this letter finds you well. At [Your Company Name], we always strive to ensure our dedicated clients, like you, are kept up-to-date with our latest offerings that may cater to their preferences and needs. Recognizing your current interests in the “Book of the week”, “Newly published books”, and “Living with a medical condition” subscriptions, we believe you might find our new subscription options appealing.

eBook of the Week Are you ever on the go and wish you had your favorite books right at your fingertips? Our brand-new “eBook of the Week” subscription might just be what you’re looking for. Dive into a digital reading experience where each week, we carefully select an eBook that aligns with your tastes and current reading trends. It’s a modern touch to our traditional “Book of the Week” offering.

Puzzle of the Week For those moments when you want a delightful break from reading, we have introduced the “Puzzle of the Week” subscription. It promises to stimulate your brain and entertain with a mix of crosswords, Sudoku, logic puzzles, and more. Tailored to varying difficulty levels, it can offer a pleasant diversion and sharpen your cognitive skills.

Given your rich history of subscriptions with us, Mary, we genuinely believe these new additions would complement your current selections wonderfully. We’re extending an exclusive offer: Sign up for either of these subscriptions within the next two weeks and enjoy a 20% discount for the first three months!

If you have any questions or would like further details about these subscriptions, please don’t hesitate to reach out. Our team is always here to support and ensure you have the best reading and leisure experience.

Thank you for being a cherished part of the [Your Company Name] family. We appreciate the trust you place in us and look forward to continuing to serve you with exciting content that resonates with your interests.

Warm regards,

[Your Name] [Your Position] [Your Company Name] [Company Contact Details]

P.S. Remember, the special 20% discount is only available for the next two weeks. Don’t miss out!

Google Bard Response

Some observations:

  • Similarly to ChatGPT the masked values appear and could be re-substituted.
  • And further along the cautionary tale Bard has both decided to offer a discount (a whole free month with a generated WELCOMEMARY code which never existed) and it’s also made up a telephone number for my organisation.

Dear Mary Taylor,

I hope this letter finds you well.

I’m writing to you today to let you know about two new subscription options that we think you might be interested in: the eBook of the week and the Puzzle of the week.

The eBook of the week is a great way to discover new books and authors. Each week, we’ll send you a free eBook that’s perfect for your reading taste. You can choose from a variety of genres, including fiction, non-fiction, and children’s books.

The Puzzle of the week is a fun and challenging way to keep your mind sharp. Each week, we’ll send you a new puzzle to solve, such as a crossword puzzle, a sudoku puzzle, or a word search.

We think you’d be a great fit for both of these subscriptions. You’re already a subscriber to our Book of the week and Newly published books subscriptions, so we know you enjoy reading. And the Puzzle of the week would be a great way to add some fun and challenge to your day.

To sign up for either of these subscriptions, simply visit our website and click on the “Subscribe” button. You can also call us at 1-800-555-1212.

As a special thank you for being a loyal customer, we’re offering you a free month of both subscriptions. Just use the code WELCOMEMARY at checkout.

We hope you’ll give these subscriptions a try. We think you’ll love them!

Sincerely,

The Subscriptions Team

Closing Thoughts

In Artificial Intelligence ? OpenAI ? Analytics, Open Data and A Few Simple Prompts I noted that one of the nice things about the data analytics feature in ChatGPT is that it generates work Python code that can be reviewed / audited and then applied/tested with different data sets. While the textual analysis generated by the LLM needs to be considered it has the scaffolding of code examples to help detect incorrect / biased or unexpected content in the results.

Long-hand prose is clearly a more complex space in terms of the sorts of results being generated – and thew random creation of discounts, discount codes and phone numbers in the above examples show why careful review of LLM output is required (and not only for concerns such as toxicity and bias).

Leave a Reply

Your email address will not be published. Required fields are marked *