This begins a series of articles / code-snippets / thoughts regarding use of Generative Artificial Intelligence (GenAI) for data analytics – starting out primarily using OpenAI.
This is not a broader discussion of uses of GenAI (though they’re kinda fun as well, and I’ll probably write about chat and API -style usage in the future).
In particular I’m looking at OpenAI’s ChatGPT Plus which (as of the date of writing) is the paid subscription option of ChatGPT which allows access to additional functionality:
- use of the GPT-4 model
- data analysis (allowing uploads of data files for analysis)
- plugins (allowing external providers to be called from within the ChatGPT dialogue)
Data Privacy
First things first – if you’re about to upload data and dialogue to OpenAI you should know how that data will be stored and used.
- As of 1st March 2023, if you’re using the OpenAI API (also see docs) OpenAI state that “… data sent to the OpenAI API will not be used to train or improve OpenAI models (unless you explicitly opt in).”
- Similarly – if you are using OpenAI Enterprise again OpenAI state that “Customer prompts or data are not used for training models”.
- HOWEVER if you are using ChatGPT (free) or ChatGPT Plus by default Open AI state: “When you use our non-API consumer services ChatGPT or DALL-E, we may use the data you provide us to improve our models.” (see “How your data is used to improve model performance“)
- It is possible to “turn off chat history and model training” using the ChatGPT settings – however this means you will also lose having conversations saved for you to refer back to so it may not be desirable.
- It’s also possible to submit a “User Content Opt Out Request” form (linked from the above page).
At the end of the day – if you own (or have sufficient authority over) the data and you don’t mind it being ingested and used by OpenAI then there is no issue.
But if there is any Personally Identifiable Information (PII), private / proprietary data or otherwise sensitive information you need to think very carefully about what (if any) of that data should be sent to OpenAI.
Depending on the situation and the data a simple risk analysis might help – for example this might look like:
Scale / Sensitivity | Open/Public Data | Easily Masked Data | Complex Sensitive Data | PII |
Small | Low | Low | High | Very High |
Medium | Low | Medium | High | Very High |
Large | Low | Medium | Very High | Very High |
The good news is that there are a number of mitigations available – even where data is at the High / Very High risk level:
- Where the scale or complexity of data is no too great Masking can be used – whereby labels or actual data can be modified to remove external meaning (where the renaming can be reversed afterwards after responses are returned from OpenAI).
- Where the scale or complexity of the data is too great (or where PII data cannot be safely masked) – mock data can be constructed to mimic the live data but without revealing any sensitive information. This mock data can be fed into OpenAI and the responses used to plan analysis run locally on the real data (including – for data analysis – using the actual source code generated by OpenAI).