PII Scanning Software: How to Find Personal Data in Your Systems
Practical guide to PII scanning software. How to find personal data across your files, databases, email, and cloud storage for DSAR compliance.
Last updated: 2026-02-07
What Is PII Scanning Software?
PII scanning software searches your files, databases, email, and cloud storage for personally identifiable information — names, email addresses, Social Security numbers, credit card numbers, phone numbers, addresses, dates of birth, and other data that can identify a specific person.
Disclaimer: This article is for informational purposes only and does not constitute legal advice. Privacy regulations are complex and change frequently. You should consult a qualified attorney for guidance specific to your business. References to specific software products or services do not constitute endorsements. The regulatory context discussed here is based on the GDPR (Regulation (EU) 2016/679), the CCPA (Cal. Civ. Code §§ 1798.100–1798.199.100), and related regulations, as of the date of publication.
Think of it as a specialized search engine. Instead of searching for keywords you specify, it searches for patterns that match known types of personal data. A Social Security number has a recognizable format (XXX-XX-XXXX). So does a credit card number, an email address, and a phone number. PII scanners look for these patterns across your systems and report back what they find and where they found it.
The "why" is straightforward: you cannot protect data you do not know you have, and you cannot respond to a data subject access request (DSAR) if you do not know where to look.
When You Need a PII Scanner
Not every business needs dedicated PII scanning software. Here are the situations where it genuinely helps.
Responding to DSARs
When someone submits a DSAR, you are legally obligated to find all their personal data across all your systems. For a small business with a handful of well-organized systems, a manual search with a checklist works. But when data is scattered across file shares, old spreadsheets, email archives, and cloud storage, a PII scanner can catch data that a manual search would miss.
The spreadsheet buried in a subfolder from three years ago with customer phone numbers in it? A PII scanner finds that. A manual search probably does not.
Breach Assessment
If you experience a data breach, the first question is: "what personal data was exposed?" A PII scanner can quickly analyze the affected systems or files to identify what types of PII were compromised. This is critical because breach notification requirements (72 hours under GDPR Article 33) depend on the nature and scope of the exposed data. Under the CCPA, breaches resulting from a failure to maintain reasonable security measures can trigger a private right of action (Cal. Civ. Code § 1798.150), with statutory damages of $100 to $750 per consumer per incident.
Data Mapping and Inventory
Privacy regulations require you to know what personal data you hold and where — GDPR Article 30 mandates that controllers maintain records of processing activities. A PII scan can validate your data map by finding personal data in places you did not expect — the shared drive nobody uses anymore, the backup files from a migration, the test database with real customer data.
Data Minimization
You are supposed to keep only the personal data you need for legitimate purposes — a principle embedded in GDPR Article 5(1)(e) (storage limitation) and reinforced by the CCPA's emphasis on disclosing the categories of data collected (Cal. Civ. Code § 1798.100). PII scanning can reveal data you have been hoarding without realizing it — old customer records that should have been deleted, personal data in log files, test environments with production data.
Pre-Migration or Decommissioning
Before migrating to a new system or decommissioning an old one, scanning for PII helps ensure you do not leave personal data behind in a system that will no longer be maintained or secured.
Types of PII Scanners
Different scanners are designed for different environments. Here is how they break down.
File System Scanners
These scan files on your computers, servers, and network drives. They open documents (Word, Excel, PDF, text files, CSVs, and sometimes images using OCR), read the content, and flag PII patterns.
What they are good at: Finding PII in unstructured data — the random spreadsheets, documents, and files scattered across your file storage. This is where "shadow data" (personal data that nobody actively manages) tends to hide.
Limitations: They need access to the files, which means they need to be installed on or connected to the machines where files are stored. Performance can be slow when scanning large volumes of files. They typically do not scan inside proprietary file formats or encrypted files.
Database Scanners
These connect to your databases (MySQL, PostgreSQL, SQL Server, Oracle, MongoDB, etc.) and scan table structures and data content for PII. They look at column names (a column called "ssn" is a strong indicator) and sample the actual data to classify it.
What they are good at: Finding PII in structured data stores. If your business application has a database, a database scanner tells you exactly which tables and columns contain personal data.
Limitations: They require database credentials and connectivity. They may not scan all rows (sampling is common for performance reasons), which means they could miss PII in unusual rows. They do not cover unstructured data like files and emails.
Cloud Storage Scanners
These connect to cloud services — Google Drive, OneDrive, Dropbox, Box, AWS S3, Azure Blob Storage — and scan the files stored there.
What they are good at: Covering the increasingly common scenario where business data lives in the cloud rather than on local servers. If your team works primarily in Google Workspace or Microsoft 365, this is where most of your unstructured data lives.
Limitations: They require API access or integration setup. Scanning speed depends on the cloud provider's API rate limits. Large cloud storage environments can take hours or days to fully scan.
Email Scanners
These scan email systems — Gmail, Outlook/Exchange, and email archives — for PII in message bodies, attachments, and headers.
What they are good at: Email is one of the richest and most overlooked repositories of personal data. Customers send you their details by email. Employees email spreadsheets with customer data to each other. Attachments accumulate over years. Email scanners catch all of this.
Limitations: Email volumes are massive. Scanning a full email archive takes significant time and computing resources. Privacy considerations can arise when scanning employee email (make sure your policies allow this).
Image and OCR Scanners
Some PII scanners include optical character recognition (OCR) to find personal data in images — scanned documents, photos of IDs, screenshots containing personal information.
What they are good at: Finding PII that lives in non-text formats. A scanned PDF of a customer's driver's license, a screenshot of a support ticket with personal details, a photo of a whiteboard with customer names on it.
Limitations: OCR is not perfect. Accuracy depends on image quality, and the scanning process is significantly slower than text-based scanning.
Free and Open-Source PII Scanners
Before spending money, try these.
Microsoft Purview (Included with Microsoft 365)
If your business uses Microsoft 365, you already have basic PII scanning capabilities through Microsoft Purview (formerly Microsoft Information Protection and Compliance).
What it does:
- Scans files in OneDrive, SharePoint, and Exchange for over 300 sensitive information types
- Classifies documents based on sensitivity
- Can apply labels and protection policies
- Includes a content search feature for DSAR-related data retrieval
How to use it for PII scanning:
- Go to the Microsoft Purview compliance portal (compliance.microsoft.com)
- Use Content Search to scan across Exchange, SharePoint, and OneDrive
- Use the Sensitive Information Types classification to identify PII patterns
- Review the results and export findings
Limitations: Only covers Microsoft 365 services. Does not scan external databases, other cloud services, or local files outside OneDrive/SharePoint. The free tier has limited features; advanced capabilities require E5 licensing or add-ons.
Our take: If you are on Microsoft 365, this should be your first stop. It covers your email and cloud files at no additional cost.
Presidio by Microsoft
Presidio is an open-source PII detection and anonymization tool. It analyzes text and identifies PII using a combination of pattern matching, named entity recognition, and configurable rules.
What it does:
- Detects dozens of PII types: names, phone numbers, email addresses, credit card numbers, SSNs, bank account numbers, passport numbers, and more
- Supports custom PII recognizers (you can add your own patterns)
- Can anonymize detected PII (replace, redact, hash, or mask)
- Available as a Python library or Docker container with REST API
How to use it:
- Install via pip (
pip install presidio-analyzer presidio-anonymizer) or pull the Docker image - Feed it text content — email bodies, document text, log files, CSV data
- Review the findings (entity type, confidence score, location in text)
Limitations: Processes text, not files directly. You need to extract text from your files first, then feed it to Presidio. Requires Python knowledge or comfort with Docker. No built-in file system crawling — it analyzes text you give it.
Our take: Excellent if you have a developer on your team. The detection accuracy is strong, especially for US and EU PII types. Not practical for non-technical users.
piicatcher
piicatcher is an open-source tool specifically designed to find PII in databases. It connects to your database, examines table and column metadata, and samples data to identify columns containing personal information.
What it does:
- Scans database schemas for PII indicators (column names like "email," "ssn," "phone")
- Samples actual data to validate findings
- Supports PostgreSQL, MySQL, SQL Server, Oracle, and other databases via ODBC
- Generates reports listing which tables and columns contain PII
How to use it:
- Install via pip (
pip install piicatcher) - Configure your database connection
- Run the scan
- Review the report
Limitations: Database-only. Does not scan files, email, or cloud storage. Detection is based on column names and data patterns, so it may miss creatively named fields (a column called "field_7" that contains Social Security numbers will not be caught by name-based detection, though data sampling may catch it). Requires command-line comfort.
Our take: If you have databases with customer data, piicatcher is a quick, free way to audit them. Run it once to establish a baseline, then periodically to check for drift.
Google Cloud DLP API
Google Cloud's Data Loss Prevention API inspects text, images, and structured data for sensitive information patterns.
What it does:
- Detects over 150 built-in sensitive data types
- Scans text, images (with OCR), and structured data
- Supports de-identification (masking, tokenization, encryption)
- Integrates with Google Cloud services (BigQuery, Cloud Storage, Datastore)
Free tier: 1 GB of content inspection per month at no cost. Beyond that, pricing is per GB.
Limitations: Works best within the Google Cloud ecosystem. Requires API setup and technical knowledge. The free tier is limited but sufficient for occasional scans.
Our take: Good for businesses that use Google Cloud Platform. The free tier is enough for periodic PII audits of specific datasets.
Bulk Extractor
A digital forensics tool that extracts PII patterns (email addresses, phone numbers, credit card numbers, URLs, domain names) from files, disk images, and directories.
Limitations: Forensic tool, not designed for regular business use. Command-line only. Output requires interpretation. But it is free, fast, and thorough.
Commercial PII Scanning Options
When free tools are not enough, here are commercial options organized by budget.
Budget-Friendly (Under $500/Month)
Nightfall AI — Cloud-native data loss prevention that integrates with Slack, Google Drive, GitHub, Jira, Confluence, and other SaaS tools. Scans for PII in real-time as data is created and shared. Also offers on-demand scanning. Pricing starts at a few hundred dollars per month for small teams.
Strac — Focuses on scanning SaaS applications (Gmail, Slack, Zendesk, Intercom) for PII. Lightweight setup, designed for smaller teams. Pricing is competitive for small businesses.
Mid-Range ($500 to $2,000/Month)
Spirion (formerly Identity Finder) — One of the most established PII scanning products. Scans endpoints, file servers, databases, email, and cloud storage. Strong pattern matching with low false positive rates. Pricing is per endpoint or per user.
Ground Labs Enterprise Recon — Scans a wide range of data sources for PII, with particular strength in payment card data (PCI compliance) and regulated data types. Mid-market pricing.
Egnyte — File sharing and governance platform with built-in PII scanning and classification. Good option if you also need cloud file storage. Per-user pricing.
Enterprise ($2,000+/Month)
BigID — AI-powered data discovery and intelligence. Scans virtually any data source. The detection engine is arguably the best in the market but the pricing reflects that.
Varonis — Data security platform with strong PII discovery across file systems, email, and cloud. Focused on data access governance alongside discovery.
OneTrust Data Discovery — Part of the broader OneTrust privacy platform. Comprehensive scanning with hundreds of data source integrations.
How to Scan Specific Systems
Let us get practical. Here is how to find PII in the systems small businesses commonly use.
Scanning Gmail and Google Workspace
With Google Vault (included in some Google Workspace plans):
- Open Google Vault (vault.google.com)
- Create a new search
- Search by user account or across the organization
- Filter by date range, keywords, or specific data types
- Export results for review
With Google Workspace Admin search:
- Go to admin.google.com
- Use the investigation tool to search across Gmail, Drive, and other services
- Search for specific PII patterns (email addresses, phone number formats)
With Microsoft Purview (if migrating from Google to Microsoft): Some businesses use Microsoft Purview's connectors to scan Google Workspace data.
Scanning Google Drive
Manual approach:
- Open Google Drive
- Search for the person's name, email address, phone number
- Check shared folders, recent files, and trash
- Review Google Sheets for PII columns
With Google Cloud DLP: Connect Google Cloud DLP to scan your Google Cloud Storage buckets. For Google Drive specifically, you may need to use the Drive API to export files and scan them with Presidio or DLP.
Scanning OneDrive and SharePoint
With Microsoft Purview:
- Use Content Search in the compliance portal
- Create a search query targeting specific users or keywords
- Preview results and export relevant data
This is the easiest scanning workflow for Microsoft 365 users since the tools are built in.
Scanning Databases
With piicatcher (free):
piicatcher detect --source-type postgresql --host localhost --port 5432 --database mydb --user myuser
This produces a report of tables and columns containing PII.
Manual approach (for small databases): Review your database schema. Look at column names and sample data from each table. Document which columns contain personal data. This takes an hour or two for a simple application database and gives you a clear PII map.
Scanning Local Files
On Windows: Windows Search can find files containing specific text patterns. For a more thorough scan, use Presidio or Bulk Extractor on target directories.
On Mac: Spotlight searches file content. For deeper scanning, command-line tools like Presidio work on macOS.
Manual approach: Focus on the directories most likely to contain PII — Downloads, Documents, Desktop, shared folders. Search for common file types (CSV, XLSX, DOCX, PDF) and spot-check them for personal data.
Building a PII Scanning Process
Rather than running scans ad hoc, build a repeatable process.
Initial Baseline Scan
Run a comprehensive scan of your systems to establish a baseline understanding of where PII lives. This informs your data map and helps you prepare for DSARs.
- List all your data storage locations (file servers, cloud storage, databases, email, SaaS applications)
- Choose the appropriate scanning tool for each location
- Run the scans
- Document findings — what PII types were found, in which systems, in which locations
- Compare findings to your existing data map and update as needed
- Identify data that should not exist (old records that should be deleted, PII in unexpected locations) and remediate
Periodic Rescans
Run scans quarterly or semi-annually to catch new PII that has accumulated. Focus on:
- Cloud storage (where new files appear constantly)
- Email (where PII accumulates over time)
- Any systems that have been added or changed since the last scan
DSAR-Triggered Scans
When a DSAR arrives, use scanning tools to supplement your manual search. The manual search (using your data search checklist) covers your known data sources. The PII scanner catches data in unexpected locations.
Post-Breach Scans
If you experience a breach, scan the affected systems immediately to determine what PII was potentially exposed. This information drives your breach notification decisions.
Practical Tips for Effective PII Scanning
Start With High-Risk Areas
You do not need to scan everything at once. Start with the areas most likely to contain PII and most likely to be searched during a DSAR:
- Email systems
- Cloud file storage (Google Drive, OneDrive)
- Customer databases
- CRM systems
- Shared network drives
Expect False Positives
PII scanners will flag things that are not actually PII. A 9-digit number that happens to match the SSN format, a test credit card number in a code repository, a fictional name in a document template. Review flagged items before acting on them.
Do Not Ignore False Negatives
PII scanners will also miss things. A person's name in a paragraph of text might not be flagged if it is not in a recognizable format. Data in proprietary file formats might be invisible to the scanner. Treat scanning as a supplement to your manual search, not a replacement for it.
Document Your Scans
Keep records of what you scanned, when, what tools you used, and what you found. This documentation demonstrates due diligence to regulators and helps you track how your PII landscape changes over time.
Secure the Scan Results
PII scan reports are themselves sensitive documents — they contain information about where personal data lives. Store them securely and limit access to people who need it.
The Bottom Line
PII scanning software ranges from free and open-source to enterprise-grade platforms costing thousands per month. For most small businesses, the right approach is:
- Start with what you have. If you use Microsoft 365, explore Microsoft Purview. If you use Google Workspace, use its built-in search and consider Google Cloud DLP.
- Add free tools for gaps. Use piicatcher for databases, Presidio for text analysis, and manual searches for everything else.
- Invest in commercial tools when the manual approach fails — usually due to data volume, system complexity, or frequent DSAR/breach response needs.
The goal is not to scan everything all the time. The goal is to know where personal data lives so you can manage it properly, respond to requests efficiently, and react quickly when something goes wrong.
References
- General Data Protection Regulation (GDPR): Regulation (EU) 2016/679. Full text
- California Consumer Privacy Act (CCPA): Cal. Civ. Code §§ 1798.100–1798.199.100. Full text
- NIST Privacy Framework: NIST Privacy Framework
Last reviewed: February 2026. Privacy laws change frequently. Verify all statutory references against the current text of the law and consult qualified legal counsel before making compliance decisions for your business.
Get the Full DSAR Compliance Picture
PII scanning is one piece of the compliance puzzle. To understand how it fits into a complete DSAR response process — from receiving a request to delivering a compliant response — download our DSAR Compliance Guide. It includes data search checklists, system inventory templates, and step-by-step workflows for small businesses.