AI And Your Data Unmasking The Collection

2025-06-13•Christopher Ramezan•9 minutes read

Data Privacy

Cybersecurity

AI Is Everywhere So Is Data Collection

Artificial intelligence, or AI, has seamlessly integrated into our daily routines. Many everyday items, from electric razors to innovative toothbrushes, are now marketed as "AI-powered." These devices employ machine learning algorithms to monitor usage patterns, assess real-time performance, and offer user feedback. Whether you're posing questions to AI assistants like ChatGPT or Microsoft Copilot, or tracking fitness goals with a smartwatch, AI systems are frequently in use.

While these AI tools and technologies offer convenience, they concurrently bring up significant questions about data privacy. These systems are designed to collect vast quantities of data, often without individuals being fully aware of this collection. This gathered information can then be utilized to discern personal habits and preferences, and even to predict future actions by drawing inferences from the aggregated dataset.

Cybersecurity experts, including those studying emerging technologies and AI systems, focus on how personal data is managed and how more secure, privacy-respecting systems can be developed for the future. Generative AI software relies on extensive training data to produce new content like text or images. Predictive AI, on the other hand, uses data to forecast outcomes based on past behavior, such as your likelihood of achieving a daily step count or movie recommendations. Both types of AI can be employed to gather information about you.

How Generative AI Tools Learn From Your Input

Generative AI assistants such as ChatGPT and Google Gemini meticulously collect all information users input into their chat interfaces. Every query, response, and prompt entered is recorded, stored, and subsequently analyzed to enhance the AI model's capabilities.

OpenAI’s privacy policy explicitly states that user-provided content may be used to improve their services, for instance, to train the models that drive ChatGPT. Although OpenAI offers an opt-out for using content in model training, it nevertheless collects and retains your personal data. While some companies assert that they anonymize this data—storing it without direct identifiers—a risk of re-identification always persists.

Computer message popup: What can I help with? ChatGPT stores and analyzes everything you type into a prompt screen. Screenshot by Christopher Ramezan, CC BY-ND

Predictive AI Tracking Your Online Behavior

Beyond generative AI, social media platforms like Facebook, Instagram, and TikTok continuously gather user data to train their predictive AI models. Every post, photo, video, like, share, and comment, along with the time spent viewing each, is collected as a data point. This data is used to construct detailed digital data profiles for each user.

These profiles serve to refine the platform’s AI recommender systems. They can also be sold to data brokers, who then sell this personal data to other companies, for example, to facilitate the development of targeted advertisements tailored to an individual’s interests.

Many social media companies also employ methods like cookies and embedded tracking pixels to monitor users across various websites and applications. Cookies are small files that store information about your identity and browsing activity. A common use is in digital shopping carts, remembering items you've added even if you leave and return later. Tracking pixels are tiny, invisible images or code snippets on websites that inform companies of your visit, helping them track your online behavior.

This is why individuals often encounter advertisements related to their browsing and shopping history on unrelated websites and even across different devices like computers, phones, and smart speakers. One study discovered that some websites can place over 300 tracking cookies on a user's computer or mobile phone.

Here’s how websites you browse can track you using cookies or tracking pixels.

Smart Devices And The Illusion Of Privacy Control

Similar to generative AI platforms, social media services provide privacy settings and opt-out options. However, these offer limited control over how personal data is aggregated and monetized. As media theorist Douglas Rushkoff noted in 2011, if a service is free, the user often becomes the product.

Many AI-inclusive tools collect data without requiring any direct user action. Smart devices like home speakers, fitness trackers, and watches continuously gather information via biometric sensors, voice recognition, and location tracking. Smart home speakers are always listening for their activation or "wake up" command. In doing so, they capture all surrounding conversations, even when appearing inactive.

Some companies assert that voice data is stored only when the wake word is detected. However, concerns about accidental recordings have been raised, especially since these devices are often linked to cloud services. This connection allows voice data to be stored, synchronized, and shared across multiple devices like phones, smart speakers, and tablets. If permitted by the company, this data can also be accessed by third parties, such as advertisers, data analytics firms, or law enforcement agencies with a warrant.

Erosion Of Privacy Protections A Growing Concern

This potential for third-party access extends to smartwatches and fitness trackers, which monitor health metrics and user activity. Companies producing wearable fitness devices are not typically considered “covered entities” under the Health Information Portability and Accountability Act (HIPAA). Consequently, they are legally permitted to sell health and location-related data collected from users.

A notable incident highlighting HIPAA data concerns occurred in 2018 when Strava, a fitness company, released a global heat map of user exercise routes. This inadvertently revealed sensitive military locations worldwide by showcasing the exercise patterns of military personnel.

Man in booth at restaurant using computer and smart speaker Smart speakers can collect information even when they’re sleeping. recep-bg/Getty Images

Reports indicate that the Trump administration utilized Palantir, a company specializing in AI-driven data analytics, to compile and analyze data on Americans. Concurrently, Palantir announced a partnership with a company operating self-checkout systems.

Such collaborations can significantly expand corporate and governmental insight into everyday consumer behavior. This particular partnership could be used to create detailed personal profiles by linking consumer habits with other personal data, raising concerns about heightened surveillance and diminished anonymity. It could enable the tracking and analysis of citizens across multiple life aspects without their knowledge or consent.

Some smart device manufacturers are also weakening privacy protections. Amazon recently announced that starting March 28, 2025, all voice recordings from Amazon Echo devices will be sent to Amazon’s cloud by default, removing the user's option to disable this. This marks a shift from previous settings that allowed users to limit private data collection. Changes like these provoke concerns about consumer control over their data when using smart devices. Many privacy experts view cloud storage of voice recordings as data collection, especially when used for algorithm improvement or user profiling, which has implications for data privacy laws.

Key Data Privacy Implications To Understand

The pervasive data collection by AI tools raises serious privacy concerns for individuals and governments regarding how data is collected, stored, used, and transmitted. The foremost concern is transparency: people often do not know what data is being collected, how it is used, and who can access it.

Companies often use complex privacy policies filled with technical jargon, making it difficult for individuals to understand the terms of service they agree to. Furthermore, people generally do not thoroughly read these documents. One study found that individuals spent an average of only 73 seconds reading terms of service documents that had an estimated read time of 29-32 minutes.

Data collected by AI tools might initially be held by a trusted company but can easily be sold or transferred to entities you may not trust. Additionally, AI tools, the companies managing them, and those with access to the collected data are all susceptible to cyberattacks and data breaches, which can expose sensitive personal information. These attacks can be orchestrated by cybercriminals motivated by financial gain, or by so-called advanced persistent threats (APTs), typically nation-state-sponsored attackers who infiltrate networks and systems to covertly collect information for disruptive or harmful purposes.

While laws and regulations like the General Data Protection Regulation (GDPR) in the European Union and the California Consumer Privacy Act (CCPA) aim to protect user data, AI development and deployment have often outpaced legislative efforts. The legal framework is still striving to catch up with AI and data privacy challenges. For now, it is wise to assume that any AI-powered device or platform is collecting data on your inputs, behaviors, and patterns.

Navigating AI Tools With Awareness

Despite the concerns about data collection and privacy, AI tools can be remarkably useful. AI-powered applications can enhance workflows, automate mundane tasks, and offer valuable insights. However, approaching these tools with awareness and caution is crucial.

When using generative AI platforms that provide answers to typed prompts, refrain from including any personally identifiable information (PII), such as names, birth dates, Social Security numbers, or home addresses. In a professional context, avoid inputting trade secrets or classified information. As a general rule, do not enter anything into a prompt that you would not be comfortable sharing publicly or seeing displayed on a billboard. Remember, once you submit a prompt, you lose control over that information.

Be mindful that devices, even when 'asleep', are always listening if turned on. If you use smart home or embedded devices, turn them off completely during private conversations. A device in sleep mode may look inactive but remains powered on and listening for a wake word. Unplugging a device or removing its batteries is a reliable way to ensure it is truly off.

Finally, diligently review the terms of service and data collection policies of the devices and platforms you use. You might be surprised by what you have already consented to.

Additional Resources

This article is part of a series on data privacy that explores who collects your data, what and how they collect, who sells and buys your data, what they all do with it, and what you can do about it.

The Conversation will be hosting a free webinar on practical and safe use of AI with our tech editor and an AI expert on June 24 at 2pm ET/11am PT. Sign up to get your questions answered.

Read Original Post