!Tumblr and WordPress
Automattic, the parent company of popular platforms Tumblr and WordPress, is making waves in the tech world with its recent decision. According to reports from 404 Media, Automattic is on the verge of striking deals to provide user data for training artificial intelligence models developed by OpenAI and Midjourney. While this move promises to enhance AI capabilities, it also raises important questions about privacy and transparency.
The Data Exchange
Automattic’s plan involves sharing data from both Tumblr and WordPress. However, the specifics of what data will be included remain somewhat murky. Here’s what we know:
Initial Data Dump: Automattic allegedly scraped an “initial data dump” containing all of Tumblr’s public post content spanning from 2014 to 2023. This treasure trove of information could serve as valuable training material for AI algorithms.
Private and Partner-Related Content: The controversy arises from the inclusion of private and partner-related data. An internal post by Tumblr product manager Cyle Gage suggests that Automattic may have inadvertently included private posts, deleted or suspended blogs, unanswered questions, explicit content, and even premium partner blog data (such as Apple’s former music site).
Legal Implications: While Automattic claims it will share only public content from sites that haven’t opted out, legal regulations currently do not require AI companies’ web crawlers to respect users’ opt-out preferences. This lack of regulation raises concerns about user consent and control over their data.
AI Companies’ Perspective
Both OpenAI and Midjourney stand to benefit significantly from this data exchange. Training AI models requires vast amounts of diverse and real-world data, and the Tumblr and WordPress content provides a rich source for refining algorithms. However, the companies must tread carefully to ensure ethical practices and user trust.
User Opt-Out and Transparency
Automattic’s upcoming opt-out tool aims to give users more control. Users can block third parties, including AI companies, from training on their data. The tool will maintain a disallowed list, preventing web crawlers from accessing content from opted-out sites. Additionally, Automattic plans to regularly update partners about users who opt out, ensuring their content is removed from past sources and future training.