Commit from Zettel Notes at 29 feb. 2024 07:00:48 p. m.

This commit is contained in:
Iván Ruvalcaba 2024-02-29 19:00:48 -06:00
parent 2f6cd46288
commit d9a1771b49
10 changed files with 233 additions and 0 deletions

View File

@ -11,3 +11,25 @@ untracked file, adding: .zettel-notes/patches/19/2d92c008-ec8e-4c83-8966-42f5504
==============
untracked file, adding: .zettel-notes/logs/sync/2024-02-29.log
==============
20240229184603
==============
untracked file, adding: .zettel-notes/patches/8/a178cc25-c894-4d89-af41-cba6c553dfc6.diff
untracked file, adding: .zettel-notes/patches/8/4f58dbf2-9688-49c6-9aec-12efd86dee4e.diff
untracked file, adding: .zettel-notes/patches/19/9ac4e85b-7085-41cf-a7d5-b7798ec9e6a8.diff
untracked file, adding: .zettel-notes/patches/12/41d7406c-25aa-4bb8-85b1-7d220cd2b89c.diff
untracked file, adding: fleeting-notes/20240221192403.md
untracked file, adding: .zettel-notes/patches/16/ee7b3dff-87d3-4b97-8d1e-63e342c9aec7.diff
untracked file, adding: .zettel-notes/patches/19/f6205780-5fc8-4684-9289-95b9f73767e6.diff
untracked file, adding: fleeting-notes/20240228093717.md
untracked file, adding: .zettel-notes/patches/19/23f53b89-67ea-4eef-88af-209938b4c0ce.diff
untracked file, adding: fleeting-notes/20240220072400.md.gpg
untracked file, adding: .zettel-notes/patches/12/0905a835-1f39-4f86-9434-ab8a57c5ba5c.diff
untracked file, adding: .zettel-notes/patches/16/0ce1902a-be84-4443-bebd-eeab54f76ae3.diff
untracked file, adding: .zettel-notes/patches/19/e76cd70d-848b-47e1-ac24-6f7958621aaf.diff
untracked file, adding: .zettel-notes/patches/17/4d57b3d0-fc2b-4afb-88b1-13bc633dbfee.diff
untracked file, adding: .zettel-notes/patches/19/746635a5-8ba3-4f06-8264-b129515eda1b.diff
untracked file, adding: fleeting-notes/20240220104508.md.gpg
untracked file, adding: .zettel-notes/patches/19/b23d3986-8c49-4692-b033-9fadb71a34fb.diff

View File

@ -0,0 +1,21 @@
--- old
+++ new
@@ -28,1 +28,1 @@
-“the way the data was queried for the initial data dump to Midjourney/OpenAI means we compiled a list of all tumblrs public post content between 2014 and 2023, but also unfortunately it included, and should not have included:
+"the way the data was queried for the initial data dump to Midjourney/OpenAI means we compiled a list of all tumblrs public post content between 2014 and 2023, but also unfortunately it included, and should not have included:
@@ -30,6 +30,6 @@
-- private posts on public blogs
-- posts on deleted or suspended blogs
-- unanswered asks (normally these are not public until theyre answered)
-- private answers (these only show up to the receiver and are not public)
-- posts that are marked explicit / NSFW / mature by our more modern standards (this may not be a big deal, I dont know)
-- content from premium partner blogs (special brand blogs like Apples former music blog, for example, who spent money with us on an ad campaign) that may have creative that doesnt belong to us, and we dont have the rights to share with this-parties; this one is kinda unknown to me, what deals are in place historically and what they should prevent us from doing.”
+- private posts on public blogs
+- posts on deleted or suspended blogs
+- unanswered asks (normally these are not public until theyre answered)
+- private answers (these only show up to the receiver and are not public)
+- posts that are marked 'explicit' / NSFW / 'mature' by our more modern standards (this may not be a big deal, I don't know)
+- content from premium partner blogs (special brand blogs like Apples former music blog, for example, who spent money with us on an ad campaign) that may have creative that doesnt belong to us, and we dont have the rights to share with this-parties; this one is kinda unknown to me, what deals are in place historically and what they should prevent us from doing."
@@ -37,1 +37,1 @@
-Gages post makes clear that engineers are working on compiling a list of post IDs that should not have been included, and that password-protected posts, DMs, and media flagged as CSAM and other community guidelines violations were not included.
+Gage's post makes clear that engineers are working on compiling a list of post IDs that should not have been included, and that password-protected posts, DMs, and media flagged as CSAM and other community guidelines violations were not included.

View File

@ -0,0 +1,5 @@
--- old
+++ new
@@ -10,2 +10,0 @@
- -
- categories:

View File

@ -0,0 +1,13 @@
--- old
+++ new
@@ -57,10 +57,0 @@
-
-OpenAI and Midjourney did not respond to requests for comment. 
-
-_Updated 4:05 p.m. EST with a statement from Automattic and at 6:51 EST with additional information about JetPack._
-
-About the author
-
-Sam Cole is writing from the far reaches of the internet, about sexuality, the adult industry, online culture, and AI. She's the author of How Sex Changed the Internet and the Internet Changed Sex.
-
-![Samantha Cole](https://www.404media.co/content/images/2023/08/404-sam-10--1-.jpg)

View File

@ -0,0 +1,71 @@
--- old
+++ new
@@ -2,1 +2,1 @@
-title: 20240229184600
+title: Tumblr and WordPress to Sell Users Data to Train AI Tools
@@ -6,0 +6,2 @@
+authors:
+ - Samantha Cole
@@ -13,1 +15,1 @@
-license: CC0-1.0
+license: ©Todos los derechos reservados
@@ -16,0 +18,59 @@
+
+Tumblr and WordPress to Sell Users Data to Train AI Tools
+Samantha Cole Samantha Cole
+Feb 27, 2024 at 1:21 PM
+
+Internal documents obtained by 404 Media show that Tumblr staff compiled users' data as part of a deal with Midjourney and OpenAI.
+
+Tumblr and WordPress.com are preparing to sell user data to Midjourney and OpenAI, according to a source with internal knowledge about the deals and internal documentation referring to the deals. 
+
+The exact types of data from each platform going to each company are not spelled out in documentation weve reviewed, but internal communications reviewed by 404 Media make clear that deals between Automattic, the platforms parent company, and OpenAI and Midjourney are imminent.
+
+The internal documentation details a messy and controversial process within Tumblr itself. One internal post made by Cyle Gage, a product manager at Tumblr, states that a query made to prepare data for OpenAI and Midjourney compiled a huge number of user posts that it wasnt supposed to. It is not clear from Gages post whether this data has already been sent to OpenAI and Midjourney, or whether Gage was detailing a process for scrubbing the data before it was to be sent. 
+
+_Subscribe to the 404 Media podcast on_ [_Apple Podcasts_](https://podcasts.apple.com/us/podcast/the-404-media-podcast/id1703615331?ref=404media.co)_,_ [_Google Podcasts_](https://podcasts.google.com/feed/aHR0cHM6Ly9mZWVkcy5hY2FzdC5jb20vcHVibGljL3Nob3dzL3RoZS00MDQtbWVkaWEtcG9kY2FzdA?ref=404media.co)_, or your favorite podcast app._
+
+Gage wrote:
+
+“the way the data was queried for the initial data dump to Midjourney/OpenAI means we compiled a list of all tumblrs public post content between 2014 and 2023, but also unfortunately it included, and should not have included:
+
+- private posts on public blogs
+- posts on deleted or suspended blogs
+- unanswered asks (normally these are not public until theyre answered)
+- private answers (these only show up to the receiver and are not public)
+- posts that are marked explicit / NSFW / mature by our more modern standards (this may not be a big deal, I dont know)
+- content from premium partner blogs (special brand blogs like Apples former music blog, for example, who spent money with us on an ad campaign) that may have creative that doesnt belong to us, and we dont have the rights to share with this-parties; this one is kinda unknown to me, what deals are in place historically and what they should prevent us from doing.”
+
+Gages post makes clear that engineers are working on compiling a list of post IDs that should not have been included, and that password-protected posts, DMs, and media flagged as CSAM and other community guidelines violations were not included.
+
+Automattic plans to launch a new setting on Wednesday that will allow users to opt-out of data sharing with third parties, including AI companies, according to the source, who spoke on the condition of anonymity, and internal documents. A new FAQ section we reviewed is titled “What happens when you opt out?” states that “If you opt out from the start, we will block crawlers from accessing your content by adding your site on a disallowed list. If you change your mind later, we also plan to update any partners about people who newly opt-out and ask that their content be removed from past sources and future training.” 
+
+💡
+
+****Do you work at Tumblr or Wordpress and have information about this deal or data compiling effort? I would love to hear from you. Using a non-work device, you can message me securely on Signal at +1 646 926 1726. Otherwise, send me an email at sam@404media.co.****
+
+
+404 Media has asked Automattic how it accidentally compiled data that it shouldnt share, and whether any of that content was shared with OpenAI. 404 Media asked Automattic about an imminent deal with Midjourney last week but did not hear back then, either. Instead of answering direct questions about these deals and the compiling of user data, [Automattic sent a statement, which it posted publicly after this story was published](https://automattic.com/2024/02/27/protecting-user-choice/?ref=404media.co), titled "Protecting User Choice." In it, Automattic promises that it's blocked AI crawlers from scraping its sites. The statement says, "We are also working directly with select AI companies as long as their plans align with what our community cares about: attribution, opt-outs, and control. Our partnerships will respect all opt-out settings. We also plan to take that a step further and regularly update any partners about people who newly opt out and ask that their content be removed from past sources and future training."
+
+The statement published by Automattic after this article was published specifically mentions WordPress.com, which are blogs that Automattic hosts as a service. There is separately an open-source WordPress CMS (WordPress.org) that people and businesses use on self-hosted websites. What remains unclear is whether self-hosted WordPress blogs that use popular Automattic plugins like JetPack to connect those blogs with [Automattic's infrastructure](https://wordpress.com/tos/?ref=404media.co) are subject to the company's AI-scraping deals. Automattic did not immediately respond to a question about whether sites using JetPack are subject to its data sharing agreements.
+
+Another internal document shows that, on February 23, an employee asked in a staff-only thread, “Do we have assurances that if a user opts out of their data being shared with third parties that our existing data partners will be notified of such a change and remove their data?”
+
+Andrew Spittle, Automattics head of AI replied: “We **will** notify existing partners on a regular basis about anyone who's opted out since the last time we provided a list. I want this to be an ongoing process where we regularly advocate for past content to be excluded based on current preferences. We **will** ask that content be deleted and removed from any future training runs. I _believe_ partners will honor this based on our conversations with them to this point. I don't think they gain much overall by retaining it.” Automattic did not respond to a question from 404 Media about whether it could guarantee that people who opt out will have their data deleted retroactively.
+
+News about a deal between Tumblr and Midjourney has been rumored and [<u>speculated about on Tumblr</u>](https://tinystepsforward.tumblr.com/post/742824508024651776/matt-is-supposed-to-be-on-fucking-sabbatical-rn?ref=404media.co) for the last week. Someone claiming to be a former Tumblr employee announced in a Tumblr blog post that the platform was working on a deal with Midjourney, and the rumor made it onto Blind, an app for verified employees of companies to anonymously discuss their jobs. 404 Media has seen the Blind posts, in which what seems like an Automattic employee says, “I'm not sure why some of you are getting worked up or worried about this. It's totally legal, and sharing it publicly is perfectly fine since it's right there in the terms & conditions. So, go ahead and spread the word as much as you can with your friends and tech journalists, it's totally fine.”
+
+Separately, 404 Media viewed a public, now-deleted post by Gage, the product manager, where he said that he was deleting all of his images off of Tumblr, and would be putting them on his personal website. A [<u>still-live post</u>](https://www.tumblr.com/cyle/740896644859625500?ref=404media.co) says, “i've deleted my photography from tumblr and will be moving it slowly but surely over to [<u>cylegage.com</u>](https://cylegage.com/?ref=404media.co), which i'm building into a photography portfolio that i can control end-to-end.” At one point last week, his personal website had a specific note stating that he did not consent to AI scraping of his images. Gages original post has been deleted, and his website is now a blank page that just reads “Cyle.” Gage did not respond to a request for comment from 404 Media. 
+
+Several online platforms have made similar deals with AI companies recently, including Reddit, which entered into an [<u>AI content licensing deal with Google</u>](https://www.reuters.com/technology/reddit-ai-content-licensing-deal-with-google-sources-say-2024-02-22/?ref=404media.co) and [<u>said in its SEC filing last week</u>](https://www.404media.co/reddit-we-are-in-the-early-stages-of-monetizing-our-user-base-2/) that its “in the early stages of monetizing \[its\] user base” by training AI on users posts. Last year, [<u>Shutterstock signed a six year deal</u>](https://investor.shutterstock.com/news-releases/news-release-details/shutterstock-expands-partnership-openai-signs-new-six-year?ref=404media.co) with OpenAI to provide training data.
+
+OpenAI and Midjourney did not respond to requests for comment. 
+
+_Updated 4:05 p.m. EST with a statement from Automattic and at 6:51 EST with additional information about JetPack._
+
+About the author
+
+Sam Cole is writing from the far reaches of the internet, about sexuality, the adult industry, online culture, and AI. She's the author of How Sex Changed the Internet and the Internet Changed Sex.
+
+![Samantha Cole](https://www.404media.co/content/images/2023/08/404-sam-10--1-.jpg)

View File

@ -0,0 +1,12 @@
--- old
+++ new
@@ -2,1 +2,1 @@
-title: Tumblr and WordPress to Sell Users Data to Train AI Tools
+title: Tumblr and WordPress to Sell Users' Data to Train AI Tools
@@ -18,6 +18,0 @@
-
-Tumblr and WordPress to Sell Users Data to Train AI Tools
-Samantha Cole Samantha Cole
-Feb 27, 2024 at 1:21 PM
-

View File

@ -0,0 +1,8 @@
--- old
+++ new
@@ -24,1 +24,1 @@
-The internal documentation details a messy and controversial process within Tumblr itself. One internal post made by Cyle Gage, a product manager at Tumblr, states that a query made to prepare data for OpenAI and Midjourney compiled a huge number of user posts that it wasnt supposed to. It is not clear from Gages post whether this data has already been sent to OpenAI and Midjourney, or whether Gage was detailing a process for scrubbing the data before it was to be sent. 
+The internal documentation details a messy and controversial process within Tumblr itself. One internal post made by Cyle Gage, a product manager at Tumblr, states that a query made to prepare data for OpenAI and Midjourney compiled a huge number of user posts that it wasnt supposed to. It is not clear from Gages post whether this data has already been sent to OpenAI and Midjourney, or whether Gage was detailing a process for scrubbing the data before it was to be sent.
@@ -26,2 +26,0 @@
-_Subscribe to the 404 Media podcast on_ [_Apple Podcasts_](https://podcasts.apple.com/us/podcast/the-404-media-podcast/id1703615331?ref=404media.co)_,_ [_Google Podcasts_](https://podcasts.google.com/feed/aHR0cHM6Ly9mZWVkcy5hY2FzdC5jb20vcHVibGljL3Nob3dzL3RoZS00MDQtbWVkaWEtcG9kY2FzdA?ref=404media.co)_, or your favorite podcast app._
-

View File

@ -0,0 +1,20 @@
--- old
+++ new
@@ -12,1 +12,1 @@
- -
+ - Inteligencia Artificial
@@ -14,1 +14,4 @@
- -
+ - Automattic
+ - Tumblr
+ - WordPress.com
+ - OpenAI
@@ -39,1 +42,1 @@
-Automattic plans to launch a new setting on Wednesday that will allow users to opt-out of data sharing with third parties, including AI companies, according to the source, who spoke on the condition of anonymity, and internal documents. A new FAQ section we reviewed is titled “What happens when you opt out?” states that “If you opt out from the start, we will block crawlers from accessing your content by adding your site on a disallowed list. If you change your mind later, we also plan to update any partners about people who newly opt-out and ask that their content be removed from past sources and future training.” 
+Automattic plans to launch a new setting on Wednesday that will allow users to opt-out of data sharing with third parties, including AI companies, according to the source, who spoke on the condition of anonymity, and internal documents. A new FAQ section we reviewed is titled “What happens when you opt out?” states that "If you opt out from the start, we will block crawlers from accessing your content by adding your site on a disallowed list. If you change your mind later, we also plan to update any partners about people who newly opt-out and ask that their content be removed from past sources and future training."
@@ -41,5 +44,0 @@
-💡
-
-****Do you work at Tumblr or Wordpress and have information about this deal or data compiling effort? I would love to hear from you. Using a non-work device, you can message me securely on Signal at +1 646 926 1726. Otherwise, send me an email at sam@404media.co.****
-
-

View File

@ -0,0 +1,5 @@
--- old
+++ new
@@ -57,0 +57,2 @@
+
+OpenAI and Midjourney did not respond to requests for comment.

View File

@ -0,0 +1,56 @@
---
title: Tumblr and WordPress to Sell Users' Data to Train AI Tools
description:
date: 2024-02-29
id: 20240229184600
authors:
- Samantha Cole
taxonomies:
categories:
- Inteligencia Artificial
tags:
- Automattic
- Tumblr
- WordPress.com
- OpenAI
license: ©Todos los derechos reservados
---
Internal documents obtained by 404 Media show that Tumblr staff compiled users' data as part of a deal with Midjourney and OpenAI.
Tumblr and WordPress.com are preparing to sell user data to Midjourney and OpenAI, according to a source with internal knowledge about the deals and internal documentation referring to the deals. 
The exact types of data from each platform going to each company are not spelled out in documentation weve reviewed, but internal communications reviewed by 404 Media make clear that deals between Automattic, the platforms parent company, and OpenAI and Midjourney are imminent.
The internal documentation details a messy and controversial process within Tumblr itself. One internal post made by Cyle Gage, a product manager at Tumblr, states that a query made to prepare data for OpenAI and Midjourney compiled a huge number of user posts that it wasnt supposed to. It is not clear from Gages post whether this data has already been sent to OpenAI and Midjourney, or whether Gage was detailing a process for scrubbing the data before it was to be sent.
Gage wrote:
"the way the data was queried for the initial data dump to Midjourney/OpenAI means we compiled a list of all tumblrs public post content between 2014 and 2023, but also unfortunately it included, and should not have included:
- private posts on public blogs
- posts on deleted or suspended blogs
- unanswered asks (normally these are not public until theyre answered)
- private answers (these only show up to the receiver and are not public)
- posts that are marked 'explicit' / NSFW / 'mature' by our more modern standards (this may not be a big deal, I don't know)
- content from premium partner blogs (special brand blogs like Apples former music blog, for example, who spent money with us on an ad campaign) that may have creative that doesnt belong to us, and we dont have the rights to share with this-parties; this one is kinda unknown to me, what deals are in place historically and what they should prevent us from doing."
Gage's post makes clear that engineers are working on compiling a list of post IDs that should not have been included, and that password-protected posts, DMs, and media flagged as CSAM and other community guidelines violations were not included.
Automattic plans to launch a new setting on Wednesday that will allow users to opt-out of data sharing with third parties, including AI companies, according to the source, who spoke on the condition of anonymity, and internal documents. A new FAQ section we reviewed is titled “What happens when you opt out?” states that "If you opt out from the start, we will block crawlers from accessing your content by adding your site on a disallowed list. If you change your mind later, we also plan to update any partners about people who newly opt-out and ask that their content be removed from past sources and future training."
404 Media has asked Automattic how it accidentally compiled data that it shouldnt share, and whether any of that content was shared with OpenAI. 404 Media asked Automattic about an imminent deal with Midjourney last week but did not hear back then, either. Instead of answering direct questions about these deals and the compiling of user data, [Automattic sent a statement, which it posted publicly after this story was published](https://automattic.com/2024/02/27/protecting-user-choice/?ref=404media.co), titled "Protecting User Choice." In it, Automattic promises that it's blocked AI crawlers from scraping its sites. The statement says, "We are also working directly with select AI companies as long as their plans align with what our community cares about: attribution, opt-outs, and control. Our partnerships will respect all opt-out settings. We also plan to take that a step further and regularly update any partners about people who newly opt out and ask that their content be removed from past sources and future training."
The statement published by Automattic after this article was published specifically mentions WordPress.com, which are blogs that Automattic hosts as a service. There is separately an open-source WordPress CMS (WordPress.org) that people and businesses use on self-hosted websites. What remains unclear is whether self-hosted WordPress blogs that use popular Automattic plugins like JetPack to connect those blogs with [Automattic's infrastructure](https://wordpress.com/tos/?ref=404media.co) are subject to the company's AI-scraping deals. Automattic did not immediately respond to a question about whether sites using JetPack are subject to its data sharing agreements.
Another internal document shows that, on February 23, an employee asked in a staff-only thread, “Do we have assurances that if a user opts out of their data being shared with third parties that our existing data partners will be notified of such a change and remove their data?”
Andrew Spittle, Automattics head of AI replied: “We **will** notify existing partners on a regular basis about anyone who's opted out since the last time we provided a list. I want this to be an ongoing process where we regularly advocate for past content to be excluded based on current preferences. We **will** ask that content be deleted and removed from any future training runs. I _believe_ partners will honor this based on our conversations with them to this point. I don't think they gain much overall by retaining it.” Automattic did not respond to a question from 404 Media about whether it could guarantee that people who opt out will have their data deleted retroactively.
News about a deal between Tumblr and Midjourney has been rumored and [<u>speculated about on Tumblr</u>](https://tinystepsforward.tumblr.com/post/742824508024651776/matt-is-supposed-to-be-on-fucking-sabbatical-rn?ref=404media.co) for the last week. Someone claiming to be a former Tumblr employee announced in a Tumblr blog post that the platform was working on a deal with Midjourney, and the rumor made it onto Blind, an app for verified employees of companies to anonymously discuss their jobs. 404 Media has seen the Blind posts, in which what seems like an Automattic employee says, “I'm not sure why some of you are getting worked up or worried about this. It's totally legal, and sharing it publicly is perfectly fine since it's right there in the terms & conditions. So, go ahead and spread the word as much as you can with your friends and tech journalists, it's totally fine.”
Separately, 404 Media viewed a public, now-deleted post by Gage, the product manager, where he said that he was deleting all of his images off of Tumblr, and would be putting them on his personal website. A [<u>still-live post</u>](https://www.tumblr.com/cyle/740896644859625500?ref=404media.co) says, “i've deleted my photography from tumblr and will be moving it slowly but surely over to [<u>cylegage.com</u>](https://cylegage.com/?ref=404media.co), which i'm building into a photography portfolio that i can control end-to-end.” At one point last week, his personal website had a specific note stating that he did not consent to AI scraping of his images. Gages original post has been deleted, and his website is now a blank page that just reads “Cyle.” Gage did not respond to a request for comment from 404 Media. 
Several online platforms have made similar deals with AI companies recently, including Reddit, which entered into an [<u>AI content licensing deal with Google</u>](https://www.reuters.com/technology/reddit-ai-content-licensing-deal-with-google-sources-say-2024-02-22/?ref=404media.co) and [<u>said in its SEC filing last week</u>](https://www.404media.co/reddit-we-are-in-the-early-stages-of-monetizing-our-user-base-2/) that its “in the early stages of monetizing \[its\] user base” by training AI on users posts. Last year, [<u>Shutterstock signed a six year deal</u>](https://investor.shutterstock.com/news-releases/news-release-details/shutterstock-expands-partnership-openai-signs-new-six-year?ref=404media.co) with OpenAI to provide training data.
OpenAI and Midjourney did not respond to requests for comment.