I see my "Tumblr users should have at least two months to get their affairs in order before potentially having to make a hard decision" postulation was a bit too optimistic. Here's what we had to learn through a third party!
Tumblr and Wordpress to Sell Users’ Data to Train AI Tools
Edit: Since you need a (free) account to view the page,
trans-mando graciously took a screencap for us; I transcribed it and offered my own questions in a reblog, which I will copy here.
My response: Honestly, this does not fill me with hope. First of all, how do you remove content that an machine learning program has already been trained on? Is that even possible?
Secondly, Tumblr's whole system is based on reblogging stuff. How does that work in conjunction with bots being blocked from training on stuff? Does the address of the person being reblogged still count, or not? What if they change their name? What about deactivated accounts? Will the bots be trained on their data as well? What if someone deactivates their account after having opted out of having their data used to train bots; is it still off-limits? If it's not off-limits, what prevents malicious actors from unfairly targeting creatives whose work they want to steal? We've already seen how easy it is for malicious actors to spam-report perfectly harmless posts and blogs hosted by trans people; what protections, if any, are there from this scenario?
Thirdly, since Tumblr is selling this data, do any of the people who created the works in the first place get to see any of that money? Or are we being stolen from twice over?
Edit 2:
trans-mando reports in reblog tags that "they did say that they didn't want to scrape deleted blogs but hadn't figured out how to exclude them yet".
Tumblr and Wordpress to Sell Users’ Data to Train AI Tools
Edit: Since you need a (free) account to view the page,
[...] AI companies, according to the source, who spoke on the condition of anonymity, and international documents. A new FAQ section we reviewed is titled "What happens when you opt out?" states that "If you opt out from the start, we will block crawlers from accessing your content by adding your site on a disallowed list. If you change your mind later, we also plan to update any partners about people who newly opt-out and ask that their content be removed from past sources and future training."
404 Media has asked Automattic how it accidentally compiled data that it shouldn't share, and whether any of that content was shared with OpenAI, but did not immediately hear back from the company. 404 Media asked Automattic about an imminent deal with Midjourney last week but did not hear back then, either.
My response: Honestly, this does not fill me with hope. First of all, how do you remove content that an machine learning program has already been trained on? Is that even possible?
Secondly, Tumblr's whole system is based on reblogging stuff. How does that work in conjunction with bots being blocked from training on stuff? Does the address of the person being reblogged still count, or not? What if they change their name? What about deactivated accounts? Will the bots be trained on their data as well? What if someone deactivates their account after having opted out of having their data used to train bots; is it still off-limits? If it's not off-limits, what prevents malicious actors from unfairly targeting creatives whose work they want to steal? We've already seen how easy it is for malicious actors to spam-report perfectly harmless posts and blogs hosted by trans people; what protections, if any, are there from this scenario?
Thirdly, since Tumblr is selling this data, do any of the people who created the works in the first place get to see any of that money? Or are we being stolen from twice over?
Edit 2:
no subject
Date: 2024-02-27 09:26 pm (UTC)no subject
Date: 2024-02-28 02:33 am (UTC)If I’m reading this right
Date: 2024-02-27 11:47 pm (UTC)Wow howdy that is a grim finale. “Can’t figure out how to monetize this, guess we’ll sell it all to the AI scrapers.”
Re: If I’m reading this right
Date: 2024-02-28 02:31 am (UTC)no subject
Date: 2024-02-29 04:42 am (UTC)No, it is not possible to remove data from a training set once it was included, just to potentially exclude it from any future training runs. And it sounds awfully like they may have already provided a scrape of the site before this announcement and the ability to opt-out. (But again, unclear!)
The rest are very good questions that I am not certain we will get good answers for... particularly things like reblogs, and whether someone opted-in reblogging someone opted-out opens the OP's content to being used. What about content that the poster does not own, like the aesthetic blogs that post hundreds of unsourced, uncredited images? Seems unlikely they'll give us real answers.
no subject
Date: 2024-02-29 05:39 am (UTC)I'm also kinda pissed about Tumblr selling this data. I'm guessing they (probably Matt) count our usage of the site as "payment enough" in exchange, but still! And that's not even getting into the fan content copyright issues...
no subject
Date: 2024-03-02 01:52 am (UTC)And yeah, the selling part is real crappy. (Worse, it's happening to Wordpress users as well! I sort of expect social media to be awful and exploiting their users, but Wordpress is a paid host for a lot of people, and finding out their work is also being sold is real awful.)