Machine learning and repair

Panda · 14 August 2023 09:13

Training data for chatGPT is I believe secret. What I had found is that both therestartproject.org and restarters.net are in Google’s C4 AI model training dataset:

RANK		DOMAIN					TOKENS	PERCENT OF ALL TOKENS
337,559		therestartproject.org	69k		0.00004%
9,529,350	restarters.net			510		0.0000003%

Source: https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/

(My personal website surprisingly ranks even higher. I’m not sure what to make of this.)

Monique · 14 August 2023 09:19

It is also very so perhaps it has not ingested the Wiki.

Monique · 14 August 2023 09:26

Although it does know what a Restart Party is. Had to limit it to one paragraph as it did bang on a bit.

Panda · 14 August 2023 09:39

I’m happilly surprised by the first response. I tried it in a different context and the response was very materialistic missing all the power dynamics. I was expecting here something along the lines of it’s a place you get a free fix, but the response does capture the essential peer knowledge sharing aspect of the Restart Parties. I guess it shows that the Restart Project’s communication on this is really good.

Monique · 14 August 2023 10:15

Here is ChatGPT hallucinating about the Wiki languages.

The real answer is English, Dutch and French

( wow, my question was very poorly constructed!)

Ian_Barnard · 14 August 2023 12:22

Maybe this might help with prompt construction: https://www.promptingguide.ai/introduction/examples

Ian_Barnard · 14 August 2023 12:45

ChatGPT is very secretive about the source of the massive amount of training material they used - possibly one reason might be that then they can’t be accused of scraping copyright material? - and also they have very heavily hand-crafted fine-tuning for specific input patterns, again they don’t have any specification of this. More positively there are now opensource LLMs and some companies are making a point of curating sources and (for example) removing hate speech from the training material.

The really scary thought is that at some point the training material for the next gen LLMs will start including output of previous generations, hallucinations and all. What happens to truth then?

The type of prompting we were playing with last week was like this (first line is anti-hallucination, not sure if it will work with chatGPT):

Answer the question below using only the information provided here.

[text of info from wiki about clock repair]

What are the possible problems with a carriage clock, and how do I repair them?

Panda · 16 August 2023 18:43

Not more positively for the poorly paid workers tasks with doing this without any support for the trauma they’re experiencing.