[Linux-aus] Open Source AI or LLM people + projects in Australia/NZ

Sun Jan 7 16:00:39 AEDT 2024

Hi Lyndsey,

The comments here are from my personal perspective, not those of any 
academic or institutional affiliations I have.

Firstly I want to clarify some terms. AI and machine learning is a very 
broad space - it's used in almost every vertical - defence, aerospace, 
marketing, business analytics, education and so on. A large language 
model (LLM) is a specific type of machine learning model that is 
_predictive_ and _generative_. Given a prompt, it responds with an 
output that best matches the prompts _based on the data it has been 
trained on_. Other types of machine learning models (for example, image 
generators like DALLE-2) use different algorithms and are trained on 
different types of data. If you train an LLM on a particular set of 
data, for example, "movies", then it will predict what it's been trained on.

## Distinguishing open source models from open source data

Models are built on data.

While a model can be open sourced, that isn't enough to make it "open 
source AI", IMHO - and the Open Source Initiative is having a broader 
conversation around what is meant by "open source AI" [5]. The 
algorithm, code, dependencies etc are required for reproducibility to 
make it "truly open", IMHO.

The key thing to keep in mind is that models are trained on *data* - 
it's not so important where the model is developed, but it's really 
important where the data comes from (this is the focus of the movement 
called "data-centric AI" [0]).

## LLMs in the Australian context

The challenge here for the Australian context is that many LLMs are 
either not trained on Australian-specific data, _or_, their predictions 
do not mirror the Australian context because their training set has 
little Australian content compared to American content. This can be 
specified using prompts - for example "Please create a news headline and 
two paragraphs of copy of an important event in Melbourne in 2002. Use 
Australian English." - feeding this prompt to ChatGPT (3) will use 
Australian spelling and Melbourne-specific landmarks. Chat GPT also 
seems to have some grounding in Indigenous knowledges, such as Bunjil. 
But if I ask ChatGPT the question "There's a bingle at Broady and the 
Western's chokkas back to the servo. What should I do?" then it quickly 
degrades to general advice, rather than context-specific suggestions 
("Mate, can you turn off to Donnybrook Road or Plenty Road?").

AFAIK there are no open source, public LLMs being developed within 
Australia.

The main academic conference for LLM work in Australia (which is 
considered less prestigious than international conferences such as 
NeurIPS [1] and ICML [2]) is the Australasian Language Technology 
Conference (ALTA), which I attended in December. My notes from this are 
public [3]. The main focus of this conference was the application of LLM 
technology to healthcare applications - such as mining medical records 
to assist health professionals in making accurate and timely diagnoses. 
These language models *are* being made open source, but they are smaller 
and much more specific than ChatGPT.

Indeed, there was a lot of conversation at the conference about the 
*need* for a research or open source LLM in Australia because the costs 
of using ChatGPT and others (Claude, Bard) quickly become expensive.

In summary, there is a lot of scope for creating an Australian-specific, 
open source LLM. AFAIK, one doesn't exit.

## Other open source LLM efforts

The main open source LLM efforts are:

* https://www.eleuther.ai/

* Llama 2 by Meta

* https://falconllm.tii.ae/

* Mistral AI

None of these efforts are Australian-based.

## Other Australian open AI efforts

Many universities are tackling the AI-generated / LLM generated text 
issue by using AI-detection tools, primarily within the Turnitin suite. 
Turnitin LLC is headquartered in California, USA.

 From conversations I've had with archivists in Australian collection 
institutions, there is also a need for Australian-specific speech 
recognition tools - because tools like Whisper do not recognise 
Australian accented speech as well as other accents [4]. These tools are 
being used to transcribe audio visual archives. Again, a lot of this 
comes down to the *data* that the model is trained on. The key problem 
here is that Whisper was trained on 680k hours of speech data. To train 
a comparable model, you would need hundreds of thousands of hours of 
Australian-accented speech. The AusTalk archive, for example, from 
memory has maybe a couple of thousand hours (it's offline so I can't 
check).

In the last week, we've also seen rapid advances in voice synthesis - 
with MyShell releasing OpenVoice voice cloning technology [6]. The 
*model* is openly available, but again, the data and algorithms and 
code, are not. The challenge here for Australia is - do we want to be 
able to clone Australian-accented voices (yeah, nah ;-). There are no 
Australian TTS / voice cloning efforts that are open source, AFAIK. This 
raises major ethical questions for the likes of the ATO that uses voice 
recognition (and which has been previously spoofed [7]).

AI is rapidly being adopted into cybersecurity efforts, particularly in 
the field of adversarial AI. These capabilities are predominantly the 
domain of the Acronym Agencies (ASIO, DSD etc), and the folks at BSides 
might be useful to talk to about Australian efforts here.

In terms of Australian AI institutes, there are a few:

* The Data61 CSIRO National Artificial Intelligence Centre - which 
doesn't actually produce any AI or ML, its remit is to encourage AI 
adoption - 
https://www.csiro.au/en/work-with-us/industries/technology/National-AI-Centre

* Australian Institute for Machine Learning at University of Adelaide - 
Research in to AI / ML - https://www.adelaide.edu.au/aiml/

* A2I2 Institute at Deakin University - https://a2i2.deakin.edu.au/

* UNSW AI Institute - https://www.unsw.edu.au/unsw-ai

Invariably, the academic institutions offer various forms of (mostly 
postgrad) offerings, with a heavy emphasis on "engaging industry" (read: 
getting industry to fund AI research because the government's research 
funding is paltry).

## The problem of national capability and why the business adoption 
centres are exacerbating rather than addressing this, IMHO

(I was about to write "sovereign capability" but as I was reminded, 
correctly, recently, sovereignty was never ceded).

This all brings me to the key problem I have with the business adoption 
initiative. What I've outlined above is that Australia has very little 
national capability in AI, and even less in open source AI. What we 
adopt is, predominantly, American-owned AI that might then be 
shoe-horned into an Australian context. Sure, businesses should be 
looking to adopt AI to remain competitive. But the only AI they can 
adopt at the moment is, largely, American AI.

What the business adoption initiative seeks to do is spur *adoption* 
rather than *development*. What I would like to see happen is the 
development of national AI capability, preferably in the form of open 
source products that can be used by Australian businesses and 
organisations nationally. Perhaps one of our national organisations 
should focus on that, rather than encouraging Australian businesses to 
spend money overseas ...

Kind regards,

Kathy Reid

[0] https://dcai.csail.mit.edu/

[1] https://neurips.cc/

[2] https://icml.cc/

[3] https://blog.kathyreid.id.au/2023/12/10/alta2023/

[4] My PhD research, forthcoming

[5] https://blog.opensource.org/open-source-ai-establishing-a-common-ground/

[6] https://research.myshell.ai/open-voice

[7]https://www.theguardian.com/technology/2023/mar/16/voice-system-used-to-verify-identity-by-centrelink-can-be-fooled-by-ai

On 7/1/24 14:54, Lyndsey Jackson via linux-aus wrote:
> Hi all,
>
> on a bit of a fact finding reach out for people or connections from 
> people working on open AI/LLM projects.
>
> Late last year a proposal for AI Centres to help SME's adopt AI 
> dropped. 
> https://business.gov.au/grants-and-programs/artificial-intelligence-ai-adopt-program Before 
> the holiday break I did some work on a proposal concept for 
> agricultural value add (which is very, very broad), and I have insight 
> into how some key groups were considering approaching it.
>
> And if you have any tech/advice/ideas/groups please let me know, I 
> might not get a group to put a bid in but that's ok. I still want to 
> know what's happening in open source and who is working on it.
>
>
> Thanks,
>
> Lyndsey
>
>
> _______________________________________________
> linux-aus mailing list
> linux-aus at lists.linux.org.au
> http://lists.linux.org.au/mailman/listinfo/linux-aus
>
> To unsubscribe from this list, send a blank email to
> linux-aus-unsubscribe at lists.linux.org.au
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.linux.org.au/pipermail/linux-aus/attachments/20240107/5c0c2386/attachment-0001.html>