2023: The year of generative AI

It’s the end of 2023 and as the calendar flips, there’s no better time to take stock of the innovation and its impact on the Future of Teams. 2023 was definitely the year of Generative AI tools. And when I write this post again at the end of 2024, the improvements between now and then will be even more mind-blowing. This is a long post, but it was a big year, so I wanted to go into the most important parts in enough detail for you and your team to go into 2024 as prepared as possible. Feel free to skip around. In this post, I cover:

  1. WHAT IS GPT?

  2. CHAT ASSISTANTS

  3. IMAGE GENERATION

  4. MEETING NOTES & VOICE TRANSCRIPTION

  5. DEEPFAKING, VOICE-CLONING, & LANGUAGE TRANSLATION

What is GPT?

I don't say AI that much. Like 'BigData' and 'BlockChain', sometimes during the hype-cycle of a new technology, people overuse the new term. People are definitely doing that with AI right now. AI means so many things. And when it means so many things, it often means nothing. This post is about a specific kind of AI. This post is about Generative Pre-trained Transformers.

  • Generative: it makes something for you.

  • Pre-trained: it scanned a large amount of data in advance in order to find patterns.

  • Transformer: it takes something that you type (a prompt) and transforms it into something that it creates (this can be text, or image, or video).

Large Language Models (LLMs): 

Although this technology has been in the works for more than 3 decades, the world of generative text really started to take off with the release of GPT3 from a research organization called Open.AI in May of 2020. Although there were many language models previously, GPT3.0 was the first LARGE language model (LLM). It was mostly built using Microsoft's servers for free through a somewhat confusing partnership. 

What’s the next word in this sentence: “Once upon a ___”

(the next word is 'time', right? - how do you know that? Easy, because you've seen enough sentences in the world to know that stories often start with “Once upon a time…”. That’s what Pre-trained means. You, yourself have been Pre-trained (you looked at examples and now you just know).

A language model is just a huge database of words with probabilities next to each one. And with that database it can predict the next word with amazing accuracy. This allows any GPT to answer questions, write stories, summarize meetings, and all kinds of other things. That's all it really does. It's just ‘autocomplete’. But it's really great at it.

And it was LARGE because it was fed a massive amount of text (760 million text examples) in order to look for word-patterns. Everything before this was small, or medium, and because of that, models before GPT3.0 weren’t that accurate or impressive. With GPT3, and it’s much bigger pre-training data-set, the results were starting to feel like magic. But it was still hard to use for people that weren't programmers, because you could only get to it through APIs.

Chat assistants:

ChatGPT (from Open AI, with investment from Microsoft):

https://chat.openai.com/

Then on November 28th, 2022, OpenAI released a newly trained version of GPT 3.5. This new version of GPT was trained on 175 billion examples of content (compared to the mere 760M used for the previous version). Also, they blew up the Internet when they attached a Chat interface to it! ChatGPT was born. Now anyone in the world could start using it. And they did. In less than 2 months, ChatGPT had more than 100 million monthly active users. This was faster than any software in history.

ChatGPT is designed to work as an assistant. You ask it a question. You get an answer - pretty great answers, usually. You can ask for virtually anything. It’s generally bad at math. It’s a language model, not a computational one.

It also builds sentences using probability; so it’s bad at answering questions that look for “the best of X”. That’s because there’s a tiny number of ‘best things’ and they get averaged out of the data when it is dramatically overshadowed by the sea of most common things. So it’s much better at answering the question, “what is the most common X.”  But it’s really great at many many other questions. It can write code. It can plan vacations. It can help you write better email. 

An important caveat… prompt information entered into ChatGPT may be used to train the model, and thus your information can be used to answer other people’s questions in the future. Don’t put anything into a prompt that you wouldn’t post on your own website. Even better, don’t put anything into a prompt that you haven’t already put on your own website. 

Claude (from Anthropic, with a investment from Amazon):

https://claude.ai/

Claude is also a chat-driven assistant interface powered by Anthropic’s Large Language Model called Claude. It essentially does the same thing as ChatGPT, but there are more controls designed to prevent bias, maintain security, and provide source attribution.  Additionally, the privacy policy currently says that they will not use your prompt information to train the underlying model. This could change at any time. I would still only sparingly put your Intellectual Property into these models. 

Bard (Powered by Gemini) (from Google):

https://bard.google.com/

Note: full disclosure… I used to work at Google, but did not have access to Gemini and have no proprietary insight on plans or capabilities. Information on my site and in my talks is from publicly available Google information only.

Bard, actually right now, is not powered by Gemini. It’s running against the Palm2 Large Language Model  - it’s not as good as ChatGPT or Claude. However, Google is currently switching all of their products to run on the new model called Gemini. Gemini, when released, should be dramatically better, because it will be able to handle text, pictures and video together in a single thread. All of the other models have segregated text and image AIs into different tools. Gemini will be able to be used with combinations of phone pictures, voice, and text. This should make it dramatically more useful as an app (or a bunch of apps) on a mobile device when you’re out in the world. Expect that every tool from Google going forward just has Gemini built into it. Search, Maps, Mail, Meet, Drive, Docs, YouTube. I would expect these tools to officially roll out in the spring of 2024 at Google Cloud Next (in April) or Google IO (in May). 

Other AI Assistants:

https://www.jasper.ai/, https://www.perplexity.ai/, https://clickup.com/, https://komo.ai/

There are hundreds of AI based assistant apps out there: Jasper, Perplexity, Clickup, Komo, and many, many others. However, when you look closely, these are thin app layers running above the other Large Language Models from OpenAI, or Palm2/Gemini, or Claude. Because tools that aren’t LLM-native are built on top of the other language models, you often only have access to the free features and are often lagging behind on the model results. Generally, you’ll get better results from one of the three models above. It takes a few billion dollars to train a large language model, so it’s likely that there will only be a few real LLMs in the world. 

Image generation:

In the above examples we were talking about generating text. But generating pictures works exactly the same. If I asked you to draw me a picture of the beach one dot at a time, you might guess that the first pixel at the very top left should be a sky-blue one. Next pixel: still blue. A few more blues, and then maybe a few white ones (because there’s a good possibility that there is a cloud in the sky).  Predicting the next pixel isn’t that different from predicting the next word. You’re just writing a story with one dot at a time. And so in the new world of GPTs, there are Image Generation tools that you can use as well.

Midjourney:

https://www.midjourney.com/ (but to talk to Midjourney, you’ll also need an account for Discord chat)

The leading Image Generation Model is called Midjourney. Midjourney generates pictures from a prompt.  And it’s pretty amazing. A picture is worth a thousands words, so I’ll just show you:

“I need a picture of 4 monkeys working together on a team.”


Dall-e (from Open.AI):

https://openai.com/dall-e-2

Dall-e is ChatGPT’s Image Generation sibling. If you’re using the newest paid version of ChatGPT, then when you ask it for a picture, it will generate a picture using Dall-e.  It’s not as good as Midjourney.  See:

“I need a picture of 4 monkeys working together on a team.”


Stable Diffusion (from Hugging Face, inc):

https://stability.ai/

Stable Diffusion is another great image generation tool. I’ve heard that it’s “not as good as MidJourney, but better than Dall-e.”  But I haven’t used it, so I can’t confirm that first hand. It’s probably worth checking out if you’re just getting started and are having trouble with the slightly confusing juggling of Discord and Midjourney together.

Gemini (from Google):

https://blog.google/technology/ai/google-gemini-ai/

Although not yet released, Google’s Gemini model should allow for Image Generation (and text at the same time). But for straight image requests, I think Midjourney will be better for a while. But I’m truly excited for the multi-modal (image+video+text) possibilities once Gemini is released. Especially on mobile devices when we’re out in the world.

Meeting notes & voice transcription:

Another very useful thing we can now do with GPT is transcribe and summarize our meetings (predicting the next word is even easier, when the software can just listen for it!). Recording and transcribing meetings is quickly becoming the new standard. There are many tools that do this, so I won’t go into them all here. I have highlighted three that you are likely to see.

A quick note on the law and on ettiquette.

Nearly every state in the U.S. requires that you notify the participants of a discussion when it is being recorded. So make sure to let people know that you are recording and transcribing if you use one of these tools. Many AI transcription tools will let you connect right into Zoom so that it joins the meeting as if it’s a participant. If you do this, that’s technically legally compliant (because people are notified). But it’s good practice to ask the others if it’s ok and make sure you know how to turn it off if someone says no.

Otter:

https://otter.ai/

I use Otter every day. With Otter, I can connect directly to Zoom and get a recording of my meetings. The transcriptions are almost always completely perfect and it even keeps track of multiple participants as they speak. There’s also a mobile app that I often use to record and transcribe my own thoughts and notes. I’ve written large chunks of my book, “The Future of Teams”, this way – talking to Otter while I drive and then polishing the transcription later. I love it.

Fireflies.ai:

https://fireflies.ai/

I don’t use Fireflies, but I’ve seen it connect to many of my meetings along with other participants. Like Otter, it’s a meeting transcription tool. I can’t comment directly since I haven’t used it, but people like it.

Zoom:

https://zoom.us/

You’re probably already using Zoom, but you may not have heard that Zoom has added the ability to get a GPT-driven meeting summary. A bit of caution: Zoom’s privacy policy is gives them permission to do virtually anything they want to with your meeting’s words including sell them to 3rd parties. I wouldn’t use Zoom’s meeting summary feature for privacy policy reasons.

Deepfaking, voice-cloning, & language translation:

Eleven Labs:

https://elevenlabs.io/

This year, I started playing around with voice cloning and discovered ElevenLabs and it instantly blew my mind. With ElevenLabs you can do three things. 

  1. Speech Synthesis (GOOD): Generate life-like sounding voice from a few provided Voices in the voice library. You’ve probably seen the ability to pick a voice from a list in other tools. Like where Tessa (UK) reads your turn-by-turn directions for you in your favorite maps app. But with Eleven Labs, you can type any text and get the synthesized voice back. These are very real sounding voices. If you are trying to narrate a training video, this is ideal. It sounds pretty good and will only keep getting better.

  2. Speech Cloning (STILL A LITTLE ROUGH): With Speech-cloning, you can upload a 3-5 minute audio clip OF YOURSELF, and ElevenLabs will train a synthesized voice to talk like you. It’s not terrible. When you and people that know you, hear it, they will recognise instantly, that it doesn’t sound precisely like you. It’s close. But it’s still in the uncanny valley, sounding a little weird. Not quite usable, but improving fast. 

  3. Language Dubbing in 50 languages (GREAT): Once you’ve cloned your own speech (via #2), you can then instantly translate audio of yourself speaking into any of 50 possible languages. While synthesized-Dan doesn’t quite sound enough like me to be useful. Synthesized-Dan speaking Spanish, sounds just like I’m actually speaking Spanish. It’s magic. 

But all of that is just audio, so it’s useful for podcasts, but for video you’ll need more. That’s where HeyGen comes in.

HeyGen:

https://www.heygen.com/

HeyGen works with ElevenLabs and generates synthesized and translated video (If you want video, you don’t need both, just go to HeyGen). With HeyGen, you can do the same three things as with ElevenLabs, but with deepfaked video as an amazing added bonus. 

  1. Video Synthesis (GOOD): Just like with ElevenLabs, you can select a lifelike, video avatar of an algorithmically-generated narrator. You can feed the algorithm text and you’ll get a video of them saying your words. These are emotionless robots, so if you’re an influencer or thought-leader, you probably won’t want to use them for much. But the use cases will continue to expand as the quality improves. 

  2. Avatar Cloning (STILL A LITTLE ROUGH): Again just like with ElevenLabs, this is early in it’s life. The audio here comes from ElevenLabs, so the results should be precisely the same. HeyGen also attempts to train a video avatar of YOU from a 5 minute video clip. After you have this Avatar, you can upload a paragraph of text, and a few minutes later, you’ll get a video clip of YOU saying those words. It’s ok, but again, it’s probably not as useful as influencer-quality live video. If you try this, keep your hands in a neutral pose (as with the avatars above)... if you don’t you’ll end up with randomized hand gestures that don’t match your new words. 

  3. Language Dubbing (GREAT): Check out the clip below of me speaking in English, Spanish, Hindi, and Mandarin. This was done using HeyGen language dubbing. The first 30 seconds is a real recording of me speaking English. That 30 seconds was then translated and dubbed into the other languages and it’s absolutely good enough to use. I can’t recommend HeyGen enough, if you’re attempting to serve international audiences. This is the future of international collaboration. 

Huge thanks to Matt Wolfe, with both his YouTube Channel and his site FutureTools.io) for giving me the latest in AI news nearly every single day! Couldn’t keep up without you, Matt!

Previous
Previous

How Generative A.I. works in 150 seconds…

Next
Next

There’s a meeting apocalypse coming soon