In this article, we will discuss the current state of the OpenAI API, some context about how GPT models work, and code implementation.
Topics:
- Introduction
- API models general descriptions
- GPT
- GPT models
- Tokens
- API implementations
Introduction
OpenAI is an artificial intelligence research lab that develops AI technologies. Founded as a non-profit in 2015 by Elon Musk and Sam Altman, among others, it has evolved into a capped-profit model to scale its impact. OpenAI is renowned for its natural language processing and machine learning breakthroughs, including the GPT series and DALL·E.
Nowadays, there are some available APIs and two ways to handle them, either via API endpoint calls, which allows its usage with any programming language, or using Python, which leads the AI field, with its own module called “OpenAI” developed by themselves, The understanding of any of them provides the knowledge to handle both of them, said that, the method that will be used in this article to explain the implementation is using the python module.
The available APIs are:
- Chat: It is widely recognized and extensively used. It allows the user to complete any prompt, from finishing a phrase to responding to questions. It provides the flexibility to specify the response format and even set instructions for generating the output via system instructions. While it is possible to add a thread-like context to it, It is more suitable things like one-time tasks, there is an better option for context-heavy applications called Assistant API.
- Assistant: It’s similar to Chat API but is more suitable for context-heavy or conversational tasks. It can retrieve files, review code, and even initiate function calls when encountering a task beyond its capabilities. Designed for context-heavy tasks, it has built-in context and threads, making it the optimal choice for sustaining a conversation. Moreover, there’s the option to pre-load instructions to configure its behavior, personality, and responses as per requirement, making it ideal for chatbots.
- Moderation: This tool is based on the same text models but fine-tuned to take text and check how the content is classified in hate, hate/threatening, self-harm, sexual, sexual/minors, violence, and violence/graphic categories. Each category is assigned a score from 0 to 1. This tool is useful for reviewing comments, reviews, messages, and alerts when necessary.
- Image: Nowadays, we can access DALL-E in its versions DALL-E 2 and DALL-E 3. It is a network-based image generation system that takes text as input and produces a URL with a generated image. It can create images from scratch, alter existing images by Inpainting providing a mask, or develop entirely new images based on another image as the input. Returning a URL to an image that lasts
- Embeddings: This model takes text as input and produces a vector that encapsulates the context information of the input. This vector can then be manipulated to conduct mathematical calculations, allowing us to measure its distance in relation to other vectors and determine their semantic relatedness. This feature is handy for performing natural language searches or receiving recommendations, such as for movies, books, or even products in a store.
- Whisper: This is a speech recognition model that lets the user input an audio file, and it can identify the language and translate or transcript its content to text.
- TTS: Accepts text as the input and outputs an audio file containing the speech conversion.
OpenAI’s Codex model powers GitHub’s Copilot, an AI pair programmer
GPT
GPT models are large language models (or LLMs). GPT stands for Generative Pre-trained Transformer, which is a type of Neural Network (NN) consisting of an input layer, a hidden layer, and an output layer.
There are some types of Neural Networks:
- Convolutional NN: great for analyzing images.
- Recurrent NN: that works well at processing and translations, but they work with one word at a time, making it hard to process long pieces of text, being slow to train, and making it impossible to parallelize its training because of the sequential processing
- Transformers: Transformers are really good for text processing and are a recent approach (2017). As they process the entire input at once, there is less risk of forgetting the previous context, and they can be trained in parallel, which is much faster
Two key innovations make Transformers so powerful, one of which is positional encoding. Instead of dealing with one word at a time, it encodes positional data, labeling the input with positional information and storing the word order in the actual data itself rather than in the network structure. This way, the network learns the significance of word order from the data.
The other innovation is self-attention, a form of “attention” and a mechanism that allows the network to selectively focus on certain parts of the data, ignoring others and dynamically adjusting the focus as it processes the data. Self-attention is a unique form of attention; each element is compared to another, and attention weights are computed based on the similarity and relation between each pair of elements. The “Self “part refers to te same sequence that is currently being encoded.
The two main models of the GPT family are GPT-3.5-turbo and GPT-4 (also GPT-4-turbo)
The model utilizes a vast amount of data to determine the most likely token (explained below) to be generated next. It considers the preceding token and all tokens in the sequence before it. Utilizing the probability function created by the LLM, it selects one of the most probable subsequent tokens and incorporates it into the sequence. This process is similar to the “typing” animation, where text sequentially appears on the OpenAI chat website. Although some initially perceived this as a gimmick, it’s actually how the model generates sequences – one token at a time.
GPT-3.5-turbo
It is a finely tuned version of the GPT-3 model, trained with around 45TB of data. Just storing the model itself takes 800Gb.
It initially costs $4.6M in GPU to train it and nearly 500 Billion Tokens of trained data.
It was trained on a quality-filtered subset of the CommonCrawl Dataset, an expanded version of the Webtext Dataset, all outbound links from Reddit with >3 Karma, two databases of online books, and the entire English-language Wikipedia.
It outputs a maximum of 4.096 tokens, is trained up to September 2021, and has a context window of 16.385 tokens.
GPT-4 (and GPT-4-turbo)
It is a totally different model than GPT-3.5 and is estimated to have been trained over 1Petabyte of data.
GPT-4-turbo is the most powerful model. It has been trained with data up to December 2023, with a context window of 128.000 tokens and a maximum output of 4.096 tokens.
On the other hand, GPT-4’s context window has just 8.192 tokens and has been trained with data up to June 13th, 2023.
Tokens
The pricing of GPT and other text processing models is measured in tokens; the amount depends on how much data is being processed. Each model has its own way of calculating the tokens that are being used using an encoding system. GPT-3.5 and GPT-4 use the same encoding.
They are not restricted to whole words. Special characters like the question mark consume an entire token, and long words, i.e., “hamburgers, “cost a total of 3 tokens (h-amburg-ers)
Each model’s input and output tokens have their own cost, and the cheapest model for text generation is GPT-3.5-turbo.
There is a tool provided by OpenAI called TikToken to quantify the amount of tokens of the input, making it possible to calculate how much it would cost.
Chat API
Code example (Python):
In this example, the chat doesn’t keep track of the context. We can implement it by making a messages array and appending the user prompt and the chat responses, passing it via the messages property.
It accepts parameters like Temperature, frequency_penalty, presence_penalty, n, and max tokens.
API Assistant
It outperforms the Chat API in context-heavy conversational applications, knowledge management, and uses involving information that hasn’t been utilized to train the LLM, such as company-related data or post-training date information.
The platform leverages other OpenAI models, offering persistent threads, a built-in context, and the capability to access files in various formats It can concurrently access tools such as code interpreter, knowledge retrieval, and function calling This feature allows creators to program functions for the model to call when necessary – including web scraping data, invoking an API, reading a file, or querying a database for information.
It is currently in beta, so there are some bugs. For example, the links to generated files, such as Excel or PDF files, don’t work at times.
To utilize this system effectively: create an Assistant by assigning it a name; select a model like GPT-3.5-turbo or GPT-4; provide instructions on how it should respond and its personality; define user interaction guidelines including “I don’t know” responses if applicable; specify its role; and determine output formatting based on creator preferences.
When a thread is created, it can be initiated with a set of messages or left empty.
Then, the user is prompted for a message, and a corresponding message with its role, as described in the Chat API section, is generated and added to the thread using its thread ID.
Then the run is executed, with the assistant and thread ID; it powers the execution of the assistant on a thread, including textual responses and multi-step tool use
When the run is completed, the last message is retrieved from the list
In order to maintain context and avoid creating multiple instances of the assistant and thread, it would be beneficial to introduce an AssistantManager class responsible for tracking the created instances.
To activate custom tools, they need to be provided during the creation of the assistant. This could involve any data retrieval, from loading a .txt file to web scraping. Handling the tool output is a more intricate subject; detailed documentation can be found on the OpenAI website.
Images API
It’s a network-based image generation system, trained with over 250 million images and their associated text descriptions.
Currently, there are two available models: DALL-E 2 and DALL-E 3, each with distinct features. These models can create images from scratch based on a given text prompt.
feature | DALL-E 2 | DALL-E 3 |
Batch size | up to 10 | 1 |
Image sizes (px) | 256×256, 512×512 or 1024×1024 | 1024×1024, 1024×1792 or 1792×1024 |
Edit images (Inpaint) | Yes | No |
Variations | Yes | No |
While DALL-E 2 offers a broader range of features and the ability to generate multiple images simultaneously, enabling users to create variations and make edits, DALL-E 3 excels in terms of image quality and supports “hd” and “standard” quality parameters, it also takes the default prompt provided by the user and automatically re-write it for safety reasons, and adding more detail to generate a higher quality result (the updated prompt is visible in the revised_prompt field on the response object).
Code example (Python):
For image variations:
For Inpainting, the model requires two files: the original image and a PNG mask indicating the area for regeneration.
Embedding
An embedding is a vector (or list) of floating point numbers generated from a text string input. The mathematical distance between two vectors measures their relatedness; small distances suggest high relatedness.
They are widely used in Vector Databases to search, cluster, recommend, classify, and detect anomalies in the data.
The most performant models at the moment are text-embedding-3-small, which generates a vector of 1536 dimensions, and text-embedding-3-large, which generates a vector of 3072 dimensions. There is also available text-embedding-ada-002, an older model. The dimension can be reduced by passing in the dimensions parameter.
Using more significant embeddings generally costs more and consumes more computing, memory, and storage; both of these embeddings were training, so it allows developers to trade off the performance and cost of using embeddings. Developers can shorten embeddings without the embedding losing its concept-representing properties; on the MTEB (Massive Text Embedding Benchmark), text-embedding-3-large embedding can be shortened to a size of 256 while still outperforming the previous model text-embedding-ada-002 with a size of 1532
Code example (Python):
Whisper
The Audio API provides the model “Whisper,” which is the Speech recognition model or speech-to-text. This provides two possible endpoints: Translation, which translates and transcribes the audio to English, and Transcription, which transcribes audio into whatever the language of the audio is in, automatically recognizing the language.
Audio files are limited to 25Mb, and the supported types are mp3, mp4, mpeg, mpga, m4a, wav, and webm.
Code example (Python):
Moderation
The Moderations API is a valuable tool for assessing the potential harm of text. Users can leverage this tool to detect content that could be harmful and then take appropriate actions, such as filtering messages or even banning a user.
it processes the text and generates a score from 0 to 1 in each category (listed below):
- hate
- hate/threatening
- harassment
- harassment/threatening
- self-harm
- self-harm/intent
- self-harm/instructions
- sexual
- sexual/minors
- violence
- violence/graphic
This service is free to most developers. For optimal precision, it is advisable to segment the text into chunks of fewer than 2000 characters.
Please note that support for non-English languages is currently limited.
Currently, the available content moderation models include text-moderation-stable and text-moderation-latest (this one is the default if not specified).
The latest model undergoes automatic updates over time to guarantee your use of the most precise model. Conversely, the stable model remains relatively consistent over time, with advanced notice provided before any updates.
TTS
This API provides a speech endpoint based on the TTS (text-to-speech) model. The latest model currently has two versions: tts-1 optimized for speed and tts-1-hd optimized for quality.
Six built-in voices are available, which can be used to narrate text, produce spoken audio in different languages, or provide real audio output via streaming.
Available voices are Alloy, Echo, Fable, Onyx, Nova, and Shimmer, optimized for English.
The endpoint takes three key inputs: the model it should use, the text that is going to be turned into speech, and the voice to use for the audio generation. It defaults to “mp3” files, but other formats like “opus,” “aac,” “flac,” and “pcm” are available.
There is no current way to manipulate the emotion of the generated audio, but certain things like punctuation, caps, or grammar may affect how it’s generated. There is no support for custom voices at the moment.
Code example (Python):
Conclusion
Drawing insights from Rich Sutton’s ‘The Bitter Lesson’, it’s clear that the sheer computational power, coupled with an ever-growing body of data, holds powerful sway in AI research. GPT-3, with its ability to handle straightforward tasks at high speeds and lower costs, exemplifies this perfectly. Moreover, the fact that it fooled 48% of the subjects into believing its text was human-authored in a recent experiment indicates just how advanced this AI model has become.
Some critics may question whether this approach of throwing more data and processing power at AI problems is sustainable or effective in the long term. However, applications like Viable, which provide instant summaries of aggregated feedback, reflect the raw potential and applications that these advanced AI systems deliver.
In conclusion, the evolution of OpenAI echoes the dynamism and potential that AI holds. As we continue to explore, experiment, and harness its power, the future promises to be nothing short of transformative for businesses, industries, and individual lives alike.