Considerations about Chat-GPT and Large Language Models

"What does ChatGPT mean for you?"

“Just deploy a large language model”, is the new answer since Chat-GPT was released last November. “What does ChatGPT mean for you? Is it going to replace years of development of your in-house solution?“ are questions asked by potential clients and investors. 

 In our team, we have worked with a multitude of NLP models ranging from small to large, and have been thinking about many of these questions consistently over time. This article is an attempt to answer a few of them directly by sharing the reflections and discussions that have been taking place in our team since the advent of Chat-GPT, as one recent product based on the general trend of large language models (LLMs).

Chat-GPT is an extension of GPT-3 (Generative Pre-trained Transformer 3), specially trained with reinforcement learning for having conversations and answering questions on any domain. GPT-3 is a transformer-based autoregressive model, which aims at predicting the next word, given a sequence of words. GPT’s architecture, decoder-only, is less complex than other architectures, such as, for example, encoder-decoder architectures. It is trained with a unidirectional pre-training learning objective, that is, only the context from one side is considered when predicting the next token, as opposed to a bidirectional pre-training learning objective, where the context from both sides of the token is considered. 

The less complex architecture and learning objective allows training very large models on huge amounts of documents and with more than hundred billion parameters. Those models are supposed to work well on few-shot learning, that is, given a few examples of an unseen task, they can perform decently on that task. They are particularly well suited for language generation tasks, including conversational question-answering. The current trend in developing large language models goes into the direction of generative decoder models.

YUKKA Lab develops a real-time news analytics product with two main use cases: helping users find relevant information in the financial domain, and developing financial models and scores based on insights obtained from the content of financial news articles through NLP. Concretely, we analyse the world’s news for relevant business events when it comes to risk, underwriting, sustainability or investment. The core of our NLP analysis is, thus, text mining for real-world entities and the relations between those entities in business relevant events, as well as entity-based sentiment detection. Document retrieval, question-answering, and summaries are also relevant features for us, especially regarding the first use case.


Is Chat-GPT well suited for real-time question answering in a continuously changing world?

While GPT-like models seem to be very well suited for conversational question-answering and summary generation, they come with a range of drawbacks. Chat-GPT, as a general-purpose question-answering system, can provide a plausible answer to almost every question in impressive human-like wording, sounding extremely convincing, at least for non-experts. Chat-GPT is a Generator-only Open Domain Question Answering (ODQA) system (see Zhang et al. 2022 for a categorization of ODQA systems). Generator-only ODQA systems encode the corpus of documents in the parameters of the model, this means that they are only able to answer questions about the documents they have been trained on. 

In addition, since their training is so costly it cannot be carried out frequently. Consequently, those systems cannot keep up with a dynamic reality and become very quickly outdated. A further drawback is that the accuracy of the responses is not high enough to replace research work about any topic that we want to seriously become acquainted with.


Example of Chat-GPT not keeping up to date: its knowledge is limited to facts that occurred until September 2021.

Specific use cases and a large velocity of data are posing additional challenges to this model architecture. Concretely, for YUKKA Lab, the following questions arise. With millions of new articles coming in every day, how to make such a model work in a dynamic fashion? Truth changes over time, how to make sure that answers hold in the present, especially when dealing with a historic timeline of events. Retriever-reader models attempt to tackle all these problems in a promising way, and with LLMs. These models first retrieve a set of documents that possibly contain the answer and then extract or generate the answer taking those documents as a further input. The answer accuracy of those models is considerably higher than generator-only models (see Zhang et al. 2022 for a survey of ODQA). 

A further advantage of those systems is that they provide the documents for the users and possibly highlight the spans where the answer has been found, so that the answer can be verified, ensuring reliability and promoting trust from the user’s side. Some examples of retriever-reader models can be found in Shuster et al. 2021., Glaese et al. 2022, and Nakano, 2022. At YUKKA Lab we also have experimented with such a retriever setting for text summarization, by providing the relevant documents to Chat-GPT and asking to create a summary, obtaining very satisfactory results.

Are GPT-like models well suited for language understanding tasks?

When we look at our second use case, developing financial models and scores, we need structured data, obtained by information extraction tasks. For any task related to acquiring and storing factual knowledge about the entities of interest, we need to transform the text into structured information. Classification and extractive question-answering tasks are best suited for this kind of use case. 

Encoder-only models, such as BERT, have proved very successful in language understanding tasks, such as classification and extractive question-answering. Their use is widespread in industrial applications because of their good performance with only small sets of labeled data and their feasibility (unlike LLMs with billions of parameters, every small-medium size enterprise can afford to run its own instance of BERT). Since BERT-like models learn the embedding of a token by taking into consideration the context from both directions, with the masked language model pre-training learning objective, they may provide a better representation of meaning than the one obtained with a unidirectional learning objective, which only considers the context from one side. 

The current trend in developing large language models, however, goes more into the decoder-only or encoder-decoder models, so that encoder-only models are being considered by some authors even as “deprecated” (e.g., Tay et al. 2022). Language understanding tasks are also possible with decoder-only models, such as GPT-like models, in which the input should rather be formulated as a prompt and the label is generated as an answer. Still the current recommendation from Huggingface is to use GPT-like models for text generation tasks. 

There are few comparative results of the performance of decoder-only models versus encoder-only in text classification or sentence similarity tasks. An interesting comparison of different text embeddings, including those of ChatGPT and SentenceBERT, for text similarity tasks can be found in a recent blog article. 

After carrying out several experiments, the author concludes that sentence transformer models significantly outperform ST5 and Chat-GPT both in quality and price in this type of tasks. A further comparison can be found in Leippold (2023). The author compares 6-shot learning ChatGPT with FinBERT fine-tuned for sentiment detection in the financial domain. While ChatGPT only reaches around 72% f1-score FinBERT reaches 89%.

One could argue that ChatGPT only has seen limited examples of the task. While true, from an industrial point of view reaching around 70% f1-score is not an option if we want customers to trust our solution. The requirements are rather to be >90%, especially for precision. It seems, thus, that labeled data still has some time to live. It is very impressive that such good results can be achieved with unsupervised learning, but for now, we will still need to stick to supervised learning if we want our applications to provide value.

A few other questions arise: Would a GPT-like model outperform BERT for language understanding if we fine-tuned it with the same data? Can the size of the GPT-like models eventually compensate for a less adequate learning objective or architecture regarding language understanding tasks? Will at some time the size of the models make supervised learning superfluous? But even if all those questions were to be answered affirmatively, the main issue to be addressed would still remain: feasibility.  

Such large models require enormous computational resources (e.g., GPT-3 needs around 700GB of GPU memory to run, single instance, Megatron-Turning around 2.1. TB) and are slow at inference time (inference time seems to be around 500-900 ms/paragraph), so that at this point in time, it does not seem manageable for any small/medium size business to use them in production. This cost factor is not negligible, so if these large models are to be adopted broadly in industry, this will need to come hand in hand with cheaper and more accessible computational resources.

Towards universal models

Finally, an interesting direction in current research are “universal models”, a single-model-fits-all-tasks approach. Decoder-only and encoder-decoder architectures are the only ones considered as adequate for a universal model. Experiments suggest that decoder-only models trained to predict the next token are best at 0-shot tasks in comparison with encoder-decoder models or prefix decoder models (Wang et al. 2022). However, when carrying out multi-task fine-tuning on those models, encoder-decoder models trained with a masked-language-model objective perform much better at the 0-shot learning task. Another finding by this group of researchers is that adaptation of decoder-only models trained for predicting the next token with masked language modeling adaptation with the goal of obtaining both a generative and multitask model can be done efficiently. 

Tay et al. 2022 propose a mixture of training objectives, language model left-to-right, prefix language modeling and span corruption. The authors argue that encoder-decoder models should be preferred over decoder-only models if there are no concerns about storage, since they perform significantly better and have similar speed, although they have twice the number of parameters. It seems, thus, that the encoder and decoder components complement each other and that models benefit from a mixture of training objectives. One option in the future could be to have a single model and use only the decoder for language generation tasks and only the encoder for language understanding tasks.

In summary, our take is that we will keep closely following the development in this area and be prepared to adopt those models if they turn out to be beneficial for some of our tasks. We already see that they are well suited for specific text generation and summarization tasks, where we see some good fit for planned product extensions. Regarding language understanding, we need to evaluate whether generative models are suitable for text classification and extractive question-answering tasks, that is, if they perform better than our current models or have the potential for it.

In 2024 the openGPT-X model will be released, a generative model for the European languages and open-source. If the availability of computational resources increases and it turns out that such models perform good at our tasks, then we will be among the first ones to embrace them. In addition, we pursue several research collaborations with universities, some of them pursuing foundational research in LLMs and related technologies. With our significant expertise in the NLP field, we see many opportunities to fuse our current developments with these new developments.

  • Zhang et al. 2022, A Survery for Efficient Open Domain Question Answering (
  • Shuster et al. 2021, Retrieval Augmentation Reduces Hallucination in Conversation (
  • Glaese et al. 2022, Improving alignment of dialogue agents via targeted human judgements (
  • Nakano et al. 2022, WebGPT: Browser-assisted question-answering with human feedback ( et al. 2022, What Language Model Architecture and Pretraining Objective Work Best for Zero-Shot Generalization? (
  • Tay et al. 2022, UL2: Unifying Language Learning Paradigms (
  • Leippold, 2023 Sentiment Spin: Attacking Financial Sentiment with GPT-3 (