Florence-2 primarily aims to master multiple “vision tasks” or assignments simultaneously. A “vision task” would be, for example, classifying images or performing object detection, i.e., determining where objects are located in an image. Counting objects or segmenting objects in images are also part of this. While previously separate neural networks or AI models in general were trained for each of these tasks, Florence-2 can perform several of these tasks at once. Florence-2 is a multimodal AI model, as it can be used to merge images, texts, and label information.
Lidar information
Lidar information is used, for example, in self-driving cars to determine the distance of objects from the vehicle. Now Lidar can be combined with typical images of the road situation, which show, among other things, whether pedestrians are on the road, to create new information. RAG models (RAG stands for Retrieval Augmented Generation) allow discussing your own files using large language models, making texts available with tables and images.
Multimodal AI models
Multimodal models are currently the talk of the town. A modality can be, for example, spoken language, images, videos, or even Lidar information. Modality in multimodal AI thus refers to the different types of data or input an AI system can process. Multimodal AI models integrate and analyze these different modalities simultaneously to achieve more comprehensive and accurate results.
In tasks that Florence-2 processes, two dimensions can be compared: semantic granularity versus spatial hierarchy. Imagine an image with a typical street scene and place it at the origin; the semantic granularity can be refined upwards on the y-axis. From the image to a label to a caption, i.e., a brief description, up to a detailed description where the individual text passages match specific scenes in the image. The spatial hierarchy is plotted on the x-axis. From the image level to the individual scenes or regions in the image to the pixel level, segmentation can be made, i.e., determining whether each pixel belongs to a specific class.
In addition to the model, Microsoft has also created its own large training dataset, FLD-5B.
Figure 1: On the y-axis, the semantic granularity is shown from the simple image upwards to the fine-grained textual representation of the individual image contents. On the x-axis, the spatial hierarchy or the level of the images is plotted from the simple image to the segmentation.
From “Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks” page 1. https://arxiv.org/pdf/2311.06242
Florence-2 is thus a multi-tasking model with a “unified architecture” that allows the user to simply include the respective task to be performed by the model as a so-called “Task_prompt” in the coding.
The model is very small compared to other current models and can also be used on mobile phones. Specifically, two models are available: “microsoft/Florence-2-base” and “microsoft/Florence-2-large”. Both run on a CPU and a GPU. The dataset with which Microsoft trained the model was also released at the same time. The dataset contains 126 million images. That sounds like a lot, but it is rather modest compared to the larger models.
The model can be tested at Huggingface or simply transformed into Python source code in a Juypter notebook and executed locally. Both variants are straightforward to test.
- Huggingface: https://huggingface.co/microsoft/Florence-2-large
- Huggingface demo: https://huggingface.co/spaces/gokaygokay/Florence-2
- Github: https://github.com/retkowsky/florence-2/blob/main/Florence-2%20-%20Advancing%20a%20Unified%20Representation%20for%20a%20Variety%20of%20Vision%20Tasks.ipynb
One of the special features is that Florence-2 enables automatic data labeling. What does that mean? For example, if you have a classification task and want to distinguish between dogs and cats with AI – or components with and without defects – then you need as many example images as possible, including labels. The label describes what can be seen in the image. This way, the AI, which initially does not know, can learn by comparing matching examples. This means that if the picture shows a dog and the corresponding label is “dog, ” everything fits. If the label says “cat, ” it means it’s wrong and needs to be learned.
Last but not least, Florence-2 brings excellent zero-shot learning qualities. Zero-shot learning describes the ability of an AI model to handle tasks without having been explicitly trained with examples for them. The model generalizes knowledge from other, known categories.
Here are a few illustrations that show how the model works. In the coding, you can simply specify the desired task in a “task_prompt” (task_prompt = ‘<CAPTION>’) or, as here with the Huggingface model, make a selection.
The desired task can simply be selected in the dropdown.
Selecting the “CAPTION” option gives you a simple image description.
If you choose another option, such as “MORE_DETAILED_CAPTION,” you also get a detailed image description.
Here, the task “MORE_DETAILED_CAPTION” was chosen. An image with a diseased leaf was uploaded. The model has automatically added the caption.
“The image shows a close-up of a green leaf with multiple small brown spots scattered across its surface. The spots appear to be of different sizes and shapes, and they are clustered together in a circular pattern. The leaf appears to be healthy and well-maintained, with no visible signs of wear or damage. The background is blurred, but it seems to be a garden or a natural setting with other plants and foliage visible.”
In the next example, “OD” was selected for object detection.
Here, the task “OD” was chosen. The objects in the picture should be found, framed, and named. It is very nice that you can also get the coordinates easily:
{‘bboxes’: [[560.6400146484375, 344.5760192871094, 916.9920654296875, 878.0800170898438], [279.0400085449219, 140.8000030517578, 665.0880126953125,
843.2640380859375]], ‘labels’: [‘bear’, ‘lion’]}
As another example, “CAPTION_TO_PHRASE_GROUNDING” is mentioned here, allowing you to provide a text describing the object found in the image. In this case, a lion.
In this context, “grounding” means the ability of an AI model to link textual information with visual content in images. It’s about connecting text (e.g., descriptions or terms) with specific areas or objects in an image so that the model understands which text refers to which image content.
For the task “CAPTION_TO_PHRASE_GROUNDING,” you can textually specify which object should be searched for in the image. In this case, it should search for “a lion.”
How does it all work?
Let’s take a quick look at the sketch to understand how Microsoft Florence-2 was implemented.
Microsoft generated encodings and embeddings for images and texts, specifically for the multi-task prompts. These visual embeddings were fed into a transformer together with the text and location embeddings. Location embeddings represent, for example, the bounding boxes for object detection. The transformer then outputs text and location tokens, which are used to establish associations between image elements and text elements. As described above, different granularities can be used in this process.
Architecture of Florence-2 as presented in the paper “Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks”. Page 3.
Florence-2 and AGI
AGI stands for “Artificial General Intelligence”. This refers to a form of artificial intelligence capable of performing a wide range of tasks that require human understanding and intelligence. Unlike specialized AI systems that only work well in narrow application areas, an AGI could handle intellectual tasks in various fields as well as, or even better than, a human.
At least three variants can be distinguished here.
- With specialized AI systems, it is assumed that the AI already has so-called island talents, which are rarely discussed. Of course, an AI can play chess or GO extremely well, perform automated quality checks for hours using image analyses, or Microsoft Translator can translate a sentence into over a hundred languages. But there are also many things it cannot do as well as humans.
- In the second variant, an AI can do everything a human can do, and at least as well. How long it will take before we get there is written in the stars. Currently, humans are still far ahead.
- In the third variant, AI can do much of what humans can do, but beyond that, it can do much more, such as ultrasound, infrared, or other information that is not directly accessible to us.
In all three variants, however, the question of whether a single AI can perform several tasks simultaneously, as we can process parallel tasks, also plays a role. Florence-2 is definitely a fascinating approach for such a capability.