Florence-2 explained: Microsoft's AI for complex image processing
Florence-2 is primarily aimed at mastering several “vision tasks” at the same time. A “vision task” would be, for example, classifying images or performing object detection, i.e. finding out where objects are located on an image. This also includes counting objects or segmenting objects on images. Whereas previously separate neural networks or, more generally, AI models were trained for each of these tasks, Florence-2 can perform several of these tasks at once. Florence-2 is a multimodal AI model, as it can be used to merge images, texts and label information. can be merged with each other.
Lidar information
Lidar information is used in self-driving cars, for example, to determine the distance of objects from the car. Lidar can now be combined with normal images of the road situation, which show whether there are pedestrians on the road, among other things, to create new information. In RAG models (RAG stands for Retrieval Augmented Generation), i.e. models that allow users to discuss their own files with the help of large language models, texts are also made available together with tables and images.
Multimodal AI models
Multimodal models are currently all the rage. A modality can be, for example, spoken language, images, videos or even lidar information. Modality in multimodal AI therefore refers to the different types of data or input that an AI system can process. Multimodal AI models integrate and analyze these different modalities simultaneously to achieve more comprehensive and accurate results.
Two-dimensional image processing with Florence-2
In tasks that Florence-2 processes, two dimensions can be plotted against each other: the semantic granularity versus the spatial hierarchy. If you imagine an image with a normal street scene and place it at the zero point, so to speak, the semantic granularity can be refined upwards on the y-axis. From the image via a label to a caption, i.e. a short description, through to a detailed description in which the individual text passages match the individual scenes in the image. The spatial hierarchy is shown on the x-axis. From the image level to the individual scenes or regions on the image to the pixel level on which segmentation can be carried out, i.e. it can be determined for each pixel whether it belongs to a certain class.
In addition to the model, Microsoft has also created its own large training dataset, FLD-5B.
Figure 1: The y-axis shows the semantic granularity from the simple image upwards to the fine-grained, textual representation of the individual image contents. The x-axis shows the spatial hierarchy or the level of the images from the simple image to the segmentation.
From “Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks” page 1. https://arxiv.org/pdf/2311.06242
Florence-2 in practice: compact, flexible and ready to use
Florence-2 is therefore a multitasking model with a “unified architecture” that allows the user to simply specify the task to be performed by the model in the coding, i.e. the so-called “Task_prompt”.
The model is very small compared to other current models and can also be used on a cell phone. To be precise, there are two models available: “ microsoft/Florence-2-base “ and “microsoft/Florence-2-large”. Both run on a CPU as well as on a GPU. The data set that Microsoft used to train the model was also published immediately. The data set contains
The model can be tested at Huggingface or simply transformed into a Jupyternotebook as Python source code and executed locally. Both variants are very easy to test.
- Huggingface: https://huggingface.co/microsoft/Florence-2-large
- Huggingface demo: https://huggingface.co/spaces/gokaygokay/Florence-2
- Github: https://github.com/retkowsky/florence-2/blob/main/Florence-2%20-%20Advancing%20a%20Unified%20Representation%20for%20a%20Variety%20of%20Vision%20Tasks.ipynb
One of the special features is that Florence-2 enables automatic data labeling. What does that mean? If, for example, you have a classification task and want to use the AI to distinguish between dogs and cats – or workpieces with and without defects – then you need as many sample images as possible, including a label. The label describes what can be seen in the image. This allows the AI, which initially does not know, to learn by comparing suitable examples. In other words, if the image shows a dog and the corresponding label is “dog”, then everything fits. If the label says “cat”, then it means wrong and it has to be learned.
Last but not least, Florence-2 has very good zero-shot learning qualities. Zero-shot learning describes the ability of an AI model to master tasks without having been explicitly trained with examples. The model generalizes knowledge from other, known categories.
Here are a few illustrations that show how the model works. You can simply specify the desired task in a “task_promt” in the coding (task_prompt = ‘<CAPTION> ‘ ) or make a selection as shown here for the hugging face model.
The desired task can simply be selected in the dropdown.
If you select the “CAPTION” option, you will receive a simple description of the image.
If you select another option, such as “MORE_DETAILED_CAPTION”, you will also receive a detailed description of the image.
The task “MORE_DETAILED_CAPTION” was selected here. An image with a diseased leaf was uploaded. The model automatically added the caption.
“The image shows a close-up of a green leaf with multiple small brown spots scattered across its surface. The spots appear to be of different sizes and shapes, and they are clustered together in a circular pattern. The leaf appears to be healthy and well-maintained, with no visible signs of wear or damage. The background is blurred, but it seems to be a garden or a natural setting with other plants and foliage visible.”
In the nextn example was “OD” for Object Detection, i.e. object detectionselected.
The task “OD” was chosen here. The objects in the picture are to be found, framed and named. It’s great that you can also simply add the coordinates:
{‘bboxes’: [[560.6400146484375, 344.5760192871094, 916.9920654296875, 878.0800170898438], [279.0400085449219, 140.8000030517578, 665.0880126953125,
843.2640380859375]], ‘labels’: [‘bear’, ‘lion’]}
Another example is “CAPTION_TO_PHRASE_GROUNDING”, which allows you to enter a text describing the object to be found in the image. In this case, a lion.
In this context, “grounding” means the ability of an AI model to link textual information with visual content in images. It is about linking text (e.g. descriptions or terms) with specific areas or objects in an image so that the model understands which text refers to which image content.
In the “CAPTION_TO_PHRASE_GROUNDING” task, you can specify in text form which object should be searched for in the image. In this case, you should search for “a lion”.
How does it all work?
Let’s take a quick look at the sketch to understand how Microsoft Florence-2 was implemented.
Microsoft has generated encodings and embeddings for images and texts, more precisely for the multi-task prompts. These visual embeddings were placed in a Transformer together with the text and location embeddings. Location embeddings represent, for example, the bounding boxes for object recognition. Text and location tokens then come from the transformer, which are then used to create the associations between image elements and text elements. As described above, different granularities can be used.
Architecture of Florence-2 as presented in the article “Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks”. Page 3.
Florence-2 and AGI
AGI stands for “Artificial General Intelligence”. This refers to a form of artificial intelligence that is capable of performing a wide range of tasks that require human understanding and intelligence. Unlike specialized AI systems that only work well in narrow application areas, an AGI would be able to perform intellectual tasks in various domains as well as or even better than a human.
At least three variants can be distinguished here:
- In the case of specialized AI systems, it is assumed that the AI already has so-called island talents and there is usually hardly any talk about this. Of course, an AI can play chess or GO extremely well, carry out automated quality checks for hours with the help of image analysis or the Microsoft Translator can translate a sentence into over a hundred languages. But there are many things it can’t do as well as a human.
- In the second variant, an AI can do everything a human can do and at least as well. How many years it will take before we get there is written in the stars. At the moment, humans are still far ahead.
- In the third variant, the AI can do a great deal that humans can, but can also process much more, such as ultrasound, infrared or other information that is not directly accessible to us.
In all three variants, however, the question of whether a single AI can perform several tasks simultaneously, as we can process parallel tasks, also plays a role. Florence-2 is definitely a fascinating approach for such a capability.