• djoldman 33 minutes ago

    This is a key observation that is simple and intuitive:

    >All CLIP-like models perform poorly on mixed-modality search due to a phenomenon known as the modality gap. As illustrated in the figure below, the closest vector to the snippet “I address you, members of the Seventy-Seventh Congress…” is not its screenshot, but other texts. This leads to search results that are skewed towards items of the same modality; in other words, text vectors will be closer to irrelevant texts than relevant images in the embedding space.

    • FergusArgyll 2 hours ago

      I'm missing something. Shouldn't any llm that's 'natively multimodal' somehow include embeddings which are multi-modal? for ex here's googles blogpost on Gemini

        Until now, the standard approach to creating multimodal models involved 
        training separate components for different modalities and then stitching them 
        together to roughly mimic some of this functionality. These models can 
        sometimes be good at performing certain tasks, like describing images, but  
        struggle with more conceptual and complex reasoning.
      
        We designed Gemini to be natively multimodal, pre-trained from the start on 
        different modalities. Then we fine-tuned it with additional multimodal data to 
        further refine its effectiveness. This helps Gemini seamlessly understand and 
        reason about all kinds of inputs from the ground up, far better than existing 
        multimodal models — and its capabilities are state of the art in nearly every 
        domain.
      • aabhay an hour ago

        LLM embedding contain super positions of many concepts so while they might predict the next token they don’t actually out perform contrastively pretrained embedding models.

      • djoldman 39 minutes ago

        This is a cool way to look at multimodal embeddings. They look at performance as the the percentage of inputs slides from one modality to another:

        https://i0.wp.com/blog.voyageai.com/wp-content/uploads/2024/...

        • Zopieux 22 minutes ago

          API-only model. No thanks but congrats anyway.

          • carschno 4 hours ago

            This does read very impressive. Any critical perspectives on the presented evaluation? What about noon-English text?

            I understand the model is, like for other commercial ones, available exclusively through their API, right?

            • stephantul 4 hours ago

              Yes, voyage models are API only.

              There was a part here about multilingualism but that was wrong! Sorry!

              FWIW: Voyage also has separate `law`, `code`, and `finance` models. See [1]

              Really cool results, anyway.

              [1]: https://docs.voyageai.com/docs/embeddings

              • fzliu 3 hours ago

                Glad you liked the results! We do have multilingual models (and rerankers) -- voyage-3, in particular, is multilingual: https://blog.voyageai.com/2024/09/18/voyage-3/

                voyage-multimodal-3 is multilingual as well, supporting the same set of languages as voyage-3.

                • stephantul 3 hours ago

                  Sorry for spreading false information. I edited the post above.

                  It is interesting that you’re not as up front about multilingualism compared to cohere. They seem to mention it a lot, which led to my confusion.

                  • fzliu 3 hours ago

                    No worries at all. That's great feedback and an area of improvement for us when it comes to future posts -- we'll be more explicit about multilingualism in blogs and in our docs.

            • unit149 3 hours ago

              In the traditional Python API, the Voyage engine will tokenize blocks of text and output a string of characters. This model seems to be doing that by vectorizing images in space.

              Words like 'you' and 'apple' will be a unitary token. More complex terms like 'pikachu' may be divided into pik-a-chu.

              [1]: https://docs.voyageai.com/docs/tokenization

              • mech4lunch an hour ago

                The colab measures dot product values 0.428 and 0.498, describing them as "...similarity value is quite high." Is that high? Can you design a system that confidently labels data with a 0.4 threshold?

                • greatgib an hour ago

                  Indeed, sad that their models are both commercial proprietary and API only.