Recently, we launched our new project search and we explained how our method 'active learning' makes it easier to quickly find relevant results. But why aren't we just working with keywords in the search, why use an abstract?
Choosing the right keywords for your search can be tricky business. What if your potential collaborator from another discipline uses a different phrase for the same thing? Or what if the same keywords are used in very different contexts? A business economist has a different association with a Curator than an Art Historian or a classicist would.
We had cases such as these in mind when we were developing Semantic Search for Impacters’ Project database. Impacters’ search algorithms use Artificial Intelligence to attempt to understand the meaning of what you’re looking for, and suggest relevant research projects.
Technologically, this version of semantic search is based on so called Sentence Transformers models . The models take the context of words into accounts, for short paragraphs of text of up to around 100 words (128 tokens).
This means that we can match a text:
Provided with a huge corpus, is it possible for someone to detect an item that's almost the same to a certain input? Preparing the corpus to readily identify items that are identical is coined ‘closeness finding’
To an actual project, containing mostly synonyms of the original:
Given a large dataset, how can one find a similar item to a given query? Preprocessing the dataset to quickly find these similar items is called "similarity search”.
And we can do it across 52 languages!
The following languages are supported, in alphabetic order:
Albanian, Arabic, Armenian, Bulgarian, Burmese, Catalan (Valencian), Chinese, Chinese (Taiwan), Croatian, Czech, Danish, Dutch/Flemish, Estonian, Finnish, French, French (Canadian), Galician, Georgian, German, Greek (Modern), Gujarati, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Kurdish, Latvian, Lithuanian, Macedonian, Malay, Marathi, Mongolian, Norwegian Bokmål, Persian, Polish, Portuguese, Portuguese (Brazil), Romanian/Moldavian/Moldovan, Russian, Serbian, Slovak, Slovenian, Spanish (Castilian), Swedish, Thai, Turkish, Ukrainian, Urdu and Vietnamese.
 Our approach builds on Reimers, N and Gurevych, I (2019), Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Available at: https://arxiv.org/pdf/1908.10084.pdf