The language of digitalisation is structural

We produce enormous amounts of information whose lifecycle is often quite short. A painstakingly produced text is hidden inside different systems, and finding the desired information in these systems is not an easy thing to do. According to a study done by McKinsey, knowledge workers spend as much as 19 % of their workweek searching for and gathering information. 

In addition to information being hidden away, the way we format our information also presents problems: for the most part, we produce information in a form that works best with the human eye, as free-form text. The automatic processing of unstructured information is difficult, particularly in Finnish and other inflection-rich languages. Lingsoft's technology makes it possible to express the desired message as a free-form text and then convert it into a machine-readable form using language structure analysis.

Metadata improves discoverability

Language digitalisation is not just about replacing a pen with a keyboard − it's about structuring and enriching texts. However, merely dividing a text into headings and paragraphs, such as in modern-day patient record systems, is not enough. One possibility for the digitalisation of language and improving the discoverability of information is text indexing, the process of enriching a document with detailed metadata, i.e. data about data.  For example, we analysed and enriched all electronic patient record data − over 260 million texts − word for word for the Hospital District of Southwest Finland. 

This enabled their search engine to find exponentially more documents where a given search word appears in an inflected form or as part of a compound word. The search results can be rated according to the desired criteria, taking not only similarities between individual words into consideration, but also entire texts. Likewise, the user can choose what not to look for, thus preventing overly general phenomena from interfering with the search. 

Closer to human understanding

In indexing, semantic data (i.e. information on the meanings of words) can also be added to words along with linguistic data. Many of our solutions make use of ontologies, which describe the relationships between concepts in a machine-readable form. In computer science, ontologies attempt to model the world in the way that humans experience it. Indeed, ontologies can be used to see what lies beyond the surface level: in an ontology, tobacco is a stimulant related to nicotine, which is a chemical compound. Likewise, the iPhone is a smartphone, which is a mobile phone, which is a mobile device and, at the end of a string of other conceptual levels, an inanimate, physical object. Even though all of this information is rather mundane to us, it still has to be taught to a machine, one way or another.  

Indexing that utilises Finto, a public thesaurus and ontology service , makes data compatible between different organisations and units. Ontologies can be used to index large text materials that have already been collected, such as entire archives, in a quick and objective manner. Indexing extends the lifecycle and usability of data when individual documents are more easily located. The process also helps reveal the connections and relationships between different concepts and phenomena. Semantic Web technologies and the data they link make data globally compatible. 

Value according to client's need 

What other phenomena and elements we can find in texts is simply a question of definition. For example, the EU General Data Protection Regulation (GDPR) has resulted in many companies and individuals asking how they can search for and anonymise names and other identifying information in their texts − a process that is no challenge for Lingsoft's solutions. Customer feedback can be classified by its tone or topic, the procedures performed on a patient can be identified from patient documentation, and offensive comments can be screened for in discussion boards. The client's needs determine what will be analysed in a given text and where the value of a solution will come from.