acm-header
Sign In

Communications of the ACM

Viewpoint

Building a Multilingual Wikipedia


image based on Wikipedia logo

Credit: Andrij Borys Associates

Wikipedia has more than 50 million articles in approximately 300 languages. The content in these languages is independently created and maintained. The knowledge in Wikipedia is very unevenly distributed over the languages: some languages have more than a million articles, but more than 50 languages have only a few hundred articles or less. More importantly, also the number of contributors is very unevenly distributed: English Wikipedia has more than 418,000 contributors, the second-most active one, Spanish, drops down to 90,000. More than half of language editions have fewer than 10 contributors doing more than four edits per month. To assume that fewer than 10 active contributors can write and maintain a comprehensive encyclopedia in their spare time is optimistic at best.

In order to close these knowledge gaps we are building a multilingual Wikipedia where content is created only once but made available in all languages. The multilingual Wikipedia has two main components: Abstract Wikipedia where the content is created and maintained in a language-independent notation, and Wikifunctions, a project to create, catalog, and maintain functions. For the multilingual Wikipedia, the most important function is one that takes content from Abstract Wikipedia and renders it in natural language, which in turn gets integrated into Wikipedia proper.

This will considerably reduce the effort required to create a comprehensive and maintain a current encyclopedia in many languages. It will allow more people to share more knowledge in more languages than ever before. It will be particularly useful for under-served languages, providing an important way to help improve education and ready access to knowledge in many countries.a

Back to Top

Example

We follow a toy example. It does not cover the complexity of the problem space, but is used to sketch the architecture. Taking the following two (simplified) sentences from English Wikipedia:

"San Francisco is the cultural, commercial, and financial center of Northern California. It is the fourth-most populous city in California, after Los Angeles, San Diego and San Jose."

Figure 1 shows how the text could be represented as abstract content. The example shows a single constructor of type Article with a single key called content. The value of content is a list with two constructors, one of type Instantiation, the other of type Ranking. The Instantiation constructor has two keys, instance and class, where the instance key has a simple entity as the value, San Francisco, and the class key has a complex value built from another constructor. San Francisco refers to an item from the Wikidata catalog of items, Q62.b Wikidata is a sister project of Wikipedia, an open knowledge base that anyone can edit,10 which currently provides language-independent identifiers for 90 million entities, such as Q62 for San Francisco, and one billion machine-readable facts about these entities. Wikifunctions will be able to call Wikidata to request these facts and use them to enrich content.

f1.jpg
Figure 1. An example abstract content of two sentences describing San Francisco. The names of the constructors and their keys are given in English here, but that is merely convenience. Just as the items in Wikidata, they will be represented by language-independent identifiers.

We require one renderer per constructor and language. A renderer is a function that takes abstract content and a language and turns it into natural language text (or an intermediate object for another renderer). Renderers are created and maintained by the community.

Back to Top

Desiderata

Content has to be editable in any language. Note this does not mean we need a parser that can read arbitrary input in any language. A form-based editor could be sufficient and easy to localize.

The set of constructors has to be extensible by the community. We cannot assume we can create all necessary constructors to capture Wikipedia a priori.

Renderers have to be written by the community. This does not mean that every community member must be able to write renderers. Wikipedia and Wikidata have shown that contributors with different skill sets can successfully work together to tackle very difficult problems.1

Lexical knowledge must be easy to contribute. Rendering will require large amounts of lexical knowledge. Wikidata has been recently extended to express and maintain lexical knowledge.c

Content must be easy to contribute. Content will constitute the largest part of the system. Accordingly, the user experience for creating and maintaining content will be crucial to the success of the project (see Figure 2).

f2.jpg
Figure 2. Mock up of the user interface.

Graceful degradation. The different languages will grow independently from each other at different speeds. It is important the system does not stop rendering the whole article because of a single missing lexicalization.

Back to Top

Architecture

The community of a language Wikipedia can choose to use the abstract content that is stored alongside the items in Wikidata. Natural language text is generated from abstract content using the renderer function available in Wikifunctions and then displayed in Wikipedia to cover current gaps. Content from Abstract Wikipedia and locally created natural language content will be living side by side. The abstract content in Wikidata and the renderers in Wikifunctions are composed from constructors and functions created and maintained in Wikifunctions. The functions can call the lexicographic knowledge in Wikidata, for example, for irregular inflections. This architecture is sketched in Figure 3. The constructor specification states the type of the result of the specification when being rendered. This allows for a system built on the principles of functional programming, which has proven suitable for natural language generation.6

f3.jpg
Figure 3. Architecture of the multilingual Wikipedia proposal.

For every function, type, and constructor, there is a page in Wikifunctions with their definition and documentation, their keys, whether the keys are optional, and what type of values are allowed for each key. The most relevant function in Wikifunctions with regards to Abstract Wikipedia is a function to render abstract content in natural language. Wikipedia calls this rendering function with some content and the language of the given Wikipedia, and displays the result as article text.

I glossed over many issues such as agreement, saliency, or register. Wikifunctions will need to implement a natural language generation library7 taking these issues into account.

Back to Top

Wikifunctions

The primary goal of Wikifunctions is to enable the multilingual Wikipedia: define the constructors for abstract content and implement the rendering function. This will lead to the creation of an open, widely usable, and well-tested multilingual natural language generation library.

But Wikifunctions also has a secondary goal: to provide a comprehensive library of functions, to allow everyone to create, maintain, run, and reference functions. This will enable people without a programming background to compute answers to many questions, either through Wikifunctions or through third-party sites accessing the functions. It offers a place where scientists collaboratively create models. It provides a persistent identifier scheme for functions, thus allowing processes, scientific publications, and standards to refer to functions unambiguously. New programming languages will be easier to create as they can rely on Wikifunctions for a comprehensive library. One obstacle in the democratization of programming is that most programming language are based on English.3 Wikifunctions will use language-independent identifiers, allowing to read and write code in any natural language.


Wikifunctions and Abstract Wikipedia are expected to drive a number of research directions in knowledge representation, natural language generation, collaborative systems, and computer-aided software engineering.


Function specifications can have multiple implementations. Implementations can be in a programming language such as JavaScript, WebAssembly, or C++, or composed from other functions. Evaluators can execute different implementations and compare their results and runtime behavior. Evaluators can be implemented in a multitude of backends: the browser, the cloud, the servers of the Wikimedia Foundation, on a distributed peer-to-peer evaluation platform, in a mobile app, or on the user's machine.

Function calls to Wikifunctions can be embedded in several contexts. Wikifunctions will provide UIs to call individual functions, but it also lends itself to be used from a Web-based REPL, locally installed CLIs, a RESTful API, as a library imported in a programming language, through dedicated apps, Jupyter notebooks, natural language assistants, or spreadsheet cells.

Back to Top

Risks and Advantages

Leibniz was probably the best-known proponent of a universal language, the characteristica universalis. Such ambitions have repeatedly failed.4 The main difference to Abstract Wikipedia is that Leibniz not only aimed for a notation for knowledge but also for a calculus to derive veracity; here, the focus is solely on notation.

A major risk is that contributing to Abstract Wikipedia and Wikifunctions becomes too difficult. Like all Wikimedia projects it relies on a sufficient number of contributors. But Wikimedia communities have repeatedly tackled very hard tasks. They managed to self-organize and allow people with different skillsets to collaborate, and to succeed beyond expectations on projects such as an encyclopedia1 or a knowledge base.10 It will be crucial to provide an accessible user experience.

A major risk is that the number of constructors is too high. If the number of constructors remains in the low thousands, a community of approximately five contributors can unlock a current and comprehensive encyclopedia for their language. Coverage experiments on texts5 using FrameNet2 allowed us to be optimistic about this, but the results are preliminary. There are several reasons to be optimistic:

  • We aim only at a single genre, encyclopedias.
  • The exact surface text is not so important as long as the text retains fidelity.
  • We start simple and allow iteration.
  • We do not need natural language understanding, merely generation.
  • The baseline is very low.

Wikifunctions and Abstract Wikipedia are expected to drive a number of research directions in knowledge representation, natural language generation, collaborative systems, and computer-aided software engineering. Having a large catalog of functions will be valuable for many tasks, as will the creation of a large multilingual natural language generation library.

Advances in machine learning, for example, for article generation,8 can be neatly tied into the architecture. ML systems for learning renderers from example texts, systems to improve the fluency of rendered text, or classifiers that help generate content from natural language input can all be important modules in the project. Machine learned components can be made accessible like any other function in Wikifunctions, or they can be used for offline analysis of Abstract Wikipedia or potential input text. Such combinations will guide an interesting exploration in the mechanisms and effectiveness of human-machine teams.

Back to Top

Conclusion

Building a multilingual Wikipedia is a clearly defined and highly attractive goal with many challenging problems. We invite the research community to join us. With Abstract Wikipedia and Wikifunctions, we sketch out an architecture to get there. A major advantage of splitting up Wikifunctions and Abstract Wikipedia is that it recognizes the risks in the project. Wikifunctions defines a valuable goal by creating a catalog of functions. Abstract Wikipedia provides value by improving on the maintainability of currently bot-created articles. Even if the full vision is not achieved, we identify valuable intermediate milestones. The project is achievable without the need for research breakthroughs. The current state of the art in natural language generation, knowledge representation, and collaborative systems can be tied together to create a system that enables many more people than today to share in the sum of all knowledge.

Back to Top

References

1. Ayers, P., Matthews, C., and Yates, B. How Wikipedia Works (And How You Can Be a Part of It). No Starch Press, 2008.

2. Baker, C.F., Fillmore, C.J., and Lowe, J.B. The Berkeley FrameNet project. In Proceedings of the 17th International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics, 1998, 86–90.

3. Dasgupta, S. and Mako Hill, B. Learning to code in localized programming languages. In Proceedings of the Fourth (2017) ACM Conference on Learning @ Scale, L@S '17, New York, NY, USA, 2017, 33–39.

4. Eco, U. The Search for the Perfect Language (the Making of Europe). Blackwell, 1995.

5. Ferraro, F. et al. Concretely annotated corpora. In Proceedings of the AKBC Workshop at NIPS, 2014.

6. Ranta, A. Grammatical Framework: Programming with Multilingual Grammars. CSLI, 2011.

7. Reiter, E. and Dale, R. Building Natural Language Generation Systems. Cambridge University Press, 2000.

8. Vougiouklis, P. et al. Neural Wikipedian: Generating textual summaries from knowledge base triples. J. Web Semant., 52–53 (2017), 1–15.

9. Vrandečić, D. Architecture for a multilingual Wikipedia. arXiv preprint arXiv:2004.04733, 2020; https://arxiv.org/abs/2004.04733.

10. Vrandečić:, D. and Krötzsch, M. Wikidata: A free collaborative knowledgebase. Commun. ACM 57, 10 (Oct. 2014), 78–85; doi: http://dx.doi.org/10.1145/2629489.

Back to Top

Author

Denny Vrandečić (denny@wikimedia.org) is Head of Special Projects at the Wikimedia Foundation in San Francisco, CA, USA.

Back to Top

Footnotes

a. This Viewpoint provides a brief summary of the full architecture available online.9

b. https://www.wikidata.org/entity/Q62

c. https://www.wikidata.org/wiki/Wikidata:Lexicographical_data


Copyright held by author.
Request permission to (re)publish from the owner/author

The Digital Library is published by the Association for Computing Machinery. Copyright © 2021 ACM, Inc.


 

No entries found