Sign In

Communications of the ACM


Research for Practice: Knowledge Base Construction in the Machine-Learning Era

Research for Practice, illustration

Credit: Getty Images

back to top 

This installment of Research for Practice features a curated selection from Alex Ratner and Chris Ré, who provide an overview of recent developments in Knowledge Base Construction (KBC). While knowledge bases have a long history dating to the expert systems of the 1970s, recent advances in machine learning have led to a knowledge base renaissance, with knowledge bases now powering major product functionality including Google Assistant, Amazon Alexa, Apple Siri, and Wolfram Alpha. Ratner and Re's selections highlight key considerations in the modern KBC process, from interfaces that extract knowledge from domain experts to algorithms and representations that transfer knowledge across tasks. Please enjoy!
        —Peter Bailis

More information is accessible today than at any other time in human history. From a software perspective, however, the vast majority of this data is unusable, as it is locked away in unstructured formats such as text, PDFs, Web pages, images, and other hard-to-parse formats. The goal of KBC (knowledge base construction) is to extract structured information automatically from this "dark data," so that it can be used in downstream applications for search, question-answering, link prediction, visualization, modeling and much more. Today, knowledge bases (KBs) are the central components of systems that help fight human trafficking,19 accelerate biomedical discovery,9 and, increasingly, power web-search and question-answering technologies.4

KBC is extremely challenging, however, as it involves dealing with highly complex input data and multiple connected subtasks such as parsing, extracting, cleaning, linking, and integration. Traditionally, even with machine learning, each of these subtasks would require arduous feature engineering (that is, manually crafting attributes of the input data to feed into the system). For this reason, KBC has traditionally been a months- or years-long process that was approached only by academic groups (for example, YAGO,8 DBPedia,7 KnowItNow,2 DeepDive,18 among others) or large, well-funded teams in industry and government (for example, Google's Knowledge Vault, IBM Watson, and Amazon's Product Graphs).

Today, however, there is a renewed sense of democratized progress in the area of KBC, thanks to powerful but easy-to-use deep-learning models that largely obviate the burdensome task of feature engineering. Instead, modern deep-learning models operate directly over raw input data such as text or images and get state-of-the-art performance on KBC sub-tasks such as parsing, tagging, classifying, and linking. Moreover, standard commodity architectures are often suitable for a wide range of domains and tasks such as the "hegemony"11 of the bi-LSTM (bidirectional long short-term memory) for text, or the CNN (convolutional neural network) for images. Open source implementations can often be downloaded and run in several lines of code.

For these emerging deep-learning-based approaches to make KBC faster and easier, though, certain critical design decisions need to be addressed—such as how to piece them together, how to collect training data for them efficiently, and how to represent their input and output data. This article highlights three papers that focus on these critical design points: joint-learning approaches for pooling information and coordinating among subcomponents; more efficient methods of weakly supervising the machine-learning components of the system; and, new ways of representing both inputs and outputs of the KB.

Joint Learning: Sharing Information and Avoiding Cascaded Errors
T.M. Mitchell et al.
Never-ending learning. In Proceedings of the Conference on Artificial Intelligence, 2015, 2302–2310.

KBC is particularly challenging because of the large number of related subtasks involved, each of which may use one or more ML (machine-learning) models. Performing these tasks in disconnected pipelines is suboptimal in at least two ways: it can lead to cascading errors (for example, an initial parsing error may throw off a downstream tagging or linking task); and it misses the opportunity to pool information and training signals among related tasks (for example, subcomponents that extract similar types of relations can probably use similar representations of the input data). The high-level idea of what are often termed joint inference and multitask learning—which we collectively refer to as joint learning—is to learn multiple related models jointly, connecting them by logical relations of their output values and/or shared representations of their input values.

Never-Ending Language Learner (NELL) is a classic example of the impact of joint learning on KBC at an impressive scale. NELL is a system that has been extracting various facts about the world (for example, ServedWith(Tea, Biscuits)) from the Internet since 2010, amounting to a KB containing (in 2015) more than 80 million entries. The problem setting approached by NELL consists of more than 2,500 distinct learning tasks, including categorizing noun phrases into specific categories, linking similar entities, and extracting relations between entities. Rather than learning all these tasks separately, NELL's formulation includes known (or learned) coupling constraints between the different tasks, which Mitchell et al. cite as critical to training NELL. These include logical relations such as subset/superset (for example, IsSandwhich(Hamburger) ⇒ IsFood (Hamburger)) and mutual-exclusion constraints, which connect the many disparate tasks during inference and learning.

In other systems, the importance of connecting or coupling multiple tasks is echoed in slightly different contexts or formulations: for example, as a way to avoid cascading errors between different pipeline steps such as extraction and integration (for example, DeepDive18), or implemented by sharing weights or learned representations of the input data between tasks as in multitask learning.3,17 Either way, the decision of how to couple different subtasks is a critical one in any KBC system design.

Weak Supervision: Programming ML with Training Data
A.J. Ratner, S.H. Bach, H. Ehrenberg, J. Fries, J., S. Wu, and C. Ré
Snorkel: Rapid training data creation with weak supervision. In Proceedings of the Very Large Database (VLDB) Endowment 11, 3 (2017), 269–282.

In almost all KBC systems today, many or all of the critical tasks are performed by increasingly complex machine-learning models, such as deep-learning ones. While these models indeed obviate much of the feature-engineering burden that was a traditional bottleneck in the KBC development process, they also require large volumes of labeled training data from which to learn. Having humans label this training data by hand is an expensive task that can take months or years, and the resulting labeled data set is frustratingly static: if the schema of a KB changes, as it frequently does in real production settings, the training set must be thrown out and relabeled. For these reasons, many KBC systems today use some form of weak supervision:15 noisier, higher-level supervision provided more efficiently by a domain expert.6,10 For example, a popular heuristic technique is distant supervision, where the entries of an existing knowledge base are heuristically aligned with new input data to label it as training data.1,13,16

Snorkel provides an end-to-end framework for weakly supervising machine-learning models by having domain experts write LFs (labeling functions), which are simply black-box functions that programmatically label training data, rather than labeling any training data by hand. These LFs subsume a wide range of weak supervision techniques and effectively give non-machine-learning experts a simple way to "program" ML models. Moreover, Snorkel automatically learns the accuracies of the LFs and reweights their outputs using statistical modeling techniques, effectively denoising the training data, which can then be used to supervise the KBC system. In this paper, the authors demonstrate that Snorkel improves over prior weak supervision approaches by enabling the easy use of many noisy sources, and comes within several percentage points of performance using massive hand-labeled training sets, showing the efficacy of weak supervision for making high-performance KBC systems faster and easier to develop.

Embeddings: Representation and Incorporation of Distributed Knowledge
S., Riedel, L. Yao, A. McCallum and B.M. Marlin
Relation extraction with matrix factorization and universal schemas. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics–Human Language Technologies: 2013, 74–84.

Finally, a critical decision in KBC is how to represent data: both the input unstructured data and the resulting output constituting the knowledge base. In both KBC and more general ML settings, the use of dense vector embeddings to represent input data, especially text, has become an omnipresent tool.12 For example, word embeddings, learned by applying PCA (principal component analysis) or some approximate variant to large unlabeled corpora, can inherently represent meaningful semantics of text data, such as synonymy, and serve as a powerful but simple way to incorporate statistical knowledge from large corpora. Increasingly sophisticated types of embeddings, such as hyperbolic,14 multimodal, and graph5 embeddings, can provide powerful boosts to end-system performance in an expanded range of settings.

In their paper, Riedel et al. provide an interesting perspective by showing how embeddings can also be used to represent the knowledge base itself. In traditional KBC, an output schema (that is, which types of relations are to be extracted) is selected first and fixed, which is necessarily a manual process. Instead, Riedel et al. propose using dense embeddings to represent the KB itself and learning these from the union of all available or potential target schemas.

Moreover, they argue that such an approach unifies the traditionally separate tasks of extraction and integration. Generally, extraction is the process of going from input data to an entry in the KB—for example, mapping a text string X likes Y to a KB relation Likes(X,Y)—while integration is the task of merging or linking related entities and relations. In their approach, however, both input text and KB entries are represented in the same vector space, so these operations become essentially equivalent. These embeddings can then be learned jointly and queried for a variety of prediction tasks.

Back to Top

KBC Becoming More Accessible

This article has reviewed approaches to three critical design points of building a modern KBC system and how they have the potential to accelerate the KBC process: coupling multiple component models to learn them jointly; using weak supervision to supervise these models more efficiently and flexibly; and choosing a dense vector representation for the data. While ML-based KBC systems are still large and complex, one practical benefit of today's interest and investment in ML is the plethora of state-of-the-art models for various KBC subtasks available in the open source, and well-engineered frameworks such as PyTorch and Tensor-Flow with which to run them. Together with techniques and systems for putting the pieces all together like those reviewed, high-performance KBC is becoming more accessible than ever.

Back to Top


1. Bunescu, R.C., Mooney, R.J. Learning to extract relations from the Web using minimal supervision. In Proceedings of the 45th Annual Meeting Assoc. Computational Linguistics, 2007, 576–583.

2. Cafarella, M.J., Downey, D., Soderland, S., Etzioni, O. KnowItNow: Fast, scalable information extraction from the Web. In Proceedings of Conf. on Human Language Tech. Empirical Methods in Natural Language Processing, 2005, 563–570.

3. Caruana, R. Multitask learning: A knowledge-based source of inductive bias. In Proceedings of the 10th Intern. Conf. Machine Learning, 1993, 41–48.

4. Dong, X. et al. Knowledge Vault: A Web-scale approach to probabilistic knowledge fusion. In Proceedings of the 20th ACM SIGKDD Intern. Conf. Knowledge Discovery and Data Mining, 2014, 601–610.

5. Grover, A. and Leskovec, J. node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD Intern. Conf. Knowledge Discovery and Data Mining, 2016, 855–864.

6. Hoffmann, R., Zhang, C., Ling, X., Zettlemoyer, L., Weld, D.S. Knowledge-based weak supervision for information extraction of overlapping relations. In Proceedings of the 49th Annual Meeting of the Assoc. Computational Linguistics–Human Language Technologies, 1, 2011, 541–550.

7. Lehmann, J. et al. DBpedia—A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web 6, 2 (2014), 167–195.

8. Mahdisoltani, F., Biega, J. and Suchanek, F.M. YAGO3: A knowledge base from multilingual wikipedias. In Proceedings of the 7th Biennial Conf. Innovative Data Systems Research, 2013.

9. Mallory, E.K., Zhang, C., Ré, C. and Altman, R.B. Large-scale extraction of gene interactions from full-text literature using DeepDive. Bioinformatics 32, 1 (2015), 106–113.

10. Mann, G.S. and McCallum, A. Generalized expectation criteria for semi-supervised learning with weakly labeled data. J. Machine Learning Research 11 (Feb 2010), 955–984.

11. Manning, C. Representations for language: From word embeddings to sentence meanings. Presented at Simons Institute for the Theory of Computing, UC Berkeley;

12. Mikolov, T., Chen, K., Corrado, G. and Dean, J. Efficient estimation of word representations in vector space, 2013; arXiv preprint arXiv:1301.3781.

13. Mintz, M., Bills, S., Snow, R. and Jurafsky, D. Distant supervision for relation extraction without labeled data. In Proceedings of the Joint Conf. 47th Annual Meeting of the Assoc. Computational Linguistics and the 4th Conf. Asian Federation of Natural Language Processing, 2009, 1003–1011.

14. Nickel, M. and Kiela, D. Poincaré embeddings for learning hierarchical representations. Advances in Neural Information Processing Systems 30 (2017), 6341–6350.

15. Ratner, A., Bach, S., Varma, P. and Ré, C. Weak supervision: the new programming paradigm for machine learning. Hazy Research;

16. Ren, X., He, W., Qu, M., Voss, C. R., Ji, H., Han, J. Label noise reduction in entity typing by heterogeneous partial-label embedding. In Proceedings of the 22nd ACM SIGKDD Intern. Conf. Knowledge Discovery and Data Mining, (2016), 1825–1834.

17. Ruder, S. An overview of multi-task learning in deep neural networks, 2017; arXiv preprint arXiv: 1706.05098.

18. Zhang, C., Ré, C., Cafarella, M., De Sa, C., Ratner, A., Shin, J., Wang, F., Wu, S. DeepDive: Declarative knowledge base construction. Commun. ACM 60, 5 (May 2017), 93–102.

19. Zhang, C., Shin, J., Ré, C., Cafarella, M. and Niu, F. Extracting databases from dark data with DeepDive. In Proceedings of the Intern. Conf. Management of Data, 2016, 847–859.

Back to Top


Alex Ratner is a Ph.D. candidate in computer science at Stanford University, advised by Chris Ré, where his research focuses on weak supervision—using higher-level, noisier input from domain experts to train complex state-of-the-art models where limited hand-labeled training data is available. He leads the development of the Snorkel framework for weakly supervised ML, which has been applied to KBC problems in domains such as genomics, clinical diagnostics, and political science. He is supported by a Stanford Bio-X SIGF fellowship.

Christopher Ré is an associate professor of computer science at Stanford University. His work focuses on enabling users and developers to build applications that more deeply understand and exploit data. Work from his group has been incorporated into major scientific and humanitarian efforts, including the IceCube neutrino detector, PaleoDeepDive, and MEMEX in the fight against human trafficking, and into commercial products from major Web and enterprise companies.

Peter Bailis is an assistant professor of computer science at Stanford University. His research in the Future Data Systems group ( focuses on the design and implementation of next-generation data-intensive systems.

Copyright held by owners/authors. Publication rights licensed to ACM.
Request permission to publish from

The Digital Library is published by the Association for Computing Machinery. Copyright © 2018 ACM, Inc.