Semantic Web: Mining Knowledge from the Web
Raghava Mutharaju
Knowledgeable Computing and Reasoning Lab
IIIT-Delhi
About Myself
Work Experience and Education
Assistant Professor (CSE), IIIT-D
Research Scientist, GE Research Center, New York
Internships at IBM Research, Bell Labs, Xerox Research, Stardog
Software Engineer, CA Technologies, Hyderabad
PhD from Wright State University
M.Tech from MNNIT, Allahabad
B.Tech from JNTU, Hyderabad
Research Interests
Semantic Web
Ontology Modelling
Ontology Reasoning
Knowledge Graphs
SPARQL Query Processing
Big Data
IoT
Introduction
Image source:
https://en.wikipedia.org/wiki/A._P._J._Abdul_Kalam
Who was the president of India who was also a scientist?
Who was the scientist who worked at DRDO and was a Bharat Ratna?
Who was the president of India with an aerospace engineering background?
Name the person that worked at DRDO and ISRO and was called the “People’s President”?
Who was the Bharat Ratna holder from rameswaram?
Humans
can read the text
can understand the text
can make inferences based on the text
can answer the questions
Machines
can read the text
cannot understand the text
cannot make any inferences
may answer some of the questions
Knowledge Graph
Knowledge Graph involving Abdul Kalam
Knowledge Graph
Capture the knowledge in a structured form
Machine processable
Inferences can be drawn
KG can be linked to other related KGs
Presidents of India
Books written by presidents
Details of DRDO and ISRO
Machines have better chance to answer questions using KGs
Knowledge Graph
Knowledge Graph involving Abdul Kalam
Domain(born in, Person) Range(born in, Place) Range(worked at, Organization)
Knowledge Graph
Expanded Knowledge Graph involving Abdul Kalam
Google's knowledge card for Abdul Kalam
Knowledge Graph
Image source:
https://goo.gl/S2F3mH
Semantic Web
Purpose is to provide structure to the Web and to the data in general
Move from web of documents to web of data
Proposed by Tim-Berners Lee in 1999
Semantic Web technologies and W3C standards
RDF (Resource Description Framework)
OWL (Web Ontology Language)
SPARQL (query language)
SHACL (Shapes Constraint Language)
Image source:
https://lod-cloud.net/
RDF
Resource Description Framework (RDF) is a data model that is used to describe resources
Physical things
Abstract concepts
Numbers and strings
We use the term resource and entity synonymously here
It is a universal, machine readable data exchange format
Resources are described using triples (subject, predicate, object)
A triple captures the relationship (predicate) between a subject and an object
<Delhi> <capitalOf> <India>
RDF triple is called a RDF Statement
Triple can be represented as a directed labelled graph
Image source:
https://www.w3.org/TR/rdf11-concepts/rdf-graph.svg
Image source:
http://www.obitko.com/tutorials/ontologies-semantic-web/rdf-graph-and-syntax.html
Mining Knowledge
Image source:
https://en.wikipedia.org/wiki/A._P._J._Abdul_Kalam
Knowledge Graph Construction
Manual
Good quality KG (entities and relations)
Not scalable
Automatic
Quality unpredictable, but generally not good
Scalable: large amount of text, any domain
Automatic Knowledge Graph Construction
Steps
Parts of speech tagging
Identify nouns, verbs, adjectives, etc.
Dependency parsing
Recognize the syntactic structure of the sentence along with the dependencies among the words
Named Entity Recognition
Identify person, location, organization, etc.
Automatic Knowledge Graph Construction
Steps
Coreference resolution
All words that refer to the same entity (I, we, he, she, it)
Entity resolution and linking
Uniquely identify entities that refer to the same entity
Obama, Barack Obama, President, Paris, Amazon
Information extraction
Define the domain (high level concepts and relations)
Learn extractors/templates
Rank/score the candidate facts
Automatic Knowledge Graph Construction
Existing OpenIE Systems
OpenIE 5.0
FRED
ClausIE
MinIE
Demo
List of papers/code/data
https://github.com/gkiril/oie-resources
Challenges
Large scale benchmark for all the OpenIE systems. Results should be reproducible.
OpenIE systems for non-English languages
Handle all types of sentences from different domains
Long/short sentences, with clauses, numerical values, etc.
Semantic Web: Mining Knowledge from the Web
Conclusion
In order for the machines to understand and make sense of the data, it has to be in a structured form
Knowledge can be captured and (machine) processed if it is in the form of Knowledge Graphs
Automatically building Knowledge Graphs is hard
Quality of the triples is not good
Some of the open challenges were discussed