Course Project


Winter 2019
Instructor: Raghava Mutharaju
IIIT-Delhi
IIIT Delhi

Logistics


  • Group project with team size of 2 or 3
    • If the team size is 1 then you will straightaway get some extra credit
  • Each group works on a different dataset and preferably different domain as well
  • There will be intermediate deadlines
  • Expected deliverables
    • Demo (team)
    • Code, OWL file, query file (team)

Project Tasks


  1. Find the data that you want to model and query
  2. Build an ontology to model the data
  3. Convert the data into RDF triples that conform with the ontology you built
  4. Store the triples in a triple store
  5. Query (templates will be provided) the triples and show the results
  6. Add SHACL constraints (if time permits/extra credit)
  7. Develop an application on top of this structured data (extra credit)

Data


Ontology

  • Minimum requirements
    • 10 classes
    • 2 or 3 levels of class hierarchy
    • 3-5 object properties
    • 3-5 data properties
    • Property hierarchy
      • 1-2 subproperty relations
      • 1 property chain (encouraged but optional)
    • Domain and Range for all the properties
    • 3-5 Existential class expressions
    • 3 Disjoint classes
    • Use at least 1-2 ontology design patterns
  • Deadline: March 12, 2019. 11:59 pm
    • Submit the owl file (can be changed later)

RDF Graph

  • Convert the data in the CSV file(s) to a set of triples
  • This is your instance data
  • Use (import) the ontology and use the classes and properties in the triples
    • cr-data:raita rdf:type cr-ont:Dish
    • cr-data:raita cr-ont:hasIngredient cr-data:curd
    • cr-data:curd rdf:type cr-ont:Ingredient
  • Ignore empty and NULL values in the CSV
  • Be careful with complex cell values (might have to be broken down into multiple separate triples
  • Make use of a CSV library such as Apache Commons CSV

Minimum Requirements

  • Mapping file containing the mapping between the column number/name in the CSV and the class/property in the ontology. This can be in any format
  • Number of triples in the RDF Graph should not be less than 500k
  • Make use of at least one existing vocabulary such as Dublin Core, FOAF, schema.org, SKOS, SIOC, CC, etc. Modify your ontology accordingly
  • Choose any triple store (next slide). There should be an even spread of triple stores among the project groups
  • Load the triples into the triple store
  • Deadline: April 10, 2019
  • Deliverable: Submit the following screenshots
    • Mapping file
    • Your code/program that includes the triple store connection string and the result of the SPARQL count query

Triple Stores


  1. Jena TDB
  2. Virtuoso
  3. GraphDB
  4. Blazegraph
  5. Stardog (trial version available for 60 days)
  6. RDFox
  7. AllegroGraph
  8. RDF4J
  9. 4store
  10. Mulgara

SPARQL Queries

  • Form meaningful SPARQL queries based on your data/triples and the broad templates given in the next two slides
  • Use at least two variables in each query
  • Put each query in a separate text file
  • Write code to read each query from the file
  • Use an API to submit the query to the triple store and collect the results
  • Write the results to a file
  1. At least 3 triple patterns involved in subject-subject joins. This is called a star query. You can use Filter to limit the results (in case of more than 100 triples)
  2. At least 3 triple patterns involved in subject-object joins. This is called a chain query. Get only 100 results using limit clause
  3. At least 3 triple patterns involved in object-object joins. This is another form of star query. You can use Filter/Limit to restrict the number of results
  4. Two star queries with 3 triple patterns each joined on a common object. You can use Filter/Limit to restrict the number of results
  5. Two star queries with 3 triple patterns each, connected to each other like a chain, i.e., the subject/object of the first star query is connected to the object/subject of the other star query. You can use Filter/Limit to restrict the number of results
  1. Query involving reasoning: Assuming that A $\sqsubseteq$ B is in your ontology and a_1, a_2, ..., a_n are instances of A and not B in your RDF graph, write a query to check whether a_1, a_2, ..., a_n are also instances of B
  2. Query involving reasoning: Your ontology has p rdfs:domain C for an object property p and s p o in your RDF graph. Write a query to check whether s rdf:type C? If s rdf:type C is already asserted in your graph, delete that triple and then run the query
  3. Property path query involving a sequence and at least one triple pattern. Sequence should contain at least two properties in it
  4. Property path query involving use of alternative and at least one triple pattern. Alternative should contain at least two properties in it
  5. ASK query with a GROUP BY, and HAVING clause and with at least 3 triple patterns in the query such that the output is true

SHACL Validation

  1. Construct a shapes graph for the following
    • Validate the datatype of a string data property (check whether the instances involved with this property are indeed what you asserted in your ontology). Make use of any of the following to validate the instances connected to this data property further: minCount, maxCount, minLength, maxLength
    • Validate the object type to which the object property is connected to and check whether the node is indeed an IRI
  2. Construct a shapes graph for the following
    • Either a) validate the datatype of a string type data property and using a regular expression check whether instances using this data property are indeed having correct values (eg., phone numbers, zipcode etc) or b) if you have dates or numeric data types, make use of lessThan or lessThanOrEquals (eg., startDate is lessThan endDate) and validate the datatype to be of numeric type
    • Make use of closed and ignoredProperties

Libraries