From PDF data sheets to shared understanding with serverless SHACL

No Comments

Knowledge contained in PDF files

When crawling the web for information about products of a specific category, may it be instances of industrial machine parts, chemical components, or even household goods, manufacturers of such goods often provide the desired information as PDF data sheets. These documents are designed for printers and human readers rather than for processing by software agents. This regularly leads to challenges when comparing information about hundreds or thousands of products due to the heterogeneity in structure and semantics of information contained in PDF data sheets.

Inspired by the data-information-knowledge-hierarchy (DIKW pyramid) as part of information science, we propose a three-step approach in order to gain a shared understanding of concepts that are referenced by PDF data sheets as shown in Figure 1.

Figure 1: Proposed three-step approach inspired by the DIKW hierarchy.
Figure 1: Proposed three-step approach inspired by the DIKW-hierarchy.

The first step is to (1) parse data of PDF files in order to extract the information contained in these files. Such information could be unstructured text as well as key/value pairs describing the properties of a subject of interest.

For the second step, the extracted key/value pairs have to be (2) mapped to explicit semantics which represents formalized knowledge. The gained knowledge could be about the subject of interest or related subjects as well as about the document itself. The mapping process therefore has to consider not only key and value of an extracted property, but also the corresponding subject.

Finally, the third step employs mechanisms to (3) validate and enrich formalized knowledge about subjects in relation to e.g. shared concepts of linked data. The resulting shared knowledge graph builds the foundation for the wisdom that empowers smart agents, chatbots or other AI tools.

In the following, we explain the implementation of the proposed three-step approach. An overview of the implementation architecture is shown in Figure 2.

Figure 2: Architecture of the implementation.
Figure 2: Architecture of the implementation.

Starting from the PDF file on the left, we (1) parse PDF data sheets, (2) map the key/value pairs to explicit semantics, and (3) judge the gained knowledge using SHACL (Shapes Constraint Language).

Step 1: Parsing PDF data sheets

The problem of retrieving textual information from PDF files is probably as old as the PDF itself and there is a bunch of libraries that address this issue. For example, for Python there are PDFMiner or PyPDF that easily extract all textual information from PDF files without any preparation. However, when it comes to data sheets, a lot of important information about the subject of interest is contained within tables that cannot be processed properly using such libraries. Those cases are addressed by dedicated libraries such as tabula-py. The latter is a Python wrapper of tabula-java, which is a Java library for extracting tables from PDF files.

Although this approach works well on extracting information from tables contained in PDF files, it requires some manual preparation of PDF files that have a complex structure – for example where tables cannot be automatically detected. This is the case for most data sheets. For the preparation, Tabula offers a web view that allows users to define areas of interest within PDF files. To demonstrate the proposed three-step approach, we have gathered publicly available PDF data sheets for vacuum cleaners as they are provided by their respective manufacturers. One of these data sheets and the areas of interests as identified using Tabula are shown in Figure 3.

templates for Tabula
Figure 3: Defining templates for Tabula.

These areas can be exported as templates for tabula-py and applied to all PDF files which have a similar structure, which is probably the case for data sheets provided by the same manufacturer:

[
    {
        "page": 2,
        "extraction_method": "guess",
        "x1": 53.94688758850098,
        "x2": 237.73835289001465,
        "y1": 230.29740287780763,
        "y2": 385.8132581329346,
        "width": 183.79146530151368,
        "height": 155.51585525512695
    }, 
    {
        "page": 2,
        "extraction_method": "guess",
        "x1": 312.8919480133057,
        "x2": 511.5653133392334,
        "y1": 231.04149787902833,
        "y2": 369.4431681060791,
        "width": 198.67336532592773,
        "height": 138.4016702270508
    }
]

The result of the tabula-py export is a list of key/value pairs for each subject:

{
    "Staubbehälter Volumen": "0,6 l",
    "Filtersystem": "EPA Filter Klasse 11 (Permanent)",
    "Akkuleistung": "2.330 mAh",
    "Laufzeit": "100 min",
    "Ladezeit": "3 Stunden",
    "Flächenleistung": "150 m2",
    "Befahrbare Teppichhöhe": "15 mm",
    "Geräuschpegel": "60 dB",
    "Maße (H x Ø)": "89 x 340 mm",
    "Maße incl. Verpackung (H x B x T)": "163 x 443 x 545 mm",
    "Gewicht netto / brutto": "3 kg / 5,6 kg",
    "Besondere Ausstattung": "Intelligente Raumerkennung mit Kamera, […]",
    "Reinigungsmodi": "Zick Zack, My Space, Smart Turbo, Turbo, […]",
    "Bedienung": "Bedientasten auf der Oberfläche des Gerätes, […]",
    "Zubehör (im Lieferumfang enthalten)": "Fernbedienung, Ladestation, […]",
    "Korpus": "Ocean Black"
}

In this example, we see keys such as “Akkuleistung” or “Gewicht netto / brutto” which are not explicitly linked to concrete concepts. From the perspective of a machine, these keys are just strings. Also, the values are interpreted as plain strings, for example “2.330 mAh” or “3 kg / 5,6 kg”. Those strings do not specify a data type or language they represent. Therefore, the numbers are not automatically interpreted as numeric values and the unit symbols are just arbitrary characters.

Step 2: Mapping to explicit Semantics

What we have after extracting table data out of PDFs is information in the form of a list of key/value pairs for each subject. But how can we recognize the knowledge that is contained within this information? For this task, we have to map ambiguous strings to explicit concepts which can be modeled using RDF as defined in the W3C recommendation.

In the case of physical quantities, we suggest employing the semantically rich and shared concepts of the QUDT project. Originally developed for the NASA Exploration Initiatives Ontology Models project, QUDT is nowadays maintained by the not-for-profit organization QUDT.org and provides detailed semantics for hundreds of quantities and thousands of associated units, including labels, descriptions, relationships with each other, and conversion information as shown in Figure 4.

Figure 4: Knowledge about the kilogram as provided by QUDT.

The importance of such explicit modelling of quantities and units is proven not least by the Mars Climate Orbiter project with a budget of more than 300 million USD. The project failed in 1999 due to the loss of the spacecraft caused by a software that expected quantities to be in a different unit than was actually provided by another software component.

In order to provide explicit semantics for each subject, we have to state both: the assertions about an instance, in our case the properties of a concrete vacuum cleaner, as well as information about the terminology which is used to describe that instance.

Instance (assertions)

@prefix cc: <http://example.org/cc/> .
 
cc:HOMBOT%20L77BK a cc:VacuumCleaner ;
    cc:type "Vacuum Cleaner" ;
    rdfs:label "HOMBOT L77BK" ;
    cc:mass "3 kg" ;
.

Schema (terminology)

@prefix cc: <http://example.org/cc/>;
@prefix qudt: <http://qudt.org/schema/qudt/>; 
 
cc:mass
  rdf:type owl:DatatypeProperty ;
  cc:rangeQuantity qudt:MassUnit ;
  rdfs:label "weight as string with any unit, e.g. \"50 kg\""@en ;
  rdfs:range xsd:string ;
.

In this example, we have the assertions that the subject is an instance of the class “cc:VacuumCleaner” and has a string value for the property “cc:mass”. Both terms are not only strings, but representations that symbolize concrete concepts. These concepts are contained in the schema that explicitly describes the intended terminology.

A remarkable part here is that the schema describes the property “cc:mass” as reference to quantities with type “qudt:MassUnit”. This reference to the QUDT model is the precondition for exploiting schema knowledge as required in step 3 of the proposed approach. As shown here for the property “cc:mass”, the mapping has to be repeated in the same way for each property of the subject. We achieve this by comparing the strings of concept labels derived from the schema with the strings of keys and values as retrieved from step 1. Although this comparison works well in most cases, we have to bear in mind that it can also lead to mismatches that later lead to incorrect results. It is therefore worth checking the results of the mapping process manually in case of doubt.

Step 3: Judging gained knowledge using SHACL

For the third step, we require tools to judge the previously gained knowledge with respect to completeness and consistency. In the case of RDF graphs, it makes sense to employ the Shapes Constraint Language (SHACL) for this purpose. SHACL was published in 2015 and since 2017 it is also a W3C recommendation. It makes use of so-called node shapes that are applied to instance data in order to gain reports with respect to a valid description of a subject within a certain domain. Within the example of vacuum cleaners, such node shape definition could include property shapes that demand for exactly one value for the property “rdfs:label” and exactly one value for the poperty “cc:mass”. Each instance of the class “cc:VacuumCleaner” has to fulfill these constraints in order to be considered a valid instance of this class.

cc:VacuumCleaner
  rdf:type rdfs:Class ;
  rdf:type sh:NodeShape ;
  rdfs:label "Staubsauger"@de ;
  rdfs:label "vacuum cleaner"@en ;
  sh:property [
      rdf:type sh:PropertyShape ;
      sh:path rdfs:label ;
      sh:datatype xsd:string ;
      sh:maxCount 1 ;
      sh:minCount 1 ;
      sh:name "label" ;
    ] ; 
  sh:property [
      rdf:type sh:PropertyShape ;
      sh:path cc:mass ;
      sh:datatype xsd:string ;
      sh:maxCount 1 ;
      sh:minCount 1 ;
      sh:name "mass" ;
    ] ;
.

This shape can now be applied to all instances of “cc:VacuumCleaner” to test whether the description is a valid vacuum cleaner description or not. A bunch of tools has been developed that support such validations using SHACL, e.g. TopBraid Composer by TopQuadrat. An exemplary validation report as created by TopBraid Composer is shown in Figure 5.

SHACL report
Figure 5: SHACL report created by TopBraid Composer.

Similar to the SHACL validation functionality of TopBraid Composer, the same SHACL shapefile can be applied to RDF graphs using programming libraries that implement SHACL functionality, such as the TopBraid SHACL API for Java or the pySHACL validatorfor Python. If the subject description passed to such libraries complies to the SHACL shape definition that is associated with the class of the subject, the API returns a report in RDF which states that the constraints of the shape are fulfilled.

[] a sh:ValidationReport ;
    sh:conforms true .

In case of a constraint violation, the generated report contains the exact parts of the shape definition which are not fulfilled. This detailed report allows us to automatically identify any constraint violation of the subject description.

[] a sh:ValidationReport ;
    sh:conforms false ;
    sh:result [ a sh:ValidationResult ;
            sh:focusNode cc:113723_origin_3 ; ;
            sh:resultPath cc:mass ;
            sh:resultSeverity sh:Violation ;
            sh:sourceConstraintComponent sh:MinCountConstraintComponent ;
            sh:sourceShape [ a sh:PropertyShape ;
                    sh:datatype xsd:string ;
                    sh:maxCount 1 ;
                    sh:minCount 1 ;
                    sh:name "mass" ;
                    sh:path cc:mass ] ] .

In addition to validate RDF graphs, SHACL also provides advanced features to apply rules on RDF models. These rules can be defined as standard SPARQL queries which are W3C recommendations themselves. Again, TopBraid Composer is capable of executing such rules and enrich subject descriptions based on that. Examples for such rules are shown in Figure 6 and Figure 7.

SHACL rule
Figure 6: SHACL rule to transform a string quantity to a float quantity with explicit unit.
SHACL rule
Figure 7: SHACL rule to transform a QUDT node to a string quantity with unit symbol.

These exemplary rules can be employed to transform statements about quantities to statements that represent the same meaning using different units or a different representation as needed for varying use cases. The result of applying those rules to the example of a vacuum cleaner description is shown in Figure 8.

Figure 8: Equivalent string units inferred by the QUDT model.

In this example, the string “25 kg” is transformed to derived strings with an equivalent meaning such as “0.025 mT” or “55.12 lbm” just by exploiting the semantics provided by QUDT.

Thanks to this model-driven approach, it is possible to reuse the same programming code for plenty of use cases in varying domains, such as industrial machine parts, chemical components, or even household goods. All that has to be maintained is the model itself, which could also be done by people without programming skills due to the availability of suitable RDF and SHACL modelling software.

Serverless implementation

The proposed approach is not limited to processing pipelines that have to be installed and prepared on dedicated hardware for each use case. In fact, the generic functionality of SHACL allows for a reusable and even serverless implementation. For example, the SHACL shapes can be maintained and shared using a hosted version control system such as GitHub which allows you to keep track of changes within the schema definition as shown in Figure 9.

github
Figure 9: Keep track of schema changes with GitHub.

In order to validate a subject description, the RDF graph of that subject could be serialized as Turtle and sent to a serverless cloud function such as AWS Lambda or Azure functions that evaluate that RDF graph on demand using the SHACL shape maintained on the hosted version control system. The functionality of such implementation can easily be tested using any web-API tester such as Postman as shown in Figure 10.

Postman
Figure 10: Using Postman to send Turtle as POST request to an AWS Lambda function and receive a SHACL report as boolean, text, and JSON-LD.

In this example, the RDF graph describing the subject of interest is serialized as Turtle and sent as a POST request to an AWS Lambda function which is associated with the schema definition in GitHub. The response contains a Boolean value that states whether the described subject conforms to the SHACL constraints defined in the schema for instances of the associated class. In case of a constraint violation, the response also contains a textual description of the exact constraint violation and an RDF graph (in this example serialized as JSON-LD) that provides a machine-readable SHACL report. In the given example, the subject is described as an instance of the class “cc:VacuumCleaner”. According to the SHACL schema definition for this class, each instance has to state a value for the property “cc:mass”, which is not the case for the submitted subject description. Therefore, the validation function returns a report that states this constraint violation.

Conclusion

In this article we have implemented a generic approach to prepare formalized knowledge for a shared understanding by extracting the information contained in arbitrary PDF data sheets. By employing mostly W3C recommendations such as RDF, SPARQL, and SHACL, we ensure that the approach is future-proof and in accordance with the efforts of a global community. We also employ freely available schema knowledge as provided by QUDT as well as standardized tools to reduce the risk of untested software. In addition, we have introduced a reusable and serverless implementation for a SHACL validation pipeline using GitHub and AWS Lambda to reduce setup and maintenance costs in varying environments.

Avatar

Matthias is an IT consultant at codecentric in Karlsruhe and has been involved in IT projects since 2003. At Siemens AG in Stuttgart, he learned the practical application of information systems from scratch. He studied business informatics in Karlsruhe with semesters abroad in Singapore, India, and Taiwan. Before joining codecentric, Matthias worked at the FZI Research Center for Information Technology and published the results of his scientific work in the area of semantic web, knowledge graphs, and the processing of IoT data streams.

Comment

Your email address will not be published. Required fields are marked *