
Abstract
This project aims at providing an automated processing of Swiss tax related documents taking into account the profiles of tax households. The system aims at helping some companies to profile their clients with regard to their tax and insurance situation, making better and faster decisions. For companies, we mean fiduciaries, insurance brokers and any other type of companies that have documentary relationships with individuals to access client information and more key documents allowing them to follow their personal situation.
We use an AI approach based on knowledge engineering to infer and reason on the classification of documents concerning the domain of Swiss insurance and fiduciary, as well as automatic building and reasoning on tax household profile.
To reach the above objective, we propose an innovative approach to develop a system that encompasses several stages. Firstly, we have the information extraction module that is able to process the native PDFs or scan documents. This module identifies the class of documents (e.g. health insurance policy), as well as specific information extracted from the document (e.g. date, amount) for each document. This module generates JSON files as regards to each document. Then, JSON files are processed by the Reasoning, Labelling and Profiles updates component which contains an ontology of the Swiss Tax declaration, as well as people profiles. This component also contains a reasoner, based on rules, which serves on the one hand to create or update JSON files of profiles based on provided documents (e.g. a new child belongs to the household because the system processed a health insurance policy for that child), and second to identify any missing document based on the existing profiles, the system may already have (e.g. health fees are missing for a person identified as being part of the household).
Our system works as follows. In the case of a tax household composed of two working parents and a child, our system extracts data from both their two salary certificates. The information extraction module provides as output JSON data - that includes the extracted data - and provides the tax household profile (employees with one child). Once the information extraction is completed, the data are sent to the mapping module that outputs them as RDF triples. Then, when the data mapping is completed, the rules for profile classification and document labelling can be run on the mapped data. The results of the profile classification show that : 1) the parents are identified as wageworkers; 2) they have to deliver some documents such as health insurance; etc.; 3) the salary certificate is multi-labelled as “Tax” and “Income”.
As shown in the example, our system enables the automatic processing of tax documents according to the profiles of tax households. Therefore, companies will have a better profiling and understanding of their clients, their documents and will be able to automate work and focus more on value-added services such as advisory.
Through this project, we achieved the following results: 1) the development of an ontology including Swiss tax household; documents; user profiles; changes of the profiles as well as Tax sections; 2) the use of two alternative approaches for classifying documents and extracting information (such as Rules based on keywords and Lexical annotation using the open-source tool GATE and based on the fiscal ontology). The extracted information represents the content of JSON files; 3) the document labelling; 4) the definition of tax household profiles; 5) the presentation of extracted data in form of RDF triples by way of the mapping process; 6) the multi-label classification of documents; 7) the recognition of tax documents based on their relevant features; 8) the users’ classification into different users’ profiles, based on tax documents that users deliver for the tax return.
Authors
Di Marzo Serugendo Giovanna; Falquet Gilles; Metral Claudine; Cappelli Maria Assunta; Wade Assane; Ghadfi Sami; Cutting-Decelle Anne-Françoise; Caselli Ashley; Cutting Graham.