Conceptual Framework

Step 1: Data gathering
Step 2: Data Harmonization
Step 3: Data Access
Step 4: Data Analysis

Data gathering /collection (applying existing methodology) That includes the inventory of the existing data from the first 6 months of the pandemic in Rwanda (first case was identified in march 2020) and the 1-year data collection (through Mobile Applications surveys, telephone calls and face-to-face). Expected data sources will be in different formats. Ranging from Covid-19 related data registered in Excel documents, via data sources containing Minimum Clinical Data (MCD) in DHIS2 and other systems, to more granular Electronic Medical Record (EMR) data in Open Clinic, OpenMRS and other EMR systems. We will start by mapping full hospital patients records, focusing on 15 hospitals located in regions with high number of COVID-19 patients and completing with other isolated datasets. The list of hospital will be defined at the start of the project as data on COVID-19 are still increasing. In total the study will target use of 1Terabytes (1000 Gigabytes) size data which size is big enough for such AI modelling.

The new data collection will follow validated guidelines/principles in terms of data collection and will be done by a longitudinal approach using mobile App questionnaires: A minimum of 214 people per administrative district (6.420 persons throughout Rwanda totalizing 154.080 survey-entries over 24 weeks) will be required for mobile App responses weekly for 6 months (24 weeks).
A minimum sample of 10 persons per district will be reached out by the data collector (2 times: at the beginning of the study and the end) via validation phone call or face-to-face questionnaire if the COVID-19 situations in Rwanda allows.
A sub-group of patients cured from COVID-19 will be specifically followed. If a followed subject has a medical file in participating hospitals, the two datasets will be linked with possibilities of linkage data request in future.

The respondents will be randomly sampled from national population registry of each district, thanks to National Institute of Statistics (NISR) Authority. The sample will proportionally include males and females based on number of inhabitants. There’s a risk of having not sufficient numbers of respondents and/or they don’t report regularly, that’s why a team of data collectors will call subjects once a week to complete the missing data and to enhance the response rate. Each participant will receive mobile fee connection and internet bundle each week to allow data collection. To mitigate the expected gap of the gender digital divide but also of selected persons without a mobile phone anymore, the consortium establish mitigation measures including-but not limited to, leveraging the community healthcare workers (CHWs). Each village in Rwanda have a CHW who is participating into various ministry of health (MoH) programs and they have all received the mobile phones from MoH. If we select a respondent without a mobile phone we will liaise with nearer CHW to reach out to him. We included into the budget the service pay to connect the involved CHWs. The other measures will be specified and tested into the sampling plans and practical data collection plans which will be developed at the beginning of the project. The questionnaires (which will be translated in 3 languages, Kinyarwanda, English and French in Mobile application) include 10 modules (at least 8 of them has to be fulfilled by the project):

Demographics;
Face mask use;
Hand hygiene;
Respect of social distancing measures and risk minimization measures;
Recent risk situations exposures and COVID-19 measures.

On the outcome side, the collected data will include 6. Coronavirus like-Signs and symptoms; 7. Mental health indicators (based on General anxiety disorder-GAD); 8. Social economic impact (based on loss of income, or categories); 9. Covid-19 test results; and 10. if available the geofencing data (no personal data to be collected): Only the Ethical committee approved anonymous phone tracking enabled at individual device on voluntary basis. The sampling plans and practical data collection plans will be developed at the beginning of the project. The sampling and data collection plans will help to overcome biases especially integrate gender dimension to deal with gender digital divide gap, known worldwide but also in Rwanda.

Infrastructure for data harmonizing (developing novel techniques) For data harmonization the custom designed ETL scripts will be developed per data source to extract, transform and load the source data to an OMOP CDM database instance. In early stages when the hospital EHRs are not yet harmonized, we will also use synthetic data approaches to help automate harmonization processes. The data owner-side infrastructure will include the OMOP CDM database instance, the Arachne client, the OHDSI Atlas analytical tool, R Studio, and Jupyter. The data harmonization process converts the observational data from the format of the source data system to the OMOP Common Data Model (OMOP CDM), the CDM supported by the Observational Health Data Sciences and Informatics (OHDSI) organization. This project will benefit from consortium members (lead by the UGent with Edence Health NV company experts support) in the steps involved in the data harmonization process, typically:

Mapping workshop: this a face-to-face (in person or via video conference) workshop, usually a full day, where the initial mapping from source data to OMOP CDM is discussed in detail.
Structure mapping + final mapping doc: Based on the mapping workshop, documentation and notes, the structure mapping is finalized and documented in the mapping document. This forms the basis of the ETL design.
Code mapping: depending on which source terminologies are used in the data source, mapping the local codes to the standard vocabularies used in OMOP CDM (LOINC, SNOMED, RxNorm, etc.) can be either a short, easy process or a long, involved one with multiple iterations.
Implementation of ETL(Extract, transform and load database functions that are combined into one tool to pull data out of one database and place it into another database) : the ETL script(s) to transform the source data into the OMOP CDM database instance; normally done in Python.
ETL testing: the ETL scripts are tested both on development data, and ideally also on the data source’s test data.
ETL deployment: once the ETL scripts will be tested successfully, and packed and deployed using GitHub and Docker.

The data harmonization process will differ quite substantially for different data sources. In terms of architecture design, we propose the following conceptual framework:

HIGH-LEVEL CONCEPTUAL FRAMEWORK OF THE PROJECT

Infrastructure for data access, query, and data analysis (Mixing existing methods and innovative techniques) The central platform data access, query, and data analysis, or central site setup, will manage and coordinate the studies that will be performed across the participating data sources. The central site should at a minimum consist of a database, and Atlas instance, a catalogue of data sources, an R Studio instance, and possibly also a Jupyter server instance. Depending on the network infrastructure chosen (see above), there may also be an installation of the Arachne central server. The database, for example a PostgreSQL database, will include an OMOP CDM schema, as well as additional schema(s) to support a central data catalogue and study coordination.

There are new techniques with regards to the creation of synthetic data and using data to help automate harmonization processes and training models: This approaches will be also used in our project from early beginning when the harmonized data from hospitals EHRs are not yet available, specially leveraging the OHDSI community available mock up data (like Synthea) to train different algorithms /models, before we use them on real data.
The OMOP CDM schema will have the same OMOP CDM vocabulary version as the participating sites and will allow studies to be prepared and tested. If needed, a synthetic data set (e.g. Synthea) or available local data set can be loaded.
There will also be result schemas that will be able to hold the Achilles output per data source site – this will allow a central view on the descriptive statistics for each site.
The database will also be the place to gather aggregated results from the data source sites as part of defined studies.

The OHDSI Atlas instance is integrated with the PostgreSQL database (in use, open source). The central Atlas instance will, as mentioned above, allow cohort definitions and studies to be prepared, and to view descriptive statistics for each participating site. The R Studio and Jupyter instances will allow development and testing of R scripts as port of a study design, or to analyze data collected from data source sites as part of studies. The Arachne central server setup will allow central management of network studies, with tight integration with the OHDSI tools such as Atlas.

Data analysis and interpretation (Mixing existing methods and innovative techniques) The federated datasets are challenging to analyse with traditional statistical methods, because they are, like other real-world-data (RWD), 1) collected without any intention for being used in research; 2) incomplete and not cleaned and 3) collected in sporadic way, not pure longitudinal approach so no way to derive cohort-like data from them. The current project will leverage the AI techniques including is Machine learning techniques and data mining that bring an added value in discovering hidden patterns or relationships between data points. The Machine learning model consists of two modules: GRU-ODE[1], responsible for learning the continuous dynamics of the latent process that generates the observations and GRU-Bayes, responsible for dealing with incoming observations and update the conditional current estimate of the latent process. Those two steps and modules are similar in essence to the propagation and update steps of a Kalman filter. With GRU-ODE, we are able to project in time the hidden process h(t) and hence indirectly future observations. GRU-Bayes perform the update of the hidden state conditioned on new observations. Yet, unlike a Kalman filter, this approach allows to learn very complex dynamics for the latent process. The subsequent figure below show the overall architecture that we propose to support this project. The design incorporates the following parts: Central platform: includes a data catalogue describing the different data sources, the Arachne central hub, a central OHDSI Atlas instance, a central database, as well as R Studio and Jupyter