Memex - Pandemic Dataset Example
Latest Covid-19 related publications are available every day, thankfully.
Latest *.tar.gz files get unrolled onto the file system.
The metadata.csv file is 51Mb in size; it contains all the publication abstracts. Each record contains more than 1000 fields. The abstract text is one field. Inside this file are references to original PubMed articles. There are approximately 270K publications in this data set. The number of fields per record is more than 1500 fields in one example, shown below:

In this example the user searched for the author "Korth"; the above record also indicates "liver", as a significant finding. Domain experts across many domains have good personal memories, the word "liver" is one word of interest out of many that could be of interest. Domain experts look for word combinations co-occurring within the same abstract or text. In this data set they may also look for new publications, or novel findings or findings similar to a hypothesis under test that match results. In the typical case, the domain user may want to find trivial matches or specific matches. Another representation of same record within modern user interface is shown below:
This representation indicates more thorough context by showing a timeline and two articles that match both the author:Korth and abstract:liver; an easy conjunctive query. Let's relax the requirement that the author:Korth is associated with records involving "abormal livers".
It is possible to continually narrow down the records selected by the search query; this is a tedious process. Typically a domain expert finds a short list of records that are close to where there interest lies. A partial "memex-like" search capability called "more like this" a non-trivial query can find best matches considering many terms at once collected from any set of fields; in the example below abstract fields were specified:
Given 9 different terms, finds the best possible matching 14 documents on the timeline. The same query in a time-invariant perspective indicates an ordering of best matches based on a default score, that can be manipulated to suit a user groups needs.
Alternatively given a specific record, match as many terms from documented fields and terms, is another way to invoke same "more like this api".
Although the Saffron Tech Memex acted in a similar manner enabling a query called "Entities Like Me", it kept a compressed graph of all co-occurring terms across all data sources. The Saffron Tech Memex largest query contained nearly 1 million attributes in order to find similar genomic patterns against a graph with more than 10 billion correlated attributes.
Software technical challenges further described below. The associated data sets files are useful for re-ingesting into other software tools; many groups utilize tool sets that obscure all information that may be available within the datasets. PubMed publications as a rule contain copious labelled fields, text, abstracts, authors, bibliographic references and URL pointing to full published articles. It also contains PubMed cited versions and also PDF extractions from original articles.
There are more than 1000 fields contained with the abstract and more than 2000 fields contained within the published articles. Most databases are limited to 1000 fields. Database are usually setup to assure consistency and optimize storage. SQL is the standard method to query databases and most users lack the skill levels necessary to access all the tables that are required to be joined to view all available fields from all tables into a certain perspective conditioned on user queries. Inevitably this leads to end user consternation as their data is trapped inside the database and without access to skilled SQL user remains trapped inside the database.
Fortunately a few companies have stepped up enabling users to easily access their databases. MongoDB was one of the first; more recently the ELK stack built originally find hackers efficiently within the internet and large data centers, ELK can also be utilized for other data sets. Elasticsearch has enabled users to view all their data relatively easily, quickly and searchable by domain users, rather than . Elasticsearch is built from three different subsystems: an index server, kibana - a node server that translates java script rendered pages and logstash - a ruby based server that enables pushing data or pulling into/from the index. The number of fields is user configurable; in my personal and corporate sponsored studies, I have built indexes with more than 3000 searchable fields.
Most humans don't have the capacity or inclination to search through 3000 fields. The ELK stack is not a best possible Memex implementation. Memexes become useful since they are designed to return most similar set of matches compared to an example or set of examples based on a similarity score. The Saffron Tech Memex contained an efficient complex graph index and a set of similarity metrics that could be applied.
At search time the ELK stack utilizes a bitmap index implementation called roaring bitmaps; the original implementation called FastBit originated at LLNL; utilized to search for particles within CERN's LHC experiment. I have had built FastBit indexes with more than 1 million searchable fields.
Above is the open query against all the abstracts, indicating publication distribution across time.
Time domain filtered version of all Covid-19 PubMed publications above; below is a specific record that has been selected by user.
Full abstract document contains URL that points to actual PubMed document.
1) how to add numeric profiles for specific parameters; have examples for NASA public data sets, could demonstrate with simulated models.
2) how to add multiple parameter profiles and find similar profiles using non-linear metrics
3) how to add image based profiles, have specific for diabetic image of eye degradation, could demonstrate with lung damage associated with Covid patients.
Comments
Post a Comment