Elasticsearch (Partial Memex Example) using CV-19 terms of interest

 Covid-19 Pharma Terms of Interest (originally asked in May 2020 time frame):

                   cd4 , cd8

                   lgG, IgA

                   nsp3, nsp4

                   orf3a, orf8


Elasticsearch can be configured to store all words contained with unstructured text documents, using fielddata: true mapping.  During index and also query time, standard English words are recognized.  Alternate analyzers can be built to meet domain needs.  Ontological data can be utilized inside of Elasticsearch to provide domain specific synonym list like LOINC.  Typical analysis is driven by keyword matches. The picture below indicates a set of matches across the keyword sets within COV-19 related abstracts 




The matching abstracts can be reviewed sequentially for matching words.  

Another way to see word match correlations are to utilize heat-maps using Elasticsearch "significant terms", that by default uses mutual information metrics to find co-associated words.  This method can find closely related words that may help confirm hypothesis sought by researcher.  Heat-maps indicate  visual topology maps.  

Green indicates higher # of co-occurrences between selected words cd4 or cd8.  As expected there are a large number of co-occurrences.  Other words that are not directly searched for that are associated are found along the axes.   This enables researches to quickly assess the significant terms within a specific knowledge base and time period. Some of these terms are peripherally associated within the overall context, providing additional terms of interest to domain experts.
For examples, along left hand side the list terms include:
cd4, cd8, t, cd3, subsets, lymphocyte(s), cd16, cd19, nk, cell(s), cd56, etc



I call this a partial Memex compared to Saffron Tech implementation because Elasticsearch only generates a limited number of co-occurrences.  This can be remedied by querying abstracts then generating topological maps for all significant terms using a similarity metric.  In Saffron Tech implementation a user would specify "entities" then all entity values would be the sources for the generated topological maps.  

In my mind, the topological maps act as hyper-planes that intersect across entity values.  In turn these hyper-dimensional  planes represent a complex lattice-graph where entities represent edges intersecting with other entities on points, entity-entity matrix values are located on hyperplane.  For two or more different co-occurrence lattice-graphs, re-applying the mutual information metric space enables finding the common entity-attributes and also finding the differences. 

As time changes, the topological map changes because new knowledge has been gained. Ideally past query result indexes can be defined within the index enabling retrospective analyses of an adaptive knowledge based system. 


An interesting correlation occurs when searching for "nsp4 and/or nsp8" finds "orf3a, orf8" in co-occurrence; popup can be seen if you scroll to the right in the windown below:




Many terms relevant previously are not present 6 months later because the overall knowledge base has changed.  From the domain experts perspective this might represent "forgetting" knowledge.  Forgetting may be advantageous in some circumstances.  On the other hand, forgetting also limits overall human understanding.  If we could continuously learn and recall then we could build infinite memories. 





Comments

Popular posts from this blog

Left of X, X and Right of X Data Design and Analysis - Using ELK / Elasticsearch / Logstash / Kibana

Bahill ( Eye Tracking & Baseball Hall of Fame ), Grad School and Systems Engineering Curricula

Associate List - In Alphanumeric Ordering - Give people credit, they shared - we learned - enhanced our knowledge base