Posts

Showing posts from December, 2020

Elasticsearch (Partial Memex Example) using CV-19 terms of interest

Image
 Covid-19 Pharma Terms of Interest (originally asked in May 2020 time frame):                    cd4 , cd8                    lgG, IgA                    nsp3, nsp4                    orf3a, orf8 Elasticsearch can be configured to store all words contained with unstructured text documents, using fielddata: true mapping.  During index and also query time, standard English words are recognized.  Alternate analyzers can be built to meet domain needs.  Ontological data can be utilized inside of Elasticsearch to provide domain specific synonym list like LOINC.  Typical analysis is driven by keyword matches. The picture below indicates a set of matches across the keyword sets within COV-19 related abstracts  The matching abstracts can be reviewed sequentially for mat...

PubMed Comments - Similarity of Published Articles

Researchers require finding other published materials to help determine supporting evidence associated  with an internal hypothesis in specific functional terms under study.  A research team will have specific term set.  Some terms will match and some terms won't and it is that detailed differences that helps their endeavor.  PubMed is a great resource built over the years to return matches to keyword search.  In many domains groups collaborate on naming conventions sometimes captured in ontological formats. MESH - https://meshb.nlm.nih.gov/search - is one of many standards within the Medical community in use of proper terminology when reporting or publishing articles. LOINC - https://loinc.org/ - is another mapping standard HL7 -  And other datasets like: Searchable databases - CPT, Rx, ICD9 Access to Therapeutic Specific Databases Data aggregation from EMR, claims, Rx data as listed by AdviseClinical LLC -  They are associates of Dr Galpin ...

Memex - Pandemic Dataset Example

Image
Latest Covid-19 related publications are available every day, thankfully. Latest *.tar.gz files get unrolled onto the file system.   The metadata.csv file is 51Mb in size; it contains all the publication abstracts.  Each record contains more than 1000 fields.  The abstract text is one field. Inside this file are references to original PubMed articles. There are approximately 270K publications in this data set.  The number of fields per record is more than 1500 fields in one example, shown below: In this example the user searched for the author "Korth"; the above record also indicates "liver", as a significant finding.  Domain experts across many domains have good personal memories, the word "liver" is one word of interest out of many that could be of interest.  Domain experts look for word combinations co-occurring within the same abstract or text. In this data set they may also look for new publications, or novel findings or findings similar to a hypo...