:: Medicine


Medical and Healthcare Database

Prof. Francesco Pinciroli - Prof. Stefano Bonacina
 

Abstract
In the medical field, the effective use of the Information and Communication Technologies (ICT) needs that all the actors, involved in the healthcare process, know the informatics aspects which characterize clinical data and how they are organized. From the reception personnel to the Medical Appointment Services, to Administrative Director; from ward personnel, to chief physician, to Healthcare Director: to cooperate for the patient benefits, all these professional profiles must achieve a proper and shared level of direct understanding of advantages and weaknesses of ICT for healthcare. Elements of the informatics terminology needed for understanding the clinical data management are presented. What databases for Medicine and Healthcare are globally for are highlighted too. The outlined relevant set of characteristics allows focusing about powerful and flexible features of which databases must be provided, as we really want they are felt effective by their users. Then, classification criteria having an informatics origin, such as "Local versus at a distance", "Data versus signals and images", and "Codes versus Languages" are described. The basic elements of hierarchical, network, relational, and object oriented database models are described. Examples into the medical fields, where these database models are successfully applied, are summarized. So, different classes of database queries, as by keywords, by intervals, by concepts, including also the semantic queries of bio-image databanks, all of them are introduced and attention is paid to the necessity for user returns. Examples and references on bio-signals databanks, bio-images databanks, and genomic databanks, close the article.

1 - Scenario mapping elements.
Even if we are not specialist in Medical Informatics and Telemedicine, in thinking about databases for Medicine and Healthcare there are words naturally coming to our minds. It makes sense to start from those words, as they are an implicit and widespread knowledge where to begin. For taking profit of this, we begin revising shortly some of such words. We take them as keywords and we remind their meaning. In doing so we also remind the general expectations that they generate at the different profiles of the several users of medical and healthcare databases. Sometimes it will be easy to understand the degree to which - by the time being - the expectations are satisfied. In other cases we would sense how much database methods and technologies still should improve for solving the problems of their users in Medicine in Healthcare. One of the keywords is storage. Over the last decades, we know how the storage capability of digital devices has been widely enlarged from year to year, even in the last few years. Probably, at present, they are not significant bottleneck in the medical and healthcare applications. In the year 2003, offered systems were able to reach 12 petabytes, and a cost of 6000$ per terabyte was relatively common (Table 1). According to the storage capacity usually needed, the storage devices have capacities good enough and prices low enough. A not negligible aspect to be cared relates to keeping the storage readable in the future, over several decades of years. As we can easily read a paper document written centuries ago, unfortunately this is not true at all for data and documents stored on electronic media.

 

 

A second keyword is archiving.
Archiving needs storage. But it needs also organization. To understand about this, let us assume that we archive because we have the need of making easier some of the future uses of the information we have. Recovering, visualizing and processing are major class of data uses that archiving should not hamper at all. Moreover, for letting any of these uses possible, data should have been stored well and safely. We mean that, any will be the physic technology where the storage devices and mechanisms are based, the archiving action should anyway provide a quite reliable storage, in such a way to let the information users do their pertinent actions, like recovering, visualizing and processing of data. To some extent, what in a library is done for books is similar to what in an archive is to be done for data. For instance: a classification system is needed for helping the users. It also guides the allocation of the books on the shelf. But not all the books have the same size. Also if books of different sizes go on different shelf, this should not make any difficulty to the users when they want to find books having similar subjects. If we transfer the concept from books to medical and healthcare data, the archiving of them should not make any difficulty to their users despite the many differences existing among basic medical and healthcare data types [2]. For any computer, let us just remind how much words are different from biomedical signals and from bio-images as well.

A third keyword is integration.
The specific functions to which the concept on integration can be applied are really many. Among them, several are different one from each other. It is integration the one of the data that a single patient generates in the information systems of the different hospitals, where he has been cared. It is integration the one among the textual description of a medical diagnosis and the specific bio-images that support it [3]. It is integration the one among clinical data and administrative data. More integration can be outlined too. The general expectations are that the information technology products will manage comprehensive and effective integration. But, in the field of clinical and healthcare data, after some decades of efforts addressed to the implementation of effective integrations, still there are bad news. We mean that, even at the patient level, we still experiment that integrations are too much far from the naturally desired levels. For instance two different hospitals, holding some of my clinical data, will not merge them under a single patient identification. Moreover, despite the present delusions, databases remain the only one way out towards effective integrations, as needed for letting the users to recover, to visualize or to process the medical and healthcare data. In summary, "integration" is a really desired keyword by the customers. Then it the supplying firms care about it, but it is still waiting for systems able to be more performing than the ones offered today. At present a pragmatic claim is for keeping limited the integration needs of interest, for analyzing them professionally and for finding the solutions that fit them.

A fourth keyword is visualization.
Thanks to the "big is beautiful" trend, big and flat screens triumph everywhere. On one hand they may appear just an attraction, on the other hand they indicate objective improvements. Big is not everything, anyway. This means that it is not enough that the visualization is made "plug and play". For guarantying also the fidelity levels requested by the medical applications, it is also necessary to consider the pixel shape, the numbers of colors levels, the dynamic response about movies, the lighting conditions of the environment and so on. For all that, in the medical arena, considered all the necessary cautions, the bigger dimension does not grant by itself better performances. Nevertheless it encourages a new start, for updating the effectiveness analysis done in the past. In dealing with medical and healthcare databases, an unavoidable key-problem is about privacy. It can be treated underlining three different aspects where tools and techniques can be based. The aspects are: what we know, what we have, what we are [4]. Respective examples are a password, a "card" of any technology, a fingerprint. Also when we focus the "what we are" category, many difficulties appear. Reliable algorithms, able to make completely tolerable the relevance of the acknowledgments errors - even if they are false positive or false negative - are not widespread, at least when the cohorts to be considered are highly populated. There are a lot of initiatives made by companies, anyway. Some of them offer services based on a certain number of biometrical quantities, for instance at the verification position placed at the end of a password-based identification [5]. Management is often attached to any of the keywords just considered here above. We may be familiar with storage management, integration management and so on. Management evokes key-problems. In the healthcare environment, since several years the notation "PACS management" - where PACS means "Picture Archives and Communication Systems" - has acquired its own physiognomy and characteristics of being global [6]. In the PACS management, the 2003 edition of the Hospital Information Management System Society confirmed a specific tendency: web-based PACS. The default concept is to adopt web methods and technologies also for operations on the hospital intranet. The tendency is sometimes really remarkable [7]. We now have the elements for understanding what databases for Medicine and Healthcare are globally for. They are for letting medical and healthcare information to be archived effectively in such a way to support the needs of the several different users of that information. They should also help the users in recovering, visualizing and processing what they are interested in. For doing their job properly, databases for medicine and healthcare would include caring about storage, integration, visualization, and management of medical information, up to the smallest level of their atomic data. They would also include granting privacy and security [8]. Last but not least they would be built (also cooperatively) and used (mainly individually) from a distance. Moreover everything should be done "on time". This is essential despite "on time" is not easy to quantify. Different users will ask for different numbers. Additionally these numbers may change over time. The outlined comprehensive set of characteristics allows the definition of the framework where databases are requested to serve Medicine and Healthcare. Nevertheless nowadays a set like this cannot be taken as easy to be satisfied. The reason is that, in its entirety, the outlined comprehensive set still stays widely over the default performances of the available databases methods and technologies. Of course this does not mean that databases can do nothing useful in Medicine and Healthcare. But, as we wish to obtain returns on a short term basis, the policy of implementing both affordable and significant subsets of characteristics, that are of interest to few user profiles at a time, has to be preferred. Usually it takes to satisfactory results. In doing so, we accept to go for specific solutions, often limited but both affordable and useful. Many examples prove that this is the widespread successful approach [9]. As citizens, when we ask our healthcare national insurance program for our badge - it is to be shown at any time we ask for medical assistance, - they use a database. It will do nothing for bio-images, but it does well in administrating the population healthcare rights. As patients, when we see our family doctor recovering our old health data, he uses a database. It will do probably nothing for comparing similar data of different patients, but it does well in recovering our own medical data. As doctors working at a hospital department, when we do comparisons of the medical data of the patients affected by the same disease, a database is serving us. Probably it does nothing for letting us understanding about the costs of the cares given to the patients, but it does well in serving our clinical comparisons purposes. As hospital administrators doing complex budget control functions, we surely use a database. It does nothing for helping the real time analysis of the vital signs of a patient bedridden in the intensive care unit, but it does well in helping us to keep the hospital budget to stand up. As researchers working in the bioinformatics field, we use data frequently coming from several different databases [10]. Generally speaking, they give nothing ready to be now integrated with the patient data collected in his medical record, but they do well for letting the investigation of the various research hypothesis.

2 - Informatics-oriented classification criteria
Local versus at a distance. - Family doctor databases are the default example of local databases. They are locally filled and locally used. It doesn't a matter that the family doctor uses a notebook computer, in such a way that he can operate also at the patient home. We only mean that there is no need of any communication infrastructure. Everything is in, in that single machine. Organ transplant databases are the opposite. They are not local at all. Usually the database is physically allocated in the transplant management service main site. The "candidate receivers" side of the database is filled at a distance, from departments belonging to different hospitals, usually without relevant needs of real time. The "candidate donors" side of the database is also filled at a distance. Also if arriving through a hospital department, their origin may even be that ambulance still serving at the accident site on a highway. Needs of real time are very high. Non proprietary communication infrastructures, where our data may not have a priority channel, may be involved in the needed network.

Data versus signals and images.
We know that a classic medical record folder collects paper documents as well as some other information supports. Bio-signal tracings, (such as electro-cardiograms, electro-encephalograms, and others) and bio-images (such as large RX of the thorax or small RX of a hand, eco-graphic pictures from one breast or from the heart, NMR or PET images, and so on) are commonly present in a medical record folder. We observe that, if the doctor holds the knowledge necessary for understanding the specific and different information contents, present in the images, such a variety of supports does not cause any trouble to him. As human, he shows the same naturalness in browsing bio-data, bio-signals, and bio-images. This is normal for our sense of sight. But, unfortunately, nothing like this has ever happened to a computer system yet. In saying so we do not refer to the data characteristic of being digital. This is technologically natural to any computer system. We mean that the moment will quietly arrive for any information type - bio-data or bio-signal or bio-image, - to end up in bits. The difficulties come from the differences related to the intermediate steps required for making each information type to end up in digits. For explaining major differences, let us observe that bio-data usually enter a computer via a keyboard. This happens for the examples of a disease name and a cardiac frequency value number. For making them in digits, we have to have answered the question "how many bits we need for coding the number of characters that are present in our alphabet?" This is about all we need. Bio-data deal with characters and numbers. But bio-signals and bio-images - as well as all signals and images - do not. Bio-images usually enter a computer via a scanner. This happens for the examples of a mammography and of foetal eco-graphs. For making these pictures in digits, we have to have answered the questions "For not loosing details, how many dots per inch my scanner should be able to have? And, for not loosing precision, how many different color mixtures and tonalities should I be able to code for each of those dot?" Dots per inch and number of colors define the basic performances of any scanner. Moreover, when the dot is not a circle, the dots per inch quantity of a scanner should be said more than once. For bio-signals, it is true that when, for example, an electrocardiogram is drawn on a graph paper sheet, it can be seen as an image. Nevertheless Bio-signals usually enter a computer via an analog-to-digital converter device. For making these signals in digits, we have to have answered the questions "For not loosing details, how many times per second should I have to read, - i.e. to sample, - the signal? And, for not loosing precision, how many different coded levels of values should I have available for coding the amplitude of each sample without falling into intolerable approximations?" Sampling frequency and quantization levels define the basic performances of any analog-to-digital converter. Moreover, when multiple channels can be entered, how much contemporary one each other can be the samples from different channels is an important quantity too.

Codes versus Languages.
"The more we code what we store in a database, the easier we will query it" is a phrase easy to agree on. Nevertheless it masks problems. We mean that a poor reading of the phrase may quietly invite towards very simple coding systems. We have to keep in mind that, if we do so, we accept to go for querying of simple type only. Generally this is not what the users want. Unfortunately it usually comes that the expectations of database users - in Medicine and Healthcare as well as in other fields - are not satisfied by the possibility of making querying of simple type only. Additionally, if we go for too much simple coding systems, their granularity may be not enough for the degree of reality that we want to save. For instance, in saying "eye inflammation", we usually want to know which one is the involved eye. An improved and usually more accepted phrase is that "The wider we code what we store in a database, the more effective we will query it". But also this phrase masks problems. We mean that some remarkable coding systems are certainly available. Unified Medical Language System (UMLS) [11], Systematized Nomenclature of Medicine - Clinical Term (SNOMED-CT) [12], International Classification of Diseases (ICD-9, ICD-10) [13] are major examples of widely extended codes.
Unfortunately, those medical dictionaries are so extensive that cannot be used in the daily clinical practice of a ward. Surely, this fact is of course wrong, but it has a own usefulness.
Since we have always to consider the wish of reduce the richness of our language to obtain greater effectiveness and understanding of our audience, inevitably in a clinical ward the width and the accuracy of the used language will end up to be smaller than those of the language we use in writing a paper we want to submit to an international scientific journal.

Models for Databases.
Hierarchical, network, relational and object oriented are major models in the field of databases. Each of them has elective applications in the wide field of Medicine and Healthcare. As hierarchical databases [14] are effective in those applications where there is one dominant point of view only, they are good for family doctors when they manage normal families. These are those where family members suffer of only one healthcare problem at a time. Any new problem is widely posterior to the previous one and quite independent from it. When a problem occurs, the family member is visited. The problem is identified. Laboratory test are usually prescribed. Drugs are prescribed too, sometime before the arrival of the laboratory results. Over the duration of the problem, the patient can be visited more times and the therapy can be adjusted. But the problem goes to an end. The next healthcare problem, if it will occur, can be assumed widely independent from the previous one. For the database organization, the hierarchical primary source is the family doctor name (Fig 1).

 

 

One level down there is the name of the patient. At the second level down there is the event name. This may be "eye inflammation". At the third level down there are the visits for that problem. At the fourth level down we find drug prescriptions and behavior recommendations. The hierarchy implies that any prescription should be interpreted belonging to that visit, to that event, to that patient only. A second medical example of clear hierarchy relates to the family history section of a patient medical record. Probably nothing is more hierarchical than a family tree. But, unfortunately, about all the other paragraphs of a patient medical record at the hospital do not show any hierarchy like the one of the family history section. And the perspective of using different databases for specific chapters of one document is not viable.
Network databases [15] are for the cases where there are several points of view - i.e. several user profiles - and none of them is such to be considered widely dominant (Fig. 2).

 


The following set of query couples summarizes what we mean. All the drugs assumed by a patient. All the patients who have taken a specified drug. All the department present in a hospital. All the hospitals having a specified clinical departments. All the patients cared by a specified doctor. All the doctors caring specified patients. And so on. In front of examples like these, it is easy to admit that there is not a dominant point of view, i.e. there is not hierarchy. A network database idea is more suitable to these conditions. However also the network databases have weaknesses. As principle, at any stage they deal with the entirety only of all the data to be scanned by queries. This is true also when all such queries belong to a specified application subset, where a large part of the stored data will never be considered objects of any analysis. An example is the administrative data: they will never be of interest in a clinical query. A second example is the full set of identification data, including those that may vary over time, like the street address and the telephone number. Mainly these last data are needed only for sending mails to the patient home. They are not needed for answering clinical queries. The outlined entirety weakness is solved by relational databases [16]. They may be seen to come from a clever segmentation of the network databases. Relations are the results of the segmentation. Essentially a relation is a plain table with rows and columns. Columns are for the attributes describing properly what the specific relation is for. One of the attributes is the key attribute (Fig. 3).

 


In a patient medical record, typical relations are for the identification data, for the family history, for the past pathological history, for the recent pathological history, etcetera. In doing so, in each relation a suitable key attribute can be the code the hospital gives to the patient for that hospitalization. Relations rows are filled with the attributes specific value proper to that case we are in front of. Usually the set of the attributes values present in a row is named tuple. Segmenting a database should not take to too small relations. We mean that the smaller the relations are, the higher is their quantity, and the higher the querying time and complexity will become. Also if the action of querying relational databases takes advantages from the suitably defined and powerful relational algebra, the querying time may become significantly important. Also the complexity to be managed may ask for too much programming work to be done. Nevertheless, any time the computing time is not important, relational databases usually provide the flexibility desired for querying the database contents. Often this is done by the widely used Structured Query Language (SQL) [17]. To some extent, what relational databases in their essence provide is along two major lines. The former is the increase of detailing, the latter is the power of the querying tools we have available. No doubts that relational databases are the most used. Nevertheless, for many medical and healthcare applications, also relational databases show major weaknesses. For instance they do not integrate easily data and images. We mean that it remains very hard to consider an image just like the value of an attribute, to be written in the proper cell of a relational table. Additionally, relational databases do not cope easily with the management of multiple value attributes - as we need for filling the "risk factors" attribute columns for patient being affected, for example, by both obesity and alcoholism. The technicality of adding rows to a relational table - for including one only of the multiple risk factors in one cells at a time, - is a criticized possibility. We mean that, under the clinical point of view, when the patient holds two contemporary risk factors, he is in an instinctively more severe status. Object Oriented Databases [18] come from courageous attempt and hope of suggesting a new vision for solving the problem of managing data effectively. We start from a broad level. In fact, in respect to words like hierarchical, network, relational, the word objects sounds more generic. Apparently it sounds weak in respect to the needed power of allowing detailed descriptions. An object can be everything. Too different things can be labeled as objects. But this word carries the power of giving relevance to comprehensive, powerful and known identities. Frequently identities include other identities, and this happens at an implicit level. For example, when in a generic conversation we say "he is diabetic", we mean that "he is a person who is sick because of diabetes". We mean this without the need of saying it explicitly. It happens that the object "diabetic" inherits the properties of the object "patient", and this inherits the properties of the object "person" (Fig. 4).

 


It remains true that object is quite a generic term, but we use its positive flexibility of meanings, provided that we have the - usually implicit - knowledge of what a given identity means in our environment. Staying in the informatics technicalities side, we say that an object include both the its suitably defined data set and the related methods for managing the data themselves.
Default methods are those requested by the data input, storing, relations, querying, saving, recovery, integration, etcetera needs. Additional to the inheritance, other frequently mentioned properties of the object oriented databases are encapsulation (for hiding the data structures to the outside), overriding, overloading, late binding, persistence (holding results also after having logout), concurrency (different procedures contemporary acting on the same data), and some others.
The database lifecycle is a complex process, usually composed by the following main phases:
1. Requirements collection and analysis
2. Conceptual database design
3. Choice of a Data Base Management System
4. Logical database design
5. Physical database design
6. Database implementation
7. Use & maintenance
For the conceptual database design phase, the most widely used conceptual data model, is the Entity-Relationship (E-R) data model (Fig 5).

 

 

3 - Users returns
Querying classes - By key-words -

Over the last decade most of us have become familiar in doing searches on the Internet by using key-words - mostly used one at a time, - and search engines. If, in a given hospital, we want to list all the patient medical records where those key-words are written, the method is successful. Practical examples are in the following list: all the patients who have taken a specified drug; all the drugs taken by a specified patient; all the doctors in charge to a specified department; and so on. Additionally if, in doing a medical literature search, probably using PubMed [23], we want to have listed all the journal articles where the key-word is present, - it may be in the abstract only, it may be within the text corpus, - the method is successful too. Of course, a result list will include also those patients (or those journal articles) where that key-word is preceded by "absence of", or is followed by "not present". But, when we query a hospital database, we do not want this to happen. We want higher effectiveness. Nevertheless we have to admit that the querying program is very simple. We also have to admit that some risks of misunderstanding are unlike.

By intervals
In Medicine, for many of the measurable quantities, the knowledge acquired in the past let us define a normality interval. In front of the result of a laboratory test for a given patient, we want to know if the result is normal or not. For doing a query of this type, also the data organization allowed by a spreadsheet can usually be enough. A little bit more complex queries are like "we want to know if the patient has had fever while staying at the hospital". Higher complexity have queries like "select patients who, after having taken a certain drug at a certain quantity, have had that drug to be stopped for undesired side effects and, after a five days minimum wash-out period, the patient was moved to this other drug, and this was done without having the patient showing any side effect for a minimum of three months long treatment by the new drug at medium or higher dosage".

By concepts
In doing a query, cardiopathy can be just a keyword. A result may be the number of times this word is present in the text. But we may use cardiopathies for querying something else.
We mean that we may look for the list of the specific names of all the cardiac pathologies. The practicality of doing so is far from just counting how many times a word appears in a given text. Now the practicality starts with looking for a medical dictionary. Frequently it will be a structured one. A possible example may be the terminology part of the Unified Medical Language System. Then the dictionary should be investigated. The investigation is to be done according to its structuring criteria.
All the cardiopathy names should be listed. Synonyms would be properly treated. The results should fit also the granularity level requested by the user. A search like the one just described is often told a query by concepts.
Semantic queries of bio-image databases.
The Visible Human Dataset (VHD) [24] is an exhaustive collection of images from slices of two human bodies. The body of the man was sliced every millimeter. This took to a series of some 1870 pictures. The body of the woman, sliced three times per millimeter, took to some 5100 pictures. Pictures in such quantities are not for being managed by hands.
Queries like "for the images coming from all the slices where the liver is present, give me all the data related to the liver only" should be considered quite natural queries also by freshmen students. But the software architecture for doing such a query is quite peculiar.
The serial visualization of the row files of the VHD images, the selection of those of them where the liver is present, the contouring of the liver on each of the selected images, the validation of the contouring action, the saving of the original VHD data belonging to the contour, the storing of them in a new "VHD-liver" database, the choice of an anatomy knowledge corpus (e.g. the thesaurus of the Unified Medical Language System - UMLS), and several other aspects to care: all of them are among the major blocks that should be present in any system doing semantic queries of bio-image database [25].

4 - Some major databases
For having a general sense of what it is going on in the field of "Medical and Healthcare Databases", any individual will look for a preliminary list by doing some internet searches. We know that results will partially depend from the used search engine.
Also professionals will do the same searching actions. But they will have the advantage of already holding the professional knowledge for giving or not giving valuable credits to several of the findings appearing among the huge number of the listed ones. General professional crediting criteria may relate to the activity segment to whom the finding can be associated.
Activity segment examples can be production and use for the market side, education for the knowledge side, government for the regulatory side, advancements for the research side. Specific professional crediting criteria relate to more focused targets.
Examples are the name of a database management system, the name of a disease, the name of a medical dictionary, the name of a drug, names of symptoms, signs, normal values of a given measurements, the cost of medical treatments, survival tables, medical errors, after-surgery complications, patient rights, data security, electrocardiograms, nuclear magnetic resonance images, medical record privacy, DNA micro-arrays, and so on (Fig. 6).

 

 


We mean that, by the time being, searching for medical and healthcare database we have findings close to 100 thousands for "medical databases" and in the order of several thousands for "healthcare databases".
A detailed view for the field of the "Database for the Medicine and the Health" stands out from the book "Elementi di Informatica BioMedica" [26], which devotes many pages to that subject. In particular there are chapters devoted to banks of medical terminologies, to medical bibliography banks, to bio-signal archives, to bio-image archives, to genomic databanks.
Each topic is faced in order to answer to the user requirements, rendering the user aware of those aspects of construction and management phases that mainly determine the performances.
Sometimes these are at user hand, other times - unfortunately - performances remain far from the good ones for the purpose. For each showed data bank the characteristic aspects, the different existing typologies (for example primary genomic banks, derived genomic banks), the problems to face for an effective realization and set up for the use in the hospital environment are presented. Where possible the book includes some comparisons. An example is that one among the digital bio-images archives. Some characteristics of the Visible Human Dataset - which was realized beyond ten years ago and widely known today - have been improved in the building of recent archives like the Visible Korean Human, and the Chinese Visible Human. These have taken advantage from the technological improvement of the systems of acquisition occurred in the last years.

 

Prof. Francesco Pinciroli
Professor of Medical Informatics and Health Information System at the Politecnico of Milano - Italy
Prof. Stefano Bonacina
Department of Bioingegneria, Politecnico di Milano -Italy

:: Archivio
 
:: FOREGROUND