Web-scraping with Java

Recently I decided to practise web-scraping. At the time I was looking into medical NLP. So I thought I would combine the two and collect some medical data. I chose the website for the German edition of the International Statistical Classification of Diseases and Related Health Problems (ICD). It can be thought of as semi-structured online data.

The initial page lists the 22 chapters of the ICD 10. Each chapter consists of bundles of sections on three levels of granularity. Scarlet fever, for example is found by selecting chapter I for “certain infectious and parasitic diseases”, which is comprised of sections A00-B99. Next going to sections A30-A49 for “other bacterial diseases”, and you find section A38 for scarlet fever. Each section and bundle of sections is on its own web page.

All sections are thus leaves in a tree, and the initial page showing the chapter index is the tree’s root. Intermediate tree nodes group related diseases and have a description of their own. Generally speaking, each node has a title and a description, and some have a description of excluded and included diseases. The goal of my web-scraping program is to collect all these data points from the website and store them in an XML document. Traversing the site is simple because of its tree-shaped structure: Each non-leaf web document contains hyperlinks to its constituent documents.

The more difficult and unstructured part is extracting the text for each data point from the webpages’ HTML code. Since this differs at each level/generation in the tree, I created one extractor object per level:

  • The first extractor handles the root by going through each entry in the table of chapters.
  • The second extractor grabs the title, a description, the “included” and “excluded” information if it is available, and then the list of lower-order (bundles of) sections. This extractor is applied to two subsequent levels in the tree, as they are conceptually equal.
  • The third and final extractor is similar to the second except that it need not look for further subordinated documents. It also needs to be more flexible in parsing the HTML code because at this level more divers formatting is encountered.

My program’s parsing methods are ad hoc because I wrote them having only eye-balled the HTML documents. After some trial and error the processing ran smoothly and culminated in a well-formed XML document. Below is an excerpt of the entire tree, first as XML text and below that as an interactive tree that you can expand and collapse. I created the latter using a JavaScript tool called jsTree.

  • ICD-10-GM
    • A00-B99
      • titel
        • Kapitel I
        • Bestimmte infektiöse und parasitäre Krankheiten
      • inklusive
        • Krankheiten, die allgemein als ansteckend oder übertragbar anerkannt sind
      • exklusive
        • Keimträger oder -ausscheider, einschließlich Verdachtsfällen (Z22.-) Bestimmte lokalisierte Infektionen - siehe im entsprechenden Kapitel des jeweiligen Körpersystems Infektiöse und parasitäre Krankheiten, die Schwangerschaft, Geburt und Wochenbett komplizieren [ausgenommen Tetanus in diesem Zeitabschnitt] (O98.-) Infektiöse und parasitäre Krankheiten, die spezifisch für die Perinatalperiode sind [ausgenommen Tetanus neonatorum, Keuchhusten, Syphilis connata, perinatale Gonokokkeninfektion und perinatale HIV-Krankheit] (P35-P39) Grippe und sonstige akute Infektionen der Atemwege (J00-J22)
      • A00-A09
        • titel
          • Infektiöse Darmkrankheiten
        • A00.-
          • titel
            • Cholera
          • A00.0
            • titel
              • Cholera durch Vibrio cholerae O:1, Biovar cholerae
            • inklusive
              • Klassische Cholera
          • A00.1
            • titel
              • Cholera durch Vibrio cholerae O:1, Biovar eltor
            • inklusive
              • El-Tor-Cholera
          • A00.9
            • titel
              • Cholera, nicht näher bezeichnet
        • A01.-
          • titel
            • Typhus abdominalis und Paratyphus
          • A01.0
            • titel
              • Typhus abdominalis
            • inklusive
              • Infektion durch Salmonella typhi
              • Typhoides Fieber
          • A01.1
            • titel
              • Paratyphus A
          • A01.2
            • titel
              • Paratyphus B
          • A01.3
            • titel
              • Paratyphus C
          • A01.4
            • titel
              • Paratyphus, nicht näher bezeichnet
            • inklusive
              • Infektion durch Salmonella paratyphi o.n.A.
        • A02.-
          • titel
            • Sonstige Salmonelleninfektionen
          • inklusive
            • Infektion oder Lebensmittelvergiftung durch Salmonellen außer durch Salmonella typhi und Salmonella paratyphi
          • A02.0
            • titel
              • Salmonellenenteritis
            • inklusive
              • Enteritis infectiosa durch Salmonellen
          • A02.1
            • titel
              • Salmonellensepsis
          • A02.2
            • titel
              • Lokalisierte Salmonelleninfektionen
            • inklusive
              • Arthritis+ (M01.3-*) durch Salmonellen
              • Meningitis+ (G01*) durch Salmonellen
              • Osteomyelitis+ (M90.2-*) durch Salmonellen
              • Pneumonie+ (J17.0*) durch Salmonellen
              • Tubulointerstitielle Nierenkrankheit+ (N16.0*) durch Salmonellen
          • A02.8
            • titel
              • Sonstige näher bezeichnete Salmonelleninfektionen
          • A02.9
            • titel
              • Salmonelleninfektion, nicht näher bezeichnet
        • A03.-
          • titel
            • Shigellose [Bakterielle Ruhr]
          • A03.0
            • titel
              • Shigellose durch Shigella dysenteriae
            • inklusive
              • Shigellose durch Shigellen der Gruppe A [Shiga-Kruse-Ruhr]
          • A03.1
            • titel
              • Shigellose durch Shigella flexneri
            • inklusive
              • Shigellose durch Shigellen der Gruppe B
          • A03.2
            • titel
              • Shigellose durch Shigella boydii
            • inklusive
              • Shigellose durch Shigellen der Gruppe C
          • A03.3
            • titel
              • Shigellose durch Shigella sonnei
            • inklusive
              • Shigellose durch Shigellen der Gruppe D
          • A03.8
            • titel
              • Sonstige Shigellosen
          • A03.9
            • titel
              • Shigellose, nicht näher bezeichnet
            • inklusive
              • Bakterielle Ruhr [Bakterielle Dysenterie] o.n.A.

To view the code of my web-scraper go to this repository.

Update: In the mean time I also found editions of ICD 10 and the recently published ICD 11 that are structured as an expandable/collapsable tree similar to the one above. The latter even comes with an API.

Computer Science Programming Java