STRIPHTML_C

Authors

Tim Swenson

Publication

QL Hacker's Journal

Pub Details

QL Hacker's Journal 23 Issue: 23

Date

January 1996

Pages

See all articles from QL Hacker's Journal 23

With the popularity of the World Wide Web, more and more information is being formatted in HTML, the “language” of the Web. Since HTML is pure ASCII this is not a problem for people that don’t’ have Web Browsers. But, HTML commands can make text look very convoluted. The following C program will read a file with HTML commands and strip them out and print out only the text information.

Since HTML commands all start with a less-than sign ( < ), and end with a greater-than sign ( > ), striping out the HTML is relatively easy to do: as you read in characters and echo them to the output file, turn off echoing when you see a < and turn it back on when you see a >. The end result will be a regular ASCII file. It may not be formatted to look nice, but the HTML stuff will be gone.

/* striphtml_c  
    This program takes in a file with HTML commands
    and outputs a file with the HTML commands
    stripped out.
*/

#include <stdio_h>

main() {
   char c, file1[30], file2[30];

   int   fd1, fd2, html;

   printf("Enter Input File Name : \n");
   gets(file1);

   printf("Enter Output File Name: \n");
   gets(file2);

   fd1 = fopen(file1,"r");
   if (fd1 == NULL)  {
      printf("Did not open file: %s",file1);
      abort(1);
   }

   fd2 = fopen(file2,"w");
   if (fd2 == NULL) {
      printf("Did not open file: %s",file2);
      abort(1);
   }

   html = NO;

   while (( c = getc(fd1)) != EOF) {

      if ( html == NO ) {
         if ( c == '<' )
             html = YES;
         else
             putc(c,fd2);
      }

      if ( html == YES ) {
         if ( c == '>' )
             html = NO;
      }

   }
   fclose(fd1);
   fclose(fd2);

}

Example HTML file:

<HTML>
<HEAD>
<TITLE>Title of Document</TITLE>
</HEAD>

<BODY>

<H1>Level 1 Text</H1>

This is a paragraph.  This is a paragraph.
This is a paragraph that will wrap in the
browser until the end paragraph marker.<P>
<P>
The Paragraph marker is also used to create
a blank line of text.<P>
<P>
</BODY>
</HTML>

Products

Downloadable Media

Related Articles

Image Gallery

Tags

People