STRIPHTML_C

Authors

Publication

Pub Details

Date

Pages

See all articles from QL Hacker's Journal 23

With the popularity of the World Wide Web, more and more information is being formatted in HTML, the “language” of the Web. Since HTML is pure ASCII this is not a problem for people that don’t’ have Web Browsers. But, HTML commands can make text look very convoluted. The following C program will read a file with HTML commands and strip them out and print out only the text information.

Since HTML commands all start with a less-than sign ( < ), and end with a greater-than sign ( > ), striping out the HTML is relatively easy to do: as you read in characters and echo them to the output file, turn off echoing when you see a < and turn it back on when you see a >. The end result will be a regular ASCII file. It may not be formatted to look nice, but the HTML stuff will be gone.

/* striphtml_c  
This program takes in a file with HTML commands
and outputs a file with the HTML commands
stripped out.
*/

#include <stdio_h>

main() {
char c, file1[30], file2[30];

int fd1, fd2, html;

printf("Enter Input File Name : \n");
gets(file1);

printf("Enter Output File Name: \n");
gets(file2);

fd1 = fopen(file1,"r");
if (fd1 == NULL) {
printf("Did not open file: %s",file1);
abort(1);
}

fd2 = fopen(file2,"w");
if (fd2 == NULL) {
printf("Did not open file: %s",file2);
abort(1);
}

html = NO;

while (( c = getc(fd1)) != EOF) {

if ( html == NO ) {
if ( c == '<' )
html = YES;
else
putc(c,fd2);
}

if ( html == YES ) {
if ( c == '>' )
html = NO;
}

}
fclose(fd1);
fclose(fd2);

}

Example HTML file:

<HTML>
<HEAD>
<TITLE>Title of Document</TITLE>
</HEAD>

<BODY>

<H1>Level 1 Text</H1>

This is a paragraph. This is a paragraph.
This is a paragraph that will wrap in the
browser until the end paragraph marker.<P>
<P>
The Paragraph marker is also used to create
a blank line of text.<P>
<P>
</BODY>
</HTML>

Products

 

Downloadable Media

 

Image Gallery

Scroll to Top