With the popularity of the World Wide Web, more and more information is being formatted in HTML, the “language” of the Web. Since HTML is pure ASCII this is not a problem for people that don’t’ have Web Browsers. But, HTML commands can make text look very convoluted. The following C program will read a file with HTML commands and strip them out and print out only the text information.
Since HTML commands all start with a less-than sign ( < ), and end with a greater-than sign ( > ), striping out the HTML is relatively easy to do: as you read in characters and echo them to the output file, turn off echoing when you see a < and turn it back on when you see a >. The end result will be a regular ASCII file. It may not be formatted to look nice, but the HTML stuff will be gone.
/* striphtml_c
This program takes in a file with HTML commands
and outputs a file with the HTML commands
stripped out.
*/
#include <stdio_h>
main() {
char c, file1[30], file2[30];
int fd1, fd2, html;
printf("Enter Input File Name : \n");
gets(file1);
printf("Enter Output File Name: \n");
gets(file2);
fd1 = fopen(file1,"r");
if (fd1 == NULL) {
printf("Did not open file: %s",file1);
abort(1);
}
fd2 = fopen(file2,"w");
if (fd2 == NULL) {
printf("Did not open file: %s",file2);
abort(1);
}
html = NO;
while (( c = getc(fd1)) != EOF) {
if ( html == NO ) {
if ( c == '<' )
html = YES;
else
putc(c,fd2);
}
if ( html == YES ) {
if ( c == '>' )
html = NO;
}
}
fclose(fd1);
fclose(fd2);
}
Example HTML file:
<HTML>
<HEAD>
<TITLE>Title of Document</TITLE>
</HEAD>
<BODY>
<H1>Level 1 Text</H1>
This is a paragraph. This is a paragraph.
This is a paragraph that will wrap in the
browser until the end paragraph marker.<P>
<P>
The Paragraph marker is also used to create
a blank line of text.<P>
<P>
</BODY>
</HTML>