Sei sulla pagina 1di 2

Prog 3. WAP to remove Html Tags from the source code of a url . 1.

1 Algorithm Step 1: Enter the url Step 2: Retrieve html source code using downloadstring function of webclient class. Step 3: Remove the html tags from the tokens by comparing it through a regular expression. Step 5: Display the source code after tags removal in textbox. 1.2 Abstract An HTML element is an individual component of an HTML document. HTML documents are composed of a tree of HTML elements and other nodes, such as text nodes. Each element can have attributes specified. Elements can also have content, including other elements and text. HTML elements represent semantics, or meaning. For example, the title element represents the title of the document. They need to be removed to increase the searching efficiency. 1.3 Code
using System; using System.Collections.Generic; using System.ComponentModel; using System.Data; using System.Drawing; using System.Linq; using System.Text; using System.Windows.Forms; using System.Text.RegularExpressions; namespace Tags_Removal { public partial class Home : Form { public Home() { InitializeComponent(); } private string RemoveAllTags(string text) { string strResult = string.Empty; text = RemoveJScript(text); try { strResult = Regex.Replace(Regex.Replace(text, @"<!--(.|\n)*?->|<(.|\n)*?>", " "), "&nbsp;", " "); } catch (Exception ex) { strResult = ex.Message; strResult = string.Empty; } return strResult; }

private string RemoveJScript(string text) { string strResult = string.Empty; try { strResult = Regex.Replace(text, @"<script(.|\n)*?</script>", "", RegexOptions.IgnoreCase); } catch (Exception ex) { strResult = ex.Message; strResult = string.Empty; } return strResult; } private void button1_Click(object sender, EventArgs e) { System.Net.WebClient webClient = new System.Net.WebClient(); string url = "http://" + textBox1.Text; string data = webClient.DownloadString(url); data = RemoveAllTags(data); richTextBox1.Text = data; } } }

1.4 O/p Snapshot

Potrebbero piacerti anche