From: John Morley on 1 Mar 2010 13:41 I didn't have any luck posting this over in the 'Controls' group, so I'm reposting it here: Hi All, I'm sure this has been covered before, but I haven't found anything that seems to help. I'm trying to read the raw html data from a web page, so that I can parse it and extract the information I need. The data is primarily text that I wish to capture. When I try to grab a page using the Winsock control, I get a 'page not found error'. I know that the URL and page location are correct because I can see a test file using a web browser. Some (possibly) relevant code: Private Sub cmdconnect_Click() On Error Resume Next TxtWebPage.Text = "" ' clear the text window Winsock1.RemoteHost = ksBaseURL Winsock1.RemotePort = 80 Winsock1.Connect End Sub Private Sub Winsock1_Connect() On Error Resume Next Dim strCommand As String Dim strWebPage As String strWebPage = TxtFileLocation.Text strCommand = "GET " + strWebPage + " HTTP/1.0" + vbCrLf strCommand = strCommand + "Accept: */*" + vbCrLf strCommand = strCommand + "Accept: text/html" + vbCrLf strCommand = strCommand + vbCrLf Debug.Print strCommand Winsock1.SendData strCommand End Sub I *think* my problem is with the GET request, but I haven't found out what it is yet! Ideas? Thanks, John
From: Nobody on 1 Mar 2010 13:55 "John Morley" <jmorley(a)nospamanalysistech.com> wrote in message news:e7bJe7WuKHA.3504(a)TK2MSFTNGP06.phx.gbl... >I didn't have any luck posting this over in the 'Controls' group, so I'm >reposting it here: > > Hi All, > > I'm sure this has been covered before, but I haven't found anything that > seems to help. > > I'm trying to read the raw html data from a web page, so that I can parse > it and extract the information I need. The data is primarily text that I > wish to capture. > > When I try to grab a page using the Winsock control, I get a 'page not > found error'. I know that the URL and page location are correct because I > can see a test file using a web browser. > > Some (possibly) relevant code: > > Private Sub cmdconnect_Click() > On Error Resume Next > > TxtWebPage.Text = "" ' clear the text window > Winsock1.RemoteHost = ksBaseURL > Winsock1.RemotePort = 80 > Winsock1.Connect > > End Sub > > Private Sub Winsock1_Connect() > On Error Resume Next > Dim strCommand As String > Dim strWebPage As String > > strWebPage = TxtFileLocation.Text > strCommand = "GET " + strWebPage + " HTTP/1.0" + vbCrLf > strCommand = strCommand + "Accept: */*" + vbCrLf > strCommand = strCommand + "Accept: text/html" + vbCrLf > strCommand = strCommand + vbCrLf > > Debug.Print strCommand > Winsock1.SendData strCommand > > End Sub > > I *think* my problem is with the GET request, but I haven't found out what > it is yet! > > Ideas? You need to include: strCommand = "Host: www.somesite.com" + vbCrLf That line is needed if multiple hosts share the same IP. Also, you need to use URLEncode function(search the web). Also, why not use WinInet? it's easier to use. See this sample: SAMPLE: Vbhttp.exe Demonstrates How to Use HTTP WinInet APIs in Visual Basic http://support.microsoft.com/kb/259100 As for parsing HTML, try using "Microsoft HTML Object Library", which is part of IE. See the sample in this post which prints a list of links in a web page. It can be adopted to parse various aspects of HTML tags easily. http://groups.google.com/group/microsoft.public.vb.general.discussion/msg/ce903530d703561c
From: mayayana on 1 Mar 2010 14:10 "A client MUST include a Host header field in all HTTP/1.1 request messages .." http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html I don't know about http 1.0, but I'd guess that's the problem. Also note: If you don't include a Content-Encoding line in the header you *should* get back plain text, but it's a good idea to be prepared for gzip compression. There's a userControl here that might be useful: http://www.jsware.net/jsware/vbcode.php5#htp It encapsulates the process of downloading files via HTTP, using Windows sockets directly so that the winsock control isn't necessary. You could use that instead of what you've got, or just use it as example code for the call. There are a lot of possible details in terms of the HTTP header, but most of it's not necessary. Once you've downloaded some files you can see what a typical header looks like. > I didn't have any luck posting this over in the 'Controls' group, so I'm > reposting it here: > > Hi All, > > I'm sure this has been covered before, but I haven't found anything that > seems to help. > > I'm trying to read the raw html data from a web page, so that I can > parse it and extract the information I need. The data is primarily text > that I wish to capture. > > When I try to grab a page using the Winsock control, I get a 'page not > found error'. I know that the URL and page location are correct because > I can see a test file using a web browser. > > Some (possibly) relevant code: > > Private Sub cmdconnect_Click() > On Error Resume Next > > TxtWebPage.Text = "" ' clear the text window > Winsock1.RemoteHost = ksBaseURL > Winsock1.RemotePort = 80 > Winsock1.Connect > > End Sub > > Private Sub Winsock1_Connect() > On Error Resume Next > Dim strCommand As String > Dim strWebPage As String > > strWebPage = TxtFileLocation.Text > strCommand = "GET " + strWebPage + " HTTP/1.0" + vbCrLf > strCommand = strCommand + "Accept: */*" + vbCrLf > strCommand = strCommand + "Accept: text/html" + vbCrLf > strCommand = strCommand + vbCrLf > > Debug.Print strCommand > Winsock1.SendData strCommand > > End Sub > > I *think* my problem is with the GET request, but I haven't found out > what it is yet! > > Ideas? > > Thanks, > > John
From: C. Kevin Provance on 1 Mar 2010 14:56 "John Morley" <jmorley(a)nospamanalysistech.com> wrote in message news:e7bJe7WuKHA.3504(a)TK2MSFTNGP06.phx.gbl... | Hi All, | | I'm sure this has been covered before, but I haven't found anything that | seems to help. | | I'm trying to read the raw html data from a web page, so that I can | parse it and extract the information I need. The data is primarily text | that I wish to capture. | | When I try to grab a page using the Winsock control, I get a 'page not | found error'. I know that the URL and page location are correct because | I can see a test file using a web browser. If you are attempting to screen scrape, ensure the page is not generated dynamically when it loads.
|
Pages: 1 Prev: Argument not optional in VB6 function Next: Creating Reg-Free DLL situation |