Writing Your Own Search Engine Using SQL Server

My coffee site INeedCoffee needed a better search engine. I had thrown some basic SQL together when the site was launched back in 1999. It did an OK job when the site didn’t have much content. Over the years, the quality of the search results got worse and worse. So I did what any coder would do, I looked for a free solution.

Google did better job searching my site than my own code, so I looked at their Google Custom Search solution. I didn’t like their free ad version and I didn’t what to pay them $100 each and every year for the non-ad version. I decided that not only could I write my own search engine that was just as fast, but I could also deliver better results to the users. After all, I knew my content better than anyone else.

Assigning a Quality Score

The first thing I noticed about Google’s search results is that the best article on a given topic often wasn’t listed first. It had no way to know quality, but I did. So I added a quality score of 1 to 5 for every article. The default was 3. The best content was rated a 4 or 5. Articles that needed better photos or improved in some way, were given a 1 or 2. Later I’d also use this quality score when assigning weight on the sitemap.

Web Form -> Server Side Code -> Stored Procedure

The HTML search form is pretty basic. A single text box and a submit button. What server side code you use to call the stored procedure is irrelevant. ASP.NET, Classic ASP, PHP – it is all good. The server side code will call the search stored procedure.

Two Temp Tables

The search stored procedure will have two temp tables: #searchWords and #searchResults. The purpose of #searchWords is to chop up any search phrase into individual words and then record their position. Later that position will be used to order search results, which more weight being placed on the first and second word in a search query. The #searchResults table are the results being returned to the web page.

CREATE TABLE #searchWords (
    word      VARCHAR(100),
    position    INT
)
CREATE TABLE #searchResults (
    url        VARCHAR(100),
    title      VARCHAR(100),
    longDesc   VARCHAR(MAX),
    quality    TINYINT,
    score      INT
)

Splitting Search Phrases

For this functionality, I found some code on StackOverflow that did the job. The SplitWordList user-defined function by Terrapin works perfectly. If the user places the search term inside quotes, I do not call the SplitWordFunction and inside enter the entire phrase as one row in the #searchWords table.

INSERT INTO #searchWords SELECT word, position from SplitWordList(@searchString)

Count String Function

For the actual search, I used the Count String Occurrence Function. The search words are compared first against the article title and then the content itself.

CREATE FUNCTION [dbo].[udfCountString](
    @InputString    VARCHAR(MAX),
    @SearchString    VARCHAR(100)
)
RETURNS INT
BEGIN
    RETURN (LEN(@InputString) -
            LEN(REPLACE(@InputString, @SearchString, ''))) /
            LEN(@SearchString)
END

I Like Cursors

The most straight forward approach I could think of for getting search results was to use two cursors. One with the content and one with the search words. Then write the hits to the #searchResults temp table. But cursors are often frowned upon for poor performance. I decided I would first code the search engine using Cursors and then if I ran into a performance problem, I’d come up with an alternate solution. But I didn’t need to, as I got rocking fast results using CURSORS.

DECLARE ContentCursor CURSOR FAST_FORWARD FOR
SELECT url, title, longDesc, quality, page
FROM Articles 

DECLARE SearchWordCursor CURSOR DYNAMIC FOR
SELECT word, position FROM #searchWords
OPEN SearchWordCursor 

OPEN ContentCursor
FETCH NEXT FROM ContentCursor INTO @url, @title, @longDesc, @quality, @page

WHILE @@FETCH_STATUS = 0
BEGIN
    FETCH FIRST FROM SearchWordCursor INTO @word, @position
    WHILE @@FETCH_STATUS = 0
    BEGIN
        -- place more weight on the first search term
        SELECT @score = CASE @position
            WHEN 1 THEN 3
            WHEN 2 THEN 2
            ELSE 1
        END
        -- search the TITLE 
        SET @count = dbo.udfCountString(@title, @word)
        IF @count > 0
        BEGIN
            INSERT INTO #searchResults VALUES (@url, @title, @longDesc, @quality, @score * 10)
        END
        -- search the PAGE
        SET @count = dbo.udfCountString(@page, @word)
        IF @count > 0
        BEGIN
            INSERT INTO #searchResults VALUES (@url, @title, @longDesc, @quality, @score)
        END                    

        FETCH NEXT FROM SearchWordCursor INTO @word, @position
    END
    FETCH NEXT FROM ContentCursor INTO @url, @title, @longDesc, @quality, @page
END

CLOSE ContentCursor
DEALLOCATE ContentCursor

CLOSE SearchWordCursor
DEALLOCATE SearchWordCursor

Working With the Results

Before dropping both temp tables, here is the query used to return the search results. If you look at the SQL above you will see that it is possible (likely) that a search hit will take place on both the title and the page content. I ran some tests and determined that a search hit against a word in the title was 10 times more important than the content, so I multiply the score time ten if there is a title match.

To flatten the results, I use a GROUP BY clause in the SQL. Then the results are returned order from highest to lowest scores.

SELECT TOP 20 S.url, S.title, S.longDesc, S.quality, SUM(S.score) AS Score
FROM #searchResults S
GROUP BY S.url, S.title, S.longDesc, S.quality
ORDER BY SUM(S.score) DESC, S.Quality DESC

Better Than Google?

I ran numerous tests comparing my search engine to Google. My hand-coded INeedCoffee search engine delivered better results at equal or faster speeds. And the best part is I don’t need to send Google a check for $100 every year.

Posted in SQL | Tagged , , | 1 Comment

Using a Classic ASP Dictionary Object To Handle Broken Links

My site INeedCoffee was first built way back in 1999 using Classic ASP.  Back then it was just called ASP or Active Server Pages.  It works well and doesn’t require moving to ASP.NET, so I never updated the code base.  If it isn’t broke, don’t fix it.

There was one thing I wanted to address better recently and that was how I handled incoming broken links.  I used to just redirect every request to the home page, but I’ve since learned that 404 error codes are OK to have.  If the page is missing, returning a 404 code is appropriate.

When I studied the bad incoming links, I isolated about 15 where I could tell what article was intended in the link.  For these links, I wanted to give a 404 error and provide a suggestion on what page is most likely the correct link.  For this I used the Scripting.Dictionary object.  In it, I matched the bad link with the good link.

RequestedBadLink = Request.ServerVariables("QUERY_STRING")

'- Dictionary of known bad URL requests
Set BadLinks = Server.CreateObject("Scripting.Dictionary")
BadLinks.Add "404;http://example.com:80/bad-link","http://example.com/good-link"
'- add more links here
If BadLinks.Exists(RequestedBadLink) Then
    SuggestedURL = BadLinks.Item(RequestedBadLink)
Else
    SuggestedURL = ""
End If

For the bad links, I added “404;” to the front of the link and embedded “:80″ after the domain. This is the format that the QUERY_STRING ServerVariables uses.  Then on your 404 page, you can test to see if you found a Suggested URL and display it for the user.

Posted in Classic ASP | Tagged | Leave a comment

My Secret For Picking Excellent Web Hosts

I have paid for web hosting since 1995.  When it comes to service, I’ve done pretty good.  I follow a three step process when searching for web hosting.

  1. Web Research – Everyone does this.  Find a host that meets all your requirements at the price point you are willing to pay.
  2. Sunday or Late Night Email – Let us say that you have 2 or 3 possible web hosts in mind.  Wait until Sunday and then fire off an email to their support staff asking a question.  The question shouldn’t be too difficult as to require lots of research, but not easy enough to be answered by a sales person.
  3. Wait For Responses – The winner of your business will be the web host that quickly and professionally responds to your off hour email question.

When I pay for web hosting I want to know that they have competent people working during non-business hours.  Some of these discount web hosts are able to offer rock bottom prices, because they have bare bones technical support.  They staff their rock stars during business hours, because that is when they are most likely to land a new customer.  Sending an email on Sunday, or late in the night, is perfect way to test a web host before handing over your credit card data.

After you have established a relationship with a web host, periodically send an off hour email.  Maybe once or twice a year.  Keep them on their toes.  Consider this test to be part of your website back up strategy.

This site uses WinHost, which I highly recommend for ASP.NET and SQL Server web hosting.  They passed the Sunday email test.

WinHost ASP.NET Web Hosting

Posted in General | Tagged | Leave a comment

Using Instapaper for Web Legibility

When I first looked over the Instapaper application I didn’t think I needed it.  I used Delicious for my social bookmarketing site.  Why would I need Instapaper?  Then news stories came out saying that Yahoo! would either shutdown or sell Delicious, so I gave Instapaper another look.

How did I miss the primary reason for using Instapaper?  It’s primary feature is not bookmarking, it is making text legible.  There was this trend in web design around 2003 where font sizes got really small.  I’m afraid it may be coming back and my eyes just can’t take it.  Thankfully with Instapaper, I no longer have to suffer through reading painfully small fonts or gray text on a while background.

Here is how it works.

  1. Set up an account on Instapaper.
  2. Add the Read Later bookmarklet to your browser, per their instructions.
  3. Go to a page you wish to read that has a hideous web design with dreadful legibility.  For this example, I am going to use the outstanding article Vitamin D — Problems With the Latitude Hypothesis on the Weston A Price website.
  4. Press your Read Later bookmarklet.
  5. Go back to Instapaper, locate the article on your list and click the Text button.

Painful typography

Add it to Instapaper and then click the Text button.

Presto.  The article is now readable.

Instapaper doesn’t work for every website, but works for most.  My online reading speed and comprehension has never been greater.  I highly recommend Instapaper.

UPDATE: Seems Instapaper has a feature in their Extras section called Instapaper Text.  Drag that bookmarklet to your toolbar and click it whenever you need text cleaned up.  No account needed.  Use this if you don’t need the bookmarking features.   They also give a shout out to a similar solution called Readability.

Posted in General | Tagged , , | 1 Comment

DreamHost and the Myth of Unlimited Domains

DreamHost promotes unlimited domain web hosting for $8.95/month.  It seemed like a sweet deal, so I signed up for it.  The reality is that their offer is misleading.  DreamHost offers 100 MB of memory per account.  If you exceed this number – even for a second – they unleash a procwatch to kill the process.

At this point you can reach out to DreamHost’s support team.  They will tell you that it is the fault of WordPress plugins and then try and upsell you on a VPS account.  They promise not to kill your processes if you get a VPS account.  Kind of like a shop keeper that pays the gangster protection money so his store doesn’t burn down.  How sweet!  My personal opinion is why should I pay for an enhanced service if the basic service is awful? I’ll just switch web hosts.

Back to the question of resources.  Just how much memory does a website running WordPress with some basic plugins use?  As of this writing, I have 2 WordPress sites on DreamHost (not DigitalColony.com).  I installed 2 plugins to help me track my memory usage: WP-Memory-Usage and TPC! Memory Usage.  Below is a screen shot from one of my sites.  The other site shows similar numbers.

With these two plugins I learned a few things:

  1. DreamHost limits your PHP memory to just 90 MB.
  2. A basic install of WordPress takes about 30 MB of memory on a 64 bit installation of PHP.
  3. I activated and deactivated every plugin.  Most used trivial amounts of memory.  No plugin exceeded 2 MB of memory.  Even the much aligned All-In-One SEO plugin used only 1.05 MB.
  4. Switching themes had almost no impact on memory usage.
  5. With as little as two domains using WordPress on DreamHost you are already reaching the upper limits of memory allocated. So much for unlimited domains.  Perhaps they should rephrase it to unlimited unused domains?
  6. DreamHost has a serious LOAD AVERAGE problem.  The numbers in the above screen capture were the lowest I captures.  Often the Load Averages exceeded 10.

Even though I went looking for answers on memory usage, the load average numbers  jumped out at me.  What do they mean and what is a good number?

The article Understanding Linux CPU Load – when should you be worried? is a great tutorial on the topic.  It makes the case that the maximum load should not exceed the number of cores on the server.  My DreamHost server has 4 cores.   I monitored this number all day and it is always in the red zone. The CPU load on DreamHost servers is excessive.

My advice is to stay away from DreamHost.  Their servers are overloaded and if you plan to host more than one WordPress account you’ll experience problems.

UPDATE (Nov 25, 2010) – This morning the DreamHost Load times spiked much higher!

  • Load Averages: 144.95 45.93 21.12
Posted in General | Tagged , , , , | 2 Comments