We use the open-source search server Solr for real-time search on data stored in a Hadoop cluster. For our terabyte-scale dataset, we had to implement distributed search on multiple Lucene index partitions (shards). This article describes our solution to manage 40 independent Solr instances without human interaction, including transparent failover and automatic backup of the Lucene index.
When building multi-platform projects where C++ is involved, developer teams are confronted with the task to compile and run the project on different platforms with different compilers. Besides, maybe just a part of the project is written in C++, other parts are written in Java and an automated build process has to be established including both languages. In our project we have found a solution based on Hudson.
Serious on-site optimization begins with the head tags of your HTML documents. For example, up to 150 characters from the description are displayed on the search result pages. This article gives a concrete guideline for writing the HTML head section. You will also learn the meaning of important and even some mysterious meta tags, e.g. GEO values, and find out what is now deprecated.
In the previous part of this article series we focused on the efficient storage of log data in Hadoop. We described how to store the data in Hadoop’s MapFiles, and we tweaked the configuration settings for increased data storage capacity and greater retrieval speed. Today, we discuss how to perform near real-time searches on up to 36.6 billion log messages by a clever combination of Hadoop with Lucene and Solr.
Hudson can be extended in every possible way by a whopping 350+ plugins! We were most impressed with the code analysis suite. This set of plugins helped us to significantly increase the efficiency of our code review process, and to improve our code quality. In this article, we share our experience with the suite and introduce many of the great code analysis features.
The URL of a page is an important SEO factor that influences the relevancy and the position in the search results. And since it is also displayed in the result, users quickly scan the URL to assess the page’s significance. This article shows how to design and structure the URLs of your pages in a way that is helpful to both people and search engines. We also discuss solutions to recover from missing URLs.
Warp Persist is a persistence integration library for Google Guice. It comes with built-in support for Hibernate, JPA, and db4o. Warp Persist proved to be lightweight and unobstrusive. Especially useful in our case was its runtime support for multiple databases, besides other features, such as declarative transactions or dynamic finders.
Our Java project team has only three developers. We liked the idea to have our tests and builds run automatically and to have a central dashboard. However, we didn’t want to invest much time and expected a continuous integration server to be overkill. But as we started to play around with Hudson, we were quite amazed: the system was up in 5 minutes, including builds, tests, and e-mail notification.
Designing a website nowadays always includes the task of optimizing the website for search engines. Otherwise you might have designed a brilliant website but nobody will be able to find it! Ideally, your site will be in the top 10 search results, i.e. on the first page. This blog series by Marcus Günther and Oliver Schmidt describes how to attain this goal. The first lesson is to master the art of being crawled by a search engine robot.
In part 1 of this article series we described the various challenges of dealing with large amounts of logging data in a heavily distributed software ecosystem. After evaluating different approaches, we quickly selected Hadoop as the technology of our choice. In this article we will describe how some pitfalls we had to solve when using Hadoop to store log messages.
At the Spring S2G Forum in Munich, the Groovy project lead Guillaume Laforge elaborated on Groovy 1.7’s new and noteworthy improvements. There was a huge interest in Groovy at the forum, which reinforces our strategic investment in Groovy as a capable and mature JVM-based language. Overall, I got a lot of insights and have here prepared some illustrative examples.
Our team has developed a system for storing and processing huge amounts of log data using Hadoop. The challenge was to handle Gigabytes of log messages every day and enable querying the 30+ Terabyte archive with instantaneous results. In this first part of our blog series we explain our motivation for using Hadoop and contrast this solution with the traditional relational database approach.
mgm develops several large scale web applications containing hundreds of forms, e.g. the e-government portal ElsterOnline Portal. Design, implementation, test, and maintenance of these forms implies significant effort. We are currently evaluating different technologies – including XForms, Wicket and JSF (ICEfaces) – that could help us to reduce these efforts, to improve quality and set the framework for our next generation web forms-centric applications.