Data Smells
There has been much discussion in recent years about code smells and using them to drive refactoring. I have heard and read very little, however, about data smells. Data smells are different from code smells and they drive a different process, but I believe that they are equally as important.
Data is like fresh food from the grocery store; much of what you buy has a “use or freeze by” date. Data too has this expiration date and like the fresh food, if this date is ignored, there will be spoilage that results in unpleasant smells. In data’s case, the “use by” date can be quite obvious as in the case of accounting data or more subtle like the useful lifespan of customer purchasing data. The “freeze by” date gives an indication of when the data is no longer useful in daily operations but generally has value as historical information to be used in reporting and analysis. The most insidious data is comprised of the fields, tables and sometimes entire data stores that have been rarely or never used. The bad smells don’t reach us because this data or structure is not being actively used.
Where does this data originate? Sometimes, it comes from the pet projects of long-departed managers or executives that ultimately didn’t serve a useful business function. Others may come from projects that were dropped due to budget or other business constraints and left in a partially-complete state. In my experience, the greatest contributors to stale data are the large, enterprise applications like Onyx CRM / Consona CRM or Oracle Financials that have been implemented in a traditional waterfall fashion. SaaS applications such as Salesforce.com that allow well-intentioned but technically-challenged users to create entire data structures are a close second and gaining rapidly.
In waterfall implementations of large, enterprise-level applications, user groups often pile on requirements in an effort not to miss any possible use case, even those which should be considered to be marginal. Because requirements are gathered upfront and then cast in stone, it’s easy to see why so much marginally-useful or even useless data points are created to satisfy the perceived business need. How often do we review this data once it has been implemented? Monthly, yearly, once a decade? Unfortunately, the likely answer is: never. We have servers to manage, upgrades, updates and patches to test, database code to write and refactor, documentation to maintain, meetings to attend, newer, shinier applications to prepare and the list goes on and on. Meanwhile, columns that contain minimal or no data and tables with stale, useless data take up space and use server resources unnecessarily.
It is incumbent on data professionals, and I include DBAs, data developers and business analysts in this category, to find these issues and to offer solutions.
- Consider re-purposing custom and user-defined data columns to support newer business needs.
- Offer to archive (just in case) those old tables and to eliminate them from the production server.
- Refactor code to eliminate dead-ends and processes that are creating the offending rows. I have seen several cases where there hundreds of thousands of rows of data that is still being created to support a process that no longer exists or is assigned to a user who is long-gone.
We offer CRM “tune-ups” for Onyx CRM, Consona CRM, Oracle Financials, Microsoft Dynamics CRM and Salesforce.com which would include data analysis of this kind. Look for a blog post on this in the future.
In addition, I’ll address the “how-to” side of this issue in a future post. In the meantime, take a look at your biggest data stores. Do you have any columns that have never been populated, have not been populated in more than a year or have a population of less than 5% of the rows? If so, these are candidates for the big cleanup ahead.
EC2 vs VMforce
“Cloud computing” has replaced “Web 2.0″ and “social networking” as the latest buzz phrase. Today, every other company or organization is gearing itself to move onto the cloud. Getting onto the cloud is itself a big deal but what is more challenging is to decide on which cloud one should go to. Today, the three most impactful companies providing cloud services are Amazon (EC2), Salesforce and Microsoft (Azure). Other than this, VMforce, the first enterprise cloud computing service for java developers, is grabbing significant attention in the market.
Amazon’s EC2 is an Infrastructure-as-a-Service (IaaS) solution where customers can consume a generic server resource in the cloud. It still requires customers to install and manage an application stack including the application server, database, and other middleware elements. On the other hand, VMforce is a full Platform-as-a-service (PaaS) solution where the customer doesn’t need to worry about installing and managing the middleware and application stack. All that ‘plumbing’ is done as a service. EC2 pricing is per instance-hour consumed for each instance type and varies based on several different factors. Pricing for VMforce will be announced closer to its release date.
EC2 (Amazon):
Amazon’s EC2 is a cloud computing service that allows users to deploy and run their applications on rented virtual computers. Users can boot what are called Amazon Machine Images and create an instance, also known as a virtual machine, and pay for the amount of computing power they need by the hour. EC2 is particularly well suited for applications that experience hourly, daily, or weekly variability in usage. Below are some of its main features:
VMforce (SalesForce and VMWare):
VMforce is the shared vision of Salesforce.com and VMware. With VMforce, the customer focuses on the application logic and leaves the rest to VMware and Salesforce.com. VMforce is aimed at SaaS and Web services developers. The major components of VMforce are VMware’s Spring platform — and its community of Java developers — and Salesforce.com’s Force.com platform. 
Below is the table with brief comparison between the two:
|
Amazon Ec2 |
VMforce |
|
IaaS |
PaaS |
|
Virtual services in the cloud |
Complete platform as a service |
|
Generic Server Images |
Complete development service |
|
Self-assembly & management of stack |
Automatic stack management |
|
Self-managed database |
Database as a service |
|
Self-managed scalability |
Automatic scalability |
|
AWS Services |
Chatter services |
You can download this file here for detailed features of EC2 and VMforce.
login.salesforce.com – New API Login Endpoint
Interesting finding
Whilst perusing some of the Salesforce.com blogs noticed this interesting tidbit:
We will soon be updating the API login endpoint found in the enterprise and partner WSDL files. For the longest time, this endpoint was https://www.salesforce.com. Last year, we quietly launched a new endpoint: https://login.salesforce.com and in a future release this will become the official endpoint listed in the WSDL files. The old endpoint will continue to work, but you will shave off a bit of request time if you switch to the new endpoint, so why not do it right away?
Note that this change only impacts the login endpoint. As always, once you have logged in, you will receive the organization specific endpoint in the SOAP response and these endpoints have not changed.
So why does this matter? Well a couple of reasons:
- As mentioned, you will shave off a bit of request time switching over to the new endpoint. I’m sure this is in the order of milliseconds, but if you make hundreds of thousands of API calls per day, this could add up to saved wait time over a long period of time. It’s easy to change, so why not do it?
- While not explicitly stated, who’s to say that the current (www.salesforce.com) endpoint will continue to be supported ad infinitum? All documentation points to the new endpoint (login.salesforce.com) as the preferred method, meaning that new applications should use this endpoint.
Nothing we can do about bullet 2, but bullet 1 got me thinking, what sort of time savings are we talking about? After running many different trials on logins against both of the endpoints, I must say that my findings are inconclusive (please see graphs below). Over a sample of 100 logins, the average seemed to converge together with mere milliseconds separating the alternate login endpoints. Not only this, I believe other factors such as network latency which cause spikes in some of the login requests, lead me to believe that there is really no noticeable difference between the two endpoints.
So if nothing else than to code to the standard url endpoint specified in the WSDL, I would go with the salesforce.com suggestion of using the new login endpoint for new applications and not touch any of your legacy applications. That is, until there is an official communication that the old endpoint is no longer supported (if that time ever comes!).
Graphs (Y Axis in Milliseconds)
10 Logins
100 Logins
500 Logins
Are you tired of working with consulting companies who give you what you ask for, but not what you need? If you want to partner with us to extend your IT capabilities and produce real results, contact us
SOQL Group By – Grouping records in Salesforce.com
Overview
One of the most sought after features in SOQL (Salesforce.com Object Query Language) has been the inclusion of record grouping. While Aggregate functions such as Min, Max, Count, and Sum have been available, they have not been available to retrieve grouped Aggregate results. For example, if you wanted to retrieve the Count and Sum(Amount) of Opportunity Records grouped by an Account, you have to retrieve all of the Opportunities and create a loop to increment Counts and Sum based on the Account associated with the Opportunity. Ugly!
With the introduction of the Salesforce Platform Winter 2010 release and version 18 of the API, grouping and a few other functions are available. In this post I will concentrate on the “GROUP BY” clause. In future posts, I will examine other clauses such as:
- GROUP BY ROLLUP ()
- GROUPING ()
- GROUP BY CUBE ()
- HAVING ()
EXAMPLE
In this example, I want to showcase the different types of Grouping and Aggregate Functions that can be combined to come up with some interesting results. I want to create a Visual Force Page bound to a Custom Controller, where I can display the results of the SOQL statement.
First, let’s start with the Controller code. I wanted to bind the results of the query directly to a apex:dataTable, so I created an object to hold the Grouping values and Aggregated results. Below is the object:
public class GroupedQuery
{
public string GroupBy1 {get;set;}
public string GroupBy2 {get;set;}
public string GroupBy3 {get;set;}
public integer Count {get;set;}
public Decimal Sum {get;set;}
public Decimal Average {get;set;}
public Decimal Min {get;set;}
public Decimal Max {get;set;}
}
In addition, in order to bind an object to the dataTable within the Visual Force page, you need to expose a public member from your Controller code. For brevity sake, I just added a public List<> member like this:
public List<GroupedQuery> queryResults {get;set;}
Next in the Controller Constructor (keep in mind this is for demo purposes!) we use a SOQL query to retrieve a List of generic sObjects:
List<sObject> queries = [select Account.Name AccountName, Count(Name) CountResult
from Opportunity group by Account.Name];
Once we have the Results we iterate through a loop to create our collection of GroupedQuery objects:
queryResults = new List<GroupedQuery>();
for (sObject query: queries)
{
GroupedQuery myObject = new GroupedQuery();
myObject.Count = Integer.valueOf(query.get('CountResult'));
myObject.GroupBy1 = String.valueOf(query.get('AccountName'));
queryResults.Add(myObject);
}
The variable queryResults is exposed as a public variable from the controller and thus we can tie this List directly to a dataTable and access the public properties of the GroupedQuery object contained in the List. Below is the Visual Force page code:
<apex:dataTable id="queryResults" value="{!queryResults}"
var="queryResultsRow" border="1" width="100%">
<apex:column >
<apex:facet name="header">GroupBy1</apex:facet>
{!queryResultsRow.GroupBy1}
</apex:column>
<apex:column >
<apex:facet name="header">Count</apex:facet>
{!queryResultsRow.Count}
</apex:column>
</apex:dataTable>
The code above will output the following results (the Account data is data from the Developer Edition of Salesforce.com)
A more involved example is included below. This includes two levels of grouping, as well as Min, Max, Sum and Average functions:
Summary
As demonstrated, the new functionality released with the Salesforce.com API Version 18 with regards to Grouping is extremely powerful and addresses a pain point that many developers have had to overcome to achieve Rollup type queries. This information is contained within the What’s New notes for Salesforce.com Version 18. Keep in mind this is included with the Winter 2010 release of the Salesforce.com platform, so you will need to ensure that your organization has been upgraded in order to avail of this functionality.
Are you tired of working with consulting companies who give you what you ask for, but not what you need? If you want to partner with us to extend your IT capabilities and produce real results, contact us
{
public string GroupBy1 {get;set;}
public string GroupBy2 {get;set;}
public string GroupBy3 {get;set;}
public integer Count {get;set;}
public Decimal Sum {get;set;}
public Decimal Average {get;set;}
public Decimal Min {get;set;}
public Decimal Max {get;set;}
}
Matching Salesforce Data to Back Office Systems
Overview
Often when working with Salesforce Account and Contact records, you want to see if there’s a match with existing SQL Server data in back office systems. While there are several commercial solutions that perform data matching on the Force.com platform (just search the App Exchange for deduplication or lead management to see what’s out there), as well as SSIS fuzzy lookups on the Microsoft SQL Server platform, often you want a quick and free solution that compares data between Salesforce.com and SQL Server data.
To do this, you simply need to add a custom formula field in Salesforce.com and a similar query in your SQL Server data. You can tweak this formula to match different fields depending upon the demographic information contained in the back office system. I will demonstrate this with the Account object in Salesforce.com. In this example, I am assuming that the company name, street, city and postcode will be stored in the back office system.
Duplicate Check field
The field simply concatenates parts of the customer’s name and address together, converts it to upper case, and pads any empty spaces with a character not likely to appear in the name or address fields. I chose the asterisk “*” for this example.
| Name |
Street | City | Post Code | Duplicate Check Code |
| *************************** | ||||
| FELLS WARGO HOME EQUITY | FELLS W******************** | |||
| CATAMARAN EXPRESS COMPANY | SAN BERNADINO | 91911 | CATAMA E**********SAN B91911 | |
| GUMFORT CARE DENTURE CENTER | 900 HOCKSVILLE RD | MASSAPEQUA | 91758 | GUMFOR C900 HOCKSVMASSA91758 |
| ORLANDO/GREEN COUNTY CAB | 1234 LAKE ROAD | ORLANDO | 12319-5273 | ORLAND C1234 LAKE ORLAN12319 |
You’ll notice that this works best when the fields you choose to populate the check code are required, or at least generally populated in both systems. If you choose fields that are often left blank, you can turn up more false positives for matches, which defeats the purpose of the duplicate check.
Salesforce.com Custom Formula Field
To implement this in Salesforce, you need to add a custom formula field to the Account object in Salesforce.com:
- Login to Salesforce.com as an account with administrator privileges.
- Click the ‘Setup’ link at the top of the screen:

- On the left hand App Setup menu, click Customize, Accounts, Fields:

- In the Account Custom Fields and Relationships section, click New.
- Choose Formula as the field type:

- Enter a label and a name for the field. Choose Text as the output type of the formula field:

- Enter the matchcode formula into the Advanced Formula text area:

- Setup the field level security and add the field to the appropriate page layouts, and Save the field.
- That’s it! The field will now be available via the API, Apex code and can be exposed through data exports with the Apex Data Loader.
I’ve reproduced the code from the screenshot in a copyable format below to save you from RSI
UPPER (
MID(TRIM( NULLVALUE(Name, '') )& '******' , 1, 6)
& MID(TRIM(NULLVALUE(Name, '')) & '*' , FIND(' ', TRIM(NULLVALUE(Name, ''))
& '*'),1)
& MID(TRIM(NULLVALUE(Name, '')) & '***' , FIND(' ', MID(TRIM(NULLVALUE(Name,
'')) & ' ',
FIND(' ',TRIM(NULLVALUE(Name,'')) & ' ') + 1,59)) +
FIND(' ', TRIM(NULLVALUE(Name,'')) & ' ') + 1,1)
& MID(TRIM( NULLVALUE(BillingStreet, '')) &'**********' , 1, 10)
& MID(TRIM( NULLVALUE(BillingCity, '')) & '*****' , 1, 5)
& MID(TRIM( NULLVALUE(BillingPostalCode, '')) & '*****' , 1, 5)
)
SQL Custom Query
Although adding the custom field to Salesforce alone has some value (for example, you could use it in custom code to detect duplicates), the real value is when you can match up the values to your back office system. I’m assuming your back office server is running SQL Server 2005 or higher for this T-SQL example, but most forms of SQL have equivalent functions that can manipulate strings, so check your documentation if you are using a different flavour of SQL, and please post your custom formula in the comments section to help out others.
This query could be added to a view, stored procedure or custom field in SQL.
For this example I’m assuming that the fields are called ‘Name’, ‘StreetName’, ‘City’, and ‘ZipCode’, and are stored in a table called CompanyInfo.
SELECT UPPER (
SUBSTRING(LTRIM(RTRIM(ISNULL(Name, '')))+ '******' , 1, 6) +
SUBSTRING(LTRIM(RTRIM(ISNULL(Name, '')))+ '*' , CHARINDEX(' ', LTRIM(RTRIM(ISNULL
(Name, ''))) + '*'),1)
+ SUBSTRING(LTRIM(RTRIM(ISNULL(Name, ''))) + '***' , CHARINDEX(' ', CHARINDEX(LTRIM
(RTRIM(ISNULL(Name,''))) + ' ',
CHARINDEX(' ', LTRIM(RTRIM(ISNULL(Name,''))) + ' ') + 1,59)) +
CHARINDEX(' ', LTRIM(RTRIM(ISNULL(Name,''))) + ' ') + 1,1)
+ SUBSTRING(LTRIM(RTRIM(ISNULL(StreetName, ''))) + '**********' , 1, 10)
+ SUBSTRING(LTRIM(RTRIM(ISNULL(City, ''))) + '*****' , 1, 5)
+ SUBSTRING(LTRIM(RTRIM(ISNULL(ZipCode, ''))) + '*****' , 1, 5)
) AS 'DuplicateCheck'
FROM CompanyInfo
Using your duplicate check code
Once the check is in place in both systems, you can use it in several ways. You can match the fields directly, you could create a matching “confidence” score by using character by character comparisons or you could use more complex fuzzy lookup functions to further match the data.
Summary
Now that you have a single field to compare Salesforce and SQL data, you can tweak the values to find better matching algorithms, and use these fields as part of any duplicate checking process for data integration.




